The problem with using splice(2) for a faster cat(1)
2023-03-12
A few weeks ago, I was reading a Hacker News post about a clipboard manager. I can't remember which one exactly, but an example is gpaste - they let you have a clipboard history, view that history, persist things to disk if you want, and so on.
One comment caught my eye: it asked why clipboard managers didn't use the splice(2) syscall. After all, splice allows copying the contents of a file descriptor to a pipe without any copies between userspace and kernelspace.
Indeed, replacing a read-write combo with splice does yield massive performance gains, and we can benchmark that. That got me thinking: why don't other tools use splice too, like cat? What are the performance gains? Are there any edge cases where it doesn't work? How can we profile this?
There are blog posts from a while ago lamenting the lack of usage of splice, e.g. https://endler.dev/2018/fastcat/ and interestingly enough, things may have changed since 2018 (specifically, in 2021), giving us new reasons to avoid splice.
The conclusion is basically that splice isn't generic enough, the details are pretty interesting.
What's our performance metric?
The basic question we're trying to answer is how fast can a program take a filename and write the contents to stdout? We're measuring performance in bits per second.
One important point is that we want to benchmark with the kernel read cache warmed, i.e. we run the benchmarks a few times until the number settles down. This is important because the only difference between any of our methods will be a memory-to-memory copy, which is always going to be multiple times slower than a disk-to-memory read, even with DMA.
Warming the read cache means everything is memory-to-memory and differences in how we do that will show up.
I'll create a file with 10,000M of zeroes and benchmark cat using pv as follows:
$ dd if=/dev/zero of=10g_zero bs=1M count=10000
$ cat 10g_zero | pv > /dev/null
...
$ !! # Repeat to warm cache
9.77GiB 0:00:02 [4.72GiB/s] [ <=> ]
So 4.72GiB/s is the number to beat!
read-write implementation
This is the dumb way you'd write a file to stdout. Make a buffer, open the file, read it out in chunks, and write those chunks to stdout. The only thing to tune here is really the buffer size I think. 32k seems to get the best performance on my machine.
Here's the code, no error handling:
#include <unistd.h>
#include <stdlib.h>
#include <fcntl.h>
int main(int argc, char* argv[]) {
size_t buf_size = 32 * 1024;
char *buf = malloc(buf_size);
char *fname = argv[1];
int fd = open(fname, O_RDONLY);
while (1) {
ssize_t bytes_read = read(fd, buf, buf_size);
if (bytes_read == 0) {
return EXIT_SUCCESS;
}
write(STDOUT_FILENO, buf, bytes_read);
}
}
I called this slow.c
. Here's the benchmark:
$ ./slow 10g_zero | pv > /dev/null
9.77GiB 0:00:01 [7.38GiB/s] [ <=> ]
So that's actually faster than cat already. 7.38 GiB/s vs 4.72 GiB/s. But this is doing unnecessary memory-to-memory copies from kernelspace to userspace on read, then from userspace to kernelspace on write. Our ideal solution would just move (not even copy) pages from the file to stdout, with all buffers owned by the kernel.
splice implementation
The splice implementation is a bit more complex, but not much. Looking at the
man page for splice with man 2 splice
, we can see the description:
splice() moves data between two file descriptors without copying between kernel address
space and user address space. It transfers up to len bytes of data from the file de-
scriptor fd_in to the file descriptor fd_out, where one of the file descriptors must
refer to a pipe.
Here's my code for my splice-based cat:
#define _GNU_SOURCE
#include <unistd.h>
#include <stdlib.h>
#include <fcntl.h>
#include <stdio.h>
#include <errno.h>
#include <string.h>
int main(int argc, char *argv[]) {
size_t buf_size = 16 * 1024;
char *fname = argv[1];
int fd = open(fname, O_RDONLY);
off64_t offset = 0;
while (1) {
ssize_t bytes_spliced = splice(fd, &offset, STDOUT_FILENO, NULL, buf_size, SPLICE_F_MOVE | SPLICE_F_MORE);
if (bytes_spliced == 0) {
return EXIT_SUCCESS;
}
if (bytes_spliced < 0) {
fprintf(stderr, "%s\n", strerror(errno));
return bytes_spliced;
}
}
}
I called this fast.c
.
Some notes about this:
#define _GNU_SOURCE
gives us access to splice, which is a non-standard (where the standard is POSIX) extension tofcntl.h
. This is one reason splice probably isn't used more widely - it's not portable.The flag
SPLICE_F_MOVE
is literally a no-op, it used to be a hint to the kernel to move pages where possible, but now does literally nothing. I added it because I do want a move, but I know it does nothing.SPLICE_F_MORE
is a hint saying more data is coming in a future splice. It's true for most splices in our case (all but the last). Not sure how useful it is outside of socket programming, where it's sometimes not obvious to the kernel that more data is coming.
Enough with the notes! Let's see some performance numbers!
$ ./fast 10g_zero | pv > /dev/null
9.77GiB 0:00:00 [26.8GiB/s] [ <=> ]
Whoa, holy shit, 26.8 GiB/s? That's more than 5.6x as fast as cat! This warrants some further investigation.
Profiling, fast and slow
This section title is a reference to "Thinking, Fast and Slow" by Daniel Kahneman, which I haven't read.
fast
is so fast I feel like we have to look into it to make sure nothing
weird is going on.
We can use perf
to profile our programs and see where we're spending time.
You can install it by installing the linux-tools
version specific to your
kernel version. I'm on Ubuntu so I needed to do:
$ sudo apt install linux-tools-5.19.0-32-generic
Let's look at cat first. Here's the command to run your program and record
performance in perf.data
:
$ sudo perf record -- cat ../10g_zero > /dev/null
Why sudo? Without sudo, perf says something about kernel symbols and symbol map restrictions if you're not root, so I just run everything here as root. Sue me. It's not like we're running untrusted code here!
To generate a breakdown with the percentage of time spent in each function:
$ sudo perf report
For the above case, the report looks like:
Overhead Command Shared Object Symbol
75.43% cat [kernel.kallsyms] [k] copy_user_generic_string
3.22% cat [kernel.kallsyms] [k] filemap_read
2.75% cat [kernel.kallsyms] [k] filemap_get_read_batch
Then a bunch of negligible <1% stuff.
The function copy_user_generic_string
copies to/from userspace. It's clear
that's what's taking the vast majority of time. The perf report for slow
looks the same:
Overhead Command Shared Object Symbol
70.53% slow [kernel.kallsyms] [k] copy_user_generic_string
3.86% slow [kernel.kallsyms] [k] filemap_read
3.82% slow [kernel.kallsyms] [k] filemap_get_read_batch
This is as expected. Let's look at the perf report for fast
:
$ sudo perf record ../fast ../10g_zero > /dev/null
Invalid argument
Oh, that's because at least one of the input and output have to be a pipe and in this case, both are files. Let's just throw a cat in there:
$ sudo perf record ../fast ../10g_zero | cat > /dev/null
Invalid argument
Huh? What? This is annoying, maybe perf does something dodgy to stdout so we can't splice to it? Let's try making perf output to a file:
$ sudo perf record -o perf.out -- ../fast ../10g_zero | cat > /dev/null
That finally works. What an ordeal. The report looks like this:
Overhead Command Shared Object Symbol
60.55% fast [kernel.kallsyms] [k] mutex_spin_on_owner ▒
7.86% fast [kernel.kallsyms] [k] filemap_get_read_batch ▒
2.95% fast [kernel.kallsyms] [k] copy_page_to_iter ▒
2.86% fast [kernel.kallsyms] [k] __mutex_lock.constprop.0 ▒
2.47% fast [kernel.kallsyms] [k] copy_user_generic_string
Notice how little time we're spending copying pages between user and kernel. It's clear that the stories of increased performance are true.
The final straw: why splice isn't more widely used
Our journey has led us to a few reasons why splice isn't used more widely:
Not portable: this is kind of a non-reason because everyone just uses Linux, but maybe someone cares about this.
Not general: you can't splice between files and files (you can just use sendfile for that anyway), or sockets and sockets, you need to have a pipe at one of the ends of the splice. This means file-to-file operations like
cat f1 f2 f3 > f4
are impossible with splice.Not universally supported: not all filesystems actually let you splice to/from them. It's possible to try a fast implementation and fall back to a slow one if we're on a non-splice filesystem, but that adds complexity for little gain.
And here's the kicker IMO: there still are bugs. Here's one, you still can't
splice from /dev/zero
to a pipe:
$ ./fast /dev/zero | pv > /dev/null
Invalid argument
Here's a thread on the kernel mailing list about that: https://lore.kernel.org/all/202105071116.638258236E@keescook/t/. It's slightly unfair to call this a bug since it was intentional - the death of generic splice was a planned affair:
The general loss of generic splice read/write is known.
The ultimate reason for this /dev/zero
funkiness is that there's no real
demand for it to work, I guess. Instead of directly using /dev/zero
, I used
actual zero files.
Conclusion
My advice is to use splice where you can, but keep in mind its drawbacks and lack of generality. If you control the types of fds passed in and the filesystem, then you can really go crazy and experience almost zero-copy file copies.
But if you're writing a general tool in the vein of cat or tee, it's probably best to stay away from splice unless you really handle all of the weird cases.