A programmer-friendly I/O abstraction over io_uring and kqueue (2022)

Consider this tale of I/O and performance. We’ll start with blocking
I/O, explore io_uring and kqueue, and take home an event loop very
similar to some software you may find familiar.

This is a twist on King’s talk at Software You Can Love
Milan ’22.

When you want to read from a file you might open() and
then call read() as many times as necessary to fill a
buffer of bytes from the file. And in the opposite direction, you call
write() as many times as needed until everything is
written. It’s similar for a TCP client with sockets, but instead of
open() you first call socket() and then
connect() to your server. Fun stuff.

In the real world though you can’t always read everything you want
immediately from a file descriptor. Nor can you always write everything
you want immediately to a file descriptor.

You can switch
a file descriptor into non-blocking mode so the call won’t block
while data you requested is not available. But system calls are still
expensive, incurring context switches and cache misses. In fact,
networks and disks have become so fast that these costs can start to
approach the cost of doing the I/O itself. For the duration of time a
file descriptor is unable to read or write, you don’t want to waste time
continuously retrying read or write system calls.

So you switch to io_uring on Linux or kqueue on FreeBSD/macOS. (I’m
skipping the generation of epoll/select users.) These APIs let you
submit requests to the kernel to learn about readiness: when a file
descriptor is ready to read or write. You can send readiness requests in
batches (also referred to as queues). Completion events, one for each
submitted request, are available in a separate queue.

Being able to batch I/O like this is especially important for TCP
servers that want to multiplex reads and writes for multiple connected
clients.

However in io_uring, you can even go one step further. Instead of
having to call read() or write() in userland
after a readiness event, you can request that the kernel do the
read() or write() itself with a buffer you
provide. Thus almost all of your I/O is done in the kernel, amortizing
the overhead of system calls.

If you haven’t seen io_uring or kqueue before, you’d probably like an
example! Consider this code: a simple, minimal, not-production-ready TCP
echo server.

const std = @import("std");
const os = std.os;
const linux = os.linux;
const allocator = std.heap.page_allocator;

const State = enum{ accept, recv, send };
const Socket = struct {
    handle: os.socket_t,
    buffer: [1024]u8,
    state: State,
};

pub fn main() !void {
    const entries = 32;
    const flags = 0;
    var ring = try linux.IO_Uring.init(entries, flags);
    defer ring.deinit();

    var server: Socket = undefined;
    server.handle = try os.socket(os.AF.INET, os.SOCK.STREAM, os.IPPROTO.TCP);
    defer os.closeSocket(server.handle);

    const port = 12345;
    var addr = std.net.Address.initIp4(.{127, 0, 0, 1}, port);
    var addr_len: os.socklen_t = addr.getOsSockLen();

    try os.setsockopt(server.handle, os.SOL.SOCKET, os.SO.REUSEADDR, &std.mem.toBytes(@as(c_int, 1)));
    try os.bind(server.handle, &addr.any, addr_len);
    const backlog = 128;
    try os.listen(server.handle, backlog);

    server.state = .accept;
    _ = try ring.accept(@ptrToInt(&server), server.handle, &addr.any, &addr_len, 0);

    while (true) {
        _ = try ring.submit_and_wait(1);

        while (ring.cq_ready() > 0) {
            const cqe = try ring.copy_cqe();
            var client = @intToPtr(*Socket, @intCast(usize, cqe.user_data));

            if (cqe.res < 0) std.debug.panic("{}({}): {}", .{
                client.state,
                client.handle,
                @intToEnum(os.E, -cqe.res),
            });

            switch (client.state) {
                .accept => {
                    client = try allocator.create(Socket);
                    client.handle = @intCast(os.socket_t, cqe.res);
                    client.state = .recv;
                    _ = try ring.recv(@ptrToInt(client), client.handle, .{.buffer = &client.buffer}, 0);
                    _ = try ring.accept(@ptrToInt(&server), server.handle, &addr.any, &addr_len, 0);
                },
                .recv => {
                    const read = @intCast(usize, cqe.res);
                    client.state = .send;
                    _ = try ring.send(@ptrToInt(client), client.handle, client.buffer[0..read], 0);
                },
                .send => {
                    os.closeSocket(client.handle);
                    allocator.destroy(client);
                },
            }
        }
    }
}

This is a great, minimal example. But notice that this code ties
io_uring behavior directly to business logic (in this case, handling
echoing data between request and response). It is fine for a small
example like this. But in a large application you might want to do I/O
throughout the code base, not just in one place. You might not want to
keep adding business logic to this single loop.

Instead, you might want to be able to schedule I/O and pass a
callback (and sometimes with some application context) to be called when
the event is complete.

The interface might look like:

io_dispatch.dispatch({
    // some big struct/union with relevant fields for all event types
}, my_callback);

This is great! Now your business logic can schedule and handle I/O no
matter where in the code base it is.

Under the hood it can decide whether to use io_uring or kqueue
depending on what kernel it’s running on. The dispatch can also batch
these individual calls through io_uring or kqueue to amortize system
calls. The application no longer needs to know the details.

Additionally, we can use this wrapper to stop thinking about
readiness events, just I/O completion. That is, if we dispatch a read
event, the io_uring implementation would actually ask the kernel to read
data into a buffer. Whereas the kqueue implementation would send a
“read” readiness event, do the read back in userland, and then call our
callback.

And finally, now that we’ve got this central dispatcher, we don’t
need spaghetti code in a loop switching on every possible submission and
completion event.

Every time we call io_uring or kqueue we both submit event requests
and poll for completion events. The io_uring and kqueue APIs tie these
two actions together in the same system call.

To sync our requests to io_uring or kqueue we’ll build a
flush function that submits requests and polls for
completion events. (In the next section we’ll talk about how the user of
the central dispatch learns about completion events.)

To make flush more convenient, we’ll build a nice
wrapper around it so that we can submit as many requests (and process as
many completion events) as possible. To avoid accidentally blocking
indefinitely we’ll also introduce a time limit. We’ll call the wrapper
run_for_ns.

Finally we’ll put the user in charge of setting up a loop to call
this run_for_ns function, independent of normal program
execution.

This is now your traditional event loop.

You may have noticed that in the API above we passed a callback. The
idea is that after the requested I/O has completed, our callback should
be invoked. But the question remains: how to track this callback between
the submission and completion queue?

Thankfully, io_uring and kqueue events have user data fields. The
user data field is opaque to the kernel. When a submitted event
completes, the kernel sends a completion event back to userland
containing the user data value from the submission event.

We can store the callback in the user data field by setting it to the
callback’s pointer casted to an integer. When the completion for a
requested event comes up, we cast from the integer in the user data
field back to the callback pointer. Then, we invoke the callback.

As described above, the struct for io_dispatch.dispatch
could get quite large handling all the different kinds of I/O events and
their arguments. We could make our API a little more expressive by
creating wrapper functions for each event type.

So if we wanted to schedule a read function we could call:

io_dispatch.read(fd, &buf, nBytesToRead, callback);

Or to write, similarly:

io_dispatch.write(fd, buf, nBytesToWrite, callback);

One more thing we need to worry about is that the batch we pass to
io_uring or kqueue has a fixed size (technically, kqueue allows any
batch size but using that might introduce unnecessary allocations). So
we’ll build our own queue on top of our I/O abstraction to keep track of
requests that we could not immediately submit to io_uring or kqueue.

To keep this API simple we could allocate for each entry in the
queue. Or we could modify the io_dispatch.X calls slightly
to accept a struct that can be used in an intrusive
linked list to contain all request context, including the callback.
The latter is what
we do in TigerBeetle.

Put another way: every time code calls io_dispatch,
we’ll try to immediately submit the requested event to io_uring or
kqueue. But if there’s no room, we store the event in an overflow
queue.

The overflow queue needs to be processed eventually, so we update our
flush function (described in Callbacks and context above) to pull
as many events from our overflow queue before submitting a batch to
io_uring or kqueue.

We’ve now built something similar to libuv, the I/O library that
Node.js uses. And if you squint, it is basically TigerBeetle’s I/O
library! (And interestingly enough, TigerBeetle’s I/O code was adopted
into Bun! Open-source for the win!)

Let’s check out how the Darwin
version of TigerBeetle’s I/O library (with kqueue) differs from the
Linux
version. As mentioned, the complete send call in the
Darwin implementation waits for file descriptor readiness (through
kqueue). Once ready, the actual send call is made back in
userland:

pub fn send(
    self: *IO,
    comptime Context: type,
    context: Context,
    comptime callback: fn (
        context: Context,
        completion: *Completion,
        result: SendError!usize,
    ) void,
    completion: *Completion,
    socket: os.socket_t,
    buffer: []const u8,
) void {
    self.submit(
        context,
        callback,
        completion,
        .send,
        .{
            .socket = socket,
            .buf = buffer.ptr,
            .len = @intCast(u32, buffer_limit(buffer.len)),
        },
        struct {
            fn do_operation(op: anytype) SendError!usize {
                return os.send(op.socket, op.buf[0..op.len], 0);
            }
        },
    );
}

Compare this to the Linux
version (with io_uring) where the kernel handles everything and
there is no send system call in userland:

pub fn send(
    self: *IO,
    comptime Context: type,
    context: Context,
    comptime callback: fn (
        context: Context,
        completion: *Completion,
        result: SendError!usize,
    ) void,
    completion: *Completion,
    socket: os.socket_t,
    buffer: []const u8,
) void {
    completion.* = .{
        .io = self,
        .context = context,
        .callback = struct {
            fn wrapper(ctx: ?*anyopaque, comp: *Completion, res: *const anyopaque) void {
                callback(
                    @intToPtr(Context, @ptrToInt(ctx)),
                    comp,
                    @intToPtr(*const SendError!usize, @ptrToInt(res)).*,
                );
            }
        }.wrapper,
        .operation = .{
            .send = .{
                .socket = socket,
                .buffer = buffer,
            },
        },
    };
    // Fill out a submission immediately if possible, otherwise adds to overflow buffer
    self.enqueue(completion);
}

Similarly, take a look at flush on Linux
and macOS
for event processing. Look at run_for_ns on Linux
and macOS
for the public API users must call. And finally, look at what puts this
all into practice, the loop calling run_for_ns in
src/main.zig.

We’ve come this far and you might be wondering — what about
cross-platform support for Windows? The good news is that Windows also
has a completion based system similar to io_uring but without batching,
called IOCP.
And for bonus points, TigerBeetle provides the same
I/O abstraction over it! But it’s enough to cover just Linux and
macOS in this post. 🙂

In both this blog post and in TigerBeetle, we implemented a
single-threaded event loop. Keeping I/O code single-threaded in
userspace is beneficial (whether or not I/O processing is
single-threaded in the kernel is not our concern). It’s the simplest
code and best for workloads that are not embarrassingly parallel. It is
also best for determinism, which is integral to the design of
TigerBeetle because it enables us to do Deterministic Simulation
Testing

But there are other valid architectures for other workloads.

For workloads that are embarrassingly parallel, like many web
servers, you could instead use multiple threads where each thread has
its own queue. In optimal conditions, this architecture has the highest
I/O throughput possible.

But if each thread has its own queue, individual threads can become
starved if an uneven amount of work is scheduled on one thread. In the
case of dynamic amounts of work, the better architecture would be to
have a single queue but multiple worker threads doing the work made
available on the queue.

Hey, maybe we’ll split this out so you can use it too. It’s written
in Zig so we can easily expose a C API. Any language with a C foreign
function interface (i.e. every language) should work well with it. Keep
an eye on our GitHub.
🙂

Additional resources:

Subscribe to Updates

What's Hot

A programmer-friendly I/O abstraction over io_uring and kqueue (2022)

A programmer-friendly I/O abstraction over io_uring and kqueue (2022)

Related Posts