PGWM 0.4, io-uring, stability, and static pie linking

A while back I decided to look into io-uring for an event-loop for pgwm, I should have written about it when I implemented it, but couldn't find the time then.

Now that I finally got pgwm to compile using the stable toolchain, I'm going to write a bit about the way there.

Io-uring

Io-uring is a linux syscall interface that allows you to submit io-tasks, and later collect the results of those tasks. It does so by providing two ring buffers, one for submissions, and one for completions.

In the simplest possible terms, you put some tasks on one queue, and later collect them on some other queue. In practice, it's a lot less simple than that.

As I've written about in previous entries on this website, I decided to scrap the std-lib and libc, and write my own syscall interface in tiny-std.
Therefore I had to look into the gritty details of how to set up these buffers, you can see those details here. Or, look at the c-implementation which I ripped off here.

Why io-uring?

I've written before about my x11-wm pgwm, but in short: It's an x11-wm is based on async socket communication where the wm-reacts to incoming messages, like a key-press, and responds with some set of outgoing messages on that same socket.
When the WM had nothing to do it used the poll interface to await another message.

So the loop could be summed up as:

1. Poll until there's a message on the socket.
2. Read from the socket.
3. Handle the message.

With io-uring that could be compacted to:

1. Read from the socket when there are bytes available.
2. Handle the message.

io-uring sounded cool, and this seemed efficient, so off I went.

Why not io-uring?

Io-uring is complex, the set-up is complex and there are quite a few considerations that need to be made. Ring-buffers are set up, how big should they be? What if we get an incoming message pile-up? What if we get an outgoing message pile-up? When is the best time to flush the buffers? What settings should I put on the uring?

There are more considerations than that, but I didn't really need to tackle most of these issues, since I'm not shipping a production-ready lib that I'll support indefinitely, I'm just messing around with my WM. I cranked up the buffer size to more than necessary, and it works fine.

Something that I did consider however, was whether to use SQ-poll, we'll get more into that and what that is.

Sharing memory with the kernel

Something that theoretically makes Io-uring more efficient than other io-alternatives is that the ring-buffers are shared with the kernel. There is no need to make a separate syscall for each sent message, if you put a message on the buffer, and update its offset through an atomic operation, that will be available for the kernel to use.
But the kernel does need to find out about the submission outside of just the updated state. There are two ways of doing this:

Make a syscall. Write an arbitrary amount of tasks to the submission queue, then tell the kernel about them through a syscall. That same syscall can be used to wait until there are completions available as well, it's very flexible.
Have the kernel poll the shared memory for changes in the queue-offset and pick tasks up as they're added. Potentially, this is a large latency-decrease as well as a throughput increase, no more waiting for syscalls!

I thought this sounded great, in practice however, SQPoll resulted in a massive cpu-usage increase. I couldn't tolerate that, so I'll have to save that setting for a different project. In the end io-uring didn't change much about pgwm.

Stable

Since I ripped out libc, pgwm has required nightly to build, this has bothered me quite a bit. The reason that the nightly compiler was necessary was because of tiny-std using the #[naked] feature to create the assembly entrypoint (_start function), where the application starts execution.

Asm to global_asm

To be able to get aux-values, the environment variable pointer, and the arguments passed to the binary, access to the stack-pointer at its start-position is required. Therefore, a function that doesn't mess up the stack needs to be injected, passing that pointer to a normal function that can extract what's necessary.

An example:

/// Binary entrypoint
#[naked]
#[no_mangle]
#[cfg(all(feature = "symbols", feature = "start"))]
pub unsafe extern "C" fn _start() {
    // Naked function making sure that main gets the first stack address as an arg
    #[cfg(target_arch = "x86_64")]
    {
        core::arch::asm!("mov rdi, rsp", "call __proxy_main", options(noreturn))
    }
    #[cfg(target_arch = "aarch64")]
    {
        core::arch::asm!("MOV X0, sp", "bl __proxy_main", options(noreturn))
    }
}
/// Called with a pointer to the top of the stack
#[no_mangle]
#[cfg(all(feature = "symbols", feature = "start"))]
unsafe fn __proxy_main(stack_ptr: *const u8) {
    // Fist 8 bytes is a u64 with the number of arguments
    let argc = *(stack_ptr as *const u64);
    // Directly followed by those arguments, bump pointer by 8
    let argv = stack_ptr.add(8) as *const *const u8;
    let ptr_size = core::mem::size_of::<usize>();
    // Directly followed by a pointer to the environment variables, it's just a null terminated string.
    // This isn't specified in Posix and is not great for portability, but we're targeting Linux so it's fine
    let env_offset = 8 + argc as usize * ptr_size + ptr_size;
    // Bump pointer by combined offset
    let envp = stack_ptr.add(env_offset) as *const *const u8;
    unsafe {
        ENV.arg_c = argc;
        ENV.arg_v = argv;
        ENV.env_p = envp;
    }
    ...etc

I got this from an article by fasterthanli.me. But later realized that you can use the global_asm-macro to generate the full function instead:

// Binary entrypoint
#[cfg(all(feature = "symbols", feature = "start", target_arch = "x86_64"))]
core::arch::global_asm!(
    ".text",
    ".global _start",
    ".type _start,@function",
    "_start:",
    "mov rdi, rsp",
    "call __proxy_main"
);

Symbols

While this means that tiny-std itself could potentially be part of a binary compiled with stable, if one would like to use for example alloc to have an allocator, then rustc would start emitting symbols like memcpy. Which rust doesn't provide for some reason.

The solution to the missing symbols is simple enough, these symbols are provided in the external compiler-builtins library, but that uses a whole host of features that require nightly. So I copied the implementation (and license), removing dependencies on nightly features, and exposed the symbols in tiny-std.

Now an application (like pgwm), can be built with the stable toolchain using tiny-std.

Static

In my boot-writeup I wrote about creating a minimal rust bootloader. A problem I encountered was that it needed an interpreter. You can't see it with ldd:

[21:55:04 gramar@grarch marcusgrass.github.io]$ ldd ../pgwm/target/x86_64-unknown-linux-gnu/lto/pgwm
        statically linked

Ldd lies (or maybe technically not), using file:

file ../pgwm/target/x86_64-unknown-linux-gnu/lto/pgwm
../pgwm/target/x86_64-unknown-linux-gnu/lto/pgwm: ELF 64-bit LSB pie executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=9b54c91e5e84a8d3c90fdb9523f46e09cbf5c6e2, stripped

Or readelf -S:

[21:57:21 gramar@grarch marcusgrass.github.io]$ readelf -S ../pgwm/target/x86_64-unknown-linux-gnu/lto/pgwm
There are 18 section headers, starting at offset 0x16a0b0:
Section Headers:
  [Nr] Name              Type             Address           Offset
       Size              EntSize          Flags  Link  Info  Align
  [ 0]                   NULL             0000000000000000  00000000
       0000000000000000  0000000000000000           0     0     0
  [ 1] .interp           PROGBITS         00000000000002a8  000002a8
       000000000000001c  0000000000000000   A       0     0     1
  [ 2] .note.gnu.bu[...] NOTE             00000000000002c4  000002c4
       0000000000000024  0000000000000000   A       0     0     4
  [ 3] .gnu.hash         GNU_HASH         00000000000002e8  000002e8
       000000000000001c  0000000000000000   A       4     0     8
  [ 4] .dynsym           DYNSYM           0000000000000308  00000308
       0000000000000018  0000000000000018   A       5     1     8
  [ 5] .dynstr           STRTAB           0000000000000320  00000320
       0000000000000001  0000000000000000   A       0     0     1
  [ 6] .rela.dyn         RELA             0000000000000328  00000328
       0000000000008310  0000000000000018   A       4     0     8
  [ 7] .text             PROGBITS         0000000000009000  00009000
       000000000013d5a4  0000000000000000  AX       0     0     16
  [ 8] .rodata           PROGBITS         0000000000147000  00147000
       000000000000eb20  0000000000000000   A       0     0     32
  [ 9] .eh_frame_hdr     PROGBITS         0000000000155b20  00155b20
       0000000000001a8c  0000000000000000   A       0     0     4
  [10] .eh_frame         PROGBITS         00000000001575b0  001575b0
       000000000000c1dc  0000000000000000   A       0     0     8
  [11] .gcc_except_table PROGBITS         000000000016378c  0016378c
       000000000000000c  0000000000000000   A       0     0     4
  [12] .data.rel.ro      PROGBITS         0000000000164e28  00163e28
       0000000000006088  0000000000000000  WA       0     0     8
  [13] .dynamic          DYNAMIC          000000000016aeb0  00169eb0
       0000000000000110  0000000000000010  WA       5     0     8
  [14] .got              PROGBITS         000000000016afc0  00169fc0
       0000000000000040  0000000000000008  WA       0     0     8
  [15] .data             PROGBITS         000000000016b000  0016a000
       0000000000000008  0000000000000000  WA       0     0     8
  [16] .bss              NOBITS           000000000016b008  0016a008
       0000000000000458  0000000000000000  WA       0     0     8
  [17] .shstrtab         STRTAB           0000000000000000  0016a008
       00000000000000a8  0000000000000000           0     0     1

Both file and readelf (.interp section) shows that this binary needs an interpreter, that being /lib64/ld-linux-x86-64.so.2. If the binary is run in an environment without it, it will immediately crash.

If compiled statically with RUSTFLAGS='-C target-feature=+crt-static' the application segfaults, oof.

I haven't found out the reason why tiny-std cannot run as a position-independent executable, or I know why, all the addresses to symbols (like static variables) are wrong. What I don't know yet is how to fix it.

There is a no-code way of fixing it though: RUSTFLAGS='-C target-feature=+crt-static -C relocation-model=static'.
This way the application will be statically linked, without requiring an interpreter, but it will not be position independent.

If you know how to make that work, please tell me, because figuring that out isn't easy.

Future plans

I'm tentatively looking into making threading work, but that is a lot of work and a lot of segfaults on the way.