PGWM 0.3, tiny-std, and xcb-parse
I recently made a substantial rewrite of my (now) pure rust x11 window manager and want to collect my thoughts on it somewhere.
X11 and the Linux desktop
PGWM is an educational experience into Linux desktop environments, the x11 specification first came about in 1984 and has for a long time been the only mainstream way for gui-applications on Linux to show what they need on screen for their users.
When working on desktop applications for Linux, the intricacies of that protocol are mostly hidden by the desktop
frameworks a developer might encounter. In Rust
,
the cross-platform library winit can be used for this purpose,
and applications written in Rust
like the terminal emulator Alacritty
uses winit
.
At the core of the Linux desktop experience lies the Window Manager, either alone or accompanied by a Desktop Enviroment (DE). The Window Manager makes decisions on how windows are displayed.
The concept of a Window
Window is a loose term often used to describe some surface that can be drawn to on screen.
In X11, a window is a u32
id that the xorg-server
keeps information about. It has properties, such as a height and
width, it can be visible or not visible, and it enables the developer to ask the server to subscribe to events.
WM inner workings and X11 (no compositor)
X11 works by starting the xorg-server
, the xorg-server
takes care of collecting input
from HIDs
like the keyboard and mouse, collecting information about device state,
such as when a screen is connected or disconnected,
and coordinates messages from running applications including the Window Manager.
This communication goes over a socket, TCP or Unix. The default is /tmp/.X11-unix/X0
for a single-display desktop
Linux environment.
The details of the communication are specified in xml files in Xorg's gitlab
repo xcbproto.
The repo contains language bindings, xml schemas that specify how an object passed over the socket should be structured
to be recognized by the xorg-server.
The name for the language bindings is XCB for 'X protocol C-language Binding'.
Having this kind of protocol means that a developer who can't or won't directly link to and use the xlib
C-library
can instead construct their own representations of those objects and send those over the socket.
In PGWM a Rust
language representation of these objects are used, containing serialization and deserialization methods
that turn Rust structs into raw bytes that can be transmitted on the socket.
If launching PGWM through xinit, an xorg-server is started at the beginning of that script, if PGWM is launched inside that script it will try to become that server's Window Manager.
When an application starts within the context of X11, a handshake takes place. The application asks for setup
information from the server, and if the server replies with a success the application can start interfacing
with the server.
In a WM's case, it will request to set the SubstructureRedirectMask
on the root X11 window.
Only one application can have that mask on the root window at a given time. Therefore, there can only be one WM active
for a running xorg-server.
If the change is granted, layout change requests will be sent to the WM. From then on the WM can make decisions on the
placements of windows.
When an application wants to be displayed on screen it will send a MapRequest
, when the WM gets that request it will
make a decision whether that window will be shown, and its dimensions, and forward that decision to the xorg-server
which is responsible for drawing it on screen. Changing window dimensions works much the same way.
A large part of the trickiness of writing a WM, apart from the plumbing of getting the socket communication right, is
handling focus.
In X11, focus determines which window will receive user input, aside from the WM making the decision of what should
be focused at some given time, some Events
will by default trigger focus changes, making careful reading of the
protocol an important part of finding maddening bugs.
What is currently focused can be requested from the xorg-server by any application, and notifications on focus changes
are produced if requested. In PGWM, focus becomes a state that needs to be kept on both the WM's and X11's side to
enable swapping between workspaces
and having previous windows re-focused, and has been a constant source of bugs.
Apart from that, the pure WM responsibilities are not that difficult, wait for events, respond by changing focus or layout, rinse and repeat. The hard parts of PGWM has been removing all C-library dependencies, and taking optimization to a stupid extent.
Remove C library dependencies, statically link PGWM 0.2
I wanted PGWM to be statically linked, small and have no C-library dependencies for 0.2. I had one problem.
Drawing characters on screen
At 0.1, PGWM used language bindings to the XFT(X FreeType interface library)
C-library, through the Rust libx11
bindings library X11. XFT handles font rendering.
It was used to draw characters on the status bar.
XFT provides a fairly nice interface, and comes with the added bonus
of Fontconfig integration.
Maybe you've encountered something like this JetBrainsMono Nerd Font Mono:size=12:antialias=true
, it's
an excerpt from my ~/.Xresources
file and configures the font for Xterm. Xterm uses fontconfig to figure out where
that font is located on my machine. Removing XFT and fontconfig with it, means that fonts have to specified by path,
now this is necessary to find fonts: /usr/share/fonts/JetBrains\ Mono\ Medium\ Nerd\ Font\ Complete\ Mono.ttf
, oof.
I still haven't found a non C
replacement for finding fonts without specifying an absolute path.
One step in drawing a font is taking the font data and creating a vector of light intensities, this process is called
Rasterization. Rust has a font rasterization library fontdue
that at least at one point claimed to be the fastest font rasterizer available.
Since I needed to turn the fonts into something that could be displayed as a vector of bytes,
I integrated that into PGWM. The next part was drawing it in the correct place. But, instead of looking
at how XFT did it I went for a search around the protocol and found the shm
(shared memory) extension (This maneuver
cost me about a week).
SHM
The X11 shm
extension allows an application to share memory with X11, and request the xorg-server to draw what's in
that shared memory at some chosen location.
So I spent some time encoding what should be displayed, pixel by pixel from the background color, with the characters as
bitmaps rasterized by fontdue
on top, into a shared memory segment, then having the xorg-server draw from that
segment.
It worked, but it took a lot of memory, increased CPU usage, and was slow.
Render
I finally went to look at XFT's code and found that it uses the render extension, an extension that can register byte representations as glyphs, and then draw those glyphs at specified locations, by glyph-id. This is the sane way to do it. After implementing that, font rendering was again working, and the performance was good.
PGWM 0.3 how can I make this smaller and faster?
I wanted PGWM to be as resource efficient as possible, I decided to dig into the library that I used do serialization
and deserialization of Rust
structs that were to go over the socket to the xorg-server
.
The library I was using was X11rb an excellent safe and performant library for doing just that. However, I was taking optimization to a ridiculous extent, so I decided to make that library optimized for my specific use case.
PGWM runs single threaded
X11rb can handle multithreading, making the execution path for single threaded applications longer than necessary.
I first rewrote the connection logic from interior mutability (the connection handles synchronization) to exterior
mutability (user handles synchronization, by for example wrapping it in an Arc<RwLock<Connection>>
).
This meant a latency decrease of about 5%, which was pretty good. However, it did mean
that RAII
no longer applied and the risk of memory leaks went up.
I set the WM to panic on leaks in debug and cleaned them up where I found them to handle that.
Optimize generated code
In X11rb, structs were serialized into owned allocated buffers of bytes, which were then sent over the socket. This means a lot of allocations. Instead, I created a connection which holds an out-buffer, structs are always serialized directly into it, that buffer is then flushed over the socket. Meaning no allocations are necessary during serialization.
The main drawback of that method is management of that buffer. If it's growable then the largest unflushed batch will take up memory for the WM's runtime, or shrink-logic needs to be inserted after each flush. If the buffer isn't growable, some messages might not fit depending on how the buffer is proportioned. It's pretty painful in edge-cases. I chose to have a fixed-size buffer of 64kb.
At this point I realized that the code generation was hard to understand and needed a lot of changes to support my
needs. Additionally, making my WM no_std
and removing all traces of libc
was starting to look achievable.
Extreme yak-shaving, generate XCB from scratch
This was by far the dumbest part of the process, reworking the entire library to support no_std
and generate the
structures and code from scratch. From probing the Wayland specification I had written a very basic Rust
code
generation library codegen-rs, I decided to use that for code generation.
After a few weeks I had managed to write a parser for the xproto.xsd
, a parser for the actual protocol files, and a
code generator that I could work with.
A few more weeks followed and I had finally generated a no_std
fairly optimized library for interfacing with X11
over socket, mostly by looking at how x11rb does it.
Extreme yak-shaving, pt 2, no libc allowed
In Rust
, libc
is the most common way that the standard library interfaces with the OS, with some direct
syscalls where necessary.
There are many good reasons for using libc
, even when not building cross-platform/cross-architecture libraries,
I wanted something pure Rust
, so that went out the window.
Libc
libc
does a vast amount of things, on Linux there are two implementations that dominate, glibc
and musl
.
I won't go into the details of the differences between them, but simplified, they are C-libraries that make your C-code
run as you expect on Linux.
As libraries they expose methods to interface with the OS, for example reading or writing to a file,
or connecting to a socket.
Some functions are essentially just a proxies for syscalls
but some do more things behind the scenes, like
synchronization of shared application resources such as access to the environment pointer.
Removing the std-library functions and replacing them with syscalls
I decided to set PGWM to !#[no_std]
and see what compiled. Many things in std::*
are just re-exports from core::*
and were easily replaced. For other things like talking to a socket I used raw syscalls
through the
excellent syscall crate
and some glue-code to approximate what libc
does. It was a bit messy,
but not too much work replacing it, PGWM is now 100% not cross-platform, although it wasn't really before either...
No allocator
Since the standard library provides the allocator I had to find a new one, I decided to
use dlmalloc,
it works no_std
, it was a fairly simple change.
Still libc
I look into my crate graph and see that quite a few dependencies still pull in libc:
I got to work forking these libraries and replacing libc with direct syscalls.
time
was easy, just some Cargo.toml
magic that could easily be upstreamed.
toml
was a bit trickier, the solution was ugly and I decided not to upstream it.
dlmalloc-rs
was even harder, it used the pthread-api to make the allocator synchronize, and implementing that
was beyond even my yak-shaving. Since PGWM is single threaded anyway I left it as-is and unsafe impl
'd
Send
and Sync
.
smallmap
fairly simple, upstreaming in progress.
The ghost of libc, time for nightly
With no traces of libc
I try to compile the WM. It can't start, it doesn't know how to start.
The reason is that libc
provides the application's entrypoint _start
, without linking libc
Rust
doesn't
know how to create an entrypoint.
As always the amazing fasterthanli.me has
a write-up about how to get around that issue. The solution required nightly and some assembly.
Now the application won't compile, but for a different reason, I have no global alloc error handler.
When running a no_std
binary with an allocator, Rust
needs to know what to do if allocation fails, but there is
at present no way to provide it with a way without another nightly feature
default_global_alloc_handler which looks like it's about to be
stabilized soon (TM).
Now the WM works, no_std
no libc
, life is good.
Tiny-std
I was looking at terminal emulator performance. Many new terminal emulators seem to
have very poor input performance
.
I had noticed this one of the many times PGWM crashed and sent me back to the cold hard tty, a comforting
speed. alacritty
is noticeably sluggish at rendering keyboard input to the screen,
I went back to xterm
, but now that PGWM worked I was toying with the idea to write a fast, small,
terminal emulator in pure rust.
I wanted to share the code I used for that in PGWM with this new application, and clean it up in the process: tiny-std
.
The goal of tiny-std
is to make a std-compatible no_std
library with no libc
dependencies available for use with
Linux Rust
applications on x86_64 and aarch64, which are the platforms I'm interested in. Additionally, all
functionality
that can work without an allocator should. You shouldn't need to pull in alloc
to read/write from a file, just
provide your own buffer.
The nightmare of cross-architecture
Almost immediately I realize why libc
is so well-used. After a couple of hours of debugging a segfault, and it turning
out to be incompatible field ordering depending on architecture one tends to see the light.
Never mind the third time that happens.
I'm unsure of the best way to handle this, perhaps by doing some libgen straight from the kernel source, but we'll see.
Start, what's this on my stack?
I wanted to be able to get arguments and preferably environment variables
into tiny-std
. Fasterthanli.me
helped with the args, but for the rest I had to go to the musl source.
When an application starts on Linux, the first 8 bytes of the stack contains argc
, the number of input arguments.
Following that are the null-terminated strings of the arguments (argv
), then a null pointer,
then comes a pointer to the environment variables.
musl
then puts that pointer into a global mutable variable, and that's the environment.
I buckle under and do the same, I see a world where arguments and environment are passed to main, and it's the
application's job, not the library, to decide to handle it in a thread-safe way
(although you can use env_p
as an argument to main
in C
).
Being no better than my predecessors, I store the environment pointer in a static variable, things like spawning
processes becomes a lot more simple that way, C
owns the world, we just live in it.
vDSO (virtual dynamic shared object), what there's more on the stack?
Through some coincidence when trying to make sure all the processes that I spawn don't become zombies I encounter
the vDSO.
ldd
has whispered the words, but I never looked it up.
[gramar@grarch marcusgrass.github.io]$ ldd $(which cat)
linux-vdso.so.1 (0x00007ffc0f59c000)
libc.so.6 => /usr/lib/libc.so.6 (0x00007ff14e93d000)
/lib64/ld-linux-x86-64.so.2 => /usr/lib64/ld-linux-x86-64.so.2 (0x00007ff14eb4f000)
It turns out to be a shared library between the Linux kernel and a running program, mapped into that program's memory.
When I read that it provides faster ways to interface with the kernel I immediately stopped reading and started
implementing, I could smell the nanoseconds.
Aux values
To find out where the VDSO is mapped into memory for an application, the application needs to inspect the
AUX values at runtime.
After the environment variable pointer comes another null pointer, following that are the AUX
values.
The AUX
values are key-value(like) pairs of information sent to the process.
Among them are 16 random bytes, the pid
of the process, the gid
, and about two dozen more entries of
possibly useful values.
I write some more code into the entrypoint to save these values.
A memory mapped elf-file
Among the aux-values is AT_SYSINFO_EHDR
, a pointer to the start of the vDSO
which is a full
ELF-file mapped into the process' memory.
I know that in this file is a function pointer for the clock_gettime
function through the
Linux vDSO docs. I had benchmarked tiny-std
's
Instant::now()
vs the standard library's, and found it to be almost seven times slower.
I needed to find this function pointer.
After reading more Linux documentation, and ELF-documentation, and Linux-ELF-documentation,
I managed to write some code that parses the ELF-file to find the address of the function.
Of course that goes into another global variable, you know, C
-world and all that.
I created a feature that does the vDSO parsing, and if clock_gettime
is found, uses that instead of the syscall.
This increased the performance if Instant::now()
from ~std * 7
to < std * 0.9
. In other words, it now outperforms
standard by taking around 12% less time to get the current time from the system.
Conclusion
I do a lot of strange yak-shaving, mostly for my own learning, I hope that this write-up might have given you something
too.
The experience of taking PGWM to no_std
and no libc
has been incredibly rewarding, although I think PGWM is mostly
the same, a bit more efficient, a bit less stable.
I'll keep working out the bugs and API och tiny-std
, plans to do a minimal terminal emulator are still in the back of
my mind, we'll see if I can find the time.