Threads, some assembly required.
Lately I've been thinking about adding threads to tiny-std,
my linux-only x86_64
/aarch64
-only tiny standard library for Rust.
Now I've finally done that, with some jankiness, in this write-up I'll go through that process.
Parallelism
Sometimes in programming, parallelism (doing multiple things at the same time), is desirable. For example, to complete some task two different long-running calculations have to be made. If those can be run in parallel, our latency becomes that of the slowest of those tasks (plus some overhead).
Some ways of achieving parallelism in your program are:
- SIMD, hopefully your compiler does that for you. But here we're talking about singular processor operations, not arbitrary tasks.
- Offloading tasks to the OS. If your OS has asynchronous apis then you could ask it to do multiple things at once and achieve parallelism that way.
- Running tasks in other processes.
- Running tasks in threads.
Threads
Wikipedia says of threads:
"In computer science, a thread of execution is the smallest sequence of programmed instructions that can be managed independently by a scheduler, which is typically a part of the operating system."
Threads from a programming perspective, are managed by the OS, how threads work is highly OS-dependent. I'll
only go into Linux
specifically here, and only from an api-consumers perspective.
Spawning a thread with a minimal task
In the rust std-library, a thread can be spawned with
fn main() {
let handle = std::thread::spawn(|| {
std::thread::sleep(std::time::Duration::from_millis(500));
println!("Hello from my thread");
});
// Suspends execution of the calling thread until the child-thread completes.
handle.join().unwrap();
}
In the above program, some setup runs before the main-function, some delegated to
libc. Which sets up what it deems appropriate.
Rust
sets up a panic handler, and miscellaneous things the program needs to run correctly,
then the main-thread starts executing the main
function.
In the main
function, the main thread spawns a child, which at the point of spawn starts executing the task provided by the
supplied closure Wait 500 millis, then print a message
, then waits for that thread to complete.
I wanted to replicate this API, without using libc
.
Clone, swiss army syscall
The Linux
clone syscall can be used for a lot of things.
So many that it's extremely difficult to use it correctly, it's very easy to cause security issues through
various memory-management mistakes, many of which I discovered on this journey.
The signature for the glibc clone wrapper function looks like:
int clone(int (*fn)(void *), void *stack, int flags, void *arg, ...
/* pid_t *parent_tid, void *tls, pid_t *child_tid */ );
Right away I can tell that calling this is not going to be easy from Rust
, we've got
varargs in there, which is problematic because:
Rust
doesn't have varargs, porting someC
-functionality from for example musl won't be straight forward.- Varargs are not readable (objectively true opinion).
Skipping down to the Notes
-section of the documentation shows the actual syscall interface (for x86_64
in a
conspiracy to ruin my life, the last two args are switched on aarch64
):
long clone(unsigned long flags, void *stack,
int *parent_tid, int *child_tid,
unsigned long tls);
Very disconcerting, since the C
-api which accepts varargs, seems to do quite a bit of work before making the syscall,
handing over a task to the OS.
In simple terms, clone is a way to "clone" the current process. If you have experience with
fork, that's an example of clone
.
Here's an equivalent fork
using the clone
syscall from tiny-std
:
/// Fork isn't implemented for aarch64, we're substituting with a clone call here
/// # Errors
/// See above
/// # Safety
/// See above
#[cfg(target_arch = "aarch64")]
pub unsafe fn fork() -> Result<PidT> {
// SIGCHLD is mandatory on aarch64 if mimicking fork it seems
let cflgs = crate::platform::SignalKind::SIGCHLD;
let res = syscall!(CLONE, cflgs.bits().0, 0, 0, 0, 0);
bail_on_below_zero!(res, "CLONE syscall failed");
#[allow(clippy::cast_possible_truncation, clippy::cast_possible_wrap)]
Ok(res as i32)
}
What happens immediately after this call, is that our process is cloned and starts executing past the code which called
clone
, following the above Rust
example:
fn parallelism_through_multiprocess() {
let pid = unsafe { rusl::process::fork().unwrap() };
if pid == 0 {
println!("In child!");
rusl::process::exit(0);
} else {
println!("In parent, spawned child {pid}");
}
}
This program will print (in non-deterministic order):
In parent, spawned child 24748
and
In child
, and return to the caller.
Here we achieved parallelism by spawning another process and doing work there, separately scheduled by the OS, then exited that process. At the same time, our caller returns as usual, only stopping briefly to make the syscall.
Achieving parallelism in this way can be fine. If you want to run a command, forking
/cloning
then executing
another binary through the execve-syscall
is usually how that's done.
Multiprocessing can be a bad choice if the task is small, because setting up an entire other process can be resource
intensive, and communicating between processes can be slower than communicating through shared memory.
Threads: Cloning intra-process with shared memory
What we think of as threads in linux are sometimes called Light-Weight Processes, the above clone call spawned a regular process, which got a full copy of the parent-process' memory with copy-on-write semantics.
To reduce overhead in both spawning, and communicating between the cloned process and the rest of the processes in the application, a combination of flags are used:
let flags = CloneFlags::CLONE_VM
| CloneFlags::CLONE_FS
| CloneFlags::CLONE_FILES
| CloneFlags::CLONE_SIGHAND
| CloneFlags::CLONE_THREAD
| CloneFlags::CLONE_SYSVSEM
| CloneFlags::CLONE_CHILD_CLEARTID
| CloneFlags::CLONE_SETTLS;
Clone flags are tricky to explain, they interact with each other as well, but in short:
CLONE_VM
, clone memory without copy-on-write semantics, meaning, the parent and child share memory space and can modify each-other's memory.CLONE_FS
, the parent and child share the same filesystem information, such as current directory.CLONE_FILES
, the parent and child share the same file-descriptor table, (if one opens anfd
, thatfd
is available to the other).CLONE_SIGHAND
, the parent and child share signal handlers.CLONE_THREAD
, the child-process is placed in the same thread-group as the parent.CLONE_SYSVSEM
, the parent and child shares semaphores.CLONE_CHILD_CLEARTID
, wake up waiters for the suppliedchild_tid
futex pointer when the child exits (we'll get into this).CLONE_SETTLS
, set the thread-local storage to the data pointed at by thetls
-variable (architecture specific, we'll get into this as well).
The crucial flags to run some tasks in a thread are only:
CLONE_VM
CLONE_THREAD
The rest are for usability and expectation, as well as cleanup reasons.
Implementation
Now towards the actual implementation of a minimal threading API.
API expectation
The std library in Rust
provides an interface that could be used like this:
let join_handle = std::thread::spawn(|| println!("Hello from my thread!"));
join_handle.join().unwrap();
A closure that is run on another thread is supplied and a JoinHandle<T>
is returned, the join handle
can be awaited by calling its join
-method, which will block the calling thread until the thread executing the closure
has completed. If it panics
, the Result
will be an Err
, if it succeeds, it will be an Ok(T)
where T
is
the return value from the closure, which in this case is nothing (()
);
Executing a clone call
If CLONE_VM
is specified, a stack should be supplied. CLONE_VM
means sharing mutable memory, if we didn't
supply the stack, both threads would continue mutating the same stack area, here's an excerpt from
the docs about that:
[..] (If the child shares the parent's memory because of the use of the CLONE_VM flag, then no copy-on-write duplication occurs and chaos is likely to result.) - "C library/kernel differences"-section
Allocating the stack
Therefore, setting up a stack is required. There are a few options for that, the kernel only needs a chunk of correctly aligned memory depending on what platform we're targeting. We could even just take some memory off our own stack if we want too.
Use the callers stack
fn clone() {
/// 16 kib stack allocation
let mut my_stack = [0u8; 16384];
let stack_ptr = my_stack.as_mut_ptr();
/// pass through to syscall
syscall!(CLONE, ..., stack_ptr, ...);
}
This is bad for a generic API for a multitude of reasons. It restricts the user to threads that complete before the caller has popped the stack frame in which they were created, since the part of the stack that was used in this frame will be reused by the caller later, possibly while the child-thread still uses it for its own stack. Which we now know, would result in chaos.
Additionally, we will have to have stack space available on the calling thread, for an arbitrary amount of children
if this API was exposed to users. In the case a heap-allocations, the assumption that we will have enough memory for
reasonable thread-usage is valid. Rust
's default thread stack size is 2MiB
. On a system with 16GiB
of RAM, with
8GiB
available at a given time, that's 4000 threads, spawning that many is likely not intentional or performant.
Keeping threads on the main-thread's stack, significantly reduces our memory availability, along with the risk of chaos.
There is a case to be made for some very specific application which spawns some threads in scope, does some work, then exits, to reuse the caller's stack. But I have yet to encounter that kind of use-case in practice, let's move on to something more reasonable.
Mmap more stack-space
This is what musl
does. We allocate the memory that we want to use from new os-pages and use those.
We could potentially do a regular malloc
as well, although that would mean less control over the allocated memory.
Communicating with the started thread
Now mmap
-ing some stack-memory is enough for the OS to start a thread with its own stack, but then what?
The thread needs to know what to do, we can't provide it with any arguments, we need to put all the data it needs
on its stack before starting execution of the task.
This means that we'll need some assembly, since using the clone syscall and then continuing in Rust
relinquishes
control that we need over the stack, we need to put almost the entire child-thread's lifetime in assembly.
The structure of the call is mostly stolen from musl
, with some changes for this more minimal use-case.
The rust function will look like this:
extern "C" {
fn __clone(
start_fn: usize,
stack_ptr: usize,
flags: i32,
args_ptr: usize,
tls_ptr: usize,
child_tid_ptr: usize,
stack_unmap_ptr: usize,
stack_sz: usize,
) -> i32;
}
- It takes a pointer to a
start_fn
, which is aC
calling convention function pointer, where our thread will pick up. - It also takes a pointer to the stack,
stack_ptr
. - It takes clone-flags which we send onto the OS in the syscall.
- It takes an
args_ptr
, which is the closure we want to run, converted to aC
calling convention function pointer. - It takes a
tls_ptr
, a pointer to some thread local storage, which we'll need to deallocate the thread's stack, and communicate with the calling thread. - It takes a
child_tid_ptr
, which will be used to synchronize with the calling thread. - It takes a
stack_unmap_ptr
, which is the base address that we allocated for the stack at its original0
offset. - It takes the
stack_sz
, stack-size, which we'll need to deallocate the stack later.
Syscalls
x86_64
and aarch64
assembly has a command to execute a syscall
.
A syscall is like a function call to the kernel, we'll need to make three syscalls using assembly:
- CLONE, nr 56 on
x86_64
- MUNMAP, nr 11 on
x86_64
- EXIT, nr 60 on
x86_64
The interface for the syscall is as follows:
/// Syscall conventions are on 5 args:
/// - arg -> arch: reg,
/// - nr -> x86: rax, aarch64: x8
/// - a1 -> x86: rdi, aarch64: x0
/// - a2 -> x86: rsi, aarch64: x1
/// - a3 -> x86: rdx, aarch64: x2
/// - a4 -> x86: r10, aarch64: x3
/// - a5 -> x86: r8, aarch64: x4
/// Pseudocode syscall as extern function:
extern "C" {
fn syscall(nr: usize, a1: usize, a2: usize, a3: usize, a4: usize, a5: usize);
}
Onto the assembly, it can be boiled down to this:
- Prepare arguments to go in the right registers for the syscall.
- Put what the thread needs into its stack.
- Execute the clone syscall, return directly to the caller (parent-thread).
- Pop data from the spawned thread's stack into registers.
- Execute the function we wanted to run in the spawned thread.
- Unmap the spawned thread's own stack
- Exit 0
// Boilerplate to expose the symbol
.text
.global __clone
.hidden __clone
.type __clone,@function
// Actual declaration
__clone:
// tls_ptr already in r8, syscall arg 5 register, due to C calling convention on this function, same with stack_ptr in rsi
// Zero syscall nr register ax (eax = 32bit ax)
xor eax, eax
// Move 56 into the lower 8 bits of ax (al = 8bit ax), 56 is the CLONE syscall nr for x86_64, will become: syscall(56, .., stack_ptr, .., tls_ptr)
mov al, 56
// Move start function into r11, scratch register, save it there since we need to shuffle stuff around
mov r11, rdi
// Move flags into rdi, syscall arg 1 register, well become: syscall(56, flags, stack_ptr, .., .., tls_ptr)
mov rdi, rdx
// Zero parent_tid_ptr from syscall arg 3 register (not using), will become: syscall(56, flags, stack_ptr, 0, .., tls_ptr)
xor rdx, rdx
// Move child_tid_ptr into syscall arg 4 register (our arg 6), will become: syscall(56, flags, stack_ptr, 0, child_tid_ptr, tls_ptr)
mov r10, r9
// Move start function into r9
mov r9, r11
// Align stack ptr to -16
and rsi, -16
// Move down 8 bytes on the stack ptr
sub rsi, 8
// Move args onto the the top of the stack
mov [rsi], rcx
// Move down 8 bytes more on the stack ptr
sub rsi, 8
// Move the first arg that went on the stack into rcx (stack_unmap_ptr)
mov rcx, [8 + rsp]
// Move stack_unmap_ptr onto our new stack
mov [rsi], rcx
// Move the second arg that went on the stack into rcx (stack_sz)
mov rcx, [16 + rsp]
// Move down stack ptr
sub rsi, 8
// Move stack_sz onto the new stack
mov [rsi], rcx
// Make clone syscall
syscall
// Check if the syscall return vaulue is 0
test eax, eax
// if not zero, return (we're the calling thread)
jnz 1f
// Child:
// Zero the base pointer
xor ebp, ebp
// Pop the stack_sz off the provided stack into callee saved register
pop r13
// Pop the stack_ptr off the provided stack into another callee saved register
pop r12
// Pop the start fn args off the provided stack into rdi
pop rdi
// Call the function we saved in r9, rdi first arg
call r9
// Zero rax (function return, we don't care)
xor rax, rax
// Move MUNMAP syscall into ax
mov al, 11
// Stack ptr as the first arg
mov rdi, r12
// Stack len as the second arg
mov rsi, r13
// Syscall, unmapping the stack
syscall
// Clear the output register, we can't use the return value anyway
xor eax,eax
// Move EXIT syscall nr into ax
mov al, 60
// Set exit code for the thread to 0
mov rdi, 0
// Make exit syscall
syscall
1: ret
And that's it, kinda, with some code wrapping this we can run an arbitrary closure on a separate thread!
Race conditions
We're far from done, in the happy case we're starting a thread, it completes, and deallocates its own stack. But, we need to get its returned value, and we need to know if it's done.
Unlike a process, we cannot use the wait-syscall to wait
for the process to complete, but there is another way, alluded to in the note on CLONE_CHILD_CLEARTID
.
Futex messaging
If CLONE_CHILD_CLEARTID
is supplied in clone-flags along with a pointer to a futex variable, something with a u32
-layout
in Rust
that's most reasonably AtomicU32
, then the OS will set that futex-value to 0
(not null) when the thread exits,
successfully or not.
This means that if the caller wants to join
, i.e. blocking-wait for the child-thread to finish, it can use the
futex-syscall.
Getting the returned value
The return value is fairly simple, we need to allocate space for it, for example with a pointer to an UnsafeCell<Option<T>>
,
and then have the child-thread update it. The catch here is that we can't have &
-references to that value while the child-thread
may be writing to it, since that's UB
. We share a pointer with the child containing the value, and we need to be
absolutely certain that the child-thread is done with
its modification before we try to read it. For example by waiting for it to exit by join
-ing.
Memory leaks, who deallocates what?
We don't necessarily have to keep our JoinHandle<T>
around after spawning a thread. A perfectly valid use-case is to
just spawn some long-running thread and then forget about it, this causes a problem, if the calling thread doesn't have
sole responsibility of deallocating the shared memory (the futex
variable, and the return value), then we need a way
to signal to the child-thread that it's that thread's responsibility to deallocate those variables before exiting.
Enter the third shared variable, an AtomicBool
called should_dealloc
, both threads share a pointer to this variable
as well.
Now there are three deallocation-scenarios:
- Caller joins the child thread by waiting for the
futex
-variable to change value to0
. In this case the caller deallocates thefutex
, takes the return value of the heap freeing its memory, and deallocates theshould_dealloc
pointer. - Caller drops the
JoinHandle<T>
. This is racy, we need to readshould_dealloc
to see that the child thread hasn't already completed its work. If it has, we wait on thefutex
to make sure the child thread is completely done, then deallocate as above. - The child thread tries to set
should_dealloc
totrue
and fails, meaning that the calling thread has already dropped theJoinHandle<T>
. In this case, the child thread needs to signal to the OS that thefutex
is no longer to be updated on thread exit through the set_tid_address-syscall (forgetting to do this results in a use after free, oof. Here's aLinux
-code-comment calling me a dumbass that I found when trying to find the source of the segfaults:
// 929ed21dfdb6ee94391db51c9eedb63314ef6847, kernel/fork.c#L1634, written by Linus himself
if (tsk->clear_child_tid) {
if (atomic_read(&mm->mm_users) > 1) {
/*
* We don't check the error code - if userspace has
* not set up a proper pointer then tough luck.
*/
put_user(0, tsk->clear_child_tid);
do_futex(tsk->clear_child_tid, FUTEX_WAKE,
1, NULL, NULL, 0, 0);
}
tsk->clear_child_tid = NULL;
}
). Then it can safely deallocate the shared variables.
Oh, right. Panics...
I imagine a world where Rust
doesn't contain panics. Sadly, we don't live in that world, and thus we need to handle them.
If the thread panics, and we try to join then it's no issue, we'll get a None
return value, and can continue with
the regular cleanup from the caller.
However, if the thread panics after the caller has dropped the JoinHandle<T>
the shared memory is leaked,
and the stack isn't deallocated.
A Rust
panic handler could like this:
/// Dummy panic handler
#[panic_handler]
pub fn on_panic(info: &core::panic::PanicInfo) -> ! {
loop {}
}
The signature shows that it gets PanicInfo
and never returns.
When a thread panics, it enters that function and never returns, it's here that we need to handle cleanup in the
case that the thread panics.
What we need:
- A pointer to the
futex
- A pointer to the return value
- A pointer to the
should_dealloc
variable - The address at which we allocated this thread's stack
- The size of that allocated stack
We could insert those in registers that shouldn't be touched by the user-supplied function, but that's fairly brittle,
instead we'll use the dreaded tls
.
Thread-local storage
Thread-local storage, or tls
is a way to store thread-specific data.
For x86_64
and aarch64
there is a specific register we can use to store a pointer to some arbitrary data,
we can read from that data at any time from any place, in other words, the data is global to the thread.
In practice:
#[repr(C)]
#[derive(Copy, Clone)]
pub(crate) struct ThreadLocalStorage {
// First arg needs to be a pointer to this struct, it's immediately dereferenced
pub(crate) self_addr: usize,
// Info on spawned threads that allow us to unmap the stack later
pub(crate) stack_info: Option<ThreadDealloc>,
}
#[repr(C)]
#[derive(Copy, Clone)]
pub(crate) struct ThreadDealloc {
// For the stack dealloc
stack_addr: usize,
stack_sz: usize,
// For the return value dealloc
payload_ptr: usize,
payload_layout: Layout,
// Futex,
futex_ptr: usize,
// Sync who deallocs
sync_ptr: usize,
}
#[inline]
#[must_use]
fn get_tls_ptr() -> *mut ThreadLocalStorage {
let mut output: usize;
#[cfg(target_arch = "x86_64")]
unsafe {
core::arch::asm!("mov {x}, fs:0", x = out(reg) output);
}
#[cfg(target_arch = "aarch64")]
unsafe {
core::arch::asm!("mrs {x}, tpidr_el0", x = out(reg) output);
}
output as _
}
This takes us to another of our clone-flags CLONE_SETTLS
, we can now allocate and supply a pointer to a
ThreadLocalStorage
-struct, and that will be put into the thread's thread-local storage register by the OS,
which registers are used can be seen in get_tls_ptr
.
Now when entering the panic_handler
we can get_tls_ptr
and see if there is a ThreadDealloc
associated with the
thread that's currently panicking. If there isn't, we're on the main thread, and we'll just bail out by exiting with
code 1
, terminating the program.
If there is a ThreadDealloc
we can now first check if the caller has dropped the JoinHandle<T>
,
and if we have exclusive access to the shared memory, if we do have exclusive access we deallocate it,
if we don't we let the caller handle it. Then, again we have to exit with some asm:
// We need to be able to unmap the thread's own stack, we can't use the stack anymore after that
// so it needs to be done in asm.
// With the stack_ptr and stack_len in rdi/x0 and rsi/x1, respectively we can call mmap then
// exit the thread
#[cfg(target_arch = "x86_64")]
core::arch::asm!(
// Call munmap, all args are provided in this macro call.
"syscall",
// Zero eax from munmap ret value
"xor eax, eax",
// Move exit into ax
"mov al, 60",
// Exit code 0 from thread.
"mov rdi, 0",
// Call exit, no return
"syscall",
in("rax") MUNMAP,
in("rdi") map_ptr,
in("rsi") map_len,
options(nostack, noreturn)
);
We also need to remember to deallocate the ThreadLocalStorage
, what we keep in the register is just a pointer to
that allocated heap-memory. This needs to be done both in successful and panicking thread-exits.
Final thoughts
I've been dreading reinventing this particular wheel, but I'm glad I did.
I learnt a lot, and it was interesting to dig into how threading works in practice on Linux
, plus tiny-std
now has
threads!
The code for threads in tiny-std can be found here. With a huge amount of comments its 500 lines.
I believe that it doesn't contain UB
or leakage, but it's incredibly hard to test, what I know is lacking is signal
handling, which is something else that I have been dreading getting into.
Next up
I've ordered a Pinephone explorer edition, I'll probably try doing stuff with that next.