I had a wrong mental model of Rust's thread-local variables performance

2024-12-25

I had always assumed that accessing thread-local variables is somewhat expensive. This made me reluctant to use them in my code.

However, it turns out that thread-local values in practice have a similar cost of accessing them as normal global variables.

I've made a quick-and-dirty benchmark and here are its (non-scientific) results:

bin::do_static_read: 1.28 ns/iter
bin::do_thread_static_read: 1.28 ns/iter
lib::do_static_read: 1.28 ns/iter
lib::do_thread_static_read: 1.28 ns/iter

Of course, this is a (probably flawed) microbenchmark, but it still shows that I should not be worrying too much about the cost of Rust's tracing crate (which uses thread-locals in several places).

Here's the relevant disassembly (on x86-64 machine, as reported by cargo asm):

# accessing a static variable in an executable
mov rax, qword ptr [rip + VAR]

# accessing a static variable in a library
mov rax, qword ptr [rip + VAR]

# accessing a thread-local variable in an executable
mov rax, qword ptr fs:[VAR::VAL@TPOFF]

# accessing a thread-local variable in a libary
lea rdi, [rip + VAR::VAL@TLSLD]
call __tls_get_addr@PLT
mov rax, qword ptr [rax + VAR::VAL@DTPOFF]

For static variables, Rustc was able to use PC-relative addressing mode (which is quite efficient), but thread-local variables are a bit more interesting.

Apparently, thread-local storage is implemented differently when defining it in an executable or a (shared) library. This StackOverflow comment and the OsDev article seem to have some details regarding it. What seems to be happening is:

when defining a thread-local in a binary, the compiler is able to use the FS segment register to reference the thread-local area
when defining a thread-local in a library, the compiler must use a less efficient way of obtaining the base address of the thread-local area via calling __tls_get_addr
Why the benchmark results are the same for bin and lib:
- I think Rustc does cross-library inlining (to a certain degree), so the actual assembly being executed (after inlining) is probably a bit different than cargo asm has reported
- If I add #[inline(never)] to a function that contains the reading code, thread-local access latency increases to 1.58 ns/iter.