I had always assumed that accessing thread-local variables is somewhat expensive. This made me reluctant to use them in my code.
However, it turns out that thread-local values in practice have a similar cost of accessing them as normal global variables.
I've made a quick-and-dirty benchmark and here are its (non-scientific) results:
bin::do_static_read: 1.28 ns/iter
bin::do_thread_static_read: 1.28 ns/iter
lib::do_static_read: 1.28 ns/iter
lib::do_thread_static_read: 1.28 ns/iter
Of course, this is a (probably flawed) microbenchmark, but it still shows that I should not be worrying too much about the cost of Rust's tracing crate (which uses thread-locals in several places).
Here's the relevant disassembly (on x86-64 machine, as reported by cargo asm
):
# accessing a static variable in an executable
mov rax, qword ptr [rip + VAR]
# accessing a static variable in a library
mov rax, qword ptr [rip + VAR]
# accessing a thread-local variable in an executable
mov rax, qword ptr fs:[VAR::VAL@TPOFF]
# accessing a thread-local variable in a libary
lea rdi, [rip + VAR::VAL@TLSLD]
call __tls_get_addr@PLT
mov rax, qword ptr [rax + VAR::VAL@DTPOFF]
For static variables, Rustc was able to use PC-relative addressing mode (which is quite efficient), but thread-local variables are a bit more interesting.
Apparently, thread-local storage is implemented differently when defining it in an executable or a (shared) library. This StackOverflow comment and the OsDev article seem to have some details regarding it. What seems to be happening is:
FS
segment register to reference the thread-local area__tls_get_addr
bin
and lib
:
cargo asm
has reported#[inline(never)]
to a function that contains the reading code, thread-local access latency increases to 1.58 ns/iter.