How to adapt SBCL's garbage collection and threading to Windows

I'm trying to implement threading support in the Windows version of SBCL. I've been looking at the source code and I was quite surprised that this had not been implemented yet. I guess the SBCL community is quite small.

I did a preliminary analysis of what needs to be done to support threads and how it can be done.

The SBCL sources are quite interesting. There are a lot of places which hadn't seen attention for quite a long time. And several places in the source code are quite involved to hard to wrap my head around – especially code that deals with signals and signals masking. The documentation is quite scarce (especially when it comes to introductory materials) but on the whole, it's approachable. The old CMUCL papers and articles are quite useful (since SBCL is based on CMUCL) and are linked from the sbcl-internals.

SBCL uses the OS API in nontypical ways. I was particularly paying attention to things that are required to make SBCL work on Windows.

The primary feature of Lisp that made SBCL developers go to great lengths and employ various tricks and hacks is, of course, the garbage collector. SBCL performs garbage collection if either successive allocation fails or periodically (the trigger is based on the number of bytes since the last collection). The garbage collection is performed in the calling thread and it suspends all other threads. When all threads are suspended, it starts to scan the garbage collection roots (that is, stacks of all threads and global variables). Having completed the garbage collection, all stopped threads are resumed.

As memory allocation is a frequently invoked operation in Lisp, it should run as fast as possible. For this, every thread in SBCL has a per-thread memory region in which allocation happens without locking by just a simple bump allocator.

But memory allocation is not an atomic operation and consists of several steps:

  • Increase the pointer to free memory by the allocation size
  • Compare the pointer with the bounds of the free region
  • Initialize the object in the allocated space.

Now, suppose the garbage collector stopped the thread in the middle of the allocation. It would observe an inconsistent state in which the GC invariants are violated and would corrupt the memory if it would try to do collection in such a case.

That's why SBCL employs a technique named “pseudo-atomic sections”. These sections are short segments of code that are handled specially by the garbage collector - the garbage collector ensures that no thread is suspended inside a pseudo-atomic section.

The pseudo-atomic sections are implemented as follows:

  • Upon entering the pseudo-atomic section the thread sets the flag to signify that it's in such a section
  • The garbage collector checks this flag for each thread and if it's set, sets another flag to signify that the garbage collector is waiting for this thread and unblocks this thread
  • Upon exiting the pseudo-atomic section, the thread checks for this flag and suspends itself if necessary.

While starting the collection, the garbage collector ensures that the collection is not running already and if this is the case, it waits for the collection to complete and then retries the memory allocation.

When running under Linux, the SBCL uses signals delivered to specific threads in order to stop or resume their execution. It uses real-time signals for that purpose since normal signals are not queued: the thread will skip a signal if already has a signal queued for it while that signal is masked. For correct signal handling, threads manipulate their signals mask.

On Windows, it makes no sense to use signals since Windows does not have signals (not even talking about real-time signals). But it supports suspending threads. After the thread is suspended, it is possible to query or set the thread context. This lets us, for example, imitate a function call. This lets us implement thread interruption by invoking a special code inside it. Also, this is one of the ways to synchronize pseudo-atomic sections with the garbage collector.

Every thread has an area that is local to the current thread. On Linux, the pointer to it is stored in either the %FS register on x86 or the %R12 register on x86-64. Windows lets each thread store an arbitrary pointer-sized value at the address %FS:0x14. And some runtimes use the %GS register to store the base pointer to the thread-local data area. Using the functions NtQueryInformationProcess/NtSetInformationProcess we can set the base and limit for the memory segment, and then using the GetThreadContext/SetThreadContext it is possible to set the selector for this segment in a segment register (%GS for Windows x86 since %FS is already taken for Thread Execution Block). A segment register stored the segment selector which is defined in the LDT (Local Descriptor Table). It is only possible to have 8192 segment selectors which limits the possible number of threads to just 8192.

In a per-thread area we would be able to store the following data:

  • dynamic binding stack
  • control stack
  • thread flags
  • thread synchronization objects
  • current dynamic values of symbols
  • per-thread allocation area

Let's see how the GENCGC garbage collector allocates memory. Memory is allocated from the per-thread memory region in a pseudo-atomic code section. The allocation sequence is as follows:

  • The thread enters a pseudo-atomic section
  • If the per-thread region has sufficient free space to satisfy the allocation request, the allocation happens from this region
  • Otherwise, a new region is allocated (this might trigger garbage collection)

Now, about the dynamic bindings for special variables. SBCL stores in the symbol structure the following:

  • global value
  • TLS index for the per-thread current value

Also, there is a per-thread binding stack that stores the pairs consisting of a symbol and the symbol's value. When a code dynamically binds a special variable to a new value, it pushes the symbol and its old value to the bindings stack and writes a new value into a thread-local area using the TLS index. Upon leaving the scope of dynamic binding, everything is done in reverse order.

The Lisp stack (also named the control stack) is separated from the foreign code stack (named the numeric stack) and from the dynamic bindings stack. The difference between the Lisp stack and the foreign code stack is that the Lisp stack only contains pointers to boxed Lisp objects and is scanned by the garbage collector while the foreign code stack is not inspected by the garbage collector at all.

The SB-THREAD package defines a set function to control threads:

  • make-thread creates a new thread
  • join-threads waits for thread completion
  • interrupt-thread interrupts a thread. Interruption is analogous to UNIX signals: thread stops doing whatever it was doing and invokes the specified function. If the thread is already being interrupted then the new interrupt is queued rather than interrupting the current interrupt handler. In Linux, this is implemented with the help of a signal. On Windows, it might be possible to implement using the combination of SuspendThread and SetThreadContext to manipulate the thread state.
  • terminate-thread forcibly terminates a thread. Thread termination is based on thread interruption: a thread is interrupted with a special function that unwinds the stack and exits from the thread.

Also, the timers are based on interrupts. A timer is a facility to periodically call a function in a specified thread (or in a new thread).

The main synchronization object used by SBCL is the futex. On Linux, it is implemented as a native Linux futex and is called a “lutex”. In a nutshell, futex is a combination of a mutex with a conditional variable, i.e. it protects the data from concurrent access and allows to wait for changes in the data. Waiting for changes is used, for example, inside the SBCL to wait for the thread state to change. On Windows, there are no conditional variables (or rather, they are available but only in the newest versions of Windows), so we will need to implement futexes based on the primitives that we have (for example, a critical section and an event). One peculiarity of SBCL is that the futexes themselves are not exposed, but rather the synchronization functions accept the pointer to the data being protected by the futex. This allows not dumping the futexes when saving the image to the disk and in general to disregard their initialization and finalization which somewhat simplifies their implementation.