More troubles in Windows thread suspension

For now, my work on threading support in SBCL is stalled since I still can not implement thread suspension for garbage collection. The main problem is that Windows API does not offer asynchronous signals (which are available on all other platforms and are surfaced as a pthread_kill function in POSIX Threads API).

I've tried several options:

  • The thread in pseudo-atomic sections suspends on its own, and in all the rest cases uses the SuspendThread. But it's fraught with problems in many places: threads sometimes just hung, there were errors inside the garbage collector and I was getting spurious page faults. I later understood that this was due to stopping the threads in the wrong place; SBCL uses not just pseudo-atomic sections for non-interruptible code, but also the thread signal mask.

  • Then I tried adding a signal mask to a thread and considered this mask when suspending the thread: first, we suspend the thread using SuspendThread, and if the thread has masked the SIG_STOP_FOR_GC signal, the thread is resumed using ResumeThread and we sleep for a while and then repeat. Having done this, threads have stopped hanging, but now I got different behavior: as soon as garbage collection completes and the threads resume, some threads discover invalid virtual memory protection flags. I attribute this to the thread suspending inside the exception handler and so it observes the exception handler context instead the proper thread context.

With all of that, I conclude that I won't be able to achieve asynchronous non-cooperative thread suspension.

One of the things that I've researched was the way the .NET garbage collector handles thread suspension (http://msdn.microsoft.com/en-us/library/678ysw69.aspx). .NET CLR tracks suspension requests for each thread. And compiler inserts special instructions - gc safepoints - so that the thread would check from time to time whether it should suspend. Specific CPU instructions used for the safepoints are not as important. These instructions could be, for example, a read from the special region of memory which is unmapped when garbage collection is started, causing page fault and exception in the reading thread and giving it a chance to react to garbage collection. When the thread leaves the bounds of CLR-managed code (that is, performs a blocking operation or calls out to foreign code), it sets the flag that signals that this thread will not touch the GC heap. And when the thread returns back into CLR-managed code, it checks whether it should stop and wait for GC to complete. And there is a weird quirk - if a thread did not stop within 250 milliseconds it will be forcefully stopped using SuspendThread. A thread might not stop timely if did not reach the safepoint within a long time. In the case of .NET, this might happen if the thread is running a long loop that does not call other methods or perform memory allocations. In this regard, the .NET garbage collector is much less careful than the JVM which inserts safepoints not just to method calls but also to backward jumps (any loop is implemented as a backward jump) which guarantees that thread will eventually reach a safepoint.

I shall try to employ a similar technique.

  • Safepoints will be placed in several kinds of places:

    • exit from pseudo-atomic section
    • change of thread signal mask
    • handler for trap CPU instructions
    • return from foreign code
    • return from blocking operations
  • When entering a foreign code (such as a C function call) or invoking blocking operations, a thread will set a flag to signify that it will not do anything with the GC-managed heap, so this thread may be ignored for the purpose of suspending all mutator threads during GC cycle

  • As I don't want to go too deep into the compiler, I will try to implement GC as follows:

    • First, all threads are suspended
    • GC enumerates all threads and checks their state:
      • if the thread is already safe for garbage collection, it is resumed
      • if the thread is in a pseudo-atomic section or has masked SIG_STOP_FOR_GC, then it is resumed and GC awaits while the thread leaves this section
      • If the thread is running a Lisp code, put a breakpoint on the closest jmp or call instruction by replacing it with a trap instruction and resume the thread. Soon enough the thread will reach the breakpoint and enter the exception handler. The exception handle will restore the proper instruction and suspend the thread.
      • So far I haven't come up with what to do if the thread runs neither a Lisp code nor a foreign code, is not in a pseudo-atomic section and its signals are unmasked. We could probably resume the thread, wait for a little and then try again.