Hacking on SWIG

I've recently encountered a bug in SWIG. It was crashing and reporting syntax errors in one of the headers files for the line numbers which actually did not exist in the file.

Some bisection led me to the _GtkArg structure that contains some anonymous structures with several levels of nesting:

struct _GtkArg
{
  //....
  union {
    /* flat values */
    //....
    /* structured values */
    struct {
      GCallback f;
      gpointer d;
    } signal_data;
  } d;
};

The SWIG documentation mentioned limited support for nested structures, but it was totally unexpected to me that they would produce syntax errors.

All my attempts to make a smaller reproduction case were unsuccessful.

Then I tried to remove various parts of the structures one by one. It was quite puzzling, but everything cleared out when I got to the comments. The comments somehow prevented SWIG from parsing this struct.

Then I tried to remove comment characters one by one to see whether anything would change. “structured values”, …, “structured” – still the same, the SWIG would report inexistent syntax errors. And not until the comment shrank down to just “struct” did the error disappear.

That seemed totally weird. The comment must have been removed by the preprocessor.

But the “struct” seems quite suspicious. And that's a hypothesis that came out to be true: any comment with the word “struct” in it, placed in this place will cause syntax errors. If such a comment is placed anywhere else, there would be no errors.

Further experimentation has shown that the same behavior happens with the words “class” and “union”. This means that something's wrong with the lexer – that's the first and the most obvious thing that can happen. I then make a change to SWIG's lexer so that it would spit out the sequence of tokens that it sees. Without the comment, the tokens are all fine, but with the added comments the token sequence becomes corrupt. It's like part of the text after a comment is invisible to the lexer. And also: the syntax error messages contained a random garbage string that hints that there's a memory corruption happening somewhere.

Further debugging led to the grammar rule for structures in the parser. Before it was a funny comment. For maximum fun, here it is in full glory:

/* ----------------------------------------------------------------------
   Nested structure.    This is a sick "hack".   If we encounter
   a nested structure, we're going to grab the text of its definition and
   feed it back into the scanner.  In the meantime, we need to grab
   variable declaration information and generate the associated wrapper
   code later.  Yikes!

   This really only works in a limited sense.   Since we use the
   code attached to the nested class to generate both C/C++ code,
   it can't have any SWIG directives in it.  It also needs to be parsable
   by SWIG or this whole thing is going to puke.
   ---------------------------------------------------------------------- */

Well, that's a nice comment!

Another debugging session leads to the dump_nested function that feeds a slightly modified substring for the structure definition to the parser. This turned out to be the cause of syntax errors. The next clue is the “nasty hack alert” in the comment.

To make the long story short, this “sick hack” used strstr to modify some code fragments in such a way that the comment lost its terminating sequence, and the lexer couldn't find the end of the comment.

It took me two days to find the root cause. And fix (also a hack by way, but a simpler one – just delete all the comments from the code before modifications) took just around 10 minutes.

The patch was sent to maintainers.