8 April 2014
Machine code and garbage collection
So you're building a language.
If you're constructing a programming language, chances are that you want to use machine code somewhere in your implementation. It might be because you want to use C modules, it might be that you're using a JITing, or you might just want to compile ahead-of-time to a binary.
If your language is garbage-collected, there are a number of things you must consider. This is an attempt to mention just some of them.
But first, let's take a brief look at how garbage collectors work, or more specifically, how one type of garbage collectors work. You have your heap with a collection of objects. These objects can have references to other objects. A tracing garbage collector stop the running program at some point during its execution. It will then go through the already known set of objects, usually global variables and objects on the stack. It'll mark these objects as reachable, find all the references to other objects within them and mark them as reachable as well. This is called the mark phase.
Once it has run out of objects to scan, it will take all objects that it didn't reach and collect them. This is called the sweep phase.
The fact that a garbage collector only collects objects it didn't mark as reachable has some serious implications. It means that if you didn't recognize a reference somewhere, you will end up collecting items that are still live.
The big problem with machine code and garbage collection is that machine code does not know what a reference is. To the machine code, memory is just a large array of numbers and it has no idea what's reference and what's not. Since we'll be collecting all objects that we didn't reach, missing a reference means potentially freeing memory that isn't ready to be free. We'll need to find some way of getting around this problem.
One strategy is to treat all memory as references. This is what's called a conservative garbage collector. By doing this, we can be sure that no matter what, you will not miss a reference.
The downside with this tactic is that you can have spurious references. Somewhere in your data, you could have a string of bytes which looks like a pointer. The garbage collector would then keep the memory around, even though it could be freed. This is how the Boehm garbage collector for C works.
Even if you implement this scheme, there are still things to watch out for. Since compilers have become very good at optimizing and keeping often referenced variables in registers, you need to make sure that you're looking at the registers as well. This problem manifested itself as a bug in Ruby modules, where it would free memory that was only referenced by register.
Another strategy is to disable garbage collection while your native code is running. Once you're back in your interpreter, the native code would have left you in a state that your interpreter can understand and you can start your garbage collection. If you're implementing a dynamically typed language, you will have type information at hand anyway and you can use this to your advantage by just looking up where the references are.
The disadvantage is that your garbage collection is delayed until a return from your native code. This is fine for something like JITed code where you'll eventually return into your intepreter, but implementing event loops in C is a problem since you never return. This is also a problem if you're running more than one thread in your implementation. Since any one thread executing machine code can stall your collection, multiple threads increase the chance that you'll have to wait even longer before you can garbage collect.
Yet another strategy is to build type information for your machine code that the garbage collector can use. This usually manifests itself as 2 data structures. One is a bitmap of the stack frame, showing the garbage collector where pointers on the stack are for any given instruction. The other is bytecode telling the garbage collector where to find references. Using this combination of data structures, the garbage collector is able to precisely figure out where the references are.
Having this data means that you'll have to calculate it somehow. If you're compiling ahead-of-time, this shouldn't be a big problem. You just make sure that you emit the data when handling your types. However, if you're trying to integrate C modules, the normal C compilers will not help you. You'll either have to use conservative garbage collection or implement your own C compiler, just to get this information.
So far, I've only just scratched the surface. There are many more things to consider like how to make sure that all your threads can be stopped for collection, interrupting your machine code only when the heap is in a consistent state and how to make your compiler generate machine code that won't lose references.
I am by no means an expert on this subject. You shouldn't use any of this advice to actually build a garbage collector. But I do have an appreciation of how much work goes into building one and after this, I hope you do too.