This week I found myself digging through the code of c4, an implementation of C “in four functions”, by Robert Swierczek.
I remember coming across c4 when it was released ten years ago. It got me excited: hey, C in four functions, that means it’s easy to understand right?
That excitement turned into “oh, I see” as soon as I scrolled through the code. c4 is dense, barely commented, and, frankly, strange. It’s unlike anything else I had come across in compiler land.
After reading through the code this week here are the other adjectives I would add to the list: clever, tricky, fascinating, cool. It’s a compiler, it’s a VM, it’s an interpreter, it’s a parser, it’s art, it’s trickshot programming.
I mean, look at this:
Let’s start at the top: it’s C in four functions — what? And if you open the code you’ll see that, yes, this is true, no cheating:
And it works!
First you need an existing C compiler to compile c4 and then you can use c4 to run another C program.
Say I have this program:
// my_program.c
#include <stdio.h>
int add(int a, int b) {
return a + b;
}
int main() {
int a, b, c;
a = 99;
b = 20;
c = add(a, b);
printf("c: %d\n", c);
return 0;
}
Once I compiled c4, I can run my_program.c
with c4:
$ gcc -m32 -o c4 ./c4.c && ./c4 my_program.c
c: 119
exit(0) cycle = 39
c is 119. Correct. So how does that work? Did it compile it? Did it run it?
The README gives us a hint: c4 takes an -s
flag. I assumed that’s to give us the assembly, but no:
That’s not x86 assembly, right? I mean, some of it sounds like x86 — LEA, ADD —
but then there’s PSH
and LEV
and PRTF
. I’m pretty sure PRTF
corresponds to our printf
statement and PRTF
is not an x86 statement.
Turns out it’s bytecode. This week I figured out that c4 is a bytecode compiler and virtual machine.
Here’s the high-level overview of how the four functions work, as best as I can tell:
Function
next
is a tokenizer/lexer function that produces the next token whenever it’s called.Function
expr
parses expressions (by callingnext
too) and emits bytecode right away.Function
stmt
parses statements (callingnext
andexpr)
and is also emitting bytecode right away.main
ties it altogether: it allocates memory for various structures, then kicks off the parsing that will result in bytecode being emitted, and then it sets up a “virtual machine” to run the bytecode.
Look at that virtual machine, it’s the very last code in all of c4.c
and it’s beautiful:
Look at how dense this is, look at the terse comments, how it’s formatted, how the operators are aligned, how the nasty side-effect-having stuff is all crammed together at the bottom where it belongs.
C implemented as a bytecode compiler and virtual machine, in four functions — and it’s self-hosting, too. Yes, c4 can run c4:
$ gcc -m32 -o c4 ./c4.c
$ ./c4 c4.c my_program.c
c: 119
exit(0) cycle = 39
exit(0) cycle = 59822
This is c4 running c4.c
which then runs my_program.c
. You can see that it takes 39 VM cycles to execute my_program.c
and 59822 cycles to execute c4.c
executing my_program.c
.
If you just didn’t say “dude, that’s sick” out loud, I’m sorry to tell you, but you might not be alive.
What I realised this week: the self-hosting aspect is a very big part in the puzzle. Because which language is the easiest to self-host? A small language. A language that’s just small enough so you can build a compiler in it — a compiler that’s just big enough to compile the language it’s written in.
The snake has to be just long enough to be able to eat itself. If it’s not, it’s a dog chaising its tail.
Here’s an example. In order to keep the language as small as possible, there are no structs in c4. Yes, that very fundamental data structure — it does not exist in this self-devouring compiler/machine beast. Integers, chars, enums, pointers, functions. That’s it.
That leads to some fascinating code. Here’s a definition from the top of the file:
// identifier offsets (since we can't create an ident struct)
enum { Tk, Hash, Name, Class, Type, Val, HClass, HType, HVal, Idsz };
This enum is used to simulate structs when tokenizing. It’s pretty clever.
Here are the two relevant declarations at the top:
int *id, // currently parsed identifier
*sym, // symbol table (simple list of identifiers)
And here’s the part of next
that tokenizes identifiers:
else if ((tk >= 'a' && tk <= 'z') /*...*/)) {
pp = p - 1;
while ((*p >= 'a' && *p <= 'z') /* ... */)
tk = tk * 147 + *p++;
tk = (tk << 6) + (p - pp);
id = sym;
while (id[Tk]) {
if (tk == id[Hash] && !memcmp((char *)id[Name], pp, p - pp)) {
tk = id[Tk]; return;
}
id = id + Idsz;
}
id[Name] = (int)pp;
id[Hash] = tk;
tk = id[Tk] = Id;
return;
}
I simplified it a bit. But the core part is there:
tk
is reused to become a hash of the identifierid
(the “currently parsed identifier”) gets set to the start of the symbol tablethe symbol table is then traversed, checking whether there is already a symbol with the same hash in there (
tk == id[Hash]
) and the same name (memcmp
returns 0 if the strings are the same, remember, that’s why the ! is there)if no match is found,
id[Name]
andid[Hash]
get set,tk
gets set to the current token type (Id, identifier
), and function is done
There’s a lot going on, but see how the enum fields are used to simulate a struct?
Tk, Hash, Name
— these all come from the enum. And in the enum, the values are integers according to their position in the enum. Tk
is 0, Hash
is 1, etc. So these offsets are used to index id, which points to the symbol table.
So id[Hash]
points to the Hash field in the “struct”. id = id + Idsz
advances to the next “struct” in the list by advancing the cursor by a full “struct” length.
Blew my mind.
There’s more mind-blowing stuff in these 525 lines that are c4.
#include
statements, for example, are completely skipped. Instead c4 has its own stdlib in the form of these bytecodes we saw above:
else if (i == OPEN) a = open((char *)sp[1], *sp);
else if (i == READ) a = read(sp[2], (char *)sp[1], *sp);
else if (i == CLOS) a = close(*sp);
else if (i == PRTF) { t = sp + pc[1]; a = printf((char *)t[-1], t[-2], t[-3], t[-4], t[-5], t[-6]); }
else if (i == MALC) a = (int)malloc(*sp);
else if (i == MSET) a = (int)memset((char *)sp[2], sp[1], *sp);
else if (i == MCMP) a = memcmp((char *)sp[2], (char *)sp[1], *sp);
else if (i == EXIT) { printf("exit(%d) cycle = %d\n", *sp, cycle); return *sp; }
All the other instructions and their implementations are C at its … most C — very, very dense.
Then there’s the fact that the base type is an integer. Or that it has a debug mode. Or that the order of the token types in the enum definition also define their precedence. Or…
Isn’t it amazing what code is floating around, out there in the world, ready to reveal secrets and delight, if one takes the time to dig?
"...how the nasty side-effect-having stuff is all crammed together at the bottom where it belongs." sounded almost lewd hahah. Awesome write-up, really enjoyed reading / learning about this!
Dlaczego autor nie napisał opisu do tego programu? nawet po polsku? Dziwne, nie? No cóż Polacy tak mają.