SBCL as a low-level machine code breadboard
A 2014 technical analysis reveals that specialising primitives for a restricted 10-slot stack yields a virtual machine with reasonable runtime overhead compared to native execution.
In a 2014 blog post, developer Paul Khuong detailed the use of Steel Bank Common Lisp (SBCL) to generate and inspect domain-specific machine code. The work involved constructing a small, stack-based virtual machine with a restricted 10-slot stack, designed to mimic the behaviour of the x87 floating-point unit and the F18 processor. By specialising primitives based on the virtual stack pointer and utilising SBCL’s assembler alongside the SLIME REPL, Khuong achieved a runtime overhead approximately six times slower than native code, a significant improvement over the 11–15x overhead of unspecialised bytecode.
The project draws conceptual parallels to Chuck Moore’s Forth language and Daniel J. Bernstein’s research on x87 rotating stacks. Khuong noted that the small stack size, with no fancy overflow or underflow traps, aligns with design choices seen in the HP-41C calculator and x87 architecture. Bernstein’s thesis suggested that with careful scheduling, the implicit stack rotation of x87 could be equivalent or superior to registers. This inspired Khuong to explore implementation techniques for stack-based virtual machines that restrict their stack to a small number of slots, aiming to keep operations within registers without the complexity of traditional push and pop operations.
To achieve this, the implementation mimics the x87 and F18 by using a modular counter for the stack top rather than moving data. The key innovation was specialising primitives for all eight values the stack counter can take, avoiding data shuffling. Khuong stored variants of each primitive at regular intervals to facilitate dispatch, using offsets from the primitive code block rather than hardcoding addresses. This approach allowed for the emission of repetitive machine code for primitives, including unconditional jumps, calls, returns, and conditionals, using SBCL’s interactive programming capabilities.
Performance testing revealed that a test loop of one million iterations took 15 million cycles, or 15 cycles per iteration. This was improved by fusing operations like 'lit' and 'sub' into 'dec', and by exposing branch predictability to the hardware through duplicated NEXT sequences. The specialised virtual machine implementation was found to be approximately six times slower than native code on a MacBook Air, compared to 11–15 times for unspecialised bytecode. An edit noted that a wasteful encoding mistake in the NEXT sequence, used to encode an effective address, was fixed in the definition but not in the disassembly snippets shown.
Khuong concluded that SBCL proved to be a strong companion for exploring the generation of domain-specific machine code, offering unique support for interactive programming and machine code inspection that C abstracts away. While not interested in straight stack languages, he believed a fixed stack virtual machine could serve as a nice runtime intermediate representation when coupled with a mechanism for local variables. The work also mentioned LuaJIT with dynasm as a potential comparable tool for similar purposes, highlighting SBCL’s position in low-level hardware exploration.


