Just over two weeks ago I released :Firth pre-alpha1 into the wild. It had progressed to the point where it was starting to be able to do what I’d intended it for–implement small DSLs with relatively little code. It was fairly small, simple, and fast, in the grand scheme of things. Writing code in :Firth was a joy, the syntax was clean and easy to understand, and I was generally happy with what I’d built. So I cut a release, made a little announcement, and waited to see what people thought.
Their feedback? The response was overwhelmingly positive. I did get some technical notes and observations, though: it’s too big, too complicated, and it’s totally incomprehensible.
Ouch. Confusion about how it works and how to read the code is mostly a a cultural issue. To Forth veterans, the REPL with the “ok” prompt should be perfectly mundane and familiar. The idiosyncratic : colon definition syntax ;
and reverse-polish notation take some getting used to, but they aren’t especially difficult except insofar as they’re so different from pretty much all mainstream languages.
But, I took to heart the observation that it could be smaller and simpler. I built the first prototype as quickly as possible in the way that was most obvious to me. As a longtime C++ programmer, this meant an OO design. Unfortunately, it’s not a very good match for the Forth-style compilation and runtime models. The division between “the compiler object” (a phrase which should make any Forth programmers out there cringe) and the language primitives that form the runtime API was artificial, arbitrary, and more hindrance than helpful. The stack is also bigger and slower than I would prefer. I wrote it like an STL container. Lua is designed for this kind of usage, but it does make it more difficult for LuaJIT to optimize code that uses the stack. There’s also some overhead for calling metamethods with the OO style, which makes a difference when everything in the language touches it.
Prototyping a simpler, sleeker stack lead necessarily to prototyping new threading/code generation implementations. Based on feedback I’d received, I decided to try to exploit the Lua call stack. After several iterations, I ended up with a set of closures wrapping a coroutine which was the execution context for a small swarm mutually-tail-recursive, data-yielding continuations. It sounds convoluted, but it’s actually extremely small, tight, elegant, and I think beautiful code. I mean, just look at this gorgeous chunk of bits:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 | local function pushs(x, ...) local continuation, nx = yield() return continuation(nx, x, ...) end local function pops(_, ...) local continuation, x = yield((...)) return continuation(x, ...) end local function shoves(_, ...) local continuation, x = yield() return continuation(x, shove(...)) end local function yanks(x, ...) local continuation, nx = yield((select(x+1, ...))) return continuation(nx, yankfilter(0, x, ...)) end local function peeks(x, ...) local continuation, x = yield((select(x+1, ...))) return continuation(x, ...) end local function unpacks(x, ...) local continuation, x = yield(...) return continuation(x, ...) end local function clears(x, ...) local continuation, x = yield() return continuation(x) end local function heights(x, ...) local continuation, x = yield(count(...)) return continuation(x, ...) end |
Bonus: this is actually about 15% faster than the ADT-style stack, in Lua 5.2. The bummer, though, is that this version is a couple orders of magnitude slower than my original stack, under LuaJIT. After some debugging and conversation with the denizens of Lua mailing list, I realized what you may have already figured out: LuaJIT’s tracing compiler doesn’t know how to cross coroutine boundaries, so hot paths don’t get JIT’ed or fully optimized.
This lead me to an in-thread implementation that works on the same underlying principal: exploit the Lua call stack. Once I proved to myself that this was a viable option, I took a look at the performance. This new stack implementation is by far the fastest. It’s decidedly faster under Lua 5.2. And where the coroutines were a couple of orders of magnitude slower than my original ADT class under LuaJIT, this version was a couple orders of magnitude faster.
This necessitates a drastic rework of the entire compilation and execution apparatus, especially code generation and threading. The code :Firth currently generates and executes is all built around the assumption of a global stack object we can push values onto, manipulate, and later pop back off at our leisure. The new stack routines act more like a set of filters on the varargs on the call stack. As an example, where drop
used to pop the topmost value from the stack (an array-style Lua table) and zero-out the slot, throwing the value away, it now looks more like this:
local function drop(tos, ...) return ... end |
We might call it from Lua code as return drop(...)
, the first item in ...
getting mapped to tos
. Then it returns everything it was handed, except without the tos
element. Simple, clean, and fast as crap. Some of you may have noticed that the example call to drop
is a tail call, as well. This just screams in LuaJIT, and it’s rather quite fast under Lua.
Given all this, the next version of :Firth is shaping up to be a nearly-complete rewrite. All these changes in the very core of the language implementation mean just about everything has to be refactored. Additionally, I have some interesting ideas for the compilation process, and some more functional-style extensions to make it more than “just another Forth”. I’ll post about these in more detail when when I’ve hacked some actual code and proved to myself that they’re a good idea.
My original stack implementation is getting a reprieve, although it will be subject to some pretty ruthless refactoring. It’ll probably change from an OO stack object to a table of closures. Forths are generally built around at least three runtime stacks (one for data flow and manipulation; one for handling subroutine calls, control flow, and temporary storage space; and one for managing compilation state). Additionally, something akin to namespaces and locally imported modules can be implemented by storing the dictionary as a stack of vocabularies. Plus, I definitely want stack structure objects I can pass around and/or read from/write to disk. The manipulation of the Lua call stack is sleek and sexy, but ill-suited to handling multiple stacks in an efficient manner.
If I had my way, I’d go with my coroutines-and-continuations solution, but the enormous performance hit on LuaJIT (my primary target platform) just isn’t worth it. I’m going to hold onto this code though, in case future versions of the tracer find a suitable way to deal with coroutines, or I find in practice that the speed just doesn’t matter that much.