Last night benabik told me about a problem that, while serious, hardly caused me to raise an eyebrow. Some innocuous-looking code, written while trying to follow along with some ill-written documentation, lead IMCC to enter into an infinite loop. I wasn’t surprised that IMCC contained such a bug, though I was surprised that it was so easy to reproduce and the functionality in question wasn’t really tested at all. The code that caused this bug is this:

.const 'FixedIntegerArray' foo = 'test'

The Sub ‘test’ was an :immediate subroutine which generated a FixedIntegerArray. For people not familiar with IMCC’s workings an :immediate Sub is executed as soon as the packfile is compiled and its entry in the packfile is replaced with the return value of the Sub. So, the Sub doesn’t exist in the packfile, only the thing that the Sub generated. This is the mechanism by which arbitrary PMCs can (in theory) be serialized into the packfile at runtime.

Compare with this syntax:

.const 'Sub' foo = 'test'

In this example, the local variable ‘foo’ will refer to the item in the constants table where the Sub ‘test’ is (or would be, if it weren’t an :immediate). The tag ‘Sub’ there is a bit of a misnomer, since the PMC in that slot might not be a Sub. This syntax is basically a really lousy way of saying “give me the PMC in the constant table in the slot where the given named ‘Sub’ is or would be.” Also, that type information isn’t really used for anything, since PIR is a dynamic language. At first glance you might suspect that the .const directive can take any PMC name as its type, but that is wrong. It only accepts four types: 'Sub', 'Integer', 'String' and 'FixedIntegerArray'. Don’t ask me why these are the only four supported from among all our built-in types, or why we support more than just ‘Sub’ (the other three options seem superfluous to me).

The poorness of this syntax is not really something that I want to write about in this post. We can do it better and in the next incarnation of PIR (if we ever get to building such a beast) will do better. This is just one more “of course it does it that way” kind of moment that you learn to ignore when you’re dealing with the code that is the current incarnation of IMCC.

What I do want to write about instead is the file where the parsing code for this exists: compilers/imcc/pbc.c. That file contains much of the logic for taking IMCC’s internal SymReg representation and turning it into a packfile. It’s a huge mess, and encapsulation is broken here in ways that are as bad or worse than any other single example in the repository. That’s why I’m planning a shotgun cleanup of it soon.

The functions and logical blocks in this file fall into two broad categories: First, there are functions for iterating the AST (struct SymReg and friends) and pulling out relevant values. Second, there are functions for inserting new data into the budding packfile. The first category of functions is generally fine and a necessary part of any assembler, even if some of the code could be cleaned and modernized. It’s that second category that’s of much broader interest. My plan is to take functions and sequences from pbc.c out of IMCC, wrap them up all pretty, and add them to the proper packfile subsystem API.

Of course, I do start to think about how exactly to do that. At runtime the PackFile* structure is basically read-only. Bytecode is read-only and contains fixed integer indices into the constants table which is also not expected to change. If bytecode isn’t changing then annotations and debugging info for that bytecode is probably not changing either. Once a packfile is loaded into the interp, given a PackfileView PMC wrapper, and made executable it really shouldn’t be modified any more.

However when we’re talking about a compiler or other code generating system, we want the ability to write and modify packfiles. When we’re done modifying we might want to stamp them with a flag to say that they are read-only and suitable for executation.

So I’m thinking we want two APIs. The first uses a bit flag on the packfile to determine if it’s editable, and can edit it if that flag is set. The second is the normal read-only accessor API which can generally ignore that flag except for the routines that load a packfile in to be executed by the interpreter. For those handful of routines that do actual loading and verification we can throw an exception or something.

My general plan, and I want to get a lot of feedback on this before I touch anything, is to make a second API for packfile editing routines. I’ll prefix those functions with something like Parrot_pfw_ (instead of Parrot_pf_) to set them aside. I’ll then start moving packfile building logic from IMCC into this new API. Instantly we get much improved encapsulation, clear separation of concerns between packfile writing and executing, and a more robust interface for compiler writers to use in the future. It will also be a nice development in terms of security, where we can limit certain packfiles to be executable. I think it’s a pretty good idea and I don’t think it should take too much effort to accomplish. At least, not in a first draft.

I’m not starting any new projects until I get my remove_sub_flags branch much further along. I think it’s a good idea to follow up a tough project with one which is more straight-forward and offers clear rewards. Speaking of that branch, parrot is building fine and I’m slowly working my way through the list of test failures. When I get the majority of those sorted out I’m going to start working on patches for Winxed, NQP-rx, NQP and Rakudo. We’re a long way off but the progress is extremely rewarding.