I’ve been on a bit of a roll in terms of cleaning up IMCC lately. I’ve been
wanting to clean up the interfaces for IMCC so that we can call into it in a
consistent manner. Part of this is motivated by other projects and project
ideas, such as my desire to create a new PIRCompiler PMC type
for use with the compreg opcode. It’s also motivated by my desire to see
packfiles be cleaned and for PackFile PMCs to come into ubiquitous usage. That
isn’t going to happen unless we modify IMCC to return a PackFile PMC properly,
and that isn’t going to happen if we don’t understand, cleanup, and refactor
the call-in interfaces to IMCC.
In the imcc_cleanups branch I’ve been working on, I moved a number of
functions around, I’ve deleted some dead code, and I’ve started refactoring
one of the several call-in interfaces. [Parrot 3.0][parrot30] was released
this morning followed shortly thereafter by the merge of the imcc_cleanups
branch. This is the first part of a multi-part cleanup and refactor project
of IMCC, and I am happy to see it merged into master.
I decided, to make the rest of the work easier, to diagram the call-chain and the responsibilities chain for the various interfaces. This gives me an at-a-glance view of what is happening and where. With this little diagram I have a much more clear understanding of what various interfaces do and what expectations are made about each.
Lines in parenthesis are basic descriptions of processes, the details of which
are not necessarily important to understand or don’t play much of a role in
the upcoming refactors. Things not in parenthesis are the actual function
names which are called. Sometimes I’ve included little comments as well.
First, here’s a function which I’ve been working on diligently in the past
few days in the imcc_cleanups branch:
imcc_run_api # called only from frontend/parrot/main.c
    imcc_parseflags
    imcc_run (returns PackFile* As UnManagedStruct PMC)
        (Setup yyscanner)
        determine_input_file_type
        imc_yyin_set
        do_pre_process (to be moved out to it's own API)
        determine_output_file_type
        compile_to_bytecode (Returns PackFile *)
            (Create new PackFile*)
            emit_open(output_file)
            imcc_run_compilation
                (Setup IMCC_TRY, IMCC_CATCH groups)
                yyparse
                imc_compile_all_units
            imc_cleanup
            (PackFile_fixup_subs PBC_POSTCOMP, if necessary)
        imcc_write_pbc
        (teardown yyscanner)
        (setup UnManagedStruct PMC to return)
Here, the function imcc_run_api is a relatively new function which I added
as part of the embed_api branch. It serves basically as a thin wrapper
around the imcc_run function. The other function, imcc_parseflags is where
IMCC parses it’s own commandline arguments. As part of the imcc_cleanups
branch work, I’ve been trying to move commandline processing out of IMCC, and
move it into the Parrot executable frontend where I think it belongs.
Unfortunately, there are far too many options right now to move all of them
out. imcc_parseflags is much smaller now than it once was, but it hasn’t
disappeared entirely.
We can see from this call graph a few different things. First, we determine
the type of the input file (PIR or PASM), set up the bison scanner object,
and set it’s input file handle (imc_yyin_set). If we’ve invoked Parrot with
the -E command-line argument, it will preprocess the input file only,
dumping the poorly-formatted, and not syntactically correct preprocessed
source to stdout, then exit.
On a side note, I would really like to have a Parrot equivalent of ccache
to help speed up lengthy builds. We’re never going to have something like this
unless we get a good preprocessor routine working. do_pre_process hardly
qualifies as “good enough” for this purpose, so improvements are necessary
there.
If we are not preprocessing, which is just about always since the preprocessor
is dead-broke, we move into the function compile_to_bytecode. This is where
things get a little bit interesting.
compile_to_bytecode creates a new PackFile* structure, setting it as
Parrot’s current packfile (in interp->initial_pf and interp->code). It
calls ‘emit_open’ to setup the output vector for the generated bytecode, then
moves to imcc_run_compilation. imcc_run_compilation is basically the heart
of IMCC. It sets up a home-brew, basic kind of exception handling mechanism
with some macros and a call to setjmp, then calls yyparse and
imc_compile_all_units. I didn’t trace any further down the call chain, I
know that imc_compile_all_units does all the actual compilation work of
turning the parse tree into a bytecode file, but understanding that mechanism
is a different project for a different day.
Notice that these functions don’t do any monkeying with Parrot contexts,
Parrot namespaces, or anything else like that. The imcc_run function assumes
that it is being run first in Parrot, that there is no existing PBC file
loaded, and that no compilations have occured previously during this process.
In the master branch this function had also blocked Parrot’s GC from running
during the compilation, but with the imcc_cleanups branch merge that logic
has been moved out into the Parrot executable front-end code and the new
embedding API.
What I most notice about these functions here is that the logic seems to be
split into routines not because it makes good logical sense to break things
up in a certain way, but simply because larger functions could be broken up
into smaller ones. Of the functions I list above, hardly any of them are used
throughout the source more than once. These blocks of code were not designed
to be reusable, and were certainly not named to be more readable or
understandable. It’s not until we get down to imcc_run_compilation that
code starts to be reused between frontends. Everything else, no matter how
common or generic, is almost always duplicated.
imcc_run_api is called from frontend/parrot/main.c, where it acts as the
entry point for IMCC from the command-line. It is one of six total
interfaces to IMCC. Actually in master there were eight of them, but I
discovered two were dead code and removed them in my branch.
These next two functions may not look familiar, but they are. These are the functions called by the current PIR compreg.
$P0 = compreg "PIR"
When you call that snippet above, $P0 is a thin NCI wrapper around the
function imcc_compile_pir_ex. The PASM compreg, likewise, is a very thin
wrapper around imcc_compile_pasm_ex. These two functions, in turn, are very
thin wrappers around the function imcc_compile, and they throw Parrot
exceptions if imcc_compile returns any errors.
imcc_compile_pasm_ex
imcc_compile_pir_ex
    imcc_compile (Returns Eval PMC*)
        (setup yyscanner)
        (block GC)
        (handle reentrant compiles)
        (Create a new Parrot context)
        (lock and mutex nonsense)
        (create new default segs)
        (switch to new CS, saving old CS)
        compile_string
            (save current yy buffer)
            yy_scan_string
            emit_open(NULL)
            imcc_run_compilation
            (maybe switch back to previous yy buffer)
        (pop Parrot context)
        (create Eval PMC from result)
        (undo reentrant compile bookkeeping)
        imc_cleanup
        (unblock GC)
        (fixup subs, PBC_MAIN, on Eval PMC)
        (switch back to saved old CS)
    (if we have an error message, throw it as an exception)
Unlike the imcc_run_api function call chain I showed above, the
imcc_compile function actually is reused among these two functions. It is
extremely ugly and duplicates more than a little functionality from
imcc_run, but it is itself reused more than once.
imcc_compile sets up the Bison scanner object and blocks Parrot’s GC,
essentially just like the imcc_run function does (or, it did in master).
Here is where things get a little different: imcc_compile does not assume
that it is executing prior to Parrot execution. In fact, imcc_compile
makes the alternate assumption: It is probably going to be called from an
executing PIR program, and could potentially even be called during another
compilation. Here’s a short PIR snippet that explains how:
.sub _test :immediate
    $P0 = compreg 'PIR'
    $S0 = <<'PIRCODE'
        .sub _test2 :immediate
            say "In an inner compilation!"
        .end
    $P0($S0)
.end
Fun. imcc_compile seems to either handle this correctly or at least attempts
to handle it correctly. I would be surprised if any tests for this behavior
existed in our test suite. I’m not going to test it; I sincerely do not want
to know the results.
imcc_compile detects a reentrant compile and does some bookkeeping for it
if necessary. Then, it pushes a new Parrot CallContext so that the compilation
has a parent context to do it’s compilations and load/init/immediate flag
handling inside of. It creates a new PackFile_ByteCode segment and loads it
into Parrot, saving the old one. From here we descend into compile_string,
Which saves the current bison buffer (if any), sets up the output vector,
and then calls into imcc_run_compilation which we’ve seen before.
Some obvious things jump out at me. First, all interfaces should detect and handle reentrant compiles. There’s no reason not to, even in cases like the Parrot frontend where we know IMCC is being called before anything else. It doesn’t make any sense for us to have two call-in interface functions which do essentially the same thing but one of them handles a reentrant situation and one doesn’t. These can and should be unified.
Second, things like blocking Parrot’s GC should not be done here, but instead
should be done in Parrot if we know we are calling into a place that can’t
handle it. If we have a PIRCompiler PMC type, the .'compile'() method of
it can handle the GC tricks. I did this already in imcc_run, and will do it
again here.
Third, likewise we shouldn’t be monkeying with Parrot contexts here. That stuff should be taken care of before we ever get into IMCC. If Parrot wants to call into an unsafe situation, it should do the necessary prep work to ensure it’s call graph is prepared for it.
Fourth, we shouldn’t be updating Parrot’s interp->initial_pf or
interp->code fields. I’m going to keep repeating this phrase like…well,
a parrot. These functions should either take a fresh packfile to
use or it should create a fresh one, populate it, and return it without ever
storing temporary values in the interpreter or storing global values, or
whatever. This is a subject for another post, but it’s probably the most
important thing I say today.
The next two functions on our grand tour, IMCC_compile_pasm_s and
IMCC_compile_pir_s are used only from the function
src/embed.c:Parrot_compile_string. That function isn’t used anywhere in the
Parrot repo besides in some examples and tests for the old embedding API. I
don’t know if any of our embedders or extenders are using
Parrot_compile_string, but I can guarantee that it is going to be
[deleted][embed_deprecation] eventually (Don’t worry, it will be replaced with
something nice before I try to rip it out). Besides the names and the fact
that these do not perform any error handling, these two functions are nearly
identical to imcc_compile_pasm_ex and imcc_compile_pir_ex above, so I
won’t go into any more details about them. Expect these functions to disappear
as soon as Parrot_compile_string does.
IMCC_compile_pasm_s
IMCC_compile_pir_s
    imcc_compile (seen above)
        compile_string (seen above)
    (return error message to caller instead of throwing exception)
Finally we have imcc_compile_file. The difference between this function and
imcc_compile_pir_ex and imcc_compile_pasm_ex from above is that this
function compiles a file from a given file name instead of compiling a string
directly. That’s the only difference and as we will see almost all of the
code here is faithfully duplicated from imcc_compile. Almost all of the
functionality is duplicated from imcc_run.
imcc_compile_file
    (setup reentrant compile)
    (open input file)
    (block GC)
    (push a new Parrot context)
    (check input file type from file extension)
    compile_file
        (save current yy buffer)
        emit_open(NULL)
        imcc_run_compilation
        (maybe move back to saved buffer)
    (unblock GC)
    (pop context)
    imc_cleanup
    (close input file)
    (undo reentrant compile bookkeeping)
We go through the same stuff as above, we detect and handle a reentrant
situation, block the GC, and set up a new Parrot context. One slight
difference is that we detect the type of the input file (which is what
determine_input_file_type does, all the way up in imcc_run), set a few
flags based on that information, and call imc_yyin_set to set the input
file handle instead of inserting the raw string into the bison buffer. From
here we call into compile_file which is almost identical to compile_string
except we read from the file instead of a string, and then call into
imcc_run_compilation.
So what are the action items to take away from all this? Let me enumerate:
- We should move logic concerning the Parrot GC and Parrot contexts out of IMCC. We should always assume we are calling into IMCC from a running PBC program. Parrot should do the work of keeping IMCC safe, and IMCC should not assume that it is built into libparrot (because eventually it won’t be).
- We should set up a simple function, macro, or new API interface to automatically handle reentrant compilations. We should just always provide that feature, because we never know what kind of crafty stuff our embedders are going to do with Parrot.
- Setting up the bison input source, be it from a string or a file, is very easy to do. It would be pretty easy to set up a source-agnostic interface for specifying an input.
- Below the call-in functions, we shouldn’t need to know whether we are compiling from a string or from a file. Bison’s input buffer should already be set up. All we need to do is run the compilation, then output directly to a new PackFile. Besides a handful of thin interface functions we should have a single primary code path that turns code into PBC.
In reality, I think this process can become very simple, and we can excise a hell of a lot of code from IMCC in the process. Combine that with a few basic refactors, an improved interface, a new PIRCompiler compreg PMC, and the ability to return a PackFile PMC without monkeying with the interpreter’s internal state, and suddenly IMCC doesn’t seem so bad.
In fact, with all this cleanup work done, IMCC might nearly become pleasant. It will almost be a shame to have to wrap it up and kick it out of the Parrot repository at that point.
Almost.
