Programming, Software and Code

L1: The possibilities



I know a little bit about microprocessor design (it is what I focused on in gradschool), and the whole idea of L1 reminds me very much of a type of microcode. A microcode is, in this context, is a very small set of instructions, very close to the underlying hardware and blindingly fast. A processor like an Intel Pentium 4 doesn't execute the vast expansive x86 instruction set directly. In most cases, it has a translation module that converts x86 opcodes into a stream of RISC-like microcodes which are then executed by a small, super-fast microcore. In some cases the minicore that executes the microcodes even runs at a faster clock speed then the rest of the processor does, since it's operations are so simple and atomic to execute.

We talked about this issue a lot yesterday, and I forsee that we will do a lot more talking about it all in the future too.

chromatic envisions L1 as being a subset of PIR. That is, L1 opcodes are just regular PASM opcodes, but the subset of fundamental operations in which all other opcodes are defined. Only the L1 opcodes are written in C, and the rest, potentially, are written in terms of L1. I envision L1 as being a completely separate layer, and not being directly usable by the programmer. Specifically, I don't like mixing together opcodes that are written in C, and opcodes that are written in L1, and expecting Parrot to figure out the difference at runtime. I would like to see L1 that is written strictly in C, and PASM ops which are strictly written in L1. L1 not only is a small set of atomic operations, but they are very simple, easy to parse and decode, and extremely fast. I'm thinking about them being three-argument operations, which opens all sorts of doors for optimization.

Here's how I think execution would go using this kind of scheme:
  1. L1 ops are written in C and are all trivially simple: Individual operations or individual calls to internal functions, or something simple like that. They need to be trivially simple to write and trivially simple to JIT.
  2. PASM opcodes are written in terms of L1 instructions.
  3. At build time, PASM opcodes are "compiled" into a sequence of L1 byte code (L1BC) values in an array, instead of being "compiled" into standards-compliant C code and then compiled again into machine code.
  4. At execution time, user-defined L1 overlays are added so that existing ops and PMCs can be redefined or extended to add specific information if needed.
  5. At execution time, executing PIR code is converted into PBC, which in turn is used to index the sequences of L1 code that represents the ops. These sequences are strung together to form an execution stream of L1BC which is executed by the core (or JITted, or whatever).
If the mapping between PIR->L1 is simple and direct enough, the conversion process will be very speedy. pmichaud and chromatic suggested that we would have a full-fledged compiler for it, but I think something closer to a table-lookup assembler would be just fine, especially if we were using sequences three-argument ops. Of course, L1 code would only be human-readable at build time, once we compile it into L1BC it is essentially frozen that way forever and never needs to be human-readable (and therefore never compiled) again. So, it doesn't really matter how complicated the compilation process is, although I always tend towards simplicity.

Having all ops represented as frozen sequences of L1BC opens up a few options for us, some of which are very enticing indeed:
  1. A pseudo-JIT, where we convert PBC on the fly to a straight L1BC stream. This saves us from having to look up L1BC segments for each op repeatedly.
  2. We still store packfiles using PBC, because it's a more compact format (although we might consider a "super fast" frozen L1BC stream format too)
  3. Overridable and dynamically extensible ops. If you use a custom PASM op in your code (written in L1, of course), you can save that custom definition in your packfiles and then execute your custom ops on any Parrot. Currently, custom ops are written in C and form a dynamicly linked library which would need to be included with your program in order to execute properly. With L1, this mechanism all disappears. Also, you could override existing op definisions locally, by supplying a custom L1BC stream to use for it instead.
  4. Overridable and dynamically extendable PMCs. Same as above, but with PMC types instead of ops. PMC types can be defined at compile time and included in the packfile as a custom L1BC stream. No need to compile dynpmc dynamic libraries and shuffle them around, they can be included in the packfile directly.
  5. Serious optimization potential. If you're willing to spend a few cycles optimizing, a unified L1BC stream makes for quite an attractive optimization target. Right now we can't optimize much because execution is flowing between C and PIR and it's just impossible to make any predictions. As chromatic is quick to point out, we suddenly gain the ability to do escape analysis, code inlining, duplicate and unused code removal, and other optimizations that we can't do now. Plus, having more information about code execution will allow us to make a smarter garbage collector, register allocator, etc.
  6. Built-in PMCs and user-defined PMCs will have all their VTABLEs written in L1 instead of in C. the VTABLE structure contains pointers to L1 streams instead of C function pointers. Overriding a vtable method becomes transparent, and there is no functional difference at all between built-in and user-defined PMCs. Also, there is the potential that built-in PMC VTABLEs could be locally overridden and then restored later, or copied to another PMC type, or something else entirely. Lots of potential here.
  7. If PMCs and OPS are ultimately compiled into L1, there's no saying that they all have to be hand written in L1. They could be written in PIR, or NQP, or Perl6, or TCL. Imagine using Perl 6 to write parts of Parrot that run the Perl 6 compiler which itself is written in Perl 6? Bootstrapping indeed, and only the start of the potential uses!
So these are some of the enticing possibilities that arise from this method, and I think everybody will agree that some of these are very valuable indeed. As I mentioned before, I do have some differences of opinion from chromatic, and I do get the feeling that he's given more deep consideration to some points then I have, but that doesn't stop me from talking about it here. What are blogs for if not the easy dissemination of ill-informed opinions? If chromatic wanted to write a guest-post here to present his vision for what L1 is and how it works, I would be happy to have it. Or, if he posted a discussion of it on his own blog, I would definitely link it. I am definitely a congregant in the church of chromatic.

So, how do we get from here to there? This is the million dollar question, and without a solid roadmap we may never be able to implement such radical changes to Parrot. On this point, chromatic has already written a partial plan, although he mostly focuses on rewriting PMCs in something other then C. I'll adapt his basic timeline, but add a few of my specifics (which, as I've mentioned before, may be at odds with what chromatic thinks, and may actually be quite rediculous):
  1. Create a PCT-based compiler for PMCs and OPs that reads the current mismash of C-like code and emits ordinary C code, which is what our Perl5-based build system does already.
  2. Develop a compiler that converts L1 to equivalent C code.
  3. Replace the current Perl5-based build system with the PCT-based one
  4. Change the PCT-based compiler to emit L1, and use the L1 to create the C code of the PMCs and Ops
  5. Develop a runcore (nanoparrot) which executes L1 directly, including all the necessary control flow to interact ops with PMC vtables, etc.
  6. Remove the L1->C step, and have the L1 code generated by the PCT-based OP and PMC compiler(s) be executed by Parrot directly.
What's remarkable about all this is that we add a new layer to Parrot, but the end result will be less complexity and better performance overall. At least, that's the hope. I'm planning at least one more post about L1 to talk about what this beast may look like, although with other things on my plate and the release on tuesday, I don't know when I will be posting it.

Comments

I have to admit I'm a bit worried about having yet another intermediate layer. That may fade as L1 evolves, though,because right now this is barely even solid enough a concept to offer comments on (by just try and stop me! :-). From the previous posts about the need for L1, it appears that maybe a classful assembly language (PIR) is turning out to be a bit more high level (with associated efficiency issues) than it was sold as. I think the recent rakudo MMD-based RPN calculator code shows us that it's not the flow control that makes a language high level, but the type system. PIR can be very compact by virtue of the dispatch system, but perhaps took it a bit too far.

In that respect I can see L1 helping.

Whenever a layer is introduced I worry about the hidden costs that are incurred from the need to work around that layer, which seems perfectly functional until you actually try to optimize some code. A large population of coders never notice this because they do not visualize on a machine code level -- they just live in a world where computers work several of orders of magnitude slower than they actually do.

My new favorite example for inefficiency by subdivision is the libGMP rational class, which defines a set of operations on rationals which expect canonicalized operands and canonicalize the result. For some work with rationals, this is going to be a performance killer by canonicalizing at the wrong times.

My least favorite example is the katakana of mallocs and reallocs in Parrot's string ops, which manage the seemingly impossible task of making a simple task rather complex by dividing it into smaller parts, with ease.

Now C has its drawbacks, but it also has the most effort already poured into optimization. You can use it just for the portability, but unless you feel pretty confident someone is going to show up and write something that magically takes this string of beads of compiled C code-lets which L1 ops represent, and, with an encyclopedic knowledge of the huge array of different computer architectures and gcc's behavior on each of them, ensures that gcc will emit somewhat well grafted assembly codelets despite having had these artificial register shuffling barriers bisect the algorithms, then you are going to lose out on efficiency, even if you are benefitting from keeping the icache mixture lean.

Not that gcc is the end all of optimization -- even without SIMD hardware, it's been shown a good hand-coding session can often beat compilers where it matters. But unless we want to start working with register files, processor profiles, and strength tables for each and every chipset, we are stuck with it (and the heinous ugliness it causes by not allowing various operations on pointers, not making endianness a low level construct, etc, etc, etc.)

Even with C, there are parts of Parrot that are eventually going to have to use architecture features if Parriot wants to be as fast as it can be. This means knowing what architecture you are on -- down to some fine details -- knowing how to test to see what special hardware is available at your disposal at runtime (you would not want to compile a version for each specific type of vector unit available for a specific architecture), and selecting or JITting your code accordingly based on this information.

C is very weak in this area. All you get is inline assembly and a bunch of clunky autoconf macros. Once you pull those all together into something resembling a coherent whole (as I recall from my libgg work years ago) and have a couple examples working for popular architectures -- and maybe an unpopular one to show off with -- you have to hope someone else will feel like fleshing out the meat of that code for the rest of them, because by then you're totally exhausted.

And those are the real problems I see facing Parrot going forward. So my question is, how does L1 help surmount them, rather than add to the indirection?

Great questions, all! I just hope I can address them appropriately. L1 opcodes themselves are going to be implemented as C functions or something similar, and we are going to rely on the compiler to optimize the implementation of individual opcode using machine-specific tricks.

At the moment, Parrot has the unenviable problem that primary control flow passes between modules written in two very different languages: PIR and C. PIR opcodes execute C functions, which can call PIR subroutines, which can execute C functions... Each time we cross that boundary, we have to shuffle data from PIR registers to the C system stack and back again. This is a HUGE time killer and one of the biggest drains on our performance. As an example of this, removing calls to PMC methods from the IO system last month more then quadrupled performance in some benchmarks.

Combined with this performance issue, we have a persistent problem where recursive calls into PIR create multiple runloops, which cannot be easily reconciled together because they are separated by layers of C and PIR. This problem is difficult to explain here, but I'll be happy to write more about it if you want.

Because we are jumping back and forth between C and PIR, we lose the ability to do all sorts of high-level optimizations ourselves. GCC can't even optimize across compilation units, and at the moment we can't do it either. We should be able to do flow analysis, handle code inlining, etc. Plus, we can do escape analysis to help produce a more speedy and accurate GC.

So to answer your question, right now we have PIR and C, and different components written in each. This brings overhead because we're constantly translating data from one form into the other. By using L1 for everything, we gain a simplicity that is not possible in our current setup. There is no switching between languages, marshalling data in our hotpaths, and maintaining all sorts of separate control flow constructs (subs, execeptions, etc) in C and PIR. Everything will be in L1, and everything will get more simple and (hopefully) faster.

I hope that answers your questions! Thanks!

Perhaps it would help build a better picture of L1 if some light were shed on the borders of it: what's appropriate content for an L1 op, and what's not appropriate? Some examples of each could be really useful in understanding the concept.

For example, for string ops, where does L1 end -- bytes? comparisons? hashval? copy? concat? substr? splice? capitalize?

Does an L1 op do any level of MMD, or fake MMD by switching on it's argument types? What kind of types do you feed an L1 for parameters anyway?

Can one reliably leave things in a math unit during one L1 op and expect them to still be there during the next?

Finally what's the advantage of L1 over just demanding any C routines be wrapped by PIR and never directly call PIR? (Except for embeds in a C application, that is.)

I've been working on a followup blogpost that will talk about what the possible form of L1 could be, in an effort to shed some more light on it all.

L1 should have enough capability to do everything we need it to do. I know that's an evasion of the question somewhat, but there really aren't many firm answers yet that I can give. chromatic has a different conception of this then I have which is closer to an LLVM bytecode, so we can interface with C-level functions directly from L1. I personally don't want to have to be playing with platform-specific calling conventions in the L1 dispatcher, but it's certainly possible to do.

I don't know if L1 will be able to do MMD, although I can say that whatever it does will not have a lot of associated magic. I'm sure that any MMD will involve manually constructing a signature, passing that to a search routine, and returning any suitable candidates. I'm sure it will have some capability to interface with our MMD API, I just don't know what that interface may look like quite yet.

I don't know what you mean by leaving things in a math unit, could you expand on that a bit more?

a pure-PIR system would be nice, but PIR ops have been notoriously difficult so far to JIT directly. We end up having to maintain separate but equivalent defintions for each op for each JITable platform. L1 ops, which will be hard limited to a certain level of complexity, will be easier to JIT by design.

You actually brought up a question I forgot to ask, about L1 op complexity -- will L1 ops have to offer a realtime profile such that they are limited to a certain number of CPU cycles, which would aid in cooperative multitasking -- that is, the program is guaranteed to be broken down into small chunks, so that event handlers are guaranteed to have an opportunity to run between the chunks.

As far the math unit question, it often takes a lot of operations -- and mire than a few parameters -- to set up a vector unit like SSE or Altivec. What I'm wondering is whether L1 is going to expect that anything that uses a vector unit will start and finish the use of that unit within a single L1 operation, or whether you will be able to allow the vector unit to carry its state over from one L1 op to anther (and expect the state to be saved/restored if another vector-using task is allowed to interrupt.)

I don't forsee L1 ops having a strict real time profile, no. This is because L1 ops may call arbitrary C functions or VTABLEs, which themselves may be very complicated and not strictly bounded. Currently the Parrot scheduler runs in response to specific PIR ops, and I'm sure the same will hold when we transition to L1 too. So the scheduler, and hence all event handlers, do run regularly but without a firm guarantee on timing.

I don't forsee Parrot ever providing explicit access o SSE, MMX or Altivec operations. It may use these implicitly as an optimization tool, but the user won't be manipulating their own vector registers. Keep in mind that Parrot is designed to be cross platform and to smooth out as much as possible the differences between those platforms.

Vector operations, if we are going to support them at all, are likely to be defined in terms of a custom PMC instead of L1 ops.

I think LLVM Intermediate Representation is a candidate to be L1.
By this way, PIR becomes the third macro-assembler of LLVM.

- 1st : C, a macro-assembler
- 2nd : C++, a macro-assembler with object (and a complex syntax)
- 3rd : PIR, a macro-assembler with object, GC and Unicode support (and a simple syntax)

A great avantage, LLVM comes with a JIT.
A inconvenient, LLVM is written in C++.

Thanks for mentioning that too, fperrad. That's another thing that's been on the table. However, I worry that tying ourselves too strongly to LLVM is a bad thing, because then we lose the ability to use alternate JIT backends like libJIT. We're also limited to building Parrot on platforms where LLVM is ported already.

It's not a terrible idea, but there are some issues with it.

On the vector registers I was not implying user-level access, more using them to implement internals. FWIW.



This entry was originally posted on Blogger and was automatically converted. There may be some broken links and other errors due to the conversion. Please let me know about any serious problems.