To follow up with two of my [previous][pcc_refactors] [posts][vtable_timings], I wanted to do a little bit of benchmarking to see what performance savings we could have (if any) if we started pursuing my idea for exposing calls directly to the user and not doing the heavy-weight processing we do in fill_params for every call. To do this benchmarking, I need to use PASM for some tests, because PIR hides a lot of complexity and forces the use of set_args and get_params in subroutine calls and definitions (and their analogs for returns). Here is a PASM wrapper function that I am using for timings. The code to benchmark, which I will show in shorter snippets in the rest of the post will be copy+pasted in the “Inner Loop” portion:

.pcc_sub :main main:
    get_global P1, "foo"
    set I1, 1000000
    print "Starting ... "
    time N0
  loop_top:


    # Inner loop
    # Code to time goes here


    dec I1
    if I1, loop_top
    time N1
    sub N2, N1, N0
    print "Total time: "
    print N2
    say "s."
    exit 0

.pcc_sub foo:
    # Empty Sub. Return immediately and do nothing
    returncc

Likewise, here’s the PIR equivalent, to show the differences between the two. I’ve deliberately kept the PIR version almost identical to the PASM version, so it’s easier to compare directly:

.sub main :main
    get_global $P1, "foo"
    $I1 = 1000000
    print "Starting ... "
    $N0 = time
  loop_top:


    # Inner loop.


    dec $I1
    if $I1, loop_top
    $N1 = time
    $N2 = $N1 - $N0
    print "Total time: "
    print $N2
    say "s."
.end

.sub foo
.end

All the changes I talk about making to Parrot in this post are being performed in the whiteknight/pcc_prototyping branch for now. I may include some of these benchmarks there as well, although I do not do that currently. If you want to follow along with this work, or you want to get involved with it, you can do it in that branch.

For the first test on this machine, I’m using a direct invokecc call for PASM vs a normal sugared call for PIR:

PASM:

invoke P1

PIR:

$P1()

And the timing numbers for just this call:

Starting ... Total time: 1.40410900115967s.
Starting ... Total time: 1.67736411094666s.

I’ve run this a few times, and while there is some variability (I’m running this on a VM right now), the PIR version appears to consistantly take about 120% of the time to run that the PASM version takes. This is without any args or any returns of any kind. Baseline is about 20% slower.

For the next test, I created a new op, a variant of invokecc which takes two arguments: the Sub PMC to invoke and a CallContext which contains the function arguments. Now let’s compare:

PASM:

new P2, "CallContext"
set P2[0], 1
set P2[1], 2
invokecc P1, P2

PIR:

$P1(1, 2)

Now the timings swing back the other way, with the PASM version being more expensive to use:

Starting ... Total time: 2.33223795890808s.
Starting ... Total time: 2.07209300994873s.

Why is this? A big part of the problem is the use of the new_p_sc opcode, which uses a string literal to lookup the type for CallContext. This is far slower than it needs to be. Next, I added a new opcode called new_call_context, which does a numerical lookup on the built-in type, the same way that PCC would build it internally. Again, the numbers swing back the other way.

PASM:

new_call_context P2
set P2[0], 1
set P2[1], 2
invokecc P1, P2

And the results:

Starting ... Total time: 1.90112495422363s.
Starting ... Total time: 2.06399393081665s.

Not as big a change as last time, but the PASM approach without set_args or get_params still seems to be doing just as well or better if we add in the correct infrastructure for it. One other important thing to demonstrate is that when we compile these two files down to bytecode, the file sizes are different. The PASM version is 1824 bytes, while the PIR version is 1888 bytes. In the disassembly, the two versions have the same number of opcodes (16 total), while the PIR version has two additional PMC constants for signature arrays that the PASM version does not have. When you think that every single callsite in your program is likely to have an additional signature array constant PMC, you can imagine how the costs of that mechanism grows for non-trivial programs.

Let’s try named parameters now:

PASM:

new_call_context P2
set P2["Foo"], 1
set P2["Bar"], 2
invokecc P1, P2

PIR:

$P1(1 :named("Foo"), 2 :named("Bar"))

Results:

Starting ... Total time: 2.52663016319275s.
Starting ... Total time: 2.98304104804993s.

The PASM version still wins, although the margin is still not by much. Part of the reason for that is because we aren’t doing any argument unpacking in the callee, which is where fill_params would be called, and where I think the real costs are hidden. I’m going to expand out the foo subroutine to read two positional arguments, as integers, and add them together. To get the raw CallContext in the callee, I need to add a new get_call_context op to read it directly without going through get_params:

PASM:

.pcc_sub foo:
    get_call_context P0
    set I0, P0[0]
    set I1, P0[1]
    add I2, I0, I1
    returncc

PIR:

.sub foo
    .param int a
    .param int b
    $I2 = a + b
.end

Results:

Starting ... Total time: 2.54172682762146s.
Starting ... Total time: 2.62854504585266s.

No huge change here. The PASM version is still faster, still by a slim margin. Let’s repeat this with named arguments instead.

PASM:

.pcc_sub foo:
    get_call_context P0
    set I0, P0["Foo"]
    set I1, P0["Bar"]
    add I2, I0, I1
    returncc PIR:

.sub foo
    .param int a :named("Foo")
    .param int b :named("Bar")
    $I2 = a + b
.end

Results:

Starting ... Total time: 3.31844401359558s.
Starting ... Total time: 4.35232996940613s.

Ooopsies. Here PIR demonstrates the huge fail of fill_params processing named arguments. PASM wins this round by a convincing margin. Almost a whole second more time to run this test using the “old” style of dispatch over my new version of it, when named arguments are involved. Again, multiply that across all the millions of function calls in a non-trivial program using these features, and you start to see the problem.

Let’s try with optional/positional args:

PASM:

.sub foo
    .param int a
    .param int b
    .param int c :optional
    .param int has_c :opt_flag
    $I2 = a + b
.end

PIR:

.sub foo
    .param int a
    .param int b
    .param int c :optional
    .param int has_c :opt_flag
    $I2 = a + b
.end

Results:

Starting ... Total time: 2.70708703994751s.
Starting ... Total time: 2.70915794372559s.

The two are neck and neck. It seems that the exists opcode is more expensive than whatever check fill_params uses in this case. Now, let’s try making those optional args named:

PASM:

.pcc_sub foo:
    get_call_context P0
    set I0, P0["Foo"]
    set I1, P0["Bar"]
    set I2, P0["Baz"]
    exists I3, P0["Baz"]
    add I2, I0, I1
    returncc

PIR:

.sub foo
    .param int a :named("Foo")
    .param int b :named("Bar")
    .param int c :named("Baz") :optional
    .param int has_c :opt_flag
    $I2 = a + b
.end

Results:

Starting ... Total time: 3.70791792869568s.
Starting ... Total time: 4.56210494041443s.

… And here comes the fail train again. exists causes the PASM version to become more expensive, but the costs of named argument processing in fill_params for the PIR version is higher too.

I’m not going to grind this point into the ground any further, and I’m also not going to analyze these numbers too hard. I ran each test a few times to make sure the numbers were looking consistant, but I’m not going to compute all the statistics. At the very least, what I can say is that my approach of avoiding the set_args and get_params opcodes, and the fill_params function is no worse and in some cases noticably better than the “standard” PIR approach. I added three new opcodes, but should be able to delete four of them in return, so Parrot is not getting any more complex because of it.

bacek pointed out yesterday that we probably need to keep fill_params around as the mechanism for passing arguments between C and PIR. That’s fine by me, although I feel like we could be passing the CallContext to embedding users as well and unpacking that manually using direct VTABLE calls in the same exact way as I show above. That would enable us to kill fill_params completely, and prevent the need for us to be passing around signature strings at the C level to describe calls. That, unlike most of what I talk about above, would require a deprecation cycle to implement.

Even if the timing numbers were all dead even, this approach still gives more power and flexibility to the users of Parrot and allows Parrot to make fewer decisions by default. Parrot doesn’t need to add a mechanism for processing arguments, because users can do it themselves. Compilers and code generators like PCT and Winxed can be making decisions about calling conventions and argument processing, and spare Parrot the headache of having to guess what users need (and, in many cases, get those guesses wrong). It’s worth pointing out that Rakudo already processes their arguments like this, because the Parrot system of fill_params and its semantics are not what Rakudo wants or needs. When our biggest user tells us that our core calling convention and argument processing mechanisms are not adequate, maybe it’s time we open the gates and allow users to do what they need to do themselves.