Saturday, June 26, 2010

"Blackhole" interpreter

Hi all,

Here are a few words about the JIT's "great speedup in compiling time" advertized on the PyPy 1.3 release (see the previous blog post). The exact meaning behind these words needs a fair bit of explanation, so here it is in case you are interested.

If you download a version of PyPy 1.3 that includes a JIT compiler, you get an executable that could be qualified as rather fat: it actually contains three interpreters. You have on the one hand the regular Python interpreter. It is here because it's not possible to JIT-compile every single piece of Python code you try to run; only the most executed loops are JIT-compiled. They are JIT-compiled with a tracing interpreter that operates one level down. This is the second interpreter. This tracing step is quite slow, but it's all right because it's only invoked on the most executed loops (on the order of 100 to 1000 times in total in a run of a Python script that takes anyway seconds or minutes to run).

So apart from the JIT compilation itself, we have two worlds in which the execution proceeds: either by regular interpretation, or by the execution of assembler code generated by the JIT compiler. And of course, we need to be able to switch from one world to the other quickly: during regular interpretation we have to detect if we already have generated assembler for this piece of code and if so, jump to it; and during execution of the assembler, when a "guard" fails, i.e. when we meet a path of execution for which we did not produce assembler, then we need to switch back to regular interpretation (or occasionally invoke the JIT compiler again).

Let us consider the cost of switching from one world to another. During regular interpretation, if we detect that we already have assembler corresponding to this Python loop, then we just jump to it instead of interpreting the Python loop. This is fairly cheap, as it involves just one fast extra check per Python loop. The reverse is harder because "guard" failures can occur at any point in time: it is possible that the bit of assembler that we already executed so far corresponds to running the first 4 Python opcodes of the loop and a half. The guard that failed just now is somewhere in the middle of interpreting that opcode -- say, multiplying these two Python objects.

It's almost impossible to just "jump" at the right place in the code of the regular interpreter -- how do you jump inside a regular function compiled in C, itself in a call chain, resuming execution of the function from somewhere in the middle?

So here is the important new bit in PyPy 1.3. Previously, what we would do is invoke the JIT compiler again in order to follow what needs to happen between the guard failure and the real end of the Python opcode. We would then throw away the trace generated, as the only purpose was to finish running the current opcode. We call this "blackhole interpretation". After the end of the Python opcode, we can jump to the regular interpreter easily.

Doing so was straightforward, but slow, in case it needs to be done very often (as in the case in some examples, but not all). In PyPy 1.3, this blackhole interpretation step has been redesigned as a time-critical component, and that's where the third interpreter comes from. It is an interpreter that works like the JIT compiler, but without the overhead of tracing (e.g. it does not need to box all values). It was designed from the ground up for the sole purpose of finishing the execution of the current Python opcode. The bytecode format that it interprets is also new, designed for that purpose, and the JIT compiler itself (the second interpreter) was adapted to it. The old bytecode format in PyPy 1.2 is gone (it was more suited for the JIT compiler, but less for blackhole interpretation).

In summary, it was a lot of changes in the most front-end-ish parts of the JIT compiler, even though it was mostly hidden changes. I hope that this longish blog post helped bring it a bit more to the light :-)

8 comments:

GRon said...

Interesting, is there any documentation for the different bytecode sets you have/had?

I would be especially interested in the differences, and the reasons for those design decisions.

Armin Rigo said...

I fear not. The bytecode set is quite custom, made to represent RPython code, which is at the level (roughly speaking) of Java -- with a few additional instructions to guide the JIT compiler. The latest version uses a register-based machine, which is more convenient than a Java-like stack-based approach starting from the control flow graphs of RPython functions. It has three independent sets of registers: integers, pointers, and floating-point (pointers are different from integers at this level because the GC needs to track them and possibly move them). Register numbers are encoded in one byte, so there is room for 256 registers of each kind, but in practice doing a simple register allocation step on each graph means that no bytecode ends up using more than ~15 registers. A few parts are needed only by the JIT compiler and not by the blackhole interpreter; these are encoded "off-line" to avoid slowing down the blackhole interpreter.

Well, I could talk at length about all the details of the format, but in truth there is nothing very deep there :-) See the comments in http://codespeak.net/svn/pypy/trunk/pypy/jit/codewriter/codewriter.py as well as the tests like test/test_flatten.py and test/test_regalloc.py.

Zeev said...

Does the PyPy JIT replace a running interpreted loop with a compiled one mid-run or only on the next iteration or only the next time this loop starts?

Is there a way to ask the PyPy interpreter to tell me what it jitted as it ran some code?

Or will it be too difficult for me to relate the produced machine code with my python source code (because it's not a straightforward method jit)?

Maciej Fijalkowski said...

Hi Zeev.

Only at the next iteration of the loop. However, you have to have at least ~1000 iterations before it happens.

There is a variety of tools that we use for inspecting generated loops. There is no programmable interface from python yet, but there are some external tools.

Run: PYPYJITLOG=jit-log-opt:log pypy

and you'll get a file log which contains all the loops. There are tools in the source checkout pypy/jit/tool, loopviewer.py, showstats.py and traceviewer.py which can help you viewing those loops. They'll contain debug_merge_points which are with info about python opcodes (including functions and file), but they can span several functions. Have fun :)

If you want more info, drop by on #pypy at irc.freenode.net.

Cheers,
fijal

Luis said...

Is this Blackhole interpreter the Jaegermonkey of pypy?

Armin Rigo said...

Luis: no.

Antonio Cuni said...

@Luis: just to expand a bit Armin's answer :-).

Jaegermonkey is a method-by-method compiler that Tracemonkey uses *before* the tracing compiler enters in action. In pypy, this is equivalent to the normal Python interpreter that profiles your loops to find the hot ones, with the obvious difference that Jaegermonkey is a compiler, while ours is an interpreter.

The blackhole interpreter is something that it's used internally by our tracing jit compiler, and AFAIK it has no equivalent in tracemonkey.

Luis said...

I see. Thanks Armin and Antonio.