Wednesday, February 1, 2012

Almost There - PyPy's ARM Backend

In this post I want to give an update on the status of the ARM backend for PyPy's JIT and describe some of the issues and details of the backend.

Current Status

It has been a more than a year that I have been working on the ARM backend. Now it is in a shape, that we can measure meaningful numbers and also ask for some feedback. Since the last post about the backend we have added support floating point operations as well as for PyPy's framework GC's. Another area of work was to keep up with the constant improvements done in the main development branch, such as out-of-line guards, labels, etc. It has been possible for about a year to cross-translate the PyPy Python interpreter and other interpreters such as Pyrolog, with a JIT, to run benchmarks on ARM. Up until now there remained some hard to track bugs that would cause the interpreter to crash with a segmentation fault in certain cases when running with the JIT on ARM. Lately it was possible to run all benchmarks without problems, but when running the translation toolchain itself it would crash. During the last PyPy sprint in Leysin Armin and I managed to fix several of these hard to track bugs in the ARM backend with the result that, it is now possible to run the PyPy translator on ARM itself (at least unless until it runs out of memory), which is a kind of litmus test for the backend itself and used to crash before. Just to point it out, we are not able to complete a PyPy translation on ARM, because on the hardware we have currently available there is not enough memory. But up to the point we run out of memory the JIT does not hit any issues.

Implementation Details

The hardware requirements to run the JIT on ARM follow those for Ubuntu on ARM which targets ARMv7 with a VFP unit running in little endian mode. The JIT can be translated without floating point support, but there might be a few places that need to be fixed to fully work in this setting. We are targeting the ARM instruction set, because at least at the time we decided to use it seemed to be the best choice in terms of speed while having some size overhead compared to the Thumb2 instruction set. It appears that the Thumb2 instruction set should give comparable speed with better code density but has a few restriction on the number of registers available and the use of conditional execution. Also the implementation is a bit easier using a fixed width instruction set and we can use the full set of registers in the generated code when using the ARM instruction set.

The calling convention on ARM

The calling convention on ARM uses 4 of the general purpose registers to pass arguments to functions, further arguments are passed on the stack. The presence of a floating point unit is not required for ARM cores, for this reason there are different ways of handling floats with relation to the calling convention. There is a so called soft-float calling convention that is independent of the presence of a floating point unit. For this calling convention floating point arguments to functions are stored in the general purpose registers and on the stack. Passing floats around this way works with software and hardware floating point implementations. But in presence of a floating point unit it produces some overhead, because floating point numbers need to be moved from the floating point unit to the core registers to do a call and moved back to the floating point registers by the callee. The alternative calling convention is the so-called hard-float calling convention which requires the presence of a floating point unit but has the advantage of getting rid of the overhead of moving floating point values around when performing a call. Although it would be better in the long term to support the hard-float calling convention, we need to be able to interoperate with external code compiled for the operating system we are running on. For this reason at the moment we only support the soft-float to interoperate with external code. We implemented and tested the backend on a BeagleBoard-xM with a Cortex-A8 processor running Ubuntu 11.04 for ARM.

Translating for ARM

The toolchain used to translate PyPy currently is based on a Scratchbox2. Scratchbox2 is a cross-compiling environment. Development had stopped for a while, but it seems to have revived again. We run a 32-bit Python interpreter on the host system and perform all calls to the compiler using a Scratchbox2 based environment. A description on how to setup the cross translation toolchain can be found here.

Results

The current results on ARM, as shown in the graph below, show that the JIT currently gives a speedup of about 3.5 times compared to CPython on ARM. The benchmarks were run on the before mentioned BeagleBoard-xM with a 1GHz ARM Cortex-A8 processor and 512MB of memory. The operating system on the board is Ubuntu 11.04 for ARM. We measured the PyPy interpreter with the JIT enabled and disabled comparing each to CPython Python 2.7.1+ (r271:86832) for ARM. The graph shows the speedup or slowdown of both PyPy versions for the different benchmarks from our benchmark suite normalized to the runtime of CPython. The data used for the graph can be seen below.

The speedup is less than the speedup of 5.2 times we currently get on x86 on our own benchmark suite (see http://speed.pypy.org for details). There are several possible reasons for this. Comparing the results for the interpreter without the JIT on ARM and x86 suggests that the interpreter generated by PyPy, without the JIT, has a worse performance when compared to CPython that it does on x86. Also it is quite possible that the code we are generating with the JIT is not yet optimal. Also there are some architectural constraints produce some overhead. One of these differences is the handling of constants, most ARM instructions only support 8 bit (that can be shifted) immediate values, larger constants need to be loaded into a register, something that is not necessary on x86.

BenchmarkPyPy JITPyPy no JIT
ai0.4844397800473.72756749625
chaos0.08072916919342.2908692212
crypto_pyaes0.07111148322453.30112318509
django0.09777432455192.56779947601
fannkuch0.2104237356982.49163632938
float0.1542753346752.12053281495
go0.3304830342025.84628320479
html5lib0.6292643898623.60333138526
meteor-contest0.9847474269122.93838610037
nbody_modified0.2369695930821.40027234936
pyflate-fast0.3674471918072.72472422146
raytrace-simple0.02905274614371.97270054339
richards0.0345755735533.29767342015
slowspitfire0.7866425519083.7397367403
spambayes0.6603243794563.29059863111
spectral-norm0.0636107837314.01788986233
spitfire0.436171311652.72050579076
spitfire_cstringio0.2555387021341.7418593111
telco0.1029189304133.86388866047
twisted_iteration0.1227239868054.33632475491
twisted_names2.423677971352.99878698076
twisted_pb1.309918374314.48877805486
twisted_tcp0.9270333540552.8161624665
waf1.020598119321.03793427321


The next steps and call for help

Although there probably still are some remaining issues which have not surfaced yet, the JIT backend for ARM is working. Before we can merge the backend into the main development line there are some things that we would like to do first, in particular it we are looking for a way to run the all PyPy tests to verify that things work on ARM before we can merge. Additionally there are some other longterm ideas. To do this we are looking for people willing to help, either by contributing to implement the open features or that can help us with hardware to test.

The incomplete list of open topics:
  • We are looking for a better way to translate PyPy for ARM, than the one describe above. I am not sure if there currently is hardware with enough memory to directly translate PyPy on an ARM based system, this would require between 1.5 or 2 Gig of memory. A fully QEMU based approach could also work, instead of Scratchbox2 that uses QEMU under the hood.
  • Test the JIT on different hardware.
  • Experiment with the JIT settings to find the optimal thresholds for ARM.
  • Continuous integration: We are looking for a way to run the PyPy test suite to make sure everything works as expected on ARM, here QEMU also might provide an alternative.
  • A long term plan would be to port the backend to ARMv5 ISA and improve the support for systems without a floating point unit. This would require to implement the ISA and create different code paths and improve the instruction selection depending on the target architecture.
  • Review of the generated machine code the JIT generates on ARM to see if the instruction selection makes sense for ARM.
  • Build a version that runs on Android.
  • Improve the tools, i.e. integrate with jitviewer.
So if you are interested or willing to help in any way contact us.

19 comments:

said...
This comment has been removed by a blog administrator.
Michael Hudson-Doyle said...

Awesome news. We might be able to donate some time in the Linaro validation lab to running tests, I'll see what we can do...

Anonymous said...

"Just to point it out, we are not able to complete a PyPy translation on ARM, because on the hardware we have currently available there is not enough memory."

Can't you just add more swap?

said...

You wrote: "The speedup is less than the speedup of 5.2 times you currently get on x86."

The removed comment was meant to point out that the author of the blog post does not (and cannot) know the actual speedups (and slowdowns) people are getting on their machines.

The speedup of 5.2 you mentioned is contradicting my own experience.

I suggest you rewrite your sentence into "The speedup is less than the speedup of 5.2 times we currently get on x86."

Maciej Fijalkowski said...

"The speedup of 5.2 you mentioned is contradicting my own experience."

Did you run the very same benchmark suite or some arbitrary programs? If arbitrary programs then sorry, but it's seriously impossible for us to optimize stuff we don't know about. Please submit bug tracker issues for that.

I agree the speedup should be qualified "on our own benchmark suite", but if you don't contribute benchmarks, you can't complain.

David Schneider said...

@⚛ to avoid misunderstandings I updated the sentence in question to make it clear that I was comparing the performance of the ARM backend running our own benchmark suite to the results of the benchmarks as shown on speed.pypy.org.

Naos said...

Every time I visit comments to blog posts on this page I see some hater or two or even more who, don't know why, have wierd problems without a reason. People chill out, you get this brilliant piece of software and you do not have to pay for it.

said...

@Naos: I do *not* hate PyPy. I like it and want to make it better. To do that, I am using a method different from your method. I would like the PyPy team to figure out how to run my benchmark faster without me disclosing the name of the benchmark. I believe that in the end this method will lead to a couple of universal optimizations in PyPy.

Maciej Fijalkowski said...

@⚛ Contrary to what you might believe, this ends up with you being annoying and nothing else. If you think we don't think all the time about general improvements, you're wrong, but pointless, unscientific complaining is not welcomed here.

Anonymous said...

will it work on the raspberry pi or does it use a different arm architecture?

(i am confused by arms. :) apparently it isn't like in the x86 world where everything is compatible with each other.)

Anonymous said...

@Anonymous

> The hardware requirements to run the JIT on ARM follow those for Ubuntu on ARM which targets ARMv7 with a VFP unit running in little endian mode.

This is higher than Raspberry Pi's ARMv6.

Anonymous said...

A long term plan would be to port the backend to ARMv5 ISA and improve the support for systems without a floating point unit. This would require to implement the ISA and create different code paths and improve the instruction selection depending on the target architecture.

so someday it will support the v6 too? or isn't v5 and v6 compatible either?

David Schneider said...

@Anonymous ARMv6 should be backwards compatible to the ARMv5 ISA. So it should be possible to use the JIT on ARMv6 once it works for ARMv5.

Anonymous said...

Really nice with raspberry Pi being released and ARM proto boards gaining traction. Looking forward to develop w/ this on rPi

d.q. said...

I have an arm system with enough swap, microsd or cifs or both... though a setup like this will probably take until next release to translate pypy, won't it?
Can also do usermode qemu of course.

David Schneider said...

@d.q. It would at least take until the next release, if not several...
A qemu based solution would be interesting. If you are interested I would propose you join #pypy on IRC to discuss

Paul said...

I'm pretty sure the guys over at Boundary Devices are making a SabreLite variant with 2GB ram.

http://boundarydevices.com/imx6-options-single-dual-core-2gb-ddr/

You may need to email them directly to purchase it still, but they're responsive and a pleasure to work with.

Eric van Riet Paap said...

Anyone try to create a qemu based solution? Even if only a build system. I would be interested in having some documentation about how to build pypy for Raspberry Pi. I know it's a non-compatible ARM version but we have to start somewhere. Plus I have a very RPi's laying around to play with...

Eric van Riet Paap said...

Anyone tried to build with a qemu setup? I have several Raspberry Pi's around that I could play with. I know the JIT will not work because of the difference in ARM versions but it's a start.