Contents
This page is about pypy-stm, a special in-development version of PyPy which can run multiple independent CPU-hungry threads in the same process in parallel. It is side-stepping what is known in the Python world as the “global interpreter lock (GIL)” problem.
“STM” stands for Software Transactional Memory, the technique used internally. This page describes pypy-stm from the perspective of a user, describes work in progress, and finally gives references to more implementation details.
This work was done mostly by Remi Meier and Armin Rigo. Thanks to all donors for crowd-funding the work so far! Please have a look at the 2nd call for donation.
pypy-stm is a variant of the regular PyPy interpreter. With caveats listed below, it should be in theory within 25%-50% slower than a regular PyPy, comparing the JIT version in both cases. It is called STM for Software Transactional Memory, which is the internal technique used (see Reference to implementation details).
What you get in exchange for this slow-down is that pypy-stm runs any multithreaded Python program on multiple CPUs at once. Programs running two threads or more in parallel should ideally run faster than in a regular PyPy, either now or soon as issues are fixed. In one way, that’s all there is to it: this is a GIL-less Python, feel free to download and try it. However, the deeper idea behind the pypy-stm project is to improve what is so far the state-of-the-art for using multiple CPUs, which for cases where separate processes don’t work is done by writing explicitly multi-threaded programs. Instead, pypy-stm is pushing forward an approach to hide the threads, as described below in atomic sections.
pypy-stm requires 64-bit Linux for now.
Development is done in the branch stmgc-c7. If you are only interested in trying it out, you can download a Ubuntu 12.04 binary here (pypy-2.2.x-stm*.tar.bz2; this version is a release mode, but not stripped of debug symbols). The current version supports four “segments”, which means that it will run up to four threads in parallel, in other words it is running a thread pool up to 4 threads emulating normal threads.
To build a version from sources, you first need to compile a custom version of clang; we recommend downloading llvm and clang like described here, but at revision 201645 (use svn co -r 201645 ... for all checkouts). Then apply all the patches in this directory: they are fixes for the very extensive usage that pypy-stm does of a clang-only feature (without them, you get crashes of clang). Then get the branch stmgc-c7 of PyPy and run:
rpython/bin/rpython -Ojit --stm pypy/goal/targetpypystandalone.py
Caveats:
When a non-main thread finishes, you get statistics printed to stderr, looking like that:
thread 0x7f73377fe600:
outside transaction 42182 0.506 s
run current 85466 0.000 s
run committed 34262 3.178 s
run aborted write write 6982 0.083 s
run aborted write read 550 0.005 s
run aborted inevitable 388 0.010 s
run aborted other 0 0.000 s
wait free segment 0 0.000 s
wait write read 78 0.027 s
wait inevitable 887 0.490 s
wait other 0 0.000 s
bookkeeping 51418 0.606 s
minor gc 162970 1.135 s
major gc 1 0.019 s
sync pause 59173 1.738 s
spin loop 129512 0.094 s
The first number is a counter; the second number gives the associated time (the amount of real time that the thread was in this state; the sum of all the times should be equal to the total time between the thread’s start and the thread’s end). The most important points are “run committed”, which gives the amount of useful work, and “outside transaction”, which should give the time spent e.g. in library calls (right now it seems to be a bit larger than that; to investigate). Everything else is overhead of various forms. (Short-, medium- and long-term future work involves reducing this overhead :-)
These statistics are not printed out for the main thread, for now.
While one of the goal of pypy-stm is to give a GIL-free but otherwise unmodified Python, the other goal is to push for a better way to use multithreading. For this, you (as the Python programmer) get an API in the __pypy__.thread submodule:
The obvious usage is to use atomic blocks in the same way as one would use locks: to protect changes to some shared data, you do them in a with atomic block, just like you would otherwise do them in a with mylock block after mylock = thread.allocate_lock(). This allows you not to care about acquiring the correct locks in the correct order; it is equivalent to having only one global lock. This is how transactional memory is generally described: as a way to efficiently execute such atomic blocks, running them in parallel while giving the illusion that they run in some serial order.
However, the less obvious intended usage of atomic sections is as a wide-ranging replacement of explicit threads. You can turn a program that is not multi-threaded at all into a program that uses threads internally, together with large atomic sections to keep the behavior unchanged. This capability can be hidden in a library or in the framework you use; the end user’s code does not need to be explicitly aware of using threads. For a simple example of this, see transaction.py in lib_pypy. The idea is that if you have a program where the function f(key, value) runs on every item of some big dictionary, you can replace the loop with:
for key, value in bigdict.items():
transaction.add(f, key, value)
transaction.run()
This code runs the various calls to f(key, value) using a thread pool, but every single call is done in an atomic section. The end result is that the behavior should be exactly equivalent: you don’t get any extra multithreading issue.
This approach hides the notion of threads from the end programmer, including all the hard multithreading-related issues. This is not the first alternative approach to explicit threads; for example, OpenMP is one. However, it is one of the first ones which does not require the code to be organized in a particular fashion. Instead, it works on any Python program which has got latent, imperfect parallelism. Ideally, it only requires that the end programmer identifies where this parallelism is likely to be found, and communicates it to the system, using for example the transaction.add() scheme.
Other APIs in pypy-stm:
Note that all of this API is (or will be) implemented in a regular PyPy too: for example, with atomic will simply mean “don’t release the GIL” and getsegmentlimit() will return 1.
The core of the implementation is in a separate C library called stmgc, in the c7 subdirectory. Please see the README.txt for more information. In particular, the notion of segment is discussed there.
PyPy itself adds on top of it the automatic placement of read and write barriers and of “becomes-inevitable-now” barriers, the logic to start/stop transactions as an RPython transformation and as supporting C code, and the support in the JIT (mostly as a transformation step on the trace and generation of custom assembler in assembler.py).