Thursday, June 11, 2009

The Nasty I/O Problem in Featherweight Threading

Since I devoted so much of my time to devising a featherweight threading system (aka, two-thirds of my master's thesis), it's clearly something I care alot about to say the least. This is, of course, an understatement. The truth is I think it's the only way forward in a concurrent world.

It begins with a criticism I hear all the time: "Functional languages have this implicit parallelism, but no one's ever realized it." The reason, is twofold. One is pure politics which I won't discuss here. The other is that functional languages expose an amazing level of fine-grained parallelism. But what we're given by pthreads and similar platforms isn't fine-grained at all. Just how fine-grained? Every single expression can compute its sub-expressions in parallel. Of course, that doesn't mean we can exploit it.

A fairly basic common-sense argument holds that the number of instructions it takes to launch a thread is rougly the same as the minimum number of instructions that we want that thread to execute before terminating. Of course, we can analyze further, but the basic notion that the cheaper it is to launch threads, the more parallelism we can afford to exploit. To parallelize a language like Haskell or Standard ML, we'd like to ideally be able to spawn threads even for medium-sized functions, consisting of 50 or so real instructions. (Anything smaller will likely end up losing more than it gains due to caching, speculation, branch predictors, and other such effects anyway). And herein lies the problem.

Your standard 1:1 pthreads library needs to allocate a stack, make a system call to allocate a kernel thread, plus its stack, lock/unlock mutexes, and initialize a whole lot of system-specific cruft. The result is a cost on the order of tens of thousands of instructions. It also costs several thousand instructions to switch between threads.

The solutions I proposed weren't necessarily new; the exact way in which I combined them was. First, begin with a copying garbage collector: a staple in most functional languages anyway. Copying collectors have the nice property of better memory locality, as well as extremely cheap allocation (just advancing a pointer). With this, you can replace conventional stacks with gc-allocated frames a la Andrew Appel's CPS/closure conversion, and employing all the usual tricks to cut down allocation (flatten to squash non-recursive call trees, liveness analysis to reuse the memory of object that have died, etc). This gets you both threads and continuations at very little cost. You then employ the reviled M:N threading, only in a more clever way. You can probably tell from the title that I/O is, as yet, unsatisfactorally resolved, so I'll dance around it and address the synchronization problems. Locking mutexs potentially blocks system threads (I call them executors), which for M:N threading is bad. The solution is quite simple, actually: don't fight the battle at all. Build the scheduler from lock-free data structures. Implementing synchronization for the user-level threads without relying on OS synchronization is a bit trickier. My solution was to rely on voluntary context-switching, which is enabled by the presence of the safepoints necessary for garbage-collection. The result is a runtime system whose only use of locks is to implement a barrier to turn the garbage collector on and off.

This is all fine in a perfect world where we just run CPU-bound jobs and go home happy. That being said, I/O mucks it all up.

The problem with I/O is that it is all done using a hopelessly outdated API. The POSIX APIs, were built for a much simpler world: one processor, usually no preemption in the kernel, running on one machine, with no concept of a distributed system. The resulting situation resembles very much the European liberal democracies implemented on top of an old, defunct absolute monarchy, which is kept in place by a quaint sort of nationalism, and the fact that they are too convenient both to the rich and powerful gentry as well as the young sycophants looking to inflate their reputation by paying court to the obselete.

Any I/O call is potentially blocking, even close, for the simple reason that it may be on an NFS mount. Anyone with cursory experience in non-blocking or asynchronous I/O knows that cake is a lie. The only thing we can trust is blocking I/O, and everything potentially blocks. Of course, we have no clean, let alone accurate way of knowing what will... Barring all kinds of platform-specific warlock arts, this means every single I/O call has to take place off of the main executors.

This gives us two options. The first is to start a thread specifically to handle every I/O call. The second, and far more sane, is to give the job to a set of I/O threads. One can imagine some sort of lock-free, load-balancing multi-queue in the vein of Herlihy's old work on counting networks, or perhaps just a randomized bucket approach based on the birthday paradox. Ideally, this should cut down the level of cross-talk between I/O threads to acceptable levels. And hopefully I/O jobs will take long enough that the level of real contention will be minimal.

The question then becomes how many I/O threads to have. There is no clear answer. For executors, this is fairly obvious: the same as the number of CPUs, perhaps plus some constant. With I/O threads, it could literally be anything. On a system with one disk, and all I/O headed there, it's clearly one. Truly non-blocking I/O, of course, would benefit from many.

My answer at the moment is simple: one per executor. There's no particular inspiration or genius going into this, aside from the simple observation that you'll do no worse than you would if you had nonblocking I/O that actually worked, or with a system such as FreeBSD's KSE system. Anything more clever than this gets platform-specific, arcane, and opaque very fast.

Regarding I/O, executors, and garbage collectors, GHC deals with this in an elegant fashion, which I adopted for my collector. One option when allocating is to "pin" the block. This can be done either by a scheme I describe in one of my papers, or simply by farming out the request to a lock-free mark/sweep collector. The result is an immobile block of garbage-collected memory, which can be used directly in I/O. You can pop this into a lock-free I/O queueing mechanism and employ the same wait mechanisms I use to implement user-level mutexes to wait on the job to finish.

One of the best arguments I can come up with for all this is that it maps very cleanly onto bare metal. The executor model, of course, is the way it actually works with multiple CPUs, the synchronization implementation ends up being very lightweight and elegant, and the I/O problems go away (though the queueing becomes far more complex).

Moreover, do we really expect POSIX to stick around forever as-is? FreeBSD has already made one attempt to switch to activations. Mac OS is on a microkernel, and I know of at least two projects to implement Linux on a microkernel. On the other hand, OS research these days has a terrible inertia, coupled with a tendency to get badly sidetracked. Which is, of course, why I got out of it.

No comments:

Post a Comment