Eric's Research Logbook: June 2009

Sunday, June 28, 2009

Complete GC and Scheduler Intrinsics

This is an attempt to add the mailbox data to the GC and transactional intrinsics. I will need to further develop some transactional concepts before the transactional intrinsics are where they need to be.

On the other hand, those are for far-future work, not present-day. The idea is to make sure I don't have to redesign all the instructions at a later date.

This is consciously trying to look more like LLVM instructions...

Types:

gcallocator: A structure used to allocate GC objects. These are good for one use, and the gcalloc intrinsic yields a new allocator. The program is expected to use this, and give the latest version to safepoints.

gcstate: A collection of boolean flags representing various things. Doesn't change between safepoints.

gc, undo, buf: Each of these are a kind of log.

mbox: A thread's mailbox. At present equal to {i32 execid, gcstate state, gcallocator alloc, gc gclog }.

thread: A thread id. This is a differentiated type because it is traced by the garbage collector.

schedstate: Scheduling states. Includes at the least run, suspend, and term.

tstate: A thread's state. At the present, equal to {schedstate state, i32 priority }

Calling Conventions:

thread: A calling convention for function that serve as the starting point of threads. Enters with the architecture's thread pointer pointing to an inital frame, which is initialized to contain the arguments to the function. The object containing this frame will likely also contain the thread structure, though it is not required.

In general, the other calling conventions are geared towards using GC-allocated frames, and flattening them to the fullest extent possible. Functions expect to enter with their stack pointer pointing to a frame of the proper size and type. When leaving, they restore the return address and old frame pointer. I call this the "blind callee convention" since the callee has no idea where it got its frame from.

There are roughly three sub-conventions that arise from this. The first is the most basic: allocate a frame from GC, initialize it, set up the arguments, return address, and return frame, and jump to the function.

In the second, we actually use the frame of the caller. We can do this for calls to non-recursive functions.

In the third, we use a classic stack, which is a big aggregate GC object. We can just use the blind callee convention, or we can have the callee be aware that it is on a stack (and thus, can allocate more frames using the stack). This is useable when the frames themselves don't escape a call (curried functions, inner functions that access variables in their parents and escape the parent, and continuations break this convention).

Instructions:

result = spawn execid execid, gcallocator gcallocator, funptrval(initial args)

Spawn a thread. The starting function funptrval must use the thread calling convention. result is of type {thread, gcallocator }.

result = tstat execid execid, thread threadval

Get the state for threadval. The result is of type tstate.

update execid execid [, schedstate (run|suspend|term)] [, i32 pri ], thread threadval

Update state for threadval. There must be at least one state change given. For threads updating themselves, this does not take effect until the next safepoint.

result = safepoint mbox mbox

Possibly context switch or run garbage collection, acknowledge any changes to scheduling state, generate a new mbox for subsequent calls. The mbox given becomes defunct, and using it after this point causes undefined behavior. A safepoint will spill and restore all registers to the current frame if a garbage collection or a context switch takes place.

result = gcalloc [tx, ] mbox mbox, type [, options ]

Allocate an object with type type, initializing only its header. The type of result is {type*, mbox }. This call may also have the effect of executing a safepoint.

result = watch [r ][w ], type* ptrval ... [, set setval ]

Watch for activity on ptrvals. If no specific type of activity is given, rw is assumed. Append ptrvals to setval, or create a new set containing ptrvals.

touch r [w ], type* ptrval ...
result = touch [r ]w, type* ptrval ... [, (gc / undo) logval ]

Report access to ptrvals, which may be read or write. If write access is reported, then it may also be logged. If the touch is logged, then there is a result, which is a log equal to the type of log argument given. Depending on the exact log type, this instruction must be placed differently relative to the memory operation to which it corresponds. If undo logs are used, the instruction must precede any write to the locations, whereas the instruction can be anywhere with gc logs.

result = read [gc gcstate, ] [tx ,] type* ptrval
result = read [gc gcstate, ] tx, type* ptrval, buf logval

Exactly equivalent to load, except it exists for extra transactional or garbage collection semantics. At least one of gc or tx must be present. The second variant checks a set of buffer logs first. This does not inform watch sets of the activity. If this is a read of gc-traced memory, the gcstate from the last mbox must be given.

write [gc gcstate, ] [tx ], type* ptrval, type val
result = write [gc gcstate, ] tx, type* ptrval, type val [, buf logval ]

Exactly equivalent to store, except for extra transactional or garbage collection semantics. At least one of gc or tx must be present. The second variant buffers the write in a write-buffer log. This does not actually write through to memory. This does not inform watch sets of the activity. If this is a read of gc-traced memory, the gcstate from the last mbox must be given.

result = validate [r ][w ], [set setval ... ]

Validate setvals, or all outstanding watched values for this thread if no sets are given. If no access mode is given, rw is assumed. result is a i1, indicating success or failure.

discard [r ][w ], set setval [, set setval ... ]

Discard setvals from the watched locations for this thread.

commit buf logval [, buf logval ... ]

Perform the actions described in logvals.

rollback undo logval [, undo logval ... ]

Abort all actions described in logvals.

listen set setval [, set setval ... ]

Following the completion of this instruction, the current thread's status will be set to run whenever anyone touches an of the locations contained in setvals.

There is still no conflict management. Also, the transactional instructions completely lack the kind of meta-arguments present in the gc and scheduler intrinsics. The next steps will be to develop conflict management and the necessary metastate. There also need to be semantics and operations for managing logs. Adding these will complete the intrinsic set.

Wednesday, June 24, 2009

Scheduler Intrinsics

This is the basic scheduler intrinsic set I developed as part of my master's thesis (a very, very tiny part, and there was a fully formal description of the execution model to go along with it), modified for the runtime system. The actual runtime system had a mailbox pointer, which gets passed around. This pointer is a required first argument to anyscheduler function

%thr = spawn %mbox, @func(...)

Spawn the call to func as a thread.

%state, %pri = stat %thr, %mbox

Get the state and priority of the given thread

update %thr [, run | suspend | term] [, %pri], %mbox

Update a thread's state and priority. State changes do not take effect until the next safepoint.

%mbox = safepoint, %mbox'

Possibly deschedule or context switch, and acknowledge any changes in state. This effectively destroys %mbox', replacing it with %mbox.

The cooperative nature of this threading works much nicer with a lock-free scheduler, and lock-free systems in general. It is possible to implement asynchronous channel communications quite simply:

send(%achan, %ptr, %mbox):
%box = getelementptr, %achan, 0
%recv = getelementptr, %achan, 1
%thr = load %recv
store %val, %box
update %thr, run, %mbox
ret

recv(%achan, %ptr, %mbox):
%box = getelementptr, %achan, 0
%recv = getelementptr, %achan, 1
update %self, suspend, %mbox
store %self, %recv
safepoint %mbox
%val = load %box
ret %box

Lastly, it works nicely for doing a TM system as well. The next step is to tie all this in with the GC/TM intrinsics, and wire in all the %mboxes into those as well.

Saturday, June 20, 2009

Transactional and GC Intrinsics, 3rd Revision

Attempting to combine the intrinsicts I developed in the latest entry about transactional intrinsics (a RISC approach) with GC intrinsics. (Note: in this text "or" means inclusive, "either ... or" means exclusive. Damn English language and its logical inadequacies).

watch (r | w), %ptr* [, %set]

Watch for read or write events on %ptr, possibly adding it to %set.

touch (tx | gc), (r | w), %ptr* [, %gclog]

Report read or write activity on %ptr to transaction or garbage collection system. If activity is being reported to the gc system, %gclog is required.

discard [r | w,] %set*

Stop watching the specified type of activity on %set.

result = validate [r | w], [%set*]

Yield true if no conflicts of the specified type has taken place on %set. If %set is omitted, check for any conflict.

log (undo/buf), %ptr*, %log

Log an action in either undo or write-buffering style, storing the information to %log.

commit %log*

Perform all actions described in %log if it is a write-buffering log, and finalize the actions in either case, discarding the log.

rollback %log*

Undo all actions in %log if it is an undo log, and abort the actions in either case, discarding the log.

suspend %set*

Suspend the current thread until someone accesses a location in %set.

gc_alloc [tx,] type, options

Allocate garbage-collected memory with type and options. If tx is specified, the allocation will be cached and not repeated if the transaction aborts.

safepoint %gclog

Possibly context switch or do garbage collection. %gclog must contain at least one entry for each location that has been written to since the last safe point.

Problems with this intrinsic set: 1) there is no way to create logs or sets and maintain them, 2) log would seem to be inseparable from a touch w, which would mean they should probably be combined, 3) there is no obvious way to indicate how conflicts are to be resolved (which thread wins, for instance). I also need to add the gc allocator stuff, and scheduler intrinsics.

The next revision will have the scheduler operations and more metastate, perhaps conflict stuff.

Friday, June 19, 2009

Languages and Derivatives of Evaluation Semantics

Brief thought. A program language is at a minimum a syntax definition and a set of evaluation semantics of the form a → b, meaning "a gets rewritten as b" (of course, they are terms in reality). So lambda calculus looks like:

t := n
| λn.t
| t t

and has three evaluation rules:

t1 λn.t2 → [t1/n]t2
t1 → t1' ⇒ t1 t2 → t1' t2
t2 → t2' ⇒ t1 t2 → t1 t2'

Only the first is of any real substance. The last two say that an application steps to a particular point if one of its sub-terms steps to a particular point.

Essentially, these evaluation rules are a vector field, or a system of equations. They tell you, for a given point, what the next point is. This is important, of course, because we do need to know where to go. We can even apply a logic system that operates over the syntax of the language to restrict the possibilities to programs that never have undefined behavior (aka a type system).

In my previous entries, I discussed concurrency and linearizability (and its similarity to limit cycles in chaos). For this, it's nice to know the difference between our present state and the next one. For this, we need differentiation.

So an interesting thing might be to take a language, and develop both it's operational semantics and its derivative semantics, and develop some theorems relating them to the equivalent concepts in calculus. What's also nice is that the corresponding type judgments over derivative semantics give a very straightforward notion of linearization points (see below).

More on this later.

Tuesday, June 16, 2009

Types, Atomic Actions and Dirac's Notation

My recent work has dealt quite alot with Hoare types and how they can be used to reason about concepts like linearizability. This has been a bit rough, for several reasons. Hoare types are fine for reasoning about sequential consistency in programs. For instance, in the following simple (and infomal, and probably incorrect by "true" Hoare logic) example:

x ← [l] : { v : Int = [l] } Unit { x : Int = v }
[l] ← x + 1 : { x : Int } Unit { [l] = x + 1 }
x : { x : Int } x { }

the types can be combined by the rules of Hoare logic into the type { v : int = [l] } v { [l] = v + 1 }.

But in concurrency, this isn't what we want. We don't really care about one step guaranteeing the preconditions of another. We care about observing properties and causing effects. It's a subtle, but distinct difference, and it seems related to what I was talking about earlier with differentiation of propositions. Observing a property is when some proposition in the outside world causes a change in the eventual result of the term. Causing an effect is when some proposition is changed. So basically, you have the properties you observe, the effects you cause, and a change of types (as in, a before type, and an after type)

Of all ways to represent this, Dirac's notation comes to mind. The representation of effects would go something like this: 〈 properties observed | effects caused 〉. So, for instance, we might type an atomic counter as 〈 v : Int = [l] | [l] = v + 1 〉. Now, we also need to add the type of the term before the action, versus the type after. So, the type might look something like this: Int〈 v : Int = [l] | [l] = v + 1 〉v. This is, of course, a simplification. In a real dependent type system, expect something like this:

Π l : Int Ref.Π x : Int. Int〈 [l] = v | [l] = v+1 〉{ y : Int. y = v }

The other benefit of this is that it allows sequences of atomic actions to be differentiated from individual ones. The counter I gave above would be typed as something like (in the simpler form):

Int〈 v : Int = [l] | 〉v ; v〈 | [l] = v + 1 〉v

The best part of all this is that I have a pretty good idea how to get these automatically from the sequences of Hoare-typed terms and dependently-typed heaps I've been working with in all my writings on this subject, and that method is a propositional differentiation function...

I told you, it all comes together in the end.

Saturday, June 13, 2009

The Point of It All

Garbage Collection and Transactions, in an advanced concurrent runtime environment are as essential as things like function calls, frames, dynamic linking and loading, and everything else we take for granted now. The compiler should take a very active role in implementing and optimizing them, not just cough out the bare minimum and punch the clock.

I might add, this is very much the thinking of a functional language person: the compiler bears the responsibility of making an elegant language fast.

A RISC Approach to Transactional Instructions

This is a different approach, taking a more low-level, RISC-based approach. Ultimately, we'd like either instructions or intrinsics that help us implement a transactions using whatever strategy we'd like, and do so efficiently, with the ability to optimize away much of the cost.

As an analogy, look at early CPU research, which was concerned with the "semantic gap". This produced instruction sets like VAX, which were a nightmare to implement in silicon, and had the effect of forcing compilers to follow very specific idioms (and generally were very difficult to optimize in any meaningful way). For a modern example, look at JVM. It was built to map directly from Java expressions, and it's been a tragedy of errors ever since. Then along came the RISC idea: present instructions that can be implemented efficiently in silicon, and semantic gaps are good: they give you space in which to optimize.

So in transactional memory, what does this equate to? Throw out begin/end transaction, specialized styles of nesting, commit handlers, abort, and all that and start over. What do I end up implementing when I do a pure STM system? Logging/watching and commit/rollback, and validation. I use some combination of these depending on exactly what strategy I'm employing (write-locking vs read-validation, lazy vs eager conflict detection, write-buffering vs undo-logging).

Going down the list we need a way to say "watch for activity on this location(s)", as well as a way to say "notify people of activity on this location(s)", and a way to say "make sure no one has conflicted with my watched set". In terms of actually doing things, we need a way to say "buffer this write" or "log the previous value", as well as a way to say "roll back/perform this action(s)" And lastly, we need the ability to discard watched locations.

I present the following:

watch (r | w), %ptr*

Add %ptr to the watch set, alerting on read or write.

touch (r | w), %ptr*

Notify watchers of activity on %ptr.

discard (r | w), %ptr*

Stop watching activity on %ptr.

validate (r | w), [%ptr*]

Check for conflicts on all watched locations, or just %ptr.

log (undo/redo), %ptr, [%val]

Log activity on %ptr, either saving the old value at %ptr or logging a write of %val.

commit [%ptr*]

Perform all buffered actions, or just those on %ptr.

rollback [%ptr*]

Rollback all buffered actions, or just those on %ptr.

Perhaps commit and rollback should be combined into a flush instruction. This isn't finished yet, except I don't quite like this since they do opposing things. We need specific watch-sets and logs so we can do nesting. We also need the ability to store logging information in SSA values, so the illustrious flatten can squish it out of the logs, as I describe in earlier posts. The watch-sets would also give a straightforward implementation of retry functionality (perhaps we want a modified halt instruction).

But this gives you the flexibility without forcing you into a particular regime. It also (hopefully) gives you the space to optimize.

Repacking Garbage Collection

If you have a copying collector, like I do, and you want to have the compiler generate the trace code, like I do, the next logical step is to have the ability to "repack" objects as they are being collected.

Take this example: you have a hash-table implementation you want to resize periodically. The naïve way is to redo the entire table whenever you want to resize. The slightly smarter way is to build a second table and move entries over whenever you look them up. This is how Haskell and SML's hash tables work.

But why not mark the table, and when you allocate the new table, then extend it. Or take something like a tree. You can pack immutable trees into a very dense, very cache-efficient array representation if you create them all at once (think like a binary heap, with a bitmap indicating whether a given slot contains an entry or a pointer to a classical node. This also works for structures that need their own internal garbage collection (like graphs), and lazy structures with some kind of lookahead caching.

More on I/O

Disclaimer: I don't endorse having one's car booted and having to walk 2 hours in the dark to get home as a way to stimulate thought.

I recalled something I had realized earlier about I/O in a highly-concurrent world.

Stream-based I/O is very detrimental to parallelism. While it's nice and convenient in a sequential world, in the concurrent world, it forces an awkward sort of serialization phase. I found this out while experimenting with implementing a parallel compiler (as in, a parallelized compiler, not a parallelizing compiler) when looking at the final stage, which outputs to a file. If you insist on using streams, the result is a dramatic reduction in parallelism. The same goes for input, for that matter. If you can get away with storing data in a way that doesn't require stream-based parsing (say, in a relational, or even just indexed form), this is better. So the question is, what kind of I/O do we want to be doing, exactly?

On the output side, it's much cleaner to do a kind of indexed scheme. For instance, even something as simple as writing out an ELF binary becomes much more parallel if my I/O calls are something like write(fd, index, size, data). This is, of course, because ELF files are a collection of arrays with some cross-referencing, essentially. Trees can also be written out using this style I/O, and relations are already so parallel it practically hurts. As far as input goes, the same style works just as well.

Where all this breaks down is when you have a sequence of different-sized things. Most of the time, this comes up in flat text files, XML data, and the like (which makes one wonder, might it be a good idea to implement some kind of OS-level XML framework that actually stores everything in some kind of ELF-like binary form, and pretends to be text when it needs to?). Even worse is when you have a sequence of different-sized things, and you index into it. This is the case with JVM class files (yet another bad JVM design choice).

The point is, we want to move away from sequences and streams and move towards trees, arrays, and relations. This all reminds me of a talk given by Guy Steele, which basically said the same thing about data structures (lists are bad, trees are good). Using the indexed I/O model, we can either build up buffers and flush them in very large writes (which lends itself naturally to having little threads write into a buffer and an I/O thread periodically sync it to disk), or we can mmap and work directly. The mmap approach is potentially dangerous, though, depending on exactly how the OS handles mmaped NFS files. What we don't want is the executors getting hit with page faults and blocking while the OS fetches data remotely. And while we're on the subject, we probably don't want to wait on disks either, in a concurrent setting. Not when we can just grab some memory and go to work.

Of course, there's more to I/O than shipping data to disk. We have, among other things, the filesystem maintenance operations and network traffic. Regarding the second, if we're in a highly-concurrent setting, chances are we want some kind of event-based or reactive model, rather than the single-processor "read a request, process it, write a response" approach.

The "maintenance" operations are the annoying ones. These are the operations that should be effectively function calls, but because of NFS mounts and the shortcomings of POSIX when it comes to distributed filesystems, they can potentially block. Having no better alternative, I'll default to the older idea: one I/O thread per executor, plus whatever buffer-maintainers and socket/event-watchers you end up creating.

Friday, June 12, 2009

Current Thoughts on the GC/Transaction Intrinsic Set

These are my current thoughts on intrinsics for the combined GC/Transaction stuff I want to do some day:

log (gc | tx), (r | w), %ptr

Perform a logging action for %ptr.

gc_alloc type, options

Allocate something from GC'ed memory.

safepoint

Possibly deschedule. Possibly switch to allocator. These must be placed by the compiler so that no more than some number of instructions get executed between each one.

tx_begin (open/closed)

Begin a new transaction, with closed or open nesting.

tx_validate (r | w) [ %ptr*]

Validate the given pointers, or the entire read/write set.

tx_discard %ptr*

Discard the given pointers from the read set.

tx_commit

Commit a transaction.

tx_abort [num]

Abort the num innermost transactions (or just the innermost).

tx_retry

Roll back, then suspend until someone touches the current read set, then continue.

This is the intrinsics, not what the compiler will end up using. The stuff actually coming from the TM runtime will be much more minimal, primarily contention management, suspend, and wakeup stuff in the end.

It should be noted that retries inside orelses are known to be evil. I have a paper about this...

Region and Effect Analysis for Optimization of Transactional Memory

One of my research activities deals with compiling, analyzing, and optimizing transactional programs. I'm doing work for Tony Hosking's RuggedJ stuff, but also gathering intelligence and experience for doing my own LLVM work at a later time, which I expect will have much more potential, since I won't be tied to JVM, Java, and their limitations.

The primitives one might see in a TM system, based on the latest research might look something like this:

open (r/w)
log (r/w)
validate
begin (open/closed)
commit
abort
retry

A beneficial, but difficult optimization is to eliminate redundant open and logging operations. This is also quite beneficial for concurrent garbage collectors, as I discuss in a previous entry.

An important point here is that open operations are idempotent, and cannot be undone. This means that once a given object is open, it stays that way until the end of the transaction, and additional opens have no effect. Also, there is no way to "unopen" an object, aside from commiting the transaction.

Simple data-flow analysis is a start, but there's another subtlety. For instance, in the following example:

%x = load %ptr
%y = %x
open read,%x
open read,%y

The second open read is unnecessary since the values %x and %y both point to the same thing. This notion of "pointing to the same thing" has a name: regions. A region is some spot in memory where some number of values are located. It originates in the context of region types, which are able to reason about memory allocation and deallocation. It also works for what we're doing: object-granular conflict detection (though in the LLVM stuff, I'd like to try a more intelligent strategy of grouping related shared fields together).

We can track regions and note which ones have been opened for reading and writing (and which fields have been logged, or any other effect we might wish). When we do a phi instruction, we create a new region, whose state is the lower bound of all the states of the arguments (unless all the arguments have the same region, in which case we just keep it).

It's more complicated, though. If we create the merge regions based on the SSA values being merged, we might be too conservative. Consider this:

%w = load %p1
%x = load %p2
br ..., label %L1, label %L2
L1:
%y1 = %w
%z1 = %w
open read %w
br label %L3
L2:
%y2 = %x
%y1 = %x
open read %x
L3:
%y3 = phi [%y1, L1], [%y2, L2]
%z3 = phi [%z1, L1], [%z2, L2]
open write %y3

We end up falsely treating %y3 and %z3 as having a separate region, even though they don't. The answer might seem to create the new regions based on the regions being merged, but this doesn't quite work. Consider:

%w = load %p1 ; w -> r1
%x = load %p2 ; x -> r2
br ..., label %L1, label %L2
L1:
%y1 = %w ; y1 -> a
%z1 = %x ; z1 -> b
open read %y1
br label %L3
L2:
%y2 = %x ; y2 -> b
%z2 = %w ; z2 -> a
open read %y2
L3:
%y3 = phi [%y1, L1], [%y2, L2] ; y3 -> (a | b)
%z3 = phi [%z1, L1], [%z2, L2] ; z3 -> (a | b)
open read %y3
open read %z3

We would treat %y3 and %z3 as being the same region. However, notice that we can get rid of the first open read, but not the second. This is because the effects visited upon a and b differ based on the path by which they arrived. Therefore, we must record the paths. This gets nasty the moment we have loops, of course, but recall that we are dealing with monotonic, idempotent effects. Running through a basic block 5 times is the same as running through it once. Therefore, we can record merged regions as a "primitive region", plus a set of control-flow edges. The reason it's control-flow edges will become apparent in the next example.

There's still one problem. Consider this loop:

L1:
%z = %y
open read %z
%y = %x
open read %y
%x = (new object)
br ..., label L1, ...

Are %x, %y, and %z all pointing to the same region? By our previous method, they are. However, this would lead us to eliminate the open read %y, when in fact, we want to eliminate open read %z. The problem is that we didn't account for the fact that each look iteration, a "fresh" region gets assigned to %x.

The answer here is to associate with each region an "age" number, representing the loop iteration during which we saw it. We disregard this age when checking to see if anything changed, for the purposes of telling if we've reached a fixed point, but we count it when merging.

In summary, merged regions are a set of primitive regions, their ages, and the control flow edges they've visited. When we merge two regions, we line up all the equal primitive region-age pairs, and union their paths). We get the state for regions created this way by taking the lower bound of all the mergees. We can supplement this with alias analysis. We can treat loads from pointers we determine "must alias" as yielding the same region. Once we have regions assigned in this way, we can eliminate extraneous operations, or perhaps do more aggressive optimizations like PRE.

Verifying Linearizability with Types

A major theme of my research for the past year has been the desire to build type systems which are able to verify that certain term (or more accurately, certain objects) are linearizable. Linearizability means that something (an execution, a term, operations on an object) all take effect in a single atomic operation.

Of course, to build a type system, we'll need to be more specific. A term's behavior is described by a Hoare type of the form {P} T {Q}, meaning that the term assumes P, guarantees Q, and reduces to a value in T. As I discuss in a previous entry, the term will observe some properties and change others. If all this happens in the same step, then we have linearizability.

Of course, life is always more complicated. In optimistic concurrency control mechanisms, we usually have a read operation which gives us a value, then a validate operation, which ensures that the location hasn't changed. In LL/SC, the write and validate operations are a combined atomic operation. "Observation" happens at any point between the read and the validate, though we usually figure out our exact control flow at the validate.

So essentially, what we have are read, write, and validate operations. We can do any number of reads, so long as we validate them and throw away the values if validation fails. All validations and writes must be done in a single atomic step. But it's still more complicated than that, actually. We're allowed to do writes that don't produce observable changes (such as rebalancing a tree), and we're allowed to skip validation on reads under less obvious circumstances.

Colin Gordon and I worked out an approach for dealing with simple languages with LL/SC-style atomic memory instructions using non-dependent (yes, you heard me correctly) types, and a ridiculous amount of caffeine. Of course, we had to combine the control-flow effects of the SC instructions with the write effects. The type system itself works something like this:

Pointers to shared data have a special "guarded" type, which is linear (meaning use-exactly-once), but special. Instead, the entry in the type environment can be "split" into several other entries, which we call "tokens": read, write, and value, each of which are also linear. The tokens are consumed by operations which read from, write to, and use the value of the corresponding tokens. The type system also uses a primitive form of Hoare-types to track whether it has seen a linearization point yet.

I'm glossing over quite alot here, but it's enough to get the general idea. Now, how can this be generalized? The tracking of whether or not we've seen a linearization point yet is also critical, since it determines our ability to make observable changes to state. If you go from a pre-point state to a post-point state, you need to validate all reads you've done and perform any writes such that you perform one of the possible behaviors of the term.

All writes that aren't part of a linearization point need a proposition showing that they don't cause any observable effects (meaning that the before and after states are equivalent). The ability to discard reads that have not been validated yet depends on a slightly more intricate property. Unvalidated reads can be discarded if the result of the term is not affected by changes to the location from which the read took place. Proving this is another matter.

In Colin's and my work, we consumed the value token in a SC or check (basically a straight validation), and if the operation failed, you got kicked to a "fail" control flow path. In a more general solution, you end up with two situations. The first is failed validation, in which you need to prove that you don't use the value, or any value derived from it. I've still not worked out the details of this yet.

The second is by proving that concurrent modifications don't affect the outcome. For instance, in a linked list, you would have a proposition that hd(cons(a, b)) = b, so if you're traversing down a list, removing deleted values from the list won't affect the outcome.

This is clearly not yet worked-out, and it won't be until I build a language and start imlementing some more complex linearizable systems. But this is a lab-book, afterall.

Thursday, June 11, 2009

The Nasty I/O Problem in Featherweight Threading

Since I devoted so much of my time to devising a featherweight threading system (aka, two-thirds of my master's thesis), it's clearly something I care alot about to say the least. This is, of course, an understatement. The truth is I think it's the only way forward in a concurrent world.

It begins with a criticism I hear all the time: "Functional languages have this implicit parallelism, but no one's ever realized it." The reason, is twofold. One is pure politics which I won't discuss here. The other is that functional languages expose an amazing level of fine-grained parallelism. But what we're given by pthreads and similar platforms isn't fine-grained at all. Just how fine-grained? Every single expression can compute its sub-expressions in parallel. Of course, that doesn't mean we can exploit it.

A fairly basic common-sense argument holds that the number of instructions it takes to launch a thread is rougly the same as the minimum number of instructions that we want that thread to execute before terminating. Of course, we can analyze further, but the basic notion that the cheaper it is to launch threads, the more parallelism we can afford to exploit. To parallelize a language like Haskell or Standard ML, we'd like to ideally be able to spawn threads even for medium-sized functions, consisting of 50 or so real instructions. (Anything smaller will likely end up losing more than it gains due to caching, speculation, branch predictors, and other such effects anyway). And herein lies the problem.

Your standard 1:1 pthreads library needs to allocate a stack, make a system call to allocate a kernel thread, plus its stack, lock/unlock mutexes, and initialize a whole lot of system-specific cruft. The result is a cost on the order of tens of thousands of instructions. It also costs several thousand instructions to switch between threads.

The solutions I proposed weren't necessarily new; the exact way in which I combined them was. First, begin with a copying garbage collector: a staple in most functional languages anyway. Copying collectors have the nice property of better memory locality, as well as extremely cheap allocation (just advancing a pointer). With this, you can replace conventional stacks with gc-allocated frames a la Andrew Appel's CPS/closure conversion, and employing all the usual tricks to cut down allocation (flatten to squash non-recursive call trees, liveness analysis to reuse the memory of object that have died, etc). This gets you both threads and continuations at very little cost. You then employ the reviled M:N threading, only in a more clever way. You can probably tell from the title that I/O is, as yet, unsatisfactorally resolved, so I'll dance around it and address the synchronization problems. Locking mutexs potentially blocks system threads (I call them executors), which for M:N threading is bad. The solution is quite simple, actually: don't fight the battle at all. Build the scheduler from lock-free data structures. Implementing synchronization for the user-level threads without relying on OS synchronization is a bit trickier. My solution was to rely on voluntary context-switching, which is enabled by the presence of the safepoints necessary for garbage-collection. The result is a runtime system whose only use of locks is to implement a barrier to turn the garbage collector on and off.

This is all fine in a perfect world where we just run CPU-bound jobs and go home happy. That being said, I/O mucks it all up.

The problem with I/O is that it is all done using a hopelessly outdated API. The POSIX APIs, were built for a much simpler world: one processor, usually no preemption in the kernel, running on one machine, with no concept of a distributed system. The resulting situation resembles very much the European liberal democracies implemented on top of an old, defunct absolute monarchy, which is kept in place by a quaint sort of nationalism, and the fact that they are too convenient both to the rich and powerful gentry as well as the young sycophants looking to inflate their reputation by paying court to the obselete.

Any I/O call is potentially blocking, even close, for the simple reason that it may be on an NFS mount. Anyone with cursory experience in non-blocking or asynchronous I/O knows that cake is a lie. The only thing we can trust is blocking I/O, and everything potentially blocks. Of course, we have no clean, let alone accurate way of knowing what will... Barring all kinds of platform-specific warlock arts, this means every single I/O call has to take place off of the main executors.

This gives us two options. The first is to start a thread specifically to handle every I/O call. The second, and far more sane, is to give the job to a set of I/O threads. One can imagine some sort of lock-free, load-balancing multi-queue in the vein of Herlihy's old work on counting networks, or perhaps just a randomized bucket approach based on the birthday paradox. Ideally, this should cut down the level of cross-talk between I/O threads to acceptable levels. And hopefully I/O jobs will take long enough that the level of real contention will be minimal.

The question then becomes how many I/O threads to have. There is no clear answer. For executors, this is fairly obvious: the same as the number of CPUs, perhaps plus some constant. With I/O threads, it could literally be anything. On a system with one disk, and all I/O headed there, it's clearly one. Truly non-blocking I/O, of course, would benefit from many.

My answer at the moment is simple: one per executor. There's no particular inspiration or genius going into this, aside from the simple observation that you'll do no worse than you would if you had nonblocking I/O that actually worked, or with a system such as FreeBSD's KSE system. Anything more clever than this gets platform-specific, arcane, and opaque very fast.

Regarding I/O, executors, and garbage collectors, GHC deals with this in an elegant fashion, which I adopted for my collector. One option when allocating is to "pin" the block. This can be done either by a scheme I describe in one of my papers, or simply by farming out the request to a lock-free mark/sweep collector. The result is an immobile block of garbage-collected memory, which can be used directly in I/O. You can pop this into a lock-free I/O queueing mechanism and employ the same wait mechanisms I use to implement user-level mutexes to wait on the job to finish.

One of the best arguments I can come up with for all this is that it maps very cleanly onto bare metal. The executor model, of course, is the way it actually works with multiple CPUs, the synchronization implementation ends up being very lightweight and elegant, and the I/O problems go away (though the queueing becomes far more complex).

Moreover, do we really expect POSIX to stick around forever as-is? FreeBSD has already made one attempt to switch to activations. Mac OS is on a microkernel, and I know of at least two projects to implement Linux on a microkernel. On the other hand, OS research these days has a terrible inertia, coupled with a tendency to get badly sidetracked. Which is, of course, why I got out of it.

Dan Grossman was Right

For those who don't follow Onward!, Dan Grossman published an essay talking about the relationship between garbage collection and transactional memory.

To give you a background, I've been working on some compiler-based transactional memory stuff in the context of Tony Hosking's RuggedJ, and on my own vision of the high-concurrency, featherweight threading runtime environment, which I'm now trying to bring to life using the components I've already built as a runtime support system, and LLVM both as an abstract instruction set and the arbitrary compiler support I need to make the whole thing fly.

The grand vision is that the compiler will graft in most of the code needed to do garbage collection and transactional memory, applying the fabled "sufficiently aggressive optimizations", which I sincerely hope will get the real costs of transactional memory down low enough to where it consistently beats locking synchronization.

Part of this, of course, involves creating a set of intrinsics, which your compiler frontend must generate. LLVM's garbage collection intrinsics are... almost... good enough. But they are just lacking enough when it comes to the kind of highly-concurrent, compiler-mangled stuff I want to do. To give you an example, LLVM gives you:

llvm.gc_read
llvm.gc_write
llvm.gc_root

Where gc_root marks a local variable (stack-slot) for visit by the collector. This is fine for your quaint, cute, lock-based, stop-the-world type collectors. But that's exactly the sort of non-concurrent barbarism we're trying to move beyond! Add in the desire to throw in STM at a future date, and it's better just to scrap theirs and build my own. Besides I really don't have much of a desire to do anything with LLVM other than pillage it for compiler machine parts anyway...

Specifically, I need a couple of things. First, I need to separate logging from the actual read/writes. I also need safepoints, at which actual collection can take place. The idea is you want only one logging action for each location that is accessed between the access and the next safe point. Enter difficult compiler analysis #1 (Eliot and I found out just how difficult...)

Second, I don't have stacks. The basic assumption is you have a frame pointer which points to an object which gets traced just like everything else. The basic assumption is that everything gets GC-allocated. Of course, the flatten optimization that you find in basically every functional language compiler on the planet will squash the all the frames for a non-recursive call tree into one (and my calling conventions allow this). Or you can figure out when a C-style stack actually is better and use it instead. But the point is that the gc_root is totally unnecessary.

In a distant third, allocation can't be done with a function. There's more state variables that get carried around, used for allocation, and passed into safepoints to be modified. This is cruft, but it's important cruft. All the same, I can probably have LLVM stick it all in for you, assuming you actually have a gc_alloc intrinsic. And of course, gc_alloc needs to know the type of the thing it's allocating since I do everyhing with type signatures.

So I have something along the lines of this instruction set:

gc_alloc
gc_log (r/w)
safepoint

And the normal LLVM load/store stuff.

Now this is where it gets interesting. At some point, I'll want to add TM intrinsics. The basics of what you want are this:

tx_begin (open/closed)
tx_validate
tx_commit
tx_abort
tx_retry
tx_log

And again, the usual LLVM load/store stuff. Notice the log. You now have two of them: one for the collector and one for the transactional memory. It gets deeper...

In the library-based world, you allocate objects, build up this list of logged locations, and then process it when you abort (or commit) a transaction. In GC world, you do the same. All of this is opaque to the compiler. But if you have a compiler that you can make do as you wish, you can do things differently. The compiler can stash the log itself in an SSA value, carrying it along as it goes. How is this special, you ask? Once again, apply the illustrious flatten. Whenever possible, you'll squish as much of the log out into SSA values, which will in turn be stashed in registers where possible and frame slots when not. Now, spit out the code to grind through the log directly into the program itself, and apply all your other optimizations. Result: much better than calling out to a library (I hope).

What's the point of all this digression? You'll have different behaviors depending on whether or not you're logging a transactional-only, gc-only, or gc and transactional thing. You also really don't want to be having a dual regime when it comes to logging, since you'll be duplicating a whole lot of stuff, and gobbling up alot of registers in the process.

Also, the TM really needs to know about and affect the placement of safepoints. Safepoints represent a potentially huge gap in time, so the transaction probably wants to validate immediately before and after. You'll also want at least some kind of notification to the contention manager, and in the future, we can imagine some kind of crosstalk between the TM system and the scheduler (I'm running a transaction that's logging A, B, and C, so it'd be really nice if you don't schedule anyone who walks all over them...)

Last and not least, safepoints can make the impossible possible. In a naive world, you can't do undo-logging with lazy conflict detection, because you might see inconsistent states. Likewise, some old lock-free TM implementations had the same problem. Let people see inconsistent states, and you get the potential for wacky control flows that shouldn't happen, loops that should terminate not terminating, and I/O side-effects that shouldn't happen happening (I/O is to transactional memory what nuclear waste containment is to nuclear power). In an I/O-less world, where you don't have some of the less-cooperative control-flow constructs (like continuations or exceptions), you can deal with the problem by using a timer interrupt to break and check consistency every so often. Safepoints are the lovely alternative to this at a fraction of the cost. Toss your safepoints around well enough, and you can actually do things like lazy detection with undo logging, or revive those old lock-free STMs...

Lastly, there's potential for some really nasty or beneficial interactions between GC and STM, depending on how naïve or smart you are. For instance, if you allocate inside a transaction, you don't want to discard the object and allocate another one just because you got aborted. Likewise, GC logging inside a transaction is likely to lend itself to certain subtleties. Start throwing in stuff like boosting, and... I'm getting ahead of myself. I'm sure there's probably several papers worth of stuff you could do here.

The last thing I'll mention is that alot of transactions can probably be compiled into simpler implementations. For instance, single-word transactions don't need a full TM suite; you can just do them with CAS or LL/SC instructions. Likewise, read-only transactions can be transformed into an atomic snapshot. Negotiating the interactions between these and GC logging will require more care in the compiler.

So in summary, Dan Grossman, you were right.

Proposition Spaces

I was on the road back from Brattleboro, thinking over my work on structural theories of atomicity, synchronization, and the like. I'll do a brief overview, since I'm writing the tech report on all this right now, and trying to turn it into a conference paper.

Concepts like atomicity, lock-freedom, and the like exist only as conceptual descriptions at the moment, but nothing you could use to build a type system or a model checker. To do that, you need to describe these concepts using structural rules. This is what I've been working on recently (one of the things, at least). My approach is to develop a theory of concurrency which operates on executions, or histories of states as a program is executed. A concept like "atomicity" can be stated as a structural property of executions, similar to the way a type judgment is a structural property of a program's syntax. You can also state things like synchronization rules, memory consistency models, and database serializability properties this way.

All this heavily depends on dependent Hoare type theory. Dependent types let you state arbitrary properties of a term, like "an integer less than 4, a hash table containing x but not y", and so on. Hoare types extend this with pre- and post-conditions, written {P} T {Q}, where P is a condition that must hold before executing a term, and Q is the condition that holds after. T, of course, is the term's type. All these can contain arbitrary predicates, so you can have something like "delete : { x ∈ t } t : Table → () { x ∉ t }", which describes the exact behavior of delete.

How does all this relate to concurrency and specifically, atomicity? These types describe behaviors. Some operation has a type {P} T {Q}, meaning that it changes the state of the world from P to Q and evaluates to T, or else {P} T {P}, meaning that it observes P, and evaluates to T (recall all these can contain arbitrary predicates). I make heavy use of a thing called a "definitive type" which is essentially the most restrictive type you can give a term. So the definitive type of a "contains" function on a set might be

contains : {x ∈ s} s : Set → true {x ∈ s} ∨ {x ∉ s} s : Set → false {x ∉ s}

Notice the disjunction operator ∨. If you take all the behaviors combined with this operator, you have a bunch of definitively typed possible behaviors. Each of these is a threshold.

Assuming your type system is sound, then a given term's definitive type represents a monotonically decreasing set as the term gets executed, meaning that its type encompasses a smaller and smaller set of values. (If the type ever grows, then your type systems lies) Here's the key point: once your term becomes a subtype of only one of its possible behaviors, the outcome is now fixed. This is analogous to what Herlihy calls a "univalent state". Prior to this, your term is a supertype of some set of the possible behaviors, which is analogous to a "multivalent state".

Now, finally on to what I was thinking about. When I wrote the logic rules for spotting linearization points, good behavior of locks, and such, I remarked that it was awkward, because you have to operate on pairs of states in order to say exactly what parts of a term got reduced. What would be really nice is a differentiation function (Harrison, here's your precious taking the deriviative of a data structure coming up soon). To do this you need several things: 1) what part of a term got eliminated, 2) what changed in the concrete state, and lastly but not least 3) what changed in abstract state (meaning what propositions' truth changed). 3 subsumes 2, since concrete states are just sets of very uninteresting propositions (address 1 holds true, 2 holds the address of 4, ...) Types are also propositions in a dependent system, and as we've already stated, they are monotonically decreasing, so all we only ever lose propositions. The propositions representing abstract states, of course, can change arbitrarily.

So what we're essentially setting up here is a sort of "propositional space". A point in this space is the truth assignments to all propositions. A vector in this space is our differentiation function. A program represents a vector field in this space (or a system of equations for math people). Lastly, an execution represents a single path.

Of course, all this would be just a mind-bending abstraction if it weren't for what else I'm working on: chaos and computing. You see, when you start to think about things in this way, it suddenly starts to look very much like concepts in nonlinear dynamics. You have your vector fields, bifurcations, limit cycles, stable and unstable fixed points, and so on. Undecidability implies that you probably can't predict the outcome or the position in propositional space past a certain point. Most importantly, you have an infinite number of propositions represented in a single point, but for every program, these are determined by a finite number of propositions. This is analogous to high-dimension systems which are actually determined by a small number of real dimensions.

Just as Euclidean space is defined by some set of vectors, it would seem that a propositional space would be defined by a set changes in propositions. Figure out what these are for a program, and you can probably tell an awful lot about its behavior at any given point.

I also wonder about the exact relationship between a nonlinear system's being chaotic and a program's outcome being undecidable.

The Beginning

There will probably be a flurry of activity as I put up the more pertinent themes in my research, opting for the things I haven't yet transcribed into papers. Some things may be out of order, as I'm more interested in getting newer ideas out than describing old ones that I've already hashed out.

Eric's Research Logbook