As previously discussed, there are two different ways of looking at concurrency. The more common pthread model presents an interface in which parallelism is explicit. The model found in parallel functional languages, Cilk, or similar environments is one in which parallelism is an optimization. From a different angle, the pthread approach is an interface for a scheduler, where the Cilk-style approach is an interface for a dispatcher. And of course, schedulers produce concurrency, while dispatchers consume them.
Put the two together, and you have a nice little hylomorphism. This, of course, is an opportunity for optimization. Put another way, it's a potential inefficiency. Since we will have to put the kernel/user boundary somewhere in here, the goal is to make this as efficient as possible.
Fundamentally, this is an extension of the uniprocessor scheduling problem. In one way (the current dominant) of looking at it, we apply uniprocessor scheduling techniques, except we have multiple processors to schedule. This works nicely when we have a bunch of threads running in isolation; it introduces no more complexity than is needed. However, this isolation isn't necessarily representative of reality. An alternate (also flawed) view of things would be to turn over all the execution resources of a machine to one task at a time. This is ill-advised, but it has its virtues. Tasks which consist of many threads are likely to run better: reactive mutexes and other synchronization constructs will work better, interthread communication will work better, and cache- and memory-affinity will help, and load-balancing is much more predictable, among other things. This approach works quite well, in fact, as long as each task can effectively use all the CPUs it is given.
The truth is we want a combination of the two: we want to assign groups of CPUs, call them "mobs" to groups of threads, call them "jobs". Theoretically, this is multimachine scheduling, which is a hard enough problem even when we know the future (which we don't), so we'll rely on simplifying heuristics. It sounds complicated, but in reality the problem can be solved using a modification of a standard epoch scheduler. For any given scheduling cycle, we can assign the highest priority job it's requested CPUs, then repeat the process with the leftovers (if there are any). The only question is how to differentiate between jobs requesting large numbers of CPUs and those requesting smaller numbers. I would argue that the scheduler should favor the smaller requests for several reasons: 1) smaller requests would tend to correspond to event handlers or other highly interactive jobs, which are likely to process a small amount of work, then yield their CPUs, 2) in an activation-type interface, if a job has a large number of idle activations, its CPU demand will likely increase in the near future, so it should be allowed to run, and 3) such an approach would preferentially run multiple, simultaneous jobs as opposed to allowing one to take over all the processing resources of the system (though, due to the epoch property, it will be allowed to run eventually).
The approach I envision would be to compare jobs requesting various CPU counts, scaling their priorities by the number of CPUs requested (priority / log CPUs would probably be a good formula). Schedule the highest ranking job, then repeat, first trying the jobs that don't request more than the remaining CPUs, then trying those that do if there are no such jobs remaining. Of course, designing a lock-free, concurrent algorithm to do all this is tricky to say the least, especially if it's supposed to deal with processor affinity.
As for the actual interface, activations are the only sane way to go about it, for a number of reasons. In essence, activations are the only way that you won't lose information when crossing the kernel/user boundary, and they are the only way that the kernel's own scheduling facilities can be properly constructed in user space. I'll go over the particulars later.
Sunday, October 11, 2009
Sunday, October 4, 2009
Operating Systems from First Principles, Part 1: Rethinking Threads
Arguably the most basic functionality of any modern OS is multitasking, and a decent threading model is fast becoming a required feature of any modern language runtime; therefore, it makes sense that I start what will in time become a lengthy elaboration with this topic.
An often-missed subtlety is that there are two different goal sets that can arise regarding multiple processors and threads, and though the realization of one can support the other, this may not be the best way to go about doing things.
The first goal is the most common: having the appearance of an infinite number of processors regardless of the underlying hardware. The primary concern here is interactivity. Raw number-crunching can be postponed provided that threads which acknowledge events are able to do so. This is what we all learn in undergrad, and is the goal of a pthread-style API. This is the production of concurrent execution, sometimes from a single processor, sometimes from several.
The second goal is increased performance, or the consumption of concurrent execution. Because the pthread model is so common, we might be tempted roll this line of thinking into it, but there are a few key differences. Parallel evaluation of terms differs from pthread-style concurrency in that it is purely an optimization; it does not affect the outcome of the terms in anyway, and . Furthermore, the interactivity and timing issues are irrelevant. The goal here is to commit more processors to a task, hopefully achieving better performance. Cilk is probably the best example of this particular way of doing things.
Building a scheduler system that caters to both is, to my knowledge, a largely unexplored problem. Such a system needs to support threads, which represent the expectation of interactivity. However, multiple CPUs might be committed to a single thread. The scheduling problem then becomes one of balancing the desire for interactivity with the desire to commit large numbers of CPUs to a given thread, so that it completes faster. A useful heuristic here might be that highly interactive threads desire relatively few CPU's, as they would function primarily as dispatchers to more compute-bound threads.
Crossing the user/kernel boundary gets even more complicated, which I'll discuss in later entries. Suffice to say, the activation model is vastly superior to anything else, and the problem gets further complicated by the desire to commit exactly the number of desired CPUs to a given task.
An often-missed subtlety is that there are two different goal sets that can arise regarding multiple processors and threads, and though the realization of one can support the other, this may not be the best way to go about doing things.
The first goal is the most common: having the appearance of an infinite number of processors regardless of the underlying hardware. The primary concern here is interactivity. Raw number-crunching can be postponed provided that threads which acknowledge events are able to do so. This is what we all learn in undergrad, and is the goal of a pthread-style API. This is the production of concurrent execution, sometimes from a single processor, sometimes from several.
The second goal is increased performance, or the consumption of concurrent execution. Because the pthread model is so common, we might be tempted roll this line of thinking into it, but there are a few key differences. Parallel evaluation of terms differs from pthread-style concurrency in that it is purely an optimization; it does not affect the outcome of the terms in anyway, and . Furthermore, the interactivity and timing issues are irrelevant. The goal here is to commit more processors to a task, hopefully achieving better performance. Cilk is probably the best example of this particular way of doing things.
Building a scheduler system that caters to both is, to my knowledge, a largely unexplored problem. Such a system needs to support threads, which represent the expectation of interactivity. However, multiple CPUs might be committed to a single thread. The scheduling problem then becomes one of balancing the desire for interactivity with the desire to commit large numbers of CPUs to a given thread, so that it completes faster. A useful heuristic here might be that highly interactive threads desire relatively few CPU's, as they would function primarily as dispatchers to more compute-bound threads.
Crossing the user/kernel boundary gets even more complicated, which I'll discuss in later entries. Suffice to say, the activation model is vastly superior to anything else, and the problem gets further complicated by the desire to commit exactly the number of desired CPUs to a given task.
Labels:
concurrency,
operating systems,
runtime systems,
threading
Saturday, October 3, 2009
Specifying Concurrent and Synchronous Operations
I've spoken before on how to specify concurrent operations in a lock-free setting. The trick there is observations and effects, and safety is guaranteed when all possible ending states of all operations are valid starting states for each operation.
However, we need more than just this. Such a type system would be unable to certify the atomic snapshot protocol. To reason about this, we need two additional reasoning capabilities. First, we need to be able to reason about when an observation is made, or when an effect was caused. Secondly, we need to be able to infer this with arbitrary predicates over terms. For instance, in the atomic snapshot protocol, we need to be able to connect a memory tag with the time at which an operation occurred. In other words, if tag a is observed in one observation of a given location, and later, a tag b is observed such that a < b, then something effected that location in between the two observations. More importantly, if they are the same, then nothing affected the location.
Lastly, we need some means of reasoning about actual synchronous operations and their behavior. For instance, the synchronization barrier operation or lock/unlock operations imply certain timing behaviors between various threads. Barriers allow a thread to continue at some point after some other number of threads have executed a barrier wait.
Temporal Logic may be the key here, or at least a starting point. Temporal logic provides six different predicates on propositions. They are as follows:
However, this logic alone isn't enough. Specifically, we need quantification over time quanta probably based on these predicates. Linearizability of terms in such a type system would then come down to identifying a specific time quantum which acts as the linearization point. For your basic lock-free algorithms, this is simply the time quantum of some operation in the term itself. For more complicated things, such as a lock-free garbage collector, then the "start collection" function specifies some point in future time (F) at which the heap is exchanged for an isomorphic superset of the observable locations at that point.
In any case, this is likely the way forward on this front.
However, we need more than just this. Such a type system would be unable to certify the atomic snapshot protocol. To reason about this, we need two additional reasoning capabilities. First, we need to be able to reason about when an observation is made, or when an effect was caused. Secondly, we need to be able to infer this with arbitrary predicates over terms. For instance, in the atomic snapshot protocol, we need to be able to connect a memory tag with the time at which an operation occurred. In other words, if tag a is observed in one observation of a given location, and later, a tag b is observed such that a < b, then something effected that location in between the two observations. More importantly, if they are the same, then nothing affected the location.
Lastly, we need some means of reasoning about actual synchronous operations and their behavior. For instance, the synchronization barrier operation or lock/unlock operations imply certain timing behaviors between various threads. Barriers allow a thread to continue at some point after some other number of threads have executed a barrier wait.
Temporal Logic may be the key here, or at least a starting point. Temporal logic provides six different predicates on propositions. They are as follows:
- Fp (future): At some point in the future, p holds
- Pp (past): At some point in the past, p held
- Hp (henceforth): At all points in the future, p holds
- Gp (hitherto): At all points in the past, p held
- ☐p (always) = Gp ∧ p ∧ Hp: At all points in time, p holds
- ◇p (sometimes) = Pp ∨ p ∨ Fp: At some point in time, p holds
However, this logic alone isn't enough. Specifically, we need quantification over time quanta probably based on these predicates. Linearizability of terms in such a type system would then come down to identifying a specific time quantum which acts as the linearization point. For your basic lock-free algorithms, this is simply the time quantum of some operation in the term itself. For more complicated things, such as a lock-free garbage collector, then the "start collection" function specifies some point in future time (F) at which the heap is exchanged for an isomorphic superset of the observable locations at that point.
In any case, this is likely the way forward on this front.
Subscribe to:
Posts (Atom)