Eric's Research Logbook: Operating Systems from First Principles, Part 2: Mob-Job Scheduling

As previously discussed, there are two different ways of looking at concurrency. The more common pthread model presents an interface in which parallelism is explicit. The model found in parallel functional languages, Cilk, or similar environments is one in which parallelism is an optimization. From a different angle, the pthread approach is an interface for a scheduler, where the Cilk-style approach is an interface for a dispatcher. And of course, schedulers produce concurrency, while dispatchers consume them.

Put the two together, and you have a nice little hylomorphism. This, of course, is an opportunity for optimization. Put another way, it's a potential inefficiency. Since we will have to put the kernel/user boundary somewhere in here, the goal is to make this as efficient as possible.

Fundamentally, this is an extension of the uniprocessor scheduling problem. In one way (the current dominant) of looking at it, we apply uniprocessor scheduling techniques, except we have multiple processors to schedule. This works nicely when we have a bunch of threads running in isolation; it introduces no more complexity than is needed. However, this isolation isn't necessarily representative of reality. An alternate (also flawed) view of things would be to turn over all the execution resources of a machine to one task at a time. This is ill-advised, but it has its virtues. Tasks which consist of many threads are likely to run better: reactive mutexes and other synchronization constructs will work better, interthread communication will work better, and cache- and memory-affinity will help, and load-balancing is much more predictable, among other things. This approach works quite well, in fact, as long as each task can effectively use all the CPUs it is given.

The truth is we want a combination of the two: we want to assign groups of CPUs, call them "mobs" to groups of threads, call them "jobs". Theoretically, this is multimachine scheduling, which is a hard enough problem even when we know the future (which we don't), so we'll rely on simplifying heuristics. It sounds complicated, but in reality the problem can be solved using a modification of a standard epoch scheduler. For any given scheduling cycle, we can assign the highest priority job it's requested CPUs, then repeat the process with the leftovers (if there are any). The only question is how to differentiate between jobs requesting large numbers of CPUs and those requesting smaller numbers. I would argue that the scheduler should favor the smaller requests for several reasons: 1) smaller requests would tend to correspond to event handlers or other highly interactive jobs, which are likely to process a small amount of work, then yield their CPUs, 2) in an activation-type interface, if a job has a large number of idle activations, its CPU demand will likely increase in the near future, so it should be allowed to run, and 3) such an approach would preferentially run multiple, simultaneous jobs as opposed to allowing one to take over all the processing resources of the system (though, due to the epoch property, it will be allowed to run eventually).

The approach I envision would be to compare jobs requesting various CPU counts, scaling their priorities by the number of CPUs requested (priority / log CPUs would probably be a good formula). Schedule the highest ranking job, then repeat, first trying the jobs that don't request more than the remaining CPUs, then trying those that do if there are no such jobs remaining. Of course, designing a lock-free, concurrent algorithm to do all this is tricky to say the least, especially if it's supposed to deal with processor affinity.

As for the actual interface, activations are the only sane way to go about it, for a number of reasons. In essence, activations are the only way that you won't lose information when crossing the kernel/user boundary, and they are the only way that the kernel's own scheduling facilities can be properly constructed in user space. I'll go over the particulars later.

Eric's Research Logbook

Sunday, October 11, 2009

Operating Systems from First Principles, Part 2: Mob-Job Scheduling

No comments:

Post a Comment

About Me

Followers

Blog Archive