Description
Currently: Tasks are scheduled on sched_loops in round-robin fashion, and are tied to their sched_loop for their entire lifetime. This can lead to surprising performance with irregular task setups.
It does not seem too hard to implement a 'steal' mechanism by which idle sched_loops can reach over to overloaded ones to balance some of the load. My impression is that the classic presentation involves idle processors busy-looping around the steal mechanism until they pick up something to do; I've even seen a talk on a purportedly very clever mechanism that was totally lockless. But... we just can't eat up all CPUs when not fully loaded.
One thing I haven't quite grasped is how to do this without thrashing some global synchronisation whenever we decide to steal. If I'm not mistaken, the whole point of work stealing is that each processor's runqueue not be in a cacheline that multiple processors want... but I think that if we want to suspend idle CPUs, we need to have a global semaphore-alike setup that gets poked every time a task (a) spawns, (b) blocks, (c) wakes, (d) exits. And, well, this is basically when scheduler runqueues get accessed anyway.
Currently we take a global (per-scheduler) lock on (a) and (d) already, but not (b) or (c). Is there an option for work stealing that puts idle CPUs to sleep without hitting global state during the extra lifecycle events?