-ck hacking: MuQSS - The Multiple Queue Skiplist Scheduler v0.105

Saturday 1 October 2016

MuQSS - The Multiple Queue Skiplist Scheduler v0.105

Announcing a multiple runqueue variant of BFS, with the more mundane name of MuQSS (pronounced mux) for linux 4.7:

Full patch for linux-4.7
4.7-sched-MuQSS_105.patch

Keep watching this blog for newer versions!

Incremental to patch bfs502 to MuQSS 0.1:
bfs502-MuQSS_103.patch

It was inevitable that one day I would find myself tackling the 2 major scalability limitations in BFS and this is the result of it. These two issues were

The single runqueue which means all CPUs would fight for lock contention over the one runqueue, and
The O(n) look up which means linear increase in overhead for task lookups as number of processes increases.

As you're all aware by now, skiplists were recently introduced into BFS to tackle number 2 with a modest improvement in throughput at high loads.

Till now I did not have the energy nor time to try and find a solution for number 1. that maintained BFS' scheduling decision algorithm as the single runqueue was actually the reason latency remains bound and deterministic on BFS, capitalising with more CPUs instead of fighting against them for scalability.

This scheduler variant is an evolution of BFS, which hopefully will be mature enough to replace BFS one day when stability is assured. It is able to still use the same scheduling algorithm as BFS meaning latency and responsiveness remains as good as always, but with the per-CPU runqueue and discrete locking, it also means it will scale to any number of CPUs, as the mainline scheduler does.

It does NOT guarantee the best possible throughput as there still is virtually no complex balancing mechanism whatsoever, selecting tasks according to deadline primarily with only CPU cache distances being used to determine which idle CPU to go to, or in non-interactive mode, which overloaded CPU to pull from to fill an idle CPU.

It would be possible, with a lot of effort, to wedge the entire balancing algorithm for scalability from mainline into this, though it will probably offset the deterministic latency that makes it special.

This is a massive rewrite and consequently there are bound to still be race conditions and hidden bugs though I have been running it for a while now with reasonable stability. I'm putting this out there for the braver people to test. There's a lot more to document about it but for now let's just say, give it a try.

Please don't use any lock debugging as it will light up every possible complaint for the time being!

Regarding 4.8, for the time being I will still be releasing BFS for it and incorporate it into -ck

EDIT: Updated to version 0.105 with significant bugfixes.

Enjoy!
お楽しみ下さい
-ck

35 comments:

Anonymous2 October 2016 at 00:57
Unfortunately, the 101 fails for me already at kernel starting up, without leaving a note, and without reproducibility at which step it stops (within several trials).
It's a dual core intel mobile cpu without hyperthreading capability, so this stuff and SMT_NICE are disabled in .config.

Oh, and I got the following compile time warning:
CC arch/x86/kernel/cpu/intel.o
LD kernel/rcu/built-in.o
CC arch/x86/kernel/cpu/mcheck/mce.o
CC kernel/sched/bfs.o
kernel/sched/bfs.c: In function ‘resched_task’:
kernel/sched/bfs.c:1233:13: warning: unused variable ‘rq’ [-Wunused-variable]
struct rq *rq = task_rq(p);
^
CC arch/x86/kernel/cpu/mcheck/mce-severity.o
CC arch/x86/kernel/cpu/mcheck/mce-genpool.o
CC arch/x86/kernel/cpu/mcheck/mce_intel.o
CC arch/x86/kernel/cpu/mcheck/threshold.o

Con, it's very nice to see your actual innovative activity. Keep up your good work, all users would really appreciate it, and hopefully many are willing to help debugging.

BR, Manuel Krause
ReplyDelete
Replies
ck2 October 2016 at 01:03
Well I said it's very new code. I think 100 booted more reliably than 101... but it's more for demonstration of the code than actual usage at this stage.
ReplyDelete
Replies
ck2 October 2016 at 03:26
I've uploaded version 102 which has a a lot of bugfixes and makes it boot properly again for me.
ReplyDelete
Replies
Anonymous2 October 2016 at 03:29
From your outline on the top, I completely don't understand one sentence, as non native speaker. It's:
"It would be possible, with a lot of effort, to wedge the entire balancing algorithm for scalability from mainline into this, though it will probably offset the deterministic latency that makes it special."
Can you describe it in other words, please, if you find time?
I really don't understand, what "offsetting the special deterministic latency" would lead to and who's the culprit: The wedging one, mainline, or the offsetting.
Looked for the related vocabulary, but don't get a clue.

Thank you, Manuel Krause
ReplyDelete
Replies
Harold Naparst2 October 2016 at 08:26
I have packaged this up for Gentoo in my overlay.

# layman -a hnaparst
# USE=muqss emerge ck-sources

It seems stable for me
ReplyDelete
Replies
Anonymous2 October 2016 at 09:36
Thank you for your work Con! Glad to see you are still hacking bfs with new ideas.

MuQSS102 is running fine here for a couple of hours. I had sort of a lockup with MuQSS101, but no problem so far with 102. Suspend/resume works.

I've run my little tests on muqss and bfs502.
They don't show how MuQSS is scalable as the CPU is only 2 core + hyperthreading, but they give an overview of throughput.
The performance is on par with bfs502, and there is still a little regression on 50% workload (make -j2).

A word of caution: the tests are not the same as the one I ran on older bfs releases, so do not compare the results with the older.

https://docs.google.com/spreadsheets/d/1ZfXUfcP2fBpQA6LLb-DP6xyDgPdFYZMwJdE0SQ6y3Xg/edit?usp=sharing

Pedro
ReplyDelete
Replies
ck2 October 2016 at 14:15
Updated to version 0.103
ReplyDelete
Replies
ck3 October 2016 at 11:39
Updated to version 0.104 with more throughput improvements. Will be releasing a new BFS for 4.8 shortly.
ReplyDelete
Replies
Anonymous3 October 2016 at 22:33
Thanks Con ! You're putting so much work in bfs these days.
I've added the results for MuQSS104.
I'll now try linux 4.8.

On a side note, some may have noticed that several results were messed up when I changed the layout of my spreadsheet (mainly on bfs 497, 490, 467 sheets). I didn't notice this at first, and I don't know what happened as I was carefull with my copy/paste. Anyway, I had kept a backup on the old spreadsheet so I was able to put the correct results back. Apologies for the inconvenience.

Pedro
ReplyDelete
Replies
kernelOfTruth4 October 2016 at 00:30
Con, thanks very much for MuQSS104 !

I couldn't stress-test it yet but there seems to be a regression compared to former BFS behavior:

taskset seems broken ?

Usually I update the "cached repo crawler (eix)"

via

time taskset 1 eix-update && time taskset 1 eix-diff

(since it's single-threaded only, pinning it to a virtual CPU turbo-boost can be used optimally)

and that command just hangs without any output message in dmesg.

Incidentally I also added uksm with this new built kernel but I'm pretty sure it cannot be the reason for that hang

Thanks

kernelOfTruth
ReplyDelete
Replies
thunderrd4 October 2016 at 05:10
@kernelOfTruth, FWIW, until Con sorts out your issue, eix-sync seems to be OK used with your taskset command, and I think it accomplishes the same thing.
ReplyDelete
Replies
Anonymous4 October 2016 at 10:10
@kernelOfTruth:
I'm using threadirqs, too, since you've advertised it many months ago. Is it still of benefit or passed over by actual development? (Maybe I suffer from not-booting sometimes.)

BR, Manuel Krause
ReplyDelete
Replies
Sean5 October 2016 at 05:00
Have you considered using a lock-free queue for the runqueue? It could eliminate some overhead. For example, there is a C++ implementation of a lock-free concurrent (multiple producer, multiple consumer) queue in Facebook's Folly library: https://github.com/facebook/folly/blob/master/folly/docs/Overview.md
ReplyDelete
Replies
Chiitoo22 October 2016 at 20:37
Hmmm... “mux”, eh? All I can think of, is “mucusss”, heh. :]

Been using BFS exclusively here, for most of my journey with Linux, and it's quite exciting to see development such as this take place.

Will need to give this a go very soon.

Many thanks to you and your work!
ReplyDelete
Replies

Add comment