Monday, 25 August 2014

BFS 453/454/455/456 and 3.16-ck2

Here is an updated set of BFS patches with the accumulated bugfixes as debugged on this blog for kernels 3.13 to 3.16 inclusive. The main obvious bug which affected people was the ath9k module which would hang on suspend/resume. However there were likely a number of subtle bugs across the board that most people would not be aware of and even I only noticed that kvm behaved much better after this applied bugfix which stretches back to every BFS after 3.12.

In order to make up for the fact that there are numerous kernels out there based on BFS across the different versions, I have updated BFS and numbered the versions according to which base kernel they are on. Note that there are no feature backports on the older kernels, only the bugfixes, so SMT nice is only on the 3.16 BFS.


And along with that an updated ck branded release for 3.16, 3.16-ck2:



Saturday, 16 August 2014

BFS 450, 3.16-ck1

Announcing a resync and update of BFS for linux kernel 3.16.x. Coding has proven a nice distraction from unpleasant life events so I've been able to bring the patch up to date with the latest kernel.

A number of minor fixes as queued up post 3.15-ck1 made their way into this patchset, along with some changes inspired by the development work of Alfred Chen (thanks!).

The major feature upgrade in this one is the inclusion of SMT nice as discussed at length on this blog. This version of BFS includes an updated version of SMT nice beyond version 6 posted here with one change - 25% of the CPU time of any nice level of SCHED_NORMAL tasks can be shared with any other nice level over and above the nice-based CPU distribution. This is to capitalise on the slightly increased throughput that is available by using the sibling CPU concurrently without too dramatically affecting higher priority process CPU loss. In addition it dramatically reduces the massive latencies that can sometimes otherwise be seen by heavily niced tasks with SMT nice enabled by dithering the metering out of CPU instead of giving it all as a burst only when it's entitled to CPU.

Making SMT nice configurable means users can get to choose if they still want the standard behaviour. The config option will recommend users who enable the SMT scheduler option also enable the SMT nice option. I believe this to be a good default choice for virtually all desktop users, and selectively for server users if they depend heavily on the use of 'nice' or scheduling policies for their work cases (but otherwise it should be disabled).

BFS by itself:
3.16-ck1 branded BFS patchset directory:

EDIT: A build fix for non SMT enabled kernels to prevent it being possible to enable SMT nice is here:
Just disabling SMT nice will achieve the same thing for those affected.


Sunday, 10 August 2014

SMT Nice 6

In my last post I discussed the problem with nice levels, scheduling policies and SMT, and my first public patch for BFS to work around the issue, "SMT Nice":


With a bit of extra testing, and feedback from a number of users, a few issues were discovered with the first patch, so I've reworked it. Thanks very much to those who tested it and provided feedback. There were a couple of scheduling points where SMT siblings weren't being examined, and the difference between nice levels was far more aggressive than it was supposed to be.

Here is an updated patch for BFS 449 with all pending patches:

And a convenient all inclusive patch for 3.15 with ck1+pending+smtnice3456:

EDIT: Added one minor change to not allow kernel threads to deschedule any users tasks on smt siblings and bumped the patch version up to smtnice4.

EDIT2: Added a change to fix the high power usage bug bringing version up to smtnice5

EDIT3: Fixed a logic fail which would cause far too many reschedules and not use full CPU with many niced tasks bringing the version up to smtnice6.


Friday, 1 August 2014

SMT/Hyperthreading, nice and scheduling policies

The concept of symmetric multi-threading, which Intel called "Hyperthreading" and introduced into their commodity CPUs first around 2001, is not remotely a new one and goes back a long way before Intel introduced it into the mainstream market. I suspect the introduction of it back then by Intel was them easing the concept of increasing threads and cores for marketing reasons with the imminent walls they'd soon hit with CPU heat and power requirements that would stop the pursuit for higher and higher single CPU frequencies. The idea is that, since a lot of the CPU sits unused even when something is running as fast as it can on part of it, with a bit of extra logic and architecture, you could throw another "virtual core" at some of the unused execution units and behave like 2 (or more) CPUs, putting more of the CPU to good use. These days the vast majority of CPUs sold by Intel have hyperthreading on them, thus doubling the virtual or "logical" cores the CPU has, including even their low power atom offerings.

There have been numerous benchmarks, in-field tests, workloads etc., where people have tried to find whether hyperthreading is better or not. With a bit of knowledge of the workings of hyperthreading, it's pretty easy to know what the answer is, and not surprisingly, it's the frustrating answer of "it depends". And that's the most accurate answer by far, but I'd go further than that and say that if you have any kind of mixed workload, hyperthreading is always going to be better, whereas if you have precisely one workload , then you have to define exactly how it's going to work and whether hyperthreading will be better or not. Which means that in my opinion at least, hyperthreading is advantageous on a desktop, laptop, tablet and even phone since by design they're nothing but mixed workloads. I won't spend much longer on this discussion, but suffice to say that I think about 4 threads (at the moment) is about optimal for most real world desktop(y) workloads.

Imagine for a moment you have a single core CPU which you can run as is, or enable hyperthreading to run as a 2 thread CPU. If you were to run your CPU in single core only mode, then when you run one task at a time it will always use the full power of the CPU, but if you run two tasks, each task runs at 50% the speed and completes in double the time. If you enable hyperthreading, then if you have two mixed workloads that actually use different parts of the CPU, you can actually get effectively (at best) about 140% of the performance of running the CPU in single core mode. This means that instead of the two tasks running at 50% speed when run concurrently, they run at 70% speed. In practice, the actual performance benefit is rarely 40% but it is often on the order of 25%, so each task tends to run about 60% speed instead of 50% speed. Still a nice speedup for "free".

One thing has always troubled me about hyperthreading, though, and that is the way it tends to break priority support in the scheduler. By priority support, I refer to the use of 'nice' and other scheduling policies, such as realtime, sched idleprio etc.

If you have a single core CPU and run a nice 0 task concurrently with a nice +19 task, the nice 0 task will get about 98% of the CPU time and the nice +19 task only about 2%. The scheduler does this by serialising and metering out the time each task gets to spend on the CPU. Now if you enable hyperthreading on that CPU, the scheduler no longer serialises access to the CPU, but gives each of those tasks one logical "core" on the CPU, and you get an overall 25% increase in throughput. However of the total throughput, both the nice 0 and nice +19 task get precisely half. This would be fine if we had two real cores, but they're not, and the performance of both tasks is sacrificed to ~60% to achieve this. Which means that for this contrived but simple example, enabling hyperthreading slows down the overall execution speed of your nice 0 task when you run a nice +19 task much more than on a single core - it runs at 60% speed instead of 98%.

An even more dramatic example is what happens with realtime tasks, which these days most audio backends on linux use (usually through pulseaudio). Running a realtime task concurrently with a SCHED_NORMAL nice 0 task on a single core means the realtime task will get 100% CPU and the nice 0 task will get zero CPU time. Enable hyperthreading and suddenly the realtime task only runs at 60% of its normal speed even with a heavily niced +19 task running in the background.

Enter SMT-nice as I call it. This is not a new idea, and in fact my first iteration of it was for mainline 10(!) years ago. See here: SMT Nice 2.6.4-rc1-mm1

I actually had the patch removed myself from mainline for criticism regarding throughput reasons, though I still argue that worrying about the last percentage points of throughput are not relevant if you break a mechanism as valuable as nice and scheduling policies, but I had lost the energy for defending it which is why I pushed it be removed myself. Note that although throughput overall may be slightly decreased, the throughput of higher priority tasks is not only fairer with respect to low priority tasks, but enhanced because the low priority tasks will have less cache trashing effects.

What this does is it examines all hyperthread "siblings" to see what is running on them, and then decides whether the currently running or next running task should actually have access to the sibling or allow the sibling to go idle completely, allowing a higher priority task to have the actual true core and all its execution units to itself. I'd been meaning to create an equivalent patch for BFS for the longest time but CPUs got faster, cheaper, more cores, I got lazy etc... though I recently found more enthusiasm for hacking.

So here is a reincarnation of the SMT-nice concept for BFS, improved to work across multiple scheduling policies from realtime, iso down to idleprio, and I've made it a compile time option in case people feel they don't wish to sacrifice any throughput:

Patch for BFS449 with pending patches:

And to make life easy, here's an all inclusive ck1+pending+smtnice patch:

The TL;DR is: On Intel hyperthreaded CPUs, 'nice', realtime and sched idleprio works better, and background tasks interfere much less with the foreground tasks. Note: This patch does nothing if you don't have a hyperthreaded CPU.

If you wish to do testing to see how this works, try running with and without the patch and running two benchmarks concurrently, one at nice 0 and one at nice +19 (such as 'make -j2' on one kernel and 'nice -19 make -j2' on another kernel on a machine with 2 cores/4 threads) and compare times. Or run some jackd benchmarks of your choice to see  what it takes to get xruns etc.

This patch will almost certainly make its way into the next BFS in some form.

EDIT: It seems people have missed the point of this patch. It improves the performance of foreground applications at the expense of background ones. So your desktop/gui/applications will remain fast even if you run folding@home, mprime, seti@home etc., but those background tasks will slow down more. If you don't want it doing that, disable it in your build config.