Linux task scheduling, slice time and threads

There is an ongoing debate about the usage of processes versus threads for programs that want to (or need to) parallelise work. In linux and most operating systems, processes are a completely separate entity, which means upon creation they get their own address space (virtual memory environment). Threads do share the same address space with the other threads inside the address space of the process that spawned them.

Therefore the obvious advantage of threads is that on creation time, there is less work to do to create a thread. However, at least on linux, a forked/cloned process uses its parent's memory allocations in a COW (copy on write) fashion. This means the memory allocation part of process creation might be less work than anticipated. Therefore, actual new physical memory allocated for the forked process might be less than assumed.

This article is meant to show and prove the effectivity of threads versus the available CPUs in the system, including the code to test this on your own systems. This means the secondary purpose is to provide some guidance on 'tuning' the number of (potential currently active) threads. Important remark: this test is purely CPU bound.

Threads more CPU efficient than processes?

In general I do encounter the assumption that threads are incredibly efficient and very much different from processes, and therefore it is okay to use much more threads than in the case of processes, and use exponentially more threads than available CPUs which can get active (on cpu) at the same time. It's not uncommon to have the ability to spawn more than 1000 threads even on a two or four CPU system.

Threads and processes are equal, but grouped

In linux, the common (non-elevated priority) processes and threads are managed by the scheduler, which with current kernels will be using the CFS, the completely fair scheduler. Even before that, the scheduler did not really make a difference between a process or a thread, they are both 'tasks' to the scheduler.

However, when the sched_autogroup_enabled is enabled/1, which is the default, all threads for a given process are first grouped, and the group is considered for scheduling, so that processes with lots of threads do not overwhelm processes with a low amount or no threads.

Pagetables and threads

If threads within a process access (lots of) memory in the address space, they have an inherent advantage over processes, because threads live in the same address space, which means they can use the same page tables (memory allocated to logical to physical memory page administration, which must be created for every memory 'page' that is 'resident' (=at least accessed once)).

The question: how do threads behave with different CPU bound loads?

The question I want to answer in this article is how threads behave and perform in a "pure CPU bound" condition. This is artificial, and does NOT correspond to any real life situation(!) The reason is to highlight the way this works.

Important: this is about dedicated threads, not lightweight/green threads such as java green threads, c++ futures or rust futures.

Code explanation

The way I create a "pure CPU bound" condition is using the following (rust) code:

loop
{
    _x = black_box(black_box(_x) + black_box(1));
    _x = black_box(black_box(_x) - black_box(1));
    loop_counter +=1;
    if loop_counter == COUNTER_STEP
    {
        let _ = counter_vector_clone[nr].fetch_add(1,Ordering::Relaxed);
        loop_counter = 0;
    }
}

This highlights the code to be absolutely artificial.
The black_box() function prevents the compiler from optimising the function, because what this work can be eliminated, and will be eliminated by compiler optimization (adding and subtracting one from a variable that is never used can, and thus will be eliminated by the compiler; thanks to the Rust users forum for pointing that out).
Additionally I count every invocation of _x = _x + 1 and _x = _x - 1 (with the variable loop_counter).
If loop_counter equals COUNTER_STEP (1000), I add one to vector_counter_clone, which is a vector of Arc<AtomicU64<>>.
The vector with the Arc and AtomicU64 type allows the vector to be shared by the threads. Every thread adds one to the vector element by its logical thread number. This allows to quantify the amount of work done per thread individually.
This code does not block on anything.
This code does not use any system calls.
This program can be found: github.com/FritsHoogland/cpu-eater

Additionally I created a bpftrace script to account for the time spent running on CPU, as well as being pushed off CPU, called task_life_histo2.bt. The task_life_histo2.bt script requires the name of the task as first argument. The name of the thread by the cpu eater that is created that performs the work described above is set to cpu-eater-w-<NR>. The main thread is called cpu-eater. By using cpu-eater-w as the argument of the task_life_histo2.bt script, it will only pick the worker threads of the cpu eater executable.

The bpftrace script produces an average and total calculation of on and off cpu times, as well as histograms, giving an insight into how the operating system scheduling handled it. This bpftrace script can be found in this repository: github.com/FritsHoogland/postgres-bpftrace

Run with a single active thread

When I run the cpu-eater benchmark for 60 seconds with one thread:

cargo run --release -- 1 60

It provides the following output on my machine:

$ cargo r --release 1 60
    Finished release [optimized] target(s) in 0.01s
     Running `target/release/cpu-eater 1 60`
threadid: 9092

thread number: 0, steps: 9780023, steps/ms: 162,
Total steps/ms: 162, total time of test: 60.00225932s

The important things here:

The cpu-eater shows a per thread number and a line with totals.
The thread number line shows the total counted steps, and a number of steps per millisecond. That is the important number, total is dependent on the time.
The total number of steps/ms is the total of steps per milliseconds for all threads.
The time shows/validates the length of the test.

To get the operating system perspective of the worker threads of the cpu-eater, run:

sudo ./task_life_histo2.bt cpu-eater-w

Run it before the cpu-eater is run, and ctrl-c it after it. In my case it shows:

$ sudo ./task_life_histo2.bt cpu-eater-w
Attaching 4 probes...
^C
Average off-cpu time:          7 us, total: 0 ms
Average slice time  :    1538510 us, total: 60001 ms

@off_cpu_histogram_us:
[1]                    1 |@@                                                  |
[2, 4)                 2 |@@@@                                                |
[4, 8)                24 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[8, 16)               11 |@@@@@@@@@@@@@@@@@@@@@@@                             |

@slice_time_histogram_us:
[512, 1K)              1 |@@@@                                                |
[1K, 2K)               0 |                                                    |
[2K, 4K)               0 |                                                    |
[4K, 8K)               0 |                                                    |
[8K, 16K)              0 |                                                    |
[16K, 32K)             0 |                                                    |
[32K, 64K)             1 |@@@@                                                |
[64K, 128K)            3 |@@@@@@@@@@@@                                        |
[128K, 256K)           2 |@@@@@@@@                                            |
[256K, 512K)           5 |@@@@@@@@@@@@@@@@@@@@                                |
[512K, 1M)             8 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                    |
[1M, 2M)               6 |@@@@@@@@@@@@@@@@@@@@@@@@                            |
[2M, 4M)              13 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

First look at the average off-cpu time: the total is 7us, which is the total time the cpu eater worker thread was scheduled off cpu when it was still in TASK_RUNNING state. In the histogram we can see the division is between the different off-cpu times. This is all mostly single digit microseconds, because there is no CPU pressure. In other words: despite the thread having been off cpu, there was almost no time spent off cpu relative to the on-cpu time.

Then look at the average slice time. The average time the cpu-eater was on cpu is 1538510us, alias 1.5s! The different time slice lengths can be seen in the histogram. It seems recent kernels can run a task in a slice up to 4 seconds, if there is enough availability of CPU time.

It's nice to see that the total slice time is 60,001ms, alias 60 seconds, which is the total time we set cpu-eater to run.

Run with more 100% more threads than CPUs

Now let's use more threads than there are CPUs visible to linux. In this system I got two CPUs, so let's run the cpu-eater with 4 threads:

$ cargo r --release 4 60
    Finished release [optimized] target(s) in 0.01s
     Running `target/release/cpu-eater 4 60`
threadid: 9102
threadid: 9103
threadid: 9104
threadid: 9105

thread number: 0, steps: 4805276, steps/ms: 80,
thread number: 1, steps: 4818986, steps/ms: 80,
thread number: 2, steps: 4814345, steps/ms: 80,
thread number: 3, steps: 4814556, steps/ms: 80,
Total steps/ms: 320, total time of test: 60.000490737s

And have a look at the task_life_histo2.bt output:

$ sudo ./task_life_histo2.bt cpu-eater-w
Attaching 4 probes...
^C
Average off-cpu time:       5516 us, total: 119501 ms
Average slice time  :       5517 us, total: 119595 ms

@off_cpu_histogram_us:
[1]                 3680 |@@@@@@@@@@@@@@@@@@@@                                |
[2, 4)              2360 |@@@@@@@@@@@@                                        |
[4, 8)              2659 |@@@@@@@@@@@@@@                                      |
[8, 16)              494 |@@                                                  |
[16, 32)             209 |@                                                   |
[32, 64)              18 |                                                    |
[64, 128)             71 |                                                    |
[128, 256)             6 |                                                    |
[256, 512)           128 |                                                    |
[512, 1K)            249 |@                                                   |
[1K, 2K)             211 |@                                                   |
[2K, 4K)             112 |                                                    |
[4K, 8K)            1214 |@@@@@@                                              |
[8K, 16K)           9530 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[16K, 32K)           751 |@@@@                                                |

@slice_time_histogram_us:
[8, 16)               15 |                                                    |
[16, 32)               7 |                                                    |
[32, 64)              52 |                                                    |
[64, 128)            228 |@                                                   |
[128, 256)           135 |                                                    |
[256, 512)          3523 |@@@@@@@@@@@@@@@@@@@@@@                              |
[512, 1K)           3708 |@@@@@@@@@@@@@@@@@@@@@@@                             |
[1K, 2K)             571 |@@@                                                 |
[2K, 4K)             823 |@@@@@                                               |
[4K, 8K)            4363 |@@@@@@@@@@@@@@@@@@@@@@@@@@@                         |
[8K, 16K)           8263 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[16K, 32K)             8 |                                                    |

The most important data is with the averages.

The average (on cpu) slice time has gone down to 5517us. This is because the scheduler now has to let the active cpu eater workers take turns, and has to shorten the slice time to facilitate reasonable progress for each thread.
The total time in the on-cpu slices is 119595ms, alias 119.5 seconds. This is consistent with 2 cpus and a runtime of 60 seconds.
The total off cpu time is 119501ms, alias 119.5 seconds. This is consistent with 2 threads out of 4 having to wait their turn on CPU.
The histogram of off cpu times shows that a large portion has spent between 8 and 16ms waiting to get back on CPU. This is consistent with the large portion of the on cpu slices also being between 8 and 16ms.
The conclusion is that this run is very strictly CPU bound.
In this run, lowering the number the number of threads down to the number of CPUs will produce roughly the same throughput (total steps per millisecond).
Also, in this run, increasing the number of threads will also produce the same throughput, until the number of threads gets so big that scheduling becomes inefficient, which will lower the throughput.

Conclusions

This means each thread (quite obviously) is bound to a cpu for running, just like a process must be, and is limited by the number of cpus and the speed at which each of the cpus can process the work.

Does this mean threads are not useful? No: there are many situations where threads are very useful, if only to prevent lots of processes from building pageables. Also it's much easier to communicate between threads than between processes (at least in Rust, I assume this is consistent between programming languages).

Does this mean you should ever only use an amount of threads limited by the number of CPUs? No: I have specifically created a situation where the thread processing is unlikely to block, and the processing threads never have a reason to voluntarily go off CPU, and thus always have to go off CPU involuntary. In fact, if you measure the number of voluntary and involuntary context switches of the threads, you will find that there are no voluntary context switches, only involuntary context switches and that number is equal to the number of the off cpu occasions.

Many, many, real life operations require thread or process execution to wait/block on something, which will make the scheduler put the thread or process off-cpu. In such situations using multiple threads allows to prevent from having to perform high latency blocking operations serially, such as performing disk IO. Because in blocking situations a thread will always be scheduled off-cpu, the total number of used threads can be higher than the number of CPUs, if a consistent amount of threads will be in blocking state (off cpu), and thus a far lower number of threads actually will be running on-cpu.

The number of CPUs limits the processing capability ultimately. Linux and most multi user operating systems allow you to create as many (active) processes or threads as you like. If the number of active processes or threads gets higher than the number of CPUs, it will balance the on-cpu times by letting processes or threads take turns, decreasing their per process or thread amount of work by reducing the on-cpu time, and increasing the waiting time between getting on-cpu.

New processes or tasks will sync their vruntime (the unit the scheduler uses for priorising tasks at the same linux level priority) and then run, meaning the delay or waiting time that the activity of running processes and tasks is causing to ongoing running tasks, equally applies to new processes or tasks. (I created task_life_histo3.bt which includes the wake up time)

If the amount of active processes or threads is extremely much higher than the number of CPUs, scheduling inefficiency likely is introduced on top of the scheduling delay, lowering the actual per process or thread work even further.

If you wonder how YOUR program behaves after reading this article: the task_life_histo2.bt script can be used to measure on cpu and off cpu times of a process or thread by name, so you can measure yours.