Linux scheduling and using strace on linux

I have been writing bpftrace scripts to look at how tasks run on Linux in detail. One of these bpftrace scripts that I created takes the task states and shows the states for a given process. When I tried running the linux strace (system call trace) executable against a process that I was looking at the task state for, I realised it creates a wonderful insight into what strace actually performs with a task, and why it should be used with caution.

Tasks

First point to note is the use of 'task' versus 'process'. Linux actually runs 'tasks', which are either processes, or threads, when a process is using threading. The general name Linux uses for these is 'task'.

Linux scheduling

The way task management, alias scheduling, works is that once a task needs to run, it is given a time slot on a given 'CPU' by the Linux scheduler. A 'CPU' is what Linux perceives as a CPU, which can be an "execution context" that uniquely uses a CPU core, or an "execution context" that runs in a cpu-thread (with SMT, symmetric multi-threading, such as intel hyper threading, also SPARC/POWER/MIPS architectures have implemented it); most ARM64/AARCH64 processors do not use SMT.

The essence is that without having been given a CPU time slot, a task cannot perform any work.

When using the PostgreSQL database, many backend processes can be waiting for input from the database client. The way this is implemented, is that it is using the epoll_pwait system call to put it off CPU indefinitely until it receives data on the socket which is created when the database client connects to the database. So if it's not active, it will not be on cpu. This is how a PostgreSQL database can serve huge numbers of inactive clients without having a huge effect on CPU. The principle of having (lots of) processes or threads of which most of them are sleeping is very common for many databases, and many other server processes.

But once the database backend receives a request (which is done via the socket), it is assumed to be running immediately. This is not actually what happens. What is actually happening, is that the kernel will trigger an interrupt for the epoll_pwait waiting call when data arrives. This will cause the kernel to set the task state to 'R' (immediately; hence 'interrupt'), after which it is up to the scheduler to get it assigned a time slice or 'quantum'.

Needless to say, if there are a lot of tasks set to 'R', they cannot all be assigned a time slice. The amount of concurrently available time slices are limited to the amount of CPUs that are visible to the operating system.

The reason this is important is that this means that there will be a delay between getting the state 'R', and actually running in a time slice, and this delay will get higher if there are a lot of tasks set to 'R'.

This is incorrect in Linux:

The way the scheduler deals with this is by setting a numeric value to task priority, and lowering the priority value if a task gets running in a time slice, and increasing the priority if a process cannot be assigned a time slice, so that every task will be assigned a slice eventually because of priorities being increased when it is not running and decreased when it is running. (this is also what the nice command uses: it changes the base priority to make the scheduler favour or disfavour a task from running)

The priority is a static value based on the scheduling class (SCHED_OTHER for standard processes) and the nice value. The default priority for user processes is 120 without a nice value set, and is static. This means all regular user processes all have the same priority. This is the base priority of a task.

The way tasks are switched and scheduled by the scheduler (CFS; completely fair scheduler) is using a concept of vruntime (virtual runtime), which can be seen in /proc/PID/sched. The amount of vruntime determines what task is selected, which is further tuned by (kernel) scheduler parameters. The scheduling class influences the vruntime in order to enforce the priority of the scheduling classes together with runtime scheduling needs.

Thread priorities: https://blogs.oracle.com/linux/post/task-priority

Kernel documentation: https://www.kernel.org/doc/Documentation/scheduler/sched-design-CFS.txt

CFS: https://en.wikipedia.org/wiki/Completely_Fair_Scheduler

Please mind that the linux scheduler has seen changes over time. The CFS does not have a notion of processor ticks and actual static time slots (which is what the classic components of processor scheduling consists of). Instead, it runs based on virtual runtime, and divides runtime relative based on running tasks. Also mind that it's easy to run into outdated documentation on the internet, for example: some of the settings are moved from /proc/sys/kernel to /sys/kernel/debug/sched.

The linux scheduler is a logical layer of the kernel code, and as such is executing it's scheduling in every task, because it needs to perform the scheduling when switching tasks. There is no separate kernel thread executing the scheduling.

See linux scheduling in action

In order to understand the scheduling, I created a small bpftrace script to see the fundamental actions of waking (getting set to run), and getting on cpu and getting off cpu. This script is called task_life.bt

There are multiple ways to get this information, for example using perf sched but I do believe it's beneficial for your understanding to create your own scripts to understand how things work, and see some of the linux kernel code that deals with it, which is where the tracepoints used by bpftrace are located.

Now let's see how this looks like. I got a PostgreSQL psql session connected to a database, and then attach to the PostgreSQL backend with the task_life.bt script.

To obtain the PID of the backend, execute:

select pg_backend_pid();

And then run the task_life.bt bpftrace script:

sudo ./task_life.bt 21241
Attaching 4 probes...

It will say it has attached 4 probes. Now let's see how the scheduling of process 21241 looks like. Go to the psql session and execute something extremely simple, such as select now();. This is how the task_life.bt output looks like. Or I should say: is likely to look like: the operating system/kernel is in charge of the scheduling, and could make slightly different decisions, such moving the TASK_RUNNING task off CPU in favour of another task, and then put it back on CPU later on:

[21241] S:2707555597:         0:waking
[21241] S:       132:         0:on cpu
[21241] S:         0:      4271:off cpu:TASK_INTERRUPTIBLE

The number in between the square brackets is the PID.
I use the letter 'S' as an indicator these are scheduling activities.
The second number, which is large with 'waking' is the time until the scheduling event got triggered since the script started or the previous scheduling event.
The third number, which is zero with 'waking' is the time spent in the scheduling event.

What this overview is showing is that PID 21241 got woken after 2707555597 microseconds. Like has been described earlier, all this does is set the task state to running, the task is not actually active or running. After setting the task state, the scheduler needs to assign the task a time slice.

After 132 microseconds, the task gets assigned a time slice, and can start performing work. There is no time spent on anything yet, the waking and on cpu states are just markers, therefore the third number is zero for these states.

After 4271 microseconds, the task is done and goes off cpu, and reverts back to the state TASK_INTERRUPTIBLE, which is the state to indicate it's sleeping.

Invoking the strace command

Okay. So what we have seen so far is that in simple cases a task will go through the stages of waking, on cpu and off cpu. Now let's keep the bpftrace script attached and use another terminal to attach to the backend with the linux strace command. The strace command output will show:

$ strace -p 21241
strace: Process 21241 attached
epoll_pwait(5,

And the bpftrace script task_life.bt will show:

[21241] S:  19964813:         0:waking
[21241] S:       205:         0:on cpu
[21241] S:         0:       325:off cpu:TASK_TRACED
[21241] S:        92:         0:waking
[21241] S:       167:         0:on cpu
[21241] S:         0:       334:off cpu:TASK_TRACED
[21241] S:       422:         0:waking
[21241] S:       394:         0:on cpu
[21241] S:         0:       282:off cpu:TASK_INTERRUPTIBLE

The strace command woke the process, and put the process in TASK_TRACED state, which makes it sleep again, then woke it up again, and put it in TASK_TRACED state a second time and then woke it up again to make it sleep in TASK_INTERRUPTIBLE again.

I knew strace is intrusive for a running process, but it seems it's actually fundamentally doing some things, which means that if a system is very busy and is showing issues, it might not be a good idea to "blindly" run strace against a problem process.

Now let's run the simple select now() again, and see how the output of task_life.bt looks like:

[21241] S: 407474752:         0:waking
[21241] S:       228:         0:on cpu
[21241] S:         0:       475:off cpu:TASK_TRACED
[21241] S:        60:         0:on cpu
[21241] S:         0:       229:off cpu:TASK_TRACED
[21241] S:       723:         0:waking
[21241] S:       388:         0:on cpu
[21241] S:         0:        64:off cpu:TASK_TRACED
[21241] S:       233:         0:waking
[21241] S:      1098:         0:waking
[21241] S:      1376:         0:on cpu
[21241] S:         0:       596:off cpu:TASK_TRACED
[21241] S:       742:         0:waking
[21241] S:       465:         0:on cpu
[21241] S:         0:       764:off cpu:TASK_TRACED
[21241] S:       141:         0:waking
[21241] S:        92:         0:on cpu
[21241] S:         0:        96:off cpu:TASK_TRACED
[21241] S:      1060:         0:waking
[21241] S:       714:         0:on cpu
[21241] S:         0:       470:off cpu:TASK_TRACED
[21241] S:       452:         0:waking
[21241] S:       978:         0:on cpu
[21241] S:         0:       505:off cpu:TASK_TRACED
[21241] S:       235:         0:waking
[21241] S:      1147:         0:on cpu
[21241] S:         0:       394:off cpu:TASK_INTERRUPTIBLE

Wow...when executing something with strace connected, it puts the task in TASK_TRACED state a lot of times, effectively halting the execution. It fundamentally changes execution, although it should be noted that outside of the TASK_TRACED state, of course it will execute exactly the same executable code as without TASK_TRACED.

This is the way a debugger interacts with execution of a process. So this is not fundamentally bad, this is an official way to interact with a process. What I try saying is that the performance impact of running strace might be (very) high, and if there are corruptions in the process, this additional work that the task is doing might have an effect on that.

perf trace

Is there an alternative? Yes. The linux perf utility has the option trace which shows all the system calls that are executed, just just strace by default does, which doesn't change the execution of the task, the output of task_life.bt is identical with or without perf trace running with the task PID set.

Does this mean perf and bpftrace do not influence execution at all? Sadly, but actually quite obviously: no. In order to record and administer the information that perf and bpftrace request, extra work must be done, so it is possible that latencies of execution changes when using these. However, the influence of these are fundamentally lower than the way strace works using PTRACE to halt the process.

Conclusion

Using strace uses the PTRACE facility. With strace this halts the task for short amounts of time. Today, on most linux systems the perf utility and facilities should be available, which allows equal functionality to strace (please mind I am not saying 'identical'). Therefore perf trace should be the first choice, not strace.

If you're wondering why strace causes so many TASK_TRACED states: strace causes two TASK_TRACED interruptions for every system call: one before, and one after.

I have not looked into the implementation of the strace equivalent tools such as tusc and truss.