About CPU usage and performance on linux
Lots of people investigate CPU performance using the way the traditional Unix tools such as sar
, present these. That means that CPU usage for the different modes is shown as a percentage:
12:26:38 CPU %user %nice %system %iowait %steal %idle
12:26:39 all 0.00 0.00 0.00 0.00 0.00 100.00
12:26:40 all 0.00 0.00 0.00 0.00 0.00 100.00
12:26:41 all 0.00 0.00 0.00 0.00 0.00 100.00
This is the output of the default invocation of sar
, which is what sar -u
will provide. The statistics that are shown here, or the more comprehensive version of the CPU statistics that can be gathered with sar -U ALL
:
12:50:29 CPU %usr %nice %sys %iowait %steal %irq %soft %guest %gnice %idle
12:50:30 all 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
12:50:31 all 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
12:50:32 all 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
In a lot of cases, more modern tools quite simply provide this data in a more modern way, such as using a graph that shows the usage visually over time, which does make it much easier to spot changes in CPU usage. Most likely because many people are used to this data.
Averages
All CPU figures for CPU modes (user, nice, etc.) are counters on Linux. The common way to use counters to show the statistic that they represent is to read the counter value, wait a certain amount of time, and then read that counter again. The amount of usage is the last read counter value subtracted by the first read counter value. Commonly that amount of usage is then divided by the amount of time between the first and second counter reads.
This means that the figure is an average of the usage for that period, not a factual number unless the amount of increases of the counter was 1. Averages hide peaks.
Percentages for CPU time spend
An issue that I see is the use of percentages. Percentages do obfuscate the actual amount of CPU used by representing it as a percentage rather than the actual amount of CPU time spent in a CPU mode.
The pitfall here is that in some cases this is fine, and it doesn't matter if the usage is shown as time or a percentage. An example is if you want to validate if the CPU usage increases at a certain point in time. To see that, it doesn't matter if it's represented as time or as a percentage.
However, there are some cases where this does matter.
One case where this is important is if virtualisation and oversubscription can play tricks with the available time for the virtual machine or container. The CPU mode %steal
is supposed to be showing this, but only does that for Xen and Xen-derived virtualisation platforms, others generally do not implement %steal
and thus might play tricks without it getting noticed. With percentages, 100% is the sum of time for all CPU modes, even if an amount of CPU time is silently stolen for any reason, and thus 100% represents a different amount of CPU time used at different times. I am not sure how common this is, and using percentages you can't know because no matter what trick is played, it will always calculate whatever is measured to 100%.
Another case is that percentages make it hard to understand what is going on. On a single vCPU system, the use of 1 vCPU for a second will show as 100% activity, the equal use of 1 vCPU on an 8 vCPU system is 12.5%, and so on.
Machine load
However, both CPU percentages and CPU time do have the inherent problem that they can only express CPU pressure based on the amount of idle time, where a high amount of idle time means low CPU pressure and a low amount of idle time means higher CPU pressure. The base problem here is that the CPU pressure only can be indicated up to no idle time.
Even if there is no more idle time, it commonly still is possible to make processes or threads active, increasing the CPU pressure, but the statistics of the CPU modes can only indicate work and pressure up to 100%.
Purely from a CPU perception, 100% or # of (v)CPUs * 1 second per second is the maximal amount that can be indicated. However, on the operating system level, there might be a lot of tasks waiting, which means more CPU pressure or load than can be seen from the perception of the CPU.
The load statistics
Ah! But there are the load figures! You know, the load 1, 5 and 15 figures? Why am I talking about this 'problem' if there is the very thing that I describe as a problem exists as a statistic?
There are several reasons for that. The load figure is not the actual CPU load on Linux. But outside of that, most importantly, the load figures are not representations of actual current load, even if it would be using an accurate current load figure, because on Linux, and most other systems, the load figures are so-called 'exponentially-damped moving sums of a five-second average'.
To put it in very simple terms: the load figures (1/5/15) are calculated based on their prior figure, and move towards a new figure based on the current derived load, and thus almost guaranteed do not represent your actual current load, unless the system load is very consistent, so the load figure gets the time to move to the actual load. The 1/5/15 numbers represent how fast the load figure changes based on the current derived load.
But you wouldn't be needing the load figure if it was so consistent, because then you would likely be understanding it, and not needing to be shown and monitored, right?
A lot of the information here is taken from: Brendan Gregg: Linux Load Averages: Solving the Mystery
Scheduler statistics
A solution: scheduler statistics. Linux contains statistics that can express the actual current CPU runtime load in a way that allows an administrator to understand the pressure. This is done using the Linux operating scheduler statistics.
The Linux kernel keeps statistics on the task scheduler level that express the amount of time that is spent in running in a scheduler scheduled time slot or 'quantum', called node_schedstat_running_seconds_total
, and the amount of time that a task spent being runnable but not scheduled yet, called node_schedstat_waiting_seconds_total
.
Schedstat running
The statistic node_schedstat_running_seconds_total
is the time that is spent by a task running in a CPU quantum. This is not a very exotic statistic, because it should be equal to all the running states of the CPU modes in total. Generally, that seems to be true, but sometimes the schedstat running time can be seen as different from the sum of the CPU running states, probably because the CPU statistics and the scheduler statistics are measured at different layers in the kernel.
Schedstat waiting
The statistic node_schedstat_waiting_seconds_total
is the time that a task spent between being switched to the task state R
(runnable), and getting scheduled in a quantum. This is largely equal to what has been called 'run queue' for a lot of operating systems. It's important to notice this has nothing to do with application/executable waits such as waiting on futexes or IO waits. The schedstat waiting is purely the operating system intending a task (which is a process or a thread) to be running, and the amount of time before it actually gets running in a quantum.
This means that generally, idle time indicates that there is CPU time available. After the idle time is gone, it means there are more willing to run tasks at a given point in time than CPU time slots, which therefore means that tasks have to queue and wait before they can get running in a quantum.
In systems with many active threads and all sorts of dependencies with network sockets, IO and file descriptors I found that the scheduler can get inefficient and start thrashing resulting in idle time/unoccupied time slots, which can easily be 5% to 10%, giving the impression there is still CPU headroom available, but that is a different topic.
The important part is that the schedstat waiting time is a good indicator of the current actual CPU load.
Making CPU pressure more concrete
For measuring the figures I am using dsar, which is a utility that can show Linux statistics in the format that sar output these, and more formats, of which some are custom. Another property of dsar is that it uses the node_exporter http (Prometheus) endpoint to obtain the machine statistics, so you can get sar-like output from a remote machine, and it can show the statistics of multiple machines (by entering a comma-separated list of hostnames or IP addresses).
Measuring no CPU activity with sar -u
dsar -H localhost -P 9100 -o sar-u
hostname time CPU %usr %nice %sys %iowait %steal %idle
localhost:9100:metrics 15:41:25 all 1.03 0.00 0.52 0.00 0.00 98.45
localhost:9100:metrics 15:41:26 all 0.50 0.00 1.49 0.00 0.00 97.51
localhost:9100:metrics 15:41:27 all 0.00 0.00 1.51 0.00 0.00 98.49
localhost:9100:metrics 15:41:28 all 0.50 0.00 1.01 0.00 0.00 98.49
No CPU pressure shows a low CPU percentage for user mode (%usr) and system mode (%sys), and a high percentage for idle (%idle).
Measuring some CPU activity with sar -u
dsar -H localhost -P 9100 -o sar-u
hostname time CPU %usr %nice %sys %iowait %steal %idle
localhost:9100:metrics 15:43:32 all 48.48 0.00 1.52 0.00 0.00 50.00
localhost:9100:metrics 15:43:33 all 50.50 0.00 0.50 0.00 0.00 49.00
localhost:9100:metrics 15:43:34 all 50.25 0.00 0.00 0.00 0.00 49.75
localhost:9100:metrics 15:43:35 all 49.50 0.00 1.00 0.00 0.00 49.50
Here we see sar indicating 'activity', being approximately 50% of user mode time. But how much activity is approximately 50% of user mode time?
Measuring CPU activity with timed CPU modes (custom cpu-all
mode)
dsar -H localhost -P 9100 -o cpu-all
hostname time usr nice sys iowait steal irq soft guest gnice idle sch_run sch_wait
localhost:9100:metrics 19:12:44 1.00 0.00 0.03 0.00 0.00 0.00 0.00 0.00 0.00 0.99 0.87 0.02
localhost:9100:metrics 19:12:45 1.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.99 1.06 0.01
localhost:9100:metrics 19:12:46 0.98 0.00 0.03 0.00 0.00 0.00 0.00 0.00 0.00 0.99 1.06 0.01
localhost:9100:metrics 19:12:47 0.99 0.00 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.99 1.06 0.00
The cpu-all
mode statistics are mostly identical to sar -u ALL
, but the time per second is dsar cpu-all
specific. This shows the percentage of around 50% from sar -u
in fact is around 1 second per second of user time, so 1 vCPU occupied per second in user mode. If you side scroll to the idle column, you see that there is 0.99 seconds per second idle time 'left'. So there are two vCPUs in this system.
But the cpu-all
option also includes the sch_run (scheduler runtime) and sch_wait (scheduler waiting time) statistics. As you can see, the scheduler runtime is close to 1 second per second too, and there is a tiny bit of scheduler waiting time. In Linux, a task gets set to R, and then the scheduler independently needs to pick it up and schedule it, which are two independent actions that must happen, so there will always be some time spent waiting.
The work that I introduced to this server is yes > /dev/null &
, which creates a process that runs completely in user mode, and can take 1 second of user mode time per second, which equals to occupying a single vCPU. That is what a single, non-threaded, process can maximally consume because it can only run on a single vCPU at any given time.
Measuring the activity with the load figures
To see how the load figures respond to this, it's very informative to look at the load figures to see what they indicate with a load that is introduced for which you know what it should look like.
Because of how the load figures work, and the introduction of a consistent load, to see the issue that the load figures do not reflect the actual current load, the load figures should be investigated right after the CPU load is introduced. Because the load figures are made to move towards the measured runtime load, they will show the actual load after some time for a consistent load.
dsar -H localhost -P 9100 -o sar-q
hostname time runq-sz plist-sz ldavg-1 ldavg-5 ldavg-15 blocked
localhost:9100:metrics 11:20:17 0 0 0.05 0.01 0.00 0
localhost:9100:metrics 11:20:18 0 0 0.05 0.01 0.00 0
localhost:9100:metrics 11:20:19 0 0 0.12 0.03 0.01 0
localhost:9100:metrics 11:20:20 0 0 0.12 0.03 0.01 0
localhost:9100:metrics 11:20:21 0 0 0.12 0.03 0.01 0
hostname time runq-sz plist-sz ldavg-1 ldavg-5 ldavg-15 blocked
localhost:9100:metrics 11:20:22 0 0 0.12 0.03 0.01 0
localhost:9100:metrics 11:20:23 0 0 0.12 0.03 0.01 0
localhost:9100:metrics 11:20:24 0 0 0.19 0.05 0.01 0
localhost:9100:metrics 11:20:25 0 0 0.19 0.05 0.01 0
localhost:9100:metrics 11:20:26 0 0 0.19 0.05 0.01 0
The ldavg-1/5/15 values reflect the load figures issue very well. The above measurements were done right after the 'yes' CPU hog was started on the system. The ldavg-1 figure is the first statistic to respond, but as you can see, it simply moves towards the figure, instead of showing the actual, current, state.
I hope this makes it clear that if the load is varying, the load figure will never actually show the current (CPU) load state.
Measuring more CPU activity with sar -u
If more CPU activity is introduced, which means another 'yes' CPU hog is started, it's obvious that the user mode time will increase. With sar -u
, with a 2 vCPU machine, this means it will get close to 100%:
(2 'yes' CPU hogs)
dsar -H localhost -P 9100 -o sar-u
hostname time CPU %usr %nice %sys %iowait %steal %idle
localhost:9100:metrics 11:38:10 all 100.00 0.00 0.00 0.00 0.00 0.00
localhost:9100:metrics 11:38:11 all 99.50 0.00 0.50 0.00 0.00 0.00
localhost:9100:metrics 11:38:12 all 99.50 0.00 0.50 0.00 0.00 0.00
localhost:9100:metrics 11:38:13 all 99.00 0.00 1.00 0.00 0.00 0.00
Now comes the main topic of this article: if again another 'yes' CPU is started, which means 3 processes that are run on a 2 vCPU system, the sar -u
will show pretty much the same statistics as the 2 'yes' CPU hogs situation:
(3 'yes' CPU hogs)
dsar -H localhost -P 9100 -o sar-u
hostname time CPU %usr %nice %sys %iowait %steal %idle
localhost:9100:metrics 11:39:47 all 99.00 0.00 1.00 0.00 0.00 0.00
localhost:9100:metrics 11:39:48 all 99.50 0.00 0.50 0.00 0.00 0.00
localhost:9100:metrics 11:39:49 all 99.00 0.00 1.00 0.00 0.00 0.00
localhost:9100:metrics 11:39:50 all 99.00 0.00 1.00 0.00 0.00 0.00
Measuring more CPU activity with cpu-all
If the above exercise of running CPU hogs is measured with timed CPU modes and the scheduler statistics, we can get an accurate overview of the current machine load:
(2 'yes' CPU hogs)
dsar -H localhost -P 9100 -o cpu-all
hostname time usr nice sys iowait steal irq soft guest gnice idle sch_run sch_wait
localhost:9100:metrics 11:44:40 2.01 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 2.01 0.02
localhost:9100:metrics 11:44:41 1.99 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 2.00 0.01
localhost:9100:metrics 11:44:42 1.97 0.00 0.03 0.00 0.00 0.00 0.00 0.00 0.00 0.00 2.00 0.01
localhost:9100:metrics 11:44:43 1.99 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 2.01 0.01
The most important things here are:
Around 2 seconds in user mode time per second, equalling 2 vCPUs.
Around 2 second scheduler runtime per second, because of the 2 vCPUs.
A neglectable amount of scheduler waiting time.
(3 'yes' CPU hogs)
dsar -H localhost -P 9100 -o cpu-all
hostname time usr nice sys iowait steal irq soft guest gnice idle sch_run sch_wait
localhost:9100:metrics 11:47:17 2.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 2.03 1.03
localhost:9100:metrics 11:47:18 1.99 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 2.00 1.01
localhost:9100:metrics 11:47:19 1.98 0.00 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.00 2.00 1.02
localhost:9100:metrics 11:47:20 1.97 0.00 0.03 0.00 0.00 0.00 0.00 0.00 0.00 0.00 2.00 1.01
I do believe with the introduction above, this output will not be a surprise:
Around 2 seconds in user mode time per second, equalling 2 vCPUs.
Around 2 seconds scheduler runtime per second, because of the 2 vCPUs.
Around 1 second scheduler waiting time, because there are 3 running processes and the scheduler can only service 2 running processes at a time.
Conclusion
Lots of people use CPU percentages to understand how busy a system is, which is a very limited way of understanding system busyness because the percentage doesn't tell quantity.
As long as there is available idle time, generally, tasks (which are processes or threads) should be able to run in a CPU quantum/get scheduled immediately. Please mind this is a simplification and might not be true.
Also, lots of people use the load 1/5/15 figures, which is a very limited way of understanding the actual CPU load, and is even more wrong if it's used as an actual representation of the current system load, because it likely isn't, because the load figures cannot react swiftly to a change in load.
Any use of a number that is created by using two measurements of a counter over time to quantify anything for which the counter was changed more than 1 time is an average, and therefore a calculated indicator, not a fact.
The scheduler statistics, and specifically the scheduler wait statistics are a good representation of the ability of the Linux task scheduler to run at will or indicate load/run queue.
PS1: Graphs
The dsar utility can optionally make graphs of some of the statistics, including CPU usage.
This is a view over time from a 2 vCPU system where I started a 'yes' hog every 5 seconds for 4 times, and then stopped them one by one.
The red line / total cpu shows the total amount of vCPUs.
The light green area shows the amount of user CPU time.
The yellow areas that come out over user CPU time are the slight differences with what the scheduler sees as runtime.
Once two processes were active and a third was started, the queueing for CPU shows itself as scheduler wait time.
For the same period of time, here is a graphical representation of the load statistics:
The load figures do not indicate what actually has been taking place, instead, it shows the different load figures each responding with their own eagerness, for which the 1-minute load average is responding the fastest.
Most of the graph shows it measured a higher load, and thus is increasing the figures, and sometime after the load was gone (!) it shows a decrease. It seems impossible to me to understand the actual system load / CPU pressure from these figures, it is impossible to derive that there have been 4 active processes.
The Linux kernel remarks about the load figure usefulness: Its a silly number but people think its important
PS2: PSI
Recent Linux kernels have a facility that is called 'PSI', which means 'pressure stall information'. The PSI statistics are created to express pressure for CPU, memory and IO.
There are some downsides to PSI: it seems most distributions do not enable PSI by default, so it has to be explicitly turned on. I performed some simple tests and concluded it needs further study to fully understand how pressure introduced to a system shows itself as PSI for CPU, memory and IO, some simple tests did not show what I expected from the PSI statistics.