CPU checking

Posted on 02/04/

I'm going to release the check_linux_cpu check that we've been beta testing at CAPSiDE. I looked around in Nagios Exchange and none of the existing plugins itched my scratch... So I made a new one. What was wrong with the other plugins?

  • No performance data. At CAPSIDE we want all the plugins we use to output perfdata.
  • Calculation of the CPU usage. Read below to find out why.
  • Dependancy on external utilities (mpstat, iostat, Net::SNMP, etc)

How is the CPU usage percentage calculated in our plugin?

/proc/stat has the info needed to calculate the CPU usage. Every time you read it, it gives you the number of slices each processor has passed doing what (computing in user space, computing in kernel (system) space, attending interrupts, etc). But those slices are absolute (counted since the OS is running).

So if you're curious about knowing what your processor has been doing, you just have to sum up all the time it has been doing something, and then calculate the proportion of time that it was doing what you're interested in.

For example, lets suppose a /proc/stat that reports user, system, nice and idle time in each column:

cpu 8000 2000 1000 9000

8000 + 2000 + 1000 + 9000 = 20000 time slices doing things.

How much of that time was spent in user? 8000/20000 = 0.4
And in system? 2000/20000 = 0.1
In nice? 1000/20000 = 0.05
Idle? 9000/20000 = 0.45

This information can be useful, but it can be misleading if you monitor it, because it accounts time since the computer was ON. That means that if at night you have little activity, idle will gain weight. And therefore, your user time can spike up to 100% during a lot of time, and the percentages will not vary all that much.

Think of an obsessive person that that notes down all the time he has spent on all of his activities. When you ask him "what have you been doing all your life?". He'll tend to respond: "sleeping" :D.

A more useful metric would be: "what have you been doing since the last time I asked you". He could tell you: "working on the presentation for tomorrow".

Well let's do the same with our CPU! Since the kernel doesn't have any interface to query what it's been doing since the last time we were interested, we'll have to ask twice:

mesure 1: cpu 8000 2000 1000 9000
mesure 2: cpu 9500 2500 1500 9500 

What has the CPU been doing between mesure 2 and mesure 1?

9500 - 8000 = 1500 in user
2500 - 2000 = 500 in system
1500 - 1000 = 500 in nice
9500 - 9000 = 500 in idle

1500 + 500 + 500 + 500 = 3000 in total

So.. it's been working on:
1500/3000 = 0.5 in user
500/3000 = 0.16 in system
500/3000 = 0.16 in nice
500/3000 = 0.16 in idle

Some plugins would calculate CPU usage using an X second interval. (mostly the ones that depend on external utilities). I don't think this is an accurate way to do the measurement either, because that obsessive person will either say "I'm talking with you", or "I was doing the presentation", but just before that he had been having a snack :D.

Execute top. Quickly look at the CPU usage. Does the first reading it displays seem familiar now? And the rest?

So... is the plugin fundamentally flawed in some way? Am I just plainly wrong? What do you think?