Sometimes I get these messages from monit and I can't understand why they occur or what they mean. I checked munin and I do not see anything abnormal in cpu usage. Code: Date: Mon, 10 Mar 2014 13:05:27 Action: alert Host: server1 Description: cpu wait usage of 99.9% matches resource limit [cpu wait usage>20.0%] Code: Date: Mon, 10 Mar 2014 13:06:29 Action: alert Host: server1 Description: 'xxxxxx' cpu wait usage check succeeded [current cpu wait usage=0.0%]
Hi Did you checked which process was eating the resource. Use TOP then shift F then choose m or n which will sort the process according to the usage. Further check lsof -u (PID of max CPU utilization) Br// Srijan
I received the message in my inbox hours after the incident. So I had no chance to investigate online the cpu usage. MUNIN graphs of cpu revealed nothing extraordinary. Generally low cpu usage. This is from /etc/monit/monitrc: Code: check system server1.surf-anonymous.info if loadavg (1min) > 70 then alert if loadavg (5min) > 40 then alert if memory usage > 75% then alert if swap usage > 25% then alert if cpu usage (user) > 70% then alert if cpu usage (system) > 30% then alert if cpu usage (wait) > 20% then alert Does it help?
Hi This will not help.We can not investigate the previous occured incident. I will suggest you to check the PID at the time of incident occurance. Use IOTOP program to check the waiting time of data waiting for the data from the hardisk read/write function. Br// Srijan
From what I understand the incident lasts no more than 1-2 minutes, see the times of the emails. Even if I am constantly online to catch something that happens once every 3 weeks, I will not have the time to use IOTOP or anything else to extract something meaningfull. All I can do is look at the logs and the munin charts. This is how we troubleshooted all the problems up to now.