monit: cpu wait usage of 99.9% matches resource limit [cpu wait usage>20.0%]

Discussion in 'ISPConfig 3 Priority Support' started by bobpit, Mar 10, 2014.

  1. bobpit

    bobpit Member

    Sometimes I get these messages from monit and I can't understand why they occur or what they mean. I checked munin and I do not see anything abnormal in cpu usage.


    Code:
    Date:        Mon, 10 Mar 2014 13:05:27
    Action:      alert
    Host:        server1
    Description: cpu wait usage of 99.9% matches resource limit [cpu wait usage>20.0%]

    Code:
    Date:        Mon, 10 Mar 2014 13:06:29
    Action:      alert
    Host:        server1
    Description: 'xxxxxx' cpu wait usage check succeeded [current cpu wait usage=0.0%]
     
  2. srijan

    srijan New Member HowtoForge Supporter

    Hi

    Did you checked which process was eating the resource. Use TOP then shift F then choose m or n which will sort the process according to the usage.

    Further check

    lsof -u (PID of max CPU utilization)

    Br//
    Srijan
     
  3. bobpit

    bobpit Member

    I received the message in my inbox hours after the incident. So I had no chance to investigate online the cpu usage.

    MUNIN graphs of cpu revealed nothing extraordinary. Generally low cpu usage.

    This is from /etc/monit/monitrc:
    Code:
      check system server1.surf-anonymous.info
        if loadavg (1min) > 70 then alert
        if loadavg (5min) > 40 then alert
        if memory usage > 75% then alert
        if swap usage > 25% then alert
        if cpu usage (user) > 70% then alert
        if cpu usage (system) > 30% then alert
        if cpu usage (wait) > 20% then alert 
    Does it help?
     
  4. srijan

    srijan New Member HowtoForge Supporter

    Hi

    This will not help.We can not investigate the previous occured incident.
    I will suggest you to check the PID at the time of incident occurance. Use IOTOP program to check the waiting time of data waiting for the data from the hardisk read/write function.

    Br//
    Srijan
     
  5. bobpit

    bobpit Member

    From what I understand the incident lasts no more than 1-2 minutes, see the times of the emails. Even if I am constantly online to catch something that happens once every 3 weeks, I will not have the time to use IOTOP or anything else to extract something meaningfull. All I can do is look at the logs and the munin charts. This is how we troubleshooted all the problems up to now.
     

Share This Page