Hello there, not sure this is the place to ask this, but if not, my apologies. So, checking my resource graphs for my VPS provided by my vendor, I see high CPU usage spikes (>140% according to the graphs) two to three times daily and lasts for about 45 min to >1h (along high disk activity, but network activity remains low), causing all my sites to be inaccessible (for all protocols: http, ssh, ftp) but not for all the forcing me to hard reboot the VPS to restore access. How would i go about detecting what's causing these spikes after a reboot?
Monitor what is happening on that host. There are various top commands, they show what is using most resources. htop, nettop, iotop, and system statistic tools like dstat, vmstat. Maybe start with commands uptime, df -hT, free -h --wide. Use Internet Search Engines with Code: linux monitor resources
VPS normally shares virtual cpu unless it clearly specify it uses a dedicated one, which could be one of the reason, just a thought. If that spikes are coming from your host, then as @Taleman said, you can monitor your host so you would know what caused such spikes.
Will those methods work after a reboot? Because I'm not able to login via ssh when the spikes happen to run those commands.
No. After reboot you can read logs. The so far proposed methods were meant to monitor what is happening right now. Seems you would need something that continuously monitors the host and writes to a log what is happening, so you can later see what was going on. There probably are tools like this, use Internet Search Engines to find them. If the problem is only you can not log in when the high load happens, I would keep a login session running and when the spike comes use that already logged in session to issue commands.
munin should provide historical graphs. or use monit to email alert you to high cpu / load eg > 80% cpu, load > number of vcpu cores. that should give you a chance to login via ssh and run eg top before it becomes completely unresponsive. my guess is the culprit would be a bad/dodgy php script running on a site, or clamav running a server scan. i'd currently be leaning towards clamav.
I've installed munin and it's reporting the following critical issue: Not entirely sure how to fix that.
You could just reboot the machine, stop all services, and see what happens. If the system is working correctly start services one by one and see what causes the issue and the further investigate why. I would reboot, stop all services that are running on the host, like db,ftp,webserver etc. run "htop" and check CPU/RAM usage. After that start one of the services and check again and so on... The Kernellog might give you a hint too.
The error is about bytes free, so you probably ran out of RAM (if the limits you set there are appropriate for your system, otherwise it can also be a false alert). You could also check with 'top' command how much free ram you have and it might help adding a swap file if you don't have a swap partition. Or add/assign more ram to the server.
My memory usage is averaging 60% (4GB total) and my swap averages at 25% (8GB total). Found this topic on the issue. Apparently it's been this way for a long while. Do you think the suggested steps are advisable?
It does not matter what the average is. What matters if host runs out of memory. Are the load spikes still happening? How often? At the same time of day or day of week? The thread seems to say the error is spurious and due to bad defalt configuration in munin. I would ignore it and go back to finding the cause of high loads. What show commands (paste in CODE tags, please): Code: uptime free -h df -hT /
So, i managed to log in during one of the spikes (had to wait a long while to get in), and found out it was a clients sites causing the cpu spikes due to their whatever custom php code they had uploaded. Disabling the php on those sites has resolved the spikes in usage.