Server dies every night at 4:00

smartcall · Mar 11, 2007

Hi,

I have this bug issue for already two nights. It started at 4:00 am on Saturday and again repeated at 4:00 am on Sunday - today.
I have FC6 with ISPConfig 2.2.9. I haven't made any changes to the system for months. And it is a production one with many sites.
The server dies completely. Only a hardware reboot fixes the problem. I can't identify what causes it.
It must be connected to the scripts that start to run at 4:00 am, but I don't know where to look. All common log files don't show anything.
I also monitor the server with snmpd and the graphs are normal. Nothing special. Just after 4:00 am there is no more data for the graphs, because the server is dead.

Please HELP. I need to resolve this before 4:00 am tomorrow.

djtremors · Mar 11, 2007

I'm using fedora 5 and 6 too. I have no issues though cron wise.

cron.daily runs at 4am and so does webalizer for ispconfig.

PHP:

# cd /etc/cron.daily # ls -l -rwxr-xr-x 1 root root 577 Feb 27 00:40 000-delay.cron -rwxr-xr-x 1 root root 379 Oct 30 18:37 0anacron -rwxr-xr-x 1 root root 2936 Nov 29 00:16 beagle-crawl-system -rwxr-xr-x 1 root root 118 Jan 25 01:06 cups -rwxr-xr-x 1 root root 180 Feb 9 01:45 logrotate -rwxr-xr-x 1 root root 418 Jan 9 20:56 makewhatis.cron -rwxr-xr-x 1 root root 137 Nov 26 23:04 mlocate.cron -rwxr-xr-x 1 root root 2181 Jun 21 2006 prelink -rwxr-xr-x 1 root root 114 Sep 7 2006 rpm -rwxr-xr-x 1 root root 290 Jul 13 2006 tmpwatch

PHP:

# crontab -e 0 4 * * * /root/ispconfig/php/php /root/ispconfig/scripts/shell/webalizer.php &> /dev/null

You can try remarking or moving these out and see which is causing it.
My guess is to check /var/log/cron and see what was the last message before the crash/hang.
Also, was the console sitting at a login or was there any kernel messages?

martinfst · Mar 11, 2007

As you say: at 4.00 couple of scripts are started. Could it be a hardware memory problem? Or are you running out of memory in general (swap full)
Code:
vmstat -s
might be useful. For real hardware problem, you will have to run vendor specific memory tests; often you need to boot from a diagnostics CD.

till · Mar 11, 2007

smartcall said:

Hi,

I have this bug issue for already two nights. It started at 4:00 am on Saturday and again repeated at 4:00 am on Sunday - today.
I have FC6 with ISPConfig 2.2.9. I haven't made any changes to the system for months. And it is a production one with many sites.
The server dies completely. Only a hardware reboot fixes the problem. I can't identify what causes it.
It must be connected to the scripts that start to run at 4:00 am, but I don't know where to look. All common log files don't show anything.
I also monitor the server with snmpd and the graphs are normal. Nothing special. Just after 4:00 am there is no more data for the graphs, because the server is dead.

Please HELP. I need to resolve this before 4:00 am tomorrow.
Click to expand...

If you dont find anything in the logs then its most likely hardware related. At 4 AM run serveral cronjobs which may cause a higher load on your server, if there is e.g. some bad RAM or power supply, the server might die.

smartcall · Mar 11, 2007

This is the last I see in /var/log/cron.1
Code:
Mar 11 04:00:01 ns1 crond[21239]: (root) CMD (/usr/bin/rdate -s ntp3.fau.de)
Mar 11 04:00:01 ns1 crond[21240]: (root) CMD (/root/ispconfig/php/php /root/ispconfig/scripts/shell/check_services.php &> /dev
/null)
Mar 11 04:00:01 ns1 crond[21241]: (root) CMD (/root/ispconfig/php/php /root/ispconfig/scripts/shell/webalizer.php &> /dev/null)
And I have Intel Dual Core CPU.
All the scripts besides webalizer one re-run after I boot the server and nothing happends.
Could that be the webalizer script?

Thanks

dlpc · Mar 11, 2007

till said:

If you dont find anything in the logs then its most likely hardware related. At 4 AM run serveral cronjobs which may cause a higher load on your server, if there is e.g. some bad RAM or power supply, the server might die.
Click to expand...

Could be a heat problem, have a look at the cpu-cooler.
Had the same problem here, cron start >> server shut down no entry in any log.
Cpu cooler not running

smartcall · Mar 11, 2007

Cooler is working. I'm currently running the wealizer script manually to see what's happenrng. It's taking a long time to finish as I have more than 300 sites. But I don't see any significant load on the CPU.

martinfst · Mar 11, 2007

Any read/write error for your disks? If it runs long, something is the bottleneck. That's either CPU, Disk or Memory. find out what's (over-)used and you probably have an indication where to look for a possible hardware problem.

smartcall · Mar 11, 2007

Webalizer still runs for already an hour. Memory, CPU and disks are OK. The reason for such long operation is extreemly big web.log files. I have sites in my server that have over 150MB web.log files. But I monitor it now, while webalizer script runs and I don't see anything strange.
Code:
top - 15:48:13 up  5:48,  2 users,  load average: 0.15, 0.51, 0.65
Tasks: 151 total,   1 running, 150 sleeping,   0 stopped,   0 zombie
Cpu(s):  2.9%us,  0.8%sy,  0.0%ni, 92.2%id,  3.9%wa,  0.0%hi,  0.2%si,  0.0%st
Mem:   2074448k total,  1992100k used,    82348k free,   194392k buffers
Swap:  2939868k total,        0k used,  2939868k free,  1500392k cached
I may move the cron.daily to run later in the morning, so I could be there and look at the console output. Because now the screensaver prevents me from seing the output.

martinfst · Mar 11, 2007

3.9%wa
Click to expand...

It's waiting on data to be retrieved from disk. I'd suspect disk problems. On my system 200Mb log files are processed within 10 minutes on a 2.8M dual core.

smartcall · Mar 11, 2007

It finished and all is OK. But if you advise to check the disks, how could I do this?

martinfst · Mar 11, 2007

I can only think of (non-destructive) vendor diagnostics. And of course watch the log files for read/write errors. Do you have hardware RAID? Maybe the raid controller can provide more info, but that's also (raid-controller) vendor specific. Most of these require you to take your server offline for some period of time, and it's not sure errors will be detected. You've probably done so already, but making backups may save your a... sometime in the future.

Problem is, these type of errors occur "randomly" and are most of the times not reproducable under controlled testing.

smartcall · Mar 11, 2007

I use RAID and mdadm. I believe mdadm would mail me if there were any errors.
I don't think it's HDD error. I will move cron.daily to execute at 11:00 am and will watch closely what is happening.
I can't think of anything else.

Thanks.

martinfst · Mar 11, 2007

Oke, that's software RAID. Yes, mdadm will report problems, but normally not through email. Check /var/log/messages and/or /var/log/syslog and or /var/log/kern.log. I'm not familiar with FC6, so maybe they even use different logfiles.

smartcall · Mar 12, 2007

It happend again. Nothing in logs. Last from cron.log is webaliser and hourly parts. But when I run webalizer script from command line nothing bad happends.
How can I edit the webaliser cronjob? I don't see it in crontab.

Thanks.

martinfst · Mar 12, 2007

As 'root', use
Code:
crontab -e

smartcall · Mar 12, 2007

Thanks.
I'll edit it to run at 11am and see what's happening. If it dies then this is it.
But I still can't believe, because when I run it manualy nothing goes wrong.

smartcall · Mar 12, 2007

All cronjobs finished after I moved them to new time of execution. All is OK.
No issues. STRANGE
I am currently running mprime torture test. Again all is OK.

I remembered this extremely strange thing: I had a problem with one of my other servers. It was running Debian and every morning at 4 the internet connection to it used to go down. I noticed that it started since I put a mobile phone next to it. I removed the phone and the issue was gone.
So now this same phone was near the other server. I removed it again and hopefully the problem will stop happening.
I will report. But don't put mobile phones near servers.

smartcall · Mar 12, 2007

I tested the CPU at full load of both cores for hours. This test also utilizes the memory. NO errors.
So the only thing that could be causing this is the cell phone.

djtremors · Mar 12, 2007

phone or no phone, bit strange that it's happening at 4am unless you're calling your servers for a 'booty call' at 4am SO DON'T CALL THEM AT 4AM

very strange.

Log in or Sign up

Server dies every night at 4:00

smartcall New Member

djtremors New Member

martinfst Member Moderator

till Super Moderator Staff Member ISPConfig Developer

smartcall New Member

dlpc New Member

smartcall New Member

martinfst Member Moderator

smartcall New Member

martinfst Member Moderator

smartcall New Member

martinfst Member Moderator

smartcall New Member

martinfst Member Moderator

smartcall New Member

martinfst Member Moderator

smartcall New Member

smartcall New Member

smartcall New Member

djtremors New Member

Share This Page

Log in or Sign up

Server dies every night at 4:00

smartcall New Member

djtremors New Member

martinfst Member Moderator

till Super Moderator Staff Member ISPConfig Developer

smartcall New Member

dlpc New Member

smartcall New Member

martinfst Member Moderator

smartcall New Member

martinfst Member Moderator

smartcall New Member

martinfst Member Moderator

smartcall New Member

martinfst Member Moderator

smartcall New Member

martinfst Member Moderator

smartcall New Member

smartcall New Member

smartcall New Member

djtremors New Member

Share This Page

Useful Searches