Server dies every night at 4:00

Discussion in 'General' started by smartcall, Mar 11, 2007.

  1. smartcall

    smartcall New Member

    Hi,

    I have this bug issue for already two nights. It started at 4:00 am on Saturday and again repeated at 4:00 am on Sunday - today.
    I have FC6 with ISPConfig 2.2.9. I haven't made any changes to the system for months. And it is a production one with many sites.
    The server dies completely. Only a hardware reboot fixes the problem. I can't identify what causes it.
    It must be connected to the scripts that start to run at 4:00 am, but I don't know where to look. All common log files don't show anything.
    I also monitor the server with snmpd and the graphs are normal. Nothing special. Just after 4:00 am there is no more data for the graphs, because the server is dead.

    Please HELP. I need to resolve this before 4:00 am tomorrow.
     
  2. djtremors

    djtremors New Member

    I'm using fedora 5 and 6 too. I have no issues though cron wise.

    cron.daily runs at 4am and so does webalizer for ispconfig.
    PHP:
    # cd /etc/cron.daily
    # ls -l
    -rwxr-xr-x 1 root root  577 Feb 27 00:40 000-delay.cron
    -rwxr-xr-x 1 root root  379 Oct 30 18:37 0anacron
    -rwxr-xr-x 1 root root 2936 Nov 29 00:16 beagle-crawl-system
    -rwxr-xr-x 1 root root  118 Jan 25 01:06 cups
    -rwxr-xr-x 1 root root  180 Feb  9 01:45 logrotate
    -rwxr-xr-x 1 root root  418 Jan  9 20:56 makewhatis.cron
    -rwxr-xr-x 1 root root  137 Nov 26 23:04 mlocate.cron
    -rwxr-xr-x 1 root root 2181 Jun 21  2006 prelink
    -rwxr-xr-x 1 root root  114 Sep  7  2006 rpm
    -rwxr-xr-x 1 root root  290 Jul 13  2006 tmpwatch
    PHP:
    # crontab -e
    0 4 * * * /root/ispconfig/php/php /root/ispconfig/scripts/shell/webalizer.php &> /dev/null
    You can try remarking or moving these out and see which is causing it.
    My guess is to check /var/log/cron and see what was the last message before the crash/hang.
    Also, was the console sitting at a login or was there any kernel messages?
     
  3. martinfst

    martinfst Member Moderator

    As you say: at 4.00 couple of scripts are started. Could it be a hardware memory problem? Or are you running out of memory in general (swap full)
    Code:
    vmstat -s
    might be useful. For real hardware problem, you will have to run vendor specific memory tests; often you need to boot from a diagnostics CD.
     
  4. till

    till Super Moderator Staff Member ISPConfig Developer

    If you dont find anything in the logs then its most likely hardware related. At 4 AM run serveral cronjobs which may cause a higher load on your server, if there is e.g. some bad RAM or power supply, the server might die.
     
  5. smartcall

    smartcall New Member

    This is the last I see in /var/log/cron.1

    Code:
    Mar 11 04:00:01 ns1 crond[21239]: (root) CMD (/usr/bin/rdate -s ntp3.fau.de)
    Mar 11 04:00:01 ns1 crond[21240]: (root) CMD (/root/ispconfig/php/php /root/ispconfig/scripts/shell/check_services.php &> /dev
    /null)
    Mar 11 04:00:01 ns1 crond[21241]: (root) CMD (/root/ispconfig/php/php /root/ispconfig/scripts/shell/webalizer.php &> /dev/null)
    And I have Intel Dual Core CPU.
    All the scripts besides webalizer one re-run after I boot the server and nothing happends.
    Could that be the webalizer script?

    Thanks
     
  6. dlpc

    dlpc New Member

    Could be a heat problem, have a look at the cpu-cooler.
    Had the same problem here, cron start >> server shut down no entry in any log.
    Cpu cooler not running :(
     
  7. smartcall

    smartcall New Member

    Cooler is working. I'm currently running the wealizer script manually to see what's happenrng. It's taking a long time to finish as I have more than 300 sites. But I don't see any significant load on the CPU.
     
  8. martinfst

    martinfst Member Moderator

    Any read/write error for your disks? If it runs long, something is the bottleneck. That's either CPU, Disk or Memory. find out what's (over-)used and you probably have an indication where to look for a possible hardware problem.
     
  9. smartcall

    smartcall New Member

    Webalizer still runs for already an hour. Memory, CPU and disks are OK. The reason for such long operation is extreemly big web.log files. I have sites in my server that have over 150MB web.log files. But I monitor it now, while webalizer script runs and I don't see anything strange.

    Code:
    top - 15:48:13 up  5:48,  2 users,  load average: 0.15, 0.51, 0.65
    Tasks: 151 total,   1 running, 150 sleeping,   0 stopped,   0 zombie
    Cpu(s):  2.9%us,  0.8%sy,  0.0%ni, 92.2%id,  3.9%wa,  0.0%hi,  0.2%si,  0.0%st
    Mem:   2074448k total,  1992100k used,    82348k free,   194392k buffers
    Swap:  2939868k total,        0k used,  2939868k free,  1500392k cached
    
    
    I may move the cron.daily to run later in the morning, so I could be there and look at the console output. Because now the screensaver prevents me from seing the output.
     
  10. martinfst

    martinfst Member Moderator

    It's waiting on data to be retrieved from disk. I'd suspect disk problems. On my system 200Mb log files are processed within 10 minutes on a 2.8M dual core.
     
  11. smartcall

    smartcall New Member

    It finished and all is OK. But if you advise to check the disks, how could I do this?
     
  12. martinfst

    martinfst Member Moderator

    I can only think of (non-destructive) vendor diagnostics. And of course watch the log files for read/write errors. Do you have hardware RAID? Maybe the raid controller can provide more info, but that's also (raid-controller) vendor specific. Most of these require you to take your server offline for some period of time, and it's not sure errors will be detected. You've probably done so already, but making backups may save your a... sometime in the future.

    Problem is, these type of errors occur "randomly" and are most of the times not reproducable under controlled testing. :(
     
  13. smartcall

    smartcall New Member

    I use RAID and mdadm. I believe mdadm would mail me if there were any errors.
    I don't think it's HDD error. I will move cron.daily to execute at 11:00 am and will watch closely what is happening.
    I can't think of anything else.

    Thanks.
     
  14. martinfst

    martinfst Member Moderator

    Oke, that's software RAID. Yes, mdadm will report problems, but normally not through email. Check /var/log/messages and/or /var/log/syslog and or /var/log/kern.log. I'm not familiar with FC6, so maybe they even use different logfiles.
     
  15. smartcall

    smartcall New Member

    It happend again. Nothing in logs. Last from cron.log is webaliser and hourly parts. But when I run webalizer script from command line nothing bad happends.
    How can I edit the webaliser cronjob? I don't see it in crontab.

    Thanks.
     
  16. martinfst

    martinfst Member Moderator

    As 'root', use
    Code:
    crontab -e
     
  17. smartcall

    smartcall New Member

    Thanks.
    I'll edit it to run at 11am and see what's happening. If it dies then this is it.
    But I still can't believe, because when I run it manualy nothing goes wrong.
     
  18. smartcall

    smartcall New Member

    All cronjobs finished after I moved them to new time of execution. All is OK.
    No issues. STRANGE:confused:
    I am currently running mprime torture test. Again all is OK.

    I remembered this extremely strange thing: I had a problem with one of my other servers. It was running Debian and every morning at 4 the internet connection to it used to go down. I noticed that it started since I put a mobile phone next to it. I removed the phone and the issue was gone.
    So now this same phone was near the other server. I removed it again and hopefully the problem will stop happening.
    I will report. But don't put mobile phones near servers.;)
     
  19. smartcall

    smartcall New Member

    I tested the CPU at full load of both cores for hours. This test also utilizes the memory. NO errors.
    So the only thing that could be causing this is the cell phone.
     
  20. djtremors

    djtremors New Member

    phone or no phone, bit strange that it's happening at 4am unless you're calling your servers for a 'booty call' at 4am ;) SO DON'T CALL THEM AT 4AM :p

    very strange.
     

Share This Page