Debian crashed several times, help needed

Discussion in 'Server Operation' started by djkoelkast, Jul 11, 2007.

  1. djkoelkast

    djkoelkast New Member

    Hi,

    I installed Debian 4.0 + ISPConfig through the perfect setup.
    It has crashed about 3 times in the past 24 hours and I don't have a clue how and why.

    The complete system freezes and the only way to recover is to turn the server off and on again, then it will boot fine.

    When it freezes the harddisk LED is on everytime.

    I really want to solve this but what am I looking for? What logs to check? What things to test?

    I installed the SMP kernel and it ran stable for about 2 weeks, all of a sudden it keeps crashing every now and then. The only thing changed is the CPU fan of 1 of the CPUs, it runs fine and keeps the CPU cool enough (can touch the copper parts of the cooler and they are just warm, not hot.

    Please help me out.
     
  2. Ben

    Ben Active Member Moderator

    Can you see anything in the /var/log/messages or dmesg?
    You could also try to get some hints running a smartctl check on your harddisk (cause you said the light was lightning, when the PC crashed) or your could start via a rescue cd (e.g. debian netinstaller or knoppix) and do a fscheck.
     
  3. djkoelkast

    djkoelkast New Member

    in /var/log/messages the last few lines before the crash:

    Code:
    Jul 10 22:23:40 server -- MARK --
    Jul 10 22:43:40 server -- MARK --
    Jul 10 23:03:40 server -- MARK --
    Jul 10 23:23:41 server -- MARK --
    Jul 10 23:43:41 server -- MARK --
    
    in /var/log/dmesg nothing special to see...

    What excactly is smartctl and how do I run it? Never ran into these kind of problems before so no experience with it ;)
     
  4. Ben

    Ben Active Member Moderator

    You have to install it (in case you have not done yet):
    aptitude instal smartmontools

    then run e.g.
    smartctl -a
    smartctl --test=long /dev/hda
    etc.
     
  5. djkoelkast

    djkoelkast New Member

    Ok so I installed it and enabled S.M.A.R.T. on this drive, it says:

    Code:
    === START OF ENABLE/DISABLE COMMANDS SECTION ===
    SMART Enabled.
    
    SMART support is: Available - device has SMART capability.
    SMART support is: Enabled
    
    === START OF READ SMART DATA SECTION ===
    SMART overall-health self-assessment test result: PASSED
    
    General SMART Values:
    Offline data collection status:  (0x00) Offline data collection activity
                                            was never started.
                                            Auto Offline Data Collection: Disabled.
    Self-test execution status:      (   0) The previous self-test routine completed
                                            without error or no self-test has ever
                                            been run.
    Total time to complete Offline
    data collection:                 (10419) seconds.
    Offline data collection
    capabilities:                    (0x5b) SMART execute Offline immediate.
                                            Auto Offline data collection on/off support.
                                            Suspend Offline collection upon new
                                            command.
                                            Offline surface scan supported.
                                            Self-test supported.
                                            No Conveyance Self-test supported.
                                            Selective Self-test supported.
    SMART capabilities:            (0x0003) Saves SMART data before entering
                                            power-saving mode.
                                            Supports SMART auto save timer.
    Error logging capability:        (0x01) Error logging supported.
                                            General Purpose Logging supported.
    Short self-test routine
    recommended polling time:        (   1) minutes.
    Extended self-test routine
    recommended polling time:        ( 174) minutes.
    
    SMART Attributes Data Structure revision number: 16
    Vendor Specific SMART Attributes with Thresholds:
    ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
      1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
      2 Throughput_Performance  0x0005   100   100   050    Pre-fail  Offline      -       0
      3 Spin_Up_Time            0x0007   100   100   024    Pre-fail  Always       -       518
      4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       9
      5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
      7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
      8 Seek_Time_Performance   0x0005   100   100   020    Pre-fail  Offline      -       0
      9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       231
     10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
     12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       9
    192 Power-Off_Retract_Count 0x0032   100   100   050    Old_age   Always       -       18
    193 Load_Cycle_Count        0x0012   100   100   050    Old_age   Always       -       18
    194 Temperature_Celsius     0x0002   166   166   000    Old_age   Always       -       33 (Lifetime Min/Max 23/35)
    196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
    197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
    198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
    199 UDMA_CRC_Error_Count    0x000a   200   253   000    Old_age   Always       -       0
    
    SMART Error Log Version: 1
    No Errors Logged
    
    SMART Self-test log structure revision number 1
    No self-tests have been logged.  [To run self-tests, use: smartctl -t]
    
    
    Warning! SMART Selective Self-Test Log Structure error: invalid SMART checksum.
    SMART Selective self-test log data structure revision number 1
     SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
        1        0        0  Not_testing
        2        0        0  Not_testing
        3        0        0  Not_testing
        4        0        0  Not_testing
        5        0        0  Not_testing
    Selective self-test flags (0x0):
      After scanning selected spans, do NOT read-scan remainder of disk.
    If Selective self-test is pending on power-up, resume after 0 minute delay.
    
    Code:
    
    server:~# smartctl --test=long /dev/hdb
    smartctl version 5.36 [i686-pc-linux-gnu] Copyright (C) 2002-6 Bruce Allen
    Home page is http://smartmontools.sourceforge.net/
    
    === START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
    Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".
    Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful.
    Testing has begun.
    Please wait 174 minutes for test to complete.
    Test will complete after Wed Jul 11 19:19:30 2007
    
    Use smartctl -X to abort test.
    
    
     

Share This Page