GPU's suddenly dissappears - nvidia-smi/Boinc/BoincTasks

Discussion in 'Server Operation' started by danhansen@denmark, Aug 26, 2014.

  1. danhansen@denmark

    danhansen@denmark Member HowtoForge Supporter

    Hi,

    OS: Ubuntu Server 12.04.3
    Installed: CUDA5.5
    Installed: Boinc Client

    Installed a Ubuntu Server 12.04 and CUDA5.5 for number crunching/Boinc. Used the 12.04.3 update, since the 12.04.4 update doesn't work with CUDA!
    System runs perfect, using all 4 GPU's to crunch data. Suddenly, without installing anything or updating anything, the GPU's is lost to Boinc!

    Please see this image:
    [​IMG]

    Checking the hardware using this command, tells us:
    # lspci | grep -i nvidia
    Code:
    01:00.0 VGA compatible controller: NVIDIA Corporation GK208 [GeForce GT 640 Rev. 2] (rev a1)
    01:00.1 Audio device: NVIDIA Corporation Device 0e0f (rev a1)
    02:00.0 VGA compatible controller: NVIDIA Corporation GK208 [GeForce GT 640 Rev. 2] (rev a1)
    02:00.1 Audio device: NVIDIA Corporation Device 0e0f (rev a1)
    03:00.0 VGA compatible controller: NVIDIA Corporation GK208 [GeForce GT 640 Rev. 2] (rev a1)
    03:00.1 Audio device: NVIDIA Corporation Device 0e0f (rev a1)
    04:00.0 VGA compatible controller: NVIDIA Corporation GK208 [GeForce GT 640 Rev. 2] (rev a1)
    04:00.1 Audio device: NVIDIA Corporation Device 0e0f (rev a1)
    
    Using this command to check the GPU's attached, usually shows all 4 GPU's and temperature, but know nothing is found. The error message was something like this:
    # nvidia-smi -a |grep Gpu
    Code:
    NVIDIA: failed to load the NVIDIA kernel module.
    No signal to moniter either....
    There's no output to the monitor either, and the monitor works just fine. Putting a Installation CD in the DVD drive and booting makes the signal to the monitor work again!?!?!? I'm totally lost! Does a "coorupt driver" or what it might be, do this? So that you can't se anything on the monitor???
     
    Last edited: Aug 26, 2014
  2. danhansen@denmark

    danhansen@denmark Member HowtoForge Supporter

    Dumped 12.04.3/CUDA5.5 and went along woth 14.04.1/CUDA6.5 !!!

    Hi,


    Never mind this post! I went along with Ubuntu Server 14.04.1 & CUDA v.6.5 and it works just fine. Had to change some minor things but it's running, and it seems to be running pretty d... good ;)

    The command I usually use for my "WatchDog*" scripts when it comes to the GPU's, is
    # nvidia-smi -a | grep Gpu
    but this doesn't return any temperature anymore. Only N/A

    I modified the command to:
    # nvidia-smi -a | grep GPU
    and I'm able to use this output for the scripts. It's not the same, but it can be used. Anybody who knows why this has changed???
     

Share This Page