Shell script & "nvidia-smi" - needs right command/flag!

Discussion in 'Programming/Scripts' started by danhansen@denmark, Jun 20, 2015.

  1. danhansen@denmark

    danhansen@denmark Member HowtoForge Supporter

    Hi friends,


    I've got a problem regarding a shell-script and the "nvidia-smi" command!

    I've made a script that as protection against CPU overheating on my Ubuntu Server 14.04.2. The scripts works nicely but I need to make it work on my 4 GPU's as well.
    I'm pretty green when it comes to bash scripts so I've been looking for commands which would make it easy for me to edit the script. I found and tested a lot of them, but none seems to give me the output I need! I'll show you the commands and the output below. And the scripts as well.

    What I need is a command which lists the GPU's the same way the "sensors" command from "lm-sensors" does. So that I can use "grep" to select a GPU and set the variable "newstring" (the temp. two digits). I've been trying for a couple of days, but have had no luck. Mostly because the command "nvidia-smi -lso" and/or "nvidia-smi -lsa" doesn't exist anymore. Think it was an experimental command.

    Here's the commands I found and tested & the output:

    This command shows GPU socket number which I could put into the string "str" but the problem is that the temp. is on the next line. I've been fiddling with the flag "A 1" but haven't been able to put it into the script:
    Code:
    # nvidia-smi -q -d temperature | grep GPU
    Attached GPUs                       : 4
    GPU 0000:01:00.0
           GPU Current Temp            : 57 C
           GPU Shutdown Temp           : N/A
           GPU Slowdown Temp           : N/A
    GPU 0000:02:00.0
           GPU Current Temp            : 47 C
           GPU Shutdown Temp           : N/A
           GPU Slowdown Temp           : N/A
    GPU 0000:03:00.0
           GPU Current Temp            : 47 C
           GPU Shutdown Temp           : N/A
           GPU Slowdown Temp           : N/A
    GPU 0000:04:00.0
           GPU Current Temp            : 48 C
           GPU Shutdown Temp           : N/A
           GPU Slowdown Temp           : N/A
    [/CODE]

    This command shows the temp in the first line, but there's no GPU number!?
    Code:
    # nvidia-smi -q -d temperature | grep "GPU Current Temp"
           GPU Current Temp            : 58 C
           GPU Current Temp            : 47 C
           GPU Current Temp            : 47 C
           GPU Current Temp            : 48 C
    
    This command shows the GPU number you select, but there's still no output showing the GPU numer/socket/ID!?
    Code:
    # nvidia-smi -q --gpu=0 | grep "GPU Current Temp"
    GPU Current Temp            : 59 C
    
    And this commands shows the GPU number and the results in the same row!! But, no temperature!!
    Code:
    # nvidia-smi -L
    GPU 0: GeForce GTX 750 Ti (UUID: GPU-9785c7c7-732f-1f51-..........)
    GPU 1: GeForce GTX 750 (UUID: GPU-b2b1a4a-4dca-0c7f-..........)
    GPU 2: GeForce GTX 750 (UUID: GPU-5e6b8efd-7531-777c-..........)
    GPU 3: GeForce GTX 750 Ti (UUID: GPU-5b2b1a2f-3635-2a1c-..........)
    
    And a command which shows all 4 GPU's temp. without anything else. But still I need the GPU number/socket/ID!?
    Code:
    # nvidia-smi --query-gpu=temperature.gpu --format=csv,noheader
    58
    47
    47
    48
    

    What I'm wishing for! If I could get a command which made a output like this I would be the happiest guy around:
    Code:
    GPU 0: GeForce GTX 750 Ti   GPU Current Temp            : 58 C
    GPU 1: GeForce GTX 750   GPU Current Temp            : 47 C
    GPU 2: GeForce GTX 750   GPU Current Temp            : 47 C
    GPU 3: GeForce GTX 750 Ti   GPU Current Temp            : 48 C
    
    Here's the output that "sensors" from "lm-sensors". As you can see the unit info and the temp is in the same line:
    Code:
    # -----------------------------------------------------------
    # coretemp-isa-0000
    # Adapter: ISA adapter
    # Physical id 0:  +56.0°C  (high = +80.0°C, crit = +100.0°C)
    # Core 0:         +56.0°C  (high = +80.0°C, crit = +100.0°C)
    # Core 1:         +54.0°C  (high = +80.0°C, crit = +100.0°C)
    # Core 2:         +54.0°C  (high = +80.0°C, crit = +100.0°C)
    # Core 3:         +52.0°C  (high = +80.0°C, crit = +100.0°C)
    # -----------------------------------------------------------
    
    Here's the part of the script that needs changing. As mentioned in the top, this works using the command "sensors" from the application "lm-sensors". "lm-sensors" doesn't show GPU temp. when running CUDA and the driver attached, so we need another command to get the GPU's listed and the temp. shown. You may know another way to fix my problem, if please don't hesitate to show me.:
    Code:
    [...]
    echo "JOB RUN AT $(date)"
    echo "======================================="
    
    echo ''
    echo 'CPU Warning Limit set to => '$1
    echo 'CPU Shutdown Limit set to => '$2
    echo ''
    echo ''
    
    sensors
    
    echo ''
    echo ''
    
    for i in 0 1 2 3
    do
    
     str=$(sensors | grep "Core $i:")
     newstr=${str:17:2}
    
     if [ ${newstr} -ge $1 ]
     then
       echo '===================================================================='         >>/home/......../logs/watchdogcputemp.log
       echo $(date)                                                                        >>/home/......../logs/watchdogcputemp.log
       echo ''                                                                             >>/home/......../logs/watchdogcputemp.log
       echo ' STATUS WARNING - NOTIFYING : TEMPERATURE CORE' $i 'EXCEEDED' $1 '=>' $newstr >>/home/......../logs/watchdogcputemp.log
       echo ' ACTION : EMAIL SENT'                                                         >>/home/......../logs/watchdogcputemp.log
       echo ''                                                                             >>/home/......../logs/watchdogcputemp.log
       echo '===================================================================='         >>/home/......../logs/watchdogcputemp.log
    
    # Status Warning Email Sending Code 
    # WatchdogCpuTemp Alert! Status Warning - Notifying!"
    
    /usr/bin/msmtp -d --read-recipients </home/......../shellscripts/messages/watchdogcputempwarning.txt
    
       echo 'Email Sent.....'
     fi
    [...]
    


    I hope there's a bash-script guru out there, ready to solve this issue [​IMG]
    Have a nice weekend!

    Kind Regards,
    Dan Hansen
    Denmark

    .
     
  2. danhansen@denmark

    danhansen@denmark Member HowtoForge Supporter

    Hi,

    No one had a suggestion, but I've solved the problem with the help of a few guys from Ubuntu Forum.
    Here's the solution to the problem for other to learn from:

    Problem seems to be solved for the moment! I've got a response from ubuntu forum and one suggestion solved the issue.

    For others to use, here's how we did it and the way we came to the solution. My thanks to "Terdon":
    http://askubuntu.com/questions/638665/shell-script-nvidia-smi-needs-right-command-flag/641828#641828

    For others to see I'll and learn of this here's the results on my Ubuntu Server 14.04

    This one looks like this on my system:

    Code:
    # nvidia-smi -q -d temperature | awk '{if(/C$/){print last,$0};last=$0};'
      Temperature  GPU Current Temp  : 53 C
      Temperature  GPU Current Temp  : 45 C
      Temperature  GPU Current Temp  : 52 C
      Temperature  GPU Current Temp  : 51 C

    And this one, which is just PERFECT looks like this on my system:

    Code:
    # nvidia-smi -q -d temperature | grep GPU | perl -pe '/^GPU/ && s/\n//' | grep ^GPU
    GPU 0000:01:00.0  GPU Current Temp  : 53 C
    GPU 0000:02:00.0  GPU Current Temp  : 45 C
    GPU 0000:03:00.0  GPU Current Temp  : 52 C
    GPU 0000:04:00.0  GPU Current Temp  : 51 C

    Here I've got the GPU text to "grep" in my script. I've got the GPU socket ID and last but not least I've got the temperature in the same line! Exactly what I asked for. I humbly bow ;)
     

Share This Page