Nvdia-smi tips and tricks

From HPCWIKI
Revision as of 10:57, 27 August 2024 by Admin (talk | contribs) (→‎--gpus options)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Ouput example

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.05              Driver Version: 535.86.05    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090        Off | 00000000:D8:00.0 Off |                  Off |
| 30%   42C    P8              38W / 450W |      2MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
Property name Anotation Meaning
Performance State Perf States range from P0 (maxi-mum performance) to P12 (minimum performance).

Disable or enable GPU

##Disable GPU, where xx is PCIe bus number from lspci  
$sudo nvidia-smi -i 0000:xx:00.0 -pm 0
$sudo nvidia-smi drain -p 0000:xx:00.0 -m 1
## The device will still be visible with lspci after running the commands above.

#Enable GPU
$sudo nvidia-smi drain -p 0000:xx:00.0 -m 0

--gpus options

To use --gpus options with docker on Ubuntu

without nvidia-container-toolkit, docker with --gpus options makes following error

docker: Error response from daemon: could not select device driver "" with capabilities: --gpus options 

nvidia-container-toolkit package will solve the issue

sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit

Turn on / off ECC[1]

To Turn off the ECC RAM

# nvidia-smi -g 0 --ecc-config=0
(repeat with -g x for each GPU ID)

To Turn back on ECC RAM

# nvidia-smi -g 0 --ecc-config=1
(repeat with -g x for each GPU ID)

To Reset ECC error[2]

# nvidia-smi -g 0 --reset-ecc-errors=TYPE (0|VOLATILE or 1|AGGREGATE)

Reset GPU

# nvidia-smi -g 0 --gpu-reset

GPU mode

The mode of the GPU is established directly at power-on, from settings stored in the GPU’s non-volatile memory.

gpumodeswitch changes the mode of the GPU by updating the GPU’s non-volatile memory settings.


Compute mode is a configuration that is optimized for high-performance computing (HPC) applications, Compute mode can cause compatibility problems with OS and hypervisors when the GPU is used primarily as a graphics device.

Graphic mode

Compute mode

#nvidia-smi -g0 -c <mode number>

number Mode Meaning
0 Default Default mode GPU can be shared with several jobs,
1 Exclusive_Thread Exclusive thread mode only is allowed to run one job, but in the same time, only one thread runs on exclusive thread mode GPU.
2 Prohibited prohibited mode GPU is not allowed to run job,
3 Exclusive_Process Exclusive process mode is allowed to run one job, but in the same time, only one process runs on exclusive process mode GPU.

References