Nvdia-smi tips and tricks

From HPCWIKI
Jump to navigation Jump to search

Ouput example

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.05              Driver Version: 535.86.05    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090        Off | 00000000:D8:00.0 Off |                  Off |
| 30%   42C    P8              38W / 450W |      2MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
Property name Anotation Meaning
Performance State Perf States range from P0 (maxi-mum performance) to P12 (minimum performance).

Disable or enable GPU

##Disable GPU, where xx is PCIe bus number from lspci  
$sudo nvidia-smi -i 0000:xx:00.0 -pm 0
$sudo nvidia-smi drain -p 0000:xx:00.0 -m 1
## The device will still be visible with lspci after running the commands above.

#Enable GPU
$sudo nvidia-smi drain -p 0000:xx:00.0 -m 0

--gpus options

without nvidia-container-toolkit, docker with --gpus options makes following error

docker: Error response from daemon: could not select device driver "" with capabilities: --gpus options 

nvidia-container-toolkit package will solve the issue

sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit

Turn on / off ECC[1]

To Turn off the ECC RAM

# nvidia-smi -g 0 --ecc-config=0
(repeat with -g x for each GPU ID)

To Turn back on ECC RAM

# nvidia-smi -g 0 --ecc-config=1
(repeat with -g x for each GPU ID)

To Reset ECC error[2]

# nvidia-smi -g 0 --reset-ecc-errors=TYPE (0|VOLATILE or 1|AGGREGATE)

Reset GPU

# nvidia-smi -g 0 --gpu-reset

GPU mode

The mode of the GPU is established directly at power-on, from settings stored in the GPU’s non-volatile memory.

gpumodeswitch changes the mode of the GPU by updating the GPU’s non-volatile memory settings.

Compute mode is a configuration that is optimized for high-performance computing (HPC) applications, Compute mode can cause compatibility problems with OS and hypervisors when the GPU is used primarily as a graphics device.

Graphic mode

Compute mode

#nvidia-smi -g0 -c <mode number>

number Mode Meaning
0 Default Default mode GPU can be shared with several jobs,
1 Exclusive_Thread Exclusive thread mode only is allowed to run one job, but in the same time, only one thread runs on exclusive thread mode GPU.
2 Prohibited prohibited mode GPU is not allowed to run job,
3 Exclusive_Process Exclusive process mode is allowed to run one job, but in the same time, only one process runs on exclusive process mode GPU.

GPU parameter change

Enable or disable GPU's auto boost mode

$ sudo nvidia-smi --auto-boost-default=DISABLED


Set GPU clock speed

$ sudo nvidia-smi --applications-clocks=2505,875

References