Nvidia GPU Tips and Tricks: Difference between revisions

From HPCWIKI
Jump to navigation Jump to search
(Created page with " == Xid errors along with the potential causes for each<ref>https://docs.nvidia.com/deploy/xid-errors/index.html</ref> == {| class="wikitable" ! colspan="1" rowspan="1" |XID ! colspan="1" rowspan="1" |Nvidia GPU Failure !Linux Kernel message ! colspan="7" rowspan="1" |Causes |- ! colspan="1" rowspan="1" | ! colspan="1" rowspan="1" | ! ! colspan="1" rowspan="1" |HW Error ! colspan="1" rowspan="1" |Driver Error ! colspan="1" rowspan="1" |User App Error ! colspan="1" rowspa...")
 
No edit summary
 
Line 1: Line 1:
== XID 62 or 119 error ==
Based on many cases, XID 62 and 119 error caused by incompatible driver version issues. it's recommended to install right version of GPU driver based on customer [[CUDA]] application version.<syntaxhighlight lang="bash">
Remove Nvidia GPU driver on Ubuntu
$sudo apt-get remove --purge '^nvidia-.*'
apt search or get download GPU driver from nvidia then install and test
</syntaxhighlight>


== Xid errors along with the potential causes for each<ref>https://docs.nvidia.com/deploy/xid-errors/index.html</ref> ==
== Xid errors along with the potential causes for each<ref>https://docs.nvidia.com/deploy/xid-errors/index.html</ref> ==

Latest revision as of 09:18, 10 March 2025

XID 62 or 119 error

Based on many cases, XID 62 and 119 error caused by incompatible driver version issues. it's recommended to install right version of GPU driver based on customer CUDA application version.

Remove Nvidia GPU driver on Ubuntu
$sudo apt-get remove --purge '^nvidia-.*'

apt search or get download GPU driver from nvidia then install and test

Xid errors along with the potential causes for each[1]

XID Nvidia GPU Failure Linux Kernel message Causes
HW Error Driver Error User App Error System Memory Corruption Bus Error Thermal Issue FB Corruption
1 Invalid or corrupted push buffer stream X X X X
2 Invalid or corrupted push buffer stream X X X X
3 Invalid or corrupted push buffer stream X X X X
4 Invalid or corrupted push buffer stream X X X X
GPU semaphore timeout X X X X X
5 Unused
6 Invalid or corrupted push buffer stream X X X X
7 Invalid or corrupted push buffer address X X X
8 GPU stopped processing X X X X
9 Driver error programming GPU X
10 Unused
11 Invalid or corrupted push buffer stream X X X X
12 Driver error handling GPU exception X
13 Graphics Engine Exception X X X X X X
14 Unused
15 Unused
16 Display engine hung X
17 Unused
18 Bus mastering disabled in PCI Config Space X
19 Display Engine error X
20 Invalid or corrupted Mpeg push buffer X X X X
21 Invalid or corrupted Motion Estimation push buffer X X X X
22 Invalid or corrupted Video Processor push buffer X X X X
23 Unused
24 GPU semaphore timeout X X X X X X
25 Invalid or illegal push buffer stream X X X X X
26 Framebuffer timeout X
27 Video processor exception X
28 Video processor exception X
29 Video processor exception X
30 GPU semaphore access error X
31 GPU memory page fault X X
32 Invalid or corrupted push buffer stream X X X X X
33 Internal micro-controller error X
34 Video processor exception X
35 Video processor exception X
36 Video processor exception X
37 Driver firmware error X X X
38 Driver firmware error X
39 Unused
40 Unused
41 Unused
42 Video processor exception X
43 GPU stopped processing X X
44 Graphics Engine fault during context switch X
45 Preemptive cleanup, due to previous errors -- Most likely to see when running multiple cuda applications and hitting a DBE Usually Kernel shows following message before XID 45

sched: RT throttling activated

X
46 GPU stopped processing X
47 Video processor exception X
48 Double Bit ECC Error X
49 Unused
50 Unused
51 Unused
52 Unused
53 Unused
54 Auxiliary power is not connected to the GPU board
55 Unused
56 Display Engine error X X
57 Error programming video memory interface X X X
58 Unstable video memory interface detected X X
EDC error – clarified in printout X
59 Internal micro-controller error

(older drivers)

X
60 Video processor exception X
61 Internal micro-controller breakpoint/warning

(newer drivers)

62 Internal micro-controller halt

(newer drivers)

X X X
63 ECC page retirement or row remapping recording event X X X
64 ECC page retirement or row remapper recording failure X X
65 Video processor exception X X
66 Illegal access by driver X X
67 Illegal access by driver X X
68 NVDEC0 Exception X X
69 Graphics Engine class error X X
70 CE3: Unknown Error X X
71 CE4: Unknown Error X X
72 CE5: Unknown Error X X
73 NVENC2 Error X X
74 NVLINK Error X X X
75 CE6: Unknown Error X X
76 CE7: Unknown Error X X
77 CE8: Unknown Error X X
78 vGPU Start Error X
79 GPU has fallen off the bus X X X X X
80 Corrupted data sent to GPU X X X X X
81 VGA Subsystem Error X
82 NVJPG0 Error X X
83 NVDEC1 Error X X
84 NVDEC2 Error X X
85 CE9: Unknown Error X X
86 OFA Exception X X
87 Reserved
88 NVDEC3 Error X X
89 NVDEC4 Error X X
90 Reserved
91 Reserved
92 High single-bit ECC error rate X X
93 Non-fatal violation of provisioned InfoROM wear limit X X
94 Contained ECC error X X X
95 Uncontained ECC error X X X
96 NVDEC5 Error X X
97 NVDEC6 Error X X
98 NVDEC7 Error X X
99 NVJPG1 Error X X
100 NVJPG2 Error X X
101 NVJPG3 Error X X
102 NVJPG4 Error X X
103 NVJPG5 Error X X
104 NVJPG6 Error X X
105 NVJPG7 Error X X
106 SMBPBI Test Message X
107 SMBPBI Test Message Silent X
108-

109

Reserved
110 Security Fault Error X
111 Display Bundle Error Event X X X
112 Display Supervisor Error X X
113 DP Link Training Error X X
114 Display Pipeline Underflow Error X X X
115 Display Core Channel Error X X
116 Display Window Channel Error X X
117 Display Cursor Channel Error X X
118 Display Pixel Pipeline Error X X
119 GSP RPC Timeout X X X X X X
120 GSP Error X X X X X X
121 Reserved
122 SPI PMU RPC Read Failure X X
123 SPI PMU RPC Write Failure X X
124 SPI PMU RPC Erase Failure X X
125 Inforom FS Failure X X

References