Nvidia GPU Tips and Tricks

From HPCWIKI
Jump to navigation Jump to search

Xid errors along with the potential causes for each[1]

XID Nvidia GPU Failure Linux Kernel message Causes
HW Error Driver Error User App Error System Memory Corruption Bus Error Thermal Issue FB Corruption
1 Invalid or corrupted push buffer stream X X X X
2 Invalid or corrupted push buffer stream X X X X
3 Invalid or corrupted push buffer stream X X X X
4 Invalid or corrupted push buffer stream X X X X
GPU semaphore timeout X X X X X
5 Unused
6 Invalid or corrupted push buffer stream X X X X
7 Invalid or corrupted push buffer address X X X
8 GPU stopped processing X X X X
9 Driver error programming GPU X
10 Unused
11 Invalid or corrupted push buffer stream X X X X
12 Driver error handling GPU exception X
13 Graphics Engine Exception X X X X X X
14 Unused
15 Unused
16 Display engine hung X
17 Unused
18 Bus mastering disabled in PCI Config Space X
19 Display Engine error X
20 Invalid or corrupted Mpeg push buffer X X X X
21 Invalid or corrupted Motion Estimation push buffer X X X X
22 Invalid or corrupted Video Processor push buffer X X X X
23 Unused
24 GPU semaphore timeout X X X X X X
25 Invalid or illegal push buffer stream X X X X X
26 Framebuffer timeout X
27 Video processor exception X
28 Video processor exception X
29 Video processor exception X
30 GPU semaphore access error X
31 GPU memory page fault X X
32 Invalid or corrupted push buffer stream X X X X X
33 Internal micro-controller error X
34 Video processor exception X
35 Video processor exception X
36 Video processor exception X
37 Driver firmware error X X X
38 Driver firmware error X
39 Unused
40 Unused
41 Unused
42 Video processor exception X
43 GPU stopped processing X X
44 Graphics Engine fault during context switch X
45 Preemptive cleanup, due to previous errors -- Most likely to see when running multiple cuda applications and hitting a DBE Usually Kernel shows following message before XID 45

sched: RT throttling activated

X
46 GPU stopped processing X
47 Video processor exception X
48 Double Bit ECC Error X
49 Unused
50 Unused
51 Unused
52 Unused
53 Unused
54 Auxiliary power is not connected to the GPU board
55 Unused
56 Display Engine error X X
57 Error programming video memory interface X X X
58 Unstable video memory interface detected X X
EDC error – clarified in printout X
59 Internal micro-controller error

(older drivers)

X
60 Video processor exception X
61 Internal micro-controller breakpoint/warning

(newer drivers)

62 Internal micro-controller halt

(newer drivers)

X X X
63 ECC page retirement or row remapping recording event X X X
64 ECC page retirement or row remapper recording failure X X
65 Video processor exception X X
66 Illegal access by driver X X
67 Illegal access by driver X X
68 NVDEC0 Exception X X
69 Graphics Engine class error X X
70 CE3: Unknown Error X X
71 CE4: Unknown Error X X
72 CE5: Unknown Error X X
73 NVENC2 Error X X
74 NVLINK Error X X X
75 CE6: Unknown Error X X
76 CE7: Unknown Error X X
77 CE8: Unknown Error X X
78 vGPU Start Error X
79 GPU has fallen off the bus X X X X X
80 Corrupted data sent to GPU X X X X X
81 VGA Subsystem Error X
82 NVJPG0 Error X X
83 NVDEC1 Error X X
84 NVDEC2 Error X X
85 CE9: Unknown Error X X
86 OFA Exception X X
87 Reserved
88 NVDEC3 Error X X
89 NVDEC4 Error X X
90 Reserved
91 Reserved
92 High single-bit ECC error rate X X
93 Non-fatal violation of provisioned InfoROM wear limit X X
94 Contained ECC error X X X
95 Uncontained ECC error X X X
96 NVDEC5 Error X X
97 NVDEC6 Error X X
98 NVDEC7 Error X X
99 NVJPG1 Error X X
100 NVJPG2 Error X X
101 NVJPG3 Error X X
102 NVJPG4 Error X X
103 NVJPG5 Error X X
104 NVJPG6 Error X X
105 NVJPG7 Error X X
106 SMBPBI Test Message X
107 SMBPBI Test Message Silent X
108-

109

Reserved
110 Security Fault Error X
111 Display Bundle Error Event X X X
112 Display Supervisor Error X X
113 DP Link Training Error X X
114 Display Pipeline Underflow Error X X X
115 Display Core Channel Error X X
116 Display Window Channel Error X X
117 Display Cursor Channel Error X X
118 Display Pixel Pipeline Error X X
119 GSP RPC Timeout X X X X X X
120 GSP Error X X X X X X
121 Reserved
122 SPI PMU RPC Read Failure X X
123 SPI PMU RPC Write Failure X X
124 SPI PMU RPC Erase Failure X X
125 Inforom FS Failure X X

References