AER (Advanced Error Reporting)
NVMe AER Issues
There are many community reports AER error on boot on various systems. [1] with following error or simlar
Bad TLP associated with device xxxx:xx:x
That means in this particular case, something goes wrong when the PCIe controller uses this method to access the configuraton space of a particular device.
It may be a hardware bug in the device, in the PCIe root controller on the motherboard, in the specific interaction of those two, or something else.
Kernal param, pci=noaer
The pci=noaer directive tells AER to not report errors. Those error reports would go into a log file, and each error sends a time-consuming interrupt request (IRQ) to the central processor. A rapid flow of error reports could thus flood the drive -- and clog NVMe bandwidth, slowing or even halting bootup.
The nvme 0000:xx:xx.x AER message identifies that error as from the NVMe M.2 connection to the PCIe bus.
So, the NVMe drive may be healthy, but there could be trouble brewing around the PCIe subsystem
Kernal param, pci=nommconf[2]
The kernel option pci=nommconf
disables Memory-Mapped PCI Configuration Space, which is available in Linux since kernel 2.6. Very roughly, all PCI devices have an area that describe this device (which you see with lspci -vv
), and the originally method to access this area involves going through I/O ports, while PCIe allows this space to be mapped to memory for simpler access.
By using pci=nommconf
, the configuration space of all devices will be accessed in the original way, and changing the access methods works around the AER problem.