AER (Advanced Error Reporting)
AER (Advanced Error Reporting) is an optional PCI Express feature that allows for more enhanced reporting and control of errors than the basic error reporting scheme. AER errors are categorized as either correctable or uncorrectable. [1]
- A correctable error is recovered by the PCI Express protocol without the need for software intervention and without any risk of data loss.
- An uncorrectable error can be either fatal or non-fatal.
- A fatal, uncorrectable error causes the link to become unreliable.
- A non-fatal, uncorrectable error results in an unreliable transaction
The AER driver in the Linux kernel drives the reporting of (and recovery from) these events.
- In the case of a correctable event, the AER driver simply logs a message that the event was encountered and recovered by hardware. Device drivers can be instrumented to register recovery routines when they are initialized.
- Should a device experience an uncorrectable error, the AER driver will invoke the appropriate recovery routines in the device driver that controls the affected device. These routines can be used to recover the link for a fatal error, for example.
Bad TLP Issues
There are bunch of Bad TLP issues about PCIe devices such as M.2, GPU, etc on various platform has been discussing over here and there. [2][3][4][5]
NVMe AER issue,
There are many community reports AER error on boot on various systems. [6] with following error or simlar
Bad TLP associated with device xxxx:xx:x
That means in this particular case, something goes wrong when the PCIe controller uses this method to access the configuraton space of a particular device.
It may be a hardware bug in the device, in the PCIe root controller on the motherboard, in the specific interaction of those two, or something else.
Kernal param, pci=noaer [7]
The pci=noaer directive tells AER to not report errors. Those error reports would go into a log file, and each error sends a time-consuming interrupt request (IRQ) to the central processor. A rapid flow of error reports could thus flood the drive -- and clog NVMe bandwidth, slowing or even halting bootup.
The nvme 0000:xx:xx.x AER message identifies that error as from the NVMe M.2 connection to the PCIe bus.
So, the NVMe drive may be healthy, but there could be trouble brewing around the PCIe subsystem
Kernal param, pcie_aspm=off[8]
Forcibly enable or disable PCIe Active State Power Management
- off : Disable ASPM.
- force : Enable ASPM even on devices that claim not to support it. (WARNING: Forcing ASPM on may cause system lockups)
Kernal param, pci=nommconf[9]
The kernel option pci=nommconf
disables Memory-Mapped PCI Configuration Space, which is available in Linux since kernel 2.6. Very roughly, all PCI devices have an area that describe this device (which you see with lspci -vv
), and the originally method to access this area involves going through I/O ports, while PCIe allows this space to be mapped to memory for simpler access.
By using pci=nommconf
, the configuration space of all devices will be accessed in the original way, and changing the access methods works around the AER problem.
Could not boot from M.2 due to Bad TLP in Recovery Mode[10]
Well known workaround is
- Adding "pcie_aspm=off" to GRUB_CMDLINE_LINUX parameter at /etc/default/grub file
- Run sudo update-grub
- then reboot.
If pcie_aspm=off does not work then try with pci=noaer with the same way and make it permanent if this works.
The ubuntu team aware it as Linux Kernel bug. Read more Bug_track_ubuntu_PCIe bus error
Reference
- ↑ https://www.plda.com/pcie-glossary/aer
- ↑ https://www.nvidia.com/en-us/geforce/forums/game-ready-drivers/13/237130/gtx-1080-throwing-bad-tlp-pcie-bus-errors/
- ↑ https://www.google.com/search?q=bad+TLP+AER+on+boot&client=firefox-b-d&biw=1216&bih=673&ei=4RcqZKHiI8OsseMPhM6k4Ao&ved=0ahUKEwjh1pfXtIz-AhVDVmwGHQQnCaw4ChDh1QMIDg&uact=5&oq=bad+TLP+AER+on+boot&gs_lcp=Cgxnd3Mtd2l6LXNlcnAQAzIHCCEQoAEQCjIHCCEQoAEQCjIHCCEQoAEQCjIHCCEQoAEQCjoKCAAQRxDWBBCwAzoGCAAQCBAeSgQIQRgAUOMJWLkzYJc0aAFwAXgAgAGXAYgB5wiSAQMwLjiYAQCgAQHIAQrAAQE&sclient=gws-wiz-serp
- ↑ https://rog-forum.asus.com/t5/promotions-general-discussions/pcie-bus-error-bad-tlp-bad-dllp/m-p/842800
- ↑ https://askubuntu.com/questions/1209597/os-is-not-loading-due-to-bad-tlp-in-recovery-mode
- ↑ https://forums.linuxmint.com/viewtopic.php?t=380602
- ↑ https://www.kernel.org/doc/html/v4.14/admin-guide/kernel-parameters.html
- ↑ https://www.kernel.org/doc/html/v4.14/admin-guide/kernel-parameters.html
- ↑ https://unix.stackexchange.com/questions/327730/what-causes-this-pcieport-00000003-0-pcie-bus-error-aer-bad-tlp
- ↑ https://askubuntu.com/questions/771899/pcie-bus-error-severity-corrected