AER (Advanced Error Reporting): Difference between revisions

Revision as of 08:53, 3 April 2023

AER (Advanced Error Reporting) is an optional PCI Express feature that allows for more enhanced reporting and control of errors than the basic error reporting scheme. AER errors are categorized as either correctable or uncorrectable. ^[1]

A correctable error is recovered by the PCI Express protocol without the need for software intervention and without any risk of data loss.
An uncorrectable error can be either fatal or non-fatal.
- A fatal, uncorrectable error causes the link to become unreliable.
- A non-fatal, uncorrectable error results in an unreliable transaction

The AER driver in the Linux kernel drives the reporting of (and recovery from) these events. In the case of a correctable event, the AER driver simply logs a message that the event was encountered and recovered by hardware. Device drivers can be instrumented to register recovery routines when they are initialized. Should a device experience an uncorrectable error, the AER driver will invoke the appropriate recovery routines in the device driver that controls the affected device. These routines can be used to recover the link for a fatal error, for example.

NVMe AER Issues

There are many community reports AER error on boot on various systems. ^[2] with following error or simlar

Bad TLP associated with device xxxx:xx:x

That means in this particular case, something goes wrong when the PCIe controller uses this method to access the configuraton space of a particular device.

It may be a hardware bug in the device, in the PCIe root controller on the motherboard, in the specific interaction of those two, or something else.

Kernal param, pci=noaer ^[3]

The pci=noaer directive tells AER to not report errors. Those error reports would go into a log file, and each error sends a time-consuming interrupt request (IRQ) to the central processor. A rapid flow of error reports could thus flood the drive -- and clog NVMe bandwidth, slowing or even halting bootup.

The nvme 0000:xx:xx.x AER message identifies that error as from the NVMe M.2 connection to the PCIe bus.

So, the NVMe drive may be healthy, but there could be trouble brewing around the PCIe subsystem

Kernal param, pcie_aspm=off^[4]

Forcibly enable or disable PCIe Active State Power Management

off : Disable ASPM.
force : Enable ASPM even on devices that claim not to support it. (WARNING: Forcing ASPM on may cause system lockups)

Kernal param, pci=nommconf^[5]

The kernel option pci=nommconf disables Memory-Mapped PCI Configuration Space, which is available in Linux since kernel 2.6. Very roughly, all PCI devices have an area that describe this device (which you see with lspci -vv), and the originally method to access this area involves going through I/O ports, while PCIe allows this space to be mapped to memory for simpler access.

By using pci=nommconf, the configuration space of all devices will be accessed in the original way, and changing the access methods works around the AER problem.

Reference

[1] ttps://www.plda.com/pcie-glossary/aer

[2] ttps://forums.linuxmint.com/viewtopic.php?t=380602

[3] ttps://www.kernel.org/doc/html/v4.14/admin-guide/kernel-parameters.html

[4] ttps://www.kernel.org/doc/html/v4.14/admin-guide/kernel-parameters.html

[5] ttps://unix.stackexchange.com/questions/327730/what-causes-this-pcieport-00000003-0-pcie-bus-error-aer-bad-tlp

[1]

[2]

[3]

[4]

[5]

@@ Line 14: / Line 14: @@
 It may be a hardware bug in the device, in the PCIe root controller on the [[motherboard]], in the specific interaction of those two, or something else.
-== '''Kernal param, pci=noaer''' ==
+== '''Kernal param, pci=noaer''' <ref>https://www.kernel.org/doc/html/v4.14/admin-guide/kernel-parameters.html</ref> ==
 The '''pci=noaer''' directive tells AER to not report errors. Those error reports would go into a log file, and each error sends a time-consuming interrupt request (IRQ) to the central processor.  A rapid flow of error reports could thus flood the drive -- and clog NVMe bandwidth, slowing or even halting bootup.
@@ Line 20: / Line 20: @@
 So, the NVMe drive may be healthy, but there could be trouble brewing around the PCIe subsystem
+== Kernal param, pcie_aspm=off<ref>https://www.kernel.org/doc/html/v4.14/admin-guide/kernel-parameters.html</ref> ==
+Forcibly enable or disable PCIe Active State Power  Management
+* off : Disable ASPM.
+* force : Enable ASPM even on devices that claim not to [[support]] it. (WARNING: Forcing ASPM on may cause system lockups)
 == Kernal param, pci=nommconf<ref>https://unix.stackexchange.com/questions/327730/what-causes-this-pcieport-00000003-0-pcie-bus-error-aer-bad-tlp</ref> ==

AER (Advanced Error Reporting): Difference between revisions

Revision as of 08:53, 3 April 2023

Contents

NVMe AER Issues

Kernal param, pci=noaer ^[3]

Kernal param, pcie_aspm=off^[4]

Kernal param, pci=nommconf^[5]

Reference

Navigation menu

AER (Advanced Error Reporting): Difference between revisions

Revision as of 08:53, 3 April 2023

NVMe AER Issues

Kernal param, pci=noaer [3]

Kernal param, pcie_aspm=off[4]

Kernal param, pci=nommconf[5]

Reference

Navigation menu

Search

Kernal param, pci=noaer ^[3]

Kernal param, pcie_aspm=off^[4]

Kernal param, pci=nommconf^[5]