AER (Advanced Error Reporting): Difference between revisions
No edit summary |
|||
(5 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
AER (Advanced Error Reporting) is an optional PCI Express feature that allows for more enhanced reporting and control of errors than the basic error reporting scheme. AER errors are categorized as either correctable or uncorrectable. <ref>https://www.plda.com/pcie-glossary/aer</ref> | |||
== | * A correctable error is recovered by the PCI Express protocol without the need for software intervention and without any risk of data loss. | ||
* An uncorrectable error can be either fatal or non-fatal. | |||
** A fatal, uncorrectable error causes the link to become unreliable. | |||
** A non-fatal, uncorrectable error results in an unreliable transaction | |||
The AER driver in the Linux kernel drives the reporting of (and recovery from) these events. | |||
* In the case of a correctable event, the AER driver simply logs a message that the event was encountered and recovered by hardware. Device drivers can be instrumented to register recovery routines when they are initialized. | |||
* Should a device experience an uncorrectable error, the AER driver will invoke the appropriate recovery routines in the device driver that controls the affected device. These routines can be used to recover the link for a fatal error, for example. | |||
== Bad TLP Issues == | |||
There are bunch of Bad TLP issues about PCIe devices such as M.2, GPU, etc on various platform has been discussing over here and there. <ref>https://www.nvidia.com/en-us/geforce/forums/game-ready-drivers/13/237130/gtx-1080-throwing-bad-tlp-pcie-bus-errors/</ref><ref>https://www.google.com/search?q=bad+TLP+AER+on+boot&client=firefox-b-d&biw=1216&bih=673&ei=4RcqZKHiI8OsseMPhM6k4Ao&ved=0ahUKEwjh1pfXtIz-AhVDVmwGHQQnCaw4ChDh1QMIDg&uact=5&oq=bad+TLP+AER+on+boot&gs_lcp=Cgxnd3Mtd2l6LXNlcnAQAzIHCCEQoAEQCjIHCCEQoAEQCjIHCCEQoAEQCjIHCCEQoAEQCjoKCAAQRxDWBBCwAzoGCAAQCBAeSgQIQRgAUOMJWLkzYJc0aAFwAXgAgAGXAYgB5wiSAQMwLjiYAQCgAQHIAQrAAQE&sclient=gws-wiz-serp</ref><ref>https://rog-forum.asus.com/t5/promotions-general-discussions/pcie-bus-error-bad-tlp-bad-dllp/m-p/842800</ref><ref>https://askubuntu.com/questions/1209597/os-is-not-loading-due-to-bad-tlp-in-recovery-mode</ref> | |||
NVMe AER issue, | |||
There are many community reports AER error on boot on various systems. <ref>https://forums.linuxmint.com/viewtopic.php?t=380602</ref> with following error or simlar<blockquote>Bad TLP associated with device xxxx:xx:x</blockquote>That means in this particular case, ''something'' goes wrong when the [[PCIe]] controller uses this method to access the configuraton space of a particular device. | |||
It may be a hardware bug in the device, in the PCIe root controller on the [[motherboard]], in the specific interaction of those two, or something else. | |||
== '''Kernal param, pci=noaer''' <ref>https://www.kernel.org/doc/html/v4.14/admin-guide/kernel-parameters.html</ref> == | |||
The '''pci=noaer''' directive tells AER to not report errors. Those error reports would go into a log file, and each error sends a time-consuming interrupt request (IRQ) to the central processor. A rapid flow of error reports could thus flood the drive -- and clog NVMe bandwidth, slowing or even halting bootup. | The '''pci=noaer''' directive tells AER to not report errors. Those error reports would go into a log file, and each error sends a time-consuming interrupt request (IRQ) to the central processor. A rapid flow of error reports could thus flood the drive -- and clog NVMe bandwidth, slowing or even halting bootup. | ||
Line 8: | Line 27: | ||
So, the NVMe drive may be healthy, but there could be trouble brewing around the PCIe subsystem | So, the NVMe drive may be healthy, but there could be trouble brewing around the PCIe subsystem | ||
== Kernal param, pcie_aspm=off<ref>https://www.kernel.org/doc/html/v4.14/admin-guide/kernel-parameters.html</ref> == | |||
Forcibly enable or disable PCIe Active State Power Management | |||
* off : Disable ASPM. | |||
* force : Enable ASPM even on devices that claim not to [[support]] it. (WARNING: Forcing ASPM on may cause system lockups) | |||
== Kernal param, pci=nommconf<ref>https://unix.stackexchange.com/questions/327730/what-causes-this-pcieport-00000003-0-pcie-bus-error-aer-bad-tlp</ref> == | |||
The kernel option <code>pci=nommconf</code> disables Memory-Mapped PCI Configuration Space, which is available in [[Linux]] since kernel 2.6. Very roughly, all PCI devices have an area that describe this device (which you see with <code>lspci -vv</code>), and the originally method to access this area involves going through I/O ports, while PCIe allows this space to be mapped to memory for simpler access. | |||
By using <code>pci=nommconf</code>, the configuration space of all devices will be accessed in the original way, and changing the access methods works around the AER problem. | |||
== Could not boot from M.2 due to Bad TLP in Recovery Mode<ref>https://askubuntu.com/questions/771899/pcie-bus-error-severity-corrected</ref> == | |||
Well known workaround is | |||
# Adding "pcie_aspm=off" to GRUB_CMDLINE_LINUX parameter at /etc/default/grub file | |||
# Run sudo update-grub | |||
# then reboot. | |||
If pcie_aspm=off does not work then try with pci=noaer with the same way and make it permanent if this works. | |||
Adjust file /etc/default/grub file,<syntaxhighlight lang="bash"> | |||
# for AER, pci=noaer pcie_aspm=off | |||
# force to enable ECC feature on AMD EPYC, amd64_edac.ecc_enable_override | |||
GRUB_CMDLINE_LINUX_DEFAULT="pci=noaer pcie_aspm=off amd64_edac.ecc_enable_override=1" | |||
</syntaxhighlight> | |||
The ubuntu team aware it as Linux Kernel bug. Read more [https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1521173 Bug_track_ubuntu_PCIe bus error] | |||
== Reference == | == Reference == |
Latest revision as of 11:24, 24 September 2023
AER (Advanced Error Reporting) is an optional PCI Express feature that allows for more enhanced reporting and control of errors than the basic error reporting scheme. AER errors are categorized as either correctable or uncorrectable. [1]
- A correctable error is recovered by the PCI Express protocol without the need for software intervention and without any risk of data loss.
- An uncorrectable error can be either fatal or non-fatal.
- A fatal, uncorrectable error causes the link to become unreliable.
- A non-fatal, uncorrectable error results in an unreliable transaction
The AER driver in the Linux kernel drives the reporting of (and recovery from) these events.
- In the case of a correctable event, the AER driver simply logs a message that the event was encountered and recovered by hardware. Device drivers can be instrumented to register recovery routines when they are initialized.
- Should a device experience an uncorrectable error, the AER driver will invoke the appropriate recovery routines in the device driver that controls the affected device. These routines can be used to recover the link for a fatal error, for example.
Bad TLP Issues
There are bunch of Bad TLP issues about PCIe devices such as M.2, GPU, etc on various platform has been discussing over here and there. [2][3][4][5]
NVMe AER issue,
There are many community reports AER error on boot on various systems. [6] with following error or simlar
Bad TLP associated with device xxxx:xx:x
That means in this particular case, something goes wrong when the PCIe controller uses this method to access the configuraton space of a particular device.
It may be a hardware bug in the device, in the PCIe root controller on the motherboard, in the specific interaction of those two, or something else.
Kernal param, pci=noaer [7]
The pci=noaer directive tells AER to not report errors. Those error reports would go into a log file, and each error sends a time-consuming interrupt request (IRQ) to the central processor. A rapid flow of error reports could thus flood the drive -- and clog NVMe bandwidth, slowing or even halting bootup.
The nvme 0000:xx:xx.x AER message identifies that error as from the NVMe M.2 connection to the PCIe bus.
So, the NVMe drive may be healthy, but there could be trouble brewing around the PCIe subsystem
Kernal param, pcie_aspm=off[8]
Forcibly enable or disable PCIe Active State Power Management
- off : Disable ASPM.
- force : Enable ASPM even on devices that claim not to support it. (WARNING: Forcing ASPM on may cause system lockups)
Kernal param, pci=nommconf[9]
The kernel option pci=nommconf
disables Memory-Mapped PCI Configuration Space, which is available in Linux since kernel 2.6. Very roughly, all PCI devices have an area that describe this device (which you see with lspci -vv
), and the originally method to access this area involves going through I/O ports, while PCIe allows this space to be mapped to memory for simpler access.
By using pci=nommconf
, the configuration space of all devices will be accessed in the original way, and changing the access methods works around the AER problem.
Could not boot from M.2 due to Bad TLP in Recovery Mode[10]
Well known workaround is
- Adding "pcie_aspm=off" to GRUB_CMDLINE_LINUX parameter at /etc/default/grub file
- Run sudo update-grub
- then reboot.
If pcie_aspm=off does not work then try with pci=noaer with the same way and make it permanent if this works.
Adjust file /etc/default/grub file,
# for AER, pci=noaer pcie_aspm=off
# force to enable ECC feature on AMD EPYC, amd64_edac.ecc_enable_override
GRUB_CMDLINE_LINUX_DEFAULT="pci=noaer pcie_aspm=off amd64_edac.ecc_enable_override=1"
The ubuntu team aware it as Linux Kernel bug. Read more Bug_track_ubuntu_PCIe bus error
Reference
- ↑ https://www.plda.com/pcie-glossary/aer
- ↑ https://www.nvidia.com/en-us/geforce/forums/game-ready-drivers/13/237130/gtx-1080-throwing-bad-tlp-pcie-bus-errors/
- ↑ https://www.google.com/search?q=bad+TLP+AER+on+boot&client=firefox-b-d&biw=1216&bih=673&ei=4RcqZKHiI8OsseMPhM6k4Ao&ved=0ahUKEwjh1pfXtIz-AhVDVmwGHQQnCaw4ChDh1QMIDg&uact=5&oq=bad+TLP+AER+on+boot&gs_lcp=Cgxnd3Mtd2l6LXNlcnAQAzIHCCEQoAEQCjIHCCEQoAEQCjIHCCEQoAEQCjIHCCEQoAEQCjoKCAAQRxDWBBCwAzoGCAAQCBAeSgQIQRgAUOMJWLkzYJc0aAFwAXgAgAGXAYgB5wiSAQMwLjiYAQCgAQHIAQrAAQE&sclient=gws-wiz-serp
- ↑ https://rog-forum.asus.com/t5/promotions-general-discussions/pcie-bus-error-bad-tlp-bad-dllp/m-p/842800
- ↑ https://askubuntu.com/questions/1209597/os-is-not-loading-due-to-bad-tlp-in-recovery-mode
- ↑ https://forums.linuxmint.com/viewtopic.php?t=380602
- ↑ https://www.kernel.org/doc/html/v4.14/admin-guide/kernel-parameters.html
- ↑ https://www.kernel.org/doc/html/v4.14/admin-guide/kernel-parameters.html
- ↑ https://unix.stackexchange.com/questions/327730/what-causes-this-pcieport-00000003-0-pcie-bus-error-aer-bad-tlp
- ↑ https://askubuntu.com/questions/771899/pcie-bus-error-severity-corrected