AER is a capability provided by the PCI Express specification which allows for reporting of PCI errors and recovery from some of those errors. AER support in Linux was implemented concurrently with EEH support; this post will give a high-level summary of AER and explain some differences between AER and EEH. I previously discussed the differences between EEH and PCI error handling on HP-UX.
AER errors are categorized as either correctable or uncorrectable. A correctable error is recovered by the PCI Express protocol without the need for software intervention, and without any risk of data loss. An uncorrectable error can be either fatal or non-fatal. A non-fatal uncorrectable error results in an unreliable transaction, while a fatal uncorrectable error causes the link to become unreliable.
The AER driver in the Linux kernel drives the reporting of (and recovery from) these events. In the case of a correctable event, the AER driver simply logs a message that the event was encountered and recovered by hardware. Device drivers can be instrumented to register recovery routines when they are initialized. Should a device experience an uncorrectable error, the AER driver will invoke the appropriate recovery routines in the device driver that controls the affected device. These routines can be used to recover the link for a fatal error, for example.
So, how does this differ from EEH on the Power architecture? First, on Power (System p and System i), EEH encapsulates AER, such that AER events are exposed to the operating system as EEH events. AER and EEH both use the (above-described) PCI error recovery infrastructure in the Linux kernel, meaning that a device driver need only be instrumented once to obtain the advantages of both; the callbacks that are added within a device driver will be called in response to an EEH event if the driver is used on a Power system, and in response to an AER event if the driver is used on other systems.
The primary difference is that with EEH, the PCI slot is frozen in response to a detected error; the affected device may not perform I/O until recovery is performed. (I use the term “slot” loosely; that statement applies to onboard devices as well.) There is no concept of an “unreliable transaction”, as the transaction does not occur, and no new transactions will occur until the slot is recovered.