AER: Advanced Error Reporting

AER is a capability provided by the PCI Express specification which allows for reporting of PCI errors and recovery from some of those errors.  AER support in Linux was implemented concurrently with EEH support; this post will give a high-level summary of AER and explain some differences between AER and EEH.  I previously discussed the differences between EEH and PCI error handling on HP-UX.

AER errors are categorized as either correctable or uncorrectable.  A correctable error is recovered by the PCI Express protocol without the need for software intervention, and without any risk of data loss.  An uncorrectable error can be either fatal or non-fatal.  A non-fatal uncorrectable error results in an unreliable transaction, while a fatal uncorrectable error causes the link to become unreliable.

The AER driver in the Linux kernel drives the reporting of (and recovery from) these events.  In the case of a correctable event, the AER driver simply logs a message that the event was encountered and recovered by hardware.  Device drivers can be instrumented to register recovery routines when they are initialized.  Should a device experience an uncorrectable error, the AER driver will invoke the appropriate recovery routines in the device driver that controls the affected device.  These routines can be used to recover the link for a fatal error, for example.

So, how does this differ from EEH on the Power architecture?  First, on Power (System p and System i), EEH encapsulates AER, such that AER events are exposed to the operating system as EEH events.  AER and EEH both use the (above-described) PCI error recovery infrastructure in the Linux kernel, meaning that a device driver need only be instrumented once to obtain the advantages of both; the callbacks that are added within a device driver will be called in response to an EEH event if the driver is used on a Power system, and in response to an AER event if the driver is used on other systems.

The primary difference is that with EEH, the PCI slot is frozen in response to a detected error; the affected  device may not perform I/O until recovery is performed.  (I use the term “slot” loosely; that statement applies to onboard devices as well.)  There is no concept of an “unreliable transaction”, as the transaction does not occur, and no new transactions will occur until the slot is recovered.

Memory Recovery on Power

On Power systems, there is a hierarchy of protection methods to guard against progressively rarer (and more catastrophic) types of memory failures.  There are a lot of terms used to describe these memory protection methods, so I thought I’d write a post to explain the various methods and how they relate to one another.

ECC (error correcting code) provides the first level of memory protection; it is capable of correcting single-bit errors, and of detecting (but not correcting) double-bit errors.

While memory is idle, memory scrubbing corrects soft single-bit errors in the background.  These single-bit errors would be corrected by ECC anyways when the memory is read, but memory scrubbing vastly reduces the quantity of double-bit errors (which ECC cannot correct) by catching and fixing the errors earlier.

If a defined threshold of errors is reached on a single bit, that memory line is considered faulty and is dynamically reassigned to a spare memory chip on the module via bit steering.  If all the bits on this spare memory chip are already used up when the error threshold is reached, the system’s service processor will generate a deferred maintenance request to indicate that the memory module should be replaced in the next available scheduled maintenance window.

Bits from multiple memory chips (on the same module) are scattered across four separate ECC words, a design which is called (logically enough) bit scattering.  This allows the system to handle simultaneous errors from multiple arrays on the same memory chip, because there will be no more than one faulty bit per ECC word.  Recovery in this scenario is handled via a process called Chipkill.  Because memory has been written across multiple chips on the module (similar to striping on disk arrays), the memory controller is capable of reconstructing the data from the killed chip.

If memory scrubbing identifies a hard error (soft errors are recovered by rewriting the correct data back to the memory location), the OS is notified of the failure so that it can remove the associated page.  This is called dynamic page deallocation, is currently supported on AIX 5L and i5/OS, and will soon be supported in Linux.  Dynamic page deallocation protects against an alignment of failed cells in two separate memory modules.

A more catastrophic memory failure can result in the unavailability of an entire row or column in the array, or an entire chip.  Redundant bit-steering protects against an alignment of the failed memory cells with any future failures.

Interestingly, elevation has a substantial effect on the frequency of soft error rates.  An elevation of 5,000 feet results in a 3- to 5-fold increase in soft errors (in comparison to sea level).  An elevation of 30,000 feet results in a 100-fold increase, due to the substantial increase in cosmic radiation.