AER: Advanced Error Reporting

AER is a capability provided by the PCI Express specification which allows for reporting of PCI errors and recovery from some of those errors.  AER support in Linux was implemented concurrently with EEH support; this post will give a high-level summary of AER and explain some differences between AER and EEH.  I previously discussed the differences between EEH and PCI error handling on HP-UX.

AER errors are categorized as either correctable or uncorrectable.  A correctable error is recovered by the PCI Express protocol without the need for software intervention, and without any risk of data loss.  An uncorrectable error can be either fatal or non-fatal.  A non-fatal uncorrectable error results in an unreliable transaction, while a fatal uncorrectable error causes the link to become unreliable.

The AER driver in the Linux kernel drives the reporting of (and recovery from) these events.  In the case of a correctable event, the AER driver simply logs a message that the event was encountered and recovered by hardware.  Device drivers can be instrumented to register recovery routines when they are initialized.  Should a device experience an uncorrectable error, the AER driver will invoke the appropriate recovery routines in the device driver that controls the affected device.  These routines can be used to recover the link for a fatal error, for example.

So, how does this differ from EEH on the Power architecture?  First, on Power (System p and System i), EEH encapsulates AER, such that AER events are exposed to the operating system as EEH events.  AER and EEH both use the (above-described) PCI error recovery infrastructure in the Linux kernel, meaning that a device driver need only be instrumented once to obtain the advantages of both; the callbacks that are added within a device driver will be called in response to an EEH event if the driver is used on a Power system, and in response to an AER event if the driver is used on other systems.

The primary difference is that with EEH, the PCI slot is frozen in response to a detected error; the affected  device may not perform I/O until recovery is performed.  (I use the term “slot” loosely; that statement applies to onboard devices as well.)  There is no concept of an “unreliable transaction”, as the transaction does not occur, and no new transactions will occur until the slot is recovered.

PCI error recovery: HP versus Linux on POWER

I wrote a post discussing PCI error recovery (via EEH) on Linux on POWER a few months ago [1], but I did not take the opportunity to compare it to other PCI error recovery methods at the time.  I’ve found some documentation on HP’s PCI error recovery since then, so I thought I’d post this article as a follow-on.

PCI error recovery on HP systems requires the installation of a feature called PCI Advanced Error Handling.  Notably, this feature is only available for HP-UX; recovery from PCI errors cannot be done with Linux on HP systems.  Installing the feature results in the PCI slots shifting to a “soft fail” mode. If a PCI error occurs on a slot in that soft fail mode, the slot will be frozen from performing any other I/O. However, recovery from this frozen state is not automatic; it must be effected by hand (using the olrad command; I believe OLRAD is an acronym for On-Line Repair/Add/Delete) [2].  Conversely, PCI error recovery on Linux on POWER is seamless, and requires no user intervention:  the frozen slot is detected on the next read operation, and the device is immediately reinitialized and made available for use.

Interestingly, there are two other limitations of PCI Error Handling on HP-UX.  If there is only a single path configured to storage devices, failover features like HP’s Serviceguard may not detect the loss of connectivity, which is necessary for them to perform a failover operation [3].  This is not an issue with PCI failures on Linux on POWER, because the device will be reinitialized immediately, with no need for a failover in order to wait for an administrative repair action to occur.  Secondly, if a new PCI adapter is added to the system, it will be initially set to the “hard fail” mode until it can be established that the driver is capable of handling the “soft fail” mode.  A machine check would occur if a PCI error occurred during this window, resulting in a system crash [3].  Such a gap does not exist in the Linux on POWER PCI error recovery implementation.

Hopefully I’ve been able to showcase the superior aspects of the EEH capabilities provided by the POWER platform for PCI error recovery; the fact that these capabilities are taken advantage of by both AIX and Linux makes the picture even better for POWER.

References:
[1] http://zombieprocess.wordpress.com/2007/09/16/pci-error-recovery-in-linux/
[2] http://h71028.www7.hp.com/ERC/downloads/c00767235.pdf
[3] http://docs.hp.com/en/5991-5308/index.html

Predictive Self Healing on Linux on POWER

Sun frequently touts their “predictive self-healing” implementation in Solaris 10. I wonder if that bullet point would be further down the list if they were familiar with the error detection, prediction, and correction capabilities of Linux on POWER platforms. In fact, the Linux on POWER implementation precedes the Solaris 10 implementation by at least a year (Solaris 10 was released in January 2005; SLES 8 had this solution for POWER in 2003, and RHEL 3 had it in 2004 at the latest).

I’ll take a moment to explain the superior aspects of the Linux on POWER implementation. The Solaris implementation consists of a number of diagnostics in the operating system that poll hardware devices for errors, and then perform notifications and/or recovery actions if a problem is detected. On POWER, hardware problem detection is largely done by the hypervisor and low-level firmware. That’s where it should be done; it means that the OS doesn’t even need to be booted for detection to occur, and doesn’t need to waste cycles polling. A huge number of devices are monitored this way: memory, CPUs, caches, fans, power supplies, VPD cards, voltage regulator modules, I/O subsystems, service processors, risers, even I/O drawers (and the fans, power supplies, etc. that those drawers may contain). PCI devices are also monitored; more details on that later.

If a failure (or impending failure) is detected, the hypervisor provides a report to every affected operating system installed on the system and to Hardware Management Consoles, if any are attached. On Linux partitions, the data is logged to the syslog and servicelog, and a number of actions may occur. Predictive CPU failures will cause the affected CPUs to be automatically removed via hotplug, so that the operating system may continue to run even after a catastrophic CPU or cache failure occurs. Severe thermal or voltage issues, and fan or power supply failures when redundant units aren’t available, will result in a shutdown to prevent hardware damage. In many cases, failures are automatically recovered by the hardware or firmware (for example, single- and double-bit memory errors are corrected via ECC, memory scrubbing, redundant bit-steering, and Chipkill), and the message to the OS is simply an FYI, or possibly an indication that the degraded device should be serviced at the administrator’s convenience. When a repair action is needed (device replacement, microcode updates, etc.), administrators are notified of the location code of the FRU and an indication of which repair procedure to follow (as documented in InfoCenter).

On a side note, the fact that this monitoring is done at such a low level means that self-healing on POWER platforms is completely OS agnostic; the reports are provided to Linux, AIX, and i5/OS partitions. The OS just has to know how to get out of the way. For that matter, there doesn’t even need to be an OS installed: the platform error log is viewable using the service processor, which is also capable of driving repair procedures. Conversely, if you are running something besides Solaris on Sun hardware, or if the error occurs during boot time, Sun’s “self-healing” feature is useless.

An OpenSolaris presentation that I found indicates that their Fault Management includes “improved resilience for all PCI I/O failures,” but is vague on details. I’d like to compare it to PCI Error Recovery/EEH on Linux on POWER, but it is difficult to do so without more information. It seems to be (again) an OS-only implementation, which almost certainly wouldn’t be able to match the functionality provided by POWER platforms. On POWER, the hardware and hypervisor again provide assistance by fencing off adapters the instant a problem is detected (to avoid the possibility of data corruption) and then notifying the operating system, which then directs the appropriate device drivers to restart the failed adapter.

Predictive Self-Healing always tops the list of Solaris 10 features (along with ZFS, Containers, and DTrace, which are reserved for other posts and/or other bloggers to discuss). Hopefully I’ve shown why it shouldn’t.

PCI error recovery in Linux

And so, with introductions out of the way, I thought I’d just leap in and explain a little bit about PCI error recovery in Linux, and how it works on System p and System i (formerly known as pSeries and iSeries).

IBM’s POWER-based systems have a feature called EEH (extended I/O error handling). Each PCI bridge has a bit of hardware associated with it that detects abnormal conditions, like parity errors and wild DMA accesses. When such an error is detected, the misbehaving device’s bus is frozen, meaning that the device becomes inaccessible by the operating system. Any writes to the device will be dropped, and any reads will result in all 1’s (0xFFF…). By itself, that’s only marginally useful; at least the partition doesn’t crash as a result of the error and, probably more importantly, at least no data corruption occurs due to the fault. The real advantage to EEH is that the system firmware provides interfaces that can be called to reset the bus and bring the device back online, without the need to restart the operating system.

In Linux, any time a read results in all 1’s, a firmware invocation is made by the kernel to determine if the device actually meant to return that value or if the bus was frozen due to a detected failure. (Empirical measurements show that devices almost never mean to return all 1’s, so there are not many false positives.) If the bus is indeed frozen, notification and recovery routines in the device driver are automatically invoked by the kernel to indicate that the device has experienced an error and to indicate when the bus is unfrozen (so that the driver can re-initialize the device).

If the device driver is not “PCI error recovery enabled” (i.e. does not provide routines to perform recovery), the kernel will attempt to perform a hotplug operation. The hotplug operation is typically successful, but recovery is much slower than it would be if the device driver were PCI error recovery enabled.

Yes, all of the support for PCI error recovery in Linux is upstream at this time. One of my co-workers, Linas Vepstas, wrote the lion’s share of the code. Several device drivers are instrumented with recovery routines, including e1000, ixgb, and ipr; those drivers serve as good examples so that others can also be instrumented. For more details, see Documentation/pci-error-recovery.txt in the kernel source.