PCI error recovery: HP versus Linux on POWER

I wrote a post discussing PCI error recovery (via EEH) on Linux on POWER a few months ago [1], but I did not take the opportunity to compare it to other PCI error recovery methods at the time.  I’ve found some documentation on HP’s PCI error recovery since then, so I thought I’d post this article as a follow-on.

PCI error recovery on HP systems requires the installation of a feature called PCI Advanced Error Handling.  Notably, this feature is only available for HP-UX; recovery from PCI errors cannot be done with Linux on HP systems.  Installing the feature results in the PCI slots shifting to a “soft fail” mode. If a PCI error occurs on a slot in that soft fail mode, the slot will be frozen from performing any other I/O. However, recovery from this frozen state is not automatic; it must be effected by hand (using the olrad command; I believe OLRAD is an acronym for On-Line Repair/Add/Delete) [2].  Conversely, PCI error recovery on Linux on POWER is seamless, and requires no user intervention:  the frozen slot is detected on the next read operation, and the device is immediately reinitialized and made available for use.

Interestingly, there are two other limitations of PCI Error Handling on HP-UX.  If there is only a single path configured to storage devices, failover features like HP’s Serviceguard may not detect the loss of connectivity, which is necessary for them to perform a failover operation [3].  This is not an issue with PCI failures on Linux on POWER, because the device will be reinitialized immediately, with no need for a failover in order to wait for an administrative repair action to occur.  Secondly, if a new PCI adapter is added to the system, it will be initially set to the “hard fail” mode until it can be established that the driver is capable of handling the “soft fail” mode.  A machine check would occur if a PCI error occurred during this window, resulting in a system crash [3].  Such a gap does not exist in the Linux on POWER PCI error recovery implementation.

Hopefully I’ve been able to showcase the superior aspects of the EEH capabilities provided by the POWER platform for PCI error recovery; the fact that these capabilities are taken advantage of by both AIX and Linux makes the picture even better for POWER.

References:
[1] https://zombieprocess.wordpress.com/2007/09/16/pci-error-recovery-in-linux/
[2] http://h71028.www7.hp.com/ERC/downloads/c00767235.pdf
[3] http://docs.hp.com/en/5991-5308/index.html


Leave a comment