Predictive Self Healing on Linux on POWER

Sun frequently touts their “predictive self-healing” implementation in Solaris 10. I wonder if that bullet point would be further down the list if they were familiar with the error detection, prediction, and correction capabilities of Linux on POWER platforms. In fact, the Linux on POWER implementation precedes the Solaris 10 implementation by at least a year (Solaris 10 was released in January 2005; SLES 8 had this solution for POWER in 2003, and RHEL 3 had it in 2004 at the latest).

I’ll take a moment to explain the superior aspects of the Linux on POWER implementation. The Solaris implementation consists of a number of diagnostics in the operating system that poll hardware devices for errors, and then perform notifications and/or recovery actions if a problem is detected. On POWER, hardware problem detection is largely done by the hypervisor and low-level firmware. That’s where it should be done; it means that the OS doesn’t even need to be booted for detection to occur, and doesn’t need to waste cycles polling. A huge number of devices are monitored this way: memory, CPUs, caches, fans, power supplies, VPD cards, voltage regulator modules, I/O subsystems, service processors, risers, even I/O drawers (and the fans, power supplies, etc. that those drawers may contain). PCI devices are also monitored; more details on that later.

If a failure (or impending failure) is detected, the hypervisor provides a report to every affected operating system installed on the system and to Hardware Management Consoles, if any are attached. On Linux partitions, the data is logged to the syslog and servicelog, and a number of actions may occur. Predictive CPU failures will cause the affected CPUs to be automatically removed via hotplug, so that the operating system may continue to run even after a catastrophic CPU or cache failure occurs. Severe thermal or voltage issues, and fan or power supply failures when redundant units aren’t available, will result in a shutdown to prevent hardware damage. In many cases, failures are automatically recovered by the hardware or firmware (for example, single- and double-bit memory errors are corrected via ECC, memory scrubbing, redundant bit-steering, and Chipkill), and the message to the OS is simply an FYI, or possibly an indication that the degraded device should be serviced at the administrator’s convenience. When a repair action is needed (device replacement, microcode updates, etc.), administrators are notified of the location code of the FRU and an indication of which repair procedure to follow (as documented in InfoCenter).

On a side note, the fact that this monitoring is done at such a low level means that self-healing on POWER platforms is completely OS agnostic; the reports are provided to Linux, AIX, and i5/OS partitions. The OS just has to know how to get out of the way. For that matter, there doesn’t even need to be an OS installed: the platform error log is viewable using the service processor, which is also capable of driving repair procedures. Conversely, if you are running something besides Solaris on Sun hardware, or if the error occurs during boot time, Sun’s “self-healing” feature is useless.

An OpenSolaris presentation that I found indicates that their Fault Management includes “improved resilience for all PCI I/O failures,” but is vague on details. I’d like to compare it to PCI Error Recovery/EEH on Linux on POWER, but it is difficult to do so without more information. It seems to be (again) an OS-only implementation, which almost certainly wouldn’t be able to match the functionality provided by POWER platforms. On POWER, the hardware and hypervisor again provide assistance by fencing off adapters the instant a problem is detected (to avoid the possibility of data corruption) and then notifying the operating system, which then directs the appropriate device drivers to restart the failed adapter.

Predictive Self-Healing always tops the list of Solaris 10 features (along with ZFS, Containers, and DTrace, which are reserved for other posts and/or other bloggers to discuss). Hopefully I’ve shown why it shouldn’t.