Predictive Self Healing on Linux on POWER

Sun frequently touts their “predictive self-healing” implementation in Solaris 10. I wonder if that bullet point would be further down the list if they were familiar with the error detection, prediction, and correction capabilities of Linux on POWER platforms. In fact, the Linux on POWER implementation precedes the Solaris 10 implementation by at least a year (Solaris 10 was released in January 2005; SLES 8 had this solution for POWER in 2003, and RHEL 3 had it in 2004 at the latest).

I’ll take a moment to explain the superior aspects of the Linux on POWER implementation. The Solaris implementation consists of a number of diagnostics in the operating system that poll hardware devices for errors, and then perform notifications and/or recovery actions if a problem is detected. On POWER, hardware problem detection is largely done by the hypervisor and low-level firmware. That’s where it should be done; it means that the OS doesn’t even need to be booted for detection to occur, and doesn’t need to waste cycles polling. A huge number of devices are monitored this way: memory, CPUs, caches, fans, power supplies, VPD cards, voltage regulator modules, I/O subsystems, service processors, risers, even I/O drawers (and the fans, power supplies, etc. that those drawers may contain). PCI devices are also monitored; more details on that later.

If a failure (or impending failure) is detected, the hypervisor provides a report to every affected operating system installed on the system and to Hardware Management Consoles, if any are attached. On Linux partitions, the data is logged to the syslog and servicelog, and a number of actions may occur. Predictive CPU failures will cause the affected CPUs to be automatically removed via hotplug, so that the operating system may continue to run even after a catastrophic CPU or cache failure occurs. Severe thermal or voltage issues, and fan or power supply failures when redundant units aren’t available, will result in a shutdown to prevent hardware damage. In many cases, failures are automatically recovered by the hardware or firmware (for example, single- and double-bit memory errors are corrected via ECC, memory scrubbing, redundant bit-steering, and Chipkill), and the message to the OS is simply an FYI, or possibly an indication that the degraded device should be serviced at the administrator’s convenience. When a repair action is needed (device replacement, microcode updates, etc.), administrators are notified of the location code of the FRU and an indication of which repair procedure to follow (as documented in InfoCenter).

On a side note, the fact that this monitoring is done at such a low level means that self-healing on POWER platforms is completely OS agnostic; the reports are provided to Linux, AIX, and i5/OS partitions. The OS just has to know how to get out of the way. For that matter, there doesn’t even need to be an OS installed: the platform error log is viewable using the service processor, which is also capable of driving repair procedures. Conversely, if you are running something besides Solaris on Sun hardware, or if the error occurs during boot time, Sun’s “self-healing” feature is useless.

An OpenSolaris presentation that I found indicates that their Fault Management includes “improved resilience for all PCI I/O failures,” but is vague on details. I’d like to compare it to PCI Error Recovery/EEH on Linux on POWER, but it is difficult to do so without more information. It seems to be (again) an OS-only implementation, which almost certainly wouldn’t be able to match the functionality provided by POWER platforms. On POWER, the hardware and hypervisor again provide assistance by fencing off adapters the instant a problem is detected (to avoid the possibility of data corruption) and then notifying the operating system, which then directs the appropriate device drivers to restart the failed adapter.

Predictive Self-Healing always tops the list of Solaris 10 features (along with ZFS, Containers, and DTrace, which are reserved for other posts and/or other bloggers to discuss). Hopefully I’ve shown why it shouldn’t.

Advertisements

3 comments

  1. You argue that Solaris 10 has a self healing in the operating system, whereas POWER has it in hardware, and therefore Linux on POWER self healing is better.

    That sounds good, but I have two questions:
    1) If you run Linux on non-POWER hardware, does Linux have self healing? Or do you require POWER for self healing? How does this affect me, who use Linux on x86? I can not afford POWER hardware. :o( Why can’t you release cheaper POWER machines?

    2) I thought that the more expensive Sun servers also have self healing mechanisms in hardware, just like all other bigger Unix servers have. But that is not true? Sun servers lack self healing mechanisms, you need Solaris for that? So it is not a good idea to run Linux on heavier SPARC servers?

    • Thanks for the comment! On your first question: There is only very limited availability of self healing features for Linux on non-Power systems, enabled by installing servicelog. I believe that additional support is forthcoming, including better interactivity with the BMC (on systems so equipped) and with device drivers, but it is not as robust as the Power implementation. Essentially, the ability to detect and recover from hardware events can only ever be as rich as the platform hardware/firmware can support, and even systems with BMCs will not be able to detect or recover from the range of errors that a Power system can.

      On your second question: self-healing isn’t really an all-or-nothing concept; the higher-end Sun machines do have improvements over the lower-end machines, but even then, the vast bulk of error detection and recovery is left up to the operating system. For example, Solaris has a module called cpumem-diagnosis, which must be loaded in order to perform SPARC CPU and memory diagnostics. This level of diagnosis is automatically performed by the platform on Power systems, meaning that failures will not be ignored by the system if some OS module is not loaded (or even if the OS is not booted).

  2. Solaris support PREDICTIVE SELF HEALING on both x86 platform and SPARC platform. On x86 platform, it work with Intel CPU and BIOS/BMC.
    On SPARC platform, it relies on hypervisor within firmware, and Service Processors too. Anyway, SPARC solution is similar with IBM platform. The big differences are it also works on x86 platform.

    On the other hand, Solaris PREDICTIVE SELF HEALING is not only designed for hardware fault. It tried to use it for Software failures, such as ZFS.

    Linux doesn’t have a good error diagnosis and reporting framework, but Solaris has it by supporting PREDICTIVE SELF HEALING.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s