Servicelog Updates

The servicelog package has been updated to version 1.0. This new version uses an sqlite database as a backend (instead of the Berkeley DB backend that the 0.x stream used). The primary advantage to the sqlite relational database backend is that queries of the servicelog can be performed with standard SQL queries. The –query flag to the servicelog command now takes an SQL WHERE clause as an argument. For example, to view all open serviceable events, run:

/usr/bin/servicelog --query "serviceable=1 and closed=0"

To view all migrations that a logical partition has undergone:

/usr/bin/servicelog --query 'refcode="#MIGRATE"'

The ability to register notification tools with servicelog, available in the 0.x stream, is still supported, with even more flexibility: now you can specify a query string for matching when registering a new notification tool. When a new event is logged, the tool will only be invoked if the event matches the criteria specified in that query string. For example, run the following command (as root) to cause a tool called /opt/foo/some_command to be automatically invoked just after a partition is migrated to a different system:

/usr/bin/servicelog_notify --add --command='/opt/foo/some_command' --match='refcode="#MIGRATE"'

Power Platform Diagnostics: Source Available

The package for performing Power platform diagnostics, ppc64-diag, has just been open sourced under the Eclipse Public License. Much of what I discussed in my previous post about predictive self healing is implemented in this package (and in servicelog, which is already open source).

Here are some of the advantages provided by the ppc64-diag package:

retrieval of first-failure error data from platform-level components, such as memory, CPUs, caches, fans, power supplies, VPD cards, voltage regulator modules, I/O subsystems, service processors, risers, etc.
the ability to offline CPUs or logical memory blocks (LMBs) that are predicted to fail
notifications of EPOW events (environmental and power warnings), and initiation of shutdowns due to abnormal thermal and voltage conditions if no redundant fans or power supplies are available
monitoring of platform-level elements (fans, power supplies, riser cards, etc.) in external I/O enclosures
retrieval of dumps from platform components to assist in problem determination (for example, dump data from a failed service processor)

The ppc64-diag package is generally install-and-forget; any platform events that may occur are logged to servicelog, with indications of the event severity and whether the event is serviceable (i.e. requires service action) or not. Additional relevant information is also logged to servicelog, such as a reference code, and the location code and part number of a failing device (obtained from lsvpd). Tools may be registered with servicelog to be automatically notified when new events are logged.

SystemTap Without Debug Info

Just a short post today to mention a new feature in SystemTap that I should have mentioned a while ago. A primary barrier to the adoption of SystemTap has been the requirement that SystemTap have access to the DWARF debug information for the kernel and modules. This is no longer the case; as of a few weeks ago, SystemTap can operate on systems that do not provide this debug info. SystemTap users can trace function entries and returns (and report argument values with a little extra effort) even if no debug info is provided. This currently works for i386 and x86_64, and powerpc support is being debugged.

For more information, refer to this page on the SystemTap wiki. This, combined with user-space probing, provides substantial and much-needed improvements to the usability of SystemTap.

Linux Kernel Hotpatching via ksplice

There have been a few articles recently discussing ksplice, a mechanism for hotpatching a Linux kernel. It is primarily geared towards applying security patches, which is a good thing: it is expressly designed to address those patches that are most urgent to apply, and thus the most painful due to the sometimes short lead time.

The implementation of ksplice is interesting, and not much like any hotpatching design I’ve seen. The patched and unpatched kernels are built using the same compiler, and, effectively, the resulting binary files are diffed. The difference gets packaged into a module which, when installed, will position jumps to excise the affected executable sections of the kernel. All branches into the excised text also need to be redirected to the new text in the kernel module.

The main complaints that I have seen regarding kernel hotpatching are along the following lines:

If you are using load balancing appropriately, hotpatching isn’t necessary. (Also phrased as: If there is a system that is so critical, why don’t you have two?) This sounds reasonable on the surface, but is actually somewhat nefarious. The same argument could be made about any feature that improves system availability. In return, I would ask: why do you need greater than 70% uptime? After all, you can just keep adding load-balancing systems until you get the aggregate uptime that you want. You are being hit twice with expenditure when you plan for downtime by adding systems rather than “adding nines”. First, even if you aren’t missing any transactions, downtime inherently costs money because it results in administrative costs (i.e. someone needs to restart the system and possibly perform root cause analysis, filesystem checks, etc.). Second, besides the cost of the other systems used to load balance, there is an ongoing expenditure for energy and cooling.

We have a lengthy QA process before deploying OS updates. This is not really a technical issue, but more of a “certification” issue. Presumably, the provider of the operating system distributes the hotfix, and has already vetted the fix to be applied concurrently. Customers who have their own QA processes before deploying fixes can perform QA on hotpatches just as easily as they can on non-concurrent updates, so this is really a non-issue.

The crux of the matter is simple. If the process of applying security patches becomes so trivial that the machine doesn’t even need to wait until a service window to be fixed, more machines are likely to be patched.

On the topic of security patches, though: one of the issues with hotpatching is that you can technically never be certain that the replaced text is not being executed. Most hotpatching implementations (including ksplice) resolve that problem by simply leaving the original code. To some extent, though, this problem reduces the utility of hotpatching for patching security vulnerabilities, because one can never be certain that no CPU is executing text in the section that is supposed to have been excised. This can really only be resolved by patch review; if there is a possibility that a CPU can be spending time in the old text for some time after the hotpatch is applied, then the system should undergo a reboot instead. This determination should be the responsibility of the OS vendor, who can provide either a hotpatch or a standard old-fashioned fix.

On a side note, AIX recently introduced a kernel hotplugging implementation (refer to section 2.3.15 in this RedBook for more information on the Concurrent AIX Update feature, first appearing in AIX 6.1.)

AER: Advanced Error Reporting

AER is a capability provided by the PCI Express specification which allows for reporting of PCI errors and recovery from some of those errors. AER support in Linux was implemented concurrently with EEH support; this post will give a high-level summary of AER and explain some differences between AER and EEH. I previously discussed the differences between EEH and PCI error handling on HP-UX.

AER errors are categorized as either correctable or uncorrectable. A correctable error is recovered by the PCI Express protocol without the need for software intervention, and without any risk of data loss. An uncorrectable error can be either fatal or non-fatal. A non-fatal uncorrectable error results in an unreliable transaction, while a fatal uncorrectable error causes the link to become unreliable.

The AER driver in the Linux kernel drives the reporting of (and recovery from) these events. In the case of a correctable event, the AER driver simply logs a message that the event was encountered and recovered by hardware. Device drivers can be instrumented to register recovery routines when they are initialized. Should a device experience an uncorrectable error, the AER driver will invoke the appropriate recovery routines in the device driver that controls the affected device. These routines can be used to recover the link for a fatal error, for example.

So, how does this differ from EEH on the Power architecture? First, on Power (System p and System i), EEH encapsulates AER, such that AER events are exposed to the operating system as EEH events. AER and EEH both use the (above-described) PCI error recovery infrastructure in the Linux kernel, meaning that a device driver need only be instrumented once to obtain the advantages of both; the callbacks that are added within a device driver will be called in response to an EEH event if the driver is used on a Power system, and in response to an AER event if the driver is used on other systems.

The primary difference is that with EEH, the PCI slot is frozen in response to a detected error; the affected device may not perform I/O until recovery is performed. (I use the term “slot” loosely; that statement applies to onboard devices as well.) There is no concept of an “unreliable transaction”, as the transaction does not occur, and no new transactions will occur until the slot is recovered.

Memory Recovery on Power

On Power systems, there is a hierarchy of protection methods to guard against progressively rarer (and more catastrophic) types of memory failures. There are a lot of terms used to describe these memory protection methods, so I thought I’d write a post to explain the various methods and how they relate to one another.

ECC (error correcting code) provides the first level of memory protection; it is capable of correcting single-bit errors, and of detecting (but not correcting) double-bit errors.

While memory is idle, memory scrubbing corrects soft single-bit errors in the background. These single-bit errors would be corrected by ECC anyways when the memory is read, but memory scrubbing vastly reduces the quantity of double-bit errors (which ECC cannot correct) by catching and fixing the errors earlier.

If a defined threshold of errors is reached on a single bit, that memory line is considered faulty and is dynamically reassigned to a spare memory chip on the module via bit steering. If all the bits on this spare memory chip are already used up when the error threshold is reached, the system’s service processor will generate a deferred maintenance request to indicate that the memory module should be replaced in the next available scheduled maintenance window.

Bits from multiple memory chips (on the same module) are scattered across four separate ECC words, a design which is called (logically enough) bit scattering. This allows the system to handle simultaneous errors from multiple arrays on the same memory chip, because there will be no more than one faulty bit per ECC word. Recovery in this scenario is handled via a process called Chipkill. Because memory has been written across multiple chips on the module (similar to striping on disk arrays), the memory controller is capable of reconstructing the data from the killed chip.

If memory scrubbing identifies a hard error (soft errors are recovered by rewriting the correct data back to the memory location), the OS is notified of the failure so that it can remove the associated page. This is called dynamic page deallocation, is currently supported on AIX 5L and i5/OS, and will soon be supported in Linux. Dynamic page deallocation protects against an alignment of failed cells in two separate memory modules.

A more catastrophic memory failure can result in the unavailability of an entire row or column in the array, or an entire chip. Redundant bit-steering protects against an alignment of the failed memory cells with any future failures.

Interestingly, elevation has a substantial effect on the frequency of soft error rates. An elevation of 5,000 feet results in a 3- to 5-fold increase in soft errors (in comparison to sea level). An elevation of 30,000 feet results in a 100-fold increase, due to the substantial increase in cosmic radiation.

Upcoming SystemTap Features

A few new SystemTap features that have been eagerly awaited were vetted in Fedora and are likely set to start appearing in upcoming RedHat releases:

A revamped security model that allows for probing by non root-users, without compromising security. An example of why this is useful is detailed in the linked readme.
Basic support for user-space probing. A probe can be placed at a location specified by PID and virtual address, and the normal range of tapset features are available for user-space probes.
The crash utility (for analyzing dump data) can use the staplog extension to retrieve the SystemTap relay buffer from a kernel dump image. Seems like a useful feature when tracking a bug that causes a kernel crash.

If you are interested in more information, I previously wrote about a real-world use of SystemTap, and about how to use kernel markers in SystemTap. Of course, there are lots of other sample scripts on the SystemTap wiki.

PCI error recovery: HP versus Linux on POWER

I wrote a post discussing PCI error recovery (via EEH) on Linux on POWER a few months ago [1], but I did not take the opportunity to compare it to other PCI error recovery methods at the time. I’ve found some documentation on HP’s PCI error recovery since then, so I thought I’d post this article as a follow-on.

PCI error recovery on HP systems requires the installation of a feature called PCI Advanced Error Handling. Notably, this feature is only available for HP-UX; recovery from PCI errors cannot be done with Linux on HP systems. Installing the feature results in the PCI slots shifting to a “soft fail” mode. If a PCI error occurs on a slot in that soft fail mode, the slot will be frozen from performing any other I/O. However, recovery from this frozen state is not automatic; it must be effected by hand (using the olrad command; I believe OLRAD is an acronym for On-Line Repair/Add/Delete) [2]. Conversely, PCI error recovery on Linux on POWER is seamless, and requires no user intervention: the frozen slot is detected on the next read operation, and the device is immediately reinitialized and made available for use.

Interestingly, there are two other limitations of PCI Error Handling on HP-UX. If there is only a single path configured to storage devices, failover features like HP’s Serviceguard may not detect the loss of connectivity, which is necessary for them to perform a failover operation [3]. This is not an issue with PCI failures on Linux on POWER, because the device will be reinitialized immediately, with no need for a failover in order to wait for an administrative repair action to occur. Secondly, if a new PCI adapter is added to the system, it will be initially set to the “hard fail” mode until it can be established that the driver is capable of handling the “soft fail” mode. A machine check would occur if a PCI error occurred during this window, resulting in a system crash [3]. Such a gap does not exist in the Linux on POWER PCI error recovery implementation.

Hopefully I’ve been able to showcase the superior aspects of the EEH capabilities provided by the POWER platform for PCI error recovery; the fact that these capabilities are taken advantage of by both AIX and Linux makes the picture even better for POWER.

References:
[1] https://zombieprocess.wordpress.com/2007/09/16/pci-error-recovery-in-linux/
[2] http://h71028.www7.hp.com/ERC/downloads/c00767235.pdf
[3] http://docs.hp.com/en/5991-5308/index.html

Linux on POWER Odds and Ends: Service Utility Roundup

There are a number of small utilities for Linux on POWER that can come in useful for servicing are configuring your system. Here are a few utilities from the powerpc-utils-papr package that you may find useful.

The set_poweron_time utility can be used to specify a time in the future when the system or partition should be powered on, if it happens to be off at that time. For example, if a partition should be automatically started 12 hours and 10 minutes from now, run the following command: set_poweron_time -d h12m10. If the partition is off when that time expires, it will restart.

The bootlist command is used to modify the order of boot devices from the command line. Boot lists on POWER are stored as Open Firmware device names, but bootlist allows you to specify logical device names (like “sda” or “eth0”) if you choose; the ofpathname utility is used by bootlist to convert between OF device names and logical device names (between “/vdevice/v-scsi@30000002/disk@8100000000000000” and “sda”, for example).

usysident is a tool for manipulating identification LEDs. These LEDs are used to help locate FRUs (field replaceable units), to ensure that the correct part is being replaced. LEDs are specified by their location code or logical device name, and can be in one of two states: either “normal” (off) or “identify” (blinking amber LED). Run usysident without any parameters to view the available LEDs; to flash the LED on eth0: usysident -d eth0 -s identify.

A related utility is usysattn; it’s used to turn off the system attention indicator, or to view the current state of that LED. The LED usually looks like an amber exclamation point located on the operator’s panel, as in the image below from a p520.

POWER5 op panel On a partitioned system, though, the system attention indicator will be illuminated if any of the partitions have activated it. This is because the system attention indicator determines whether any of the partitions require attention. Refer to the Service Focal Point on the HMC or IVM to determine who is asking for attention.

serv_config is a very useful utility for modifying serviceability parameters. I talked a little about it in an earlier post, so refer to that entry for more details.

The uesensor command can be used to view the values of various thermal, voltage, and fan speed sensors on the system. Unfortunately, these sensors are only exposed on POWER4 systems and some blades; more recent systems will instead send an EPOW (environmental or power warning) event if any of the sensors are in danger of shifting out of the normal operating range. EPOW events are exposed in servicelog.

All of these commands have man pages; take a look there if you need more details.

POWER Reference Codes

One nice advantage provided by POWER systems is the availability of structured and well-defined reference codes. Besides indicating errors or conditions that otherwise require attention, these codes are also used to indicate the progress of boots or dumps. If your system failed to boot for some reason, the last reference code on the operator’s panel (op panel) would provide a good clue as to what the system was doing just before the failure.

On Linux, besides appearing on the op panel, these reference codes are also found in events that are surfaced in servicelog. While servicelog contains a lot of details that are useful for servicing errors, more information can always be obtained by looking up the reference code.

There are a few kinds of reference codes; the key for decoding these refcodes is the IBM Hardware InfoCenter. I’ll briefly explain the three different types of reference codes (SRCs, SRNs, and menugoals) before showing how they are displayed in servicelog.

System Reference Codes

SRCs are sequences of alphanumeric characters (usually 8 — just enough to fit snugly on the display of the operator’s panel — but sometimes 6). They were first introduced on POWER5 systems, and exist on both System p and System i (formerly pSeries and iSeries). SRCs are documented in InfoCenter: “Service provider information”/”Reference codes”/”Using system reference codes”.

An example of an SRC used as a progress code is C7004091; that refcode indicates that the partition is in a standby state, and is waiting to be manually activated. If the partition is set to be activated automatically, the partition will not stop at this SRC, but will continue to the Open Firmware boot phase.

Linux does not generate SRCs as progress codes, but will generate some as error codes. Additionally, if you have a POWER5 or POWER6 system, events with SRCs may be written to servicelog to indicate platform-level errors.

Service Request Numbers

SRNs are an older formatting method for progress or error codes. They are generated by diagnostics in AIX, and by the firmware on POWER4 (and earlier) systems. If the progress/error code has 5 digits, or has a ‘-‘ character somewhere in it, it is an SRN. These are documented in InfoCenter: “Service provider information”/”Reference codes”/”Using service request numbers”.

As an example, the SRN 747-223 indicates that there was a “miscompare during the write/read of the memory I/O register.” Many SRNs point to a repair procedure called a MAP; in this case, the SRN points to MAP 0050, “SCSI bus problems”, which provides procedures for analyzing and repairing the problem.

Linux does not generate SRNs, but you may still see SRNs generated by older POWER platforms. They may also be generated if you boot the eServer Standalone Diagnostics CD to run device diagnostics.

Menugoals

Menugoals are reference codes that begin with a ‘#’ character. They are generated by diagnostics, and indicate procedures that can be performed by a system admin rather than by a trained service representative. Menugoals don’t typically indicate errors, but instead convey additional information about the state of the device being diagnosed. As an example, a menugoal might indicate that a tape drive requires cleaning.

Reference Codes in servicelog

Each event in servicelog has a refcode field, which will always contain a reference code (either an SRC, an SRN, or a menugoal). Here is a sample event from servicelog indicating a platform error reported by a POWER system:

PPC64 Platform Event:
Servicelog ID:      64
Event Timestamp:    Fri Dec 10 21:37:05 2004
Log Timestamp:      Wed Apr 18 00:19:12 2007
Severity:           4 (WARNING)
Version:            2
Serviceable Event:  Yes
Event Repaired:     No
Reference Code:     B125E500
Action Flags:       a800
Event Type:         224 - Platform Event
Kernel ID:          1000
Platform ID:        50929493
Creator ID:         E - Service Processor
Subsystem ID:       25 - Memory subsystem including external cache
RTAS Severity:      41 - Unrecoverable Error, bypassed with degraded performance
Event Subtype:      00 - Not applicable
Machine Type/Model: 9118-575
Machine Serial:     0SQIH47

Extended Reference Codes:
2: 030000f0  3: 28f00110  4: c13920ff  5: c1000000
6: 00811630  7: 00000001  8: 00d6000d  9: 00000000

Description:
Memory subsystem including external cache Informational (non-error) Event.
Refer to the system service documentation for more information.

<< Callout 1 >>
Priority            M
Type                16
Repair Event Key:   0
Procedure Id:       n/a
Location:           U787D.001.0481682-P2
FRU:                80P4180
Serial:             YH3016129997
CCIN:               260D

The error description provides some details concerning the failure, and the FRU callout indicates which part to repair in order to fix the problem. The refcode field contains an SRC, B125E500; looking that SRC up in InfoCenter shows the following details:

B1 indicates it was reported by the service processor
25 indicates that it is an “external cache event or error reported by the service processor”
E500 indicates that it is a result of processor runtime diagnostics (PRD)

In addition to that, the InfoCenter entry for B125E500 indicates that this event is the result of a hardware failure. The FRU callout indicates which piece of hardware should be replaced to resolve the error.

Zombie Process

2 AM, out of caffeine. Once more, I'm a… Zombie Process.

Category Archives: RAS