Kernel Markers Wiki

I wrote a little bit about kernel markers before, but I’ve since found a wiki with some more information: http://sourceware.org/systemtap/wiki/UsingMarkers. Besides information on adding new markers to the kernel source and building kernels that include marker support, it also has information on using markers in SystemTap.

It certainly does seem to make SystemTap scripts easier to write:

probe kernel.mark("some_marker") { printf("some_marker hit: %p, %d\n",                                                                                                             
                                                 $arg1, $arg2) }

The obvious first question from someone sitting down to write a SystemTap script that uses markers: where are the markers, what are their labels, and what are the arguments you can access using them? Perhaps the marker developers could write a tool that parses the kernel source and spits out a document that provides the names and locations of the available markers, along with the arguments that they expose.

The wiki page notes another failing with the current implementation: if a marker is a structure pointer, the struct type can’t be obtained from a SystemTap script. Consequently, the members of the struct are not easily accessible. Perhaps the documentation that results from post-processing the source could also be used to provide the type of each of the arguments. Just a suggestion.

POWER, Linux, and IBM Software Demonstrations

I’ve spent a little time recently looking through IBM’s DEMOcentral, a repository that collects and displays demonstrations of both software and hardware products, and found several of interest. These pre-recorded and sometimes interactive demos play in a browser window, are available in several languages, and range from high-level overviews to tutorials covering installation or specific features; if you’re interested in IBM systems, WebSphere, Tivoli, or any other IBM software technologies, it’s well worth your time to browse through the collection. Here’s a quick tour of what I found to be interesting.

Hardware Flyovers

Hardware “flyovers” are interactive demos that show IBM systems, inside and out, with annotations that pop up when you mouse over the various components. For example:

The JS21 flyover shows the blade from the front and back, as well as inside the cover.
The System p 570 flyover shows the system from the front (with or without the cover) and the back, and allows you to zoom i to view the detail of the processor books (other flyovers allow that as well, like the p5 550Q flyover). It also shows how to interconnect multiple systems to make an 8-, 12-, or 16-core system (select the “upgrade” graphic to see the interconnections).
For the big iron junkies, there are even System p5 590/595 and System z9 flyovers.

Unfortunately, I haven’t found any flyovers of POWER6 systems yet; I assume it’s just a matter of time.

Recorded Software Demos

There are a number of recorded demos that are of interest to users of IBM systems, describing things like the BladeCenter Management Module, IBM Director, PowerExecutive, and IBM Virtualization Manager. There are many more; look at the complete list of systems demos to see if any others interest you.

Besides the systems demos, there are also demos of many IBM software products. For example, here is a demo detailing how to install DB2 Express on Linux. In addition to DB2/Information Management, there’s a several demo collections that cover topics like Workplace, SOA, WebSphere Portal, Rational and other software development tools, the OmniFind Yahoo! Edition, even Lotus Notes and Sametime.

DLPAR Tools Open Sourced

The latest version of the powerpc-utils and powerpc-utils-papr packages have been released; source tarballs are available at http://powerpc-utils.ozlabs.org.

In addition to a few minor bug fixes there is a significant addition to the powerpc-utils-papr package: the newly open sourced DLPAR (Dynamic Logical PARtitioning) tools. These new tools are the drmgr and lsslot commands. Both of these commands were previously shipped from the IBM website in the (proprietary) rpa-dlpar and rpa-pci-hotplug packages. The inclusion of these tools in the powerpc-utils-papr package will now mean that DLPAR capabilities will be present at system install instead of having to download and install additional packages to enable this on System p.

So, what do these fancy new tools do? Good question. The drmgr command enables users to dynamically (at runtime) add and remove I/O, processors and memory. (Yes, memory remove is not currently supported on Linux for System p but that will be changing soon.) The drmgr command is meant to be driven from the HMC/IVM, not the command line, although it can be. This explains its slightly cryptic usage and limitations when used directly.

The lsslot is a command line tool that lists all DLPAR or hotplug capable I/O, PHBs (PCI Host Bridges), processors and memory slots on the system. Although its (unfortunate) naming implies that it will list all slots on the system, it does not.

Hopefully the powerpc-utils and powerpc-utils-papr packages are familiar to you. If not you may recognize the names they appear as in the various distros such as ppc64-utils on RHEL or just powerpc-utils on SuSE. Both of these distros combine the packages into one, whereas Gentoo ships them separately. Merging the packages is most likely a hold-over from when the they were the combined ppc64-utils package. Community requests asked to split the previous ppc64-utils package into a set of tools generic to the POWER platform (powerpc-utils) and those specific to PAPR based POWER platforms (powerpc-utils-papr).

Guest Bloggers

In the interest of letting voices other than mine be heard on occasion in this neck of the woods, I thought it would be interesting to have guests post to this blog to discuss their own projects and thoughts. We’ll see how it goes!

The first guest blogger will be Nathan Fontenot, who will introduce DLPAR (Dynamic Logical Partitioning) for Linux on POWER. Look out for his post, coming soon.

Predictive Self Healing on Linux on POWER

Sun frequently touts their “predictive self-healing” implementation in Solaris 10. I wonder if that bullet point would be further down the list if they were familiar with the error detection, prediction, and correction capabilities of Linux on POWER platforms. In fact, the Linux on POWER implementation precedes the Solaris 10 implementation by at least a year (Solaris 10 was released in January 2005; SLES 8 had this solution for POWER in 2003, and RHEL 3 had it in 2004 at the latest).

I’ll take a moment to explain the superior aspects of the Linux on POWER implementation. The Solaris implementation consists of a number of diagnostics in the operating system that poll hardware devices for errors, and then perform notifications and/or recovery actions if a problem is detected. On POWER, hardware problem detection is largely done by the hypervisor and low-level firmware. That’s where it should be done; it means that the OS doesn’t even need to be booted for detection to occur, and doesn’t need to waste cycles polling. A huge number of devices are monitored this way: memory, CPUs, caches, fans, power supplies, VPD cards, voltage regulator modules, I/O subsystems, service processors, risers, even I/O drawers (and the fans, power supplies, etc. that those drawers may contain). PCI devices are also monitored; more details on that later.

If a failure (or impending failure) is detected, the hypervisor provides a report to every affected operating system installed on the system and to Hardware Management Consoles, if any are attached. On Linux partitions, the data is logged to the syslog and servicelog, and a number of actions may occur. Predictive CPU failures will cause the affected CPUs to be automatically removed via hotplug, so that the operating system may continue to run even after a catastrophic CPU or cache failure occurs. Severe thermal or voltage issues, and fan or power supply failures when redundant units aren’t available, will result in a shutdown to prevent hardware damage. In many cases, failures are automatically recovered by the hardware or firmware (for example, single- and double-bit memory errors are corrected via ECC, memory scrubbing, redundant bit-steering, and Chipkill), and the message to the OS is simply an FYI, or possibly an indication that the degraded device should be serviced at the administrator’s convenience. When a repair action is needed (device replacement, microcode updates, etc.), administrators are notified of the location code of the FRU and an indication of which repair procedure to follow (as documented in InfoCenter).

On a side note, the fact that this monitoring is done at such a low level means that self-healing on POWER platforms is completely OS agnostic; the reports are provided to Linux, AIX, and i5/OS partitions. The OS just has to know how to get out of the way. For that matter, there doesn’t even need to be an OS installed: the platform error log is viewable using the service processor, which is also capable of driving repair procedures. Conversely, if you are running something besides Solaris on Sun hardware, or if the error occurs during boot time, Sun’s “self-healing” feature is useless.

An OpenSolaris presentation that I found indicates that their Fault Management includes “improved resilience for all PCI I/O failures,” but is vague on details. I’d like to compare it to PCI Error Recovery/EEH on Linux on POWER, but it is difficult to do so without more information. It seems to be (again) an OS-only implementation, which almost certainly wouldn’t be able to match the functionality provided by POWER platforms. On POWER, the hardware and hypervisor again provide assistance by fencing off adapters the instant a problem is detected (to avoid the possibility of data corruption) and then notifying the operating system, which then directs the appropriate device drivers to restart the failed adapter.

Predictive Self-Healing always tops the list of Solaris 10 features (along with ZFS, Containers, and DTrace, which are reserved for other posts and/or other bloggers to discuss). Hopefully I’ve shown why it shouldn’t.

The Problems with EDAC

I’ve been looking into EDAC (Error Detection and Correction) a bit recently, to see how it compares with the error detection that is native to IBM’s POWER machines (and to see if there are any features we can better exploit on POWER). If you aren’t familiar with EDAC, it’s a collection of modules for querying the internal registers on some devices (most notably memory modules and PCI devices) to gather failure statistics. To some extent, EDAC is also moving to take action based on those statistics.

Here is the first and most difficult problem: How do you know what number of errors is acceptable before a device should be replaced? It’s entirely acceptable for a device to experience an occasional, sporadic failure; they are typically recovered by the hardware itself (parity or ECC within a memory module, for example), with only a minute effect on performance (if any). A repair action should only be taken when these problems become more common. A good metric in this case would be the number of errors within the past hour; if the error count exceeds a threshold, then the device should be replaced. That threshold is the nut of the problem, as it can vary wildly depending on the device.

On POWER machines, the firmware takes care of thresholding these errors, and sends an event to the OS (including Linux) when the threshold is exceeded. The users don’t need to know how many errors have occurred; all they need to know is that the device at a specific location code is failing, and should be repaired.

Here’s another issue: EDAC polls, say once per second, and reads status registers on certain devices. It then clears the contents of those registers so that only new errors will be registered on the next poll. On many enterprise systems appropriately equipped, the service processor will also poll those same registers to perform predictive failure analysis (PFA). Unfortunately if EDAC is running on the system and clearing the registers, the service processor will be unable to obtain an accurate count of errors for thresholding.

When it comes to PCI errors, POWER includes EEH support for both detection and seamless recovery. EEH seems to be vastly superior in this regard, as the system firmware will cause the bus to be immediately frozen (to ensure that erroneous data is not written or read), and the Linux kernel/driver will reset the device and bring it back online, frequently within the space of a second. I’m not sure how well EDAC plays with AER (Advanced Error Reporting, for PCI-Express systems); I’ll probably write about that when I learn more.

EDAC in its current form seems to be only useful for home users, who are using systems that are not equipped with service processors and who are wondering why their system is suddenly misbehaving. I think it has promise for the enterprise space, but step one is to have it not stomp on any data gathering done by service processors, and step two is to provide information on when the error statistics become meaningful (thresholds).

Sample Real-World Use of SystemTap

SystemTap has been sometimes plagued with “solution in search of a problem” complaints. It was interesting to run across an example of SystemTap being used to solve a real-world problem. A developer discovered an OOM (Out Of Memory) condition in the upstream kernel. In the quest to obtain additional information regarding the issue, the kernel refused to boot with additional debug printk()’s added to quicklist.c. The developer, being familiar with SystemTap, used the following script:

probe kernel.statement("quicklist_trim@mm/quicklist.c:56")
{
        printf(" q->nr_pages is %d, min_pages is %d ----> %s\n",
               $q->nr_pages, $min_pages, execname());
}

The SystemTap scripting language takes a little getting used to, but the intent here is clear: each time quicklist_trim() is run in quicklist.c, a message should be printed that displays some kernel data.

This sort of usage is interesting for two reasons:

Kernel developers understand that there are places in the kernel that cannot be debugged by the printk() method; kprobes (simplified via SystemTap) provide a method to extract debug data from many of those locations, and this is an example of that in action.
When using SystemTap to gather debug data, the system doesn’t need to be rebooted with a new kernel. In this particular scenario, that didn’t matter so much, as the problem was discovered by a developer who could modify his own kernel and reboot the system as needed. However, if an issue is discovered on a production system, this allows for root cause analysis to proceed without the need to take the system out of production.

Zombie Process

2 AM, out of caffeine. Once more, I'm a… Zombie Process.

Monthly Archives: January 2008