Servicelog Updates

The servicelog package has been updated to version 1.0.  This new version uses an sqlite database as a backend (instead of the Berkeley DB backend that the 0.x stream used).  The primary advantage to the sqlite relational database backend is that queries of the servicelog can be performed with standard SQL queries.  The –query flag to the servicelog command now takes an SQL WHERE clause as an argument.  For example, to view all open serviceable events, run:

/usr/bin/servicelog --query "serviceable=1 and closed=0"

To view all migrations that a logical partition has undergone:

/usr/bin/servicelog --query 'refcode="#MIGRATE"'

The ability to register notification tools with servicelog, available in the 0.x stream, is still supported, with even more flexibility: now you can specify a query string for matching when registering a new notification tool.  When a new event is logged, the tool will only be invoked if the event matches the criteria specified in that query string.  For example, run the following command (as root) to cause a tool called /opt/foo/some_command to be automatically invoked just after a partition is migrated to a different system:

/usr/bin/servicelog_notify --add --command='/opt/foo/some_command' --match='refcode="#MIGRATE"'

Advertisements

Power Platform Diagnostics: Source Available

The package for performing Power platform diagnostics, ppc64-diag, has just been open sourced under the Eclipse Public License.  Much of what I discussed in my previous post about predictive self healing is implemented in this package (and in servicelog, which is already open source).

Here are some of the advantages provided by the ppc64-diag package:

  • retrieval of first-failure error data from platform-level components, such as memory, CPUs, caches, fans, power supplies, VPD cards, voltage regulator modules, I/O subsystems, service processors, risers, etc.
  • the ability to offline CPUs or logical memory blocks (LMBs) that are predicted to fail
  • notifications of EPOW events (environmental and power warnings), and initiation of shutdowns due to abnormal thermal and voltage conditions if no redundant fans or power supplies are available
  • monitoring of platform-level elements (fans, power supplies, riser cards, etc.) in external I/O enclosures
  • retrieval of dumps from platform components to assist in problem determination (for example, dump data from a failed service processor)

The ppc64-diag package is generally install-and-forget; any platform events that may occur are logged to servicelog, with indications of the event severity and whether the event is serviceable (i.e. requires service action) or not.  Additional relevant information is also logged to servicelog, such as a reference code, and the location code and part number of a failing device (obtained from lsvpd).  Tools may be registered with servicelog to be automatically notified when new events are logged.

DLPAR Tools Open Sourced

The latest version of the powerpc-utils and powerpc-utils-papr packages have been released; source tarballs are available at http://powerpc-utils.ozlabs.org.

In addition to a few minor bug fixes there is a significant addition to the powerpc-utils-papr package: the newly open sourced DLPAR (Dynamic Logical PARtitioning) tools. These new tools are the drmgr and lsslot commands. Both of these commands were previously shipped from the IBM website in the (proprietary) rpa-dlpar and rpa-pci-hotplug packages. The inclusion of these tools in the powerpc-utils-papr package will now mean that DLPAR capabilities will be present at system install instead of having to download and install additional packages to enable this on System p.

So, what do these fancy new tools do? Good question. The drmgr command enables users to dynamically (at runtime) add and remove I/O, processors and memory. (Yes, memory remove is not currently supported on Linux for System p but that will be changing soon.) The drmgr command is meant to be driven from the HMC/IVM, not the command line, although it can be. This explains its slightly cryptic usage and limitations when used directly.

The lsslot is a command line tool that lists all DLPAR or hotplug capable I/O, PHBs (PCI Host Bridges), processors and memory slots on the system. Although its (unfortunate) naming implies that it will list all slots on the system, it does not.

Hopefully the powerpc-utils and powerpc-utils-papr packages are familiar to you. If not you may recognize the names they appear as in the various distros such as ppc64-utils on RHEL or just powerpc-utils on SuSE. Both of these distros combine the packages into one, whereas Gentoo ships them separately. Merging the packages is most likely a hold-over from when the they were the combined ppc64-utils package. Community requests asked to split the previous ppc64-utils package into a set of tools generic to the POWER platform (powerpc-utils) and those specific to PAPR based POWER platforms (powerpc-utils-papr).

Servicelog Source Available

Source code for the servicelog library and utilities is now available from the linux-diag project on SourceForge: http://linux-diag.sourceforge.net/servicelog.html. There is user-level documentation (PDF) for servicelog available on SourceForge as well.

Why is servicelog different than other logging mechanisms, such as syslog? It’s intended to store entries that are only relevant to system service. The concept of a serviceable event is introduced, which is a single servicelog entry that contains enough information to identify a failure and to indicate how to repair it. This information will typically include:

  • a short description of the failure, including a reference code
  • identification of the physical location of a failing component (via location code, for example)
  • indication of severity of the failure and/or priority of the repair
  • pointers to documented procedures for repairing the failure (for example, PCI hotplug instructions for replacing a failing PCI adapter)

System management tools can register to be notified when new serviceable events are created (the Service Focal Point on the Hardware Management Console will be updated when a serviceable event is logged on a Linux partition on System p). When a failure is fixed (for example, a failed PCI adapter is replaced via a hotplug action), a repair action should be logged to servicelog, which will cause all of the relevant open serviceable events to be marked as “closed” (i.e., fixed). This will provide a complete history of all of the failures that have occurred on a system, as well as all of the repair actions that have taken place.

Servicelog is particularly useful with Linux on System p right now. The superior First Failure Data Capture (FFDC) facilities provided by System p will result in very informative servicelog entries to indicate a wide range of possible platform failures, and each reference code and repair procedure is documented in IBM’s eServer hardware InfoCenter.

Hardware Inventory with lsvpd

VPD, Vital Product Data, is information associated with system hardware that is useful for easing system configuration and service. The lsvpd package for Linux provides commands that can be used to retrieve an inventory of system hardware, along with the VPD associated with each device. The lsvpd package will install three commands: lsvpd (“list VPD”), lscfg (“list configuration”), and lsmcode (“list microcode”). The lscfg command is the human-readable command of the three; lsvpd and lsmcode provide output that is more easily read by scripts/applications.

The lsvpd package requires the libvpd library. The libvpd library can also be used to retrieve inventory data from within an application; in fact, that’s how lsvpd, lscfg, and lsmcode work.

Types of Vital Product Data

Running lscfg by itself will list each device, along with its location code. More detailed VPD for each device on that list can be obtained by running “lscfg -vl <device>“. The following examples illustrate the type of data that can be retrieved from the lsvpd package:

# lscfg -vl eth0
  eth0             U787A.001.DNZ00Z5-P1-T5
                                         Port 1 - IBM 2 PORT 10/100/1000
                                         Base-TX PCI-X Adapter (14108902)

        Manufacturer................Intel Corporation
        Machine Type and Model......82546EB Gigabit Ethernet Controller
                                    (Copper)
        Network Address.............00096b6b0591
        Device Specific.(YL)........U787A.001.DNZ00Z5-P1-T5

The description, manufacturer, model number, MAC address, and location code of the eth0 device are all noted in the output. Here is another example, for a hard drive:

# lscfg -vl sda
  sda              U787A.001.DNZ00Z5-P1-T10-L8-L0
                                         16 Bit LVD SCSI Disk Drive (73400 MB)

        Manufacturer................IBM
        Machine Type and Model......ST373453LC
        FRU Number..................00P2685
        ROS Level and ID............43353141
        Serial Number...............0007EA3B
        EC Level....................H12094
        Part Number.................00P2684
        Device Specific.(Z0)........000003129F00013E
        Device Specific.(Z1)........0626C51A
        Device Specific.(Z2)........0002
        Device Specific.(Z3)........04112
        Device Specific.(Z4)........0001
        Device Specific.(Z5)........22
        Device Specific.(Z6)........H12094
        Device Specific.(YL)........U787A.001.DNZ00Z5-P1-T10-L8-L0

The location code, description, model, and manufacturer are all there, along with the FRU and part numbers (for ordering new parts), the serial number of the device, and its current microcode level (“ROS Level and ID”).

The -A flag to lsmcode will list all the microcode levels on the system, including the system firmware level:

# lsmcode -A
sys0!system:SF240_320 (t) SF220_051 (p) SF240_320 (t)|service:
sg6 1:255:255:255 !570B001.0FC93FFC0FC93FFC
sg5 1:0:4:0 sdd !HUS103036FL3800.0FC94004100698C86F7374
sg4 1:0:3:0 sdc !HUS103036FL3800.0FC942E40FC942E40620
sg3 0:255:255:255 !570B001.0FC940040FC940040FC93FF410193860
sg2 0:0:15:0 !VSBPD3E   U4SCSI.0FC9420C0FC9420C0620
sg1 0:0:5:0 sdb !ST336607LC.0FC9420C0FC9420C0620
sg0 0:0:3:0 sda !ST373453LC.0FC942040FC942040620

See my previous article on pSeries and System p firmware for a description of the dual firmware banks, and information on updating your system firmware level. Currently, device microcode must be updated using a microcode update utility specific to the device in question (iprutils for the onboard RAID SCSI HBAs on POWER5, for example).

Refreshing the VPD Database

Unfortunately, the data in the lsvpd database can become stale as devices are added or changed (via hotplug or DLPAR, for example). Running /usr/sbin/vpdupdate will cause the data to be refreshed. The developers of lsvpd are currently working on having vpdudpate run automatically in response to hotplug events.

Other Tools for Hardware Inventory

Besides lsvpd, there are several other Linux tools that can assist with hardware inventory for system configuration or service:

  • HAL (Hardware Abstraction Layer): run hal-device for a list of devices
  • Open Firmware device tree (on Power): stored in /proc/device-tree
  • The sysfs filesystem (usually mounted on /sys)