Power Platform Diagnostics: Source Available

The package for performing Power platform diagnostics, ppc64-diag, has just been open sourced under the Eclipse Public License.  Much of what I discussed in my previous post about predictive self healing is implemented in this package (and in servicelog, which is already open source).

Here are some of the advantages provided by the ppc64-diag package:

  • retrieval of first-failure error data from platform-level components, such as memory, CPUs, caches, fans, power supplies, VPD cards, voltage regulator modules, I/O subsystems, service processors, risers, etc.
  • the ability to offline CPUs or logical memory blocks (LMBs) that are predicted to fail
  • notifications of EPOW events (environmental and power warnings), and initiation of shutdowns due to abnormal thermal and voltage conditions if no redundant fans or power supplies are available
  • monitoring of platform-level elements (fans, power supplies, riser cards, etc.) in external I/O enclosures
  • retrieval of dumps from platform components to assist in problem determination (for example, dump data from a failed service processor)

The ppc64-diag package is generally install-and-forget; any platform events that may occur are logged to servicelog, with indications of the event severity and whether the event is serviceable (i.e. requires service action) or not.  Additional relevant information is also logged to servicelog, such as a reference code, and the location code and part number of a failing device (obtained from lsvpd).  Tools may be registered with servicelog to be automatically notified when new events are logged.

Memory Recovery on Power

On Power systems, there is a hierarchy of protection methods to guard against progressively rarer (and more catastrophic) types of memory failures.  There are a lot of terms used to describe these memory protection methods, so I thought I’d write a post to explain the various methods and how they relate to one another.

ECC (error correcting code) provides the first level of memory protection; it is capable of correcting single-bit errors, and of detecting (but not correcting) double-bit errors.

While memory is idle, memory scrubbing corrects soft single-bit errors in the background.  These single-bit errors would be corrected by ECC anyways when the memory is read, but memory scrubbing vastly reduces the quantity of double-bit errors (which ECC cannot correct) by catching and fixing the errors earlier.

If a defined threshold of errors is reached on a single bit, that memory line is considered faulty and is dynamically reassigned to a spare memory chip on the module via bit steering.  If all the bits on this spare memory chip are already used up when the error threshold is reached, the system’s service processor will generate a deferred maintenance request to indicate that the memory module should be replaced in the next available scheduled maintenance window.

Bits from multiple memory chips (on the same module) are scattered across four separate ECC words, a design which is called (logically enough) bit scattering.  This allows the system to handle simultaneous errors from multiple arrays on the same memory chip, because there will be no more than one faulty bit per ECC word.  Recovery in this scenario is handled via a process called Chipkill.  Because memory has been written across multiple chips on the module (similar to striping on disk arrays), the memory controller is capable of reconstructing the data from the killed chip.

If memory scrubbing identifies a hard error (soft errors are recovered by rewriting the correct data back to the memory location), the OS is notified of the failure so that it can remove the associated page.  This is called dynamic page deallocation, is currently supported on AIX 5L and i5/OS, and will soon be supported in Linux.  Dynamic page deallocation protects against an alignment of failed cells in two separate memory modules.

A more catastrophic memory failure can result in the unavailability of an entire row or column in the array, or an entire chip.  Redundant bit-steering protects against an alignment of the failed memory cells with any future failures.

Interestingly, elevation has a substantial effect on the frequency of soft error rates.  An elevation of 5,000 feet results in a 3- to 5-fold increase in soft errors (in comparison to sea level).  An elevation of 30,000 feet results in a 100-fold increase, due to the substantial increase in cosmic radiation.

PCI error recovery: HP versus Linux on POWER

I wrote a post discussing PCI error recovery (via EEH) on Linux on POWER a few months ago [1], but I did not take the opportunity to compare it to other PCI error recovery methods at the time.  I’ve found some documentation on HP’s PCI error recovery since then, so I thought I’d post this article as a follow-on.

PCI error recovery on HP systems requires the installation of a feature called PCI Advanced Error Handling.  Notably, this feature is only available for HP-UX; recovery from PCI errors cannot be done with Linux on HP systems.  Installing the feature results in the PCI slots shifting to a “soft fail” mode. If a PCI error occurs on a slot in that soft fail mode, the slot will be frozen from performing any other I/O. However, recovery from this frozen state is not automatic; it must be effected by hand (using the olrad command; I believe OLRAD is an acronym for On-Line Repair/Add/Delete) [2].  Conversely, PCI error recovery on Linux on POWER is seamless, and requires no user intervention:  the frozen slot is detected on the next read operation, and the device is immediately reinitialized and made available for use.

Interestingly, there are two other limitations of PCI Error Handling on HP-UX.  If there is only a single path configured to storage devices, failover features like HP’s Serviceguard may not detect the loss of connectivity, which is necessary for them to perform a failover operation [3].  This is not an issue with PCI failures on Linux on POWER, because the device will be reinitialized immediately, with no need for a failover in order to wait for an administrative repair action to occur.  Secondly, if a new PCI adapter is added to the system, it will be initially set to the “hard fail” mode until it can be established that the driver is capable of handling the “soft fail” mode.  A machine check would occur if a PCI error occurred during this window, resulting in a system crash [3].  Such a gap does not exist in the Linux on POWER PCI error recovery implementation.

Hopefully I’ve been able to showcase the superior aspects of the EEH capabilities provided by the POWER platform for PCI error recovery; the fact that these capabilities are taken advantage of by both AIX and Linux makes the picture even better for POWER.

References:
[1] https://zombieprocess.wordpress.com/2007/09/16/pci-error-recovery-in-linux/
[2] http://h71028.www7.hp.com/ERC/downloads/c00767235.pdf
[3] http://docs.hp.com/en/5991-5308/index.html

Linux on POWER Odds and Ends: Service Utility Roundup

There are a number of small utilities for Linux on POWER that can come in useful for servicing are configuring your system. Here are a few utilities from the powerpc-utils-papr package that you may find useful.

The set_poweron_time utility can be used to specify a time in the future when the system or partition should be powered on, if it happens to be off at that time. For example, if a partition should be automatically started 12 hours and 10 minutes from now, run the following command: set_poweron_time -d h12m10. If the partition is off when that time expires, it will restart.

The bootlist command is used to modify the order of boot devices from the command line. Boot lists on POWER are stored as Open Firmware device names, but bootlist allows you to specify logical device names (like “sda” or “eth0”) if you choose; the ofpathname utility is used by bootlist to convert between OF device names and logical device names (between “/vdevice/v-scsi@30000002/disk@8100000000000000” and “sda”, for example).

usysident is a tool for manipulating identification LEDs. These LEDs are used to help locate FRUs (field replaceable units), to ensure that the correct part is being replaced. LEDs are specified by their location code or logical device name, and can be in one of two states: either “normal” (off) or “identify” (blinking amber LED). Run usysident without any parameters to view the available LEDs; to flash the LED on eth0: usysident -d eth0 -s identify.

A related utility is usysattn; it’s used to turn off the system attention indicator, or to view the current state of that LED. The LED usually looks like an amber exclamation point located on the operator’s panel, as in the image below from a p520.

POWER5 op panelOn a partitioned system, though, the system attention indicator will be illuminated if any of the partitions have activated it. This is because the system attention indicator determines whether any of the partitions require attention. Refer to the Service Focal Point on the HMC or IVM to determine who is asking for attention.

serv_config is a very useful utility for modifying serviceability parameters. I talked a little about it in an earlier post, so refer to that entry for more details.

The uesensor command can be used to view the values of various thermal, voltage, and fan speed sensors on the system. Unfortunately, these sensors are only exposed on POWER4 systems and some blades; more recent systems will instead send an EPOW (environmental or power warning) event if any of the sensors are in danger of shifting out of the normal operating range. EPOW events are exposed in servicelog.

All of these commands have man pages; take a look there if you need more details.

POWER Reference Codes

One nice advantage provided by POWER systems is the availability of structured and well-defined reference codes. Besides indicating errors or conditions that otherwise require attention, these codes are also used to indicate the progress of boots or dumps. If your system failed to boot for some reason, the last reference code on the operator’s panel (op panel) would provide a good clue as to what the system was doing just before the failure.

On Linux, besides appearing on the op panel, these reference codes are also found in events that are surfaced in servicelog. While servicelog contains a lot of details that are useful for servicing errors, more information can always be obtained by looking up the reference code.

There are a few kinds of reference codes; the key for decoding these refcodes is the IBM Hardware InfoCenter. I’ll briefly explain the three different types of reference codes (SRCs, SRNs, and menugoals) before showing how they are displayed in servicelog.

System Reference Codes

SRCs are sequences of alphanumeric characters (usually 8 — just enough to fit snugly on the display of the operator’s panel — but sometimes 6). They were first introduced on POWER5 systems, and exist on both System p and System i (formerly pSeries and iSeries). SRCs are documented in InfoCenter: “Service provider information”/”Reference codes”/”Using system reference codes”.

An example of an SRC used as a progress code is C7004091; that refcode indicates that the partition is in a standby state, and is waiting to be manually activated. If the partition is set to be activated automatically, the partition will not stop at this SRC, but will continue to the Open Firmware boot phase.

Linux does not generate SRCs as progress codes, but will generate some as error codes. Additionally, if you have a POWER5 or POWER6 system, events with SRCs may be written to servicelog to indicate platform-level errors.

Service Request Numbers

SRNs are an older formatting method for progress or error codes. They are generated by diagnostics in AIX, and by the firmware on POWER4 (and earlier) systems. If the progress/error code has 5 digits, or has a ‘-‘ character somewhere in it, it is an SRN. These are documented in InfoCenter: “Service provider information”/”Reference codes”/”Using service request numbers”.

As an example, the SRN 747-223 indicates that there was a “miscompare during the write/read of the memory I/O register.” Many SRNs point to a repair procedure called a MAP; in this case, the SRN points to MAP 0050, “SCSI bus problems”, which provides procedures for analyzing and repairing the problem.

Linux does not generate SRNs, but you may still see SRNs generated by older POWER platforms. They may also be generated if you boot the eServer Standalone Diagnostics CD to run device diagnostics.

Menugoals

Menugoals are reference codes that begin with a ‘#’ character. They are generated by diagnostics, and indicate procedures that can be performed by a system admin rather than by a trained service representative. Menugoals don’t typically indicate errors, but instead convey additional information about the state of the device being diagnosed. As an example, a menugoal might indicate that a tape drive requires cleaning.

Reference Codes in servicelog

Each event in servicelog has a refcode field, which will always contain a reference code (either an SRC, an SRN, or a menugoal). Here is a sample event from servicelog indicating a platform error reported by a POWER system:

PPC64 Platform Event:
Servicelog ID:      64
Event Timestamp:    Fri Dec 10 21:37:05 2004
Log Timestamp:      Wed Apr 18 00:19:12 2007
Severity:           4 (WARNING)
Version:            2
Serviceable Event:  Yes
Event Repaired:     No
Reference Code:     B125E500
Action Flags:       a800
Event Type:         224 - Platform Event
Kernel ID:          1000
Platform ID:        50929493
Creator ID:         E - Service Processor
Subsystem ID:       25 - Memory subsystem including external cache
RTAS Severity:      41 - Unrecoverable Error, bypassed with degraded performance
Event Subtype:      00 - Not applicable
Machine Type/Model: 9118-575
Machine Serial:     0SQIH47

Extended Reference Codes:
2: 030000f0  3: 28f00110  4: c13920ff  5: c1000000
6: 00811630  7: 00000001  8: 00d6000d  9: 00000000

Description:
Memory subsystem including external cache Informational (non-error) Event.
Refer to the system service documentation for more information.

<< Callout 1 >>
Priority            M
Type                16
Repair Event Key:   0
Procedure Id:       n/a
Location:           U787D.001.0481682-P2
FRU:                80P4180
Serial:             YH3016129997
CCIN:               260D

The error description provides some details concerning the failure, and the FRU callout indicates which part to repair in order to fix the problem. The refcode field contains an SRC, B125E500; looking that SRC up in InfoCenter shows the following details:

  • B1 indicates it was reported by the service processor
  • 25 indicates that it is an “external cache event or error reported by the service processor”
  • E500 indicates that it is a result of processor runtime diagnostics (PRD)

In addition to that, the InfoCenter entry for B125E500 indicates that this event is the result of a hardware failure. The FRU callout indicates which piece of hardware should be replaced to resolve the error.

POWER, Linux, and IBM Software Demonstrations

DEMOcentralI’ve spent a little time recently looking through IBM’s DEMOcentral, a repository that collects and displays demonstrations of both software and hardware products, and found several of interest. These pre-recorded and sometimes interactive demos play in a browser window, are available in several languages, and range from high-level overviews to tutorials covering installation or specific features; if you’re interested in IBM systems, WebSphere, Tivoli, or any other IBM software technologies, it’s well worth your time to browse through the collection. Here’s a quick tour of what I found to be interesting.

Hardware Flyovers

Hardware “flyovers” are interactive demos that show IBM systems, inside and out, with annotations that pop up when you mouse over the various components. For example:

  • The JS21 flyover shows the blade from the front and back, as well as inside the cover.
  • The System p 570 flyover shows the system from the front (with or without the cover) and the back, and allows you to zoom i to view the detail of the processor books (other flyovers allow that as well, like the p5 550Q flyover). It also shows how to interconnect multiple systems to make an 8-, 12-, or 16-core system (select the “upgrade” graphic to see the interconnections).
  • For the big iron junkies, there are even System p5 590/595 and System z9 flyovers.

Unfortunately, I haven’t found any flyovers of POWER6 systems yet; I assume it’s just a matter of time.

Recorded Software Demos

There are a number of recorded demos that are of interest to users of IBM systems, describing things like the BladeCenter Management Module, IBM Director, PowerExecutive, and IBM Virtualization Manager. There are many more; look at the complete list of systems demos to see if any others interest you.

Besides the systems demos, there are also demos of many IBM software products. For example, here is a demo detailing how to install DB2 Express on Linux. In addition to DB2/Information Management, there’s a several demo collections that cover topics like Workplace, SOA, WebSphere Portal, Rational and other software development tools, the OmniFind Yahoo! Edition, even Lotus Notes and Sametime.

DLPAR Tools Open Sourced

The latest version of the powerpc-utils and powerpc-utils-papr packages have been released; source tarballs are available at http://powerpc-utils.ozlabs.org.

In addition to a few minor bug fixes there is a significant addition to the powerpc-utils-papr package: the newly open sourced DLPAR (Dynamic Logical PARtitioning) tools. These new tools are the drmgr and lsslot commands. Both of these commands were previously shipped from the IBM website in the (proprietary) rpa-dlpar and rpa-pci-hotplug packages. The inclusion of these tools in the powerpc-utils-papr package will now mean that DLPAR capabilities will be present at system install instead of having to download and install additional packages to enable this on System p.

So, what do these fancy new tools do? Good question. The drmgr command enables users to dynamically (at runtime) add and remove I/O, processors and memory. (Yes, memory remove is not currently supported on Linux for System p but that will be changing soon.) The drmgr command is meant to be driven from the HMC/IVM, not the command line, although it can be. This explains its slightly cryptic usage and limitations when used directly.

The lsslot is a command line tool that lists all DLPAR or hotplug capable I/O, PHBs (PCI Host Bridges), processors and memory slots on the system. Although its (unfortunate) naming implies that it will list all slots on the system, it does not.

Hopefully the powerpc-utils and powerpc-utils-papr packages are familiar to you. If not you may recognize the names they appear as in the various distros such as ppc64-utils on RHEL or just powerpc-utils on SuSE. Both of these distros combine the packages into one, whereas Gentoo ships them separately. Merging the packages is most likely a hold-over from when the they were the combined ppc64-utils package. Community requests asked to split the previous ppc64-utils package into a set of tools generic to the POWER platform (powerpc-utils) and those specific to PAPR based POWER platforms (powerpc-utils-papr).

Predictive Self Healing on Linux on POWER

Sun frequently touts their “predictive self-healing” implementation in Solaris 10. I wonder if that bullet point would be further down the list if they were familiar with the error detection, prediction, and correction capabilities of Linux on POWER platforms. In fact, the Linux on POWER implementation precedes the Solaris 10 implementation by at least a year (Solaris 10 was released in January 2005; SLES 8 had this solution for POWER in 2003, and RHEL 3 had it in 2004 at the latest).

I’ll take a moment to explain the superior aspects of the Linux on POWER implementation. The Solaris implementation consists of a number of diagnostics in the operating system that poll hardware devices for errors, and then perform notifications and/or recovery actions if a problem is detected. On POWER, hardware problem detection is largely done by the hypervisor and low-level firmware. That’s where it should be done; it means that the OS doesn’t even need to be booted for detection to occur, and doesn’t need to waste cycles polling. A huge number of devices are monitored this way: memory, CPUs, caches, fans, power supplies, VPD cards, voltage regulator modules, I/O subsystems, service processors, risers, even I/O drawers (and the fans, power supplies, etc. that those drawers may contain). PCI devices are also monitored; more details on that later.

If a failure (or impending failure) is detected, the hypervisor provides a report to every affected operating system installed on the system and to Hardware Management Consoles, if any are attached. On Linux partitions, the data is logged to the syslog and servicelog, and a number of actions may occur. Predictive CPU failures will cause the affected CPUs to be automatically removed via hotplug, so that the operating system may continue to run even after a catastrophic CPU or cache failure occurs. Severe thermal or voltage issues, and fan or power supply failures when redundant units aren’t available, will result in a shutdown to prevent hardware damage. In many cases, failures are automatically recovered by the hardware or firmware (for example, single- and double-bit memory errors are corrected via ECC, memory scrubbing, redundant bit-steering, and Chipkill), and the message to the OS is simply an FYI, or possibly an indication that the degraded device should be serviced at the administrator’s convenience. When a repair action is needed (device replacement, microcode updates, etc.), administrators are notified of the location code of the FRU and an indication of which repair procedure to follow (as documented in InfoCenter).

On a side note, the fact that this monitoring is done at such a low level means that self-healing on POWER platforms is completely OS agnostic; the reports are provided to Linux, AIX, and i5/OS partitions. The OS just has to know how to get out of the way. For that matter, there doesn’t even need to be an OS installed: the platform error log is viewable using the service processor, which is also capable of driving repair procedures. Conversely, if you are running something besides Solaris on Sun hardware, or if the error occurs during boot time, Sun’s “self-healing” feature is useless.

An OpenSolaris presentation that I found indicates that their Fault Management includes “improved resilience for all PCI I/O failures,” but is vague on details. I’d like to compare it to PCI Error Recovery/EEH on Linux on POWER, but it is difficult to do so without more information. It seems to be (again) an OS-only implementation, which almost certainly wouldn’t be able to match the functionality provided by POWER platforms. On POWER, the hardware and hypervisor again provide assistance by fencing off adapters the instant a problem is detected (to avoid the possibility of data corruption) and then notifying the operating system, which then directs the appropriate device drivers to restart the failed adapter.

Predictive Self-Healing always tops the list of Solaris 10 features (along with ZFS, Containers, and DTrace, which are reserved for other posts and/or other bloggers to discuss). Hopefully I’ve shown why it shouldn’t.

The Problems with EDAC

I’ve been looking into EDAC (Error Detection and Correction) a bit recently, to see how it compares with the error detection that is native to IBM’s POWER machines (and to see if there are any features we can better exploit on POWER). If you aren’t familiar with EDAC, it’s a collection of modules for querying the internal registers on some devices (most notably memory modules and PCI devices) to gather failure statistics. To some extent, EDAC is also moving to take action based on those statistics.

Here is the first and most difficult problem: How do you know what number of errors is acceptable before a device should be replaced? It’s entirely acceptable for a device to experience an occasional, sporadic failure; they are typically recovered by the hardware itself (parity or ECC within a memory module, for example), with only a minute effect on performance (if any). A repair action should only be taken when these problems become more common. A good metric in this case would be the number of errors within the past hour; if the error count exceeds a threshold, then the device should be replaced. That threshold is the nut of the problem, as it can vary wildly depending on the device.

On POWER machines, the firmware takes care of thresholding these errors, and sends an event to the OS (including Linux) when the threshold is exceeded. The users don’t need to know how many errors have occurred; all they need to know is that the device at a specific location code is failing, and should be repaired.

Here’s another issue: EDAC polls, say once per second, and reads status registers on certain devices. It then clears the contents of those registers so that only new errors will be registered on the next poll. On many enterprise systems appropriately equipped, the service processor will also poll those same registers to perform predictive failure analysis (PFA). Unfortunately if EDAC is running on the system and clearing the registers, the service processor will be unable to obtain an accurate count of errors for thresholding.

When it comes to PCI errors, POWER includes EEH support for both detection and seamless recovery. EEH seems to be vastly superior in this regard, as the system firmware will cause the bus to be immediately frozen (to ensure that erroneous data is not written or read), and the Linux kernel/driver will reset the device and bring it back online, frequently within the space of a second. I’m not sure how well EDAC plays with AER (Advanced Error Reporting, for PCI-Express systems); I’ll probably write about that when I learn more.

EDAC in its current form seems to be only useful for home users, who are using systems that are not equipped with service processors and who are wondering why their system is suddenly misbehaving. I think it has promise for the enterprise space, but step one is to have it not stomp on any data gathering done by service processors, and step two is to provide information on when the error statistics become meaningful (thresholds).

Hardware Inventory with lsvpd

VPD, Vital Product Data, is information associated with system hardware that is useful for easing system configuration and service. The lsvpd package for Linux provides commands that can be used to retrieve an inventory of system hardware, along with the VPD associated with each device. The lsvpd package will install three commands: lsvpd (“list VPD”), lscfg (“list configuration”), and lsmcode (“list microcode”). The lscfg command is the human-readable command of the three; lsvpd and lsmcode provide output that is more easily read by scripts/applications.

The lsvpd package requires the libvpd library. The libvpd library can also be used to retrieve inventory data from within an application; in fact, that’s how lsvpd, lscfg, and lsmcode work.

Types of Vital Product Data

Running lscfg by itself will list each device, along with its location code. More detailed VPD for each device on that list can be obtained by running “lscfg -vl <device>“. The following examples illustrate the type of data that can be retrieved from the lsvpd package:

# lscfg -vl eth0
  eth0             U787A.001.DNZ00Z5-P1-T5
                                         Port 1 - IBM 2 PORT 10/100/1000
                                         Base-TX PCI-X Adapter (14108902)

        Manufacturer................Intel Corporation
        Machine Type and Model......82546EB Gigabit Ethernet Controller
                                    (Copper)
        Network Address.............00096b6b0591
        Device Specific.(YL)........U787A.001.DNZ00Z5-P1-T5

The description, manufacturer, model number, MAC address, and location code of the eth0 device are all noted in the output. Here is another example, for a hard drive:

# lscfg -vl sda
  sda              U787A.001.DNZ00Z5-P1-T10-L8-L0
                                         16 Bit LVD SCSI Disk Drive (73400 MB)

        Manufacturer................IBM
        Machine Type and Model......ST373453LC
        FRU Number..................00P2685
        ROS Level and ID............43353141
        Serial Number...............0007EA3B
        EC Level....................H12094
        Part Number.................00P2684
        Device Specific.(Z0)........000003129F00013E
        Device Specific.(Z1)........0626C51A
        Device Specific.(Z2)........0002
        Device Specific.(Z3)........04112
        Device Specific.(Z4)........0001
        Device Specific.(Z5)........22
        Device Specific.(Z6)........H12094
        Device Specific.(YL)........U787A.001.DNZ00Z5-P1-T10-L8-L0

The location code, description, model, and manufacturer are all there, along with the FRU and part numbers (for ordering new parts), the serial number of the device, and its current microcode level (“ROS Level and ID”).

The -A flag to lsmcode will list all the microcode levels on the system, including the system firmware level:

# lsmcode -A
sys0!system:SF240_320 (t) SF220_051 (p) SF240_320 (t)|service:
sg6 1:255:255:255 !570B001.0FC93FFC0FC93FFC
sg5 1:0:4:0 sdd !HUS103036FL3800.0FC94004100698C86F7374
sg4 1:0:3:0 sdc !HUS103036FL3800.0FC942E40FC942E40620
sg3 0:255:255:255 !570B001.0FC940040FC940040FC93FF410193860
sg2 0:0:15:0 !VSBPD3E   U4SCSI.0FC9420C0FC9420C0620
sg1 0:0:5:0 sdb !ST336607LC.0FC9420C0FC9420C0620
sg0 0:0:3:0 sda !ST373453LC.0FC942040FC942040620

See my previous article on pSeries and System p firmware for a description of the dual firmware banks, and information on updating your system firmware level. Currently, device microcode must be updated using a microcode update utility specific to the device in question (iprutils for the onboard RAID SCSI HBAs on POWER5, for example).

Refreshing the VPD Database

Unfortunately, the data in the lsvpd database can become stale as devices are added or changed (via hotplug or DLPAR, for example). Running /usr/sbin/vpdupdate will cause the data to be refreshed. The developers of lsvpd are currently working on having vpdudpate run automatically in response to hotplug events.

Other Tools for Hardware Inventory

Besides lsvpd, there are several other Linux tools that can assist with hardware inventory for system configuration or service:

  • HAL (Hardware Abstraction Layer): run hal-device for a list of devices
  • Open Firmware device tree (on Power): stored in /proc/device-tree
  • The sysfs filesystem (usually mounted on /sys)