The package for performing Power platform diagnostics, ppc64-diag, has just been open sourced under the Eclipse Public License. Much of what I discussed in my previous post about predictive self healing is implemented in this package (and in servicelog, which is already open source).
Here are some of the advantages provided by the ppc64-diag package:
- retrieval of first-failure error data from platform-level components, such as memory, CPUs, caches, fans, power supplies, VPD cards, voltage regulator modules, I/O subsystems, service processors, risers, etc.
- the ability to offline CPUs or logical memory blocks (LMBs) that are predicted to fail
- notifications of EPOW events (environmental and power warnings), and initiation of shutdowns due to abnormal thermal and voltage conditions if no redundant fans or power supplies are available
- monitoring of platform-level elements (fans, power supplies, riser cards, etc.) in external I/O enclosures
- retrieval of dumps from platform components to assist in problem determination (for example, dump data from a failed service processor)
The ppc64-diag package is generally install-and-forget; any platform events that may occur are logged to servicelog, with indications of the event severity and whether the event is serviceable (i.e. requires service action) or not. Additional relevant information is also logged to servicelog, such as a reference code, and the location code and part number of a failing device (obtained from lsvpd). Tools may be registered with servicelog to be automatically notified when new events are logged.