One nice advantage provided by POWER systems is the availability of structured and well-defined reference codes. Besides indicating errors or conditions that otherwise require attention, these codes are also used to indicate the progress of boots or dumps. If your system failed to boot for some reason, the last reference code on the operator’s panel (op panel) would provide a good clue as to what the system was doing just before the failure.
On Linux, besides appearing on the op panel, these reference codes are also found in events that are surfaced in servicelog. While servicelog contains a lot of details that are useful for servicing errors, more information can always be obtained by looking up the reference code.
There are a few kinds of reference codes; the key for decoding these refcodes is the IBM Hardware InfoCenter. I’ll briefly explain the three different types of reference codes (SRCs, SRNs, and menugoals) before showing how they are displayed in servicelog.
System Reference Codes
SRCs are sequences of alphanumeric characters (usually 8 — just enough to fit snugly on the display of the operator’s panel — but sometimes 6). They were first introduced on POWER5 systems, and exist on both System p and System i (formerly pSeries and iSeries). SRCs are documented in InfoCenter: “Service provider information”/”Reference codes”/”Using system reference codes”.
An example of an SRC used as a progress code is C7004091; that refcode indicates that the partition is in a standby state, and is waiting to be manually activated. If the partition is set to be activated automatically, the partition will not stop at this SRC, but will continue to the Open Firmware boot phase.
Linux does not generate SRCs as progress codes, but will generate some as error codes. Additionally, if you have a POWER5 or POWER6 system, events with SRCs may be written to servicelog to indicate platform-level errors.
Service Request Numbers
SRNs are an older formatting method for progress or error codes. They are generated by diagnostics in AIX, and by the firmware on POWER4 (and earlier) systems. If the progress/error code has 5 digits, or has a ‘-‘ character somewhere in it, it is an SRN. These are documented in InfoCenter: “Service provider information”/”Reference codes”/”Using service request numbers”.
As an example, the SRN 747-223 indicates that there was a “miscompare during the write/read of the memory I/O register.” Many SRNs point to a repair procedure called a MAP; in this case, the SRN points to MAP 0050, “SCSI bus problems”, which provides procedures for analyzing and repairing the problem.
Linux does not generate SRNs, but you may still see SRNs generated by older POWER platforms. They may also be generated if you boot the eServer Standalone Diagnostics CD to run device diagnostics.
Menugoals are reference codes that begin with a ‘#’ character. They are generated by diagnostics, and indicate procedures that can be performed by a system admin rather than by a trained service representative. Menugoals don’t typically indicate errors, but instead convey additional information about the state of the device being diagnosed. As an example, a menugoal might indicate that a tape drive requires cleaning.
Reference Codes in servicelog
Each event in servicelog has a refcode field, which will always contain a reference code (either an SRC, an SRN, or a menugoal). Here is a sample event from servicelog indicating a platform error reported by a POWER system:
PPC64 Platform Event:
Servicelog ID: 64
Event Timestamp: Fri Dec 10 21:37:05 2004
Log Timestamp: Wed Apr 18 00:19:12 2007
Severity: 4 (WARNING)
Serviceable Event: Yes
Event Repaired: No
Reference Code: B125E500
Action Flags: a800
Event Type: 224 - Platform Event
Kernel ID: 1000
Platform ID: 50929493
Creator ID: E - Service Processor
Subsystem ID: 25 - Memory subsystem including external cache
RTAS Severity: 41 - Unrecoverable Error, bypassed with degraded performance
Event Subtype: 00 - Not applicable
Machine Type/Model: 9118-575
Machine Serial: 0SQIH47
Extended Reference Codes:
2: 030000f0 3: 28f00110 4: c13920ff 5: c1000000
6: 00811630 7: 00000001 8: 00d6000d 9: 00000000
Memory subsystem including external cache Informational (non-error) Event.
Refer to the system service documentation for more information.
<< Callout 1 >>
Repair Event Key: 0
Procedure Id: n/a
The error description provides some details concerning the failure, and the FRU callout indicates which part to repair in order to fix the problem. The refcode field contains an SRC, B125E500; looking that SRC up in InfoCenter shows the following details:
- B1 indicates it was reported by the service processor
- 25 indicates that it is an “external cache event or error reported by the service processor”
- E500 indicates that it is a result of processor runtime diagnostics (PRD)
In addition to that, the InfoCenter entry for B125E500 indicates that this event is the result of a hardware failure. The FRU callout indicates which piece of hardware should be replaced to resolve the error.