I’ve been looking into EDAC (Error Detection and Correction) a bit recently, to see how it compares with the error detection that is native to IBM’s POWER machines (and to see if there are any features we can better exploit on POWER). If you aren’t familiar with EDAC, it’s a collection of modules for querying the internal registers on some devices (most notably memory modules and PCI devices) to gather failure statistics. To some extent, EDAC is also moving to take action based on those statistics.
Here is the first and most difficult problem: How do you know what number of errors is acceptable before a device should be replaced? It’s entirely acceptable for a device to experience an occasional, sporadic failure; they are typically recovered by the hardware itself (parity or ECC within a memory module, for example), with only a minute effect on performance (if any). A repair action should only be taken when these problems become more common. A good metric in this case would be the number of errors within the past hour; if the error count exceeds a threshold, then the device should be replaced. That threshold is the nut of the problem, as it can vary wildly depending on the device.
On POWER machines, the firmware takes care of thresholding these errors, and sends an event to the OS (including Linux) when the threshold is exceeded. The users don’t need to know how many errors have occurred; all they need to know is that the device at a specific location code is failing, and should be repaired.
Here’s another issue: EDAC polls, say once per second, and reads status registers on certain devices. It then clears the contents of those registers so that only new errors will be registered on the next poll. On many enterprise systems appropriately equipped, the service processor will also poll those same registers to perform predictive failure analysis (PFA). Unfortunately if EDAC is running on the system and clearing the registers, the service processor will be unable to obtain an accurate count of errors for thresholding.
When it comes to PCI errors, POWER includes EEH support for both detection and seamless recovery. EEH seems to be vastly superior in this regard, as the system firmware will cause the bus to be immediately frozen (to ensure that erroneous data is not written or read), and the Linux kernel/driver will reset the device and bring it back online, frequently within the space of a second. I’m not sure how well EDAC plays with AER (Advanced Error Reporting, for PCI-Express systems); I’ll probably write about that when I learn more.
EDAC in its current form seems to be only useful for home users, who are using systems that are not equipped with service processors and who are wondering why their system is suddenly misbehaving. I think it has promise for the enterprise space, but step one is to have it not stomp on any data gathering done by service processors, and step two is to provide information on when the error statistics become meaningful (thresholds).