Memory Recovery on Power

On Power systems, there is a hierarchy of protection methods to guard against progressively rarer (and more catastrophic) types of memory failures.  There are a lot of terms used to describe these memory protection methods, so I thought I’d write a post to explain the various methods and how they relate to one another.

ECC (error correcting code) provides the first level of memory protection; it is capable of correcting single-bit errors, and of detecting (but not correcting) double-bit errors.

While memory is idle, memory scrubbing corrects soft single-bit errors in the background.  These single-bit errors would be corrected by ECC anyways when the memory is read, but memory scrubbing vastly reduces the quantity of double-bit errors (which ECC cannot correct) by catching and fixing the errors earlier.

If a defined threshold of errors is reached on a single bit, that memory line is considered faulty and is dynamically reassigned to a spare memory chip on the module via bit steering.  If all the bits on this spare memory chip are already used up when the error threshold is reached, the system’s service processor will generate a deferred maintenance request to indicate that the memory module should be replaced in the next available scheduled maintenance window.

Bits from multiple memory chips (on the same module) are scattered across four separate ECC words, a design which is called (logically enough) bit scattering.  This allows the system to handle simultaneous errors from multiple arrays on the same memory chip, because there will be no more than one faulty bit per ECC word.  Recovery in this scenario is handled via a process called Chipkill.  Because memory has been written across multiple chips on the module (similar to striping on disk arrays), the memory controller is capable of reconstructing the data from the killed chip.

If memory scrubbing identifies a hard error (soft errors are recovered by rewriting the correct data back to the memory location), the OS is notified of the failure so that it can remove the associated page.  This is called dynamic page deallocation, is currently supported on AIX 5L and i5/OS, and will soon be supported in Linux.  Dynamic page deallocation protects against an alignment of failed cells in two separate memory modules.

A more catastrophic memory failure can result in the unavailability of an entire row or column in the array, or an entire chip.  Redundant bit-steering protects against an alignment of the failed memory cells with any future failures.

Interestingly, elevation has a substantial effect on the frequency of soft error rates.  An elevation of 5,000 feet results in a 3- to 5-fold increase in soft errors (in comparison to sea level).  An elevation of 30,000 feet results in a 100-fold increase, due to the substantial increase in cosmic radiation.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s