Shortly before I started this blog, Dave Wilder, a co-worker of mine, submitted a patch to LKML to implement a trace layer. Ever since the Linux Trace Toolkit, which was an attempt to introduce static tracepoints into the Linux kernel, tracing in the Linux kernel has been spotty at best. Back in “those days,” LTT received a substantial amount of opposition, possibly because of the resistance towards accepting features with the “enterprise” stigma. This stigma wasn’t entirely unfounded, given that enterprise features were frequently at odds with the needs of the bulk of the user base.
The introduction of kprobes has certainly provided a substantial boon for those interested in inserting tracepoints, and it’s nice seeing new technologies like this new trace layer being built on top of such a well-entrenched, though difficult to use, technology (this new patch was at least somewhat inspired by SystemTap). Given that there seems to be recognition that the barrier to accepting instrumentation should be lowered, perhaps it is time for reasonably generic trace technology to be implemented, and, more importantly, maintained.
And so, with introductions out of the way, I thought I’d just leap in and explain a little bit about PCI error recovery in Linux, and how it works on System p and System i (formerly known as pSeries and iSeries).
IBM’s POWER-based systems have a feature called EEH (extended I/O error handling). Each PCI bridge has a bit of hardware associated with it that detects abnormal conditions, like parity errors and wild DMA accesses. When such an error is detected, the misbehaving device’s bus is frozen, meaning that the device becomes inaccessible by the operating system. Any writes to the device will be dropped, and any reads will result in all 1’s (0xFFF…). By itself, that’s only marginally useful; at least the partition doesn’t crash as a result of the error and, probably more importantly, at least no data corruption occurs due to the fault. The real advantage to EEH is that the system firmware provides interfaces that can be called to reset the bus and bring the device back online, without the need to restart the operating system.
In Linux, any time a read results in all 1’s, a firmware invocation is made by the kernel to determine if the device actually meant to return that value or if the bus was frozen due to a detected failure. (Empirical measurements show that devices almost never mean to return all 1’s, so there are not many false positives.) If the bus is indeed frozen, notification and recovery routines in the device driver are automatically invoked by the kernel to indicate that the device has experienced an error and to indicate when the bus is unfrozen (so that the driver can re-initialize the device).
If the device driver is not “PCI error recovery enabled” (i.e. does not provide routines to perform recovery), the kernel will attempt to perform a hotplug operation. The hotplug operation is typically successful, but recovery is much slower than it would be if the device driver were PCI error recovery enabled.
Yes, all of the support for PCI error recovery in Linux is upstream at this time. One of my co-workers, Linas Vepstas, wrote the lion’s share of the code. Several device drivers are instrumented with recovery routines, including e1000, ixgb, and ipr; those drivers serve as good examples so that others can also be instrumented. For more details, see Documentation/pci-error-recovery.txt in the kernel source.
An all too common aphorism dictates that it is hard to say goodbye. But any author will tell you that the first paragraph, the first line, even the title are the hardest words to finalize. Many times, the title, the very first words read when you pick up a book, are the very last ones written. When writing, you are struck by how difficult it is to just dive in and write; the action needs to be immediate and important, so that the reader does not become bored. The tone of the entire narrative is set on page one, paragraph one, word one. Given those constraints, it is perhaps more miraculous that wonders like Gravity’s Rainbow, Ficciones, or Pale Fire ever get written.
Surely you’ve noticed by now that I’ve managed to squeeze out those first words by merely babbling about the intransigence of first words; whether I’ve succeeded in setting the tone is up for debate. Though I may lean by nature towards rampant verbosity, abstruse phraseology, and aimless diversions, this blog will primarily concern itself with Linux and open source, with the reliability, availability, and serviceability (RAS) of computing systems, and with the discussion about how those things (and others) relate to IBM’s POWER-based systems.