May « 2008 « Zombie Process

There have been a few articles recently discussing ksplice, a mechanism for hotpatching a Linux kernel. It is primarily geared towards applying security patches, which is a good thing: it is expressly designed to address those patches that are most urgent to apply, and thus the most painful due to the sometimes short lead time.

The implementation of ksplice is interesting, and not much like any hotpatching design I’ve seen. The patched and unpatched kernels are built using the same compiler, and, effectively, the resulting binary files are diffed. The difference gets packaged into a module which, when installed, will position jumps to excise the affected executable sections of the kernel. All branches into the excised text also need to be redirected to the new text in the kernel module.

The main complaints that I have seen regarding kernel hotpatching are along the following lines:

If you are using load balancing appropriately, hotpatching isn’t necessary. (Also phrased as: If there is a system that is so critical, why don’t you have two?) This sounds reasonable on the surface, but is actually somewhat nefarious. The same argument could be made about any feature that improves system availability. In return, I would ask: why do you need greater than 70% uptime? After all, you can just keep adding load-balancing systems until you get the aggregate uptime that you want. You are being hit twice with expenditure when you plan for downtime by adding systems rather than “adding nines”. First, even if you aren’t missing any transactions, downtime inherently costs money because it results in administrative costs (i.e. someone needs to restart the system and possibly perform root cause analysis, filesystem checks, etc.). Second, besides the cost of the other systems used to load balance, there is an ongoing expenditure for energy and cooling.

We have a lengthy QA process before deploying OS updates. This is not really a technical issue, but more of a “certification” issue. Presumably, the provider of the operating system distributes the hotfix, and has already vetted the fix to be applied concurrently. Customers who have their own QA processes before deploying fixes can perform QA on hotpatches just as easily as they can on non-concurrent updates, so this is really a non-issue.

The crux of the matter is simple. If the process of applying security patches becomes so trivial that the machine doesn’t even need to wait until a service window to be fixed, more machines are likely to be patched.

On the topic of security patches, though: one of the issues with hotpatching is that you can technically never be certain that the replaced text is not being executed. Most hotpatching implementations (including ksplice) resolve that problem by simply leaving the original code. To some extent, though, this problem reduces the utility of hotpatching for patching security vulnerabilities, because one can never be certain that no CPU is executing text in the section that is supposed to have been excised. This can really only be resolved by patch review; if there is a possibility that a CPU can be spending time in the old text for some time after the hotpatch is applied, then the system should undergo a reboot instead. This determination should be the responsibility of the OS vendor, who can provide either a hotpatch or a standard old-fashioned fix.

On a side note, AIX recently introduced a kernel hotplugging implementation (refer to section 2.3.15 in this RedBook for more information on the Concurrent AIX Update feature, first appearing in AIX 6.1.)

Zombie Process

2 AM, out of caffeine. Once more, I'm a… Zombie Process.

Monthly Archives: May 2008

Linux Kernel Hotpatching via ksplice