Calculating Percent Identity from SAM Files

The Sequence Alignment/Map (SAM) file format is used for storing nucleotide sequence alignments.  It is not a simple format, but has managed to achieve relatively widespread use.  Popular aligners such as Bowtie and BWA make use of the SAM format for output, and the alignments produced from other tools such as the fragment recruitment tool FR-HIT are often converted to the SAM format as well.  Interestingly, though the record representing each alignment contains an enormous amount of information, the “percent identity” is not among this data.  This is a strange oversight, as the identity is a common attribute that should be easy to calculate, and is a necessity for fragment recruitment plotting, among other uses.

Percent identity is calculated by dividing the number of nucleotide matches in the alignment by the total number of nucleotides (matched or mismatched). The CIGAR field seems as if it would be useful in this endeavor, but turns out to be useless in terms of identity calculation.  In a CIGAR string, an op of M indicates an alignment match, which (according to the SAM format specification) “can be a sequence match or mismatch” (!).  This means that with 200bp Illumina reads, the alignments might include a CIGAR string like 200M, which does not indicate a 100% identity.

It turns out that there is an optional field, the “MD” field, that can instead be used for this calculation.  MD fields are supposed to be used to identify SNPs and indels, and so they include information about mismatches and deletions.  These MD fields may or may not be included in the SAM file produced by your aligner of choice; if they are not present, samtools can be used to postprocess a SAM-formatted file in order to add MD fields to the records:

samtools fillmd -S file.sam genomic.fna >

where genomic.fna is a FASTA-formatted file containing the reference sequence.  A record including an MD field may look like this:

SRR062334.404   0       gi|378696079|ref|NC_016809.1|   1804559 22
100S84M16S      *       0       0       TTAGGTGCATTTGAAAATACAACCGTCGGTAT
###################        AS:i:84 XN:i:0  XM:i:16 XO:i:0  XG:i:0  NM:i:
16 MD:Z:27A5G2T6G4T1A6T2C0A7A1A0A0C0G1G1A5 YT:Z:UU

The CIGAR string and MD fields are highlighted for clarity.  Note that the CIGAR string indicates that there are 84 matches/mismatches, surrounded by soft clipped nucleotides; you will find that the MD string identifies these 84 nucleotides.  The MD field is read as follows:  the first 27 nucleotides are matches; the next nucleotide is a mismatch (an A in the reference sequence); the next 5 are matches; the next is a mismatch (a G in the reference sequence); etc.  This example does not include any deletions; these would be identified with a ‘^’ character, so that the string 5^GT5 would represent 5 matches followed by two deleted nucleotides (missing GT in comparison to the reference) followed by 5 matches.

Based on this string, it is easy to write a parser to establish the quantity of matches and mismatches in the alignment.  Once these two values are determined, identity can be calculated by a simple formula:

identity = 100 * (matches / (matches + mismatches))

In the above example, there are 68 matched nucleotides in the overall string of 84, yielding 81% identity.


Tricks with vi: Hex Editing, Running sudo

I’ve been a faithful user of vi for many years, but I only recently learned these two useful tricks.

Trick 1: Hex Editing in vi

The first trick is switching vi to a sort of hex edit mode. To view the the contents of a (potentially binary) file in hexadecimal that you have already opened in vi, press ESC and then type “:%!xxd” (without the quotes). The contents of your vi session will now look like the output of the xxd hex dump tool:

0000000: 4c6f 7265 6d20 6970 7375 6d20 646f 6c6f Lorem ipsum dolo
0000010: 7220 7369 7420 616d 6574 2c20 636f 6e73 r sit amet, cons
0000020: 6563 7465 7475 7220 6164 6970 6973 6963 ectetur adipisic
0000030: 696e 6720 656c 6974 2c20 7365 6420 646f ing elit, sed do
0000040: 2065 6975 736d 6f64 2074 656d 706f 7220 eiusmod tempor
0000050: 696e 6369 6469 6475 6e74 2075 7420 6c61 incididunt ut la

I could now choose to (for example) choose to change the first character to a space by replacing the first byte, 4c, with 20:

0000000: 206f 7265 6d20 6970 7375 6d20 646f 6c6f Lorem ipsum dolo
0000010: 7220 7369 7420 616d 6574 2c20 636f 6e73 r sit amet, cons
0000020: 6563 7465 7475 7220 6164 6970 6973 6963 ectetur adipisic
0000030: 696e 6720 656c 6974 2c20 7365 6420 646f ing elit, sed do
0000040: 2065 6975 736d 6f64 2074 656d 706f 7220 eiusmod tempor
0000050: 696e 6369 6469 6475 6e74 2075 7420 6c61 incididunt ut la

Note that the text on the right is not automatically updated when you modify the hex. If you were to save the file at this point, it would be saved as it appears in the vi session (i.e. the contents of the file would look like the output of xxd). Revert back to the normal text edit mode with “:%!xxd -r”, and then save.

Trick 2: Running sudo Within vi to Edit a Read-Only File

There are many times when I open a file in vi and begin editing it without noticing the “Changing a readonly file” warning that appears. Then, after making numerous updates, I try to save my changes, only to get the “Can’t open file for writing” message… I forgot to run vi with sudo. One option is to exit vi without saving, run “sudo vi”, and then redo all my changes, which can be painful if there were a lot of updates. Another option: press ESC and then type “:w !sudo tee %”. Type your password, press ENTER when prompted to continue, and then press “L” to reload the now-modified file in the existing vi session. Your updates will have been written to disk. Note that you will need to follow this procedure every time you wish to write the file within that vi session.

Progress Bar Strategies for Java SWT

Continuing the series on developing applications in Java using SWT, I’d like to address the problem of implementing progress dialogs for long-running operations. The last time I was on the topic of SWT, I explained how I implemented a sash in my GUI.

This application processes potentially large datasets, which can be very time-consuming operations (anywhere from a few seconds to several minutes or more). The demo and early iterations of the project simply became unresponsive after the user chose a dataset, and remained unresponsive until the processing was complete. The obvious thing to do here is to implement a progress dialog that gives the user a visual indication that the application has not hung, and, ideally, provides the capability to cancel the operation.

This turned out to be a bit more difficult than I anticipated, so I thought I’d relate my experiences.

Phase 1: Modal Pop-up Dialog

This part is not difficult. To make things reusable, and to make the follow-on phases easier to implement, I created a DialogProgress class. The constructor accepts two strings (one for the dialog title, one for the text inside the dialog), and creates a new shell with a ProgressBar widget. I made it the ProgressBar widget private, and added an update(finished, outOf) method to DialogProgress for updating the widget. The dataset processing routine just calls update() each time it finishes processing an entry in the dataset (for example, invoke update(5, 10) to indicate that you just finished processing the 5th entry in a dataset with 10 entries).

Of course, now that you have a progress dialog up and going, the next obvious step is to stick a cancel button on it.

Phase 2: Allow Canceling

Here’s the catch: adding a button to the progress bar dialog means that something needs to be listening for the mouse click. The dataset processing, then, should move into a separate thread, so that the main process can enter the display.readAndDispatch() loop.

My initial thought was to have the main application start the thread and execute a wait(). Then the thread would signal its completion by invoking signal(). The problem is that the main application does not enter the display.readAndDispatch() loop because it is busy waiting… the new “cancel” button can’t be clicked.

The most elegant solution that I came up with was to provide the ability to register processingComplete and processingCanceled listeners with a dataset processing object. The main process instantiates a dataset processing object, registers a processingComplete listener and a processingCanceled listener with it, and then invokes the run() method of the object to start the thread. There is no need to wait anymore; the main process simply returns to the display.readAndDispatch() loop.

Now, if the user clicks the “cancel” button, the dataset processing object will invoke the processingCanceled listener. If the dataset processing finishes without interruption, the processingComplete listener will be invoked just before the thread exits, and the main process will know to update the UI with the results of the processing.

Phase 3: Eliminating the Dialog: Asynchronous Updating

I haven’t actually gotten to this point with this application yet, but I believe this will be the next iteration. I plan to eliminate the modal progress dialog, and move the progress bar to the status bar on the bottom of the main shell. This way, the dataset can be shown as it is loaded, meaning that the user can begin reviewing the first entries in the dataset while the later entries are still being loaded. I’ll post again if I ever get around to implementing this iteration!

Recording at 96 kHz: Requires USB 2.0 (duh…)

I recently installed Ubuntu Studio on a spare, many-years old system (a Pentium 3) to try out a new Edirol UA-25EX audio interface that I picked up. I plan to detail my experiences with Studio in a soon-to-come post, but there was one problem I ran into that warrants its own post.

After applying a patch for the device and rebuilding the kernel modules (again, topic for another post), I was easily able to record at 44.1 kHz. Then I ran the following command to record at 96 kHz:

mike@studio:~$ arecord -v -r 96000 -f cd -t wav -D plughw:UA25EX test.wav

After the expected verbose output, the arecord command immediately terminated with the following error message:

arecord: xrun:1090: read/write error, state = PREPARED

Hmm. Not very informative. I thought to run dmesg, though, which gave me more useful information:

[13469.727719] ALSA /home/mike/linux-ubuntu-modules-2.6.24-2.6.24/debian/build/build-rt/sound/alsa-driver/usb/usbaudio.c:864: cannot submit datapipe for urb 0, error -28: not enough bandwidth

Not enough bandwidth. It occurred to me at this point that this system is old enough to pre-date the common availability of USB 2.0. And then I was surprised to find that I didn’t know how to tell from the command line whether a system’s USB controllers are 1.1 or 2.0.

And so, to the point of this article: On a Linux system, how do you determine the versions of your USB controllers?

Simply run lspci -v | grep HCI. Any controllers that say “UHCI” or “OHCI” are USB 1.1; any that say “EHCI” are 2.0. Sure enough, there were only USB 1.1 controllers on the system. I picked up a PCI card with 5 USB 2.0 ports for $15, plugged the audio interface into that, and suddenly I could record at 96 kHz with no problem.

Using the Sash Widget in Java SWT

This is the first in what will be a series of posts detailing some of the problems I ran into when developing an SWT-based GUI in the Eclipse development environment.  I hadn’t developed a GUI in Java since Swing was the new hotness, so SWT was new to me.

Sash Layout Overview

Layout Overview

The main window design called for a toolbar across the top, a table which lists entries from a database, a composite under the table which provides details on the selected entry, and a status bar on the bottom.  It was easy enough to position the widgets using a GridLayout, but I wanted the user to be able to adjust the relative sizes of the table and detail composite, giving either more or less screen real estate to whichever of the two widgets is of the most interest at the moment.  I suspected that a sash was the way to do this, but the SWT snippet library, which is very useful in most cases, contained surprisingly little information on the use of sashes.  After some experimentation, here is what I learned.

Part of the problem was that I was using a GridLayout for the main shell.  Once I switched to using a FormLayout, the sash became trivial to implement.  (This article on layouts is spectacular, by the way.) With the GridLayout, I instantiated the widgets in order from the top to the bottom of the shell, but when I switched to a FormLayout I needed to change the order in which the widgets were created.

The first step when laying out the widgets is to position the sash:

    final Sash sash = new Sash(shell, SWT.BORDER | SWT.HORIZONTAL);
    formData = new FormData(); = new FormAttachment(30, 0);
    formData.left = new FormAttachment(0, 0);
    formData.right = new FormAttachment(100, 0);
    formData.height = 3;

When the application is started, this sash is positioned in such a way that 30% of the window space is provided for the toolbar and table at the top of the window, and 70% is provided for the detail composite and status bar at the bottom.

At this point, the general idea is to place the widgets from the top of the shell down, until you reach the sash.  Then you place widgets from the bottom of the shell up, again until you reach the sash.  In my case, I instantiated the toolbar first and attached it to the top of the main shell.  Then I created the table and attached the top of it to the bottom of the toolbar, and the bottom of it to the top of the sash.  The next step was to start at the bottom of the shell by attaching a composite for the status bar to the bottom of the main shell.  Then I attached the bottom of the detail composite to the top of the status bar composite, and the top of the detail composite to the bottom of the sash.

With that updated layout, the selection listener for the sash became relatively simple:

    sash.addListener(SWT.Selection, new Listener () {
        public void handleEvent(Event e) {
            sash.setBounds(e.x, e.y, e.width, e.height);

            FormData formData = new FormData();
   = new FormAttachment(0, e.y);
            formData.left = new FormAttachment(0, 0);
            formData.right = new FormAttachment(100, 0);
            formData.height = 3;

When the sash is moved, this listener causes its layout data to be replaced.  The top of the sash is set to the user-specified location, and the layout() method of the shell is invoked, which will cause the table and detail widgets to be resized to accommodate the new sash location (since they are attached to the sash).

Obviously, it’s easy to modify this layout strategy for vertical sashes.  It does get a bit more complex if you want to have multiple sashes in the same window, but even that is not too bad: if I had wanted a vertical sash in the detail composite, I could have simply used a form layout within that composite, building widgets from the left and right sides of the composite until they met the sash in the middle.

So there you go:  a quick summary of my experience with the elusive sash widget.

Where has Mike been?

I’ve been here all along!  Obviously, it’s been many moons since I’ve posted; I intend to remedy that post-haste.  I don’t work regularly on RAS anymore, so I had been feeling that my well of ideas for posts had been drying up.  But I’ve been involved with a number of new projects, and new projects should mean that there are new things to post about.

I don’t think I’ve written Java code since Swing and Ant were considered the new hotness.  I’m now finishing up a Java-based GUI that leverages SWT, and realized that there were a few challenges that I had to pound through the old-fashioned way because, even with the very useful SWT snippets and Javadocs, there was little advice to be found online.  So that’s where I’ll start back up:  over the next few weeks, I will be posting articles that discuss these challenges and how I solved them.

Servicelog Updates

The servicelog package has been updated to version 1.0.  This new version uses an sqlite database as a backend (instead of the Berkeley DB backend that the 0.x stream used).  The primary advantage to the sqlite relational database backend is that queries of the servicelog can be performed with standard SQL queries.  The –query flag to the servicelog command now takes an SQL WHERE clause as an argument.  For example, to view all open serviceable events, run:

/usr/bin/servicelog --query "serviceable=1 and closed=0"

To view all migrations that a logical partition has undergone:

/usr/bin/servicelog --query 'refcode="#MIGRATE"'

The ability to register notification tools with servicelog, available in the 0.x stream, is still supported, with even more flexibility: now you can specify a query string for matching when registering a new notification tool.  When a new event is logged, the tool will only be invoked if the event matches the criteria specified in that query string.  For example, run the following command (as root) to cause a tool called /opt/foo/some_command to be automatically invoked just after a partition is migrated to a different system:

/usr/bin/servicelog_notify --add --command='/opt/foo/some_command' --match='refcode="#MIGRATE"'

Power Platform Diagnostics: Source Available

The package for performing Power platform diagnostics, ppc64-diag, has just been open sourced under the Eclipse Public License.  Much of what I discussed in my previous post about predictive self healing is implemented in this package (and in servicelog, which is already open source).

Here are some of the advantages provided by the ppc64-diag package:

  • retrieval of first-failure error data from platform-level components, such as memory, CPUs, caches, fans, power supplies, VPD cards, voltage regulator modules, I/O subsystems, service processors, risers, etc.
  • the ability to offline CPUs or logical memory blocks (LMBs) that are predicted to fail
  • notifications of EPOW events (environmental and power warnings), and initiation of shutdowns due to abnormal thermal and voltage conditions if no redundant fans or power supplies are available
  • monitoring of platform-level elements (fans, power supplies, riser cards, etc.) in external I/O enclosures
  • retrieval of dumps from platform components to assist in problem determination (for example, dump data from a failed service processor)

The ppc64-diag package is generally install-and-forget; any platform events that may occur are logged to servicelog, with indications of the event severity and whether the event is serviceable (i.e. requires service action) or not.  Additional relevant information is also logged to servicelog, such as a reference code, and the location code and part number of a failing device (obtained from lsvpd).  Tools may be registered with servicelog to be automatically notified when new events are logged.

SystemTap Without Debug Info

Just a short post today to mention a new feature in SystemTap that I should have mentioned a while ago.  A primary barrier to the adoption of SystemTap has been the requirement that SystemTap have access to the DWARF debug information for the kernel and modules.  This is no longer the case; as of a few weeks ago, SystemTap can operate on systems that do not provide this debug info.  SystemTap users can trace function entries and returns (and report argument values with a little extra effort) even if no debug info is provided.  This currently works for i386 and x86_64, and powerpc support is being debugged.

For more information, refer to this page on the SystemTap wiki.  This, combined with user-space probing, provides substantial and much-needed improvements to the usability of SystemTap.

Good Guitar Lessons in Austin

I’m just going to take a minute to shamelessly advertise for The School of Feedback Guitar in Austin, TX. I started taking lessons there a few months ago, and I’m now utterly addicted to this infernal instrument. It’s amazing that I can use these QWERTY-seized hands to cause a guitar to make sounds even close to what I’m trying to get it to make. If you are considering guitar instruction in this city, even if you’ve never touched a guitar before (or any other instrument, for that matter), I don’t know that you could do better. Dave Wirth is the instructor; stop by the school’s website to drop him a line.