Calculating Percent Identity from SAM Files

The Sequence Alignment/Map (SAM) file format is used for storing nucleotide sequence alignments.  It is not a simple format, but has managed to achieve relatively widespread use.  Popular aligners such as Bowtie and BWA make use of the SAM format for output, and the alignments produced from other tools such as the fragment recruitment tool FR-HIT are often converted to the SAM format as well.  Interestingly, though the record representing each alignment contains an enormous amount of information, the “percent identity” is not among this data.  This is a strange oversight, as the identity is a common attribute that should be easy to calculate, and is a necessity for fragment recruitment plotting, among other uses.

Percent identity is calculated by dividing the number of nucleotide matches in the alignment by the total number of nucleotides (matched or mismatched). The CIGAR field seems as if it would be useful in this endeavor, but turns out to be useless in terms of identity calculation.  In a CIGAR string, an op of M indicates an alignment match, which (according to the SAM format specification) “can be a sequence match or mismatch” (!).  This means that with 200bp Illumina reads, the alignments might include a CIGAR string like 200M, which does not indicate a 100% identity.

It turns out that there is an optional field, the “MD” field, that can instead be used for this calculation.  MD fields are supposed to be used to identify SNPs and indels, and so they include information about mismatches and deletions.  These MD fields may or may not be included in the SAM file produced by your aligner of choice; if they are not present, samtools can be used to postprocess a SAM-formatted file in order to add MD fields to the records:

samtools fillmd -S file.sam genomic.fna > file.md.sam

where genomic.fna is a FASTA-formatted file containing the reference sequence.  A record including an MD field may look like this:


SRR062334.404   0       gi|378696079|ref|NC_016809.1|   1804559 22
100S84M16S      *       0       0       TTAGGTGCATTTGAAAATACAACCGTCGGTAT
TTCGCTTGTGATTGGTGGTTCATGTGCGGTTGCGATGTCAATATTATTGATTATTCTTGTGTGTCACGTTTA
TTAATTGTCCAAGCAAAGAATAAGATTGCAATGGCGCCTAACATAGCTTTAATACCTGCAATCCATGCGTTT
ACGTATTCTTGTTGAGAAACCACA        C:5CCC?=BBA:5AAD-DD>AAA=-7,?4+>:1;66:
)>@?C:??>69=@85B?-98**3<>:5??##################################.7>@;>6A?
A>:C?CAAA5?5A?;CD=;:DDDADDDDD?D,CADDA?5AA-AAA-AA5:--6C@@;?.@############
###################        AS:i:84 XN:i:0  XM:i:16 XO:i:0  XG:i:0  NM:i:
16 MD:Z:27A5G2T6G4T1A6T2C0A7A1A0A0C0G1G1A5 YT:Z:UU

The CIGAR string and MD fields are highlighted for clarity.  Note that the CIGAR string indicates that there are 84 matches/mismatches, surrounded by soft clipped nucleotides; you will find that the MD string identifies these 84 nucleotides.  The MD field is read as follows:  the first 27 nucleotides are matches; the next nucleotide is a mismatch (an A in the reference sequence); the next 5 are matches; the next is a mismatch (a G in the reference sequence); etc.  This example does not include any deletions; these would be identified with a ‘^’ character, so that the string 5^GT5 would represent 5 matches followed by two deleted nucleotides (missing GT in comparison to the reference) followed by 5 matches.

Based on this string, it is easy to write a parser to establish the quantity of matches and mismatches in the alignment.  Once these two values are determined, identity can be calculated by a simple formula:

identity = 100 * (matches / (matches + mismatches))

In the above example, there are 68 matched nucleotides in the overall string of 84, yielding 81% identity.