Hidden Markov Models in Python

I recently created a new GitHub repository for a Python module that I wrote to implement arbitrary HMMs: https://github.com/mstrosaker/hmm

A brief primer on HMMs

I think that HMMs are best described by an example.  We will look at this toy model:


The numbers represent the probabilities of transitioning between the various states, or of emitting certain symbols.  For example, state S1 has a 90% chance of transitioning back to itself; each time it is visited, there is a 50% chance that it emits a ‘1’, and a 50% chance that it emits a ‘2’.

Clearly, this model can be used to produce strings of 1s and 2s that fit its parameters.  But we can use it to solve a much more interesting question: given a string of 1s and 2s, which sequence of states most likely generated the string?  This is why it’s described as a hidden Markov model; the states that were responsible for emitting the various symbols are unknown, and we would like to establish which sequence of states is most likely to have produced the sequence of symbols.

Let’s look at what might have generated the string 222.  We can look at every possible combination of 3 states to establish which was most likely responsible for that string.

  • S1 – S1 – S1:  0.5 (initial probability of being in state 1) * 0.5 (probability of S1 emitting a 2) * 0.9 (probability of S1 transitioning to S1) * 0.5 (probability of S1 emitting a 2) * 0.9 (probability of S1 transitioning to S1) * 0.5 (probability of S1 emitting a 2) = 0.050625
  • S1 – S1 – S2:  0.5 * 0.5 * 0.9 * 0.5 * 0.1 * 0.75 = 0.0084375 (less likely than the previous sequence)
  • S1 – S2 – S1:  0.5 * 0.5 * 0.1 * 0.75 * 0.8 * 0.5 = 0.0075
  • S1 – S2 – S2:  0.5 * 0.5 * 0.1 * 0.75 * 0.2 * 0.75 = 0.0028125
  • S2 – S1 – S1:  0.5 * 0.75 * 0.8 * 0.5 * 0.9 * 0.5 = 0.0675
  • S2 – S1 – S2:  0.5 * 0.75 * 0.8 * 0.5 * 0.1 * 0.75 = 0.01125
  • S2 – S2 – S1:  0.5 * 0.75 * 0.2 * 0.75 * 0.8 * 0.5 = 0.0225
  • S2 – S2 – S2:  0.5 * 0.75 * 0.2 * 0.75 * 0.2 * 0.75 = 0.0084375

The string 222 was most likely generated by the sequence S2 – S1 – S1.  That may be a little surprising, since S2 is much more likely than S1 to emit 2s, but the transition probabilities show that the model is generally much more likely to be in state S1.

When the string of observations to explain is very long, this enumeration of every possible sequence of states becomes infeasible.  We had to enumerate 2^3 possible sequences for a string of length 3; if the string were of length 400, we would need to enumerate 2^400 (about 2.6 * 10^120) sequences.  Instead, we can employ a dynamic programming approach to make the problem tractable; the module that I wrote includes an implementation of the Viterbi algorithm for this purpose.

Why is this interesting?

A good example of the utility of HMMs is the annotation of genes in a genome, which is a very difficult problem in eukaryotic organisms.  A gene typically consists of a promoter region, numerous exons and introns with their associated splice sites, and a poly-A region, among others.  A model can be created to describe each of these regions.  The most likely sequence of states that explains a genome can then be calculated.  Each nucleotide in the genome would be annotated with a probable state, indicating whether it is likely to be part of an intron, exon, splice site, intergenic region, etc.

There are a wide variety of real-world applications for HMMs, such as in signal processing, or in identifying secondary structure elements in proteins, parts of speech in sentences, or components of musical scores.

The hmm Python module

With my Python module, the above model can be created with the following:

    import hmm
    s1 = hmm.state(
            'S1',            # name of the state
            0.5,             # probability of being the initial state
            { '1': 0.5,      # probability of emitting a '1' at each visit
              '2': 0.5 },    # probability of emitting a '2' at each visit
            { 'S1': 0.9,     # probability of transitioning to itself
              'S2': 0.1 })   # probability of transitioning to state 'S2'
    s2 = hmm.state('S2', 0.5,
            { '1': 0.25, '2': 0.75 },
            { 'S1': 0.8, 'S2': 0.2 })
    model = hmm.hmm(['1', '2'],  # all symbols that can be emitted
                    [s1, s2])    # all of the states in this HMM

All of the possible paths explaining 222 can be generated with:


Which would print the following:

    ('S2', 'S2', 'S2'): -2.073786
    ('S2', 'S2', 'S1'): -1.647817
    ('S2', 'S1', 'S2'): -1.948847
    ('S2', 'S1', 'S1'): -1.170696
    ('S1', 'S2', 'S2'): -2.550907
    ('S1', 'S2', 'S1'): -2.124939
    ('S1', 'S1', 'S2'): -2.073786
    ('S1', 'S1', 'S1'): -1.295635
    BEST: ('S2', 'S1', 'S1'): -1.170696

Note that the probabilities are log (base 10) transformed, as they will be very tiny numbers for long strings.

As stated above, don’t do that with long sequences!  It becomes intractable very quickly.  Instead, use the dynamic programming approach as follows:

    path, prob = model.viterbi_path('222')
    print path
    print prob

Which yields:

    ['S2', 'S1', 'S1']

Fragment Recruitment Plotting in Python

I recently created a repository on GitHub for an application that I wrote a year and a half ago to generate fragment recruitment plots; it is located at https://github.com/mstrosaker/fragrecruit .  These plots are used in genomic and metagenomic analysis to visualize how short reads align to one or more reference genomes.  The application is written in Python, using matplotlib to generate the plots.

One such plot is shown below.  This plot shows the reads from human keratinized gingiva microbiome samples that align against a 100000bp section of the genome of a strain of the bacteria Gemella haemolysans.  Each sample is represented by a different color.  The sample sizes are dramatically different (reflecting differing depth of sequencing among the samples), which makes some colors much more predominant.  My application adds vertical lines to delineate the boundaries of contigs in the reference genome, and shades alternate contigs; as far as I know, it is the only application that identifies contigs along the reference genome in this manner.  It is interesting to note that, in some cases, fragments may tend to cluster near the edges of contigs; I suspect this is often due to the presence of repetitive sequences at the ends of contigs.