Hidden Markov Models in Python

I recently created a new GitHub repository for a Python module that I wrote to implement arbitrary HMMs: https://github.com/mstrosaker/hmm

A brief primer on HMMs

I think that HMMs are best described by an example.  We will look at this toy model:


The numbers represent the probabilities of transitioning between the various states, or of emitting certain symbols.  For example, state S1 has a 90% chance of transitioning back to itself; each time it is visited, there is a 50% chance that it emits a ‘1’, and a 50% chance that it emits a ‘2’.

Clearly, this model can be used to produce strings of 1s and 2s that fit its parameters.  But we can use it to solve a much more interesting question: given a string of 1s and 2s, which sequence of states most likely generated the string?  This is why it’s described as a hidden Markov model; the states that were responsible for emitting the various symbols are unknown, and we would like to establish which sequence of states is most likely to have produced the sequence of symbols.

Let’s look at what might have generated the string 222.  We can look at every possible combination of 3 states to establish which was most likely responsible for that string.

  • S1 – S1 – S1:  0.5 (initial probability of being in state 1) * 0.5 (probability of S1 emitting a 2) * 0.9 (probability of S1 transitioning to S1) * 0.5 (probability of S1 emitting a 2) * 0.9 (probability of S1 transitioning to S1) * 0.5 (probability of S1 emitting a 2) = 0.050625
  • S1 – S1 – S2:  0.5 * 0.5 * 0.9 * 0.5 * 0.1 * 0.75 = 0.0084375 (less likely than the previous sequence)
  • S1 – S2 – S1:  0.5 * 0.5 * 0.1 * 0.75 * 0.8 * 0.5 = 0.0075
  • S1 – S2 – S2:  0.5 * 0.5 * 0.1 * 0.75 * 0.2 * 0.75 = 0.0028125
  • S2 – S1 – S1:  0.5 * 0.75 * 0.8 * 0.5 * 0.9 * 0.5 = 0.0675
  • S2 – S1 – S2:  0.5 * 0.75 * 0.8 * 0.5 * 0.1 * 0.75 = 0.01125
  • S2 – S2 – S1:  0.5 * 0.75 * 0.2 * 0.75 * 0.8 * 0.5 = 0.0225
  • S2 – S2 – S2:  0.5 * 0.75 * 0.2 * 0.75 * 0.2 * 0.75 = 0.0084375

The string 222 was most likely generated by the sequence S2 – S1 – S1.  That may be a little surprising, since S2 is much more likely than S1 to emit 2s, but the transition probabilities show that the model is generally much more likely to be in state S1.

When the string of observations to explain is very long, this enumeration of every possible sequence of states becomes infeasible.  We had to enumerate 2^3 possible sequences for a string of length 3; if the string were of length 400, we would need to enumerate 2^400 (about 2.6 * 10^120) sequences.  Instead, we can employ a dynamic programming approach to make the problem tractable; the module that I wrote includes an implementation of the Viterbi algorithm for this purpose.

Why is this interesting?

A good example of the utility of HMMs is the annotation of genes in a genome, which is a very difficult problem in eukaryotic organisms.  A gene typically consists of a promoter region, numerous exons and introns with their associated splice sites, and a poly-A region, among others.  A model can be created to describe each of these regions.  The most likely sequence of states that explains a genome can then be calculated.  Each nucleotide in the genome would be annotated with a probable state, indicating whether it is likely to be part of an intron, exon, splice site, intergenic region, etc.

There are a wide variety of real-world applications for HMMs, such as in signal processing, or in identifying secondary structure elements in proteins, parts of speech in sentences, or components of musical scores.

The hmm Python module

With my Python module, the above model can be created with the following:

    import hmm
    s1 = hmm.state(
            'S1',            # name of the state
            0.5,             # probability of being the initial state
            { '1': 0.5,      # probability of emitting a '1' at each visit
              '2': 0.5 },    # probability of emitting a '2' at each visit
            { 'S1': 0.9,     # probability of transitioning to itself
              'S2': 0.1 })   # probability of transitioning to state 'S2'
    s2 = hmm.state('S2', 0.5,
            { '1': 0.25, '2': 0.75 },
            { 'S1': 0.8, 'S2': 0.2 })
    model = hmm.hmm(['1', '2'],  # all symbols that can be emitted
                    [s1, s2])    # all of the states in this HMM

All of the possible paths explaining 222 can be generated with:


Which would print the following:

    ('S2', 'S2', 'S2'): -2.073786
    ('S2', 'S2', 'S1'): -1.647817
    ('S2', 'S1', 'S2'): -1.948847
    ('S2', 'S1', 'S1'): -1.170696
    ('S1', 'S2', 'S2'): -2.550907
    ('S1', 'S2', 'S1'): -2.124939
    ('S1', 'S1', 'S2'): -2.073786
    ('S1', 'S1', 'S1'): -1.295635
    BEST: ('S2', 'S1', 'S1'): -1.170696

Note that the probabilities are log (base 10) transformed, as they will be very tiny numbers for long strings.

As stated above, don’t do that with long sequences!  It becomes intractable very quickly.  Instead, use the dynamic programming approach as follows:

    path, prob = model.viterbi_path('222')
    print path
    print prob

Which yields:

    ['S2', 'S1', 'S1']

Metrics for Multi-Class Classifiers: A Case Study

As I mentioned in my prior post, assessing classifiers can be a difficult task.  In this post, I’ll look at an example of a multi-class classification problem, and discuss good metrics for assessing the performance of tools designed to solve the problem.

Rather than separating samples into “positive” and “negative” categories, multi-class classifiers instead separate them into one of numerous (>2) potential categories.  An example of this in the field of bioinformatics is secondary structure prediction, in which each amino acid in a chain is predicted to be part of an α helix, part of an extended β strand, or random coil.  (Some tools may also attempt to predict other classes such as 3_10 helices, π helices, β turns, or bend regions, but we will limit this analysis to the three common categories.)

We will look at three secondary structure tools for this analysis:

The following graphic compares the predictions of each of the three tools against the actual, experimentally observed secondary structure of the protein called TFPIα (the alpha isomer of tissue factor pathway inhibitor):


This post does not claim to analyze the overall efficacy of these three tools; only their performance with TFPIα is assessed, for the purposes of studying metrics.

As we did with binary classifiers, we can construct confusion matrices for each of these three multi-class classifiers.  First, the matrix for GOR4:


For this protein, GOR4 shows a tendency to mispredict regions of coil as extended strands or helices.  There is also a pronounced tendency to mispredict strands as coil.

Now, the matrix for PHD:


In comparison to the prior matrix, the coil mispredictions are still present, though less pronounced.

Finally, the matrix for PSIPRED:


There is less tendency to mispredict coil residues as helices, but, when compared to PHD, more tendency to mispredict coil as extended strand.

How should these three tools be compared?  It seems obvious at a glance that the results from PHD and PSIPRED are better than those from GOR4.  For secondary structure tools, it is common to report Q scores; these are accuracy scores that indicate the ratio of residues correctly predicted in a given category versus the number of residues that were observed in that category.  The following table shows the Q scores for the three categories, as well as an overall (total) Q score (the CEN column will be discussed momentarily):


When assessing secondary structure predictions, it is insufficient to look only at the Q_total score.  Why is that?  It is notable that TFPIα is 79% random coil.  A tool that blindly predicts ‘c’ for every residue would receive a very good Q_total score of 0.79, better than all three of these tools.  But that is an uninteresting prediction; we would need to consider all four of the scores when assessing these tools in order to correctly dismiss this hypothetical tool.

As I mentioned in the last post, it is often useful to have a single metric for comparing such classifiers.  Clearly, Q_total does not fit the bill, which corroborates my statement in the last post that “accuracy is a poor metric.”  One excellent metric for multi-class classifiers is “confusion entropy,” or CEN.  This technique accounts for every cell of the confusion matrix, measuring the entropy in the mispredictions in every row and in every column.  This provides a single number that is very representative of the overall predictive power.  The CEN scores range from 0 (indicating perfect classification) to 1 (indicating that every class was completely mispredicted evenly across all of the other possible classes).  The CEN calculation is described in Wei et al. (2010).  In this example, the CEN scores correlate well with the Q_total scores, but give the reviewer confidence that the failings of accuracy metrics are not an issue.

Another information theoretic approach, which accounts for “truth information completeness” and the “false information ratio,” is described in Holt et al. (2010).


  • Wei JM, Yuan XJ, Hu QH, and Wang SQ. 2010. A novel measure for evaluating classifiers. Expert Systems with Applications 37: 3799-3809.
  • Holt RS, Mastromarino PA, Kao EK, and Hurley MB. 2010.  Information theoretic approach for performance evaluation of multi-class assignment systems.  Proc. of SPIE 7697: 76970R/1-12.