Handling OTU Tables in Python

I have recently been working a lot with OTU tables in the BIOM (Biological Observation Matrix) format, generated by QIIME and PIPITS. This now-common OTU table format seems to be getting more popular; I don’t recall running into such files in my bioinformatics studies just a few years back.

Tool suites like QIIME allow you to treat BIOM-formatted OTU tables as black boxes, keeping the users insulated from their internals, but I wanted to examine them a bit in Python. They are HDF-formatted files, version 5 in recent tools. Installing the h5py package along with the ancillary tables package seems to be a good way to go, but does not work. The tables package does not handle variable length strings.

The easiest way to inspect OTU tables in Python is with the biom-format package, which interestingly uses the h5py package that I could not get to work.

For starters, installation:

pip install biom-format
pip install h5py
pip install pandas
pip install numpy

And now a tour through a sample OTU table. The biom-format documentation didn’t make it obvious how to open an BIOM file, but it turns out that they provide a load_table routine that accepts a filename. There is a ‘sample’ axis and an ‘observation’ axis, where the observations are the OTUs.

>>> from biom import load_table
>>> t = load_table(‘sample_otu_table.biom')
>>> print(t.ids())
['SA8' 'Bed' '26' '29' 'Food' 'CA4' '5' 'SA3' 'Water' '15' '22' 'SA15'
 'SA25' 'CA22' '1' 'TRI3' '24' '10' '14' 'PA36' '3' '7' '16' '21' 'PA17'
 'CA8' '25' 'PA60' 'PA29' 'PA71' 'PA53' 'PA79' 'PA64' 'PA46' 'PA6' 'TRI18'
 'TRI8' '8' 'CA17' 'PA1' '13' '27' '11' '28' '9' '4' '2' '18' '12' '23'
 'PA19' 'HydroGel' '6' '20' '30' '19' 'TRI24' '17']
>>> print(t.ids(axis='observation'))
['4371191' '4479944' '816470' ... '512148' '1109141' '3275744']
>>> print(t.data('SA8', axis='sample'))
[2. 0. 0. ... 0. 0. 0.]
>>> print(t.data('4371191', axis='observation'))
[2. 5. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]

The taxonomy metadata annotations are easily available, allowing one to quickly access a list of all taxa present in the study.

>>> print(t.metadata('4371191', axis='observation')['taxonomy'])
['k__Bacteria', 'p__Proteobacteria', 'c__Gammaproteobacteria', 'o__Pseudomonadales', 'f__Pseudomonadaceae', 'g__Pseudomonas', 's__']
>>> metadata_df = t.metadata_to_dataframe('observation')
>>> metadata_df.columns = ['kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species']
>>> print(metadata_df.head())
 kingdom phylum class \
4371191 k__Bacteria p__Proteobacteria c__Gammaproteobacteria
4479944 k__Bacteria p__Actinobacteria c__MB-A2-108
816470 k__Bacteria p__Firmicutes c__Bacilli
842143 k__Bacteria p__Actinobacteria c__Actinobacteria
99682 k__Bacteria p__Proteobacteria c__Gammaproteobacteria

order family genus species
4371191 o__Pseudomonadales f__Pseudomonadaceae g__Pseudomonas s__
4479944 o__0319-7L14 f__ g__ s__
816470 o__Bacillales f__Bacillaceae g__ s__
842143 o__Actinomycetales f__Cellulomonadaceae g__Cellulomonas s__
99682 o__Pseudomonadales f__Pseudomonadaceae g__Pseudomonas s__

Methods are provided to collapse the table over particular levels of taxonomy, to normalize the data across samples to account for difference in sampling depth, and many other useful operations. (This code was borrowed from the biom-format documentation.)

>>> phylum_idx = 1
>>> collapse_f = lambda id_, md: '; '.join(md['taxonomy'][:phylum_idx + 1])
>>> collapsed = t.collapse(collapse_f, axis='observation')
>>> print(collapsed.head())
# Constructed from biom file
#OTU ID SA8 Bed 26 29 Food
k__Bacteria; p__Proteobacteria 11.17291857273559 2.9359560841720036 0.5507776761207686 0.16468435498627632 6.343092406221409
k__Bacteria; p__Actinobacteria 0.412751677852349 102.79362416107382 3.9077181208053693 1.354026845637584 0.06040268456375839
k__Bacteria; p__Firmicutes 17.308791684711995 0.4538761368557817 2.1459506279774794 1.0415764400173235 0.11173668254655696
k__Bacteria; p__Verrucomicrobia 0.0 1.5 0.0 0.0 0.0
k__Bacteria; p__Bacteroidetes 0.006493506493506494 1.7402597402597402 0.21428571428571427 0.0 0.07792207792207792
>>> normed = t.norm(axis='sample', inplace=False)
>>> print(normed.head())
# Constructed from biom file
#OTU ID SA8 Bed 26 29 Food
4371191 3.813009990086174e-05 7.30823199251637e-05 9.994003597841295e-05 0.0002801905295601009 1.965177062453327e-05
4479944 0.0 2.9232927970065482e-05 0.0 0.0 0.0
816470 0.0 0.0 0.0 0.0 0.0
842143 0.0 0.00011693171188026193 0.0 0.0 0.0
99682 1.906504995043087e-05 0.0 0.0 0.0 0.0

 

Advertisements

Hidden Markov Models in Python

I recently created a new GitHub repository for a Python module that I wrote to implement arbitrary HMMs: https://github.com/mstrosaker/hmm

A brief primer on HMMs

I think that HMMs are best described by an example.  We will look at this toy model:

hmm

The numbers represent the probabilities of transitioning between the various states, or of emitting certain symbols.  For example, state S1 has a 90% chance of transitioning back to itself; each time it is visited, there is a 50% chance that it emits a ‘1’, and a 50% chance that it emits a ‘2’.

Clearly, this model can be used to produce strings of 1s and 2s that fit its parameters.  But we can use it to solve a much more interesting question: given a string of 1s and 2s, which sequence of states most likely generated the string?  This is why it’s described as a hidden Markov model; the states that were responsible for emitting the various symbols are unknown, and we would like to establish which sequence of states is most likely to have produced the sequence of symbols.

Let’s look at what might have generated the string 222.  We can look at every possible combination of 3 states to establish which was most likely responsible for that string.

  • S1 – S1 – S1:  0.5 (initial probability of being in state 1) * 0.5 (probability of S1 emitting a 2) * 0.9 (probability of S1 transitioning to S1) * 0.5 (probability of S1 emitting a 2) * 0.9 (probability of S1 transitioning to S1) * 0.5 (probability of S1 emitting a 2) = 0.050625
  • S1 – S1 – S2:  0.5 * 0.5 * 0.9 * 0.5 * 0.1 * 0.75 = 0.0084375 (less likely than the previous sequence)
  • S1 – S2 – S1:  0.5 * 0.5 * 0.1 * 0.75 * 0.8 * 0.5 = 0.0075
  • S1 – S2 – S2:  0.5 * 0.5 * 0.1 * 0.75 * 0.2 * 0.75 = 0.0028125
  • S2 – S1 – S1:  0.5 * 0.75 * 0.8 * 0.5 * 0.9 * 0.5 = 0.0675
  • S2 – S1 – S2:  0.5 * 0.75 * 0.8 * 0.5 * 0.1 * 0.75 = 0.01125
  • S2 – S2 – S1:  0.5 * 0.75 * 0.2 * 0.75 * 0.8 * 0.5 = 0.0225
  • S2 – S2 – S2:  0.5 * 0.75 * 0.2 * 0.75 * 0.2 * 0.75 = 0.0084375

The string 222 was most likely generated by the sequence S2 – S1 – S1.  That may be a little surprising, since S2 is much more likely than S1 to emit 2s, but the transition probabilities show that the model is generally much more likely to be in state S1.

When the string of observations to explain is very long, this enumeration of every possible sequence of states becomes infeasible.  We had to enumerate 2^3 possible sequences for a string of length 3; if the string were of length 400, we would need to enumerate 2^400 (about 2.6 * 10^120) sequences.  Instead, we can employ a dynamic programming approach to make the problem tractable; the module that I wrote includes an implementation of the Viterbi algorithm for this purpose.

Why is this interesting?

A good example of the utility of HMMs is the annotation of genes in a genome, which is a very difficult problem in eukaryotic organisms.  A gene typically consists of a promoter region, numerous exons and introns with their associated splice sites, and a poly-A region, among others.  A model can be created to describe each of these regions.  The most likely sequence of states that explains a genome can then be calculated.  Each nucleotide in the genome would be annotated with a probable state, indicating whether it is likely to be part of an intron, exon, splice site, intergenic region, etc.

There are a wide variety of real-world applications for HMMs, such as in signal processing, or in identifying secondary structure elements in proteins, parts of speech in sentences, or components of musical scores.

The hmm Python module

With my Python module, the above model can be created with the following:

    import hmm
    s1 = hmm.state(
            'S1',            # name of the state
            0.5,             # probability of being the initial state
            { '1': 0.5,      # probability of emitting a '1' at each visit
              '2': 0.5 },    # probability of emitting a '2' at each visit
            { 'S1': 0.9,     # probability of transitioning to itself
              'S2': 0.1 })   # probability of transitioning to state 'S2'
    s2 = hmm.state('S2', 0.5,
            { '1': 0.25, '2': 0.75 },
            { 'S1': 0.8, 'S2': 0.2 })
    model = hmm.hmm(['1', '2'],  # all symbols that can be emitted
                    [s1, s2])    # all of the states in this HMM

All of the possible paths explaining 222 can be generated with:

    model.enumerate('222')

Which would print the following:

    ('S2', 'S2', 'S2'): -2.073786
    ('S2', 'S2', 'S1'): -1.647817
    ('S2', 'S1', 'S2'): -1.948847
    ('S2', 'S1', 'S1'): -1.170696
    ('S1', 'S2', 'S2'): -2.550907
    ('S1', 'S2', 'S1'): -2.124939
    ('S1', 'S1', 'S2'): -2.073786
    ('S1', 'S1', 'S1'): -1.295635
    BEST: ('S2', 'S1', 'S1'): -1.170696

Note that the probabilities are log (base 10) transformed, as they will be very tiny numbers for long strings.

As stated above, don’t do that with long sequences!  It becomes intractable very quickly.  Instead, use the dynamic programming approach as follows:

    path, prob = model.viterbi_path('222')
    print path
    print prob

Which yields:

    ['S2', 'S1', 'S1']
    -1.17069622717

Fragment Recruitment Plotting in Python

I recently created a repository on GitHub for an application that I wrote a year and a half ago to generate fragment recruitment plots; it is located at https://github.com/mstrosaker/fragrecruit .  These plots are used in genomic and metagenomic analysis to visualize how short reads align to one or more reference genomes.  The application is written in Python, using matplotlib to generate the plots.

One such plot is shown below.  This plot shows the reads from human keratinized gingiva microbiome samples that align against a 100000bp section of the genome of a strain of the bacteria Gemella haemolysans.  Each sample is represented by a different color.  The sample sizes are dramatically different (reflecting differing depth of sequencing among the samples), which makes some colors much more predominant.  My application adds vertical lines to delineate the boundaries of contigs in the reference genome, and shades alternate contigs; as far as I know, it is the only application that identifies contigs along the reference genome in this manner.  It is interesting to note that, in some cases, fragments may tend to cluster near the edges of contigs; I suspect this is often due to the presence of repetitive sequences at the ends of contigs.

1379_800kb_900kb

legend