2.6 Machine learning in bioinformatics [edit]

2.6.1 Supervised v unsupervised classification [edit]

2.6.2 Training data, test data, and cross validation [edit]

2.6.3 scikit-learn [edit]

In this chapter we'll implement several machine learning classifiers so we can gain an in-depth understanding of how they work. In practice though, there are many mature machine learning libraries that you'd want to use. scikit-learn is a popular and well-documented Python library for machine learning which many bioinformatics researchers and software developers use in their work.

2.6.4 Defining the problem [edit]

We'll explore machine learning classifiers in the context of a familiar topic: taxonomic classification of 16S rRNA sequences. We previously explored this problem in Sequence Homology Searching, so it is likely worth spending a few minutes skimming that chapter if it's not fresh in your mind.

Briefly, the problem that we are going to address here is as follows. We have a query sequence ($q_i$) which is not taxonomically annotated (meaning we don't know the taxonomy of the organism whose genome it is found in), and a reference database ($R$) of taxonomically annotated sequences ($r_1, r_2, r_3, ... r_n$). We want to infer a taxonomic annotation for $q_i$. We'll again work with the Greengenes database, which we'll access using QIIME default reference project. Greengenes is a database of 16S rRNA gene sequences. (This should all sound very familiar - if not, I again suggest that you review Sequence Homology Searching.)

This time, instead of using sequence alignment to identify the most likely taxonomic origin of a sequence, we'll train classifiers by building kmer-based models of the 16S sequences of taxa in our reference database. We'll then run our query sequences through those models to identify the most likely taxonomic origin of each query sequence. Since we know the taxonomic origin of our query sequences in this case, we can evaluate the accuracy of our classifiers by seeing how often they return the known taxonomy assignment. If our training and testing approaches are well-designed, the performance on our tests will inform us of how accurate we can expect our classifier to be on data where the actual taxonomic origin is unknown.

Let's jump in...

2.6.5 Naive Bayes classifiers [edit]

The first classifier we'll explore is the popular and relatively simple Naive Bayes classifier. This classifier uses Bayes Theorem to determine the most likely label for an unknown input based on a probabilistic model it has constructed from training data. (The preceding text needs work.) The model that is constructed is based on user-defined features of the sequences. The most commonly used features for sequence classification tasks such as this is overlapping kmers.

We'll begin by importing some libraries that we'll use in this chapter, and then preparing our reference database and query sequences as we did previously.

In [1]:
%pylab inline

from IPython.core import page
page.page = print

import pandas as pd
import skbio
import numpy as np
import itertools
import collections
Populating the interactive namespace from numpy and matplotlib
In [2]:
import qiime_default_reference as qdr

# Load the taxonomic data
reference_taxonomy = {}
for e in open(qdr.get_reference_taxonomy()):
    seq_id, seq_tax = e.strip().split('\t')
    reference_taxonomy[seq_id] = seq_tax

# Load the reference sequences, and associate the taxonomic annotation with
# each as metadata
reference_db = []
for e in skbio.io.read(qdr.get_reference_sequences(), format='fasta', constructor=skbio.DNA):
    if e.has_degenerates():
        # For the purpose of this lesson, we're going to ignore sequences that contain
        # degenerate characters (i.e., characters other than A, C, G, or T)
        continue
    seq_tax = reference_taxonomy[e.metadata['id']]
    e.metadata['taxonomy'] = seq_tax
    reference_db.append(e)

print("%s sequences were loaded from the reference database." % len(reference_db))
88452 sequences were loaded from the reference database.
In [3]:
reference_db[0]
Out[3]:
DNA
-----------------------------------------------------------------------
Metadata:
    'description': ''
    'id': '1111883'
    'taxonomy': 'k__Bacteria; p__Gemmatimonadetes; c__Gemm-1; o__; f__;
                 g__; s__'
Stats:
    length: 1428
    has gaps: False
    has degenerates: False
    has non-degenerates: True
    GC-content: 61.90%
-----------------------------------------------------------------------
0    GCTGGCGGCG TGCCTAACAC ATGTAAGTCG AACGGGACTG GGGGCAACTC CAGTTCAGTG
60   GCAGACGGGT GCGTAACACG TGAGCAACTT GTCCGACGGC GGGGGATAGC CGGCCCAACG
...
1320 GCCGCGGTGA ATACGTTCCC GGGCCTTGTA CACACCGCCC GTCACGCCAT GGAAGCCGGA
1380 GGGACCCGAA ACCGGTGGGC CAACCGCAAG GGGGCAGCCG TCTAAGGT
In [4]:
reference_db[-1]
Out[4]:
DNA
----------------------------------------------------------------------
Metadata:
    'description': ''
    'id': '4483258'
    'taxonomy': 'k__Archaea; p__Crenarchaeota; c__Thermoprotei;
                 o__Thermoproteales; f__Thermoproteaceae; g__; s__'
Stats:
    length: 2123
    has gaps: False
    has degenerates: False
    has non-degenerates: True
    GC-content: 58.36%
----------------------------------------------------------------------
0    CTGGTTGATC CTGCCGGACC CGACCGCTAT CGGGGTGGGG CTTAGCCATG CGAGTCAAGC
60   GCCCCAGGGA CCCGCTGGGG TGCGGCGCAC GGCTCAGTAA CACGTGGCCA ACCTACCCTC
...
2040 ATAATCTCCT TATTGTCTGA TCCTTATGCA TTTTCCTTTG GCCCATCCCG TGAATACGCG
2100 CGGTGAATAC GTCCCTGCCC CTT

We'll select a random subset of the reference database to work with here.

In [5]:
reference_db = np.random.choice(reference_db, 500, replace=False)
print("%s sequences are present in the subsampled database." % len(reference_db))
500 sequences are present in the subsampled database.

The first thing our Naive Bayes classifier will need is the set of all possible words of length k. This will be dependent on the value of k and the characters in our alphabet (i.e., the characters that we should expect to find in the reference database). This set is referred to as W, and can be computed as follows. Given the following alphabet, how many kmers of length 2 are there (i.e., 2-mers)? How many 7-mers are there? How many 7-mers are there if there are twenty characters in our alphabet (as would be the case if we were working with protein sequences instead of DNA sequences)?

In [6]:
alphabet = skbio.DNA.nondegenerate_chars
k = 2

def compute_W(alphabet, k):
 return set(map(''.join, itertools.product(alphabet, repeat=k)))

W = compute_W(alphabet, k)
print('Alphabet contains the characters: %s' % ', '.join(alphabet))
print('For an alphabet size of %d, W contains %d length-%d kmers.' % (len(alphabet), len(W), k))
Alphabet contains the characters: T, G, A, C
For an alphabet size of 4, W contains 16 length-2 kmers.

scikit-bio provides methods for identifying all kmers in a skbio.DNA sequence object, and for computing the kmer frequencies. This information can be obtained for one of our reference sequences as follows:

In [7]:
kmers = reference_db[0].iter_kmers(k=k)
for kmer in kmers:
    print(kmer, end=' ')
AG GA AG GT TT TT TG GA AT TC CC CT TG GG GC CT TC CA AG GG GA AC CG GA AA AC CG GC CT TG GG GC CG GG GT TG GT TG GC CT TT TA AA AC CA AC CA AT TG GC CA AA AG GT TG GG GT TG GC CG GA AC CG GA AA AC CC CA AG GG GC CT TT TC CG GG GC CC CT TG GG GG GG GC CA AA AA AG GC CC CG GC CG GA AA AC CG GG GG GT TG GA AG GT TA AA AC CA AC CG GT TG GG GG GT TA AA AT TC CT TG GC CC CC CC CG GA AT TG GA AC CC CG GG GG GA AC CA AA AC CC CC CG GA AG GG GA AA AA AC CT TC CG GG GG GC CT TA AA AT TA AC CC CG GG GA AT TG GT TG GG GT TC CC CA AC CA AG GG GC CT TT TA AA AG GC CC CT TG GT TG GT TA AC CT TA AA AA AG GG GT TA AG GC CT TT TC CG GG GC CT TT TC CC CG GC CA AT TT TG GG GG GA AG GG GG GG GC CC CC CG GC CG GG GC CC CC CA AT TT TA AG GC CT TT TG GT TT TG GG GT TG GG GG GG GT TC CA AA AG GG GC CC CT TA AC CC CA AA AG GG GC CA AA AC CG GA AT TG GG GG GT TA AT TC CT TG GG GT TC CT TG GA AG GA AG GG GA AC CG GA AT TC CA AA AC CC CA AC CG GC CT TG GG GG GA AC CT TG GA AA AA AC CA AC CG GG GC CC CC CA AG GA AC CT TC CC CT TA AC CG GG GG GG GG GG GC CA AC CC CT TG GT TG GG GG GG GA AA AT TC CT TT TG GT TG GC CA AA AT TG GC CG GC CC CT TT TA AG GC CG GT TG GA AC CG GC CA AG GC CA AT TC CA AC CC CG GC CG GT TG GA AG GG GA AA AG GA AC CG GT TC CC CT TA AA AG GT TG GT TA AC CC CT TT TG GT TC CA AA AT TG GG GA AC CG GA AG GT TC CC CA AG GC CG GG GT TG GA AT TA AG GC CT TG GA AC CT TG GC CA AT TT TG GA AC CC CG GT TA AC CT TC CA AG GA AG GA AG GC CT TC CC CA AG GG GC CC CT TA AA AC CT TA AC CG GT TG GC CA AG GC CA AG GT TC CG GC CG GG GT TA AA AT TA AC CG GT TA AG GG GA AG GC CA AA AG GC CG GT TT TG GT TT TC CT TG GT TA AA AT TC CA AT TT TG GG GG GC CG GT TA AA AA AG GA AG GC CG GC CG GT TA AG GG GT TG GG GA AT TC CA AA AT TT TA AG GT TC CT TG GC CT TG GT TC CA AA AA AG GT TC CA AA AA AG GG GC CT TC CA AA AC CC CT TT TT TG GA AA AA AG GC CC CG GG GT TG GG GA AT TA AC CT TG GT TT TG GA AT TC CT TA AG GA AG GT TA AC CG GG GA AA AG GA AG GG GC CG GA AG GT TG GG GA AA AT TT TC CC CT TG GG GT TG GT TA AG GC CG GG GT TG GG GA AA AT TG GC CG GC CA AG GA AT TA AT TC CA AG GG GA AG GG GA AA AC CA AC CC CA AA AT TA AG GC CG GA AA AG GG GC CA AG GC CT TC CG GG GT TG GG GG GA AC CG GT TT TA AC CT TG GA AC CA AC CT TG GA AG GG GC CG GC CG GA AA AA AG GC CG GT TG GG GG GG GA AG GC CA AA AA AC CA AG GG GA AT TT TA AG GA AT TA AC CC CG GT TG GG GT TA AG GT TC CC CA AC CG GC CA AG GT TA AA AA AC CG GA AT TG GG GG GT TA AC CT TA AG GA AT TG GT TG GG GG GA AG GG GT TG GT TC CG GA AC CT TC CC CT TC CC CC CG GT TA AT TC CG GC CA AG GC CT TA AA AC CG GC CA AC CT TA AA AG GT TA AC CC CC CC CG GC CC CT TG GG GG GG GA AG GT TA AC CG GG GC CC CG GC CA AA AG GG GC CT TA AA AA AA AC CT TC CA AA AA AG GG GA AA AT TT TG GA AC CG GG GG GG GG GC CC CC CG GC CA AC CA AA AG GC CA AG GC CG GG GA AG GC CA AT TG GT TG GG GT TT TT TA AA AT TT TC CG GA AC CG GC CA AA AC CG GC CG GA AA AG GA AA AC CC CT TT TA AC CC CT TG GG GG GC CT TT TG GA AC CA AT TG GT TA AT TG GT TG GA AC CC CG GC CC CA AT TA AG GA AG GA AT TA AT TG GG GC CT TT TC CC CC CT TT TC CG GG GG GG GC CA AC CA AT TT TC CA AC CA AG GG GT TG GG GT TG GC CA AT TG GG GC CT TG GT TC CG GT TC CA AG GC CT TC CG GT TG GT TC CG GT TG GA AG GA AT TG GT TT TG GG GG GT TT TA AA AG GT TC CC CC CG GC CA AA AC CG GA AG GC CG GC CA AA AC CC CC CC CC CG GT TC CC CT TA AT TG GT TT TG GC CC CA AG GC CA AT TT TC CA AG GT TT TG GG GG GG GA AC CT TC CA AT TA AG GG GA AG GA AC CT TG GC CC CG GG GT TG GA AC CA AA AA AC CC CG GG GA AG GG GA AA AG GG GT TG GG GG GG GA AT TG GA AC CG GT TC CA AA AG GT TC CA AT TC CA AT TG GC CC CC CC CT TT TA AT TG GT TC CC CA AG GG GG GC CT TA AC CA AC CA AC CG GT TG GC CT TA AC CA AT TT TG GG GC CG GC CA AT TA AC CA AG GA AG GG GG GT TT TG GC CA AA AT TA AC CC CG GT TG GA AG GG GT TG GG GA AG GC CG GA AA AT TC CC CC CA AA AA AA AA AG GT TG GC CG GT TC CT TC CG GG GT TT TC CG GG GA AT TT TG GG GA AG GG GC CT TG GC CA AA AC CT TC CG GC CC CT TC CC CA AT TG GA AA AG GA AT TG GG GA AG GT TT TG GC CT TA AG GT TA AA AT TC CG GC CA AG GA AT TC CA AG GC CA AA AT TG GC CT TG GC CG GG GT TG GA AA AT TA AC CG GT TT TC CC CC CG GG GG GC CC CT TT TG GT TA AC CA AC CA AC CC CG GC CC CC CG GT TC CA AC CA AC CC CA AC CG GA AA AA AG GC CG GA AG GC CA AA AC CA AC CC CC CG GA AA AG GC CC CG GG GT TG GG GC CC CT TA AA AC CC CT TT TT TT TG GG GA AG GG GG GA AG GC CC CG GT TC CG
 GA AA AG GG GT TG GG GG GG GT TT TC CG GT TG GA AT TT TG GG GG GG GT TG GA AA AG GT TC CG GT TA AA AC CA AA AG GG GT TA 
In [8]:
print(reference_db[0].kmer_frequencies(k=k))
{'AG': 107, 'GA': 103, 'CT': 74, 'TC': 67, 'AC': 93, 'GC': 112, 'TG': 113, 'CA': 95, 'CC': 87, 'TA': 66, 'GT': 108, 'GG': 149, 'AA': 94, 'CG': 103, 'AT': 64, 'TT': 52}

This information can be convenient to store in a pandas Series object:

In [9]:
pd.Series(reference_db[0].kmer_frequencies(k=k), name=reference_db[0].metadata['id'])
Out[9]:
AA     94
AC     93
AG    107
AT     64
CA     95
CC     87
CG    103
CT     74
GA    103
GC    112
GG    149
GT    108
TA     66
TC     67
TG    113
TT     52
Name: 590365, dtype: int64

To train our taxonomic classifier, we next need to define a few things. First, at what level of taxonomic specificity do we want to classify our sequences? We should expect to achieve higher accuracy at less specific taxonomic levels such as phylum or class, but these are likely to be less informative biologically than more specific levels such as genus or species. Let's start classifying at the phylum level to keep our task simple, since we're working with a small subset of the reference database here. In Greengenes, phylum is the second level of the taxonomy.

Next, how long should our kmers be? We don't have a good idea of this to start with. The longer our kmers, the more likely they are to be specific to certain taxa, which is good because that will help with classification. However, if they get too long it becomes less likely that we'll observe those kmers in sequences that aren't represented in our database because the longer the sequence is the more likely we are to see variation across other organisms that are assigned to the same taxonomy. Based on some of my own work in this area, I'll start us out with 7-mers (i.e., kmers of length 7).

Finally, we'll need to know the value of W, defined above as the set of all possible kmers given our alphabet and the value of k.

As an exercise, I recommend exploring the impact of the value of k and taxonomic_level on the accuracy of our classifier after reading this chapter.

In [10]:
taxonomic_level = 2
k = 7
alphabet = skbio.DNA.nondegenerate_chars

Next, we'll compute a table of the per-sequence kmer counts for all kmers in W for all sequences in our reference database. We'll also store the taxonomic label of each of our reference sequences at our specified taxonomic level. We can store this information in a pandas DataFrame, and then view the first 25 rows of that table.

In [11]:
def get_taxon_at_level(taxon, level):
    taxon = [l.strip() for l in taxon.split(';')]
    return '; '.join(taxon[:level])

W = compute_W(alphabet, k)

per_sequence_kmer_counts = []
for reference_sequence in reference_db:
    taxon = get_taxon_at_level(reference_sequence.metadata['taxonomy'], taxonomic_level)
    kmer_counts = dict.fromkeys(W, 0)
    kmer_counts.update(reference_sequence.kmer_frequencies(k=k))
    per_sequence_kmer_counts.append(pd.Series(kmer_counts, name=taxon))

per_sequence_kmer_counts = pd.DataFrame(data=per_sequence_kmer_counts).fillna(0).T
per_sequence_kmer_counts[:25]
Out[11]:
k__Bacteria; p__Actinobacteria k__Bacteria; p__Firmicutes k__Bacteria; p__Bacteroidetes k__Bacteria; p__Actinobacteria k__Bacteria; p__Proteobacteria k__Bacteria; p__Firmicutes k__Bacteria; p__Cyanobacteria k__Bacteria; p__Firmicutes k__Bacteria; p__Bacteroidetes k__Bacteria; p__Firmicutes ... k__Bacteria; p__Actinobacteria k__Bacteria; p__Bacteroidetes k__Bacteria; p__Firmicutes k__Bacteria; p__Bacteroidetes k__Bacteria; p__Proteobacteria k__Bacteria; p__Gemmatimonadetes k__Bacteria; p__Proteobacteria k__Bacteria; p__Proteobacteria k__Bacteria; p__Acidobacteria k__Bacteria; p__Firmicutes
AAAAAAA 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
AAAAAAC 0 0 0 0 0 0 0 0 0 0 ... 1 0 0 0 0 0 0 0 0 0
AAAAAAG 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
AAAAAAT 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
AAAAACA 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
AAAAACC 0 0 0 0 0 0 0 1 1 0 ... 0 0 0 0 0 0 0 0 0 0
AAAAACG 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
AAAAACT 0 0 0 0 0 1 0 0 1 0 ... 1 0 0 0 0 0 0 0 0 1
AAAAAGA 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
AAAAAGC 0 0 0 0 1 0 0 0 0 0 ... 0 0 0 0 1 0 0 0 0 0
AAAAAGG 0 0 0 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
AAAAAGT 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
AAAAATA 0 0 0 0 0 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
AAAAATC 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
AAAAATG 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
AAAAATT 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
AAAACAA 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
AAAACAC 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
AAAACAG 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
AAAACAT 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
AAAACCA 0 0 0 0 0 0 0 0 0 0 ... 0 1 0 0 0 0 0 0 0 0
AAAACCC 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
AAAACCG 0 1 0 0 0 0 0 1 0 0 ... 0 0 0 1 0 0 1 0 0 1
AAAACCT 0 0 0 0 0 0 0 1 1 0 ... 0 0 0 0 0 0 0 0 0 0
AAAACGA 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

25 rows × 500 columns

With this information, we'll next compute our "kmer probability table" (EXISTING NAME FOR THIS?). The content of this table will be the probability of observing each kmer in W given a taxon. This is computed based on a few values:

$N$ : The total number of sequences in the training set.

$n(w_i)$ : The number of total sequences containing kmer i.

$P_i$ : The probability of observing kmer i. Initially it might seem as though this would be computed as $n(w_i) / N$, but this neglects the possibility of that a kmer observed in a query sequence might not be represented in our reference database, so a small pseudocount is added to the numerator and denomenator.

$P(w_i | taxon)$ : The probability of observing a kmer given a taxon. Again, it would seem that this would be computed as the proportion of sequences in the taxon containing the kmer, but this would neglect that we'll likely observe kmers in our query sequences that are not represented in our reference database. As pseudocount is therefore added again to the numerator and denominator. This time the pseudocount in the numerator is scaled by how frequent the kmer is in the reference database as a whole: specifically, it is $P_i$.

Our "kmer probability table" is $P(w_i | taxon)$ computed for all kmers in W and all taxa represented in our reference database. We'll compute that and again look at the first 25 rows.

In [12]:
def compute_kmer_probability_table(per_sequence_kmer_counts):
    N = len(per_sequence_kmer_counts) # number of training sequences

    # number of sequences containing kmer wi
    n_wi = per_sequence_kmer_counts.astype(bool).sum(axis=1)
    n_wi.name = 'n(w_i)'

    # probabilities of observing each kmer
    Pi = (n_wi + 0.5) / (N + 1)
    Pi.name = 'P_i'

    # number of times each taxon appears in training set
    taxon_counts = collections.Counter(per_sequence_kmer_counts.columns)
    n_taxon_members_containing_kmer = per_sequence_kmer_counts.astype(bool).groupby(level=0, axis=1).sum()

    # probabilities of observing each kmer in each taxon
    p_wi_t = []
    for taxon, count in taxon_counts.items():
        p_wi_t.append(pd.Series((n_taxon_members_containing_kmer[taxon] + Pi) / (count + 1), name=taxon))

    return pd.DataFrame(p_wi_t).T
In [13]:
kmer_probability_table = compute_kmer_probability_table(per_sequence_kmer_counts)
In [14]:
kmer_probability_table[:25]
Out[14]:
k__Bacteria; p__Chlorobi k__Bacteria; p__Synergistetes k__Bacteria; p__LCP-89 k__Bacteria; p__TM6 k__Bacteria; p__BRC1 k__Bacteria; p__KSB3 k__Bacteria; p__Actinobacteria k__Bacteria; p__GOUTA4 k__Bacteria; p__WS3 k__Bacteria; p__Elusimicrobia ... k__Bacteria; p__Verrucomicrobia k__Bacteria; p__Firmicutes k__Bacteria; p__Cyanobacteria k__Bacteria; p__TM7 k__Bacteria; p__OP8 k__Bacteria; p__OP9 k__Archaea; p__[Parvarchaeota] k__Bacteria; p__OD1 k__Bacteria; p__Gemmatimonadetes k__Bacteria; p__Tenericutes
AAAAAAA 0.000381 0.000381 0.000381 0.000127 0.000381 0.000381 0.038491 0.000381 0.000254 0.000191 ... 0.000069 0.027977 0.083397 0.000381 0.000381 0.000381 0.000254 0.000254 0.000109 0.000109
AAAAAAC 0.001022 0.001022 0.001022 0.000341 0.001022 0.001022 0.038540 0.001022 0.334015 0.000511 ... 0.000186 0.034979 0.083504 0.001022 0.001022 0.001022 0.000682 0.000682 0.000292 0.000292
AAAAAAG 0.000687 0.000687 0.000687 0.000229 0.000687 0.000687 0.038514 0.000687 0.000458 0.000343 ... 0.091034 0.062947 0.083448 0.000687 0.000687 0.000687 0.000458 0.000458 0.000196 0.143053
AAAAAAT 0.000442 0.000442 0.000442 0.000147 0.000442 0.000442 0.000034 0.000442 0.000295 0.000221 ... 0.000080 0.027978 0.083407 0.000442 0.000442 0.000442 0.333628 0.000295 0.000126 0.285841
AAAAACA 0.000320 0.000320 0.000320 0.166773 0.000320 0.000320 0.000025 0.000320 0.000214 0.000160 ... 0.000058 0.034970 0.083387 0.000320 0.000320 0.000320 0.000214 0.000214 0.000092 0.000092
AAAAACC 0.002121 0.002121 0.002121 0.000707 0.002121 0.002121 0.115548 0.002121 0.334747 0.251060 ... 0.000386 0.076953 0.083687 0.002121 0.002121 0.002121 0.001414 0.001414 0.143463 0.000606
AAAAACG 0.000504 0.000504 0.000504 0.000168 0.000504 0.000504 0.000039 0.000504 0.000336 0.250252 ... 0.000092 0.027979 0.000084 0.000504 0.000504 0.000504 0.000336 0.000336 0.000144 0.000144
AAAAACT 0.001327 0.001327 0.001327 0.000442 0.001327 0.001327 0.038564 0.001327 0.000885 0.000664 ... 0.000241 0.244774 0.000221 0.001327 0.001327 0.001327 0.000885 0.000885 0.000379 0.000379
AAAAAGA 0.000534 0.500534 0.000534 0.000178 0.000534 0.000534 0.000041 0.000534 0.000356 0.000267 ... 0.091006 0.027979 0.166756 0.000534 0.000534 0.000534 0.000356 0.000356 0.000153 0.143010
AAAAAGC 0.002365 0.002365 0.002365 0.167455 0.002365 0.002365 0.346336 0.002365 0.334910 0.001182 ... 0.000430 0.090942 0.000394 0.002365 0.002365 0.002365 0.001577 0.001577 0.143533 0.000676
AAAAAGG 0.500595 0.000595 0.000595 0.000198 0.000595 0.000595 0.076969 0.000595 0.000397 0.000298 ... 0.000108 0.041966 0.000099 0.000595 0.000595 0.000595 0.000397 0.333730 0.000170 0.428741
AAAAAGT 0.000717 0.000717 0.000717 0.000239 0.500717 0.000717 0.038517 0.000717 0.000478 0.000359 ... 0.000130 0.027982 0.083453 0.000717 0.000717 0.000717 0.000478 0.000478 0.143062 0.285919
AAAAATA 0.001419 0.001419 0.001419 0.000473 0.001419 0.001419 0.000109 0.001419 0.000946 0.000709 ... 0.000258 0.279740 0.083570 0.001419 0.001419 0.001419 0.000946 0.000946 0.000405 0.143263
AAAAATC 0.000259 0.000259 0.000259 0.000086 0.000259 0.000259 0.000020 0.000259 0.000173 0.000130 ... 0.000047 0.020983 0.083377 0.000259 0.000259 0.000259 0.000173 0.000173 0.000074 0.000074
AAAAATG 0.000412 0.000412 0.000412 0.000137 0.000412 0.000412 0.000032 0.000412 0.000275 0.000206 ... 0.000075 0.055950 0.083402 0.000412 0.000412 0.000412 0.333608 0.000275 0.000118 0.000118
AAAAATT 0.000229 0.000229 0.000229 0.000076 0.000229 0.000229 0.000018 0.000229 0.000153 0.000114 ... 0.000042 0.000003 0.083371 0.000229 0.000229 0.000229 0.000153 0.333486 0.000065 0.285780
AAAACAA 0.000290 0.000290 0.000290 0.000097 0.000290 0.000290 0.000022 0.000290 0.000193 0.000145 ... 0.000053 0.034969 0.000048 0.000290 0.000290 0.000290 0.000193 0.000193 0.000083 0.000083
AAAACAC 0.000168 0.000168 0.000168 0.000056 0.000168 0.000168 0.038474 0.000168 0.000112 0.000084 ... 0.000031 0.013988 0.000028 0.000168 0.000168 0.000168 0.000112 0.000112 0.000048 0.000048
AAAACAG 0.000534 0.000534 0.000534 0.000178 0.000534 0.000534 0.000041 0.000534 0.000356 0.000267 ... 0.000097 0.055952 0.083422 0.000534 0.000534 0.000534 0.000356 0.000356 0.000153 0.143010
AAAACAT 0.000442 0.000442 0.000442 0.166814 0.000442 0.000442 0.038496 0.000442 0.000295 0.000221 ... 0.000080 0.020985 0.000074 0.500442 0.000442 0.000442 0.333628 0.000295 0.000126 0.000126
AAAACCA 0.000717 0.000717 0.000717 0.000239 0.000717 0.000717 0.038517 0.000717 0.333811 0.000359 ... 0.000130 0.041968 0.083453 0.000717 0.000717 0.000717 0.000478 0.000478 0.000205 0.000205
AAAACCC 0.000900 0.000900 0.500900 0.000300 0.000900 0.000900 0.076992 0.000900 0.667267 0.500450 ... 0.000164 0.083929 0.000150 0.000900 0.000900 0.000900 0.333933 0.000600 0.285971 0.000257
AAAACCG 0.502029 0.002029 0.502029 0.000676 0.002029 0.002029 0.000156 0.502029 0.001353 0.001015 ... 0.454914 0.097930 0.000338 0.002029 0.502029 0.002029 0.001353 0.001353 0.000580 0.000580
AAAACCT 0.001663 0.001663 0.001663 0.000554 0.001663 0.001663 0.153974 0.001663 0.001109 0.250832 ... 0.000302 0.069953 0.083611 0.001663 0.001663 0.001663 0.001109 0.001109 0.000475 0.000475
AAAACGA 0.000168 0.000168 0.000168 0.000056 0.000168 0.000168 0.000013 0.000168 0.000112 0.000084 ... 0.000031 0.000002 0.000028 0.000168 0.000168 0.000168 0.000112 0.000112 0.000048 0.000048

25 rows × 37 columns

With our kmer probability table we are now ready to classify unknown sequences. We'll begin by defining some query sequences. We'll pull these at random from our reference sequences, which means that some of the query sequences will be represented in our reference database and some won't be. This is the sitatuation that is typically encountered in practice. To simulate real-world 16S taxonomy classification tasks, we'll also trim out 200 bases of our reference sequences since (as of this writing) we typically don't obtain full-length 16S sequences from a DNA sequencing instrument.

In [15]:
queries = []
for e in skbio.io.read(qdr.get_reference_sequences(), format='fasta', constructor=skbio.DNA):
    if e.has_degenerates():
        # For the purpose of this lesson, we're going to ignore sequences that contain
        # degenerate characters (i.e., characters other than A, C, G, or T)
        continue
    e = e[100:300]
    queries.append(e)
# can't figure out why np.random.choice isn't working here...
np.random.shuffle(queries)
queries = queries[:50]
In [16]:
queries[0]
Out[16]:
DNA
---------------------------------------------------------------------
Metadata:
    'description': ''
    'id': '583572'
Stats:
    length: 200
    has gaps: False
    has degenerates: False
    has non-degenerates: True
    GC-content: 53.00%
---------------------------------------------------------------------
0   TTCTTGTTTA ACTTAGTGGC GGACGGGTGA GTAACGCGTG AGCAACCTGC CTTACAGAGG
60  GGGACAACAG TTGGAAACGA CTGCTAATAC CGCATAATGT GGAGAGAGGA CATCCTTTTT
120 TCACCAAAGG AGCAATCCGC TGTAAGATGG GCTCGCGTCC GATTAGCCAG TTGGCGGGGT
180 AACGGCCCAC CAAAGCGACG

For a given query sequence, its taxonomy will be classified as follows. First, the set of all kmers will be extracted from the sequence. This is referred to as $V$. Then, for all taxa in the kmer probability table, the probability of observing the query sequence will be computed given that taxon: $P(query | taxon)$. This is computed as the product of all its kmer probabilities for the given taxon. (It should be clear based on this formula why it was necessary to add pseudocounts when computing our kmer probability table - if not, kmer probabilities of zero would result in a zero probability of the sequence being derived from that taxon at this step.)

After computing $P(query | taxon)$ for all taxa, the taxonomy assignment return is simply the one achieving the maximum probability. Here we'll classify a sequence and look at the resulting taxonomy assignment.

In [17]:
def classify_V(V, kmer_probability_table):
    P_S_t = [] # probability of the sequence given the taxon
    for taxon in kmer_probability_table:
        kmer_probabilities = kmer_probability_table[taxon]
        probability = 1.0
        for v_i in V:
            probability *= kmer_probabilities[v_i]
        P_S_t.append((probability, taxon))
    return max(P_S_t)[1], V

def classify_sequence(query_sequence, kmer_probability_table, k):
    V = list(map(str, query_sequence.iter_kmers(k=k)))
    return classify_V(V, kmer_probability_table)
In [18]:
taxon_assignment, V = classify_sequence(queries[0], kmer_probability_table, k)
print(taxon_assignment)
k__Bacteria; p__Firmicutes

Since we know the actual taxonomy assignment for this sequence, we can look that up in our reference database. Was your assignment correct? Try this with a few query sequences and keep track of how many times the classifier achieved the correct assignment.

In [19]:
get_taxon_at_level(reference_taxonomy[queries[0].metadata['id']], taxonomic_level)
Out[19]:
'k__Bacteria; p__Firmicutes'

Because the query and reference sequences that were working with were randomly selected from the full reference database, each time you run this notebook you should observe different results. Chances are however that if you run the above steps multiple times you'll get the wrong taxonomy assignment at least some of the time. Up to this point, we've left out an important piece of information: how confident should we be in our assignment, or in other workds, how dependent is our taxonomy assignment on our specific query? If there were slight differences in our query (e.g., because we observed a very closely related organism, such as one of the same species but a different strain, or because we sequenced a different region of the 16S sequence) would we obtain the same taxonomy assignment? If so, we should have higher confidence in our assignment. If not, we should have lower confidence in our assignment.

We can quantify confidence using an approach called bootstrapping. With a bootstrap approach, we'll get our taxonomy assignment as we did above, but then for some user-specified number of times, we'll create random subsets of V sampled with replacement (DEFINE THIS). We'll then assign taxonomy each random subset of V, and count the number of times the resulting taxonomy assignment is the same that we achieved when assigning taxonomy to V. The count divided by the number of iterations we've chosen to run will be our confidence value. If the assignments are often the same we'll have a high confidence value. If the assignments are often different, we'll have a low confidence value.

Let's now assign taxonomy and compute a confidence for that assignment.

In [20]:
def classify_sequence_with_confidence(sequence, kmer_probability_table, k,
                                      confidence_iterations=100):
    taxon, V = classify_sequence(sequence, kmer_probability_table, k)

    count_same_taxon = 0
    subsample_size = int(len(V) * 0.1)
    for i in range(confidence_iterations):
        subsample_V = np.random.choice(V, subsample_size, replace=True)
        subsample_taxon, _ = classify_V(subsample_V, kmer_probability_table)
        if taxon == subsample_taxon:
            count_same_taxon += 1
    confidence = count_same_taxon / confidence_iterations

    return (taxon, confidence)
In [21]:
taxon_assignment, confidence = classify_sequence_with_confidence(queries[0], kmer_probability_table, k)
print(taxon_assignment)
print(confidence)
k__Bacteria; p__Firmicutes
0.88

How did the computed confidence compare to the accuracy taxonomy assignment?

We don't have an a priori idea for what good versus bad confidence scores are, but we can use our reference database to explore that. We might want this information so we can come up with a confidence threshold, above which we would accept a taxonomy assignment and below which we might reject it. To explore this, let's compute taxonomy assignments and confidence for all of our query sequences and then see what the distributions of confidence scores look like for correct assignments and incorrect assignments.

In [22]:
correct_assignment_confidences = []
incorrect_assignment_confidences = []
summary = []

for query in queries:
    predicted_taxonomy, confidence = classify_sequence_with_confidence(query, kmer_probability_table, k)
    actual_taxonomy = get_taxon_at_level(reference_taxonomy[query.metadata['id']], taxonomic_level)
    if actual_taxonomy == predicted_taxonomy:
        correct_assignment_confidences.append(confidence)
    else:
        incorrect_assignment_confidences.append(confidence)

    summary.append([predicted_taxonomy, actual_taxonomy, confidence])
summary = pd.DataFrame(summary, columns=['Predicted taxonomy', 'Actual taxonomy', 'Confidence'])
In [23]:
import seaborn as sns

ax = sns.boxplot(data=[correct_assignment_confidences, incorrect_assignment_confidences])
ax = sns.swarmplot(data=[correct_assignment_confidences, incorrect_assignment_confidences], color="black")
_ = ax.set_xticklabels(['Correct assignments', 'Incorrect assignments'])
_ = ax.set_ylabel('Confidence')

What does this plot tell you about how well setting a confidence threshold is likely to work? If you never wanted to reject a correct assignment, how often would you accept an incorrect assignment? If you never wanted to accept an incorrect assignment, how often would you reject a correct assignment?

In [24]:
summary # maybe explore whether certain taxa are more frequently wrong than others...
Out[24]:
Predicted taxonomy Actual taxonomy Confidence
0 k__Bacteria; p__Firmicutes k__Bacteria; p__Firmicutes 0.88
1 k__Bacteria; p__Proteobacteria k__Bacteria; p__Proteobacteria 0.66
2 k__Bacteria; p__Firmicutes k__Bacteria; p__Firmicutes 0.81
3 k__Bacteria; p__Proteobacteria k__Bacteria; p__Proteobacteria 0.98
4 k__Bacteria; p__Firmicutes k__Bacteria; p__Firmicutes 0.97
5 k__Bacteria; p__Proteobacteria k__Bacteria; p__Proteobacteria 0.91
6 k__Bacteria; p__Proteobacteria k__Bacteria; p__Proteobacteria 0.96
7 k__Bacteria; p__Proteobacteria k__Bacteria; p__Proteobacteria 0.97
8 k__Bacteria; p__Actinobacteria k__Bacteria; p__Actinobacteria 0.62
9 k__Bacteria; p__Bacteroidetes k__Bacteria; p__Bacteroidetes 0.91
10 k__Bacteria; p__Proteobacteria k__Bacteria; p__Proteobacteria 0.88
11 k__Bacteria; p__Bacteroidetes k__Bacteria; p__Proteobacteria 0.50
12 k__Bacteria; p__Proteobacteria k__Bacteria; p__Verrucomicrobia 0.37
13 k__Bacteria; p__Cyanobacteria k__Bacteria; p__Cyanobacteria 0.99
14 k__Bacteria; p__Proteobacteria k__Bacteria; p__Proteobacteria 0.92
15 k__Bacteria; p__Firmicutes k__Bacteria; p__Firmicutes 0.72
16 k__Bacteria; p__Proteobacteria k__Bacteria; p__Proteobacteria 1.00
17 k__Bacteria; p__Proteobacteria k__Bacteria; p__Armatimonadetes 0.35
18 k__Bacteria; p__Bacteroidetes k__Bacteria; p__Bacteroidetes 0.97
19 k__Bacteria; p__Actinobacteria k__Bacteria; p__Actinobacteria 0.71
20 k__Bacteria; p__Proteobacteria k__Bacteria; p__Proteobacteria 0.85
21 k__Bacteria; p__Firmicutes k__Bacteria; p__Firmicutes 1.00
22 k__Bacteria; p__Proteobacteria k__Bacteria; p__Proteobacteria 0.56
23 k__Bacteria; p__Proteobacteria k__Bacteria; p__Proteobacteria 0.73
24 k__Bacteria; p__Proteobacteria k__Bacteria; p__Verrucomicrobia 0.36
25 k__Bacteria; p__Firmicutes k__Bacteria; p__Firmicutes 0.72
26 k__Bacteria; p__Firmicutes k__Bacteria; p__Firmicutes 0.71
27 k__Bacteria; p__Proteobacteria k__Bacteria; p__Proteobacteria 0.92
28 k__Bacteria; p__Proteobacteria k__Bacteria; p__Proteobacteria 0.60
29 k__Bacteria; p__Firmicutes k__Bacteria; p__Firmicutes 0.74
30 k__Bacteria; p__Firmicutes k__Bacteria; p__Firmicutes 1.00
31 k__Bacteria; p__Bacteroidetes k__Bacteria; p__Bacteroidetes 0.71
32 k__Bacteria; p__Firmicutes k__Bacteria; p__Firmicutes 0.94
33 k__Bacteria; p__Firmicutes k__Bacteria; p__Firmicutes 0.67
34 k__Bacteria; p__Firmicutes k__Bacteria; p__Synergistetes 0.64
35 k__Bacteria; p__Firmicutes k__Bacteria; p__Firmicutes 0.64
36 k__Bacteria; p__Bacteroidetes k__Bacteria; p__Bacteroidetes 0.94
37 k__Bacteria; p__Firmicutes k__Bacteria; p__Firmicutes 0.86
38 k__Bacteria; p__Firmicutes k__Bacteria; p__Firmicutes 0.84
39 k__Bacteria; p__Firmicutes k__Bacteria; p__Firmicutes 0.76
40 k__Bacteria; p__Proteobacteria k__Bacteria; p__Elusimicrobia 0.38
41 k__Bacteria; p__Firmicutes k__Bacteria; p__Firmicutes 0.80
42 k__Bacteria; p__Bacteroidetes k__Bacteria; p__Bacteroidetes 0.56
43 k__Bacteria; p__Firmicutes k__Bacteria; p__Firmicutes 0.84
44 k__Bacteria; p__Bacteroidetes k__Bacteria; p__Bacteroidetes 0.78
45 k__Bacteria; p__Firmicutes k__Bacteria; p__Firmicutes 0.77
46 k__Bacteria; p__Bacteroidetes k__Bacteria; p__Bacteroidetes 0.83
47 k__Bacteria; p__Firmicutes k__Bacteria; p__Firmicutes 0.70
48 k__Bacteria; p__Proteobacteria k__Bacteria; p__Nitrospirae 0.65
49 k__Bacteria; p__Firmicutes k__Bacteria; p__Firmicutes 1.00

2.6.6 Random Forest classifiers [edit]

2.6.7 Neural networks and "deep learning" [edit]