4.2 Multiple sequence alignment exercises [edit]

4.2.1 Purpose [edit]

In this assignment you'll use multiple sequence alignment to reconstruct the phylogeny of a group of organisms based on their 16S rRNA sequences. This assignment builds on ideas from the previous assignment, in that in the last assignment you were identifying good primers to use for amplifying 16S from diverse organisms, and in this assignment we're using those sequences to group organisms by their relatedness. Because of the very large numbers of sequences that are commonly obtained in a modern DNA-sequencing-based experiment, grouping similar sequences and then working with representative sequences for each of those groups is common for computational efficiency. We'll be exploring these ideas in more detail through-out the next segments of the class.

From a bioinformatics standpoint, we usually start working with sequence in fasta format, very similar to the sequences in the cell below. See here for an explanation of the fasta format.

At this point, you should be feeling fairly comfortable interacting with the IPython Notebook. This assignment will give you additional practice while you explore the ideas mentioned above.

Continue to work with IPython Notebooks and interact with python code. Understand what multiple sequence alignment is used for, and the concept of grouping sequences into clusters of OTUs. Consider the possible drawbacks to these methods.

  • Read all of the cells containing text very carefully!
  • You may write code or use a text editor if you wish, however all of the tools necessary to answer the questions are present in this notebook.
  • Get help, that's what office hours are for!
  • You are allowed to discuss the assignment with other students, however your work needs to be your own. Using or looking at code or commands generated by another student is strictly prohibited. If you're in doubt over whether some type of interaction is acceptable for this assignment, ask.

4.2.4 Functions that you will need to complete the exercise. [edit]

Remember to learn about what a function does you can run:

help(name_of_function)

Try this with the functions below to see what they do.

In [1]:
%pylab inline
from __future__ import division

import skbio.io
from skbio import DNA

from iab.algorithms import progressive_msa_and_tree, iterative_msa_and_tree, kmer_distance, guide_tree_from_sequences
Populating the interactive namespace from numpy and matplotlib

The cell below contains the sequences that you will be working with throughout the assignment

In [2]:
from io import StringIO
seqs_16s = """>881726
GACGAACGCTGGCGGCGTGCCTAATACATGCAAGTCGAGCGGATTCATCCTTCGGGATGGGTTAGCGGCGGACGGGTGAGTAACACGTAGGCAACCTGCCTGCAAGTCCGGGATAACTAACGGAAACGTTAGCTAATACCGGATACGCGGTTGGATCGCATGATCCGATCGGGAAAGACGGCGCAAGCTGCCACTTGTAGATGGGCCTGCGGCGCATTAGCTAGTTGGTGGGGTAACGGCTCACCAAGGCGACGATGCGTAGCCGACCTGAGAGAGTGATCGGCCACACTGGGACTGAGACACGGCCCAGACTCCTACGGGAGGCAGCAGTAGGGAATCTTCCGCAATGGACGCAAGTCTGACGGAGCAACGCCGCGTGAGTGATGAAGGTTCTCGGATCGTAAAGCTCTGTTGCCAGGGAAGAACGCTCGGGAGAGTAACTGCTCTCGAGGTGACGGTACCTGAGAAGAAAGCCCCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGGGCAAGCGTTGTCCGGAATTATTGGGCGTAAAGCGCGCGCAGGCGGTCGATTAAGTTTGGTGTTTAAGCCCGGGGCTCAACCCCGGTTCGCACTGAAAACTGATCGACTTGAGTGTAGGAGAGGAAAGTGGAATTCCACGTGTAGCGGTGAAATGCGTAGAGATGTGGAGGAACACCAGTGGCGAAGGCGACTTTCTGGCCTATAACTGACGCTGAGGCGCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGATGCATGCTAGGTGTTAGGGGTTTCGATACCCTTGGTGCCGAAGTCAACACAGTAAGCATGCCGCCTGGGGAGTACGGTCGCAAGACTGAAACTCAAAGGAATTGACGGGGACCCGCACAAGCAGTGGAGTATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAGGTCTTGACATCCCCCTGAATCCTCTAGAGATAGAGGCGGCCCTTCGGGGACAGGGGAGACAGGTGGTGCATGGTTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTGATCGTAGTTGCCAGCACTTCGGGTGGGCACTCTAGGATGACTGCCGGTGACAAACCGGAGGAAGGCGGGGATGACGTCAAATCATCATGCCCCTTATGACCTGGGCTACACACGTACTACAATGGCCGGTACAACGGGCTGCGAAGCCGCGAGGTGGAGCCAATCCCAGAAAGCCGGTCTCAGTTCAGATTGCAGGCTGCAACTCGCCTGCATGAAGTCGGAATTGCTAGTAATCGCGGATCAGCATGCCGCGGTGAATACGTTCCCGGGTCT
>793074
GAATGAACGCTGGCGGCGTGCTTAATAATGCAAGTCGAGCGCGTAGCAATACGAGCGGCGCACGGGTGCGTAACACGTAGGTCATCTGCCTCTAGGTCGGGGATAACTGCGGGAAACTGCAGCTAATACCCGATGATATCGAGAGATCAAAGCTTCGGTGCCTAGAGAGGAGCCTGCGGCTCATTAGCTAGTTGGTGGGGTAACGGCCTACCAAGGCCACGATGAGTAGCCGGCCTGAGAGGGCGATCGGCCACACTGGAACTGAGACACGGTCCAGACTCCTACGGGAGGCAGCAGTAGGGAATATTGGGCAATGGGCGAAAGCCTGACCCAGCAACGCCGCGTGAGTGATGAAGCCTTTCGGGGTGTAAAGCTCTTTTGGCAGGGACGAATCAATGACGGTACCTGCGTAATAAGCCCCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGGGGGGGCAAGCGTTATTCGGAATTACTGGGCGTAAAGCGCGCGTAGGCGGCTTCTTAAGTCGGGTGTTTAATGTCGGGGCTCAACTCCGGCGCTGCACTCGATACTGGGAGGCTAGAGTACTCGAGAGGAAAGCGGAATTCCTAGTGTAGCGGTGAAATGCGTAGATATTTAGGAGGAACACCAGTGGCGAAGGCGGCTTTCTGGAGAGTAACTGACGCTCAGAGCGCGAAAGCCAGGGGATCGAACGGGATTAGATACCCCGGTAGTCCTGGCTGTAAACGATGGGTACTAGATGTCGCCGGTATCAATCCCGGCGGTATCGTCGCTAACGCATTAAGTACCCCGCCTGGGGAGTACGCTCGCAAGAGTGAAACTCAAAGGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGCAACGCGAAGAACCTTACCTGGACTTGACATACCTCGGACCGGACCTAGAGATAGGACCTTCTCCCGTAAGGGAGCCGGGGATACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCCCATCCCTAGTTGCCAGCGAGTCATGTCGGGAACTCTAGGGAGACTGCCGTTGATAAAACGAGAGGAAGGTGGGGATGACGTCAAGTCATCATGGCCCTTACGTCCAGGGCTACACACGTGCTACAATGGCCACCACAAAGGGTCGCAATACCGTGAGGTGGAGCTAATCCCAAAAAGGTGGCCTCAGTTCGGATTGTAGTCTGCAACTCGACTACATGAAGTCGGAATCGCTAGTAATCGCGGATCAGAACGCCGCGGTGAATACAGTTCCCGGGCCTTGTACACACCGCCCGTCACACCACGAGAGCTGGTTGCGTTAGAAGTCGCCAGGCCAACCGCAAGGGGGCAGGCGCCGAATGCGTGATGAGTGATTGGGGT
>669210
AGAGTTTGATCCTGGCTCAGAACGAACGCTGGCGGCGCGCCTAACACATGCAAGTCGAACGGACTAGCCCCTTCGGGGGCGAAGTTAGTGGCGAACGGGTGAGTAACGCGTAAGTAACCTGCCCCCGGGACTGGGATAACAGCTCGAAAGAGCCGCTAATACCGGATAATTGTTGCAACACTTAGGAGTTGTAACTAAAGAAGGCCTCTGTTTCAAGCTTTCACCTGGGGATGGGCTTGCGTCCCATTAGCTTGTTGGTGAGGTAACGGCTCACCAAGGAAACGATGGGTAGCCGGCCTGAGAGGGTGGTCGGTCACACTGGAACTGAGACACGGTCCAGACTCCTACGGGAGGCAGCAGTGAGGAATCTTGCGCAATGGGCTAACGCCTGACGCAGCGACGCCGCGTGGACGATGAAGCTTTTCGGAGTGTAAAGTCCTTTCAGGAGGGAAGAAATGCCGGTAGTGTGAATAACACACCGGTTTGACGGTACCTCAAGAAGAAGCCCCGGCTAACTCCGTGCCAGCAGCCGCGGTAACACGGAGGGGGCAAGCGTTGTTCGGAATCACTGGGCGTAAAGAGCGCGTAGGTGGTTGTGTAAGTCGGATGTGAAATCCCTCGGCTCAACCGAGGAACTGCGTTCGAAACTACATAGCTAGAGGGCAGGAGAGGAGAGCGGAATTCCCAGTGTAGCGGTGAAATGCGCAGATATTGGGAAGAACACCGGTGGCGAAGGCGGCTCTCTGGACTGTTCCTGACACTGAGGCGCGAAAGCCAGGGGAGCGAACGGGATTAGATACCCCGGTAGTCCTGGCCGTAAACGATGGGCACTAGGTGTGGGGGGTGTCGATCCCCCCCGTGCCGCAGCTAACGCATTAAGTGCCCCGCCTGGGAAGTACGATCGCAAGGTTGAAACTCAAAGGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGACGCAACGCGAAGAACCTTACCGGAATTTGACATGTTTCTGACGGCCTGCAGAAATGCAGGCTTCCCCTCGGGGCAGATACACAGGAGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTGCCCTTAGTTGCCATCGGTTCGGCCGGGAACTCTAAGGGGACTGCCGGTGACAAACCGGAGGAAGGTGGGGATGACGTCAAGTCAGTATGGCCTTTATGTTCCGGGCTACACACGTGCTACAATGGCTGGTACAAAGGGTCGCGATGCCGTGAGGTGGAGCCAATCCCAAAAAGCCAGTCTTAGTTCGGATTGGAGTCTGCAACTCGACTCTATGAAGCCGGAATCGCTAGTAATCGTGGATCAGCACGCCATGGTGAATACGTTCCCGGGCCTTGTACACACCGCCCGTCACACCACGAAAGTTGGTTGTACCAGAAGTCATTGGGCTAACCCTTTTGGGGGGCAGATGCCGAAGGTATGGTCAGCGATTGGGGTGAAGTCGTAACAAGGTAACC
>583705
ACGGGTGAGTAACGCGTATGCAACCTACCTCGGAAAAGGGGATGACTGGTGGAAACGGGGATTAATGCCCCCTAGGGTTGTTTCTCTGCCTGGGTGAGCCGTTACTATTGGAACCGATTGAGATGGCCATGTTGGTCATTTCCTGGTTGGTGAGGTTACCTCACACCAAGGCGACGATGACTACGGGGTCTAAAAGGATGGTCCCGCACACTGGTACTGAGACACGGACCAGACTCCTACGGGAGGCAGCAGTGAGGAATATTGGTCAATGGGGGCAACCCTGAACCAGCCATGCCGCGTGAAGGAAGACGGCCCTATGGGTTGTAAACTTCTTTTATATGGGAATAAAGAGAGGTACGTGTACCTCAGTGAATGTACCATATGAATAAGCATCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGATGCAGGCGTTATCCGGATTTATTAGGTTTAAAGGGTGCGTAGGCGGGATACTAAGTCAGTGGTGAAAGTTTGCGGCTCAACCGTAAAATCGCCATTGATACTGGTATTCTTGAGTATACAGGAAGTAGGCGGAATGTGTAGTGTAGCGGTGAAATGCATAGATATTACACAGAACACCGATTGCGAAGGCAGCTTACTATAGTATAACTGACGCTGATGCACGAAAGCGTGGGGAGCGAACAGGATTAGATACCCTGGTAGTCCACGCTGTAAACGATGATTACTGGTTGTGCGCGATACACAGTGCGCGACTGAGCGAAAGCATTAAGTAATCCACCTGGGGAGTACGGCGGCAACGCTGAAACTCAAAGGAATTGACGGGGGCCCGCACAAGCGGAGGAACATGTGGTTTAATTCGATGATACGCGAGGAACCTTACCTGGGCTTAAATGTAGAGTGCATGGAGTGGAAACATTCCTTTCCTTCGGGACTCTTTACAAGGTGCTGCATGGTTGTCGTCAGCTCGTGCCGTGAGGTGTCGGGTTAAGTCCCATAACGAGCGCAACCCCTATCATTAGTTGCTAACAGGTCAAGCTGAGGACTCTAGCGAAACTGCCGGTGTAAACCGTGAGGAAGGTGGGGATGACGTCAAATCAGCACGGCCCTTATGTCCAGGGCTACACACGTGTTACGATGGCCAGTACAAAGGGTAGCTACCTGGTGACAGGATGCTAATCTCAAAAGCTGGTCTCAGTTCGGATTGGAGTCTGCAACTCGACTCCATGAAGTTGGATTCGCTAGTAATCGTATATCAGCCATGATACGGTGAATACGTTCCCGGGCCTTGTACACACCGCCCGTCAAACCATGGAAGCTGGGGGTACCTGAAGTACGTCACCGCAAGGAGCGTCCTAGGGTAAATCTAGTGACTGGGGTTAAGTGGTAACAAGGTAACC
>524860
AGAGTTTGATCCTGGCTCAGAACGAACGTTGGCGGCATGGATGAGGCATGCAAGTCGCGGGAATCCCCAGCAATGGGGGGAACCGGCGTAAGGGGCAGTAAGGCGTAGGTACCTACCCCCAGGTCCGGGATAGCCCGCCGAGAGGCGGGGTAATACCGGATGACCTCGGGAGAGCAAAGCTCCGGCGCCTGAGGCGGGGCCTACGTGATATTACCTAGTTGGCGGGGTAACGGCCCACCAAGGGGGAGATGTCTAGCGGGTGTGAGAGCACGACCCGCGCCACTCGCACTGAGACACTGGCGAGACACCTACGGGTGGCTGCAGTCGAGGATCTTCGGCAATGGGGGCAACCCTGACCGAGCGATGCCGCGTGGGCGACGAAGGCCTTCGGGTTGTAAAGCCCTGTCGAGGGGGAGAAAGCCGCAAGGCGGATCCATCCCTGGAGGAAGCTCGGGCTAAGTTCGTGCCAGCAGCCGCGGTAAGACGAACCGAGCGAACGTTGTTCGGAATCACTGGGCTTAAAGGGCGCGTAGGCGGGCTGCCGCGTCCGGGGTGAAATCCCACGGCTCAACCGTGGAACGGCCCCGGGTACGGGCGGCCTCGAGGGGGATAGGGGCGTGCGGAACTGTGGGTGGAGCGGTGAAATGCGTTGATATCCACAGGAACTCCGGTGGCGAAGGCGGCACGCTGGATCCTCTCTGACGCTGAGGCGCGGAAGCCAGGGGAGCGAACGGGATTAGATACCCCGGTAGTCCTGGCCCTAAACGATGAGAACTGGGTAGTAGCCCTGGCATGGGGTTACTGCCGCAGCCAAAGTGCTAAGTTCTCCGCCTGGGGAGTATGGTCGCAAGGCTGAAACTCAAAGGAATTGACGGGGGCTCACACAAGCGGTGGAGCATGTGGCTTAATTCGAGGCTACGCGAAGAACCTTATCCTGGACTTGACATGTGCGAAAGCGCCAGCAGGTAGGACCCGGAAACGGGAACGAACGGTATCCAACCCGGAAGCTGGTACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTCGGGTTAAGTCCCATAACGAGCGAAACCCTTACCCCTTGTTGCAACCCGAAAGGGGCACTCGAGGGGGACTGCCGGTGTCAAACCGGAGGAAGGTGGGGATGACGTCAAGTCCTCATGGCCTTTATGTCCAGGGCTGCACACGTGCTACAATGGCGTGGACAGAGGGACGCGACTGCGCGAGCAGAAGCCGACCCCCGAAAGCACGCCCCAGTTCAGATCGCAGGCTGCAACCCGCCTGCGTGAAGCCGGAATCGCTAGTAATCGCGGGTCAGCAACACCGCGGTGAATGTGTTCCTGAGCCTTGTACACACCGCCCGTCAAGCCACGAAAGGGAGGGACGTCCGAAGTCGCCTCGCGGCGCCGAAGACGGACTTCCTGATTGGGACTAAGTCGTAACAAGGTAACC
>501793
GACGAACGCTGGCGGCGTGCCTAATACATGCAAGTCGAACGGAGTGGTTGAAGGAGCTTGCTCTTTTGATCGCTTAGTGGCAGACGGGTGAGTAACACGTAGGCAACCTGGCTGTAAGACGGGGATAACTGGCGGAAACGTGAGCTAAAACCGGATGGTCGGCTTGAGGGCATCCTCGAGTCGGGAAAGGACGGAGCAATCTGTCGCTTACAGATGGGCCTGCGGCGCATTAGCTAGTTGGTAGGGTAACGGCCTACCAAGGCGACGATGCGTAGCCGACCTGAGAGGGTGAACGGCCACACTGGGACTGAGACACGGCCCAGACTCCTACGGGAGGCAGCAGTAGGGAATCTTCGGCAATGGACGAAAGTCTGACCGAGCAACGCCGCGTGAGTGATGAAGGTTTTCGGATCGTAAAGCTCTGTTGCCAGGGAAGAACGCCAGGGAGAGTAACTGCTCTCTGGGTGACGGTACCTGAGAAGAAAGCCCCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGGGCAAGCGTTGTCCGGAATTATTGGGCGTAAAGCGCGCGCAGGCGGTCGTTTAAGTCTCATGTCTAAACCCCGGGGCTCAACCTCGGGGTGCATGGGAAACTGGGCGACTGGAGTGCATGTGAGGAAAGTGGAATTCCACGTGTAGCGGTGGAATGCGTAGAGATGTGGAGGAACACCAGTGGCGAAGGCGACTTTCTGGCCTGTAACTGACGCTGAGGCGCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGATGAATGCTAGGTGTTAGGGGTTTCGATACCCTTGGTGCCGAAGTTAACACATTAAGCATTCCGCCTGGGGAGTACGGTCGCAAGACTGAAACTCAAAGGAATTGACGGGGACCCGCACAAGCAGTGGGGTATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAGGTCTTGACATCCCGATGAAACATGCAGAGATGTGTGCCCTCTTCGGAGCATTGGAGACAGGTGGTGCATGGTTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTAAGGTTAGTTGCCAGCAGGTGAAGCTGGGCACTCTAACATGACTGCCGGTGACAAACCGGAGGAAGGCGGGGATGACGTCAAATCATCATGCCCCTTATGACCTGGGCTACACACGTACTACAATGGCCAGTACAACGGGAAGCGAAGTGGCGACACGGAGCCAATCTTAGAAAGCTGGTCTCAGTTCGGATTGCAGGCTGCAACTCGCCTGCATGAAGTCGGAATTGCTAGTAATCGCGGATCAGCATGCCGCGGTGAATACGTTCCCGGGTCT
>296752
AGAGTTTGATCTCTGGCTCAGAACAAACGCTGGCGGTGCGTCTTAAGCATGCAAGTCGAGCGATGGTAGGGGGCTTGCTCCCTATTCATAGCGGCGGACTGGTGAGTAACGCGTAGATGACATACCTTTTGCTGGGGGATAGCTTGTGGAAACACAGGGTAATACCGCATACGATTGAGGCGGTTAGAGCGCTTCAATCAAAGCCTTGTATGGGGCGGCAGTTGAGTGGTCTGCGTACTATTAGCTTGTTGGTGGGGTAACGGCTCACCAAGGCGATGATAGTTATCCGGCCTGAGAGGGTGAACGGACACATTGGGACTGAGATACGGCCCAGACTCCTACGGGAGGCAGCAGCTAAGAATATTCCGCAATGGGCGAAAGCCTGACGGAGCGACACCGCATGAGTGATGAAGGTCGAAAGATTGTAAAATTCTTTTTGAGAGTGATGAATAAAGTCGAGCAGTAATGCTCGGTGATGACGGTAACTTTTGAATAAGGGGTGGCTAATTACGTGCCAGCAGCCGCGGTAACACGTAAGCCCCAAGCGTTGTTCGGAATTATTGGGTGTAAAGGGCATGTAGGTGGTCTTGCAAGCTTGATGTGAAATCTTACAGCTTAACTGTAAAACTGCATTGAGAACTGCATAACTTAAGAATAACTGAGGCGCAACTGGAATTCCAGGTGTAGGGGTGAAATCTGTAGATATCTGGAAGAACACCAATGGCGAAGGCAAGTTGCGAGCAGATTATTGACACTGAGGTGCGAAGGTGCGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCGCACAGACAACGATGTACACTGGGCGTCTGGCTTTATGCTGGGTGCCGTAGTAAACGCGATAAGTGTACCGCCTGGGGAGTATGCTCGCAAGGGTGAAACTCAAAGGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGATACGCGAGAAATCTTACCTGGGTTTGACATTTAGTGGAATTGTATAGAGATATGCAAGGTACTTGTACCCGCTAAACAGGTGCTGCATGGCTGTCGTCAGCTCGTGCCGTGAGGTGTTGGGTTAAGTCCCGCAACGAGCGCAACCCCTACTGCTAGTTACTAACAGGTTATGCTGAGGACTCTAGCGGAACTGCCGGTGACAAACCGGAGGAAGATGGGGATGACGTCAAGTCATCATGGCCCTTATGTCCAGGGCAACACACGTGCTACAATGGTTGTAACAAAGTGATGCGAAATCGCAAGATGAAGCAAAACGCAGAAATGCAATCGTAGTTCGGATTGGAGTCTGAAACTCGACTCCATGAAGTTGGAATCGCTAGTAATCGCATATCAGCACGATGCGGTGAATACGTTCCCGGGCCTTGCACACACCGCCCGTCA
>293514
AGAGTTTGATCCTGGCTCAGAACGAACGCTGGCGGTGCGTCTTAAGCATGCAAGTCGAGCGATGGTAGGGGGCTTGCTCCCTATTCATAGCGGCGGACTGGTGAGTAACGCGTAGATGACATACCTTTTGCTGGGGGATAGCTTGTGGAAACACAGGGTAATACCGCATACGATTGAGGCGGTTAGAGCGCTTCAATCAAAGCCTTGTATGGGGCGGCAGTTGAGTGGTCTGCGTACTATTAGCTTGTTGGTGGGGTAACGGCTCACCAAGGCGATGATAGTTATCCGGCCTGAGAGGGTGAACGGACACATTGGGACTGAGATACGGCCCAGACTCCTACGGGAGGCAGCAGCTAAGAATATTCCGCAATGGGCGAAAGCCTGACGGAGCGACACCGCGTGAGTGATGAAGGTCGAAAGATTGTAAAACTCTTTTGAGAGTGATGAATAAGTCGAGCAGTAATGCTCGGTGATGACGGTAACTTTTGAATAAGGGGTGGCTAATTACGTGCCAGCAGCCGCGGTAACACGTAAGCCCCAAGCGTTGTTCGGAATTATTGGGCGTAAAGGGCATGTAGGTGGTCTTGCAAGCTTGATGTGAAATCTTACAGCTTAACTGTAAAACTGCATTGAGAACTGCATAACTTAAGAATAACTGAGGCGCAACTGGAATTCCAGGTGTAGGGGTGAAATCTGTAGATATCTGGAAGAACACCAATGGCGAAGGCAAGTTGCGAGCAGATTATTGACACTGAGGTGCGAAGGTGCGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCGCACAGTCAACGATGTACACTGGGCGTCTGGCTTTATGCTGGGTGCCGTAGTAAACGCGATAAGTGTACCGCCTGGGGAGTATGCTCGCAAGGGTGAAACTCAAAGGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGATACGCGAGGAATCTTACCTGGGTTTGACATACACATTATCTTTGCAGAGATGTAAAGCGGGGGTAACCCCAATGTGAACAGGTGCTGCATGGCTGTCGTCAGCTCGTGCCGTGAGGTGTTGGGTTAAGTCCCGCAACGAGCGCAACCCCTATTGCCAGTTACTAACAAGTTAAGTTGAGGACTCTGGCGAAACTGCCGGTGACAAATCGGAGGAAGATGGGGATGACGTCAAGTCATCATGGCCCTTATGTCCAGGGCAACACACGTGCTACAATGGTTGAGACAGAGTGATGCTAAGTCGCAAGATGGAGCAAAACGCAGAAATTCAATCGTAGTTCGGATTGGAGTCTGAAACTCGACTCCATGAAGTTGGAATCGCTAGTAATCGCATATCAGCACGATGCGGTGAATACGTTCCCGGGCCTTGCACACACCGCCCGTCA
>292553
AGAGTTTGATCCTGGCTCAGAACGAACGCTGGCGGTGCGTCTTAAGCATGCAAGTCGAGCGATGAATGAGGGGCTTGCTCCTTATTCATAGCGGCGGACTGGTGAGTAACGCGTAGATGACATGTCGATGGCAGGGGGATAGCCAGTAGAAATATTGGGTAATACCGCGTATCCTTCTTGTTGTTAGAGGACAAGAAGAAAAGCCTTGTATGGGGCGGCTATTGAGTGGTCTGCGTACTATTAGTTTGTTGGTGGGGTAACGGCCTACCAAGACTATGATAGTTATCCGGCCTGAGAGGGTGAACGGACACATTGGGACTGAGATACGGCCCAGACTCCTACGGGAGGCAGCAGCTAAGAATATTCCGCAATGGGCGAAAGCCTGACGGAGCGACACCGCGTGAGTGATGAAGGTCGAAAGATTGTAAAACTCTTTTGAATATGATGAATAAGTCAAGCAGTAATGCTTGGCGATGACGGTAGTGTTTGAATAAGGGGTGGCTAATTACGTGCCAGCAGCCGCGGTAACACGTAAGCCCCAAGCGTTGTTCGGAATTATTGGGCGTAAAGGGCATGTAGGTGGTTTTGTAAGCTTGATGTGAAATCTTACAGCTTAACTGTAAAACTGCATTGAGAACTGCAGAACTAGAGTAACTGAGGTGCAACTGGAATTCCAGGTGTAGGGGTGAAATCTGTAGATATCTGGAAGAACACCAATGGCGAAGGCAAGTTGCAAGCAGATTACTGACACTGAGGTGCGAAGGTGCGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCGCACAGTCAACGATGTACACTGGGCGTCTGGCTTTATGCTGGGTGCCGTAGTAAACGCGATAAGTGTACCGCCTGGGGAGTATGCTCGCAAGGGTGAAACTCAAAGGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGATACGCGAGAAATCTTACCTGGGTTTGACATTTAGTGGAATTGTATAGAGATATGCAAGGTACTTGTACCCGCTAAACAGGTGCTGCATGGCTGTCGTCAGCTCGTGCCGTGAGGTGTTGGGTTAAGTCCCGCAACGAGCGCAACCCCTACTGCTAGTTACTAACAGGTTATGCTGAGGACTCTAGCGGAACTGCCGGTGACAAACCGGAGGAAGATGGGGATGACGTCAAGTCATCATGGCCCTTATGTCCAGGGCAACACACGTGCTACAATGGTTGTAACAAAGTGATGCGAAATCGCAAGATGAAGCAAAACGCAGAAATGCAATCGTAGTTCGGATTGGAGTCTGAAACTCGACTCCATGAAGTTGGAATCGCTAGTAATCGCATATCAGCACGATGCGGTGAATACGTTCCCGGGCCTTGCACTAACGGCCCGTCA
>266495
AGTTTGATCCTGGCTCAAGATGAACGCTAGCGGCAGGCTTAACACATGCAAGTCAAAGGGCAACGGGGAGAGTGCTTGCACTCTCTGCCGGCGACTGGCGCACGGGTGAGTAACACTTATGCAGACACTGCCTTCCACAGGGCGGACAACCTCTCCCAAAGGGAGGCTAATCCCGCGTATATCCCTTGGGGGCATCCCCGGGGGAGGAAAGGATTACCGGTGTGCAGGATGGGCATGCGGCGCATTACGCAGTAGGCGGGGTAACGGCCCACCTAACCGACCATGCGTATGGGTTCTGAGAGGAAGGCCCCCCACACTGGTACTGAGACACTGACCAGACTCCTACTGGAGGCAGCAGTGAGGAACATTGGTCAATGGGCGGGAGCCTGAACCAGCAAACCCGCGTGAAGGAAGAAGGCGCCGAACGTCGTAAACTTCTTTTGTCCGGGATCAAAGGGCGCCACGTGTGGCGTTGTGAGTGTACCTGTAGAGAAAGCTTCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGTAGGGAGCGAGCGTTGTCCGGATTTATTGGGTGTAAAGGGCGCGTAGGTGGTCGGTTAAGTCAGGTGTGAAAGCTCGGGGCTCAACCCGGAGGATCCGCTGGAACTTTGGTGTCATGAGGCGCAGGAGAAGTAAAGTGGAATTCGTGGTGTAGCGGTGAAATGCATAGATATTGGGCGGAACTCCGGTGGCGAAGGCAGCGTTCTGGCGCGTGCCTGACGCTGAGGCGCGAAAGCGTGGGTATCGAACGGGATTAGATACCCCGGTAGTCCACGCAGTAAACGATGAATACTGGGTGTCGGACCCATAGAACGTTTGGGTGCGCGCAGCGAAAGCGATAAGCATTCCAAGTGGGGAGTACACCGGCAGTGATGAGACTCAAAGGAATCGACGGGGGTTCGCACAAGTGGAGGGATATGTGGTTTAATTAGACGATAAGTGAGGAACGTGACCCGGGTTCAACAGGGAGTCGACAGGGGCAGAGATTCCCTCTTCCACGGACGTCTTCCGAGGTGGGGCATGGTTGTCAGTCAGCTACGTGCCGTGAGGTGTCGGCTTAAGTGCCATAAGGTGTGCAACACGGGCAGACAGTTGCTAACGGGTAGAGCAGTGGAATGTGTAGTGATTGCAGGGGCAAGCCGCGAGGAAGGGGGGGATGATGTCAAATCAGCGCGGCCCTTAGGTCAGGGGTGACACACGTGCTGCAATGGCGGGGACAGAGGGATGTGAAGAGGCGACGTGGAGCGAACCCCAAAAACCCCGCCCCAGTTAGGATTGTAGTATGCAACCCGAATACATGAAGCCGGAATAGGTAGTAATCGCGGATCAGAATGCAGCGGTGAATAAGTTCCCGGCTCTAGCACACACCGCCCGTCA
>229854
GAGTTTGATCCTGGCTCAGATTGAACGCTGGCGGCATGCTTAACACATGCAAGTCGAACGGCAGCATGACTTAGCTTGCTAAGTTGATGGCGAGTGGCGAACGGGTGAGTAACGCGTAGGAATATGCCTTAAAGAGGGGGACAACTTGGGGAAACTCAAGCTAATACCGCATAAACTCTTCGGAGAAAAGCTGGGGACTTTCGAGCCTGGCGCTTTAAGATTAGCCTGCGTCCGATTAGCTAGTTGGTAGGGTAAAGGCCTACCAAGGCGACGATCAGTAGCTGGTCTGAGAGGATGACCAGCCACACTGGAACTGAGACACGGTCCAGACTCCTACGGGAGGCAGCAGTGGGGAATATTGGACAATGGGGGCAACCCTGATCCAGCAATGCCGCGTGTGTGAAGAAGGCCTGAGGGTTGTAAAGCACTTTCAGTGGGGAGGAGGGTTTCCCGGTTAAGAGCTAGGGGCATTGGACGTTACCCACAGAAGAAGCACCGGCTAACTCCGTGCCAGCAGCCCGCGGTAATACGGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCCGTTAAAANGGTGCCTAAGGTGGTTTGGATNAGTTATGTGTTAAATTCCCTGGCGCCTCCACCCTGGNGCCAGGTCCATANTAAAAACTGTTAAACTCCGAAGTATGGGCACAAGGTAANTTGGAAANTTCCGGTGGTNANCCGNTGAAAATGCGCTTAGAGATNCGGGAAGGGACCACCCCAGTGGGGAAGGCGGCTACCTGGCCTAATAACTGACATTGAGGCACGAAAAGCGTGGGGAGCAACCAGGATTAGATACCCTGGTAGTCCACGCTGTAAACGATGTCAACTAGCTGTNGGTTATATGAATATAATTAGTGGCGAAGCTAACGCGATAAGTTGACCGCCTGGGGAGTACGGTCGCAAGATTAAAACTCAAAGGAATNGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGCAACGCGAAGAACCTTACCTACCCTTGACATACAGTAAATCTTTCAGAGATGAGAGAGTGCCTTCGGGAATACTGATACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGTAACGAGCGCAACCCTTATCTCTAGTTGCCAGCGAGTAATGTCGGGAACTCTAAAGAGACTGCCGGTGACAAACCGGAGGAAGGCGGGGACGACGTCAAGTCATCATGGCCCTTACGGGTAGGGCTACACACGTGCTACAATGGCCGATACAGAGGGGCGCGAAGGAGCGATCTGGAGCAAATCTTATAAAGTCGGTCGTAGTCCGGATTGGAGTCTGCAACTCGACTCCATGAAGTCGGAATCGCTAGTAATCGCGAATCAGCATGTCGCGGTGAATACGTTCCCGGGCCTTGTACACACCGCCCGTCACACCATGGGAGTGGGCTGCACCAGAAGTAGATAGTCTAACCGCAAGGGGGACGTTTACCACGGTGTGGTTCATGACTGGGGTGAAGTCGTAACAAGGTAGCCG
>182569
AGAGTTTGATCCTGGCTCAGGATGAACGCTAGCTACAGGCTTAACACATGCAAGTCGAGGGGCAGCATGGTGTATCAATATATCTATGGCGACCAGCGCACCGGTGATGCACACCTCTCCTACCTGCCCCTTACTCCGGGATGATCTTTCTAAAAAAATATTACTACTCCATGGTATTACCGAAAAACGTCTTTTTGTTGTTTAAAAACTTCGATGGTGGAAGGTGATGCTTTCTATTATATACTTGGTGGGGTAACAGCCCACCACCTCAGCGATGAATAGGGGTTCTAATAAGAAGGTCCCCCCCATGGTAACTGGGCCCCGGTCCAAATTCTTCGGGAAGCCACCAGTGAGGATTATTGTTCAATGGCGGAGATTTTGACCCAGCCCAAGTAGCGTGAAGGATGACTGCTCCCATAGGTGGTAAACTTCTTTTATATGGGAATAAAGTGAGTCACGTGTGTCTTTTTGTATGTATCATATGAATAAGGATCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGATTCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCGGTTTGTTAAGTCAGTGGTGAAAGTTTGGGGCTCAACCGTGAAATTGCATTTGATACTGGCGGTCTTGAGTGCAGTAGAGGTGGGCGGAATTTGTGGTGTAGCGGTGAAATGCTTAGATATCATGCAGAACTCCGATTGCGAAGGCAGCTCACCGGAGTGTATCTGACGTTGAGGCTCGAAAGTGTGGGTATCAAACAGGATTAGATACCCTGGTAGTCCACACAGTAAAGAAGGAATATTGTCGTTGTGGGATCTCCATTAAGGGGTCAAGGGAAAGCATTAATTATTCCCCTGGGGGAGTAGTCCGCCAGAGGTGAAATTAAAAGAAATGGAGGGGGGCCGGCCCAAGGGAAGGACCATGTGGTTTAATTGGAGGATAGGGGAGGACCTTTCCCGGGGTTGAAAGTGCAAATGAATTATGGGGAGAGCCATTCCCTTCAAGGCATGAGAGAAGGTGCTGCATGGTTGTCGTCAGCTCGTGCCGTGAGGTGTCGGGTTAAGTCCCATAACGAGCGCAACCCTTATCTTCAGTTACTATCAGGTCAAGCTGAGCACTCTGGAGAGACTGCCGTTGTAAGATGAGAGGAAGGTGGGGATGACGTCAAATCAGCACGGCCCTTACGTCCGGGGCTACACACGTGTTACAATGGGGGGTACAGAAGGCAGCTACCCAGCGACAGGATGCCAATCCCAAAAACCTATCTCAGTTCGGATTGAAGTCTGCAACCCGCCTTCGTGAAGTTGGATTCGCTAGTAATCGCGCATCAGCCATGGCGCGGTGAATACGTTCCCGGGCCTTGCACACACCGCCCGTCA
>1719550
TCCTGGCTCAGAACGAACGTTGGCGGCGTGGATTAGGCATGCAAGTCGCGCGAATCCCCGCAAGGGGGGAAGCGGCGTAAGGGGCAGTAAGGCGTGGGTACCTACCCGGGGGTCGGGGATAGCCCGTCGAGAGACGGGGTAATACCCGATGACGTGGAGACACCAAAGGTCCGCCGCCCTCGGCGGGGCCCACGTGATATTAGCTAGTTGGCGGGGTAACGGCCCACCAAGGCGGGGATGTCTAGCGGGTGTGAGAGCACGACCCGCGCCACTGGCACTGAGACACTGGCCAGACACCTACGGGTGGCTGCAGTCGAGGATCTTCGGCAATGGGGGAAACCCTGACCGAGCGACGCCGCGTGGGCGACGAAGGCCTTCGGGTTGTAAAGCCCTGTCGAGGGGGAGAAAGCCTTAACCGGGTGATCTATCCCTGGAGGAAGCACGGGCTAAGTTCGTGCCAGCAGCCGCGGTAAGACGAACCGTGCGAACGTTGTTCGGAATCACTGGGCTTAAAGGGCGCGTAGGCGGGTTGCCGCGTCCGGGGTGAAATCCCACGGCTCAACCGTGGGGCGGCCCCGGGTACGGGCAGCCTCGAGGAGAGTAGGGGCATGCGGAACTCTGGGTGGAGCGGTGAAATGCGTTGATATCCAGAGGAACTCCGGTGGCGAAGGCGGCATGCTGGACCCTTCCTGACGCTGAGGCGCGAAAGCCAGGGGAGCGAACGGGATTAGATACCCCGGTAGTCCTGGCCCTAAACGATGAGAACTAGGTAGCCGGCCGGACATGGGCTGGCTGCCGGAGCCAAAGTGCTAAGTTCTCCGCCTGGGGAGTATGGCCGCAAGGCTGAAACTCAAAGGAATTGACGGGGGCTCACACAAGCGGTGGAGCATGTGGCTTAATTCGAGGCTACGCGAAGAACCTTATCCCGGGCTTGACATGTTCGAAAGAGGCTCGAAGTAGCCCGCGGAAACGTGGGGCCAACGGTATCCAGTCCGGAGCGAGCTACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTCGGGTTAAGTCCCATAACGAGCGAAACCCTTACCCTCAGTTGCTTACTAGGACTCTGGGGGGACTGCCGGTGTCAAACCGGAGGAAGGTGGGGATGACGTCAAGTCCTCATGGCCTTTATGCCCGGGGCTGCACACGTGCTACAATGGCGTGGACAAAGAGACGCGAGCCCGCGAGGGGGAGCCAATCTCAGAAAGCACGCCCCAGTTCAGATCGCAGGCTGCAACTCGCCTGCGTGAAGCCGGAATCGCTAGTAATCGCGGGTCAGCAACACCGCGGTGAATGTGTTCCTGAGCCTTGTACACACCGCCCGTCAAGCCACGAAAGAGAGGGACGTCCGGAGTCGCCTTCACCGGTGCCGAAGACGGACTTCTTGATTGGGACTAAGTCGTAACAAGGTAACC
>1794723
TTAGAGTTTGATCCTGGCTCAGAACGAACGTTGGCGGCGTGGATTAGGCATGCAAGTCTCGCGAATCCCCGCAAGGGGGGAAGCGGCGTAAGGGGCAGTAAGGCGTGGGTAACCCACCCCGGGGCCCGGGATAGCCCGTCGAGAGACGGGGTAATACCGGGCGACGCAGCGTGCCGGCATCGGTGTGCTGCCAAAGGTCCGCCGCCCCGGGCGGGGCCCACGTGGTATTAGCTAGTTGGTGGGGTGACGGCCCACCAAGGCGGAGATGCCTAGCGGGTGTGAGAGCACGACCCGCGCCACTGGCACTGAGACACTGGCCAGACACCTACGGGTGGCTGCAGTCGAGGATCTTCGGCAATGGGGGCAACCCTGACCGAGCGACGCCGCGTGGGCGACGAAGGCCTTCGGGTTGTAAAGCCCTGTCGAGGGGGAGAAACGTCCCGCAAGGGGCCTGATCTATCCCTGGAGGAAGCACGAGCTAAGTTCGTGCCAGCAGCCGCGGTAAGACGAACCGTGCGAACGTTGTTCGGAATCACTGGGCTTAAAGGGCGCGTAGGCGGGCTGCCGAGTCCGGGGTGAAATCCTCCCGCTCAACGGGAGAACGGCCCCGGGTACTGGCGGCCTCGAGGCGGGTAGGGGCGTGCGGAACACTGGGTGGAGCGGTGAAATGCGTTGATATCCAGTGGAACTCCGGTGGCGAAGGCGGCACGCTGGACCCGTCTGACGCTGAGGCGCGAAAGCCAGGGGAGCGAACGGGATTAGATACCCCGGTAGTCCTGGCCCTAAACGTTGAGAACTAGGTAGTCGGCCGGACATGGGCTGACTGCCGGAGCGAAAGTGCTAAGTTCTCCGCCTGGGGAGTATGGTCGCAAGGCTGAAACTCAAAGGAATTGACGGGGGCTCACACAAGCGGTGGAGCATGTGGCTTAATTCGAGGCAACGCGAAGAACCTTATCCCGGGCTTGACATGTGCGAAAGCGTCTGGGGGTACCCGCCGGAAACGGCCGGGGAAGGTATCCAGTCCTGAACCAGACACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTCGGGTTAAGTCCCATAACGAGCGAAACCCTTACCCTCAGTTGCCAGCGGGTCACGCCGGGGACTCTGGGGGGACTGCCGGTGTCAAACCGGAGGAAGGTGGGGATGACGTCAAGTCCTCATGGCCTTTATGCCCGGGGCTGCACACGTGCTACAATGGCGTGGACAAAGGGGCGCGAACGCGCGAGCGGGAGCCGACCCCGGAAAGCACGCCCCAGTTCAGATCGCAGGCTGCAACTCGCCTGCGTGAAGTCGGAATCGCTAGTAATCGCGGGTCAGCAACACCGCGGTGAATGTGTTCCTGAGCCTTGTACACACCGCCCGTCAAGCCACGAAAGGGAGGGACGGCCGAAGTCGCGCCCCGCGCGCCGACGCCGGACTTCCCGATTGGGACTAAGTCGTAACAAGGTAACC
>1142181
CACGTGGGTCATTTGCCCCGAAGCCCGGGATAGCCCATGGAAACATGGATTAATACCGGATGTGGTTGGAGTACACAGGTGCTCCGTATTAAACGGTAGGTAGCAATACCTTCCGCTTCGGGATAAGCCCGCGGCCCATTAGCTAGTTGGTGGGGTAAGACCCAACCAAGGAGACAACCGGGAGCCGGACAGAAAGGGTGACGGCCACATTGGGACTGAGAAACGGCCCGATCCTACGGAGGCAGCAGTAAGAATCTTCCGCATGAACGAAGTCCGACCGAGCGACGCGCTGAGTGATGAAGGTGTTATGCATCGTAAAGCTCCTTCGGGGAGGAGAATAAGCATAGTCCAAAAGGCTATGTGATGACGACCCTCCCTAAAGAAGCCCCGGCTAATTACGTGCAGCAGCGCGGCAATACGTAAGGGGTAAGCGTTGTTCGGAATTACTGGGCGTAAAGGGTGTGCAGGCGGGAGAGTAAGTTGGGGGTGAAATCTACGGGCCCAACCCGTAAACTGCCCTCAAAACTGCTTTTCTTGAGTGCAGGAGAGGAGACTGGAATTCCTAGTGTAGGAGTGAAATCTGTAGATATTAGGAAGAACACCGGTGGCGAAGGCGAGTCTCTGGCCTGACACTGACGCTGATACACGAAAGCGTGGGGAGCGAACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGTTGTGCACTAGATGTTGGGGGTGTCAATCCCCTCAGTGTCGCAGTTAACGCATTAAGTGCACCGCCTGGGGAGTATGCTCGCAAGGGTGAAACTCAAAGGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGATACGCGAGGAACCTTACCAGGGCTTGACATACAGGTGCCGGGCTGTGAAAGCAGTCCTCTCTTCCGAGCGCCTGTACAGGTGTTGCACGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGTTAAGTCCCCCAACGAGCGCAACCCCTATTGTCTGTGCCATCATTAAGTTGGCACTCGAACGAAACTGCCGGTGATAAACCGGAGGAAGGTGGGGATGACGTCAAGTCATATGGCCCTTAGGCCTGGCTACACGTGCTACAATGGACAGTACAAGAGTCGCAAGACCGAAAGGTGGACCATCCAAAAGCTGTCCTCAGTTCCGATTGAAGTCTGAAACTCGACTCCATGAAGTTGGAATCGCTAGTAATCGCGCATCAGAATGGCGCGGTGAATACGTTCCCGGGCCTTGTACACACCGCCCGTCACACCACCCGAGTTGGAAGTACTTGAAGTCGCTGATCTAACCTTCGGGAGGAAGGCGCCGATTGTACGTCTGATAAGGGGGGTGAAGTCGTAACAAGGTAACC
>2683209
CTGGCGGCGTGGTTTAGGCATGCAAGTCGAACGCGAAAGATTTACTTCGGTAAATTGAGTAGAGTGGCGAACGGGTGAGTAATACGTACGAATCTACCTTAAAGACAGGGATAGTCCCGGGAAACTGGGTTTAATACCTGATGGTATCCGGCTTTGCCGGATTAAAGACGGCCTCTATTTATAAGCTGTTACTTTTAGATGAGCGTGCGCTCCATTAGTTAGTTGGTAAGGTAAGAGCTTACCAAGGCGATGATGGATAGGCGTCCTTAACGGGTGGTCGCCCACACTGGGATTGAGATACGGCCCAGACTCCTACGGGAGGCAGCAGTCGAGAATCGTCTACAATGAACGCAAGTTTGATAGTGCGACGCCGCGTGAATGAAGAAGCATTTCGGTGTGTAAAATTCTTTTATATAAGAACAGTGCATGTATGGTAAATAATTATACGTGAGAGATAGTACTATATGAATAAGCTCCGGCTAACTTCGTGCCAGCAGCCGCGGTAATACGAAGGGAGCAAGCGTTGTCCGGAATTACTAGGTGTAAAGGGTAAGTAGGCGGAAATTTAAGTCTCCGGTTAAATCTTCGGGCTCAACCCGAAATCTGCCTGAGATACTGGATTTCTAGAGTAAAGCAGATGAAGGCGGAATTCCTGGAGTAGCGGTGGAATGCGTAGATATCAGGAAGAACACCCATAGCGAAGGCAGCTTTCAATGCTATTACTGACGCTCAATTACGAAGGTGCGGGTATCGAACAGGATTAGATACCCTGGTAGTCCGCACAGTAAACGATATGTACTTGATATTGGATGTTGAAAATTCAGTGTCGTAGCTAACGCGTTAAGTACATCACCTGGGGACTAACGGCCGCAAGGTTAAAACTCAAAGGAATTGACGGGGGCCCACACAAGCGGTGGAGCATGTGGTTTAATTCGATGCAACGCGAAGAACCTTACCTGAACTTGACATGCCGAGAATCCTGTAGAAATATGGGAGTGCCTTTTTTGGAGCTCGGACACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTTGTGAAATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCCTATCTTTAGTTGCTACCATTAAGTTGAGGACTCTAAAGAGACTGCCAGAGTACAAATCTGGAGGAAGGTGGGGACGACGTCAAGTCATCATGGTCCTTATGTTCAGGGCTACACACGTGCTACAATGGTTGGAACAAAAGGCAGCGAAGGGGCGACCCGGAGCTAATCTCCAAACCCAATCTTAGTCCGGATTGCAGTCTGCAACTCGACTGCATGAAGTTGGAATCGCTAGTAATCGTGAGTCAGCATATCACGGTGAACATGTTCCTGGGCCTTGTACACACCGCCCGTCAAGTCAGCCGAATCGAGTGCACCCGAAGAAGGTGAGTTAATTAGACAGCTTTCGAAGGTGTGCTTGTAAGGGGGACTAAGTC
>2784824
AGTGGCGCACGGGTGAGTAACGCGTGGGTAACTTGCCTTTAAGTGAGGGATAACCCACTGAAAGGTGGACTAATACCTCATAAGACCACAGTGCTACGGCAGCGTGGTCAAAGGTGGCTTTATTAAAAGCTGCCGCTTGGAGAGAGACCCGCGTCCCATCAGCTTGTTGGTAAGGTAATGGCTTACCAAGGCCGAGACGGGTAGCTGGTCTGAGAGGATGGCCAGCCACACTGGAACTGAAACACGGTCCAGACTCCTACGGGAGGCAGCAGTGAGGAATCTTGCGCAATGGGGGGAACCCTGACGCAGCAACGCCGCGTGAGTGAAGAAGGTCTTCGGGTCGTAAAGCCCTGTCGGGAGGGAAGAAACAGTTATGCATGAATAATGCATAACCTTGACGGTACCTCCNGAGGAAGCACCGGCCAACTCCGTGCCAGCAGCCGCGGTAAAACGGAGGGTGCTAGCGTTGTTCGGAATTACTGGGCGTAAAGAGCGTGTAGGCGGATAGATAAGTCGAGTGTGAAAGCCCTCAGCTTAACTGAGGAAGTGCATTCGAAACTATCTTTCTTGGGTACGGAAGAGGGAAGTGGAATTCCCGGTGTAGGGGTGAAATCCGTAGATATCGGGAGGAATACCAGTGGCGAAGGCGACTTCCTGGACCGTCACTGACGCTGAGACGCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGATGGGCACTAGGTGTATCTCGCTTAGCGGGATGTGCCGTAGCTAACGCATTAAGTGCCCCGCCTGGGGAGTACGGTCGCAAGACTAAAACTCAAAGGAATTGACGGGGGCCCGCACAAGTGGTGGAGCATGTGGTTTAATTCGACGCAACGCGAAGAACCTTACCTGGGTTTGACATGCCGAGAATCTGCCAGAAATGGTGGAGTGCCCCGTTAGGGGAACTCGGACACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTCACCTTTAGTTGCCAGCATTAAGTTGGGCACTCTAAAGGGACTGCCGGTGTTAAACCGGAGGAAGGTGGGGACGACGTCNAGTCCTCATGGCCTTTATACCCAGGGCTACACACGTGCTACAATGGCCAGTACAAAGGGCTGCAATCCCGCGAGGGGGAGCCAACCCCAAAAATCTGGTCTTAGTTCGGATTGGAGTCTGCAACTCGACTCCATAAAGGTGGAATCGCTAGTAATCGTGAATCAGCACGTCACGGTGAATACGTTCCCGGGCCTTGTACACACCGCCCGTCACACCACGAAAGTTGGCTGTACCAGAAGTTGCTGAGCTAACTCGCCTCGGCGGGAGGCAGGCACCTAAGGTGTGGTTGATGATTGGGGTGAAGT
>2941516
TTAGAGTTTGATCCTGGCTCAGGATGAACGCTAGCGATAGGCCTAACACATGCAAGTCGAGGGGTAACAGGGTAGCAATACCGCTGACGACCGGCAAATGGGTGAGTAACGCGTATGCAACCTACCGATAACAGTTGGATAGCTCCCTGAAAGGGGAATTAAACCGGCATGACACTATGAGATCGCCTGTTTTCATAGTTAAATATTTATAGGTTATTGATGGGCATGCGTGACATTAGCAAGTTGGTGAGGTAACGGCTCACCAATGCTACGATGTCTAGGGGTTCTGAGAGGAAGGTCCCCCACACTGGTACTGAGACACGGACCAGACTCCTACGGGAGGCAGCAGTGAGGAATATTGGTCAATGGACGGAAGTCTGAACCACCCACTTCGCGTGCAGGATGACTGCCCTATGGGTTGTAAACTGCTTTTATATAAGAGGAACAGTATTTATGTATAGATATTTGCCAGTATTATATGAATAAGGATCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGATCCGAGCGTTATCCGGATTTATTGAGTTTAAAGGGTGAGTACGCGGTAGTATAAGTCAGCGGTGATAACTCGCAGCTCATCTGTAAGCTTGCCGTTGACACTGTATTACTTGACTTAACGTTGAGGTATGCTGAATGGGGGGGGGTTACCCGTTGAAATGCATTAATCAAAACAACAGACCACCCGATTTGCGGACGGCAGCAAAACTACACTGTCCACTGACGCTGATGCACAAAAGGCGTGGGTATCAAACAGGATTAGATACCCTGGTAGTCCACGCTGTAAACGATGATTACTGGTTGTTTGTGATACACTGCAAGTGACTGAGCGAAAGCACTAAGTAATCCACTTGGCGAGTACGTCGGCAACGATGAAACTCAAAGGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTCTAATTCGAGGCAACGCGAAGAACCTTACCCAGACTTGACATCTAGGAAAGGTCCTTGAAAGAGGATCGTGCCCGCAAGGGAATCCTAAGACAGGTGTTGCATGGCTGTCGTCAGCTCCTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCCTGCTTACAGTTACCATCGGTTCGGCCGGGGACTCTGTAAGGACTGCCGCTGATAAAGCGAAGGAAGGCGGGGACGACGTCAAGCAATCACGGCCCTTACGTCTGGGGCTACACACGTGCTACAATGGCCGGTACAATGAGTCGCAAAACCGCGAGGTCAAGCTAATCTCAAAAAACCGGTCTCAGTTCGGATTGGAGTCTGCAACCCGACTTCGTGAAGCTGGATTCGCTAGTAATCGCGCATCAGCCATGGCGCGGCGAATACGTTCCCGGGCCTTGTACACACCGCCCGTCA
>998428
GACGAACGCTGGCGGCGTGCCTAATACATGCAAGTCGAGCGGAGTTGTTCCTTCGGGGACAGCTTAGCGGCGGACGGGTGAGTAACACGTAGGCAACCTGCCTGCAGGACCGGGATAACCCACGGAAACGTGAGCTAATACCGGATAGATGGTTCCCTCGCATGAGGGGATCAGGAAAGACGGGGCAACCTGTCACTTGTAGATGGGCCTGCGGCGCATTAGCTAGTTGGCGAGGTAACGGCTCACCAAGGCGACGATGCGTAGCCGACCTGAGAGGGTGAACGGCCACACTGGGACTGAGACACGGCCCAGACTCCTACGGGAGGCAGCAGTAGGGAATCTTCCGCAATGGACGAAAGTCTGACGGAGCAACGCCGCGTGAGTGAGGAAGGTCTTCGGATCGTAAAGCTCTGTTGCCAAGGAAGAACGCTTGGTGGAGTAACTGCCATCAAGGTGACGGTACTTGAGAAGAAAGCCCCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGGGCAAGCGTTGTCCGGAATTATGGGCGTAAGCGCGCGCAGCGGTTCTTTAAGTCTGAGGTTAAATGCAGGGCTCAACCTTGTAACGCCTTGGAAACTGGGGGACTGGAGTGTAGGAGAGGAAAGTGGAATTCCACGTGTAGCGGTGAAATGCGTAGAGATGTGGAGGAACACCAGTGGCGAAGGCGACTTTCTGGCCTATAACTGACGCTGAGGCGCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGATGAGTGCTAGGTGTTAGGGGTTTCGATACCCTTGGTGCCGAAGTTAACACAGTAAGCACTCCGCCTGGGGAGTACGCTCGCAAGAGTGAAACTCAAAGGAATTGACGGGGACCCGCACAGGCAGTGGAGTATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAGGTCTTGACATCCCGATGAAACGTCTAGAGATAGGCGCCCTCTTCGGAGCATTGGAGACAGGTGGTGCATGGTTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTGAATTCAGTTGCCAGCACTTCGGGTGGGCACTCTGAATTGACTGCCGGTGACAAACCGGAGGAAGGTGGGGATGACGTCAAATCATCGTGCCCCTTATGACCTGGGCTACACACGTACTACAATGGTCGGTACAACGGGCAGCGAAGCCGCGAGGCGGAGCCAATCCTAGAAAAGCCGATCTCAGTTCGGATTGCAGGCTGCAACTCGCCTGCATGAAGTCGGAATTGCTAGTAATCGCGGATCAGCATGCCGCGGTGAATACGTTCCCGGGTCT
>4343117
AACGAACGCTGGCGGCGTGCTTAACACATGCAAGTCGCGTGCGCGGTTCACGAACTTGTACGTGGATGGGCGCACGGCGCAGGGGGGCGTAACACGTGGGCACTCTGCCCTCCGATGGGGAATACTCCCGCGAACCGGGGGCTAATACCGCATAACATTCCGAGGACTTGGGTTCTTGGATTCAAAGCAGTGATGCCTGTGAGGAGGAGCCCGCGCCCGATTAGCTAGTTGGTAGGGTAACGGCCTACCTCGGCAATGATCGGTAGCTGGTCTGAGAGGATAATCAGACACACTGCAACTGAAACGAGGCCCAGACTCCTACCGTAGGGAACGCTGGGGAATCTTGCCTTCTGGGCGAAAGCATGACCCAACGACGCCGCGTGGGGGATGAAGCTTTTGCTAGTGTAAACCCCTTTTCACTGGTAAGAATGCACGCAAGGGAGCGACAGTACCCTGGCAAGAAGCCCCGGCTAACTACGTGCCACCCGCCTCGGTAAGACCTAGGGGGCCAGCGTTGTTCGGAATTACTGGGTGTATAGGGTACTTATGCGGTGCGACAAGTTGGGAGTGAAATCTCTGGGCTTAACCCAGAGGCTGCTTCTCAAACTGCTATGCTTGATTGTGACAGAGGCTCTTGAAATTGCAGGAGTAGCGTTGAAATGCATGTATATCTGCAAGATCACCCGAGATATGGACGAACAGCTGGATCACAAGTGACGCTGAGGAACGAAAGCTACGCTGAGCGAACAGGATTATATACACTGGTAGTCCTAGCACTAAACGATCATGACTTGCGGTGACGACCGTTCGGACGTCTCCCGGAGCTAACGCGTTAAGTCCTGCACCTGGGGAGTACGGTCGCAGACTGGAAGTCAAAGGAATTGACGGGGGCCCGCACAAGCGGTGGAACATGTGGTTCAATTCGACGCTACGCGAGGAACCTTACCTGGTTCGAAATTCTTATGACCAGCTGTAGAATTACGGCTTTCCTTCAAGAGACATGAGTCTAGGCGCTCCATGGCTGTCGTCAGTTCGTTCCGTGAGGTGTTGGGTTAAGTCCCGCAACGAGCGCAACCCCTGCACGTAGTTACTACTCGCAAGAGAGGACTCTACGTGGACTGCTCCGGATAACGGAGAGGAAGGTGGGAATGACGTCAAGTCCGCATGGCCTTTATGTCCAGGGCTACACACGTGTTACAATGCAGGGTACAAACCGTTGCCAACCCGCGAGGGGGAGCTAATCGGATAAAACTGTGCTCAGTTCGGATTGCAGTCTGCAACTCGACTGCATGAAGCTGGAATCGCTAGTAATGGGGATCAGCTTGACGCCGTGAATACGTTCCCGGGCCTTGTACACACCGCCCGTCACATCACGAAAGTGAGCTCACCTAGAAGTCGCCACGCTAACCGCAAGGGGGCAGGCGCCCAAGGTATGACTCATGATTGGGGTG
>4353661
GGATGAACGCTAGCGGGAGGCTTAATACATGCAAGTCGAGGGTGAAGCTTTCTTCGGAAAGTGGAAACCGGCGAACGGGTGCGTAACGCGTACGCAACTTACCCCTTGCTGGAGAATAGCCCCGGGAAACTGGGATTAATGCTCCATGGTATGGTGAAATCGCATGATTTTATCATTAAAGGTTACGGCAAGGGATAGGCGTGCGTCCCATTAGCTTGTTGGTGAGGTAACGGCTCACCAATGCAAACGATGGGTAGCTGGTCTGAGAGGATGATCAGCCACACGGGCACTGAGACACGGGCCCGACTCCTACGGGAGGCAGCAGTAGGGAATATTGGACAATGGACGAAAGTCTGATCCAGCCATCCCGCGTGCAGGACGAATGCCCTATGGGTTGTAAACTGCTTTTCTAAGGAAAGAAATATCTCATTCATGAGGTGCTGACGGTACCTTAGGAATAAGCACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTGCAAGCGTTATCCGGATTCACTGGGTTTAAAGGGTGCGTAGGCGGTATGATAAGTCAGTGGTGAAAGCCCGGGGCTCAACTCCGGAACTGCCGTTGATACTGTCATACTTGAGTCCAGTTGAGGTGGGCGGAATGATACATGTAGCGGTGAAATGCTTAGATATGTATCAGAACACCGACTGCGAAGGCAGCTCACTAAACTGGTACTGACGCTGAGGCACGAAAGCGTGGGTAGCGAACAGGATTAGATACCCTGGTAGTCCACGCCCTAAACGATGCTAACTCGGTATGTGCGATATACTGTACGTGCCTGAGGGAAACCGTTAAGTTAGCCACCTGGGGAGTACGTTCGCAAGAATGAAACTCAAAGGAATTGACGGGGGTCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGATACGCGAGGAACCTTACCTGGGCTCTAATGTACCACGCCCGACCCTGAAAGGGGTCTTCTTCTTCGGAAGCGGGGTACAAGGTGCTGCATGGTTGTCGTCAGCTCGTGCCGTGAGGTGTTGGGTTAAGTCCCGCAACGAGCGCAACCCCTGTCCTTAGTTGCCAGCGGTCCGGCCGGGGACTCTAAGGAGACTGCCTTCGCAAGGAGTGAGGAAGGAGGGGACGACGTCAAATCATCATGGCCTTTATGCCCAGGGCTACACACGTGCTACAATGGTGAGGACAAAGGGCAGCCACTTAGCGATAAGGAGCAAATCCCAAAAACCTCACCTCAGTTCGGATTGGAGTCTGCAACTCGACTCCATGAAGTCGGAACCGCTAGTAATCGCAGATCAGACATGCTGCGGTGAATACGTTCCCGGACCTTGTACACACCGCCCGTCAAGCCATGGAGCCGGGTGTACCTTAAGGCGATAACCGAAAGGAGTTGCCCAAGGTA"""
seqs_16s = StringIO(seqs_16s)

seqs_16s = [s[:50] for s in skbio.io.read(seqs_16s, constructor=DNA, format='fasta')]
seq_lookup = {seq.metadata['id'] : seq for seq in seqs_16s}


tax = """669210	k__Bacteria; p__; c__; o__; f__; g__; s__
881726	k__Bacteria; p__Firmicutes; c__Bacilli; o__Bacillales; f__Paenibacillaceae; g__Paenibacillus; s__
296752	k__Bacteria; p__Spirochaetes; c__Spirochaetes; o__Spirochaetales; f__Spirochaetaceae; g__Treponema; s__
1794723	k__Bacteria; p__Planctomycetes; c__Planctomycetia; o__Gemmatales; f__Gemmataceae; g__Gemmata; s__
2941516	k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Marinilabiaceae; g__; s__
793074	k__Bacteria; p__; c__; o__; f__; g__; s__
4353661	k__Bacteria; p__Bacteroidetes; c__[Saprospirae]; o__[Saprospirales]; f__; g__; s__
292553	k__Bacteria; p__Spirochaetes; c__Spirochaetes; o__Spirochaetales; f__Spirochaetaceae; g__Treponema; s__
2784824	k__Bacteria; p__Proteobacteria; c__Deltaproteobacteria; o__Syntrophobacterales; f__Syntrophaceae; g__; s__
1719550	k__Bacteria; p__Planctomycetes; c__Planctomycetia; o__Gemmatales; f__Gemmataceae; g__Gemmata; s__
182569	k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__
266495	k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__S24-7; g__; s__
524860	k__Bacteria; p__Planctomycetes; c__Planctomycetia; o__Gemmatales; f__Gemmataceae; g__Gemmata; s__
293514	k__Bacteria; p__Spirochaetes; c__Spirochaetes; o__Spirochaetales; f__Spirochaetaceae; g__Treponema; s__
2683209	k__Bacteria; p__WWE1; c__[Cloacamonae]; o__[Cloacamonales]; f__; g__; s__
501793	k__Bacteria; p__Firmicutes; c__Bacilli; o__Bacillales; f__Paenibacillaceae; g__Paenibacillus; s__
229854	k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Legionellales; f__Legionellaceae; g__Legionella; s__
583705	k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__; g__; s__
1142181	k__Bacteria; p__Spirochaetes; c__GN05; o__SBYZ_6080; f__; g__; s__
998428	k__Bacteria; p__Firmicutes; c__Bacilli; o__Bacillales; f__Paenibacillaceae; g__Paenibacillus; s__
4343117	k__Bacteria; p__Acidobacteria; c__DA052; o__Ellin6513; f__; g__; s__"""

tax_lookup = dict([e.strip().split('\t') for e in tax.split('\n')])

4.2.5 Question 1 [edit]

What are the fraction of 3 base k-words unique to the following two sequences?

In [3]:
seq1 = DNA('AGCTAGCATCGATCGATCGATGCATGCAT')
seq2 = DNA('AGCTCGGCATCGAGGGCAGTCAATCGATCT')
In [4]:
help(kmer_distance)
Help on function kmer_distance in module iab.algorithms:

kmer_distance(sequence1, sequence2, k=3, overlap=True)
    Compute the kmer distance between a pair of sequences
    
    Parameters
    ----------
    sequence1 : skbio.Sequence
    sequence2 : skbio.Sequence
    k : int, optional
        The word length.
    overlapping : bool, optional
        Defines whether the k-words should be overlapping or not
        overlapping.
    
    Returns
    -------
    float
        Fraction of the set of k-mers from both sequence1 and
        sequence2 that are unique to either sequence1 or
        sequence2.
    
    Raises
    ------
    ValueError
        If k < 1.
    
    Notes
    -----
    k-mer counts are not incorporated in this distance metric.

In [5]:
# Compute the kmer distance in this cell

4.2.6 Question 2 [edit]

Display the guide tree for the sequences in the cell below.

In [6]:
query_seqs = [DNA("ACGATGACCAGTGCTACCAGT", metadata={'id': 's1'}),
              DNA("AACGATCGATCGATCGTGCTA", metadata={'id': 's2'}),
              DNA("AACGATCTGCTA", metadata={'id': 's3'}),
              DNA("CGATCGATGACATGCATG", metadata={'id': 's4'}),
              DNA("CGATCTGCAT", metadata={'id': 's5'})]
In [7]:
help(guide_tree_from_sequences)
Help on function guide_tree_from_sequences in module iab.algorithms:

guide_tree_from_sequences(sequences, metric=<function kmer_distance at 0x7f47f89b9598>, display_tree=False)
    Build a UPGMA tree by applying metric to sequences
    
    Parameters
    ----------
    sequences : list of skbio.Sequence objects (or subclasses)
      The sequences to be represented in the resulting guide tree.
    metric : function
      Function that returns a single distance value when given a pair of
      skbio.Sequence objects.
    display_tree : bool, optional
      Print the tree before returning.
    
    Returns
    -------
    skbio.TreeNode

In [8]:
# Display the guide tree in this cell.

4.2.7 Question 3 [edit]

What are the differences in the guide tree from Question 2, the tree that is generated after 1 iterations of iterative multiple sequence alignment, and the tree that is generated after 5 iterations of iterative multiple sequence alignment? Display the trees for both 1 and 5 iterations of iterative multiple sequence alignment.

In [9]:
help(iterative_msa_and_tree)
Help on function iterative_msa_and_tree in module iab.algorithms:

iterative_msa_and_tree(sequences, num_iterations, pairwise_aligner, metric=<function kmer_distance at 0x7f47f89b9598>, display_aln=False, display_tree=False)
    Perform progressive msa of sequences and build a UPGMA tree
    Parameters
    ----------
    sequences : skbio.SequenceCollection
       The sequences to be aligned.
    num_iterations : int
       The number of iterations of progressive multiple sequence alignment to
       perform. Must be greater than zero and less than five.
    pairwise_aligner : function
       Function that should be used to perform the pairwise alignments,
       for example skbio.alignment.global_pairwise_align_nucleotide. Must
       support skbio.Sequence objects or skbio.TabularMSA objects
       as input.
    metric : function, optional
      Function that returns a single distance value when given a pair of
      skbio.Sequence objects. This will be used to build a guide tree if one
      is not provided.
    display_aln : bool, optional
       Print the alignment before returning.
    display_tree : bool, optional
       Print the tree before returning.
    
    Returns
    -------
    skbio.alignment
    skbio.TreeNode

In [10]:
from skbio.alignment import global_pairwise_align_nucleotide
# add your command for 1 iterations of iterative multiple sequence alignment here
# hint: pass pairwise_aligner=global_pairwise_align_nucleotide
In [11]:
# add your command for 5 iterations of iterative multiple sequence alignment here
# hint: pass pairwise_aligner=global_pairwise_align_nucleotide

4.2.8 Question 4 [edit]

Generate and display a tree based on progressive alignment of the sequences from the second cell (the ones in the seqs_16s variable). This step can take about 10 minutes to complete.

In [12]:
help(progressive_msa_and_tree)
Help on function progressive_msa_and_tree in module iab.algorithms:

progressive_msa_and_tree(sequences, pairwise_aligner, metric=<function kmer_distance at 0x7f47f89b9598>, guide_tree=None, display_aln=False, display_tree=False)
    Perform progressive msa of sequences and build a UPGMA tree
    Parameters
    ----------
    sequences : skbio.SequenceCollection
        The sequences to be aligned.
    pairwise_aligner : function
        Function that should be used to perform the pairwise alignments,
        for example skbio.alignment.global_pairwise_align_nucleotide. Must
        support skbio.Sequence objects or skbio.TabularMSA objects
        as input.
    metric : function, optional
      Function that returns a single distance value when given a pair of
      skbio.Sequence objects. This will be used to build a guide tree if one
      is not provided.
    guide_tree : skbio.TreeNode, optional
        The tree that should be used to guide the alignment process.
    display_aln : bool, optional
        Print the alignment before returning.
    display_tree : bool, optional
        Print the tree before returning.
    
    Returns
    -------
    skbio.alignment
    skbio.TreeNode

In [13]:
# Add your command for progressive alignment and tree building here
# hint: pass pairwise_aligner=global_pairwise_align_nucleotide

4.2.9 Question 5 [edit]

Using the tree representing the sequences from question four as a guide, define clusters (i.e., groups) of sequences at 90% and 70% identity. There is not a single right answer for this, or a single method for grouping sequences. Go about this systematically, and describe the process that you're going through in a couple of paragraphs. These groups are usually referred to as operational taxonomic units, or OTUs, because they represent a hypothesis about the taxonomic relatedness of a group of sequences (which is a proxy for a hypothesis about the relatedness of the group of organisms containing those sequences in their genomes).

If you want to obtain a given sequence, you can now do so by looking up its identifier with seqs_16s.get_seq:

In [14]:
seq_lookup['4343117']
Out[14]:
DNA
--------------------------------------------------------
Metadata:
    'description': ''
    'id': '4343117'
Stats:
    length: 50
    has gaps: False
    has degenerates: False
    has definites: True
    GC-content: 60.00%
--------------------------------------------------------
0 AACGAACGCT GGCGGCGTGC TTAACACATG CAAGTCGCGT GCGCGGTTCA
In [15]:
seq_lookup['4353661']
Out[15]:
DNA
--------------------------------------------------------
Metadata:
    'description': ''
    'id': '4353661'
Stats:
    length: 50
    has gaps: False
    has degenerates: False
    has definites: True
    GC-content: 52.00%
--------------------------------------------------------
0 GGATGAACGC TAGCGGGAGG CTTAATACAT GCAAGTCGAG GGTGAAGCTT

To compute the pairwise identity for two sequences, use pairwise_percent_id as follows. IMPORTANT: While this calculation is fast for a pair of sequences (i.e., a pairwise calculation), running it for all pairs of sequences (e.g., in a nested for-loop) could take a couple of hours, depending on the hardware where it's being run. You only need to do pairwise comparisons to answer this question.

In [16]:
from skbio.alignment import global_pairwise_align_nucleotide

def pairwise_percent_id(seq1_id, seq2_id, seq_lookup):
    seq1 = seq_lookup[seq1_id]
    seq2 = seq_lookup[seq2_id]
    aln, _, _ = global_pairwise_align_nucleotide(seq1, seq2)
    return 1. - aln[0].distance(aln[1])
In [17]:
print(pairwise_percent_id('793074', '4353661', seq_lookup))
/opt/conda/envs/iab/lib/python3.5/site-packages/skbio/alignment/_pairwise.py:599: EfficiencyWarning: You're using skbio's python implementation of Needleman-Wunsch alignment. This is known to be very slow (e.g., thousands of times slower than a native C implementation). We'll be adding a faster version soon (see https://github.com/biocore/scikit-bio/issues/254 to track progress on this).
  "to track progress on this).", EfficiencyWarning)
0.7647058823529411
In [18]:
# Compute additional pairwise identities, as necessary, to answer this question here. Show all of your commands!

Discuss your results here.

4.2.10 Question 6 [edit]

Choose one representative sequence from each of the clusters you defined in question 5. Look these up in tax_lookup by their ids to get the taxonomy of each sequence, and include those in the results below. When you see a key that ends with __, that means that there is no known taxonomic assignment for that sequence at that level.

In [19]:
print(tax_lookup['4343117'])
k__Bacteria; p__Acidobacteria; c__DA052; o__Ellin6513; f__; g__; s__
In [20]:
print(tax_lookup['4353661'])
k__Bacteria; p__Bacteroidetes; c__[Saprospirae]; o__[Saprospirales]; f__; g__; s__
In [21]:
# Perform addition taxonomy look-ups here

Discuss your results here.

4.2.11 Question 7 [edit]

Is the taxonomy of the representative sequences consistent with phylogenetic tree you generated in question 4? For your 90% and 70% OTUs, list three taxa (e.g., at the phylum, class, or species level) that are monophyletic, if any, and three taxa that are not monophyletic, if any. Discuss two specific reasons why some taxa might appear to not be monophyletic based on your tree.

Discuss your results here.