{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# [4.2]() Multiple sequence alignment exercises [edit]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Table of Contents**\n", "0. [Purpose](#1)\n", "0. [Goals](#2)\n", "0. [Hints](#3)\n", "0. [Functions that you will need to complete the exercise.](#4)\n", "0. [Question 1](#5)\n", "0. [Question 2](#6)\n", "0. [Question 3](#7)\n", "0. [Question 4](#8)\n", "0. [Question 5](#9)\n", "0. [Question 6](#10)\n", "0. [Question 7](#11)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## [4.2.1](#1) Purpose [edit]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "In this assignment you'll use multiple sequence alignment to reconstruct the phylogeny of a group of organisms based on their 16S rRNA sequences. This assignment builds on ideas from the previous assignment, in that in the last assignment you were identifying good primers to use for amplifying 16S from diverse organisms, and in this assignment we're using those sequences to group organisms by their relatedness. Because of the very large numbers of sequences that are commonly obtained in a modern DNA-sequencing-based experiment, grouping similar sequences and then working with representative sequences for each of those groups is common for computational efficiency. We'll be exploring these ideas in more detail through-out the next segments of the class." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From a bioinformatics standpoint, we usually start working with sequence in fasta format, very similar to the sequences in the cell below. See [here](http://www.bioinformatics.nl/tools/crab_fasta.html) for an explanation of the fasta format." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "At this point, you should be feeling fairly comfortable interacting with the IPython Notebook. This assignment will give you additional practice while you explore the ideas mentioned above." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## [4.2.2](#2) Goals [edit]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Continue to work with IPython Notebooks and interact with python code. Understand what multiple sequence alignment is used for, and the concept of grouping sequences into clusters of OTUs. Consider the possible drawbacks to these methods." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## [4.2.3](#3) Hints [edit]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "* Read all of the cells containing text very carefully!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* You may write code or use a text editor if you wish, however all of the tools necessary to answer the questions are present in this notebook." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Get help, that's what office hours are for!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* You are allowed to discuss the assignment with other students, however your work needs to be your own. Using or looking at code or commands generated by another student is strictly prohibited. If you're in doubt over whether some type of interaction is acceptable for this assignment, ask." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## [4.2.4](#4) Functions that you will need to complete the exercise. [edit]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Remember to learn about what a function does you can run:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`help(name_of_function)`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Try this with the functions below to see what they do." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%pylab inline\n", "from __future__ import division\n", "\n", "import skbio.io\n", "from skbio import DNA\n", "\n", "from iab.algorithms import progressive_msa_and_tree, iterative_msa_and_tree, kmer_distance, guide_tree_from_sequences" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The cell below contains the sequences that you will be working with throughout the assignment" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "from io import StringIO\n", "seqs_16s = \"\"\">881726\n", "GACGAACGCTGGCGGCGTGCCTAATACATGCAAGTCGAGCGGATTCATCCTTCGGGATGGGTTAGCGGCGGACGGGTGAGTAACACGTAGGCAACCTGCCTGCAAGTCCGGGATAACTAACGGAAACGTTAGCTAATACCGGATACGCGGTTGGATCGCATGATCCGATCGGGAAAGACGGCGCAAGCTGCCACTTGTAGATGGGCCTGCGGCGCATTAGCTAGTTGGTGGGGTAACGGCTCACCAAGGCGACGATGCGTAGCCGACCTGAGAGAGTGATCGGCCACACTGGGACTGAGACACGGCCCAGACTCCTACGGGAGGCAGCAGTAGGGAATCTTCCGCAATGGACGCAAGTCTGACGGAGCAACGCCGCGTGAGTGATGAAGGTTCTCGGATCGTAAAGCTCTGTTGCCAGGGAAGAACGCTCGGGAGAGTAACTGCTCTCGAGGTGACGGTACCTGAGAAGAAAGCCCCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGGGCAAGCGTTGTCCGGAATTATTGGGCGTAAAGCGCGCGCAGGCGGTCGATTAAGTTTGGTGTTTAAGCCCGGGGCTCAACCCCGGTTCGCACTGAAAACTGATCGACTTGAGTGTAGGAGAGGAAAGTGGAATTCCACGTGTAGCGGTGAAATGCGTAGAGATGTGGAGGAACACCAGTGGCGAAGGCGACTTTCTGGCCTATAACTGACGCTGAGGCGCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGATGCATGCTAGGTGTTAGGGGTTTCGATACCCTTGGTGCCGAAGTCAACACAGTAAGCATGCCGCCTGGGGAGTACGGTCGCAAGACTGAAACTCAAAGGAATTGACGGGGACCCGCACAAGCAGTGGAGTATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAGGTCTTGACATCCCCCTGAATCCTCTAGAGATAGAGGCGGCCCTTCGGGGACAGGGGAGACAGGTGGTGCATGGTTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTGATCGTAGTTGCCAGCACTTCGGGTGGGCACTCTAGGATGACTGCCGGTGACAAACCGGAGGAAGGCGGGGATGACGTCAAATCATCATGCCCCTTATGACCTGGGCTACACACGTACTACAATGGCCGGTACAACGGGCTGCGAAGCCGCGAGGTGGAGCCAATCCCAGAAAGCCGGTCTCAGTTCAGATTGCAGGCTGCAACTCGCCTGCATGAAGTCGGAATTGCTAGTAATCGCGGATCAGCATGCCGCGGTGAATACGTTCCCGGGTCT\n", ">793074\n", "GAATGAACGCTGGCGGCGTGCTTAATAATGCAAGTCGAGCGCGTAGCAATACGAGCGGCGCACGGGTGCGTAACACGTAGGTCATCTGCCTCTAGGTCGGGGATAACTGCGGGAAACTGCAGCTAATACCCGATGATATCGAGAGATCAAAGCTTCGGTGCCTAGAGAGGAGCCTGCGGCTCATTAGCTAGTTGGTGGGGTAACGGCCTACCAAGGCCACGATGAGTAGCCGGCCTGAGAGGGCGATCGGCCACACTGGAACTGAGACACGGTCCAGACTCCTACGGGAGGCAGCAGTAGGGAATATTGGGCAATGGGCGAAAGCCTGACCCAGCAACGCCGCGTGAGTGATGAAGCCTTTCGGGGTGTAAAGCTCTTTTGGCAGGGACGAATCAATGACGGTACCTGCGTAATAAGCCCCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGGGGGGGCAAGCGTTATTCGGAATTACTGGGCGTAAAGCGCGCGTAGGCGGCTTCTTAAGTCGGGTGTTTAATGTCGGGGCTCAACTCCGGCGCTGCACTCGATACTGGGAGGCTAGAGTACTCGAGAGGAAAGCGGAATTCCTAGTGTAGCGGTGAAATGCGTAGATATTTAGGAGGAACACCAGTGGCGAAGGCGGCTTTCTGGAGAGTAACTGACGCTCAGAGCGCGAAAGCCAGGGGATCGAACGGGATTAGATACCCCGGTAGTCCTGGCTGTAAACGATGGGTACTAGATGTCGCCGGTATCAATCCCGGCGGTATCGTCGCTAACGCATTAAGTACCCCGCCTGGGGAGTACGCTCGCAAGAGTGAAACTCAAAGGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGCAACGCGAAGAACCTTACCTGGACTTGACATACCTCGGACCGGACCTAGAGATAGGACCTTCTCCCGTAAGGGAGCCGGGGATACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCCCATCCCTAGTTGCCAGCGAGTCATGTCGGGAACTCTAGGGAGACTGCCGTTGATAAAACGAGAGGAAGGTGGGGATGACGTCAAGTCATCATGGCCCTTACGTCCAGGGCTACACACGTGCTACAATGGCCACCACAAAGGGTCGCAATACCGTGAGGTGGAGCTAATCCCAAAAAGGTGGCCTCAGTTCGGATTGTAGTCTGCAACTCGACTACATGAAGTCGGAATCGCTAGTAATCGCGGATCAGAACGCCGCGGTGAATACAGTTCCCGGGCCTTGTACACACCGCCCGTCACACCACGAGAGCTGGTTGCGTTAGAAGTCGCCAGGCCAACCGCAAGGGGGCAGGCGCCGAATGCGTGATGAGTGATTGGGGT\n", ">669210\n", "AGAGTTTGATCCTGGCTCAGAACGAACGCTGGCGGCGCGCCTAACACATGCAAGTCGAACGGACTAGCCCCTTCGGGGGCGAAGTTAGTGGCGAACGGGTGAGTAACGCGTAAGTAACCTGCCCCCGGGACTGGGATAACAGCTCGAAAGAGCCGCTAATACCGGATAATTGTTGCAACACTTAGGAGTTGTAACTAAAGAAGGCCTCTGTTTCAAGCTTTCACCTGGGGATGGGCTTGCGTCCCATTAGCTTGTTGGTGAGGTAACGGCTCACCAAGGAAACGATGGGTAGCCGGCCTGAGAGGGTGGTCGGTCACACTGGAACTGAGACACGGTCCAGACTCCTACGGGAGGCAGCAGTGAGGAATCTTGCGCAATGGGCTAACGCCTGACGCAGCGACGCCGCGTGGACGATGAAGCTTTTCGGAGTGTAAAGTCCTTTCAGGAGGGAAGAAATGCCGGTAGTGTGAATAACACACCGGTTTGACGGTACCTCAAGAAGAAGCCCCGGCTAACTCCGTGCCAGCAGCCGCGGTAACACGGAGGGGGCAAGCGTTGTTCGGAATCACTGGGCGTAAAGAGCGCGTAGGTGGTTGTGTAAGTCGGATGTGAAATCCCTCGGCTCAACCGAGGAACTGCGTTCGAAACTACATAGCTAGAGGGCAGGAGAGGAGAGCGGAATTCCCAGTGTAGCGGTGAAATGCGCAGATATTGGGAAGAACACCGGTGGCGAAGGCGGCTCTCTGGACTGTTCCTGACACTGAGGCGCGAAAGCCAGGGGAGCGAACGGGATTAGATACCCCGGTAGTCCTGGCCGTAAACGATGGGCACTAGGTGTGGGGGGTGTCGATCCCCCCCGTGCCGCAGCTAACGCATTAAGTGCCCCGCCTGGGAAGTACGATCGCAAGGTTGAAACTCAAAGGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGACGCAACGCGAAGAACCTTACCGGAATTTGACATGTTTCTGACGGCCTGCAGAAATGCAGGCTTCCCCTCGGGGCAGATACACAGGAGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTGCCCTTAGTTGCCATCGGTTCGGCCGGGAACTCTAAGGGGACTGCCGGTGACAAACCGGAGGAAGGTGGGGATGACGTCAAGTCAGTATGGCCTTTATGTTCCGGGCTACACACGTGCTACAATGGCTGGTACAAAGGGTCGCGATGCCGTGAGGTGGAGCCAATCCCAAAAAGCCAGTCTTAGTTCGGATTGGAGTCTGCAACTCGACTCTATGAAGCCGGAATCGCTAGTAATCGTGGATCAGCACGCCATGGTGAATACGTTCCCGGGCCTTGTACACACCGCCCGTCACACCACGAAAGTTGGTTGTACCAGAAGTCATTGGGCTAACCCTTTTGGGGGGCAGATGCCGAAGGTATGGTCAGCGATTGGGGTGAAGTCGTAACAAGGTAACC\n", ">583705\n", "ACGGGTGAGTAACGCGTATGCAACCTACCTCGGAAAAGGGGATGACTGGTGGAAACGGGGATTAATGCCCCCTAGGGTTGTTTCTCTGCCTGGGTGAGCCGTTACTATTGGAACCGATTGAGATGGCCATGTTGGTCATTTCCTGGTTGGTGAGGTTACCTCACACCAAGGCGACGATGACTACGGGGTCTAAAAGGATGGTCCCGCACACTGGTACTGAGACACGGACCAGACTCCTACGGGAGGCAGCAGTGAGGAATATTGGTCAATGGGGGCAACCCTGAACCAGCCATGCCGCGTGAAGGAAGACGGCCCTATGGGTTGTAAACTTCTTTTATATGGGAATAAAGAGAGGTACGTGTACCTCAGTGAATGTACCATATGAATAAGCATCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGATGCAGGCGTTATCCGGATTTATTAGGTTTAAAGGGTGCGTAGGCGGGATACTAAGTCAGTGGTGAAAGTTTGCGGCTCAACCGTAAAATCGCCATTGATACTGGTATTCTTGAGTATACAGGAAGTAGGCGGAATGTGTAGTGTAGCGGTGAAATGCATAGATATTACACAGAACACCGATTGCGAAGGCAGCTTACTATAGTATAACTGACGCTGATGCACGAAAGCGTGGGGAGCGAACAGGATTAGATACCCTGGTAGTCCACGCTGTAAACGATGATTACTGGTTGTGCGCGATACACAGTGCGCGACTGAGCGAAAGCATTAAGTAATCCACCTGGGGAGTACGGCGGCAACGCTGAAACTCAAAGGAATTGACGGGGGCCCGCACAAGCGGAGGAACATGTGGTTTAATTCGATGATACGCGAGGAACCTTACCTGGGCTTAAATGTAGAGTGCATGGAGTGGAAACATTCCTTTCCTTCGGGACTCTTTACAAGGTGCTGCATGGTTGTCGTCAGCTCGTGCCGTGAGGTGTCGGGTTAAGTCCCATAACGAGCGCAACCCCTATCATTAGTTGCTAACAGGTCAAGCTGAGGACTCTAGCGAAACTGCCGGTGTAAACCGTGAGGAAGGTGGGGATGACGTCAAATCAGCACGGCCCTTATGTCCAGGGCTACACACGTGTTACGATGGCCAGTACAAAGGGTAGCTACCTGGTGACAGGATGCTAATCTCAAAAGCTGGTCTCAGTTCGGATTGGAGTCTGCAACTCGACTCCATGAAGTTGGATTCGCTAGTAATCGTATATCAGCCATGATACGGTGAATACGTTCCCGGGCCTTGTACACACCGCCCGTCAAACCATGGAAGCTGGGGGTACCTGAAGTACGTCACCGCAAGGAGCGTCCTAGGGTAAATCTAGTGACTGGGGTTAAGTGGTAACAAGGTAACC\n", ">524860\n", "AGAGTTTGATCCTGGCTCAGAACGAACGTTGGCGGCATGGATGAGGCATGCAAGTCGCGGGAATCCCCAGCAATGGGGGGAACCGGCGTAAGGGGCAGTAAGGCGTAGGTACCTACCCCCAGGTCCGGGATAGCCCGCCGAGAGGCGGGGTAATACCGGATGACCTCGGGAGAGCAAAGCTCCGGCGCCTGAGGCGGGGCCTACGTGATATTACCTAGTTGGCGGGGTAACGGCCCACCAAGGGGGAGATGTCTAGCGGGTGTGAGAGCACGACCCGCGCCACTCGCACTGAGACACTGGCGAGACACCTACGGGTGGCTGCAGTCGAGGATCTTCGGCAATGGGGGCAACCCTGACCGAGCGATGCCGCGTGGGCGACGAAGGCCTTCGGGTTGTAAAGCCCTGTCGAGGGGGAGAAAGCCGCAAGGCGGATCCATCCCTGGAGGAAGCTCGGGCTAAGTTCGTGCCAGCAGCCGCGGTAAGACGAACCGAGCGAACGTTGTTCGGAATCACTGGGCTTAAAGGGCGCGTAGGCGGGCTGCCGCGTCCGGGGTGAAATCCCACGGCTCAACCGTGGAACGGCCCCGGGTACGGGCGGCCTCGAGGGGGATAGGGGCGTGCGGAACTGTGGGTGGAGCGGTGAAATGCGTTGATATCCACAGGAACTCCGGTGGCGAAGGCGGCACGCTGGATCCTCTCTGACGCTGAGGCGCGGAAGCCAGGGGAGCGAACGGGATTAGATACCCCGGTAGTCCTGGCCCTAAACGATGAGAACTGGGTAGTAGCCCTGGCATGGGGTTACTGCCGCAGCCAAAGTGCTAAGTTCTCCGCCTGGGGAGTATGGTCGCAAGGCTGAAACTCAAAGGAATTGACGGGGGCTCACACAAGCGGTGGAGCATGTGGCTTAATTCGAGGCTACGCGAAGAACCTTATCCTGGACTTGACATGTGCGAAAGCGCCAGCAGGTAGGACCCGGAAACGGGAACGAACGGTATCCAACCCGGAAGCTGGTACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTCGGGTTAAGTCCCATAACGAGCGAAACCCTTACCCCTTGTTGCAACCCGAAAGGGGCACTCGAGGGGGACTGCCGGTGTCAAACCGGAGGAAGGTGGGGATGACGTCAAGTCCTCATGGCCTTTATGTCCAGGGCTGCACACGTGCTACAATGGCGTGGACAGAGGGACGCGACTGCGCGAGCAGAAGCCGACCCCCGAAAGCACGCCCCAGTTCAGATCGCAGGCTGCAACCCGCCTGCGTGAAGCCGGAATCGCTAGTAATCGCGGGTCAGCAACACCGCGGTGAATGTGTTCCTGAGCCTTGTACACACCGCCCGTCAAGCCACGAAAGGGAGGGACGTCCGAAGTCGCCTCGCGGCGCCGAAGACGGACTTCCTGATTGGGACTAAGTCGTAACAAGGTAACC\n", ">501793\n", "GACGAACGCTGGCGGCGTGCCTAATACATGCAAGTCGAACGGAGTGGTTGAAGGAGCTTGCTCTTTTGATCGCTTAGTGGCAGACGGGTGAGTAACACGTAGGCAACCTGGCTGTAAGACGGGGATAACTGGCGGAAACGTGAGCTAAAACCGGATGGTCGGCTTGAGGGCATCCTCGAGTCGGGAAAGGACGGAGCAATCTGTCGCTTACAGATGGGCCTGCGGCGCATTAGCTAGTTGGTAGGGTAACGGCCTACCAAGGCGACGATGCGTAGCCGACCTGAGAGGGTGAACGGCCACACTGGGACTGAGACACGGCCCAGACTCCTACGGGAGGCAGCAGTAGGGAATCTTCGGCAATGGACGAAAGTCTGACCGAGCAACGCCGCGTGAGTGATGAAGGTTTTCGGATCGTAAAGCTCTGTTGCCAGGGAAGAACGCCAGGGAGAGTAACTGCTCTCTGGGTGACGGTACCTGAGAAGAAAGCCCCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGGGCAAGCGTTGTCCGGAATTATTGGGCGTAAAGCGCGCGCAGGCGGTCGTTTAAGTCTCATGTCTAAACCCCGGGGCTCAACCTCGGGGTGCATGGGAAACTGGGCGACTGGAGTGCATGTGAGGAAAGTGGAATTCCACGTGTAGCGGTGGAATGCGTAGAGATGTGGAGGAACACCAGTGGCGAAGGCGACTTTCTGGCCTGTAACTGACGCTGAGGCGCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGATGAATGCTAGGTGTTAGGGGTTTCGATACCCTTGGTGCCGAAGTTAACACATTAAGCATTCCGCCTGGGGAGTACGGTCGCAAGACTGAAACTCAAAGGAATTGACGGGGACCCGCACAAGCAGTGGGGTATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAGGTCTTGACATCCCGATGAAACATGCAGAGATGTGTGCCCTCTTCGGAGCATTGGAGACAGGTGGTGCATGGTTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTAAGGTTAGTTGCCAGCAGGTGAAGCTGGGCACTCTAACATGACTGCCGGTGACAAACCGGAGGAAGGCGGGGATGACGTCAAATCATCATGCCCCTTATGACCTGGGCTACACACGTACTACAATGGCCAGTACAACGGGAAGCGAAGTGGCGACACGGAGCCAATCTTAGAAAGCTGGTCTCAGTTCGGATTGCAGGCTGCAACTCGCCTGCATGAAGTCGGAATTGCTAGTAATCGCGGATCAGCATGCCGCGGTGAATACGTTCCCGGGTCT\n", ">296752\n", "AGAGTTTGATCTCTGGCTCAGAACAAACGCTGGCGGTGCGTCTTAAGCATGCAAGTCGAGCGATGGTAGGGGGCTTGCTCCCTATTCATAGCGGCGGACTGGTGAGTAACGCGTAGATGACATACCTTTTGCTGGGGGATAGCTTGTGGAAACACAGGGTAATACCGCATACGATTGAGGCGGTTAGAGCGCTTCAATCAAAGCCTTGTATGGGGCGGCAGTTGAGTGGTCTGCGTACTATTAGCTTGTTGGTGGGGTAACGGCTCACCAAGGCGATGATAGTTATCCGGCCTGAGAGGGTGAACGGACACATTGGGACTGAGATACGGCCCAGACTCCTACGGGAGGCAGCAGCTAAGAATATTCCGCAATGGGCGAAAGCCTGACGGAGCGACACCGCATGAGTGATGAAGGTCGAAAGATTGTAAAATTCTTTTTGAGAGTGATGAATAAAGTCGAGCAGTAATGCTCGGTGATGACGGTAACTTTTGAATAAGGGGTGGCTAATTACGTGCCAGCAGCCGCGGTAACACGTAAGCCCCAAGCGTTGTTCGGAATTATTGGGTGTAAAGGGCATGTAGGTGGTCTTGCAAGCTTGATGTGAAATCTTACAGCTTAACTGTAAAACTGCATTGAGAACTGCATAACTTAAGAATAACTGAGGCGCAACTGGAATTCCAGGTGTAGGGGTGAAATCTGTAGATATCTGGAAGAACACCAATGGCGAAGGCAAGTTGCGAGCAGATTATTGACACTGAGGTGCGAAGGTGCGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCGCACAGACAACGATGTACACTGGGCGTCTGGCTTTATGCTGGGTGCCGTAGTAAACGCGATAAGTGTACCGCCTGGGGAGTATGCTCGCAAGGGTGAAACTCAAAGGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGATACGCGAGAAATCTTACCTGGGTTTGACATTTAGTGGAATTGTATAGAGATATGCAAGGTACTTGTACCCGCTAAACAGGTGCTGCATGGCTGTCGTCAGCTCGTGCCGTGAGGTGTTGGGTTAAGTCCCGCAACGAGCGCAACCCCTACTGCTAGTTACTAACAGGTTATGCTGAGGACTCTAGCGGAACTGCCGGTGACAAACCGGAGGAAGATGGGGATGACGTCAAGTCATCATGGCCCTTATGTCCAGGGCAACACACGTGCTACAATGGTTGTAACAAAGTGATGCGAAATCGCAAGATGAAGCAAAACGCAGAAATGCAATCGTAGTTCGGATTGGAGTCTGAAACTCGACTCCATGAAGTTGGAATCGCTAGTAATCGCATATCAGCACGATGCGGTGAATACGTTCCCGGGCCTTGCACACACCGCCCGTCA\n", ">293514\n", "AGAGTTTGATCCTGGCTCAGAACGAACGCTGGCGGTGCGTCTTAAGCATGCAAGTCGAGCGATGGTAGGGGGCTTGCTCCCTATTCATAGCGGCGGACTGGTGAGTAACGCGTAGATGACATACCTTTTGCTGGGGGATAGCTTGTGGAAACACAGGGTAATACCGCATACGATTGAGGCGGTTAGAGCGCTTCAATCAAAGCCTTGTATGGGGCGGCAGTTGAGTGGTCTGCGTACTATTAGCTTGTTGGTGGGGTAACGGCTCACCAAGGCGATGATAGTTATCCGGCCTGAGAGGGTGAACGGACACATTGGGACTGAGATACGGCCCAGACTCCTACGGGAGGCAGCAGCTAAGAATATTCCGCAATGGGCGAAAGCCTGACGGAGCGACACCGCGTGAGTGATGAAGGTCGAAAGATTGTAAAACTCTTTTGAGAGTGATGAATAAGTCGAGCAGTAATGCTCGGTGATGACGGTAACTTTTGAATAAGGGGTGGCTAATTACGTGCCAGCAGCCGCGGTAACACGTAAGCCCCAAGCGTTGTTCGGAATTATTGGGCGTAAAGGGCATGTAGGTGGTCTTGCAAGCTTGATGTGAAATCTTACAGCTTAACTGTAAAACTGCATTGAGAACTGCATAACTTAAGAATAACTGAGGCGCAACTGGAATTCCAGGTGTAGGGGTGAAATCTGTAGATATCTGGAAGAACACCAATGGCGAAGGCAAGTTGCGAGCAGATTATTGACACTGAGGTGCGAAGGTGCGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCGCACAGTCAACGATGTACACTGGGCGTCTGGCTTTATGCTGGGTGCCGTAGTAAACGCGATAAGTGTACCGCCTGGGGAGTATGCTCGCAAGGGTGAAACTCAAAGGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGATACGCGAGGAATCTTACCTGGGTTTGACATACACATTATCTTTGCAGAGATGTAAAGCGGGGGTAACCCCAATGTGAACAGGTGCTGCATGGCTGTCGTCAGCTCGTGCCGTGAGGTGTTGGGTTAAGTCCCGCAACGAGCGCAACCCCTATTGCCAGTTACTAACAAGTTAAGTTGAGGACTCTGGCGAAACTGCCGGTGACAAATCGGAGGAAGATGGGGATGACGTCAAGTCATCATGGCCCTTATGTCCAGGGCAACACACGTGCTACAATGGTTGAGACAGAGTGATGCTAAGTCGCAAGATGGAGCAAAACGCAGAAATTCAATCGTAGTTCGGATTGGAGTCTGAAACTCGACTCCATGAAGTTGGAATCGCTAGTAATCGCATATCAGCACGATGCGGTGAATACGTTCCCGGGCCTTGCACACACCGCCCGTCA\n", ">292553\n", "AGAGTTTGATCCTGGCTCAGAACGAACGCTGGCGGTGCGTCTTAAGCATGCAAGTCGAGCGATGAATGAGGGGCTTGCTCCTTATTCATAGCGGCGGACTGGTGAGTAACGCGTAGATGACATGTCGATGGCAGGGGGATAGCCAGTAGAAATATTGGGTAATACCGCGTATCCTTCTTGTTGTTAGAGGACAAGAAGAAAAGCCTTGTATGGGGCGGCTATTGAGTGGTCTGCGTACTATTAGTTTGTTGGTGGGGTAACGGCCTACCAAGACTATGATAGTTATCCGGCCTGAGAGGGTGAACGGACACATTGGGACTGAGATACGGCCCAGACTCCTACGGGAGGCAGCAGCTAAGAATATTCCGCAATGGGCGAAAGCCTGACGGAGCGACACCGCGTGAGTGATGAAGGTCGAAAGATTGTAAAACTCTTTTGAATATGATGAATAAGTCAAGCAGTAATGCTTGGCGATGACGGTAGTGTTTGAATAAGGGGTGGCTAATTACGTGCCAGCAGCCGCGGTAACACGTAAGCCCCAAGCGTTGTTCGGAATTATTGGGCGTAAAGGGCATGTAGGTGGTTTTGTAAGCTTGATGTGAAATCTTACAGCTTAACTGTAAAACTGCATTGAGAACTGCAGAACTAGAGTAACTGAGGTGCAACTGGAATTCCAGGTGTAGGGGTGAAATCTGTAGATATCTGGAAGAACACCAATGGCGAAGGCAAGTTGCAAGCAGATTACTGACACTGAGGTGCGAAGGTGCGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCGCACAGTCAACGATGTACACTGGGCGTCTGGCTTTATGCTGGGTGCCGTAGTAAACGCGATAAGTGTACCGCCTGGGGAGTATGCTCGCAAGGGTGAAACTCAAAGGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGATACGCGAGAAATCTTACCTGGGTTTGACATTTAGTGGAATTGTATAGAGATATGCAAGGTACTTGTACCCGCTAAACAGGTGCTGCATGGCTGTCGTCAGCTCGTGCCGTGAGGTGTTGGGTTAAGTCCCGCAACGAGCGCAACCCCTACTGCTAGTTACTAACAGGTTATGCTGAGGACTCTAGCGGAACTGCCGGTGACAAACCGGAGGAAGATGGGGATGACGTCAAGTCATCATGGCCCTTATGTCCAGGGCAACACACGTGCTACAATGGTTGTAACAAAGTGATGCGAAATCGCAAGATGAAGCAAAACGCAGAAATGCAATCGTAGTTCGGATTGGAGTCTGAAACTCGACTCCATGAAGTTGGAATCGCTAGTAATCGCATATCAGCACGATGCGGTGAATACGTTCCCGGGCCTTGCACTAACGGCCCGTCA\n", ">266495\n", "AGTTTGATCCTGGCTCAAGATGAACGCTAGCGGCAGGCTTAACACATGCAAGTCAAAGGGCAACGGGGAGAGTGCTTGCACTCTCTGCCGGCGACTGGCGCACGGGTGAGTAACACTTATGCAGACACTGCCTTCCACAGGGCGGACAACCTCTCCCAAAGGGAGGCTAATCCCGCGTATATCCCTTGGGGGCATCCCCGGGGGAGGAAAGGATTACCGGTGTGCAGGATGGGCATGCGGCGCATTACGCAGTAGGCGGGGTAACGGCCCACCTAACCGACCATGCGTATGGGTTCTGAGAGGAAGGCCCCCCACACTGGTACTGAGACACTGACCAGACTCCTACTGGAGGCAGCAGTGAGGAACATTGGTCAATGGGCGGGAGCCTGAACCAGCAAACCCGCGTGAAGGAAGAAGGCGCCGAACGTCGTAAACTTCTTTTGTCCGGGATCAAAGGGCGCCACGTGTGGCGTTGTGAGTGTACCTGTAGAGAAAGCTTCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGTAGGGAGCGAGCGTTGTCCGGATTTATTGGGTGTAAAGGGCGCGTAGGTGGTCGGTTAAGTCAGGTGTGAAAGCTCGGGGCTCAACCCGGAGGATCCGCTGGAACTTTGGTGTCATGAGGCGCAGGAGAAGTAAAGTGGAATTCGTGGTGTAGCGGTGAAATGCATAGATATTGGGCGGAACTCCGGTGGCGAAGGCAGCGTTCTGGCGCGTGCCTGACGCTGAGGCGCGAAAGCGTGGGTATCGAACGGGATTAGATACCCCGGTAGTCCACGCAGTAAACGATGAATACTGGGTGTCGGACCCATAGAACGTTTGGGTGCGCGCAGCGAAAGCGATAAGCATTCCAAGTGGGGAGTACACCGGCAGTGATGAGACTCAAAGGAATCGACGGGGGTTCGCACAAGTGGAGGGATATGTGGTTTAATTAGACGATAAGTGAGGAACGTGACCCGGGTTCAACAGGGAGTCGACAGGGGCAGAGATTCCCTCTTCCACGGACGTCTTCCGAGGTGGGGCATGGTTGTCAGTCAGCTACGTGCCGTGAGGTGTCGGCTTAAGTGCCATAAGGTGTGCAACACGGGCAGACAGTTGCTAACGGGTAGAGCAGTGGAATGTGTAGTGATTGCAGGGGCAAGCCGCGAGGAAGGGGGGGATGATGTCAAATCAGCGCGGCCCTTAGGTCAGGGGTGACACACGTGCTGCAATGGCGGGGACAGAGGGATGTGAAGAGGCGACGTGGAGCGAACCCCAAAAACCCCGCCCCAGTTAGGATTGTAGTATGCAACCCGAATACATGAAGCCGGAATAGGTAGTAATCGCGGATCAGAATGCAGCGGTGAATAAGTTCCCGGCTCTAGCACACACCGCCCGTCA\n", ">229854\n", "GAGTTTGATCCTGGCTCAGATTGAACGCTGGCGGCATGCTTAACACATGCAAGTCGAACGGCAGCATGACTTAGCTTGCTAAGTTGATGGCGAGTGGCGAACGGGTGAGTAACGCGTAGGAATATGCCTTAAAGAGGGGGACAACTTGGGGAAACTCAAGCTAATACCGCATAAACTCTTCGGAGAAAAGCTGGGGACTTTCGAGCCTGGCGCTTTAAGATTAGCCTGCGTCCGATTAGCTAGTTGGTAGGGTAAAGGCCTACCAAGGCGACGATCAGTAGCTGGTCTGAGAGGATGACCAGCCACACTGGAACTGAGACACGGTCCAGACTCCTACGGGAGGCAGCAGTGGGGAATATTGGACAATGGGGGCAACCCTGATCCAGCAATGCCGCGTGTGTGAAGAAGGCCTGAGGGTTGTAAAGCACTTTCAGTGGGGAGGAGGGTTTCCCGGTTAAGAGCTAGGGGCATTGGACGTTACCCACAGAAGAAGCACCGGCTAACTCCGTGCCAGCAGCCCGCGGTAATACGGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCCGTTAAAANGGTGCCTAAGGTGGTTTGGATNAGTTATGTGTTAAATTCCCTGGCGCCTCCACCCTGGNGCCAGGTCCATANTAAAAACTGTTAAACTCCGAAGTATGGGCACAAGGTAANTTGGAAANTTCCGGTGGTNANCCGNTGAAAATGCGCTTAGAGATNCGGGAAGGGACCACCCCAGTGGGGAAGGCGGCTACCTGGCCTAATAACTGACATTGAGGCACGAAAAGCGTGGGGAGCAACCAGGATTAGATACCCTGGTAGTCCACGCTGTAAACGATGTCAACTAGCTGTNGGTTATATGAATATAATTAGTGGCGAAGCTAACGCGATAAGTTGACCGCCTGGGGAGTACGGTCGCAAGATTAAAACTCAAAGGAATNGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGCAACGCGAAGAACCTTACCTACCCTTGACATACAGTAAATCTTTCAGAGATGAGAGAGTGCCTTCGGGAATACTGATACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGTAACGAGCGCAACCCTTATCTCTAGTTGCCAGCGAGTAATGTCGGGAACTCTAAAGAGACTGCCGGTGACAAACCGGAGGAAGGCGGGGACGACGTCAAGTCATCATGGCCCTTACGGGTAGGGCTACACACGTGCTACAATGGCCGATACAGAGGGGCGCGAAGGAGCGATCTGGAGCAAATCTTATAAAGTCGGTCGTAGTCCGGATTGGAGTCTGCAACTCGACTCCATGAAGTCGGAATCGCTAGTAATCGCGAATCAGCATGTCGCGGTGAATACGTTCCCGGGCCTTGTACACACCGCCCGTCACACCATGGGAGTGGGCTGCACCAGAAGTAGATAGTCTAACCGCAAGGGGGACGTTTACCACGGTGTGGTTCATGACTGGGGTGAAGTCGTAACAAGGTAGCCG\n", ">182569\n", "AGAGTTTGATCCTGGCTCAGGATGAACGCTAGCTACAGGCTTAACACATGCAAGTCGAGGGGCAGCATGGTGTATCAATATATCTATGGCGACCAGCGCACCGGTGATGCACACCTCTCCTACCTGCCCCTTACTCCGGGATGATCTTTCTAAAAAAATATTACTACTCCATGGTATTACCGAAAAACGTCTTTTTGTTGTTTAAAAACTTCGATGGTGGAAGGTGATGCTTTCTATTATATACTTGGTGGGGTAACAGCCCACCACCTCAGCGATGAATAGGGGTTCTAATAAGAAGGTCCCCCCCATGGTAACTGGGCCCCGGTCCAAATTCTTCGGGAAGCCACCAGTGAGGATTATTGTTCAATGGCGGAGATTTTGACCCAGCCCAAGTAGCGTGAAGGATGACTGCTCCCATAGGTGGTAAACTTCTTTTATATGGGAATAAAGTGAGTCACGTGTGTCTTTTTGTATGTATCATATGAATAAGGATCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGATTCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCGGTTTGTTAAGTCAGTGGTGAAAGTTTGGGGCTCAACCGTGAAATTGCATTTGATACTGGCGGTCTTGAGTGCAGTAGAGGTGGGCGGAATTTGTGGTGTAGCGGTGAAATGCTTAGATATCATGCAGAACTCCGATTGCGAAGGCAGCTCACCGGAGTGTATCTGACGTTGAGGCTCGAAAGTGTGGGTATCAAACAGGATTAGATACCCTGGTAGTCCACACAGTAAAGAAGGAATATTGTCGTTGTGGGATCTCCATTAAGGGGTCAAGGGAAAGCATTAATTATTCCCCTGGGGGAGTAGTCCGCCAGAGGTGAAATTAAAAGAAATGGAGGGGGGCCGGCCCAAGGGAAGGACCATGTGGTTTAATTGGAGGATAGGGGAGGACCTTTCCCGGGGTTGAAAGTGCAAATGAATTATGGGGAGAGCCATTCCCTTCAAGGCATGAGAGAAGGTGCTGCATGGTTGTCGTCAGCTCGTGCCGTGAGGTGTCGGGTTAAGTCCCATAACGAGCGCAACCCTTATCTTCAGTTACTATCAGGTCAAGCTGAGCACTCTGGAGAGACTGCCGTTGTAAGATGAGAGGAAGGTGGGGATGACGTCAAATCAGCACGGCCCTTACGTCCGGGGCTACACACGTGTTACAATGGGGGGTACAGAAGGCAGCTACCCAGCGACAGGATGCCAATCCCAAAAACCTATCTCAGTTCGGATTGAAGTCTGCAACCCGCCTTCGTGAAGTTGGATTCGCTAGTAATCGCGCATCAGCCATGGCGCGGTGAATACGTTCCCGGGCCTTGCACACACCGCCCGTCA\n", ">1719550\n", "TCCTGGCTCAGAACGAACGTTGGCGGCGTGGATTAGGCATGCAAGTCGCGCGAATCCCCGCAAGGGGGGAAGCGGCGTAAGGGGCAGTAAGGCGTGGGTACCTACCCGGGGGTCGGGGATAGCCCGTCGAGAGACGGGGTAATACCCGATGACGTGGAGACACCAAAGGTCCGCCGCCCTCGGCGGGGCCCACGTGATATTAGCTAGTTGGCGGGGTAACGGCCCACCAAGGCGGGGATGTCTAGCGGGTGTGAGAGCACGACCCGCGCCACTGGCACTGAGACACTGGCCAGACACCTACGGGTGGCTGCAGTCGAGGATCTTCGGCAATGGGGGAAACCCTGACCGAGCGACGCCGCGTGGGCGACGAAGGCCTTCGGGTTGTAAAGCCCTGTCGAGGGGGAGAAAGCCTTAACCGGGTGATCTATCCCTGGAGGAAGCACGGGCTAAGTTCGTGCCAGCAGCCGCGGTAAGACGAACCGTGCGAACGTTGTTCGGAATCACTGGGCTTAAAGGGCGCGTAGGCGGGTTGCCGCGTCCGGGGTGAAATCCCACGGCTCAACCGTGGGGCGGCCCCGGGTACGGGCAGCCTCGAGGAGAGTAGGGGCATGCGGAACTCTGGGTGGAGCGGTGAAATGCGTTGATATCCAGAGGAACTCCGGTGGCGAAGGCGGCATGCTGGACCCTTCCTGACGCTGAGGCGCGAAAGCCAGGGGAGCGAACGGGATTAGATACCCCGGTAGTCCTGGCCCTAAACGATGAGAACTAGGTAGCCGGCCGGACATGGGCTGGCTGCCGGAGCCAAAGTGCTAAGTTCTCCGCCTGGGGAGTATGGCCGCAAGGCTGAAACTCAAAGGAATTGACGGGGGCTCACACAAGCGGTGGAGCATGTGGCTTAATTCGAGGCTACGCGAAGAACCTTATCCCGGGCTTGACATGTTCGAAAGAGGCTCGAAGTAGCCCGCGGAAACGTGGGGCCAACGGTATCCAGTCCGGAGCGAGCTACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTCGGGTTAAGTCCCATAACGAGCGAAACCCTTACCCTCAGTTGCTTACTAGGACTCTGGGGGGACTGCCGGTGTCAAACCGGAGGAAGGTGGGGATGACGTCAAGTCCTCATGGCCTTTATGCCCGGGGCTGCACACGTGCTACAATGGCGTGGACAAAGAGACGCGAGCCCGCGAGGGGGAGCCAATCTCAGAAAGCACGCCCCAGTTCAGATCGCAGGCTGCAACTCGCCTGCGTGAAGCCGGAATCGCTAGTAATCGCGGGTCAGCAACACCGCGGTGAATGTGTTCCTGAGCCTTGTACACACCGCCCGTCAAGCCACGAAAGAGAGGGACGTCCGGAGTCGCCTTCACCGGTGCCGAAGACGGACTTCTTGATTGGGACTAAGTCGTAACAAGGTAACC\n", ">1794723\n", "TTAGAGTTTGATCCTGGCTCAGAACGAACGTTGGCGGCGTGGATTAGGCATGCAAGTCTCGCGAATCCCCGCAAGGGGGGAAGCGGCGTAAGGGGCAGTAAGGCGTGGGTAACCCACCCCGGGGCCCGGGATAGCCCGTCGAGAGACGGGGTAATACCGGGCGACGCAGCGTGCCGGCATCGGTGTGCTGCCAAAGGTCCGCCGCCCCGGGCGGGGCCCACGTGGTATTAGCTAGTTGGTGGGGTGACGGCCCACCAAGGCGGAGATGCCTAGCGGGTGTGAGAGCACGACCCGCGCCACTGGCACTGAGACACTGGCCAGACACCTACGGGTGGCTGCAGTCGAGGATCTTCGGCAATGGGGGCAACCCTGACCGAGCGACGCCGCGTGGGCGACGAAGGCCTTCGGGTTGTAAAGCCCTGTCGAGGGGGAGAAACGTCCCGCAAGGGGCCTGATCTATCCCTGGAGGAAGCACGAGCTAAGTTCGTGCCAGCAGCCGCGGTAAGACGAACCGTGCGAACGTTGTTCGGAATCACTGGGCTTAAAGGGCGCGTAGGCGGGCTGCCGAGTCCGGGGTGAAATCCTCCCGCTCAACGGGAGAACGGCCCCGGGTACTGGCGGCCTCGAGGCGGGTAGGGGCGTGCGGAACACTGGGTGGAGCGGTGAAATGCGTTGATATCCAGTGGAACTCCGGTGGCGAAGGCGGCACGCTGGACCCGTCTGACGCTGAGGCGCGAAAGCCAGGGGAGCGAACGGGATTAGATACCCCGGTAGTCCTGGCCCTAAACGTTGAGAACTAGGTAGTCGGCCGGACATGGGCTGACTGCCGGAGCGAAAGTGCTAAGTTCTCCGCCTGGGGAGTATGGTCGCAAGGCTGAAACTCAAAGGAATTGACGGGGGCTCACACAAGCGGTGGAGCATGTGGCTTAATTCGAGGCAACGCGAAGAACCTTATCCCGGGCTTGACATGTGCGAAAGCGTCTGGGGGTACCCGCCGGAAACGGCCGGGGAAGGTATCCAGTCCTGAACCAGACACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTCGGGTTAAGTCCCATAACGAGCGAAACCCTTACCCTCAGTTGCCAGCGGGTCACGCCGGGGACTCTGGGGGGACTGCCGGTGTCAAACCGGAGGAAGGTGGGGATGACGTCAAGTCCTCATGGCCTTTATGCCCGGGGCTGCACACGTGCTACAATGGCGTGGACAAAGGGGCGCGAACGCGCGAGCGGGAGCCGACCCCGGAAAGCACGCCCCAGTTCAGATCGCAGGCTGCAACTCGCCTGCGTGAAGTCGGAATCGCTAGTAATCGCGGGTCAGCAACACCGCGGTGAATGTGTTCCTGAGCCTTGTACACACCGCCCGTCAAGCCACGAAAGGGAGGGACGGCCGAAGTCGCGCCCCGCGCGCCGACGCCGGACTTCCCGATTGGGACTAAGTCGTAACAAGGTAACC\n", ">1142181\n", "CACGTGGGTCATTTGCCCCGAAGCCCGGGATAGCCCATGGAAACATGGATTAATACCGGATGTGGTTGGAGTACACAGGTGCTCCGTATTAAACGGTAGGTAGCAATACCTTCCGCTTCGGGATAAGCCCGCGGCCCATTAGCTAGTTGGTGGGGTAAGACCCAACCAAGGAGACAACCGGGAGCCGGACAGAAAGGGTGACGGCCACATTGGGACTGAGAAACGGCCCGATCCTACGGAGGCAGCAGTAAGAATCTTCCGCATGAACGAAGTCCGACCGAGCGACGCGCTGAGTGATGAAGGTGTTATGCATCGTAAAGCTCCTTCGGGGAGGAGAATAAGCATAGTCCAAAAGGCTATGTGATGACGACCCTCCCTAAAGAAGCCCCGGCTAATTACGTGCAGCAGCGCGGCAATACGTAAGGGGTAAGCGTTGTTCGGAATTACTGGGCGTAAAGGGTGTGCAGGCGGGAGAGTAAGTTGGGGGTGAAATCTACGGGCCCAACCCGTAAACTGCCCTCAAAACTGCTTTTCTTGAGTGCAGGAGAGGAGACTGGAATTCCTAGTGTAGGAGTGAAATCTGTAGATATTAGGAAGAACACCGGTGGCGAAGGCGAGTCTCTGGCCTGACACTGACGCTGATACACGAAAGCGTGGGGAGCGAACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGTTGTGCACTAGATGTTGGGGGTGTCAATCCCCTCAGTGTCGCAGTTAACGCATTAAGTGCACCGCCTGGGGAGTATGCTCGCAAGGGTGAAACTCAAAGGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGATACGCGAGGAACCTTACCAGGGCTTGACATACAGGTGCCGGGCTGTGAAAGCAGTCCTCTCTTCCGAGCGCCTGTACAGGTGTTGCACGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGTTAAGTCCCCCAACGAGCGCAACCCCTATTGTCTGTGCCATCATTAAGTTGGCACTCGAACGAAACTGCCGGTGATAAACCGGAGGAAGGTGGGGATGACGTCAAGTCATATGGCCCTTAGGCCTGGCTACACGTGCTACAATGGACAGTACAAGAGTCGCAAGACCGAAAGGTGGACCATCCAAAAGCTGTCCTCAGTTCCGATTGAAGTCTGAAACTCGACTCCATGAAGTTGGAATCGCTAGTAATCGCGCATCAGAATGGCGCGGTGAATACGTTCCCGGGCCTTGTACACACCGCCCGTCACACCACCCGAGTTGGAAGTACTTGAAGTCGCTGATCTAACCTTCGGGAGGAAGGCGCCGATTGTACGTCTGATAAGGGGGGTGAAGTCGTAACAAGGTAACC\n", ">2683209\n", "CTGGCGGCGTGGTTTAGGCATGCAAGTCGAACGCGAAAGATTTACTTCGGTAAATTGAGTAGAGTGGCGAACGGGTGAGTAATACGTACGAATCTACCTTAAAGACAGGGATAGTCCCGGGAAACTGGGTTTAATACCTGATGGTATCCGGCTTTGCCGGATTAAAGACGGCCTCTATTTATAAGCTGTTACTTTTAGATGAGCGTGCGCTCCATTAGTTAGTTGGTAAGGTAAGAGCTTACCAAGGCGATGATGGATAGGCGTCCTTAACGGGTGGTCGCCCACACTGGGATTGAGATACGGCCCAGACTCCTACGGGAGGCAGCAGTCGAGAATCGTCTACAATGAACGCAAGTTTGATAGTGCGACGCCGCGTGAATGAAGAAGCATTTCGGTGTGTAAAATTCTTTTATATAAGAACAGTGCATGTATGGTAAATAATTATACGTGAGAGATAGTACTATATGAATAAGCTCCGGCTAACTTCGTGCCAGCAGCCGCGGTAATACGAAGGGAGCAAGCGTTGTCCGGAATTACTAGGTGTAAAGGGTAAGTAGGCGGAAATTTAAGTCTCCGGTTAAATCTTCGGGCTCAACCCGAAATCTGCCTGAGATACTGGATTTCTAGAGTAAAGCAGATGAAGGCGGAATTCCTGGAGTAGCGGTGGAATGCGTAGATATCAGGAAGAACACCCATAGCGAAGGCAGCTTTCAATGCTATTACTGACGCTCAATTACGAAGGTGCGGGTATCGAACAGGATTAGATACCCTGGTAGTCCGCACAGTAAACGATATGTACTTGATATTGGATGTTGAAAATTCAGTGTCGTAGCTAACGCGTTAAGTACATCACCTGGGGACTAACGGCCGCAAGGTTAAAACTCAAAGGAATTGACGGGGGCCCACACAAGCGGTGGAGCATGTGGTTTAATTCGATGCAACGCGAAGAACCTTACCTGAACTTGACATGCCGAGAATCCTGTAGAAATATGGGAGTGCCTTTTTTGGAGCTCGGACACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTTGTGAAATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCCTATCTTTAGTTGCTACCATTAAGTTGAGGACTCTAAAGAGACTGCCAGAGTACAAATCTGGAGGAAGGTGGGGACGACGTCAAGTCATCATGGTCCTTATGTTCAGGGCTACACACGTGCTACAATGGTTGGAACAAAAGGCAGCGAAGGGGCGACCCGGAGCTAATCTCCAAACCCAATCTTAGTCCGGATTGCAGTCTGCAACTCGACTGCATGAAGTTGGAATCGCTAGTAATCGTGAGTCAGCATATCACGGTGAACATGTTCCTGGGCCTTGTACACACCGCCCGTCAAGTCAGCCGAATCGAGTGCACCCGAAGAAGGTGAGTTAATTAGACAGCTTTCGAAGGTGTGCTTGTAAGGGGGACTAAGTC\n", ">2784824\n", "AGTGGCGCACGGGTGAGTAACGCGTGGGTAACTTGCCTTTAAGTGAGGGATAACCCACTGAAAGGTGGACTAATACCTCATAAGACCACAGTGCTACGGCAGCGTGGTCAAAGGTGGCTTTATTAAAAGCTGCCGCTTGGAGAGAGACCCGCGTCCCATCAGCTTGTTGGTAAGGTAATGGCTTACCAAGGCCGAGACGGGTAGCTGGTCTGAGAGGATGGCCAGCCACACTGGAACTGAAACACGGTCCAGACTCCTACGGGAGGCAGCAGTGAGGAATCTTGCGCAATGGGGGGAACCCTGACGCAGCAACGCCGCGTGAGTGAAGAAGGTCTTCGGGTCGTAAAGCCCTGTCGGGAGGGAAGAAACAGTTATGCATGAATAATGCATAACCTTGACGGTACCTCCNGAGGAAGCACCGGCCAACTCCGTGCCAGCAGCCGCGGTAAAACGGAGGGTGCTAGCGTTGTTCGGAATTACTGGGCGTAAAGAGCGTGTAGGCGGATAGATAAGTCGAGTGTGAAAGCCCTCAGCTTAACTGAGGAAGTGCATTCGAAACTATCTTTCTTGGGTACGGAAGAGGGAAGTGGAATTCCCGGTGTAGGGGTGAAATCCGTAGATATCGGGAGGAATACCAGTGGCGAAGGCGACTTCCTGGACCGTCACTGACGCTGAGACGCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGATGGGCACTAGGTGTATCTCGCTTAGCGGGATGTGCCGTAGCTAACGCATTAAGTGCCCCGCCTGGGGAGTACGGTCGCAAGACTAAAACTCAAAGGAATTGACGGGGGCCCGCACAAGTGGTGGAGCATGTGGTTTAATTCGACGCAACGCGAAGAACCTTACCTGGGTTTGACATGCCGAGAATCTGCCAGAAATGGTGGAGTGCCCCGTTAGGGGAACTCGGACACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTCACCTTTAGTTGCCAGCATTAAGTTGGGCACTCTAAAGGGACTGCCGGTGTTAAACCGGAGGAAGGTGGGGACGACGTCNAGTCCTCATGGCCTTTATACCCAGGGCTACACACGTGCTACAATGGCCAGTACAAAGGGCTGCAATCCCGCGAGGGGGAGCCAACCCCAAAAATCTGGTCTTAGTTCGGATTGGAGTCTGCAACTCGACTCCATAAAGGTGGAATCGCTAGTAATCGTGAATCAGCACGTCACGGTGAATACGTTCCCGGGCCTTGTACACACCGCCCGTCACACCACGAAAGTTGGCTGTACCAGAAGTTGCTGAGCTAACTCGCCTCGGCGGGAGGCAGGCACCTAAGGTGTGGTTGATGATTGGGGTGAAGT\n", ">2941516\n", "TTAGAGTTTGATCCTGGCTCAGGATGAACGCTAGCGATAGGCCTAACACATGCAAGTCGAGGGGTAACAGGGTAGCAATACCGCTGACGACCGGCAAATGGGTGAGTAACGCGTATGCAACCTACCGATAACAGTTGGATAGCTCCCTGAAAGGGGAATTAAACCGGCATGACACTATGAGATCGCCTGTTTTCATAGTTAAATATTTATAGGTTATTGATGGGCATGCGTGACATTAGCAAGTTGGTGAGGTAACGGCTCACCAATGCTACGATGTCTAGGGGTTCTGAGAGGAAGGTCCCCCACACTGGTACTGAGACACGGACCAGACTCCTACGGGAGGCAGCAGTGAGGAATATTGGTCAATGGACGGAAGTCTGAACCACCCACTTCGCGTGCAGGATGACTGCCCTATGGGTTGTAAACTGCTTTTATATAAGAGGAACAGTATTTATGTATAGATATTTGCCAGTATTATATGAATAAGGATCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGATCCGAGCGTTATCCGGATTTATTGAGTTTAAAGGGTGAGTACGCGGTAGTATAAGTCAGCGGTGATAACTCGCAGCTCATCTGTAAGCTTGCCGTTGACACTGTATTACTTGACTTAACGTTGAGGTATGCTGAATGGGGGGGGGTTACCCGTTGAAATGCATTAATCAAAACAACAGACCACCCGATTTGCGGACGGCAGCAAAACTACACTGTCCACTGACGCTGATGCACAAAAGGCGTGGGTATCAAACAGGATTAGATACCCTGGTAGTCCACGCTGTAAACGATGATTACTGGTTGTTTGTGATACACTGCAAGTGACTGAGCGAAAGCACTAAGTAATCCACTTGGCGAGTACGTCGGCAACGATGAAACTCAAAGGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTCTAATTCGAGGCAACGCGAAGAACCTTACCCAGACTTGACATCTAGGAAAGGTCCTTGAAAGAGGATCGTGCCCGCAAGGGAATCCTAAGACAGGTGTTGCATGGCTGTCGTCAGCTCCTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCCTGCTTACAGTTACCATCGGTTCGGCCGGGGACTCTGTAAGGACTGCCGCTGATAAAGCGAAGGAAGGCGGGGACGACGTCAAGCAATCACGGCCCTTACGTCTGGGGCTACACACGTGCTACAATGGCCGGTACAATGAGTCGCAAAACCGCGAGGTCAAGCTAATCTCAAAAAACCGGTCTCAGTTCGGATTGGAGTCTGCAACCCGACTTCGTGAAGCTGGATTCGCTAGTAATCGCGCATCAGCCATGGCGCGGCGAATACGTTCCCGGGCCTTGTACACACCGCCCGTCA\n", ">998428\n", "GACGAACGCTGGCGGCGTGCCTAATACATGCAAGTCGAGCGGAGTTGTTCCTTCGGGGACAGCTTAGCGGCGGACGGGTGAGTAACACGTAGGCAACCTGCCTGCAGGACCGGGATAACCCACGGAAACGTGAGCTAATACCGGATAGATGGTTCCCTCGCATGAGGGGATCAGGAAAGACGGGGCAACCTGTCACTTGTAGATGGGCCTGCGGCGCATTAGCTAGTTGGCGAGGTAACGGCTCACCAAGGCGACGATGCGTAGCCGACCTGAGAGGGTGAACGGCCACACTGGGACTGAGACACGGCCCAGACTCCTACGGGAGGCAGCAGTAGGGAATCTTCCGCAATGGACGAAAGTCTGACGGAGCAACGCCGCGTGAGTGAGGAAGGTCTTCGGATCGTAAAGCTCTGTTGCCAAGGAAGAACGCTTGGTGGAGTAACTGCCATCAAGGTGACGGTACTTGAGAAGAAAGCCCCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGGGCAAGCGTTGTCCGGAATTATGGGCGTAAGCGCGCGCAGCGGTTCTTTAAGTCTGAGGTTAAATGCAGGGCTCAACCTTGTAACGCCTTGGAAACTGGGGGACTGGAGTGTAGGAGAGGAAAGTGGAATTCCACGTGTAGCGGTGAAATGCGTAGAGATGTGGAGGAACACCAGTGGCGAAGGCGACTTTCTGGCCTATAACTGACGCTGAGGCGCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGATGAGTGCTAGGTGTTAGGGGTTTCGATACCCTTGGTGCCGAAGTTAACACAGTAAGCACTCCGCCTGGGGAGTACGCTCGCAAGAGTGAAACTCAAAGGAATTGACGGGGACCCGCACAGGCAGTGGAGTATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAGGTCTTGACATCCCGATGAAACGTCTAGAGATAGGCGCCCTCTTCGGAGCATTGGAGACAGGTGGTGCATGGTTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTGAATTCAGTTGCCAGCACTTCGGGTGGGCACTCTGAATTGACTGCCGGTGACAAACCGGAGGAAGGTGGGGATGACGTCAAATCATCGTGCCCCTTATGACCTGGGCTACACACGTACTACAATGGTCGGTACAACGGGCAGCGAAGCCGCGAGGCGGAGCCAATCCTAGAAAAGCCGATCTCAGTTCGGATTGCAGGCTGCAACTCGCCTGCATGAAGTCGGAATTGCTAGTAATCGCGGATCAGCATGCCGCGGTGAATACGTTCCCGGGTCT\n", ">4343117\n", "AACGAACGCTGGCGGCGTGCTTAACACATGCAAGTCGCGTGCGCGGTTCACGAACTTGTACGTGGATGGGCGCACGGCGCAGGGGGGCGTAACACGTGGGCACTCTGCCCTCCGATGGGGAATACTCCCGCGAACCGGGGGCTAATACCGCATAACATTCCGAGGACTTGGGTTCTTGGATTCAAAGCAGTGATGCCTGTGAGGAGGAGCCCGCGCCCGATTAGCTAGTTGGTAGGGTAACGGCCTACCTCGGCAATGATCGGTAGCTGGTCTGAGAGGATAATCAGACACACTGCAACTGAAACGAGGCCCAGACTCCTACCGTAGGGAACGCTGGGGAATCTTGCCTTCTGGGCGAAAGCATGACCCAACGACGCCGCGTGGGGGATGAAGCTTTTGCTAGTGTAAACCCCTTTTCACTGGTAAGAATGCACGCAAGGGAGCGACAGTACCCTGGCAAGAAGCCCCGGCTAACTACGTGCCACCCGCCTCGGTAAGACCTAGGGGGCCAGCGTTGTTCGGAATTACTGGGTGTATAGGGTACTTATGCGGTGCGACAAGTTGGGAGTGAAATCTCTGGGCTTAACCCAGAGGCTGCTTCTCAAACTGCTATGCTTGATTGTGACAGAGGCTCTTGAAATTGCAGGAGTAGCGTTGAAATGCATGTATATCTGCAAGATCACCCGAGATATGGACGAACAGCTGGATCACAAGTGACGCTGAGGAACGAAAGCTACGCTGAGCGAACAGGATTATATACACTGGTAGTCCTAGCACTAAACGATCATGACTTGCGGTGACGACCGTTCGGACGTCTCCCGGAGCTAACGCGTTAAGTCCTGCACCTGGGGAGTACGGTCGCAGACTGGAAGTCAAAGGAATTGACGGGGGCCCGCACAAGCGGTGGAACATGTGGTTCAATTCGACGCTACGCGAGGAACCTTACCTGGTTCGAAATTCTTATGACCAGCTGTAGAATTACGGCTTTCCTTCAAGAGACATGAGTCTAGGCGCTCCATGGCTGTCGTCAGTTCGTTCCGTGAGGTGTTGGGTTAAGTCCCGCAACGAGCGCAACCCCTGCACGTAGTTACTACTCGCAAGAGAGGACTCTACGTGGACTGCTCCGGATAACGGAGAGGAAGGTGGGAATGACGTCAAGTCCGCATGGCCTTTATGTCCAGGGCTACACACGTGTTACAATGCAGGGTACAAACCGTTGCCAACCCGCGAGGGGGAGCTAATCGGATAAAACTGTGCTCAGTTCGGATTGCAGTCTGCAACTCGACTGCATGAAGCTGGAATCGCTAGTAATGGGGATCAGCTTGACGCCGTGAATACGTTCCCGGGCCTTGTACACACCGCCCGTCACATCACGAAAGTGAGCTCACCTAGAAGTCGCCACGCTAACCGCAAGGGGGCAGGCGCCCAAGGTATGACTCATGATTGGGGTG\n", ">4353661\n", "GGATGAACGCTAGCGGGAGGCTTAATACATGCAAGTCGAGGGTGAAGCTTTCTTCGGAAAGTGGAAACCGGCGAACGGGTGCGTAACGCGTACGCAACTTACCCCTTGCTGGAGAATAGCCCCGGGAAACTGGGATTAATGCTCCATGGTATGGTGAAATCGCATGATTTTATCATTAAAGGTTACGGCAAGGGATAGGCGTGCGTCCCATTAGCTTGTTGGTGAGGTAACGGCTCACCAATGCAAACGATGGGTAGCTGGTCTGAGAGGATGATCAGCCACACGGGCACTGAGACACGGGCCCGACTCCTACGGGAGGCAGCAGTAGGGAATATTGGACAATGGACGAAAGTCTGATCCAGCCATCCCGCGTGCAGGACGAATGCCCTATGGGTTGTAAACTGCTTTTCTAAGGAAAGAAATATCTCATTCATGAGGTGCTGACGGTACCTTAGGAATAAGCACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTGCAAGCGTTATCCGGATTCACTGGGTTTAAAGGGTGCGTAGGCGGTATGATAAGTCAGTGGTGAAAGCCCGGGGCTCAACTCCGGAACTGCCGTTGATACTGTCATACTTGAGTCCAGTTGAGGTGGGCGGAATGATACATGTAGCGGTGAAATGCTTAGATATGTATCAGAACACCGACTGCGAAGGCAGCTCACTAAACTGGTACTGACGCTGAGGCACGAAAGCGTGGGTAGCGAACAGGATTAGATACCCTGGTAGTCCACGCCCTAAACGATGCTAACTCGGTATGTGCGATATACTGTACGTGCCTGAGGGAAACCGTTAAGTTAGCCACCTGGGGAGTACGTTCGCAAGAATGAAACTCAAAGGAATTGACGGGGGTCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGATACGCGAGGAACCTTACCTGGGCTCTAATGTACCACGCCCGACCCTGAAAGGGGTCTTCTTCTTCGGAAGCGGGGTACAAGGTGCTGCATGGTTGTCGTCAGCTCGTGCCGTGAGGTGTTGGGTTAAGTCCCGCAACGAGCGCAACCCCTGTCCTTAGTTGCCAGCGGTCCGGCCGGGGACTCTAAGGAGACTGCCTTCGCAAGGAGTGAGGAAGGAGGGGACGACGTCAAATCATCATGGCCTTTATGCCCAGGGCTACACACGTGCTACAATGGTGAGGACAAAGGGCAGCCACTTAGCGATAAGGAGCAAATCCCAAAAACCTCACCTCAGTTCGGATTGGAGTCTGCAACTCGACTCCATGAAGTCGGAACCGCTAGTAATCGCAGATCAGACATGCTGCGGTGAATACGTTCCCGGACCTTGTACACACCGCCCGTCAAGCCATGGAGCCGGGTGTACCTTAAGGCGATAACCGAAAGGAGTTGCCCAAGGTA\"\"\"\n", "seqs_16s = StringIO(seqs_16s)\n", "\n", "seqs_16s = [s[:50] for s in skbio.io.read(seqs_16s, constructor=DNA, format='fasta')]\n", "seq_lookup = {seq.metadata['id'] : seq for seq in seqs_16s}\n", "\n", "\n", "tax = \"\"\"669210\tk__Bacteria; p__; c__; o__; f__; g__; s__\n", "881726\tk__Bacteria; p__Firmicutes; c__Bacilli; o__Bacillales; f__Paenibacillaceae; g__Paenibacillus; s__\n", "296752\tk__Bacteria; p__Spirochaetes; c__Spirochaetes; o__Spirochaetales; f__Spirochaetaceae; g__Treponema; s__\n", "1794723\tk__Bacteria; p__Planctomycetes; c__Planctomycetia; o__Gemmatales; f__Gemmataceae; g__Gemmata; s__\n", "2941516\tk__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Marinilabiaceae; g__; s__\n", "793074\tk__Bacteria; p__; c__; o__; f__; g__; s__\n", "4353661\tk__Bacteria; p__Bacteroidetes; c__[Saprospirae]; o__[Saprospirales]; f__; g__; s__\n", "292553\tk__Bacteria; p__Spirochaetes; c__Spirochaetes; o__Spirochaetales; f__Spirochaetaceae; g__Treponema; s__\n", "2784824\tk__Bacteria; p__Proteobacteria; c__Deltaproteobacteria; o__Syntrophobacterales; f__Syntrophaceae; g__; s__\n", "1719550\tk__Bacteria; p__Planctomycetes; c__Planctomycetia; o__Gemmatales; f__Gemmataceae; g__Gemmata; s__\n", "182569\tk__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__\n", "266495\tk__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__S24-7; g__; s__\n", "524860\tk__Bacteria; p__Planctomycetes; c__Planctomycetia; o__Gemmatales; f__Gemmataceae; g__Gemmata; s__\n", "293514\tk__Bacteria; p__Spirochaetes; c__Spirochaetes; o__Spirochaetales; f__Spirochaetaceae; g__Treponema; s__\n", "2683209\tk__Bacteria; p__WWE1; c__[Cloacamonae]; o__[Cloacamonales]; f__; g__; s__\n", "501793\tk__Bacteria; p__Firmicutes; c__Bacilli; o__Bacillales; f__Paenibacillaceae; g__Paenibacillus; s__\n", "229854\tk__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Legionellales; f__Legionellaceae; g__Legionella; s__\n", "583705\tk__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__; g__; s__\n", "1142181\tk__Bacteria; p__Spirochaetes; c__GN05; o__SBYZ_6080; f__; g__; s__\n", "998428\tk__Bacteria; p__Firmicutes; c__Bacilli; o__Bacillales; f__Paenibacillaceae; g__Paenibacillus; s__\n", "4343117\tk__Bacteria; p__Acidobacteria; c__DA052; o__Ellin6513; f__; g__; s__\"\"\"\n", "\n", "tax_lookup = dict([e.strip().split('\\t') for e in tax.split('\\n')])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## [4.2.5](#5) Question 1 [edit]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "What are the fraction of 3 base k-words unique to the following two sequences?" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "seq1 = DNA('AGCTAGCATCGATCGATCGATGCATGCAT')\n", "seq2 = DNA('AGCTCGGCATCGAGGGCAGTCAATCGATCT')" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "help(kmer_distance)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# Compute the kmer distance in this cell" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## [4.2.6](#6) Question 2 [edit]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Display the guide tree for the sequences in the cell below." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "query_seqs = [DNA(\"ACGATGACCAGTGCTACCAGT\", metadata={'id': 's1'}),\n", " DNA(\"AACGATCGATCGATCGTGCTA\", metadata={'id': 's2'}),\n", " DNA(\"AACGATCTGCTA\", metadata={'id': 's3'}),\n", " DNA(\"CGATCGATGACATGCATG\", metadata={'id': 's4'}),\n", " DNA(\"CGATCTGCAT\", metadata={'id': 's5'})]" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "help(guide_tree_from_sequences)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "# Display the guide tree in this cell." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## [4.2.7](#7) Question 3 [edit]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "What are the differences in the guide tree from *Question 2*, the tree that is generated after 1 iterations of iterative multiple sequence alignment, and the tree that is generated after 5 iterations of iterative multiple sequence alignment? Display the trees for both 1 and 5 iterations of iterative multiple sequence alignment." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "help(iterative_msa_and_tree)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "from skbio.alignment import global_pairwise_align_nucleotide\n", "# add your command for 1 iterations of iterative multiple sequence alignment here\n", "# hint: pass pairwise_aligner=global_pairwise_align_nucleotide" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "# add your command for 5 iterations of iterative multiple sequence alignment here\n", "# hint: pass pairwise_aligner=global_pairwise_align_nucleotide" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## [4.2.8](#8) Question 4 [edit]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Generate and display a tree based on progressive alignment of the sequences from the second cell (the ones in the ``seqs_16s`` variable). This step can take about 10 minutes to complete." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "help(progressive_msa_and_tree)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "# Add your command for progressive alignment and tree building here\n", "# hint: pass pairwise_aligner=global_pairwise_align_nucleotide" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## [4.2.9](#9) Question 5 [edit]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Using the tree representing the sequences from question four as a guide, define clusters (i.e., groups) of sequences at 90% and 70% identity. There is not a single right answer for this, or a single method for grouping sequences. Go about this systematically, and describe the process that you're going through in a couple of paragraphs. These groups are usually referred to as operational taxonomic units, or OTUs, because they represent a hypothesis about the taxonomic relatedness of a group of sequences (which is a proxy for a hypothesis about the relatedness of the group of organisms containing those sequences in their genomes)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you want to obtain a given sequence, you can now do so by looking up its identifier with `seqs_16s.get_seq`:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "seq_lookup['4343117']" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "seq_lookup['4353661']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To compute the pairwise identity for two sequences, use `pairwise_percent_id` as follows. IMPORTANT: While this calculation is fast for a pair of sequences (i.e., a pairwise calculation), running it for all pairs of sequences (e.g., in a nested for-loop) could take a couple of hours, depending on the hardware where it's being run. You only need to do pairwise comparisons to answer this question." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "from skbio.alignment import global_pairwise_align_nucleotide\n", "\n", "def pairwise_percent_id(seq1_id, seq2_id, seq_lookup):\n", " seq1 = seq_lookup[seq1_id]\n", " seq2 = seq_lookup[seq2_id]\n", " aln, _, _ = global_pairwise_align_nucleotide(seq1, seq2)\n", " return 1. - aln[0].distance(aln[1])" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "print(pairwise_percent_id('793074', '4353661', seq_lookup))" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "# Compute additional pairwise identities, as necessary, to answer this question here. Show all of your commands!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Discuss your results here." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## [4.2.10](#10) Question 6 [edit]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Choose one representative sequence from each of the clusters you defined in question 5. Look these up in `tax_lookup` by their ids to get the taxonomy of each sequence, and include those in the results below. When you see a key that ends with ``__``, that means that there is no known taxonomic assignment for that sequence at that level." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "print(tax_lookup['4343117'])" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "print(tax_lookup['4353661'])" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "# Perform addition taxonomy look-ups here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Discuss your results here." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## [4.2.11](#11) Question 7 [edit]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Is the taxonomy of the representative sequences consistent with phylogenetic tree you generated in question 4? For your 90% and 70% OTUs, list three taxa (e.g., at the phylum, class, or species level) that are monophyletic, if any, and three taxa that are not monophyletic, if any. Discuss two specific reasons why some taxa might appear to not be monophyletic based on your tree." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Discuss your results here." ] } ], "metadata": {}, "nbformat": 4, "nbformat_minor": 2 }