2 Fundamentals [edit]

Table of Contents

  1. Pairwise sequence alignment
    1. What is a sequence alignment?
    2. A simple procedure for aligning a pair of sequences
      1. Step 1: Create a blank matrix where the rows and columns represent the positions in the sequences.
      2. Step 2: Add values to the cells in the matrix.
      3. Step 3: Identify the longest diagonals.
      4. Step 4: Transcribe some of the possible alignments that arise from this process.
      5. Why this simple procedure is too simplistic
    3. Differential scoring of matches and mismatches
    4. A better approach for global pairwise alignment using the Needleman-Wunsch algorithm
      1. Stepwise Needleman-Wunsch alignment
        1. Step 1: Create blank matrices.
        2. Step 2: Compute $F$ and $T$.
        3. Step 3: Transcribe the alignment.
      2. Automating Needleman-Wunsch alignment with Python
      3. A note on computing $F$ and $T$
    5. Global versus local alignment
    6. Smith-Waterman local sequence alignment
      1. Step 1: Create blank matrices.
      2. Step 2: Compute $F$ and $T$.
      3. Step 3: Transcribe the alignment.
      4. Automating Smith-Waterman alignment with Python
    7. Differential scoring of gaps
    8. How long does pairwise sequence alignment take?
      1. Comparing implementations of Smith-Waterman
      2. Analyzing Smith-Waterman run time as a function of sequence length
      3. Conclusions on the scalability of pairwise sequence alignment with Smith-Waterman
  2. Sequence homology searching
    1. Defining the problem
    2. Loading annotated sequences
    3. Defining the problem
    4. A complete homology search function
    5. Reducing the runtime for database searches
    6. Heuristic algorithms
      1. Random reference sequence selection
      2. Composition-based reference sequence collection
        1. GC content
        2. kmer content
        3. Further optimizing composition-based approaches by pre-computing reference database information
    7. Determining the statistical significance of a pairwise alignment
      1. Metrics of alignment quality
      2. False positives, false negatives, p-values, and alpha
      3. Interpreting alignment scores in context
      4. Exploring the limit of detection of sequence homology searches
  3. Generalized dynamic programming for multiple sequence alignment
    1. Progressive alignment
      1. Building the guide tree
      2. Generalization of Needleman-Wunsch (with affine gap scoring) for progressive multiple sequence alignment
      3. Putting it all together: progressive multiple sequence alignment
    2. Progressive alignment versus iterative alignment
  4. Phylogenetic reconstruction
    1. Why build phylogenies?
    2. How phylogenies are reconstructed
    3. Some terminology
    4. Simulating evolution
      1. A cautionary word about simulations
    5. Visualizing trees with ete3
    6. Distance-based approaches to phylogenetic reconstruction
      1. Distances and distance matrices
      2. Alignment-free distances between sequences
      3. Alignment-based distances between sequences
      4. Jukes-Cantor correction of observed distances between sequences
      5. Phylogenetic reconstruction with UPGMA
        1. Applying UPGMA from SciPy
        2. Understanding the name
      6. Phylogenetic reconstruction with neighbor-joining
      7. Limitations of distance-based approaches
    7. Bootstrap analysis
    8. Parsimony-based approaches to phylogenetic reconstruction
      1. How many possible phylogenies are there for a given collection of sequences?
    9. Statistical approaches to phylogenetic reconstruction
      1. Bayesian methods
      2. Maximum likelihood methods
    10. Rooted versus unrooted trees
    11. Acknowledgements
  5. Sequence mapping and clustering
    1. De novo clustering of sequences by similarity
      1. Furthest neighbor clustering
      2. Nearest neighbor clustering
      3. Centroid clustering
      4. Three different definitions of OTUs
    2. Comparing properties of our clustering algorithms
    3. Reference-based clustering to assist with parallelization