**Table of Contents**

- Why build phylogenies?
- How phylogenies are reconstructed
- Some terminology
- Simulating evolution
- Visualizing trees with ete3
- Distance-based approaches to phylogenetic reconstruction
- Distances and distance matrices
- Alignment-free distances between sequences
- Alignment-based distances between sequences
- Jukes-Cantor correction of observed distances between sequences
- Phylogenetic reconstruction with UPGMA
- Phylogenetic reconstruction with neighbor-joining
- Limitations of distance-based approaches

- Bootstrap analysis
- Parsimony-based approaches to phylogenetic reconstruction
- Statistical approaches to phylogenetic reconstruction
- Rooted versus unrooted trees
- Acknowledgements

*tips* in the tree. You can explore an interactive version of the three-domain tree presented in Figure 1b online, through the Interactive Tree of Life project.

<table border=0>

</td> </td> </tr><td colspan=2>
**Figure 1a**: Evolutionary tree presented by Charles Darwin in *On the Origin of Species*. **Figure 1b**: A hypothesis of evolutionary relationships between the three domains of life. This image was created by the Norman Pace Laboratory.
</td>
</tr>
</table>

*Homo sapiens*, by studying features of our closest relatives, both extant (still existing) organisms, such as the *Pan* (chimpanzees) and *Gorilla* genera, and extinct (no longer living) species, including *Homo neanderthalensis*, *Homo erectus*, *Homo habilus*, and many species in the *Australopithecus* genus. (The Smithsonian Museum's Human Origins Initiative is an excellent resource for learning more about the fascinating subject of human evolution.) In this time, we've also improved our understanding of the deeper branches in the tree of life. For example, Woese and Fox (1977) used phylogenetic reconstruction to first illustrate that the "prokaryotes" really represented two ancient lineages which they called the eubacteria and the archaebacteria, which ultimately led to the proposal of a "three domain" tree of life composed of the three deep branching linages, the archaea, the bacteria, and the eucarya (Woese, Kandler and Wheelis (1990)).

*On the Origin of Species* was the phylogenetic tree presented in Figure 1a.

*individuals* represented at the tips of our trees don't necessarily have to be organisms though. In another important application of phylogenetic trees, we can study the evolution of genes, which can help us gain a deeper understanding of gene function. Specifically, we can learn about families of related genes. A classic example of this is the globin family, which includes the proteins hemoglobin and myoglobin, molecules that can reversibly bind oxygen (meaning they can bind to it, and then let go of it). You've probably heard of hemoglobin (if not globins in general), as this molecule binds to oxygen where it is present in high concentration (such as in your lung) and releases it where it is present in low concentration (such as in the bicep, where it is ultimately used to power your arm). Hemoglobin and myoglobin are paralogs, meaning that they are related by a gene duplication and subsequent divergence. If you were to compare an unknown globin sequence to either of these you could detect homology, but a tree such as the one present in Figure 2, would help you understand the type of homologous relationship (i.e., whether it was orthology or paralogy).

*features* of organisms, and inferring the evolutionary distance between those organisms based on the similarity of their features. The features that are compared can be nearly anything that is observable, either from extant organisms or fossilized representatives of extinct organisms.

*phylogenomic* studies have further supported the idea that orb-weaving is an ancient monophyletic trait, and have provided much finer scale information on the evolution of spiders. Supported by these data, researchers hypothesize that the loss of orb-weaving might not be that surprising. While it does provide an effective means of catching flying insects, many insects which are potential prey for spiders don't fly. Further, orb webs may attract predators of spiders, as they are easily observable signals of where a spider can be found.

*Terminal nodes or tips* typically represent extant organisms, also frequently called operational taxonomic units or OTUs. OTU is a generic way of referring to a grouping of organisms (such as a species, a genus, or a phylum), without specifically identifying what that grouping is.

*Internal nodes* in a phylogenetic tree represent hypothetical ancestors. We postulate their existence but often don't have direct evidence. The *root node* is the internal node from which all other nodes in the tree descend. This is often referred to as the *last common ancestor (LCA)* of the OTUs represented in the tree. In a universal tree of life, the LCA is often referred to as *LUCA*, the *last universal common ancestor*. All nodes in the tree can be referred to as OTUs.

*Branches* connect the nodes in the tree, and generally represent time or some amount of evolutionary change between the OTUs. The specific meaning of the branches will be dependent on the method that was used to build the phylogenetic tree.

*clade* in a tree refers to some node (either internal or terminal) and all nodes descending from it (i.e., moving away from the root toward the tips).

We'll use all of these terms below as we begin to explore phylogenetic trees.

**Table of Contents**

`evolve_generations`

. This will take a starting sequence and the number of generations that we want to simulate. It will also take the probability that we want a substitution mutation to occur at each position in the starting sequence, and the probability that we want either an insertion or deletion mutation to occur at each position. In each generation, every sequence will spawn two new sequences, randomly incurring mutations at the prescribed rates. This effectively simulates a clonal process of reproduction, a form of asexual reproduction common in single cellular organisms, where a parent cell divides into two cells, each containing a copy of the parent's genome with some errors introduced.

Let's inspect this code and then run our simulation beginning with a random sequence.

In [1]:

```
%pylab inline
import numpy as np
import seaborn as sns
import random
from IPython.core import page
page.page = print
```

In [2]:

```
from iab.algorithms import evolve_sequence
%psource evolve_sequence
```

In [3]:

```
from iab.algorithms import evolve_generation
%psource evolve_generation
```

In [4]:

```
from iab.algorithms import evolve_generations
%psource evolve_generations
```

`evolve_generations`

, we'll pass the parameter `verbose=True`

. This will tell the function to print out some information throughout the process. This will let us inspect our evolutionary process: something that is impossible to do with real sequences.

In [5]:

```
from iab.algorithms import random_sequence
import skbio
sequence = random_sequence(skbio.DNA, 50)
```

In [6]:

```
sequences = evolve_generations(sequence, generations=3, substitution_probability=0.1, indel_probability=0.05,
increased_rate_probability=0.1, verbose=True)
```

`1`

or `2`

. Notice that all sequence identifiers start with `0`

, the identifier of the last common ancestor (or our starting sequence) of all of the sequences. These identifiers will help us interpret whether the phylogenies that we reconstruct accurately represent the evolutionary relationships between the sequences.

In [7]:

```
print(len(sequences))
```

In [8]:

```
sequences[0]
```

Out[8]:

In [9]:

```
sequences[-1]
```

Out[9]:

`indel_probability=0.0`

. This means that we won't need to align our selected sequences before constructing trees from them (because with only substitution mutations occurring, homologous positions always remain aligned). Multiple sequence alignment is currently a runtime bottleneck in IAB, so this will let us run this notebook much faster. If you'd like to model insertion/deletions, you can increase the `indel_probability`

, say to `0.005`

. If you do that, the sequences will be aligned for you in the next cell, but it may take around 30 minutes to run.

In [10]:

```
from skbio.alignment import global_pairwise_align_nucleotide
from iab.algorithms import progressive_msa
from functools import partial
indel_probability = 0.0
sequences = evolve_generations(sequence, generations=10, substitution_probability=0.03,
indel_probability=0.0, increased_rate_probability=0.1, verbose=False)
sequences = random.sample(sequences, 25)
if indel_probability == 0:
sequences_aligned = sequences
else:
gpa = partial(global_pairwise_align_nucleotide, penalize_terminal_gaps=True)
sequences_aligned = progressive_msa(sequences, pairwise_aligner=gpa)
```

`TreeStyle`

which is used to define how the trees we visualize will look. If you'd like to experiment with other views, you can modify the code in this cell according to the ete3 documentation. If you come up with a nicer style, I'd be interested in seeing that in a pull request. You can post screenshots to IAB issue #213 before creating a pull request so I can see what the new style looks like.

In [11]:

```
import ete3
ts = ete3.TreeStyle()
ts.show_leaf_name = True
ts.scale = 250
ts.branch_vertical_margin = 15
```

`TreeStyle`

to a random tree as follows. Any changes that you make to the `TreeStyle`

above will impact this tree and all following trees in this chapter. Experiment with this - it's fun!

In [12]:

```
t = ete3.Tree()
t.populate(10)
t.render("%%inline", tree_style=ts)
```

Out[12]:

**Table of Contents**

- Distances and distance matrices
- Alignment-free distances between sequences
- Alignment-based distances between sequences
- Jukes-Cantor correction of observed distances between sequences
- Phylogenetic reconstruction with UPGMA
- Phylogenetic reconstruction with neighbor-joining
- Limitations of distance-based approaches

*distance*, and introducing the concept of a distance matrix.

*distance* between a pair of objects is a measure of their dissimilarity. There isn't one single definition of the distance between two objects. For example, if your two objects are the cities Flagstaff, Arizona and Boulder, Colorado you might measure the distance between them as length of shortest line that connects them. This would be the *Euclidean distance* between the two cities. If you're trying to travel from one city to another however, that might not be the most relevant distance measure. Instead, you might be interested in something like their *Manhatten distance*, which in this would be more similar to a measure of the shortest route that you could travel between the two cities using the Interstate Highway system. You can see how this is different than the shortest line connecting the two cities.

*distance* if it meets these four criteria for all $x$ and $y$:

- $d(x,y) \geq 0$ (non-negativity)
- $d(x,y) = 0\ if\ and\ only\ if\ x = y$ (identity of indiscernibles)
- $d(x,y) = d(y,x)$ (symmetry)
- $d(x,z) \leq d(x,y) + d(y,z)$ (triangle inequality)

*distance matrix* which contains all of those values. These distance matrices are so common in bioinformatics that scikit-bio defines a DistanceMatrix object that provides a convenient interface for working with these data. We can create one of these objects as follows:

In [13]:

```
from skbio import DistanceMatrix
dm = DistanceMatrix([[0.0, 1.0, 2.0],
[1.0, 0.0, 3.0],
[2.0, 3.0, 0.0]],
ids=['a', 'b', 'c'])
```

In [14]:

```
print(dm['a', 'b'])
print(dm['b', 'c'])
```

In [15]:

```
_ = dm.plot(cmap='Greens')
```

*symmetric* (if you flip the upper triangle over the diagonal, the values are the same as those in the lower triangle), *hollow* (the diagonal is all zeros), and all values are greater than or equal to zero. Which of the conditions listed above results in each of these features of distance matrices?

*alignment-free* distance between sequences is the k-mer distance that we worked with in *Sequence Homology Searching*.

`skbio.DistanceMatrix`

object.

In [16]:

```
from iab.algorithms import kmer_distance
kmer_dm = DistanceMatrix.from_iterable(sequences, metric=kmer_distance, key='id')
_ = kmer_dm.plot(cmap='Greens', title='3mer distances between sequences')
```

In [17]:

```
from skbio.sequence.distance import hamming
hamming_dm = DistanceMatrix.from_iterable(sequences_aligned, metric=hamming, key='id')
_ = hamming_dm.plot(cmap='Greens', title='Hamming distances between sequences')
```

`A`

to `C`

. Then, in the next generation $g + 1$, the same position $p$ of sequence $S1$ undergoes a substitution from `C`

to `T`

. Because we can only inspect the modern-day sequences, not their ancestors, it looks like position $p$ has had a single substitution event. Similarly, if in generation $g + 1$ position $p$ changed from `C`

back to `A`

(a *back substitution*), we would observe zero substitution mutations at that position even though two had occurred.

*Jukes-Cantor correction* is typically applied to the Hamming distances between the sequences. Where $p$ is the Hamming distance, the corrected genetic distance is computed as $d = -\frac{3}{4} \ln(1 - \frac{4}{3}p)$. The derivation of this formula is beyond the scope of this text (you can find it in Inferring Phylogeny by Felsenstein), but it is based on the Jukes-Cantor (JC69) nucleotide substitution model.

In [18]:

```
def jc_correction(p):
return (-3/4) * np.log(1 - (4*p/3))
```

In [19]:

```
distances = np.arange(0, 0.70, 0.05)
jc_corrected_distances = list(map(jc_correction, distances))
ax = sns.pointplot(distances, jc_corrected_distances)
ax.set_xlabel('Hamming distance')
ax.set_ylabel('JC-corrected distance')
ax.set_xlim(0)
ax.set_ylim(0)
ax
```

Out[19]:

In [20]:

```
def jc_correct_dm(dm):
result = np.zeros(dm.shape)
for i in range(dm.shape[0]):
for j in range(i):
result[i,j] = result[j,i] = jc_correction(dm[i,j])
return skbio.DistanceMatrix(result, ids=dm.ids)
jc_corrected_hamming_dm = jc_correct_dm(hamming_dm)
```

In [21]:

```
print(hamming_dm[0])
print(jc_corrected_hamming_dm[0])
```

In [22]:

```
_ = jc_corrected_hamming_dm.plot(cmap='Greens', title='JC-corrected Hamming distances between sequences')
```

**Table of Contents**

*Unweighted Pair-Group Method with Arithmetic mean*. While that name sounds complex, it's actually a straightforward algorithm, which is why we're starting with it. After we work through the algorithm, we'll come back to the name as it'll make more sense then.

**it's application in phylogenetics is usually restricted to building preliminary trees to "guide" the process of multiple sequence alignment. The reason for this is that it's fast, but it makes some assumptions that don't work well for inferring relationships between organisms, which we'll discuss after working through the algorithm.**

UPGMA starts with a distance matrix, and works through the following steps to create a tree.

**Step 1**: Find the smallest non-zero distance in the matrix and define a clade containing only those members. Draw that clade, and set the total length of the branch connecting the tips to the distance between the tips. The distance between each tip and the node connecting them should be half of the distance between the tips.

**Step 2**: Create a new distance matrix with an entry representing the new clade created in step 1.

**Step 3**: Calculate the distance matrix entries for the new clade as the mean distance from each of the tips of the new clade to all other tips in the *original* distance matrix.

**Step 4**: If there is only one distance (below or above the diagonal) in the distance matrix, use it to connect the remaining unconnected clades, and stop. Otherwise repeat step 1.

In [23]:

```
_data = np.array([[ 0., 4., 2., 5., 6.],
[ 4., 0., 3., 6., 5.],
[ 2., 3., 0., 3., 4.],
[ 5., 6., 3., 0., 1.],
[ 6., 5., 4., 1., 0.]])
_ids = ['s1', 's2', 's3', 's4', 's5']
master_upgma_dm = skbio.DistanceMatrix(_data, _ids)
print(master_upgma_dm)
```

`s4`

and `s5`

. So, we'll draw that clade and set each branch length to half of the distance between them.

`s4`

and `s5`

are now represented by a single clade which we'll call `(s4, s5)`

. This notation indicates that the corresponding distances are to both `s4`

and `s5`

.

In [24]:

```
iter1_ids = ['s1', 's2', 's3', '(s4, s5)']
iter1_dm = [[0.0, 4.0, 2.0, None],
[4.0, 0.0, 3.0, None],
[2.0, 3.0, 0.0, None],
[None, None, None, None]]
```

`s1`

and `(s4, s5)`

is the mean of the distance between `s1`

and `s4`

and `s1`

and `s5`

:

In [25]:

```
import numpy as np
s1_s4s5 = np.mean([master_upgma_dm['s1', 's4'], master_upgma_dm['s1', 's5']])
print(s1_s4s5)
```

`s2`

and `(s4, s5)`

is the mean of the distance between `s2`

and `s4`

and `s2`

and `s5`

:

In [26]:

```
s2_s4s5 = np.mean([master_upgma_dm['s2', 's4'], master_upgma_dm['s2', 's5']])
print(s2_s4s5)
```

`s3`

and `(s4, s5)`

is the mean of the distance between `s3`

and `s4`

and the distance between `s3`

and `s5`

:

In [27]:

```
s3_s4s5 = np.mean([master_upgma_dm['s3', 's4'], master_upgma_dm['s3', 's5']])
print(s3_s4s5)
```

In [28]:

```
iter1_dm = [[0.0, 4.0, 2.0, s1_s4s5],
[4.0, 0.0, 3.0, s2_s4s5],
[2.0, 3.0, 0.0, s3_s4s5],
[s1_s4s5, s2_s4s5, s3_s4s5, 0.0]]
iter1_dm = DistanceMatrix(iter1_dm, iter1_ids)
print(iter1_dm)
```

`s1`

and `s3`

. So, we'll draw that clade and set each branch length to half of that distance.

`s1`

and `s3`

are now represented by a single clade, `(s1, s3)`

.

In [29]:

```
iter2_ids = ['(s1, s3)', 's2', '(s4, s5)']
iter2_dm = [[None, None, None],
[None, 0.0, 5.5],
[None, 5.5, 0.0]]
```

`s2`

is the mean of two values, but the distance between our new clade and the clade defined in iteration 1 is the mean of four values. Why is that?

In [30]:

```
s2_s1s3 = np.mean([master_upgma_dm[1][0], master_upgma_dm[1][2]])
s3s4_s1s3 = np.mean([master_upgma_dm[0][3], master_upgma_dm[0][4], master_upgma_dm[2][3], master_upgma_dm[2][4]])
```

We can now fill in all of the distances in our iteration 2 distance matrix.

In [31]:

```
iter2_dm = [[0.0, s2_s1s3, s3s4_s1s3],
[s2_s1s3, 0.0, 5.5],
[s3s4_s1s3, 5.5, 0.0]]
iter2_dm = DistanceMatrix(iter2_dm, iter2_ids)
print(iter2_dm)
```

`(s1, s3)`

and `s2`

. So, we'll draw that clade and set each branch length to half of the distance.

`(s1, s3)`

and the sequence `s2`

are now represented by a single clade, `((s1, s3), s2)`

.

In [32]:

```
iter3_ids = ['((s1, s3), s2)', '(s4, s5)']
iter3_dm = [[None, None],
[None, 0.0]]
```

In [33]:

```
s1s2s3_s4s5 = np.mean([master_upgma_dm[0][3], master_upgma_dm[0][4],
master_upgma_dm[2][3], master_upgma_dm[2][4],
master_upgma_dm[1][3], master_upgma_dm[1][4]])
```

We fill this value into our iteration 3 distance matrix.

In [34]:

```
iter3_dm = [[0.0, s1s2s3_s4s5],
[s1s2s3_s4s5, 0.0]]
iter3_dm = DistanceMatrix(iter3_dm, iter3_ids)
print(iter3_dm)
```

`((s1, s3), s2)`

and `(s4, s5)`

, which will give us our final UPGMA tree.

*wrapper function* that will give this an interface that is convenient to work with. If you'd like to see what the wrapper function is doing, using the `psource`

IPython magic function as we have in other places in the text.

`TreeStyle`

being used here) doesn't have biological meaning, it's purely a visualization component. You can rotate the branches descending from any node in the tree freely.

In [35]:

```
from iab.algorithms import tree_from_distance_matrix
kmer_tree = tree_from_distance_matrix(kmer_dm, metric='upgma')
ete3.Tree(str(kmer_tree), format=1).render("%%inline", tree_style=ts)
```

Out[35]:

In [36]:

```
jc_corrected_hamming_tree = tree_from_distance_matrix(jc_correct_dm(hamming_dm), metric='upgma')
ete3.Tree(str(jc_corrected_hamming_tree), format=1).render("%%inline", tree_style=ts)
```

Out[36]:

*Unweighted Pair Group metric with Arithmetic mean*. The *Unweighted* term indicates that all tip-to-tip distances contribute equally to each average that is computed (no weighted averages are being computed). The *Pair Group* term implies that all internal nodes, including the root node, will be strictly bifurcating, or descent to exactly two other nodes (either internal or terminal). The *Arithmetic mean* term implies that distances to each clade are the mean of distances to all members of that clade.

*The Phylogenetic Handbook*, by Lemey, Salemi, and Vandamme for discussion of this topic. You can also refer to the scikit-bio implementation of Neighbor Joining, which will be used here (the source code is linked from that page).

*ultrametric*. This is not likely to be the case in the real world, as different lineages in the tree might be undergoing different selective pressures, leading to different rates of evolution. Neighboring joining is a distance-based phylogenetic reconstruction approach that does not assume ultrametricity.

In [37]:

```
nj_tree = tree_from_distance_matrix(jc_correct_dm(hamming_dm), metric='nj')
ete3.Tree(str(nj_tree), format=1).render("%%inline", tree_style=ts)
```

Out[37]:

This section is currently a placeholder. You can track progress on this section through issue #119.

*The Phylogenetic Handbook*, by Lemey, Salemi, and Vandamme for discussion of this topic.

*The Phylogenetic Handbook*, by Lemey, Salemi, and Vandamme for discussion of this topic.

**Table of Contents**

*The Phylogenetic Handbook*, by Lemey, Salemi, and Vandamme for discussion of this topic.

*The Phylogenetic Handbook*, by Lemey, Salemi, and Vandamme for discussion of this topic.

*rooted tree*, which means that it includes an assumption about the last common ancestor of all sequences represented in the tree.

**unrooted tree**, like the following, doesn't include an assumption about the last common ancestor of all sequences:

The material in this section was compiled while consulting the following sources:

- The Phylogenetic Handbook (Lemey, Salemi, Vandamme)
- Inferring Phylogeny (Felsenstein)
- Richard Edwards's teaching website