**Table of Contents**

- What is a sequence alignment?
- A simple procedure for aligning a pair of sequences
- Step 1: Create a blank matrix where the rows and columns represent the positions in the sequences.
- Step 2: Add values to the cells in the matrix.
- Step 3: Identify the longest diagonals.
- Step 4: Transcribe some of the possible alignments that arise from this process.
- Why this simple procedure is too simplistic

- Differential scoring of matches and mismatches
- A better approach for global pairwise alignment using the Needleman-Wunsch algorithm
- Global versus local alignment
- Smith-Waterman local sequence alignment
- Differential scoring of gaps
- How long does pairwise sequence alignment take?

*pairwise sequence alignment*, the standard approach for determining sequence similarity.

`r1`

and `r2`

(*r* is for *reference*) and `q1`

(*q* is for *query*) - and you want to know whether `q1`

is more similar to `r1`

or `r2`

. On the surface, it seems like you could just count the number of positions where they differ (i.e., compute the Hamming distance between them) to figure this out. Here's what this would look like.

In [1]:

```
%pylab inline
from __future__ import division, print_function
import numpy as np
from IPython.core.display import HTML
from IPython.core import page
page.page = print
```

In [2]:

```
from scipy.spatial.distance import hamming
from skbio import DNA
r1 = DNA("ACCCAGGTTAACGGTGACCAGGTACCAGAAGGGTACCAGGTAGGACACACGGGGATTAA")
r2 = DNA("ACCGAGGTTAACGGTGACCAGGTACCAGAAGGGTACCAGGTAGGAGACACGGCGATTAA")
q1 = DNA("TTCCAGGTAAACGGTGACCAGGTACCAGTTGCGTTTGTTGTAGGAGACACGGGGACCCA")
%psource hamming
```

In [3]:

```
print(hamming(r1, q1))
print(hamming(r2, q1))
```

`q1`

has a smaller distance to `r1`

than it does to `r2`

, so `q1`

is more similar to `r1`

than `r2`

. But it's not always that simple.

*substitution events* have occurred, meaning one DNA base was substituted with another. Let's define `q2`

, which is the same as `q1`

except that a single base has been deleted at the beginning of the sequence, and a single base has been inserted at the end of the sequence.

In [4]:

```
q2 = DNA("TCCAGGTAAACGGTGACCAGGTACCAGTTGCGTTTGTTGTAGGAGACACGGGGACCCAT")
print(hamming(r1, q2))
```

`q2`

has shifted that sequence relative to `r1`

, which resulted in many of the bases "downstream" of the deleted base being different. However the sequences do still seem fairly similar, so perhaps this relatively large distance isn't biologically justified.

`q2`

. Let's define `q3`

, where we use a `-`

character to indicate a deletion with respect to `r1`

. This results in what seems like a more reasonable distance between the two sequences:

In [5]:

```
q3 = DNA("-TCCAGGTAAACGGTGACCAGGTACCAGTTGCGTTTGTTGTAGGAGACACGGGGACCCA")
print(hamming(r1, q3))
```

`r1`

and `q3`

. In other words, we've **aligned** positions to maximize the similarity of the two sequences, using the `-`

to fill in spaces where one character is missing with respect to the other sequence. We refer to `-`

characters in aligned sequences as **gap characters**, or gaps.

The *alignment* is of these two sequences is clear if we print them out, one on top of the other:

In [6]:

```
print(r1)
print(q3)
```

`-`

character, and about 25% *substitutions* of one base for another.

**mutations**. Some types of mutation events that can occur are:

**substitutions**, where one base (or amino acid, in protein sequences) is replaced with another;**insertions**, where one or more contiguous bases are inserted into a sequence;- and
**deletions**, where one or more contiguous bases are deleted from a sequence.

(Other types of mutation events can occur, but we're going to focus on these for now.)

**homologs** of one another, or homologous sequences. On a piece of paper, make a hypothesis about which of these types of mutation events occurred where over our hypothetical evolution of these sequences.

**The goal of pairwise sequence alignment is, given two sequences, to generate a hypothesis about which sequence positions derived from a common ancestral sequence position.** In practice, we develop this hypothesis by aligning the sequences to one another inserting gaps as necessary, in a way that maximizes their similarity. This is a **maximum parsimony** approach (an application of Occam's razor), where we assume that the simplest explanation (the one involving the fewest or least extreme mutation events) is the most likely.

*hypothesis* about the evolutionary events that took place, a sequence alignment that you might get from a computer program such as BLAST is also only a hypothesis. Which do you think is the most likely alignment of these sequences (note that there may not be a single best answer)?

**indel** used to refer to these either an insertion or deletion events.

**Table of Contents**

- Step 1: Create a blank matrix where the rows and columns represent the positions in the sequences.
- Step 2: Add values to the cells in the matrix.
- Step 3: Identify the longest diagonals.
- Step 4: Transcribe some of the possible alignments that arise from this process.
- Why this simple procedure is too simplistic

Lets define two sequences, `seq1`

and `seq2`

, and develop an approach for aligning them.

In [7]:

```
seq1 = DNA("ACCGGTGGAACCGGTAACACCCAC")
seq2 = DNA("ACCGGTAACCGGTTAACACCCAC")
```

`show_table`

to display a table that we're going to use to develop our alignment. Once a function has been imported, you can view the source code for that function. This will be useful as we begin to explore some of the algorithms that are in use throughout these notebooks. You should spend time reading the source code examples in this book until you're sure that you understand what's happening, especially if your goal is to develop bioinformatics software. Reading other people's code is a good way to improve your own.

Here's how you'd import a function and then view its source code:

In [8]:

```
from iab.algorithms import show_F
```

In [9]:

```
%psource show_F
```

Now let's look at how to align these sequences.

We'll create this matrix and initialize it with all zeros as follows:

In [10]:

```
num_rows = len(seq2)
num_cols = len(seq1)
data = np.zeros(shape=(num_rows, num_cols), dtype=np.int)
HTML(show_F(seq1, seq2, data))
```

Out[10]:

`show_table`

hide the zero values.

In [11]:

```
for row_number, row_character in enumerate(seq2):
for col_number, col_character in enumerate(seq1):
if row_character == col_character:
data[row_number, col_number] = 1
HTML(show_F(seq1, seq2, data, hide_zeros=True))
```

Out[11]:

*diagonals*. Diagonals indicate segments of the two sequences that are identical and uninterrupted by mismatched characters (substitution events) or indel events.

We can identify the longest diagonals as follows:

In [12]:

```
# create a copy of our data matrix to work with, so we
# leave the original untouched.
summed_data = data.copy()
# iterate over the cells in our data matrix, starting in
# the second row and second column
for i in range(1, summed_data.shape[0]):
for j in range(1, summed_data.shape[1]):
# if the value in the current cell is greater than zero
# (i.e., the characters at the corresponding pair of
# sequence positions are the same), add the value from the
# cell that is diagonally up and to the left.
if summed_data[i, j] > 0:
summed_data[i, j] += summed_data[i-1][j-1]
# Identify the longest diagonal
print("The longest diagonal is %d characters long." % summed_data.max())
HTML(show_F(seq1, seq2, summed_data, hide_zeros=True))
```

Out[12]:

Here are two possible alignments:

Alignment 1 (score: 19)

```
ACCGGTGGAACCGG-TAACACCCAC
ACCGGT--AACCGGTTAACACCCAC
```

Alignment 2 (score: 8)

```
ACCGGTGGAACCGGTAACACCCAC
ACCGGT--------TAACACCCAC
```

**As an exercise**, go back to where we defined `seq1`

and `seq2`

and re-define one or both of those as other sequences. Execute the code through here and see how the matrices change.

- We're scoring all matches as 1 and all mismatches as 0. This suggests that all matches are equally likely, and all mismatches are equally unlikely. What's a more biologically meaningful way to do this (think about protein sequences here)?
- Similarly, every gap that is introduced results in the same penalty being incurred. Based on what we know about how insertion/deletion events occur, what do you think is a more biologically meaningful way to do this?

*residues*, for short) to one another, instead of nucleotides.

*Molecule of the Month* series. Spend a couple of minutes reading about it here.

**substitution matrix**, which defines the score associated with substitution of one amino acid for another. A widely used substitution matrix is referred to as BLOSUM 50. Let's take a look at this matrix:

In [13]:

```
from iab.algorithms import blosum50, show_substitution_matrix
aas = list(blosum50.keys())
aas.sort()
data = []
for aa1 in aas:
row = []
for aa2 in aas:
row.append(blosum50[aa1][aa2])
data.append(row)
aa_labels = ''.join(aas)
HTML(show_substitution_matrix(aa_labels, data))
```

Out[13]:

You can look up individual substitution scores as follows:

In [14]:

```
print(blosum50['A']['G'])
print(blosum50['G']['A'])
print(blosum50['W']['K'])
print(blosum50['A']['A'])
print(blosum50['W']['W'])
```

*A Model of Evolutionary Change in Proteins.* Atlas of Protein Sequence and Structure) and by Henikoff and Henikoff in the early 1990s. Briefly, these matrices are often defined empirically, by aligning sequences manually or through automated systems, and counting how frequent certain substitutions are. This is a good article on the source of the widely used substitution matrices by Sean Eddy. We'll work with BLOSUM 50 here for the remainder of this chapter.

*Needleman-Wunsch alignment*. This performs what is known as *global alignment*, meaning that both sequences are aligned from their first residue (or base) through their last residue (or base). We'll contrast this later in this chapter with local alignment.

In [15]:

```
from skbio import Protein
seq1 = Protein("HEAGAWGHEE")
seq2 = Protein("PAWHEAE")
```

As we discussed earlier in this chapter, a pair of sequences can be aligned in different ways. Needleman-Wunsch provides the best alignment, as defined by its score. Here we'll compute two new matrices that together allow us to determine the highest alignment score given the sequences and the substitution matrix, and to transcribe the aligned sequences. These matrices are

- the
*dynamic programming matrix*, or $F$ - and the
*traceback matrix*, or $T$.

$F$ and $T$ are defined at the same time.

*state* that is independent of the first residue of the sequences that is important for our algorithm). $F$ keeps track of the best score of the alignment through the corresponding pair of positions, if the alignment were to terminate at that pair of positions.

Prior to initialization, $F$ and $T$ would look like the following.

In [16]:

```
num_rows = len(seq2) + 1
num_cols = len(seq1) + 1
F = np.zeros(shape=(num_rows, num_cols), dtype=np.int)
HTML(show_F(seq1, seq2, F))
```

Out[16]:

In [17]:

```
from iab.algorithms import show_T
T = np.full(shape=(num_rows, num_cols), fill_value=" ", dtype=np.str)
HTML(show_T(seq1, seq2, T))
```

Out[17]:

*gap penalty*. This is a constant value that is subtracted from the score of the alignment every time a gap character has to be introduced to align the sequences. We'll use a constant value of $d=8$ (it's positive because we subtract it) for now, and explore its use more shortly. $i$ is the row number in $F$, and $j$ is the column number in $F$.

$$
\begin{align}
& F(0, 0) = 0\\
& F(i, 0) = F(i-1, 0) - d\\
& F(0, j) = F(0, j-1) - d\\
\end{align}
$$

Initializing $F$ would result in the following.

In [18]:

```
d = 8
F[0][0] = 0
for i in range(1, num_rows):
F[i][0] = F[i-1][0] - d
for j in range(1, num_cols):
F[0][j] = F[0][j-1] - d
HTML(show_F(seq1, seq2, F))
```

Out[18]:

Initializing $T$ would result in the following.

In [19]:

```
T[0][0] = "•"
for i in range(1, num_rows):
T[i][0] = "↑"
for j in range(1, num_cols):
T[0][j] = "←"
HTML(show_T(seq1, seq2, T))
```

Out[19]:

`seq1`

and `seq2`

.

$$
F(i, j) = max \left(\begin{align}
& F(i-1, j-1) + s(c_i, c_j)\\
& F(i-1, j) - d\\
& F(i, j-1) - d
\end{align}\right)
$$

In [20]:

```
from iab.algorithms import format_dynamic_programming_matrix, format_traceback_matrix
from skbio.alignment._pairwise import _compute_score_and_traceback_matrices
%psource _compute_score_and_traceback_matrices
```

`seq1`

and `seq2`

to compute the dynamic programming and traceback matrices.

In [21]:

```
from skbio.sequence import Protein
from skbio.alignment import TabularMSA
seq1 = TabularMSA([seq1])
seq2 = TabularMSA([seq2])
nw_matrix, traceback_matrix = _compute_score_and_traceback_matrices(
seq1, seq2, 8, 8, blosum50)
HTML(show_F(seq1[0], seq2[0], nw_matrix))
```

Out[21]:

In [22]:

```
HTML(show_T(seq1[0], seq2[0], traceback_matrix))
```

Out[22]:

- Every time we encounter a vertical arrow, we consume a character from sequence 2 (the vertical sequence) and add a gap to sequence 1.
- Every time we encounter a horizontal arrow, we consume a character from sequence 1 (the horizontal sequence) and add a gap to sequence 2.
- Every time we encounter a diagonal arrow, we consume a character from sequence 1 and sequence 2.
- When we encounter a bullet, we've reached the end of the alignment so we're done.

In [23]:

```
from skbio.alignment._pairwise import _traceback
%psource _traceback
```

In [24]:

```
aln1, aln2, score, _, _ = _traceback(traceback_matrix,nw_matrix,seq1,seq2, nw_matrix.shape[0]-1, nw_matrix.shape[1]-1)
print(aln1[0])
print(aln2[0])
print(score)
```

*Application Programmer Interface*, or *API* for a function. Defining APIs is a bit of an art and a bit of a science, and there are great APIs and horrible APIs. API definition is hard, and it's something that you get better at with practice. Spending time thinking about APIs is important for developers, as it's how your users will interact with your code. There is a lot of good code out there that no one uses because it has a bad API.

In [25]:

```
from skbio.alignment import global_pairwise_align
%psource global_pairwise_align
```

In [26]:

```
aln, score, _ = global_pairwise_align(Protein("HEAGAWGHEE"), Protein("PAWHEAE"), 8, 8, blosum50, penalize_terminal_gaps=True)
print(aln)
print(score)
```

*global alignment*, meaning we align both sequences from their beginning through their end. This has some important specific applications: for example, if we have two full-length protein sequences, and we have a crystal structure for one of them, we can use global alignment to give us a direct mapping between all positions in both sequences.

In [27]:

```
from skbio import Protein
seq1 = Protein("HEAGAWGHEE")
seq2 = Protein("PAWHEAE")
```

In [28]:

```
num_rows = len(seq2) + 1
num_cols = len(seq1) + 1
F = np.zeros(shape=(num_rows, num_cols), dtype=np.int)
HTML(show_F(seq1, seq2, F))
```

Out[28]:

In [29]:

```
from iab.algorithms import show_T
T = np.full(shape=(num_rows, num_cols), fill_value=" ", dtype=np.str)
HTML(show_T(seq1, seq2, T))
```

Out[29]:

$$
\begin{align}
& F(0, 0) = 0\\
& F(i, 0) = 0\\
& F(0, j) = 0
\end{align}
$$

Initializing $F$ would therefore result in the following.

In [30]:

```
d = 8
F[0][0] = 0
for i in range(1, num_rows):
F[i][0] = 0
for j in range(1, num_cols):
F[0][j] = 0
HTML(show_F(seq1, seq2, F))
```

Out[30]:

In [31]:

```
T[0][0] = "•"
for i in range(1, num_rows):
T[i][0] = "•"
for j in range(1, num_cols):
T[0][j] = "•"
HTML(show_T(seq1, seq2, T))
```

Out[31]:

$$
F(i, j) = max \left(\begin{align}
& 0\\
& F(i-1, j-1) + s(c_i, c_j)\\
& F(i-1, j) - d\\
& F(i, j-1) - d)
\end{align}\right)
$$

In [32]:

```
from skbio.alignment._pairwise import _init_matrices_sw
seq1 = TabularMSA([seq1])
seq2 = TabularMSA([seq2])
sw_matrix, traceback_matrix = _compute_score_and_traceback_matrices(
seq1, seq2, 8, 8, blosum50, new_alignment_score=0.0,
init_matrices_f=_init_matrices_sw)
HTML(show_F(seq1[0], seq2[0], sw_matrix))
```

Out[32]:

In [33]:

```
HTML(show_T(seq1[0], seq2[0], traceback_matrix))
```

Out[33]:

In [34]:

```
max_value = 0.0
max_i = 0
max_j = 0
for i in range(sw_matrix.shape[0]):
for j in range(sw_matrix.shape[1]):
if sw_matrix[i, j] > max_value:
max_i, max_j = i, j
max_value = sw_matrix[i, j]
aln1, aln2, score, start_a1, start_a2 = _traceback(traceback_matrix, sw_matrix, seq1, seq2, max_i, max_j)
print(aln1[0])
print(aln2[0])
print(score)
```

*convenience function*, which will allow us to provide the required input and just get our aligned sequences back.

In [35]:

```
from skbio.alignment import local_pairwise_align
%psource local_pairwise_align
```

*convenience function* one step further, and wrap `local_pairwise_align`

and `global_pairwise_align`

up in a more general `align`

function, which takes a boolean parameter (i.e., `True`

or `False`

) indicating where we want a local or global alignment.

In [36]:

```
def align(sequence1, sequence2, gap_penalty, substitution_matrix, local):
if local:
return local_pairwise_align(sequence1, sequence2, gap_penalty, gap_penalty, substitution_matrix)
else:
return global_pairwise_align(sequence1, sequence2, gap_penalty, gap_penalty, substitution_matrix)
```

In [37]:

```
aln, score, _ = align(Protein('HEAGAWGHEE'), Protein('PAWHEAE'), 8, blosum50, True)
print(aln)
print(score)
```

In [38]:

```
aln, score, _ = align(Protein('HEAGAWGHEE'), Protein('PAWHEAE'), 8, blosum50, False)
print(aln)
print(score)
```

*will* be able to carry out those steps (a metric of your understanding of the algorithm) as long as you put a bit of effort into performing those steps.

*affine gap scoring*.

*gap extension penalty* if the value in $T$ that the new arrow will point to is the same type of arrow. Otherwise, we should incur the *gap open penalty*. If we represent our gap open penalty as $d^0$, and our gap extend penalty as $d^e$, our scoring scheme would look now like the following:

$$
F(i, j) = max \left(\begin{align}
& 0\\
& F(i-1, j-1) + s(c_i, c_j)\\
& \left\{\begin{array}{l l} F(i-1, j) - d^e \quad \text{if $T(i-1, j)$ is ↑}\\ F(i-1, j) - d^o \quad \text{if $T(i-1, j)$ is not ↑} \end{array} \right\} \\
& \left\{\begin{array}{l l} F(i, j-1) - d^e \quad \text{if $T(i, j-1)$ is ←}\\ F(i, j-1) - d^o \quad \text{if $T(i, j-1)$ is not ←} \end{array} \right\}
\end{align}\right)
$$

Take a look at how the scores differ with these additions.

In [39]:

```
seq1 = TabularMSA([Protein("HEAGAWGHEE")])
seq2 = TabularMSA([Protein("PAWHEAE")])
sw_matrix, traceback_matrix = _compute_score_and_traceback_matrices(seq1, seq2, 8, 1, blosum50)
HTML(show_F(seq1[0], seq2[0], sw_matrix))
```

Out[39]:

In [40]:

```
HTML(show_T(seq1[0], seq2[0], traceback_matrix))
```

Out[40]:

`gap_open_penalty`

and `gap_extend_penalty`

, which we can see by calling `help`

on the function. So, we can use those functions to explore sequence alignment with affine gap scoring.

In [41]:

```
help(global_pairwise_align)
```

`seq1`

to be slightly different than what I have above. Notice how we get different alignments when we use affine gap penalties (i.e., `gap_extend_penalty`

is not equal to `gap_open_penalty`

) versus equal gap open and gap extend penalties.

In [42]:

```
seq1 = TabularMSA([Protein("HEAGAWGFHEE")])
seq2 = TabularMSA([Protein("PAWHEAE")])
```

In [43]:

```
aln, score, _ = global_pairwise_align(seq1, seq2, 8, 8, blosum50)
print(aln)
print(score)
```

In [44]:

```
aln, score, _ = global_pairwise_align(seq1, seq2, 8, 1, blosum50)
print(aln)
print(score)
```

`help`

function - that's essential for learning how to use a function.

*applied* bioinformatics, and two of the practical considerations we need to think about when developing algorithms and applications is how long they'll take to run, and how much system memory (or RAM) they'll require. Both of these can be limiting factors for applications that require sequence alignments, so a lot of effort is spent understanding how to optimize sequence alignment.

`timeit`

. This allows us to conveniently run a given command many times and reports the average time it takes to run. We'll use this to see how long local alignment takes to run. Note that we don't care about getting the actual alignment back for the moment. We just want the runtime in seconds.

*benchmark* the runtime of the scikit-bio `local_pairwise_align_nucleotide`

function. This specifically performs nucleotide alignment, and is implemented in Python.

In [45]:

```
from skbio.alignment import local_pairwise_align_nucleotide
seq1 = DNA("GGTCTTCGCTAGGCTTTCATCGGGTTCGGCATCTACTCTGAGTTACTACG")
seq2 = DNA("GGTCTTCAGGCTTTCATCGGGAACGGCATCTCTGAGTTACTACC")
%timeit local_pairwise_align_nucleotide(seq1, seq2, gap_open_penalty=8, gap_extend_penalty=1)
```

In [46]:

```
from skbio.alignment import local_pairwise_align_ssw
%timeit local_pairwise_align_ssw(seq1, seq2)
```

`local_pairwise_align_ssw`

function is much faster for performing alignment than `local_pairwise_align_nucleotide`

(be sure to compare the units of each run time!). This is because `local_pairwise_align_ssw`

is a much more efficient implementation of Smith-Waterman alignment, and it additionally applies some cool tricks that allow it to not compute all of the values in $F$ and $T$, but still get the right answer most of the time. This is referred to as a *heuristic* approach to optimizing an algorithm. We'll spend some time defining and comparing heuristics in the Database Searching chapter, but to get you to start thinking about it, what you care about is how much a heuristic reduces the run time of an algorithm, and how often it gives you the same answer as the full algorithm. Take a minute to compare the results of the two functions we just ran:

In [47]:

```
msa, _, _ = local_pairwise_align_nucleotide(seq1, seq2, gap_open_penalty=8, gap_extend_penalty=1)
msa
```

Out[47]:

In [48]:

```
msa, _, _ = local_pairwise_align_ssw(seq1, seq2)
msa
```

Out[48]:

How do the results look?

`local_pairwise_align_ssw`

is generally producing comparable results to `local_pairwise_align_nucleotide`

, and you can clearly see that it's running much faster. So, we'll use that implementation in this text when we need to perform fast local alignments.

`random`

module to get random pairs of sequences.

In [49]:

```
import numpy as np
def random_sequence(moltype, length):
result = []
# Our "alphabet" here will consist of the standard characters in a
# molecules alphabet.
alphabet = list(moltype.nondegenerate_chars)
for e in range(length):
result.append(np.random.choice(alphabet))
return moltype(''.join(result))
```

In [50]:

```
print(random_sequence(DNA, 10))
print(random_sequence(DNA, 10))
print(random_sequence(DNA, 25))
print(random_sequence(DNA, 50))
```

`timeit`

module (which the `%timeit`

magic function is based on).

In [51]:

```
import timeit
times = []
seq_lengths = range(5000,110000,20000)
def get_time_function(seq_length):
def f():
seq1 = random_sequence(DNA, seq_length)
seq2 = random_sequence(DNA, seq_length)
local_pairwise_align_ssw(seq1, seq2)
return f
for seq_length in seq_lengths:
times.append(min(timeit.Timer(get_time_function(seq_length)).repeat(repeat=3, number=3)))
```

If we look at the run times, we can see that they are increasing with increasing sequence lengths:

In [52]:

```
import pandas as pd
runtimes = pd.DataFrame(data=np.asarray([seq_lengths, times]).T, columns=["Sequence length", "Runtime (s)"] )
runtimes
```

Out[52]:

That's probably to be expected, but what we care about now is *how* the runtimes are increasing as a function of sequence length. Is the relationship between runtime and sequence length:

- linear: $runtime \approx constant \times sequence\ length$
- quadratic: $runtime \approx constant \times {sequence\ length}^2$
- exponential: $runtime \approx {constant}^{sequence\ length}$
- or something else?

In [53]:

```
import seaborn as sns
ax = sns.regplot(x="Sequence length", y="Runtime (s)", data=runtimes, fit_reg=False)
ax.set_xlim(0)
ax.set_ylim(0)
ax
```

Out[53]:

`timeit`

), when we double our sequence length we have four times as many cells to compute.

In [54]:

```
# if we could split this process over more processors (four, for example)
# that would effectively reduce the runtime by 1/4
parallel_runtimes = pd.DataFrame(data=np.asarray([seq_lengths, [t/4 for t in times]]).T, columns=["Sequence length", "Runtime (s)"] )
parallel_runtimes
ax = sns.regplot(x="Sequence length", y="Runtime (s)", data=parallel_runtimes, fit_reg=False)
ax.set_xlim(0)
ax.set_ylim(0)
ax
```

Out[54]:

*computational complexity* (or how its runtime scales as a function of its input size). You can explore the computational complexity of different types of algorithms in the Big-O Cheat Sheet, though it's a fairly advanced introduction to the topic (and one that's usually covered in the second or third year for Computer Science majors).