1.1 Reading An Introduction to Applied Bioinformatics [edit]

Bioinformatics, as I see it, is the application of the tools of computer science (things like programming languages, algorithms, and databases) to address biological problems (for example, inferring the evolutionary relationship between a group of organisms based on fragments of their genomes, or understanding if or how the community of microorganisms that live in my gut changes if I modify my diet). Bioinformatics is a rapidly growing field, largely in response to the vast increase in the quantity of data that biologists now grapple with. Students from varied disciplines (e.g., biology, computer science, statistics, and biochemistry) and stages of their educational careers (undergraduate, graduate, or postdoctoral) are becoming interested in bioinformatics.

An Introduction to Applied Bioinformatics, or IAB, is an open source, interactive bioinformatics text. It introduces readers to the core concepts of bioinformatics in the context of their implementation and application to real-world problems and data. IAB is closely tied to the scikit-bio python package, which provides production-ready implementations of core bioinformatics algorithms and data structures. Readers therefore learn the concepts in the context of tools they can use to develop their own bioinformatics software and pipelines, enabling them to rapidly get started on their own projects. While some theory is discussed, the focus of IAB is on what readers need to know to be effective, practicing bioinformaticians.

IAB is interactive, being based on IPython Notebooks which can be installed on a reader's computer or viewed statically online. As readers are learning a concept, for example, pairwise sequence alignment, they are presented with its scikit-bio implementation directly in the text. scikit-bio code is well annotated (adhering to the pep8 and numpydoc conventions), so readers can use it to assist with their understanding of the concept. And, because IAB is presented as an IPython Notebook, readers can execute the code directly in the text. For example, when learning pairwise alignment, users can align sequences provided in IAB (or their own sequences) and modify parameters (or even the algorithm itself) to see how changes affect the resulting alignments.

IAB is completely open access, with all software being BSD-licensed, and all text being licensed under Creative Commons Attribution Only (i.e., CC BY-NC-SA 4.0). All development and publication is coordinated under public revision control on GitHub.

IAB is also an electronic-only resource. There are currently no plans to commercialize it or to create a print version. This means that, unlike printed bioinformatics texts which are generally out of date before the ink dries, IAB can be updated as the field changes.

The life cycle of IAB is more like a software package than a book. There will be development and release versions of IAB, where the release versions are more polished but won't always contain the latest content, and the development versions will contain all of the latest materials, but won't necessarily be copy-edited and polished.

We are in the process of developing a project status page that will detail the plans for IAB. This will include the full table of contents, and what stage you can expect chapters to be at at different times. You can track progress of this on IAB #97.

My goal for IAB is for it to make bioinformatics as accessible as possible to students from varied backgrounds, and to get more people into this hugely exciting field. I'm very interested in hearing from readers and instructors who are using IAB, so get in touch if you have corrections, suggestions for how to improve the content, or any other thoughts or comments on the text. In the spirit of openness, I'd prefer to be contacted via the IAB issue tracker. I'll respond to direct e-mail as well, but I'm always backlogged (just ask my students), so e-mail responses are likely to be slower.

I hope you find IAB useful, and that you enjoy reading it!

1.1.1 Who should read IAB? [edit]

IAB is written for scientists, software developers, and students interested in understanding and applying bioinformatics methods, and ultimately in developing their own bioinformatics analysis pipelines or software.

IAB was initially developed for an undergraduate course cross-listed in computer science and biology with no pre-requisites. It therefore assumes little background in biology or computer science, however some basic background is very helpful. For example, an understanding of the roles of and relationship between DNA and protein in a cell, and the ability to read and follow well-annotated python code, are both helpful (but not necessary) to get started.

In the Getting started with Biology and Computer Science sections below I provide some suggestions for other texts that will help you to get started.

1.1.2 How to read IAB [edit]

There are two ways to read An Introduction To Applied Bioinformatics:

In both cases you have the option to read the latest development version (i.e., master of the GitHub repository) or the latest release version. The lastest development version will have all of the most recent content, but since it hasn't yet been officially released some aspects may be less polished or buggy. The latest release version won't necessarily have the latest content, but it should be more polished and less buggy. If you're reading IAB on your own, I recommend working with the latest development version. If you're teaching a class that uses IAB, you probably should use the latest release version.

IAB is split into four different sections: Getting started, Fundamentals, Applications, and Wrapping up. You should start reading IAB by working through the Getting started and Fundamentals chapters in order. You should then read the Applications chapters and Wrapping up in any order, based on your own interest.

1.1.3 Using the IPython Notebook [edit]

IAB is built using the IPython Notebook, an interactive HTML-based computing environment. The main source for information about the IPython Notebook is the IPython Notebook website. You can find information there on how to use the IPython Notebook, and also on how to set up and run and IPython Notebook server (for example, if you'd like to make one available to your students).

Most of the code that is used in IAB comes from scikit-bio package, or other python scientific computing tools. You can access these in the same way that you would in a python script. For example:

In [1]:
import skbio
from __future__ import print_function
from IPython.core import page
page.page = print

We can then access functions, variables, and classes from these modules.

In [2]:
*                                                    *
               _ _    _ _          _     _
              (_) |  (_) |        | |   (_)
      ___  ___ _| | ___| |_ ______| |__  _  ___
     / __|/ __| | |/ / | __|______| '_ \| |/ _ \
     \__ \ (__| |   <| | |_       | |_) | | (_) |
     |___/\___|_|_|\_\_|\__|      |_.__/|_|\___/

*                                                    *

                   \  Amoebozoa
                    \ /
                     *    Euryarchaeota
                      \     |_ Crenarchaeota
                       \   *
                        \ /
                    / \
                   /   \
        Proteobacteria  \

We'll inspect a lot of source code in IAB as we explore bioinformatics algorithms. If you're ever interested in seeing the source code for some functionality that we're using, you can do that using IPython's psource magic.

In [3]:
from skbio.alignment import TabularMSA
%psource TabularMSA.conservation
    def conservation(self, metric='inverse_shannon_uncertainty',
                     degenerate_mode='error', gap_mode='nan'):
        """Apply metric to compute conservation for all alignment positions

        metric : {'inverse_shannon_uncertainty'}, optional
            Metric that should be applied for computing conservation. Resulting
            values should be larger when a position is more conserved.
        degenerate_mode : {'nan', 'error'}, optional
            Mode for handling positions with degenerate characters. If
            ``"nan"``, positions with degenerate characters will be assigned a
            conservation score of ``np.nan``. If ``"error"``, an
            error will be raised if one or more degenerate characters are
        gap_mode : {'nan', 'ignore', 'error', 'include'}, optional
            Mode for handling positions with gap characters. If ``"nan"``,
            positions with gaps will be assigned a conservation score of
            ``np.nan``. If ``"ignore"``, positions with gaps will be filtered
            to remove gaps before ``metric`` is applied. If ``"error"``, an
            error will be raised if one or more gap characters are present. If
            ``"include"``, conservation will be computed on alignment positions
            with gaps included. In this case, it is up to the metric to ensure
            that gaps are handled as they should be or to raise an error if
            gaps are not supported by that metric.

        np.array of floats
            Values resulting from the application of ``metric`` to each
            position in the alignment.

            If an unknown ``metric``, ``degenerate_mode`` or ``gap_mode`` is
            If any degenerate characters are present in the alignment when
            ``degenerate_mode`` is ``"error"``.
            If any gaps are present in the alignment when ``gap_mode`` is

        Users should be careful interpreting results when
        ``gap_mode = "include"`` as the results may be misleading. For example,
        as pointed out in [1]_, a protein alignment position composed of 90%
        gaps and 10% tryptophans would score as more highly conserved than a
        position composed of alanine and glycine in equal frequencies with the
        ``"inverse_shannon_uncertainty"`` metric.

        ``gap_mode = "include"`` will result in all gap characters being
        recoded to ``TabularMSA.dtype.default_gap_char``. Because no
        conservation metrics that we are aware of consider different gap
        characters differently (e.g., none of the metrics described in [1]_),
        they are all treated the same within this method.

        The ``inverse_shannon_uncertainty`` metric is simply one minus
        Shannon's uncertainty metric. This method uses the inverse of Shannon's
        uncertainty so that larger values imply higher conservation. Shannon's
        uncertainty is also referred to as Shannon's entropy, but when making
        computations from symbols, as is done here, "uncertainty" is the
        preferred term ([2]_).

        .. [1] Valdar WS. Scoring residue conservation. Proteins. (2002)
        .. [2] Schneider T. Pitfalls in information theory (website, ca. 2015).


        if gap_mode not in {'nan', 'error', 'include', 'ignore'}:
            raise ValueError("Unknown gap_mode provided: %s" % gap_mode)

        if degenerate_mode not in {'nan', 'error'}:
            raise ValueError("Unknown degenerate_mode provided: %s" %

        if metric not in {'inverse_shannon_uncertainty'}:
            raise ValueError("Unknown metric provided: %s" %

        if self.shape[0] == 0:
            # handle empty alignment to avoid error on lookup of character sets
            return np.array([])

        # Since the only currently allowed metric is
        # inverse_shannon_uncertainty, and we already know that a valid metric
        # was provided, we just define metric_f here. When additional metrics
        # are supported, this will be handled differently (e.g., via a lookup
        # or if/elif/else).
        metric_f = self._build_inverse_shannon_uncertainty_f(
                        gap_mode == 'include')

        result = []
        for p in self.iter_positions(ignore_metadata=True):
            cons = None
            # cast p to self.dtype for access to gap/degenerate related
            # functionality
            pos_seq = self.dtype(p)

            # handle degenerate characters if present
            if pos_seq.has_degenerates():
                if degenerate_mode == 'nan':
                    cons = np.nan
                else:  # degenerate_mode == 'error' is the only choice left
                    degenerate_chars = pos_seq[pos_seq.degenerates()]
                    raise ValueError("Conservation is undefined for positions "
                                     "with degenerate characters. The "
                                     "following degenerate characters were "
                                     "observed: %s." % degenerate_chars)

            # handle gap characters if present
            if pos_seq.has_gaps():
                if gap_mode == 'nan':
                    cons = np.nan
                elif gap_mode == 'error':
                    raise ValueError("Gap characters present in alignment.")
                elif gap_mode == 'ignore':
                    pos_seq = pos_seq.degap()
                else:  # gap_mode == 'include' is the only choice left
                    # Recode all gap characters with pos_seq.default_gap_char.
                    pos_seq = pos_seq.replace(pos_seq.gaps(),

            if cons is None:
                cons = metric_f(pos_seq)


        return np.array(result)

The documentation for scikit-bio is also very extensive (though the package itself is still in early development). You can view the documentation for the TabularMSA object, for example, here. These documents will be invaluable for learning how to use the objects.

1.1.4 Reading list [edit] Getting started with Biology [edit]

If you're new to biology, these are some books and resources that will help you get started.

  • The NIH Bookshelf A lot of free biology texts, some obviously better than others.
  • Brock Biology of Microorganisms by Michael T. Madigan, John M. Martinko, David Stahl, David P. Clark. One of the best textbooks on microbiology. This is also fairly advanced, but if you're interested in microbial ecology or other aspects of microbiology it will likely be extremely useful. Getting started with Computer Science and programming [edit]

If you're new to Computer Science and programming, these are some books and resources that will help you get started.

  • Software Carpentry Online resources for learning scientific computing skills, and regular in-person workshops all over the world. Taking a Software Carpentry workshop will pay off for biology students interested in a career in research.
  • Practical Computing for Biologists by Steven Haddock and Casey Dunn. A great introduction to many computational skills that are required of modern biologists. I highly recommend this book to all Biology undergraduate and graduate students.
  • The Pragmatic Programmer by Andrew Hunt. A more advanced book on becoming a better programmer. This book is excellent, and I highly recommend it for anyone developing bioinformatics software. You should know how to program and have done some software development before jumping into this.

These are some books that I've enjoyed, that have also helped me think about biological systems. These are generally written for a more popular audience, so should be accessible to any readers of An Introduction to Applied Bioinformatics.

  • Ever Since Darwin by Stephen Jay Gould. This is the first book in a series of collections of short essays.

1.1.5 Need help? [edit]

If you're having issues getting An Introduction to Applied Bioinformatics running on your computer, or you have corrections or suggestions on the content, you should get in touch through the IAB issue tracker. This will generally be much faster than e-mailing the author directly, as there are multiple people who monitor the issue tracker. It also helps us manage our technical support load if we can consolidate all requests and responses in one place.

1.1.6 Contributing to IAB [edit]

If you're interested in contributing content or features to IAB, you should start by reviewing CONTRIBUTING.md which provides guidelines on how to get involved.

1.1.7 About the author [edit]

My name is Greg Caporaso. I'm the primary author of An Introduction to Applied Bioinformatics, but there are other contributors and I hope that list will grow.

I have degrees in Computer Science (B.S., University of Colorado, 2001) and Biochemistry (B.A., University of Colorado, 2004; Ph.D., University of Colorado 2009). Following my formal training, I joined the Rob Knight Laboratory, then at the University of Colorado, for approximately 2 years as a post-doctoral scholar. In 2011, I joined the faculty at Northern Arizona University (NAU) where I'm an Assistant Professor in the Biological Sciences department. I teach one course per year in bioinformatics for graduate and undergraduate students of Biology and Computer Science. I also run a research lab in the Center for Microbial Genetics and Genomics, which is focused on developing bioinformatics software and studying the human microbiome.

I'm not the world expert on the topics that I present in IAB, but I have a passion for bioinformatics, open source software, writing, and education. When I'm learning a new bioinformatics concept, for example an algorithm like pairwise alignment or a statistical technique like Monte Carlo simulation, implementing it is usually the best way for me to wrap my head around it. This led me to start developing IAB, as I found that my implementations helped my students learn the concepts too. I think that one of my strongest skills is the ability to break complex ideas into accessible components. I do this well for bioinformatics because I remember (and still regularly experience) the challenges of learning it, so can relate to newcomers in the field.

I'm very active in open source bioinformatics software development. I am most widely known for my involvement in the development of the QIIME software package, and more recently for leading the development of scikit-bio. I am also involved in many other bioinformatics software projects (see my GitHub page). IAB is one of the projects that I'm currently most excited about, so I truly hope that it's as useful for you as it is fun for me.

For updates on IAB and various other things, you should follow me on Twitter.

1.1.8 Acknowledgements [edit]

An Introduction to Applied Bioinformatics is funded in part by the Alfred P. Sloan Foundation. Initial prototyping was funded by Arizona's Technology and Research Initiative Fund. The style of the project was inspired by Bayesian Methods for Hackers.

See the repository's contributors page for information on who has contributed to the project.