Hypothesis-Bio
Hypothesis-Bio is a Hypothesis extension for property-based testing of bioinformatic software.
Automates the testing process to validate the correctness of bioinformatics tools by generating a wide range of test cases beyond human testers. Finds and returns the minimal error test case that causes an exception.
Features
This module provides a Hypothesis strategy for generating biological data formats. This can be used to efficiently and thoroughly test your code.
Currently supports DNA, RNA, protein, CDS, k-mers, FASTA, & FASTQ formats.
Quick Start
Basic Example
So what exactly does Hypothesis-Bio do? Let's look at some example code that calculates GC-content:
def gc_content(seq):
return (seq.count("G") + seq.count("C")) / len(seq)
(Can you spot the bug in the code?)
Now let's use Hypothesis-Bio to find the bug. To do so, we specify a property about our code that we expect to hold true over all examples. In this case, GC-content is a percentage, so we know it should always be between 0 and 1. We can encode that requirement into a test:
from hypothesis import given
from hypothesis_bio import dna
@given(dna())
def test_gc_content(seq):
assert 0 <= gc_content(seq) <= 1
When we run the test (by calling test_gc_content
), we get the following output:
Falsifying example: test_gc_content(seq='')
ZeroDivisionError: division by zero
Aha!
When given an empty sequence, our simple gc_content
calculator raises an error.
This simple example shows the power of property-based testing.
Instead of hard coding inputs and output examples, we can let Hypothesis-Bio do the hard work for us.
Another Example
We saw that Hypothesis-Bio can catch simple bugs like a division by zero error, but it can do so much more than that. Let's consider another function that translates from DNA to protein:
genetic_code = {
"ATA": "I", "ATC": "I", "ATT": "I", "ATG": "M", "ACA": "T", "ACC": "T", "ACG": "T", "ACT": "T",
"AAC": "N", "AAT": "N", "AAA": "K", "AAG": "K", "AGC": "S", "AGT": "S", "AGA": "R", "AGG": "R",
"CTA": "L", "CTC": "L", "CTG": "L", "CTT": "L", "CCA": "P", "CCC": "P", "CCG": "P", "CCT": "P",
"CAC": "H", "CAT": "H", "CAA": "Q", "CAG": "Q", "CGA": "R", "CGC": "R", "CGG": "R", "CGT": "R",
"GTA": "V", "GTC": "V", "GTG": "V", "GTT": "V", "GCA": "A", "GCC": "A", "GCG": "A", "GCT": "A",
"GAC": "D", "GAT": "D", "GAA": "E", "GAG": "E", "GGA": "G", "GGC": "G", "GGG": "G", "GGT": "G",
"TCA": "S", "TCC": "S", "TCG": "S", "TCT": "S", "TTC": "F", "TTT": "F", "TTA": "L", "TTG": "L",
"TAC": "Y", "TAT": "Y", "TAA": "*", "TAG": "*", "TGC": "C", "TGT": "C", "TGA": "*", "TGG": "W",
}
def translate(dna):
protein = ""
for codon_start_index in range(0, len(dna), 3):
codon = dna[codon_start_index : codon_start_index + 3]
protein += genetic_code[codon]
return protein
This looks pretty good, right? (Hint: nope! Can you find all the bugs?) For our testing code, we can rely on the property that a DNA sequence's protein is always a third the length of DNA sequence (since three DNA bases are used to code for each amino acid in the protein):
from hypothesis import given
from hypothesis_bio import dna
@given(dna())
def test_translate(seq):
assert len(translate(seq)) == len(seq) / 3
When we run it, we get the following error:
Falsifying example: test_translate(seq='A')
KeyError: 'A'
It turns out that our translation function never actually checked to ensure that the DNA sequence was a coding sequence.
If the sequence isn't at least three letters long, there's no way to convert it into a protein.
We should fix our function, but to see just what Hypothesis-Bio can do, we'll tell it the minimum length DNA sequence we want via the min_size
argument:
@given(dna(min_size=3))
def test_translate(seq):
assert len(translate(seq)) == len(seq) / 3
Now we get this error:
Falsifying example: test_translate(seq='AA-')
KeyError: 'AA-'
Whoops, we forgot to take gap characters into account!
Note that Hypothesis didn't just find any example that raised a bug, it found the smallest falsifying example.
Again, while we should fix the translate
function, let's just ignore the issue to see what else Hypothesis will find:
@given(dna(min_size=3, allow_gaps=False))
def test_translate(seq):
assert len(translate(seq)) == len(seq) / 3
Now we get:
Falsifying example: test_translate(seq='AAB')
KeyError: 'AAB'
It turns out we also forgot the ambiguous nucleotides as well. What else can we find if we ignore ambiguous nucleotides?
@given(dna(min_size=3, allow_gaps=False, allow_ambiguous=False))
def test_translate(seq):
assert len(translate(seq)) == len(seq) / 3
Now we get:
Falsifying example: test_translate(seq='AAa')
KeyError: 'AAa'
We also forgot to handle lowercase characters!
By passing the argument uppercase_only=True
to dna
, we can tell Hypothesis-Bio to only generate uppercase DNA sequences:
@given(dna(min_size=3, allow_gaps=False, allow_ambiguous=False, uppercase_only=True))
def test_translate(seq):
assert len(translate(seq)) == len(seq) / 3
And now we get:
Falsifying example: test_translate(seq='AAAA')
KeyError: 'A'
We now see another bug, in which a sequence whose length isn't divisible by 3 will result in a KeyError since there'll be a partial codon. Gaps and ambiguous bases and lowercase letters, oh my! Thankfully, Hypothesis-Bio will generate all of these weird edge cases so you don't manually have to.
Installation
Hypothesis-Bio will be available from PyPI via:
pip install hypothesis-bio
And Conda using:
conda install -c [CHANNEL GOES HERE] hypothesis-bio
Documentation
The documentation for Hypothesis-Bio is available here.
Citation
If you use Hypothesis-Bio, please cite it as:
Hypothesis-Bio. https://github.com/Lab41/hypothesis-bio
or, for BibTeX:
@misc{hypothesis_bio,
author = {Benjamin Lee and Reva Shenwai and Zongyi Ha and Michael B. Hall and Vaastav Anand},
title = {{Hypothesis-Bio}},
publisher = {GitHub},
url = {https://github.com/Lab41/hypothesis-bio}
}