Based on materials by Mike Jackson, Katy Huff, Rachel Slaybaugh, Patrick Fuller and Anthony Scopatz. With special thanks to Gordon Webster, the Digital Biologist, for kindly allowing use of his Python DNA function.
Software testing is exercising and confirming expected behaviours and results from code. It allows us to check that,
It also gives us the confidence to:
…all without introducing bugs. Nothing is worse than fixing a bug only to introduce a new one.
Tests also help us remember what all the parts of our code does. If we are working on a large project over three years and end up with 100s of functions, it may be hard to remember what each function does in detail. If we have a test that checks all of the function's functionality, we can look at the test to remember what it's supposed to do.
Testing allows us, and others, to trust our code and trust it enough to answer in the affirmative to at least a few of the following questions:
As a cautionary tale, consider Ariane 5 which used Ariane 4 software. Ariane 5 had new and improved engines which caused the code to produce a buffer overflow…and Ariane 5 blew up! So, some forgotten tests led to millions of pounds down the drain and some very red faces.
Or, consider Geoffrey Chang who had to retract 3 papers from Science, due to a flipped sign! Or, McKitrick and Michaels' Climate Research 26(2) 2004 paper, which drew the attention of a blogger Tim Lambert who noted a problem which led to their subsequent erratum.
Do this too regularly and people may not trust our research, which could affect our chances for collaborations, publications or funding.
But if this is not compelling, then, if nothing else, writing tests is an investment in time that saves us time in future,
Before we test our code, it can be very productive to get a colleague to look at it for us…why?
What we know about software development - code reviews work
Fagan (1976) discovered that a rigorous inspection can remove 60-90% of errors before the first test is run. M.E., Fagan (1976). Design and Code inspections to reduce errors in program development. IBM Systems Journal 15 (3): pp. 182-211.
What we know about software development - code reviews should be about 60 minutes long
Cohen (2006) discovered that all the value of a code review comes within the first hour, after which reviewers can become exhausted and the issues they find become ever more trivial. J. Cohen (2006). Best Kept Secrets of Peer Code Review. SmartBear, 2006. ISBN-10: 1599160676. ISBN-13: 978-1599160672.
The first step toward getting the right answers from our programs is to assume that mistakes will happen and to guard against them. This is called defensive programming, and the most common way to do it is to add assertions to our code so that it checks itself as it runs. An assertion is simply a statement that something must be true at a certain point in a program. When Python sees one, it checks that the assertion's condition. If it's true, Python does nothing, but if it's false, Python halts the program immediately and prints the error message provided. For example, this piece of code halts as soon as the loop encounters a value that isn't positive:
numbers = [1.5, 2.3, 0.7, -0.001, 4.4]
total = 0.0
for n in numbers:
assert n >= 0.0, 'Data should only contain positive values'
total += n
print 'total is:', total
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-19-33d87ea29ae4> in <module>()
2 total = 0.0
3 for n in numbers:
----> 4 assert n >= 0.0, 'Data should only contain positive values'
5 total += n
6 print 'total is:', total
AssertionError: Data should only contain positive values
Programs like the Firefox browser are full of assertions: 10-20% of the code they contain are there to check that the other 80-90% are working correctly. Broadly speaking, assertions fall into three categories:
For example,
suppose we are representing rectangles using a tuple of four coordinates (x0, y0, x1, y1)
.
In order to do some calculations,
we need to normalize the rectangle so that it is at the origin
and 1.0 units long on its longest axis.
This function does that,
but checks that its input is correctly formatted and that its result makes sense:
def normalize_rectangle(rect):
'''Normalizes a rectangle so that it is at the origin and 1.0 units long on its longest axis.'''
assert len(rect) == 4, 'Rectangles must contain 4 coordinates'
x0, y0, x1, y1 = rect
assert x0 < x1, 'Invalid X coordinates'
assert y0 < y1, 'Invalid Y coordinates'
dx = x1 - x0
dy = y1 - y0
if dx > dy:
scaled = float(dx) / dy
upper_x, upper_y = 1.0, scaled
else:
scaled = float(dx) / dy
upper_x, upper_y = scaled, 1.0
assert 0 < upper_x <= 1.0, 'Calculated upper X coordinate invalid'
assert 0 < upper_y <= 1.0, 'Calculated upper Y coordinate invalid'
return (0, 0, upper_x, upper_y)
The preconditions on lines 2, 4, and 5 catch invalid inputs:
print normalize_rectangle( (0.0, 1.0, 2.0) ) # missing the fourth coordinate
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-21-3a97b1dcab70> in <module>()
----> 1 print normalize_rectangle( (0.0, 1.0, 2.0) ) # missing the fourth coordinate
<ipython-input-20-408dc39f3915> in normalize_rectangle(rect)
1 def normalize_rectangle(rect):
2 '''Normalizes a rectangle so that it is at the origin and 1.0 units long on its longest axis.'''
----> 3 assert len(rect) == 4, 'Rectangles must contain 4 coordinates'
4 x0, y0, x1, y1 = rect
5 assert x0 < x1, 'Invalid X coordinates'
AssertionError: Rectangles must contain 4 coordinates
print normalize_rectangle( (4.0, 2.0, 1.0, 5.0) ) # X axis inverted
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-22-f05ae7878a45> in <module>()
----> 1 print normalize_rectangle( (4.0, 2.0, 1.0, 5.0) ) # X axis inverted
<ipython-input-20-408dc39f3915> in normalize_rectangle(rect)
3 assert len(rect) == 4, 'Rectangles must contain 4 coordinates'
4 x0, y0, x1, y1 = rect
----> 5 assert x0 < x1, 'Invalid X coordinates'
6 assert y0 < y1, 'Invalid Y coordinates'
7
AssertionError: Invalid X coordinates
The post-conditions help us catch bugs by telling us when our calculations cannot have been correct. For example, if we normalize a rectangle that is taller than it is wide everything seems OK:
print normalize_rectangle( (0.0, 0.0, 1.0, 5.0) )
(0, 0, 0.2, 1.0)
but if we normalize one that's wider than it is tall, the assertion is triggered:
print normalize_rectangle( (0.0, 0.0, 5.0, 1.0) )
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-24-5f0ef7954aeb> in <module>()
----> 1 print normalize_rectangle( (0.0, 0.0, 5.0, 1.0) )
<ipython-input-20-408dc39f3915> in normalize_rectangle(rect)
16
17 assert 0 < upper_x <= 1.0, 'Calculated upper X coordinate invalid'
---> 18 assert 0 < upper_y <= 1.0, 'Calculated upper Y coordinate invalid'
19
20 return (0, 0, upper_x, upper_y)
AssertionError: Calculated upper Y coordinate invalid
Re-reading our function,
we realize that line 10 should divide dy
by dx
rather than dx
by dy
.
(You can display line numbers by typing Ctrl-M, then L.)
If we had left out the assertion at the end of the function,
we would have created and returned something that had the right shape as a valid answer,
but wasn't.
Detecting and debugging that would almost certainly have taken more time in the long run
than writing the assertion.
But assertions aren't just about catching errors: they also help people understand programs. Each assertion gives the person reading the program a chance to check (consciously or otherwise) that their understanding matches what the code is doing.
Most good programmers follow two rules when adding assertions to their code. The first is, "fail early, fail often". The greater the distance between when and where an error occurs and when it's noticed, the harder the error will be to debug, so good code catches mistakes as early as possible.
The second rule is, "turn bugs into assertions or tests". If you made a mistake in a piece of code, the odds are good that you have made other mistakes nearby, or will make the same mistake (or a related one) the next time you change it. Writing assertions to check that you haven't regressed (i.e., haven't re-introduced an old problem) can save a lot of time in the long run, and helps to warn people who are reading the code (including your future self) that this bit is tricky.
average
that calculates the average of the numbers in a list.
What pre-conditions and post-conditions would you write for it?
Compare your answer to your neighbor's:
can you think of a function that will past your tests but not hers or vice versa?Assertions help us catch errors in our code, but things can go wrong for other reasons, like missing or badly-formatted files. Most modern programming languages allow programmers to use exceptions to separate what the program should do if everything goes right from what it should do if something goes wrong. Doing this makes both cases easier to read and understand.
For example, here's a small piece of code that tries to read parameters and a grid from two separate files, and reports an error if either goes wrong:
try:
params = read_params(param_file)
grid = read_grid(grid_file)
except:
log.error('Failed to read input file(s)')
sys.exit(ERROR)
We join the normal case and the error-handling code using the keywords try
and except
.
These work together like if
and else
:
the statements under the try
are what should happen if everything works,
while the statements under except
are what the program should do if something goes wrong.
We have actually seen exceptions before without knowing it,
since by default,
when an exception occurs,
Python prints it out and halts our program.
For example,
trying to open a nonexistent file triggers a type of exception called an IOError
,
while an out-of-bounds index to a list triggers an IndexError
:
open('nonexistent-file.txt', 'r')
---------------------------------------------------------------------------
IOError Traceback (most recent call last)
<ipython-input-13-58cbde3dd63c> in <module>()
----> 1 open('nonexistent-file.txt', 'r')
IOError: [Errno 2] No such file or directory: 'nonexistent-file.txt'
values = [0, 1, 2]
print values[999]
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-14-7fed13afc650> in <module>()
1 values = [0, 1, 2]
----> 2 print values[999]
IndexError: list index out of range
We can use try
and except
to deal with these errors ourselves
if we don't want the program simply to fall over:
try:
reader = open('nonexistent-file.txt', 'r')
except IOError:
print 'Whoops!'
Whoops!
When Python executes this code,
it runs the statement inside the try
.
If that works, it skips over the except
block without running it.
If an exception occurs inside the try
block,
though,
Python compares the type of the exception to the type specified by the except
.
If they match, it executes the code in the except
block.
IOError
is the particular kind of exception Python raises
when there is a problem related to input and output,
such as files not existing
or the program not having the permissions it needs to read them.
We can put as many lines of code in a try
block as we want,
just as we can put many statements under an if
.
We can also handle several different kinds of errors afterward.
For example,
here's some code to calculate the entropy at each point in a grid:
try:
params = read_params(param_file)
grid = read_grid(grid_file)
entropy = lee_entropy(params, grid)
write_entropy(entropy_file, entropy)
except IOError:
report_error_and_exit('IO error')
except ArithmeticError:
report_error_and_exit('Arithmetic error')
Python tries to run the four functions inside the try
as normal.
If an error occurs in any of them,
Python immediately jumps down
and tries to find an except
of the corresponding type:
if the exception is an IOError
,
Python jumps into the first error handler,
while if it's an ArithmeticError
,
Python jumps into the second handler instead.
It will only execute one of these,
just as it will only execute one branch
of a series of if
/elif
/else
statements.
This layout has made the code easier to read,
but we've lost something important:
the message printed out by the IOError
branch doesn't tell us
which file caused the problem.
We can do better if we capture and hang on to the object that Python creates
to record information about the error:
try:
params = read_params(param_file)
grid = read_grid(grid_file)
entropy = lee_entropy(params, grid)
write_entropy(entropy_file, entropy)
except IOError as err:
report_error_and_exit('Cannot read/write' + err.filename)
except ArithmeticError as err:
report_error_and_exit(err.message)
If something goes wrong in the try
,
Python creates an exception object,
fills it with information,
and assigns it to the variable err
.
(There's nothing special about this variable name—we can use anything we want.)
Exactly what information is recorded depends on what kind of error occurred;
Python's documentation describes the properties of each type of error in detail,
but we can always just print the exception object.
In the case of an I/O error,
we print out the name of the file that caused the problem.
And in the case of an arithmetic error,
printing out the message embedded in the exception object is what Python would have done anyway.
So much for how exceptions work:
how should they be used?
Some programmers use try
and except
to give their programs default behaviors.
For example,
if this code can't read the grid file that the user has asked for,
it creates a default grid instead:
try:
grid = read_grid(grid_file)
except IOError:
grid = default_grid()
Other programmers would explicitly test for the grid file,
and use if
and else
for control flow:
if file_exists(grid_file):
grid = read_grid(grid_file)
else:
grid = default_grid()
It's mostly a matter of taste,
but we prefer the second style.
As a rule,
exceptions should only be used to handle exceptional cases.
If the program knows how to fall back to a default grid,
that's not an unexpected event.
Using if
and else
instead of try
and except
sends different signals to anyone reading our code,
even if they do the same thing.
Novices often ask another question about exception handling style as well, but before we address it, there's something in our example that you might not have noticed. Exceptions can actually be thrown a long way: they don't have to be handled immediately. Take another look at this code:
try:
params = read_params(param_file)
grid = read_grid(grid_file)
entropy = lee_entropy(params, grid)
write_entropy(entropy_file, entropy)
except IOError as err:
report_error_and_exit('Cannot read/write' + err.filename)
except ArithmeticError as err:
report_error_and_exit(err.message)
The four lines in the try
block are all function calls.
They might catch and handle exceptions themselves,
but if an exception occurs in one of them that isn't handled internally,
Python looks in the calling code for a matching except
.
If it doesn't find one there,
it looks in that function's caller,
and so on.
If we get all the way back to the main program without finding an exception handler,
Python's default behavior is to print an error message like the ones we've been seeing all along.
This rule is the origin of the rule "Throw Low, Catch High." There are many places in our program where an error might occur. There are only a few, though, where errors can sensibly be handled. For example, a linear algebra library doesn't know whether it's being called directly from the Python interpreter, or whether it's being used as a component in a larger program. In the latter case, the library doesn't know if the program that's calling it is being run from the command line or from a GUI. The library therefore shouldn't try to handle or report errors itself, because it has no way of knowing what the right way to do this is. It should instead just raise an exception, and let its caller figure out how best to handle it.
Finally, we can raise exceptions ourselves if we want to. In fact, we should do this, since it's the standard way in Python to signal that something has gone wrong. Here, for example, is a function that reads a grid and checks its consistency:
def read_grid(grid_file):
'''Read grid, checking consistency.'''
data = read_raw_data(grid_file)
if not grid_consistent(data):
raise Exception('Inconsistent grid: ' + grid_file)
result = normalize_grid(data)
return result
The raise
statement creates a new exception with a meaningful error message.
Since read_grid
itself doesn't contain a try
/except
block,
this exception will always be thrown up and out of the function,
to be caught and handled by whoever is calling read_grid
.
We can define new types of exceptions if we want to.
And we should,
so that errors in our code can be distinguished from errors in other people's code.
However,
this involves classes and objects,
which is outside the scope of these lessons.
try:
for number in [-1, 0, 1]:
print 1.0/number
except ZeroDivisionError:
print 'whoops'
In the file dna.py we have a Python dictionary that stores the molecular weights of the 4 standard DNA nucleotides, A, T, C and G,
NUCLEOTIDES = {'A':131.2, 'T':304.2, 'C':289.2, 'G':329.2}
and a Python function that takes a DNA sequence as input and returns its molecular weight, which is the sum of the weights for each nucelotide in the sequence,
def calculate_weight(sequence):
"""
Calculate the molecular weight of a DNA sequence.
@param sequence: DNA sequence expressed as an upper-case string.
@return molecular weight.
"""
weight = 0.0
for ch in sequence:
weight += NUCLEOTIDES[ch]
return weight
We can calculate the molecular weight of a sequence by,
weight = calculate_weight('GATGCTGTGGATAA')
print weight
We can add a test to our code as follows,
def calculate_weight(sequence):
"""
Calculate the molecular weight of a DNA sequence.
@param sequence: DNA sequence expressed as an upper-case string.
@return molecular weight.
"""
weight = 0.0
try:
for ch in sequence:
weight += NUCLEOTIDES[ch]
return weight
except TypeError:
print 'The input is not a sequence e.g. a string or list'
If the input is not a string, or a list of characters then the for...in
statement will raise an exception which is caught by the except
block. For example,
print calculate_weight(123)
This is a runtime test. It alerts the user to exceptional behavior in the code. Often, exceptions are related to functions that depend on input that is unknown at compile time. Such tests make our code robust and allows our code to behave gracefully - they anticipate problematic values and handle them.
Often, we want to pass such errors to other points in our program rather than just print a message and continue. So, for example we could do,
except TypeError:
raise ValueError('The input is not a sequence e.g. a string or list')
which raises a new exception, with a more meaningful message. If writing a complex application, our user interface could then present this to the user e.g. as a dialog box.
Runtime tests don't test our functions behaviour or whether it's implemented correctly. So, we can add some tests,
print "A is ", calculate_weight('A')
print "G is ", calculate_weight('G')
print "GA is ", calculate_weight('GA')
But we'd have to visually inspect the results to see they are as expected. So, let's have the computer do that for us and make our lives easier, and save us time in checking,
assert calculate_weight('A') == 131.2
assert calculate_weight('G') == 329.2
assert calculate_weight('GA') == 460.4
assert
checks whether a condition is true and, if not, raises an exception.
We explicitly list the expected weights in each statement. But, by doing this there is a risk that we mistype one. A good design principle is to define constant values in one place only. As we already have defined them in nucleotides
we can just refer to that,
assert calculate_weight('A') == NUCLEOTIDES['A']
assert calculate_weight('G') == NUCLEOTIDES['G']
assert calculate_weight('GA') == NUCLEOTIDES['G'] + NUCLEOTIDES['A']
But this isn't very modular, and modularity is a good design principle, so let's define some test functions,
def test_a():
assert calculate_weight('A') == NUCLEOTIDES['A']
def test_g():
assert calculate_weight('G') == NUCLEOTIDES['G']
def test_ga():
assert calculate_weight('GA') == NUCLEOTIDES['G'] + NUCLEOTIDES['A']
test_a()
test_g()
test_ga()
And, rather than have our tests and code in the same file, let's separate them out. So, let's create
$ nano test_dna.py
Now, our function and nucleotides data are in dna.py
and we want to refer to them in test_dna.py
file, we need to import them. We can do this as,
from dna import calculate_weight
from dna import NUCLEOTIDES
Then we can add all our test functions and function calls to this file. And run the tests,
$ python test_dna.py
nose
- a Python test frameworknose
is a test framework for Python that will automatically find, run and report on tests written in Python. It is an example of what has been termed an xUnit test framework, perhaps the most famous being JUnit for Java.
To use nose
, we write test functions, as we've been doing, with the prefix test_
and put these in files, likewise prefixed by test_
. The prefixes Test-
, Test_
and test-
can also be used.
Typically, a test function,
So, we could rewrite test_a
, as the more, verbose, but equivalent,
def test_a():
expected = NUCLEOTIDES['A']
actual = calculate_weight('A')
assert expected == actual
Python assert
allows us to check,
assert should_be_true()
assert not should_not_be_true()
nose
defines additional functions which can be used to check for a rich range of conditions e.g..
from nose.tools import *
assert_equal(a, b)
assert_almost_equal(a, b, 3)
assert_true(a)
assert_false(a)
assert_raises(exception, func, *args, **kwargs)
...
assert_raises
is used for where we want to test that an exception is raised if, for example, we give a function a bad input.
To run nose
for our tests, we can do,
$ nosetests test_dna.py
Each .
corresponds to a successful test. And to prove nose
is finding our tests, let's remove the function calls from test_dna.py
and try again,
$ nosetests test_dna.py
nosetests can output an "xUnit" test report,
$ nosetests --with-xunit test_dna.py
$ cat nosetests.xml
This is a standard format that that is supported by a number of xUnit frameworks which can then be converted to HTML and presented online.
###Write some more tests
Let's spend a few minutes coming up with some more tests for calculate_weight
. Consider,
calculate_weight
?Examples of tests we could add include,
calculate_weight('T')
calculate_weight('C')
calculate_weight('TC')
calculate_weight(123)
The latter requires us to check whether an exception was raised which we can do as follows:
try:
calculate_weight(123)
assert False
except ValueError:
assert True
This is like catching a runtime error. If an exception is raised then our test passes (assert True
), else if no exception is raised, it fails. Alternatively, we can use assert_raises
from nose
,
from nose.tools import assert_raises
def test_123():
assert_raises(ValueError, calculate_weight, 123)
The assert fails if the named exception is not raised.
One other test we could do is calculate_weight('GATCX')
for which we can add another runtime test,
...
except KeyError:
raise ValueError('The input is not a sequence of G,T,C,A')
The example we've looked at is based on one function. Suppose we have a complex legacy code of 10000s of lines and which takes many input files and produces many output files. Exactly the same approach can be used as above - we run our code on a set of input files and check whether the output files match what you'd expect. For example, we could,
This was the approach taken by EPCC and the Colon Cancer Genetics Group (CCGG) of the MRC Human Genetics Unit at the Western General as part of an Oncology project to optimise and parallelise a FORTRAN genetics code.
The Muon Ion Cooling Experiment (MICE) have a large number of tests written in Python. They use Jenkins, a continuous integration server to build their code and trigger the running of the tests which are then published online.
###When 1 + 1 = 2.0000001
Computers don't do floating point arithmetic too well. This can make simple tests for the equality of two floating point values problematic due to imprecision in the values being compared.
$ python
>>> expected = 1 + 1
>>> actual = 2.0000001
>>> assert expected == actual
We can get round this by comparing to within a given threshold, or delta, for example we may consider expected and actual to be equal if expected - actual < 0.000000000001.
Test frameworks such as nose
, often provide functions to handle this for us. For example, to test that 2 numbers are equal when rounded to a given number of decimal places,
$ python
>>> from nose.tools import assert_almost_equal
>>> assert_almost_equal(expected, actual, 0)
>>> assert_almost_equal(expected, actual, 1)
>>> assert_almost_equal(expected, actual, 3)
>>> assert_almost_equal(expected, actual, 6)
>>> assert_almost_equal(expected, actual, 7)
...
AssertionError: 2 != 2.0000000999999998 within 7 places
What do we consider to be a suitable threshold for equality? That is application-specific - for some domains we might be happy to round to the nearest whole number, for others we may want to be far, far more accurate.
###When should we test?
We should test,
But, when should we finish writing tests? How much is enough?
What we know about software development - we can't test everything
"It is nearly impossible to test software at the level of 100 percent of its logic paths", fact 32 in R. L. Glass (2002) Facts and Fallacies of Software Engineering (PDF).
We can't test everything but that's no excuse for testing nothing! How much to test is something to be learned by experience, so think of it as analogous to when you finish proof reading a paper, over and over, before sending it to a conference. If you find bugs when you use your code, you did too little, so consider what you might have done and how to address this next time.
Tests, like code, should ideally be reviewed by a colleague which helps avoid tests that,
For example,
def test_critical_correctness():
# TODO - will complete this tomorrow!
pass
Yes, tests like this do occur on projects!
Traditionally, we'd write our code, then write the tests. Test driven development (TDD), proposed by Kent Beck, is a philosophy that turns this on its head - we write code by writing the tests first, then write the code to make the tests pass. If a new feature is needed, another test is written and the code is expanded to meet this new use case. This continues until the code does what is needed. This can be summarised as red-green-refactor:
By writing tests first, we're forced to think about what our code should do. In contrast, in writing our code then tests, we risk testing what the code actually does, rather than what it should do.
TDD operates on the YAGNI principle (You Ain't Gonna Need It) to avoid developing code for which there is no need.
###TDD of a DNA complement function
Given a DNA sequence consisting of A, C, T and G, we can create its complementary DNA, cDNA, by applying a mapping to each nucleotide in turn,
For example, given DNA strand GTCA, the cDNA is CAGT.
So, let's write a complement
function that creates the cDNA strand, given a DNA strand in a string. We'll use TDD, so to start, let's create a file test_cdna.py
and add a test,
from cdna import complement
def test_complement_a():
assert_equals complement('A') == 'T'
And let's run the test,
$ nosetests test_cdna.py
Which fails as we have no function! So, let's create a file cdna.py
. Our initial function to get the tests to pass could be,
def complement(sequence):
return 'T'
This is simplistic, but the test passes. Now let's add another test,
def test_complement_c():
assert complement('C') == 'G'
To get both our tests to pass, we can change our function to be,
def complement(sequence):
if (sequence == 'A'):
return 'T'
else:
return 'G'
Now, add some more tests. Don't worry about complement
just now.
Let's discuss the tests you've come up with.
Now update complement
to make your tests pass. You may want to reuse some of the logic of calculate_weight
!
When we're done, not only do we have a working function, we also have a set of tests. There's no risk of us leaving the tests "till later" and then never having time to write them.