Home Fellowship Teaching Raspberry Pi projects Little Minion Computer Durham CS

EPSRC Postdoctoral Fellowship

Randomised algorithms and approximation in phylogenetics.

A poster summarising some of my recent work during the fellowship:

Algorithms and computational complexity lie at the heart of the fundamental question of determining which problems in computer science we can or cannot hope to solve in a reasonable amount of time. In recent years this has been nowhere highlighted more clearly than in the rapidly developing areas of bio-informatics and computational biology. The advent of techniques for gathering huge volumes of genetic data at a reasonable cost has lead to hopes of understanding many deep problems in biology. However, it lies in the sphere of theoretical computer science to answer the question of which of these problems will be solvable, and which will turn out to be NP-hard, which would suggest it is impossible to compute an exact solution efficiently.

Phylogenetics is the reconstruction and analysis of phylogenetic (evolutionary) trees and networks based on inherited characteristics. In evolutionary biology, phylogenetic trees are used to represent the ancestral history of a collection of present-day species. Creating a such a "tree of life" has been a primary goal of systematic biology since Charles Darwin's first sketch of an evolutionary tree in 1837, and is now the focus of a global academic effort. Phylogenetics has also proved to be important in the study of mutating diseases; recent work reconstructing the phylogeny of HIV has helped trace the origins of the disease.

The broad aim of this project was to develop algorithms and randomised approximation schemes which will be beneficial to biologists working in the field of phylogenetics, as well as to devise new techniques for analysing such algorithms, which will be of independent interest in theoretical computer science.

Some places I visited during the fellowship:
Novemeber 2006: School of Computing Sciences, University of East Anglia, UK.
2 January - 27 March 2007: Department of Mathematics and Statistics, University of Canterbury, NZ.
September - December 2007: At the Phylogenetics programme at the Newton Institute, Cambridge, U.K.
March 2008: At the Combinatorics and Statistical Mechanics programme at the Newton Institute.
11 May - 16 May 2008: Dagstuhl seminar: Design and Analysis of Randomized and Approximation Algorithms.
Spring 2009: Department of Mathematics and Statistics, University of Canterbury, NZ.

Research Topics
The three specific topics which were the immediate focus of the project are below. Papers generated while working on the project can be found in my publications list.
A) Approximation schemes for hybridisation number and SPR distance.
B) Random generation of consistent phylogenetic trees.
C) Metastable states of Markov chains and convergence diagnostics.

A) One of the main tools used to understand reticulation events in evolutionary biology is an operation called (rooted) subtree prune and regraft (rSPR). Hybridisation events are relatively rare, so a fundamental problem is: given a collection of phylogenetic trees that correctly represent the tree-like evolution of different parts of various species genomes, what is the smallest number of reticulation events required so that all of the trees are simultaneously `displayed' by a single hybrid phylogeny. This `hybrid number' is closely related to the rSPR distance. For two trees, the hybrid number is the minimum number of rSPR operations required to transform one tree into the other with no temporal inconsistency, i.e. the implied digraph representing the hybrid phylogeny is acyclic. Computing either the rSPR distance or hybrid number exactly is NP-hard; approximating either problem is APX-hard. The specific goals of this topic are to:
i) find an approximation scheme for the hybrid number of a pair of phylogenetic trees;
ii) close the gap between inapproximability results and approximation schemes for the rSPR distance.

B) A phylogenetic tree that displays a given set of rooted subtrees is said to be consistent. Computing the number of consistent trees is #P-hard. However a strong approximation (FPRAS) could be possible using a MCMC algorithm. A natural Markov chain (branch rotation, or SPR based chains) converges to the uniform distribution on the set of consistent trees. The primary goals of this topic are to:
i) establish upper or lower bounds on the mixing time of this chain;
ii) study alternative Markov chains with statespace the set of consistent trees.

C) Due to the difficulty of proving rapid mixing, and stringent conditions required for other non-quantitative convergence tests (for example geometric ergodicity), many convergence diagnostics have become widely used to test whether an observed simulation has reached stationarity. Such diagnostics can falsely indicate convergence if the Markov chain enters a metastable state. This topic will focus on what conditions are required for metastable states to exist, and how well popular convergence diagnostics can detect such metastable states. In particular,
i) to identify metastable states in Markov chains;
ii) determine whether the chains in topic B i), ii) give rise to metastable states;
iii) and whether typical metastable states be easily indentified by targeted diagnostics.