Efficient Computation of Close Upper and Lower Bounds on
the Minimum Number of Recombinations in Biological Sequence Evolution
Yun S. Song, Yufeng Wu, Dan Gusfield
UC Davis
ISMB 2005
Meiotic Recombination (single-crossover)
Prefix Suffix
Recombination is one of the principal evolutionary forces responsible for shaping genetic variation within species.
Estimating the frequency and the location of recombination is central to modern-day genetics.
(e.g. disease association mapping)
1 1L L
1 L
b b
b
s1 =
s2 =
s3 =
s4 =
0
0
1
1
0
1
0
0
s1 =
s2 =
s3 =
s4 =
0
0
1
1
0
1
0
1
2
00
10 10 01 00
1
10 11 01 00
0 0
21
All four possible gametic types
Assumption: at most
one mutation per site
SNP Sequences
Possible gametic types:
{ 00, 01, 10, 11 }
1 0 0 1
1 1
Recombination
Mutation
1 =2 =3 =4 =5 =6 =7 =8 =9 =10 =11 =
Given a set M of sequences, what is the minimum number Rmin(M) of recombinations needed for constructing evolutionary histories that explain M?
Minimizing Recombinations
Kreitman’s data from the adh locus of D. Melanogaster (1983)
M =
Minimization is NP-hard. (Wang et al 2000, Semple 2004)
Bounds on the Minimum Number Rmin(M) of Recombinations
Rmin(M)L(M) <
M, a set of sequences
< U(M)MinimumNo efficientmethod.
Lower bound
There are many methods.
Upper Bound
Novel.
Our Contribution:
Efficient, practical algorithms for computing lower and upper bounds on Rmin(M).
Key idea: If L(M) = U(M), then we know Rmin(M).
Empirical observation: L(M) = U(M) frequently for a surprisingly large range of data.
The Composite Method (Myers & Griffiths 2003)
M
1. Given a set of intervals, and
Composite Problem: Find the minimum number of vertical lines so that every I intersects at least L(I) vertical lines.
2
1
2
2
2
31
2. for each interval I, a number L(I)
Let L(I) be a “local” recombination lower bound for I.
The composite recombination lower bound on Rmin(M) is
given by a solution to the composite problem.
8
Optimal Haplotype Bound as L(I) S = A subset of columns in I. Haplotype Bound h(S) = (Number of distinct rows
restricted to S) (Number of distinct columns in S) 1.
Optimal Haplotype Bound Opt(I) = Maximum value
of h(S) over all subsets S of columns in interval I.
1 0 0 1 0 11 1 1 0 1 00 0 1 0 0 01 0 1 0 1 10 1 0 1 1 1 1 1 1 0 1 1
I
1 1 00 1 11 1 01 0 00 0 1 0 0 1
0 1 00 1 11 0 00 1 0
1 0 1 0 0 1
1 01 10 11 10 0 1 1
0 0 11 1 00 1 00 1 11 0 1 1 1 1
h(S) = 421 = 1 h(S) = 631 = 2
Myers & Griffiths : For every interval I, restrict the maximum size (s) of S and the maximum distance (w)
between the leftmost and the rightmost columns in S.
I
|S| < s,d
d < w
Implemented in RecMin, along with the composite method.
Computing this optimal haplotype bound Opt(I) is NP-hard. (Bafna & Bansal 2005)
RecMin is a tremendous improvement over previous practical lower bound methods. But,
1. The user is instructed to experiment with parameters s and w until the bound does not change.
2. That does not guarantee that the bound could not be improved by further increasing the parameters.
(Implemented in HapBound)
1. No parameters.
2. Much faster than RecMin.
3. Implements additional ideas that produce lower bounds even better than the optimal haplotype bound.
How to derive sharper bounds? In the composite method, check if each local bound L(I) is in fact equal to Rmin(I),
and if not, increase L(I) by one. (S and M options)
We cast the problem as a classic set cover problem that can be formulated as an ILP problem, with 1 variable per column and 1 inequality per pair of rows.
We can compute the optimal haplotype bound exactly.
Can use either GNU ILP Solver or CPLEX.
RecMin vs. HapBound on the human LPL (Nickerson et al., 1998)
Program Lower Bound Time
RecMin –s 8 –w 12 (default) 59 3 sec
RecMin s 25 w 25 75 7944 sec
RecMin s 48 w 48 No result 5 days
HapBound 75 31 sec
HapBound S 78 1643 sec
88 Sequences, 48 sites
1 9 7 10 3 6 5 2 8 4
Mutation
Recombination
Upper Bound on Rmin(M)
Branch and Bound construction of genealogies backwards in time (using an alternating series of coalescent, mutation, and recombination events).
B&B uses recombination lower bounds and randomization.
Implemented in SHRUB (Simulated History Recombination Upper Bound)
SHRUB constructs genealogies that can be viewed using an open source program.
Contains U(M) recombination vertices.
Kreitman’s ADH data (1983)11 alleles of the alcohol dehydrogenase locus of Drosophila melanogaster. (43 Sites)
There is only one previous implemented method that computes Rmin(M) exactly. (Song and Hein, 2003)
That method took about 1.5 GB of memory and 30 minutes of CPU time to find Rmin(M) = 7.
We tried 9 different implemented lower bound methods, aside from HapBound. They all produced either 5 or 6.
Time
Both HapBound (with –M option) and SHRUB produced 7 and took only a fraction of a second to analyze this data set.
L(M) = U(M) = 7 Rmin(M) = 7.
An evolutionary history, found by SHRUB, with 7 recombination events.
It corresponds to the most parsimonious history.
The Human LPL Data (Nickerson et al., 1998)
(88 Sequences, 48 sites)
Our new lower bound
HapBound S M
Upper bound
SHRUB
(We ignored insertion/deletion, unphased sites, and sites with missing data.)
Composite optimal haplotype bounds
Match frequency for simulated data = scaled recombination rate.= scaled mutation rate.
Frequency of having L(M) = U(M)
Used Hudson’s MS to generate1000 simulated datasets for each pair of and
For < 5, our lower and upper bounds match over 90% of the time.
This is a significant progress, as there currently exists no other method that can find Rmin(M) for more than 9 sequences after some data reduction.
n = number of sequences
Softwares
HapBound and SHRUB can be found at
wwwcsif.cs.ucdavis.edu/~gusfield/