making the most of dart data for phylogenetic inference barbara holland & michael woodhams...
TRANSCRIPT
![Page 1: Making the most of DArT data for phylogenetic inference Barbara Holland & Michael Woodhams (Maths & Physics) Dorothy Steane (Plant Science) Vincent Moulton](https://reader035.vdocuments.us/reader035/viewer/2022062518/56649eec5503460f94bfdc39/html5/thumbnails/1.jpg)
Making the most of DArT data for phylogenetic inference
Barbara Holland&
Michael Woodhams(Maths & Physics)
Dorothy Steane(Plant Science)
Vincent Moulton(Computational Biology)
![Page 2: Making the most of DArT data for phylogenetic inference Barbara Holland & Michael Woodhams (Maths & Physics) Dorothy Steane (Plant Science) Vincent Moulton](https://reader035.vdocuments.us/reader035/viewer/2022062518/56649eec5503460f94bfdc39/html5/thumbnails/2.jpg)
1: Collect DNA from reference individuals
2. Digest with one 6bp rare cutter (CTGCAG) and one 4bp frequent cutter (TCGA)
3. Only fragments with two rare ends are amplified and retained
4. Create a microarray with these fragments (~2-3% of the genome)
5: Analyse phylogenetic samples by digesting them with the same cutters and running them against the microarray (DNA-DNA hybridisation).
Each fragment is scored 1 (present) or 0 (absent) *
*This is in math fantasy land – in real life you also get ?s
Generating DArTsDDiversityArArrayTTechnologies
![Page 3: Making the most of DArT data for phylogenetic inference Barbara Holland & Michael Woodhams (Maths & Physics) Dorothy Steane (Plant Science) Vincent Moulton](https://reader035.vdocuments.us/reader035/viewer/2022062518/56649eec5503460f94bfdc39/html5/thumbnails/3.jpg)
Properties of DArT data
Data is binary (fragments are present or absent, 1/0)
A random set of fragments from across the genome.
Fragments are much more likely to be lost in parallel than gained in parallel
Data exhibit an ascertainment bias: We can observe only the fragments on the chip. These fragments were derived from a small set of reference taxa.
![Page 4: Making the most of DArT data for phylogenetic inference Barbara Holland & Michael Woodhams (Maths & Physics) Dorothy Steane (Plant Science) Vincent Moulton](https://reader035.vdocuments.us/reader035/viewer/2022062518/56649eec5503460f94bfdc39/html5/thumbnails/4.jpg)
The model
Fragment evolution can be modeled as a stochastic Dollo process, i.e. gained once but lost potentially many times
Parallel gains are forbidden
Fragments are lost at a constant rate r (memoryless)
Chance of loss over time t is 1-exp(-rt)
![Page 5: Making the most of DArT data for phylogenetic inference Barbara Holland & Michael Woodhams (Maths & Physics) Dorothy Steane (Plant Science) Vincent Moulton](https://reader035.vdocuments.us/reader035/viewer/2022062518/56649eec5503460f94bfdc39/html5/thumbnails/5.jpg)
![Page 6: Making the most of DArT data for phylogenetic inference Barbara Holland & Michael Woodhams (Maths & Physics) Dorothy Steane (Plant Science) Vincent Moulton](https://reader035.vdocuments.us/reader035/viewer/2022062518/56649eec5503460f94bfdc39/html5/thumbnails/6.jpg)
Hamming Horror
1111111111111111
1111111100000000
1100110000000000
1111000000000000
1000100000000000
1100000000000000
Ref
D
C
B
D(Ref,B)=12/16=(12+0)/(4+0+12+0) D(C,D)=2/16=(1+1)/(1+13+1+1)
Hamming distance D = (n10
+n01
)/(n11
+n00
+n10
+n01
)
![Page 7: Making the most of DArT data for phylogenetic inference Barbara Holland & Michael Woodhams (Maths & Physics) Dorothy Steane (Plant Science) Vincent Moulton](https://reader035.vdocuments.us/reader035/viewer/2022062518/56649eec5503460f94bfdc39/html5/thumbnails/7.jpg)
Hamming simulation
Tree based on Hamming distancesusing A as the reference taxon
Underlying tree used in simulation
![Page 8: Making the most of DArT data for phylogenetic inference Barbara Holland & Michael Woodhams (Maths & Physics) Dorothy Steane (Plant Science) Vincent Moulton](https://reader035.vdocuments.us/reader035/viewer/2022062518/56649eec5503460f94bfdc39/html5/thumbnails/8.jpg)
A distance correction is required
R
A B
•Let n00 be the number of fragments absent at both A and B
•Let n01 be the number of fragments absent at A and present at B
•Let n10 be the number of fragments present at A and absent at B
•Let n11 be the number of fragments present at both A and B
![Page 9: Making the most of DArT data for phylogenetic inference Barbara Holland & Michael Woodhams (Maths & Physics) Dorothy Steane (Plant Science) Vincent Moulton](https://reader035.vdocuments.us/reader035/viewer/2022062518/56649eec5503460f94bfdc39/html5/thumbnails/9.jpg)
A distance correction is required
Michael Woodham's key observation was that, due to the Dollo nature of the process, any fragment that is present at the reference taxon R and at taxon A, must also be present at the internal node X.
R
A B
X
![Page 10: Making the most of DArT data for phylogenetic inference Barbara Holland & Michael Woodhams (Maths & Physics) Dorothy Steane (Plant Science) Vincent Moulton](https://reader035.vdocuments.us/reader035/viewer/2022062518/56649eec5503460f94bfdc39/html5/thumbnails/10.jpg)
A distance correction is required
Recall, chance of survival over time t is exp(-rt)
d(X,B) = -log[probability fragment survives from X to B]
Anything present at A is known to be present at X
=> d(X,B) = -log[n11
/(n11
+n10
)]
d(A,B) = d(A,X) + d(X,B) = -log[n
11/(n
11+n
01)] - log[n
11/(n
11+n
10)]
= log[(1+n01
/n11
)(1+n10
/n11
)]
R
A B
X
![Page 11: Making the most of DArT data for phylogenetic inference Barbara Holland & Michael Woodhams (Maths & Physics) Dorothy Steane (Plant Science) Vincent Moulton](https://reader035.vdocuments.us/reader035/viewer/2022062518/56649eec5503460f94bfdc39/html5/thumbnails/11.jpg)
A zoo of distances
Hamming: dH=(n
01+n
10)/(n
11+n
10+n
01+n
00)
Log Det: dLD
=log[det[D]]-0.5Σk(log(C
k)+log(R
k))
Jaccard: dJ=(n
01+n
10)/(n
11+n
10+n
01)
Log Jaccard: dLJ
=-log(1-dJ)=-log[n
11/(n
11+n
10+n
01)]
HS: dHS
=-log[2n11
/(2n11
+n10
+n01
)]
Nei Li: F=2n11
/(2n11
+n10
+n01
);F=Q^2/(2-Q) d
NL=-log(Q)
![Page 12: Making the most of DArT data for phylogenetic inference Barbara Holland & Michael Woodhams (Maths & Physics) Dorothy Steane (Plant Science) Vincent Moulton](https://reader035.vdocuments.us/reader035/viewer/2022062518/56649eec5503460f94bfdc39/html5/thumbnails/12.jpg)
Simulations
Random (yule) topology,Edge lengths chosen from uniform distribution 0.05<l<0.40
Yule tree, subject to minimum edge length 0.01
![Page 13: Making the most of DArT data for phylogenetic inference Barbara Holland & Michael Woodhams (Maths & Physics) Dorothy Steane (Plant Science) Vincent Moulton](https://reader035.vdocuments.us/reader035/viewer/2022062518/56649eec5503460f94bfdc39/html5/thumbnails/13.jpg)
Simulation details
• Choose an arbitrary node to start the process at. At this node, the number of DArT fragments is taken from a Poisson distribution with mean M. (We use the result from HS 2004 that a stochastic Dollo process is independent of the root).
• Propagate outward from the start point along tree edges, so that each new node acquires some new DArT fragments and inherits some of those from its parent.
• If the edge length is l, then the probability of a given fragment present in the parent still being present at the end of the edge is exp(-l).
• The number of new fragments in the child but not the parent is Poisson distributed with mean (1-exp(-l))M.
![Page 14: Making the most of DArT data for phylogenetic inference Barbara Holland & Michael Woodhams (Maths & Physics) Dorothy Steane (Plant Science) Vincent Moulton](https://reader035.vdocuments.us/reader035/viewer/2022062518/56649eec5503460f94bfdc39/html5/thumbnails/14.jpg)
Simulations
R
R
S
R
RS
R
S
T
U
V
Selection of Reference Taxa
One ref, included
Two refs, included
One ref, excluded
Two refs, excluded
All taxa are refs
![Page 15: Making the most of DArT data for phylogenetic inference Barbara Holland & Michael Woodhams (Maths & Physics) Dorothy Steane (Plant Science) Vincent Moulton](https://reader035.vdocuments.us/reader035/viewer/2022062518/56649eec5503460f94bfdc39/html5/thumbnails/15.jpg)
Simulations
All taxa are references, 9 taxa.
![Page 16: Making the most of DArT data for phylogenetic inference Barbara Holland & Michael Woodhams (Maths & Physics) Dorothy Steane (Plant Science) Vincent Moulton](https://reader035.vdocuments.us/reader035/viewer/2022062518/56649eec5503460f94bfdc39/html5/thumbnails/16.jpg)
Simulations
Single reference, excluded, 9 taxa. Single reference, included, 9 taxa.
![Page 17: Making the most of DArT data for phylogenetic inference Barbara Holland & Michael Woodhams (Maths & Physics) Dorothy Steane (Plant Science) Vincent Moulton](https://reader035.vdocuments.us/reader035/viewer/2022062518/56649eec5503460f94bfdc39/html5/thumbnails/17.jpg)
Simulations(distance matrix -> tree by FastME)
![Page 18: Making the most of DArT data for phylogenetic inference Barbara Holland & Michael Woodhams (Maths & Physics) Dorothy Steane (Plant Science) Vincent Moulton](https://reader035.vdocuments.us/reader035/viewer/2022062518/56649eec5503460f94bfdc39/html5/thumbnails/18.jpg)
Simulations
![Page 19: Making the most of DArT data for phylogenetic inference Barbara Holland & Michael Woodhams (Maths & Physics) Dorothy Steane (Plant Science) Vincent Moulton](https://reader035.vdocuments.us/reader035/viewer/2022062518/56649eec5503460f94bfdc39/html5/thumbnails/19.jpg)
Simulations
![Page 20: Making the most of DArT data for phylogenetic inference Barbara Holland & Michael Woodhams (Maths & Physics) Dorothy Steane (Plant Science) Vincent Moulton](https://reader035.vdocuments.us/reader035/viewer/2022062518/56649eec5503460f94bfdc39/html5/thumbnails/20.jpg)
11111 11100 0000011111 11100 0000011111 11100 0000011111 11100 00000
sr
SR
BA00001 11110 0000000001 11110 0000000000 00000 0000000000 00000 00000
00000 01111 1000000000 01111 1000000000 01111 1000000000 01111 10000
00000 00111 1111100000 00111 1111100000 00111 1111100000 00111 11111
00000 00000 0000000000 01111 1000000000 01111 1000000000 00000 00000
00001 11110 0000000001 11110 0000000001 11110 0000000001 11110 00000
Multiple References
If R were the only reference, we'd only see the coloured sites.n
10=6, n
01=2, n
11=2, d(A,B)= -log(2/4) - log(2/8)=3
![Page 21: Making the most of DArT data for phylogenetic inference Barbara Holland & Michael Woodhams (Maths & Physics) Dorothy Steane (Plant Science) Vincent Moulton](https://reader035.vdocuments.us/reader035/viewer/2022062518/56649eec5503460f94bfdc39/html5/thumbnails/21.jpg)
11111 11100 0000011111 11100 0000011111 11100 0000011111 11100 00000
sr
SR
BA00001 11110 0000000001 11110 0000000000 00000 0000000000 00000 00000
00000 01111 1000000000 01111 1000000000 01111 1000000000 01111 10000
00000 00111 1111100000 00111 1111100000 00111 1111100000 00111 11111
00000 00000 0000000000 01111 1000000000 01111 1000000000 00000 00000
00001 11110 0000000001 11110 0000000001 11110 0000000001 11110 00000
Multiple References
With R and S as references n
10=7, n
01=7, n
11=3, d(A,B)= -2log(3/10)=3.474
![Page 22: Making the most of DArT data for phylogenetic inference Barbara Holland & Michael Woodhams (Maths & Physics) Dorothy Steane (Plant Science) Vincent Moulton](https://reader035.vdocuments.us/reader035/viewer/2022062518/56649eec5503460f94bfdc39/html5/thumbnails/22.jpg)
Generalizing the DArT Distance
• The DArT distance does less well when there is more than one reference taxon.
• Define
dRDa
(A,B;R)=DArT distance between A and B calculated only from sites that are 1 at R.
• Then dGD
(A,B) is a weighted average:
dGD
(A,B)=(ΣRd
RDa(A,B;R)√n
R)/(Σ
R√n
R)
![Page 23: Making the most of DArT data for phylogenetic inference Barbara Holland & Michael Woodhams (Maths & Physics) Dorothy Steane (Plant Science) Vincent Moulton](https://reader035.vdocuments.us/reader035/viewer/2022062518/56649eec5503460f94bfdc39/html5/thumbnails/23.jpg)
Partitioned DarT distances(under construction)
• When the reference taxa are known (typically the case)
• And it's also known which fragments come from which reference taxon (not always the case)
• You can define a partitioned DArT distance that takes a weighted average of the DArT distance for each partition.
![Page 24: Making the most of DArT data for phylogenetic inference Barbara Holland & Michael Woodhams (Maths & Physics) Dorothy Steane (Plant Science) Vincent Moulton](https://reader035.vdocuments.us/reader035/viewer/2022062518/56649eec5503460f94bfdc39/html5/thumbnails/24.jpg)
Simulations
All taxa are references, 9 taxa.
![Page 25: Making the most of DArT data for phylogenetic inference Barbara Holland & Michael Woodhams (Maths & Physics) Dorothy Steane (Plant Science) Vincent Moulton](https://reader035.vdocuments.us/reader035/viewer/2022062518/56649eec5503460f94bfdc39/html5/thumbnails/25.jpg)
Simulations
![Page 26: Making the most of DArT data for phylogenetic inference Barbara Holland & Michael Woodhams (Maths & Physics) Dorothy Steane (Plant Science) Vincent Moulton](https://reader035.vdocuments.us/reader035/viewer/2022062518/56649eec5503460f94bfdc39/html5/thumbnails/26.jpg)
DArT, Generalized DArTand HS tree (FastME)
94 Eucalcypt taxa8 reference taxa
![Page 27: Making the most of DArT data for phylogenetic inference Barbara Holland & Michael Woodhams (Maths & Physics) Dorothy Steane (Plant Science) Vincent Moulton](https://reader035.vdocuments.us/reader035/viewer/2022062518/56649eec5503460f94bfdc39/html5/thumbnails/27.jpg)
Norwich
• Why does the Generalised DarT distance perform so well when the reference taxa are included and so poorly when they are not?
![Page 28: Making the most of DArT data for phylogenetic inference Barbara Holland & Michael Woodhams (Maths & Physics) Dorothy Steane (Plant Science) Vincent Moulton](https://reader035.vdocuments.us/reader035/viewer/2022062518/56649eec5503460f94bfdc39/html5/thumbnails/28.jpg)
R
A
B
C D
Single reference
pr
pa
pbcd
pb
pcd
pc
pd
Pattern proabilities can be computed by rooting the tree at the reference taxon and then only considering loss of fragments.
E.g. the probability of seeing R 1A 0B 0C 1D 1
is(1-p
r)p
a(1-p
bcd)p
b(1-p
cd)(1-p
c)(1-p
d)
![Page 29: Making the most of DArT data for phylogenetic inference Barbara Holland & Michael Woodhams (Maths & Physics) Dorothy Steane (Plant Science) Vincent Moulton](https://reader035.vdocuments.us/reader035/viewer/2022062518/56649eec5503460f94bfdc39/html5/thumbnails/29.jpg)
Reference unknown
R B D
n01/n11 0.010 0.010 0.010
n10/n11 0.031 0.020 0.010
D(A,C) 0.040 0.030 0.020D(A,C) = log[(1+n01
/n11
)(1+n10
/n11
)]
R
A
B
C D
pr
pa
pbcd
pb
pcd
pc
pd
Set all edge probabilities to 0.01
![Page 30: Making the most of DArT data for phylogenetic inference Barbara Holland & Michael Woodhams (Maths & Physics) Dorothy Steane (Plant Science) Vincent Moulton](https://reader035.vdocuments.us/reader035/viewer/2022062518/56649eec5503460f94bfdc39/html5/thumbnails/30.jpg)
R
A
B
C S
Multiple reference taxa
pr
pa
pbcs
pb
pcs
pc
ps
In the multiple reference setting you also have to consider gain of fragments down any edge that is above a reference taxon.
E.g. the probability of seeing R 0A 0B 0C 1S 1
Has 4 termsp
rp
a(1-p
bcs)p
b(1-p
cs)(1-p
c)(1-p
s) +
pbcs
pb(1-p
cs)(1-p
c)(1-p
s) +
pcs
(1-pc)(1-p
s) +
ps
* need to renormalise probabilities
![Page 31: Making the most of DArT data for phylogenetic inference Barbara Holland & Michael Woodhams (Maths & Physics) Dorothy Steane (Plant Science) Vincent Moulton](https://reader035.vdocuments.us/reader035/viewer/2022062518/56649eec5503460f94bfdc39/html5/thumbnails/31.jpg)
R
A
B
pr
pa
pbs
pb
ps
* need to renormalise probabilities
S
Set all edge probabilities to 0.01
D(A,B) = log[(1+n01
/n11
)(1+n10
/n11
)]
R S R or S
n01/n11 0.010 0.020 0.020
n10/n11 0.020 0.010 0.020
D(A,B) 0.030 0.030 0.040
R S R or S
n00 0.0102 0.0102 0.0203
n01 0.0097 0.0195 0.0196
n10 0.0195 0.0097 0.0196
n11 0.9606 0.9606 0.9702
1.0000 1.0000 1.0297
![Page 32: Making the most of DArT data for phylogenetic inference Barbara Holland & Michael Woodhams (Maths & Physics) Dorothy Steane (Plant Science) Vincent Moulton](https://reader035.vdocuments.us/reader035/viewer/2022062518/56649eec5503460f94bfdc39/html5/thumbnails/32.jpg)
Future ideas
• The small examples we worked through in Norwich suggest two new ideas to be tested by simulation
• In the case of unknown references, compute D(X,Y|R) for each R and take the max.
• In the case of known references, a modification to the Generalised DArT that only averages over the references
![Page 33: Making the most of DArT data for phylogenetic inference Barbara Holland & Michael Woodhams (Maths & Physics) Dorothy Steane (Plant Science) Vincent Moulton](https://reader035.vdocuments.us/reader035/viewer/2022062518/56649eec5503460f94bfdc39/html5/thumbnails/33.jpg)
Future work - hybridisation
![Page 34: Making the most of DArT data for phylogenetic inference Barbara Holland & Michael Woodhams (Maths & Physics) Dorothy Steane (Plant Science) Vincent Moulton](https://reader035.vdocuments.us/reader035/viewer/2022062518/56649eec5503460f94bfdc39/html5/thumbnails/34.jpg)
Links to other peoples work
• Gene content evolution with HGT, aka controlling ancestral genome obesiety (Tal Dagan, Bill Martin)
• Language evolution with borrowing (Geoff Nicholls, Russell Gray)
![Page 35: Making the most of DArT data for phylogenetic inference Barbara Holland & Michael Woodhams (Maths & Physics) Dorothy Steane (Plant Science) Vincent Moulton](https://reader035.vdocuments.us/reader035/viewer/2022062518/56649eec5503460f94bfdc39/html5/thumbnails/35.jpg)
BIG Thanks to Torsten and Shiju
![Page 36: Making the most of DArT data for phylogenetic inference Barbara Holland & Michael Woodhams (Maths & Physics) Dorothy Steane (Plant Science) Vincent Moulton](https://reader035.vdocuments.us/reader035/viewer/2022062518/56649eec5503460f94bfdc39/html5/thumbnails/36.jpg)
http://www.maths.utas.edu.au/phylomania/phylomania2011.htm