comparing phylogenetic and statistical classification methods for dna barcoding frederic austerlitz,...

Comparing phylogenetic and statistical classification methods

for DNA barcoding

Frederic Austerlitz, Olivier David, Brigitte Schaeffer, Sisi Ye, Michel

Veuille & Catherine Laredo

Testing the assignation methods

• Individual to be tested:– X1 : ATATGTACCTAGTA

– X2 : TTATCTACCTAGAA

• Reference sample :– 1_1: ATATGTACGTAGTA

– 1_2: ATATCTACGAAGTA

– 1_3: ATATCTACTAAGTA

– 2_1: ATATGTACGTAGTT

– 2_2: ATATGTACGAAGTT

– 2_3: ATTTCTACTAAGTT

Species 1

Species 2

Phylogenetic methods

1_1 1_3 2_1 1_2 2_2 2_3X1

• Unanimity rule: X1 classified as ambiguous, X2 classified as species 1

• Majority rule: X1 and X2 classified as belonging to species 1

• Two methods tested: neighbor joining and maximum likelihood (PhyML, Guindon and Gascuel 2003)

X2

Node 0 Start with all the individualsto reach:

« individuals in each leave belong to the same species »

A: 5 B: 3 C: 8

CART (Classification And Regression Tree)

• Builds a classification tree from the reference sample

(Breiman et al., 1984, 1996)

Node 1

A: 5 B: 0 C: 4

Node 2

A: 0 B: 3 C: 4

Leave 1

A: 5 B: 0 C: 0

Leave 2

A: 0 B: 0 C: 4

Leave 3

A: 0 B: 0 C: 4

Leave 1

A: 0 B: 3 C: 0

node t = subset of individualsp (j | t) = relative proportion of individuals of class j in node t

JJ

1,...,

1 maximum

1,0,...,0,...,0,...,0,1 minimum for

Computing the impurity of the nodes

Impurity criterion at node t:

I(t) = - ∑j p(j|t) log p(j|t) entropy

I(t) = 1 - ∑j p(j|t)² Gini index

))(),...,(),...,1(()( tJptjptptI

PL + PR = 1

t

tLtR

s PRPL s SS: set of variables

t: set of individuals

Finding the best split

ΔI(s*,t) = max { ΔI(s,t), s S }

ΔI(s,t) = I(t) - pL I(tL) - pR I(tR)

Rule to select a splitting candidate:

Decrease in impurity:

s* selected as

Stop splitting rule: e.g. threshold β

max { ΔI(s,t), s S } < β

Example:

I(t) = I(node0) = - ∑j p(j|t) log p(j|t)At node 0:

I(node0) = - [3/10 × log(3/10) + 3/10 × log(3/10) + 4/10 × log(4/10)]

3 species, 10 individuals, 4 variables

I(node0) = 1.0889

Species x1 x2 x3 x4 A a g c g A a g c g A a a a g B t t a g B t t a g B t a c g C a c a g C g c a g C a c a g C a c c g

Node 0A: 3/10, B: 3/10, C: 4/10

I(t) = - ∑j p(j|t) log p(j|t)

Splitting according to x1


At node L: Ix1(nodeL) = - [0 + 3/3× log(3/3) + 0] = 0

Ix1(nodeR) = - [3/7× log(3/7) + 0 + 4/7× log(4/7)]At node R:

Ix1(nodeR) = 0.6829

A: 0, B: 3/3, C: 0 A: 3/7, B: 0, C: 4/7

Node 0A: 3/10, B: 3/10, C: 4/10

x1= t

Node L Node R

PR= 7/10PL= 3/10

ΔI(x1,t) = I(node0) - PL *Ix1(nodeL) - PR *Ix1(nodeR)

ΔI(x1,t) = 1.0889 - 0.3 × 0 - 0.7 × 0.6829 = 0.6109


At node L:

Ix2(nodeR) = - [0 + 0 + 4/4*log(4/4)] = 0


Ix2(nodeL) = - [3/6*log(3/6) + 3/6*log(3/6) +0]

At node R:

Node 0A: 3/10, B: 3/10, C: 4/10

x2 = a,g,t

Node L Node RA: 3/6, B: 3/6, C: 0 A: 0, B: 0, C: 4/4

PR= 4/10PL= 6/10


ΔI(x2,t) = 1.0889 - 0.6 *0.6931 - 0.4*0 = 0.6730

Ix2(nodeL) = 0.6931



Ix3(nodeR) = - [2/4*log(2/4) + 1/4*log(1/4) + 1/4*log(1/4)] = 1.040


At node L: Ix3(nodeL) = - [1/6*log(1/6) + 2/6*log(2/6) + 2/6*log(2/6)] = 1.031

At node R:

Node 0A: 3/10, B: 3/10, C: 4/10

x3 = a

Node L Node RA: 1/6, B: 2/6, C: 1/6 A: 2/4, B: 1/4, C: 1/4

PR= 4/10PL= 6/10


ΔI(x3,t) = 1.0889 - 0.6 *1.031 - 0.4*1.040 = 0.0662



ΔI(x2,t) = 0.6730

ΔI(s*,t) = max ΔI(s,t)s S

ΔI(x1,t) = 0.6109

ΔI(x3,t) = 0.0662

x4 no division

x2 is selected

Choosing the best split

Criterion: entropy

Software: RPackage: rpart

Implementation

Bagging or bootstrap aggregation

N bootstrap samples from original data

N classification trees

N assignements for a new individual

Majority rule → class of the new individual

Simulation method

• one haploid population that splits into two (or more), T generations in the past.

• The ancestral population and the two new populations of constant size N.

• Sequences with mutation rate parameter of interest = 2N• Simulations performed with simcoal 2.1.2 (Excoffier et al)

T

Evaluation of the different classification methods

• We simulate n +1 individuals in each species.

• n individuals are considered as the reference samples, and the last one as the individual to test.

• Using repeated simulations, we compute the proportion of cases in which each test individual is correctly assigned to its species.

T

Parameters assumed for the simulation study

• = 3 (e.g. Litoria) or 30 (e.g. Astraptes)

• Reference sample size n = 3, 5, 10, 25

• Effective population size: N = 1000

• Separation time T = N/10, N/2, N, 5N or 10N

• Number of newly founded populations: from 2 to 5 In all cases, all populations assumed to be founded

simultaneously.

Comparison between phylogenetic methods

0.0%

20.0%

40.0%

60.0%

80.0%

100.0%

0 10 20 30

reference sample size

go

od

ass

ign

men

t ra

te

neighbor joining

PhyML

• 2 populations • Separation time = 500, = 3.

Effect of the number of populations

• = 3

0.00%

20.00%

40.00%

60.00%

80.00%

100.00%

2 3 4 5

Number of populations

go

od

ass

igm

ent

rate

phylo

cart

bagging

• = 30

0.00%

20.00%

40.00%

60.00%

80.00%

100.00%

2 3 4 5

Number of populations

go

od

ass

igm

ent

rate

phylo

cart

bagging

(Separation time: 500 generations, Reference sample size = 10)

Effect of the separation time

• = 3

• = 30

(4 populations, Reference sample size = 10)

0.00%

20.00%

40.00%

60.00%

80.00%

100.00%

100 1000 10000

Separation time

go

od

ass

igm

ent

rate

phylo

CART

Bagging

0.00%

20.00%

40.00%

60.00%

80.00%

100.00%

100 1000 10000

Separation time

go

od

ass

igm

ent

rate

phylo

CART

Bagging

Effect of the size of the reference sample

• = 3

• = 30

(4 populations, Separation time = 500)

0.00%

20.00%

40.00%

60.00%

80.00%

100.00%

0 5 10 15 20 25

Reference sample size

go

od

ass

igm

ent

rate

phylo

CART

Bagging

0.00%

20.00%

40.00%

60.00%

80.00%

100.00%

0 5 10 15 20 25

Reference sample size

go

od

ass

igm

ent

rate

phylo

CART

Bagging

Adding nuclear genes

• We considered the case where polymorphism of nuclear genes are also available.

• We assumed that– these genes were independent.– they all have the same (= 4N) value, equal

to the value for the cytoplasmic genes.– They do not show intragenic recombination or

they show it at a rate equal to the mutation rate (c = , i.e. 4Nc = ).


• Phylogenetic method• 2 populations• = 3, separation time = 500, reference sample size = 10

0%

20%

40%

60%

80%

100%

0 1 2 3 4 5

number of nuclear loci

go

od

ass

ign

men

t ra

te

without recombination no cyto

with recombination no cyto

without recombination + cyto

with recombination + cyto


• Phylogenetic method• = 30, separation time = 500, reference sample size = 10

0%

20%

40%

60%

80%

100%

0 1 2 3 4 5

number of nuclear loci

go

od

ass

ign

men

t ra

te

without recombination no cyto

with recombination no cyto

without recombination + cyto

with recombination + cyto

Application to real data

• Litoria (Schneider et al, 1998, Mol. Ecol. 7, 487–498).

– 4 species– Average sample size: 43.75– average = 1.54

92%

93%

94%

95%

96%

97%

98%

99%

100%

0 2 4 6 8 10 12


go

od

ass

ign

men

t ra

te

phylo

CART

Application to real data

• Astraptes (Hebert et al 2004. Proc. Natl Acad. Sci. USA 101,14812–14817)

– 12 species– Average sample size: 38.8– average = 23.5

90%

91%

92%

93%

94%

95%

96%

97%

98%

99%

100%

3 4 5 6 7 8 9 10

Reference Sample size

Go

od

ass

ign

men

t ra

te

phylo

CART

Application to real data• Cowries (Meyer et al (2005) PLoS Biol 3: e422)

– 357 taxa (species/subspecies)– Average sample size: 5.7– average = 2.93

90%91%92%

93%94%95%96%97%98%

99%100%101%

0 5 10 15 20 25 30


go

od

ass

ign

men

t ra

te

phylo

CART

Conclusions• Regarding phylogenetic methods, the maximum

likelihood method performs better than the neighbor joining.

• CART performs better than phylogenetic methods for poorly informative data (low value)

• but not for more polymorphic data (high value)

• Adding nuclear loci can help, but at a quite high cost.

• Recombination improves the phylogenetic method for low values (Ongoing work for CART).

Perspectives

• Developing a statistical method to put a confidence level on a given assignation.

• Evaluating other classification methods (learning methods)

comparing phylogenetic and statistical classification methods for dna barcoding frederic austerlitz,...

Documents

t node

node r p r

x2 nodel p r

inode0 p

x1 nodel p r

node t maximum minimum

subset of individuals

variables t