optimizing pooling strategies for the massive next-generation sequencing of viral samples

25
Optimizing pooling strategies for the massive next-generation sequencing of viral samples Pavel Skums 1 Joint work with Olga Glebova 2 , Alex Zelikovsky 2 , Ion Mandoiu 3 and Yury Khudyakov 1 1 Centers for Disease Control and Prevention, Atlanta, GA 2 Georgia State University, Atlanta, GA 3 University of Connecticut, Storrs, CT

Upload: erol

Post on 23-Feb-2016

33 views

Category:

Documents


0 download

DESCRIPTION

Optimizing pooling strategies for the massive next-generation sequencing of viral samples. Pavel Skums 1 Joint work with Olga Glebova 2 , Alex Zelikovsky 2 , Ion Mandoiu 3 and Yury Khudyakov 1. 1 Centers for Disease Control and Prevention, Atlanta, GA - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Optimizing pooling strategies for the  massive next-generation  sequencing of viral samples

Optimizing pooling strategies for the massive next-generation sequencing of viral

samples

Pavel Skums1

Joint work withOlga Glebova2, Alex Zelikovsky2, Ion Mandoiu3 and

Yury Khudyakov1

1Centers for Disease Control and Prevention, Atlanta, GA2Georgia State University, Atlanta, GA3University of Connecticut, Storrs, CT

Page 2: Optimizing pooling strategies for the  massive next-generation  sequencing of viral samples

1. Massive NGS of viral samples

2. Optimal pooling design problem

3. Algorithm and results

Outline

Page 3: Optimizing pooling strategies for the  massive next-generation  sequencing of viral samples

NGS in epidemiology

Molecular surveillance

Prediction of the epidemics progress

Transmission networks

Epidemiological parameters

Vaccination strategies

Page 4: Optimizing pooling strategies for the  massive next-generation  sequencing of viral samples

NGS in epidemiologyA large-scale molecular surveillance requires sequencing ofunprecedentedly large sets of viral samples.

NGS of tens of thousands of samples is highly cost- and labor-intensive.

Example: sequencing 100K samples using 454 senior system with 50 MIDs doing one sequencing run per day

Cost: 5000*(100 000)/50 = 10 000 000$Time: (100 000)/50 = 2000 days 5.5 years

Page 5: Optimizing pooling strategies for the  massive next-generation  sequencing of viral samples

Optimal Pooling Design ProblemGoal: a framework for identification of viral sequences from large number of samples using the smallest possible number of NGS runs.

Idea: for n samples generate m pools (i.e. mixtures of samples) with m << n in such a way that every sample is uniquely identified by the pools to which it belongs.

Page 6: Optimizing pooling strategies for the  massive next-generation  sequencing of viral samples

Optimal Pooling Design ProblemE

Example.

8 samples: 1,2,3,4,5,6,7,8Sequencing each sample separately: 8 runs

4 pools: M1 = {1,2,3,4} M2 = {5,6,7,8} M3 = {1,2,5,6} M4 = {1,3,5,7}

Sequencing pools: 4 runs

Page 7: Optimizing pooling strategies for the  massive next-generation  sequencing of viral samples

Optimal Pooling Design Problem4 pools: M1 = {1,2,3,4} M2 = {5,6,7,8} M3 = {1,2,5,6} M4 = {1,3,5,7}

{1} = M1M3M4

{2} = (M1M3) \ M4

……………………………………{8} = (M2 \ M3) \ M4

Page 8: Optimizing pooling strategies for the  massive next-generation  sequencing of viral samples

Optimal Pooling Design ProblemProblem 1 (Optimal Pooling Design Problem).

Given: a set of samples S = {S1,...,Sn}

Find: a set of pools P = {P1,…,Pm} , Pk S for k=1,…,m such that 1)P1…Pm = S 2) for every Si,SjS there exists Pk P such that |Pk{Si,Sj}| = 1 3) m is minimal

Theorem1. There exists a solution of Problem 1 with m = log(n) + 1

(Pk separates Si and Sj)

Page 9: Optimizing pooling strategies for the  massive next-generation  sequencing of viral samples

Optimal Pooling Design ProblemAdditional conditions for the problem

Condition Reasons

Each pool contains at most k samples

|Pj| ≤ k for j = 1,…,m

• number of reads which could be obtained by each NGS technology is bounded

• if large number of samples are mixed in one pool, some of them may be lost due to a PCR bias

Page 10: Optimizing pooling strategies for the  massive next-generation  sequencing of viral samples

Optimal Pooling Design ProblemAdditional conditions for the problem

Condition Reasons

Each pool contains at most k samples

|Pj| ≤ k for j = 1,…,m

• number of reads which could be obtained by each NGS technology is bounded

• if large number of samples are mixed in one pool, some of them may be lost due to a PCR bias

Each sample belongs to at least l pools

|{j : Si Pj}| ≥ l for i = 1,…,n

• to ensure sufficient coverage for sequences of each sample

Page 11: Optimizing pooling strategies for the  massive next-generation  sequencing of viral samples

Optimal Pooling Design ProblemAdditional conditions for the problem

Condition Reasons

Each pool contains at most k samples

|Pj| ≤ k for j = 1,…,m

• number of reads which could be obtained by each NGS technology is bounded

• if large number of samples are mixed in one pool, some of them may be lost due to a PCR bias

Each sample belongs to at least l pools

|{j : Si Pj}| ≥ l for i = 1,…,n

• to ensure sufficient coverage for sequences of each sample

Some pairs of samples should not be put into a same pool

• Some samples may intersect (if they belong to the same transmission cluster)

Page 12: Optimizing pooling strategies for the  massive next-generation  sequencing of viral samples

Optimal Pooling Design Problem

Condition ReasonsEach pool Pj should be a clique of the graph G(S)

• Some samples may intersect (if they belong to the same transmission cluster)

A graph G(S) = (V,E), whereV = SSiSjE if and only if there is a confidence that the samples Si and Sj do not intersect.

Page 13: Optimizing pooling strategies for the  massive next-generation  sequencing of viral samples

Optimal Pooling Design ProblemProblem 2 (Minimum Clique Test Set Problem).

Given: a graph G=G(S), natural numbers k,l

Find: a set of cliques P = {P1,…,Pm} of G such that 1) |Pi| ≤ k for every i=1,…,m 2) every vertex v V(G) belongs to at least l cliques from P 3) for every u,vV(G) there exists Pi P such that |Pi{u,v}| = 1 4) m is minimal

Page 14: Optimizing pooling strategies for the  massive next-generation  sequencing of viral samples

Optimal Pooling Design ProblemMinimum Test Set Problem (Garey, Johnson)

Given: collection Q={Q1,…,Qn} of subsets of a finite set S

Find: a subcollection P = {P1,…,Pm}Q such that 1) for every si,sjS there exists Pr P such that |Pr{si,sj}| = 1 2) m is minimal

Page 15: Optimizing pooling strategies for the  massive next-generation  sequencing of viral samples

Problem reformulationsMinimum Clique Test Set Problem

Given: a graph G=G(S), natural numbers k,l

Find: a set of cliques P = {P1,…,Pm} of G such that 1) |Pi| ≤ k for every i=1,…,m 2) every vertex v V(G) belongs to at least l cliques from P 3) for every u,vV(G) there exists Pi P such that |Pi{u,v}| = 1 4) m is minimal

Only some pairs of vertices should be separatedA graph H with V(H)=V(G), uvE(H) if and only if u and v should be separated

Page 16: Optimizing pooling strategies for the  massive next-generation  sequencing of viral samples

Problem reformulationsMinimum Clique Test Set Problem

Given: a graphs G and H on the same set of vertices V, natural numbers k,l

Find: a set of cliques P = {P1,…,Pm} of G such that 1) |Pi| ≤ k for every i=1,…,m 2) every vertex v V(G) belongs to at least l cliques from P 3) for every uvE(H) there exists Pi P such that |Pi{u,v}| = 1 4) m is minimalOnly some pairs of vertices should be separatedA graph H with V(H)=V(G), uvE(H) if and only if u and v should be separated

Page 17: Optimizing pooling strategies for the  massive next-generation  sequencing of viral samples

Problem reformulationsMinimum Clique Test Set Problem

Given: a graphs G and H on the same set of vertices V, natural numbers k,l

Find: a set of cliques P = {P1,…,Pm} of G such that 1) |Pi| ≤ k for every i=1,…,m 2) every vertex v V belongs to at least l cliques from P 3) for every uvE(H) there exists Pi P such that |Pi{u,v}| = 1 4) m is minimal

Replace each vertex vV(G) with l pairwise non-adjacent copies

Page 18: Optimizing pooling strategies for the  massive next-generation  sequencing of viral samples

Problem reformulationsMinimum Clique Test Set Problem

Given: a graphs G and H on the same set of vertices V, natural number k

Find: a set of cliques P = {P1,…,Pm} of G such that 1) |Pi| ≤ k for every i=1,…,m 2) P1…Pm = V(G) 3) for every uvE(H) there exists Pi P such that |Pi{u,v}| = 1 4) m is minimal

Replace each vertex vV(G) with l pairwise non-adjacent copies

Page 19: Optimizing pooling strategies for the  massive next-generation  sequencing of viral samples

Problem reformulationsMinimum Clique Test Set Problem

Given: a graphs G and H on the same set of vertices V, natural number k

Find: a set of cliques P = {P1,…,Pm} of G such that 1) |Pi| ≤ k for every i=1,…,m 2) P1…Pm = V 3) for every uvE(H) there exists Pi P such that |Pi{u,v}| = 1 4) m is minimal

For every uV add a vertex xu

and an edge uxuE(H) 1 2

3

Page 20: Optimizing pooling strategies for the  massive next-generation  sequencing of viral samples

Problem reformulationsMinimum Clique Test Set Problem

Given: a graphs G and H on the same set of vertices V, natural number k

Find: a set of cliques P = {P1,…,Pm} of G such that 1) |Pi| ≤ k for every i=1,…,m 2) P1…Pm = V 3) for every uvE(H) there exists Pi P such that |Pi{u,v}| = 1 4) m is minimal

For every uV add a vertex xu

and an edge uxuE(H) 1 2

3

x1x2

x3

Page 21: Optimizing pooling strategies for the  massive next-generation  sequencing of viral samples

Problem reformulationsMinimum Clique Test Set Problem

Given: a graphs G and H on the same set of vertices, natural number k

Find: a set of cliques P = {P1,…,Pm} of G such that 1) |Pi| ≤ k for every i=1,…,m 3) for every uvE(H) there exists Pi P such that |Pi{u,v}| = 1 4) m is minimal

For every uV add a vertex xu

and an edge uxuE(H) 1 2

3

x1x2

x3

Page 22: Optimizing pooling strategies for the  massive next-generation  sequencing of viral samples

Heuristic algorithmInput: a graphs G and H on the same set of vertices V, natural number k

P = WHILE CP C V OR E(H) find maximum cut (A,B) in H (using local search); for every aA put w(a) = # of neighbors of a from B in H; for every bB put w(b) = # of neighbors of b from A in H; find maximum clique C1 with |C1|≤k in a subgraph G[A] with weights w; find maximum clique C2 with |C2|≤k in a subgraph G[B] with weights w; C := argmax{w(C1),w(C2)} ; P:= P {C}; E(H) := E(H) \ {uv : uC, vV\C}ENDWHILE

Page 23: Optimizing pooling strategies for the  massive next-generation  sequencing of viral samples

Algorithm results1. All samples are unrelated (i.e. G is complete graph)

# samples # generated pools4 38 4

16 532 664 7

128 8256 9512 10

1024 112048 12

Page 24: Optimizing pooling strategies for the  massive next-generation  sequencing of viral samples

Algorithm results2. Some samples are related (G is a random graph with the edge probability p)

10 30 50 70 90110

130150

170190

210230

250270

290320

360400

440480

550650

750850

9500

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

p=0.5p=0.75p=1

# of samples

#poo

ls/#

sam

ples

Page 25: Optimizing pooling strategies for the  massive next-generation  sequencing of viral samples

Thank you!