optimizing pooling strategies for the massive next-generation sequencing of viral samples

Optimizing pooling strategies for the massive next-generation sequencing of viral

samples

Pavel Skums1

Joint work withOlga Glebova2, Alex Zelikovsky2, Ion Mandoiu3 and

Yury Khudyakov1

1Centers for Disease Control and Prevention, Atlanta, GA2Georgia State University, Atlanta, GA3University of Connecticut, Storrs, CT

1. Massive NGS of viral samples

2. Optimal pooling design problem

3. Algorithm and results

Outline

NGS in epidemiology

Molecular surveillance

Prediction of the epidemics progress

Transmission networks

Epidemiological parameters

Vaccination strategies

NGS in epidemiologyA large-scale molecular surveillance requires sequencing ofunprecedentedly large sets of viral samples.

NGS of tens of thousands of samples is highly cost- and labor-intensive.

Example: sequencing 100K samples using 454 senior system with 50 MIDs doing one sequencing run per day

Cost: 5000*(100 000)/50 = 10 000 000$Time: (100 000)/50 = 2000 days 5.5 years

Optimal Pooling Design ProblemGoal: a framework for identification of viral sequences from large number of samples using the smallest possible number of NGS runs.

Idea: for n samples generate m pools (i.e. mixtures of samples) with m << n in such a way that every sample is uniquely identified by the pools to which it belongs.

Optimal Pooling Design ProblemE

Example.

8 samples: 1,2,3,4,5,6,7,8Sequencing each sample separately: 8 runs

4 pools: M1 = {1,2,3,4} M2 = {5,6,7,8} M3 = {1,2,5,6} M4 = {1,3,5,7}

Sequencing pools: 4 runs

Optimal Pooling Design Problem4 pools: M1 = {1,2,3,4} M2 = {5,6,7,8} M3 = {1,2,5,6} M4 = {1,3,5,7}

{1} = M1M3M4

{2} = (M1M3) \ M4

……………………………………{8} = (M2 \ M3) \ M4

Optimal Pooling Design ProblemProblem 1 (Optimal Pooling Design Problem).

Given: a set of samples S = {S1,...,Sn}

Find: a set of pools P = {P1,…,Pm} , Pk S for k=1,…,m such that 1)P1…Pm = S 2) for every Si,SjS there exists Pk P such that |Pk{Si,Sj}| = 1 3) m is minimal

Theorem1. There exists a solution of Problem 1 with m = log(n) + 1

(Pk separates Si and Sj)

Optimal Pooling Design ProblemAdditional conditions for the problem

Condition Reasons

Each pool contains at most k samples

|Pj| ≤ k for j = 1,…,m

• number of reads which could be obtained by each NGS technology is bounded

• if large number of samples are mixed in one pool, some of them may be lost due to a PCR bias


Condition Reasons


|Pj| ≤ k for j = 1,…,m



Each sample belongs to at least l pools

|{j : Si Pj}| ≥ l for i = 1,…,n

• to ensure sufficient coverage for sequences of each sample


Condition Reasons


|Pj| ≤ k for j = 1,…,m



Each sample belongs to at least l pools

|{j : Si Pj}| ≥ l for i = 1,…,n

• to ensure sufficient coverage for sequences of each sample

Some pairs of samples should not be put into a same pool

• Some samples may intersect (if they belong to the same transmission cluster)

Optimal Pooling Design Problem

Condition ReasonsEach pool Pj should be a clique of the graph G(S)

• Some samples may intersect (if they belong to the same transmission cluster)

A graph G(S) = (V,E), whereV = SSiSjE if and only if there is a confidence that the samples Si and Sj do not intersect.

Optimal Pooling Design ProblemProblem 2 (Minimum Clique Test Set Problem).

Given: a graph G=G(S), natural numbers k,l

Find: a set of cliques P = {P1,…,Pm} of G such that 1) |Pi| ≤ k for every i=1,…,m 2) every vertex v V(G) belongs to at least l cliques from P 3) for every u,vV(G) there exists Pi P such that |Pi{u,v}| = 1 4) m is minimal

Optimal Pooling Design ProblemMinimum Test Set Problem (Garey, Johnson)

Given: collection Q={Q1,…,Qn} of subsets of a finite set S

Find: a subcollection P = {P1,…,Pm}Q such that 1) for every si,sjS there exists Pr P such that |Pr{si,sj}| = 1 2) m is minimal

Problem reformulationsMinimum Clique Test Set Problem

Given: a graph G=G(S), natural numbers k,l

Find: a set of cliques P = {P1,…,Pm} of G such that 1) |Pi| ≤ k for every i=1,…,m 2) every vertex v V(G) belongs to at least l cliques from P 3) for every u,vV(G) there exists Pi P such that |Pi{u,v}| = 1 4) m is minimal

Only some pairs of vertices should be separatedA graph H with V(H)=V(G), uvE(H) if and only if u and v should be separated


Given: a graphs G and H on the same set of vertices V, natural numbers k,l

Find: a set of cliques P = {P1,…,Pm} of G such that 1) |Pi| ≤ k for every i=1,…,m 2) every vertex v V(G) belongs to at least l cliques from P 3) for every uvE(H) there exists Pi P such that |Pi{u,v}| = 1 4) m is minimalOnly some pairs of vertices should be separatedA graph H with V(H)=V(G), uvE(H) if and only if u and v should be separated


Given: a graphs G and H on the same set of vertices V, natural numbers k,l

Find: a set of cliques P = {P1,…,Pm} of G such that 1) |Pi| ≤ k for every i=1,…,m 2) every vertex v V belongs to at least l cliques from P 3) for every uvE(H) there exists Pi P such that |Pi{u,v}| = 1 4) m is minimal

Replace each vertex vV(G) with l pairwise non-adjacent copies


Given: a graphs G and H on the same set of vertices V, natural number k

Find: a set of cliques P = {P1,…,Pm} of G such that 1) |Pi| ≤ k for every i=1,…,m 2) P1…Pm = V(G) 3) for every uvE(H) there exists Pi P such that |Pi{u,v}| = 1 4) m is minimal

Replace each vertex vV(G) with l pairwise non-adjacent copies



Find: a set of cliques P = {P1,…,Pm} of G such that 1) |Pi| ≤ k for every i=1,…,m 2) P1…Pm = V 3) for every uvE(H) there exists Pi P such that |Pi{u,v}| = 1 4) m is minimal

For every uV add a vertex xu

and an edge uxuE(H) 1 2

3



Find: a set of cliques P = {P1,…,Pm} of G such that 1) |Pi| ≤ k for every i=1,…,m 2) P1…Pm = V 3) for every uvE(H) there exists Pi P such that |Pi{u,v}| = 1 4) m is minimal



3

x1x2

x3


Given: a graphs G and H on the same set of vertices, natural number k

Find: a set of cliques P = {P1,…,Pm} of G such that 1) |Pi| ≤ k for every i=1,…,m 3) for every uvE(H) there exists Pi P such that |Pi{u,v}| = 1 4) m is minimal



3

x1x2

x3

Heuristic algorithmInput: a graphs G and H on the same set of vertices V, natural number k

P = WHILE CP C V OR E(H) find maximum cut (A,B) in H (using local search); for every aA put w(a) = # of neighbors of a from B in H; for every bB put w(b) = # of neighbors of b from A in H; find maximum clique C1 with |C1|≤k in a subgraph G[A] with weights w; find maximum clique C2 with |C2|≤k in a subgraph G[B] with weights w; C := argmax{w(C1),w(C2)} ; P:= P {C}; E(H) := E(H) \ {uv : uC, vV\C}ENDWHILE

Algorithm results1. All samples are unrelated (i.e. G is complete graph)

# samples # generated pools4 38 4

16 532 664 7

128 8256 9512 10

1024 112048 12

Algorithm results2. Some samples are related (G is a random graph with the edge probability p)

10 30 50 70 90110

130150

170190

210230

250270

290320

360400

440480

550650

750850

9500

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

p=0.5p=0.75p=1

# of samples

#poo

ls/#

sam

ples

Thank you!

optimizing pooling strategies for the massive next-generation sequencing of viral samples

Documents

large number of samples

n samples

mixtures of samples

thousands of samples

massive ngs of viral

pooling strategies

day cost

alex zelikovsky2