optimizing pooling strategies for the massive next-generation sequencing of viral samples pavel...
TRANSCRIPT
![Page 1: Optimizing pooling strategies for the massive next-generation sequencing of viral samples Pavel Skums 1 Joint work with Olga Glebova 2, Alex Zelikovsky](https://reader035.vdocuments.us/reader035/viewer/2022062516/56649e055503460f94af1ef4/html5/thumbnails/1.jpg)
Optimizing pooling strategies for the massive next-generation sequencing of
viral samples
Pavel Skums1
Joint work withOlga Glebova2, Alex Zelikovsky2, Ion Mandoiu3 and
Yury Khudyakov1
1Centers for Disease Control and Prevention, Atlanta, GA2Georgia State University, Atlanta, GA3University of Connecticut, Storrs, CT
![Page 2: Optimizing pooling strategies for the massive next-generation sequencing of viral samples Pavel Skums 1 Joint work with Olga Glebova 2, Alex Zelikovsky](https://reader035.vdocuments.us/reader035/viewer/2022062516/56649e055503460f94af1ef4/html5/thumbnails/2.jpg)
1. Massive NGS of viral samples
2. Optimal pooling design problem
3. Algorithm and results
Outline
![Page 3: Optimizing pooling strategies for the massive next-generation sequencing of viral samples Pavel Skums 1 Joint work with Olga Glebova 2, Alex Zelikovsky](https://reader035.vdocuments.us/reader035/viewer/2022062516/56649e055503460f94af1ef4/html5/thumbnails/3.jpg)
NGS in epidemiology
Molecular surveillance
Prediction of the epidemics progress
Transmission networks
Epidemiological parameters
Vaccination strategies
![Page 4: Optimizing pooling strategies for the massive next-generation sequencing of viral samples Pavel Skums 1 Joint work with Olga Glebova 2, Alex Zelikovsky](https://reader035.vdocuments.us/reader035/viewer/2022062516/56649e055503460f94af1ef4/html5/thumbnails/4.jpg)
NGS in epidemiologyA large-scale molecular surveillance requires sequencing ofunprecedentedly large sets of viral samples.
NGS of tens of thousands of samples is highly cost- and labor-intensive.
Example: sequencing 100K samples using 454 senior system with 50 MIDs doing one sequencing run per day
Cost: 5000*(100 000)/50 = 10 000 000$Time: (100 000)/50 = 2000 days 5.5 years
![Page 5: Optimizing pooling strategies for the massive next-generation sequencing of viral samples Pavel Skums 1 Joint work with Olga Glebova 2, Alex Zelikovsky](https://reader035.vdocuments.us/reader035/viewer/2022062516/56649e055503460f94af1ef4/html5/thumbnails/5.jpg)
Optimal Pooling Design Problem
Goal: a framework for identification of viral sequences from large number of samples using the smallest possible number of NGS runs.
Idea: for n samples generate m pools (i.e. mixtures of samples) with m << n in such a way that every sample is uniquely identified by the pools to which it belongs.
![Page 6: Optimizing pooling strategies for the massive next-generation sequencing of viral samples Pavel Skums 1 Joint work with Olga Glebova 2, Alex Zelikovsky](https://reader035.vdocuments.us/reader035/viewer/2022062516/56649e055503460f94af1ef4/html5/thumbnails/6.jpg)
Optimal Pooling Design ProblemE
Example.
8 samples: 1,2,3,4,5,6,7,8Sequencing each sample separately: 8 runs
4 pools: M1 = {1,2,3,4} M2 = {5,6,7,8} M3 = {1,2,5,6} M4 = {1,3,5,7}
Sequencing pools: 4 runs
![Page 7: Optimizing pooling strategies for the massive next-generation sequencing of viral samples Pavel Skums 1 Joint work with Olga Glebova 2, Alex Zelikovsky](https://reader035.vdocuments.us/reader035/viewer/2022062516/56649e055503460f94af1ef4/html5/thumbnails/7.jpg)
Optimal Pooling Design Problem4 pools: M1 = {1,2,3,4} M2 = {5,6,7,8} M3 = {1,2,5,6} M4 = {1,3,5,7}
{1} = M1M3M4
{2} = (M1M3) \ M4
……………………………………{8} = (M2 \ M3) \ M4
![Page 8: Optimizing pooling strategies for the massive next-generation sequencing of viral samples Pavel Skums 1 Joint work with Olga Glebova 2, Alex Zelikovsky](https://reader035.vdocuments.us/reader035/viewer/2022062516/56649e055503460f94af1ef4/html5/thumbnails/8.jpg)
Optimal Pooling Design ProblemProblem 1 (Optimal Pooling Design Problem).
Given: a set of samples S = {S1,...,Sn}
Find: a set of pools P = {P1,…,Pm} , Pk S for k=1,…,m such that 1)P1…Pm = S 2) for every Si,SjS there exists Pk P such that |Pk{Si,Sj}| = 1 3) m is minimal
Theorem1. There exists a solution of Problem 1 with m = log(n) + 1
(Pk separates Si and Sj)
![Page 9: Optimizing pooling strategies for the massive next-generation sequencing of viral samples Pavel Skums 1 Joint work with Olga Glebova 2, Alex Zelikovsky](https://reader035.vdocuments.us/reader035/viewer/2022062516/56649e055503460f94af1ef4/html5/thumbnails/9.jpg)
Optimal Pooling Design ProblemAdditional conditions for the problem
Condition Reasons
Each pool contains at most k samples
|Pj| ≤ k for j = 1,…,m
• number of reads which could be obtained by each NGS technology is bounded
• if large number of samples are mixed in one pool, some of them may be lost due to a PCR bias
![Page 10: Optimizing pooling strategies for the massive next-generation sequencing of viral samples Pavel Skums 1 Joint work with Olga Glebova 2, Alex Zelikovsky](https://reader035.vdocuments.us/reader035/viewer/2022062516/56649e055503460f94af1ef4/html5/thumbnails/10.jpg)
Optimal Pooling Design ProblemAdditional conditions for the problem
Condition Reasons
Each pool contains at most k samples
|Pj| ≤ k for j = 1,…,m
• number of reads which could be obtained by each NGS technology is bounded
• if large number of samples are mixed in one pool, some of them may be lost due to a PCR bias
Each sample belongs to at least l pools
|{j : Si Pj}| ≥ l for i = 1,…,n
• to ensure sufficient coverage for sequences of each sample
![Page 11: Optimizing pooling strategies for the massive next-generation sequencing of viral samples Pavel Skums 1 Joint work with Olga Glebova 2, Alex Zelikovsky](https://reader035.vdocuments.us/reader035/viewer/2022062516/56649e055503460f94af1ef4/html5/thumbnails/11.jpg)
Optimal Pooling Design ProblemAdditional conditions for the problem
Condition Reasons
Each pool contains at most k samples
|Pj| ≤ k for j = 1,…,m
• number of reads which could be obtained by each NGS technology is bounded
• if large number of samples are mixed in one pool, some of them may be lost due to a PCR bias
Each sample belongs to at least l pools
|{j : Si Pj}| ≥ l for i = 1,…,n
• to ensure sufficient coverage for sequences of each sample
Some pairs of samples should not be put into a same pool
• Some samples may intersect (if they belong to the same transmission cluster)
![Page 12: Optimizing pooling strategies for the massive next-generation sequencing of viral samples Pavel Skums 1 Joint work with Olga Glebova 2, Alex Zelikovsky](https://reader035.vdocuments.us/reader035/viewer/2022062516/56649e055503460f94af1ef4/html5/thumbnails/12.jpg)
Optimal Pooling Design Problem
Condition ReasonsEach pool Pj should be a clique of the graph G(S)
• Some samples may intersect (if they belong to the same transmission cluster)
A graph G(S) = (V,E), whereV = SSiSjE if and only if there is a confidence that the samples Si and Sj do not intersect.
![Page 13: Optimizing pooling strategies for the massive next-generation sequencing of viral samples Pavel Skums 1 Joint work with Olga Glebova 2, Alex Zelikovsky](https://reader035.vdocuments.us/reader035/viewer/2022062516/56649e055503460f94af1ef4/html5/thumbnails/13.jpg)
Optimal Pooling Design ProblemProblem 2 (Minimum Clique Test Set Problem).
Given: a graph G=G(S), natural numbers k,l
Find: a set of cliques P = {P1,…,Pm} of G such that 1) |Pi| ≤ k for every i=1,…,m 2) every vertex v V(G) belongs to at least l cliques from P 3) for every u,vV(G) there exists Pi P such that |Pi{u,v}| = 1 4) m is minimal
![Page 14: Optimizing pooling strategies for the massive next-generation sequencing of viral samples Pavel Skums 1 Joint work with Olga Glebova 2, Alex Zelikovsky](https://reader035.vdocuments.us/reader035/viewer/2022062516/56649e055503460f94af1ef4/html5/thumbnails/14.jpg)
Optimal Pooling Design ProblemMinimum Test Set Problem (Garey, Johnson)
Given: collection Q={Q1,…,Qn} of subsets of a finite set S
Find: a subcollection P = {P1,…,Pm}Q such that 1) for every si,sjS there exists Pr P such that |Pr{si,sj}| = 1 2) m is minimal
![Page 15: Optimizing pooling strategies for the massive next-generation sequencing of viral samples Pavel Skums 1 Joint work with Olga Glebova 2, Alex Zelikovsky](https://reader035.vdocuments.us/reader035/viewer/2022062516/56649e055503460f94af1ef4/html5/thumbnails/15.jpg)
Problem reformulationsMinimum Clique Test Set Problem
Given: a graph G=G(S), natural numbers k,l
Find: a set of cliques P = {P1,…,Pm} of G such that 1) |Pi| ≤ k for every i=1,…,m 2) every vertex v V(G) belongs to at least l cliques from P 3) for every u,vV(G) there exists Pi P such that |Pi{u,v}| = 1 4) m is minimal
Only some pairs of vertices should be separatedA graph H with V(H)=V(G), uvE(H) if and only if u and v should be separated
![Page 16: Optimizing pooling strategies for the massive next-generation sequencing of viral samples Pavel Skums 1 Joint work with Olga Glebova 2, Alex Zelikovsky](https://reader035.vdocuments.us/reader035/viewer/2022062516/56649e055503460f94af1ef4/html5/thumbnails/16.jpg)
Problem reformulationsMinimum Clique Test Set Problem
Given: a graphs G and H on the same set of vertices V, natural numbers k,l
Find: a set of cliques P = {P1,…,Pm} of G such that 1) |Pi| ≤ k for every i=1,…,m 2) every vertex v V(G) belongs to at least l cliques from P 3) for every uvE(H) there exists Pi P such that |Pi{u,v}| = 1 4) m is minimalOnly some pairs of vertices should be separatedA graph H with V(H)=V(G), uvE(H) if and only if u and v should be separated
![Page 17: Optimizing pooling strategies for the massive next-generation sequencing of viral samples Pavel Skums 1 Joint work with Olga Glebova 2, Alex Zelikovsky](https://reader035.vdocuments.us/reader035/viewer/2022062516/56649e055503460f94af1ef4/html5/thumbnails/17.jpg)
Problem reformulationsMinimum Clique Test Set Problem
Given: a graphs G and H on the same set of vertices V, natural numbers k,l
Find: a set of cliques P = {P1,…,Pm} of G such that 1) |Pi| ≤ k for every i=1,…,m 2) every vertex v V belongs to at least l cliques from P 3) for every uvE(H) there exists Pi P such that |Pi{u,v}| = 1 4) m is minimal
Replace each vertex vV(G) with l pairwise non-adjacent copies
![Page 18: Optimizing pooling strategies for the massive next-generation sequencing of viral samples Pavel Skums 1 Joint work with Olga Glebova 2, Alex Zelikovsky](https://reader035.vdocuments.us/reader035/viewer/2022062516/56649e055503460f94af1ef4/html5/thumbnails/18.jpg)
Problem reformulationsMinimum Clique Test Set Problem
Given: a graphs G and H on the same set of vertices V, natural number k
Find: a set of cliques P = {P1,…,Pm} of G such that 1) |Pi| ≤ k for every i=1,…,m 2) P1…Pm = V(G) 3) for every uvE(H) there exists Pi P such that |Pi{u,v}| = 1 4) m is minimal
Replace each vertex vV(G) with l pairwise non-adjacent copies
![Page 19: Optimizing pooling strategies for the massive next-generation sequencing of viral samples Pavel Skums 1 Joint work with Olga Glebova 2, Alex Zelikovsky](https://reader035.vdocuments.us/reader035/viewer/2022062516/56649e055503460f94af1ef4/html5/thumbnails/19.jpg)
Problem reformulationsMinimum Clique Test Set Problem
Given: a graphs G and H on the same set of vertices V, natural number k
Find: a set of cliques P = {P1,…,Pm} of G such that 1) |Pi| ≤ k for every i=1,…,m 2) P1…Pm = V 3) for every uvE(H) there exists Pi P such that |Pi{u,v}| = 1 4) m is minimal
For every uV add a vertex xu
and an edge uxuE(H) 1 2
3
![Page 20: Optimizing pooling strategies for the massive next-generation sequencing of viral samples Pavel Skums 1 Joint work with Olga Glebova 2, Alex Zelikovsky](https://reader035.vdocuments.us/reader035/viewer/2022062516/56649e055503460f94af1ef4/html5/thumbnails/20.jpg)
Problem reformulationsMinimum Clique Test Set Problem
Given: a graphs G and H on the same set of vertices V, natural number k
Find: a set of cliques P = {P1,…,Pm} of G such that 1) |Pi| ≤ k for every i=1,…,m 2) P1…Pm = V 3) for every uvE(H) there exists Pi P such that |Pi{u,v}| = 1 4) m is minimal
For every uV add a vertex xu
and an edge uxuE(H) 1 2
3
x1x2
x3
![Page 21: Optimizing pooling strategies for the massive next-generation sequencing of viral samples Pavel Skums 1 Joint work with Olga Glebova 2, Alex Zelikovsky](https://reader035.vdocuments.us/reader035/viewer/2022062516/56649e055503460f94af1ef4/html5/thumbnails/21.jpg)
Problem reformulationsMinimum Clique Test Set Problem
Given: a graphs G and H on the same set of vertices, natural number k
Find: a set of cliques P = {P1,…,Pm} of G such that 1) |Pi| ≤ k for every i=1,…,m 3) for every uvE(H) there exists Pi P such that |Pi{u,v}| = 1 4) m is minimal
For every uV add a vertex xu
and an edge uxuE(H) 1 2
3
x1x2
x3
![Page 22: Optimizing pooling strategies for the massive next-generation sequencing of viral samples Pavel Skums 1 Joint work with Olga Glebova 2, Alex Zelikovsky](https://reader035.vdocuments.us/reader035/viewer/2022062516/56649e055503460f94af1ef4/html5/thumbnails/22.jpg)
Heuristic algorithmInput: a graphs G and H on the same set of vertices V, natural number k
P = WHILE CP C V OR E(H) find maximum cut (A,B) in H (using local search); for every aA put w(a) = # of neighbors of a from B in H; for every bB put w(b) = # of neighbors of b from A in H; find maximum clique C1 with |C1|≤k in a subgraph G[A] with weights w; find maximum clique C2 with |C2|≤k in a subgraph G[B] with weights w; C := argmax{w(C1),w(C2)} ; P:= P {C}; E(H) := E(H) \ {uv : uC, vV\C}ENDWHILE
![Page 23: Optimizing pooling strategies for the massive next-generation sequencing of viral samples Pavel Skums 1 Joint work with Olga Glebova 2, Alex Zelikovsky](https://reader035.vdocuments.us/reader035/viewer/2022062516/56649e055503460f94af1ef4/html5/thumbnails/23.jpg)
Algorithm results1. All samples are unrelated (i.e. G is complete graph)
# samples # generated pools4 38 4
16 532 664 7
128 8256 9512 10
1024 112048 12
![Page 24: Optimizing pooling strategies for the massive next-generation sequencing of viral samples Pavel Skums 1 Joint work with Olga Glebova 2, Alex Zelikovsky](https://reader035.vdocuments.us/reader035/viewer/2022062516/56649e055503460f94af1ef4/html5/thumbnails/24.jpg)
Algorithm results2. Some samples are related (G is a random graph with the edge probability p)
10 30 50 70 90110
130150
170190
210230
250270
290320
360400
440480
550650
750850
9500
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
p=0.5p=0.75p=1
# of samples
#poo
ls/#
sam
ples
![Page 25: Optimizing pooling strategies for the massive next-generation sequencing of viral samples Pavel Skums 1 Joint work with Olga Glebova 2, Alex Zelikovsky](https://reader035.vdocuments.us/reader035/viewer/2022062516/56649e055503460f94af1ef4/html5/thumbnails/25.jpg)
Thank you!