quasispecies assembly using network flows alex zelikovsky georgia state university joint work with...

30
Quasispecies Assembly Using Network Flows Alex Zelikovsky Georgia State University Joint work with Kelly Westbrooks Georgia State University Irina Astrovskaya Georgia State University David Campo Centers for Disease Control Yury Khudyakov Centers for Disease Control Piotr Berman Pennsylvania State University Ion Mandoiu University of Connecticut

Upload: hilda-powers

Post on 04-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Quasispecies Assembly Using Network Flows Alex Zelikovsky Georgia State University Joint work with Kelly Westbrooks Georgia State University Irina Astrovskaya

Quasispecies Assembly Using Network Flows

Alex Zelikovsky

Georgia State University

Joint work with

Kelly Westbrooks Georgia State UniversityIrina Astrovskaya Georgia State UniversityDavid Campo Centers for Disease ControlYury Khudyakov Centers for Disease ControlPiotr Berman Pennsylvania State UniversityIon Mandoiu University of Connecticut

Page 2: Quasispecies Assembly Using Network Flows Alex Zelikovsky Georgia State University Joint work with Kelly Westbrooks Georgia State University Irina Astrovskaya

Outline

454 Sequencing of Virus Genome

Quasispecies Assembly

The Read Graph

Network Flow Formulations

Phasing Flow Problem

Maximum Likelihood

Simulation Results

Page 3: Quasispecies Assembly Using Network Flows Alex Zelikovsky Georgia State University Joint work with Kelly Westbrooks Georgia State University Irina Astrovskaya

HCV Quasispecies

HCV is a small, enveloped, positive sense single strand RNA virus that is responsible for Hepatitis C infection.

Over the course of infection, the mutations made in replication are passed down to descendants, eventually producing a family of related variants of the ancestral genome referred as quasispecies.

Due to HCV's high mutation rate, in time the quasispecies in an infected person can become very diverse.

A better understanding of HCV quasispecies diversity could potentially lead to new treatments.

The ultimate objective of our work is to develop a method of computationally inferring the quasispecies sequences in a HCV-infected individual.

Page 4: Quasispecies Assembly Using Network Flows Alex Zelikovsky Georgia State University Joint work with Kelly Westbrooks Georgia State University Irina Astrovskaya

454 Sequencing

454 Sequencing is one promising technology that may prove useful for quasispecies sequencing. It is a massively-parallel pyrosequencing system developed the by biotechnology firm 454 Life Sciences for genome sequencing.

The system fragments the source genetic material to be sequenced into pieces called reads. Then, each read is sequenced and the original genome is reconstructed via software.

Since this system was originally designed to sequence genetic material from a single organism, the software assembles all of the reads to a single genome.

In order to use 454 for quasispecies sequencing, new methods and software are needed to correctly infer the sequences of the quasispecies present in an infected person and their population frequencies directly from 454 read data.

Page 5: Quasispecies Assembly Using Network Flows Alex Zelikovsky Georgia State University Joint work with Kelly Westbrooks Georgia State University Irina Astrovskaya

Problem Formulations

Given a collection of 454 reads taken from a quasispecies population of unknown size and distribution, reconstruct the quasispecies population, i.e. the sequences and tehir frequencies.

OriginalQuasispecies

454 Reads ReconstructedQuasispecies

Page 6: Quasispecies Assembly Using Network Flows Alex Zelikovsky Georgia State University Joint work with Kelly Westbrooks Georgia State University Irina Astrovskaya

Parsimonious/Min Cost Quasispecies Assembly

Given a set of reads, Find the minimum number set of quasispecies covering all reads

Given a set of reads with costs on read overlaps, Find the minimum number set of quasispecies covering all reads

The cost of the assembly should be inversely connected to the likelihood that the assembly is the correct one.

OriginalQuasispecies

454 Reads

Reconstruction 2Cost: C

2

Reconstruction 1Cost: C

1

If C1 < C

2, then favor Reconstruction 1 over Reconstruction 2

Page 7: Quasispecies Assembly Using Network Flows Alex Zelikovsky Georgia State University Joint work with Kelly Westbrooks Georgia State University Irina Astrovskaya

Read Alignment

Before beginning assembly, first find the genome offset of the read. We assume that the consensus sequence for the particular strain of HCV that the quasispecies came from is available to us.

We simply align each read to the consensus sequence, choosing the offset that yields the best alignment (i.e. lowest Hamming distance).

Because HCV quasispecies don't contain repeats as long as a 454 read, the alignment is both fast and extremely accurate.

GUCUCAUCGGAACAGCAAAACACUUGCCCCGAACGCUAGCGGUUGGGGUACUAUUCAAUGGCUGUAG AACAGCAAAACACUUGCUCCGAACGCUAGCGGUUGGGGAACUAU

Page 8: Quasispecies Assembly Using Network Flows Alex Zelikovsky Georgia State University Joint work with Kelly Westbrooks Georgia State University Irina Astrovskaya

The Read Graph

The data structure: a directed acyclic graph that contains every possible quasispecies reconstruction.

An aligned read can be contained within another aligned read.

Find the subset of reads that are not contained within any other read

We call these reads “superreads” “subreads” = everything else

The superreads are vertices in the read graph.

Page 9: Quasispecies Assembly Using Network Flows Alex Zelikovsky Georgia State University Joint work with Kelly Westbrooks Georgia State University Irina Astrovskaya

Edges of the Read Graph

Put an edge between read X and read Y if X overlaps Y in the alignment some suffix of X = some prefix of Y

UGGACUAGAUGUGGUGGGUGCUCUCCGGAAUACCUUGGUGGCGGGU GAUGUGGUGGGUGCUCUCCGGAAUACCUUGGUGGCGGGUUAGAGA GGGUGCUCUCCGGAAUACCUUGGUGGCGGGUUAGAGAGAAGAGAGCA CUCCGGAAUACCUUGGUGGCGGGUUAGAGAGAAGAGAGCAAGUGUCA AUACCUUGGUGGCGGGUUAGAGAGAAGAGAGCAAGUGUCAACGCCUA

Page 10: Quasispecies Assembly Using Network Flows Alex Zelikovsky Georgia State University Joint work with Kelly Westbrooks Georgia State University Irina Astrovskaya

Quasispecies in the Read Graph

Then, we add two extra vertices: a new vertex with outgoing edges to all vertices with indegree 0 (the “source”) and a new vertex with incoming edges from all vertices with outdegree 0 (the “sink”)

The ReadGraph

Source Sink

Any path from the source to the sink represents a potential sequencein the quasispecies population!

Page 11: Quasispecies Assembly Using Network Flows Alex Zelikovsky Georgia State University Joint work with Kelly Westbrooks Georgia State University Irina Astrovskaya

Transitive Reduction

Edge u w logically follows from edges u v and v w Drop edge u w from consideration – no information, any quasispecies

sequence containing u and w will also have v

The transitive reduction of a graph = smallest subgraph that maintains all reachability relationships

The graph is partially closed – the transitive reduction found in O(δ|E|), where δ is the read degree

Page 12: Quasispecies Assembly Using Network Flows Alex Zelikovsky Georgia State University Joint work with Kelly Westbrooks Georgia State University Irina Astrovskaya

Estimating Read Frequencies

In general, superreads may be contained in several quasispecies sequences.

Thus, each superread has associated with its frequency = the sum over the quasispecies of the population frequencies of quasispecies that contains the superread.

Although the true read frequencies are unknown to us, we may estimate them by counting the number of subreads contained within each superread.

By definition, the read frequency of the source and sink vertices are 0.

Page 13: Quasispecies Assembly Using Network Flows Alex Zelikovsky Georgia State University Joint work with Kelly Westbrooks Georgia State University Irina Astrovskaya

Probability of a True Overlap

Given N reads over Q sequences, each read with L possible starting positions, the probability that a position is b

u for some

read u is N/(LQ).

Let (u, v) be an edge in transitive reduction. The probability of b

v-b

u > Δ is proportional to exp(Δ N/(LQ)).

Probability of an edge from the source or to sink is 0.

GUGGGGGCAGCGGACGUAUGC GACGUAUGCAGAACUCUAGGCA

bu

bv

Δ

Page 14: Quasispecies Assembly Using Network Flows Alex Zelikovsky Georgia State University Joint work with Kelly Westbrooks Georgia State University Irina Astrovskaya

Network Flow Through Vertices

Replacing the vertex for read r with two vertices r_b and r_e and the edge (r_b, r_e)

Page 15: Quasispecies Assembly Using Network Flows Alex Zelikovsky Georgia State University Joint work with Kelly Westbrooks Georgia State University Irina Astrovskaya

Networks Flows

Observe that the true quasispecies sequences in the read graph can be represented as a flow:

2

2

1

1 1 11

2

1

11 2

1 1

2

1

23

Each vertex has a frequency proportional to the number and frequencies of thequasispecies that contains it's associated superread. When we solve the flow, wedemand that each vertex has a inflow passing into it >= its frequency.

Page 16: Quasispecies Assembly Using Network Flows Alex Zelikovsky Georgia State University Joint work with Kelly Westbrooks Georgia State University Irina Astrovskaya

Min Cost Flow

We define the cost of a flow in the following manner:

The flow cost of an edge is that edge's cost multiplied by the amount of flow that traverses the edge

The cost of a flow through the graph is the sum of all of the edge flow costs.

Out of all possible flows that go through all of the vertices in the graph, we seek to find the flow with the minimum cost.

After solving min cost flow for the graph, all of the edges that have flow > 0 are assumed to participate in true quasispecies. The remaining edges can be dropped from the graph.

Page 17: Quasispecies Assembly Using Network Flows Alex Zelikovsky Georgia State University Joint work with Kelly Westbrooks Georgia State University Irina Astrovskaya

LP for Min Cost Flow

Although there are fast combinatorial algorithms for solving min cost flow, we opted to solve the flow using a linear program.

For each edge e, create a real-valued, nonnegative variable fe to represent the flow across that edge.

The ReadGraph SinkSource

Page 18: Quasispecies Assembly Using Network Flows Alex Zelikovsky Georgia State University Joint work with Kelly Westbrooks Georgia State University Irina Astrovskaya

Linear Program for Min Cost Quasispecies Assembly

Objective:Minimize the sum of cost(e) * fe over all edges e in the read graph.

Subject to: For all vertices v:

The sum of fe over incoming edges to v equals the sum of fe over outgoing edges from v.

The sum of fe over incoming edges to v is greater or equal to the frequency of v.

Page 19: Quasispecies Assembly Using Network Flows Alex Zelikovsky Georgia State University Joint work with Kelly Westbrooks Georgia State University Irina Astrovskaya

Splitting Flow in Quasispecies

Five quasispecies share a common long segment [a,b] and differ on the left and the right in value of a SNP. The resulted graph with network flow have multiple feasible solutions.

b-a > the read length

l a b r

Multi set L Multi set RA

C

AACCT

TTACT

T

T

A

C

f=2

f=2f=3

f=1

f=1

f=1

Page 20: Quasispecies Assembly Using Network Flows Alex Zelikovsky Georgia State University Joint work with Kelly Westbrooks Georgia State University Irina Astrovskaya

Quasispecies Matching Problem

Given two multisets of haplotypes

L on the segment [l, b] and R on the segment [a, r]

such that all haplotypes are indistinguishable on a common segment [a, b] (l < a < b < r), |L| = |R|,

Find the matching between multisets L and R such that concatenation of the matched haplotypes correspond to the original quasispecies.

Page 21: Quasispecies Assembly Using Network Flows Alex Zelikovsky Georgia State University Joint work with Kelly Westbrooks Georgia State University Irina Astrovskaya

Decomposing the flow into paths

General strategy:➡Find a source-to-sink path with positive flow f ➡Subtract f from all of the edge flows in the path

How to find paths?➡Pick the shortest path “most likely quasispecies”⇒➡Pick the maximum bandwidth path “most frequent ⇒

quasispecies”

1→3→5→6 is the shortest path1→2→4→6 is the maximum bandwidth path

Page 22: Quasispecies Assembly Using Network Flows Alex Zelikovsky Georgia State University Joint work with Kelly Westbrooks Georgia State University Irina Astrovskaya

Finding Max Bandwidth Paths

A single iteration of the Bellman-Ford algorithm gives an efficient method for finding the maximum bandwidth path from the source to the sink:

Initialize: For each vertex I

W(i) ← 0 p(i) ← 0

For the source s W(s) ← +infinity

Relax: For each edge (i,j) in order      *** (i,j)< (k,l) if i<k or i=k & j<l ***

if W(j) < min { W(i), cap(i,j) }       *** cap(i,j) capacity of (i,j) W(j) ← min { W(i), cap(i,j) } p(j) ← I

Return path p(sink), p(p(sink)),..., source=0

Finding minimum cost paths is simple: just grow a shortest-path tree from the source using costs for weights.

Page 23: Quasispecies Assembly Using Network Flows Alex Zelikovsky Georgia State University Joint work with Kelly Westbrooks Georgia State University Irina Astrovskaya

Maximum Likelihood Choice

• After path decomposition, we have a set of candidate quasispecies sequences, but we don’t know what their frequencies are.

• Given a set of candidate quasispecies and observed reads

• Expectation Maximization: alternates between 2 steps until convergence: Expectation (E) and Maximization (M)

E Step: Calculates the expected likelihood by including the current estimate of the latent variables

M step: Computes maximum likelihood estimates of parameters by maximizing the expected likelihood found in the previous E step

Page 24: Quasispecies Assembly Using Network Flows Alex Zelikovsky Georgia State University Joint work with Kelly Westbrooks Georgia State University Irina Astrovskaya

EM Implementation

Create a bipartite graph:Left side: quasispeciesRight side: superreadsPut an edge if quasispecies contains read

Keep 3 sets of numbers:For each qsp q, keep its estimated frequency fq

For each superread r, keep its frequency nr

For each (q, r) edge, keep pqr

E step: Compute pqr = nr · fq / Σ fq for every edgeM step: Compute fq = Σ pqr / Σ nr for every qsp

Page 25: Quasispecies Assembly Using Network Flows Alex Zelikovsky Georgia State University Joint work with Kelly Westbrooks Georgia State University Irina Astrovskaya

Validation

Real quasispecies population, simulated readsReal data: 44 sequences from E1E2 region of

HCV3 populations consisting of 10 sequences each:

Uniformly distributed frequencies (the “uniform” population)

Geometrically distributed frequencies (the “geometric” population)

Highly skewed distribution (the “skewed” population)

Page 26: Quasispecies Assembly Using Network Flows Alex Zelikovsky Georgia State University Joint work with Kelly Westbrooks Georgia State University Irina Astrovskaya

Instance Generation

Inputs: A quasispecies population Q, n = number of reads desired n, read length mean μ and variance σ2

S ←∅While |S| < number of reads desired

Randomly select a quasispecies q of length lq according to the population frequency distribution

Generate a read length lr using normal distribution (μ, σ2)Generate an offset o using uniform distribution on [0, lq - lr]Extract a substring of length lr starting at position o from

quasispecies q and add it to SReturn: S

Page 27: Quasispecies Assembly Using Network Flows Alex Zelikovsky Georgia State University Joint work with Kelly Westbrooks Georgia State University Irina Astrovskaya

Quality MeasuresPercentage of correctly predicted sequences:

Takes into account frequenciesSymmetric difference between multisets

“Switching error”Generalization of “switching error” from the haplotype

phasing communityAverage number of times each path corresponding to a

predicted quasispecies switches between paths in the read graph corresponding to real quasispecies.

Page 28: Quasispecies Assembly Using Network Flows Alex Zelikovsky Georgia State University Joint work with Kelly Westbrooks Georgia State University Irina Astrovskaya

Initial ResultsRead Length

Number of Reads

% Correctly Predicted

Uniform

400 40000 8%

400 50000 4%

500 40000 29%

500 50000 38%

Geometric

400 40000 21%

400 50000 11%

500 40000 50%

500 50000 66%

Skewed

400 40000 46%

400 50000 10%

500 40000 82%

500 50000 84%

Page 29: Quasispecies Assembly Using Network Flows Alex Zelikovsky Georgia State University Joint work with Kelly Westbrooks Georgia State University Irina Astrovskaya

Results – Geometric Instances

8000

1200

0

1600

0

2000

0

2400

0

2800

0

3200

0

3600

0

4000

0

4400

0

4800

0

5200

0

5600

0

6000

0

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

Average Error

350450550650750

Number of Reads

Ave

rag

e E

rro

r

8000

1200

0

1600

0

2000

0

2400

0

2800

0

3200

0

3600

0

4000

0

4400

0

4800

0

5200

0

5600

0

6000

0

0

0.2

0.4

0.6

0.8

1

1.2

% Correctly Predicted

350450550650750

Number of Reads

% C

orr

ect

ly P

red

icte

d

Page 30: Quasispecies Assembly Using Network Flows Alex Zelikovsky Georgia State University Joint work with Kelly Westbrooks Georgia State University Irina Astrovskaya

Results – Uniform Instances

8000

1200

0

1600

0

2000

0

2400

0

2800

0

3200

0

3600

0

4000

0

4400

0

4800

0

5200

0

5600

0

6000

0

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Average Error

350450550650750

Number of Reads

Ave

rag

e E

rro

r

8000

1200

0

1600

0

2000

0

2400

0

2800

0

3200

0

3600

0

4000

0

4400

0

4800

0

5200

0

5600

0

6000

0

0

0.2

0.4

0.6

0.8

1

1.2

% Correctly Predicted

350450550650750

Number of Reads

% C

orr

ect

ly P

red

icte

d