dynamic programming: one algorithmic key to many biological locks

60
Dynamic programming: one algorithmic key to many biological locks Mikhail Gelfand RTCB, IITP, RA S and FBB, MSU 2010-2011

Upload: milt

Post on 21-Mar-2016

40 views

Category:

Documents


0 download

DESCRIPTION

Dynamic programming: one algorithmic key to many biological locks. Mikhail Gelfand RTCB, IITP, RA S and FBB, MSU 2010-2011. BIOINFORMATICS FOR BIOLOGISTS Pavel Pevzner and Ron Shamir, eds. (Cambridge University Press, 2011) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Dynamic programming: one algorithmic key to many biological locks

Dynamic programming:

one algorithmic key to many

biological locksMikhail GelfandRTCB, IITP, RA S and FBB, MSU

2010-2011

Page 2: Dynamic programming: one algorithmic key to many biological locks

BIOINFORMATICS FOR BIOLOGISTSPavel Pevzner and Ron Shamir, eds.(Cambridge University Press, 2011)

Ch. 4. DYNAMIC PROGRAMMING: ONE ALGORITHMIC KEY FOR MANY BIOLOGICAL LOCKS

Mikhail GelfandResearch and Training Center “Bioinformatics” of the

Institute for Information Transmission Problems, RASand Faculty of Bioengineering and Bioinformatics,

M.V.Lomonosov Moscow State University

Page 3: Dynamic programming: one algorithmic key to many biological locks

Alignment

Three (of many) alignments of two sequences. Plus denotes a match; dot, a mismatch, minus, a gap. (a) Two matches, five mismatches, (b) three matches, one mismatch, two gaps of size three (six indels, that is one-nucleotide insertions/deletions), (c) four matches, two gaps of size three (six indels).

Page 4: Dynamic programming: one algorithmic key to many biological locks

The number of alignments is large

# of alignments of two sequences of length N~ (1+√2)2N+1√N

at N = 1000 # ≈ 10767

# of elementary particles in the Universe ≈ 1080 at N = 100 # ≈ 1076

assume 1 operation per alignment, 1012 operations per second

=> need 1057 years

=> we cannot consider them one by one

Page 5: Dynamic programming: one algorithmic key to many biological locks

Gene recognitionSegmentation of a genomic fragment

into protein-coding and non-coding regionsbased on differences in statistical

properties of these regionsdifficult in eukaryotes due to the

existence of introns, non-coding regions within genes

Page 6: Dynamic programming: one algorithmic key to many biological locks

Toy exampleHow many operations are needed to

calculate∑i=1…m, j=1…n xi∙yj =

= x1∙y1 + x1∙y2 + … + x1∙yn +

+ x2∙y1 + x2∙y2 + … + x2∙yn +

+ … + + xm∙y1 + xm∙y2 + … + xm∙yn

Naïve answer: mn multiplications and mn–1 additions

Page 7: Dynamic programming: one algorithmic key to many biological locks

but rewrite as…

(x1 + x2 + … + xm) ∙ (y1 + y2 + … + yn) =

= ∑i=1…m xi ∙ ∑j=1…n yj

and it becomes m+n–2 additions and just 1 multiplication

Page 8: Dynamic programming: one algorithmic key to many biological locks

QuizHow many multiplications do we need to

calculate x1

y1 ∙ x1y2 ∙ … ∙ x1

yn ∙ x2y1 ∙ x2

y2 ∙ … ∙ x2yn ∙ … ∙

∙ xmy1 ∙ xm

y2 ∙ … ∙ xmyn = ∏ i=1…m, j=1…n xi

yj

if we are • naïve? (b) sophisticated? (c) What if in addition to multiplication, we

have an operation “taking to the power”? (d) if we may perform not only multiplication,

but also addition?

Page 9: Dynamic programming: one algorithmic key to many biological locks

Lesson

Restructuring the order of calculations using properties of the data may sharply decrease the number of operations

Page 10: Dynamic programming: one algorithmic key to many biological locks

GraphsVerticesArcs – directed pairs of vertices

contains cyclesmultiple sources and sinks

Page 11: Dynamic programming: one algorithmic key to many biological locks

“bad” graphs and not graphs

multiple arcs

loop multiple compo-nents

not a graph (hanging arc)

undirec-ted graph

Page 12: Dynamic programming: one algorithmic key to many biological locks

Sources, sinks, paths, cyclesSource is a vertex that is not an end vertex for any arcSink is a vertex that is not a start vertex for any arc.Walk p of length N is an ordered set of N arcs

p = (a1, …, aN) such that the end vertex of arc an = (bn, en) coincides with the start vertex of arc an+1, en=bn+1, for all n = 1, …, N–1. In a graph without loops and multiple arcs, each walk may also be defined as an ordered set of vertices p = (v1, …, vN+1) such that for each pair of adjacent vertices vn, vn+1 there is an arc an = (vn, vn+1), n = 1, …, N.

Path is a walk in which no arc is passed twice.Cycle is a path in which the end vertex of the last arc aN

coincides with the start vertex of the first arc a1, eN=b1. Acyclic graph contains no cycles.

Page 13: Dynamic programming: one algorithmic key to many biological locks

Quiz(a) Draw all acyclic connected oriented

graphs with three vertices (up to vertex labels).

(b) How many oriented graphs will there be if we label vertices with symbols A, B and C?

(c) Prove that in an acyclic graph there is at least one source and at least one sink.

(d) Draw sinks and sources in the graphs of (a).

Page 14: Dynamic programming: one algorithmic key to many biological locks

ProblemConsider an acyclic graph with one

source and one sink. Assign each arc with a number called a weight. For a given path, its path score is defined as the sum of the weights of its arcs.

Given a weighted acyclic graph, find the highest scoring path from the sink to the source.

Page 15: Dynamic programming: one algorithmic key to many biological locks

ObservationIf two subpaths P and Q end at the same

vertex v, and the score of P is larger than the score of Q, then for all pairs of paths P* and Q* that start with P and Q, respectively, and coincide after v, the score of P* is higher than the score of Q*.

Hence, we do not need to consider all paths, as it is sufficient to construct the highest scoring subpath from the source to each vertex, finishing at the sink.

Page 16: Dynamic programming: one algorithmic key to many biological locks

Let’s do it for this graph

2

41

23 4

1

1

6 5

25

86 5 2

23

3 1

Page 17: Dynamic programming: one algorithmic key to many biological locks

24

12

3 4

1

1

6 525

86 5 2

23

3 1

13

22

41

23 4

1

1

6 525

86 5 2

23

3 1

45

2

Step 1 Step 2

3

6

Page 18: Dynamic programming: one algorithmic key to many biological locks

24

12

3 4

1

1

6 525

86 5 2

23

3 1

45

2

Step 3

3

6

24

12

3 4

1

1

6 525

86 5 2

23

3 1

105

2

Step 4

3

7

1110

Page 19: Dynamic programming: one algorithmic key to many biological locks

24

12

3 4

1

1

6 525

86 5 2

23

3 1

105

2

Step 5

3

7

1112

24

12

3 4

1

1

6 525

86 5 2

23

3 1

105

2

Step 6

3

7

1118

16

Page 20: Dynamic programming: one algorithmic key to many biological locks

24

12

3 4

1

1

6 525

86 5 2

23

3 1

105

2

Step 7

3

7

1118

16

24

12

3 4

1

1

6 525

86 5 2

23

3 1

105

2

Step 8

3

7

1119

16

19

Page 21: Dynamic programming: one algorithmic key to many biological locks

24

12

3 4

1

1

6 525

86 5 2

23

3 1

105

2

Step 9

3

7

1119

16

20

24

12

3 4

1

1

6 525

86 5 2

23

3 1

105

2

Backtracing

3

7

1119

16

20

Page 22: Dynamic programming: one algorithmic key to many biological locks

Quiz

At what steps did we have more than one vertex with all incoming arcs processed?

Page 23: Dynamic programming: one algorithmic key to many biological locks

Algorithm

Data types and definitions: vertices: v, u, Source, Sink; arcs: (v,u), a; start vertex of arc a: B(a); weight of arc (v,u): W(v,u); path: BestPath; // defined as a set of arcs the highest score of subpath ending at v: S(v); the highest score of subpath ending at u and coming through (v,u): T(v,u); the last arc of the highest scoring subpath ending at u: L(u); 

Page 24: Dynamic programming: one algorithmic key to many biological locks

Initialize: for each vertex v: S(v) := minus_infinity.Forward process: while There are unprocessed vertices: v := arbitrary unprocessed vertex with all incoming arcs processed; for each arc (v,u): // consider all arcs starting at v T(v,u) := S(v)+W(v,u); if T(v,u)>S(u) // subpath coming through v is better than the current best subpath ending at u then: // update the data for u S(u) := T(v,u); L(u) := (v,u); endif; (v,u) := processed_arc; endfor; v := processed_vertex;endwhile.Backtracing: BestPath = empty_set; // initialize v := Sink; // go from the sink backwards by marked arcs until v=Source Add L(v) to BestPath; // add the last arc of the best path ending at the current vertex v := B(L(v)); // go to the start vertex of this arc enduntil.Output BestPath.

Page 25: Dynamic programming: one algorithmic key to many biological locks

The number of operations

The limiting procedure is processing vertices and adding arcs to paths, and we consider each arc only once

Hence the number of operations is linear in the number of arcs A: the run time of the algorithm is O(A)

Page 26: Dynamic programming: one algorithmic key to many biological locks

Greedy algorithm

Start at the source and select the highest-weighted arc at each step.

13 < 20

It does not work.2

41

23 4

1

1

6 5

25

86 5 2

23

3 1

Page 27: Dynamic programming: one algorithmic key to many biological locks

Quiz(a)Construct the simplest possible graph in which

the greedy algorithm yields the highest scoring path.

(b) Construct a graph with three vertices in which the greedy algorithm does not yield the highest scoring path.

(c) Construct a graph with three vertices in which the greedy algorithm does yield the highest scoring path.

(d) Assign new weights to the arcs of the above graph so that the greedy algorithm will yield the highest scoring path.

Page 28: Dynamic programming: one algorithmic key to many biological locks

Quiz cont’d(e) Write an algorithm for construction of the path

with the maximum number of arcs and apply it to the above graph.

Hint: do not change the algorithm, set proper arc weights.

(f) Modify the maximum score algorithm so as to construct the path with the minimal score and find this path for the above graph.

(g) Provide a greedy algorithm for finding the path of minimal score in a graph, and apply it to the above graph.

(h) For the above graph, find the path with the minimal number of arcs.

Page 29: Dynamic programming: one algorithmic key to many biological locks

Lesson

The generic dynamic programming algorithm may be applied to different problems. The common feature of these problems is that each one can be decomposed into an ordered set of smaller subproblems, and to solve a more complex subproblem one needs to know only the solutions of the simpler ones, but not the entire set of possibilities.

Page 30: Dynamic programming: one algorithmic key to many biological locks

NoteThere exist path optimisation problems that cannot be

solved by the dynamic programming. Traveling salesman problem. Given a non-oriented graph

with weighted arcs, we need to construct the lowest scoring path passing through all the vertices (the salesman needs to visit all cities with travel time between the cities given by the arc weights, while spending the least amount of time traveling).

All cities need to be visited in a single trip => NP-complete problem.

No efficient algorithms are known. Most computer scientists believe that for all NP-complete problems the number of operations required to provide an optimal solution is exponential in the problem size.

Page 31: Dynamic programming: one algorithmic key to many biological locks

AlignmentGiven two symbol sequences (nucleotides or

amino acids) of lengths M and N, set a correspondence between these sequences so that some symbols are set in pairs, matching or mismatching, whereas other symbols are ignored (deleted). The order of corresponding symbols in the subsequences should coincide.

The alignment score is the sum of match premiums r per matching pair minus the sum of mismatch penalties p per mismatching pair and deletion penalties q per ignored symbol.

The goal is to construct the highest scoring alignment.

Page 32: Dynamic programming: one algorithmic key to many biological locks

Quiz

What are the scores of the alignments

Page 33: Dynamic programming: one algorithmic key to many biological locks

Reduction to the optimal path problem

Construct a graph.Vertices correpond to pairs of positions

(endpoint of partial alignments).Outcoming arcs (for each vertex) are of

three types:• match (weight r ) or mismatch (weight(–p));

total M∙N arcs• deletion in the 1st sequence (weight (–q));

total M∙(N+1) arcs• deletion in the 2nd sequence (weight (–q);

total (M+1)∙N) arcs

Page 34: Dynamic programming: one algorithmic key to many biological locks

Alignment graphg e l af n d

g

a

l

a

f

n

d

Page 35: Dynamic programming: one algorithmic key to many biological locks

Alignment graph with weights

r

q

g e

qq

q

p

q

qq

q

p

q

qq

q

p

q

qq

q

r

q

qq

q

p

q

qq

q

p

q

qq

q

p

q

qq

q

p

q

qq

q

p

q

qq

q

p

q

qq

q

p

q

qq

q

pp

p p

q

qq

q

p

q

qq

q

p

q

qq

q

p

q

qq

q

p

q

qq

q

p

q

qq

q

q

q

q

q

q

q

q q q

q q q

p p p

p r

p p r

p p

p p p

p r p

l af n d

g

a

l

a

f

n

d

p qq

q

r qq

q

p qq

q

q

q q q

p p p

p

q

q

q

r

q

q

q

p

q

q

q

q

q

q

q

p

p

p

p q

q

Page 36: Dynamic programming: one algorithmic key to many biological locks

Paths for the three alignmentsg e l af n d

g

a

l

a

f

n

d

Page 37: Dynamic programming: one algorithmic key to many biological locks

Variants

• Hanging-end alignment (genome assembly)– zero-weight arcs from the source to the

top and left “perimeter” and from the right and bottom perimeter to the sink

• Local alignment– zero-weight arcs from the source to all

internal vertices and from internal vertices to the sink

Page 38: Dynamic programming: one algorithmic key to many biological locks

Weights• Amino-acid substitution weight matrices

– evolutionary• PAM (sure alignment of closely related proteins,

take matrix to the power)• BLOCKS (alignment of conservative regions in

distantly related proteins)– based on physical and chemical properties of

residues• Deletion penalty

– affine penalties (opening and extension penalties)

• Structural alignment as the gold standard

Page 39: Dynamic programming: one algorithmic key to many biological locks

Quiz

For the above alignments, assuming match premium r=10, what combinations of mismatch and deletion penalties would yield optimal alignments (a), (b), and (c)?

Page 40: Dynamic programming: one algorithmic key to many biological locks

Multiple alignment

• triple cubic graph– etc

• for K sequences of length N requires O(NK) operations

• soon becomes unworkable• progressive alignment

– all pairwise alignments, distance matrices

– guide tree– alignment of partial alignment

Page 41: Dynamic programming: one algorithmic key to many biological locks

Lesson

Weights matter. The same graph with differently assigned arc weights will yield different types of alignment.

Page 42: Dynamic programming: one algorithmic key to many biological locks

Gene recognitionDefine a gene as a sequence fragment consisting of

exons and introns.The boundaries between them are donor sites

(between exons and introns) and acceptor sites (between introns and exons).

Each exon and intron is assigned a weight, measuring coding affinity (respectively, non-coding affinity) of its sequence.

The gene’s score is the sum of weights of constituent exons and introns.

The goal is, given a sequence and a set of candidate donor and acceptor sites, construct the highest-scoring exon–intron structure for a gene.

Page 43: Dynamic programming: one algorithmic key to many biological locks

Construct a graphactgagactgcagacggacgtacggcactgacgtataagccccacagtccttacgtctga

actgagactgcagACGGACGTACGGCACTGACgtataagCCCCACAGTCCTTACgtctga

(a)

(b)

Page 44: Dynamic programming: one algorithmic key to many biological locks

Complexity

Assume even distribution of sites (leave out details)

=> O(L) vertices, O(L2) arcs

Can we do better?

Page 45: Dynamic programming: one algorithmic key to many biological locks

It makes sense to assume that the segment weights are additive (we assume that for exons

anyhow). Then we have just O(L) arcs

actgagactgcagacggacgtacggcactgacgtataagccccacagtccttacgtctga

actgagactgcagACGGACGTACGGCACTGACgtataagCCCCACAGTCCTTACgtctga

(a)

(b)

(a)

(b)

Page 46: Dynamic programming: one algorithmic key to many biological locks

Quiz

There are two paths in the segment graph that describe exon–intron structures not represented in the exon–intron graph. What are they? What arcs need to be added to the exon–intron graph to represent these structures?

Page 47: Dynamic programming: one algorithmic key to many biological locks

Lesson

Structure matters. The same problem may be represented by different graphs, and the conceptually simplest representation is not necessarily the most efficient one.

Page 48: Dynamic programming: one algorithmic key to many biological locks

Return to the toy problem

calculate

the standard trick would not work because

x∙z + y∙z = (x + y) ∙ z (before) holds, but

(x+z) ∙ (y+z) = x∙y + z generally does not.

Quiz. When (x+z) ∙ (y+z) = x∙y + z ?

Page 49: Dynamic programming: one algorithmic key to many biological locks

DP, generic statement.1. Path weights

Let be the operation of calculating the path score S given arc weights W. We require .

Hence we can simply write .The path weight (former S(P) =

) becomes .

Page 50: Dynamic programming: one algorithmic key to many biological locks

DP, generic statement.2. Graph score

Let Ψ be the set of all paths. Define associative, commutative operation of combining paths:

and .The graph score is define as

(for the optimal path problem . ).

Page 51: Dynamic programming: one algorithmic key to many biological locks

DP, generic statement.3. Transitivity

To use dynamic programming, we need the distribution law

and .This is a generalization of the property

used for calculating the optimal path:max (x + z, y + z) = max (x, y) + z.

Page 52: Dynamic programming: one algorithmic key to many biological locks

DP, algorithm

Page 53: Dynamic programming: one algorithmic key to many biological locks

Problem (physics of polymers)

Linear polymer chain of L+1 monomers k = 0, …, L.Each monomer assumes N states σ(k) є {σi | i = 1, …, N}.Energy of interactions between adjacent monomers is

defined by an N×N matrix ξ(σi,σj) (measured in the KT units).

Chain conformation P is defined by the states of the monomers {σ(0), σ(1), …, σ(L)}.

Exponent of energy: S(P) = exp (–E(P)) = = ∏k=1…L exp (–ξ(σ(k–1),σ(k)).

Ψ is the set of all conformations. Calculate the partition function of the set of all

conformations Ω = ∑PєΨ S(P).

Page 54: Dynamic programming: one algorithmic key to many biological locks

Graph construction and reduction to DP

Vertices correspond to monomer states, so that their number is (L+1)∙N+2 (two additional vertices are the source and the sink, corresponding to the virtual start and end of the chain).

Arcs link vertices corresponding to adjacent monomers.

Arc weights are the interaction energies. Paths through this graph exactly correspond to the

chain conformations. is ordinary multiplication, and is additionThe path score is the product of arc weights.The total graph score is the sum of these products.Standard DP solves the problem.

Page 55: Dynamic programming: one algorithmic key to many biological locks

Quiz(a)How many operations shall we need? (b) How many operations shall we need

if we calculate the partition function directly?

(c) Provide an algorithm for calculating the number of paths in a graph. Hint: invent suitable arc weights and reduce to the previous problem.

(d) What will Ω be if both and are the operation of taking the maximum?

Page 56: Dynamic programming: one algorithmic key to many biological locks

ProblemCalculate the minimum energy and the number of

conformations with the minimum energy.Arc weights are pairs [1, ξ], with ξ as defined previously.Path scores are pars [n, ε], where ε is the energy, and n is

the number of conformations having this energy.When two systems are combined, the resulting energy is

the sum of the systems’ energies, whereas the number of states is the product of the numbers of states. Hence

solves the problem.

Page 57: Dynamic programming: one algorithmic key to many biological locks

Lesson

Generalizations are useful

Page 58: Dynamic programming: one algorithmic key to many biological locks

NoteNot all problems that can be solved by dynamic programming have a simple graph representation. For example, reconstruction of the secondary structure of a RNA molecule given its sequence can be decomposed into simpler, embedded problems and can be solved by a variant of dynamic programming algorithm, but in the language of this paragraph it requires slightly more complicated objects called hypergraphs.

Page 59: Dynamic programming: one algorithmic key to many biological locks

Спасибо

•Mikhail Roytberg

• Andrei Mironov• Anatoly Rubinov•Pavel Pevzner

Page 60: Dynamic programming: one algorithmic key to many biological locks