week 4 - cs.uwaterloo.cabrowndg/482s14/notes/week4.pdf · cs 482/682, spring 2014 week 4 1 week 4...

1 CS 482/682, Spring 2014 Week 4

Week 4

Topics for this week: •  Doing global alignment in linear space •  Heuristic global alignment •  Different anchor finding techniques The big ideas of this week: Global alignment

heuristics work via forcing certain sites to be matched in an alignment

There are huge tradeoffs in global heuristic alignment. Few are adequately understood.

2 CS 482/682, Spring 2014 Week 4

Where are we?

We’ve seen what alignment scores are. We’ve seen how to compute them, and

how to find local alignments quickly, using seed techniques.

 The most interesting recent work in this area was done right here, at Waterloo.

What if we want to find optimal alignments using a small amount of space?

Or what if we want to compute global alignments quickly?

That’s the subject of today’s lecture.

3 CS 482/682, Spring 2014 Week 4

Back to global alignment

Recall from before: for sequences S and T, fill in a dynamic programming matrix.

•  Size of matrix: |S|+1 by |T|+1. •  That takes Θ(nm) space to store. If the sequences are 1M x 1M, this

seems like a bad idea: we need about a trillion DP cells.

[But 1 trillion operations? That’s a problem, but less of a big deal; maybe a few minutes.]

Can we avoid using this massive amount of space, as we did with local alignment?

4 CS 482/682, Spring 2014 Week 4

Two topics, then

First: finding optimal global alignments in O(n+m) space, Θ(nm) time.

Next: heuristic global alignment in close to linear space, close to linear time.

The first result is much more theoretical; the second is, well, useful.

We won’t go into as much detail with heuristic global alignment as in previous years…

5 CS 482/682, Spring 2014 Week 4

Linear space optimal alignment

How do we find the optimal alignment in linear space? (We saw this in Week 1.)

We’ll restrict to linear gap penalties (remember; that’s the one where we only use one matrix to compute the score, not three)

First question: how to find the score of the optimal alignment?

Remember: M(i,j) = score of the optimal alignment of S[1…i] to T[1…j].

We’ll limit to the simplest scoring scheme, where match scores +1 and mismatch and gap score -1.

6 CS 482/682, Spring 2014 Week 4

Finding the score, linear space

M(i,j) = max (M(i-1,j-1)+1, M(i,j-1)-1, M(i-1,j)-1) if si = tj, and

M(i,j) = max (M(i-1,j-1), M(i,j-1), M(i-1,j)) - 1 if si ≠ tj.

By the time I’m looking at row i of this table, I’ll never look at any row before i-1 ever again.

So I don’t need it… This means I only need enough memory

to store 2 rows. At the end, the score of the optimal

global alignment is M(n,m).

But what alignment gave rise to that?

7 CS 482/682, Spring 2014 Week 4

A remarkable trick…

We know how much the optimal alignment scores. Can we monitor how we got that alignment?

That optimal alignment is an alignment of S[1…n/2] to part of T, followed by S[n/2+1…n] to the rest of T.

Can we find the division point: the place in T where the first half of S is matched to?

This would only require one row’s worth of information, maybe.

8 CS 482/682, Spring 2014 Week 4

Thought!

Suppose S and T are both 1000 letters long.

I told you that the best alignment of S to T was actually the best alignment of s1…s500 to t1…t400 followed by the best alignment of s501…s1000 to t401…t1000.

Maybe then I could use divide and conquer, and have two smaller problems.

But I have to remember what s500 was matched with, even when I get to the end of the sequence S.

9 CS 482/682, Spring 2014 Week 4

Trick for alignment

We know that the best alignment of S and T is:

•  The best alignment of s1…sn/2 to some prefix of T, t1…tk followed by

•  The best alignment of sn/2+1…sn to the remainder of T, tk+1…tm.

What if we could find the value of k in O(nm) time and O(n+m) space?

Then, we’d have two smaller problems to find the alignment of, one of size n/2 x k and one of size n/2 x m-k.

Total size would be half of what was before, and in each case n+m would be smaller than before, so we’d still use the right amount of space.

10 CS 482/682, Spring 2014 Week 4

Trick for alignment, cont’d

Here’s the alignment algorithm, then: For sequences of length n and m: 1)  If nm is smaller than a constant U,

just compute the optimal alignment. 2)  Otherwise, find the value of k, the

size of the part of T aligned to s1…sn/2 in O(nm) time.

3)  Align s1…sn/2 and t1…tk and sn/

2+1…sn to tk+1…tm using this algorithm, and paste them together.

Runtime will be T(n,m) = O(1) if nm small enough, and

T(n,m) = T(n/2,m-k)+T(n/2,k) + O(nm) otherwise.

This is O(nm) overall: the work in one phase is half of the previous phase.

11 CS 482/682, Spring 2014 Week 4

How to find the midpoint?

Keep two rows of alignment scores, to keep the score values accurate. (Each row depends only on the previous row.)

For rows past the middle of the DP matrix, keep track of which position in row n/2 the path to a point (i,j) in the DP matrix went through.

This is exactly equivalent to knowing, in the optimal alignment of s1…si and t1…tj, of what fraction of T was aligned to the first n/2 letters of S.

And it only requires two rows of pointers, as well.

12 CS 482/682, Spring 2014 Week 4

Making it practical

This algorithm is a pain in the neck to implement, especially for more complicated alignment scoring schemes.

(In particular, details of how it’s described coordinate what happens at the boundaries between the two halves in complicated ways.)

But it’s vital when the two sequences are long: it’s not possible otherwise to align them, because of memory demands.

13 CS 482/682, Spring 2014 Week 4

A much more important one

We’ve seen a lot about how to make local alignment easier, faster, etc.

What about global alignment? Heuristic global alignment in close to

linear space, close to linear time •  Here’s two long homologous chunks

of chromosomes. How do they relate?

Obviously, we don’t want to compute the O(nm) global alignment.

How can we short-circuit this?

14 CS 482/682, Spring 2014 Week 4

One quick caveat

I have forgotten to teach this before, and I want to stress it this time.

The optimal alignment might not actually be the right one!

This is actually crucial.

Remember: the optimal alignment corresponds to the best hypothesis for the sequence, given the model of evolution we’re optimizing.

It may not be the right one…

15 CS 482/682, Spring 2014 Week 4

Typical approach: anchors

What if we know the alignment must go through certain parts of the DP matrix?

Then we can shrink the problem and only fill in part of the alignment matrix, just as for local alignment.

But how to find those “required” parts?

Many approaches, lots of current research (including by me…)

But the basic idea is to find regions that match so well, they better be in it.

16 CS 482/682, Spring 2014 Week 4

Simple example

Suppose that n=m=1000. If we fill in the entire matrix, it’s got 1

million cells. Suppose we’re sure that s100 matches

with t200 and s700 matches with t600. Then we’ll build: •  The 100x200 matrix corresponding to

S[1…100] vs T[1…200] •  The 600x400 matrix corresponding to

S[101…700] vs T[201…600] •  The 300x400 matrix corresponding to

S[701…1000] vs T[610…1000]. That’s a total of

20,000+240,000+120,000 = 380,000 cells.

So we’ve saved 62% of the space!

17 CS 482/682, Spring 2014 Week 4

We might be wrong!

Remember: we’re trying to globally align S and T, using the cue of places we think are homologous between S and T.

But, they might be wrong. We said that s100 matches t50 in reality,

not t200. Then our alignment is going to be awful: the first segment will be wrong, and so will the second.

We’ll see a picture. We’re running a heuristic: it may not

find the optimal alignment. We really should try to minimize this

happening. (And remember: optimal≠right.)

18 CS 482/682, Spring 2014 Week 4

Three questions, then…

What makes a good anchor? How do I find all of the good anchors? How do I pick the right anchors if I

have a choice?

Traditionally, this has been where in CS 482 that we’ve spent a week or ��᠋᠌᠍two on suffix trees.

We will spend some time, but we’ll get into this first.

The key issue, aside from finding the anchors fast, is not getting bad ones!

19 CS 482/682, Spring 2014 Week 4

Large-scale alignment

Global alignment is “anchored” by regions that are assumed likely to be in the true alignment.

Lots of possible definitions of the anchors:

•  BLAST hits •  BLAST alignments •  …

Another reasonable idea: long regions of exact matches.

For this, we’ll use suffix trees.

20 CS 482/682, Spring 2014 Week 4

Let’s do a simple case, though

Given: S and T, of lengths n and m as usual.

What if we try to keep all places where there is a BLASTN hit (seed length k) between S and T as possible anchors?

That’s easy to find: we can find BLASTN hits very efficiently.

Example: k=4 S: CTAGTAGTAGTTTAGTAGTAG T: CTAGTTACTAATAGTGGTAG We can force the alignment to obey the

anchors.

21 CS 482/682, Spring 2014 Week 4

Anchoring, continued

S: CTAGTAGTAGTTTAGTAGTAG T: CTAGTTACTAATAGTGGTAG

We align the regions between the anchors, using the global alignment algorithm we saw in Weeks 1 and 2.

What does this get us? Much less work! We find the anchors in linear time. The two internal alignments are an 8x7

and a 1x1 alignment. That’s a lot less work than the initial

20x19 alignment!

22 CS 482/682, Spring 2014 Week 4

In a perfect world…

How good could this be?

Suppose we get a anchors. (To make life easier, assume that the anchors themselves are very small…)

Takes linear time to find the anchors. The a anchors divide the sequence into a

+1 regions, each. If the regions in S are n/(a+1), and the

regions in T are m/(a+1), then each dynamic programming fills nm/(a+1)2 space.

There are a+1 of them. Total amount of DP: nm/(a+1). That’s a factor of a+1 less work. This is the big win for anchoring.

23 CS 482/682, Spring 2014 Week 4

What’s the potential problems?

The chosen anchors might not be homologous, as we discussed before.

The possible anchors might not be mutually agreeable.

Example: k = 4 S = AGTATGCTGAGTATGAGTA T = AGTAAGTTGGGTGTGAGTA Six BLAST hits:

AGTA appears three times in S, twice in T.

What do we do? We can’t use all of these! They disagree; either the first AGTA in S is homologous with the first, or the second, or neither, of them in T.

But certainly not both!

24 CS 482/682, Spring 2014 Week 4

This is more common.

Again, k = 4. S: CTAGTAGTAGTTTAGTAGTAG

T: GTAGTTACTAATAGTGCTAG

We can only have one of these possible anchors: they have to go left-to-right in the sequences!

Two possible anchors conflict when they can’t both be used in a single alignment.

We’re going to have lots of potential anchor conflicts. What to do?

25 CS 482/682, Spring 2014 Week 4

Let’s step back

Again, the purpose of anchoring: •  Find a very good alignment, very

efficiently.

These two goals are definitely in conflict.

To find a very good alignment, we know a slow algorithm.

We can find a garbage alignment very easily.

We must navigate this tradeoff so that the alignment is still very good.

That’s quite hard.

26 CS 482/682, Spring 2014 Week 4

A general framework

Here’s the general routine that has been used since around 1998 for this approach.

(Earliest program, I think: OWEN, by Mikhail Roytberg and co-authors)

•  From S and T, find a set of possible anchors.

•  Find a large set of non-conflicting anchors from the initial set. Maybe optimize something sensible when finding them.

•  Align the regions between the chosen anchors, using the standard dynamic programming algorithm.

There are few exceptions.

27 CS 482/682, Spring 2014 Week 4

More about anchors

Possible anchors take a lot of forms: •  BLAST hits •  BLAT hits (which allow a small

number of mismatches) •  BLAST alignments (which even allow

for indels) But they better be very unlikely to be

wrong! In fact, this means that BLAST hits are

unwise: the sequence AAAAAAAAAAAAAA happens much more often than you would expect. We might want to remove any BLAST hits that correspond to repetitive sequences.

28 CS 482/682, Spring 2014 Week 4

More interesting anchors

A different kind of anchors: unique matches

Sequences of length k found exactly one time in S and exactly one time in T.

These may be safer anchors than BLAST hits of sequences that appear a ton of times in S and in T.

(Of course, they could still be wrong, because the same unique sequence could arise in S and in T, even in non-homologous positions!!)

We can still find these quite fast: throw out entries in the S hash table with more than 1 entry, and tag entries of S that are hit twice.

29 CS 482/682, Spring 2014 Week 4

Why unique matches?

Mostly, because sequence isn’t random noise.

Repetitive sequence elements will confuse any program that is trying to figure out how anchor sequence together.

Unique matches are pretty strongly biased away from being repetitive, for obvious reasons.

But things aren’t necessarily great: S: …ACG…CGT…ACGT… T: …………….ACGT……… If k = 3, the matches here aren’t unique.

That’s sort of unhelpful, isn’t it? The ACGT might be the sort of thing we would want as an anchor.

30 CS 482/682, Spring 2014 Week 4

New type of anchor

This suggests a new kind of anchor: a unique match that isn’t necessarily of fixed length k.

Unique match: the sequence is found exactly once in S and once in T.

Maximal match: the characters on both sides of the match are different.

S: ……GCAGTAGT… T: ……TCAGTAGG….

Maximal unique match: a substring of both S and T, found exactly once in each, but with different flanking characters.

Again, no fixed value for the lengths.

31 CS 482/682, Spring 2014 Week 4

Distribution of anchor lengths

The neat thing about this idea: •  Some possible anchors are short (and

may be likely to be wrong) •  Others are potentially quite large (big

chunks of an exon, say). There is still the risk of repetitive

sequence, just with small numbers of mismatches.

But in general, this has the possibility of being pretty handy.

How can we find all of these?

32 CS 482/682, Spring 2014 Week 4

It’s not immediately obvious

We can do this sloppily by using hash tables, and somehow blobbing together the hash table hits.

That’s really crummy.

Instead, what we’re going to do is to use a structure called a suffix tree, and we’ll talk about how to find all of the maximal unique matches in linear time, next week.

For now, please assume that we have them, and we’ll go on.

(Again, also, remember that we could have used local alignments as possible anchors…)

33 CS 482/682, Spring 2014 Week 4

Picking anchors

Phase 1: Find possible anchors. But some of them conflict!

Phase 2: Pick a set of anchors from our pool of possibilities.

This is hard, because we might get it wrong, and then the alignment is garbage.

Here’s the general goal: •  Pick anchors to reduce work and

avoid making horrible mistakes. In lecture, we’re going to just focus on

the first of these; see HW 3 to think about the 2nd.

34 CS 482/682, Spring 2014 Week 4

Avoid work

Anchors are short alignments (maybe ungapped). So let’s describe an anchor by four co-ordinates: StartS, EndS, StartT, EndT.

We can have both of two anchors, A1 and A2, in an alignment exactly when EndS(A1) < StartS(A2) and EndT(A1) < StartT(A2).

That seems to describe a pretty simple sort of relationship, which we could also show geometrically.

(I’ll do that on the board.)

35 CS 482/682, Spring 2014 Week 4

Now, what avoids work best?

Having the most anchors is a reasonable goal.

One thing to realize: we can cast this as a graph optimization problem!

Given: a graph, where nodes correspond to possible anchors; a directed edge between two nodes if the first possible anchor could precede the second.

Find: the maximum-length path in the graph.

(This turns out to be a little easier if you have a node for the upper-left corner and for the lower-right corner.)

You saw Dijkstra in CS 341. It works.

36 CS 482/682, Spring 2014 Week 4

We could have other rules

We could incorporate probabilistic stuff into it (minimize probability of bad anchors, maximize expected score, …)

We could try to have some anchors (longer ones, better conserved ones, rarer ones, …) have higher score, and incorporate that.

There’s a ton of different possible ways to pick anchors.

Amazingly, this is not a well-studied area. I have no idea why not.

37 CS 482/682, Spring 2014 Week 4

Next: Align between them

Once we’ve divided up the sequences by anchors, must align the interspersed regions.

Sometimes, they’re small (100x100, or 200x300). Then, we can use our existing algorithms.

Other times, they may be huge. (10,000 x 2000)

What to do in those intervals? Well, maybe we should treat those as an

opportunity to do a heuristic global alignment…

Recursion can be appropriate, perhaps with changes in the parameters for the anchoring.

38 CS 482/682, Spring 2014 Week 4

Again, what’s good, what isn’t?

What’s good here is that we get an alignment of S and T.

If we anchored sensibly, we may even be aligning homologous DNA.

What’s bad is that a bad anchor is a lot more costly than a local alignment with no seed hit for BLAST: it screws up the entire alignment.

Again, see your homework.

39 CS 482/682, Spring 2014 Week 4

Wrap up: Week 4

Heuristic global sequence alignment •  Goal is same as for heuristic local

alignment: shorter runtimes •  But the algorithms may make garbage

alignments! The key issue: picking good anchors •  Lots of possible candidates: BLAST

seeds, BLAST alignments, unique matches, maximal unique matches …

•  Then, must chain to find good ones.

week 4 - cs.uwaterloo.cabrowndg/482s14/notes/week4.pdf · cs 482/682, spring 2014 week 4 1 week 4...

Documents