local alignments seq x: seq y:. local alignment what’s local? –allow only parts of the sequence...

25
Local alignments Seq X: Seq Y:

Post on 18-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally

Local alignments

Seq X:

Seq Y:

Page 2: Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally

Local alignment

What’s local?– Allow only parts of the sequence to

match– Results in High Scoring Segments– Locally maximal: can not make it better

by trimming/extending the alignment

Seq X:

Seq Y:

Page 3: Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally

Local alignmentSeq X:

Seq Y:

Why local?– Parts of sequence diverge faster

evolutionary pressure does not put constraints on the whole sequence

– Proteins have modular constructionsharing domains between sequences

Page 4: Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally

Global local alignment

Take the old, good equation

Look at the result of the global alignment (on the right)

- s e q u e

-

s

e

q

CAGCACTTGGATTCTCG-CA-C-----GATTCGT-G

a) global align

b) retrieve the result

c) score along the result

align.pos.

Page 5: Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally

Local alignment – breaking the alignment A recipe

– Just don’t let the score go below 0

– Start the new alignment when it happens

– Where is the result in the matrix?

Before: After:

align.pos.

q

e

s

0-

euqes-

q

e

0s

0-

euqes-

align.pos.

align.pos.

Page 6: Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally

Local alignment – the equation

Init the matrix with 0’s Read the maximal value from

anywhere in the matrix Find the result with backtracking

M[i, j] =M[i, j-1] – g

M[i-1, j] – g

M[i-1, j-1] + score(X[i],Y[j])

max

0Great contribution to science!

q

e

s

-

euqes-

Page 7: Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally

Amino acid substitution matrices

Page 8: Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally

Point Accepted Mutations distance and matrices

Accepted by natural selection– not lethal– not silent

Def.: S1 and S2 are PAM 1 distant if

on avg. there was one mutation per 100 aa

Q.: If the seqs are PAM 8 distant, how many residues may be diffent?

Page 9: Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally

PAM matrix Created from “easy”alignments

– pairwise– gapless– 85% id

f proline − frequency of occurence of proline

f proline ,valine − frequency of substitution

proline with valine , for PAM 1

i.e. f a ,b =∑

aligns

count a ,b

∑aligns

∑c ,d

count c ,d

M − symmetric matr ix ,

i.e. M=[f a ,a f a ,b

f b ,a f b ,b ]

Page 10: Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally

PAM matrix How to calculate M for PAM 2

distance?– Take more distant sequences– or extrapolate...

M 2=[ f a ,a f a ,b

f b ,a f b ,b ]2

=[ f a ,a f a , a f a ,b f b ,a f a ,a f a ,b f a , b f b ,b

⋯ ⋯ ]

Page 11: Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally

AlignmentsBLAST - Basic Local Alignment

Search Tool

Page 12: Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally

The amount of genetic information in organisms

http://www.cbs.dtu.dk/staff/dave/roanoke/genetics980313.html

Largest genome: amoeba Chaos chaos (200x human genome) http://www.lawrence.edu/dept/biology/animal/

Page 13: Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally

Sequence searching - challenges Exponential growth of

databases

Page 14: Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally

Sequence searching – definition

Task:

– Query: short, new sequence (~1000 letters)

– Database (searching space): very many sequences

– Goal: find seqs homologous to the query

Page 15: Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally

Sequence searching – definition

We want:

– fast tool– primarily a filter: most sequences will

be unrelated to the query – fine-tune the alignment later

Page 16: Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally

What is BLAST Basic Local Alignment Search Tool Bad news: it is only a heuristics

– Heuristics: A rule of thumb that often helps in solving a certain class of problems, but makes no guarantees.Perkins, DN (1981) The Mind's Best Work

Basic idea:– High scoring segments have well

conserved (almost identical) part– As well conserved part are identified,

extend it to the real alignment

q

e

s

-

euqes-

Page 17: Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally

What means well conserved for BLAST? BLAST works with k-words (words of length

k)

– k is a parameter– different for DNA (>10) and proteins

(2..4) word w1 is T-similar to w2 if the sum of pair

scores is at least T (e.g. T=12)

Similar 3-wordsW1: R K PW2: R R PScore: 9 –1 7 sum = 15

Page 18: Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally

BLAST algorithm3 basic steps

1)Preprocess the query: extract all the k-words

2)Scan for T-similar matches in database

3)Extend them to alignments

1)Preprocess2)Scan3)Extend

Page 19: Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally

BLAST, Step 1: Preprocess the query

Take the query (e.g. LVNRKPVVP) Chop it into overlapping k-words (k=3 in this

case) Query: LVNRKPVVPWord1: LVNWord2: VNRWord3: NRK…

For each word find all similar words (scoring at least T)

E.g. for RKP the following 3-words are similar:QKP KKP RQP REP RRP RKP

1)Preprocess2)Scan3)Extend

Page 20: Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally

Finite state machine

AC*T|GGC

abstract machine constant amount of memory

(states) used in computation and languages recognizes regular expressions

– cp dmt*.pdf /home/john

1)Preprocess2)Scan3)Extend

Page 21: Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally

BLAST, Step 2: Find ¨exact¨ matches with scanning Use all the T-similar k-words to build the Finite

State Machine Scan for exact matches

...VLQKPLKKPPLVKRQPCCEVVRKPLVKVIRCLA...

QKPKKPRQPREPRRPRKP...

movement

1)Preprocess2)Scan3)Extend

Page 22: Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally

BLAST, Step 3: Extending ¨exact¨ matches Having the list of exact matches we extend

alignment in both directions

Query: L V N R K P V V PT-similar: R R PSubject: G V C R R P L K CScore: -3 4 -3 5 2 7 1 -2 -3

…till the sum of scores drops below some level X (e.g. X=-100) from the best known- what with gaps?

1)Preprocess2)Scan3)Extend

Page 23: Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally

Gapped BLAST (now standard)

gapped local alignments are computed: much, much, much slower

therefore: modified “Hit criteria”

1)Preprocess2)Scan3)Extend

Page 24: Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally

Hit criteria Extends the alignment only if there are

close two hits on the same diagonal

– sensitivity would drop without lowering T– reduces extensions (90% time is spend

on extensions) Gapped local alignments are computed

– increased sensitivity allows us raise T– raising T speeds up the search

1)Preprocess2)Scan3)Extend

dbpos

querypos

close

hit, s

ame d

iag

Page 25: Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally

BLAST flavours blastp: protein query, protein db blastn: DNA query, DNA db blastx: DNA query, protein db

– in all reading frames. Used to find potential translation products of an unknown nucleotide sequence.

tblastn: protein query, DNA db

– database dynamically translated in all reading frames.

tblastx: DNA query, DNA db

– all translations of query against all translations of db