intro to alignment algorithms: global and...

Post on 18-Oct-2020

7 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Intro to Alignment Algorithms:

Global and Local

Algorithmic Functions of Computational Biology

Professor Istrail

Sequence ComparisonBiomolecular sequences □ DNA sequences (string over 4 letter alphabet {A, C,

G, T}) □ RNA sequences (string over 4 letter alphabet

{ACGU}) □ Protein sequences (string over 20 letter alphabet

{Amino Acids})

Sequence similarity helps in the discovery of genes, and the prediction of structure and function of proteins.

Algorithmic Functions of Computational Biology

Professor Istrail

The Basic Similarity Analysis Algorithm

Global Similarity

• Scoring Schemes

• Edit Graphs

• Alignment = Path in the Edit Graph

• The Principle of Optimality

• The Dynamic Programming Algorithm

• The Traceback

Algorithmic Functions of Computational Biology –

Professor Istrail

The Sequence Alignment ProblemInput. : two sequences over the same alphabet and a scoring scheme Output: an alignment of the two sequences of maximum score

Example: □ GCGCATTTGAGCGA □ TGCGTTAGGGTGACCA A possible alignment: - GCGCATTTGAGCGA - - TGCG - - TTAGGGTGACC

match mismatch

indel

Algorithmic Functions of Computational Biology –

Professor Istrail

CSCI2820 - Class 4

Mismatch, Deletion, Insertion

5

mismatch

deletion (in

template)insertion

(in template)

TCAGGGGGCTATTAGTCCTCCGATAA

TCAGGGGGCTATTAGTCC-CCGATAATCAGGGGG-CTATTAGTCCCCCCGATAA

m

n

yyyYxxxX

...21

...21

=

=Consider two sequences

Over the alphabet

},,, TGCA{=Σ

ji yx , belong to Σ

Algorithmic Functions of Computational Biology –

Professor Istrail

Scoring Schemes

δUnit-score A C G T

ACGT

11

11

-

-

0000

000

0 00

00

000

0

00000

Algorithmic Functions of Computational Biology –

Professor Istrail

Alignment

ACG | | | AGG

δScore = (A,A) (C,G) (G,G)+ +

= 1 + 0 + 1 = 2

Unit-cost

A | A

A is aligned with A

C | G

C is aligned with G

G is aligned with GG | G

Algorithmic Functions of Computational Biology –

Professor Istrail

δ δ

Gaps

ACATGGAAT ACAGGAAAT

ACAT GG - AAT ACA - GG AAAT

OPTIMAL ALIGNMENTS

SCORE 7 8

AAAGGG GGGAAA

SCORE 0 3

- - - AAAGGG GGGAAA - - -

“-” is the gap symbol

Algorithmic Functions of Computational Biology –

Professor Istrail

δ

δδ (x,y) = the score for aligning x with y

(-,y) = the score for aligning - with y

(x,-) = the score for aligning x with -

Algorithmic Functions of Computational Biology –

Professor Istrail

A-CG - G ATCGTG

Alignment

Score

δ δδ δ δδ(A,A) + (G,G) +(C,C) +(-,T) + (-,T ) + (G,G)

THE SUM OF THE SCORES OF THE PAIRWISE ALIGNED SYMBOLS

Algorithmic Functions of Computational Biology –

Professor Istrail

Margaret Dayhoff & PAM Similarity Matrices

ARTEMIS Summer 2008

Professor Istrail

Dr. Margaret Oakley Dayhoff The Mother & Father of Bioinformatics

ARTEMIS Summer 2008

Professor Istrail

Scoring SchemeDayhoff PAM scoring matrices

...δ

PTIPLSRLFDNAMLRAHRLHQ SAIENQRLFNIAVSRVQHLHL

Partial alignment for Monkey and Trout somatotropin proteins

- A R N D C Q E G H I L K M F P S T W Y V

-8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 3 -3 0 0 -3 -1 0 1 -3 -1 -3 -2 -2 -4 1 1 1 -7 -4 0A

R N D

64

Algorithmic Functions of Computational Biology –

Professor Istrail

Scoring Functions

Scoring function = a sum of a terms each for a pair of aligned residues, and for each gap

The meaning = log of the relative likelihood that the sequences are related, compared to being unrelated

Identities and conservative substitutions are Positive terms

Non-conservative substitutions are Negative terms

Mutations= Substitutions, Insertions, Deletions

Algorithmic Functions of Computational Biology –

Professor Istrail

CSCI2820 - Class 4

Global alignment problem

• Input: Sequences X and Y of length m and n respectively and a similarity matrix

• Output: An optimal global alignment of X and Y – Global alignments require all bases in both

sequences are aligned

16

CSCI2820 - Class 4

Local alignment problem

• Input: Sequences X and Y of length m and n respectively and a similarity matrix

• Output: An optimal local alignment of X and Y – Local alignments do not require using all

bases in either sequence in the alignment

• Applicable when looking for subsequences of similarity

17

The Edit Graph

Suppose that we want to align AGT with AT

We are going to construct a graph where alignments between the two sequences correspond to paths between the begin and and end nodes of the graph.

This is the Edit Graph

Algorithmic Functions of Computational Biology –

Professor Istrail

0 1 2 30

1

2

AGT has length 3 AT has length 2

The Edit graph has (3+1)*(2+1) nodes

The sequence AGT

The sequence AT

Algorithmic Functions of Computational Biology –

Professor Istrail

0 1 2 30

1

2

A G T

A

T

AGT indexes the columns, and AT indexes the rows of this “table”

Begin

End

Algorithmic Functions of Computational Biology –

Professor Istrail

0 1 2 30

1

2

A G T

A

T

Begin

EndThe Graph is directed. The nodes (i,j) will hold values.

Algorithmic Functions of Computational Biology –

Professor Istrail

0 10

1

2

A G

A

T

T0 1 2 30

1

2

A G

A

T

Begin

End

Algorithmic Functions of Computational Biology

Professor Istrail

T

A

T

0 1 2 30

1

2

A GA -

- A

A A

- A

- A-

A

- A

G -

A -

A -

G -

G -

T -

T -

T -

- T

- T

- T

- T

A T

G T

T T

G A

T A

Directed edges get as labels pairs of aligned letters.

Begin

End

Algorithmic Functions of Computational Biology –

Professor Istrail

Alignment = Path in the Edit GraphT

A

T

0 1 2 30

1

2

A GA -

- A

A A

- A

- A-

A

- A

G -

A -

A -

G -

G -

T -

T -

T -

- T

- T

- T

- T

A T

G T

T T

G A

T A

AGT A-T

Begin

End

Every path from Begin to End corresponds to an alignment

Every alignment corresponds to a path between Begin and End

Algorithmic Functions of Computational Biology –

Professor Istrail

The Principle of Optimality

The optimal answer to a problem is expressed in terms of optimal answer for its sub-problems

Algorithmic Functions of Computational Biology –

Professor Istrail

Dynamic Programming

Part 1: Compute first the optimal alignment score

Part 2: Construct optimal alignment

We are looking for the optimal alignment = maximal score path in the Edit Graph from the Begin vertex to the End vertex

Given: Two sequences X and Y Find: An optimal alignment of X with Y

Algorithmic Functions of Computational Biology –

Professor Istrail

The DP Matrix S(i,j)0 1 2 3

0

1

2

A G T

A

TS(2,1)

S(1,0)

Algorithmic Functions of Computational Biology –

Professor Istrail

The DP MatrixMatrix S =[S(i,j)]

S(i,j) = The score of the maximal cost path from the Begin Vertex and the vertex (i,j)

(i,j)

(i,j-1)

(i-1,j)

(i-1,j-1) The optimal path to (i,j) must pass through one of

the vertices

(i-1,j-1)

(i-1,j)

(i,j-1)

Opt

imal

Pat

h to

(i,j

)Algorithmic Functions of Computational Biology –

Professor Istrail

Opt path

(i,j)

(i,j-1)

(i-1,j)

(i-1,j-1)

Optimal path to (i-1,j) + (- , yj)

- xi

yj -

S(i-1,j) + δ

δ

(- , yj)

Algorithmic Functions of Computational Biology –

Professor Istrail

Optimal path

(i,j)

(i-1,j)

(i,j-1)

(i-1,j-1)

Optimal path to (i-1,j-1) + (xi,yj)δ

δS(i-1,j-1) + (xi , yj)

Algorithmic Functions of Computational Biology –

Professor Istrail

Optimal path

(i,j)

(i,j-1)

(I-1,j)

(i-1,j-1)

Optimal path to (i,j-1) + (xi,-)δ

δS(i,j-1) + (xi, -)

Algorithmic Functions of Computational Biology –

Professor Istrail

The Basic ALGORITHM

S(i,j) =

S(i-1, j-1) + (xi, yj)

S(i-1, j) + (xi, -)

S(i, j-1) + (-, yj)

MAX

δ

δ

δ

Algorithmic Functions of Computational Biology –

Professor Istrail

T

A

T

0 1 2 30

1

2

A GA -

- A

A A

- A

- A-

A

- A

G -

A -

A -

G -

G -

T -

T -

T -

- T

- T

- T

- T

A T

G T

T T

G A

T A

0

0

0

0

1

1

0

1

1

0

1

2AGT A - TOptimal Alignment

Optimal Alignment and TracbackAlgorithmic Functions of Computational Biology –

Professor Istrail

S(i,j) =

S(i-1, j-1) + (xi, yj),

S(i-1, j) + (xi, -),

S(i, j-1) + (-, yj)

MAX

δ

δ

δ

0, We add this

The Basic ALGORITHM: Local SimilarityAlgorithmic Functions of Computational Biology –

Professor Istrail

CSCI2820 - Class 4

Protein global alignment

35

X = hlsek Y = nlsak

• X and Y represent a protein subsequence from the BRCA2 (early onset) protein in human and chimpanzee

• Global alignments are used when the two sequences being compared represent a similar biological sequence

CSCI2820 - Class 4

Margaret Dayhoff’s PAM 100 similarity matrix (partial)

36

A N E H L K S *

A 4 -1 0 -3 -3 -3 1 -9

N -1 5 1 2 -4 1 1 -9

E 0 1 5 -1 -5 -1 -1 -9

H -3 2 -1 7 -3 -2 -2 -9

L -3 -4 -5 -3 6 -4 -4 -9

K -3 1 -1 -2 -4 5 -1 -9

S 1 1 -1 -2 -4 -1 4 -9

* -9 -9 -9 -9 -9 -9 -9 1

CSCI2820 - Class 4 37

h l s e k

n

l

s

a

k

XY

-9 2 -7 -16 -25 -34

-9 -18 -27 -36 -45

-27 -16 -1 12 3 -6

-36 -25 -10 3 12 3

-18 -7 8 -1 -10 -19

0

-45 -34 -19 -6 3 17

top related