pairwise alignments introduction introduction why do alignments? why do alignments? definitions...

45
Pairwise alignments Introduction Why do alignments? Definitions Scoring alignments Alignment methods Significance of alignments

Upload: russell-randall

Post on 11-Jan-2016

246 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment

Pairwise alignments

Introduction Why do alignments? Definitions

Scoring alignments Alignment methods Significance of alignments

Page 2: Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment

Definitions

An alignment is a mutual arrangement of sequences, which exhibits where the sequences are similar, and where they differ.

An optimal alignment is one that exhibits the most correspondences and the least differences. It is the alignment with the highest score. May or may not be biologically meaningful.

Page 3: Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment

Why do alignments?

Sequence Alignment is useful for discovering structural, functional and evolutional information in biological sequences.

Page 4: Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment

How to measure the similarity

Three kinds of changes can occur at any given position within a sequence:

Mutation Insertion Deletion

Insertion and deletion have been found to occur in nature at a significantly lower frequency than mutations.

indel

Page 5: Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment

v : A T G T T A Tw : A T C G T A C

m = 7

n = 7

A T -- G T A T --

A T C G -- A -- C

letters of v

letters of w

T

T

5 matches 2 insertions 2 deletions

Given 2 DNA sequences v and w:

Alignment: 2 row representation

???

4 matches3 mismatchs

Page 6: Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment

Aligning DNA Sequences

V = ATCTGATG

W = TGCATAC

n = 8

m = 7

A T C T G A T G

T G C A T A CV

W

match

insertiondeletion

mismatch

indels

4123

matchesmismatchesinsertions deletions

Page 7: Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment

Scoring Matrices for Aligning DNA Sequences

Transition --- substitutions in which a purine (A/G) is replaced by another purine (A/G) or a pyrimadine (C/T) is replaced by another pyrimadine (C/T).

Transversions --- (A/G) (C/T)

5-4

-4

-4

G

-4

5-4

-4

C

-4

-4

5-4

T

-4

-4

-4

5A

GCTA

1000G

0100C

0010T

0001A

GCTA

1-5

-5

-1

G

-5

1-1

-5

C

-5

-1

1-5

T

-1

-5

-5

1A

GCTA

Identity matrix BLAST matrix Transition-Transversion matrix

Page 8: Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment

Scoring a Sequence Alignment

Match score: +1 Mismatch score: +0 Gap penalty: –1

ACGTCTGATACGCCGTATAGTCTATCT ||||| ||| || ||||||||----CTGATTCGC---ATCGTCTATCT

Matches: 18 × (+1) Mismatches: 2 × 0 Gaps: 7 × (– 1)

Score = +11?

Page 9: Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment

Aligning protein sequences

FFDGGLQMQMLKDKFPMEGGQKDPKQRI

Page 10: Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment

Amino Acid Substitution Matrices

PAM - point accepted mutation based on global alignment [evolutionary model]

BLOSUM - block substitutions based on local alignments [similarity among conserved sequences]

Page 11: Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment

Part of PAM 250 MatrixC S T P A G

C 12S 0 2T -2 1 3P -3 1 0 6A -2 1 1 1 2G -3 1 0 -1 1 5

Log-odds = log ( )

chance to see pair in homologous proteins chance to see pair in unrelated proteins by chance

Page 12: Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment

PAM 250 MatrixC S T P A G N D E Q H R K M I L V F Y W

C 12

S 0 2

T -2 1 3

P -3 1 0 6

A -2 1 1 1 2

G -3 1 0 -1 1 5

N -4 1 0 -1 0 0 2

D -5 0 0 -1 0 1 2 4

E -5 0 0 -1 0 0 1 3 4

Q -5 -1 -1 0 0 -1 1 2 2 4

H -3 -1 -1 0 -1 -2 2 1 1 3 6

R -4 0 -1 0 -2 -3 0 -1 -1 1 2 6

K -5 0 0 -1 -1 -2 1 0 0 1 0 3 5

M -5 -2 -1 -2 -1 -3 -2 -3 -2 -1

-2 0 0 6

I -2 -1 0 -2 -1 -3 -2 -2 -2 -2

-2 -2 -2 2 5

L -6 -3 -2 -3 -2 -4 -3 -4 -3 -2

-2 -3 -3 4 2 6

V -2 -1 0 -1 0 -1 -2 -2 -2 -2

-2 -2 -2 2 4 2 4

F -4 -3 -3 -5 -4 -5 -4 -6 -5 -5

-2 -4 -5 0 1 2 -1 9

Y 0 -3 -3 -5 -3 -5 -2 -4 -4 -4

0 -4 -4 -2 -1 -1 -2 7 10

W -8 -2 -5 -6 -6 -7 -4 -7 -7 -5

-3 2 -3 -4 -5 -2 -6 0 0 17

Page 13: Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment

Scoring Matrix: Example

A R N K

A 5 -2 -1 -1

R - 7 -1 3

N - - 7 0

K - - - 6

• Notice that although R and K are different amino acids, they have a positive score.

• Why?

They are both positively charged amino acids will not greatly change function of protein.

AKRANRKAAANK

-1+(-1)+(-2) +5+7+3=11

Page 14: Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment

Sequence Alignment Problem

T C A T GC A T T G

Page 15: Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment

Elements of Dynamic Programming

Dynamic Programming method is used to solve optimization problems to which optimal solutions depend on the optimal solutions to the subproblems. It involves

Characterize the structure of the optimal solutions

Recursively define the score of an optimal solution in terms of the scores of optimal solutions of sub-problems

Compute the solution in a bottom-up fashion Trace back the optimal solution

Page 16: Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment

Dynamic Programming

Consider two sequences:AAAT AGC

To find the optimal solution, if T is aligned with C, we have to find the best alignment between AAA and AG.

Best solution depends on the best solutions of the subproblems.

Page 17: Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment

Dynamic Programming

Consider two sequences:AAAT AGC

To find the optimal solution, we have to find the best alignment between AAA and AG, AAA and AGC or AAAT and AG.

Best solution depends on the best solutions of the subproblems.

Page 18: Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment

Dynamic Programming

Optimal solutions for the subproblems have to be solved recursively.

Let n be the size of sequence s = AAAT, m be the size of sequence t = AGC.

Consider subproblems: matching the prefixes of s and t.t has ? possible prefixes, including empty strings has ? possible prefixes, including empty string

n+1

m+1

Page 19: Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment

Dynamic Programming

We would like to match s[1…i] and t[1…j]: Align s[1…i] with t[1…j-1] and match a space

with t[j] Align s[1…i-1] with t[1…j-1] and match s[i]

with t[j] Align s[1…i-1] with t[1…j] and match a space

with s[i]

Similarity between s and t:Score(s[1…i],t[1…j])=maxScore(s[1…i],t[1…j-1])+gap penalty

Score(s[1…i-1],t[1…j-1])+score(s[i],s[j])Score(s[1…i-1],t[1…j])+gap penalty

Page 20: Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment

Definitions

Global alignment - Needleman-Wunsch (1970) maximizes the number of matches between the sequences along the entire length of the sequences.

Local alignment - Smith-Waterman (1981) is a modification of the dynamic programming algorithm gives the highest scoring local match between two sequences.

Page 21: Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment

Example Let gap = -2

match = 1 mismatch = -1.

C A A Aempty

C

G

A

empty

1

-1

-3

-5 -

1 0 -

3

-4 -

1 -1

-2

-8

-6

-4

-2

-2

-6

-4

-2

0

AAACA-GC Complexity :

O(mn)?

Page 22: Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment

Gap Penalty Scoring Indels: Naive

Approach A fixed penalty σ is given to every

indel: -σ for 1 indel, -2σ for 2 consecutive indels -3σ for 3 consecutive indels, etc.

Can be too severe penalty for a series of 100 consecutive indels

Page 23: Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment

Affine Gap Penalties

In nature, a series of k indels often come as a single event rather than a series of k single nucleotide events:

Normal scoring would give the same score for both alignments

This is more likely.

This is less likely.

Page 24: Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment

Gap Penalty

Gap opening penalty defines the cost for opening a gap in one of the sequences. The penalty must be tuned based on the default matrix.

Gap extension penalty is an extra penalty proportional to the length of the gap. The gap extension penalty is always lower than gap opening penalty.

Optimal penalties vary from sequence to sequence, and finding the most adequate value is a matter of empirical trial and error.

Page 25: Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment

Accounting for Gaps Gaps- contiguous sequence of spaces in

one of the rows

Score for a gap of length x is: -(ρ + σx) where x length of the gap, ρ >0 is the

penalty for introducing a gap: gap opening penalty ρ will be large relative to σ: gap extension penalty because you do not want to add too much

of a penalty for extending the gap.

Page 26: Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment

Affine Gap Penalties

Gap penalties: -ρ-σ when there is 1 indel -ρ-2σ when there are 2 indels -ρ-3σ when there are 3 indels,

etc. -ρ- x·σ (-gap opening - x gap

extensions) Somehow reduced penalties (as

compared to naïve scoring) are given to runs of horizontal and vertical edges

Page 27: Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment

Affine Gap Penalties and Edit Graph

To reflect affine gap penalties we have to add “long” horizontal and vertical edges to the edit graph. Each such edge of length x should have weight

- - x *

Page 28: Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment

There are many such edges!

Adding them to the graph increases the running time of the alignment algorithm by a factor of n (where n is the length of the sequence)

So the complexity increases from O(n2) to O(n3)

Adding “Affine Penalty” Edges to the Edit Graph

Page 29: Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment

Optimal alignment

Optimal alignment - one that exhibits the most correspondences. It is the alignment with the highest score. May or may not be biologically meaningful.

Global alignment - Needleman-Wunsch (1970) maximizes the number of matches between the sequences along the entire length of the sequences.

Local alignment - Smith-Waterman (1981) gives the highest scoring local match between two sequences.

Page 30: Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment

Local Alignment

Problem first formulated: Smith and Waterman (1981)

Problem: Find an optimal alignment between

a substring of s and a substring of t Algorithm:

is a variant of the basic algorithm for global alignment

Page 31: Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment

Motivation Searching for unknown domains or

motifs within proteins from different families Proteins encoded from Homeobox genes

(only conserved in 1 region called Homeo domain – 60 amino acids long)

Identifying active sites of enzymes Comparing long stretches of anonymous

DNA Querying databases where query word

much smaller than sequences in database

Analyzing repeated elements within a single sequence

Page 32: Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment

Local Alignment

Let n be the size of sequence s = GATCACCT m be the size of sequence t =

GATACCC.Consider subproblems:

matching the suffixes of s and t.t has ? possible suffixes, including empty strings has ? possible suffixes, including empty string

n+1

m+1

Page 33: Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment

DP for Local Alignment

Match the suffixes of s[1…i] and t[1…j]: Align suffixes of s[1…i] with t[1…j-1] & match a

space with t[j] Align suffixes of s[1…i-1] with t[1…j-1] & match

s[i] with t[j] Align suffixes of s[1…i-1] with t[1…j] & match a

space with s[i]

Score(s[1…i],t[1…j])=max

Score(s[1…i],t[1…j-1])+gap penaltyScore(s[1…i-1],t[1…j-1])+score(s[i],s[j])Score(s[1…i-1],t[1…j])+gap penalty

0

Sij – highest score for alignment between 2 prefixes ending at i and j

Page 34: Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment

Local Alignment Let gap = -2

match = 1 mismatch = -1.

GATCACCTGATACCC

C

C

C

A

T

A

G

empty

TCCACTAGempty

0 0 0 0 0 0 0 0 0

00

00

00

0

100

00

00

0 00 01

02

11

0

00

00

32

2

00

00

14

3

00

10

02

3

20

10

0

0

03

10

0

0

01

22

1

1

GATCACCTGAT_ACCC

Page 35: Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment

Local Alignment Let gap = -2

match = 1 mismatch = -1.

GATCACCTGATACCC

C

C

C

A

T

A

G

empty

TCCACTAGempty

0 0 0 0 0 0 0 0 0

00

00

00

0

100

00

00

0 00 01

02

11

0

00

00

32

2

00

00

14

3

00

10

02

3

20

10

0

0

03

10

0

0

01

22

1

1

GATCACCTGAT_ACCC

Page 36: Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment

Local Alignment Let gap = -2

match = 1 mismatch = -1.

ACACACTA AGCACAC

- A C A C A C T A- 0 0 0 0 0 0 0 0 0A 0 1 0 1 0 1 0 0 1G 0 0 0 0 0 0 0 0 0C 0 0 1 0 1 0 1 0 0A 0 1 0 2 0 2 0 0 1C 0 0 2 0 3 0 3 1 0A 0 1 0 3 1 4 2 0 2C 0 0 1 0 4 2 5 3 1A 0 1 0 1 2 5 3 4 4

Page 37: Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment

Smith & Waterman Place each sequence along one axis Place score 0 at the up-left corner Fill in 1st row & column with 0s Fill in the matrix with max value of 4

possible values: 0 Vertical move: Score + gap penalty Horizontal move: Score + gap penalty Diagonal move: Score + match/mismatch score

The optimal alignment score is the max in the matrix

To reconstruct the optimal alignment, trace back where the MAX at each step came from, stop when a zero is hit

Page 38: Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment

Semi-global AlignmentExample:

CAGCA-CTTGGATTCTCGG–––CAGCGTGG––––––––

CAGCACTTGGATTCTCGGCAGC––––G––T––––GG

We like the first alignment much better. In semiglobal comparison, we score the alignments ignoring some of the end spaces.

Page 39: Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment

Global AlignmentExample:

AAACCCA - - CCC

Prefer to see: AAACCC - - ACCC

Do not want to penalize the end spaces

empty

A A A C C C

empty 0 -2 -4 -6 -8 -

10-

12A -2 1 -1 -3 -5 -7 -9C -4 -1 0 -2 -2 -4 -6C -6 -3 -2 -1 -1 -1 -3C -8 -5 -4 -3 0 0 0

Page 40: Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment

SemiGlobal AlignmentExample:

s = AAACCC

t = - - ACCC

empty

A A A C C C

empty 0 0 0 0 0 0 0

A -2 1 1 1 -1 -1 -1C -4 -1 0 0 2 0 0C -6 -3 -2 -1 1 3 1C -8 -5 -4 -3 0 2 4

Page 41: Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment

SemiGlobal AlignmentExample:

s = AAACCCG

t = - - ACCC -

empty

A A A C C C

empty 0 0 0 0 0 0 0

A -2 1 1 1 -1 -1 -1C -4 -1 0 0 2 0 0C -6 -3 -2 -1 1 3 1C -8 -5 -4 -3 0 2 4 2

-2

-1

0

G

-1

Page 42: Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment

SemiGlobal Alignment

Summary of end space charging procedures:

Place where spaces are not penalized

forAction

Beginning of 2nd sequence

End of 1st sequence

Beginning of 1st sequence

End of 2nd sequence

Initialize 1st row with zeros

Look for max in last row

Initialize 1st column with zeros

Look for max in last column

Page 43: Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment

Global vs Local

Demo

Page 44: Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment

R R: http://www.r-project.org/ R IDE: http://rstudio.org/ R manual:

http://cran.r-project.org/doc/manuals/R-intro.pdf R & IDE downloads:

http://cran.case.edu/ http://rstudio.org/download/

Quick R: http://www.statmethods.net/

Page 45: Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment

R Demo Bioconductor

http://www.bioconductor.org/install/ source("http://bioconductor.org/biocLite.R") biocLite()

library(Biostrings) #load librarypairwiseAlignment(pattern = c("succeed", "precede"),

subject = "supersede")?pairwiseAlignment