introduction to computation & pairwise...

72
Hanyang Univ. Introduction to Computation & Pairwise Alignment Eunok Paek [email protected]

Upload: others

Post on 06-Jun-2020

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to Computation & Pairwise Alignmentarchi.snu.ac.kr/courses/under/13_spring_computer_concept/... · 2019-07-12 · Introduction to Computation & Pairwise Alignment Eunok

Hanyang Univ.

Introduction to Computation & Pairwise Alignment

Eunok [email protected]

Page 2: Introduction to Computation & Pairwise Alignmentarchi.snu.ac.kr/courses/under/13_spring_computer_concept/... · 2019-07-12 · Introduction to Computation & Pairwise Alignment Eunok

Hanyang Univ.

Pan-Fried Fish with Spicy Dipping Sauce

This spicy fish dish is quick to prepare and cooks in about 8 minutes.

Ingredients:½ c mayonnaise ½ t salt½ t cayenne pepper ¼ t ground black pepper2 T lemon juice 2 eggs, beaten4 white fish fillets (6 oz.) 1 c bread crumbs

3 T vegetable oilDirections:In a small bowl whisk together mayonnaise, cayenne pepper and lemon juice;

set aside.

Season fish fillets with salt and pepper to taste. Dip in beaten egg and coat evenly with bread crumbs. Heat a large, nonstick skillet over medium-high heat. Add oil and when hot, but not smoking, saute fish until golden brown and thoroughly cooked, about 4 minutes per side. Serve warm with reserved spicy dipping sauce.

Algorithm – what you already know about programming

Page 3: Introduction to Computation & Pairwise Alignmentarchi.snu.ac.kr/courses/under/13_spring_computer_concept/... · 2019-07-12 · Introduction to Computation & Pairwise Alignment Eunok

Hanyang Univ.

Algorithm – what you already know about programming

Recipes have to be refined

- A new recipe is rarely right on the first attempt.- Modifications are made as necessary.- Trying the recipe on the intended audience may yield further modifications.- The recipe can be adapted for new ingredients.

Writing a program is a lot like writing a recipe.

Page 4: Introduction to Computation & Pairwise Alignmentarchi.snu.ac.kr/courses/under/13_spring_computer_concept/... · 2019-07-12 · Introduction to Computation & Pairwise Alignment Eunok

Hanyang Univ.

Algorithm – Definition

An algorithm is a finite set of precise instructions for performing a computation or for solving a problem

Example: find a maximum value in a finite sequence of integers

1. Set the temporary maximum equal to the first integer in the sequence.

2. Compare the next integer in the sequence to the temporary maximum equal to this integer.

3. Repeat the previous step if there are more integers in the sequence.

4. Stop when there are no integers left in the sequence. The temporary maximum at this point is the largest integer in the sequence.

Page 5: Introduction to Computation & Pairwise Alignmentarchi.snu.ac.kr/courses/under/13_spring_computer_concept/... · 2019-07-12 · Introduction to Computation & Pairwise Alignment Eunok

Hanyang Univ.

Algorithm – Pseudo code

function max (a1, a2, …, an: integer)

max = a1;for i = 2 to n

if max < ai then max = ai;return max;

function name

repetition(iteration)

assignmentvariable = value

argumentsdatatype

functionvalue

Page 6: Introduction to Computation & Pairwise Alignmentarchi.snu.ac.kr/courses/under/13_spring_computer_concept/... · 2019-07-12 · Introduction to Computation & Pairwise Alignment Eunok

Hanyang Univ.

Algorithm – Pseudo code

function binary search (x: integer, a1, a2,…, an: increasing integers)

i = 1;j = n;while i < j begin

m = (i + j / 2);if x > am then i = m + 1

else j = m;endif x = ai then location = i

else location = 0;return location;

repetition

Page 7: Introduction to Computation & Pairwise Alignmentarchi.snu.ac.kr/courses/under/13_spring_computer_concept/... · 2019-07-12 · Introduction to Computation & Pairwise Alignment Eunok

Hanyang Univ.

Algorithm – Pseudo code

function n_Choose_k (n, k: integers)

return Factorial(n) / (Factorial(n – k) * Factorial(k));

function Factorial (n: integer)

temp = 1;for i = 2 to n

temp = temp * i;

return temp;

calling another function

Page 8: Introduction to Computation & Pairwise Alignmentarchi.snu.ac.kr/courses/under/13_spring_computer_concept/... · 2019-07-12 · Introduction to Computation & Pairwise Alignment Eunok

Hanyang Univ.

Algorithm – Recursion

function fibonacci (n: nonnegative integer)

if n = 0 then return 0else if n = 1 then return 1else return fibonacci(n-1) + fibonacci(n-2);

recursive call

F4

F3

F1

F2

F2 F1 F0

F1 F0

F4

F1F0

F3F2

Page 9: Introduction to Computation & Pairwise Alignmentarchi.snu.ac.kr/courses/under/13_spring_computer_concept/... · 2019-07-12 · Introduction to Computation & Pairwise Alignment Eunok

Hanyang Univ.

Algorithm – Iteration & Memory

function fibonacci (n: nonnegative integer)

if n = 0 then return 0else begin

fn_2 = 0;fn_1 = 1;for i = 1 to n-1;begin

fn = fn_1 + fn_2;fn_2 = fn_1;fn_1 = fn;

endendreturn fn_1;

what if n = 1?

Page 10: Introduction to Computation & Pairwise Alignmentarchi.snu.ac.kr/courses/under/13_spring_computer_concept/... · 2019-07-12 · Introduction to Computation & Pairwise Alignment Eunok

Hanyang Univ.

Computation – Running Time

• Two ways to measure relative efficiency of an algorithm– Mathematical analysis– Empirical analysis

• Mathematical analysis of the running time– Running time is measured by the number of “basic steps” (e.g.,

the number of python statements) that the algorithm makes.

– Running time is described as a function of input size ,

• We are usually interested in the worst case running time or average case running time.

n( )T n

Page 11: Introduction to Computation & Pairwise Alignmentarchi.snu.ac.kr/courses/under/13_spring_computer_concept/... · 2019-07-12 · Introduction to Computation & Pairwise Alignment Eunok

Hanyang Univ.

Example:T(n) = 13n3 + 42n2 + 2nlogn + 4n

As n grows larger, n3 is MUCH larger than n2, nlogn, and n, so it dominates T(n)

The constant factor 13 can be ignored since it is affected by the compiler used or machine speed, etc.

• The running time grows “roughly on the order of n3”

• Notationally, T(n)=O(n3)

Computation – Big-Oh(O) Notation

Page 12: Introduction to Computation & Pairwise Alignmentarchi.snu.ac.kr/courses/under/13_spring_computer_concept/... · 2019-07-12 · Introduction to Computation & Pairwise Alignment Eunok

Hanyang Univ.

2nn3/2 5n2

100n

n

g(n)

5 10 15 20

1000

2000

3000

Computation – Complexity

Page 13: Introduction to Computation & Pairwise Alignmentarchi.snu.ac.kr/courses/under/13_spring_computer_concept/... · 2019-07-12 · Introduction to Computation & Pairwise Alignment Eunok

Hanyang Univ.

Function Approximate Values

n 10 100 1000

nlogn 33 664 9966

n3 1,000 1,000,000 109

106n8 1014 1022 1030

2n 1024 1.27x1030 1.05x10301

nlogn 2099 1.93x1013 7.89x1029

n! 3,628,800 10158 4x102567

Computation – Complexity

Page 14: Introduction to Computation & Pairwise Alignmentarchi.snu.ac.kr/courses/under/13_spring_computer_concept/... · 2019-07-12 · Introduction to Computation & Pairwise Alignment Eunok

Hanyang Univ.

Computation – Complexity

FunctionSize of InstanceSolved in One

Day

Size of InstanceSolved in a

Computer 10Times Faster

n 1012 1013

nlogn 0.948x1011 0.87x1012

n2 106 3.16x106

n3 104 2.15x104

108n4 10 18

2n 40 43

10n 12 13

nlogn 79 95

n! 14 15

Page 15: Introduction to Computation & Pairwise Alignmentarchi.snu.ac.kr/courses/under/13_spring_computer_concept/... · 2019-07-12 · Introduction to Computation & Pairwise Alignment Eunok

Hanyang Univ.

Take Home Message

• There can be many ways to solve the same problem.• Running time can often be estimated mathematically, using

parameter of input size n.• What matter is the “order of growth” in computational time.

Page 16: Introduction to Computation & Pairwise Alignmentarchi.snu.ac.kr/courses/under/13_spring_computer_concept/... · 2019-07-12 · Introduction to Computation & Pairwise Alignment Eunok

Hanyang Univ.

A C C T G A G – A G A C G T G – G C A G

70% identicalmismatch

indel

Sequence Alignment

Page 17: Introduction to Computation & Pairwise Alignmentarchi.snu.ac.kr/courses/under/13_spring_computer_concept/... · 2019-07-12 · Introduction to Computation & Pairwise Alignment Eunok

Hanyang Univ.

Eye of the tiger

* In 1994 Walter Gehring et alum (Un. Basel) turn the gene “eyeless” on in various places on Drosophila melanogaster

* Result: on multiple places eyes are formed

* ‘eyeless’ is a master regulatory gene that controls +/- 2000 other genes

* ‘eyeless’ on induces formation of an eye

Sequence Alignment

Page 18: Introduction to Computation & Pairwise Alignmentarchi.snu.ac.kr/courses/under/13_spring_computer_concept/... · 2019-07-12 · Introduction to Computation & Pairwise Alignment Eunok

Hanyang Univ.

Eyeless Drosophila

Sequence Alignment

Page 19: Introduction to Computation & Pairwise Alignmentarchi.snu.ac.kr/courses/under/13_spring_computer_concept/... · 2019-07-12 · Introduction to Computation & Pairwise Alignment Eunok

Hanyang Univ.

Sequence Alignment

Page 20: Introduction to Computation & Pairwise Alignmentarchi.snu.ac.kr/courses/under/13_spring_computer_concept/... · 2019-07-12 · Introduction to Computation & Pairwise Alignment Eunok

Hanyang Univ.

Sequence Alignment – Homeoboxes & Master regulatory genes

Page 21: Introduction to Computation & Pairwise Alignmentarchi.snu.ac.kr/courses/under/13_spring_computer_concept/... · 2019-07-12 · Introduction to Computation & Pairwise Alignment Eunok

Hanyang Univ.

HOMEO BOX

A homeobox is a DNA sequence found within genes that are involved in the regulation of development (morphogenesis) of animals, fungi and plants.

Sequence Alignment – Homeoboxes & Master regulatory genes

Page 22: Introduction to Computation & Pairwise Alignmentarchi.snu.ac.kr/courses/under/13_spring_computer_concept/... · 2019-07-12 · Introduction to Computation & Pairwise Alignment Eunok

Hanyang Univ.

22

Sequence alignment is important for:

* prediction of function* database searching* gene finding* sequence divergence* sequence assembly

Sequence Alignment

Page 23: Introduction to Computation & Pairwise Alignmentarchi.snu.ac.kr/courses/under/13_spring_computer_concept/... · 2019-07-12 · Introduction to Computation & Pairwise Alignment Eunok

Hanyang Univ.

Growth of GenBank and WGS

Page 24: Introduction to Computation & Pairwise Alignmentarchi.snu.ac.kr/courses/under/13_spring_computer_concept/... · 2019-07-12 · Introduction to Computation & Pairwise Alignment Eunok

Hanyang Univ.

Pairwise Alignment

• Dot matrix• Dynamic programming

– Needleman-Wunschoptimal global alignment

– Smith-Watermanoptimal local alignment

Page 25: Introduction to Computation & Pairwise Alignmentarchi.snu.ac.kr/courses/under/13_spring_computer_concept/... · 2019-07-12 · Introduction to Computation & Pairwise Alignment Eunok

Hanyang Univ.

Pairwise Alignment

Types of Sequence Alignment

• Dot matrix• Number of sequences

– pairwise alignment: compare two sequences– multiple alignment: compare multiple sequences

• Portion of sequences aligned– global alignment: align sequences over their entire lengths– local alignment: find the longest/best subsequence pairs

that give maximum similarity• Algorithmic approach

– optimal methods: Needleman-Wunsch, Smith-Waterman– heuristic methods: FASTA, BLAST

Page 26: Introduction to Computation & Pairwise Alignmentarchi.snu.ac.kr/courses/under/13_spring_computer_concept/... · 2019-07-12 · Introduction to Computation & Pairwise Alignment Eunok

Hanyang Univ.

Dot Matrix

• A visual depiction of relationship between 2 sequences• Reveals insertion/deletion• Finds direct or inverted repeats• Steps

– create a 2D matrix– one sequence along the top– the other along the left side– for each cell of the matrix, place a dot if the two

corresponding residues match

Pairwise Alignment – Dot Matrix

Page 27: Introduction to Computation & Pairwise Alignmentarchi.snu.ac.kr/courses/under/13_spring_computer_concept/... · 2019-07-12 · Introduction to Computation & Pairwise Alignment Eunok

Hanyang Univ.

Pairwise Alignment – Dot Matrix

Running Time of Dot Matrix

• Lengths of sequences: m, n• O(mn)

Page 28: Introduction to Computation & Pairwise Alignmentarchi.snu.ac.kr/courses/under/13_spring_computer_concept/... · 2019-07-12 · Introduction to Computation & Pairwise Alignment Eunok

Hanyang Univ.

Pairwise Alignment – Dot Matrix

DNA sequences protein sequences

Page 29: Introduction to Computation & Pairwise Alignmentarchi.snu.ac.kr/courses/under/13_spring_computer_concept/... · 2019-07-12 · Introduction to Computation & Pairwise Alignment Eunok

Hanyang Univ.

Pairwise Alignment – Dot Matrix

Random Matches in Dot Matrix

• When comparing DNA sequences, random matches occur with probability 1/4

• When comparing protein sequences, 1/20

• Thus, for comparisons of protein coding DNA sequences, we should translate them to amino acids first

Page 30: Introduction to Computation & Pairwise Alignmentarchi.snu.ac.kr/courses/under/13_spring_computer_concept/... · 2019-07-12 · Introduction to Computation & Pairwise Alignment Eunok

Hanyang Univ.

Pairwise Alignment – Dot Matrix

To Reduce Random Noise in Dot Matrix

• Specify a window size, w

• Take w residues from each of the two sequences

• Among the w pairs of residues, count how many pairs are matches

• Specify a stringency

Page 31: Introduction to Computation & Pairwise Alignmentarchi.snu.ac.kr/courses/under/13_spring_computer_concept/... · 2019-07-12 · Introduction to Computation & Pairwise Alignment Eunok

Hanyang Univ.

Pairwise Alignment – Dot Matrix

Simple dot matrix, Window size 1

P V I L E P M M K V T I E M P

P 1 1 1

V 1 1

I 1 1

L 1

E 1 1

P 1 1 1

I 1 1

M 1 1 1

R

V 1 1

E 1 1

V 1 1

T 1

T 1

P 1 1 1

Page 32: Introduction to Computation & Pairwise Alignmentarchi.snu.ac.kr/courses/under/13_spring_computer_concept/... · 2019-07-12 · Introduction to Computation & Pairwise Alignment Eunok

Hanyang Univ.

Pairwise Alignment – Dot Matrix

Window size is 3

P V I L E P M M K V T I E M P

P 3 1 1 1 1

V 3 1 1

I 3 1 1 1 1

L 3 1 1 1

E 1 2 1 1 1

P 1 1 1 2 1 1 1

I 1 1 1 1 1

M 1 2 1

R 1 1 1 1 1 1

V 1 1 1 1 1 1

E 1 1 2 1

V 1 1 2

T 1 1 1 1

T 1 1 2 2 1

P 1 1 1 1 1 1 1 3

Page 33: Introduction to Computation & Pairwise Alignmentarchi.snu.ac.kr/courses/under/13_spring_computer_concept/... · 2019-07-12 · Introduction to Computation & Pairwise Alignment Eunok

Hanyang Univ.

Pairwise Alignment – Dot Matrix

Window size is 3; Stringency is 2

P V I L E P M M K V T I E M P

P 3

V 3

I 3

L 3

E 2

P 2

I

M 2

R

V

E 2

V 2

T

T 2 2

P 3

Page 34: Introduction to Computation & Pairwise Alignmentarchi.snu.ac.kr/courses/under/13_spring_computer_concept/... · 2019-07-12 · Introduction to Computation & Pairwise Alignment Eunok

Hanyang Univ.

Pairwise Alignment – Dot Matrix

DNA Sequences

single residue identity 16 out of 23 identical

Page 35: Introduction to Computation & Pairwise Alignmentarchi.snu.ac.kr/courses/under/13_spring_computer_concept/... · 2019-07-12 · Introduction to Computation & Pairwise Alignment Eunok

Hanyang Univ.

Pairwise Alignment – Dot Matrix

Protein Sequences

single residue identity 6 out of 23 identical

Page 36: Introduction to Computation & Pairwise Alignmentarchi.snu.ac.kr/courses/under/13_spring_computer_concept/... · 2019-07-12 · Introduction to Computation & Pairwise Alignment Eunok

Hanyang Univ.

Pairwise Alignment – Dot Matrix

Insertion/Deletion, Inversion

Page 37: Introduction to Computation & Pairwise Alignmentarchi.snu.ac.kr/courses/under/13_spring_computer_concept/... · 2019-07-12 · Introduction to Computation & Pairwise Alignment Eunok

Hanyang Univ.

Pairwise Alignment – Dot Matrix

ABCDEFGEFGHIJKLMNO

tandem duplication

compared to no duplication

tandem duplication

compared to self

Page 38: Introduction to Computation & Pairwise Alignmentarchi.snu.ac.kr/courses/under/13_spring_computer_concept/... · 2019-07-12 · Introduction to Computation & Pairwise Alignment Eunok

Hanyang Univ.

Pairwise Alignment – Dot Matrix

What Is This?

5’ GGCGG 3’

Palindrome

(Intrastrand)

Page 39: Introduction to Computation & Pairwise Alignmentarchi.snu.ac.kr/courses/under/13_spring_computer_concept/... · 2019-07-12 · Introduction to Computation & Pairwise Alignment Eunok

Hanyang Univ.

Pairwise Alignment – Global Alignment

Optimal Alignment

• Consider two sequences, both of length n

• If no gaps are allowed, there is only one alignment, which is optimal

• If n gaps are allowed, there are possible alignments

• How to find the optimal ones?nn

n nn

n

2

22 2

)!()!2()(

Page 40: Introduction to Computation & Pairwise Alignmentarchi.snu.ac.kr/courses/under/13_spring_computer_concept/... · 2019-07-12 · Introduction to Computation & Pairwise Alignment Eunok

Hanyang Univ.

First, Define Optimality

• Scoring scheme

– a scoring matrix and

– gap penalties

• Examples of scoring schemes

– amino acids: PAM250, or BLOSUM62; -13 for gap opening, -2 for gap extension

– nucleotides: the matrix to the right; -8 for gap opening, -6 for gap extension

A C G T

A 2 -7 -6 -7

C -7 2 -7 -6

G -6 -7 2 -7

T -7 -6 -7 2

Pairwise Alignment – Global Alignment

Page 41: Introduction to Computation & Pairwise Alignmentarchi.snu.ac.kr/courses/under/13_spring_computer_concept/... · 2019-07-12 · Introduction to Computation & Pairwise Alignment Eunok

Hanyang Univ.

Pairwise Alignment – Global Alignment

Intuition of Dynamic Programming

If we already have the optimal solution to:XYAB

then we know the next pair of characters will either be:XYZ or XY- or XYZABC ABC AB-

(where “-” indicates a gap).

So we can extend the match by determining which of these has the highest score.

Page 42: Introduction to Computation & Pairwise Alignmentarchi.snu.ac.kr/courses/under/13_spring_computer_concept/... · 2019-07-12 · Introduction to Computation & Pairwise Alignment Eunok

Hanyang Univ.

Pairwise Alignment – Global Alignment

Recursive Definition of Dynamic Programming

• Notations:

– F(i,j): the accumulated score of aligning x1, x2, …, xi to

y1, …, yj

– s(x,y): the score of matching residue x to residue y, from the scoring matrix

– (k): the penalty for a gap of length k

.1,...,0),(),(,1,...,0),(),(

),,()1,1(max),(

jkkjkiFikkijkF

yxsjiFjiF

ji

Page 43: Introduction to Computation & Pairwise Alignmentarchi.snu.ac.kr/courses/under/13_spring_computer_concept/... · 2019-07-12 · Introduction to Computation & Pairwise Alignment Eunok

Hanyang Univ.

Pairwise Alignment – Global Alignment

Illustration of Dynamic Programming

X Y Z

U

V

W

Page 44: Introduction to Computation & Pairwise Alignmentarchi.snu.ac.kr/courses/under/13_spring_computer_concept/... · 2019-07-12 · Introduction to Computation & Pairwise Alignment Eunok

Hanyang Univ.

Pairwise Alignment – Global Alignment

Dynamic Programming: Units of Operations

Y1 Y2 Y3 Y4 Yn total

X1 1 1 1 1 1 n

X2 1 3 4 5 n+1 ( n + 4 ) ( n - 1 ) / 2 + 1 = ( n 2 + 3 n - 4 ) / 2 + 1

X3 1 4 5 6 n+2 ( n + 6 ) ( n - 1 ) / 2 + 1 = ( n 2 + 5 n - 6 ) / 2 + 1

X4 1 5 6 7 n+3 ( n + 8 ) ( n - 1 ) / 2 + 1 = ( n 2 + 7 n - 8 ) / 2 + 1

Xn 1 n+1 n+2 n+3 2n-1 (n+2n)(n-1)/2+1 = (n2+(2n-1)n-2n)/2+1

[n2(n-1)+n(n+1)(n-1)-(n+2)(n-1)]/2+2n-1= [2n3-3n2-n+2]/2 +2n -1

O(n3) units of operations

Page 45: Introduction to Computation & Pairwise Alignmentarchi.snu.ac.kr/courses/under/13_spring_computer_concept/... · 2019-07-12 · Introduction to Computation & Pairwise Alignment Eunok

Hanyang Univ.

Pairwise Alignment – Global Alignment

The Needleman-Wunsch Algorithm

• The method described in the previous slides is the Needleman-Wunsch (1970) algorithm

• It computes the optimal global alignment between two sequences

• The optimality is defined in terms of a scoring scheme (a scoring matrix plus gap penalties)

• The running time is O(n3)

Page 46: Introduction to Computation & Pairwise Alignmentarchi.snu.ac.kr/courses/under/13_spring_computer_concept/... · 2019-07-12 · Introduction to Computation & Pairwise Alignment Eunok

Hanyang Univ.

Pairwise Alignment – Global Alignment

Needleman-Wunsch Implementation Details

.1,...,0),(),(,1,...,0),(),(

),,()1,1(max),(

jkkjkiFikkijkF

yxsjiFjiF

ji

• At each cell of the matrix, keep track of how the maximum is arrived at

• After the entire matrix is filled, do a traceback from the bottom right corner to the top left corner

01-23456789

ABCDEFG-HIJ

A

B

J

C

I

0 1 2 8 9

Page 47: Introduction to Computation & Pairwise Alignmentarchi.snu.ac.kr/courses/under/13_spring_computer_concept/... · 2019-07-12 · Introduction to Computation & Pairwise Alignment Eunok

Hanyang Univ.

Pairwise Alignment – Global Alignment

Gap Penalties

• Above, the function of gap penalties can take any form• Below, using a simple gap penalty (-d for each gap position), we

can speed up the alignment algorithm

.1,...,0),(),(,1,...,0),(),(

),,()1,1(max),(

jkkjkiFikkijkF

yxsjiFjiF

ji

.)1,(,),1(

),,()1,1(max),(

djiFdjiF

yxsjiFjiF

ji

Page 48: Introduction to Computation & Pairwise Alignmentarchi.snu.ac.kr/courses/under/13_spring_computer_concept/... · 2019-07-12 · Introduction to Computation & Pairwise Alignment Eunok

Hanyang Univ.

Pairwise Alignment – Global Alignment

Illustration of Gotoh’s Algorithm

X Y Z

0 -d -2d -3d

U -d

V -2d

W -3d

Page 49: Introduction to Computation & Pairwise Alignmentarchi.snu.ac.kr/courses/under/13_spring_computer_concept/... · 2019-07-12 · Introduction to Computation & Pairwise Alignment Eunok

Hanyang Univ.

Pairwise Alignment – Global Alignment

Example: match 1, mismatch -1, gap -1

A C G T

0 -1 -2 -3 -4

A -1 1 0 -1 -2

G -2 0 0 1 0

C -3 -1 1 0 0

T -4 -2 0 0 1

Page 50: Introduction to Computation & Pairwise Alignmentarchi.snu.ac.kr/courses/under/13_spring_computer_concept/... · 2019-07-12 · Introduction to Computation & Pairwise Alignment Eunok

Hanyang Univ.

Pairwise Alignment – Global Alignment

Gotoh’s Algorithm: Units of Operations

• O(n2) units of operations to fill the matrix

• O(n) units to trace back

Y1 Y2 Y3 Y4 Yn total

1 1 1 1 1 1 n+1X1 1 3 3 3 3 3 3n+1

X2 1 3 3 3 3 3 3n+1

X3 1 3 3 3 3 3 3n+1

X4 1 3 3 3 3 3 3n+1

Xn 1 3 3 3 3 3 3n+1

3n2+2n+1

Page 51: Introduction to Computation & Pairwise Alignmentarchi.snu.ac.kr/courses/under/13_spring_computer_concept/... · 2019-07-12 · Introduction to Computation & Pairwise Alignment Eunok

Hanyang Univ.

Pairwise Alignment – Global Alignment

Affine Gap Penalties

• -d for gap opening

• -e for gap extension

• (k) = -d - e (k-1)

• Running time is still

O(n2)• Described in Gotoh

(1982)

• Optimal globalalignment

.)1,(,)1,(

max),(

.),1(,),1(

max),(

).,()1,1(),,()1,1(),,()1,1(

max),(

ejiCdjiF

jiC

ejiRdjiF

jiR

yxsjiCyxsjiRyxsjiF

jiF

ji

ji

ji

Page 52: Introduction to Computation & Pairwise Alignmentarchi.snu.ac.kr/courses/under/13_spring_computer_concept/... · 2019-07-12 · Introduction to Computation & Pairwise Alignment Eunok

Hanyang Univ.

Pairwise Alignment – Local Alignment

Smith-Waterman

• Running time is O(n2)

• Described in Smith and Waterman (1981)

• Optimal localalignment

• Traceback is different

.)1,(,)1,(

max),(

.),1(,),1(

max),(

.0),,()1,1(),,()1,1(),,()1,1(

max),(

ejiCdjiF

jiC

ejiRdjiF

jiR

yxsjiCyxsjiRyxsjiF

jiFji

ji

ji

Page 53: Introduction to Computation & Pairwise Alignmentarchi.snu.ac.kr/courses/under/13_spring_computer_concept/... · 2019-07-12 · Introduction to Computation & Pairwise Alignment Eunok

Hanyang Univ.

Pairwise Alignment – Local Alignment

Global versus Local Alignments

LGPSSKQTGKGS-SRIWDN| | ||| | | (Global) LN-ITKSAGKGAIMRLGDA

-------TGKG--------||| (Local)

-------AGKG--------

Page 54: Introduction to Computation & Pairwise Alignmentarchi.snu.ac.kr/courses/under/13_spring_computer_concept/... · 2019-07-12 · Introduction to Computation & Pairwise Alignment Eunok

Hanyang Univ.

Pairwise Alignment – Local Alignment

Smith-Waterman Traceback

H E A G A W G H E E0 0 0 0 0 0 0 0 0 0 0

P 0 0 0 0 0 0 0 0 0 0 0

A 0 0 0 5 0 5 0 0 0 0 0

W 0 0 0 0 2 0 20 12 4 0 0

H 0 10 2 0 0 0 12 18 22 14 6

E 0 2 16 8 0 0 4 10 18 28 20

A 0 0 8 21 13 5 0 4 10 20 27

E 0 0 6 13 18 12 4 0 4 16 26

Page 55: Introduction to Computation & Pairwise Alignmentarchi.snu.ac.kr/courses/under/13_spring_computer_concept/... · 2019-07-12 · Introduction to Computation & Pairwise Alignment Eunok

Hanyang Univ.

Pairwise Alignment – Significance of Alignment

Probability of Random Alignments

• Suppose we have a tetrahedron-shaped die whose four faces are labeled with A, C, G, T.

• Throw the die twice, and record the labels facing down.

• Probability of getting an identical pair: ¼*¼.

• There are 4 possible identical pairs: 4*¼*¼ = ¼.

• 6 identical pairs = (1/4)^6 = 2.4E-4.

• Probability of getting a mismatch: 1 – ¼ = ¾.

• 6 mismatched pairs is (3/4)^6 = 0.178.

Page 56: Introduction to Computation & Pairwise Alignmentarchi.snu.ac.kr/courses/under/13_spring_computer_concept/... · 2019-07-12 · Introduction to Computation & Pairwise Alignment Eunok

Hanyang Univ.

Pairwise Alignment – Significance of Alignment

If A, C, G, T are not of Equal Proportions

• Probability of drawing an identical pair is given by:

px is proportion of nucleotide x

• Probability of drawing a mismatch is 1 - p

2222TGCA ppppp

Page 57: Introduction to Computation & Pairwise Alignmentarchi.snu.ac.kr/courses/under/13_spring_computer_concept/... · 2019-07-12 · Introduction to Computation & Pairwise Alignment Eunok

Hanyang Univ.

Pairwise Alignment – Significance of Alignment

Longest Run of Heads in Coin Toss

HTTHHHTHHTHHHTTTHHHHHHHTTTHHT

• Probability of head is p. We are looking at a sequence of length n.

• At a random position, probability of seeing a run of 5 heads p5

• There are n – 4 such positions Frequency of observing such a

run is p5 (n – 4). In general, pK (n – (K – 1)).

• (Erdos-Renyi law, 1970) For large n, K = log1/pn.

• Expected length of the longest run of heads:

– If p=0.5, after 100 tosses, the longest run is log2100 = 6.65

Page 58: Introduction to Computation & Pairwise Alignmentarchi.snu.ac.kr/courses/under/13_spring_computer_concept/... · 2019-07-12 · Introduction to Computation & Pairwise Alignment Eunok

Hanyang Univ.

Pairwise Alignment – Significance of Alignment

M: Longest Run in Random Alignment

• Sequence lengths: m, n• p: probability of match

• q: 1 – p• γ: Euler’s number, 0.577

• E(M) ≈ log1/p(mn) + log1/p(q) + γlog(e) – ½ , for large m, n• If a local alignment is longer than E(M), then it is significant

• How significant?

Page 59: Introduction to Computation & Pairwise Alignmentarchi.snu.ac.kr/courses/under/13_spring_computer_concept/... · 2019-07-12 · Introduction to Computation & Pairwise Alignment Eunok

Hanyang Univ.

Pairwise Alignment – Significance of Alignment

Significance of Local Alignment

• In biological experiments, after a set of values of an entity is obtained, we usually calculate the mean and variance– Assume data follows the normal distribution

– The mean and variance are of interest

– For example, is the mean not equal to zero at the significance level of 0.05?

• This is not what we want in local alignment– We want the significance of the highest scores

– not the mean score

Page 60: Introduction to Computation & Pairwise Alignmentarchi.snu.ac.kr/courses/under/13_spring_computer_concept/... · 2019-07-12 · Introduction to Computation & Pairwise Alignment Eunok

Hanyang Univ.

Pairwise Alignment – Significance of Alignment

Distribution of Scores

• The scores of a pair of sequences are compared to those of two random sequences of the same length and composition

• The distribution of random sequence scores follows the Gumbel extreme value distribution

• Similar to the normal distribution, with a positively skewed tail– The score must be greater than expected from a normal distribution to

achieve the same level of significance

Page 61: Introduction to Computation & Pairwise Alignmentarchi.snu.ac.kr/courses/under/13_spring_computer_concept/... · 2019-07-12 · Introduction to Computation & Pairwise Alignment Eunok

Hanyang Univ.

Pairwise Alignment – Significance of Alignment

Normal Distribution versus Extreme Value Distribution

0.0

0.4

-4 -3 -2 -1 0 1 2 3 4

x

Normal

ExtremeValue

Extreme value distribution:

y = exp(-x – exp(-x))

Normal distribution:

y = exp(-x2/2) / sqrt(2π)

Page 62: Introduction to Computation & Pairwise Alignmentarchi.snu.ac.kr/courses/under/13_spring_computer_concept/... · 2019-07-12 · Introduction to Computation & Pairwise Alignment Eunok

Hanyang Univ.

Pairwise Alignment – Substitution Matrices

DNA PAM1 Matrix

• PAM1 corresponds to 1% mutations, 99% conservation.

• Assume 4 nucleotides are present at equal frequencies

• Assume all mutations from any nucleotide to any other are equally likely

• A uniform model M

A C G T

A 0.99 0.0033 0.0033 0.0033

C 0.0033 0.99 0.0033 0.0033

G 0.0033 0.0033 0.99 0.0033

T 0.0033 0.0033 0.0033 0.99

Page 63: Introduction to Computation & Pairwise Alignmentarchi.snu.ac.kr/courses/under/13_spring_computer_concept/... · 2019-07-12 · Introduction to Computation & Pairwise Alignment Eunok

Hanyang Univ.

Pairwise Alignment – Substitution Matrices

Transitions and Transversions

• Purines: A and G

• Pyrimidines: C and T

• Transitions: more often– purine to purine

– pyrimidine to pyrimidine

• Transversions: less often– from purine to pyrimidine

– from pyrimidine to purine

Page 64: Introduction to Computation & Pairwise Alignmentarchi.snu.ac.kr/courses/under/13_spring_computer_concept/... · 2019-07-12 · Introduction to Computation & Pairwise Alignment Eunok

Hanyang Univ.

Pairwise Alignment – Substitution Matrices

Another DNA PAM1 Matrix

• Assume 4 nucleotides are present at equal frequencies

• Assume transitions are 3 times more often than transversions

• A biased model

A C G T

A 0.99 0.002 0.006 0.002

C 0.002 0.99 0.002 0.006

G 0.006 0.002 0.99 0.002

T 0.002 0.006 0.002 0.99

Page 65: Introduction to Computation & Pairwise Alignmentarchi.snu.ac.kr/courses/under/13_spring_computer_concept/... · 2019-07-12 · Introduction to Computation & Pairwise Alignment Eunok

Hanyang Univ.

Pairwise Alignment – Substitution Matrices

The Meaning of the Score of an Alignment

• Assume ACGT is aligned to CCGT• Given a model (matrix) M• Want: odds ratio

– Pr(A↔C) · Pr(C↔C) · Pr(G↔G) · Pr(T↔T) given the model

• (PA · MAC) (PC · MCC)(PG · MGG)(PT · MTT)– Divided by

– Pr(A↔C) · Pr(C↔C) · Pr(G↔G) · Pr(T↔T) happened by chance

• (PA · PC) (PC · PC)(PG · PG)(PT · PT)

• Compute:– Let SXY = log2(PX MXY / PX PY)– S = SAC + SCC+ SGG + STT, log odds ratio

– 2S is what we want (odds ratio)

Page 66: Introduction to Computation & Pairwise Alignmentarchi.snu.ac.kr/courses/under/13_spring_computer_concept/... · 2019-07-12 · Introduction to Computation & Pairwise Alignment Eunok

Hanyang Univ.

Pairwise Alignment – Substitution Matrices

From PAM1 Mutation Probability Matrix to PAM1 Log Odds Ratio Matrix

A C G T

A 0.99 0.0033 0.0033 0.0033

C 0.0033 0.99 0.0033 0.0033

G 0.0033 0.0033 0.99 0.0033

T 0.0033 0.0033 0.0033 0.99

A C G T

A 2 -6 -6 -6

C -6 2 -6 -6

G -6 -6 2 -6

T -6 -6 -6 2

Page 67: Introduction to Computation & Pairwise Alignmentarchi.snu.ac.kr/courses/under/13_spring_computer_concept/... · 2019-07-12 · Introduction to Computation & Pairwise Alignment Eunok

Hanyang Univ.

Pairwise Alignment – Substitution Matrices

From Another PAM1 Mutation Probability Matrix to PAM1 Log Odds Ratio Matrix

A C G T

A 0.99 0.002 0.006 0.002

C 0.002 0.99 0.002 0.006

G 0.006 0.002 0.99 0.002

T 0.002 0.006 0.002 0.99

A C G T

A 2 -7 -5 -7

C -7 2 -7 -5

G -5 -7 2 -7

T -7 -5 -7 2

Page 68: Introduction to Computation & Pairwise Alignmentarchi.snu.ac.kr/courses/under/13_spring_computer_concept/... · 2019-07-12 · Introduction to Computation & Pairwise Alignment Eunok

Hanyang Univ.

Pairwise Alignment – Substitution Matrices

From PAM1 to PAM2

• PAM2 = PAM1 * PAM1 = (PAM1)2

• PAM2(A→C):– PAM1(A→A)*PAM1(A→C) + PAM1(A→C)*PAM1(C→C) +

PAM1(A→G)*PAM1(G→C) + PAM1(A→T)*PAM1(T→C)

• Markov process: the probability of change from nucleotide A to nucleotide C is the same, regardless of previous changes at the site or the position of the site in the sequence

Page 69: Introduction to Computation & Pairwise Alignmentarchi.snu.ac.kr/courses/under/13_spring_computer_concept/... · 2019-07-12 · Introduction to Computation & Pairwise Alignment Eunok

Hanyang Univ.

Pairwise Alignment – Substitution Matrices

Amino Acid PAM Matrices

• Percent Accepted Mutation• Dayhoff (1978), 1572 changes in 71 families of proteins, at least

85% similar• For each amino acid, count 20 numbers• For example, how many F (phenylalanine) stay the same, how many

change to the other 19 amino acids• Normalize: divide each of these 20 numbers by (sum of 20 numbers)• PAM1: 1% probability of change

Page 70: Introduction to Computation & Pairwise Alignmentarchi.snu.ac.kr/courses/under/13_spring_computer_concept/... · 2019-07-12 · Introduction to Computation & Pairwise Alignment Eunok

Hanyang Univ.

Pairwise Alignment – Substitution Matrices

The Column/Row of F in PAM1

• F to A: 0.0002• F to R: 0.0001• F to N: 0.0001• F to D: 0.0000• F to C: 0.0000• F to Q: 0.0000• F to E: 0.0000• F to G: 0.0001• F to H: 0.0002• F to I: 0.0007

• F to L: 0.0013• F to K: 0.0000• F to M: 0.0001• F to F: 0.9946• F to P: 0.0001• F to S: 0.0003• F to T: 0.0001• F to W: 0.0001• F to Y: 0.0021• F to V: 0.0001

Page 71: Introduction to Computation & Pairwise Alignmentarchi.snu.ac.kr/courses/under/13_spring_computer_concept/... · 2019-07-12 · Introduction to Computation & Pairwise Alignment Eunok

Hanyang Univ.

Pairwise Alignment – Substitution Matrices

Compute PAM250

• PAM2 = PAM1 * PAM1 = (PAM1)2

• PAM250 = (PAM1)250

• Convert to log odds:– PAM250(F→Y) = 0.15– Divide by the frequency of F, 0.04– 0.15/0.04 = 3.75– log10(3.75) = 0.57– Similarly for Y→F: log10(0.2/0.03) = 0.83

• So PAM250(F→Y) = 10*(0.57+0.83)/2

Page 72: Introduction to Computation & Pairwise Alignmentarchi.snu.ac.kr/courses/under/13_spring_computer_concept/... · 2019-07-12 · Introduction to Computation & Pairwise Alignment Eunok

Hanyang Univ.

Pairwise Alignment – Substitution Matrices

BLOSUM

• BLOcks of amino acid SUbstitution Matrices

• Start with highly-conserved patterns (blocks) in a large set of closely related proteins

• Use the likelihood of substitutions found in those sequences to create a substitution probability matrix

• BLOSUM-n means that the sequences used were n% alike

• BLOSUM62 is “standard”