bioinformatics (4) sequence analysis. figure na1: common & simple dna2: the last 5000...

25
Bioinformatics (4) Sequence Analysis

Upload: irvin-esterbrook

Post on 31-Mar-2015

222 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Bioinformatics (4) Sequence Analysis. figure NA1: Common & simple DNA2: the last 5000 generations Sequence Similarity and Homology

Bioinformatics (4)

Sequence Analysis

Page 2: Bioinformatics (4) Sequence Analysis. figure NA1: Common & simple DNA2: the last 5000 generations Sequence Similarity and Homology

figure

NA1: Common & simple

DNA2: the last 5000 generations

Sequence Similarity andHomology

Page 3: Bioinformatics (4) Sequence Analysis. figure NA1: Common & simple DNA2: the last 5000 generations Sequence Similarity and Homology

Alignments & Scores

Global (e.g. haplotype) ACCACACA ::xx::x: ACACCATAScore= 5(+1) + 3(-1) = 2

Suffix (shotgun assembly) ACCACACA ::: ACACCATAScore= 3(+1) =3

Local (motif) ACCACACA ::::ACACCATAScore= 4(+1) = 4

Page 4: Bioinformatics (4) Sequence Analysis. figure NA1: Common & simple DNA2: the last 5000 generations Sequence Similarity and Homology

"Hardness" of (multi-) sequence alignment

Align 2 sequences of length N allowing gaps.

ACCAC-ACA ACCACACA ::x::x:x: :xxxxxx: AC-ACCATA , A-----CACCATA , etc.

2N gap positions, gap lengths of 0 to N each: A naïve algorithm might scale by O(N2N).For N= 3x109 this is rather large. Now, what about k>2 sequences? or rearrangements other than gaps?

Page 5: Bioinformatics (4) Sequence Analysis. figure NA1: Common & simple DNA2: the last 5000 generations Sequence Similarity and Homology

Increasingly complex (accurate) searches

Exact (stringsearch) CGCGRegular expression (PrositeSearch) CGN{0-9}CG = CGAACG

Substitution matrix (BlastN) CGCG ~= CACG Profile matrix (PSI-blast) CGc(g/a) ~ = CACG

Gaps (Gap-Blast) CGCG ~= CGAACGDynamic Programming (NW, SM) CGCG ~= CAGACG

Page 6: Bioinformatics (4) Sequence Analysis. figure NA1: Common & simple DNA2: the last 5000 generations Sequence Similarity and Homology

Pearson WR Protein Sci 1995 Jun;4(6):1145-60 Comparison of methods for searching protein sequence databases. Methods Enzymol 1996;266:227-58 Effective protein sequence comparison.

Algorithm: FASTA, Blastp, BlitzSubstitution matrix:PAM120, PAM250, BLOSUM50, BLOSUM62Database: PIR, SWISS-PROT, GenPept

Comparisons of homology scores

Page 7: Bioinformatics (4) Sequence Analysis. figure NA1: Common & simple DNA2: the last 5000 generations Sequence Similarity and Homology

Scoring matrix based on large set of distantly related blocks: Blosum62

10 1 6 6 4 8 2 6 6 9 2 4 4 4 5 6 6 7 1 3 % A C D E F G H I K L M N P Q R S T V W Y8 0 -4 -2 -4 0 -4 -2 -2 -2 -2 -4 -2 -2 -2 2 0 0 -6 -4 A

18 -6 -8 -4 -6 -6 -2 -6 -2 -2 -6 -6 -6 -6 -2 -2 -2 -4 -4 C12 4 -6 -2 -2 -6 -2 -8 -6 2 -2 0 -4 0 -2 -6 -8 -6 D

10 -6 -4 0 -6 2 -6 -4 0 -2 4 0 0 -2 -4 -6 -4 E12 -6 -2 0 -6 0 0 -6 -8 -6 -6 -4 -4 -2 2 6 F

12 -4 -8 -4 -8 -6 0 -4 -4 -4 0 -4 -6 -4 -6 G16 -6 -2 -6 -4 2 -4 0 0 -2 -4 -6 -4 4 H

8 -6 4 2 -6 -6 -6 -6 -4 -2 6 -6 -2 I10 -4 -2 0 -2 2 4 0 -2 -4 -6 -4 K

8 4 -6 -6 -4 -4 -4 -2 2 -4 -2 L10 -4 -4 0 -2 -2 -2 2 -2 -2 M

12 -4 0 0 2 0 -6 -8 -4 N14 -2 -4 -2 -2 -4 -8 -6 P

10 2 0 -2 -4 -4 -2 Q10 -2 -2 -6 -6 -4 R

8 2 -4 -6 -4 S10 0 -4 -4 T

8 -6 -2 V22 4 W

14 Y

Page 8: Bioinformatics (4) Sequence Analysis. figure NA1: Common & simple DNA2: the last 5000 generations Sequence Similarity and Homology

Scoring Functions and Alignments

• Scoring function: (match) = +1; or substitution matrix (mismatch) = -1; " (indel) = -2; (other) = 0.

• Alignment score: sum of columns.

• Optimal alignment: maximum score.

Page 9: Bioinformatics (4) Sequence Analysis. figure NA1: Common & simple DNA2: the last 5000 generations Sequence Similarity and Homology

Calculating Alignment Scores

ATGA A-TGA(1) :xx: (2) : : : ACTA ACT-A

.111111,1)(

.112121)2(

.01111)1(

Score indel if

Score

Score

A

AG

T

T

CA

A

A

A

T

G

C

T

A

A

Page 10: Bioinformatics (4) Sequence Analysis. figure NA1: Common & simple DNA2: the last 5000 generations Sequence Similarity and Homology

What is dynamic programming?

A dynamic programming algorithm solves every subsubproblems just once and then saves its answer in a table, avoiding the work of recomputing the answer every time the subsubproblem is encountered.

-- Cormen et al. "Introduction to Algorithms", The MIT Press.

Page 11: Bioinformatics (4) Sequence Analysis. figure NA1: Common & simple DNA2: the last 5000 generations Sequence Similarity and Homology

Pairwise sequence alignment by the dynamic programming algorithm. The algorithm involves finding the optimal path in the

path matrix. (a), which is equivalent to searching the optimal solution in the search tree (b).

(a) Path Matrix (b) Search Tree

A I M S

A

M

O

S

Alignment AIM-S A-MOS Pruning by an optimization function

X X

. . . . .

. . . . . . . . .

Page 12: Bioinformatics (4) Sequence Analysis. figure NA1: Common & simple DNA2: the last 5000 generations Sequence Similarity and Homology

Methods for computing the optimal score in the dynamic programming algorithm (a ) the gap penalty is a constant.

(b) the gap penalty is a linear function of the gap length.

(a) (b)Di, j-l

d

Di-1, j

Di-1, j-1 Di-1, j-1

Di-1, j

Di, j-l

d

ws(i), t(j)

Di,j

Di, j(2)b

ws(i), t(j)

Di,j(1)

Di,j(3)

b

Page 13: Bioinformatics (4) Sequence Analysis. figure NA1: Common & simple DNA2: the last 5000 generations Sequence Similarity and Homology

Recursion of Optimal Global Alignments

.

;

;

max

A

As

A

As

A

As

A

As

vuv

us

ACT

ATG

ACT

ATG

ACT

ATG

ACT

ATG

. and of scorealignment global optimal :

Page 14: Bioinformatics (4) Sequence Analysis. figure NA1: Common & simple DNA2: the last 5000 generations Sequence Similarity and Homology

Concepts of global and local optimality in the pairwise sequence alignment. The distinction is made as to how the

initial values are assigned to the path matrix.

(a) Global vs. Global (b) Local vs. Global

0 0 0 . . . . . . 0

.

.

.

.

0

0 0 . . . . . . 0

.

.

.

.

0

0 0 . . . . . . 0

X

(c) Local vs. Local

Page 15: Bioinformatics (4) Sequence Analysis. figure NA1: Common & simple DNA2: the last 5000 generations Sequence Similarity and Homology

Recursion of Optimal Local Alignments

.

. and of scorealignment local optimal :

0

;...

...

;...

...

;...

...

max...

...

21

121

121

121

121

21

21

21

i

j

i

j

i

j

i

jj

i

j

i

u

vvv

uuus

v

u

vvv

uuus

vvvv

uuus

vvv

uuus

vuv

us

Page 16: Bioinformatics (4) Sequence Analysis. figure NA1: Common & simple DNA2: the last 5000 generations Sequence Similarity and Homology

The dynamic programming algorithm can be applied to limited areas, rather than to the entire matrix, after rapidly searching the

diagonals that contain candidate markers.

n1

mm

n +m -1

j

11

i

l

l

Page 17: Bioinformatics (4) Sequence Analysis. figure NA1: Common & simple DNA2: the last 5000 generations Sequence Similarity and Homology

Computing Row-by-RowA C T A

0 min min min minA min 1 -1 -3 -5T minG min A min

A C T A0 min min min min

A min 1 -1 -3 -5T min -1 0 0 -2G min A min

A C T A0 min min min min

A min 1 -1 -3 -5T min -1 0 0 -2G min -3 -2 -1 -1A min

A C T A0 min min min min

A min 1 -1 -3 -5T min -1 0 0 -2G min -3 -2 -1 -1A min -5 -4 -3 0

Page 18: Bioinformatics (4) Sequence Analysis. figure NA1: Common & simple DNA2: the last 5000 generations Sequence Similarity and Homology

Traceback Optimal Global Alignment

A C T A0 min min min min

A min 1 -1 -3 -5T min -1 0 0 -2G min -3 -2 -1 -1A min -5 -4 -3 0

ACTA

ATGA::

Page 19: Bioinformatics (4) Sequence Analysis. figure NA1: Common & simple DNA2: the last 5000 generations Sequence Similarity and Homology

Local and Global Alignments

A C C A C A C A0 m m m m m m m m

A m 1 -1 -3 -5 -7 -9 -11 -13

C m -1 2 0 -2 -4 -6 -8 -10

A m -3 0 1 1 -1 -3 -5 -7

C m -5 -2 1 0 2 0 -2 -4

C m -7 -4 -1 0 1 1 1 -1

A m -9 -6 -3 0 -1 2 0 2

T m -11 -8 -5 -2 -1 0 1 0

A m -13 -10 -7 -4 -1 0 -1 2

A C C A C A C A0 0 0 0 0 0 0 0 0

A 0 1 0 0 1 0 1 0 1

C 0 0 2 1 0 2 0 2 0

A 0 1 0 1 2 0 3 1 3

C 0 0 2 1 0 3 1 4 2

C 0 0 0 3 2 1 2 2 3

A 0 1 0 1 4 2 2 1 3

T 0 0 0 0 2 3 1 1 1

A 0 1 0 0 1 1 4 2 2

Page 20: Bioinformatics (4) Sequence Analysis. figure NA1: Common & simple DNA2: the last 5000 generations Sequence Similarity and Homology

Time and Space Complexity of Computing Alignments

For two sequences u=u1u2…un and v=v1v2…vm, finding theoptimal alignment takes O(mn) time and O(mn) space.

An O(1)-time operation: one comparison, threemultiplication steps, computing an entry in the alignmenttable…

An O(1)-space memory: one byte, a data structure of twofloating points, an entry in the alignment table…

Page 21: Bioinformatics (4) Sequence Analysis. figure NA1: Common & simple DNA2: the last 5000 generations Sequence Similarity and Homology

The order of computing matrix elements in the path matrix, which is suitable for (a) sequential processing and (b) parallel processing.

(I, j -1)

(i, j)

(i +1, j-1)

(i +1, j )

(i -1, j -1)

(i -1, j )

(a)

(i, j -2)

(i, j -1)

(i, j)

(i+1, j -2)

(i +1, j -1)(i -1, j -1)

(i -1, j )

(b)

Page 22: Bioinformatics (4) Sequence Analysis. figure NA1: Common & simple DNA2: the last 5000 generations Sequence Similarity and Homology

Time and Space Problems

• Comparing two one-megabase genomes.

• Space:– An entry: 4 bytes;– Table: 4 * 10^6 * 10^6 = 4 G bytes memory.

• Time:– 1000 MHz CPU: 1M entries/second;– 10^12 entries: 1M seconds = 10 days.

Page 23: Bioinformatics (4) Sequence Analysis. figure NA1: Common & simple DNA2: the last 5000 generations Sequence Similarity and Homology

A Multiple Alignment of Immunoglobulins

VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG LSLTCTVSGTSFDD--YYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNWYVDG-- ATLVCLISDFYPGA--VTVAWKADS-- AALGCLVKDYFPEP--VTVSWNSG--- VSLTCLVKGFYPSD--IAVEWESNG--

Page 24: Bioinformatics (4) Sequence Analysis. figure NA1: Common & simple DNA2: the last 5000 generations Sequence Similarity and Homology

A multiple alignment <=> Dynamic programming on a hyperlattice

From G. Fullen, 1996.

Page 25: Bioinformatics (4) Sequence Analysis. figure NA1: Common & simple DNA2: the last 5000 generations Sequence Similarity and Homology

Computing a Node on Hyperlattice

NANSNV

s

NA-N

NVs

V

S

A

NANS-N

s

-NNS-N

s

-N-N

NVs

NNN

s

AS-

A--

-NNSNV

-S-

NA-N

NV--V

NANS-N

-SV

NA-N-N

A-V

-NNS-N

AS-

-N-N

NVASV

NNN

NANSNV

s

s

s

s

s

s

s

s max