bioinformatics (4) sequence analysis. figure na1: common & simple dna2: the last 5000...

Bioinformatics (4)

Sequence Analysis

figure

NA1: Common & simple

DNA2: the last 5000 generations

Sequence Similarity andHomology

Alignments & Scores

Global (e.g. haplotype) ACCACACA ::xx::x: ACACCATAScore= 5(+1) + 3(-1) = 2

Suffix (shotgun assembly) ACCACACA ::: ACACCATAScore= 3(+1) =3

Local (motif) ACCACACA ::::ACACCATAScore= 4(+1) = 4

"Hardness" of (multi-) sequence alignment

Align 2 sequences of length N allowing gaps.

ACCAC-ACA ACCACACA ::x::x:x: :xxxxxx: AC-ACCATA , A-----CACCATA , etc.

2N gap positions, gap lengths of 0 to N each: A naïve algorithm might scale by O(N2N).For N= 3x109 this is rather large. Now, what about k>2 sequences? or rearrangements other than gaps?

Increasingly complex (accurate) searches

Exact (stringsearch) CGCGRegular expression (PrositeSearch) CGN{0-9}CG = CGAACG

Substitution matrix (BlastN) CGCG ~= CACG Profile matrix (PSI-blast) CGc(g/a) ~ = CACG

Gaps (Gap-Blast) CGCG ~= CGAACGDynamic Programming (NW, SM) CGCG ~= CAGACG

Pearson WR Protein Sci 1995 Jun;4(6):1145-60 Comparison of methods for searching protein sequence databases. Methods Enzymol 1996;266:227-58 Effective protein sequence comparison.

Algorithm: FASTA, Blastp, BlitzSubstitution matrix:PAM120, PAM250, BLOSUM50, BLOSUM62Database: PIR, SWISS-PROT, GenPept

Comparisons of homology scores

Scoring matrix based on large set of distantly related blocks: Blosum62

10 1 6 6 4 8 2 6 6 9 2 4 4 4 5 6 6 7 1 3 % A C D E F G H I K L M N P Q R S T V W Y8 0 -4 -2 -4 0 -4 -2 -2 -2 -2 -4 -2 -2 -2 2 0 0 -6 -4 A

18 -6 -8 -4 -6 -6 -2 -6 -2 -2 -6 -6 -6 -6 -2 -2 -2 -4 -4 C12 4 -6 -2 -2 -6 -2 -8 -6 2 -2 0 -4 0 -2 -6 -8 -6 D

10 -6 -4 0 -6 2 -6 -4 0 -2 4 0 0 -2 -4 -6 -4 E12 -6 -2 0 -6 0 0 -6 -8 -6 -6 -4 -4 -2 2 6 F

12 -4 -8 -4 -8 -6 0 -4 -4 -4 0 -4 -6 -4 -6 G16 -6 -2 -6 -4 2 -4 0 0 -2 -4 -6 -4 4 H

8 -6 4 2 -6 -6 -6 -6 -4 -2 6 -6 -2 I10 -4 -2 0 -2 2 4 0 -2 -4 -6 -4 K

8 4 -6 -6 -4 -4 -4 -2 2 -4 -2 L10 -4 -4 0 -2 -2 -2 2 -2 -2 M

12 -4 0 0 2 0 -6 -8 -4 N14 -2 -4 -2 -2 -4 -8 -6 P

10 2 0 -2 -4 -4 -2 Q10 -2 -2 -6 -6 -4 R

8 2 -4 -6 -4 S10 0 -4 -4 T

8 -6 -2 V22 4 W

Scoring Functions and Alignments

• Scoring function: (match) = +1; or substitution matrix (mismatch) = -1; " (indel) = -2; (other) = 0.

• Alignment score: sum of columns.

• Optimal alignment: maximum score.

Calculating Alignment Scores

ATGA A-TGA(1) :xx: (2) : : : ACTA ACT-A

.111111,1)(

.112121)2(

.01111)1(

Score indel if

What is dynamic programming?

A dynamic programming algorithm solves every subsubproblems just once and then saves its answer in a table, avoiding the work of recomputing the answer every time the subsubproblem is encountered.

-- Cormen et al. "Introduction to Algorithms", The MIT Press.

Pairwise sequence alignment by the dynamic programming algorithm. The algorithm involves finding the optimal path in the

path matrix. (a), which is equivalent to searching the optimal solution in the search tree (b).

(a) Path Matrix (b) Search Tree

A I M S

Alignment AIM-S A-MOS Pruning by an optimization function

. . . . .

. . . . . . . . .

Methods for computing the optimal score in the dynamic programming algorithm (a ) the gap penalty is a constant.

(b) the gap penalty is a linear function of the gap length.

(a) (b)Di, j-l

Di-1, j

Di-1, j-1 Di-1, j-1

Di-1, j

Di, j-l

ws(i), t(j)

Di, j(2)b

ws(i), t(j)

Di,j(1)

Di,j(3)

Recursion of Optimal Global Alignments

. and of scorealignment global optimal :

Concepts of global and local optimality in the pairwise sequence alignment. The distinction is made as to how the

initial values are assigned to the path matrix.

(a) Global vs. Global (b) Local vs. Global

0 0 0 . . . . . . 0

0 0 . . . . . . 0

(c) Local vs. Local

Recursion of Optimal Local Alignments

. and of scorealignment local optimal :

max...

The dynamic programming algorithm can be applied to limited areas, rather than to the entire matrix, after rapidly searching the

diagonals that contain candidate markers.

n +m -1

Computing Row-by-RowA C T A

0 min min min minA min 1 -1 -3 -5T minG min A min

A C T A0 min min min min

A min 1 -1 -3 -5T min -1 0 0 -2G min A min

A min 1 -1 -3 -5T min -1 0 0 -2G min -3 -2 -1 -1A min

A min 1 -1 -3 -5T min -1 0 0 -2G min -3 -2 -1 -1A min -5 -4 -3 0

Traceback Optimal Global Alignment

A min 1 -1 -3 -5T min -1 0 0 -2G min -3 -2 -1 -1A min -5 -4 -3 0

ATGA::

Local and Global Alignments

A C C A C A C A0 m m m m m m m m

A m 1 -1 -3 -5 -7 -9 -11 -13

C m -1 2 0 -2 -4 -6 -8 -10

A m -3 0 1 1 -1 -3 -5 -7

C m -5 -2 1 0 2 0 -2 -4

C m -7 -4 -1 0 1 1 1 -1

A m -9 -6 -3 0 -1 2 0 2

T m -11 -8 -5 -2 -1 0 1 0

A m -13 -10 -7 -4 -1 0 -1 2

A C C A C A C A0 0 0 0 0 0 0 0 0

A 0 1 0 0 1 0 1 0 1

C 0 0 2 1 0 2 0 2 0

A 0 1 0 1 2 0 3 1 3

C 0 0 2 1 0 3 1 4 2

C 0 0 0 3 2 1 2 2 3

A 0 1 0 1 4 2 2 1 3

T 0 0 0 0 2 3 1 1 1

A 0 1 0 0 1 1 4 2 2

Time and Space Complexity of Computing Alignments

For two sequences u=u1u2…un and v=v1v2…vm, finding theoptimal alignment takes O(mn) time and O(mn) space.

An O(1)-time operation: one comparison, threemultiplication steps, computing an entry in the alignmenttable…

An O(1)-space memory: one byte, a data structure of twofloating points, an entry in the alignment table…

The order of computing matrix elements in the path matrix, which is suitable for (a) sequential processing and (b) parallel processing.

(I, j -1)

(i, j)

(i +1, j-1)

(i +1, j )

(i -1, j -1)

(i -1, j )

(i, j -2)

(i, j -1)

(i, j)

(i+1, j -2)

(i +1, j -1)(i -1, j -1)

(i -1, j )

Time and Space Problems

• Comparing two one-megabase genomes.

• Space:– An entry: 4 bytes;– Table: 4 * 10^6 * 10^6 = 4 G bytes memory.

• Time:– 1000 MHz CPU: 1M entries/second;– 10^12 entries: 1M seconds = 10 days.

A Multiple Alignment of Immunoglobulins

VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG LSLTCTVSGTSFDD--YYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNWYVDG-- ATLVCLISDFYPGA--VTVAWKADS-- AALGCLVKDYFPEP--VTVSWNSG--- VSLTCLVKGFYPSD--IAVEWESNG--

A multiple alignment <=> Dynamic programming on a hyperlattice

From G. Fullen, 1996.

Computing a Node on Hyperlattice

NANSNV

NANS-N

-NNS-N

-NNSNV

NANS-N

NA-N-N

-NNS-N

NANSNV

bioinformatics (4) sequence analysis. figure na1: common & simple dna2: the last 5000...

homology slide

blosum62 slide

sequence analysis slide

cagacg slide

cacg gaps gapblast cgcg

protein sequence databases

n gap positions

sequences of length

Documents

ultra-slim body picking sensor na1-pk5 series na1-5...

sequence sequence segment real sequence tonal sequence...

na1 na2 na3 na4 那

na1 l in t standards tech ii r.i.c. i...rtrttiuruitoojilhu...

1 dna2: last week's take home lessons zcomparing types of...

nucleolytic processing of aberrant replication ... et... ·...

sequence models scenarios scenarios sequence diagram...

{[na1([mu]-h2o)na2]2[(c2o4)2cr([mu]-oh)2cr(c2o4)2].h2o}n

study abroad na1

release 0.72.1 nick catalano, community contributors...

balticgrid-ii project and na1

na1 adjectives...

problem salesman the - computer action...

biochemical analysis of human dna2 · 2015. 7. 29. ·...

increased expression of dna2 was linked to poor prognosis

convolutional sequence to sequence learning

na1) - national university of ireland,...

blm–dna2–rpa–mrn and exo1–blm–rpa–mrn constitute...

milk na1 info

significance of the dissociation of dna2 by flap...