lecture 1: introduction - university of california, riversidejiang/cs215/compbiol-dragonstar.pdf ·...

Lecture 1: Introduction

Agenda:

• Welcome to CS 238 —Algorithmic Techniques in Computational Biology

• Official course information

• Course description

• Announcements

• Basic concepts in molecular biology

1


Official course information

• Grading weights:

– 50% assignments (3-4)

– 50% participation, presentation, and course report.

• Text and reference books:

– T. Jiang, Y. Xu and M. Zhang (co-eds), Current Topics inComputational Biology, MIT Press, 2002. (co-publishedby Tsinghua University Press in China)

– D. Gusfield, Algorithms for Strings, Trees, and Sequences:Computer Science and Computational Biology, CambridgePress, 1997.

– D. Krane and M. Raymer, Fundamental Concepts of Bioin-formatics, Benjamin Cummings, 2003.

– P. Pevzner, Computational Molecular Biology: An Algo-rithmic Approach, 2000, the MIT Press.

– M. Waterman, Introduction to Computational Biology:Maps, Sequences and Genomes, Chapman and Hall, 1995.

– These notes (typeset by Guohui Lin and Tao Jiang).

• Instructor: Tao Jiang

– Surge Building 330, x82991, [email protected]

– Office hours: Tuesday & Thursday 3-4pm

2


Course overview

• Topics covered:

– Biological background introduction

– My research topics

– Sequence homology search and comparison

– Sequence structural comparison

– string matching algorithms and suffix tree

– Genome rearrangement

– Protein structure and function prediction

– Phylogenetic reconstruction

* DNA sequencing, sequence assembly

* Physical/restriction mapping

* Prediction of regulatiory elements

• Announcements:

– Assignment #1 available soon.

– Suggestions for topics of class presentation will be pro-vided in a few weeks.

– Finalize your presentation topic before mid quarter.

3


The Secrets of Life (also in Krane and Raymer, 2003)

(A mathematician’s introduction to molecular biology)

• Basic concepts: DNA, RNA, enzymes

• DNA (deoxyribonucleic acid)

– discovered by James Watson and Francis Crick, 40years ago

– DNA discovery =⇒ modern genetic engineering

– at breathtaking speed

• astonishing scientific & practical implications

1. DNA sequence → the evolutionary record of life

2. genes for human insulin→ bacteria → protein (inexpensive large amount & pure)

3. farm animal & corps→ healthier & more desirable products

4. diagnosis for viral disease (e.g. AIDS)sensitive & reliable

• molecular biology: the character of the field is changingexperimental science −→ theoretical sciencedatabases of DNA and protein sequences

• the mathematical sciences — mathematics, statistics, andcomputational science are playing an important role

4


The Composition of a Living Organism

• Protein, Nucleic Acid, Lipid, Carbohydrate

• The human makes 100,000 proteins

– enzymes — digestion of food

– structural molecular — hair, skin

– transporter of substances — hemoglobin

– transporter of information — receptor in surfaceof cells

• Protein do the work of the cell!a linear chain of amino acids (20 types)

5


Classical Genetics

Geneticists study mutant organisms that differ in one component.

Gregor Mendel ’s experiments on pure breeding peas — 1865

Trait (phenotype): round © or wrinkled⊗

F0 pure breeding © or⊗

cross to generate F1 © (100%)cross F1 with

⊗© (50%)

⊗(50%)

cross F1 with F1 © (75%)⊗

(25%)

genotype:

F0 © AA⊗

aaF1 © Aa (100%)F2 © AA (25%) © Aa (50%)

⊗aa (25%)

Mendel Hypothesis (gene)

• each organism inherits two copies of a gene

• genes occur in alternative forms — alleles, with differentbiological functions

• pure breeding plants carried two copies of identical alleles(AA, aa) and are called homozygotes.the F1 plants have Aa and are heterozygotes.

• roundness (A) dominates over wrinkledness (a)

• inheritance is random.

6


Biochemistry

• Role of biochemistry:

molecular biology

biochemistry (protein)

genetics (gene)

function1

PPPPPPPq

@@@R

:

• What is biochemistry?Goal: Purifying and characterizing the chemicalcomponents responsible for carrying out a partic-ular function.

• What is a biochemist doing?Devises an array for measuring an ”activity” andthen tries successive fractionation procedures toisolate a pure fraction having the activity.

7


Molecular Biology

Scientists realized that Genes encode enzymes / pro-teins.

What are genes made of? — DNA

Watson & Crick, 1953

• Double helix

• Nucleotides: Adenine, Guanine, Cytosine, Thymine

• Base pairs: A–T, C–G

• Replication of DNA: the two strands unwind andeach serves as a template

• DNA polymerase: carries out replication

8


DNA (Deoxyribonucleic Acid)

The DNA sequence is . . .GTGAT. . .

It may encode the message black hair.

9


RNA (Ribonucleic Acid)

The RNA sequence is GCGGAUUGCUCC. . .

10


Protein

The protein sequence is KVLLAAALIAGSVFFLLLPG. . .

The elements are called amino acids.

11


Synthesizing a Protein (in prokaryote)

• A gene is expressed with the transcription of theDNA sequence into a messenger RNA (mRNA).

This is carried out by RNA polymerase.

An RNA is made of A, U, C, G, and is single stranded.

• Each triple of A/U/C/G encodes an amino acid,called a codon.

• The mRNA is then translated into a protein by theribosome.

5′′ 3′′UCCUACUUGUUCCGA

12


Pattern Matching in Nature

Amino acid

DNA codons

AA AA AA AA AA AA AA AA AA AA

AA AA AA AA AA AA AA AA AA AAA

gccgctgcggca

R

cgccgtcggcgaaggaga

D

gacgat

N

aacaat

C

tgctgt

E

gaggaa

Q

cagcaa

G

ggcggtggggga

H

caccat

I

atcattata

@@

@@stop

tgatagtaa

AA AA AA AA AA AA AA AA AA AA

AA AA AA AA AA AA AA AA AA AAL

ctccttctgctattgtta

K

aagaaa

AA

AAstart&%'$

M

atg

F

ttcttt

P

ccccctccgcca

S

tcctcttcgtcaagcagt

T

accactacgaca

W

tgg

Y

tactat

V

gtcgttgtggta

5′′ 3′′

u c g a u g c a u u a u g c a a u c a a a g u c u u u g g c u a a u c a? ?

NH2-M- H- Y- A- I- K- V- F- G- COOH

13


Areas of Computational Biology

1. DNA sequencing — determine DNA sequences

• shotgun sequencing

• sequencing by hybridization

• sequencing by mass spectrometry (for proteinfragments)

2. Physical mapping — low resolution information aboutgenome

• STS content mapping

• restriction mapping

• radiation hybridization mapping

• optical mapping

• fluorescent in situ hybridization (FISH)

3. Sequence analysis — extract functional/evolutionaryinformation from DNA/protein sequences

• sequence comparison

• gene recognition

• regulatory signal analysis

• sequence databases

14


Areas of Computational Biology – continued

4 Comparative genomics, phylogenetic analysis — re-covering history of evolution

5 Protein folding, structural biology — determine sec-ondary and tertiary structures of protein sequences

• crystallography

• NMR data

• energy minimization

• protein threading

6 Genetic linkage — relative locations of genes

• SNP analaysis

• haplotyping

7 Gene expression arrays and DNA microarrays —functional genomics

• How do genes relate to functions?

• How do genes relate to diseases?

• When are genes expressed and co-expressed?

15


Areas of Computational Biology – continued

8 Gene networks, metabolism pathways

• How do genes relate to / control each other?

9 Genetic drug discovery, combinatorial chemistry

• Identify drug targets from molecular information(sequence, 3D structure, etc.)

16


Sequence Assembly

17


Physical mapping - a restriction (MCD) map

18


Sequence Alignment

19


Gene regulation

20


Phylogeny (evolutionary tree) reconstruction

Sumatran Orangutan

Orangutan

Chimpanzee

Pygmy Chimpanzee

Human

Gorilla

Gibbon

Cow

Blue Whale

Finback Whale

Horse

White Rhine

Grey Seal

Barber Seal

Cat

House Mouse

Ratrodents: a branch

ferungulates

primates

extinct species

• what data? (mitochondrial genomes)

• which method?

21


gap p24 (2390bp) and pol (705bp) nucleotide sequences combined.

FIV14SIVsykSIVmndgb1SIVstmSIVmneSIVmm239SIVsmmh4SIVsmm9HIV2rodHIV2stHIV2nihzHIV2d194HIV2benHIV2uc1HIV2d205SIVagm677SIVagm155SIVagm3SIVagmtyoHIVmvp5180HIV1ant70HIV1malHIV1rfHIV1laiHIV1jrcsfHIV1ndkHIV1eliSIVcpz

2

3

57

38

41

10

13

7

7

3837

66

20

82

sB

77

167

69

17

113

sA

107

49

20

102

18

25

?

A phylogeny of immunodeficiency viruses [D. Mindell, 1994]

• organism / species: DNA / RNA / protein sequence

• distance between sequences: estimate the time of evolutionon a molecular clock

• objective function: minimize the distance

22


Protein Structure

This is the ras protein. Knowing the 3-D structure of this important

molecular switch governing cell growth may enable interventions to

shut off this switch in cancer cells.

23


Inference of haplotypes from genotypes

Haplotype information is useful in fine gene mapping, association

studies, etc.

24


Gene Expression Array

A 9000 gene sequence verified mouse microarray.

25


Gene network

26


Gene network

27


Functional genomics

Courtesy U.S. Department of Energy Genomes to Life program:

DOEGenomesToLife.org

28


Drug Discovery

The cellular targets (or receptors) of many drugs used for medical

treatment are proteins. By binding to the receptor, drugs either

enhance or inhibit its activity.

29


Drug Discovery

30

Research Projects in Jiang’s Lab

1. Efficient haplotyping on a pedigree

• polynomial-time algorithm for 0-recombinant data• missing alleles imputation using LD information

2. Dynamics of gene evolution on plant genomes

• comparison of 12 chloroplast genomes• analysis of Myb families on Arabidopsis and ricegenomes

3. Search for TFBS in the human genome

• an “optimized” markov chain algorithm• prediction + in vitro validation doubled knownHNF4α binding sequences

4. Classification of microbial communities by oligonu-cleotide fingerprinting on DNA arrays

• design of probe sets• a discrete approach to clustering

5. Sequence annotation and comparison

• comparison of RNA structures• prediction of operons in Synechococcus Sp.

6. Analysis of barley EST database

• design of popular and unique oligo probes

7. automatic NMR spectral peak assignment

31

Lecture 4: Pairwise Sequence Alignment

Agenda:

• Dynamic programming concept, review

• Edit distance between two sequences

• Global alignment

• Optimal alignment in linear space

Reading:

• Ref[JXZ, 2002]: Pages 45 – 69

• Ref[Gusfield, 1997]: Pages 209 – 258

• Ref[CLRS, 2001]: Pages 331 – 356

32


Why sequence comparison:

• Genes diverging from a common ancestor are homologous.

• Homologous genes perform the same or similar functions.

• Homologous gene sequences are similar.

– On average, human and mouse homologous gene se-quences are 85% similar.

• Similarity:

– How to define it

– How to compute it

– twin definition — distance.

33


Matrix-chain multiplication (from CRLS):

• Input: matrices A1, A2, . . ., An with dimensions d0×d1, d1×d2,. . ., dn−1 × dn, respectively.

• Output: an order in which matrices should be multiplied suchthat the product A1 × A2 × . . . × An is computed using theminimum number of scalar multiplications.

• Example: n = 4 and (d0, d1, . . . , dn) = (5,2,6,4,3)

• Possible orders with different number of scalar multiplications:

((A1 ×A2)×A3)×A4 5× 2× 6 + 5× 6× 4 + 5× 4× 3 = 240(A1 × (A2 ×A3))×A4 5× 2× 4 + 2× 6× 4 + 5× 4× 3 = 148(A1 ×A2)× (A3 ×A4) 5× 2× 6 + 5× 6× 3 + 6× 4× 3 = 222A1 × ((A2 ×A3)×A4) 5× 2× 3 + 2× 6× 4 + 2× 4× 3 = 102A1 × (A2 × (A3 ×A4)) 5× 2× 3 + 2× 6× 3 + 6× 4× 3 = 138

• Number of orders in Ω((4− ε)n) !!!

– Cannot afford exhaustive enumeration.

• Try recursion?

– M(i, j) — the minimum number of scalar multiplicationsneeded to compute product Ai ×Ai+1 × . . .×Aj (i ≤ j)

–

M(i, j) =

0, if i = jmini≤k<jM(i, k) + M(k + 1, j) + di−1dkdj, if i < j

– M(1, n) ∈ Ω(3n)

34


Matrix-chain multiplication (from CS 218):

• Observations:

– A lot of re-computation in the recursion approach!

Avoid them!!!

– Carefully define the order of computation for M(i, j),store them somewhere.

– When needed, get the value, rather than re-computation.

• Build a table of dimension n× n to store M(i, j)

• Order of computation: increasing (j − i)

• This is Dynamic Programming

• Essence of dynamic programming

– Optimal substructures

– Overlapping subproblems

– Number of subproblems/substructures is polynomial !!!

– Given substructures, assembling an optimal structure canbe done in polynomial time !!!

35


Defining distance — String edit:

Problem: Given two strings S1 and S2, edit S1 into S2 with theminimum number of edit operations:

• Replace a letter a with b

• Insert a letter a

• Delete a letter b

Example:

S1 = contextfreeS2 = test-file

context freetest-file

edit cost: 111001010110 = 7

The minimum number of operations is called the edit distancebetween S1 and S2.

Theorem: String edit can be done in O(n1n2) time.

Examine the last edit operation

what about the first edit operation

36


Edit distance by Dynamic Programming:

• Satisfying DP requirements?

– Computing edit distance between S1[1..n1] and S2[1..n2]reduces to

– computing edit distance between S1[1..n1−1] and S2[1..n2]

– computing edit distance between S1[1..n1] and S2[1..n2 −1], and

– computing edit distance between S1[1..n1−1] and S2[1..n2−1]

• Define a generalized form of the problem with a few param-eters:

D[i, j] — edit distance between a1a2 . . . ai and b1b2 . . . bj, 0 ≤i ≤ n1 and 0 ≤ j ≤ n2.

• Derive a recurrence relation:

Consider the last (rightmost) edit operation for a1a2 . . . ai →b1b2 . . . bj:

– insertion

– deletion

– match or replace

D[i, j] = min

D[i− 1, j] + 1D[i, j − 1] + 1

D[i− 1, j − 1] +

0 if ai = bj1 otherwise

37


Edit distance by Dynamic Programming (cont’d):

• Need to order the entries D[i, j] to be computed.

• Need to initialize some entries to start the computation:

D[i,0] = i, D[0, j] = j

• Use a table to store the entry values.

– S1 = a1a2 . . . an1

– S2 = b1b2 . . . bn2

– DP table: D[i, j] — records the edit distance betweena1a2 . . . ai and b1b2 . . . bj, for any pair 0 ≤ i ≤ n1 and 0 ≤j ≤ n2.

• The recurrence relation for D:

D[i, j] = min

D[i− 1, j] + 1D[i, j − 1] + 1

D[i− 1, j − 1] +


• An example,

f r e e0 1 2 3 4

f 1i 2l 3e 4

38





D[i,0] = i, D[0, j] = j


– S1 = a1a2 . . . an1

– S2 = b1b2 . . . bn2



D[i, j] = min

D[i− 1, j] + 1D[i, j − 1] + 1

D[i− 1, j − 1] +


• An example,

f r e e0 1 2 3 4

f 1 0i 2l 3e 4

39





D[i,0] = i, D[0, j] = j


– S1 = a1a2 . . . an1

– S2 = b1b2 . . . bn2



D[i, j] = min

D[i− 1, j] + 1D[i, j − 1] + 1

D[i− 1, j − 1] +


• An example,

f r e e0 1 2 3 4

f 1 0 1 2 3i 2l 3e 4

40





D[i,0] = i, D[0, j] = j


– S1 = a1a2 . . . an1

– S2 = b1b2 . . . bn2



D[i, j] = min

D[i− 1, j] + 1D[i, j − 1] + 1

D[i− 1, j − 1] +


• An example,

f r e e0 1 2 3 4

f 1 0 1 2 3i 2 1 1 2 3l 3 2 2 2 3e 4 3 3 2 2

41





D[i,0] = i, D[0, j] = j


– S1 = a1a2 . . . an1

– S2 = b1b2 . . . bn2



D[i, j] = min

D[i− 1, j] + 1D[i, j − 1] + 1

D[i− 1, j − 1] +


• An example,

f r e e0 1 2 3 4

f 1 0 1 2 3i 2 1 1 2 3l 3 2 2 2 3e 4 3 3 2 2

42





D[i,0] = i, D[0, j] = j


– S1 = a1a2 . . . an1

– S2 = b1b2 . . . bn2



D[i, j] = min

D[i− 1, j] + 1D[i, j − 1] + 1

D[i− 1, j − 1] +


• An example,

f r e e0 1 2 3 4

f 1 0 1 2 3i 2 1 1 2 3l 3 2 2 2 3e 4 3 3 2 2

Complexity O(n1n2) time and space.

43


Pseudocode for Algorithm Edit(a1a2 . . . an1, b1b2 . . . bn2)

int D[0..n1,0..n2]int P [1..n1,1..n2]

for i← 0 to n1 doD[i,0]← i

for j ← 0 to n2 doD[0, j]← j

for i← 1 to n1 dofor j ← 1 to n2 do

if ai = bj thend← 0

elsed← 1

D[i, j]← D[i− 1, j] + 1P [i, j]← (0,−1)if D[i− 1, j] < D[i, j − 1] then

D[i, j]← D[i− 1, j] + 1P [i, j]← (−1,0)

if D[i− 1, j − 1] + d < D[i, j] thenD[i, j]← D[i− 1, j − 1] + dP [i, j]← (−1,−1)

Notes:

• Edit distance only: O(n) space

• Backtracing the path from entry [n1, n2] to entry [0,0] in arrayP to find an optimal series of edit operations.

• Next: reducing time & space complexity

44


Speeding up string edit by preprocessing:

Assumptions:

• Sequences S1 = a1a2 . . . an1 and S2 = b1b2 . . . bn2 are froma fixed alphabet Σ with |Σ| = k.

• n1 ≤ n2.

• Let ` = 12logk n2 − 1.

Preprocessing: The Idea

• Compute an (` + 1) × (` + 1) matrix Di,j for each pairaiai+1 . . . ai+` and bjbj+1 . . . bj+`.

• Di,j can be computed in O((` + 1)2) = O(log2k n2) time.

• How many distinct Di,j matrices?

Since |Σ| = k, there are (k`+1)2 = n2 of them.

Conclusion:

• Setting ` = 12log3k n2 − 1.

• Preprocessing each pair on two vectors over −1,0,1:O((k2(`+1) log2

k n2)× 32(`+1)) = O(n2 log2 n2).

• For each of the (n1n2

`2) entries, filling in 2(` + 1) values

from the corresponding Di,j matrix:

O( n1n2

logn2).

• Preprocessing is minor and thus in total O( n1n2

logn2) time.

• Two keys:

– How many Di,j matrices are there?

– Why preprocessing on those vectors only?

45


Linear space algorithm for string edit:

Ideas: Edit distance takes O(n) space.

(Recall: optimal substructure?)

Edit(a1a2 . . . an1, b1b2 . . . bn2) =

Edit(a1a2 . . . an12, b1b2 . . . bj)+Edit(an1

2+1an1

2+2 . . . an1, bj+1bj+2 . . . bn2),

for some j.

an1

. . .an1

2

. . .a1

b1 . . . bj . . . bn2

t

Recursively solve for (a1a2 · · · an12, b1b2 · · · bj) and (an1

2+1 · · · an1−1an1, bj+1 · · · bn2−1bn2).

Determining j:

• Compute Edit(a1a2 . . . an12, b1b2 . . . bj) for all j;

• Compute Edit(an1an1−1 . . . an12+1, bn2bn2−1 . . . bj+1) for all j;

• Find the j that minimizes

Edit(a1a2 . . . an12, b1b2 . . . bj)+Edit(an1an1−1 . . . an1

2+1, bn2bn2−1 . . . bj+1).

46


Linear space algorithm for string edit (cont’d):

Conclusion:

• Divide-and-conquer;

• Compute Edit(a1a2 . . . an12, b1b2 . . . bn2);

Compute Edit(an1an1−1 . . . an12+1, bn2bn2−1 . . . b1);

• Linear space for storing all necessary entries;

• Space complexity:

S(n2) = max2n2, S(j) + S(n2 − j) ∈ O(n2).

• Time complexity:

T0(n1, n2) = n1n2 — normal running time

T (n1, n2) = 2× T0(n1

2, n2) + T (n1

2, j) + T (n1

2, n2 − j)

= n1n2 + 2× (T ′(n1

4, j) + T ′(n1

4, n2 − j)) + T (n1

4, . . .

= n1n2 + n1

2n2 + n1

4n2 + . . .

≤ 2n1n2∈ O(n1n2).

• Alignment construction?

concatenating.

47


General string edit:

• Strings S1 and S2 over a fixed alphabet Σ

• Replacing letter a with b has a cost σ(a, b), where a, b ∈ Σ∪

• The minimum cost of series of operations transforming S1

into S2 is the edit distance between S1 and S2.

• A series of operations is equivalent to an alignment of thesequences:

– inserting spaces ( ) into or at the ends of sequences

– putting one resultant sequence on top of the other, suchthat every letter or space in one sequence is opposite toa unique letter or a unique space in the other sequence

– each edit operation corresponds to a column of the align-ment

– alignment cost is the sum of costs of columns

• A series of edit operations of the minimum cost correspondsto an optimal (pairwise) sequence (global) alignment.

• Global alignment — similarity over the whole sequences.

48

Lecture 5: Pairwise Sequence Alignment &

Sequence Homology Search

Agenda:

• Global & local pairwise alignments

• Pairwise alignments with affine gap penalty

• The BLAST

• Pattern Hunter

• Database search demonstration

Reading:

• Ref[JXZ, 2002]: Pages 45 – 69

• Ref[Gusfield, 1997]: Pages 209 – 258

• S. F. Altschul et al. Basic local alignment search tool. Journalof Molecular Biology. (1990) 215, 403 – 410.

• S. F. Altschul et al. Gapped BLAST and PSI-BLAST: anew generation of protein database search programs. NucleicAcids Research. (1997) 25, 3389 – 3402.

• B. Ma et al. Pattern hunter: faster and more sensitive ho-mology search. Bioinformatics. (2002) 18, 440 – 445.

• Webpage: http://www.ncbi.nlm.nih.gov/BLAST/

49


Given two sequences, form a correspondence betweenthe letters by inserting spaces if necessary.

ATGCTCGCCATACA ATGCTCGCCATA CAACCTCGTCATGCCA AC CTCGTCATGCCA

cost: 011000010001100 = 5

Assume a similarity score/distance cost function s.

s | - A C G T s | - A C G T--+---------- --+------------------------- | 0 1 1 1 1 - | 0 2.3 2.3 2.3 2.3A | 1 0 1 1 1 A | 2.3 0 1.7 1 1.7C | 1 1 0 1 1 C | 2.3 1.7 0 1.7 1G | 1 1 1 0 1 G | 2.3 1 1.7 0 1.7T | 1 1 1 1 0 T | 2.3 1.7 1 1.7 0

The score/cost reflects the probability of mutations,e.g. s(a, b) = − log prob(a, b).

The score/cost of the alignment is the total score/costof all columns.

The goal is to maximize the score or to minimize thecost.

The same DP algorithm for String Edit works for (global)

sequence alignment in O(mn) time and space.

50


Score schemes for sequence alignment:

• The score scheme tells how to calculate the score of thealignment.

• The score scheme can measure either distance or similarity.

• Various types of score schemes have been investigated:

– columns independent to each other:

∗ letter-independent (edit distance, BLAST)

∗ letter-dependent (PAM250, BLOSUM62, PsiBLAST)

– gap penalties: what is a gap? why gap?

ATCCGGTGGTCTCGCCAT CAACCC CTCGTCATGGTGGTGCA

cost: 0100 g(6) 00001000 g(7) 00

∗ general: g(i) denotes the score of a gap of length i

O(n1n2(n1 + n2)) [Waterman et al., 1976]

∗ concave:

O(n1n2(logn1 + logn2)) [Miller & Myers, 1988; Epp-stein et al., 1989]

∗ affine: g(i) = gopen + i · gext

O(n1n2) [Gotoh, 1982]

51


The PAM250 scoring matrix:

X 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Z 0 0 1 3 -5 3 3 -1 2 -2 -3 0 -2 -5 0 0 -1 -6 -4 -2 2 3

B 0 -1 2 3 -4 1 2 0 1 -2 -3 1 -2 -5 -1 0 0 -5 -3 -2 2

V 0 -2 -2 -2 -2 -2 -2 -1 -2 4 2 -2 2 -1 -1 -1 0 -6 -2 4

Y -3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3 0 10

W -6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0 -6 -2 -5 17

T 1 -1 0 0 -2 -1 0 0 -1 0 -2 0 -1 -3 0 1 3

S 1 0 1 0 0 -1 0 1 -1 -1 -3 0 -2 -3 1 2

P 1 0 -1 -1 -3 0 -1 -1 0 -2 -3 -1 -2 -5 6

F -4 -4 -4 -6 -4 -5 -5 -5 -2 1 2 -5 0 9

M -1 0 -2 -3 -5 -1 -2 -3 -2 2 4 0 6

K -1 3 1 0 -5 1 0 -2 0 -2 -3 5

L -2 -3 -3 -4 -6 -2 -3 -4 -2 2 6

I -1 -2 -2 -2 -2 -2 -2 -3 -2 5

H -1 2 2 1 -3 3 1 -2 6

G 1 -3 0 1 -3 -1 0 5

E 0 -1 1 3 -5 2 4

Q 0 1 1 2 -5 4

C -2 -4 -4 -5 12

D 0 -1 2 4

N 0 0 2

R -2 6

A 2

A R N D C Q E G H I L K M F P S T W Y V B Z X

B: Aspartic D or Asparagine N;

Z: Glutamic E or Glutamine Q; and X: any undetermined aminoacid.

52


Longest common subsequence (LCS):

Definitions:

• A sequence/string S = a1a2 . . . an

• A subsequence is any ai1ai2 . . . aik where

1 ≤ i1 < i2 < . . . < ik ≤ n.

• a1a2 . . . an is called a supersequence of ai1ai2 . . . aik.

Warning: Difference between subsequence and substring

Problem: Given two sequences S1 = a1a2 . . . an1 and S2 = b1b2 . . . bn2,find a longest subsequence of both S1 and S2.

Example:S1 = 0 1 0 1 0 1 0 0 1S2 = 1 0 0 1 0 1 0 0 0 1

LCS = 0 1 0 1 0 0 0 1

LCS as Sequence Alignment: Similarity score scheme: matchgets 1, mismatch and indel get 0.

The maximum score alignment is equivalent to an LCS.

State of the arts:

• Can be computed in O(n1n2) time (in linear space).

• Open: the expected length of an LCS of random se-quences of length n?

53


Affine gap penalty (g(i) = q + i · r)

Tables:

S[i, j] is the maximum score of all alignments of a1a2 . . . ai andb1b2 . . . bj.

D[i, j] is the maximum score of all alignments of a1a2 . . . ai andb1b2 . . . bj that end with a deletion gap (in B) .

I[i, j] is the maximum score of all alignments of a1a2 . . . ai andb1b2 . . . bj that end with an insertion gap (in A).

Recurrences:

S[i, j] =

0 if i = 0 and j = 0D[i,0] if i > 0 and j = 0I[0, j] if i = 0 and j > 0

max

S[i− 1, j − 1] + σ(ai, bj)D[i, j]I[i, j]

if i > 0 and j > 0

D[i, j] =

S[0, j]− q if i = 0 and j ≥ 0D[i− 1,0]− r if i > 0 and j = 0

max

D[i− 1, j]− rS[i− 1, j]− q − r

if i > 0 and j > 0

I[i, j] =

S[i,0]− q if i ≥ 0 and j = 0I[0, j − 1]− r if i = 0 and j > 0

max

I[i, j − 1]− rS[i, j − 1]− q − r

if i > 0 and j > 0

Linear-Space Algorithm:

An exercise.

54


Local sequence alignment:

Find one substring of S1 and one substring of S2 to have the max-imum (sequence) similarity.

For example:

ATAAGGTTGCTGAGCGCCCCCTTCCTATTGATA

ATAGGTTGCTAGCGC TTGCTGA ATACCCCTTCCTATTGATA TTCCT A ATA

• Local sequence alignment can be done in O(n1n2) time usingDynamic Programming as well.

Key observation: allowing re-start

[Smith & Waterman, 1981]

• k-best non-intersecting local alignments can be found in

O(n1n2 + kn2) time.

[Waterman & Eggert, 1987]

• The space can be reduced to O(n2) as well.

[Huang & Miller, 1991; Chao & Miller, 1994]

55


The maximum likelihood approach:

• An alignment describes a scenario of mutations.

A C C T C G T C A T G C C A

A T G C T C G C C A T A C A

• Scores are the negative log of probabilities.

• Summing up scores is equivalent to multiplying probabilitiesof mutations.

• The cost of an alignment is the probability of a particular pathof mutations.

• The maximum likelihood approach is to consider all possiblepaths of mutations.

– Let L[i, j] = Pr(a1a2 . . . ai → b1b2 . . . bj).

– L[i, j] =∑

Prinsert · L[i− 1, j]Prdelete · L[i, j − 1]Pr(ai → bj) · L[i− 1, j − 1]

– We need find parameters Prinsert, Prdelete, Pr(ai → bj) tomaximize the L[n1, n2] = Pr(S1 → S2).

– This is a non-linear programming problem and heuristicmethods like Simplex and E(nergy)M(inimization) are of-ten used.

56


Sequences annotations and comparison of annotated se-quences

1. Functional information

TBS SerSerAlaSer SerLeuAsn******* ############ #########

cgccctataaaacccagc...agctctgctagcacctgagcctcaat

2. Structural information

alpha beta coil******** ######### %%%%%%%%%%

VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF...

3. Physical contact/bond

CAGCGUCACACCCGCGGGGUAAACGCU

' $' $' $

CAUGCCGG

CUA

CAC

CGCGG

CG

GCUAAA

4. And many more.

57


Alignment of DNA Sequences with both Coding andNon-Coding Regions

1. A Genomic sequence is a DNA sequence consistingof both coding and noncoding regions.

TBS SerSerAlaSer SerLeuAsn******* ############ #########

cgccctataaaacccagc...agctctgctagcacctgagcctcaat

How do we compare genomic sequences?

2. The sequences are usually broken up into segments,which are compared separately.

3. For coding regions, classical methods align eitherDNA sequences based on nucleotide evolution orprotein sequences based on amino acid evolution.

Since amino acids evolve slower than nucleotides,alignments of protein sequences are usually morereliable.

But nucleotide changes are also important!

4. Can we align coding DNA to also reflect the evolu-tion of its protein?

58


Combined DNA and protein sequence alignment:

• Amino acids are coded by triplets of nucleotides (codon); thusactual mutations happen at the DNA level.

• Alignments of protein sequences are reliable (amino acid evo-lution is slower than nucleotide evolution).

• Nucleotide changes are also important.

How do we align (coding) DNA sequences to reflect changes atboth DNA and protein levels?

AATACG

Thr

AsnAATAsn

:

XXXXXXXz

AAGLys

ACTThr

XXXXXXXz

: ACGThr

1T→G

1A→C

1A→C

0T→G

T TGTTGCTC

LeuLeu

LeuTTGLeu

:

XXXXXXXz

TtgcTGLeuLeu

TTCPhe

PPPPPPq

: TtgcTCLeuLeu

1indel

0G→C

1G→C

2indel

Alignment cost depends on order of events!

59


The notion of codon alignment [Hein, 1994]:Align DNA to reflect nucleotide changes; but also consider the costof induced amino acid changes.

Alignment Model:• cd(a, b): cost of substituting base b for base a.• cp(e1e2e3, f1f2f3): cost of substituting the amino acid coded

by f1f2f3 for the amino acid coded by e1e2e3.• gd(i): cost of inserting (or deleting) a block of i nu-

cleotides.We consider only indels of lengths divisible by 3, i.e. noframe shifts.

• gp(i): cost of inserting (or deleting) a block of i aminoacids.

• Let g(i) = gd(3i)+gp(i). (affine gap: g(i) = gopen + i ·gext)

An alignment can be partitioned into codon alignments at codonboundaries of both sequences.

The codon alignments are independent of each other.

Operation Costs:Mutation

For e1e2e3 → f1e2e3,costmut = cd(e1, f1) + cp(e1e2e3, f1e2e3).

InsertionFor e1e2e3 → e1e2e3f1...f3i,

costins = g(i).For e1e2e3 → e1f1...f3ie2e3,

costins = g(i) + min

cp(e1e2e3, e1f1f2),cp(e1e2e3, f3ie2e3).

DeletionFor e1e2e3...e3i+1e3i+2e3i+3 → e1e2e3,

costdel = g(i).For e1e2e3...e3i+1e3i+2e3i+3 → e1e3i+2e3i+3,

costdel = g(i) + min

cp(e1e2e3, e1e3i+2e3i+3),cp(e3i+1e3i+2e3i+3, e1e3i+2e3i+3).

60


Codon alignment:

• An alignment can be partitioned into codon alignments atcodon boundaries of both sequences.

• The codon alignments are independent of each other.

• There are 11 types of codon alignment.

• Each codon alignment involves at most 5 operations and thusthe optimal can be computed easily.

Summary:

1. An O(n21n

22) algorithm [Hein, 1994]

2. An O(n1n2) heuristic algorithm [Hein-Støvlbæk, 1994]

3. An O(n1n2) time algorithm assuming affine gaps [Hua-Jiang, 1987], which requires a table of size roughly 16644n1n2.

4. An O(n1n2) time algorithm assuming affine gaps [Pedersen-Lyngsø-Hein, 1998], which requires a table of size roughly15476n1n2 (may be reduced to 800n1n2).

5. An O(n1n2) time algorithm assuming affine gaps for aslightly simplified model (context-free), requiring a tableof size only 292n1n2 [Hua-Jiang-Wu, 1998].

61


The 11 types of codon alignment:

62

Lecture 5: Sequence Homology Search

A practical consideration:

Genomic Database Search. Given a query sequence Q = a1a2 . . . an,find all sequences in the database that are similar to Q.

A database may contain millions of sequences, totaling in billionsof bases.

The quadratic time DP algorithm is NOT fast enough!

Ideas in BLAST [Altschul et al., 1990]

• Screen out all sequences which don’t share a commonsubstring of length w with Q.

• Often w = 12 for DNA and w = 4 for protein.

• Consider n − w + 1 substrings aiai+1 . . . ai+w−1, for i =1,2, . . . , n− w + 1.

• How to find the exact occurences of these substrings isthe topic of exact string matching (later).

• It is an approximate searching algorithm.

Ideas in PatternHutter [Li et al., 2002]

• Looking for appearances of length m substrings with atleast w matches.

63


BLAST:

• Generalized from local alignment

• Maximal segment pair (MSP)

highest scoring pair of identical lengths segments chosen from2 sequences

Generalize to non-identical (score cutoff S)

• Word pair

a segment pair of fixed length w (score cutoff T ) — 12 forDNA, 4 for protein

Extending word pair to a possible MSP with score ≥ S

• MSP vs. WP — T ↓, chance of MSP containing a WP ↑number of WP hits ↑running time ↑

• 3 algorithmic steps in BLAST:

1. compiling a list of high-scoring words (two ways: ,)

2. scanning the database for hits (two ways: ,)

3. extending hits (heuristics: time vs. accuracy tradeoff)

• Implementation details:

– database compression

– removing repititive words

– allowing gaps (in the extension step)

64


PsiBLAST:

• In the original BLAST, the extension step accounts more than90% of execution time.

Try to reduce the number of extensions performed

• Observation: length of HSP is much larger than word-length

Therefore, multiple hits are very possible (same diagonal)

Idea: Do the extension only when

There are two hits at distance within A (pre-specified).

Some commonly used: w = 3 ∼ 4, T = 11 ∼ 13, A = 37 ∼ 40


– hit generation — use of position-specific score matrix

– hit extension (E-value, affine gap penalty, etc.)

– allowing gaps (only when the score is high enough)

65


PatternHunter:

• A dilemma — increasing seed size decreases sensitivity whiledecreasing seed size slows down computation

— large seeds lose distant homology while small seeds createstwo many random hits (which slow down the computation)

• Expect less to get more — use of nonconsecutive w lettersas seeds

The expected number of hits of a weight W , length w modelwithin a length m region of similarity p (0 ≤ p ≤ 1), is (m −w + 1)pW .

• For W = 11, the most sensitive model:

111010010100110111 (sensitivity 0.467122 for certain p)


– hit generation

– hit extension

– allowing gaps (in the extension step)

66


Notes:

• In any of the above homology search model, we look for eitherexact or approximate occurrences of some word/seed/substring.

• Gaps are NOT allowed in the occurrences

• This is the Exact/Approximate String Matching problem.

Subject of some future lectures ...

67

Lecture 6: Multiple Sequence Alignment

Agenda:

• Straightforward extension from pairwise algorithm

• Hardness results review

• Approximation algorithm concepts, review

• Approximate MSA

Reading:

• Ref[JXZ, 2002]: Pages 71 – 110

• RefGusfield, 1997]: Pages 332 – 369

• Ref[CLRS, 2001]: Pages 966 – 1054

68


The MSA problem:

• Input:

A set of k sequences together with a score scheme σ:

S1: TACCCCGGGCCCCTTTGAGCAS2: TCCTGGGCCAACCTTAAGCGS3: CACCCGGGCCAGCTTTTAAGCGS4: TACCCCGACCAACTTTAACT

• Output:

An alignment achieving the maximum score (w.r.t. σ):

S1: TACCCCGGGCC--CCTTTGAGCAS2: T-CCT-GGGCCAACCTT-AAGCGS3: CACCC-GGGCCAGCTTTTAAGCGS4: TACCCCGA-CCAACTTT-AA-CT

• Applications:

1. Identification of highly conserved regions in homologoussequences.

E.g. the case of Cystic Fibrosis (Science, 1989).

2. Reconstruction of evolutionary trees.

A typical reconstruction algorithm begins with multiplesequence alignment.

3. Applications in computer science.

E.g. syntactical analysis of HTML tables and analysis ofelectronic chain mail.

69


Identification of the Cystic Fibrosis Gene:

Cloning and Characterization of Complementary DNA:

• Official Gene Symbol:

CFTR

• Name of Gene Product:

cystic fibrosis transmembrane conductance regulator

• Locus:

7q31.2

The CFTR gene is found in region q31-q32 on the long (q)arm of human chromosome 7.

• Size:

The CFTR gene’s coding regions (exons) are scattered over250,000 base pairs (250 kb) of genomic DNA. The 27 exonstranslated into the CFTR protein are interspersed with largesegments of noncoding DNA (introns). During transcription,introns are spliced out and exons are pieced together to forma 6100-bp mRNA transcript that is translated into the 1480amino acid sequence of CFTR protein.

• Protein Function:

The normal CFTR protein is a chloride channel protein foundin membranes of cells that line passageways of the lungs, pan-creas, colon, and genitourinary tract. CFTR is also involvedin the regulation of other transport pathways. Defective ver-sions of this protein, caused by CFTR gene mutations, causecystic fibrosis (CF) and congenital bilateral aplasia of the vasdeferens (CBAVD).

70


Identification of the Cystic Fibrosis Gene (cont’d):

Cloning and Characterization of Complementary DNA:

• mRNA sequence (NM 000492 to access NCBI)

• protein sequence (NP 000483 to access NCBI)

• Common disease:

About 70% of mutations observed in CF patients result fromdeletion of three base pairs in CFTR’s nucleotide sequence.This deletion causes loss of the amino acid phenylalanine lo-cated at position 508 in the protein (this mutation is referredto as deltaF508 CFTR).

Homozygous patients have severe symptoms of cystic fibrosis(e.g. breathing difficulty).

• Identification of the cystic fibrosis gene:

Done by Riordan et al. from Hospital for Sick Children (Toronto)

“Identification of the Cystic Fibrosis Gene:Cloning and Characterization of Complementary DNA.”

Science, volume 245: 1066-73 (1989 September).

• Using multiple sequence alignment to detect the mutationsat the DNA-level.

71


A multiple alignment between CFTR genes and

other membrane-bound proteins.

72


Typical reconstruction of evolutionary tree:

1. Input: a set of molecular sequences. E.g. some globin genesof human, mouse, etc.

2. Compute a multiple alignment of the sequences.

121 131 141trema TCCT--CAAG A--TA---TT TG--AGATTGmouse-alpha AT-A--TGGA GCTGAAGCCC TGGAAAG---midge CT-T--CAAG GCTGA---TC CATCAAT---cow-myo GG-CCATGGG CAGGAGGTCC TCATCAG---mouse-beta AG-T--TGGA GGAGAAACTC TGGGAAG---human-myo GG-CCATGGG CAGGAAGTCC TCATCAG---vitreoscilla TTAT--AAAA ACTTG---TT TGCCAAA---duck-beta CT-G--TGGA GCTGAGGCCC TGGCCAG---oak TCTT--AAAG A--TA---TT TG--AGATCG

3. Analyze each column to infer pairwise distance.

4. Construct an evolutionary tree.

dt

vitreoscilla

QQQdd

ttrema

@@toak

QQQQQdt

midge

QQQdd

tcow-myo

HHHthuman-myo

QQQQQQQd

t

mouse-alpha

QQQd

tduck-beta

QQQtmouse-beta

73


Models of MSA:

• An alignment:

S1: TACCCCGGGCC--CCTTTGAGCAS2: T-CCT-GGGCCAACCTT-AAGCGS3: CACCC-GGGCCAGCTTTTAAGCGS4: TACCCCGA-CCAACTTT-AA-CT

S′i denotes the original sequence Si with spaces inserted.

• Each column j has a cost: c(S′1[j], S′2[j], . . . , S

′k[j]).

• Cost of alignment is the sum of costs of columns.

• The goal is to minimize the cost.

• Example cost (or score) functions:

– Longest Common Subsequence (LCS):

c(a1, a2, . . . , ak) =

1 if a1 = a2 = . . . = ak0 otherwise

– Maximum Weight Trace (MWT):

c(a1, a2, . . . , ak) = −∑a∈Σ

(k∑

i=1

I(a, ai)

)2

,

where

I(a, ai) =

1 if a = ai0 otherwise

74


Models of MSA (cont’d):

• Models of cost/score schemes:

Given a pairwise cost scheme s(a, b),

• 1. Sum-of-Pairs (SP) cost:

c(a1, a2, . . . , ak) =∑i6=j

s(ai, aj)

2. Consensus cost:

c(a1, a2, . . . , ak) = mina

∑i

s(a, ai)

3. Tree cost: w.r.t. a given evolutionary tree T ,

c(a1, a2, . . . , ak) = mininternal node assignment

∑(a,b)∈T

s(a, b)

TACGGGCC TTG ----+|o---+

T CAGGCCC A ----+ ||----o

CA GGG CCTTA ----+ ||o---+

TACAA CCTTTG ----+

The evaluation of tree-cost of a column:

75

*TACGGGCC TTG G----+

|b_1--+T CAGGCCC A A----+ |

|--b_3CA GGG CCTTA A----+ |

|b_2--+TACAA CCTTTG G----+

tree-cost of column *:c(G, A, A, G) = minb1,b2,b3

. s(G, b1) + s(A, b1) + s(A, b2) + s(G, b2)

. + s(b1, b3) + s(b2, b3)

. = s(G, G) + s(A, G) + s(A, G) + s(G, G)

. + s(G, G) + s(G, G)

. = 2(assume s(a, b) = 0 if a = b, or 1 otherwise)

This can be done by dynamic programming in lineartime. (also called the small parsimony problem)

Usually assume cost s(a, b) is a distance metric.


An exact algorithm — Dynamic Programming:

• Idea:

Extending the Dynamic Programming algorithm for pairwisesequence alignment to k dimensions.

• Table entry computation (recurrence):

d(i1, i2, . . . , ik) denotes the cost of an optimal alignment for kprefixes S1[1..i1], S2[1..i2], . . ., Sk[ik], where 0 ≤ ij ≤ nj.

d(i1, i2, . . . , ik) = min

d(i′1, i′2, . . . , i

′k) + s(i′1 + 1, i′2 + 1, . . . , i′k + 1)

where i′j = ij or ij−1 and s(i′1+1, i′2+1, . . . , i′k+1) is the cost of

the last column in the alignment containing k letters/spaces,one from each given prefix.

The minimization is taken over 2k − 1 entries (recall the casek = 2).

• Boundary entry computation:

An entry d(i1, i2, . . . , ik) is boundary if at least one of its indicesij is 0.

• Running time and space requirement:

If s(i′1 + 1, i′2 + 1, . . . , i′k + 1) is known,

– There are (n1 + 1)× (n2 + 1)× . . .× (nk + 1) entries.

– Each needs to scan 2k − 1 values.

– Space: O(nk)

– Time: O(2knk)

76


Dynamic Programming recurrences:

• SP alignment:

Boundaries:

d(0, i2, . . . , ik) = d(i2, . . . , ik) +

k∑i=2

ij∑j=1

s(Si[j],−),

Column score:

s(a1, a2, . . . , ak) =∑i6=j

s(ai, aj).

General recurrence:

d(i1, i2, . . . , ik)= min

ij−1≤i′j≤ij ,∑

ji′j<∑

jij

d(i′1, i

′2, . . . , i

′k)

+ c(S1[i′1 + 1], S2[i′2 + 1], . . . , Sk[i′k + 1])

,

where Sj[ij + 1] = −.

• Consensus alignment:

An exercise !

• Tree alignment:

An exercise !

77


Approximate SP alignment:

• Sum-of-all-Pairs: c(S1, S2, . . . , Sk) =∑

i6=js(Si, Sj).

• The exact algorithm running time O(nk),

which is NOT polynomial when k is also a variable !!!

• In general, the problem is NP-hard — no polynomial timealgorithm exists unless P = NP

• This is NOT the end of story.

We still want to solve the problem to some extent

— sacrifice the accuracy

— approximately good (performance guaranteed) alignmentsin a short (polynomial) time

• Observations:

– Fix index i, computing c(Si) =∑

i6=js(Si, Sj) is easy.

– Computing mini c(Si) is then easy too.

– Using the alignment A produced in this way as the ap-proximate SP alignment.

78


Analysis of the approximation algorithm:

• Running time:

O(k × k × n2)

• Performance analysis:

tSitSi+1 tS1

tt

Si−1

tSk

tS2

t tt

t tt

HHHH

H

HHHHH

A — alignment computed

A∗ — an optimal alignment

– For fixed i, in A∗,∑

j 6=is(Si, Sj) ≥ c(Si).

– Thus c(A∗) ≥ 12

∑ic(Si) ≥ k

2·mini c(Si).

Or equivalently, mini c(Si) ≤ 2kc(A∗).

– Therefore,

c(A) =∑

j 6=`s(Sj, S`)

≤ (k − 1)∑

j 6=is(Si, Sj)

= (k − 1)c(Si)≤ (2− 2

k)c(A∗).

– c(A) ≤ (2− 2k)c(A∗)

• Simulations showing that the algorithm performs much betterin practice:

– On 19 random sequences, c(A) ≤ 1.02c(A∗).

– On homologous sequences, c(A) ≤ 1.16c(A∗).

79


Approximation algorithm (review):

• Runs in polynomial time (in the input instance size)

• Produces a solution within certain range of the optimum

• For a minimization (maximization) problem, an α-approximationalgorithm produces a solution, for any instance, with scoreless than or equal to α (1

α, respectively) times the score of an

optimal solution.

α is called the performance ratio of the approximation.

• The above is a 2-approximation algorithm for the MSA withSP score.

80


Approximate SP alignment (2):

• The above approximation algorithm tries to find the best se-quence to be used as a center sequence.

It has performance ratio 2− 2k.

• The center sequence has the property that it has the minimumsum of pairwise distances to other sequences.

The graph (tree) connecting the center to all other sequencesis called a 2-star, and the algorithm the 2-star algorithm.

• Generalizing to 3-star (disjoint triangles glued at the center):

– Now each “edge” from the center involves 3 sequences.

Need a partition of the other sequences into pairs, whichshould be optimal regarding the chosen center sequence.

tSitSi+1 tS1

tt

Si−1 t

t tSk

tS2

t tt

t tt

HHHH

H

HHHHH

tSitSi+1 tS1

tt

Si−1 t

t tSk

tS2

t tt

t tt

HHHH

H

HHHHH

– For fixed index i, an optimal partition of the other se-quences and the associated optimal alignment can becomputed by maximum weighted matching.

Weight function is defined by:

w(Sj, S`) = minA

(dA(Sj, S`) + dA(Si, Sj) + dA(Si, S`))

81


Approximate SP alignment (2) (cont’d):

• Maximum weighted matching can be computed efficiently

• An optimal center sequence Si and its associated optimalalignment A can thus be computed efficiently.

• (1992) The 3-star algorithm runs in time O(n3k3 + k5) andachieves performance ratio 2− 3

k.

• Analysis of 3-star algorithm:

– Running time

An exercise

– Performance

An exercise

82


Approximate SP Alignment (3):

• Consider `-stars:

tt t

tPP

PPPP

QQQ

ttPPPPPP

t

QQQ

tt

BBBBBBt

JJJtt t ttt

Si

S1

S2

S3

Thm. (1992)The multiple sequence alignment induced by an opti-mal `-star costs at most (2− `

k)c(A∗).

Problem: Can’t find an optimal `-star when ` > 3, because 3 dimensionalmatching is already NP-hard.

Solution: balanced set of `-stars.

Thm. (1994)The best `-star in a balanced set also induces an align-ment with cost at most (2− `

k)c(A∗).

• Running time: O(k · n` · |B|) = O(k3 · n`).

83


Balanced set of `-stars:

• Definition:

The edge set connecting to the `-starts covers each edge ofthe complete graph the same number of times.

• Example:

K5 (k = 5):

There are in total 30 possible 3-stars (` = 3).

• A balanced set of 3-stars — covers each edge 3 times:

84


Approximate consensus alignment:

• Consensus sequence:

– Steiner consensus sequence S — definition without MSA

∗ S need not be from the set of given set of sequences

∗ It is generally hard to compute S

∗ Assuming the pairwise alignment score scheme satisfiestriangle inequality,

The 2-star sequence is an approximation, with perfor-mance ratio (2− 2

k).

– Definition based on MSA:

For each column in the MSA, which letter minimizes thecolumn cost?

Concatenating the consensus letters into a sequence —consensus sequence

• Two definitions equivalent to each other

• NP-hard

• 2-approximation algorithm

Can you improve it ?

85


Approximate tree alignment:

• Tree alignment is NP-hard, even when the score scheme sat-isfies the triangle inequality.

• Generalized Tree alignment (when the tree is not given aheadof time) is MAX SNP-hard.

Which means, no PTAS !!! unless P = NP .

86


Agenda:

• PTAS for Tree Alignment

• Special cases of MSA

• PTAS for some special cases

Reading:

• Ref[JXZ, 2002]: Pages 71 – 110

• M. Li et al. Near optimal multiple alignment withina band in polynomial time. ACM STOC 2000.425–434. Portland, Oregon.

• C. Tang et al. Constrained multiple sequence align-ment tool development and its application to RNasefamily alignment. IEEE CSB 2002. 127–137. Stan-ford, California.

87


Approximate tree alignment:

• Tree alignment is NP-hard, even when the score scheme sat-isfies the triangle inequality.

• Generalized Tree alignment (when the tree is not given aheadof time) is MAX SNP-hard.

Which means, no PTAS !!! unless P = NP .

• Reformulating tree alignment:

Sequence reconstruction: Given a rooted evolutionary tree Twith leaves labeled by sequences, construct a sequence foreach interior node to minimize the cost of the tree.

– An example, f

@@@@@@f

AAAAAAA

f

AAAAAAAf f f f

r1

r2 r3

S1 S2 S3 S4

Sequences and score scheme:

S1 = ACTGS2 = ATCGS3 = GCCAS4 = GTTA

A C G T -A 0 1 1 1 1C 1 0 1 1 1G 1 1 0 1 1T 1 1 1 0 1- 1 1 1 1 -

88


Approximate tree alignment (cont’d):

• Cost of edge (x, y): the optimal pairwise alignment cost ofthe sequence labels at x and y

String Edit, computable in O(n2) time

• Cost of (fully) labeled tree: total cost of edges

• This is a variant of fix-topology Steiner tree problem.

• Equivalence of the formulations:f

@@@@@@f

AAAAAAA

f

AAAAAAAf f f f

r1

r2 r3

S1 S2 S3 S4

ACTA

ACTCG GCTA

ACTG ATCG GCCA GTTA

The induced MSA:

S1: ACT-G S1: ACT-Gr2: ACTCGS2: A-TCG S2: A-TCGr2: ACTCGS3: GCC-A S3: GCC-Ar3: GCT-AS4: GTT-A S4: GTT-Ar3: GCT-Ar1: ACT-Ar2: ACTCG

• Cost of optimally labeled tree = cost of optimal tree align-ment

89


Tree alignment vs. Steiner tree:

• Steiner Minimal Tree (SMT):

Given a set S of terminals in some space, find the shortestnetwork connecting S.

• For our generalized tree alignment problem, the space is DNAsequences with edit distance.

Corollary. The generalized tree alignment problem can be approximatedwith ratio ∼ 1.5.

• Tree alignment is SMT with a fixed topology !

Corollary. SMT with a fixed topology in any metric has a PTAS.

• PTAS: polynomial time approximation scheme — approxi-mate arbitrarily close to the optimum

90


Approximate tree alignment (2):

• Lifted tree:

A fully labeled tree in which every parent sequence equals tosome child sequence.

• For example, f

@@@@@@f

AAAAAAA

f

AAAAAAAf f f f

r1

r2 r3

S1 S2 S3 S4

GCCA

ATCG GCCA

ACTG ATCG GCCA GTTA

Theorem. For some lifted tree T `, c(T `) ≤ 2× c(T ∗).

T ∗ is an optimal labeled tree.

Proof. By averaging over all lifted trees.

Theorem. We can compute an optimal T ` in O(k3 + k2n2) time.

Theorem. Tree alignment has a PTAS that achieves ratio 1 + 21+t

in

O(kdn2t−1+1) time, where d is the depth of the tree.

t 2 3 4ratio 1.67 1.50 1.40

running time O(kdn3) O(kdn5) O(kdn9)

91


A sketch of analysis of 2-approximation:

• T ∗ — an optimal labeled tree (not necessarily lifted)

For each v ∈ T ∗, `(v) — the descendant leaf closest to v

WARNing: some constraints on choosing `(v)

• T ` — obtained by lifting sequence of `(v) to v

ACTA

@@@@@@

ACTCG

AAAAAAA

GCTA

AAAAAAA

ACTG ATCG GCCA GTTA

T ∗

GCCA

@@@@@@

ACTG

AAAAAAA

GCCA

AAAAAAA

ACTG ATCG GCCA GTTA

T `

• Observation: c(T `) is at most the total distance between everypair of adjacent leaves in T ∗.

With suitable flipping of the subtrees !!!

Theorem. c(T `) ≤ 2× c(T ∗).

92


Computing an optimal lifted tree:

• v — internal node

• Tv — subtree rooted at v

• S(v) — subset of sequences corresponding to leaves in Tv

• Define for each v and s ∈ S(v), D[v, s] is the cost of an optimallifted tree for Tv with v being labeled by s.

vs D[v, s]

PPPPPPPPPPPv v vs1 s s3

AAAAAAA

D[v1, s1]

AAAAAAA

D[v2, s]

AAAAAAA

D[v3, s3]

• Recurrence:

For each i 6= 2, find an si ∈ S(vi) to minimize D[vi, si]+d(s, si).

D[v, s] = D[v2, s] +∑i6=2

(D[vi, si] + d(s, si))

• Running time is O(k2n2 + k3)

Can be improved to O(kn× depth) by uniform lifting (next).

93


Uniform lifting tree:

• Given a phylogeny, the order of the sequences are fixed

• During the lifting, the lifting decisions (Left or Right) areidentical at each level of the tree

f

HHHHHHHf

@@@@

f

@@@@f

AAAA

f

AAAA

f

AAAA

f

AAAAv v v v v v v v

s1 s2 s3 s4 s5 s6 s7 s8

94


Uniform lifting tree:

• Given a phylogeny, the order of the sequences are fixed

• During the lifting, the lifting decisions (Left or Right) areidentical at each level of the tree

R fs6

HHHHHHHL fs2

@@@@

fs6

@@@@R fs2

AAAA

fs4

AAAA

fs6

AAAA

fs8

AAAAv v v v v v v v

s1 s2 s3 s4 s5 s6 s7 s8

The uniform lifting choice vector is V = (R, L, R)

Theorem. For an average uniformly lifted tree T u`,

c(T u`) ≤ 2c(T ∗).

Prove it.

Theorem. We can compute an optimal uniform T u` in O(kd+kdn2) time,where d is the depth of T .

Dynamic programming.

95


PTAS for Tree alignment:

Theorem. Tree alignment has a PTAS that achieves ratio 1 + 21+t

in

O(kdn2t−1+1) time.

• Idea: from spanning to Steiner

Keeping the lifted sequences at some nodes while reconstruct-ing Steiner sequences at the others

• For example,

GCCA

@@@@@@

ACTG

AAAAAAA

GCCA

AAAAAAA

ACTG ATCG GCCA GTTA

T `

GCCA

@@@@@@

ACCG

AAAAAAA

GCCA

AAAAAAA

ACTG ATCG GCCA GTTA

T ′

96


PTAS for Tree alignment (cont’d):

• Key techniques:

– Careful partitioning strategy

– Local optimization

– Dynamic programming

– Uniform lifting

• Outline:

1. Partition the tree into overlapping components, each withr boundary nodes

2. Uniform lifting to boundary nodes

Local optimization on each component

Dynamic programming

3. Consider a set of partitions (by shifting) so that everynode has equal chance of being a boundary node

The partitions yield a good approximation on average

4. The the best result of all partitions

97


Extending the 2-approximation to a PTAS:

• General idea: lifting + local optimization

Assuming ∆-ary trees

• depth-t component: a subtree with depth t

• Local optimization: optimize a depth-t component with fixedlabels at leaves and root

S

S1 S2 S`−1 S` Sk−1 Sk Sh−1 Sh

ZZZZZZZZZZZ

@@@@@@

AAAAAA

AAAAAA

@@@@@@

AAAAAA

AAAAAA

EEEEEE

EEEEEE

EEEEEE

EEEEEE

EEEEEE

EEEEEE

EEEEEE

EEEEEE

h = ∆t

• This requires optimizing ∆ depth-(t − 1) components withfixed labels at leaves and parent.

• Each takes M(∆, t− 1, n) = O(n∆t−1+1) time.

98


Extending the 2-approximation to a PTAS:

• The partition Pi, 0 ≤ i < t (assuming binary tree)

– Fix a uniform lifting choice V

– The top component is a binary tree of height i

– All nodes on the lifting path of head/leaf of a componentare also heads

– There are t × 2d different components, where d is theheight of the tree

– These components put all nodes of the tree on boundarieswith roughly equal frequency

L-type components:vf

CCCCv v

v

fJJJv

f

CCCCv v

v

fJJJf

f

CCCCv v

CCCCv v

t = 2r = 3

t = 3r = 4

t = 3r = 5

R-type components:

99


Extending the 2-approximation to a PTAS (cont’d):

• A partition example: t = 3, r = 4, V = (R, R, R)v

fHHH

HHHHvfJJJv

f

CCCCv v

CCCCv v

QQQQvJJJv

f

CCCCv v

CCCCv v

fJJJv

f

CCCCv v

CCCCv v

QQQQvJJJv

f

CCCCv v

CCCCv v

P0

f

vHHHH

HHHvfJJJv

f

CCCCv v

CCCCv v

QQQQvJJJv

f

CCCCv v

CCCCv v

fJJJv

f

CCCCv v

CCCCv v

QQQQvJJJv

f

CCCCv v

CCCCv v

P1

f

fHHHH

HHHvvJJJv

f

CCCCv v

CCCCv v

QQQQvJJJv

f

CCCCv v

CCCCv v

vJJJv

f

CCCCv v

CCCCv v

QQQQvJJJv

f

CCCCv v

CCCCv v

P2

100


Extending the 2-approximation to a PTAS (cont’d):

• With respect to each partition Pi, the best depth-t componenttree is denoted by T i

• Compute T i by optimally lifting leaves to the boundaries andlocally optimize the other nodes

For Pi, levels i, i + t, i + 2t, . . ., form the boundaries

– For each boundary node v and lifted sequence s,

D[v, s] — cost of an optimal depth-t component subtreewith root label s

– D[v, s] can be computed iteratively from bottom to top(dynamic programming)

Reduced to the optimization of tree of depth-(t− 1)

– The size of local optimization is r = ∆t−1 + 1

– Running time:

O(k∆t−1+2 ×M(∆, t− 1, n)) = O(k∆t−1+2n∆t−1+1),

Lemma.∑t−1

i=0c(T i) ≤ (t + 3)c(T ∗).

Corollary. For some i, c(T i) ≤ (1 + 3t)c(T ∗).

Key observation: Let T be a full binary tree. For each leaf` ∈ L(T ), there is a mapping π` from I(T ) to L(T ) such that

– ∀v ∈ I(T ), π`(v) is a descendant of v

– The paths from nodes v to their images π`(v) are disjoint

– ∃ unused path from root to `

101


PTAS implementation and experimentation:

• TAAR — Tree Alignment and Reconstruction

• DNAPARS (in PHYLIP)

– Arrange the sequences (species) in some order

– Construct a tree for the first 3 sequences

– For each of the rest sequences do

Add the sequence to the current tree to achieve the min-imum cost

(It assumes a given MSA)

– Assumption on MSA can be removed if the cost is mea-sured using tree alignment

102


MSA in practice:

There are many multiple alignment programs available either com-mercially or as freewares, such as

CLUSTAL W, DFALIGN, MULTAL, MST, ESEE, . . .

It seems all are based on the same idea: clustering and progressivealignment.

The Generic Algorithm

1. Find the pair of the most similar sequences.

2. Optimally align the pair.

3. Record the alignment as either a profile or a consensussequence, and substitute it for the pair.

4. Repeat until only one profile/sequence is left.

5. Recover the multiple alignment from the final profile or byaligning each sequence with the final consensus sequence.

Comments

1. Quite efficient.

2. No clear objective function.

3. Some also use the evolutionary tree as the guide tree.

103


CLUSTAL W — king of progressive alignment:

Step 1: Simple pairwise alignments and distance

• Quick approximate or dynamic programming (k-tuple match)

• score = 1 - percent identity - gap

Step 2: Compute guide tree using neighbor joining (NJ)

• Branch lengths

• Rooted at ”mid-point”

• Weight of sequence =∑

e∈Pw(e)

# of sequences sharing e

Step 3: Progressive alignment

• traverse the tree in post-order

• align a pair of alignments using dynamic programming(Smith-Waterman, linear-space)

• existing gaps are frozen

• column score = average score between each pair of residuesfrom different alignments

• space results in 0 (worst case)

104

Lecture 9: RNA Sequence and Structure

Alignment

Agenda:

• Three computation models

• Algorithms & approximations

Reading:

• Ref[JXZ, 2002]: chapter 14.

• T. Jiang et al. A general edit distance betweenRNA structures. Journal of Computational Biology.(2002) 9, 371–388.

• G. Lin et al. The longest common subsequenceproblem for sequences with nested arc annotations.ICALP 2001. LNCS 2076, pp. 444–455.

• K. Zhang et al. Computing similarity between RNAstructures. CPM 1999. LNCS 1645. pp. 281–293.

• V. Bafna et al. Computing similarity between RNAstrings. CPM 1995. LNCS 937. pp. 1–16.

105

Lecture 9: RNA Sequence and Structure Alignment

Backgrounds:

• RNA performs a wide range of functions (HIV)

• Presumption: a preserved function corresponds to a preservedstructure

• RNA structures:

– primary — sequence of nucleotides

– secondary & tertiary — bonded pair of two complemen-tary bases

– secondary — base pairs

– tertiary — base pair & spatially close pair of nucleotidesweakly bonded

– one base involves in at most one base pair

– secondary structure non-crossing / nested

• Representing a bond between a pair by drawing an arc con-necting them

arc-annotated sequence

A tRNA and its associated annotated sequence:

CAGCGUCACACCCGCGGGGUAAACGCU

' $' $' $

CAUGCCGG

CUA

CAC

CGCGG

CG

GCUAAA

106


Comparison taking sequence & structure information:

• Intended: the measure of similarity becomes more accurate...

• Possible mutations: looking at the structures

– nucleotide mutations (replacement, mismatch, substitu-tion, ...)

if that base is involved in a base pair, then the bond coulddisappear — depending on the other base change

– single nucleotide indel

if that base is involved in a base pair, then the bonddisappears — more costly ?

– a base pair indel

– a base pair mutation (the bond maintains, e.g. A-U → C-G— more costly than single base mutation?

– In short, mutation operations have to be carefully defined...

• Three existing comparison models (sequence + secondarystructure):

– More focus on secondary structure units

Less on base pair mutations

Even less on single nucleotide mutations

– No secondary structure units

Separate base pair mutations from single base mutations

– No secondary structure units

Distinguish but not really separate the two sets of muta-tions

107


Tree edit distance:

• Using the planar secondary structure itself.

108


Tree (rooted, ordered) representation of theabove RNA secondary structure.

Tree edit compares two trees taking into ac-count secondary structure changes, base pairmutations, and single base mutations

109


If a base pair is not matched with another base pair,then delete it:

AA**U

GA*-U

-AA**U

GA-*-U

AA**U

GA*-U

-AA**U

G-A*U-

• Secondary structure vs. tertiary structure — solvable

Algorithm is similar to some to be described later ...

• Tertiary structure vs. tertiary structure — NP-hard

Approximations ???

The above (tree edit) model seems a bit too

strong. We consider a more relaxed model in

next page.

110


If a base pair is not matched with another base pair,then at least one of its bases should be deleted:

AA**U

GA*-U

--AA**U

GA--*-U

AA**U

GA*-U

-AA**U

G-A*-U

• Levels of restrictions on arcs:

1. no sharing of end bases

2. no crossing

3. no nesting

4. no arcs

LAPCS: Longest Arc-Preserving Common Subsequence Problem:

Given two arc-annotated sequences, find a longest commonsubsequence c1c2 . . . ck such that (ci, cj) is an arc in one se-quence if and only if it is an arc in the other.

1. unlimited — no restriction,

2. crossing — restriction 1,

3. nested — restrictions 1 & 2,

4. chain — restrictions 1, 2, & 3,

5. plain — restriction 4.

111


An example pair of arc-annotated sequences. The first

common subsequence ACGUUCCGGAU is arc-preserving

but the second common subsequence ACGUUUA is not.

112


Best-to-date results for LAPCS(·, ·):

plain chain nested crossing unlimitedunlimited NP-hard [Evans, 1999]

inapprox. within ratio nε, for ε ∈ (0, 14)

crossing NP-hard [Evans, 1999]MAX SNP-hard2-approx.

nested O(nm3) NP-hard2-approx.

chain O(nm) [Evans, 1999]plain O(nm)

Special Cases for LAPCS(nested, nested):

• unary

• c-diagonal, c-fragmented

Results:

• still NP-hard; 43-approximable

• still NP-hard; admit PTAS

113


LAPCS(crossing, plain) is NP-hard:

• General procedure of proof of NP-hardness of problem A:

– Find some appropriate known NP-hard problem B

– For every instance I of B, construct an instance J of A

– Prove that I is a yes-instance iff J is

• For LAPCS(crossing, plain):

– An appropriate known NP-hard problem is Maximum In-dependent Set of vertices in Cubic graphs (MIS-cubic):

Given a simple cubic graph G = (V, E), where |V | = n,and an integer k, is there a subset S ⊂ V such that

1. no two vertices in S are adjacent in G,

2. |S| ≥ k?

– Reduction from MIS-cubic:

∗ S1 = (AAAACCGGG)n

∗ S2 = (AAAAGGGCC)n

∗ an edge between vi and vj corresponds to an arc con-necting a G from the ith segment and a G from thejth segment

∗ making sure that every G is used exactly once

∗ set k′ = k + 6n

– Prove that G has an independent set of size at least k iffthe constructed LAPCS(crossing, plain) has an APCS oflength at least k′.

114


2-Approximation for LAPCS(crossing, crossing):

Given (S1, P1) and (S2, P2),

1. Compute a classical LCS for sequences S1 and S2, denoted S;

2. If there is exactly one arc that connects two bases in S, con-nect these two bases by an edge — this constructs a graphG on bases in S;

3. Delete the minimum number of vertices (bases in S) from Gto make the remaining graph trivial

— equivalently, find an MIS for G;

4. Take the vertices remained as the approximate subsequenceT .

5. Properties:

• G has the maximum degree (at most) 2.

• An MIS of G can be computed easily.

• The size of MIS is at least half the number of vertices.

• Therefore, |T | ≥ 12|S|.

6. Conclusion:

|T | ≥ 12|S| ≥ 1

2|S∗|

115


LAPCS(crossing, plain) is MAX-SNP-hard:

• General procedure of proof of MAX SNP-hardness of problemA:

– Find some appropriate known MAX SNP-hard problem B

– For every instance I of B, construct an instance J of A

– Show OPT (J) ≤ a×OPT (I)

– Prove that an approximate solution S to I implies anapproximate solution T to J such that

∗ T can be obtained from S in polynomial time

∗ OPT (I)− |S| ≤ b× (OPT (J)− |T |)namely, the error rate guaranteed

• Reduction from MIS for cubic graphs:

– S1 = (AAAACCGGG)n

– S2 = (AAAAGGGCC)n

– an edge between vi and vj corresponds to an arc con-necting a G from the ith segment and a G from the jthsegment

– making sure that every G is used exactly once

• Prove that vi is in the independent set if and only if AAAAGGGfrom the ith segment is in the common subsequence (other-wise AAAACC).

• LAPCS(crossing, plain) = MIS(G) + 6n ≤ 25 ·MIS(G).

• |k −MIS(G)| ≤ |k′ − LAPCS(crossing,plain)|.

116


Dynamic Programming for LAPCS(nested, plain):

• The key ideas:

– Notice that at most one base for every base pair can bein (any) LAPCS

– Ordinary DP reduces the computation to neighboring en-tries, which is not true anymore

– Consider the two bases at the same time. How ???

∗ the nested arc structure enables us to do this ...

∗ for every pair of indices (i, i′), such that no arc canhave one base inside, while the other outside interval[i, i′]:

compute how (S1[i, i′], P1[i, i′]) is aligned with S2[j, j′]

117


Dynamic Programming for LAPCS(nested, plain) (cont’d):

Step 1: Given (S1, P1) and (S2, ∅), DP (i, i′; j, j′) records the lengthof LAPCS of S1[i, i′] and S2[j, j′], where no arc has exactlyone base inside [i, i′].

Phase 1: if i′ is free,

DP (i, i′; j, j′) = max

DP (i, i′ − 1; j, j′ − 1) + χ(S1[i′], S2[j′]),DP (i, i′ − 1; j, j′),DP (i, i′; j, j′ − 1);

otherwise,

DP (i, i′; j, j′) = maxj≤j′′≤j′

DP (i, u(i′)l − 1; j, j′′1) + DP (u(i′)l, i

′; j′′, j′)

.

Phase 2: (i1, j1) ∈ P1,

DP (i1, j1; j, j′) = max

DP (i1 + 1, j1 − 1; j + 1, j′) + χ(S1[i1], S2[j]),DP (i1 + 1, j1 − 1; j, j′ − 1) + χ(S1[j1], S2[j′]),DP (i1 + 1, j1 − 1; j, j′),DP (i1, j1; j, j′ − 1),DP (i1, j1; j + 1, j′);

Step 2: if i′ is free,

DP (1, i′; j, j′) = max

DP (1, i′ − 1; j, j′ − 1) + χ(S1[i′], S2[j′]),DP (1, i′ − 1; j, j′),DP (1, i′; j, j′ − 1);

otherwise,

DP (1, i′; j, j′) = maxj≤j′′≤j′

DP (1, u(i′)l − 1; j, j′′1) + DP (u(i′)l, i

′; j′′, j′)

.

DP (1, n1; 1, n2) gives the length of the annotated LAPCS. Simple

back-tracing gives the subsequence.

118

Lecture 10: Sequence & Structure Alignment

Agenda:

• More algorithms & approximations for LAPCS

• Models on protein structure comparison

Reading:

• T. Jiang et al. A general edit distance betweenRNA structures. Journal of Computational Biology.(2002) 9, 371–388.

• G. Lin et al. The longest common subsequenceproblem for sequences with nested arc annotations.ICALP 2001. LNCS 2076, pp. 444–455.

• D. Goldman et al. Algorithmic aspects of proteinstructure similarity. IEEE FOCS 1999. pp, 512–521.

119


LAPCS(nested, nested) is NP-hard:

• Similar idea as in the MAX-SNP-hardness proof of LAPCS(crossing,plain), with carefully found problem: MIS on Planar CubicBridgeless Graphs.

– MIS on planar cubic bridgeless graphs is NP-hard.

– A planar cubic bridgeless graph has a perfect matching,which can be computed in linear time.

– Planar Cubic Bridgeless Graphs are subhamiltonian.

A graph is subhamiltonian if it is a subgraph of a Hamil-tonian planar graph.

– There is a linear time algorithm which, for any planarcubic bridgeless graph, finds a Hamiltonian supergraph ofmaximum degree at most 5 and a 2-page book embeddingin each page every vertex has degree at least 1 (and thusat most 2).

• This time,

– S1 = (CCCCDBAB)nCCCC, n — number of vertices

– S2 = (CCCCBDBA)nCCCC

– an edge between vi and vj corresponds to an arc con-necting a B from the ith segment and a B from the jthsegment

– in S1, if vi has degree 1, then an arc connecting last Cand D and an arc connecting B and A; in S2, if vi hasdegree 1, then an arc connecting next C and A and anarc connecting B and D; making sure that every B is usedexactly once.

120


LAPCS(nested, nested) is NP-hard (cont’d):

• An example reduction from K4 (which is planar, cubic, bridge-less)

sv1

sv4

sv2 sv3

QQQQQQ

TTTTTTTTT

v v v vv1 v2 v3 v4

' $' $' $& %& %& %

cccccccc

cccccccc

cccccccc

cccccccc

cccccccc

b b b bd d d db b b ba a a ad d d db b b ba a a ab b b b

' $' $

& %& %& %

• Prove that vi is in the independent set if and only if thetwo B from the ith segment are in the common subsequence(otherwise D is in the common subsequence).

• Therefore, there is an MIS of size k if and only if there is anAPCS of length 4n + 4 + 2k + (n− k) = 5n + 4 + k.

• This is not a MAX-SNP-hardness proof !!!

Why ?

because MIS on planar cubic bridgeless graphs admits a PTAS

121


More NP-hardness results:

• Observations:

– the matches are within segments

– the possible matches are of forms

〈2i− 1,2i− 1〉, 〈2i− 1,2i〉, 〈2i,2i− 1〉, and 〈2i,2i〉

– we are limiting ourselves to

∗ cut S1 and S2 into fragments of length 2

∗ the possible matches are required to be inside the frag-ments

∗ called 2-fragmented LAPCS(nested, nested)

∗ it is already NP-hard

• More hardness results:

– `-fragmented LAPCS(nested, nested) is NP-hard, ` ≥ 2

– c-diagonal LAPCS(nested, nested) is NP-hard, c ≥ 1

• Some positive results:

– 1-fragmented LAPCS(nested, nested) is in P

– 0-diagonal LAPCS(nested, nested) is in P

122


PTAS for `-fragmented LAPCS(nested, nested):

• Observation:

– It is closely related to MIS on planar (cubic bridgeless)graphs.

– MIS on planar graphs admits a PTAS.

• Key ideas:

– Inside each fragment, there are at most `2 possible matches,of which some are in fact conflicting each other

– Make every possible match into a vertex

– Conflicting matches are connected via an edge

– To compute an MIS

– Problem: graph is NOT necessarily planar ...

inside and long-range

• More ideas:

– Use a super-vertex to represent the subset of verticesinside a fragment

– Merge multiple edges, if any, into one (super-)edge

– Because of the nested arc structures, the new graph isplanar

123


PTAS for `-fragmented LAPCS(nested, nested) (cont’d):

• Further ideas: for any integer k,

– Compute a k-cover U1, U2, . . . , Uk for this planar graph

(polynomial time)

– For each subset Ui, which is (k−1)-outerplanar, computea tree decomposition of width at most 3k − 4

(polynomial time)

– Substitute back the subset of vertices for every super-vertex

– This substitution increases the tree width to at most (3k−4)`2

– Compute an MIS for Ui

(polynomial time)

– Output the maximum MIS among the k ones computed

– It is of size at least k−1k

of the optimum

Theorem: `-fragmented LAPCS(nested, nested) admits a PTAS (` ≥ 2)

(it runs in O(2(3k−4)c2k3c5(|S1| + |S2|)) time and outputs an

APCS of length at least k−1k

times the optimum).

Corollary: c-fragmented LAPCS(nested, nested) admits a PTAS, c ≥ 1.

124


A 43-approximation for Unary LAPCS(nested, nested):

• Unary LAPCS(nested, nested) is still NP-hard ...

Proof similar (can you prove it ???)

• Needs PTAS, or better approximations:

Given unary (S1, P1) and (S2, P2),

• Key ideas in the 43-approximation:

Count how many arc-matches in an LAPCS?

Count how many base matches in an LAPCS are not involvedin any arc-match?

left bases, right bases, of some arcs in P1?

What do they mean ???

Definition. Left-Priority subsequence:

T is a subsequence of S1 and (i1, i2) ∈ P1, T cannot containS1[i2] unless it contains S1[i1].

Definition. Lep-LAPCS:

Left-Priority Longest Arc-Preserving Common Subsequence.

125


A 43-approximation for Unary LAPCS(nested, nested)

(cont’d):

• Compute a Lep-LAPCS T1 in O(n31n2) time. Denote the

length by `1.

– How ?

– While a base-pair is cut, leave its left base, remove itsright base

Since left base has the priority ...

– Please fill in the details here

• Compute a Rip-LAPCS T2 in O(n31n2) time. Denote the length

by `2.

• Compute a longest APCS T0 in O(n1n2) time, which does notcontain any arc-match. Denote the length by `0.

– How ?

– Remove the right bases from all the base pairs ...

– Then compute the classical LCS ...

• Can prove

max`1, `2, `0 ≥3

4`∗,

where `∗ is the length of an LAPCS.

126


Protein structure alignment:

• Measure of similarity:

– root-mean-square distance

the Euclidean distance between Cα atoms in the twoaligned amino acids

minimize the sum / maximum

– similarity of distance matrices

compute the distance matrices for proteins

align the most similar local sub-matrices (hexa-peptides)

– scores based on secondary structure alignment

– scores based on hydrogen bonding pattern

– . . .

– maximize the aligned contact

• Usually NP-hard (or even harder)

• Heuristics, Monte Carlo simulations, etc.

127


Contact map:

• Construction:

– Linearly order the amino acids

– Compute the Euclidean distance between every pair ofamino acids

– Connect by an edge for a pair of distance less than pre-defined threshold

– Hei, ... arc-annotated sequence, again

• Properties:

– One amino acid could be close to a few other amino acids(not sequentially adjacent)

– a few — 0 – 4, usually

– (some assume that such pair should be at least 4 positionsapart — does it make the problem easier ???)

• Computational problem — Contact Map Overlap (CMO):

– two contact maps (n, E) and (m, F )

– find two subsets S ⊆ 1,2, . . . , n and T ⊆ 1,2, . . . , m∗ |S| = |T |

∗ there is an order-preserving bijection f : S −→ T

∗ |(u, v) ∈ E : u, v ∈ S, (f(u), f(v)) ∈ T| is maximized

– Hei, ... unary, again

128


129


130


Contact map (2):

Theorem The CMO problem is MAX SNP-hard, even when both con-tact maps have maximum degree 1.

well, ... not surprising :-)

• Special cases:

– self-avoiding walk on the 2D grid

w wwwwwwww w w ww ww ww wwww

– stack: no-crossing (e.g. RNA secondary structures)

– queue: not-nested unless sharing an endpoint

– staircase:

sets of crossing edges

no two edges from different sets cross, but can share anendpoint

Theorem A queue can be decomposed into 2 staircases.

131


Contact map (3):

Theorem For two degree-2 contact maps, of which one is either a stackor staircase, the CMO problem can be solved in O(n3m3) time.

What you can expect ??? Dynamic programming (check pa-per for more details)

Corollary All known RNA tertiary structures (except for one) can bedecomposed into 2 degree-1 stacks.

The CMO problem on RNA structures is NP-hard. Nonethe-less, it can be approximated within ratio 2. (CMO for RNAsecondary structures can be solved in O(n3m3) time.)

• Augmented staircase: staircase + stack

such that for every staircase edge e and every stack edge f :

– e and f disjoint

– e and f share an endpoint

– e contains f

Theorem For two degree-2 contact maps, of which one is an augmentedstaircase, the CMO problem can be solved in O(n3m3) time.

132


Contact map (4):

Theorem A self-avoiding walk can be decomposed into 2 stacks and 1queue.

Corollary A self-avoiding walk has maximum degree-2 except the headand tail nodes. Therefore, the CMO problem on self-avoidingwalks can be approximated within ratio 4.

Theorem A self-avoiding walk can be decomposed into 1 stack and 2augmented staircases.

Corollary The CMO problem on self-avoiding walks can be approxi-mated within ratio 3.

Research problems:

• Designing better approximations

PTAS, or prove the MAX SNP-hardness (on self-avoidingwalks)

• Prove the CMO problem is NP-hard on queues

Or design an efficient algorithm

133

Lecture 12: Exact String Matching

Agenda:

• Fundamental preprocessing

• 1st algorithm

• 2nd algorithm — Boyer-Moore algorithm

Reading:

• Ref[Gusfield, 1997]: pages 1–23

134

Lecture 12: Exact string matching

Genomic database search:

Given a query sequence Q = a1a2 . . . an, find all sequences in thedatabase that are similar to Q.

A database may contain millions of sequences, totaling in billionsof bases.

The quadratic time DP algorithm is NOT fast enough!

Ideas in BLAST: [Altschul et al., 1990]

• Screen out all sequences which don’t share a commonsubstring of length w with Q.

• Often w = 11 for DNA and w = 4 for protein.

• Consider n − w + 1 substrings aiai+1 . . . ai+w−1 of Q, fori = 1,2, . . . , n− w + 1.

• This becomes a multiple keyword search problem.

Ideas in PatternHutter: [Li et al., 2002]

• Looking for appearances of length m substrings with atleast w matches.

• This becomes an approximate multiple keyword searchproblem.

135


Multiple keyword search:

Problem Given words W1, W2, . . . , Wk, and text T , decide if any Wi

appears in T .

Theorem The problem can be solved in O(|T |+∑k

i=1|Wi|) time.

Example

W1 = p i n k p i gW2 = p i gW3 = p i n p o i n tW4 = p a r k

T = p i n k p i n k p i g . . .

A kind of tree:

@@

@@

@@

@@

@@@@

@@

136


Definitions:

• string:

A string S is an ordered list of characters written contiguouslyfrom left to right.

• substring:

For any string S, S[i..j] is the contiguous substring of S thatstarts at position i and ends at position j.

It is empty if i > j.

• prefix:

A substring S[1..j]

• suffix:

A substring S[i..|S|]

• S[i]:

The ith character

• proper substring, proper prefix, proper suffix

• match, mismatch

• The problem:

Given a string P called pattern and a long string T called text,the Exact Matching problem is to find all occurrences of Pin T .

137


Naive method and its speedup:

• Naive method:

– align the left end of P with the left end of T

– compare letters of P and T from left to right, until

– either a mismatch is found (not an occurrence)

– or P is exhausted (an occurrence)

– shift P one position to the right

– restart the comparison from the left end of P

– repeat this process till the right end of P shifts past theright end of T

• Running time analysis:

n = |P |m = |T |The worst case number of comparisons is n × (m − n + 1)(Θ(nm))

A worst case: P = aaa and T = aaaaaaaaaaaaaaaaaaaaaa

• Not useful as in current database there are trillions of letters(billions of sequences), typically when P is long ...

• Speed it up !!!

138


Naive method and its speedup (2):

• Where to speed up?

a a a a a a a a a a a a a a a a a a a a a a

a a a

a a a

. . . . . .

a a a

1. shift P more than one letter, when mismatch occurs

but never shift so far as to miss an occurrence

2. after shifting, skip over parts of P to reduce comparisons

• An example: P = abxyabxz, T = xabxyabxyabxz

x a b x y a b x y a b x z

a b x y a b x z

139


Naive method and its speedup (3):

• An example: P = abxyabxz, T = xabxyabxyabxz

Shifting more than one letter


a b x y a b x z

Skipping over parts of P to reduce comparisons


a b x y a b x z

140


Fundamental preprocessing:

• Can be on pattern P (query sequence) or text T (database)

• Given a string S and a position i > 1

Zi(S) — length of longest common prefix of S and S[i..|S|]

• Example, S = abxyabxz

a b x y a b x z

141


Fundamental preprocessing (2):

• Can be on pattern P (query sequence) or text T (database)

• Given a string S and a position i > 1

Zi(S) — length of longest common prefix of S and S[i..|S|]

• Example, S = abxyabxz

a b x y a b x z

Z2 = 0

Z3 = 0

Z4 = 0

Z5 = 3

Z6 = 0

Z7 = 0

Z8 = 0

• Intention:

– Concatenate P and T , inserted by an extra letter $:

S = P$T

– Every i, Zi ≤ |P |

– Every i > |P |+ 1 and Zi = |P |records the occurrence of P in T

• Question: running time to compute all the Zi’s ???

The naive method runs in Θ((n + m)2) time ...

142



• Goal: linear time to compute all the Zi’s ...

• Z-box: for Zi > 0, it is box starting at i and ending at i+Zi−1

• ri — largest j + Zj − 1 over all 1 < j ≤ i such that Zj > 0

Zì ì rii

ì — the j with Z-box ending at ri (tie breaks arbitrarily)

• Computing Zk:

– given Zi for all 1 < i ≤ k − 1

– r = rk−1

– ` = `k−1

1. k > r: compute Zk explicitly (updating accordingly r and` if Zk > 0)

2. k ≤ r: k is in the Z-box starting at ` (equivalently sub-string S[`..r])

therefore, S[k] = S[k − ` + 1], S[k + 1] = S[k − ` + 2], . . .,S[r] = S[Z`]

in other words, Zk ≥ max Zk−`+1, r − k + 1(a) Zk−`+1 < r − k + 1: Zk = Zr−k+1 and r, ` remain un-

changed

(b) Zk−`+1 ≥ r− k +1: Zk ≥ r− k +1 and start comparisonbetween S[r + 1] and S[r − k + 2] until a mismatch isfound (updating r and ` accordingly if Zk > r − k + 1)

143



• Conclusions:

1. Zk is correctly computed

2. there are a constant number of operations besides com-parisons for each k

– |S| iterations

– whenever a mismatch occurs, the iteration terminates

– whenever a match occurs, r is increased

3. in total at most |S| mismatches and at most |S| matches

4. running time Θ(|S|)

5. Space complexity also Θ(|S|)

Theorem: There is a Θ(|S|)-time Θ(|S|)-space algorithm which com-putes Zi for all 1 < i ≤ |S|.

Corollary: There is a Θ(n + m)-time Θ(n + m)-space algorithm whichfinds all the occurrences of P in T , where n = |P | and m = |T |.

• Notes on the algorithm:

– alphabet-independent

– space requirement can be reduced to Θ(n)

why ??? how ???

– not well suited for multiple patterns searching ...

– strictly linear — every letter in T has to be compared atleast once

• more algorithms to be introduced / designed ...

144


An example:

• P = abxabxab, T = dababxababxabxab

• Case 1:

d a a a b x a b a b x a b x a b

a b x a b x a b

a b x a b x a b

• Case 2:

d a a a b x a b a b x a b x a b

a b x a b x a b

a b x a b x a b

• Notes:

– right-to-left comparison

– won’t miss any occurrence

– some T [i] won’t be compared — achieving sublinear time

(in the above example, T [2] and T [1])

145


The Boyer-Moore algorithm:

• Left-to-right shifting (like the naive algorithm)

• Rule 1: Right-to-left comparison

• Rule 2: Bad character rule

– for each x ∈ Σ, R(x) denotes the right-most occurrenceof x in P (0 if doesn’t appear)

– when a mismatch occurs, T [k] against P [i], shift P rightby max 1, i−R(T [k]) places

this makes T [k] against P [R(T [k])], if to the left

– |Σ| space to store R-values

– not saving anything if R(T [k]) to the right

Extended bad character rule: try the rightmost (but tothe left) occurrence of T [k]

require n space

• Rule 3: Good suffix rule

– when a mismatch occurs, T [k] against P [i]

– find the rightmost occurrence of P [(i + 1)..n] in P suchthat

the letter to the left differs P [i]

– shift P right such that this occurrence of P [(i + 1)..n] isagainst T [(k + 1)..(n + k − i)]

– if there is no occurrence of P [(i + 1)..n]:

find the longest prefix of P matches a suffix of P [(i+1)..n]

shift P right such that this prefix is against the corre-sponding suffix

146

Lecture 14: Exact String Matching

Agenda:

• 3rd algorithm — Knuth-Morris-Pratt algorithm

• Applications

Reading:


147


The Knuth-Morris-Pratt algorithm:

• History:

– best known

– not the method of choice, inferior in practice

– nonetheless can be simply explained and proved

– basis of Aho-Corasick algorithm for multiple pattern search

• Example,

∗

a b x y a b x z w

a b x y a b x z w

– (back to) left-to-right comparison rule

– shift P more places without missing any occurrence

• Definition: spi(P ),2 ≤ i ≤ n (sp1(P ) = 0)

the length of longest proper suffix of P [1..i] that matches aprefix of P

a b x y a b x z w

sp1 = sp2 = sp3 = sp4 = 0sp5 = 1, sp6 = 2, sp7 = 3sp8 = 0, sp9 = 0

148


The Knuth-Morris-Pratt algorithm (2):

• Definition: sp′i(P ),2 ≤ i ≤ n (sp1(P ) = 0)

the length of longest proper suffix of P [1..i] that matches aprefix of P ,

with the additional condition that character P [i + 1] differsfrom P [sp′i + 1].

a b x y a b x z w

Obviously, sp′i ≤ spi for any i ...

sp′1 = sp′2 = sp′3 = sp′4 = 0sp′5 = 0sp′6 = 0sp′7 = 3sp′8 = 0sp′9 = 0

• Shifting rule: shift P to the right (i− sp′i) spaces (i = n if anoccurrence)

• Effect:k∗

a b x y a b x z w

(i + 1)

a b x y a b x z w

(sp′i + 1)

T [(k − sp′i)..(k − 1)] matches P [1..sp′i]

and thus can skip sp′i comparisons ...

149



• Correctness

• Zj — length of the longest common prefix of P and P [j..n]

sp′i — length of the longest proper suffix of P [1..i] that matchesa prefix of P , with the additional condition that characterP [i + 1] differs from P [sp′i + 1]

x y

6

i

6

sp′i

?

j

?

Zj

?

j′

?

Zj′

Therefore,

sp′i = maxjZj | Zj = i− j + 1

• Preprocessing sp′i’s in linear time Θ(n)

150



• Running time:

– in total s phases (of comparison/shift), s ≤ m

– every 2 consecutive phases overlap one letter (the mis-matched one) from T

– Therefore, in total m + s ≤ 2m comparisons

Question: any letter from T is skipped for comparison?

• A real-time algorithm (def):

any letter is compared at most a constant times

• KMP is NOT a real-time algorithm

Question: can you provide an example?

• Converting KMP into a real-time algorithm — more detailedpreprocessing

Exercise: figure out the details.

151


The Knuth-Morris-Pratt algorithm (5)

— a way to multiple pattern search:

The initial preprocessing idea:

• Preprocessing to compute spi values

spi — length of the longest proper suffix of P [1..i] that matchesa prefix of P

sp′i — length of the longest proper suffix of P [1..i] that matchesa prefix of P , with the additional condition that characterP [i + 1] differs from P [sp′i + 1]

sp′i =

spi, if P [spi + 1] 6= P [i + 1]sp′spi, otherwise

x y

6

i

6

spi

• Computing spi+1:

y

6

i

6

spi

152



— the initial preprocessing


spi+1 =

spi + 1, if P [spi + 1] = P [i + 1]spspi+ 1, if P [spspi+ 1] = P [i + 1]spspspi+ 1, if P [spspspi+ 1] = P [i + 1]. . . ,

• An example,

a b x a b q a b x a b r a b x a b q a b x a b x

6

k + 1

6

k

153





spi+1 =


• An example,


6

k + 1

6

k

6

spk

6

spspk

6

spspspk

?

spk+1

' $' $#

154





spi+1 =


• An example,


6

k + 1

6

k

6

spk

6

spspk

6

spspspk

?

spk+1

' $' $# ' $

• Preprocessing can be done in linear time

Proof.

155


Keyword tree:

• Given a set of patterns P = P1, P2, . . . , Pk, the keyword treeK is a tree:

– rooted, directed

– each edge is labeled with one letter

– edges coming out of a node have distinct labels

– every pattern Pi is spelled out

– every leaf maps to some pattern

• A keyword tree containing one keyword

P1 = abxabqabxabrabxabqabxabx:

HHHH

HHHHHH

HHHHHH

HHHHHH

HHHHHHH

HHHHHH

x x x x x x x x x x x x x x x x x x x x x x x x x

ab

xa

bq

ab

xa

br

ab

xa

bq

ab

xa

bx

1

156


Keyword tree (2):

• A keyword tree containing multiple keywords:

P1 = p i n k p i gP2 = p i gP3 = p i n p o i n tP4 = p a r k

~

~

@@@~

~~

~

@@@~ ~

@@@~

~

~~

~@@@~@@@~@@@~@@@~

• A keyword tree can be constructed in linear time (in the sumof lengths of patters).

— provided the alphabet is fixed (and thus of constant size)

157


Multiple pattern matching problem:

• Given a set of pattern P = P1, P2, . . . , Pk and

a text (database) T

— find all the occurrences of all the patterns ...

• n — sum of the lengths of patterns

m — length of text

• Previous results imply a search algorithm in Θ(n + km) time

there are algorithms running in Θ(n + m + `) time, where

` — total number of occurrences of all the patterns

• Using keyword tree of P ...

• An idea to speed up — from KMP

HHHHHH

HHHHHH

HHHHHH

HHHHHH

HHHHHH

HHHHH

x x x x x x x x x x x x x x x x x x x x x x x x x

ab

xa

bq

ab

xa

br

ab

xa

bq

ab

xa

bx

1

$$

$ $

158


Multiple pattern matching problem (2):

• The keyword tree preprocessing:


~

~

@@@~

~~

~

@@@~ ~

@@@~

~

~~

~@@@~@@@~@@@~@@@~

• lp(v) — the length of the longest proper suffix of string L(v)that is a prefix of some pattern in P

• lp(v) for all the v’s in K can be computed in linear time —Θ(n)

• failure links v → nv

159


Aho-Corasick algorithm:

• Create the keyword tree

• Computer the lp(v) and nv for every v in the keyword tree

• During the searching

– every occurrence is reported Θ(`)

– whenever a mismatch, T shifted lp(v) spaces to the left

– whenever a match, T [i] never compared again

• Running time Θ(m + `)

• Therefore, in total Θ(n + m + `) time

160


Aho-Corasick algorithm (2):

• The keyword tree:


~

~

@@@~

~~

~

@@@~ ~

@@@~

~

~~

~@@@~@@@~@@@~@@@~

• T = pinkpinkpig

161


Exact string matching applications:

• Sequence-tagged-sites

• Exact string matching with wild cards

• Two-dimensional exact matching

• Regular expression pattern matching

Reading: Ref[Gusfield, 1997], pages 61–66

162

Lecture 15: Suffix Tree

Agenda:

• Introduction

• Construction

• Applications

Reading:


163


Introduction:

• Given a finite alphabet Σ, a string S of length m

E.g., S = abxabc

• Suffix tree of S:


– edges labeled by (non-empty) substrings of S

— could be a single letter

– edges coming out of a node start with distinct letters

– exactly m leaves

– leaf i spells out suffix S[i..m]

• Keyword tree for a set of patterns:


– edges labeled with letters

– edges coming out of a node have distinct labels

– every pattern is spelled out

– every leaf maps to some pattern

164


Introduction (2):

• Given a finite alphabet Σ, a string S of length m

E.g., S = xabxac

• Suffix tree of S:


– edges labeled by (non-empty) substrings of S

— could be a single letter

– edges coming out of a node start with distinct letters

– exactly m leaves

– leaf i spells out suffix S[i..m]

1@@@@@@@@@@@@@

2

3

6

c

zAAAAA

xa bxac

c

4

za

bxac

c 5

bx

ac

• Have to assume spm−i+1(S[i..m]) = 0 ... Why ???

Solution: if spm−i+1(S[i..m]) 6= 0, append to it an extra letter$ /∈ Σ ...

165


First construction:

• Assuming now spm−i+1(S[i..m]) = 0

• Make every suffix as a pattern Pi = S[i..m]

• Apply the linear time keyword tree construction algorithm

• Concatenate “paths” into “edges”

• Running time:

linear in the sum of the lengths of patterns —∑m

i=1|Pi|

=m(m + 1)

2

— a Θ(m2) construction algorithm :-(

• Next goal: design a linear time algorithm Θ(m)

166


Why suffix tree — 1st application:

• Suppose in Θ(m) time we can build the suffix tree for text T

• Given any pattern P — at any time

– match letters of P along the suffix tree

– . . . until

– either no more matches are possible

P doesn’t occur anywhere in T

– or P is exhausted

the labels of the leaves inside the subtree under the lastmatching edge are the starting positions of the occur-rences

• Conclusion:

– Exact string matching done in Θ(m + n + `) time

– Exact multiple strings matching done in Θ(m+n+`) time

` — number of occurrences

Other applications:

• Multiple keyword search

• Longest repeating substring

• Longest common substring of two (or more) strings

167


Ukkonen’s linear time construction:

• Implicit suffix tree for S$ — from the suffix tree

1. remove every copy of $

2. remove every edge without labels

3. remove the degree-2 internal node (except the root)

need to concatenate the edge labels ...

E.g., S = xabxa

1@@@@@@@@@@@

2

3

6

$

yAAAA

xa bxa$

$4

ya

bxa$

$5

bx

a$

1@@@@@@@@@@@

2

3

xabxa

ab

xa

bx

a

• Tm contains almost all the information about the suffix tree

168


Ukkonen’s linear time construction (2):

• Ti — implicit suffix tree for S[1..i]$

• Construct Ti incrementally

– from Ti to Ti+1

– need to add S[i + 1] to every suffix of S[1..i] and theempty string

there are i + 1 of them ...

– append S[i + 1] to suffix S[j..i] — becoming a suffix ofS[1..(i + 1)]

j = 1,2, . . . , i, i + 1

1. if S[j..i] ends at a leaf, append S[i + 1] to the corre-sponding edge label

2. if S[j..i] ends at an internal node

(a) if there is an edge out of the node with label beginswith S[i + 1], done

(b) otherwise add an edge and a leaf, with edge label‘S[i + 1]’

3. if S[j..i] doesn’t end at any node

(a) if the succeeding letter in the edge label is ‘S[i+1]’,done

(b) otherwise add an internal node right after (break intotwo edges) and an edge adjacent with a leaf. labelfor the new edge is ‘S[i + 1]’

• Straightforward implementation Θ(m3)

169


Ukkonen’s linear time construction (3) — speedup:

• How?

• Key thing: locate the ending position of S[j..i] in Ti

S[1..5] = axaxb, try adding S[6] = a:

12

3

@@@@4

5

ax axbx

ax

b

b

b

b@@

@@I

– j = 1 — easy (use a pointer pointing to the longest pathin Ti)

append a to the edge label (entering the leaf spells outS[1..i])

– denote the leaf edge as (v,1)

1. if v is the root: j = 2 is done straightforwardly

2. if v isn’t the root:

there is another node, denoted s(v), such that if root-to-v spells out S[1..`] then root-to-s(v) spells out S[2..`]

3. when we have the information on s(v), continue searchfrom it

not necessarily from the root again ...

170



• S[1..5] = axaxb, try adding S[6] = a:

12

3

@@@@4

5

ax axbx

ax

b

b

b

b

$'

?@

@@@I

In this example, save a few comparisons between x and x ...

• Notes:

– need to record s(v) if a new node v is created

– it doesn’t give a “faster” algorithm

• Trick 1 — skip/count

In this example, we know that a is appended 3 letters afterS[1..`] = ax

Therefore, if we can record for every edge, the length of itslabel ...

171



• Analysis of trick 1:

– every node has a depth — # of nodes on the path fromroot

– the depth of v is at most one greater than the depth ofs(v)

– constructing Ti+1 from Ti:

∗ decrease node depth at most 2m

∗ could increase a lot but bounded by 3m — why?

since the maximum depth is m

so construction done in O(m) time

Applying Trick 1 gives an O(m2) time suffix tree building al-gorithm

• One observation to overcome the Θ(m2)-barrier

total number of letters in the edge labels could reach Θ(m2)...

• An alternate way to represent labels

using position intervals [start, end] to represent label S[start..end]

172

Lecture 17: Genome Rearrangements

Agenda:

• Introduction to genome-level mutations

• Reversals (or Inversions)

• Signed Reversals

Reading:



• Hannenhalli & Pevzner. Transforming cabbage intoturnip: polynomial algorithm for sorting signed per-mutations by reversals. Journal of the ACM, 46(1999),1–27.

173


Gene-level mutations vs genome-level mutations

• Gene-level mutations:

– single nucleotide/amino acid substitution

– single nucleotide/amino acid inser-/dele-tion (space)

– block of nucleotides/amino acids inser-/dele-tion (gap)

• Gene-level mutations reflect:

– how does a single gene evolve over the time

– how are a family of genes related to each other

• Genome-level (chromosome-level) mutations:

– duplication with modification

– reversal

– transposition

– translocation (special cases: fusion and fission)

• Genome-level (chromosome-level) mutations reflect:

– evolution of the whole genome (might not be seen at thegene-level comparisons)

– how do species diverge and relate to each other

– deeper evolutionary relationship

174


Genome-level mutations: an example

Complete mitochondrial genomes:

gene name label

ND2 1COX1 2COX2 3ATP8 4ATP6 5COX3 6ND3 7ND5 8ND4 9ND4L 10ND6 11CYTB 12ND1 13

Anopheles gambiae (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)Apis mellifera ligustica (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)Artemia franciscana (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)Drosophila yakuba (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)Balaenoptera musculus (13, 1, 2, 3, 4, 5, 6, 7, 10, 9, 8, 11, 12)Balaenoptera physalus (13, 1, 2, 3, 4, 5, 6, 7, 10, 9, 8, 11, 12)Bos taurus (13, 1, 2, 3, 4, 5, 6, 7, 10, 9, 8, 11, 12)Cyprinus carpio (13, 1, 2, 3, 4, 5, 6, 7, 10, 9, 8, 11, 12)Didelphis virginiana (13, 1, 2, 3, 4, 5, 6, 7, 10, 9, 8, 11, 12)Halichoerus grypus (13, 1, 2, 3, 4, 5, 6, 7, 10, 9, 8, 11, 12)Mus musculus (13, 1, 2, 3, 4, 5, 6, 7, 10, 9, 8, 11, 12)Phoca vitulina (13, 1, 2, 3, 4, 5, 6, 7, 10, 9, 8, 11, 12)Xenopus laevis (13, 1, 2, 3, 4, 5, 6, 7, 10, 9, 8, 11, 12)Gallus gallus (13, 1, 2, 3, 4, 5, 6, 7, 10, 9, 8, 12, 11)Strongylocentrotus purpuratus (13, 1, 2, 10, 3, 4, 5, 6, 7, 9, 8, 11, 12)Paracentrotus lividus (13, 1, 2, 10, 3, 4, 5, 6, 7, 9, 8, 11, 12)Petromyzon marinus (12, 13, 1, 2, 3, 4, 5, 6, 7, 10, 9, 8, 11)

175


The general problem:

• Assumptions:

– a same gene found from both species

– ignore the gene sequence difference — identical

– ignore multiple copies — use one copy

– label genes using integers

For example,

Anopheles gambiae (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)Gallus gallus (13, 1, 2, 3, 4, 5, 6, 7, 10, 9, 8, 12, 11)

• Question:

– how were these two genomes evolved from some commonancestral genome?

(which also contains the same set of genes)

– equivalently, how was one genome evolved from the other?

• A reduction:

– assuming one species genome is the identity permutation...

– how was a permutation evolved from the identity permu-tation?

• Measurement:

Applying the parsimony rule:

Nature usually takes the path with least resistance, so thescenario with the least amount of genome-level changes islikely to be the most probable one,

176


The concrete technical problems:

• genome rearrangements by reversals only

• genome rearrangements by transpositions only

• genome rearrangements by reversals and transpositions

• genome rearrangements by all mutations

• signed genome rearrangements by reversals only

• signed genome rearrangements by transpositions only

• signed genome rearrangements by reversals and transpositions

• etc.

177


Genome rearrangements by reversals:

• Formal description of the problem:

– π = (π1, π2, . . . , πi−1, πi, πi+1, . . . , πj−1, πj, πj+1, . . . , πn) a per-mutation on (1,2, . . . , n)

– a reversal r(i, j), 1 ≤ i < j ≤ n, on π makes π into

π′ = (π1, π2, . . . , πi−1, πj, πj−1, . . . , πi+1, πi , πj+1, . . . , πn)

– given any permutation, find a shortest series of reversalsthat transforms π into the identity permutation I.

• An example,


(13, 1, 2, 3, 4, 5, 6, 7, 10, 9, 8, 12, 11) + R(9,11) −→(13, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 11) + R(12,13) −→(13, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12) + R(2,13) −→(13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1) + R(1,13) −→(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)

Question 1: are 4 reversals really necessary?

Question 2: is it always possible to transform one permutationinto identity by reversals only?

178


Genome rearrangements by reversals (2):

• Question 2: is it always possible to transform one permutationinto identity by reversals only?

Answer: yes.

• Question 1: are 4 reversals really necessary?

Let us see.

• Definition: a breakpoint in permutation π occurs betweennumbers πi and πi+1, for 1 ≤ i < n, if and only if

|πi − πi+1| 6= 1.

There is a breakpoint at the front of the permutation if π1 6=1; and there is a breakpoint at the end of the permutation ifπn 6= n.

— Another way to define breakpoints is to add π0 = 0 andπn+1 = n + 1 to get a permutation on n + 2 numbers.

• Key observation:

– a permutation without breakpoints if and only if it is theidentity

– new goal: reduce the number of breakpoints to 0

– every reversal reduces this number by at most 2

Theorem Let d(π) denote the number of breakpoints in π. The number

of reversals to transform π into the identity is at leastd(π)

2.

179



• The example (13, 1, 2, 3, 4, 5, 6, 7, 10, 9, 8, 12, 11):

The number of breakpoints in it is 5. So, it requires at least3 reversals.

• Idea in designing an algorithm:

Try to find a segment that, by reversing it we can reduce thenumber of breakpoints!!!

– definition: a strip is a maximal segment containing nobreakpoints inside

– the example (13, 1, 2, 3, 4, 5, 6, 7, 10, 9, 8, 12, 11)contains 4 strips: (13), (1, 2, 3, 4, 5, 6, 7), (10, 9, 8),(12, 11)

– definitions: decreasing / increasing strip

– reversal intervals should contain a full strip — otherwisewill be creating new breakpoints

(can be shown true precisely — can you ???)

Lemma Whenever there is a decreasing strip, there is a reversalwhich decreases d(π) by at least 1. Furthermore, thereversal can be determined in linear time.

Proof.

– for the example (13, 1, 2, 3, 4, 5, 6, 7, 10, 9, 8, 12, 11),the reversal is R(9,11)

180



• Designing an algorithm:

– need to consider the case where there is no decreasingstrips ...

easy —

when there is none, simply inverse an increasing strip (oflength ≥ 2)

– pseudocode:

Alg1π

while (d(π) > 0) doif (there is a decreasing strip) then

find the reversal to reduce d(π) by at least 1else

inverse one increasing strip of length at least 2

– analysis:

1. two consecutive reversals reduce d(π) by at least 1

2. at most 2d(π) reversals transform π into identity

3. at least we needd(π)

2reversals

Conclusion. Alg1 is a 4-approximation.

• Genome rearrangements by reversals is NP-hard.

• We answered Question 2 affirmatively:

Is it always possible to transform one permutation into identityby reversals only?

181



• Improve the above design:

idea 1: can we guarantee to reduce d(π) and, at the same time,leave a decreasing strip?

idea 2: if we can’t, can we guarantee to reduce d(π) by 2?

— such that on average, every reversal reduces d(π) byat least 1

• Search for the desired reversal:

– assuming there is a decreasing strip ...

– examine the smallest number, say πi, in any decreasingstrip

— try to remove the breakpoint involves πi and πi − 1

and leave a decreasing strip — if successful, we are done

– otherwise examine the largest number, say πj, in any de-creasing strip

— try to remove the breakpoint involves πj and πj + 1

and leave a decreasing strip — if successful, we are done

– otherwise, we claim that

1

6

j − 1

6

j

6

i

6

i + 1

6

n

6

0

6

n + 1

6

πi − 1

?

πj + 1

?

– therefore, reversal R(j, i) reduces d(π) by 2

182



• The above is a 2-approximation algorithm.

• This is a look-ahead style algorithm:

Whenever you execute one step, you also prepare for the nextstep ...

• More carefully,

There is a 1.75-approximation algorithm for genome rear-rangements by reversals.

• When genes have directions — signed genome rearrange-ments by reversals

Good news: polynomial time solvable (Hannenhalli & Pevzner)!

The speed has been improved to almost linear-time.

• A closely related problem: Breakpoint graph decomposition

– 1.5-approximation

– 1.46...-approximation

– 1.42...-approximation

– 1.375-approximation

183


Signed genome rearrangements by reversals (7):

Transforming the mitochondrial DNA of turnip into thatof cabbage by reversals.

184

Lecture 18: Genome Rearrangement

Agenda:

• More on reversals

• Transpositions

• Reversal & transpositions

Reading:

• Ref[JXZ, 2002]: Chap 6

• Ref[Pevzner, 2000]: pages 175–228

• A. Caprara. Sorting permutations by reversals andEulerian cycle decompositions. SIAM Journal onDiscrete Mathematics. 12(1999), 91–110.

• V. Bafna et al. Sorting permutations by transposi-tions. Proceedings of SODA’95. Pages 614–623.

• Q. Gu et al. A 2-approximation algorithm for genomerearrangements by reversals and transpositions. The-oretical Computer Science. 210(1999), 327–339.

185


Genome rearrangements by reversals:


– π = (π1, π2, . . . , πi−1, πi, πi+1, . . . , πj−1, πj, πj+1, . . . , πn) a per-mutation on (1,2, . . . , n)

– a reversal r(i, j), 1 ≤ i < j ≤ n, on π makes π into

π′ = (π1, π2, . . . , πi−1, πj, πj−1, . . . , πi+1, πi , πj+1, . . . , πn)

– given any permutation, find a shortest series of rever-sals that transforms π into the identity permutation I =(1,2, . . . , n− 1, n).

• An example,


(13, 1, 2, 3, 4, 5, 6, 7, 10, 9, 8, 12, 11) + R(9,11) −→(13, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 11) + R(12,13) −→(13, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12) + R(2,13) −→(13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1) + R(1,13) −→(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)


Still trying to answer ...

Question 2: is it always possible to transform one permutationinto identity by reversals only?

Yes.

186


Breakpoint graph:

• Given a permutation π = (π1, π2, . . . , πn), extend it to haveπ0 = 0 and πn+1 = n + 1

• Definition: a breakpoint in permutation π occurs betweennumbers πi and πi+1, for 0 ≤ i ≤ n, if and only if

|πi − πi+1| 6= 1.

• Key observation:

– a permutation without breakpoints if and only if it is theidentity

– new goal: reduce the number of breakpoints to 0

– every reversal reduces this number by at most 2

Theorem Let b(π) denote the number of breakpoints in π. Let d(π)denote the minimum number of reversals to transform π intoI. Then,

d(π) ≥b(π)

2.

Theorem Genome rearrangement by reversals admits a 2-approximationalgorithm.

• Breakpoint graph G(π) of π:

– n + 2 vertices 0,1,2, . . . , n, n + 1

– a black edge connecting πi and πi+1 for every i : 0 ≤ i ≤ n

– a gray edge connecting πi and πj if |πi − πj| = 1

π = (0,13,1,2,3,4,5,6,7,10,9,8,12,11,14) and n = 13:

y y y y y y y y y y y y y y y187


Breakpoint graph (2):


π = (0,13,1,2,3,4,5,6,7,10,9,8,12,11,14) and n = 13:

y y y y y y y y y y y y y y y0 13 1 2 3 4 5 6 7 10 9 8 12 11 14

' $' $' $' $

• Properties:

– every vertex is incident with the same number of blackand gray edges

– therefore, exists an Eulerian alternating cycle

– therefore, exists an alternating cycle decomposition

– c(π) — the maximum number of cycles in any alternatingcycle decomposition

• For the above example, besides the 1-cycles (number of blackedges therein), we have one more alternating cycle:

y y y y y y y y y y y y y y y0 13 1 2 3 4 5 6 7 10 9 8 12 11 14

' $' $' $' $

Can show this is a maximum alternating cycle decomposition−→ c(π) = 10

188


Breakpoint graph (3):


π = (0,13,1,2,3,4,5,6,7,10,9,8,12,11,14) and n = 13:

has c(π) = 10.

Lemma. For every permutation π, apply one reversal to transform itinto π′. Then, |c(π)− c(π′)| ≤ 1.

Proof. By distinguishing the cases where the two black edgesinvolved are in one cycle or two cycles in a maximum alter-nating cycle decomposition.

• In other words: a reversal increases c(π) by at most 1.

• Observation: c(I) = n + 1.

Theorem. Let d(π) denote the minimum number of reversals to trans-form π into I. Then,

d(π) ≥ (n + 1)− c(π).

Note: (n + 1)− c(π) ≥b(π)

2— why ???

• For π = (0,13,1,2,3,4,5,6,7,10,9,8,12,11,14), c(π) = 10.

Therefore, d(π) ≥ 14− 10 = 4. So,


Yes. And thus that is an optimal series of reversals.

189


Further notes on genome rearrangement by reversals:

• For almost all biological instances, d(π) = (n + 1)− c(π)

– also called Sorting by Reversals (SBR)

– NP-hard and Max SNP-hard

– the best-of-the-art approximation ratio 1.375

(claimed, prior to this the best is 1.5)

• An independent research problem Breakpoint Graph Decom-position (BGD)

– (n!− (n− 5)!) instances out of n!: SBR = BGD

– NP-hard and Max SNP-hard

– the best-of-the-art approximation ratio is 1.4193

190


Genome rearrangements by transpositions:


– π = (π1, . . . , πi−1, πi, . . . , πj−1, πj, . . . , πk−1, πk, . . . , πn) a per-mutation on (1,2, . . . , n)

– a transposition t(i, j, k), 1 ≤ i < j < k ≤ n+1, on π makesπ into

π′ = (π1, . . . , πi−1, πj, . . . , πk−1 , πi, . . . , πj−1 , πk, . . . , πn)

– given any permutation, find a shortest series of transpo-sitions that transforms π into the identity permutationI = (1,2, . . . , n− 1, n).

• Questions:

1. Can we always transform one permutation into identityby transpositions only?

Yes. How ???

2. How to find a shortest series of transpositions ???

• The example (13, 1, 2, 3, 4, 5, 6, 7, 10, 9, 8, 12, 11):

• Breakpoint:

There is one breakpoint if πi+1 − πi 6= 1, for 0 ≤ i ≤ n.

Note: different to the definition for reversal only

191


Genome rearrangements by transpositions (2):

• Idea in designing an algorithm:

Try to find a segment and a position so that, by transposingthe segment to that position, we can reduce the number ofbreakpoints!!!

– a strip is a maximal segment containing no breakpointsinside (therefore, increasing)

– the example (13, 1, 2, 3, 4, 5, 6, 7, 10, 9, 8, 12, 11)contains 7 strips: (13), (1, 2, 3, 4, 5, 6, 7), (10), (9),(8), (12), (11)

– transposition segments should contain a full strip and theinserting position should be outside of any strip

— otherwise will be creating new breakpoints

(can be shown true precisely — can you ???)

Lemma Every transposition reduces the number of breakpointsb(π) by at most 3.

So, one lower bound isb(π)

3.

• The example permutation (13, 1, 2, 3, 4, 5, 6, 7, 10, 9, 8, 12,11) can be transformed into the identity via 4 transpositions...

Its lower bound is⌈

73

⌉= 3.

Question: is it optimal ???

192


Genome rearrangements by transpositions (3):

• Breakpoint graph G(π) of π the same as in the reversals onlycase:

π = (0,13,1,2,3,4,5,6,7,10,9,8,12,11,14) and n = 13:

has c(π) = 10.

Lemma. For every permutation π, apply one transposition to transformit into π′. Then, |c(π)− c(π′)| = 0 or 2.

Proof. By distinguishing the cases where the two black edgesinvolved are in one cycle or two cycles in a maximum alter-nating cycle decomposition.

• In other words: a transposition can increase c(π) by at most2.

• Observation: c(I) = n + 1.

Theorem. Let d(π) denote the minimum number of transpositions totransform π into I. Then,

d(π) ≥(n + 1)− c(π)

2.

• For π = (0,13,1,2,3,4,5,6,7,10,9,8,12,11,14), c(π) = 10.

Therefore, d(π) ≥ 14−102

= 2. So,

Question: Is 4 optimal?

Sorry, we are unable to answer ...

193


Further notes on genome rearrangement by transposi-tions:

• Not known yet if it is NP-hard ...

• Approximation algorithms

– using the above theorem on the lower bound, there is a1.75-approximation

– improved to 1.5

• There is NO signed genome rearrangement by transpositions...

Why ???

You can never flip the sign of a gene by transpositions only.

194


Unsigned genome rearrangement by reversals and trans-positions:

• For example permutation:

(13, 1, 2, 3, 4, 5, 6, 7, 10, 9, 8, 12, 11) + t(9,12,14) −→(13, 1, 2, 3, 4, 5, 6, 7, 12, 11, 10, 9, 8) + t(1,2,9) −→(1, 2, 3, 4, 5, 6, 7, 13, 12, 11, 10, 9, 8) + r(8,13) −→(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)

• It’s hard to have precise definition breakpoint, although it isstill a breakpoint if

|πi − πi+1| 6= 1.

So, b(π) ≥ 5.

• Can still have the breakpoint graph:

– c(π)

• Lower bounds:

– every operation decreases the number of the above specialbreakpoints, b0(π), by ≤ 3

– every operation increases c(π) by ≤ 2

– every operation increases co(π) by ≤ 2, where co(π) is themaximum number of odd cycles in any alternating cycledecomposition

Therefore,

d(π) ≥

b0(π)

3,n + 1− co(π)

2

.

• So, at least we have a 1.5-approximation algorithm ...

• Is it hard ???

195


Signed genome rearrangement by reversals and trans-positions:

• Transforming signed into unsigned by:

+i −→ (2i− 1,2i)

−i −→ (2i,2i− 1)

and each of the pairs is not allowed broken

• There is a breakpoint if

|πi − πi+1| 6= 1.

In breakpoint graph, there is a black edge iff there is a break-point; there is a gray edge iff the numbers differ by 1 andtheir positions are not adjacent.

Note: different from the previous definitions

• Alternating cycles, and c(π)

• New goal: eliminating breakpoints and cycles

b(I) = 0 and c(I) = 0

• Lower bounds:

– every operation decreases b(π) by ≤ 3

– every operation decreasing b(π) by 3 also decreases c(π)by 1

those 3 black edges belong to a 3-cycle

– can show: every operation decreases (b(π)− c(π)) by ≤ 2

Therefore,

d(π) ≥b(π)− c(π)

2.

196


Signed genome rearrangement by reversals and trans-positions (2):

• By carefully examination, we will be able to find a series of atmost (b(π)− c(π)) operations which eliminate all breakpoints(and thus all cycles) in π.

Theorem. SBRT admits a 2-approximation algorithm.

• Further notes:

– believed to be NP-hard, no proof yet ...

– 2-approximation is the best guarantee so far

• One more operation — reversal transposition:

– π = (π1, . . . , πi−1, πi, . . . , πj−1, πj, . . . , πk−1, πk, . . . , πn) a per-mutation on (1,2, . . . , n)

– a reversal transposition rt(i, j, k), 1 ≤ i < j < k ≤ n + 1,on π makes π into

π′ = (π1, . . . , πi−1, πj, . . . , πk−1 , πj−1, . . . , πi , πk, . . . , πn)

• Genome rearrangement by reversals, transpositions, and re-versal transpositions

– hardness open

– the best-of-the-art: 1.75-approximation algorithm

• Transposition regarded as an exchangement

Another operation: double reversal rr(i, j, k):

π = (π1, . . . , πi−1, πi, . . . , πj−1, πj, . . . , πk−1, πk, . . . , πn) →π′ = (π1, . . . , πi−1, πj−1, . . . , πi , πk−1, . . . , πj , πk, . . . , πn)

197

Lecture 20: Protein Structure Prediction

Agenda:

• Overview of protein structure/function prediction

• Sequence homology (BLAST, PsiBLAST)

• Structure homology (fold recognition by threading)

Reading:

• Ref[JXZ, 2002]: chapters 16-18.

• D. Baker and A. Sali. Protein structure predictionand structural genomics. Science. (2001) 294:5,93–96.

• S. F. Altschul et al. Gapped BLAST and PSI-BLAST: a new generation of protein database searchprograms. Nucleic Acids Research. (1997) 25,3389 – 3402.

• Y. Xu et al. An efficient computational method forglobally optimal threading. Journal of Computa-tional Biology. (1998) 5, 597–614.

198


A protein (3D or tertiary) structure:

See www.cmbi.kun.nl/gvteach/alg/infopages/proteins.shtml for adetailed atomic decomposition.

Secondary structure involves α-helix, β-sheet, and β-

turn.

199


Structure prediction:

• Big reasons for prediction:

– a huge number of proteins (e.g., 300,000 human proteins)

– a small number of structures been experimentally deter-mined

accurate (high resolution)

– experimental structure determination is very expensiveand slow

– not all protein structure necessarily determined experi-mentally

• Roughly, 3 different approaches for prediction:

– NMR spectroscopy and X-ray crystallography

– comparative modeling

– De novo structure prediction

• 4 steps in structure modeling (including comparative modelingand threading):

– find related known structures — templates

– compute sequence-template alignment

– build structure model

– assess the structure model

200


NMR and X-ray crystallography:

• They are both experimental methods using knowledge frombiology, physics, chemistry, biochemistry, and mathematics

• NMR spectroscopy

– protein purification

– spectra generation

– peak picking

– peak assignment

– structural information extraction

– structure calculation

Requires nearly complete and accurate peak assignment, whichis hard

• X-ray crystallography

– crystal preparation

– X-ray diffraction collection

– structure calculation

Requires high quality crystal, which is hard too

• None of them is a high-throughput technology

• NMR can provide the structure in near physiological condition

201


Structure modeling:

• 4 steps:

– find related known structures — templates

– compute sequence-template alignment

– build structure model

– assess the structure model

• Comparative modeling:

– templates can be found by

1. sequence comparison methods such as PSI-BLAST

2. structure comparison methods

– compute a sequence-structure alignment minimizing the‘energy’:

1. using sequence alignment (homology search)

2. threading (next lecture)

– produce all-atom model

1. cores, loops, side-chains

2. approximate positions of the conserved atoms

202


De novo prediction:

• No templates, predict from scratch

• Assumption: native structure is at the global free energy min-imum

• Therefore, search limited to the conformational space withlow global free energy

Accuracy:

approach req. seq. identity model accuracyNMR, X-ray - 1.0A

sequence > 50% 1.5Acomparative threading > 30% 3.5A

threading < 30% high errorde novo insignificant 4-8A

Applications:

• understand the functions/structure

• function repair, drug design

• etc.

203


Protein function prediction:

• Through structure prediction

• Indirectly through structure prediction

– protein backbone structure prediction

via NMR, X-ray

– threading to predict function

– sequence similarity — homology search

– key functional region identification

1. de novo structure prediction

2. function motif recognition / learning

E.g., the basic EF-hand consists of two perpendicular10 to 12 residue alpha helices with a 12-residue loopregion between, forming a single calcium-binding site(helix-loop-helix). Calcium ions interact with residuescontained within the loop region. Each of the 12residues in the loop region is important for calciumcoordination.

3. structure motif recognition / learning

204


Sequence homology search tool — Psi-BLAST:

• Position-Specific-Iterated BLAST

• “http://www.ncbi.nlm.nih.gov/BLAST/”

• One protein sequence:

>tetrahymena01: 378aa

MILFKKLLIQ KKVNYLSRLL MIARHILKLQ NLAKTTPFFR FSEFNETESSFNHNENRPQR VQKPRFKKVA IILSGCGVYD GSEVTEVVSL MVHLNKSHVSFQQEINTFYS FLELIKNFKT QFQNIFLNQN RCFAPNQDQL HVVNHITGETTTETRNVLVE SARIARGEVK DITQLKGEDY QAVLLPGGFG AAKNLSDYAVNGTNFTVNSE VERVLREFHS QKKPIGAMCI SPLILAKVLQ NDNLNHHGRVDNQDTPNKCW PHSEAINAAE TLGAQHFQRN ADRFQVDFEN KIVTAPAMMFNGFTKFSTTF DGAGHLVNII DQIVQGNSIK LDTIGEKKHE NFDGERRPRGQYNRNNNNRR PRENNNENHE QHHHHEQQ

• Items need to check:

– Psi-BLAST vs. BLASTp

– input protein sequence format: FASTA

– database selection

– score scheme

– # iterations

– score, Z-score, E-value

– sequence identity

– PSSM generating and using

205


Protein threading:

• Knowledge-based force field– statistics on a database of known proteins– building a force field– collecting a set of possible conformations for peptides of

length L– measuring the energy for query protein (of length L) in

each conformation– ranking the energy (the lowest on the top)

• Details:– s — discrete distance between two atoms c and d– c and d — atoms from two amino acids a and b, at dis-

tance k– the energy associated is

Eac,bdk (s) = −kT ln

[1

1 + σmab

+σmab

1 + σmab

gac,bdk (s)

gc,dk (s)

],where

mab — relative frequency of amino acid a and b at distancekgac,bd

k (s) — relative frequency of atoms c and d at distancesgc,d

k (s) — relative frequency of atoms c and d at distances averaged over all amino acid pairsσ — weight of observation pair a and bk — Boltzmann’s constantT = 293

– energy in one conformation is:

E(S, C) =∑

ij

Eac,bdk (sij),where

S — query sequenceC — conformationi, j — indices of atoms c and d, respectively

206


Protein threading (2):

• Lattice Monte Carlo model

– global minimum (native structure) state is known in themodel

– global minimum state on the potential surface

– folding starts from random-coil state to a random semi-compact globule

– folding from semi-globule state to the native fold

– complexity: 1016 → 1010 → 103

– calculate the energy for every possible state and outputthe minimum

207



• Comparative protein modeling (Modeller):

– compose a database of proteins with known 3D structure

– align the query sequence with “related” known 3D struc-tures

– derive the distance and dihedral angle constraints on thequery sequence

– combine these constraints and energy terms into an ob-jective function

– optimize the objective function, for example by

1. non-linear optimization

2. molecular dynamics + simulated annealing

– evaluate the fold

• Two steps of details

– constraint deriving:

1. conditional probability summarized from the database

2. using least-square to fit into the histogram

– optimization:

1. typically there are thousands of constraints

2. can model atoms from the sidechains as well

• Note: alignment is critical to the success of prediction

208



• Template/Sequence profile (GenThreader):

– database consists of unique protein chains (templates)from PDB

– run each template on a much larger database to constructa profile

– pairwise align query sequence with a profile

– evaluate the alignment using potentials:

1. pairwise potential:

Eac,bdk (s) = −kT ln

[1

1 + σmab

+σmab

1 + σmab

gac,bdk (s)

gc,dk (s)

](you have seen this before)

2. solvation potential:

Easol(r) = −RT ln

(fa(r)

f(r)

),where

r — degree of residue a burial

fa(r) — frequency of occurrence amino acid a withburial r

f(r) — frequency of occurrence all amino acids withburial r

– evaluation through a neural network:

1. input layer: pairwise energy, solvation energy, align-ment score, alignment length, template length, querysequence length

2. hidden layer

3. output layer: templates related, templates unrelated

209

Lecture 21: Protein Fold Recognition

Agenda:

• Protein threading — PROSPECT

Reading:

• Ref[JXZ, 2002]: chapters 17-18.

• Y. Xu et al. An efficient computational method forglobally optimal threading. Journal of Computa-tional Biology. (1998) 5, 597–614.

• Y. Xu & D. Xu. Protein threading using PROSPECT:design and evaluation. Proteins: Structure, Func-tion, and Genetics. (2000) 40, 343–354.

210


Protein threading (5) — PROSPECT:

• Main idea: divide-and-conquer

• Details:

– again, thread query protein into template folds

– find the fold achieving minimum free energy

– computational recognition of native fold

– can be used to construct the structure

– mainly focus on backbone structure for function predic-tion

• Factors: — what are counted into the free energy ?

– singleton fitness (into the local environment)

– pairwise interaction (non-adjacent close residues)

– gaps

• How to compute:

– sequence-to-template alignment

– assume additive sum of energy terms

– no gaps inside cores

– divide-and-conquer

211


Protein threading — PROSPECT (2):

• A schematic view:

template

cores

sequence

f f f f f f f f f f f f

' $' $' $' $' $interactions

6

aligned portion

BBBM

gap

Here, (secondary) core = α-helix or β-sheet.

• The goal:

To compute an optimal partition of the query sequence suchthat the corresponding sequence-to-template alignment achievesthe minimum free energy.

212



• Singleton fitness:

Represents a residue’s local preference of the secondary struc-ture (helix, sheet, loop) and the solvent environment (theextent of exposure to solvent).

• Energy term:

esingle(aa, ss, sol) = −kT log(

N(aa, ss, sol)

NE(aa, ss, sol)

),where

aa — amino acid type

ss — secondary structure type

sol — extent of exposure to solvent

k — Boltzmann’s constant

T — temperature 295

N(aa, ss, sol) — # amino acid of type aa in secondary struc-ture ss with extent sol of exposure to solvent, counted fromthe database

NE(aa, ss, sol) — expected # amino acid of type aa in sec-ondary structure ss with extent sol of exposure to solvent,counted from the database:

NE(aa, ss, sol) =N(aa)×N(ss)×N(sol)

N2

There are 3 types of solvent accessibility: buried (< 9.5%), in-termediate (< 48.6%), and exposed (≥ 48.6%) — calculatedby ACCESS

213



• Pairwise interaction:

Measures the pairwise potential between spatially close non-adjacent residues.

• Energy term:

epair(aa1, aa2) = −kT log(

M(aa1, aa2)

ME(aa1, aa2)

),where

aa1, aa2 — amino acid types

M(aa1, aa2) — # of pairs of amino acid types aa1 and aa2 inthe database that interact

ME(aa1, aa2) — expected # of pairs of amino acid types aa1

and aa2 in the database that interact:

ME(aa1, aa2) =M(aa1)×M(aa2)

M.

• Spatially close:

– for two non-adjacent residues

– the distance between their β carbon (Cβ) atoms is ≤ D

– D usually ranges from 7 A to 15 A

214



• Mutation energy:

Describes the compatibility of substituting one amino acidtype by another during the evolution.

• Energy term:

Use PAM250

X 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0Z 0 0 1 3 -5 3 3 -1 2 -2 -3 0 -2 -5 0 0 -1 -6 -4 -2 2 3B 0 -1 2 3 -4 1 2 0 1 -2 -3 1 -2 -5 -1 0 0 -5 -3 -2 2V 0 -2 -2 -2 -2 -2 -2 -1 -2 4 2 -2 2 -1 -1 -1 0 -6 -2 4Y -3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3 0 10W-6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0 -6 -2 -517T 1 -1 0 0 -2 -1 0 0 -1 0 -2 0 -1 -3 0 1 3S 1 0 1 0 0 -1 0 1 -1 -1 -3 0 -2 -3 1 2P 1 0 -1 -1 -3 0 -1 -1 0 -2 -3 -1 -2 -5 6F -4 -4 -4 -6 -4 -5 -5 -5 -2 1 2 -5 0 9M -1 0 -2 -3 -5 -1 -2 -3 -2 2 4 0 6K -1 3 1 0 -5 1 0 -2 0 -2 -3 5L -2 -3 -3 -4 -6 -2 -3 -4 -2 2 6I -1 -2 -2 -2 -2 -2 -2 -3 -2 5H -1 2 2 1 -3 3 1 -2 6G 1 -3 0 1 -3 -1 0 5E 0 -1 1 3 -5 2 4Q 0 1 1 2 -5 4C -2 -4 -4 -512D 0 -1 2 4N 0 0 2R -2 6A 2

A R N D C Q E G H I L K M F P S T W Y V B Z X

215



• Gap penalty

Gap length is the length difference between a loop in thetemplate and its aligned (loop) region in the query sequence.

• Affine gap penalty:

egap = 10.8 + 0.6× (gap length− 1).

• Total energy:

Etotal = c1 × Esingle + c2 × Epair + c3 × Emutation + c4 × Egap

Need to learn coefficients ci: How ???

216



• The threading algorithm:

The base step — how to alignment a subsequence to a core

• Cutting a core out of the template creates a set of half-links

• Links represent the pairwise interaction

• For one half-link, need to get the information:

the amino acid that the half-link should connect to

• When computing the alignment, impose a constraint that thehalf-link and its the other part can be connected together —

— meaning that both half-links should be assigned a sameamino acid

• The reason we only need to record amino acid type (not itsresidual position) is:

Positional constraints are counted into the gap penalty ...

• Gaps are restricted to the loop area (and the tails of cores)

• Try all (in total linear number of) the possibilities and get theoptimal

f f f f f f f f f f f f f fv v v v v v v v v v v v v v v v v v v

217



• The threading algorithm:

Analysis of the base step, suppose

– core (tp, tq)

– half-link set X

– subsequence (si, sj) — j − i ≥ q − p

For every feasible assignment A for half-link set X, try all thepossible cutting point k : i ≤ k ≤ j − (q − p)

Θ((

(j − i)− (q − p))(q − p)

)• The general recurrence:

When cutting a link, one half is open

— can be assigned with any amino acid and the pairwiseinteraction energy is calculated;

The other half is closed

— again can be assigned any amino acid, but the energy is+∞ if it doesn’t agree with the residue that the link links to

Therefore, these force the link to link two residues with pair-wise energy counted exactly once.

• Complexity:

– need to examine every possible cutting point — divide

– need to create the assignments for links cut during thedivide stage

– C — maximum # links being cut over all possible posi-tions

Θ(mn20C)

218



• Overall running complexity Θ(mn20C), so far ...

• Careful dividing strategy to minimize C — preprocessing tem-plates

• Another observation for speed up:

No gaps inside the cores

• If two links linking to tp and tr within a same core, then theymay produce up to Θ((j − i)), instead of Θ(202), differentpairs of half-links.

−→ merge them into one (half-)link, if both are open (orclosed).

• Need careful dividing strategy to minimize this new parameterC for every template ...

Can be done by Dynamic Programming (exponential time inthe template length)

– preprocessing

– once get the dividing strategy, use it forever in the futurethreading

– for most of the templates, C ≤ 4

while some might be as high as 8

– C increases with distance threshold D increases

• Overall running time now: Θ(mn× n1.5C)

219



• Database construction:

– collect more than 2,000 templates from FSSP/PDB (alsoSCOP library)

– secondary structure information

– solvent accessibility information

– preprocessing to compute the optimal dividing strategyfor every template

• Accepts additional information to assist threading

– PSSM

– secondary structure preference

– solvent accessibility preference

• Execution details:

– excluding pairwise interaction to speed up first-round thread-ing

then enter the second round detail threading using pair-wise interaction

– output sequence-to-template alignment

– assess the alignment through a neural network (confi-dence)

– alignment score, fitness score for every energy term, Z-score, compact factor, structure construction using tem-plate structure(s)

220

Lecture 23: Phylogeny Reconstruction

Agenda:

• Maximum parsimony

• Perfect phylogeny

• Neighbor-joining

Reading:



221


A phylogeny (also called evolutionary tree):

Sumatran Orangutan

Orangutan

Chimpanzee

Pygmy Chimpanzee

Human

Gorilla

Gibbon

Cow

Blue Whale

Finback Whale

Horse

White Rhine

Grey Seal

Barber Seal

Cat

House Mouse

Ratrodents: a branch

ferungulates

primates

extinct species

• What data?

Mitochondrial genomes

• Which method?

222


An example - Out of Africa Theory:

223


Phylogenetic analysis from Cann et al., 1987, Nature,based on restriction mapping of mitochondrial DNAs(mtDNAs) of 147 people.

Similar results have been obtained using polymorphismson Y chromosomes.

224


Factors in phylogeny reconstruction:

• Data:

– character data — morphological features, aligned sequencecharacters, polymorphic genetic features

– molecular sequence — DNA, RNA, protein

– distance — estimated (evolutionary) distance betweenTAXA based on some model of evolution

• Methods:

– parsimony, compatibility, distance-based, maximum like-lihood, quartet, phylogenetic root

• Rooted vs unrooted phylogeny:

A

B C

D

E

@@@@

@@

@@C

A B

D E

HHHH

HH

HHHH@@

HHHH

• Weighted vs unweighted phylogeny:

Weight gives the evolutionary distance between two TAXA

225


Parsimony:

Input usually character data:

species site1 2 3 4 5 6

Aardvark C A G G T ABison C A G A C A

Chimp C G G G T ADog T G C A C T

Elephant T G C G T A

• How to evaluate a given tree? — minimum number of muta-tions

• How to search the tree space?

Small Parsimony Problem: given the phylogeny with TAXA ar-ranged at the leaves, compute the internal nodes to minimize thetotal number of mutations.

C C C T T

ZZZZZZZZZZ

@@@

@@@@@

1 mutation at site 1.

AardvarkBison Chimp

DogElephant

ZZZZZZZZZZZ

@

@@

@@@@@@

8 mutations at 6 sites.

226


Fitch’s algorithm for Small Parsimony:

1. Goal: to minimize the total number of mutations.

2. Consider sites separately.

3. For each internal node i and state s, c(i, s) is the minimumnumber of changes required for the subtree rooted at i, if i isgiven the state s;

4. Recurrence relation:

If j and k are the children of i,

c(i, s)

= minp,q∈A,C,G,T(c(j, p) + c(k, q) + s(s, p) + s(s, q))

= minp∈A,C,G,T(c(j, p) + s(s, p)) + minq∈A,C,G,T(c(j, q) + s(s, q))

5. Traverse the tree in post order or bottom up to computec(·, ·).

Time: O(|T | · |Σ|)

227


The (Large) Parsimony Problem:

• We know how to evaluate a given tree ...

• How to find a tree with the minimum cost?

For 20 leaves, there are 1021 binary trees, 1024 rooted trees...

• NP-hard

228


The (Large) Parsimony Problem (2):

• Heuristics and approximations:

1. branch-and-bound:

– consider trees of increasing size;

– abort an extension if cost already exceeds current best.

@@@

@@@

\\\

@@@

T0

T1

T2

T3T4

T5

2. local search (neighborhoods)

– nearest neighbor interchange (NNI)

HHH

B

HHH@@

A

HHH

@@

D

HHH

C

=⇒

HHH

C

HHH@@

A

HHH

@@

D

HHH

B

– subtree transfer

AAAAAAAAAAA

AAAAT

=⇒

AAAAAAAAAAA

AAT

229


The (Large) Parsimony Problem (3): A greedy method

• Species are represented by DNA sequences

• Assume a given multiple alignment −→ character data

PHYLIP – DNAPARS

1. Arrange the sequences in some order.

2. Construct a phylogeny for the first two sequences.

3. repeat until all sequences are added:

Add the sequence at the head of the order to the currentphylogeny to achieve parsimony.

vS1

vS2

AAAAAAA

vS1

vS2

AAAAAAA

AAAA v

S3

vS1

vS2

AAAAAAA

AAAAvS3

vS1

vS2

AAAAAAA

vS3

or or

· · · 230


Problem with the parsimony method: statistical incon-sistency

231


Perfect phylogeny:

• The parsimony problem taking binary state characters

• Requires at most 1 change for each character/site.

1 2 3 4 5

A 1 1 0 0 0B 0 0 1 0 0C 0 0 0 0 1D 0 1 0 0 0E 0 0 0 1 0

AAA

AAAAAA

AAAAAAAAAAA

C E B D A

1

2

345

Definition: For any character i, let Oi be the set of taxa with state1 at character i.

Theorem: A perfect phylogeny exists for the input, iff for anyi 6= j, either Oi ∩Oj = ∅, or Oi ⊆ Oj, or Oj ⊆ Oi.

Proof. “only if”: Trivial.

“if”: For any character i incurring a state change, define abipartition (A, B), and draw an edge

A B

Repeat this process for other characters. The property ofthe data guarantees that sets A and B will be subdividedby further bipartitions. (If there are any unresolved nodes,resolve them arbitrarily.)

Generalizations To more than 2 states, see [Gusfield, 1997].

232


Distance-based phylogeny reconstruction (1):

• The problem: for every pair of taxa i and j, define a distanced(i, j). Construct an edge weighted tree such that the pathjoining i and j has weight d(i, j).

• An example,

A B C D EA 0B 0.3 0C 0.7 0.8 0D 1.1 1.2 0.6 0E 0.9 1.0 0.4 0.4 0

A

B C

D

E

@@@@

@@@@

0.1

0.2

0.5 0.2

0.1

0.1

0.3

• Getting distance — Jukes/Cantor’s distance model

1. input data: ungapped alignment of DNA sequences

2. how: treat each site as an independent random variable

3. assumption: every nucleotide has equal probability of mu-tating one of the three other nucleotides

4. parameter: common point mutation rate m — expectednumber of mutations per unit time. Hence, the expectednumber of mutations within time ∆t is m∆t.

5. Jukes/Cantor’s model is Markovian:

1−∆t

1−∆t

1−∆t

1−∆t

C

A

T

G

m3∆t m

3∆t

m3∆t

m3∆t

233



• Jukes/Cantor’s distance model:

– assume m∆t ≤ 0.75, starting from state i, the probabilityof ending up in a state j 6= i is:

Pr(qj|qi, t) =1

4

(1− (1−

4m∆t

3)

t∆t

).

Letting ∆t→ 0,

Pr(qj|qi, t) =1

4

(1− e−

43mt)

.

– hence, the probability of at least 1 mutation is

Pr(mutate|t) =3

4

(1− e−

43mt)

,

since the site are independent and identical processes(iid).

– fraction of sites changed (letting f = 34

(1− e−

43mt)):

mt = −3

4log(1−

4

3f) −−− mutation distance

• Kimura 2-parameter:

Replace m with transition and transversion rates.

• HKY/Felsenstein:

Unequal probability for individual nucleotide.

234



• Distance data d(i, j) is additive if there is an edge weightedtree fully agreeing with the distances.

• (Buneman) Four Point Condition:

For any taxa A, B, C, D,

max

d(A, B) + d(C, D)d(A, C) + d(B, D)d(A, D) + d(B, C)

= median

d(A, B) + d(C, D)d(A, C) + d(B, D)d(A, D) + d(B, C)

Theorem A distance data is additive if and only if it satisfies the four

point condition.

Proof.

1. “only if”: Trivial.

2. “if”: For any taxa A, B, C, D, the condition uniquely de-termines,

B x3

@@

A x2x1

@@

Dx5

Cx4

(for example)xi may be negative!

2x1 = d(A, C) + d(B, D)− d(A, B)− d(C, D),x2 + x3 = d(A, B),x4 + x5 = d(C, D),x1 + x2 + x4 = d(A, C),x1 + x3 + x5 = d(B, D),x1 + x2 + x5 = d(A, D),x1 + x3 + x4 = d(B, C).

redundant by 4 point condition

235



• The Theorem implies an O(n4) time algorithm. (How?)

• But, there are O(n2) time ones, such as Neighbor-Joining(NJ) to be discussed.

• What if data is not additive / perfect?

– Edwards & Cavalli-Sforza try to find a weighted tree tominimize ∑

i,j

(Dij − dij)2,

Dij — observed distance,dij — implied by tree.

– Fitch & Margoliash try to minimize∑i,j

(Dij − dij)2

D2ij

.

236


Distance-based phylogeny reconstruction (4)

— Neighbor-Joining (NJ) by Saitou & Nei, 1987:Given d(i, j),1 ≤ i, j ≤ n.

1. For each taxa i, compute net divergence

ri =

n∑j=1

d(i, j).

2. Compute rate-corrected distance

Mij = d(i, j)−ri + rj

n− 2.

3. Let i and j be the taxa minimizing Mij.

Introduce a new taxa u, such that

j

ik· · ·

· · ·

u

sju

siuXXXXXX

PPPPPP

d(k, u)

siu = d(i,j)2

+ ri−rj

2n−4sju = d(i, j)− siu

d(k, u) = d(i,k)+d(j,k)−d(i,j)2

4. If more than 2 edges unweighted, go to STEP 2.Otherwise let sku = d(k, u).

Notes on NJ:

• Perfect on additive distance data.

• Very popular heuristic in practice.

• Theoretical analysis by K. Atteson.

237


Agenda:

• Maximum likelihood

• Quartet and quartet-based

Reading:

• Ref[JXZ, 2003]: pages 111 – 134

238


Maximum Likelihood Methods:

Find a tree with maximum probability of generating data (se-quences).

• Model:

A branching process describing how taxa/species diverge overevolutionary time, and a sampling model (ignored).

max L(T |S, M) = Pr(S|T, M) (Baysian)M : Jukes/Cantor, Kimura, HKY

• For example,

T =

t t ttt

AAAA

AAAAAAA

A B C

x

y

t1 t2

t3t4

– ti: mutation distance (m∆t)

– L =∑

x

∑yP(y)·P(x|y, t3)·P(A|x, t1)·P(B|x, t2)·P(C|y, t4)

– In a stochastic process, P(y) is the equilibrium probabilityof y.

– The sum can be made nested.

• Maximum Likelihood gives confidence level

... but time consuming.

239


Quartet-based phylogeny reconstruction (1)

• A quartet q is a set of 4 distinct species in a tree T ; thesubtree of T linking the species in q is the quartet topologyfor q relative to T .

@@@@

@@@@

@@@@

A

B

C D

E

F

D

CC

BB

A

@@

@@

@@

@@

@@

@@

E

AE

FE

F

Assuming bifurcate events ...

• Each tree has a unique set of associated quartet topologies(Buneman, 1971)

• Reconstruct tree by inferring topologies of all possible quar-tets and then combining these topologies.

240



Combining inferred quartet topologies is not easy in practice.

1. O(n4) quartets (huge data redundancy).

2. All existing phylogeny reconstruction methods sometimes in-fer incorrect quartet topologies (quartet errors).

3. Quartet recombination methods are usually sensitive to quar-tet errors.

4. It is NP-hard to find the largest set of compatible quartettopologies, even if the input set of quartet topologies is com-plete. [Steel, 1992; Berry et al., 1999]

5. Heuristics such as quartet puzzling [SH96] and Q∗ [BG97],etc. don’t have guaranteed performance.

An alternative approach: Look for efficient algorithms that recon-struct the tree correctly in presence of bounded number of quarteterrors in particular regions of the tree.

Such algorithms essentially “clean” errors in a given set of quartet

topologies, and are hence called quartet cleaning algorithms.

241



A motivation: Felsenstein Zone

Maximum parsimony is one of the most popular methods in phy-logeny reconstruction. But it is known to be statistically inconsis-tent.

@@

@@@@@@@

B

A C

D

@@

@@@@@@

B

A C

D

However, when the tree to be reconstructed is large, we can afford

to infer a few (strange shaped) quartet topologies incorrectly. Data

redundancy allow us to clean these errors.

242



ra

DDDDDDrb

EEEEEEErc

JJJJrd

AAAre

CCCCCrf

true phylogeny T

-evolution

abcdef

sequence data?

quartettopologyinferencemethod

a

b

c

d

r rr r@@

@@

a

c

b

e

b bb b@@

@@

a

b

c

f

r rr r@@

@@

a

c

d

e

r rr r@@

@@

a

f

b

d

b bb b@@

@@

a

b

e

f

r rr r@@

@@

a

c

d

e

r rr r@@

@@

a

c

d

f

r rr r@@

@@

a

f

c

e

b bb b@@

@@

a

d

e

f

r rr r@@

@@

b

e

c

d

b bb b@@

@@

b

c

d

f

r rr r@@

@@

b

c

e

f

r rr r@@

@@

b

d

e

f

r rr r@@

@@

c

d

e

f

r rr r@@

@@

set ofinferredquartet

topologies

cleaning

?

quartetrecombinationmethod

a

b

e

fc d

@@@

@@@

inferred phylogeny topology

determine rootassign edge weights

ra

DDDDDDrb

EEEEEEErc

JJJJrd

AAAre

CCCCCrf

inferred phylogeny

243


Quartet cleaning (1)

• Quartet error bounds may be stated relative to quartet errorsacross either vertices or edges:

@@@@

@@@@

@@@@

A

B

C D

E

F

a b1 2

3

c

d

D

CC

BB

A

@@

@@

@@

3

2

1,2

b, d

b, c

a, b, c

@@

@@

@@

E

AE

FE

F

Definition: A quartet error q is across a vertex (an edge) in a tree T ifthe joining path of the quartet topology associated with q inT includes that vertex (edge).

• Quartet error bounds may either be local (hold in individualparts of the tree) or global (hold in all parts of the tree).

• 4 types of cleaning algorithms:

global, local + edge, vertex cleaning

• Local quartet cleaning preferable —

– all parts of tree may not satisfy a quartet error bound

– can still reconstruct part of the tree

244


Quartet cleaning (2)

1. Global Edge Cleaning

• Every edge (A, B) has at most (|A|−1)(|B|−1)2

quartet errors.

• O(n4) time.

• Error bound is optimal.

2. Local Edge Cleaning

• Recovers an edge (A, B) if the number of quartet errors

across (A, B) is at most (|A|−1)(|B|−1)2

.

• O(n5) time.

• Unresolved tree.

3. Hyper-Cleaning(m)

• Returns all bipartition (A, B) with

|Q(A,B) −Q| ≤ m(|A| − 1)(|B| − 1)

2.

• The bipartitions may not be all compatible.

• O(n7 · f(m)) time.

245


Quartet cleaning (3) — the limit

• Consider the following two trees T and T ′.

AAAA

@@@

r

r

AAAA

X Y

e

a b

T

AAAA

@@@

r

r

AAAA

X Y

b a

T ′

• Observe that QT and QT ′ differ by quartet topologies of theform ax|by, where x ∈ X and y ∈ Y . Hence,

|QT −QT ′| = (|X| − 1)(|Y | − 1).

• Therefore, no algorithm can detect more than (|X|−1)(|Y |−1)2

errors across edge e.

246


Quartet cleaning (4) — local edge cleaning

Theorem There exists an O(n5) time algorithm that correctly recoversall edges e of the tree satisfying

|Q(Ae,Be) −Q| ≤(|Ae| − 1)(|Be| − 1)

2.

• Proof.

Let S = s1, s2, . . . , sn, Sk = s1, s2, . . . , sk, and Qk the subsetof Q induced by Sk. The idea is to recursively compute sets:

BPxy(Qk) = (X, Y )|x ∈ X, y ∈ Y , and there is at most onequartet error across (X, Y ) involving both x and y.

BPx(Qk) = (X, Y )|x ∈ X and there are at most (|Y |−1)2

quarteterrors across (X, Y ) involving x.

BP (Qk) = (X, Y )| there are at most (|X|−1)(|Y |−1)2

quartet er-rors across (X, Y ).

X Yx y"!# "!# r r rr t r r r r rr r r rr t r r

Observe that the algorithm may leave some of edges of thetree unresolved.

247


Quartet cleaning (5) — hypercleaning

• The local edge cleaning algorithm returns the set of biparti-

tions (A, B) incurring at most (|A|−1)(|B|−1)2

quartet errors in Q.These bipartitions are compatible with each other.

• What about bipartitions incurring more than (|A|−1)(|B|−1)2

quar-tet errors?

Theorem There exists an O(n7f(m)) time algorithm that computes allbipartitions (A, B) satisfying:

|Q(A,B) −Q| ≤ m(|A| − 1)(|B| − 1)

2.

– These bipartitions may not be compatible with each other!

– The maximum subset of compatible quartets (hard).

– Greedy searching.

• f(m) = 4m2(1 + 2m)4m (in practice, m ≤ 5).

248


Quartet cleaning (6) — global edge cleaning

1. Quartet error bound must hold for every edge in the tree.

2. For each edge inducing a bipartition (A, B) of S, we state thebound relative to (A, B).

Theorem There exists a polynomial time algorithm with bound α·|A|·|B|(α ≤ 1

2).

Theorem The following algorithm computes the correct tree if every

edge of the tree has at most (|A|−1)(|B|−1)2

quartet errors in Q.

The algorithm can be implemented to run in O(n4) time, i.e.,linear time.

The error bound is optimal.

• Algorithm Global-Clean(S, Q):

1. Let R := S.

2. For every pair of rooted subtrees T1 and T2 in R

(a) Let A denote the leaf labels of T1 and T2.

(b) If |Q(A, S −A)−Q| < (|A|−1)(|S−A|−1)2

then

i. Create a new tree T ′ with T1 and T2 as its subtrees.

ii. Let R := R− T1, T2 ∪ T ′.

3. Repeat step 2 until |R| = 3.

4. Connect the three subtrees in R and output the resulting(unrooted) tree.

249


Quartet cleaning (7) — global edge cleaning (cont’d)

AAAA

@@@

AAAA

A B

e

A′ B′

T

AAAA

AAAA

@@@

AAAA

```

AAAA

A B

e′

A′ B′

T ′

AAAA

AAAA

````

• Suppose the algorithm incorrectly joins subtrees T1 and T2

with label sets A and B. This implies that in the true tree T ,there is an edge e with bipartition (A ∪ A′, B ∪ B′), |A′| ≥ 1,|B′| ≥ 1, which separates T1 and T2 in T . Similarly, there is anedge e′ in the reconstructed tree with bipartition (A∪B, A′∪B′).

Let Q1 = Q(A ∪A′, B ∪B′) and Q2 = Q(A ∪B, A′ ∪B′) denotethe sets of quartets across e and e′, respectively.

• Consider the symmetric difference of Q1 and Q2 — |Q1⊕Q2|.Observe the following:

– |Q1 ⊕Q2| = |A||B||A′||B′|.

– |Q1 ⊕Q2| ≤ |Q1 −Q|+ |Q2 −Q|.

• As each edge in T satisfies the quartet-error bound,

|Q1 −Q| <((|A|+ |A′|)− 1)((|B|+ |B′|)− 1)

2.

However, as T1 and T2 were joined by the algorithm,

|Q2 −Q| <((|A|+ |B|)− 1)((|A′|+ |B′|)− 1)

2.

250


Quartet cleaning (8) — global edge cleaning (cont’d)

• The above two items imply

|A||B||A′||B′| <((|A|+|A′|)−1)((|B|+|B′|)−1)

2

+ ((|A|+|B|)−1)((|A′|+|B′|)−1)2

,

which is false for any |A|, |B|, |A′|, |B′| ≥ 1.

• Error bound (|A|−1)(|B|−1)2

is optimal.

• Intuitions behind the Proof:

– Trees “separated” by minimum number of differences inassociated sets of quartet topologies.

– If quartet error bounds hold relative to true tree, can-not accumulate enough quartet topologies to reconstructdifferent tree under the cleaning algorithm.

251


Quartet cleaning (9) — simulation study

• Quartet cleaning is computationally expensive. Is it reallyworth? If so, under what assumption?

• Two questions:

1. How often do quartet errors occur?

2. How often can quartet errors be cleaned, i.e., how often isthe number of quartet errors less than the bounds requiredby known algorithms?

• Examine quartet errors in simulated datasets:

– Approach: Generate phylogeny; generate DNA sequencedataset for phylogeny; infer quartet topologies relativeto sequence dataset; determine quartet errors relative tophylogeny.

– Vary evolutionary rates on edges of phylogenies and lengthsof sequences in dataset.

– Quartets inferred using 3 popular methods: MaximumParsimony, Neighbor Joining, and Ordinal Quartet [Kear-ney, 1997].

252


Quartet cleaning (10) — open problems

1. Improve the time efficiency of local edge cleaning and hyper-cleaning algorithms.

2. Combine cleaning with maximum likelihood reconstruction ofquartet topologies.

3. Test the algorithms on real datasets.

4. Various ways of extracting a tree from the bipartitions foundby hypercleaning.

5. Extend the work to incomplete set of quartet topologies.

253

lecture 1: introduction - university of california, riversidejiang/cs215/compbiol-dragonstar.pdf ·...

Documents