composition alignment gary benson departments of computer science and biology boston university

33
Composition Alignment Composition Alignment Gary Benson Gary Benson Departments of Computer Science and Biology Departments of Computer Science and Biology Boston University Boston University

Post on 21-Dec-2015

216 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Composition Alignment Gary Benson Departments of Computer Science and Biology Boston University

Composition AlignmentComposition Alignment

Gary BensonGary BensonDepartments of Computer Science and BiologyDepartments of Computer Science and Biology

Boston UniversityBoston University

Page 2: Composition Alignment Gary Benson Departments of Computer Science and Biology Boston University

Composition AlignmentComposition Alignment

Gary BensGary BenszzononDepartments of Computer Science and BiologyDepartments of Computer Science and Biology

Boston UniversityBoston University

Page 3: Composition Alignment Gary Benson Departments of Computer Science and Biology Boston University

Outline of TalkOutline of Talk

1.1. Sequence composition and composition matchSequence composition and composition match

2.2. Composition alignment algorithmComposition alignment algorithm

3.3. Composition match scoring functionsComposition match scoring functions

4.4. Growth of local composition alignment scores Growth of local composition alignment scores

5.5. Limiting the length of a composition matchLimiting the length of a composition match

6.6. Biological examplesBiological examples

Page 4: Composition Alignment Gary Benson Departments of Computer Science and Biology Boston University

GoalGoal

Identify features in DNA sequences that are Identify features in DNA sequences that are notnot accurately accurately

described by described by position specific patterns.position specific patterns.

A position specific pattern, P, has the form:A position specific pattern, P, has the form:

P = pP = p1 1 pp2 2 pp3 3 ...... ppkk

where pwhere pii is either a single specific character or a choice (weighted is either a single specific character or a choice (weighted

or unweighted) of characters. or unweighted) of characters.

In DNA there are features that are characterized by In DNA there are features that are characterized by compositioncomposition rather than by position specific patterns.rather than by position specific patterns.

Page 5: Composition Alignment Gary Benson Departments of Computer Science and Biology Boston University

Sequence CompositionSequence Composition

CompositionComposition is a vector quantity describing the frequency is a vector quantity describing the frequency of occurrence of each alphabet letter in a particular string. of occurrence of each alphabet letter in a particular string.

Let Let SS be a string over be a string over ΣΣ. Then, . Then,

C(S)=(fC(S)=(fσσ1 1 , f, fσσ2 2

, , ffσσ3 3 , … , , … , ffσσ||ΣΣ||

))

is the composition of is the composition of SS, where , where ffσσii is the fraction of the is the fraction of the

characters in characters in SS that are that are σσii. .

Page 6: Composition Alignment Gary Benson Departments of Computer Science and Biology Boston University

Composition ExampleComposition Example

S = ACTGTACCTGGCGCTATTS = ACTGTACCTGGCGCTATT

C(S) = ( 0.17, 0.28, 0.22, 0.33 )C(S) = ( 0.17, 0.28, 0.22, 0.33 )

A C G TA C G T

Note that the Note that the orderorder of letters of letters is irrelevantis irrelevant as it has no effect on as it has no effect on the composition. the composition.

Page 7: Composition Alignment Gary Benson Departments of Computer Science and Biology Boston University

Composition and Sequence FeaturesComposition and Sequence Features

• Isochores Isochores – Multi-megabase, specifically GC-rich or GC-– Multi-megabase, specifically GC-rich or GC-poor. GC-rich isochores have greater gene density. poor. GC-rich isochores have greater gene density.

• CpG Islands CpG Islands – Several hundred nucleotides, rich in the – Several hundred nucleotides, rich in the dinucleotide CG which is underrepresented in eukaryotic dinucleotide CG which is underrepresented in eukaryotic genomes. Methylation of the cystine (C) in these genomes. Methylation of the cystine (C) in these dinucleotides affects gene expression.dinucleotides affects gene expression.

• Protein binding regionsProtein binding regions – Tens of nucleotides, dinucleotide – Tens of nucleotides, dinucleotide composition contributes to DNA flexibility, allowing the composition contributes to DNA flexibility, allowing the helix to change shape during protein binding.helix to change shape during protein binding.

Page 8: Composition Alignment Gary Benson Departments of Computer Science and Biology Boston University

Composition MatchComposition Match

We hope to identify common features in sequences using a We hope to identify common features in sequences using a new new alignment algorithmalignment algorithm. The main new idea is the use of . The main new idea is the use of composition matching. composition matching.

Two strings, Two strings, SS and and TT, have a , have a composition matchcomposition match if their lengths if their lengths

are equal and are equal and C(S) = C(T)C(S) = C(T). .

For example, For example, SS and and TT below have a composition match: below have a composition match:

S = ACTGTACCTGGCGCTATTS = ACTGTACCTGGCGCTATT

T = AAACCCCCGGGGTTTTTTT = AAACCCCCGGGGTTTTTT

Page 9: Composition Alignment Gary Benson Departments of Computer Science and Biology Boston University

Composition Alignment ProblemComposition Alignment Problem

GivenGiven:: Two sequences, Two sequences, SS and and TT of lengths of lengths mm and and nn, over an , over an

alphabet alphabet ΣΣ, and a scoring function , and a scoring function cm(s, t)cm(s, t) for the score for the score

of a of a composition matchcomposition match between substrings between substrings ss and and tt. .

Find:Find: The best scoring alignment (global or local) of The best scoring alignment (global or local) of SS with with

TT such that the allowed such that the allowed scoring options include scoring options include

composition matchcomposition match between substrings of between substrings of SS and and TT as well as well as the standard options of 1) single character match, 2) as the standard options of 1) single character match, 2) single character mismatch, 3) insertion and deletion.single character mismatch, 3) insertion and deletion.

Page 10: Composition Alignment Gary Benson Departments of Computer Science and Biology Boston University

Example of composition alignmentExample of composition alignment

S = AACGTCTTTGAGCTCS = AACGTCTTTGAGCTC

T = AGCCTGACTGCCTAT = AGCCTGACTGCCTA

AlignmentAlignment

AAAACGTCCGTCTTTTTTGGAGCTCAGCTC

| |<-> | <--->| |<-> | <--->

AAGGCCTGCCTGACACTT--GCCTAGCCTA

Page 11: Composition Alignment Gary Benson Departments of Computer Science and Biology Boston University

Related WorkRelated Work

• Alignment allowing adjacent letter swap. Alignment allowing adjacent letter swap.

O(nm), Lowrance and Wagner (1975)O(nm), Lowrance and Wagner (1975)

• All swapped matchings of a pattern in a text. All swapped matchings of a pattern in a text.

O(nmO(nm1/31/3 log m log|log m log|ΣΣ|), Amir, Aumann, Landau, Lewenstein, |), Amir, Aumann, Landau, Lewenstein, Lewenstein (2000)Lewenstein (2000)

O(n log m log O(n log m log ||ΣΣ|), Amir, Cole, Hariharan, Lewenstein, Porat |), Amir, Cole, Hariharan, Lewenstein, Porat (2001)(2001)

• Composition namingComposition naming

O(n log m log O(n log m log ||ΣΣ|), Amir, Apostolico, Landau, Satta (2003)|), Amir, Apostolico, Landau, Satta (2003)

Page 12: Composition Alignment Gary Benson Departments of Computer Science and Biology Boston University

Composition Alignment using Composition Alignment using Dynamic ProgrammingDynamic Programming

Given two sequences, Given two sequences, SS and and TT, the best alignment of the , the best alignment of the prefix stringsprefix strings

S[1, i] = sS[1, i] = s1 1 …… ssii

T[1, j] = tT[1, j] = t1 1 …… ttjj

ends in one of four ways: ends in one of four ways:

1.1. mismatch, mismatch,

2.2. insertion, insertion,

3.3. deletion, or deletion, or

4.4. composition matchcomposition match

Page 13: Composition Alignment Gary Benson Departments of Computer Science and Biology Boston University

Ways an Alignment Can EndWays an Alignment Can End

S: C G TS: C G T

T: C G AT: C G A

S: C A TS: C A T

T: C A -T: C A -

S: C A –S: C A –

T: C A AT: C A A

X: C G T A C X: C G T A C

Y: C G C T AY: C G C T A

mismatchmismatch

insertion or deletioninsertion or deletion

composition matchcomposition match

Page 14: Composition Alignment Gary Benson Departments of Computer Science and Biology Boston University

Ways an Alignment Can EndWays an Alignment Can End

S: C G TS: C G T

T: C G AT: C G A

S: C A TS: C A T

T: C A -T: C A -

S: C A –S: C A –

T: C A AT: C A A

X: C G T A C X: C G T A C

Y: C G C T AY: C G C T A

mismatchmismatch

insertion or deletioninsertion or deletion

composition matchcomposition match

Note that the suffixes will have Note that the suffixes will have

a length a length l l wherewhere

1 ≤ 1 ≤ ll ≤ min(i, j, limit) ≤ min(i, j, limit)

Page 15: Composition Alignment Gary Benson Departments of Computer Science and Biology Boston University

Time ComplexityTime Complexity

Computing the Computing the optimal composition alignmentoptimal composition alignment with dynamic with dynamic programming is similar to standard alignment, except for programming is similar to standard alignment, except for the composition match scoring option. The overall time the composition match scoring option. The overall time complexity is complexity is

O(nmZ)O(nmZ)

where where ZZ is the time required per is the time required per (i, j)(i, j) pair to find the best pair to find the best

length length ll for the composition match. for the composition match.

Page 16: Composition Alignment Gary Benson Departments of Computer Science and Biology Boston University

Computing length of the shortest Computing length of the shortest composition matchcomposition match

Our goal here is to start with two strings, Our goal here is to start with two strings, SS and and TT, of equal , of equal

length, and for each prefix pair length, and for each prefix pair S[1, k], T[1, k]S[1, k], T[1, k], find the , find the length of the length of the shortestshortest suffixes that have a composition suffixes that have a composition match. match.

Page 17: Composition Alignment Gary Benson Departments of Computer Science and Biology Boston University

kk 00 11 22 33 44 55 66

Shortest suffix Shortest suffix match lengthmatch length

00 11 00 11 00 11 33

For example, letFor example, let

S = AACGTCTTTGAGCTS = AACGTCTTTGAGCT

T = AGCCTGACTGCCTAT = AGCCTGACTGCCTA

the table states that the table states that for k = 6for k = 6, the shortest suffixes which , the shortest suffixes which have a composition match have have a composition match have length = 3length = 3::

S = AACS = AACGTCGTC......

T = AGCT = AGCCTGCTG......

Page 18: Composition Alignment Gary Benson Departments of Computer Science and Biology Boston University

Composition differenceComposition difference

We find the matching suffix lengths using We find the matching suffix lengths using composition composition

differencedifference, a vector quantity for two strings , a vector quantity for two strings xx and and yy: :

CD(x, y) = (cCD(x, y) = (cσσ11 , … , , … , ccσσ||ΣΣ||

))

where where ccσσii is the difference between the number of times is the difference between the number of times σσii

occurs in occurs in xx and in and in yy. .

Page 19: Composition Alignment Gary Benson Departments of Computer Science and Biology Boston University

Using composition differenceUsing composition difference

Key observation:Key observation: two identical composition differences at two identical composition differences at prefix lengths k and g indicate a composition match of prefix lengths k and g indicate a composition match of length k – g.length k – g.

Page 20: Composition Alignment Gary Benson Departments of Computer Science and Biology Boston University

Sorting to find shortest Sorting to find shortest composition matchescomposition matches

Sort on composition Sort on composition difference using difference using stable sort. Adjacent stable sort. Adjacent tuples with the same tuples with the same composition composition difference identify difference identify shortestshortest composition composition matches.matches.

Page 21: Composition Alignment Gary Benson Departments of Computer Science and Biology Boston University

Time complexity for composition matchesTime complexity for composition matches

O(nmO(nmΣΣ)) to find to find all index pairsall index pairs shortest composition match shortest composition match

lengths for two strings of length lengths for two strings of length nn and and mm..

In our work, In our work, ΣΣ, is a small constant, is a small constant (4 for DNA, 16 for (4 for DNA, 16 for dinucleotides). For larger alphabets, the method of Amir, dinucleotides). For larger alphabets, the method of Amir, Apostolico, Landau and Satta (2003) can be used.Apostolico, Landau and Satta (2003) can be used.

Page 22: Composition Alignment Gary Benson Departments of Computer Science and Biology Boston University

Composition match scoring functionsComposition match scoring functions

We have explored:We have explored:

Functions based on match length, Functions based on match length, kk::

• Function 1: Function 1: cm(k) = ckcm(k) = ck• Function 2: Function 2: cm(k) = ccm(k) = c√ k√ k

where where cc is a constant. is a constant.

Functions based on substring composition:Functions based on substring composition:

• Function 4: Function 4: cm(C, B, k) = ck cm(C, B, k) = ck · H(C,B)· H(C,B)

where where HH is the is the relative entropyrelative entropy function, function, CC is the is the

composition of the matching substrings and composition of the matching substrings and BB is a is a backgroundbackground composition.composition.

Page 23: Composition Alignment Gary Benson Departments of Computer Science and Biology Boston University

Additive and subadditive scoring functionsAdditive and subadditive scoring functions

The functions based on length are additive or subadditive:The functions based on length are additive or subadditive:

cm(i + j) cm(i + j) ≤ cm(i) + cm(j)≤ cm(i) + cm(j)

Lemma:Lemma: For additive or subadditive composition match For additive or subadditive composition match scoring functions, scoring functions, any best scoring alignmentany best scoring alignment is equivalent is equivalent in score to an alignment which contains in score to an alignment which contains only shortest only shortest

composition matches.composition matches.

Theorem: Theorem: Composition alignment with additive or Composition alignment with additive or subadditive match scoring functions and finite alphabet subadditive match scoring functions and finite alphabet has time complexity has time complexity O(nm)O(nm)..

Page 24: Composition Alignment Gary Benson Departments of Computer Science and Biology Boston University

The limit parameterThe limit parameter

Intuitively, Intuitively, allowing scrambled letters to matchallowing scrambled letters to match should should increase increase the the amount of matching between sequences. If amount of matching between sequences. If too much matchingtoo much matching occurs, occurs, alignments will not be meaningful.alignments will not be meaningful.

The The limitlimit parameter is an upper bound on the length parameter is an upper bound on the length ll of the of the longest single composition match, used to prevent excessive longest single composition match, used to prevent excessive matching. matching.

Sequence length = 100, randomly generated Sequence length = 100, randomly generated

limitlimit 11 22 55 1010

DNA (DNA (all letters p = 0.25)all letters p = 0.25) 2525 33.733.7 44.444.4 5151

Page 25: Composition Alignment Gary Benson Departments of Computer Science and Biology Boston University

Growth of local alignment scoreGrowth of local alignment scoreFunction 1Function 1

Average Local Composition Alignment Scores: DNA SequencesFunction 1

0

20

40

60

80

100

120

100 1000

Sequence Length

Sc

ore

Limit = 2

Limit = 3

Limit = 4

200 400 800

Page 26: Composition Alignment Gary Benson Departments of Computer Science and Biology Boston University

Global score as a predictor of Global score as a predictor of local parameter suitability: Function 1local parameter suitability: Function 1

Average Global Composition Alignment Scores: DNA SequencesFunction 1

-400

-350

-300

-250

-200

-150

-100

-50

0

50

100

100 200 300 400 500 600 700 800 900

Sequence Length

Sc

ore

Limit = 2

Limit = 3

Limit = 4

Limit = 5

Page 27: Composition Alignment Gary Benson Departments of Computer Science and Biology Boston University

Growth of local alignment score Growth of local alignment score Function 2Function 2

Average Local Composition Alignment Scores: DNA SequencesFunction 2

0

10

20

30

40

50

60

70

80

90

100

100 1000

Sequence Length

Sc

ore

50

30

20

10

6

200 400 800

Page 28: Composition Alignment Gary Benson Departments of Computer Science and Biology Boston University

Global score as a predictor of Global score as a predictor of local parameter suitability: Function 2local parameter suitability: Function 2

Global Composition Alignment Scores: DNA SequencesFunction 2

-200

-180

-160

-140

-120

-100

-80

-60

-40

-20

0

0 100 200 300 400 500 600 700 800 900

Sequence Length

Sc

ore

10

20

30

50

Page 29: Composition Alignment Gary Benson Departments of Computer Science and Biology Boston University

Limit values for DNA Limit values for DNA

• Function 1: cm(k) = ck: Function 1: cm(k) = ck: Limit Limit ≤ 3≤ 3..

• Function 2: cm(k) = c√k: Function 2: cm(k) = c√k: Limit ≤ 10Limit ≤ 10..

• Function 4: cm(C, B, k) = ck ·H(C, B): Function 4: cm(C, B, k) = ck ·H(C, B):

Limit ≤ 50Limit ≤ 50..

Page 30: Composition Alignment Gary Benson Departments of Computer Science and Biology Boston University

Biological examplesBiological examples

Composition alignment was tested on a set of 1796 Composition alignment was tested on a set of 1796 promoter promoter sequencessequences from the Eukaryotic Promoter Database. Each from the Eukaryotic Promoter Database. Each sequence is sequence is 600 nucleotides long600 nucleotides long, 500 bases upstream and , 500 bases upstream and 100 downstream of the transcription initiation site.100 downstream of the transcription initiation site.

Two local alignment scores were produced using function 1, Two local alignment scores were produced using function 1, WW using composition alignment and using composition alignment and SS using standard using standard alignment. The examples shown have alignment. The examples shown have statistically statistically significant Wsignificant W with with W W ≥ 3 · S≥ 3 · S to exclude good standard to exclude good standard alignments.alignments.

Page 31: Composition Alignment Gary Benson Departments of Computer Science and Biology Boston University

Example 1Example 1

Composition alignment and standard alignment of the same Composition alignment and standard alignment of the same two promoters. Standard alignment is not statistically two promoters. Standard alignment is not statistically significant. Sequences are characteristic of significant. Sequences are characteristic of CpG islandsCpG islands..

Composition Alignment:Composition Alignment:

GCCCGCCCGCCGCGCTCCCGCCCGCCGCTCTCCGTGGCCC-CGCCG-CGCTGCCGCCGCCGCCGCTGCGCCCGCCCGCCGCGCTCCCGCCCGCCGCTCTCCGTGGCCC-CGCCG-CGCTGCCGCCGCCGCCGCTGC<->||||<>|<>||<>| ||||<>||<> |<-> |||||| <>|<> ||||<><> |<>| ||<->||<->||||<>|<>||<>| ||||<>||<> |<-> |||||| <>|<> ||||<><> |<>| ||<->||CCGCGCCGCCGCCGTCCGCGCCGCCCCG-CCCT-TGGCCCAGCCGCTCGCTCGGCTCCGCTCCCTGGCCCGCGCCGCCGCCGTCCGCGCCGCCCCG-CCCT-TGGCCCAGCCGCTCGCTCGGCTCCGCTCCCTGGC

Standard Alignment:Standard Alignment:

CGCCGCCGCCGCGCCGCCGCCGCGCCGCCGCCGCGCCGCCGCCG

Page 32: Composition Alignment Gary Benson Departments of Computer Science and Biology Boston University

Example 2Example 2

Composition alignment of two promoter sequences. Composition alignment of two promoter sequences. Composition changes at vertical line.Composition changes at vertical line. A C G TA C G T

Left: Left: (0.01, 0.61, 0.30, 0.08) (0.01, 0.61, 0.30, 0.08) Right: Right: (0.19, 0.16, 0.56, 0.09)(0.19, 0.16, 0.56, 0.09)

GCCCCGCGCCCCGCGCCCCGCGCCCCGCGCGCCTC-CGCCCGCCCCT-GCTCCGGC---C-TTGCGCCTGC-GCACAGTGGGATGCGCGGGGAGGCCCCGCGCCCCGCGCCCCGCGCCCCGCGCGCCTC-CGCCCGCCCCT-GCTCCGGC---C-TTGCGCCTGC-GCACAGTGGGATGCGCGGGGAG<->|<><>|||| <>|||||| ||<->|<>||||| <>|||| |||| || ||<-> | |<><>|<-> | |<>|<>|<>||||<-><->|<->|<><>|||| <>|||||| ||<->|<>||||| <>|||| |||| || ||<-> | |<><>|<-> | |<>|<>|<>||||<-><->|CCGCGCGCCCCC-GCCCCCGCCCCGCCCCGGCCTCGGCCCCGGCCCTGGC-CCCGGGGGCAGTCGCGCCTGTG-AACGGTGAGTGCGGGCAGGGCCGCGCGCCCCC-GCCCCCGCCCCGCCCCGGCCTCGGCCCCGGCCCTGGC-CCCGGGGGCAGTCGCGCCTGTG-AACGGTGAGTGCGGGCAGGG

Page 33: Composition Alignment Gary Benson Departments of Computer Science and Biology Boston University

ConclusionConclusion

We We

• define a new alignment problem based on composition define a new alignment problem based on composition matching and test several scoring functions matching and test several scoring functions

• show how to find all-pairs shortest composition match show how to find all-pairs shortest composition match lengths in linear time per pair for a fixed alphabetlengths in linear time per pair for a fixed alphabet

• show that alignment using scoring functions based on show that alignment using scoring functions based on sequence length only require finding shortest composition sequence length only require finding shortest composition matchesmatches

• give biological examples where composition alignment finds give biological examples where composition alignment finds statistically (and functionally) significant sequence similarity statistically (and functionally) significant sequence similarity in the absence of significant standard alignmentsin the absence of significant standard alignments