composition alignment gary benson departments of computer science and biology boston university

Composition AlignmentComposition Alignment

Gary BensonGary BensonDepartments of Computer Science and BiologyDepartments of Computer Science and Biology

Boston UniversityBoston University

Composition AlignmentComposition Alignment

Gary BensGary BenszzononDepartments of Computer Science and BiologyDepartments of Computer Science and Biology

Boston UniversityBoston University

Outline of TalkOutline of Talk

1.1. Sequence composition and composition matchSequence composition and composition match

2.2. Composition alignment algorithmComposition alignment algorithm

3.3. Composition match scoring functionsComposition match scoring functions

4.4. Growth of local composition alignment scores Growth of local composition alignment scores

5.5. Limiting the length of a composition matchLimiting the length of a composition match

6.6. Biological examplesBiological examples

GoalGoal

Identify features in DNA sequences that are Identify features in DNA sequences that are notnot accurately accurately

described by described by position specific patterns.position specific patterns.

A position specific pattern, P, has the form:A position specific pattern, P, has the form:

P = pP = p1 1 pp2 2 pp3 3 ...... ppkk

where pwhere pii is either a single specific character or a choice (weighted is either a single specific character or a choice (weighted

or unweighted) of characters. or unweighted) of characters.

In DNA there are features that are characterized by In DNA there are features that are characterized by compositioncomposition rather than by position specific patterns.rather than by position specific patterns.

Sequence CompositionSequence Composition

CompositionComposition is a vector quantity describing the frequency is a vector quantity describing the frequency of occurrence of each alphabet letter in a particular string. of occurrence of each alphabet letter in a particular string.

Let Let SS be a string over be a string over ΣΣ. Then, . Then,

C(S)=(fC(S)=(fσσ1 1 , f, fσσ2 2

, , ffσσ3 3 , … , , … , ffσσ||ΣΣ||

is the composition of is the composition of SS, where , where ffσσii is the fraction of the is the fraction of the

characters in characters in SS that are that are σσii. .

Composition ExampleComposition Example

S = ACTGTACCTGGCGCTATTS = ACTGTACCTGGCGCTATT

C(S) = ( 0.17, 0.28, 0.22, 0.33 )C(S) = ( 0.17, 0.28, 0.22, 0.33 )

A C G TA C G T

Note that the Note that the orderorder of letters of letters is irrelevantis irrelevant as it has no effect on as it has no effect on the composition. the composition.

Composition and Sequence FeaturesComposition and Sequence Features

• Isochores Isochores – Multi-megabase, specifically GC-rich or GC-– Multi-megabase, specifically GC-rich or GC-poor. GC-rich isochores have greater gene density. poor. GC-rich isochores have greater gene density.

• CpG Islands CpG Islands – Several hundred nucleotides, rich in the – Several hundred nucleotides, rich in the dinucleotide CG which is underrepresented in eukaryotic dinucleotide CG which is underrepresented in eukaryotic genomes. Methylation of the cystine (C) in these genomes. Methylation of the cystine (C) in these dinucleotides affects gene expression.dinucleotides affects gene expression.

• Protein binding regionsProtein binding regions – Tens of nucleotides, dinucleotide – Tens of nucleotides, dinucleotide composition contributes to DNA flexibility, allowing the composition contributes to DNA flexibility, allowing the helix to change shape during protein binding.helix to change shape during protein binding.

Composition MatchComposition Match

We hope to identify common features in sequences using a We hope to identify common features in sequences using a new new alignment algorithmalignment algorithm. The main new idea is the use of . The main new idea is the use of composition matching. composition matching.

Two strings, Two strings, SS and and TT, have a , have a composition matchcomposition match if their lengths if their lengths

are equal and are equal and C(S) = C(T)C(S) = C(T). .

For example, For example, SS and and TT below have a composition match: below have a composition match:

S = ACTGTACCTGGCGCTATTS = ACTGTACCTGGCGCTATT

T = AAACCCCCGGGGTTTTTTT = AAACCCCCGGGGTTTTTT

Composition Alignment ProblemComposition Alignment Problem

GivenGiven:: Two sequences, Two sequences, SS and and TT of lengths of lengths mm and and nn, over an , over an

alphabet alphabet ΣΣ, and a scoring function , and a scoring function cm(s, t)cm(s, t) for the score for the score

of a of a composition matchcomposition match between substrings between substrings ss and and tt. .

Find:Find: The best scoring alignment (global or local) of The best scoring alignment (global or local) of SS with with

TT such that the allowed such that the allowed scoring options include scoring options include

composition matchcomposition match between substrings of between substrings of SS and and TT as well as well as the standard options of 1) single character match, 2) as the standard options of 1) single character match, 2) single character mismatch, 3) insertion and deletion.single character mismatch, 3) insertion and deletion.

Example of composition alignmentExample of composition alignment

S = AACGTCTTTGAGCTCS = AACGTCTTTGAGCTC

T = AGCCTGACTGCCTAT = AGCCTGACTGCCTA

AlignmentAlignment

AAAACGTCCGTCTTTTTTGGAGCTCAGCTC

| |<-> | <--->| |<-> | <--->

AAGGCCTGCCTGACACTT--GCCTAGCCTA

Related WorkRelated Work

• Alignment allowing adjacent letter swap. Alignment allowing adjacent letter swap.

O(nm), Lowrance and Wagner (1975)O(nm), Lowrance and Wagner (1975)

• All swapped matchings of a pattern in a text. All swapped matchings of a pattern in a text.

O(nmO(nm1/31/3 log m log|log m log|ΣΣ|), Amir, Aumann, Landau, Lewenstein, |), Amir, Aumann, Landau, Lewenstein, Lewenstein (2000)Lewenstein (2000)

O(n log m log O(n log m log ||ΣΣ|), Amir, Cole, Hariharan, Lewenstein, Porat |), Amir, Cole, Hariharan, Lewenstein, Porat (2001)(2001)

• Composition namingComposition naming

O(n log m log O(n log m log ||ΣΣ|), Amir, Apostolico, Landau, Satta (2003)|), Amir, Apostolico, Landau, Satta (2003)

Composition Alignment using Composition Alignment using Dynamic ProgrammingDynamic Programming

Given two sequences, Given two sequences, SS and and TT, the best alignment of the , the best alignment of the prefix stringsprefix strings

S[1, i] = sS[1, i] = s1 1 …… ssii

T[1, j] = tT[1, j] = t1 1 …… ttjj

ends in one of four ways: ends in one of four ways:

1.1. mismatch, mismatch,

2.2. insertion, insertion,

3.3. deletion, or deletion, or

4.4. composition matchcomposition match

Ways an Alignment Can EndWays an Alignment Can End

S: C G TS: C G T

T: C G AT: C G A

S: C A TS: C A T

T: C A -T: C A -

S: C A –S: C A –

T: C A AT: C A A

X: C G T A C X: C G T A C

Y: C G C T AY: C G C T A

mismatchmismatch

insertion or deletioninsertion or deletion

composition matchcomposition match

Ways an Alignment Can EndWays an Alignment Can End

S: C G TS: C G T

T: C G AT: C G A

S: C A TS: C A T

T: C A -T: C A -

S: C A –S: C A –

T: C A AT: C A A

X: C G T A C X: C G T A C

Y: C G C T AY: C G C T A

mismatchmismatch

insertion or deletioninsertion or deletion

composition matchcomposition match

Note that the suffixes will have Note that the suffixes will have

a length a length l l wherewhere

1 ≤ 1 ≤ ll ≤ min(i, j, limit) ≤ min(i, j, limit)

Time ComplexityTime Complexity

Computing the Computing the optimal composition alignmentoptimal composition alignment with dynamic with dynamic programming is similar to standard alignment, except for programming is similar to standard alignment, except for the composition match scoring option. The overall time the composition match scoring option. The overall time complexity is complexity is

O(nmZ)O(nmZ)

where where ZZ is the time required per is the time required per (i, j)(i, j) pair to find the best pair to find the best

length length ll for the composition match. for the composition match.

Computing length of the shortest Computing length of the shortest composition matchcomposition match

Our goal here is to start with two strings, Our goal here is to start with two strings, SS and and TT, of equal , of equal

length, and for each prefix pair length, and for each prefix pair S[1, k], T[1, k]S[1, k], T[1, k], find the , find the length of the length of the shortestshortest suffixes that have a composition suffixes that have a composition match. match.

kk 00 11 22 33 44 55 66

Shortest suffix Shortest suffix match lengthmatch length

00 11 00 11 00 11 33

For example, letFor example, let

S = AACGTCTTTGAGCTS = AACGTCTTTGAGCT

T = AGCCTGACTGCCTAT = AGCCTGACTGCCTA

the table states that the table states that for k = 6for k = 6, the shortest suffixes which , the shortest suffixes which have a composition match have have a composition match have length = 3length = 3::

S = AACS = AACGTCGTC......

T = AGCT = AGCCTGCTG......

Composition differenceComposition difference

We find the matching suffix lengths using We find the matching suffix lengths using composition composition

differencedifference, a vector quantity for two strings , a vector quantity for two strings xx and and yy: :

CD(x, y) = (cCD(x, y) = (cσσ11 , … , , … , ccσσ||ΣΣ||

where where ccσσii is the difference between the number of times is the difference between the number of times σσii

occurs in occurs in xx and in and in yy. .

Using composition differenceUsing composition difference

Key observation:Key observation: two identical composition differences at two identical composition differences at prefix lengths k and g indicate a composition match of prefix lengths k and g indicate a composition match of length k – g.length k – g.

Sorting to find shortest Sorting to find shortest composition matchescomposition matches

Sort on composition Sort on composition difference using difference using stable sort. Adjacent stable sort. Adjacent tuples with the same tuples with the same composition composition difference identify difference identify shortestshortest composition composition matches.matches.

Time complexity for composition matchesTime complexity for composition matches

O(nmO(nmΣΣ)) to find to find all index pairsall index pairs shortest composition match shortest composition match

lengths for two strings of length lengths for two strings of length nn and and mm..

In our work, In our work, ΣΣ, is a small constant, is a small constant (4 for DNA, 16 for (4 for DNA, 16 for dinucleotides). For larger alphabets, the method of Amir, dinucleotides). For larger alphabets, the method of Amir, Apostolico, Landau and Satta (2003) can be used.Apostolico, Landau and Satta (2003) can be used.

Composition match scoring functionsComposition match scoring functions

We have explored:We have explored:

Functions based on match length, Functions based on match length, kk::

• Function 1: Function 1: cm(k) = ckcm(k) = ck• Function 2: Function 2: cm(k) = ccm(k) = c√ k√ k

where where cc is a constant. is a constant.

Functions based on substring composition:Functions based on substring composition:

• Function 4: Function 4: cm(C, B, k) = ck cm(C, B, k) = ck · H(C,B)· H(C,B)

where where HH is the is the relative entropyrelative entropy function, function, CC is the is the

composition of the matching substrings and composition of the matching substrings and BB is a is a backgroundbackground composition.composition.

Additive and subadditive scoring functionsAdditive and subadditive scoring functions

The functions based on length are additive or subadditive:The functions based on length are additive or subadditive:

cm(i + j) cm(i + j) ≤ cm(i) + cm(j)≤ cm(i) + cm(j)

Lemma:Lemma: For additive or subadditive composition match For additive or subadditive composition match scoring functions, scoring functions, any best scoring alignmentany best scoring alignment is equivalent is equivalent in score to an alignment which contains in score to an alignment which contains only shortest only shortest

composition matches.composition matches.

Theorem: Theorem: Composition alignment with additive or Composition alignment with additive or subadditive match scoring functions and finite alphabet subadditive match scoring functions and finite alphabet has time complexity has time complexity O(nm)O(nm)..

The limit parameterThe limit parameter

Intuitively, Intuitively, allowing scrambled letters to matchallowing scrambled letters to match should should increase increase the the amount of matching between sequences. If amount of matching between sequences. If too much matchingtoo much matching occurs, occurs, alignments will not be meaningful.alignments will not be meaningful.

The The limitlimit parameter is an upper bound on the length parameter is an upper bound on the length ll of the of the longest single composition match, used to prevent excessive longest single composition match, used to prevent excessive matching. matching.

Sequence length = 100, randomly generated Sequence length = 100, randomly generated

limitlimit 11 22 55 1010

DNA (DNA (all letters p = 0.25)all letters p = 0.25) 2525 33.733.7 44.444.4 5151

Growth of local alignment scoreGrowth of local alignment scoreFunction 1Function 1

Average Local Composition Alignment Scores: DNA SequencesFunction 1

100 1000

Sequence Length

Limit = 2

Limit = 3

Limit = 4

200 400 800

Global score as a predictor of Global score as a predictor of local parameter suitability: Function 1local parameter suitability: Function 1

Average Global Composition Alignment Scores: DNA SequencesFunction 1

100 200 300 400 500 600 700 800 900

Sequence Length

Limit = 2

Limit = 3

Limit = 4

Limit = 5

Growth of local alignment score Growth of local alignment score Function 2Function 2

Average Local Composition Alignment Scores: DNA SequencesFunction 2

100 1000

Sequence Length

200 400 800

Global score as a predictor of Global score as a predictor of local parameter suitability: Function 2local parameter suitability: Function 2

Global Composition Alignment Scores: DNA SequencesFunction 2

0 100 200 300 400 500 600 700 800 900

Sequence Length

Limit values for DNA Limit values for DNA

• Function 1: cm(k) = ck: Function 1: cm(k) = ck: Limit Limit ≤ 3≤ 3..

• Function 2: cm(k) = c√k: Function 2: cm(k) = c√k: Limit ≤ 10Limit ≤ 10..

• Function 4: cm(C, B, k) = ck ·H(C, B): Function 4: cm(C, B, k) = ck ·H(C, B):

Limit ≤ 50Limit ≤ 50..

Biological examplesBiological examples

Composition alignment was tested on a set of 1796 Composition alignment was tested on a set of 1796 promoter promoter sequencessequences from the Eukaryotic Promoter Database. Each from the Eukaryotic Promoter Database. Each sequence is sequence is 600 nucleotides long600 nucleotides long, 500 bases upstream and , 500 bases upstream and 100 downstream of the transcription initiation site.100 downstream of the transcription initiation site.

Two local alignment scores were produced using function 1, Two local alignment scores were produced using function 1, WW using composition alignment and using composition alignment and SS using standard using standard alignment. The examples shown have alignment. The examples shown have statistically statistically significant Wsignificant W with with W W ≥ 3 · S≥ 3 · S to exclude good standard to exclude good standard alignments.alignments.

Example 1Example 1

Composition alignment and standard alignment of the same Composition alignment and standard alignment of the same two promoters. Standard alignment is not statistically two promoters. Standard alignment is not statistically significant. Sequences are characteristic of significant. Sequences are characteristic of CpG islandsCpG islands..

Composition Alignment:Composition Alignment:

GCCCGCCCGCCGCGCTCCCGCCCGCCGCTCTCCGTGGCCC-CGCCG-CGCTGCCGCCGCCGCCGCTGCGCCCGCCCGCCGCGCTCCCGCCCGCCGCTCTCCGTGGCCC-CGCCG-CGCTGCCGCCGCCGCCGCTGC<->||||<>|<>||<>| ||||<>||<> |<-> |||||| <>|<> ||||<><> |<>| ||<->||<->||||<>|<>||<>| ||||<>||<> |<-> |||||| <>|<> ||||<><> |<>| ||<->||CCGCGCCGCCGCCGTCCGCGCCGCCCCG-CCCT-TGGCCCAGCCGCTCGCTCGGCTCCGCTCCCTGGCCCGCGCCGCCGCCGTCCGCGCCGCCCCG-CCCT-TGGCCCAGCCGCTCGCTCGGCTCCGCTCCCTGGC

Standard Alignment:Standard Alignment:

CGCCGCCGCCGCGCCGCCGCCGCGCCGCCGCCGCGCCGCCGCCG

Example 2Example 2

Composition alignment of two promoter sequences. Composition alignment of two promoter sequences. Composition changes at vertical line.Composition changes at vertical line. A C G TA C G T

Left: Left: (0.01, 0.61, 0.30, 0.08) (0.01, 0.61, 0.30, 0.08) Right: Right: (0.19, 0.16, 0.56, 0.09)(0.19, 0.16, 0.56, 0.09)

GCCCCGCGCCCCGCGCCCCGCGCCCCGCGCGCCTC-CGCCCGCCCCT-GCTCCGGC---C-TTGCGCCTGC-GCACAGTGGGATGCGCGGGGAGGCCCCGCGCCCCGCGCCCCGCGCCCCGCGCGCCTC-CGCCCGCCCCT-GCTCCGGC---C-TTGCGCCTGC-GCACAGTGGGATGCGCGGGGAG<->|<><>|||| <>|||||| ||<->|<>||||| <>|||| |||| || ||<-> | |<><>|<-> | |<>|<>|<>||||<-><->|<->|<><>|||| <>|||||| ||<->|<>||||| <>|||| |||| || ||<-> | |<><>|<-> | |<>|<>|<>||||<-><->|CCGCGCGCCCCC-GCCCCCGCCCCGCCCCGGCCTCGGCCCCGGCCCTGGC-CCCGGGGGCAGTCGCGCCTGTG-AACGGTGAGTGCGGGCAGGGCCGCGCGCCCCC-GCCCCCGCCCCGCCCCGGCCTCGGCCCCGGCCCTGGC-CCCGGGGGCAGTCGCGCCTGTG-AACGGTGAGTGCGGGCAGGG

ConclusionConclusion

• define a new alignment problem based on composition define a new alignment problem based on composition matching and test several scoring functions matching and test several scoring functions

• show how to find all-pairs shortest composition match show how to find all-pairs shortest composition match lengths in linear time per pair for a fixed alphabetlengths in linear time per pair for a fixed alphabet

• show that alignment using scoring functions based on show that alignment using scoring functions based on sequence length only require finding shortest composition sequence length only require finding shortest composition matchesmatches

• give biological examples where composition alignment finds give biological examples where composition alignment finds statistically (and functionally) significant sequence similarity statistically (and functionally) significant sequence similarity in the absence of significant standard alignmentsin the absence of significant standard alignments

composition alignment gary benson departments of computer science and biology boston university

Documents

benson figure copy and recall - national alzheimer's ... ·...

benson presentation

making things happen · things happen assuring the...

benson boilers - energy huaneng international power co.,...

george benson - the best of george benson[1]

benson lecture inpla[1] phil benson

charles benson

benson benson gas fired - reznor hvac

benson pathways

stringology 2004 cri, haifa composition alignment gary...

internal and external recruiting mana 4328 dr. george benson...

selection decisions mana 5341 dr. george benson...

internal and external recruiting mana 5341 dr. george benson...

george benson-the best of george benson

page 1 of 92ndpembin/heritage_89.pdfbennett, joe 182...

benson work

benson companies

s. v. benson, fls 2012 march 5, 2012 stephen benson some...

benson cabinet heaters -...

benson boiler