composition alignment gary benson departments of computer science and biology boston university
Post on 21-Dec-2015
216 Views
Preview:
TRANSCRIPT
Composition AlignmentComposition Alignment
Gary BensonGary BensonDepartments of Computer Science and BiologyDepartments of Computer Science and Biology
Boston UniversityBoston University
Composition AlignmentComposition Alignment
Gary BensGary BenszzononDepartments of Computer Science and BiologyDepartments of Computer Science and Biology
Boston UniversityBoston University
Outline of TalkOutline of Talk
1.1. Sequence composition and composition matchSequence composition and composition match
2.2. Composition alignment algorithmComposition alignment algorithm
3.3. Composition match scoring functionsComposition match scoring functions
4.4. Growth of local composition alignment scores Growth of local composition alignment scores
5.5. Limiting the length of a composition matchLimiting the length of a composition match
6.6. Biological examplesBiological examples
GoalGoal
Identify features in DNA sequences that are Identify features in DNA sequences that are notnot accurately accurately
described by described by position specific patterns.position specific patterns.
A position specific pattern, P, has the form:A position specific pattern, P, has the form:
P = pP = p1 1 pp2 2 pp3 3 ...... ppkk
where pwhere pii is either a single specific character or a choice (weighted is either a single specific character or a choice (weighted
or unweighted) of characters. or unweighted) of characters.
In DNA there are features that are characterized by In DNA there are features that are characterized by compositioncomposition rather than by position specific patterns.rather than by position specific patterns.
Sequence CompositionSequence Composition
CompositionComposition is a vector quantity describing the frequency is a vector quantity describing the frequency of occurrence of each alphabet letter in a particular string. of occurrence of each alphabet letter in a particular string.
Let Let SS be a string over be a string over ΣΣ. Then, . Then,
C(S)=(fC(S)=(fσσ1 1 , f, fσσ2 2
, , ffσσ3 3 , … , , … , ffσσ||ΣΣ||
))
is the composition of is the composition of SS, where , where ffσσii is the fraction of the is the fraction of the
characters in characters in SS that are that are σσii. .
Composition ExampleComposition Example
S = ACTGTACCTGGCGCTATTS = ACTGTACCTGGCGCTATT
C(S) = ( 0.17, 0.28, 0.22, 0.33 )C(S) = ( 0.17, 0.28, 0.22, 0.33 )
A C G TA C G T
Note that the Note that the orderorder of letters of letters is irrelevantis irrelevant as it has no effect on as it has no effect on the composition. the composition.
Composition and Sequence FeaturesComposition and Sequence Features
• Isochores Isochores – Multi-megabase, specifically GC-rich or GC-– Multi-megabase, specifically GC-rich or GC-poor. GC-rich isochores have greater gene density. poor. GC-rich isochores have greater gene density.
• CpG Islands CpG Islands – Several hundred nucleotides, rich in the – Several hundred nucleotides, rich in the dinucleotide CG which is underrepresented in eukaryotic dinucleotide CG which is underrepresented in eukaryotic genomes. Methylation of the cystine (C) in these genomes. Methylation of the cystine (C) in these dinucleotides affects gene expression.dinucleotides affects gene expression.
• Protein binding regionsProtein binding regions – Tens of nucleotides, dinucleotide – Tens of nucleotides, dinucleotide composition contributes to DNA flexibility, allowing the composition contributes to DNA flexibility, allowing the helix to change shape during protein binding.helix to change shape during protein binding.
Composition MatchComposition Match
We hope to identify common features in sequences using a We hope to identify common features in sequences using a new new alignment algorithmalignment algorithm. The main new idea is the use of . The main new idea is the use of composition matching. composition matching.
Two strings, Two strings, SS and and TT, have a , have a composition matchcomposition match if their lengths if their lengths
are equal and are equal and C(S) = C(T)C(S) = C(T). .
For example, For example, SS and and TT below have a composition match: below have a composition match:
S = ACTGTACCTGGCGCTATTS = ACTGTACCTGGCGCTATT
T = AAACCCCCGGGGTTTTTTT = AAACCCCCGGGGTTTTTT
Composition Alignment ProblemComposition Alignment Problem
GivenGiven:: Two sequences, Two sequences, SS and and TT of lengths of lengths mm and and nn, over an , over an
alphabet alphabet ΣΣ, and a scoring function , and a scoring function cm(s, t)cm(s, t) for the score for the score
of a of a composition matchcomposition match between substrings between substrings ss and and tt. .
Find:Find: The best scoring alignment (global or local) of The best scoring alignment (global or local) of SS with with
TT such that the allowed such that the allowed scoring options include scoring options include
composition matchcomposition match between substrings of between substrings of SS and and TT as well as well as the standard options of 1) single character match, 2) as the standard options of 1) single character match, 2) single character mismatch, 3) insertion and deletion.single character mismatch, 3) insertion and deletion.
Example of composition alignmentExample of composition alignment
S = AACGTCTTTGAGCTCS = AACGTCTTTGAGCTC
T = AGCCTGACTGCCTAT = AGCCTGACTGCCTA
AlignmentAlignment
AAAACGTCCGTCTTTTTTGGAGCTCAGCTC
| |<-> | <--->| |<-> | <--->
AAGGCCTGCCTGACACTT--GCCTAGCCTA
Related WorkRelated Work
• Alignment allowing adjacent letter swap. Alignment allowing adjacent letter swap.
O(nm), Lowrance and Wagner (1975)O(nm), Lowrance and Wagner (1975)
• All swapped matchings of a pattern in a text. All swapped matchings of a pattern in a text.
O(nmO(nm1/31/3 log m log|log m log|ΣΣ|), Amir, Aumann, Landau, Lewenstein, |), Amir, Aumann, Landau, Lewenstein, Lewenstein (2000)Lewenstein (2000)
O(n log m log O(n log m log ||ΣΣ|), Amir, Cole, Hariharan, Lewenstein, Porat |), Amir, Cole, Hariharan, Lewenstein, Porat (2001)(2001)
• Composition namingComposition naming
O(n log m log O(n log m log ||ΣΣ|), Amir, Apostolico, Landau, Satta (2003)|), Amir, Apostolico, Landau, Satta (2003)
Composition Alignment using Composition Alignment using Dynamic ProgrammingDynamic Programming
Given two sequences, Given two sequences, SS and and TT, the best alignment of the , the best alignment of the prefix stringsprefix strings
S[1, i] = sS[1, i] = s1 1 …… ssii
T[1, j] = tT[1, j] = t1 1 …… ttjj
ends in one of four ways: ends in one of four ways:
1.1. mismatch, mismatch,
2.2. insertion, insertion,
3.3. deletion, or deletion, or
4.4. composition matchcomposition match
Ways an Alignment Can EndWays an Alignment Can End
S: C G TS: C G T
T: C G AT: C G A
S: C A TS: C A T
T: C A -T: C A -
S: C A –S: C A –
T: C A AT: C A A
X: C G T A C X: C G T A C
Y: C G C T AY: C G C T A
mismatchmismatch
insertion or deletioninsertion or deletion
composition matchcomposition match
Ways an Alignment Can EndWays an Alignment Can End
S: C G TS: C G T
T: C G AT: C G A
S: C A TS: C A T
T: C A -T: C A -
S: C A –S: C A –
T: C A AT: C A A
X: C G T A C X: C G T A C
Y: C G C T AY: C G C T A
mismatchmismatch
insertion or deletioninsertion or deletion
composition matchcomposition match
Note that the suffixes will have Note that the suffixes will have
a length a length l l wherewhere
1 ≤ 1 ≤ ll ≤ min(i, j, limit) ≤ min(i, j, limit)
Time ComplexityTime Complexity
Computing the Computing the optimal composition alignmentoptimal composition alignment with dynamic with dynamic programming is similar to standard alignment, except for programming is similar to standard alignment, except for the composition match scoring option. The overall time the composition match scoring option. The overall time complexity is complexity is
O(nmZ)O(nmZ)
where where ZZ is the time required per is the time required per (i, j)(i, j) pair to find the best pair to find the best
length length ll for the composition match. for the composition match.
Computing length of the shortest Computing length of the shortest composition matchcomposition match
Our goal here is to start with two strings, Our goal here is to start with two strings, SS and and TT, of equal , of equal
length, and for each prefix pair length, and for each prefix pair S[1, k], T[1, k]S[1, k], T[1, k], find the , find the length of the length of the shortestshortest suffixes that have a composition suffixes that have a composition match. match.
kk 00 11 22 33 44 55 66
Shortest suffix Shortest suffix match lengthmatch length
00 11 00 11 00 11 33
For example, letFor example, let
S = AACGTCTTTGAGCTS = AACGTCTTTGAGCT
T = AGCCTGACTGCCTAT = AGCCTGACTGCCTA
the table states that the table states that for k = 6for k = 6, the shortest suffixes which , the shortest suffixes which have a composition match have have a composition match have length = 3length = 3::
S = AACS = AACGTCGTC......
T = AGCT = AGCCTGCTG......
Composition differenceComposition difference
We find the matching suffix lengths using We find the matching suffix lengths using composition composition
differencedifference, a vector quantity for two strings , a vector quantity for two strings xx and and yy: :
CD(x, y) = (cCD(x, y) = (cσσ11 , … , , … , ccσσ||ΣΣ||
))
where where ccσσii is the difference between the number of times is the difference between the number of times σσii
occurs in occurs in xx and in and in yy. .
Using composition differenceUsing composition difference
Key observation:Key observation: two identical composition differences at two identical composition differences at prefix lengths k and g indicate a composition match of prefix lengths k and g indicate a composition match of length k – g.length k – g.
Sorting to find shortest Sorting to find shortest composition matchescomposition matches
Sort on composition Sort on composition difference using difference using stable sort. Adjacent stable sort. Adjacent tuples with the same tuples with the same composition composition difference identify difference identify shortestshortest composition composition matches.matches.
Time complexity for composition matchesTime complexity for composition matches
O(nmO(nmΣΣ)) to find to find all index pairsall index pairs shortest composition match shortest composition match
lengths for two strings of length lengths for two strings of length nn and and mm..
In our work, In our work, ΣΣ, is a small constant, is a small constant (4 for DNA, 16 for (4 for DNA, 16 for dinucleotides). For larger alphabets, the method of Amir, dinucleotides). For larger alphabets, the method of Amir, Apostolico, Landau and Satta (2003) can be used.Apostolico, Landau and Satta (2003) can be used.
Composition match scoring functionsComposition match scoring functions
We have explored:We have explored:
Functions based on match length, Functions based on match length, kk::
• Function 1: Function 1: cm(k) = ckcm(k) = ck• Function 2: Function 2: cm(k) = ccm(k) = c√ k√ k
where where cc is a constant. is a constant.
Functions based on substring composition:Functions based on substring composition:
• Function 4: Function 4: cm(C, B, k) = ck cm(C, B, k) = ck · H(C,B)· H(C,B)
where where HH is the is the relative entropyrelative entropy function, function, CC is the is the
composition of the matching substrings and composition of the matching substrings and BB is a is a backgroundbackground composition.composition.
Additive and subadditive scoring functionsAdditive and subadditive scoring functions
The functions based on length are additive or subadditive:The functions based on length are additive or subadditive:
cm(i + j) cm(i + j) ≤ cm(i) + cm(j)≤ cm(i) + cm(j)
Lemma:Lemma: For additive or subadditive composition match For additive or subadditive composition match scoring functions, scoring functions, any best scoring alignmentany best scoring alignment is equivalent is equivalent in score to an alignment which contains in score to an alignment which contains only shortest only shortest
composition matches.composition matches.
Theorem: Theorem: Composition alignment with additive or Composition alignment with additive or subadditive match scoring functions and finite alphabet subadditive match scoring functions and finite alphabet has time complexity has time complexity O(nm)O(nm)..
The limit parameterThe limit parameter
Intuitively, Intuitively, allowing scrambled letters to matchallowing scrambled letters to match should should increase increase the the amount of matching between sequences. If amount of matching between sequences. If too much matchingtoo much matching occurs, occurs, alignments will not be meaningful.alignments will not be meaningful.
The The limitlimit parameter is an upper bound on the length parameter is an upper bound on the length ll of the of the longest single composition match, used to prevent excessive longest single composition match, used to prevent excessive matching. matching.
Sequence length = 100, randomly generated Sequence length = 100, randomly generated
limitlimit 11 22 55 1010
DNA (DNA (all letters p = 0.25)all letters p = 0.25) 2525 33.733.7 44.444.4 5151
Growth of local alignment scoreGrowth of local alignment scoreFunction 1Function 1
Average Local Composition Alignment Scores: DNA SequencesFunction 1
0
20
40
60
80
100
120
100 1000
Sequence Length
Sc
ore
Limit = 2
Limit = 3
Limit = 4
200 400 800
Global score as a predictor of Global score as a predictor of local parameter suitability: Function 1local parameter suitability: Function 1
Average Global Composition Alignment Scores: DNA SequencesFunction 1
-400
-350
-300
-250
-200
-150
-100
-50
0
50
100
100 200 300 400 500 600 700 800 900
Sequence Length
Sc
ore
Limit = 2
Limit = 3
Limit = 4
Limit = 5
Growth of local alignment score Growth of local alignment score Function 2Function 2
Average Local Composition Alignment Scores: DNA SequencesFunction 2
0
10
20
30
40
50
60
70
80
90
100
100 1000
Sequence Length
Sc
ore
50
30
20
10
6
200 400 800
Global score as a predictor of Global score as a predictor of local parameter suitability: Function 2local parameter suitability: Function 2
Global Composition Alignment Scores: DNA SequencesFunction 2
-200
-180
-160
-140
-120
-100
-80
-60
-40
-20
0
0 100 200 300 400 500 600 700 800 900
Sequence Length
Sc
ore
10
20
30
50
Limit values for DNA Limit values for DNA
• Function 1: cm(k) = ck: Function 1: cm(k) = ck: Limit Limit ≤ 3≤ 3..
• Function 2: cm(k) = c√k: Function 2: cm(k) = c√k: Limit ≤ 10Limit ≤ 10..
• Function 4: cm(C, B, k) = ck ·H(C, B): Function 4: cm(C, B, k) = ck ·H(C, B):
Limit ≤ 50Limit ≤ 50..
Biological examplesBiological examples
Composition alignment was tested on a set of 1796 Composition alignment was tested on a set of 1796 promoter promoter sequencessequences from the Eukaryotic Promoter Database. Each from the Eukaryotic Promoter Database. Each sequence is sequence is 600 nucleotides long600 nucleotides long, 500 bases upstream and , 500 bases upstream and 100 downstream of the transcription initiation site.100 downstream of the transcription initiation site.
Two local alignment scores were produced using function 1, Two local alignment scores were produced using function 1, WW using composition alignment and using composition alignment and SS using standard using standard alignment. The examples shown have alignment. The examples shown have statistically statistically significant Wsignificant W with with W W ≥ 3 · S≥ 3 · S to exclude good standard to exclude good standard alignments.alignments.
Example 1Example 1
Composition alignment and standard alignment of the same Composition alignment and standard alignment of the same two promoters. Standard alignment is not statistically two promoters. Standard alignment is not statistically significant. Sequences are characteristic of significant. Sequences are characteristic of CpG islandsCpG islands..
Composition Alignment:Composition Alignment:
GCCCGCCCGCCGCGCTCCCGCCCGCCGCTCTCCGTGGCCC-CGCCG-CGCTGCCGCCGCCGCCGCTGCGCCCGCCCGCCGCGCTCCCGCCCGCCGCTCTCCGTGGCCC-CGCCG-CGCTGCCGCCGCCGCCGCTGC<->||||<>|<>||<>| ||||<>||<> |<-> |||||| <>|<> ||||<><> |<>| ||<->||<->||||<>|<>||<>| ||||<>||<> |<-> |||||| <>|<> ||||<><> |<>| ||<->||CCGCGCCGCCGCCGTCCGCGCCGCCCCG-CCCT-TGGCCCAGCCGCTCGCTCGGCTCCGCTCCCTGGCCCGCGCCGCCGCCGTCCGCGCCGCCCCG-CCCT-TGGCCCAGCCGCTCGCTCGGCTCCGCTCCCTGGC
Standard Alignment:Standard Alignment:
CGCCGCCGCCGCGCCGCCGCCGCGCCGCCGCCGCGCCGCCGCCG
Example 2Example 2
Composition alignment of two promoter sequences. Composition alignment of two promoter sequences. Composition changes at vertical line.Composition changes at vertical line. A C G TA C G T
Left: Left: (0.01, 0.61, 0.30, 0.08) (0.01, 0.61, 0.30, 0.08) Right: Right: (0.19, 0.16, 0.56, 0.09)(0.19, 0.16, 0.56, 0.09)
GCCCCGCGCCCCGCGCCCCGCGCCCCGCGCGCCTC-CGCCCGCCCCT-GCTCCGGC---C-TTGCGCCTGC-GCACAGTGGGATGCGCGGGGAGGCCCCGCGCCCCGCGCCCCGCGCCCCGCGCGCCTC-CGCCCGCCCCT-GCTCCGGC---C-TTGCGCCTGC-GCACAGTGGGATGCGCGGGGAG<->|<><>|||| <>|||||| ||<->|<>||||| <>|||| |||| || ||<-> | |<><>|<-> | |<>|<>|<>||||<-><->|<->|<><>|||| <>|||||| ||<->|<>||||| <>|||| |||| || ||<-> | |<><>|<-> | |<>|<>|<>||||<-><->|CCGCGCGCCCCC-GCCCCCGCCCCGCCCCGGCCTCGGCCCCGGCCCTGGC-CCCGGGGGCAGTCGCGCCTGTG-AACGGTGAGTGCGGGCAGGGCCGCGCGCCCCC-GCCCCCGCCCCGCCCCGGCCTCGGCCCCGGCCCTGGC-CCCGGGGGCAGTCGCGCCTGTG-AACGGTGAGTGCGGGCAGGG
ConclusionConclusion
We We
• define a new alignment problem based on composition define a new alignment problem based on composition matching and test several scoring functions matching and test several scoring functions
• show how to find all-pairs shortest composition match show how to find all-pairs shortest composition match lengths in linear time per pair for a fixed alphabetlengths in linear time per pair for a fixed alphabet
• show that alignment using scoring functions based on show that alignment using scoring functions based on sequence length only require finding shortest composition sequence length only require finding shortest composition matchesmatches
• give biological examples where composition alignment finds give biological examples where composition alignment finds statistically (and functionally) significant sequence similarity statistically (and functionally) significant sequence similarity in the absence of significant standard alignmentsin the absence of significant standard alignments
top related