estimating seed sensitivity on homogeneous alignments
DESCRIPTION
Estimating seed sensitivity on homogeneous alignments. BIBE 2004 Taichung - May 20th, 2004 Gregory Kucherov 1 , Laurent Noé 1 , Yann Ponty 2 1 LORIA, Nancy 2 LRI, Paris, France. Detected seeds. Detected alignment. Seed paradigm ( FASTA, BLAST, PatternHunter, YASS, … ). - PowerPoint PPT PresentationTRANSCRIPT
Estimating seed sensitivity on Estimating seed sensitivity on homogeneous alignmentshomogeneous alignments
BIBE 2004 Taichung - May 20th, 2004
Gregory Kucherov1, Laurent Noé1, Yann Ponty2
1LORIA, Nancy 2LRI, Paris, France
2Detected alignment
Seed paradigm Seed paradigm ((FASTA, BLAST,FASTA, BLAST,
PatternHunter, YASS, … )PatternHunter, YASS, … )
Start with small conserved and easily detected fragments (seeds).
Then extend the seeds and build possible alignments
Dot plot
Detected seeds
ctcgactcgggctcacgctcgcaccgggttacagcggtcgattgcataggcctcgggctcgcgctcgcgcgctagacaccgggttacagcgt
3
ATCAGTGCAATGCTCAAGA|||||:||:||||:|||||ATCAGCGCGATGCGCAAGA
Spaced Seed Model Spaced Seed Model [Ma & al. 02][Ma & al. 02]
Seed Pattern : ###--#-##
‘#’ : obligatory match position‘-’ : joker position (“don’t care” position)
Weight : 6 [number of #] Span : 9 [number of all symbols]
Example :
###--#-##ATCAGTGCAATGCTCAAGA|||||:||:||||:|||||ATCAGCGCGATGCGCAAGA
###--#-##ATCAGTGCAATGCTCAAGA|||||:||:||||:|||||ATCAGCGCGATGCGCAAGA
###--#-##ATCAGTGCAATGCTCAAGA|||||:||:||||:|||||ATCAGCGCGATGCGCAAGA
###--#-##ATCAGTGCAATGCTCAAGA|||||:||:||||:|||||ATCAGCGCGATGCGCAAGA
###--#-##ATCAGTGCAATGCTCAAGA|||||:||:||||:|||||ATCAGCGCGATGCGCAAGA
###--#-##ATCAGTGCAATGCTCAAGA|||||:||:||||:|||||ATCAGCGCGATGCGCAAGA
4
How to describe How to describe Selectivity Selectivity and and SensitivitySensitivity
Selectivity
seed weight:number of random occurrences ~ 4-weight .
Sensitivityprobability for the seed to detect an interesting similarity.
To be specified:
• What set of similarities do we want to detect?
• What is the probability of each similarity?
5
What is a good seed?What is a good seed?Sensitivity/Selectivity balanceSensitivity/Selectivity balance
Seed of relatively large weight : – Few random seed matches (high selectivity)– Possible loss of similarities (low sensitivity)
Seed of relatively small weight :– Detect almost all possible similarities (high sensitivity)– Many random seed matches (low selectivity)
6
Similarity: notationSimilarity: notation
Ungapped similarities only (no indels)
CTACGATGAGCTGCT|||:||:||||:|||CTATGACGAGCGGCT
All matches are equiprobable, all mismatches are equiprobable (simplification)
binary word
7
Similarities to be detected Similarities to be detected [Bulher et al 2003, Brejova et al 2003, Choi et al 2003, Keich et al 2004, Ma [Bulher et al 2003, Brejova et al 2003, Choi et al 2003, Keich et al 2004, Ma
et al 2001, 2003, …] et al 2001, 2003, …]
The set:– all strings in (given n)
The probabilities:– Bernoulli model;– Markov models;
{ , }n
8
Similarities to be detected Similarities to be detected [Bulher et al 2003, Brejova et al 2003, Choi et al 2003, Keich et al 2004, Ma [Bulher et al 2003, Brejova et al 2003, Choi et al 2003, Keich et al 2004, Ma
et al 2001, 2003, …]et al 2001, 2003, …]
Advantage: – natural probability model;– DP algorithms to compute sensitivity;
Disadvantage: – Uninteresting similarities are included in the set.
9
Similarities to be detected Similarities to be detected and their probabilitiesand their probabilities: our approach: our approach
Only “true” similarities to be considered
Scoring Scheme
CTACGATGAGCTGCT|||:||:||||:|||CTATGACGAGCGGCT
Score = 12r – 3p
+r+r+r-p+r+r-p+r+r+r+r-p+r+r+r
10
Similarities to be detected Similarities to be detected and their probabilitiesand their probabilities: our approach : our approach
Homogeneous similarities :
do not contain sub-alignment of higher score : (cf. Maximum Scoring Pairs)
all prefixes and suffixes of the similarity have non-negative score.
11
Homogeneous similaritiesHomogeneous similarities
(Prefix) Score
Alignment
Homogeneous alignment
homogeneous similarities occur entirely inside shaded area:
12
Alignment
Non homogeneous similaritiesNon homogeneous similarities
Negative suffix
Score
Prefix of higher score
Alignment
Suffix of higher scoreNegative prefix
Score
13
Our ModelOur Model
The set:Homogeneous similarities of
given length n and given score S
The probabilities :all similarities of the set have same probability.
14
Problem StatementProblem Statement
Given:
1) a seed of weight w and span l,2) integer scoring scheme {r, p},3) similarity length n,
4) score S. Compute:
the probability for the seed to match a random
homogeneous similarity of length n and score S
15
Computation of Seed Sensitivity Computation of Seed Sensitivity (homogeneous case) (homogeneous case)
To be computed:
probability for a seed to detect a homogeneous similarity, i.e.
Two steps of computation– Preprocessing: counting number Nhom of all homogeneous similarities of
given length n and score S.• DP algorithm: Space ant time complexity:
– Seed sensitivity measure: counting number of homogeneous similarities detected by seed .
• DP algorithm: (similar to Keich03).
16
Number of Homogeneous Similarities: Number of Homogeneous Similarities: Reduction to Graph Path problemReduction to Graph Path problem
Vertices: {(k, y)}, where k is a length of similarity, y is its score.
Vertex (k, y) corresponds to the set of all similarities of score y and length k.
2 edges from each (k, y): (k+1, y+r) - for match at position (k+1) ; (k+1, y-p) - for mismatch at position (k+1);
Homogeneous Similarities Paths from (0,0) to (n,S) inside n x S grid
(0,0) Alignment
(k,y) (n,S)
Score
17
Score S fixed:
number of possible paths from (0,0) to (n,S). D(y, k) = D(y+r, k+1) + D(y-p, k+1)
Taking into account border effects
Number of Homogeneous Similarities:Number of Homogeneous Similarities:Recursive equationRecursive equation
(0,0)
(k,y)(n,S)
18
Time and complexityTime and complexity
Space Complexity
Time Complexity
19
Computer experiments:Computer experiments:Homogeneous vs. All similaritiesHomogeneous vs. All similarities
Compare the sensitivity of seeds on both models– Fixed score S according to the scoring scheme (r=+1; p=-3)– Similarity length varies from 20 to 120– Two sets of similarities:
(1) all similarities of given length and score;(2) only homogeneous similarities of given length and
score;
Comparison plots– x axis : similarity length
– y axis : sensitivity (probability that the seed matches a similarity)
20
contiguous seed (weight 11)###########
Experiments (score 16)Experiments (score 16)
21
spaced seed (weight 11)###-#--#-#--##-###
Experiments (score 16)Experiments (score 16)
22
Optimal seedsOptimal seeds
Optimal seeds are different
Optimal seeds are the same
23
SummarySummary
We have proposed
– a new definition of seed sensitivity based on the notion of homogeneous similarity;
– a DP algorithm to compute the sensitivity of a given seed.
Sensitivity of a seed on homogeneous similarities is usually substantially larger than on all similarities
Optimal seed on homogeneous similarities may be not optimal on all similarities and vice versa.
24
ExtensionsExtensions
Combining homogeneity constraint with properties of DNA sequences
Distinguishing different mismatches (transitions/transversions): YASS http://www.loria.fr/projects/YASS/
Estimating seed sensitivity on Estimating seed sensitivity on homogeneous alignmentshomogeneous alignments
Gregory Kucherov1, Laurent Noé1, Yann Ponty2
1LORIA (Laboratoire lorrain de recherche en informatique et ses applications), Nancy, France
2LRI (Laboratoire de recherche en informatique), Paris, France
26
CollaboratorsCollaborators
Thanks !!
Mikhail Roytberg (Institute of Mathematical Problem of Biology, Russia) for his comments and helpful discussion during the preparation of this work.
Alain Denise (Laboratoire de Recherche en Informatique, France) for his help on culminating paths.
27
Thank you for your attention !Thank you for your attention !
????
28
29
Number of Number of Detected Detected Homogeneous Homogeneous Similarities: Similarities:
Reduction to Graph Path problemReduction to Graph Path problem(after Keich03)(after Keich03)
Vertexes: {(k, y, t, u)} , where
k is a length of similarity, y is its score;
1 ≤ t ≤ l ; u is a binary word of length l-w.
{(k, y, t, u)} corresponds to set H (k, y, t, u) consists of homogeneous similarities of length k and score y;
– t is a length of maximal prefix of the seed , matching an end of a word v from H(k, y, t, u);
– u is a word, consisting of symbols corresponding to the joker position within the matching.
– t and u are same for all similarities from H
30
Number of Number of Detected Detected Homogeneous Similarities.Homogeneous Similarities.
u
###--#-##
t
k = 17,y = 8,t = 8,u =
k’ = k + 1,y’ = y + r,t’ = t + 1,u’ = u
(k,y)
Seed: ###--#-## Scoring Scheme : r = +1 ; p = -2
31
Number of Number of Detected Detected Homogeneous Similarities.Homogeneous Similarities.
k’ = 17 + 1,y’ = 8 + r,t’ = t + 1,u’ = u
u
(k’,y’)
###--#-##
t+1
k’ = 18,y’ = 9,t’ = 9,u’ =
Seed: ###--#-## Scoring Scheme : r = +1 ; p = -2
32
Number of Number of Detected Detected Homogeneous Similarities.Homogeneous Similarities.
Seed: ###--#-## Scoring Scheme : r = +1 ; p = -2
k = 17,y = 8,t = 8,u =
k’ = 17 + 1,y’ = 8 - p,t’ = ?,u’ = ?
u
(k,y)
###--#-##
t
33
Number of Number of Detected Detected Homogeneous Similarities.Homogeneous Similarities.
k’ = 17 + 1,y’ = 8 - p,t’ = ?,u’ = ?
k’ = 18,y’ = 6,t’ = 5,u’ =
###--#-##
(k’,y’)
u’ ?
t’(t’, u’) = F (t, u);
Seed: ###--#-## Scoring Scheme : r = +1 ; p = -2
34
Number of Number of Detected Detected Homogeneous Homogeneous Similarities: Similarities:
Reduction to Graph Path problemReduction to Graph Path problem(after Keich03)(after Keich03)
Edges: 2 edges from “each” (k, y, t, u):
(k+1, y+r, t+1, u) - for match at (k+1)-th position
(k+1, y-p, t’, u’) - for mismatch at (k+1)-th position;
t’, u’ can be pre-computed.
35
contiguous seed (weight 11)###########
Experiments (score 32)Experiments (score 32)
36
spaced seed (weight 11)###-#--#-#--##-###
Experiments (score 32)Experiments (score 32)
37
Example 1Example 1
Score
Alignment
###-###
detected alignment
38
Score
Alignment
Example 2Example 2
###-###
detected alignment
39
Example 3Example 3
Score
Alignment
###-###
detected alignment