approximate similarity search in genomic sequence databases using landmark-guided embedding
DESCRIPTION
Approximate Similarity Search in Genomic Sequence Databases using Landmark-Guided Embedding. Ahmet Sacan and I. Hakki Toroslu email: [ ahmet,toroslu ]@ ceng.metu.edu.tr Computer Engineering Department, Middle East Technical University Ankara, TURKEY. Outline. Background Sequence Alignment - PowerPoint PPT PresentationTRANSCRIPT
SISAP’08 – 20080411
Approximate Similarity Search in Genomic Sequence Databases using
Landmark-Guided Embedding
Ahmet Sacan and I. Hakki Torosluemail: [ahmet,toroslu]@ceng.metu.edu.tr
Computer Engineering Department,Middle East Technical University
Ankara, TURKEY
SISAP’08 – 20080411
Outline• Background
– Sequence Alignment– Blast
• Embedding Subsequences– Fastmap, LMDS– Analysis of parameters to achieve stable and
accurate mapping• Indexing Subsequences
2
SISAP’08 – 20080411
Sequence Similarity Search
• Sequence similarity search is at the heart of bioinformatics research– Similarity information allows: structural,
functional, and evolutionary inferences
3
SISAP’08 – 20080411
Sequence Alignment
• Goal: maximize “alignment score”• Score of aligning two residues:
– Substitution matrix
• Optimal solution: Dynamic Programming– Global: Needleman-Wunsch (1970)
– Local: Smith-Waterman (1981)
4
SISAP’08 – 20080411
Blast (Basic Local Alignment Search Tool)
• Popular tool for similarity search in sequence databases
1)Generate “k-tuples” (“k-mers”, “words”) from query• CDEFG CDE, DEF, EFG• CDE ADE,CDC,CCE, CDE, …
2)Find (exact) matching k-tuples in the database3)For each candidate sequence, extend the k-tuple
match in both directions.
5
SISAP’08 – 20080411
Time-accuracy trade-off
• Challenge:– Allow flexible matching for larger words at
reasonable time
6
1 2 3 …4 11k:
Too many k-tuple hits to processSlows down the extension phase
Few/none k-tuple hitsFast executionExact k-tuple matching not sensitiveToo many false negatives
Proteins (203 tuples) DNA (411 tuples)
SISAP’08 – 20080411
Raising the bar for k
1. Map k-tuples to a vector space• Mapping cannot be perfect, thus “approximate
results”
2. Use Spatial Access Methods (e.g. R-tree, X-tree) to index and retrieve k-tuples
7
SISAP’08 – 20080411
Mapping k-tuples
• Requirements:– Need to support out of sample extension– Speed
• Candidate methods:– Fastmap (Faloutsos, 1995)– Landmark MDS (de Silva, 2003)
8
SISAP’08 – 20080411
Fastmap
1. Select two pivots• Distant pivots heuristic
2. Obtain projection usingcosine law
3. Project objects tonew hyperplane
4. Repeat9
SISAP’08 – 20080411
Fastmap
• Fast! O(Nd)– N: number of data points– d is the target dimensionality
• For query, need only to calculate distances to set of pivots
• Unstable (esp. if original space is non-Euclidean)
10
SISAP’08 – 20080411
Landmark MDS
1. Select n landmarks (pivots)2. Embed landmarks using classical
MDS3. For the remaining objects, apply
distance-based triangulation based on distances to landmarks
11
SISAP’08 – 20080411
Landmark MDS
• Provides stable results
• Good selection of landmarks is critical.– LMDSrandom
– LMDSmaxmin • Add new landmarks that maximizes the minimum
distance to already selected landmarks
– LMDSfastmap • Use the same landmarks as found by Fastmap
12
SISAP’08 – 20080411
Evaluation
• Synthetic datasets– Randomly generate k-tuples for a given k and
alphabet size σ• Real dataset
– Yeast proteins benchmark (σ=20)– 6,341 proteins, 2.9 million residues– 103 query proteins, 38-884 residues
• Weighted Hamming distance• CB-EUC substitution matrix (Sacan, 2007)
13
SISAP’08 – 20080411
• Sammon’s metric stress:• Breaking point dimensionality
14
Target dimensionality (d)
k=5, synthetic dataset, identity matrix
SISAP’08 – 20080411
Subsequence length (k)and alphabet size (σ)
15
SISAP’08 – 20080411
Number of landmarks
16k=5, d=7, synthetic dataset, identity matrix
SISAP’08 – 20080411
Approximate k-tuple search performance
• Find all k-tuples within a specified radius from a query k-tuple
17
k=6, d=8, real dataset, CB-EUC matrix
SISAP’08 – 20080411
Homology search
18
k=6, d=8, real dataset, CB-EUC matrix
SISAP’08 – 20080411
Search time
19
search radius=7 Database size=100,000
SISAP’08 – 20080411
Conclusion
• Applied an embedding-based approach to approximate sequence similarity search for the first time
• Significant time improvements with negligible degradation in accuracy
• Achieved more stable embedding with combined pivot selection strategy
• Defined intrinsic Euclidean dimensionality of the dataset
20