approximate similarity search in genomic sequence databases using landmark-guided embedding

SISAP’08 – 20080411

Approximate Similarity Search in Genomic Sequence Databases using

Landmark-Guided Embedding

Ahmet Sacan and I. Hakki Torosluemail: [ahmet,toroslu]@ceng.metu.edu.tr

Computer Engineering Department,Middle East Technical University

Ankara, TURKEY

SISAP’08 – 20080411

Outline• Background

– Sequence Alignment– Blast

• Embedding Subsequences– Fastmap, LMDS– Analysis of parameters to achieve stable and

accurate mapping• Indexing Subsequences

2

SISAP’08 – 20080411

Sequence Similarity Search

• Sequence similarity search is at the heart of bioinformatics research– Similarity information allows: structural,

functional, and evolutionary inferences

3

SISAP’08 – 20080411

Sequence Alignment

• Goal: maximize “alignment score”• Score of aligning two residues:

– Substitution matrix

• Optimal solution: Dynamic Programming– Global: Needleman-Wunsch (1970)

– Local: Smith-Waterman (1981)

4

SISAP’08 – 20080411

Blast (Basic Local Alignment Search Tool)

• Popular tool for similarity search in sequence databases

1)Generate “k-tuples” (“k-mers”, “words”) from query• CDEFG CDE, DEF, EFG• CDE ADE,CDC,CCE, CDE, …

2)Find (exact) matching k-tuples in the database3)For each candidate sequence, extend the k-tuple

match in both directions.

5

SISAP’08 – 20080411

Time-accuracy trade-off

• Challenge:– Allow flexible matching for larger words at

reasonable time

6

1 2 3 …4 11k:

Too many k-tuple hits to processSlows down the extension phase

Few/none k-tuple hitsFast executionExact k-tuple matching not sensitiveToo many false negatives

Proteins (203 tuples) DNA (411 tuples)

SISAP’08 – 20080411

Raising the bar for k

1. Map k-tuples to a vector space• Mapping cannot be perfect, thus “approximate

results”

2. Use Spatial Access Methods (e.g. R-tree, X-tree) to index and retrieve k-tuples

7

SISAP’08 – 20080411

Mapping k-tuples

• Requirements:– Need to support out of sample extension– Speed

• Candidate methods:– Fastmap (Faloutsos, 1995)– Landmark MDS (de Silva, 2003)

8

SISAP’08 – 20080411

Fastmap

1. Select two pivots• Distant pivots heuristic

2. Obtain projection usingcosine law

3. Project objects tonew hyperplane

4. Repeat9

SISAP’08 – 20080411

Fastmap

• Fast! O(Nd)– N: number of data points– d is the target dimensionality

• For query, need only to calculate distances to set of pivots

• Unstable (esp. if original space is non-Euclidean)

10

SISAP’08 – 20080411

Landmark MDS

1. Select n landmarks (pivots)2. Embed landmarks using classical

MDS3. For the remaining objects, apply

distance-based triangulation based on distances to landmarks

11

SISAP’08 – 20080411

Landmark MDS

• Provides stable results

• Good selection of landmarks is critical.– LMDSrandom

– LMDSmaxmin • Add new landmarks that maximizes the minimum

distance to already selected landmarks

– LMDSfastmap • Use the same landmarks as found by Fastmap

12

SISAP’08 – 20080411

Evaluation

• Synthetic datasets– Randomly generate k-tuples for a given k and

alphabet size σ• Real dataset

– Yeast proteins benchmark (σ=20)– 6,341 proteins, 2.9 million residues– 103 query proteins, 38-884 residues

• Weighted Hamming distance• CB-EUC substitution matrix (Sacan, 2007)

13

SISAP’08 – 20080411

• Sammon’s metric stress:• Breaking point dimensionality

14

Target dimensionality (d)

k=5, synthetic dataset, identity matrix

SISAP’08 – 20080411

Subsequence length (k)and alphabet size (σ)

15

SISAP’08 – 20080411

Number of landmarks

16k=5, d=7, synthetic dataset, identity matrix

SISAP’08 – 20080411

Approximate k-tuple search performance

• Find all k-tuples within a specified radius from a query k-tuple

17

k=6, d=8, real dataset, CB-EUC matrix

SISAP’08 – 20080411

Homology search

18

k=6, d=8, real dataset, CB-EUC matrix

SISAP’08 – 20080411

Search time

19

search radius=7 Database size=100,000

SISAP’08 – 20080411

Conclusion

• Applied an embedding-based approach to approximate sequence similarity search for the first time

• Significant time improvements with negligible degradation in accuracy

• Achieved more stable embedding with combined pivot selection strategy

• Defined intrinsic Euclidean dimensionality of the dataset

20

approximate similarity search in genomic sequence databases using landmark-guided embedding

Documents