1 efficient algorithms for substring near neighbor problem alexandr andoni piotr indyk mit
TRANSCRIPT
1
Efficient Algorithms for Substring Near
Neighbor Problem
Alexandr Andoni
Piotr Indyk
MIT
2
What’s SNN?
SNN ≈ Text Indexing with mismatches Text Indexing:
Construct a data structure on a text T[1..n], s.t. Given query P[1..m], finds occurrences of P in T
Text indexing with mismatches: Given P, find the substrings of T that are equal to P except
≤R chars.
Motivation: e.g., computational bio (BLAST)
T= GAGTAACTCAATA
P= AGTA
T= GAGTAACTCAATA
3
Outline
General approach View: Near Neighbor in Hamming Focus: reducing space
Background Locality-Sensitive Hashing (LSH)
Solution Reducing query & preprocessing
Redesign LSH Concluding remarks
4
Approach (Or, why SNN?)
SNN = a near neighbor problem in Hamming metric with m dimensions: Construct data structure on
D={all substrings of T of length m}, s.t. Given P, find a point in D that is at distance ≤R
from P Use a NN data structure for Hamming
D={GAGT, AGTA, GTAA, …. AATA}
T= GAGTAACTCAATA
P= AGTA
5
Approximate NN
Exact NN problem seems hard (i.e., hard w/o exponential space or O(n) query time)
Approximate NN is easier Defined for approximation c=1+ε as
OK to report a point at distance ≤cR (when there is a point at distance ≤R)
Query Space
[KOR98, IM98] poly(log n, m) nO(1/ε^2)
LSH [IM98] n1/c+m n1+1/c
R
cR
q
6
Our contribution
Problem: need m in advance for NN Have to construct a data structure for each m≤M
Here: approx SNN data structure for unknown m Without degradation in space or query time
Our algorithm for SNN based on LSH: Supports patterns of length m≤M Optimal* space: n1+1/c
Optimal* query time: n1/c
Slightly worse preprocessing time if c>3 (* Optimal w.r.t. LSH, modulo subpoly factors)
Also extends to l1
7
Outline
General approach View: Near Neighbor in Hamming Focus: reducing space
Background Locality-Sensitive Hashing (LSH)
Solution Reducing query & preprocessing
Redesign LSH Concluding remarks
8
Locality-Sensitive Hashing
Based on a family of hash functions {g} For points P[1..m], Q[1..m]:
If dist(P,Q) ≤ R, Prg[g(P)=g(Q)] = “medium” If dist(P,Q) > cR, Prg[g(P)=g(Q)] = “low”
Idea: Construct L hash tables with random g1, g2, … gL
For query P, look at buckets g1(P), g2(P)… gL(P) Space: L*n Query time: L
9
LSH for Hamming
Hash function g: Projection on k random coordinates
E.g.: g1(“AGTA”)=“AA” (k=2)
L=#hash tables=n1/c
k=|log n / log(1-cR/m)| < m * log n
T= GAGTAACTCAATA D={GAGT, AGTA, GTAA, …, AATA}
HT1: GT->GAGT AA->AGTA, AATA GA->GTAA …
P= AGTA
R=1
10
Outline
General approach View: Near Neighbor in Hamming Focus: reducing space
Background Locality-Sensitive Hashing (LSH)
Solution Reducing query & preprocessing
Redesign LSH Concluding remarks
11
Unknown m
Bad news k dependent on m! Distinct m distinct hash tables
T= GAGTAACTCAATA D={GAG, AGT, …, ACT, …}
HT1: GG-> GAG AT-> AGG, ACT,… …
P= AGT
R=1
g1(“AGT”)=“AT”
12
Solution
Let’s just reuse the same data structure for all m g(“AGTA”)=“AA” On “AGT” have to guess last char
g(“AGT?”)=g(“AGT?”) = “A?” Like in [exact] text indexing…
T= GAGTAACTCAATA D={GAGT, AGTA, … ACTA, …}
HT1: GT->GAGT AA->AGTA, AATA GA->GTAA AC->ACTC …
P= AGT
R=1
13
Tries*!
Replace HT1 with
trie on g1(suffixes)
Stop searchwhen outside P
Same analysis!
T= GAGTAACTCAATA D={GAGT, AGTA, … ACTA, …}
HT1: GT->GAGT AA->AGTA, AATA GA->GTAA AC->ACTC …
P= AGT
R=1
A G
A C
AGTAAATA
ACTCT
AACT
AT
…
… …
AGTAGTA
* Tries have been used with LSH before in [MS02], but in a different context
14
Resulting performance Space:
n1+1/c (using compressed tries, one trie takes n space) Optimal!
Query time: n1/c * m (m=length P) Not [yet] really optimal: originally, could do dim-reduction Can improve to n1/c + mno(1)
Preprocessing time: n1+1/c * M (M=max m) Not optimal (optimal = n1+1/c) Can improve to n1+1/c + M1/3 * n1+o(1)
Optimal for c<3
15
Outline
General approach View: Near Neighbor in Hamming Focus: reducing space
Background Locality-Sensitive Hashing (LSH)
Solution Reducing query & preprocessing
Redesign LSH Concluding remarks
16
Better query & preprocessing
Redesign LSH to improve query and preprocessing: Query: n1/c * m n1/c + mno(1)
Preprocessing: n1+1/c * M n1+1/c + n1+o(1) * M Idea for new LSH
Use same # of hash tables/tries (#=L= n1/c) But use “less randomness” in choosing hash
functions g1, g2, …gL
S.t., each gi looks random, but g’s are not independent
17
New LSH scheme
Old scheme: Choose L hash functions gi
Each gi = projection on k random coordinates New scheme:
Construct the L functions gi from a smaller number of “base” hash functions
A “base” hash function = projection on k/2 random coordinates
{gi ,i =1..L} = all pairs of “base” hash functions Need only ~L1/2 “base” hash functions!
18
Example
k=4
w=
#base fns=4
L=(w choose 2)=(4 choose 2)=6
u1=
u2=
u3=
u4=
g1=<u1, u2>=
g2=<u1, u3>=
g3=<u1, u4>=...
19
Saving time
Can save time since there are less “base” hash functions
E.g.: computing fingerprints Want to compute FP(gi(P)) for i=1..L
FP(gi(P))=(Σj P[j] * χji * 2j) mod prime
Old way Would take L * m time for L functions g
New way Takes L1/2 * m time for L1/2 functions ui
Need only L time to combine FP(u(P)) into FP(g(P)) If g=<u1,u2>, then FP(g(P))=(FP(u1(P))+FP(u2(P))) mod prime
Total: L + L1/2 * m
20
Better query & preproc (2)
E.g., for query Use fingerprints to leap faster in the trie Yields time n1/c + n1/(2c) * m (since L= n1/c)
To get n1/c + no(1) * m, generalize: g = tuple of t base functions a base function = k/t random coordinates
Other details similar to fingerprints
21
Better preprocessing (3)
Preprocessing, can get n1+1/c + n1+o(1) * M
Can get n1+1/c + n1+o(1) * M1/3
Can construct a trie in n * M1/3 (instead on n * M) Using FFT, etc
22
Outline
General approach View: Near Neighbor problem in Hamming metric Focus: reducing space
Background Locality-Sensitive Hashing (LSH)
Solution = LSH + Tries Reducing query & preprocessing
Redesign LSH Concluding remarks
23
Conclusions
Problem: Substring Near Neighbor (a.k.a., text indexing with
mismatches) Approach:
View as NN in m-dimensional Hamming Use LSH
Challenge: Variable-length pattern w/o degradation in performance
Solution: Space/query optimal (w.r.t. LSH) Preprocessing optimal (w.r.t. LSH) for c<3
24
Extensions
Extends to l1 Nontrivial since a need a quite different LSH
functions Preprocessing slightly worse n1+1/c + n1+o(1) * M2/3
Using “Less-than-matching” problem [Amir-Farach’95]
25
Remarks
Other approaches? Or, why LSH for SNN?
Since better SNN better NN… And LSH is the “best” known algorithm for
high-dimensional NN (using reasonable space)
26
Thanks!