ryan o’donnell (cmu, ias) joint work with yi wu (cmu, ibm), yuan zhou (cmu)
TRANSCRIPT
Ryan O’Donnell (CMU, IAS)
joint work with
Yi Wu (CMU, IBM), Yuan Zhou (CMU)
Locality Sensitive Hashing [Indyk–Motwani ’98]
objects sketchesh :
H : family of hash functions h s.t.
“similar” objects collide w/ high prob.
“dissimilar” objects collide w/ low prob.
Abbreviated history
A
Broder ’97, Altavista
B
0 1 1 1 0 0 1 0 0
1 1 1 0 0 0 1 0 1
wor
d 1?
wor
d 2?
wor
d 3?
wor
d d?
Jaccard similarity:
Invented simple H s.t. Pr [h(A) = h(B)] =
Indyk–Motwani ’98 (cf. Gionis–I–M ’98)
Defined LSH.
Invented very simple H good for
{0, 1}d under Hamming distance.
Showed good LSH implies good
nearest-neighbor-search data structs.
Charikar ’02, STOC
Proposed alternate H (“simhash”) for
Jaccard similarity.
Many papers about LSH
Practice Theory
Free code base [AI’04]
Sequence comparisonin bioinformatics
Association-rule findingin data mining
Collaborative filtering
Clustering nouns bymeaning in NLP
Pose estimation in vision
• • •
[Tenesawa–Tanaka ’07]
[Broder ’97]
[Indyk–Motwani ’98]
[Gionis–Indyk–Motwani ’98]
[Charikar ’02]
[Datar–Immorlica– –Indyk–Mirrokni ’04]
[Motwani–Naor–Panigrahi ’06]
[Andoni–Indyk ’06]
[Neylon ’10]
[Andoni–Indyk ’08, CACM]
Given: (X, dist), r > 0, c > 1
distance space “radius” “approx factor”
Goal: Family H of functions X → S
(S can be any finite set)
s.t. ∀ x, y ∈ X,
≥ p
≤ q
≥ q.5 ≥ q.25 ≥ q.1 ≥ qρ
Theorem
[IM’98, GIM’98]
Given LSH family for (X, dist),
can solve “(r,cr)-near-neighbor search”
for n points with data structure of
size: O(n1+ρ)
query time: Õ(nρ) hash fcn evals.
Example
X = {0,1}d, dist = Hamming
r = ϵd, c = 5
0 1 1 1 0 0 1 0 0
1 1 1 0 0 0 1 0 1
dist ≤ ϵd
or ≥ 5ϵd
H = { h1, h2, …, hd }, hi(x) = xi[IM’98]
“output a random coord.”
Analysis
= q
= qρ
(1 − 5ϵ)1/5 ≈ 1 − ϵ. ∴ ρ ≈
(1 − 5ϵ)1/5 ≤ 1 − ϵ. ∴ ρ ≤
In general, achieves ρ ≤ ∀ c (∀ r).
Optimal upper bound
( {0, 1}d, Ham ), r > 0, c > 1.
S ≝ {0, 1}d ∪ {✔}, H ≝ {hab : dist(a,b) ≤ r}
hab(x) = ✔ if x = a or x = b
x otherwise
0
positive=> 0.5 > 0.1 > 0.01 > 0.0001
Wait, what?
[IM’98, GIM’98] Theorem:
Given LSH family for (X, dist),
can solve “(r,cr)-near-neighbor search”
for n points with data structure of
size: Õ(n1+ρ)
query time: Õ(nρ) hash fcn evals
Wait, what?
[IM’98, GIM’98] Theorem:
size: Õ(n1+ρ)
query time: Õ(nρ) hash fcn evals
More results
For Rd with ℓp-distance:
when p = 1, 0 < p < 1, p = 2
[IM’98] [DIIM’04] [AI’06]For Jaccard similarity: ρ ≤ 1/c
For {0,1}d with Hamming distance:
[Bro’97]
−od(1) (assuming q ≥ 2−o(d))[MNP’06]
immediately
for ℓp-distance
Our Theorem
For {0,1}d with Hamming distance:
−od(1) (assuming q ≥ 2−o(d))
immediately
for ℓp-distance
(∃ r s.t.)
Proof also yields ρ ≥ 1/c for Jaccard.
Proof:
Proof:
Noise-stability is log-convex.
Proof:
A definition, and two lemmas.
Fix any arbitrary function h : {0,1}d → S.
Pick x ∈ {0,1}d at random:
0 1 1 1 0 0 1 0 0x = h(x) = s
Continuous-time (lazy)
random walk for time τ.
0 0 1 1 0 0 1 1 0y = h(y) = s’
def:
Lemma 1:
Lemma 2:
From which the proof of ρ ≥ 1/c follows easily.
For x y,τ
when τ ≪ 1.
Kh(τ) is a log-convex function of τ.
(for any h)
0
1
τ
Continuous-Time Random Walk
: Repeatedly
— waits Exponential(1) seconds,
— dings.
(Reminder: T ~ Expon(1) means Pr[T > u] = e−u.)
In C.T.R.W. on {0,1}d, each coord. gets
its own independent alarm clock.
When ith clock dings, coord. i is rerandomized.
0 1 1 1 0 0 1 0 0 1x =
0 1 0 1 0 0 1 0 1 1y =
timeτ
0
1
1
1
Pr[coord. i never updated] = Pr[Exp(1) > τ] = e−τ
∴ Pr[xi ≠ yi] =
⇒ Lemma 1: dist(x,y) ≈
Lemma 2: Kh(τ) is a log-convex function of τ.
Remark: True for any reversible C.T.M.C.
Recall: For f : {0,1}d → ℝ,
Given hash function h : {0,1}d → S,
for each s ∈ S, introduce
hs : {0,1}d → {0,1}, hs(x) = 1{h(x)=s}
Proof of Lemma 2:
is log-convex.log-convexnon-neg. lin. comb. of
Lemma 1:
Lemma 2:
Theorem: LSH for {0,1}d requires
For x y,τ
is a log-convex function of τ.
Proof: Say H is an LSH family for {0,1}d
with params .
r (c − o(1)) r
def: (Non-neg. lin. comb.
of log-convex fcns.
∴ KH(τ) is also
log-convex.)
w.v.h.p.,
dist(x,y) ≈ ∴ KH(ϵ) ≳ qρ
KH(cϵ) ≲ q
in truth, q+2−Θ(d); we assume q not tiny
∴ KH(ϵ) ≳
KH(cϵ) ≲
∴ KH(0) = ln
ln
ln
1
qρ
q
0
ρ ln q
ln q
KH(τ) is log-convex
0 τ
ln KH(τ)
cϵ
ln q
ϵ
∴
Super-tedious, super-straightforward
Make Lemma 1 precise. (Chernoff)
Make precise. (Taylor)
Choose ϵ = ϵ(c, q, d) very carefully.
Theorem:
Meaningful iff q ≥ 2−o(d); i.e., not tiny.