1 embedding and similarity search for point sets under translation minkyoung cho and david m. mount...

33
1 Embedding and Similarity Search for Embedding and Similarity Search for Point Sets under Translation Point Sets under Translation Minkyoung Cho and David M. Mount University of Maryland SoCG 2008

Upload: arnold-quinn

Post on 17-Jan-2016

227 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Embedding and Similarity Search for Point Sets under Translation Minkyoung Cho and David M. Mount University of Maryland SoCG 2008

1

Embedding and Similarity Search for Embedding and Similarity Search for Point Sets under TranslationPoint Sets under Translation

Minkyoung Cho and David M. Mount University of Maryland

SoCG 2008

Page 2: 1 Embedding and Similarity Search for Point Sets under Translation Minkyoung Cho and David M. Mount University of Maryland SoCG 2008

2

Point Pattern Point Pattern MatchingMatching

Point Pattern Matching Given two point sets P, Q, find Q’ Q to minimize Dist(P, Q’) = min dist(tP, Q’) where t is a geometric

transformation. (e.g., translation, rotation, …)

P

Q

Page 3: 1 Embedding and Similarity Search for Point Sets under Translation Minkyoung Cho and David M. Mount University of Maryland SoCG 2008

3

Point Pattern Point Pattern Similarity SearchSimilarity Search

Point Pattern Similarity Search

A collection of point sets S=S={P1,P2,…,PN}

has been preprocessed. Given a

query set Q, find (approximate)

nearest Pi with respect to a

distance function and transformation group.

Q

S = {P1, P2, …, PN}

Page 4: 1 Embedding and Similarity Search for Point Sets under Translation Minkyoung Cho and David M. Mount University of Maryland SoCG 2008

4

ResultsResults

Transformation

Space Index Note

Geometric Hashing[Wolfson & Rigoutsos 97]

TranslationRotationAffine …

O(Nnk+1)(k: frame size)

YES Space complexity

EMDM into Euclidean space [Indyk & Thaper03]

None O(Nn) YES Embedding EMD to L1

EMD under transformation sets [Cohen & Guibas99]

ScalingTranslation

O(Nn) NO Brute-force, Heuristic

Ours Translation O(Nn log2n )

YES EmbeddingSD to L1

EMD: Earth Mover’s DistanceSD: Symmetric Difference Distance

Page 5: 1 Embedding and Similarity Search for Point Sets under Translation Minkyoung Cho and David M. Mount University of Maryland SoCG 2008

5

Problem DefinitionProblem Definition

Point Pattern Similarity Searching::

• Distance Measure: Symmetric Difference

Distance

• Error Model: Outliers (but No Noise)

• Transformation: Translation • Restriction: Coordinates are integers

P\QQ\PQΔP

P = {p1,p2,p3,p4}Q = {p1,p2,p5,p6}

}6p,5p{}4p,3p{QΔP 4QΔP

{0,12,14,23,35,54,59,64}P =

{0,12,14,23,35,54,59,64}{ 12,14,23,35,54, 64}{12,14,17,23,35,54,62,64}t=3

{15,17,20,26,38,57,65,67}Q =

QP… ……… ……… …

Page 6: 1 Embedding and Similarity Search for Point Sets under Translation Minkyoung Cho and David M. Mount University of Maryland SoCG 2008

6

Motivation: Sources of ComplexityMotivation: Sources of Complexity

• Combination of Translation + Outliers

• Translation Only - translate the point set by aligning leftmost point to the

origin - trivial matching

• Outliers Only - Reduce to Nearest neighbor search in Hamming cube (By hashing or random sampling)

Page 7: 1 Embedding and Similarity Search for Point Sets under Translation Minkyoung Cho and David M. Mount University of Maryland SoCG 2008

7

IntuitionIntuition

P1

Q

P2

P3

P4

PN

f

ff

f

f

f

Metric space

Page 8: 1 Embedding and Similarity Search for Point Sets under Translation Minkyoung Cho and David M. Mount University of Maryland SoCG 2008

8

Embedding: Basic DefinitionsEmbedding: Basic Definitions

Given metric spaces (X, d) and (X', d'), a map f: X X’ is called an embedding.

The contraction of f is the maximum factor by which distancesare shrunk, i.e.,

The expansion or stretch of f is the maximum factor bywhich distances are stretched:

The distortion of f is the product of the contraction and expansion.

))Y(f),x(f('d)y,x(d

maxXy,x

)y,x(d))Y(f),x(f('d

maxXy,x

Page 9: 1 Embedding and Similarity Search for Point Sets under Translation Minkyoung Cho and David M. Mount University of Maryland SoCG 2008

9

Main Result: PreliminariesMain Result: Preliminaries

• Main result: There exists an randomized embedding that maps a point set under symmetric difference with respect to translation into a metric space L1 with distortion O(log2 n).

• Assumption: – Each point set has at most n elements and is in dimension d.– Coordinates are integers of magnitude polynomial in n

• Distance Function: Symmetric Difference with respect to translation

<PΔQ> = min |(P + t)ΔQ|

• Target Metric: L1

t

d

1iii1

d yxyx ,Ry,x

Page 10: 1 Embedding and Similarity Search for Point Sets under Translation Minkyoung Cho and David M. Mount University of Maryland SoCG 2008

10

Outline of AlgorithmOutline of Algorithm

1. Transform d-dimension points into 1-d dimension points.

(Distortion: 1)

2. Reduce the domain size using a linear hash function.

(Distortion: O(1))

3. Make invariant under translation.

(Distortion: O(log2n))

4. Reduce the target domain size using a universal hash function.

(Distortion: O(1))

{3,6,10,14,22}

1 0000 0 01 0 1 1

{101010, ..., 010100, …, 11101}

3 00 0 02 0 1

O(nlogn)

Page 11: 1 Embedding and Similarity Search for Point Sets under Translation Minkyoung Cho and David M. Mount University of Maryland SoCG 2008

11

Translation InvariantTranslation Invariant

1 0000 0 01 0 1 1 1 0 0 01 0

{ 1101, 0001,0000, 0010,1100, 1010}…ρ = 4

P =

s

Page 12: 1 Embedding and Similarity Search for Point Sets under Translation Minkyoung Cho and David M. Mount University of Maryland SoCG 2008

12

Intuition Intuition

1 0000 0 01 0 1 0

1 1000 0 10 0 1 0

hP

hQ

Φ2Q={10,00,01,00,11,00,10,01,00,11,00}

Φ2P={10,01,00,10,01,00,10,00,00,01,00}

Φ4P={1101,0000,0010,1100,0000,0001,1000,0010,0101,0000,0010}

Φ4Q={1011,0100,0010,0101,1000,0011,1100,0010,0100,1001,0000}

s

s

If one of probes hits mismatched positions, then the bit patterns generated may differ.

The probability that one of probes hits mismatched positions increases when the probe size increases.

Page 13: 1 Embedding and Similarity Search for Point Sets under Translation Minkyoung Cho and David M. Mount University of Maryland SoCG 2008

13

Relationship between Relationship between ρρ (probe size) and (probe size) and δδ**

*δ)s(lnO/*δ

s2

1

s

n2δ

δs

ρ2

Unknown

QΔΦPΦ ρρ

δ: estimated distanceδ*: original distance

Upper bound

Expectation

>2s-2

???

s/2i

increasesρ

Distanceof Invariants

Page 14: 1 Embedding and Similarity Search for Point Sets under Translation Minkyoung Cho and David M. Mount University of Maryland SoCG 2008

14

EmbeddingEmbedding

)s(logOδ*

)s(logO*δ

122s

QΦPΦ2QΨPΨ E 1L

L

0i

in2log

0i

1iii1

δ

???s2

QΔΦPΦ ρρ*δ

1

.5

20 21 22 … 2L 2H 2log 2n=2n… … …

*

n2log

0i

*Hn2log

Hi

*1H

0i

i1

δ)n(logOδ2δ2QΨPΨ

Distanceof Invariants

δ: estimated distanceδ*: original distance

Page 15: 1 Embedding and Similarity Search for Point Sets under Translation Minkyoung Cho and David M. Mount University of Maryland SoCG 2008

15

Build TimeBuild Time

The expensive operations are of building invariant and hashing for large domain.

Building invariant : (# of Probes) * (# of Translations) Trivial: O(s) * s = O(n log n) * O(n log n) = O(n2 log2 n)

Universal hash function: (# of Elements) * (Matrix operation) = (# of Elements) * (Input Size) * (Output Size)Trivial: O(s) * O(s) * O(log s) = O(s2 log s) = O( n2 log3 n )

We can improve it to O( n log3 n ) if we merge two operations. Surprise!!!

Page 16: 1 Embedding and Similarity Search for Point Sets under Translation Minkyoung Cho and David M. Mount University of Maryland SoCG 2008

16

Merge Two OperationsMerge Two Operations

1 1000 0 10 0 1 0P=s

1 0 1 0 1

1 0 1 0 1r0

H

y0y1y2 ys-1

f

)P),fr((Conv 0

Convolution can be computed in O(n log n) where n is the size of array

rlog s

Page 17: 1 Embedding and Similarity Search for Point Sets under Translation Minkyoung Cho and David M. Mount University of Maryland SoCG 2008

17

Main Result: Formal StatementMain Result: Formal Statement

Given failure probability β, there exists a randomized embedding from a point set P into a vector ΨP of dimension O(n (log2n) log(1/β)) such that for any P, Q

This embedding can be computed in time O(n (log4n) log(1/β))

QΔP nlog2QΨPΨ )i(

- β1 at least .with prob QΔP nlog17

1QΨPΨ )ii(

Page 18: 1 Embedding and Similarity Search for Point Sets under Translation Minkyoung Cho and David M. Mount University of Maryland SoCG 2008

18

Open ProblemsOpen Problems

• Q1. Can we improve the distortion bound? currently O(log2

n) Cormode & Muthukrishnan show how to embed a

string under edit distance with moves into L1 with O(log n log* n) distortion.

• Q2. Can we derandomize the algorithm? Cormode & Muthukrishnan’s algorithm is deterministic. • Q3. Can we improve space/time complexities?

Page 19: 1 Embedding and Similarity Search for Point Sets under Translation Minkyoung Cho and David M. Mount University of Maryland SoCG 2008

19

Other ExtensionsOther Extensions

• Q1. Can we support a distance measure (e.g., Hausdorff distance that is robust to noisy data)?

• Q2. Can we handle other transformation groups?

- integer scaling? - integer scaling +

translation? - affine transformations

over finite vector spaces?

Point Pattern Similarity Searching::

• Distance Measure: Symmetric Difference

Distance

• Error Model: Outliers (but No Noise)

• Transformation: Translation • Restriction: Coordinates are integral

Page 20: 1 Embedding and Similarity Search for Point Sets under Translation Minkyoung Cho and David M. Mount University of Maryland SoCG 2008

20

Thank You!

Page 21: 1 Embedding and Similarity Search for Point Sets under Translation Minkyoung Cho and David M. Mount University of Maryland SoCG 2008

21

Translation InvariantTranslation Invariant

1 0000 0 01 0 1 1

P = {3,6,10,14,22}

h(x) = x mod s (e.g. s = 11)

1 0 0 01 0

{ 1101, 0001,0000, 0010,1100, 1010}…

h’(x) : (for simplicity, x mod 10)

2 0001 2 01 0 0

ΦρP = {13,0,2,12,1,…,10}

ρ = 4

ΦρP =

hP =

s

0 0000 0 00 0 0

0 1 2 3 4 5 6 7 8 9

Page 22: 1 Embedding and Similarity Search for Point Sets under Translation Minkyoung Cho and David M. Mount University of Maryland SoCG 2008

22

Trial 1: Geometric Hashing for TranslationTrial 1: Geometric Hashing for Translation

• Naïve Version: - Space complexity is O( N n2 ) since the frame size is 1.

- With outliers in a query: # of queries will increase

• Adaptive Version: To reduce space complexity, if store only c transformed sets,

then # of queries will increase.

• Outliers may lead a false matching, thus they will increase the prob. of the false positive.

Page 23: 1 Embedding and Similarity Search for Point Sets under Translation Minkyoung Cho and David M. Mount University of Maryland SoCG 2008

23

Geometric Hashing with Outliers (delete)Geometric Hashing with Outliers (delete)

Based on the outliers $r$ and the frame size $k$, the number of queries will increase to get a correct result.

method 1. Pr[ choose a valid frame set] = ( 1 – r/n )^k method 2. (r + 1) different trials ( deterministic) method 3. pigeonhole theorem. Pr[ choose a valid frame set] = 1-r/(n/k)

[Grimson&Huttenlocher 90] : Outliers lead a false matching and increase the prob. of the false positive.

Page 24: 1 Embedding and Similarity Search for Point Sets under Translation Minkyoung Cho and David M. Mount University of Maryland SoCG 2008

24

d-Dimension d-Dimension 1-Dimension 1-Dimension

Let u be the maximum coordinate value of each point. Then, we can map a d-dimensional point set to a 1-dimensional point set with coordinates of size at most (3u)d. without changing the symmetric difference distance under translation.

0 1 0 10

0 0 1 00

0 1 0 00

0 1 0 10 … 0 0 1 00 … 0 1 0 00 …

(1,1)

(5,3)

1 35[6,15] [21,30]

Page 25: 1 Embedding and Similarity Search for Point Sets under Translation Minkyoung Cho and David M. Mount University of Maryland SoCG 2008

25

# of Primes & Collision Prob.# of Primes & Collision Prob.

•Collision Probability h(x) = x mod s where s is a prime number in Θ (n log n) ( where s is chosen uniformly at random )

For x != y Pr[h(x) = h(y)] = Pr[(x mod s) = (y mod s)] = Pr[(x-y) mod s = 0] Since x, y Є Znc, |x – y| < nc.

Pr[h(x) = h(y)] < c/(# of primes) = 1/O(n)

• Prime Number Theorem There exist O(m/log m) prime numbers in range between 1 and m.

Page 26: 1 Embedding and Similarity Search for Point Sets under Translation Minkyoung Cho and David M. Mount University of Maryland SoCG 2008

26

Distance Distortion by HashingDistance Distortion by Hashing

We can achieve o(1) distortion with the hash function which the probability of collision is 1/O(n).

Note that the distance is always contracted due to collision.

Page 27: 1 Embedding and Similarity Search for Point Sets under Translation Minkyoung Cho and David M. Mount University of Maryland SoCG 2008

27

Linear Hash Function (X)Linear Hash Function (X)

• h(x) = x mod s where s is a prime number in Θ(n log n)

• Linearity h( x + t ) = h(x) + h(t) - translation ΦρP = Φρ(P+t)

P = {3,6,10,14,22}

1 0000 0 01 0 1 1S

Page 28: 1 Embedding and Similarity Search for Point Sets under Translation Minkyoung Cho and David M. Mount University of Maryland SoCG 2008

28

Distance Distortion by Hashing (X)Distance Distortion by Hashing (X)

We can achieve o(1) distortion with the hash function which the probability of collision is 1/O(n).

Note that the distance is always contracted due to collision.

Page 29: 1 Embedding and Similarity Search for Point Sets under Translation Minkyoung Cho and David M. Mount University of Maryland SoCG 2008

29

Universal Hash Function for large domainUniversal Hash Function for large domain

Since the maximum probe size is O(n log n), the input domain of hash function is O(2O(n log n)). However, it has only θ(n log n) elements.

• H: 2s 2k

H(x) = R x + b (mod (2,2,…,2)) R: a random k x s matrix b: k bits random row vector.

• Time Complexity: For compute a value : O( k s ) = O( (log n) n log n ) = O( n log2 n ) For, all s (= O(n log n) ) , the time is O( n2 log3 n ).

Page 30: 1 Embedding and Similarity Search for Point Sets under Translation Minkyoung Cho and David M. Mount University of Maryland SoCG 2008

30

Relationship between Relationship between ρρ and and δδ**

*δ)s(lnO

δ*

s2

1

s

n2δ

δs

ρ2

Unknown

QΔΦPΦ ρρ

δ is a guess distanceδ* is an optimal distance

Upper bound

Expectation

>2s-2

???

s/2i

Page 31: 1 Embedding and Similarity Search for Point Sets under Translation Minkyoung Cho and David M. Mount University of Maryland SoCG 2008

31

Effect of Hash FunctionsEffect of Hash Functions

*δ)s(logO

δ*

s2

1

s

n2δ

δs

ρ2

???

QΔΦPΦ ρρ

h

h’

Page 32: 1 Embedding and Similarity Search for Point Sets under Translation Minkyoung Cho and David M. Mount University of Maryland SoCG 2008

32

Merge Two Operations using FFT & Merge Two Operations using FFT & ConvolutionConvolution

П = random_probe( ρ, s ) For t = 1, …., s, x(t) = (hP + t)[П] // make an invariant

For t = 1, …, s. x’(t) = H x(t) + b ( mod (2,2,2,…,2) ) // H: O(log s) x ρ

matrix

ΦρP[x’(t)]++Time Complexity: O(s) * O(matrix multi) = O( s ) * O(s log s)------------------------------------------------------------------------ H = [r1, r2, …, rO(log s)]’ // ri : a binary row bit vector

Hx(t) = [ r1 x(t), r2 x(t), r3 x(t), …, rO(logs) x(t)]’

ri x(t) = ri (hP + t)[П] = (hP + t)[П ri]

[ri x(0), ri x(1), …, ri x(s)] = fliplr(hP) [П ri]

Time Complexity: O(log s) * O(convolution) = O( log s ) * O(s log s)

Page 33: 1 Embedding and Similarity Search for Point Sets under Translation Minkyoung Cho and David M. Mount University of Maryland SoCG 2008

33

Build TimeBuild Time

Trivial running time Ours

d-dimension -> 1-dimension

O(dn) O(dn)

Linear Hashing O(n) O(n)

Invariant under Translation

O(n^2 log^2 n)

O( n log^3 n)Universal Hashing(due to the domain size, we need to use matrix multiplication )

O(n^2 log^4 n)