1 embedding and similarity search for point sets under translation minkyoung cho and david m. mount...

1

Embedding and Similarity Search for Embedding and Similarity Search for Point Sets under TranslationPoint Sets under Translation

Minkyoung Cho and David M. Mount University of Maryland

SoCG 2008

2

Point Pattern Point Pattern MatchingMatching

Point Pattern Matching Given two point sets P, Q, find Q’ Q to minimize Dist(P, Q’) = min dist(tP, Q’) where t is a geometric

transformation. (e.g., translation, rotation, …)

P

Q

3

Point Pattern Point Pattern Similarity SearchSimilarity Search

Point Pattern Similarity Search

A collection of point sets S=S={P1,P2,…,PN}

has been preprocessed. Given a

query set Q, find (approximate)

nearest Pi with respect to a

distance function and transformation group.

Q

…

…

…

…

S = {P1, P2, …, PN}

4

ResultsResults

Transformation

Space Index Note

Geometric Hashing[Wolfson & Rigoutsos 97]

TranslationRotationAffine …

O(Nnk+1)(k: frame size)

YES Space complexity

EMDM into Euclidean space [Indyk & Thaper03]

None O(Nn) YES Embedding EMD to L1

EMD under transformation sets [Cohen & Guibas99]

ScalingTranslation

O(Nn) NO Brute-force, Heuristic

Ours Translation O(Nn log2n )

YES EmbeddingSD to L1

EMD: Earth Mover’s DistanceSD: Symmetric Difference Distance

5

Problem DefinitionProblem Definition

Point Pattern Similarity Searching::

• Distance Measure: Symmetric Difference

Distance

• Error Model: Outliers (but No Noise)

• Transformation: Translation • Restriction: Coordinates are integers

P\QQ\PQΔP

P = {p1,p2,p3,p4}Q = {p1,p2,p5,p6}

}6p,5p{}4p,3p{QΔP 4QΔP

{0,12,14,23,35,54,59,64}P =

{0,12,14,23,35,54,59,64}{ 12,14,23,35,54, 64}{12,14,17,23,35,54,62,64}t=3

{15,17,20,26,38,57,65,67}Q =

QP… ……… ……… …

6

Motivation: Sources of ComplexityMotivation: Sources of Complexity

• Combination of Translation + Outliers

• Translation Only - translate the point set by aligning leftmost point to the

origin - trivial matching

• Outliers Only - Reduce to Nearest neighbor search in Hamming cube (By hashing or random sampling)

7

IntuitionIntuition

P1

Q

P2

P3

P4

PN

f

ff

f

f

f

Metric space

8

Embedding: Basic DefinitionsEmbedding: Basic Definitions

Given metric spaces (X, d) and (X', d'), a map f: X X’ is called an embedding.

The contraction of f is the maximum factor by which distancesare shrunk, i.e.,

The expansion or stretch of f is the maximum factor bywhich distances are stretched:

The distortion of f is the product of the contraction and expansion.

))Y(f),x(f('d)y,x(d

maxXy,x

)y,x(d))Y(f),x(f('d

maxXy,x

9

Main Result: PreliminariesMain Result: Preliminaries

• Main result: There exists an randomized embedding that maps a point set under symmetric difference with respect to translation into a metric space L1 with distortion O(log2 n).

• Assumption: – Each point set has at most n elements and is in dimension d.– Coordinates are integers of magnitude polynomial in n

• Distance Function: Symmetric Difference with respect to translation

<PΔQ> = min |(P + t)ΔQ|

• Target Metric: L1

t

d

1iii1

d yxyx ,Ry,x

10

Outline of AlgorithmOutline of Algorithm

1. Transform d-dimension points into 1-d dimension points.

(Distortion: 1)

2. Reduce the domain size using a linear hash function.

(Distortion: O(1))

3. Make invariant under translation.

(Distortion: O(log2n))

4. Reduce the target domain size using a universal hash function.

(Distortion: O(1))

{3,6,10,14,22}

1 0000 0 01 0 1 1

{101010, ..., 010100, …, 11101}

3 00 0 02 0 1

O(nlogn)

11

Translation InvariantTranslation Invariant

1 0000 0 01 0 1 1 1 0 0 01 0

{ 1101, 0001,0000, 0010,1100, 1010}…ρ = 4

P =

s

12

Intuition Intuition

1 0000 0 01 0 1 0

1 1000 0 10 0 1 0

hP

hQ

Φ2Q={10,00,01,00,11,00,10,01,00,11,00}

Φ2P={10,01,00,10,01,00,10,00,00,01,00}

Φ4P={1101,0000,0010,1100,0000,0001,1000,0010,0101,0000,0010}

Φ4Q={1011,0100,0010,0101,1000,0011,1100,0010,0100,1001,0000}

s

s

If one of probes hits mismatched positions, then the bit patterns generated may differ.

The probability that one of probes hits mismatched positions increases when the probe size increases.

13

Relationship between Relationship between ρρ (probe size) and (probe size) and δδ**

*δ)s(lnO/*δ

s2

1

s

n2δ

δs

ρ2

Unknown

QΔΦPΦ ρρ

δ: estimated distanceδ*: original distance

Upper bound

Expectation

>2s-2

???

s/2i

increasesρ

Distanceof Invariants

14

EmbeddingEmbedding

)s(logOδ*

)s(logO*δ

122s

QΦPΦ2QΨPΨ E 1L

L

0i

in2log

0i

1iii1

δ

???s2

QΔΦPΦ ρρ*δ

1

.5

20 21 22 … 2L 2H 2log 2n=2n… … …

*

n2log

0i

*Hn2log

Hi

*1H

0i

i1

δ)n(logOδ2δ2QΨPΨ

Distanceof Invariants

δ: estimated distanceδ*: original distance

15

Build TimeBuild Time

The expensive operations are of building invariant and hashing for large domain.

Building invariant : (# of Probes) * (# of Translations) Trivial: O(s) * s = O(n log n) * O(n log n) = O(n2 log2 n)

Universal hash function: (# of Elements) * (Matrix operation) = (# of Elements) * (Input Size) * (Output Size)Trivial: O(s) * O(s) * O(log s) = O(s2 log s) = O( n2 log3 n )

We can improve it to O( n log3 n ) if we merge two operations. Surprise!!!

16

Merge Two OperationsMerge Two Operations

1 1000 0 10 0 1 0P=s

1 0 1 0 1

1 0 1 0 1r0

H

…

y0y1y2 ys-1

…

f

)P),fr((Conv 0

…

Convolution can be computed in O(n log n) where n is the size of array

rlog s

17

Main Result: Formal StatementMain Result: Formal Statement

Given failure probability β, there exists a randomized embedding from a point set P into a vector ΨP of dimension O(n (log2n) log(1/β)) such that for any P, Q

This embedding can be computed in time O(n (log4n) log(1/β))

QΔP nlog2QΨPΨ )i(

- β1 at least .with prob QΔP nlog17

1QΨPΨ )ii(

18

Open ProblemsOpen Problems

• Q1. Can we improve the distortion bound? currently O(log2

n) Cormode & Muthukrishnan show how to embed a

string under edit distance with moves into L1 with O(log n log* n) distortion.

• Q2. Can we derandomize the algorithm? Cormode & Muthukrishnan’s algorithm is deterministic. • Q3. Can we improve space/time complexities?

19

Other ExtensionsOther Extensions

• Q1. Can we support a distance measure (e.g., Hausdorff distance that is robust to noisy data)?

• Q2. Can we handle other transformation groups?

- integer scaling? - integer scaling +

translation? - affine transformations

over finite vector spaces?

Point Pattern Similarity Searching::

• Distance Measure: Symmetric Difference

Distance

• Error Model: Outliers (but No Noise)

• Transformation: Translation • Restriction: Coordinates are integral

20

Thank You!

21

Translation InvariantTranslation Invariant

1 0000 0 01 0 1 1

P = {3,6,10,14,22}

h(x) = x mod s (e.g. s = 11)

1 0 0 01 0

{ 1101, 0001,0000, 0010,1100, 1010}…

h’(x) : (for simplicity, x mod 10)

2 0001 2 01 0 0

ΦρP = {13,0,2,12,1,…,10}

ρ = 4

ΦρP =

hP =

s

0 0000 0 00 0 0

0 1 2 3 4 5 6 7 8 9

22

Trial 1: Geometric Hashing for TranslationTrial 1: Geometric Hashing for Translation

• Naïve Version: - Space complexity is O( N n2 ) since the frame size is 1.

- With outliers in a query: # of queries will increase

• Adaptive Version: To reduce space complexity, if store only c transformed sets,

then # of queries will increase.

• Outliers may lead a false matching, thus they will increase the prob. of the false positive.

23

Geometric Hashing with Outliers (delete)Geometric Hashing with Outliers (delete)

Based on the outliers $r$ and the frame size $k$, the number of queries will increase to get a correct result.

method 1. Pr[ choose a valid frame set] = ( 1 – r/n )^k method 2. (r + 1) different trials ( deterministic) method 3. pigeonhole theorem. Pr[ choose a valid frame set] = 1-r/(n/k)

[Grimson&Huttenlocher 90] : Outliers lead a false matching and increase the prob. of the false positive.

24

d-Dimension d-Dimension 1-Dimension 1-Dimension

Let u be the maximum coordinate value of each point. Then, we can map a d-dimensional point set to a 1-dimensional point set with coordinates of size at most (3u)d. without changing the symmetric difference distance under translation.

0 1 0 10

0 0 1 00

0 1 0 00

0 1 0 10 … 0 0 1 00 … 0 1 0 00 …

(1,1)

(5,3)

1 35[6,15] [21,30]

25

# of Primes & Collision Prob.# of Primes & Collision Prob.

•Collision Probability h(x) = x mod s where s is a prime number in Θ (n log n) ( where s is chosen uniformly at random )

For x != y Pr[h(x) = h(y)] = Pr[(x mod s) = (y mod s)] = Pr[(x-y) mod s = 0] Since x, y Є Znc, |x – y| < nc.

Pr[h(x) = h(y)] < c/(# of primes) = 1/O(n)

• Prime Number Theorem There exist O(m/log m) prime numbers in range between 1 and m.

26

Distance Distortion by HashingDistance Distortion by Hashing

We can achieve o(1) distortion with the hash function which the probability of collision is 1/O(n).

Note that the distance is always contracted due to collision.

27

Linear Hash Function (X)Linear Hash Function (X)

• h(x) = x mod s where s is a prime number in Θ(n log n)

• Linearity h( x + t ) = h(x) + h(t) - translation ΦρP = Φρ(P+t)

P = {3,6,10,14,22}

1 0000 0 01 0 1 1S

28

Distance Distortion by Hashing (X)Distance Distortion by Hashing (X)

We can achieve o(1) distortion with the hash function which the probability of collision is 1/O(n).

Note that the distance is always contracted due to collision.

29

Universal Hash Function for large domainUniversal Hash Function for large domain

Since the maximum probe size is O(n log n), the input domain of hash function is O(2O(n log n)). However, it has only θ(n log n) elements.

• H: 2s 2k

H(x) = R x + b (mod (2,2,…,2)) R: a random k x s matrix b: k bits random row vector.

• Time Complexity: For compute a value : O( k s ) = O( (log n) n log n ) = O( n log2 n ) For, all s (= O(n log n) ) , the time is O( n2 log3 n ).

30

Relationship between Relationship between ρρ and and δδ**

*δ)s(lnO

δ*

s2

1

s

n2δ

δs

ρ2

Unknown

QΔΦPΦ ρρ

δ is a guess distanceδ* is an optimal distance

Upper bound

Expectation

>2s-2

???

s/2i

31

Effect of Hash FunctionsEffect of Hash Functions

*δ)s(logO

δ*

s2

1

s

n2δ

δs

ρ2

???

QΔΦPΦ ρρ

h

h’

32

Merge Two Operations using FFT & Merge Two Operations using FFT & ConvolutionConvolution

П = random_probe( ρ, s ) For t = 1, …., s, x(t) = (hP + t)[П] // make an invariant

For t = 1, …, s. x’(t) = H x(t) + b ( mod (2,2,2,…,2) ) // H: O(log s) x ρ

matrix

ΦρP[x’(t)]++Time Complexity: O(s) * O(matrix multi) = O( s ) * O(s log s)------------------------------------------------------------------------ H = [r1, r2, …, rO(log s)]’ // ri : a binary row bit vector

Hx(t) = [ r1 x(t), r2 x(t), r3 x(t), …, rO(logs) x(t)]’

ri x(t) = ri (hP + t)[П] = (hP + t)[П ri]

[ri x(0), ri x(1), …, ri x(s)] = fliplr(hP) [П ri]

Time Complexity: O(log s) * O(convolution) = O( log s ) * O(s log s)

33

Build TimeBuild Time

Trivial running time Ours

d-dimension -> 1-dimension

O(dn) O(dn)

Linear Hashing O(n) O(n)

Invariant under Translation

O(n^2 log^2 n)

O( n log^3 n)Universal Hashing(due to the domain size, we need to use matrix multiplication )

O(n^2 log^4 n)

1 embedding and similarity search for point sets under translation minkyoung cho and david m. mount...

Documents