time series data mining group saxually explicit images: finding unusual shapes li wei eamonn keogh...

22
Time Series Data Mining Group Time Series Data Mining Group SAXually Explicit SAXually Explicit Images: Finding Images: Finding Unusual Shapes Unusual Shapes Li Wei Eamonn Keogh Xiaopeng Xi Computer Science & Engineering Dept. University of California – Riverside Appears as a Google tech talk , Google “Keogh SAXually” http://video.google.com/videoplay?docid=6642985254445857159

Post on 20-Dec-2015

218 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Time Series Data Mining Group SAXually Explicit Images: Finding Unusual Shapes Li Wei Eamonn Keogh Xiaopeng Xi Computer Science & Engineering Dept. University

Time Series Data Mining GroupTime Series Data Mining Group

SAXually Explicit Images: SAXually Explicit Images: Finding Unusual ShapesFinding Unusual Shapes

Li Wei Eamonn Keogh Xiaopeng Xi

Computer Science & Engineering Dept.University of California – Riverside

Appears as a Google tech talk , Google “Keogh SAXually”

http://video.google.com/videoplay?docid=6642985254445857159

Page 2: Time Series Data Mining Group SAXually Explicit Images: Finding Unusual Shapes Li Wei Eamonn Keogh Xiaopeng Xi Computer Science & Engineering Dept. University

Time Series Data Mining GroupTime Series Data Mining Group

OutlineOutline• Shape Representations and Distance Measures

• Shape Discords (i.e. unusual shapes)

• Algorithm– Shape Discords Discovery Framework– Approximating the Optimal Ordering

• Empirical Evaluation

• Conclusion

Page 3: Time Series Data Mining Group SAXually Explicit Images: Finding Unusual Shapes Li Wei Eamonn Keogh Xiaopeng Xi Computer Science & Engineering Dept. University

Time Series Data Mining GroupTime Series Data Mining Group

Shape DatasetsShape Datasets

Butterflies

Skulls Fruit fly wings

Leaves

Sea animals

NematodesArrowheads

Petroglyphs

Lizards

Page 4: Time Series Data Mining Group SAXually Explicit Images: Finding Unusual Shapes Li Wei Eamonn Keogh Xiaopeng Xi Computer Science & Engineering Dept. University

Time Series Data Mining GroupTime Series Data Mining Group

Shape RepresentationsShape Representations

0 200 400 600 800 1000 1200

We can convert shapes into a 1D signal. By doing this we remove information about scale and offset. But we must deal with rotation in our algorithms …

There are three ways to be rotation invariant: Landmarking, Rotation Invariant Features, Brute Force Rotation Alignment…

Page 5: Time Series Data Mining Group SAXually Explicit Images: Finding Unusual Shapes Li Wei Eamonn Keogh Xiaopeng Xi Computer Science & Engineering Dept. University

Time Series Data Mining GroupTime Series Data Mining Group

Landmarking*: Find the one “True” RotationLandmarking*: Find the one “True” Rotation

OrangutanOwl MonkeyNorthern Gray-Necked

Owl Monkey (species unknown)

Generic Landmark Alignment

A B C

A

B

A

B

Best Rotation Alignment

Generic Landmark Alignment

Best Rotation Alignment

• Domain Specific LandmarkingFind some fixed point in your domain, eg. the nose on a face, the stem of leaf, the tail of a fish …

• Generic LandmarkingFind the major axis of the shape and use that as the canonical alignment

• ProblemIt does not work in many cases.

?

* Xie, J. AND Heng P. Shape Modeling Using Automatic Landmarking. MICCAI 2005.

Page 6: Time Series Data Mining Group SAXually Explicit Images: Finding Unusual Shapes Li Wei Eamonn Keogh Xiaopeng Xi Computer Science & Engineering Dept. University

Time Series Data Mining GroupTime Series Data Mining Group

Rotation Invariant Features*Rotation Invariant Features*

• Possible features include:Ratio of perimeter to area, fractal measures, elongatedness, circularity, min/max/mean curvature, entropy, perimeter of convex hull and histograms

• ProblemWhen throwing away rotation information, some useful information are thrown away invariably.

Red HowlerMonkey

Mantled HowlerMonkey

Orangutan(juvenile)

BorneoOrangutan

Orangutan

Histogram of the distances between two randomly chosen points on the perimeter of the shape

1D centriod representation

* Cardone, A., Gupta, S. K., and Karnik, M. A Survey of Shape Similarity Assessment Algorithms for Product Design and Manufacturing Applications. ASME Journal, 2003

Page 7: Time Series Data Mining Group SAXually Explicit Images: Finding Unusual Shapes Li Wei Eamonn Keogh Xiaopeng Xi Computer Science & Engineering Dept. University

Time Series Data Mining GroupTime Series Data Mining Group

Brute Force Rotation AlignmentBrute Force Rotation Alignment

C1

C2

C3

C4

C5

C6

C7

C8

C9

C10

C11

C12

C13

• IdeaAchieve true rotation invariance by exhaustive brute force search over all possible rotations

• Rotation MatrixGiven a time series C of length n, its possible rotations constitute a rotation matrix C of size n by n

• Rotation Invariant Euclidean Distance (RED)

• ProblemHigh computational cost

C

121

112

121

,,,,

,,,,

,,,,

nn

nn

nn

cccc

cccc

cccc

C

j

nj

CQEDCQRED ,min),(1

We have forcefully shown this is the right representation, see our VLDB 2006 paper

Page 8: Time Series Data Mining Group SAXually Explicit Images: Finding Unusual Shapes Li Wei Eamonn Keogh Xiaopeng Xi Computer Science & Engineering Dept. University

Time Series Data Mining GroupTime Series Data Mining Group

Shape DiscordShape Discord• The shape that is least similar to other shapes in a dataset

(or has the largest distance to its nearest match)

1st Discord

SQUID Dataset (subset)

1st Discord

Specimen 20773

1st Discord(Castroville Cornertang)

2nd Discord(Martindale

point)

Page 9: Time Series Data Mining Group SAXually Explicit Images: Finding Unusual Shapes Li Wei Eamonn Keogh Xiaopeng Xi Computer Science & Engineering Dept. University

Time Series Data Mining GroupTime Series Data Mining Group

Brute Force Shape Discord DiscoveryBrute Force Shape Discord Discovery

123456789

1011121314151617

Algorithm [dist, index] = BruteForce_Search(S)best_so_far_dist = 0best_so_far_index = NaNFor p = 1 to |S| nearest_neighbor_dist = infinity For q = 1 to |S| If p!= q If Dist (Cp , Cq ) < nearest_neighbor_dist

nearest_neighbor_dist = Dist (Cp , Cq)

End End End If nearest_neighbor_dist > best_so_far_dist best_so_far_dist = nearest_neighbor_dist best_so_far_index = p EndEndReturn [best_so_far_dist, best_so_far_index]

For each shape in the dataset (row)

Find the distance to its nearest neighbor (column)

Check whether it is a better candidate for discord

For each shape in the datasetFind the distance to its nearest neighborCheck whether it is a better candidate as the discord

Page 10: Time Series Data Mining Group SAXually Explicit Images: Finding Unusual Shapes Li Wei Eamonn Keogh Xiaopeng Xi Computer Science & Engineering Dept. University

Time Series Data Mining GroupTime Series Data Mining Group

Observations from Brute Force AlgorithmObservations from Brute Force Algorithm

19.1 5.9 29.3 19.5 18.4

19.1 10.1 29.0 2.4 3.0

5.9 10.1 28.1 4.1 8.4

29.3 29.0 28.1 26.7 28.8

19.5 2.4 4.1 26.7 3.4

18.4 3.0 8.4 28.8 3.4

1

2

3

4

5

6

5.9

2.4

4.1

26.7

2.4

3.0

1

2

3

4

5

6

1 2 3 4 5 6 nn_dist

bsf_dist = 5.9

2.4 < 5.9

4.1 < 5.9

bsf_dist = 26.7

2.4 < 26.7

3.0 < 26.7

comments

19.1 5.9 29.3 19.5 18.4

19.1 10.1 29.0 2.4 3.0

5.9 10.1 28.1 4.1 8.4

29.3 29.0 28.1 26.7 28.8

19.5 2.4 4.1 26.7 3.4

18.4 3.0 8.4 28.8 3.4

1

2

3

4

5

6

1 2 3 4 5 6

19.1 5.9 29.3 19.5 18.4

19.1 10.1 29.0 2.4 3.0

5.9 10.1 28.1 4.1 8.4

29.3 29.0 28.1 26.7 28.8

19.5 2.4 4.1 26.7 3.4

18.4 3.0 8.4 28.8 3.4

1 2 3 4 5 6

Brute Force Early Abandon

Magic

Order Matters!

Page 11: Time Series Data Mining Group SAXually Explicit Images: Finding Unusual Shapes Li Wei Eamonn Keogh Xiaopeng Xi Computer Science & Engineering Dept. University

Time Series Data Mining GroupTime Series Data Mining Group

Heuristic Shape Discord DiscoveryHeuristic Shape Discord Discovery

123456789

1011121314151617181920

Algorithm [dist, index] = Heuristic_Search(S, Outer, Inner)best_so_far_dist = 0best_so_far_index = NaNFor each index p given by heuristic Outer nearest_neighbor_dist = infinity For each index q given by heuristic Inner If p!= q If Dist (Cp , Cq ) < best_so_far_dist

break End If Dist (Cp , Cq ) < nearest_neighbor_dist

nearest_neighbor_dist = Dist (Cp , Cq )

End End End If nearest_neighbor_dist > best_so_far_dist best_so_far_dist = nearest_neighbor_dist best_so_far_index = p EndEndReturn [ best_so_far_dist, best_so_far_index ]

Consider discord candidate in Outer order

Visit other shapes in Inner order

Apply early abandoning

Page 12: Time Series Data Mining Group SAXually Explicit Images: Finding Unusual Shapes Li Wei Eamonn Keogh Xiaopeng Xi Computer Science & Engineering Dept. University

Time Series Data Mining GroupTime Series Data Mining Group

Observations from Heuristic AlgorithmObservations from Heuristic Algorithm

123456789

1011121314151617181920

Algorithm [dist, index] = Heuristic_Search(S, Outer, Inner)best_so_far_dist = 0best_so_far_index = NaNFor each index p given by heuristic Outer nearest_neighbor_dist = infinity For each index q given by heuristic Inner If p!= q If Dist (Cp , Cq ) < best_so_far_dist

break End If Dist (Cp , Cq ) < nearest_neighbor_dist

nearest_neighbor_dist = Dist (Cp , Cq )

End End End If nearest_neighbor_dist > best_so_far_dist best_so_far_dist = nearest_neighbor_dist best_so_far_index = p EndEndReturn [ best_so_far_dist, best_so_far_index ]

Observation 1 • We do not need a perfect outer ordering. • Among the first few shapes being examined, there is at least one that has a large distance to its nearest neighbor.

We want this conditional test be true as often as possible!

Observation 2 • We do not need a perfect inner ordering. • Among the first few shapes being examined, there is at least one that has a distance to the candidate that is less than the current value of the best_so_far_dist variable .

Page 13: Time Series Data Mining Group SAXually Explicit Images: Finding Unusual Shapes Li Wei Eamonn Keogh Xiaopeng Xi Computer Science & Engineering Dept. University

Time Series Data Mining GroupTime Series Data Mining Group

Approximating the Optimal Ordering Approximating the Optimal Ordering

• Step 1: symbolize the time series• Step 2: use locality-sensitive hashing to estimate similarity

between shapes• Step 3: generate heuristics for outer and inner loops

• Keep in mind:– Outer heuristic (invoked only once) can take at most O(m) to calculate.– Inner heuristic (invoked m times) can take at most O(1) to calculate.

Page 14: Time Series Data Mining Group SAXually Explicit Images: Finding Unusual Shapes Li Wei Eamonn Keogh Xiaopeng Xi Computer Science & Engineering Dept. University

Time Series Data Mining GroupTime Series Data Mining Group

SAX: SAX: Symbolic Aggregate approXimationSymbolic Aggregate approXimation

baabccbc• Lower bounds Euclidean distance• Achieves dimensionality reduction• There are now well over 100 SAX papers, see www.cs.ucr.edu/~eamonn/SAX.htm

Page 15: Time Series Data Mining Group SAXually Explicit Images: Finding Unusual Shapes Li Wei Eamonn Keogh Xiaopeng Xi Computer Science & Engineering Dept. University

Time Series Data Mining GroupTime Series Data Mining Group

Locality-sensitive Hash Function*Locality-sensitive Hash Function*

• Consider a string s of length w over an alphabet S and k indices i1, … , ik chosen uniformly at random from the set {1, … , w}, the locality-sensitive hash function f is defined as

• For example,

• Property– Strings similar to each other are more likely to be hashed to the same

value.

][],...,[],[)( 21 kisisissf

ada dda da

aadd

f

* Indyk, P., Motwani, R., Raghavan, P., and Vempala, S. Locality-Preserving Hashing in Multidimensional Spaces. STOC 1997.

Page 16: Time Series Data Mining Group SAXually Explicit Images: Finding Unusual Shapes Li Wei Eamonn Keogh Xiaopeng Xi Computer Science & Engineering Dept. University

Time Series Data Mining GroupTime Series Data Mining Group

0 200 400 600 800 1000 1200

0 200 400 600 800 1000 1200

adad

daca

Images Time Series Representations

SAX Words

A)

B)

Because of Because of rotationsrotations, similar shapes may not be , similar shapes may not be hashed to the same value.hashed to the same value.

Page 17: Time Series Data Mining Group SAXually Explicit Images: Finding Unusual Shapes Li Wei Eamonn Keogh Xiaopeng Xi Computer Science & Engineering Dept. University

Time Series Data Mining GroupTime Series Data Mining Group

Rotation Invariant Locality-sensitive Hash FunctionRotation Invariant Locality-sensitive Hash Function

• Consider a string s of length w over an alphabet S and k indices i1, … , ik chosen uniformly at random from the set {1, … , w}, the rotation invariant locality-sensitive hash function f ’ is defined as

where LSHIFTS(s) is the set of all possible left shifts of string

s

)}(|][],...,[],[{)(' 21 sLSHIFTSpipipipsf k

a d a dd a d a

d a c aa c a dc a d aa d a c

a d a d

d a c a

a ad d

d ca ac d

Images SAX Words

A)

Shifts LSH Values

B)

Page 18: Time Series Data Mining Group SAXually Explicit Images: Finding Unusual Shapes Li Wei Eamonn Keogh Xiaopeng Xi Computer Science & Engineering Dept. University

Time Series Data Mining GroupTime Series Data Mining Group

Generating HeuristicsGenerating Heuristics

d a c aa c b dc a a ba d a d:: :: :: ::

b a d a

d a c aa c a dc a d aa d a c

a d a dd a d a

aa:

ab:

ac:

ba:

1

m

2

34

2 1 12 2

21 1

1 1

1 4 m

2 3

3

2 3

1

m

2

3

4

1 m2 3 4

bd:

ca:

cd:

db:

m

3

1 2

m

dc: 1 2

dd: 4

Image 1

Image 4

Shifts

Array

BucketsCollision Matrix

Time Series 1

Time Series 4

• Outer order: examines shapes in the ascending order of the largest number of collisions each shape has with others .

• Inner order: When candidate shape i is considered in the outer loop, the inner loop examines the shapes in the descending order of the number of collisions they have with shape i.

Page 19: Time Series Data Mining Group SAXually Explicit Images: Finding Unusual Shapes Li Wei Eamonn Keogh Xiaopeng Xi Computer Science & Engineering Dept. University

Time Series Data Mining GroupTime Series Data Mining Group

The Utility of Shape DiscordsThe Utility of Shape Discords

0 100 200 300 400 500 600 700 800 900

Heliconius melpomene(The Postman)

Heliconius erato (Red Passion Flower Butterfly)

1st Discord (Dacrocyte) A B

C

D

E

A B

C D

E

A B C D E F

G

G

A, D, E

B, C, F

A

B

C

A B C

1st Discord

Page 20: Time Series Data Mining Group SAXually Explicit Images: Finding Unusual Shapes Li Wei Eamonn Keogh Xiaopeng Xi Computer Science & Engineering Dept. University

Time Series Data Mining GroupTime Series Data Mining Group

The Utility of Heuristic Ordered SearchThe Utility of Heuristic Ordered Search• Datasets

– Homogeneous: 10,000 projectile points– Heterogeneous: 5,844 objects

• Measurement– number of distance function calls by each approach / number of distance function

calls by brute force

Projectile Points

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

100008000600040002000100050010050

Number of Time Series in database (m)

Brute ForceRandomOur method

Heterogeneous Dataset

0

0.2

0.4

0.6

0.8

1

58445000

400030002000100080040020010050

Brute ForceRandomOur method

Number of Time Series in database (m)

Just using early abandoning (which is an original idea in this context) is 3 or 4 orders of magnitude faster, the Magic heuristic is a further order of magnitude faster.

Page 21: Time Series Data Mining Group SAXually Explicit Images: Finding Unusual Shapes Li Wei Eamonn Keogh Xiaopeng Xi Computer Science & Engineering Dept. University

Time Series Data Mining GroupTime Series Data Mining Group

Conclusion & Future WorkConclusion & Future Work

• We define shape discords.• We introduce the heuristic based algorithm to

efficiently find discords and demonstrate its utility in various domains

• Future Work– Investigate image discords not only using shapes

but also texture/color– Conduct a field studies of shape discord discovery

in anthropology and archeology

Page 22: Time Series Data Mining Group SAXually Explicit Images: Finding Unusual Shapes Li Wei Eamonn Keogh Xiaopeng Xi Computer Science & Engineering Dept. University

Time Series Data Mining GroupTime Series Data Mining Group

Thank you!Thank you!

Questions?