indexing and data mining in multimedia databases

Post on 21-Jun-2015

358 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Indexing and Data Mining in Multimedia Databases

Christos Faloutsos

CMU www.cs.cmu.edu/~christos

U. of Alberta, 2001 C. Faloutsos 2

Outline

Goal: ‘Find similar / interesting things’

• Problem - Applications

• Indexing - similarity search

• New tools for Data Mining: Fractals

• Conclusions

• Resources

U. of Alberta, 2001 C. Faloutsos 3

Problem

Given a large collection of (multimedia) records, find similar/interesting things, ie:

• Allow fast, approximate queries, and

• Find rules/patterns

U. of Alberta, 2001 C. Faloutsos 4

Sample queries

• Similarity search– Find pairs of branches with similar sales

patterns– find medical cases similar to Smith's– Find pairs of sensor series that move in sync

U. of Alberta, 2001 C. Faloutsos 5

Sample queries –cont’d

• Rule discovery– Clusters (of patients; of customers; ...)– Forecasting (total sales for next year?)– Outliers (eg., fraud detection)

U. of Alberta, 2001 C. Faloutsos 6

Outline

Goal: ‘Find similar / interesting things’

• Problem - Applications

• Indexing - similarity search

• New tools for Data Mining: Fractals

• Conclusions

• Resourses

U. of Alberta, 2001 C. Faloutsos 7

Indexing - Multimedia

Problem:

• given a set of (multimedia) objects,

• find the ones similar to a desirable query object (quickly!)

U. of Alberta, 2001 C. Faloutsos 8

day

$price

1 365

day

$price

1 365

day

$price

1 365

distance function: by expert

U. of Alberta, 2001 C. Faloutsos 9

day1 365

day1 365

S1

Sn

F(S1)

F(Sn)

‘GEMINI’ - Pictorially

eg, avg

eg,. std

off-the-shelf S.A.Ms (spatial Access Methods)

U. of Alberta, 2001 C. Faloutsos 10

‘GEMINI’

• fast; ‘correct’ (=no false dismissals)

• used for – images (eg., QBIC) (2x, 10x faster)– shapes (27x faster)– video (eg., InforMedia)– time sequences ([Rafiei+Mendelzon], ++)

U. of Alberta, 2001 C. Faloutsos 11

Remaining issues

• how to extract features automatically?

• how to merge similarity scores from different media

U. of Alberta, 2001 C. Faloutsos 12

Outline

Goal: ‘Find similar / interesting things’

• Problem - Applications

• Indexing - similarity search– Visualization: Fastmap– Relevance feedback: FALCON

• Data Mining / Fractals

• Conclusions

U. of Alberta, 2001 C. Faloutsos 13

FastMap

O1 O2 O3 O4 O5

O1 0 1 1 100 100

O2 1 0 1 100 100

O3 1 1 0 100 100

O4 100 100 100 0 1

O5 100 100 100 1 0

~100

~1

??

U. of Alberta, 2001 C. Faloutsos 14

FastMap

• Multi-dimensional scaling (MDS) can do that, but in O(N**2) time

• We want a linear algorithm: FastMap [SIGMOD95]

U. of Alberta, 2001 C. Faloutsos 15

Applications: time sequences

• given n co-evolving time sequences

• visualize them + find rules [ICDE00]

time

rate

HKD

JPY

DEM

U. of Alberta, 2001 C. Faloutsos 16

Applications - financial• currency exchange rates [ICDE00]

USD(t)

USD(t-5)

FRFGBPJPYHKD

U. of Alberta, 2001 C. Faloutsos 17

Applications - financial• currency exchange rates [ICDE00]

USD

HKD

JPY

FRFDEM

GBP

USD(t)

USD(t-5)

U. of Alberta, 2001 C. Faloutsos 20

Outline

Goal: ‘Find similar / interesting things’

• Problem - Applications

• Indexing - similarity search– Visualization: Fastmap– Relevance feedback: FALCON

• Data Mining / Fractals

• Conclusions

U. of Alberta, 2001 C. Faloutsos 21

Merging similarity scores

• eg., video: text, color, motion, audio– weights change with the query!

• solution 1: user specifies weights

• solution 2: user gives examples – and we ‘learn’ what he/she wants: rel. feedback

(Rocchio, MARS, MindReader)– but: how about disjunctive queries?

U. of Alberta, 2001 C. Faloutsos 22

DEMO

server demo

U. of Alberta, 2001 C. Faloutsos 23

‘FALCON’Inverted VsVs

Trader wants only ‘unstable’ stocks

U. of Alberta, 2001 C. Faloutsos 24

‘FALCON’Inverted VsVs

average: is flat!

U. of Alberta, 2001 C. Faloutsos 25

“Single query point” methods

Rocchio

+

+ ++

++

x

avg

std

U. of Alberta, 2001 C. Faloutsos 26

“Single query point” methods

Rocchio MindReader

+

+ ++

++ +

+ ++

++ +

+ ++

++

MARS

The averaging affect in action...

x x x

U. of Alberta, 2001 C. Faloutsos 27

++

+

++

Main idea: FALCON Contours

feature1 (eg., avg)

feature2

eg., std

[Wu+, vldb2000]

U. of Alberta, 2001 C. Faloutsos 28

A: Aggregate Dissimilarity

: parameter (~ -5 ~ ‘soft OR’)

1

,i iG xgdxD

++

+

++

g1

g2

x

U. of Alberta, 2001 C. Faloutsos 29

• converges quickly (~5 iterations)

• good precision/recall

• is fast (can use off-the-shelf ‘spatial/metric access methods’)

FALCON

U. of Alberta, 2001 C. Faloutsos 30

Conclusions for indexing + visualization

• GEMINI: fast indexing, exploiting off-the-shelf SAMs

• FastMap: automatic feature extraction in O(N) time

• FALCON: relevance feedback for disjunctive queries

U. of Alberta, 2001 C. Faloutsos 31

Outline

Goal: ‘Find similar / interesting things’

• Problem - Applications

• Indexing - similarity search

• New tools for Data Mining: Fractals

• Conclusions

• Resourses

U. of Alberta, 2001 C. Faloutsos 32

Data mining & fractals – Road map

• Motivation – problems / case study

• Definition of fractals and power laws

• Solutions to posed problems

• More examples

U. of Alberta, 2001 C. Faloutsos 33

Problem #1 - spatial d.m.

Galaxies (Sloan Digital Sky Survey w/ B. Nichol) - ‘spiral’ and ‘elliptical’

galaxies

(stores & households; healthy & ill subjects)

- patterns? (not Gaussian; not uniform)

-attraction/repulsion?

- separability??

U. of Alberta, 2001 C. Faloutsos 34

Problem#2: dim. reduction

• given attributes x1, ... xn

– possibly, non-linearly correlated

• drop the useless ones

(Q: why?

A: to avoid the ‘dimensionality curse’)

enginesize

mpg

U. of Alberta, 2001 C. Faloutsos 35

Answer:

• Fractals / self-similarities / power laws

U. of Alberta, 2001 C. Faloutsos 36

What is a fractal?

= self-similar point set, e.g., Sierpinski triangle:

...zero area;

infinite length!

U. of Alberta, 2001 C. Faloutsos 37

Definitions (cont’d)

• Paradox: Infinite perimeter ; Zero area!

• ‘dimensionality’: between 1 and 2

• actually: Log(3)/Log(2) = 1.58… (long story)

U. of Alberta, 2001 C. Faloutsos 38

Intrinsic (‘fractal’) dimension

• Q: fractal dimension of a line?

x y

5 1

4 2

3 3

2 4

Eg:

#cylinders; miles / gallon

U. of Alberta, 2001 C. Faloutsos 39

Intrinsic (‘fractal’) dimension

• Q: fractal dimension of a line?

• A: nn ( <= r ) ~ r^1

U. of Alberta, 2001 C. Faloutsos 40

Intrinsic (‘fractal’) dimension

• Q: fractal dimension of a line?

• A: nn ( <= r ) ~ r^1

• Q: fd of a plane?• A: nn ( <= r ) ~ r^2fd== slope of (log(nn) vs

log(r) )

U. of Alberta, 2001 C. Faloutsos 41

Sierpinsky triangle

log( r )

log(#pairs within <=r )

1.58

== ‘correlation integral’

U. of Alberta, 2001 C. Faloutsos 42

Observations

self-similarity ->• <=> fractals • <=> scale-free• <=> power-laws (y=x^a, F=C*r^(-2))

log( r )

log(#pairs within <=r )

1.58

U. of Alberta, 2001 C. Faloutsos 43

Road map

• Motivation – problems / case studies

• Definition of fractals and power laws

• Solutions to posed problems

• More examples

• Conclusions

U. of Alberta, 2001 C. Faloutsos 44

Solution#1: spatial d.m.Galaxies (Sloan Digital Sky Survey w/ B.

Nichol - ‘BOPS’ plot - [sigmod2000])

•clusters?

•separable?

•attraction/repulsion?

•data ‘scrubbing’ – duplicates?

U. of Alberta, 2001 C. Faloutsos 45

Solution#1: spatial d.m.

log(r)

log(#pairs within <=r )

spi-spi

spi-ell

ell-ell

- 1.8 slope

- plateau!

- repulsion!

U. of Alberta, 2001 C. Faloutsos 46

Solution#1: spatial d.m.

log(r)

log(#pairs within <=r )

spi-spi

spi-ell

ell-ell

- 1.8 slope

- plateau!

- repulsion!

[w/ Seeger, Traina, Traina, SIGMOD00]

U. of Alberta, 2001 C. Faloutsos 47

spatial d.m.

r1r2

r1

r2

Heuristic on choosing # of clusters

U. of Alberta, 2001 C. Faloutsos 48

Solution#1: spatial d.m.

log(r)

log(#pairs within <=r )

spi-spi

spi-ell

ell-ell

- 1.8 slope

- plateau!

- repulsion!

U. of Alberta, 2001 C. Faloutsos 49

Solution#1: spatial d.m.

log(r)

log(#pairs within <=r )

spi-spi

spi-ell

ell-ell

- 1.8 slope

- plateau!

-repulsion!!

-duplicates

U. of Alberta, 2001 C. Faloutsos 50

Problem #2: Dim. reduction

y

xxx

yy(a) Quarter-circle (c) Spike(b)Line1

0000

1

U. of Alberta, 2001 C. Faloutsos 51

Solution:

• drop the attributes that don’t increase the ‘partial f.d.’ PFD

• dfn: PFD of attribute set A is the f.d. of the projected cloud of points [w/ Traina, Traina, Wu, SBBD00]

U. of Alberta, 2001 C. Faloutsos 52

Problem #2: dim. reduction

y

xxx

yy(a) Quarter-circle (c) Spike(b)Line1

0000

1

PFD~1

PFD~1global FD=1 PFD=1

PFD=0PFD=1

U. of Alberta, 2001 C. Faloutsos 53

Problem #2: dim. reduction

y

xxx

yy(a) Quarter-circle (c) Spike(b)Line1

0000

1

PFD~1

PFD=1global FD=1PFD=1

PFD=0PFD=1

Notice: ‘max variance’ would fail here

U. of Alberta, 2001 C. Faloutsos 54

Problem #2: dim. reduction

y

xxx

yy(a) Quarter-circle (c) Spike(b)Line1

0000

1

PFD~1

PFD~1global FD=1

PFD=1

PFD=0PFD=1

Notice: SVD would fail here

U. of Alberta, 2001 C. Faloutsos 55

Currency dataset

USD HKD JPY …Day1 1.62

Day2 1.58

U. of Alberta, 2001 C. Faloutsos 56

self-similar?

4

5

6

7

8

9

10

11

12

13

14

15

-3.5 -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Currency dataset

log(radii)

Currency slp=1.9807S(r)

-5

0

5

10

15

20

25

-7 -6 -5 -4 -3 -2 -1

Eigenfaces dataset

log(radii)

Eigenfaces slp=4.2506S(r)

fd=1.98fd=4.25

currency eigenfaces

U. of Alberta, 2001 C. Faloutsos 57

FDR on the ‘currency’ dataset

0.8

1

1.2

1.4

1.6

1.8

2

1 2 3 4 5 6

AmericanDollar

German MarkBritish Pound

FrenchFranc

Japanese Yen

#Attributes considered

if unif + indep.

U. of Alberta, 2001 C. Faloutsos 58

FDR on the ‘currency’ dataset

0.8

1

1.2

1.4

1.6

1.8

2

1 2 3 4 5 6

AmericanDollar

German MarkBritish Pound

FrenchFranc

Japanese Yen

#Attributes considered

if unif + indep.

• HKD: “useless”

•>1.98 axis are needed

U. of Alberta, 2001 C. Faloutsos 59

Road map

• Motivation – problems / case studies

• Definition of fractals and power laws

• Solutions to posed problems

• More examples

• Conclusions

U. of Alberta, 2001 C. Faloutsos 60

App. : traffic

• disk traces: self-similar (also: web traffic; comm. errors; etc)

time

#bytes

U. of Alberta, 2001 C. Faloutsos 61

More apps: Brain scans

• Oct-trees; brain-scans

octree levels

Log(#octants)

2.63 = fd

U. of Alberta, 2001 C. Faloutsos 62

More fractals:

• stock prices (LYCOS) - random walks: 1.51 year 2 years

U. of Alberta, 2001 C. Faloutsos 63

More fractals:

• coast-lines: 1.1-1.2 (up to 1.58)

U. of Alberta, 2001 C. Faloutsos 64

U. of Alberta, 2001 C. Faloutsos 65

Examples:MG county

• Montgomery County of MD (road end-points)

U. of Alberta, 2001 C. Faloutsos 66

Examples:LB county

• Long Beach county of CA (road end-points)

U. of Alberta, 2001 C. Faloutsos 67

More power laws: Zipf’s law

• Bible - rank vs frequency (log-log)

log(rank)

log(freq)

“a”

“the”

U. of Alberta, 2001 C. Faloutsos 68

More power laws

• Freq. distr. of first names; last names (Mandelbrot)

U. of Alberta, 2001 C. Faloutsos 69

Internet

• Internet routers: how many neighbors within h hops?

U of Alberta

U. of Alberta, 2001 C. Faloutsos 70

Internet topology

• Internet routers: how many neighbors within h hops? [SIGCOMM 99]

Reachability function: number of neighbors within r hops, vs r (log-log).

Mbone routers, 1995log(hops)

log(#pairs)

2.8

U. of Alberta, 2001 C. Faloutsos 71

More power laws: areas – Korcak’s law

Scandinavian lakes

([icde99], w/ Proietti)

U. of Alberta, 2001 C. Faloutsos 72

More power laws: areas – Korcak’s law

Scandinavian lakes area vs complementary cumulative count (log-log axes)

log(count( >= area))

log(area)

U. of Alberta, 2001 C. Faloutsos 73

Olympic medals:

y = -0.9676x + 2.3054

R2 = 0.9458

0

0.5

1

1.5

2

2.5

0 0.5 1 1.5 2

Series1

Linear (Series1)

log rank

log(# medals)

U. of Alberta, 2001 C. Faloutsos 74

More power laws

• Energy of earthquakes (Gutenberg-Richter law) [simscience.org]

log(count)

magnitudeday

amplitude

U. of Alberta, 2001 C. Faloutsos 75

Even more power laws:

• Income distribution (Pareto’s law);

• sales distributions;

• duration of UNIX jobs• Distribution of UNIX file sizes• publication counts (Lotka’s law)

U. of Alberta, 2001 C. Faloutsos 76

Even more power laws:

• web hit frequencies ([Huberman])

• hyper-link distribution [Barabasi], ++

U. of Alberta, 2001 C. Faloutsos 77

Overall Conclusions:

‘Find similar/interesting things’ in multimedia databases

• Indexing: feature extraction (‘GEMINI’)– automatic feature extraction: FastMap– Relevance feedback: FALCON

U. of Alberta, 2001 C. Faloutsos 78

Conclusions - cont’d

• New tools for Data Mining: Fractals/power laws:– appear everywhere– lead to skewed distributions (Gaussian,

Poisson, uniformity, independence)– ‘correlation integral’ for separability/cluster

detection– PFD for dimensionality reduction

U. of Alberta, 2001 C. Faloutsos 79

Conclusions - cont’d

– can model bursty time sequences (buffering/prefetching)

– selectivity estimation (‘how many neighbors within x km?)

– dim. curse diagnosis (it’s the fractal dim. that matters! [ICDE2000])

U. of Alberta, 2001 C. Faloutsos 80

Resources:

• Software and papers:– http://www.cs.cmu.edu/~christos– Fractal dimension (FracDim)– Separability (sigmod 2000)– Relevance feedback for query by content

(FALCON – vldb 2000)

top related