indexing and data mining in multimedia databases
TRANSCRIPT
Indexing and Data Mining in Multimedia Databases
Christos Faloutsos
CMU www.cs.cmu.edu/~christos
U. of Alberta, 2001 C. Faloutsos 2
Outline
Goal: ‘Find similar / interesting things’
• Problem - Applications
• Indexing - similarity search
• New tools for Data Mining: Fractals
• Conclusions
• Resources
U. of Alberta, 2001 C. Faloutsos 3
Problem
Given a large collection of (multimedia) records, find similar/interesting things, ie:
• Allow fast, approximate queries, and
• Find rules/patterns
U. of Alberta, 2001 C. Faloutsos 4
Sample queries
• Similarity search– Find pairs of branches with similar sales
patterns– find medical cases similar to Smith's– Find pairs of sensor series that move in sync
U. of Alberta, 2001 C. Faloutsos 5
Sample queries –cont’d
• Rule discovery– Clusters (of patients; of customers; ...)– Forecasting (total sales for next year?)– Outliers (eg., fraud detection)
U. of Alberta, 2001 C. Faloutsos 6
Outline
Goal: ‘Find similar / interesting things’
• Problem - Applications
• Indexing - similarity search
• New tools for Data Mining: Fractals
• Conclusions
• Resourses
U. of Alberta, 2001 C. Faloutsos 7
Indexing - Multimedia
Problem:
• given a set of (multimedia) objects,
• find the ones similar to a desirable query object (quickly!)
U. of Alberta, 2001 C. Faloutsos 8
day
$price
1 365
day
$price
1 365
day
$price
1 365
distance function: by expert
U. of Alberta, 2001 C. Faloutsos 9
day1 365
day1 365
S1
Sn
F(S1)
F(Sn)
‘GEMINI’ - Pictorially
eg, avg
eg,. std
off-the-shelf S.A.Ms (spatial Access Methods)
U. of Alberta, 2001 C. Faloutsos 10
‘GEMINI’
• fast; ‘correct’ (=no false dismissals)
• used for – images (eg., QBIC) (2x, 10x faster)– shapes (27x faster)– video (eg., InforMedia)– time sequences ([Rafiei+Mendelzon], ++)
U. of Alberta, 2001 C. Faloutsos 11
Remaining issues
• how to extract features automatically?
• how to merge similarity scores from different media
U. of Alberta, 2001 C. Faloutsos 12
Outline
Goal: ‘Find similar / interesting things’
• Problem - Applications
• Indexing - similarity search– Visualization: Fastmap– Relevance feedback: FALCON
• Data Mining / Fractals
• Conclusions
U. of Alberta, 2001 C. Faloutsos 13
FastMap
O1 O2 O3 O4 O5
O1 0 1 1 100 100
O2 1 0 1 100 100
O3 1 1 0 100 100
O4 100 100 100 0 1
O5 100 100 100 1 0
~100
~1
??
U. of Alberta, 2001 C. Faloutsos 14
FastMap
• Multi-dimensional scaling (MDS) can do that, but in O(N**2) time
• We want a linear algorithm: FastMap [SIGMOD95]
U. of Alberta, 2001 C. Faloutsos 15
Applications: time sequences
• given n co-evolving time sequences
• visualize them + find rules [ICDE00]
time
rate
HKD
JPY
DEM
U. of Alberta, 2001 C. Faloutsos 16
Applications - financial• currency exchange rates [ICDE00]
USD(t)
USD(t-5)
FRFGBPJPYHKD
U. of Alberta, 2001 C. Faloutsos 17
Applications - financial• currency exchange rates [ICDE00]
USD
HKD
JPY
FRFDEM
GBP
USD(t)
USD(t-5)
U. of Alberta, 2001 C. Faloutsos 20
Outline
Goal: ‘Find similar / interesting things’
• Problem - Applications
• Indexing - similarity search– Visualization: Fastmap– Relevance feedback: FALCON
• Data Mining / Fractals
• Conclusions
U. of Alberta, 2001 C. Faloutsos 21
Merging similarity scores
• eg., video: text, color, motion, audio– weights change with the query!
• solution 1: user specifies weights
• solution 2: user gives examples – and we ‘learn’ what he/she wants: rel. feedback
(Rocchio, MARS, MindReader)– but: how about disjunctive queries?
U. of Alberta, 2001 C. Faloutsos 22
DEMO
server demo
U. of Alberta, 2001 C. Faloutsos 23
‘FALCON’Inverted VsVs
Trader wants only ‘unstable’ stocks
U. of Alberta, 2001 C. Faloutsos 24
‘FALCON’Inverted VsVs
average: is flat!
U. of Alberta, 2001 C. Faloutsos 25
“Single query point” methods
Rocchio
+
+ ++
++
x
avg
std
U. of Alberta, 2001 C. Faloutsos 26
“Single query point” methods
Rocchio MindReader
+
+ ++
++ +
+ ++
++ +
+ ++
++
MARS
The averaging affect in action...
x x x
U. of Alberta, 2001 C. Faloutsos 27
++
+
++
Main idea: FALCON Contours
feature1 (eg., avg)
feature2
eg., std
[Wu+, vldb2000]
U. of Alberta, 2001 C. Faloutsos 28
A: Aggregate Dissimilarity
: parameter (~ -5 ~ ‘soft OR’)
1
,i iG xgdxD
++
+
++
g1
g2
x
U. of Alberta, 2001 C. Faloutsos 29
• converges quickly (~5 iterations)
• good precision/recall
• is fast (can use off-the-shelf ‘spatial/metric access methods’)
FALCON
U. of Alberta, 2001 C. Faloutsos 30
Conclusions for indexing + visualization
• GEMINI: fast indexing, exploiting off-the-shelf SAMs
• FastMap: automatic feature extraction in O(N) time
• FALCON: relevance feedback for disjunctive queries
U. of Alberta, 2001 C. Faloutsos 31
Outline
Goal: ‘Find similar / interesting things’
• Problem - Applications
• Indexing - similarity search
• New tools for Data Mining: Fractals
• Conclusions
• Resourses
U. of Alberta, 2001 C. Faloutsos 32
Data mining & fractals – Road map
• Motivation – problems / case study
• Definition of fractals and power laws
• Solutions to posed problems
• More examples
U. of Alberta, 2001 C. Faloutsos 33
Problem #1 - spatial d.m.
Galaxies (Sloan Digital Sky Survey w/ B. Nichol) - ‘spiral’ and ‘elliptical’
galaxies
(stores & households; healthy & ill subjects)
- patterns? (not Gaussian; not uniform)
-attraction/repulsion?
- separability??
U. of Alberta, 2001 C. Faloutsos 34
Problem#2: dim. reduction
• given attributes x1, ... xn
– possibly, non-linearly correlated
• drop the useless ones
(Q: why?
A: to avoid the ‘dimensionality curse’)
enginesize
mpg
U. of Alberta, 2001 C. Faloutsos 35
Answer:
• Fractals / self-similarities / power laws
U. of Alberta, 2001 C. Faloutsos 36
What is a fractal?
= self-similar point set, e.g., Sierpinski triangle:
...zero area;
infinite length!
U. of Alberta, 2001 C. Faloutsos 37
Definitions (cont’d)
• Paradox: Infinite perimeter ; Zero area!
• ‘dimensionality’: between 1 and 2
• actually: Log(3)/Log(2) = 1.58… (long story)
U. of Alberta, 2001 C. Faloutsos 38
Intrinsic (‘fractal’) dimension
• Q: fractal dimension of a line?
x y
5 1
4 2
3 3
2 4
Eg:
#cylinders; miles / gallon
U. of Alberta, 2001 C. Faloutsos 39
Intrinsic (‘fractal’) dimension
• Q: fractal dimension of a line?
• A: nn ( <= r ) ~ r^1
U. of Alberta, 2001 C. Faloutsos 40
Intrinsic (‘fractal’) dimension
• Q: fractal dimension of a line?
• A: nn ( <= r ) ~ r^1
• Q: fd of a plane?• A: nn ( <= r ) ~ r^2fd== slope of (log(nn) vs
log(r) )
U. of Alberta, 2001 C. Faloutsos 41
Sierpinsky triangle
log( r )
log(#pairs within <=r )
1.58
== ‘correlation integral’
U. of Alberta, 2001 C. Faloutsos 42
Observations
self-similarity ->• <=> fractals • <=> scale-free• <=> power-laws (y=x^a, F=C*r^(-2))
log( r )
log(#pairs within <=r )
1.58
U. of Alberta, 2001 C. Faloutsos 43
Road map
• Motivation – problems / case studies
• Definition of fractals and power laws
• Solutions to posed problems
• More examples
• Conclusions
U. of Alberta, 2001 C. Faloutsos 44
Solution#1: spatial d.m.Galaxies (Sloan Digital Sky Survey w/ B.
Nichol - ‘BOPS’ plot - [sigmod2000])
•clusters?
•separable?
•attraction/repulsion?
•data ‘scrubbing’ – duplicates?
U. of Alberta, 2001 C. Faloutsos 45
Solution#1: spatial d.m.
log(r)
log(#pairs within <=r )
spi-spi
spi-ell
ell-ell
- 1.8 slope
- plateau!
- repulsion!
U. of Alberta, 2001 C. Faloutsos 46
Solution#1: spatial d.m.
log(r)
log(#pairs within <=r )
spi-spi
spi-ell
ell-ell
- 1.8 slope
- plateau!
- repulsion!
[w/ Seeger, Traina, Traina, SIGMOD00]
U. of Alberta, 2001 C. Faloutsos 47
spatial d.m.
r1r2
r1
r2
Heuristic on choosing # of clusters
U. of Alberta, 2001 C. Faloutsos 48
Solution#1: spatial d.m.
log(r)
log(#pairs within <=r )
spi-spi
spi-ell
ell-ell
- 1.8 slope
- plateau!
- repulsion!
U. of Alberta, 2001 C. Faloutsos 49
Solution#1: spatial d.m.
log(r)
log(#pairs within <=r )
spi-spi
spi-ell
ell-ell
- 1.8 slope
- plateau!
-repulsion!!
-duplicates
U. of Alberta, 2001 C. Faloutsos 50
Problem #2: Dim. reduction
y
xxx
yy(a) Quarter-circle (c) Spike(b)Line1
0000
1
U. of Alberta, 2001 C. Faloutsos 51
Solution:
• drop the attributes that don’t increase the ‘partial f.d.’ PFD
• dfn: PFD of attribute set A is the f.d. of the projected cloud of points [w/ Traina, Traina, Wu, SBBD00]
U. of Alberta, 2001 C. Faloutsos 52
Problem #2: dim. reduction
y
xxx
yy(a) Quarter-circle (c) Spike(b)Line1
0000
1
PFD~1
PFD~1global FD=1 PFD=1
PFD=0PFD=1
U. of Alberta, 2001 C. Faloutsos 53
Problem #2: dim. reduction
y
xxx
yy(a) Quarter-circle (c) Spike(b)Line1
0000
1
PFD~1
PFD=1global FD=1PFD=1
PFD=0PFD=1
Notice: ‘max variance’ would fail here
U. of Alberta, 2001 C. Faloutsos 54
Problem #2: dim. reduction
y
xxx
yy(a) Quarter-circle (c) Spike(b)Line1
0000
1
PFD~1
PFD~1global FD=1
PFD=1
PFD=0PFD=1
Notice: SVD would fail here
U. of Alberta, 2001 C. Faloutsos 55
Currency dataset
USD HKD JPY …Day1 1.62
Day2 1.58
U. of Alberta, 2001 C. Faloutsos 56
self-similar?
4
5
6
7
8
9
10
11
12
13
14
15
-3.5 -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
Currency dataset
log(radii)
Currency slp=1.9807S(r)
-5
0
5
10
15
20
25
-7 -6 -5 -4 -3 -2 -1
Eigenfaces dataset
log(radii)
Eigenfaces slp=4.2506S(r)
fd=1.98fd=4.25
currency eigenfaces
U. of Alberta, 2001 C. Faloutsos 57
FDR on the ‘currency’ dataset
0.8
1
1.2
1.4
1.6
1.8
2
1 2 3 4 5 6
AmericanDollar
German MarkBritish Pound
FrenchFranc
Japanese Yen
#Attributes considered
if unif + indep.
U. of Alberta, 2001 C. Faloutsos 58
FDR on the ‘currency’ dataset
0.8
1
1.2
1.4
1.6
1.8
2
1 2 3 4 5 6
AmericanDollar
German MarkBritish Pound
FrenchFranc
Japanese Yen
#Attributes considered
if unif + indep.
• HKD: “useless”
•>1.98 axis are needed
U. of Alberta, 2001 C. Faloutsos 59
Road map
• Motivation – problems / case studies
• Definition of fractals and power laws
• Solutions to posed problems
• More examples
• Conclusions
U. of Alberta, 2001 C. Faloutsos 60
App. : traffic
• disk traces: self-similar (also: web traffic; comm. errors; etc)
time
#bytes
U. of Alberta, 2001 C. Faloutsos 61
More apps: Brain scans
• Oct-trees; brain-scans
octree levels
Log(#octants)
2.63 = fd
U. of Alberta, 2001 C. Faloutsos 62
More fractals:
• stock prices (LYCOS) - random walks: 1.51 year 2 years
U. of Alberta, 2001 C. Faloutsos 63
More fractals:
• coast-lines: 1.1-1.2 (up to 1.58)
U. of Alberta, 2001 C. Faloutsos 64
U. of Alberta, 2001 C. Faloutsos 65
Examples:MG county
• Montgomery County of MD (road end-points)
U. of Alberta, 2001 C. Faloutsos 66
Examples:LB county
• Long Beach county of CA (road end-points)
U. of Alberta, 2001 C. Faloutsos 67
More power laws: Zipf’s law
• Bible - rank vs frequency (log-log)
log(rank)
log(freq)
“a”
“the”
U. of Alberta, 2001 C. Faloutsos 68
More power laws
• Freq. distr. of first names; last names (Mandelbrot)
U. of Alberta, 2001 C. Faloutsos 69
Internet
• Internet routers: how many neighbors within h hops?
U of Alberta
U. of Alberta, 2001 C. Faloutsos 70
Internet topology
• Internet routers: how many neighbors within h hops? [SIGCOMM 99]
Reachability function: number of neighbors within r hops, vs r (log-log).
Mbone routers, 1995log(hops)
log(#pairs)
2.8
U. of Alberta, 2001 C. Faloutsos 71
More power laws: areas – Korcak’s law
Scandinavian lakes
([icde99], w/ Proietti)
U. of Alberta, 2001 C. Faloutsos 72
More power laws: areas – Korcak’s law
Scandinavian lakes area vs complementary cumulative count (log-log axes)
log(count( >= area))
log(area)
U. of Alberta, 2001 C. Faloutsos 73
Olympic medals:
y = -0.9676x + 2.3054
R2 = 0.9458
0
0.5
1
1.5
2
2.5
0 0.5 1 1.5 2
Series1
Linear (Series1)
log rank
log(# medals)
U. of Alberta, 2001 C. Faloutsos 74
More power laws
• Energy of earthquakes (Gutenberg-Richter law) [simscience.org]
log(count)
magnitudeday
amplitude
U. of Alberta, 2001 C. Faloutsos 75
Even more power laws:
• Income distribution (Pareto’s law);
• sales distributions;
• duration of UNIX jobs• Distribution of UNIX file sizes• publication counts (Lotka’s law)
U. of Alberta, 2001 C. Faloutsos 76
Even more power laws:
• web hit frequencies ([Huberman])
• hyper-link distribution [Barabasi], ++
U. of Alberta, 2001 C. Faloutsos 77
Overall Conclusions:
‘Find similar/interesting things’ in multimedia databases
• Indexing: feature extraction (‘GEMINI’)– automatic feature extraction: FastMap– Relevance feedback: FALCON
U. of Alberta, 2001 C. Faloutsos 78
Conclusions - cont’d
• New tools for Data Mining: Fractals/power laws:– appear everywhere– lead to skewed distributions (Gaussian,
Poisson, uniformity, independence)– ‘correlation integral’ for separability/cluster
detection– PFD for dimensionality reduction
U. of Alberta, 2001 C. Faloutsos 79
Conclusions - cont’d
– can model bursty time sequences (buffering/prefetching)
– selectivity estimation (‘how many neighbors within x km?)
– dim. curse diagnosis (it’s the fractal dim. that matters! [ICDE2000])
U. of Alberta, 2001 C. Faloutsos 80
Resources:
• Software and papers:– http://www.cs.cmu.edu/~christos– Fractal dimension (FracDim)– Separability (sigmod 2000)– Relevance feedback for query by content
(FALCON – vldb 2000)