OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
RobustMap: A Fast and Robust Algorithm for Dimension Reduction and Clustering
Lionel F. Lovett, II
Advisors: George Ostrouchov and Houssain Kettani
Computer Science and Mathematics DivisionOak Ridge National Laboratory
Summer 2005
2
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
Why Dimension Reduction? Data Mining
Large Databases Number of Items Number of Attributes (high-dimensionality) with
items
Visualization Requires low dimensional views (2 or 3) Structure discovery
Patterns Clusters
Fast similarity searching Images, video, documents, character recognition,
face recognition, DNA sequences Data Reduction
RobustMap: A Fast and Robust Algorithm for Dimension Reduction and Clustering
3
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
RobustMap Uses Distances and Mimics PCA Like FastMap
RobustMap: A Fast and Robust Algorithm for Dimension Reduction and Clustering
Faloutsos and Lin (1995) (FastMap)
Choose two very distant points as principal axis
Project onto orthogonal hyperplane
Repeat
Each axis O(n), given distances
Distances updated using cosine law as needed
Result is a mapping as well as the transformation to map new items
4
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
Projection to Pivot Axis and to Orthogonal Hyperplane
Given pivot axis ab FastMap computes coordinates along axis and projections onto the orthogonal hyperplane. b
a
y
z
y’ z’dy’,z’
RobustMap: A Fast and Robust Algorithm for Dimension Reduction and Clustering
a
y
cy da,y
db,y
da,b
b
5
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
Problems with FastMap?
OUTLIERS!
Outliers are points that are not closest on average to the other members of their cluster.
When Selecting points based on distance,FastMap considers all the points of a dataset.
By including outliers, FastMap isn’t robust.
RobustMap: A Fast and Robust Algorithm for Dimension Reduction and Clustering
6
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
FastMap Pivot Pair: Choosing Outliers
RobustMap: A Fast and Robust Algorithm for Dimension Reduction and Clustering
Axis does n
ot
represent majority
of
data
7
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
RobustMap: Clustering and Excluding Outliers
RobustMap: A Fast and Robust Algorithm for Dimension Reduction and Clustering
Compute n distances from random object
Take point of largest distance
Repeat
Clustering
Estimate distance distribution from two extreme points
Find probability of extreme points
Exclude most extreme cluster of low probability points
Finish projection using remaining points
Diagnostic histogram and cluster plots
Ratio Function Uses only distances from
pivots (2nd and 3rd)
Computes ratios: data fraction / probability of data
Looks for splits according to ratio threshold
Discards smaller portion.
8
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
Dataset Generator Generates clustered data from a mixture of
multivariate normal densities
There are five parameters Number of dimensions Number of clusters Cluster variability Cluster mixing proportions Seed for random number generator
Other RobustMap parameters Number of dimensions to extract Quantile of trimmed max Ratio threshold for outlying cluster extraction
RobustMap: A Fast and Robust Algorithm for Dimension Reduction and Clustering
9
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
Results
RobustMap identifies and excludes outlying clusters
RobustMap performs dimension reduction
RobustMap exploits robust statistics
RobustMap exploits fast machine learning algorithms (runtime O(nk))
RobustMap: A Fast and Robust Algorithm for Dimension Reduction and Clustering
10
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
. . .PC 1 PC 2 PC 3 PC 4+ + PC 1620+
135 year CCM3 run at T42 resolution
CO2 increase to 3xAverage Monthly Surface
Temperature1620 x 2500 matrix
(Putman, Drake, Ostrouchov, 2000)
Decomposition of Climate Model Run Datawith PCA (EOF)
Image vector
PC 1
PC 2
PC 3
PC 4
Concise 4-d summary of 135 year run
Winter warming more severe than summer warming
Amplitude-in-time plots
RM 1 RM 2 RM 3 RM 4+ +1000 x FASTER !
+
RM 1
RM 2
RM 3
RM 4
Concise 4-d summary of 135 year run
Winter warming more severe than summer warming
RobustMapAmplitude-in-time plots
11
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
Future Plans
Ratio Compute threshold from probability theory
Create loop for remaining clusters
Develop better probability theory for RobustMap
Add application context visualization
RobustMap: A Fast and Robust Algorithm for Dimension Reduction and Clustering
12
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
Applications
Searching Images and multimedia databases String databases (spelling, typing and OCR error
correction) Medical databases
Data Mining and Visualization Medical databases (ECGs, X-rays, MRI brain scans)
Demographic Data
Time Series Business, Commerce, and Financial Data Climate, Astrophysics, Chemistry, and Biology Data
RobustMap: A Fast and Robust Algorithm for Dimension Reduction and Clustering
13
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
Questions?