o ak r idge n ational l aboratory u. s. d epartment of e nergy robustmap: a fast and robust...

13
OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY RobustMap: A Fast and Robust Algorithm for Dimension Reduction and Clustering Lionel F. Lovett, II Advisors: George Ostrouchov and Houssain Kettani Computer Science and Mathematics Division Oak Ridge National Laboratory Summer 2005

Upload: blanche-allison

Post on 28-Dec-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY RobustMap: A Fast and Robust Algorithm for Dimension Reduction and Clustering Lionel F

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

RobustMap: A Fast and Robust Algorithm for Dimension Reduction and Clustering

Lionel F. Lovett, II

Advisors: George Ostrouchov and Houssain Kettani

Computer Science and Mathematics DivisionOak Ridge National Laboratory

Summer 2005

Page 2: O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY RobustMap: A Fast and Robust Algorithm for Dimension Reduction and Clustering Lionel F

2

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

Why Dimension Reduction? Data Mining

Large Databases Number of Items Number of Attributes (high-dimensionality) with

items

Visualization Requires low dimensional views (2 or 3) Structure discovery

Patterns Clusters

Fast similarity searching Images, video, documents, character recognition,

face recognition, DNA sequences Data Reduction

RobustMap: A Fast and Robust Algorithm for Dimension Reduction and Clustering

Page 3: O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY RobustMap: A Fast and Robust Algorithm for Dimension Reduction and Clustering Lionel F

3

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

RobustMap Uses Distances and Mimics PCA Like FastMap

RobustMap: A Fast and Robust Algorithm for Dimension Reduction and Clustering

Faloutsos and Lin (1995) (FastMap)

Choose two very distant points as principal axis

Project onto orthogonal hyperplane

Repeat

Each axis O(n), given distances

Distances updated using cosine law as needed

Result is a mapping as well as the transformation to map new items

Page 4: O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY RobustMap: A Fast and Robust Algorithm for Dimension Reduction and Clustering Lionel F

4

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

Projection to Pivot Axis and to Orthogonal Hyperplane

Given pivot axis ab FastMap computes coordinates along axis and projections onto the orthogonal hyperplane. b

a

y

z

y’ z’dy’,z’

RobustMap: A Fast and Robust Algorithm for Dimension Reduction and Clustering

a

y

cy da,y

db,y

da,b

b

Page 5: O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY RobustMap: A Fast and Robust Algorithm for Dimension Reduction and Clustering Lionel F

5

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

Problems with FastMap?

OUTLIERS!

Outliers are points that are not closest on average to the other members of their cluster.

When Selecting points based on distance,FastMap considers all the points of a dataset.

By including outliers, FastMap isn’t robust.

RobustMap: A Fast and Robust Algorithm for Dimension Reduction and Clustering

Page 6: O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY RobustMap: A Fast and Robust Algorithm for Dimension Reduction and Clustering Lionel F

6

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

FastMap Pivot Pair: Choosing Outliers

RobustMap: A Fast and Robust Algorithm for Dimension Reduction and Clustering

Axis does n

ot

represent majority

of

data

Page 7: O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY RobustMap: A Fast and Robust Algorithm for Dimension Reduction and Clustering Lionel F

7

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

RobustMap: Clustering and Excluding Outliers

RobustMap: A Fast and Robust Algorithm for Dimension Reduction and Clustering

Compute n distances from random object

Take point of largest distance

Repeat

Clustering

Estimate distance distribution from two extreme points

Find probability of extreme points

Exclude most extreme cluster of low probability points

Finish projection using remaining points

Diagnostic histogram and cluster plots

Ratio Function Uses only distances from

pivots (2nd and 3rd)

Computes ratios: data fraction / probability of data

Looks for splits according to ratio threshold

Discards smaller portion.

Page 8: O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY RobustMap: A Fast and Robust Algorithm for Dimension Reduction and Clustering Lionel F

8

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

Dataset Generator Generates clustered data from a mixture of

multivariate normal densities

There are five parameters Number of dimensions Number of clusters Cluster variability Cluster mixing proportions Seed for random number generator

Other RobustMap parameters Number of dimensions to extract Quantile of trimmed max Ratio threshold for outlying cluster extraction

RobustMap: A Fast and Robust Algorithm for Dimension Reduction and Clustering

Page 9: O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY RobustMap: A Fast and Robust Algorithm for Dimension Reduction and Clustering Lionel F

9

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

Results

RobustMap identifies and excludes outlying clusters

RobustMap performs dimension reduction

RobustMap exploits robust statistics

RobustMap exploits fast machine learning algorithms (runtime O(nk))

RobustMap: A Fast and Robust Algorithm for Dimension Reduction and Clustering

Page 10: O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY RobustMap: A Fast and Robust Algorithm for Dimension Reduction and Clustering Lionel F

10

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

. . .PC 1 PC 2 PC 3 PC 4+ + PC 1620+

135 year CCM3 run at T42 resolution

CO2 increase to 3xAverage Monthly Surface

Temperature1620 x 2500 matrix

(Putman, Drake, Ostrouchov, 2000)

Decomposition of Climate Model Run Datawith PCA (EOF)

Image vector

PC 1

PC 2

PC 3

PC 4

Concise 4-d summary of 135 year run

Winter warming more severe than summer warming

Amplitude-in-time plots

RM 1 RM 2 RM 3 RM 4+ +1000 x FASTER !

+

RM 1

RM 2

RM 3

RM 4

Concise 4-d summary of 135 year run

Winter warming more severe than summer warming

RobustMapAmplitude-in-time plots

Page 11: O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY RobustMap: A Fast and Robust Algorithm for Dimension Reduction and Clustering Lionel F

11

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

Future Plans

Ratio Compute threshold from probability theory

Create loop for remaining clusters

Develop better probability theory for RobustMap

Add application context visualization

RobustMap: A Fast and Robust Algorithm for Dimension Reduction and Clustering

Page 12: O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY RobustMap: A Fast and Robust Algorithm for Dimension Reduction and Clustering Lionel F

12

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

Applications

Searching Images and multimedia databases String databases (spelling, typing and OCR error

correction) Medical databases

Data Mining and Visualization Medical databases (ECGs, X-rays, MRI brain scans)

Demographic Data

Time Series Business, Commerce, and Financial Data Climate, Astrophysics, Chemistry, and Biology Data

RobustMap: A Fast and Robust Algorithm for Dimension Reduction and Clustering

Page 13: O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY RobustMap: A Fast and Robust Algorithm for Dimension Reduction and Clustering Lionel F

13

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

Questions?