o ak r idge n ational l aboratory u.s. d epartment of e nergy multi-agent based high-dimensional...

24
OAK RIDGE NATIONAL LABORATORY U.S. DEPARTMENT OF ENERGY Multi-agent based High- Dimensional Cluster Analysis SciDAC SDM-ISIC Kickoff Meeting July 10-11, 2001 Nagiza Samatova & George Ostrouchov Computer Science and Mathematics Division Oak Ridge National Laboratory

Post on 18-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY Multi-agent based High-Dimensional Cluster Analysis SciDAC SDM-ISIC Kickoff Meeting July

OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY

Multi-agent based High-Dimensional Cluster Analysis

SciDAC SDM-ISIC Kickoff Meeting July 10-11, 2001

Nagiza Samatova & George OstrouchovComputer Science and Mathematics Division

Oak Ridge National Laboratory

Page 2: O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY Multi-agent based High-Dimensional Cluster Analysis SciDAC SDM-ISIC Kickoff Meeting July

OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY

SDM meeting, July 10-11, 2001

Science driven Bottlenecks1) Data management and data mining algorithms:not

scalable to petabytes of scientific data2) Retrieving data subsets from storage systems: too slow,

especially for tertiary storage3) Transferring large datasets between sites is inefficient4) Navigating between heterogeneous, distributed data

sources very user intensive5) I/O techniques: too low access rate

Approaches:

To improve the transfer of large datasetsMajor Focus:

To implement effective high-bandwidth transfers (Randy Burris)

To minimize the amount of data transferred

Page 3: O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY Multi-agent based High-Dimensional Cluster Analysis SciDAC SDM-ISIC Kickoff Meeting July

OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY

SDM meeting, July 10-11, 2001

Minimizing the amount of scientific simulation data transfer – State of the Art Data compression utilities (zip, compress, etc.):

large overheads modest compression rates

Post-processing data analysis tools (like PCMDI): Scientists must wait for the simulation completion can use lots of CPU cycles on long-running simulations can use up to 50% more storage and require unnecessary data

transfer for data-intensive simulations

Simulation monitoring tools: interference with simulations lack of flexibility

Page 4: O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY Multi-agent based High-Dimensional Cluster Analysis SciDAC SDM-ISIC Kickoff Meeting July

OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY

SDM meeting, July 10-11, 2001

Improvements through — Multi-level data minimization mechanisms

I. Simulation level Data stream not simulation monitoring tools for:

“Any-time” feedback to decide whether to terminate a simulation, restart with new parameters, or continue

Filtering runs to decide whether to transfer to a central archive, keep locally, or delete

II. Comparative analysis level Application-specific search engines for:

Simulation data comparison, esp. against archived databases Distributed simulation data query, search, and retrieval

III. In-depth analysis level Application-specific inference engines for:

Inferring rules relating fragments in two or more simulation outputs New scientific discoveries

Page 5: O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY Multi-agent based High-Dimensional Cluster Analysis SciDAC SDM-ISIC Kickoff Meeting July

OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY

SDM meeting, July 10-11, 2001

How we will address these needs

Our Approach: Develop ASPECT (Adaptable Simulation Product

Exploration via Clustering Toolkit) that includes: Dynamic first-look multivariate time series miner (Level I) Distributed time-series query, search, and retrieval engine (Level II) Time-series-based rules inference engine (Level III)

Our Strategy: Leverage existing work Expand our prior work Integrate with other SDM tasks Work closely with application scientists Develop ASPECT in an iterative fashion

Page 6: O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY Multi-agent based High-Dimensional Cluster Analysis SciDAC SDM-ISIC Kickoff Meeting July

OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY

SDM meeting, July 10-11, 2001

Our work will be leveraging

Distributed Scientific Data Mining Research (Probe/MICS) [SOA+01a, SOA+01b]

Analysis of Large Scientific Datasets (LDRD/ORNL) [DFL+96, DFL+00, DFL+00]

Statistical Downscaling for Climate (LDRD/ORNL) [PDO00 ]

Page 7: O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY Multi-agent based High-Dimensional Cluster Analysis SciDAC SDM-ISIC Kickoff Meeting July

OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY

Distributed Scientific Data Mining Research(funded under Probe/MICS)

Motivation Big picture SDM-ETC related effort Relevance to our task: Levels II and III Limitations w.r.t. to our task:

Enabling Technology research not application-specific

Page 8: O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY Multi-agent based High-Dimensional Cluster Analysis SciDAC SDM-ISIC Kickoff Meeting July

OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY

SDM meeting, July 10-11, 2001

Motivation for Scientific Data Mining Research under Probe Existing data mining tools have limited applicability

to the emerging scientific data sets that are:

High-dimensional Usual assumptions about homogeneity or ergodicity can not be made Need segmented dimension reduction methods.

Massive (terabytes to petabytes) Existing methods do not scale in terms of time, storage, number of dimensions. Need scalable data analysis algorithms.

Distributed (e.g., across computational grids, multiple files, disks, tapes) Existing methods work on a single, centralized dataset. Data transfer is prohibitive (high bandwidth, security/privacy concerns).Need distributed data analysis algorithm.

Dynamic Existing methods work with static datasets. Any changes require complete re-computation.Need dynamic (updating & downdating) techniques.

Page 9: O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY Multi-agent based High-Dimensional Cluster Analysis SciDAC SDM-ISIC Kickoff Meeting July

OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY

SDM meeting, July 10-11, 2001

Our Approach – Distributed agents and peer-to-peer negotiation Strategy

to perform data mining in a distributed and recursive fashion with reasonable data transfer overheads

Key idea Generate local components using distributed agents Merge these components into a global system via peer-to-peer

agents’ collaboration and negotiation

Requirements for Resulting System Qualitative comparability Computational complexity reduction Scalability Communication acceptability Flexibility (in the choice of a local algorithm) Visual representation sufficiency

Page 10: O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY Multi-agent based High-Dimensional Cluster Analysis SciDAC SDM-ISIC Kickoff Meeting July

OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY

SDM meeting, July 10-11, 2001

Background: Hierarchical Clustering

A C D EB

60%

Dendrogram

A B C D EA 0.00 0.25 0.00 0.60 0.75B 0.25 0.00 0.20 0.40 0.80C 0.00 0.20 0.00 0.60 0.70D 0.60 0.40 0.60 0.00 0.50E 0.75 0.80 0.70 0.50 0.00

Distance MatrixA

CD

B

E

0.6

.25.7

75% 40%

Spanning Tree with

Dissimilarity Measures

A

D

BE

.75 .25

.25

.6

.5

.8

0.6

.70 .4

C

Page 11: O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY Multi-agent based High-Dimensional Cluster Analysis SciDAC SDM-ISIC Kickoff Meeting July

OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY

SDM meeting, July 10-11, 2001

SDM-ETC Tie-in: Distributed Hierarchical Clustering

Given: A data set with N d-dimensional data items distributed across

multiples data sites Task:

Determine a hierarchical decomposition of this dataset Application of Clustering:

Database Management Multi-dimensional indexing Data Mining and….

Problem Description:

Page 12: O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY Multi-agent based High-Dimensional Cluster Analysis SciDAC SDM-ISIC Kickoff Meeting July

OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY

SDM meeting, July 10-11, 2001

Local Dendrogram

Local Dendrogram

Local Dendrogram

RACHET: Distributed Clustering Algorithm

Global Dendrogram

Distributed dendrograms

Generate local dendrograms

Centralized dendrograms

Transmit local dendrograms

Merge local dendrograms

Global Dendrogram

Increase k

Improve Comparable Quality?

Control flow of RACHET

Reconstruct Geometry for visualization (optional)

RACHET

Page 13: O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY Multi-agent based High-Dimensional Cluster Analysis SciDAC SDM-ISIC Kickoff Meeting July

OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY

SDM meeting, July 10-11, 2001

Features:

• vs. space cost

• Sufficient for efficiently calculating all measurements involved in making clustering decisions

• Sufficient for visualization

)( dNO c )( cNO

Centroid Descriptive Statistics -summarized cluster representation

is a cluster centroid

of Nc points

),...,,( 21 dfffc

dN Rppp

c},...,,{ 21

),,,,,()( cccccc MAXMINSUMRNORMSQNcDS

Nc – number of data points in the cluster

– square norm of centroid

– radius of the cluster

– sum of centroid components

– minimum centroid component

– maximum centroid component

d

j jc fNORMSQ1

2

d

j jc fSUM1

jdj

cc fNMIN

1min

jdj

cc fNMAX

1max

2

1

1)(

c

N

i i

c N

cpR

c

Question How many statistical parameters are sufficient to make clustering decisions (merging or splitting clusters)?

Page 14: O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY Multi-agent based High-Dimensional Cluster Analysis SciDAC SDM-ISIC Kickoff Meeting July

OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY

SDM meeting, July 10-11, 2001

Updating Descriptive Statistics

Let and be descriptive statistics of two clusters. Then the following statements hold for of cluster formed by merging and :

)( 1CDS

)( 2CDS

)( newCDS

newC

2C

1C

Merging Theorem:

21 NNNnewC

)),(,,,,( 212

2121 CCdNORMSQNORMSQNNNORMSQnewC

21 MINMINMINnewC

21 MAXMAXMAXnewC

)),(,,,,( 212

2121 CCdRRNNRnewC

21 SUMSUMSUMnewC S2

O

S1

C1

C2

newC

newCR

Page 15: O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY Multi-agent based High-Dimensional Cluster Analysis SciDAC SDM-ISIC Kickoff Meeting July

OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY

SDM meeting, July 10-11, 2001

Euclidean Distance Approximation

2),(

22

212 upperlowerapprox

ddccd

Squared Euclidean Distance:

d

j jj ffccd1

22121

2 )(),(

transmission cost)( dNO

transmission cost)(NO

d

j jjjj ffffccd1 21

22

2121

2 )(),(

},max{2

),( 122121

21212 SUMMINSUMMIN

NNNORMSQNORMSQccdupper

},min{2

),( 122121

21212 SUMMAXSUMMAX

NNNORMSQNORMSQccd lower

Lower and Upper Bounds:

Page 16: O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY Multi-agent based High-Dimensional Cluster Analysis SciDAC SDM-ISIC Kickoff Meeting July

OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY

SDM meeting, July 10-11, 2001

Performance Analysis: linear in time, space and transmission

)|(|)|(| 2 NSOSOTime total )( NkOonTransmissi total

)()|(| 2 NkOSOSpacetotal

RACHET

|S|<<N and k<<N

O(N

)

Page 17: O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY Multi-agent based High-Dimensional Cluster Analysis SciDAC SDM-ISIC Kickoff Meeting July

OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY

Analysis of Large Scientific Datasets

Focus: Univariate time series data Applications: ARM, EEG Relevance to our task: Level III Limitations w.r.t. our task:

No support of dynamic & distributed time series No support of multivariate time series

Page 18: O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY Multi-agent based High-Dimensional Cluster Analysis SciDAC SDM-ISIC Kickoff Meeting July

OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY

SDM meeting, July 10-11, 2001

Local Models For Global Analysis and Comparison of Data Series

Strategy Segment series Model the usual to find the unusual

Key ideas Fit simple local models to segments Use parameters for global analysis and monitoring

Resulting system Detects specific events (targeted or unusual) Provides a global description of one or several data series Provides data reduction to parameters of local model

Page 19: O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY Multi-agent based High-Dimensional Cluster Analysis SciDAC SDM-ISIC Kickoff Meeting July

OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY

SDM meeting, July 10-11, 2001

From Local Models to Annotated Time Series

Segment series (100 obs)

Fit simple local model ( c0, c1, c2, ||e||, ||e||2)

Select extreme (10%)

Cluster extreme (4)

Map back to series

Page 20: O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY Multi-agent based High-Dimensional Cluster Analysis SciDAC SDM-ISIC Kickoff Meeting July

OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY

Statistical Downscaling for Climate

Focus: Image time series Application: Climate Relevance to our task: Levels I and II Limitation w.r.t. our task:

• Works as a post-processing tool

Page 21: O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY Multi-agent based High-Dimensional Cluster Analysis SciDAC SDM-ISIC Kickoff Meeting July

OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY

SDM meeting, July 10-11, 2001

Climate Downscaling Contains Several Post-Processing Tools

Page 22: O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY Multi-agent based High-Dimensional Cluster Analysis SciDAC SDM-ISIC Kickoff Meeting July

OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY

SDM meeting, July 10-11, 2001

Trend and Periodic Components Provide a Concise Description of Model Run

Filter periodic and trend components

Compute EOFs

Monitor model run

Page 23: O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY Multi-agent based High-Dimensional Cluster Analysis SciDAC SDM-ISIC Kickoff Meeting July

OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY

SDM meeting, July 10-11, 2001

Summary of where efforts are needed

Research: Multivariate time series datasets Dynamic versions of time series processing & analysis tools Application-specific distributed & dynamic clustering Application-specific rules inference algorithms

Implementations: ASPECT’s framework

• Simulation data monitoring engine: with pluggable user-driven data analysis modules with “any-time”, “real-time” not post-processing with no or very little interference with simulation

• Simulation data query, search, & retrieval engine• Simulation data rules inference engine

A lot of integration work…

Page 24: O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY Multi-agent based High-Dimensional Cluster Analysis SciDAC SDM-ISIC Kickoff Meeting July

OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY

SDM meeting, July 10-11, 2001

Integration with other SDM-ETC tasks

5) Agent technology

c) Dataset Level

b) File Level

a) Storage Level

1) Storage and retrieval of Very large datasets

2) Access optimization

of distributed data

• Parallel I/O: improving parallel access from clusters (ANL, NWU)

• MPI I/O: implementation based on file-level hints (ANL, NWU)

• Multi-tier metadata system for querying heterogeneous data sources (LLNL, Georgia Tech)• Knowledge-based federation of heterogeneous databases (SDSC)

• Low level API for grid I/O (ANL)

• Optimization of low-level data storage, retrieval and transport (ORNL)

• [Grid Enabling Technology]

• Analysis of application-level query patterns (LLNL, NWU)

• Optimizing shared access to tertiary storage (LBNL, ORNL)

• High-dimensional indexing techniques (LBNL)

• Enabling communication among tools and data (ORNL, NCSU)

d) Dataset Federation Level

• Multi-agent high-dimensional cluster analysis (ORNL)

• Adaptive file caching in a distributed system (LBNL)

Dimension reduction and sampling (LLNL, LBNL)

3) Data mining anddiscovery of access patterns

4) Distributed, heterogeneous data access