o ak r idge n ational l aboratory u.s. d epartment of e nergy multi-agent based high-dimensional...
Post on 18-Dec-2015
215 views
TRANSCRIPT
![Page 1: O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY Multi-agent based High-Dimensional Cluster Analysis SciDAC SDM-ISIC Kickoff Meeting July](https://reader030.vdocuments.us/reader030/viewer/2022032703/56649d235503460f949f90cd/html5/thumbnails/1.jpg)
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
Multi-agent based High-Dimensional Cluster Analysis
SciDAC SDM-ISIC Kickoff Meeting July 10-11, 2001
Nagiza Samatova & George OstrouchovComputer Science and Mathematics Division
Oak Ridge National Laboratory
![Page 2: O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY Multi-agent based High-Dimensional Cluster Analysis SciDAC SDM-ISIC Kickoff Meeting July](https://reader030.vdocuments.us/reader030/viewer/2022032703/56649d235503460f949f90cd/html5/thumbnails/2.jpg)
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM meeting, July 10-11, 2001
Science driven Bottlenecks1) Data management and data mining algorithms:not
scalable to petabytes of scientific data2) Retrieving data subsets from storage systems: too slow,
especially for tertiary storage3) Transferring large datasets between sites is inefficient4) Navigating between heterogeneous, distributed data
sources very user intensive5) I/O techniques: too low access rate
Approaches:
To improve the transfer of large datasetsMajor Focus:
To implement effective high-bandwidth transfers (Randy Burris)
To minimize the amount of data transferred
![Page 3: O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY Multi-agent based High-Dimensional Cluster Analysis SciDAC SDM-ISIC Kickoff Meeting July](https://reader030.vdocuments.us/reader030/viewer/2022032703/56649d235503460f949f90cd/html5/thumbnails/3.jpg)
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM meeting, July 10-11, 2001
Minimizing the amount of scientific simulation data transfer – State of the Art Data compression utilities (zip, compress, etc.):
large overheads modest compression rates
Post-processing data analysis tools (like PCMDI): Scientists must wait for the simulation completion can use lots of CPU cycles on long-running simulations can use up to 50% more storage and require unnecessary data
transfer for data-intensive simulations
Simulation monitoring tools: interference with simulations lack of flexibility
![Page 4: O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY Multi-agent based High-Dimensional Cluster Analysis SciDAC SDM-ISIC Kickoff Meeting July](https://reader030.vdocuments.us/reader030/viewer/2022032703/56649d235503460f949f90cd/html5/thumbnails/4.jpg)
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM meeting, July 10-11, 2001
Improvements through — Multi-level data minimization mechanisms
I. Simulation level Data stream not simulation monitoring tools for:
“Any-time” feedback to decide whether to terminate a simulation, restart with new parameters, or continue
Filtering runs to decide whether to transfer to a central archive, keep locally, or delete
II. Comparative analysis level Application-specific search engines for:
Simulation data comparison, esp. against archived databases Distributed simulation data query, search, and retrieval
III. In-depth analysis level Application-specific inference engines for:
Inferring rules relating fragments in two or more simulation outputs New scientific discoveries
![Page 5: O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY Multi-agent based High-Dimensional Cluster Analysis SciDAC SDM-ISIC Kickoff Meeting July](https://reader030.vdocuments.us/reader030/viewer/2022032703/56649d235503460f949f90cd/html5/thumbnails/5.jpg)
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM meeting, July 10-11, 2001
How we will address these needs
Our Approach: Develop ASPECT (Adaptable Simulation Product
Exploration via Clustering Toolkit) that includes: Dynamic first-look multivariate time series miner (Level I) Distributed time-series query, search, and retrieval engine (Level II) Time-series-based rules inference engine (Level III)
Our Strategy: Leverage existing work Expand our prior work Integrate with other SDM tasks Work closely with application scientists Develop ASPECT in an iterative fashion
![Page 6: O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY Multi-agent based High-Dimensional Cluster Analysis SciDAC SDM-ISIC Kickoff Meeting July](https://reader030.vdocuments.us/reader030/viewer/2022032703/56649d235503460f949f90cd/html5/thumbnails/6.jpg)
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM meeting, July 10-11, 2001
Our work will be leveraging
Distributed Scientific Data Mining Research (Probe/MICS) [SOA+01a, SOA+01b]
Analysis of Large Scientific Datasets (LDRD/ORNL) [DFL+96, DFL+00, DFL+00]
Statistical Downscaling for Climate (LDRD/ORNL) [PDO00 ]
![Page 7: O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY Multi-agent based High-Dimensional Cluster Analysis SciDAC SDM-ISIC Kickoff Meeting July](https://reader030.vdocuments.us/reader030/viewer/2022032703/56649d235503460f949f90cd/html5/thumbnails/7.jpg)
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
Distributed Scientific Data Mining Research(funded under Probe/MICS)
Motivation Big picture SDM-ETC related effort Relevance to our task: Levels II and III Limitations w.r.t. to our task:
Enabling Technology research not application-specific
![Page 8: O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY Multi-agent based High-Dimensional Cluster Analysis SciDAC SDM-ISIC Kickoff Meeting July](https://reader030.vdocuments.us/reader030/viewer/2022032703/56649d235503460f949f90cd/html5/thumbnails/8.jpg)
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM meeting, July 10-11, 2001
Motivation for Scientific Data Mining Research under Probe Existing data mining tools have limited applicability
to the emerging scientific data sets that are:
High-dimensional Usual assumptions about homogeneity or ergodicity can not be made Need segmented dimension reduction methods.
Massive (terabytes to petabytes) Existing methods do not scale in terms of time, storage, number of dimensions. Need scalable data analysis algorithms.
Distributed (e.g., across computational grids, multiple files, disks, tapes) Existing methods work on a single, centralized dataset. Data transfer is prohibitive (high bandwidth, security/privacy concerns).Need distributed data analysis algorithm.
Dynamic Existing methods work with static datasets. Any changes require complete re-computation.Need dynamic (updating & downdating) techniques.
![Page 9: O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY Multi-agent based High-Dimensional Cluster Analysis SciDAC SDM-ISIC Kickoff Meeting July](https://reader030.vdocuments.us/reader030/viewer/2022032703/56649d235503460f949f90cd/html5/thumbnails/9.jpg)
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM meeting, July 10-11, 2001
Our Approach – Distributed agents and peer-to-peer negotiation Strategy
to perform data mining in a distributed and recursive fashion with reasonable data transfer overheads
Key idea Generate local components using distributed agents Merge these components into a global system via peer-to-peer
agents’ collaboration and negotiation
Requirements for Resulting System Qualitative comparability Computational complexity reduction Scalability Communication acceptability Flexibility (in the choice of a local algorithm) Visual representation sufficiency
![Page 10: O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY Multi-agent based High-Dimensional Cluster Analysis SciDAC SDM-ISIC Kickoff Meeting July](https://reader030.vdocuments.us/reader030/viewer/2022032703/56649d235503460f949f90cd/html5/thumbnails/10.jpg)
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM meeting, July 10-11, 2001
Background: Hierarchical Clustering
A C D EB
60%
Dendrogram
A B C D EA 0.00 0.25 0.00 0.60 0.75B 0.25 0.00 0.20 0.40 0.80C 0.00 0.20 0.00 0.60 0.70D 0.60 0.40 0.60 0.00 0.50E 0.75 0.80 0.70 0.50 0.00
Distance MatrixA
CD
B
E
0.6
.25.7
75% 40%
Spanning Tree with
Dissimilarity Measures
A
D
BE
.75 .25
.25
.6
.5
.8
0.6
.70 .4
C
![Page 11: O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY Multi-agent based High-Dimensional Cluster Analysis SciDAC SDM-ISIC Kickoff Meeting July](https://reader030.vdocuments.us/reader030/viewer/2022032703/56649d235503460f949f90cd/html5/thumbnails/11.jpg)
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM meeting, July 10-11, 2001
SDM-ETC Tie-in: Distributed Hierarchical Clustering
Given: A data set with N d-dimensional data items distributed across
multiples data sites Task:
Determine a hierarchical decomposition of this dataset Application of Clustering:
Database Management Multi-dimensional indexing Data Mining and….
Problem Description:
![Page 12: O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY Multi-agent based High-Dimensional Cluster Analysis SciDAC SDM-ISIC Kickoff Meeting July](https://reader030.vdocuments.us/reader030/viewer/2022032703/56649d235503460f949f90cd/html5/thumbnails/12.jpg)
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM meeting, July 10-11, 2001
Local Dendrogram
Local Dendrogram
Local Dendrogram
RACHET: Distributed Clustering Algorithm
Global Dendrogram
Distributed dendrograms
Generate local dendrograms
Centralized dendrograms
Transmit local dendrograms
Merge local dendrograms
Global Dendrogram
Increase k
Improve Comparable Quality?
Control flow of RACHET
Reconstruct Geometry for visualization (optional)
RACHET
![Page 13: O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY Multi-agent based High-Dimensional Cluster Analysis SciDAC SDM-ISIC Kickoff Meeting July](https://reader030.vdocuments.us/reader030/viewer/2022032703/56649d235503460f949f90cd/html5/thumbnails/13.jpg)
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM meeting, July 10-11, 2001
Features:
• vs. space cost
• Sufficient for efficiently calculating all measurements involved in making clustering decisions
• Sufficient for visualization
)( dNO c )( cNO
Centroid Descriptive Statistics -summarized cluster representation
is a cluster centroid
of Nc points
),...,,( 21 dfffc
dN Rppp
c},...,,{ 21
),,,,,()( cccccc MAXMINSUMRNORMSQNcDS
Nc – number of data points in the cluster
– square norm of centroid
– radius of the cluster
– sum of centroid components
– minimum centroid component
– maximum centroid component
d
j jc fNORMSQ1
2
d
j jc fSUM1
jdj
cc fNMIN
1min
jdj
cc fNMAX
1max
2
1
1)(
c
N
i i
c N
cpR
c
Question How many statistical parameters are sufficient to make clustering decisions (merging or splitting clusters)?
![Page 14: O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY Multi-agent based High-Dimensional Cluster Analysis SciDAC SDM-ISIC Kickoff Meeting July](https://reader030.vdocuments.us/reader030/viewer/2022032703/56649d235503460f949f90cd/html5/thumbnails/14.jpg)
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM meeting, July 10-11, 2001
Updating Descriptive Statistics
Let and be descriptive statistics of two clusters. Then the following statements hold for of cluster formed by merging and :
)( 1CDS
)( 2CDS
)( newCDS
newC
2C
1C
Merging Theorem:
21 NNNnewC
)),(,,,,( 212
2121 CCdNORMSQNORMSQNNNORMSQnewC
21 MINMINMINnewC
21 MAXMAXMAXnewC
)),(,,,,( 212
2121 CCdRRNNRnewC
21 SUMSUMSUMnewC S2
O
S1
C1
C2
newC
newCR
![Page 15: O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY Multi-agent based High-Dimensional Cluster Analysis SciDAC SDM-ISIC Kickoff Meeting July](https://reader030.vdocuments.us/reader030/viewer/2022032703/56649d235503460f949f90cd/html5/thumbnails/15.jpg)
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM meeting, July 10-11, 2001
Euclidean Distance Approximation
2),(
22
212 upperlowerapprox
ddccd
Squared Euclidean Distance:
d
j jj ffccd1
22121
2 )(),(
transmission cost)( dNO
transmission cost)(NO
d
j jjjj ffffccd1 21
22
2121
2 )(),(
},max{2
),( 122121
21212 SUMMINSUMMIN
NNNORMSQNORMSQccdupper
},min{2
),( 122121
21212 SUMMAXSUMMAX
NNNORMSQNORMSQccd lower
Lower and Upper Bounds:
![Page 16: O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY Multi-agent based High-Dimensional Cluster Analysis SciDAC SDM-ISIC Kickoff Meeting July](https://reader030.vdocuments.us/reader030/viewer/2022032703/56649d235503460f949f90cd/html5/thumbnails/16.jpg)
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM meeting, July 10-11, 2001
Performance Analysis: linear in time, space and transmission
)|(|)|(| 2 NSOSOTime total )( NkOonTransmissi total
)()|(| 2 NkOSOSpacetotal
RACHET
|S|<<N and k<<N
O(N
)
![Page 17: O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY Multi-agent based High-Dimensional Cluster Analysis SciDAC SDM-ISIC Kickoff Meeting July](https://reader030.vdocuments.us/reader030/viewer/2022032703/56649d235503460f949f90cd/html5/thumbnails/17.jpg)
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
Analysis of Large Scientific Datasets
Focus: Univariate time series data Applications: ARM, EEG Relevance to our task: Level III Limitations w.r.t. our task:
No support of dynamic & distributed time series No support of multivariate time series
![Page 18: O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY Multi-agent based High-Dimensional Cluster Analysis SciDAC SDM-ISIC Kickoff Meeting July](https://reader030.vdocuments.us/reader030/viewer/2022032703/56649d235503460f949f90cd/html5/thumbnails/18.jpg)
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM meeting, July 10-11, 2001
Local Models For Global Analysis and Comparison of Data Series
Strategy Segment series Model the usual to find the unusual
Key ideas Fit simple local models to segments Use parameters for global analysis and monitoring
Resulting system Detects specific events (targeted or unusual) Provides a global description of one or several data series Provides data reduction to parameters of local model
![Page 19: O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY Multi-agent based High-Dimensional Cluster Analysis SciDAC SDM-ISIC Kickoff Meeting July](https://reader030.vdocuments.us/reader030/viewer/2022032703/56649d235503460f949f90cd/html5/thumbnails/19.jpg)
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM meeting, July 10-11, 2001
From Local Models to Annotated Time Series
Segment series (100 obs)
Fit simple local model ( c0, c1, c2, ||e||, ||e||2)
Select extreme (10%)
Cluster extreme (4)
Map back to series
![Page 20: O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY Multi-agent based High-Dimensional Cluster Analysis SciDAC SDM-ISIC Kickoff Meeting July](https://reader030.vdocuments.us/reader030/viewer/2022032703/56649d235503460f949f90cd/html5/thumbnails/20.jpg)
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
Statistical Downscaling for Climate
Focus: Image time series Application: Climate Relevance to our task: Levels I and II Limitation w.r.t. our task:
• Works as a post-processing tool
![Page 21: O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY Multi-agent based High-Dimensional Cluster Analysis SciDAC SDM-ISIC Kickoff Meeting July](https://reader030.vdocuments.us/reader030/viewer/2022032703/56649d235503460f949f90cd/html5/thumbnails/21.jpg)
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM meeting, July 10-11, 2001
Climate Downscaling Contains Several Post-Processing Tools
![Page 22: O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY Multi-agent based High-Dimensional Cluster Analysis SciDAC SDM-ISIC Kickoff Meeting July](https://reader030.vdocuments.us/reader030/viewer/2022032703/56649d235503460f949f90cd/html5/thumbnails/22.jpg)
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM meeting, July 10-11, 2001
Trend and Periodic Components Provide a Concise Description of Model Run
Filter periodic and trend components
Compute EOFs
Monitor model run
![Page 23: O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY Multi-agent based High-Dimensional Cluster Analysis SciDAC SDM-ISIC Kickoff Meeting July](https://reader030.vdocuments.us/reader030/viewer/2022032703/56649d235503460f949f90cd/html5/thumbnails/23.jpg)
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM meeting, July 10-11, 2001
Summary of where efforts are needed
Research: Multivariate time series datasets Dynamic versions of time series processing & analysis tools Application-specific distributed & dynamic clustering Application-specific rules inference algorithms
Implementations: ASPECT’s framework
• Simulation data monitoring engine: with pluggable user-driven data analysis modules with “any-time”, “real-time” not post-processing with no or very little interference with simulation
• Simulation data query, search, & retrieval engine• Simulation data rules inference engine
A lot of integration work…
![Page 24: O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY Multi-agent based High-Dimensional Cluster Analysis SciDAC SDM-ISIC Kickoff Meeting July](https://reader030.vdocuments.us/reader030/viewer/2022032703/56649d235503460f949f90cd/html5/thumbnails/24.jpg)
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM meeting, July 10-11, 2001
Integration with other SDM-ETC tasks
5) Agent technology
c) Dataset Level
b) File Level
a) Storage Level
1) Storage and retrieval of Very large datasets
2) Access optimization
of distributed data
• Parallel I/O: improving parallel access from clusters (ANL, NWU)
• MPI I/O: implementation based on file-level hints (ANL, NWU)
• Multi-tier metadata system for querying heterogeneous data sources (LLNL, Georgia Tech)• Knowledge-based federation of heterogeneous databases (SDSC)
• Low level API for grid I/O (ANL)
• Optimization of low-level data storage, retrieval and transport (ORNL)
• [Grid Enabling Technology]
• Analysis of application-level query patterns (LLNL, NWU)
• Optimizing shared access to tertiary storage (LBNL, ORNL)
• High-dimensional indexing techniques (LBNL)
• Enabling communication among tools and data (ORNL, NCSU)
d) Dataset Federation Level
• Multi-agent high-dimensional cluster analysis (ORNL)
• Adaptive file caching in a distributed system (LBNL)
•
Dimension reduction and sampling (LLNL, LBNL)
3) Data mining anddiscovery of access patterns
4) Distributed, heterogeneous data access