oak ridge national laboratory u.s. department of energy sdm center nagiza samatova & george...
TRANSCRIPT
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM center
Nagiza Samatova & George OstrouchovComputer Science and Mathematics Division
Oak Ridge National Laboratoryhttp://www.csm.ornl.gov/
SDM All-Hands MeetingSeptember 11-13, 2002
ASPECT: Adaptable Simulation Product Exploration and Control Toolkit
ASPECT
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM center
Our Team Students:
AbuKhzam, Faisal, Ph.D. – University of Tennessee, Knoxville Bauer, David, B.S. – Georgia Tech Institute Hespen, Jennifer, Ph.D. – University of Tennessee, Knoxville Nair, Rajeet, M.S. – University of Illinois, Chicago
Postdocs: Park, Hooney, Ph.D.
Staff: Ostrouchov, George, Ph.D. – Principal Investigator Reed, Joel, M.S. Samatova, Nagiza, Ph.D.– Principal Investigator Watkins, Ian, B.S.
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM center Our Collaborators Application:
David Erickson, Climate, ORNL John Drake, ORNL Tony Mezzacappa, Astrophysics, ORNL
Linear Algebra & Graph Theory: Gene Golub, Stanford University Mike Langston, UTK
Data Mining and Data Management: Rob Grossman, UIC
High Performance Computing: Alok Choudhary, Wei-keng Liao: NWU Bill Gropp, Rob Ross, Rajeev Thakur: ANL
Hardware and Software Infrastructure: Dan Million, ORNL Randy Burris, ORNL
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM center
Typical Simulation Exploration ScenariosDriven by limitations of existing technologies
Post-processing Scenario: Submit a long-running simulation job (weeks – months) Periodically check the status (run “tail -f” command on each
machine) Analyze large simulation data set
Real-time Scenario:1. Instrument a simulation code to visualize a field(s)
2. While running a simulation job• Monitor the selected field(s)
• If can not monitor, then either Stop a job or Continue running without monitoring and ability to view later what has been skipped
3. If changing a set of fields to monitor, then go to 1
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM center
Analysis & Visualization of Simulation Product – State of the Art
Post-processing data analysis tools (like PCMDI): Scientists must wait for the simulation completion Can use lots of CPU cycles on long-running simulations Can use up to 50% more storage and require unnecessary data
transfer for data-intensive simulations
Real-time Simulation monitoring tools (like Cumulvs): Need simulation code instrumentation (e.g., call to vis. libraries) Interference with simulation run: snapshot of data => can pause simulation
Computationally intensive data analysis task becomes part of simulation Synchronous view of data and simulation run More control over simulation
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM center
Some More Limitations…
Post-processing data analysis tools: Application specific (PyClimate, mtaCDF, PCMDI tools, ncview)
tools written for one application can not be used for another usually written by experts in the application not data analysis field
Not user friendly, usually script-driven (Python, IDL, GrADS) Support no more than a dozen of simple data analysis algorithms Do not exist for some applications (astrophysics vs. climate) Are not designed as distributed systems
distributed data sets must be centralized tools must be installed where the data is
Real-time Simulation monitoring tools: Provide even simpler data analysis (usually focused on rendering of the data)
Require good familiarity with the simulation code to make changes: NCAR folks develop climate simulation codes (PCM, CCSM) used world-wide
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM center Improvements through — ASPECT
Data stream not simulation monitoring tool
PROBE
FFTFFTICAICAFiltersFilters D4 RACHET
Desktop
Filters
RACHET ICA
D4
GUI Interface
Plug-in modules
ASPECT
Disks TapesSimulation Data
ASPECT’s advantages:• No simulation code instrumentation• Single data — multiple views of data• No interference w/ simulation• Decoupled from the simulation
ASPECT’s drawbacks:(e.g. unlike CUMULVS/ORNL)• No computational steering• No collaborative visualization• No high performance visualization
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM center“Run and Render” Simulation Cycle in SciDAC: Our vision
Terascale Supernova Explosion (TSI)SimulationComputational Environment
Disks
Tapes
PROBE for Storage & Analysis of Simulation Data:• High-Dimensional • Distributed• Dynamic• Massive
Data Management
Application Scientist
ASPECT
Data Analysis
Goal:
To develop ASPECT (Adaptable Simulation Product Exploration and Control Toolkit)
Enable effective and efficient monitoring of data generated by long running simulations through the GUI interface to a rich set of pluggable data analysis modules
Potentially lead to new scientific discoveries
Allow very efficient utilization of human and computer resources
Benefits:
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM center Approaching the Goal through a
Collaborative Set of Activities
Interact with Application Scientists
T. Mezzacappa, R. Toedte, D. Erickson, J. Drake
Interact with Application Scientists
T. Mezzacappa, R. Toedte, D. Erickson, J. Drake
Build a Workflow Environment (Probe)Build a Workflow
Environment (Probe)
Application Data Analysis ResearchApplication Data
Analysis Research
CS & Math Research driven by ApplicationsCS & Math Research driven by Applications
ASPECT Design & Implementation
ASPECT Design & Implementation
Publications, Meetings & Presentations
Publications, Meetings & Presentations
Learn Application Domain (problem, software)
Learn Application Domain (problem, software)
Data Preparation & Processing
Data Preparation & Processing
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM center
Building a Workflow Environment
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM center
80% => 20% Paradigm in Probe’s Research & Application driven Environment
Very limited resources General purpose software only Lack of interface with HPSS Homogenous platform
(e.g., Linux only)
From frustrations To smooth operation Hardware Infrastructure:
RS6000 S80, 6 processors 2 GB memory,1 TB IDE FibreChannel RAID4-processor (1.4 GHz Xeon) 8 GB 5*73GB, FibreChannel HBA and GigEtwo 2-processor (2.4 GHz Xeon), 2 GB, 5*73 GB, GigE, FibreChannel HBA
Software Infrastructure:Compilers (Fortran, C, Java)Data Analysis (R, Java-R, Ggobi)Visualization (ncview, GrADS)Data Formats (netCDF, HDF)Data Storage & Transfer (HPSS, hsi, pftp, GridFTP, MPI-IO, PVFS)
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM center
ASPECT Design and Implementation
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM center ASPECT Infrastructure
Distributed End-to-End System
User
DataSpace Server
Archival data
UIC
DataSpace Server
Archival data
Probe
Request
Data
Request
Data
Data I/O
Data Reduction
Data Preprocessing
Data Analysis
ASPECT GUI Client
XML Request Builder
Viz. Engine
Data Restore
HPSSPVFS
ASPECT Server
Chiba City
HPSSPVFS
ASPECT Server
Probe
HPSSPVFS
ASPECT Server
NERSC
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM center
Menu of Modules
Categories:• Data Acquisition• Data Filtering• Data Analysis• Visualization
Create Instance
Link Modules
Link Modules
FFTFFT
NetCDF Reader
Visualization Module Filter Module
ASPECT GUI Infrastructure
<modules> <module-set>
<name> Data Acquisition </name> <module>
<name> NetCDF Reader </name> <code> datamonitor.NetCDFReader </code> </module> </module-set> <module-set> <name> Data Filtering </name> <module> <name> Invert Filter </name> <code> datamonitor.Inverter </code> </module> </module-set></modules>
<modules> <module-set>
<name> Data Acquisition </name> <module>
<name> NetCDF Reader </name> <code> datamonitor.NetCDFReader </code> </module> </module-set> <module-set> <name> Data Filtering </name> <module> <name> Invert Filter </name> <code> datamonitor.Inverter </code> </module> </module-set></modules> XML Config File
Functionality:
• Instantiate Modules
• Link Modules
• Synchronous Control
• Add Modules by XML
• XML-based Request Builder
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM center
ASPECT Back-End Engine Overview
The GUI passes a string indicating the script to run, the variables to pass to the script, the names of the files (or groups of files) where those variables can be found, and other optional parameters.
The engine parses the string, reads all of the data into R compatible objects (in memory), and then calls the script through R.
When R returns, the single returned object is broken up into respective variables, and written to a NetCDF file.
Engine Front End(Takes Request from GUI, reads input into
memory)
R Script(Translates input to
R function call)
R(Performs
calculations)
Engine Back End(Converts R’s Output
to NetCDF file)
GUI
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM centerInterfacing with R:
ASPECT provides a rich set of data analysis modules through R
http://www.r-project.org/
The open source R statistical package provides the generic computational backend for the ASPECT engine. While R was designed to be mostly a stand-alone program, it does provide for internal hooks in its libraries.
Using the same functions, macros, and syntax available to internal R code, the ASPECT engine creates R objects from the input data directly. These objects are then installed in the namespace of the R engine, and used by the R wrapper scripts as if it were running in an ordinary R environment.
Status:• Release under GPL in Source
Forge, September, 2002• Includes about 30 algorithms• A dozen can be added in a
matter of a week• Requested by DataSpace, UIC • Joint effort w/ DataSpace
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM center
Scripts …
wsample <-function(x1, x2, v1, v2, n1, n2, c1, c2) {a <- if (n2 != 0) TRUE else FALSEq <- if (!is.null(v2)) ( if (n1 != 0) sample(v1, size = n1, replace = a, prob=v2) elsesample (v1, replace = a, prob = v2) ) else ( if (n1 != 0) sample (v1, size = n1, replace = a) else sample (v1, replace = a) )list( Sample = q) }
Using R script wrappers to the R functions allows for an incredible amount of flexibility. Users can easily add their own functions, without having to know the internals of the ASPECT engine. Most of the scripts, like the one below, simply translate the C input into the equivalent R function call.
The scripts can be as complicated or simple as they need to be. The below script is perfectly valid.
whello <-function(x1, x2, v1, v2, n1, n2, c1, c2) { print("Hello World") }
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM center
XML-based Description of Algorithms and Visualization Interfaces
<name> wsort </name><displayName> Sort </displayName><input>
<variable><type> vector </type><name> data </name><description> The input data </description></variable>
<variable>....
•Dynamically loaded XML descriptions of functions and menus provide user expandable configuration details.
•Users can add comments, change default values, add multiple interfaces to a single function, and add interfaces for their own functions.
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM centerNetCDF/HDF Input/Output:
ASPECT understands and uses scientific standard file formats
http://www.unidata.ucar.edu/packages/netcdf/
The open source NetCDF format is widely used to hold self-describing data. The output from the R engine is a single R object. Given the recursively defined list nature of R objects, this is no limitation.
In order to save a dynamic R object into a flat NetCDF file, the object must be carefully unwound, while preserving as much of the metadata (such as dimension names, the original source of the data, etc) as possible into the NetCDF file.
Once the output file is written, it is ready to be used by the user either for visualization, or as the input to another function.
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM centerMPI-IO NetCDF
ASPECT supports parallel I/O w/ various data access patterns(Collaboration with ANL (Bill Gropp, Rob Ross, Rajeev Thakur) and NWU (Alok Choudhoury, Wei-keng Liou)
• Concatenate multiple files into a single file for a given set of variables
• Analyze multiple files with different data distribution patterns among processors (by blocks, by strided patterns, by entire files)
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM centerData Sampling
ASPECT handles large data sets
• Random subsampling
• Decimation
• Blocks
• Striding
Types of Subsampling:
• Standard netCDF
• MPI-IO netCDF
Implementations:
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM centerInterfacing with DataSpace
ASPECT provides “hooks” to a Web of Scientific Data(Collaboration with Bob Grossman at UIC)
http://www.dataspaceweb.net
The web today provides an infrastructure for working with distributed multimedia documents. DataSpace is an infrastructure for creating a web of data instead of documents.
ASPECTPSockets/Sabul
• Very high throughput for moving data through DataSpace’s parallel network transport protocols (Psockets (TCP), Sabul (TCP, UDP))
• Ability to do comparative/correlation analysis between simulation and archived data
UIC – Amsterdam: Sabul – 540 Mb/s Psockets – 180 Mb/s Sockets – 10Mb/s
DataSpace – Web of Data
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM center Summary of ASPECT’s
Design & Implementation• ASPECT is a Data Stream Monitoring Tool
• ASPECT has very nice features for efficient and effective simulation data analysis:
• GUI interface to a rich set of pluggable data analysis modules.
• Uses the open source R statistical data analysis package as a computational back-end.
• Understands and uses the NetCDF/HDF scientific file format.
• Uses dynamically loaded R scripts and XML descriptors for flexibility.
• Handles large sets of data through the support for block selection, striding, sampling, data reduction, and distributed algorithms.
• Provides efficient I/O through MPI-IO interface to NetCDF and HDF
• Moves data efficiently through PSockets/Sabul
• Supports dataset view of the simulation not only a collection of files
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM center
Distributed and Streamline Data Analysis Research
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM center
Simulation Data Sets are Massive & Growing Fast
Supernova Explosion: 1-D simulation: 2GB 2-D simulation: 1TB
3-D simulation: 50TB
Astrophysics Data per Run
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM center
Most of this Data will NEVER Be Touched with the current trends in technology
The amount of data stored online quadruples every 18 months, while processing power ‘only’ doubles every 18 months. Unless the number of processors increases unrealistically rapidly,
most of this data will never be touched. Storage device capacity doubles every 9 months, while
memory capacity doubles every 18 months (Moore’s law). Even if the divergence between these rates of growth will converge,
the memory latency is and will remain the rate-limiting step in data-intensive computations
Operating systems struggle to handle files larger than a few GB. OS constraints and memory capacity determine data set file size
and fragmentation
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM centerMassive Data Sets are Naturally Distributed
BUT Effectively Immoveable (Skillicorn, 2001)
Bandwidth is increasing but not at the same rate as stored data There are some parts of the world with high available bandwidth BUT there
are enough bottlenecks that high effective bandwidth is unachievable across heterogeneous networks
Latency for transmission at global distances is significant Most of this latency is time-of-flight and so will not be reduced by technology
Data has a property similar to inertia: It is cheap to store and cheap to keep moving, but the transitions between
these two states are expensive in time and hardware. Legal and political restrictions Social restrictions
Data owners may let access data but only by retaining control of it
Computations MUST move to data, rather than data to computations
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM center Simulation Data Sets are
Dynamically Changing Scientific simulations (e.g., climate
modeling and supernova explosion) typically run for at least one month and produce data sets in the order of one to ten terabytes per simulation.
Effectively and efficiently analyzing these streams of data is a challenge: Most existing methods work with
static datasets. Any changes require complete re-computation.
Computations MUST be able to efficiently analyze streams of data while they are being produced, rather than wait until they are produced
t=t
0
t=t1 t=t
2
new new
Incr
emen
tal u
pdat
e vi
a fu
sion
Stream of climate simulation data
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM center
Algorithms Fail for a Few Gigabyte Data
3 yrs.0.1 sec.10-2 sec.10GB
3 hrs10-3 sec.10-4 sec.100MB
1 sec.10-5 sec.10-6 sec.1MB
10-4sec.10-8 sec.10-8 sec.10KB
10-8 sec.10-10 sec.10-10sec.100B
n2nlog(n)n
Algorithm ComplexityData size,
nAlgorithmic Complexity:
Calculate means O(n)
Calculate FFT O(n log(n))
Calculate SVD O(r • c)
Clustering algorithms O(n2)
For illustration chart assumes 10-12 sec. calculation time per data point
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM center
Perform data mining in a distributed fashion
with reasonable data transfer overheads
Strategy
Benefits
Compute local analyses using distributed agents Merge minimum info into a global analysis via peer-to-peer agents’ collaboration & negotiation
Key idea
NO need to centralize data Linear scalability with data size and with data dimensionality
)|(|)|(| 2 NSOSOTime )(NOissionDataTransm
)()|(| 2 NOSOSpace
RACHET|S|<<N O
(N)
RACHET High Performance Framework for Distributed Cluster Analysis
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM center
t=t0 t=t1 t=t2new new
Incr
emen
tal u
pdat
e vi
a fu
sion
Stream of simulation dataRatio of monolithic vs. streamline
0.92
0.94
0.96
0.98
1
1.02
1.04
1 2 3 4 5 6 7 8 9Number of dimensions k
Ratio
m/ t=2
m/ t=4
Linear Time Dimension Reduction for Streamline & Distributed Data
Features:• One time communication • Linear time for each chunk• ~10% deviation from central version• Based on FastMap
Status:• C, MPI, MPI-IO based
implementation of package• Both one time and iterative
communication• Integration into ASPECT is in
progress• Requested by DataSpace, UIC;
P3 project (Ekow), LBL
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM center
Distributed Principal Components (PCA) Merging Information Rather Than Raw Data Global Principal Components
transmit information, not data Dynamic Principal Components
no need to keep all data
Benefits: Little loss of information Much lower transmission costs:
Centralized O(np) DPCA O(sp), s<<n
Computation cost: O(kp2) vs O(np2)
Method:Merge few local PCs and local means
Performance of Distributed PCA vs. Monolithic PCA
# of Data Sets
Rat
io
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM center
Data Understanding for
Scientific Discovery
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM center
Data Analysis for Monitoring Simulations
What do we monitor? Contrast between Supernova and Climate
simulation data analysis Highlights from Astrophysics Wider implications on simulation data Data reduction and monitoring from reduced data
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM center
What Do We Monitor?
General Concepts
• Application-specific
• comparative displays driven by data mining and exploratory data analysis
• Visual comparison in time is less effective than comparison side-by-side (Visual Display of Quantitative Information, Tufte)
Entropy of2-d (axisymmetric)
Supernova Simulation
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM center
Median along layer
Evolving Display Shows Entropy Progression over Time
TimeReduction with median
Rad
ius
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM center
Time
Range along layer
Specific Aspects of Simulation Can be Monitored
Entropy instability (range) over time
Reduction with range (max – min)
Rad
ius
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM centerShorten the Experimental Cycle with Run-and-
Render Comparative MonitoringR
ad
ius
Ra
diu
s
Archived Run Active Run
Time Time
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM center Concise Views of a Supernova Simulation
• Displays must be application-specific, but some general concepts apply
• Need general data mining capability for flexibility in building displays
• New 2-d vs. 3-d comparison
• Views evolve through time• Comparison with archived run possible
Three orthogonal views of entropy variation in a 400 time-step 2-d supernova simulation are shown with polar coordinates presented as Cartesian.
Angle
An
gle
Ra
diu
s
Time
Ra
diu
s
Time
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM center Data Reduction for Multigrid Simulation Based on PCA of contiguous field blocks Exploits spatial correlation and adapts to
complexity of spatial field Parameter controls selected % variation Field restoration with single matrix
multiply Astrophysics supernova simulation:
16 to 200 times reduction per time step
Outperforms subsampling 3 times for comparable MSE over all time steps
Timestep 390
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM center Spherical Symmetry Medians
Conserved under PC Compression
Time Time
Original Data 30x Compressed Data
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM center
Time
Original Data
Spherical Symmetry Instability Ranges Conserved under PC Compression
30x Compressed Data
Time
Ra
diu
s
Ra
diu
s
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM center
Publications & Presentations
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM center
Conference
Co-sponsored Statistical Data Mining Conference, June 22-25, 2002, in Knoxville jointly with the University of Tennessee Department of Statistics
Organized an invited session on Distributed Data Mining at the conference.
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM center Publications FY 2002
Y. M. Qu, G. Ostrouchov, N. F. Samatova, and G. A. Geist (2002). Principal Component Analysis for Dimension Reduction in Massive Distributed Data Sets. Workshop on High Performance Data Mining at the Second SIAM International Conference on Data Mining, p.4-9.
N.F. Samatova, G. Ostrouchov, A. Geist, A. Melechko. “RACHET: An Efficient Cover-Based Merging of Clustering Hierarchies from Distributed Datasets”, Special Issue on Parallel and Distributed Data Mining, International Journal of Distributed and Parallel Databases: An International Journal, 2002, Volume 11, No. 2, March 2002.
F. N. Abu-Khzam, N. F. Samatova, G. Ostrouchov, M. A. Langston, and A. G. Geist (2002). Distributed Dimension Reduction Algorithms for Widely Dispersed Data, Fourteenth IASTED International Conference on Parallel and Distributed Computing and Systems. Accepted.
G. Ostrouchov and N. F. Samatova (2002). On FastMap and the Convex Hull of Multivariate Data. In preparation.
J. Hespen, G. Ostrouchov, N. F. Samatova, and A. Mezzacappa (2002). Adaptive Data Reduction for Multigrid Simulation Output. In preparation.
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM center
Presentations FY 2002Invited
G. Ostrouchov and N. F. Samatova. Multivariate Analysis of Massive Distributed Data Sets. Spring Research Conference on Statistics in Industry and Technology May 20-22, 2002, Ann Arbor, Michigan.
G. Ostrouchov and N. F. Samatova. Combining Distributed Local Principal Component Analyses into a Global Analysis, C. Warren Neel Conference on Statistical Data Mining and Knowledge Discovery, June 22-25, 2002, Knoxville, Tennessee.
N. Samatova, G. A. Geist, and G. Ostrouchov, “RACHET: Petascale Distributed Data Analysis Suite”, SPEEDUP Workshop on Distributed Supercomputing Data Intensive Computing, March 4-6, 2002, Badehotel Bristol, Leukerbad, Valais, Switzerland
ContributedY. M. Qu, G. Ostrouchov, N. F. Samatova, and G. A. Geist. Principal Component Analysis
for Dimension Reduction in Massive Distributed Data Sets. Workshop on High Performance Data Mining at the Second SIAM International Conference on Data Mining, April 11-13, 2002, Washington, DC.
LocalN. Samatova and G. Ostrouchov. Large-Scale Analysis of Distributed Scientific Data.
ORNL Weinberg Auditorium, July 11, 2002.
OAK RIDGE NATIONAL LABORATORYU.S. DEPARTMENT OF ENERGY
SDM center
Thank You!