presented by scientific data management center nagiza f. samatova network and cluster computing...

11
Presented by Scientific Data Management Center Nagiza F. Samatova Network and Cluster Computing Computer Sciences and Mathematics Division

Upload: georgia-dawson

Post on 05-Jan-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Presented by Scientific Data Management Center Nagiza F. Samatova Network and Cluster Computing Computer Sciences and Mathematics Division

Presented by

Scientific Data Management Center

Nagiza F. SamatovaNetwork and Cluster Computing

Computer Sciences and Mathematics Division

Page 2: Presented by Scientific Data Management Center Nagiza F. Samatova Network and Cluster Computing Computer Sciences and Mathematics Division

2 Samatova_SDM_0611

Scientific Data Management Center

SciDAC Review, Issue 2, Fall 2006 Illustration: A. Tovey

Lead Institution: LBNL

PI: Arie Shoshani

Laboratories:ANL, ORNL, LBNL, LLNL, PNNL

Universities:NCSU, NWU, SDSC, UCD, U. Utah

Established 5 years ago (SciDAC-1)

Successfully re-competed for next 5 years (SciDAC-2)

Featured in Fall 2006 issue of SciDAC Review magazine

Page 3: Presented by Scientific Data Management Center Nagiza F. Samatova Network and Cluster Computing Computer Sciences and Mathematics Division

3 Samatova_SDM_0611

SDM InfrastructureUses three-layer organization of technologies

Scientific Process Automation (SPA)

Data Mining and Analysis (DMA)

Storage Efficient Access (SEA)

Operating system

Hardware (e.g., Cray XT3, IBM Blue/Gene L)

Operating system

Hardware (e.g., Cray XT3, IBM Blue/Gene L)

Integrated approach:• To provide a scientific

workflow capability

• To support data mining and analysis tools

• To accelerate storage and access to data

Benefits scientists by• Hiding underlying parallel

and indexing technology

• Permitting assembly of modules using workflow description tool

Goal: Reduce data management overhead

Page 4: Presented by Scientific Data Management Center Nagiza F. Samatova Network and Cluster Computing Computer Sciences and Mathematics Division

4 Samatova_SDM_0611

Automating scientific workflow in SPA enables mining of diverse biology databases

Scientific discovery is a multi-step process. SPA-Kepler workflow system automates and manages this process.

Illustration: A. Tovey

Tasks that required hours or days can now be completed in minutes, allowing biologists to spend their time saved on science

Contact: Terence Critchlow, LLNL ([email protected])

Page 5: Presented by Scientific Data Management Center Nagiza F. Samatova Network and Cluster Computing Computer Sciences and Mathematics Division

5 Samatova_SDM_0611

Data analysis for fusion plasma

Plot of orbits in cross-section of a fusion experiment shows different types of orbits, including circle-like “quasi-periodic orbits” and “island orbits.” Characterizing the topology of orbits is challenging, as experimental and simulation data are in the form of points rather than a continuous curve. We are successfully applying data mining techniques to this problem.

Feature selection techniques used to identify key parameters relevant to the presence of edge harmonic oscillations in the DIII-D tokomak.

Contact: Chandrika Kamath, LLNL ([email protected])

Err

or

0.16

0.18

0.2

0.22

0.24

0.26

0.28

0.3

0 5 10 15 20 25 30 35

Features

PCA FilterDistance

ChiSquareStump

Boosting

Page 6: Presented by Scientific Data Management Center Nagiza F. Samatova Network and Cluster Computing Computer Sciences and Mathematics Division

6 Samatova_SDM_0611

Searching and indexing with FastBit Gleaning insights about combustion simulation

About FastBit:

• Extremely fast search of large databases

• Outperforms commercial software

• Used by various applications: combustion, STAR, astrophysics visualization

Collaborators:

SNL: J. Chen, W. Doyle

NCSU: T. Echekki

Illustration: A. ToveyIllustration: A. Tovey

Finding & tracking of combustion flame frontsFinding & tracking of combustion flame fronts

Contact: John Wu, LBNL ([email protected])

Searching for regions that satisfy particular criteria is a challenge. FastBit efficiently finds regions of interest.

Page 7: Presented by Scientific Data Management Center Nagiza F. Samatova Network and Cluster Computing Computer Sciences and Mathematics Division

7 Samatova_SDM_0611

Searches in STAR experiment Finding one event out of a million

Illustration: A. Tovey

STAR scientists extract data of interest within 15 minutes. Accomplishing the same task before consumed weeks.

FastBit and Storage Resource Manager enable efficient mining of hundreds of terabytes of data generated by STAR detector

Contact: John Wu, LBNL ([email protected])

Page 8: Presented by Scientific Data Management Center Nagiza F. Samatova Network and Cluster Computing Computer Sciences and Mathematics Division

8 Samatova_SDM_0611

Parallel statistical computing with pRGoal: Provide scalable high-performance statistical data analysis framework to help scientists perform interactive analyses of produced data to extract knowledge

• Able to use existing high-level (i.e., R) code

• Requires minimal effort for parallelizing

• Offers identical interface

• Provides efficient and scalable performance

Contact: Nagiza Samatova, ORNL ([email protected])

Page 9: Presented by Scientific Data Management Center Nagiza F. Samatova Network and Cluster Computing Computer Sciences and Mathematics Division

9 Samatova_SDM_0611

Parallel input/outputScaling computational science

Multi-layer parallel I/O design:

Supports Parallel-netCDF library built on top of MPI-IO implementation calledROMIO, built in turn on top of Abstract Device Interface for I/O system, used to access parallel storage system.

Benefits to scientists:• Brings performance, productivity,

and portability

• Improves performance by order of magnitude

Orchestration of data transfers and speedy analyses depends on efficient systems for storage, access, and movement of data among modules.

Illustration: A. ToveyIllustration: A. Tovey

Contact: Rob Ross, ANL ([email protected])

Page 10: Presented by Scientific Data Management Center Nagiza F. Samatova Network and Cluster Computing Computer Sciences and Mathematics Division

10 Samatova_SDM_0611

Speeding data transfer with PnetCDF

P0P0 P1P1 P2P2 P3P3

netCDFnetCDF

Parallel File SystemParallel File System

Parallel netCDFParallel netCDF

P0P0 P1P1 P2P2 P3P3

Parallel File SystemParallel File System

Illustration: A. ToveyIllustration: A. Tovey

Rate of data transfer using HDF5 decreases when a particular problem is divided among more processors. In contrast, parallel version of netCDF improves because of low-overhead nature of PnetCDF and its tight coupling to MPI-IO.

Enables high performance parallel I/O to netCDFdata sets

Achieves up to 10-fold performance improvement over HDF5

Contact: Rob Ross, ANL ([email protected])

Inter-process communication

Page 11: Presented by Scientific Data Management Center Nagiza F. Samatova Network and Cluster Computing Computer Sciences and Mathematics Division

11 Samatova_SDM_0611

Contact

Nagiza SamatovaNetwork and Cluster ComputingComputer Science and Mathematics Division(865) [email protected]

11 Samatova_SDM_0611