presented by scientific data management center nagiza f. samatova oak ridge national laboratory arie...

12
Presented by Scientific Data Management Center Nagiza F. Samatova Oak Ridge National Laboratory Arie Shoshani (PI) Lawrence Berkeley National Laboratory DOE Laboratories ANL: Rob Ross LBNL: Doron Rotem LLNL: Chandrika Kamath ORNL: Nagiza Samatova PNNL: Terence Critchlow Universities NCSU: Mladen Vouk NWU: Alok Choudhary UCD: Bertram Ludaescher SDSC: Ilkay Altintas UUtah: Steve Parker Co-Principal Investigators

Upload: june-horn

Post on 21-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Presented by Scientific Data Management Center Nagiza F. Samatova Oak Ridge National Laboratory Arie Shoshani (PI) Lawrence Berkeley National Laboratory

Presented by

Scientific Data Management Center

Nagiza F. SamatovaOak Ridge National Laboratory

Arie Shoshani (PI)Lawrence Berkeley National Laboratory

DOE LaboratoriesANL: Rob RossLBNL: Doron RotemLLNL: Chandrika KamathORNL: Nagiza SamatovaPNNL: Terence Critchlow

Jarek Nieplocha

UniversitiesNCSU: Mladen VoukNWU: Alok ChoudharyUCD: Bertram

LudaescherSDSC: Ilkay AltintasUUtah: Steve Parker

Co-Principal Investigators

Page 2: Presented by Scientific Data Management Center Nagiza F. Samatova Oak Ridge National Laboratory Arie Shoshani (PI) Lawrence Berkeley National Laboratory

2 Samatova_SDMC_SC07

Scientific Data Management Center

SciDAC Review, Issue 2, Fall 2006 Illustration: A. Tovey

Lead Institution: LBNL

PI: Arie Shoshani

Laboratories:ANL, ORNL, LBNL, LLNL, PNNL

Universities:NCSU, NWU, SDSC, UCD, U. Utah

Established 5 years ago (SciDAC-1)

Successfully re-competed for next

5 years (SciDAC-2)

Featured in Fall 2006 issue of SciDAC Review magazine

Page 3: Presented by Scientific Data Management Center Nagiza F. Samatova Oak Ridge National Laboratory Arie Shoshani (PI) Lawrence Berkeley National Laboratory

3 Samatova_SDMC_SC07

SDM infrastructureUses three-layer organization of technologies

Scientific Process Automation (SPA)

Data Mining and Analysis (DMA)

Storage Efficient Access (SEA)

Operating system

Hardware (e.g., Cray XT4, IBM Blue/Gene L)

Operating system

Hardware (e.g., Cray XT4, IBM Blue/Gene L)

Integrated approach:• To provide a scientific workflow

capability

• To support data mining and analysis tools

• To accelerate storage and access to data

Benefits scientists by• Hiding underlying parallel and

indexing technology

• Permitting assembly of modules using workflow description tool

Goal: Reduce data management overhead

Page 4: Presented by Scientific Data Management Center Nagiza F. Samatova Oak Ridge National Laboratory Arie Shoshani (PI) Lawrence Berkeley National Laboratory

4 Samatova_SDMC_SC07

Automating scientific workflow in SPAEnables scientists to focus on science not process

Scientific discovery is a multi-step process. SPA-Kepler workflow system automates and manages this process.

Contact: Terence Critchlow, PNNL ([email protected])

Illustration: A. Tovey

Tasks that required hours or days can now be completed in minutes, allowing biologists to spend their time saved on science.

Dashboards provide improved interfaces. Dashboards provide improved interfaces.

Execution monitoring(provenance) provides near real-time status.

Execution monitoring(provenance) provides near real-time status.

Page 5: Presented by Scientific Data Management Center Nagiza F. Samatova Oak Ridge National Laboratory Arie Shoshani (PI) Lawrence Berkeley National Laboratory

5 Samatova_SDMC_SC07

Data analysis for fusion plasma

Plot of orbits in cross-section of a fusion experiment shows different types of orbits, including circle-like “quasi-periodic orbits” and “island orbits.” Characterizing the topology of orbits is challenging, as experimental and simulation data are in the form of points rather than a continuous curve. We are successfully applying data mining techniques to this problem.

Feature selection techniques used to identify key parameters relevant to the presence of edge harmonic oscillations in the DIII-D tokomak.

Contact: Chandrika Kamath, LLNL ([email protected])

Err

or

0.16

0.18

0.2

0.22

0.24

0.26

0.28

0.3

0 5 10 15 20 25 30 35

Features

PCA FilterDistance

ChiSquareStump

Boosting

Page 6: Presented by Scientific Data Management Center Nagiza F. Samatova Oak Ridge National Laboratory Arie Shoshani (PI) Lawrence Berkeley National Laboratory

6 Samatova_SDMC_SC07

Searching and indexing with FastBit Gleaning insights about combustion simulation

About FastBit:

• Extremely fast search of large databases

• Outperforms commercial software

• Used by various applications: combustion, STAR, astrophysics visualization

Collaborators:

SNL: J. Chen, W. Doyle

NCSU: T. EchekkiIllustration: A. ToveyIllustration: A. Tovey

Finding and tracking of combustion flame frontsFinding and tracking of combustion flame fronts

Contact: John Wu, LBNL ([email protected])

Searching for regions that satisfy particular criteria is a challenge. FastBit efficiently finds regions of interest.

Page 7: Presented by Scientific Data Management Center Nagiza F. Samatova Oak Ridge National Laboratory Arie Shoshani (PI) Lawrence Berkeley National Laboratory

7 Samatova_SDMC_SC07

Data analysis based on dynamic histograms using FastBit

Example of finding the number of malicious network connections in a particular time window.

A histogram of number of connections to port 5554 of machine in LBNL IP address space (two-horizontal axes); vertical axis is time.

Two sets of scans are visible as two sheets.

Contact: John Wu, LBNL ([email protected])

Conditional histograms are common in data analysis. FastBit indexing facilitates real-time anomaly detection.

Page 8: Presented by Scientific Data Management Center Nagiza F. Samatova Oak Ridge National Laboratory Arie Shoshani (PI) Lawrence Berkeley National Laboratory

8 Samatova_SDMC_SC07

Parallel input/outputScaling computational science

Multi-layer parallel I/O design:Supports Parallel-netCDF library built on top of MPI-IO implementation calledROMIO, built in turn on top of Abstract Device Interface for I/O system, used to access parallel storage system

Benefits to scientists:• Brings performance, productivity,

and portability

• Improves performance by order of magnitude

Operates on any parallel file system (e.g. GPFS, PVFS, PanFS, Lustre)

Orchestration of data transfers and speedy analyses depends on efficient systems for storage, access, and movement of data among modules.

Illustration: A. ToveyIllustration: A. Tovey

Contact: Rob Ross, ANL ([email protected])

Page 9: Presented by Scientific Data Management Center Nagiza F. Samatova Oak Ridge National Laboratory Arie Shoshani (PI) Lawrence Berkeley National Laboratory

9 Samatova_SDMC_SC07

Goal: Provide scalable high-performance statistical data analysis framework to help scientists perform interactive analyses of produced data to extract knowledge

Contact: Nagiza Samatova, ORNL ([email protected])

Parallel statistical computing with pR

Able to use existing high-level (i.e., R) code Requires minimal effort for parallelizing Offers identical application and web interface Provides efficient and scalable performance Integrates with Kepler as front-end interface Enables sharing results with collaborators

Page 10: Presented by Scientific Data Management Center Nagiza F. Samatova Oak Ridge National Laboratory Arie Shoshani (PI) Lawrence Berkeley National Laboratory

10 Samatova_SDMC_SC07

Speeding data transfer with PnetCDF

P0P0 P1P1 P2P2 P3P3

netCDFnetCDF

Parallel file systemParallel file system

Parallel netCDFParallel netCDF

P0P0 P1P1 P2P2 P3P3

Parallel file systemParallel file system

Enables high performance parallel I/O to netCDFdata sets.

Achieves up to 10-fold performance improvement over HDF5.

Contact: Rob Ross, ANL ([email protected])

Inter-process communication

Early performance testing showed PnetCDF outperformed HDF5 for some critical access patterns.

The HDF5 team has responded by improving its code for these patterns, and now these teams actively collaborate to better understand application needs and system characteristics, leading to I/O performance gains in both libraries.

Page 11: Presented by Scientific Data Management Center Nagiza F. Samatova Oak Ridge National Laboratory Arie Shoshani (PI) Lawrence Berkeley National Laboratory

11 Samatova_SDMC_SC07

Active storage

Modern filesystems such as GPFS, Lustre, PVFS2 use general purpose servers with substantial CPU and memory resources.

Active Storage moves I/O-intensive tasks from the compute nodes to the storage nodes.

Main benefits: local I/O operations, very low network traffic (mainly

metadata-related), better overall system performance.

Active Storage has been ported to Lustre and PVFS2.

Active Storage enablesscientific applications to exploitunderutilized resources of storage nodes for computations involving data located in secondary storage.

MetadataServer (MDS)

AS ProcessingComponent

File.in

AS ProcessingComponent

File.inFile.outFile.out

AS Manager

ComputeNode

ComputeNode

ComputeNode

StorageNode0

StorageNodeN-1

MetadataServer (MDS)

Parallel Filesystem's Components

.....

Parallel Filesystem's Clients

.....

ActiveStorage

ComputeNode

Network Interconnect

MetadataServer (MDS)

AS ProcessingComponent

File.in File.in

File.outFile.out

AS Manager

ComputeNode

ComputeNode

ComputeNode

StorageNode0

StorageNode N-1

MetadataServer (MDS)

Parallel Filesystem's Components

.....

Parallel Filesystem's Clients

.....

ActiveStorage

ComputeNode

Network Interconnect

AS ProcessingComponent

Contact: Jarek Nieplocha, PNNL ([email protected])

Page 12: Presented by Scientific Data Management Center Nagiza F. Samatova Oak Ridge National Laboratory Arie Shoshani (PI) Lawrence Berkeley National Laboratory

12 Samatova_SDMC_SC07

Contacts

Arie ShoshaniPrincipal InvestigatorLawrence Berkeley National [email protected]

Terence CritchlowScientific Process Automation area leaderPacific Northwest National [email protected]

Nagiza SamatovaData Mining and Analysis area leaderOak Ridge National [email protected]

Rob RossStorage Efficient Access area leaderArgonne National [email protected]

12 Samatova_SDMC_SC07