scientific data management center (isic)

18
1 Scientific Data Management Scientific Data Management Center Center (ISIC) (ISIC) http://sdmcenter.lbl.gov http://sdmcenter.lbl.gov contains extensive publication list contains extensive publication list

Upload: abra

Post on 08-Jan-2016

50 views

Category:

Documents


1 download

DESCRIPTION

Scientific Data Management Center (ISIC). http://sdmcenter.lbl.gov contains extensive publication list. Scientific Data Management Center. Participating Institutions. Center PI: Arie Shoshani LBNL DOE Laboratories co-PIs: Bill Gropp, Rob Ross ANL Arie Shoshani, Doron Rotem LBNL - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Scientific Data Management Center (ISIC)

1

Scientific Data ManagementScientific Data Management

CenterCenter

(ISIC)(ISIC)

http://sdmcenter.lbl.govhttp://sdmcenter.lbl.govcontains extensive publication listcontains extensive publication list

Page 2: Scientific Data Management Center (ISIC)

2

Scientific Data Management CenterScientific Data Management Center

Center PI: Arie Shoshani LBNL

DOE Laboratories co-PIs:

Bill Gropp, Rob Ross ANLArie Shoshani, Doron Rotem LBNLTerence Critchlow, Chandrika Kamath LLNLNagiza Samatova, Andy White ORNL

Universities co-PIs :Mladen Vouk North Carolina State Alok Choudhary Northwestern Reagan Moore, Bertram Ludaescher UC San Diego (SDSC)Calton Pu Georgia Tech

Participating Institutions

Page 3: Scientific Data Management Center (ISIC)

3

Phases of Scientific Exploration

Data Generation From large scale simulations or experiments Fast data growth with computational power examples

• HENP: 100 Teraops and 10 Petabytes by 2006• Climate: Spatial Resolution: T42 (280 km) -> T85 (140 km) -> T170 (70 km),

T42: about 1 TB/100 year run => factor of ~ 10-20

Problems• Can’t dump the data to storage fast enough – waste of compute resources• Can’t move terabytes of data over WAN robustly – waste of scientist’s time• Can’t steer the simulation – waste of time and resource• Need to reorganize and transform data – large data intensive tasks slowing

progress

Page 4: Scientific Data Management Center (ISIC)

4

Phases of Scientific Exploration

Data Analysis Analysis of large data volume Can’t fit all data in memory Problems

• Find the relevant data – need efficient indexing• Cluster analysis – need linear scaling• Feature selection – efficient high-dimensional analysis• Data heterogeneity – combine data from diverse sources• Streamline analysis steps – output of one step needs to match input of next

Page 5: Scientific Data Management Center (ISIC)

5

Example Data Flow in TSI

InputData

HighlyParallelCompute

Output~500x500files

Aggregate to ~500 files (< 2 to 10+ GB each)

Archive

Data Depot

Logistic NetworkL-Bone

Local MassStorage 14+TB)

Aggregate to one file (1+ TB each)

VizWall

Viz Client

Local 44 Proc.Data Cluster- data sits on local nodes for weeks

Viz Software

Logistical Network

Courtesy: John Blondin

Page 6: Scientific Data Management Center (ISIC)

6

Goal: Reduce the Data Management Overhead

• Efficiency• Example: parallel I/O, indexing, matching storage structures to

the application

• Effectiveness• Example: Access data by attributes-not files, facilitate massive

data movement

• New algorithms• Example: Specialized PCA techniques to separate signals or to

achieve better spatial data compression

• Enabling ad-hoc exploration of data• Example: by enabling exploratory “run and render” capability to

analyze and visualize simulation output while the code is running

Page 7: Scientific Data Management Center (ISIC)

7

Approach

Use an integrated framework that:

• Provides a scientific workflow capability

• Supports data mining and analysis tools

• Accelerates storage and access to data

Simplify data management tasks for the scientist

• Hide details of underlying parallel and indexingtechnology

• Permit assembly of modules using a simple graphical workflow description tool

DataMining &Analysis

Layer

StorageEfficientAccessLayer

ScientificProcess

AutomationLayer

ScientificApplication

ScientificUnderstanding

SDM Framework

Page 8: Scientific Data Management Center (ISIC)

8

Technology Details by Layer

Hardware, OS, and MSS (HPSS)

WorkFlowManagement

Tools

Web Wrapping

Tools

EfficientParallel

Visualization(pVTK)

Efficientindexing(Bitmap Index)

DataAnalysis

tools(PCA, ICA)

ASPECT:integration Framework

Parallel NetCDFSoftware

Layer

ParallelVirtual

FileSystem

StorageResourceManager

(To HPSS)

ROMIOMPI-IOSystem

DataMining &Analysis(DMA)Layer

StorageEfficientAccess(SEA)Layer

ScientificProcess

Automation(SPA)Layer

Hardware, OS, and MSS (HPSS)

WorkFlowManagement

Tools

Web Wrapping

Tools

EfficientParallel

Visualization(pVTK)

Efficientindexing(Bitmap Index)

DataAnalysis

tools(PCA, ICA)

ASPECT:integration Framework

Parallel NetCDFSoftware

Layer

ParallelVirtual

FileSystem

StorageResourceManager

(To HPSS)

ROMIOMPI-IOSystem

DataMining &Analysis(DMA)Layer

StorageEfficientAccess(SEA)Layer

ScientificProcess

Automation(SPA)Layer

Page 9: Scientific Data Management Center (ISIC)

9

Accomplishments:Storage Efficient Access (SEA)

Developed Parallel netCDF Enables high performance parallel I/O to

netCDF datasets Achieves up to 10 fold performance

improvement over HDF5

Enhanced ROMIO: Provides MPI access to PVFS Advanced parallel file system interfaces

for more efficient access

Developed PVFS2 Adds Myrinet GM and InfiniBand support improved fault tolerance asynchronous I/O offered by Dell and HP for Clusters

Deployed an HPSS Storage Resource Manager (SRM) with PVFS

Automatic access of HPSS files to PVFS through MPI-IO library

SRM is a middleware component

P0P0

P1P1

P2P2

P3P3

netCDFnetCDF

Parallel File SystemParallel File System

Parallel netCDFParallel netCDF

P0P0

P1P1

P2P2

P3P3

Parallel File SystemParallel File System

Before After

Parallel Virtual File System:Enhancements and deployment

Shared memory communication

FLASH I/O Benchmark Performance (8x8x8 block sizes)

Page 10: Scientific Data Management Center (ISIC)

10

Robust Multi-file ReplicationRobust Multi-file Replication

Problem: move thousands of files robustly Takes many hours Need error recovery Mass storage systems

failures Network failures Use Storage Resource

Managers (SRMs) Problem: too slow

Use parallel streams Use concurrent transfers Use large FTP windows Pre-stage files from MSS

NCAR

Anywhere

LBNL

DiskCache

DiskCache

SRM-COPY

(thousands of files)

SRM-GET (one file at a time)

DataMover

SRM(performs writes)

SRM(performs reads)GridFTP GET (pull mode)

stage filesarchive files

Network transfer

Get listof files

MSS

Page 11: Scientific Data Management Center (ISIC)

11

Accomplishments:Data Mining and Analysis (DMA)

Developed Parallel-VTK Efficient 2D/3D Parallel Scientific

Visualization for NetCDF and HDF files Built on top of PnetCDF

Developed “region tracking” tool For exploring 2D/3D scientific

databases Using bitmap technology to identify

regions based on multi-attribute conditions

Implemented Independent Component Analysis (ICA) module

Used for accurate for signal separation Used for discovering key parameters

that correlate with observed data

Developed highly effective data reduction Achieves 15 fold reduction with high level

of accuracy Using parallel Principle Component Analysis

(PCA) technology

Developed ASPECT A framework that supports a rich set of

pluggable data analysis tools Including all the tools above A rich suite of statistical tools based on R

package

PVTK Serial (vs) Parallel Writer (80 MB)

0

10

20

30

40

0 2 4 6 8 10 12 14 16 18

Number of Processors

Tim

e (s

eco

nd

s)

PVTK Serial Writer PVTK Parallel Writer

El Nino signal (red) and estimation (blue) closely match

Combustion region tracking

Page 12: Scientific Data Management Center (ISIC)

12

ASPECT Analysis Environment

Data Select Data Access Correlate Render Display(temp, pressure)From astro-data Where (step=101)(entropy>1000);

Sample (temp, pressure) Visualize scatter

plot in QT

Run pVTK filter

Run R analysis

pVTKTool

SelectData

R AnalysisTool

TakeSample

Use Bitmap(condition)

Get variables(var-names, ranges)

Read Data(buffer-name)Write Data

Read Data(buffer-name)Write Data

Read Data(buffer-name)

Parallel NetCDF

PVFS Bitmap Index

Selection

Hardware, OS, and MSS (HPSS)

Data Mining & Analysis Layer

Storage EfficientAccess Layer

Page 13: Scientific Data Management Center (ISIC)

13

Accomplishments:Scientific Process Automation (SPA)

Unique requirements of scientific WFs Moving large volumes between modules

• Tightlly-coupled efficient data movement Specification of granularity-based iteration

• e.g. In spatio-temporal simulations – a time step is a “granule”

Support for data transformation

• complex data types (including file formats, e.g. netCDF, HDF)

Dynamic steering of workflow by user

• Dynamic user examination of results

Developed a working scientific work flow system

Automatic microarray analysis Using web-wrapping tools developed by

the center Using Kepler WF engine Kepler is an adaptation of the UC Berkeley

tool, Ptolemy

workflow steps defined graphically

workflow results presented to user

Page 14: Scientific Data Management Center (ISIC)

14

GUI for setting up and running workflows

Page 15: Scientific Data Management Center (ISIC)

15

Re-applying Technology

Technology

Parallel NetCDF

Parallel VTK

Compressed bitmaps

Storage ResourceManagers

Feature Selection

Scientific Workflow

New Applications

Climate

Climate

Combustion, Astrophysics

Astrophysics

Fusion

Astrophysics (planned)

Initial Application

Astrophysics

Astrophysics

HENP

HENP

Climate

Biology

SDM technology, developed for one application, can be effectively targeted at many other applications …

Page 16: Scientific Data Management Center (ISIC)

16

Broad Impact of the SDM Center…

Astrophysics:High speed storage technology, parallel NetCDF, parallel VTK, and ASPECT integration software used for Terascale Supernova Initiative (TSI) and FLASH simulationsTony Mezzacappa – ORNL, John Blondin –NCSU, Mike Zingale – U of Chicago, Mike Papka – ANL

Climate:High speed storage technology, Parallel NetCDF, and ICA technology used for Climate Modeling projects Ben Santer – LLNL, John Drake – ORNL, John Michalakes – NCAR

Combustion:Compressed Bitmap Indexing used for fast generation of flame regions and tracking their progress over timeWendy Koegler, Jacqueline Chen – Sandia Lab

ASCI FLASH – parallel NetCDF

Dimensionality reduction

Region growing

Page 17: Scientific Data Management Center (ISIC)

17

Broad Impact (cont.)

Biology:Kepler workflow system and web-wrapping technology used for executing complex highly repetitive workflow tasks for processing microarray dataMatt Coleman - LLNL

High Energy Physics:Compressed Bitmap Indexing and Storage Resource Managers used for locating desired subsets of data (events) and automatically retrieving data from HPSSDoug Olson - LBNL, Eric Hjort – LBNL, Jerome Lauret - BNL

Fusion:A combination of PCA and ICA technology used to identify the key parameters that are relevant to the presence of edge harmonic oscillations in a Tokomak

Keith Burrell - General Atomics

Building a scientific workflow

Dynamic monitoring of HPSS file transfers

Identifying key parameters for the DIII-D Tokamak

Page 18: Scientific Data Management Center (ISIC)

18

Goals for Years 4-5

Fully develop the integrated SDM framework Implement the 3 layer framework on SDM center facility Provide a way to select only components needed Develop self-guiding web pages on the use of SDM components Use existing successful examples as guides

Generalize components for reuse Develop general interfaces between components in the layers support loosely-coupled WSDL interfaces Support tightly-coupled components for efficient dataflow

Integrate operation of components in the framework Hide details form user – automate parallel access and indexing Develop a reusable library of components that can be selected

for use in the workflow system