large-scale data management challenges of southern california earthquake center (scec) philip j....

51
Large-scale Data Management Challenges of Southern California Earthquake Center (SCEC) Philip J. Maechling ([email protected]) Information Technology Architect Southern California Earthquake Center Research and Data Access and Preservation Summit Phoenix, Arizona 9 April 2010

Upload: arthur-hensley

Post on 17-Dec-2015

214 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Large-scale Data Management Challenges of Southern California Earthquake Center (SCEC) Philip J. Maechling (maechlin@usc.edu) Information Technology Architect

Large-scale Data Management Challenges of Southern California Earthquake Center (SCEC)

Philip J. Maechling ([email protected])Information Technology Architect

Southern California Earthquake CenterResearch and Data Access and Preservation Summit

Phoenix, Arizona9 April 2010

Page 2: Large-scale Data Management Challenges of Southern California Earthquake Center (SCEC) Philip J. Maechling (maechlin@usc.edu) Information Technology Architect

Interagency Working Group on Digital Data

(2009)

Page 3: Large-scale Data Management Challenges of Southern California Earthquake Center (SCEC) Philip J. Maechling (maechlin@usc.edu) Information Technology Architect
Page 4: Large-scale Data Management Challenges of Southern California Earthquake Center (SCEC) Philip J. Maechling (maechlin@usc.edu) Information Technology Architect

Consider the Digital Data Life Cycle

Can we Validate this Life Cycle Model against Digital Data Life Cycle Observations?

Page 5: Large-scale Data Management Challenges of Southern California Earthquake Center (SCEC) Philip J. Maechling (maechlin@usc.edu) Information Technology Architect

Digital Data Life Cycle Origination – Jan 2009

Page 6: Large-scale Data Management Challenges of Southern California Earthquake Center (SCEC) Philip J. Maechling (maechlin@usc.edu) Information Technology Architect

Digital Data Life Cycle Completion – Jan 2010

Page 7: Large-scale Data Management Challenges of Southern California Earthquake Center (SCEC) Philip J. Maechling (maechlin@usc.edu) Information Technology Architect
Page 8: Large-scale Data Management Challenges of Southern California Earthquake Center (SCEC) Philip J. Maechling (maechlin@usc.edu) Information Technology Architect

Notable Earthquakes in 2010

Page 9: Large-scale Data Management Challenges of Southern California Earthquake Center (SCEC) Philip J. Maechling (maechlin@usc.edu) Information Technology Architect

The SCEC Partnership

NationalPartners

InternationalPartners

CoreInstitutions

ParticipatingInstitutions

Page 10: Large-scale Data Management Challenges of Southern California Earthquake Center (SCEC) Philip J. Maechling (maechlin@usc.edu) Information Technology Architect

SCEC Member Institutions (November 1, 2009)

Core Institutions (16)

California Institute of TechnologyColumbia UniversityHarvard UniversityMassachusetts Institute of TechnologySan Diego State UniversityStanford UniversityU.S. Geological Survey, GoldenU.S. Geological Survey, Menlo ParkU.S. Geological Survey, PasadenaUniversity of California, Los AngelesUniversity of California, RiversideUniversity of California, San DiegoUniversity of California, Santa BarbaraUniversity of California, Santa CruzUniversity of Nevada, RenoUniversity of Southern California (lead)

Participating Institutions (53)

Appalachian State University; Arizona State University; Berkeley Geochron Center; Boston University; Brown University; Cal-Poly, Pomona; Cal-State, Long Beach; Cal-State, Fullerton; Cal-State, Northridge; Cal-State, San Bernardino; California Geological Survey; Carnegie Mellon University; Case Western Reserve University; CICESE (Mexico); Cornell University; Disaster Prevention Research Institute, Kyoto University (Japan); ETH (Switzerland); Georgia Tech; Institute of Earth Sciences of Academia Sinica (Taiwan); Earthquake Research Institute, University of Tokyo (Japan); Indiana University; Institute of Geological and Nuclear Sciences (New Zealand); Jet Propulsion Laboratory; Los Alamos National Laboratory; Lawrence Livermore National Laboratory; National Taiwan University (Taiwan); National Central University (Taiwan); Ohio State University; Oregon State University; Pennsylvania State University; Princeton University; Purdue University; Texas A&M University; University of Arizona; UC, Berkeley; UC, Davis; UC, Irvine; University of British Columbia (Canada); University of Cincinnati; University of Colorado; University of Massachusetts; University of Miami; University of Missouri-Columbia; University of Oklahoma; University of Oregon; University of Texas-El Paso; University of Utah; University of Western Ontario (Canada); University of Wisconsin; University of Wyoming; URS Corporation; Utah State University; Woods Hole Oceanographic Institution

Page 11: Large-scale Data Management Challenges of Southern California Earthquake Center (SCEC) Philip J. Maechling (maechlin@usc.edu) Information Technology Architect

Ground Motion Prediction

Unified Structural Representation

Fault Models

BlockModels

DeformationModels

EarthquakeRupture

Forecasts

SeismicHazard

Products

AnelasticStructures

AttenuationRelationships

EarthquakeRuptureModels

GroundMotion

Simulations

RiskMitigationProducts

Crustal Deformation Modeling

Fault & Rupture Mechanics

Earthquake Forecasting & Prediction

Seismic Hazard & Risk Analysis

TectonicEvolution &

B.C.s

Lithospheric Architecture & Dynamics

SCEC Earthquake System Models & Focus Groups

Southern California Earthquake Center• Involves more than 600 experts at over 60

institutions worldwide

• Focuses on earthquake system science using Southern California as a natural laboratory

• Translates basic research into practical products for earthquake risk reduction, contributing to NEHRP

Page 12: Large-scale Data Management Challenges of Southern California Earthquake Center (SCEC) Philip J. Maechling (maechlin@usc.edu) Information Technology Architect

SCEC Leadership Teams

Board of Directors

Staff

Planning Committee

Page 13: Large-scale Data Management Challenges of Southern California Earthquake Center (SCEC) Philip J. Maechling (maechlin@usc.edu) Information Technology Architect

Earthquakes are system-level phenomena… They emerge from complex, long-term interactions within active faults

systems that are opaque – thus are difficult to observe

They cascade as chaotic chain reactions through the natural and built environments – thus are difficult to predict

Anticipation time

month dayyeardecadecentury week

Faultrupture

Origintime

Response time

0 minute hour day year decade

------ Aftershocks -------------------------------------------------------------------

Surfacefaulting

Seismicshaking

Structural & nonstructuraldamage to built environment

Human casualties

Disease

Fires

Socioeconomic aftereffects

Landslides

Liquifaction

NucleationTectonic loading

Stress accumulation

Seafloordeformation

Tsunami

Dynamic triggering

Slow slip transients

Stress transfer

----- Foreshocks -----

Page 14: Large-scale Data Management Challenges of Southern California Earthquake Center (SCEC) Philip J. Maechling (maechlin@usc.edu) Information Technology Architect

Computational codes, structural models, and simulation results versioned with associated tests.

Development of new computational, data, and physical models.

Automated retrospective testing of forecast models using community defined validation problems.

Automated prospective performance evaluation of forecast models over time within collaborative forecast testing center.

External Seismic /Tsunami Models

Seismic Data Centers

HPC Resource Providers

Public and Governmental

Forecasts

Engineering and interdisciplinary

Research

Collaborative Research Project

Individual Research Project

Real-time Earthquake Monitoring

Discovery and access to digital

artifacts.

CME Platform and Data Management TAG

CME Platform and Data Administration System

Contribution and annotation of digital

artifacts.CME cyberinfrastructure supports a broad range of research computing with computational and data resources.

Programmable Interfaces

Page 15: Large-scale Data Management Challenges of Southern California Earthquake Center (SCEC) Philip J. Maechling (maechlin@usc.edu) Information Technology Architect

Future of solid earth computational science

Page 16: Large-scale Data Management Challenges of Southern California Earthquake Center (SCEC) Philip J. Maechling (maechlin@usc.edu) Information Technology Architect

Echo Cliffs PBR

Echo Cliffs PBR in the Santa Monica Mountains is >14m high and has a 3-4s free period. This rock withstood ground motions estimated at 0.2g and 12 cm/s during the Northridge earthquake. Such fragile geologic features give important constraints on PSHA.

Page 17: Large-scale Data Management Challenges of Southern California Earthquake Center (SCEC) Philip J. Maechling (maechlin@usc.edu) Information Technology Architect

Simulate Observed Earthquakes

Page 18: Large-scale Data Management Challenges of Southern California Earthquake Center (SCEC) Philip J. Maechling (maechlin@usc.edu) Information Technology Architect

Then, validate simulation model by comparing simulation results against observational data recorded by seismic sensors .

(red – simulation results,

black – observed data)

Page 19: Large-scale Data Management Challenges of Southern California Earthquake Center (SCEC) Philip J. Maechling (maechlin@usc.edu) Information Technology Architect

Simulate Potential Future Earthquakes

Page 20: Large-scale Data Management Challenges of Southern California Earthquake Center (SCEC) Philip J. Maechling (maechlin@usc.edu) Information Technology Architect

20SAN DIEGO SUPERCOMPUTER CENTER, UCSD

SCEC Roadmap to Petascale Earthquake Computing

2004

20122005

2008

2007 2011

2010

2009

2006

TeraShake1.x

ShakeOut 1.x

TeraShake2.x

ShakeOut 2.x

Chino Hills 1.x

M8 1.x

M8 2.x

M8 3.1

96% Parallel efficiency on 40K TJ Waterson BG/L cores.

BGW

First large wave propagation simulations of Mw7.7 earthquakes on the southern San Andreas with maximum frequency of 0.5Hz run using kinematic source descriptions based on the Denali earthquake. 240 SDSC DataStar cores used, 53 TBs outputs, largest simulation outputs recorded.

Simulations of Mw7.7 earthquakes in 2005-2006 using source descriptions generated by dynamic rupture simulations. The dynamic rupture simulations were based on Landers initial stress conditions, used 1024 NCSA TG cores.

Simulations of Mw7.8 with max frequency of 1.0Hz run using kinematic source descriptions based on geological observations.1920 TACC Lonestar cores.

Simulations of Mw7.8 earthquakes with max 1.0Hz using source descriptions generated by SGSN dynamic rupture simulations. The ShakeOut 2.x dynamic rupture simulations were constructed to produce final surface slip equivalent to the ShakeOut 1.x kinematic sources. 32K TACC Ranger cores used.

Comparison of simulated and recorded ground motions for 2009 Mw5.4 Chino Hills, two simulations were conducted using meshes extracted from CMU eTree database for CVM4 and CVM-H, 64K NICS Kraken cores used.

Simulations of Mw8.0 scenario on SAF from the Salton Sea to Parkfield ('Wall-to-Wall'), up to 1.0Hz. The source description was generated by combining several dynamic Mw7.8 dynamic source descriptions ('ShakeOut-D’). 96K NICS Kraken cores used.

40-m spacing and 435 billion mesh points, M8 2.x to run on 230K NCCS Jaguar cores, the world most powerful machine.

SciDAC OASCR Award

TeraGrid Viz Award

The most read article of year

15 Mio SUs, awarded, largest NSF TG allocation

INCITE allocations

M8 3.2

New model under development to deal with complex geometry, topography and non-planar fault surfaces.

Big 10

Simulaion of 9.0 Megaquake in Pacific Northwest

ShakeOut verification with 3 models

BG/L

TACC Ranger

ALCF BG/P

NICS Kraken

Improved source descriptions based Wave propagation simulation: dx=25m, Mw8.0, 2-Hz, 2,048 billion mesh points, 256x bigger than current runs

Dynamic rupture simulation, dx=5m (50 x 25 x 25km). Improve earthquake source descriptions by integrating more realistic friction laws into dynamic rupture simulations and computing at large scales including inner-scale of friction processes and outer-scale of large faults

Page 21: Large-scale Data Management Challenges of Southern California Earthquake Center (SCEC) Philip J. Maechling (maechlin@usc.edu) Information Technology Architect
Page 22: Large-scale Data Management Challenges of Southern California Earthquake Center (SCEC) Philip J. Maechling (maechlin@usc.edu) Information Technology Architect

SCEC: An NSF + USGS Research Center

Page 23: Large-scale Data Management Challenges of Southern California Earthquake Center (SCEC) Philip J. Maechling (maechlin@usc.edu) Information Technology Architect

Panel Questions

• What technical solutions exist that meet your academic project requirements?

• What requirements are unique to the academic environment?

• Are there common approaches for managing large-scale collections?

Page 24: Large-scale Data Management Challenges of Southern California Earthquake Center (SCEC) Philip J. Maechling (maechlin@usc.edu) Information Technology Architect

Simulation Results Versus Data

• Context of this workshop is Research Data Management.– I would like to communicate characteristics of the data management

complete perform seismic hazard computational research.

• I will refer to our simulation results as “data”– Some groups distinguish observational data from simulation results– This distinction becomes more difficult as observation and

simulation results are combined.

• For today’s presentation, I will focus on management of SCEC simulation results which may include both observational data and simulation results.

Page 25: Large-scale Data Management Challenges of Southern California Earthquake Center (SCEC) Philip J. Maechling (maechlin@usc.edu) Information Technology Architect

SCEC Storage Volume by Type

Estimated SCEC Data Archives (Total Current Archives ~ 1.4 PB)

Page 26: Large-scale Data Management Challenges of Southern California Earthquake Center (SCEC) Philip J. Maechling (maechlin@usc.edu) Information Technology Architect

SCEC Storage Elements (Files,Rows) by Type

Estimated SCEC Data Archives (Total Current Archives ~ 100M files, 600M rows)

Page 27: Large-scale Data Management Challenges of Southern California Earthquake Center (SCEC) Philip J. Maechling (maechlin@usc.edu) Information Technology Architect

Consider the Digital Data Life Cycle

Estimated SCEC Simulation Archives in Terabytes by Storage Location

Page 28: Large-scale Data Management Challenges of Southern California Earthquake Center (SCEC) Philip J. Maechling (maechlin@usc.edu) Information Technology Architect

• 2TB per SWF

• 6TB per RGT

• 2Hr per run

•10.4 M CPU-Hrs (650 runs, 3.6 Months on 4000 cores)

•400 - 600 TB

• 1 Hz body waves

• Up to 0.5 Hz Surface waves

Goal:

• 150 three-component stations [Nr]

• 200 earthquakes [Ns]

Sources & Receivers:

Costs:

• 200m, 1872 M mesh points

• 2min time series, 12000 time steps

Simulation parameters:

Page 29: Large-scale Data Management Challenges of Southern California Earthquake Center (SCEC) Philip J. Maechling (maechlin@usc.edu) Information Technology Architect

Data Management Context for SCEC

• Academic research groups responding to NSF proposals. Aggressive, large-scale, collaborative with need for transformative, innovative, original research (bigger, larger, faster)

• Data management tools and processes managed by heavily burdened academic staff

Page 30: Large-scale Data Management Challenges of Southern California Earthquake Center (SCEC) Philip J. Maechling (maechlin@usc.edu) Information Technology Architect

Data Management Context for SCEC

• Academic research very cost sensitive for new technologies

• HPC capabilities largely based on integrating existing cyberinfrastructure (CI) (not new CI development)

• Largely based on use of other peoples computers and storage systems (resulting in widely distributed archives)

Page 31: Large-scale Data Management Challenges of Southern California Earthquake Center (SCEC) Philip J. Maechling (maechlin@usc.edu) Information Technology Architect

Panel Questions

• What technical solutions exist that meet your academic project requirements?

• What requirements are unique to the academic environment?

• Are there common approaches for managing large-scale collections?

Page 32: Large-scale Data Management Challenges of Southern California Earthquake Center (SCEC) Philip J. Maechling (maechlin@usc.edu) Information Technology Architect

SCEC Milestone Capability RunsMilestone Runs TS1 TS2 DS2 SO1 SO2 CH50m W2W-1 CH15m* M8 W2W-3**

Machine SDSCDataStar

SDSCDataStar

NCSAIA-64

TACCLoneStar

TACCRanger

NICSKraken

NICSKraken

NICSKraken

NCCSJaguar

NCSABlue Water

Outer scale (km) 600 600 299 600 600 180 800 183 810 800

Inner (m) 200 200 100 100 100 50 100 15 40 25

Max Frequency 0.5 0.5 1.0 1.0 1.0 2.0 1 3.3 1.0 2.0

Min Surface Vel (m/s) 500 500 500 500 500 500 500 250 200 250

Mesh Points 1.8E+09 1.8E+09 9.6E+08 1.4E+10 1.4E+10 1.1E+10 3.1E+10 3.0E+11 4.4E+11 2.0E+12

Time Steps 22,768 22,768 13,637 45,456 50,000 80,000 60,346 100,000 120,000 320,000

Vel. Model Input (TB) 0.05 0.05 0.03 0.42 0.42 0.31 0.89 6.87 12.68 59.60

Storage w/o ckpt (TB) 53.0 10.0 9.5 0.5 0.5 1.9 0.3 66.4 39.9 400.0

Cores used 240 1,920 1,024 1,920 32,000 64,000 96,000 96,000 223,080 320K**

Wall-Clock-Time (hrs) 66.8 6.7 35.2 32.0 6.9 2.3 2.5 24 21.2 45**

Sustained TeraFlop/s 0.04 0.43 0.68 1.44 7.29 26.86 50.00 87.00 174.00 1,000**

* benchmarked, ** estimated

Page 33: Large-scale Data Management Challenges of Southern California Earthquake Center (SCEC) Philip J. Maechling (maechlin@usc.edu) Information Technology Architect

33SAN DIEGO SUPERCOMPUTER CENTER, UCSD

Data Transfer, Archive and Management

(Zhou et al., CSO’10)

Input/output data transfer between SDSC disk/HPSS to Ranger disk at the transfer rate up to 450 MB/s using Globus GridFTP

90k – 120k files per simulation, 150 TBs generated on Ranger, organized as a separate sub-collection in iRODs

Direct data transfer using iRODs from Ranger to SDSC SAM-QFS up to 177 MB/s using our data ingestion tool PIPUT

Sub-collections published through SCEC digital library (168 TB in size)

integrated through SCEC portal into seismic-oriented interaction environments

Page 34: Large-scale Data Management Challenges of Southern California Earthquake Center (SCEC) Philip J. Maechling (maechlin@usc.edu) Information Technology Architect

34

CyberShake Data Management Numbers

• CyberShake– 8.5 TB staged in (~700k

files) to TACC’s Ranger– 2.1 TB staged out (~36k

files) to SCEC storage– 190 million jobs

executed on the grid– 750,000 files stored in

RLS

CyberShake map

Page 35: Large-scale Data Management Challenges of Southern California Earthquake Center (SCEC) Philip J. Maechling (maechlin@usc.edu) Information Technology Architect

35

CyberShake Production Run - 2009• Run from 4/16/09 – 6/10/09• 223 sites

– Curve produced every 5.4 hrs

• 1207 hrs (92% uptime)– 4,420 cores on average– 14,540 peak (23% of Ranger)

• 192 million tasks– 44 tasks/sec– 3.8 million Condor jobs

• 192 million files– 11 TB output, 165 TB temp

Page 36: Large-scale Data Management Challenges of Southern California Earthquake Center (SCEC) Philip J. Maechling (maechlin@usc.edu) Information Technology Architect

36

Challenge: Millions of tasks

• Automation is key– Workflows with clustering

• Include all executions, staging, notification

– Job submission

• Data management– Millions of data files– Pegasus provides staging– Automated checks

• Correct number of files• NaN, zero-value checks• MD5 checksums

Page 37: Large-scale Data Management Challenges of Southern California Earthquake Center (SCEC) Philip J. Maechling (maechlin@usc.edu) Information Technology Architect

What is DAG-workflow

Jobs with dependencies organized in Directed Acyclic Graphs (DAG)

Large number of similar DAGs make up a workflow

37

Page 38: Large-scale Data Management Challenges of Southern California Earthquake Center (SCEC) Philip J. Maechling (maechlin@usc.edu) Information Technology Architect

GlobusWORLD 2003 The Globus View of Data Architecture 38

Virtual data language– Users define desired transformations– logical names for data and transformations

Virtual data catalog– Stores information about transformations,

derivations, logical inputs/outputs Query tool

– Retrieves necessary transformations given a description of them

– Gives an abstract workflow Pegasus

– Tool for executing abstract workflows on the grid

Virtual Data Toolkit (VDT): part of GriPhyN and iVDGL projects– Includes existing technology (Globus,

Condor) and experimental software (Chimera, Pegasus)

GriPhyN Virtual Data System

GriPhyN VDTReplica Catalog

DAGmanGlobus Toolkit, Etc.

Data Grid Resources(distributed execution

and data management)

VDL API/CLI(manipulate derivations

and transformations)

Virtual Data Catalog(implements ChimeraVirtual Data Schema)

Virtual DataApplications

Virtual Data Language

XML

ChimeraTask Graphs

(compute and datamovemment tasks,with dependencies)

Page 39: Large-scale Data Management Challenges of Southern California Earthquake Center (SCEC) Philip J. Maechling (maechlin@usc.edu) Information Technology Architect

GlobusWORLD 2003 The Globus View of Data Architecture 39

Functional View of Grid Data Management

Location based ondata attributes

Location of one ormore physical replicas

State of grid resources, performance measurements and predictions

Metadata Service

Application

Replica LocationService

Information Services

Planner:Data location, Replica selection,Selection of compute and storage nodes

Security and Policy

Executor:Initiates data transfers and computations

Data Movement

Data Access

Compute Resources Storage Resources

Page 40: Large-scale Data Management Challenges of Southern California Earthquake Center (SCEC) Philip J. Maechling (maechlin@usc.edu) Information Technology Architect

Panel Questions

• What technical solutions exist that meet your academic project requirements?

• What requirements are unique to the academic environment?

• Are there common approaches for managing large-scale collections?

Page 41: Large-scale Data Management Challenges of Southern California Earthquake Center (SCEC) Philip J. Maechling (maechlin@usc.edu) Information Technology Architect

SCEC: An NSF + USGS Research Center

Treat Simulation Data as Depreciating Asset

Simulation results differ from observational data.- Tends to be larger- Can be (often) recomputed- Often decreases in value with time- Less well-defined metadata

Page 42: Large-scale Data Management Challenges of Southern California Earthquake Center (SCEC) Philip J. Maechling (maechlin@usc.edu) Information Technology Architect

SCEC: An NSF + USGS Research Center

Collaborate with Existing Data Center

Avoid re-inventing Data Management Centers- (Re)-Train Observational data centers to manage

simulation data- Change the culture so deleting data is acceptable

Page 43: Large-scale Data Management Challenges of Southern California Earthquake Center (SCEC) Philip J. Maechling (maechlin@usc.edu) Information Technology Architect

SCEC: An NSF + USGS Research Center

Simulation Data as Depreciating Asset

Manage simulation results as depreciating asset:- Unique persistent ID’s for all sets- Track cost to produce, and cost to re-generate

for every data set

Page 44: Large-scale Data Management Challenges of Southern California Earthquake Center (SCEC) Philip J. Maechling (maechlin@usc.edu) Information Technology Architect

SCEC: An NSF + USGS Research Center

Simulation Data as Depreciating Asset

Responsibilities of researchers who want a lot of storage:

- Default storage lifetime is always limited- Longer term storage-based on community

use, community value, and readiness for use by community

- Burden on researchers for long term storage is more time adding metadata

Page 45: Large-scale Data Management Challenges of Southern California Earthquake Center (SCEC) Philip J. Maechling (maechlin@usc.edu) Information Technology Architect

SCEC: An NSF + USGS Research Center

Remove the Compute/Data Distinction

Compute models should always have associated verification and validation results and data sets should always have codes demonstrating access and usage.

Apply automated acceptance tests for all codes and access retrieval codes for all data sets.

Page 46: Large-scale Data Management Challenges of Southern California Earthquake Center (SCEC) Philip J. Maechling (maechlin@usc.edu) Information Technology Architect

SCEC: An NSF + USGS Research Center

Data Storage Entropy Resistance

Data sets will grow to fill storage- We recognize the need to encourage efficient

storage practices as routine

Page 47: Large-scale Data Management Challenges of Southern California Earthquake Center (SCEC) Philip J. Maechling (maechlin@usc.edu) Information Technology Architect

SCEC: An NSF + USGS Research Center

Data Storage Entropy Resistance

We are looking for data management tools that provide project management with tools to administer simulation results project-wide by providing information such as:

- Total Project and User Storage in use- Time since access for data- Understanding of backup and replicas

Page 48: Large-scale Data Management Challenges of Southern California Earthquake Center (SCEC) Philip J. Maechling (maechlin@usc.edu) Information Technology Architect

SCEC: An NSF + USGS Research Center

Metadata Strategies

Development of simulation metadata lead to extended effort with minimal value to geoscientists:

- Ontology development as basis for metadata not (yet?) shown significant value in field.

- Difficulty based on need to anticipate all possible future uses.

Page 49: Large-scale Data Management Challenges of Southern California Earthquake Center (SCEC) Philip J. Maechling (maechlin@usc.edu) Information Technology Architect

SCEC: An NSF + USGS Research Center

Controlled Vocabulary Tools

Controlled vocabulary management based on community-based wiki systems with subjects and terms used as tags in simulation data descriptions:

- Need tools for converting wiki, labels, and entries to relational database entries

- Need smooth integration between relational database (storing metadata) and wiki system

Page 50: Large-scale Data Management Challenges of Southern California Earthquake Center (SCEC) Philip J. Maechling (maechlin@usc.edu) Information Technology Architect

SCEC: An NSF + USGS Research Center

Metadata Strategies

Current simulation metadata based on practical uses cases:

- Metadata saved to support reproduction of data analysis described in publications.

- Metadata saved needed to re-run simulation.- Unanticipated future uses of simulation data often

not supported

Page 51: Large-scale Data Management Challenges of Southern California Earthquake Center (SCEC) Philip J. Maechling (maechlin@usc.edu) Information Technology Architect

End