simulation informatics

17
SIMULATION INFORMATICS A data-driven methodology for synthesis from simulation results David F. Gleich John von Neumann Postdoctoral Fellow Sandia National Laboratories Livermore, CA David Gleich (Sandia) 5/5/2011 1/18

Upload: david-gleich

Post on 12-Nov-2014

1.549 views

Category:

Technology


0 download

DESCRIPTION

A high level vision for a new project on how to use MapReduce computers for simulating physical phenomenon

TRANSCRIPT

Page 1: Simulation Informatics

SIMULATION INFORMATICS

A data-driven methodology for

synthesis from simulation results

David F. Gleich

John von Neumann Postdoctoral Fellow

Sandia National Laboratories

Livermore, CA

David Gleich (Sandia) 5/5/2011 1/18

Page 2: Simulation Informatics

A bit more background on meBefore

Undergrad Harvey Mudd College (Joint CS/Math)

Internships Yahoo! and Microsoft Internships

Graduate Stanford (Computation and Mathematical Engineering)Thesis: Models and Algorithm’s for PageRank Sensitivity

Internships Intel, Joint Library of Congress/Stanford, Microsoft

Post-doc 1 Univ. of British Columbia

Now

Post-doc 2 2010 John von Neumann fellow @ Sandia

In August

Tenure-Track position at Purdue’s CS department

David Gleich (Sandia) 5/5/2011 2/18

Page 3: Simulation Informatics

It’s time to ask:

What can science

learn from Google?

--Wired Magazine (2008)

David Gleich (Sandia) 5/5/2011

http://www.wired.com/science/discoveries/magazine/16-07/pb_theory

3/18

Page 4: Simulation Informatics

A tale of two computers …

Cielo (#10 in top 500)

Separate storage system

This platform is awesome

for simulating physical systems

Cost ~$50 million

SNL/CA Nebula Cluster

2TB/core storage

This platform is awesome

for working with data

Cost $150kDavid Gleich (Sandia) 5/5/2011 4/18

Page 5: Simulation Informatics

Simulation: The Third Pillar of Science

21st Century Science in a nutshell

Experiments are not practical / feasible.

Simulate things instead.

But do we trust the simulations?

We’re trying!

Model Fidelity

Verification & Validation (V&V)

Uncertainty Quantification (UQ)

Insight and confidence requires multiple runs.

David Gleich (Sandia) 5/5/2011 5/18

Page 6: Simulation Informatics

The problem Simulation ain’t cheap!

21.1st Century Science

in a nutshell?

Simulations are too expensive.

Let data provide a surrogate.

David Gleich (Sandia) 5/5/2011

Input Parameters

Time historyof simulation

s f

The Database

s1 -> f1

s2 -> f2

sk -> fk

We can throw the numbers into

the biggest computing clusters the

world has ever seen and let

statistical algorithms find patterns

where science cannot.

--Wired (again)

6/18

Page 7: Simulation Informatics

Our CSRF project Simulation Informatics

To enable analysts and engineers to hypothesize from inexpensive data computations instead of expensive HPC computations.

The People

Myself

Paul G. Constantine (2009 von Neumann)

Stefan Domino

Jeremy Templeton

Joe Ruthruff

David Gleich (Sandia) 5/5/2011 7/18

Page 8: Simulation Informatics

The idea and vision: Store the runs!

David Gleich (Sandia) 5/5/2011

Supercomputer MapReduce cluster Engineer

Each multi-day HPC

simulation generates

gigabytes of data.

A MapReduce cluster can

hold hundreds or thousands

of old simulations …

… enabling engineers to query

and analyze months of simulation

data for statistical studies and

uncertainty quantification.

8/18

Page 9: Simulation Informatics

MapReduce

Originated at Google for indexing web pages and computing PageRank

The idea

Bring the computations to the data.

Express algorithms in data-local operations.

Implement one type of communication: shuffle.

Shuffle moves all data with the same key to the same reducer

David Gleich (Sandia) 5/5/2011

Data scalable

Fault tolerance by design

M

M R

RM

M

M

Input stored in triplicate

Map output

persisted to disk

before shuffle

Reduce input/

output on disk

1 M

M R

RM

M

M

Maps

Reduce

Shuffle

2

3

4

5

1 2

M M

3 4

M M

5

M

9/18

Page 10: Simulation Informatics

Mesh point variance in MapReduce

David Gleich (Sandia) 5/5/2011

Three simulation runs for three time steps each.

Run 1 Run 2 Run 3

T=1 T=2 T=3 T=1 T=2 T=3 T=1 T=2 T=3

10/18

Page 11: Simulation Informatics

Mesh point variance in MapReduce

David Gleich (Sandia) 5/5/2011

M M M

R R

1. Each mapper out-

puts the mesh points

with the same key.

2. Shuffle moves all

values from the same

mesh point to the same

reducer.

Three simulation runs for three time steps each.

Run 1 Run 2 Run 3

3. Reducers just compute

a numerical variance.

Bring the computations

to the data.!

T=1 T=2 T=3 T=1 T=2 T=3 T=1 T=2 T=3

11/18

Page 12: Simulation Informatics

Hadoop???

David Gleich (Sandia) 5/5/2011

Photo credit, New York Times (2009)

http://www.nytimes.com/2009/03/17/tech

nology/business-computing/17cloud.html

Hadoop was the name

of Doug Cutting’s

son’s stuffed elephant.

Hadoop: an open-source MapReduce system supported by Yahoo!

12/18

Page 13: Simulation Informatics

MapReduce and Surrogate Models

David Gleich (Sandia) 5/5/2011

On the MapReduce cluster

Just one machine

Extraction

The Surrogate

Interpolation

New Samples

On the MapReduce cluster

f1

f2

f5

A surrogate model

is a function that

reproduces the

output of a simul-

ation and predicts

its output at new

parameter values.

The Database

s1 -> f1

s2 -> f2

sk -> fk

sa -> fa

sb -> fb

sc -> fc

SurrogateSample

13/18

Page 14: Simulation Informatics

MapReduce + Simulations

Thermal Race problemJoe Ruthruff and Jeremy Templeton

Heating an unclassified NW model to simulate a fire

Goal make sure the weak link failsbefore the strong link

Objective determine where parameters are interesting

Parameters material properties

1000 sample parameter study using

David Gleich (Sandia) 5/5/2011

30 min per run700GB of Raw Data~64GB of data for

MapReduce

Our resultsExtraction 30 minInterpolation 5 min

14/18

Page 15: Simulation Informatics

Data driven Green’s functions

David Gleich (Sandia) 5/5/2011

The simulation varies most in the corners of the parameter space.

These analyses will help understand which parameters matter,

and where to continue running the simulations.

15/18

Page 16: Simulation Informatics

What can science

learn from Google?

David Gleich (Sandia) 5/5/2011

The importance of data.

For more about this, see a talk and paper by Peter Norvig

“Internet-Scale Data Analysis”http://www.stanford.edu/group/mmds/slides2010/Norvig.pdf

Halevy et al. “The unreasonable effectiveness of data.” IEEE Intelligent systems, 2009.

16/18

Page 17: Simulation Informatics

The future?

David Gleich (Sandia) 5/5/2011

MapReduce computers

embedded within super

computers as the parallel file

system and data analysis

platform.

The best of both worlds.

Any questions?

Our workLaying the foundation for useful

surrogate models that will

make data based simulation a

compelling addition to the

simulation revolution.

17/18