simulation informatics; analyzing large scientific datasets

46
Simulation Informatics Analyzing Large Datasets from Scientific Simulations DAVID F. GLEICH PURDUE UNIVERSITY COMPUTER SCIENCE DEPARTMENT PAUL G. CONSTANTINE STANFORD UNIVERSITY JOE RUTHRUFF & JEREMY TEMPLETON SANDIA NATIONAL LABS CS&E Seminar David Gleich · Purdue 1

Upload: david-gleich

Post on 15-Jan-2015

1.027 views

Category:

Technology


1 download

DESCRIPTION

A talk I gave at the Purdue CS&E Seminar Series.

TRANSCRIPT

Page 1: Simulation Informatics; Analyzing Large Scientific Datasets

Simulation Informatics!Analyzing Large Datasets from Scientific Simulations

DAVID F. GLEICH !PURDUE UNIVERSITY

COMPUTER SCIENCE !DEPARTMENT

PAUL G. CONSTANTINE !STANFORD UNIVERSITY

JOE RUTHRUFF !& JEREMY TEMPLETON !SANDIA NATIONAL LABS

CS&E Seminar David Gleich · Purdue 1

Page 2: Simulation Informatics; Analyzing Large Scientific Datasets

This talk is a story …

CS&E Seminar David Gleich · Purdue 2

Page 3: Simulation Informatics; Analyzing Large Scientific Datasets

How I learned to stop worrying and love the simulation!

CS&E Seminar David Gleich · Purdue 3

Page 4: Simulation Informatics; Analyzing Large Scientific Datasets

I asked …!Can we do UQ on PageRank?

CS&E Seminar David Gleich · Purdue 4

Page 5: Simulation Informatics; Analyzing Large Scientific Datasets

PageRank by Google

1

2

3

4

5

6

The Model1. follow edges uniformly with

probability �, and2. randomly jump with probability

1� �, we’ll assume everywhere isequally likely

The places we find thesurfer most often are im-portant pages.

David F. Gleich (Sandia) PageRank intro Purdue 5 / 36

PageRank by Google

1

2

3

4

5

6

The Model1. follow edges uniformly with

probability �, and2. randomly jump with probability

1� �, we’ll assume everywhere isequally likely

The places we find thesurfer most often are im-portant pages.

David F. Gleich (Sandia) PageRank intro Purdue 5 / 36

Google’s PageRank

CS&E Seminar David Gleich · Purdue 5

Page 6: Simulation Informatics; Analyzing Large Scientific Datasets

Random alpha PageRankRAPr�

Model PageRank as the random variables

x(A)

and look atE [x(A)] and Std [x(A)] .

Note � “Wrapper” not “rapper”Gleich and Constantine, Workshop on Algorithms on the Web Graph, 2007; Gleich and Constantine, submitted.David F. Gleich (Sandia) Random sensitivity Purdue 16 / 36

Random alpha PageRankRAPr�

Model PageRank as the random variables

x(A)

and look atE [x(A)] and Std [x(A)] .

Note � “Wrapper” not “rapper”Gleich and Constantine, Workshop on Algorithms on the Web Graph, 2007; Gleich and Constantine, submitted.David F. Gleich (Sandia) Random sensitivity Purdue 16 / 36

Random alpha PageRankRAPr�

Model PageRank as the random variables

x(A)

and look atE [x(A)] and Std [x(A)] .

Note � “Wrapper” not “rapper”Gleich and Constantine, Workshop on Algorithms on the Web Graph, 2007; Gleich and Constantine, submitted.David F. Gleich (Sandia) Random sensitivity Purdue 16 / 36

Random alpha PageRankRAPr�

Model PageRank as the random variables

x(A)

and look atE [x(A)] and Std [x(A)] .

Note � “Wrapper” not “rapper”Gleich and Constantine, Workshop on Algorithms on the Web Graph, 2007; Gleich and Constantine, submitted.David F. Gleich (Sandia) Random sensitivity Purdue 16 / 36

Explored in Constantine and Gleich, WAW2007; and "Constantine and Gleich, J. Internet Mathematics 2011.

Random alpha PageRank or PageRank meets UQ

Which sensitivity?

(�� �P)x = (1� �)v

Sensitivity to the links : examined and understood

Sensitivity to the jump : examined, understood, and useful

Sensitivity to � : less well understood

David F. Gleich (Sandia) Sensitivity Purdue 10 / 36

CS&E Seminar David Gleich · Purdue 6

Page 7: Simulation Informatics; Analyzing Large Scientific Datasets

Convergence theory

Method Conv. Work Required What is N?

Monte Carlo 1pN

N PageRank systems number ofsamples from A

Path Damping(withoutStd [x(A)])

rN+2

N1+�N+ 1 matrix vectorproducts

terms ofNeumann series

GaussianQuadrature r2N N PageRank systems

number ofquadraturepoints

� and r are parameters from Bet�(�, b, �, r)

David F. Gleich (Sandia) Random sensitivity Purdue 27 / 36

Random alpha PageRank has a rigorous convergence theory.

CS&E Seminar David Gleich · Purdue 7

Page 8: Simulation Informatics; Analyzing Large Scientific Datasets

Working with PageRank showed us how to treat UQ more generally …

CS&E Seminar David Gleich · Purdue 8

Page 9: Simulation Informatics; Analyzing Large Scientific Datasets

Constantine, Gleich, and Iaccarino. Spectral Methods for Parameterized Matrix Equations, SIMAX, 2010.

Constantine, Gleich, and Iaccarino. A factorization of the spectral Galerkin system for parameterized matrix equations: derivation and applications, SISC 2011.

A(s)x(s) = b(s), A(J1)x(J1) = b(J1)) A(JN )x(JN ) = b(JN ) or) AN (J1)xN (J1) = bN (J1)

How to compute the Galerkin solution in a weakly intrusive manner.!

A(s)x(s) = b(s)

We studied parameterized matrices.

Discretized PDE with explicit parameters

Parameterized Solution

CS&E Seminar David Gleich · Purdue 9

Page 10: Simulation Informatics; Analyzing Large Scientific Datasets

Simulation!The Third Pillar of Science 21st Century Science in a nutshell!

Experiments are not practical or feasible. Simulate things instead.

But do we trust the simulations?! We’re trying!

Model Fidelity Verification & Validation (V&V) Uncertainty Quantification (UQ)

CS&E Seminar David Gleich · Purdue 10

Page 11: Simulation Informatics; Analyzing Large Scientific Datasets

The message Insight and confidence requires multiple runs.

CS&E Seminar David Gleich · Purdue 11

Page 12: Simulation Informatics; Analyzing Large Scientific Datasets

The problem A simulation run ain’t cheap!

CS&E Seminar David Gleich · Purdue 12

Page 13: Simulation Informatics; Analyzing Large Scientific Datasets

Another problem It’s very hard to “modify” current codes.

CS&E Seminar David Gleich · Purdue 13

Page 14: Simulation Informatics; Analyzing Large Scientific Datasets

Large scale nonlinear, time dependent heat transfer problem

105 nodes 103 time steps 30 minutes on 16 cores

Questions What is the probability of failure? Which input values cause failure?

CS&E Seminar David Gleich · Purdue 14

Page 15: Simulation Informatics; Analyzing Large Scientific Datasets

It’s time to ask "What can science learn from Google?""- Wired Magazine (2008)

CS&E Seminar David Gleich · Purdue 15

Page 16: Simulation Informatics; Analyzing Large Scientific Datasets

21.1st Century Science �in a nutshell?

Simulations are "too expensive. Let data provide a surrogate.

We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot. - Wired (again)

CS&E Seminar David Gleich · Purdue 16/1

8

Page 17: Simulation Informatics; Analyzing Large Scientific Datasets

Our approach!Construct an interpolating reduced order model from a budget-constrained ensemble of runs for uncertainty and optimization studies.

CS&E Seminar David Gleich · Purdue 17

Page 18: Simulation Informatics; Analyzing Large Scientific Datasets

That is, we store the runs Supercomputer Data computing cluster Engineer

Each multi-day HPC simulation generates gigabytes of data.

A data cluster can hold hundreds or thousands of old simulations …

… enabling engineers to query and analyze months of simulation data for statistical studies and uncertainty quantification.

and build the interpolant from the pre-computed data.

CS&E Seminar David Gleich · Purdue 18

Page 19: Simulation Informatics; Analyzing Large Scientific Datasets

Input "Parameters

Time history"of simulation

s f

The Database

s1 -> f1 s2 -> f2

sk -> fk

f(s) =

2

66666666666664

q(x1, t1, s)...

q(xn

, t1, s)q(x1, t2, s)

...q(x

n

, t2, s)...

q(xn

, t

k

, s)

3

77777777777775

A single simulation at one time step

X =⇥f(s1) f(s2) ... f(sp)

The database as a matrix

The

simula

tion

as a

vec

tor

CS&E Seminar David Gleich · Purdue 19

Page 20: Simulation Informatics; Analyzing Large Scientific Datasets

The interpolant

Motivation!Let the data give you the basis. Then find the right combination

X =⇥f(s1) f(s2) ... f(sp)

f(s) ⇡rX

j=1

uj↵j (s)

This idea was inspired by the success of other reduced order models like POD; and Paul’s residual minimizing idea.

These are the left singular vectors from X!

CS&E Seminar David Gleich · Purdue 20

Page 21: Simulation Informatics; Analyzing Large Scientific Datasets

Why the SVD?!Let’s study a simple case.

X =

2

66664

g(x1, s1) g(x1, s2) · · · g(x1, s

p

)

g(x2, s1). . .

. . ....

.... . .

. . .g(x

m�1, s

p

)g(x

m

, s1) · · · g(xm

, s

p�1) g(xm

, s

p

).

3

77775

= U⌃VT ,

g(xi

, s

j

) =rX

`=1

U

i ,`�`Vj ,` =

rX

`=1

u`(xi

)�`v`(sj

)

treat each right singular vector as samples of the unknown basis functions

split x and s

g(xi

, s) =rX

`=1

u`(xi

)�`v`(s) v`(s) ⇡pX

j=1

v`(sj

)�(`)j

(s)

Interpolate v any way you wish

a general parameter

CS&E Seminar David Gleich · Purdue 21

Page 22: Simulation Informatics; Analyzing Large Scientific Datasets

Method summary

Compute SVD of X!Compute interpolant of right singular vectors Approximate a new value of f(s)!

CS&E Seminar David Gleich · Purdue 22

Page 23: Simulation Informatics; Analyzing Large Scientific Datasets

A B

A quiz!Which section would you rather try and interpolate, A or B?

CS&E Seminar David Gleich · Purdue 23

Page 24: Simulation Informatics; Analyzing Large Scientific Datasets

How predictable is a !singular vector? Folk Theorem (O’Leary 2011) The singular vectors of a matrix of “smooth” data become more oscillatory as the index increases. Implication!The gradient of the singular vectors increases as the index increases.

v1(s), v2(s), ... , vt (s) vt+1(s), ... , vr (s)Predictable Unpredictable

CS&E Seminar David Gleich · Purdue 24

Page 25: Simulation Informatics; Analyzing Large Scientific Datasets

A refined method with !an error model

Predictable Unpredictable ⌘j ⇠ N(0, 1)

Variance[f] = diag

0

@rX

j=t(s)+1

�jujuTj

1

A

Don’t even try to interpolate the predictable modes.

f(s) ⇡t(s)X

j=1

uj↵j (s) +rX

j=t(s)+1

uj�j⌘j

But now, how to choose t(s)? CS&E Seminar David Gleich · Purdue 25

Page 26: Simulation Informatics; Analyzing Large Scientific Datasets

Our current approach to choosing the predictability

1�1

⌧X

i=1

�i

����@v

i

@s

���� < threshold

t(s) is the largest 𝜏 such that

CS&E Seminar David Gleich · Purdue 26

Page 27: Simulation Informatics; Analyzing Large Scientific Datasets

An experimental test case

A heat equation problem Two parameters that control the material properties

CS&E Seminar David Gleich · Purdue 27

Page 28: Simulation Informatics; Analyzing Large Scientific Datasets

Experiments

CS&E Seminar David Gleich · Purdue

20 point, Latin hypercube sample

28

Page 29: Simulation Informatics; Analyzing Large Scientific Datasets

Our Reduced Order Model

The

Trut

h

Where the error is the worst

CS&E Seminar David Gleich · Purdue 29

Page 30: Simulation Informatics; Analyzing Large Scientific Datasets

A Large Scale Example

Nonlinear heat transfer model 80k nodes, 300 time-steps 104 basis runs SVD of 24m x 104 data matrix 500x reduction in wall clock time (100x including the SVD)

CS&E Seminar David Gleich · Purdue 30

Page 31: Simulation Informatics; Analyzing Large Scientific Datasets

Tall-and-skinny QR (and SVD)!on MapReduce

PART 2 !

CS&E Seminar David Gleich · Purdue 31

Page 32: Simulation Informatics; Analyzing Large Scientific Datasets

Quick review of QR QR Factorization

David Gleich (Sandia)

Using QR for regression

   is given by the solution of   

QR is block normalization“normalize” a vector usually generalizes to computing    in the QR

A Q

Let    , real

  

   is    orthogonal (   )

   is    upper triangular.

0

R

=

4/22MapReduce 2011 CS&E Seminar David Gleich · Purdue 32

Page 33: Simulation Informatics; Analyzing Large Scientific Datasets

Intro to MapReduce Originated at Google for indexing web pages and computing PageRank.

The idea Bring the computations to the data.

Express algorithms in "data-local operations. Implement one type of communication: shuffle. Shuffle moves all data with the same key to the same reducer.

MM R

RMM

Input stored in triplicate

Map output"persisted to disk"before shuffle

Reduce input/"output on disk

1 MM R

RMMM

Maps Reduce

Shuffle

2

3

4

5

1 2 M M

3 4 M M

5 M

Data scalable

Fault-tolerance by design

CS&E Seminar David Gleich · Purdue 33

Page 34: Simulation Informatics; Analyzing Large Scientific Datasets

Mesh point variance in MapReduce Run 1 Run 2 Run 3

T=1 T=2 T=3 T=1 T=2 T=3 T=1 T=2 T=3

CS&E Seminar David Gleich · Purdue 34

Page 35: Simulation Informatics; Analyzing Large Scientific Datasets

Mesh point variance in MapReduce

M M M

R R

1. Each mapper out-puts the mesh points with the same key.

2. Shuffle moves all values from the same mesh point to the same reducer.

Run 1 Run 2 Run 3

3. Reducers just compute a numerical variance.

Bring the computations to the data!

T=1 T=2 T=3 T=1 T=2 T=3 T=1 T=2 T=3

CS&E Seminar David Gleich · Purdue 35

Page 36: Simulation Informatics; Analyzing Large Scientific Datasets

Communication avoiding QR (Demmel et al. 2008) Communication avoiding TSQR

Demmel et al. 2008. Communicating avoiding parallel and sequential QR.

First, do QR factorizationsof each local matrix   

Second, compute a QR factorization of the new “R”

David Gleich (Sandia) 6/22MapReduce 2011CS&E Seminar David Gleich · Purdue 36

Page 37: Simulation Informatics; Analyzing Large Scientific Datasets

Serial QR factorizations!(Demmel et al. 2008) Fully serial TSQR

Demmel et al. 2008. Communicating avoiding parallel and sequential QR.

Compute QR of    , read    , update QR, …

David Gleich (Sandia) 8/22MapReduce 2011

CS&E Seminar David Gleich · Purdue 37

Page 38: Simulation Informatics; Analyzing Large Scientific Datasets

Tall-and-skinny matrix storage in MapReduce MapReduce matrix storage

  

Key is an arbitrary row-idValue is the    array for

a row.

Each submatrix    is an input split.

A1

A2

A3

A4

David Gleich (Sandia) 10/22MapReduce 2011

CS&E Seminar David Gleich · Purdue 38

Page 39: Simulation Informatics; Analyzing Large Scientific Datasets

A1

A2

A3

A1

A2qr

Q2 R2

A3qr

Q3 R3

A4qr Q4A4

R4

emit

A5

A6

A7

A5

A6qr

Q6 R6

A7qr

Q7 R7

A8qr Q8A8

R8

emit

Mapper 1Serial TSQR

R4

R8

Mapper 2Serial TSQR

R4

R8

qr Q emitRReducer 1Serial TSQR

AlgorithmData Rows of a matrix

Map QR factorization of rowsReduce QR factorization of rows

CS&E Seminar David Gleich · Purdue 39

Page 40: Simulation Informatics; Analyzing Large Scientific Datasets

Key Limitations

Computes only R and not Q Can get Q via Q = AR+ with another MR iteration. " (we currently use this for computing the SVD) Dubious numerical stability; iterative refinement helps. Working on better ways to compute Q "(with Austin Benson, Jim Demmel)

CS&E Seminar David Gleich · Purdue 40

Page 41: Simulation Informatics; Analyzing Large Scientific Datasets

Full code in hadoopy In hadoopyimport random, numpy, hadoopyclass SerialTSQR:def __init__(self,blocksize,isreducer):self.bsize=blocksizeself.data = []if isreducer: self.__call__ = self.reducerelse: self.__call__ = self.mapper

def compress(self):R = numpy.linalg.qr(

numpy.array(self.data),'r')# reset data and re-initialize to Rself.data = []for row in R:self.data.append([float(v) for v in row])

def collect(self,key,value):self.data.append(value)if len(self.data)>self.bsize*len(self.data[0]):self.compress()

def close(self):self.compress()for row in self.data:key = random.randint(0,2000000000)yield key, row

def mapper(self,key,value):self.collect(key,value)

def reducer(self,key,values):for value in values: self.mapper(key,value)

if __name__=='__main__':mapper = SerialTSQR(blocksize=3,isreducer=False)reducer = SerialTSQR(blocksize=3,isreducer=True)hadoopy.run(mapper, reducer)

David Gleich (Sandia) 13/22MapReduce 2011CS&E Seminar David Gleich · Purdue 41

Page 42: Simulation Informatics; Analyzing Large Scientific Datasets

Lots of data? Add an iteration.

S(1)

A

A1

A2

A3

A3

R1map

Mapper 1-1Serial TSQR

A2

emitR2map

Mapper 1-2Serial TSQR

A3

emitR3map

Mapper 1-3Serial TSQR

A4

emitR4map

Mapper 1-4Serial TSQR

shuffle

S1

A2

reduce

Reducer 1-1Serial TSQR

S2

R2,2reduce

Reducer 1-2Serial TSQR

R2,1emit

emit

emit

shuffle

A2S3

R2,3reduce

Reducer 1-3Serial TSQR

emit

Iteration 1 Iteration 2

identity map

A2S(2)

Rreduce

Reducer 2-1Serial TSQR

emit

Too many maps? Add an iteration!

David Gleich (Sandia) 14/22MapReduce 2011

CS&E Seminar David Gleich · Purdue 42

Page 43: Simulation Informatics; Analyzing Large Scientific Datasets

Summary of parameters mrtsqr – summary of parametersBlocksize How many rows to

read before computing a QR factorization, expressed as a multiple of the number of columns (See paper)

Splitsize The size of each local matrix

Reduction treeThe number of reducers and iterations to use

David Gleich (Sandia)

A1

A2

A1

A2qr

Q2

A1

R1map

Mapper 1-1Serial TSQR

emit

S(2)S(1)

A

shuffle

Iteration 1

(Red)

(Red)

S(2)

(Red)

Iter 2 Iter 315/22MapReduce 2011 CS&E Seminar David Gleich · Purdue 43

Page 44: Simulation Informatics; Analyzing Large Scientific Datasets

Varying splitsize and the tree Varying splitsize Synthetic DataCols. Iters. Split

(MB)Maps Secs.

50 1 64 8000 388

– – 256 2000 184

– – 512 1000 149

– 2 64 8000 425

– – 256 2000 220

– – 512 1000 191

1000 1 512 1000 666

– 2 64 6000 590

– – 256 2000 432

– – 512 1000 337

Increasing split size improves performance(accounts for Hadoopdata movement)

Increasing iterations helps for problems with many columns.

(1000 columns with 64-MB split size overloaded the single reducer.)

David Gleich (Sandia) 17/22MapReduce 2011

CS&E Seminar David Gleich · Purdue 44

Page 45: Simulation Informatics; Analyzing Large Scientific Datasets

MapReduce TSQR summary MapReduce is great for TSQR!Data A tall and skinny (TS) matrix by rows

Map QR factorization of local rowsReduce QR factorization of local rows

Input 500,000,000-by-100 matrixEach record 1-by-100 rowHDFS Size 423.3 GBTime to compute    (the norm of each column) 161 sec.Time to compute    in qr(   ) 387 sec.

On a 64-node Hadoop cluster with 4x2TB, one Core i7-920, 12GB RAM/node

Demmel et al. showed that this construction works to compute a QR factorization with minimal communication

David Gleich (Sandia) 2/22MapReduce 2011

CS&E Seminar David Gleich · Purdue 45

Page 46: Simulation Informatics; Analyzing Large Scientific Datasets

Our vision!To enable analysts and engineers to hypothesize from "data computations instead of expensive HPC computations.

Paul G. Constantine "

Sandia!Jeremy Templeton

Joe Ruthruff

… and you ? …

CS&E Seminar David Gleich · Purdue 46