escience data management bill howe, phd escience institute it’s not just size that matters, it’s...

33
eScience Data Management Bill Howe, Phd eScience Institute It’s not just size that matters, it’s what you can do with it

Upload: linda-lane

Post on 04-Jan-2016

214 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: EScience Data Management Bill Howe, Phd eScience Institute It’s not just size that matters, it’s what you can do with it

eScience Data Management

Bill Howe, PhdeScience Institute

It’s not just size that matters, it’s what you can do with it

Page 2: EScience Data Management Bill Howe, Phd eScience Institute It’s not just size that matters, it’s what you can do with it

3/12/09 Bill Howe, eScience Institute 2

from eScience Rollout, 11/5/08

me

Page 3: EScience Data Management Bill Howe, Phd eScience Institute It’s not just size that matters, it’s what you can do with it

3/12/09 Bill Howe, eScience Institute 3

My Background

BS Industrial and Systems Engineering, GA Tech 1999

Big 3 Consulting with Deloitte 99-00 Residual guilt from call centers of consultants burning $50k/day

Independent Consulting 00-01 Microsoft, Siebel, Schlumberger, Verizon

Phd, Computer Science, Portland State University, 2006 (via OGI) Dissertation: “GridFields: Model-Driven Data Manipulation in the Physical

Sciences”, Advisor: David Maier

Postdoc and Data Architect 06-08 NSF Science and Technology Center for

Coastal Margin Observation and Prediction (CMOP)

Page 4: EScience Data Management Bill Howe, Phd eScience Institute It’s not just size that matters, it’s what you can do with it

3/12/09 Bill Howe, eScience Institute 4

All Science is becoming eScience

Old model: “Query the world” (Data acquisition coupled to a specific hypothesis)New model: “Download the world” (Data acquired en masse, independent of hypotheses)But: Acquisition now outpaces analysis

Astronomy: High-resolution, high-frequency sky surveys (SDSS, LSST, PanSTARRS) Medicine: ubiquitous digital records, MRI, ultrasound Oceanography: high-resolution models, cheap sensors, satellites Biology: automated PCR, high-throughput sequencing

“Increase Data Collection Exponentially in Less Time, with FlowCAM”

Empirical X Analytical X Computational X X-informatics

Page 5: EScience Data Management Bill Howe, Phd eScience Institute It’s not just size that matters, it’s what you can do with it

3/12/09 Bill Howe, eScience Institute 5

The long tail is getting fatter:

notebooks become spreadsheets (MB), spreadsheets become databases (GB), databases become clusters (TB) clusters become clouds (PB)

The Long Tailda

ta in

vent

ory

ordinal position

Researchers with growing data management challenges but limited resources for cyberinfrastructure

• No dedicated IT staff

• Overreliance on simple tools (e.g., spreadsheets)CERN (~15PB/year)

LSST (~100PB)

PanSTARRS (~40PB)

Ocean Modelers <Spreadsheet

users>

SDSS (~100TB)

Seis-mologists

MicrobiologistsCARMEN (~50TB)

“The future is already here. It’s just not very evenly distributed.”-- William Gibson

Page 6: EScience Data Management Bill Howe, Phd eScience Institute It’s not just size that matters, it’s what you can do with it

3/12/09 Bill Howe, eScience Institute 6

Heterogeneity also drives costs#

of

by

tes

# of data types

CERN (~15PB/year, particle interactions)

LSST(~100PB; images, objects)

PanSTARRS (~40PB; images, objects, trajectories)

OOI(~50TB/year; sim. results, satellite, gliders, AUVs, vessels, more)

SDSS (~100TB; images, objects)

Biologists(~10TB, sequences, alignments, annotations, BLAST hits, metadata, phylogenetic trees)

Page 7: EScience Data Management Bill Howe, Phd eScience Institute It’s not just size that matters, it’s what you can do with it

3/12/09 Bill Howe, eScience Institute 7

Web Services

Facets of Data Management

Query Languages

Storage Management

Visualization; Workflow

Data IntegrationKnowledge Extraction,Crawlers

Access Methods

Data Mining, Distributed Programming Models, Provenance

complexity-hiding interfaces

The DB maxim: push computation to the data

Page 8: EScience Data Management Bill Howe, Phd eScience Institute It’s not just size that matters, it’s what you can do with it

3/12/09 Bill Howe, eScience Institute 8

Example: Relational Databases

At IBM Almaden in 60s and 70s, Codd worked out a formal basis for tabular data representation, organization, and access [Codd 70].

The early systems were buggy and slow (and sometimes reviled), but programmers only had to write 5% of the code the previously did!

Now: $10B market, de facto standard for data management. SQL is “intergalactic dataspeak”

physical data independence

logical data independence

Page 9: EScience Data Management Bill Howe, Phd eScience Institute It’s not just size that matters, it’s what you can do with it

3/12/09 Bill Howe, eScience Institute 9

Medium-Scale Data Management Toolbox

Relational Databases

Scientific Workflow Systems

Science “Mashups”

“Dataspace” systems

The “hammer” of data management

[Howe, Freire, Silva, et al. 2008]

[Howe, Green-Fishback, Maier, 2009]

[Howe, Maier, Rayner, Rucker 2008]

Page 10: EScience Data Management Bill Howe, Phd eScience Institute It’s not just size that matters, it’s what you can do with it

3/12/09 Bill Howe, eScience Institute 10

Large-Scale Data Management Toolbox

Amazon S3

Dryad

MapReduce

Parallel programming via relational algebra plus type safety, monitoring, debugging (Michael Isard, Microsoft Research)

Parallel programming using functional programming abstractions(Google)Howe, Freire, Silva: 2009 NSF CluE AwardConnolly, Gardner: 2009 NSF CluE Award

RDBMS-like features in the cloudNote: cost effectiveness unclear for large datasets

Page 11: EScience Data Management Bill Howe, Phd eScience Institute It’s not just size that matters, it’s what you can do with it

3/12/09 Bill Howe, eScience Institute 11

Current Activities

Consulting: Armbrust Lab(next slide)

Research: MapReduce for Oceanographic SImulations (+ Visualization and Workflow)

Page 12: EScience Data Management Bill Howe, Phd eScience Institute It’s not just size that matters, it’s what you can do with it

3/12/09 Bill Howe, eScience Institute 12

Consulting: Armbrust Lab

Initial Goal: Corral and inventory all relevant data SOLiD sequencer: potentially 0.5 TB / day, flat files Metadata: small relational DB + Rails/Django web app Data Products: visualizations, intermediate results Ad hoc scripts and programs

Initial Goal: Amplify programmer effort Change is constant: No “one size fits all” solution; ad hoc

development is the norm Strategy: Teach biologists to “fish” (David Schruth’s R course) Strategy: Develop an infrastructure that enables and encourages

reuse -- scientific workflow systems

key idea: these are data too

Page 13: EScience Data Management Bill Howe, Phd eScience Institute It’s not just size that matters, it’s what you can do with it

3/12/09 Bill Howe, eScience Institute 13

Scientific Workflow Systems

Value proposition: More time on science, less time on code

How: By providing language features emphasizing sharing, reuse, reproducibility, rapid prototyping, efficiency

Provenance Automatic task-parallelism Visual programming Caching Domain-specific toolkits

Many examples from eScience and DB communities: Trident (MSR), Taverna (Manchester), Kepler (UCSD), VisTrails (Utah), more

Page 14: EScience Data Management Bill Howe, Phd eScience Institute It’s not just size that matters, it’s what you can do with it

3/12/09 Bill Howe, eScience Institute 14

Photo: The Trident Scientific Workflow Workbench for Oceanography, developed by Microsoft Research, demonstrated at Microsoft’s TechFest 2008.

http://www.microsoft.com/mscorp/tc/trident.mspx

Page 15: EScience Data Management Bill Howe, Phd eScience Institute It’s not just size that matters, it’s what you can do with it

3/12/09 Bill Howe, eScience Institute 15screenshot: VisTrails, Claudio Silva, Juliana Freire, et al., University of Utah

Page 16: EScience Data Management Bill Howe, Phd eScience Institute It’s not just size that matters, it’s what you can do with it

3/12/09 Bill Howe, eScience Institute 16screenshot: VisTrails, Claudio Silva, Juliana Freire, et al., University of Utah

Page 17: EScience Data Management Bill Howe, Phd eScience Institute It’s not just size that matters, it’s what you can do with it

3/12/09 Bill Howe, eScience Institute 17

Bill Howe @ CMOP computes salt flux using GridFields

Erik Anderson @ Utah adds vector

streamlines and adjusts opacity

Bill Howe @ CMOP adds an isosurface of

salinity

Peter Lawson adds discussion of the

scientific interpretation

source: VisTrails (Silva, Freire, Anderson) and GridFields (Howe)

Page 18: EScience Data Management Bill Howe, Phd eScience Institute It’s not just size that matters, it’s what you can do with it

3/12/09 Bill Howe, eScience Institute 18

Strategy at Armbrust Lab

1. Develop a benchmark suite of workflow exemplars and use them to evaluate workflow offerings

2. “Let a hundred flowers blossom” -- deploy multiple solutions in practice to assess user uptake

3. “Pay as you go” -- evolve a toolkit rather than attempt a comprehensive, monolithic data management juggernaut.

Informed by two of Jim Gray’s Laws of Data Engineering: Start with “20 queries” Go from “working to working”

Page 19: EScience Data Management Bill Howe, Phd eScience Institute It’s not just size that matters, it’s what you can do with it

3/12/09 Bill Howe, eScience Institute 19

NSF Award: Cluster Exploratory (CluE)

Partnership between NSF, IBM, Google Data-intensive computing: “I/O farm”

massive queries, not massive simulations “in ferro” experiments

To “Cloud-Enable” GridFields and VisTrails Goal: 10+-year climatologies at interactive speeds Requires turning over up to 25TB < 5s Provenance, reproducibility, visualization: VisTrails

Connect rich desktop experience to cloud query engine

Co-PIs from University of Utah Claudio Silva and Juliana Freire

Page 20: EScience Data Management Bill Howe, Phd eScience Institute It’s not just size that matters, it’s what you can do with it

3/12/09 Bill Howe, eScience Institute 20

Ahmdahl’s Laws

Gene Amdahl (1965): Laws for a balanced system

i. Parallelism: max speedup is S/(S+P)

ii. One bit of IO/sec per instruction/sec (BW)

iii. One byte of memory per one instruction/sec (MEM)

iv. One IO per 50,000 instructions (IO)

Modern multi-core systems move farther away from Amdahl’s Laws (Bell, Gray and Szalay 2006)

For a Blue Gene the BW=0.001, MEM=0.12.

For the JHU cluster BW=0.5, MEM=1.04

source: Alex Szalay, keynote, eScience 2008

Page 21: EScience Data Management Bill Howe, Phd eScience Institute It’s not just size that matters, it’s what you can do with it

3/12/09 Bill Howe, eScience Institute 21

Climatology

Feb May

Average Surface Salinity by Month Columbia River Plume 1999-2006

Columbia River

psu

Washington

Oregon

animation

Page 22: EScience Data Management Bill Howe, Phd eScience Institute It’s not just size that matters, it’s what you can do with it

3/12/09 Bill Howe, eScience Institute 22

1 2 3 4 5 6 7

31

23

psu

8 9 10 11 12 13 14 15

16 17 18(b)

19 20 21 22

24 25 26 27 28 29 30

Page 23: EScience Data Management Bill Howe, Phd eScience Institute It’s not just size that matters, it’s what you can do with it

3/12/09 Bill Howe, eScience Institute 23

Epilogue

We’re here to help!

SIG Wiki:https://sig.washington.edu/itsigs/SIG_eScience

eScience Blog:http://escience.washington.edu/blog/

eScience wesbite:http://www.washington.edu/uwtech/escience.html

Page 24: EScience Data Management Bill Howe, Phd eScience Institute It’s not just size that matters, it’s what you can do with it

3/12/09 Bill Howe, eScience Institute 24

Page 25: EScience Data Management Bill Howe, Phd eScience Institute It’s not just size that matters, it’s what you can do with it

3/12/09 Bill Howe, eScience Institute 25

eScience requirements are Fractal

William Gibson -- “The future is already here. It’s just not very evenly distributed.”

Page 26: EScience Data Management Bill Howe, Phd eScience Institute It’s not just size that matters, it’s what you can do with it

3/12/09 Bill Howe, eScience Institute 26

High-Performance Computing

Data Management

Con

sult

ing

Online Collaboration Tools

CS

Res

earc

h

eScience

Page 27: EScience Data Management Bill Howe, Phd eScience Institute It’s not just size that matters, it’s what you can do with it

3/12/09 Bill Howe, eScience Institute 27

It’s what you can do with it

Relational database SQL, plus UDTs and UDFs as needed

FASTA databases Alignments, rarefaction curves, phylogenetic trees, filtering

MapReduce: Roll your own

Dryad Relational algebra available; you can still roll our own if needed

Page 28: EScience Data Management Bill Howe, Phd eScience Institute It’s not just size that matters, it’s what you can do with it

3/12/09 Bill Howe, eScience Institute 28

A data deluge in all fields

Acquisition eventually outpaces analysis Astronomy: SDSS, now LSST; PanSTARRS Biology: PCR, SOLiD sequencing Oceanography: high-resolution models, cheap sensors Marine Microbiology: FlowCytometer

Empirical X Analytical X Computational X X-informatics

“Increase Data Collection Exponentially in Less Time, with FlowCAM”

Page 29: EScience Data Management Bill Howe, Phd eScience Institute It’s not just size that matters, it’s what you can do with it

High-Performance Computing

Data ManagementC

onsu

ltin

g

Online CollaborationCom

mu

nit

y B

uild

ing

Tec

hn

olog

y T

ran

sfer

eScience Research

Page 30: EScience Data Management Bill Howe, Phd eScience Institute It’s not just size that matters, it’s what you can do with it

3/12/09 Bill Howe, eScience Institute 30

Query Languages

Organize and encapsulate access methods Raise the level of abstraction beyond GPLs Identify and exploit opportunities for algebraic

optimization What is algebraic optimization? Consider the expression x/z +

y/zx/z + y/z = (x + y)/z, but the latter is less expensive since it involves only

one division operation

Tables -- SQL XML -- XQuery, XPath RDF -- SPARQL Streams -- StreamSQL, CQL Meshes (e.g., Finite Element Sims) -- GridFields

Page 31: EScience Data Management Bill Howe, Phd eScience Institute It’s not just size that matters, it’s what you can do with it

3/12/09 Bill Howe, eScience Institute 31

Example: Relational Databases (In Codd we Trust…)

At IBM Almaden in 60s and 70s, Codd worked out a formal basis for working with tabular data1.

The early relational systems were buggy and slow (and sometimes reviled), but programmers only had to write 5% of the code the previously did!

1 E. F. Codd, “A Relational Model of Data for Large Shared Data Banks”, Communications of the ACM 13(6), pp 377-387, 1970

The Database Game: do the same thing as Codd, but with new data types: XML (trees), RDF (graphs), streams, DNA sequences, images, arrays, simulation results, etc.

Page 32: EScience Data Management Bill Howe, Phd eScience Institute It’s not just size that matters, it’s what you can do with it

3/12/09 Bill Howe, eScience Institute 32

Gray’s Laws of Data Engineering

Jim Gray: Scientific computing is revolving around data Need scale-out solution for analysis Take the analysis to the data! Start with “20 queries” Go from “working to working”

DISSC: Data Intensive Scalable Scientific Computing

slide source: Alex Szalay, keynote, eScience 2008

Page 33: EScience Data Management Bill Howe, Phd eScience Institute It’s not just size that matters, it’s what you can do with it

3/12/09 Bill Howe, eScience Institute 33

Data Management