berkeley water center early experience prototyping a science data server for environmental data deb...

27
Berkeley Water Center Early Experience Prototyping a Science Data Server for Environmental Data Deb Agarwal, LBL ([email protected]) Catharine van Ingen, MSFT ([email protected]) 20 September 2006

Post on 20-Dec-2015

217 views

Category:

Documents


3 download

TRANSCRIPT

Berkeley Water Center

Early Experience Prototyping a Science Data Server for Environmental Data

Deb Agarwal, LBL ([email protected])

Catharine van Ingen, MSFT ([email protected])

20 September 2006

Berkeley Water Center

Outline

Landscape Data archives and other sources Typical small group collaboration needs Examples using “Ameriflux”

Science Data Server Goals and ideal capabilities Approach Experiences with the current system

Next generation Next set of development efforts Research issues

Conclusion

Berkeley Water Center

Unprecedented Data Availability

Berkeley Water Center

Typical Data Flow Today

Large Data Archives

Local measurementsModels

Berkeley Water Center 6

Ameriflux Collaboration Overview

149 Sites across the AmericasEach site reports a minimum

of 22 common measurements.Communal science – each

principle investigator acts independently to prepare and publish data.

Data published to and archived at Oak Ridge.

Total data reported to date on the order of 150M half-hourly measurements.

http://public.ornl.gov/ameriflux/

Berkeley Water Center

microbesrootswoodleafnetstoragec RRRRPNEEFF 1. Applications of eddy covariance measurements, Part 1: Lecture on Analyzing and

Interpreting CO2 Flux Measurements, Dennis Baldocchi, CarboEurope Summer Course, 2006, Namur, Belgium (http://nature.berkeley.edu/biometlab/lectures/)

What A Tower Sees

Berkeley Water Center

Example Carbon-Climate Investigations

Net carbon exchange for the ecosystem Impact of climate change on the greening of

ecosystems Start of leaf growth Duration of photosynthesis

Effects of early spring on carbon uptake Role of ecosystem and latitude on carbon flux Effect of various pollution sources on carbon in

atmosphere and carbon balance

Berkeley Water Center

Measurements Are Not Simple or Complete

Gaps in the data Quiet nights Bird poop High winds ….

Difficult to make measurements Leaf area index Wood respiration Soil respiration …

Localized measurements – tower footprint Local investigator knowledge important PIs’ science goals are not uniform across the towers

Soils

Climate

Remote SensingExamples of Carbon-Climate Datasets

Observatory datasets

Spatially continuous datasets

Berkeley Water Center

Scientific Data Server

Large Data Archives

Local measurements

Berkeley Water Center

Scientific Data Server - Goals

Act as a local repository for data and metadata assembled by a small group of scientists from a wide variety of sources Simplify provenance by providing a common “safe deposit box”

for assembled data Interact simply with existing and emerging Internet portals for data

and metadata download, and, over time, upload Simplify data assembly by adding automation Simplify name space confusion by adding explicit decode

translation Support basic analyses across the entire dataset for both data

cleaning and science Simplify mundane data handling tasks Simplify quality checking and data selection by enabling data

browsing

Berkeley Water Center

Scientific Data Server - Non-Goals

Replace the large Internet data source sites The technology developed may be applicable, but the focus is

on the group collaboration scale and usability Very large datasets require different operational practices

Perform complex modeling and statistical analyses There are a lot of existing tools with established trust based on

long track records Only part of a full LIMS (laboratory information management

system) Develop a new standard schema or controlled vocabulary

Other work on these is progressing independently Due to the heterogeneity of the data, more than one such

standard seems likely to be relevant

Berkeley Water Center

Scientific Data Server - Workflows

Staging: adding data or metadata New downloaded or field measurements

added New derived measurements added

Editing: changing data or metadata Existing older measurements re-calibrated

or re-derived Data cleaning or other algorithm changes Gap filling

Sharing: making the latest acquired data available rapidly Even before all the checks have been

made Browsing new data before more detailed

analyses Private Analysis: Supporting individual

researchers (MyDB) Stable location for personal calibrations,

derivations, and other data transformations

Import/Export to analysis tools and models Curating: data versioning and provenance

Simple parent:child versioning to track collections of data used for specific uses

Large Data Archives

Local measurements

Berkeley Water Center

Scientific Data Server - Logical Overview

DataAccess

and Analysis

Tools

Latest DatasetDatabase

Latest DatasetCube

Staging Databases

and Cubes

Last Known Good Dataset(s)

Database

Older Dataset(s)Archive

Database

Last Known Good Dataset(s) Cubes

Private Data

Analysis Databases

and Cubes

Scientific Data Server

Analysis ToolsExcel, Matlab, SPlus, SAS,

ArcGIS

Simple web data plots and

tables

BigPlot data browsing

Computational Models

Flat file data import/export

Berkeley Water Center

Databases

All descriptive metadata and data held in relational databases Metadata is important too!

While separate databases are shown, the datasets may actually reside in a single database Mapping is transparent to the

scientist Separate databases used for

performance Unified databases used for

simplicity New metadata and data are staged with

a temporary database Minimal quality checks applied All name and unit conversions

Data may be exported to flat file, copied to a private MyDb database, directly accessed programmatically, or ?

Latest DatasetDatabase

Last Known Good Dataset(s)

Database

Older Dataset(s)Archive

Database

MyDbAnalysis

Database

Staging Database

Berkeley Water Center

Data Cubes

A data cube is a database specifically for data mining (OLAP) Initially developed for commercial

needs like tracking sales of Oreos and milk

Simple aggregations (sum, min, or max) can be pre-computed for speed

Additional calculations (median) can be computed dynamically

Both operate along dimensions such as time, site, or datumtype

Constructed from a relational database

A specialized query language (MDX) is used

Client tool integrations is evolving Excel PivotTables allow simple data

viewing More powerful charting with

Tableaux or ProClarity (commercial mining tools)

Latest DatasetCube

Last Known Good Dataset(s) Cubes

Staging Data Cube

MyDbAnalysis

Data Cubes

Berkeley Water Center

Browsing For Data AvailabilitySites Reporting Data Colored by Year

Ameriflux Data Availability : All Data

Bra

zil -

- T

apaj

os (

San

tare

m,K

mB

razi

l --

Tap

ajos

(S

anta

rem

,Km

Can

ada

- B

orea

s 18

50C

anad

a --

BO

RE

AS

NS

A -

193

0 bu

Can

ada

-- B

OR

EA

S N

SA

- 1

963

buC

anad

a --

BO

RE

AS

NS

A -

198

1 bu

Can

ada

-- B

OR

EA

S N

SA

- 1

989

buC

anad

a --

BO

RE

AS

NS

A -

199

8 bu

Can

ada

-- B

OR

EA

S N

SA

- O

ld B

laC

anad

a --

Brit

ish

Col

., C

ampb

eC

anad

a --

Let

hbrid

geU

SA

--

AK

Atq

asuk

, A

lask

aU

SA

--

AK

Bar

row

, A

lask

aU

SA

--

AK

Hap

py V

alle

y, A

lask

aU

SA

--

AK

Upa

d, A

lask

aU

SA

--

AZ

Aud

ubon

Res

earc

h R

anU

SA

--

CA

Blo

dget

t F

ores

t, C

alU

SA

--

CA

Sky

Oak

s, O

ld S

tand

,U

SA

--

CA

Sky

Oak

s, Y

oung

Sta

nU

SA

--

CA

Ton

zi R

anch

, C

alifo

rU

SA

--

CA

Vai

ra R

anch

, Io

ne,

CU

SA

--

CO

Niw

ot R

idge

For

est,

U

SA

--

CT

Gre

at M

ount

ain

For

esU

SA

--

FL

Flo

rida-

Ken

nedy

Spa

cU

SA

--

FL

Flo

rida-

Ken

nedy

Spa

cU

SA

--

FL

Sla

shpi

ne-A

ustin

Car

US

A -

- F

L S

lash

pine

-Don

alds

on,

US

A -

- F

L S

lash

pine

-Miz

e,cl

ear

US

A -

- F

L S

lash

pine

-Ray

onie

r,m

US

A -

- IL

Bon

dvill

e, I

llino

isU

SA

--

IN M

orga

n M

onro

e S

tate

U

SA

--

KS

Wal

nut

Riv

er W

ater

shU

SA

--

MA

Har

vard

For

est

EM

S T

US

A -

- M

A H

arva

rd F

ores

t he

mlo

US

A -

- M

A L

ittle

Pro

spec

t H

illU

SA

--

ME

How

land

For

est

(mai

nU

SA

--

MI

Syl

vani

a W

ilder

ness

U

SA

--

MI

Uni

v. o

f M

ich.

Bio

loU

SA

--

MO

Mis

sour

i Oza

rk S

iteU

SA

--

MS

Goo

dwin

Cre

ek,

Mis

siU

SA

--

MT

For

t P

eck,

Mon

tana

US

A -

- N

C D

uke

For

est

- lo

blol

US

A -

- N

C D

uke

For

est-

hard

woo

dU

SA

--

NE

Mea

d -

irrig

ated

con

US

A -

- N

E M

ead

- irr

igat

ed m

aiU

SA

--

NE

Mea

d -

rain

fed

mai

zeU

SA

--

OK

Litt

le W

ashi

ta W

ater

US

A -

- O

K P

onca

City

, O

klah

oma

US

A -

- O

K S

hidl

er,

Okl

ahom

aU

SA

--

OK

Sou

ther

n G

reat

Pla

inU

SA

--

OR

Met

oliu

s-fir

st y

oung

US

A -

- O

R M

etol

ius-

inte

rmed

iat

US

A -

- O

R M

etol

ius-

old

aged

po

US

A -

- S

D B

lack

Hill

s, S

outh

DU

SA

--

SD

Bro

okin

gs,

Sou

th D

akU

SA

--

TN

Wal

ker

Bra

nch

Wat

ers

US

A -

- W

A W

ind

Riv

er C

rane

Sit

US

A -

- W

I Lo

st C

reek

, W

isco

nsi

US

A -

- W

I P

ark

Fal

ls/W

LEF

, W

isU

SA

--

WI

Will

ow C

reek

, W

isco

nU

SA

--

WV

Can

aan

Val

ley,

Wes

t

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

1996

1995

1994

1993

1992

1991

Berkeley Water Center

Browsing For Data Availability Total Data Availability by Type Colored by Site

0

2,000,000

4,000,000

6,000,000

8,000,000

10,000,000

12,000,000

14,000,000

16,000,000

18,000,000

AP

AR

CO

2

DT

FC

FG

FH

2O

FP

AR

GP

P H

H2

O LE

Le

afw

etn

ess

NE

E

O3

Oth

er

PA

R

PR

EC

PR

ES

S

Rd

Rg

Rg

l

RH

Rn

Sa

SC

O2

SV

P

SW

C

TA

TA

U

Tb

ole

Td

ew

TS U

US

T

UW

VP

D

WD

WS

USA -- WV Canaan Valley, West USA -- WI Willow Creek, WisconUSA -- WI Park Falls/WLEF, WisUSA -- WI Lost Creek, WisconsiUSA -- WA Wind River Crane SitUSA -- TN Walker Branch WatersUSA -- SD Brookings, South DakUSA -- SD Black Hills, South DUSA -- OR Metolius-old aged poUSA -- OR Metolius-intermediatUSA -- OR Metolius-f irst youngUSA -- OK Southern Great PlainUSA -- OK Shidler, OklahomaUSA -- OK Ponca City, OklahomaUSA -- OK Little Washita WaterUSA -- NE Mead - rainfed maizeUSA -- NE Mead - irrigated maiUSA -- NE Mead - irrigated conUSA -- NC Duke Forest-hardw oodUSA -- NC Duke Forest - loblolUSA -- MT Fort Peck, MontanaUSA -- MS Goodw in Creek, MissiUSA -- MO Missouri Ozark SiteUSA -- MI Univ. of Mich. BioloUSA -- MI Sylvania Wilderness USA -- ME How land Forest (mainUSA -- MA Little Prospect HillUSA -- MA Harvard Forest hemloUSA -- MA Harvard Forest EMS TUSA -- KS Walnut River WatershUSA -- IN Morgan Monroe State USA -- IL Bondville, IllinoisUSA -- FL Slashpine-Rayonier,mUSA -- FL Slashpine-Mize,clearUSA -- FL Slashpine-Donaldson,USA -- FL Slashpine-Austin CarUSA -- FL Florida-Kennedy SpacUSA -- FL Florida-Kennedy SpacUSA -- CT Great Mountain ForesUSA -- CO Niw ot Ridge Forest, USA -- CA Vaira Ranch, Ione, CUSA -- CA Tonzi Ranch, CaliforUSA -- CA Sky Oaks, Young StanUSA -- CA Sky Oaks, Old Stand,USA -- CA Blodgett Forest, CalUSA -- AZ Audubon Research RanUSA -- AK Upad, AlaskaUSA -- AK Happy Valley, AlaskaUSA -- AK Barrow , AlaskaUSA -- AK Atqasuk, AlaskaCanada -- LethbridgeCanada -- British Col., CampbeCanada -- BOREAS NSA - Old BlaCanada -- BOREAS NSA - 1998 buCanada -- BOREAS NSA - 1989 buCanada -- BOREAS NSA - 1981 buCanada -- BOREAS NSA - 1963 buCanada -- BOREAS NSA - 1930 buCanada - Boreas 1850Brazil -- Tapajos (Santarem,KmBrazil -- Tapajos (Santarem,Km

Data type reporting is far from uniform across type

Berkeley Water Center

Browsing for Data Availability Total Data Availability by Site Colored by Type

0

1,000,000

2,000,000

3,000,000

4,000,000

5,000,000

6,000,000

7,000,000

Bra

zil -

- T

apaj

os (

San

tare

m,K

m

Bra

zil -

- T

apaj

os (

San

tare

m,K

m

Can

ada

- B

orea

s 18

50

Can

ada

-- B

OR

EA

S N

SA

- 1

930

bu

Can

ada

-- B

OR

EA

S N

SA

- 1

963

bu

Can

ada

-- B

OR

EA

S N

SA

- 1

981

bu

Can

ada

-- B

OR

EA

S N

SA

- 1

989

bu

Can

ada

-- B

OR

EA

S N

SA

- 1

998

bu

Can

ada

-- B

OR

EA

S N

SA

- O

ld B

la

Can

ada

-- B

ritis

h C

ol.,

Cam

pbe

Can

ada

-- L

ethb

ridge

US

A -

- A

K A

tqas

uk, A

lask

a

US

A -

- A

K B

arro

w, A

lask

a

US

A -

- A

K H

appy

Val

ley,

Ala

ska

US

A -

- A

K U

pad,

Ala

ska

US

A -

- A

Z A

udub

on R

esea

rch

Ran

US

A -

- C

A B

lodg

ett F

ores

t, C

al

US

A -

- C

A S

ky O

aks,

Old

Sta

nd,

US

A -

- C

A S

ky O

aks,

You

ng S

tan

US

A -

- C

A T

onzi

Ran

ch, C

alifo

r

US

A -

- C

A V

aira

Ran

ch, I

one,

C

US

A -

- C

O N

iwot

Rid

ge F

ores

t,

US

A -

- C

T G

reat

Mou

ntai

n F

ores

US

A -

- F

L F

lorid

a-K

enne

dy S

pac

US

A -

- F

L F

lorid

a-K

enne

dy S

pac

US

A -

- F

L S

lash

pine

-Aus

tin C

ar

US

A -

- F

L S

lash

pine

-Don

alds

on,

US

A -

- F

L S

lash

pine

-Miz

e,cl

ear

US

A -

- F

L S

lash

pine

-Ray

onie

r,m

US

A -

- IL

Bon

dville

, Illin

ois

US

A -

- IN

Mor

gan

Mon

roe

Sta

te

US

A -

- K

S W

alnu

t Riv

er W

ater

sh

US

A -

- M

A H

arva

rd F

ores

t EM

S T

US

A -

- M

A H

arva

rd F

ores

t hem

lo

US

A -

- M

A L

ittle

Pro

spec

t Hill

US

A -

- M

E H

owla

nd F

ores

t (m

ain

US

A -

- M

I Syl

vani

a W

ilder

ness

US

A -

- M

I Uni

v. o

f Mic

h. B

iolo

US

A -

- M

O M

isso

uri O

zark

Site

US

A -

- M

S G

oodw

in C

reek

, Mis

si

US

A -

- M

T F

ort P

eck,

Mon

tana

US

A -

- N

C D

uke

For

est -

lobl

ol

US

A -

- N

C D

uke

For

est-

hard

woo

d

US

A -

- N

E M

ead

- irr

igat

ed c

on

US

A -

- N

E M

ead

- irr

igat

ed m

ai

US

A -

- N

E M

ead

- ra

infe

d m

aize

US

A -

- O

K L

ittle

Was

hita

Wat

er

US

A -

- O

K P

onca

City

, Okl

ahom

a

US

A -

- O

K S

hidl

er, O

klah

oma

US

A -

- O

K S

outh

ern

Gre

at P

lain

US

A -

- O

R M

etol

ius-

first

you

ng

US

A -

- O

R M

etol

ius-

inte

rmed

iat

US

A -

- O

R M

etol

ius-

old

aged

po

US

A -

- S

D B

lack

Hills

, Sou

th D

US

A -

- S

D B

rook

ings

, Sou

th D

ak

US

A -

- T

N W

alke

r B

ranc

h W

ater

s

US

A -

- W

A W

ind

Riv

er C

rane

Sit

US

A -

- W

I Los

t Cre

ek, W

isco

nsi

US

A -

- W

I Par

k F

alls

/WLE

F, W

is

US

A -

- W

I Willo

w C

reek

, Wis

con

US

A -

- W

V C

anaa

n V

alle

y, W

est

APAR CO2 DT FC FG FH2O FPAR GPP H H2OLE Leafwetness NEE O3 Other PAR PREC PRESS Rd RgRgl RH Rn Sa SCO2 SVP SWC TA TAU TboleTdew TS U UST UW VPD WD WS

Sites report more data either because of longevity or specific research interests

Berkeley Water Center

Browsing for Data Quality

Real field data has both short term gaps and longer term outages The utility of the data depends

on the nature of the science being performed

Browsing data counts can give rapid insight into how the data can be used before more complex analyses are performed

0

1000

2000

3000

4000

5000

6000

1 2 3 4 5 6 7 8 9 10 11 12

55.86306 BOREAS NSA -1981 burn site

55.879002 BOREAS NSA -Old Black Spruce

55.90583 BOREAS NSA -1930 burn site

55.911671 BOREAS NSA -1963 burn site

55.916672 BOREAS NSA -1989 burn site

56.63583 BOREAS NSA -1998 burn site

69.133331 AK HappyValley

70.281471 AK Upad

70.496002 AK Atqasuk

Data often missing in the winter!

Soil Water Content (% Volume)

-10

-5

0

5

10

15

20

25

30

35

40

45

0 10 20 30 40 50 60

Tonzi Ranch

Vai

ra R

anch

0 (cm)

20 (cm)Measurements charted on axes are gaps

-15

-10

-5

0

5

10

15

20

25

30

20 30 40 50 60 70 80

Latitude

Deg

C

Average Temperature

What’s going on at higher latitudes? (It should be getting colder)

Data Count

Berkeley Water Center

Browsing for Data Quality

Real field data has unit and time scale conversion problems Sometimes easy to spot in

isolation Sometimes easier to spot when

comparing to other data Browsing data values can give

rapid insight into how the data can be used before more complex analyses are performed

Maximum Annual Air TemperatureGlobal Warming or Reporting in Fahrenheit?

Odd Microclimate Effects or Error in Time Reporting ?

Average Air Temperature Two Nearby Sites

Local time or GMT time?

Berkeley Water Center

Lessons Learned To Date

Metadata is as important as data Comparing sites of like vegetation, climate is as important

as latitude or other physical quantity Curate the two together

Controlled vocabularies are hard Humans like making up names and have a hard time

remembering 100+ names Assume a decode step in the staging pipeline

There are at least three database schema families and two cube construction approaches Everyone has a favorite Each has advantages and disadvantages Automate the maintenance and use the right one for the

right job Visual programming tools are great for prototyping

But debugging and maintenance can hit a wall It’s easy to overbuild – use when “good enough”

Data analysis and data cleaning are intertwined Data cleaning is always on-going Share the simple tools and visualizations

The saga continues at http://dsd.lbl.gov/BWC/amfluxblog/ and http://research.microsoft.com/~vaningen/BWC/BWC.htm

Berkeley Water Center

Near Term Futures

Improve current capabilities Assemble gap-filled and non-gap filled data sets Implement incremental data staging to enable speedy and

simple data editing by an actual scientist (rather than a programmer)

Implement expanded metadata handling to enable scientist to add site characteristics and sort sites on those expanded definitions

Add basic reporting capabilities for server-side browsing of data availability to speed and simplify locating “interesting” data

Apply Data Server capabilities to a different set of data with different (but related) science Considering either Russian River or Yosemite Valley

hydrological data Will be automating download from multiple different

national data sets Spatial (GIS) analyses more important Linkage with imagery data necessary for science

Berkeley Water Center

Longer Term Futures

Handling imagery and other remote sensing information Curating images is different from curating time series data Using both together enables new science and new insights Graphical selection and display of data

Support for user specified calculations within the database Support for direct connections to analysis and statistical

packages Linkage with models

Additional (emerging) data standards such as NetCDF Handling “just in time” data delivery and model result

curation Data mining subscription services Handling of a broader array of data types Support for workflow tools

Berkeley Water Center

Conclusions

Large data archives create the opportunity to Do science at the regional and global scale Combine data from multiple disciplines Perform historical trend analysis

Small scientific collaborations need help to Perform analyses using more data than they can

currently manage Enable data handling and versioning Store the currently needed data and metadata Browse the data for science

It’s the science, not the computer science Computer science research can certainly help

Berkeley Water Center

URLs

Berkeley Water Center (BWC)

http://esd.lbl.gov/BWC/ Microsoft Project at BWC

http://esd.lbl.gov/BWC/thrust_areas/mstci.html Ameriflux Project

http://esd.lbl.gov/BWC/thrust_areas/ameriflux.html

http://dsd.lbl.gov/BWC/amfluxblog/