early experience prototyping a science data server for environmental data deb agarwal (lbl)...

17
Early Experience Early Experience Prototyping a Prototyping a Science Data Server Science Data Server for Environmental for Environmental Data Data Deb Agarwal (LBL) Deb Agarwal (LBL) Catharine van Ingen (MSFT) Catharine van Ingen (MSFT) 25 October 2006 25 October 2006

Upload: jackson-crawford

Post on 27-Mar-2015

215 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Early Experience Prototyping a Science Data Server for Environmental Data Deb Agarwal (LBL) Catharine van Ingen (MSFT) 25 October 2006

Early Experience Early Experience Prototyping a Science Prototyping a Science

Data Server for Data Server for Environmental DataEnvironmental Data

Deb Agarwal (LBL) Deb Agarwal (LBL) Catharine van Ingen (MSFT) Catharine van Ingen (MSFT)

25 October 200625 October 2006

Page 2: Early Experience Prototyping a Science Data Server for Environmental Data Deb Agarwal (LBL) Catharine van Ingen (MSFT) 25 October 2006

OutlineOutline• Water and ecological data archives

and other sources• Typical small group collaboration

needs• Berkeley Water Center and Ameriflux

collaboration• Common problems

Page 3: Early Experience Prototyping a Science Data Server for Environmental Data Deb Agarwal (LBL) Catharine van Ingen (MSFT) 25 October 2006

Unprecedented Data Unprecedented Data AvailabilityAvailability

Page 4: Early Experience Prototyping a Science Data Server for Environmental Data Deb Agarwal (LBL) Catharine van Ingen (MSFT) 25 October 2006

Soils

Climate

Remote SensingExample Carbon-Climate Datasets

Observatory datasets

Spatially continuous datasets

Page 5: Early Experience Prototyping a Science Data Server for Environmental Data Deb Agarwal (LBL) Catharine van Ingen (MSFT) 25 October 2006

5

Ameriflux Collaboration Ameriflux Collaboration OverviewOverview

• 149 Sites across the Americas• Each site reports a minimum of

22 common measurements.• Communal science – each

principle investigator acts independently to prepare and publish data.

• Second level data published to and archived at Oak Ridge.

• Total data reported to date on the order of 150M half-hourly measurements.

• http://public.ornl.gov/ameriflux/ T AIR

T SOIL

Onset of photosynthesis

Page 6: Early Experience Prototyping a Science Data Server for Environmental Data Deb Agarwal (LBL) Catharine van Ingen (MSFT) 25 October 2006

Typical Data Flow TodayTypical Data Flow Today• Prior to analysis, data and

ancillary data are must be assembled, checked, and cleaned– Some of this is mundane

(eg unit conversions) – Some requires domain-

specific knowledge including instrumentation or location knowledge

– Ancillary data is often critical to understanding and using the data

• After all that, data are often misplaced, scattered, and even lost– Provenance is in the mind

of the beholder– “Everybody knows” yet no

one is sure

Internet Data Archives

Local Measurements

Large Models

Legacy Sources

Page 7: Early Experience Prototyping a Science Data Server for Environmental Data Deb Agarwal (LBL) Catharine van Ingen (MSFT) 25 October 2006

Improved Data Flow Improved Data Flow • Local repository for data

and ancillary data assembled by a small scientific collaboration from a wide variety of sources– A common “safe deposit

box” – Versioned and logged to

provide basic provenance• Simple interactions with

existing and emerging internet portals for data and ancillary data download, and, over time, upload– Simplify data assembly by

adding automation for tracking and data conversions

Legacy Sources

Internet DataArchives

Local Measurements

Large Models

Page 8: Early Experience Prototyping a Science Data Server for Environmental Data Deb Agarwal (LBL) Catharine van Ingen (MSFT) 25 October 2006

Data Curation TodayData Curation Today

• Well curated large government operated sites Clear protocols for measurement updates, recalibrations, changes– Emerging standards or long

standing practices for measurement naming and reported units

– http://waterdata.usgs.gov/nwis

• Somewhat curated smaller organization sites – Best effort use of common

measurement naming and units

– As data sharing increases, “best” practices tend to emerge

– http://public.ornl.gov/ameriflux/

• Locator catalog sites– Helps locate similar data

across websites– http://www.cuahshi.org/hdas

• Everybody else– Naming, units, and

recalibrations unclear– Moving to an ideal:

http://www2.ncsu.edu/ncsu/CIL/WRRI/neuse.html

Page 9: Early Experience Prototyping a Science Data Server for Environmental Data Deb Agarwal (LBL) Catharine van Ingen (MSFT) 25 October 2006

Data Curation ChallengesData Curation Challenges• Cross source and over time

rationalization– Different naming and units conventions: – Distinguish derived and non-derived

measurements: VPD computed from Rh

• Convert basic measurements to useful inputs for science – Algorithms still evolving for smoothing

(obviously?) data and gap-filling– Archive tends to represent

instrumentation; science tends to represent physical system

• Convert from basic science data to useful inputs for public policy– $40K acre-foot for Central Valley

irrigation water; ~80% of that is energy cost

0

1,000,000

2,000,000

3,000,000

4,000,000

5,000,000

6,000,000

7,000,000

Bra

zil -

- T

apaj

os (

San

tare

m,K

m

Bra

zil -

- T

apaj

os (

San

tare

m,K

m

Can

ada

- B

orea

s 18

50

Can

ada

-- B

OR

EA

S N

SA

- 1

930

bu

Can

ada

-- B

OR

EA

S N

SA

- 1

963

bu

Can

ada

-- B

OR

EA

S N

SA

- 1

981

bu

Can

ada

-- B

OR

EA

S N

SA

- 1

989

bu

Can

ada

-- B

OR

EA

S N

SA

- 1

998

bu

Can

ada

-- B

OR

EA

S N

SA

- O

ld B

la

Can

ada

-- B

ritis

h C

ol.,

Cam

pbe

Can

ada

-- L

ethb

ridge

US

A -

- A

K A

tqas

uk, A

lask

a

US

A -

- A

K B

arro

w, A

lask

a

US

A -

- A

K H

appy

Val

ley,

Ala

ska

US

A -

- A

K U

pad,

Ala

ska

US

A -

- A

Z A

udub

on R

esea

rch

Ran

US

A -

- C

A B

lodg

ett F

ores

t, C

al

US

A -

- C

A S

ky O

aks,

Old

Sta

nd,

US

A -

- C

A S

ky O

aks,

You

ng S

tan

US

A -

- C

A T

onzi

Ran

ch, C

alifo

r

US

A -

- C

A V

aira

Ran

ch, I

one,

C

US

A -

- C

O N

iwot

Rid

ge F

ores

t,

US

A -

- C

T G

reat

Mou

ntai

n F

ores

US

A -

- F

L F

lorid

a-K

enne

dy S

pac

US

A -

- F

L F

lorid

a-K

enne

dy S

pac

US

A -

- F

L S

lash

pine

-Aus

tin C

ar

US

A -

- F

L S

lash

pine

-Don

alds

on,

US

A -

- F

L S

lash

pine

-Miz

e,cl

ear

US

A -

- F

L S

lash

pine

-Ray

onie

r,m

US

A -

- IL

Bon

dville

, Illin

ois

US

A -

- IN

Mor

gan

Mon

roe

Sta

te

US

A -

- K

S W

alnu

t Riv

er W

ater

sh

US

A -

- M

A H

arva

rd F

ores

t EM

S T

US

A -

- M

A H

arva

rd F

ores

t hem

lo

US

A -

- M

A L

ittle

Pro

spec

t Hill

US

A -

- M

E H

owla

nd F

ores

t (m

ain

US

A -

- M

I Syl

vani

a W

ilder

ness

US

A -

- M

I Uni

v. o

f Mic

h. B

iolo

US

A -

- M

O M

isso

uri O

zark

Site

US

A -

- M

S G

oodw

in C

reek

, Mis

si

US

A -

- M

T F

ort P

eck,

Mon

tana

US

A -

- N

C D

uke

For

est -

lobl

ol

US

A -

- N

C D

uke

For

est-

hard

woo

d

US

A -

- N

E M

ead

- irr

igat

ed c

on

US

A -

- N

E M

ead

- irr

igat

ed m

ai

US

A -

- N

E M

ead

- ra

infe

d m

aize

US

A -

- O

K L

ittle

Was

hita

Wat

er

US

A -

- O

K P

onca

City

, Okl

ahom

a

US

A -

- O

K S

hidl

er, O

klah

oma

US

A -

- O

K S

outh

ern

Gre

at P

lain

US

A -

- O

R M

etol

ius-

first

you

ng

US

A -

- O

R M

etol

ius-

inte

rmed

iat

US

A -

- O

R M

etol

ius-

old

aged

po

US

A -

- S

D B

lack

Hills

, Sou

th D

US

A -

- S

D B

rook

ings

, Sou

th D

ak

US

A -

- T

N W

alke

r B

ranc

h W

ater

s

US

A -

- W

A W

ind

Riv

er C

rane

Sit

US

A -

- W

I Los

t Cre

ek, W

isco

nsi

US

A -

- W

I Par

k F

alls

/WLE

F, W

is

US

A -

- W

I Willo

w C

reek

, Wis

con

US

A -

- W

V C

anaa

n V

alle

y, W

est

APAR CO2 DT FC FG FH2O FPAR GPP H H2OLE Leafwetness NEE O3 Other PAR PREC PRESS Rd RgRgl RH Rn Sa SCO2 SVP SWC TA TAU TboleTdew TS U UST UW VPD WD WS

Odd Microclimate Effects or Error in Time Reporting ?

Average Air Temperature at Two Nearby Sites

Page 10: Early Experience Prototyping a Science Data Server for Environmental Data Deb Agarwal (LBL) Catharine van Ingen (MSFT) 25 October 2006

Scientific Data Server Scientific Data Server GoalsGoals

• Act as a local repository for data and metadata assembled by a small group of scientists from a wide variety of sources– Simplify provenance by providing a common “safe

deposit box” for assembled data• Interact simply with existing and emerging

internet portals for data and metadata download, and, over time, upload– Simplify data assembly by adding automation– Simplify name space confusion by adding explicit

decode translation• Support basic analyses across the entire dataset

for both data cleaning and science– Simplify mundane data handling tasks– Simplify quality checking and data selection by enabling

data browsing

Page 11: Early Experience Prototyping a Science Data Server for Environmental Data Deb Agarwal (LBL) Catharine van Ingen (MSFT) 25 October 2006

Scientific Data Server Logical Scientific Data Server Logical OverviewOverview

DataAccess

and Analysis

Tools

Latest DatasetDatabase

Latest DatasetCube

Staging Databases

and Cubes

Last Known Good Dataset(s)

Database

Older Dataset(s)Archive

Database

Last Known Good Dataset(s) Cubes

Private Data

Analysis Databases

and Cubes

Scientific Data Server

Analysis ToolsExcel, Matlab, SPlus, SAS,

ArcGIS

Simple web data plots and

tables

BigPlot data browsing

Computational Models

Flat file data import/export

Page 12: Early Experience Prototyping a Science Data Server for Environmental Data Deb Agarwal (LBL) Catharine van Ingen (MSFT) 25 October 2006

Data Staging PipelineData Staging Pipeline

Scheduled download

from Website

Incremental Data Copy to

Active Database

Basic Data Checks

Stage Data Decode

Convert to CSV

Canonical Form

Load CSV files into Staging

Database

• Data can be downloaded from internet sites regularly– Sometimes the only way to detect changed data is to compare with the

data already archived– The download is relatively cheap, the subsequent staging is expensive

• New or changed data discovered during staging– Simple checksum before load– Chunk checksum after decode– Comparison query if requested

• Decode stage critical to handle the uncontrolled vocabularies– Measurement type, location offset, quality indicators, units, derivation

methods often encoded in column headers• Incremental copy moves staged data to one or more sitesets

– Automated via siteset:site:source mapping

Page 13: Early Experience Prototyping a Science Data Server for Environmental Data Deb Agarwal (LBL) Catharine van Ingen (MSFT) 25 October 2006

Column Decode TodayColumn Decode Today

[Datumtype] [repeat][_offset][_offset][extended datumtype][units]

• Datumtype: the short (<16 characters) name for the data. – Example: TA, PREC, or LE.

• Repeat: an optional number indicating that multiple measurements were taken at the same site and offset. – Example: include TA2.

• [_offset][_offset]: major and minor part of the z offset.– Example: SWC_10 (SWC at 10 cm) orTA_10_7 (TA at 10.7m).

• Extended datumtype: any remaining column text. – Example: “fir”, “E”, “sfc”, “wangrot”, “_cum”

• Units: measurement units. – Example: w/m2, or deg C.

1243 unique column header strings nowRoughly 70% of that due to offset or two extended datumtypes

Another ~100 arriving nowQuality and algorithm derivation provenance

Page 14: Early Experience Prototyping a Science Data Server for Environmental Data Deb Agarwal (LBL) Catharine van Ingen (MSFT) 25 October 2006

Browsing for Data Availability Browsing for Data Availability Data Availability by SiteData Availability by Site

Measuring temperature is easy; deriving ecosystem production problematic

GPP Data Availability

199019911992199319941995199619971998199920002001200220032004200520062007

La

Sa

nta

rem

-S

an

tare

m-

BO

RE

AS

NS

A -

Ca

mp

be

ll R

ive

r-L

eth

bri

dg

eU

CI-

18

50

bu

rnU

CI-

19

30

bu

rnU

CI-

19

64

bu

rnU

CI-

19

64

bu

rnU

CI-

19

81

bu

rnU

CI-

19

89

bu

rnU

CI-

19

98

bu

rnU

CI-

20

03

bu

rnL

a S

elv

aL

a P

az

Atq

asu

kB

arr

ow

Ha

pp

y V

alle

yIv

otu

kU

pa

dA

ud

ub

on

Sa

nta

Rita

Wa

lnu

t Gu

lch

Blo

dg

ett

Fo

rest

Sky

Oa

ksS

ky O

aks

-Old

Sky

Oa

ks-

To

nzi

Ra

nch

Va

ira

Ra

nch

CR

P g

raze

d s

iteC

RP

min

imu

m-

CR

P u

ng

raze

dN

iwo

t Rid

ge

Gre

at M

ou

nta

inK

en

ne

dy

Sp

ace

Ke

nn

ed

y S

pa

ceM

an

gro

veS

lash

pin

e-

Sla

shp

ine

-S

lash

pin

e-M

ize

-S

lash

pin

e-

Ne

al S

mith

Bo

nd

ville

Bo

nd

ville

Fe

rmiL

ab

-F

erm

iLa

b-

Mo

rga

n M

on

roe

Wa

lnu

t Riv

er

Ha

rva

rd F

ore

stH

arv

ard

Fo

rest

Litt

le P

rosp

ect

Ho

wla

nd

Fo

rest

Ho

wla

nd

Fo

rest

Ho

wla

nd

Fo

rest

KB

S C

rop

sN

ort

he

rnS

ylva

nia

Un

iv. o

f Mic

h.

KU

OM

tow

er

Ro

sem

ou

nt-

C7

Ro

sem

ou

nt-

Ro

sem

ou

nt-

Mis

sou

ri O

zark

Go

od

win

Cre

ek

Fo

rt P

eck

Du

ke F

ore

st-

Du

ke F

ore

st-

Du

ke F

ore

st-

NC

_C

lea

rcu

tN

C_

Lo

blo

llyM

ea

d-i

rrig

ate

dM

ea

d-i

rrig

ate

dM

ea

d-r

ain

fed

Ba

rtle

ttC

ed

ar

Bri

dg

eF

ort

Dix

Sila

s L

ittle

Va

lles

Ca

lde

raO

ak

Op

en

ing

sA

RM

So

uth

ern

Litt

le W

ash

itaP

on

ca C

ityS

hid

ler

Fir

site

Me

toliu

s-E

yerl

yM

eto

lius-

first

Me

toliu

s-M

eto

lius-

old

Me

toliu

s-B

lack

Hill

sB

roo

kin

gs

Ch

est

nu

t Rid

ge

Wa

lke

r B

ran

chF

ree

ma

nF

ree

ma

n R

an

chF

ree

ma

nW

ind

Riv

er

Lo

st C

ree

kP

ark

Will

ow

Cre

ek

Ca

na

an

Va

lley

GL

EE

SS

ky O

aks

-Po

st

TA Data Availability

199019911992199319941995199619971998199920002001200220032004200520062007

La

Sa

nta

rem

-S

an

tare

m-

BO

RE

AS

NS

A -

Ca

mp

be

ll R

ive

r-L

eth

bri

dg

eU

CI-

18

50

bu

rnU

CI-

19

30

bu

rnU

CI-

19

64

bu

rnU

CI-

19

64

bu

rnU

CI-

19

81

bu

rnU

CI-

19

89

bu

rnU

CI-

19

98

bu

rnU

CI-

20

03

bu

rnL

a S

elv

aL

a P

az

Atq

asu

kB

arr

ow

Ha

pp

y V

alle

yIv

otu

kU

pa

dA

ud

ub

on

Sa

nta

Rita

Wa

lnu

t Gu

lch

Blo

dg

ett

Fo

rest

Sky

Oa

ksS

ky O

aks

-Old

Sky

Oa

ks-

To

nzi

Ra

nch

Va

ira

Ra

nch

CR

P g

raze

d s

iteC

RP

min

imu

m-

CR

P u

ng

raze

dN

iwo

t Rid

ge

Gre

at M

ou

nta

inK

en

ne

dy

Sp

ace

Ke

nn

ed

y S

pa

ceM

an

gro

veS

lash

pin

e-

Sla

shp

ine

-S

lash

pin

e-M

ize

-S

lash

pin

e-

Ne

al S

mith

Bo

nd

ville

Bo

nd

ville

Fe

rmiL

ab

-F

erm

iLa

b-

Mo

rga

n M

on

roe

Wa

lnu

t Riv

er

Ha

rva

rd F

ore

stH

arv

ard

Fo

rest

Litt

le P

rosp

ect

Ho

wla

nd

Fo

rest

Ho

wla

nd

Fo

rest

Ho

wla

nd

Fo

rest

KB

S C

rop

sN

ort

he

rnS

ylva

nia

Un

iv. o

f Mic

h.

KU

OM

tow

er

Ro

sem

ou

nt-

C7

Ro

sem

ou

nt-

Ro

sem

ou

nt-

Mis

sou

ri O

zark

Go

od

win

Cre

ek

Fo

rt P

eck

Du

ke F

ore

st-

Du

ke F

ore

st-

Du

ke F

ore

st-

NC

_C

lea

rcu

tN

C_

Lo

blo

llyM

ea

d-i

rrig

ate

dM

ea

d-i

rrig

ate

dM

ea

d-r

ain

fed

Ba

rtle

ttC

ed

ar

Bri

dg

eF

ort

Dix

Sila

s L

ittle

Va

lles

Ca

lde

raO

ak

Op

en

ing

sA

RM

So

uth

ern

Litt

le W

ash

itaP

on

ca C

ityS

hid

ler

Fir

site

Me

toliu

s-E

yerl

yM

eto

lius-

first

Me

toliu

s-M

eto

lius-

old

Me

toliu

s-B

lack

Hill

sB

roo

kin

gs

Ch

est

nu

t Rid

ge

Wa

lke

r B

ran

chF

ree

ma

nF

ree

ma

n R

an

chF

ree

ma

nW

ind

Riv

er

Lo

st C

ree

kP

ark

Will

ow

Cre

ek

Ca

na

an

Va

lley

GL

EE

SS

ky O

aks

-Po

st

Page 15: Early Experience Prototyping a Science Data Server for Environmental Data Deb Agarwal (LBL) Catharine van Ingen (MSFT) 25 October 2006

Browsing for Data ApplicabilityBrowsing for Data Applicability

• Real field data has both short term gaps and longer term outages due to instrument outages– The utility of the data

depends on the nature of the science being performed

– Browsing data counts can give rapid insight into how the data can be used before more complex analyses are performed

0

1000

2000

3000

4000

5000

6000

1 2 3 4 5 6 7 8 9 10 11 12

55.86306 BOREAS NSA -1981 burn site

55.879002 BOREAS NSA -Old Black Spruce

55.90583 BOREAS NSA -1930 burn site

55.911671 BOREAS NSA -1963 burn site

55.916672 BOREAS NSA -1989 burn site

56.63583 BOREAS NSA -1998 burn site

69.133331 AK HappyValley

70.281471 AK Upad

70.496002 AK Atqasuk

Data often missing in the winter!

-15

-10

-5

0

5

10

15

20

25

30

20 30 40 50 60 70 80

Latitude

Deg

C

Average Temperature

What’s going on at higher latitudes? (It should be getting colder)

Data Count

Page 16: Early Experience Prototyping a Science Data Server for Environmental Data Deb Agarwal (LBL) Catharine van Ingen (MSFT) 25 October 2006

Curation Learnings To DateCuration Learnings To Date

• Ancillary data is as important as data– Comparing sites of like vegetation, climate as

important as latitude or other physical quantity– Only some are numeric, most are debated, some

vary with time– Curate the two together

• Controlled vocabularies are hard – Humans like making up names and have a hard

time remembering 100+ names– Assume a decode step in the staging pipeline

• Data analysis and data cleaning are intertwined– Data cleaning is always on-going– Some measurements can be used as indicators of

quality of other measurements– Share the simple tools and visualizations

The saga continues at http://dsd.lbl.gov/BWC/amfluxblog/ and http://research.microsoft.com/~vaningen/BWC/BWC.htm

Page 17: Early Experience Prototyping a Science Data Server for Environmental Data Deb Agarwal (LBL) Catharine van Ingen (MSFT) 25 October 2006

AcknowledgementsAcknowledgementsBerkeley Water Center,

University of California, Berkeley, Lawrence Berkeley LaboratoryDeb AgarwalMonte GoodSusan HubbardJames HuntMatt RodriguezYoram Rubin

MicrosoftJim GrayTony HeyDan FayStuart OzerSQL product team

Ameriflux CollaborationDennis BaldocchiBeverly LawGretchen MillerTara StieflMathias GoeckedeMattias FalkTom Boden