1 where the rubber meets the sky giving access to science data talk at national institute of...

1

Where The Rubber Meets the SkyGiving Access to Science Data

Talk atNational Institute of Informatics, Tokyo, Japan

October 2005

Jim GrayMicrosoft Research

[email protected]://research.Microsoft.com/~Gray

Alex SzalayJohns Hopkins University

[email protected]

2

•Abstract: I have been working with some astronomers for the last 6 years trying to apply DB technology to science problems.

These are some lessons I learnedPaper at:Where the Rubber Meets the Sky: Bridging the Gap between Databases and Science,” Jim Gray; Alexander S. Szalay; MSR-TR-2004-110, October 2004

3

New Science Paradigms• Thousand years ago:

science was empirical describing natural phenomena

• Last few hundred years: theoretical branch using models, generalizations

• Last few decades: a computational branch simulating complex phenomena

• Today: data exploration (eScience)unify theory, experiment, and simulation using data management and statistics– Data captured by instruments

Or generated by simulator– Processed by software– Scientist analyzes database / files

2

22.

3

4

a

cG

a

a

2

22.

3

4

a

cG

a

a

http://es.rice.edu/ES/humsoc/Galileo/Images/Astro/Instruments/hevelius_telescope.gif

4

The Big Picture

• Data ingest

• Managing a petabyte

• Common schema

• How to organize it?

• How to reorganize it?

• How to coexist with others?

• Data Query and Visualization tools • Support/training• Performance

– Execute queries in a minute – Batch (big) query scheduling

The Big Problems

Experiments &Instruments

Simulationsfacts

facts

answers

questions

?Literature

Other Archives facts

facts

5

Experiment Budgets ¼…½ Software

Software for• Instrument scheduling• Instrument control• Data gathering• Data reduction• Database • Analysis • Visualization

Millions of lines of code

Repeated for experiment after experiment

Not much sharing or learning

Let’s work to change this

Identify generic tools• Workflow schedulers• Databases and libraries • Analysis packages • Visualizers • …

6

Data Lifecycle

• Raw data → primary data → derived data• Data has bugs:

– Instrument bugs– Pipeline bugs

• Data comes in versions – later versions fix known bugs– Just like software (indeed data is software)

• Can’t “un-publish” bad data.

instrumentor

simulatorpipeline

otherdata

otherdata

pipeline

Level 0raw

Level 1calibrated

Level 2derived

7

Data Inflation – Data Pyramid

Level 1AGrows X TB/year ~ .4X TB/y compressed (level 1A in NASA terms)

Level 2Derived data products ~10x smaller But there are many. L2≈L1

• Publish new edition each year – Fixes bugs in data.– Must preserve old editions– Creates data pyramid

• Store each edition – 1, 2, 3, 4… N ~ N2 bytes

• Net: Data Inflation: L2 ≥ L1

E1

E2

E3E4

4 editions oflevel 1A data(source data)

4 editions of level 2 derived data products. Note that each derived product is small, but they are numerous. This proliferation combined with the data pyramid implies that level2 data more than doubles the total storage volume.

time

Level 1A 4 editions of 4 Level 2 products

8

The Year 5 Problem Yearly Demand

0

20

40

60

80

100

120

140

160

180

0 2 4 6 8 10Year

Yea

rly

Dem

and

( R

)

Depreciated Inflated Demand

Inflated Demand

Naive Demand

Yearly Capital Cost

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

0 2 4 6 8 10Year

Mar

gin

al C

apit

al C

ost

• Data arrives at R bytes/year• New Storage & Processing

– Need to buy R units in year N

• Data inflation means ~N2R– Need to buy NR units

• Depreciate over 3 years– After year 3

need to buy N2R + (N-3)2R

• Moore’s law: 60%/year price decline

• Capital expense peaks at year 5

• See 6x Over-Power slide next

9

6x Over-Power Ratio

• If you think you need X raw capacity, then you probably need 6X

• Reprocessing

• Backup copies

• Versions

• …

• Hardware is cheap, Your time is precious.

PubDB 3.6TB

DR3C 2.4TB

DR2C 1.8TB

DR2M 1.8TB

DR2P 1.8TB

DR3M 2.4TB

DR3P 2.4TB

10

Data Loading• Data from outside

– Is full of bugs– Is not in your format

• Advice– Get it in a “Universal Format”

(e.g. Unicode CSV)– Create Blood-Brain barrier

Quarantine in a “load database”– Scrub the data

• Cross check everything you can• Check data statistics for sanity• Reject or repair bad data• Generate detailed bug reports

(needed to send rejection upstream)

– Expect to reload many times Automate everything!

Test UniquenessOf Primary KeysTest UniquenessOf Primary Keys

TestForeign Keys

TestForeign Keys

TestCardinalities

TestCardinalities

TestHTM IDs

TestHTM IDs

Test parent- childconsistency


Test the uniqueKey in each table

Test for consistencyof keys that link tables

Test consistency of numbers of various quantities

Test the HierarchicalTriamgular Mesh IDsused for spatial indexing

Ensure that all parentsand children and linked

Test UniquenessOf Primary KeysTest UniquenessOf Primary Keys

TestForeign Keys

TestForeign Keys

TestCardinalities

TestCardinalities

TestHTM IDs

TestHTM IDs



Test the uniqueKey in each table

Test for consistencyof keys that link tables

Test consistency of numbers of various quantities

Test the HierarchicalTriamgular Mesh IDsused for spatial indexing

Ensure that all parentsand children and linked

LOADLOAD

PUBLISHPUBLISH

FINISHFINISH

EXPEXP

CHKCHK

BLDBLD

SQLSQL

VALVAL

BCKBCK

DTCDTC

Export

Check CSV

Build Task DBs

Build SQL Schema

Validate

Backup

Detach

PUBPUB

CLNCLN

Publish

Cleanup

FINFIN

LOADLOAD

PUBLISHPUBLISH

FINISHFINISH

EXPEXP

CHKCHK

BLDBLD

SQLSQL

VALVAL

BCKBCK

DTCDTC

Export

Check CSV

Build Task DBs

Build SQL Schema

Validate

Backup

Detach

PUBPUB

CLNCLN

Publish

Cleanup

FINFIN

http://skyserver.pha.jhu.edu/admin/

11

Performance Prediction & Regression

• Database grows exponentially

• Set up response-time requirements – For load– For access

• Define a workload to measure each

• Run it regularly to detect anomalies

• SDSS uses – one-week to reload– 20 queries with response of 10 sec to 10 min.

12

Data Subsets For Science and Development

• Offer 1GB, 10GB, …, Full subsets

• Wonderful tool for youDesign & Debug

• Good tool for scientists– Experiment on subset– Not for needle in haystack,

but good for global stats• Challenge: How make

statistically valid subsets?– Seems domain specific– Seems problem specific– But, must be some general

concepts.

13

Data Curation Problem Statement• Once published,

scientific data needs to be available forever,so that the science can be reproduced/extended.

• What does that mean?– Data can be characterized as

• Primary Data: could not be reproduced • Derived data: could be derived from primary data.

– Meta-data: how the data was collected/derivedis primary

• Must be preserved • Includes design docs, software, email, pubs, personal

notes, teleconferences,

NASA “level 0”

14

Schema (aka metadata)• Everyone starts with the same schema

<stuff/>Then the start arguing about semantics.

• Virtual Observatory: http://www.ivoa.net/

• Metadata based on Dublin Core:http://www.ivoa.net/Documents/latest/RM.html

• Universal Content Descriptors (UCD): http://vizier.u-strasbg.fr/doc/UCD.htxCaptures quantitative concepts and their unitsReduced from ~100,000 tables in literature to ~1,000 terms

• VOtable – a schema for answers to questionshttp://www.us-vo.org/VOTable/

• Common Queries:Cone Search and Simple Image Access Protocol, SQL

• Registry: http://www.ivoa.net/Documents/latest/RMExp.htmlstill a work in progress.

http://www.ivoa.net/



http://www.ivoa.net/Documents/latest/RM.html

http://www.ivoa.net/Documents/latest/RM.html

http://vizier.u-strasbg.fr/doc/UCD.htx

http://www.us-vo.org/VOTable/

http://www.ivoa.net/Documents/latest/RMExp.html


15

Archive Challenges• Cost of administering storage:

– Presently 10x to 100x the hardware cost.

• Resist attack: geographic diversity • At 1GBps it takes 12 days to move a PB• Store it in two (or more) places online (on disk).

A geo-plex• Scrub it continuously (look for errors)

• On failure, – use other copy until failure repaired, – refresh lost copy from safe copy.

• Can organize the copies differently (e.g.: one by time, one by space)

16

References http://SkyServer.SDSS.org/http://research.microsoft.com/pubs/

http://research.microsoft.com/Gray/SDSS/ (download personal SkyServer)

Extending the SDSS Batch Query System to the National Virtual Observatory Grid, M. A. Nieto-Santisteban, W. O'Mullane, J. Gray, N. Li, T. Budavari, A. S. Szalay, A. R. Thakar, MSR-TR-2004-12, Feb. 2004

Scientific Data Federation, J. Gray, A. S. Szalay, The Grid 2: Blueprint for a New Computing Infrastructure, I. Foster, C. Kesselman, eds, Morgan Kauffman, 2003, pp 95-108.

Data Mining the SDSS SkyServer Database, J. Gray, A.S. Szalay, A. Thakar, P. Kunszt, C. Stoughton, D. Slutz, J. vandenBerg , Distributed Data & Structures 4: Records of the 4th International Meeting, pp 189-210, W. Litwin, G. Levy (eds),, Carleton Scientific 2003, ISBN 1-894145-13-5, also MSR-TR-2002-01, Jan. 2002

Petabyte Scale Data Mining: Dream or Reality?, Alexander S. Szalay; Jim Gray; Jan vandenBerg, SIPE Astronomy Telescopes and Instruments, 22-28 August 2002, Waikoloa, Hawaii, MSR-TR-2002-84

Online Scientific Data Curation, Publication, and Archiving, J. Gray; A. S. Szalay; A.R. Thakar; C. Stoughton; J. vandenBerg, SPIE Astronomy Telescopes and Instruments, 22-28 August 2002, Waikoloa, Hawaii, MSR-TR-2002-74

The World Wide Telescope: An Archetype for Online Science, J. Gray; A. Szalay,, CACM, Vol. 45, No. 11, pp 50-54, Nov. 2002, MSR TR 2002-75,

The SDSS SkyServer: Public Access To The Sloan Digital Sky Server Data, A. S. Szalay, J. Gray, A. Thakar, P. Z. Kunszt, T. Malik, J. Raddick, C. Stoughton, J. vandenBerg:, ACM SIGMOD 2002: 570-581 MSR TR 2001 104.

The World Wide Telescope, A.S., Szalay, J., Gray, Science, V.293 pp. 2037-2038. 14 Sept 2001. MS-TR-2001-77

Designing & Mining Multi-Terabyte Astronomy Archives: Sloan Digital Sky Survey, A. Szalay, P. Kunszt, A. Thakar, J. Gray, D. Slutz, P. Kuntz, June 1999, ACM SIGMOD 2000, MS-TR-99-30,

http://skyserver.sdss.org/

http://research.microsoft.com/pubs/

http://research.microsoft.com/pubs/

http://research.microsoft.com/Gray/SDSS/

http://research.microsoft.com/Gray/SDSS/

http://research.microsoft.com/research/pubs/view.aspx?tr_id=714

http://research.microsoft.com/research/pubs/view.aspx?tr_id=714

http://research.microsoft.com/scripts/pubs/view.asp?TR_ID=MSR-TR-2002-84