public access to large astronomical datasets alex szalay, johns hopkins jim gray, microsoft research

17
Public Access to Large Astronomical Datasets Alex Szalay, Johns Hopkins Jim Gray, Microsoft Research

Upload: heather-marsh

Post on 01-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Public Access to Large Astronomical

Datasets

Alex Szalay, Johns Hopkins Jim Gray, Microsoft Research

Outline

• Trends• The Sloan Digital Sky Survey

– The `Cosmic Genome Project’

• The SDSS database design• The World-Wide Telescope

– Virtual Observatory: Federating archives over the world

• Exploring Web Services– Sky Query, Image Cutout

Living in an Exponential World

• Astronomers have a few hundred TB now– 1 pixel (byte) / sq arc second ~ 4TB

– Multi-spectral, temporal, … → 1PB

• They mine it looking for new (kinds of) objects or more of interesting ones (quasars), density variations in 400-D space correlations in 400-D space

• Data doubles every year• Data is public after 1 year• So, 50% of the data is public• Some have private access to 5% more data• So: 50% vs 55% access for everyone

Science is hitting a wall

• FTP and GREP are not adequate– You can GREP 1 MB in a second

– You can GREP 1 GB in a minute

– You can GREP 1 TB in 2 days

– You can GREP 1 PB in 3 years.

• Oh!, and 1PB ~10,000 disks

• At some point you need indices to limit searchparallel data search and analysis

• This is where databases can help

• You can FTP 1 MB in 1 sec• You can FTP 1 GB / min (= 1 $/GB)

• … 2 days and 1K$

• … 3 years and 1M$

Making Discoveries

• When and where are discoveries made?– Always at the edges and boundaries– Going deeper, using more colors….

• Metcalfe’s law– Utility of computer networks grows as the

number of possible connections: O(N2)

• VO: Federation of N archives– Possibilities for new discoveries grow as O(N2)

• Current sky surveys have proven this– Very early discoveries from SDSS, 2MASS, DPOSS

Publishing Data

Roles

Authors

Publishers

Curators

Consumers

Traditional

Scientists

Journals

Libraries

Scientists

Emerging

Collaborations

Project www site

Bigger Archives

Scientists

Changing Roles

• Exponential growth:– Projects last at least 3-5 years

– Data sent upwards only at the end of the project

– Data will be never centralized

• More responsibility on projects– Becoming Publishers and Curators

– Larger fraction of budget spent on software

– Lot of development duplicated, wasted

– All documentation is contained in the archive

• More standards are needed– Easier data interchange, fewer tools

• More templates are needed– Develop less software on your own

Emerging New Concepts

• Standardizing distributed data– Web Services, supported on all platforms– Custom configure remote data dynamically– XML: Extensible Markup Language– SOAP: Simple Object Access Protocol– WSDL: Web Services Description Language

• Standardizing distributed computing– Grid Services– Custom configure remote computing dynamically– Build your own remote computer, and discard– Virtual Data: new data sets on demand

Goal Create the most detailed map of the Northern sky in 5 years2.5m telescope, Apache Point, NM 3 degree field of view

¼ of the whole skyTwo surveys in one Photometric survey in 5 bands Spectroscopic redshift surveyAutomated data reduction 150 man-years of developmentVery high data volume 40 TB of raw data 5 TB processed catalogs Data is public

Features of the SDSS

The University of Chicago Princeton University The Johns Hopkins University The University of Washington New Mexico State University Fermi National Accelerator Laboratory US Naval Observatory The Japanese Participation Group The Institute for Advanced Study Max Planck Inst, Heidelberg

Sloan Foundation, NSF, DOE, NASA

The University of Chicago Princeton University The Johns Hopkins University The University of Washington New Mexico State University Fermi National Accelerator Laboratory US Naval Observatory The Japanese Participation Group The Institute for Advanced Study Max Planck Inst, Heidelberg

Sloan Foundation, NSF, DOE, NASA

Continuous data rate of 8 Mbytes/sec

Northern Galactic Cap drift scan of 10,000 square degrees 24k x 1M pixel “panoramic” images in 5 colors – broad-band filters (u,g,r,i,z) exposure time: 55 sec pixel size: 0.4 arcsec astrometry: 60 mas calibration: 2% done only in best seeing

(20 nights/year) Southern Galactic Cap multiple scans (> 30 times) of the same stripe

The Imaging Survey

Expanding universe redshift = distance

SDSS Redshift Survey1 million galaxies100,000 quasars100,000 stars

Two high throughput spectrographsspectral range 3900-9200 Å640 spectra simultaneouslyR=2000 resolution, 1.3 Å

FeaturesAutomated reduction of spectraVery high sampling density and completeness

The Spectroscopic SurveyElliptical galaxy

Pixel data collected by telescope

Sent to Fermilab for processing

Beowulf Clusterproduces catalog

Loaded in aSQL database

Data Flow

Public Data Release

• June 2002: EDR– Early Data Release

• January 2003: DR1– Contains 30% of final data– 200 million photo objects

• 4 versions of the data– Target, best, runs, spectro

• Total catalog volume 1.7TB– See Terascale sneakernet paper…

• Published releases served forever– EDR, DR1, DR2, …. – Soon to include email archives, annotations

• O(N2) – only possible because of Moore’s Law!

EDR

DR1 DR1

DR2 DR2 DR2

DR3 DR3 DR3 DR3

Why Is Astronomy Data Special?

•It has no commercial value–No privacy concerns–Can freely share results with others–Great for experimenting with algorithms

•It is real and well documented–High-dimensional (with confidence intervals)–Spatial–Temporal

•Diverse and distributed–Many different instruments from

many different places and many different times

•The questions are interesting•There is a lot of it (petabytes)

IRAS 100

ROSAT ~keV

DSS Optical

2MASS 2

IRAS 25

NVSS 20cm

WENSS 92cm

GB 6cm

Virtual Observatory

• Many new surveys are coming– SDSS is a dry run for the next ones– LSST will be 1TB/night

• All the data will be on the Internet– But how? ftp, webservice…

• Data and apps will be associated withthe instruments– Distributed world wide– Cross-indexed– Federation is a must, but how?

• Will be the best telescope in the world– World Wide Telescope

SkyQuery: Experimental Federation

• Federated 5 Web Services – Portal unifies 3 archives and a cutout service to visualize results– Fermilab/SDSS, JHU/FIRST, Caltech/2MASS Archives– Multi-survey spatial join and SQL select– Distributed query optimization (T. Malik, T. Budavari) in 6 weeks

http://www.skyquery.net/

• Cutout web service: annotated SDSS images

http://skyservice.pha.jhu.edu/sdsscutout/

SELECT o.objId, o.ra, o.r, o.type, o.I, t.objId, t.j_m FROM SDSS:PhotoPrimary o,

TWOMASS:PhotoPrimary t WHERE XMATCH(o,t)<3.5

AND AREA(181.3,-0.76,6.5) AND o.type=3 AND o.I – t.j_m > 2

Summary

• The data is public and largely self-documenting– Get your own copy!

• The SDSS database and web app are interesting– Data mining challenge

– Data visualization challenge

– Educational challenge

– Web services `poster-child’

• Information at your fingertips– Students see the same data as professional astronomers

• More data coming– 1.7 TB+ public data by Jan 2003, 6TB+ coming

• The World-Wide Telescope– Federating the astronomy archives is a CS challenge