alex szalay, jim gray analyzing large data sets in astronomy

Alex Szalay, Jim Gray

Analyzing Large Data Sets in Astronomy

Patterns of Scientific Progress

Observational Science Scientist gathers data by direct observation

Scientist analyzes data

Analytical Science Scientist builds analytical model

Makes predictions.

Computational Science Simulate analytical model

Validate model and makes predictions

Data Exploration Science Data captured by instrumentsOr data generated by simulator

Processed by softwarePlaced in a database / filesScientist analyzes database / files

Gray and Szalay, Communications of the ACM (2002)

http://es.rice.edu/ES/humsoc/Galileo/Images/Astro/Instruments/hevelius_telescope.gif

Living in an Exponential World

Astronomers have a few hundred TB now1 pixel (byte) / sq arc second ~ 4TBMulti-spectral, temporal, … → 1PB

They mine it looking for new (kinds of) objects, more of interesting ones (quasars), density variations in 400-D space, correlations in 400-D space

Data doubles every year, public after 1 yearSo, 50% of the data is publicSame trend appears in all sciences

The Challenges

DataCollection

Discoveryand Analysis

Publishing

Exponential data growth: Distributed collections Soon Petabytes

New analysis paradigm: Data federations, Move analysis to data

New publishing paradigm: Scientists are publishers and Curators

Making Discoveries

Where are discoveries made?At the edges and boundaries

Going deeper, collecting more data, using more colors….

Metcalfe’s lawUtility of computer networks grows as the number of possible connections: O(N2)

Szalay’s data federation law Federation of N archives has utility O(N2)

Possibilities for new discoveries grow as O(N2)

Current sky surveys have proven thisVery early discoveries from SDSS, 2MASS, DPOSS

Data Analysis Today

Download (FTP and GREP) are not adequateYou can GREP 1 MB in a secondYou can GREP 1 GB in a minute You can GREP 1 TB in 2 daysYou can GREP 1 PB in 3 years.

Oh!, and 1PB ~10,000 disksAt some point we need

indices to limit searchparallel data search and analysis

This is where databases can help

Next generation technique: Data ExplorationBring the analysis to the data!

Next-Generation Data Analysis

Looking forNeedles in haystacks – the Higgs particleHaystacks: Dark matter, Dark energy

Needles are easier than haystacksGlobal statistics have poor scaling

Correlation functions are N2, likelihood techniques N3

As data and computers grow at same rate, we can only keep up with N logN

A way out? Discard notion of optimal (data is fuzzy, answers are approximate)Don’t assume infinite computational resources or memory

Requires combination of statistics & computer science

Why Is Astronomy Special?

It has no commercial value No privacy concerns, freely share results with others Great for experimenting with algorithms

It is real and well documented High-dimensional (with confidence intervals) Spatial, temporal

Diverse and distributed Many different instruments from

many different places and many different times

The questions are interesting There is a lot of it (soon Petabytes)

The Virtual Observatory

Many new surveys are comingSDSS is a dry run for the next onesLSST will be 5TB/night

All the data will be on the Internetftp, web services…

Data and applications will be associated with the projects

Distributed world wide, cross-indexedFederation is a must

Will be the best telescope in the worldWorld Wide Telescope

Finds the “needle in the haystack”Successful demonstrations in Jan’03

Dealing with the astronomy legacyFITS data formatSoftware analysis systems

Standards driven by evolving new technologiesExchange of rich and structured data (XML…)DB connectivity, Web Services, Grid computing

Boundary Conditions

Application to astronomy domainData dictionaries (UCDs)Data modelsProtocolsRegistries and resource/service discoveryProvenance, data quality

Boundary conditions

Short History of the VO

Driven by exponential data growthIn the US it started with SDSS + GriPhyNIn Europe started at CDS (Strasbourg)Continued with NVO + AVONow: International Virtual Observatory Alliance

Now in 14 countriesTotal data holdings >200TB

Core services and standards adoptedGetting ready for first deployment (mid04)

Data Analysis - Optimal Statistics

Brute-force examples for optimal statistics have poor scaling

Correlation functions N2, likelihood techniques N3

As data sizes grow at Moore’s law, computers can only keep up with at most N logN algorithmsWhat goes?

Notion of optimal is in the sense of statistical errorsAssumes infinite computational resourcesAssumes that only source of error is statistical‘Cosmic Variance’: we can only observe the Universe from one location (finite sample size)

Solutions require combination of Statistics and CSNew algorithms: not worse than N logN

Organization & Algorithms

Use of clever data structures (trees, cubes):Up-front creation cost, but only N logN access costLarge speedup during the analysisTree-codes for correlations (A. Moore et al 2001)Data Cubes for OLAP (all vendors)

Fast, approximate heuristic algorithmsNo need to be more accurate than cosmic varianceFast CMB analysis by Szapudi et al (2001)

• N logN instead of N3 => 1 day instead of 10 million years

Take cost of computation into accountControlled level of accuracyBest result in a given time, given our computing resources

Analysis and Databases

Much statistical analysis deals withCreating uniform samples –

data filtering

Assembling relevant subsets

Estimating completeness

censoring bad data

Counting and building histograms

Generating Monte-Carlo subsets

Likelihood calculations

Hypothesis testing

Traditionally these are performed on filesMost of these tasks are much better done inside a databaseMove Mohamed to the mountain, not the mountain to Mohamed

Cosmic Microwave Background

Szapudi et al 2002

Data Exploration: A New Way of Doing Science

Primary access to data is through databasesExponential data growth – distributed dataPublication before analysisLarge data: move analysis to where data isDistributed computing – data federationNew algorithms are neededThe Virtual Observatory is a good exampleUnavoidable, emerging in all sciences!

alex szalay, jim gray analyzing large data sets in astronomy

Documents

data federations

data explorationbring

generation data analysis

large data sets

search parallel data

petabytesnew analysis

computational science

new discoveries