Alex Szalay, Jim Gray
Analyzing Large Data Sets in Astronomy
Patterns of Scientific Progress
Observational Science Scientist gathers data by direct observation
Scientist analyzes data
Analytical Science Scientist builds analytical model
Makes predictions.
Computational Science Simulate analytical model
Validate model and makes predictions
Data Exploration Science Data captured by instrumentsOr data generated by simulator
Processed by softwarePlaced in a database / filesScientist analyzes database / files
Gray and Szalay, Communications of the ACM (2002)
Living in an Exponential World
Astronomers have a few hundred TB now1 pixel (byte) / sq arc second ~ 4TBMulti-spectral, temporal, … → 1PB
They mine it looking for new (kinds of) objects, more of interesting ones (quasars), density variations in 400-D space, correlations in 400-D space
Data doubles every year, public after 1 yearSo, 50% of the data is publicSame trend appears in all sciences
The Challenges
DataCollection
Discoveryand Analysis
Publishing
Exponential data growth: Distributed collections Soon Petabytes
New analysis paradigm: Data federations, Move analysis to data
New publishing paradigm: Scientists are publishers and Curators
Making Discoveries
Where are discoveries made?At the edges and boundaries
Going deeper, collecting more data, using more colors….
Metcalfe’s lawUtility of computer networks grows as the number of possible connections: O(N2)
Szalay’s data federation law Federation of N archives has utility O(N2)
Possibilities for new discoveries grow as O(N2)
Current sky surveys have proven thisVery early discoveries from SDSS, 2MASS, DPOSS
Data Analysis Today
Download (FTP and GREP) are not adequateYou can GREP 1 MB in a secondYou can GREP 1 GB in a minute You can GREP 1 TB in 2 daysYou can GREP 1 PB in 3 years.
Oh!, and 1PB ~10,000 disksAt some point we need
indices to limit searchparallel data search and analysis
This is where databases can help
Next generation technique: Data ExplorationBring the analysis to the data!
Next-Generation Data Analysis
Looking forNeedles in haystacks – the Higgs particleHaystacks: Dark matter, Dark energy
Needles are easier than haystacksGlobal statistics have poor scaling
Correlation functions are N2, likelihood techniques N3
As data and computers grow at same rate, we can only keep up with N logN
A way out? Discard notion of optimal (data is fuzzy, answers are approximate)Don’t assume infinite computational resources or memory
Requires combination of statistics & computer science
Why Is Astronomy Special?
It has no commercial value No privacy concerns, freely share results with others Great for experimenting with algorithms
It is real and well documented High-dimensional (with confidence intervals) Spatial, temporal
Diverse and distributed Many different instruments from
many different places and many different times
The questions are interesting There is a lot of it (soon Petabytes)
The Virtual Observatory
Many new surveys are comingSDSS is a dry run for the next onesLSST will be 5TB/night
All the data will be on the Internetftp, web services…
Data and applications will be associated with the projects
Distributed world wide, cross-indexedFederation is a must
Will be the best telescope in the worldWorld Wide Telescope
Finds the “needle in the haystack”Successful demonstrations in Jan’03
Dealing with the astronomy legacyFITS data formatSoftware analysis systems
Standards driven by evolving new technologiesExchange of rich and structured data (XML…)DB connectivity, Web Services, Grid computing
Boundary Conditions
Application to astronomy domainData dictionaries (UCDs)Data modelsProtocolsRegistries and resource/service discoveryProvenance, data quality
Boundary conditions
Short History of the VO
Driven by exponential data growthIn the US it started with SDSS + GriPhyNIn Europe started at CDS (Strasbourg)Continued with NVO + AVONow: International Virtual Observatory Alliance
Now in 14 countriesTotal data holdings >200TB
Core services and standards adoptedGetting ready for first deployment (mid04)
Data Analysis - Optimal Statistics
Brute-force examples for optimal statistics have poor scaling
Correlation functions N2, likelihood techniques N3
As data sizes grow at Moore’s law, computers can only keep up with at most N logN algorithmsWhat goes?
Notion of optimal is in the sense of statistical errorsAssumes infinite computational resourcesAssumes that only source of error is statistical‘Cosmic Variance’: we can only observe the Universe from one location (finite sample size)
Solutions require combination of Statistics and CSNew algorithms: not worse than N logN
Organization & Algorithms
Use of clever data structures (trees, cubes):Up-front creation cost, but only N logN access costLarge speedup during the analysisTree-codes for correlations (A. Moore et al 2001)Data Cubes for OLAP (all vendors)
Fast, approximate heuristic algorithmsNo need to be more accurate than cosmic varianceFast CMB analysis by Szapudi et al (2001)
• N logN instead of N3 => 1 day instead of 10 million years
Take cost of computation into accountControlled level of accuracyBest result in a given time, given our computing resources
Analysis and Databases
Much statistical analysis deals withCreating uniform samples –
data filtering
Assembling relevant subsets
Estimating completeness
censoring bad data
Counting and building histograms
Generating Monte-Carlo subsets
Likelihood calculations
Hypothesis testing
Traditionally these are performed on filesMost of these tasks are much better done inside a databaseMove Mohamed to the mountain, not the mountain to Mohamed
Cosmic Microwave Background
Szapudi et al 2002
Data Exploration: A New Way of Doing Science
Primary access to data is through databasesExponential data growth – distributed dataPublication before analysisLarge data: move analysis to where data isDistributed computing – data federationNew algorithms are neededThe Virtual Observatory is a good exampleUnavoidable, emerging in all sciences!