analyzing large datasets in astrophysics alexander szalay the johns hopkins university towards an...
TRANSCRIPT
Analyzing Large Datasets in Astrophysics
Alexander SzalayThe Johns Hopkins University
Towards an International Virtual Observatory,Garching, 2002
(Living in an exponential world….)
Alex Szalay, Garching 2002 2
Outline
Collecting DataExponential Growth
Making DiscoveriesPublishing DataVO: How will it work?Web Services
Atomic vs Composite services
Distributed queries with SkyQueryCross-Matching AlgorithmSkyNode Web Services + Portal
Statistical Analysis of large data sets
Alex Szalay, Garching 2002 3
The World is Exponential
Astrophysical data is growing exponentially
Doubling every year (Moore’s Law+):both data sizes and number of data sets
Computational resources scale the same way
Constant $$$ will keep up with the data
Main problem is the software component
Currently components are not reusedSoftware costs are increasingly larger fractionAggregate costs are growing exponentially
Alex Szalay, Garching 2002 4
Making Discoveries
When and where are discoveries made?Always at the edges and boundariesGoing deeper, using more colors….
Metcalfe’s lawUtility of computer networks grows as the number of possible connections: O(N2)
VO: Federation of N archivesPossibilities for new discoveries grow as O(N2)
Current sky surveys have proven thisVery early discoveries from SDSS, 2MASS, DPOSS
Alex Szalay, Garching 2002 5
Publishing Data
Roles
Authors
Publishers
Curators
Consumers
Traditional
Scientists
Journals
Libraries
Scientists
Emerging
Collaborations
Project www site
Bigger Archives
Scientists
Alex Szalay, Garching 2002 6
Changing Roles
Exponential growth:Projects last at least 3-5 yearsData sent upwards only at the end of the projectData will be never centralized
More responsibility on projectsBecoming Publishers and CuratorsLarger fraction of budget spent on softwareLot of development duplicated, wasted
More standards are neededEasier data interchange, fewer tools
More templates are neededDevelop less software on your own
Alex Szalay, Garching 2002 7
Emerging New Concepts
Standardizing distributed dataWeb Services, supported on all platformsCustom configure remote data dynamicallyXML: Extensible Markup LanguageSOAP: Simple Object Access ProtocolWSDL: Web Services Description Language
Standardizing distributed computingGrid ServicesCustom configure remote computing dynamicallyBuild your own remote computer, and discardVirtual Data: new data sets on demand
Alex Szalay, Garching 2002 8
NVO: How Will It Work?
Define commonly used `atomic’ servicesBuild higher level toolboxes/portals on topWe do not build `everything for everybody’Use the 90-10 rule:
Define the standards and interfacesBuild the frameworkBuild the 10% of services that are used by 90%Let the users build the rest from the components
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5
# of services# o
f u
sers
Alex Szalay, Garching 2002 9
Atomic Services
Metadata information about resourcesWavebandSky coverageTranslation of names to universal dictionary (UCD)
Simple search patterns on the resourcesCone SearchImage mosaicUnit conversions
Simple filtering, counting, histogrammingOn-the-fly recalibrations
Alex Szalay, Garching 2002 10
Higher Level Services
Built on Atomic ServicesPerform more complex tasksExamples
Automated resource discoveryCross-identificationsPhotometric redshiftsOutlier detectionsVisualization facilities
Expectation:Build custom portals in matter of days from existing building blocks (like today in IRAF or IDL)
Alex Szalay, Garching 2002 11
SkyQuery
Distributed Query tool using a set of servicesFeasibility study, built in 6 weeks from scratch
Tanu Malik (JHU CS grad student) Tamas Budavari (JHU astro postdoc)
Implemented in C# and .NETWon 2nd prize of Microsoft XML ContestAllows queries like:
SELECT o.objId, o.r, o.type, t.objId FROM SDSS:PhotoPrimary o,
TWOMASS:PhotoPrimary t WHERE XMATCH(o,t)<3.5
AND AREA(181.3,-0.76,6.5) AND o.type=3 and (o.I - t.m_j)>2
Alex Szalay, Garching 2002 12
Architecture
Image cutout
SkyNodeSDSS
SkyNode2Mass
SkyNodeFirst
SkyQuery
Web Page
Alex Szalay, Garching 2002 13
Cross-id Steps
Parse queryGet countsSort by countsMake planCross-match
Recursively, from small to large
Select necessary attributes onlyReturn outputInsert cutout image
SELECT o.objId, o.r, o.type, t.objId FROM SDSS:PhotoPrimary o,
TWOMASS:PhotoPrimary t WHERE XMATCH(o,t)<3.5 AND AREA(181.3,-0.76,6.5) AND (o.i - t.m_j) > 2 AND o.type=3
Alex Szalay, Garching 2002 14
Monte-Carlo Simulation
Comparing different algorithms for 3-way xid
Transmit all the dataTransmit after filteringRecursive cross-match
SurveysSDSS2MASSFirst
Random variables:Sky Area (0..10 sqdeg)Selectivity of each subselect (0..1)Efficiency of join (0.5..2)Selectivity of common select (0..1)
0
500
1000
1500
2000
-4 -2 0 2 4log cost
0
500
1000
1500
2000
-4 -2 0 2 4log cost
Alex Szalay, Garching 2002 15
SkyNode
Metadata functions (SOAP)Info, Tables, Columns, Schema, Functions, Keysearch
Query functions (SOAP)Dataset Query(String sqlCmd)Dataset Xmatch(Dataset input, String sqlCmd, float eps)
Database MS SQL ServerUpload datasetVery fast spatial search engine (HTM-based)crossmatch takes <3 ms/object over 15M in SDSSUser defined functions and stored procedures
Alex Szalay, Garching 2002 16
Data Flow
SkyNode 1
SkyQuery
SkyNode 2
SkyNode 3
query
http://www.skyquery.net
Alex Szalay, Garching 2002 17
Optimal Statistics
The examples for optimal statistics have poor scaling
Correlation functions N2, likelihood techniques N3
As data sizes grow at Moore’s law, computers can only keep up with at most N logN algorithmsWhat goes?
Notion of optimal is in the sense of statistical errorsAssumes infinite computational resourcesAssumes that only source of error is statistical`Cosmic Variance’: we can only observe the Universe from one location (finite sample size)
Solutions require combination of Statistics and CSNew algorithms: not worse than N logN
Alex Szalay, Garching 2002 18
Clever Data Structures
Heavy use of tree structures:Up-front cost, but only N logNLarge speedup laterTree-codes for correlations (A. Moore et al 2001)
Fast, approximate heuristic algorithmsNo need to be more accurate than cosmic varianceFast CMB analysis by Szapudi etal (2001)
• N logN instead of N3 => 1 day instead of 10 million years
Take cost of computation into accountControlled level of accuracyBest result in a given time, given our computing resources
Alex Szalay, Garching 2002 19
Angular Clustering with Photo-z
w() by Peebles and Groth:The first example of publishing and analyzing large data
Samples based on rest-frame quantitiesStrictly volume limited samplesLargest angular correlation study to dateVery clear detection of
Luminosity and color dependence
Results consistent with 3D clusteringT. Budavari, A. Connolly, I. Csabai, I. Szapudi, A. Szalay, S. Dodelson, J. Frieman, R. Scranton, D. Johnston
and the SDSS Collaboration
Alex Szalay, Garching 2002 20
The Samples
343k343k 254k254k 185k185k 316k316k 280k280k 326k326k 185k185k 127k127k
-20 > Mr >-21
1182k1182k
-21 > Mr >-23
931k931k
0.1<z<0.3-20 > Mr
2.2M2.2M
-21 > Mr >-22
662k662k
-22 > Mr >-23
269k269k
0.1<z<0.5-21.4 > Mr
3.1M3.1M
10 stripes: 10M10M
mr<21 : 15M15M
All: 50M50M
2800 square degrees in 10 stripes, data in custom DB
Alex Szalay, Garching 2002 21
The Stripes
10 stripes over the SDSS area, covering about 2800 square degreesAbout 20% lost due to bad seeingMasks: seeing, bright stars, etc.Images generated from query by web service
Alex Szalay, Garching 2002 22
The Masks
Stripe 11 + masksMasks are derived from the database
Search and intersect extended objects with boundaries
Alex Szalay, Garching 2002 23
The Analysis
eSpICE : I.Szapudi, S.Colombi and S.PrunetIntegrated with the database by T. BudavariExtremely fast processing (N logN)
1 stripe with about 1 million galaxies is processed in 3 mins
Usual figure was 10 min for 10,000 galaxies => 70 days
Each stripe processed separately for each cut2D angular correlation function computedw(): average with rejection of pixels along the scan
flat field vector causes mock correlations
Alex Szalay, Garching 2002 24
Angular Correlations I.
Luminosity dependence: 3 cuts-20> M > -21-21> M > -22-22> M > -23
Alex Szalay, Garching 2002 25
Angular Correlations II.
Color Dependence4 bins by rest-frame SED type
Alex Szalay, Garching 2002 26
Summary
Exponential data growth – distributed dataWeb Services – hierarchical architectureUse the 90-10 rule (maybe 80-20)There are clever ways to federate datasets!Statistical analyses do not follow Moore’s lawNeed to revisit optimal statisticsGive interesting new tools into the hands of smart young people…They will quickly turn them into cutting edge science
Alex Szalay, Garching 2002 27
Virtual Observatory
Astronomy with an attitude…