web services for the virtual observatory alex szalay, tamas budavari, tanu malik, jim gray, and ani...
TRANSCRIPT
Web Services for the Virtual Observatory
Alex Szalay, Tamas Budavari, Tanu Malik, Jim Gray, and Ani Thakar
SPIE, Hawaii, 2002
(Living in an exponential world….)
Alex Szalay, SPIE 2002 2
Outline
Collecting DataExponential Growth
Making DiscoveriesPublishing DataVO: How will it work?Web Services
Atomic vs Composite services
Distributed queries with SkyQueryCross-Matching AlgorithmSkyNode Web Services + Portal
Alex Szalay, SPIE 2002 3
The World is Exponential
Astrophysical data is growing exponentially
Doubling every year (Moore’s Law+):both data sizes and number of data sets
Computational resources scale the same way
Constant $$$ will keep up with the data
Main problem is the software component
Currently components are not reusedSoftware costs are increasingly larger fractionAggregate costs are growing exponentially
Alex Szalay, SPIE 2002 4
Making Discoveries
When and where are discoveries made?Always at the edges and boundariesGoing deeper, using more colors….
Metcalfe’s lawUtility of computer networks grows as the number of possible connections: O(N2)
VO: Federation of N archivesPossibilities for new discoveries grow as O(N2)
Current sky surveys have proven thisVery early discoveries from SDSS, 2MASS, DPOSS
Alex Szalay, SPIE 2002 5
Publishing Data
Roles
Authors
Publishers
Curators
Consumers
Traditional
Scientists
Journals
Libraries
Scientists
Emerging
Collaborations
Project www site
Bigger Archives
Scientists
Alex Szalay, SPIE 2002 6
Changing Roles
Exponential growth:Projects last at least 3-5 yearsData sent upwards only at the end of the projectData will be never centralized
More responsibility on projectsBecoming Publishers and CuratorsLarger fraction of budget spent on softwareLot of development duplicated, wasted
More standards are neededEasier data interchange, fewer tools
More templates are neededDevelop less software on your own
Alex Szalay, SPIE 2002 7
Emerging New Concepts
Standardizing distributed dataWeb Services, supported on all platformsCustom configure remote data dynamicallyXML: Extensible Markup LanguageSOAP: Simple Object Access ProtocolWSDL: Web Services Description Language
Standardizing distributed computingGrid ServicesCustom configure remote computing dynamicallyBuild your own remote computer, and discardVirtual Data: new data sets on demand
Alex Szalay, SPIE 2002 8
Shielding Users
Users do not want to deal with XML,they want their dataUsers do not want to deal with configuring grid computing, they want resultsSOAP: data appears in user memory, XML is invisibleSOAP call: just a remote procedure
Alex Szalay, SPIE 2002 9
NVO: How Will It Work?
Define commonly used `atomic’ servicesBuild higher level toolboxes/portals on topWe do not build `everything for everybody’Use the 90-10 rule:
Define the standards and interfacesBuild the frameworkBuild the 10% of services that are used by 90%Let the users build the rest from the components
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5
# of services# o
f u
sers
Alex Szalay, SPIE 2002 10
Atomic Services
Metadata information about resourcesWavebandSky coverageTranslation of names to universal dictionary (UCD)
Simple search patterns on the resourcesCone SearchImage mosaicUnit conversions
Simple filtering, counting, histogrammingOn-the-fly recalibrations
Alex Szalay, SPIE 2002 11
Higher Level Services
Built on Atomic ServicesPerform more complex tasksExamples
Automated resource discoveryCross-identificationsPhotometric redshiftsOutlier detectionsVisualization facilities
Expectation:Build custom portals in matter of days from existing building blocks (like today in IRAF or IDL)
Alex Szalay, SPIE 2002 12
SkyQuery
Distributed Query tool using a set of servicesFeasibility study, built in 6 weeks from scratch
Tanu Malik (JHU CS grad student) Tamas Budavari (JHU astro postdoc)
Implemented in C# and .NETWon 2nd prize of Microsoft XML ContestAllows queries like:
SELECT o.objId, o.r, o.type, t.objId FROM SDSS:PhotoPrimary o,
TWOMASS:PhotoPrimary t WHERE XMATCH(o,t)<3.5
AND AREA(181.3,-0.76,6.5) AND o.type=3 and (o.I - t.m_j)>2
Alex Szalay, SPIE 2002 13
Architecture
Image cutout
SkyNodeSDSS
SkyNode2Mass
SkyNodeFirst
SkyQuery
Web Page
Alex Szalay, SPIE 2002 14
Cross-id Steps
Parse queryGet countsSort by countsMake planCross-match
Recursively, from small to large
Select necessary attributes onlyReturn outputInsert cutout image
SELECT o.objId, o.r, o.type, t.objId FROM SDSS:PhotoPrimary o,
TWOMASS:PhotoPrimary t WHERE XMATCH(o,t)<3.5 AND AREA(181.3,-0.76,6.5) AND (o.i - t.m_j) > 2 AND o.type=3
Alex Szalay, SPIE 2002 15
Monte-Carlo Simulation
Comparing different algorithms for 3-way xid
Transmit all the dataTransmit after filteringRecursive cross-match
SurveysSDSS2MASSFirst
Random variables:Sky Area (0..10 sqdeg)Selectivity of each subselect (0..1)Efficiency of join (0.5..2)Selectivity of common select (0..1)
0
500
1000
1500
2000
-4 -2 0 2 4log cost
0
500
1000
1500
2000
-4 -2 0 2 4log cost
Alex Szalay, SPIE 2002 16
SkyNode
Metadata functions (SOAP)Info, Tables, Columns, Schema, Functions, Keysearch
Query functions (SOAP)Dataset Query(String sqlCmd)Dataset Xmatch(Dataset input, String sqlCmd, float eps)
Database MS SQL ServerUpload datasetVery fast spatial search engine (HTM-based)crossmatch takes <3 ms/object over 15M in SDSSUser defined functions and stored procedures
Alex Szalay, SPIE 2002 17
Data Flow
SkyNode 1
SkyQuery
SkyNode 2
SkyNode 3
query
http://www.skyquery.net
Alex Szalay, SPIE 2002 18
Other web services
Create density maps and masks for angular clustering
Deliver photometric redshifts form photometry dataIntersect pointed observations with surveysGenerate XSLT from script XML=> SVGWrap legacy (Linux C) data mining applications as a web serviceCreate a C# class for the CFITSIO library
Alex Szalay, SPIE 2002 19
Archive Footprint
Footprint is a ‘fractal’Result depends on context
all sky, degree scale, pixel scale
Translate to web servicesFootprint()returns single region that contains the archiveIntersection(region, tolerance) feed a region and returns the intersection with archive footprintContains(point) returns yes/no (maybe fuzzy) if point is inside archive footprint
Alex Szalay, SPIE 2002 20
Summary
Exponential data growth – distributed data– federation needed
Projects now Publishers and CuratorsWeb Services – hierarchical architectureUse the 90-10 rule (maybe 80-20)There are clever ways to federate datasets!