Download - Prototype Web Services Using SDSS DR1
![Page 1: Prototype Web Services Using SDSS DR1](https://reader036.vdocuments.us/reader036/viewer/2022081512/56814449550346895db0e6e9/html5/thumbnails/1.jpg)
Prototype Web ServicesUsing SDSS DR1
Alex Szalay, Tamas Budavari, Sam Carlisle, Jim Gray, Vivek Haridas, Nolan Li, Tanu Malik,
Maria Nieto-Santisteban, Wil O’Mullane, Ani Thakar
![Page 2: Prototype Web Services Using SDSS DR1](https://reader036.vdocuments.us/reader036/viewer/2022081512/56814449550346895db0e6e9/html5/thumbnails/2.jpg)
NVO: How Will It Work?
• Define commonly used ‘core’ services• Build higher level toolboxes/portals on top• We do not build ‘everything for everybody’• Use the 90-10 rule:
– Define the standards and interfaces– Build the framework– Build the 10% of services
that are used by 90%– Let the users build the rest
from the components 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5
# of services# o
f u
sers
![Page 3: Prototype Web Services Using SDSS DR1](https://reader036.vdocuments.us/reader036/viewer/2022081512/56814449550346895db0e6e9/html5/thumbnails/3.jpg)
Using SDSS DR1
• SDSS DR1 (Data Release1) is now publicly available
http://skyserver.pha.jhu.edu/dr1/
• About 1TB of catalog data• Using MS SQL Server 2000• Complex schema (72 Tables)• About 80 million photometric objects• Two versions (TARGET/BEST)• Automated documentation• Raw data at FNAL file server with URL access
![Page 4: Prototype Web Services Using SDSS DR1](https://reader036.vdocuments.us/reader036/viewer/2022081512/56814449550346895db0e6e9/html5/thumbnails/4.jpg)
Loading DR1
• Automated table driven workflow system for loading– Included lots of verification code– Over 16K lines of SQL code
• Loading process was extremely painful– Lack of systems engineering for the pipelines– Poor testing (lots of foreign key mismatch)– Detected data bugs even a month ago– Most of the time spent on scrubbing data– Fixing corrupted files (RAID5 disk errors)
• Once data was clean, everything loaded in 3 days• Neighbors calculation took about 10 hours• Reorganization of data took about 1 week of
experiments in partitioning/layouts
![Page 5: Prototype Web Services Using SDSS DR1](https://reader036.vdocuments.us/reader036/viewer/2022081512/56814449550346895db0e6e9/html5/thumbnails/5.jpg)
Reorganization
• Introduced partitions and filegroups– Photo, Tag, Neighbors, Spectro, Frame, Other, Profiles
• Keep partitions under 100GB• Vertical partitioning – tried and abandoned• Both partitioning and index build now table driven
– Stored procedures to create/drop indices at various granularities
• Tremendous improvement in performance when doing this on a large memory machine (24GB)
• Also much better performance afterwards
![Page 6: Prototype Web Services Using SDSS DR1](https://reader036.vdocuments.us/reader036/viewer/2022081512/56814449550346895db0e6e9/html5/thumbnails/6.jpg)
Spatial Features
• Precomputed Neighbors– All objects within 30”
• Boundaries, Masks and Outlines– Stored as spatial polygons
Time Domain:• Precomputed Match
– All objects with 1”, observed at different times– Found duplicates due to telescope tracking errors– Manual fix, recorded in the database
• MatchHead– The first observation of the linked list used as unique id to
chain of observations of the same object
![Page 7: Prototype Web Services Using SDSS DR1](https://reader036.vdocuments.us/reader036/viewer/2022081512/56814449550346895db0e6e9/html5/thumbnails/7.jpg)
Spatial Algorithms
• Updated HTM library– Automated depth for HTM_Cover– Output vertices– Simplify polygon– Boolean operations on regions– Part of VO data model (A. Rots)
• Zones– Much better performance for bulk neighbors at a fixed radius
• Footprint service in progress– Bool Contains(point)– Region Intersect(region)
![Page 8: Prototype Web Services Using SDSS DR1](https://reader036.vdocuments.us/reader036/viewer/2022081512/56814449550346895db0e6e9/html5/thumbnails/8.jpg)
Web Services in Progress
• Registry– Harvesting and querying
• Data Delivery– Query driven Queue management
• Graphics and visualization– Query driven vs interactive– Show spatial objects (Chart/Navi/List)
• Footprint/intersect– It is a “fractal”
• Cross-matching– SkyQuery and SkyNode– Ferris-wheel– Distributed vs parallel
![Page 9: Prototype Web Services Using SDSS DR1](https://reader036.vdocuments.us/reader036/viewer/2022081512/56814449550346895db0e6e9/html5/thumbnails/9.jpg)
Registry: Easy Clients
Just use SOAP toolkit (T. McGlynn & J. Lee have done Perl client).Easy in Java
java org.apache.axis.wsdl.WSDL2Java "http://skyservice.pha.jhu.edu/devel/registry/registry.asmx?wsdl"
• Gives set of Classes for accessing the service • Gives Classes for the XML which is returned (i.e. SimpleResource)
Still need to write client like
RegistryLocator loc = new RegistryLocator(); RegistrySoap reg = loc.getRegistrySoap(); ArrayOfSimpleResource reses = null; reses = reg.queryRegistry(args[0]);
http://skyservice.pha.jhu.edu/devel/registry/index.aspx
![Page 10: Prototype Web Services Using SDSS DR1](https://reader036.vdocuments.us/reader036/viewer/2022081512/56814449550346895db0e6e9/html5/thumbnails/10.jpg)
Generic Catalog Access
• After 2 years of SDSS EDR and 6 months of DR1 usage, access patterns start to emerge– Lots of small users, requiring instant response– 1/f distribution of request sizes (tail of the lognormal)
• How to make everybody happy?• No clear business model…• We need a separate interactive and batch server• We also need access to full SQL with extensions• Users want to access services via browsers• Other services will need SOAP access
![Page 11: Prototype Web Services Using SDSS DR1](https://reader036.vdocuments.us/reader036/viewer/2022081512/56814449550346895db0e6e9/html5/thumbnails/11.jpg)
Data Formats
• Different data formats requested:– HTML, CSV, FITS binary, VOTABLE, XML, graphics
• Quick browsing and exploration– Small requests, need to be nicely rendered– Needs good random access performance– Also simple 2D scatter plots or density plots required
• Heavy duty statistical use– Aggregate functions on complex joins, lots of scans but
small output, mostly want CSV
• Successive Data Filter– Multi-step non-indexed filtering of the whole database,
mostly want FITS binary
![Page 12: Prototype Web Services Using SDSS DR1](https://reader036.vdocuments.us/reader036/viewer/2022081512/56814449550346895db0e6e9/html5/thumbnails/12.jpg)
Data Delivery
• Small requests (<100MB) – Putting data on the stream
• Medium requests (<1GB)– Use DIME attachments to SOAP messages
• Large requests (>1GB)– Save data in scratch area and use asynch delivery– Only practical for large/long queries
• Iterative requests– Save data in temp tables in user space– Let user manipulate via web browser
• Paradox: if we use web browser to submit, users want immediate response from batch-size queries
![Page 13: Prototype Web Services Using SDSS DR1](https://reader036.vdocuments.us/reader036/viewer/2022081512/56814449550346895db0e6e9/html5/thumbnails/13.jpg)
How To Provide a UserDB
• Goal: through several search/filter operations reduce data transfer to manageable sizes (1-100MB)
• Today: people download tens of millions of rows, and then do their next filtering on client side, using F77
• Could be much better done in the database• But: users need to create/manage temporary tables
– DOS attacks, fragmentation, who pays for it– Security, who can see my data (group access)?– Follow progress of long jobs– Who does the cleanup?
![Page 14: Prototype Web Services Using SDSS DR1](https://reader036.vdocuments.us/reader036/viewer/2022081512/56814449550346895db0e6e9/html5/thumbnails/14.jpg)
Query Managament Service
• Enable fast, anonymous access to small requests• Enable large queries, with ability to manage• Enable creation of temporary tables in user space• Create multiple ways to get query output• Needs to support multiple mirrors/load balancing• Do all this without logging in to Windows• Need also support of machine clients
Web Service: http://skyservice.pha.jhu.edu/devel/CasJobs/
• Two request categories:– Quick
– Batch
![Page 15: Prototype Web Services Using SDSS DR1](https://reader036.vdocuments.us/reader036/viewer/2022081512/56814449550346895db0e6e9/html5/thumbnails/15.jpg)
Queue Management
• Need to register batch ‘power users’• Query output goes to ‘MyDB’• Can be joined with source database• Results are materialized from MyDB upon request• Users can do:
– Insert, Drop, Create, Select Into, Functions, Procedures– Publish their tables to a group area
• Data delivery via the CASService (C# WS)
http://skyservice.pha.jhu.edu/devel/CasService/CasService.asmx
![Page 16: Prototype Web Services Using SDSS DR1](https://reader036.vdocuments.us/reader036/viewer/2022081512/56814449550346895db0e6e9/html5/thumbnails/16.jpg)
Graphics Tools
• Simple xy plotshttp://skyservice.pha.jhu.edu/nli/wplot/
• Density plothttp://skyservice.pha.jhu.edu/devel/DensityMap/AllSkyView.aspxhttp://skyservice.pha.jhu.edu/devel/DensityMap/PlotQuery.aspx
• Chart/Navi/Listhttp://skyservice.pha.jhu.edu/dr1/imgcutout/getjpeg.asmx
• Can be built into various applications
![Page 17: Prototype Web Services Using SDSS DR1](https://reader036.vdocuments.us/reader036/viewer/2022081512/56814449550346895db0e6e9/html5/thumbnails/17.jpg)
Archive Footprint
• Footprint is a ‘fractal’• Result depends on context
– all sky, degree scale, pixel scale
• Translate to web services– Footprint()
returns single region that contains the archive– Intersection(region, tolerance)
feed a region and returns the intersection with archive footprint
– Contains(point) returns yes/no (maybe fuzzy) if point is inside archive footprint
![Page 18: Prototype Web Services Using SDSS DR1](https://reader036.vdocuments.us/reader036/viewer/2022081512/56814449550346895db0e6e9/html5/thumbnails/18.jpg)
Cross-Matching
• SkyQuery – SkyNode• Currently lots of proprietary features
– Data transmitted via .NET DataSet => VOTable– Query plan written in MS T-SQL => ADQL– Spatial operator restricted to a cone =>VORegion– Made up metadata delivery => VORegistry– Data delivery in XML/HTML => VOTable
• Catalogs in the near future– SDSS DR1, FIRST, 2MASS, INT– POSS-1, GSC-2, HST, ROSAT, 2dF– GALEX, IRAS, PSCZ
![Page 19: Prototype Web Services Using SDSS DR1](https://reader036.vdocuments.us/reader036/viewer/2022081512/56814449550346895db0e6e9/html5/thumbnails/19.jpg)
Spatial Cross-Match
• For small area HTM is close to optimal, but needs more speed
• For all-sky surveys the zone algorithm is best• Current heuristic is a linear chain of all nodes• Easy to generalize to include precomputed neighbors• But, for all sky queries very large number
of random reads instead of sequential
![Page 20: Prototype Web Services Using SDSS DR1](https://reader036.vdocuments.us/reader036/viewer/2022081512/56814449550346895db0e6e9/html5/thumbnails/20.jpg)
Ferris-Wheel
• Sky split into buckets/zones• All archives scan in sync• Queries enter at bottom• Results come back after
full circle• Only sequential access
=> buckets get into cache,then queries processed
Portal
SDSS
![Page 21: Prototype Web Services Using SDSS DR1](https://reader036.vdocuments.us/reader036/viewer/2022081512/56814449550346895db0e6e9/html5/thumbnails/21.jpg)
Utilitites
• FITSLIB 1.10 C# library around the CFITSIO packagehttp://www.cs.jhu.edu/~haridas/tech/Fits/
• MIRAGEJava wrapper around Mirage, can directly access the VORegistry, and ConeSearch http://skyservice.pha.jhu.edu/develop/vo/mirage/mirage.html
• HTM2.0Updated HTM library, conforming to the new Region specificationhttp://www.sdss.jhu.edu/htm/
• ADQLPrototype service to convert back and forth between ADQL and SQLhttp://skyservice.pha.jhu.edu/vivek/msdev/AstroDql/ws/http://skyservice.pha.jhu.edu/vivek/msdev/AstroDql/ws/Archive.asmx
• SDSSQAJava application, emulating MS Query Analyzer
![Page 22: Prototype Web Services Using SDSS DR1](https://reader036.vdocuments.us/reader036/viewer/2022081512/56814449550346895db0e6e9/html5/thumbnails/22.jpg)
Summary
• Web Services have been remarkably easy to use• Now different platforms are interoperable• We have invested a lot of energy to develop various
interface libraries (FITS, VOTable)• Integrating graphics into web services was very easy• Next:
– Parallel queries– Finish query queue management– Upgrade SkyQuery– Bring in more archives– Ferris-Wheel experiment– On-demand database creation– 100TB parallel data access layer
![Page 23: Prototype Web Services Using SDSS DR1](https://reader036.vdocuments.us/reader036/viewer/2022081512/56814449550346895db0e6e9/html5/thumbnails/23.jpg)
http://skyservice.pha.jhu.edu/develop/