casjobs: a workflow environment designed for large scientific catalogs
DESCRIPTION
CASJobs: A Workflow Environment Designed for Large Scientific Catalogs. Nolan Li, Johns Hopkins University. What is CASJobs. Terabytes of scientific data Web based system Data distribution Server-side analysis Optimize user work patterns Server-side user storage and programmability. - PowerPoint PPT PresentationTRANSCRIPT
CASJOBS: A WORKFLOW ENVIRONMENT DESIGNED FOR LARGE SCIENTIFIC CATALOGS
Nolan Li, Johns Hopkins University
What is CASJobs
Terabytes of scientific data Web based system
Data distribution Server-side analysis Optimize user work patterns Server-side user storage and
programmability
Sloan Digital Sky Survey (SDSS) Astronomical Survey
Images (fits) - 15.7 TB
Other data products ( masks, jpeg images, etc.) (DAS, fits format) - 26.8 TB
Catalogs (CAS, SQL database) - 18 TB
Data is public Delivery?
Database
Bandwidth is expensive!
10 terabytes is big! So database it
(SkyServer) Partial delivery Move work to data
Scalability Traffic++ Complexity ++ Data++
So… Cap execution time Cap results Build something else
Monthly CAS Usage
1.E+04
1.E+05
1.E+06
1.E+07
Web Hits
SQL Queries
CASJobs
Catalog Archive Server Jobs Server-side user storage and programmability
MyDB Hardware abstraction and long-term query
portability Contexts
Complete, automatic query logging Scalable performance
Controlled asynchronous query execution Data sharing
Groups http://casjobs.sdss.org/casjobs
MyDB
Server-side user database
Intermediate storage
Data import User
programmable
SELECT *FROM DR4WHERE a.objid = 38573498OR a.objid = 92837451OR a.objid = 20394833OR a.objid = 90284723
SELECT *FROM DR4 a, MyDB.MyTable bWHERE a.objid = b.objid
Logging
Automatically log all user queries Resubmit old queries Reconstruct database objects
Contexts
Databases are identified by their data, not their location
Queries are independent of hardware configuration
SELECT TOP 10 *FROM [server].[catalog].[user].MyTable
SELECT TOP 10 *FROM DR4.MyTable
Quick Jobs
Executes right away
But not for very long
Restricted memory usage
For things like… How many objects
? Table previews Preliminary
queries System queries
Long Jobs
Asynchronous Less restricted
execution time Storage capped
by MyDB size
For things like… Heavy IO Heavy
computation
Groups
Non exclusive sets of CASJobs users
Share data Keep more work
at the data
SELECT *FROM myGroup.otherUser.theirTable
Hardware
Flexible configuration
1+ machine per context (non exclusive)
1+ machine for MyDBs
Interface
Web Site Web Services
Usage
> two million jobs > 2200 users Astro deployments
Galaxy Evolution Explorer (GALEX)
Palomar Quest Panoramic Survey
Telescope and Rapid Response System (Pan-STARRS)[3].
Non Astro deployments Ameriflux Swiss Institute of
Bioinformatics (ISB) 8/29
/200
3 17
:32
11/3
0/20
03 1
6:33
2/27
/200
4 15
:45
5/31
/200
4 8:
42
8/31
/200
4 19
:41
11/3
0/20
04 2
0:08
2/28
/200
5 23
:59
5/31
/200
5 23
:57
8/31
/200
5 23
:58
11/3
0/20
05 2
3:58
2/28
/200
6 23
:57
5/31
/200
6 23
:42
8/31
/200
6 23
:35
11/3
0/20
06 2
3:41
2/28
/200
7 22
:44
5/31
/200
7 14
:08
8/31
/200
7 23
:46
11/3
0/20
07 2
3:35
2/29
/200
8 23
:43
5/31
/200
8 23
:47
8/31
/200
8 23
:59
0
50000
100000
150000
200000
250000
Monthly CASJobs