data-intensive science at johns hopkins university institute for data-intensive engineering and...

37
Data-Intensive Science at Johns Hopkins University Institute for Data-Intensive Engineering and Science (IDIES) Johns Hopkins University Jordan Raddick

Upload: kerry-hancock

Post on 11-Jan-2016

217 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Data-Intensive Science at Johns Hopkins University Institute for Data-Intensive Engineering and Science (IDIES) Johns Hopkins University Jordan Raddick

Data-Intensive Science atJohns Hopkins University

Institute for Data-Intensive Engineering and Science (IDIES)

Johns Hopkins University

Jordan Raddick

Page 2: Data-Intensive Science at Johns Hopkins University Institute for Data-Intensive Engineering and Science (IDIES) Johns Hopkins University Jordan Raddick

Case Study: Astronomy

• In the “old days” astronomers took photos.• New instruments are digital (< 100 GB/night)• Detectors are following Moore’s law.• Data avalanche: double every 2 years

Total area of 3m+ telescopes in the world in m2, total number of CCD pixels in Megapix, as a function of time. Growth over 25 years is a factor of 30 in glass, 3000 in pixels.

Page 3: Data-Intensive Science at Johns Hopkins University Institute for Data-Intensive Engineering and Science (IDIES) Johns Hopkins University Jordan Raddick

Why astronomy?

• Because “it’s worthless” (Jim Gray)– No privacy restrictions– No intellectual property– No one wants to sell you data

• Because it’s intrinsically interesting

• And because there’s a lot of it

Page 4: Data-Intensive Science at Johns Hopkins University Institute for Data-Intensive Engineering and Science (IDIES) Johns Hopkins University Jordan Raddick

Astronomy Data

• Astronomers have a few Petabytes now• Data doubles every 2 years• Data is public after 2 years• So, 50% of the data is public• But… how do I get at that 50% of the

data?

Page 5: Data-Intensive Science at Johns Hopkins University Institute for Data-Intensive Engineering and Science (IDIES) Johns Hopkins University Jordan Raddick

Science is breaking

• You can…– search 1 MB in 1 second, transfer for < 1¢– search 1 GB in 1 minute, transfer for $1– search 1 TB in 2 days, transfer for $1,000– search 1 PB in 3 years, transfer for

$1,000,000– …and 1 PB is 5,000 disks

• “Data avalanche” in all fields of science– New approach needed

Page 6: Data-Intensive Science at Johns Hopkins University Institute for Data-Intensive Engineering and Science (IDIES) Johns Hopkins University Jordan Raddick

Data Intensive Science Tools

• Databases instead of files (DBMSs)– Data integrity ensured– Optimized data access (DB indices)– Ability to define methods on data– Do the science inside the database!– Bring the analysis to the data, not vice-versa

• Scalable (parallel) data access• High-speed transport protocols

– Move the data rapidly when necessary

• Asynchronous Web services– Web browsers cannot handle data volumes

Page 7: Data-Intensive Science at Johns Hopkins University Institute for Data-Intensive Engineering and Science (IDIES) Johns Hopkins University Jordan Raddick

Gray’s Laws of Data Engineering

• Jim Gray– Scientific computing is revolving around

data – Need scale-out solution for analysis– Take the analysis to the data!– Start with “20 queries” use cases

Page 8: Data-Intensive Science at Johns Hopkins University Institute for Data-Intensive Engineering and Science (IDIES) Johns Hopkins University Jordan Raddick

Virtual Observatory

• Premise: Most data is (or could be) online• So, the Internet is the world’s best telescope:

– It has data on every part of the sky– In every measured spectral band: optical, x-ray, radio..

– As deep as the best instruments (2 years ago).– It is up when you are up.

The observing conditions are always great (no working at night, no clouds no moons no..).

– It’s a smart telescope: links objects and data to literature on them.

Page 9: Data-Intensive Science at Johns Hopkins University Institute for Data-Intensive Engineering and Science (IDIES) Johns Hopkins University Jordan Raddick

Sloan Digital Sky Survey

• A map of the universe• Telescope at Apache

Point Observatory (Sunspot, NM)– 2.5 meter reflector– ~3 degree field of view– Drift scanning

(telescope stationary, sky appears to move)

Page 10: Data-Intensive Science at Johns Hopkins University Institute for Data-Intensive Engineering and Science (IDIES) Johns Hopkins University Jordan Raddick

An Ambitious Survey

A thousand-fold increase in the amount of data!

• Info content > US Library Info content > US Library of Congressof Congress

•Before SDSS, total number Before SDSS, total number of galaxies with measured of galaxies with measured parameters ~ 200kparameters ~ 200k

•With SDSS, we already With SDSS, we already have detailed parameters have detailed parameters for over 200 Million for over 200 Million galaxies!!galaxies!!

Page 11: Data-Intensive Science at Johns Hopkins University Institute for Data-Intensive Engineering and Science (IDIES) Johns Hopkins University Jordan Raddick

SDSS Data Overview

Data Archive Server (DAS)Data Archive Server (DAS)FITS files (raw data)

Images, spectra, corrected frames, atlas images, binned images,

masksOnline form-based access

Rsync and wget file retrieval

Data Archive Server (DAS)Data Archive Server (DAS)FITS files (raw data)

Images, spectra, corrected frames, atlas images, binned images,

masksOnline form-based access

Rsync and wget file retrieval

Catalog Archive Server (CAS)Catalog Archive Server (CAS)Science parameters extracted to

catalogsStuffed into relational DBMS (SQL

Server)Heavily indexed, optimizedOnline access via SkyServer

Several levels of access, query tools

Catalog Archive Server (CAS)Catalog Archive Server (CAS)Science parameters extracted to

catalogsStuffed into relational DBMS (SQL

Server)Heavily indexed, optimizedOnline access via SkyServer

Several levels of access, query tools

SDSSData

Releasewww.sdss.org

das.sdss.org/DRx-cgi-bin/DAS

www.sdss.orgdas.sdss.org/DRx-cgi-bin/DAS

cas.sdss.org

skyserver.sdss.org

cas.sdss.org

skyserver.sdss.org

Page 12: Data-Intensive Science at Johns Hopkins University Institute for Data-Intensive Engineering and Science (IDIES) Johns Hopkins University Jordan Raddick

The CAS Databases

• Processed data is stuffed into a commercial relational DBMS– Microsoft SQL Server 2000

• Allows fast exploration and analysis - Data Mining• Two versions of the sky: Best and Target

– Target is version of sky on which spectroscopic targets were chosen– Best is latest, greatest processing of the data– 2 DBs for each release, e.g. BestDR3 and TargDR3

• Heavily indexed to speed up access – HTM + DB Indices• Short queries can run interactively.• Long queries (> 1 hr) require a custom Batch Query System.

SDSS Pipelines

Page 13: Data-Intensive Science at Johns Hopkins University Institute for Data-Intensive Engineering and Science (IDIES) Johns Hopkins University Jordan Raddick

SkyServer Web Access

• Supports several levels of SQL access– Novice/casual users

• Radial (cone) and Rectangular search – Intermediate/astro users

• Imaging and Spectro Form Query– Expert

• Free-form SQL, Object Crossid (upload RA/dec list)• CasJobs workbench environment (MyDB)

• Visual tools– ImageCutout service

• Finding Chart, Navigate/browse images, image lists– Explore tool

• Detailed info for each image object, including spectrum

• Downloadable interfaces– Emacs, Java tool (sdssQA), sqlcl (command-line)

http://cas.sdss.org/, http://skyserver.sdss.org/

Page 14: Data-Intensive Science at Johns Hopkins University Institute for Data-Intensive Engineering and Science (IDIES) Johns Hopkins University Jordan Raddick

CAS Workload Management

• Query execution time follows power law

• Vast majority of queries under a minute

• Separate short and long queries

• Execute long queries in batch mode

• Short and long queues (short=1min, long=8hrs)• Strictly limit time of query in a queue

• Provide user workspace DB – MyDB– Reduce network traffic from repeat and intermediate results– Allow sharing of query results between user groups

1

10

100

1000

10000

100000

1000000

0 1 2 4 8 16 32 64 128

256

512

1024

2048

4096

8192

1638

4

3276

8

6553

6

2621

4

5242

8

elapsed

cpu

rows

Page 15: Data-Intensive Science at Johns Hopkins University Institute for Data-Intensive Engineering and Science (IDIES) Johns Hopkins University Jordan Raddick

CasJobs and MyDB

• Batch Query Workbench for SDSS CASBatch Query Workbench for SDSS CAS• Queries are queued for batch execution

– Load balancing – queues on multiple servers– Limit of 2 simultaneous queries per server

• Short (1 minute) queue for immediate mode– Query aborted after 1 minute

• MyDB personal database– 1 GB (more on demand) SQL DB for each user– Long queries write to MyDB table by default– User can extract output (download) when ready– Ability to share MyDB tables with other users via group

visibility

Page 16: Data-Intensive Science at Johns Hopkins University Institute for Data-Intensive Engineering and Science (IDIES) Johns Hopkins University Jordan Raddick

SkyServer

• Entire database of the Sloan Digital Sky Survey– Free to anyone– 4 TB of data– 287 million sky

objects– 1.2 million spectra

Page 17: Data-Intensive Science at Johns Hopkins University Institute for Data-Intensive Engineering and Science (IDIES) Johns Hopkins University Jordan Raddick

Why data to the public?

• All SDSS data available to general public– Commitment

to data sharing– People can look

at any star or galaxy

– Inquiry learning known to be effective (learners answer a question themselves, with their own design)

Page 18: Data-Intensive Science at Johns Hopkins University Institute for Data-Intensive Engineering and Science (IDIES) Johns Hopkins University Jordan Raddick

Browse Data

Page 19: Data-Intensive Science at Johns Hopkins University Institute for Data-Intensive Engineering and Science (IDIES) Johns Hopkins University Jordan Raddick

Explore Data

Page 20: Data-Intensive Science at Johns Hopkins University Institute for Data-Intensive Engineering and Science (IDIES) Johns Hopkins University Jordan Raddick

Search for Data

Page 21: Data-Intensive Science at Johns Hopkins University Institute for Data-Intensive Engineering and Science (IDIES) Johns Hopkins University Jordan Raddick

From Access to Learning

• Not enough just to give public access to data

• How are they using it?• Are they learning anything from it?

– Do they understand what they’re doing and why?*

*Understanding = long-term memory, transfer to new situations (How People Learn)

Page 22: Data-Intensive Science at Johns Hopkins University Institute for Data-Intensive Engineering and Science (IDIES) Johns Hopkins University Jordan Raddick

Formal Education: Projects

Page 23: Data-Intensive Science at Johns Hopkins University Institute for Data-Intensive Engineering and Science (IDIES) Johns Hopkins University Jordan Raddick

SkyServer projects

• Three levels– Kids (K-8)– Basic (high school or Astro 101)– Advanced (skilled and motivated students)

• Research challenges– Independent research (open inquiry) with

data

• Activities created by users

Page 24: Data-Intensive Science at Johns Hopkins University Institute for Data-Intensive Engineering and Science (IDIES) Johns Hopkins University Jordan Raddick

SkyServer projects

Page 25: Data-Intensive Science at Johns Hopkins University Institute for Data-Intensive Engineering and Science (IDIES) Johns Hopkins University Jordan Raddick

Use of educational projects

• 2 of top 20 IPs to hit entire site are K-12 districts– Orlando, FL &

Los Angeles

1.E+5

1.E+6

1.E+7

2001/4 2002/4 2003/4 2004/4 2005/4 2006/4

HitspageViews

Traffic by Month

Page 26: Data-Intensive Science at Johns Hopkins University Institute for Data-Intensive Engineering and Science (IDIES) Johns Hopkins University Jordan Raddick

Citizen science: Galaxy Zoo

• Background: galaxies come in different shapes

• Classifying is hard for computers, easy for people

• So get the public to help!

spiral elliptical

Page 27: Data-Intensive Science at Johns Hopkins University Institute for Data-Intensive Engineering and Science (IDIES) Johns Hopkins University Jordan Raddick

Galaxy Zoo

Page 28: Data-Intensive Science at Johns Hopkins University Institute for Data-Intensive Engineering and Science (IDIES) Johns Hopkins University Jordan Raddick

Galaxy Zoo Discovery

• But if you mirror galaxies, it’s still 48/52!• Counter-clockwise excess due to perception• Unintentional psychology discovery!

clockwise (48%) counter-clockwise (52%)

Page 29: Data-Intensive Science at Johns Hopkins University Institute for Data-Intensive Engineering and Science (IDIES) Johns Hopkins University Jordan Raddick

Galaxy Zoo Discovery

• “Hanny’s Voorwerp” (“Voorwerp” = object)

• Strange blue thing found by Dutch teacher Hanny van Arkel

• Most likely “light echo” from a long-dead quasar

• Follow-up time on HST• Galaxy Zoo great for finding

the unexpected!

Page 30: Data-Intensive Science at Johns Hopkins University Institute for Data-Intensive Engineering and Science (IDIES) Johns Hopkins University Jordan Raddick

Galaxy Zoo and learning

• Even people who love science often don’t understand peer review, proposals, etc.

• Want to use Galaxy Zoo to show process of science

• As we write science papers, we share results with volunteers

Page 31: Data-Intensive Science at Johns Hopkins University Institute for Data-Intensive Engineering and Science (IDIES) Johns Hopkins University Jordan Raddick

Galaxy Zoo blog

Page 32: Data-Intensive Science at Johns Hopkins University Institute for Data-Intensive Engineering and Science (IDIES) Johns Hopkins University Jordan Raddick

Other Science: Sensors

• Small, self-contained devices to measure soil temperature, etc.

• Deployed around Maryland

• Study carbon cycle, nesting of turtles, etc.

• Wide range of applications

Page 33: Data-Intensive Science at Johns Hopkins University Institute for Data-Intensive Engineering and Science (IDIES) Johns Hopkins University Jordan Raddick

Other science: Turbulence

• Large-scale simulation of isotropic turbulence in fluid– Imagine shaking a box of water

• FORTRAN code• 1024 x 1024 x 1024 mesh• 10,240 timesteps• Every 10th timestep written to

SQL database

Page 34: Data-Intensive Science at Johns Hopkins University Institute for Data-Intensive Engineering and Science (IDIES) Johns Hopkins University Jordan Raddick

Other science: Turbulence

• Database accessible online through web services

• Can call database from FORTRANor C code

• Like having a supercomputer simulation on your laptop!

Page 35: Data-Intensive Science at Johns Hopkins University Institute for Data-Intensive Engineering and Science (IDIES) Johns Hopkins University Jordan Raddick

Graywulf

• Data-intensive computing architecture– 40 worker nodes, 1440 MB/s each– 6 head nodes, 2100 MB/s each– DDR InfiniBand interconnect

• Deployed mid-2008• Architecture for future projects

Page 36: Data-Intensive Science at Johns Hopkins University Institute for Data-Intensive Engineering and Science (IDIES) Johns Hopkins University Jordan Raddick

Graywulf at SC08

Page 37: Data-Intensive Science at Johns Hopkins University Institute for Data-Intensive Engineering and Science (IDIES) Johns Hopkins University Jordan Raddick

Contact information

Contact Information

Jordan [email protected]