science as a service: how on-demand computing can accelerate discovery
DESCRIPTION
My talk at ScienceCloud 2013 in NYC. Thanks to the organizers for the invitation to talk. A bit of new material relative to previous talks posted, e.g., on Globus Genomics.TRANSCRIPT
![Page 1: Science as a Service: How On-Demand Computing can Accelerate Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062703/554ea0cfb4c905977e8b45fe/html5/thumbnails/1.jpg)
computationinstitute.org
Science as a Service How On-Demand Computing Can Accelerate
Discovery
![Page 2: Science as a Service: How On-Demand Computing can Accelerate Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062703/554ea0cfb4c905977e8b45fe/html5/thumbnails/2.jpg)
computationinstitute.org
A time of disruptive changeAs evidenced by cost per human genome
![Page 3: Science as a Service: How On-Demand Computing can Accelerate Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062703/554ea0cfb4c905977e8b45fe/html5/thumbnails/3.jpg)
computationinstitute.org
A time of disruptive changeAs evidenced by cost per human genome
![Page 4: Science as a Service: How On-Demand Computing can Accelerate Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062703/554ea0cfb4c905977e8b45fe/html5/thumbnails/4.jpg)
computationinstitute.org
But most labs have extremely limited resources
Heidorn: NSF grants in 2007
< $350,00080% of awards50% of grant $$
$1,000,000
$100,000
$10,000
$1,0002000 4000 6000 8000
![Page 5: Science as a Service: How On-Demand Computing can Accelerate Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062703/554ea0cfb4c905977e8b45fe/html5/thumbnails/5.jpg)
computationinstitute.org
Automation is required to apply more sophisticated methods to far more data
![Page 6: Science as a Service: How On-Demand Computing can Accelerate Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062703/554ea0cfb4c905977e8b45fe/html5/thumbnails/6.jpg)
computationinstitute.org
Automation is required to apply more sophisticated methods to far more data
Outsourcing is needed to achieve economies of scale in the use of automated methods
![Page 7: Science as a Service: How On-Demand Computing can Accelerate Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062703/554ea0cfb4c905977e8b45fe/html5/thumbnails/7.jpg)
computationinstitute.org
Building a discovery cloud
• Identify time-consuming activities amenable to automation and outsourcing
• Implement as high-quality, low-touch SaaS
• Leverage IaaS for reliability, economies of scale
• Extract common elements asresearch automation platform
Bonus question: Sustainability
Software as a service
Platform as a service
Infrastructure as a service
![Page 8: Science as a Service: How On-Demand Computing can Accelerate Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062703/554ea0cfb4c905977e8b45fe/html5/thumbnails/8.jpg)
computationinstitute.org
Where does time go in research?
42%!!
![Page 9: Science as a Service: How On-Demand Computing can Accelerate Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062703/554ea0cfb4c905977e8b45fe/html5/thumbnails/9.jpg)
computationinstitute.org
We aspire (initially) to create a great user
experience forresearch data managementWhat would a “dropbox
for science” look like?
![Page 10: Science as a Service: How On-Demand Computing can Accelerate Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062703/554ea0cfb4c905977e8b45fe/html5/thumbnails/10.jpg)
computationinstitute.org
• Collect•Move• Sync• Share• Analyze
• Annotate• Publish• Search• Backup• Archive
BIG DATA…for
![Page 11: Science as a Service: How On-Demand Computing can Accelerate Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062703/554ea0cfb4c905977e8b45fe/html5/thumbnails/11.jpg)
computationinstitute.org
RegistryStaging Store
IngestStore
AnalysisStore
Community Store
Archive Mirror
IngestStore
AnalysisStore
Community Store
Archive Mirror
Registry
Quotaexceeded
!
Expiredcredential
s
!
Networkfailed. Retry.
!
Permissiondenied
!
It should be trivial to Collect, Move, Sync, Share, Analyze, Annotate, Publish, Search, Backup, &
Archive BIG DATA… but in reality it’s often very challenging
![Page 12: Science as a Service: How On-Demand Computing can Accelerate Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062703/554ea0cfb4c905977e8b45fe/html5/thumbnails/12.jpg)
computationinstitute.org
• Collect•Move• Sync• Share• Analyze
• Annotate• Publish• Search• Backup• Archive
BIG DATA…for
![Page 13: Science as a Service: How On-Demand Computing can Accelerate Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062703/554ea0cfb4c905977e8b45fe/html5/thumbnails/13.jpg)
computationinstitute.org
• Collect•Move• Sync• Share• Analyze
• Annotate• Publish• Search• Backup• Archive
• Collect•Move• Sync• Share
Capabilities delivered using Software-as-Service (SaaS) model
![Page 14: Science as a Service: How On-Demand Computing can Accelerate Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062703/554ea0cfb4c905977e8b45fe/html5/thumbnails/14.jpg)
computationinstitute.org
DataSource
DataDestinatio
n
User initiates transfer request
1
Globus Online moves/syncs files
2
Globus Online notifies user
3
![Page 15: Science as a Service: How On-Demand Computing can Accelerate Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062703/554ea0cfb4c905977e8b45fe/html5/thumbnails/15.jpg)
computationinstitute.org
DataSource
User A selects file(s) to share; selects user/group, sets share permissions
1
Globus Online tracks shared files; no need to move files to cloud storage!
2
User B logs in to Globus Online
and accesses shared file
3
![Page 16: Science as a Service: How On-Demand Computing can Accelerate Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062703/554ea0cfb4c905977e8b45fe/html5/thumbnails/16.jpg)
computationinstitute.org
Extreme ease of use
• InCommon, Oauth, OpenID, X.509, …• Credential management• Group definition and management• Transfer management and
optimization• Reliability via transfer retries• Web interface, REST API, command
line• One-click “Globus Connect” install • 5-minute Globus Connect Multi User
install
![Page 17: Science as a Service: How On-Demand Computing can Accelerate Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062703/554ea0cfb4c905977e8b45fe/html5/thumbnails/17.jpg)
computationinstitute.org
Early adoption is encouraging
![Page 18: Science as a Service: How On-Demand Computing can Accelerate Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062703/554ea0cfb4c905977e8b45fe/html5/thumbnails/18.jpg)
computationinstitute.org
Early adoption is encouraging
10,000 registered users; >100 daily~18 PB moved; ~1B files
10x (or better) performance vs. scp99.9% availability
Entirely hosted on Amazon
![Page 19: Science as a Service: How On-Demand Computing can Accelerate Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062703/554ea0cfb4c905977e8b45fe/html5/thumbnails/19.jpg)
![Page 20: Science as a Service: How On-Demand Computing can Accelerate Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062703/554ea0cfb4c905977e8b45fe/html5/thumbnails/20.jpg)
computationinstitute.org
K. Heitmann (Argonne) moves 22 TB of cosmology data LANL ANL at 5 Gb/s
![Page 21: Science as a Service: How On-Demand Computing can Accelerate Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062703/554ea0cfb4c905977e8b45fe/html5/thumbnails/21.jpg)
computationinstitute.org
B. Winjum (UCLA) moves 900K-file plasma physics datasets UCLA NERSC
![Page 22: Science as a Service: How On-Demand Computing can Accelerate Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062703/554ea0cfb4c905977e8b45fe/html5/thumbnails/22.jpg)
computationinstitute.org
Dan Kozak (Caltech) replicates 1 PB LIGO astronomy data for resilience
![Page 23: Science as a Service: How On-Demand Computing can Accelerate Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062703/554ea0cfb4c905977e8b45fe/html5/thumbnails/23.jpg)
computationinstitute.org23
Credit: Kerstin Kleese-van Dam
Erin Miller (PNNL) collects data at Advanced Photon Source, renders at PNNL, and views at ANL
![Page 24: Science as a Service: How On-Demand Computing can Accelerate Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062703/554ea0cfb4c905977e8b45fe/html5/thumbnails/24.jpg)
computationinstitute.org
• Collect•Move• Sync• Share• Analyze
• Annotate• Publish• Search• Backup• Archive
BIG DATA…for
![Page 25: Science as a Service: How On-Demand Computing can Accelerate Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062703/554ea0cfb4c905977e8b45fe/html5/thumbnails/25.jpg)
computationinstitute.org
• Collect•Move• Sync• Share• Analyze
• Annotate• Publish• Search• Backup• Archive
BIG DATA…for
![Page 26: Science as a Service: How On-Demand Computing can Accelerate Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062703/554ea0cfb4c905977e8b45fe/html5/thumbnails/26.jpg)
computationinstitute.org
Globus Online already does a lot
Globus Toolkit
Sharing Service
Transfer Service
Globus Nexus (Identity, Group, Profile)G
lob
us O
nlin
e A
PIs
Glo
bu
s C
on
nect
![Page 27: Science as a Service: How On-Demand Computing can Accelerate Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062703/554ea0cfb4c905977e8b45fe/html5/thumbnails/27.jpg)
computationinstitute.org
A platform for integration
![Page 28: Science as a Service: How On-Demand Computing can Accelerate Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062703/554ea0cfb4c905977e8b45fe/html5/thumbnails/28.jpg)
computationinstitute.org
A platform for integration
![Page 29: Science as a Service: How On-Demand Computing can Accelerate Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062703/554ea0cfb4c905977e8b45fe/html5/thumbnails/29.jpg)
computationinstitute.org
A platform for integration
![Page 30: Science as a Service: How On-Demand Computing can Accelerate Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062703/554ea0cfb4c905977e8b45fe/html5/thumbnails/30.jpg)
computationinstitute.org
Data management SaaS (Globus) + Next-gen sequence analysis pipelines
(Galaxy) + Cloud IaaS (Amazon) =
Flexible, scalable, easy-to-use genomics analysis for all biologists
globus genomics
![Page 31: Science as a Service: How On-Demand Computing can Accelerate Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062703/554ea0cfb4c905977e8b45fe/html5/thumbnails/31.jpg)
computationinstitute.org31Ravi Madduri, Bo Liu, Paul Davé, et al.
![Page 32: Science as a Service: How On-Demand Computing can Accelerate Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062703/554ea0cfb4c905977e8b45fe/html5/thumbnails/32.jpg)
computationinstitute.org
RNA-Seq pipeline
32
Ravi Madduri, Bo Liu, Paul Davé, et al.
![Page 33: Science as a Service: How On-Demand Computing can Accelerate Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062703/554ea0cfb4c905977e8b45fe/html5/thumbnails/33.jpg)
computationinstitute.org
Amazon pricing for Diffusion Tensor Imaging pipeline
m1.large m1.xlarge m3.xlarge m3.2xlarge m2.xlarge m2.2xlarge m2.4xlarge0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
On-Demand Spot (Low) Spot (High)
Cost
per
Sub
ject
($)
Credit: Kyle Chard
![Page 34: Science as a Service: How On-Demand Computing can Accelerate Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062703/554ea0cfb4c905977e8b45fe/html5/thumbnails/34.jpg)
computationinstitute.org
We are adding capabilities
Globus Toolkit
Sharing Service
Transfer Service
Globus Nexus (Identity, Group, Profile)G
lob
us O
nlin
e A
PIs
Glo
bu
s C
on
nect
![Page 35: Science as a Service: How On-Demand Computing can Accelerate Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062703/554ea0cfb4c905977e8b45fe/html5/thumbnails/35.jpg)
computationinstitute.org
Globus Toolkit
Sharing Service
Transfer Service
Dataset Services
Globus Nexus (Identity, Group, Profile)G
lob
us O
nlin
e A
PIs
Glo
bu
s C
on
nect
We are adding capabilities
![Page 36: Science as a Service: How On-Demand Computing can Accelerate Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062703/554ea0cfb4c905977e8b45fe/html5/thumbnails/36.jpg)
computationinstitute.org
• Ingest and publication– Imagine a DropBox that not only
replicates, but also extracts metadata, catalogs, converts
• Cataloging– Virtual views of data based on user-
defined and/or automatically extracted metadata
• Computation– Associate computational procedures,
orchestrate application, catalog results, record provenance
We are adding capabilities
![Page 37: Science as a Service: How On-Demand Computing can Accelerate Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062703/554ea0cfb4c905977e8b45fe/html5/thumbnails/37.jpg)
computationinstitute.org
Looking deeply at how researchers use data
• A single research question often requires the integration of many data elements, that are:– In different locations– In different formats (Excel, text, CDF, HDF,
…)– Described in different ways
• Best grouping can vary during investigation– Longitudinal, vertical, cross-cutting
• But always needs to be operated on as a unit– Share, annotate, process, copy, archive, …
![Page 38: Science as a Service: How On-Demand Computing can Accelerate Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062703/554ea0cfb4c905977e8b45fe/html5/thumbnails/38.jpg)
computationinstitute.org
How do we manage data today?
• Often, a curious mix of ad hoc methods– Organize in directories using file and
directory naming conventions– Capture status in README files,
spreadsheets, notebooks
• Time-consuming, complex, error prone
Why can’t we manage our data like we manage our pictures and music?
![Page 39: Science as a Service: How On-Demand Computing can Accelerate Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062703/554ea0cfb4c905977e8b45fe/html5/thumbnails/39.jpg)
![Page 40: Science as a Service: How On-Demand Computing can Accelerate Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062703/554ea0cfb4c905977e8b45fe/html5/thumbnails/40.jpg)
computationinstitute.org
Introducing the dataset• Group data based on use, not location– Logical grouping to organize, reorganize, search,
and describe usage
• Tag with characteristics that reflect content …– Capture as much existing information as we can
• …or to reflect current status in investigation– Stage of processing, provenance, validation, ..
• Share data sets for collaboration– Control access to data and metadata
• Operate on datasets as units– Copy, export, analyze, tag, archive, …
![Page 41: Science as a Service: How On-Demand Computing can Accelerate Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062703/554ea0cfb4c905977e8b45fe/html5/thumbnails/41.jpg)
computationinstitute.org
Builds on catalog as a service
Approach
• Hosted user-defined catalogs
• Based on tag model<subject, name, value>
• Optional schema constraints
• Integrated with other Globus services
Three REST APIs
/query/
• Retrieve subjects
/tags/
• Create, delete, retrieve tags
/tagdef/
• Create, delete, retrieve tag definitions
Builds on USC Tagfiler project (C. Kesselman et al.)
![Page 42: Science as a Service: How On-Demand Computing can Accelerate Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062703/554ea0cfb4c905977e8b45fe/html5/thumbnails/42.jpg)
computationinstitute.org
Provide more capability formore people at lower cost by building a “Discovery Cloud”
Delivering “Science as a service”
Our vision for a 21st century discovery infrastructure
![Page 43: Science as a Service: How On-Demand Computing can Accelerate Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062703/554ea0cfb4c905977e8b45fe/html5/thumbnails/43.jpg)
computationinstitute.org
It’s a time of great opportunity … to develop and apply Science aaS
Globus Nexus (Identity, Group, Profile)
…
Sharing Service
Transfer Service
Dataset Services
Globus Toolkit
Glo
bu
s O
nlin
e A
PIs
Glo
bu
s C
on
nect
![Page 44: Science as a Service: How On-Demand Computing can Accelerate Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062703/554ea0cfb4c905977e8b45fe/html5/thumbnails/44.jpg)
computationinstitute.org
Thanks to great colleagues and collaborators
• Steve Tuecke, Rachana Ananthakrishnan, Kyle Chard, Raj Kettimuthu, Ravi Madduri, Tanu Malik, and many others at Argonne & Uchicago
• Carl Kesselman, Karl Czajkowski, Rob Schuler, and others at USC/ISI
• Francesco de Carlo, Chris Jacobsen, and others at Argonne
![Page 45: Science as a Service: How On-Demand Computing can Accelerate Discovery](https://reader036.vdocuments.us/reader036/viewer/2022062703/554ea0cfb4c905977e8b45fe/html5/thumbnails/45.jpg)
computationinstitute.org
Thank you to our sponsors!