re:invent 2013-foster-madduri
DESCRIPTION
Presentation from @ianfoster and @madduri at Amazon re:InventTRANSCRIPT
![Page 1: re:Invent 2013-foster-madduri](https://reader038.vdocuments.us/reader038/viewer/2022103000/554ea116b4c905977e8b4615/html5/thumbnails/1.jpg)
Science as a Service
Ian Foster, The University of Chicago and Argonne National Laboratory
November 14, 2013
![Page 2: re:Invent 2013-foster-madduri](https://reader038.vdocuments.us/reader038/viewer/2022103000/554ea116b4c905977e8b4615/html5/thumbnails/2.jpg)
A time of disruptive change
![Page 3: re:Invent 2013-foster-madduri](https://reader038.vdocuments.us/reader038/viewer/2022103000/554ea116b4c905977e8b4615/html5/thumbnails/3.jpg)
A time of disruptive change
![Page 4: re:Invent 2013-foster-madduri](https://reader038.vdocuments.us/reader038/viewer/2022103000/554ea116b4c905977e8b4615/html5/thumbnails/4.jpg)
Most labs have limited resources Heidorn: NSF grants in 2007
< $350,00080% of awards50% of grant $$
$1,000,000
$100,000
$10,000
$1,000
2000 4000 6000 8000
![Page 5: re:Invent 2013-foster-madduri](https://reader038.vdocuments.us/reader038/viewer/2022103000/554ea116b4c905977e8b4615/html5/thumbnails/5.jpg)
Automation is required to apply more sophisticated methods to far more data
![Page 6: re:Invent 2013-foster-madduri](https://reader038.vdocuments.us/reader038/viewer/2022103000/554ea116b4c905977e8b4615/html5/thumbnails/6.jpg)
Automation is required to apply more sophisticated methods to far more data
Outsourcing is needed to achieve economies of scale in the use of automated methods
![Page 7: re:Invent 2013-foster-madduri](https://reader038.vdocuments.us/reader038/viewer/2022103000/554ea116b4c905977e8b4615/html5/thumbnails/7.jpg)
Building a discovery cloud• Identify time-consuming activities amenable to
automation and outsourcing• Implement as high-quality, low-touch SaaS• Leverage IaaS for reliability,
economies of scale• Extract common elements as
research automation platform
Bonus question: Sustainability
Software as a service
Platform as a service
Infrastructure as a service
![Page 8: re:Invent 2013-foster-madduri](https://reader038.vdocuments.us/reader038/viewer/2022103000/554ea116b4c905977e8b4615/html5/thumbnails/8.jpg)
We aspire (initially) to create a great user experience for
research data management
What would a “dropbox for science” look like?
![Page 9: re:Invent 2013-foster-madduri](https://reader038.vdocuments.us/reader038/viewer/2022103000/554ea116b4c905977e8b4615/html5/thumbnails/9.jpg)
• Collect• Move• Sync• Share• Analyze
• Annotate• Publish• Search• Backup• Archive
BIG DATA…for
![Page 10: re:Invent 2013-foster-madduri](https://reader038.vdocuments.us/reader038/viewer/2022103000/554ea116b4c905977e8b4615/html5/thumbnails/10.jpg)
RegistryStaging Store
IngestStore
AnalysisStore
Community Store
Archive Mirror
IngestStore
AnalysisStore
Community Store
Archive Mirror
Registry
Quotaexceeded
!
Expiredcredential
s
!
Networkfailed. Retry.
!
Permissiondenied
!
It should be trivial to Collect, Move, Sync, Share, Analyze, Annotate, Publish, Search, Backup, &
Archive BIG DATA… but in reality it’s often very challenging
![Page 11: re:Invent 2013-foster-madduri](https://reader038.vdocuments.us/reader038/viewer/2022103000/554ea116b4c905977e8b4615/html5/thumbnails/11.jpg)
• Collect• Move• Sync• Share• Analyze
• Annotate• Publish• Search• Backup• Archive
BIG DATA…for
![Page 12: re:Invent 2013-foster-madduri](https://reader038.vdocuments.us/reader038/viewer/2022103000/554ea116b4c905977e8b4615/html5/thumbnails/12.jpg)
• Collect• Move• Sync• Share• Analyze
• Annotate• Publish• Search• Backup• Archive
BIG DATA…for
• Move• Sync• Share
Capabilities delivered using Software-as-Service (SaaS) model
![Page 13: re:Invent 2013-foster-madduri](https://reader038.vdocuments.us/reader038/viewer/2022103000/554ea116b4c905977e8b4615/html5/thumbnails/13.jpg)
DataSource
DataDestinatio
n
User initiates transfer request
1
Globus Online moves/syncs files
2
Globus Online notifies user
3
![Page 14: re:Invent 2013-foster-madduri](https://reader038.vdocuments.us/reader038/viewer/2022103000/554ea116b4c905977e8b4615/html5/thumbnails/14.jpg)
DataSource
User A selects file(s) to share; selects user/group, sets share permissions
1
Globus Online tracks shared files; no need to move files to cloud storage!
2
User B logs in to Globus Online and accesses
shared file
3
![Page 15: re:Invent 2013-foster-madduri](https://reader038.vdocuments.us/reader038/viewer/2022103000/554ea116b4c905977e8b4615/html5/thumbnails/15.jpg)
Extreme ease of use• InCommon, Oauth, OpenID, X.509, …• Credential management• Group definition and management• Transfer management and optimization• Reliability via transfer retries• Web interface, REST API, command line• One-click “Globus Connect” install • 5-minute Globus Connect Multi User install
![Page 16: re:Invent 2013-foster-madduri](https://reader038.vdocuments.us/reader038/viewer/2022103000/554ea116b4c905977e8b4615/html5/thumbnails/16.jpg)
Early adoption is encouraging
![Page 17: re:Invent 2013-foster-madduri](https://reader038.vdocuments.us/reader038/viewer/2022103000/554ea116b4c905977e8b4615/html5/thumbnails/17.jpg)
Early adoption is encouraging
>12,000 registered users; >150 daily>27 PB moved; >1B files
10x (or better) performance vs. scp99.9% availability
Entirely hosted on Amazon
![Page 18: re:Invent 2013-foster-madduri](https://reader038.vdocuments.us/reader038/viewer/2022103000/554ea116b4c905977e8b4615/html5/thumbnails/18.jpg)
Amazon web services used
• EC2 for hosting Globus services• ELB to use multiple availability zones for
reliability and uptime• SES and SNS to send notifications of transfer
status• S3 to store historical state• PostgreSQL for active state
![Page 19: re:Invent 2013-foster-madduri](https://reader038.vdocuments.us/reader038/viewer/2022103000/554ea116b4c905977e8b4615/html5/thumbnails/19.jpg)
K. Heitmann (Argonne) moves 22 TB of cosmology data LANL ANL at 5 Gb/s
![Page 20: re:Invent 2013-foster-madduri](https://reader038.vdocuments.us/reader038/viewer/2022103000/554ea116b4c905977e8b4615/html5/thumbnails/20.jpg)
B. Winjum (UCLA) moves 900K-file plasma physics datasets UCLA NERSC
![Page 21: re:Invent 2013-foster-madduri](https://reader038.vdocuments.us/reader038/viewer/2022103000/554ea116b4c905977e8b4615/html5/thumbnails/21.jpg)
Dan Kozak (Caltech) replicates 1 PB LIGO astronomy data for resilience
![Page 22: re:Invent 2013-foster-madduri](https://reader038.vdocuments.us/reader038/viewer/2022103000/554ea116b4c905977e8b4615/html5/thumbnails/22.jpg)
22
Credit: Kerstin Kleese-van Dam
Erin Miller (PNNL) collects data at Advanced Photon Source, renders at PNNL, and views at ANL
![Page 23: re:Invent 2013-foster-madduri](https://reader038.vdocuments.us/reader038/viewer/2022103000/554ea116b4c905977e8b4615/html5/thumbnails/23.jpg)
• Collect• Move• Sync• Share• Analyze
• Annotate• Publish• Search• Backup• Archive
BIG DATA…for
• Move• Sync• Share
Capabilities delivered using Software-as-Service (SaaS) model
![Page 24: re:Invent 2013-foster-madduri](https://reader038.vdocuments.us/reader038/viewer/2022103000/554ea116b4c905977e8b4615/html5/thumbnails/24.jpg)
• Collect• Move• Sync• Share• Analyze
• Annotate• Publish• Search• Backup• Archive
BIG DATA…for
![Page 25: re:Invent 2013-foster-madduri](https://reader038.vdocuments.us/reader038/viewer/2022103000/554ea116b4c905977e8b4615/html5/thumbnails/25.jpg)
Globus Online already does a lot
Globus Toolkit
Sharing Service
Transfer Service
Globus Nexus (Identity, Group, Profile)
Glo
bu
s O
nlin
e
AP
Is
Glo
bu
s
Con
nect
![Page 26: re:Invent 2013-foster-madduri](https://reader038.vdocuments.us/reader038/viewer/2022103000/554ea116b4c905977e8b4615/html5/thumbnails/26.jpg)
The identity challenge in science
• Research communities often need to– Assign identities to their users – Manage user profiles– Organize users into groups for authorization
• Obstacles to high-quality implementations– Complexity of associated security protocols– Creation of identity silos– Multiple credentials for users– Reliability, availability, scalability, security
![Page 27: re:Invent 2013-foster-madduri](https://reader038.vdocuments.us/reader038/viewer/2022103000/554ea116b4c905977e8b4615/html5/thumbnails/27.jpg)
Nexus provides four key capabilities• Identity provisioning
– Create, manage Globus identities
• Identity hub– Link with other identities; use
to authenticate to services
• Group hub– User-managed groups; groups can
be used for authorization
• Profile management– User-managed attributes;
can use in group admission
I
II I
I
Ia b
I
UVG
Key points:1) Outsource
identity, group, profile management
2) REST API for flexible integration
3) Intuitive, customizable Web interfaces
![Page 28: re:Invent 2013-foster-madduri](https://reader038.vdocuments.us/reader038/viewer/2022103000/554ea116b4c905977e8b4615/html5/thumbnails/28.jpg)
Branded sites
Open Science Grid University of ChicagoXSEDE
DOE kBase Indiana University University of Exeter
Globus Online NERSC NIH BIRN
![Page 29: re:Invent 2013-foster-madduri](https://reader038.vdocuments.us/reader038/viewer/2022103000/554ea116b4c905977e8b4615/html5/thumbnails/29.jpg)
A platform for integration
![Page 30: re:Invent 2013-foster-madduri](https://reader038.vdocuments.us/reader038/viewer/2022103000/554ea116b4c905977e8b4615/html5/thumbnails/30.jpg)
A platform for integration
![Page 31: re:Invent 2013-foster-madduri](https://reader038.vdocuments.us/reader038/viewer/2022103000/554ea116b4c905977e8b4615/html5/thumbnails/31.jpg)
A platform for integration
![Page 32: re:Invent 2013-foster-madduri](https://reader038.vdocuments.us/reader038/viewer/2022103000/554ea116b4c905977e8b4615/html5/thumbnails/32.jpg)
Data management SaaS (Globus) + Next-gen sequence analysis pipelines
(Galaxy) + Cloud IaaS (Amazon) =
Flexible, scalable, easy-to-use genomics analysis for all biologists
globus genomics
![Page 33: re:Invent 2013-foster-madduri](https://reader038.vdocuments.us/reader038/viewer/2022103000/554ea116b4c905977e8b4615/html5/thumbnails/33.jpg)
Globus Toolkit
Sharing Service
Transfer Service
Globus Nexus (Identity, Group, Profile)
Glo
bu
s O
nlin
e
AP
Is
Glo
bu
s
Con
nect
We are adding capabilities
![Page 34: re:Invent 2013-foster-madduri](https://reader038.vdocuments.us/reader038/viewer/2022103000/554ea116b4c905977e8b4615/html5/thumbnails/34.jpg)
Globus Toolkit
Sharing Service
Transfer Service
Dataset Services
Globus Nexus (Identity, Group, Profile)
Glo
bu
s O
nlin
e
AP
Is
Glo
bu
s
Con
nect
We are adding capabilities
![Page 35: re:Invent 2013-foster-madduri](https://reader038.vdocuments.us/reader038/viewer/2022103000/554ea116b4c905977e8b4615/html5/thumbnails/35.jpg)
We are adding capabilities
• Ingest and publication– Imagine a DropBox that not only replicates, but also extracts
metadata, catalogs, converts
• Cataloging– Virtual views of data based on user-defined and/or automatically
extracted metadata
• Computation– Associate computational procedures, orchestrate application,
catalog results, record provenance
![Page 36: re:Invent 2013-foster-madduri](https://reader038.vdocuments.us/reader038/viewer/2022103000/554ea116b4c905977e8b4615/html5/thumbnails/36.jpg)
Next Gen Sequencing Analysis for Everyone – No IT Required
Ravi K Madduri, The University of Chicago and Argonne National Laboratory
November 14, 2013
![Page 37: re:Invent 2013-foster-madduri](https://reader038.vdocuments.us/reader038/viewer/2022103000/554ea116b4c905977e8b4615/html5/thumbnails/37.jpg)
One slide to get your attention
![Page 38: re:Invent 2013-foster-madduri](https://reader038.vdocuments.us/reader038/viewer/2022103000/554ea116b4c905977e8b4615/html5/thumbnails/38.jpg)
Outline
• Globus Vision• Challenges in Sequencing Analysis
– Big Data Management– Analysis at Scale– Reproducibility
• Proposed Approach Using Globus Genomics• Example Collaborations• Q&A
![Page 39: re:Invent 2013-foster-madduri](https://reader038.vdocuments.us/reader038/viewer/2022103000/554ea116b4c905977e8b4615/html5/thumbnails/39.jpg)
Globus VisionGoal: Accelerate discovery and innovation worldwide by providing research IT as a service
Leverage software-as-a-service to:
– provide millions of researchers with unprecedented access to powerful tools for managing Big Data
– reduce research IT costs dramatically via economies of scale
“Civilization advances by extending the number of important operations which we can perform without thinking of them”
—Alfred North Whitehead , 1911
![Page 40: re:Invent 2013-foster-madduri](https://reader038.vdocuments.us/reader038/viewer/2022103000/554ea116b4c905977e8b4615/html5/thumbnails/40.jpg)
Challenges in Sequencing Analysis
Sequencing Centers
Sequencing Centers
Data Movement and Access Challenges
Manual Data Analysis
PublicData
Storage
Local Cluster/CloudSeq
Center
Research Lab
How do we analyze this Sequence Data
Picard
GATK
Fastq Ref Genome
Alignment
Variant Calling
• Manually move the data to the Compute node
(Re)Run Script
Install
Modify
• Install all the tools required for the Analysis• BWA, Picard, GATK, Filtering Scripts, etc.
• Shell scripts to sequentially execute the tools• Manually modify the scripts for any change
• Error Prone, difficult to keep track, messy..• Difficult to maintain and transfer the knowledge
FTP, SCP, HTTP
SCP
FTP,
SC
P, H
TTP
• Difficult to Data is distributed in different locations• Research labs need access to the data for analysis • Be able to Share data with other researchers/collaborators
• Inefficient ways of data movement• Data needs to be available on the local and Distributed Compute
Resources • Local Clusters, Cloud, Grid and transfer the knowledge
![Page 41: re:Invent 2013-foster-madduri](https://reader038.vdocuments.us/reader038/viewer/2022103000/554ea116b4c905977e8b4615/html5/thumbnails/41.jpg)
Globus Genomics
Sequencing Centers
Sequencing
Centers
PublicData
Storage
Local Cluster/CloudSeq
Center
Research Lab
Globus Provides a• High-performance • Fault-tolerant• Secure
file transfer Service between all data-endpoints
Data Management Data Analysis
Picard
GATK
Fastq Ref Genome
Alignment
Variant Calling
Galaxy Data Libraries
• Globus Integrated within Galaxy
• Web-based UI• Drag-Drop workflow
creations• Easily modify
Workflows with new tools
Globus Genomics on Amazon EC2
• Analytical tools are automatically run on the scalable compute resources when possible
Galaxy Based Workflow Management System
FTP, SCP, others
FTP, SCP SCP
Globus Genomics
FTP,
SC
P, H
TTP
![Page 42: re:Invent 2013-foster-madduri](https://reader038.vdocuments.us/reader038/viewer/2022103000/554ea116b4c905977e8b4615/html5/thumbnails/42.jpg)
Globus Genomics Architecture
Figure 2: Globus Genomics Architecture
![Page 43: re:Invent 2013-foster-madduri](https://reader038.vdocuments.us/reader038/viewer/2022103000/554ea116b4c905977e8b4615/html5/thumbnails/43.jpg)
Globus Genomics Usage
350K Core hours in last 6 months
![Page 44: re:Invent 2013-foster-madduri](https://reader038.vdocuments.us/reader038/viewer/2022103000/554ea116b4c905977e8b4615/html5/thumbnails/44.jpg)
![Page 45: re:Invent 2013-foster-madduri](https://reader038.vdocuments.us/reader038/viewer/2022103000/554ea116b4c905977e8b4615/html5/thumbnails/45.jpg)
Globus Genomics• Computational profiles for
various analysis tools• Resources can be
provisioned on-demand with Amazon Web Services cloud based infrastructure
• Glusterfs as a shared file system between head nodes and compute nodes
• Provisioned I/O on EBS
![Page 46: re:Invent 2013-foster-madduri](https://reader038.vdocuments.us/reader038/viewer/2022103000/554ea116b4c905977e8b4615/html5/thumbnails/46.jpg)
Coming soon!• Integration with Globus Catalog
– Better data discovery and metadata management
• Integration with Globus Sharing– Easy and secure method to share large datasets with collaborators
• Integration with Amazon Glacier for data archiving
• Support for high throughput computational modalities through Apache Mesos– MapReduce and MPI clusters
• Dynamic Storage Strategies using S3 and/or LVM-based shared file system
![Page 47: re:Invent 2013-foster-madduri](https://reader038.vdocuments.us/reader038/viewer/2022103000/554ea116b4c905977e8b4615/html5/thumbnails/47.jpg)
Globus Climate Globus Proteomics
CVRG Materials
Coming soon
![Page 48: re:Invent 2013-foster-madduri](https://reader038.vdocuments.us/reader038/viewer/2022103000/554ea116b4c905977e8b4615/html5/thumbnails/48.jpg)
Provide more capability formore people at lower cost by building a “Discovery Cloud”
Delivering “Science as a service”
Our vision for a 21st century discovery infrastructure
![Page 49: re:Invent 2013-foster-madduri](https://reader038.vdocuments.us/reader038/viewer/2022103000/554ea116b4c905977e8b4615/html5/thumbnails/49.jpg)
Thank you to our sponsors
![Page 50: re:Invent 2013-foster-madduri](https://reader038.vdocuments.us/reader038/viewer/2022103000/554ea116b4c905977e8b4615/html5/thumbnails/50.jpg)
For more information
• More information on Globus Genomics and to sign up: www.globus.org/genomics
• More information on Globus: www.globusonline.org
• Follow us on Twitter: @ianfoster, @madduri, @globusgenomics, @globusonline
![Page 51: re:Invent 2013-foster-madduri](https://reader038.vdocuments.us/reader038/viewer/2022103000/554ea116b4c905977e8b4615/html5/thumbnails/51.jpg)
Thank you!
![Page 52: re:Invent 2013-foster-madduri](https://reader038.vdocuments.us/reader038/viewer/2022103000/554ea116b4c905977e8b4615/html5/thumbnails/52.jpg)
Please give us your feedback on this presentation
As a thank you, we will select prize winners daily for completed surveys!
BDT 310