streamlined data sharing and analysis to accelerate cancer research
TRANSCRIPT
1
Streamlined data sharing and analysis
to accelerate cancer researchIan Foster
The University of Chicago and Argonne National Laboratory
2
Thesis: We enhance sharing and analysis
by eliminating friction
3
1919 Motor Transport Corps convoyWashington, DC., to San Francisco56 days, average speed of 9 km/h
2016: 41 hours by road, 5.5 hours by air
5
2 minutes by web<1 second by API
6
Cloud: Outsourcing and automationSoftware as a service: SaaS
Infrastructure as a service: IaaS
Platform as a service: PaaS
(web & mobile apps)
7
Cloud: Outsourcing and automationSoftware as a service: SaaS
Infrastructure as a service: IaaS
Platform as a service: PaaS
(web & mobile apps)Saas for science
Data challenges:Olopade Lab
Inherited hematological malignancies
Impact:• Familial blood cancer syndromes are being included in the 2016 revision of World Health Organization
Classification of Hematological Malignancies; NCCN guidelines; European LeukemiaNet• Identification of germline mutations is important for prevention/intervention and early diagnosis, and may
change treatment (e.g., stem cell transplant from related donor w/o mutation or matched unrelated donor)
Background:• Familial predisposition to blood cancers has not been widely appreciated, like
some solid cancers
• Identifying the genes involved informs understanding of biology and may impact patient care (prevention, diagnosis and treatment)
Jane Churpek, MD Lucy Godley, MD, PhD
Research highlights:• With samples from >500 families, the team has identified novel germline
mutations that predispose to familial myelodysplastic syndromes and leukemia
• These mutations are much more common than previously known
• Specific genes with identified mutations include RUNX1, ETV6, DDX41, ANKRD26
The RUNX1 InternationalSequencing Consortium (RISC)
Inherited hematological malignancies, Lucy Godley, UChicago
11
Notable areas of friction
• Moving data rapidly, securely, and reliably from lab to lab• Accessing data at other labs• Controlling who can access data• Tracking what data is where• Discovering available data within a rapidly growing haystack• Computing at scale• Complying with rules on personal health information • Archive and backup
Sequencing center
Publication repository
Personal Computer 13
Compute facility
Researcher initiates transfer request; or requested automatically by script, science gateway
Publication repository
Personal Computer 14
1
Sequencing center Compute facility
Researcher initiates transfer request; or requested automatically by script, science gateway
Compute facilityGlobus transfers files reliably, securely
2
Personal Computer
Transfer
15
1
Sequencing center
Publication repository
Researcher initiates transfer request; or requested automatically by script, science gateway
Researcher selects files to
share, selects user or group, and sets
access permissions Publication repository
Personal Computer 16
1 3
Share
Compute facilityGlobus transfers files reliably, securely
2
Transfer
Sequencing center
Researcher initiates transfer request; or requested automatically by script, science gateway
Globus controls access to shared files on existing
storage; no need to move files to cloud storage!
Researcher selects files to
share, selects user or group, and sets
access permissions Publication repository
Personal Computer 17
1 3
Share
4
Compute facilityGlobus transfers files reliably, securely
2
Transfer
Sequencing center
Researcher initiates transfer request; or requested automatically by script, science gateway
Researcher selects files to
share, selects user or group, and sets
access permissions
Collaborator logs in to access shared
files; no local account needed;
download via Globus
Publication repository
Personal Computer 18
1 3
Share
5
Compute facilityGlobus transfers files reliably, securely
2
Transfer
Sequencing centerGlobus controls access to
shared files on existing storage; no need to move
files to cloud storage!
4
Researcher initiates transfer request; or requested automatically by script, science gateway
Researcher selects files to
share, selects user or group, and sets
access permissions
Collaborator logs in to access shared
files; no local account needed;
download via Globus
Researcher assembles data set; attaches metadata
(Dublin core, domain-specific) Publication
repository
Personal Computer 19
1 3
SharePublish
5
6
6
Compute facilityGlobus transfers files reliably, securely
2
Transfer
Sequencing centerGlobus controls access to
shared files on existing storage; no need to move
files to cloud storage!
4
Researcher initiates transfer request; or requested automatically by script, science gateway
Curator reviews and approves; data set published
on campus or other system
Researcher selects files to
share, selects user or group, and sets
access permissions
Collaborator logs in to access shared
files; no local account needed;
download via Globus
Researcher assembles data set; attaches metadata
(Dublin core, domain-specific) Publication
repository
Personal Computer 20
1 3
SharePublish
5
6
6
7
Compute facilityGlobus transfers files reliably, securely
2
Transfer
Globus controls access to shared files on existing
storage; no need to move files to cloud storage!
4Sequencing center
Researcher initiates transfer request; or requested automatically by script, science gateway
Curator reviews and approves; data set published
on campus or other system
Researcher selects files to
share, selects user or group, and sets
access permissions
Collaborator logs in to access shared
files; no local account needed;
download via Globus
Researcher assembles data set; attaches metadata
(Dublin core, domain-specific)
Peers, collaborators search and discover datasets; transfer and share using Globus
Publication repository
Personal Computer 21
1 3
SharePublish
Discover
5
6
6
7
8
Compute facilityGlobus transfers files reliably, securely
2
Transfer
Globus controls access to shared files on existing
storage; no need to move files to cloud storage!
4Sequencing center
Researcher initiates transfer request; or requested automatically by script, science gateway
Curator reviews and approves; data set published
on campus or other system
Researcher selects files to
share, selects user or group, and sets
access permissions
Collaborator logs in to access shared
files; no local account needed;
download via Globus
Researcher assembles data set; attaches metadata
(Dublin core, domain-specific)
Peers, collaborators search and discover datasets; transfer and share using Globus
Publication repository
Personal Computer
• Only Web browser required• Use any storage system• Access using any credential 22
1 3
SharePublish
Discover
5
6
6
7
8
Compute facilityGlobus transfers files reliably, securely
2
Transfer
Sequencing centerGlobus controls access to
shared files on existing storage; no need to move
files to cloud storage!
4
23
How Globus adds value…• Ease of use, consistent user interface across systems• “Fire-and-forget” reliable file transfer• Low-overhead external collaboration • Secure access, multi-tier security model• Maximized wide area network throughput• Rapid deployment via standard packages• Highly automatable: CLI, RESTful API
24
25
26
27
28
29
Share with any identity
30
31
Storage connectorsStandard (Posix)
LinuxWindowsMacOSLustre, GPFS, OrangeFS, ...
Premium
HPSSHDFSAmazon S3Ceph RadosGW (S3 API)Spectra Logic BlackPearlGoogle Drive*
* Coming soon
32
Globus accelerates disk-to-disk throughput
scp
scp (w/HPN)
sftp
GridFTP_x000d_(1 stream)
GridFTP_x000d_(4 streams)
0 1,000 2,000 3,000 4,000 5,000 6,000 7,000 8,000 9,000
Disk-to-Disk Throughput (Mbps)Source: ESnet (2016)
• Berkeley, CA to Argonne, IL (RTT: 53 ms, Capacity: 10Gbps)
• scp is 24x slower than GridFTP• >1 Gbps (125 MB/s) disk-to-disk requires RAID array
34
35
36
Data Discovery
Globus is widely used
4 major services
13 national labs
190 PBtransferred
10,000 active endpoints
20 billion
files processed
10,000 active users
50,000 registered users
99.9%uptime
35+institutional subscribers
1 PBlargest single
transfer to date
3 months longest
continuously managed transfer
130 federated
campus identities
38
2015-03
2015-04
2015-05
2015-06
2015-07
2015-08
2015-09
2015-10
2015-11
2015-12
2016-01
2016-02
2016-03
2016-04
2016-05
2016-06
2016-07
2016-08
020406080
100120140160180
Terabytes per Month
Year and Month
Tera
byte
s
2015-03
2015-04
2015-05
2015-06
2015-07
2015-08
2015-09
2015-10
2015-11
2015-12
2016-01
2016-02
2016-03
2016-04
2016-05
2016-06
2016-07
2016-08
0
20
40
60
80
100
120
140
Users per Month
Year and Month
User
s
Globus @ NIH
39
Globus subscriptions for sustainability• Standard subscription
• Shared endpoints• Data publication• HTTPS support*• Management console• Usage reporting• Priority support• Application integration
• Branded Web Site• Premium Storage Connectors
• Amazon S3, Ceph, HPSS, Spectra, Google Drive*, …• Alternate Identity Provider (InCommon is standard)
*Available late 2016
41
Repr
esen
tativ
e su
bscr
iber
s
42
Cloud: Outsourcing and automationSoftware as a service: SaaS
Infrastructure as a service: IaaS
Platform as a service: PaaS
(web & mobile apps)
PaaS for science
43
44
45
46
47
https://fasterdata.es.net/
Globus leverages Science DMZs
Prototypical research data portal
• Move portal storage into Science DMZ, with Globus endpoint
• Leave portal web server behind firewall
• Globus handles security and data heavy lifting
48
Desktop
Globus Cloud
Firewall
Science DMZ
Globus Transfer Service
Portal Web Server (Client)
Globus Auth
Browser
User’s Endpoint (optional)
Portal Endpoint
Other Endpoints
HTTPS
GridFTP
REST Other Services
Globus Web Widgets
50
https://github.com/globus/globus-sample-data-portal
51
https://www.globusworld.org/tour/
Workflows can be easily defined and automated with integrated Galaxy Platform capabilities
Data movement is streamlined with integrated Globus transfer
Resources can be provisioned on-demand with Amazon Web Services cloud based infrastructure
Globus Genomics: Genomics analysis as a service
Ravi Madduri et al., University of Chicago
Globus Genomics use cases
A profile of inherited predisposition to breast cancer among Nigerian womenY. Zheng, T. Walsh, F. Yoshimatsu, M. Lee, S. Gulsuner, S. Casadei, A. Rodriguez, T. Ogundiran, C. Babalola, O.
Ojengbede, D. Sighoko, R. Madduri, M.-C. King, O. Olopade
A case study for high throughput analysis of NGS data for translational research using Globus Genomics
D. Sulakhe, A. Rodriguez, K. Bhuvaneshwar, Y. Gusev, R. Madduri, L. Lacinski, U. Dave, I. Foster, S. Madhavan
Globus Genomics at a glance
30 institutions, groups
10smillion core hours
2 PBsraw sequence
analyzed
1,500 analysis tools
10,000 genomes processed
50workflows
99%uptime over the past
two years
1 PBdata generated
43steps in longest
pipeline
100sspecies
75largest user group
5 dayslongest running
workflow
Cost-aware provisioning on cloud resources
55
1. Filter instance types with profiles
2. Determine price for each instance type across all availability zones
3. Rank potential requests
4. Make requests and monitor
5. Cancel or repurpose excess active requests once one is fulfilled
Can reduce costs by 95% or more!
$$$
???
R. Chard et al. Cost-aware cloud provisioning, 11th IEEE International Conference on e-Science (e-Science), 2015.
56
What’s coming soon: Richer endpointsHTTPS access to endpoints • Enhanced use of research storage:
• Asynchronous, bulk transfer: GridFTP• Synchronous remote access: HTTPS
• Enhanced Globus web app• Browser-based upload/download• Inline file viewer
• Integration with clients, web apps
GridFTP
HTTP
57
What’s coming soon: Richer endpoints
GridFTP
HTTP
Collections• Groupings of files that are to be
treated as logical units• Can be named and described
HTTPS access to endpoints • Enhanced use of research storage:
• Asynchronous, bulk transfer: GridFTP• Synchronous remote access: HTTPS
• Enhanced Globus web app• Browser-based upload/download• Inline file viewer
• Integration with clients, web apps
58
What’s coming soon: Richer endpoints
Data search• Automated metadata harvesting
• From Globus endpoints• Submitted via REST API
• Rich search capabilities• Free text, faceted, boosted
GridFTP
HTTP
HTTPS access to endpoints • Enhanced use of research storage:
• Asynchronous, bulk transfer: GridFTP• Synchronous remote access: HTTPS
• Enhanced Globus web app• Browser-based upload/download• Inline file viewer
• Integration with clients, web apps
Collections• Groupings of files that are to be
treated as logical units• Can be named and described
59
Thank you to our sponsors
U . S . D E PA RT M E N T O F
ENERGY
Thanks to: Rachana Ananthakrishnan, Kyle Chard, Ravi Madduri, Brigitte Raumann, Steve Tuecke, Vas Vasiliadis,
and others in the Globus team at the University of Chicago
60
Globus provides a new global-scale data fabric that can accelerate discovery by streamlining scientific data sharing and analysis• Globus-enabled storage systems enable robust, secure access• Globus cloud services implement transfer, sharing, publication,
discovery, and other capabilities
This fabric is:• Being applied in cancer research• Spreading rapidly by word of mouth (scientists like it!)• Widely deployed across universities and labs (thanks, NSF & DOE)• On a path to sustainability based on subscriptions• Being integrated into research infrastructures and applications
61
To accelerate impact in biomedicine:
• Integrate biomedical research facilities into the fabric
• Encourage subscriptions to address sustainability
• Provide HIPAA compliance for applications involving PHI
• Cultivate an ecosystem of data portals and applications that leverage the platform
• Continue to add capabilities
www.globus.org [email protected]