streamlined data sharing and analysis to accelerate cancer research

61
Streamlined data sharing and analysis to accelerate cancer research Ian Foster The University of Chicago and Argonne National Laboratory 1

Upload: ian-foster

Post on 20-Feb-2017

269 views

Category:

Science


0 download

TRANSCRIPT

Page 1: Streamlined data sharing and analysis to accelerate cancer research

1

Streamlined data sharing and analysis

to accelerate cancer researchIan Foster

The University of Chicago and Argonne National Laboratory

Page 2: Streamlined data sharing and analysis to accelerate cancer research

2

Thesis: We enhance sharing and analysis

by eliminating friction

Page 3: Streamlined data sharing and analysis to accelerate cancer research

3

1919 Motor Transport Corps convoyWashington, DC., to San Francisco56 days, average speed of 9 km/h

Page 4: Streamlined data sharing and analysis to accelerate cancer research

2016: 41 hours by road, 5.5 hours by air

Page 5: Streamlined data sharing and analysis to accelerate cancer research

5

2 minutes by web<1 second by API

Page 6: Streamlined data sharing and analysis to accelerate cancer research

6

Cloud: Outsourcing and automationSoftware as a service: SaaS

Infrastructure as a service: IaaS

Platform as a service: PaaS

(web & mobile apps)

Page 7: Streamlined data sharing and analysis to accelerate cancer research

7

Cloud: Outsourcing and automationSoftware as a service: SaaS

Infrastructure as a service: IaaS

Platform as a service: PaaS

(web & mobile apps)Saas for science

Page 8: Streamlined data sharing and analysis to accelerate cancer research

Data challenges:Olopade Lab

Page 9: Streamlined data sharing and analysis to accelerate cancer research

Inherited hematological malignancies

Impact:• Familial blood cancer syndromes are being included in the 2016 revision of World Health Organization

Classification of Hematological Malignancies; NCCN guidelines; European LeukemiaNet• Identification of germline mutations is important for prevention/intervention and early diagnosis, and may

change treatment (e.g., stem cell transplant from related donor w/o mutation or matched unrelated donor)

Background:• Familial predisposition to blood cancers has not been widely appreciated, like

some solid cancers

• Identifying the genes involved informs understanding of biology and may impact patient care (prevention, diagnosis and treatment)

Jane Churpek, MD Lucy Godley, MD, PhD

Research highlights:• With samples from >500 families, the team has identified novel germline

mutations that predispose to familial myelodysplastic syndromes and leukemia

• These mutations are much more common than previously known

• Specific genes with identified mutations include RUNX1, ETV6, DDX41, ANKRD26

Page 10: Streamlined data sharing and analysis to accelerate cancer research

The RUNX1 InternationalSequencing Consortium (RISC)

Inherited hematological malignancies, Lucy Godley, UChicago

Page 11: Streamlined data sharing and analysis to accelerate cancer research

11

Notable areas of friction

• Moving data rapidly, securely, and reliably from lab to lab• Accessing data at other labs• Controlling who can access data• Tracking what data is where• Discovering available data within a rapidly growing haystack• Computing at scale• Complying with rules on personal health information • Archive and backup

Page 12: Streamlined data sharing and analysis to accelerate cancer research
Page 13: Streamlined data sharing and analysis to accelerate cancer research

Sequencing center

Publication repository

Personal Computer 13

Compute facility

Page 14: Streamlined data sharing and analysis to accelerate cancer research

Researcher initiates transfer request; or requested automatically by script, science gateway

Publication repository

Personal Computer 14

1

Sequencing center Compute facility

Page 15: Streamlined data sharing and analysis to accelerate cancer research

Researcher initiates transfer request; or requested automatically by script, science gateway

Compute facilityGlobus transfers files reliably, securely

2

Personal Computer

Transfer

15

1

Sequencing center

Publication repository

Page 16: Streamlined data sharing and analysis to accelerate cancer research

Researcher initiates transfer request; or requested automatically by script, science gateway

Researcher selects files to

share, selects user or group, and sets

access permissions Publication repository

Personal Computer 16

1 3

Share

Compute facilityGlobus transfers files reliably, securely

2

Transfer

Sequencing center

Page 17: Streamlined data sharing and analysis to accelerate cancer research

Researcher initiates transfer request; or requested automatically by script, science gateway

Globus controls access to shared files on existing

storage; no need to move files to cloud storage!

Researcher selects files to

share, selects user or group, and sets

access permissions Publication repository

Personal Computer 17

1 3

Share

4

Compute facilityGlobus transfers files reliably, securely

2

Transfer

Sequencing center

Page 18: Streamlined data sharing and analysis to accelerate cancer research

Researcher initiates transfer request; or requested automatically by script, science gateway

Researcher selects files to

share, selects user or group, and sets

access permissions

Collaborator logs in to access shared

files; no local account needed;

download via Globus

Publication repository

Personal Computer 18

1 3

Share

5

Compute facilityGlobus transfers files reliably, securely

2

Transfer

Sequencing centerGlobus controls access to

shared files on existing storage; no need to move

files to cloud storage!

4

Page 19: Streamlined data sharing and analysis to accelerate cancer research

Researcher initiates transfer request; or requested automatically by script, science gateway

Researcher selects files to

share, selects user or group, and sets

access permissions

Collaborator logs in to access shared

files; no local account needed;

download via Globus

Researcher assembles data set; attaches metadata

(Dublin core, domain-specific) Publication

repository

Personal Computer 19

1 3

SharePublish

5

6

6

Compute facilityGlobus transfers files reliably, securely

2

Transfer

Sequencing centerGlobus controls access to

shared files on existing storage; no need to move

files to cloud storage!

4

Page 20: Streamlined data sharing and analysis to accelerate cancer research

Researcher initiates transfer request; or requested automatically by script, science gateway

Curator reviews and approves; data set published

on campus or other system

Researcher selects files to

share, selects user or group, and sets

access permissions

Collaborator logs in to access shared

files; no local account needed;

download via Globus

Researcher assembles data set; attaches metadata

(Dublin core, domain-specific) Publication

repository

Personal Computer 20

1 3

SharePublish

5

6

6

7

Compute facilityGlobus transfers files reliably, securely

2

Transfer

Globus controls access to shared files on existing

storage; no need to move files to cloud storage!

4Sequencing center

Page 21: Streamlined data sharing and analysis to accelerate cancer research

Researcher initiates transfer request; or requested automatically by script, science gateway

Curator reviews and approves; data set published

on campus or other system

Researcher selects files to

share, selects user or group, and sets

access permissions

Collaborator logs in to access shared

files; no local account needed;

download via Globus

Researcher assembles data set; attaches metadata

(Dublin core, domain-specific)

Peers, collaborators search and discover datasets; transfer and share using Globus

Publication repository

Personal Computer 21

1 3

SharePublish

Discover

5

6

6

7

8

Compute facilityGlobus transfers files reliably, securely

2

Transfer

Globus controls access to shared files on existing

storage; no need to move files to cloud storage!

4Sequencing center

Page 22: Streamlined data sharing and analysis to accelerate cancer research

Researcher initiates transfer request; or requested automatically by script, science gateway

Curator reviews and approves; data set published

on campus or other system

Researcher selects files to

share, selects user or group, and sets

access permissions

Collaborator logs in to access shared

files; no local account needed;

download via Globus

Researcher assembles data set; attaches metadata

(Dublin core, domain-specific)

Peers, collaborators search and discover datasets; transfer and share using Globus

Publication repository

Personal Computer

• Only Web browser required• Use any storage system• Access using any credential 22

1 3

SharePublish

Discover

5

6

6

7

8

Compute facilityGlobus transfers files reliably, securely

2

Transfer

Sequencing centerGlobus controls access to

shared files on existing storage; no need to move

files to cloud storage!

4

Page 23: Streamlined data sharing and analysis to accelerate cancer research

23

How Globus adds value…• Ease of use, consistent user interface across systems• “Fire-and-forget” reliable file transfer• Low-overhead external collaboration • Secure access, multi-tier security model• Maximized wide area network throughput• Rapid deployment via standard packages• Highly automatable: CLI, RESTful API

Page 24: Streamlined data sharing and analysis to accelerate cancer research

24

Page 25: Streamlined data sharing and analysis to accelerate cancer research

25

Page 26: Streamlined data sharing and analysis to accelerate cancer research

26

Page 27: Streamlined data sharing and analysis to accelerate cancer research

27

Page 28: Streamlined data sharing and analysis to accelerate cancer research

28

Page 29: Streamlined data sharing and analysis to accelerate cancer research

29

Share with any identity

Page 30: Streamlined data sharing and analysis to accelerate cancer research

30

Page 31: Streamlined data sharing and analysis to accelerate cancer research

31

Storage connectorsStandard (Posix)

LinuxWindowsMacOSLustre, GPFS, OrangeFS, ...

Premium

HPSSHDFSAmazon S3Ceph RadosGW (S3 API)Spectra Logic BlackPearlGoogle Drive*

* Coming soon

Page 32: Streamlined data sharing and analysis to accelerate cancer research

32

Globus accelerates disk-to-disk throughput

scp

scp (w/HPN)

sftp

GridFTP_x000d_(1 stream)

GridFTP_x000d_(4 streams)

0 1,000 2,000 3,000 4,000 5,000 6,000 7,000 8,000 9,000

Disk-to-Disk Throughput (Mbps)Source: ESnet (2016)

• Berkeley, CA to Argonne, IL (RTT: 53 ms, Capacity: 10Gbps)

• scp is 24x slower than GridFTP• >1 Gbps (125 MB/s) disk-to-disk requires RAID array

Page 33: Streamlined data sharing and analysis to accelerate cancer research
Page 34: Streamlined data sharing and analysis to accelerate cancer research

34

Page 35: Streamlined data sharing and analysis to accelerate cancer research

35

Page 36: Streamlined data sharing and analysis to accelerate cancer research

36

Data Discovery

Page 37: Streamlined data sharing and analysis to accelerate cancer research

Globus is widely used

4 major services

13 national labs

190 PBtransferred

10,000 active endpoints

20 billion

files processed

10,000 active users

50,000 registered users

99.9%uptime

35+institutional subscribers

1 PBlargest single

transfer to date

3 months longest

continuously managed transfer

130 federated

campus identities

Page 38: Streamlined data sharing and analysis to accelerate cancer research

38

2015-03

2015-04

2015-05

2015-06

2015-07

2015-08

2015-09

2015-10

2015-11

2015-12

2016-01

2016-02

2016-03

2016-04

2016-05

2016-06

2016-07

2016-08

020406080

100120140160180

Terabytes per Month

Year and Month

Tera

byte

s

2015-03

2015-04

2015-05

2015-06

2015-07

2015-08

2015-09

2015-10

2015-11

2015-12

2016-01

2016-02

2016-03

2016-04

2016-05

2016-06

2016-07

2016-08

0

20

40

60

80

100

120

140

Users per Month

Year and Month

User

s

Globus @ NIH

Page 39: Streamlined data sharing and analysis to accelerate cancer research

39

Globus subscriptions for sustainability• Standard subscription

• Shared endpoints• Data publication• HTTPS support*• Management console• Usage reporting• Priority support• Application integration

• Branded Web Site• Premium Storage Connectors

• Amazon S3, Ceph, HPSS, Spectra, Google Drive*, …• Alternate Identity Provider (InCommon is standard)

*Available late 2016

Page 40: Streamlined data sharing and analysis to accelerate cancer research
Page 41: Streamlined data sharing and analysis to accelerate cancer research

41

Repr

esen

tativ

e su

bscr

iber

s

Page 42: Streamlined data sharing and analysis to accelerate cancer research

42

Cloud: Outsourcing and automationSoftware as a service: SaaS

Infrastructure as a service: IaaS

Platform as a service: PaaS

(web & mobile apps)

PaaS for science

Page 43: Streamlined data sharing and analysis to accelerate cancer research

43

Page 44: Streamlined data sharing and analysis to accelerate cancer research

44

Page 45: Streamlined data sharing and analysis to accelerate cancer research

45

Page 46: Streamlined data sharing and analysis to accelerate cancer research

46

Page 47: Streamlined data sharing and analysis to accelerate cancer research

47

https://fasterdata.es.net/

Globus leverages Science DMZs

Page 48: Streamlined data sharing and analysis to accelerate cancer research

Prototypical research data portal

• Move portal storage into Science DMZ, with Globus endpoint

• Leave portal web server behind firewall

• Globus handles security and data heavy lifting

48

Desktop

Globus Cloud

Firewall

Science DMZ

Globus Transfer Service

Portal Web Server (Client)

Globus Auth

Browser

User’s Endpoint (optional)

Portal Endpoint

Other Endpoints

HTTPS

GridFTP

REST Other Services

Globus Web Widgets

Page 49: Streamlined data sharing and analysis to accelerate cancer research
Page 50: Streamlined data sharing and analysis to accelerate cancer research

50

https://github.com/globus/globus-sample-data-portal

Page 51: Streamlined data sharing and analysis to accelerate cancer research

51

https://www.globusworld.org/tour/

Page 52: Streamlined data sharing and analysis to accelerate cancer research

Workflows can be easily defined and automated with integrated Galaxy Platform capabilities

Data movement is streamlined with integrated Globus transfer

Resources can be provisioned on-demand with Amazon Web Services cloud based infrastructure

Globus Genomics: Genomics analysis as a service

Ravi Madduri et al., University of Chicago

Page 53: Streamlined data sharing and analysis to accelerate cancer research

Globus Genomics use cases

A profile of inherited predisposition to breast cancer among Nigerian womenY. Zheng, T. Walsh, F. Yoshimatsu, M. Lee, S. Gulsuner, S. Casadei, A. Rodriguez, T. Ogundiran, C. Babalola, O.

Ojengbede, D. Sighoko, R. Madduri, M.-C. King, O. Olopade

A case study for high throughput analysis of NGS data for translational research using Globus Genomics

D. Sulakhe, A. Rodriguez, K. Bhuvaneshwar, Y. Gusev, R. Madduri, L. Lacinski, U. Dave, I. Foster, S. Madhavan

Page 54: Streamlined data sharing and analysis to accelerate cancer research

Globus Genomics at a glance

30 institutions, groups

10smillion core hours

2 PBsraw sequence

analyzed

1,500 analysis tools

10,000 genomes processed

50workflows

99%uptime over the past

two years

1 PBdata generated

43steps in longest

pipeline

100sspecies

75largest user group

5 dayslongest running

workflow

Page 55: Streamlined data sharing and analysis to accelerate cancer research

Cost-aware provisioning on cloud resources

55

1. Filter instance types with profiles

2. Determine price for each instance type across all availability zones

3. Rank potential requests

4. Make requests and monitor

5. Cancel or repurpose excess active requests once one is fulfilled

Can reduce costs by 95% or more!

$$$

???

R. Chard et al. Cost-aware cloud provisioning, 11th IEEE International Conference on e-Science (e-Science), 2015.

Page 56: Streamlined data sharing and analysis to accelerate cancer research

56

What’s coming soon: Richer endpointsHTTPS access to endpoints • Enhanced use of research storage:

• Asynchronous, bulk transfer: GridFTP• Synchronous remote access: HTTPS

• Enhanced Globus web app• Browser-based upload/download• Inline file viewer

• Integration with clients, web apps

GridFTP

HTTP

Page 57: Streamlined data sharing and analysis to accelerate cancer research

57

What’s coming soon: Richer endpoints

GridFTP

HTTP

Collections• Groupings of files that are to be

treated as logical units• Can be named and described

HTTPS access to endpoints • Enhanced use of research storage:

• Asynchronous, bulk transfer: GridFTP• Synchronous remote access: HTTPS

• Enhanced Globus web app• Browser-based upload/download• Inline file viewer

• Integration with clients, web apps

Page 58: Streamlined data sharing and analysis to accelerate cancer research

58

What’s coming soon: Richer endpoints

Data search• Automated metadata harvesting

• From Globus endpoints• Submitted via REST API

• Rich search capabilities• Free text, faceted, boosted

GridFTP

HTTP

HTTPS access to endpoints • Enhanced use of research storage:

• Asynchronous, bulk transfer: GridFTP• Synchronous remote access: HTTPS

• Enhanced Globus web app• Browser-based upload/download• Inline file viewer

• Integration with clients, web apps

Collections• Groupings of files that are to be

treated as logical units• Can be named and described

Page 59: Streamlined data sharing and analysis to accelerate cancer research

59

Thank you to our sponsors

U . S . D E PA RT M E N T O F

ENERGY

Thanks to: Rachana Ananthakrishnan, Kyle Chard, Ravi Madduri, Brigitte Raumann, Steve Tuecke, Vas Vasiliadis,

and others in the Globus team at the University of Chicago

Page 60: Streamlined data sharing and analysis to accelerate cancer research

60

Globus provides a new global-scale data fabric that can accelerate discovery by streamlining scientific data sharing and analysis• Globus-enabled storage systems enable robust, secure access• Globus cloud services implement transfer, sharing, publication,

discovery, and other capabilities

This fabric is:• Being applied in cancer research• Spreading rapidly by word of mouth (scientists like it!)• Widely deployed across universities and labs (thanks, NSF & DOE)• On a path to sustainability based on subscriptions• Being integrated into research infrastructures and applications

Page 61: Streamlined data sharing and analysis to accelerate cancer research

61

To accelerate impact in biomedicine:

• Integrate biomedical research facilities into the fabric

• Encourage subscriptions to address sustainability

• Provide HIPAA compliance for applications involving PHI

• Cultivate an ecosystem of data portals and applications that leverage the platform

• Continue to add capabilities

www.globus.org [email protected]