accelerating data-intensive science by outsourcing the mundane

Post on 10-May-2015

2.080 Views

Category:

Technology

4 Downloads

Preview:

Click to see full reader

DESCRIPTION

Talk at eResearch New Zealand Conference, June 2011 (given remotely from Italy, unfortunately!) Abstract: Whitehead observed that "civilization advances by extending the number of important operations which we can perform without thinking of them." I propose that cloud computing can allow us to accelerate dramatically the pace of discovery by removing a range of mundane but timeconsuming research data management tasks from our consciousness. I describe the Globus Online system that we are developing to explore these possibilities, and propose milestones for evaluating progress towards smarter science.

TRANSCRIPT

www.ci.anl.govwww.ci.uchicago.edu

Accelerating data-intensive scienceby outsourcing the mundane

Ian Foster

www.ci.anl.govwww.ci.uchicago.edu

2

Alfred North Whitehead (1911)

Civilization advances by extending the number of important operations which we can perform

without thinking about them

www.ci.anl.govwww.ci.uchicago.edu

3

J.C.R. Licklider reflects on thinking (1960)

About 85 per cent of my “thinking” time was spent getting into a position to think, to make a decision, to learn something I needed to know

www.ci.anl.govwww.ci.uchicago.edu

4

For example … (Licklider again) At one point, it was necessary to compare six

experimental determinations of a function relating speech-intelligibilityto speech-to-noise ratio. No two experimenters had used the same definition or measure of speech-to-noise ratio. Several hours of calculating were required to get the data into comparable form. When they were in comparable form, it took only a few seconds to determine what I needed to know.

www.ci.anl.govwww.ci.uchicago.edu

5

Publish results

Collectdata

Design experiment

Test hypotheses

Hypothesize explanation

Identify patterns

Analyzedata

Research hasn’t changed much in 300 years

Pose question

www.ci.anl.govwww.ci.uchicago.edu

6

Discovery 1960: Data collection dominates

Janet Rowley: chromosome translocations

and cancer

www.ci.anl.govwww.ci.uchicago.edu

7

800,000,000,000 bases/day30,000,000,000,000 bases/year

Discovery 2010: Data overflows

www.ci.anl.govwww.ci.uchicago.edu

8

42%!!

Meanwhile, we drown in administrivia

The Federal Demonstration Partnership’s faculty burden survey

www.ci.anl.govwww.ci.uchicago.edu

9

You can run a company from a coffee shop

www.ci.anl.govwww.ci.uchicago.edu

10

SaaS

PaaS

IaaS

Software

Platform

Infrastructure

Salesforce.com, Google,Animoto, …, …, caBIG,TeraGrid gateways

Varieties of “* as a Service” (*aaS)

www.ci.anl.govwww.ci.uchicago.edu

11

SaaS

PaaS

IaaS

Software

Platform

Infrastructure Amazon, GoGrid,Microsoft, Flexiscale, …

Salesforce.com, Google,Animoto, …, …, caBIG,TeraGrid gateways

Varieties of * as a service (*aaS)

www.ci.anl.govwww.ci.uchicago.edu

12

SaaS

PaaS

IaaS

Software

Platform

Infrastructure Amazon, GoGrid,Microsoft, Flexiscale, …

Google, Microsoft, Amazon, …

Salesforce.com, Google,Animoto, …, …, caBIG,TeraGrid gateways

Varieties of * as a service (*aaS)

www.ci.anl.govwww.ci.uchicago.edu

13

Perform important tasks without thinking

Web presence Email (hosted Exchange) Calendar Telephony (hosted VOIP) Human resources and payroll Accounting Customer relationship mgmt Data analytics Content distribution IaaS

www.ci.anl.govwww.ci.uchicago.edu

14

Perform important tasks without thinking

Web presence Email (hosted Exchange) Calendar Telephony (hosted VOIP) Human resources and payroll Accounting Customer relationship mgmt Data analytics Content distribution

SaaS

IaaS

www.ci.anl.govwww.ci.uchicago.edu

15

What about small and medium labs?

www.ci.anl.govwww.ci.uchicago.edu

16

Research IT is a growing burden

Big projects can build sophisticated solutions to IT problems

Small labs and collaborations have problems with both

They need solutions, not toolkits—ideally outsourced solutions

www.ci.anl.govwww.ci.uchicago.edu

17

Medium science: Dark Energy Survey

• Every night, they receive 100,000 files in Illinois

• They transmit these files to Texas for analysis (35 msec latency)

• Then move the results back to Illinois

• This whole process must run reliably & routinely

Image credit: Roger Smith/NOAO/AURA/NSF

Blanco 4m on Cerro Tololo

www.ci.anl.govwww.ci.uchicago.edu

18

Open transfer sockets vs. time

[Image: Don Petravick, NCSA]

www.ci.anl.govwww.ci.uchicago.edu

19

A new approach to research IT

Goal: Accelerate discovery and innovation worldwide by providing research IT as a service

Leverage software-as-a-service (SaaS) to• provide millions of researchers with

unprecedented access to powerful research tools, and

• enable a massive shortening of cycle times intime-consuming research processes

www.ci.anl.govwww.ci.uchicago.edu

20

Time-consuming tasks in science

• Run experiments• Collect data• Manage data• Move data• Acquire computers• Analyze data• Run simulations• Compare experiment

with simulation• Search the literature

• Communicate with colleagues

• Publish papers• Find, configure, install

relevant software• Find, access, analyze

relevant data• Order supplies• Write proposals• Write reports• …

www.ci.anl.govwww.ci.uchicago.edu

21

Time-consuming tasks in science

• Run experiments• Collect data• Manage data• Move data• Acquire computers• Analyze data• Run simulations• Compare experiment

with simulation• Search the literature

• Communicate with colleagues

• Publish papers• Find, configure, install

relevant software• Find, access, analyze

relevant data• Order supplies• Write proposals• Write reports• …

www.ci.anl.govwww.ci.uchicago.edu

22

A B

Discover endpoints, determine available protocols, negotiate firewalls, configure software,

manage space, determine required credentials, configure protocols, detect and respond to failures, determine expected performance, determine actual performance, identify diagnose and correct network misconfigurations, integrate with file systems, …

Data movement can be surprisingly difficult

www.ci.anl.govwww.ci.uchicago.edu

23

Grid (aka federation) as a service

Globus ToolkitBuild the Grid

Components for building custom grid solutions

globustoolkit.org

Globus OnlineUse the Grid

Cloud-hostedfile transfer service

globusonline.org

www.ci.anl.govwww.ci.uchicago.edu

24

Globus Online’s Web 2.0 architecture

Fire-and-forget data movementMany files and lots of dataCredential managementPerformance optimizationExpert operations and monitoring

Web interface

HTTP REST interfacePOST https://transfer.api.globusonline.org/ v0.10/transfer <transfer-doc>

Command line interfacels alcf#dtn:/scp alcf#dtn:/myfile \ nersc#dtn:/myfile

GridFTP serversFTP servers

High-performancedata transfer nodes

Globus Connecton local computers

www.ci.anl.govwww.ci.uchicago.edu

25

Globus Connect to/from your laptop

25

www.ci.anl.govwww.ci.uchicago.edu

26

Almost always faster than other methods

1E+03

1E+04

1E+05

1E+06

1E+07

1E+08

1E+09

gogucscptunedguc

Tran

sfer

rate

in b

ytes

/sec

0.001 0.01 0.1 1 10 100 1000Megabyte/fileArgonne NERSC

www.ci.anl.govwww.ci.uchicago.edu

27

Monitoring provides deep visibility

www.ci.anl.govwww.ci.uchicago.edu

29

Globus Online runs on the cloud

www.ci.anl.govwww.ci.uchicago.edu

30

Data movers scale well on Amazon

www.ci.anl.govwww.ci.uchicago.edu

31

11 x 125 files200 MB each

11 users12 sites

SaaS facilitates troubleshooting

www.ci.anl.govwww.ci.uchicago.edu

32

Moving 586 Terabytes in two weeks

www.ci.anl.govwww.ci.uchicago.edu

33

NSF XSEDE architecture incorporatesGlobus Toolkit and Globus Online

33

XSEDE

www.ci.anl.govwww.ci.uchicago.edu

34

Publish results

Collectdata

Design experiment

Test hypotheses

Hypothesize explanation

Identify patterns

Analyzedata

Next steps: Outsource additional activities

Pose question

www.ci.anl.govwww.ci.uchicago.edu

35

A use case for the next steps

• Medical image data is acquired at multiple sites• Uploaded to a commercial cloud• Quality control algorithms applied• Anonymization procedures applied• Metadata extracted and stored• Access granted to clinical trial team• Interactive access and analysis• More metadata generated and stored• Access granted to subset of data for education

www.ci.anl.govwww.ci.uchicago.edu

36

Required building blocks

• Group management for data sharing– Scheduled September, 2011, for BIRN biomedical

• Metadata management– Create, update, query a hosted metadata catalog

• Data publication workflows– Data movement, naming, metadata operations, etc.

• Cloud storage access– And HTTP, WebDAV, SRM, iRODS, …

• Computation on shared data– E.g., via Galaxy workflow system

www.ci.anl.govwww.ci.uchicago.edu

www.globusoline.org

37

www.ci.anl.govwww.ci.uchicago.edu

38

Summary

• To accelerate discovery, automate the mundane

• Data-intensive computing is particularly full of mundane tasks

• Outsourcing complexity to SaaS providers is a promising route to automation

• Globus Online is an early experiment in SaaS for science

www.ci.anl.govwww.ci.uchicago.edu

39

For more information

• Foster, I. Globus Online: Accelerating and democratizing science through cloud-based services. IEEE Internet Computing(May/June):70-73, 2011.

• Allen, B., Bresnahan, J., Childers, L., Foster, I., Kandaswamy, G., Kettimuthu, R., Kordas, J., Link, M., Martin, S., Pickett, K. and Tuecke, S. Globus Online: Radical Simplification of Data Movement via SaaS. Preprint CI-PP-05-0611, Computation Institute, 2011.

www.ci.anl.govwww.ci.uchicago.edu

Thank you!

foster@anl.govfoster@uchicago.edu

top related