accelerating data-intensive science by outsourcing the mundane
DESCRIPTION
Talk at eResearch New Zealand Conference, June 2011 (given remotely from Italy, unfortunately!) Abstract: Whitehead observed that "civilization advances by extending the number of important operations which we can perform without thinking of them." I propose that cloud computing can allow us to accelerate dramatically the pace of discovery by removing a range of mundane but timeconsuming research data management tasks from our consciousness. I describe the Globus Online system that we are developing to explore these possibilities, and propose milestones for evaluating progress towards smarter science.TRANSCRIPT
www.ci.anl.govwww.ci.uchicago.edu
Accelerating data-intensive scienceby outsourcing the mundane
Ian Foster
www.ci.anl.govwww.ci.uchicago.edu
2
Alfred North Whitehead (1911)
Civilization advances by extending the number of important operations which we can perform
without thinking about them
www.ci.anl.govwww.ci.uchicago.edu
3
J.C.R. Licklider reflects on thinking (1960)
About 85 per cent of my “thinking” time was spent getting into a position to think, to make a decision, to learn something I needed to know
www.ci.anl.govwww.ci.uchicago.edu
4
For example … (Licklider again) At one point, it was necessary to compare six
experimental determinations of a function relating speech-intelligibilityto speech-to-noise ratio. No two experimenters had used the same definition or measure of speech-to-noise ratio. Several hours of calculating were required to get the data into comparable form. When they were in comparable form, it took only a few seconds to determine what I needed to know.
www.ci.anl.govwww.ci.uchicago.edu
5
Publish results
Collectdata
Design experiment
Test hypotheses
Hypothesize explanation
Identify patterns
Analyzedata
Research hasn’t changed much in 300 years
Pose question
www.ci.anl.govwww.ci.uchicago.edu
6
Discovery 1960: Data collection dominates
Janet Rowley: chromosome translocations
and cancer
www.ci.anl.govwww.ci.uchicago.edu
7
800,000,000,000 bases/day30,000,000,000,000 bases/year
Discovery 2010: Data overflows
www.ci.anl.govwww.ci.uchicago.edu
8
42%!!
Meanwhile, we drown in administrivia
The Federal Demonstration Partnership’s faculty burden survey
www.ci.anl.govwww.ci.uchicago.edu
9
You can run a company from a coffee shop
www.ci.anl.govwww.ci.uchicago.edu
10
SaaS
PaaS
IaaS
Software
Platform
Infrastructure
Salesforce.com, Google,Animoto, …, …, caBIG,TeraGrid gateways
Varieties of “* as a Service” (*aaS)
www.ci.anl.govwww.ci.uchicago.edu
11
SaaS
PaaS
IaaS
Software
Platform
Infrastructure Amazon, GoGrid,Microsoft, Flexiscale, …
Salesforce.com, Google,Animoto, …, …, caBIG,TeraGrid gateways
Varieties of * as a service (*aaS)
www.ci.anl.govwww.ci.uchicago.edu
12
SaaS
PaaS
IaaS
Software
Platform
Infrastructure Amazon, GoGrid,Microsoft, Flexiscale, …
Google, Microsoft, Amazon, …
Salesforce.com, Google,Animoto, …, …, caBIG,TeraGrid gateways
Varieties of * as a service (*aaS)
www.ci.anl.govwww.ci.uchicago.edu
13
Perform important tasks without thinking
Web presence Email (hosted Exchange) Calendar Telephony (hosted VOIP) Human resources and payroll Accounting Customer relationship mgmt Data analytics Content distribution IaaS
www.ci.anl.govwww.ci.uchicago.edu
14
Perform important tasks without thinking
Web presence Email (hosted Exchange) Calendar Telephony (hosted VOIP) Human resources and payroll Accounting Customer relationship mgmt Data analytics Content distribution
SaaS
IaaS
www.ci.anl.govwww.ci.uchicago.edu
15
What about small and medium labs?
www.ci.anl.govwww.ci.uchicago.edu
16
Research IT is a growing burden
Big projects can build sophisticated solutions to IT problems
Small labs and collaborations have problems with both
They need solutions, not toolkits—ideally outsourced solutions
www.ci.anl.govwww.ci.uchicago.edu
17
Medium science: Dark Energy Survey
• Every night, they receive 100,000 files in Illinois
• They transmit these files to Texas for analysis (35 msec latency)
• Then move the results back to Illinois
• This whole process must run reliably & routinely
Image credit: Roger Smith/NOAO/AURA/NSF
Blanco 4m on Cerro Tololo
www.ci.anl.govwww.ci.uchicago.edu
18
Open transfer sockets vs. time
[Image: Don Petravick, NCSA]
www.ci.anl.govwww.ci.uchicago.edu
19
A new approach to research IT
Goal: Accelerate discovery and innovation worldwide by providing research IT as a service
Leverage software-as-a-service (SaaS) to• provide millions of researchers with
unprecedented access to powerful research tools, and
• enable a massive shortening of cycle times intime-consuming research processes
www.ci.anl.govwww.ci.uchicago.edu
20
Time-consuming tasks in science
• Run experiments• Collect data• Manage data• Move data• Acquire computers• Analyze data• Run simulations• Compare experiment
with simulation• Search the literature
• Communicate with colleagues
• Publish papers• Find, configure, install
relevant software• Find, access, analyze
relevant data• Order supplies• Write proposals• Write reports• …
www.ci.anl.govwww.ci.uchicago.edu
21
Time-consuming tasks in science
• Run experiments• Collect data• Manage data• Move data• Acquire computers• Analyze data• Run simulations• Compare experiment
with simulation• Search the literature
• Communicate with colleagues
• Publish papers• Find, configure, install
relevant software• Find, access, analyze
relevant data• Order supplies• Write proposals• Write reports• …
www.ci.anl.govwww.ci.uchicago.edu
22
A B
Discover endpoints, determine available protocols, negotiate firewalls, configure software,
manage space, determine required credentials, configure protocols, detect and respond to failures, determine expected performance, determine actual performance, identify diagnose and correct network misconfigurations, integrate with file systems, …
Data movement can be surprisingly difficult
www.ci.anl.govwww.ci.uchicago.edu
23
Grid (aka federation) as a service
Globus ToolkitBuild the Grid
Components for building custom grid solutions
globustoolkit.org
Globus OnlineUse the Grid
Cloud-hostedfile transfer service
globusonline.org
www.ci.anl.govwww.ci.uchicago.edu
24
Globus Online’s Web 2.0 architecture
Fire-and-forget data movementMany files and lots of dataCredential managementPerformance optimizationExpert operations and monitoring
Web interface
HTTP REST interfacePOST https://transfer.api.globusonline.org/ v0.10/transfer <transfer-doc>
Command line interfacels alcf#dtn:/scp alcf#dtn:/myfile \ nersc#dtn:/myfile
GridFTP serversFTP servers
High-performancedata transfer nodes
Globus Connecton local computers
www.ci.anl.govwww.ci.uchicago.edu
25
Globus Connect to/from your laptop
25
www.ci.anl.govwww.ci.uchicago.edu
26
Almost always faster than other methods
1E+03
1E+04
1E+05
1E+06
1E+07
1E+08
1E+09
gogucscptunedguc
Tran
sfer
rate
in b
ytes
/sec
0.001 0.01 0.1 1 10 100 1000Megabyte/fileArgonne NERSC
www.ci.anl.govwww.ci.uchicago.edu
27
Monitoring provides deep visibility
www.ci.anl.govwww.ci.uchicago.edu
29
Globus Online runs on the cloud
www.ci.anl.govwww.ci.uchicago.edu
30
Data movers scale well on Amazon
www.ci.anl.govwww.ci.uchicago.edu
31
11 x 125 files200 MB each
11 users12 sites
SaaS facilitates troubleshooting
www.ci.anl.govwww.ci.uchicago.edu
32
Moving 586 Terabytes in two weeks
www.ci.anl.govwww.ci.uchicago.edu
33
NSF XSEDE architecture incorporatesGlobus Toolkit and Globus Online
33
XSEDE
www.ci.anl.govwww.ci.uchicago.edu
34
Publish results
Collectdata
Design experiment
Test hypotheses
Hypothesize explanation
Identify patterns
Analyzedata
Next steps: Outsource additional activities
Pose question
www.ci.anl.govwww.ci.uchicago.edu
35
A use case for the next steps
• Medical image data is acquired at multiple sites• Uploaded to a commercial cloud• Quality control algorithms applied• Anonymization procedures applied• Metadata extracted and stored• Access granted to clinical trial team• Interactive access and analysis• More metadata generated and stored• Access granted to subset of data for education
www.ci.anl.govwww.ci.uchicago.edu
36
Required building blocks
• Group management for data sharing– Scheduled September, 2011, for BIRN biomedical
• Metadata management– Create, update, query a hosted metadata catalog
• Data publication workflows– Data movement, naming, metadata operations, etc.
• Cloud storage access– And HTTP, WebDAV, SRM, iRODS, …
• Computation on shared data– E.g., via Galaxy workflow system
www.ci.anl.govwww.ci.uchicago.edu
www.globusoline.org
37
www.ci.anl.govwww.ci.uchicago.edu
38
Summary
• To accelerate discovery, automate the mundane
• Data-intensive computing is particularly full of mundane tasks
• Outsourcing complexity to SaaS providers is a promising route to automation
• Globus Online is an early experiment in SaaS for science
www.ci.anl.govwww.ci.uchicago.edu
39
For more information
• Foster, I. Globus Online: Accelerating and democratizing science through cloud-based services. IEEE Internet Computing(May/June):70-73, 2011.
• Allen, B., Bresnahan, J., Childers, L., Foster, I., Kandaswamy, G., Kettimuthu, R., Kordas, J., Link, M., Martin, S., Pickett, K. and Tuecke, S. Globus Online: Radical Simplification of Data Movement via SaaS. Preprint CI-PP-05-0611, Computation Institute, 2011.