the discovery cloud: accelerating science via outsourcing and automation
DESCRIPTION
Director's Colloquium at Los Alamos National Laboratory, September 18, 2014. We have made much progress over the past decade toward harnessing the collective power of IT resources distributed across the globe. In high-energy physics, astronomy, and climate, thousands work daily within virtual computing systems with global scope. But we now face a far greater challenge: Exploding data volumes and powerful simulation tools mean that many more--ultimately most?--researchers will soon require capabilities not so different from those used by such big-science teams. How are we to meet these needs? Must every lab be filled with computers and every researcher become an IT specialist? Perhaps the solution is rather to move research IT out of the lab entirely: to leverage the “cloud” (whether private or public) to achieve economies of scale and reduce cognitive load. In this talk, I explore the past, current, and potential future of large-scale outsourcing and automation for science.TRANSCRIPT
accelerating science via outsourcing and automation
Ian Foster Argonne National Laboratory and University of Chicago
ianfoster.org
The Discovery Cloud!
Publish
results
Collectdata
Design experimen
t
Test hypothesis
Hypothesize
explanation
Identify patterns
Analyzedata
The discovery process:Iterative and time-consuming
Pose questio
n
We've got no money, so we've got to think
Ernest Rutherford
Civilization advancesby extending the number of important operations which we can perform without thinking about them
Alfred North Whitehead (1911)
About 85% of my “thinking” time was spent getting into a position to think, to make a decision, to learn something I needed to know
J.C.R Licklider, 1960
Automation is required to apply more sophisticated methods at larger scales
Outsourcing is needed to achieve economies of scale in the use of automated methods
Automation is required to apply more sophisticated methods at larger scales
Outsourcing and automation:(1) The Grid
A computational grid is a hardware and software infrastructure that provides dependable, consistent, pervasive, and inexpensive access to computational capabilities
Foster and Kesselman, 1998
Higgs discovery “only possible because of the extraordinary achievements of … grid computing”—Rolf Heuer, CERN DG
10s of PB, 100s of institutions,1000s of scientists, 100Ks of CPUs, Bs of tasks
Outsourcing and automation:(2) The Cloud
Cloud computing is a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction
NIST, 2011
11
Tripit exemplifies process automation
MeBook flights
Book hotel
Record flights
Suggest hotel
Record hotel
Get weather
Prepare maps
Share info
Monitor prices
Monitor flight
Other servicesTime
How the “business cloud” works
Platformservices
Database, analytics, application, deployment, workflow, queuing Auto-scaling, Domain Name Service, content distributionElastic MapReduce, streaming data analyticsEmail, messaging, transcoding. Many more.
Infrastructureservices
Computing, storage, networkingElastic capacityMultiple availability zones
The Intelligence Cloud
Process automation for science
Run experimentCollect dataMove dataCheck data
Annotate dataShare data
Find similar dataLink to literature
Analyze dataPublish data
Time
Automate and
outsource:
theDiscovery cloud
Analysis
Staging Ingest
Community Repository
Archive Mirror
Registry
Next-gen genomesequencer
Telescope
In millions of labs worldwide, researchers struggle with massive data, advanced software, complex protocols, burdensome reporting
Globus research data management services
www.globus.org
Simulation
“I need to easily, quickly, and reliably mirror [portions of] my data to other
places.”
Research Computing HPC Cluster
Lab Server
Campus Home Filesystem
Desktop Workstation
Personal Laptop
XSEDE Resource
Public Cloud
“I need to easily and securely share my data with colleagues.”
“I need to get data from a scientific instrument to my analysis server.”
Next GenSequencer
Light Sheet Microscope
MRIAdvanced Light Source
Globus transfer & sharing; identity & group management, data discovery &
publication
25,000 users, 60 PB and 3B files transferred, 8,000 endpoints
The Globus Galaxies platform:Science as a service
Globus Galaxies platform
Tool and workflow execution, publication, discovery, sharing;identity management; data management; task scheduling
Infra-structureservices
EC2, EBS, S3, SNS, Spot, Route 53, Cloud Formation
Ematter materials scienceFACE-IT
PDACS
22
Flexible, scalable, affordable
genomics analysis for all biologists
Ravi Madduri, Paul Davé , Dina Sulakhe, Alex Rodriguez
Globus Genomics
Sequencing Centers
Sequencing Centers
PublicData
Storage
Local Cluster/CloudSeq
Center
Research Lab
Globus Provides a• High-performance • Fault-tolerant• Secure
file transfer Service between all data-endpoints
Data Management Data Analysis
Picard
GATK
Fastq Ref Genome
Alignment
Variant Calling
Galaxy Data Libraries
Globus Genomics on Amazon EC2
• Analytical tools are automatically run on the scalable compute resources when possible
• Globus Integrated within Galaxy
• Web-based UI• Drag-Drop workflow
creations• Easily modify Workflows
with new tools
Galaxy-based workflow management
FTP, SCP, others
FTP, SCP
SCP
Globus Genomics
FTP,
SCP,
HTTP
It’s proving popular
DobynsLab
Cox LabVolchenboum LabOlopade Lab
Nagarajan Lab
25
2.5 million core hours used in first six months of 2014
• Pricing includes• Estimated compute• Storage (one month)• Globus Genomics platform usage• Support
Costs are remarkably low
metagenomics.anl.gov
Data service as community resource
kbase.us
Linking simulation and experiment to study disordered structures
Diffuse scattering images from Ray Osborn et al., Argonne
SampleExperimentalscattering
Material composition
Simulated structure
Simulatedscattering
La 60%Sr 40%
Detect errors (secs—mins)
Knowledge basePast experiments;
simulations; literature; expert knowledge
Select experiments (mins—hours)
Contribute to knowledge base
Simulations driven by experiments (mins—days)
Knowledge-drivendecision making
Evolutionary optimization
Integrate data movement, management, workflow, and computation to accelerate data-driven applications
New data, computational capabilities, and methods create opportunities and
challengesIntegrate statistics/machine learning to assess many models and calibrate them against `all' relevant data
New computer facilities enable on-demand computing and high-speed analysis of large quantities of data
A lab-wide data architecture and facility
32
Immediate assessment of alignment quality in near-field high-energy
diffraction microscopy
33
Before
After
Hemant SharmaJustin WozniakMike WildeJon Almer
34
One APS data node: 125 destinations
Same node(1 Gbps link)
Accelerate discovery via automation and outsourcing
And at the same time:– Enhance reproducibility– Encourage entrepreneurial science– Democratize access and contributions– Enhance collaboration
The discovery Cloud!
My work is supported by:
U.S . DEPARTMENT OF
ENERGY
37
Questions?