what are science clouds?
DESCRIPTION
This is a talk I gave at Data Cloud 2013 on November 17, 2013 that was titled: "What is So Special About Science Clouds and Why Does It Matter? ."TRANSCRIPT
What is So Special About Science Clouds and Why Does It Ma8er?
November 17, 2013
Robert L. Grossman University of Chicago Open Data Group
Open Cloud ConsorLum
Part 1 Clouds
2
In 2011, aNer several years and 15 draNs, NIST developed a definiLon of a cloud that is now the standard definiLon.
EssenLal CharacterisLcs of a Cloud
1. Self Service 2. Scale
4
Self Service
Self Service
5
Scale
6
Cloud Deployment Models
• Public Clouds – Vendors offering cloud services, such as Amazon.
• Private Clouds – Run internally by company or organizaLon, such as the University of Chicago.
• Community Clouds – Run by a community or organizaLons (either formally or informally), such as the Open Cloud ConsorLum
7
How do you measure compute capacity for science clouds?
TB? PB? EB? 100’s? 1,000’s? 10,000’s?
Think of science clouds as large if you measure them in MW, as in Facebook’s Pineville Data
Center is 30 MW.
Another way:
opencompute.org
What about automaLc provisioning and infrastructure management?
11
This is not a cloud.
This is a cloud.
Commercial Cloud Service Provider (CSP) 15 MW Data Center
100,000 servers 1 PB DRAM
100’s of PB of disk
AutomaLc provisioning and infrastructure management
Monitoring, network security and forensics
AccounLng and billing Customer
Facing Portal
Data center network
~1 Tbps egress bandwidth
25 operators for 15 MW Commercial Cloud
Rack / Container Test: The addiLon of racks / containers of cores and disks is automated and does not require changing the soNware stack, but aNerwards the capacity of the system has increased.
Requirement of a cloud computing infrastructure
• At good cloud service providers, development and operaLons are integrated (devops).
• SRE/Devops are considered key personnel.
15
• For many organizaLons, system administrators are just performing a service. • It’s considered a good pracLce to outsource the service to the lowest cost provider.
Latency is Difficult
EssenLal CharacterisLcs of a Cloud
1. Self Service 2. Scale 3. Infrastructure management and automaLon 4. Focus on devops
17
Part 2 Science Clouds
18
Discipline Dura5on Size # Devices
HEP -‐ LHC 10 years 15 PB/year* One
Astronomy -‐ LSST 10 years 12 PB/year** One
Genomics -‐ NGS 2-‐4 years 0.5 TB/genome 1000’s
Some Examples of the Sizes of Datasets Produced by Instruments
*At full capacity, the Large Hadron Collider (LHC), the world's largest parLcle accelerator, is expected to produce more than 15 million Gigabytes of data each year. … This ambiLous project connects and combines the IT power of more than 140 computer centres in 33 countries. Source: h8p://press.web.cern.ch/public/en/Spotlight/SpotlightGrid_081008-‐en.html **As it carries out its 10-‐year survey, LSST will produce over 15 terabytes of raw astronomical data each night (30 terabytes processed), resulLng in a database catalog of 22 petabytes and an image archive of 100 petabytes. Source: h8p://www.lsst.org/News/enews/teragrid-‐1004.html
N.B. This is just the data produced by the instrument itself. The analysis of this data produces significantly more data.
Science Cloud Service Provider (Sci CSP)
Data scienLst
Sci CSP services
What are some of the important differences between commercial and research-‐focused Sci CSPs?
Amazon Web Services (AWS)?
Community clouds, science clouds, etc.
• Lower cost (at medium & large scale) • Some data too important to be stored
exclusively in commercial cloud • CompuLng over scienLfic data is a core
competency • Can support any required governance /
security model
• Scale • Simplicity of a credit card • Wide variety of offerings.
vs.
It is essenLal that community science clouds interoperate with public clouds. 22
Science Clouds
Science Clouds Commercial Clouds POV DemocraLze access to
data. Integrate data to make discoveries. Long term archive.
As long as you pay the bill; as long as the business model holds.
Data & Storage
In addiLon, data intensive compuLng & HP storage
Internet style scale out and object-‐based storage
Flows Large & small data flows Lots of small web flows AccounLng EssenLal EssenLal Lock in Moving environment
between CSPs essenLal Lock in is good
Interop CriLcal, but difficult Customers will drive to some degree
23
EssenLal Services for a Science CSP • Support for data intensive compuLng • Support for big data flows • Account management, authenLcaLon and authorizaLon services
• Health and status monitoring • Billing and accounLng • Ability to rapidly provision infrastructure • Security services, logging, event reporLng • Access to large amounts of public data • High performance storage • Simple data export and import services
Datascope – Science Cloud Service Provider (Sci CSP)
Data scienLst
Sci CSP services
Cloud Service OperaLons Center (CSOC)
Part 3. Open Science Data Cloud
Small Medium to Large Very Large
Data Size
10’s
100’s
1000’s
Number
Public infrastructure
Dedicated infrastructure
Shared community infrastructure
Individual scienLsts & small projects
Community based science via Science as a Service
very large projects
The long tail of data science
A few large data science projects.
Many smaller data science projects.
Commercial Cloud Service Provider (CSP) 15 MW Data Center
100,000 servers 1 PB DRAM
100’s of PB of disk
AutomaLc provisioning and infrastructure management
Monitoring, network security and forensics
AccounLng and billing Customer
Facing Portal
Data center network
~1 Tbps egress bandwidth
25 operators for 15 MW Commercial Cloud
Open Science Data Cloud
Cores & Disks (OpenStack, GlusterFS & Hadoop)
Infrastructure automaLon & management
(Yates)
Compliance, & security (OCM)
AccounLng & billing
(Salesforce.com)
Customer Facing Portal (Tukey)
Data center network
~10-‐100 Gbps bandwidth
6 engineers to operate 0.5 MW Science Cloud
Science Cloud SW & Services
• Virtual Machine (VM) containing common applicaLons & pipelines • Tukey (OSDC portal & middleware v0.2) • Yates (infrastructure automaLon and management v0.1) • UDR / UDT for high performance data transport • Interoperate with other clouds (upcoming) and proprietary systems (such as
Globus Online.)
The Open Science Data Cloud (OSDC) is a producLon 5 PB*, 7500 core, wide area 10G cloud.
www.opensciencedatacloud.org *10 PB raw storage.
32 www.opencloudconsorLum.org
• U.S based not-‐for-‐profit corporaLon. • Manages cloud compuLng infrastructure to
support scienLfic research: Open Science Data Cloud.
• Manages cloud compuLng infrastructure to support medical and health care research: Biomedical Commons Cloud
• Manages cloud compuLng testbeds: Open Cloud Testbed.
33 www.opencloudconsorLum.org
• Companies: Cisco, Yahoo!, Infoblox, … • UniversiLes: University of Chicago, Northwestern Univ., Johns Hopkins, Calit2, LLNL, University of Illinois at Chicago, …
• Federal agencies and labs: NASA, LLNL, … • InternaLonal Partners: AIST (Japan), U. Edinburgh, U. Amsterdam, …
Designed to hold Protected Health InformaLon (PHI) e.g. genomic data, electronic medical records, etc. (HIPAA, FISMA)
• Earth sciences • Biological sciences • Social sciences • Digital humaniLes • ACL, groups, etc.
Science Cloud Biomedical Cloud
What You Get with the OSDC
• Login with your university credenLals via InCommon
• Launch virtual machines, virtual clusters, access to large Hadoop clusters, etc.
• Access PB+ of open and protected data • Manage files, collecLons of files, collecLons of collecLons
• Manage users, groups of users • Manage accounts, sub-‐accounts • Efficient transfer of large data (UDT, UDR)
Our Point of View • We want to develop as li8le technology and soNware as possible – we want others to develop soNware and technology.
• We focus on providing researchers the ability to compute over large and very large datasets.
• We need open source soluLons. • We can interoperate with proprietary soluLons. • We are working to make interoperaLon with AWS seamless
• Run lights out over mulLple data centers connected with 10G (soon 100G) networks.
OSDC Cloud Services OperaLons Center (CSOC)
• The OSDC operates a Cloud Services OperaLons Center (or CSOC).
• It is a CSOC focused on supporLng Science Clouds for researchers.
• How quickly can we set up a rack?
• How efficiently can we operate a rack? (racks/admin)
• How few changes does our soNware stack and operaLons require when we add new racks?
2013 OSDC rack design • 1 PB / rack • 1150 cores / rack
OSDC Racks
Tukey
• Tukey (based in part on Horizon). • We have factored out digital ID service, file sharing, and transport from the Bionimbus and Matsu Projects.
Yates
• AutomaLon installaLon of OSDC soNware stack on rack of computers.
• Based upon Chef • Version 0.1
UDR
• UDT is a high performance network transport protocol • UDR = rsync + UDT • It is easy for an average systems administrator to keep 100’s of TB of distributed data synchronized.
• We are using it to distribute c. 1 PB from the OSDC
Bionimbus Protected Data Cloud
42
Analyzing Data From The Cancer Genome Atlas (TCGA)
1. Apply to dbGaP for access to data.
2. Hire staff, set up and operate secure compliant compuLng environment to mange 10 – 100+ TB of data.
3. Get environment approved by your research center.
4. Setup analysis pipelines. 5. Download data from CG-‐
Hub (takes days to weeks). 6. Begin analysis.
Current Prac5ce With Protected Data Cloud (PDC)
1. Apply to dbGaP for access to data.
2. Use your exisLng NIH grant eRA credenLals to login to the PDC, select the data that you want to analyze, and the pipelines that you want to use.
3. Begin analysis.
OCC Project Matsu Clouds to Support Earth Science
44
matsu.opensciencedatacloud.org
Biomedical Community Cloud
Cloud for Public Data
Cloud for Controlled Genomic Data
Cloud for EMR, PHI,
data
Example: Open Cloud ConsorLum’s Biomedical Commons Cloud (BCC)
Medical Research Center A
Medical Research Center B
Hospital D
Medical Research Center C
45
Company E
4. Cloud Condos
Cyber Condo Model • Research insLtuLons today have access to high performance networks – 10G & 100G.
• They couldn’t afford access to these networks from commercial providers.
• Over a decade ago, they got together to buy and light fiber.
• This changed how we do scienLfic research.
Cloud Condos • The Open Cloud ConsorLum’s Burnham Facility (in planning) is a Cloud Condo model.
• This infrastructure provides a sustainable home for large commons of research data (and an infrastructure to compute over it).
• Please join us.
Some Data Commons Guidelines for the Next Five Years
• There is a societal benefit when research data is available in data commons operated by a NFP (vs sold exclusively as data products by commercial enLLes or only offered for download by the USG).
• Large data commons providers should peer. • Data commons providers should develop standards for interoperaLng.
• Standards should not be developed ahead of open source reference implementaLons.
• We need a period of experimentaLon as we develop the best technology and pracLces.
• The details are hard (consent, publicaLon, IDs, open vs controlled access, sustainability, etc.)
Working with the OSDC -‐ CSP
• If you have a cloud, please interoperate it with the OSDC.
• Work with us to design and prototype standards so that Science Clouds and Science Data Commons can interoperate. – Data synchronizaLon between two clouds – APIs to access data – Resvul queries – Sca8ering queries, gathering the results – Coordinated analysis
OSDC SoNware Ecosystem
AWS
Globus Online
CSP A
Medical Research Center B
Hospital D
University E
51
Startup F Startup G
Bioninmbus
OpenStack
Hadoop
Tukey R
UDT
GlusterFS
Working with the OSDC -‐ Researchers
• Apply for an account and make a discovery • Add data to the OSDC • Add your soNware to the OSDC • Suggest someone else’s data to add • Suggest someone else’s soNware to add
Data Commons
EO1
TCGA
CSP A
Medical Research Center B
Hospital D
University E
53
Startup F Startup G
urban sciences data
1000 Genomes
EMR
census
Social sciences data
earth cube data
Bookworm
QuesLons?
54
Thank You!
For more informaLon • @bobgrossman • You can find more informaLon on my blog:
rgrossman.com. • You can find more of my talks on:
slideshare.net/rgrossman
Center forResearchInformatics
Major funding and support for the Open Science Data Cloud (OSDC) is provided by the Gordon and Be8y Moore FoundaLon. This funding is used to support the OSDC-‐Adler, Sullivan and Root faciliLes. AddiLonal funding for the OSDC has been provided by the following sponsors: • The Bionimbus Protected Data Cloud is supported in by part by NIH/NCI through NIH/SAIC Contract
13XS021 / HHSN261200800001E. • The OCC-‐Y Hadoop Cluster (approximately 1000 cores and 1 PB of storage) was donated by Yahoo!
in 2011. • Cisco provides the OSDC access to the Cisco C-‐Wave, which connects OSDC data centers with 10
Gbps wide area networks. • The OSDC is supported by a 5-‐year (2010-‐2016) PIRE award (OISE – 1129076) to train scienLsts to
use the OSDC and to further develop the underlying technology. • OSDC technology for high performance data transport is support in part by NSF Award 1127316. • The StarLight Facility in Chicago enables the OSDC to connect to over 30 high performance
research networks around the world at 10 Gbps or higher. • Any opinions, findings, and conclusions or recommendaLons expressed in this material are those
of the author(s) and do not necessarily reflect the views of the NaLonal Science FoundaLon, NIH or other funders of this research.
The OSDC is managed by the Open Cloud ConsorLum, a 501(c)(3) not-‐for-‐profit corporaLon. If you are interested in providing funding or donaLng equipment or services, please contact us at [email protected].
Please join us!
www.opensciencedatacloud.org www.opencloudconsorLum.org