open science data cloud (june 21, 2010)
DESCRIPTION
This is a talk that I gave at the ScienceCloud 2010 Workshop in Chicago on June 21, 2010.TRANSCRIPT
Open Science Data Cloud
Robert GrossmanOpen Cloud Consortium
Today is a good day to get involved with the Open Science Data Cloud.
3
Astronomical dataBiological data (Bionimbus)
Networking dataImage processing for disaster relief
Part 1: Basic Facts About the OSDC
Who are we?
• 501(3)(c) Not-for-profit corporation• Supports the development of standards,
interoperability frameworks, and reference implementations.
• Manages testbeds: Open Cloud Testbed and Intercloud Testbed.
• Manages cloud computing infrastructure to support scientific research: Open Science Data Cloud.
• Develops benchmarks.
5www.opencloudconsortium.org
OCC Members
• Companies: Aerospace, Booz Allen Hamilton, Cisco, InfoBlox, Open Data Group, Raytheon, Yahoo
• Universities: CalIT2, Johns Hopkins, MIT Lincoln Lab, Northwestern Univ., University of Illinois at Chicago, University of Chicago
• Government agencies: NASA• Open Source Projects: Sector Project
6
Operates Clouds
• 500 nodes• 3000 cores• 1.5+ PB• Four data centers• 10 Gbps• Target to refresh 1/3
each year.
• Open Cloud Testbed• Open Science Data Cloud• Intercloud Testbed• Cloud-based Disaster
Relief Services
Open Cloud Consortium Perspective
• Vendor neutral• Open, interoperable
architecture• Experiment at scale• Operate infrastructure at the
scale of a small data center• Long term point of view
(think like a library not cloud service provider)
• Think public, private & hybrid clouds
What Are the Projects?
Project 1: Bionimubs
10www.cistrack.org
Project 2: Bulk Download of the SDSSSource Destination LLPR* Link BandwidthChicago Greenbelt 0.98 1 Gb/s 615 Mb/sChicago Austin 0.83 10 Gb/s 8000 Mb/s
11
•LLPR = local / long distance performance • Sector LLPR varies between 0.61 and 0.98
Recent Sloan Digital Sky Survey (SDSS) data release is 14 TB in size.
Project 3: Image Processing in the Cloud
Mapper Input Key: Bounding Box
Mapper Input Value:
Mapper Output Key: Bounding BoxMapper Output Value:
Mapper resizes and/or cuts up the originalimage into pieces to output Bounding Boxes
(minx = -135.0 miny = 45.0 maxx = -112.5 maxy = 67.5)
Step 1: Input to Mapper
Step 2: Processing in Mapper Step 3: Mapper Output
Mapper Output Key: Bounding BoxMapper Output Value:
Mapper Output Key: Bounding BoxMapper Output Value:
Mapper Output Key: Bounding BoxMapper Output Value:
Mapper Output Key: Bounding BoxMapper Output Value:
Mapper Output Key: Bounding BoxMapper Output Value:
Mapper Output Key: Bounding BoxMapper Output Value:
Mapper Output Key: Bounding BoxMapper Output Value:
+ Timestamp
+ Timestamp
+ Timestamp
+ Timestamp
+ Timestamp
+ Timestamp
+ Timestamp
+ Timestamp
+ Timestamp
Project 4: Anomalies in Network Data
13
What is the OSDC?
Hosted, managed, distributed facility to:• Manage & archive your medium and large datasets• Provide computational resources to analyze it• Provide networking to share it with your colleagues
and the public.
Long Time Goal
Build a (small) data center for science.
And preserve your data the same way that libraries preserve books &
museums preserve art.
Why do it?
Work on Stuff That MattersTim O’Reilly, Jan 11, 2009
1. Work on something that matters to you more than money [and, presumably, papers].
2. Create more value than you capture.3. Take the long view.
What is similar?
Internet Archive
Wayback Machine
Part 2:Why Another Cloud Project?
Small Medium to Large Very Large
Data Size
Low
Med
Wide
Variety of analysis
No infrastructure Dedicated infrastructureGeneral infrastructure
Scientist with laptop
Open Science Data Cloud
High energy physics, astronomy
Single workstations
Small to medium clusters
HPC
Cycles
Small
Med
Large
Persistent data
data clouds
Large & spec. clusters
databases
Who do you most trust to manage your data for 100 years?
Companies may not be here tomorrow.
Think of a not for profit with that mission.
Government agencies have a role, but not always easy to use.
Part 3:Technical Approach
Condominium Clouds• In a condominium cloud, you buy your own rack
or bunch of racks.• The racks are managed and operated by the
condominium association, in this case the OCC.• If your rack is 120 TB, you get the rights to c. 40
TB of storage in the cloud. The rest is a shared resource.
• The Open Cloud Testbed is a condo cloud managed by the OCC.
28
Raywulf rack
Condo Clouds
Open source software stack: Hadoop, Sector, Eucalyptus, Nova, NoSQL DBs,
Data Migration
• Challenge: data migration.• Solution: use Hadoop style replication.
Operating ModelYear New
RacksTotal Racks
New Cap Total Cap
Net New Cap
1 10 10 1.28 1.28 02 10 20 1.92 3.20 1.923 10 30 2.88 6.08 2.884 10 30 4.32 9.12 3.045 10 30 6.48 13.68 4.566 10 30 9.72 20.52 6.85
Operating model requires constant cap ex investment each year, for example 10 racks or $1M. (Cap in PB)
Retiring Equipment
• Challenge: Adding & removing racks.• Solution: Support virtual networks, virtual
data centers, etc.
We Have Several Ways of Defining Virtual Networks….
VN-Link
VLAN
VPNs
BGP
MPLSOpenFlow
Open vSwitch
vSwitchCloudSwitch
But No Vendor Neutral VN Standard That
• That scales to 100,000+ VMs • Supported by multiple vendors• Spans multiple physical switches• Supports VN Mobility• Provides strong isolation of VN• Is easy for VMs to join and leave VNs• Includes management interfaces ….
OCC has a working group working on VN standards
Bridging the Gaps…A Small Step
Infrastructure as a Service– Virtual Data Centers (VDC)– Virtual Networks (VN)– Virtual Machines (VM)– Physical Resources
Platform as a Service– Cloud Compute Services– Data as a Service
Open Virtualization Format (OVF)
Open Cloud Computing Interface (OCCI)
SNIA Cloud Data Management Interface (CDMI)
Large Data Cloud Interoperability Framework
Metadata service linking IaaS and DaaS
Metadata service naming and linking entities in the IaaS layers
One Day We Hope to Peer
Open Science Data Cloud
More Challenges: Finding a Business Model That Works Long Term
• Challenge: raising constant amount of funding each year.
• To date: talking to foundations.
Thank You
• For more information:– www.opencloudconsortium.org– rgrossman.com (for research papers, etc.)