cern user story
DESCRIPTION
CERN, the European Organization for Nuclear Research, is one of the world’s largest centres for scientific research. Its business is fundamental physics, finding out what the universe is made of and how it works. At CERN, accelerators such as the 27km Large Hadron Collider, are used to study the basic constituents of matter. This talk reviews the challenges to record and analyse the 25 Petabytes/year produced by the experiments and the investigations into how OpenStack could help to deliver a more agile computing infrastructure.TRANSCRIPT
Towards An Agile Infrastructure at CERN
OpenStack Conference6th October 2011
1
What is CERN ?
OpenStack Conference, Boston 2011 Tim Bell, CERN 2
• Conseil Européen pour la Recherche Nucléaire – aka European Laboratory for Particle Physics
• Between Geneva and the Jura mountains, straddling the Swiss-French border
• Founded in 1954 with an international treaty
• Our business is fundamental physics and how our universe works
OpenStack Conference, Boston 2011 Tim Bell, CERN 3
Answering fundamental questions…• How to explain particles have mass?
We have theories but need experimental evidence
• What is 96% of the universe made of ?We can only see 4% of its estimated mass!
• Why isn’t there anti-matterin the universe?
Nature should be symmetric…
• What was the state of matter justafter the « Big Bang » ?
Travelling back to the earliest instants ofthe universe would help…
Community collaboration on an international scale
Tim Bell, CERN 4OpenStack Conference, Boston 2011
Tim Bell, CERN 5
The Large Hadron Collider
OpenStack Conference, Boston 2011
OpenStack Conference, Boston 2011 Tim Bell, CERN 6
LHC construction
OpenStack Conference, Boston 2011 Tim Bell, CERN 7
8
The Large Hadron Collider (LHC) tunnel
OpenStack Conference, Boston 2011 Tim Bell, CERN
OpenStack Conference, Boston 2011 Tim Bell, CERN 9
Accumulating events in 2009-2011
OpenStack Conference, Boston 2011 Tim Bell, CERN 10
OpenStack Conference, Boston 2011 Tim Bell, CERN 11
Heavy Ion Collisions
OpenStack Conference, Boston 2011 Tim Bell, CERN 12
OpenStack Conference, Boston 2011 Tim Bell, CERN 13
OpenStack Conference, Boston 2011 Tim Bell, CERN 14
Tier-1 (11 centres):•Permanent storage•Re-processing•Analysis
Tier-0 (CERN):•Data recording•Initial data reconstruction•Data distribution
Tier-2 (~200 centres):• Simulation• End-user analysis
• Data is recorded at CERN and Tier-1s and analysed in the Worldwide LHC Computing Grid• In a normal day, the grid provides 100,000 CPU days executing 1 million jobs
OpenStack Conference, Boston 2011 Tim Bell, CERN 15
• Data Centre by Numbers– Hardware installation & retirement
• ~7,000 hardware movements/year; ~1,800 disk failures/year
Xeon 51502%
Xeon 516010%
Xeon E5335
7%Xeon
E534514%
Xeon E5405
6%
Xeon E541016%
Xeon L5420
8%
Xeon L552033%
Xeon 3GHz4%
Fujitsu3%
Hitachi23% HP
0% Maxtor
0% Seagate15%
Western Digital
59%
Other0%
High Speed Routers(640 Mbps → 2.4 Tbps) 24
Ethernet Switches 350
10 Gbps ports 2000
Switching Capacity 4.8 Tbps
1 Gbps ports 16,939
10 Gbps ports 558
Racks 828
Servers 11,728
Processors 15,694
Cores 64,238
HEPSpec06 482,507
Disks 64,109
Raw disk capacity (TiB) 63,289
Memory modules 56,014
Memory capacity (TiB) 158
RAID controllers 3,749
Tape Drives 160
Tape Cartridges 45000
Tape slots 56000
Tape Capacity (TiB) 34000
IT Power Consumption 2456 KW
Total Power Consumption 3890 KW
Our Environment
• Our users– Experiments build on top of our infrastructure and services
to deliver application frameworks for the 10,000 physicists
• Our custom user applications split into– Raw data processing from the accelerator and export to
the world wide LHC computing grid– Analysis of physics data– Simulation
• We also have standard large organisation applications– Payroll, Web, Mail, HR, …
OpenStack Conference, Boston 2011 Tim Bell, CERN 16
Our Infrastructure
• Hardware is generally based on commodity, white-box servers– Open tendering process based on SpecInt/CHF, CHF/Watt and GB/CHF– Compute nodes typically dual processor, 2GB per core– Bulk storage on 24x2TB disk storage-in-a-box with a RAID card
• Vast majority of servers run Scientific Linux, developed by Fermilab and CERN, based on Redhat Enterprise– Focus is on stability in view of the number of centres on the WLCG
OpenStack Conference, Boston 2011 Tim Bell, CERN 17
Our Challenges – Compute
• Optimise CPU resources– Maximise production lifetime of servers– Schedule interventions such as hardware repairs and OS patching– Match memory and core requirements per job– Reduce CPUs waiting idle for I/O
• Conflicting software requirements– Different experiments want different libraries– Maintenance of old programs needs old OSes
OpenStack Conference, Boston 2011 Tim Bell, CERN 18
Our Challenges – variable demand
OpenStack Conference, Boston 2011 Tim Bell, CERN 19
Our Challenges - Data storage
OpenStack Conference, Boston 2011 Tim Bell, CERN 20
• 25PB/year to record• >20 years retention• 6GB/s average• 25GB/s peaks
OpenStack Conference, Boston 2011 Tim Bell, CERN 21
Our Challenges – ‘minor’ other issues
• Power– Living within a fixed envelope of 2.9MW available for computer centre
• Cooling– Only 6kW/m2 without using water cooled racks (and no spare power)
• Space– New capacity replaces old servers in same racks (as density is low)
• Staff– CERN staff headcount is fixed
• Budget– CERN IT budget reflects member states contributions
OpenStack Conference, Boston 2011 Tim Bell, CERN 22
Server Consolidation
OpenStack Conference, Boston 2011 Tim Bell, CERN 23
4/1/2010 10/1/2010 4/1/20110
200
400
600
800
1000
1200
1400
1600
1800
WindowsOther LinuxScientific Linux
Num
ber o
f Virt
ual M
achi
nes
Batch Virtualisation
OpenStack Conference, Boston 2011 Tim Bell, CERN 24
Infrastructure as a Service Studies
• CERN has been using virtualisation on a small scale since 2007– Server Consolidation with Microsoft System Centre VM manager and Hyper-V– Virtual batch compute farm using OpenNebula and Platform ISF on KVM
• We are investigating moving to a cloud service provider model for infrastructure at CERN– Virtualisation consolidation across multiple sites– Bulk storage / Dropbox / …– Self-Service
• Aims– Improve efficiency– Reduce operations effort– Ease remote data centre support– Enable cloud APIs
OpenStack Conference, Boston 2011 Tim Bell, CERN 25
OpenStack Infrastructure as a Service Studies
• Current Focus– Converge the current virtualisation services into a single IaaS– Test Swift for bulk storage, compatibility with S3 tools and resilience
on commodity hardware– Integrate OpenStack with CERN’s infrastructure such as LDAP and
network databases
• Status– Swift testbed (480TB) is being migrated to Diablo and expanded to 1PB
with 10Ge networking– 48 Hypervisors running RHEL/KVM/Nova under test
OpenStack Conference, Boston 2011 Tim Bell, CERN 26
Areas where we struggled
• Networking configuration with Cactus– Trying out new Network-as-a-Service Quantum functions in Diablo
• Redhat distribution base– RPMs not yet in EPEL but Grid Dynamics RPMs helped– Puppet manifests needed adapting and multiple sources from
OpenStack and Puppetlabs
• Currently only testing with KVM– We’ll try Hyper-V once Diablo/Hyper-V support is fully in place
OpenStack Conference, Boston 2011 Tim Bell, CERN 27
OpenStack investigations : next steps
• Homogeneous servers for both storage and batch ?
OpenStack Conference, Boston 2011 Tim Bell, CERN 28
Other18%
Databases4%
VO Services5%
Mass Storage
25%
Batch40%
Grid Services
2%
WinServices6%
OpenStack investigations : next steps
• Scale testing with CERN’s toolchains to install and schedule 16,000 VMs
OpenStack Conference, Boston 2011 Tim Bell, CERN 29
Previous test results performed with OpenNebula
OpenStack investigations : next steps
• Investigate the commodity solutions for external volume storage– Ceph– Sheepdog– Gluster– ...
• Focus is on– Reducing performance impact of I/O with virtualisation– Enabling widespread use of live migration– Understanding the future storage classes and service definitions– Supporting remote data centre use cases
OpenStack Conference, Boston 2011 Tim Bell, CERN 30
Areas of interest looking forward
• Nova and Glance– Scheduling VMs near to the data they need– Managing the queue of requests when “no credit card” and no
resources– Orchestration of bare metal servers within OpenStack
• Swift– High performance transfers through the proxies without encryption– Long term archiving for low power disks or tape
• General– Filling in the missing functions such as billing, availability and
performance monitoring
OpenStack Conference, Boston 2011 Tim Bell, CERN 31
Final Thoughts
OpenStack Conference, Boston 2011 Tim Bell, CERN 32
• A small project to share documents at CERN in the ‘90s created the massive phenomenon that is today’s world wide web• Open Source• Transparent governance• Basis for innovation and competition• Standard APIs where consensus• Stable production ready solutions• Vibrant eco-system
• There is a strong need for a similar solution in the Infrastructure-as-a-Service space
• The next year is going to be exciting for OpenStack as the project matures and faces the challenges of production deployments
References
OpenStack Conference, Boston 2011 Tim Bell, CERN 33
CERN http://public.web.cern.ch/public/Scientific Linux http://www.scientificlinux.org/Silent data corruption study http://cern.ch/go/G7vLHEPiX Working Group on virtualization http://w3.hepix.org/virtualization/Worldwide LHC Computing Grid http://lcg.web.cern.ch/lcg/
http://rtm.hep.ph.ic.ac.uk/Jobs http://cern.ch/jobs
Backup Slides
OpenStack Conference, Boston 2011 Tim Bell, CERN 34
CERN’s tools
• The world’s most powerful accelerator: LHC– A 27 km long tunnel filled with high-tech instruments– Equipped with thousands of superconducting magnets– Accelerates particles to energies never before obtained– Produces particle collisions creating microscopic “big bangs”
• Very large sophisticated detectors– Four experiments each the size of a cathedral– Hundred million measurement channels each– Data acquisition systems treating Petabytes per second
• Top level computing to distribute and analyse the data– A Computing Grid linking ~200 computer centres around the globe– Sufficient computing power and storage to handle 25 Petabytes per
year, making them available to thousands of physicists for analysisOpenStack Conference, Boston 2011 Tim Bell, CERN 35
Other non-LHC experiments at CERN
OpenStack Conference, Boston 2011 Tim Bell, CERN 36
Superconducting magnets – October 2008
OpenStack Conference, Boston 2011 Tim Bell, CERN 37
A faulty connection between two superconducting magnets led to the release of a large amount of helium into the LHC tunnel and forced the machine to shut down for repairs
CERN Computer Centre
Tim Bell, CERN 38OpenStack Conference, Boston 2011
Our Challenges – keeping up to date
OpenStack Conference, Boston 2011 Tim Bell, CERN 39
CPU capacity at CERN during ‘80s and ‘90s
OpenStack Conference, Boston 2011 Tim Bell, CERN 40
198702
198718
198734
198750
198814
198830
198846
198910
198926
198942
199006
199022
199038
199102
199118
199134
199150
199214
199230
199246
199310
199326
199342
199407
199423
199439
199503
199519
199535
199551
199615
199631
199647
199712
199728
199744
199808
199824
199840
199903
199919
199935
199951
200017
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
Week - yyyyww
CP
U C
apac
ity
LEP Starts
CapacityUsed
Testbed Configuration for Nova / Swift
• 24 servers• Single server configuration for both compute and storage
• Supermicro based systems• Intel Xeon CPU L5520 @ 2.27GHz• 12GB memory• 10Ge connectivity• IPMI
OpenStack Conference, Boston 2011 Tim Bell, CERN 41
Data Rates at Tier-0
OpenStack Conference, Boston 2011 Tim Bell, CERN 42
Typical tier-0 bandwidthAverage in: 2 GB/s with peaks at 11.5 GB/sAverage out: 6 GB/s with peaks at 25 GB/s
Web Site Activity
OpenStack Conference, Boston 2011 Tim Bell, CERN 43
11/1
4/20
07
6/1/
2008
12/1
8/20
08
7/6/
2009
1/22
/201
0
8/10
/201
0
2/26
/201
1
0
500000000
1000000000
1500000000
2000000000
2500000000
3000000000
Num
ber o
f Hits
LHC first beam day:9. September 2008100 million hits to main CERN Websites300 million hits in total
LHC first collisions:25. March 201050 million hits to main CERN Websites
CERN websites access statistics