pierre charrue – be/co. preamble the lhc controls infrastructure external dependencies ...
TRANSCRIPT
Controls presentation to the External Panel on Risk
Pierre Charrue – BE/CO
Pierre Charrue - BE/CO - LHC Risk Review 2
Outline
Preamble The LHC Controls Infrastructure External Dependencies Redundancies Control Room Power Loss Conclusion
6 March 2009
Pierre Charrue - BE/CO - LHC Risk Review 3
Outline
Preamble The LHC Controls Infrastructure External Dependencies Redundancies Control Room Power Loss Conclusion
6 March 2009
Pierre Charrue - BE/CO - LHC Risk Review 4
Preamble
The Controls Infrastructure is designed to control the beams in the accelerators
It is not designed to protect the machine nor to ensure personnel safety See Machine Protection or Access
Infrastructures
6 March 2009
Pierre Charrue - BE/CO - LHC Risk Review 5
Outline
PreambleThe LHC Controls Infrastructure External Dependencies Redundancies Control Room Power Loss Conclusion
6 March 2009
LHC Controls Infrastructure
6 March 2009 Pierre Charrue - BE/CO - LHC Risk Review 6
• The 3-tier architecture– Hardware Infrastructure– Software layers
– Resource Tier– VME crates, PC GW & PLC dealing
with high performance acquisitions and real-time processing
– Database where all the setting and configuration of all LHC device exist
– Server Tier– Application servers– Data Servers– File Servers– Central Timing
– Client Tier– Interactive Consoles– Fixed Displays– GUI applications
– Communication to the equipment goes through Controls Middleware CMW
CTRL CTRL
DB
Business Layer
Hardware
Client tier
Server tier
Applications Layer
Resource tier
CMW
Pierre Charrue - BE/CO - LHC Risk Review 7
The CCC and CCR
Since January 2006, the accelerator operation is done from the CERN Control Centre (CCC) on the Prévessin site
The CCC hosts around 100 consoles and around 300 screens
The CCR is the rack room next to the CCC. It hosts more than 400 servers
6 March 2009
Pierre Charrue - BE/CO - LHC Risk Review 8
Outline
Preamble The LHC Controls InfrastructureExternal Dependencies Redundancies Control Room Power Loss Conclusion
6 March 2009
Pierre Charrue - BE/CO - LHC Risk Review 9
External Dependencies
HARDWARE
Electricity Cooling and
Ventilation Network Oracle servers in IT
SOFTWARE
Oracle IT Authentication Technical
Network/General Purpose Network
6 March 2009
Pierre Charrue - BE/CO - LHC Risk Review 10
External Hardware DependenciesElectricity
All Linux servers are HP Proliants with dual power supplies
They are cabled to two separate 230V UPS sources
High power consumption will consume UPS batteries rapidly 1 hour maximum autonomy Each Proliant consumes an average of
250W
6 March 2009
Pierre Charrue - BE/CO - LHC Risk Review 11
External Hardware DependenciesCooling & Ventilation
Feb 2009: upgrade of air flow and cooling circuits in CCR CCR vulnerability to Cooling problems has been resolved
In the event of loss of refrigeration, the CCR will overheat very quickly Monitoring with temperature sensors and alarms in place to ensure rapid
intervention by TI operators The CCR cooling state is monitored by the Technical Infrastructure Monitoring
(TIM) with views which can show trends over the last 2 weeks:
6 March 2009
Pierre Charrue - BE/CO - LHC Risk Review 12
External Hardware DependenciesTechnical Network
6 March 2009
• Very reliable network topology
• Redundant network routes
• Redundant Power Supply in routers and switches
Pierre Charrue - BE/CO - LHC Risk Review
External Software dependenciesOnline Databases
Controls ConfigurationLSA SettingsE-LogbookCESAR
HWC MeasurementsMeasurements
Logging
CTRL CTRL
CTRL CTRL 2 x quad-core
2.8GHz CPU 8GB
RAM
Clustered NAS shelf14x300GB SATA
disks
11.4TB usable
Additional server for testing: Standby database for LSA
Clustered NAS shelf14x146GB FC disks
• Service Availability – New infrastructure has high-redundancy for high-availability– Deploy each service on a dedicated Oracle Real Application Cluster– The use of a standby database will be investigated
• objective of reaching 100% uptime for LSA– The Logging infrastructure can sustain a 24h un-attainability of the DB
– Keep data in local buffers– A ‘golden’ level support with intervention in 24h– Secure database account granting specific privileges to dedicated db accounts13
• LHC Controls infrastructure is highly DATA centric– All accelerator parameters & settings are stored in a DB located in
B513
6 March 2009
Pierre Charrue - BE/CO - LHC Risk Review 14
External Software DependenciesIT Authentication
Needed online for Role Based Access Control (RBAC) and various Web Pages used by operators
Not used for operational logins on Linux
Windows caches recently used passwords
6 March 2009
Pierre Charrue - BE/CO - LHC Risk Review 15
Outline
Preamble The LHC Controls Infrastructure External DependenciesRedundancies Control Room Power Loss Conclusion
6 March 2009
Pierre Charrue - BE/CO - LHC Risk Review 16
RedundanciesFile & Application Servers
Remote Reboot and Terminal Server functionality built in
Excellent power supply and fan redundancy, partial CPU redundancy, ECC memory
Excellent disk redundancy Automatic warnings in case of a disk failure
Several backup methods : ADSM towards IT backup Daily or weekly rsync towards a storage place in Meyrin▪ Data will be recovered in case of catastrophic failure in CCR
We are able to restore a BackEnd with destroyed disks in a few hours
2nd PostMortem server installed on the Meyrin site
6 March 2009
Pierre Charrue - BE/CO - LHC Risk Review 17
RedundanciesFront Ends VMEs
Can survive limited fan failure Some VME systems with redundant power
supplies Otherwise no additional redundancy Remote reboot and terminal server vital
PLCs Generally very reliable Rarely have remote reboot because of previous
point▪ some LHC Alcove PLCs have a remote reboot
6 March 2009
Pierre Charrue - BE/CO - LHC Risk Review 18
RedundancyTiming distribution
LHC central timing Master, Slave,
Gateway using reflective memory, and hot standby switch
Timing is distributed over dedicated network to timing receivers CTRx in front ends
6 March 2009
Pierre Charrue - BE/CO - LHC Risk Review 19
Network Security
Isolation of Technical Network from external access CNIC initiative to separate
the General Purpose Network from the Technical Network
NO dependences of resources from the GPN for operating the machines
Very few hosts from the GPN allowed to access the TN
Regular Technical Network security scans
6 March 2009
Pierre Charrue - BE/CO - LHC Risk Review 20
Software Safety and Reliability High level tools to diagnose and monitor the controls
infrastructure (DIAMON and LASER) Easy to use first line diagnostics and tool to solve problems
or help to decide about responsibilities for first line intervention
Protecting the device access : RBAC initiative Device access are authorized upon RULES applied to ROLES
given to specific USERS
Protecting the Machine Critical Settings (e.g. BLM threshold)
▪ Can only be changed by authorized person▪ Uses RBAC for Authentication & Authorization▪ Signs the data with a unique signature to ensure critical parameters
have not been tampered since last update6 March 2009
Navigation Tree
Group View
Monitoring Tests Details Repair tools
Pierre Charrue - BE/CO - LHC Risk Review 21
Outline
Preamble The LHC Controls Infrastructure External Dependencies RedundanciesControl Room Power Loss Conclusion
6 March 2009
Pierre Charrue - BE/CO - LHC Risk Review 22
Power Loss
Power Loss in any LHC site: No access to equipment from this site▪ Machine protection or OP will take action
Power Loss in the CCC/CCR CCC can sustain 1 hour on UPS CCR Cooling will be a problem Some CCR servers will still be up if the 2nd
power source is not affected 10 minutes on UPS for EOD1 1 hour on UPS for EOD2 and EOD9
6 March 2009
Pierre Charrue - BE/CO - LHC Risk Review 23
CCR electricity distribution
6 March 2009
Pierre Charrue - BE/CO - LHC Risk Review 24
Machine Protection
The LHC machine itself is protected via a complete Machine Protection System, mainly based of hardware : Beam Interlock System Safe Machine Parameters system Fast Magnet current Change Monitors Powering Interlock System Warm Magnet Interlock System Software Interlock System
All devices in the PostMortem chain are protected for at least 15 minutes In addition, the source FrontEnds can hold the data locally in case the
network is the cause The CCR servers for PostMortem are on UPS for one hour A 2nd PostMortem mirror server is located on the Meyrin site The archives are stored on RAID servers, with 2 levels of backups, one on a
backup server maintained by BE/CO on the Meyrin site, and one ADSM backup server maintained by IT in building 513.
6 March 2009
Pierre Charrue - BE/CO - LHC Risk Review 25
Outline
Preamble The LHC Controls Infrastructure External Dependencies Redundancies Control Room Power LossConclusion
6 March 2009
Pierre Charrue - BE/CO - LHC Risk Review 26
Conclusion
High dependence on electricity distribution, network, cooling and ventilation, databases
Emphasis on the Controls Infrastructure on Redundancy Remote monitoring and diagnostic Remote reset Quick recovery in case of major problem
The controls infrastructure can sustain Power loss between 10’ and 60’ Special care is taken to secure PostMortem data
(collection and archives)
6 March 2009