december 2005 scaling up pvss phase ii test results paul burkimsher it-co
TRANSCRIPT
![Page 1: December 2005 Scaling Up PVSS Phase II Test Results Paul Burkimsher IT-CO](https://reader035.vdocuments.us/reader035/viewer/2022062805/5697bfad1a28abf838c9c078/html5/thumbnails/1.jpg)
December 2005
Scaling Up PVSS
Phase II Test Results
Paul Burkimsher IT-CO
![Page 2: December 2005 Scaling Up PVSS Phase II Test Results Paul Burkimsher IT-CO](https://reader035.vdocuments.us/reader035/viewer/2022062805/5697bfad1a28abf838c9c078/html5/thumbnails/2.jpg)
Aim of the Scaling Up Project
Investigate functionality and performance of large PVSS systems
In Phase 1 we reassured ourselves that PVSS scales to support large systems
Provided detail rather than bland reassurances
![Page 3: December 2005 Scaling Up PVSS Phase II Test Results Paul Burkimsher IT-CO](https://reader035.vdocuments.us/reader035/viewer/2022062805/5697bfad1a28abf838c9c078/html5/thumbnails/3.jpg)
Phase 2: WYSIWYAF
Began with a questionnaire to you to establish your concerns
Eclectic list of “hot topics of the moment”– Oracle Archiving– Alerts– Regular reconfiguration of channels
(alerts and setpoints)– Backup and restore– Configuring all channels at startup
![Page 4: December 2005 Scaling Up PVSS Phase II Test Results Paul Burkimsher IT-CO](https://reader035.vdocuments.us/reader035/viewer/2022062805/5697bfad1a28abf838c9c078/html5/thumbnails/4.jpg)
Your requests (cont.)
– OPC performance– Local DB cache– Central Panel Repository–Windows/Linux lurking limits– System startup time (DPT distribution)– Task Allocation
![Page 5: December 2005 Scaling Up PVSS Phase II Test Results Paul Burkimsher IT-CO](https://reader035.vdocuments.us/reader035/viewer/2022062805/5697bfad1a28abf838c9c078/html5/thumbnails/5.jpg)
Menu
From these requests, we initially picked out four for investigation:
– Task Allocation– Backup of a running system– Alerts– Panel Repository
![Page 6: December 2005 Scaling Up PVSS Phase II Test Results Paul Burkimsher IT-CO](https://reader035.vdocuments.us/reader035/viewer/2022062805/5697bfad1a28abf838c9c078/html5/thumbnails/6.jpg)
Task AllocationRecall that PVSS is manager based
and any manager can be scattered to another machine (not just UIs).
CTRLControlmanager
APIAPI-Manager
DDriver
DBDatabase-Manager
UIUserinterface
Runtime
DDriver
DDriver
EVEventmanager
UIUserinterface
Editor
UIUserinterface
Runtime
EVEventmanager
CTRLControlmanager
UIUserinterface
Editor
APIAPI-Manager
DBDatabase-Manager
DDriver
DDriver
UIUserinterface
Runtime
![Page 7: December 2005 Scaling Up PVSS Phase II Test Results Paul Burkimsher IT-CO](https://reader035.vdocuments.us/reader035/viewer/2022062805/5697bfad1a28abf838c9c078/html5/thumbnails/7.jpg)
Task Allocation
More than 20 different tests conducted to investigate the effect of moving managers around.
Results have been available on the web for some time (URLs at the end)
Results were surprising and went against our (& ETM’s!) assumptions of what would be “better”…
![Page 8: December 2005 Scaling Up PVSS Phase II Test Results Paul Burkimsher IT-CO](https://reader035.vdocuments.us/reader035/viewer/2022062805/5697bfad1a28abf838c9c078/html5/thumbnails/8.jpg)
What we measured…A task allocation was deemed “better” if
it supported a higher number of datapoint changes per second (“throughput”) than a system running entirely on a single processor.
We observed the number of changes per second that the system could support before one of the following became overloaded : – CPU usage – Memory usage – Network traffic – Disk traffic
![Page 9: December 2005 Scaling Up PVSS Phase II Test Results Paul Burkimsher IT-CO](https://reader035.vdocuments.us/reader035/viewer/2022062805/5697bfad1a28abf838c9c078/html5/thumbnails/9.jpg)
What we saw…
As throughput increases on a typical PVSS system, the machine first becomes CPU bound.
The Event Manager (EM) is the task most in need of CPU.
We expected that scattering the EM away from the Data Manager (DM) would cause slow-down because of the high traffic between these tasks. WRONG!
![Page 10: December 2005 Scaling Up PVSS Phase II Test Results Paul Burkimsher IT-CO](https://reader035.vdocuments.us/reader035/viewer/2022062805/5697bfad1a28abf838c9c078/html5/thumbnails/10.jpg)
Scattering the EM
Despite the overhead of sending traffic EM DM over the external network, scattering the EM caused throughput to be significantly increased. (+75%)
![Page 11: December 2005 Scaling Up PVSS Phase II Test Results Paul Burkimsher IT-CO](https://reader035.vdocuments.us/reader035/viewer/2022062805/5697bfad1a28abf838c9c078/html5/thumbnails/11.jpg)
AES
The Alert-Event Screen (AES) is CPU-hungry.
Runs in a UI task which can be scattered.
Beware: Each additional AES not only increases the load on its own machine, but also increases the load on the EM to which it is connected.
![Page 12: December 2005 Scaling Up PVSS Phase II Test Results Paul Burkimsher IT-CO](https://reader035.vdocuments.us/reader035/viewer/2022062805/5697bfad1a28abf838c9c078/html5/thumbnails/12.jpg)
Recommendation
Execute as few AESs as possible outside the main control room.
When you are not actually looking at the AES, leave it in “stopped” mode. (Screen is not updated.)
![Page 13: December 2005 Scaling Up PVSS Phase II Test Results Paul Burkimsher IT-CO](https://reader035.vdocuments.us/reader035/viewer/2022062805/5697bfad1a28abf838c9c078/html5/thumbnails/13.jpg)
Scattering other managers
Can improve throughput, but not as spectacularly as when scattering the EM.
Moving the DM is useful, but more delicate (i.e. many Value Archive (VA) connections?)
![Page 14: December 2005 Scaling Up PVSS Phase II Test Results Paul Burkimsher IT-CO](https://reader035.vdocuments.us/reader035/viewer/2022062805/5697bfad1a28abf838c9c078/html5/thumbnails/14.jpg)
Absolute Performance
The average number of “changes per second” that can be supported depend on the nature of the traffic.
A steady data flow is easier to cope with.
Irregular bursts of rapid traffic tend to overflow the queues between the managers. (Queue lengths are configurable.)
![Page 15: December 2005 Scaling Up PVSS Phase II Test Results Paul Burkimsher IT-CO](https://reader035.vdocuments.us/reader035/viewer/2022062805/5697bfad1a28abf838c9c078/html5/thumbnails/15.jpg)
Load Management
PVSS implements several Load Management schemes, e.g.
Alert screen update pauses during a brief avalanche
Alert screen switches into Stopped mode if the sustained number of alerts arriving is crazy
![Page 16: December 2005 Scaling Up PVSS Phase II Test Results Paul Burkimsher IT-CO](https://reader035.vdocuments.us/reader035/viewer/2022062805/5697bfad1a28abf838c9c078/html5/thumbnails/16.jpg)
Load Management - II
Load Shedding, where EM will cut the umbilical to rogue managers rather than be brought down itself.
I recommend that shift operators be taught to recognise the symptoms when they occur
![Page 17: December 2005 Scaling Up PVSS Phase II Test Results Paul Burkimsher IT-CO](https://reader035.vdocuments.us/reader035/viewer/2022062805/5697bfad1a28abf838c9c078/html5/thumbnails/17.jpg)
Multiple CPUs
An alternative to scattering: Buy a dual processor!
2 CPUs are generally enough to satisfy even the hungry Event Manager
Our dual-CPUs became disk bound when we pushed them.
---Tribute to the well balanced design of modern PCs!
![Page 18: December 2005 Scaling Up PVSS Phase II Test Results Paul Burkimsher IT-CO](https://reader035.vdocuments.us/reader035/viewer/2022062805/5697bfad1a28abf838c9c078/html5/thumbnails/18.jpg)
RAM
Look how much memory you are using.
Buy enough of it.If you are worried about
performance, paging is wasted effort!!
![Page 19: December 2005 Scaling Up PVSS Phase II Test Results Paul Burkimsher IT-CO](https://reader035.vdocuments.us/reader035/viewer/2022062805/5697bfad1a28abf838c9c078/html5/thumbnails/19.jpg)
Task summary
Give plenty CPU capacity to the EM by:– Buying a fast machine– Scattering the EM– Buying a dual CPU machine
![Page 20: December 2005 Scaling Up PVSS Phase II Test Results Paul Burkimsher IT-CO](https://reader035.vdocuments.us/reader035/viewer/2022062805/5697bfad1a28abf838c9c078/html5/thumbnails/20.jpg)
Menu
– Task Allocation– Backup of a running system– Alerts– Panel Repository
![Page 21: December 2005 Scaling Up PVSS Phase II Test Results Paul Burkimsher IT-CO](https://reader035.vdocuments.us/reader035/viewer/2022062805/5697bfad1a28abf838c9c078/html5/thumbnails/21.jpg)
Backup
In the development systems nobody did backup.
PVSS backup is somewhat intricate.
Need for a set of recipes of backup instructions
![Page 22: December 2005 Scaling Up PVSS Phase II Test Results Paul Burkimsher IT-CO](https://reader035.vdocuments.us/reader035/viewer/2022062805/5697bfad1a28abf838c9c078/html5/thumbnails/22.jpg)
18-page Report
What needs backing upWhat this means in PVSSHow to back it up
How to restore (rather important!)
Handout
![Page 23: December 2005 Scaling Up PVSS Phase II Test Results Paul Burkimsher IT-CO](https://reader035.vdocuments.us/reader035/viewer/2022062805/5697bfad1a28abf838c9c078/html5/thumbnails/23.jpg)
Four Parts
1) Executive Summary2) Recipes3) Detailed Background Description4) Frequently Asked Questions
about Backup.
(I’m not going to go through them, just let you know that they exist.)
![Page 24: December 2005 Scaling Up PVSS Phase II Test Results Paul Burkimsher IT-CO](https://reader035.vdocuments.us/reader035/viewer/2022062805/5697bfad1a28abf838c9c078/html5/thumbnails/24.jpg)
Menu
– Task Allocation– Backup of a running system– Alerts– Panel Repository
![Page 25: December 2005 Scaling Up PVSS Phase II Test Results Paul Burkimsher IT-CO](https://reader035.vdocuments.us/reader035/viewer/2022062805/5697bfad1a28abf838c9c078/html5/thumbnails/25.jpg)
Alerts
PVSS 3.5 (due in 200x) will contain new functionality for summary alerts and alert provocation during ramping.
I did not do in depth performance measurements on the existing system, beyond those I described to you in Phase 1 of S.U.P.
![Page 26: December 2005 Scaling Up PVSS Phase II Test Results Paul Burkimsher IT-CO](https://reader035.vdocuments.us/reader035/viewer/2022062805/5697bfad1a28abf838c9c078/html5/thumbnails/26.jpg)
At the request of one experiment though, we did investigate
“What is the load of an alert definition on a PVSS system?”
Results on the web (Test 38).
![Page 27: December 2005 Scaling Up PVSS Phase II Test Results Paul Burkimsher IT-CO](https://reader035.vdocuments.us/reader035/viewer/2022062805/5697bfad1a28abf838c9c078/html5/thumbnails/27.jpg)
Loads of Alert Definitions
We showed that it is safe to declare any number of alerts and even to activate them provided that the data values stay in range.
It is provocation of the warnings and alerts that incurs a significant CPU load.
![Page 28: December 2005 Scaling Up PVSS Phase II Test Results Paul Burkimsher IT-CO](https://reader035.vdocuments.us/reader035/viewer/2022062805/5697bfad1a28abf838c9c078/html5/thumbnails/28.jpg)
Memory load
Test 39 looked at memory usage of Alerts.
Requirement of 2.5KB per DPE alert.
![Page 29: December 2005 Scaling Up PVSS Phase II Test Results Paul Burkimsher IT-CO](https://reader035.vdocuments.us/reader035/viewer/2022062805/5697bfad1a28abf838c9c078/html5/thumbnails/29.jpg)
Menu
– Task Allocation– Backup of a running system– Alerts– Panel Repository
![Page 30: December 2005 Scaling Up PVSS Phase II Test Results Paul Burkimsher IT-CO](https://reader035.vdocuments.us/reader035/viewer/2022062805/5697bfad1a28abf838c9c078/html5/thumbnails/30.jpg)
Panel Repository
Owing to staffing changes in the section, it was not possible to address this topic
![Page 31: December 2005 Scaling Up PVSS Phase II Test Results Paul Burkimsher IT-CO](https://reader035.vdocuments.us/reader035/viewer/2022062805/5697bfad1a28abf838c9c078/html5/thumbnails/31.jpg)
On the subject of panels…
During the tests I would have found it helpful to have a ready display of the interconnection status of the distributed systems.
I recommend that there is something showing this on the top-level display panel. (Even just a grid of red/green pixels showing connection status.) Lost connections should raise an alert.
![Page 32: December 2005 Scaling Up PVSS Phase II Test Results Paul Burkimsher IT-CO](https://reader035.vdocuments.us/reader035/viewer/2022062805/5697bfad1a28abf838c9c078/html5/thumbnails/32.jpg)
Other questions
During the tests, I was approached by different experiments with other issues!
We agreed to investigate the following…
![Page 33: December 2005 Scaling Up PVSS Phase II Test Results Paul Burkimsher IT-CO](https://reader035.vdocuments.us/reader035/viewer/2022062805/5697bfad1a28abf838c9c078/html5/thumbnails/33.jpg)
PVSS Disturbance
With Alice we looked together at the effect of heavy external (unrelated) network traffic on PVSS.
Results written up as Tests 28 & 29.
Use 100Mbit with switches not hubs
Conclusion was that external traffic is not a problem
![Page 34: December 2005 Scaling Up PVSS Phase II Test Results Paul Burkimsher IT-CO](https://reader035.vdocuments.us/reader035/viewer/2022062805/5697bfad1a28abf838c9c078/html5/thumbnails/34.jpg)
Traffic Pattern
For Atlas we compared the CPU load demanded by: – Changing 1 item N times vs– Changing N items once each
Same
![Page 35: December 2005 Scaling Up PVSS Phase II Test Results Paul Burkimsher IT-CO](https://reader035.vdocuments.us/reader035/viewer/2022062805/5697bfad1a28abf838c9c078/html5/thumbnails/35.jpg)
Long Term Test (LTT)
With CMS’ machines (for the use of which we are very grateful!) we ran a long term test:– Generated random data– Recorded it and displayed it
continuously on a trend– Distributed system
Results
![Page 36: December 2005 Scaling Up PVSS Phase II Test Results Paul Burkimsher IT-CO](https://reader035.vdocuments.us/reader035/viewer/2022062805/5697bfad1a28abf838c9c078/html5/thumbnails/36.jpg)
LTT Results
The electricity supply at Cern is unreliable. You really do need a UPS.
The Cern campus AFS servers are relatively unreliable and should never be used in a production system!
The Cern network infrastructure is very reliable, but can break.
![Page 37: December 2005 Scaling Up PVSS Phase II Test Results Paul Burkimsher IT-CO](https://reader035.vdocuments.us/reader035/viewer/2022062805/5697bfad1a28abf838c9c078/html5/thumbnails/37.jpg)
Network Problem
One network break revealed that the Cern default Linux O/S settings actually prevent PVSS’s automatic recovery feature from accomplishing its goal.
Cache-ing problem. Written up in 2 pages of background, symptoms, explanation, how to fix it if it does happen to you and how to avoid it happening in the first place.
![Page 38: December 2005 Scaling Up PVSS Phase II Test Results Paul Burkimsher IT-CO](https://reader035.vdocuments.us/reader035/viewer/2022062805/5697bfad1a28abf838c9c078/html5/thumbnails/38.jpg)
“Side Effects” of SUP Project
Accumulated a large body of practical experience wrestling with PVSS.
Systematically recorded for your benefit.
Where?
![Page 39: December 2005 Scaling Up PVSS Phase II Test Results Paul Burkimsher IT-CO](https://reader035.vdocuments.us/reader035/viewer/2022062805/5697bfad1a28abf838c9c078/html5/thumbnails/39.jpg)
FAQs
FAQ pages on http://cern.ch/itcobeNot restricted to today’s frequent
questions but ones that we foresee will become frequent in the near future, e.g.–My disk is nearly full! What can I do? –My archive file is corrupt. What can I
do? Please spread the word, tell your
friends…
![Page 40: December 2005 Scaling Up PVSS Phase II Test Results Paul Burkimsher IT-CO](https://reader035.vdocuments.us/reader035/viewer/2022062805/5697bfad1a28abf838c9c078/html5/thumbnails/40.jpg)
FAQ Categories Framework PVSS - Installation PVSS - Project Creation PVSS - Alerts (Alarms) PVSS - Import/Export PVSS - Archiving PVSS - Access Control PVSS - Backup-Restore PVSS - Cross Platform PVSS - Distributed Systems PVSS - Drivers PVSS - Excel Report PVSS - Folklore PVSS - Graphics
PVSS - Linux specific PVSS - Messages PVSS - Miscellaneous PVSS - Printing PVSS – Programming
PVSS - Production Systems PVSS - Run-time problems PVSS - Scattered Systems General Support Issues
![Page 41: December 2005 Scaling Up PVSS Phase II Test Results Paul Burkimsher IT-CO](https://reader035.vdocuments.us/reader035/viewer/2022062805/5697bfad1a28abf838c9c078/html5/thumbnails/41.jpg)
FolkloreWhat the FAQs don’t really address
is the folklore that is built up in a close-knit team.
Often this information is unknown (or inaccessible) to outsiders.
![Page 42: December 2005 Scaling Up PVSS Phase II Test Results Paul Burkimsher IT-CO](https://reader035.vdocuments.us/reader035/viewer/2022062805/5697bfad1a28abf838c9c078/html5/thumbnails/42.jpg)
FolkloreEnter the Wiki…– Web pages editable from inside a browser.– Controls Wiki. – Only CERN users can add (or change
existing) content.– Readable worldwide. (Is already used as a
reference by non-HEP organisations!)
Folklore often embodies recommended ways of doing things. Do read it, and keep reading it…
…and edit it. It’s belongs to you!
![Page 43: December 2005 Scaling Up PVSS Phase II Test Results Paul Burkimsher IT-CO](https://reader035.vdocuments.us/reader035/viewer/2022062805/5697bfad1a28abf838c9c078/html5/thumbnails/43.jpg)
Example Recommendations in the Folklore
Assume one PVSS system per machine (Service restriction in Windows)
Place EM/DM on a different CPU to OPC client/servers (Protect EM against CPU overload from OPC; Freedom to move EM to Linux)
In a Summary (Group) alert, use a CHAR type (not a STRING type) DPE upon which to hang the summary alert. It's more efficient.
![Page 44: December 2005 Scaling Up PVSS Phase II Test Results Paul Burkimsher IT-CO](https://reader035.vdocuments.us/reader035/viewer/2022062805/5697bfad1a28abf838c9c078/html5/thumbnails/44.jpg)
Support Issues
Final Remark:
SUP has generated a fair number of support issues that have been followed up with ETM. “Bugs you didn’t know you nearly had”.
Significant contribution to the robustness of the PVSS systems.
![Page 45: December 2005 Scaling Up PVSS Phase II Test Results Paul Burkimsher IT-CO](https://reader035.vdocuments.us/reader035/viewer/2022062805/5697bfad1a28abf838c9c078/html5/thumbnails/45.jpg)
Summary
I do not claim to have answered all questions about building large systems.– New questions come up frequently
anyway.We have shown that PVSS will
scale to build large systemsWe have investigated the “hot
topics of the moment” as defined by you.
![Page 46: December 2005 Scaling Up PVSS Phase II Test Results Paul Burkimsher IT-CO](https://reader035.vdocuments.us/reader035/viewer/2022062805/5697bfad1a28abf838c9c078/html5/thumbnails/46.jpg)
To read a summary of the salient points of the most recent tests, including a discussion of the observed “Emergent Behaviour” in large systems, see my ICALEPCS paper, “Scaling Up PVSS”.
We are now bringing this project to a close.
Thank you!
Any (more) questions?
![Page 47: December 2005 Scaling Up PVSS Phase II Test Results Paul Burkimsher IT-CO](https://reader035.vdocuments.us/reader035/viewer/2022062805/5697bfad1a28abf838c9c078/html5/thumbnails/47.jpg)
![Page 48: December 2005 Scaling Up PVSS Phase II Test Results Paul Burkimsher IT-CO](https://reader035.vdocuments.us/reader035/viewer/2022062805/5697bfad1a28abf838c9c078/html5/thumbnails/48.jpg)
Reference LinksScaling Up Home Page:
http://cern.ch/itcobe/Projects/ScalingUpPVSS/welcome.html
IT-CO-BE FAQs: http://itcobe.web.cern.ch/itcobe/Services/Pvss/FAQ/
(T)Wiki: https://uimon.cern.ch/twiki/bin/view/Controls/PVSSFolkLore#PVSS_Folklore
ICALEPCS paper “Scaling Up PVSS”:http://elise.epfl.ch/pdf/P1_056.pdf
![Page 49: December 2005 Scaling Up PVSS Phase II Test Results Paul Burkimsher IT-CO](https://reader035.vdocuments.us/reader035/viewer/2022062805/5697bfad1a28abf838c9c078/html5/thumbnails/49.jpg)