1005 cern-active mq-v2
DESCRIPTION
A presentation talking about how we use Apache ActiveMQ (fuse-message-broker) at CERN for monitoring the Large Hadron Collider.Presented at the FUSE community day - London 2010TRANSCRIPT
![Page 1: 1005 cern-active mq-v2](https://reader035.vdocuments.us/reader035/viewer/2022062417/54c688b14a795949318b4593/html5/thumbnails/1.jpg)
James Casey
CERN IT Department
Grid Technologies Group
FUSE Community Day, London, 2010
Using ActiveMQ at CERN for the Large Hadron Collider
![Page 2: 1005 cern-active mq-v2](https://reader035.vdocuments.us/reader035/viewer/2022062417/54c688b14a795949318b4593/html5/thumbnails/2.jpg)
Overview
• What we do at CERN
• Current ActiveMQ Usage– Monitoring a distributed infrastructure
• Lessons Learned
• Future ActiveMQ Usage– Building a generic messaging service
![Page 3: 1005 cern-active mq-v2](https://reader035.vdocuments.us/reader035/viewer/2022062417/54c688b14a795949318b4593/html5/thumbnails/3.jpg)
LHC is a very large scientific instrument…
Lake GenevaLake Geneva
Large Hadron Collider27 km circumference
Large Hadron Collider27 km circumference
CMSCMS
ATLASATLAS
LHCbLHCb
ALICEALICE
![Page 4: 1005 cern-active mq-v2](https://reader035.vdocuments.us/reader035/viewer/2022062417/54c688b14a795949318b4593/html5/thumbnails/4.jpg)
… based on advanced technology27 km of superconducting magnetscooled in superfluid helium at 1.9 K
![Page 5: 1005 cern-active mq-v2](https://reader035.vdocuments.us/reader035/viewer/2022062417/54c688b14a795949318b4593/html5/thumbnails/5.jpg)
What are we looking for?
• To answer fundamental questions about the construction of the universe– Why have we got mass ? (Higgs Boson)– Search for a Grand Unified Theory
• Supersymmetry
– Dark Matter, Dark Energy– Antimatter/matter asymmetry
![Page 6: 1005 cern-active mq-v2](https://reader035.vdocuments.us/reader035/viewer/2022062417/54c688b14a795949318b4593/html5/thumbnails/6.jpg)
This Requires…….
1. Accelerators : powerful machines that accelerate particles to extremely high energies and then bring them into collision with other particles
2. Detectors : gigantic instruments that record the resulting particles as they “stream” out from the point of collision.
3. Computers : to collect, store, distribute and analyse the vast amount of data produced by the detectors
4. People : Only a worldwide collaboration of thousands of scientists, engineers, technicians and support staff can design, build and operate the complex “machines”
![Page 7: 1005 cern-active mq-v2](https://reader035.vdocuments.us/reader035/viewer/2022062417/54c688b14a795949318b4593/html5/thumbnails/7.jpg)
View of the ATLAS detector during construction
Length : ~ 46 m
Radius : ~ 12 m
Weight : ~ 7000 tons
~108 electronic channels
![Page 8: 1005 cern-active mq-v2](https://reader035.vdocuments.us/reader035/viewer/2022062417/54c688b14a795949318b4593/html5/thumbnails/8.jpg)
A collision at LHC
8
Bunches, each containing 100 billion protons, cross 40 million times a second in the centre of each experiment
1 billion proton-proton interactions per second in ATLAS & CMS !
Large Numbers of collisions per event
~ 1000 tracks stream into the detector every 25 ns a large number of channels (~ 100 M ch) ~ 1 MB/25ns i.e. 40 TB/s !
![Page 9: 1005 cern-active mq-v2](https://reader035.vdocuments.us/reader035/viewer/2022062417/54c688b14a795949318b4593/html5/thumbnails/9.jpg)
The Data Acquisition
9
Cannot possibly extract and record 40 TB/s. Essentially 2 stages of selection
- dedicated custom designed hardware processors 40 MHz 100 kHz
- then each ‘event’ sent to a free core in a farm of ~ 30k CPU-cores
100 kHz few 100 Hz
![Page 11: 1005 cern-active mq-v2](https://reader035.vdocuments.us/reader035/viewer/2022062417/54c688b14a795949318b4593/html5/thumbnails/11.jpg)
First Beam day – 10 Sep. 2008
![Page 12: 1005 cern-active mq-v2](https://reader035.vdocuments.us/reader035/viewer/2022062417/54c688b14a795949318b4593/html5/thumbnails/12.jpg)
![Page 13: 1005 cern-active mq-v2](https://reader035.vdocuments.us/reader035/viewer/2022062417/54c688b14a795949318b4593/html5/thumbnails/13.jpg)
The LHC Computing Challenge
• Experiments will produce about 15 Million Gigabytes (15 PB) of data each year (about 20 million CDs!)
• LHC data analysis requires a computing power equivalent to ~100,000 of today's fastest PC processors (140MSi2K)
• Analysis carried out at more than 140 computing centres
• 12 large centres for primary data management: CERN (Tier-0) and eleven Tier-1s
• 38 federations of smaller Tier-2 centres
![Page 14: 1005 cern-active mq-v2](https://reader035.vdocuments.us/reader035/viewer/2022062417/54c688b14a795949318b4593/html5/thumbnails/14.jpg)
Solution: the Grid
Use the Grid to unite computing resources of particle physics institutions around the world
The World Wide Web provides seamless access to information that is stored in many millions of different geographical locations
The Grid is an infrastructure that provides seamless access to computing power and data storage capacity distributed over the globe.It makes multiple computer centres look like a single system to the end-user.
![Page 15: 1005 cern-active mq-v2](https://reader035.vdocuments.us/reader035/viewer/2022062417/54c688b14a795949318b4593/html5/thumbnails/15.jpg)
LHC Computing Grid project (WLCG)
• The grid is complex– Highly distributed– No central control– Lots of software in many languages
• Grid middleware SLOC – 1.7M Total– C++ 850K, C 550K, SH 160K, Java 115K, Python 50K, Perl 35K
• Experiment code e.g. ATLAS – C++ 7M SLOC
– Complex services dependencies
![Page 16: 1005 cern-active mq-v2](https://reader035.vdocuments.us/reader035/viewer/2022062417/54c688b14a795949318b4593/html5/thumbnails/16.jpg)
My Problem - Monitoring the operational grid infrastructure
• Tools for Operations and Monitoring– Build and run monitoring infrastructure for WLCG– Operational tools for management of grid
infrastructures• Examples:
– Configuration database– Helpdesk/ ticketing– Monitoring– Availability reporting
• Early design decision: – Use messaging as an integration framework
![Page 17: 1005 cern-active mq-v2](https://reader035.vdocuments.us/reader035/viewer/2022062417/54c688b14a795949318b4593/html5/thumbnails/17.jpg)
Open Source to the core
• Design and develop services for Open Science based on:– Open source software– Open protocols
• Funded by a series of EU Projects– EDG, EGEE, EGI.eu, EMI
• Backed by industry support• All our code is open source and freely available• Results published in Open Access journals
![Page 18: 1005 cern-active mq-v2](https://reader035.vdocuments.us/reader035/viewer/2022062417/54c688b14a795949318b4593/html5/thumbnails/18.jpg)
Use Case– Availability Monitoring and Reporting
• Monitoring of reliability and availability of European distributed computing infrastructure
• Data must be reliable– Definitive source of availability and accounting
reports• Distributed operations model
– Grid implies ‘cross-administrative domain’• No root login !
– Global ticketing– Distributed operations dashboards
![Page 19: 1005 cern-active mq-v2](https://reader035.vdocuments.us/reader035/viewer/2022062417/54c688b14a795949318b4593/html5/thumbnails/19.jpg)
Solution
• Distributed monitoring based on Nagios– Tied together with ActiveMQ
• Network of 4 brokers in 3 countries
– Linked to ticketing and alarm systems• Message level signing + encryption for verification of
identity
• Uses STOMP for all communication– Code in Perl & python
• Topics with Virtual Consumers – All persistent messages– Topic naming used for filtering and selection
![Page 20: 1005 cern-active mq-v2](https://reader035.vdocuments.us/reader035/viewer/2022062417/54c688b14a795949318b4593/html5/thumbnails/20.jpg)
Architecture
![Page 21: 1005 cern-active mq-v2](https://reader035.vdocuments.us/reader035/viewer/2022062417/54c688b14a795949318b4593/html5/thumbnails/21.jpg)
Component drilldown
![Page 22: 1005 cern-active mq-v2](https://reader035.vdocuments.us/reader035/viewer/2022062417/54c688b14a795949318b4593/html5/thumbnails/22.jpg)
Current Status
• 16 national level Nagios servers– Will grow to ~40 in next 3 months
• Clients distributed across 40 countries• 315 sites• 5K services• 500,000 test results/day• 3 consumers of full data stream to database for
analysis and post processing• 40 distributed alarm dashboards with filtered
feeds
![Page 23: 1005 cern-active mq-v2](https://reader035.vdocuments.us/reader035/viewer/2022062417/54c688b14a795949318b4593/html5/thumbnails/23.jpg)
Lessons (1)
• Just using STOMP is sub-optimal• Pros:
– Very simple– Good for lightweight clients in many languages
• Cons– Hard to write reliable long-lived clients
• No NACK, No heartbeat
– Ambiguities in the specification• Content-length and TextMessage• Content-encoding
• Not really broker independent in practice• Interested in contributing to STOMP 1.1/2.0
![Page 24: 1005 cern-active mq-v2](https://reader035.vdocuments.us/reader035/viewer/2022062417/54c688b14a795949318b4593/html5/thumbnails/24.jpg)
Lessons (2)
• JMS Durable consumer suck– Fragile in Network of Brokers– Many problems fixed now by FUSE
• Virtual Topics solve the problem• Pros:
– Just like a queue• Can monitor queue length, purge
• Cons– Issues with selectors– Startup race conditions (solvable via config)
![Page 25: 1005 cern-active mq-v2](https://reader035.vdocuments.us/reader035/viewer/2022062417/54c688b14a795949318b4593/html5/thumbnails/25.jpg)
Lessons (3)
• Network of brokers seem attractive• Pros:
– It’s all a cloud– Clients connect anywhere and it “just works”
• Cons:– It’s a very complicated area of code– Often you need to “ask the computer”
• Or a core ActiveMQ developer
• Trade off between resilience/scaling and complexity
![Page 26: 1005 cern-active mq-v2](https://reader035.vdocuments.us/reader035/viewer/2022062417/54c688b14a795949318b4593/html5/thumbnails/26.jpg)
Lessons (4)
• Know the code– Most of it is very simple– Even for non-java developers
• If you keep away from “java-ish” stuff– JTA, XA, Spring
– Plugin architecture is very easy to work with• Most things can be implemented by a plugin• E.g. Monitoring, logging, restricting features,
AuthN/AuthZ
• Docs currently don’t explain everything– Especially the interactions between
plugins/features
![Page 27: 1005 cern-active mq-v2](https://reader035.vdocuments.us/reader035/viewer/2022062417/54c688b14a795949318b4593/html5/thumbnails/27.jpg)
Lessons (5)
Stay in the ballpark• If it’s not in tests:
– Think twice about using the feature in that way…– Write a test for it !
• Examples– SSL and network connectors– Network of Brokers with odd topologies– STOMP/Openwire differences in feature support
![Page 28: 1005 cern-active mq-v2](https://reader035.vdocuments.us/reader035/viewer/2022062417/54c688b14a795949318b4593/html5/thumbnails/28.jpg)
Nagios for ActiveMQ
• We use Nagios to monitor – Brokers– Producer/consumers
• Uses jmx4perl to reduce JVM load on Nagios machine– Exposes JMX information as JSON– Simple perl interface to write clients
• Generic nagios checks– Looking how to make more available for the
community
![Page 29: 1005 cern-active mq-v2](https://reader035.vdocuments.us/reader035/viewer/2022062417/54c688b14a795949318b4593/html5/thumbnails/29.jpg)
Broker Monitoring
• Standard OS information– Filesystem full, processes running, socket counts,
open file counts• JMX for broker statistics
– Store usage, JVM stats, inactive durable subs, queues with pending messages
• JMX based scripts to manage brokers– Remove unwanted advisories– Purge queues with no consumers
![Page 30: 1005 cern-active mq-v2](https://reader035.vdocuments.us/reader035/viewer/2022062417/54c688b14a795949318b4593/html5/thumbnails/30.jpg)
Virtual Topic monitoring
• Full testing of consumers from producers on all brokers in Network of Brokers
• Consumers instrumented to reply to test messages– Addressed to a single client-id on a topic– Send message to topic in Reply-To
• Nagios sends messages to all brokers for a topic– Checks they all come back– Useful to check that all brokers in network are
forwarding correctly
![Page 31: 1005 cern-active mq-v2](https://reader035.vdocuments.us/reader035/viewer/2022062417/54c688b14a795949318b4593/html5/thumbnails/31.jpg)
Nagios broker status check
![Page 32: 1005 cern-active mq-v2](https://reader035.vdocuments.us/reader035/viewer/2022062417/54c688b14a795949318b4593/html5/thumbnails/32.jpg)
To the future – a generic messaging service
• Many concurrent applications …– … in many languages …– … over the WAN …– … with little control over the users
• Not a typical messaging problem ?
![Page 33: 1005 cern-active mq-v2](https://reader035.vdocuments.us/reader035/viewer/2022062417/54c688b14a795949318b4593/html5/thumbnails/33.jpg)
• Isolate clients from messaging via filesystem– Particularly in the WAN– Always assume messaging could be uncontactable
• Keeps “core” broker network small• And keeps complexity isolated• Allows all clients to use best language/protocol
to talk to messaging
Design thoughts – File Based Queue
![Page 34: 1005 cern-active mq-v2](https://reader035.vdocuments.us/reader035/viewer/2022062417/54c688b14a795949318b4593/html5/thumbnails/34.jpg)
Design Thoughts– AMQP style remote messaging
• Queues bound to broker nodes
• IP-like routing sends messages to destinations
• Clients connect to specific instances– Better determinacy in
network– Easier to manage explicit
connections between brokers
![Page 35: 1005 cern-active mq-v2](https://reader035.vdocuments.us/reader035/viewer/2022062417/54c688b14a795949318b4593/html5/thumbnails/35.jpg)
Summary
• ActiveMQ is a key technology choice for operating and monitoring the WLCG grid infrastructure
• It provides a scalable and adaptable platform for building a wide range of messaging based applications
• FUSE fits our model of open source software with industrial support
![Page 36: 1005 cern-active mq-v2](https://reader035.vdocuments.us/reader035/viewer/2022062417/54c688b14a795949318b4593/html5/thumbnails/36.jpg)
Thank you for your attention
Questions?