gsi, oct 2005hans g. essel daq control1 h.g.essel, j.adamczewski, b.kolb, m.stockmeier

28
GSI, Oct 2005 Hans G. Essel DAQ Control 1 H.G.Essel, J.Adamczewski, B.Kolb, M.Stockmeier

Upload: merilyn-hutchinson

Post on 17-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

GSI, Oct 2005 Hans G. Essel DAQ Control 1

H.G.Essel, J.Adamczewski, B.Kolb, M.Stockmeier

GSI, Oct 2005 Hans G. Essel DAQ Control 3

CMS: blueprint for clustered DAQ

DAQ staging

TTC Timing, Trigger and Control FU Filter UnitTPD Trigger Primitive Data FFN Filter Farm NetworkaTTS asynchronous Trigger Throttle System EVM Event ManagerD2S Data to Surface RCN Readout Control NetworkFRL Frontend Readout Link BCN Builder Control NetworkRU Readout Unit DCN Detector Control NetworkBU Builder Unit DSN DAQ Service Network

GSI, Oct 2005 Hans G. Essel DAQ Control 4

CMS DAQ: requirements

• Communication and Interoperability– Transmission and reception within and across subsystem boundaries without regard of the used protocols– Addition of protocols without a need for modifications in the applications

• Device Access– Access to custom devices for configuration and readout– Access to local and remote devices (bus adapters) without the need for modifications in applications

• Configuration, control and monitoring– Make parameters of built-in or user defined types visible and allow their modification– Allow the coordination of application components (define their states and modes)– Allow the inspection of states and modes– Provide services for recording structured information

• Logging, error reporting• Interface to persistent data stores (preferrably without the need to adapt the applications)• Publish all information to interested subscribers

– Device allocation, sharing and concurrent access support

• Maintainability and Portability– Allow portability across operating system and hardware platforms– Support access for data across multiple bus systems– Allow addition of new electronics without changes in user software– Provide memory management functionality to

• Improve robustness

• Give room for efficiency improvements

– Application code shall be invariant with respect to the physical location and the network– Possibility to factorise out re-usable building blocks

• Scalability– Overhead introduced by the software environment must be constant for each transmission operation and small with respect to the

underlying communication hardware in order not to introduce unpredictable behaviour– Allow applications to take advantage of additional resource availability

• Flexibility– Allow the applications to use multiple communication channels concurrently– Addition of components must not decrease the system’s capacity

GSI, Oct 2005 Hans G. Essel DAQ Control 5

CMS XDAQ

• XDAQ is a framework targeted at data processing clusters– Can be used for general purpose applications– Has its origins in the I2O (Intelligent IO) specification

• The programming environment is designed as an executive– A program that runs on every host– User applications are C++ programmed plug-ins– Plug-ins are dynamically downloaded into the executives– The executive provides functionality for

• Memory management• Systems programming

queues, tasks, semaphores, timers• Communication

Asynchronous peer-to-peer communication modelIncoming events (data, signals, …) are demultiplexed to callback functions of application components

• Services for configuration, control and monitoring• Direct hardware access and manipulation services• Persistency services

GSI, Oct 2005 Hans G. Essel DAQ Control 6

XDAQ Availability

Platform OS CPU Description

Linux (RH) x86 Baseline implementation

Mac OS X PPC G3, G4 no HAL, no raw Ethernet PT

Solaris Sparc no HAL, ro raw Ethernet PT

VxWorks PPC 603, Intel x86 no GM

http://cern.ch/xdaq

Current version: 1.1Next releases: V 1.2 in October 2002 (Daqlets)

V 1.3 in February 2003 (HAL inspector)

Change control: Via sourceforge: http://sourceforge.net/projects/xdaqVersion control: CVS at CERNLicense: BSD style

GSI, Oct 2005 Hans G. Essel DAQ Control 7

XDAQ: References

J. Gutleber, L. Orsini, « Software Architecture for Processing Clusters based on I2O », Cluster Computing, the Journal of Networks, Software and Applications, Baltzer Science Publishers, 5(1):55-64, 2002(goto http://cern.ch/gutleber for a draft version or contact me)The CMS collaboration, “CMS, The Trigger/DAQ Project”, Chapter 9 - “Online software infrastructure”, CMS TDR-6.2, in print (contact me for a draft), also available at http://cmsdoc.cern.ch/cms/TDR/DAQ/G. Antchev et al., “The CMS Event Builder Demonstrator and Results with Myrinet”, Computer Physics Communications 2189, Elsevier Science North-Holland, 2001 (contact [email protected])E. Barsotti, A. Booch, M. Bowden, “Effects of various event building techniques on data acquisition architectures”, Fermilab note, FERMILAB-CONF-90/61, USA, 1990.

GSI, Oct 2005 Hans G. Essel DAQ Control 8

XDAQ event driven communication

• Dynamically loaded application modules (from URL, from file)• Inbound/Outbound queue (pass frame pointers, zero-copy)• Homogeneous frame format

Readout componentGenerates a DMA completion event

Application componentImplements callback function

foo( )

Peer transportReceives messages from network

Executive frameworkDemultiplexes incoming events to listener application component

Computer

GSI, Oct 2005 Hans G. Essel DAQ Control 9

XDAQ: I2O peer operation for clusters

• Application component device• Processing node IOP• Controller node host

• Homogeneous communication– frameSend for local, remote,

host– single addressing scheme (Tid)

• Application framework

Executive

Messaging Layer

Peer TransportAgent

Messaging Layer

Executive

Peer TransportAgent

ˆ

ƒ

„ …

‰PeerTransport

Application Application

I2O Message Frames

GSI, Oct 2005 Hans G. Essel DAQ Control 11

XDAQWin client

Daqlet windowDaqlets are Java applets that can be used to customize the configuration,control and monitoring of all componentsin the configuration tree

Configuration treeXML based configuration of a XDAQ cluster

GSI, Oct 2005 Hans G. Essel DAQ Control 13

XDAQ: component properties

Component PropertiesAllows the inspection andmodification of components’exported parameters.

GSI, Oct 2005 Hans G. Essel DAQ Control 15

BTeV: a 20 THz real-time system

• Input: 800 GB/s (2.5 MHz)• Level 1

– Lvl1 processing: 190srate of 396 ns

– 528 “8 GHz” G5 CPUs (factor of 50 event reduction)

– high performance interconnects• Level 2/3:

– Lvl 2 processing: 5 ms (factor of 10 event reduction)

– Lvl 3 processing: 135 ms (factor of 2 event reduction)

– 1536 “12 GHz” CPUs commodity networking• Output: 200 MB/s (4 kHz) = 1-2 Petabytes/year

GSI, Oct 2005 Hans G. Essel DAQ Control 16

BTeV: The problem

• Monitoring, Fault Tolerance and Fault Mitigation are crucial

– In a cluster of this size, processes and daemons are constantly hanging/failing without warning or notice

• Software reliability depends on

– Physics detector-machine performance

– Program testing procedures, implementation, and design quality

– Behavior of the electronics (front-end and within the trigger)

• Hardware failures will occur!

– one to a few per week

• Given the very complex nature of this system where thousands of events are simultaneously and asynchronously cooking, issues of data integrity, robustness, and monitoring are critically important and have the capacity to cripple a design if not dealt with at the outset… BTeV [needs to] supply the necessary level of “self-awareness” in the trigger system.

Real Time Embedded System

GSI, Oct 2005 Hans G. Essel DAQ Control 17

BTeV: RTES goals

• High availability– Fault handling infrastructure capable of

Accurately identifying problems (where, what, and why)Compensating for problems (shift the load, changing thresholds)Automated recovery procedures (restart / reconfiguration)Accurate accountingExtensibility (capturing new detection/recovery procedures)Policy driven monitoring and control

• Dynamic reconfiguration– adjust to potentially changing resources

• Faults must be detected/corrected ASAP– semi-autonomously

with as little human intervention as possible

– distributed & hierarchical monitoring and control

• Life-cycle maintainability and evolvability– to deal with new algorithms, new hardware and new versions of the OS

GSI, Oct 2005 Hans G. Essel DAQ Control 18

RTES deliverables

A hierarchical fault management system and toolkit:

– Model Integrated Computing

• GME (Generic Modeling Environment) system modeling tools – and application specific “graphic languages” for modeling system configuration, messaging, fault

behaviors, user interface, etc.

– ARMORs (Adaptive, Reconfigurable, and Mobile Objects for Reliability) • Robust framework for detection and reaction to faults in processes

– VLAs (Very Lightweight Agents for limited resource environments)• To monitor/mitigate at every level

– DSP, Supervisory nodes, Linux farm, etc.

GSI, Oct 2005 Hans G. Essel DAQ Control 19

RTES Development

• The Real Time Embedded System Group– A collaboration of five institutions,

• University of Illinois• University of Pittsburgh• University of Syracuse• Vanderbilt University (PI)• Fermilab

• NSF ITR grant ACI-0121658• Physicists and Computer Scientists/Electrical Engineers at

BTeV institutions

GSI, Oct 2005 Hans G. Essel DAQ Control 20

RTES structure

Analysis

Runtime

Design andAnalysis

AlgorithmsFault Behavior

Resource

Synt

hesi

s

PerformanceDiagnosabilityReliability

ExperimentControlInterface

SynthesisFe

edba

ck

ModelingReconfigure

Logi

cal D

ata

Net

Region Operations

Mgr

Region Operations

Mgr

L2,3/CISC/RISC L1/DSP

Region Fault Mgr

LocalFault MgrLocalFault Mgr

LocalOper. MgrLocalOper. Mgr

ARMOR/LinuxARMOR/Linux

Time Time

TriggerAlgorithmTriggerAlgorithmTrigger

AlgorithmTriggerAlgorithmTrigger

AlgorithmTriggerAlgorithmTrigger

AlgorithmTriggerAlgorithm

Time Time

TriggerAlgorithmTriggerAlgorithmTrigger

AlgorithmTriggerAlgorithmTrigger

AlgorithmTriggerAlgorithmTrigger

AlgorithmTriggerAlgorithm

LocalFault MgrLocalFault Mgr

LocalOper. MgrLocalOper. Mgr

ARMOR/LinuxARMOR/Linux

Time Time

TriggerAlgorithmTriggerAlgorithmTrigger

AlgorithmTriggerAlgorithmTrigger

AlgorithmTriggerAlgorithmTrigger

AlgorithmTriggerAlgorithm

Time Time

TriggerAlgorithmTriggerAlgorithmTrigger

AlgorithmTriggerAlgorithmTrigger

AlgorithmTriggerAlgorithmTrigger

AlgorithmTriggerAlgorithm

LocalFault MgrLocalFault Mgr

LocalOper. MgrLocalOper. Mgr

ARMOR/DSPARMOR/DSP

Time Time

TriggerAlgorithmTriggerAlgorithmTrigger

AlgorithmTriggerAlgorithmTrigger

AlgorithmTriggerAlgorithmTrigger

AlgorithmTriggerAlgorithm

Time Time

TriggerAlgorithmTriggerAlgorithmTrigger

AlgorithmTriggerAlgorithmTrigger

AlgorithmTriggerAlgorithmTrigger

AlgorithmTriggerAlgorithm

LocalFault MgrLocalFault Mgr

LocalOper. MgrLocalOper. Mgr

ARMOR/DSPARMOR/DSP

Time Time

TriggerAlgorithmTriggerAlgorithmTrigger

AlgorithmTriggerAlgorithmTrigger

AlgorithmTriggerAlgorithmTrigger

AlgorithmTriggerAlgorithm

Time Time

TriggerAlgorithmTriggerAlgorithmTrigger

AlgorithmTriggerAlgorithmTrigger

AlgorithmTriggerAlgorithmTrigger

AlgorithmTriggerAlgorithm

Logical Control N

etwork

Logical Control N

etworkLo

gica

l Dat

a N

et

Logical Control N

etwork

Logical Control N

etwork Local

Fault MgrLocalFault Mgr

LocalOper. MgrLocalOper. Mgr

ARMOR/LinuxARMOR/Linux

Time Time

TriggerAlgorithmTriggerAlgorithmTrigger

AlgorithmTriggerAlgorithmTrigger

AlgorithmTriggerAlgorithmTrigger

AlgorithmTriggerAlgorithm

Time Time

TriggerAlgorithmTriggerAlgorithmTrigger

AlgorithmTriggerAlgorithmTrigger

AlgorithmTriggerAlgorithmTrigger

AlgorithmTriggerAlgorithm

Logical Control N

etwork

Global Fault Manager

Global Operations Manager

Soft Real Time Hard

GSI, Oct 2005 Hans G. Essel DAQ Control 23

GME: data type modeling

• Modeling of Data Types and Structures

• Configure marshalling-demarshalling interfaces for communication

GSI, Oct 2005 Hans G. Essel DAQ Control 24

RTES: GME modeling environment

Fault handlingProcess dataflowHardware Configuration

GSI, Oct 2005 Hans G. Essel DAQ Control 26

• Configuration of ARMOR Infrastructure (A)

• Modeling of Fault Mitigation Strategies (B)

• Specification of Communication Flow (C)

A B

C

RTES: GME fault mitigation modeling language (1)

GSI, Oct 2005 Hans G. Essel DAQ Control 27

RTES: GME fault mitigation modeling language (2)

• Model translator generates fault-tolerant strategies and communication flow strategy from FMML models• Strategies are plugged into ARMOR infrastructure as ARMOR elements• ARMOR infrastructure uses these custom elements to provide customized fault-tolerant protection to the

application

ARMOR

ARMOR Microkernel

Fault TolerantCustom Element

FMML Model – Behavior Aspect

CommunicationCustom Element

Switch(cur_state)case NOMINAL:I f (time<100) { next_state = FAULT; }Break;case FAULT if () { next_state = NOMINAL; } break;

class armorcallback0:public Callback{

public:ack0(ControlsCection *cc, void *p) : CallbackFaultInjectTererbose>(cc, p) { } void invoke(FaultInjecerbose* msg)

{ printf("Callback. Recievede

dtml_rcver_LocalArmor_ct *Lo; mc_message_ct *pmc = new m_ct; mc_bundle_ct *bundlepmc->ple();

pmc->assign_name(); bundle=pmc->push_bundle();mc);

}};

Translator

GSI, Oct 2005 Hans G. Essel DAQ Control 30

ARMOR

• Adaptive Reconfigurable Mobile Objects of Reliability:– Multithreaded processes composed of replaceable building blocks

– Provide error detection and recovery services to user applications

• Hierarchy of ARMOR processes form runtime environment:– System management, error detection, and error recovery services distributed across ARMOR processes.

– ARMOR Runtime environment is itself self checking.

• 3-tiered ARMOR support of user application – Completely transparent and external support– Enhancement of standard libraries– Instrumentation with ARMOR API

• ARMOR processes designed to be reconfigurable:– Internal architecture structured around event-driven modules called elements. – Elements provide functionality of the runtime environment, error-detection capabilities, and recovery policies.– Deployed ARMOR processes contain only elements necessary for required error detection and recovery

services.• ARMOR processes resilient to errors by leveraging multiple detection and recovery mechanisms:

– Internal self-checking mechanisms to prevent failures from occurring and to limit error propagation.– State protected through checkpointing.– Detection and recovery of errors.

• ARMOR runtime environment fault-tolerant and scalable:– 1-node, 2-node, and N-node configurations.

GSI, Oct 2005 Hans G. Essel DAQ Control 31

ARMOR system: basic configuration

Heartbeat ARMORDetects and recovers FTM failures

Fault Tolerant ManagerHighest ranking manager in the system

DaemonsDetect ARMOR crash and hang failures

ARMOR processesProvide a hierarchy of error detection and recovery.ARMORS are protected through checkpointingand internal self-checking.

Execution ARMOROversees application process(e.g. the various Trigger Supervisor/Monitors)

Daemon

Fault TolerantManager (FTM)

Daemon

HeartbeatARMOR

Daemon

ExecARMOR

AppProcess

network

GSI, Oct 2005 Hans G. Essel DAQ Control 32

EPICS overview

EPICS is a set of software components and tools to develop control systems.

The basic components are:

OPI (clients)

– Operator Interface. This is a UNIX or Windows based workstation which can run various EPICS tools (MEDM, ALH, OracleArchiver).

IOC (server)

– Input Output Controller. This can be VME/VXI based chassis containing a Motorola 68xxx processor, various I/O modules, and VME modules that provide access to other I/O buses such as GPIB, CANbus.

LAN (communication)

– Local area network. This is the communication network which allows the IOCs and OPIs to communicate. EPICS provides a software component, Channel Access, which provides network transparent communication between a Channel Access client and an arbitrary number of Channel Access servers.

GSI, Oct 2005 Hans G. Essel DAQ Control 35

Hierarchy in a flat system

tasks

tasks

tasks

tasks

IOC

IOC

IOC

IOC

IOC

IOC

Client

• IOCs

– One IOC per standard CPU (Linux, Lynx, VxWorks)

• clients

– on Linux, (Windows)

• Agents

– Segment IOCs beeing also clients

Name space architecture!

GSI, Oct 2005 Hans G. Essel DAQ Control 36

Local communication (node)

statussegmentIOC

Taskintertask

commands

messages

memoryTask

Task

command threadworking thread

message thread

• Commands handled by threads• Execution maybe in working thread• Message thread maybe not needed

Node

GSI, Oct 2005 Hans G. Essel DAQ Control 37

MBS node and monitor IOC

Externalcontrol

asynchronous

IOC

statussegmentDispatcher

Task

commands (text)

messages (text)

Task

Messageserver

Statusserver

asynchronous

on request

GSI, Oct 2005 Hans G. Essel DAQ Control 38

Screen shot FOPI

GSI, Oct 2005 Hans G. Essel DAQ Control 39

Kind of conclusion

• RTES: Very big and powerful. Not simply available!– Big collaboration– Fully modelled and simulated using GME– ARMORs for maximum fault tolerance and control

• XDAQ: Much smaller. Installed at GSI.– Dynamic configurations (XML)– Fault tolerance?

• EPICS: From accelerator controls community. Installed at GSI– Maybe best known– No fault tolerance– Not very dynamic