ruckman, jj russell matt graham, giovanna lehmann, mark ... · matt graham, giovanna lehmann, mark...

January 25, 2018

Matt Graham, Giovanna Lehmann, Mark Convery, Ryan Herbst, Larry Ruckman, JJ Russell

DUNE FD DAQ: ATCA/RCE + FELIX Solution

Felix + ATCA RCE Overview & Responsibilities

2

FrontEnds

WIBsATCA RCE

ClusterFront end data passes through WIB without concentration, electrical to optical conversion only** ATCA RCE Cluster provides

filtering, feature extraction & SuperNova Buffering (100s).Raw data received at

the ATCA RCE RTM, passed directly to DPMs

10Gbps links between RCE and Felix (underground); buffer for trigger

FELIXCluster

Event Builder, Aggregator, L3 triggering

Optics Up Shaft

Backend Computing

TriggerFarm Trigger

decisions

Trigger primitives

● FE+WIB → RCE: all raw data into RTM with some custom format (e.g. COLDATA); 8B/10B (probably) at 1.28 Gbps

● RCE → FELIX: all raw data out of the RTM some custom format (GBT etc) of multiplexed data ~10-12 Gbps links

● FELIX → Backend Computing: triggered raw data over ethernet on switched network● Trigger Path: RCE-extracted primitives go either to RCE → FELIX → trigger farm on

separate stream or directly from RCE → trigger farm via ethernet (shown)● Lossy Buffer (not shown) → RCE-extracted waveforms/time slices → lossy buffer either

through FELIX or direct from RCE

3

Numerology (just CE, RCE, FELIX)

● “Cold” Electronics: 64 channels/COLDATA, 2 COLDATA/FEMB, 4 FEMB/WIB, 20 FEMB/5 WIBs/APA

○ these are fixed, never ever will change● RCE System: 4 DPMs/COB, 1* RCEs/DPM, up to 14 COBs/shelf, 1 COB/APA (target)

○ current-gen of DPM has 2 RCEs/DPM, see later slides● FELIX System: 2 APAs/FELIX; 2 FELIX/PC (target)● WIB → RCE Links: 16 links/WIB @ 1.28 Gbps, 80 links/COB/APA

○ assume passive WIB● RCE→FELIX Links: (raw, uncompressed) 2 10-Gbps links/DPM, 8 links/APA, 16 links/FELIX,

32 links/PC

4

RTM Block DiagramS

NA

P12

SN

AP

12

SN

AP

12

SN

AP

12

SN

AP

12

SN

AP

12

SN

AP

12

SN

AP

12

SFP

+

SN

AP

12

SN

AP

12

WIB Connections: Support 80 links FELIX Interface Timing

DTMDPMs

Experience with high density fiber optic RTMs

QS

FPReflexphotonics SNAP12 transmitter/receiver supports 10.3125 Gbps per lane:http://reflexphotonics.com/embedded-transceivers/snap12/

http://reflexphotonics.com/embedded-transceivers/snap12/

Data Flow In ATCA RCEs

5

Filtering

Filtering

Filtering

FrontEnds

FeatureExtraction

SuperNovaPre-Buffer

SuperNovaPost Buffer

FelixUplink

(GBT or PGP)

TimingInterface

To Felix

● Flexible architecture allows front ends to be allocated across RCEs in a flexible fashion○ Simply add more cards and move fibers

● Target is 640 channels per RCE (1x APA per COB) → 5 FEMBs/DPM○ Numerology is important! 5 WIBs vs 4 DPMs/APA; multiplexing at WIB ( 2xFEMB links e.g.)

reduces flexibility

Compression& Other

Processing

CompressionProposed

RCE To Felix Uplink

6

● Multiple optical links will be routed between the ATCA RCE platform to the Felix nodes○ DWDM utilized to maximize uplink bandwidth and provide redundancy

● The ATCA RCE platform will utilize its powerful interconnects to provide a data routing capability

○ Flexible configuration of which data goes to each Felix board■ Allows system to adapt to changing data needs (noise, etc)■ Some channels can be used for raw data from a subset of the detector■ SuperNova readout lanes (slow trickle, post trigger)■ Data can be re-routed to different fibers in the case of a fiber break or Felix board failure!

○ Link count can be scaled to match system needs

■ Less fibers when RCEs do majority of data processing and event selection■ More fibers when computing cluster is needed for data processing■ One or more fibers per DPM, one fiber per COB or 1 fiber per crate

RCECluster Felix

Felix

Felix

Felix

Felix

Felix

Example Data Routing

7

Felix

ProcessingRCE

DetectorData

ProcessingRCE

DetectorData

ProcessingRCE

DetectorData

ProcessingRCE

DetectorData

ProcessingRCE

DetectorData

ProcessingRCE

DetectorData

ProcessingRCE

OutboundRCE

OutboundRCE

OutboundRCE

Felix

Felix

ATCA RCEInterconnect

Note: Processing RCEs can also serve as outbound RCEs

DetectorData

Example Data Routing: 2 Active & 1 Spare Felix

8

Felix

ProcessingRCE

DetectorData

ProcessingRCE

DetectorData

ProcessingRCE

DetectorData

ProcessingRCE

DetectorData

ProcessingRCE

DetectorData

ProcessingRCE

DetectorData

ProcessingRCE

OutboundRCE

OutboundRCE

OutboundRCE

Felix

Felix



DetectorData

Example Data Routing: Felix Failure Or Fiber Break

9

Felix

ProcessingRCE

DetectorData

ProcessingRCE

DetectorData

ProcessingRCE

DetectorData

ProcessingRCE

DetectorData

ProcessingRCE

DetectorData

ProcessingRCE

DetectorData

ProcessingRCE

OutboundRCE

OutboundRCE

OutboundRCE

Felix

Felix



ATCA RCE cross connect routes data to redundant Felix board after fiber break or Felix board failure!

DetectorData

Example Data Routing: Load Adjustment

10

Felix

ProcessingRCE

DetectorData

ProcessingRCE

DetectorData

ProcessingRCE

DetectorData

ProcessingRCE

DetectorData

ProcessingRCE

DetectorData

ProcessingRCE

DetectorData

ProcessingRCE

OutboundRCE

OutboundRCE

OutboundRCE

Felix

Felix



Redundant Felix boards can take on additional loads due to flexible data routing in ATCA RCE!

DetectorData

11

Data from on FELIX PCs

Benefits Of Felix + ATCA RCE

12

● ATCA RCE provides a powerful front end processing platform for data processing in FPGAs○ RCE data processing features included in upcoming slides

■ Filtering■ Feature extraction■ Neural Network Processing

○ RCE could provide SuperNova data buffering (minute easily) ○ Proven packaging, cooling, interconnects, high reliability (incl hot-swap redundant fan and power supplies)

■ Already used for other experiments (LSST, ATLAS CSC, ATLAS IBL, KOTO, etc), mature design, low risk

● ATCA RCE interconnect provides ability to re-route data to Felix nodes on demand○ Adjust processor load to match the amount of processing needed in back end○ Route data around failed uplink fibers○ Route data to move from a failed Felix node (or host CPU) to a redundant element

● Felix provides a point to point path between the ATCA RCEs and the back end data processing○ Better flow control model than Ethernet or TCP / UDP over long links

■ Both PGP and GBT provide proven flow control over long link distances○ Transmitted frames stay in their native size instead of being chunked up into small network transfers (Ethernet

MTU)■ Felix has demonstrated high throughput with larger packet sizes

○ End to end data integrity■ GBT and PGP both have data integrity checking on their transport protocols■ Minimal error handling required in receiving nodes before data processing layer■ Test pattern capability over GBT or PGP links

● Back end processing model follows classic Felix architecture○ Receive data in Felix node with PCI-Express card○ Back end data processing with CPUs and GPUs

13

ATCA RCE Data Processing

Readout Overview

14

15

ATCA RCE Platform Clustering

• The RCE nodes are interconnected through Ethernet• Each COB contains a low latency 10/40Gbps Ethernet switch

- Cut through latency < 300ns• The COB supports a full mesh 14-slot backplane

- Each COB has a direct 10Gbps link to every other COB in a crate- Any RCE in an ATCA shelf has a maximum of two switches between it and every other RCE- 14 * 8 = 112 RCEs in a low latency cluster

• Reliable UDP protocol allows direct firmware to firmware data sharing• Allows for low latency data sharing between nodes

- APA combining and edge channel data sharing- Neural Network data sharing

COB

DPM 0 DPM 1

DPM 2 DPM 3

EthernetSwitch

DTM

COB

DPM 0 DPM 1

DPM 2 DPM 3

EthernetSwitch

DTM

COB

DPM 0DPM 1

DPM 2DPM 3

EthernetSwitch

DTM

COB

DPM 0DPM 1

DPM 2DPM 3

EthernetSwitch

DTM

Off shelf link

16

Oxford Design: Revision C01

● ZYNQ: XCZU15EG-1FFVB1156E● PL DDR4: 8 GB on DPM● PS DDR4 8 GB on DPM● M.2 NMVe: 512 GB on DPM

○ Located above the DPM’s DDR ICs● Dimensions: 85.09 mm x 110 mm

○ Increased by 1.27mm for NMVe

XCZU15EG-1FFVB1156E

DDR4 ICs

M.2 NMVe

SD Memory Card

JTAG

17

DPM Redesign for DUNE

● Oxford/SLAC Collaboration● Optimized for large memory buffering on the DPM● Only 24 GT channels on this FPGA

○ 20 of 24 GTs for the FEBs:■ 80 links/COB @ 1.28 Gbps (8B/10B)

○ 2 of 24 GTs for the ETH SW: ■ two separate 10 GbE (10Gbps/lane, 64B/66B) to ETH SW

○ 2 of 24 GTs for the Felix: ■ 2 RX lanes and up to 22 TX lanes

● Able to support redundant Felix connections■ 20 Gb/s @ 2 lanes (10Gbps/lane, 64B/66B)

SuperNova Pre-Buffer

SuperNova Post-Buffer

Linux Kernel + SuperNova Pre-Buffer

Boot Memory

Unused FEB TX lanes can be used to increase bandwidth to Felix

Backup Slides

19

ATCA Packaging for DUNE

● 1 APA = 2560 channels● 1 APA per COB

○ 4 DPMs per COB○ 640 channels per DPM

● 150 APA for the entire system = 150 COBs● Total Rack space: 165U

○ 11x 14-slot ATCA crates○ 15U per 14-slot ATCA Crate

■ http://www.asis-pro.com/maxum-atca-systems/14-Slot-14U-MaXum-460

http://www.asis-pro.com/maxum-atca-systems/14-Slot-14U-MaXum-460

20

ATCA Power/Cooling Estimates for DUNE

● COB Max Power: 300W○ ~100W for ETH SW○ 36W for RTM (limited by 3A fuse)○ 160 W for digital processing

■ 40W per DPM● Total Max Power: 45kW● Cooling via forced air (Integrated into the ATCA platform)● Power and thermal monitoring via standard IPMI interface● Example of ATCA crate that support 400W per slot

○ http://www.asis-pro.com/maxum-atca-systems/14-Slot-14U-MaXum-460

http://www.asis-pro.com/maxum-atca-systems/14-Slot-14U-MaXum-460

21

ATCA Costs for DUNE (Updated for quantity)

● RCE Cost Estimate: ○ Upgraded DPM + Flash: $2.5K○ Upgrade COB: $4K○ RTM: $2K○ DTM $1k○ ~$17k/unit

● 14-slot ATCA crates, in quantity, 2019○ ~$7k/unit○ IPMI + shelf manager + 10GbE/40GbE backplane + fans + power supplies

● Total ATCA Hardware Cost: $2.7M○ 11x ATCA crates○ 150x RCE ATCA slots

22

Packaging And Architecture Thoughts

23

ATCA ComponentsProven standard, built to be robust and reliable, also fully monitored

Air Intake Filter

Intake Fans

Power supply DC or AC input

Shelf Manager

● Telecom standard designed for “5 nines” uptime● Almost all components can be replaced in the field● Redundancy is available if desired

○ N + 1 redundancy for power supplies○ Redundant shelf managers

● System is designed to handle one fan failure in each fan tray○ Shelf manager generates alarm to request fan tray replacement

Exit Fans

Application Card

ShelfManager

24

ATCA Provides Management & Monitoring Features Required In Reliable & Maintainable DAQ Designs

ShelfManagers

Ethernet

Console

PowerSupplies

FanTrays

EEPROM

IPMC

EPROMs● ATCA uses IPMI for management purposes

○ Intelligent Platform Management Interface● Manages and monitors all shelf based components

○ Power supply status and power○ Shelf inlet and exit temperatures○ Fan speed control and monitoring○ Application card control and monitoring

● Redundant EEPROMs contain all shelf information○ Shelf serial number, location and ID○ Shelf manager IP/MAC address

● Application card hosts IPMC○ Intelligent Platform Management Controller

● IPMC hosts all application card information in local EEPROM○ MAC addresses○ Serial number, card type & revision

Supernova Buffering In Two Stages (Update)

● Pre trigger buffer stores data in a ring buffer waiting for a supernova trigger○ 640 channels per RCE (1x APA per COB)○ 2 MHz ADC sampling rate○ 12-bits per ADC○ Raw Bandwidth: 15.36 Gbps (1.92 GB/s)

■ 640 x 2MHz x 12b○ Each DPM has 16 GB RAM:

■ 9.6 TB DDR4 RAM for all system across 150x COBs○ Total Memory for supernova “pre-buffering”: 15 GB

■ PL 8 GB + PS 7 GB (1GB for Kernel & OS)○ Without compression: 7.8 seconds pre-trigger buffer

■ Assuming 12-bit packing to remove 4-bit overhead when packing into bytes● Post trigger buffer stores data in flash based SSD before backend DAQ

○ Write sequence occurs once per supernova trigger: Low write wearing over experiment lifetime○ Low bandwidth background readout post trigger: Does not impact normal data taking○ ~$180K for NVMe M.2 SD buffering (150x COBs x 4 DPMs/COB x $300/NVMe)○ 512GB/DPM = 266 second post-trigger buffer○ Samsung NVME SSD 960 PRO: Sequential write up to 2.1GB/s

■ SSD write bandwidth matches well with 640 channels of uncompressed data

25NOTE: NO compression factor is applied in slide (only RAW bandwidths)

Zynq Ultrascale+ and M.2 SDD Performance

● Benchmarked read/write bandwidth into Samsung NVMe SSD 960 PRO with the ZYNQ PS PCIe root complex interface

● M.2 SDD mounted and formated as EXT4 hard drive● Running on ArchLinux● Measuring ~1.6GB/s for read/writing dummy data

generated by the CPU○ Limited by the Zynq’s PCIe GEN2 x 4 lane

interface (Theoretical limit: 2.0Gb/s)■ Not limited by M.2 SDD’s controller

● Because the input bandwidth is 1.92GB/s > 1.6 GB/s SDD write speed, we would be able to buffer for 37 seconds in DDR before 100% back pressure

● Need some amount of compression before the SSD to prevent bottlenecking at the SDD

● This is a very simple test with only one process○ Need to do stress testing of other interfaces in

parallel of SDD to confirm rate is still 1.6GB/s

26

Optional Compression

● Past Development has shown firmware compression can be costly in FPGA resources● If compression is done in firmware, a minimal LUT footprint would be required● With the high performance Zynq Ultrascale+, real-time software compression does become a reality.

27

Algorithm kLUTs/DPM kFFs/DPM DSP48/DPM RAM(Mb)/DPM

Arithmetic Probability Encoding

292(86%)

120(18%)

75(<1%)

22.3(38%)

Huffman 143(43%)

60(9%)

75(<1%)

22.3(38%)

FPGA Resources for 640 channel per DPM compression

28

Waveform Extraction

•–

•–

•••

•–

•–

••

● See slides from JJ Russell here:https://docs.google.com/presentation/d/1XufamuZOdFGkIlHZEw4N8nXMSUEbK9OlhQ9pcAGn4wk/edit?usp=sharing

https://docs.google.com/presentation/d/1XufamuZOdFGkIlHZEw4N8nXMSUEbK9OlhQ9pcAGn4wk/edit?usp=sharing

ruckman, jj russell matt graham, giovanna lehmann, mark ... · matt graham, giovanna lehmann, mark...

Documents