ruckman, jj russell matt graham, giovanna lehmann, mark ... · matt graham, giovanna lehmann, mark...
TRANSCRIPT
January 25, 2018
Matt Graham, Giovanna Lehmann, Mark Convery, Ryan Herbst, Larry Ruckman, JJ Russell
DUNE FD DAQ: ATCA/RCE + FELIX Solution
Felix + ATCA RCE Overview & Responsibilities
2
FrontEnds
WIBsATCA RCE
ClusterFront end data passes through WIB without concentration, electrical to optical conversion only** ATCA RCE Cluster provides
filtering, feature extraction & SuperNova Buffering (100s).Raw data received at
the ATCA RCE RTM, passed directly to DPMs
10Gbps links between RCE and Felix (underground); buffer for trigger
FELIXCluster
Event Builder, Aggregator, L3 triggering
Optics Up Shaft
Backend Computing
TriggerFarm Trigger
decisions
Trigger primitives
● FE+WIB → RCE: all raw data into RTM with some custom format (e.g. COLDATA); 8B/10B (probably) at 1.28 Gbps
● RCE → FELIX: all raw data out of the RTM some custom format (GBT etc) of multiplexed data ~10-12 Gbps links
● FELIX → Backend Computing: triggered raw data over ethernet on switched network● Trigger Path: RCE-extracted primitives go either to RCE → FELIX → trigger farm on
separate stream or directly from RCE → trigger farm via ethernet (shown)● Lossy Buffer (not shown) → RCE-extracted waveforms/time slices → lossy buffer either
through FELIX or direct from RCE
3
Numerology (just CE, RCE, FELIX)
● “Cold” Electronics: 64 channels/COLDATA, 2 COLDATA/FEMB, 4 FEMB/WIB, 20 FEMB/5 WIBs/APA
○ these are fixed, never ever will change● RCE System: 4 DPMs/COB, 1* RCEs/DPM, up to 14 COBs/shelf, 1 COB/APA (target)
○ current-gen of DPM has 2 RCEs/DPM, see later slides● FELIX System: 2 APAs/FELIX; 2 FELIX/PC (target)● WIB → RCE Links: 16 links/WIB @ 1.28 Gbps, 80 links/COB/APA
○ assume passive WIB● RCE→FELIX Links: (raw, uncompressed) 2 10-Gbps links/DPM, 8 links/APA, 16 links/FELIX,
32 links/PC
4
RTM Block DiagramS
NA
P12
SN
AP
12
SN
AP
12
SN
AP
12
SN
AP
12
SN
AP
12
SN
AP
12
SN
AP
12
SFP
+
SN
AP
12
SN
AP
12
WIB Connections: Support 80 links FELIX Interface Timing
DTMDPMs
Experience with high density fiber optic RTMs
QS
FPReflexphotonics SNAP12 transmitter/receiver supports 10.3125 Gbps per lane:http://reflexphotonics.com/embedded-transceivers/snap12/
Data Flow In ATCA RCEs
5
Filtering
Filtering
Filtering
FrontEnds
FeatureExtraction
SuperNovaPre-Buffer
SuperNovaPost Buffer
FelixUplink
(GBT or PGP)
TimingInterface
To Felix
● Flexible architecture allows front ends to be allocated across RCEs in a flexible fashion○ Simply add more cards and move fibers
● Target is 640 channels per RCE (1x APA per COB) → 5 FEMBs/DPM○ Numerology is important! 5 WIBs vs 4 DPMs/APA; multiplexing at WIB ( 2xFEMB links e.g.)
reduces flexibility
Compression& Other
Processing
CompressionProposed
RCE To Felix Uplink
6
● Multiple optical links will be routed between the ATCA RCE platform to the Felix nodes○ DWDM utilized to maximize uplink bandwidth and provide redundancy
● The ATCA RCE platform will utilize its powerful interconnects to provide a data routing capability
○ Flexible configuration of which data goes to each Felix board■ Allows system to adapt to changing data needs (noise, etc)■ Some channels can be used for raw data from a subset of the detector■ SuperNova readout lanes (slow trickle, post trigger)■ Data can be re-routed to different fibers in the case of a fiber break or Felix board failure!
○ Link count can be scaled to match system needs
■ Less fibers when RCEs do majority of data processing and event selection■ More fibers when computing cluster is needed for data processing■ One or more fibers per DPM, one fiber per COB or 1 fiber per crate
RCECluster Felix
Felix
Felix
Felix
Felix
Felix
Example Data Routing
7
Felix
ProcessingRCE
DetectorData
ProcessingRCE
DetectorData
ProcessingRCE
DetectorData
ProcessingRCE
DetectorData
ProcessingRCE
DetectorData
ProcessingRCE
DetectorData
ProcessingRCE
OutboundRCE
OutboundRCE
OutboundRCE
Felix
Felix
ATCA RCEInterconnect
Note: Processing RCEs can also serve as outbound RCEs
DetectorData
Example Data Routing: 2 Active & 1 Spare Felix
8
Felix
ProcessingRCE
DetectorData
ProcessingRCE
DetectorData
ProcessingRCE
DetectorData
ProcessingRCE
DetectorData
ProcessingRCE
DetectorData
ProcessingRCE
DetectorData
ProcessingRCE
OutboundRCE
OutboundRCE
OutboundRCE
Felix
Felix
ATCA RCEInterconnect
Note: Processing RCEs can also serve as outbound RCEs
DetectorData
Example Data Routing: Felix Failure Or Fiber Break
9
Felix
ProcessingRCE
DetectorData
ProcessingRCE
DetectorData
ProcessingRCE
DetectorData
ProcessingRCE
DetectorData
ProcessingRCE
DetectorData
ProcessingRCE
DetectorData
ProcessingRCE
OutboundRCE
OutboundRCE
OutboundRCE
Felix
Felix
ATCA RCEInterconnect
Note: Processing RCEs can also serve as outbound RCEs
ATCA RCE cross connect routes data to redundant Felix board after fiber break or Felix board failure!
DetectorData
Example Data Routing: Load Adjustment
10
Felix
ProcessingRCE
DetectorData
ProcessingRCE
DetectorData
ProcessingRCE
DetectorData
ProcessingRCE
DetectorData
ProcessingRCE
DetectorData
ProcessingRCE
DetectorData
ProcessingRCE
OutboundRCE
OutboundRCE
OutboundRCE
Felix
Felix
ATCA RCEInterconnect
Note: Processing RCEs can also serve as outbound RCEs
Redundant Felix boards can take on additional loads due to flexible data routing in ATCA RCE!
DetectorData
11
Data from on FELIX PCs
Benefits Of Felix + ATCA RCE
12
● ATCA RCE provides a powerful front end processing platform for data processing in FPGAs○ RCE data processing features included in upcoming slides
■ Filtering■ Feature extraction■ Neural Network Processing
○ RCE could provide SuperNova data buffering (minute easily) ○ Proven packaging, cooling, interconnects, high reliability (incl hot-swap redundant fan and power supplies)
■ Already used for other experiments (LSST, ATLAS CSC, ATLAS IBL, KOTO, etc), mature design, low risk
● ATCA RCE interconnect provides ability to re-route data to Felix nodes on demand○ Adjust processor load to match the amount of processing needed in back end○ Route data around failed uplink fibers○ Route data to move from a failed Felix node (or host CPU) to a redundant element
● Felix provides a point to point path between the ATCA RCEs and the back end data processing○ Better flow control model than Ethernet or TCP / UDP over long links
■ Both PGP and GBT provide proven flow control over long link distances○ Transmitted frames stay in their native size instead of being chunked up into small network transfers (Ethernet
MTU)■ Felix has demonstrated high throughput with larger packet sizes
○ End to end data integrity■ GBT and PGP both have data integrity checking on their transport protocols■ Minimal error handling required in receiving nodes before data processing layer■ Test pattern capability over GBT or PGP links
● Back end processing model follows classic Felix architecture○ Receive data in Felix node with PCI-Express card○ Back end data processing with CPUs and GPUs
13
ATCA RCE Data Processing
Readout Overview
14
15
ATCA RCE Platform Clustering
• The RCE nodes are interconnected through Ethernet• Each COB contains a low latency 10/40Gbps Ethernet switch
- Cut through latency < 300ns• The COB supports a full mesh 14-slot backplane
- Each COB has a direct 10Gbps link to every other COB in a crate- Any RCE in an ATCA shelf has a maximum of two switches between it and every other RCE- 14 * 8 = 112 RCEs in a low latency cluster
• Reliable UDP protocol allows direct firmware to firmware data sharing• Allows for low latency data sharing between nodes
- APA combining and edge channel data sharing- Neural Network data sharing
COB
DPM 0 DPM 1
DPM 2 DPM 3
EthernetSwitch
DTM
COB
DPM 0 DPM 1
DPM 2 DPM 3
EthernetSwitch
DTM
COB
DPM 0DPM 1
DPM 2DPM 3
EthernetSwitch
DTM
COB
DPM 0DPM 1
DPM 2DPM 3
EthernetSwitch
DTM
Off shelf link
16
Oxford Design: Revision C01
● ZYNQ: XCZU15EG-1FFVB1156E● PL DDR4: 8 GB on DPM● PS DDR4 8 GB on DPM● M.2 NMVe: 512 GB on DPM
○ Located above the DPM’s DDR ICs● Dimensions: 85.09 mm x 110 mm
○ Increased by 1.27mm for NMVe
XCZU15EG-1FFVB1156E
DDR4 ICs
M.2 NMVe
SD Memory Card
JTAG
17
DPM Redesign for DUNE
● Oxford/SLAC Collaboration● Optimized for large memory buffering on the DPM● Only 24 GT channels on this FPGA
○ 20 of 24 GTs for the FEBs:■ 80 links/COB @ 1.28 Gbps (8B/10B)
○ 2 of 24 GTs for the ETH SW: ■ two separate 10 GbE (10Gbps/lane, 64B/66B) to ETH SW
○ 2 of 24 GTs for the Felix: ■ 2 RX lanes and up to 22 TX lanes
● Able to support redundant Felix connections■ 20 Gb/s @ 2 lanes (10Gbps/lane, 64B/66B)
SuperNova Pre-Buffer
SuperNova Post-Buffer
Linux Kernel + SuperNova Pre-Buffer
Boot Memory
Unused FEB TX lanes can be used to increase bandwidth to Felix
Backup Slides
19
ATCA Packaging for DUNE
● 1 APA = 2560 channels● 1 APA per COB
○ 4 DPMs per COB○ 640 channels per DPM
● 150 APA for the entire system = 150 COBs● Total Rack space: 165U
○ 11x 14-slot ATCA crates○ 15U per 14-slot ATCA Crate
■ http://www.asis-pro.com/maxum-atca-systems/14-Slot-14U-MaXum-460
20
ATCA Power/Cooling Estimates for DUNE
● COB Max Power: 300W○ ~100W for ETH SW○ 36W for RTM (limited by 3A fuse)○ 160 W for digital processing
■ 40W per DPM● Total Max Power: 45kW● Cooling via forced air (Integrated into the ATCA platform)● Power and thermal monitoring via standard IPMI interface● Example of ATCA crate that support 400W per slot
○ http://www.asis-pro.com/maxum-atca-systems/14-Slot-14U-MaXum-460
21
ATCA Costs for DUNE (Updated for quantity)
● RCE Cost Estimate: ○ Upgraded DPM + Flash: $2.5K○ Upgrade COB: $4K○ RTM: $2K○ DTM $1k○ ~$17k/unit
● 14-slot ATCA crates, in quantity, 2019○ ~$7k/unit○ IPMI + shelf manager + 10GbE/40GbE backplane + fans + power supplies
● Total ATCA Hardware Cost: $2.7M○ 11x ATCA crates○ 150x RCE ATCA slots
22
Packaging And Architecture Thoughts
23
ATCA ComponentsProven standard, built to be robust and reliable, also fully monitored
Air Intake Filter
Intake Fans
Power supply DC or AC input
Shelf Manager
● Telecom standard designed for “5 nines” uptime● Almost all components can be replaced in the field● Redundancy is available if desired
○ N + 1 redundancy for power supplies○ Redundant shelf managers
● System is designed to handle one fan failure in each fan tray○ Shelf manager generates alarm to request fan tray replacement
Exit Fans
Application Card
ShelfManager
24
ATCA Provides Management & Monitoring Features Required In Reliable & Maintainable DAQ Designs
ShelfManagers
Ethernet
Console
PowerSupplies
FanTrays
EEPROM
IPMC
EPROMs● ATCA uses IPMI for management purposes
○ Intelligent Platform Management Interface● Manages and monitors all shelf based components
○ Power supply status and power○ Shelf inlet and exit temperatures○ Fan speed control and monitoring○ Application card control and monitoring
● Redundant EEPROMs contain all shelf information○ Shelf serial number, location and ID○ Shelf manager IP/MAC address
● Application card hosts IPMC○ Intelligent Platform Management Controller
● IPMC hosts all application card information in local EEPROM○ MAC addresses○ Serial number, card type & revision
Supernova Buffering In Two Stages (Update)
● Pre trigger buffer stores data in a ring buffer waiting for a supernova trigger○ 640 channels per RCE (1x APA per COB)○ 2 MHz ADC sampling rate○ 12-bits per ADC○ Raw Bandwidth: 15.36 Gbps (1.92 GB/s)
■ 640 x 2MHz x 12b○ Each DPM has 16 GB RAM:
■ 9.6 TB DDR4 RAM for all system across 150x COBs○ Total Memory for supernova “pre-buffering”: 15 GB
■ PL 8 GB + PS 7 GB (1GB for Kernel & OS)○ Without compression: 7.8 seconds pre-trigger buffer
■ Assuming 12-bit packing to remove 4-bit overhead when packing into bytes● Post trigger buffer stores data in flash based SSD before backend DAQ
○ Write sequence occurs once per supernova trigger: Low write wearing over experiment lifetime○ Low bandwidth background readout post trigger: Does not impact normal data taking○ ~$180K for NVMe M.2 SD buffering (150x COBs x 4 DPMs/COB x $300/NVMe)○ 512GB/DPM = 266 second post-trigger buffer○ Samsung NVME SSD 960 PRO: Sequential write up to 2.1GB/s
■ SSD write bandwidth matches well with 640 channels of uncompressed data
25NOTE: NO compression factor is applied in slide (only RAW bandwidths)
Zynq Ultrascale+ and M.2 SDD Performance
● Benchmarked read/write bandwidth into Samsung NVMe SSD 960 PRO with the ZYNQ PS PCIe root complex interface
● M.2 SDD mounted and formated as EXT4 hard drive● Running on ArchLinux● Measuring ~1.6GB/s for read/writing dummy data
generated by the CPU○ Limited by the Zynq’s PCIe GEN2 x 4 lane
interface (Theoretical limit: 2.0Gb/s)■ Not limited by M.2 SDD’s controller
● Because the input bandwidth is 1.92GB/s > 1.6 GB/s SDD write speed, we would be able to buffer for 37 seconds in DDR before 100% back pressure
● Need some amount of compression before the SSD to prevent bottlenecking at the SDD
● This is a very simple test with only one process○ Need to do stress testing of other interfaces in
parallel of SDD to confirm rate is still 1.6GB/s
26
Optional Compression
● Past Development has shown firmware compression can be costly in FPGA resources● If compression is done in firmware, a minimal LUT footprint would be required● With the high performance Zynq Ultrascale+, real-time software compression does become a reality.
27
Algorithm kLUTs/DPM kFFs/DPM DSP48/DPM RAM(Mb)/DPM
Arithmetic Probability Encoding
292(86%)
120(18%)
75(<1%)
22.3(38%)
Huffman 143(43%)
60(9%)
75(<1%)
22.3(38%)
FPGA Resources for 640 channel per DPM compression
28
Waveform Extraction
•–
•–
•••
•–
•–
••
● See slides from JJ Russell here:https://docs.google.com/presentation/d/1XufamuZOdFGkIlHZEw4N8nXMSUEbK9OlhQ9pcAGn4wk/edit?usp=sharing