data acquisition at cbm(fair) walter f.j. müller, gsi, darmstadt for the cbm collaboration irtg...

Data Acquisition at Data Acquisition at CBM(FAIR)CBM(FAIR)

Walter F.J. Müller, GSI, Darmstadtfor the CBM Collaboration

IRTG Lecture Week 2007 Bergen, Norway, 11-15 April 2007

11-15 April 2007 IRTG Lecture Week 2007, Bergen Norway -- Walter F.J. Müller, GSI

2

OutlineOutline

FAIR (very briefly) CBM (briefly)

observables setup

FEE/DAQ/Trigger requirements/challenges 3 case studies

n (cbm) –xyter InfiniBand Cell and SIMDization


3

storage and cooler rings

• beams of rare isotopes• e – A Collider• 1011 stored and cooled antiprotons 0.8 - 14.5 GeV

primary beams

• 5∙1011/s; 1.5-2 GeV/u; 238U28+

• factor 100-1000 increased intensity• 4x1013/s 90 GeV protons• 2x109/s 238U 35 GeV/u ( Ni 45 GeV/u)

secondary beams

• rare isotopes 1.5 - 2 GeV/u; factor 10 000 increased intensity • antiprotons 3(0) - 30 GeV

accelerator technical challenges

• Rapidly cycling superconducting magnets• high energy electron cooling• dynamical vacuum, beam losses

FFacility for acility for AAntiproton and ntiproton and IIon on RResearchesearch


4

Rare isotope beams: nuclear structure and nuclear astrophysics nuclear structure far off stability

nucleosynthesis in stars and supernovae

Beams of antiprotons: hadron physics quark-confinement potential

search for gluonic matter and hybrids hypernuclei

Nucleus-nucleus collisions: compressed baryonic matter baryonic matter at highest densities (neutron stars) phase transitions and critical endpoint in-medium properties of hadrons

Short-pulse heavy ion beams: plasma physics matter at high pressure, densities, and temperature fundamentals of nuclear fusion

Atomic physics, FLAIR, and applied research highly charged atoms low energy antiprotons radiobiology

Research ProgramsResearch Programs


5

BBaseline aseline TTechnical echnical RReporteport

Volume 1 Executive Summary

Volume 2 Accelerator and Scientific Infrastructure

Volume 3A Experiment Proposals on QCD Physics 3.1 CBM

Volume 3B Experiment Proposals on QCD Physics 3.2 PANDA

3.3 PAX 3.4 ASSIA

Volume 4 Experiment Proposals on Nuclear Structure & Astro Physics (NUSTAR) 4.1 LEB-SuperFRS 4.2 HISPEC/DESPEC

4.3 MATS 4.4 LASPEC 4.5 R3B 4.6 ILIMA

4.7 AIC 4.8 ELISe 4.9 EXL

Volume 5 Experiment Proposals on Atomic, Plasma & Applied Physics (APPA) 5.1 SPARC 5.2 HEDgeHOB

5.3 WDM 5.4 FLAIR 5.5 BIOMAT

Volume 6 Civil Construction and Safety

Official project description: 66 Volumes with morethan 3500 3500 pages and more than 26002600 authors


6

StatusStatus Construction cost of the FAIR project: ~ 1 Billion € International project: 25% from foreign partners German Federal Government has approved construction budget

FAIR is in budget plan for the next 10 years MoU to build and operate FAIR signed by 14 states:

AU CN DE ES FI FR GB

GR IN IT PL RO RU SE

FAIR Joint Core Team established Next steps (within 12 month) :

Joint declarations with partner states Conclude negotiations with partner states Signature of FAIR Convention and Final Act Formation of FAIR GmbH


7

ConstructionConstruction

Time lines:Start: 2008Finish: 2015


8

CBM


9

CBM Physics caseCBM Physics case

Compressed Baryonic Matter @ FAIR – high mB, moderate T:

searching for the landmarks of the QCD phase diagram• first order deconfinement phase transition • chiral phase transition• QCD critical point

Investigate: A+A 10-45 AGeV p+A 10-90 GeV

Physics program complementary to ALICE


10

CBM Physics Topics and CBM Physics Topics and ObservablesObservables Equation of state at high ρB

collective flow of hadrons particle production at threshold energies

measure: D0, D±

Deconfinement phase transition at high ρB

excitation function and flow of strangeness measure: K, , , ,

excitation function and flow of charm measure: J/, ', D0, D±, c

sequential melting of J/ and ', charmonium suppression measure: J/, '

QCD critical point excitation function of event-by-event fluctuations

measure: π, K

Onset of chiral symmetry restoration at high ρ in-medium modifications of hadrons

measure: , , e+e- or μ+μ-

CBM Physics Book

in preparation


11

CBM Detector RequirementsCBM Detector Requirements

measure: π, K

measure: K, , , ,

measure: D0, D±, Ds, c

measure: J/, ' e+e- or μ+μ-


measure: γ

Hadron identification

Vertex detector

Good e/π separation

Good μ/π separation

Low cross sections

→ High rates

→ Selective Triggers

Hadrons

Leptons

Photons Simulations indicate:J/, ' better in μ+μ- , , better in e+e-

try to do both


12

CBM Layout (e-mode)CBM Layout (e-mode)

1m

500 mrad

50 mrad

STS

RICH

TRD

TOFECAL

PSD

MVD

STS Silicon trackerMVD Micro vertex detectorRICH Ring Imaging ChamberTRD Transition Rad. DetectorTOF Time-of-flight (RPC)ECAL Photon armsPSD Projectile spectator detector

10 m


13

CBM Layout (CBM Layout (μμ-mode)-mode)

1m

500 mrad

50 mrad

STS

TRD

TOFECAL

PSD

MVD


10 m

MUCH

RICH

STS Silicon trackerMVD Micro vertex detectorMUCH Muon ChambersTRD Transition Rad. DetectorTOF Time-of-flight (RPC)ECAL Photon armsPSD Projectile spectator detector


14

CBM Layout CBM Layout {ongoing tuning}{ongoing tuning}

1m

500 mrad

50 mrad

STS

TRD

TOFECAL

PSD

MVD


10 m

RICHTRD

ITS

STS Silicon trackerMVD Micro vertex detectorRICH Ring Imaging ChamberTRD Transition Rad. DetectorITS Intermediate TrackerTOF Time-of-flight (RPC)ECAL Photon armsPSD Projectile spectator detector


15

The CBM ExperimentThe CBM Experiment

MVD + STSaim: optimize setup to include both, electron and muon ID (not necessarily simultaneously)

• electron ID: RICH & TRD p suppression 104

• muon ID: absorber + detector sandwich move out absorbers for hadron runs

• tracking, momentum determination, vertex reconstruction: radiation hard silicon pixel/strip detectors (STS) in a magnetic dipole field

• hadron ID: TOF (& RICH)• photons, p0, m: ECAL

• PSD for event characterization• high speed DAQ and trigger → rare probes!


16

CBM and HADES 2005 CBM and HADES 2005 → → 20072007All you want to know about CBM:

Technical Status Report (400 p)http://www.gsi.de/documents/DOC-2005-Feb-447-1.pdf

CBM Progress Report 2006 (60p)http://www.gsi.de/documents/DOC-2007-Mar-137-1.pdf

http://www.gsi.de/documents/DOC-2005-Feb-447-1.pdf

http://www.gsi.de/documents/DOC-2007-Mar-137-1.pdf


17

Meson Production in central Meson Production in central Au+AuAu+AuW. Cassing, E. Bratkovskaya, A. Sibirtsev, Nucl. Phys. A 691 (2001) 745

10 MHz interaction rateneeded for 10-15 A GeV

SIS300


18

CBM Trigger RequirementsCBM Trigger Requirements

measure: π, K

measure: K, , , ,

measure: D0, D±, Ds, c

measure: J/, ' e+e- or μ+μ-


measure: γ

Hadrons

Leptons

Photons

trigger <10 AGeV

trigger

trigger e+e-

offline

offline >10 AGeV

offline ?

offline for e+e-

trigger for μ+μ- ?

assume archive rate:few GB/sec20 kevents/sec

trigger on high pt e+ - e- pair

trigger ondisplaced vertex

drives FEE/DAQarchitecture

trigger μ+μ-

μ identification


19

Open Charm DetectionOpen Charm Detection Example: D0 K-+ (3.9%; c = 124.4 m) reconstruct tracks find primary vertex find displaced tracks find secondary vertex

target

first two planesof vertex detector

few 100 μm

5 cm

high selectivity because combinatorics is reduced


20

A Typical Au+Au CollisionA Typical Au+Au Collision

Central Au+Au collision at 25 AGeV:URQMD + GEANT

160 p 170 n360 - 330 + 360 0 41 K+ 13 K- 42 K0

up to 107 Au+Au interactions/sec

109 tracks/sec to reconstruct for first level event selection


21

CBM DAQ Requirements ProfileCBM DAQ Requirements Profile

D and J/Ψ signal drives the rate capability requirements D signal drives FEE and DAQ/Trigger requirements

Problem similar to B detection, like in LHCb or BTeV (rip)

Adopted approach:

displaced vertex 'trigger' in first level, like in BTeV (rip)

Additional Problem:

DC beam → interactions at random times

→ time stamps with ns precision needed

→ explicit event association needed Current design for FEE and DAQ/Trigger:

Self-triggered FEE Data-push architecture


22

Conventional FEE-DAQ-Trigger Conventional FEE-DAQ-Trigger LayoutLayout Detector

Cave

Shack

FEE

Buffer

L2 Trigger L1 Trigger

DAQ

L1 A

ccep

t

L0 Trigger

fbunch

Archive

Trigger

Primitives

Especially

instrumented

detectors

Dedicated

connections

Specialized

trigger

hardware

Limited

capacity

Limited

L1 trigger

latency

Modest

bandwidth


23

Limits of Conventional Limits of Conventional ArchitectureArchitecture

Decision time for first level trigger limited.

typ. max. latency 4 μs for LHC

Only especially instrumenteddetectors can contribute to

first level trigger

Large variety of veryspecific trigger hardware

Not suitable for complexglobal triggers like secondary

vertex search

Limits future triggerdevelopment

High development cost


24

L1 Trigger

High

bandwidth

The way out .. use Data Push The way out .. use Data Push ArchitectureArchitectureDetector

Cave

Shack

FEE

Buffer

L2 Trigger L1 Trigger

DAQ

L1 A

ccep

t

L0 Trigger

fbunch

Archive

Trigger

Primitives

Especially

instrumented

detectors

Dedicated

connections

Specialized

trigger

hardware

Limited

capacity

Limited

L1 trigger

latency

Modest

bandwidth

fclock

L2 Trigger

Timedistribution

FIFOBuffer


25

L1 Trigger

High

bandwidth

The way out .. use Data Push The way out .. use Data Push ArchitectureArchitectureDetector

Cave

Shack

FEE

DAQ

Archive

fclock

L2 Trigger

Buffer


26

L1 Select

High

bandwidth

The way out .. use Data Push The way out .. use Data Push ArchitectureArchitecture Detector

Cave

Shack

FEE

DAQ

Archive

fclock

L2 Select

Self-triggered front-end

Autonomous hit detection

No dedicated trigger connectivity

All detectors can contribute to L1

Large buffer depth available

System is throughput-limited

and not latency-limited

Use term: Event Selection

Buffer


27

Front-End for Data Push Front-End for Data Push ArchitectureArchitecture Each channel detects autonomously all hits An absolute time stamp, precise to a fraction of the

sampling period, is associated with each hit All hits are shipped to the next layer (usually

concentrators) Association of hits with events done later using time

correlation

Typical Parameters: with few 1% occupancy and 107 interaction rate:

some 100 kHz channel hit rate few MByte/sec per channel whole CBM detector: 1 Tbyte/sec


28

Typical Self-Triggered Front-Typical Self-Triggered Front-EndEnd Average 10 MHz interaction rate Not periodic like in collider On average 100 ns event spacing

0 5 10 15 20 25 30 time

ampl

itude

50

100

a: 126 t: 5.6

a: 114 t: 22.2

Use sampling ADCon each detector

channel running withappropriate clock

Time is determinedto a fraction of thesampling period

threshold

2005


29

Toward Multi-Purpose FEE Toward Multi-Purpose FEE Chain Chain

PreAmpPreAmp preFilter

preFilter ADCADC Hit

Finder

Hit

Finderdigital

Filter

digital

FilterBackend

& Driver

Backend

& Driver

Pad GEM's PMT APD's

analogAnti-

AliasingFilter

Sample rate:10-100 MHz

Dyn. range:8...12 bit

digital'Shaping'

1/t Tailcancellation

Baselinerestorer

Hit parameterestimators:

Amplitude

Time

Clustering

Buffering

Link protocol

All potentially in one mixed-signal chip

2005


30

Self-triggered DAQ – View Self-triggered DAQ – View beyond HEPbeyond HEP Using a 'trigger' is considered natural in the HEP world

large and complex events one wants to select events of interest

In other areas, using same detector technologies, the revere is true very simple events (single hits, 2-fold coincidence) no trigger possible (e.g. singles), or needed (e.g. all events relevant) raw statistics is the key factor

Examples: thermal neutron scattering PET scanners


31

Si-StripDetector

300 μm

Si-StripDetector

300 μm

Detection of thermal neutronsDetection of thermal neutrons

key observable is θ-φ distribution of scattered neutrons to detect neutron use converter

157Gd 157Gd+n → 158Gd + γ's + ce's [ σ = 255000 b; λn = 1.3 μm] 10B 10B+n → 7Li + α + ~2 MeV [ σ = 3838 b; λn = 20 μm]

to determine θ-φ combine converter with a position sensitive detector Example:

neutron

conversionelectrons

157Gd

2 μm

strip inx direction

strip iny direction

neutron hit

↔coincidence of X and Y

strip


32

The DETNI ProjectThe DETNI Project

Hahn Meitner Institut, Berlin GSI, Darmstadt Phys. Inst., Univ. Heidelberg Forschungszentrum Jülich

DETNIDETNI – DetDetectors for NNeutron IInstrumentation

Mission:Develop and prototype three different advanced area sensitive detector systems including read-out ASIC within EU FP-6

Gd / Si-Microstrip Gd – CsI / MSGC B / GEM (CASCADE)

Goals: very high rates (100 MHz) mm and sub-mm resolution highest possible detection efficiency

Read-out ASCI common for Silicon

and gas detectors


and gas detectors


and gas detectors

AGH Univ. of Science and Tech., Krakow IFN-Polish Academy of Sciences, Krakow INFN Milano INFN Perugia


33

DETNI-A DETNI-A 157157Gd/Si Mircostrip Gd/Si Mircostrip DetectorDetector

• Eth(5s) 10 keV

• ENC 550 e- (20 - 30 pF)

• cps (global) 2.5 x 107

• cps/strip 7.5 x 104

• dT (x/y) 4 ns

• Size 51 mm x 51

mm

• No. of strips 640

• Pitch 80 µm

• dx 50 - 100 µm

• ES 29 - 250 keVCaterina Petrillo et al., Peruggia and Milano slide courtesy C.J.Schmidt


34

DETNI-A DETNI-A 157157Gd/Si Detector ModuleGd/Si Detector Module

slide courtesy C.J.Schmidt

100 mm

GoalsGoals 108 n/sec in 100 cm2

with 2 views, 2 hit/strip:400 MHz strip hit rate

with 5 Byte/hit:2 GByte/sec data

ConsequencesConsequences 128 channel ASIC 20 chip/module 20 MHz/chip 100 MByte/chip


35

DETNI-C DETNI-C CASCADE: Boron on GEM CASCADE: Boron on GEM MultilayerMultilayer

FWHM (Diffusion)G 2G 2 G 1

3 m m G 14 . 5 m m

E

4 . 5 m m

Eei - b u t a n e : 2 0 m b a rg r i dC s I e x t r a c t i o n g r i dc o l u m n a r C s I ( 2 * ~ 1 µ m )1 5 7 G d ( 2 * 0 . 5 - 1 . 5 µ m )s u p p o r t f o i l1 5 7 G d

( n , e - ) - c o n v e r t e rc a t h o d e1 0 0 µ m a n o d e2 5 µ m

s u p p o r t f o i l

570 mm 5 7 0 m m C s I 1 5 7 G d

anode

635 µm

n

SCP SCPcathodeSCP

SCP SCP 2510050cathode 56025 50 m e t a l 1d A - S C P SCP SCP

cathode

6 3 5

cathode635

FWHM (Diffusion)G 2G 2 G 1

3 m m G 14 . 5 m m

E

4 . 5 m m

Eei - b u t a n e : 2 0 m b a rg r i dC s I e x t r a c t i o n g r i dc o l u m n a r C s I ( 2 * ~ 1 µ m )1 5 7 G d ( 2 * 0 . 5 - 1 . 5 µ m )s u p p o r t f o i l1 5 7 G d

( n , e - ) - c o n v e r t e rc a t h o d e1 0 0 µ m a n o d e2 5 µ m

s u p p o r t f o i l

570 mm 5 7 0 m m C s I 1 5 7 G d

anode

635 µm

n

SCP SCPcathodeSCP

SCP SCP 2510050cathode 56025 50 m e t a l 1d A - S C P SCP SCP

cathode

6 3 5

cathode635

slide courtesy C.J.Schmidt

GEMs can be operated to be transparent for charges!

they can be cascaded!

Each one can carry two Boron layers

Last one operated as amplifier

Cumulate 5% single layer detection efficiency to give 50% for thermal neutrons (1.8Å) need 10 cascaded GEM-foils

Christian Schmidt et al., Heidelberg, Darmstadt

DETNI-B is Gd/CsI+MSGC

B. Gebauer et al, HMI


36

n-XYTER – The DETNI Readout n-XYTER – The DETNI Readout ASICASIC

NameName n-XYTER: NNeutron - XX, YY, TTime and EEnergy ... RR

Front-endFront-end 128(32) channels, charge sensitive pre-amplifier, both polarities 30 pF detector capacitance, ENC 1000 e self-triggered, autonomous hit detection time stamping with 1 ns resolution (needed to correlate x-y views)

ReadoutReadout energy (peak height) and time information for each hit data driven, de-randomizing, sparsifying readout 32 MHz average hit rate

128 channel version (Si,GEM): ~ 250 kHz hit / channel 32 channel version (MSGC): ~ 1 MHz hit / channel

Goal: Common readout ASIC for silicon strip and gas detectors


37

32 MHzreadoutrate

n-XYTER Architecturen-XYTER Architecture

PreAmpPreAmp fast shaperfast shaper

peaking timefast: 20 nsslow: 140 ns

slow shaperslow shaper

timewalk

comp.

comparatorcomparator

tokencell

timestamp

counter

tokenmanager

1 ns step

32 MHzreadoutrate

self-trigger:latch amplitudelatch timestore in FIFOs

outputdrivers

digitalFIFO

analogFIFO

peakdetect& hold


38

tokencell

digitalFIFO

analogFIFO

tokencell

digitalFIFO

analogFIFO

Token Ring Token Ring ReadoutReadout

tokenmanager

outputdrivers

Token Cell Processes:on token, check for data,either initiate readout in clockcycle or pass forward token Token asynchronously passes from

channel to channel in search of data Within one clock cycle token could

pass through all channels use 2 stage logic design to keep logic

path short and allow scan of 128 channels in one clock cycle

If token encounters occupied channel, data readout is initiated (1 clock cycle)

After readout of one hitof one hit the token passes to the next occupied channel.

Token manager ensures that there is one and only one token is circling

Readout clock: 32 MHz


39

Token Ring: Architectural Pros Token Ring: Architectural Pros & Cons& ConsProsPros handles sparsification provides together with FIFO de-randomization

100% of output bandwidth can be used for data automatic bandwidth focusing

one (or few) channels can use all the bandwidth fair distribution of bandwidth

most busy channel will loose data first in case of overload a hot channel will not block readout of the rest (only fills the bandwidth)

ConsCons data is not time sorted at output

usually resorting needed somewhere in readout and processing chain


40

Token Ring at WorkToken Ring at Work

5

4

3

2

1

0

11

10

9

8

7

6

0

0

0

h

T=0

h

h

Output:t=0,c=5

token here, active


41


5

4

3

2

1

0

11

10

9

8

7

6

1

0

1

1

0

Output:t=0,c=5t=0,c=6

h

h

h

T=1

token position in last cycle

token here, active


42


5

4

3

2

1

0

11

10

9

8

7

6

1

2

2 1

1

0

Output:t=0,c=5t=0,c=6t=2,c=7

h

h

T=2


token here, active


43


5

4

3

2

1

0

11

10

9

8

7

6

3 1

3

3

2 1

1

0

Output:t=0,c=5t=0,c=6t=2,c=7t=1,c=8

h

h

h

T=3


token here, active


44


5

4

3

2

1

0

11

10

9

8

7

6

3 1

3

3

2

1

0

Output:t=0,c=5t=0,c=6t=2,c=7t=1,c=8t=1,c=9

T=4


token here, active


45


5

4

3

2

1

0

11

10

9

8

7

6

3 1

3

3

2

0

Output:t=0,c=5t=0,c=6t=2,c=7t=1,c=8t=1,c=9t=0,c=11

T=5


token here, active


46


5

4

3

2

1

0

11

10

9

8

7

6

3 1

3

3

2

Output:t=0,c=5t=0,c=6t=2,c=7t=1,c=8t=1,c=9t=0,c=11t=3,c=0

T=6


token here, active


47


5

4

3

2

1

0

11

10

9

8

7

6

3 1

3

2

Output:t=0,c=5t=0,c=6t=2,c=7t=1,c=8t=1,c=9t=0,c=11t=3,c=0t=1,c=1

T=7


token here, active


48


5

4

3

2

1

0

11

10

9

8

7

6

3

3

2

Output:t=0,c=5t=0,c=6t=2,c=7t=1,c=8t=1,c=9t=0,c=11t=3,c=0t=1,c=1t=3,c=2

T=8


token here, active


49


5

4

3

2

1

0

11

10

9

8

7

6

3

2

Output:t=0,c=5t=0,c=6t=2,c=7t=1,c=8t=1,c=9t=0,c=11t=3,c=0t=1,c=1t=3,c=2t=2,c=8

T=9


token here, active


50


5

4

3

2

1

0

11

10

9

8

7

6

3 Output:t=0,c=5t=0,c=6t=2,c=7t=1,c=8t=1,c=9t=0,c=11t=3,c=0t=1,c=1t=3,c=2t=2,c=8t=3,c=2

T=10


token here, active


51


5

4

3

2

1

0

11

10

9

8

7

6

h

T=0

h

h

h

h

h

T=1

h

h

T=2

h

h

h

T=3

Output:t=0,c=5t=0,c=6t=2,c=7t=1,c=8t=1,c=9t=0,c=11t=3,c=0t=1,c=1t=3,c=2t=2,c=8t=3,c=2

Output not time ordered, cluster broken apart


52

.

Putting it together: n-XYTER Putting it together: n-XYTER ASICASIC

128 channels timestamp

counter

tokenmanager

outputdrivers

.

.

I2Cinterface

DACsslow

control

outputs:1 analog differential 8 digital LVDS (4*32 MHz)

Note: there is also a 32 channel version for MSGC readout called MSGCROC


53

n-XYTER Submission 1n-XYTER Submission 1

CBM started active collaboration with DETNI project

Submission of full chip in July 2006 Used technology: AMS 0.35 CMOS with thick metal four

250 dice now available, shared between

CBM and DETNI verification ongoing, no problems found n-XYTER will be

basis for detector R&D in next 2-3 years basis CBM-XYTER, a next generation chip


54

n-XYTER Statusn-XYTER Status

The n-XYTER is quickly becoming the designated read-out solution for many CBM detector system prototypes: obvious cases:

STS - Silicon strip detectors MUCH/TRD – GEM chambers

plausible cases: MUCH Silicon Pad chambers RICH PMT

potential cases: MUCH/TRD - MWPC chambers

Setting up N-XYTER read-out chain to support simple lab desk-top setups medium-size test beam configurations with multiple sub-systems.


55

Basic n-XYTER Readout ChainBasic n-XYTER Readout ChainDetector

FEB ROCX

YTER

XY

TER

XY

TER

AD

C

XY

TER

Tag data

Tag data

Tag data

Tag data

ADC data

clockFP

GA

control

SFPMGT

ABB

FP

GA

SFP MGT

MGTMGT

Front-EndBoard

Read-OutController

Active BufferBoard

Bond orcable

connection

up to 8 N-XYTER1024 ch.

LVDSsignalcable

2.5 Gbpsoptical

link

1-4 lanePCIe

interface


56

Scalable n-XYTER Readout Scalable n-XYTER Readout ChainChainDetector

FEB ROCX

YTER

XY

TER

XY

TER

AD

C

XY

TER

Tag data

Tag data

Tag data

Tag data

ADC data

clockFP

GA

control

SFPMGT

DCB

FP

GA

SFP MGT

MGT

Front-EndBoard

Read-OutController

Data CombinerBoard

to otherROC's

to ABB

SFP

SFP MGT

MGT


57

PC

Some ConfigurationsSome Configurations

Detector FEB ROC ABB

PCDCB ABB

Detector FEB ROC

Detector FEB ROC

Detector FEB ROC

Detector FEB ROC

Minimal Configuration

Expandable Configuration


58

Real Hardware for ROC, ABB...Real Hardware for ROC, ABB...

ROC ABB

Build at KIP many concepts from Alice DCS Virtex-4 based (FX, PPC) Idea: FPGA+Linux core design

to be used in many apps.

→ SysCore

Build in Mannheim PCIexpress based Virtex-4 FX based

SFP PCIe mezz. board

connectors

shortcommercia

lfollows

59

Basic Components and Interfaces

Xilinx Virtex4 FPGA

320 up to 576 user I/Os

LAN interfaces

SD-Card connector

LAN, USB, JTAG programming capability via CPLD

RS232 interface

High Speed Serial Ports (MGTs)

DDR SDRAM

user definable I/O

Watchdogslide courtesy D.

Gottschalk

60

SysCore Features

(Remote) Configuration: via standard JTAG or select map configuration

via USB to JTAG bridge

via LAN Watchdog triggered

Radiation tolerant by fast configuration/reboot

Linux on FPGA

Fast Boot

All features together in one design

slide courtesy D. Gottschalk

61

Enhanced FPGA refresh technology for radiation tolerance

Mode 1: Initial configuration

Mode 2: Refresh of the configuration memory (SRAM) either: continuously overwriting with the correct configuration or: overwriting on demand (after error detection)

Mode 3: Error detection: Read back of the configuration memory Checking (compare or checksum) Virtex 4: internal Hamming functionality

Mode 4: Watchdog triggers start of the failsafe configuration if design fails.

slide courtesy D. Gottschalk

as was developed for the Alice DCS

board


62

Towards CBM-XYTERTowards CBM-XYTER

n-XYTER meets, surprisingly well, most of CBM requirements

However, it's not radiation hard WHY ??: it was intended for a thermal neutron environment thermal neutrons do very little damage (only process is capture....) we'll test what chip can withstand, but expectations are not high (0.35

μm !)

A new ASIC is therefore needed

Goals: Radiation-hard ( > 1 MRad, depends on MUCH detector layout)

Consequence: new technology Lower power. n-XYTER is 1.5 W/chip Integrated ADC, pure digital interface Meet requirements of silicon and gas detectors, look also beyond CBM.

Might well be again a family of chips with same architecture


63


CBM STS CBM MuCh PANDA STS PANDA TPC

charge polarity +/- + or - +/- -

no channels 128 128 128 128

sparsification yes yes yes yes

self trigger yes yes yes yes

differential i/o yes yes yes yes

rate per channel

250 kHz 200 kHz 75 kHz 200 kHz

time stamp yes, few ns yes, few ns yes, 2 to 20 ns yes, 5 ns

double hit res 100 ns 100 ns 200 ns

energy r/o yes, 8bit yes yes, 10 bit 8 bit

ch fifo depth 16 16

rad. level 1 MRad 1 MRad 1 MRad 0.1 CMS STS

ch pitch 50 µm 100 – 200 µm 50 µm 100 – 200 µm

DC-bias, leakage

no no yes ? no

power high concern no concern 3 mW less concern

no of chips t.b.d. t.b.d. 5000 1000

CBM SiStrip CBM GEM PANDA SiStrip PANDA GEM


64


CBM-XYTER development has started Bergen, Darmstadt, Heidelberg, Krakow, Mannheim, Moscow involved Will be based on experience with n-XYTER Currently many pre-studies:

Work out specifications ... Technology assessment

Lots of experience with 0.18 μm UMC in the developers group Questions: Do we really need enclosed geometry transistors ?

Future availability of 0.18 μm processes ? UMC or AMS ?What are advantages of 0.13 μm processes ? Can we

afford them ? Pre-amp design

best balance between detector thickness (affects sensor signal, capacitance but also radiation tolerance) and pre-amp power.

De-randomizer (FIFO) designHow many stages ? Likely more than 4 needed.

Work on building blocks, quite a few mini@SIC's and MMPW submitted Also some conceptual and system-design questions still to be

resolved Examples: 1. time sorting; 2. throttling


65

Self-Triggered FEE – Output Self-Triggered FEE – Output Format IFormat I

FEE

17 15 ...

68 34 ...

134 18 ...

135 19 ...

1234 33 ...

Time stamp counter is finitenumber of bits ! Will wrap !? How to express time ?

TimeStamp

Channeladdress

other values:amplitudespulse shape

Note:CBM is fixed targetexperiment. Long spills (~ 10 s).

Output of asingle

FEE chip


66

Handle the infinite Time AxisHandle the infinite Time Axis

Time

Epoch 1 Epoch 2 Epoch 3 Epoch 4

(2, 137 ns) (3, 314 ns)

1. Subdivide Time in Epochs

2. Express a timerelative to an epoch

3. Introduce Epoch Markers

A HitAn EpochMarker

practical epochlength about 10 μs

practical epochlength about 10 μs


67

Self-Triggered FEE – Output Self-Triggered FEE – Output Format IIFormat II

FEE

M 1

H 17 15 ...

H 68 34 ...

H 134 18 ...

H 135 19 ...

H 1234 33 ...

M 2

M 3

H 258 19 ...

Output of a FEE chipis a list of hits and

epoch markersEach hit has a

timestampplus other information

Recordtype

Hit

EpochMarker

Hit with effective timestamp (3, 258)


68

Self-Triggered FEE – Self-Triggered FEE – ConcentratorsConcentrators

FEE

M 1

H 17 15 ...

H 68 34 ...

H 134 18 ...

H 135 19 ...

H 1234 33 ...

M 2

M 3

H 258 19 ...

A concentrator merges

the data streams andeliminates redundant

epoch markers

FEE

M 1

H 18 2007 ...

M 2

H 589 2134 ...

M 3

H 258 2714 ...

M 1

H 17 15 ...

H 18 2007 ...

H 68 34 ...

H 134 18 ...

H 135 19 ...

H 1234 33 ...

M 2

H 589 2134 ...

M 3

H 258 19 ...

H 258 2714 ...

Seems prudent

to keep dataalways sorted

in time

time address

!! 2005 slide !!Where is herethe problem?


69

Where to re-sort data ?Where to re-sort data ?

Token ring scheme produces locally unsorted data Big advantage of token ring schema is the fair distribution of

bandwidth in case of local overload. The system is robust against hot channels ect.

n-XYTER doesn't even produce epoch markers the reading stage needs a clock cycle precise replica of the time stamp

counter to interpret the data correctly. That clearly only works if there are no additional elasticity buffers.

some form of 'time stamp expansion' and epoch marking needed re-sort data early ? Or use a form of fuzzy epoch boundaries ?

How to build concentrators ? conceptually easy of output bandwidth > sum of input bandwidth at least not feasible in early stages

read-out ASIC is in fact the first concentrator stage total bandwidth will always be smaller than sum of channel bandwidth

In other words: when and where to drop data in case of overload ?


70

Think Big or Throttling ?Think Big or Throttling ?

Conventional triggered systems handle overload gracefully there is some form of 'common' dead time in case of overload, whole events are discarded loosely speaking: one gets 100% of the data for 90% of the events

With self-triggered front-end the converse might happen data is dropped in an uncorrelated fashion where FIFOs overfill loosely speaking: one gets 90% of the data of 100% of the events quite unpleasant perspective

tracking systems might tolerate a few % data loss without major performance drop

in other detectors, like an ECAL, this leads immediately to a loss of efficiency

What is the proper solution ? Build and operate the system with 'enough' bandwidth headroom ?

Note: extracted beams from synchrotrons are notoriously non-poissonian ! Can that be handled with large-enough channel FIFO's alone ?

Or introduce some form of 'global throttling', to drop data in a correlated fashion. The time distribution system can easily distribute 'XOFF' and 'XON' messages Problem is to find an easy to evaluate throttle criterion.


71

CBM DAQ and Online Event CBM DAQ and Online Event SelectionSelection

More than 50% of total data volume might be relevant for first level event selection

Aim for simplicity Ansatz:

do (almost) all processing done after the build stage

Simple two layer approach:1. event building2. event processing

Other scenarios are possible, putting more emphasis on:

do all processing as early as possible

transfer data only then necessary

neededfor D needed

for J/μ

MVD,STS, and TRD data usedin first level event selection


72

Logical Data FlowLogical Data Flow

Concentrators:multiplex channelsto high-speed links

Time distribution

Buffers

Build Network

Processing resources forfirst level event selectionstructured in small farms

Connection to'high level'

selection processing


73

Different View on StructureDifferent View on Structure

detector

FEE buffer

readoutbuffer

switch

processorfarm

storage

L1trigger

HLT

CMS CBM


74

Bandwidth RequirementsBandwidth Requirements

Data flow:

~ 1 TB/sec

1st level selection:

~ 1014-15 operation/sec

Data flow:

few 10 GB/sec

to archive: few 1 GB/sec

Moore helps

Gilder helps

~ 100 Sub-Farms


75

Focus on BNet

Event Building


76

Fast Event Building NetworksFast Event Building Networks

Very tempting to look into InfiniBandInfiniBand used in many HPC clusters as interconnect offers large bandwidth at low CPU overhead

Available since some time SDR systems: 4 x 2.5 Gbps per link 1 GByte/sec bandwidth per port and

direction 288 port Switches

based on 24 port switch chips (288=24*12) non-blocking switch 288 GByte/sec switching bandwidth modest cost: ~ 400 EUR/port

Perspectives DDR just became available QDR likely to come

One 288 port QDR switch does 1 TByte/sec A few could do CBMnetwork adapter (HCA) small and low

power compared to 10 Gbit Ether


77

Why is InfiniBand fast & low Why is InfiniBand fast & low overhead ?overhead ? In conventional network interconnects the data is copied at least

once between user and kernel buffers This loads CPU and costs memory bandwidth

The way out avoid the copies do a DMA transfer from local user buffer to remote user buffer buzz-word: zero-copy remote direct memory access

zero-copy RDMA


78

Conventional NetworkingConventional Networking

library

application

library

application

driver

networkadapter

networkswitch

networkadapter

driver

User

Kernel

Hardware

data flow

control flow


79

Use Zero-Copy RDMAUse Zero-Copy RDMA

library

application

library

application

driver

networkadapter

networkswitch

networkadapter

driver

User

Kernel

Hardware

data flow

control flow


80

How does Zero-copy RDMA How does Zero-copy RDMA work ?work ? User side requirements

all buffers used for I/O must be locked in memory made known to the network adapter, which

stores virtual to physical mapping this setup involves OS and a driver

(expensive) Network adapter requirements

export two types of interfaces for kernel interactions for user interactions

the interface for user interactions is memory mapped replicated for each connected process mapped into the user process address space

Chain of events for zero-copy RDMA user process writes request descriptor directly

into the network adapter (mapped interface) adapter validates, builds scatter-gather list,

and transfers to/from user address space

library

application

networkadapter

driver

User

Kernel

Hardware


81

and Real World problems ...and Real World problems ...

Usually some application framework is used ROOT, XDAQ, ....

It usually has its own 'buffer management' Remember:

making a user buffer eligible for RDMA is quite expensive (locking, driver calls)

thus create/delete of a buffer is expensive A framework design with a very 'dynamic'

handling of buffers, which often creates / deletes buffers will not work well with RDMA.

Adapting the underlying buffer management in an existing framework can be quite cumbersome: ... basic execution logic problems methods may not be virtual ... successfully done to adapt XDAQ, the CMS

DAQ framework to uDAPL. (J.Adamczewski, GSI)

networkadapter

driver

User

Kernel

Hardware

application

library

framework


82

... and Real World throughput... and Real World throughputSmall InfiniBand test cluster at GSI 4 dual-dual Opteron server Mellanox MHES18-XT HCA (PCIe) Mellanox MTS2400 24X 24 port switch

data by J. Adamczewski, GSI

Test case XDAQ peer transport

via uDAPL (an RDMA access library for IB and iWRAP)

Results for large (100 kB) buffers

throughput approaches IB limit of 1 GB/sec

~30 kB buffers needed to get 500 MB/sec

http://mellanox.com/products/infinihost_iii_lx_cards.php

http://mellanox.com/products/infinihost_iii_lx_cards.php

http://mellanox.com/products/switch_silicon.php

http://mellanox.com/products/switch_silicon.php


83

Event BuildingEvent Building

networkswitch

node node node node node node node node


Data collectors

Event selectors

Is classical all-to-all pattern susceptible to head of line blocking

1.a 1.b 1.c 1.d 1.e 1.f 1.g 1.h

1.a1.b1.c1.d1.e1.f


84

Event Building – SchedulingEvent Building – Scheduling

networkswitch



Data collectors

Event selectors

Is classical all-to-all pattern susceptible to head of line blocking some form of data flow orchestration needed

(aka scheduled transfers)

8.a 7.b 6.c 5.d 4.e 3.f 2.g 1.h

one of simplest schemes

is the barrel shifter

1.h 2.g 3.f 4.e 5.d 6.c 7.b 8.a


85


networkswitch



Data collectors

Event selectors



9.a 8.b 7.c 6.d 5.e 4.f 3.g 2.h



9.a 2.h 3.g 4.f 5.e 6.d 7.c 8.b


86


networkswitch



Data collectors

Event selectors



10.a 9.b 8.c 7.d 6.e 5.f 4.g 3.h



9.b 10.a 3.h 4.g 5.f 6.e 7.d 8.c


87

Event BuildingEvent Building Barrel-shift is in practice a too rigid scheme

e.g. works only when processing always takes same time Questions are:

How much scheduling is needed ? → Works chaotic transfer & many buffers ?

What is an 'optimal' scheme ? → Precise timing and sizing of each transfer

What is a simple and robust scheme ? → Get close to optimal with simple means

Our 4 node mini-cluster is nice to develop software Go to a larger cluster for real tests

Done: 24 nodes at FZ Karlsruhe Later: >100 nodes at Paderborn cluster

First results from FZK tests in March 2007 23 nodes, Opteron's with DDR InfiniBand HCA's and switches surprise: peer-to-peer bandwidth: 1160 MB/sec uni and 730 MB/sec bidirectional memory or PCIe is apparently here the limiting factor


88

First results from FZK tests in March 2007 (cont.) 23 nodes 'chaotic robin-round':

surprise Big buffers and many buffers isn't the best case one sees link level flow control and HCA request scheduling at

work

Event Building – Chaotic Event Building – Chaotic TransfersTransfers

MB/Sec 2 8 32

2k 260 256 256

8k 720 627 580

32k 625 587 568

128k 582 572 567

queue length

bu

ffer

size

Best throughput:

720 MB/sec/node

total 16.5 GB/sec

data by S. Linev, GSI


89

First results from FZK tests in March 2007 (cont.) 23 nodes strictly timed transfers:

now Buffer size and number now uncritical (big doesn't hurt, at

least...) Peak throughput same as before Tests on realistic size cluster needed

Event Building – Scheduled Event Building – Scheduled TransfersTransfers

MB/Sec 2 8 32

2k 255 271 272

8k 718 695 704

32k 699 698 696

128k 732 727 -

queue length

bu

ffer

size

Best throughput:

718 MB/sec/node

total 16.5 GB/sec

data by S. Linev, GSI


90

Focus on PNet

Event Selection


91

Event Selection ProcessingEvent Selection Processing

In CBM we'll have a tracking trigger certainly for open charm

requires reconstruction of tracks in STS of all events search for displaced vertices identification of open charm candidates

possibly also for muon identification again reconstruction of tracks in STS of all events forward tracking though muon absorbers

So we need high throughput STS tracking Two routes followed

Cellular automaton / Kalman filter tracker lots for floating point arithmetic better performance (simply because cuts can be narrower)

Hough tracker algorithm 'bit-oriented' and parallelizable can be implemented in programmable logic

Is it feasible to do CA/KF in L1

event selection ?

Does Hough hasthe required

performance ?


92

Game Processors as Supercomputers ?

Slide from CHEP'04 Dave McQueeney

IBM CTO US Federal

2005


93

The Cell ProcessorThe Cell Processor8 SPE: Synergistic Processing 8 SPE: Synergistic Processing

ElementsElements, each with 256 kB local memory 128 x 128 bit registers 4 SP floating ops/cycle (SIMD)

PPE: 'normal' PowerPC CPUPPE: 'normal' PowerPC CPU running Linux used to orchestrate the SPU's

Peak performancePeak performance 32 singe

precisionmultiply/add per clock cycle

runs at ~3 GHz


94

Cell is Motivation, not Target Cell is Motivation, not Target ArchitectureArchitectureWhat are the general aspects ?What are the general aspects ?

SIMD (single instruction – multiple data) available in essentially every mainstream system

SSE in IA-32 compatible systems Altivec in PPC systems

implemented with 128 bit registers 4 x single precision floating point 2 x double precision floating point also: 16 x 8bit, 8 x 16 bit, 4 x 32 bit int

is for example heavily usedin video codecs

QuestionQuestion:Can the CA/KF be 'SIMDizedSIMDized' ?

Answer:Answer:Yes: propagate 4 tracks parallelImplement such that scalar and SSE, Altivec, and SPU targets are supported

don't show this slidewhen IBM folks arein the audience !!


95


Use float, not double somewhat Cell special Cell is tuned for single precision, double is factor 10 slower the next generation will fix this (HPC people only look at DP flops....) However: when using SIMD, float gives twice the operation count:

4 x float per cycle 2 x double per cycle

QuestionQuestion:Can the CA/KF run with single precision

Answer:Answer:Yes: work on algorithm to make it numerically stable


96


Code and data in local store, 256 kByte sounds special, is in fact general

on normal PC's, performance of algorithm critically depends on memory locality. L1 or L2 cache missesare veryvery expensive.

A cache miss can easily cost 100 cycles Consequence:

random access in big tables is very expensivelookup tables are counter-productiveseveral dozen floating point operations mightare cheaper than a single cache miss

QuestionQuestion:Can the CA/KF fit in 256 kByte

Answer:Answer:Yes: Replace the field map by a parameterization

11-15 April 200711-15 April 2007 IRTG Lecture Week 2007, Bergen Norway -- Walter F.J. Müller, GSIIRTG Lecture Week 2007, Bergen Norway -- Walter F.J. Müller, GSI 9797

Kalman Filter for Track FitKalman Filter for Track Fit

detectorsmeasurements

ee--

(r, C)(r, C)

track parametersand errors

slide courtesy I. Kisel


The Kalman Filter for Track FitThe Kalman Filter for Track Fit

arbitrary large errors

non-homogeneous magnetic fieldas large map

multiple scattering in

material

small errors

weight for update

>>> 256 KB >>> 256 KB of Local Storeof Local Store

not enough accuracy not enough accuracy in single precisionin single precision



Modifications of the Fitting AlgorithmModifications of the Fitting Algorithm

• The initial track parameters are directly estimated from the input data. • The propagation step is performed directly from measurement to measurement without intermediate steps. • Matrix multiplications have been replaced by direct operations on only non-trivial matrix elements.• Most loops have been unrolledloops have been unrolled in order to provide additional instructions for interleaving.• All branches have been eliminatedbranches have been eliminated from the algorithm to avoid branch misprediction penalty. • Calculations have been reorderedCalculations have been reordered for better use of the processors pipeline.



SPE StatisticsSPE Statistics



Modifications of the Fitting AlgorithmModifications of the Fitting AlgorithmIn

tel P4

Inte

l P4

Cell

Cell



102

The CBM CollaborationThe CBM CollaborationChina:USTC HefeiCCNU WuhanCroatia: University of SplitRBI, ZagrebCyprus: Nikosia Univ. Czech Republic:Techn. Univ. PragueCAS, RezFrance: IPHC StrasbourgGermany: GSI, DarmstadtFZ Dresden-RossendorfUniv. Heidelberg, Phys. Inst.

47 institutions> 350 members

Univ. HD, Kirchhoff Inst. Univ. FrankfurtUniv. KaiserslauternUniv. Mannheim Univ. MünsterHungaria:KFKI BudapestEötvös Univ. BudapestIndia:IOP BhubaneswarUniv. ChandigarhIlT KharagpurVECC KolkataSAHA KolkataUniv. VaranasiKorea:Korea Univ. Seoul

Pusan National Univ.Norway:Univ. BergenPoland:Silesia Univ. KatowiceKrakow Univ.Nucl. Phys. Inst. KrakowWarsaw Univ.Portugal: LIP CoimbraRomania: NIPNE BucharestRussia:LHE, JINR DubnaLIT, JINR DubnaLPP, JINR DubnaPNPI Gatchina

ITEP MoscowMEPHI MoscowKurchatov Inst. MoscowSINP, Moscow State Univ. Obninsk State Univ.IHEP ProtvinoKRI, St. PetersburgSt. Petersburg Polytec. U.INR TroitzkUkraine: Shevchenko Univ. , Kiev


103

The EndThe End

Thanks for Thanks for your attentionyour attention

We acknowledge the support of the European Community-Research Infrastructure Activity under the

FP6 "Structuring the European Research Area" programme (HadronPhysics, contract number RII3-CT-

2004-506078).


104

The EndThe End

Thanks for Thanks for your attentionyour attention

data acquisition at cbm(fair) walter f.j. müller, gsi, darmstadt for the cbm collaboration irtg...

Documents

bergen norway

simdizationirtg lecture

gsicbmirtg lecture week

authorsirtg lecture

beam lossesirtg lecture

cbm volume

qcd physics

fair project