magnus rentsch ersdal magnus.ersdal@uib
TRANSCRIPT
Magnus Rentsch Ersdal
TWEPP '19, SANTIAGO DE COMPOSTELA
TWEPP '19, Santiago de Compostela
Inner Tracking System (ITS) Upgrade
TWEPP '19, Santiago de Compostela
UNIVERSITY OF BERGEN
Inner barrel half-layersITS upgrade cutaway
Readout Electronics
UNIVERSITY OF BERGEN
PAGE 3
ALPIDE
ALPIDE
ALPIDE
ALPIDE
ALPIDE
Readout Unit(RU)
Power Unit(PU)
Stave
SCA GBTxCAN
GBTx
GBTx
Main FPGA
Aux FPGA
FlashMem
Radiation Field
Common Readout Unit (CRU)
Timing & Trigger System
Power System
O2 First Level Processor (FLP)
Atmospheric Radiation Environment
Detector Control System (DCS)
8 RUs per CRU8 to 28 ALPIDE
connections per RU
TRIGGER CONTROLDATA POWER
GBT links (3.2 Gbps)
2 CRUs per FLP
CANBUS DCS (backup)
Event Processing Nodes(EPN)
Central Trigger Processor (CTP)
Readout Unit
UNIVERSITY OF BERGEN
PAGE 4
Stave with ALPIDE
Main FPGA
Flash FPGAFlash
Memory
3xGBTx
GBTSCA
CANBUS
Rad hard by design
Rad tolerantConfig Memory
Low cross section per bit
Readout Unit
Transition Board
Radiation environment
UNIVERSITY OF BERGEN
PAGE 5
~ 4 orders of magnitude more than normal radiation background
Readout Units
Sit here
Design for 1 kHz/cm2
Total Ionizing Dose (TID) and Non-Ionizing Energy Loss(NIEL)
are such that they pose no concern
SEUs and CMOS circuits
UNIVERSITY OF BERGEN
PAGE 6
• Single Event Upsets (SEU)
• SEU = LET changing the
state of a node (bitflip)
• SEUs in configuration cell SRAM
Radiation challenges
• SEUs interrupt operations by:
– Upsets in configuration memory in SRAM FPGAs (Main
concern1)
– Upsets in flash memory
– Upsets in registers / state-machines
• Potentially, a disruption of the clock / reset nets can stop all activity
on the FPGA
– Some space projects utilize anti-fuse devices, not an option in
our case.
– There is a potential for single event functional interrupts
UNIVERSITY OF BERGEN
PAGE 7
1:New Developments in Error Detection and Correction
Strategies for Critical Applications, Melanie Berg 2017
Mitigation, generally
UNIVERSITY OF BERGEN
PAGE 8
• In our environment, we can ignore dose effects for our FPGAs
because TID will be low enough
– Tolerates expected doses
– We cannot ignore soft errors
• Mitigation techniques are applied to our FPGA designs
– Triple Modular Redundancy (TMR) on logic
– For protecting against configuration memory SEUs, this is not
sufficient1
1:New Developments in Error Detection and Correction
Strategies for Critical Applications, Melanie Berg 2017
Stave with ALPIDE
Main FPGA
Flash FPGAFlash
Memory
3xGBTx
GBTSCA
CANBUS
Rad hard by design
Rad tolerantConfig Memory
Low cross section per bit
Readout Unit
Transition Board
Readout Unit
UNIVERSITY OF BERGEN
PAGE 9
Additional system components; Flash FPGA, Proasic3 (Pa3)
for increased radiation tolerance
SEU mitigation for the main FPGA
• In FPGA design: TMR (see poster* by M.Lupi)
• Scrubbing:
– "Scrubbing is the act of simultaneously writing into FPGA
configuration memory as the device’s functional logic area is
operating with the intent of correcting configuration memory bit
errors." 1
– External scrubber that is radiation tolerant
– Flash FPGA configuration memory is rad-tolerant
UNIVERSITY OF BERGEN
PAGE 10
1:New Developments in Error Detection and Correction
Strategies for Critical Applications, Melanie Berg 2017
*https://indico.cern.ch/event/799025/contributions/3486415/
Requirements for External Scrubber
• Initial configuration of Xilinx Ultrascale (XKCU - main fpga) using configurationstored in on-board flash memory
• Scrubbing of XKCU configuration Memory
• Configuration and Scrubbing are both operating on the SelectMAP bus
• Additional requirements:
– Scrubbing and initial configuration must be «fast enough»
• Scrubbing cycles should have a significantly higher frequency thanSEU rate, rule of thumb: 10x (Xilinx application note xapp216*)
• Worst case SEU rate: ~0.04 SEU/s per Readout Unit. (8/s for all 192 RUs)
– Radiation tolerant
– Efficient control interface
• Two I2C interfaces are available in hardware
– Efficient upload of files
UNIVERSITY OF BERGEN
PAGE 11*https://www.xilinx.com/support/documentation/application_notes/xapp216.pdf
Flash FPGA Design
UNIVERSITY OF BERGEN
PAGE 12
UART I2C
Wishbone Bus (8b data/7b addr)
register block
selectmap interface
Config ctrl
GPIO status
(master) (master)
Clk Ctrl
SysClk(40 MHz)
Local Clk(160 MHz)
SCA I2C_5
TMR
Xilinx K
US
selectMA
P
Sam
sun
g Fl
ash
Au
x FP
GA
(PA
3)
debug
Flash Write
Controller
FIFO
Flash interface
Flash Read Controller
Xilinx KUS
FIFO
GBTx pinheader
ECC decoder
ResetPOR reset
SCA_GPIO Reset
Button_0 (debug)
256B FIFO
SCA GPIO
CRC CALC
Areset
POR_conf
I2C(master)
SCA I2C_0
IC#1
IC#2
Loss of lock cnt LOCAL_CLK_C2B
LOCAL_CLK_C1B
LOCAL_CLK_LOL
Jitter C
lean
er
Config and Scrubbing
UNIVERSITY OF BERGEN
PAGE 13
UART I2C
Wishbone Bus (8b data/7b addr)
register block
selectmap interface
Config ctrl
GPIO status
(master) (master)
Clk Ctrl
SysClk(40 MHz)
Local Clk(160 MHz)
SCA I2C_5
TMR
Xilinx K
US
selectMA
P
Sam
sun
g Fl
ash
Au
x FP
GA
(PA
3)
debug
Flash Write
Controller
FIFO
Flash interface
Flash Read Controller
Xilinx KUS
FIFO
GBTx pinheader
ECC decoder
ResetPOR reset
SCA_GPIO Reset
Button_0 (debug)
256B FIFO
SCA GPIO
CRC CALC
Areset
POR_conf
I2C(master)
SCA I2C_0
IC#1
IC#2
Loss of lock cnt LOCAL_CLK_C2B
LOCAL_CLK_C1B
LOCAL_CLK_LOL
Jitter C
lean
er
File upload
UNIVERSITY OF BERGEN
PAGE 14
UART I2C
Wishbone Bus (8b data/7b addr)
register block
selectmap interface
Config ctrl
GPIO status
(master) (master)
Clk Ctrl
SysClk(40 MHz)
Local Clk(160 MHz)
SCA I2C_5
TMR
Xilinx K
US
selectMA
P
Sam
sun
g Fl
ash
Au
x FP
GA
(PA
3)
debug
Flash Write
Controller
FIFO
Flash interface
Flash Read Controller
Xilinx KUS
FIFO
GBTx pinheader
ECC decoder
ResetPOR reset
SCA_GPIO Reset
Button_0 (debug)
256B FIFO
SCA GPIO
CRC CALC
Areset
POR_conf
I2C(master)
SCA I2C_0
IC#1
IC#2
Loss of lock cnt LOCAL_CLK_C2B
LOCAL_CLK_C1B
LOCAL_CLK_LOL
Jitter C
lean
er
Control
UNIVERSITY OF BERGEN
PAGE 15
UART I2C
Wishbone Bus (8b data/7b addr)
register block
selectmap interface
Config ctrl
GPIO status
(master) (master)
Clk Ctrl
SysClk(40 MHz)
Local Clk(160 MHz)
SCA I2C_5
TMR
Xilinx K
US
selectMA
P
Sam
sun
g Fl
ash
Au
x FP
GA
(PA
3)
debug
Flash Write
Controller
FIFO
Flash interface
Flash Read Controller
Xilinx KUS
FIFO
GBTx pinheader
ECC decoder
ResetPOR reset
SCA_GPIO Reset
Button_0 (debug)
256B FIFO
SCA GPIO
CRC CALC
Areset
POR_conf
I2C(master)
SCA I2C_0
IC#1
IC#2
Loss of lock cnt LOCAL_CLK_C2B
LOCAL_CLK_C1B
LOCAL_CLK_LOL
Jitter C
lean
er
Key numbers
• Initial config : 2s (197 Mb)
• Scrubbing : 1.7s (151 Mb)
• Writing to flash memory done via scripts
– I2C: ~230 kb/s
– SWT* (Xilinx FIFO): ~4 Mb/s
• Resource utilization
– Logic cells: 79%
– RAM: 4 of 24
UNIVERSITY OF BERGEN
PAGE 16
*Single Word Transaction, the slow-control protocol for the main FPGA
SEU mitigation in the PA3 design
• Local TMR on registers
– Recommended method for flash-based FPGAs1
– Needs 3x DFFs and some additional logic cells for voting
UNIVERSITY OF BERGEN
PAGE 17
Reproduced from 1
1:New Developments in Error Detection and Correction
Strategies for Critical Applications, Melanie Berg 2017
SEU mitigation in the Flash memory
• Scenario: writing a faulty configuration bit can theoretically stop the
Xilinx FPGA from functioning
• 1048/1024bit hamming error correcting codes (ECC), interleaved
with data before loading the flash. (python3 sw)
– Implementation of TN2908*
– Gitlab CI creates and encodes the files on every commit
– Single-bit correction, double-bit detection. More than 2 bitflips
undefined.
• Device has two distinct chips inside the same package. Writing to
both in case of critical error on one.
UNIVERSITY OF BERGEN
PAGE 18
*https://www.micron.com/-/media/Documents/Products/Technical%20Note/
NAND%20Flash/tn2908_NAND_hamming_ECC_code.pdf
SEU mitigation in the Flash memory
• Based on irradiation campaigns the SEU cross section in the Flash
Memory is estimated at:
– (0 1) 10-16 cm2/bit
– (1 0) 10-21 cm2/bit
• A typical scrubbing file has a 1:20 ratio of
Ones vs Zeros
• A typical programming file has a 1:50 ratio of Ones vs Zeros
– given no default values written to BRAM
• Because of this, the bits of the files are inverted before writing these
to the flash memory
UNIVERSITY OF BERGEN
PAGE 19
Weste, Harris: CMOS VLSI Design, p.127
SEU mitigation in the Flash memory
• Three measures have been implemented:
1. Storing the programming file inverted
2. Adding Hamming encoding of the bitstream
3. Store two copies of all the files in the Flash memory
• This gives: P(fatal error) == P(double bitflip in one ECC encoded
block in both copies of the file)
– P(fatal error) = 7E-26 during 10h spill
UNIVERSITY OF BERGEN
PAGE 20
Additional feature for commissioning
and design qualification
• Fault injection
• A tool for tabletop "beam-testing"
• To be used for commissioning and design qualification only.
– This can be exploited to improve rad tolerance and add design
recovery routines.
UNIVERSITY OF BERGEN
PAGE 21
Fault injection HW top level
UNIVERSITY OF BERGEN
PAGE 22
• Select random number -> count down -> flip bit
• 14x faster rate than worst case design SEU rate
UART I2C
Wishbone Bus (8b data/7b addr)
register block
selectmap interface
Config ctrl
GPIO status
(master) (master)
Clk Ctrl
SysClk(40 MHz)
Local Clk(160 MHz)
SCA I2C_5
TMR
Xilinx K
US
selectMA
P
Sam
sun
g Fl
ash
Au
x FP
GA
(PA
3)
debug
Flash Write
Controller
FIFO
Flash interface
Flash Read Controller
Xilinx KUS
FIFO
GBTx pinheader
ECC decoder
ResetPOR reset
SCA_GPIO Reset
Button_0 (debug)
256B FIFO
SCA GPIO
CRC CALC
Areset
POR_conf
I2C(master)
SCA I2C_0
IC#1
IC#2
Loss of lock cnt LOCAL_CLK_C2B
LOCAL_CLK_C1B
LOCAL_CLK_LOL
Jitter C
lean
er
PRBS "random" functions
UNIVERSITY OF BERGEN
PAGE 23
• Pseudorandom Binary sequence
• Linear Feedback Shift Register (LFSR), 32 bits long
• scaled to fit memory layout (4504 pages x 4096 bytes)
Status
• Design is verified and tested; all mandatory
features of the FPGA design are ready.
• Work in progress:
– Finalize fault injection
– Remote programming of ProASIC3
• Thank you
UNIVERSITY OF BERGEN
PAGE 24
UNIVERSITY OF BERGEN
Probability of fatal error
• Combined crosssection:– CS1:20 = 4.76E-18 cm2/bit
• Probability of double bitflip in ECC block flash#0:– P(double#0) ≈ (CS1:20*ECC_size*ECC_blocks)2 = 1.61E-14
• Probability of double bitflip in same ECC block flash #1:– P(double#1 | double#0) ≈ P(double#0)/ECC_blocks = 6.33E-22
• Combined Probability:– P(double#1 ꓵ double#0) = P(double#0) * P (double#1 | double#0) = 1E-35
• 7E-26 double bitflips in same ECC block in both flash ICs during 10h run
• Important numbers:• ECC block size: 1048
bits
• # ECC blocks on Flash: 2.52E+07
• Est. Flux Run 3: 1 kHz/cm2
• Fluence 10h spill: 3.6E+07 cm-2
• Cross-section (10): 1.0E-21 cm2/bit
• Cross-section (01): 1.0E-16 cm2/bit
• Ratio 1:0 scrub-file: 1:20
ITS PLENARY MEETING 28TH FEB - 1ST MAR 2018
27
Resource usage & timing
Main Branch Fault Injector
Core Cells 79% 94%
Block Rams 4 of 24 7 of 24
Sys_clk
estimate
(40MHz req)
41.5 MHz 40.2 MHz
03/09/2019
ITS PLENARY MEETING 28TH FEB - 1ST MAR 2018
28
How random is prbs
UNIVERSITY OF BERGEN
PAGE 29