magnus rentsch ersdal magnus.ersdal@uib

29
Magnus Rentsch Ersdal [email protected] TWEPP '19, SANTIAGO DE COMPOSTELA TWEPP '19, Santiago de Compostela

Upload: others

Post on 07-Feb-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Magnus Rentsch Ersdal magnus.ersdal@uib

Magnus Rentsch Ersdal

[email protected]

TWEPP '19, SANTIAGO DE COMPOSTELA

TWEPP '19, Santiago de Compostela

Page 2: Magnus Rentsch Ersdal magnus.ersdal@uib

Inner Tracking System (ITS) Upgrade

TWEPP '19, Santiago de Compostela

UNIVERSITY OF BERGEN

Inner barrel half-layersITS upgrade cutaway

Page 3: Magnus Rentsch Ersdal magnus.ersdal@uib

Readout Electronics

UNIVERSITY OF BERGEN

PAGE 3

ALPIDE

ALPIDE

ALPIDE

ALPIDE

ALPIDE

Readout Unit(RU)

Power Unit(PU)

Stave

SCA GBTxCAN

GBTx

GBTx

Main FPGA

Aux FPGA

FlashMem

Radiation Field

Common Readout Unit (CRU)

Timing & Trigger System

Power System

O2 First Level Processor (FLP)

Atmospheric Radiation Environment

Detector Control System (DCS)

8 RUs per CRU8 to 28 ALPIDE

connections per RU

TRIGGER CONTROLDATA POWER

GBT links (3.2 Gbps)

2 CRUs per FLP

CANBUS DCS (backup)

Event Processing Nodes(EPN)

Central Trigger Processor (CTP)

Page 4: Magnus Rentsch Ersdal magnus.ersdal@uib

Readout Unit

UNIVERSITY OF BERGEN

PAGE 4

Stave with ALPIDE

Main FPGA

Flash FPGAFlash

Memory

3xGBTx

GBTSCA

CANBUS

Rad hard by design

Rad tolerantConfig Memory

Low cross section per bit

Readout Unit

Transition Board

Page 5: Magnus Rentsch Ersdal magnus.ersdal@uib

Radiation environment

UNIVERSITY OF BERGEN

PAGE 5

~ 4 orders of magnitude more than normal radiation background

Readout Units

Sit here

Design for 1 kHz/cm2

Total Ionizing Dose (TID) and Non-Ionizing Energy Loss(NIEL)

are such that they pose no concern

Page 6: Magnus Rentsch Ersdal magnus.ersdal@uib

SEUs and CMOS circuits

UNIVERSITY OF BERGEN

PAGE 6

• Single Event Upsets (SEU)

• SEU = LET changing the

state of a node (bitflip)

• SEUs in configuration cell SRAM

Page 7: Magnus Rentsch Ersdal magnus.ersdal@uib

Radiation challenges

• SEUs interrupt operations by:

– Upsets in configuration memory in SRAM FPGAs (Main

concern1)

– Upsets in flash memory

– Upsets in registers / state-machines

• Potentially, a disruption of the clock / reset nets can stop all activity

on the FPGA

– Some space projects utilize anti-fuse devices, not an option in

our case.

– There is a potential for single event functional interrupts

UNIVERSITY OF BERGEN

PAGE 7

1:New Developments in Error Detection and Correction

Strategies for Critical Applications, Melanie Berg 2017

Page 8: Magnus Rentsch Ersdal magnus.ersdal@uib

Mitigation, generally

UNIVERSITY OF BERGEN

PAGE 8

• In our environment, we can ignore dose effects for our FPGAs

because TID will be low enough

– Tolerates expected doses

– We cannot ignore soft errors

• Mitigation techniques are applied to our FPGA designs

– Triple Modular Redundancy (TMR) on logic

– For protecting against configuration memory SEUs, this is not

sufficient1

1:New Developments in Error Detection and Correction

Strategies for Critical Applications, Melanie Berg 2017

Page 9: Magnus Rentsch Ersdal magnus.ersdal@uib

Stave with ALPIDE

Main FPGA

Flash FPGAFlash

Memory

3xGBTx

GBTSCA

CANBUS

Rad hard by design

Rad tolerantConfig Memory

Low cross section per bit

Readout Unit

Transition Board

Readout Unit

UNIVERSITY OF BERGEN

PAGE 9

Additional system components; Flash FPGA, Proasic3 (Pa3)

for increased radiation tolerance

Page 10: Magnus Rentsch Ersdal magnus.ersdal@uib

SEU mitigation for the main FPGA

• In FPGA design: TMR (see poster* by M.Lupi)

• Scrubbing:

– "Scrubbing is the act of simultaneously writing into FPGA

configuration memory as the device’s functional logic area is

operating with the intent of correcting configuration memory bit

errors." 1

– External scrubber that is radiation tolerant

– Flash FPGA configuration memory is rad-tolerant

UNIVERSITY OF BERGEN

PAGE 10

1:New Developments in Error Detection and Correction

Strategies for Critical Applications, Melanie Berg 2017

*https://indico.cern.ch/event/799025/contributions/3486415/

Page 11: Magnus Rentsch Ersdal magnus.ersdal@uib

Requirements for External Scrubber

• Initial configuration of Xilinx Ultrascale (XKCU - main fpga) using configurationstored in on-board flash memory

• Scrubbing of XKCU configuration Memory

• Configuration and Scrubbing are both operating on the SelectMAP bus

• Additional requirements:

– Scrubbing and initial configuration must be «fast enough»

• Scrubbing cycles should have a significantly higher frequency thanSEU rate, rule of thumb: 10x (Xilinx application note xapp216*)

• Worst case SEU rate: ~0.04 SEU/s per Readout Unit. (8/s for all 192 RUs)

– Radiation tolerant

– Efficient control interface

• Two I2C interfaces are available in hardware

– Efficient upload of files

UNIVERSITY OF BERGEN

PAGE 11*https://www.xilinx.com/support/documentation/application_notes/xapp216.pdf

Page 12: Magnus Rentsch Ersdal magnus.ersdal@uib

Flash FPGA Design

UNIVERSITY OF BERGEN

PAGE 12

UART I2C

Wishbone Bus (8b data/7b addr)

register block

selectmap interface

Config ctrl

GPIO status

(master) (master)

Clk Ctrl

SysClk(40 MHz)

Local Clk(160 MHz)

SCA I2C_5

TMR

Xilinx K

US

selectMA

P

Sam

sun

g Fl

ash

Au

x FP

GA

(PA

3)

debug

Flash Write

Controller

FIFO

Flash interface

Flash Read Controller

Xilinx KUS

FIFO

GBTx pinheader

ECC decoder

ResetPOR reset

SCA_GPIO Reset

Button_0 (debug)

256B FIFO

SCA GPIO

CRC CALC

Areset

POR_conf

I2C(master)

SCA I2C_0

IC#1

IC#2

Loss of lock cnt LOCAL_CLK_C2B

LOCAL_CLK_C1B

LOCAL_CLK_LOL

Jitter C

lean

er

Page 13: Magnus Rentsch Ersdal magnus.ersdal@uib

Config and Scrubbing

UNIVERSITY OF BERGEN

PAGE 13

UART I2C

Wishbone Bus (8b data/7b addr)

register block

selectmap interface

Config ctrl

GPIO status

(master) (master)

Clk Ctrl

SysClk(40 MHz)

Local Clk(160 MHz)

SCA I2C_5

TMR

Xilinx K

US

selectMA

P

Sam

sun

g Fl

ash

Au

x FP

GA

(PA

3)

debug

Flash Write

Controller

FIFO

Flash interface

Flash Read Controller

Xilinx KUS

FIFO

GBTx pinheader

ECC decoder

ResetPOR reset

SCA_GPIO Reset

Button_0 (debug)

256B FIFO

SCA GPIO

CRC CALC

Areset

POR_conf

I2C(master)

SCA I2C_0

IC#1

IC#2

Loss of lock cnt LOCAL_CLK_C2B

LOCAL_CLK_C1B

LOCAL_CLK_LOL

Jitter C

lean

er

Page 14: Magnus Rentsch Ersdal magnus.ersdal@uib

File upload

UNIVERSITY OF BERGEN

PAGE 14

UART I2C

Wishbone Bus (8b data/7b addr)

register block

selectmap interface

Config ctrl

GPIO status

(master) (master)

Clk Ctrl

SysClk(40 MHz)

Local Clk(160 MHz)

SCA I2C_5

TMR

Xilinx K

US

selectMA

P

Sam

sun

g Fl

ash

Au

x FP

GA

(PA

3)

debug

Flash Write

Controller

FIFO

Flash interface

Flash Read Controller

Xilinx KUS

FIFO

GBTx pinheader

ECC decoder

ResetPOR reset

SCA_GPIO Reset

Button_0 (debug)

256B FIFO

SCA GPIO

CRC CALC

Areset

POR_conf

I2C(master)

SCA I2C_0

IC#1

IC#2

Loss of lock cnt LOCAL_CLK_C2B

LOCAL_CLK_C1B

LOCAL_CLK_LOL

Jitter C

lean

er

Page 15: Magnus Rentsch Ersdal magnus.ersdal@uib

Control

UNIVERSITY OF BERGEN

PAGE 15

UART I2C

Wishbone Bus (8b data/7b addr)

register block

selectmap interface

Config ctrl

GPIO status

(master) (master)

Clk Ctrl

SysClk(40 MHz)

Local Clk(160 MHz)

SCA I2C_5

TMR

Xilinx K

US

selectMA

P

Sam

sun

g Fl

ash

Au

x FP

GA

(PA

3)

debug

Flash Write

Controller

FIFO

Flash interface

Flash Read Controller

Xilinx KUS

FIFO

GBTx pinheader

ECC decoder

ResetPOR reset

SCA_GPIO Reset

Button_0 (debug)

256B FIFO

SCA GPIO

CRC CALC

Areset

POR_conf

I2C(master)

SCA I2C_0

IC#1

IC#2

Loss of lock cnt LOCAL_CLK_C2B

LOCAL_CLK_C1B

LOCAL_CLK_LOL

Jitter C

lean

er

Page 16: Magnus Rentsch Ersdal magnus.ersdal@uib

Key numbers

• Initial config : 2s (197 Mb)

• Scrubbing : 1.7s (151 Mb)

• Writing to flash memory done via scripts

– I2C: ~230 kb/s

– SWT* (Xilinx FIFO): ~4 Mb/s

• Resource utilization

– Logic cells: 79%

– RAM: 4 of 24

UNIVERSITY OF BERGEN

PAGE 16

*Single Word Transaction, the slow-control protocol for the main FPGA

Page 17: Magnus Rentsch Ersdal magnus.ersdal@uib

SEU mitigation in the PA3 design

• Local TMR on registers

– Recommended method for flash-based FPGAs1

– Needs 3x DFFs and some additional logic cells for voting

UNIVERSITY OF BERGEN

PAGE 17

Reproduced from 1

1:New Developments in Error Detection and Correction

Strategies for Critical Applications, Melanie Berg 2017

Page 18: Magnus Rentsch Ersdal magnus.ersdal@uib

SEU mitigation in the Flash memory

• Scenario: writing a faulty configuration bit can theoretically stop the

Xilinx FPGA from functioning

• 1048/1024bit hamming error correcting codes (ECC), interleaved

with data before loading the flash. (python3 sw)

– Implementation of TN2908*

– Gitlab CI creates and encodes the files on every commit

– Single-bit correction, double-bit detection. More than 2 bitflips

undefined.

• Device has two distinct chips inside the same package. Writing to

both in case of critical error on one.

UNIVERSITY OF BERGEN

PAGE 18

*https://www.micron.com/-/media/Documents/Products/Technical%20Note/

NAND%20Flash/tn2908_NAND_hamming_ECC_code.pdf

Page 19: Magnus Rentsch Ersdal magnus.ersdal@uib

SEU mitigation in the Flash memory

• Based on irradiation campaigns the SEU cross section in the Flash

Memory is estimated at:

– (0 1) 10-16 cm2/bit

– (1 0) 10-21 cm2/bit

• A typical scrubbing file has a 1:20 ratio of

Ones vs Zeros

• A typical programming file has a 1:50 ratio of Ones vs Zeros

– given no default values written to BRAM

• Because of this, the bits of the files are inverted before writing these

to the flash memory

UNIVERSITY OF BERGEN

PAGE 19

Weste, Harris: CMOS VLSI Design, p.127

Page 20: Magnus Rentsch Ersdal magnus.ersdal@uib

SEU mitigation in the Flash memory

• Three measures have been implemented:

1. Storing the programming file inverted

2. Adding Hamming encoding of the bitstream

3. Store two copies of all the files in the Flash memory

• This gives: P(fatal error) == P(double bitflip in one ECC encoded

block in both copies of the file)

– P(fatal error) = 7E-26 during 10h spill

UNIVERSITY OF BERGEN

PAGE 20

Page 21: Magnus Rentsch Ersdal magnus.ersdal@uib

Additional feature for commissioning

and design qualification

• Fault injection

• A tool for tabletop "beam-testing"

• To be used for commissioning and design qualification only.

– This can be exploited to improve rad tolerance and add design

recovery routines.

UNIVERSITY OF BERGEN

PAGE 21

Page 22: Magnus Rentsch Ersdal magnus.ersdal@uib

Fault injection HW top level

UNIVERSITY OF BERGEN

PAGE 22

• Select random number -> count down -> flip bit

• 14x faster rate than worst case design SEU rate

UART I2C

Wishbone Bus (8b data/7b addr)

register block

selectmap interface

Config ctrl

GPIO status

(master) (master)

Clk Ctrl

SysClk(40 MHz)

Local Clk(160 MHz)

SCA I2C_5

TMR

Xilinx K

US

selectMA

P

Sam

sun

g Fl

ash

Au

x FP

GA

(PA

3)

debug

Flash Write

Controller

FIFO

Flash interface

Flash Read Controller

Xilinx KUS

FIFO

GBTx pinheader

ECC decoder

ResetPOR reset

SCA_GPIO Reset

Button_0 (debug)

256B FIFO

SCA GPIO

CRC CALC

Areset

POR_conf

I2C(master)

SCA I2C_0

IC#1

IC#2

Loss of lock cnt LOCAL_CLK_C2B

LOCAL_CLK_C1B

LOCAL_CLK_LOL

Jitter C

lean

er

Page 23: Magnus Rentsch Ersdal magnus.ersdal@uib

PRBS "random" functions

UNIVERSITY OF BERGEN

PAGE 23

• Pseudorandom Binary sequence

• Linear Feedback Shift Register (LFSR), 32 bits long

• scaled to fit memory layout (4504 pages x 4096 bytes)

Page 24: Magnus Rentsch Ersdal magnus.ersdal@uib

Status

• Design is verified and tested; all mandatory

features of the FPGA design are ready.

• Work in progress:

– Finalize fault injection

– Remote programming of ProASIC3

• Thank you

UNIVERSITY OF BERGEN

PAGE 24

Page 25: Magnus Rentsch Ersdal magnus.ersdal@uib
Page 26: Magnus Rentsch Ersdal magnus.ersdal@uib

UNIVERSITY OF BERGEN

Page 27: Magnus Rentsch Ersdal magnus.ersdal@uib

Probability of fatal error

• Combined crosssection:– CS1:20 = 4.76E-18 cm2/bit

• Probability of double bitflip in ECC block flash#0:– P(double#0) ≈ (CS1:20*ECC_size*ECC_blocks)2 = 1.61E-14

• Probability of double bitflip in same ECC block flash #1:– P(double#1 | double#0) ≈ P(double#0)/ECC_blocks = 6.33E-22

• Combined Probability:– P(double#1 ꓵ double#0) = P(double#0) * P (double#1 | double#0) = 1E-35

• 7E-26 double bitflips in same ECC block in both flash ICs during 10h run

• Important numbers:• ECC block size: 1048

bits

• # ECC blocks on Flash: 2.52E+07

• Est. Flux Run 3: 1 kHz/cm2

• Fluence 10h spill: 3.6E+07 cm-2

• Cross-section (10): 1.0E-21 cm2/bit

• Cross-section (01): 1.0E-16 cm2/bit

• Ratio 1:0 scrub-file: 1:20

ITS PLENARY MEETING 28TH FEB - 1ST MAR 2018

27

Page 28: Magnus Rentsch Ersdal magnus.ersdal@uib

Resource usage & timing

Main Branch Fault Injector

Core Cells 79% 94%

Block Rams 4 of 24 7 of 24

Sys_clk

estimate

(40MHz req)

41.5 MHz 40.2 MHz

03/09/2019

ITS PLENARY MEETING 28TH FEB - 1ST MAR 2018

28

Page 29: Magnus Rentsch Ersdal magnus.ersdal@uib

How random is prbs

UNIVERSITY OF BERGEN

PAGE 29