Download - Advanced Topics on FPGA Applications Screen A

Advanced Topics on FPGA ApplicationsScreen A

Wu, JinyuanFermilabIEEE NSS 2007 Refresher CourseSupplemental MaterialsOct, 2007

Oct. 2007, Wu Jinyuan, Fermilab

IEEE NSS Refresher Course, Supplemental Materials

2

Doublet Matching,Hash Sorter



3

Hit MatchingSoftware FPGA

TypicalFPGA Resource Saving Approaches

O(n2)for(){ for(){…}}

O(n)*O(N)ComparatorArray

Hash SorterO(n)*O(N): in RAM

O(n3)for(){ for(){ for(){…} }}

O(n)*O(N2)CAM,Hugh Trans.

Tiny Triplet FinderO(n)*O(N*logN)

O(n4)for(){ for(){ for(){ for() {…}}}}



4

Hash Sorter

K

K

D

K

D

Pass 1: Data in Group 1 are

stored in the hash sorter bins based on key number K.

Pass 2: Data in Group 2 are

fetched though and paired up with corresponding Group 1 data with same key number K.

Group 1

Group 2



5

Hash Sorter

K



6

Hash Sorter Implementation

Pipelined structure: Single clock cycle push or pop

Single clock cycle fast reset



7

An Example of Track Recognition: Event

We explain the track recognition process using this 20-track example.



8

Tangent Angle Measurements There are various techniques to measure the tangent

angle of the track segment (or “doublet”, or “cluster”). Sometimes extra “ghost” segments may exist. The ghost segments may be resolved in track recognition

process later.



9

A Large Curvature Track A soft track hits large region.

A global algorithm is better suited. The “high-pT” approximation is not

valid globally. Exact track equation is needed.R

r

0

)sin(2 0 Rr

)sin(2

20

Rr

)(2 0 Rr

Parameter:Initial angle

Measure the tangent angle..

Parameter:Radius of curvature



10

An Example of Track Recognition: Clustering

0

c0

clustering

The doublets in clusters are grouped together.

The “ghost” doublets are gone.

The 9-bin scheme The 4-bin schemeFor doublets on the seeding super layer in this bin…

search for coincident in these 9 bins.

For doublets on the seeding super layer in this bin…

search for coincident in these 4 bins.



11

FPGA Block Diagram

Hash sorters for 0

Hash sorters for c0

)sin(50252

0

0

rcm

Rcmc



12

Without Full Track Recognition

Two track parameters can be calculated for each doublet.

Useful trigger primitives can be found without full track recognition.

For example…

)sin(50252

0

0

rcm

Rcmc



13

Triplet Finding,Tiny Triplet Finder



14

Hit MatchingSoftware FPGA

TypicalFPGA Resource Saving Approaches

O(n2)for(){ for(){…}}

O(n)*O(N)ComparatorArray

Hash SorterO(n)*O(N): in RAM

O(n3)for(){ for(){ for(){…} }}

O(n)*O(N2)AM, CAM,Hugh Trans.

Tiny Triplet FinderO(n)*O(N*logN)

O(n4)for(){ for(){ for(){ for() {…}}}}



15

Hits, Hit Data & Triplets

• Hit data come out of the detector planes in random order.

• Hit data from 3 planes generated by same particle tracks are organized together to form triplets.



16

TTF OperationsPhase I: Filling Bit Arrays

Note: Flipped Bit Order

Physical Planes

Bit Array/Shifters

For any hit… Fill a corresponding logic cell.

• xA+ xC = 2 xB

• xA= - xC + constant



17

TTF Operations Phase II: Making Match

For any center plane hit…

Logically shift the

bit array.

Perform bit-wise AND in this range.

Triplet is found.

Physical Planes

Bit Array/Shifters



18

Tiny Triplet FinderReuse Coincident Logic via Shifting Hit Patterns

C1

C2

C3

One set of coincident logic is implemented.

For an arbitrary hit on C3, rotate, i.e., shift the hit patterns for C1 and C2 to search for coincidence.



19

Tiny Triplet Finder for Circular Tracks

*R1/R3

*R2/R3Triplet Map Output To Decoder

Bit

Arr

ay

Shifter

Bit

Arr

ay

ShifterBit-wise Coincident Logic

0

16

32

48

64

80

96

112

128

0 16 32 48 64 80 96 112 128

1. Fill the C1 and C2 bit arrays. (n1 clock cycles)

2. Loop over C3 hits, shift bit arrays and check for coincidence. (n3 clock cycles)

Also works with more than 3 layers



20

Tiny? Yes, Tiny! – Logic Cell Usage:

AM, CAM, Hough Transform

etc., O(N2)Tiny Triplet FinderO(N*logN)



21

Complex Triplet Fining Problems



22

Options of Sequence Control



23

Micro-computing vs. Reconfigurable Computing

In microprocessor, the users specify program on fixed logic circuits. In FPGA, the users specify logic circuits (as well as program). The FPGA computing needs not to follow microprocessor architectures. (But useful

experiences can be borrowed.) The usefulness of FPGA reconfigurable computing is still to be fully appreciated.

(100+3-4)*5+7 =?

100

34

57Control:

Data: 100,3,4,5,7

LD (-) (+)(*)(+)

CPUFPGAData

ProgramConfiguration

DataProgram



24

ELMS– Enclosed Loop Micro-Sequencer

Loop & Return Logic + Stack

Conditional Branch Logic

ProgramCounter

ROM128x

36bits

AReset

CLK Con

trol S

igna

lsPC Control Signals Opration00 000000000000000 01 001000100011010 LD R1, #n02 000010001000000 LD R2, #addr_a03 000000000000100 LD R3, #addr_X04 000000010001000 LD R7, #005 000000000100001 BckA1 LD R4, (R2)06 000100000010000 INC R207 000001000100000 LD R5, (R3)08 000100010000001 INC R309 001001000100000 MUL R6, R4, R50a 000000010001000 EndA1 ADD R7, R7, R60b 000010000010000 DEC R10c 000000100000100 BRNZ BckA1

Special in ELMSSupports FOR loops at machine code level

PC+ROM is a good sequencer in FPGA.

Adding Conditional Branch Logic allows the program to loop back.

Loop & Return Logic + Stack is a special feature in ELMS that supports FOR loops at machine code level.

Allows jump back as in microprocessors



25

Software: Using Spread Sheet as Compiler



26

What’s Good about ELMSNo ALU => Small Resource Usage

ProgramDATA

Memory

PrincetonArchitecture

HarvardArchitecture

FermilabArchitecture(?)

ProgramControl

ALU

ProgramMemory

ProgramControl

ALUDATAMemory

ProgramMemory

Sequencer(ELMS)

Data Processor

DATAMemory

The Princeton Architecture is more suitable at system level while Harvard Architecture is better suited at micro-structure level.

Regular microprocessors cannot run looped program without an ALU.

The ALU takes large amount of resource while may not be efficiently utilized for data processing tasks in FPGA.

The ELMS can run nested loop program without an ALU.

Further separation of Program and data is therefore possible.

The ELMS is kept small.



27

Recursive Structure



28

The Digitizer Card for the Fermilab Beam Loss Monitor System

Beam loss input signals from ion chambers are integrated and digitized.

Sliding sums are accumulated and compared with pre-loaded thresholds.

Over threshold in several places causes beam abort based on pre-defined setting.

Beam loss signals are filtered and “de-rippled” for display purposes.

Sequence is controlled by “Seq128” block.

ADC21s/sample

RAM

FastSliding Sum

A>B

SlowSliding Sum

Very SlowSliding Sum

ImmediateSliding Sum Threshold I

AbortLogic

A>BThreshold F

A>BThreshold S

A>BThreshold V

CICSums

De-rippleProcess

Ion ChamberInput

Seq128



29

Filter Functions

SlidingSum

Cascaded IntegratorComb (CIC) Sum of 2nd Order

)1(

][1][Km

mk

kxms

)12(

][][][Kn

nk

kxkhny

The CIC sum is a sliding sum of sliding sums.

The frequency response of CIC sum is a sinc2(x) function that has 2nd order zeros and better stop band suppression. -0.4

-0.2

0

0.2

0.4

0.6

0.8

1

0 5 10 15 20x

sinc(x)sinc 2̂(x)

First Zero @ 360 Hz

Frequency

21s/sample124 samples



30

Filter Implementation

RecursiveImplementation

Recursive != IIRNon-RecursiveImplementation

Finite Impulse Respond (FIR)

Infinite Impulse Respond (IIR)

Possible

YesYes

NO

ResourceFriendly

x[n]

s[n]

+s[n]

-x[n-K]

x[n]

The non-recursive implementation needs: 124 memory fetches, 124 additions and more ops for longer sum lengths.

The recursive implementation needs: 1 memory fetch, 2 add/sub operations regardless sum length.

SlidingSum



31

BLM DC Process Sequencing

The processes of calculating sliding sums and CIC sums are fully sequenced. The de-ripple processor is flat for the process path. But it operates sequentially for 4

channels.

+ SlidingSum 1

(-)

+u[n]

-2x[n-K]

x[n]

+y[n]

x[n-2K] +u[n-L]

-2x[n-L-K]

+y[n-L]

x[n-L-2K]

x[n-L]

If |y[n]-y[n-L]|>MaxDY for entire period, then PG++.WF

PG=0WF

PG=1 PG

---

WF-WM DR=y[n]-(WF-WM)

MaxDYDecimation

Counter

+ SlidingSum 4

(-)SlidingSum 2

SlidingSum 3

Fully Sequencing

PartiallyFlat



32

The EndThanks



33

Resource Saving Tricks

Loop Reduction Tricks:The number of computations in a given task is reduced by (1) using fewer iterations in loops or/and (2) using fewer operations in each iteration.

Non-Loop Reduction Tricks:The number of computations in a given task is unchanged. The FPGA resource is saved by (1) reusing the resources multiple times via sequencing or/and (2) using transistor-saving resources such as RAM.



34

Resource Saving TricksLoop-Reduction

Multiplier-less (ML) Approaches

Recursive Implementation of FIR Filter

FFT: O(n)*O(log(N))

Tiny Triplet Finder: O(n)*O(N*log(N))

+s[n]

-x[n-K]

x[n]

+y[n]

-s[n-K]

x[n]

y[n]

*h1*h2

*h[K]

X

<<

+/-

*R1/R3*R2/R3

Bit

Arr

ay

Shifter

Bit

Arr

ay

ShifterBit-wise Coincident Logic



35

Resource Saving TricksNon-Loop-Reduction

Sequencing: Using RAM: Hash Sorter/Histogram

OP1

Initialization

OP2 OP3 OP4

OP1 OP2 OP3 OP4

OP1 OP2 OP3 OP4

OP1 OP2 OP3 OP4

Initialization 1Initialization 2Initialization 3

OP1OP2OP3OP4

OP1OP2OP3OP4

OP1OP2OP3OP4

OP1OP2OP3OP4

Download - Advanced Topics on FPGA Applications Screen A

Top Related