Advanced Topics on FPGA ApplicationsScreen A
Wu, JinyuanFermilabIEEE NSS 2007 Refresher CourseSupplemental MaterialsOct, 2007
Oct. 2007, Wu Jinyuan, Fermilab
IEEE NSS Refresher Course, Supplemental Materials
2
Doublet Matching,Hash Sorter
Oct. 2007, Wu Jinyuan, Fermilab
IEEE NSS Refresher Course, Supplemental Materials
3
Hit MatchingSoftware FPGA
TypicalFPGA Resource Saving Approaches
O(n2)for(){ for(){…}}
O(n)*O(N)ComparatorArray
Hash SorterO(n)*O(N): in RAM
O(n3)for(){ for(){ for(){…} }}
O(n)*O(N2)CAM,Hugh Trans.
Tiny Triplet FinderO(n)*O(N*logN)
O(n4)for(){ for(){ for(){ for() {…}}}}
Oct. 2007, Wu Jinyuan, Fermilab
IEEE NSS Refresher Course, Supplemental Materials
4
Hash Sorter
K
K
D
K
D
Pass 1: Data in Group 1 are
stored in the hash sorter bins based on key number K.
Pass 2: Data in Group 2 are
fetched though and paired up with corresponding Group 1 data with same key number K.
Group 1
Group 2
Oct. 2007, Wu Jinyuan, Fermilab
IEEE NSS Refresher Course, Supplemental Materials
5
Hash Sorter
K
Oct. 2007, Wu Jinyuan, Fermilab
IEEE NSS Refresher Course, Supplemental Materials
6
Hash Sorter Implementation
Pipelined structure: Single clock cycle push or pop
Single clock cycle fast reset
Oct. 2007, Wu Jinyuan, Fermilab
IEEE NSS Refresher Course, Supplemental Materials
7
An Example of Track Recognition: Event
We explain the track recognition process using this 20-track example.
Oct. 2007, Wu Jinyuan, Fermilab
IEEE NSS Refresher Course, Supplemental Materials
8
Tangent Angle Measurements There are various techniques to measure the tangent
angle of the track segment (or “doublet”, or “cluster”). Sometimes extra “ghost” segments may exist. The ghost segments may be resolved in track recognition
process later.
Oct. 2007, Wu Jinyuan, Fermilab
IEEE NSS Refresher Course, Supplemental Materials
9
A Large Curvature Track A soft track hits large region.
A global algorithm is better suited. The “high-pT” approximation is not
valid globally. Exact track equation is needed.R
r
0
)sin(2 0 Rr
)sin(2
20
Rr
)(2 0 Rr
Parameter:Initial angle
Measure the tangent angle..
Parameter:Radius of curvature
Oct. 2007, Wu Jinyuan, Fermilab
IEEE NSS Refresher Course, Supplemental Materials
10
An Example of Track Recognition: Clustering
0
c0
clustering
The doublets in clusters are grouped together.
The “ghost” doublets are gone.
The 9-bin scheme The 4-bin schemeFor doublets on the seeding super layer in this bin…
search for coincident in these 9 bins.
For doublets on the seeding super layer in this bin…
search for coincident in these 4 bins.
Oct. 2007, Wu Jinyuan, Fermilab
IEEE NSS Refresher Course, Supplemental Materials
11
FPGA Block Diagram
Hash sorters for 0
Hash sorters for c0
)sin(50252
0
0
rcm
Rcmc
Oct. 2007, Wu Jinyuan, Fermilab
IEEE NSS Refresher Course, Supplemental Materials
12
Without Full Track Recognition
Two track parameters can be calculated for each doublet.
Useful trigger primitives can be found without full track recognition.
For example…
)sin(50252
0
0
rcm
Rcmc
Oct. 2007, Wu Jinyuan, Fermilab
IEEE NSS Refresher Course, Supplemental Materials
13
Triplet Finding,Tiny Triplet Finder
Oct. 2007, Wu Jinyuan, Fermilab
IEEE NSS Refresher Course, Supplemental Materials
14
Hit MatchingSoftware FPGA
TypicalFPGA Resource Saving Approaches
O(n2)for(){ for(){…}}
O(n)*O(N)ComparatorArray
Hash SorterO(n)*O(N): in RAM
O(n3)for(){ for(){ for(){…} }}
O(n)*O(N2)AM, CAM,Hugh Trans.
Tiny Triplet FinderO(n)*O(N*logN)
O(n4)for(){ for(){ for(){ for() {…}}}}
Oct. 2007, Wu Jinyuan, Fermilab
IEEE NSS Refresher Course, Supplemental Materials
15
Hits, Hit Data & Triplets
• Hit data come out of the detector planes in random order.
• Hit data from 3 planes generated by same particle tracks are organized together to form triplets.
Oct. 2007, Wu Jinyuan, Fermilab
IEEE NSS Refresher Course, Supplemental Materials
16
TTF OperationsPhase I: Filling Bit Arrays
Note: Flipped Bit Order
Physical Planes
Bit Array/Shifters
For any hit… Fill a corresponding logic cell.
• xA+ xC = 2 xB
• xA= - xC + constant
Oct. 2007, Wu Jinyuan, Fermilab
IEEE NSS Refresher Course, Supplemental Materials
17
TTF Operations Phase II: Making Match
For any center plane hit…
Logically shift the
bit array.
Perform bit-wise AND in this range.
Triplet is found.
Physical Planes
Bit Array/Shifters
Oct. 2007, Wu Jinyuan, Fermilab
IEEE NSS Refresher Course, Supplemental Materials
18
Tiny Triplet FinderReuse Coincident Logic via Shifting Hit Patterns
C1
C2
C3
One set of coincident logic is implemented.
For an arbitrary hit on C3, rotate, i.e., shift the hit patterns for C1 and C2 to search for coincidence.
Oct. 2007, Wu Jinyuan, Fermilab
IEEE NSS Refresher Course, Supplemental Materials
19
Tiny Triplet Finder for Circular Tracks
*R1/R3
*R2/R3Triplet Map Output To Decoder
Bit
Arr
ay
Shifter
Bit
Arr
ay
ShifterBit-wise Coincident Logic
0
16
32
48
64
80
96
112
128
0 16 32 48 64 80 96 112 128
1. Fill the C1 and C2 bit arrays. (n1 clock cycles)
2. Loop over C3 hits, shift bit arrays and check for coincidence. (n3 clock cycles)
Also works with more than 3 layers
Oct. 2007, Wu Jinyuan, Fermilab
IEEE NSS Refresher Course, Supplemental Materials
20
Tiny? Yes, Tiny! – Logic Cell Usage:
AM, CAM, Hough Transform
etc., O(N2)Tiny Triplet FinderO(N*logN)
Oct. 2007, Wu Jinyuan, Fermilab
IEEE NSS Refresher Course, Supplemental Materials
21
Complex Triplet Fining Problems
Oct. 2007, Wu Jinyuan, Fermilab
IEEE NSS Refresher Course, Supplemental Materials
22
Options of Sequence Control
Oct. 2007, Wu Jinyuan, Fermilab
IEEE NSS Refresher Course, Supplemental Materials
23
Micro-computing vs. Reconfigurable Computing
In microprocessor, the users specify program on fixed logic circuits. In FPGA, the users specify logic circuits (as well as program). The FPGA computing needs not to follow microprocessor architectures. (But useful
experiences can be borrowed.) The usefulness of FPGA reconfigurable computing is still to be fully appreciated.
(100+3-4)*5+7 =?
100
34
57Control:
Data: 100,3,4,5,7
LD (-) (+)(*)(+)
CPUFPGAData
ProgramConfiguration
DataProgram
Oct. 2007, Wu Jinyuan, Fermilab
IEEE NSS Refresher Course, Supplemental Materials
24
ELMS– Enclosed Loop Micro-Sequencer
Loop & Return Logic + Stack
Conditional Branch Logic
ProgramCounter
ROM128x
36bits
AReset
CLK Con
trol S
igna
lsPC Control Signals Opration00 000000000000000 01 001000100011010 LD R1, #n02 000010001000000 LD R2, #addr_a03 000000000000100 LD R3, #addr_X04 000000010001000 LD R7, #005 000000000100001 BckA1 LD R4, (R2)06 000100000010000 INC R207 000001000100000 LD R5, (R3)08 000100010000001 INC R309 001001000100000 MUL R6, R4, R50a 000000010001000 EndA1 ADD R7, R7, R60b 000010000010000 DEC R10c 000000100000100 BRNZ BckA1
Special in ELMSSupports FOR loops at machine code level
PC+ROM is a good sequencer in FPGA.
Adding Conditional Branch Logic allows the program to loop back.
Loop & Return Logic + Stack is a special feature in ELMS that supports FOR loops at machine code level.
Allows jump back as in microprocessors
Oct. 2007, Wu Jinyuan, Fermilab
IEEE NSS Refresher Course, Supplemental Materials
25
Software: Using Spread Sheet as Compiler
Oct. 2007, Wu Jinyuan, Fermilab
IEEE NSS Refresher Course, Supplemental Materials
26
What’s Good about ELMSNo ALU => Small Resource Usage
ProgramDATA
Memory
PrincetonArchitecture
HarvardArchitecture
FermilabArchitecture(?)
ProgramControl
ALU
ProgramMemory
ProgramControl
ALUDATAMemory
ProgramMemory
Sequencer(ELMS)
Data Processor
DATAMemory
The Princeton Architecture is more suitable at system level while Harvard Architecture is better suited at micro-structure level.
Regular microprocessors cannot run looped program without an ALU.
The ALU takes large amount of resource while may not be efficiently utilized for data processing tasks in FPGA.
The ELMS can run nested loop program without an ALU.
Further separation of Program and data is therefore possible.
The ELMS is kept small.
Oct. 2007, Wu Jinyuan, Fermilab
IEEE NSS Refresher Course, Supplemental Materials
27
Recursive Structure
Oct. 2007, Wu Jinyuan, Fermilab
IEEE NSS Refresher Course, Supplemental Materials
28
The Digitizer Card for the Fermilab Beam Loss Monitor System
Beam loss input signals from ion chambers are integrated and digitized.
Sliding sums are accumulated and compared with pre-loaded thresholds.
Over threshold in several places causes beam abort based on pre-defined setting.
Beam loss signals are filtered and “de-rippled” for display purposes.
Sequence is controlled by “Seq128” block.
ADC21s/sample
RAM
FastSliding Sum
A>B
SlowSliding Sum
Very SlowSliding Sum
ImmediateSliding Sum Threshold I
AbortLogic
A>BThreshold F
A>BThreshold S
A>BThreshold V
CICSums
De-rippleProcess
Ion ChamberInput
Seq128
Oct. 2007, Wu Jinyuan, Fermilab
IEEE NSS Refresher Course, Supplemental Materials
29
Filter Functions
SlidingSum
Cascaded IntegratorComb (CIC) Sum of 2nd Order
)1(
][1][Km
mk
kxms
)12(
][][][Kn
nk
kxkhny
The CIC sum is a sliding sum of sliding sums.
The frequency response of CIC sum is a sinc2(x) function that has 2nd order zeros and better stop band suppression. -0.4
-0.2
0
0.2
0.4
0.6
0.8
1
0 5 10 15 20x
sinc(x)sinc 2̂(x)
First Zero @ 360 Hz
Frequency
21s/sample124 samples
Oct. 2007, Wu Jinyuan, Fermilab
IEEE NSS Refresher Course, Supplemental Materials
30
Filter Implementation
RecursiveImplementation
Recursive != IIRNon-RecursiveImplementation
Finite Impulse Respond (FIR)
Infinite Impulse Respond (IIR)
Possible
YesYes
NO
ResourceFriendly
x[n]
s[n]
+s[n]
-x[n-K]
x[n]
The non-recursive implementation needs: 124 memory fetches, 124 additions and more ops for longer sum lengths.
The recursive implementation needs: 1 memory fetch, 2 add/sub operations regardless sum length.
SlidingSum
Oct. 2007, Wu Jinyuan, Fermilab
IEEE NSS Refresher Course, Supplemental Materials
31
BLM DC Process Sequencing
The processes of calculating sliding sums and CIC sums are fully sequenced. The de-ripple processor is flat for the process path. But it operates sequentially for 4
channels.
+ SlidingSum 1
(-)
+u[n]
-2x[n-K]
x[n]
+y[n]
x[n-2K] +u[n-L]
-2x[n-L-K]
+y[n-L]
x[n-L-2K]
x[n-L]
If |y[n]-y[n-L]|>MaxDY for entire period, then PG++.WF
PG=0WF
PG=1 PG
---
WF-WM DR=y[n]-(WF-WM)
MaxDYDecimation
Counter
+ SlidingSum 4
(-)SlidingSum 2
SlidingSum 3
Fully Sequencing
PartiallyFlat
Oct. 2007, Wu Jinyuan, Fermilab
IEEE NSS Refresher Course, Supplemental Materials
32
The EndThanks
Oct. 2007, Wu Jinyuan, Fermilab
IEEE NSS Refresher Course, Supplemental Materials
33
Resource Saving Tricks
Loop Reduction Tricks:The number of computations in a given task is reduced by (1) using fewer iterations in loops or/and (2) using fewer operations in each iteration.
Non-Loop Reduction Tricks:The number of computations in a given task is unchanged. The FPGA resource is saved by (1) reusing the resources multiple times via sequencing or/and (2) using transistor-saving resources such as RAM.
Oct. 2007, Wu Jinyuan, Fermilab
IEEE NSS Refresher Course, Supplemental Materials
34
Resource Saving TricksLoop-Reduction
Multiplier-less (ML) Approaches
Recursive Implementation of FIR Filter
FFT: O(n)*O(log(N))
Tiny Triplet Finder: O(n)*O(N*log(N))
+s[n]
-x[n-K]
x[n]
+y[n]
-s[n-K]
x[n]
y[n]
*h1*h2
*h[K]
X
<<
+/-
*R1/R3*R2/R3
Bit
Arr
ay
Shifter
Bit
Arr
ay
ShifterBit-wise Coincident Logic
Oct. 2007, Wu Jinyuan, Fermilab
IEEE NSS Refresher Course, Supplemental Materials
35
Resource Saving TricksNon-Loop-Reduction
Sequencing: Using RAM: Hash Sorter/Histogram
OP1
Initialization
OP2 OP3 OP4
OP1 OP2 OP3 OP4
OP1 OP2 OP3 OP4
OP1 OP2 OP3 OP4
Initialization 1Initialization 2Initialization 3
OP1OP2OP3OP4
OP1OP2OP3OP4
OP1OP2OP3OP4
OP1OP2OP3OP4