A Scalable Pipelined Architecture for Biomimetic Vision Sensors
DANIEL LLAMOCCA AND BRIAN K. DEAN
Electrical and Computer Engineering Department,
Oakland University
September, 3rd, 2015
Outline Motivation
Contributions
Methodology
Results
Conclusions
Motivation Fly-inspired vision algorithms can outperform traditional
image processing algorithms in motion detection and in-flight obstacle tracking and interception. Applications include high-speed target tracking for unmanned
aerial and ground vehicles, structural monitoring. Dedicated hardware implementations are desired when the
large amounts of data are to be processed in parallel.
OpticalElectrical Interface
7
Opticalfront-end
7
ADCs
FilterLight
Adaptation
C1-7
Supportinghardware
FLY-INSPIRED VISION SENSOR
B B
BO
1 2
3 4 5
6 7
FO1-7
LA
O1
-7
Vision Sensor: Optical front-end
Optical electrical interface
Analog-to-digital converters
Supporting digital hardware for filtering and light adaptation.
Motivation Vision Sensor: Optical front-end
Plano-convex lens (12 mm diameter and 12 mm focal length) placed above seven photodiodes arranged in a hexagonal pattern
Hexagonal pattern approximates fly optical arrangement.
Photoreceptor response: overlapping Gaussians
Contributions Fully pipelined and Scalable Hardware
Implementation for the Biomimetic Sensor The fully-customizable fixed-point architecture allows
users to quickly modify design parameters (# of input bits, output format, # of bits per of iterations, # of bits of filters’ coefficients).
Fully-pipelined architecture is achieved by unrolling the IIR filter architecture.
Generic VHDL code validated on an FPGA The fully-parameterized RTL VHDL code is not tied to a
particular device or vendor. Design Space Exploration
The fully-parameterized VHDL code allows us to create a set of different hardware profiles by varying the design parameters. We can then explore trade-offs among design parameters, accuracy, resources, andexecution time.
Methodology Block Diagram: Data path uses fixed-point representation:
Input: [B B-1], Output/Intermediate Signals [BO BQ] Design Parameters: B, BO, BQ, NH (# of bits per filters’ coefficients)
60 Hz IIRNotch Filter
BC1
BO FO1
60 Hz IIRNotch Filter
BC2
BO FO2
60 Hz IIRNotch Filter
BC3
BO FO3
60 Hz IIRNotch Filter
BC4
BO FO4
60 Hz IIRNotch Filter
BC5
BO FO5
60 Hz IIRNotch Filter
BC6
BO FO6
60 Hz IIRNotch Filter
BC7
BO FO7
Average Unit
BO AVG
0.159 Hz Low-PassDA FIR Filter
N=24,L=6,symmetricE
BO
AVG_FILT
v v
...
RL_AVG + RL_FIR
...
RL_AVG + RL_FIR
...
RL_AVG + RL_FIR
...
RL_AVG + RL_FIR
...
RL_AVG + RL_FIR
...
RL_AVG + RL_FIR
...
RL_AVG + RL_FIR
E
E
-+
RL_FIRRL_AVG
RL_IIR
BO LAO7
-+
BO LAO1
-+
BO LAO2
-+
BO LAO3
-+
BO LAO4
-+
BO LAO5
-+
BO LAO6
v
v
NH
BQ
B
BO
Architecture:7 IIR filters,1 average unit,1 FIR filter,7 subtractors,7 chain registers(synchronization registers that allow for pipelining)
𝑅𝐿_𝐴𝑉𝐺 = 𝐵𝑂 + 2 log2𝑁𝑅𝐿_𝐹𝐼𝑅 = log2 𝐵𝑂 + 1 + log2𝑁/2𝐿 + 2𝑅𝐿_𝐼𝐼𝑅 = 7
Methodology IIR Filter (60 Hz Notch filter): fs=1 KHz, 2nd order IIR filter
Direct implementation: data dependencies prevent pipelining Look-ahead transformation: The 2nd order IIR filter is turned
into a 4th order IIR filter with no data dependencies.
𝐻 𝑧 =𝑏0+𝑏1𝑧
−1+𝑏2𝑧−2
1+𝑎1𝑧−1+𝑎2𝑧
−2 𝐻𝑃(𝑧) =𝑏𝑝0+𝑏𝑝1𝑧
−1+𝑏𝑝2𝑧−2+𝑏𝑝3𝑧
−3+𝑏𝑝4𝑧−4
1+𝑎𝑝2𝑧−2+𝑎𝑝4𝑧
−4
Scattered look-ahead decomposition with powers of 2: coefficients of the 4th order IIR filter avoid instability.
X_in B
+
NH
Addertree
bp(0)
B
bp(1)NH
bp(2)NH
bp(3)
bp(4)NH
NH
IIR FILTER
+Addertree
NH-ap(2)
NH-ap(4)
BO
FO
w[n]
y[n-4]y[n] y[n-2]
clock
w[n] w[0] w[1] w[2] w[3] w[4] w[5] w[6] w[7] w[8] w[9]
y[n]
y[n-1]
y[n-2]
y[n-3]
y[n-4]
y[0] y[1] y[2] y[3] y[4] y[5] y[6] y[7] y[8] y[9]
y[0] y[1] y[2] y[3] y[4] y[5] y[6] y[7] y[8]
y[0] y[1] y[2] y[3] y[4] y[5] y[6] y[7]
y[0] y[1] y[2] y[3] y[4] y[5] y[6]
y[0] y[1] y[2] y[3] y[4] y[5]
y[n] = w[n] - ap(2)*y[n-2] - ap(4)*y[n-4]
Assumption: The adder tree does not have delay units
Methodology IIR Filter (60 Hz Notch filter): fs=1 KHz, 2nd order IIR filter
Retiming: An actual adder tree usually includes register levels in order to increase the frequency of operation.
For example, a 3-input adder tree usually has 2 register levels. The delay breaks the pipeline of the previous figure.
Retiming is used here to address this issue: the delay units that create y[n-2] are embedded into the two register levels of the adder tree.
+
+
NH-ap(4)
NH-ap(2)
BO
FO
+
+
NH-ap(4)
NH-ap(2)
BO
FO
w[n]
w[n]
clock
w[n] w[0] w[1] w[2] w[3] w[4] w[5] w[6] w[7] w[8] w[9]
w[n-2]
y[n-2]
y[n-4]
w[0] w[1] w[2] w[3] w[4] w[5] w[6] w[7] w[8] w[9]
y[0] y[1] y[2] y[3] y[4] y[5] y[6] y[7]
y[0] y[1] y[2] y[3] y[4] y[5]
y[n] = w[n] - ap(2)*y[n-2] - ap(4)*y[n-4]
Assumption: The adder tree has two delay units
y[n]
-ap(2)*y[n-2] -ap(4)*y[n-4] + w[n-2]
y[n] y[0] y[1] y[2] y[3] y[4] y[5] y[6] y[7] y[8] y[9]
Methodology FIR Filter with Distributed
Arithmetic Efficient multiplier-less
implementation where coefficients are constant.
Cut-off frequency: 0.159 Hz. Stopband: -41dB 24-tap symmetric low-pass
filter Fully pipelined system with I/O
delay of RL_FIR. LUT input size = 6 Coefficients format:
[NH NH-1] Constant coefficients loaded as
a text file.
BO
0.159 Hz Low-PassDA FIR Filter
N=24,L=6,symmetricE
BO
v
RL_FIR
...
s[0]s[M-2]
s[0]:
s[L-1]:
X
x[0]
x[N-1] x[M+1]
x[M]x[M-1]x[1]
x[N-2]B
only whenN is even
B+1
E
E E E
E E E
E
sB
s0
+
Addertree
s[M-1]
...
...
-2B
20...
LUT L-to-LO
LUT L-to-LO
Filter Block 0
s[L]:
s[2L-1]:
...
Filter Block 1
s[M-L]:
s[M-1]:
...
...
Filter Block M/L-1
LO=NH+log2L
x[M]
x[M
-1]...
...
...
sB
s0
...
-2B
20...
LUT L-to-LO
LUT L-to-LO
...
sB
s0
-2B
20...
LUT L-to-LO
LUT L-to-LO
B+1 B+1
sB s1 s0...
sB s1 s0...
sB s1 s0...
+
+
+
Y
X Y
BO
BO
L
B
N
NH
M=N/2
Methodology Averaging unit
The seven outputs FOx (x=1..7) are averaged out by this block. This requires a 7-input pipelined adder tree and an array divider.
X(0) X(1)
+++
++
+
0
BY
+/-
Yo
ARRAY
DIVIDER
+/-
BY=BO+log2(N)
BY
log2(N+1)N=7
+/-
+/-
BY
BY
BY0BY
BO BO BO BO BO BO BO
BY BO
X(2) X(3) X(4) X(5) X(6)
v
E
BY
+lo
g2 (N
)
N
BOAVG
BY
Input format: [BO BQ] Output format: [BO BQ] I/O delay:𝑅𝐿_𝐴𝑉𝐺 = 𝐵𝑂 + 2 log2𝑁
The adder tree output requires log2𝑁 extra integer bits, but the divider gets rid of those bits, hence the output of the Average Unit only needs BO bits.
Accuracy can always be increased by incrementing the number of fractional bits the divider generates.
Methodology Experimental Setup:
Input signals: seven overlapping Gaussian-shaped signals (close match to the angular displacement response of the fly’s rhabdomers). 500 samples generated per channel, values quantized with 8, 10, and 12 bits per sample.
Design Space Exploration: Parameters: BO=16, NH =10,12,14,16, B=8,10,12, BQ=10,11,12,13,14
Accuracy measurement: PSNR Test 1: FPGA and software (MATLAB) implementation uses
the quantized input samples. This allows us to study the effect of the fixed-point architecture on accuracy.
Test 2: Only FPGA uses the quantized input samples. This allows us to study the effect of input quantization and the fixed-point architecture on accuracy.
Synthesis of VHDL code: Artix-7 XC7A100T FPGA
Results Input/Output Behavior
Case: B=12, [BO BQ] = [16 14], NH=16. There is not much visual difference if we change the parameters.
The output signals constitute the output of a primary signal path required for all image processing techniques.
0 0.2 0.4 0.6 0.8 1-0.2
0
0.2
0.4
0.6
0.8
1
1.2
Normalized Target Location
Norm
aliz
ed O
utp
ut V
olta
ge
0 0.2 0.4 0.6 0.8 1-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
Normalized Target Location
Norm
aliz
ed O
utp
ut V
olta
ge
C1
C6C3
C4 C5
C2
C7
LAO1
LAO6LAO3 LAO4
LAO5
LAO2LAO7
(a) (b)
Results Hardware Resource Utilization
The figure shows resources (in terms of 6-input LUTs and registers) for all the cases where [BO BQ] = [16 14]
The effect of BQ on resources is negligible and it is not shown.
For proper comparison, the DSP48E1s blocks were not used.
3500 4000 4500 5000 5500 6000
6000
8000
10000
12000
14000
LUTs
Regis
ters
B=8
B=10
B=12
NH=10
NH=12
NH=14
NH=16
Execution Time For BO=16, N=7, L=6, the I/O delay is given by:
𝑅𝐿_𝑆𝑌𝑆 = 𝑅𝐿_𝐼𝐼𝑅 + 𝑅𝐿_𝐴𝑉𝐺 + 𝑅𝐿_𝐹𝐼𝑅 = 36 cycles To compute NS samples per channel, we need 36+NS cycles. For 100 MHz, the execution time is: 36 + 𝑁𝑆 × 10−8𝑠𝑒𝑐𝑜𝑛𝑑𝑠
𝑇ℎ𝑟𝑜𝑢𝑔ℎ𝑝𝑢𝑡 =108
1+ 36𝑁𝑆
𝑠𝑎𝑚𝑝𝑙𝑒𝑠 𝑝𝑒𝑟 𝑐ℎ𝑎𝑛𝑛𝑒𝑙/𝑠𝑒𝑐𝑜𝑛𝑑
Results Accuracy (PSNR)
Test 1: FPGA and software implementations use the quantized input samples.
Results only shown for B=12, as the effect of input bit-width (B) is negligible. NH and BQ have the strongest effect on accuracy. 10 11 12 13 14
65
70
75
80
85
90
95
BQ
PS
NR
(dB
)
NH=10
NH=12
NH=14
NH=16
Results Accuracy (PSNR)
Test 2: Only FPGA uses the quantized input samples.
Accuracy-resource trade-offs are indicated.
Accuracy is heavily affected by B and BQ. NH has some effect.
Resources are affected by B and NH.
3500 4000 4500 5000 5500 6000
55
60
65
70
75
80
85
Slices
PS
NR
(dB
)
NH
NH
=1
0
NH
=10
NH
=1
0 BQBQ
BQ
NH
=1
4
10...1
4
10
11
12
1314
NH
=1
6
NH
=1
2
NH
=16
Results Application: Edge Detection
Edge feature extraction: Gradients are computed specific to the orientation
Conclusions A scalable and fully-pipelined fixed-point architecture was
implemented and successfully validated.
This works demonstrates the feasibility of incorporatingdigital hardware into the design of largely analogcompound vision sensors.
The next step is to implement a system with severalsensors on an FPGA that can adapt resources at run-timebased on user-generated or automatic constraints.
Current work consist on implementing selected imageprocessing algorithms based on the outputs LAOx (x=1..7)generated by the system.