fpga multipliers - univ-perp.fr · !a)le 30%,dsp 86%, b) le 52%,dsp 88%, c) le 63%,dsp 100% a...
TRANSCRIPT
![Page 1: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM](https://reader034.vdocuments.us/reader034/viewer/2022050315/5f773c574a7bdb757e7e55ce/html5/thumbnails/1.jpg)
FPGA Multipliers
Bogdan PASCA
projet Arenaire, ENS-Lyon/INRIA/CNRS/Universite de Lyon, France
RAIM’11February 7-10, 2011
![Page 2: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM](https://reader034.vdocuments.us/reader034/viewer/2022050315/5f773c574a7bdb757e7e55ce/html5/thumbnails/2.jpg)
Outline
Background & Context
Algorithmic techniques for reducing DSP count of large multipliersKaratsuba-Ofman algorithmNon-Standard tilingsSquarersTruncated multipliers
Conclusions
Bogdan PASCA FPGA Multipliers 1
![Page 3: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM](https://reader034.vdocuments.us/reader034/viewer/2022050315/5f773c574a7bdb757e7e55ce/html5/thumbnails/3.jpg)
What’s an FPGA?
Field Programmable Gate Array
integrated circuit
has a regular architecture (hence array)
logic elements can be programmed to perform various functions
Bogdan PASCA FPGA Multipliers 2
![Page 4: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM](https://reader034.vdocuments.us/reader034/viewer/2022050315/5f773c574a7bdb757e7e55ce/html5/thumbnails/4.jpg)
Modern FPGA Architecture
a set of configurable logic elements
on chip memory blocks
digital signal processing (DSP) blocks (including multipliers)
connected by a configurable wire network
all connected to outside world by I/O pins
Bogdan PASCA FPGA Multipliers 3
![Page 5: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM](https://reader034.vdocuments.us/reader034/viewer/2022050315/5f773c574a7bdb757e7e55ce/html5/thumbnails/5.jpg)
Modern FPGA Architecture
RA
MR
AM
RA
MR
AM
a set of configurable logic elements
on chip memory blocks
digital signal processing (DSP) blocks (including multipliers)
connected by a configurable wire network
all connected to outside world by I/O pins
Bogdan PASCA FPGA Multipliers 3
![Page 6: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM](https://reader034.vdocuments.us/reader034/viewer/2022050315/5f773c574a7bdb757e7e55ce/html5/thumbnails/6.jpg)
Modern FPGA Architecture
RA
MR
AM
RA
MR
AM
DS
PD
SP
DS
PD
SP
a set of configurable logic elements
on chip memory blocks
digital signal processing (DSP) blocks (including multipliers)
connected by a configurable wire network
all connected to outside world by I/O pins
Bogdan PASCA FPGA Multipliers 3
![Page 7: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM](https://reader034.vdocuments.us/reader034/viewer/2022050315/5f773c574a7bdb757e7e55ce/html5/thumbnails/7.jpg)
Modern FPGA Architecture
RA
MR
AM
RA
MR
AM
DS
PD
SP
DS
PD
SP
a set of configurable logic elements
on chip memory blocks
digital signal processing (DSP) blocks (including multipliers)
connected by a configurable wire network
all connected to outside world by I/O pins
Bogdan PASCA FPGA Multipliers 3
![Page 8: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM](https://reader034.vdocuments.us/reader034/viewer/2022050315/5f773c574a7bdb757e7e55ce/html5/thumbnails/8.jpg)
Modern FPGA Architecture
RA
MR
AM
RA
MR
AM
DS
PD
SP
DS
PD
SP
a set of configurable logic elements
on chip memory blocks
digital signal processing (DSP) blocks (including multipliers)
connected by a configurable wire network
all connected to outside world by I/O pins
Bogdan PASCA FPGA Multipliers 3
![Page 9: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM](https://reader034.vdocuments.us/reader034/viewer/2022050315/5f773c574a7bdb757e7e55ce/html5/thumbnails/9.jpg)
Modern FPGA Architecture
RA
MR
AM
RA
MR
AM
DS
PD
SP
DS
PD
SP
LUT
a set of configurable logic elements
on chip memory blocks
digital signal processing (DSP) blocks (including multipliers)
connected by a configurable wire network
all connected to outside world by I/O pins
Bogdan PASCA FPGA Multipliers 3
![Page 10: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM](https://reader034.vdocuments.us/reader034/viewer/2022050315/5f773c574a7bdb757e7e55ce/html5/thumbnails/10.jpg)
Modern FPGA Architecture
RA
MR
AM
RA
MR
AM
DS
PD
SP
DS
PD
SP
LUT
shift 17
18
18
a set of configurable logic elements
on chip memory blocks
digital signal processing (DSP) blocks (including multipliers)
connected by a configurable wire network
all connected to outside world by I/O pins
Bogdan PASCA FPGA Multipliers 3
![Page 11: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM](https://reader034.vdocuments.us/reader034/viewer/2022050315/5f773c574a7bdb757e7e55ce/html5/thumbnails/11.jpg)
What can we compute?
LUT
LUT
LUT
LUT
LUT
LUT
LUT
LUT
LUTx0y0
y0
x1
y0
x2
y1
x0
x1y1
y1
x2 u2
u1
u0
l2
l1
l0 p0
p1
p2
p3
p4
x2x1x0×y1y0
l2 l1 l0+u2u1u0
p4p3p2p1p0
l0 = y0 ∧ x0
l1 = y0 ∧ x1
l2 = y0 ∧ x2
u0 = y1 ∧ x0
u1 = y1 ∧ x1
u2 = y1 ∧ x2
Bogdan PASCA FPGA Multipliers 4
![Page 12: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM](https://reader034.vdocuments.us/reader034/viewer/2022050315/5f773c574a7bdb757e7e55ce/html5/thumbnails/12.jpg)
What can we compute?
LUT
LUT
LUT
LUT
LUT
LUT
LUT
LUT
LUTx0y0
y0
x1
y0
x2
y1
x0
x1y1
y1
x2 u2
u1
u0
l2
l1
l0 p0
p1
p2
p3
p4
x2x1x0×y1y0
l2 l1 l0+u2u1u0
p4p3p2p1p0
l0 = y0 ∧ x0
l1 = y0 ∧ x1
l2 = y0 ∧ x2
u0 = y1 ∧ x0
u1 = y1 ∧ x1
u2 = y1 ∧ x2
Bogdan PASCA FPGA Multipliers 4
![Page 13: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM](https://reader034.vdocuments.us/reader034/viewer/2022050315/5f773c574a7bdb757e7e55ce/html5/thumbnails/13.jpg)
What can we compute?
LUT
LUT
LUT
LUT
LUT
LUT
LUT
LUT
LUTx0y0
y0
x1
y0
x2
y1
x0
x1y1
y1
x2 u2
u1
u0
l2
l1
l0 p0
p1
p2
p3
p4
x2x1x0×y1y0
l2 l1 l0+u2u1u0
p4p3p2p1p0
l0 = y0 ∧ x0
l1 = y0 ∧ x1
l2 = y0 ∧ x2
u0 = y1 ∧ x0
u1 = y1 ∧ x1
u2 = y1 ∧ x2
Bogdan PASCA FPGA Multipliers 4
![Page 14: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM](https://reader034.vdocuments.us/reader034/viewer/2022050315/5f773c574a7bdb757e7e55ce/html5/thumbnails/14.jpg)
What can we compute?
LUT
LUT
LUT
LUT
LUT
LUTx0y0
y0
x1
y0
x2
y1
x0
x1y1
y1
x2 u2
u1
u0
l2
l1
l0 p0
p1
p2
p3
p4
FA
FA
FA
x2x1x0×y1y0
l2 l1 l0+u2u1u0
p4p3p2p1p0
l0 = y0 ∧ x0
l1 = y0 ∧ x1
l2 = y0 ∧ x2
u0 = y1 ∧ x0
u1 = y1 ∧ x1
u2 = y1 ∧ x2
Bogdan PASCA FPGA Multipliers 4
![Page 15: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM](https://reader034.vdocuments.us/reader034/viewer/2022050315/5f773c574a7bdb757e7e55ce/html5/thumbnails/15.jpg)
Need of DSP blocks
Multiplication in logic is expensive
n × n bit ≈ n2︸︷︷︸partial products
+ n(n − 1)︸ ︷︷ ︸adder tree
LUTs
18× 18 bit ≈ 324LUT + 306LUT = 630LUTs
1 DSP block = 8 LEs (size on FPGA layout)
DSP blocks are a need in modern FPGAs
17 bit shift
17 bit shift
48
48
B
P
18
18
A
C
P
Bogdan PASCA FPGA Multipliers 5
![Page 16: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM](https://reader034.vdocuments.us/reader034/viewer/2022050315/5f773c574a7bdb757e7e55ce/html5/thumbnails/16.jpg)
Need of DSP blocks
Multiplication in logic is expensive
n × n bit ≈ n2︸︷︷︸partial products
+ n(n − 1)︸ ︷︷ ︸adder tree
LUTs
18× 18 bit ≈ 324LUT + 306LUT = 630LUTs
1 DSP block = 8 LEs (size on FPGA layout)
DSP blocks are a need in modern FPGAs
17 bit shift
17 bit shift
48
48
B
P
18
18
A
C
P
Bogdan PASCA FPGA Multipliers 5
![Page 17: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM](https://reader034.vdocuments.us/reader034/viewer/2022050315/5f773c574a7bdb757e7e55ce/html5/thumbnails/17.jpg)
DSP-Hungry Applications
FPGA floating point performance – a pencil and paper evaluation 1
→ DSP-blocks are a scarce resource for accelerating DP apps.Efficient reconfigurable design for pricing asian options 2
→ LUTs 46%, RAM 4%, DSP 100% (192)Implementation and evaluation of an arithmetic pipeline onFLOPS-2D: multi-FPGA system3
→ a)LE 30%, DSP 86%, b) LE 52%, DSP 88%, c) LE 63%, DSP 100%
A temporal coding hardware implementation for spiking neuralnetworks4
→ 16PE: LE 22%, RAM 3%, DSP 74% (100/136)
Four recipes for saving DSPs
1D. Strenski (HPCWire, 2007.)2Anson H.T. Tse, David B. Thomas, K. H. Tsoi, Wayne Luk (HEART’10)3H. Morisita, K. Inakagata, Y. Osana, N. Fujita, H. Amano (HEART’10)4Marco Nuno-Maganda, Cesar Torres-Huitzil (HEART’10)
Bogdan PASCA FPGA Multipliers 6
![Page 18: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM](https://reader034.vdocuments.us/reader034/viewer/2022050315/5f773c574a7bdb757e7e55ce/html5/thumbnails/18.jpg)
DSP-Hungry Applications
FPGA floating point performance – a pencil and paper evaluation 1
→ DSP-blocks are a scarce resource for accelerating DP apps.Efficient reconfigurable design for pricing asian options 2
→ LUTs 46%, RAM 4%, DSP 100% (192)Implementation and evaluation of an arithmetic pipeline onFLOPS-2D: multi-FPGA system3
→ a)LE 30%, DSP 86%, b) LE 52%, DSP 88%, c) LE 63%, DSP 100%
A temporal coding hardware implementation for spiking neuralnetworks4
→ 16PE: LE 22%, RAM 3%, DSP 74% (100/136)
Four recipes for saving DSPs
1D. Strenski (HPCWire, 2007.)2Anson H.T. Tse, David B. Thomas, K. H. Tsoi, Wayne Luk (HEART’10)3H. Morisita, K. Inakagata, Y. Osana, N. Fujita, H. Amano (HEART’10)4Marco Nuno-Maganda, Cesar Torres-Huitzil (HEART’10)
Bogdan PASCA FPGA Multipliers 6
![Page 19: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM](https://reader034.vdocuments.us/reader034/viewer/2022050315/5f773c574a7bdb757e7e55ce/html5/thumbnails/19.jpg)
DSP-Hungry Applications
FPGA floating point performance – a pencil and paper evaluation 1
→ DSP-blocks are a scarce resource for accelerating DP apps.Efficient reconfigurable design for pricing asian options 2
→ LUTs 46%, RAM 4%, DSP 100% (192)Implementation and evaluation of an arithmetic pipeline onFLOPS-2D: multi-FPGA system3
→ a)LE 30%, DSP 86%, b) LE 52%, DSP 88%, c) LE 63%, DSP 100%
A temporal coding hardware implementation for spiking neuralnetworks4
→ 16PE: LE 22%, RAM 3%, DSP 74% (100/136)
Four recipes for saving DSPs
1D. Strenski (HPCWire, 2007.)2Anson H.T. Tse, David B. Thomas, K. H. Tsoi, Wayne Luk (HEART’10)3H. Morisita, K. Inakagata, Y. Osana, N. Fujita, H. Amano (HEART’10)4Marco Nuno-Maganda, Cesar Torres-Huitzil (HEART’10)
Bogdan PASCA FPGA Multipliers 6
![Page 20: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM](https://reader034.vdocuments.us/reader034/viewer/2022050315/5f773c574a7bdb757e7e55ce/html5/thumbnails/20.jpg)
DSP-Hungry Applications
FPGA floating point performance – a pencil and paper evaluation 1
→ DSP-blocks are a scarce resource for accelerating DP apps.Efficient reconfigurable design for pricing asian options 2
→ LUTs 46%, RAM 4%, DSP 100% (192)Implementation and evaluation of an arithmetic pipeline onFLOPS-2D: multi-FPGA system3
→ a)LE 30%, DSP 86%, b) LE 52%, DSP 88%, c) LE 63%, DSP 100%
A temporal coding hardware implementation for spiking neuralnetworks4
→ 16PE: LE 22%, RAM 3%, DSP 74% (100/136)
Four recipes for saving DSPs
1D. Strenski (HPCWire, 2007.)2Anson H.T. Tse, David B. Thomas, K. H. Tsoi, Wayne Luk (HEART’10)3H. Morisita, K. Inakagata, Y. Osana, N. Fujita, H. Amano (HEART’10)4Marco Nuno-Maganda, Cesar Torres-Huitzil (HEART’10)
Bogdan PASCA FPGA Multipliers 6
![Page 21: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM](https://reader034.vdocuments.us/reader034/viewer/2022050315/5f773c574a7bdb757e7e55ce/html5/thumbnails/21.jpg)
DSP-Hungry Applications
FPGA floating point performance – a pencil and paper evaluation 1
→ DSP-blocks are a scarce resource for accelerating DP apps.Efficient reconfigurable design for pricing asian options 2
→ LUTs 46%, RAM 4%, DSP 100% (192)Implementation and evaluation of an arithmetic pipeline onFLOPS-2D: multi-FPGA system3
→ a)LE 30%, DSP 86%, b) LE 52%, DSP 88%, c) LE 63%, DSP 100%
A temporal coding hardware implementation for spiking neuralnetworks4
→ 16PE: LE 22%, RAM 3%, DSP 74% (100/136)
Four recipes for saving DSPs
1D. Strenski (HPCWire, 2007.)2Anson H.T. Tse, David B. Thomas, K. H. Tsoi, Wayne Luk (HEART’10)3H. Morisita, K. Inakagata, Y. Osana, N. Fujita, H. Amano (HEART’10)4Marco Nuno-Maganda, Cesar Torres-Huitzil (HEART’10)
Bogdan PASCA FPGA Multipliers 6
![Page 22: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM](https://reader034.vdocuments.us/reader034/viewer/2022050315/5f773c574a7bdb757e7e55ce/html5/thumbnails/22.jpg)
Perceiving Multiplications Visually
XY
∑
classical binary multiplication
all sub-products can be properly located inside the diamond
rotate the diamond so to obtain a rectangle
Bogdan PASCA FPGA Multipliers 7
![Page 23: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM](https://reader034.vdocuments.us/reader034/viewer/2022050315/5f773c574a7bdb757e7e55ce/html5/thumbnails/23.jpg)
Perceiving Multiplications Visually
∑Y2:0
X2:0
classical binary multiplication
all sub-products can be properly located inside the diamond
rotate the diamond so to obtain a rectangle
Bogdan PASCA FPGA Multipliers 7
![Page 24: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM](https://reader034.vdocuments.us/reader034/viewer/2022050315/5f773c574a7bdb757e7e55ce/html5/thumbnails/24.jpg)
Perceiving Multiplications Visually
∑X5:3
Y5:3
classical binary multiplication
all sub-products can be properly located inside the diamond
rotate the diamond so to obtain a rectangle
Bogdan PASCA FPGA Multipliers 7
![Page 25: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM](https://reader034.vdocuments.us/reader034/viewer/2022050315/5f773c574a7bdb757e7e55ce/html5/thumbnails/25.jpg)
Perceiving Multiplications Visually
∑X3:1
Y4:3
classical binary multiplication
all sub-products can be properly located inside the diamond
rotate the diamond so to obtain a rectangle
Bogdan PASCA FPGA Multipliers 7
![Page 26: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM](https://reader034.vdocuments.us/reader034/viewer/2022050315/5f773c574a7bdb757e7e55ce/html5/thumbnails/26.jpg)
Perceiving Multiplications Visually
∑X3:1
Y4:3
classical binary multiplication
all sub-products can be properly located inside the diamond
rotate the diamond so to obtain a rectangle
Bogdan PASCA FPGA Multipliers 7
![Page 27: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM](https://reader034.vdocuments.us/reader034/viewer/2022050315/5f773c574a7bdb757e7e55ce/html5/thumbnails/27.jpg)
Perceiving Multiplications Visually
005
5
3
X0X1
Y0
Y1
X0Y0
classical binary multiplication
all sub-products can be properly located inside the diamond
rotate the diamond so to obtain a rectangle
Bogdan PASCA FPGA Multipliers 7
![Page 28: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM](https://reader034.vdocuments.us/reader034/viewer/2022050315/5f773c574a7bdb757e7e55ce/html5/thumbnails/28.jpg)
Perceiving Multiplications Visually
005
5
3
X0X1
Y0
Y1
X0Y0
+23+3X1Y1
+23X1Y0
+23X0Y1
XY =
classical binary multiplication
all sub-products can be properly located inside the diamond
rotate the diamond so to obtain a rectangle
Bogdan PASCA FPGA Multipliers 7
![Page 29: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM](https://reader034.vdocuments.us/reader034/viewer/2022050315/5f773c574a7bdb757e7e55ce/html5/thumbnails/29.jpg)
Karatsuba-Ofman algorithm
trading multiplications for additions
Bogdan PASCA FPGA Multipliers 8
![Page 30: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM](https://reader034.vdocuments.us/reader034/viewer/2022050315/5f773c574a7bdb757e7e55ce/html5/thumbnails/30.jpg)
The Karatsuba-Ofman algorithm
Basic principle for two way splitting
split X and Y into two chunks:
X = 2kX1 + X0 and Y = 2kY1 + Y0
computation goal: XY = 22kX1Y1 + 2k(X1Y0 + X0Y1) + X0Y0
precompute DX = X1 − X0 and DY = Y1 − Y0
make the observation: X1Y0 + X0Y1 = X1Y1 + X0Y0 − DXDY
XY requires only 3 DSP blocks (X1Y1,X0Y0,DXDY )
overhead: two k-bit and one 2k-bit subtraction
overhead � DSP-block emulation
Bogdan PASCA FPGA Multipliers 9
![Page 31: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM](https://reader034.vdocuments.us/reader034/viewer/2022050315/5f773c574a7bdb757e7e55ce/html5/thumbnails/31.jpg)
The Karatsuba-Ofman algorithm
Basic principle for two way splitting
split X and Y into two chunks:
X = 2kX1 + X0 and Y = 2kY1 + Y0
computation goal: XY = 22kX1Y1 + 2k(X1Y0 + X0Y1) + X0Y0
precompute DX = X1 − X0 and DY = Y1 − Y0
make the observation: X1Y0 + X0Y1 = X1Y1 + X0Y0 − DXDY
XY requires only 3 DSP blocks (X1Y1,X0Y0,DXDY )
overhead: two k-bit and one 2k-bit subtraction
overhead � DSP-block emulation
Bogdan PASCA FPGA Multipliers 9
![Page 32: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM](https://reader034.vdocuments.us/reader034/viewer/2022050315/5f773c574a7bdb757e7e55ce/html5/thumbnails/32.jpg)
The Karatsuba-Ofman algorithm
Basic principle for two way splitting
split X and Y into two chunks:
X = 2kX1 + X0 and Y = 2kY1 + Y0
computation goal: XY = 22kX1Y1 + 2k(X1Y0 + X0Y1) + X0Y0
precompute DX = X1 − X0 and DY = Y1 − Y0
make the observation: X1Y0 + X0Y1 = X1Y1 + X0Y0 − DXDY
XY requires only 3 DSP blocks (X1Y1,X0Y0,DXDY )
overhead: two k-bit and one 2k-bit subtraction
overhead � DSP-block emulation
Bogdan PASCA FPGA Multipliers 9
![Page 33: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM](https://reader034.vdocuments.us/reader034/viewer/2022050315/5f773c574a7bdb757e7e55ce/html5/thumbnails/33.jpg)
The Karatsuba-Ofman algorithm
Basic principle for two way splitting
split X and Y into two chunks:
X = 2kX1 + X0 and Y = 2kY1 + Y0
computation goal: XY = 22kX1Y1 + 2k(X1Y0 + X0Y1) + X0Y0
precompute DX = X1 − X0 and DY = Y1 − Y0
make the observation: X1Y0 + X0Y1 = X1Y1 + X0Y0 − DXDY
XY requires only 3 DSP blocks (X1Y1,X0Y0,DXDY )
overhead: two k-bit and one 2k-bit subtraction
overhead � DSP-block emulation
Bogdan PASCA FPGA Multipliers 9
![Page 34: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM](https://reader034.vdocuments.us/reader034/viewer/2022050315/5f773c574a7bdb757e7e55ce/html5/thumbnails/34.jpg)
The Karatsuba-Ofman algorithm
Basic principle for two way splitting
split X and Y into two chunks:
X = 2kX1 + X0 and Y = 2kY1 + Y0
computation goal: XY = 22kX1Y1 + 2k(X1Y0 + X0Y1) + X0Y0
precompute DX = X1 − X0 and DY = Y1 − Y0
make the observation: X1Y0 + X0Y1 = X1Y1 + X0Y0 − DXDY
XY requires only 3 DSP blocks (X1Y1,X0Y0,DXDY )
overhead: two k-bit and one 2k-bit subtraction
overhead � DSP-block emulation
Bogdan PASCA FPGA Multipliers 9
![Page 35: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM](https://reader034.vdocuments.us/reader034/viewer/2022050315/5f773c574a7bdb757e7e55ce/html5/thumbnails/35.jpg)
The Karatsuba-Ofman algorithm
Basic principle for two way splitting
split X and Y into two chunks:
X = 2kX1 + X0 and Y = 2kY1 + Y0
computation goal: XY = 22kX1Y1 + 2k(X1Y0 + X0Y1) + X0Y0
precompute DX = X1 − X0 and DY = Y1 − Y0
make the observation: X1Y0 + X0Y1 = X1Y1 + X0Y0 − DXDY
XY requires only 3 DSP blocks (X1Y1,X0Y0,DXDY )
overhead: two k-bit and one 2k-bit subtraction
overhead � DSP-block emulation
Bogdan PASCA FPGA Multipliers 9
![Page 36: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM](https://reader034.vdocuments.us/reader034/viewer/2022050315/5f773c574a7bdb757e7e55ce/html5/thumbnails/36.jpg)
The Karatsuba-Ofman algorithm
Basic principle for two way splitting
split X and Y into two chunks:
X = 2kX1 + X0 and Y = 2kY1 + Y0
computation goal: XY = 22kX1Y1 + 2k(X1Y0 + X0Y1) + X0Y0
precompute DX = X1 − X0 and DY = Y1 − Y0
make the observation: X1Y0 + X0Y1 = X1Y1 + X0Y0 − DXDY
XY requires only 3 DSP blocks (X1Y1,X0Y0,DXDY )
overhead: two k-bit and one 2k-bit subtraction
overhead � DSP-block emulation
Bogdan PASCA FPGA Multipliers 9
![Page 37: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM](https://reader034.vdocuments.us/reader034/viewer/2022050315/5f773c574a7bdb757e7e55ce/html5/thumbnails/37.jpg)
Visual Interpretation
X0X1
Y1
Y0
Bogdan PASCA FPGA Multipliers 10
![Page 38: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM](https://reader034.vdocuments.us/reader034/viewer/2022050315/5f773c574a7bdb757e7e55ce/html5/thumbnails/38.jpg)
Visual Interpretation
X0X1
Y1
Y0
X1X2
Y0
Y1
Y2
X0
Bogdan PASCA FPGA Multipliers 10
![Page 39: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM](https://reader034.vdocuments.us/reader034/viewer/2022050315/5f773c574a7bdb757e7e55ce/html5/thumbnails/39.jpg)
Visual Interpretation
X0X1
Y1
Y0
X1X2
Y0
Y1
Y2
X0
Bogdan PASCA FPGA Multipliers 10
![Page 40: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM](https://reader034.vdocuments.us/reader034/viewer/2022050315/5f773c574a7bdb757e7e55ce/html5/thumbnails/40.jpg)
Visual Interpretation
X0X1
Y1
Y0
X1X2
Y0
Y1
Y2
X0
Bogdan PASCA FPGA Multipliers 10
![Page 41: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM](https://reader034.vdocuments.us/reader034/viewer/2022050315/5f773c574a7bdb757e7e55ce/html5/thumbnails/41.jpg)
Visual Interpretation
X0X1
Y1
Y0
X1X2
Y0
Y1
Y2
X0
Bogdan PASCA FPGA Multipliers 10
![Page 42: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM](https://reader034.vdocuments.us/reader034/viewer/2022050315/5f773c574a7bdb757e7e55ce/html5/thumbnails/42.jpg)
Visual Interpretation
X0X1
Y1
Y0
X1X2
Y0
Y1
Y2
X0
Bogdan PASCA FPGA Multipliers 10
![Page 43: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM](https://reader034.vdocuments.us/reader034/viewer/2022050315/5f773c574a7bdb757e7e55ce/html5/thumbnails/43.jpg)
Visual Interpretation
X0X1
Y1
Y0
X1X2
Y0
Y1
Y2
X0
Bogdan PASCA FPGA Multipliers 10
![Page 44: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM](https://reader034.vdocuments.us/reader034/viewer/2022050315/5f773c574a7bdb757e7e55ce/html5/thumbnails/44.jpg)
Visual Interpretation
X0X1
Y1
Y0
X1X2
Y0
Y1
Y2
X0
Y0
Y1
Y2
X0
Y3
X1X2X3X0X1X2X3
Bogdan PASCA FPGA Multipliers 10
![Page 45: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM](https://reader034.vdocuments.us/reader034/viewer/2022050315/5f773c574a7bdb757e7e55ce/html5/thumbnails/45.jpg)
Visual Interpretation
X0X1
Y1
Y0
X1X2
Y0
Y1
Y2
X0
Y0
Y1
Y2
X0
Y3
X1X2X3X0X1X2X3
Bogdan PASCA FPGA Multipliers 10
![Page 46: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM](https://reader034.vdocuments.us/reader034/viewer/2022050315/5f773c574a7bdb757e7e55ce/html5/thumbnails/46.jpg)
Implementation
fairly trivial starting from the equation:
XY = 22kX1Y1 + 2k(X1Y1 + X0Y0 − DXDY ) + X0Y0
z
z
DSP48
17
17
17
17
18
18
Y0
X0
Y0
Y1
X0
X1
Y1
X1
36
34
34
X0Y0
X0Y0 − DXDY
X1Y1 + X0Y0 − DXDY
P6851X1Y1
34(16 : 0)
(33 : 17)
34x34bit multiplier using Virtex-4 DSP48
X1Y1 + X0Y0 − DXDY is implemented inside the DSPs
need to recover X1Y1 with a subtraction
Bogdan PASCA FPGA Multipliers 11
![Page 47: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM](https://reader034.vdocuments.us/reader034/viewer/2022050315/5f773c574a7bdb757e7e55ce/html5/thumbnails/47.jpg)
Results
latency frequency (MHz). slices5 DSPs
LogiCore 6 447 26 4
LogiCore 3 176 34 4
K-O-2 3 317 95 3
Table: 34x34-bit multipliers on Virtex-4
trade-off 1DSPs (>630 Logic Elements) for 138 Logic Elements
latency frequency(MHz) slices DSPs
LogiCore 11 353 185 9
LogiCore 6 264 122 9
K-O-3 6 317 331 6
Table: 51x51 multipliers on Virtex-4
trade-off 3DSPs (>1890 Logic Elements) for 292 Logic Elements
5On Virtex4 devices 1 slice = 2 Logic ElementsBogdan PASCA FPGA Multipliers 12
![Page 48: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM](https://reader034.vdocuments.us/reader034/viewer/2022050315/5f773c574a7bdb757e7e55ce/html5/thumbnails/48.jpg)
Non-Standard tilings
new multiplication algorithms
Bogdan PASCA FPGA Multipliers 13
![Page 49: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM](https://reader034.vdocuments.us/reader034/viewer/2022050315/5f773c574a7bdb757e7e55ce/html5/thumbnails/49.jpg)
Non-standard tilings
optimize use of rectangular multipliers on Virtex5 (25x18 signed)
classical decomposition may produce suboptimal results
chunk size for X is 24chunk size for Y is 17
translate the operand decomposition into a tiling problem
Bogdan PASCA FPGA Multipliers 14
![Page 50: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM](https://reader034.vdocuments.us/reader034/viewer/2022050315/5f773c574a7bdb757e7e55ce/html5/thumbnails/50.jpg)
Non-standard tilings
optimize use of rectangular multipliers on Virtex5 (25x18 signed)
classical decomposition may produce suboptimal results
chunk size for X is 24chunk size for Y is 17
translate the operand decomposition into a tiling problem
∑
Bogdan PASCA FPGA Multipliers 14
![Page 51: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM](https://reader034.vdocuments.us/reader034/viewer/2022050315/5f773c574a7bdb757e7e55ce/html5/thumbnails/51.jpg)
Non-standard tilings
optimize use of rectangular multipliers on Virtex5 (25x18 signed)
classical decomposition may produce suboptimal results
chunk size for X is 24chunk size for Y is 17
translate the operand decomposition into a tiling problem
∑Y2:0
X2:0
Bogdan PASCA FPGA Multipliers 14
![Page 52: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM](https://reader034.vdocuments.us/reader034/viewer/2022050315/5f773c574a7bdb757e7e55ce/html5/thumbnails/52.jpg)
Non-standard tilings
optimize use of rectangular multipliers on Virtex5 (25x18 signed)
classical decomposition may produce suboptimal results
chunk size for X is 24chunk size for Y is 17
translate the operand decomposition into a tiling problem
∑X5:3
Y5:3
Bogdan PASCA FPGA Multipliers 14
![Page 53: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM](https://reader034.vdocuments.us/reader034/viewer/2022050315/5f773c574a7bdb757e7e55ce/html5/thumbnails/53.jpg)
Non-standard tilings
optimize use of rectangular multipliers on Virtex5 (25x18 signed)
classical decomposition may produce suboptimal results
chunk size for X is 24chunk size for Y is 17
translate the operand decomposition into a tiling problem
∑X3:1
Y4:3
Bogdan PASCA FPGA Multipliers 14
![Page 54: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM](https://reader034.vdocuments.us/reader034/viewer/2022050315/5f773c574a7bdb757e7e55ce/html5/thumbnails/54.jpg)
Non-standard tilings
optimize use of rectangular multipliers on Virtex5 (25x18 signed)
classical decomposition may produce suboptimal results
chunk size for X is 24chunk size for Y is 17
translate the operand decomposition into a tiling problem
∑X3:1
Y4:3
Bogdan PASCA FPGA Multipliers 14
![Page 55: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM](https://reader034.vdocuments.us/reader034/viewer/2022050315/5f773c574a7bdb757e7e55ce/html5/thumbnails/55.jpg)
Non-standard tilings
optimize use of rectangular multipliers on Virtex5 (25x18 signed)
classical decomposition may produce suboptimal results
chunk size for X is 24chunk size for Y is 17
translate the operand decomposition into a tiling problem
X0
05
5
1
3Y
23+1X3:1Y4:3
Bogdan PASCA FPGA Multipliers 14
![Page 56: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM](https://reader034.vdocuments.us/reader034/viewer/2022050315/5f773c574a7bdb757e7e55ce/html5/thumbnails/56.jpg)
Non-standard tilings
optimize use of rectangular multipliers on Virtex5 (25x18 signed)
classical decomposition may produce suboptimal results
chunk size for X is 24chunk size for Y is 17
translate the operand decomposition into a tiling problem
X0
05
5
1
3Y
23+1X3:1Y4:3
+21+5X3:1Y5
+23X0Y5:3
+X3:0Y2:0
+24X5:4Y5:0
XY =
Bogdan PASCA FPGA Multipliers 14
![Page 57: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM](https://reader034.vdocuments.us/reader034/viewer/2022050315/5f773c574a7bdb757e7e55ce/html5/thumbnails/57.jpg)
Tilings
Performing a 53× 53-bit multiplication on Virtex5
51
48
(a) standard tiling
00
16
33
163358
58
(b) Logicore tiling
34
0
0
24
41
58 34 17
41 24
17
M1
M2
M3M4M5
M6
M7M8
(c) proposed tiling
standard tiling ≡ classical decomposition (12 DSPs)
Logicore 11.1 tiling uses 10 DSPs (4 DSPs used as 17x17-bit)
our proposed tiling does it in 8 DSPs and a few LUTs
Bogdan PASCA FPGA Multipliers 15
![Page 58: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM](https://reader034.vdocuments.us/reader034/viewer/2022050315/5f773c574a7bdb757e7e55ce/html5/thumbnails/58.jpg)
Tiling Architecture - 53x53bit
34
0
0
24
41
58 34 17
41 24
17
M1
M2
M3M4M5
M6
M7M8
XY = X0:23Y0:16 (M1)+ 217(X0:23Y17:33 (M2)+ 217(X0:16Y34:57 (M3)+ 217X17:33Y34:57)) (M4)+ 224(X24:40Y0:23 (M8)+ 217(X41:57Y0:23 (M7)+ 217(X34:57Y24:40 (M6)+ 217X34:57Y41:57))) (M5)+ 248X24:33Y24:33
X24:33Y24:33 (10x10 multiplier) probably best implemented in LUTs.
parenthesis makes best use of DSP48E internal adders (17-bitshifts)
Bogdan PASCA FPGA Multipliers 16
![Page 59: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM](https://reader034.vdocuments.us/reader034/viewer/2022050315/5f773c574a7bdb757e7e55ce/html5/thumbnails/59.jpg)
Tiling Results
58x58 multipliers on Virtex-5 (5vlx50ff676-3)6
latency Freq. REGs LUTs DSPs
LogiCore 14 440 300 249 10
LogiCore 8 338 208 133 10
LogiCore 4 95 208 17 10
Tiling 4 366 247 388 8
Remarks
save 2 DSP48E for a few LUTs/REGs
huge latency save at a comparable frequency
good use of internal adders due to the 17-bit shifts
6Results for 53-bits are almost identicalBogdan PASCA FPGA Multipliers 17
![Page 60: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM](https://reader034.vdocuments.us/reader034/viewer/2022050315/5f773c574a7bdb757e7e55ce/html5/thumbnails/60.jpg)
Squarers
simple methods to save resources
Bogdan PASCA FPGA Multipliers 18
![Page 61: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM](https://reader034.vdocuments.us/reader034/viewer/2022050315/5f773c574a7bdb757e7e55ce/html5/thumbnails/61.jpg)
Squarers
appear in norms, statistical computations, polynomial evaluation...dedicated squarer saves as many DSP blocks as theKaratsuba-Ofman algorithm, but without its overhead∗.
Squaring with k = 17 on a Virtex-4
X02
X12
X0X1
X0X1
(2kX1 + X0)2 = 22kX 21 + 2 · 2kX1X0 + X 2
0
X02
X12
X0X1
X0X1
X0X2
X0X2
X1X2
X1X2X22
(22kX2 + 2kX1 + X0)2 = 24kX 22 + 22kX 2
1 + X 20
+ 2 · 23kX2X1
+ 2 · 22kX2X0
+ 2 · 2kX1X0
Bogdan PASCA FPGA Multipliers 19
![Page 62: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM](https://reader034.vdocuments.us/reader034/viewer/2022050315/5f773c574a7bdb757e7e55ce/html5/thumbnails/62.jpg)
Squarers
appear in norms, statistical computations, polynomial evaluation...dedicated squarer saves as many DSP blocks as theKaratsuba-Ofman algorithm, but without its overhead∗.
Squaring with k = 17 on a Virtex-4
X02
X12
X0X1
X0X1
(2kX1 + X0)2 = 22kX 21 + 2 · 2kX1X0 + X 2
0
X02
X12
X0X1
X0X1
X0X2
X0X2
X1X2
X1X2X22
(22kX2 + 2kX1 + X0)2 = 24kX 22 + 22kX 2
1 + X 20
+ 2 · 23kX2X1
+ 2 · 22kX2X0
+ 2 · 2kX1X0
Bogdan PASCA FPGA Multipliers 19
![Page 63: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM](https://reader034.vdocuments.us/reader034/viewer/2022050315/5f773c574a7bdb757e7e55ce/html5/thumbnails/63.jpg)
Squarers
appear in norms, statistical computations, polynomial evaluation...dedicated squarer saves as many DSP blocks as theKaratsuba-Ofman algorithm, but without its overhead∗.
Squaring with k = 17 on a Virtex-4
X02
X12
X0X1
X0X1
(2kX1 + X0)2 = 22kX 21 + 2 · 2kX1X0 + X 2
0
X02
X12
X0X1
X0X1
X0X2
X0X2
X1X2
X1X2X22
(22kX2 + 2kX1 + X0)2 = 24kX 22 + 22kX 2
1 + X 20
+ 2 · 23kX2X1
+ 2 · 22kX2X0
+ 2 · 2kX1X0
Bogdan PASCA FPGA Multipliers 19
![Page 64: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM](https://reader034.vdocuments.us/reader034/viewer/2022050315/5f773c574a7bdb757e7e55ce/html5/thumbnails/64.jpg)
However ...
(2kX1 + X0)2 = 234X 21 + 218X1X0 + X 2
0
shifts of 0, 18, 34 the previous equation
the DSP48 of VirtexIV allow shifts of 17 so internal adders unused
Workaround for ≤ 33-bit multiplications
rewrite equation:
(217X1 + X0)2 = 234X 21 + 217(2X1)X0 + X 2
0
compute 2X1 by shifting X1 by one bit before inputing into DSP48block
Bogdan PASCA FPGA Multipliers 20
![Page 65: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM](https://reader034.vdocuments.us/reader034/viewer/2022050315/5f773c574a7bdb757e7e55ce/html5/thumbnails/65.jpg)
However ...
(2kX1 + X0)2 = 234X 21 + 218X1X0 + X 2
0
shifts of 0, 18, 34 the previous equation
the DSP48 of VirtexIV allow shifts of 17 so internal adders unused
Workaround for ≤ 33-bit multiplications
rewrite equation:
(217X1 + X0)2 = 234X 21 + 217(2X1)X0 + X 2
0
compute 2X1 by shifting X1 by one bit before inputing into DSP48block
Bogdan PASCA FPGA Multipliers 20
![Page 66: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM](https://reader034.vdocuments.us/reader034/viewer/2022050315/5f773c574a7bdb757e7e55ce/html5/thumbnails/66.jpg)
Results – 32-bit and 53-bit squarers on Virtex-4
latency frequency slices DSPs bits
LogiCore 6 489 59 432LogiCore 3 176 34 4
Squarer 3 317 18 3
LogiCore 18 380 279 1653LogiCore 7 176 207 16
Squarer 7 317 332 6
DSPs saved without any overhead
impressive 10 DSPs saved for double precision squarer
Bogdan PASCA FPGA Multipliers 21
![Page 67: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM](https://reader034.vdocuments.us/reader034/viewer/2022050315/5f773c574a7bdb757e7e55ce/html5/thumbnails/67.jpg)
Squarers on Virtex5 using tilings
the tiling technique can be extended to squaring
36
53
17
0
M1
M2
M3 M6M5
M4
041 24 0
19
36
53
M1
M2
M3
M4M5
Issues
darker squares are computed twice thus need be removed.
thanks to symmetry diagonal multiplication of size n shouldconsume only n(n + 1)/2 LUTs instead of n2 .
Bogdan PASCA FPGA Multipliers 22
![Page 68: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM](https://reader034.vdocuments.us/reader034/viewer/2022050315/5f773c574a7bdb757e7e55ce/html5/thumbnails/68.jpg)
Truncated multipliers
Bogdan PASCA FPGA Multipliers 23
![Page 69: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM](https://reader034.vdocuments.us/reader034/viewer/2022050315/5f773c574a7bdb757e7e55ce/html5/thumbnails/69.jpg)
Truncated multipliers
Classical technique
reduce resources, delay, or power consumption
controlled accuracy degradation
×
∑BA
u
kd
n − k
v
remove some of the least-significant d columns
keep the error smaller than 2k
Bogdan PASCA FPGA Multipliers 24
![Page 70: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM](https://reader034.vdocuments.us/reader034/viewer/2022050315/5f773c574a7bdb757e7e55ce/html5/thumbnails/70.jpg)
Truncated multipliers
Classical technique
reduce resources, delay, or power consumption
controlled accuracy degradation
×
∑BA
u
kd
n − k
v
remove some of the least-significant d columns
keep the error smaller than 2k
Bogdan PASCA FPGA Multipliers 24
![Page 71: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM](https://reader034.vdocuments.us/reader034/viewer/2022050315/5f773c574a7bdb757e7e55ce/html5/thumbnails/71.jpg)
Truncated multipliers
Classical technique
reduce resources, delay, or power consumption
controlled accuracy degradation
×
∑BA
u
kd
n − k
v
remove some of the least-significant d columns
keep the error smaller than 2k
Bogdan PASCA FPGA Multipliers 24
![Page 72: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM](https://reader034.vdocuments.us/reader034/viewer/2022050315/5f773c574a7bdb757e7e55ce/html5/thumbnails/72.jpg)
Error budget
×
∑BA
u
kd
n − k
v
Etotal = Eapprox + Eround ≤ 2k
Eround – caused by rounding the n − d-bit result to n − k bitsuse compensation bit to center the errorround to nearest bounds Eround ≤ 2k−1
Eapprox – caused by the truncation of the d columns{0 ≤ Eapprox ≤
∑di=1 i2i−1
Eapprox < 2k−1→ d = f (k)
Precision k Discarded (d)
Single 23 18Double 52 46
Quadruple 112 105
Bogdan PASCA FPGA Multipliers 25
![Page 73: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM](https://reader034.vdocuments.us/reader034/viewer/2022050315/5f773c574a7bdb757e7e55ce/html5/thumbnails/73.jpg)
Error budget
×
∑BA
u
kd
n − k
v
Etotal = Eapprox + Eround ≤ 2k
Eround – caused by rounding the n − d-bit result to n − k bitsuse compensation bit to center the errorround to nearest bounds Eround ≤ 2k−1
Eapprox – caused by the truncation of the d columns{0 ≤ Eapprox ≤
∑di=1 i2i−1
Eapprox < 2k−1→ d = f (k)
Precision k Discarded (d)
Single 23 18Double 52 46
Quadruple 112 105
Bogdan PASCA FPGA Multipliers 25
![Page 74: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM](https://reader034.vdocuments.us/reader034/viewer/2022050315/5f773c574a7bdb757e7e55ce/html5/thumbnails/74.jpg)
Tiling the truncated board
M2
M3 M1
k
d
M4
M2
M3 M1
k
d
M4 M2
M3 M1
k
d
Sol 1: tile and discard columns (save additions)
waste DSPs
Sol 2: use softcore multiplier (trade a DSP for logic)
Best : tile with softcore multipliers so that Eapprox ≤ 2k−1
use the extra precision for free
Bogdan PASCA FPGA Multipliers 26
![Page 75: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM](https://reader034.vdocuments.us/reader034/viewer/2022050315/5f773c574a7bdb757e7e55ce/html5/thumbnails/75.jpg)
Tiling the truncated board
M2
M3 M1
k
d
M4 M2
M3 M1
k
d
M4
M2
M3 M1
k
d
Sol 1: tile and discard columns (save additions)
waste DSPs
Sol 2: use softcore multiplier (trade a DSP for logic)
Best : tile with softcore multipliers so that Eapprox ≤ 2k−1
use the extra precision for free
Bogdan PASCA FPGA Multipliers 26
![Page 76: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM](https://reader034.vdocuments.us/reader034/viewer/2022050315/5f773c574a7bdb757e7e55ce/html5/thumbnails/76.jpg)
Tiling the truncated board
M2
M3 M1
k
d
M4 M2
M3 M1
k
d
M4 M2
M3 M1
k
d
Sol 1: tile and discard columns (save additions)
waste DSPs
Sol 2: use softcore multiplier (trade a DSP for logic)
Best : tile with softcore multipliers so that Eapprox ≤ 2k−1
use the extra precision for free
Bogdan PASCA FPGA Multipliers 26
![Page 77: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM](https://reader034.vdocuments.us/reader034/viewer/2022050315/5f773c574a7bdb757e7e55ce/html5/thumbnails/77.jpg)
Reality Check – faithfully rounding
Mantissa Multipliers for SP,DP,QP, Virtex4 (left) and Virtex5(right)
FPGA Prec. Latency, Freq. Resources
Virtex5DP 6 cycles @ 414MHz 320LUT 302REG 5DSP
QP 20 cycles @ 334MHz 2497LUT 2321REG 19DSP
QP 14 cycles @ 245MHz 2249LUT 1576REG 19DSP
Virtex4DP 11 cycles @ 368MHz 358sl. 7DSP
QP 21 cycles @ 368MHz 1735sl. 26DSP
Virtex4DP reduce DSPs from 10 to 7 while also reducing slice countQP reduce DSPs from 49 to 26 at without any slice penalty
Virtex5DP reduce DSP from 6 to 5 for and roughly half the LUTs and REGsQP reduce DSP from 34 to 19 at a small increase in logic resources.
Bogdan PASCA FPGA Multipliers 27
![Page 78: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM](https://reader034.vdocuments.us/reader034/viewer/2022050315/5f773c574a7bdb757e7e55ce/html5/thumbnails/78.jpg)
Another point of view
(wE ,wF )=accuracy
(wE ,wF + 1)correctly rounded faithfully rounded→ in FPGAs the extra bit comes for free∗
truncate multipliers when IEEE-754 compliance is not needed
function approximation by polynomial evaluation
log2(1 + x) (53-bit)
default 27 DSPsoptimized Horner 23 DSPs
optimized Horner + truncated multipliers 11* DSPs
Bogdan PASCA FPGA Multipliers 28
![Page 79: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM](https://reader034.vdocuments.us/reader034/viewer/2022050315/5f773c574a7bdb757e7e55ce/html5/thumbnails/79.jpg)
Another point of view
(wE ,wF )=accuracy
(wE ,wF + 1)correctly rounded faithfully rounded→ in FPGAs the extra bit comes for free∗
truncate multipliers when IEEE-754 compliance is not needed
function approximation by polynomial evaluation
log2(1 + x) (53-bit)
default 27 DSPsoptimized Horner 23 DSPs
optimized Horner + truncated multipliers 11* DSPs
Bogdan PASCA FPGA Multipliers 28
![Page 80: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM](https://reader034.vdocuments.us/reader034/viewer/2022050315/5f773c574a7bdb757e7e55ce/html5/thumbnails/80.jpg)
Conclusion
save DSPs by exploiting the flexibility of the FPGA
Karatsuba-Ofman reduces DSP cost at small price in logic elements
tiling techinques adapt better to asymmetric DSPs
dedicated squarers significantly reduce DSP count
control accuracy and save DSPs using truncated multipliers
Bogdan PASCA FPGA Multipliers 29
![Page 81: FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM](https://reader034.vdocuments.us/reader034/viewer/2022050315/5f773c574a7bdb757e7e55ce/html5/thumbnails/81.jpg)
Thank you for your attention !
http://flopoco.gforge.inria.fr/
Questions ?
Bogdan PASCA FPGA Multipliers 30