from low-architectural expertise up to high-throughput non-binary
TRANSCRIPT
©2005, it- institutode telecomunicações. Todososdireitos reservados.
From Low-architectural Expertise Up toHigh-throughput Non-binary LDPC Decoders:Optimization Guidelines using High-level Synthesis
João Andrade1, Nithin George2, Kimon Karras3, David Novo1,Vitor Silva1, Paolo Ienne2, Gabriel Falcão1
1 University of Coimbra, PT; 2 EPFL, CH; 3 Xilinx Research Labs, IE
FPL 2015, London, UK, 1-4 Sept. 20150
Outline
The Challenge
The ProblemNon-binary LDPC decoding
Decoding architectureHLS decoder design
Experimental Results
Conclusion
1 | FPL’15, London, UK
2 | FPL’15, London, UK
The Challenge
Design an efficient LDPC decoder fast
• RTL requires too specialized knowledge• Our background is GPU and not hardware
• Error-prone design space exploration (DSE)
• Extensive code refactoring for DSE
• High-level synthesis (HLS) has been around for years
• Fast time to market
3 | FPL’15, London, UK
Design an efficient LDPC decoder fast• Adjust DS decisions much faster wo/ extensive refactoring
• Bitwidth for a particular SNR operation point• Decoding schedule• Decoding algorithm
• C/C++ code base can be used with Vivado HLS• C/C++ supported• Cycle-accurate simulation after C-synthesis• Code annotations (#pragma) or Tcl commands
Why?• Power budgets of GPUs way above requirements
• Real-time operation is required• High decoding throughputs• Low latencies
4 | FPL’15, London, UK
5 | FPL’15, London, UK
The Problem
FEC on communication systems
• Belief propagation problem → LDPC codes decoding
• Non-binary LDPC can tackle
• Quantum-key distribution
• Erasure channel (burst)
• AWGN channel
• But have a very high (non-linear) numerical complexity
• Irregular data patterns and intensive access profile
6 | FPL’15, London, UK
Soft-Decoding AlgorithmsBelief-Propagation I
• LDPC decoding is a particular case of belief-propagation
α α αα2 1 α2 α 1 1 α2 1 1
αc1
c6c5c4c3c2c1
α2c1 c6c6α2c5c4 c5αc4αc2 c3 α2c3αc2
F F F F F F F F F F F F
mv(x)
mvc(x) mcv(x)
mcv(z)mvc(z)
perm
ute
deperm
ute
CN1 CN2 CN3
VN1 VN3 VN4 VN5 VN6
Walsh-Hadamard
Transform
m∗v(x)
VN2
dc = 4
dv = 2
• Messages circulate through a bipartite graph structure withcomputation applied at the node- and edge-level
7 | FPL’15, London, UK
Soft-Decoding AlgorithmsBelief-Propagation II
• The bipartite model can be employed to generalize otheralgorithms w/ the following constraints
• node level functions → must produce/consume data coherently
• edge level functions → produce/consume without restrictions
• By defining these kernels different algorithms can be defined
• CN and VN → Hadamard products
• Edges permute/depermute
• Edges apply the Fast Walsh-Hadamard Transform
8 | FPL’15, London, UK
HLS decoderMapping the LDPC Tanner graph
• We attempt an isomorphic transformation of the Tanner graphH =
α 0 1 α 0 1α2 α 0 1 1 00 α α2 0 α2 1
αα2 1 α2 α 1 1 α2 1 1
αc1
c6c5c4c3c2c1
α2c1 c6c6α2c5c4 c5αc4αc2 c3 α2c3αc2
F F F F F F F F F F F F
mv(x)
mvc(x) mcv(x)
mcv(z)mvc(z)
perm
ute
deperm
ute
CN1
CN2
CN3
VN1
VN3
VN4
VN5
VN6
Walsh-Hadamard
Transform
m∗v(x)
VN2
α α
vnUpdate();
permute();
depermute();
fwht();
cnUpdate();
index_lut
• Therein, each node/edge-level kernel become their ownC-function and nodes/edges an iteration within a loop structure
9 | FPL’15, London, UK
HLS decoderWhere to begin?
• There are several dimensions to non-binary LDPC decoding
(code related)• N VNs and M CNs to process
• Each VN connects to dv CNs
• Each CN connects to dc VNs
(Galois Field related)• 2m probabilities to compute per probability mass-function (pmf)
10 | FPL’15, London, UK
HLS decoderHow to express computation?
• Suppose a GPU SIMT-architecture mindset//flat loop unsuitable for Vivado HLS optimizationsfor(int i = 0; i < edges*q*d_v; i++){
int e = i/(d_v*q); //get VN idint g = i%q; //get GF(q) elementint t = (i/q)%d_v; //get d_v element
computation();}
• What is it any different than this?//nested loop suitable for Vivado HLS optimizationsfor(int e = 0; e < edges; e++)
for(int g = 0; g < q; g++)for(int t = 0; t < d_v; t++)
computation();
• Optimizations are hardly picked up in the former
11 | FPL’15, London, UK
HLS decoderLoop structures
CN->VNVN->CN
E: edges
GF: 2m
vn_proc
perm
ute
E: edges
E: edges
LogGF: m
fwht
cn_proc
deperm
ute
E:
G:
VN/CNW:
E:
G_read:
G_write:
G_read:
LOGGF:
G_compute:
fwht
E: edges
GF: 2m
Dc: dc
E: edges
GF_read: 2m
GF_write: 2m
G_read: 2m
LOGGF: m
G_compute:
2m
DRAM:
mcv
mvc
mv
iterate
pro
log
ue
ep
ilog
ue
iterate
l_mvc
l_mcv
l_mv
GF_read: 2m
GF_write: 2m
Dv: dv
GF_read: 2m
GF: 2m
GF_write: 2m
BRAM arrays
partitioned in Solutions IV-VII
E: edges
LogGF: m
GF_read: 2m
GF: 2m
GF_write: 2m
2 RW ports
available
• Loop trip counts• E: edges or
N×dv = M×dc
• GF: 2m
• LOGGF: m• Dv/Dc: dv /dc
• Local BRAM copiesare maintained
• Data streamsfrom DRAM
12 | FPL’15, London, UK
HLS decoderLoop structures
//nested loop structure of vn_procE:for(int e = 0; e < edges; e++)
GF:for(int g = 0; g < q; g++)Dv:for(int t = 0; t < d_v; t++)
//computation follows
CN->VNVN->CN
E: edges
GF: 2m
vn_proc
perm
ute
E: edges
E: edges
LogGF: m
fwht
cn_proc
deperm
ute
E:
G:
VN/CNW:
E:
G_read:
G_write:
G_read:
LOGGF:
G_compute:
fwht
E: edges
GF: 2m
Dc: dc
E: edges
GF_read: 2m
GF_write: 2m
G_read: 2m
LOGGF: m
G_compute:
2m
DRAM:
mcv
mvc
mv
iterate
pro
log
ue
ep
ilog
ue
iterate
l_mvc
l_mcv
l_mv
GF_read: 2m
GF_write: 2m
Dv: dv
GF_read: 2m
GF: 2m
GF_write: 2m
BRAM arrays
partitioned in Solutions IV-VII
E: edges
LogGF: m
GF_read: 2m
GF: 2m
GF_write: 2m
2 RW ports
available
13 | FPL’15, London, UK
HLS decoderLoop structures
E:for(int e = 0; e < limit; e++){GF_read:for(int g = 0; g < GF; g++)
//load data into temporary bufferGF_write:for(int g = 0; g < GF; g++)
//permute and store back to memory}
CN->VNVN->CN
E: edges
GF: 2m
vn_proc
perm
ute
E: edges
E: edges
LogGF: m
fwht
cn_proc
deperm
ute
E:
G:
VN/CNW:
E:
G_read:
G_write:
G_read:
LOGGF:
G_compute:
fwht
E: edges
GF: 2m
Dc: dc
E: edges
GF_read: 2m
GF_write: 2m
G_read: 2m
LOGGF: m
G_compute:
2m
DRAM:
mcv
mvc
mv
iterate
pro
log
ue
ep
ilog
ue
iterate
l_mvc
l_mcv
l_mv
GF_read: 2m
GF_write: 2m
Dv: dv
GF_read: 2m
GF: 2m
GF_write: 2m
BRAM arrays
partitioned in Solutions IV-VII
E: edges
LogGF: m
GF_read: 2m
GF: 2m
GF_write: 2m
2 RW ports
available
14 | FPL’15, London, UK
HLS decoderLoop structures
E:for(int e = 0; e < edges; e++){G_read:for(int g = 0; g < q; g++){
//load data into temporary array}LogGF:for(int c=0;c<m;c++)
GF:for(int g = 0; g < q; g++)//perform Radix-2 computation
G_write:for(int g = 0; g < q; g++){//store data back to memory
}}
CN->VNVN->CN
E: edges
GF: 2m
vn_proc
perm
ute
E: edges
E: edges
LogGF: m
fwht
cn_proc
deperm
ute
E:
G:
VN/CNW:
E:
G_read:
G_write:
G_read:
LOGGF:
G_compute:
fwht
E: edges
GF: 2m
Dc: dc
E: edges
GF_read: 2m
GF_write: 2m
G_read: 2m
LOGGF: m
G_compute:
2m
DRAM:
mcv
mvc
mv
iterate
pro
log
ue
ep
ilog
ue
iterate
l_mvc
l_mcv
l_mv
GF_read: 2m
GF_write: 2m
Dv: dv
GF_read: 2m
GF: 2m
GF_write: 2m
BRAM arrays
partitioned in Solutions IV-VII
E: edges
LogGF: m
GF_read: 2m
GF: 2m
GF_write: 2m
2 RW ports
available
15 | FPL’15, London, UK
HLS decoder solutionsPartitioning
• Scheduling analysis after C-synthesis shows lack of mem. ports• BRAMs are instantiated with dual-port control• Further action is required
set_directive_resource -core RAM_T2P_BRAMset_directive_array_partition -type cyclic -factor 4 -dim 1
Partitioned arraysOriginal arrays
l_mv
l_mvc
l_mcv
l_mv_0
l_mvc_0
l_mv_1
l_mv_2
l_mv_3
l_mvc_1
l_mvc_2
l_mvc_3
l_mcv_0
l_mcv_1
l_mcv_2
l_mcv_30
1
2
3
4
5
.
.
.
0
1
2
3
4
5
.
.
.
0
1
2
3
4
5
.
.
.
0
4
8
.
.
1
5
9
.
.
2
6
10
.
.
3
7
11
.
.
0
4
8
.
.
1
5
9
.
.
2
6
10
.
.
3
7
11
.
.
0
4
8
.
.
1
5
9
.
.
2
6
10
.
.
3
7
11
.
.
2x2m
RW ports availablePartitioning
16 | FPL’15, London, UK
HLS decoderOptimization solution I
E: edges
fwht
E: edges
LogGF: m
fwht
GF_read: 2m
GF_compute: 2m
GF_write: 2m
E: edges
E: edges
fwht
fwht
2x2m
RW ports available2 RW ports available
…
…
E: edges
fwht
…
…
Solution II Solution III Solution IV Solution V Solution VI
not parallel
high IIparallel
low IIE: edges
fwht
Solution VII
…
• Solution I, base version wo/ optimizations
17 | FPL’15, London, UK
HLS decoderOptimization solution II
E: edges
fwht
E: edges
LogGF: m
fwht
GF_read: 2m
GF_compute: 2m
GF_write: 2m
E: edges
E: edges
fwht
fwht
2x2m
RW ports available2 RW ports available
…
…
E: edges
fwht
…
…
Solution II Solution III Solution IV Solution V Solution VI
not parallel
high IIparallel
low IIE: edges
fwht
Solution VII
…
• Solution II Full unroll of inner loops LOGGF and GFset_directive_unroll "*/LOGGF"set_directive_unroll "*/GF"
18 | FPL’15, London, UK
HLS decoderOptimization solution III
E: edges
fwht
E: edges
LogGF: m
fwht
GF_read: 2m
GF_compute: 2m
GF_write: 2m
E: edges
E: edges
fwht
fwht
2x2m
RW ports available2 RW ports available
…
…
E: edges
fwht
…
…
Solution II Solution III Solution IV Solution V Solution VI
not parallel
high IIparallel
low IIE: edges
fwht
Solution VII
…
• Solution III, II+pipeline of outer loops E to II=1set_directive_pipeline "*/E" -II=1
19 | FPL’15, London, UK
HLS decoderOptimization solution IV
E: edges
fwht
E: edges
LogGF: m
fwht
GF_read: 2m
GF_compute: 2m
GF_write: 2m
E: edges
E: edges
fwht
fwht
2x2m
RW ports available2 RW ports available
…
…
E: edges
fwht
…
…
Solution II Solution III Solution IV Solution V Solution VI
not parallel
high IIparallel
low IIE: edges
fwht
Solution VII
…
• Solution IV, I+cyclic partitioning of all BRAM arrays by afactor of 2m
set_directive_array_partition -type cyclic -factor 2^m -dim 1 "decoder" l_buffer
20 | FPL’15, London, UK
HLS decoderOptimization solution V
E: edges
fwht
E: edges
LogGF: m
fwht
GF_read: 2m
GF_compute: 2m
GF_write: 2m
E: edges
E: edges
fwht
fwht
2x2m
RW ports available2 RW ports available
…
…
E: edges
fwht
…
…
Solution II Solution III Solution IV Solution V Solution VI
not parallel
high IIparallel
low IIE: edges
fwht
Solution VII
…
• Solution V, IV+full unroll of inner loops LOGGF and GF
21 | FPL’15, London, UK
HLS decoderOptimization solution VI
E: edges
fwht
E: edges
LogGF: m
fwht
GF_read: 2m
GF_compute: 2m
GF_write: 2m
E: edges
E: edges
fwht
fwht
2x2m
RW ports available2 RW ports available
…
…
E: edges
fwht
…
…
Solution II Solution III Solution IV Solution V Solution VI
not parallel
high IIparallel
low IIE: edges
fwht
Solution VII
…
• Solution VI, III+IV (unroll, pipeline, partition)
22 | FPL’15, London, UK
HLS decoderOptimization solution VII
E: edges
fwht
E: edges
LogGF: m
fwht
GF_read: 2m
GF_compute: 2m
GF_write: 2m
E: edges
E: edges
fwht
fwht
2x2m
RW ports available2 RW ports available
…
…
E: edges
fwht
…
…
Solution II Solution III Solution IV Solution V Solution VI
not parallel
high IIparallel
low IIE: edges
fwht
Solution VII
…
• Solution VII, IV+pipeline of inner loops LOGGF GF to II=1set_directive_pipeline "*/LOGGF" -II=1set_directive_pipeline "*/GF" -II=1
23 | FPL’15, London, UK
HLS decoder
• Define fixed-point computation suitable for a target SNR/BER#include<ap_cint.h>//data is stored in llr type variables//computation is performed in llr_ type variables//use floating-pointtypedef float llr;typedef float llr_;//use Q8.7 fixed-pointtypedef ap_fixed< 8, 1, AP_RND_INF, SC_SAT > llr;typedef ap_fixed< 16, 3, AP_RND_INF, SC_SAT > llr_;
24 | FPL’15, London, UK
Experimental resultsClock frequency and latency
OptimizationsI II III IV V VI VII
La
ten
cy
[c
yc
les
]
10 3
10 4
10 5
10 6
0
50
100
150
200
250
OptimizationsI II III IV V VI VII
La
ten
cy
[c
yc
les
]
10 4
10 5
10 6
10 7
Fre
qu
en
cy
[M
Hz]
0
50
100
150
200
250
OptimizationsI II III IV V VI VII
La
ten
cy
[c
yc
les
]
10 4
10 5
10 6
10 7
Fre
qu
en
cy
[M
Hz]
0
50
100
150
200
250
• Best clock frequency of operation obtained for Solution VI• Lowest latency always achieved for Solution VI• Solution III is a good compromise between VI and theremaining Solutions
• Solution VII replication of pipelined loops is a poor designchoice
• Most alike to OpenCL strategy
25 | FPL’15, London, UK
Experimental resultsFPGA utilization
I
IV
II
V
III
VI
I
IV
IIV
III
VI
I
IV
II
III
V
VI
partition
unroll
unroll
partition
pipeline
pipeline
partition
higher utilization
latency unchanged
low
er
late
ncy
utiliz
atio
n u
nch
an
ged
• Under 20% LUTutil.
• Multiple decoderinstantiation
• What about pin,clock and meminterface?
26 | FPL’15, London, UK
HLS host platform
• Originally RTL-project and can be Tcl’d automatically• VC709 board target(693K CLBs)
Board DRAM 0 DRAM 1
Memory Interface
...
AXI4
Interconnect
AXI4
Interconnect
BRAMs KBRAMs 1
HLS IP
Core 1
HLS IP
Core K
FPGA
core 0
core 1
core 2
AXI4 I.
Mem. Int.
BRAMs 2
HLS IP
Core 2
• DMA via PCIe → 3KLUTs
• Two DRAM banks controlled(MIG)
• Two AXI interconnect can beconfigured for up to 16 HLScores
• Data streams from the DRAMbank 0 and to bank 1
• Each HLS core performscomputation to its own“BRAM” space
27 | FPL’15, London, UK
Experimental resultsPareto exploration
LUTs [%]0 10 20 30 40 50 60 70 80 90
La
ten
cy
[7
s]
10 0
10 1
10 2
10 3
10 4
10 5
GF(4) GF(8) GF(16)Non-optimal points
Final decoder design w/ DRAM controllersand several accelerators instantiated
Single acceleratorw/o DRAM controllers
ParetoOptimalPoints
• The host HLS arch and the multiple decoders elevatethe LUT utilization to ∼80%
• {14, 5, 3} decoders for GF(22), GF(23) and GF(24)
28 | FPL’15, London, UK
Experimental resultsComparison with RTL decoders
Decoder m K LUT [%] FF BRAM DSP Thr. [Mbit/s] Clk [MHz]
This work
2 1 14 7 0.5 0.5 1.17 25014 80 35 6 6 14.54 219
3 1 21 9 0.9 0.9 0.95 2506 81 34 5 5 4.81 210
4 1 30 13 2 2 0.66 2163 73 32 5 5 1.85 201
Zhang TCS–I’11 4
1
48 (Slices) 41 – 9.3 –
Emden ISTC’102 33.16
1004 – 13.228 1.56
Spagnol SiPS’09 3 13 3 1 – ≤4.7 99Boutillon TCS–I’13 6 19 6 1 – 2.95 61Andrade ICASSP’14 8 85 (LEs) 62 7 1.1 163Scheiber ICECS’13 1 14 (Slices) 21 – 13.4 122
∗ Differences in technology nodes and FPGA are not considered.
29 | FPL’15, London, UK
Comparison with previous HLS works
• Maxeler decoders allow for ∼1 Gbit/s decoding throughputsPratas, GlobalSIP’13Andrade, ASAP’14
• OpenCL (Altera) decoder peaks at hundreds of Kbit/sAndrade, ICASSP’14
FPGA GF(23)
GF(23)
(floating-point)Util.[%] I IV V VI I IV V VILUTs 0.64 1.13 5.20 10.4 0.65 1.48 11.7 17.3FF 0.28 0.53 2.52 3.94 0.29 0.51 3.30 7.56DSP 0.06 0.06 0.89 0.89 0.06 0.06 1.78 1.78BRAM 0.44 0.82 0.82 0.82 0.78 1.63 2.72 1.63
• Vivado HLS decoder reaches dozens of MBit/sScheiber, ICECS’13
30 | FPL’15, London, UK
Summary
• Code writing-style counts• Language is the same, model is not
• Design optimizations come hand-in-hand with the codewriting-style
• Clearly defined bounds are better• Optimizations can be double-edge swords
• We can achieve same ballpark figures of RTL• Higher utilization
• Outlook• When will platforms be automatically generated?
• When will the C programming model merge?
31 | FPL’15, London, UK
32 | FPL’15, London, UK
*(b++)=*(a++)*c;
Qué?
b[i]=a[i]*c;
Ah, si!
Thank you. Questions arewelcome.
33 | FPL’15, London, UK
What tool to use?
What HLS tool?• How much are we willing to lose in control?
• A lot? → OpenCL (C-based)• Dataflow? → MaxCompiler (JAVA)• Some? → LegUp, Vivado HLS (C/C++, SystemC)• None? → Stick to RTL (Verilog, VHDL)
• Vivado HLS allows fine control over• Loop scheduling → unroll, pipeline, merge, flatten• AXI4 blocks → master/slave memory and stream interfaces• Arbitrary bitwidth → fixed-point types supported• No clock, no external memory interfaces, and no pin I/O layout
34 | FPL’15, London, UK