from low-architectural expertise up to high-throughput non-binary

©2005, it- institutode telecomunicações. Todososdireitos reservados.

From Low-architectural Expertise Up toHigh-throughput Non-binary LDPC Decoders:Optimization Guidelines using High-level Synthesis

João Andrade1, Nithin George2, Kimon Karras3, David Novo1,Vitor Silva1, Paolo Ienne2, Gabriel Falcão1

1 University of Coimbra, PT; 2 EPFL, CH; 3 Xilinx Research Labs, IE

FPL 2015, London, UK, 1-4 Sept. 20150

Outline

The Challenge

The ProblemNon-binary LDPC decoding

Decoding architectureHLS decoder design

Experimental Results

Conclusion

1 | FPL’15, London, UK


The Challenge

Design an efficient LDPC decoder fast

• RTL requires too specialized knowledge• Our background is GPU and not hardware

• Error-prone design space exploration (DSE)

• Extensive code refactoring for DSE

• High-level synthesis (HLS) has been around for years

• Fast time to market


Design an efficient LDPC decoder fast• Adjust DS decisions much faster wo/ extensive refactoring

• Bitwidth for a particular SNR operation point• Decoding schedule• Decoding algorithm

• C/C++ code base can be used with Vivado HLS• C/C++ supported• Cycle-accurate simulation after C-synthesis• Code annotations (#pragma) or Tcl commands

Why?• Power budgets of GPUs way above requirements

• Real-time operation is required• High decoding throughputs• Low latencies



The Problem

FEC on communication systems

• Belief propagation problem → LDPC codes decoding

• Non-binary LDPC can tackle

• Quantum-key distribution

• Erasure channel (burst)

• AWGN channel

• But have a very high (non-linear) numerical complexity

• Irregular data patterns and intensive access profile


Soft-Decoding AlgorithmsBelief-Propagation I

• LDPC decoding is a particular case of belief-propagation

α α αα2 1 α2 α 1 1 α2 1 1

αc1

c6c5c4c3c2c1

α2c1 c6c6α2c5c4 c5αc4αc2 c3 α2c3αc2

F F F F F F F F F F F F

mv(x)

mvc(x) mcv(x)

mcv(z)mvc(z)

perm

ute

deperm

ute

CN1 CN2 CN3

VN1 VN3 VN4 VN5 VN6

Walsh-Hadamard

Transform

m∗v(x)

VN2

dc = 4

dv = 2

• Messages circulate through a bipartite graph structure withcomputation applied at the node- and edge-level


Soft-Decoding AlgorithmsBelief-Propagation II

• The bipartite model can be employed to generalize otheralgorithms w/ the following constraints

• node level functions → must produce/consume data coherently

• edge level functions → produce/consume without restrictions

• By defining these kernels different algorithms can be defined

• CN and VN → Hadamard products

• Edges permute/depermute

• Edges apply the Fast Walsh-Hadamard Transform


HLS decoderMapping the LDPC Tanner graph

• We attempt an isomorphic transformation of the Tanner graphH =

α 0 1 α 0 1α2 α 0 1 1 00 α α2 0 α2 1

αα2 1 α2 α 1 1 α2 1 1

αc1

c6c5c4c3c2c1

α2c1 c6c6α2c5c4 c5αc4αc2 c3 α2c3αc2

F F F F F F F F F F F F

mv(x)

mvc(x) mcv(x)

mcv(z)mvc(z)

perm

ute

deperm

ute

CN1

CN2

CN3

VN1

VN3

VN4

VN5

VN6

Walsh-Hadamard

Transform

m∗v(x)

VN2

α α

vnUpdate();

permute();

depermute();

fwht();

cnUpdate();

index_lut

• Therein, each node/edge-level kernel become their ownC-function and nodes/edges an iteration within a loop structure


HLS decoderWhere to begin?

• There are several dimensions to non-binary LDPC decoding

(code related)• N VNs and M CNs to process

• Each VN connects to dv CNs

• Each CN connects to dc VNs

(Galois Field related)• 2m probabilities to compute per probability mass-function (pmf)


HLS decoderHow to express computation?

• Suppose a GPU SIMT-architecture mindset//flat loop unsuitable for Vivado HLS optimizationsfor(int i = 0; i < edges*q*d_v; i++){

int e = i/(d_v*q); //get VN idint g = i%q; //get GF(q) elementint t = (i/q)%d_v; //get d_v element

computation();}

• What is it any different than this?//nested loop suitable for Vivado HLS optimizationsfor(int e = 0; e < edges; e++)

for(int g = 0; g < q; g++)for(int t = 0; t < d_v; t++)

computation();

• Optimizations are hardly picked up in the former


HLS decoderLoop structures

CN->VNVN->CN

E: edges

GF: 2m

vn_proc

perm

ute

E: edges

E: edges

LogGF: m

fwht

cn_proc

deperm

ute

E:

G:

VN/CNW:

E:

G_read:

G_write:

G_read:

LOGGF:

G_compute:

fwht

E: edges

GF: 2m

Dc: dc

E: edges

GF_read: 2m

GF_write: 2m

G_read: 2m

LOGGF: m

G_compute:

2m

DRAM:

mcv

mvc

mv

iterate

pro

log

ue

ep

ilog

ue

iterate

l_mvc

l_mcv

l_mv

GF_read: 2m

GF_write: 2m

Dv: dv

GF_read: 2m

GF: 2m

GF_write: 2m

BRAM arrays

partitioned in Solutions IV-VII

E: edges

LogGF: m

GF_read: 2m

GF: 2m

GF_write: 2m

2 RW ports

available

• Loop trip counts• E: edges or

N×dv = M×dc

• GF: 2m

• LOGGF: m• Dv/Dc: dv /dc

• Local BRAM copiesare maintained

• Data streamsfrom DRAM



//nested loop structure of vn_procE:for(int e = 0; e < edges; e++)

GF:for(int g = 0; g < q; g++)Dv:for(int t = 0; t < d_v; t++)

//computation follows

CN->VNVN->CN

E: edges

GF: 2m

vn_proc

perm

ute

E: edges

E: edges

LogGF: m

fwht

cn_proc

deperm

ute

E:

G:

VN/CNW:

E:

G_read:

G_write:

G_read:

LOGGF:

G_compute:

fwht

E: edges

GF: 2m

Dc: dc

E: edges

GF_read: 2m

GF_write: 2m

G_read: 2m

LOGGF: m

G_compute:

2m

DRAM:

mcv

mvc

mv

iterate

pro

log

ue

ep

ilog

ue

iterate

l_mvc

l_mcv

l_mv

GF_read: 2m

GF_write: 2m

Dv: dv

GF_read: 2m

GF: 2m

GF_write: 2m

BRAM arrays


E: edges

LogGF: m

GF_read: 2m

GF: 2m

GF_write: 2m

2 RW ports

available



E:for(int e = 0; e < limit; e++){GF_read:for(int g = 0; g < GF; g++)

//load data into temporary bufferGF_write:for(int g = 0; g < GF; g++)

//permute and store back to memory}

CN->VNVN->CN

E: edges

GF: 2m

vn_proc

perm

ute

E: edges

E: edges

LogGF: m

fwht

cn_proc

deperm

ute

E:

G:

VN/CNW:

E:

G_read:

G_write:

G_read:

LOGGF:

G_compute:

fwht

E: edges

GF: 2m

Dc: dc

E: edges

GF_read: 2m

GF_write: 2m

G_read: 2m

LOGGF: m

G_compute:

2m

DRAM:

mcv

mvc

mv

iterate

pro

log

ue

ep

ilog

ue

iterate

l_mvc

l_mcv

l_mv

GF_read: 2m

GF_write: 2m

Dv: dv

GF_read: 2m

GF: 2m

GF_write: 2m

BRAM arrays


E: edges

LogGF: m

GF_read: 2m

GF: 2m

GF_write: 2m

2 RW ports

available



E:for(int e = 0; e < edges; e++){G_read:for(int g = 0; g < q; g++){

//load data into temporary array}LogGF:for(int c=0;c<m;c++)

GF:for(int g = 0; g < q; g++)//perform Radix-2 computation

G_write:for(int g = 0; g < q; g++){//store data back to memory

}}

CN->VNVN->CN

E: edges

GF: 2m

vn_proc

perm

ute

E: edges

E: edges

LogGF: m

fwht

cn_proc

deperm

ute

E:

G:

VN/CNW:

E:

G_read:

G_write:

G_read:

LOGGF:

G_compute:

fwht

E: edges

GF: 2m

Dc: dc

E: edges

GF_read: 2m

GF_write: 2m

G_read: 2m

LOGGF: m

G_compute:

2m

DRAM:

mcv

mvc

mv

iterate

pro

log

ue

ep

ilog

ue

iterate

l_mvc

l_mcv

l_mv

GF_read: 2m

GF_write: 2m

Dv: dv

GF_read: 2m

GF: 2m

GF_write: 2m

BRAM arrays


E: edges

LogGF: m

GF_read: 2m

GF: 2m

GF_write: 2m

2 RW ports

available


HLS decoder solutionsPartitioning

• Scheduling analysis after C-synthesis shows lack of mem. ports• BRAMs are instantiated with dual-port control• Further action is required

set_directive_resource -core RAM_T2P_BRAMset_directive_array_partition -type cyclic -factor 4 -dim 1

Partitioned arraysOriginal arrays

l_mv

l_mvc

l_mcv

l_mv_0

l_mvc_0

l_mv_1

l_mv_2

l_mv_3

l_mvc_1

l_mvc_2

l_mvc_3

l_mcv_0

l_mcv_1

l_mcv_2

l_mcv_30

1

2

3

4

5

.

.

.

0

1

2

3

4

5

.

.

.

0

1

2

3

4

5

.

.

.

0

4

8

.

.

1

5

9

.

.

2

6

10

.

.

3

7

11

.

.

0

4

8

.

.

1

5

9

.

.

2

6

10

.

.

3

7

11

.

.

0

4

8

.

.

1

5

9

.

.

2

6

10

.

.

3

7

11

.

.

2x2m

RW ports availablePartitioning


HLS decoderOptimization solution I

E: edges

fwht

E: edges

LogGF: m

fwht

GF_read: 2m

GF_compute: 2m

GF_write: 2m

E: edges

E: edges

fwht

fwht

2x2m

RW ports available2 RW ports available

…

…

E: edges

fwht

…

…

Solution II Solution III Solution IV Solution V Solution VI

not parallel

high IIparallel

low IIE: edges

fwht

Solution VII

…

• Solution I, base version wo/ optimizations


HLS decoderOptimization solution II

E: edges

fwht

E: edges

LogGF: m

fwht

GF_read: 2m

GF_compute: 2m

GF_write: 2m

E: edges

E: edges

fwht

fwht

2x2m


…

…

E: edges

fwht

…

…


not parallel

high IIparallel

low IIE: edges

fwht

Solution VII

…

• Solution II Full unroll of inner loops LOGGF and GFset_directive_unroll "*/LOGGF"set_directive_unroll "*/GF"


HLS decoderOptimization solution III

E: edges

fwht

E: edges

LogGF: m

fwht

GF_read: 2m

GF_compute: 2m

GF_write: 2m

E: edges

E: edges

fwht

fwht

2x2m


…

…

E: edges

fwht

…

…


not parallel

high IIparallel

low IIE: edges

fwht

Solution VII

…

• Solution III, II+pipeline of outer loops E to II=1set_directive_pipeline "*/E" -II=1


HLS decoderOptimization solution IV

E: edges

fwht

E: edges

LogGF: m

fwht

GF_read: 2m

GF_compute: 2m

GF_write: 2m

E: edges

E: edges

fwht

fwht

2x2m


…

…

E: edges

fwht

…

…


not parallel

high IIparallel

low IIE: edges

fwht

Solution VII

…

• Solution IV, I+cyclic partitioning of all BRAM arrays by afactor of 2m

set_directive_array_partition -type cyclic -factor 2^m -dim 1 "decoder" l_buffer


HLS decoderOptimization solution V

E: edges

fwht

E: edges

LogGF: m

fwht

GF_read: 2m

GF_compute: 2m

GF_write: 2m

E: edges

E: edges

fwht

fwht

2x2m


…

…

E: edges

fwht

…

…


not parallel

high IIparallel

low IIE: edges

fwht

Solution VII

…

• Solution V, IV+full unroll of inner loops LOGGF and GF


HLS decoderOptimization solution VI

E: edges

fwht

E: edges

LogGF: m

fwht

GF_read: 2m

GF_compute: 2m

GF_write: 2m

E: edges

E: edges

fwht

fwht

2x2m


…

…

E: edges

fwht

…

…


not parallel

high IIparallel

low IIE: edges

fwht

Solution VII

…

• Solution VI, III+IV (unroll, pipeline, partition)


HLS decoderOptimization solution VII

E: edges

fwht

E: edges

LogGF: m

fwht

GF_read: 2m

GF_compute: 2m

GF_write: 2m

E: edges

E: edges

fwht

fwht

2x2m


…

…

E: edges

fwht

…

…


not parallel

high IIparallel

low IIE: edges

fwht

Solution VII

…

• Solution VII, IV+pipeline of inner loops LOGGF GF to II=1set_directive_pipeline "*/LOGGF" -II=1set_directive_pipeline "*/GF" -II=1


HLS decoder

• Define fixed-point computation suitable for a target SNR/BER#include<ap_cint.h>//data is stored in llr type variables//computation is performed in llr_ type variables//use floating-pointtypedef float llr;typedef float llr_;//use Q8.7 fixed-pointtypedef ap_fixed< 8, 1, AP_RND_INF, SC_SAT > llr;typedef ap_fixed< 16, 3, AP_RND_INF, SC_SAT > llr_;


Experimental resultsClock frequency and latency

OptimizationsI II III IV V VI VII

La

ten

cy

[c

yc

les

]

10 3

10 4

10 5

10 6

0

50

100

150

200

250


La

ten

cy

[c

yc

les

]

10 4

10 5

10 6

10 7

Fre

qu

en

cy

[M

Hz]

0

50

100

150

200

250


La

ten

cy

[c

yc

les

]

10 4

10 5

10 6

10 7

Fre

qu

en

cy

[M

Hz]

0

50

100

150

200

250

• Best clock frequency of operation obtained for Solution VI• Lowest latency always achieved for Solution VI• Solution III is a good compromise between VI and theremaining Solutions

• Solution VII replication of pipelined loops is a poor designchoice

• Most alike to OpenCL strategy


Experimental resultsFPGA utilization

I

IV

II

V

III

VI

I

IV

IIV

III

VI

I

IV

II

III

V

VI

partition

unroll

unroll

partition

pipeline

pipeline

partition

higher utilization

latency unchanged

low

er

late

ncy

utiliz

atio

n u

nch

an

ged

• Under 20% LUTutil.

• Multiple decoderinstantiation

• What about pin,clock and meminterface?


HLS host platform

• Originally RTL-project and can be Tcl’d automatically• VC709 board target(693K CLBs)

Board DRAM 0 DRAM 1

Memory Interface

...

AXI4

Interconnect

AXI4

Interconnect

BRAMs KBRAMs 1

HLS IP

Core 1

HLS IP

Core K

FPGA

core 0

core 1

core 2

AXI4 I.

Mem. Int.

BRAMs 2

HLS IP

Core 2

• DMA via PCIe → 3KLUTs

• Two DRAM banks controlled(MIG)

• Two AXI interconnect can beconfigured for up to 16 HLScores

• Data streams from the DRAMbank 0 and to bank 1

• Each HLS core performscomputation to its own“BRAM” space


Experimental resultsPareto exploration

LUTs [%]0 10 20 30 40 50 60 70 80 90

La

ten

cy

[7

s]

10 0

10 1

10 2

10 3

10 4

10 5

GF(4) GF(8) GF(16)Non-optimal points

Final decoder design w/ DRAM controllersand several accelerators instantiated

Single acceleratorw/o DRAM controllers

ParetoOptimalPoints

• The host HLS arch and the multiple decoders elevatethe LUT utilization to ∼80%

• {14, 5, 3} decoders for GF(22), GF(23) and GF(24)


Experimental resultsComparison with RTL decoders

Decoder m K LUT [%] FF BRAM DSP Thr. [Mbit/s] Clk [MHz]

This work

2 1 14 7 0.5 0.5 1.17 25014 80 35 6 6 14.54 219

3 1 21 9 0.9 0.9 0.95 2506 81 34 5 5 4.81 210

4 1 30 13 2 2 0.66 2163 73 32 5 5 1.85 201

Zhang TCS–I’11 4

1

48 (Slices) 41 – 9.3 –

Emden ISTC’102 33.16

1004 – 13.228 1.56

Spagnol SiPS’09 3 13 3 1 – ≤4.7 99Boutillon TCS–I’13 6 19 6 1 – 2.95 61Andrade ICASSP’14 8 85 (LEs) 62 7 1.1 163Scheiber ICECS’13 1 14 (Slices) 21 – 13.4 122

∗ Differences in technology nodes and FPGA are not considered.


Comparison with previous HLS works

• Maxeler decoders allow for ∼1 Gbit/s decoding throughputsPratas, GlobalSIP’13Andrade, ASAP’14

• OpenCL (Altera) decoder peaks at hundreds of Kbit/sAndrade, ICASSP’14

FPGA GF(23)

GF(23)

(floating-point)Util.[%] I IV V VI I IV V VILUTs 0.64 1.13 5.20 10.4 0.65 1.48 11.7 17.3FF 0.28 0.53 2.52 3.94 0.29 0.51 3.30 7.56DSP 0.06 0.06 0.89 0.89 0.06 0.06 1.78 1.78BRAM 0.44 0.82 0.82 0.82 0.78 1.63 2.72 1.63

• Vivado HLS decoder reaches dozens of MBit/sScheiber, ICECS’13


Summary

• Code writing-style counts• Language is the same, model is not

• Design optimizations come hand-in-hand with the codewriting-style

• Clearly defined bounds are better• Optimizations can be double-edge swords

• We can achieve same ballpark figures of RTL• Higher utilization

• Outlook• When will platforms be automatically generated?

• When will the C programming model merge?



*(b++)=*(a++)*c;

Qué?

b[i]=a[i]*c;

Ah, si!

Thank you. Questions arewelcome.


What tool to use?

What HLS tool?• How much are we willing to lose in control?

• A lot? → OpenCL (C-based)• Dataflow? → MaxCompiler (JAVA)• Some? → LegUp, Vivado HLS (C/C++, SystemC)• None? → Stick to RTL (Verilog, VHDL)

• Vivado HLS allows fine control over• Loop scheduling → unroll, pipeline, merge, flatten• AXI4 blocks → master/slave memory and stream interfaces• Arbitrary bitwidth → fixed-point types supported• No clock, no external memory interfaces, and no pin I/O layout


from low-architectural expertise up to high-throughput non-binary

Documents