huyen pham thi, sabooh ajaz, and hanho leesoc.inha.ac.kr/images/isocc2014_nbldpc.pdf · huyen pham...

Efficient Min-Max Nonbinary LDPC Decoding on GPU

Huyen Pham Thi, Sabooh Ajaz, and Hanho Lee

Dept. of Information and Communication Engineering, Inha University, 402-751, Incheon, Korea

E-mail: [email protected]

Abstract

This paper presents an novel modified Min-Max algorithm (MMMA) and an efficient implementation of an nonbinary LDPC (NB-LDPC) decoder on a graphics processing unit (GPU) to achieve both great flexibility and scalability. The MMMA for check node processing removes the multiplications over Galois-field in merger step and significantly reduces the decoding latency. The proposed MMMA provides a better BER performance than previous algorithm. The experimental results show that the GPU-based implementation of the proposed NB-LDPC decoder provides higher throughput and the coding gain under low 10-8 BER comparted to CPU-based implementation.

Keywords- GPU; nonbinary LDPC; Min-Max, decoding; CUDA

Introduction

Matthew et al. [1] showed that NB-LDPC codes give a significant performance improvement when the code lengths are short and moderate. However, the decoding algorithms for NB-LDPC codes require complex computations and a large memory [2-4]. The demand for new codes and novel low-complexity decoding algorithm for NB-LDPC codes requires a huge amount of extensive simulations. Due to the high complexity of NB-LDPC decoding algorithms, the simulation time on CPU is extremely slow in higher order Galois-filed GF(q) fields.

A GPU can provide massively parallel computation threads with a many core architecture, which can accelerate the simulations of the NB-LDPC decoding. However, the implementation of NB-LDPC codes on GPU is still very challenging. In this paper, the novel modified Min-Max algorithm and an efficient implementation of a parallel-block layered NB-LDPC decoder based on a GPU is presented. The MMMA for check node processing removes the multiplications over Galois-field in merger step and significantly reduces the decoding latency.

Proposed Modified Min-Max Algorithm

In this section, a new modified Min-Max algorithm is

provided which removes the multiplications with nonzero elements of H matrix in merger step. A block-layered decoding algorithm is proposed in Algorithm 1. The Min-Max decoding, which is implemented by a well-known forward-backward algorithm (FBA), is applied in check node process [3].

Algorithm 1: Initialization: Ln(a) = ln(Pr(cn = sn| channel)/(Pr(cn = a | channel)); Ln 1,0(a) = Ln(a); Rmn 0,1(a) = 0; Iterations: For (k=1; k <= Imax; k++)

For (l=1; l<= L; l++) For (m=0; m < q-1; m++)

( ) ( ) ( );,11,~:1 , aR lkmnaL lknaLStep lknm

−−−=

( )( )( )aLminL

lknm

qGFa

lk

nm

~~ ,,

∈=

( ) ( ) ;~ ,~ ,, L lknmaL lk

nmaL lknm −=

( )( ) ( ) ( ) { }( )

( )( )⎟⎠⎞

⎜⎝⎛

∈∈∈= aLmaxminRStep n

lkmnnmNnamnamNnan

alk

mn ','\'''

:2 ,

γ

( ) ( ) ( );,~ ,:3 , aR lkmnaL lknmaLStep lk

n +=

End for End for

Decision: ( )( );minarg ,~ aLc lk

nn =

End for

Forward metrics,(Fi)i=0,dc-2 and backward metrics (Bi)i=1,dc-1 respectively are calculated sequentially with a conditional equation as follows: aaa h civ =′′+′ α (1)

where hvi,c is the nonzero element in H matrix. In the merger step, the merger messages are C2V messages,

which are updated for the posteriori messages. When the check node degree is equal to dc that is after finishing the forward and backward processing, two vectors of merger processing are found such as and . It is remarked that

is equal to and is equal to . The FB messages, which are already multiplied with nonzero elements of H matrix in FB processing in the equation (1), are used directly to generate other merger vectors. Therefore, a new conditional equation (2) and a computing merger metrics (4) are proposed for the merger step in the decoding algorithm as follows: Conditional equation for merger step: a’ + a’’ = a (2)

Merger metrics: (3)

(4)

- 266 - ISOCC2014978-1-4799-5127-7/$31.00 ⓒ2014 IEEE

Fig. 1. Data flow of parallel block-layered NB-LDPC decoding on CPU and GPU platforms.

3.8 4 4.2 4.4 4.6 4.8 510

-10

10-8

10-6

10-4

10-2

100

Eb/No(dB)

BE

R a

nd F

ER

BER Imax=15

BER Imax=5FER Imax=15

FER Imax=5

Fig. 2. BERs and FERs of a (744,651) NB-LDPC code over GF(25) using GPU.

Block-Layered NB-LDPC Decoder on GPU

NVIDIA GPUs are powerful arithmetic engines capable of running thousands of lightweight threads in parallel following a Single Instruction Multiple Thread approach. Furthermore, the NB-LDPC decoding algorithm satisfies a high computation to memory access ratio.

Fig. 1 shows data flow of layered decoder, which implements the decoding on the CPU and GPU platforms. Each of module in GPU device corresponds to one CUDA kernel. The host CPU transfers the data to/from the GPU device. Most of computations are implemented on GPU and all the intermediate messages are stored in the device memory to restrict data transfer between host and device.

Experimental Results

The experimental setup to evaluate the performance of the proposed NB-LDPC decoder consists of NVIDIA GTX650Ti GPU with 768 CUDA cores, 1024 MB of GDDR5 device memory, and an Intel(R) Core (TM) i7-4770 CPU with 16 GB RAM running at 3.4 GHz.

The simulation results are implemented over an AWGN channel with binary phase-shift-keying (BPSK) modulation. The bit error rate (BER) and frame error rate (FER) performances under the number of different iterations are shown in Fig. 2. Experiment results show that the proposed

TABLE I EXPERIMENTAL RESULTS

Proposed [4]

Code (744,651) (744,651) No. iterations 15 15 15

Program C++ CUDA C C++ Run time (ms) 646 43.07 -

Coding gain at 10-5 BER 4.13 4.13 4.3 Coding gain at 10-8 BER - 4.43 -

NB-LDPC decoder with 15 iterations can obtain 4.3-dB coding gain at 10-5 FER and 4.13-dB coding gain at 10-5 BER, which is approximately 0.17 dB higher coding gain at 10-5 BER compared to Fang Cai’s result for the same code length over GF(25) [4]. Moreover, the GPU accelerate to achieve the coding gain under low 10-8 BER within hours, instead of weeks of computation.

Table I shows the experiment results using CPU and GPU. The execution times were obtained with CPU timers. In decoding on GPU process, the running time achieves 43.07ms. The GPU-based implementation using CUDA C program provides 15 times higher throughput than CPU-based implementation using C++ program with the long code (744, 651).

Conclusion

This paper presents a novel MMMA and an efficient

implementation of a parallel block-layered NB-LDPC decoder based on a GPU. Due to its inherently massive parallelism, a NB-LDPC decoder is more suitable for a GPU implementation than for binary LDPC codes. The proposed MMMA provides a better BER performance than previous algorithm. The experimental results show that the GPU-based implementation of the proposed NB-LDPC decoder provides higher throughput and the coding gain under low 10-8 BER compared to CPU-based implementation.

Acknowledgment

This work was supported by the IT R&D program of MOTIE/KEIT [10044092] and by the MSIP, Korea, under the ITRC support program (NIPA-2014-H0301-14-1042) supervised by the NIPA.

References

[1] C. D. Matthew, and D. MacKay, “Low density parity check codes over GF(q)” IEEE Communications Letters, vol. 2, no. 6, pp. 165-167, Jun. 1998. [2] B. Zhou, et al, “Construction of nonbinary Quasic-cyclic LDPC codes by arrays and array dispersions,” IEEE Trans. on Communications, vol. 57, no. 6, pp. 1652-1662, Jun. 2009. [3] V. Savin, “Min-Max decoding for nonbinary LDPC codes,” In Proc. IEEE. Int. Symp. Inf. Theory, Toronto Canada, pp. 960-964, Jul. 2008. [4] X. Zhang, and F. Cai, “Efficient partial-parallel decoder architecture for quasi-cyclic nonbinary LDPC codes,” IEEE Transactions on Circuits and Systems I, vol. 58, no. 2, pp. 402-414, Feb. 2011.

- 267 - ISOCC2014978-1-4799-5127-7/$31.00 ⓒ2014 IEEE

huyen pham thi, sabooh ajaz, and hanho leesoc.inha.ac.kr/images/isocc2014_nbldpc.pdf · huyen pham...

Documents