autoregressive belief propagation for decoding block codes

6
Autoregressive Belief Propagation for Decoding Block Codes Eliya Nachmani and Lior Wolf Tel Aviv University & Facebook AI Research Email: {eliyan, wolf}@fb.com Abstract—We revisit recent methods that employ graph neu- ral networks for decoding error correcting codes and employ messages that are computed in an autoregressive manner. The outgoing messages of the variable nodes are conditioned not only on the incoming messages, but also on an estimation of the SNR and on the inferred codeword and on two downstream computations: (i) an extended vector of parity check outcomes, (ii) the mismatch between the inferred codeword and the re- encoding of the information bits of this codeword. Unlike most learned methods in the field, our method violates the symmetry conditions that enable the other methods to train exclusively with the zero-word. Despite not having the luxury of training on a single word, and the inability to train on more than a small fraction of the relevant sample space, we demonstrate effective training. The new method obtains a bit error rate that outperforms the latest methods by a sizable margin. I. I NTRODUCTION The majority of learned block code decoders follow the footsteps of traditional decoders, and add learned parameters to these. This provides solid foundations to the structure of the learned decoder, and the few cases in the literature which decoders were designed de-novo, using generic neural networks, have led to weaker results. In this work, we propose to enrich the decoders with an autoregressive element, which is common in many other machine learning domains. When decoding is done using belief propagation (BP), it is often the case that the decoded codeword emerges gradually through the iterations. This prop- erty is used, for example, to derive loss terms that are based not only on the final output, but also on the intermediate steps. In this work, we suggest to use the intermediate decoding as input to the learned networks. Specifically, we propose to use a hypernetwork framework, in which conditioning is performed not only based on the computed BP messages, but also based on the current level of success in decoding. We, therefore, add inputs, such as the current decoding output, the parity check of this output, and a comparison of it to an ideal version, in which a codeword is obtained from the information bits. For the purpose of performing the parity check, instead of using the original parity check matrix H, we employ, during the iterations of the method an extended version. The extended version H 0 is obtained by considering all pairwise combinations of the rows of H. This does not, of course, change the input to the decoder, nor does it change the structure of the Trellis graph that is used to design the network. To further improve the results, we also suggest to estimate the SNR and to condition the network on an embedding of it. This allows the network to adapt to the level of noise observed for each specific input codeword. Taken together, the new method shows an improved per- formance in comparison to the state of the art deep learning methods, on a diverse family of codes (BCH, LDPC, Polar). Specifically, we present an improvement of 0.5dB for LDPC and BCH codes, and 1.2dB for a Polar code. Our code is attached as supplementary. II. RELATED WORK Deep learning have been applied to various tasks in commu- nication [1], [2], such as modulation [3], [4], equalization [5], [6], and MIMO detection [7], [8]. Deep learning techniques were also applied successfully to error correcting codes, for example, encoding [9], decoding [10]–[16] and even designing new codes [17] that outperform the state of the art codes for feedback channels [18], [19]. In [20] a fully connected neural network was employed for decoding short Polar codes up to n = 16 bits. The results obtained are close to the maximum a posteriori (MAP) decoding, which is the optimal perfor- mance. The main drawback is that this method works well for small codes, but cannot scale to larger block codes. Another line of work for decoding Polar codes, partitions the polar encoding graph into sub-blocks, and decodes each sub-block separately [21]. In [22], an RNN decoder for convolutional and Turbo codes was introduced, which can match the performance of the classical BCJR decoder and the Viterbi decoder. There are several methods for decoding relatively large block codes (n 100). The well known BP decoding algorithm was unfolded into a neural network, where the variable edges were equipped with learnable parameters [23]. A hardware friendly decoder of a similar nature was then introduced [24], in which the min-sum algorithm is employed. Both of these methods show an improvement over the baseline BP algorithm. Further improvements were obtained by [25]. First, each variable node was extended to a small neural network g which transforms the decoder into a graph neural network. Second, the weights of each variable nodes network g are determined by a hypernetwork [26] f , which provides the neural decoder with added adaptation. In this work, we focus on decoding the following block codes – LDPC, BCH, and Polar, where the most relevant arXiv:2103.11780v1 [cs.IT] 23 Jan 2021

Upload: others

Post on 30-Jan-2022

22 views

Category:

Documents


0 download

TRANSCRIPT

Autoregressive Belief Propagation for DecodingBlock Codes

Eliya Nachmani and Lior WolfTel Aviv University & Facebook AI Research

Email: {eliyan, wolf}@fb.com

Abstract—We revisit recent methods that employ graph neu-ral networks for decoding error correcting codes and employmessages that are computed in an autoregressive manner. Theoutgoing messages of the variable nodes are conditioned notonly on the incoming messages, but also on an estimation ofthe SNR and on the inferred codeword and on two downstreamcomputations: (i) an extended vector of parity check outcomes,(ii) the mismatch between the inferred codeword and the re-encoding of the information bits of this codeword. Unlike mostlearned methods in the field, our method violates the symmetryconditions that enable the other methods to train exclusivelywith the zero-word. Despite not having the luxury of trainingon a single word, and the inability to train on more than asmall fraction of the relevant sample space, we demonstrateeffective training. The new method obtains a bit error rate thatoutperforms the latest methods by a sizable margin.

I. INTRODUCTION

The majority of learned block code decoders follow thefootsteps of traditional decoders, and add learned parametersto these. This provides solid foundations to the structureof the learned decoder, and the few cases in the literaturewhich decoders were designed de-novo, using generic neuralnetworks, have led to weaker results.

In this work, we propose to enrich the decoders withan autoregressive element, which is common in many othermachine learning domains. When decoding is done usingbelief propagation (BP), it is often the case that the decodedcodeword emerges gradually through the iterations. This prop-erty is used, for example, to derive loss terms that are basednot only on the final output, but also on the intermediate steps.In this work, we suggest to use the intermediate decoding asinput to the learned networks.

Specifically, we propose to use a hypernetwork framework,in which conditioning is performed not only based on thecomputed BP messages, but also based on the current levelof success in decoding. We, therefore, add inputs, such as thecurrent decoding output, the parity check of this output, anda comparison of it to an ideal version, in which a codewordis obtained from the information bits.

For the purpose of performing the parity check, insteadof using the original parity check matrix H , we employ,during the iterations of the method an extended version. Theextended version H ′ is obtained by considering all pairwisecombinations of the rows of H . This does not, of course,change the input to the decoder, nor does it change thestructure of the Trellis graph that is used to design the network.

To further improve the results, we also suggest to estimatethe SNR and to condition the network on an embedding of it.This allows the network to adapt to the level of noise observedfor each specific input codeword.

Taken together, the new method shows an improved per-formance in comparison to the state of the art deep learningmethods, on a diverse family of codes (BCH, LDPC, Polar).Specifically, we present an improvement of 0.5dB for LDPCand BCH codes, and 1.2dB for a Polar code. Our code isattached as supplementary.

II. RELATED WORK

Deep learning have been applied to various tasks in commu-nication [1], [2], such as modulation [3], [4], equalization [5],[6], and MIMO detection [7], [8]. Deep learning techniqueswere also applied successfully to error correcting codes, forexample, encoding [9], decoding [10]–[16] and even designingnew codes [17] that outperform the state of the art codes forfeedback channels [18], [19]. In [20] a fully connected neuralnetwork was employed for decoding short Polar codes up ton = 16 bits. The results obtained are close to the maximuma posteriori (MAP) decoding, which is the optimal perfor-mance. The main drawback is that this method works well forsmall codes, but cannot scale to larger block codes. Anotherline of work for decoding Polar codes, partitions the polarencoding graph into sub-blocks, and decodes each sub-blockseparately [21]. In [22], an RNN decoder for convolutional andTurbo codes was introduced, which can match the performanceof the classical BCJR decoder and the Viterbi decoder.

There are several methods for decoding relatively largeblock codes (n ≥ 100). The well known BP decodingalgorithm was unfolded into a neural network, where thevariable edges were equipped with learnable parameters [23].A hardware friendly decoder of a similar nature was thenintroduced [24], in which the min-sum algorithm is employed.Both of these methods show an improvement over the baselineBP algorithm. Further improvements were obtained by [25].First, each variable node was extended to a small neuralnetwork g which transforms the decoder into a graph neuralnetwork. Second, the weights of each variable nodes networkg are determined by a hypernetwork [26] f , which providesthe neural decoder with added adaptation.

In this work, we focus on decoding the following blockcodes – LDPC, BCH, and Polar, where the most relevant

arX

iv:2

103.

1178

0v1

[cs

.IT

] 2

3 Ja

n 20

21

baselines which we improve are [25], [27]. This improvementis obtained mostly by introducing autoregressive signals. Themethod is related to the Dynamic Factor Graphs (DFG) [28],which improves the performance of a factor graph used fortime-series analysis by conditioning the current state on theprevious states. We also condition the hyeprnetwork on anestimated SNR. This is related to [29], which conditioned theclassic BP algorithm on the output of minimum mean squareerror (MMSE) prediction for the task of LDPC decoding.

III. BACKGROUND

The BP methods, which were traditionally used for decod-ing block code, have been augmented with learned weights,and, more recently, generalized to graph neural networks(GNN) [30] and even to graph hyper networks [25]. Ourmethod relies on hyper networks as well and the backgroundbelow follows the naming conventions of [25].

A block code has k information bits and n output bits. Theparity check matrix H that defines the code is a binary matrixof size (n−k)×n; The generator matrix G is a binary matrixof size k × n. The standard form of G has the structure of[Ik|P ] where, Ik is the identity matrix with a dimension ofk × k and P is k × (n− k) binary matrix.

The vanilla belief propagation algorithm operates on theTrellis graph, which is a directed graph that we view as alayered neural network. The input layer has n nodes cor-responding to each input bit. The hidden layers are of twotypes, namely variable layers and check layers. These layersare interleaved, such that the layer index j is odd for variablelayers, and even for check layers.

Each column of the matrix H is associated with one bitof the codeword and with dv variable nodes in each variablelayer, where dv is the sum over this column. For notationalconvenience, we assume that H is regular, i.e, that the sumover columns (dv) is fixed. Therefore, each variable layer hasE = dv · n variable processing units.

Similarly, assuming regularity also over the rows of H ,the check layers are composed of E = (n − k) × dc checkprocessing units, each associated with a parity check.

In the experiment section, we will show that our method alsoworks in an irregular scenario, i.e. where each variable/checknode has a different degree.

The messages propagate in the Trellis graph from a variablelayer to a check layer iteratively. The input to the beliefpropagation algorithm is the log likelihood ratio (LLR) ` ∈ Rnof each bit:

`v = logPr (cv = 0|yv)Pr (cv = 1|yv)

, (1)

where yv is the received signal which is associated with thecv bit that we wish to recover. And `v is the correspondinglog likelihood ratio.

Let xj be the messages vector in the belief propagationalgorithm, of length E. The input layer expands the vector `of n LLR values, to the messages vector xj with E elements:

xj = Win · ` (2)

where Win is a binary matrix with a dimension of E×n. Win

is constructed from the parity check matrix H . Each row ofWin is associated with a row i and a column j of H and isset as Hi − δj if Hij = 1, where δj is a sparse vector thatcontain 1 only in the j-th location, Hi is the i-th row of H ,and Hij is a single value of this matrix.

Each iteration j of the BP algorithm produces a messagexj . Each element e in the vector xj is given by:

xje = xj(c,v) = tanh

1

2

lv +∑

e′∈N(v)\{(c,v)}

xj−1e′

if j is odd

(3)

xje = xj(c,v) = 2arctanh

∏e′∈N(c)\{(c,v)}

xj−1e′

if j is even

(4)

where N(c) = {(c, v)|H(c, v) = 1} is the set of edges inwhich the check node c is associated with the variable nodev in the Trellis graph. The messages propagate in L variablelayers and L check layers until the 2L+1 final marginalizationlayer. This layer outputs the marginalization:

ujv = lv +∑

e′∈N(v)

xj−1e′ (5)

ojv = σ(ujv)

(6)

where σ is the sigmoid function. The final marginalization canbe also written in matrix form by:

uj = xj−1Wout (7)

where Wout is a binary matrix of dimensionality E × n.Every group of rows in Wout is associated with column j ofH , which is denote Hj . The group has as many rows as thenumber of ones in Hj and each row is sparse and contains asingle 1 element in a location i that matches a value of onein Hj (Hij = 1).

In [25] a learnable hypernet graph neural network is intro-duced. First, the vanilla belief propagation is transformed intoa graph neural network, by replacing each variable node with asmall learnable neural network g. Second, the primary network(a term from the hypernetwork literature) f is used to predictthe weights of the neural network g. These two modificationstake place by replacing Eq. 3 (odd j, where j is the iterationnumber) with:

xje = xj(c,v) = g(lv, xj−1N(v, c), θ

jg), (8)

θjg = f(|xj−1|, θf ) , (9)

where θjg and θf are the weights of the network g and f

respectively. xjN(v, c) is a vector of the elements from xj thatcorresponds to the indices N(v) \ {(c, v)}, and has a lengthof dv − 1. The hypernet scheme is shown to increase theadaptiveness of the network, which helps overcome errorsduring the decoding process.

Fig. 1. An overview of our method for a linear block code with n = 4,k = 2.

Training is performed as in [23] by observing the marginal-ization at each odd iteration and applying cross entropy. Themarginalization at the odd layer j is

ojv = σ

lv +∑

e′∈N(v)

w̄e′xje′

(10)

where w̄e′ is learnable matrix. The loss function is:

L = − 1

n

L∑h=1

n∑v=1

cv log(o2h+1v )+(1−cv) log(1−o2h+1

v ) (11)

where cv is the vector of ground truth bits. The problem setupassumes additive white Gaussian noise (AWGN) channel withBinary Phase Shift Keying (BPSK) modulation.

IV. METHOD

We suggest adding four major components to the learnablebelief propagation algorithm: (i) the autoregressive signalaj of the current estimation of the output variable, (ii) anautoregressive vector of parity check outcomes ej , based onthe parity check matrix and the column combinations, (iii)an autoregressive error vector zj based on the re-encodecodeword, and (iv) an embedding of Signal-To-Noise ratio p,see Fig. 1. Specifically, we replace Eq. 9 with the following:

θjg = f(aj , ej , zj , p, |xj−1|, θf ) . (12)

The first input aj is a vector of length E, which containsthe output scaled projection of the previous j − 1 beliefpropagation algorithm after Binary Phase Shift Keying (BPSK)modulation, given by:

aj = cj · (Wout · sj) (13)

where cj is a learnable scale variable, Wout is defined afterEq. 7, and sj , is the hard decision vector with length of n:

sj =

{1 oj−1 > 0.5−1 oj−1 6 0.5

, (14)

where oj defined in Eq. 6 and the binarization occurs pereach vector element. Note that Wout is used to project thehard decision vector into the same dimension as |xj−1|.

The second input of f , ej , contains information aboutviolated parity checks. We consider an augment the paritycheck matrix H ′, which extends the parity check matrix

H , and contains all pairwise row combinations of H . Letd =

(n−k

2

), then the new parity check matrix H ′ has a

dimension of d× n, and is given by:

H ′αβ = Hα ⊕Hβ (15)

where ⊕ is the XOR operation (multiplication modulo 2),H ′αβ is a row of H ′ with the double index αβ, and Hα(Hβ)is row α(β) of H . In this manner, the second input to f , ej

is a binary vector with a length of d, and is given by:

ej = H ′ ⊕ sj (16)

The third inputs of f , zj , is an error measurement basedon the re-encoded parity check bits of the autoregressedcodeword. We suggest to re-encode the parity check bits, byusing the generator matrix G and the information symbolsof the hard decision sj . Since we use G in its standardform, we can regard sj as concatenation of two vectors,sj = [sjinfo, s

jparity], of size k and n − k, respectively, the first

denoting the information bits and the second the parity checkbits.

The re-encoded codeword s′j of length n is given by:

s′j

= sjinfo ⊕G (17)

In the same manner as above, we can regard s′j as concate-

nation of two vectors s′j = [s′jinfo, s

′jparity].

The error measurement zj , a vector of length (n−k), is themismatch between the autoregressed parity check bits sjparity

and the re-encoded ones s′jparity:

zj = sjparity ⊕ s′jparity (18)

In order to compute the fourth component p, we estimatethe SNR of the received signal y. Since the problem setupassumes an AWGN channel, the received signal is given by:

y =2

σ2n

(s+ σn · n) (19)

where s is a vector of +1/ − 1 bit symbols, n is normalGaussian noise vector and σn is the noise variance which equalto (√

2 ·R · 10SNR10 )−1.

An estimation p̃ of the SNR in dB given y is therefore:

p̃ = 10 · log10

(V ar±(y)

8 ·R

)(20)

where V ar±(y) is an estimation of the variance based on ahard decision on the received signal, i.e., by thresholding thevector y and partitioning it to obtain the positive elements y+

and the negative elements y−:

V ar±(y) =1

2·(V ar(y+) + V ar(y−)

)(21)

Valid SNR values are between one and I . The estimatedvalue p̃ may be out of range of the valid SNR values and we,therefore, round and trim it, obtaining a corrected estimationp̄ = min(max(0, bp̃e), I), where I is the maximal SNR valueand b·e is the round operation.

p is an embedding of the estimated SNR value p̄. Let i =1, 2, ..., I be the estimated SNR values (in dB). Let LUTsnr ∈R64×I be the SNR look-up-table, where each column pi ∈ R64

is the learned vector embedding of SNR i. Given an estimationp̄, we simply set p = pp̄.

A. Complexity Analysis

We next compare the complexity of the Neural BP decoder[23], the hypernetwork BP of [25] and our proposed method.The move from Neural BP [23] to hypernetwork BP [25] addsa complexity term of: O (LEnu,gnu,f ) where L is the numberof the iterations, E is the number of processing units, and nu,gand nu,f are the number of neurons in g and f networksrespectively. Our proposed method add on top of [25] thefollowing number of multiplication: O (Ln (E + d+ nu,f ))where n is the number of bits in the code and d =

(n−k

2

).

For example, decoding BCH(63,51) with our method adds 5%to the total number of operations (and gains an improvementof 0.5dB, see Sec. V). This is also validated in the actualruntime, as reported in Tab. III.

However, the performance of our autoregressive method forL = 5 iterations convincingly outperforms the method of [25]for L = 50 iterations. This 10× reduction in the number ofiterations is much more substantial than an increase of 5% inthe runtime per iteration.

B. Symmetry conditions

Decoding block codes with the message passing decoderthat maintains the symmetry condition has the desired propertythat the error is independent of the transmitted codeword [31,Definition 4.83]. The direct implication is that the trainingset can contain only noisy variations of the zero codeword.Our method does not preserve the symmetry condition and,therefore, the training set should contain random codewords.The symmetry condition for a variable node at iteration j isgiven by:

Ψ(−lv,−xj−1

N(v, c)

)= −Ψ

(lv, x

j−1N(v, c)

)(22)

where Ψ is the computation in the variable node.Assuming that the variable node calculation is given by

Eq. (8) and Eq. (12), then the proposed architecture does notsatisfy the variable symmetry condition. The underlying reasonis that g employs the anti-symmetric function tanh, but theinputs to this function are not sign invariant. Let K = dv − 1

and xjN(v, c) =(xj1, . . . , x

jK

). In the proposed architecture for

any odd j > 0, Ψ is given as

g(lv, x

j−11 , . . . , xj−1

K , θjg)

= tanh(W>p

... tanh(W>2 tanh

(W>1

(lv, x

j−11 , . . . , xj−1

K

)))) (23)

where p is the number of layers and the weightsW1, ...,Wp constitute θjg = f(aj , ej , zj , p, |xj−1|, θf ).For real valued weights θlhsg and θrhsg , since tanh(x)is an odd function, for any real value input, ifθlhsg = θrhsg then g

(lv, x

j−11 , . . . , xj−1

K , θlhsg

)=

−g(−lv,−xj−1

1 , . . . ,−xj−1K , θrhsg

). In our case,

TABLE ITHE NEGATIVE NATURAL LOGARITHM OF THE BIT ERROR RATE (BER)

AT THREE SNR VALUES. HIGHER IS BETTER.

Method BP [27] Hypernet Ours

4 5 6 4 5 6 4 5 6 4 5 6

— after five iterations —Polar(64,32) 3.52 4.04 4.48 4.14 5.32 6.67 4.25 5.49 7.02 4.77 6.30 8.19Polar(64,48) 4.15 4.68 5.31 4.77 6.12 7.84 4.91 6.48 8.41 5.25 6.96 9.00Polar(128,64) 3.38 3.80 4.15 3.73 4.78 5.87 3.89 5.18 6.94 4.02 5.48 7.55Polar(128,86) 3.80 4.19 4.62 4.37 5.71 7.19 4.57 6.18 8.27 4.81 6.57 9.04Polar(128,96) 3.99 4.41 4.78 4.56 5.98 7.53 4.73 6.39 8.57 4.92 6.73 9.30LDPC(49,24) 5.30 7.28 9.88 5.49 7.44 10.47 5.76 7.90 11.17 6.05 8.13 11.68LDPC(121,60) 4.82 7.21 10.87 5.12 7.97 12.22 5.22 8.29 13.00 5.22 8.31 13.07LDPC(121,70) 5.88 8.76 13.04 6.27 9.44 13.47 6.39 9.81 14.04 6.45 10.01 14.77LDPC(121,80) 6.66 9.82 13.98 6.97 10.47 14.86 6.95 10.68 15.80 7.22 11.03 15.9MacKay(96,48) 6.84 9.40 12.57 7.04 9.67 12.75 7.19 10.02 13.16 7.43 10.65 14.65CCSDS(128,64) 6.55 9.65 13.78 6.82 10.15 13.96 6.99 10.57 15.27 7.25 10.99 16.36BCH(31,16) 4.63 5.88 7.60 4.74 6.25 8.00 5.05 6.64 8.80 5.48 7.37 9.61BCH(63,36) 3.72 4.65 5.66 3.94 5.27 6.97 3.96 5.35 7.20 4.33 5.94 8.21BCH(63,45) 4.08 4.96 6.07 4.37 5.78 7.67 4.48 6.07 8.45 4.80 6.43 8.69BCH(63,51) 4.34 5.29 6.35 4.54 5.98 7.73 4.64 6.08 8.16 4.95 6.69 9.18

— at convergence —Polar(64,32) 4.26 5.38 6.50 4.22 5.59 7.30 4.59 6.10 7.69 5.57 7.43 9.82Polar(64,48) 4.74 5.94 7.42 4.70 5.93 7.55 4.92 6.44 8.39 5.41 7.19 9.30Polar(128,64) 4.10 5.11 6.15 4.19 5.79 7.88 4.52 6.12 8.25 4.84 6.78 9.30Polar(128,86) 4.49 5.65 6.97 4.58 6.31 8.65 4.95 6.84 9.28 5.39 7.37 10.13Polar(128,96) 4.61 5.79 7.08 4.63 6.31 8.54 4.94 6.76 9.09 5.27 7.44 10.2LDPC(49,24) 6.23 8.19 11.72 6.05 8.34 11.80 6.23 8.54 11.95 6.58 9.39 12.39BCH(63,36) 4.03 5.42 7.26 4.15 5.73 7.88 4.29 5.91 8.01 4.57 6.39 8.92BCH(63,45) 4.36 5.55 7.26 4.49 6.01 8.20 4.64 6.27 8.51 4.97 6.90 9.41BCH(63,51) 4.58 5.82 7.42 4.64 6.21 8.21 4.80 6.44 8.58 5.17 7.16 9.53

θlhsg = f(aj , ej , zj , p, |xj−1|, θf ) 6= f(−aj ,−ej ,−zj ,−p, | −xj−1|, θf ) = θrhsg since aj = cj · (Wout · sj) 6= −aj =−cj · (Wout · sj) = cj · (Wout · (−sj)). Similar argumentshold for other terms, such as ej and zj .

C. Training

Training is performed using the loss function in Eq. 11.Similarly to [25], the Taylor approximation to Eq. 4 is em-ployed, in order to stabilize the training. Since the variablesymmetry condition is violated, we train the model withrandom codewords. We use the Adam optimizer [32] fortraining, with the learning rate of 1e− 4 for all block codes.Moreover, the graph neural network simulated L = 5 iterationof the BP algorithm, which is equal to 10 layers.

V. EXPERIMENTS

We trained our proposed architecture with threetypes of linear block codes: Low Density Parity Check(LDPC) codes [33]–[35], Polar codes [36]–[39] andBose–Chaudhuri–Hocquenghem (BCH) codes [40]. All paritycheck matrices and generator matrices are taken from [41].If the generator matrices do not have a standard form, werearrange the columns of G to a standard form.

The training set contains generated random examplesthat are transmitted over an additive white Gaussian noise(AWGN). Each batch contains multiple sets of Signal-To-Noise (SNR) values. The same hyperparameters are used forall types of codes. We use a batch size of 120 examples,

TABLE IIABLATION ANALYSIS. THE NEGATIVE NATURAL LOGARITHM OF THE BIT

ERROR RATE (BER) AT TWO SNR VALUES FOR OUR MODEL FIVEVARIANTS OF IT. HIGHER IS BETTER.

Code BCH (31,16) BCH (63,45) BCH (63,51)

Variant/SNR 7 8 7 8 7 8

(i) Complete method 11.94 14.50 11.48 14.08 12.78 16.13(ii) No aj 11.61 13.23 10.89 13.54 11.58 14.74(iii) No ej 11.63 13.05 10.92 13.44 11.51 14.23(iv) No zj 11.32 13.14 10.96 13.62 11.42 14.17(v) No p 11.57 13.18 11.01 13.23 11.98 14.55(vi) training with zero codeword 2.50 1.88 2.78 3.46 5.95 6.12

Hypernet [25] train with zero c.w. 10.69 13.05 11.04 14.14 11.07 13.54Hypernet [25] train with random c.w. 10.47 13.01 10.13 12.89 10.23 12.44

TABLE IIIRUNTIME PER BATCH IN MSEC

Method Train Test

[23] 16.2 5.4[25] 88.2 40.1

Ours 90.9 42.6

and each batch contains 15 examples per SNR value of1dB, 2dB, .., 8dB. The order of the Taylor approximation toEq. 4 was q = 1005. the network f has four layers with 128neurons at each layer. The network g has two layers with 16neurons at each layer

The results are reported in Tab. I, and are provided for moreBER values, for some of the codes in Fig. 3. Fig. 3(a) displaysthe BER for Polar(64,32) code, where our method achieves animprovement of 1.2dB over [25] for L = 50. For BCH codes,Fig. I(b) depict an improvement of 0.5dB for BCH(63,51).

Tab. I presents negative natural logarithm of Bit Error Rate(BER) results for 15 block codes. Our method obtains afterfive iterations better results then BP, Learned BP [23] and theHypernetwork BP [25]. The same improved results hold forlarger L, i.e, convergence of the algorithm.

In order to observe the contribution of the various autore-gressive terms and the SNR conditioning, we ran an ablationstudy. We compare our (i) complete method, (ii) our methodwithout the autoregressive term aj , (iii) our method withoutthe term ej , (iv) our method without zj , (v) our methodwithout p and (vi) our method when trained with noisyvariations of zero codeword only. The reduced methods invariants (ii-v) are identical to the complete method, only oneterm is removed from Eq. 12.

As can be seen in Tab. II, the removal of each of the fournovel terms is detrimental. As expected from our analysis,training with the zero codeword is not effective for our method.However, as shown in the same table, for the previous workthat maintains the symmetry conditions, training with the zerocodeword is slightly better than training with our training set.

To further understand the behavior of the autoregressivemodel, we present in Fig.2(a) the TSNE visualization of theoutput of network f trained for BCH(63,51) code. Each pointrepresents codeword, we color each point with the number

(a)

(b)

Fig. 2. (a) TSNE of f output (b) Abs weight per input

(a) (b)Fig. 3. BER for various values of SNR for various codes. (a) Polar (64,32)and (b) BCH(63,51).

of bit errors at the output of the decoder. We can see aclear difference in the primary (dynamic) network g betweencodewords with significant errors (bottom right) in comparisonto words without errors (top left). Fig.2(b) shows the absolutevalue of the first layer of the network f for the same code.The network emphasizes zj (shown in red) which is the re-encoding mismatch in Eq.(18). The second important part isej which is the parity check of from matrix H ′.

VI. CONCLUSIONS

We propose two modifications to the learnable BP methodsthat currently provide the state of the art results in decodingblock codes. First, we embed the estimated SNR, enabling thenetwork to adapt to the level of the incoming noise. Second, weincorporate multiple autoregressive signals that are obtainedfrom the intermediate output of the network. The usage ofautoregressive signals for BP, and graph neural networks ingeneral, can also be beneficial in more domains, such as imagecompletion [42], stereo matching [43], image restoration [44],codes over graphs [45] and speech recognition [46]. For ex-ample, it would be interesting to see if feeding these networkswith computed error and mismatch signals that arise from theintermediate per-iteration solutions can boost performance.

REFERENCES

[1] M. Lian, C. Häger, and H. D. Pfister, “What can machine learning teachus about communications?” in 2018 IEEE Information Theory Workshop(ITW). IEEE, 2018, pp. 1–5.

[2] N. Weinberger, “Learning additive noise channels: Generalizationbounds and algorithms,” in 2020 IEEE International Symposium onInformation Theory (ISIT). IEEE, 2020, pp. 2586–2591.

[3] S. Ramjee, S. Ju, D. Yang, X. Liu, A. E. Gamal, and Y. C. Eldar,“Ensemble wrapper subsampling for deep modulation classification,”arXiv preprint arXiv:2005.04586, 2020.

[4] ——, “Fast deep learning for automatic modulation classification,” arXivpreprint arXiv:1901.05850, 2019.

[5] A. Caciularu and D. Burshtein, “Blind channel equalization usingvariational autoencoders,” in 2018 IEEE International Conference onCommunications Workshops (ICC Workshops). IEEE, 2018, pp. 1–6.

[6] ——, “Unsupervised linear and nonlinear channel equalization and de-coding using variational autoencoders,” IEEE Transactions on CognitiveCommunications and Networking, 2020.

[7] N. Samuel, T. Diskin, and A. Wiesel, “Learning to detect,” IEEETransactions on Signal Processing, vol. 67, no. 10, pp. 2554–2564, 2019.

[8] N. Shlezinger, R. Fu, and Y. C. Eldar, “Deep soft interference cancel-lation for mimo detection,” in ICASSP 2020-2020 IEEE InternationalConference on Acoustics, Speech and Signal Processing (ICASSP).IEEE, 2020, pp. 8881–8885.

[9] Y. Jiang, H. Kim, H. Asnani, S. Kannan, S. Oh, and P. Viswanath,“Turbo autoencoder: Deep learning based channel codes for point-to-point communication channels,” in Advances in Neural InformationProcessing Systems, 2019, pp. 2754–2764.

[10] X. Xiao, B. Vasic, R. Tandon, and S. Lin, “Finite alphabet iterativedecoding of ldpc codes with coarsely quantized neural networks,” in2019 IEEE Global Communications Conference (GLOBECOM). IEEE,2019, pp. 1–6.

[11] S. Dörner, M. Henninger, S. Cammerer, and S. t. Brink, “Wgan-basedautoencoder training over-the-air,” arXiv preprint arXiv:2003.02744,2020.

[12] A. Buchberger, C. Häger, H. D. Pfister, L. Schmalen, and A. G. i Amat,“Pruning and quantizing neural belief propagation decoders,” IEEEJournal on Selected Areas in Communications, 2020.

[13] A. Buchberger, C. Häger, H. D. Pfister, L. Schmalen et al., “Learneddecimation for neural belief propagation decoders,” arXiv preprintarXiv:2011.02161, 2020.

[14] T. Raviv, N. Raviv, and Y. Be’ery, “Data-driven ensembles for deep andhard-decision hybrid decoding,” in 2020 IEEE International Symposiumon Information Theory (ISIT). IEEE, 2020, pp. 321–326.

[15] T. Raviv, A. Schwartz, and Y. Be’ery, “Deep ensemble of weightedviterbi decoders for tail-biting convolutional codes,” Entropy, vol. 23,no. 1, p. 93, 2021.

[16] F. Carpi, C. Häger, M. Martalò, R. Raheli, and H. D. Pfister, “Rein-forcement learning for channel coding: Learned bit-flipping decoding,”in 2019 57th Annual Allerton Conference on Communication, Control,and Computing (Allerton). IEEE, 2019, pp. 922–929.

[17] H. Kim, Y. Jiang, S. Kannan, S. Oh, and P. Viswanath, “Deepcode:Feedback codes via deep learning,” in Advances in Neural InformationProcessing Systems (NIPS), 2018, pp. 9436–9446.

[18] A. Ben-Yishai and O. Shayevitz, “The gaussian channel with noisy feed-back: improving reliability via interaction,” in 2015 IEEE InternationalSymposium on Information Theory (ISIT). IEEE, 2015, pp. 2500–2504.

[19] S. Ginzach, N. Merhav, and I. Sason, “Random-coding error exponentof variable-length codes with a single-bit noiseless feedback,” in 2017IEEE Information Theory Workshop (ITW). IEEE, 2017, pp. 584–588.

[20] T. Gruber, S. Cammerer, J. Hoydis, and S. ten Brink, “On deeplearning-based channel decoding,” in 2017 51st Annual Conference onInformation Sciences and Systems (CISS). IEEE, 2017, pp. 1–6.

[21] S. Cammerer, T. Gruber, J. Hoydis, and S. ten Brink, “Scaling deeplearning-based decoding of polar codes via partitioning,” in GLOBE-COM 2017-2017 IEEE Global Communications Conference. IEEE,2017, pp. 1–6.

[22] H. Kim, Y. Jiang, R. Rana, S. Kannan, S. Oh, and P. Viswanath,“Communication algorithms via deep learning,” in Sixth InternationalConference on Learning Representations (ICLR), 2018.

[23] E. Nachmani, Y. Be’ery, and D. Burshtein, “Learning to decode linearcodes using deep learning,” in 2016 54th Annual Allerton Conference

on Communication, Control, and Computing (Allerton). IEEE, 2016,pp. 341–346.

[24] L. Lugosch and W. J. Gross, “Neural offset min-sum decoding,” in 2017IEEE International Symposium on Information Theory (ISIT). IEEE,2017, pp. 1361–1365.

[25] E. Nachmani and L. Wolf, “Hyper-graph-network decoders for blockcodes,” in Advances in Neural Information Processing Systems, 2019,pp. 2326–2336.

[26] D. Ha, A. Dai, and Q. V. Le, “Hypernetworks,” arXiv preprintarXiv:1609.09106, 2016.

[27] E. Nachmani, E. Marciano, L. Lugosch, W. J. Gross, D. Burshtein, andY. Be’ery, “Deep learning methods for improved decoding of linearcodes,” IEEE Journal of Selected Topics in Signal Processing, vol. 12,no. 1, pp. 119–131, 2018.

[28] P. Mirowski and Y. LeCun, “Dynamic factor graphs for time seriesmodeling,” in Joint European Conference on Machine Learning andKnowledge Discovery in Databases. Springer, 2009, pp. 128–143.

[29] J. Goldberger and A. Leshem, “Pseudo prior belief propagation fordensely connected discrete graphs,” in 2010 IEEE Information TheoryWorkshop on Information Theory (ITW 2010, Cairo). IEEE, 2010, pp.1–5.

[30] V. G. Satorras and M. Welling, “Neural enhanced belief propagation onfactor graphs,” arXiv preprint arXiv:2003.01998, 2020.

[31] T. Richardson and R. Urbanke, Modern coding theory. Cambridgeuniversity press, 2008.

[32] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”arXiv preprint arXiv:1412.6980, 2014.

[33] R. Gallager, “Low-density parity-check codes,” IRE Transactions oninformation theory, vol. 8, no. 1, pp. 21–28, 1962.

[34] B. Shuval and I. Sason, “On the universality of ldpc code ensemblesunder belief propagation and ml decoding,” in 2010 IEEE 26-th Con-vention of Electrical and Electronics Engineers in Israel. IEEE, 2010,pp. 000 355–000 359.

[35] E. Soljanin, N. Varnica, and P. Whiting, “Incremental redundancy hybridarq with ldpc and raptor codes,” IEEE Trans. Inform. Theory, 2005.

[36] E. Arikan, “Channel polarization: A method for constructing capacity-achieving codes,” in 2008 IEEE International Symposium on InformationTheory. IEEE, 2008, pp. 1173–1177.

[37] I. Tal, “On the construction of polar codes for channels with moderateinput alphabet sizes,” IEEE Transactions on Information Theory, vol. 63,no. 3, pp. 1501–1509, 2017.

[38] S. Liu, Y. Hong, and E. Viterbo, “Polar codes for block fading channels,”in 2017 IEEE Wireless Communications and Networking ConferenceWorkshops (WCNCW). IEEE, 2017, pp. 1–6.

[39] N. Goela, E. Abbe, and M. Gastpar, “Polar codes for broadcast chan-nels,” IEEE Transactions on Information Theory, vol. 61, no. 2, pp.758–782, 2014.

[40] R. C. Bose and D. K. Ray-Chaudhuri, “On a class of error correctingbinary group codes,” Information and control, vol. 3, no. 1, pp. 68–79,1960.

[41] M. Helmling, S. Scholl, F. Gensheimer, T. Dietz, K. Kraft, S. Ruzika,and N. Wehn, “Database of Channel Codes and ML Simulation Results,”www.uni-kl.de/channel-codes, 2019.

[42] N. Komodakis and G. Tziritas, “Image completion using efficientbelief propagation via priority scheduling and dynamic pruning,” IEEETransactions on Image Processing, vol. 16, no. 11, pp. 2649–2661, 2007.

[43] J. Sun, N.-N. Zheng, and H.-Y. Shum, “Stereo matching using beliefpropagation,” IEEE Transactions on pattern analysis and machineintelligence, vol. 25, no. 7, pp. 787–800, 2003.

[44] P. F. Felzenszwalb and D. P. Huttenlocher, “Efficient belief propagationfor early vision,” International journal of computer vision, vol. 70, no. 1,pp. 41–54, 2006.

[45] L. Yohananov and E. Yaakobi, “Codes for graph erasures,” IEEETransactions on Information Theory, vol. 65, no. 9, pp. 5433–5453,2019.

[46] J. R. Hershey, S. J. Rennie, P. A. Olsen, and T. T. Kristjansson, “Super-human multi-talker speech recognition: A graphical modeling approach,”Computer Speech & Language, vol. 24, no. 1, pp. 45–66, 2010.