low-density parity-check decoder architectures for ... · quantum cryptography by mario milicevic a...

Low-Density Parity-Check Decoder

Architectures for Integrated Circuits and

Quantum Cryptography

by

Mario Milicevic

A thesis submitted in conformity with the requirementsfor the degree of Doctor of Philosophy

Graduate Department of Electrical and Computer EngineeringUniversity of Toronto

c© Copyright 2017 by Mario Milicevic

Abstract

Low-Density Parity-Check Decoder Architectures for Integrated Circuits and Quantum Cryptography

Mario Milicevic

Doctor of Philosophy

Graduate Department of Electrical and Computer Engineering

University of Toronto

2017

Forward error correction enables reliable one-way communication over noisy channels, by transmitting

redundant data along with the message in order to detect and resolve errors at the receiver. Low-density

parity-check (LDPC) codes achieve superior error-correction performance on Gaussian channels under

belief propagation decoding, however, their complex parity-check matrix structure introduces hardware

implementation challenges. This thesis explores how the quasi-cyclic structure of LDPC parity-check

matrices can be exploited in the design of low-power hardware architectures for multi-Gigabit/second

decoders realized in CMOS technology, as well as in the design and construction of multi-edge LDPC

codes for long-distance (beyond 100km) quantum cryptography over optical fiber.

A frame-interleaved architecture is presented with a path-unrolled message-passing schedule to reduce

the complexity of routing interconnect in an integrated circuit decoder implementation. A proof-of-

concept silicon test chip was fabricated in the 28nm CMOS technology node. The LDPC decoder chip

supports the four codes presented in the IEEE 802.11ad standard, occupies an area of 3.41mm2, and

achieves an energy efficiency of 15pJ/bit while delivering a maximum throughput of 6.78Gb/s, and

operating with a 202MHz clock at 0.9V supply. The test chip achieves the highest normalized energy

efficiency among published CMOS-based decoders for the IEEE 802.11ad standard.

A quasi-cyclic code construction technique is applied to a multi-edge LDPC code with block length of

106 bits in order to reduce the latency of LDPC decoding in the key reconciliation step of long-distance

quantum key distribution. The GPU-based decoder achieves a maximum information throughput of

7.16Kb/s, and extends the current maximum transmission distance from 100km to 160km with a secret

key rate of 4.10× 10−7 bits/pulse under 8-dimensional reconciliation. The GPU-based decoder delivers

up to 8.03× higher decoded information throughput over the upper bound on secret key rate for a

lossy optical channel, thus demonstrating that key reconciliation with LDPC codes is no longer a post-

processing bottleneck in quantum key distribution.

The contributions presented in this thesis can be applied to future research in the implementation of

silicon-based linear-program decoders for high-reliability channels, and single-chip solutions for quantum

key distribution containing integrated photonics and post-processing algorithms.

ii

Dedicated in loving memory to my grandparents.

iii

Acknowledgements

I would like to thank my family, friends, professors, and colleagues for their tremendous support during

my pursuit of a Ph.D. degree. Thanks to you, I have not been on this journey alone. I have had the

opportunity to explore new ideas and contribute to the state-of-the-art. Such opportunities are far and

few between. Looking back, I am happy I committed the time to do it, and would do it again in a

heartbeat.

I extend my sincerest gratitude to Professor Glenn Gulak for originally taking me on as a Masters

student, and encouraging me to pursue a Ph.D. degree. Your guidance and attention to detail contributed

tremendously to the direction and quality of my Ph.D. research. Thank you for opening the doors to so

many great opportunities, and for giving me the time to pursue long periods of “studio time” to focus

on my ideas and writing.

I would like to thank Professors Jason Anderson, Stark Draper, and Frank Kschischang from the

University of Toronto, and Professor Zhengya Zhang from the University of Michigan Ann Arbor for

serving on my thesis examination committee. Your insights and thoughtful questions have helped bring

clarity and rigour to this thesis.

Some of the best learning experiences during my Ph.D. have been through my collaborative research

on LDPC codes for QKD with Chen Feng and Lei Zhang, as well as hardware-based implementations of

ADMM-LP decoders with Mitch Wasson and Professor Stark Draper. It has been an absolute pleasure

to work with you. I would also like to thank Christian Weedbrook and Xingxing Xing for introducing

me to QKD and your technical guidance.

I am grateful to the many faculty members and professional staff that I have had the pleasure of

knowing and working with since I started my undergraduate studies in 2006 in the Department of

Electrical and Computer Engineering at the University of Toronto. You have all contributed positively

to my experiences at the university. In particular, I wish to thank Professors Aleksandar Prodic, Ali

Sheikholeslami, Bruce Francis, David Johns, Khoman Phang, Micah Stickel, Paul Chow, Roman Genov,

Sorin Voinigescu, and Tony Chan Carusone. I also wish to thank Jennifer Rodrigues, Darlene Gorzo, and

Jayne Leake for their administrative assistance with my graduate studies and teaching assistantships.

Last, I wish to acknowledge Jeetendar Narsinghani for his guidance with ATE SoC testing.

I would like to thank the Natural Sciences and Engineering Research Council of Canada (NSERC)

and MaxLinear Inc., California, USA for supporting this research. At MaxLinear, I would like to thank

Curtis Ling and Tim Gallagher for their interest and continuous support in fabricating my proposed

LDPC decoder architecture in a state-of-the-art CMOS technology. I am sincerely grateful to Stephane

Laurent-Michel for taking me on as an intern in his communication systems team and for championing

my chip tapeout. To Isai Miranda, my chip tapeout would not have been possible without your on-

going support in resolving DRC violations and additional metal-mask fixes. To Prasanna, thank you

for pushing me through the physical back-end design of my chip, and to Jack and Jilian for helping me

resolve tool-related issues so that I could simulate and synthesize my design. To Nitza, Kris, and Preeti,

thank you for providing me with support and test time on the ATE, and to Jay and Henry for your

help with packaging and wafer production. Finally, I would like to thank Didier Roland from Mentor

Graphics for helping me trailblaze the place-and-route tool flow for the physical back-end design of my

LDPC decoder chip. I would also like to thank Isai, Juan, Preeti, Paul, Michael, Akan, Srikar, Miles,

Thao, and Naman for your friendship during my internships at MaxLinear.

iv

To my graduate student colleagues in BA5000, we have shared some fun and gruelling times together.

From long nights in the lab, coffee re-fill runs to Starbucks, and nights out at Sin and Redemption and

Prenup Pub, I have enjoyed working and socializing with all of you during our time together. Thank

you Alain, Alhassan, Alireza, Andy, Aynaz, Colin, Cliff, Dustin, Dawei, Farhad, Hemesh, Jeff, Joshua,

Kevin, Luke, Michal, Meysam, Nadeesha, Nasim, Neno, Ravi, Rocky, Rosanah, Sadegh, Safeen, Shayan,

Victor, Yue, and Zeynep. Grad school would not have been as enjoyable without my 7:00 AM morning

workouts at Hart House; thank you, Tarik, for being an awesome and punctual gym partner.

One of my favourite and most enjoyable experiences in grad school was leading the Saratoga student

volunteer team at the IEEE International Solid-State Circuits Conference each year in San Francisco. I

am very grateful to Laura Fujino and K.C. Smith for believing in Andrew Shorten and myself to lead

the team and ensure that the conference runs smoothly. Andrew, we had some massive fires at that

show, but we fooled them every time, dude. We had the time of our lives waking up at 5:00 AM and

going to sleep at 3:00 AM for a week straight every year. I wouldn’t have wanted it any other way.

To my many friends who partook in our adventures in San Francisco, thank you for being a part of

the team, namely: Bert, Dave, Danial, Daniel, Gairik, Gerard, Guy, Jasmina, Javid, Jin-Hee, Jingshu,

Jingxuan, Joy, Junmin, Ivan, Mike, Navid, Paul, Oleksey, Robert B., Robert H., Saba, Samira, Simon,

Stefan, Victor, Vince, Wahid, Weijia, Yingying, and Xander, and to the entourage: Alex, David, Karim,

Ricardo, Shahriar, Saman, and Trevor. Finally, to the two guys that helped us all keep our cool, a big

thanks to Mark and Snoopy from the production team.

Outside of the university, I would like to thank my many friends at Northern Karate Schools and

the National Yacht Club in Toronto for your endearning support of my academic pursuits, and for being

there to take my mind off work. I would also like to thank my volunteer colleagues and staff from the

IEEE for allowing me to pursue engaging leadership opportunities within a global community. To my

skiing posse, Peter, Dino, Moritz, Nora, Ozren, Taylor, and Chris, thanks for the epic powder sessions.

To my Toronto crew, Amir, Dorijan, Nikita, Nikola, Vasily, and Victor, it has always been good times.

Finally, and most importantly, I wish to extend my deepest gratitude and love to my family. To my

sister, Dana, thank you for always being there for me; your tasty baked treats have helped fuel a lot

of my work. To my girlfriend, Katrina, your unwavering love and commitment to seeing me finish this

thing has always given me the drive to work hard and find the answers; I will forever cherish our trips

to Pizza Libretto and Bellwooods Brewery in Toronto, and our many adventures around the world. To

my mom and dad, words can not express how thankful I am for all the opportunities you have given me,

for raising myself and my sister in Canada, for your countless sacrifices, and for always having warm

home-cooked meals whenever I came home. Most importantly, thank you for believing in me; this thesis

is for you.

v

Contents

List of Tables viii

List of Figures xii

List of Acronyms xv

List of Symbols xviii

1 Introduction 1

1.1 LDPC Decoders in Integrated Circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 LDPC Decoding for Quantum Key Distribution . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Thesis Roadmap: Investigation of Two Application Areas . . . . . . . . . . . . . . . . . . 7

2 Background 12

2.1 Forward Error Correction with LDPC Codes . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2 LDPC Codes: A Class of Linear Block Codes . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3 LDPC Decoding: Belief Propagation Algorithms . . . . . . . . . . . . . . . . . . . . . . . 13

2.4 Quasi-Cyclic LDPC Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.5 Silicon Integrated Circuits for LDPC Decoding . . . . . . . . . . . . . . . . . . . . . . . . 16

2.5.1 LDPC Decoder Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.5.2 Flooding and Layered Decoding Schedules . . . . . . . . . . . . . . . . . . . . . . . 17

2.5.3 Design Challenges: Message Permutation, Memory, and Power . . . . . . . . . . . 17

2.5.4 Explicit Check and Variable Node Processing Units . . . . . . . . . . . . . . . . . 18

2.6 LDPC Decoding in Quantum Key Distribution . . . . . . . . . . . . . . . . . . . . . . . . 19

2.6.1 Quantum Transmission and Sifting . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.6.2 Reconciliation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.6.3 Privacy Amplification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.6.4 Maximizing Secret Key Rate with Collective Attacks . . . . . . . . . . . . . . . . . 26

2.6.5 Upper Bound on Secret Key Rate for a Lossy Channel . . . . . . . . . . . . . . . . 27

2.6.6 Frame Error Rate for Reverse Reconciliation . . . . . . . . . . . . . . . . . . . . . 27

2.6.7 Impact of Reconciliation Error and Efficiency on Secret Key Rate . . . . . . . . . 29

2.6.8 Secret Key Rate with Finite-Size Effects . . . . . . . . . . . . . . . . . . . . . . . . 30

2.7 Multi-Edge LDPC Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.7.1 General Design and Construction of LDPC Codes . . . . . . . . . . . . . . . . . . 30

2.7.2 Multi-Edge Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

vi

2.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3 LDPC Decoder Architecture with Path-Unrolled Message Passing 33

3.1 Proposed LDPC Decoder Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.1.1 Hardware Mapping with Path-Unrolled Decoding Schedule . . . . . . . . . . . . . 34

3.1.2 Time-Distributed Piecewise Min-Sum Computation . . . . . . . . . . . . . . . . . . 36

3.1.3 Parity-Check Matrix Partitioning and Hardware Mapping . . . . . . . . . . . . . . 38

3.1.4 Column Slice Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.1.5 Pipelined Frame Interleaving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.1.6 Input/Output Frame Buffering for Continuous Decoding . . . . . . . . . . . . . . . 44

3.1.7 Combined CN+VN Processing Unit Architecture . . . . . . . . . . . . . . . . . . . 45

3.1.8 Early Termination with Coarse-Grained Clock Gating . . . . . . . . . . . . . . . . 48

3.1.9 Extendibility to Layered Decoding Schedule . . . . . . . . . . . . . . . . . . . . . . 51

3.2 Physical Silicon Chip Implementation and Results . . . . . . . . . . . . . . . . . . . . . . 52

3.2.1 Error-Correction Decoding Performance . . . . . . . . . . . . . . . . . . . . . . . . 52

3.2.2 Post-Silicon Power Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.2.3 Comparison with the State-of-the-Art . . . . . . . . . . . . . . . . . . . . . . . . . 56

3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4 Quasi-Cyclic Multi-Edge LDPC Codes for Quantum Cryptography 60

4.1 Construction of Quasi-Cyclic Multi-Edge LDPC Codes . . . . . . . . . . . . . . . . . . . . 61

4.2 Error-Correction Performance of Multi-Edge QC Codes . . . . . . . . . . . . . . . . . . . 62

4.3 Finite Secret Key Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.4 GPU-Accelerated LDPC Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

4.4.1 GPU-Based LDPC Decoder Implementation . . . . . . . . . . . . . . . . . . . . . . 69

4.4.2 Information Throughput Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

4.4.3 Comparison to Other CV-QKD Implementations . . . . . . . . . . . . . . . . . . . 76

4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

5 Conclusion and Future Directions 80

5.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

5.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

5.2.1 Extendibility to Non-Quasi-Cyclic and Spatially-Coupled LDPC Codes . . . . . . 81

5.2.2 Linear-Program Decoding for High-SNR Channels . . . . . . . . . . . . . . . . . . 82

5.2.3 Decoder Architectures for Near-Threshold Voltage FinFET Operation . . . . . . . 82

5.2.4 Decoder Architectures for 3-Dimensional Integrated Circuits . . . . . . . . . . . . 83

A Supplementary Background on QKD 84

B Development, Simulation, and Testing Framework 88

C Supplementary QKD Results: Bit Error Rate and Finite Secret Key Rate 90

D LDPC Decoding for Reverse Reconciliation in CV- and DV-QKD 95

References 98

vii

List of Tables

1.1 Comparison of LDPC decoder performance requirements for an IEEE 802.11ad wireless

SoC IP core vs. long-distance CV-QKD key reconciliation block . . . . . . . . . . . . . . . 10

3.1 Piecewise time-distributed reformulation of Min-Sum algorithm with flooding schedule for

single layer routing path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.2 Column-slice messages highlighted in Fig. 3.5 . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.3 Memory specification in each column slice . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.4 Decoder performance at target BER = 10−6 with early termination (including idle cycles) 55

3.5 Percentage breakdown of post-silicon area and estimated power by decoder module at

target BER = 10−6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

3.6 Comparison of LDPC decoder implementations for the IEEE 802.11ad standard . . . . . . 57

4.1 Designed rate 0.02 multi-edge LDPC codes . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.2 GPU-based LDPC decoding latency and error-correction performance for rate 0.02 multi-

edge codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4.3 Overview of secret key rate and GPU throughput at maximum reconciliation distance

with rate 0.02 multi-edge codes and Nprivacy = 1012 bits . . . . . . . . . . . . . . . . . . . 74

4.4 GPU LDPC decoding comparison at SNR = 0.161 with d = 8 on BIAWGNC targeting

FER = 0.04 with rate 1/10 codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

viii

List of Figures

1.1 Wireless SoC showing LDPC decoder IP block within the physical layer baseband, and

auxiliary circuits and systems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2 Throughput vs. power comparison of silicon-based LDPC decoders for multiple standards

with different throughput, latency, and error-correction performance requirements. The

plot legend indicates the decoding standard and block length for each implementation.

Annotated values in parentheses indicate the CMOS technology node, clock frequency,

and decoder core area for each implementation [19,25–36]. . . . . . . . . . . . . . . . . . . 4

1.3 Information transmission over untrusted quantum channel and authenticated public chan-

nel between Alice and Bob for CV- and DV-QKD, with eavesdropper Eve. . . . . . . . . . 6

1.4 Throughput vs. distance of GPU-based LDPC decoders for CV- and DV-QKD. The

reported throughput is the raw GPU throughput without code- or error-rate scaling. For

CV-QKD implementations [12, 60, 64], the annotated values in parentheses indicate the

LDPC code code block length n, the code rate R, the reconciliation efficiency β, and SNR

of the quantum channel. For DV-QKD implementations, the annotated values indicate

the block length n, code rate R, and QBER [48,65,66]. By convention in QKD, the SNR

is reported in linear units. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.5 FER vs. SNR for IEEE 802.11ad Wireless SoC and CV-QKD applications. . . . . . . . . 9

1.6 Data throughput with FER and code-rate scaling vs. block length for IEEE 802.11ad

wireless SoC and CV-QKD applications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1 Simplified model of a BIAWGNC with LDPC encoding and decoding. . . . . . . . . . . . 12

2.2 Tanner graph and corresponding binary parity-check matrix with block length of n = 6

bits, k = 3 information bits, n = 6 variable nodes, and (n− k) = 3 check nodes. . . . . . . 13

2.3 Sample quasi-cyclic binary parity-check matrix for q = 5 constructed from uniformly-sized

(q × q), cyclically-shifted identity matrices and all-zero matrices. . . . . . . . . . . . . . . 16

2.4 Examples of three LDPC decoder architectures showing message-passing networks, and

configuration of CN and VN processing units. While exceptions exist, typically, fully-

parallel architectures instantiate n VNs and (n− k) CNs, partially-parallel architectures

instantiate a factor of q VNs and CNs, and serial architectures instantiate only 1 VN and

1 CN. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.5 Explicitly defined processing units VN1, VN2, and VN4 connected to processing unit CN1

based on the Tanner graph in Fig. 2.2, with Lvc and mcv messages indicated for decoding

iterations i and i+ 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

ix

2.6 CV-QKD model for secret key distillation with reverse reconciliation between Alice and

Bob over a private quantum channel and public classical channel. . . . . . . . . . . . . . . 21

2.7 Possible decoding scenarios and error detection techniques. . . . . . . . . . . . . . . . . . 28

2.8 LDPC frame components: message, CRC, and parity bits. . . . . . . . . . . . . . . . . . . 28

3.1 Simplified example of the proposed LDPC decoder architecture, based on: (a) sample

two-layer QC parity-check matrix and (b) Tanner graph with one decoding path high-

lighted. The proposed layer routing patterns and arrangement of combined CN+VN

processing units in the systolic array architecture are shown in (c). The closed path given

by VN0−CN2−VN4−CN2−VN8−CN2−VN0 in (b) is unrolled in (c) such that CN2 is ab-

sorbed into its connected VNs, resulting in the following unrolled path: VN0−VN4−VN8−VN0. 35

3.2 (a) The closed path through CN2 in the Tanner graph for one pass (phase) of decoding. (b)

The unrolled piecewise messages that are passed between combined CN+VN processing

units in successive columns of the architecture corresponding to the closed path highlighted

in (a). Here, t = 0 arbitrarily corresponds to the third column of T = 3 total columns. . . 38

3.3 IEEE 802.11ad QC parity-check matrices with hardware mapping for proposed architec-

ture [23]. The sub-matrix value indicates the cyclic permutation index. The four matrices

are derived from a single 8-layer base matrix by removing layers in higher-rate matrices,

or by removing cyclically-shifted submatrices in lower-rate matrices. . . . . . . . . . . . . 39

3.4 System block diagram for proposed architecture showing the global control unit, and the

datapath containing: column slices with combined CN+VN processing units and mem-

ories, a hard-wired cyclic permutation network between each column slice, and pipeline

registers between column-slice pairs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.5 Column slice t comprised of CN+VN processing units, local memory, and wired permuta-

tion networks between adjacent column slices. Pipeline registers are connected only to the

first column in a column-slice pair. Hard-wired interconnect does not contain multiplexing

logic. Hard-wired connections are specified by the parity-check matrix connectivity. The

operations in column slice t are computed in one clock cycle. . . . . . . . . . . . . . . . . 40

3.6 Pipelined frame interleaving pattern through column slices in the proposed architecture

over 16 clock cycles of one complete LDPC decoding iteration for IEEE 802.11ad. The

number in each bubble indicates the frame index. Frame 4 highlights the cyclic frame-

shifting property of the architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.7 Input/output frame buffering schedule, assuming a uniform decoding latency of 10 itera-

tions with 16 clock cycles per iteration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.8 Combined CN+VN processing unit for time-distributed piecewise decoding, showing CN-

and VN-update phase logic, memory interfaces, and data permutation logic between pro-

cessing units in successive column slices. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

x

3.9 Processing unit timing diagram for CN- and VN-update phases showing 3 independent

frame updates over 3 clock cycles. Each CN+VN processing unit updates a single frame

j in each clock cycle. All arrow-highlighted operations occur in each clock cycle, and

independently for each frame j, j ∈ 0, 1, . . . , 7. Circled nodes B, C, D, E, and F

correspond to the connections shown in the column slice architecture in Fig. 3.5. The

following operations are highlighted. (a) Sign sc(t), first minimum magnitude min1c(t),

and second minimum magnitude min2c(t) updates through column slice pair. (b) Parity

pc(t) updates through column slice pair. (c) Independent Lvc and Cv updates in columns

t and t + 1. (d) Propagation of sign, first minimum magnitude, and second minimum

magnitude messages to next column-slice pair without updates in columns t and t+ 1. . 47

3.10 Probability distribution of decoding iterations for the four code rates of the IEEE 802.11ad

standard at FER of 10−2, 10−3, and 10−4. . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.11 Multi-frame decoding with: (a) no early termination, (b) early termination with idle

cycles (discontinuous decoding), and (c) early termination without idle cycles (continuous

decoding). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.12 Sample frame termination pattern in frame-interleaved architecture. One iteration is

performed over 16 clock cycles. One clock cycle is required to update a frame in a column-

slice pair. Frames that have terminated are not updated in their current column-slice

pair. Column slices in which the current frame has terminated are disabled through

coarse-grained clock gating in each cycle. . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.13 Die micrograph with wirebonds shown in exposed package. . . . . . . . . . . . . . . . . . 52

3.14 FER and BER vs. SNR under Min-Sum decoding for all four IEEE 802.11ad codes on

BIAWGNC with maximum 10 decoding iterations. The channel SNR is normalized to

energy-per-bit as given by Eq. 1.2. Channel input LLRs are quantized to 5 bits for both

fixed-point and floating-point simulations. . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.15 Shmoo plots of measured chip showing functional test pass (P) and fail (F) results. . . . . 54

3.16 Measured power at nominal 0.9V supply and 202MHz clock rate, with and without early

termination, at five SNR Eb/N0 operating points for all four code rates. . . . . . . . . . . 54

3.17 Measured power at reduced core and memory voltage with clock-frequency scaling, for

the same operating points as in Fig. 3.16. . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.1 Structure of designed parity-check matrices. . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.2 FER vs. SNR for Sum-Product decoding with d = 1 and d = 2 dimensional reconciliation

on BIAWGNC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.3 FER vs. SNR for Sum-Product decoding with d = 4 and d = 8 dimensional reconciliation

on BIAWGNC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.4 Probability of invalid decoding error vs. SNR for Sum-Product decoding with d = 1, 2, 4, 8

dimensional reconciliation on BIAWGNC. Probability of error is computed for invalid

messages that are correctly decoded but CRC fails. . . . . . . . . . . . . . . . . . . . . . . 65

4.5 FER vs. reconciliation efficiency for Sum-Product decoding with d = 1 and d = 8 di-

mensional reconciliation on BIAWGNC. FER values are derived from the FER vs. SNR

curves based on Eq. 2.15. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.6 d = 1 dimensional reconciliation with Nprivacy = 1012 bits: finite secret key rate Kfinite

vs. distance for collective attacks on BIAWGNC with Sum-Product decoding. . . . . . . . 67

xi

4.7 d = 8 dimensional reconciliation with Nprivacy = 1012 bits: finite secret key rate Kfinite


4.8 GPU implementation of LDPC decoder showing four multi-threaded compute kernels

and data flow from top to bottom for one decoding iteration. Coalesced memory access

patterns and message variables are indicated. Thread i is denoted by ti, where T in

kernels 1 and 3 represents the maximum number of connections between all CNs and

VNs, (n− k) in Kernel 2 is the number of CNs, and n in Kernel 4 is the number of VNs.

Early termination is not shown. All memory blocks shown in the figure are in Global

GPU Memory. The threads in each kernel use Shared GPU Memory to store intermediate

values during the execution of the kernel. . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

4.9 Measured information throughput K ′GPU vs. reconciliation efficiency for d = 1 and d = 8

dimensional reconciliation. Each measurement point corresponds to a particular SNR

operating point with a measured FER presented in Fig. 4.5. . . . . . . . . . . . . . . . . . 75

4.10 GPU information throughput K ′GPU of the q = 21 QC-LDPC code with d = 8 dimensional

reconciliation up to the maximum distance point for β ∈ 0.80, 0.89, 0.92, 0.95, 0.98, 0.99,and upper bound on secret key rate for lossy channel K ′lim vs. distance. . . . . . . . . . . 75

A.1 Optimal VA vs. transmission distance for maximum theoretical secret key rate, from

β = 0.8 to β = 0.99, based on the assumed physical operating parameters of the quantum

channel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

A.2 Maximum theoretical secret key rates vs. transmission distance. The maximum CV-

QKD key rate is defined by Kopt from β = 0.8 to β = 0.99 based on the optimal VA. The

fundamental limit for a lossy channel is defined by Klim = − log2(1− T ). . . . . . . . . . . 87

B.1 Development, simulation, and testing framework. . . . . . . . . . . . . . . . . . . . . . . . 89

C.1 BER vs. SNR for Sum-Product decoding with d = 1 and d = 2 dimensional reconciliation

on BIAWGNC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

C.2 BER vs. SNR for Sum-Product decoding with d = 4 and d = 8 dimensional reconciliation

on BIAWGNC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

C.3 d = 1 dimensional reconciliation with Nprivacy = 1010 bits: finite secret key rate Kfinite












D.1 LDPC encoding and decoding system with BIAWGNC model. . . . . . . . . . . . . . . . . 96

D.2 LDPC encoding and decoding system with BSC model. . . . . . . . . . . . . . . . . . . . 97

xii

List of Acronyms

ADC Analog-to-Digital Converter

ADMM Alternating Direction Method of Multipliers

ASIC Application-Specific Integrated Circuit

ASIP Application-Specific Instruction Set Processor

ATE Automated Test Equipment

AWGN Additive White Gaussian Noise

BER Bit Error Rate

BIAWGNC Binary-Input Additive White Gaussian Noise Channel

BP Belief Propagation

BSC Binary Symmetric Channel

CMOS Complementary Metal-Oxide-Semiconductor

CN Check Node

CRC Cyclic Redundancy Check

CV-QKD Continuous-Variable Quantum Key Distribution

DAC Digital-to-Analog Converter

DRAM Dynamic Random Access Memory

DV-QKD Discete-Variable Quantum Key Distribution

DVB-S2X Digital Video Broadcasting Second Generation Satellite Extensions

eDRAM Embedded Dynamic Random Access Memory

eSRAM Embedded Static Random Access Memory

EDA Electronic Design Automation

ETSI European Telecommunications Standards Institute

FD-SOI Fully-Depleted Silicon-On-Insulator

xiii

FEC Forward Error Correction

FER Frame Error Rate

FIFO First-In First-Out (Memory)

FLOPS Floating Point Operations Per Second

FPGA Field-Programmable Gate Array

GG02 Grosshans-Grangier 2002 Protocol

GPU Graphics Processing Unit

HDL Hardware Description Language

IEEE Institute of Electrical and Electronics Engineers

I/O Input/Output

IP Intellectual Property

LDPC Low-Density Parity-Check

LLR Log-Likelihood Ratio

LNA Low-Noise Amplifier

LP Linear Program

MCS Modulation and Coding Scheme

MDI-QKD Measurement-Device-Independent Quantum Key Distribution

NTV Near-Threshold Voltage

PA Power Amplifier

PVT Process, Voltage, and Temperature

QC Quasi Cyclic

QBER Quantum Bit Error Rate

QKD Quantum Key Distribution

QUESS Quantum Experiments at Space Scale

RA Repeat-Accumulate

RTL Register Transfer Level

SIMT Single-Instruction Multiple-Thread

SM Sign-Magnitude Number Format

SNR Signal-to-Noise Ratio

xiv

SoC System-on-Chip

SRAM Static Random Access Memory

TSV Through-Silicon Via

VN Variable Node

WiGig Wireless Gigabit Alliance

Wi-Fi Wi-Fi Alliance

WPAN Wireless Personal Area Network

xv

List of Symbols

α Transmission loss (assumed to be 0.2dB/km for a single-mode optical fiber)

β Reconciliation efficiency

χBE Holevo bound on the information leaked to Eve

` Optical fiber distance in kilometers

ε Excess channel noise expressed in shot noise units

η Homodyne detector efficiency

C The set of complex numbers

H The set of quaternions

O The set of octonions

R The set of real numbers

C Decoded codeword estimate of length n bits

S Decoded message estimate of length k bits

C LDPC-encoded codeword of length n bits

H Binary parity-check matrix

M Bob’s classical message to Alice

N d-dimensional noise vector

R Received soft-decision vector of length n

S Binary message vector of length k bits

U d-dimensional vector comprised of (−1)Ci components

X Alice’s correlated Gaussian sequence

Y Bob’s correlated Gaussian sequence

Z Gaussian noise distribution on the quantum channel

G A Tanner graph

xvi

Ω(x) Normalized variable node degree distribution

Ψ(x) Normalized check node degree distribution

σ2 Channel noise variance

C(s) Shannon channel capacity for signal-to-noise ratio of s

d Reconciliation dimension

Eb/N0 Signal-to-noise ratio per bit

frep Light source pulse repetition rate

I(X;Y ) Mutual information between correlated sequences X and Y

Ii Identity matrix cyclically shifted to the right by i− 1

IAB Mutual information between Alice and Bob

k LDPC information length (bits)

Kfinite Finite secret key rate (bits/pulse)

K ′finite Operating secret key rate (bits/second)

K ′GPU GPU information throughput (bits/second)

KrawGPU Raw GPU throughput (bits/second)

Klim Upper bound on secret key rate for a lossy channel (bits/pulse)

K ′lim Upper bound on secret key rate for a lossy channel with a source repetition rate frep (bits/second)

Kopt Maximum theoretical secret key rate for a CV-QKD system with one-way reverse reconciliation

Keff Effective secret key rate (bits/pulse)

Lvc Message from variable node v to check node c

Lv Updated log-likelihood ratio at variable node v

M(v) The set of all check nodes connected to variable node v

mcv Message from check node c to variable node v

min1c First minimum magnitude at check node c

min2c Second minimum magnitude at check node c

n LDPC block length (bits)

N(c) The set of all variable nodes connected to check node c

Nprivacy Block length for privacy amplification

Nquantum Number of symbols sent from Alice to Bob during quantum transmission

xvii

Pdetected error Probability of detected frame error

Pe Probability of frame error

Pundetected error Probability of undetected frame error

pc Exclusive-OR parity result at check node c

q Cyclic expansion factor of a quasi-cyclic parity check matrix

Qv Received channel log-likelihood ratio at variable node v

Rcode Code rate

s SNR of the quantum channel

sc Sign result at check node c

T Transmittance of an optical fiber quantum channel

Vel Electronic noise in shot noise units

VA Alice’s modulation variance

xviii

Chapter 1

Introduction

Error-correcting codes enable the reliable delivery of messages across unreliable channels in modern digi-

tal communication systems. The application of error-correcting codes in modern communication systems

enables today’s multi-Gb/s data rates, while also paving the way for new technologies to revolutionize

public network infrastructure. Richard Hamming first introduced error-correction codes in 1950 in order

to increase the rate of reliable communication in noisy channels [1], and as a means of approaching the

channel capacity limit defined by Claude Shannon in 1948 [2]. Over the past 70 years, error-correction

coding has been a rich area of research among information and coding theorists, while today, the en-

coding and decoding procedures present hardware implementation challenges for circuit designers due

to the limited power budgets available in modern systems-on-chip.

Error-correction coding is often referred to as forward error correction (FEC), where the sender uses

a code to encode data prior to transmission, such that the receiver can then reconstruct the original

data without having to request a repeat transmission of the data when an error is detected. FEC

enables low-latency data transmission across a multitude of noisy channels with application in mobile

networks, data storage, satellite and deep space communications, and the Internet. Some examples of

error-correcting codes include convolutional codes, Hamming codes, low-density parity-check (LDPC)

codes, polar codes, Raptor codes, Reed-Solomon codes, and turbo codes [3]. This thesis focuses on the

hardware-based decoding of LDPC codes in two applications areas: (1) integrated circuit architectures for

silicon-based system-on-chip implementations, and (2) secret key reconciliation in long-distance quantum

cryptography.

LDPC codes have been widely adopted over the past 15 years for FEC in wireless, wireline, optical,

and non-volatile memory systems due to their near-Shannon limit error-correction performance and ab-

sence of patent licensing fees [4–6]. First introduced by Robert Gallager in 1962, LDPC codes were mostly

ignored up until the early 2000s due to their computationally-complex decoding nature [7,8]. Their recent

widespread adoption in modern standards such as IEEE 802.11ac (Wi-Fi) was predominantly enabled

through the introduction of hardware-friendly variants of the belief propagation (BP) decoding algo-

rithm [9], as well as increased research in the design of hardware-oriented codes and integrated circuit

decoder architectures capable of delivering multi-Gb/s information throughput at low power. Over the

past 10 years, LDPC codes have also been applied to secret key reconciliation in quantum key distribution

(QKD) systems in order to extend the secure distance and increase the speed of unconditionally-secure

communication between two remote parties, better known as Alice and Bob [10–12]. Only a limited

1

Chapter 1. Introduction 2

number of hardware-based LDPC decoders have been realized to date for QKD applications due to the

complexity of designing high-performance codes and high-speed decoders for the quantum channel.

The general motivation of this thesis is to investigate low-power LDPC decoder architectures for

integrated circuits in CMOS (complementary metal-oxide-semiconductor) technology, as well as decoding

acceleration techniques for key reconciliation in quantum cryptography. CMOS-based decoders target

low-power applications with multi-Gb/s throughput requirements. Such decoders typically perform 10-

to-20 decoding iterations using LDPC codes with block lengths on the order of 102 to 104 bits. This is

in contrast to the LDPC codes used for long-distance QKD, where block lengths on the order of 106 bits

are required. Such decoders perform hundreds of decoding iterations and are not constrained by power,

but rather by the maximum decoded bit rate, which is limited by the number of non-zero elements in the

parity-check matrix. This thesis explores techniques for exploiting the intrinsic structure of parity-check

matrices that define the LDPC codes in each of the two application areas, in order to maximize scalability

and minimize integration complexity. This chapter first introduces the key challenges in implementing

high-speed LDPC decoders in both integrated circuits and quantum crypto-systems, and then presents

a roadmap for the remainder of the thesis.

1.1 LDPC Decoders in Integrated Circuits

CMOS technology miniaturization has played a prominent role in improving the energy efficiency of

silicon-based LDPC decoders, thanks to low-leakage devices and increasing transistor densities [13, 14].

However, the scalability of LDPC decoders remains primarily limited by the complexity of message-

passing interconnect between on-chip memory and parallel processing units [15,16]. The recent stagna-

tion of wired interconnect scaling beyond the 45nm CMOS node has introduced additional low-power

design challenges for multi-Gb/s decoders, as unstructured routing now largely dominates overall power

consumption due to longer interconnect delay [17,18].

System-on-chip (SoC) integration of LDPC decoder intellectual property (IP) is another key design

constraint. Most baseband receiver SoCs typically operate at clock frequencies around 200MHz to meet

timing constraints across process, voltage, and temperature (PVT) corners [19]. Since an LDPC decoder

must ultimately integrate within a larger SoC with a system-level bit error rate (BER) target, the clock

frequency should realistically be constrained to be around 200MHz, and the number of decoding itera-

tions should not be reduced beyond a threshold where error-correction performance starts to degrade.

However, several published pre-silicon and post-silicon implementations achieve multi-Gb/s throughput

with clock rates beyond 400MHz and/or with a reduced number of decoding iterations [20–22]. For

optimal cost, reliability, and testability, the decoder IP should also be implemented using standard

CMOS technology. Under these conditions, achieving multi-Gb/s decoding throughput is a challenge

using traditional architectural and scheduling techniques, especially for short block-length codes like

those defined in the IEEE 802.11ad (WiGig) and IEEE 802.15.3c (WPAN) standards [23,24]. Figure 1.1

presents a block diagram of a wireless SoC with the LDPC decoding block highlighted in the physical

layer baseband, while Fig. 1.2 visualizes the state of the art of silicon-based LDPC decoders for modern

communication standards in terms of power and decoded data throughput over multiple standards with

different LDPC code block lengths and performance requirements [19,25–36]. Figure 1.2 also illustrates

that the design space for integrated circuit implementation of LDPC decoders is multi-dimensional,

where in addition to power and throughput, constraints also include the CMOS technology node, core


area, and clock frequency.

The motivation of this thesis is to address the high power consumption of unstructured interconnect

in multi-Gb/s LDPC decoders by exploiting the structure of LDPC parity-check matrices to reduce

wiring complexity through new architectural techniques that scale to sub-10nm CMOS technology nodes.

Chapter 3 presents a new, frame-interleaved LDPC decoder architecture with a reformulated message-

passing schedule that reduces interconnect complexity and routing logic overhead, by exploiting the

spatial locality of stored messages in neighboring processing nodes. The architecture is scalable by

design, supports multiple code rates, and achieves multi-Gb/s throughput while operating at clock

rates below 200MHz. The IEEE 802.11ad standard for the 60GHz wireless millimeter-wave band is

used as a vehicle to demonstrate the application of the proposed architecture, due to its multi-Gb/s

throughput specification over four code rates, and quasi-cyclic LDPC parity-check matrix structure [23].

Quasi-cyclic codes are used to illustrate the general approach, however, the proposed architecture and

decoding schedule can also be extended to non-quasi-cyclic codes.

A proof-of-concept application-specific integrated circuit (ASIC) was fabricated in a 28nm CMOS

technology, and tested at-speed on an automated SoC tester. The LDPC decoder occupies an area of

3.41mm2 and achieves a throughput of 6.78Gb/s with a maximum latency of 0.793µs at 10 decoding

iterations for all four code rates that share a uniform block length of 672 bits, while operating at a

0.9V supply and 202MHz clock. The decoder consumes between 104mW and 279mW of power at a

target BER of 10−6 for the rate 13/16 and rate 1/2 codes, respectively. This corresponds to energy

efficiencies between 15pJ/bit and 41pJ/bit, demonstrating that low-power performance is achievable

with a low clock rate in a standard bulk CMOS technology for maximum SoC integration capability.

The performance of this work in comparison to previously published silicon-based decoders is plotted in

Fig. 1.2, and discussed in greater detail in Chapter 3. It is noted here that the purpose of Fig. 1.2 is

not to distinguish this work, but rather to illustrate the multi-dimensional design space of silicon-based

LDPC decoders with varying constraints.

1.2 LDPC Decoding for Quantum Key Distribution

LDPC codes have recently shown great promise in a forward error-correction context in QKD, where

two remote parties, Alice and Bob, attempt to construct a symmetric secret key by communicating

over a private quantum channel and an authenticated classical public channel. However, the speed at

which Alice and Bob can exchange secret keys is currently limited by the computational complexity of

post-processing algorithms for key reconciliation.

Quantum key distribution, also referred to as quantum cryptography, offers unconditional security

between two remote parties that employ one-time pad encryption to encrypt and decrypt messages using

a shared secret key, even in the presence of an eavesdropper with infinite computing power and math-

ematical genius [37–40]. Unlike classical cryptography, quantum cryptography allows the two remote

parties, Alice and Bob, to detect the presence of an eavesdropper, Eve, while also providing future-proof

security against brute force, key distillation attacks that may be enabled through quantum comput-

ing [41]. Today’s public key exchange schemes such as Diffie-Hellman and encryption algorithms like

RSA respectively rely on the computational hardness of solving the discrete log problem and prime

factorization [42, 43]. Both of these problems, however, can be solved in polynomial time by applying

Shor’s algorithm on a quantum computer [44–46].


LNA

PA

ADC

DACLDPC

Encoder

Equalization and

Calibration

Me

diu

m A

cc

es

s C

on

tro

l

200MHz Clock

Digital Baseband

LDPC

Decoder

Symbol

Mapper

Slicer and

Demapper

Analog and RF Front End

Figure 1.1: Wireless SoC showing LDPC decoder IP block within the physical layer baseband, andauxiliary circuits and systems.

0 100 200 300 400 500Power (mW)

0

5

10

15

20

Dec

od

ed D

ata

Th

rou

gh

pu

t (G

b/s

)

(65nm,322MHz,1.19mm2)

(65nm,360MHz,1.6mm2)

(28nm,260MHz,0.63mm 2)

(130nm,111MHz,

3.88mm2)

(40nm,280MHz,

0.46mm2)(65nm,197MHz,

1.56mm2)

(90nm,768MHz,

2.67mm2)

(90nm,157MHz,

2.25mm2)

(130nm,214MHz,3.03mm2)(65nm,110MHz,3.36mm2)

(65nm,934MHz,1.54mm2)

(90nm,85MHz,

5.35mm2)

(65nm,100MHz,

5.35mm2)

This Work(28nm,202MHz,

3.41mm2)

Custom Pre-5G Wireless: 215bIEEE 802.11ad (WiGig): 672bIEEE 802.11n (Wi-Fi): 1944bIEEE 802.15.3c (WPAN): 1440bIEEE 802.15.3c (WPAN): 672bIEEE 802.16e (WiMAX): 2304bIEEE 802.16e (WiMAX): 576bIEEE 802.3an (10GBASE-T): 2048b

Figure 1.2: Throughput vs. power comparison of silicon-based LDPC decoders for multiple standardswith different throughput, latency, and error-correction performance requirements. The plot legendindicates the decoding standard and block length for each implementation. Annotated values in paren-theses indicate the CMOS technology node, clock frequency, and decoder core area for each implemen-tation [19,25–36].


While quantum computing remains speculative, QKD systems have already been realized in several

commercial and research settings worldwide [47–50]. Figure 1.3 presents two different protocols for

generating a symmetric key over a quantum channel: (1) discrete-variable QKD (DV-QKD) where

Alice encodes her information in the polarization of single-photon states that she sends to Bob, or (2)

continuous-variable QKD (CV-QKD) where Alice encodes her information in the amplitude and phase

quadratures of coherent states [40]. In DV-QKD, Bob uses a single-photon detector to measure each

received quantum state, while in CV-QKD, Bob uses homodyne or heterodyne detection techniques

to measure the quadratures of light [40]. While DV-QKD has been experimentally demonstrated up

to a distance of 404km [51], the cryogenic temperatures required for single-photon detection at such

extreme distances present a challenge for widespread implementation [40]. CV-QKD systems on the

other hand can be implemented using standard, cost-effective detectors that are routinely deployed in

classical telecommunications equipment that operates at room temperature [40]. The majority of QKD

research focuses on applications over optical fiber, since quantum signals for both CV- and DV-QKD

can be multiplexed over classical telecommunications traffic in existing fiber-optical networks [52–54].

Nevertheless, there has also been recent progress in chip-based, free-space, and Earth-to-satellite QKD

applications [55–57]. It is noted here that quantum cryptography, i.e., QKD, differs from post-quantum

cryptography, which is an evolving area of research that studies public-key encryption algorithms that

are believed to be secure against an attack by a quantum computer [58]. The discussion of post-quantum

cryptography is beyond the scope of this thesis.

The motivation of this thesis is to address the two key challenges that remain in the practical

implementation of CV-QKD over optical fiber: (1) to extend the distance of secure communication

beyond 100km with protection against collective Gaussian attacks, and (2) to increase the computational

throughput of the key reconciliation (error correction) algorithm in the post-processing step such that

the maximum achievable secret key rate remains limited only by the fundamental physical parameters

of the optical equipment at long distances [12, 59, 60]. There are two limitations to the speed of key

reconciliation. The first is the secret key rate, which is fundamentally limited by the transmittance of

the lossy optical channel and is measured in bits/pulse [61]. The second is the rate of computational

throughput from the hardware implementation, measured in bits/second [60]. To compare the two rates,

we normalize the secret key rate to bits/second by choosing a realistic CV-QKD pulse sampling rate of

frep =1MHz [59,62]. While secure QKD networks can be built using trusted and untrusted intermediate

nodes, the long-distance reconciliation problem is motivated by the following two key reasons: (1) each

intermediate node introduces additional vulnerability, and (2) implementing efficient quantum repeaters

remains a challenge [40]. Jouguet and Kunz-Jacques showed that Mbit/s error-correction decoding of

multi-edge LDPC codes is achievable for distances up to 80km [60], while Huang et al. recently showed

that the distance could be extended to 100km by controlling excess system noise [63]. This thesis explores

high-speed LDPC decoding for CV-QKD beyond 100km.

A particular challenge in implementing long-distance CV-QKD is the low signal-to-noise ratio (SNR)

of the optical quantum channel, which typically operates below −15dB. At such low SNR, high-efficiency

key reconciliation can be achieved only using low-rate codes with large block lengths on the order of

106 bits [67,68], where approximately 98% of the bits are redundant parity bits that must be discarded

after error-correction decoding. The reconciliation efficiency is a measure of how close the code operates

to the Shannon limit at a particular SNR. In order to maximize the secret key rate and reconciliation

distance, the error-correcting code must achieve a high reconciliation efficiency and high error-correction


Continuous-Variable QKD

Discrete-Variable QKD

ALICE BOB

Untrusted Public Channel

Untrusted Optical Fiber

Untrusted Optical Fiber

Light

Pulse

Generator

Coherent

State

Modulator

Single

Photon

Counter

Homodyne

Detector

EVE

Figure 1.3: Information transmission over untrusted quantum channel and authenticated public channelbetween Alice and Bob for CV- and DV-QKD, with eavesdropper Eve.

0 20 40 60 80 100 120 140 160 180Distance (km)

100

101

102

LD

PC

Dec

od

er T

hro

ug

hp

ut

(Mb

/s)

(n=220 , R=0.1, β =0.931, SNR=0.161)

(n=104, R=0.02, β =0.93, SNR=N/A)

(n=104, R=0.02, β=0.969, SNR=0.9)

(n=106, R=0.55, QBER=7.5%)

(n=1944, R=0.67,QBER=N/A)(n=105, R=0.55, QBER=8%)

This Work

(n=106, R=0.02, β =0.99, SNR=0.0284)

CV-QKDDV-QKD

Figure 1.4: Throughput vs. distance of GPU-based LDPC decoders for CV- and DV-QKD. The reportedthroughput is the raw GPU throughput without code- or error-rate scaling. For CV-QKD implemen-tations [12, 60, 64], the annotated values in parentheses indicate the LDPC code code block length n,the code rate R, the reconciliation efficiency β, and SNR of the quantum channel. For DV-QKD imple-mentations, the annotated values indicate the block length n, code rate R, and QBER [48, 65, 66]. Byconvention in QKD, the SNR is reported in linear units.


performance with low frame error rate (FER). Jouguet et al. previously explored multi-edge LDPC codes

for long-distance reconciliation due to their near-Shannon limit performance with low-rate codes, how-

ever, such codes require hundreds of LDPC decoding iterations to achieve asymptotic error-correction

performance [11, 59, 60]. This is in contrast to LDPC codes employed in modern communication stan-

dards, such as IEEE 802.11ac (Wi-Fi) and ETSI DVB-S2X, where the target SNR is above 0dB and

block lengths range from 648 bits to 64,800 bits [9,69]. In these standards, the LDPC decoder typically

operates at 10 iterations to deliver Gbit/s decoding throughput [20, 31, 70]. Long block lengths allow

Alice and Bob to generate longer secret keys, which can be used to provide unconditional security by em-

ploying the one-time pad encryption scheme. Shorter codes with block lengths of 105 bits, for instance,

would not be suitable for low-SNR channels beyond 100km due to their less robust error-correction per-

formance [10,11]. In addition to long block-length codes, key reconciliation over multiple dimensions has

also been shown to improve error-correction performance of multi-edge codes at low SNR [59], thereby

increasing both the secret key rate and distance. However, the computational complexity and latency

of decoding random LDPC parity-check matrices with block lengths on the order of 106 bits remains a

challenge. Figure 1.4 presents a comparison of LDPC decoding throughput versus distance for several

state-of-the-art CV- and DV-QKD implementations, illustrating that high-throughput reconciliation at

long distances is achievable only with large block-length codes that approach the Shannon limit with

more than 90% efficiency for CV-QKD or less than 10% quantum bit error rate (QBER) for DV-QKD.

Chapter 4 introduces a new, quasi-cyclic (QC) code construction for multi-edge LDPC codes with

block lengths on the order of 106 bits [71, 72]. Computational acceleration is achieved through an

optimized LDPC decoder design implemented on a state-of-the-art graphics processing unit (GPU).

When combined with an 8-dimensional reconciliation scheme, the LDPC decoder achieves a raw decoding

throughput of 1.72Mbit/s and an information throughput of 7.16Kbit/s using an NVIDIA GeForce

GTX 1080 GPU at a maximum distance of 160km with a secret key rate of 4.10×10−7 bits/pulse when

finite-size effects are considered. The performance of this work in comparison to previous GPU-based

decoders for QKD is plotted in Fig. 1.4, and discussed in greater detail in Chapter 4. This work extends

the previous maximum CV-QKD distance of 100km to 160km, while delivering between 1.07× and

8.03× higher decoded information throughput over the upper bound on the secret key rate for a lossy

channel [61]. These results show that LDPC decoding is no longer the computational bottleneck in

long-distance CV-QKD, and that the secret key rate remains limited only by the physical parameters of

the quantum channel and the latency of privacy amplification.

1.3 Thesis Roadmap: Investigation of Two Application Areas

This thesis examines decoder implementation techniques for two distinct application areas for LDPC

codes: (1) integrated circuits for baseband FEC systems in wireless SoCs, and (2) quantum cryptography.

The goals in each application are distinct. For integrated circuits, a multi-Gb/s decoder IP core should

integrate with existing blocks in an SoC and achieve low-power performance with acceptable BER.

The design should be scalable to future CMOS technology nodes where interconnect currently presents

integration complexity challenges. For long-distance QKD, the GPU-accelerated decoder should deliver

sufficient speedup over the secret key rate limits defined by the parameters of the quantum channel at

low SNR.

The operating SNR is the primary distinction between the two application areas presented in this


thesis. As illustrated in Fig. 1.5, the SNR for the wireless IEEE 802.11ad standard is around 6dB

and the LDPC decoder achieves a FER of approximately 10−4, whereas in long-distance CV-QKD,

the near Shannon-limit channel operates at an SNR of −15dB where the LDPC decoder can only

achieve an FER of approximately 8× 10−1. This key distinction in error-correction performance drives

algorithmic and LDPC code design considerations, such as selecting the most appropriate variant of the

belief propagation algorithm, the LDPC code block length, and the code rate. The power budget and

throughput requirements then drive the implementation platform consideration. Figure 1.6 shows the

distinction in information throughput with respect to the LDPC code block length for the CV-QKD and

Wireless SoC application areas after redundant bits and erroneously decoded frames are discarded.

The error-correction performance, block length, and information throughput requirements present

a unique set of decoder implementation challenges and considerations for each application. Table 1.1

compares the use case, binary parity-check matrix structure, LDPC code performance, and decoder

implementation performance for the two application areas investigated in this thesis. Although both

applications communicate over a Gaussian channel, the low SNR operating point of CV-QKD requires

belief propagation decoding via the Sum-Product algorithm with a maximum of 500 iterations, while the

relatively high SNR operating point of a wireless IEEE 802.11ad channel allows for reduced-complexity

decoding via the Min-Sum algorithm at a maximum of 10 decoding iterations. For the remainder of this

thesis, the SNR and the SNR per bit, Eb/N0, are defined as follows:

SNR =1

σ2= 2Rcode

(EbN0

)(1.1)

EbN0

=1

2σ2Rcode, (1.2)

where σ2 is the variance of a zero-mean Gaussian channel, and Rcode is the code rate – the ratio of the

number of information bits that are kept after redundant parity bits are discarded with respect to the

total length of the decoding block.

Long-distance CV-QKD beyond 100km requires 98% code redundancy with a very low code rate of

Rcode = 0.02 and long block length of 106 bits in order to approach the Shannon limit at low SNR,

while wireless IEEE 802.11ad communication over 1-meter line-of-sight links requires only 50% code

redundancy with Rcode = 0.5 and a shorter block length of 672 bits. Since the decoding latency and

block length of IEEE 802.11ad are several orders of magnitude smaller than in long-distance CV-QKD,

the decoded information throughput is several orders of magnitude higher, in the Gigabit/s regime,

as opposed to Megabit/s for CV-QKD. Nevertheless, a silicon-based decoder implementation for IEEE

802.11ad has to achieve high energy efficiency, i.e., low power performance, in order to prolong battery

life when integrated in a wireless SoC fabricated in a standard CMOS technology for a mobile device.

Although highly-customizable ASICs provide excellent energy efficiency, the silicon implementation

of an LDPC decoder for long-distance CV-QKD with an LDPC code block length of 106 bits would

require significant silicon die area, which may be prohibitively expensive to fabricate in a modern CMOS

technology node [13]. Moreover, ASICs suffer from fixed-point computational precision, limited memory,

and highly complex routing. While modern field-programmable gate arrays (FPGAs) offer floating-point

computational cores, the logic requirements of an LDPC decoder with a block length of 106 bits may

exceed to the maximum utilization of on-chip FPGA logic blocks and switch-based routing to fully

place-and-route the design with strict timing constraints [15, 73]. GPUs on the other hand are a highly


-20 -15 -10 -5 0 5 10SNR (dB)

10-5

10-4

10-3

10-2

10-1

100

Fra

me

Err

or

Rat

e (F

ER

)

CV-QKD

Wireless SoC

Figure 1.5: FER vs. SNR for IEEE 802.11ad Wireless SoC and CV-QKD applications.

100 101 102 103 104 105 106 107

Block Length (Bits)

100

101

102

103

104

105

106

107

108

109

1010

Dat

a T

hro

ug

hp

ut

(Bit

s/S

eco

nd

)

Wireless SoC

CV-QKD

Figure 1.6: Data throughput with FER and code-rate scaling vs. block length for IEEE 802.11ad wirelessSoC and CV-QKD applications.


suitable platform for LDPC decoder implementation in CV-QKD systems due to their low cost, and high

availability of on-chip memory, floating-point computational precision, and architectural flexibility, which

allows for shorter development time [74,75]. Since Alice and Bob are stationary and their communication

occurs over a fixed-length fiber-optic cable, the traditional optimization parameters of energy efficiency

and silicon chip area do not necessarily apply since the LDPC decoder does not need to assume an

integrated circuit form factor. Furthermore, GPUs seamlessly integrate into a post-processing computer

system, and provide increasing computational performance at low cost with each successive architecture

generation [76].

The LDPC decoder implementations presented in Chapters 3 and 4 of this thesis holistically consider

the constraints outlined in Table 1.1.

Table 1.1: Comparison of LDPC decoder performance requirements for an IEEE 802.11ad wireless SoCIP core vs. long-distance CV-QKD key reconciliation block

SpecificationIEEE 802.11ad

Wireless SoC IPLong-Distance

CV-QKD

Use Case

Transmission Medium Free Space Optical FiberPower Source Battery Grid

Distance 1m 100-200km

Data Rate 10Gb/s 1Mb/sChannel Type Gaussian Gaussian

Binary Parity-Check Matrix

Number of Rows 336 987,840Number of Columns 672 1,008,000

Number of Connections 1890 3,363,885

LDPC Code Performance

Block Length (Bits) 672 1.008× 106

Code Rate 1/2 1/50Decoding Algorithm Min-Sum Sum-Product

Maximum Decoding Iterations 10 500SNR (dB) 6.76 -15.47

SNR Per Bit Eb/N0 (dB) 5.5 -1.49

Target FER 10−4 0.8

Decoder Implementation Performance

Platform ASIC GPUCMOS Technology Node 28nm 16nm

Decoding Throughput (bit/s) 6.78×109 1.724×106

Information Throughput (bit/s) 3.39×109 7.16×103

Latency (µs) 0.793 1296

Power (W) 0.104 (1) 180 (2)

Key Implementation ChallengesEnergy efficiency,

SoC IP integrationFrame error rate,

throughput

Performance Bottleneck Interconnect wiringMemory access

latency

(1) Measured power of the test chip fabricated in this thesis in 28nm CMOS technology.(2) Thermal design power of the NVIDIA GeForce GTX 1080 GPU.

As shown in Table 1.1, the key distinction between the two application areas from an LDPC decoder

perspective is the size of the binary parity-check matrix that defines the LDPC code. The largest parity-


check matrix defined in the IEEE 802.11ad standard has only 1890 node connections between iterative

processing groups, while the parity-check matrix for CV-QKD has over 3.3 Million node connections.

The number of node connections directly affects the decoding latency, implementation complexity, power,

and throughput. This thesis explores techniques for exploiting the intrinsic structure of LDPC parity-

check matrices for the integrated circuit and QKD application areas in order to reduce decoding latency,

complexity, and power, while maximizing throughput. The ideas presented herein are scalable to codes

with longer block lengths, beyond those defined in Table 1.1, for both wireless and CV-QKD applications.

The implementation results presented in Chapters 3 and 4 of this thesis provide insights into possible

directions for future LDPC decoder implementations, by leveraging the computational acceleration,

integration, and scalability benefits offered by exploiting the structure of LDPC parity-check matrices.

The remainder of this thesis is organized as follows. Chapter 2 presents the background on LDPC

codes and QKD. Chapter 3 describes a new frame-interleaved LDPC decoder architecture with a path-

unrolled message-passing schedule for integrated circuit applications, and presents the measurement

results of the fabricated proof-of-concept silicon test chip. Chapter 4 introduces a quasi-cyclic parity-

check matrix construction and GPU-based decoder implementation for multi-edge LDPC codes with

application in key reconciliation for long-distance CV-QKD. Chapter 5 concludes the thesis and presents

some future research directions.

Chapter 2

Background

This chapter first introduces the fundamentals of LDPC codes, outlines the belief propagation decoding

algorithm, and describes the challenges in traditional LDPC decoder architectures that target CMOS

implementation. The chapter then explores the application of LDPC codes for secret key reconciliation

in QKD, by first presenting some preliminaries on QKD, and then introducing multi-edge codes for

reconciliation at low SNR. This chapter provides the background for the integrated circuit and long-

distance CV-QKD application areas discussed in Chapters 3 and 4, respectively. LDPC encoding is not

described in this thesis, since the complexity of encoding is relatively low compared to decoding.

2.1 Forward Error Correction with LDPC Codes

The forward error correction procedure is presented in Fig. 2.1 with a simplified channel model. A binary

message S of length k bits is first encoded using a known parity-check matrix H. The encoding produces a

noiseless codeword C of length n bits by appending (n−k) computed parity bits to S. The LDPC-encoded

codeword C is then transmitted over a binary-input additive white Gaussian noise channel (BIAWGNC)

with zero mean and noise variance σ2, such that the received soft-decision vector R is described by

Rv = (−1)Cv +Nv for v = 1, 2, . . . , n, where N ∼ N (0, σ2) represents the normally-distributed Gaussian

noise. LDPC decoding is performed using the known parity-check matrix H to produce an estimate C

of the original transmitted codeword C, where each binary hard decision Cv ∈ 0, 1 for v = 1, 2, . . . , n.

By discarding the (n−k) parity bits from the frame, an estimate S of the original message S is obtained.

LDPC decoding is successful if C = C, otherwise a frame error is said to have occurred where one or

more bits in the frame is in error.

AWGN Channel

LDPC

EncoderS

CĈ, Ŝ

LDPC

Decoder

R

H2

1SNR

H2

Original

Message

Decoded

Message

Figure 2.1: Simplified model of a BIAWGNC with LDPC encoding and decoding.

12

Chapter 2. Background 13

2.2 LDPC Codes: A Class of Linear Block Codes

LDPC codes are a class of linear block codes defined by a sparse parity-check matrix H of size (n −k) × n, k ≤ n, with block length n and code rate Rcode = k/n [8, 77]. An equivalent definition of an

LDPC code is given by its Tanner graph – a bipartite graph – where the non-zero entries of the binary

parity-check matrix H define the edge connections between independent vertex sets known as check

nodes (CNs) and variable nodes (VNs) [78]. As shown in Fig. 2.2, CNs and VNs correspond to the rows

and columns of H, respectively, and an edge between CN ci and VN vj belongs to the graph G if and

only if H(i, j) = 1.

v1 VN2 VN3 VN4 VN5 VN6

c1 c2 CN3

Check Nodes

Variable Nodes

010101

101010

001011

H

Binary Parity-Check Matrix

v2 v3 v4 v5 v6

c2 c3

v1 v2 v3 v4 v5 v6

c1

c2

c3

k

n

n-k

Figure 2.2: Tanner graph and corresponding binary parity-check matrix with block length of n = 6 bits,k = 3 information bits, n = 6 variable nodes, and (n− k) = 3 check nodes.

2.3 LDPC Decoding: Belief Propagation Algorithms

LDPC decoding is performed using belief propagation, an iterative message-passing algorithm commonly

used to perform inference on graphical models such as factor graphs [79]. LDPC decoding attempts to

converge on a valid codeword by iteratively exchanging probabilistic updates between check and variable

nodes along the edges of the Tanner graph until the parity-check condition is satisfied, i.e., a valid

codeword has been found, or the maximum number of iterations is exhausted.

The Sum-Product algorithm is the most common variant of belief propagation [79], and is described in

Algorithm 1 with a flooding schedule where CNs and VNs pass message updates between their connected

neighbors once per iteration to generate a codeword estimate C. In Algorithm 1, Step 1 prepares the

Qv log-likelihood ratio (LLR) input values at each VN v based on the channel noise variance σ2. All

VN-to-CN messages from VN v are initialized to the received channel LLR Qv before the first message-

passing iteration. Steps 2 to 5 specify the message-passing interaction between the CNs and VNs until

the codeword syndrome defined by CH> is equal to zero, or the maximum predetermined number of

decoding iterations is reached. In Step 2, m(i)cv is the message from CN c to VN v in iteration i, and

Φ(x) = Φ−1(x) = − ln(tanh(x/2)). In Step 3, L(i)vc is the message from VN v to CN c, and L

(i)v is the

updated LLR belief of bit v in the frame, whose decision is given by C(i)v in Step 4. In Step 2, the set

of VNs connected to CN c is defined as N(c) = v|v ∈ 1, 2, . . . , n ∧ Hcv = 1, where the notation

v′ ∈ N(c)\v refers to all VNs in the set N(c) excluding VN v. Similarly, in Step 3, the set of CNs

connected to VN v is defined as M(v) = c|c ∈ 1, 2, . . . , n − k ∧Hcv = 1, where c′ ∈ M(v)\c refers

to all CNs in the set M(v) excluding CN c. The syndrome CH> = 0 if the parity check pc = 0 at each


Algorithm 1 Sum-Product with flooding schedule

Input: R, σ2; Output: C

Step 1: LLR initialization at each VN, v = 1, 2, . . . , n

Qv ← ln

(P (Rv|Cv = 0)

P (Rv|Cv = 1)

)=

2Rvσ2

for BIAWGNC

L(i=0)vc ← Qv, ∀c ∈M(v) for first iteration

for Iteration i = 1 to Max Iterations do

Step 2: Check node update at each CN, c = 1, 2, . . . , n− k (CN-to-VN messages)

sgn(m(i)cv )←

∏v′∈N(c)\v sgn(L

(i−1)v′c )∣∣m(i)

cv

∣∣← Φ−1

(∑v′∈N(c)\v Φ

(∣∣L(i−1)v′c

∣∣))m

(i)cv ← sgn(m

(i)cv )×

∣∣m(i)cv

∣∣Step 3: Variable node update at each VN, v = 1, 2, . . . , n (VN-to-CN messages)

L(i)v ← Qv +

∑c∈M(v)m

(i)cv

L(i)vc ← Qv +

∑c′∈M(v)\cm

(i)c′v = L

(i)v −m(i)

cv

Step 4: Hard decision at each VN, v = 1, 2, . . . , n

C(i)v ←

0, L(i)v ≥ 0

1, otherwise

Step 5: Early termination (parity) check at each CN, c = 1, 2, . . . , n− kif CH> = 0 (mod 2) then Terminate

end for

CN c, where pc is the XOR of all hard decision Cv bits from VNs connected to CN c in the set N(c):

pc ← CN(c)1 ⊕ CN(c)2 ⊕ · · · ⊕ CN(c)|N(c)| = ⊕v∈N(c)

Cv. (2.1)

The traditional Sum-Product algorithm achieves error-correction performance close to the theoretical

Shannon limit, however, it is not well-suited for hardware implementation due to the non-linearity of the

tanh(x) function [15]. The Min-Sum algorithm is often adopted as a suitable alternative in integrated

circuit decoder implementations as it does not require complex lookup tables and can be computed with

simple comparator circuits [15, 73, 77]. Algorithm 2 describes the Min-Sum decoding procedure with a

flooding schedule.

The Min-Sum algorithm can be modified to include scaling factors and offset coefficients to improve

decoding performance, as in the case of the Normalized Min-Sum and Offset Min-Sum algorithms,

respectively [80]. However, these parameters are strongly dependent on the channel SNR, code rate, and

fixed-point message quantization in the LDPC decoder. In order to avoid BER performance degradation,

these parameters should be adjusted dynamically during runtime to compensate for changes in SNR or

code rate. Such enhancements are beyond the scope of this thesis.

The massive computational parallelism required to achieve high decoding throughput warrants the

integrated circuit implementation of LDPC decoders [81, 82]. Despite the prospect of near-capacity

error-correction performance, LDPC decoder hardware implementations must balance the trade-offs of

data throughput, power consumption, silicon chip area, and decoding latency, while operating within the

system-level constraints defined by the target BER, block length, and parity-check matrix connectivity


Algorithm 2 Min-Sum with flooding schedule

Input: R, σ2; Output: C

Step 1: LLR initialization at each VN, v = 1, 2, . . . , n

Qv ← ln

(P (Rv|Cv = 0)

P (Rv|Cv = 1)

)=

2Rvσ2

for BIAWGNC

L(i=0)vc ← Qv, ∀c ∈M(v) for first iteration

for Iteration i = 1 to Max Iterations do

Step 2: Check node update at each CN, c = 1, 2, . . . , n− k (CN-to-VN messages)

sgn(m(i)cv )←

∏v′∈N(c)\v sgn(L

(i−1)v′c )∣∣m(i)

cv

∣∣← minv′∈N(c)\v∣∣L(i−1)v′c

∣∣m

(i)cv ← sgn(m

(i)cv )×

∣∣m(i)cv

∣∣Step 3: Variable node update at each VN, v = 1, 2, . . . , n (VN-to-CN messages)

L(i)v ← Qv +

∑c∈M(v)m

(i)cv

L(i)vc ← Qv +

∑c′∈M(v)\cm

(i)c′v = L

(i)v −m(i)

cv

Step 4: Hard decision at each VN, v = 1, 2, . . . , n

C(i)v ←

0, L(i)v ≥ 0

1, otherwise

Step 5: Early termination (parity) check at each CN, c = 1, 2, . . . , n− kif CH> = 0 (mod 2) then Terminate

end for

for multiple code rates. These trade-offs are controlled through architectural and algorithmic schedul-

ing considerations, which include the number of decoding iterations, message bit-width quantization,

partitioning of combinational logic and memory, and routing overhead complexity.

2.4 Quasi-Cyclic LDPC Codes

In a traditional LDPC decoder, CN and VN processing units iteratively exchange messages across an

interconnect network described by a Tanner graph, as shown in Fig. 2.2. Random parity-check matrices

introduce unstructured interconnect between the variable and check nodes, resulting in unordered mem-

ory access patterns and complex routing, which limit scalability in ASIC or FPGA implementations.

Architecture-aware codes were introduced to alleviate the hardware complexity in the design of both

LDPC encoders and decoders by imposing a highly-regular matrix structure with a sufficient degree of

randomness in the parity-check matrix to ensure adequate Euclidean distance between valid codewords

under maximum likelihood detection [83].

Quasi-cyclic (QC) LDPC codes are a popular class of such codes, where the parity-check matrix is

constructed from an array of q×q cyclically-shifted identity matrices or q×q zero matrices [72]. As shown

in Fig. 2.3, the tilings evenly divide the parity-check matrix H into n/q QC macro-columns and (n−k)/q

QC macro-rows. The expansion factor q in a QC parity-check matrix determines the trade-off between

decoder implementation complexity and error-correction performance. For a small expansion factor q,

the matrix exhibits a high degree of randomness, which improves error-correction performance, while

a large q reduces decoder complexity with some performance degradation. QC-LDPC codes provide


a highly-regular matrix structure, which can be exploited to simplify decoder architecture design and

implementation.

I4 I3 I1

qxq Cyclically-Shifted Identity Matrix

I3 I3 I2 I1

I4

I3

I2

I2

I4

I2I3

I1

I2

I1

I2

I2

I4 I2

I5

I5

I4

I5 I1

I4 I1 I2

I3

I4

I1 I4

I1

I# qxq All-Zero Matrix

6 QC

Macro-

Rows

6x5

Check

Nodes

10 QC Macro-Columns 10x5 Variable Nodes

Expansion Factor q=5

k information bits (n-k) parity bits

(n-k) c

he

ck

no

de

s

block length = n bits (n variable nodes)

00010

00001

10000

01000

00100

Sample Cyclically-

Shifted Identify

Matrix for q=5

I3=

Figure 2.3: Sample quasi-cyclic binary parity-check matrix for q = 5 constructed from uniformly-sized(q × q), cyclically-shifted identity matrices and all-zero matrices.

2.5 Silicon Integrated Circuits for LDPC Decoding

The silicon-based implementation of LDPC decoders was introduced in Chapter 1, with the motivation of

maximizing energy efficiency and throughput in modern CMOS technology nodes. This section outlines

the challenges of extending existing decoder architectures to modern CMOS technology nodes.

2.5.1 LDPC Decoder Architectures

As presented in Fig. 2.4, LDPC decoder architectures can be classified into the following three categories

based on their degree of computational parallelism: fully parallel, partially parallel, and serial.

Fully-parallel decoders achieve the highest throughput, but suffer from high power consumption

and large silicon area due to the large number of parallel processors and highly complex reconfigurable

Benes/Banyan switch-based networks or hard-wired interconnect [70,82,84].

Partially-parallel decoder architectures achieve higher energy- and area-efficiency at the expense of

lower throughput [83, 85]. Message updates are computed in a time-multiplexed approach by storing

intermediate updates in embedded memory and performing pipelined computations among partitioned

processing nodes whose level of parallelism is typically defined by the QC matrix expansion factor

q [83, 85]. Barrel shifters and large multiplexer trees are implemented to perform cyclic data rotation.


Partially-parallel architectures can further be classified as row-parallel or column-parallel depending on

the processing unit partitioning of the QC parity-check matrix.

Serial architectures implement a single processing unit with large memory, and are suitable only for

low-throughput applications where high latency is acceptable [86].

VN

CN

Fully Parallel

VN VN VN VN VN

CN CN CN CN CN

VN

CN

Partially Parallel

VN VN

CN CN

Permutation Memory

VN

CN

Serial

Global Memory

Figure 2.4: Examples of three LDPC decoder architectures showing message-passing networks, and con-figuration of CN and VN processing units. While exceptions exist, typically, fully-parallel architecturesinstantiate n VNs and (n − k) CNs, partially-parallel architectures instantiate a factor of q VNs andCNs, and serial architectures instantiate only 1 VN and 1 CN.

2.5.2 Flooding and Layered Decoding Schedules

All LDPC decoders execute the belief propagation algorithm based on either a flooding or layered

message-passing schedule [82].

In the flooding schedule, CNs and VNs perform subsequent rounds of updates and pass messages

between their connected neighbors once per iteration [82]. The flooding schedule is amenable to both

fully-parallel and partially-parallel architectures. Flooding decoders achieve high throughput with low

latency at the expense of area and routing complexity [87].

The layered schedule was introduced to improve decoder convergence by performing intermediate

LLR updates in the VNs in each iteration [88]. The layered schedule is amenable to partially-parallel

architectures with QC-LDPC codes. While the layered schedule generally requires fewer iterations, the

overall decoding latency is generally longer as additional clock cycles are required to perform intermediate

LLR updates. Successive message updates can also lead to dynamic range clipping/saturation, which

reduces error-correction performance if scaling circuits are not employed. Moreover, the data dependency

among layers results in memory access contention [20], even with sliced message-passing and overlapped

decoding schedule techniques [89,90]. Layered decoders thus require higher clock rates to achieve multi-

Gb/s throughput, as well as more complex permutation network control. As such, they are not optimal

for multi-Gb/s decoding with the goal of IP integration in a target system operating at a 200MHz clock

rate.

2.5.3 Design Challenges: Message Permutation, Memory, and Power

Reducing interconnect congestion and minimizing reconfigurable routing logic overhead are key chal-

lenges in improving the energy efficiency of decoder implementations. Several variants of the split-row

technique have been shown to reduce silicon area and routing congestion in fully-parallel decoders [15],

however, they may be suitable only for single-rate codes with high row weights. The split-row decoding

algorithm splits each matrix row into multiple submatrices to perform localized semi-autonomous mini-

mum magnitude computations with a reduction of 0.3dB to 0.7dB in error-correction performance [15].


The half-row decoding schedule folds the traditional layered schedule computations in time to reduce

interconnect wiring [21], but suffers in throughput and introduces more complex data shuffling circuits.

Memory-based decoders enable frame-level pipelining to shorten routing wires and maximize hardware

utilization, but are similarly plagued with barrel shifting networks [91, 92]. Depending on the parity-

check matrix construction, row-merging techniques can be applied to improve decoding throughput [32],

but again suffer from reconfigurable routing logic overhead.

Since internal decoder memories are frequently updated, two refresh-free dynamic memory implemen-

tations have been proposed to improve the energy efficiency of decoder memory. Transistor-based, non-

refresh embedded dynamic random-access memory (eDRAM) can be used instead of embedded static

random-access memory (eSRAM) or flip-flop register files without special process requirements [31].

However, since eDRAM cells operate at a higher supply voltage than standard CMOS logic, an addi-

tional on-chip power grid may be required, thus further increasing SoC design complexity. In addition,

the retention time of eDRAM cells is reduced in newer technology nodes as device leakage becomes more

significant. Refresh-free dynamic standard-cell based memory that operates similar to domino logic can

reduce area overhead in comparison to static memory [93], however, the high threshold-voltage devices

required might affect timing closure and SoC integration complexity. Both the eDRAM and standard-cell

memory techniques are susceptible to PVT variations, which affect read/write timing.

Time-domain signal processing techniques have been explored to minimize power consumption by

reducing the critical path of the minimum magnitude calculation in the CN and summation operation

in the VN [36,94]. Time-domain processing uses digital-to-time converters to produce a staggered, time-

delayed set of inputs, which are then computed upon using simplified logic, where the output is then

converted back to the digital domain via a time-to-digital converter. However, such implementations

have limited feasibility in SoC integration due to back-end timing closure challenges as a result of

metastability and PVT variations in the delay chains and custom clock-tree networks.

While these approaches provide some reduction in overall energy consumption, the fundamental

architectures still rely on long, global, unstructured routing wires to connect processing units, which

continue to execute the same traditional flooding or layered decoding schedules.

2.5.4 Explicit Check and Variable Node Processing Units

The two-phase schedule implied by the belief propagation algorithm allows computations in the check

and variable update phases to be mapped directly to explicit CN and VN processing units. Figure 2.5

presents a dataflow diagram of Lvc and mcv messages between processing unit CN1 and its connected

VN processing units over two successive decoding iterations based on the Tanner graph connectivity in

Fig. 2.2. In each iteration, the explicit CN1 processing unit receives a unique Lvc message from each of

its connected VNs, and then calculates a unique mcv return message for each VN.

This canonical algorithm-to-hardware mapping is adopted in most traditional LDPC decoder im-

plementations, however, it requires all Lvc and mcv messages to be routed between the two processing

groups through a congested interconnect network, and also introduces routing permutation logic in both

CN and VN processing units. As a result, the complexity of both the interconnect network and permu-

tation logic scales with block length. The mcv computation in Step 2 of Algorithm 2 can be simplified

by calculating only the first and second minimum magnitudes from received Lvc messages based on the


L11

L21

L41

m11

m12

m14

VN1

CN1VN2

VN4

VN1

VN2

VN4

CN1

VN1

VN2

VN4

Iteration i Iteration i+1(i)

L11

L21

L41

m11

m12

m14

(i)

(i)

(i) (i)

(i)

(i+1)

(i+1)

(i+1) (i+1)

(i+1)

(i+1)

Figure 2.5: Explicitly defined processing units VN1, VN2, and VN4 connected to processing unit CN1

based on the Tanner graph in Fig. 2.2, with Lvc and mcv messages indicated for decoding iterations iand i+ 1.

following Min-Sum simplification:

mcv ←

(sc × sgn(Lvc))×min2c, if min1c = |Lvc|

(sc × sgn(Lvc))×min1c, otherwise(2.2)

sc ←∏

v∈N(c)

sgn(Lvc) (2.3)

min1c ← minv∈N(c)

|Lvc| (2.4)

min2c ← minv′∈N(c)\vmin1

|Lv′c|. (2.5)

However, high-order compare-select trees are still required to compute mcv. Pipeline registers are typi-

cally added to ease the critical path timing constraints in the compare-select trees and CN-to-VN routing

interconnect, however, the scalability of the explicitly partitioned architecture remains limited. New ar-

chitectural and scheduling techniques are therefore required to alleviate the global routing and energy

efficiency challenges.

Up until this point, this chapter has provided a general background on LDPC codes, their parity-

check matrices, and their decoding algorithms. The previous section described some of the challenges in

implementing decoders in silicon, and motivated the need for a new, scalable architecture for low-power

applications. The remainder of this chapter focuses on the application of LDPC codes for secret key

reconciliation in long-distance CV-QKD. The four steps of the QKD protocol are first described, followed

by the definitions of error rate and secret key rate in a QKD context. Last, this chapter provides an

overview of multi-edge LDPC codes used in long-distance CV-QKD.

2.6 LDPC Decoding in Quantum Key Distribution

This section first provides a fundamental overview of QKD, and then presents the mathematical frame-

work for key reconciliation using LDPC codes over multiple dimensions. Multi-edge LDPC codes are

also introduced to provide context for the discussion in Chapter 4, which focuses on the computational

speedup of the error-correction (reconciliation) algorithm for CV-QKD.

In a QKD system, two remote parties, Alice and Bob, communicate over a private optical quantum

channel, as well as an authenticated classical public channel to generate a shared secret key in the presence

of an adversary or eavesdropper, Eve, who may have access to both channels [38]. The public channel


can be assumed to be the Internet, while the private quantum channel is intended for communication

only between Alice and Bob. Eve may attempt to perform a man-in-the-middle attack on both the

public channel via a replay attack, or on the private optical channel via beam-splitting. The security of

QKD stems from the no-cloning theorem of quantum mechanics, which states that any observation or

measurement of the quantum channel by Eve would disturb the coherent states transmitted from Alice

to Bob [10, 95]. Since Alice and Bob can calibrate their expected channel noise threshold for a fixed

fiber-optic transmission distance prior to being deployed in the field, any quantum measurement by Eve

would result in a channel noise increase, at which point, the reconciliation error rate would increase,

and Alice and Bob could choose to terminate their communication if they suspect a man-in-the-middle

attack [59]. A typical prepare-and-measure CV-QKD system is based on the Grosshans-Grangier 2002

(GG02) protocol [95], which defines the following four steps presented in Fig. 2.6: quantum transmission,

sifting, reconciliation, and privacy amplification. Fully secure QKD networks can be built by designating

intermediate trusted nodes [39,40], or through measurement-device-independent QKD (MDI-QKD) using

untrusted relay nodes in both CV- and DV-QKD [51,96,97]. MDI-QKD is beyond the scope of this thesis,

however, it does provide a viable solution to the quantum hacking problem by removing all detector side

channels [96].

2.6.1 Quantum Transmission and Sifting

To construct a secret key using the prepare-and-measure CV-QKD protocol, Alice first constructs a vector

A consisting of Nquantum coherent states, which she then transmits to Bob over a private optical fiber.

For each of Alice’s transmitted states, Bob arbitrarily measures the amplitude or phase quadrature using

an unbiased homodyne detector to construct a vector B of length Nquantum. The optical experimental

setup is beyond the scope of this thesis, thus experimental values from previously published works have

been used to characterize the quantum channel [10].

In the remaining post-processing steps of the QKD protocol, Alice and Bob communicate over an

authenticated classical public channel, which is assumed to be noiseless and error-free. Eve may have

access to this channel, however, her eavesdropping does not introduce additional errors [95]. Following

quantum transmission and measurement, Alice and Bob perform a sifting operation to construct two

correlated Gaussian sequences, X0 and Y0, based on the transmitted and measured states. In the

following reconciliation and privacy amplification steps, Alice and Bob apply error correction and hashing

techniques to build a secret key using their sifted sequences of correlated quadrature measurements.

A detailed discussion of the quantum transmission and sifting steps is provided in Appendix A.

2.6.2 Reconciliation

During information reconciliation, Alice and Bob perform the first step in building a unique secret key by:

(1) encoding a randomly-generated message using the sifted quadrature measurements, (2) transmitting

the encoded message over an authenticated classical channel, and then (3) applying an error-correction

scheme to decode the original message [10, 98]. In the direct reconciliation scheme, Alice generates and

transmits a random message to Bob, who then performs the error-correction decoding based on his

measured quadratures. However, previous works have shown that the transmission distance with direct

reconciliation is limited to about 15km [99–101], and is thus not suitable for long-distance CV-QKD

targeting transmission distances beyond 100km [11].


Eve

BobAlice

Pulsed

Light

Source

Private Optical

Quantum Channel

(Noisy)

Ax

B

pB

xA

pA

B

Disclose Selected

Quadratures

Authenticated

Classical

Public Channel

(Noiseless)

Randomly

Select and

Measure

xB

or pB

Quadrature

Discard Unused

Quadratures

A

Step 1:

Quantum

Transmission

Step 2: Sifting

Generate Random

String

LDPC Encoding

Compute Public

Message

Step 3: Reconciliation

S

C

Compute Received

Message

M

Y0X0

X

LDPC Decoding

M

Y

Ss

R

Step 4:

Privacy Amplification

Universal Hashing

Generate Coherent States

Secret Key

A´

Optical Fiber Losses (T, ε)

Homodyne

Detector Losses

(η, Vel)

Modulation

Variance VA

Repetition Rate frep

Eve

Selected

Quadratures

+Plaintext Input Message

+Plaintext Output Message

One-Time

Pad

Encryption

One-Time

Pad

Decryption

Eve

Authenticated

Classical

Public Channel

(Noiseless)

Universal Hashing

Secret Key

Authenticated

Classical

Public Channel

(Noiseless)

Eve

Figure 2.6: CV-QKD model for secret key distillation with reverse reconciliation between Alice and Bobover a private quantum channel and public classical channel.


The long distance problem drives the need for an alternate, robust scheme that is capable of operating

under the low-SNR conditions of the optical channel, even in the presence of excess noise introduced by

an eavesdropper. In the reverse reconciliation scheme, the direction of classical communication between

Alice and Bob is reversed. Reverse reconciliation achieves a higher secret key rate at longer distances in

comparison to direct reconciliation, however, powerful error-correction codes are still required to combat

the high channel noise at long distances without revealing unnecessary information to Eve during the

reconciliation process [11,59,98].

Two-way interactive error-correction protocols such as Cascade or Winnow are not practical for long-

distance QKD due to the large latency and communication overhead required to theoretically minimize

the information leakage to Eve [102–105]. In such interactive protocols, Alice and Bob perform the

error-correction procedure by iteratively exchanging update messages over the public channel until the

key is reconciled. Blind reconciliation using short block-length codes on the order of 103 bits with low

interactivity was proposed to reduce decoding latency [65], however, the short block length is not suit-

able for error-correction at low SNR. Instead, one-way forward error-correction implemented using long

block-length codes with iterative soft-decision decoding is required to achieve efficient error-correction

at low SNR [67, 98]. Jouguet et al. recently showed that multi-edge LDPC codes combined with a

multi-dimensional reverse reconciliation scheme can achieve near-Shannon limit error-correction perfor-

mance at long distances [11, 59]. However, the computational complexity of LDPC decoding remains a

limitation to the maximum achievable secret key rate in a practical QKD implementation [60]. Chap-

ter 4 presents hardware-oriented optimization techniques to alleviate the time-intensive bottleneck of

LDPC decoding for long distance CV-QKD systems, while the remainder of this section outlines the

mathematical framework for long-distance reverse reconciliation1.

Reconciliation at Long Distances

Strong error-correction schemes do not exist for systems with both a Gaussian input and Gaussian

channel, as in the case of CV-QKD. However, at low SNR, the maximum theoretical secret key rate

is less than 1 bit/pulse per channel use, and the Shannon limit of the additive white Gaussian noise

(AWGN) channel approaches the limit of a binary-input AWGN channel (BIAWGNC) [98]. This makes

binary codes highly suitable for error correction in the low-SNR regime [4,106], as opposed to non-binary

codes, which outperform binary codes on channels with more than 1 bit/symbol per channel use [107].

Since binary codewords can be encoded in the signs of Alice and Bob’s correlated sequences, X0 and

Y0, the reconciliation system can therefore be modelled as a BIAWGNC [11].

Reverse Reconciliation Algorithm for the BIAWGNC

A model of the BIAWGNC can be induced from the physical parameters that characterize the quantum

transmission [11]. The variance of the optical input signal is normalized based on Alice’s modulation

variance VA, and captured in the form of a signal-to-noise ratio with respect to the optical fiber and

homodyne detector losses. Assuming that the BIAWGNC has a zero mean and noise variance of σ2Z ,

Z ∼ N (0, σ2Z), the SNR can be expressed as s = 1/σ2

Z . In order to perform key reconciliation, Alice

and Bob now construct two new correlated Gaussian sequences from their sifted correlated sequences

X0 and Y0 of length Nquantum. Alice and Bob first select a subset of n elements from X0 and Y0,

1Chen Feng helped develop the mathematical preliminaries for reverse reconciliation.


where n < Nquantum. Here, n is chosen to be equal to the LDPC code block length. Alice and Bob then

normalize their subset of n elements by the modulation variance VA, such that Alice and Bob now share

correlated Gaussian sequences X and Y, each of length n, where X ∼ N (0, 1), Y ∼ N (0, 1 + σ2Z), and

the property Y = X + Z holds [11].

Bob uses a quantum random number generator to generate a uniformly-distributed random binary

sequence S of length k, where Si ∈ 0, 1. He then performs a computationally inexpensive LDPC

encoding operation to generate an LDPC codeword C of length n, where Ci ∈ 0, 1, by appending

(n−k) redundant parity bits to S based on a binary LDPC parity-check matrix H that is also known to

Alice. Eve may also have access to H, however, the QKD security proof still holds since Eve is assumed

to have infinite mathematical genius. Bob prepares his classical message to Alice, M, by modulating

the signs of his correlated Gaussian sequence Y with the LDPC codeword C, such that Mi = (−1)CiYi,

where Mi ∈ R and Yi ∈ R for i = 1, 2, . . . , n. The symmetry in the uniform distribution of Bob’s random

binary sequence S ensures that the transmission of M over the authenticated classical public channel

does not reveal any additional information to Eve [10].

Assuming error-free transmission over the classical channel, Alice attempts to recover Bob’s codeword

using her correlated Gaussian sequence X based on the following division operation:

Ri =Mi

Xi=

(−1)CiYiXi

=(−1)Ci(Xi + Zi)

Xi= (−1)Ci + (−1)Ci

ZiXi, (2.6)

for i = 1, 2, . . . , n. Here, Alice observes a channel with binary input (±1) and additive noise (−1)Ci Zi

Xi. In

this case, the division operation in the noise term represents a fading channel, however, since Alice knows

the value of each Xi, the norm of X is revealed and the overall channel noise remains Gaussian with

zero mean and variance σ2Ni = σ2

Z/|Xi|2 for each i = 1, 2, . . . , n [11]. Alice then attempts to reconstruct

S by performing the computationally intensive Sum-Product belief propagation algorithm for LDPC

decoding, to remove the channel noise from her received vector R. Sum-Product decoding is preferred

for long-distance CV-QKD, as Min-Sum does not perform well at low SNR [108]. The LDPC decoding

algorithm requires the channel noise variance σ2Ni to be known for each i = 1, 2, . . . , n. By discarding the

(n − k) parity bits from the decoded codeword, Alice can build an estimate S of Bob’s original binary

sequence S for further post-processing in the next privacy amplification step to asymptotically reduce

Eve’s knowledge about the secret key [95].

Multi-Dimensional Reconciliation

Up until this point, the discussion has assumed a 1-dimensional reconciliation scheme in R, with ±1 bi-

nary inputs on the BIAWGNC. Leverrier et al. showed that the quantum transmission can be extended

to longer distances with proven security by employing multi-dimensional reconciliation schemes con-

structed from spherical rotations in R2, R4, and R8, where the multiplication and division operators are

defined [67,98]. These spaces are commonly referred to as the set of complex numbers C, the quaternions

H, and the octonions O, respectively. As shown in Eq. 2.6, the division and multiplication operations

must be defined for the reverse reconciliation procedure. By Hurwitz’s theorem of composition algebras,

normed division is only defined for four finite-dimensional algebras: the real numbers R (Rd=1), the

complex numbers C (Rd=2), the quaternions H (Rd=4), and the octonions O (Rd=8) [109]. Hence, the

remainder of this discussion considers only the d = 1, 2, 4, 8 dimensions.

The multi-dimensional approach is a further reformulation of the reduction of the physical Gaussian


channel to a BIAWGNC at low SNR. For d-dimensional reconciliation, d ∈ 1, 2, 4, 8, each consecu-

tive group of d quantum coherent-state transmissions from Alice to Bob can be mapped to the same

BIAWGNC. As a result, the channel noise variance among all d channels is uniform. For the d = 1

case, each Ri defined in Eq. 2.6 has a unique channel noise variance defined by σ2Ni = σ2

Z/|Xi|2 for

i = 1, 2, . . . , n. For the d = 2 case, the reconciliation is performed over successive (R2i−1, R2i) pairs:

(R1, R2), (R3, R4), . . . , (Rn−1, Rn), which are constructed from the quadrature transmission of successive

(M2i−1,M2i) pairs for i = 1, 2, . . . , n/2 in Rd=2. Similar to d = 1, each ith received value is still comprised

of a ±1 binary input and a noise term, such that R2i−1 = (−1)C2i−1 + N2i−1 and R2i = (−1)C2i + N2i

for i = 1, 2, . . . , n/2. While the real and imaginary noise components, N2i−1 and N2i, are not equal,

the variance of the channel noise is uniform over both dimensions, such that σ2N(2i−1) = σ2

N(2i) for each

(R2i−1, R2i) pair. This can be extended to the d = 4 and d = 8 cases, where each d-tuple of successive Ri

values has a unique channel noise for each dimensional component, but the channel noise variance remains

equal over all d dimensions. For example, for d = 4, each received 4-tuple, (R4i−3, R4i−2, R4i−1, R4i) for

i = 1, 2, . . . , n/4, has a unique noise term for each of its four components, but the channel noise variance

over all four dimensions remains uniform.

The following derivation extends Alice’s message reconstruction calculation presented in Eq. 2.6 to

d-dimensional vector spaces, d ∈ 1, 2, 4, 8, where the multiplication and division operators are defined.

The derivation of the channel noise for d = 2, 4, 8 is much more rigorous than for d = 1, however, the

procedure can be simplified by applying associative and distributive algebraic properties that hold true

for the complex, quaternion, and octonion vector spaces. Here, R, M, X, Y, and Z are d-dimensional

vectors, and U is the d-dimensional vector comprised of (−1)Ci components. For example, for d = 2,

U = [(−1)C2i , (−1)C2i−1 ], while for d = 4, U = [(−1)C4i−3 , (−1)C4i−2 , (−1)C4i−1 , (−1)C4i ]. It follows

then that

R = MX−1

= (UY)X−1

= (U(X + Z))X−1

= (UX + UZ)X−1 by right-distributivity a(b+ c) = ab+ ac

= UXX−1 + UZX−1 by left-distributivity (b+ c)a = ba+ ca

= U + UZX−1 by right-cancellation abb−1 = a

= U + UZX∗

||X||2. (2.7)

The received vector can be expressed as R = U + N, where the multi-dimensional noise for a

BIAWGNC is given by the term N = (UZX∗)/||X||2. The Cayley-Dickson construction can then be

applied to complete the derivation of the multi-dimensional noise N for d = 2, 4, 8 [110]. Since the noise

is identically distributed in each dimension, U can be assumed to be the all-zero codeword, i.e., Ci = 0

for all i = 1, 2, . . . , n, to further simplify the derivation. For d = 2, the channel noise of both the real

and imaginary components can be expressed as N2i−1 = aiZ2i−1 + biZ2i and N2i = aiZ2i − biZ2i−1,

where ai =X2i−1 +X2i

X22i−1 +X2

2i

and bi =X2i −X2i−1

X22i−1 +X2

2i

for i = 1, 2, . . . , n/2. It follows then that the channel

noise variance for d = 2 is given by σ2N(2i−1) = σ2

N(2i) = (a2i + b2i )σ

2Z . The noise derivation for d = 4 and

d = 8 is much longer and is not included in the thesis.


Reconciliation Efficiency

The reverse reconciliation algorithm for the BIAWGNC can be reduced to an asymmetric Slepian-Wolf

source-coding problem with input M and side information X, where Alice and Bob observe correlated

Gaussian sequences X and Y, respectively [104,111]. Since Alice must discard (n−k) parity bits from the

linear block code after LDPC decoding, it follows then that the efficiency of the reverse reconciliation

algorithm is given by β = Rcode

I(X;Y ) , where I(X;Y ) is the mutual information between X and Y, and

Rcode is the LDPC code rate defined as Rcode = k/n from the n-length codeword C and k-length

random information string S [11, 111]. The mutual information I(X;Y ) corresponds to the Shannon

capacity of the quantum channel, hence the reconciliation efficiency can be expressed more simply as:

β =Rcode

C(s)=Rcode

Rmax, (2.8)

where C(s) = 12 log2(1 + s) is the Shannon capacity and s is the SNR of the BIAWGNC. The Shannon

capacity defines the maximum achievable code rate Rmax for a given SNR, and thus, the β-efficiency

characterizes how close the reconciliation algorithm operates to this fundamental limit [106].

The reconciliation efficiency β plays a crucial role in the performance of CV-QKD. The β-efficiency

at a particular SNR operating point determines the code rate, and ultimately, the number of parity

bits discarded in each message. Assuming that the LDPC coding scheme has been optimized for a

particular SNR operating point such that the code rate Rcode is fixed, the reconciliation efficiency

then depends solely on the SNR of the quantum channel, which is a function of Alice’s coherent-state

modulation variance and the physical transmission losses in the optical fiber. Hence, for a fixed optical

transmission distance between Alice and Bob, the reconciliation efficiency can be optimized by tuning

Alice’s modulation variance VA, and designing an optimal error-correction scheme for a target SNR.

Chapter 4 explores how changes in the β-efficiency affect the reconciliation distance and maximum

achievable secret key rate.

Quantum Channel Capacity vs. Coding Channel Capacity

The reconciliation efficiency β and channel capacity C(s) defined in Eq. 2.8 are related to the overall ca-

pacity of the complete QKD system, which has an AWGN channel characterized by the optical quantum

losses and modulation variance. It is also possible to define a different efficiency and capacity related

to the channel coding problem presented in Eq. 2.6, where Alice observes a channel with binary input

(±1) and additive noise (−1)Ci Zi

Xi. It is important to note here that these two channel capacities and

efficiencies are different, but can easily be related. Up until this point, the key reconciliation problem

has been considered as a single problem, however, for clarity, it should be decomposed into two related

problems: (1) distilling a common message from correlated random sequences X and Y, and (2) chan-

nel coding for a binary-input fast fading channel with channel state information available only at the

decoder. The first problem is an information theory problem, and is independent of the second channel

coding problem.

The information theoretic problem attempts to distill the correlated Gaussian sequence Y, in the

presence of the quantum channel noise Z, as given by the expression Y = X + Z. This problem is

more formally known as “secret key agreement by public discussion from common information” [112].

The efficiency β and channel capacity C(s) defined in Eq. 2.8 are the efficiency and capacity related to


solving the information theoretic problem, where s represents the SNR on the optical quantum channel.

For clarity, let us redefine the overall QKD system efficiency as βAWGN and the capacity as CAWGN.

In the channel coding problem, Alice attempts to recover an encoded codeword C using error-

correction decoding techniques. In Eq. 2.6, the noise represents a fading channel where each ith symbol

has a unique SNR, characterized by its unique channel noise variance. For d = 1 dimensional reconcilia-

tion, the coding channel noise variance is given by σ2Ni = σ2

Z/|Xi|2 for each i = 1, 2, . . . , n, and thus the

coding (fading) channel has an ergodic capacity, which can be expressed as Ccoding = E[ 12 log2(1 + 1

σ2Ni

)].

The ergodic capacity Ccoding can be computed by averaging the SNR given by 1/σ2Ni for i = 1, 2, . . . , n.

It follows then that the channel coding efficiency is given by βcoding = Rcode/Ccoding. The overall QKD

system efficiency can then be expressed independent of the code rate as follows:

βAWGN = βcodingCcoding

CAWGN. (2.9)

The ergodic capacity of multi-dimensional reconciliation schemes d = 2, 4, 8 can be determined by

applying the same expression for Ccoding. The expression for βAWGN in Eq. 2.9 holds for d = 1, 2, 4, 8

dimensional reconciliation. The remainder of this thesis considers only the overall QKD system efficiency

βAWGN, which is herein denoted more simply as β.

2.6.3 Privacy Amplification

Since Eve may have collected sufficient information during her observations of the quantum and classical

channels, Alice and Bob asymptotically reduce Eve’s knowledge of the key by independently applying

a shared universal hashing function on a concatenated block of their independent binary strings S and

S [10, 95]. Each concatenated secret key block is Nprivacy bits in length. After hashing, Alice and Bob

can use the resulting symmetric key to encrypt and decrypt messages with perfect secrecy using the

one-time pad technique [39]. Additional details about the privacy amplification step are provided in

Appendix A.

2.6.4 Maximizing Secret Key Rate with Collective Attacks

Assuming perfect error-correction during the reconciliation step, the maximum theoretical secret key

rate for a CV-QKD system with one-way reverse reconciliation can be defined as

Kopt = βIAB − χBE (bits/pulse), (2.10)

where IAB is the mutual information between Alice and Bob, β is the previously-defined reconciliation

efficiency, and χBE is the Holevo bound on the information leaked to Eve [10]. A complete derivation

of the secret key rate with collective attacks is provided in Appendix A. In order to maximize the secret

key rate Kopt for a particular β-efficiency, Alice’s modulation variance VA must be optimally tuned for

each quantum transmission distance to maximize the SNR on the BIAWGNC. A complete discussion on

the optimal choice of VA is provided in Appendix A.

The asymptotic limit on the secret key rate Kopt is based on ideal theoretical security models, and

does not consider the imperfections of a practical CV-QKD system, which might enable additional

side-channel attacks [113]. Such imperfections include the finite-size effects [114–116], excess electronic

and phase noise from uncalibrated optical equipment, as well as discretized Gaussian modulation with


finite bounds on the distribution and randomness [113]. Leverrier proved that CV-QKD with coherent

states provides composable security against collective attacks [117], however, extending the information-

theoretic security proofs from collective attacks to general attacks in the finite-size regime of CV-QKD

is currently an active area of research [116,118]. At the time of writing, the highest CV-QKD key rates

can be achieved using coherent states and homodyne detection with security against collective attacks

and some finite-size effects [59]. The motivation of this thesis is to show that the key reconciliation

(error correction) algorithm can be accelerated such that the throughput of LDPC decoding is higher

than the asymptotic secret key rate achievable using realistic quantum channel parameters and optical

equipment available today. The finite-size effects on secret key rate are considered later in this section,

while the other imperfections of a practical CV-QKD system are beyond the scope of this thesis.

The BIAWGNC model for long-distance CV-QKD under investigation in this thesis has also been

proven secure against collective attacks, thus the expression for the asymptotic secret key rate Kopt still

holds [11, 98]. At long distances, IAB and χBE are nearly equal, thus in order to maximize the secret

key rate, it would appear that the reconciliation efficiency β must also be maximized. However, this is

not necessarily true since Kopt only provides an expression for the maximum achievable secret key rate

and does not consider the speed of reconciliation, nor the uncorrectable errors. The frame error rate

(FER) of the reconciliation algorithm must also be considered.

2.6.5 Upper Bound on Secret Key Rate for a Lossy Channel

Pirandola et al. recently showed that there exists a general upper bound on the secret key rate for a lossy

channel [61]. This fundamental limit is determined by the transmittance T of the fiber-optic channel,

and is given by

Klim = − log2(1− T ) (bits/pulse). (2.11)

The transmittance is defined as T = 10−α`/10, where the distance ` is expressed in kilometers and the

standard loss of a single-mode fiber optic cable is assumed to be α = 0.2dB/km. The upper bound

versus distance is plotted in Fig. A.2 in Appendix A.

2.6.6 Frame Error Rate for Reverse Reconciliation

In reverse reconciliation, Alice attempts to construct a decoded estimate S of Bob’s original message S

in order to perform privacy amplification and build a secret key. The tree diagram in Fig. 2.7 highlights

four possible decoding scenarios for generating S from Alice’s decoded codeword C.

After LDPC decoding, Alice performs a parity check CH> to verify that her decoded codeword C is

valid. When the parity check fails, i.e., CH> 6= 0, Alice knows that a decoding error has occurred and

the frame is discarded since it can not be used to generate a secret key. However, when the parity check

passes, i.e., CH> = 0, Alice knows that C is a valid codeword, however, she does not yet know if C is

equal to Bob’s original encoded codeword C.

For any binary linear block code, the number of possible codewords is 2k = 2nRcode . Thus, for codes

with a long block length n, the number of possible codewords grows exponentially, and it is possible

for the decoder to converge to a valid codeword where the decoded message is incorrect, i.e., S 6= S.

In coding theory, this is referred to as an undetected error. This scenario is problematic for secret key

generation where both parties must share the same message after decoding in order to perform universal

hashing in the next privacy amplification step.


Decoded Codeword C

CH> = 0Parity Check Pass

C Valid

CRC Pass

S = SNo Error

S 6= SUndetected

Error

CRC Fail

S 6= SDetected

Error

CH> 6= 0Parity Check Fail

C Error

Skip CRC

S 6= SDetected

Error

Figure 2.7: Possible decoding scenarios and error detection techniques.

In order to detect invalid decoding errors when CH> = 0, a cyclic redundancy check (CRC) of Bob’s

original message S can be transmitted as part of the frame, and then verified against the computed CRC

of Alice’s decoded message S. Figure 2.8 presents the components of an LDPC-encoded frame, where

k information bits are comprised of (k −NCRC) message bits and NCRC CRC bits, followed by (n− k)

parity bits to be discarded after LDPC decoding. If the CRC results of S and S are equal, then the

decoding is successful and S can be used to distill a secret key, otherwise Alice knows that a decoding

error has occurred and S is discarded. The CRC needs to be performed only when the parity check

passes, otherwise the frame is known to contain an error and the CRC is skipped. A truly undetected

error occurs when both the parity check and CRC pass, but S 6= S.

Message CRC Parity (Redundancy)

k n-k

NCRCk-NCRC

n

Figure 2.8: LDPC frame components: message, CRC, and parity bits.

A frame error is said to have occurred when S 6= S, i.e., when the decoding fails to reproduce the

original message. Both detected and undetected errors contribute to the overall FER. The probability

of frame error is defined as follows:

Pe = Pdetected error + Pundetected error. (2.12)

From Fig. 2.7, it follows then that the detected and undetected error probabilities are given as

Pdetected error = P (CH> 6= 0) + P (CH> = 0 ∩ CRC Fail)

Pundetected error = P (CH> = 0 ∩ CRC Pass ∩ S 6= S).

There exists a rare case not shown in Fig. 2.7, where the parity check passes and CRC fails, yet

S = S. In this case, the error is in the CRC component of the frame. Although the decoded message


is correct, it will be discarded by the decoder due to the failed CRC check. As a result, there is a rare

chance that this frame will be lost and the secret key rate will be reduced. However, this case is not

considered by convention in communication theory [119].

2.6.7 Impact of Reconciliation Error and Efficiency on Secret Key Rate

This thesis investigates the trade-offs in error-correction performance, reconciliation efficiency, reconcili-

ation distance, and secret key rate, by assuming that the physical parameters of the quantum channel are

fixed, and that Alice’s modulation variance VA has been optimally set for each transmission distance and

desired β-efficiency. In practice, the asymptotic secret key rate Kopt is scaled by the FER since decoded

frames with known error can not be used to generate a secret key and must therefore be discarded. As

such, the effective secret key rate of a practical CV-QKD system is given by

Keff =(

1− Pdetected error

)((1− Pundetected error)βIAB − χBE

). (2.13)

Alice and Bob can discard frames with detected error, while frames with undetected error further re-

duce the mutual information IAB between Alice and Bob. In Chapter 4, it is empirically shown that

Pundetected error = 0 using a 32-bit CRC code, thus the total decoding FER can be expressed more simply

as Pe = Pdetected error. This simplified expression for the FER is assumed for the remainder of this thesis,

and thus the effective secret key rate expression given by Eq. 2.13 can be reduced to

Keff = (1− Pe)(βIAB − χBE). (2.14)

Up until this point, the β-efficiency has been assumed to be independent of the reconciliation algo-

rithm, however, as shown in Eq. 2.13, the effective secret key rate Keff is dependent on both β and FER.

Given the set of optimal VA values and assuming that the physical operating parameters of the quantum

channel remain constant, the BIAWGNC channel can be induced and described solely in terms of the

SNR at a particular distance with an effective secret key rate Keff. As described further in Chapter 4,

there exists a trade-off between reconciliation distance and effective secret key rate, such that for a single

SNR, one of the following two operating conditions is possible: (1) long distance with a low secret key

rate, or (2) short distance with a high secret key rate. In fact, for a fixed LDPC code rate Rcode, the

SNR depends only on the reconciliation efficiency and is independent of transmission distance. From

Eq. 2.8, the SNR of a BIAWGNC can be expressed as a function of β such that

s(β) = 22Rcode/β − 1. (2.15)

From a code design perspective then, a rate Rcode LDPC code can be designed to achieve a target FER

at a particular SNR. Since Alice and Bob remain stationary once deployed in the field, their transmission

distance remains fixed, and thus an LDPC code, i.e., parity-check matrix H, can be designed independent

of other CV-QKD system parameters to achieve the maximum operating secret key rate over a range

of distances by providing the optimal trade-off between β and FER. The reverse reconciliation problem

can thus be reduced to the simpler model shown in Fig. 2.1 as a result of the BIAWGNC approximation

at low SNR.


2.6.8 Secret Key Rate with Finite-Size Effects

The security of the CV-QKD protocol must account for the finite length of the secret key, which is

generated via universal hashing in the privacy amplification step using a block of length of Nprivacy

bits. Alice constructs her privacy amplification block from her correctly decoded S messages, while

Bob constructs his privacy amplification block from his original corresponding S messages. Due to the

finite block size, the secret key rate is reduced by an offset coefficient ∆(Nprivacy) and scaling coefficient

Nprivacy/Nquantum, where Nquantum is the number of symbols sent from Alice to Bob during the first

quantum transmission step. The secret key rate, accounting for finite-size effects, is given by

Kfinite =

(Nprivacy

Nquantum

)(1− Pe

)(βIAB − χBE −∆(Nprivacy)

)(bits/pulse). (2.16)

Leverrier et al. showed that Nquantum can be arbitrarily chosen as Nquantum = 2Nprivacy [114], and that

when Nprivacy > 104, the finite-size offset factor ∆(Nprivacy) can be approximated as

∆(Nprivacy) ≈ 7

√log2(2/ε)

Nprivacy, (2.17)

where a conservative choice for the security parameter is ε = 10−10 [114]. The LDPC block length n is

not directly included in this expression, however, the LDPC block length does affect the reconciliation

efficiency β and FER Pe. Chapter 4 presents a study of the optimal privacy amplification block size

Nprivacy for achieving maximum distance.

2.7 Multi-Edge LDPC Codes

This thesis builds on the previous work by Jouguet et al., who explored the application of low-rate multi-

edge LPDC codes for reverse reconciliation in long-distance CV-QKD on the BIAWGNC. This section

presents an overview of multi-edge LDPC codes, while new quasi-cyclic construction techniques for multi-

edge LDPC codes are presented in Chapter 4. Multi-edge LDPC codes, first introduced by Richardson

and Urbanke, provide two advantages over standard LDPC codes: (1) near-Shannon capacity error-

correction performance for low-rate codes, and (2) low error-floor performance for high-rate codes [71].

The latter is not a significant concern for long-distance CV-QKD where the reconciliation FER is on

the order of 10−1, however, the design of a high-performance low-rate code is crucial to achieving high

β-efficiency [11]. Since multi-edge codes can be described by a binary parity-check matrix, they have the

same computational decoding complexity as single edge-type codes. However, given their application

in low-SNR channels, the decoding latency of multi-edge codes is generally higher due to the increased

number of iterations required to converge to a valid codeword at low SNR. This section first briefly reviews

the general construction procedure for an LDPC code, and then explores the multi-edge framework in

more detail.

2.7.1 General Design and Construction of LDPC Codes

An LDPC code of length n can be specified by the number of variable and check nodes, and their

respective degree distributions. The number of edges connected to a vertex in the graph G is called

the degree of the vertex. The degree distribution of G is a pair of polynomials ω(x) =∑i ωix

i and


ψ(x) =∑i ψix

i, where ωi and ψi respectively denote the number of variable and check nodes of degree

i in G. The performance of tree-like Tanner graphs can be analyzed using a technique called density

evolution [4]. As n → ∞, the error-correction performance of Tanner graphs with the same degree

distribution is nearly identical [4]. Hence, the variable and check node degree distributions can be

normalized to Ω(x) =∑iωi

n xi and Ψ(x) =

∑iψi

n−kxi, respectively. The design of binary LDPC codes of

rate Rcode and block length n consists of a two-step process. First, find the normalized degree distribution

pair (Ω(x),Ψ(x)) of rate Rcode with the best performance. Then, if n is large, randomly sample a Tanner

graph G that satisfies the degree distribution defined by ω(x) and ψ(x) (up to rounding error), and find

the corresponding parity-check matrix H. The random Tanner graph sampling technique is non-trivial

in the design of low-rate codes that approach Shannon capacity at low SNR.

2.7.2 Multi-Edge Framework

The multi-edge framework can be applied to both regular and irregular LDPC codes with uniform and

non-uniform vertex degree distributions, respectively, by introducing multiple edge types into the Tanner

graph specifying the code [71]. In a standard LDPC code, the polynomial degree distributions are limited

to a single edge type, such that all variable and check nodes are statistically interchangeable. In order

to improve performance, multi-edge LDPC codes extend the polynomial degree distributions to multiple

independent edge types with an additional edge-type matching condition [71].

To describe the design of multi-edge LDPC codes, let the potential connections of a variable or

check node be called its sockets. Let the vector d = (d1, d2, . . . , dt) be a multi-edge node degree of

t types. A node of degree d has d1 sockets of type 1, d2 sockets of type 2, etc. When generating

a Tanner graph, only sockets of the same type can be connected by an edge of that type. Multi-

edge normalized degree distributions are straightforward generalizations based on multi-edge degrees

Ω(x1, x2, . . . , xt) =∑

d Ωdxd11 x

d22 · · ·x

dtt and Ψ(x1, x2, . . . , xt) =

∑d Ψdx

d11 x

d22 · · ·x

dtt , where Ωd1,d2,...,dt

and Ψd1,d2,...,dt are the respective fractions of variable and check nodes with d1 edges of type 1, d2 edges

of type 2, etc. The rate of a multi-edge LDPC code is then defined as Rcode = Ω(1) − Ψ(1), where 1

denotes the all-ones vector with implied length [71].

The multi-edge LDPC code used in this thesis is rate 0.02 with normalized degree distribution

Ω(x1, x2, x3) =9

400x2

1x572 x

03 +

7

400x3

1x572 x

03 +

24

25x0

1x02x

13 (2.18)

Ψ(x1, x2, x3) =3

320x3

1x02x

03 +

17

1600x7

1x02x

03 +

3

5x0

1x22x

13 +

9

25x0

1x32x

13. (2.19)

This degree distribution was designed by Jouguet et al. by modifying a rate 1/10 multi-edge degree

structure introduced by Richardson and Urbanke [11, 71]. For the BIAWGNC, the minimum SNR for

which the tree-like Tanner graph with this multi-edge degree distribution is error free is 2.863× 10−2 or

−15.47dB [11].

The LDPC parity-check matrices in this thesis were generated by randomly sampling Tanner graphs

that satisfied the multi-edge degree distribution defined by ω(x) and ψ(x), and the edge-type matching

condition. The random sampling technique does not degrade code performance in this case, since the

operating FER is known to be high (Pe ≈ 10−1). At such high FER, the error-floor phenomenon is

not a significant concern as the code is strictly designed to operate in the waterfall region in order to


achieve high β-efficiency [120]. The rate 0.02 LDPC codes explored in this thesis target a block length of

n = 1×106 bits in order to achieve near-Shannon capacity error-correction performance. As a result, the

parity-check matrix H has dimensions n− k = n(1−Rcode) = 9.8× 105 by n = 1× 106. Due to the low

code rate and large block length, the random parity-check matrix construction introduces LDPC decoder

implementation complexity, which directly affects decoding latency and maximum achievable secret key

rate. The LDPC decoder implementation complexity for such a code can be reduced with minimal

degradation in error-correction performance by imposing a quasi-cyclic structure to the parity-check

matrix.

2.8 Summary

This chapter outlined the key challenges of implementing LDPC decoders in silicon, and presented the

mathematical foundation for LDPC decoding in CV-QKD. Chapters 3 and 4 extend these concepts to

new architectural and implementation techniques for both the integrated circuit and long-distance CV-

QKD application areas, respectively. Chapter 3 introduces a new frame-interleaved decoding architecture

targeting low-power, multi-Gb/s integrated circuit applications using short block-length codes for high

SNR channels. Chapter 4 presents a new multi-edge code construction technique to reduce the latency

of GPU-based decoding with long block-length codes for low SNR channels. Both Chapters 3 and 4

demonstrate techniques to exploit the intrinsic structure of LDPC parity-check matrices to improve

decoding performance.

Chapter 3

LDPC Decoder Architecture with

Path-Unrolled Message Passing

The renaissance of LDPC decoders in the early 2000s was primarily fueled by the benefits of Dennard

scaling: increasing transistor speeds and process shrink made high-performance silicon-based LDPC

decoders viable [121]. However, today, in the post-Dennard era, CMOS technology scaling offers dimin-

ishing returns in terms of performance. Today’s digital SoCs are ruled by energy efficiency – a metric

that requires new architectural techniques for the scalable implementation of low-power LDPC decoders.

Traditional LDPC decoder architectures are plagued with routing and message-permutation complex-

ity. Although interconnect scaling stagnated at the 45nm node, the 2016 IEEE International Roadmap

for Devices and Systems (IRDS) forecasts planar 2-dimensional transistor scaling to continue down to

the 10nm node, with 3-dimensional (3D) transistor technologies such as vertical-gate and monolithic-3D

expected to emerge over the next 10-to-15 years [13, 122–124]. This motivates the introduction of a

new low-power design paradigm for multi-Gb/s LDPC decoders based on dark silicon design principles,

where overall power consumption is reduced at the expense of increased silicon area by systematically

powering down inactive logic in order to minimize dynamic switching power [125].

This chapter presents a new LDPC decoder architecture, which addresses the global routing and

scalability problem using a reformulated message-passing schedule to achieve greater computational

parallelism at low clock rate. The proposed architecture exploits the intrinsic structure of the quasi-

cyclic (QC) parity-check matrix by splitting long routing wires into multiple shorter segments to reduce

interconnect delay. QC-LDPC codes are used as a vehicle to illustrate the general approach, however, the

proposed architecture and decoding schedule can also be extended to non-QC codes. The IEEE 802.11ad

standard for multi-Gb/s wireless systems is used as a case study to demonstrate the application of the

proposed architecture, however, the techniques described in this chapter are scalable to longer LDPC

codes for wireline and optical channels.

3.1 Proposed LDPC Decoder Architecture

In a traditional LDPC decoder, update messages are iteratively sent back and forth between CN and

VN processing groups with explicitly defined CN and VN processing units. The proposed decoder archi-

tecture partitions the global message-passing network into structured, local interconnect groups defined

33

Chapter 3. LDPC Decoder Architecture with Path-Unrolled Message Passing 34

between successive macro-columns of the QC parity-check matrix. A time-distributed decoding schedule

is introduced to exploit the spatial locality of update messages by combining the CN-phase and VN-phase

update logic into a single processing unit with deterministic memory access. The reformulation of the

Min-Sum flooding schedule leverages the QC interconnect structure to minimize routing congestion and

permutation logic overhead, while enabling frame-level parallelism for multi-rate, multi-Gb/s decoding.

3.1.1 Hardware Mapping with Path-Unrolled Decoding Schedule

The proposed decoding schedule unrolls the CN update in time by distributing the CN-to-VN mcv

message computation across all of the VNs that participate in the CN update. This is in contrast to

the traditional approach where all mcv updates for CN c are computed in a single CN processing unit.

Figure 3.1 presents an illustrative example of the interconnect routing patterns and combined processing

unit arrangement for the proposed LDPC decoder architecture. For the QC parity-check matrix defined

in Fig. 3.1(a), the corresponding Tanner graph is shown in Fig. 3.1(b), where messages exchanged between

connected CNs and VNs trace a closed path along the edges of the graph. As highlighted in Fig. 3.1(b),

one such path is defined by the following node sequence: VN0−CN2−VN4−CN2−VN8−CN2−VN0. By

unrolling this path, the intermediate CN2 node can be removed by absorbing and distributing the CN

update operation among its connected VNs. This modification does not alter the result of the Min-Sum

algorithm, but simply introduces a piecewise calculation of each CN-to-VN message defined in Step 2 of

Algorithm 2 (Chapter 2).

Figure 3.1(c) shows that combined processing nodes are arranged in column groups, which correspond

to the macro-columns of the QC parity-check matrix. The combined CN+VN processing units are

indexed according to their VN ordering. Each combined CN+VN processing unit contains the original

VN-update logic to compute Steps 3 and 4 of Algorithm 2, as well as intermediate CN-update logic for

the unrolled CN-to-VN mcv message computation. The interconnect structure between column groups

in the proposed architecture is defined by the cyclic permutation between the connected VNs in adjacent

columns. Each of the partitioned networks between successive column groups is hard wired, such that

all combined CN+VN processing units along each unrolled path are connected to a single CN in the

original Tanner graph. For example, the Tanner graph edges labelled in Fig. 3.1(b) trace the following

unrolled path in the Layer 0 Routing structure shown in Fig. 3.1(c): VN0−VN4−VN8−VN0.

The QC parity-check matrix construction guarantees that each VN is connected to at most one CN

in each layer (macro-row) of the matrix. Hence, each layer of the QC parity-check matrix requires

an independent routing layer in the proposed structure, as shown by the Layer 0 and Layer 1 routing

patterns in Fig. 3.1(c). This layer-parallel approach is a realization of the flooding schedule, and thus,

each combined CN+VN processing unit requires independent CN-update logic for each active processing

layer. For example, in Fig. 3.1(c), the node corresponding to VN0 contains CN-update logic for both

CN2 and CN4 paths in Layer 0 and Layer 1, respectively.

Bypass routing is required between combined processing node groups in layers where an all-zero sub-

matrix appears in the QC parity-check matrix, in order to ensure the continuity of the closed path in the

message-passing interconnect and to guarantee an equal number of column hops in the path traversal.

In the bypass case, the processing node in the successive column is simply included in the path, and

neither CN-update nor VN-update computations are performed. Bypass routing introduces an artificial

edge into the Tanner graph, such that the number of processing nodes in each closed path is equal.

The proposed architecture can further be described as a systolic array with homogeneous CN+VN


010

001

100

001

100

010

100

010

001

000

000

000

001

100

010

010

001

100

H

210

012HIII

II

La

ye

r 1

Column 0V

N0

VN

1

VN

2

VN

3

VN

4

VN

5

VN

6

VN

7

VN

8

CN5

CN4

CN3

CN2

CN1

CN0

Bypass Routing

Required for All-Zero

Sub-Matrix

Quasi-Cyclic

Parity-Check

Matrix

Expansion

Factor

Expanded

Binary

Parity-Check

Matrix

La

ye

r 0

Column 1 Column 2

(a)

(b)

(c)

3q

1 2 3

VN0 VN1 VN2 VN3 VN4 VN5 VN6 VN7 VN8

CN0 CN1 CN3 CN4 CN5CN2

Layer 0 Layer 1

Column 0 Column 1 Column 2

Layer 0 Routing:

1

2

3

VN0

CN2

VN1

CN1

VN2

CN0

VN4

CN2

VN5

CN1

VN8

CN2

VN6

CN1

VN7

CN0

Combined CN+VN

Processing Unit

(c)

VN3

CN0

Column 2Column 1Column 0

Layer 1 Routing:

Bypass

Routing

VN5

CN4

VN3

CN3

VN4

CN5

VN0

CN4

VN1

CN3

VN2

CN5

VN6

CN3

VN7

CN5

VN8

CN4

Bypass

Routing

Bypass

Routing

Col 2Col 1Col 0Hard

Wired

Cyclic

Permutation

Network

Figure 3.1: Simplified example of the proposed LDPC decoder architecture, based on: (a) sample two-layer QC parity-check matrix and (b) Tanner graph with one decoding path highlighted. The proposedlayer routing patterns and arrangement of combined CN+VN processing units in the systolic arrayarchitecture are shown in (c). The closed path given by VN0−CN2−VN4−CN2−VN8−CN2−VN0 in (b)is unrolled in (c) such that CN2 is absorbed into its connected VNs, resulting in the following unrolledpath: VN0−VN4−VN8−VN0.


processing units connected to the corresponding VNs in neighboring macro-columns along the unrolled

CN path. Routing complexity is constrained between successive columns by serializing the large fan-

in/fan-out wiring to/from each CN. One decoding iteration requires two passes through the proposed

architecture. The first pass corresponds to the CN-update phase, and the second pass corresponds to

the VN-update phase. The following section describes the mathematical modification to the flooding

Min-Sum algorithm as a result of the path-unrolled message-passing schedule.

3.1.2 Time-Distributed Piecewise Min-Sum Computation

Each closed routing path is defined by the VNs connected to a single CN c, i.e., the VNs in the set

N(c). By absorbing the CN-update operation among all connected VNs, the traditional CN-to-VN mcv

and VN-to-CN Lvc messages defined in Algorithm 2 are not explicitly computed, nor routed through

the proposed structure. Instead, the Lvc and mcv computations are discretized through a reformulation

of the flooding Min-Sum algorithm. Table 3.1 presents one decoding iteration of the piecewise, time-

distributed Min-Sum schedule for a complete CN routing path traversal over T = n/q total columns.

In the first CN-update phase, the sign sc(t), first minimum magnitude min1c(t), and second minimum

magnitude min2c(t) for every CN c are updated sequentially at every column t over T processing-node

hops in T successive columns. Each combined processing unit stores its own Lvc value from the previous

iteration. In each iteration, when the path traversal hop arrives at a particular node in column t, the

internally stored Lvc value is used to update the intermediate sign and minimum magnitudes, which

are then sent to the next successive column t + 1. The path traversal then hops to the next connected

processing node in the successive column, and the updates continue until the last column T − 1 of the

structure is reached, at which point, the final sign sc(T − 1), first minimum magnitude min1c(T − 1),

and second minimum magnitude min2c(T − 1) for CN c are known.

In the second VN-update phase, the sign and minimum magnitude values that were computed in the

first CN-update pass are not updated any further, but rather held constant and broadcast through the

CN path to all connected processing units over T successive hops. Each processing unit first computes

its own, unique CN-to-VN mcv message based on Eq. 2.2. The computed mcv value is then used to

calculate the intermediate LLR Lv, hard-decision bit Cv, and new VN-to-CN Lvc message according

to the expressions outlined in Steps 3 and 4 of Algorithm 2. The updated Lvc value is stored in the

processing unit’s memory to be used in the next iteration. A piecewise parity computation is performed

in order to eliminate the explicit parity check defined by Step 5 in Algorithm 2. The parity pc(t)

corresponding to CN c is updated sequentially along the closed path, such that the final parity across all

VNs connected to CN c is determined by the last column of the path traversal. The parity result from

the current iteration is then known immediately at the start of the CN-update phase (first pass) of the

next iteration.

Figure 3.2 presents an illustration of the reformulated decoding procedure for one decoding iteration.

A single layer message for CN2 of the form pc(t), sc(t),min1c(t),min2c(t) is successively consumed and

updated in each column t of T total columns among the T combined CN+VN processing units along the

closed CN2 path. The final value of each of the four components at column T − 1 is given by Equations

2.1, 2.3, 2.4, and 2.5. VN-update logic is inactive during the CN-update pass, while CN-update logic is

inactive during the VN-update pass. In addition, column t = 0 does not necessarily correspond to the

absolute first column of the parity-check matrix, but rather refers to the starting column for a particular

layer message. Since the distributed update computations in each column are independent, multiple


Table 3.1: Piecewise time-distributed reformulation of Min-Sum algorithm with flooding schedule forsingle layer routing path

Itera

tion

Phase

Colu

mn

Pari

tySig

nF

irst

Min

imum

Magnit

ude

Seco

nd

Min

imum

Magnit

ude

tpc(t

)s c

(t)

min

1 c(t

)min

2c(t

)

Check

Up

date

Com

pute

:s c

(t)

min

1 c(t

)min

2 c(t

)

0−

sgn( L vc

(0))

|Lvc(0

)||M

AX

LL

RM

AG

NIT

UD

E|

1−

s c(0

)×

sgn( L vc

(1))

min( |L v

c(1

)|,min

1 c(0

))m

in( |L

vc(1

)|,min

1 c(0

),min

2 c(0

) \min

1 c(1

)). . .

. . .. . .

. . .. . .

t−

s c(t−

1)×

sgn( L vc

(t))

min( |L v

c(t

)|,min

1 c(t−

1))

min( |L

vc(t

)|,min

1c(t−

1),min

2 c(t−

1) \ m

in1 c

(t))

. . .. . .

. . .. . .

. . .

T−

1−

s c(T−

2)×

sgn( L vc

(T−

1))

min( |L v

c(T−

1)|,min

1 c(T−

2))

min( |L

vc(T−

1)|,min

1c(T−

2),m

in2 c

(T−

2) \ m

in1 c

(T−

1))

Vari

able

Up

date

Com

pute

:pc(t

)

Pro

pag

ate:

s c(t

)min

1 c(t

)min

2 c(t

)

0Cv(0

)s c

(T−

1)min

1 c(T−

1)

min

2c(T−

1)

1pc(0

)⊕Cv(1

)s c

(T−

1)min

1 c(T−

1)

min

2c(T−

1)

. . .. . .

. . .. . .

. . .

tpc(t−

1)⊕Cv(t

)s c

(T−

1)min

1 c(T−

1)

min

2c(T−

1)

. . .. . .

. . .. . .

. . .

T−

1pc(T−

2)⊕Cv(T−

1)

s c(T−

1)min

1 c(T−

1)

min

2c(T−

1)


(a)

CN0 CN1 CN3 CN4 CN5

Column t = 0

1st Pass: CN Update

(VN Logic Inactive)

VN1

CN1

VN2

CN0

VN5

CN1

VN3

CN0

VN6

CN1

VN7

CN0

−, sc(t=0), min1c(t=0), min2c(t=0)

1

−, sc(t=1), min1c(t=1), min2c(t=1)

2

−, sc(t=2), min1c(t=2), min2c(t=2)3

Column

t = 1

2nd Pass: VN Update

(CN Logic Inactive)

VN1

CN1

VN2

CN0

VN5

CN1

VN3

CN0

VN6

CN1

VN7

CN0

pc(t=0), sc(t=2), min1c(t=2), min2c(t=2)

VN8

CN2

VN4

CN2

VN0

CN2



1

2

3

(b)

Column

t = 2Column

t = 0Column

t = 1Column

t = 2Column

t = 0

Column t = 2Column t = 1

VN1 VN2 VN3 VN5 VN6 VN7

CN2

VN8

CN2

VN4

CN2

VN0

CN2

11 322 3

VN0 VN4 VN8

Start/

End

Here

indicate the order of closed path through CN2 starting/ending at VN81 2 3

Figure 3.2: (a) The closed path through CN2 in the Tanner graph for one pass (phase) of decoding.(b) The unrolled piecewise messages that are passed between combined CN+VN processing units insuccessive columns of the architecture corresponding to the closed path highlighted in (a). Here, t = 0arbitrarily corresponds to the third column of T = 3 total columns.

frames can be interleaved in the proposed structure to ensure a constant workload over the uniformly

partitioned processing nodes to maximize hardware utilization and minimize idle logic. The pipelined

frame interleaving pattern is discussed later in this Chapter.

3.1.3 Parity-Check Matrix Partitioning and Hardware Mapping

This section describes how the QC-LDPC parity-check matrices for IEEE 802.11ad are partitioned and

mapped to the proposed architecture. The IEEE 802.11ad standard specifies a fixed block length of

672 bits for four code rates, and 24 modulation and coding scheme (MCS) modes with decoded (raw)

bit rates between 385Mb/s and 6.757Gb/s. The hardware mapping described in this section targets

the peak throughput modes of IEEE 802.11ad, and the LDPC decoder in this thesis is designed for the

maximum 6.757Gb/s throughput requirement with a maximum frame latency of 1µs for all four code

rates.

As shown in Fig. 3.3, the QC-LDPC parity-check matrix for each of the four code rates can be derived

from a single 8-layer base matrix, by decreasing the sparsity of higher-rate matrices by removing layers

and adding non-zero sub-matrices. Each of the 16 macro-columns of the QC-LDPC parity-check matrix


Rate 1/2

Rate 5/8

Rate 3/4

Rate 13/16

Inactive

RoutingB

Bypass

Routing#

Active

Routing

3 Connection

Layers

Max 3 Active

Processing

Layers Per

Column

33

11

31

22

21

3

13

27

7

12

40

31

22

21

20 7

12

40

31

22

21

40

33

11

31

41

22

21

3

18

2 1

10

28

4

28

27

32

4

28

9

2

2

10

25

28

4

28

9

28

27

32

4

18

28

9

41

15 6

12

3

27

20

12

14

27

29

18

41

15 6

12

3

27

17

20

12

14

3

27

29

18

5

30

20

34

39

14

4

20

17

6

14

4

20

15

41

20

34

14

39

14

4

20

39

17

6

14

6

4

20

15

28

13

23

0

24

13

23

0

22

10

28

24 23

0

28

24

13

23

0

22

8 Connection

Layers

Max 4 Active

Processing

Layers Per

Column

6 Connection

Layers

Max 4 Active

Processing

Layers Per

Column

1 Column Group Pair = 2 QC Macro-Columns

40

34

36

27

35

29

31

22

29

37

25

30

31

22

20

30

36

27

35

29

31

22

35

29

37

25

19

30

31

22

38

35

31

18

41

0

23

34

0

18

4

8

23

34

34 31

18

41

0

23

34

41

0

18

4

22

8

23

34

13

22 24

13

22 24

13

22 24

13

22 24

4 Connection

Layers

Max 4 Active

Processing

Layers Per

Column

B B

B B B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B B

B

B

B

B

B

B

B

B

B

B B

B

B

B

B B

B

B

B

B

B

B

B

B

B

B B

B

B

B

B

B

B

B B

B

B

B

B

B

B

B

B

B

B B

B B B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B B

B B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

All-zero sub-matrix

42x42 cyclic identity matrix

2 Inactive

Wired Layers

4 Inactive

Wired Layers

5 Inactive

Wired Layers

Figure 3.3: IEEE 802.11ad QC parity-check matrices with hardware mapping for proposed architec-ture [23]. The sub-matrix value indicates the cyclic permutation index. The four matrices are derivedfrom a single 8-layer base matrix by removing layers in higher-rate matrices, or by removing cyclically-shifted submatrices in lower-rate matrices.


Global Control Unit: Frame Input/Output Buffering, Rate Selection, Bypass Enabling, Frame Decoding Termination

Ha

rd-W

ire

d C

yc

lic

Pe

rmu

tati

on

Ne

two

rk(C

ol 0

à C

ol 1)

Column Slice 0

CN+VN

Proc.

Unit

VN-to-CN Message Memory

Hard Decision Memory

Channel LLR

Memory

Column Slice 1

CN+VN

Proc.

Unit



Channel LLR

Memory Ha

rd-W

ire

d C

yc

lic

Pe

rmu

tati

on

Ne

two

rk(C

ol 1 à

Co

l 2)

Pip

elin

e R

eg

iste

rs

Ha

rd-W

ire

d C

yc

lic

Pe

rmu

tati

on

Ne

two

rk(C

ol 2 à

Co

l 3)

Column Slice 2

CN+VN

Proc.

Unit



Channel LLR

Memory

Column Slice 3

CN+VN

Proc.

Unit



Channel LLR

Memory Ha

rd-W

ire

d C

yc

lic

Pe

rmu

tati

on

Ne

two

rk(C

ol 3 à

Co

l 4)

Pip

elin

e R

eg

iste

rs

Ha

rd-W

ire

d C

yc

lic

Pe

rmu

tati

on

Ne

two

rk(C

ol 1

4 à

Co

l 1

5)

Column Slice 14

CN+VN

Proc.

Unit



Channel LLR

Memory

Column Slice 15

CN+VN

Proc.

Unit



Channel LLR

Memory Ha

rd-W

ire

d C

yc

lic

Pe

rmu

tati

on

Ne

two

rk(C

ol 1

5 à

Co

l 0)

Pip

elin

e R

eg

iste

rs

Column-Slice Pair

Figure 3.4: System block diagram for proposed architecture showing the global control unit, and thedatapath containing: column slices with combined CN+VN processing units and memories, a hard-wiredcyclic permutation network between each column slice, and pipeline registers between column-slice pairs.

Hard Decision

(Ĉv) Memory

Combined CN+VN

Processing Unit

Channel LLR (Qv) Memory

q=42Processing

Nodes

Ha

rd-W

ire

d C

yc

lic

Pe

rmu

tati

on

Ne

two

rk

(fro

m p

revio

us c

olu

mn

slic

e t-1

)

Ha

rd-W

ire

d C

yc

lic

Pe

rmu

tati

on

Ne

two

rk

(to

ne

xt co

lum

n s

lice

t+

1)

Hard Decision Output

Bits to I/O interface

Soft Decision LLR

Inputs from I/O interface

A

B

D

E

G

VN-to-CN (Lvc)

Message Memory

Pip

elin

e R

eg

iste

rs

C

Coarse-Grained

Clock Gating

Logic

1-to-4 Processing Layers

Check Node

Update Logic

Variable Node

Update Logic

H

I K

L

J

K F

Figure 3.5: Column slice t comprised of CN+VN processing units, local memory, and wired permutationnetworks between adjacent column slices. Pipeline registers are connected only to the first column in acolumn-slice pair. Hard-wired interconnect does not contain multiplexing logic. Hard-wired connectionsare specified by the parity-check matrix connectivity. The operations in column slice t are computed inone clock cycle.


maps directly to a single column slice in the proposed architecture shown in Fig. 3.4. Each column slice

contains the memories and combined CN+VN processor logic for one macro-column of the matrix. The

VN ordering in each column slice is consistent with the VN ordering in the matrix macro-column, such

that each column slice contains 42 combined CN+VN processing units. Adjacent columns are connected

through hard-wired routing in each of the 8 defined connection layers, which correspond to 8×42=336

CN paths. The connectivity between successive columns is specific to the parity-check matrix, i.e.,

the interconnect mapping between CN+VN processors in adjacent column slices is not any-to-any, but

rather, is mandated by the column-to-column connectivity defined by the parity-check matrix. Hence,

there is no multiplexer fanout between connected CN+VN processors in adjacent columns along the

same CN path. Depending on the code rate, only a subset of layers may be active. The rate 1/2 code

requires all 8×42=336 CN paths, while the rate 13/16 code requires only 3×42=126 CN paths. Inactive

layers/paths are disabled (turned off) through clock gating to eliminate unnecessary logic switching and

message passing. Each column in the four parity-check matrices shown in Fig. 3.3 has at most 4 active

CN connections, i.e., the maximum VN degree is 4, thus each combined processing unit requires at most

4 processing layers. Processing nodes in the last 4 columns require only 1, 2, or 3 processing layers, due

to the lower-triangular matrix construction.

The proposed architecture exploits the QC structure of the IEEE 802.11ad LDPC parity-check ma-

trices by constraining routing to local interconnect between adjacent columns. Since each of the four

parity-check matrices is derived from a single base matrix, multi-rate functionality is intrinsic to the

hard-wired cyclic permutation networks between adjacent columns. The same wiring is used between

successive columns for multiple code rates, thus eliminating the need for additional permutation and

control logic to switch between code rates. As the decoder switches from a low-rate code to a high-rate

code, e.g., from rate 1/2 to rate 3/4, the unused wired layers of the high-rate code are disabled. The

decoder continues to pass messages along the active processing layers, and hardware utilization in the

combined processing nodes remains at either 100% (for rate 1/2, 5/8, and 3/4 codes) or 75% (for the

rate 13/16 code) since the four rates have either 3 or 4 active processing layers.

3.1.4 Column Slice Architecture

Figure 3.5 presents the architecture of a single column slice, which contains local memories and processing

nodes for the VNs associated with that particular macro-column. Table 3.2 provides an overview of

the messages exchanged between the combined processing units and local memories in the column slice,

based on the labelling in Fig. 3.5. Table 3.3 provides a parametric overview of each column-slice memory.

These configurations are specific to the IEEE 802.11ad standard, and will vary depending on the CN

and VN connectivity defined by the parity-check matrix. All column-slice memories are simple dual-

port eSRAM register files with 1 read port and 1 write port. In each column slice, input LLRs are

buffered in to the LDPC decoder through the Qv memory write port, while output hard decisions are

buffered out through the Cv memory read port. Outgoing update messages in column slice t are of the

form pc(t), sc(t),min1c(t),min2c(t). These messages are computed based on incoming messages from

the previous column slice t − 1 and values stored in the column-slice memories. In both the CN- and

VN-update phases, outgoing update messages from column slice t are computed in one clock cycle.

The number of instantiated CN+VN processing units is always constant and equal to the expansion

factor q of the QC parity-check matrix. In this case, q = 42 processing units are instantiated in each

column slice, while the number of instantiated processing layers ranges from 1 to 4 depending on the


column slice index. Column slices corresponding to the last four macro-columns of the QC parity-check

matrix require fewer hardware resources (memory and routing) than the slices for the first twelve columns.

The fixed-point bit width of internal column-slice messages is chosen to be 5 bits in this implementation

as there is less than 0.01dB degradation in fixed-point error-rate performance of the Min-Sum algorithm

in comparison to floating point across all four code rates. The min1c(t) and min2c(t) messages are both

4 bits wide, while the pc(t) and sc(t) are 1 bit each.

Table 3.2: Column-slice messages highlighted in Fig. 3.5

Label Message Description and Format

A, BInput channel LLRs for VNs in current column slice t5-bit message:Qv× 42 VNs

CIncoming connection layer messages from column slice t− 110-bit msg:pc(t− 1), sc(t− 1),min1c(t− 1),min2c(t− 1) × 42 VNs × 8 connection layers

DComputed intermediate VN-to-CN messages in column slice t5-bit message:Lvc× 42 VNs × (1-to-4) processing layers

EOutgoing connection layer messages from column t to t+ 110-bit message:pc(t), sc(t),min1c(t),min2c(t) × 42 VNs × 8 connection layers

F, GOutput hard decisions for VNs in current column slice t

1-bit decision:Cv× 42 VNs

Table 3.3: Memory specification in each column slice

MemoryDepth

(Frames)Data Bus Width (Bits)

Total Size(Kb)

Qv 16 42 VNs × 5 bits/message = 210 3.360Columns 0-15

Cv 16 42 VNs × 1 bit/decision = 42 0.672Columns 0-15

Lvc 8 42 VNs × 4 layers∗ × 5 bits/message = 840 6.720Columns 0-11

Lvc 8 42 VNs × 3 layers∗ × 5 bits/message = 630 5.040Columns 12-13

Lvc 8 42 VNs × 2 layers∗ × 5 bits/message = 420 3.360Column 14

Lvc 8 42 VNs × 1 layers∗ × 5 bits/message = 210 1.680Column 15

∗ Active processing layers in combined CN+VN processing unit.

Coarse-grained clock gating control logic is integrated in each column-slice pair to disable the pipeline

registers and local memories from switching when the current frame has successfully been decoded. Four

independent integrated clock gating cells are used in each column-slice pair to disable the input pipeline

register bank to column t, and the Qv, Cv, and Lvc memories in columns t and t + 1, as highlighted

by labels H, I, K, and L in Fig. 3.5, respectively. Label J in Fig. 3.5 shows the gated clock for a

pipeline register in the Lvc memory write path in the VN-update logic in each combined processing

unit. The clock gating pattern is identical for both columns t and t + 1 in a column-slice pair due to

the pipelined frame interleaving described next. When clock gating is active, there is no switching in


the combinational logic that comprises the CN+VN processing units, and thus, the pair of column slices

between two successive pipeline stages is effectively turned off in that clock cycle. This technique is part

of the early-termination strategy described later in this section.

The CN and VN logic in the combined CN+VN processor share the message-passing interface. The

CN logic is turned off during the VN-update phase, and vice versa. While there is an area penalty, there is

also a power saving benefit as inactive logic is completely turned off in the phase where it is inactive. The

latency specification of the IEEE 802.11ad standard is met with this single-phase operation. To reduce

latency even further, CN and VN logic can be run in parallel with two independent frames in each column

slice. However, this would introduce additional control complexity to track the phase of independent

frames, multiplexer overhead in the combined CN+VN processing units to multiplex between the two

anti-phase frames, and twice the routing between successive column slices to support message passing

for both CN and VN phases simultaneously. Phase-interlaced decoding was not implemented in this

thesis to avoid these specific issues.

3.1.5 Pipelined Frame Interleaving

The constant workload of the time-distributed decoding schedule enables message pipelining between

successive column-slice pairs, since the number of columns in the proposed architecture is fixed for all

code rates. As shown in Fig. 3.4, pipeline registers are placed after every two columns instead of after

every single column to satisfy the 1µs total decoding latency requirement of the IEEE 802.11ad standard,

assuming 10 decoding iterations and a 200MHz clock rate. As such, a single frame is shared by columns

t and t+ 1 in a column-slice pair. Since the computation in each column is independent and a constant

number of hops is required to traverse each closed path, 8 independent frames can be interleaved in the

structure without any memory access contention.

Figure 3.6 presents the frame interleaving schedule where 8 interleaved frames are cyclically pipelined

across 16 column slices. Both CN- and VN-update phases require 8 clock cycles to complete, thus a

total of 16 cycles is required to complete a single decoding iteration across all 8 interleaved frames. Since

each frame is independent, the frame-interleaved decoding schedule ensures full hardware utilization of

all processing units without any pipeline stall cycles.

For a fixed number of iterations, the minimum decoding throughput and maximum latency of a

frame-interleaved LDPC decoder are given by the following two equations:

Throughput =Frames× Block Length× fclk

Iterations× Cycles Per Iteration(bits/s)

Latency =Iterations× Cycles Per Iteration

fclk(seconds).

For a block length of 672 bits, the proposed LDPC decoder achieves a throughput of 6.78Gb/s and an

acceptable latency of 0.793µs with 8 interleaved frames, while operating at a clock frequency of 202MHz

with 10 decoding iterations. This performance satisfies the maximum throughput and minimum latency

requirements of the IEEE 802.11ad standard.

Frame sequencing control is intrinsically embedded in the hard-wired interconnect between adjacent

column slices. The cyclic memory addressing pattern in each column slice results in conflict-free memory

access and eliminates the need for additional control overhead. In every clock cycle, the dual-ported

channel LLR, hard decision, and VN-to-CN message memories in column slices t and t + 1 share the


Frame 0

Clk

Cycle

0

Col 0VN 0 –

VN 4

1

2

Col 1VN 42 –

VN 83

Frame 1

Col 2VN 84 –

VN 125

Col 3VN 126 –

VN 167

Frame 7

Col 14VN 588 –

VN 629

Col 15VN 630 –

VN 671

Frame 0Frame 7 Frame 6

Frame 5Frame 7Frame 6

7 Frame 0Frame 1 Frame 2

Frame 08

9

10

Frame 1 Frame 7

Frame 0Frame 7 Frame 6

Frame 5Frame 7Frame 6

15 Frame 0Frame 1 Frame 2

1s

t P

as

s:

CN

Up

da

te2n

d P

as

s:

VN

Up

da

te

2 43 60 1 75

2 43 60 17 5

2 436 0 17 5

2 436 0 175

24 36 0 175

243 6 0 175

2 43 6 0 175

Column Slice

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15Clk

Cycle

0

1

2

3

4

5

2 43 6 0751

6

7

2 43 60 1 75

2 43 60 17 5

2 436 0 17 5

2 436 0 175

24 36 0 175

243 6 0 175

2 43 6 0 175

8

9

10

11

12

13

2 43 6 0751

14

15

1s

t P

as

s:

CN

Up

da

te2n

d P

as

s:

VN

Up

da

te

1st Pass:

Check

Node

Update

Phase

Update

Layer

Messages parity, sign,

min1, min2

2nd

Pass:

Variable

Node

Update

Phase

Propagate

Layer

Messages parity, sign,

min1, min2

Figure 3.6: Pipelined frame interleaving pattern through column slices in the proposed architecture over16 clock cycles of one complete LDPC decoding iteration for IEEE 802.11ad. The number in each bubbleindicates the frame index. Frame 4 highlights the cyclic frame-shifting property of the architecture.

same read/write address, which corresponds to the index of the current frame in the column-slice pair.

The independent frame processing among column-slice pairs allows frames of different code rates to be

decoded simultaneously, achieving the same throughput with minimal bypass routing control overhead,

since the primary rate control mechanism is embedded in the hard-wired cyclic interconnect between

adjacent columns. The architecture presented in this thesis can thus be classified as both row-parallel

and column-parallel.

3.1.6 Input/Output Frame Buffering for Continuous Decoding

The channel LLR and hard decision memories are dual-ported to enable input/output (I/O) frame

buffering such that the decoder runs continuously without any idle cycles. Figure 3.7 presents a timing

diagram of the pipelined I/O frame buffering schedule, highlighting the following three steps: loading

channel LLRs for the next 8 frames, decoding the current 8 frames, and reading out hard decisions for

the previous 8 frames. The I/O latency is masked by the compute latency of the decoder since the next

8 frames are buffered in to the channel LLR Qv memory while the current 8 frames are being decoded.

Once the current 8 frames terminate, the decoder restarts the decoding process with the next 8 frames

already loaded in Qv memory. Decoded codewords in the hard decision Cv memory are buffered out

while the 8 new frames are decoded.


LOAD Qv

Frames 0 to 7

DECODEFrames 0 to 7

READ Ĉv

Frames 0 to 7

LOAD Qv

Frames 8 to 15

DECODEFrames 8 to 15

READ Ĉv

Frames 8 to 15

160 Clock Cycles 160 Clock Cycles 160 Clock Cycles 160 Clock Cycles

DECODEFrames 16 to 23

LOAD Qv

Frames 16 to 23LOAD Qv

Frames 24 to 31

10 Iterations 10 Iterations 10 Iterations 10 Iterations

Figure 3.7: Input/output frame buffering schedule, assuming a uniform decoding latency of 10 iterationswith 16 clock cycles per iteration.

As shown in Table 3.3, both the channel LLR Qv and hard decision Cv memories in a single column

slice have twice the address depth compared to the VN-to-CN Lvc memory. The Qv and Cv memories

have a depth of 16 addresses to accommodate the current 8 frames and the next 8 frames in the decoding

queue for all 42 processing nodes, while the Lvc memory stores only intermediate updates for the current

8 frames cycling through the decoder. In the column slice architecture presented in Fig. 3.5, label A

shows the input buffer LLR write port used to load the next 8 frames, label B shows the LLR read

port accessed during decoding of the current 8 frames, label F shows the hard decision write port of the

current 8 decoding frames, and label G shows the output hard decision read port of the 8 previously

decoded frames. This overlapped input/output frame buffering schedule enables continuous decoding

that does not interrupt the frame-interleaved decoding schedule, thus enabling multi-Gb/s throughput

with acceptable latency for the IEEE 802.11ad standard.

3.1.7 Combined CN+VN Processing Unit Architecture

Figure 3.8 presents the architecture of the combined CN+VN processing unit, where CN- and VN-update

logic blocks share the memory and layer-message routing interfaces. The CN- and VN-update logic is

partitioned independently such that VN logic is disabled (turned off) during the CN phase, and vice

versa. One clock cycle is required to perform either the CN or VN update in column slice t. As such,

there is no pipelining in the combined CN+VN processing unit, aside from the re-timing register in the

Lvc memory write path.

In every clock cycle, the combined CN+VN processing unit in column t first receives an incoming layer

message pc(t−1), sc(t−1),min1c(t−1),min2c(t−1) from its connected processing node in the previous

column slice t − 1. The CN- and VN-update are performed using the elements of the incoming layer

message and the internally stored Lvc and Qv values. The updated pc(t), sc(t),min1c(t),min2c(t) layer

message is then transmitted to its connected processing unit in column t+1. In the CN phase, stored Lvc

values for each active processing layer are read in parallel from the Lvc memory. The magnitude of each

Lvc value is compared to the first and second minimum magnitude values min1c(t−1) and min2c(t−1)

of the incoming layer message, while the sign of each Lvc value is compared to the sign element sc(t− 1)

of the incoming layer message. The updated sc(t), min1c(t), and min2c(t) values are transmitted to

the next column slice. In the first decoding iteration, the Lvc values are not initialized, and thus the

Qv LLR memory is read instead. In the VN phase, the Lvc message for each active processing layer,

hard decision Cv bit, and parity message element pc(t) are updated based on the computed minimum


+

min2

min1

min2

updateCompare

and Select

Minimum

Magnitudes

Max LLR Magnitude

4'b1111

min1

update

min2'

min1'

sign'sign

update

sign

Processing

Layer Select

First

Decoding

Iteration

Pro

ce

ss

ing

La

ye

r L

vc

Up

da

ted

P

roc

es

sin

g L

ay

er

Me

ss

ag

e

pa

rity

', s

ign

', m

in1', m

in2

'

Bypass and

Processing

Layer

Select

Received

Layer Messages from

Previous Column t-1For each layer:

parity, sign, min1, min2

8 layers x 10 bits

Updated

Layer

Messages

to Next

Column t

parity'

min1

min1

min2

sign

==

Hard

Decision

Ĉv (Lv MSB)

Qv

CN/VN Phase

Select

parity

VN-to-CN Lvc Intermediate Message Memory

(Sign-Magnitude Format)

2's to

SM

Lv

mcv

sign

mcv mag

Lvc

-

SM to

2's

Channel LLR Qv Memory

(2's Complement Format)

Hard Decision

Ĉv Memory

2's to

SM

Lvc

Update

Hard

Decision

Ĉv

Processing Layer mcv

+

CN Phase

0

Parity Check Fail

(To Control Logic)

pa

rity

0

pa

rity

7

Lv

Lvc

sign

Lvc

mag

mcv

Lvc mag

Lvc sign

Gated

Clock

Messages for 1-to-4 Active Layers

Re

ce

ive

d P

roc

es

sin

g

La

ye

r M

es

sa

ge

pa

rity

, s

ign

, m

in1

, m

in2

4b

4b

4b

10b

10b

4b

5b

4b

5b5b 5b

1b

1b

1b

1b

1b

80b

80b

pa

rity

1

1b

7b

7b

5b

10b

1-to-4

Processing

Layers

Lvc

parallel

memory

read/write

for all

active

processing

layers

sc(t)

min2c(t-1)

min1c(t-1)

min1c(t)

min2c(t)

pc(t)

pc(t-1)

pc(t-1),sc(t-1),min1c(t-1),min2c(t-1)

pc(t),sc(t),min1c(t),min2c(t)

sc(t-1)

pc(t-1)C

N U

pd

ate

Ph

as

e L

og

icV

N U

pd

ate

Ph

as

e L

og

ic

Early

Termination

Check

Figure 3.8: Combined CN+VN processing unit for time-distributed piecewise decoding, showing CN-and VN-update phase logic, memory interfaces, and data permutation logic between processing units insuccessive column slices.


j-2

j-2

j-2

Co

l t

(E

ve

n C

olu

mn

)

jj-1

jj-1

jj-1

j j-1Update Msgs to Col t+2 –,sc(t+1),min1c(t+1),

min2c(t+1)

Update Msgs to Col t+1–,sc(t),min1c(t),min2c(t)

Read Lvc(t)

jj-1

Clock

j+1

j+1

j+1

j+1

1st Pass: Check Node Update Phase

jj-1

jj-1

Read Lvc(t)

jj-1

Clock

j+1

j+1

j+1

2nd Pass: Variable Node Update Phase

Compute mcv(t)

Update Lvc(t) and Write to Mem

Compute Lv(t)

Compute pc(t)

j j-1

Read Qv(t) jj-1 j+1

j j-1

j j-1

jj-1 j+1

jj-1 j+1

jj-1

j j-1

jj-1

jj-1

jj-1

Read Lvc(t+1)

Compute mcv(t+1)

Update Lvc(t+1) and

Write to Mem

Compute Lv(t+1)

Compute pc(t+1)

Read Qv(t+1)

Read Lvc(t+1)

Update Msgs to Col t+2

pc(t+1),sc(t-1),min1c(t-1),min2c(t-1)

Write Ĉv(t) to Mem

D

E

C

D

E

pc(t-1),sc(t-1),min1c(t-1),min2c(t-1)Incoming Msgs at Col t

Pipeline Reg Output

jEarly-Termination Check

Using All pc(t-1) Values

Co

l t+

1 (O

dd

Co

lum

n)

j-2

pc(t-1),sc(t-1),min1c(t-1),min2c(t-1)Incoming Msgs at Col t

Pipeline Reg Output

j+1

j j-1 j+1

j+1

Write Ĉv(t+1) to Mem jj-1

C

D

B

F

D

D

B

F

D

E

(a)

Co

l t

(E

ve

n C

olu

mn

)C

ol t+

1 (O

dd

Co

lum

n)

j+1

j+1

(b) (c) (d)

Frame Index

Figure 3.9: Processing unit timing diagram for CN- and VN-update phases showing 3 independentframe updates over 3 clock cycles. Each CN+VN processing unit updates a single frame j in each clockcycle. All arrow-highlighted operations occur in each clock cycle, and independently for each frame j,j ∈ 0, 1, . . . , 7. Circled nodes B, C, D, E, and F correspond to the connections shown in the columnslice architecture in Fig. 3.5. The following operations are highlighted. (a) Sign sc(t), first minimummagnitude min1c(t), and second minimum magnitude min2c(t) updates through column slice pair. (b)Parity pc(t) updates through column slice pair. (c) Independent Lvc and Cv updates in columns t andt+ 1. (d) Propagation of sign, first minimum magnitude, and second minimum magnitude messages tonext column-slice pair without updates in columns t and t+ 1.


sign and magnitude values in the previous CN phase. The parity pc(t) element of the layer message is

updated, and transmitted to the next column slice, along with the unmodified sc(t − 1), min1c(t − 1),

and min2c(t − 1) elements that were received from the previous column slice. The early-termination

parity check is performed in the first clock cycle of the CN-phase, starting from the second decoding

iteration, once the pc(t) parity bits in each layer message have been computed and have returned home

to their starting position t = 0 in the closed path traversal.

Both sign-magnitude (SM) and 2’s complement number formats are used in the combined processing

node logic. Each Lvc message is represented in a 5-bit SM format with 1 sign bit and 4 magnitude bits, in

order to enable direct comparison of the sign and magnitude with the incoming sc(t− 1), min1c(t− 1),

and min2c(t − 1) values. Each Qv LLR is represented in a 5-bit 2’s complement format in order to

avoid SM-to-2’s complement conversion prior to the addition operation in the VN phase. Similarly, the

intermediate mcv value in the VN phase is also represented by 5 bits in 2’s complement format. The hard

decision Cv corresponds to the most significant bit (MSB) of the update LLR Lv, and is represented by

only 1 bit.

The combined processing node exploits the spatial locality of Lvc and Qv values stored in partitioned

column-slice memories through a deterministic access pattern during the CN- and VN-update phases.

The combined CN and VN logic minimizes routing complexity between processing units, while also

eliminating the need for additional data shifting or permutation logic. The time-distributed piecewise

decoding schedule also eliminates the complex, timing-constrained compare-select and XOR trees, which

are employed in traditional architectures to compute the minimum magnitudes, sign, and parity. In the

CN-update phase, a single XOR gate is required to calculate the sign sc(t), and a single compare-select

circuit is used to determine the minimum magnitude among the Lvc, min1c(t − 1), and min2c(t − 1)

magnitudes. Similarly, in the VN-update phase, the parity check computation is also reduced to a

sequential XOR update, and the mcv value is computed independently in each VN. These simplifications

relax the critical path timing constraints to enable pipelined frame interleaving.

Figure 3.9 presents the timing diagram for each frame j that is decoded through two CN+VN

processing units in successive columns t and t+1 within a single pipeline stage over one clock cycle. The

timing diagram highlights the individual operations in the CN- and VN-update phases, as well as the

data dependency and message-passing sequence between processing units in successive columns t and

t+ 1. The uniform processing unit arrangement in each column slice ensures that all units perform the

same operation in each clock cycle.

Additional pipeline stages could be added in the combined CN+VN processing unit and in each

column slice. While this may reduce the amount of buffers added to critical timing paths in design

synthesis and place-and-route stages, the complexity of control logic would increase, and there would

be a trade-off between the area saved on buffer elimination and the insertion of pipeline register cells

throughout the design. In addition, in order to maintain full hardware utilization, the depth of column

slice memories would also need to be increased to accommodate more interleaved frames in the pipeline.

In this design, only 8 frames are interleaved through the architecture in order to meet the latency

requirements with minimal area overhead and simple control logic.

3.1.8 Early Termination with Coarse-Grained Clock Gating

Early termination logic allows the decoder to terminate once the parity-check condition across all CNs

is satisfied, i.e., pc = 0 for every CN path as defined by Eq. 2.1. This reduces the overall power and


improves energy efficiency as the decoder does not have to execute the maximum number of iterations

since the decoded codeword C is valid.

0 1 2 3 4 5 6 7 8 9 10Number of Iterations to Decoding Convergence

0

0.1

0.2

0.3

0.4

0.5

Pro

bab

ility

Rate 1/2 at 4.1dB, FER < 10 -2

Rate 1/2 at 4.4dB, FER < 10 -3

Rate 1/2 at 4.7dB, FER < 10 -4

Rate 5/8 at 4.1dB, FER < 10 -2

Rate 5/8 at 4.4dB, FER < 10 -3

Rate 5/8 at 4.7dB, FER < 10 -4

Rate 3/4 at 4.4dB, FER < 10 -2

Rate 3/4 at 4.7dB, FER < 10 -3

Rate 3/4 at 5.0dB, FER < 10 -4

Rate 13/16 at 4.7dB, FER < 10 -2

Rate 13/16 at 5.1dB, FER < 10 -3

Rate 13/16 at 5.5dB, FER < 10 -4

Figure 3.10: Probability distribution of decoding iterations for the four code rates of the IEEE 802.11adstandard at FER of 10−2, 10−3, and 10−4.

Figure 3.10 presents a normalized histogram showing the probability of performing i iterations,

i ∈ 1, 2, . . . , 10, before converging to a valid codeword. The iteration probability is presented for

all four code rates in the IEEE 802.11ad standard at SNR operating points that achieve frame error

rates of 10−2, 10−3, and 10−4 in order to capture the average statistical performance. Figure 3.10

shows that 95% of frames terminate in 5 iterations or less, hence early termination provides a significant

power saving opportunity, especially for the higher SNR operating points where the majority of frames

terminate within 1-to-3 iterations. Early termination also reduces the overall decoding latency, since

fewer iterations are required to produce valid codewords. As such, higher throughput is achievable if

the decoder is configured to run continuously where the next set of frames begin decoding immediately

after all the frames in the first set have converged, as shown in Fig. 3.11.

There is a fourth scenario, not illustrated in Fig. 3.11, where the decoder pipeline is continuously

filled until hundreds of frames have been decoded, at which point, the decoder is fully powered down

and remains off until the next round of frames is ready to begin decoding. This duty-cycling approach

would achieve the same throughput with lower latency compared to the decoding scenarios illustrated

in Fig. 3.11. However, this approach significantly increases control complexity in order to track the

termination pattern and iteration count of independently interleaved frames, and also requires a deep

first-in first-out (FIFO) buffer memory to store the input and output frames. This additional FIFO

memory would increase design area, unless it is already available in the larger SoC. This scenario was

not explored in this thesis due to the additional control complexity and silicon area constraints.

Similar to the CN update, the parity check is performed in the VN-update phase through a piecewise,


DECODE

Frames 0 to 7

DECODE

Frames 8 to 15

160 Clock Cycles

DECODE

Frames 16 to 23

10 Iterations

DECODE

Frames 0 to 7

DECODE

Frames 8 to 15

DECODE

Frames 16 to 23

160 Clock Cycles

10 Iterations

160 Clock Cycles

10 Iterations

160 Clock Cycles

7 Iterations

160 Clock Cycles

8 Iterations

160 Clock Cycles

10 Iterations

DECODE

Frames 0 to 7

DECODE

Frames 8 to 15

DECODE

Frames 16 to 23

112 Cycles

7 Iterations

160 Cycles

10 Iterations

128 Cycles

8 Iterations

(a)

(b)

(c)

Figure 3.11: Multi-frame decoding with: (a) no early termination, (b) early termination with idle cycles(discontinuous decoding), and (c) early termination without idle cycles (continuous decoding).

time-distributed computation across all VNs along the unrolled CN path. The incoming pc(t − 1)

component of every layer message is XOR-ed with the hard decision Cv in each VN to produce the

updated parity value pc(t) in column slice t. The early termination check is then performed in every

iteration once the final parity result returns home to its starting column position t = 0. This corresponds

to the first cycle of the next CN-update phase. The early termination check needs to be performed only

in the first column of a column-slice pair, immediately after the input pipeline stage, since the parity

result returning home to the column-slice pair is unique to the frame and is valid for both columns.

The early termination check is performed independently for each interleaved frame in the architecture.

The global control unit aggregates the parity-check results from all column slices over the entire CN phase

to determine which frames have terminated. Coarse-grained clock gating is used to disable (turn off)

column slices in which the current frame is known to have terminated. Figure 3.12 presents a sample

frame termination pattern, which shows all 8 frames terminating within 5 iterations. Frames that have

terminated are not updated or cycled further since the input column pipeline registers and memories are

disabled. For completeness, Fig. 3.12 also captures the frame cycling pattern within a single iteration

to show the temporal position of each terminated frame among the active frames that have not yet

terminated. Each set of 8 frames will have a unique termination pattern, however, through coarse-

grained clock gating, the decoder minimizes dynamic power consumption by systematically turning off

logic until all frames have terminated.


Frame 0

Frame 0 Frame 1 Frame 2 Frame 3 Frame 4 Frame 5 Frame 6 Frame 7

Frame 0

Frame 0

Frame 0

Frame 0

Frame 1

Frame 1

Frame 1

Frame 2

Frame 2

Frame 2

Frame 3 Frame 4 Frame 5 Frame 6Frame 7

Frame 3

Frame 3

Frame 3

Frame 3

Frame 4 Frame 5Frame 6 Frame 7

Frame 5

Frame 5

Frame 5

Frame 4

Frame 4

Frame 4

Frame 6 Frame 7

Frame 6 Frame 7

Frame 7

Frame Terminated – Column Disabled (Off)Frame Not Terminated – Column Active (On)

Columns

14 and 15

Columns

12 and 13

Columns

10 and 11

Columns

8 and 9

Columns

6 and 7

Columns

4 and 5

Columns

2 and 3

Columns

0 and 1Iteration

1

2

3

4

5

6 Frame 1

Frame 1

Frame 2

Frame 2

Frame 6

2 43 60 1 75

4 650 2 31 7

5 761 3 420

60 72 4 531

710 3 5 642

Column Slice

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Ite

rati

on 1

2

3

4

5

Frame Not Terminated Frame Terminated

2 43 60 17 5

2 436 0 17 5

2 436 0 175

24 36 0 175

243 6 0 175

2 43 6 0 175

2 43 6 0751

Fra

me

Cy

clin

g W

ith

in O

ne

Ite

rati

on

(On

ly O

ne

Ph

as

e S

ho

wn

)

Figure 3.12: Sample frame termination pattern in frame-interleaved architecture. One iteration is per-formed over 16 clock cycles. One clock cycle is required to update a frame in a column-slice pair. Framesthat have terminated are not updated in their current column-slice pair. Column slices in which thecurrent frame has terminated are disabled through coarse-grained clock gating in each cycle.

3.1.9 Extendibility to Layered Decoding Schedule

The proposed architecture can be extended to support a layered decoding schedule at the expense of

latency and increased control complexity. Chapter 2 provides a brief introduction and describes some of

the challenges of implementing a layered decoder.

In this architecture, a layered schedule would have a higher decoding latency since the CN update

phase would require several more passes. The VN update phase would still require only one pass,

however, the CN update phase would require the same number of passes as the maximum VN degree.

For example, for a QC matrix with 4 layers/macro-rows connected to each VN, a layered decoder would

require 4 passes through the structure just for the CN phase, and then 1 additional pass for the VN

phase, for a total of 5 passes per iteration. This is in contrast to the 2 passes per iteration that are

required with a flooding schedule, corresponding to a 2.5× increase in total worst case decoding latency

assuming the decoder does not terminate in fewer iterations.

The area of the decoder could be reduced with a layered schedule by sharing CN update logic among

all connected layers, however, additional multiplexers would be required in each combined CN+VN

processing unit to select the routing path for each intermediate update message based on the current

layer. This would further increase control complexity and latency. Thus, a layered decoding schedule

was not explored in this thesis. The proposed architecture with a flooding schedule achieves acceptable

latency for the IEEE 802.11ad standard, while minimizing control complexity.


The proposed frame-interleaved architecture addresses the challenges of designing a multi-Gb/s

LDPC decoder with low-complexity interconnect and multi-rate reconfigurability, by introducing a time-

distributed decoding schedule and combined CN+VN processing unit design. The following section

presents the physical implementation details and results of a proof-of-concept test chip.

3.2 Physical Silicon Chip Implementation and Results

An LDPC decoder test chip was synthesized, placed-and-routed, and fabricated in a 28nm CMOS tech-

nology as a proof-of-concept of the proposed frame-interleaved architecture. The decoder core occupies

an area of 3.41mm2 (3.20mm × 1.06mm), while the total die size including pads is 4.78mm2 (3.36mm

× 1.42mm). The design contains 837K-gates, and 160Kb of eSRAM, which was generated using a

commercial memory compiler. The decoder supports all 4 code rates and 24 throughput modes of the

IEEE 802.11ad standard, while operating at a nominal 0.9V supply voltage and 202MHz clock. Fig-

ure 3.13 presents a die micrograph of the test chip, which contains two decoupled power domains to

independently measure core logic and eSRAM power. This section presents an overview of the decoder’s

error-correction performance, area and power breakdown, and a comparison of this work to previously

published LDPC decoder implementations for the IEEE 802.11ad standard. The complete development,

simulation, and testing framework is presented in Appendix B.

Embedded SRAM

Embedded SRAM I/O

In

terf

ac

eLDPC Decoder

Standard Cell Logic

00

1414 1313 1212 1111 1010 99 8877

11 22 33 44 55 66

1515

Embedded SRAM

Embedded SRAM

Global Control Logic

3.20mm3.20mm

1.0

6m

m1

.06

mm

Figure 3.13: Die micrograph with wirebonds shown in exposed package.

3.2.1 Error-Correction Decoding Performance

Figure 3.14 presents the FER and BER performance of the four code rates for both fixed-point and

floating-point number representations with a maximum of 10 decoding iterations on the BIAWGNC

with Min-Sum decoding. The channel input LLRs are quantized to 5 bits for both floating-point and

fixed-point simulations, based on the assumption that channel LLRs are received from a 5-bit analog-

to-digital converter (ADC). The rate 1/2 and rate 5/8 curves have similar performance due to the input

LLR quantization.

3.2.2 Post-Silicon Power Measurements

The fabricated chip was tested on a Teradyne UltraFLEX-HD automated tester (ATE) at a room

temperature of 21C. The chip contains two test modes for at-speed functional verification and power


2 2.5 3 3.5 4 4.5 5 5.5 6E

b/N

0 (dB)

10-7

10-6

10-5

10-4

10-3

10-2

10-1

100

Fra

me

Err

or

Rat

e (F

ER

), B

it E

rro

r R

ate

(BE

R)

Max 10 Iterations100 Frame Errors

FER - Floating Point (5-bit Input LLR)FER - Fixed Point (5-bit Messages)BER - Floating Point (5-bit Input LLR)BER - Fixed Point (5-bit Messages)

Rate3/4

Rate13/16

Rate1/2

Rate5/8

Figure 3.14: FER and BER vs. SNR under Min-Sum decoding for all four IEEE 802.11ad codes onBIAWGNC with maximum 10 decoding iterations. The channel SNR is normalized to energy-per-bit asgiven by Eq. 1.2. Channel input LLRs are quantized to 5 bits for both fixed-point and floating-pointsimulations.

measurement. In the functional test mode, channel input LLRs are loaded through a shift-register based

I/O interface, the decoding is performed, and the output hard decision bits are shifted out. The captured

hard decision bits are compared to a set of golden vectors, whose expected values are predetermined

through C++ simulation of a fixed-point LDPC decoder with a floating-point BIAWGNC model. The

chip functionality is verified over 40 test cases: five SNR points in each of the four code rates, both with

and without early termination. Figure 3.15 presents two Shmoo plots that show the range of operating

voltages for both eSRAM and core logic, as well as decoding functionality up to 300MHz. In the power

test mode, decoded hard decision bits are not shifted out after each set of frames has terminated. Instead,

the decoder runs continuously such that once a set of frames terminates, the decoder immediately restarts

the decoding cycle without any idle period. The supply current is sampled over a 10ms interval to obtain

an accurate power measurement.

Figure 3.16 presents the average power measured over 10 typical-corner chips under nominal condi-

tions. The results show that with early termination, overall power can be reduced by up to 1.43× for the

rate 1/2 code at Eb/N0 = 4.6dB and 2.93× for the rate 13/16 code at Eb/N0 = 5.4dB, while satisfying

the maximum throughput specification of the IEEE 802.11ad standard. The five Eb/N0 SNR points

chosen for each code rate correspond to the five highest performance points in the waterfall region of

each error-rate curve in Fig. 3.14. The majority of the power is consumed by standard cell logic and

wired routing, while eSRAM memories consume between 17% and 32% of overall power. Figure 3.16

also shows that there is a linear decrease in power over the four code rates from rate 1/2 to rate 13/16

when early termination is enabled. At Eb/N0 = 5.4dB, the rate 13/16 code consumes 2.88× less power

than the rate 1/2 code at Eb/N0 = 4.6dB with early-termination decoding. This reduction in power


F F F F FF

0.70

F

0.60

0.70

0.80

0.90

1.00

VDD_LOGIC (V)

VD

D_

ME

M (

V)

0.50

P P PFFF

F

F

F

F

F

F

F

F

F

F

P

P

P

P

P

P

P

P

F

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P P P P

(a)

F

150 175 200 225 250 275 300

0.60

0.70

0.80

0.90

1.00

Clock Frequency (MHz)

VD

D_

LO

GIC

(V

)

0.50

F

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

(b)

F

F

F

F

F

F F

F

F

F

F

F F

F

F

F

F

F F

F F

F

F

F

F

F

F

F

F

F

F

F

F

F

F

F

P

P

P

P

P

P

P

P

P

P

P

P

P P P

0.75 0.80 0.85 0.90 0.95 1.00

(a) eSRAM VDD vs. core logic VDD with202MHz clock.

F F F F FF

0.70

F

0.60

0.70

0.80

0.90

1.00

VDD_LOGIC (V)

VD

D_

ME

M (

V)

0.50

P P PFFF

F

F

F

F

F

F

F

F

F

F

P

P

P

P

P

P

P

P

F

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P P P P

(a)

F

150 175 200 225 250 275 300

0.60

0.70

0.80

0.90

1.00

Clock Frequency (MHz)

VD

D_

LO

GIC

(V

)

0.50

F

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

(b)

F

F

F

F

F

F F

F

F

F

F

F F

F

F

F

F

F F

F F

F

F

F

F

F

F

F

F

F

F

F

F

F

F

F

P

P

P

P

P

P

P

P

P

P

P

P

P P P

0.75 0.80 0.85 0.90 0.95 1.00

(b) Core logic VDD vs. clock frequency witheSRAM voltage at 0.9V.

Figure 3.15: Shmoo plots of measured chip showing functional test pass (P) and fail (F) results.

3.8dB 4.0dB 4.2dB 4.4dB 4.6dB 3.8dB 4.0dB 4.2dB 4.4dB 4.6dB 4.2dB 4.4dB 4.6dB 4.8dB 5.0dB 4.6dB 4.8dB 5.0dB 5.2dB 5.4dB0

100

200

300

400

500

Mea

sure

d P

ow

er (

mW

)

Rate 1/2 (CLK=202MHz) Rate 5/8 (CLK=202MHz) Rate 3/4 (CLK=202MHz) Rate 13/16 (CLK=202MHz)

No Early Termination With Early Termination460

393 43

132

7

419

302

408

283

400

279

402

291

384

261

375

224

366

215

359

176

362

195

356

187

349

157

342

149

336

112

331

150

326

158

316

144

311

110

305

104

Logic + Routing, VDD_LOGIC = 0.9V Embedded SRAM, VDD_MEM = 0.9V

Figure 3.16: Measured power at nominal 0.9V supply and 202MHz clock rate, with and without earlytermination, at five SNR Eb/N0 operating points for all four code rates.

3.8dB 4.0dB 4.2dB 4.4dB 4.6dB 3.8dB 4.0dB 4.2dB 4.4dB 4.6dB 4.2dB 4.4dB 4.6dB 4.8dB 5.0dB 4.6dB 4.8dB 5.0dB 5.2dB 5.4dB0

100

200

300

400

500

Mea

sure

d P

ow

er (

mW

)

Rate 1/2 (CLK=92MHz) Rate 5/8 (CLK=155MHz) Rate 3/4 (CLK=186MHz) Rate 13/16 (CLK=202MHz)

MCS-10, Throughput=3.09Gb/s

MCS-22, Throughput=5.20Gb/sMCS-23, Throughput=6.25Gb/s MCS-24, Throughput=6.78Gb/s

161

139

150

115 14

510

6 141

100 13

898

220

163 20

914

6

204

125

198

121

195

99

251

137

242

129

236

112

231

104

226

83

230

108

226

114

219

103

215

80

210

76

Logic + Routing, VDD_LOGIC = 0.79V Embedded SRAM, VDD_MEM = 0.63V

Figure 3.17: Measured power at reduced core and memory voltage with clock-frequency scaling, for thesame operating points as in Fig. 3.16.


with increasing code rate is attributed to the fact that the decoder terminates in fewer iterations, and

the higher SNR operating point enables more frequent clock gating as more frames terminate early.

Moreover, in the rate 13/16 case, the CN+VN processor logic that corresponds to an entire layer of

the QC parity-check matrix is disabled as there are only 3 active processing layers for the rate 13/16

matrix, as opposed to 4 active processing layers for the rate 1/2, rate 5/8, and rate 3/4 matrices shown

in Fig. 3.3.

Table 3.4: Decoder performance at target BER = 10−6 with early termination (including idle cycles)

Code Rate 1/2 Rate 5/8 Rate 3/4 Rate 13/16

Eb/N0 (dB) 4.6 4.6 5.0 5.4Iterations (Max 10) ∗ 7 6 4 4

Nominal Conditions: VDD LOGIC=0.9V, VDD MEM=0.9V

Clock Frequency (MHz) 202 202 202 202Throughput (Gb/s) ∗ 6.78 6.78 6.78 6.78Max Latency (µs) ∗ 0.793 0.793 0.793 0.793

Measured Power (mW) 279 176 112 104Energy Efficiency (pJ/bit) ∗ 41 26 16 15

Normalized Efficiency(pJ/bit/iteration) ∗

4.1 2.6 1.6 1.5

Low-Power Conditions: VDD LOGIC=0.79V, VDD MEM=0.63V

Clock Frequency (MHz) 92 155 186 202

Throughput (Gb/s) ∗ 3.09 5.20 6.25 6.78Max Latency (µs) ∗ 1.740 1.034 0.860 0.793

Measured Power (mW) 98 99 83 76

Energy Efficiency (pJ/bit) ∗ 31 19 13 11Normalized Efficiency(pJ/bit/iteration) ∗

3.1 1.9 1.3 1.1

∗ Throughput, Latency, and Efficiency calculations assume 10 decoding iterations, even though thedecoder terminates and stops after the specified number of Iterations. This scenario corresponds to themulti-frame decoding highlighted in Fig. 3.11(b).

Additional power reduction is possible through clock-frequency and voltage scaling techniques. The

high performance MCS-10, MCS-22, MCS-23, and MCS-24 modes of the IEEE 802.11ad standard specify

data rates of 3.08Gb/s, 5.20Gb/s, 6.24Gb/s, and 6.76Gb/s for the rate 1/2, 5/8, 3/4, and 13/16 codes,

respectively. As shown in Fig. 3.17, with clock-frequency and voltage scaling, the LDPC decoder achieves

between 1.35× and 2.85× reduction in power over the nominal case, while satisfying the required data

rates. Table 3.4 highlights the power and energy efficiency of both nominal and low-power operating

modes, as well as the throughput and maximum latency assuming 10 decoding iterations for four SNR

points at a target BER of 10−6. Through the multi-frame I/O buffering technique shown in Figures

3.7 and 3.11, the decoder can achieve higher throughput with lower latency by immediately starting

to decode the next set of frames if the current set of frames terminates early. In this case, the power

consumption is higher, however the energy efficiency per bit remains the same. This result is shown in

Table 3.6.

Table 3.5 presents a percentage breakdown of the total decoder area by module, as well as the

estimated power consumed by each module for the high-performance SNR points of each of the four codes

at a target BER of 10−6. The power estimates are derived from both the measured power, as well as power

estimates obtained from gate-level simulation of the synthesized design using actual toggle patterns for


each code rate. VN-to-CN message memories consume about 6× more power than channel LLR and hard

decision memories due to their higher activity during decoding. Column slice logic consumes about 1.5×more power than all the remaining standard cell logic due to the large number of gates required to realize

the frame-interleaved, flooding Min-Sum decoder. The power consumption of control logic is negligible

in the frame-interleaved architecture since the majority of control logic is intrinsically embedded in

the path-unrolled interconnect. Routing, however, consumes the largest percentage of power in order

to cyclically shuffle high bit-width layer messages between successive column slices. While two power

domains provide insight into the ratio of core logic versus memory power consumption, a single power

domain should be used when integrating the IP in an SoC to minimize unused area overhead.

Table 3.5: Percentage breakdown of post-silicon area and estimated power by decoder module at targetBER = 10−6

Decoder Module AreaPower with Early Termination Enabled ∗

Rate 1/2 Rate 5/8 Rate 3/4 Rate 13/16

Total Core Area and Power

Core Area 3.41mm2 279mW 176mW 112mW 104mW

Embedded SRAM Memories

VN-to-CNMessages Lvc

7.74% 17.36% 15.90% 15.01% 15.02%

Channel LLR Qv 2.48% 1.95% 2.25% 2.13% 3.20%

Hard Decision Cv 0.63% 0.58% 0.58% 0.54% 0.69%

Standard Cell Logic

Column Slices 16.28% 11.53% 15.23% 15.78% 13.29%

Pipeline Registers 2.85% 7.30% 9.61% 9.71% 8.16%Buffers/Inverters 2.75% 0.06% 0.08% 0.11% 0.04%Decoder Control 0.01% 0.01% 0.01% 0.01% 0.01%

Integrated ClockGating Cells

0.03% 0.06% 0.08% 0.11% 0.04%

Test Control andI/O Interface

0.21% 0.01% 0.01% 0.01% 0.01%

Core Filler Cells and Wired Routing

Core Filler Cells 67.01% N/A N/A N/A N/A

Routing N/A 59.79% 54.41% 54.69% 57.46%

∗ Power measured at Eb/N0 = 4.6dB, 4.6dB, 5.0dB, and 5.4dB for rates 1/2, 5/8, 3/4, and 13/16,respectively, at nominal 0.9V supply and 202MHz clock. Estimates are derived from power measurementsand gate-level simulation reports of the synthesized design.

3.2.3 Comparison with the State-of-the-Art

Table 3.6 compares this work to five recent decoder implementations for the IEEE 802.11ad standard.

All comparison works implement a partially-parallel architecture with a variant of the Min-Sum de-

coding algorithm, and achieve similar BER performance over the four code rates. Several parity-check

matrix modifications are applied among the comparison works, yielding different permutation network

structures, which include barrel shifters, cyclic shift registers, and switch networks. This work reduces

routing overhead complexity by eliminating the need for message permutation logic due to the parti-

tioned routing networks between adjacent column slices. While this work occupies between 2.1× and

5.4× more unnormalized silicon area than the comparison works, the proposed frame-interleaved archi-


Table 3.6: Comparison of LDPC decoder implementations for the IEEE 802.11ad standard

SpecificationWeiner

[32]Park[31]

Ajaz[126]

Li[127]

Motozuka[128]

This Work

ISSCC2014

JSSC 2014APCCAS

2014ASSCC

2015GlobalSIP

20152017

Implementation ASIC ASICPlace and

RouteASIP ASIC ASIC

CMOS TechnologyNode

28nmFD-SOI

65nm 65nm 28nm 40nm LP 28nm

Core Area (mm2) 0.63 1.60 0.58 0.78 0.8 3.41 ∗

Memory Type Flip Flops eDRAM Flip Flops N/A Flip Flops eSRAM

Total Memory (Kbits) N/A 33.6 7.875 N/A 12.096 160.272Pipeline Stages 5 5 1 5 3 8

Interleaved Frames 2 2 1 1 1 8

Supply Voltage (V) 1.1 0.94 1.1 0.9 1.1 0.9Clock Frequency (MHz) 260 360 400 470 220 202

Block Length (Bits) 672 672 672 672 672 672Decoding Schedule Flooding Flooding Layered Layered Flooding Flooding

Code Rate for PowerMeasurement

1/2 1/2 1/2 1/2 13/16 1/2 (a) 13/16 (b)

Iterations 3.75 10 7 2 7 10 (c) 7 (d) 10 (e) 4 (f)

Throughput (Gb/s) 12.00 6.00 9.25 18.40 6.16 6.78 9.69 6.78 16.95

Latency (µs) 0.112 0.224 0.073 0.037 0.109 0.793 0.555 0.793 0.317

Power (mW) 180 373.6 272.9 166 203 279 399 104 260

Energy Efficiency(pJ/bit)

15.00 62.27 29.50 9.02 32.95 41.15 15.34

Normalized EnergyEfficiency

(pJ/bit/iteration)4.00 6.23 4.21 4.51 4.71 4.12 5.88 1.53 3.84

Area Efficiency(Gb/s/mm2)

19.05 3.75 16.09 23.59 7.70 1.99

Max Average Iterations(Flooding: 10,

Layered: 5)10 10 5 5 10 10

Latency at MaxIterations (µs)

0.229 0.224 0.052 0.091 0.156 0.793

Throughput at MaxIterations (Gb/s)

4.50 6.00 12.95 7.36 4.31 6.78

Decoding AlgorithmOffset Min

SumOffset Min

SumMin Sum Min Sum Min Sum Min Sum

Message Quantization(Bits)

5 5Lvc: 4,mcv: 2

N/A 5 5

Multi Rate YesNo (Rate

1/2)Yes Yes Yes Yes

BER for Rate 1/2 Code10−6 at4.4dB

10−6 at3.6dB

10−5 at3.6dB

10−6 at4.0dB

10−6 at4.0dB

10−6 at 4.6dB


N/A10−5 at4.0dB

10−6 at4.0dB

N/A 10−6 at 4.6dB


N/A10−5 at4.3dB

10−6 at5.0dB

N/A 10−6 at 4.6dB

BER for Rate 13/16Code

10−6 at5.0dB

N/A10−5 at5.0dB

10−5 at5.0dB

10−6 at5.2dB

10−6 at 5.4dB

Partially-ParallelArchitecture

Row-Parallel

Row-Parallel

Row-Parallel

Row-Parallel

Column-Parallel

Row/Column-Parallel

Permutation NetworkCyclicShift

Registers

CyclicShift

Registers

SwitchNetwork

BarrelShift

Network

Reduced-Complexity

BarrelShifters

Cyclically Hard-Wired Partitions

Parity-Check MatrixModification

Row Re-Ordering

RowMerging

ColumnRe-

Ordering

ColumnPermuta-

tionN/A Bypass Routing

∗ Core area contains two power domains to independently measure eSRAM and logic power. A “productionversion” of the chip with only one power domain would occupy less core area.(a) Power reported at Eb/N0 =4.6dB and BER= 10−6. (b) Power reported at Eb/N0 =5.4dB and BER= 10−6.(c) Decoder terminates early after 7 iterations, and remains idle for the remaining 3 iterations. This corresponds tothe scenario in Fig. 3.11(b). (d) Decoder terminates early after 7 iterations, and immediately begins decoding nextset of frames without any idle cycles. This corresponds to the scenario in Fig. 3.11(c).(e) Decoder terminates early after 4 iterations, and remains idle for the remaining 6 iterations. This corresponds tothe scenario in Fig. 3.11(b). (f) Decoder terminates early after 4 iterations, and immediately begins decoding nextset of frames without any idle cycles. This corresponds to the scenario in Fig. 3.11(c).


tecture with a path-unrolled message-passing schedule achieves high energy efficiency, while maximizing

SoC integration capability using standard bulk CMOS technology and a low clock rate. Silicon area and

power are not normalized to a particular CMOS technology node in this comparison, as Dennard scaling

rules do not hold below the 65nm node due to the exponential growth of leakage current in newer nodes

and dark silicon design techniques. In Table 3.6, power is reported at the nominal supply voltage and

nominal clock frequency.

Energy efficiency is the primary optimization metric for state-of-the-art decoders. Weiner et al.

present an ASIC implementation that achieves a normalized energy efficiency approximately equal to

this work for the rate 1/2 code using a fully-depleted silicon-on-insulator (FD-SOI) technology [32], which

is known to provide superior power performance over conventional bulk CMOS technology due to the

forward body bias that enables low-voltage operation with reduced leakage current [129]. Under worst-

case channel conditions with a maximum of 10 decoding iterations, the implementation by Weiner et al.

achieves a throughput of only 4.50Gb/s, which is below the maximum throughput specification for IEEE

802.11ad. Weiner et al. do not report the measured power for the rate 13/16 code.

Park et al. present an ASIC decoder for a single code rate with an energy efficiency approximately

1.5× lower than this work for the rate 1/2 code [31], likely due to the high clock frequency of 360MHz,

which would also present SoC integration challenges. This work achieves 1.5× higher normalized energy

efficiency than the work by Park et al. for the rate 1/2 code.

Ajaz and Lee report a pre-silicon place-and-route implementation also using a prohibitively high

clock rate of 400MHz. This work achieves approximately equal normalized energy efficiency for the rate

1/2 code, however, a fair comparison is not possible since the work by Ajaz and Lee is a pre-silicon

implementation, and power is not reported for the rate 13/16 code.

Li et al. introduce a new approach to high-throughput LDPC decoder design through a multi-core

application-specific instruction set processor (ASIP). While the reported energy efficiency is 4.56× higher

than this work for the rate 1/2 code [127], their reported power of 166mW is measured for only 2 decoding

iterations. This work achieves approximately equal normalized energy efficiency for the rate 1/2 code.

In addition, the high clock frequency of 470MHz and ASIP architecture may introduce SoC integration

challenges. Li et al. do not report the measured power for the rate 13/16 code.

Motozuka et al. introduce a new column-parallel architecture that uses multi-stage variable shifters

with low memory requirements [128], however, the ASIC implementation does not achieve more than

4.31Gb/s throughput under worst-case channel conditions. This work achieves 2.1× higher energy

efficiency and 3.08× higher normalized energy efficiency than the work by Motozuka et al. for the

rate 13/16 code. Motozuka et al. do not report the measured power for the rate 1/2 code.

The silicon area of the proposed architecture could be reduced by applying more optimal floorplanning

and chip layout techniques. The presented design contains additional area overhead in order to implement

two power domains to independently measure core logic and eSRAM power. The decoder core area could

be reduced by using a single power domain for both standard cell logic and eSRAM macros, as this would

eliminate particular wire routing and logic placement constraints.

The frame-interleaved LDPC decoder presented in this work achieves high energy efficiency, error-

correction performance, and SoC integration capability at the expense of high transistor area. The

presented decoder achieves similar normalized energy efficiency for the rate 1/2 code in comparison

to other published implementations, however, it achieves the highest normalized energy efficiency for

the rate 13/16 code at 1.53pJ/bit/iteration. As previously described, this work is scalable by design.


The architecture is scalable to future technology nodes as interconnect complexity is constrained and

localized between column slices. Moreover, since most QC-LDPC codes are constructed with expansion

factors between q = 10 and q = 100, the column-slice logic complexity of the proposed architecture

remains approximately equal, even for longer block-length codes. As such, the frame-interleaved decoder

architecture would provide low-power performance and high energy efficiency for longer block-length

codes in future technology nodes.

3.3 Summary

This chapter introduced a new partially-parallel LDPC decoder architecture that implements a path-

unrolled message-passing schedule with pipelined frame interleaving. The traditional flooding Min-Sum

algorithm is reformulated through a time-distributed computation over multiple processing units that

contain both check and variable node update logic. Message permutation overhead and unstructured

routing are minimized by exploiting the cyclic structure between adjacent macro-columns in the quasi-

cyclic parity-check matrix.

Despite the high silicon core area of 3.41mm2, the decoder achieves an energy efficiency of 15pJ/bit

at 0.9V supply with a 202MHz clock rate, which is ideally suited for modern SoC integration. At a

maximum of 10 iterations, the decoder achieves a nominal throughput of 6.78Gb/s with a maximum

latency of 0.793µs for all four code rates of the IEEE 802.11ad standard. By trading off interconnect

complexity for high transistor area, the proposed architecture introduces a new design strategy for LDPC

decoders in sub-45nm CMOS technology nodes where interconnect scaling has stagnated.

The proposed architecture is scalable to longer block-length codes, since (1) the critical timing path

is not constrained by the expansion factor of the quasi-cyclic parity check matrix, (2) the complexity

of localized routing between successive column groups is bounded by the number of active processing

layers, and (3) the high bit-cell density of eSRAM compensates for additional overhead in processing

node logic. The architecture can also be reconfigured and extended to support multiple standards with

different parity-check matrices by including programmable shifters between successive column slices.

Furthermore, the architecture is not restricted to quasi-cyclic codes, but can rather be applied more

generally to random codes, or to codes that allow column-wise matrix partitioning as a way to enforce

structure. Further research in this area may lead to new low-power techniques for hardware-based LDPC

decoders.

Chapter 4

Quasi-Cyclic Multi-Edge LDPC

Codes for Quantum Cryptography

The design of efficient reconciliation algorithms is one of the central challenges of long-distance CV-

QKD [11]. Early reconciliation algorithms failed to achieve efficiencies above 80% [130], while more

advanced algorithms that now achieve 95% efficiency suffer from computational complexity [59, 60].

LDPC codes are highly suitable for low-SNR reconciliation in CV-QKD due to their near-Shannon

limit error-correction performance, however, designing and constructing efficient LDPC codes with block

lengths on the order of 106 bits remains a challenge.

This chapter introduces a technique to reduce the complexity of multi-edge LDPC codes in order

to reduce overall decoding latency, which would ultimately provide a higher secret key rate. A quasi-

cyclic structure is imposed on the multi-edge parity-check matrix construction to enable computational

decoding speedup as a result of the highly parallelizable structure, which provides a simple mapping

to hardware [72, 83]. Previous independent works by Martinez-Mateo and Walenta have explored the

application of existing QC-LDPC codes from the IEEE 802.11n standard for DV-QKD, however, these

works were not able to demonstrate reliable reconciliation beyond 50km [65, 131]. While this distance

may have been a limitation of DV-QKD, the short block lengths of such existing QC-LDPC codes (on

the order of 103 bits) remain unsuitable for long-distance CV-QKD. Recently, Bai et al. theoretically

showed that rate 0.12 QC codes with block lengths of 106 bits can be constructed using progressive

edge growth techniques, or by applying a QC extension to random LDPC codes with block lengths

of 105 bits [132]. However, the reported QC codes target an SNR of -1dB, and are thus not suitable

for long-distance CV-QKD beyond 100km. At the time of writing, there has not been any reported

investigation of the construction of QC codes for multi-edge LDPC codes targeting low-SNR channels

below -15dB for long-distance CV-QKD. This thesis shows that by applying a structured QC-LDPC code

construction technique to the random multi-edge LDPC codes previously explored by Jouguet et al. for

long-distance CV-QKD [11], it is possible to construct codes that achieve sufficient error-correction

performance while enabling the acceleration of the computationally-intensive LDPC decoding algorithm

such that the reconciliation step is no longer the bottleneck for secret key distillation beyond 100km.

This thesis demonstrates the application of multi-edge QC-LDPC codes for long-distance CV-QKD

through the design of several rate 0.02 binary parity-check matrices with block lengths on the order of 106

bits. While a complete QKD system would offer multi-rate code programmability for various operating

60

Chapter 4. Quasi-Cyclic Multi-Edge LDPC Codes for Quantum Cryptography 61

channels, this thesis focuses on the design of a single, low-rate code for a large range of transmission

distances to fully study the effects of β-efficiency and FER on the maximum achievable secret key rate

and reconciliation distance. Some works have explored the use of rate-adaptive or repetition codes to

achieve high-efficiency decoding with multiple code rates [11], however, the exploration of multi-rate

code design for long-distance CV-QKD is beyond the scope of this thesis.

This chapter describes the construction of quasi-cyclic multi-edge codes, and presents the error-

correction performance and achievable secret key rates for multiple β-efficiencies beyond 100km using

specifically designed rate 0.02 codes1. A GPU-based LDPC decoder implementation is also presented to

highlight the computational speedup that can be achieved using quasi-cyclic codes with respect to the

fundamental upper bound on secret key rate.

4.1 Construction of Quasi-Cyclic Multi-Edge LDPC Codes

While random LDPC codes have been shown to achieve near-Shannon capacity error-correction perfor-

mance under belief propagation decoding [5], the hardware-based implementation of decoders for random

codes is a challenge with large block lengths, especially on the order of 106 bits. The bottleneck stems

from the complex interconnect network between CN and VN processing units that execute the belief

propagation algorithm [15, 73]. This thesis extends the design of low-rate, multi-edge LDPC codes de-

scribed in Chapter 2 to QC codes in order to optimize decoding performance in hardware, by minimizing

latency and increasing throughput.

To design a multi-edge QC-LDPC code, repeat the random multi-edge sampling process using n/q

as the block length instead of n to obtain a base Tanner graph GB . The base parity-check matrix HB is

obtained from GB by populating each non-zero entry by a random element of the set 1, 2, ..., q. Let Ii

be the circulant permutation matrix obtained by cyclically shifting each row of the q× q identity matrix

to the right by i − 1. The QC parity-check matrix H is obtained from HB by replacing each non-zero

entry of value i by Ii, and each zero entry by the q × q all-zeros matrix.

In this thesis, multi-edge QC-LDPC parity-check matrices of rate 0.02 were generated for expansion

factors q ∈ 21, 50, 100, 500, 1000. Under belief propagation decoding, the error-correction performance

of the q ∈ 100, 500, 1000 QC codes was significantly worse in comparison to a random multi-edge code

with the same degree distribution. Thus, only the q = 21 and q = 50 QC codes are presented in the

remainder of this study. In order to maintain the same degree distributions, the block length for the

q = 21 code with rate Rcode = 0.02 was adjusted to n = 1.008 × 106 bits. Similarly, the q = 50 code

has a block length of n = 1× 106 bits and rate Rcode = 0.01995. As described in Chapter 2, as n→∞,

the error-correction performance of Tanner graphs with the same degree distribution is nearly identical.

Since a block length on the order of n = 106 →∞, any q = 21 or q = 50 code is expected to have similar

error-correction performance. Hence, only one q = 21 code and one q = 50 code were constructed.

Figure 4.1 shows the structure of the parity-check matrices designed in this thesis. Both the non-QC

random and QC matrices have a similar structure, which contains a dense area of 1s or cyclic identity

matrices on the left, and a long diagonal of degree-1 VNs to the right. The starting point of the diagonal

is determined by the VN degree distribution of the multi-edge matrix. In the case of the QC codes, no

cyclic shifts are implemented along the diagonal, thus all submatrices are q×q I1 identity matrices. This

1Lei M. Zhang specifically constructed the QC multi-edge codes based on the degree distributions in Equa-tions 2.18 and 2.19.


0 5 10Matrix Column

×105

0

2

4

6

8

Mat

rix

Ro

w

×105

(a) Full q = 50 QC parity-check ma-trix structure with 1×106 columns and9.8× 105 rows. Empty space representszeros.

3 4 5Matrix Column

×104

1

1.5

2

2.5

3

3.5

Mat

rix

Ro

w

×104

(b) Zoom-in of top left corner of q = 50QC matrix shown in (a). Each dot rep-resents a 50× 50 cyclic identity matrix.

Figure 4.1: Structure of designed parity-check matrices.

matrix structure greatly improves the decoding speed as degree-1 VNs along the diagonal need to pass

VN-to-CN messages only in the first decoding iteration, while CN-to-VN messages need to be passed to

degree-1 VNs only if the early-termination condition is enabled. The degree-1 VNs along the diagonal

correspond to the majority (but not all) of the (n − k) parity bits that are discarded after decoding,

thus the VN update computation needs to be performed in these degree-1 VNs only if a decision needs

to be made when early termination is enabled. A small fraction of the (n − k) parity bits correspond

to VNs with more than one CN connection in the denser area of the matrix to the left of the diagonal.

These VNs must perform the VN update computation in each iteration along with the first k VNs, which

correspond to the k information bits of the block.

The parity component of H, i > k in H(j, i), is lower-triangular for both the non-QC random and

QC parity-check matrices designed in this study. An example of this type of construction is shown in

Fig. 2.3, and is also illustrated in Fig. 4.1. While the lower-triangular construction does not necessarily

impact decoding complexity or error-correction performance, it does simplify the LDPC encoding pro-

cedure, which can be performed via forward substitution if H is of this form. Further investigation of

LDPC encoding complexity for such large codes is beyond the scope of this thesis.

4.2 Error-Correction Performance of Multi-Edge QC Codes

The multi-edge LDPC codes designed in this thesis achieve similar FER performance on the BIAWGNC

compared to those developed by Jouguet et al. for long-distance CV-QKD with multi-dimensional recon-

ciliation [11]. Table 4.1 summarizes the parameters of the three codes designed in this thesis, and Figures

4.2 and 4.3 present their FER vs. SNR error-correction performance under Sum-Product decoding for

d = 1, 2, 4, 8 reconciliation dimensions. FER simulations were performed for the complete linear SNR

range corresponding to the range of efficiencies between β = 0.8 and β = 0.99, as defined by Eq. 2.15.

This range of β-efficiency values was chosen to illustrate the trade-off between distance and finite secret


Table 4.1: Designed rate 0.02 multi-edge LDPC codes

StructureExpansion Factor Code Rate Block Length (Bits)

q Rcode n

Random N/A 0.02 1× 106

Quasi-Cyclic 21 0.02 1.008× 106

Quasi-Cyclic 50 0.01995 1× 106

key rate in the next section. For clarity, however, Figures 4.2 and 4.3 present the FER results only

for the SNR range corresponding to β-efficiencies between β = 0.88 and β = 0.99. The bit error rate

(BER) performance is not presented in Figures 4.2 and 4.3 since it is not of particular concern for key

reconciliation. Once Alice and Bob detect a frame error, the entire frame must be discarded since it

can not be used to generate a symmetric secret key. For completeness, the BER results for three codes

under investigation are presented in Appendix C.

Despite their identical degree distributions, the q = 50 QC code achieves the best overall FER

performance over d = 1, 2, 4, 8 dimensions in comparison to the random and q = 21 QC codes, due to its

slightly lower code rate of Rcode = 0.01995 versus Rcode = 0.02 for the random and q = 21 QC codes.

At low SNR where β is high, the q = 21 QC code also performs better than the random code over all

dimensions, likely due to the longer block length of n = 1.008 × 106 bits versus n = 106 bits for the

random and q = 50 QC codes. At higher SNR though, the random code achieves a lower error-floor

than the q = 21 QC code due to higher randomness in the parity-check matrix.

It was empirically found that the d = 2, d = 4, and d = 8 reconciliation schemes achieve approxi-

mately 0.04dB, 0.08dB, and 0.2dB of coding gain, respectively, over the d = 1 scheme in the waterfall

region for all three codes. As previously mentioned in Chapter 2, FER performance in the waterfall

region is of particular interest for long-distance CV-QKD since it corresponds to the high β-efficiency

region of operation at low SNR close to the Shannon limit. The error-floor region beyond the waterfall

is not of practical use in CV-QKD as it corresponds to the low β-efficiency region where transmission

distance is limited.

As previously discussed in Chapter 2, for any binary linear block code, the number of possible

codewords is 2k = 2nRcode . In this case, when n = 1 × 106 bits and Rcode = 0.02, the number of

possible valid codewords for the decoder to choose from is approximately 4× 106020. In order to detect

invalid decoding errors when the parity check CH> = 0 but S 6= S, a 32-bit CRC code is included in

each LDPC frame. In this work, NCRC = 32 bits were sufficient to detect all invalid decoded messages

without sacrificing information throughput. Having full control of the simulation environment, it was

also empirically found that Pundetected error = 0 using a 32-bit CRC code.

The probability of an invalid decoding error is given by

P (CH> = 0 ∩ CRC Fail ∩ S 6= S) =Number of CRC Errors

Total Number of Frame Errors.

Figure 4.4 shows the probability of an invalid decoding error over the SNR range of interest for d =

1, 2, 4, 8 reconciliation dimensions on the BIAWGNC for the three LDPC codes designed in this thesis.

In general, the probability of invalid decoding increases as the SNR increases and becomes the main

source of frame error, particularly in the error-floor region as a result of the large block length and low

code rate. In the low-SNR region of operation for long-distance CV-QKD where the FER Pe ≈ 1, invalid


0.028 0.0285 0.029 0.0295 0.03 0.0305 0.031 0.0315 0.032SNR

10-2

10-1

1

Fra

me

Err

or

Rat

e (F

ER

)

Max 500 Iterations100 Frame Errors32-bit Floating Point

d=1 - Random, R=0.02d=1 - QC, q=21, R=0.02d=1 - QC, q=50, R=0.01995d=2 - Random, R=0.02d=2 - QC, q=21, R=0.02d=2 - QC, q=50, R=0.01995

Figure 4.2: FER vs. SNR for Sum-Product decoding with d = 1 and d = 2 dimensional reconciliationon BIAWGNC.

0.028 0.0285 0.029 0.0295 0.03 0.0305 0.031 0.0315 0.032SNR

10-2

10-1

1

Fra

me

Err

or

Rat

e (F

ER

)



Figure 4.3: FER vs. SNR for Sum-Product decoding with d = 4 and d = 8 dimensional reconciliationon BIAWGNC.

decoding convergence still contributes to nearly 10% of all frame errors. A concatenated higher-rate code

was not included as part of the message component to correct residual errors [10,11].

Up until this point, the performance of the reconciliation algorithm has been presented as a coding

theory problem, where an LDPC code was designed to achieve a particular FER at a given SNR op-


0.028 0.0285 0.029 0.0295 0.03 0.0305 0.031 0.0315 0.032SNR

0

0.2

0.4

0.6

0.8

1

Pro

bab

ility

of

Inva

lid D

eco

din

g E

rro

r

d=1 - Randomd=1 - QC, q=21d=1 - QC, q=50d=2 - Randomd=2 - QC, q=21d=2 - QC, q=50d=4 - Randomd=4 - QC, q=21d=4 - QC, q=50d=8 - Randomd=8 - QC, q=21d=8 - QC, q=50

Figure 4.4: Probability of invalid decoding error vs. SNR for Sum-Product decoding with d = 1, 2, 4, 8dimensional reconciliation on BIAWGNC. Probability of error is computed for invalid messages that arecorrectly decoded but CRC fails.

0.8 0.85 0.9 0.95 1Reconciliation Efficiency (β)

10-3

10-2

10-1

1

Fra

me

Err

or

Rat

e (F

ER

)

d=8d=1

d=1 - Random LDPC Code, rate = 0.02d=1 - QC LDPC Code, q = 21, rate = 0.02d=1 - QC LDPC Code, q = 50, rate = 0.01995d=8 - Random LDPC Code, rate = 0.02d=8 - QC LDPC Code, q = 21, rate = 0.02d=8 - QC LDPC Code, q = 50, rate = 0.01995

Figure 4.5: FER vs. reconciliation efficiency for Sum-Product decoding with d = 1 and d = 8 dimensionalreconciliation on BIAWGNC. FER values are derived from the FER vs. SNR curves based on Eq. 2.15.


erating point. The SNR was considered as an abstraction of the BIAWGNC in order to demonstrate

fixed-rate code performance, independent of other CV-QKD system parameters such as modulation vari-

ance, transmission distance, and physical losses. Assuming that the transmission distance and physical

parameters of the quantum channel are fixed, Alice’s modulation variance can be optimally tuned such

that the effective secret key rate Keff is then solely determined by the FER and β-efficiency of the

LDPC-decoding reconciliation algorithm.

Figure 4.5 shows that for each fixed-rate LDPC code, there exists a unique FER-β pair, where each β

corresponds to a particular SNR operating point based on Eq. 2.15. While it may appear from Eq. 2.14

that maximizing β would produce a higher effective secret key rate, Fig. 4.5 shows that β and FER

are positively correlated, such that there exists an optimal trade-off between β and FER where Keff

is maximized for a fixed transmission distance. To achieve key reconciliation at long distances, the

operating point must be chosen in the waterfall region where β is high, despite the high FER.

The results presented in this section showed that higher-dimension reconciliation schemes, namely

d = 4 and d = 8, extend code performance to lower SNR where the FER Pe > 0 and β → 1. As such, the

d = 8 scheme is most suitable for long-distance reconciliation. The next section examines the impact of

reconciliation dimension, β-efficiency, and FER on the finite secret key rate over a range of transmission

distances for the LDPC codes designed in this thesis.

4.3 Finite Secret Key Rate

This section extends the discussion of the effective secret key rate to include finite-size effects. Key

reconciliation for a particular β-efficiency is only achievable over a limited range of distances where the

finite secret key rate Kfinite > 0. In general, for a single FER-β pair, LDPC decoding can achieve either

(1) a high secret key rate at short distance, or (2) a low secret key rate at long distance. For long-distance

CV-QKD beyond 100km, key reconciliation is only achievable with high β-efficiency at the expense of

low secret key rate. This section provides an overview of the maximum achievable finite secret key rates

and reconciliation distances for the three LDPC codes designed in this thesis. Results are presented for

the d = 1 and d = 8 reconciliation dimensions in order to demonstrate the effectiveness of higher-order

dimensionality on reconciliation distance.

The range of transmission distances for each β is limited by the total noise between Alice and Bob.

From Eq. A.3 (Appendix A), the total noise can be expressed as a function of β, such that

χ′total(β) =Vopt

A (β)

s(β)− 1, (4.1)

where VoptA (β) is a vector of Alice’s optimal modulation variances for a particular β-efficiency from

Fig. A.1, and the SNR s(β) is given by Eq. 2.15 for a fixed-rate LDPC code. From the expression for the

total channel noise given by Eq. A.1, a set of transmission distance points for a particular β can then

be described by the vector

`′(β) =10

αlog10

(η(χ′total(β)− ε+ 1)

1 + Vel

), (4.2)

in order to compute the maximum finite secret key rate based on Eq. 2.16, where α = 0.2dB/km is

single-mode fiber transmission loss, η is Bob’s homodyne detector efficiency, ε is the excess channel noise

in shot noise units, and Vel is Bob’s added electronic noise in shot noise units.


70 80 90 100 110 120 130 140 150 160 170Distance (km)

10-8

10-7

10-6

10-5

10-4

10-3

10-2

Fin

ite

Sec

ret

Key

Rat

e (b

its/

pu

lse)

ǫ = 0.005v

el = 0.041

η = 0.606α = 0.2N

privacy = 1012

β = 0.80

β = 0.83

β = 0.86

β = 0.89

β = 0.92

β = 0.95β = 0.96

Random LDPC Code, rate = 0.02QC LDPC Code, q = 21, rate = 0.02QC LDPC Code, q = 50, rate = 0.01995

Figure 4.6: d = 1 dimensional reconciliation with Nprivacy = 1012 bits: finite secret key rate Kfinite vs.distance for collective attacks on BIAWGNC with Sum-Product decoding.

70 80 90 100 110 120 130 140 150 160 170Distance (km)

10-8

10-7

10-6

10-5

10-4

10-3

10-2

Fin

ite

Sec

ret

Key

Rat

e (b

its/

pu

lse)

ǫ = 0.005v

el = 0.041

η = 0.606α = 0.2N

privacy = 1012

β = 0.80

β = 0.83

β = 0.86

β = 0.89

β = 0.92β = 0.95

β = 0.98

β = 0.99Random LDPC Code, rate = 0.02QC LDPC Code, q = 21, rate = 0.02QC LDPC Code, q = 50, rate = 0.01995

Figure 4.7: d = 8 dimensional reconciliation with Nprivacy = 1012 bits: finite secret key rate Kfinite vs.distance for collective attacks on BIAWGNC with Sum-Product decoding.


Figures 4.6 and 4.7 present the finite secret key rate results for the three LDPC codes over the trans-

mission distance range of interest with Nprivacy = 1012 bits based on the d = 1 and d = 8 reconciliation

dimensions, respectively. Each β-efficiency curve in Figures 4.6 and 4.7 represents a FER-β pair where

the FER and SNR are constant over the entire transmission distance range, while VA is optimally chosen

to achieve the maximum secret key rate at each distance point. When β is high, the FER Pe → 1, and

thus Kfinite → 0 as erroneous frames are discarded after decoding. As a result, the maximum reconcilia-

tion distance is limited by the error-correction performance of the LDPC code. Appendix C presents the

finite secret key rate results for Nprivacy = 1012 bits with d = 2 and d = 4 reconciliation dimensions, as

well as the finite secret key rate results with Nprivacy = 1010 bits for d = 1, 2, 4, 8 reconciliation. When

Nprivacy = 1012 bits, the maximum transmission distance is extended by 18km over the result with

Nprivacy = 1010 bits for d = 8 reconciliation with β = 0.99 efficiency. This demonstrates the importance

of selecting a large block size for privacy amplification.

In each of the Nprivacy = 1010 and Nprivacy = 1012 cases, the three LDPC codes achieve similar

finite secret key rates and reconciliation distances for β ≤ 0.92, since the codes are operating close to

their respective error floors. However, for β > 0.92, the FER becomes a limiting factor to achieving

a non-zero secret key rate. The d = 1 scheme achieves a maximum efficiency of β = 0.96, where the

maximum distance is limited to 124km with Nprivacy = 1010 bits, and 132km with Nprivacy = 1012 bits.

For β > 0.96, the FER Pe = 1, thus Kfinite = 0. The d = 8 scheme operates up to β = 0.99 efficiency,

with a maximum distance of 142km with Nprivacy = 1010 bits, and 160km with Nprivacy = 1012 bits.

Furthermore, the d = 8 scheme achieves higher secret key rates for all three LDPC codes at β = 0.95

and β = 0.96 in comparison to the d = 1 scheme since the code FER performance is higher. The d = 2

and d = 4 schemes both achieve a maximum efficiency of β = 0.97, at 129km with Nprivacy = 1010 bits,

and 138km with Nprivacy = 1012 bits.

The finite secret key rate Kfinite results presented in this section were normalized to the pulse rate,

without consideration of the light source repetition rate frep. By considering the pulse rate, the complete

operating secret key rate of the CV-QKD system can be defined as

K ′finite = frepKfinite (bits/s). (4.3)

The next section presents an overview of a GPU-based LDPC decoder implementation where the infor-

mation throughput for the three LDPC codes designed in this thesis is compared to the upper bound

on secret key rate at the maximum reconciliation distance points.

4.4 GPU-Accelerated LDPC Decoding

GPUs are a highly suitable platform for the implementation of LDPC decoders that target high infor-

mation throughput with long block-length codes. Computational acceleration of the belief propagation

algorithm is achieved by parallelizing the check and variable node update operations across thousands of

single-instruction multiple-thread (SIMT) cores, which provide floating-point precision, high-bandwidth

read/write access to on-chip memory, and intrinsic mathematical libraries for the logarithmic functions

of the Sum-Product algorithm [133–137].

This section provides an overview of the GPU-based LDPC decoder implementation in this thesis.


GPU throughput results are presented for the maximum CV-QKD distances under d = 1, 2, 4, 8 dimen-

sional reconciliation, and also compared to the maximum achievable secret key rates for reconciliation

efficiencies β > 0.85. Finally, the implementation is compared to previous work by Jouguet and Kunz-

Jacques for an LDPC code with block length of 220 bits [60], as well as other non-LDPC codes. The

GPU decoding throughput results presented in this section quantitatively highlight the computational

speedup that can be achieved using quasi-cyclic LDPC codes for long-distance CV-QKD. The presented

results were measured using a single GPU, however, further computational speedup can be achieved

by concurrently decoding multiple frames using multiple GPUs. The complete simulation framework is

presented in Appendix B.

4.4.1 GPU-Based LDPC Decoder Implementation

The LDPC decoder was implemented on a single NVIDIA GeForce GTX 1080 (Pascal Architecture) GPU

with 2560 CUDA cores using the NVIDIA CUDA C++ application programming interface. Figure 4.8

shows the data flow for a single decoding iteration of the parallelized Sum-Product algorithm, which is

comprised of four multi-threaded compute kernels. Each kernel instantiates a different number of GPU

threads depending on the level of parallelism for the operation. The individual compute operations of

the Sum-Product algorithm are re-ordered to exploit the maximum amount of thread-level parallelism in

each kernel such that the latency per iteration is minimized. The overall throughput of the GPU-based

LDPC decoder is then determined by the number of iterations, latency per iteration, and block length.

The complexity of an LDPC decoder implementation stems from the highly-irregular interconnect

structure between CNs and VNs described by the code’s Tanner graph. For codes with short block

lengths, the permutation network complexity does not introduce significant GPU decoding latency [133–

135], however, for codes with block lengths on the order of 106 bits as those designed in this thesis,

data permutation and message passing constitute between 25% to 50% of GPU runtime per decoding

iteration, as shown in Table 4.2. While arithmetic operations are relatively inexpensive on a GPU,

addressing global memory is very costly in terms of compute time. The most expensive GPU operation is

addressing unordered memory, i.e., accessing non-consecutive memory locations, as multiple transactions

are required to perform the unordered memory read or write, and all kernel threads must be stalled [134].

On the contrary, coalesced memory addressing, i.e., accessing consecutive memory locations, can be

performed in a single transaction and allows for concurrent thread execution, which reduces the runtime

of the kernel. Furthermore, uncoalesced memory writes are more expensive than uncoalesced memory

reads. Thus, the throughput of a GPU-based decoder is highly dependent on memory access patterns,

i.e., the decoder is memory-bound as opposed to compute-bound.

The operations of the Sum-Product algorithm presented in Algorithm 1 (Chapter 2) were re-ordered

to avoid uncoalesced memory writes and to use the maximum amount of thread-level parallelism for

arithmetic computations. For example, the VN-to-CN message-passing permutation in Kernel 1 also

performs the Φ(·) computation from the next CN-update step in each thread. The CN-update Kernel

(2) does not fully compute the mcv messages from each CN to its connected VNs, but instead, the

final CN-to-VN mcv messages are computed in the CN-to-VN message-passing Kernel (3). Due to

the Tanner graph structure and data permutation nature of the LDPC decoder, uncoalesced memory

reads are still required when reading from edge memory in Kernel 1 and reading from CN memory in

Kernel 3. However, the latency of these operations is negligible compared to the overall latency of an

entire iteration. Fully-coalesced memory writes are enabled by the different ordering of connected edges


Aligned

Memory

Read

Edge Memory (VN-to-CN Lvc Messages)

CN Memory (VN-to-CN Φ(|Lvc|) Messages)

CN Memory (mc Intermediate Values)

t1 t2 t3 t4 t5 t6 tT-2 tT-1 tT


t1 t2 t3 tn-k-2 tn-k-1 tn-k


VN Memory (CN-to-VN mcv Messages)

t1 t2 t3 tn-2 tn-1 tn

Aligned

Memory

Write

Unaligned

Memory

Read

Aligned

Memory

Write

Aligned

Memory

Read

Aligned

Memory

Write

Unaligned

Memory

Read

Aligned

Memory

Read

Aligned

Memory

Write

Aligned

Memory

Write

Aligned

Memory

Read

Aligned

Memory

Write

Ke

rne

l 1:

VN

-to

-CN

Me

ss

ag

e P

as

sin

gK

ern

el 3

: C

N-t

o-V

N M

es

sa

ge

Pa

ss

ing

Ke

rne

l 2:

CN

Up

da

teK

ern

el 4

: V

N U

pd

ate

VN Memory (Updated Lv LLRs)

Edge Memory (VN-to-CN Φ(|Lvc|) Messages)

Edge Memory (CN-to-VN mcv Messages)

VN Memory (Lv LLRs)

Aligned

Memory

Read

Edge Memory (CN-to-VN mcv Messages)

Edge Memory (VN-to-CN Φ(|Lvc|) Messages)

Figure 4.8: GPU implementation of LDPC decoder showing four multi-threaded compute kernels anddata flow from top to bottom for one decoding iteration. Coalesced memory access patterns and messagevariables are indicated. Thread i is denoted by ti, where T in kernels 1 and 3 represents the maximumnumber of connections between all CNs and VNs, (n − k) in Kernel 2 is the number of CNs, and n inKernel 4 is the number of VNs. Early termination is not shown. All memory blocks shown in the figureare in Global GPU Memory. The threads in each kernel use Shared GPU Memory to store intermediatevalues during the execution of the kernel.


Table 4.2: GPU-based LDPC decoding latency and error-correction performance for rate 0.02multi-edge codes

LDPC CodeRandom

Multi-Edgeq = 21 QC

Multi-Edgeq = 50 QC

Multi-Edge

Block Length (Bits) 1× 106 1.008× 106 1× 106

Code Rate 0.02 0.02 0.01995Connections inParity Matrix

3,337,494 160,185 66,747

Latency by Kernel with Percent Breakdown for 1 Decoding Iteration

Kernel 1 RuntimeVN-to-CN (ms)

1.773 (50.3%) 0.446 (34.4%) 0.391 (33.2%)

Kernel 2 RuntimeCN Update (ms)

0.197 (5.6%) 0.204 (15.7%) 0.198 (16.8%)

Kernel 3 RuntimeCN-to-VN (ms)

1.240 (35.1%) 0.317 (24.4%) 0.303 (25.7%)

Kernel 4 RuntimeVN Update (ms)

0.318 (9.0%) 0.331 (25.5%) 0.286 (24.3%)

Total Latency PerIteration (ms)

3.528 (100.0%) 1.296 (100.0%) 1.177 (100.0%)

FER Performance and Decoding Throughput at β = 0.99 and d = 8

Max Iterations 500 500 500Average Iterations ∗ 470 451 470

FER 0.883 0.792 0.883

KrawGPU Raw

Throughput (Mb/s)0.603 1.724 1.807

K ′GPU InformationThroughput (Kb/s)

1.409 7.160 4.207

∗ Early-termination check is enabled only after the number of decoding iterations is equal to the averagenumber of iterations, which is determined empirically through FER simulation and stored in a lookuptable.

in the VN-to-CN and CN-to-VN message-passing kernels (1 and 3). In the VN-to-CN message-passing

Kernel (1), the edge connectivity is ordered by consecutive VNs, while in the CN-to-VN message-passing

Kernel (3), the edges are ordered by consecutive CNs. Each CN-VN edge in the edge memory has a

unique index that is addressed by both message-passing kernels (1 and 3). Several additional memory

optimizations improve the overall GPU throughput. All of the memory blocks shown in Fig. 4.8 are in

global GPU memory, while shared GPU memory is used in each kernel thread to store local variables

and to avoid expensive global memory accesses. Texture caches are used to store frequently-accessed

static variables such as channel LLRs and the parity-check matrix. Prior to executing the GPU-based

decoder, the received channel LLRs are first transferred from the host to global GPU memory. Data is

kept on the GPU during decoder runtime, and the decoded codeword is transferred from global GPU

memory to the host after the decoder terminates.

As shown in Fig. 4.8, message-passing kernels (1 and 3) instantiate up to T threads, where T is the

maximum number of edge connections between all CNs and VNs, Kernel 2 instantiates (n− k) threads

equal to the total number of CNs in the matrix, and Kernel 4 instantiates up to n threads equal to the

total number of VNs in the matrix. When early termination is enabled, T threads are required in kernels

1 and 3, and n threads are required in Kernel 4. However, when early termination is disabled, the number

of threads instantiated in kernels 1, 3, and 4 can be reduced due to the large number of degree-1 VNs


along the long diagonal in the parity-check matrices, as illustrated in Fig. 4.1. As previously described,

degree-1 VNs along the diagonal need to pass VN-to-CN messages only in the first decoding iteration,

while CN-to-VN messages need to be passed to degree-1 VNs only if the early-termination condition is

enabled. The degree-1 VNs along the diagonal correspond to the majority (but not all) of the (n − k)

parity bits that are discarded after decoding, thus the VN update computation needs to be performed in

these degree-1 VNs only when early termination is enabled. The message-passing Kernels (1 and 3) need

only to instantiate threads that correspond to the CN-VN connections to the left of the long diagonal

in the matrix structure shown in Fig. 4.1. Similarly, the VN-update Kernel (4) needs only to instantiate

threads that correspond to VNs to the left of the long diagonal. This reduction in the number of threads

provides a marginal speedup in each iteration.

While not shown in Fig. 4.8, the early-termination check is implemented via multiple kernels that

perform a parallel reduction following the VN-to-CN message-passing Kernel (1) in order to compute

the parity at each CN. Additional computations and memory reads/writes are required in the message-

passing and VN-update kernels (1, 3, and 4). The following additional operations must be performed to

enable an early-termination check: send the decision bit from each VN to its connected CNs, send all

mcv messages from each CN to its connected VNs (including those corresponding to connections along

the long diagonal), and calculate the decision bit in each VN. To reduce overall decoding latency and

maximize throughput, the early-termination check is performed only after a fixed number of decoding

iterations. This fixed number of iterations corresponds to the average number of iterations required at

each SNR point, and is pre-determined empirically through FER simulation for each code. The decoder

uses a lookup table to decide after how many decoding iterations to enable the early-termination check

based on the current SNR.

A quasi-cyclic matrix structure reduces data permutation and memory access complexity by elimi-

nating random, unordered memory access patterns. In addition, QC codes require fewer memory lookups

for message passing since the parity-check matrix can be described with approximately q-times fewer

terms, where q is the expansion factor of the QC parity-check matrix, in comparison to a random matrix

for the same block length. Table 4.2 presents a breakdown of the latency of each GPU kernel for the

three LDPC codes designed in this thesis. While the CN and VN update kernels (2 and 4) have similar

runtime for both random and QC codes, QC codes achieve faster runtime in data permutation kernels (1

and 3) due the approximately q-times fewer CN-VN edge connections in the parity-check matrix. Since

the parity-check matrices designed in this thesis are sparse, a compressed data structure is used to store

CN-VN edge connections to reduce memory read latency in the message-passing kernels.

Table 4.2 also highlights the respective error-correction performance and GPU throughput of the

three codes at the maximum β = 0.99 efficiency with d = 8 reconciliation. The raw GPU throughput

(including parity bits) is given by

KrawGPU =

Block Length

Latency Per Iteration× Iterations(bits/s). (4.4)

The information throughput of the GPU decoder must be scaled by (1) the FER Pe to account for

discarded frames when decoding is unsuccessful, i.e., CRC does not pass or parity check fails, and (2)

the code rate Rcode to account for the parity bits that must be discarded after decoding. The average


GPU information throughput is then given by

K ′GPU = KrawGPURcode (1− Pe). (4.5)

Thus, for any LDPC code, the GPU throughput is determined by the latency per iteration and the

number of decoding iterations. The latency per iteration depends on the LDPC code structure and the

number of memory lookups, while the FER is bound by the maximum number of iterations.

Some GPU-based LDPC decoders use fixed-point number representations and/or frame-level par-

allelism to maximize computational speedup for codes with short block lengths (n < 105 bits) in

high-SNR regions above 0dB where the Min-Sum algorithm achieves sufficient error-correction per-

formance [133–137]. This work, however, uses single-precision floating point to minimize FER with

Sum-Product decoding at SNRs below -15dB. Due to the large block length (n = 106 bits), all GPU

threads are fully utilized, thus external (frame-level) parallelism does not provide additional speedup.

Asynchronous data transfer to the GPU is another technique often employed to minimize overhead la-

tency, however, this does not provide any significant performance boost as the Sum-Product computation

dominates overall execution time due to the large number of iterations required for low-SNR decoding.

4.4.2 Information Throughput Results

Figure 4.9 presents the measured information throughput K ′GPU from the GPU decoder for all three

LDPC codes at each β-efficiency point, which corresponds to a unique SNR-FER point in Figures 4.2

and 4.3 for the d = 1 and d = 8 dimensional reconciliation cases, respectively. Table 4.3 compares

the performance of the rate 0.02 random and QC codes at the maximum achievable distance for each

reconciliation dimension, assuming a privacy amplification block size of Nprivacy = 1012 bits. The q = 21

and q = 50 QC codes designed in this thesis achieve approximately 3× higher raw decoding throughput

KrawGPU over the random code with d = 1, 2, 4, 8 dimensional reconciliation at the maximum distance

point for each β-efficiency. When scaled by the corresponding FER and code rate, the QC codes achieve

between 5.1× and 12.8× higher information throughput K ′GPU over the random code. Table 4.3 also

presents the operating secret key rate K ′finite defined by Eq. 4.3, and the fundamental secret key rate

limit Klim for a lossy channel defined by Eq. 2.11. Here, the fundamental limit is scaled by the light

source repetition rate frep, such that

K ′lim = frepKlim. (4.6)

A realistic CV-QKD repetition rate of frep = 1MHz is assumed for the comparison [59, 62, 100]. For

distances beyond 130km, the operating secret key rate K ′finite is between 2176× and 57112× lower than

the fundamental limit K ′lim, with d = 8 and d = 1 dimensional reconciliation, respectively. The upper

bound versus distance is plotted in Fig. 4.10, along with the GPU-decoded information throughput for

the q = 21 QC code under d = 8 dimensional reconciliation. Figure 4.10 illustrates that the decoded

information throughput K ′GPU of the reconciliation algorithm is higher than the upper bound on secret

key rate K ′lim on a lossy channel with a 1MHz source from β = 0.8 to β = 0.99.

The rightmost column in Table 4.3 (K ′GPU/K′lim) presents the two key results of this work. First, it

shows that the GPU decoder can achieve between 1.07× and 8.03× higher information throughput K ′GPU

over the fundamental secret key rate limit K ′lim with a 1MHz source using QC-LDPC codes with d = 4

and d = 8 dimensional reconciliation. The 8.03× speedup is also highlighted in Fig. 4.10 at 160km with

β = 0.99. Since the decoder delivers an information throughput higher than the fundamental key rate


Table 4.3: Overview of secret key rate and GPU throughput at maximum reconciliation distance withrate 0.02 multi-edge codes and Nprivacy = 1012 bits

Recon

cilia

tion

Dim

en

sion

Maxim

um

Recon

cilia

tion

Effi

cie

ncy

LD

PC

Cod

e

Maxim

um

Dis

tan

ce

(km

)

Op

era

tin

gS

ecre

tK

ey

Rate

K′ finit

eat

Max

Dis

tan

ce

wit

hf r

ep

=1M

Hz

(bit

/s)

Fu

nd

am

enta

lK

ey

Rate

Lim

itK′ lim

at

Max

Dis

tance

wit

hf r

ep

=1M

Hz

(Kb

it/s)

GP

UR

aw

Th

rou

gh

pu

tK

raw

GP

U

(Mb

it/s)

GP

UIn

fo.

Th

rou

gh

pu

tK′ G

PU

(Kb

it/s)

K′ G

PU

Sp

eed

up

Over

K′ lim

(K′ G

PU/K′ lim

)

d=

1β

=0.

960

Ran

dom

131.

38

0.0

60

3.405

0.612

0.1

11

0.033×

QC

,q

=21

131.

38

0.1

19

3.405

1.887

0.6

86

0.202×

QC

,q

=50

131.

43

0.2

35

3.397

1.966

1.4

26

0.420×

d=

2β

=0.

970

Ran

dom

137.

99

0.0

51

2.510

0.612

0.2

23

0.087×

QC

,q

=21

137.

99

0.2

03

2.510

1.856

2.7

00

1.076×

QC

,q

=50

137.

85

0.0

50

2.526

1.983

0.3

60

0.142×

d=

4β

=0.

970

Ran

dom

137.

99

0.1

01

2.510

0.604

0.4

39

0.175×

QC

,q

=21

137.

99

0.3

02

2.510

1.818

3.9

38

1.569×

QC

,q

=50

137.

85

0.4

01

2.526

1.855

2.6

92

1.065×

d=

8β

=0.

990

Ran

dom

160.

47

0.2

30

0.891

0.604

1.4

09

1.581×

QC

,q

=21

160.

47

0.4

10

0.891

1.724

7.1

60

8.033×

QC

,q

=50

160.

52

0.2

24

0.889

1.808

4.2

07

4.733×


0.85 0.9 0.95 1Reconciliation Efficiency (β)

102

103

104

105

106

GP

U In

form

atio

n T

hro

ug

hp

ut

(bit

s/s)

d=8

d=1

d=1 - GPU Thpt: Random Coded=1 - GPU Thpt: QC q=21 Coded=1 - GPU Thpt: QC q=50 Coded=8 - GPU Thpt: Random Coded=8 - GPU Thpt: QC q=21 Coded=8 - GPU Thpt: QC q=50 Code

Figure 4.9: Measured information throughput K ′GPU vs. reconciliation efficiency for d = 1 and d = 8dimensional reconciliation. Each measurement point corresponds to a particular SNR operating pointwith a measured FER presented in Fig. 4.5.

0 20 40 60 80 100 120 140 160 180Distance (km)

102

103

104

105

106

107

Info

rmat

ion

Rat

e (b

its/

seco

nd

)

Upper Bound on SecretKey Rate for Lossy Channel

with frep

=1MHz

ǫ = 0.005v

el = 0.041

η = 0.606α = 0.2N

privacy=1012

β = 0.80β = 0.89

β = 0.92β = 0.95

β = 0.98β = 0.99

GPU InformationThroughput forq=21 QC Code

8x

Figure 4.10: GPU information throughput K ′GPU of the q = 21 QC-LDPC code with d = 8 dimensionalreconciliation up to the maximum distance point for β ∈ 0.80, 0.89, 0.92, 0.95, 0.98, 0.99, and upperbound on secret key rate for lossy channel K ′lim vs. distance.


limit, it can be concluded that LDPC decoding is no longer the post-processing bottleneck in CV-QKD,

and thus, the secret key rate remains only limited by the physical parameters of the quantum channel.

The second result is that d = 1 and d = 2 dimensional reconciliation schemes are not well-suited for

long-distance CV-QKD since the K ′GPU speedup over K ′lim is less than 1×. In general, Table 4.3 shows

that QC codes achieve lower decoding latency than the random code at long distances, thereby making

them more suitable for reverse reconciliation at high β efficiencies.

The results presented in Table 4.3 and Fig. 4.10 assumed a light source repetition rate of frep =1MHz.

While a higher source repetition rate such as frep = 100MHz or frep = 1GHz would raise the fundamental

secret key rate limit K ′lim above the maximum GPU decoder throughput K ′GPU, it would still not

introduce a post-processing bottleneck for CV-QKD. The GPU decoder currently delivers an information

throughput K ′GPU between 1868× and 18790× higher than the operating secret key rate K ′finite with a

1MHz light source at the maximum distance points for d = 1, 2, 4, 8 dimensional reconciliation schemes

beyond 130km. Even with a source repetition rate of frep = 1GHz, the GPU information throughput

K ′GPU would still exceed the operating secret key rate K ′finite between 1.8× and 18.7× for distances

beyond 130km, assuming the same quantum channel parameters. Therefore, GPUs remain a viable

platform for the implementation of reconciliation algorithms for long-distance CV-QKD.

4.4.3 Comparison to Other CV-QKD Implementations

While QKD has been well-studied over the past 30 years, the exploration of long-distance CV-QKD is

still nascent, with very few published implementations in the low-SNR regime for optical transmission

distances beyond 100km. Hardware-based implementations of DV-QKD and short-distance CV-QKD

have previously been demonstrated using FPGAs and GPUs [65, 66, 131, 138], however, at the time of

writing, there is only one reported CV-QKD implementation designed to operate in the low-SNR regime

for long-distance reconciliation [60].

Jouguet and Kunz-Jacques reported a GPU-based LDPC decoder implementation that achieves

7.1Mb/s throughput at SNR = 0.161 (β = 0.93) on the BIAWGNC [60], for a random multi-edge

LDPC code with a block length of 220 bits based on the rate 1/10 multi-edge code designed by Richard-

son and Urbanke with an SNR threshold of 0.1556 [71]. For throughput comparison purposes, two

additional multi-edge codes with the same code rate, block length, and SNR threshold were constructed

in this thesis: a random code and a q = 512 QC code2.

Table 4.4 presents a performance comparison between the two designed rate 1/10 codes and the result

achieved by Jouguet and Kunz-Jacques at SNR = 0.161 on the BIAWGNC [60]. The two designed codes

achieve a FER of approximately 0.04 under the same decoding conditions as the comparison work with

d = 8 dimensional reconciliation. Similar to the results presented in Tables 4.2 and 4.3, the q = 512 QC

code achieves approximately 3× lower latency per iteration than the random rate 1/10 code designed in

this thesis. Rate 1/10 QC codes with expansion factors q ∈ 64, 128, 256 were also designed, however,

the q = 512 QC code achieved the lowest latency per iteration due to the lower number of required

memory accesses in the GPU message-passing kernels, as a result of the lower number of connections

in the QC parity-check matrix. While the designed rate 1/10 random code achieves a maximum raw

throughput of only 2.78Mb/s, the q = 512 QC code delivers a maximum raw throughput of 9.17Mb/s with

early termination enabled only in iterations greater than the average number of iterations, as determined

2Lei M. Zhang specifically constructed the rate 1/10 random and QC codes based on the degree distribution designedby Richardson and Urbanke [71].


Table 4.4: GPU LDPC decoding comparison at SNR = 0.161 with d = 8 on BIAWGNC targetingFER = 0.04 with rate 1/10 codes

SpecificationThis Work

Jouguet andKunz-Jacques

2016 2014 [60]

Code Rate 1/10 1/10Block Length (Bits) 220 220

SNR 0.161 0.161LDPC Code

StructureRandom

Multi-Edgeq = 512 QCMulti-Edge

RandomMulti-Edge

Connections inParity Matrix

4,063,229 7,932 N/A

Early Termination No Yes No Yes NoMax Iterations 88 88 100 100 100

Average Iterations 88 78 100 78 100

FER (1) 0.04 0.04 0.0243 0.0243 0.04

Latency PerIteration (ms) (2) 4.73 4.84 1.28 1.47 1.48

KrawGPU GPU Raw

Throughput (Mb/s)2.52 2.78 8.21 9.17 7.1

K ′GPU GPU Info.Throughput (Kb/s)

242 267 801 895 682

GPU Model NVIDIA GeForce GTX 1080AMD Radeon HD

7970CMOS Technology 16nm 28nm

GPU Cores 2560 2048GPU GFLOPS 8228 3789

GPU Memory BusWidth (Bits)

256 384

GPU MemoryBandwidth (GB/s)

320 264

(1) FER Pe corresponds to the probability of detected error, since Pundetected = 0 with 32-bit CRC. Allthree codes achieve a CV-QKD distance of 83.8km based on the quantum channel parameters assumed inthis thesis.(2) Latency per iteration is an average for the full decoding of a single frame, and also includes the datatransfer latency between the CPU and GPU.

empirically through FER simulation. The q = 512 QC code achieves a 1.29× higher throughput than the

7.1Mb/s reported by Jouguet and Kunz-Jacques [60], further demonstrating that the QC code structure

offers computational speedup benefits for multi-edge codes operating in the high β-efficiency region at

low SNR. Although the comparison work is from 2014, both GPU models have a similar memory bus

width, which is the primary constraint that limits the latency per iteration. As previously discussed,

GPU decoder performance is bound by the memory access rate, and not the floating point operations

per second (FLOPS). Thus, a wider GPU memory allows for a higher memory access rate, which in

turn, reduces the decoding latency.

Other types of error-correcting codes have been studied for application in the low-SNR regime of CV-

QKD, such as polar codes, repeat-accumulate (RA) codes, and Raptor codes. Polar codes require block

lengths on the order of 227 bits to achieve comparable FER performance to the rate 1/10 multi-edge

LDPC codes designed in this thesis, however, they have been shown to achieve low decoding latency on


generic x86 CPUs due to their recursive decoding algorithm [60]. A polar-code performance comparison

is not available for the rate 0.02 multi-edge QC-LDPC codes designed in this thesis. Punctured and

extended low-rate RA codes have been constructed from ETSI DVB-S2 codes with block lengths of 64,800

bits to achieve β > 0.85 efficiency over a wide range of SNRs [139], however, their performance has not

been investigated beyond 70km and there is currently no hardware implementation to provide a sufficient

throughput comparison. Lastly, Raptor codes achieve high β-efficiency at low SNR and guarantee error-

free decoding (Pe = 0) by sending as many coded symbols as required by the receiver [140]. However,

their decoding latency may be a limitation to high-throughput reconciliation, and at the time of writing,

there is no known hardware implementation of Raptor codes for long-distance CV-QKD. The demand for

long-distance communication through applications such as CV-QKD motivates the need for continued

research in high-efficiency codes and their hardware realizations.

4.5 Summary

This chapter introduced quasi-cyclic multi-edge LDPC codes to accelerate the reconciliation step in

long-distance CV-QKD by means of a GPU-based decoder implementation and multi-dimensional rec-

onciliation schemes. With an 8-dimensional reconciliation scheme, the GPU-based decoder delivers an

information throughput up to 8.03× higher than the upper bound on secret key rate for a lossy channel

with a 1MHz source, thereby demonstrating that key reconciliation is no longer a computational bottle-

neck in long-distance CV-QKD. Furthermore, the low-rate LDPC codes extend the maximum distance

of CV-QKD from the previously achieved 100km to 160km based on the quantum channel and privacy

amplification parameters assumed in this thesis. LDPC codes with longer block lengths on the order of

n = 107 or n = 108 bits could also be designed to improve the error-correction performance and further

increase distance at the expense of decoding latency.

The LDPC codes and reconciliation techniques applied in this thesis can be extended to post-

processing algorithms in two areas that show promise for the future of QKD: (1) free-space QKD using

low-Earth orbit satellites as communication relays to extend the distance of secure communication be-

yond 200km without fiber-optic infrastructure, and (2) fully-integrated chip implementations [40]. Recent

works have experimentally demonstrated terrestrial free-space QKD for distances up to 143km [141,142],

while satellite-based QKD has been proposed as a practical near-term solution to achieving long-distance

QKD on a global scale [56, 143]. In August 2016, China launched the Quantum Experiments at Space

Scale (QUESS) satellite to generate secret keys between ground stations in Beijing and Vienna by trans-

mitting entangled photon pairs from an orbit altitude of 500km [144, 145]. Free-space fading channels

for satellite QKD typically operate at SNRs above 0dB [146], however, quasi-cyclic code construction

techniques can still be employed to achieve high secret key rates, while GPUs would allow for simple

integration with other satellite equipment for rapid prototyping, in contrast to ASIC- or FPGA-based

LDPC decoder implementations. This thesis presented the computational speedup achievable on a single

state-of-the-art GPU. Further acceleration can be achieved through architectural optimizations in the

design of a monolithic QKD chip that combines both optical and post-processing circuits. Photonic chips

have already been realized for QKD transmitters and receivers [40,57,147,148], and further integration

of post-processing algorithms would provide a considerable reduction in system size and power consump-

tion. A final key takeaway here is that the quasi-cyclic LDPC code construction and GPU architecture

techniques presented in this thesis can also be applied to forward error-correction implementations for


DV-QKD where reconciliation is performed over the binary symmetric channel (BSC) instead of the

BIAWGNC as in CV-QKD. The derivation of reverse reconciliation with LDPC decoding for DV-QKD

on the BSC is provided in Appendix D.

This chapter addressed the challenge of achieving high-speed, high-efficiency reconciliation for long-

distance CV-QKD over fiber-optic cable. In addition to extending information-theoretic security to

general attacks for finite key sizes, a major remaining hurdle to extending the secure transmission distance

in CV-QKD is the reduction of excess noise in the optical quantum channel. While recent techniques have

been demonstrated to control excess noise to within a tolerable limit [63], future work may also investigate

the security of CV-QKD in the presence of non-Gaussian noise sources, and in particular, the performance

of LDPC decoding at low SNR with non-Gaussian noise. GPU-based decoder implementations with

quasi-cyclic codes would provide a suitable platform for such investigations. Furthermore, reducing the

latency of privacy amplification for large block sizes on the order of Nprivacy ≥ 1012 bits is necessary in

order to realize secret key exchange for distances beyond 100km.

Chapter 5

Conclusion and Future Directions

This thesis presented techniques for reducing design complexity in the implementation of LDPC decoders

for integrated circuits targeting high-performance wireless channels, and secret key reconciliation in

quantum cryptography over long-distance optical fiber. This thesis showed that it was possible to

leverage the quasi-cyclic structure of LDPC parity-check matrices to reduce decoder latency, complexity,

and power, while maximizing throughput in each of the two distinct application areas. In Chapter 3,

a new message-passing schedule was proposed for a frame-interleaved architecture in order to minimize

interconnect routing complexity and reduce overall power consumption in silicon-based decoders for

modern CMOS technology nodes. The fabricated test chip achieves record energy efficiency among

published ASIC decoders for the IEEE 802.11ad standard. In Chapter 4, a quasi-cyclic structure was

applied to a multi-edge LDPC code with block length of 106 bits in order to enable coalesced GPU

memory access patterns to reduce decoding latency for long-distance quantum cryptography. The error-

correction performance of the LDPC code extends the maximum CV-QKD transmission distance from

100km to 160km, while the GPU-accelerated decoder delivers an information throughput higher than

the upper bound on secret key rate for a lossy channel. While LDPC decoding is no longer the post-

processing bottleneck, other factors such as privacy amplification and parameter estimation reduce the

secret key rate below this upper bound. The record results presented in Chapters 3 and 4 were achieved

through combined algorithmic and architectural techniques by exploiting the quasi-cyclic structure of

LDPC parity-check matrices in both integrated circuit and quantum cryptography application areas.

The two final sections in this chapter provide a summary of the contributions presented in this thesis,

as well as a recommendation for future research areas based on these contributions.

5.1 Summary of Contributions

The three major contributions of this thesis can be summarized as follows:

• Developed a low-power, frame-interleaved architecture for LDPC decoders to reduce interconnect

complexity and improve scalability in modern CMOS technology nodes by modifying the Min-

Sum belief propagation algorithm and introducing pipelined frame-interleaving and clock gating

techniques that exploit the inherent structure of QC-LDPC codes. QC-LDPC codes were used as a

vehicle to illustrate a general approach, however, the proposed architecture and decoding schedule

can be extended to non-QC codes.

80

Chapter 5. Conclusion and Future Directions 81

• Designed (algorithm, micro-architecture, RTL), synthesized, placed-and-routed, fabricated, and

tested a 4.78mm2 proof-of-concept test chip in 28nm CMOS containing 837K-gates, 160Kb total

eSRAM, 2 asynchronous clock domains, and 2 power domains, achieving 6.78Gb/s and 11pJ/bit

efficiency at 76mW with 202MHz clock for IEEE 802.11ad codes at a BER of 10−6.

• Constructed quasi-cyclic multi-edge LDPC codes with block lengths of 106 bits, and implemented a

1.72Mb/s GPU-accelerated (CUDA C++) decoder to extend the secure distance of key reconcilia-

tion in CV-QKD from the previous 100km to 160km over fiber-optic cable using multi-dimensional

reconciliation schemes that achieve up to 8× higher throughput than the upper bound on secret

key rate for a lossy channel.

5.2 Future Directions

The frame-interleaved LDPC decoder architecture presented in Chapter 3 can be extended to a number of

active research areas. First, the architecture can be extended for non-quasi-cyclic and spatially-coupled

LDPC codes. Second, the architecture can be adopted for silicon implementation of new decoding algo-

rithms that achieve better error-correction performance than traditional belief propagation algorithms.

Third, the architecture can be adopted for near-threshold voltage operation to further reduce power

consumption. Finally, the architecture can be extended to a stacked-die implementation for codes with

longer block lengths, like those investigated in Chapter 4 for QKD. This section suggests some future

research directions based on the contributions presented in this thesis.

5.2.1 Extendibility to Non-Quasi-Cyclic and Spatially-Coupled LDPC Codes

As suggested in Chapter 3, the path-unrolled architecture is not restricted to QC codes, but can rather

be applied more generally to random codes, or to codes that allow for column partitioning as a way to

enforce structure. In the path-unrolled structure, global routing is eliminated, and instead, routing is

constrained between successive column-slice pairs. Thus, any LDPC code that can be represented as a

Tanner graph with two independent vertex sets of CNs and VNs can be implemented using the proposed

architecture with a path-unrolled message-passing schedule.

The architecture can be extended to support non-QC codes, as well as random codes. Consider the

example of a non-QC matrix with two CN connections to a VN in a single macro-layer, i.e., two ‘1’

elements in a sub-matrix column. Here, the combined CN+VN processing unit simply needs additional

CN-update logic to support the additional CN connected to the VN, while the VN-update logic remains

the same. Additional wired routing is required to and from the additional CN-update logic block,

however, the overall path-unrolled message-passing schedule remains the same. This example can be

extended to a random code, which can be partitioned into uniformly-defined macro-columns. Combined

CN+VN processing units would then only contain CN logic for the CNs connected to each VN. The

drawback of this implementation is that some CN+VN processors would have more internal CN-update

logic blocks, while other CN+VN processors would have less. Future work may explore more optimal

hardware mapping and partitioning of non-QC codes for the architecture introduced in this thesis.

The frame-interleaved architecture can also support windowed decoding of spatially-coupled LDPC

codes [149]. Each decoding window of the spatially-coupled code can be mapped to an independent

column-slice pair. In this case, instead of decoding multiple frames, the decoder would slide the window


as data moves from one pipeline stage to the next. The systolic nature of the proposed architecture is

well-suited for this application.

5.2.2 Linear-Program Decoding for High-SNR Channels

Linear program (LP) decoding of binary linear block codes via the Alternating Direction Method of

Multipliers (ADMM) has recently demonstrated improved error-floor decoding performance over the BP

Sum-Product algorithm in high-SNR Gaussian channels [150]. This makes ADMM-LP attractive for

optical transport and storage applications. ADMM-LP frames error correction as a convex optimization

problem, in contrast to BP, which frames error correction as problem of graphical inference. An FPGA-

based implementation of ADMM-LP decoding recently demonstrated FER performance within 0.5dB

of floating-point precision at Eb/N0 = 6.5dB with an FER of 10−6 for the rate 13/16 code of the

IEEE 802.11ad standard [151]. However, the implementation achieves a throughput of only 13.16Mb/s.

The error-rate, power consumption, and throughput performance of the same code were presented for

the silicon-based decoder implementation in Chapter 3 of this thesis. The fabricated chip achieves

a throughput of 6.78Gb/s with an FER of 10−4 at Eb/N0 = 5.6dB. A possible future research area

would be to investigate the implementation of a silicon-based ADMM-LP decoder by applying the

ADMM-LP computation kernels in the combined CN+VN processing units of the frame-interleaved

LDPC decoder architecture presented in Chapter 3. An ASIC implementation may provide several

orders of magnitude speedup such that ADMM-LP achieves Gigabit/s decoding throughput, just like

the BP Min-Sum decoder presented in Chapter 3. Furthermore, the early termination patterns at high

SNR would allow for extensive clock gating in the frame-interleaved decoder architecture, thus offering

the prospect of low power performance for a silicon-based ADMM-LP decoder.

5.2.3 Decoder Architectures for Near-Threshold Voltage FinFET Operation

Near-threshold voltage (NTV) circuit design techniques have recently shown promise in improving energy

efficiency and alleviating on-chip power hotspots at the expense of performance, by operating at the

point where switching and leakage power are equal [152]. Switching energy dominates at supply voltages

greater than the NTV operating point, while leakage energy dominates at supply voltages below the NTV

operating point. The frame-interleaved LDPC decoder architecture presented in Chapter 3 of this thesis

can be extended to the NTV region of operation due to the high level of computational parallelism, deep

pipelining, and low clock frequency [153]. However, a particular challenge at NTV is mitigating SRAM

failure, since device mismatches degrade cell stability during read/write operations [152]. This behavior

was also observed during measurements of the fabricated proof-of-concept test chip in this thesis, as

shown in the measurement Shmoo plots in Fig. 3.15. SRAM failure can be avoided by implementing an

independent power domain for eSRAM, such that the memory operates at a higher supply voltage than

the NTV logic, and level shifters are used at the memory periphery. Furthermore, NTV operation is

largely unexplored in FinFET devices. Current research suggests that FinFETs offer significant voltage-

scaling improvements over planar technologies [153]. Since wired interconnect is still a limitation in

modern FinFET technology nodes, the frame-interleaved decoder architecture presented in Chapter 3

can offer significant energy reduction benefits under NTV operation, especially given the higher transistor

area density in sub-10nm FinFET nodes.


5.2.4 Decoder Architectures for 3-Dimensional Integrated Circuits

The silicon fabrication of a frame-interleaved LDPC decoder for a code with block length of 106 bits

may not be feasible on a single die, however, 3D die-stacking techniques may enable an implementation

over multiple dies connected with through-silicon vias (TSVs) [154]. The piecewise time-distributed

decoding schedule applied in the frame-interleaved architecture would minimize the amount of message

passing between adjacent stacked dies, while the low clock frequency would enable TSV-based message-

passing without parasitic degradation [154]. The expansion factor of the quasi-cyclic matrices designed

for long-distance CV-QKD in Chapter 4 is on the same order of magnitude as the expansion factor of

matrices used in the frame-interleaved decoder implementation in Chapter 3. Despite the longer block

length of 106 bits, the wiring permutation complexity between adjacent column slices for a QKD code

in the frame-interleaved architecture would remain about the same as the implemented 672-bit IEEE

802.11ad codes. The decoding latency would increase due to the larger number of columns, however, the

latency can be reduced through architectural optimization techniques to eliminate the long diagonal in

the parity-check matrix. Thus, the frame-interleaved decoder architecture presented in Chapter 3 can

be extended to the quasi-cyclic multi-edge LDPC codes designed in Chapter 4 in order to further reduce

decoding latency for long-distance CV-QKD. Finally, the introduction of 3D technologies may enable

the monolithic integration of LDPC decoder post-processing circuits with integrated photonics to build

a single-chip QKD solution.

Appendix A

Supplementary Background on QKD

This appendix provides a complete discussion of the quantum transmission, sifting, and privacy amplifi-

cation steps of the QKD protocol introduced in Chapter 2, as well as a derivation of the secret key rate

for a CV-QKD system with collective attacks.

A.1 Quantum Transmission and Measurement

To construct a secret key using the prepare-and-measure CV-QKD protocol, Alice first transmitsNquantum

coherent states to Bob over an optical fiber. Each coherent state is comprised of a pair of ampli-

tude and phase quadrature operators, x and p, of the form |x+ jp〉, j =√−1. Using a quantum

random number generator, Alice prepares each coherent state by randomly selecting her xA and pA

quadrature values according to a zero-mean Gaussian distribution with adjustable modulation vari-

ance σ2A = VAN0, where N0 represents the shot noise variance defined by the Heisenberg inequality

∆x∆p ≥ N0 [10, 95]. Alice transmits her train of Nquantum coherent states to Bob by modulating a

light source with a pulse repetition rate frep. She also records her selections of xA and pA for the

next sifting step, by constructing a vector, A, of length 2Nquantum, from her Nquantum coherent state

quadrature operator pairs (xA, pA), such that A2i−1 = xAi and A2i = pAi for i = 1, 2, . . . , Nquantum.

As such, A = (xA1 , pA1 , x

A2 , p

A2 , . . . , x

ANquantum

, pANquantum) and A ∼ N (0, σ2

A). Bob randomly selects and

measures either the x or p quadrature for each incoming pulse using an unbiased homodyne detector.

Bob constructs his own vector, B, of length Nquantum, comprised of the observed modulated quadrature

measurements, where Bi ∈ xBi , pBi with equal probability. Despite the losses in the optical fiber, and

the added noise from the Bob and Eve’s detection equipment, the xB and pB quadrature measurements

can still be used to distill a secret key following the sifting and reconciliation (error correction) steps.

Without considering the presence of the eavesdropper (Eve), the quantum transmission is subject to

path loss, excess noise in the single-mode fiber between Alice and Bob, the inefficiency of Bob’s homodyne

detection, as well as added electronic (thermal) noise [59]. In this thesis, the quantum channel was

characterized using previously published parameters [59]. The excess channel noise expressed in shot

noise units is assumed to be ε = 0.005, Bob’s added electronic noise in shot noise units is chosen

as Vel = 0.041, Bob’s homodyne detector efficiency is set to η = 0.606, and the single-mode fiber

transmission loss is assumed to be 0.2dB/km, such that the transmittance of the quantum channel is

given by T = 10−α`/10, where ` is the transmission distance in kilometers and α = 0.2dB/km. The total

84

Appendix A. Supplementary Background on QKD 85

noise between Alice and Bob is given by

χtotal = χline +χhom

T, (A.1)

where χline = ( 1T − 1) + ε is the total channel added noise referred to the channel input, and χhom =

1+Vel

η −1 is the noise introduced by the homodyne detector. The variance of Bob’s measurement is given

by σ2B = VBN0 = ηT (V + χtotal)N0. Although the adversary (Eve) may have access to the quantum

channel, her presence is not considered in the channel characterization. Instead, the information leaked

to Eve will be considered in the secret key rate calculation [95].

A.2 Sifting

Following the quantum transmission step, Alice’s original transmission vector A contains twice as many

elements as Bob’s measurement vector B. In the sifting step, Bob informs Alice via the classical public

channel which of the xB or pB quadratures he randomly selected for each of his Nquantum element

measurements, such that Alice may respectively discard her Nquantum unused xA and pA quadrature

values [95]. After sifting, Alice and Bob share correlated random sequences of length Nquantum, herein

defined as X0 = (X01 , X02 , . . . , X0Nquantum) and Y0 = (Y01 , Y02 , . . . , Y0Nquantum

), respectively, where

(X0i , Y0i), i = 1, 2, . . . , Nquantum, are independent and identically distributed realizations of some jointly

Gaussian random variables (X0, Y0). For example, Alice and Bob may have the following random

sequences after sifting: X0 = (xA1 , pA2 , p

A3 , x

A4 , . . . , p

ANquantum

) and Y0 = (xB1 , pB2 , p

B3 , x

B4 , . . . , p

BNquantum

).

A.3 Privacy Amplification

Alice first discards her erroneously decoded S messages, and informs Bob as to which messages she

discarded. Bob then discards his original S messages that correspond to the S messages that were

discarded by Alice. Alice concatenates all of her correctly decoded S messages to construct a long secret

key block of length Nprivacy = mk bits, where k is the length of the LDPC-decoded message S, and m

is some large non-zero integer. Bob also concatenates his corresponding S messages to construct a long

secret key, also of length Nprivacy bits. Alice and Bob then independently perform universal hashing on

their independent secret key blocks to reduce Eve’s knowledge of the key.

The speed of privacy amplification is an active area of research, with published results showing

maximum speeds of 100Mb/s for a block size of Nprivacy = 108 bits [155]. The computational complexity

of universal hashing can be reduced from O(n2) to O(n log2 n) by applying the fast Fourier transform

(FFT) or number theoretical transform (NTT) on a Toeplitz matrix [156]. Estimation of security

parameters is also performed during privacy amplification using (Nquantum −Nprivacy) bits. A complete

discussion of parameter estimation and privacy amplification is beyond the scope of this work. Interested

readers should refer to [157] for further information.

A.4 Maximizing Secret Key Rate with Collective Attacks

The primary metric that defines the performance of a QKD system is the maximum rate at which Alice

and Bob can securely generate and reconcile keys over a fixed-distance optical fiber in the presence of an


eavesdropper that has access to both the quantum and classical channels. The maximum secret key rate

must be proven secure against a collective Gaussian attack, the most optimal man-in-the-middle attack,

where Eve first prepares an ancilla state to interact with each one of Alice’s coherent states during the

quantum transmission, and then listens to the public communication between Alice and Bob during

the reconciliation step in order to perform the most optimal measurement on her collected ancillae to

reconstruct the classical messages transmitted by Bob [10]. Assuming perfect error-correction during

the reconciliation step, the maximum theoretical secret key rate for a CV-QKD system with one-way

reverse reconciliation can be defined as

Kopt = βIAB − χBE (bits/pulse), (A.2)

where IAB is the mutual information between Alice and Bob, β is the previously defined reconciliation

efficiency, and χBE is the Holevo bound on the information leaked to Eve [10]. Here, IAB is equivalent

to the Shannon channel capacity, and is defined as

IAB =1

2log2(1 + s) =

1

2log2

(V + χtotal

1 + χtotal

), (A.3)

where V = VA + 1, VA is Alice’s adjustable modulation variance, and χtotal is the total noise between

Alice and Bob. The Holevo bound is defined as

χBE = G

(λ1 − 1

2

)+G

(λ2 − 1

2

)−G

(λ3 − 1

2

)−G

(λ4 − 1

2

), (A.4)

where G(x) = (x+ 1) log2(x+ 1)− x log2 x, and the Eigenvalues λ1,2,3,4 are given by

λ21,2 =

1

2(A±

√A2 − 4B) λ2

3,4 =1

2(C ±

√C2 − 4D),

where

A = V 2(1− 2T ) + 2T + T 2(V + χline)2

B = T 2(V χline + 1)2

C =V√B + T (V + χline) +Aχhom

T (V + χtotal)

D =√BV +

√Bχhom

T (V + χtotal).

Optimizing Alice’s modulation variance for each quantum transmission distance ensures a maximum

SNR on the BIAWGNC [11], and thus, a maximum achievable secret key rate Kopt for a particular

β-efficiency. Figure A.1 presents the optimal modulation variance VA as a function of β for quantum

transmission distances up to 180km, assuming perfect error-correction in the reconciliation step. Fig-

ure A.2 shows the corresponding maximum theoretical secret key rate Kopt for CV-QKD based on the

computed optimal VA at each distance, as well as the upper bound on secret key rate for a lossy channel

defined by Eq. 2.11.


0 20 40 60 80 100 120 140 160 180Distance (km)

0

5

10

15

20

Op

tim

al V

A (

Sh

ot

No

ise

Un

its)

ǫ = 0.005v

el = 0.041

η = 0.606α = 0.2

β = 0.99

β = 0.98

β = 0.95

β = 0.80

β = 0.8β = 0.83β = 0.86β = 0.89β = 0.92β = 0.95β = 0.98β = 0.99

Figure A.1: Optimal VA vs. transmission distance for maximum theoretical secret key rate, from β = 0.8to β = 0.99, based on the assumed physical operating parameters of the quantum channel.

0 20 40 60 80 100 120 140 160 180Distance (km)

10-4

10-3

10-2

10-1

100

101

Max

imu

m S

ecre

t K

ey R

ate

(bit

s/p

uls

e)

Fundamental Limit of Lossy Channel

ǫ = 0.005v

el = 0.041

η = 0.606α = 0.2

Max CV-QKD Key Rate (FER=0)

β = 0.99

β = 0.80

β = 0.8β = 0.83β = 0.86β = 0.89β = 0.92β = 0.95β = 0.98β = 0.99

Figure A.2: Maximum theoretical secret key rates vs. transmission distance. The maximum CV-QKDkey rate is defined by Kopt from β = 0.8 to β = 0.99 based on the optimal VA. The fundamental limitfor a lossy channel is defined by Klim = − log2(1− T ).

Appendix B

Development, Simulation, and

Testing Framework

This appendix outlines the design stages required to realize both the silicon-based and GPU-based LDPC

decoders presented in this thesis. A list of electronic design automation (EDA) tools is provided below,

while Fig. B.1 provides a visual overview of the design flow.

Algorithm Design and Verification

• Microsoft Visual Studio with NVIDIA CUDA Framework

Architecture Implementation

• Cadence RTL Compiler

• Cadence Incisive Simulator

• Cadence SimVision Waveform Viewer

• Cadence Conformal Logic Equivalence Checker

• Synopsys SpyGlass Linting Tool

Physical Design

• Mentor Graphics Olympus-SoC Place-And-Route Tool

• Mentor Graphics Calibre Design Rule Check

• Cadence Tempus Static Timing Analysis Tool

Chip Measurement

• Source III VTRAN Test Vector Translation

• Teradyne ATE Test Suite

88

Appendix B. Development, Simulation, and Testing Framework 89

Random

Message

Generator

LDPC

Encoder

Gaussian

Noise

Generator

Error

Statistics

Monitor

Floating-

Point LDPC

Decoder

Fixed-Point

LDPC

Decoder

C++ Software Environment

NVIDIA CUDA

GPU-Based LDPC

Decoder

RTL

Synthesis

Gate-Level

Simulation

Chip Floorplanning,

Power Domain Construction,

I/O Interface Design

Timing Sign-Off

Design Rule

Checking

Tape-Out / Chip Fabrication

Place-And-

Route

Chip

Packaging

Functional Verification

Test Vector

Translation

ATE Functional Test and

Power Measurement

Architectural Model

of Frame-Interleaved

LDPC Decoder

RTL HDL

Model

Alg

ori

thm

De

sig

n a

nd

Ve

rifi

ca

tio

nA

rch

ite

ctu

re Im

ple

me

nta

tio

n

Test

Vectors

Ph

ys

ica

l D

es

ign

Ch

ip M

ea

su

rem

en

t

GPU Decoder

Throughput and

Latency Results

LDPC Code

BER and FER

Results

Silicon Decoder

At-Speed Tests and

Power Results

Figure B.1: Development, simulation, and testing framework.

Appendix C

Supplementary QKD Results: Bit

Error Rate and Finite Secret Key

Rate

This appendix presents the bit error rate (BER) performance of the three LDPC codes presented in Chap-

ter 4 under d = 1, 2, 4, 8 dimensional reconciliation, as well as the finite secret key rate results for privacy

amplification blocks of Nprivacy = 1010 bits with d = 1, 2, 4, 8 reconciliation, and for Nprivacy = 1012

bits with d = 2, 4 reconciliation to demonstrate the impact of block size on the maximum transmission

distance.

Figures C.1 and C.2 present the BER performance of the random, q = 21, and q = 50 codes on the

BIAWGNC with Sum-Product decoding under d = 1, 2 and d = 4, 8 reconciliation, respectively.

Figures C.3 to C.6 present the finite secret key rate results for a privacy amplification block of

Nprivacy = 1010 with d = 1, 2, 4, 8 reconciliation. Figures C.7 and C.8 present the finite secret key rate

results for a privacy amplification block of Nprivacy = 1012 with d = 2, 4 reconciliation. The d = 1 and

d = 8 results for Nprivacy = 1012 bits are presented in Figures 4.6 and 4.7 in Chapter 4. The results

show that the distance is extended using longer privacy amplification blocks and higher reconciliation

dimensions. The maximum distance is achieved with d = 8 reconciliation and Nprivacy = 1012 bits.

90

Appendix C. Supplementary QKD Results: Bit Error Rate and Finite Secret Key Rate91

0.028 0.0285 0.029 0.0295 0.03 0.0305 0.031 0.0315 0.032SNR

10-7

10-6

10-5

10-4

10-3

10-2

Bit

Err

or

Rat

e (B

ER

)



Figure C.1: BER vs. SNR for Sum-Product decoding with d = 1 and d = 2 dimensional reconciliationon BIAWGNC.

0.028 0.0285 0.029 0.0295 0.03 0.0305 0.031 0.0315 0.032SNR

10-8

10-7

10-6

10-5

10-4

10-3

10-2

Bit

Err

or

Rat

e (B

ER

)



Figure C.2: BER vs. SNR for Sum-Product decoding with d = 4 and d = 8 dimensional reconciliationon BIAWGNC.


70 80 90 100 110 120 130 140 150 160 170Distance (km)

10-8

10-7

10-6

10-5

10-4

10-3

10-2

Fin

ite

Sec

ret

Key

Rat

e (b

its/

pu

lse)

ǫ = 0.005v

el = 0.041

η = 0.606α = 0.2N

privacy = 1010

β = 0.80

β = 0.83β = 0.86

β = 0.89

β = 0.92

β = 0.95

β = 0.96


Figure C.3: d = 1 dimensional reconciliation with Nprivacy = 1010 bits: finite secret key rate Kfinite vs.distance for collective attacks on BIAWGNC with Sum-Product decoding.

70 80 90 100 110 120 130 140 150 160 170Distance (km)

10-8

10-7

10-6

10-5

10-4

10-3

10-2

Fin

ite

Sec

ret

Key

Rat

e (b

its/

pu

lse)

ǫ = 0.005v

el = 0.041

η = 0.606α = 0.2N

privacy = 1010

β = 0.80β = 0.83

β = 0.86

β = 0.89

β = 0.92

β = 0.95

β = 0.97




70 80 90 100 110 120 130 140 150 160 170Distance (km)

10-8

10-7

10-6

10-5

10-4

10-3

10-2

Fin

ite

Sec

ret

Key

Rat

e (b

its/

pu

lse)

ǫ = 0.005v

el = 0.041

η = 0.606α = 0.2N

privacy = 1010

β = 0.80

β = 0.83

β = 0.86

β = 0.89

β = 0.92

β = 0.95

β = 0.97



70 80 90 100 110 120 130 140 150 160 170Distance (km)

10-8

10-7

10-6

10-5

10-4

10-3

10-2

Fin

ite

Sec

ret

Key

Rat

e (b

its/

pu

lse)

ǫ = 0.005v

el = 0.041

η = 0.606α = 0.2N

privacy = 1010

β = 0.80β = 0.83

β = 0.86β = 0.89

β = 0.92

β = 0.95

β = 0.98




70 80 90 100 110 120 130 140 150 160 170Distance (km)

10-8

10-7

10-6

10-5

10-4

10-3

10-2

Fin

ite

Sec

ret

Key

Rat

e (b

its/

pu

lse)

ǫ = 0.005v

el = 0.041

η = 0.606α = 0.2N

privacy = 1012

β = 0.80

β = 0.83

β = 0.86β = 0.89

β = 0.92

β = 0.95 β = 0.97



70 80 90 100 110 120 130 140 150 160 170Distance (km)

10-8

10-7

10-6

10-5

10-4

10-3

10-2

Fin

ite

Sec

ret

Key

Rat

e (b

its/

pu

lse)

ǫ = 0.005v

el = 0.041

η = 0.606α = 0.2N

privacy = 1012

β = 0.80

β = 0.83β = 0.86

β = 0.89

β = 0.92

β = 0.95



Appendix D

LDPC Decoding for Reverse

Reconciliation in CV- and DV-QKD

This appendix examines the differences in reconciling secret keys using LDPC codes over the binary-input

additive white Gaussian noise channel (BIAWGNC) for CV-QKD, and the binary symmetric channel

(BSC) for DV-QKD. The reconciliation procedure is first presented for the BIAWGNC (as discussed in

the thesis for long-distance CV-QKD), and then the procedure is extended for the BSC. This appendix

shows that LDPC decoding is independent of the QKD system parameters once the log-likelihood (LLR)

input to the decoder is calculated.

D.1 Alice and Bob’s Correlated Sequences

After the quantum transmission and sifting steps in CV- or DV-QKD protocols, Alice and Bob share

correlated sequences X and Y of length n, respectively. In CV-QKD, Alice’s X ∈ R and is normally

distributed over X ∼ N (0, 1). In DV-QKD, Alice’s X ∈ Fn2 and is uniformly distributed over Fn2 . The

distribution of Bob’s correlated Gaussian sequence Y is determined by the channel model.

D.2 Reverse Reconciliation

In both CV- and DV-QKD, Bob uses a quantum random number generator to generate a uniformly-

distributed random binary sequence S of length k, where Si ∈ 0, 1. He then encodes S to generate an

LDPC codeword C of length n, where Ci ∈ 0, 1, based on a binary LDPC parity-check matrix H that

is also known to Alice.

D.2.1 CV-QKD: Decoding on the BIAWGNC

In CV-QKD, Bob prepares his classical message to Alice, M, by modulating the signs of his Gaussian

sequence Y with the LDPC codeword, such that Mi = (−1)CiYi for i = 1, 2, . . . , n. The BIAWGNC is

described by Z ∼ N (0, σ2Z). Bob’s correlated sequence Y is Gaussian, and is given by Y = X + Z,

such that Y ∼ N (0, 1 +σ2Z). Alice attempts to recover Bob’s codeword C using her correlated Gaussian

95

Appendix D. LDPC Decoding for Reverse Reconciliation in CV- and DV-QKD 96

sequence X based on the following division operation:

Ri =Mi

Xi=

(−1)CiYiXi

=(−1)Ci(Xi + Zi)

Xi= (−1)Ci + (−1)Ci

ZiXi

for i = 1, 2, . . . , n. (D.1)

Here, Alice observes a channel with binary input (±1) and additive noise (−1)Ci Zi

Xi. In this case, the

division operation in the noise term represents a fading channel, however, since Alice knows the value

of each Xi, the norm of X is revealed and the overall channel noise remains Gaussian with zero mean

and variance σ2Ni = σ2

Z/|Xi|2 for each i = 1, 2, ..., n. Alice attempts to reconstruct Bob’s original binary

sequence S by applying the Sum-Product belief propagation algorithm for LDPC decoding to build

an estimate S for further post-processing in the next privacy amplification step. LDPC decoding is

successful if S = S. Figure D.1 shows the setup for a BIAWGNC.

BIAWGNCLDPC

EncoderS

Cs LDPC

Decoder

R

H2

1

z

SNR

H X 2

z

H

BSCLDPC

EncoderS

Cs LDPC

Decoder

R

H H

Crossover Error

Probability p

Figure D.1: LDPC encoding and decoding system with BIAWGNC model.

The Sum-Product algorithm for LDPC decoding generates a codeword estimate S based on the log-

likelihood ratio (LLR) of each of its Ri inputs for i = 1, 2, . . . , n. For a BIAWGNC, the noise variance

σ2Z is known, and each LLR is given by:

LLR(Ri) = ln

(P (Ri|Ci = 0)

P (Ri|Ci = 1)

)=

2Riσ2Ni

, where σ2Ni

=σ2Z

|Xi|2. (D.2)

From here, the LDPC decoder performs the decoding procedure independent of any other QKD system

parameters. As such, the implementation and design of the LDPC decoder (software/hardware) is

independent of the QKD system parameters. In the thesis, a GPU-based LDPC decoder was implemented

for speed purposes. The same error-correction performance is achievable using a software-based decoder,

albeit at much slower speed.

D.2.2 DV-QKD: Decoding on the BSC

In DV-QKD, Bob’s correlated sequence Y ∈ Fn2 and is uniformly distributed over Fn2 . The BSC is defined

by a bit crossover error probability p, 0 < p < 1/2, such that Bob’s correlated Gaussian sequence is

described by Y = X + E, where + denotes binary addition and E represents the crossover error as

follows:

Ei =

1, with probability p

0, with probability 1− p. (D.3)

Using the same codeword generation procedure as before, Bob encodes an LDPC codeword C from a

uniformly-distributed random binary sequence S. In DV-QKD, Bob sends a message M = C + Y to

Alice. Alice attempts to recover Bob’s codeword C by adding her correlated Gaussian sequence X to

Appendix D. LDPC Decoding for Reverse Reconciliation in CV- and DV-QKD 97

Bob’s received message M as per the following binary addition operation:

R = M + X = (C + Y) + X = C + (X + E) + X = C + E (D.4)

Alice’s received value R = C + E is then used to perform LDPC decoding to recover an estimate S

of Bob’s original random binary sequence S. As in the previous BIAWGNC case, LDPC decoding is

successful if S = S.

BIAWGNCLDPC

EncoderS

Cs LDPC

Decoder

R

H2

1

z

SNR

H X 2

z

H

BSCLDPC

EncoderS

Cs LDPC

Decoder

R

H H

Crossover Error

Probability p p

Figure D.2: LDPC encoding and decoding system with BSC model.

Fig. D.2 presents the decoding setup for the BSC case. From the LDPC decoder’s perspective, the

only difference is in the calculation of the input channel LLR based on the input R and the known

crossover probability p. In fact, this LLR calculation is even simpler than the LLR calculation for the

BIAWGNC in CV-QKD because the decoder does not need to know X. Depending on whether Ri = 0

or Ri = 1, the LLR for each Ri on the BSC is given by:

LLR(Ri = 0) = ln

(1− pp

)LLR(Ri = 1) = ln

(p

1− p

) (D.5)

Once the LLR for each input is known, the same Sum-Product algorithm can be used for LDPC decoding.

Hence, the LDPC decoder implementation is independent of the channel.

D.2.3 Efficiency of Reconciliation with Multiple Dimensions

The efficiency of reverse reconciliation can be improved through multi-dimensional reconciliation tech-

niques, however, multi-dimensional reconciliation is only applicable to the CV-QKD case over a BI-

AWGNC. For d-dimensional reconciliation, d ∈ 1, 2, 4, 8, each consecutive group of d quantum coherent-

state transmissions from Alice to Bob can be mapped to the same BIAWGNC, and thus, the channel

noise variance among all d virtual channels is uniform. Multi-dimensional reconciliation is not possible

on the BSC because each bit is transmitted discretely and has it’s own crossover probability p.

References

[1] R. W. Hamming, “Error detecting and error correcting codes,” The Bell System Technical Journal,

vol. 29, no. 2, pp. 147–160, Apr. 1950.

[2] C. E. Shannon, “A mathematical theory of communication,” The Bell System Technical Journal,

vol. 27, no. 3, pp. 379–423, Jul. 1948.

[3] S. Lin and D. J. Costello, Error Control Coding, Second Edition. Upper Saddle River, NJ, USA:

Prentice-Hall, Inc., 2004.

[4] T. Richardson, M. Shokrollahi, and R. Urbanke, “Design of capacity-approaching irregular low-

density parity-check codes,” Information Theory, IEEE Transactions on, vol. 47, no. 2, pp. 619–

637, Feb. 2001.

[5] S.-Y. Chung, J. Forney, G.D. et al., “On the design of low-density parity-check codes within 0.0045

dB of the Shannon limit,” Communications Letters, IEEE, vol. 5, no. 2, pp. 58–60, Feb. 2001.

[6] J. Kim and W. Sung, “Rate-0.96 LDPC Decoding VLSI for Soft-Decision Error Correction of

NAND Flash Memory,” Very Large Scale Integration (VLSI) Systems, IEEE Transactions on,

vol. 22, no. 5, pp. 1004–1015, May 2014.

[7] R. Gallager, “Low-density parity-check codes,” IRE Transactions on Information Theory, vol. 8,

no. 1, pp. 21–28, Jan. 1962.

[8] D. MacKay, “Good error-correcting codes based on very sparse matrices,” IEEE Transactions on

Information Theory, vol. 45, no. 2, pp. 399–431, Mar. 1999.

[9] “IEEE Standard for Information Technology - Part 11: Wireless LAN Medium Access Control

(MAC) and Physical Layer (PHY) Specifications,” IEEE Std 802.11ac-2013, pp. 1–425, Dec. 2013.

[10] J. Lodewyck, M. Bloch, R. Garcia-Patron, S. Fossier, E. Karpov, E. Diamanti, T. Debuisschert,

N. J. Cerf, R. Tualle-Brouri, S. W. McLaughlin, and P. Grangier, “Quantum key distribution

over 25km with an all-fiber continuous-variable system,” Phys. Rev. A, vol. 76, pp. 042 305–1 –

042 305–10, Oct. 2007.

[11] P. Jouguet, S. Kunz-Jacques, and A. Leverrier, “Long-distance continuous-variable quantum key

distribution with a Gaussian modulation,” Phys. Rev. A, vol. 84, pp. 062 317–1 – 062 317–7, Dec.

2011.

[12] D. Huang, D. Lin et al., “Continuous-variable quantum key distribution with 1 Mbps secure key

rate,” Opt. Express, vol. 23, no. 13, pp. 17 511–17 519, Jun. 2015.

98

References 99

[13] “International Roadmap for Devices and Systems (IRDS),” IEEE, 2016. [Online]. Available:

http://irds.ieee.org

[14] K. J. Kuhn, “Considerations for Ultimate CMOS Scaling,” IEEE Transactions on Electron Devices,

vol. 59, no. 7, pp. 1813–1828, Jul. 2012.

[15] T. Mohsenin, D. Truong, and B. Baas, “A Low-Complexity Message-Passing Algorithm for Re-

duced Routing Congestion in LDPC Decoders,” Circuits and Systems I: Regular Papers, IEEE

Transactions on, vol. 57, no. 5, pp. 1048–1061, May 2010.

[16] K. Cushon, S. Hemati et al., “High-throughput energy-efficient LDPC decoders using differential

binary message passing,” IEEE Transactions on Signal Processing, vol. 62, no. 3, pp. 619–631,

Feb. 2014.

[17] S. Rossnagel, R. Wisnieff, D. Edelstein, and T. Kuan, “Interconnect issues post 45nm,” Electron

Devices Meeting, 2005. IEDM Technical Digest. IEEE International, pp. 89–91, Dec. 2005.

[18] R. Brain, “Interconnect scaling: Challenges and opportunities,” 2016 IEEE International Electron

Devices Meeting (IEDM), pp. 9.3.1–9.3.4, Dec. 2016.

[19] K. Okada, K. Kondou et al., “Full Four-Channel 6.3-Gb/s 60-GHz CMOS Transceiver With Low-

Power Analog and Digital Baseband Circuitry,” Solid-State Circuits, IEEE Journal of, vol. 48,

no. 1, pp. 46–65, Jan. 2013.

[20] K. Zhang, X. Huang, and Z. Wang, “High-throughput layered decoder implementation for quasi-

cyclic LDPC codes,” Selected Areas in Communications, IEEE Journal on, vol. 27, no. 6, pp.

985–994, Aug. 2009.

[21] M. Li, F. Naessens et al., “An area and energy efficient half-row-paralleled layer LDPC decoder for

the 802.11ad standard,” Signal Processing Systems (SiPS), 2013 IEEE Workshop on, pp. 112–117,

Oct. 2013.

[22] Z. Chen, X. Peng et al., “A 6.72-Gb/s, 8pJ/bit/iteration WPAN LDPC decoder in 65nm CMOS,”

2013 18th Asia and South Pacific Design Automation Conference (ASP-DAC), pp. 87–88, Jan.

2013.

[23] “IEEE Standard for Information Technology - Part 11: Wireless LAN Medium Access Control

(MAC) and Physical Layer (PHY) Specifications,” IEEE Std 802.11ad-2012, pp. 1–628, Dec. 2012.

[24] “IEEE Standard for Information Technology - Part 15.3: Wireless Medium Access Control

(MAC) and Physical Layer (PHY) Specifications for High Rate Wireless Personal Area Networks

(WPANs),” IEEE Std 802.15.3c-2009, pp. 1–187, Oct. 2009.

[25] X.-Y. Shih, C.-Z. Zhan, and A. Y. Wu, “A 7.39mm2 76mW (1944, 972) LDPC decoder chip for

IEEE 802.11n applications,” 2008 IEEE Asian Solid-State Circuits Conference, pp. 301–304, Nov.

2008.

[26] A. Cevrero, Y. Leblebici, P. Ienne, and A. Burg, “A 5.35 mm2 10GBASE-T Ethernet LDPC

decoder chip in 90 nm CMOS,” 2010 IEEE Asian Solid-State Circuits Conference, pp. 1–4, Nov.

2010.

http://irds.ieee.org

References 100

[27] Z. Zhang, V. Anantharam, M. Wainwright, and B. Nikolic, “An Efficient 10GBASE-T Ethernet

LDPC Decoder Design With Low Error Floors,” Solid-State Circuits, IEEE Journal of, vol. 45,

no. 4, pp. 843–855, Apr. 2010.

[28] X. Peng, Z. Chen, X. Zhao, D. Zhou, and S. Goto, “A 115mW 1Gbps QC-LDPC decoder ASIC for

WiMAX in 65nm CMOS,” IEEE Asian Solid-State Circuits Conference 2011, pp. 317–320, Nov.

2011.

[29] B. Xiang, D. Bao, S. Huang, and X. Zeng, “An 847-955 Mb/s 342-397 mW Dual-Path Fully-

Overlapped QC-LDPC Decoder for WiMAX System in 0.13µm CMOS,” IEEE Journal of Solid-

State Circuits, vol. 46, no. 6, pp. 1416–1432, Jun. 2011.

[30] S. W. Yen, S. Y. Hung, C. L. Chen, H. C. Chang, S. J. Jou, and C. Y. Lee, “A 5.79-Gb/s

Energy-Efficient Multirate LDPC Codec Chip for IEEE 802.15.3c Applications,” IEEE Journal of

Solid-State Circuits, vol. 47, no. 9, pp. 2246–2257, Sep. 2012.

[31] Y. S. Park, D. Blaauw et al., “Low-Power High-Throughput LDPC Decoder Using Non-Refresh

Embedded DRAM,” Solid-State Circuits, IEEE Journal of, vol. 49, no. 3, pp. 783–794, Mar. 2014.

[32] M. Weiner, M. Blagojevic et al., “27.7 A scalable 1.5-to-6Gb/s 6.2-to-38.1mW LDPC decoder

for 60GHz wireless networks in 28nm UTBB FDSOI,” Solid-State Circuits Conference Digest of

Technical Papers (ISSCC), 2014 IEEE International, pp. 464–465, Feb. 2014.

[33] T. C. Ou, Z. Zhang, and M. C. Papaefthymiou, “A 934MHz 9Gb/s 3.2pJ/b/iteration charge-

recovery LDPC decoder with in-package inductors,” 2015 IEEE Asian Solid-State Circuits Con-

ference (A-SSCC), pp. 1–4, Nov. 2015.

[34] X. R. Lee, C. L. Chen, H. C. Chang, and C. Y. Lee, “A 7.92 Gb/s 437.2 mW Stochastic LDPC

Decoder Chip for IEEE 802.15.3c Applications,” IEEE Transactions on Circuits and Systems I:

Regular Papers, vol. 62, no. 2, pp. 507–516, Feb. 2015.

[35] C. L. Lin, R. J. Liu, C. L. Chen, H. C. Chang, and C. Y. Lee, “A 7.72 Gb/s LDPC-CC decoder

with overlapped architecture for pre-5G wireless communications,” 2016 IEEE Asian Solid-State

Circuits Conference (A-SSCC), pp. 337–340, Nov. 2016.

[36] M. R. Li, C. H. Yang, and Y. L. Ueng, “A 5.28-Gb/s LDPC Decoder With Time-Domain Signal

Processing for IEEE 802.15.3c Applications,” IEEE Journal of Solid-State Circuits, vol. 52, no. 2,

pp. 592–604, Feb. 2017.

[37] C. H. Bennett and G. Brassard, “Quantum cryptography: Public key distribution and coin toss-

ing,” Theoretical Computer Science, vol. 560, Part 1, pp. 7 – 11, Dec. 2014.

[38] N. Gisin, G. Ribordy, W. Tittel, and H. Zbinden, “Quantum cryptography,” Rev. Mod. Phys.,

vol. 74, pp. 145–195, Jan. 2002.

[39] R. Alleaume, C. Branciard, J. Bouda, T. Debuisschert, M. Dianati, N. Gisin, M. Godfrey, P. Grang-

ier, T. Lnger, N. Ltkenhaus et al., “Using quantum key distribution for cryptographic purposes:

A survey,” Theoretical Computer Science, vol. 560, Part 1, pp. 62–81, Dec. 2014.

References 101

[40] E. Diamanti, H.-K. Lo, B. Qi, and Z. Yuan, “Practical challenges in quantum key distribution,”

Npj Quantum Information, vol. 2, pp. 16 025–1 –16 025–12, Nov. 2016.

[41] J. D. Morris, M. R. Grimaila, D. D. Hodson, D. Jacques, and G. Baumgartner, “Chapter 9 - A

Survey of Quantum Key Distribution (QKD) Technologies,” Emerging Trends in ICT Security,

pp. 141–152, Nov. 2014.

[42] R. L. Rivest, A. Shamir, and L. Adleman, “A method for obtaining digital signatures and public-

key cryptosystems,” Communications of the ACM, vol. 21, pp. 120–126, Feb. 1978.

[43] C. Kollmitzer and M. Pivk, Applied Quantum Cryptography. Springer, Apr. 2010, vol. 797.

[44] P. W. Shor, “Polynomial-Time Algorithms for Prime Factorization and Discrete Logarithms on a

Quantum Computer,” SIAM Journal on Computing, vol. 26, no. 5, pp. 1484–1509, Oct. 1997.

[45] D. Adrian, K. Bhargavan, Z. Durumeric, P. Gaudry, M. Green, J. A. Halderman, N. Heninger,

D. Springall, E. Thome, L. Valenta, B. VanderSloot, E. Wustrow, S. Zanella-Beguelin, and P. Zim-

mermann, “Imperfect Forward Secrecy: How Diffie-Hellman Fails in Practice,” Proceedings of the

22Nd ACM SIGSAC Conference on Computer and Communications Security, pp. 5–17, Oct. 2015.

[46] H.-K. Lo, M. Curty, and K. Tamaki, “Secure quantum key distribution,” Nature Photonics, vol. 8,

no. 8, pp. 595–604, Jul. 2014.

[47] M. Peev, C. Pacher, R. Allaume, C. Barreiro, J. Bouda, W. Boxleitner, T. Debuisschert, E. Dia-

manti, M. Dianati, J. F. Dynes et al., “The SECOQC quantum key distribution network in Vi-

enna,” New Journal of Physics, vol. 11, no. 7, pp. 075 001–1 – 075 001–37, Jul. 2009.

[48] M. Sasaki, M. Fujiwara et al., “Field test of quantum key distribution in the Tokyo QKD Network,”

Opt. Express, vol. 19, no. 11, pp. 10 387–10 409, May 2011.

[49] P. Jouguet, S. Kunz-Jacques et al., “Field test of classical symmetric encryption with continuous

variables quantum key distribution,” Opt. Express, vol. 20, no. 13, pp. 14 030–14 041, Jun. 2012.

[50] S. Wang, W. Chen et al., “Field and long-term demonstration of a wide area quantum key distri-

bution network,” Opt. Express, vol. 22, no. 18, pp. 21 739–21 756, Sep. 2014.

[51] H.-L. Yin, T.-Y. Chen, Z.-W. Yu, H. Liu, L.-X. You, Y.-H. Zhou, S.-J. Chen, Y. Mao, M.-Q.

Huang, W.-J. Zhang, H. Chen, M. J. Li, D. Nolan, F. Zhou, X. Jiang, Z. Wang, Q. Zhang, X.-B.

Wang, and J.-W. Pan, “Measurement-device-independent quantum key distribution over a 404 km

optical fiber,” Phys. Rev. Lett., vol. 117, pp. 190 501–1 – 190 501–5, Nov. 2016.

[52] B. Qi, W. Zhu, L. Qian, and H.-K. Lo, “Feasibility of quantum key distribution through a dense

wavelength division multiplexing network,” New Journal of Physics, vol. 12, no. 10, pp. 103 042–1

– 103 042–17, Oct. 2010.

[53] K. A. Patel, J. F. Dynes, I. Choi, A. W. Sharpe, A. R. Dixon, Z. L. Yuan, R. V. Penty, and A. J.

Shields, “Coexistence of high-bit-rate quantum key distribution and data on optical fiber,” Phys.

Rev. X, vol. 2, pp. 041 010–1 – 041 010–8, Nov. 2012.

[54] R. Kumar, H. Qin, and R. Allaume, “Coexistence of continuous variable QKD with intense DWDM

classical channels,” New Journal of Physics, vol. 17, no. 4, pp. 043 027–1 – 043 027–4, Apr. 2015.

References 102

[55] G. Vest, M. Rau, L. Fuchs, G. Corrielli, H. Weier, S. Nauerth, A. Crespi, R. Osellame, and

H. Weinfurter, “Design and evaluation of a handheld quantum key distribution sender module,”

IEEE Journal of Selected Topics in Quantum Electronics, vol. 21, no. 3, pp. 131–137, May 2015.

[56] G. Vallone, D. Bacco, D. Dequal, S. Gaiarin, V. Luceri, G. Bianco, and P. Villoresi, “Experimental

satellite quantum communications,” Phys. Rev. Lett., vol. 115, pp. 040 502–1 – 040 502–5, Jul.

2015.

[57] P. Sibson, J. E. Kennard et al., “Integrated silicon photonics for high-speed quantum key distri-

bution,” Optica, vol. 4, no. 2, pp. 172–177, Feb. 2017.

[58] L. Chen, S. Jordan et al., “Report on post-quantum cryptography,” National Institute of Standards

and Technology Internal Report, vol. 8105, 2016.

[59] P. Jouguet, S. Kunz-Jacques, A. Leverrier, P. Grangier, and E. Diamanti, “Experimental demon-

stration of long-distance continuous-variable quantum key distribution,” Nature Photonics, vol. 7,

pp. 378–381, May 2013.

[60] P. Jouguet and S. Kunz-Jacques, “High performance error correction for quantum key distribution

using polar codes,” Quantum Inform. & Comp., vol. 14, no. 3, pp. 329–338, Mar. 2014.

[61] S. Pirandola, R. Laurenza, C. Ottaviani, and L. Banchi, “Fundamental limits of repeaterless

quantum communications,” Nature Communications, vol. 8, pp. 15 043–1 – 15 043–15, Apr. 2017.

[62] D. Huang, P. Huang, T. Wang, H. Li, Y. Zhou, and G. Zeng, “Continuous-variable quantum key

distribution based on a plug-and-play dual-phase-modulated coherent-states protocol,” Phys. Rev.

A, vol. 94, pp. 032 305–1 – 032 305–11, Sep. 2016.

[63] D. Huang, P. Huang, D. Lin, and G. Zeng, “Long-distance continuous-variable quantum key dis-

tribution by controlling excess noise,” Scientific Reports, vol. 6, pp. 19 201–1 – 19 201–6, Jan.

2016.

[64] C. Wang, D. Huang et al., “25 MHz clock continuous-variable quantum key distribution system

over 50 km fiber channel,” Scientific reports, vol. 5, pp. 14 607–1 – 14 607–8, Sep. 2015.

[65] J. Martinez-Mateo, D. Elkouss, and V. Martin, “Key reconciliation for high performance quantum

key distribution,” Scientific Reports, vol. 3, pp. 1576–1 – 1576–6, Apr. 2013.

[66] A. Dixon and H. Sato, “High speed and adaptable error correction for Megabit/s rate quantum

key distribution,” Scientific Reports, vol. 4, pp. 7275–1 – 7275–4, Dec. 2014.

[67] A. Leverrier and P. Grangier, “Unconditional security proof of long-distance continuous-variable

quantum key distribution with discrete modulation,” Physical Review Letters, vol. 102, no. 18, pp.

180 504–1 – 180 504–4, May 2009.

[68] A. Becir and M. Ridza Wahiddin, “Phase coherent states for enhancing the performance of contin-

uous variable quantum key distribution,” Journal of the Physical Society of Japan, vol. 81, no. 3,

pp. 034 005–1 – 034 005–9, Mar. 2012.

[69] “ETSI Standard 302 307-2 V1.1.1: Digital Video Broadcasting (DVB),” ETSI Std 302 307-2 V1.1.1

DVB-S2X 2014), pp. 1–139, Oct. 2014.

References 103

[70] A. Blanksby and C. Howland, “A 690-mW 1-Gb/s 1024-b, rate-1/2 low-density parity-check code

decoder,” Solid-State Circuits, IEEE Journal of, vol. 37, no. 3, pp. 404–412, Mar 2002.

[71] T. Richardson, R. Urbanke et al., “Multi-edge type LDPC codes,” Workshop honoring Prof. Bob

McEliece on his 60th birthday, California Institute of Technology, Pasadena, California, pp. 24–25,

May 2002.

[72] M. Fossorier, “Quasicyclic low-density parity-check codes from circulant permutation matrices,”

Information Theory, IEEE Transactions on, vol. 50, no. 8, pp. 1788–1793, Aug. 2004.

[73] S. Kim, G. E. Sobelman, and H. Lee, “A reduced-complexity architecture for LDPC layered de-

coding schemes,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 19,

no. 6, pp. 1099–1103, Jun. 2011.

[74] G. Falcao, L. Sousa, and V. Silva, “Massively LDPC decoding on multicore architectures,” IEEE

Transactions on Parallel and Distributed Systems, vol. 22, no. 2, pp. 309–322, Feb. 2011.

[75] H. Ji, J. Cho, and W. Sung, “Massively parallel implementation of cyclic LDPC codes on a general

purpose graphics processing unit,” 2009 IEEE Workshop on Signal Processing Systems, pp. 285–

290, Oct. 2009.

[76] S. W. Keckler, W. J. Dally, B. Khailany, M. Garland, and D. Glasco, “GPUs and the Future of

Parallel Computing,” IEEE Micro, vol. 31, no. 5, pp. 7–17, Sep. 2011.

[77] M. Fossorier, M. Mihaljevic, and H. Imai, “Reduced complexity iterative decoding of low-density

parity check codes based on belief propagation,” Communications, IEEE Transactions on, vol. 47,

no. 5, pp. 673–680, May 1999.

[78] R. Tanner, “A recursive approach to low complexity codes,” IEEE Trans. Inf. Theory, vol. 27,

no. 5, pp. 533–547, Sep. 1981.

[79] F. R. Kschischang, B. J. Frey, and H. A. Loeliger, “Factor graphs and the sum-product algorithm,”

IEEE Transactions on Information Theory, vol. 47, no. 2, pp. 498–519, Feb. 2001.

[80] D. Oh and K. Parhi, “Optimally quantized offset min-sum algorithm for flexible LDPC decoder,”

Signals, Systems and Computers, 2008 42nd Asilomar Conference on, pp. 1886–1891, Oct. 2008.

[81] T. Mohsenin and B. Baas, “Trends and challenges in LDPC hardware decoders,” Signals, Systems

and Computers, 2009 Conference Record of the Forty-Third Asilomar Conference on, pp. 1273–

1277, Nov. 2009.

[82] C. Roth, A. Cevrero, C. Studer, Y. Leblebici, and A. Burg, “Area, throughput, and energy-

efficiency trade-offs in the VLSI implementation of LDPC decoders,” Circuits and Systems (IS-

CAS), 2011 IEEE International Symposium on, pp. 1772–1775, May 2011.

[83] M. Mansour and N. Shanbhag, “High-throughput LDPC decoders,” Very Large Scale Integration

(VLSI) Systems, IEEE Transactions on, vol. 11, no. 6, pp. 976–996, Dec. 2003.

[84] D. Oh and K. Parhi, “Low-Complexity Switch Network for Reconfigurable LDPC Decoders,” Very

Large Scale Integration (VLSI) Systems, IEEE Transactions on, vol. 18, no. 1, pp. 85–94, Jan.

2010.

References 104

[85] D. Hocevar, “A reduced complexity decoder architecture via layered decoding of LDPC codes,”

Signal Processing Systems, 2004. SIPS 2004. IEEE Workshop on, pp. 107–112, Oct. 2004.

[86] T. Bhatt, V. Sundaramurthy, V. Stolpman, and D. McCain, “Pipelined block-serial decoder ar-

chitecture for structured LDPC codes,” Acoustics, Speech and Signal Processing, 2006. ICASSP

2006 Proceedings. 2006 IEEE International Conference on, vol. 4, pp. 225–228, May 2006.

[87] A. Darabiha, A. Carusone, and F. Kschischang, “Power Reduction Techniques for LDPC De-

coders,” Solid-State Circuits, IEEE Journal of, vol. 43, no. 8, pp. 1835–1845, Aug. 2008.

[88] Z. Wang and Z. Cui, “Low-Complexity High-Speed Decoder Design for Quasi-Cyclic LDPC Codes,”

Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, vol. 15, no. 1, pp. 104–114,

Jan. 2007.

[89] L. Liu and C.-J. Shi, “Sliced Message Passing: High Throughput Overlapped Decoding of High-

Rate Low-Density Parity-Check Codes,” Circuits and Systems I: Regular Papers, IEEE Transac-

tions on, vol. 55, no. 11, pp. 3697–3710, Dec. 2008.

[90] Y. Chen and K. Parhi, “Overlapped message passing for quasi-cyclic low-density parity check

codes,” Circuits and Systems I: Regular Papers, IEEE Transactions on, vol. 51, no. 6, pp. 1106–

1113, Jun. 2004.

[91] D. Bao, X. Chen et al., “A single-routing layered LDPC decoder for 10Gbase-T Ethernet in 130nm

CMOS,” 17th Asia and South Pacific Design Automation Conference, pp. 565–566, Jan. 2012.

[92] S. Kumawat, R. Shrestha et al., “High-throughput LDPC-decoder architecture using efficient com-

parison techniques and dynamic multi-frame processing schedule,” IEEE Transactions on Circuits

and Systems I: Regular Papers, vol. 62, no. 5, pp. 1421–1430, May 2015.

[93] P. Meinerzhagen, A. Bonetti, G. Karakonstantis, C. Roth, F. Giirkaynak, and A. Burg, “Refresh-

free dynamic standard-cell based memories: Application to a QC-LDPC decoder,” 2015 IEEE

International Symposium on Circuits and Systems (ISCAS), pp. 1426–1429, May 2015.

[94] D. Miyashita, R. Yamaki et al., “An LDPC decoder with time-domain analog and digital mixed-

signal processing,” IEEE Journal of Solid-State Circuits, vol. 49, no. 1, pp. 73–83, Jan. 2014.

[95] F. Grosshans and P. Grangier, “Continuous variable quantum cryptography using coherent states,”

Phys. Rev. Lett., vol. 88, pp. 057 902–1 – 057 902–4, Jan. 2002.

[96] H.-K. Lo, M. Curty, and B. Qi, “Measurement-device-independent quantum key distribution,”

Phys. Rev. Lett., vol. 108, pp. 130 503–1 – 130 503–5, Mar. 2012.

[97] S. Pirandola, C. Ottaviani, G. Spedalieri, C. Weedbrook, S. L. Braunstein, S. Lloyd, T. Gehring,

C. S. Jacobsen, and U. L. Andersen, “High-rate measurement-device-independent quantum cryp-

tography,” Nature Photonics, vol. 9, no. 6, pp. 397–402, May 2015.

[98] A. Leverrier, R. Alleaume, J. Boutros, G. Zemor, and P. Grangier, “Multidimensional reconciliation

for a continuous-variable quantum key distribution,” Phys. Rev. A, vol. 77, pp. 042 325–1 – 042 325–

8, Apr. 2008.

References 105

[99] C. Weedbrook, S. Pirandola, S. Lloyd, and T. C. Ralph, “Quantum cryptography approaching the

classical limit,” Phys. Rev. Lett., vol. 105, pp. 110 501–1 – 110 501–4, Sep. 2010.

[100] P. Jouguet, D. Elkouss, and S. Kunz-Jacques, “High-bit-rate continuous-variable quantum key

distribution,” Phys. Rev. A, vol. 90, pp. 042 329–1 – 042 329–8, Oct. 2014.

[101] T. Gehring, V. Handchen, J. Duhme, F. Furrer, T. Franz, C. Pacher, R. F. Werner, and R. Schn-

abel, “Implementation of continuous-variable quantum key distribution with composable and one-

sided-device-independent security against coherent attacks,” Nature Communications, vol. 6, pp.

8795–1 – 8795–7, Oct. 2015.

[102] F. Grosshans, G. V. Assche, J. Wenger, R. Brouri, N. J. Cerf, and P. Grangier, “Quantum key

distribution using Gaussian-modulated coherent states,” Nature, vol. 421, pp. 238–241, Jan. 2003.

[103] H. Yan, X. Peng, X. Lin, W. Jiang, T. Liu, and H. Guo, “Efficiency of Winnow Protocol in

Secret Key Reconciliation,” 2009 WRI World Congress on Computer Science and Information

Engineering, vol. 3, pp. 238–242, Mar. 2009.

[104] D. Elkouss, J. Martinez, D. Lancho, and V. Martin, “Rate compatible protocol for information

reconciliation: An application to QKD,” 2010 IEEE Information Theory Workshop on Information

Theory, pp. 1–5, Jan 2010.

[105] N. Benletaief, H. Rezig, and A. Bouallegue, “Toward Efficient Quantum Key Distribution Recon-

ciliation,” Journal of Quantum Information Science, vol. 4, no. 2, pp. 117–128, Jun. 2014.

[106] T. Richardson, M. Shokrollahi, and R. Urbanke, “Design of capacity-approaching irregular low-

density parity-check codes,” Information Theory, IEEE Transactions on, vol. 47, no. 2, pp. 619–

637, Feb. 2001.

[107] D. Declercq and M. Fossorier, “Decoding algorithms for nonbinary LDPC codes over GF(q),”

IEEE Transactions on Communications, vol. 55, no. 4, pp. 633–643, Apr. 2007.

[108] A. Anastasopoulos, “A comparison between the sum-product and the min-sum iterative detection

algorithms based on density evolution,” Global Telecommunications Conference, 2001. GLOBE-

COM ’01. IEEE, vol. 2, pp. 1021–1025, Nov. 2001.

[109] A. Hurwitz, “Ueber die Composition der quadratischen Formen von belibig vielen Variablen,”

Nachrichten von der Gesellschaft der Wissenschaften zu Gttingen, Mathematisch-Physikalische

Klasse, vol. 1898, pp. 309–316, Jul. 1898.

[110] J. C. Baez, “The octonions,” Bulletin of the American Mathematical Society, vol. 39, no. 2, pp.

145–205, Dec. 2001.

[111] M. Bloch, A. Thangaraj, S. W. McLaughlin, and J. M. Merolla, “LDPC-based secret key agreement

over the Gaussian wiretap channel,” 2006 IEEE International Symposium on Information Theory,

pp. 1179–1183, Jul. 2006.

[112] U. M. Maurer, “Secret key agreement by public discussion from common information,” IEEE

Transactions on Information Theory, vol. 39, no. 3, pp. 733–742, May 1993.

References 106

[113] P. Jouguet, S. Kunz-Jacques, E. Diamanti, and A. Leverrier, “Analysis of imperfections in practical

continuous-variable quantum key distribution,” Phys. Rev. A, vol. 86, p. 032309, Sep 2012.

[114] A. Leverrier, F. Grosshans, and P. Grangier, “Finite-size analysis of a continuous-variable quantum

key distribution,” Phys. Rev. A, vol. 81, pp. 062 343–1 – 062 343–11, Jun. 2010.

[115] M. Curty, F. Xu, W. Cui, C. C. W. Lim, K. Tamaki, and H.-K. Lo, “Finite-key analysis for

measurement-device-independent quantum key distribution,” Nature Communications, vol. 5, Apr.

2014.

[116] E. Diamanti and A. Leverrier, “Distributing secret keys with quantum continuous variables: prin-

ciple, security and implementations,” Entropy, vol. 17, no. 9, pp. 6072–6092, Aug. 2015.

[117] A. Leverrier, “Composable Security Proof for Continuous-Variable Quantum Key Distribution

with Coherent States,” Phys. Rev. Lett., vol. 114, pp. 070 501–1 – 070 501–5, Feb. 2015.

[118] V. C. Usenko and R. Filip, “Trusted noise in continuous-variable quantum key distribution: A

threat and a defense,” Entropy, vol. 18, no. 1, p. 20, Jan. 2016.

[119] H.-A. Loeliger, “On the basic averaging arguments for linear codes,” Communications and Cryp-

tography, vol. 276, pp. 251–261, 1994.

[120] T. J. Richardson, “Error floors of LDPC codes,” Proceedings of the annual Allerton conference on

communication, control, and computing, vol. 41, no. 3, pp. 1426–1435, Oct. 2003.

[121] R. H. Dennard, F. H. Gaensslen, V. L. Rideout, E. Bassous, and A. R. LeBlanc, “Design of ion-

implanted MOSFETs with very small physical dimensions,” IEEE Journal of Solid-State Circuits,

vol. 9, no. 5, pp. 256–268, Oct. 1974.

[122] T. Huynh-Bao, J. Ryckaert, S. Sakhare, A. Mercha, D. Verkest, A. Thean, and P. Wambacq,

“Toward the 5nm technology: layout optimization and performance benchmark for logic/SRAMs

using lateral and vertical GAA FETs,” Proc. SPIE, vol. 9781, pp. 978 102 – 978 102–12, Mar. 2016.

[123] S. Y. Wu, C. Y. Lin et al., “Demonstration of a sub-0.03 um2 high density 6-t SRAM with scaled

bulk FinFETs for mobile SOC applications beyond 10nm node,” 2016 IEEE Symposium on VLSI

Technology, pp. 1–2, Jun. 2016.

[124] ——, “A 7nm CMOS platform technology featuring 4th generation FinFET transistors with a

0.027um2 high density 6-t SRAM cell for mobile SoC applications,” 2016 IEEE International

Electron Devices Meeting (IEDM), pp. 2.6.1–2.6.4, Dec. 2016.

[125] M. B. Taylor, “A landscape of the new dark silicon design regime,” IEEE Micro, vol. 33, no. 5,

pp. 8–19, Sep. 2013.

[126] S. Ajaz and H. Lee, “Multi-Gb/s multi-mode LDPC decoder architecture for IEEE 802.11ad stan-

dard,” Circuits and Systems (APCCAS), 2014 IEEE Asia Pacific Conference on, pp. 153–156,

Nov. 2014.

[127] M. Li, J. W. Weijers et al., “An energy efficient 18Gbps LDPC decoding processor for 802.11ad in

28nm CMOS,” 2015 IEEE Asian Solid-State Circuits Conference (A-SSCC), pp. 1–5, Nov. 2015.

References 107

[128] H. Motozuka, N. Yosoku et al., “A 6.16Gb/s 4.7pJ/bit/iteration LDPC decoder for IEEE 802.11ad

standard in 40nm LP-CMOS,” 2015 IEEE Global Conference on Signal and Information Processing

(GlobalSIP), pp. 1289–1292, Dec. 2015.

[129] J. P. Colinge, “Fully-depleted SOI CMOS for analog applications,” IEEE Transactions on Electron

Devices, vol. 45, no. 5, pp. 1010–1016, May 1998.

[130] G. V. Assche, J. Cardinal, and N. J. Cerf, “Reconciliation of a quantum-distributed Gaussian key,”

IEEE Transactions on Information Theory, vol. 50, no. 2, pp. 394–400, Feb. 2004.

[131] N. Walenta, A. Burg, D. Caselunghe, J. Constantin, N. Gisin, O. Guinnard, R. Houlmann,

P. Junod, B. Korzh, N. Kulesza et al., “A fast and versatile quantum key distribution system

with hardware key distillation and wavelength multiplexing,” New Journal of Physics, vol. 16,

no. 1, pp. 013 047–1 – 013 047–20, Jan. 2014.

[132] Z. Bai, S. Yang, and Y. Li, “High-efficiency reconciliation for continuous variable quantum key

distribution,” Japanese Journal of Applied Physics, vol. 56, no. 4, pp. 044 401–1 – 044 401–4, Mar.

2017.

[133] S. Kang and J. Moon, “Parallel LDPC decoder implementation on GPU based on unbalanced

memory coalescing,” 2012 IEEE International Conference on Communications (ICC), pp. 3692–

3697, Jun. 2012.

[134] G. Wang, M. Wu, B. Yin, and J. R. Cavallaro, “High throughput low latency LDPC decoding

on GPU for SDR systems,” Global Conference on Signal and Information Processing (GlobalSIP),

2013 IEEE, pp. 1258–1261, Dec. 2013.

[135] Y. Lin and W. Niu, “High Throughput LDPC Decoder on GPU,” IEEE Communications Letters,

vol. 18, no. 2, pp. 344–347, Feb. 2014.

[136] G. Wang, M. Wu, Y. Sun, and J. R. Cavallaro, “A massively parallel implementation of QC-

LDPC decoder on GPU,” Application Specific Processors (SASP), 2011 IEEE 9th Symposium on,

pp. 82–85, Jun. 2011.

[137] B. L. Gal, C. Jego, and J. Crenne, “A High Throughput Efficient Approach for Decoding LDPC

Codes onto GPU Devices,” IEEE Embedded Systems Letters, vol. 6, no. 2, pp. 29–32, Jun. 2014.

[138] H. Zbinden, N. Walenta, O. Guinnard, R. Houlmann, C. L. C. Wen, B. Korzh, T. Lunghi, N. Gisin,

A. Burg, J. Constantin et al., “Continuous QKD and high speed data encryption,” Proc. SPIE,

vol. 8899, pp. 88 990P–1 – 88 990P–4, Oct. 2013.

[139] S. J. Johnson, V. A. Chandrasetty, and A. M. Lance, “Repeat-accumulate codes for reconcilia-

tion in continuous variable quantum key distribution,” 2016 Australian Communications Theory

Workshop (AusCTW), pp. 18–23, Jan. 2016.

[140] M. Shirvanimoghaddam, S. J. Johnson, and A. M. Lance, “Design of Raptor codes in the low SNR

regime with applications in quantum key distribution,” 2016 IEEE International Conference on

Communications (ICC), pp. 1–6, May 2016.

References 108

[141] J. Yin, J.-G. Ren, H. Lu, Y. Cao, H.-L. Yong, Y.-P. Wu, C. Liu, S.-K. Liao, F. Zhou, Y. Jiang et al.,

“Quantum teleportation and entanglement distribution over 100-kilometre free-space channels,”

Nature, vol. 488, no. 7410, pp. 185–188, Aug. 2012.

[142] J. Handsteiner, D. Rauch, D. Bricher, T. Scheidl, and A. Zeilinger, “Quantum key distribution

at space scale,” 2015 IEEE International Conference on Space Optical Systems and Applications

(ICSOS), pp. 1–3, Oct. 2015.

[143] J.-P. Bourgoin, N. Gigov, B. L. Higgins, Z. Yan, E. Meyer-Scott, A. K. Khandani, N. Lutkenhaus,

and T. Jennewein, “Experimental quantum key distribution with simulated ground-to-satellite

photon losses and processing limitations,” Phys. Rev. A, vol. 92, pp. 052 339–1 – 052 339–12, Nov.

2015.

[144] E. Gibney, “Chinese satellite is one giant step for the quantum internet,” Nature, vol. 535, pp.

478–479, Jul. 2016.

[145] J. Yin, Y. Cao et al., “Satellite-based entanglement distribution over 1200 kilometers,” Science,

vol. 356, no. 6343, pp. 1140–1144, Jun. 2017.

[146] J. Rarity, P. Tapster, P. Gorman, and P. Knight, “Ground to satellite secure key exchange using

quantum cryptography,” New Journal of Physics, vol. 4, no. 1, pp. 82.1–82.21, Oct. 2002.

[147] C. Ma, W. D. Sacher, Z. Tang, J. C. Mikkelsen, Y. Yang, F. Xu, T. Thiessen, H.-K. Lo, and

J. K. S. Poon, “Silicon photonic transmitter for polarization-encoded quantum key distribution,”

Optica, vol. 3, no. 11, pp. 1274–1278, Nov. 2016.

[148] D. Bunandar, N. Harris, Z. Zhang, C. Lee, R. Ding, T. Baehr-Jones, M. Hochberg, J. Shapiro,

F. Wong, and D. Englund, “Cavity integrated quantum key distribution,” Sept. 2016, poster at

QCrypt 2016.

[149] D. G. M. Mitchell, M. Lentmaier, and D. J. Costello, “Spatially Coupled LDPC Codes Constructed

From Protographs,” IEEE Transactions on Information Theory, vol. 61, no. 9, pp. 4866–4889, Sep.

2015.

[150] X. Liu and S. C. Draper, “The ADMM Penalized Decoder for LDPC Codes,” IEEE Transactions

on Information Theory, vol. 62, no. 6, pp. 2966–2984, Jun. 2016.

[151] M. Wasson, M. Milicevic, S. Draper, and G. Gulak, “Hardware-based linear programming decod-

ing via the alternating direction method of multipliers,” 2017 IEEE International Conference on

Acoustics, Speech and Signal Processing, Mar. 2017.

[152] V. De, S. Vangal, and R. Krishnamurthy, “Near Threshold Voltage (NTV) Computing: Computing

in the Dark Silicon Era,” IEEE Design Test, vol. 34, no. 2, pp. 24–30, Apr. 2017.

[153] N. Pinckney, S. Jeloka, R. Dreslinski, T. Mudge, D. Sylvester, D. Blaauw, L. Shifren, B. Cline, and

S. Sinha, “Impact of FinFET on Near-Threshold Voltage Scalability,” IEEE Design Test, vol. 34,

no. 2, pp. 31–38, Apr. 2017.

[154] S. K. Samal, D. Nayak, M. Ichihashi, S. Banna, and S. K. Lim, “Monolithic 3D IC vs. TSV-based

3D IC in 14nm FinFET technology,” 2016 IEEE SOI-3D-Subthreshold Microelectronics Technology

Unified Conference (S3S), pp. 1–2, Oct. 2016.

References 109

[155] R. Takahashi, Y. Tanizawa, and A. Dixon, “High-speed implementation of privacy amplification

in quantum key distribution,” Sept. 2016, poster at QCrypt 2016.

[156] M. Hayashi and T. Tsurumaru, “More efficient privacy amplification with less random seeds via

dual universal hash function,” IEEE Transactions on Information Theory, vol. 62, no. 4, pp.

2213–2232, Apr. 2016.

[157] R. Renner and R. Konig, “Universally composable privacy amplification against quantum adver-

saries,” Theory of Cryptography Conference, pp. 407–425, Feb. 2005.

low-density parity-check decoder architectures for ... · quantum cryptography by mario milicevic a...

Documents