Low-Density Parity-Check Decoder
Architectures for Integrated Circuits and
Quantum Cryptography
by
Mario Milicevic
A thesis submitted in conformity with the requirementsfor the degree of Doctor of Philosophy
Graduate Department of Electrical and Computer EngineeringUniversity of Toronto
c© Copyright 2017 by Mario Milicevic
Abstract
Low-Density Parity-Check Decoder Architectures for Integrated Circuits and Quantum Cryptography
Mario Milicevic
Doctor of Philosophy
Graduate Department of Electrical and Computer Engineering
University of Toronto
2017
Forward error correction enables reliable one-way communication over noisy channels, by transmitting
redundant data along with the message in order to detect and resolve errors at the receiver. Low-density
parity-check (LDPC) codes achieve superior error-correction performance on Gaussian channels under
belief propagation decoding, however, their complex parity-check matrix structure introduces hardware
implementation challenges. This thesis explores how the quasi-cyclic structure of LDPC parity-check
matrices can be exploited in the design of low-power hardware architectures for multi-Gigabit/second
decoders realized in CMOS technology, as well as in the design and construction of multi-edge LDPC
codes for long-distance (beyond 100km) quantum cryptography over optical fiber.
A frame-interleaved architecture is presented with a path-unrolled message-passing schedule to reduce
the complexity of routing interconnect in an integrated circuit decoder implementation. A proof-of-
concept silicon test chip was fabricated in the 28nm CMOS technology node. The LDPC decoder chip
supports the four codes presented in the IEEE 802.11ad standard, occupies an area of 3.41mm2, and
achieves an energy efficiency of 15pJ/bit while delivering a maximum throughput of 6.78Gb/s, and
operating with a 202MHz clock at 0.9V supply. The test chip achieves the highest normalized energy
efficiency among published CMOS-based decoders for the IEEE 802.11ad standard.
A quasi-cyclic code construction technique is applied to a multi-edge LDPC code with block length of
106 bits in order to reduce the latency of LDPC decoding in the key reconciliation step of long-distance
quantum key distribution. The GPU-based decoder achieves a maximum information throughput of
7.16Kb/s, and extends the current maximum transmission distance from 100km to 160km with a secret
key rate of 4.10× 10−7 bits/pulse under 8-dimensional reconciliation. The GPU-based decoder delivers
up to 8.03× higher decoded information throughput over the upper bound on secret key rate for a
lossy optical channel, thus demonstrating that key reconciliation with LDPC codes is no longer a post-
processing bottleneck in quantum key distribution.
The contributions presented in this thesis can be applied to future research in the implementation of
silicon-based linear-program decoders for high-reliability channels, and single-chip solutions for quantum
key distribution containing integrated photonics and post-processing algorithms.
ii
Dedicated in loving memory to my grandparents.
iii
Acknowledgements
I would like to thank my family, friends, professors, and colleagues for their tremendous support during
my pursuit of a Ph.D. degree. Thanks to you, I have not been on this journey alone. I have had the
opportunity to explore new ideas and contribute to the state-of-the-art. Such opportunities are far and
few between. Looking back, I am happy I committed the time to do it, and would do it again in a
heartbeat.
I extend my sincerest gratitude to Professor Glenn Gulak for originally taking me on as a Masters
student, and encouraging me to pursue a Ph.D. degree. Your guidance and attention to detail contributed
tremendously to the direction and quality of my Ph.D. research. Thank you for opening the doors to so
many great opportunities, and for giving me the time to pursue long periods of “studio time” to focus
on my ideas and writing.
I would like to thank Professors Jason Anderson, Stark Draper, and Frank Kschischang from the
University of Toronto, and Professor Zhengya Zhang from the University of Michigan Ann Arbor for
serving on my thesis examination committee. Your insights and thoughtful questions have helped bring
clarity and rigour to this thesis.
Some of the best learning experiences during my Ph.D. have been through my collaborative research
on LDPC codes for QKD with Chen Feng and Lei Zhang, as well as hardware-based implementations of
ADMM-LP decoders with Mitch Wasson and Professor Stark Draper. It has been an absolute pleasure
to work with you. I would also like to thank Christian Weedbrook and Xingxing Xing for introducing
me to QKD and your technical guidance.
I am grateful to the many faculty members and professional staff that I have had the pleasure of
knowing and working with since I started my undergraduate studies in 2006 in the Department of
Electrical and Computer Engineering at the University of Toronto. You have all contributed positively
to my experiences at the university. In particular, I wish to thank Professors Aleksandar Prodic, Ali
Sheikholeslami, Bruce Francis, David Johns, Khoman Phang, Micah Stickel, Paul Chow, Roman Genov,
Sorin Voinigescu, and Tony Chan Carusone. I also wish to thank Jennifer Rodrigues, Darlene Gorzo, and
Jayne Leake for their administrative assistance with my graduate studies and teaching assistantships.
Last, I wish to acknowledge Jeetendar Narsinghani for his guidance with ATE SoC testing.
I would like to thank the Natural Sciences and Engineering Research Council of Canada (NSERC)
and MaxLinear Inc., California, USA for supporting this research. At MaxLinear, I would like to thank
Curtis Ling and Tim Gallagher for their interest and continuous support in fabricating my proposed
LDPC decoder architecture in a state-of-the-art CMOS technology. I am sincerely grateful to Stephane
Laurent-Michel for taking me on as an intern in his communication systems team and for championing
my chip tapeout. To Isai Miranda, my chip tapeout would not have been possible without your on-
going support in resolving DRC violations and additional metal-mask fixes. To Prasanna, thank you
for pushing me through the physical back-end design of my chip, and to Jack and Jilian for helping me
resolve tool-related issues so that I could simulate and synthesize my design. To Nitza, Kris, and Preeti,
thank you for providing me with support and test time on the ATE, and to Jay and Henry for your
help with packaging and wafer production. Finally, I would like to thank Didier Roland from Mentor
Graphics for helping me trailblaze the place-and-route tool flow for the physical back-end design of my
LDPC decoder chip. I would also like to thank Isai, Juan, Preeti, Paul, Michael, Akan, Srikar, Miles,
Thao, and Naman for your friendship during my internships at MaxLinear.
iv
To my graduate student colleagues in BA5000, we have shared some fun and gruelling times together.
From long nights in the lab, coffee re-fill runs to Starbucks, and nights out at Sin and Redemption and
Prenup Pub, I have enjoyed working and socializing with all of you during our time together. Thank
you Alain, Alhassan, Alireza, Andy, Aynaz, Colin, Cliff, Dustin, Dawei, Farhad, Hemesh, Jeff, Joshua,
Kevin, Luke, Michal, Meysam, Nadeesha, Nasim, Neno, Ravi, Rocky, Rosanah, Sadegh, Safeen, Shayan,
Victor, Yue, and Zeynep. Grad school would not have been as enjoyable without my 7:00 AM morning
workouts at Hart House; thank you, Tarik, for being an awesome and punctual gym partner.
One of my favourite and most enjoyable experiences in grad school was leading the Saratoga student
volunteer team at the IEEE International Solid-State Circuits Conference each year in San Francisco. I
am very grateful to Laura Fujino and K.C. Smith for believing in Andrew Shorten and myself to lead
the team and ensure that the conference runs smoothly. Andrew, we had some massive fires at that
show, but we fooled them every time, dude. We had the time of our lives waking up at 5:00 AM and
going to sleep at 3:00 AM for a week straight every year. I wouldn’t have wanted it any other way.
To my many friends who partook in our adventures in San Francisco, thank you for being a part of
the team, namely: Bert, Dave, Danial, Daniel, Gairik, Gerard, Guy, Jasmina, Javid, Jin-Hee, Jingshu,
Jingxuan, Joy, Junmin, Ivan, Mike, Navid, Paul, Oleksey, Robert B., Robert H., Saba, Samira, Simon,
Stefan, Victor, Vince, Wahid, Weijia, Yingying, and Xander, and to the entourage: Alex, David, Karim,
Ricardo, Shahriar, Saman, and Trevor. Finally, to the two guys that helped us all keep our cool, a big
thanks to Mark and Snoopy from the production team.
Outside of the university, I would like to thank my many friends at Northern Karate Schools and
the National Yacht Club in Toronto for your endearning support of my academic pursuits, and for being
there to take my mind off work. I would also like to thank my volunteer colleagues and staff from the
IEEE for allowing me to pursue engaging leadership opportunities within a global community. To my
skiing posse, Peter, Dino, Moritz, Nora, Ozren, Taylor, and Chris, thanks for the epic powder sessions.
To my Toronto crew, Amir, Dorijan, Nikita, Nikola, Vasily, and Victor, it has always been good times.
Finally, and most importantly, I wish to extend my deepest gratitude and love to my family. To my
sister, Dana, thank you for always being there for me; your tasty baked treats have helped fuel a lot
of my work. To my girlfriend, Katrina, your unwavering love and commitment to seeing me finish this
thing has always given me the drive to work hard and find the answers; I will forever cherish our trips
to Pizza Libretto and Bellwooods Brewery in Toronto, and our many adventures around the world. To
my mom and dad, words can not express how thankful I am for all the opportunities you have given me,
for raising myself and my sister in Canada, for your countless sacrifices, and for always having warm
home-cooked meals whenever I came home. Most importantly, thank you for believing in me; this thesis
is for you.
v
Contents
List of Tables viii
List of Figures xii
List of Acronyms xv
List of Symbols xviii
1 Introduction 1
1.1 LDPC Decoders in Integrated Circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 LDPC Decoding for Quantum Key Distribution . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Thesis Roadmap: Investigation of Two Application Areas . . . . . . . . . . . . . . . . . . 7
2 Background 12
2.1 Forward Error Correction with LDPC Codes . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 LDPC Codes: A Class of Linear Block Codes . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 LDPC Decoding: Belief Propagation Algorithms . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Quasi-Cyclic LDPC Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.5 Silicon Integrated Circuits for LDPC Decoding . . . . . . . . . . . . . . . . . . . . . . . . 16
2.5.1 LDPC Decoder Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.5.2 Flooding and Layered Decoding Schedules . . . . . . . . . . . . . . . . . . . . . . . 17
2.5.3 Design Challenges: Message Permutation, Memory, and Power . . . . . . . . . . . 17
2.5.4 Explicit Check and Variable Node Processing Units . . . . . . . . . . . . . . . . . 18
2.6 LDPC Decoding in Quantum Key Distribution . . . . . . . . . . . . . . . . . . . . . . . . 19
2.6.1 Quantum Transmission and Sifting . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.6.2 Reconciliation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.6.3 Privacy Amplification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.6.4 Maximizing Secret Key Rate with Collective Attacks . . . . . . . . . . . . . . . . . 26
2.6.5 Upper Bound on Secret Key Rate for a Lossy Channel . . . . . . . . . . . . . . . . 27
2.6.6 Frame Error Rate for Reverse Reconciliation . . . . . . . . . . . . . . . . . . . . . 27
2.6.7 Impact of Reconciliation Error and Efficiency on Secret Key Rate . . . . . . . . . 29
2.6.8 Secret Key Rate with Finite-Size Effects . . . . . . . . . . . . . . . . . . . . . . . . 30
2.7 Multi-Edge LDPC Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.7.1 General Design and Construction of LDPC Codes . . . . . . . . . . . . . . . . . . 30
2.7.2 Multi-Edge Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
vi
2.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3 LDPC Decoder Architecture with Path-Unrolled Message Passing 33
3.1 Proposed LDPC Decoder Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.1.1 Hardware Mapping with Path-Unrolled Decoding Schedule . . . . . . . . . . . . . 34
3.1.2 Time-Distributed Piecewise Min-Sum Computation . . . . . . . . . . . . . . . . . . 36
3.1.3 Parity-Check Matrix Partitioning and Hardware Mapping . . . . . . . . . . . . . . 38
3.1.4 Column Slice Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.1.5 Pipelined Frame Interleaving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.1.6 Input/Output Frame Buffering for Continuous Decoding . . . . . . . . . . . . . . . 44
3.1.7 Combined CN+VN Processing Unit Architecture . . . . . . . . . . . . . . . . . . . 45
3.1.8 Early Termination with Coarse-Grained Clock Gating . . . . . . . . . . . . . . . . 48
3.1.9 Extendibility to Layered Decoding Schedule . . . . . . . . . . . . . . . . . . . . . . 51
3.2 Physical Silicon Chip Implementation and Results . . . . . . . . . . . . . . . . . . . . . . 52
3.2.1 Error-Correction Decoding Performance . . . . . . . . . . . . . . . . . . . . . . . . 52
3.2.2 Post-Silicon Power Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.2.3 Comparison with the State-of-the-Art . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4 Quasi-Cyclic Multi-Edge LDPC Codes for Quantum Cryptography 60
4.1 Construction of Quasi-Cyclic Multi-Edge LDPC Codes . . . . . . . . . . . . . . . . . . . . 61
4.2 Error-Correction Performance of Multi-Edge QC Codes . . . . . . . . . . . . . . . . . . . 62
4.3 Finite Secret Key Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.4 GPU-Accelerated LDPC Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.4.1 GPU-Based LDPC Decoder Implementation . . . . . . . . . . . . . . . . . . . . . . 69
4.4.2 Information Throughput Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.4.3 Comparison to Other CV-QKD Implementations . . . . . . . . . . . . . . . . . . . 76
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5 Conclusion and Future Directions 80
5.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.2.1 Extendibility to Non-Quasi-Cyclic and Spatially-Coupled LDPC Codes . . . . . . 81
5.2.2 Linear-Program Decoding for High-SNR Channels . . . . . . . . . . . . . . . . . . 82
5.2.3 Decoder Architectures for Near-Threshold Voltage FinFET Operation . . . . . . . 82
5.2.4 Decoder Architectures for 3-Dimensional Integrated Circuits . . . . . . . . . . . . 83
A Supplementary Background on QKD 84
B Development, Simulation, and Testing Framework 88
C Supplementary QKD Results: Bit Error Rate and Finite Secret Key Rate 90
D LDPC Decoding for Reverse Reconciliation in CV- and DV-QKD 95
References 98
vii
List of Tables
1.1 Comparison of LDPC decoder performance requirements for an IEEE 802.11ad wireless
SoC IP core vs. long-distance CV-QKD key reconciliation block . . . . . . . . . . . . . . . 10
3.1 Piecewise time-distributed reformulation of Min-Sum algorithm with flooding schedule for
single layer routing path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2 Column-slice messages highlighted in Fig. 3.5 . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.3 Memory specification in each column slice . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.4 Decoder performance at target BER = 10−6 with early termination (including idle cycles) 55
3.5 Percentage breakdown of post-silicon area and estimated power by decoder module at
target BER = 10−6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.6 Comparison of LDPC decoder implementations for the IEEE 802.11ad standard . . . . . . 57
4.1 Designed rate 0.02 multi-edge LDPC codes . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.2 GPU-based LDPC decoding latency and error-correction performance for rate 0.02 multi-
edge codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.3 Overview of secret key rate and GPU throughput at maximum reconciliation distance
with rate 0.02 multi-edge codes and Nprivacy = 1012 bits . . . . . . . . . . . . . . . . . . . 74
4.4 GPU LDPC decoding comparison at SNR = 0.161 with d = 8 on BIAWGNC targeting
FER = 0.04 with rate 1/10 codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
viii
List of Figures
1.1 Wireless SoC showing LDPC decoder IP block within the physical layer baseband, and
auxiliary circuits and systems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Throughput vs. power comparison of silicon-based LDPC decoders for multiple standards
with different throughput, latency, and error-correction performance requirements. The
plot legend indicates the decoding standard and block length for each implementation.
Annotated values in parentheses indicate the CMOS technology node, clock frequency,
and decoder core area for each implementation [19,25–36]. . . . . . . . . . . . . . . . . . . 4
1.3 Information transmission over untrusted quantum channel and authenticated public chan-
nel between Alice and Bob for CV- and DV-QKD, with eavesdropper Eve. . . . . . . . . . 6
1.4 Throughput vs. distance of GPU-based LDPC decoders for CV- and DV-QKD. The
reported throughput is the raw GPU throughput without code- or error-rate scaling. For
CV-QKD implementations [12, 60, 64], the annotated values in parentheses indicate the
LDPC code code block length n, the code rate R, the reconciliation efficiency β, and SNR
of the quantum channel. For DV-QKD implementations, the annotated values indicate
the block length n, code rate R, and QBER [48,65,66]. By convention in QKD, the SNR
is reported in linear units. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 FER vs. SNR for IEEE 802.11ad Wireless SoC and CV-QKD applications. . . . . . . . . 9
1.6 Data throughput with FER and code-rate scaling vs. block length for IEEE 802.11ad
wireless SoC and CV-QKD applications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1 Simplified model of a BIAWGNC with LDPC encoding and decoding. . . . . . . . . . . . 12
2.2 Tanner graph and corresponding binary parity-check matrix with block length of n = 6
bits, k = 3 information bits, n = 6 variable nodes, and (n− k) = 3 check nodes. . . . . . . 13
2.3 Sample quasi-cyclic binary parity-check matrix for q = 5 constructed from uniformly-sized
(q × q), cyclically-shifted identity matrices and all-zero matrices. . . . . . . . . . . . . . . 16
2.4 Examples of three LDPC decoder architectures showing message-passing networks, and
configuration of CN and VN processing units. While exceptions exist, typically, fully-
parallel architectures instantiate n VNs and (n− k) CNs, partially-parallel architectures
instantiate a factor of q VNs and CNs, and serial architectures instantiate only 1 VN and
1 CN. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5 Explicitly defined processing units VN1, VN2, and VN4 connected to processing unit CN1
based on the Tanner graph in Fig. 2.2, with Lvc and mcv messages indicated for decoding
iterations i and i+ 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
ix
2.6 CV-QKD model for secret key distillation with reverse reconciliation between Alice and
Bob over a private quantum channel and public classical channel. . . . . . . . . . . . . . . 21
2.7 Possible decoding scenarios and error detection techniques. . . . . . . . . . . . . . . . . . 28
2.8 LDPC frame components: message, CRC, and parity bits. . . . . . . . . . . . . . . . . . . 28
3.1 Simplified example of the proposed LDPC decoder architecture, based on: (a) sample
two-layer QC parity-check matrix and (b) Tanner graph with one decoding path high-
lighted. The proposed layer routing patterns and arrangement of combined CN+VN
processing units in the systolic array architecture are shown in (c). The closed path given
by VN0−CN2−VN4−CN2−VN8−CN2−VN0 in (b) is unrolled in (c) such that CN2 is ab-
sorbed into its connected VNs, resulting in the following unrolled path: VN0−VN4−VN8−VN0. 35
3.2 (a) The closed path through CN2 in the Tanner graph for one pass (phase) of decoding. (b)
The unrolled piecewise messages that are passed between combined CN+VN processing
units in successive columns of the architecture corresponding to the closed path highlighted
in (a). Here, t = 0 arbitrarily corresponds to the third column of T = 3 total columns. . . 38
3.3 IEEE 802.11ad QC parity-check matrices with hardware mapping for proposed architec-
ture [23]. The sub-matrix value indicates the cyclic permutation index. The four matrices
are derived from a single 8-layer base matrix by removing layers in higher-rate matrices,
or by removing cyclically-shifted submatrices in lower-rate matrices. . . . . . . . . . . . . 39
3.4 System block diagram for proposed architecture showing the global control unit, and the
datapath containing: column slices with combined CN+VN processing units and mem-
ories, a hard-wired cyclic permutation network between each column slice, and pipeline
registers between column-slice pairs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.5 Column slice t comprised of CN+VN processing units, local memory, and wired permuta-
tion networks between adjacent column slices. Pipeline registers are connected only to the
first column in a column-slice pair. Hard-wired interconnect does not contain multiplexing
logic. Hard-wired connections are specified by the parity-check matrix connectivity. The
operations in column slice t are computed in one clock cycle. . . . . . . . . . . . . . . . . 40
3.6 Pipelined frame interleaving pattern through column slices in the proposed architecture
over 16 clock cycles of one complete LDPC decoding iteration for IEEE 802.11ad. The
number in each bubble indicates the frame index. Frame 4 highlights the cyclic frame-
shifting property of the architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.7 Input/output frame buffering schedule, assuming a uniform decoding latency of 10 itera-
tions with 16 clock cycles per iteration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.8 Combined CN+VN processing unit for time-distributed piecewise decoding, showing CN-
and VN-update phase logic, memory interfaces, and data permutation logic between pro-
cessing units in successive column slices. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
x
3.9 Processing unit timing diagram for CN- and VN-update phases showing 3 independent
frame updates over 3 clock cycles. Each CN+VN processing unit updates a single frame
j in each clock cycle. All arrow-highlighted operations occur in each clock cycle, and
independently for each frame j, j ∈ 0, 1, . . . , 7. Circled nodes B, C, D, E, and F
correspond to the connections shown in the column slice architecture in Fig. 3.5. The
following operations are highlighted. (a) Sign sc(t), first minimum magnitude min1c(t),
and second minimum magnitude min2c(t) updates through column slice pair. (b) Parity
pc(t) updates through column slice pair. (c) Independent Lvc and Cv updates in columns
t and t + 1. (d) Propagation of sign, first minimum magnitude, and second minimum
magnitude messages to next column-slice pair without updates in columns t and t+ 1. . 47
3.10 Probability distribution of decoding iterations for the four code rates of the IEEE 802.11ad
standard at FER of 10−2, 10−3, and 10−4. . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.11 Multi-frame decoding with: (a) no early termination, (b) early termination with idle
cycles (discontinuous decoding), and (c) early termination without idle cycles (continuous
decoding). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.12 Sample frame termination pattern in frame-interleaved architecture. One iteration is
performed over 16 clock cycles. One clock cycle is required to update a frame in a column-
slice pair. Frames that have terminated are not updated in their current column-slice
pair. Column slices in which the current frame has terminated are disabled through
coarse-grained clock gating in each cycle. . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.13 Die micrograph with wirebonds shown in exposed package. . . . . . . . . . . . . . . . . . 52
3.14 FER and BER vs. SNR under Min-Sum decoding for all four IEEE 802.11ad codes on
BIAWGNC with maximum 10 decoding iterations. The channel SNR is normalized to
energy-per-bit as given by Eq. 1.2. Channel input LLRs are quantized to 5 bits for both
fixed-point and floating-point simulations. . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.15 Shmoo plots of measured chip showing functional test pass (P) and fail (F) results. . . . . 54
3.16 Measured power at nominal 0.9V supply and 202MHz clock rate, with and without early
termination, at five SNR Eb/N0 operating points for all four code rates. . . . . . . . . . . 54
3.17 Measured power at reduced core and memory voltage with clock-frequency scaling, for
the same operating points as in Fig. 3.16. . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.1 Structure of designed parity-check matrices. . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.2 FER vs. SNR for Sum-Product decoding with d = 1 and d = 2 dimensional reconciliation
on BIAWGNC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.3 FER vs. SNR for Sum-Product decoding with d = 4 and d = 8 dimensional reconciliation
on BIAWGNC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.4 Probability of invalid decoding error vs. SNR for Sum-Product decoding with d = 1, 2, 4, 8
dimensional reconciliation on BIAWGNC. Probability of error is computed for invalid
messages that are correctly decoded but CRC fails. . . . . . . . . . . . . . . . . . . . . . . 65
4.5 FER vs. reconciliation efficiency for Sum-Product decoding with d = 1 and d = 8 di-
mensional reconciliation on BIAWGNC. FER values are derived from the FER vs. SNR
curves based on Eq. 2.15. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.6 d = 1 dimensional reconciliation with Nprivacy = 1012 bits: finite secret key rate Kfinite
vs. distance for collective attacks on BIAWGNC with Sum-Product decoding. . . . . . . . 67
xi
4.7 d = 8 dimensional reconciliation with Nprivacy = 1012 bits: finite secret key rate Kfinite
vs. distance for collective attacks on BIAWGNC with Sum-Product decoding. . . . . . . . 67
4.8 GPU implementation of LDPC decoder showing four multi-threaded compute kernels
and data flow from top to bottom for one decoding iteration. Coalesced memory access
patterns and message variables are indicated. Thread i is denoted by ti, where T in
kernels 1 and 3 represents the maximum number of connections between all CNs and
VNs, (n− k) in Kernel 2 is the number of CNs, and n in Kernel 4 is the number of VNs.
Early termination is not shown. All memory blocks shown in the figure are in Global
GPU Memory. The threads in each kernel use Shared GPU Memory to store intermediate
values during the execution of the kernel. . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.9 Measured information throughput K ′GPU vs. reconciliation efficiency for d = 1 and d = 8
dimensional reconciliation. Each measurement point corresponds to a particular SNR
operating point with a measured FER presented in Fig. 4.5. . . . . . . . . . . . . . . . . . 75
4.10 GPU information throughput K ′GPU of the q = 21 QC-LDPC code with d = 8 dimensional
reconciliation up to the maximum distance point for β ∈ 0.80, 0.89, 0.92, 0.95, 0.98, 0.99,and upper bound on secret key rate for lossy channel K ′lim vs. distance. . . . . . . . . . . 75
A.1 Optimal VA vs. transmission distance for maximum theoretical secret key rate, from
β = 0.8 to β = 0.99, based on the assumed physical operating parameters of the quantum
channel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
A.2 Maximum theoretical secret key rates vs. transmission distance. The maximum CV-
QKD key rate is defined by Kopt from β = 0.8 to β = 0.99 based on the optimal VA. The
fundamental limit for a lossy channel is defined by Klim = − log2(1− T ). . . . . . . . . . . 87
B.1 Development, simulation, and testing framework. . . . . . . . . . . . . . . . . . . . . . . . 89
C.1 BER vs. SNR for Sum-Product decoding with d = 1 and d = 2 dimensional reconciliation
on BIAWGNC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
C.2 BER vs. SNR for Sum-Product decoding with d = 4 and d = 8 dimensional reconciliation
on BIAWGNC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
C.3 d = 1 dimensional reconciliation with Nprivacy = 1010 bits: finite secret key rate Kfinite
vs. distance for collective attacks on BIAWGNC with Sum-Product decoding. . . . . . . . 92
C.4 d = 2 dimensional reconciliation with Nprivacy = 1010 bits: finite secret key rate Kfinite
vs. distance for collective attacks on BIAWGNC with Sum-Product decoding. . . . . . . . 92
C.5 d = 4 dimensional reconciliation with Nprivacy = 1010 bits: finite secret key rate Kfinite
vs. distance for collective attacks on BIAWGNC with Sum-Product decoding. . . . . . . . 93
C.6 d = 8 dimensional reconciliation with Nprivacy = 1010 bits: finite secret key rate Kfinite
vs. distance for collective attacks on BIAWGNC with Sum-Product decoding. . . . . . . . 93
C.7 d = 2 dimensional reconciliation with Nprivacy = 1012 bits: finite secret key rate Kfinite
vs. distance for collective attacks on BIAWGNC with Sum-Product decoding. . . . . . . . 94
C.8 d = 4 dimensional reconciliation with Nprivacy = 1012 bits: finite secret key rate Kfinite
vs. distance for collective attacks on BIAWGNC with Sum-Product decoding. . . . . . . . 94
D.1 LDPC encoding and decoding system with BIAWGNC model. . . . . . . . . . . . . . . . . 96
D.2 LDPC encoding and decoding system with BSC model. . . . . . . . . . . . . . . . . . . . 97
xii
List of Acronyms
ADC Analog-to-Digital Converter
ADMM Alternating Direction Method of Multipliers
ASIC Application-Specific Integrated Circuit
ASIP Application-Specific Instruction Set Processor
ATE Automated Test Equipment
AWGN Additive White Gaussian Noise
BER Bit Error Rate
BIAWGNC Binary-Input Additive White Gaussian Noise Channel
BP Belief Propagation
BSC Binary Symmetric Channel
CMOS Complementary Metal-Oxide-Semiconductor
CN Check Node
CRC Cyclic Redundancy Check
CV-QKD Continuous-Variable Quantum Key Distribution
DAC Digital-to-Analog Converter
DRAM Dynamic Random Access Memory
DV-QKD Discete-Variable Quantum Key Distribution
DVB-S2X Digital Video Broadcasting Second Generation Satellite Extensions
eDRAM Embedded Dynamic Random Access Memory
eSRAM Embedded Static Random Access Memory
EDA Electronic Design Automation
ETSI European Telecommunications Standards Institute
FD-SOI Fully-Depleted Silicon-On-Insulator
xiii
FEC Forward Error Correction
FER Frame Error Rate
FIFO First-In First-Out (Memory)
FLOPS Floating Point Operations Per Second
FPGA Field-Programmable Gate Array
GG02 Grosshans-Grangier 2002 Protocol
GPU Graphics Processing Unit
HDL Hardware Description Language
IEEE Institute of Electrical and Electronics Engineers
I/O Input/Output
IP Intellectual Property
LDPC Low-Density Parity-Check
LLR Log-Likelihood Ratio
LNA Low-Noise Amplifier
LP Linear Program
MCS Modulation and Coding Scheme
MDI-QKD Measurement-Device-Independent Quantum Key Distribution
NTV Near-Threshold Voltage
PA Power Amplifier
PVT Process, Voltage, and Temperature
QC Quasi Cyclic
QBER Quantum Bit Error Rate
QKD Quantum Key Distribution
QUESS Quantum Experiments at Space Scale
RA Repeat-Accumulate
RTL Register Transfer Level
SIMT Single-Instruction Multiple-Thread
SM Sign-Magnitude Number Format
SNR Signal-to-Noise Ratio
xiv
SoC System-on-Chip
SRAM Static Random Access Memory
TSV Through-Silicon Via
VN Variable Node
WiGig Wireless Gigabit Alliance
Wi-Fi Wi-Fi Alliance
WPAN Wireless Personal Area Network
xv
List of Symbols
α Transmission loss (assumed to be 0.2dB/km for a single-mode optical fiber)
β Reconciliation efficiency
χBE Holevo bound on the information leaked to Eve
` Optical fiber distance in kilometers
ε Excess channel noise expressed in shot noise units
η Homodyne detector efficiency
C The set of complex numbers
H The set of quaternions
O The set of octonions
R The set of real numbers
C Decoded codeword estimate of length n bits
S Decoded message estimate of length k bits
C LDPC-encoded codeword of length n bits
H Binary parity-check matrix
M Bob’s classical message to Alice
N d-dimensional noise vector
R Received soft-decision vector of length n
S Binary message vector of length k bits
U d-dimensional vector comprised of (−1)Ci components
X Alice’s correlated Gaussian sequence
Y Bob’s correlated Gaussian sequence
Z Gaussian noise distribution on the quantum channel
G A Tanner graph
xvi
Ω(x) Normalized variable node degree distribution
Ψ(x) Normalized check node degree distribution
σ2 Channel noise variance
C(s) Shannon channel capacity for signal-to-noise ratio of s
d Reconciliation dimension
Eb/N0 Signal-to-noise ratio per bit
frep Light source pulse repetition rate
I(X;Y ) Mutual information between correlated sequences X and Y
Ii Identity matrix cyclically shifted to the right by i− 1
IAB Mutual information between Alice and Bob
k LDPC information length (bits)
Kfinite Finite secret key rate (bits/pulse)
K ′finite Operating secret key rate (bits/second)
K ′GPU GPU information throughput (bits/second)
KrawGPU Raw GPU throughput (bits/second)
Klim Upper bound on secret key rate for a lossy channel (bits/pulse)
K ′lim Upper bound on secret key rate for a lossy channel with a source repetition rate frep (bits/second)
Kopt Maximum theoretical secret key rate for a CV-QKD system with one-way reverse reconciliation
Keff Effective secret key rate (bits/pulse)
Lvc Message from variable node v to check node c
Lv Updated log-likelihood ratio at variable node v
M(v) The set of all check nodes connected to variable node v
mcv Message from check node c to variable node v
min1c First minimum magnitude at check node c
min2c Second minimum magnitude at check node c
n LDPC block length (bits)
N(c) The set of all variable nodes connected to check node c
Nprivacy Block length for privacy amplification
Nquantum Number of symbols sent from Alice to Bob during quantum transmission
xvii
Pdetected error Probability of detected frame error
Pe Probability of frame error
Pundetected error Probability of undetected frame error
pc Exclusive-OR parity result at check node c
q Cyclic expansion factor of a quasi-cyclic parity check matrix
Qv Received channel log-likelihood ratio at variable node v
Rcode Code rate
s SNR of the quantum channel
sc Sign result at check node c
T Transmittance of an optical fiber quantum channel
Vel Electronic noise in shot noise units
VA Alice’s modulation variance
xviii
Chapter 1
Introduction
Error-correcting codes enable the reliable delivery of messages across unreliable channels in modern digi-
tal communication systems. The application of error-correcting codes in modern communication systems
enables today’s multi-Gb/s data rates, while also paving the way for new technologies to revolutionize
public network infrastructure. Richard Hamming first introduced error-correction codes in 1950 in order
to increase the rate of reliable communication in noisy channels [1], and as a means of approaching the
channel capacity limit defined by Claude Shannon in 1948 [2]. Over the past 70 years, error-correction
coding has been a rich area of research among information and coding theorists, while today, the en-
coding and decoding procedures present hardware implementation challenges for circuit designers due
to the limited power budgets available in modern systems-on-chip.
Error-correction coding is often referred to as forward error correction (FEC), where the sender uses
a code to encode data prior to transmission, such that the receiver can then reconstruct the original
data without having to request a repeat transmission of the data when an error is detected. FEC
enables low-latency data transmission across a multitude of noisy channels with application in mobile
networks, data storage, satellite and deep space communications, and the Internet. Some examples of
error-correcting codes include convolutional codes, Hamming codes, low-density parity-check (LDPC)
codes, polar codes, Raptor codes, Reed-Solomon codes, and turbo codes [3]. This thesis focuses on the
hardware-based decoding of LDPC codes in two applications areas: (1) integrated circuit architectures for
silicon-based system-on-chip implementations, and (2) secret key reconciliation in long-distance quantum
cryptography.
LDPC codes have been widely adopted over the past 15 years for FEC in wireless, wireline, optical,
and non-volatile memory systems due to their near-Shannon limit error-correction performance and ab-
sence of patent licensing fees [4–6]. First introduced by Robert Gallager in 1962, LDPC codes were mostly
ignored up until the early 2000s due to their computationally-complex decoding nature [7,8]. Their recent
widespread adoption in modern standards such as IEEE 802.11ac (Wi-Fi) was predominantly enabled
through the introduction of hardware-friendly variants of the belief propagation (BP) decoding algo-
rithm [9], as well as increased research in the design of hardware-oriented codes and integrated circuit
decoder architectures capable of delivering multi-Gb/s information throughput at low power. Over the
past 10 years, LDPC codes have also been applied to secret key reconciliation in quantum key distribution
(QKD) systems in order to extend the secure distance and increase the speed of unconditionally-secure
communication between two remote parties, better known as Alice and Bob [10–12]. Only a limited
1
Chapter 1. Introduction 2
number of hardware-based LDPC decoders have been realized to date for QKD applications due to the
complexity of designing high-performance codes and high-speed decoders for the quantum channel.
The general motivation of this thesis is to investigate low-power LDPC decoder architectures for
integrated circuits in CMOS (complementary metal-oxide-semiconductor) technology, as well as decoding
acceleration techniques for key reconciliation in quantum cryptography. CMOS-based decoders target
low-power applications with multi-Gb/s throughput requirements. Such decoders typically perform 10-
to-20 decoding iterations using LDPC codes with block lengths on the order of 102 to 104 bits. This is
in contrast to the LDPC codes used for long-distance QKD, where block lengths on the order of 106 bits
are required. Such decoders perform hundreds of decoding iterations and are not constrained by power,
but rather by the maximum decoded bit rate, which is limited by the number of non-zero elements in the
parity-check matrix. This thesis explores techniques for exploiting the intrinsic structure of parity-check
matrices that define the LDPC codes in each of the two application areas, in order to maximize scalability
and minimize integration complexity. This chapter first introduces the key challenges in implementing
high-speed LDPC decoders in both integrated circuits and quantum crypto-systems, and then presents
a roadmap for the remainder of the thesis.
1.1 LDPC Decoders in Integrated Circuits
CMOS technology miniaturization has played a prominent role in improving the energy efficiency of
silicon-based LDPC decoders, thanks to low-leakage devices and increasing transistor densities [13, 14].
However, the scalability of LDPC decoders remains primarily limited by the complexity of message-
passing interconnect between on-chip memory and parallel processing units [15,16]. The recent stagna-
tion of wired interconnect scaling beyond the 45nm CMOS node has introduced additional low-power
design challenges for multi-Gb/s decoders, as unstructured routing now largely dominates overall power
consumption due to longer interconnect delay [17,18].
System-on-chip (SoC) integration of LDPC decoder intellectual property (IP) is another key design
constraint. Most baseband receiver SoCs typically operate at clock frequencies around 200MHz to meet
timing constraints across process, voltage, and temperature (PVT) corners [19]. Since an LDPC decoder
must ultimately integrate within a larger SoC with a system-level bit error rate (BER) target, the clock
frequency should realistically be constrained to be around 200MHz, and the number of decoding itera-
tions should not be reduced beyond a threshold where error-correction performance starts to degrade.
However, several published pre-silicon and post-silicon implementations achieve multi-Gb/s throughput
with clock rates beyond 400MHz and/or with a reduced number of decoding iterations [20–22]. For
optimal cost, reliability, and testability, the decoder IP should also be implemented using standard
CMOS technology. Under these conditions, achieving multi-Gb/s decoding throughput is a challenge
using traditional architectural and scheduling techniques, especially for short block-length codes like
those defined in the IEEE 802.11ad (WiGig) and IEEE 802.15.3c (WPAN) standards [23,24]. Figure 1.1
presents a block diagram of a wireless SoC with the LDPC decoding block highlighted in the physical
layer baseband, while Fig. 1.2 visualizes the state of the art of silicon-based LDPC decoders for modern
communication standards in terms of power and decoded data throughput over multiple standards with
different LDPC code block lengths and performance requirements [19,25–36]. Figure 1.2 also illustrates
that the design space for integrated circuit implementation of LDPC decoders is multi-dimensional,
where in addition to power and throughput, constraints also include the CMOS technology node, core
Chapter 1. Introduction 3
area, and clock frequency.
The motivation of this thesis is to address the high power consumption of unstructured interconnect
in multi-Gb/s LDPC decoders by exploiting the structure of LDPC parity-check matrices to reduce
wiring complexity through new architectural techniques that scale to sub-10nm CMOS technology nodes.
Chapter 3 presents a new, frame-interleaved LDPC decoder architecture with a reformulated message-
passing schedule that reduces interconnect complexity and routing logic overhead, by exploiting the
spatial locality of stored messages in neighboring processing nodes. The architecture is scalable by
design, supports multiple code rates, and achieves multi-Gb/s throughput while operating at clock
rates below 200MHz. The IEEE 802.11ad standard for the 60GHz wireless millimeter-wave band is
used as a vehicle to demonstrate the application of the proposed architecture, due to its multi-Gb/s
throughput specification over four code rates, and quasi-cyclic LDPC parity-check matrix structure [23].
Quasi-cyclic codes are used to illustrate the general approach, however, the proposed architecture and
decoding schedule can also be extended to non-quasi-cyclic codes.
A proof-of-concept application-specific integrated circuit (ASIC) was fabricated in a 28nm CMOS
technology, and tested at-speed on an automated SoC tester. The LDPC decoder occupies an area of
3.41mm2 and achieves a throughput of 6.78Gb/s with a maximum latency of 0.793µs at 10 decoding
iterations for all four code rates that share a uniform block length of 672 bits, while operating at a
0.9V supply and 202MHz clock. The decoder consumes between 104mW and 279mW of power at a
target BER of 10−6 for the rate 13/16 and rate 1/2 codes, respectively. This corresponds to energy
efficiencies between 15pJ/bit and 41pJ/bit, demonstrating that low-power performance is achievable
with a low clock rate in a standard bulk CMOS technology for maximum SoC integration capability.
The performance of this work in comparison to previously published silicon-based decoders is plotted in
Fig. 1.2, and discussed in greater detail in Chapter 3. It is noted here that the purpose of Fig. 1.2 is
not to distinguish this work, but rather to illustrate the multi-dimensional design space of silicon-based
LDPC decoders with varying constraints.
1.2 LDPC Decoding for Quantum Key Distribution
LDPC codes have recently shown great promise in a forward error-correction context in QKD, where
two remote parties, Alice and Bob, attempt to construct a symmetric secret key by communicating
over a private quantum channel and an authenticated classical public channel. However, the speed at
which Alice and Bob can exchange secret keys is currently limited by the computational complexity of
post-processing algorithms for key reconciliation.
Quantum key distribution, also referred to as quantum cryptography, offers unconditional security
between two remote parties that employ one-time pad encryption to encrypt and decrypt messages using
a shared secret key, even in the presence of an eavesdropper with infinite computing power and math-
ematical genius [37–40]. Unlike classical cryptography, quantum cryptography allows the two remote
parties, Alice and Bob, to detect the presence of an eavesdropper, Eve, while also providing future-proof
security against brute force, key distillation attacks that may be enabled through quantum comput-
ing [41]. Today’s public key exchange schemes such as Diffie-Hellman and encryption algorithms like
RSA respectively rely on the computational hardness of solving the discrete log problem and prime
factorization [42, 43]. Both of these problems, however, can be solved in polynomial time by applying
Shor’s algorithm on a quantum computer [44–46].
Chapter 1. Introduction 4
LNA
PA
ADC
DACLDPC
Encoder
Equalization and
Calibration
Me
diu
m A
cc
es
s C
on
tro
l
200MHz Clock
Digital Baseband
LDPC
Decoder
Symbol
Mapper
Slicer and
Demapper
Analog and RF Front End
Figure 1.1: Wireless SoC showing LDPC decoder IP block within the physical layer baseband, andauxiliary circuits and systems.
0 100 200 300 400 500Power (mW)
0
5
10
15
20
Dec
od
ed D
ata
Th
rou
gh
pu
t (G
b/s
)
(65nm,322MHz,1.19mm2)
(65nm,360MHz,1.6mm2)
(28nm,260MHz,0.63mm 2)
(130nm,111MHz,
3.88mm2)
(40nm,280MHz,
0.46mm2)(65nm,197MHz,
1.56mm2)
(90nm,768MHz,
2.67mm2)
(90nm,157MHz,
2.25mm2)
(130nm,214MHz,3.03mm2)(65nm,110MHz,3.36mm2)
(65nm,934MHz,1.54mm2)
(90nm,85MHz,
5.35mm2)
(65nm,100MHz,
5.35mm2)
This Work(28nm,202MHz,
3.41mm2)
Custom Pre-5G Wireless: 215bIEEE 802.11ad (WiGig): 672bIEEE 802.11n (Wi-Fi): 1944bIEEE 802.15.3c (WPAN): 1440bIEEE 802.15.3c (WPAN): 672bIEEE 802.16e (WiMAX): 2304bIEEE 802.16e (WiMAX): 576bIEEE 802.3an (10GBASE-T): 2048b
Figure 1.2: Throughput vs. power comparison of silicon-based LDPC decoders for multiple standardswith different throughput, latency, and error-correction performance requirements. The plot legendindicates the decoding standard and block length for each implementation. Annotated values in paren-theses indicate the CMOS technology node, clock frequency, and decoder core area for each implemen-tation [19,25–36].
Chapter 1. Introduction 5
While quantum computing remains speculative, QKD systems have already been realized in several
commercial and research settings worldwide [47–50]. Figure 1.3 presents two different protocols for
generating a symmetric key over a quantum channel: (1) discrete-variable QKD (DV-QKD) where
Alice encodes her information in the polarization of single-photon states that she sends to Bob, or (2)
continuous-variable QKD (CV-QKD) where Alice encodes her information in the amplitude and phase
quadratures of coherent states [40]. In DV-QKD, Bob uses a single-photon detector to measure each
received quantum state, while in CV-QKD, Bob uses homodyne or heterodyne detection techniques
to measure the quadratures of light [40]. While DV-QKD has been experimentally demonstrated up
to a distance of 404km [51], the cryogenic temperatures required for single-photon detection at such
extreme distances present a challenge for widespread implementation [40]. CV-QKD systems on the
other hand can be implemented using standard, cost-effective detectors that are routinely deployed in
classical telecommunications equipment that operates at room temperature [40]. The majority of QKD
research focuses on applications over optical fiber, since quantum signals for both CV- and DV-QKD
can be multiplexed over classical telecommunications traffic in existing fiber-optical networks [52–54].
Nevertheless, there has also been recent progress in chip-based, free-space, and Earth-to-satellite QKD
applications [55–57]. It is noted here that quantum cryptography, i.e., QKD, differs from post-quantum
cryptography, which is an evolving area of research that studies public-key encryption algorithms that
are believed to be secure against an attack by a quantum computer [58]. The discussion of post-quantum
cryptography is beyond the scope of this thesis.
The motivation of this thesis is to address the two key challenges that remain in the practical
implementation of CV-QKD over optical fiber: (1) to extend the distance of secure communication
beyond 100km with protection against collective Gaussian attacks, and (2) to increase the computational
throughput of the key reconciliation (error correction) algorithm in the post-processing step such that
the maximum achievable secret key rate remains limited only by the fundamental physical parameters
of the optical equipment at long distances [12, 59, 60]. There are two limitations to the speed of key
reconciliation. The first is the secret key rate, which is fundamentally limited by the transmittance of
the lossy optical channel and is measured in bits/pulse [61]. The second is the rate of computational
throughput from the hardware implementation, measured in bits/second [60]. To compare the two rates,
we normalize the secret key rate to bits/second by choosing a realistic CV-QKD pulse sampling rate of
frep =1MHz [59,62]. While secure QKD networks can be built using trusted and untrusted intermediate
nodes, the long-distance reconciliation problem is motivated by the following two key reasons: (1) each
intermediate node introduces additional vulnerability, and (2) implementing efficient quantum repeaters
remains a challenge [40]. Jouguet and Kunz-Jacques showed that Mbit/s error-correction decoding of
multi-edge LDPC codes is achievable for distances up to 80km [60], while Huang et al. recently showed
that the distance could be extended to 100km by controlling excess system noise [63]. This thesis explores
high-speed LDPC decoding for CV-QKD beyond 100km.
A particular challenge in implementing long-distance CV-QKD is the low signal-to-noise ratio (SNR)
of the optical quantum channel, which typically operates below −15dB. At such low SNR, high-efficiency
key reconciliation can be achieved only using low-rate codes with large block lengths on the order of
106 bits [67,68], where approximately 98% of the bits are redundant parity bits that must be discarded
after error-correction decoding. The reconciliation efficiency is a measure of how close the code operates
to the Shannon limit at a particular SNR. In order to maximize the secret key rate and reconciliation
distance, the error-correcting code must achieve a high reconciliation efficiency and high error-correction
Chapter 1. Introduction 6
Continuous-Variable QKD
Discrete-Variable QKD
ALICE BOB
Untrusted Public Channel
Untrusted Optical Fiber
Untrusted Optical Fiber
Light
Pulse
Generator
Coherent
State
Modulator
Single
Photon
Counter
Homodyne
Detector
EVE
Figure 1.3: Information transmission over untrusted quantum channel and authenticated public channelbetween Alice and Bob for CV- and DV-QKD, with eavesdropper Eve.
0 20 40 60 80 100 120 140 160 180Distance (km)
100
101
102
LD
PC
Dec
od
er T
hro
ug
hp
ut
(Mb
/s)
(n=220 , R=0.1, β =0.931, SNR=0.161)
(n=104, R=0.02, β =0.93, SNR=N/A)
(n=104, R=0.02, β=0.969, SNR=0.9)
(n=106, R=0.55, QBER=7.5%)
(n=1944, R=0.67,QBER=N/A)(n=105, R=0.55, QBER=8%)
This Work
(n=106, R=0.02, β =0.99, SNR=0.0284)
CV-QKDDV-QKD
Figure 1.4: Throughput vs. distance of GPU-based LDPC decoders for CV- and DV-QKD. The reportedthroughput is the raw GPU throughput without code- or error-rate scaling. For CV-QKD implemen-tations [12, 60, 64], the annotated values in parentheses indicate the LDPC code code block length n,the code rate R, the reconciliation efficiency β, and SNR of the quantum channel. For DV-QKD imple-mentations, the annotated values indicate the block length n, code rate R, and QBER [48, 65, 66]. Byconvention in QKD, the SNR is reported in linear units.
Chapter 1. Introduction 7
performance with low frame error rate (FER). Jouguet et al. previously explored multi-edge LDPC codes
for long-distance reconciliation due to their near-Shannon limit performance with low-rate codes, how-
ever, such codes require hundreds of LDPC decoding iterations to achieve asymptotic error-correction
performance [11, 59, 60]. This is in contrast to LDPC codes employed in modern communication stan-
dards, such as IEEE 802.11ac (Wi-Fi) and ETSI DVB-S2X, where the target SNR is above 0dB and
block lengths range from 648 bits to 64,800 bits [9,69]. In these standards, the LDPC decoder typically
operates at 10 iterations to deliver Gbit/s decoding throughput [20, 31, 70]. Long block lengths allow
Alice and Bob to generate longer secret keys, which can be used to provide unconditional security by em-
ploying the one-time pad encryption scheme. Shorter codes with block lengths of 105 bits, for instance,
would not be suitable for low-SNR channels beyond 100km due to their less robust error-correction per-
formance [10,11]. In addition to long block-length codes, key reconciliation over multiple dimensions has
also been shown to improve error-correction performance of multi-edge codes at low SNR [59], thereby
increasing both the secret key rate and distance. However, the computational complexity and latency
of decoding random LDPC parity-check matrices with block lengths on the order of 106 bits remains a
challenge. Figure 1.4 presents a comparison of LDPC decoding throughput versus distance for several
state-of-the-art CV- and DV-QKD implementations, illustrating that high-throughput reconciliation at
long distances is achievable only with large block-length codes that approach the Shannon limit with
more than 90% efficiency for CV-QKD or less than 10% quantum bit error rate (QBER) for DV-QKD.
Chapter 4 introduces a new, quasi-cyclic (QC) code construction for multi-edge LDPC codes with
block lengths on the order of 106 bits [71, 72]. Computational acceleration is achieved through an
optimized LDPC decoder design implemented on a state-of-the-art graphics processing unit (GPU).
When combined with an 8-dimensional reconciliation scheme, the LDPC decoder achieves a raw decoding
throughput of 1.72Mbit/s and an information throughput of 7.16Kbit/s using an NVIDIA GeForce
GTX 1080 GPU at a maximum distance of 160km with a secret key rate of 4.10×10−7 bits/pulse when
finite-size effects are considered. The performance of this work in comparison to previous GPU-based
decoders for QKD is plotted in Fig. 1.4, and discussed in greater detail in Chapter 4. This work extends
the previous maximum CV-QKD distance of 100km to 160km, while delivering between 1.07× and
8.03× higher decoded information throughput over the upper bound on the secret key rate for a lossy
channel [61]. These results show that LDPC decoding is no longer the computational bottleneck in
long-distance CV-QKD, and that the secret key rate remains limited only by the physical parameters of
the quantum channel and the latency of privacy amplification.
1.3 Thesis Roadmap: Investigation of Two Application Areas
This thesis examines decoder implementation techniques for two distinct application areas for LDPC
codes: (1) integrated circuits for baseband FEC systems in wireless SoCs, and (2) quantum cryptography.
The goals in each application are distinct. For integrated circuits, a multi-Gb/s decoder IP core should
integrate with existing blocks in an SoC and achieve low-power performance with acceptable BER.
The design should be scalable to future CMOS technology nodes where interconnect currently presents
integration complexity challenges. For long-distance QKD, the GPU-accelerated decoder should deliver
sufficient speedup over the secret key rate limits defined by the parameters of the quantum channel at
low SNR.
The operating SNR is the primary distinction between the two application areas presented in this
Chapter 1. Introduction 8
thesis. As illustrated in Fig. 1.5, the SNR for the wireless IEEE 802.11ad standard is around 6dB
and the LDPC decoder achieves a FER of approximately 10−4, whereas in long-distance CV-QKD,
the near Shannon-limit channel operates at an SNR of −15dB where the LDPC decoder can only
achieve an FER of approximately 8× 10−1. This key distinction in error-correction performance drives
algorithmic and LDPC code design considerations, such as selecting the most appropriate variant of the
belief propagation algorithm, the LDPC code block length, and the code rate. The power budget and
throughput requirements then drive the implementation platform consideration. Figure 1.6 shows the
distinction in information throughput with respect to the LDPC code block length for the CV-QKD and
Wireless SoC application areas after redundant bits and erroneously decoded frames are discarded.
The error-correction performance, block length, and information throughput requirements present
a unique set of decoder implementation challenges and considerations for each application. Table 1.1
compares the use case, binary parity-check matrix structure, LDPC code performance, and decoder
implementation performance for the two application areas investigated in this thesis. Although both
applications communicate over a Gaussian channel, the low SNR operating point of CV-QKD requires
belief propagation decoding via the Sum-Product algorithm with a maximum of 500 iterations, while the
relatively high SNR operating point of a wireless IEEE 802.11ad channel allows for reduced-complexity
decoding via the Min-Sum algorithm at a maximum of 10 decoding iterations. For the remainder of this
thesis, the SNR and the SNR per bit, Eb/N0, are defined as follows:
SNR =1
σ2= 2Rcode
(EbN0
)(1.1)
EbN0
=1
2σ2Rcode, (1.2)
where σ2 is the variance of a zero-mean Gaussian channel, and Rcode is the code rate – the ratio of the
number of information bits that are kept after redundant parity bits are discarded with respect to the
total length of the decoding block.
Long-distance CV-QKD beyond 100km requires 98% code redundancy with a very low code rate of
Rcode = 0.02 and long block length of 106 bits in order to approach the Shannon limit at low SNR,
while wireless IEEE 802.11ad communication over 1-meter line-of-sight links requires only 50% code
redundancy with Rcode = 0.5 and a shorter block length of 672 bits. Since the decoding latency and
block length of IEEE 802.11ad are several orders of magnitude smaller than in long-distance CV-QKD,
the decoded information throughput is several orders of magnitude higher, in the Gigabit/s regime,
as opposed to Megabit/s for CV-QKD. Nevertheless, a silicon-based decoder implementation for IEEE
802.11ad has to achieve high energy efficiency, i.e., low power performance, in order to prolong battery
life when integrated in a wireless SoC fabricated in a standard CMOS technology for a mobile device.
Although highly-customizable ASICs provide excellent energy efficiency, the silicon implementation
of an LDPC decoder for long-distance CV-QKD with an LDPC code block length of 106 bits would
require significant silicon die area, which may be prohibitively expensive to fabricate in a modern CMOS
technology node [13]. Moreover, ASICs suffer from fixed-point computational precision, limited memory,
and highly complex routing. While modern field-programmable gate arrays (FPGAs) offer floating-point
computational cores, the logic requirements of an LDPC decoder with a block length of 106 bits may
exceed to the maximum utilization of on-chip FPGA logic blocks and switch-based routing to fully
place-and-route the design with strict timing constraints [15, 73]. GPUs on the other hand are a highly
Chapter 1. Introduction 9
-20 -15 -10 -5 0 5 10SNR (dB)
10-5
10-4
10-3
10-2
10-1
100
Fra
me
Err
or
Rat
e (F
ER
)
CV-QKD
Wireless SoC
Figure 1.5: FER vs. SNR for IEEE 802.11ad Wireless SoC and CV-QKD applications.
100 101 102 103 104 105 106 107
Block Length (Bits)
100
101
102
103
104
105
106
107
108
109
1010
Dat
a T
hro
ug
hp
ut
(Bit
s/S
eco
nd
)
Wireless SoC
CV-QKD
Figure 1.6: Data throughput with FER and code-rate scaling vs. block length for IEEE 802.11ad wirelessSoC and CV-QKD applications.
Chapter 1. Introduction 10
suitable platform for LDPC decoder implementation in CV-QKD systems due to their low cost, and high
availability of on-chip memory, floating-point computational precision, and architectural flexibility, which
allows for shorter development time [74,75]. Since Alice and Bob are stationary and their communication
occurs over a fixed-length fiber-optic cable, the traditional optimization parameters of energy efficiency
and silicon chip area do not necessarily apply since the LDPC decoder does not need to assume an
integrated circuit form factor. Furthermore, GPUs seamlessly integrate into a post-processing computer
system, and provide increasing computational performance at low cost with each successive architecture
generation [76].
The LDPC decoder implementations presented in Chapters 3 and 4 of this thesis holistically consider
the constraints outlined in Table 1.1.
Table 1.1: Comparison of LDPC decoder performance requirements for an IEEE 802.11ad wireless SoCIP core vs. long-distance CV-QKD key reconciliation block
SpecificationIEEE 802.11ad
Wireless SoC IPLong-Distance
CV-QKD
Use Case
Transmission Medium Free Space Optical FiberPower Source Battery Grid
Distance 1m 100-200km
Data Rate 10Gb/s 1Mb/sChannel Type Gaussian Gaussian
Binary Parity-Check Matrix
Number of Rows 336 987,840Number of Columns 672 1,008,000
Number of Connections 1890 3,363,885
LDPC Code Performance
Block Length (Bits) 672 1.008× 106
Code Rate 1/2 1/50Decoding Algorithm Min-Sum Sum-Product
Maximum Decoding Iterations 10 500SNR (dB) 6.76 -15.47
SNR Per Bit Eb/N0 (dB) 5.5 -1.49
Target FER 10−4 0.8
Decoder Implementation Performance
Platform ASIC GPUCMOS Technology Node 28nm 16nm
Decoding Throughput (bit/s) 6.78×109 1.724×106
Information Throughput (bit/s) 3.39×109 7.16×103
Latency (µs) 0.793 1296
Power (W) 0.104 (1) 180 (2)
Key Implementation ChallengesEnergy efficiency,
SoC IP integrationFrame error rate,
throughput
Performance Bottleneck Interconnect wiringMemory access
latency
(1) Measured power of the test chip fabricated in this thesis in 28nm CMOS technology.(2) Thermal design power of the NVIDIA GeForce GTX 1080 GPU.
As shown in Table 1.1, the key distinction between the two application areas from an LDPC decoder
perspective is the size of the binary parity-check matrix that defines the LDPC code. The largest parity-
Chapter 1. Introduction 11
check matrix defined in the IEEE 802.11ad standard has only 1890 node connections between iterative
processing groups, while the parity-check matrix for CV-QKD has over 3.3 Million node connections.
The number of node connections directly affects the decoding latency, implementation complexity, power,
and throughput. This thesis explores techniques for exploiting the intrinsic structure of LDPC parity-
check matrices for the integrated circuit and QKD application areas in order to reduce decoding latency,
complexity, and power, while maximizing throughput. The ideas presented herein are scalable to codes
with longer block lengths, beyond those defined in Table 1.1, for both wireless and CV-QKD applications.
The implementation results presented in Chapters 3 and 4 of this thesis provide insights into possible
directions for future LDPC decoder implementations, by leveraging the computational acceleration,
integration, and scalability benefits offered by exploiting the structure of LDPC parity-check matrices.
The remainder of this thesis is organized as follows. Chapter 2 presents the background on LDPC
codes and QKD. Chapter 3 describes a new frame-interleaved LDPC decoder architecture with a path-
unrolled message-passing schedule for integrated circuit applications, and presents the measurement
results of the fabricated proof-of-concept silicon test chip. Chapter 4 introduces a quasi-cyclic parity-
check matrix construction and GPU-based decoder implementation for multi-edge LDPC codes with
application in key reconciliation for long-distance CV-QKD. Chapter 5 concludes the thesis and presents
some future research directions.
Chapter 2
Background
This chapter first introduces the fundamentals of LDPC codes, outlines the belief propagation decoding
algorithm, and describes the challenges in traditional LDPC decoder architectures that target CMOS
implementation. The chapter then explores the application of LDPC codes for secret key reconciliation
in QKD, by first presenting some preliminaries on QKD, and then introducing multi-edge codes for
reconciliation at low SNR. This chapter provides the background for the integrated circuit and long-
distance CV-QKD application areas discussed in Chapters 3 and 4, respectively. LDPC encoding is not
described in this thesis, since the complexity of encoding is relatively low compared to decoding.
2.1 Forward Error Correction with LDPC Codes
The forward error correction procedure is presented in Fig. 2.1 with a simplified channel model. A binary
message S of length k bits is first encoded using a known parity-check matrix H. The encoding produces a
noiseless codeword C of length n bits by appending (n−k) computed parity bits to S. The LDPC-encoded
codeword C is then transmitted over a binary-input additive white Gaussian noise channel (BIAWGNC)
with zero mean and noise variance σ2, such that the received soft-decision vector R is described by
Rv = (−1)Cv +Nv for v = 1, 2, . . . , n, where N ∼ N (0, σ2) represents the normally-distributed Gaussian
noise. LDPC decoding is performed using the known parity-check matrix H to produce an estimate C
of the original transmitted codeword C, where each binary hard decision Cv ∈ 0, 1 for v = 1, 2, . . . , n.
By discarding the (n−k) parity bits from the frame, an estimate S of the original message S is obtained.
LDPC decoding is successful if C = C, otherwise a frame error is said to have occurred where one or
more bits in the frame is in error.
AWGN Channel
LDPC
EncoderS
CĈ, Ŝ
LDPC
Decoder
R
H2
1SNR
H2
Original
Message
Decoded
Message
Figure 2.1: Simplified model of a BIAWGNC with LDPC encoding and decoding.
12
Chapter 2. Background 13
2.2 LDPC Codes: A Class of Linear Block Codes
LDPC codes are a class of linear block codes defined by a sparse parity-check matrix H of size (n −k) × n, k ≤ n, with block length n and code rate Rcode = k/n [8, 77]. An equivalent definition of an
LDPC code is given by its Tanner graph – a bipartite graph – where the non-zero entries of the binary
parity-check matrix H define the edge connections between independent vertex sets known as check
nodes (CNs) and variable nodes (VNs) [78]. As shown in Fig. 2.2, CNs and VNs correspond to the rows
and columns of H, respectively, and an edge between CN ci and VN vj belongs to the graph G if and
only if H(i, j) = 1.
v1 VN2 VN3 VN4 VN5 VN6
c1 c2 CN3
Check Nodes
Variable Nodes
010101
101010
001011
H
Binary Parity-Check Matrix
v2 v3 v4 v5 v6
c2 c3
v1 v2 v3 v4 v5 v6
c1
c2
c3
k
n
n-k
Figure 2.2: Tanner graph and corresponding binary parity-check matrix with block length of n = 6 bits,k = 3 information bits, n = 6 variable nodes, and (n− k) = 3 check nodes.
2.3 LDPC Decoding: Belief Propagation Algorithms
LDPC decoding is performed using belief propagation, an iterative message-passing algorithm commonly
used to perform inference on graphical models such as factor graphs [79]. LDPC decoding attempts to
converge on a valid codeword by iteratively exchanging probabilistic updates between check and variable
nodes along the edges of the Tanner graph until the parity-check condition is satisfied, i.e., a valid
codeword has been found, or the maximum number of iterations is exhausted.
The Sum-Product algorithm is the most common variant of belief propagation [79], and is described in
Algorithm 1 with a flooding schedule where CNs and VNs pass message updates between their connected
neighbors once per iteration to generate a codeword estimate C. In Algorithm 1, Step 1 prepares the
Qv log-likelihood ratio (LLR) input values at each VN v based on the channel noise variance σ2. All
VN-to-CN messages from VN v are initialized to the received channel LLR Qv before the first message-
passing iteration. Steps 2 to 5 specify the message-passing interaction between the CNs and VNs until
the codeword syndrome defined by CH> is equal to zero, or the maximum predetermined number of
decoding iterations is reached. In Step 2, m(i)cv is the message from CN c to VN v in iteration i, and
Φ(x) = Φ−1(x) = − ln(tanh(x/2)). In Step 3, L(i)vc is the message from VN v to CN c, and L
(i)v is the
updated LLR belief of bit v in the frame, whose decision is given by C(i)v in Step 4. In Step 2, the set
of VNs connected to CN c is defined as N(c) = v|v ∈ 1, 2, . . . , n ∧ Hcv = 1, where the notation
v′ ∈ N(c)\v refers to all VNs in the set N(c) excluding VN v. Similarly, in Step 3, the set of CNs
connected to VN v is defined as M(v) = c|c ∈ 1, 2, . . . , n − k ∧Hcv = 1, where c′ ∈ M(v)\c refers
to all CNs in the set M(v) excluding CN c. The syndrome CH> = 0 if the parity check pc = 0 at each
Chapter 2. Background 14
Algorithm 1 Sum-Product with flooding schedule
Input: R, σ2; Output: C
Step 1: LLR initialization at each VN, v = 1, 2, . . . , n
Qv ← ln
(P (Rv|Cv = 0)
P (Rv|Cv = 1)
)=
2Rvσ2
for BIAWGNC
L(i=0)vc ← Qv, ∀c ∈M(v) for first iteration
for Iteration i = 1 to Max Iterations do
Step 2: Check node update at each CN, c = 1, 2, . . . , n− k (CN-to-VN messages)
sgn(m(i)cv )←
∏v′∈N(c)\v sgn(L
(i−1)v′c )∣∣m(i)
cv
∣∣← Φ−1
(∑v′∈N(c)\v Φ
(∣∣L(i−1)v′c
∣∣))m
(i)cv ← sgn(m
(i)cv )×
∣∣m(i)cv
∣∣Step 3: Variable node update at each VN, v = 1, 2, . . . , n (VN-to-CN messages)
L(i)v ← Qv +
∑c∈M(v)m
(i)cv
L(i)vc ← Qv +
∑c′∈M(v)\cm
(i)c′v = L
(i)v −m(i)
cv
Step 4: Hard decision at each VN, v = 1, 2, . . . , n
C(i)v ←
0, L(i)v ≥ 0
1, otherwise
Step 5: Early termination (parity) check at each CN, c = 1, 2, . . . , n− kif CH> = 0 (mod 2) then Terminate
end for
CN c, where pc is the XOR of all hard decision Cv bits from VNs connected to CN c in the set N(c):
pc ← CN(c)1 ⊕ CN(c)2 ⊕ · · · ⊕ CN(c)|N(c)| = ⊕v∈N(c)
Cv. (2.1)
The traditional Sum-Product algorithm achieves error-correction performance close to the theoretical
Shannon limit, however, it is not well-suited for hardware implementation due to the non-linearity of the
tanh(x) function [15]. The Min-Sum algorithm is often adopted as a suitable alternative in integrated
circuit decoder implementations as it does not require complex lookup tables and can be computed with
simple comparator circuits [15, 73, 77]. Algorithm 2 describes the Min-Sum decoding procedure with a
flooding schedule.
The Min-Sum algorithm can be modified to include scaling factors and offset coefficients to improve
decoding performance, as in the case of the Normalized Min-Sum and Offset Min-Sum algorithms,
respectively [80]. However, these parameters are strongly dependent on the channel SNR, code rate, and
fixed-point message quantization in the LDPC decoder. In order to avoid BER performance degradation,
these parameters should be adjusted dynamically during runtime to compensate for changes in SNR or
code rate. Such enhancements are beyond the scope of this thesis.
The massive computational parallelism required to achieve high decoding throughput warrants the
integrated circuit implementation of LDPC decoders [81, 82]. Despite the prospect of near-capacity
error-correction performance, LDPC decoder hardware implementations must balance the trade-offs of
data throughput, power consumption, silicon chip area, and decoding latency, while operating within the
system-level constraints defined by the target BER, block length, and parity-check matrix connectivity
Chapter 2. Background 15
Algorithm 2 Min-Sum with flooding schedule
Input: R, σ2; Output: C
Step 1: LLR initialization at each VN, v = 1, 2, . . . , n
Qv ← ln
(P (Rv|Cv = 0)
P (Rv|Cv = 1)
)=
2Rvσ2
for BIAWGNC
L(i=0)vc ← Qv, ∀c ∈M(v) for first iteration
for Iteration i = 1 to Max Iterations do
Step 2: Check node update at each CN, c = 1, 2, . . . , n− k (CN-to-VN messages)
sgn(m(i)cv )←
∏v′∈N(c)\v sgn(L
(i−1)v′c )∣∣m(i)
cv
∣∣← minv′∈N(c)\v∣∣L(i−1)v′c
∣∣m
(i)cv ← sgn(m
(i)cv )×
∣∣m(i)cv
∣∣Step 3: Variable node update at each VN, v = 1, 2, . . . , n (VN-to-CN messages)
L(i)v ← Qv +
∑c∈M(v)m
(i)cv
L(i)vc ← Qv +
∑c′∈M(v)\cm
(i)c′v = L
(i)v −m(i)
cv
Step 4: Hard decision at each VN, v = 1, 2, . . . , n
C(i)v ←
0, L(i)v ≥ 0
1, otherwise
Step 5: Early termination (parity) check at each CN, c = 1, 2, . . . , n− kif CH> = 0 (mod 2) then Terminate
end for
for multiple code rates. These trade-offs are controlled through architectural and algorithmic schedul-
ing considerations, which include the number of decoding iterations, message bit-width quantization,
partitioning of combinational logic and memory, and routing overhead complexity.
2.4 Quasi-Cyclic LDPC Codes
In a traditional LDPC decoder, CN and VN processing units iteratively exchange messages across an
interconnect network described by a Tanner graph, as shown in Fig. 2.2. Random parity-check matrices
introduce unstructured interconnect between the variable and check nodes, resulting in unordered mem-
ory access patterns and complex routing, which limit scalability in ASIC or FPGA implementations.
Architecture-aware codes were introduced to alleviate the hardware complexity in the design of both
LDPC encoders and decoders by imposing a highly-regular matrix structure with a sufficient degree of
randomness in the parity-check matrix to ensure adequate Euclidean distance between valid codewords
under maximum likelihood detection [83].
Quasi-cyclic (QC) LDPC codes are a popular class of such codes, where the parity-check matrix is
constructed from an array of q×q cyclically-shifted identity matrices or q×q zero matrices [72]. As shown
in Fig. 2.3, the tilings evenly divide the parity-check matrix H into n/q QC macro-columns and (n−k)/q
QC macro-rows. The expansion factor q in a QC parity-check matrix determines the trade-off between
decoder implementation complexity and error-correction performance. For a small expansion factor q,
the matrix exhibits a high degree of randomness, which improves error-correction performance, while
a large q reduces decoder complexity with some performance degradation. QC-LDPC codes provide
Chapter 2. Background 16
a highly-regular matrix structure, which can be exploited to simplify decoder architecture design and
implementation.
I4 I3 I1
qxq Cyclically-Shifted Identity Matrix
I3 I3 I2 I1
I4
I3
I2
I2
I4
I2I3
I1
I2
I1
I2
I2
I4 I2
I5
I5
I4
I5 I1
I4 I1 I2
I3
I4
I1 I4
I1
I# qxq All-Zero Matrix
6 QC
Macro-
Rows
6x5
Check
Nodes
10 QC Macro-Columns 10x5 Variable Nodes
Expansion Factor q=5
k information bits (n-k) parity bits
(n-k) c
he
ck
no
de
s
block length = n bits (n variable nodes)
00010
00001
10000
01000
00100
Sample Cyclically-
Shifted Identify
Matrix for q=5
I3=
Figure 2.3: Sample quasi-cyclic binary parity-check matrix for q = 5 constructed from uniformly-sized(q × q), cyclically-shifted identity matrices and all-zero matrices.
2.5 Silicon Integrated Circuits for LDPC Decoding
The silicon-based implementation of LDPC decoders was introduced in Chapter 1, with the motivation of
maximizing energy efficiency and throughput in modern CMOS technology nodes. This section outlines
the challenges of extending existing decoder architectures to modern CMOS technology nodes.
2.5.1 LDPC Decoder Architectures
As presented in Fig. 2.4, LDPC decoder architectures can be classified into the following three categories
based on their degree of computational parallelism: fully parallel, partially parallel, and serial.
Fully-parallel decoders achieve the highest throughput, but suffer from high power consumption
and large silicon area due to the large number of parallel processors and highly complex reconfigurable
Benes/Banyan switch-based networks or hard-wired interconnect [70,82,84].
Partially-parallel decoder architectures achieve higher energy- and area-efficiency at the expense of
lower throughput [83, 85]. Message updates are computed in a time-multiplexed approach by storing
intermediate updates in embedded memory and performing pipelined computations among partitioned
processing nodes whose level of parallelism is typically defined by the QC matrix expansion factor
q [83, 85]. Barrel shifters and large multiplexer trees are implemented to perform cyclic data rotation.
Chapter 2. Background 17
Partially-parallel architectures can further be classified as row-parallel or column-parallel depending on
the processing unit partitioning of the QC parity-check matrix.
Serial architectures implement a single processing unit with large memory, and are suitable only for
low-throughput applications where high latency is acceptable [86].
VN
CN
Fully Parallel
VN VN VN VN VN
CN CN CN CN CN
VN
CN
Partially Parallel
VN VN
CN CN
Permutation Memory
VN
CN
Serial
Global Memory
Figure 2.4: Examples of three LDPC decoder architectures showing message-passing networks, and con-figuration of CN and VN processing units. While exceptions exist, typically, fully-parallel architecturesinstantiate n VNs and (n − k) CNs, partially-parallel architectures instantiate a factor of q VNs andCNs, and serial architectures instantiate only 1 VN and 1 CN.
2.5.2 Flooding and Layered Decoding Schedules
All LDPC decoders execute the belief propagation algorithm based on either a flooding or layered
message-passing schedule [82].
In the flooding schedule, CNs and VNs perform subsequent rounds of updates and pass messages
between their connected neighbors once per iteration [82]. The flooding schedule is amenable to both
fully-parallel and partially-parallel architectures. Flooding decoders achieve high throughput with low
latency at the expense of area and routing complexity [87].
The layered schedule was introduced to improve decoder convergence by performing intermediate
LLR updates in the VNs in each iteration [88]. The layered schedule is amenable to partially-parallel
architectures with QC-LDPC codes. While the layered schedule generally requires fewer iterations, the
overall decoding latency is generally longer as additional clock cycles are required to perform intermediate
LLR updates. Successive message updates can also lead to dynamic range clipping/saturation, which
reduces error-correction performance if scaling circuits are not employed. Moreover, the data dependency
among layers results in memory access contention [20], even with sliced message-passing and overlapped
decoding schedule techniques [89,90]. Layered decoders thus require higher clock rates to achieve multi-
Gb/s throughput, as well as more complex permutation network control. As such, they are not optimal
for multi-Gb/s decoding with the goal of IP integration in a target system operating at a 200MHz clock
rate.
2.5.3 Design Challenges: Message Permutation, Memory, and Power
Reducing interconnect congestion and minimizing reconfigurable routing logic overhead are key chal-
lenges in improving the energy efficiency of decoder implementations. Several variants of the split-row
technique have been shown to reduce silicon area and routing congestion in fully-parallel decoders [15],
however, they may be suitable only for single-rate codes with high row weights. The split-row decoding
algorithm splits each matrix row into multiple submatrices to perform localized semi-autonomous mini-
mum magnitude computations with a reduction of 0.3dB to 0.7dB in error-correction performance [15].
Chapter 2. Background 18
The half-row decoding schedule folds the traditional layered schedule computations in time to reduce
interconnect wiring [21], but suffers in throughput and introduces more complex data shuffling circuits.
Memory-based decoders enable frame-level pipelining to shorten routing wires and maximize hardware
utilization, but are similarly plagued with barrel shifting networks [91, 92]. Depending on the parity-
check matrix construction, row-merging techniques can be applied to improve decoding throughput [32],
but again suffer from reconfigurable routing logic overhead.
Since internal decoder memories are frequently updated, two refresh-free dynamic memory implemen-
tations have been proposed to improve the energy efficiency of decoder memory. Transistor-based, non-
refresh embedded dynamic random-access memory (eDRAM) can be used instead of embedded static
random-access memory (eSRAM) or flip-flop register files without special process requirements [31].
However, since eDRAM cells operate at a higher supply voltage than standard CMOS logic, an addi-
tional on-chip power grid may be required, thus further increasing SoC design complexity. In addition,
the retention time of eDRAM cells is reduced in newer technology nodes as device leakage becomes more
significant. Refresh-free dynamic standard-cell based memory that operates similar to domino logic can
reduce area overhead in comparison to static memory [93], however, the high threshold-voltage devices
required might affect timing closure and SoC integration complexity. Both the eDRAM and standard-cell
memory techniques are susceptible to PVT variations, which affect read/write timing.
Time-domain signal processing techniques have been explored to minimize power consumption by
reducing the critical path of the minimum magnitude calculation in the CN and summation operation
in the VN [36,94]. Time-domain processing uses digital-to-time converters to produce a staggered, time-
delayed set of inputs, which are then computed upon using simplified logic, where the output is then
converted back to the digital domain via a time-to-digital converter. However, such implementations
have limited feasibility in SoC integration due to back-end timing closure challenges as a result of
metastability and PVT variations in the delay chains and custom clock-tree networks.
While these approaches provide some reduction in overall energy consumption, the fundamental
architectures still rely on long, global, unstructured routing wires to connect processing units, which
continue to execute the same traditional flooding or layered decoding schedules.
2.5.4 Explicit Check and Variable Node Processing Units
The two-phase schedule implied by the belief propagation algorithm allows computations in the check
and variable update phases to be mapped directly to explicit CN and VN processing units. Figure 2.5
presents a dataflow diagram of Lvc and mcv messages between processing unit CN1 and its connected
VN processing units over two successive decoding iterations based on the Tanner graph connectivity in
Fig. 2.2. In each iteration, the explicit CN1 processing unit receives a unique Lvc message from each of
its connected VNs, and then calculates a unique mcv return message for each VN.
This canonical algorithm-to-hardware mapping is adopted in most traditional LDPC decoder im-
plementations, however, it requires all Lvc and mcv messages to be routed between the two processing
groups through a congested interconnect network, and also introduces routing permutation logic in both
CN and VN processing units. As a result, the complexity of both the interconnect network and permu-
tation logic scales with block length. The mcv computation in Step 2 of Algorithm 2 can be simplified
by calculating only the first and second minimum magnitudes from received Lvc messages based on the
Chapter 2. Background 19
L11
L21
L41
m11
m12
m14
VN1
CN1VN2
VN4
VN1
VN2
VN4
CN1
VN1
VN2
VN4
Iteration i Iteration i+1(i)
L11
L21
L41
m11
m12
m14
(i)
(i)
(i) (i)
(i)
(i+1)
(i+1)
(i+1) (i+1)
(i+1)
(i+1)
Figure 2.5: Explicitly defined processing units VN1, VN2, and VN4 connected to processing unit CN1
based on the Tanner graph in Fig. 2.2, with Lvc and mcv messages indicated for decoding iterations iand i+ 1.
following Min-Sum simplification:
mcv ←
(sc × sgn(Lvc))×min2c, if min1c = |Lvc|
(sc × sgn(Lvc))×min1c, otherwise(2.2)
sc ←∏
v∈N(c)
sgn(Lvc) (2.3)
min1c ← minv∈N(c)
|Lvc| (2.4)
min2c ← minv′∈N(c)\vmin1
|Lv′c|. (2.5)
However, high-order compare-select trees are still required to compute mcv. Pipeline registers are typi-
cally added to ease the critical path timing constraints in the compare-select trees and CN-to-VN routing
interconnect, however, the scalability of the explicitly partitioned architecture remains limited. New ar-
chitectural and scheduling techniques are therefore required to alleviate the global routing and energy
efficiency challenges.
Up until this point, this chapter has provided a general background on LDPC codes, their parity-
check matrices, and their decoding algorithms. The previous section described some of the challenges in
implementing decoders in silicon, and motivated the need for a new, scalable architecture for low-power
applications. The remainder of this chapter focuses on the application of LDPC codes for secret key
reconciliation in long-distance CV-QKD. The four steps of the QKD protocol are first described, followed
by the definitions of error rate and secret key rate in a QKD context. Last, this chapter provides an
overview of multi-edge LDPC codes used in long-distance CV-QKD.
2.6 LDPC Decoding in Quantum Key Distribution
This section first provides a fundamental overview of QKD, and then presents the mathematical frame-
work for key reconciliation using LDPC codes over multiple dimensions. Multi-edge LDPC codes are
also introduced to provide context for the discussion in Chapter 4, which focuses on the computational
speedup of the error-correction (reconciliation) algorithm for CV-QKD.
In a QKD system, two remote parties, Alice and Bob, communicate over a private optical quantum
channel, as well as an authenticated classical public channel to generate a shared secret key in the presence
of an adversary or eavesdropper, Eve, who may have access to both channels [38]. The public channel
Chapter 2. Background 20
can be assumed to be the Internet, while the private quantum channel is intended for communication
only between Alice and Bob. Eve may attempt to perform a man-in-the-middle attack on both the
public channel via a replay attack, or on the private optical channel via beam-splitting. The security of
QKD stems from the no-cloning theorem of quantum mechanics, which states that any observation or
measurement of the quantum channel by Eve would disturb the coherent states transmitted from Alice
to Bob [10, 95]. Since Alice and Bob can calibrate their expected channel noise threshold for a fixed
fiber-optic transmission distance prior to being deployed in the field, any quantum measurement by Eve
would result in a channel noise increase, at which point, the reconciliation error rate would increase,
and Alice and Bob could choose to terminate their communication if they suspect a man-in-the-middle
attack [59]. A typical prepare-and-measure CV-QKD system is based on the Grosshans-Grangier 2002
(GG02) protocol [95], which defines the following four steps presented in Fig. 2.6: quantum transmission,
sifting, reconciliation, and privacy amplification. Fully secure QKD networks can be built by designating
intermediate trusted nodes [39,40], or through measurement-device-independent QKD (MDI-QKD) using
untrusted relay nodes in both CV- and DV-QKD [51,96,97]. MDI-QKD is beyond the scope of this thesis,
however, it does provide a viable solution to the quantum hacking problem by removing all detector side
channels [96].
2.6.1 Quantum Transmission and Sifting
To construct a secret key using the prepare-and-measure CV-QKD protocol, Alice first constructs a vector
A consisting of Nquantum coherent states, which she then transmits to Bob over a private optical fiber.
For each of Alice’s transmitted states, Bob arbitrarily measures the amplitude or phase quadrature using
an unbiased homodyne detector to construct a vector B of length Nquantum. The optical experimental
setup is beyond the scope of this thesis, thus experimental values from previously published works have
been used to characterize the quantum channel [10].
In the remaining post-processing steps of the QKD protocol, Alice and Bob communicate over an
authenticated classical public channel, which is assumed to be noiseless and error-free. Eve may have
access to this channel, however, her eavesdropping does not introduce additional errors [95]. Following
quantum transmission and measurement, Alice and Bob perform a sifting operation to construct two
correlated Gaussian sequences, X0 and Y0, based on the transmitted and measured states. In the
following reconciliation and privacy amplification steps, Alice and Bob apply error correction and hashing
techniques to build a secret key using their sifted sequences of correlated quadrature measurements.
A detailed discussion of the quantum transmission and sifting steps is provided in Appendix A.
2.6.2 Reconciliation
During information reconciliation, Alice and Bob perform the first step in building a unique secret key by:
(1) encoding a randomly-generated message using the sifted quadrature measurements, (2) transmitting
the encoded message over an authenticated classical channel, and then (3) applying an error-correction
scheme to decode the original message [10, 98]. In the direct reconciliation scheme, Alice generates and
transmits a random message to Bob, who then performs the error-correction decoding based on his
measured quadratures. However, previous works have shown that the transmission distance with direct
reconciliation is limited to about 15km [99–101], and is thus not suitable for long-distance CV-QKD
targeting transmission distances beyond 100km [11].
Chapter 2. Background 21
Eve
BobAlice
Pulsed
Light
Source
Private Optical
Quantum Channel
(Noisy)
Ax
B
pB
xA
pA
B
Disclose Selected
Quadratures
Authenticated
Classical
Public Channel
(Noiseless)
Randomly
Select and
Measure
xB
or pB
Quadrature
Discard Unused
Quadratures
A
Step 1:
Quantum
Transmission
Step 2: Sifting
Generate Random
String
LDPC Encoding
Compute Public
Message
Step 3: Reconciliation
S
C
Compute Received
Message
M
Y0X0
X
LDPC Decoding
M
Y
Ss
R
Step 4:
Privacy Amplification
Universal Hashing
Generate Coherent States
Secret Key
A´
Optical Fiber Losses (T, ε)
Homodyne
Detector Losses
(η, Vel)
Modulation
Variance VA
Repetition Rate frep
Eve
Selected
Quadratures
+Plaintext Input Message
+Plaintext Output Message
One-Time
Pad
Encryption
One-Time
Pad
Decryption
Eve
Authenticated
Classical
Public Channel
(Noiseless)
Universal Hashing
Secret Key
Authenticated
Classical
Public Channel
(Noiseless)
Eve
Figure 2.6: CV-QKD model for secret key distillation with reverse reconciliation between Alice and Bobover a private quantum channel and public classical channel.
Chapter 2. Background 22
The long distance problem drives the need for an alternate, robust scheme that is capable of operating
under the low-SNR conditions of the optical channel, even in the presence of excess noise introduced by
an eavesdropper. In the reverse reconciliation scheme, the direction of classical communication between
Alice and Bob is reversed. Reverse reconciliation achieves a higher secret key rate at longer distances in
comparison to direct reconciliation, however, powerful error-correction codes are still required to combat
the high channel noise at long distances without revealing unnecessary information to Eve during the
reconciliation process [11,59,98].
Two-way interactive error-correction protocols such as Cascade or Winnow are not practical for long-
distance QKD due to the large latency and communication overhead required to theoretically minimize
the information leakage to Eve [102–105]. In such interactive protocols, Alice and Bob perform the
error-correction procedure by iteratively exchanging update messages over the public channel until the
key is reconciled. Blind reconciliation using short block-length codes on the order of 103 bits with low
interactivity was proposed to reduce decoding latency [65], however, the short block length is not suit-
able for error-correction at low SNR. Instead, one-way forward error-correction implemented using long
block-length codes with iterative soft-decision decoding is required to achieve efficient error-correction
at low SNR [67, 98]. Jouguet et al. recently showed that multi-edge LDPC codes combined with a
multi-dimensional reverse reconciliation scheme can achieve near-Shannon limit error-correction perfor-
mance at long distances [11, 59]. However, the computational complexity of LDPC decoding remains a
limitation to the maximum achievable secret key rate in a practical QKD implementation [60]. Chap-
ter 4 presents hardware-oriented optimization techniques to alleviate the time-intensive bottleneck of
LDPC decoding for long distance CV-QKD systems, while the remainder of this section outlines the
mathematical framework for long-distance reverse reconciliation1.
Reconciliation at Long Distances
Strong error-correction schemes do not exist for systems with both a Gaussian input and Gaussian
channel, as in the case of CV-QKD. However, at low SNR, the maximum theoretical secret key rate
is less than 1 bit/pulse per channel use, and the Shannon limit of the additive white Gaussian noise
(AWGN) channel approaches the limit of a binary-input AWGN channel (BIAWGNC) [98]. This makes
binary codes highly suitable for error correction in the low-SNR regime [4,106], as opposed to non-binary
codes, which outperform binary codes on channels with more than 1 bit/symbol per channel use [107].
Since binary codewords can be encoded in the signs of Alice and Bob’s correlated sequences, X0 and
Y0, the reconciliation system can therefore be modelled as a BIAWGNC [11].
Reverse Reconciliation Algorithm for the BIAWGNC
A model of the BIAWGNC can be induced from the physical parameters that characterize the quantum
transmission [11]. The variance of the optical input signal is normalized based on Alice’s modulation
variance VA, and captured in the form of a signal-to-noise ratio with respect to the optical fiber and
homodyne detector losses. Assuming that the BIAWGNC has a zero mean and noise variance of σ2Z ,
Z ∼ N (0, σ2Z), the SNR can be expressed as s = 1/σ2
Z . In order to perform key reconciliation, Alice
and Bob now construct two new correlated Gaussian sequences from their sifted correlated sequences
X0 and Y0 of length Nquantum. Alice and Bob first select a subset of n elements from X0 and Y0,
1Chen Feng helped develop the mathematical preliminaries for reverse reconciliation.
Chapter 2. Background 23
where n < Nquantum. Here, n is chosen to be equal to the LDPC code block length. Alice and Bob then
normalize their subset of n elements by the modulation variance VA, such that Alice and Bob now share
correlated Gaussian sequences X and Y, each of length n, where X ∼ N (0, 1), Y ∼ N (0, 1 + σ2Z), and
the property Y = X + Z holds [11].
Bob uses a quantum random number generator to generate a uniformly-distributed random binary
sequence S of length k, where Si ∈ 0, 1. He then performs a computationally inexpensive LDPC
encoding operation to generate an LDPC codeword C of length n, where Ci ∈ 0, 1, by appending
(n−k) redundant parity bits to S based on a binary LDPC parity-check matrix H that is also known to
Alice. Eve may also have access to H, however, the QKD security proof still holds since Eve is assumed
to have infinite mathematical genius. Bob prepares his classical message to Alice, M, by modulating
the signs of his correlated Gaussian sequence Y with the LDPC codeword C, such that Mi = (−1)CiYi,
where Mi ∈ R and Yi ∈ R for i = 1, 2, . . . , n. The symmetry in the uniform distribution of Bob’s random
binary sequence S ensures that the transmission of M over the authenticated classical public channel
does not reveal any additional information to Eve [10].
Assuming error-free transmission over the classical channel, Alice attempts to recover Bob’s codeword
using her correlated Gaussian sequence X based on the following division operation:
Ri =Mi
Xi=
(−1)CiYiXi
=(−1)Ci(Xi + Zi)
Xi= (−1)Ci + (−1)Ci
ZiXi, (2.6)
for i = 1, 2, . . . , n. Here, Alice observes a channel with binary input (±1) and additive noise (−1)Ci Zi
Xi. In
this case, the division operation in the noise term represents a fading channel, however, since Alice knows
the value of each Xi, the norm of X is revealed and the overall channel noise remains Gaussian with
zero mean and variance σ2Ni = σ2
Z/|Xi|2 for each i = 1, 2, . . . , n [11]. Alice then attempts to reconstruct
S by performing the computationally intensive Sum-Product belief propagation algorithm for LDPC
decoding, to remove the channel noise from her received vector R. Sum-Product decoding is preferred
for long-distance CV-QKD, as Min-Sum does not perform well at low SNR [108]. The LDPC decoding
algorithm requires the channel noise variance σ2Ni to be known for each i = 1, 2, . . . , n. By discarding the
(n − k) parity bits from the decoded codeword, Alice can build an estimate S of Bob’s original binary
sequence S for further post-processing in the next privacy amplification step to asymptotically reduce
Eve’s knowledge about the secret key [95].
Multi-Dimensional Reconciliation
Up until this point, the discussion has assumed a 1-dimensional reconciliation scheme in R, with ±1 bi-
nary inputs on the BIAWGNC. Leverrier et al. showed that the quantum transmission can be extended
to longer distances with proven security by employing multi-dimensional reconciliation schemes con-
structed from spherical rotations in R2, R4, and R8, where the multiplication and division operators are
defined [67,98]. These spaces are commonly referred to as the set of complex numbers C, the quaternions
H, and the octonions O, respectively. As shown in Eq. 2.6, the division and multiplication operations
must be defined for the reverse reconciliation procedure. By Hurwitz’s theorem of composition algebras,
normed division is only defined for four finite-dimensional algebras: the real numbers R (Rd=1), the
complex numbers C (Rd=2), the quaternions H (Rd=4), and the octonions O (Rd=8) [109]. Hence, the
remainder of this discussion considers only the d = 1, 2, 4, 8 dimensions.
The multi-dimensional approach is a further reformulation of the reduction of the physical Gaussian
Chapter 2. Background 24
channel to a BIAWGNC at low SNR. For d-dimensional reconciliation, d ∈ 1, 2, 4, 8, each consecu-
tive group of d quantum coherent-state transmissions from Alice to Bob can be mapped to the same
BIAWGNC. As a result, the channel noise variance among all d channels is uniform. For the d = 1
case, each Ri defined in Eq. 2.6 has a unique channel noise variance defined by σ2Ni = σ2
Z/|Xi|2 for
i = 1, 2, . . . , n. For the d = 2 case, the reconciliation is performed over successive (R2i−1, R2i) pairs:
(R1, R2), (R3, R4), . . . , (Rn−1, Rn), which are constructed from the quadrature transmission of successive
(M2i−1,M2i) pairs for i = 1, 2, . . . , n/2 in Rd=2. Similar to d = 1, each ith received value is still comprised
of a ±1 binary input and a noise term, such that R2i−1 = (−1)C2i−1 + N2i−1 and R2i = (−1)C2i + N2i
for i = 1, 2, . . . , n/2. While the real and imaginary noise components, N2i−1 and N2i, are not equal,
the variance of the channel noise is uniform over both dimensions, such that σ2N(2i−1) = σ2
N(2i) for each
(R2i−1, R2i) pair. This can be extended to the d = 4 and d = 8 cases, where each d-tuple of successive Ri
values has a unique channel noise for each dimensional component, but the channel noise variance remains
equal over all d dimensions. For example, for d = 4, each received 4-tuple, (R4i−3, R4i−2, R4i−1, R4i) for
i = 1, 2, . . . , n/4, has a unique noise term for each of its four components, but the channel noise variance
over all four dimensions remains uniform.
The following derivation extends Alice’s message reconstruction calculation presented in Eq. 2.6 to
d-dimensional vector spaces, d ∈ 1, 2, 4, 8, where the multiplication and division operators are defined.
The derivation of the channel noise for d = 2, 4, 8 is much more rigorous than for d = 1, however, the
procedure can be simplified by applying associative and distributive algebraic properties that hold true
for the complex, quaternion, and octonion vector spaces. Here, R, M, X, Y, and Z are d-dimensional
vectors, and U is the d-dimensional vector comprised of (−1)Ci components. For example, for d = 2,
U = [(−1)C2i , (−1)C2i−1 ], while for d = 4, U = [(−1)C4i−3 , (−1)C4i−2 , (−1)C4i−1 , (−1)C4i ]. It follows
then that
R = MX−1
= (UY)X−1
= (U(X + Z))X−1
= (UX + UZ)X−1 by right-distributivity a(b+ c) = ab+ ac
= UXX−1 + UZX−1 by left-distributivity (b+ c)a = ba+ ca
= U + UZX−1 by right-cancellation abb−1 = a
= U + UZX∗
||X||2. (2.7)
The received vector can be expressed as R = U + N, where the multi-dimensional noise for a
BIAWGNC is given by the term N = (UZX∗)/||X||2. The Cayley-Dickson construction can then be
applied to complete the derivation of the multi-dimensional noise N for d = 2, 4, 8 [110]. Since the noise
is identically distributed in each dimension, U can be assumed to be the all-zero codeword, i.e., Ci = 0
for all i = 1, 2, . . . , n, to further simplify the derivation. For d = 2, the channel noise of both the real
and imaginary components can be expressed as N2i−1 = aiZ2i−1 + biZ2i and N2i = aiZ2i − biZ2i−1,
where ai =X2i−1 +X2i
X22i−1 +X2
2i
and bi =X2i −X2i−1
X22i−1 +X2
2i
for i = 1, 2, . . . , n/2. It follows then that the channel
noise variance for d = 2 is given by σ2N(2i−1) = σ2
N(2i) = (a2i + b2i )σ
2Z . The noise derivation for d = 4 and
d = 8 is much longer and is not included in the thesis.
Chapter 2. Background 25
Reconciliation Efficiency
The reverse reconciliation algorithm for the BIAWGNC can be reduced to an asymmetric Slepian-Wolf
source-coding problem with input M and side information X, where Alice and Bob observe correlated
Gaussian sequences X and Y, respectively [104,111]. Since Alice must discard (n−k) parity bits from the
linear block code after LDPC decoding, it follows then that the efficiency of the reverse reconciliation
algorithm is given by β = Rcode
I(X;Y ) , where I(X;Y ) is the mutual information between X and Y, and
Rcode is the LDPC code rate defined as Rcode = k/n from the n-length codeword C and k-length
random information string S [11, 111]. The mutual information I(X;Y ) corresponds to the Shannon
capacity of the quantum channel, hence the reconciliation efficiency can be expressed more simply as:
β =Rcode
C(s)=Rcode
Rmax, (2.8)
where C(s) = 12 log2(1 + s) is the Shannon capacity and s is the SNR of the BIAWGNC. The Shannon
capacity defines the maximum achievable code rate Rmax for a given SNR, and thus, the β-efficiency
characterizes how close the reconciliation algorithm operates to this fundamental limit [106].
The reconciliation efficiency β plays a crucial role in the performance of CV-QKD. The β-efficiency
at a particular SNR operating point determines the code rate, and ultimately, the number of parity
bits discarded in each message. Assuming that the LDPC coding scheme has been optimized for a
particular SNR operating point such that the code rate Rcode is fixed, the reconciliation efficiency
then depends solely on the SNR of the quantum channel, which is a function of Alice’s coherent-state
modulation variance and the physical transmission losses in the optical fiber. Hence, for a fixed optical
transmission distance between Alice and Bob, the reconciliation efficiency can be optimized by tuning
Alice’s modulation variance VA, and designing an optimal error-correction scheme for a target SNR.
Chapter 4 explores how changes in the β-efficiency affect the reconciliation distance and maximum
achievable secret key rate.
Quantum Channel Capacity vs. Coding Channel Capacity
The reconciliation efficiency β and channel capacity C(s) defined in Eq. 2.8 are related to the overall ca-
pacity of the complete QKD system, which has an AWGN channel characterized by the optical quantum
losses and modulation variance. It is also possible to define a different efficiency and capacity related
to the channel coding problem presented in Eq. 2.6, where Alice observes a channel with binary input
(±1) and additive noise (−1)Ci Zi
Xi. It is important to note here that these two channel capacities and
efficiencies are different, but can easily be related. Up until this point, the key reconciliation problem
has been considered as a single problem, however, for clarity, it should be decomposed into two related
problems: (1) distilling a common message from correlated random sequences X and Y, and (2) chan-
nel coding for a binary-input fast fading channel with channel state information available only at the
decoder. The first problem is an information theory problem, and is independent of the second channel
coding problem.
The information theoretic problem attempts to distill the correlated Gaussian sequence Y, in the
presence of the quantum channel noise Z, as given by the expression Y = X + Z. This problem is
more formally known as “secret key agreement by public discussion from common information” [112].
The efficiency β and channel capacity C(s) defined in Eq. 2.8 are the efficiency and capacity related to
Chapter 2. Background 26
solving the information theoretic problem, where s represents the SNR on the optical quantum channel.
For clarity, let us redefine the overall QKD system efficiency as βAWGN and the capacity as CAWGN.
In the channel coding problem, Alice attempts to recover an encoded codeword C using error-
correction decoding techniques. In Eq. 2.6, the noise represents a fading channel where each ith symbol
has a unique SNR, characterized by its unique channel noise variance. For d = 1 dimensional reconcilia-
tion, the coding channel noise variance is given by σ2Ni = σ2
Z/|Xi|2 for each i = 1, 2, . . . , n, and thus the
coding (fading) channel has an ergodic capacity, which can be expressed as Ccoding = E[ 12 log2(1 + 1
σ2Ni
)].
The ergodic capacity Ccoding can be computed by averaging the SNR given by 1/σ2Ni for i = 1, 2, . . . , n.
It follows then that the channel coding efficiency is given by βcoding = Rcode/Ccoding. The overall QKD
system efficiency can then be expressed independent of the code rate as follows:
βAWGN = βcodingCcoding
CAWGN. (2.9)
The ergodic capacity of multi-dimensional reconciliation schemes d = 2, 4, 8 can be determined by
applying the same expression for Ccoding. The expression for βAWGN in Eq. 2.9 holds for d = 1, 2, 4, 8
dimensional reconciliation. The remainder of this thesis considers only the overall QKD system efficiency
βAWGN, which is herein denoted more simply as β.
2.6.3 Privacy Amplification
Since Eve may have collected sufficient information during her observations of the quantum and classical
channels, Alice and Bob asymptotically reduce Eve’s knowledge of the key by independently applying
a shared universal hashing function on a concatenated block of their independent binary strings S and
S [10, 95]. Each concatenated secret key block is Nprivacy bits in length. After hashing, Alice and Bob
can use the resulting symmetric key to encrypt and decrypt messages with perfect secrecy using the
one-time pad technique [39]. Additional details about the privacy amplification step are provided in
Appendix A.
2.6.4 Maximizing Secret Key Rate with Collective Attacks
Assuming perfect error-correction during the reconciliation step, the maximum theoretical secret key
rate for a CV-QKD system with one-way reverse reconciliation can be defined as
Kopt = βIAB − χBE (bits/pulse), (2.10)
where IAB is the mutual information between Alice and Bob, β is the previously-defined reconciliation
efficiency, and χBE is the Holevo bound on the information leaked to Eve [10]. A complete derivation
of the secret key rate with collective attacks is provided in Appendix A. In order to maximize the secret
key rate Kopt for a particular β-efficiency, Alice’s modulation variance VA must be optimally tuned for
each quantum transmission distance to maximize the SNR on the BIAWGNC. A complete discussion on
the optimal choice of VA is provided in Appendix A.
The asymptotic limit on the secret key rate Kopt is based on ideal theoretical security models, and
does not consider the imperfections of a practical CV-QKD system, which might enable additional
side-channel attacks [113]. Such imperfections include the finite-size effects [114–116], excess electronic
and phase noise from uncalibrated optical equipment, as well as discretized Gaussian modulation with
Chapter 2. Background 27
finite bounds on the distribution and randomness [113]. Leverrier proved that CV-QKD with coherent
states provides composable security against collective attacks [117], however, extending the information-
theoretic security proofs from collective attacks to general attacks in the finite-size regime of CV-QKD
is currently an active area of research [116,118]. At the time of writing, the highest CV-QKD key rates
can be achieved using coherent states and homodyne detection with security against collective attacks
and some finite-size effects [59]. The motivation of this thesis is to show that the key reconciliation
(error correction) algorithm can be accelerated such that the throughput of LDPC decoding is higher
than the asymptotic secret key rate achievable using realistic quantum channel parameters and optical
equipment available today. The finite-size effects on secret key rate are considered later in this section,
while the other imperfections of a practical CV-QKD system are beyond the scope of this thesis.
The BIAWGNC model for long-distance CV-QKD under investigation in this thesis has also been
proven secure against collective attacks, thus the expression for the asymptotic secret key rate Kopt still
holds [11, 98]. At long distances, IAB and χBE are nearly equal, thus in order to maximize the secret
key rate, it would appear that the reconciliation efficiency β must also be maximized. However, this is
not necessarily true since Kopt only provides an expression for the maximum achievable secret key rate
and does not consider the speed of reconciliation, nor the uncorrectable errors. The frame error rate
(FER) of the reconciliation algorithm must also be considered.
2.6.5 Upper Bound on Secret Key Rate for a Lossy Channel
Pirandola et al. recently showed that there exists a general upper bound on the secret key rate for a lossy
channel [61]. This fundamental limit is determined by the transmittance T of the fiber-optic channel,
and is given by
Klim = − log2(1− T ) (bits/pulse). (2.11)
The transmittance is defined as T = 10−α`/10, where the distance ` is expressed in kilometers and the
standard loss of a single-mode fiber optic cable is assumed to be α = 0.2dB/km. The upper bound
versus distance is plotted in Fig. A.2 in Appendix A.
2.6.6 Frame Error Rate for Reverse Reconciliation
In reverse reconciliation, Alice attempts to construct a decoded estimate S of Bob’s original message S
in order to perform privacy amplification and build a secret key. The tree diagram in Fig. 2.7 highlights
four possible decoding scenarios for generating S from Alice’s decoded codeword C.
After LDPC decoding, Alice performs a parity check CH> to verify that her decoded codeword C is
valid. When the parity check fails, i.e., CH> 6= 0, Alice knows that a decoding error has occurred and
the frame is discarded since it can not be used to generate a secret key. However, when the parity check
passes, i.e., CH> = 0, Alice knows that C is a valid codeword, however, she does not yet know if C is
equal to Bob’s original encoded codeword C.
For any binary linear block code, the number of possible codewords is 2k = 2nRcode . Thus, for codes
with a long block length n, the number of possible codewords grows exponentially, and it is possible
for the decoder to converge to a valid codeword where the decoded message is incorrect, i.e., S 6= S.
In coding theory, this is referred to as an undetected error. This scenario is problematic for secret key
generation where both parties must share the same message after decoding in order to perform universal
hashing in the next privacy amplification step.
Chapter 2. Background 28
Decoded Codeword C
CH> = 0Parity Check Pass
C Valid
CRC Pass
S = SNo Error
S 6= SUndetected
Error
CRC Fail
S 6= SDetected
Error
CH> 6= 0Parity Check Fail
C Error
Skip CRC
S 6= SDetected
Error
Figure 2.7: Possible decoding scenarios and error detection techniques.
In order to detect invalid decoding errors when CH> = 0, a cyclic redundancy check (CRC) of Bob’s
original message S can be transmitted as part of the frame, and then verified against the computed CRC
of Alice’s decoded message S. Figure 2.8 presents the components of an LDPC-encoded frame, where
k information bits are comprised of (k −NCRC) message bits and NCRC CRC bits, followed by (n− k)
parity bits to be discarded after LDPC decoding. If the CRC results of S and S are equal, then the
decoding is successful and S can be used to distill a secret key, otherwise Alice knows that a decoding
error has occurred and S is discarded. The CRC needs to be performed only when the parity check
passes, otherwise the frame is known to contain an error and the CRC is skipped. A truly undetected
error occurs when both the parity check and CRC pass, but S 6= S.
Message CRC Parity (Redundancy)
k n-k
NCRCk-NCRC
n
Figure 2.8: LDPC frame components: message, CRC, and parity bits.
A frame error is said to have occurred when S 6= S, i.e., when the decoding fails to reproduce the
original message. Both detected and undetected errors contribute to the overall FER. The probability
of frame error is defined as follows:
Pe = Pdetected error + Pundetected error. (2.12)
From Fig. 2.7, it follows then that the detected and undetected error probabilities are given as
Pdetected error = P (CH> 6= 0) + P (CH> = 0 ∩ CRC Fail)
Pundetected error = P (CH> = 0 ∩ CRC Pass ∩ S 6= S).
There exists a rare case not shown in Fig. 2.7, where the parity check passes and CRC fails, yet
S = S. In this case, the error is in the CRC component of the frame. Although the decoded message
Chapter 2. Background 29
is correct, it will be discarded by the decoder due to the failed CRC check. As a result, there is a rare
chance that this frame will be lost and the secret key rate will be reduced. However, this case is not
considered by convention in communication theory [119].
2.6.7 Impact of Reconciliation Error and Efficiency on Secret Key Rate
This thesis investigates the trade-offs in error-correction performance, reconciliation efficiency, reconcili-
ation distance, and secret key rate, by assuming that the physical parameters of the quantum channel are
fixed, and that Alice’s modulation variance VA has been optimally set for each transmission distance and
desired β-efficiency. In practice, the asymptotic secret key rate Kopt is scaled by the FER since decoded
frames with known error can not be used to generate a secret key and must therefore be discarded. As
such, the effective secret key rate of a practical CV-QKD system is given by
Keff =(
1− Pdetected error
)((1− Pundetected error)βIAB − χBE
). (2.13)
Alice and Bob can discard frames with detected error, while frames with undetected error further re-
duce the mutual information IAB between Alice and Bob. In Chapter 4, it is empirically shown that
Pundetected error = 0 using a 32-bit CRC code, thus the total decoding FER can be expressed more simply
as Pe = Pdetected error. This simplified expression for the FER is assumed for the remainder of this thesis,
and thus the effective secret key rate expression given by Eq. 2.13 can be reduced to
Keff = (1− Pe)(βIAB − χBE). (2.14)
Up until this point, the β-efficiency has been assumed to be independent of the reconciliation algo-
rithm, however, as shown in Eq. 2.13, the effective secret key rate Keff is dependent on both β and FER.
Given the set of optimal VA values and assuming that the physical operating parameters of the quantum
channel remain constant, the BIAWGNC channel can be induced and described solely in terms of the
SNR at a particular distance with an effective secret key rate Keff. As described further in Chapter 4,
there exists a trade-off between reconciliation distance and effective secret key rate, such that for a single
SNR, one of the following two operating conditions is possible: (1) long distance with a low secret key
rate, or (2) short distance with a high secret key rate. In fact, for a fixed LDPC code rate Rcode, the
SNR depends only on the reconciliation efficiency and is independent of transmission distance. From
Eq. 2.8, the SNR of a BIAWGNC can be expressed as a function of β such that
s(β) = 22Rcode/β − 1. (2.15)
From a code design perspective then, a rate Rcode LDPC code can be designed to achieve a target FER
at a particular SNR. Since Alice and Bob remain stationary once deployed in the field, their transmission
distance remains fixed, and thus an LDPC code, i.e., parity-check matrix H, can be designed independent
of other CV-QKD system parameters to achieve the maximum operating secret key rate over a range
of distances by providing the optimal trade-off between β and FER. The reverse reconciliation problem
can thus be reduced to the simpler model shown in Fig. 2.1 as a result of the BIAWGNC approximation
at low SNR.
Chapter 2. Background 30
2.6.8 Secret Key Rate with Finite-Size Effects
The security of the CV-QKD protocol must account for the finite length of the secret key, which is
generated via universal hashing in the privacy amplification step using a block of length of Nprivacy
bits. Alice constructs her privacy amplification block from her correctly decoded S messages, while
Bob constructs his privacy amplification block from his original corresponding S messages. Due to the
finite block size, the secret key rate is reduced by an offset coefficient ∆(Nprivacy) and scaling coefficient
Nprivacy/Nquantum, where Nquantum is the number of symbols sent from Alice to Bob during the first
quantum transmission step. The secret key rate, accounting for finite-size effects, is given by
Kfinite =
(Nprivacy
Nquantum
)(1− Pe
)(βIAB − χBE −∆(Nprivacy)
)(bits/pulse). (2.16)
Leverrier et al. showed that Nquantum can be arbitrarily chosen as Nquantum = 2Nprivacy [114], and that
when Nprivacy > 104, the finite-size offset factor ∆(Nprivacy) can be approximated as
∆(Nprivacy) ≈ 7
√log2(2/ε)
Nprivacy, (2.17)
where a conservative choice for the security parameter is ε = 10−10 [114]. The LDPC block length n is
not directly included in this expression, however, the LDPC block length does affect the reconciliation
efficiency β and FER Pe. Chapter 4 presents a study of the optimal privacy amplification block size
Nprivacy for achieving maximum distance.
2.7 Multi-Edge LDPC Codes
This thesis builds on the previous work by Jouguet et al., who explored the application of low-rate multi-
edge LPDC codes for reverse reconciliation in long-distance CV-QKD on the BIAWGNC. This section
presents an overview of multi-edge LDPC codes, while new quasi-cyclic construction techniques for multi-
edge LDPC codes are presented in Chapter 4. Multi-edge LDPC codes, first introduced by Richardson
and Urbanke, provide two advantages over standard LDPC codes: (1) near-Shannon capacity error-
correction performance for low-rate codes, and (2) low error-floor performance for high-rate codes [71].
The latter is not a significant concern for long-distance CV-QKD where the reconciliation FER is on
the order of 10−1, however, the design of a high-performance low-rate code is crucial to achieving high
β-efficiency [11]. Since multi-edge codes can be described by a binary parity-check matrix, they have the
same computational decoding complexity as single edge-type codes. However, given their application
in low-SNR channels, the decoding latency of multi-edge codes is generally higher due to the increased
number of iterations required to converge to a valid codeword at low SNR. This section first briefly reviews
the general construction procedure for an LDPC code, and then explores the multi-edge framework in
more detail.
2.7.1 General Design and Construction of LDPC Codes
An LDPC code of length n can be specified by the number of variable and check nodes, and their
respective degree distributions. The number of edges connected to a vertex in the graph G is called
the degree of the vertex. The degree distribution of G is a pair of polynomials ω(x) =∑i ωix
i and
Chapter 2. Background 31
ψ(x) =∑i ψix
i, where ωi and ψi respectively denote the number of variable and check nodes of degree
i in G. The performance of tree-like Tanner graphs can be analyzed using a technique called density
evolution [4]. As n → ∞, the error-correction performance of Tanner graphs with the same degree
distribution is nearly identical [4]. Hence, the variable and check node degree distributions can be
normalized to Ω(x) =∑iωi
n xi and Ψ(x) =
∑iψi
n−kxi, respectively. The design of binary LDPC codes of
rate Rcode and block length n consists of a two-step process. First, find the normalized degree distribution
pair (Ω(x),Ψ(x)) of rate Rcode with the best performance. Then, if n is large, randomly sample a Tanner
graph G that satisfies the degree distribution defined by ω(x) and ψ(x) (up to rounding error), and find
the corresponding parity-check matrix H. The random Tanner graph sampling technique is non-trivial
in the design of low-rate codes that approach Shannon capacity at low SNR.
2.7.2 Multi-Edge Framework
The multi-edge framework can be applied to both regular and irregular LDPC codes with uniform and
non-uniform vertex degree distributions, respectively, by introducing multiple edge types into the Tanner
graph specifying the code [71]. In a standard LDPC code, the polynomial degree distributions are limited
to a single edge type, such that all variable and check nodes are statistically interchangeable. In order
to improve performance, multi-edge LDPC codes extend the polynomial degree distributions to multiple
independent edge types with an additional edge-type matching condition [71].
To describe the design of multi-edge LDPC codes, let the potential connections of a variable or
check node be called its sockets. Let the vector d = (d1, d2, . . . , dt) be a multi-edge node degree of
t types. A node of degree d has d1 sockets of type 1, d2 sockets of type 2, etc. When generating
a Tanner graph, only sockets of the same type can be connected by an edge of that type. Multi-
edge normalized degree distributions are straightforward generalizations based on multi-edge degrees
Ω(x1, x2, . . . , xt) =∑
d Ωdxd11 x
d22 · · ·x
dtt and Ψ(x1, x2, . . . , xt) =
∑d Ψdx
d11 x
d22 · · ·x
dtt , where Ωd1,d2,...,dt
and Ψd1,d2,...,dt are the respective fractions of variable and check nodes with d1 edges of type 1, d2 edges
of type 2, etc. The rate of a multi-edge LDPC code is then defined as Rcode = Ω(1) − Ψ(1), where 1
denotes the all-ones vector with implied length [71].
The multi-edge LDPC code used in this thesis is rate 0.02 with normalized degree distribution
Ω(x1, x2, x3) =9
400x2
1x572 x
03 +
7
400x3
1x572 x
03 +
24
25x0
1x02x
13 (2.18)
Ψ(x1, x2, x3) =3
320x3
1x02x
03 +
17
1600x7
1x02x
03 +
3
5x0
1x22x
13 +
9
25x0
1x32x
13. (2.19)
This degree distribution was designed by Jouguet et al. by modifying a rate 1/10 multi-edge degree
structure introduced by Richardson and Urbanke [11, 71]. For the BIAWGNC, the minimum SNR for
which the tree-like Tanner graph with this multi-edge degree distribution is error free is 2.863× 10−2 or
−15.47dB [11].
The LDPC parity-check matrices in this thesis were generated by randomly sampling Tanner graphs
that satisfied the multi-edge degree distribution defined by ω(x) and ψ(x), and the edge-type matching
condition. The random sampling technique does not degrade code performance in this case, since the
operating FER is known to be high (Pe ≈ 10−1). At such high FER, the error-floor phenomenon is
not a significant concern as the code is strictly designed to operate in the waterfall region in order to
Chapter 2. Background 32
achieve high β-efficiency [120]. The rate 0.02 LDPC codes explored in this thesis target a block length of
n = 1×106 bits in order to achieve near-Shannon capacity error-correction performance. As a result, the
parity-check matrix H has dimensions n− k = n(1−Rcode) = 9.8× 105 by n = 1× 106. Due to the low
code rate and large block length, the random parity-check matrix construction introduces LDPC decoder
implementation complexity, which directly affects decoding latency and maximum achievable secret key
rate. The LDPC decoder implementation complexity for such a code can be reduced with minimal
degradation in error-correction performance by imposing a quasi-cyclic structure to the parity-check
matrix.
2.8 Summary
This chapter outlined the key challenges of implementing LDPC decoders in silicon, and presented the
mathematical foundation for LDPC decoding in CV-QKD. Chapters 3 and 4 extend these concepts to
new architectural and implementation techniques for both the integrated circuit and long-distance CV-
QKD application areas, respectively. Chapter 3 introduces a new frame-interleaved decoding architecture
targeting low-power, multi-Gb/s integrated circuit applications using short block-length codes for high
SNR channels. Chapter 4 presents a new multi-edge code construction technique to reduce the latency
of GPU-based decoding with long block-length codes for low SNR channels. Both Chapters 3 and 4
demonstrate techniques to exploit the intrinsic structure of LDPC parity-check matrices to improve
decoding performance.
Chapter 3
LDPC Decoder Architecture with
Path-Unrolled Message Passing
The renaissance of LDPC decoders in the early 2000s was primarily fueled by the benefits of Dennard
scaling: increasing transistor speeds and process shrink made high-performance silicon-based LDPC
decoders viable [121]. However, today, in the post-Dennard era, CMOS technology scaling offers dimin-
ishing returns in terms of performance. Today’s digital SoCs are ruled by energy efficiency – a metric
that requires new architectural techniques for the scalable implementation of low-power LDPC decoders.
Traditional LDPC decoder architectures are plagued with routing and message-permutation complex-
ity. Although interconnect scaling stagnated at the 45nm node, the 2016 IEEE International Roadmap
for Devices and Systems (IRDS) forecasts planar 2-dimensional transistor scaling to continue down to
the 10nm node, with 3-dimensional (3D) transistor technologies such as vertical-gate and monolithic-3D
expected to emerge over the next 10-to-15 years [13, 122–124]. This motivates the introduction of a
new low-power design paradigm for multi-Gb/s LDPC decoders based on dark silicon design principles,
where overall power consumption is reduced at the expense of increased silicon area by systematically
powering down inactive logic in order to minimize dynamic switching power [125].
This chapter presents a new LDPC decoder architecture, which addresses the global routing and
scalability problem using a reformulated message-passing schedule to achieve greater computational
parallelism at low clock rate. The proposed architecture exploits the intrinsic structure of the quasi-
cyclic (QC) parity-check matrix by splitting long routing wires into multiple shorter segments to reduce
interconnect delay. QC-LDPC codes are used as a vehicle to illustrate the general approach, however, the
proposed architecture and decoding schedule can also be extended to non-QC codes. The IEEE 802.11ad
standard for multi-Gb/s wireless systems is used as a case study to demonstrate the application of the
proposed architecture, however, the techniques described in this chapter are scalable to longer LDPC
codes for wireline and optical channels.
3.1 Proposed LDPC Decoder Architecture
In a traditional LDPC decoder, update messages are iteratively sent back and forth between CN and
VN processing groups with explicitly defined CN and VN processing units. The proposed decoder archi-
tecture partitions the global message-passing network into structured, local interconnect groups defined
33
Chapter 3. LDPC Decoder Architecture with Path-Unrolled Message Passing 34
between successive macro-columns of the QC parity-check matrix. A time-distributed decoding schedule
is introduced to exploit the spatial locality of update messages by combining the CN-phase and VN-phase
update logic into a single processing unit with deterministic memory access. The reformulation of the
Min-Sum flooding schedule leverages the QC interconnect structure to minimize routing congestion and
permutation logic overhead, while enabling frame-level parallelism for multi-rate, multi-Gb/s decoding.
3.1.1 Hardware Mapping with Path-Unrolled Decoding Schedule
The proposed decoding schedule unrolls the CN update in time by distributing the CN-to-VN mcv
message computation across all of the VNs that participate in the CN update. This is in contrast to
the traditional approach where all mcv updates for CN c are computed in a single CN processing unit.
Figure 3.1 presents an illustrative example of the interconnect routing patterns and combined processing
unit arrangement for the proposed LDPC decoder architecture. For the QC parity-check matrix defined
in Fig. 3.1(a), the corresponding Tanner graph is shown in Fig. 3.1(b), where messages exchanged between
connected CNs and VNs trace a closed path along the edges of the graph. As highlighted in Fig. 3.1(b),
one such path is defined by the following node sequence: VN0−CN2−VN4−CN2−VN8−CN2−VN0. By
unrolling this path, the intermediate CN2 node can be removed by absorbing and distributing the CN
update operation among its connected VNs. This modification does not alter the result of the Min-Sum
algorithm, but simply introduces a piecewise calculation of each CN-to-VN message defined in Step 2 of
Algorithm 2 (Chapter 2).
Figure 3.1(c) shows that combined processing nodes are arranged in column groups, which correspond
to the macro-columns of the QC parity-check matrix. The combined CN+VN processing units are
indexed according to their VN ordering. Each combined CN+VN processing unit contains the original
VN-update logic to compute Steps 3 and 4 of Algorithm 2, as well as intermediate CN-update logic for
the unrolled CN-to-VN mcv message computation. The interconnect structure between column groups
in the proposed architecture is defined by the cyclic permutation between the connected VNs in adjacent
columns. Each of the partitioned networks between successive column groups is hard wired, such that
all combined CN+VN processing units along each unrolled path are connected to a single CN in the
original Tanner graph. For example, the Tanner graph edges labelled in Fig. 3.1(b) trace the following
unrolled path in the Layer 0 Routing structure shown in Fig. 3.1(c): VN0−VN4−VN8−VN0.
The QC parity-check matrix construction guarantees that each VN is connected to at most one CN
in each layer (macro-row) of the matrix. Hence, each layer of the QC parity-check matrix requires
an independent routing layer in the proposed structure, as shown by the Layer 0 and Layer 1 routing
patterns in Fig. 3.1(c). This layer-parallel approach is a realization of the flooding schedule, and thus,
each combined CN+VN processing unit requires independent CN-update logic for each active processing
layer. For example, in Fig. 3.1(c), the node corresponding to VN0 contains CN-update logic for both
CN2 and CN4 paths in Layer 0 and Layer 1, respectively.
Bypass routing is required between combined processing node groups in layers where an all-zero sub-
matrix appears in the QC parity-check matrix, in order to ensure the continuity of the closed path in the
message-passing interconnect and to guarantee an equal number of column hops in the path traversal.
In the bypass case, the processing node in the successive column is simply included in the path, and
neither CN-update nor VN-update computations are performed. Bypass routing introduces an artificial
edge into the Tanner graph, such that the number of processing nodes in each closed path is equal.
The proposed architecture can further be described as a systolic array with homogeneous CN+VN
Chapter 3. LDPC Decoder Architecture with Path-Unrolled Message Passing 35
010
001
100
001
100
010
100
010
001
000
000
000
001
100
010
010
001
100
H
210
012HIII
II
La
ye
r 1
Column 0V
N0
VN
1
VN
2
VN
3
VN
4
VN
5
VN
6
VN
7
VN
8
CN5
CN4
CN3
CN2
CN1
CN0
Bypass Routing
Required for All-Zero
Sub-Matrix
Quasi-Cyclic
Parity-Check
Matrix
Expansion
Factor
Expanded
Binary
Parity-Check
Matrix
La
ye
r 0
Column 1 Column 2
(a)
(b)
(c)
3q
1 2 3
VN0 VN1 VN2 VN3 VN4 VN5 VN6 VN7 VN8
CN0 CN1 CN3 CN4 CN5CN2
Layer 0 Layer 1
Column 0 Column 1 Column 2
Layer 0 Routing:
1
2
3
VN0
CN2
VN1
CN1
VN2
CN0
VN4
CN2
VN5
CN1
VN8
CN2
VN6
CN1
VN7
CN0
Combined CN+VN
Processing Unit
(c)
VN3
CN0
Column 2Column 1Column 0
Layer 1 Routing:
Bypass
Routing
VN5
CN4
VN3
CN3
VN4
CN5
VN0
CN4
VN1
CN3
VN2
CN5
VN6
CN3
VN7
CN5
VN8
CN4
Bypass
Routing
Bypass
Routing
Col 2Col 1Col 0Hard
Wired
Cyclic
Permutation
Network
Figure 3.1: Simplified example of the proposed LDPC decoder architecture, based on: (a) sample two-layer QC parity-check matrix and (b) Tanner graph with one decoding path highlighted. The proposedlayer routing patterns and arrangement of combined CN+VN processing units in the systolic arrayarchitecture are shown in (c). The closed path given by VN0−CN2−VN4−CN2−VN8−CN2−VN0 in (b)is unrolled in (c) such that CN2 is absorbed into its connected VNs, resulting in the following unrolledpath: VN0−VN4−VN8−VN0.
Chapter 3. LDPC Decoder Architecture with Path-Unrolled Message Passing 36
processing units connected to the corresponding VNs in neighboring macro-columns along the unrolled
CN path. Routing complexity is constrained between successive columns by serializing the large fan-
in/fan-out wiring to/from each CN. One decoding iteration requires two passes through the proposed
architecture. The first pass corresponds to the CN-update phase, and the second pass corresponds to
the VN-update phase. The following section describes the mathematical modification to the flooding
Min-Sum algorithm as a result of the path-unrolled message-passing schedule.
3.1.2 Time-Distributed Piecewise Min-Sum Computation
Each closed routing path is defined by the VNs connected to a single CN c, i.e., the VNs in the set
N(c). By absorbing the CN-update operation among all connected VNs, the traditional CN-to-VN mcv
and VN-to-CN Lvc messages defined in Algorithm 2 are not explicitly computed, nor routed through
the proposed structure. Instead, the Lvc and mcv computations are discretized through a reformulation
of the flooding Min-Sum algorithm. Table 3.1 presents one decoding iteration of the piecewise, time-
distributed Min-Sum schedule for a complete CN routing path traversal over T = n/q total columns.
In the first CN-update phase, the sign sc(t), first minimum magnitude min1c(t), and second minimum
magnitude min2c(t) for every CN c are updated sequentially at every column t over T processing-node
hops in T successive columns. Each combined processing unit stores its own Lvc value from the previous
iteration. In each iteration, when the path traversal hop arrives at a particular node in column t, the
internally stored Lvc value is used to update the intermediate sign and minimum magnitudes, which
are then sent to the next successive column t + 1. The path traversal then hops to the next connected
processing node in the successive column, and the updates continue until the last column T − 1 of the
structure is reached, at which point, the final sign sc(T − 1), first minimum magnitude min1c(T − 1),
and second minimum magnitude min2c(T − 1) for CN c are known.
In the second VN-update phase, the sign and minimum magnitude values that were computed in the
first CN-update pass are not updated any further, but rather held constant and broadcast through the
CN path to all connected processing units over T successive hops. Each processing unit first computes
its own, unique CN-to-VN mcv message based on Eq. 2.2. The computed mcv value is then used to
calculate the intermediate LLR Lv, hard-decision bit Cv, and new VN-to-CN Lvc message according
to the expressions outlined in Steps 3 and 4 of Algorithm 2. The updated Lvc value is stored in the
processing unit’s memory to be used in the next iteration. A piecewise parity computation is performed
in order to eliminate the explicit parity check defined by Step 5 in Algorithm 2. The parity pc(t)
corresponding to CN c is updated sequentially along the closed path, such that the final parity across all
VNs connected to CN c is determined by the last column of the path traversal. The parity result from
the current iteration is then known immediately at the start of the CN-update phase (first pass) of the
next iteration.
Figure 3.2 presents an illustration of the reformulated decoding procedure for one decoding iteration.
A single layer message for CN2 of the form pc(t), sc(t),min1c(t),min2c(t) is successively consumed and
updated in each column t of T total columns among the T combined CN+VN processing units along the
closed CN2 path. The final value of each of the four components at column T − 1 is given by Equations
2.1, 2.3, 2.4, and 2.5. VN-update logic is inactive during the CN-update pass, while CN-update logic is
inactive during the VN-update pass. In addition, column t = 0 does not necessarily correspond to the
absolute first column of the parity-check matrix, but rather refers to the starting column for a particular
layer message. Since the distributed update computations in each column are independent, multiple
Chapter 3. LDPC Decoder Architecture with Path-Unrolled Message Passing 37
Table 3.1: Piecewise time-distributed reformulation of Min-Sum algorithm with flooding schedule forsingle layer routing path
Itera
tion
Phase
Colu
mn
Pari
tySig
nF
irst
Min
imum
Magnit
ude
Seco
nd
Min
imum
Magnit
ude
tpc(t
)s c
(t)
min
1 c(t
)min
2c(t
)
Check
Up
date
Com
pute
:s c
(t)
min
1 c(t
)min
2 c(t
)
0−
sgn( L vc
(0))
|Lvc(0
)||M
AX
LL
RM
AG
NIT
UD
E|
1−
s c(0
)×
sgn( L vc
(1))
min( |L v
c(1
)|,min
1 c(0
))m
in( |L
vc(1
)|,min
1 c(0
),min
2 c(0
) \min
1 c(1
)). . .
. . .. . .
. . .. . .
t−
s c(t−
1)×
sgn( L vc
(t))
min( |L v
c(t
)|,min
1 c(t−
1))
min( |L
vc(t
)|,min
1c(t−
1),min
2 c(t−
1) \ m
in1 c
(t))
. . .. . .
. . .. . .
. . .
T−
1−
s c(T−
2)×
sgn( L vc
(T−
1))
min( |L v
c(T−
1)|,min
1 c(T−
2))
min( |L
vc(T−
1)|,min
1c(T−
2),m
in2 c
(T−
2) \ m
in1 c
(T−
1))
Vari
able
Up
date
Com
pute
:pc(t
)
Pro
pag
ate:
s c(t
)min
1 c(t
)min
2 c(t
)
0Cv(0
)s c
(T−
1)min
1 c(T−
1)
min
2c(T−
1)
1pc(0
)⊕Cv(1
)s c
(T−
1)min
1 c(T−
1)
min
2c(T−
1)
. . .. . .
. . .. . .
. . .
tpc(t−
1)⊕Cv(t
)s c
(T−
1)min
1 c(T−
1)
min
2c(T−
1)
. . .. . .
. . .. . .
. . .
T−
1pc(T−
2)⊕Cv(T−
1)
s c(T−
1)min
1 c(T−
1)
min
2c(T−
1)
Chapter 3. LDPC Decoder Architecture with Path-Unrolled Message Passing 38
(a)
CN0 CN1 CN3 CN4 CN5
Column t = 0
1st Pass: CN Update
(VN Logic Inactive)
VN1
CN1
VN2
CN0
VN5
CN1
VN3
CN0
VN6
CN1
VN7
CN0
−, sc(t=0), min1c(t=0), min2c(t=0)
1
−, sc(t=1), min1c(t=1), min2c(t=1)
2
−, sc(t=2), min1c(t=2), min2c(t=2)3
Column
t = 1
2nd Pass: VN Update
(CN Logic Inactive)
VN1
CN1
VN2
CN0
VN5
CN1
VN3
CN0
VN6
CN1
VN7
CN0
pc(t=0), sc(t=2), min1c(t=2), min2c(t=2)
VN8
CN2
VN4
CN2
VN0
CN2
pc(t=1), sc(t=2), min1c(t=2), min2c(t=2)
pc(t=2), sc(t=2), min1c(t=2), min2c(t=2)
1
2
3
(b)
Column
t = 2Column
t = 0Column
t = 1Column
t = 2Column
t = 0
Column t = 2Column t = 1
VN1 VN2 VN3 VN5 VN6 VN7
CN2
VN8
CN2
VN4
CN2
VN0
CN2
11 322 3
VN0 VN4 VN8
Start/
End
Here
indicate the order of closed path through CN2 starting/ending at VN81 2 3
Figure 3.2: (a) The closed path through CN2 in the Tanner graph for one pass (phase) of decoding.(b) The unrolled piecewise messages that are passed between combined CN+VN processing units insuccessive columns of the architecture corresponding to the closed path highlighted in (a). Here, t = 0arbitrarily corresponds to the third column of T = 3 total columns.
frames can be interleaved in the proposed structure to ensure a constant workload over the uniformly
partitioned processing nodes to maximize hardware utilization and minimize idle logic. The pipelined
frame interleaving pattern is discussed later in this Chapter.
3.1.3 Parity-Check Matrix Partitioning and Hardware Mapping
This section describes how the QC-LDPC parity-check matrices for IEEE 802.11ad are partitioned and
mapped to the proposed architecture. The IEEE 802.11ad standard specifies a fixed block length of
672 bits for four code rates, and 24 modulation and coding scheme (MCS) modes with decoded (raw)
bit rates between 385Mb/s and 6.757Gb/s. The hardware mapping described in this section targets
the peak throughput modes of IEEE 802.11ad, and the LDPC decoder in this thesis is designed for the
maximum 6.757Gb/s throughput requirement with a maximum frame latency of 1µs for all four code
rates.
As shown in Fig. 3.3, the QC-LDPC parity-check matrix for each of the four code rates can be derived
from a single 8-layer base matrix, by decreasing the sparsity of higher-rate matrices by removing layers
and adding non-zero sub-matrices. Each of the 16 macro-columns of the QC-LDPC parity-check matrix
Chapter 3. LDPC Decoder Architecture with Path-Unrolled Message Passing 39
Rate 1/2
Rate 5/8
Rate 3/4
Rate 13/16
Inactive
RoutingB
Bypass
Routing#
Active
Routing
3 Connection
Layers
Max 3 Active
Processing
Layers Per
Column
33
11
31
22
21
3
13
27
7
12
40
31
22
21
20 7
12
40
31
22
21
40
33
11
31
41
22
21
3
18
2 1
10
28
4
28
27
32
4
28
9
2
2
10
25
28
4
28
9
28
27
32
4
18
28
9
41
15 6
12
3
27
20
12
14
27
29
18
41
15 6
12
3
27
17
20
12
14
3
27
29
18
5
30
20
34
39
14
4
20
17
6
14
4
20
15
41
20
34
14
39
14
4
20
39
17
6
14
6
4
20
15
28
13
23
0
24
13
23
0
22
10
28
24 23
0
28
24
13
23
0
22
8 Connection
Layers
Max 4 Active
Processing
Layers Per
Column
6 Connection
Layers
Max 4 Active
Processing
Layers Per
Column
1 Column Group Pair = 2 QC Macro-Columns
40
34
36
27
35
29
31
22
29
37
25
30
31
22
20
30
36
27
35
29
31
22
35
29
37
25
19
30
31
22
38
35
31
18
41
0
23
34
0
18
4
8
23
34
34 31
18
41
0
23
34
41
0
18
4
22
8
23
34
13
22 24
13
22 24
13
22 24
13
22 24
4 Connection
Layers
Max 4 Active
Processing
Layers Per
Column
B B
B B B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B B
B
B
B
B
B
B
B
B
B
B B
B
B
B
B B
B
B
B
B
B
B
B
B
B
B B
B
B
B
B
B
B
B B
B
B
B
B
B
B
B
B
B
B B
B B B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B B
B B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
All-zero sub-matrix
42x42 cyclic identity matrix
2 Inactive
Wired Layers
4 Inactive
Wired Layers
5 Inactive
Wired Layers
Figure 3.3: IEEE 802.11ad QC parity-check matrices with hardware mapping for proposed architec-ture [23]. The sub-matrix value indicates the cyclic permutation index. The four matrices are derivedfrom a single 8-layer base matrix by removing layers in higher-rate matrices, or by removing cyclically-shifted submatrices in lower-rate matrices.
Chapter 3. LDPC Decoder Architecture with Path-Unrolled Message Passing 40
Global Control Unit: Frame Input/Output Buffering, Rate Selection, Bypass Enabling, Frame Decoding Termination
Ha
rd-W
ire
d C
yc
lic
Pe
rmu
tati
on
Ne
two
rk(C
ol 0
à C
ol 1)
Column Slice 0
CN+VN
Proc.
Unit
VN-to-CN Message Memory
Hard Decision Memory
Channel LLR
Memory
Column Slice 1
CN+VN
Proc.
Unit
VN-to-CN Message Memory
Hard Decision Memory
Channel LLR
Memory Ha
rd-W
ire
d C
yc
lic
Pe
rmu
tati
on
Ne
two
rk(C
ol 1 à
Co
l 2)
Pip
elin
e R
eg
iste
rs
Ha
rd-W
ire
d C
yc
lic
Pe
rmu
tati
on
Ne
two
rk(C
ol 2 à
Co
l 3)
Column Slice 2
CN+VN
Proc.
Unit
VN-to-CN Message Memory
Hard Decision Memory
Channel LLR
Memory
Column Slice 3
CN+VN
Proc.
Unit
VN-to-CN Message Memory
Hard Decision Memory
Channel LLR
Memory Ha
rd-W
ire
d C
yc
lic
Pe
rmu
tati
on
Ne
two
rk(C
ol 3 à
Co
l 4)
Pip
elin
e R
eg
iste
rs
Ha
rd-W
ire
d C
yc
lic
Pe
rmu
tati
on
Ne
two
rk(C
ol 1
4 à
Co
l 1
5)
Column Slice 14
CN+VN
Proc.
Unit
VN-to-CN Message Memory
Hard Decision Memory
Channel LLR
Memory
Column Slice 15
CN+VN
Proc.
Unit
VN-to-CN Message Memory
Hard Decision Memory
Channel LLR
Memory Ha
rd-W
ire
d C
yc
lic
Pe
rmu
tati
on
Ne
two
rk(C
ol 1
5 à
Co
l 0)
Pip
elin
e R
eg
iste
rs
Column-Slice Pair
Figure 3.4: System block diagram for proposed architecture showing the global control unit, and thedatapath containing: column slices with combined CN+VN processing units and memories, a hard-wiredcyclic permutation network between each column slice, and pipeline registers between column-slice pairs.
Hard Decision
(Ĉv) Memory
Combined CN+VN
Processing Unit
Channel LLR (Qv) Memory
q=42Processing
Nodes
Ha
rd-W
ire
d C
yc
lic
Pe
rmu
tati
on
Ne
two
rk
(fro
m p
revio
us c
olu
mn
slic
e t-1
)
Ha
rd-W
ire
d C
yc
lic
Pe
rmu
tati
on
Ne
two
rk
(to
ne
xt co
lum
n s
lice
t+
1)
Hard Decision Output
Bits to I/O interface
Soft Decision LLR
Inputs from I/O interface
A
B
D
E
G
VN-to-CN (Lvc)
Message Memory
Pip
elin
e R
eg
iste
rs
C
Coarse-Grained
Clock Gating
Logic
1-to-4 Processing Layers
Check Node
Update Logic
Variable Node
Update Logic
H
I K
L
J
K F
Figure 3.5: Column slice t comprised of CN+VN processing units, local memory, and wired permutationnetworks between adjacent column slices. Pipeline registers are connected only to the first column in acolumn-slice pair. Hard-wired interconnect does not contain multiplexing logic. Hard-wired connectionsare specified by the parity-check matrix connectivity. The operations in column slice t are computed inone clock cycle.
Chapter 3. LDPC Decoder Architecture with Path-Unrolled Message Passing 41
maps directly to a single column slice in the proposed architecture shown in Fig. 3.4. Each column slice
contains the memories and combined CN+VN processor logic for one macro-column of the matrix. The
VN ordering in each column slice is consistent with the VN ordering in the matrix macro-column, such
that each column slice contains 42 combined CN+VN processing units. Adjacent columns are connected
through hard-wired routing in each of the 8 defined connection layers, which correspond to 8×42=336
CN paths. The connectivity between successive columns is specific to the parity-check matrix, i.e.,
the interconnect mapping between CN+VN processors in adjacent column slices is not any-to-any, but
rather, is mandated by the column-to-column connectivity defined by the parity-check matrix. Hence,
there is no multiplexer fanout between connected CN+VN processors in adjacent columns along the
same CN path. Depending on the code rate, only a subset of layers may be active. The rate 1/2 code
requires all 8×42=336 CN paths, while the rate 13/16 code requires only 3×42=126 CN paths. Inactive
layers/paths are disabled (turned off) through clock gating to eliminate unnecessary logic switching and
message passing. Each column in the four parity-check matrices shown in Fig. 3.3 has at most 4 active
CN connections, i.e., the maximum VN degree is 4, thus each combined processing unit requires at most
4 processing layers. Processing nodes in the last 4 columns require only 1, 2, or 3 processing layers, due
to the lower-triangular matrix construction.
The proposed architecture exploits the QC structure of the IEEE 802.11ad LDPC parity-check ma-
trices by constraining routing to local interconnect between adjacent columns. Since each of the four
parity-check matrices is derived from a single base matrix, multi-rate functionality is intrinsic to the
hard-wired cyclic permutation networks between adjacent columns. The same wiring is used between
successive columns for multiple code rates, thus eliminating the need for additional permutation and
control logic to switch between code rates. As the decoder switches from a low-rate code to a high-rate
code, e.g., from rate 1/2 to rate 3/4, the unused wired layers of the high-rate code are disabled. The
decoder continues to pass messages along the active processing layers, and hardware utilization in the
combined processing nodes remains at either 100% (for rate 1/2, 5/8, and 3/4 codes) or 75% (for the
rate 13/16 code) since the four rates have either 3 or 4 active processing layers.
3.1.4 Column Slice Architecture
Figure 3.5 presents the architecture of a single column slice, which contains local memories and processing
nodes for the VNs associated with that particular macro-column. Table 3.2 provides an overview of
the messages exchanged between the combined processing units and local memories in the column slice,
based on the labelling in Fig. 3.5. Table 3.3 provides a parametric overview of each column-slice memory.
These configurations are specific to the IEEE 802.11ad standard, and will vary depending on the CN
and VN connectivity defined by the parity-check matrix. All column-slice memories are simple dual-
port eSRAM register files with 1 read port and 1 write port. In each column slice, input LLRs are
buffered in to the LDPC decoder through the Qv memory write port, while output hard decisions are
buffered out through the Cv memory read port. Outgoing update messages in column slice t are of the
form pc(t), sc(t),min1c(t),min2c(t). These messages are computed based on incoming messages from
the previous column slice t − 1 and values stored in the column-slice memories. In both the CN- and
VN-update phases, outgoing update messages from column slice t are computed in one clock cycle.
The number of instantiated CN+VN processing units is always constant and equal to the expansion
factor q of the QC parity-check matrix. In this case, q = 42 processing units are instantiated in each
column slice, while the number of instantiated processing layers ranges from 1 to 4 depending on the
Chapter 3. LDPC Decoder Architecture with Path-Unrolled Message Passing 42
column slice index. Column slices corresponding to the last four macro-columns of the QC parity-check
matrix require fewer hardware resources (memory and routing) than the slices for the first twelve columns.
The fixed-point bit width of internal column-slice messages is chosen to be 5 bits in this implementation
as there is less than 0.01dB degradation in fixed-point error-rate performance of the Min-Sum algorithm
in comparison to floating point across all four code rates. The min1c(t) and min2c(t) messages are both
4 bits wide, while the pc(t) and sc(t) are 1 bit each.
Table 3.2: Column-slice messages highlighted in Fig. 3.5
Label Message Description and Format
A, BInput channel LLRs for VNs in current column slice t5-bit message:Qv× 42 VNs
CIncoming connection layer messages from column slice t− 110-bit msg:pc(t− 1), sc(t− 1),min1c(t− 1),min2c(t− 1) × 42 VNs × 8 connection layers
DComputed intermediate VN-to-CN messages in column slice t5-bit message:Lvc× 42 VNs × (1-to-4) processing layers
EOutgoing connection layer messages from column t to t+ 110-bit message:pc(t), sc(t),min1c(t),min2c(t) × 42 VNs × 8 connection layers
F, GOutput hard decisions for VNs in current column slice t
1-bit decision:Cv× 42 VNs
Table 3.3: Memory specification in each column slice
MemoryDepth
(Frames)Data Bus Width (Bits)
Total Size(Kb)
Qv 16 42 VNs × 5 bits/message = 210 3.360Columns 0-15
Cv 16 42 VNs × 1 bit/decision = 42 0.672Columns 0-15
Lvc 8 42 VNs × 4 layers∗ × 5 bits/message = 840 6.720Columns 0-11
Lvc 8 42 VNs × 3 layers∗ × 5 bits/message = 630 5.040Columns 12-13
Lvc 8 42 VNs × 2 layers∗ × 5 bits/message = 420 3.360Column 14
Lvc 8 42 VNs × 1 layers∗ × 5 bits/message = 210 1.680Column 15
∗ Active processing layers in combined CN+VN processing unit.
Coarse-grained clock gating control logic is integrated in each column-slice pair to disable the pipeline
registers and local memories from switching when the current frame has successfully been decoded. Four
independent integrated clock gating cells are used in each column-slice pair to disable the input pipeline
register bank to column t, and the Qv, Cv, and Lvc memories in columns t and t + 1, as highlighted
by labels H, I, K, and L in Fig. 3.5, respectively. Label J in Fig. 3.5 shows the gated clock for a
pipeline register in the Lvc memory write path in the VN-update logic in each combined processing
unit. The clock gating pattern is identical for both columns t and t + 1 in a column-slice pair due to
the pipelined frame interleaving described next. When clock gating is active, there is no switching in
Chapter 3. LDPC Decoder Architecture with Path-Unrolled Message Passing 43
the combinational logic that comprises the CN+VN processing units, and thus, the pair of column slices
between two successive pipeline stages is effectively turned off in that clock cycle. This technique is part
of the early-termination strategy described later in this section.
The CN and VN logic in the combined CN+VN processor share the message-passing interface. The
CN logic is turned off during the VN-update phase, and vice versa. While there is an area penalty, there is
also a power saving benefit as inactive logic is completely turned off in the phase where it is inactive. The
latency specification of the IEEE 802.11ad standard is met with this single-phase operation. To reduce
latency even further, CN and VN logic can be run in parallel with two independent frames in each column
slice. However, this would introduce additional control complexity to track the phase of independent
frames, multiplexer overhead in the combined CN+VN processing units to multiplex between the two
anti-phase frames, and twice the routing between successive column slices to support message passing
for both CN and VN phases simultaneously. Phase-interlaced decoding was not implemented in this
thesis to avoid these specific issues.
3.1.5 Pipelined Frame Interleaving
The constant workload of the time-distributed decoding schedule enables message pipelining between
successive column-slice pairs, since the number of columns in the proposed architecture is fixed for all
code rates. As shown in Fig. 3.4, pipeline registers are placed after every two columns instead of after
every single column to satisfy the 1µs total decoding latency requirement of the IEEE 802.11ad standard,
assuming 10 decoding iterations and a 200MHz clock rate. As such, a single frame is shared by columns
t and t+ 1 in a column-slice pair. Since the computation in each column is independent and a constant
number of hops is required to traverse each closed path, 8 independent frames can be interleaved in the
structure without any memory access contention.
Figure 3.6 presents the frame interleaving schedule where 8 interleaved frames are cyclically pipelined
across 16 column slices. Both CN- and VN-update phases require 8 clock cycles to complete, thus a
total of 16 cycles is required to complete a single decoding iteration across all 8 interleaved frames. Since
each frame is independent, the frame-interleaved decoding schedule ensures full hardware utilization of
all processing units without any pipeline stall cycles.
For a fixed number of iterations, the minimum decoding throughput and maximum latency of a
frame-interleaved LDPC decoder are given by the following two equations:
Throughput =Frames× Block Length× fclk
Iterations× Cycles Per Iteration(bits/s)
Latency =Iterations× Cycles Per Iteration
fclk(seconds).
For a block length of 672 bits, the proposed LDPC decoder achieves a throughput of 6.78Gb/s and an
acceptable latency of 0.793µs with 8 interleaved frames, while operating at a clock frequency of 202MHz
with 10 decoding iterations. This performance satisfies the maximum throughput and minimum latency
requirements of the IEEE 802.11ad standard.
Frame sequencing control is intrinsically embedded in the hard-wired interconnect between adjacent
column slices. The cyclic memory addressing pattern in each column slice results in conflict-free memory
access and eliminates the need for additional control overhead. In every clock cycle, the dual-ported
channel LLR, hard decision, and VN-to-CN message memories in column slices t and t + 1 share the
Chapter 3. LDPC Decoder Architecture with Path-Unrolled Message Passing 44
Frame 0
Clk
Cycle
0
Col 0VN 0 –
VN 4
1
2
Col 1VN 42 –
VN 83
Frame 1
Col 2VN 84 –
VN 125
Col 3VN 126 –
VN 167
Frame 7
Col 14VN 588 –
VN 629
Col 15VN 630 –
VN 671
Frame 0Frame 7 Frame 6
Frame 5Frame 7Frame 6
7 Frame 0Frame 1 Frame 2
Frame 08
9
10
Frame 1 Frame 7
Frame 0Frame 7 Frame 6
Frame 5Frame 7Frame 6
15 Frame 0Frame 1 Frame 2
1s
t P
as
s:
CN
Up
da
te2n
d P
as
s:
VN
Up
da
te
2 43 60 1 75
2 43 60 17 5
2 436 0 17 5
2 436 0 175
24 36 0 175
243 6 0 175
2 43 6 0 175
Column Slice
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15Clk
Cycle
0
1
2
3
4
5
2 43 6 0751
6
7
2 43 60 1 75
2 43 60 17 5
2 436 0 17 5
2 436 0 175
24 36 0 175
243 6 0 175
2 43 6 0 175
8
9
10
11
12
13
2 43 6 0751
14
15
1s
t P
as
s:
CN
Up
da
te2n
d P
as
s:
VN
Up
da
te
1st Pass:
Check
Node
Update
Phase
Update
Layer
Messages parity, sign,
min1, min2
2nd
Pass:
Variable
Node
Update
Phase
Propagate
Layer
Messages parity, sign,
min1, min2
Figure 3.6: Pipelined frame interleaving pattern through column slices in the proposed architecture over16 clock cycles of one complete LDPC decoding iteration for IEEE 802.11ad. The number in each bubbleindicates the frame index. Frame 4 highlights the cyclic frame-shifting property of the architecture.
same read/write address, which corresponds to the index of the current frame in the column-slice pair.
The independent frame processing among column-slice pairs allows frames of different code rates to be
decoded simultaneously, achieving the same throughput with minimal bypass routing control overhead,
since the primary rate control mechanism is embedded in the hard-wired cyclic interconnect between
adjacent columns. The architecture presented in this thesis can thus be classified as both row-parallel
and column-parallel.
3.1.6 Input/Output Frame Buffering for Continuous Decoding
The channel LLR and hard decision memories are dual-ported to enable input/output (I/O) frame
buffering such that the decoder runs continuously without any idle cycles. Figure 3.7 presents a timing
diagram of the pipelined I/O frame buffering schedule, highlighting the following three steps: loading
channel LLRs for the next 8 frames, decoding the current 8 frames, and reading out hard decisions for
the previous 8 frames. The I/O latency is masked by the compute latency of the decoder since the next
8 frames are buffered in to the channel LLR Qv memory while the current 8 frames are being decoded.
Once the current 8 frames terminate, the decoder restarts the decoding process with the next 8 frames
already loaded in Qv memory. Decoded codewords in the hard decision Cv memory are buffered out
while the 8 new frames are decoded.
Chapter 3. LDPC Decoder Architecture with Path-Unrolled Message Passing 45
LOAD Qv
Frames 0 to 7
DECODEFrames 0 to 7
READ Ĉv
Frames 0 to 7
LOAD Qv
Frames 8 to 15
DECODEFrames 8 to 15
READ Ĉv
Frames 8 to 15
160 Clock Cycles 160 Clock Cycles 160 Clock Cycles 160 Clock Cycles
DECODEFrames 16 to 23
LOAD Qv
Frames 16 to 23LOAD Qv
Frames 24 to 31
10 Iterations 10 Iterations 10 Iterations 10 Iterations
Figure 3.7: Input/output frame buffering schedule, assuming a uniform decoding latency of 10 iterationswith 16 clock cycles per iteration.
As shown in Table 3.3, both the channel LLR Qv and hard decision Cv memories in a single column
slice have twice the address depth compared to the VN-to-CN Lvc memory. The Qv and Cv memories
have a depth of 16 addresses to accommodate the current 8 frames and the next 8 frames in the decoding
queue for all 42 processing nodes, while the Lvc memory stores only intermediate updates for the current
8 frames cycling through the decoder. In the column slice architecture presented in Fig. 3.5, label A
shows the input buffer LLR write port used to load the next 8 frames, label B shows the LLR read
port accessed during decoding of the current 8 frames, label F shows the hard decision write port of the
current 8 decoding frames, and label G shows the output hard decision read port of the 8 previously
decoded frames. This overlapped input/output frame buffering schedule enables continuous decoding
that does not interrupt the frame-interleaved decoding schedule, thus enabling multi-Gb/s throughput
with acceptable latency for the IEEE 802.11ad standard.
3.1.7 Combined CN+VN Processing Unit Architecture
Figure 3.8 presents the architecture of the combined CN+VN processing unit, where CN- and VN-update
logic blocks share the memory and layer-message routing interfaces. The CN- and VN-update logic is
partitioned independently such that VN logic is disabled (turned off) during the CN phase, and vice
versa. One clock cycle is required to perform either the CN or VN update in column slice t. As such,
there is no pipelining in the combined CN+VN processing unit, aside from the re-timing register in the
Lvc memory write path.
In every clock cycle, the combined CN+VN processing unit in column t first receives an incoming layer
message pc(t−1), sc(t−1),min1c(t−1),min2c(t−1) from its connected processing node in the previous
column slice t − 1. The CN- and VN-update are performed using the elements of the incoming layer
message and the internally stored Lvc and Qv values. The updated pc(t), sc(t),min1c(t),min2c(t) layer
message is then transmitted to its connected processing unit in column t+1. In the CN phase, stored Lvc
values for each active processing layer are read in parallel from the Lvc memory. The magnitude of each
Lvc value is compared to the first and second minimum magnitude values min1c(t−1) and min2c(t−1)
of the incoming layer message, while the sign of each Lvc value is compared to the sign element sc(t− 1)
of the incoming layer message. The updated sc(t), min1c(t), and min2c(t) values are transmitted to
the next column slice. In the first decoding iteration, the Lvc values are not initialized, and thus the
Qv LLR memory is read instead. In the VN phase, the Lvc message for each active processing layer,
hard decision Cv bit, and parity message element pc(t) are updated based on the computed minimum
Chapter 3. LDPC Decoder Architecture with Path-Unrolled Message Passing 46
+
min2
min1
min2
updateCompare
and Select
Minimum
Magnitudes
Max LLR Magnitude
4'b1111
min1
update
min2'
min1'
sign'sign
update
sign
Processing
Layer Select
First
Decoding
Iteration
Pro
ce
ss
ing
La
ye
r L
vc
Up
da
ted
P
roc
es
sin
g L
ay
er
Me
ss
ag
e
pa
rity
', s
ign
', m
in1', m
in2
'
Bypass and
Processing
Layer
Select
Received
Layer Messages from
Previous Column t-1For each layer:
parity, sign, min1, min2
8 layers x 10 bits
Updated
Layer
Messages
to Next
Column t
parity'
min1
min1
min2
sign
==
Hard
Decision
Ĉv (Lv MSB)
Qv
CN/VN Phase
Select
parity
VN-to-CN Lvc Intermediate Message Memory
(Sign-Magnitude Format)
2's to
SM
Lv
mcv
sign
mcv mag
Lvc
-
SM to
2's
Channel LLR Qv Memory
(2's Complement Format)
Hard Decision
Ĉv Memory
2's to
SM
Lvc
Update
Hard
Decision
Ĉv
Processing Layer mcv
+
CN Phase
0
Parity Check Fail
(To Control Logic)
pa
rity
0
pa
rity
7
Lv
Lvc
sign
Lvc
mag
mcv
Lvc mag
Lvc sign
Gated
Clock
Messages for 1-to-4 Active Layers
Re
ce
ive
d P
roc
es
sin
g
La
ye
r M
es
sa
ge
pa
rity
, s
ign
, m
in1
, m
in2
4b
4b
4b
10b
10b
4b
5b
4b
5b5b 5b
1b
1b
1b
1b
1b
80b
80b
pa
rity
1
1b
7b
7b
5b
10b
1-to-4
Processing
Layers
Lvc
parallel
memory
read/write
for all
active
processing
layers
sc(t)
min2c(t-1)
min1c(t-1)
min1c(t)
min2c(t)
pc(t)
pc(t-1)
pc(t-1),sc(t-1),min1c(t-1),min2c(t-1)
pc(t),sc(t),min1c(t),min2c(t)
sc(t-1)
pc(t-1)C
N U
pd
ate
Ph
as
e L
og
icV
N U
pd
ate
Ph
as
e L
og
ic
Early
Termination
Check
Figure 3.8: Combined CN+VN processing unit for time-distributed piecewise decoding, showing CN-and VN-update phase logic, memory interfaces, and data permutation logic between processing units insuccessive column slices.
Chapter 3. LDPC Decoder Architecture with Path-Unrolled Message Passing 47
j-2
j-2
j-2
Co
l t
(E
ve
n C
olu
mn
)
jj-1
jj-1
jj-1
j j-1Update Msgs to Col t+2 –,sc(t+1),min1c(t+1),
min2c(t+1)
Update Msgs to Col t+1–,sc(t),min1c(t),min2c(t)
Read Lvc(t)
jj-1
Clock
j+1
j+1
j+1
j+1
1st Pass: Check Node Update Phase
jj-1
jj-1
Read Lvc(t)
jj-1
Clock
j+1
j+1
j+1
2nd Pass: Variable Node Update Phase
Compute mcv(t)
Update Lvc(t) and Write to Mem
Compute Lv(t)
Compute pc(t)
j j-1
Read Qv(t) jj-1 j+1
j j-1
j j-1
jj-1 j+1
jj-1 j+1
jj-1
j j-1
jj-1
jj-1
jj-1
Read Lvc(t+1)
Compute mcv(t+1)
Update Lvc(t+1) and
Write to Mem
Compute Lv(t+1)
Compute pc(t+1)
Read Qv(t+1)
Read Lvc(t+1)
Update Msgs to Col t+2
pc(t+1),sc(t-1),min1c(t-1),min2c(t-1)
Write Ĉv(t) to Mem
D
E
C
D
E
pc(t-1),sc(t-1),min1c(t-1),min2c(t-1)Incoming Msgs at Col t
Pipeline Reg Output
jEarly-Termination Check
Using All pc(t-1) Values
Co
l t+
1 (O
dd
Co
lum
n)
j-2
pc(t-1),sc(t-1),min1c(t-1),min2c(t-1)Incoming Msgs at Col t
Pipeline Reg Output
j+1
j j-1 j+1
j+1
Write Ĉv(t+1) to Mem jj-1
C
D
B
F
D
D
B
F
D
E
(a)
Co
l t
(E
ve
n C
olu
mn
)C
ol t+
1 (O
dd
Co
lum
n)
j+1
j+1
(b) (c) (d)
Frame Index
Figure 3.9: Processing unit timing diagram for CN- and VN-update phases showing 3 independentframe updates over 3 clock cycles. Each CN+VN processing unit updates a single frame j in each clockcycle. All arrow-highlighted operations occur in each clock cycle, and independently for each frame j,j ∈ 0, 1, . . . , 7. Circled nodes B, C, D, E, and F correspond to the connections shown in the columnslice architecture in Fig. 3.5. The following operations are highlighted. (a) Sign sc(t), first minimummagnitude min1c(t), and second minimum magnitude min2c(t) updates through column slice pair. (b)Parity pc(t) updates through column slice pair. (c) Independent Lvc and Cv updates in columns t andt+ 1. (d) Propagation of sign, first minimum magnitude, and second minimum magnitude messages tonext column-slice pair without updates in columns t and t+ 1.
Chapter 3. LDPC Decoder Architecture with Path-Unrolled Message Passing 48
sign and magnitude values in the previous CN phase. The parity pc(t) element of the layer message is
updated, and transmitted to the next column slice, along with the unmodified sc(t − 1), min1c(t − 1),
and min2c(t − 1) elements that were received from the previous column slice. The early-termination
parity check is performed in the first clock cycle of the CN-phase, starting from the second decoding
iteration, once the pc(t) parity bits in each layer message have been computed and have returned home
to their starting position t = 0 in the closed path traversal.
Both sign-magnitude (SM) and 2’s complement number formats are used in the combined processing
node logic. Each Lvc message is represented in a 5-bit SM format with 1 sign bit and 4 magnitude bits, in
order to enable direct comparison of the sign and magnitude with the incoming sc(t− 1), min1c(t− 1),
and min2c(t − 1) values. Each Qv LLR is represented in a 5-bit 2’s complement format in order to
avoid SM-to-2’s complement conversion prior to the addition operation in the VN phase. Similarly, the
intermediate mcv value in the VN phase is also represented by 5 bits in 2’s complement format. The hard
decision Cv corresponds to the most significant bit (MSB) of the update LLR Lv, and is represented by
only 1 bit.
The combined processing node exploits the spatial locality of Lvc and Qv values stored in partitioned
column-slice memories through a deterministic access pattern during the CN- and VN-update phases.
The combined CN and VN logic minimizes routing complexity between processing units, while also
eliminating the need for additional data shifting or permutation logic. The time-distributed piecewise
decoding schedule also eliminates the complex, timing-constrained compare-select and XOR trees, which
are employed in traditional architectures to compute the minimum magnitudes, sign, and parity. In the
CN-update phase, a single XOR gate is required to calculate the sign sc(t), and a single compare-select
circuit is used to determine the minimum magnitude among the Lvc, min1c(t − 1), and min2c(t − 1)
magnitudes. Similarly, in the VN-update phase, the parity check computation is also reduced to a
sequential XOR update, and the mcv value is computed independently in each VN. These simplifications
relax the critical path timing constraints to enable pipelined frame interleaving.
Figure 3.9 presents the timing diagram for each frame j that is decoded through two CN+VN
processing units in successive columns t and t+1 within a single pipeline stage over one clock cycle. The
timing diagram highlights the individual operations in the CN- and VN-update phases, as well as the
data dependency and message-passing sequence between processing units in successive columns t and
t+ 1. The uniform processing unit arrangement in each column slice ensures that all units perform the
same operation in each clock cycle.
Additional pipeline stages could be added in the combined CN+VN processing unit and in each
column slice. While this may reduce the amount of buffers added to critical timing paths in design
synthesis and place-and-route stages, the complexity of control logic would increase, and there would
be a trade-off between the area saved on buffer elimination and the insertion of pipeline register cells
throughout the design. In addition, in order to maintain full hardware utilization, the depth of column
slice memories would also need to be increased to accommodate more interleaved frames in the pipeline.
In this design, only 8 frames are interleaved through the architecture in order to meet the latency
requirements with minimal area overhead and simple control logic.
3.1.8 Early Termination with Coarse-Grained Clock Gating
Early termination logic allows the decoder to terminate once the parity-check condition across all CNs
is satisfied, i.e., pc = 0 for every CN path as defined by Eq. 2.1. This reduces the overall power and
Chapter 3. LDPC Decoder Architecture with Path-Unrolled Message Passing 49
improves energy efficiency as the decoder does not have to execute the maximum number of iterations
since the decoded codeword C is valid.
0 1 2 3 4 5 6 7 8 9 10Number of Iterations to Decoding Convergence
0
0.1
0.2
0.3
0.4
0.5
Pro
bab
ility
Rate 1/2 at 4.1dB, FER < 10 -2
Rate 1/2 at 4.4dB, FER < 10 -3
Rate 1/2 at 4.7dB, FER < 10 -4
Rate 5/8 at 4.1dB, FER < 10 -2
Rate 5/8 at 4.4dB, FER < 10 -3
Rate 5/8 at 4.7dB, FER < 10 -4
Rate 3/4 at 4.4dB, FER < 10 -2
Rate 3/4 at 4.7dB, FER < 10 -3
Rate 3/4 at 5.0dB, FER < 10 -4
Rate 13/16 at 4.7dB, FER < 10 -2
Rate 13/16 at 5.1dB, FER < 10 -3
Rate 13/16 at 5.5dB, FER < 10 -4
Figure 3.10: Probability distribution of decoding iterations for the four code rates of the IEEE 802.11adstandard at FER of 10−2, 10−3, and 10−4.
Figure 3.10 presents a normalized histogram showing the probability of performing i iterations,
i ∈ 1, 2, . . . , 10, before converging to a valid codeword. The iteration probability is presented for
all four code rates in the IEEE 802.11ad standard at SNR operating points that achieve frame error
rates of 10−2, 10−3, and 10−4 in order to capture the average statistical performance. Figure 3.10
shows that 95% of frames terminate in 5 iterations or less, hence early termination provides a significant
power saving opportunity, especially for the higher SNR operating points where the majority of frames
terminate within 1-to-3 iterations. Early termination also reduces the overall decoding latency, since
fewer iterations are required to produce valid codewords. As such, higher throughput is achievable if
the decoder is configured to run continuously where the next set of frames begin decoding immediately
after all the frames in the first set have converged, as shown in Fig. 3.11.
There is a fourth scenario, not illustrated in Fig. 3.11, where the decoder pipeline is continuously
filled until hundreds of frames have been decoded, at which point, the decoder is fully powered down
and remains off until the next round of frames is ready to begin decoding. This duty-cycling approach
would achieve the same throughput with lower latency compared to the decoding scenarios illustrated
in Fig. 3.11. However, this approach significantly increases control complexity in order to track the
termination pattern and iteration count of independently interleaved frames, and also requires a deep
first-in first-out (FIFO) buffer memory to store the input and output frames. This additional FIFO
memory would increase design area, unless it is already available in the larger SoC. This scenario was
not explored in this thesis due to the additional control complexity and silicon area constraints.
Similar to the CN update, the parity check is performed in the VN-update phase through a piecewise,
Chapter 3. LDPC Decoder Architecture with Path-Unrolled Message Passing 50
DECODE
Frames 0 to 7
DECODE
Frames 8 to 15
160 Clock Cycles
DECODE
Frames 16 to 23
10 Iterations
DECODE
Frames 0 to 7
DECODE
Frames 8 to 15
DECODE
Frames 16 to 23
160 Clock Cycles
10 Iterations
160 Clock Cycles
10 Iterations
160 Clock Cycles
7 Iterations
160 Clock Cycles
8 Iterations
160 Clock Cycles
10 Iterations
DECODE
Frames 0 to 7
DECODE
Frames 8 to 15
DECODE
Frames 16 to 23
112 Cycles
7 Iterations
160 Cycles
10 Iterations
128 Cycles
8 Iterations
(a)
(b)
(c)
Figure 3.11: Multi-frame decoding with: (a) no early termination, (b) early termination with idle cycles(discontinuous decoding), and (c) early termination without idle cycles (continuous decoding).
time-distributed computation across all VNs along the unrolled CN path. The incoming pc(t − 1)
component of every layer message is XOR-ed with the hard decision Cv in each VN to produce the
updated parity value pc(t) in column slice t. The early termination check is then performed in every
iteration once the final parity result returns home to its starting column position t = 0. This corresponds
to the first cycle of the next CN-update phase. The early termination check needs to be performed only
in the first column of a column-slice pair, immediately after the input pipeline stage, since the parity
result returning home to the column-slice pair is unique to the frame and is valid for both columns.
The early termination check is performed independently for each interleaved frame in the architecture.
The global control unit aggregates the parity-check results from all column slices over the entire CN phase
to determine which frames have terminated. Coarse-grained clock gating is used to disable (turn off)
column slices in which the current frame is known to have terminated. Figure 3.12 presents a sample
frame termination pattern, which shows all 8 frames terminating within 5 iterations. Frames that have
terminated are not updated or cycled further since the input column pipeline registers and memories are
disabled. For completeness, Fig. 3.12 also captures the frame cycling pattern within a single iteration
to show the temporal position of each terminated frame among the active frames that have not yet
terminated. Each set of 8 frames will have a unique termination pattern, however, through coarse-
grained clock gating, the decoder minimizes dynamic power consumption by systematically turning off
logic until all frames have terminated.
Chapter 3. LDPC Decoder Architecture with Path-Unrolled Message Passing 51
Frame 0
Frame 0 Frame 1 Frame 2 Frame 3 Frame 4 Frame 5 Frame 6 Frame 7
Frame 0
Frame 0
Frame 0
Frame 0
Frame 1
Frame 1
Frame 1
Frame 2
Frame 2
Frame 2
Frame 3 Frame 4 Frame 5 Frame 6Frame 7
Frame 3
Frame 3
Frame 3
Frame 3
Frame 4 Frame 5Frame 6 Frame 7
Frame 5
Frame 5
Frame 5
Frame 4
Frame 4
Frame 4
Frame 6 Frame 7
Frame 6 Frame 7
Frame 7
Frame Terminated – Column Disabled (Off)Frame Not Terminated – Column Active (On)
Columns
14 and 15
Columns
12 and 13
Columns
10 and 11
Columns
8 and 9
Columns
6 and 7
Columns
4 and 5
Columns
2 and 3
Columns
0 and 1Iteration
1
2
3
4
5
6 Frame 1
Frame 1
Frame 2
Frame 2
Frame 6
2 43 60 1 75
4 650 2 31 7
5 761 3 420
60 72 4 531
710 3 5 642
Column Slice
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Ite
rati
on 1
2
3
4
5
Frame Not Terminated Frame Terminated
2 43 60 17 5
2 436 0 17 5
2 436 0 175
24 36 0 175
243 6 0 175
2 43 6 0 175
2 43 6 0751
Fra
me
Cy
clin
g W
ith
in O
ne
Ite
rati
on
(On
ly O
ne
Ph
as
e S
ho
wn
)
Figure 3.12: Sample frame termination pattern in frame-interleaved architecture. One iteration is per-formed over 16 clock cycles. One clock cycle is required to update a frame in a column-slice pair. Framesthat have terminated are not updated in their current column-slice pair. Column slices in which thecurrent frame has terminated are disabled through coarse-grained clock gating in each cycle.
3.1.9 Extendibility to Layered Decoding Schedule
The proposed architecture can be extended to support a layered decoding schedule at the expense of
latency and increased control complexity. Chapter 2 provides a brief introduction and describes some of
the challenges of implementing a layered decoder.
In this architecture, a layered schedule would have a higher decoding latency since the CN update
phase would require several more passes. The VN update phase would still require only one pass,
however, the CN update phase would require the same number of passes as the maximum VN degree.
For example, for a QC matrix with 4 layers/macro-rows connected to each VN, a layered decoder would
require 4 passes through the structure just for the CN phase, and then 1 additional pass for the VN
phase, for a total of 5 passes per iteration. This is in contrast to the 2 passes per iteration that are
required with a flooding schedule, corresponding to a 2.5× increase in total worst case decoding latency
assuming the decoder does not terminate in fewer iterations.
The area of the decoder could be reduced with a layered schedule by sharing CN update logic among
all connected layers, however, additional multiplexers would be required in each combined CN+VN
processing unit to select the routing path for each intermediate update message based on the current
layer. This would further increase control complexity and latency. Thus, a layered decoding schedule
was not explored in this thesis. The proposed architecture with a flooding schedule achieves acceptable
latency for the IEEE 802.11ad standard, while minimizing control complexity.
Chapter 3. LDPC Decoder Architecture with Path-Unrolled Message Passing 52
The proposed frame-interleaved architecture addresses the challenges of designing a multi-Gb/s
LDPC decoder with low-complexity interconnect and multi-rate reconfigurability, by introducing a time-
distributed decoding schedule and combined CN+VN processing unit design. The following section
presents the physical implementation details and results of a proof-of-concept test chip.
3.2 Physical Silicon Chip Implementation and Results
An LDPC decoder test chip was synthesized, placed-and-routed, and fabricated in a 28nm CMOS tech-
nology as a proof-of-concept of the proposed frame-interleaved architecture. The decoder core occupies
an area of 3.41mm2 (3.20mm × 1.06mm), while the total die size including pads is 4.78mm2 (3.36mm
× 1.42mm). The design contains 837K-gates, and 160Kb of eSRAM, which was generated using a
commercial memory compiler. The decoder supports all 4 code rates and 24 throughput modes of the
IEEE 802.11ad standard, while operating at a nominal 0.9V supply voltage and 202MHz clock. Fig-
ure 3.13 presents a die micrograph of the test chip, which contains two decoupled power domains to
independently measure core logic and eSRAM power. This section presents an overview of the decoder’s
error-correction performance, area and power breakdown, and a comparison of this work to previously
published LDPC decoder implementations for the IEEE 802.11ad standard. The complete development,
simulation, and testing framework is presented in Appendix B.
Embedded SRAM
Embedded SRAM I/O
In
terf
ac
eLDPC Decoder
Standard Cell Logic
00
1414 1313 1212 1111 1010 99 8877
11 22 33 44 55 66
1515
Embedded SRAM
Embedded SRAM
Global Control Logic
3.20mm3.20mm
1.0
6m
m1
.06
mm
Figure 3.13: Die micrograph with wirebonds shown in exposed package.
3.2.1 Error-Correction Decoding Performance
Figure 3.14 presents the FER and BER performance of the four code rates for both fixed-point and
floating-point number representations with a maximum of 10 decoding iterations on the BIAWGNC
with Min-Sum decoding. The channel input LLRs are quantized to 5 bits for both floating-point and
fixed-point simulations, based on the assumption that channel LLRs are received from a 5-bit analog-
to-digital converter (ADC). The rate 1/2 and rate 5/8 curves have similar performance due to the input
LLR quantization.
3.2.2 Post-Silicon Power Measurements
The fabricated chip was tested on a Teradyne UltraFLEX-HD automated tester (ATE) at a room
temperature of 21C. The chip contains two test modes for at-speed functional verification and power
Chapter 3. LDPC Decoder Architecture with Path-Unrolled Message Passing 53
2 2.5 3 3.5 4 4.5 5 5.5 6E
b/N
0 (dB)
10-7
10-6
10-5
10-4
10-3
10-2
10-1
100
Fra
me
Err
or
Rat
e (F
ER
), B
it E
rro
r R
ate
(BE
R)
Max 10 Iterations100 Frame Errors
FER - Floating Point (5-bit Input LLR)FER - Fixed Point (5-bit Messages)BER - Floating Point (5-bit Input LLR)BER - Fixed Point (5-bit Messages)
Rate3/4
Rate13/16
Rate1/2
Rate5/8
Figure 3.14: FER and BER vs. SNR under Min-Sum decoding for all four IEEE 802.11ad codes onBIAWGNC with maximum 10 decoding iterations. The channel SNR is normalized to energy-per-bit asgiven by Eq. 1.2. Channel input LLRs are quantized to 5 bits for both fixed-point and floating-pointsimulations.
measurement. In the functional test mode, channel input LLRs are loaded through a shift-register based
I/O interface, the decoding is performed, and the output hard decision bits are shifted out. The captured
hard decision bits are compared to a set of golden vectors, whose expected values are predetermined
through C++ simulation of a fixed-point LDPC decoder with a floating-point BIAWGNC model. The
chip functionality is verified over 40 test cases: five SNR points in each of the four code rates, both with
and without early termination. Figure 3.15 presents two Shmoo plots that show the range of operating
voltages for both eSRAM and core logic, as well as decoding functionality up to 300MHz. In the power
test mode, decoded hard decision bits are not shifted out after each set of frames has terminated. Instead,
the decoder runs continuously such that once a set of frames terminates, the decoder immediately restarts
the decoding cycle without any idle period. The supply current is sampled over a 10ms interval to obtain
an accurate power measurement.
Figure 3.16 presents the average power measured over 10 typical-corner chips under nominal condi-
tions. The results show that with early termination, overall power can be reduced by up to 1.43× for the
rate 1/2 code at Eb/N0 = 4.6dB and 2.93× for the rate 13/16 code at Eb/N0 = 5.4dB, while satisfying
the maximum throughput specification of the IEEE 802.11ad standard. The five Eb/N0 SNR points
chosen for each code rate correspond to the five highest performance points in the waterfall region of
each error-rate curve in Fig. 3.14. The majority of the power is consumed by standard cell logic and
wired routing, while eSRAM memories consume between 17% and 32% of overall power. Figure 3.16
also shows that there is a linear decrease in power over the four code rates from rate 1/2 to rate 13/16
when early termination is enabled. At Eb/N0 = 5.4dB, the rate 13/16 code consumes 2.88× less power
than the rate 1/2 code at Eb/N0 = 4.6dB with early-termination decoding. This reduction in power
Chapter 3. LDPC Decoder Architecture with Path-Unrolled Message Passing 54
F F F F FF
0.70
F
0.60
0.70
0.80
0.90
1.00
VDD_LOGIC (V)
VD
D_
ME
M (
V)
0.50
P P PFFF
F
F
F
F
F
F
F
F
F
F
P
P
P
P
P
P
P
P
F
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P P P P
(a)
F
150 175 200 225 250 275 300
0.60
0.70
0.80
0.90
1.00
Clock Frequency (MHz)
VD
D_
LO
GIC
(V
)
0.50
F
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
(b)
F
F
F
F
F
F F
F
F
F
F
F F
F
F
F
F
F F
F F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
P
P
P
P
P
P
P
P
P
P
P
P
P P P
0.75 0.80 0.85 0.90 0.95 1.00
(a) eSRAM VDD vs. core logic VDD with202MHz clock.
F F F F FF
0.70
F
0.60
0.70
0.80
0.90
1.00
VDD_LOGIC (V)
VD
D_
ME
M (
V)
0.50
P P PFFF
F
F
F
F
F
F
F
F
F
F
P
P
P
P
P
P
P
P
F
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P P P P
(a)
F
150 175 200 225 250 275 300
0.60
0.70
0.80
0.90
1.00
Clock Frequency (MHz)
VD
D_
LO
GIC
(V
)
0.50
F
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
(b)
F
F
F
F
F
F F
F
F
F
F
F F
F
F
F
F
F F
F F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
P
P
P
P
P
P
P
P
P
P
P
P
P P P
0.75 0.80 0.85 0.90 0.95 1.00
(b) Core logic VDD vs. clock frequency witheSRAM voltage at 0.9V.
Figure 3.15: Shmoo plots of measured chip showing functional test pass (P) and fail (F) results.
3.8dB 4.0dB 4.2dB 4.4dB 4.6dB 3.8dB 4.0dB 4.2dB 4.4dB 4.6dB 4.2dB 4.4dB 4.6dB 4.8dB 5.0dB 4.6dB 4.8dB 5.0dB 5.2dB 5.4dB0
100
200
300
400
500
Mea
sure
d P
ow
er (
mW
)
Rate 1/2 (CLK=202MHz) Rate 5/8 (CLK=202MHz) Rate 3/4 (CLK=202MHz) Rate 13/16 (CLK=202MHz)
No Early Termination With Early Termination460
393 43
132
7
419
302
408
283
400
279
402
291
384
261
375
224
366
215
359
176
362
195
356
187
349
157
342
149
336
112
331
150
326
158
316
144
311
110
305
104
Logic + Routing, VDD_LOGIC = 0.9V Embedded SRAM, VDD_MEM = 0.9V
Figure 3.16: Measured power at nominal 0.9V supply and 202MHz clock rate, with and without earlytermination, at five SNR Eb/N0 operating points for all four code rates.
3.8dB 4.0dB 4.2dB 4.4dB 4.6dB 3.8dB 4.0dB 4.2dB 4.4dB 4.6dB 4.2dB 4.4dB 4.6dB 4.8dB 5.0dB 4.6dB 4.8dB 5.0dB 5.2dB 5.4dB0
100
200
300
400
500
Mea
sure
d P
ow
er (
mW
)
Rate 1/2 (CLK=92MHz) Rate 5/8 (CLK=155MHz) Rate 3/4 (CLK=186MHz) Rate 13/16 (CLK=202MHz)
MCS-10, Throughput=3.09Gb/s
MCS-22, Throughput=5.20Gb/sMCS-23, Throughput=6.25Gb/s MCS-24, Throughput=6.78Gb/s
161
139
150
115 14
510
6 141
100 13
898
220
163 20
914
6
204
125
198
121
195
99
251
137
242
129
236
112
231
104
226
83
230
108
226
114
219
103
215
80
210
76
Logic + Routing, VDD_LOGIC = 0.79V Embedded SRAM, VDD_MEM = 0.63V
Figure 3.17: Measured power at reduced core and memory voltage with clock-frequency scaling, for thesame operating points as in Fig. 3.16.
Chapter 3. LDPC Decoder Architecture with Path-Unrolled Message Passing 55
with increasing code rate is attributed to the fact that the decoder terminates in fewer iterations, and
the higher SNR operating point enables more frequent clock gating as more frames terminate early.
Moreover, in the rate 13/16 case, the CN+VN processor logic that corresponds to an entire layer of
the QC parity-check matrix is disabled as there are only 3 active processing layers for the rate 13/16
matrix, as opposed to 4 active processing layers for the rate 1/2, rate 5/8, and rate 3/4 matrices shown
in Fig. 3.3.
Table 3.4: Decoder performance at target BER = 10−6 with early termination (including idle cycles)
Code Rate 1/2 Rate 5/8 Rate 3/4 Rate 13/16
Eb/N0 (dB) 4.6 4.6 5.0 5.4Iterations (Max 10) ∗ 7 6 4 4
Nominal Conditions: VDD LOGIC=0.9V, VDD MEM=0.9V
Clock Frequency (MHz) 202 202 202 202Throughput (Gb/s) ∗ 6.78 6.78 6.78 6.78Max Latency (µs) ∗ 0.793 0.793 0.793 0.793
Measured Power (mW) 279 176 112 104Energy Efficiency (pJ/bit) ∗ 41 26 16 15
Normalized Efficiency(pJ/bit/iteration) ∗
4.1 2.6 1.6 1.5
Low-Power Conditions: VDD LOGIC=0.79V, VDD MEM=0.63V
Clock Frequency (MHz) 92 155 186 202
Throughput (Gb/s) ∗ 3.09 5.20 6.25 6.78Max Latency (µs) ∗ 1.740 1.034 0.860 0.793
Measured Power (mW) 98 99 83 76
Energy Efficiency (pJ/bit) ∗ 31 19 13 11Normalized Efficiency(pJ/bit/iteration) ∗
3.1 1.9 1.3 1.1
∗ Throughput, Latency, and Efficiency calculations assume 10 decoding iterations, even though thedecoder terminates and stops after the specified number of Iterations. This scenario corresponds to themulti-frame decoding highlighted in Fig. 3.11(b).
Additional power reduction is possible through clock-frequency and voltage scaling techniques. The
high performance MCS-10, MCS-22, MCS-23, and MCS-24 modes of the IEEE 802.11ad standard specify
data rates of 3.08Gb/s, 5.20Gb/s, 6.24Gb/s, and 6.76Gb/s for the rate 1/2, 5/8, 3/4, and 13/16 codes,
respectively. As shown in Fig. 3.17, with clock-frequency and voltage scaling, the LDPC decoder achieves
between 1.35× and 2.85× reduction in power over the nominal case, while satisfying the required data
rates. Table 3.4 highlights the power and energy efficiency of both nominal and low-power operating
modes, as well as the throughput and maximum latency assuming 10 decoding iterations for four SNR
points at a target BER of 10−6. Through the multi-frame I/O buffering technique shown in Figures
3.7 and 3.11, the decoder can achieve higher throughput with lower latency by immediately starting
to decode the next set of frames if the current set of frames terminates early. In this case, the power
consumption is higher, however the energy efficiency per bit remains the same. This result is shown in
Table 3.6.
Table 3.5 presents a percentage breakdown of the total decoder area by module, as well as the
estimated power consumed by each module for the high-performance SNR points of each of the four codes
at a target BER of 10−6. The power estimates are derived from both the measured power, as well as power
estimates obtained from gate-level simulation of the synthesized design using actual toggle patterns for
Chapter 3. LDPC Decoder Architecture with Path-Unrolled Message Passing 56
each code rate. VN-to-CN message memories consume about 6× more power than channel LLR and hard
decision memories due to their higher activity during decoding. Column slice logic consumes about 1.5×more power than all the remaining standard cell logic due to the large number of gates required to realize
the frame-interleaved, flooding Min-Sum decoder. The power consumption of control logic is negligible
in the frame-interleaved architecture since the majority of control logic is intrinsically embedded in
the path-unrolled interconnect. Routing, however, consumes the largest percentage of power in order
to cyclically shuffle high bit-width layer messages between successive column slices. While two power
domains provide insight into the ratio of core logic versus memory power consumption, a single power
domain should be used when integrating the IP in an SoC to minimize unused area overhead.
Table 3.5: Percentage breakdown of post-silicon area and estimated power by decoder module at targetBER = 10−6
Decoder Module AreaPower with Early Termination Enabled ∗
Rate 1/2 Rate 5/8 Rate 3/4 Rate 13/16
Total Core Area and Power
Core Area 3.41mm2 279mW 176mW 112mW 104mW
Embedded SRAM Memories
VN-to-CNMessages Lvc
7.74% 17.36% 15.90% 15.01% 15.02%
Channel LLR Qv 2.48% 1.95% 2.25% 2.13% 3.20%
Hard Decision Cv 0.63% 0.58% 0.58% 0.54% 0.69%
Standard Cell Logic
Column Slices 16.28% 11.53% 15.23% 15.78% 13.29%
Pipeline Registers 2.85% 7.30% 9.61% 9.71% 8.16%Buffers/Inverters 2.75% 0.06% 0.08% 0.11% 0.04%Decoder Control 0.01% 0.01% 0.01% 0.01% 0.01%
Integrated ClockGating Cells
0.03% 0.06% 0.08% 0.11% 0.04%
Test Control andI/O Interface
0.21% 0.01% 0.01% 0.01% 0.01%
Core Filler Cells and Wired Routing
Core Filler Cells 67.01% N/A N/A N/A N/A
Routing N/A 59.79% 54.41% 54.69% 57.46%
∗ Power measured at Eb/N0 = 4.6dB, 4.6dB, 5.0dB, and 5.4dB for rates 1/2, 5/8, 3/4, and 13/16,respectively, at nominal 0.9V supply and 202MHz clock. Estimates are derived from power measurementsand gate-level simulation reports of the synthesized design.
3.2.3 Comparison with the State-of-the-Art
Table 3.6 compares this work to five recent decoder implementations for the IEEE 802.11ad standard.
All comparison works implement a partially-parallel architecture with a variant of the Min-Sum de-
coding algorithm, and achieve similar BER performance over the four code rates. Several parity-check
matrix modifications are applied among the comparison works, yielding different permutation network
structures, which include barrel shifters, cyclic shift registers, and switch networks. This work reduces
routing overhead complexity by eliminating the need for message permutation logic due to the parti-
tioned routing networks between adjacent column slices. While this work occupies between 2.1× and
5.4× more unnormalized silicon area than the comparison works, the proposed frame-interleaved archi-
Chapter 3. LDPC Decoder Architecture with Path-Unrolled Message Passing 57
Table 3.6: Comparison of LDPC decoder implementations for the IEEE 802.11ad standard
SpecificationWeiner
[32]Park[31]
Ajaz[126]
Li[127]
Motozuka[128]
This Work
ISSCC2014
JSSC 2014APCCAS
2014ASSCC
2015GlobalSIP
20152017
Implementation ASIC ASICPlace and
RouteASIP ASIC ASIC
CMOS TechnologyNode
28nmFD-SOI
65nm 65nm 28nm 40nm LP 28nm
Core Area (mm2) 0.63 1.60 0.58 0.78 0.8 3.41 ∗
Memory Type Flip Flops eDRAM Flip Flops N/A Flip Flops eSRAM
Total Memory (Kbits) N/A 33.6 7.875 N/A 12.096 160.272Pipeline Stages 5 5 1 5 3 8
Interleaved Frames 2 2 1 1 1 8
Supply Voltage (V) 1.1 0.94 1.1 0.9 1.1 0.9Clock Frequency (MHz) 260 360 400 470 220 202
Block Length (Bits) 672 672 672 672 672 672Decoding Schedule Flooding Flooding Layered Layered Flooding Flooding
Code Rate for PowerMeasurement
1/2 1/2 1/2 1/2 13/16 1/2 (a) 13/16 (b)
Iterations 3.75 10 7 2 7 10 (c) 7 (d) 10 (e) 4 (f)
Throughput (Gb/s) 12.00 6.00 9.25 18.40 6.16 6.78 9.69 6.78 16.95
Latency (µs) 0.112 0.224 0.073 0.037 0.109 0.793 0.555 0.793 0.317
Power (mW) 180 373.6 272.9 166 203 279 399 104 260
Energy Efficiency(pJ/bit)
15.00 62.27 29.50 9.02 32.95 41.15 15.34
Normalized EnergyEfficiency
(pJ/bit/iteration)4.00 6.23 4.21 4.51 4.71 4.12 5.88 1.53 3.84
Area Efficiency(Gb/s/mm2)
19.05 3.75 16.09 23.59 7.70 1.99
Max Average Iterations(Flooding: 10,
Layered: 5)10 10 5 5 10 10
Latency at MaxIterations (µs)
0.229 0.224 0.052 0.091 0.156 0.793
Throughput at MaxIterations (Gb/s)
4.50 6.00 12.95 7.36 4.31 6.78
Decoding AlgorithmOffset Min
SumOffset Min
SumMin Sum Min Sum Min Sum Min Sum
Message Quantization(Bits)
5 5Lvc: 4,mcv: 2
N/A 5 5
Multi Rate YesNo (Rate
1/2)Yes Yes Yes Yes
BER for Rate 1/2 Code10−6 at4.4dB
10−6 at3.6dB
10−5 at3.6dB
10−6 at4.0dB
10−6 at4.0dB
10−6 at 4.6dB
BER for Rate 5/8 Code10−6 at4.7dB
N/A10−5 at4.0dB
10−6 at4.0dB
N/A 10−6 at 4.6dB
BER for Rate 3/4 Code10−6 at4.5dB
N/A10−5 at4.3dB
10−6 at5.0dB
N/A 10−6 at 4.6dB
BER for Rate 13/16Code
10−6 at5.0dB
N/A10−5 at5.0dB
10−5 at5.0dB
10−6 at5.2dB
10−6 at 5.4dB
Partially-ParallelArchitecture
Row-Parallel
Row-Parallel
Row-Parallel
Row-Parallel
Column-Parallel
Row/Column-Parallel
Permutation NetworkCyclicShift
Registers
CyclicShift
Registers
SwitchNetwork
BarrelShift
Network
Reduced-Complexity
BarrelShifters
Cyclically Hard-Wired Partitions
Parity-Check MatrixModification
Row Re-Ordering
RowMerging
ColumnRe-
Ordering
ColumnPermuta-
tionN/A Bypass Routing
∗ Core area contains two power domains to independently measure eSRAM and logic power. A “productionversion” of the chip with only one power domain would occupy less core area.(a) Power reported at Eb/N0 =4.6dB and BER= 10−6. (b) Power reported at Eb/N0 =5.4dB and BER= 10−6.(c) Decoder terminates early after 7 iterations, and remains idle for the remaining 3 iterations. This corresponds tothe scenario in Fig. 3.11(b). (d) Decoder terminates early after 7 iterations, and immediately begins decoding nextset of frames without any idle cycles. This corresponds to the scenario in Fig. 3.11(c).(e) Decoder terminates early after 4 iterations, and remains idle for the remaining 6 iterations. This corresponds tothe scenario in Fig. 3.11(b). (f) Decoder terminates early after 4 iterations, and immediately begins decoding nextset of frames without any idle cycles. This corresponds to the scenario in Fig. 3.11(c).
Chapter 3. LDPC Decoder Architecture with Path-Unrolled Message Passing 58
tecture with a path-unrolled message-passing schedule achieves high energy efficiency, while maximizing
SoC integration capability using standard bulk CMOS technology and a low clock rate. Silicon area and
power are not normalized to a particular CMOS technology node in this comparison, as Dennard scaling
rules do not hold below the 65nm node due to the exponential growth of leakage current in newer nodes
and dark silicon design techniques. In Table 3.6, power is reported at the nominal supply voltage and
nominal clock frequency.
Energy efficiency is the primary optimization metric for state-of-the-art decoders. Weiner et al.
present an ASIC implementation that achieves a normalized energy efficiency approximately equal to
this work for the rate 1/2 code using a fully-depleted silicon-on-insulator (FD-SOI) technology [32], which
is known to provide superior power performance over conventional bulk CMOS technology due to the
forward body bias that enables low-voltage operation with reduced leakage current [129]. Under worst-
case channel conditions with a maximum of 10 decoding iterations, the implementation by Weiner et al.
achieves a throughput of only 4.50Gb/s, which is below the maximum throughput specification for IEEE
802.11ad. Weiner et al. do not report the measured power for the rate 13/16 code.
Park et al. present an ASIC decoder for a single code rate with an energy efficiency approximately
1.5× lower than this work for the rate 1/2 code [31], likely due to the high clock frequency of 360MHz,
which would also present SoC integration challenges. This work achieves 1.5× higher normalized energy
efficiency than the work by Park et al. for the rate 1/2 code.
Ajaz and Lee report a pre-silicon place-and-route implementation also using a prohibitively high
clock rate of 400MHz. This work achieves approximately equal normalized energy efficiency for the rate
1/2 code, however, a fair comparison is not possible since the work by Ajaz and Lee is a pre-silicon
implementation, and power is not reported for the rate 13/16 code.
Li et al. introduce a new approach to high-throughput LDPC decoder design through a multi-core
application-specific instruction set processor (ASIP). While the reported energy efficiency is 4.56× higher
than this work for the rate 1/2 code [127], their reported power of 166mW is measured for only 2 decoding
iterations. This work achieves approximately equal normalized energy efficiency for the rate 1/2 code.
In addition, the high clock frequency of 470MHz and ASIP architecture may introduce SoC integration
challenges. Li et al. do not report the measured power for the rate 13/16 code.
Motozuka et al. introduce a new column-parallel architecture that uses multi-stage variable shifters
with low memory requirements [128], however, the ASIC implementation does not achieve more than
4.31Gb/s throughput under worst-case channel conditions. This work achieves 2.1× higher energy
efficiency and 3.08× higher normalized energy efficiency than the work by Motozuka et al. for the
rate 13/16 code. Motozuka et al. do not report the measured power for the rate 1/2 code.
The silicon area of the proposed architecture could be reduced by applying more optimal floorplanning
and chip layout techniques. The presented design contains additional area overhead in order to implement
two power domains to independently measure core logic and eSRAM power. The decoder core area could
be reduced by using a single power domain for both standard cell logic and eSRAM macros, as this would
eliminate particular wire routing and logic placement constraints.
The frame-interleaved LDPC decoder presented in this work achieves high energy efficiency, error-
correction performance, and SoC integration capability at the expense of high transistor area. The
presented decoder achieves similar normalized energy efficiency for the rate 1/2 code in comparison
to other published implementations, however, it achieves the highest normalized energy efficiency for
the rate 13/16 code at 1.53pJ/bit/iteration. As previously described, this work is scalable by design.
Chapter 3. LDPC Decoder Architecture with Path-Unrolled Message Passing 59
The architecture is scalable to future technology nodes as interconnect complexity is constrained and
localized between column slices. Moreover, since most QC-LDPC codes are constructed with expansion
factors between q = 10 and q = 100, the column-slice logic complexity of the proposed architecture
remains approximately equal, even for longer block-length codes. As such, the frame-interleaved decoder
architecture would provide low-power performance and high energy efficiency for longer block-length
codes in future technology nodes.
3.3 Summary
This chapter introduced a new partially-parallel LDPC decoder architecture that implements a path-
unrolled message-passing schedule with pipelined frame interleaving. The traditional flooding Min-Sum
algorithm is reformulated through a time-distributed computation over multiple processing units that
contain both check and variable node update logic. Message permutation overhead and unstructured
routing are minimized by exploiting the cyclic structure between adjacent macro-columns in the quasi-
cyclic parity-check matrix.
Despite the high silicon core area of 3.41mm2, the decoder achieves an energy efficiency of 15pJ/bit
at 0.9V supply with a 202MHz clock rate, which is ideally suited for modern SoC integration. At a
maximum of 10 iterations, the decoder achieves a nominal throughput of 6.78Gb/s with a maximum
latency of 0.793µs for all four code rates of the IEEE 802.11ad standard. By trading off interconnect
complexity for high transistor area, the proposed architecture introduces a new design strategy for LDPC
decoders in sub-45nm CMOS technology nodes where interconnect scaling has stagnated.
The proposed architecture is scalable to longer block-length codes, since (1) the critical timing path
is not constrained by the expansion factor of the quasi-cyclic parity check matrix, (2) the complexity
of localized routing between successive column groups is bounded by the number of active processing
layers, and (3) the high bit-cell density of eSRAM compensates for additional overhead in processing
node logic. The architecture can also be reconfigured and extended to support multiple standards with
different parity-check matrices by including programmable shifters between successive column slices.
Furthermore, the architecture is not restricted to quasi-cyclic codes, but can rather be applied more
generally to random codes, or to codes that allow column-wise matrix partitioning as a way to enforce
structure. Further research in this area may lead to new low-power techniques for hardware-based LDPC
decoders.
Chapter 4
Quasi-Cyclic Multi-Edge LDPC
Codes for Quantum Cryptography
The design of efficient reconciliation algorithms is one of the central challenges of long-distance CV-
QKD [11]. Early reconciliation algorithms failed to achieve efficiencies above 80% [130], while more
advanced algorithms that now achieve 95% efficiency suffer from computational complexity [59, 60].
LDPC codes are highly suitable for low-SNR reconciliation in CV-QKD due to their near-Shannon
limit error-correction performance, however, designing and constructing efficient LDPC codes with block
lengths on the order of 106 bits remains a challenge.
This chapter introduces a technique to reduce the complexity of multi-edge LDPC codes in order
to reduce overall decoding latency, which would ultimately provide a higher secret key rate. A quasi-
cyclic structure is imposed on the multi-edge parity-check matrix construction to enable computational
decoding speedup as a result of the highly parallelizable structure, which provides a simple mapping
to hardware [72, 83]. Previous independent works by Martinez-Mateo and Walenta have explored the
application of existing QC-LDPC codes from the IEEE 802.11n standard for DV-QKD, however, these
works were not able to demonstrate reliable reconciliation beyond 50km [65, 131]. While this distance
may have been a limitation of DV-QKD, the short block lengths of such existing QC-LDPC codes (on
the order of 103 bits) remain unsuitable for long-distance CV-QKD. Recently, Bai et al. theoretically
showed that rate 0.12 QC codes with block lengths of 106 bits can be constructed using progressive
edge growth techniques, or by applying a QC extension to random LDPC codes with block lengths
of 105 bits [132]. However, the reported QC codes target an SNR of -1dB, and are thus not suitable
for long-distance CV-QKD beyond 100km. At the time of writing, there has not been any reported
investigation of the construction of QC codes for multi-edge LDPC codes targeting low-SNR channels
below -15dB for long-distance CV-QKD. This thesis shows that by applying a structured QC-LDPC code
construction technique to the random multi-edge LDPC codes previously explored by Jouguet et al. for
long-distance CV-QKD [11], it is possible to construct codes that achieve sufficient error-correction
performance while enabling the acceleration of the computationally-intensive LDPC decoding algorithm
such that the reconciliation step is no longer the bottleneck for secret key distillation beyond 100km.
This thesis demonstrates the application of multi-edge QC-LDPC codes for long-distance CV-QKD
through the design of several rate 0.02 binary parity-check matrices with block lengths on the order of 106
bits. While a complete QKD system would offer multi-rate code programmability for various operating
60
Chapter 4. Quasi-Cyclic Multi-Edge LDPC Codes for Quantum Cryptography 61
channels, this thesis focuses on the design of a single, low-rate code for a large range of transmission
distances to fully study the effects of β-efficiency and FER on the maximum achievable secret key rate
and reconciliation distance. Some works have explored the use of rate-adaptive or repetition codes to
achieve high-efficiency decoding with multiple code rates [11], however, the exploration of multi-rate
code design for long-distance CV-QKD is beyond the scope of this thesis.
This chapter describes the construction of quasi-cyclic multi-edge codes, and presents the error-
correction performance and achievable secret key rates for multiple β-efficiencies beyond 100km using
specifically designed rate 0.02 codes1. A GPU-based LDPC decoder implementation is also presented to
highlight the computational speedup that can be achieved using quasi-cyclic codes with respect to the
fundamental upper bound on secret key rate.
4.1 Construction of Quasi-Cyclic Multi-Edge LDPC Codes
While random LDPC codes have been shown to achieve near-Shannon capacity error-correction perfor-
mance under belief propagation decoding [5], the hardware-based implementation of decoders for random
codes is a challenge with large block lengths, especially on the order of 106 bits. The bottleneck stems
from the complex interconnect network between CN and VN processing units that execute the belief
propagation algorithm [15, 73]. This thesis extends the design of low-rate, multi-edge LDPC codes de-
scribed in Chapter 2 to QC codes in order to optimize decoding performance in hardware, by minimizing
latency and increasing throughput.
To design a multi-edge QC-LDPC code, repeat the random multi-edge sampling process using n/q
as the block length instead of n to obtain a base Tanner graph GB . The base parity-check matrix HB is
obtained from GB by populating each non-zero entry by a random element of the set 1, 2, ..., q. Let Ii
be the circulant permutation matrix obtained by cyclically shifting each row of the q× q identity matrix
to the right by i − 1. The QC parity-check matrix H is obtained from HB by replacing each non-zero
entry of value i by Ii, and each zero entry by the q × q all-zeros matrix.
In this thesis, multi-edge QC-LDPC parity-check matrices of rate 0.02 were generated for expansion
factors q ∈ 21, 50, 100, 500, 1000. Under belief propagation decoding, the error-correction performance
of the q ∈ 100, 500, 1000 QC codes was significantly worse in comparison to a random multi-edge code
with the same degree distribution. Thus, only the q = 21 and q = 50 QC codes are presented in the
remainder of this study. In order to maintain the same degree distributions, the block length for the
q = 21 code with rate Rcode = 0.02 was adjusted to n = 1.008 × 106 bits. Similarly, the q = 50 code
has a block length of n = 1× 106 bits and rate Rcode = 0.01995. As described in Chapter 2, as n→∞,
the error-correction performance of Tanner graphs with the same degree distribution is nearly identical.
Since a block length on the order of n = 106 →∞, any q = 21 or q = 50 code is expected to have similar
error-correction performance. Hence, only one q = 21 code and one q = 50 code were constructed.
Figure 4.1 shows the structure of the parity-check matrices designed in this thesis. Both the non-QC
random and QC matrices have a similar structure, which contains a dense area of 1s or cyclic identity
matrices on the left, and a long diagonal of degree-1 VNs to the right. The starting point of the diagonal
is determined by the VN degree distribution of the multi-edge matrix. In the case of the QC codes, no
cyclic shifts are implemented along the diagonal, thus all submatrices are q×q I1 identity matrices. This
1Lei M. Zhang specifically constructed the QC multi-edge codes based on the degree distributions in Equa-tions 2.18 and 2.19.
Chapter 4. Quasi-Cyclic Multi-Edge LDPC Codes for Quantum Cryptography 62
0 5 10Matrix Column
×105
0
2
4
6
8
Mat
rix
Ro
w
×105
(a) Full q = 50 QC parity-check ma-trix structure with 1×106 columns and9.8× 105 rows. Empty space representszeros.
3 4 5Matrix Column
×104
1
1.5
2
2.5
3
3.5
Mat
rix
Ro
w
×104
(b) Zoom-in of top left corner of q = 50QC matrix shown in (a). Each dot rep-resents a 50× 50 cyclic identity matrix.
Figure 4.1: Structure of designed parity-check matrices.
matrix structure greatly improves the decoding speed as degree-1 VNs along the diagonal need to pass
VN-to-CN messages only in the first decoding iteration, while CN-to-VN messages need to be passed to
degree-1 VNs only if the early-termination condition is enabled. The degree-1 VNs along the diagonal
correspond to the majority (but not all) of the (n − k) parity bits that are discarded after decoding,
thus the VN update computation needs to be performed in these degree-1 VNs only if a decision needs
to be made when early termination is enabled. A small fraction of the (n − k) parity bits correspond
to VNs with more than one CN connection in the denser area of the matrix to the left of the diagonal.
These VNs must perform the VN update computation in each iteration along with the first k VNs, which
correspond to the k information bits of the block.
The parity component of H, i > k in H(j, i), is lower-triangular for both the non-QC random and
QC parity-check matrices designed in this study. An example of this type of construction is shown in
Fig. 2.3, and is also illustrated in Fig. 4.1. While the lower-triangular construction does not necessarily
impact decoding complexity or error-correction performance, it does simplify the LDPC encoding pro-
cedure, which can be performed via forward substitution if H is of this form. Further investigation of
LDPC encoding complexity for such large codes is beyond the scope of this thesis.
4.2 Error-Correction Performance of Multi-Edge QC Codes
The multi-edge LDPC codes designed in this thesis achieve similar FER performance on the BIAWGNC
compared to those developed by Jouguet et al. for long-distance CV-QKD with multi-dimensional recon-
ciliation [11]. Table 4.1 summarizes the parameters of the three codes designed in this thesis, and Figures
4.2 and 4.3 present their FER vs. SNR error-correction performance under Sum-Product decoding for
d = 1, 2, 4, 8 reconciliation dimensions. FER simulations were performed for the complete linear SNR
range corresponding to the range of efficiencies between β = 0.8 and β = 0.99, as defined by Eq. 2.15.
This range of β-efficiency values was chosen to illustrate the trade-off between distance and finite secret
Chapter 4. Quasi-Cyclic Multi-Edge LDPC Codes for Quantum Cryptography 63
Table 4.1: Designed rate 0.02 multi-edge LDPC codes
StructureExpansion Factor Code Rate Block Length (Bits)
q Rcode n
Random N/A 0.02 1× 106
Quasi-Cyclic 21 0.02 1.008× 106
Quasi-Cyclic 50 0.01995 1× 106
key rate in the next section. For clarity, however, Figures 4.2 and 4.3 present the FER results only
for the SNR range corresponding to β-efficiencies between β = 0.88 and β = 0.99. The bit error rate
(BER) performance is not presented in Figures 4.2 and 4.3 since it is not of particular concern for key
reconciliation. Once Alice and Bob detect a frame error, the entire frame must be discarded since it
can not be used to generate a symmetric secret key. For completeness, the BER results for three codes
under investigation are presented in Appendix C.
Despite their identical degree distributions, the q = 50 QC code achieves the best overall FER
performance over d = 1, 2, 4, 8 dimensions in comparison to the random and q = 21 QC codes, due to its
slightly lower code rate of Rcode = 0.01995 versus Rcode = 0.02 for the random and q = 21 QC codes.
At low SNR where β is high, the q = 21 QC code also performs better than the random code over all
dimensions, likely due to the longer block length of n = 1.008 × 106 bits versus n = 106 bits for the
random and q = 50 QC codes. At higher SNR though, the random code achieves a lower error-floor
than the q = 21 QC code due to higher randomness in the parity-check matrix.
It was empirically found that the d = 2, d = 4, and d = 8 reconciliation schemes achieve approxi-
mately 0.04dB, 0.08dB, and 0.2dB of coding gain, respectively, over the d = 1 scheme in the waterfall
region for all three codes. As previously mentioned in Chapter 2, FER performance in the waterfall
region is of particular interest for long-distance CV-QKD since it corresponds to the high β-efficiency
region of operation at low SNR close to the Shannon limit. The error-floor region beyond the waterfall
is not of practical use in CV-QKD as it corresponds to the low β-efficiency region where transmission
distance is limited.
As previously discussed in Chapter 2, for any binary linear block code, the number of possible
codewords is 2k = 2nRcode . In this case, when n = 1 × 106 bits and Rcode = 0.02, the number of
possible valid codewords for the decoder to choose from is approximately 4× 106020. In order to detect
invalid decoding errors when the parity check CH> = 0 but S 6= S, a 32-bit CRC code is included in
each LDPC frame. In this work, NCRC = 32 bits were sufficient to detect all invalid decoded messages
without sacrificing information throughput. Having full control of the simulation environment, it was
also empirically found that Pundetected error = 0 using a 32-bit CRC code.
The probability of an invalid decoding error is given by
P (CH> = 0 ∩ CRC Fail ∩ S 6= S) =Number of CRC Errors
Total Number of Frame Errors.
Figure 4.4 shows the probability of an invalid decoding error over the SNR range of interest for d =
1, 2, 4, 8 reconciliation dimensions on the BIAWGNC for the three LDPC codes designed in this thesis.
In general, the probability of invalid decoding increases as the SNR increases and becomes the main
source of frame error, particularly in the error-floor region as a result of the large block length and low
code rate. In the low-SNR region of operation for long-distance CV-QKD where the FER Pe ≈ 1, invalid
Chapter 4. Quasi-Cyclic Multi-Edge LDPC Codes for Quantum Cryptography 64
0.028 0.0285 0.029 0.0295 0.03 0.0305 0.031 0.0315 0.032SNR
10-2
10-1
1
Fra
me
Err
or
Rat
e (F
ER
)
Max 500 Iterations100 Frame Errors32-bit Floating Point
d=1 - Random, R=0.02d=1 - QC, q=21, R=0.02d=1 - QC, q=50, R=0.01995d=2 - Random, R=0.02d=2 - QC, q=21, R=0.02d=2 - QC, q=50, R=0.01995
Figure 4.2: FER vs. SNR for Sum-Product decoding with d = 1 and d = 2 dimensional reconciliationon BIAWGNC.
0.028 0.0285 0.029 0.0295 0.03 0.0305 0.031 0.0315 0.032SNR
10-2
10-1
1
Fra
me
Err
or
Rat
e (F
ER
)
Max 500 Iterations100 Frame Errors32-bit Floating Point
d=4 - Random, R=0.02d=4 - QC, q=21, R=0.02d=4 - QC, q=50, R=0.01995d=8 - Random, R=0.02d=8 - QC, q=21, R=0.02d=8 - QC, q=50, R=0.01995
Figure 4.3: FER vs. SNR for Sum-Product decoding with d = 4 and d = 8 dimensional reconciliationon BIAWGNC.
decoding convergence still contributes to nearly 10% of all frame errors. A concatenated higher-rate code
was not included as part of the message component to correct residual errors [10,11].
Up until this point, the performance of the reconciliation algorithm has been presented as a coding
theory problem, where an LDPC code was designed to achieve a particular FER at a given SNR op-
Chapter 4. Quasi-Cyclic Multi-Edge LDPC Codes for Quantum Cryptography 65
0.028 0.0285 0.029 0.0295 0.03 0.0305 0.031 0.0315 0.032SNR
0
0.2
0.4
0.6
0.8
1
Pro
bab
ility
of
Inva
lid D
eco
din
g E
rro
r
d=1 - Randomd=1 - QC, q=21d=1 - QC, q=50d=2 - Randomd=2 - QC, q=21d=2 - QC, q=50d=4 - Randomd=4 - QC, q=21d=4 - QC, q=50d=8 - Randomd=8 - QC, q=21d=8 - QC, q=50
Figure 4.4: Probability of invalid decoding error vs. SNR for Sum-Product decoding with d = 1, 2, 4, 8dimensional reconciliation on BIAWGNC. Probability of error is computed for invalid messages that arecorrectly decoded but CRC fails.
0.8 0.85 0.9 0.95 1Reconciliation Efficiency (β)
10-3
10-2
10-1
1
Fra
me
Err
or
Rat
e (F
ER
)
d=8d=1
d=1 - Random LDPC Code, rate = 0.02d=1 - QC LDPC Code, q = 21, rate = 0.02d=1 - QC LDPC Code, q = 50, rate = 0.01995d=8 - Random LDPC Code, rate = 0.02d=8 - QC LDPC Code, q = 21, rate = 0.02d=8 - QC LDPC Code, q = 50, rate = 0.01995
Figure 4.5: FER vs. reconciliation efficiency for Sum-Product decoding with d = 1 and d = 8 dimensionalreconciliation on BIAWGNC. FER values are derived from the FER vs. SNR curves based on Eq. 2.15.
Chapter 4. Quasi-Cyclic Multi-Edge LDPC Codes for Quantum Cryptography 66
erating point. The SNR was considered as an abstraction of the BIAWGNC in order to demonstrate
fixed-rate code performance, independent of other CV-QKD system parameters such as modulation vari-
ance, transmission distance, and physical losses. Assuming that the transmission distance and physical
parameters of the quantum channel are fixed, Alice’s modulation variance can be optimally tuned such
that the effective secret key rate Keff is then solely determined by the FER and β-efficiency of the
LDPC-decoding reconciliation algorithm.
Figure 4.5 shows that for each fixed-rate LDPC code, there exists a unique FER-β pair, where each β
corresponds to a particular SNR operating point based on Eq. 2.15. While it may appear from Eq. 2.14
that maximizing β would produce a higher effective secret key rate, Fig. 4.5 shows that β and FER
are positively correlated, such that there exists an optimal trade-off between β and FER where Keff
is maximized for a fixed transmission distance. To achieve key reconciliation at long distances, the
operating point must be chosen in the waterfall region where β is high, despite the high FER.
The results presented in this section showed that higher-dimension reconciliation schemes, namely
d = 4 and d = 8, extend code performance to lower SNR where the FER Pe > 0 and β → 1. As such, the
d = 8 scheme is most suitable for long-distance reconciliation. The next section examines the impact of
reconciliation dimension, β-efficiency, and FER on the finite secret key rate over a range of transmission
distances for the LDPC codes designed in this thesis.
4.3 Finite Secret Key Rate
This section extends the discussion of the effective secret key rate to include finite-size effects. Key
reconciliation for a particular β-efficiency is only achievable over a limited range of distances where the
finite secret key rate Kfinite > 0. In general, for a single FER-β pair, LDPC decoding can achieve either
(1) a high secret key rate at short distance, or (2) a low secret key rate at long distance. For long-distance
CV-QKD beyond 100km, key reconciliation is only achievable with high β-efficiency at the expense of
low secret key rate. This section provides an overview of the maximum achievable finite secret key rates
and reconciliation distances for the three LDPC codes designed in this thesis. Results are presented for
the d = 1 and d = 8 reconciliation dimensions in order to demonstrate the effectiveness of higher-order
dimensionality on reconciliation distance.
The range of transmission distances for each β is limited by the total noise between Alice and Bob.
From Eq. A.3 (Appendix A), the total noise can be expressed as a function of β, such that
χ′total(β) =Vopt
A (β)
s(β)− 1, (4.1)
where VoptA (β) is a vector of Alice’s optimal modulation variances for a particular β-efficiency from
Fig. A.1, and the SNR s(β) is given by Eq. 2.15 for a fixed-rate LDPC code. From the expression for the
total channel noise given by Eq. A.1, a set of transmission distance points for a particular β can then
be described by the vector
`′(β) =10
αlog10
(η(χ′total(β)− ε+ 1)
1 + Vel
), (4.2)
in order to compute the maximum finite secret key rate based on Eq. 2.16, where α = 0.2dB/km is
single-mode fiber transmission loss, η is Bob’s homodyne detector efficiency, ε is the excess channel noise
in shot noise units, and Vel is Bob’s added electronic noise in shot noise units.
Chapter 4. Quasi-Cyclic Multi-Edge LDPC Codes for Quantum Cryptography 67
70 80 90 100 110 120 130 140 150 160 170Distance (km)
10-8
10-7
10-6
10-5
10-4
10-3
10-2
Fin
ite
Sec
ret
Key
Rat
e (b
its/
pu
lse)
ǫ = 0.005v
el = 0.041
η = 0.606α = 0.2N
privacy = 1012
β = 0.80
β = 0.83
β = 0.86
β = 0.89
β = 0.92
β = 0.95β = 0.96
Random LDPC Code, rate = 0.02QC LDPC Code, q = 21, rate = 0.02QC LDPC Code, q = 50, rate = 0.01995
Figure 4.6: d = 1 dimensional reconciliation with Nprivacy = 1012 bits: finite secret key rate Kfinite vs.distance for collective attacks on BIAWGNC with Sum-Product decoding.
70 80 90 100 110 120 130 140 150 160 170Distance (km)
10-8
10-7
10-6
10-5
10-4
10-3
10-2
Fin
ite
Sec
ret
Key
Rat
e (b
its/
pu
lse)
ǫ = 0.005v
el = 0.041
η = 0.606α = 0.2N
privacy = 1012
β = 0.80
β = 0.83
β = 0.86
β = 0.89
β = 0.92β = 0.95
β = 0.98
β = 0.99Random LDPC Code, rate = 0.02QC LDPC Code, q = 21, rate = 0.02QC LDPC Code, q = 50, rate = 0.01995
Figure 4.7: d = 8 dimensional reconciliation with Nprivacy = 1012 bits: finite secret key rate Kfinite vs.distance for collective attacks on BIAWGNC with Sum-Product decoding.
Chapter 4. Quasi-Cyclic Multi-Edge LDPC Codes for Quantum Cryptography 68
Figures 4.6 and 4.7 present the finite secret key rate results for the three LDPC codes over the trans-
mission distance range of interest with Nprivacy = 1012 bits based on the d = 1 and d = 8 reconciliation
dimensions, respectively. Each β-efficiency curve in Figures 4.6 and 4.7 represents a FER-β pair where
the FER and SNR are constant over the entire transmission distance range, while VA is optimally chosen
to achieve the maximum secret key rate at each distance point. When β is high, the FER Pe → 1, and
thus Kfinite → 0 as erroneous frames are discarded after decoding. As a result, the maximum reconcilia-
tion distance is limited by the error-correction performance of the LDPC code. Appendix C presents the
finite secret key rate results for Nprivacy = 1012 bits with d = 2 and d = 4 reconciliation dimensions, as
well as the finite secret key rate results with Nprivacy = 1010 bits for d = 1, 2, 4, 8 reconciliation. When
Nprivacy = 1012 bits, the maximum transmission distance is extended by 18km over the result with
Nprivacy = 1010 bits for d = 8 reconciliation with β = 0.99 efficiency. This demonstrates the importance
of selecting a large block size for privacy amplification.
In each of the Nprivacy = 1010 and Nprivacy = 1012 cases, the three LDPC codes achieve similar
finite secret key rates and reconciliation distances for β ≤ 0.92, since the codes are operating close to
their respective error floors. However, for β > 0.92, the FER becomes a limiting factor to achieving
a non-zero secret key rate. The d = 1 scheme achieves a maximum efficiency of β = 0.96, where the
maximum distance is limited to 124km with Nprivacy = 1010 bits, and 132km with Nprivacy = 1012 bits.
For β > 0.96, the FER Pe = 1, thus Kfinite = 0. The d = 8 scheme operates up to β = 0.99 efficiency,
with a maximum distance of 142km with Nprivacy = 1010 bits, and 160km with Nprivacy = 1012 bits.
Furthermore, the d = 8 scheme achieves higher secret key rates for all three LDPC codes at β = 0.95
and β = 0.96 in comparison to the d = 1 scheme since the code FER performance is higher. The d = 2
and d = 4 schemes both achieve a maximum efficiency of β = 0.97, at 129km with Nprivacy = 1010 bits,
and 138km with Nprivacy = 1012 bits.
The finite secret key rate Kfinite results presented in this section were normalized to the pulse rate,
without consideration of the light source repetition rate frep. By considering the pulse rate, the complete
operating secret key rate of the CV-QKD system can be defined as
K ′finite = frepKfinite (bits/s). (4.3)
The next section presents an overview of a GPU-based LDPC decoder implementation where the infor-
mation throughput for the three LDPC codes designed in this thesis is compared to the upper bound
on secret key rate at the maximum reconciliation distance points.
4.4 GPU-Accelerated LDPC Decoding
GPUs are a highly suitable platform for the implementation of LDPC decoders that target high infor-
mation throughput with long block-length codes. Computational acceleration of the belief propagation
algorithm is achieved by parallelizing the check and variable node update operations across thousands of
single-instruction multiple-thread (SIMT) cores, which provide floating-point precision, high-bandwidth
read/write access to on-chip memory, and intrinsic mathematical libraries for the logarithmic functions
of the Sum-Product algorithm [133–137].
This section provides an overview of the GPU-based LDPC decoder implementation in this thesis.
Chapter 4. Quasi-Cyclic Multi-Edge LDPC Codes for Quantum Cryptography 69
GPU throughput results are presented for the maximum CV-QKD distances under d = 1, 2, 4, 8 dimen-
sional reconciliation, and also compared to the maximum achievable secret key rates for reconciliation
efficiencies β > 0.85. Finally, the implementation is compared to previous work by Jouguet and Kunz-
Jacques for an LDPC code with block length of 220 bits [60], as well as other non-LDPC codes. The
GPU decoding throughput results presented in this section quantitatively highlight the computational
speedup that can be achieved using quasi-cyclic LDPC codes for long-distance CV-QKD. The presented
results were measured using a single GPU, however, further computational speedup can be achieved
by concurrently decoding multiple frames using multiple GPUs. The complete simulation framework is
presented in Appendix B.
4.4.1 GPU-Based LDPC Decoder Implementation
The LDPC decoder was implemented on a single NVIDIA GeForce GTX 1080 (Pascal Architecture) GPU
with 2560 CUDA cores using the NVIDIA CUDA C++ application programming interface. Figure 4.8
shows the data flow for a single decoding iteration of the parallelized Sum-Product algorithm, which is
comprised of four multi-threaded compute kernels. Each kernel instantiates a different number of GPU
threads depending on the level of parallelism for the operation. The individual compute operations of
the Sum-Product algorithm are re-ordered to exploit the maximum amount of thread-level parallelism in
each kernel such that the latency per iteration is minimized. The overall throughput of the GPU-based
LDPC decoder is then determined by the number of iterations, latency per iteration, and block length.
The complexity of an LDPC decoder implementation stems from the highly-irregular interconnect
structure between CNs and VNs described by the code’s Tanner graph. For codes with short block
lengths, the permutation network complexity does not introduce significant GPU decoding latency [133–
135], however, for codes with block lengths on the order of 106 bits as those designed in this thesis,
data permutation and message passing constitute between 25% to 50% of GPU runtime per decoding
iteration, as shown in Table 4.2. While arithmetic operations are relatively inexpensive on a GPU,
addressing global memory is very costly in terms of compute time. The most expensive GPU operation is
addressing unordered memory, i.e., accessing non-consecutive memory locations, as multiple transactions
are required to perform the unordered memory read or write, and all kernel threads must be stalled [134].
On the contrary, coalesced memory addressing, i.e., accessing consecutive memory locations, can be
performed in a single transaction and allows for concurrent thread execution, which reduces the runtime
of the kernel. Furthermore, uncoalesced memory writes are more expensive than uncoalesced memory
reads. Thus, the throughput of a GPU-based decoder is highly dependent on memory access patterns,
i.e., the decoder is memory-bound as opposed to compute-bound.
The operations of the Sum-Product algorithm presented in Algorithm 1 (Chapter 2) were re-ordered
to avoid uncoalesced memory writes and to use the maximum amount of thread-level parallelism for
arithmetic computations. For example, the VN-to-CN message-passing permutation in Kernel 1 also
performs the Φ(·) computation from the next CN-update step in each thread. The CN-update Kernel
(2) does not fully compute the mcv messages from each CN to its connected VNs, but instead, the
final CN-to-VN mcv messages are computed in the CN-to-VN message-passing Kernel (3). Due to
the Tanner graph structure and data permutation nature of the LDPC decoder, uncoalesced memory
reads are still required when reading from edge memory in Kernel 1 and reading from CN memory in
Kernel 3. However, the latency of these operations is negligible compared to the overall latency of an
entire iteration. Fully-coalesced memory writes are enabled by the different ordering of connected edges
Chapter 4. Quasi-Cyclic Multi-Edge LDPC Codes for Quantum Cryptography 70
Aligned
Memory
Read
Edge Memory (VN-to-CN Lvc Messages)
CN Memory (VN-to-CN Φ(|Lvc|) Messages)
CN Memory (mc Intermediate Values)
t1 t2 t3 t4 t5 t6 tT-2 tT-1 tT
t1 t2 t3 t4 t5 t6 tT-2 tT-1 tT
t1 t2 t3 tn-k-2 tn-k-1 tn-k
t1 t2 t3 t4 t5 t6 tT-2 tT-1 tT
VN Memory (CN-to-VN mcv Messages)
t1 t2 t3 tn-2 tn-1 tn
Aligned
Memory
Write
Unaligned
Memory
Read
Aligned
Memory
Write
Aligned
Memory
Read
Aligned
Memory
Write
Unaligned
Memory
Read
Aligned
Memory
Read
Aligned
Memory
Write
Aligned
Memory
Write
Aligned
Memory
Read
Aligned
Memory
Write
Ke
rne
l 1:
VN
-to
-CN
Me
ss
ag
e P
as
sin
gK
ern
el 3
: C
N-t
o-V
N M
es
sa
ge
Pa
ss
ing
Ke
rne
l 2:
CN
Up
da
teK
ern
el 4
: V
N U
pd
ate
VN Memory (Updated Lv LLRs)
Edge Memory (VN-to-CN Φ(|Lvc|) Messages)
Edge Memory (CN-to-VN mcv Messages)
VN Memory (Lv LLRs)
Aligned
Memory
Read
Edge Memory (CN-to-VN mcv Messages)
Edge Memory (VN-to-CN Φ(|Lvc|) Messages)
Figure 4.8: GPU implementation of LDPC decoder showing four multi-threaded compute kernels anddata flow from top to bottom for one decoding iteration. Coalesced memory access patterns and messagevariables are indicated. Thread i is denoted by ti, where T in kernels 1 and 3 represents the maximumnumber of connections between all CNs and VNs, (n − k) in Kernel 2 is the number of CNs, and n inKernel 4 is the number of VNs. Early termination is not shown. All memory blocks shown in the figureare in Global GPU Memory. The threads in each kernel use Shared GPU Memory to store intermediatevalues during the execution of the kernel.
Chapter 4. Quasi-Cyclic Multi-Edge LDPC Codes for Quantum Cryptography 71
Table 4.2: GPU-based LDPC decoding latency and error-correction performance for rate 0.02multi-edge codes
LDPC CodeRandom
Multi-Edgeq = 21 QC
Multi-Edgeq = 50 QC
Multi-Edge
Block Length (Bits) 1× 106 1.008× 106 1× 106
Code Rate 0.02 0.02 0.01995Connections inParity Matrix
3,337,494 160,185 66,747
Latency by Kernel with Percent Breakdown for 1 Decoding Iteration
Kernel 1 RuntimeVN-to-CN (ms)
1.773 (50.3%) 0.446 (34.4%) 0.391 (33.2%)
Kernel 2 RuntimeCN Update (ms)
0.197 (5.6%) 0.204 (15.7%) 0.198 (16.8%)
Kernel 3 RuntimeCN-to-VN (ms)
1.240 (35.1%) 0.317 (24.4%) 0.303 (25.7%)
Kernel 4 RuntimeVN Update (ms)
0.318 (9.0%) 0.331 (25.5%) 0.286 (24.3%)
Total Latency PerIteration (ms)
3.528 (100.0%) 1.296 (100.0%) 1.177 (100.0%)
FER Performance and Decoding Throughput at β = 0.99 and d = 8
Max Iterations 500 500 500Average Iterations ∗ 470 451 470
FER 0.883 0.792 0.883
KrawGPU Raw
Throughput (Mb/s)0.603 1.724 1.807
K ′GPU InformationThroughput (Kb/s)
1.409 7.160 4.207
∗ Early-termination check is enabled only after the number of decoding iterations is equal to the averagenumber of iterations, which is determined empirically through FER simulation and stored in a lookuptable.
in the VN-to-CN and CN-to-VN message-passing kernels (1 and 3). In the VN-to-CN message-passing
Kernel (1), the edge connectivity is ordered by consecutive VNs, while in the CN-to-VN message-passing
Kernel (3), the edges are ordered by consecutive CNs. Each CN-VN edge in the edge memory has a
unique index that is addressed by both message-passing kernels (1 and 3). Several additional memory
optimizations improve the overall GPU throughput. All of the memory blocks shown in Fig. 4.8 are in
global GPU memory, while shared GPU memory is used in each kernel thread to store local variables
and to avoid expensive global memory accesses. Texture caches are used to store frequently-accessed
static variables such as channel LLRs and the parity-check matrix. Prior to executing the GPU-based
decoder, the received channel LLRs are first transferred from the host to global GPU memory. Data is
kept on the GPU during decoder runtime, and the decoded codeword is transferred from global GPU
memory to the host after the decoder terminates.
As shown in Fig. 4.8, message-passing kernels (1 and 3) instantiate up to T threads, where T is the
maximum number of edge connections between all CNs and VNs, Kernel 2 instantiates (n− k) threads
equal to the total number of CNs in the matrix, and Kernel 4 instantiates up to n threads equal to the
total number of VNs in the matrix. When early termination is enabled, T threads are required in kernels
1 and 3, and n threads are required in Kernel 4. However, when early termination is disabled, the number
of threads instantiated in kernels 1, 3, and 4 can be reduced due to the large number of degree-1 VNs
Chapter 4. Quasi-Cyclic Multi-Edge LDPC Codes for Quantum Cryptography 72
along the long diagonal in the parity-check matrices, as illustrated in Fig. 4.1. As previously described,
degree-1 VNs along the diagonal need to pass VN-to-CN messages only in the first decoding iteration,
while CN-to-VN messages need to be passed to degree-1 VNs only if the early-termination condition is
enabled. The degree-1 VNs along the diagonal correspond to the majority (but not all) of the (n − k)
parity bits that are discarded after decoding, thus the VN update computation needs to be performed in
these degree-1 VNs only when early termination is enabled. The message-passing Kernels (1 and 3) need
only to instantiate threads that correspond to the CN-VN connections to the left of the long diagonal
in the matrix structure shown in Fig. 4.1. Similarly, the VN-update Kernel (4) needs only to instantiate
threads that correspond to VNs to the left of the long diagonal. This reduction in the number of threads
provides a marginal speedup in each iteration.
While not shown in Fig. 4.8, the early-termination check is implemented via multiple kernels that
perform a parallel reduction following the VN-to-CN message-passing Kernel (1) in order to compute
the parity at each CN. Additional computations and memory reads/writes are required in the message-
passing and VN-update kernels (1, 3, and 4). The following additional operations must be performed to
enable an early-termination check: send the decision bit from each VN to its connected CNs, send all
mcv messages from each CN to its connected VNs (including those corresponding to connections along
the long diagonal), and calculate the decision bit in each VN. To reduce overall decoding latency and
maximize throughput, the early-termination check is performed only after a fixed number of decoding
iterations. This fixed number of iterations corresponds to the average number of iterations required at
each SNR point, and is pre-determined empirically through FER simulation for each code. The decoder
uses a lookup table to decide after how many decoding iterations to enable the early-termination check
based on the current SNR.
A quasi-cyclic matrix structure reduces data permutation and memory access complexity by elimi-
nating random, unordered memory access patterns. In addition, QC codes require fewer memory lookups
for message passing since the parity-check matrix can be described with approximately q-times fewer
terms, where q is the expansion factor of the QC parity-check matrix, in comparison to a random matrix
for the same block length. Table 4.2 presents a breakdown of the latency of each GPU kernel for the
three LDPC codes designed in this thesis. While the CN and VN update kernels (2 and 4) have similar
runtime for both random and QC codes, QC codes achieve faster runtime in data permutation kernels (1
and 3) due the approximately q-times fewer CN-VN edge connections in the parity-check matrix. Since
the parity-check matrices designed in this thesis are sparse, a compressed data structure is used to store
CN-VN edge connections to reduce memory read latency in the message-passing kernels.
Table 4.2 also highlights the respective error-correction performance and GPU throughput of the
three codes at the maximum β = 0.99 efficiency with d = 8 reconciliation. The raw GPU throughput
(including parity bits) is given by
KrawGPU =
Block Length
Latency Per Iteration× Iterations(bits/s). (4.4)
The information throughput of the GPU decoder must be scaled by (1) the FER Pe to account for
discarded frames when decoding is unsuccessful, i.e., CRC does not pass or parity check fails, and (2)
the code rate Rcode to account for the parity bits that must be discarded after decoding. The average
Chapter 4. Quasi-Cyclic Multi-Edge LDPC Codes for Quantum Cryptography 73
GPU information throughput is then given by
K ′GPU = KrawGPURcode (1− Pe). (4.5)
Thus, for any LDPC code, the GPU throughput is determined by the latency per iteration and the
number of decoding iterations. The latency per iteration depends on the LDPC code structure and the
number of memory lookups, while the FER is bound by the maximum number of iterations.
Some GPU-based LDPC decoders use fixed-point number representations and/or frame-level par-
allelism to maximize computational speedup for codes with short block lengths (n < 105 bits) in
high-SNR regions above 0dB where the Min-Sum algorithm achieves sufficient error-correction per-
formance [133–137]. This work, however, uses single-precision floating point to minimize FER with
Sum-Product decoding at SNRs below -15dB. Due to the large block length (n = 106 bits), all GPU
threads are fully utilized, thus external (frame-level) parallelism does not provide additional speedup.
Asynchronous data transfer to the GPU is another technique often employed to minimize overhead la-
tency, however, this does not provide any significant performance boost as the Sum-Product computation
dominates overall execution time due to the large number of iterations required for low-SNR decoding.
4.4.2 Information Throughput Results
Figure 4.9 presents the measured information throughput K ′GPU from the GPU decoder for all three
LDPC codes at each β-efficiency point, which corresponds to a unique SNR-FER point in Figures 4.2
and 4.3 for the d = 1 and d = 8 dimensional reconciliation cases, respectively. Table 4.3 compares
the performance of the rate 0.02 random and QC codes at the maximum achievable distance for each
reconciliation dimension, assuming a privacy amplification block size of Nprivacy = 1012 bits. The q = 21
and q = 50 QC codes designed in this thesis achieve approximately 3× higher raw decoding throughput
KrawGPU over the random code with d = 1, 2, 4, 8 dimensional reconciliation at the maximum distance
point for each β-efficiency. When scaled by the corresponding FER and code rate, the QC codes achieve
between 5.1× and 12.8× higher information throughput K ′GPU over the random code. Table 4.3 also
presents the operating secret key rate K ′finite defined by Eq. 4.3, and the fundamental secret key rate
limit Klim for a lossy channel defined by Eq. 2.11. Here, the fundamental limit is scaled by the light
source repetition rate frep, such that
K ′lim = frepKlim. (4.6)
A realistic CV-QKD repetition rate of frep = 1MHz is assumed for the comparison [59, 62, 100]. For
distances beyond 130km, the operating secret key rate K ′finite is between 2176× and 57112× lower than
the fundamental limit K ′lim, with d = 8 and d = 1 dimensional reconciliation, respectively. The upper
bound versus distance is plotted in Fig. 4.10, along with the GPU-decoded information throughput for
the q = 21 QC code under d = 8 dimensional reconciliation. Figure 4.10 illustrates that the decoded
information throughput K ′GPU of the reconciliation algorithm is higher than the upper bound on secret
key rate K ′lim on a lossy channel with a 1MHz source from β = 0.8 to β = 0.99.
The rightmost column in Table 4.3 (K ′GPU/K′lim) presents the two key results of this work. First, it
shows that the GPU decoder can achieve between 1.07× and 8.03× higher information throughput K ′GPU
over the fundamental secret key rate limit K ′lim with a 1MHz source using QC-LDPC codes with d = 4
and d = 8 dimensional reconciliation. The 8.03× speedup is also highlighted in Fig. 4.10 at 160km with
β = 0.99. Since the decoder delivers an information throughput higher than the fundamental key rate
Chapter 4. Quasi-Cyclic Multi-Edge LDPC Codes for Quantum Cryptography 74
Table 4.3: Overview of secret key rate and GPU throughput at maximum reconciliation distance withrate 0.02 multi-edge codes and Nprivacy = 1012 bits
Recon
cilia
tion
Dim
en
sion
Maxim
um
Recon
cilia
tion
Effi
cie
ncy
LD
PC
Cod
e
Maxim
um
Dis
tan
ce
(km
)
Op
era
tin
gS
ecre
tK
ey
Rate
K′ finit
eat
Max
Dis
tan
ce
wit
hf r
ep
=1M
Hz
(bit
/s)
Fu
nd
am
enta
lK
ey
Rate
Lim
itK′ lim
at
Max
Dis
tance
wit
hf r
ep
=1M
Hz
(Kb
it/s)
GP
UR
aw
Th
rou
gh
pu
tK
raw
GP
U
(Mb
it/s)
GP
UIn
fo.
Th
rou
gh
pu
tK′ G
PU
(Kb
it/s)
K′ G
PU
Sp
eed
up
Over
K′ lim
(K′ G
PU/K′ lim
)
d=
1β
=0.
960
Ran
dom
131.
38
0.0
60
3.405
0.612
0.1
11
0.033×
QC
,q
=21
131.
38
0.1
19
3.405
1.887
0.6
86
0.202×
QC
,q
=50
131.
43
0.2
35
3.397
1.966
1.4
26
0.420×
d=
2β
=0.
970
Ran
dom
137.
99
0.0
51
2.510
0.612
0.2
23
0.087×
QC
,q
=21
137.
99
0.2
03
2.510
1.856
2.7
00
1.076×
QC
,q
=50
137.
85
0.0
50
2.526
1.983
0.3
60
0.142×
d=
4β
=0.
970
Ran
dom
137.
99
0.1
01
2.510
0.604
0.4
39
0.175×
QC
,q
=21
137.
99
0.3
02
2.510
1.818
3.9
38
1.569×
QC
,q
=50
137.
85
0.4
01
2.526
1.855
2.6
92
1.065×
d=
8β
=0.
990
Ran
dom
160.
47
0.2
30
0.891
0.604
1.4
09
1.581×
QC
,q
=21
160.
47
0.4
10
0.891
1.724
7.1
60
8.033×
QC
,q
=50
160.
52
0.2
24
0.889
1.808
4.2
07
4.733×
Chapter 4. Quasi-Cyclic Multi-Edge LDPC Codes for Quantum Cryptography 75
0.85 0.9 0.95 1Reconciliation Efficiency (β)
102
103
104
105
106
GP
U In
form
atio
n T
hro
ug
hp
ut
(bit
s/s)
d=8
d=1
d=1 - GPU Thpt: Random Coded=1 - GPU Thpt: QC q=21 Coded=1 - GPU Thpt: QC q=50 Coded=8 - GPU Thpt: Random Coded=8 - GPU Thpt: QC q=21 Coded=8 - GPU Thpt: QC q=50 Code
Figure 4.9: Measured information throughput K ′GPU vs. reconciliation efficiency for d = 1 and d = 8dimensional reconciliation. Each measurement point corresponds to a particular SNR operating pointwith a measured FER presented in Fig. 4.5.
0 20 40 60 80 100 120 140 160 180Distance (km)
102
103
104
105
106
107
Info
rmat
ion
Rat
e (b
its/
seco
nd
)
Upper Bound on SecretKey Rate for Lossy Channel
with frep
=1MHz
ǫ = 0.005v
el = 0.041
η = 0.606α = 0.2N
privacy=1012
β = 0.80β = 0.89
β = 0.92β = 0.95
β = 0.98β = 0.99
GPU InformationThroughput forq=21 QC Code
8x
Figure 4.10: GPU information throughput K ′GPU of the q = 21 QC-LDPC code with d = 8 dimensionalreconciliation up to the maximum distance point for β ∈ 0.80, 0.89, 0.92, 0.95, 0.98, 0.99, and upperbound on secret key rate for lossy channel K ′lim vs. distance.
Chapter 4. Quasi-Cyclic Multi-Edge LDPC Codes for Quantum Cryptography 76
limit, it can be concluded that LDPC decoding is no longer the post-processing bottleneck in CV-QKD,
and thus, the secret key rate remains only limited by the physical parameters of the quantum channel.
The second result is that d = 1 and d = 2 dimensional reconciliation schemes are not well-suited for
long-distance CV-QKD since the K ′GPU speedup over K ′lim is less than 1×. In general, Table 4.3 shows
that QC codes achieve lower decoding latency than the random code at long distances, thereby making
them more suitable for reverse reconciliation at high β efficiencies.
The results presented in Table 4.3 and Fig. 4.10 assumed a light source repetition rate of frep =1MHz.
While a higher source repetition rate such as frep = 100MHz or frep = 1GHz would raise the fundamental
secret key rate limit K ′lim above the maximum GPU decoder throughput K ′GPU, it would still not
introduce a post-processing bottleneck for CV-QKD. The GPU decoder currently delivers an information
throughput K ′GPU between 1868× and 18790× higher than the operating secret key rate K ′finite with a
1MHz light source at the maximum distance points for d = 1, 2, 4, 8 dimensional reconciliation schemes
beyond 130km. Even with a source repetition rate of frep = 1GHz, the GPU information throughput
K ′GPU would still exceed the operating secret key rate K ′finite between 1.8× and 18.7× for distances
beyond 130km, assuming the same quantum channel parameters. Therefore, GPUs remain a viable
platform for the implementation of reconciliation algorithms for long-distance CV-QKD.
4.4.3 Comparison to Other CV-QKD Implementations
While QKD has been well-studied over the past 30 years, the exploration of long-distance CV-QKD is
still nascent, with very few published implementations in the low-SNR regime for optical transmission
distances beyond 100km. Hardware-based implementations of DV-QKD and short-distance CV-QKD
have previously been demonstrated using FPGAs and GPUs [65, 66, 131, 138], however, at the time of
writing, there is only one reported CV-QKD implementation designed to operate in the low-SNR regime
for long-distance reconciliation [60].
Jouguet and Kunz-Jacques reported a GPU-based LDPC decoder implementation that achieves
7.1Mb/s throughput at SNR = 0.161 (β = 0.93) on the BIAWGNC [60], for a random multi-edge
LDPC code with a block length of 220 bits based on the rate 1/10 multi-edge code designed by Richard-
son and Urbanke with an SNR threshold of 0.1556 [71]. For throughput comparison purposes, two
additional multi-edge codes with the same code rate, block length, and SNR threshold were constructed
in this thesis: a random code and a q = 512 QC code2.
Table 4.4 presents a performance comparison between the two designed rate 1/10 codes and the result
achieved by Jouguet and Kunz-Jacques at SNR = 0.161 on the BIAWGNC [60]. The two designed codes
achieve a FER of approximately 0.04 under the same decoding conditions as the comparison work with
d = 8 dimensional reconciliation. Similar to the results presented in Tables 4.2 and 4.3, the q = 512 QC
code achieves approximately 3× lower latency per iteration than the random rate 1/10 code designed in
this thesis. Rate 1/10 QC codes with expansion factors q ∈ 64, 128, 256 were also designed, however,
the q = 512 QC code achieved the lowest latency per iteration due to the lower number of required
memory accesses in the GPU message-passing kernels, as a result of the lower number of connections
in the QC parity-check matrix. While the designed rate 1/10 random code achieves a maximum raw
throughput of only 2.78Mb/s, the q = 512 QC code delivers a maximum raw throughput of 9.17Mb/s with
early termination enabled only in iterations greater than the average number of iterations, as determined
2Lei M. Zhang specifically constructed the rate 1/10 random and QC codes based on the degree distribution designedby Richardson and Urbanke [71].
Chapter 4. Quasi-Cyclic Multi-Edge LDPC Codes for Quantum Cryptography 77
Table 4.4: GPU LDPC decoding comparison at SNR = 0.161 with d = 8 on BIAWGNC targetingFER = 0.04 with rate 1/10 codes
SpecificationThis Work
Jouguet andKunz-Jacques
2016 2014 [60]
Code Rate 1/10 1/10Block Length (Bits) 220 220
SNR 0.161 0.161LDPC Code
StructureRandom
Multi-Edgeq = 512 QCMulti-Edge
RandomMulti-Edge
Connections inParity Matrix
4,063,229 7,932 N/A
Early Termination No Yes No Yes NoMax Iterations 88 88 100 100 100
Average Iterations 88 78 100 78 100
FER (1) 0.04 0.04 0.0243 0.0243 0.04
Latency PerIteration (ms) (2) 4.73 4.84 1.28 1.47 1.48
KrawGPU GPU Raw
Throughput (Mb/s)2.52 2.78 8.21 9.17 7.1
K ′GPU GPU Info.Throughput (Kb/s)
242 267 801 895 682
GPU Model NVIDIA GeForce GTX 1080AMD Radeon HD
7970CMOS Technology 16nm 28nm
GPU Cores 2560 2048GPU GFLOPS 8228 3789
GPU Memory BusWidth (Bits)
256 384
GPU MemoryBandwidth (GB/s)
320 264
(1) FER Pe corresponds to the probability of detected error, since Pundetected = 0 with 32-bit CRC. Allthree codes achieve a CV-QKD distance of 83.8km based on the quantum channel parameters assumed inthis thesis.(2) Latency per iteration is an average for the full decoding of a single frame, and also includes the datatransfer latency between the CPU and GPU.
empirically through FER simulation. The q = 512 QC code achieves a 1.29× higher throughput than the
7.1Mb/s reported by Jouguet and Kunz-Jacques [60], further demonstrating that the QC code structure
offers computational speedup benefits for multi-edge codes operating in the high β-efficiency region at
low SNR. Although the comparison work is from 2014, both GPU models have a similar memory bus
width, which is the primary constraint that limits the latency per iteration. As previously discussed,
GPU decoder performance is bound by the memory access rate, and not the floating point operations
per second (FLOPS). Thus, a wider GPU memory allows for a higher memory access rate, which in
turn, reduces the decoding latency.
Other types of error-correcting codes have been studied for application in the low-SNR regime of CV-
QKD, such as polar codes, repeat-accumulate (RA) codes, and Raptor codes. Polar codes require block
lengths on the order of 227 bits to achieve comparable FER performance to the rate 1/10 multi-edge
LDPC codes designed in this thesis, however, they have been shown to achieve low decoding latency on
Chapter 4. Quasi-Cyclic Multi-Edge LDPC Codes for Quantum Cryptography 78
generic x86 CPUs due to their recursive decoding algorithm [60]. A polar-code performance comparison
is not available for the rate 0.02 multi-edge QC-LDPC codes designed in this thesis. Punctured and
extended low-rate RA codes have been constructed from ETSI DVB-S2 codes with block lengths of 64,800
bits to achieve β > 0.85 efficiency over a wide range of SNRs [139], however, their performance has not
been investigated beyond 70km and there is currently no hardware implementation to provide a sufficient
throughput comparison. Lastly, Raptor codes achieve high β-efficiency at low SNR and guarantee error-
free decoding (Pe = 0) by sending as many coded symbols as required by the receiver [140]. However,
their decoding latency may be a limitation to high-throughput reconciliation, and at the time of writing,
there is no known hardware implementation of Raptor codes for long-distance CV-QKD. The demand for
long-distance communication through applications such as CV-QKD motivates the need for continued
research in high-efficiency codes and their hardware realizations.
4.5 Summary
This chapter introduced quasi-cyclic multi-edge LDPC codes to accelerate the reconciliation step in
long-distance CV-QKD by means of a GPU-based decoder implementation and multi-dimensional rec-
onciliation schemes. With an 8-dimensional reconciliation scheme, the GPU-based decoder delivers an
information throughput up to 8.03× higher than the upper bound on secret key rate for a lossy channel
with a 1MHz source, thereby demonstrating that key reconciliation is no longer a computational bottle-
neck in long-distance CV-QKD. Furthermore, the low-rate LDPC codes extend the maximum distance
of CV-QKD from the previously achieved 100km to 160km based on the quantum channel and privacy
amplification parameters assumed in this thesis. LDPC codes with longer block lengths on the order of
n = 107 or n = 108 bits could also be designed to improve the error-correction performance and further
increase distance at the expense of decoding latency.
The LDPC codes and reconciliation techniques applied in this thesis can be extended to post-
processing algorithms in two areas that show promise for the future of QKD: (1) free-space QKD using
low-Earth orbit satellites as communication relays to extend the distance of secure communication be-
yond 200km without fiber-optic infrastructure, and (2) fully-integrated chip implementations [40]. Recent
works have experimentally demonstrated terrestrial free-space QKD for distances up to 143km [141,142],
while satellite-based QKD has been proposed as a practical near-term solution to achieving long-distance
QKD on a global scale [56, 143]. In August 2016, China launched the Quantum Experiments at Space
Scale (QUESS) satellite to generate secret keys between ground stations in Beijing and Vienna by trans-
mitting entangled photon pairs from an orbit altitude of 500km [144, 145]. Free-space fading channels
for satellite QKD typically operate at SNRs above 0dB [146], however, quasi-cyclic code construction
techniques can still be employed to achieve high secret key rates, while GPUs would allow for simple
integration with other satellite equipment for rapid prototyping, in contrast to ASIC- or FPGA-based
LDPC decoder implementations. This thesis presented the computational speedup achievable on a single
state-of-the-art GPU. Further acceleration can be achieved through architectural optimizations in the
design of a monolithic QKD chip that combines both optical and post-processing circuits. Photonic chips
have already been realized for QKD transmitters and receivers [40,57,147,148], and further integration
of post-processing algorithms would provide a considerable reduction in system size and power consump-
tion. A final key takeaway here is that the quasi-cyclic LDPC code construction and GPU architecture
techniques presented in this thesis can also be applied to forward error-correction implementations for
Chapter 4. Quasi-Cyclic Multi-Edge LDPC Codes for Quantum Cryptography 79
DV-QKD where reconciliation is performed over the binary symmetric channel (BSC) instead of the
BIAWGNC as in CV-QKD. The derivation of reverse reconciliation with LDPC decoding for DV-QKD
on the BSC is provided in Appendix D.
This chapter addressed the challenge of achieving high-speed, high-efficiency reconciliation for long-
distance CV-QKD over fiber-optic cable. In addition to extending information-theoretic security to
general attacks for finite key sizes, a major remaining hurdle to extending the secure transmission distance
in CV-QKD is the reduction of excess noise in the optical quantum channel. While recent techniques have
been demonstrated to control excess noise to within a tolerable limit [63], future work may also investigate
the security of CV-QKD in the presence of non-Gaussian noise sources, and in particular, the performance
of LDPC decoding at low SNR with non-Gaussian noise. GPU-based decoder implementations with
quasi-cyclic codes would provide a suitable platform for such investigations. Furthermore, reducing the
latency of privacy amplification for large block sizes on the order of Nprivacy ≥ 1012 bits is necessary in
order to realize secret key exchange for distances beyond 100km.
Chapter 5
Conclusion and Future Directions
This thesis presented techniques for reducing design complexity in the implementation of LDPC decoders
for integrated circuits targeting high-performance wireless channels, and secret key reconciliation in
quantum cryptography over long-distance optical fiber. This thesis showed that it was possible to
leverage the quasi-cyclic structure of LDPC parity-check matrices to reduce decoder latency, complexity,
and power, while maximizing throughput in each of the two distinct application areas. In Chapter 3,
a new message-passing schedule was proposed for a frame-interleaved architecture in order to minimize
interconnect routing complexity and reduce overall power consumption in silicon-based decoders for
modern CMOS technology nodes. The fabricated test chip achieves record energy efficiency among
published ASIC decoders for the IEEE 802.11ad standard. In Chapter 4, a quasi-cyclic structure was
applied to a multi-edge LDPC code with block length of 106 bits in order to enable coalesced GPU
memory access patterns to reduce decoding latency for long-distance quantum cryptography. The error-
correction performance of the LDPC code extends the maximum CV-QKD transmission distance from
100km to 160km, while the GPU-accelerated decoder delivers an information throughput higher than
the upper bound on secret key rate for a lossy channel. While LDPC decoding is no longer the post-
processing bottleneck, other factors such as privacy amplification and parameter estimation reduce the
secret key rate below this upper bound. The record results presented in Chapters 3 and 4 were achieved
through combined algorithmic and architectural techniques by exploiting the quasi-cyclic structure of
LDPC parity-check matrices in both integrated circuit and quantum cryptography application areas.
The two final sections in this chapter provide a summary of the contributions presented in this thesis,
as well as a recommendation for future research areas based on these contributions.
5.1 Summary of Contributions
The three major contributions of this thesis can be summarized as follows:
• Developed a low-power, frame-interleaved architecture for LDPC decoders to reduce interconnect
complexity and improve scalability in modern CMOS technology nodes by modifying the Min-
Sum belief propagation algorithm and introducing pipelined frame-interleaving and clock gating
techniques that exploit the inherent structure of QC-LDPC codes. QC-LDPC codes were used as a
vehicle to illustrate a general approach, however, the proposed architecture and decoding schedule
can be extended to non-QC codes.
80
Chapter 5. Conclusion and Future Directions 81
• Designed (algorithm, micro-architecture, RTL), synthesized, placed-and-routed, fabricated, and
tested a 4.78mm2 proof-of-concept test chip in 28nm CMOS containing 837K-gates, 160Kb total
eSRAM, 2 asynchronous clock domains, and 2 power domains, achieving 6.78Gb/s and 11pJ/bit
efficiency at 76mW with 202MHz clock for IEEE 802.11ad codes at a BER of 10−6.
• Constructed quasi-cyclic multi-edge LDPC codes with block lengths of 106 bits, and implemented a
1.72Mb/s GPU-accelerated (CUDA C++) decoder to extend the secure distance of key reconcilia-
tion in CV-QKD from the previous 100km to 160km over fiber-optic cable using multi-dimensional
reconciliation schemes that achieve up to 8× higher throughput than the upper bound on secret
key rate for a lossy channel.
5.2 Future Directions
The frame-interleaved LDPC decoder architecture presented in Chapter 3 can be extended to a number of
active research areas. First, the architecture can be extended for non-quasi-cyclic and spatially-coupled
LDPC codes. Second, the architecture can be adopted for silicon implementation of new decoding algo-
rithms that achieve better error-correction performance than traditional belief propagation algorithms.
Third, the architecture can be adopted for near-threshold voltage operation to further reduce power
consumption. Finally, the architecture can be extended to a stacked-die implementation for codes with
longer block lengths, like those investigated in Chapter 4 for QKD. This section suggests some future
research directions based on the contributions presented in this thesis.
5.2.1 Extendibility to Non-Quasi-Cyclic and Spatially-Coupled LDPC Codes
As suggested in Chapter 3, the path-unrolled architecture is not restricted to QC codes, but can rather
be applied more generally to random codes, or to codes that allow for column partitioning as a way to
enforce structure. In the path-unrolled structure, global routing is eliminated, and instead, routing is
constrained between successive column-slice pairs. Thus, any LDPC code that can be represented as a
Tanner graph with two independent vertex sets of CNs and VNs can be implemented using the proposed
architecture with a path-unrolled message-passing schedule.
The architecture can be extended to support non-QC codes, as well as random codes. Consider the
example of a non-QC matrix with two CN connections to a VN in a single macro-layer, i.e., two ‘1’
elements in a sub-matrix column. Here, the combined CN+VN processing unit simply needs additional
CN-update logic to support the additional CN connected to the VN, while the VN-update logic remains
the same. Additional wired routing is required to and from the additional CN-update logic block,
however, the overall path-unrolled message-passing schedule remains the same. This example can be
extended to a random code, which can be partitioned into uniformly-defined macro-columns. Combined
CN+VN processing units would then only contain CN logic for the CNs connected to each VN. The
drawback of this implementation is that some CN+VN processors would have more internal CN-update
logic blocks, while other CN+VN processors would have less. Future work may explore more optimal
hardware mapping and partitioning of non-QC codes for the architecture introduced in this thesis.
The frame-interleaved architecture can also support windowed decoding of spatially-coupled LDPC
codes [149]. Each decoding window of the spatially-coupled code can be mapped to an independent
column-slice pair. In this case, instead of decoding multiple frames, the decoder would slide the window
Chapter 5. Conclusion and Future Directions 82
as data moves from one pipeline stage to the next. The systolic nature of the proposed architecture is
well-suited for this application.
5.2.2 Linear-Program Decoding for High-SNR Channels
Linear program (LP) decoding of binary linear block codes via the Alternating Direction Method of
Multipliers (ADMM) has recently demonstrated improved error-floor decoding performance over the BP
Sum-Product algorithm in high-SNR Gaussian channels [150]. This makes ADMM-LP attractive for
optical transport and storage applications. ADMM-LP frames error correction as a convex optimization
problem, in contrast to BP, which frames error correction as problem of graphical inference. An FPGA-
based implementation of ADMM-LP decoding recently demonstrated FER performance within 0.5dB
of floating-point precision at Eb/N0 = 6.5dB with an FER of 10−6 for the rate 13/16 code of the
IEEE 802.11ad standard [151]. However, the implementation achieves a throughput of only 13.16Mb/s.
The error-rate, power consumption, and throughput performance of the same code were presented for
the silicon-based decoder implementation in Chapter 3 of this thesis. The fabricated chip achieves
a throughput of 6.78Gb/s with an FER of 10−4 at Eb/N0 = 5.6dB. A possible future research area
would be to investigate the implementation of a silicon-based ADMM-LP decoder by applying the
ADMM-LP computation kernels in the combined CN+VN processing units of the frame-interleaved
LDPC decoder architecture presented in Chapter 3. An ASIC implementation may provide several
orders of magnitude speedup such that ADMM-LP achieves Gigabit/s decoding throughput, just like
the BP Min-Sum decoder presented in Chapter 3. Furthermore, the early termination patterns at high
SNR would allow for extensive clock gating in the frame-interleaved decoder architecture, thus offering
the prospect of low power performance for a silicon-based ADMM-LP decoder.
5.2.3 Decoder Architectures for Near-Threshold Voltage FinFET Operation
Near-threshold voltage (NTV) circuit design techniques have recently shown promise in improving energy
efficiency and alleviating on-chip power hotspots at the expense of performance, by operating at the
point where switching and leakage power are equal [152]. Switching energy dominates at supply voltages
greater than the NTV operating point, while leakage energy dominates at supply voltages below the NTV
operating point. The frame-interleaved LDPC decoder architecture presented in Chapter 3 of this thesis
can be extended to the NTV region of operation due to the high level of computational parallelism, deep
pipelining, and low clock frequency [153]. However, a particular challenge at NTV is mitigating SRAM
failure, since device mismatches degrade cell stability during read/write operations [152]. This behavior
was also observed during measurements of the fabricated proof-of-concept test chip in this thesis, as
shown in the measurement Shmoo plots in Fig. 3.15. SRAM failure can be avoided by implementing an
independent power domain for eSRAM, such that the memory operates at a higher supply voltage than
the NTV logic, and level shifters are used at the memory periphery. Furthermore, NTV operation is
largely unexplored in FinFET devices. Current research suggests that FinFETs offer significant voltage-
scaling improvements over planar technologies [153]. Since wired interconnect is still a limitation in
modern FinFET technology nodes, the frame-interleaved decoder architecture presented in Chapter 3
can offer significant energy reduction benefits under NTV operation, especially given the higher transistor
area density in sub-10nm FinFET nodes.
Chapter 5. Conclusion and Future Directions 83
5.2.4 Decoder Architectures for 3-Dimensional Integrated Circuits
The silicon fabrication of a frame-interleaved LDPC decoder for a code with block length of 106 bits
may not be feasible on a single die, however, 3D die-stacking techniques may enable an implementation
over multiple dies connected with through-silicon vias (TSVs) [154]. The piecewise time-distributed
decoding schedule applied in the frame-interleaved architecture would minimize the amount of message
passing between adjacent stacked dies, while the low clock frequency would enable TSV-based message-
passing without parasitic degradation [154]. The expansion factor of the quasi-cyclic matrices designed
for long-distance CV-QKD in Chapter 4 is on the same order of magnitude as the expansion factor of
matrices used in the frame-interleaved decoder implementation in Chapter 3. Despite the longer block
length of 106 bits, the wiring permutation complexity between adjacent column slices for a QKD code
in the frame-interleaved architecture would remain about the same as the implemented 672-bit IEEE
802.11ad codes. The decoding latency would increase due to the larger number of columns, however, the
latency can be reduced through architectural optimization techniques to eliminate the long diagonal in
the parity-check matrix. Thus, the frame-interleaved decoder architecture presented in Chapter 3 can
be extended to the quasi-cyclic multi-edge LDPC codes designed in Chapter 4 in order to further reduce
decoding latency for long-distance CV-QKD. Finally, the introduction of 3D technologies may enable
the monolithic integration of LDPC decoder post-processing circuits with integrated photonics to build
a single-chip QKD solution.
Appendix A
Supplementary Background on QKD
This appendix provides a complete discussion of the quantum transmission, sifting, and privacy amplifi-
cation steps of the QKD protocol introduced in Chapter 2, as well as a derivation of the secret key rate
for a CV-QKD system with collective attacks.
A.1 Quantum Transmission and Measurement
To construct a secret key using the prepare-and-measure CV-QKD protocol, Alice first transmitsNquantum
coherent states to Bob over an optical fiber. Each coherent state is comprised of a pair of ampli-
tude and phase quadrature operators, x and p, of the form |x+ jp〉, j =√−1. Using a quantum
random number generator, Alice prepares each coherent state by randomly selecting her xA and pA
quadrature values according to a zero-mean Gaussian distribution with adjustable modulation vari-
ance σ2A = VAN0, where N0 represents the shot noise variance defined by the Heisenberg inequality
∆x∆p ≥ N0 [10, 95]. Alice transmits her train of Nquantum coherent states to Bob by modulating a
light source with a pulse repetition rate frep. She also records her selections of xA and pA for the
next sifting step, by constructing a vector, A, of length 2Nquantum, from her Nquantum coherent state
quadrature operator pairs (xA, pA), such that A2i−1 = xAi and A2i = pAi for i = 1, 2, . . . , Nquantum.
As such, A = (xA1 , pA1 , x
A2 , p
A2 , . . . , x
ANquantum
, pANquantum) and A ∼ N (0, σ2
A). Bob randomly selects and
measures either the x or p quadrature for each incoming pulse using an unbiased homodyne detector.
Bob constructs his own vector, B, of length Nquantum, comprised of the observed modulated quadrature
measurements, where Bi ∈ xBi , pBi with equal probability. Despite the losses in the optical fiber, and
the added noise from the Bob and Eve’s detection equipment, the xB and pB quadrature measurements
can still be used to distill a secret key following the sifting and reconciliation (error correction) steps.
Without considering the presence of the eavesdropper (Eve), the quantum transmission is subject to
path loss, excess noise in the single-mode fiber between Alice and Bob, the inefficiency of Bob’s homodyne
detection, as well as added electronic (thermal) noise [59]. In this thesis, the quantum channel was
characterized using previously published parameters [59]. The excess channel noise expressed in shot
noise units is assumed to be ε = 0.005, Bob’s added electronic noise in shot noise units is chosen
as Vel = 0.041, Bob’s homodyne detector efficiency is set to η = 0.606, and the single-mode fiber
transmission loss is assumed to be 0.2dB/km, such that the transmittance of the quantum channel is
given by T = 10−α`/10, where ` is the transmission distance in kilometers and α = 0.2dB/km. The total
84
Appendix A. Supplementary Background on QKD 85
noise between Alice and Bob is given by
χtotal = χline +χhom
T, (A.1)
where χline = ( 1T − 1) + ε is the total channel added noise referred to the channel input, and χhom =
1+Vel
η −1 is the noise introduced by the homodyne detector. The variance of Bob’s measurement is given
by σ2B = VBN0 = ηT (V + χtotal)N0. Although the adversary (Eve) may have access to the quantum
channel, her presence is not considered in the channel characterization. Instead, the information leaked
to Eve will be considered in the secret key rate calculation [95].
A.2 Sifting
Following the quantum transmission step, Alice’s original transmission vector A contains twice as many
elements as Bob’s measurement vector B. In the sifting step, Bob informs Alice via the classical public
channel which of the xB or pB quadratures he randomly selected for each of his Nquantum element
measurements, such that Alice may respectively discard her Nquantum unused xA and pA quadrature
values [95]. After sifting, Alice and Bob share correlated random sequences of length Nquantum, herein
defined as X0 = (X01 , X02 , . . . , X0Nquantum) and Y0 = (Y01 , Y02 , . . . , Y0Nquantum
), respectively, where
(X0i , Y0i), i = 1, 2, . . . , Nquantum, are independent and identically distributed realizations of some jointly
Gaussian random variables (X0, Y0). For example, Alice and Bob may have the following random
sequences after sifting: X0 = (xA1 , pA2 , p
A3 , x
A4 , . . . , p
ANquantum
) and Y0 = (xB1 , pB2 , p
B3 , x
B4 , . . . , p
BNquantum
).
A.3 Privacy Amplification
Alice first discards her erroneously decoded S messages, and informs Bob as to which messages she
discarded. Bob then discards his original S messages that correspond to the S messages that were
discarded by Alice. Alice concatenates all of her correctly decoded S messages to construct a long secret
key block of length Nprivacy = mk bits, where k is the length of the LDPC-decoded message S, and m
is some large non-zero integer. Bob also concatenates his corresponding S messages to construct a long
secret key, also of length Nprivacy bits. Alice and Bob then independently perform universal hashing on
their independent secret key blocks to reduce Eve’s knowledge of the key.
The speed of privacy amplification is an active area of research, with published results showing
maximum speeds of 100Mb/s for a block size of Nprivacy = 108 bits [155]. The computational complexity
of universal hashing can be reduced from O(n2) to O(n log2 n) by applying the fast Fourier transform
(FFT) or number theoretical transform (NTT) on a Toeplitz matrix [156]. Estimation of security
parameters is also performed during privacy amplification using (Nquantum −Nprivacy) bits. A complete
discussion of parameter estimation and privacy amplification is beyond the scope of this work. Interested
readers should refer to [157] for further information.
A.4 Maximizing Secret Key Rate with Collective Attacks
The primary metric that defines the performance of a QKD system is the maximum rate at which Alice
and Bob can securely generate and reconcile keys over a fixed-distance optical fiber in the presence of an
Appendix A. Supplementary Background on QKD 86
eavesdropper that has access to both the quantum and classical channels. The maximum secret key rate
must be proven secure against a collective Gaussian attack, the most optimal man-in-the-middle attack,
where Eve first prepares an ancilla state to interact with each one of Alice’s coherent states during the
quantum transmission, and then listens to the public communication between Alice and Bob during
the reconciliation step in order to perform the most optimal measurement on her collected ancillae to
reconstruct the classical messages transmitted by Bob [10]. Assuming perfect error-correction during
the reconciliation step, the maximum theoretical secret key rate for a CV-QKD system with one-way
reverse reconciliation can be defined as
Kopt = βIAB − χBE (bits/pulse), (A.2)
where IAB is the mutual information between Alice and Bob, β is the previously defined reconciliation
efficiency, and χBE is the Holevo bound on the information leaked to Eve [10]. Here, IAB is equivalent
to the Shannon channel capacity, and is defined as
IAB =1
2log2(1 + s) =
1
2log2
(V + χtotal
1 + χtotal
), (A.3)
where V = VA + 1, VA is Alice’s adjustable modulation variance, and χtotal is the total noise between
Alice and Bob. The Holevo bound is defined as
χBE = G
(λ1 − 1
2
)+G
(λ2 − 1
2
)−G
(λ3 − 1
2
)−G
(λ4 − 1
2
), (A.4)
where G(x) = (x+ 1) log2(x+ 1)− x log2 x, and the Eigenvalues λ1,2,3,4 are given by
λ21,2 =
1
2(A±
√A2 − 4B) λ2
3,4 =1
2(C ±
√C2 − 4D),
where
A = V 2(1− 2T ) + 2T + T 2(V + χline)2
B = T 2(V χline + 1)2
C =V√B + T (V + χline) +Aχhom
T (V + χtotal)
D =√BV +
√Bχhom
T (V + χtotal).
Optimizing Alice’s modulation variance for each quantum transmission distance ensures a maximum
SNR on the BIAWGNC [11], and thus, a maximum achievable secret key rate Kopt for a particular
β-efficiency. Figure A.1 presents the optimal modulation variance VA as a function of β for quantum
transmission distances up to 180km, assuming perfect error-correction in the reconciliation step. Fig-
ure A.2 shows the corresponding maximum theoretical secret key rate Kopt for CV-QKD based on the
computed optimal VA at each distance, as well as the upper bound on secret key rate for a lossy channel
defined by Eq. 2.11.
Appendix A. Supplementary Background on QKD 87
0 20 40 60 80 100 120 140 160 180Distance (km)
0
5
10
15
20
Op
tim
al V
A (
Sh
ot
No
ise
Un
its)
ǫ = 0.005v
el = 0.041
η = 0.606α = 0.2
β = 0.99
β = 0.98
β = 0.95
β = 0.80
β = 0.8β = 0.83β = 0.86β = 0.89β = 0.92β = 0.95β = 0.98β = 0.99
Figure A.1: Optimal VA vs. transmission distance for maximum theoretical secret key rate, from β = 0.8to β = 0.99, based on the assumed physical operating parameters of the quantum channel.
0 20 40 60 80 100 120 140 160 180Distance (km)
10-4
10-3
10-2
10-1
100
101
Max
imu
m S
ecre
t K
ey R
ate
(bit
s/p
uls
e)
Fundamental Limit of Lossy Channel
ǫ = 0.005v
el = 0.041
η = 0.606α = 0.2
Max CV-QKD Key Rate (FER=0)
β = 0.99
β = 0.80
β = 0.8β = 0.83β = 0.86β = 0.89β = 0.92β = 0.95β = 0.98β = 0.99
Figure A.2: Maximum theoretical secret key rates vs. transmission distance. The maximum CV-QKDkey rate is defined by Kopt from β = 0.8 to β = 0.99 based on the optimal VA. The fundamental limitfor a lossy channel is defined by Klim = − log2(1− T ).
Appendix B
Development, Simulation, and
Testing Framework
This appendix outlines the design stages required to realize both the silicon-based and GPU-based LDPC
decoders presented in this thesis. A list of electronic design automation (EDA) tools is provided below,
while Fig. B.1 provides a visual overview of the design flow.
Algorithm Design and Verification
• Microsoft Visual Studio with NVIDIA CUDA Framework
Architecture Implementation
• Cadence RTL Compiler
• Cadence Incisive Simulator
• Cadence SimVision Waveform Viewer
• Cadence Conformal Logic Equivalence Checker
• Synopsys SpyGlass Linting Tool
Physical Design
• Mentor Graphics Olympus-SoC Place-And-Route Tool
• Mentor Graphics Calibre Design Rule Check
• Cadence Tempus Static Timing Analysis Tool
Chip Measurement
• Source III VTRAN Test Vector Translation
• Teradyne ATE Test Suite
88
Appendix B. Development, Simulation, and Testing Framework 89
Random
Message
Generator
LDPC
Encoder
Gaussian
Noise
Generator
Error
Statistics
Monitor
Floating-
Point LDPC
Decoder
Fixed-Point
LDPC
Decoder
C++ Software Environment
NVIDIA CUDA
GPU-Based LDPC
Decoder
RTL
Synthesis
Gate-Level
Simulation
Chip Floorplanning,
Power Domain Construction,
I/O Interface Design
Timing Sign-Off
Design Rule
Checking
Tape-Out / Chip Fabrication
Place-And-
Route
Chip
Packaging
Functional Verification
Test Vector
Translation
ATE Functional Test and
Power Measurement
Architectural Model
of Frame-Interleaved
LDPC Decoder
RTL HDL
Model
Alg
ori
thm
De
sig
n a
nd
Ve
rifi
ca
tio
nA
rch
ite
ctu
re Im
ple
me
nta
tio
n
Test
Vectors
Ph
ys
ica
l D
es
ign
Ch
ip M
ea
su
rem
en
t
GPU Decoder
Throughput and
Latency Results
LDPC Code
BER and FER
Results
Silicon Decoder
At-Speed Tests and
Power Results
Figure B.1: Development, simulation, and testing framework.
Appendix C
Supplementary QKD Results: Bit
Error Rate and Finite Secret Key
Rate
This appendix presents the bit error rate (BER) performance of the three LDPC codes presented in Chap-
ter 4 under d = 1, 2, 4, 8 dimensional reconciliation, as well as the finite secret key rate results for privacy
amplification blocks of Nprivacy = 1010 bits with d = 1, 2, 4, 8 reconciliation, and for Nprivacy = 1012
bits with d = 2, 4 reconciliation to demonstrate the impact of block size on the maximum transmission
distance.
Figures C.1 and C.2 present the BER performance of the random, q = 21, and q = 50 codes on the
BIAWGNC with Sum-Product decoding under d = 1, 2 and d = 4, 8 reconciliation, respectively.
Figures C.3 to C.6 present the finite secret key rate results for a privacy amplification block of
Nprivacy = 1010 with d = 1, 2, 4, 8 reconciliation. Figures C.7 and C.8 present the finite secret key rate
results for a privacy amplification block of Nprivacy = 1012 with d = 2, 4 reconciliation. The d = 1 and
d = 8 results for Nprivacy = 1012 bits are presented in Figures 4.6 and 4.7 in Chapter 4. The results
show that the distance is extended using longer privacy amplification blocks and higher reconciliation
dimensions. The maximum distance is achieved with d = 8 reconciliation and Nprivacy = 1012 bits.
90
Appendix C. Supplementary QKD Results: Bit Error Rate and Finite Secret Key Rate91
0.028 0.0285 0.029 0.0295 0.03 0.0305 0.031 0.0315 0.032SNR
10-7
10-6
10-5
10-4
10-3
10-2
Bit
Err
or
Rat
e (B
ER
)
Max 500 Iterations100 Frame Errors32-bit Floating Point
d=1 - Random, R=0.02d=1 - QC, q=21, R=0.02d=1 - QC, q=50, R=0.01995d=2 - Random, R=0.02d=2 - QC, q=21, R=0.02d=2 - QC, q=50, R=0.01995
Figure C.1: BER vs. SNR for Sum-Product decoding with d = 1 and d = 2 dimensional reconciliationon BIAWGNC.
0.028 0.0285 0.029 0.0295 0.03 0.0305 0.031 0.0315 0.032SNR
10-8
10-7
10-6
10-5
10-4
10-3
10-2
Bit
Err
or
Rat
e (B
ER
)
Max 500 Iterations100 Frame Errors32-bit Floating Point
d=4 - Random, R=0.02d=4 - QC, q=21, R=0.02d=4 - QC, q=50, R=0.01995d=8 - Random, R=0.02d=8 - QC, q=21, R=0.02d=8 - QC, q=50, R=0.01995
Figure C.2: BER vs. SNR for Sum-Product decoding with d = 4 and d = 8 dimensional reconciliationon BIAWGNC.
Appendix C. Supplementary QKD Results: Bit Error Rate and Finite Secret Key Rate92
70 80 90 100 110 120 130 140 150 160 170Distance (km)
10-8
10-7
10-6
10-5
10-4
10-3
10-2
Fin
ite
Sec
ret
Key
Rat
e (b
its/
pu
lse)
ǫ = 0.005v
el = 0.041
η = 0.606α = 0.2N
privacy = 1010
β = 0.80
β = 0.83β = 0.86
β = 0.89
β = 0.92
β = 0.95
β = 0.96
Random LDPC Code, rate = 0.02QC LDPC Code, q = 21, rate = 0.02QC LDPC Code, q = 50, rate = 0.01995
Figure C.3: d = 1 dimensional reconciliation with Nprivacy = 1010 bits: finite secret key rate Kfinite vs.distance for collective attacks on BIAWGNC with Sum-Product decoding.
70 80 90 100 110 120 130 140 150 160 170Distance (km)
10-8
10-7
10-6
10-5
10-4
10-3
10-2
Fin
ite
Sec
ret
Key
Rat
e (b
its/
pu
lse)
ǫ = 0.005v
el = 0.041
η = 0.606α = 0.2N
privacy = 1010
β = 0.80β = 0.83
β = 0.86
β = 0.89
β = 0.92
β = 0.95
β = 0.97
Random LDPC Code, rate = 0.02QC LDPC Code, q = 21, rate = 0.02QC LDPC Code, q = 50, rate = 0.01995
Figure C.4: d = 2 dimensional reconciliation with Nprivacy = 1010 bits: finite secret key rate Kfinite vs.distance for collective attacks on BIAWGNC with Sum-Product decoding.
Appendix C. Supplementary QKD Results: Bit Error Rate and Finite Secret Key Rate93
70 80 90 100 110 120 130 140 150 160 170Distance (km)
10-8
10-7
10-6
10-5
10-4
10-3
10-2
Fin
ite
Sec
ret
Key
Rat
e (b
its/
pu
lse)
ǫ = 0.005v
el = 0.041
η = 0.606α = 0.2N
privacy = 1010
β = 0.80
β = 0.83
β = 0.86
β = 0.89
β = 0.92
β = 0.95
β = 0.97
Random LDPC Code, rate = 0.02QC LDPC Code, q = 21, rate = 0.02QC LDPC Code, q = 50, rate = 0.01995
Figure C.5: d = 4 dimensional reconciliation with Nprivacy = 1010 bits: finite secret key rate Kfinite vs.distance for collective attacks on BIAWGNC with Sum-Product decoding.
70 80 90 100 110 120 130 140 150 160 170Distance (km)
10-8
10-7
10-6
10-5
10-4
10-3
10-2
Fin
ite
Sec
ret
Key
Rat
e (b
its/
pu
lse)
ǫ = 0.005v
el = 0.041
η = 0.606α = 0.2N
privacy = 1010
β = 0.80β = 0.83
β = 0.86β = 0.89
β = 0.92
β = 0.95
β = 0.98
β = 0.99Random LDPC Code, rate = 0.02QC LDPC Code, q = 21, rate = 0.02QC LDPC Code, q = 50, rate = 0.01995
Figure C.6: d = 8 dimensional reconciliation with Nprivacy = 1010 bits: finite secret key rate Kfinite vs.distance for collective attacks on BIAWGNC with Sum-Product decoding.
Appendix C. Supplementary QKD Results: Bit Error Rate and Finite Secret Key Rate94
70 80 90 100 110 120 130 140 150 160 170Distance (km)
10-8
10-7
10-6
10-5
10-4
10-3
10-2
Fin
ite
Sec
ret
Key
Rat
e (b
its/
pu
lse)
ǫ = 0.005v
el = 0.041
η = 0.606α = 0.2N
privacy = 1012
β = 0.80
β = 0.83
β = 0.86β = 0.89
β = 0.92
β = 0.95 β = 0.97
Random LDPC Code, rate = 0.02QC LDPC Code, q = 21, rate = 0.02QC LDPC Code, q = 50, rate = 0.01995
Figure C.7: d = 2 dimensional reconciliation with Nprivacy = 1012 bits: finite secret key rate Kfinite vs.distance for collective attacks on BIAWGNC with Sum-Product decoding.
70 80 90 100 110 120 130 140 150 160 170Distance (km)
10-8
10-7
10-6
10-5
10-4
10-3
10-2
Fin
ite
Sec
ret
Key
Rat
e (b
its/
pu
lse)
ǫ = 0.005v
el = 0.041
η = 0.606α = 0.2N
privacy = 1012
β = 0.80
β = 0.83β = 0.86
β = 0.89
β = 0.92
β = 0.95
β = 0.97Random LDPC Code, rate = 0.02QC LDPC Code, q = 21, rate = 0.02QC LDPC Code, q = 50, rate = 0.01995
Figure C.8: d = 4 dimensional reconciliation with Nprivacy = 1012 bits: finite secret key rate Kfinite vs.distance for collective attacks on BIAWGNC with Sum-Product decoding.
Appendix D
LDPC Decoding for Reverse
Reconciliation in CV- and DV-QKD
This appendix examines the differences in reconciling secret keys using LDPC codes over the binary-input
additive white Gaussian noise channel (BIAWGNC) for CV-QKD, and the binary symmetric channel
(BSC) for DV-QKD. The reconciliation procedure is first presented for the BIAWGNC (as discussed in
the thesis for long-distance CV-QKD), and then the procedure is extended for the BSC. This appendix
shows that LDPC decoding is independent of the QKD system parameters once the log-likelihood (LLR)
input to the decoder is calculated.
D.1 Alice and Bob’s Correlated Sequences
After the quantum transmission and sifting steps in CV- or DV-QKD protocols, Alice and Bob share
correlated sequences X and Y of length n, respectively. In CV-QKD, Alice’s X ∈ R and is normally
distributed over X ∼ N (0, 1). In DV-QKD, Alice’s X ∈ Fn2 and is uniformly distributed over Fn2 . The
distribution of Bob’s correlated Gaussian sequence Y is determined by the channel model.
D.2 Reverse Reconciliation
In both CV- and DV-QKD, Bob uses a quantum random number generator to generate a uniformly-
distributed random binary sequence S of length k, where Si ∈ 0, 1. He then encodes S to generate an
LDPC codeword C of length n, where Ci ∈ 0, 1, based on a binary LDPC parity-check matrix H that
is also known to Alice.
D.2.1 CV-QKD: Decoding on the BIAWGNC
In CV-QKD, Bob prepares his classical message to Alice, M, by modulating the signs of his Gaussian
sequence Y with the LDPC codeword, such that Mi = (−1)CiYi for i = 1, 2, . . . , n. The BIAWGNC is
described by Z ∼ N (0, σ2Z). Bob’s correlated sequence Y is Gaussian, and is given by Y = X + Z,
such that Y ∼ N (0, 1 +σ2Z). Alice attempts to recover Bob’s codeword C using her correlated Gaussian
95
Appendix D. LDPC Decoding for Reverse Reconciliation in CV- and DV-QKD 96
sequence X based on the following division operation:
Ri =Mi
Xi=
(−1)CiYiXi
=(−1)Ci(Xi + Zi)
Xi= (−1)Ci + (−1)Ci
ZiXi
for i = 1, 2, . . . , n. (D.1)
Here, Alice observes a channel with binary input (±1) and additive noise (−1)Ci Zi
Xi. In this case, the
division operation in the noise term represents a fading channel, however, since Alice knows the value
of each Xi, the norm of X is revealed and the overall channel noise remains Gaussian with zero mean
and variance σ2Ni = σ2
Z/|Xi|2 for each i = 1, 2, ..., n. Alice attempts to reconstruct Bob’s original binary
sequence S by applying the Sum-Product belief propagation algorithm for LDPC decoding to build
an estimate S for further post-processing in the next privacy amplification step. LDPC decoding is
successful if S = S. Figure D.1 shows the setup for a BIAWGNC.
BIAWGNCLDPC
EncoderS
Cs LDPC
Decoder
R
H2
1
z
SNR
H X 2
z
H
BSCLDPC
EncoderS
Cs LDPC
Decoder
R
H H
Crossover Error
Probability p
Figure D.1: LDPC encoding and decoding system with BIAWGNC model.
The Sum-Product algorithm for LDPC decoding generates a codeword estimate S based on the log-
likelihood ratio (LLR) of each of its Ri inputs for i = 1, 2, . . . , n. For a BIAWGNC, the noise variance
σ2Z is known, and each LLR is given by:
LLR(Ri) = ln
(P (Ri|Ci = 0)
P (Ri|Ci = 1)
)=
2Riσ2Ni
, where σ2Ni
=σ2Z
|Xi|2. (D.2)
From here, the LDPC decoder performs the decoding procedure independent of any other QKD system
parameters. As such, the implementation and design of the LDPC decoder (software/hardware) is
independent of the QKD system parameters. In the thesis, a GPU-based LDPC decoder was implemented
for speed purposes. The same error-correction performance is achievable using a software-based decoder,
albeit at much slower speed.
D.2.2 DV-QKD: Decoding on the BSC
In DV-QKD, Bob’s correlated sequence Y ∈ Fn2 and is uniformly distributed over Fn2 . The BSC is defined
by a bit crossover error probability p, 0 < p < 1/2, such that Bob’s correlated Gaussian sequence is
described by Y = X + E, where + denotes binary addition and E represents the crossover error as
follows:
Ei =
1, with probability p
0, with probability 1− p. (D.3)
Using the same codeword generation procedure as before, Bob encodes an LDPC codeword C from a
uniformly-distributed random binary sequence S. In DV-QKD, Bob sends a message M = C + Y to
Alice. Alice attempts to recover Bob’s codeword C by adding her correlated Gaussian sequence X to
Appendix D. LDPC Decoding for Reverse Reconciliation in CV- and DV-QKD 97
Bob’s received message M as per the following binary addition operation:
R = M + X = (C + Y) + X = C + (X + E) + X = C + E (D.4)
Alice’s received value R = C + E is then used to perform LDPC decoding to recover an estimate S
of Bob’s original random binary sequence S. As in the previous BIAWGNC case, LDPC decoding is
successful if S = S.
BIAWGNCLDPC
EncoderS
Cs LDPC
Decoder
R
H2
1
z
SNR
H X 2
z
H
BSCLDPC
EncoderS
Cs LDPC
Decoder
R
H H
Crossover Error
Probability p p
Figure D.2: LDPC encoding and decoding system with BSC model.
Fig. D.2 presents the decoding setup for the BSC case. From the LDPC decoder’s perspective, the
only difference is in the calculation of the input channel LLR based on the input R and the known
crossover probability p. In fact, this LLR calculation is even simpler than the LLR calculation for the
BIAWGNC in CV-QKD because the decoder does not need to know X. Depending on whether Ri = 0
or Ri = 1, the LLR for each Ri on the BSC is given by:
LLR(Ri = 0) = ln
(1− pp
)LLR(Ri = 1) = ln
(p
1− p
) (D.5)
Once the LLR for each input is known, the same Sum-Product algorithm can be used for LDPC decoding.
Hence, the LDPC decoder implementation is independent of the channel.
D.2.3 Efficiency of Reconciliation with Multiple Dimensions
The efficiency of reverse reconciliation can be improved through multi-dimensional reconciliation tech-
niques, however, multi-dimensional reconciliation is only applicable to the CV-QKD case over a BI-
AWGNC. For d-dimensional reconciliation, d ∈ 1, 2, 4, 8, each consecutive group of d quantum coherent-
state transmissions from Alice to Bob can be mapped to the same BIAWGNC, and thus, the channel
noise variance among all d virtual channels is uniform. Multi-dimensional reconciliation is not possible
on the BSC because each bit is transmitted discretely and has it’s own crossover probability p.
References
[1] R. W. Hamming, “Error detecting and error correcting codes,” The Bell System Technical Journal,
vol. 29, no. 2, pp. 147–160, Apr. 1950.
[2] C. E. Shannon, “A mathematical theory of communication,” The Bell System Technical Journal,
vol. 27, no. 3, pp. 379–423, Jul. 1948.
[3] S. Lin and D. J. Costello, Error Control Coding, Second Edition. Upper Saddle River, NJ, USA:
Prentice-Hall, Inc., 2004.
[4] T. Richardson, M. Shokrollahi, and R. Urbanke, “Design of capacity-approaching irregular low-
density parity-check codes,” Information Theory, IEEE Transactions on, vol. 47, no. 2, pp. 619–
637, Feb. 2001.
[5] S.-Y. Chung, J. Forney, G.D. et al., “On the design of low-density parity-check codes within 0.0045
dB of the Shannon limit,” Communications Letters, IEEE, vol. 5, no. 2, pp. 58–60, Feb. 2001.
[6] J. Kim and W. Sung, “Rate-0.96 LDPC Decoding VLSI for Soft-Decision Error Correction of
NAND Flash Memory,” Very Large Scale Integration (VLSI) Systems, IEEE Transactions on,
vol. 22, no. 5, pp. 1004–1015, May 2014.
[7] R. Gallager, “Low-density parity-check codes,” IRE Transactions on Information Theory, vol. 8,
no. 1, pp. 21–28, Jan. 1962.
[8] D. MacKay, “Good error-correcting codes based on very sparse matrices,” IEEE Transactions on
Information Theory, vol. 45, no. 2, pp. 399–431, Mar. 1999.
[9] “IEEE Standard for Information Technology - Part 11: Wireless LAN Medium Access Control
(MAC) and Physical Layer (PHY) Specifications,” IEEE Std 802.11ac-2013, pp. 1–425, Dec. 2013.
[10] J. Lodewyck, M. Bloch, R. Garcia-Patron, S. Fossier, E. Karpov, E. Diamanti, T. Debuisschert,
N. J. Cerf, R. Tualle-Brouri, S. W. McLaughlin, and P. Grangier, “Quantum key distribution
over 25km with an all-fiber continuous-variable system,” Phys. Rev. A, vol. 76, pp. 042 305–1 –
042 305–10, Oct. 2007.
[11] P. Jouguet, S. Kunz-Jacques, and A. Leverrier, “Long-distance continuous-variable quantum key
distribution with a Gaussian modulation,” Phys. Rev. A, vol. 84, pp. 062 317–1 – 062 317–7, Dec.
2011.
[12] D. Huang, D. Lin et al., “Continuous-variable quantum key distribution with 1 Mbps secure key
rate,” Opt. Express, vol. 23, no. 13, pp. 17 511–17 519, Jun. 2015.
98
References 99
[13] “International Roadmap for Devices and Systems (IRDS),” IEEE, 2016. [Online]. Available:
http://irds.ieee.org
[14] K. J. Kuhn, “Considerations for Ultimate CMOS Scaling,” IEEE Transactions on Electron Devices,
vol. 59, no. 7, pp. 1813–1828, Jul. 2012.
[15] T. Mohsenin, D. Truong, and B. Baas, “A Low-Complexity Message-Passing Algorithm for Re-
duced Routing Congestion in LDPC Decoders,” Circuits and Systems I: Regular Papers, IEEE
Transactions on, vol. 57, no. 5, pp. 1048–1061, May 2010.
[16] K. Cushon, S. Hemati et al., “High-throughput energy-efficient LDPC decoders using differential
binary message passing,” IEEE Transactions on Signal Processing, vol. 62, no. 3, pp. 619–631,
Feb. 2014.
[17] S. Rossnagel, R. Wisnieff, D. Edelstein, and T. Kuan, “Interconnect issues post 45nm,” Electron
Devices Meeting, 2005. IEDM Technical Digest. IEEE International, pp. 89–91, Dec. 2005.
[18] R. Brain, “Interconnect scaling: Challenges and opportunities,” 2016 IEEE International Electron
Devices Meeting (IEDM), pp. 9.3.1–9.3.4, Dec. 2016.
[19] K. Okada, K. Kondou et al., “Full Four-Channel 6.3-Gb/s 60-GHz CMOS Transceiver With Low-
Power Analog and Digital Baseband Circuitry,” Solid-State Circuits, IEEE Journal of, vol. 48,
no. 1, pp. 46–65, Jan. 2013.
[20] K. Zhang, X. Huang, and Z. Wang, “High-throughput layered decoder implementation for quasi-
cyclic LDPC codes,” Selected Areas in Communications, IEEE Journal on, vol. 27, no. 6, pp.
985–994, Aug. 2009.
[21] M. Li, F. Naessens et al., “An area and energy efficient half-row-paralleled layer LDPC decoder for
the 802.11ad standard,” Signal Processing Systems (SiPS), 2013 IEEE Workshop on, pp. 112–117,
Oct. 2013.
[22] Z. Chen, X. Peng et al., “A 6.72-Gb/s, 8pJ/bit/iteration WPAN LDPC decoder in 65nm CMOS,”
2013 18th Asia and South Pacific Design Automation Conference (ASP-DAC), pp. 87–88, Jan.
2013.
[23] “IEEE Standard for Information Technology - Part 11: Wireless LAN Medium Access Control
(MAC) and Physical Layer (PHY) Specifications,” IEEE Std 802.11ad-2012, pp. 1–628, Dec. 2012.
[24] “IEEE Standard for Information Technology - Part 15.3: Wireless Medium Access Control
(MAC) and Physical Layer (PHY) Specifications for High Rate Wireless Personal Area Networks
(WPANs),” IEEE Std 802.15.3c-2009, pp. 1–187, Oct. 2009.
[25] X.-Y. Shih, C.-Z. Zhan, and A. Y. Wu, “A 7.39mm2 76mW (1944, 972) LDPC decoder chip for
IEEE 802.11n applications,” 2008 IEEE Asian Solid-State Circuits Conference, pp. 301–304, Nov.
2008.
[26] A. Cevrero, Y. Leblebici, P. Ienne, and A. Burg, “A 5.35 mm2 10GBASE-T Ethernet LDPC
decoder chip in 90 nm CMOS,” 2010 IEEE Asian Solid-State Circuits Conference, pp. 1–4, Nov.
2010.
References 100
[27] Z. Zhang, V. Anantharam, M. Wainwright, and B. Nikolic, “An Efficient 10GBASE-T Ethernet
LDPC Decoder Design With Low Error Floors,” Solid-State Circuits, IEEE Journal of, vol. 45,
no. 4, pp. 843–855, Apr. 2010.
[28] X. Peng, Z. Chen, X. Zhao, D. Zhou, and S. Goto, “A 115mW 1Gbps QC-LDPC decoder ASIC for
WiMAX in 65nm CMOS,” IEEE Asian Solid-State Circuits Conference 2011, pp. 317–320, Nov.
2011.
[29] B. Xiang, D. Bao, S. Huang, and X. Zeng, “An 847-955 Mb/s 342-397 mW Dual-Path Fully-
Overlapped QC-LDPC Decoder for WiMAX System in 0.13µm CMOS,” IEEE Journal of Solid-
State Circuits, vol. 46, no. 6, pp. 1416–1432, Jun. 2011.
[30] S. W. Yen, S. Y. Hung, C. L. Chen, H. C. Chang, S. J. Jou, and C. Y. Lee, “A 5.79-Gb/s
Energy-Efficient Multirate LDPC Codec Chip for IEEE 802.15.3c Applications,” IEEE Journal of
Solid-State Circuits, vol. 47, no. 9, pp. 2246–2257, Sep. 2012.
[31] Y. S. Park, D. Blaauw et al., “Low-Power High-Throughput LDPC Decoder Using Non-Refresh
Embedded DRAM,” Solid-State Circuits, IEEE Journal of, vol. 49, no. 3, pp. 783–794, Mar. 2014.
[32] M. Weiner, M. Blagojevic et al., “27.7 A scalable 1.5-to-6Gb/s 6.2-to-38.1mW LDPC decoder
for 60GHz wireless networks in 28nm UTBB FDSOI,” Solid-State Circuits Conference Digest of
Technical Papers (ISSCC), 2014 IEEE International, pp. 464–465, Feb. 2014.
[33] T. C. Ou, Z. Zhang, and M. C. Papaefthymiou, “A 934MHz 9Gb/s 3.2pJ/b/iteration charge-
recovery LDPC decoder with in-package inductors,” 2015 IEEE Asian Solid-State Circuits Con-
ference (A-SSCC), pp. 1–4, Nov. 2015.
[34] X. R. Lee, C. L. Chen, H. C. Chang, and C. Y. Lee, “A 7.92 Gb/s 437.2 mW Stochastic LDPC
Decoder Chip for IEEE 802.15.3c Applications,” IEEE Transactions on Circuits and Systems I:
Regular Papers, vol. 62, no. 2, pp. 507–516, Feb. 2015.
[35] C. L. Lin, R. J. Liu, C. L. Chen, H. C. Chang, and C. Y. Lee, “A 7.72 Gb/s LDPC-CC decoder
with overlapped architecture for pre-5G wireless communications,” 2016 IEEE Asian Solid-State
Circuits Conference (A-SSCC), pp. 337–340, Nov. 2016.
[36] M. R. Li, C. H. Yang, and Y. L. Ueng, “A 5.28-Gb/s LDPC Decoder With Time-Domain Signal
Processing for IEEE 802.15.3c Applications,” IEEE Journal of Solid-State Circuits, vol. 52, no. 2,
pp. 592–604, Feb. 2017.
[37] C. H. Bennett and G. Brassard, “Quantum cryptography: Public key distribution and coin toss-
ing,” Theoretical Computer Science, vol. 560, Part 1, pp. 7 – 11, Dec. 2014.
[38] N. Gisin, G. Ribordy, W. Tittel, and H. Zbinden, “Quantum cryptography,” Rev. Mod. Phys.,
vol. 74, pp. 145–195, Jan. 2002.
[39] R. Alleaume, C. Branciard, J. Bouda, T. Debuisschert, M. Dianati, N. Gisin, M. Godfrey, P. Grang-
ier, T. Lnger, N. Ltkenhaus et al., “Using quantum key distribution for cryptographic purposes:
A survey,” Theoretical Computer Science, vol. 560, Part 1, pp. 62–81, Dec. 2014.
References 101
[40] E. Diamanti, H.-K. Lo, B. Qi, and Z. Yuan, “Practical challenges in quantum key distribution,”
Npj Quantum Information, vol. 2, pp. 16 025–1 –16 025–12, Nov. 2016.
[41] J. D. Morris, M. R. Grimaila, D. D. Hodson, D. Jacques, and G. Baumgartner, “Chapter 9 - A
Survey of Quantum Key Distribution (QKD) Technologies,” Emerging Trends in ICT Security,
pp. 141–152, Nov. 2014.
[42] R. L. Rivest, A. Shamir, and L. Adleman, “A method for obtaining digital signatures and public-
key cryptosystems,” Communications of the ACM, vol. 21, pp. 120–126, Feb. 1978.
[43] C. Kollmitzer and M. Pivk, Applied Quantum Cryptography. Springer, Apr. 2010, vol. 797.
[44] P. W. Shor, “Polynomial-Time Algorithms for Prime Factorization and Discrete Logarithms on a
Quantum Computer,” SIAM Journal on Computing, vol. 26, no. 5, pp. 1484–1509, Oct. 1997.
[45] D. Adrian, K. Bhargavan, Z. Durumeric, P. Gaudry, M. Green, J. A. Halderman, N. Heninger,
D. Springall, E. Thome, L. Valenta, B. VanderSloot, E. Wustrow, S. Zanella-Beguelin, and P. Zim-
mermann, “Imperfect Forward Secrecy: How Diffie-Hellman Fails in Practice,” Proceedings of the
22Nd ACM SIGSAC Conference on Computer and Communications Security, pp. 5–17, Oct. 2015.
[46] H.-K. Lo, M. Curty, and K. Tamaki, “Secure quantum key distribution,” Nature Photonics, vol. 8,
no. 8, pp. 595–604, Jul. 2014.
[47] M. Peev, C. Pacher, R. Allaume, C. Barreiro, J. Bouda, W. Boxleitner, T. Debuisschert, E. Dia-
manti, M. Dianati, J. F. Dynes et al., “The SECOQC quantum key distribution network in Vi-
enna,” New Journal of Physics, vol. 11, no. 7, pp. 075 001–1 – 075 001–37, Jul. 2009.
[48] M. Sasaki, M. Fujiwara et al., “Field test of quantum key distribution in the Tokyo QKD Network,”
Opt. Express, vol. 19, no. 11, pp. 10 387–10 409, May 2011.
[49] P. Jouguet, S. Kunz-Jacques et al., “Field test of classical symmetric encryption with continuous
variables quantum key distribution,” Opt. Express, vol. 20, no. 13, pp. 14 030–14 041, Jun. 2012.
[50] S. Wang, W. Chen et al., “Field and long-term demonstration of a wide area quantum key distri-
bution network,” Opt. Express, vol. 22, no. 18, pp. 21 739–21 756, Sep. 2014.
[51] H.-L. Yin, T.-Y. Chen, Z.-W. Yu, H. Liu, L.-X. You, Y.-H. Zhou, S.-J. Chen, Y. Mao, M.-Q.
Huang, W.-J. Zhang, H. Chen, M. J. Li, D. Nolan, F. Zhou, X. Jiang, Z. Wang, Q. Zhang, X.-B.
Wang, and J.-W. Pan, “Measurement-device-independent quantum key distribution over a 404 km
optical fiber,” Phys. Rev. Lett., vol. 117, pp. 190 501–1 – 190 501–5, Nov. 2016.
[52] B. Qi, W. Zhu, L. Qian, and H.-K. Lo, “Feasibility of quantum key distribution through a dense
wavelength division multiplexing network,” New Journal of Physics, vol. 12, no. 10, pp. 103 042–1
– 103 042–17, Oct. 2010.
[53] K. A. Patel, J. F. Dynes, I. Choi, A. W. Sharpe, A. R. Dixon, Z. L. Yuan, R. V. Penty, and A. J.
Shields, “Coexistence of high-bit-rate quantum key distribution and data on optical fiber,” Phys.
Rev. X, vol. 2, pp. 041 010–1 – 041 010–8, Nov. 2012.
[54] R. Kumar, H. Qin, and R. Allaume, “Coexistence of continuous variable QKD with intense DWDM
classical channels,” New Journal of Physics, vol. 17, no. 4, pp. 043 027–1 – 043 027–4, Apr. 2015.
References 102
[55] G. Vest, M. Rau, L. Fuchs, G. Corrielli, H. Weier, S. Nauerth, A. Crespi, R. Osellame, and
H. Weinfurter, “Design and evaluation of a handheld quantum key distribution sender module,”
IEEE Journal of Selected Topics in Quantum Electronics, vol. 21, no. 3, pp. 131–137, May 2015.
[56] G. Vallone, D. Bacco, D. Dequal, S. Gaiarin, V. Luceri, G. Bianco, and P. Villoresi, “Experimental
satellite quantum communications,” Phys. Rev. Lett., vol. 115, pp. 040 502–1 – 040 502–5, Jul.
2015.
[57] P. Sibson, J. E. Kennard et al., “Integrated silicon photonics for high-speed quantum key distri-
bution,” Optica, vol. 4, no. 2, pp. 172–177, Feb. 2017.
[58] L. Chen, S. Jordan et al., “Report on post-quantum cryptography,” National Institute of Standards
and Technology Internal Report, vol. 8105, 2016.
[59] P. Jouguet, S. Kunz-Jacques, A. Leverrier, P. Grangier, and E. Diamanti, “Experimental demon-
stration of long-distance continuous-variable quantum key distribution,” Nature Photonics, vol. 7,
pp. 378–381, May 2013.
[60] P. Jouguet and S. Kunz-Jacques, “High performance error correction for quantum key distribution
using polar codes,” Quantum Inform. & Comp., vol. 14, no. 3, pp. 329–338, Mar. 2014.
[61] S. Pirandola, R. Laurenza, C. Ottaviani, and L. Banchi, “Fundamental limits of repeaterless
quantum communications,” Nature Communications, vol. 8, pp. 15 043–1 – 15 043–15, Apr. 2017.
[62] D. Huang, P. Huang, T. Wang, H. Li, Y. Zhou, and G. Zeng, “Continuous-variable quantum key
distribution based on a plug-and-play dual-phase-modulated coherent-states protocol,” Phys. Rev.
A, vol. 94, pp. 032 305–1 – 032 305–11, Sep. 2016.
[63] D. Huang, P. Huang, D. Lin, and G. Zeng, “Long-distance continuous-variable quantum key dis-
tribution by controlling excess noise,” Scientific Reports, vol. 6, pp. 19 201–1 – 19 201–6, Jan.
2016.
[64] C. Wang, D. Huang et al., “25 MHz clock continuous-variable quantum key distribution system
over 50 km fiber channel,” Scientific reports, vol. 5, pp. 14 607–1 – 14 607–8, Sep. 2015.
[65] J. Martinez-Mateo, D. Elkouss, and V. Martin, “Key reconciliation for high performance quantum
key distribution,” Scientific Reports, vol. 3, pp. 1576–1 – 1576–6, Apr. 2013.
[66] A. Dixon and H. Sato, “High speed and adaptable error correction for Megabit/s rate quantum
key distribution,” Scientific Reports, vol. 4, pp. 7275–1 – 7275–4, Dec. 2014.
[67] A. Leverrier and P. Grangier, “Unconditional security proof of long-distance continuous-variable
quantum key distribution with discrete modulation,” Physical Review Letters, vol. 102, no. 18, pp.
180 504–1 – 180 504–4, May 2009.
[68] A. Becir and M. Ridza Wahiddin, “Phase coherent states for enhancing the performance of contin-
uous variable quantum key distribution,” Journal of the Physical Society of Japan, vol. 81, no. 3,
pp. 034 005–1 – 034 005–9, Mar. 2012.
[69] “ETSI Standard 302 307-2 V1.1.1: Digital Video Broadcasting (DVB),” ETSI Std 302 307-2 V1.1.1
DVB-S2X 2014), pp. 1–139, Oct. 2014.
References 103
[70] A. Blanksby and C. Howland, “A 690-mW 1-Gb/s 1024-b, rate-1/2 low-density parity-check code
decoder,” Solid-State Circuits, IEEE Journal of, vol. 37, no. 3, pp. 404–412, Mar 2002.
[71] T. Richardson, R. Urbanke et al., “Multi-edge type LDPC codes,” Workshop honoring Prof. Bob
McEliece on his 60th birthday, California Institute of Technology, Pasadena, California, pp. 24–25,
May 2002.
[72] M. Fossorier, “Quasicyclic low-density parity-check codes from circulant permutation matrices,”
Information Theory, IEEE Transactions on, vol. 50, no. 8, pp. 1788–1793, Aug. 2004.
[73] S. Kim, G. E. Sobelman, and H. Lee, “A reduced-complexity architecture for LDPC layered de-
coding schemes,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 19,
no. 6, pp. 1099–1103, Jun. 2011.
[74] G. Falcao, L. Sousa, and V. Silva, “Massively LDPC decoding on multicore architectures,” IEEE
Transactions on Parallel and Distributed Systems, vol. 22, no. 2, pp. 309–322, Feb. 2011.
[75] H. Ji, J. Cho, and W. Sung, “Massively parallel implementation of cyclic LDPC codes on a general
purpose graphics processing unit,” 2009 IEEE Workshop on Signal Processing Systems, pp. 285–
290, Oct. 2009.
[76] S. W. Keckler, W. J. Dally, B. Khailany, M. Garland, and D. Glasco, “GPUs and the Future of
Parallel Computing,” IEEE Micro, vol. 31, no. 5, pp. 7–17, Sep. 2011.
[77] M. Fossorier, M. Mihaljevic, and H. Imai, “Reduced complexity iterative decoding of low-density
parity check codes based on belief propagation,” Communications, IEEE Transactions on, vol. 47,
no. 5, pp. 673–680, May 1999.
[78] R. Tanner, “A recursive approach to low complexity codes,” IEEE Trans. Inf. Theory, vol. 27,
no. 5, pp. 533–547, Sep. 1981.
[79] F. R. Kschischang, B. J. Frey, and H. A. Loeliger, “Factor graphs and the sum-product algorithm,”
IEEE Transactions on Information Theory, vol. 47, no. 2, pp. 498–519, Feb. 2001.
[80] D. Oh and K. Parhi, “Optimally quantized offset min-sum algorithm for flexible LDPC decoder,”
Signals, Systems and Computers, 2008 42nd Asilomar Conference on, pp. 1886–1891, Oct. 2008.
[81] T. Mohsenin and B. Baas, “Trends and challenges in LDPC hardware decoders,” Signals, Systems
and Computers, 2009 Conference Record of the Forty-Third Asilomar Conference on, pp. 1273–
1277, Nov. 2009.
[82] C. Roth, A. Cevrero, C. Studer, Y. Leblebici, and A. Burg, “Area, throughput, and energy-
efficiency trade-offs in the VLSI implementation of LDPC decoders,” Circuits and Systems (IS-
CAS), 2011 IEEE International Symposium on, pp. 1772–1775, May 2011.
[83] M. Mansour and N. Shanbhag, “High-throughput LDPC decoders,” Very Large Scale Integration
(VLSI) Systems, IEEE Transactions on, vol. 11, no. 6, pp. 976–996, Dec. 2003.
[84] D. Oh and K. Parhi, “Low-Complexity Switch Network for Reconfigurable LDPC Decoders,” Very
Large Scale Integration (VLSI) Systems, IEEE Transactions on, vol. 18, no. 1, pp. 85–94, Jan.
2010.
References 104
[85] D. Hocevar, “A reduced complexity decoder architecture via layered decoding of LDPC codes,”
Signal Processing Systems, 2004. SIPS 2004. IEEE Workshop on, pp. 107–112, Oct. 2004.
[86] T. Bhatt, V. Sundaramurthy, V. Stolpman, and D. McCain, “Pipelined block-serial decoder ar-
chitecture for structured LDPC codes,” Acoustics, Speech and Signal Processing, 2006. ICASSP
2006 Proceedings. 2006 IEEE International Conference on, vol. 4, pp. 225–228, May 2006.
[87] A. Darabiha, A. Carusone, and F. Kschischang, “Power Reduction Techniques for LDPC De-
coders,” Solid-State Circuits, IEEE Journal of, vol. 43, no. 8, pp. 1835–1845, Aug. 2008.
[88] Z. Wang and Z. Cui, “Low-Complexity High-Speed Decoder Design for Quasi-Cyclic LDPC Codes,”
Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, vol. 15, no. 1, pp. 104–114,
Jan. 2007.
[89] L. Liu and C.-J. Shi, “Sliced Message Passing: High Throughput Overlapped Decoding of High-
Rate Low-Density Parity-Check Codes,” Circuits and Systems I: Regular Papers, IEEE Transac-
tions on, vol. 55, no. 11, pp. 3697–3710, Dec. 2008.
[90] Y. Chen and K. Parhi, “Overlapped message passing for quasi-cyclic low-density parity check
codes,” Circuits and Systems I: Regular Papers, IEEE Transactions on, vol. 51, no. 6, pp. 1106–
1113, Jun. 2004.
[91] D. Bao, X. Chen et al., “A single-routing layered LDPC decoder for 10Gbase-T Ethernet in 130nm
CMOS,” 17th Asia and South Pacific Design Automation Conference, pp. 565–566, Jan. 2012.
[92] S. Kumawat, R. Shrestha et al., “High-throughput LDPC-decoder architecture using efficient com-
parison techniques and dynamic multi-frame processing schedule,” IEEE Transactions on Circuits
and Systems I: Regular Papers, vol. 62, no. 5, pp. 1421–1430, May 2015.
[93] P. Meinerzhagen, A. Bonetti, G. Karakonstantis, C. Roth, F. Giirkaynak, and A. Burg, “Refresh-
free dynamic standard-cell based memories: Application to a QC-LDPC decoder,” 2015 IEEE
International Symposium on Circuits and Systems (ISCAS), pp. 1426–1429, May 2015.
[94] D. Miyashita, R. Yamaki et al., “An LDPC decoder with time-domain analog and digital mixed-
signal processing,” IEEE Journal of Solid-State Circuits, vol. 49, no. 1, pp. 73–83, Jan. 2014.
[95] F. Grosshans and P. Grangier, “Continuous variable quantum cryptography using coherent states,”
Phys. Rev. Lett., vol. 88, pp. 057 902–1 – 057 902–4, Jan. 2002.
[96] H.-K. Lo, M. Curty, and B. Qi, “Measurement-device-independent quantum key distribution,”
Phys. Rev. Lett., vol. 108, pp. 130 503–1 – 130 503–5, Mar. 2012.
[97] S. Pirandola, C. Ottaviani, G. Spedalieri, C. Weedbrook, S. L. Braunstein, S. Lloyd, T. Gehring,
C. S. Jacobsen, and U. L. Andersen, “High-rate measurement-device-independent quantum cryp-
tography,” Nature Photonics, vol. 9, no. 6, pp. 397–402, May 2015.
[98] A. Leverrier, R. Alleaume, J. Boutros, G. Zemor, and P. Grangier, “Multidimensional reconciliation
for a continuous-variable quantum key distribution,” Phys. Rev. A, vol. 77, pp. 042 325–1 – 042 325–
8, Apr. 2008.
References 105
[99] C. Weedbrook, S. Pirandola, S. Lloyd, and T. C. Ralph, “Quantum cryptography approaching the
classical limit,” Phys. Rev. Lett., vol. 105, pp. 110 501–1 – 110 501–4, Sep. 2010.
[100] P. Jouguet, D. Elkouss, and S. Kunz-Jacques, “High-bit-rate continuous-variable quantum key
distribution,” Phys. Rev. A, vol. 90, pp. 042 329–1 – 042 329–8, Oct. 2014.
[101] T. Gehring, V. Handchen, J. Duhme, F. Furrer, T. Franz, C. Pacher, R. F. Werner, and R. Schn-
abel, “Implementation of continuous-variable quantum key distribution with composable and one-
sided-device-independent security against coherent attacks,” Nature Communications, vol. 6, pp.
8795–1 – 8795–7, Oct. 2015.
[102] F. Grosshans, G. V. Assche, J. Wenger, R. Brouri, N. J. Cerf, and P. Grangier, “Quantum key
distribution using Gaussian-modulated coherent states,” Nature, vol. 421, pp. 238–241, Jan. 2003.
[103] H. Yan, X. Peng, X. Lin, W. Jiang, T. Liu, and H. Guo, “Efficiency of Winnow Protocol in
Secret Key Reconciliation,” 2009 WRI World Congress on Computer Science and Information
Engineering, vol. 3, pp. 238–242, Mar. 2009.
[104] D. Elkouss, J. Martinez, D. Lancho, and V. Martin, “Rate compatible protocol for information
reconciliation: An application to QKD,” 2010 IEEE Information Theory Workshop on Information
Theory, pp. 1–5, Jan 2010.
[105] N. Benletaief, H. Rezig, and A. Bouallegue, “Toward Efficient Quantum Key Distribution Recon-
ciliation,” Journal of Quantum Information Science, vol. 4, no. 2, pp. 117–128, Jun. 2014.
[106] T. Richardson, M. Shokrollahi, and R. Urbanke, “Design of capacity-approaching irregular low-
density parity-check codes,” Information Theory, IEEE Transactions on, vol. 47, no. 2, pp. 619–
637, Feb. 2001.
[107] D. Declercq and M. Fossorier, “Decoding algorithms for nonbinary LDPC codes over GF(q),”
IEEE Transactions on Communications, vol. 55, no. 4, pp. 633–643, Apr. 2007.
[108] A. Anastasopoulos, “A comparison between the sum-product and the min-sum iterative detection
algorithms based on density evolution,” Global Telecommunications Conference, 2001. GLOBE-
COM ’01. IEEE, vol. 2, pp. 1021–1025, Nov. 2001.
[109] A. Hurwitz, “Ueber die Composition der quadratischen Formen von belibig vielen Variablen,”
Nachrichten von der Gesellschaft der Wissenschaften zu Gttingen, Mathematisch-Physikalische
Klasse, vol. 1898, pp. 309–316, Jul. 1898.
[110] J. C. Baez, “The octonions,” Bulletin of the American Mathematical Society, vol. 39, no. 2, pp.
145–205, Dec. 2001.
[111] M. Bloch, A. Thangaraj, S. W. McLaughlin, and J. M. Merolla, “LDPC-based secret key agreement
over the Gaussian wiretap channel,” 2006 IEEE International Symposium on Information Theory,
pp. 1179–1183, Jul. 2006.
[112] U. M. Maurer, “Secret key agreement by public discussion from common information,” IEEE
Transactions on Information Theory, vol. 39, no. 3, pp. 733–742, May 1993.
References 106
[113] P. Jouguet, S. Kunz-Jacques, E. Diamanti, and A. Leverrier, “Analysis of imperfections in practical
continuous-variable quantum key distribution,” Phys. Rev. A, vol. 86, p. 032309, Sep 2012.
[114] A. Leverrier, F. Grosshans, and P. Grangier, “Finite-size analysis of a continuous-variable quantum
key distribution,” Phys. Rev. A, vol. 81, pp. 062 343–1 – 062 343–11, Jun. 2010.
[115] M. Curty, F. Xu, W. Cui, C. C. W. Lim, K. Tamaki, and H.-K. Lo, “Finite-key analysis for
measurement-device-independent quantum key distribution,” Nature Communications, vol. 5, Apr.
2014.
[116] E. Diamanti and A. Leverrier, “Distributing secret keys with quantum continuous variables: prin-
ciple, security and implementations,” Entropy, vol. 17, no. 9, pp. 6072–6092, Aug. 2015.
[117] A. Leverrier, “Composable Security Proof for Continuous-Variable Quantum Key Distribution
with Coherent States,” Phys. Rev. Lett., vol. 114, pp. 070 501–1 – 070 501–5, Feb. 2015.
[118] V. C. Usenko and R. Filip, “Trusted noise in continuous-variable quantum key distribution: A
threat and a defense,” Entropy, vol. 18, no. 1, p. 20, Jan. 2016.
[119] H.-A. Loeliger, “On the basic averaging arguments for linear codes,” Communications and Cryp-
tography, vol. 276, pp. 251–261, 1994.
[120] T. J. Richardson, “Error floors of LDPC codes,” Proceedings of the annual Allerton conference on
communication, control, and computing, vol. 41, no. 3, pp. 1426–1435, Oct. 2003.
[121] R. H. Dennard, F. H. Gaensslen, V. L. Rideout, E. Bassous, and A. R. LeBlanc, “Design of ion-
implanted MOSFETs with very small physical dimensions,” IEEE Journal of Solid-State Circuits,
vol. 9, no. 5, pp. 256–268, Oct. 1974.
[122] T. Huynh-Bao, J. Ryckaert, S. Sakhare, A. Mercha, D. Verkest, A. Thean, and P. Wambacq,
“Toward the 5nm technology: layout optimization and performance benchmark for logic/SRAMs
using lateral and vertical GAA FETs,” Proc. SPIE, vol. 9781, pp. 978 102 – 978 102–12, Mar. 2016.
[123] S. Y. Wu, C. Y. Lin et al., “Demonstration of a sub-0.03 um2 high density 6-t SRAM with scaled
bulk FinFETs for mobile SOC applications beyond 10nm node,” 2016 IEEE Symposium on VLSI
Technology, pp. 1–2, Jun. 2016.
[124] ——, “A 7nm CMOS platform technology featuring 4th generation FinFET transistors with a
0.027um2 high density 6-t SRAM cell for mobile SoC applications,” 2016 IEEE International
Electron Devices Meeting (IEDM), pp. 2.6.1–2.6.4, Dec. 2016.
[125] M. B. Taylor, “A landscape of the new dark silicon design regime,” IEEE Micro, vol. 33, no. 5,
pp. 8–19, Sep. 2013.
[126] S. Ajaz and H. Lee, “Multi-Gb/s multi-mode LDPC decoder architecture for IEEE 802.11ad stan-
dard,” Circuits and Systems (APCCAS), 2014 IEEE Asia Pacific Conference on, pp. 153–156,
Nov. 2014.
[127] M. Li, J. W. Weijers et al., “An energy efficient 18Gbps LDPC decoding processor for 802.11ad in
28nm CMOS,” 2015 IEEE Asian Solid-State Circuits Conference (A-SSCC), pp. 1–5, Nov. 2015.
References 107
[128] H. Motozuka, N. Yosoku et al., “A 6.16Gb/s 4.7pJ/bit/iteration LDPC decoder for IEEE 802.11ad
standard in 40nm LP-CMOS,” 2015 IEEE Global Conference on Signal and Information Processing
(GlobalSIP), pp. 1289–1292, Dec. 2015.
[129] J. P. Colinge, “Fully-depleted SOI CMOS for analog applications,” IEEE Transactions on Electron
Devices, vol. 45, no. 5, pp. 1010–1016, May 1998.
[130] G. V. Assche, J. Cardinal, and N. J. Cerf, “Reconciliation of a quantum-distributed Gaussian key,”
IEEE Transactions on Information Theory, vol. 50, no. 2, pp. 394–400, Feb. 2004.
[131] N. Walenta, A. Burg, D. Caselunghe, J. Constantin, N. Gisin, O. Guinnard, R. Houlmann,
P. Junod, B. Korzh, N. Kulesza et al., “A fast and versatile quantum key distribution system
with hardware key distillation and wavelength multiplexing,” New Journal of Physics, vol. 16,
no. 1, pp. 013 047–1 – 013 047–20, Jan. 2014.
[132] Z. Bai, S. Yang, and Y. Li, “High-efficiency reconciliation for continuous variable quantum key
distribution,” Japanese Journal of Applied Physics, vol. 56, no. 4, pp. 044 401–1 – 044 401–4, Mar.
2017.
[133] S. Kang and J. Moon, “Parallel LDPC decoder implementation on GPU based on unbalanced
memory coalescing,” 2012 IEEE International Conference on Communications (ICC), pp. 3692–
3697, Jun. 2012.
[134] G. Wang, M. Wu, B. Yin, and J. R. Cavallaro, “High throughput low latency LDPC decoding
on GPU for SDR systems,” Global Conference on Signal and Information Processing (GlobalSIP),
2013 IEEE, pp. 1258–1261, Dec. 2013.
[135] Y. Lin and W. Niu, “High Throughput LDPC Decoder on GPU,” IEEE Communications Letters,
vol. 18, no. 2, pp. 344–347, Feb. 2014.
[136] G. Wang, M. Wu, Y. Sun, and J. R. Cavallaro, “A massively parallel implementation of QC-
LDPC decoder on GPU,” Application Specific Processors (SASP), 2011 IEEE 9th Symposium on,
pp. 82–85, Jun. 2011.
[137] B. L. Gal, C. Jego, and J. Crenne, “A High Throughput Efficient Approach for Decoding LDPC
Codes onto GPU Devices,” IEEE Embedded Systems Letters, vol. 6, no. 2, pp. 29–32, Jun. 2014.
[138] H. Zbinden, N. Walenta, O. Guinnard, R. Houlmann, C. L. C. Wen, B. Korzh, T. Lunghi, N. Gisin,
A. Burg, J. Constantin et al., “Continuous QKD and high speed data encryption,” Proc. SPIE,
vol. 8899, pp. 88 990P–1 – 88 990P–4, Oct. 2013.
[139] S. J. Johnson, V. A. Chandrasetty, and A. M. Lance, “Repeat-accumulate codes for reconcilia-
tion in continuous variable quantum key distribution,” 2016 Australian Communications Theory
Workshop (AusCTW), pp. 18–23, Jan. 2016.
[140] M. Shirvanimoghaddam, S. J. Johnson, and A. M. Lance, “Design of Raptor codes in the low SNR
regime with applications in quantum key distribution,” 2016 IEEE International Conference on
Communications (ICC), pp. 1–6, May 2016.
References 108
[141] J. Yin, J.-G. Ren, H. Lu, Y. Cao, H.-L. Yong, Y.-P. Wu, C. Liu, S.-K. Liao, F. Zhou, Y. Jiang et al.,
“Quantum teleportation and entanglement distribution over 100-kilometre free-space channels,”
Nature, vol. 488, no. 7410, pp. 185–188, Aug. 2012.
[142] J. Handsteiner, D. Rauch, D. Bricher, T. Scheidl, and A. Zeilinger, “Quantum key distribution
at space scale,” 2015 IEEE International Conference on Space Optical Systems and Applications
(ICSOS), pp. 1–3, Oct. 2015.
[143] J.-P. Bourgoin, N. Gigov, B. L. Higgins, Z. Yan, E. Meyer-Scott, A. K. Khandani, N. Lutkenhaus,
and T. Jennewein, “Experimental quantum key distribution with simulated ground-to-satellite
photon losses and processing limitations,” Phys. Rev. A, vol. 92, pp. 052 339–1 – 052 339–12, Nov.
2015.
[144] E. Gibney, “Chinese satellite is one giant step for the quantum internet,” Nature, vol. 535, pp.
478–479, Jul. 2016.
[145] J. Yin, Y. Cao et al., “Satellite-based entanglement distribution over 1200 kilometers,” Science,
vol. 356, no. 6343, pp. 1140–1144, Jun. 2017.
[146] J. Rarity, P. Tapster, P. Gorman, and P. Knight, “Ground to satellite secure key exchange using
quantum cryptography,” New Journal of Physics, vol. 4, no. 1, pp. 82.1–82.21, Oct. 2002.
[147] C. Ma, W. D. Sacher, Z. Tang, J. C. Mikkelsen, Y. Yang, F. Xu, T. Thiessen, H.-K. Lo, and
J. K. S. Poon, “Silicon photonic transmitter for polarization-encoded quantum key distribution,”
Optica, vol. 3, no. 11, pp. 1274–1278, Nov. 2016.
[148] D. Bunandar, N. Harris, Z. Zhang, C. Lee, R. Ding, T. Baehr-Jones, M. Hochberg, J. Shapiro,
F. Wong, and D. Englund, “Cavity integrated quantum key distribution,” Sept. 2016, poster at
QCrypt 2016.
[149] D. G. M. Mitchell, M. Lentmaier, and D. J. Costello, “Spatially Coupled LDPC Codes Constructed
From Protographs,” IEEE Transactions on Information Theory, vol. 61, no. 9, pp. 4866–4889, Sep.
2015.
[150] X. Liu and S. C. Draper, “The ADMM Penalized Decoder for LDPC Codes,” IEEE Transactions
on Information Theory, vol. 62, no. 6, pp. 2966–2984, Jun. 2016.
[151] M. Wasson, M. Milicevic, S. Draper, and G. Gulak, “Hardware-based linear programming decod-
ing via the alternating direction method of multipliers,” 2017 IEEE International Conference on
Acoustics, Speech and Signal Processing, Mar. 2017.
[152] V. De, S. Vangal, and R. Krishnamurthy, “Near Threshold Voltage (NTV) Computing: Computing
in the Dark Silicon Era,” IEEE Design Test, vol. 34, no. 2, pp. 24–30, Apr. 2017.
[153] N. Pinckney, S. Jeloka, R. Dreslinski, T. Mudge, D. Sylvester, D. Blaauw, L. Shifren, B. Cline, and
S. Sinha, “Impact of FinFET on Near-Threshold Voltage Scalability,” IEEE Design Test, vol. 34,
no. 2, pp. 31–38, Apr. 2017.
[154] S. K. Samal, D. Nayak, M. Ichihashi, S. Banna, and S. K. Lim, “Monolithic 3D IC vs. TSV-based
3D IC in 14nm FinFET technology,” 2016 IEEE SOI-3D-Subthreshold Microelectronics Technology
Unified Conference (S3S), pp. 1–2, Oct. 2016.
References 109
[155] R. Takahashi, Y. Tanizawa, and A. Dixon, “High-speed implementation of privacy amplification
in quantum key distribution,” Sept. 2016, poster at QCrypt 2016.
[156] M. Hayashi and T. Tsurumaru, “More efficient privacy amplification with less random seeds via
dual universal hash function,” IEEE Transactions on Information Theory, vol. 62, no. 4, pp.
2213–2232, Apr. 2016.
[157] R. Renner and R. Konig, “Universally composable privacy amplification against quantum adver-
saries,” Theory of Cryptography Conference, pp. 407–425, Feb. 2005.