probability, entropy, and adaptive immune system repertoires...my grandparents, patarasp sethna,...

Probability, Entropy, and Adaptive

Immune System Repertoires

Zachary Michael Sethna

A Dissertation

Presented to the Faculty

of Princeton University

in Candidacy for the Degree

of Doctor of Philosophy

Recommended for Acceptance

by the Department of

Physics

Adviser: Professor Curtis Callan

September 2018

c© Copyright by Zachary Michael Sethna, 2018.

All rights reserved.

Abstract

The adaptive immune system, composed of white blood cells called lymphocytes (B

and T cells) that circulate in the lymph and blood, is a precision tool that tags

and removes foreign peptides. Such peptides, also called antigens or epitopes, are

identified by a specific binding to elements of a library or repertoire of unique proteins

called receptors (e.g. antibodies or T cell receptors). A repertoire must be large and

diverse enough so that at least one receptor will be able to recognize any pathogen

epitope the organism is likely to encounter. This diversity is achieved by stochastic

rearrangement of the germline DNA to create novel complementarity determining

region sequences (CDR3) in a process called called V(D)J recombination.

In this thesis we utilize previously developed generative models of V(D)J recombi-

nation events, and infer the model parameters from large datasets of DNA sequences.

The generation probability (Pgen) of a nucleotide or amino acid CDR3 is the sum

of all model probabilities of V(D)J recombination events that generate the sequence.

While previously it was only feasible to compute Pgen of nucleotide sequences, we

introduce a novel dynamic programming algorithm that efficiently computes Pgen of

amino acid sequences. We use this Pgen for several applications. First we examine

how the diversity of a repertoire, characterized by the model entropy, scales with the

number of insertions in the V(D)J process. This is used to describe the maturation

of the T cell repertoire of mice from embryos to young adults. Next, we introduce a

statistical model of hypermutation in B cells and infer the parameters from a human

repertoire, providing a principled quantification of the biases in hypermutation rates.

Lastly, we examine the statistics of the receptors shared amongst a cohort of more

than 600 individual humans and show that the statistics and identities of so-called

‘public’ sequences are determined directly from Pgen.

We highlight possible clinical applications and attempt to place this work in the

context of a full theory of the adaptive immune system.

iii

Acknowledgements

I don’t have the words to express my thanks to my advisor Curt Callan. Curt has been

a consummate advisor, providing support, advice, direction, and countless opportu-

nities. I came into grad school with somewhat scattered interests, yet Curt showed

me, by example, how to find a path forward through dedication, collaboration, and

boundless curiosity. Curt has always been willing to entertain my crazy, inchoate

ideas, and with only a few incisive questions give them shape (though it often takes

me days to catch up and realize this). Curt, thank you for all of your time and effort,

thank you for being my mentor. Thank you.

I also thank my collaborators on both sides of the pond. I have learned so much

from the insights and clarity of Aleksandra Walczak and Thierry Mora. Their ability

to parse the underlying science, translate it into math, and then communicate this

effectively is something I hope to one day be able to emulate. Yuval Elhanati has

made my time here much more productive and enjoyable. Not only did Yuval provide

crucial assistance with every step of the research, but he provided a sympathetic ear

and was willing to talk about whatever the topic of the day was. Quentin Marcou is

not only a wonderful collaborator, but a welcoming friend.

Thanks to Ben Greenbaum and Vinod Balachandran for great discussions, data,

and continuing collaboration.

I would also like to thank Anand Murugan, whom I have never met, but whose

code I’ve spent uncounted hours working with.

Biophysics

The professors in biophysics have been hugely influential on my perspective on science

and life, and I would like to thank them. I must start by thanking Bill Bialek, and not

only for being on my committee. His vision, instant understanding of any topic, and

personality have made his conversations something to be sought after. I would like to

iv

thank Bob Austin, not only for being a reader of this thesis, but for the many crazy

conversations and a shared appreciation of scotch. I also want to thank Josh Shaevitz

for efficiently cutting to the bone of any issue, Thomas Gregor for teaching me much

during my time as a TA for ISC, and Ned Wingreen for somehow always knowing

everything about any biological system. You all have made Princeton biophysics not

only a superb place to do research, but a friendly and welcoming environment.

The biophysics community also has had several postdocs and graduate students

over the years that I would like to thank for teaching me much and making my

time here so much fun. Andreas Mayer for great discussions on immunology. I’ve

immensely enjoyed speculating about Information Geometry with Ben Machta. I’d

also like to thank Leenoy Mushulam, Henry Mattingly, Dima Krotov, Ashley Linder,

Ugne Klibaite, Ben Bratton, Gordon Berman, Michael Tikhonov, Xiaowen Chen,

Guannan Liu, Mochi Liu, Alex Song, Sagar Setru, Mark Ioffe, and Jeff Nyugen.

Physics

The greater physics community has made Jadwin Hall a second home for these years.

I’d like to thank Herman Verlinde for all of his work in organizing the grad program.

A special thanks to Suzanne Staggs for being on my committee. Thanks to Jessica

Heslin, Barbara Mooring, and Kate Brosowksy for the invaluable administrative as-

sistance – without you we grad students would be helpless. Sumit Saluja has been a

lifesaver with helping me get my code running on the server. Also, a shoutout to the

softball team – especially the impressive Ed Groth.

Friends

Naturally, I must thank my fellow grad students who’ve been through the ringer with

me and yet made my time here enjoyable. There are too many people to name, so un-

doubtably I have accidentally forgotten some people: I must beg your forgiveness! I’d

v

like to thank Aitor Lewkowycz for the science, fun, keen insight, and advice. Aaron

Levy for the innumerable discussions about life, politics, and science. Will Coulton

for always being a good sport and a positive influence in every scenario. Josh Hard-

enbrook for always calling me out when he thinks I am wrong. Dave Zajac for helping

me ‘study’ for prelims with uncounted games of pool. Christian Jepsen for his impec-

cable taste. Joaquin Turiaci and Debayan Mitra for the many fun nights of beer and

foosball. Shai Chester for the fun and ridiculous stories, but NOT for any ‘help’ in my

work. Farzan Beroz for the many philosophical and science discussions. Lauren Mc-

Gough for the many discussions about about stat mech, information theory, and life.

Kenan Diab also understands the important things in a grad student’s life: softball,

starcraft, MTG, and beer. Ilya Belopolski for doing many prelim problems together

while DJ’ing with some select music. DJ Strouse for our annual run-ins at APS and

the many good conversations about information theory and machine learning. Bin Xu

for his always cheerful demeanor and great scientific discussions. Mallika Randeria

for her friendship and advice. Tom Hazard, softball captain extraordinaire. Many

thanks to Shawn Westerdale, Anne Gambrel, Guangyong Koh, Ed Young, Matt Her-

nandez, Lee Gunderson, Sarthak Parikh, Grisha Tarnoplskiy, Vlad Kirilin, Matteo

Ippolti, Luca Iliesiu, and Trithep Devakul. Thanks to everyone.

Family

Lastly, I must thank the whole of my family for being so supportive of me since

before I can remember. I come from a unique family, filled with medical doctors and

physicists, such that when I go home I am frequently grilled on my research. Coming

from such a background, it is no surprise that I’ve effectively split the difference

between physics and medicine in this thesis.

It would be hard to overstate the influence my uncle, Jim Sethna, has had on

me: I’ve quite literally followed in his footsteps in getting a PhD in physics from

vi

Princeton. Thank you Uncle Jim for all of your advice, support, and even academic

mentorship. I cannot tell you how much it means to me.

My grandparents, Patarasp Sethna, Shirley Sethna, Marjory Sethna, Joshua Lyn-

field, and Yelva Lynfield, have always been an examples to me, both in their achieve-

ments and morality. Sadly, not all of my grandparents will see me graduate, however

I am confident that all of them would both be proud and approve of my time here.

I also thank my sisters, Julia and Sharon Sethna, for always providing a ready

distraction when needed.

Finally, I would like to thank my parents Ruth Lynfield and Michael Sethna, with-

out whom not only would this thesis not have been possible but I never would have

been in the position in the first place. Your love, support, direction, and parenting

have got me to this point. Mom, your talents and commitment to helping people is

inspiring. Your work in infectious diseases and epidemiology have clearly colored my

interests. And Dad, your elevation of science and logic above all else has shaped the

way I think. You have frequently ‘joked’, that studying math, physics, and science is

‘holy work’ – a sentiment I certainly share. Thank you both for everything.

vii

“The idea is like grass. It craves light, likes crowds, thrives

on crossbreeding, grows better for being stepped on.”

- Ursula K. Le Guin, The Dispossessed

viii

Contents

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv

1 Introduction 1

1.1 Adaptive immune system . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 B cells . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.1.2 T cells . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.1.3 The DNA problem . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2 V(D)J recombination . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Repertoire sequencing and analysis . . . . . . . . . . . . . . . . . . . 7

1.4 Organization of thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Generative Model 9

2.1 V(D)J recombination models . . . . . . . . . . . . . . . . . . . . . . . 9

2.1.1 VDJ generative model . . . . . . . . . . . . . . . . . . . . . . 11

2.1.2 Model Validation . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.1.3 VJ generative model . . . . . . . . . . . . . . . . . . . . . . . 12

2.1.4 Pgen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2 Model Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

ix

2.2.1 Entropy of Precomb . . . . . . . . . . . . . . . . . . . . . . . . 14

2.2.2 Entropy of Pgen . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.2.3 The Pgen distribution . . . . . . . . . . . . . . . . . . . . . . 18

2.3 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.3.1 Errors and Mismatches . . . . . . . . . . . . . . . . . . . . . 21

2.3.2 Expectation Maximization algorithm . . . . . . . . . . . . . . 24

2.3.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 27

3 V(D)J recombination to sequences: Precomb → Pgen 28

3.0.1 Probability Spaces (mathematical aside) . . . . . . . . . . . . 29

3.1 Too many states! The free energy problem . . . . . . . . . . . . . . . 29

3.2 Dynamic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.3 OLGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.3.1 Notation, 3 ′ and 5 ′ vectors . . . . . . . . . . . . . . . . . . . . 34

3.3.2 VDJ recombination: V, M, D, N, and J . . . . . . . . . . . . 37

3.3.3 VJ recombination: V, M, and J . . . . . . . . . . . . . . . . . 43

3.3.4 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.3.5 Comparison to existing methods . . . . . . . . . . . . . . . . . 46

3.4 Some applications of OLGA computed Pgen . . . . . . . . . . . . . . . 48

3.4.1 Pgen distributions and diversity . . . . . . . . . . . . . . . . . 48

3.4.2 Generation probability of epitope-specific TCRs . . . . . . . . 49

3.4.3 Predicting the frequencies . . . . . . . . . . . . . . . . . . . . 51

3.4.4 Generation probability of sequence motifs . . . . . . . . . . . 53

4 The repertoires ‘Of Mice and Men’ 55

4.1 Of Mice... (mouse TRB) . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.1.1 Generative model . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.1.2 Changing insertion profile → Increasing diversity . . . . . . . 58

x

4.1.3 Mixture mode . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.1.4 Toy model of mouse repertoire maturation . . . . . . . . . . . 64

4.1.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.2 ...and Men (human IGH) . . . . . . . . . . . . . . . . . . . . . . . . . 67

4.2.1 Analysis approach . . . . . . . . . . . . . . . . . . . . . . . . 67

4.2.2 Generative Model, Allele identification . . . . . . . . . . . . . 68

4.2.3 Hypermutation . . . . . . . . . . . . . . . . . . . . . . . . . . 70

4.2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5 Sharing 74

5.1 The Sharing Distribution . . . . . . . . . . . . . . . . . . . . . . . . 76

5.1.1 Analytical calculation of the sharing distribution from the Pgen

distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5.1.2 Sharing modified by selection . . . . . . . . . . . . . . . . . . 81

5.2 Extrapolation to full repertoires and beyond . . . . . . . . . . . . . . 83

5.3 Predicting the publicness of sequences . . . . . . . . . . . . . . . . . 86

5.3.1 Sharing and TCR generation probability . . . . . . . . . . . . 86

5.3.2 PUBLIC: Classifier of public vs. private TCRs based on gener-

ation probability . . . . . . . . . . . . . . . . . . . . . . . . . 89

5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

6 Conclusion 93

A Information Theory 96

A.1 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

A.2 Mutual Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

A.3 Kullback-Leibler divergence . . . . . . . . . . . . . . . . . . . . . . . 99

B Probabilistic vs Deterministic inference 100

xi

C Proof of Expectation Maximization algorithm 103

D Mouse Appendix 105

D.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

D.2 Model parameters and validation . . . . . . . . . . . . . . . . . . . . 106

E Human B cells Appendix 113

E.1 Repertoire entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

E.2 Inference of alleles and their chromosome distribution . . . . . . . . . 114

E.3 Model parameters and validation . . . . . . . . . . . . . . . . . . . . 116

F Sharing Appendix 122

F.1 Sampling effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

F.2 Monte Carlo simulation . . . . . . . . . . . . . . . . . . . . . . . . . . 124

F.2.1 Sequence data . . . . . . . . . . . . . . . . . . . . . . . . . . 124

Bibliography 126

xii

List of Tables

3.1 Distance metrics for OLGA VDJ validation . . . . . . . . . . . . . . 45

3.2 Time performance and scaling of possible methods. . . . . . . . . . . 47

3.3 P funcgen of TCR motifs . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.4 Pgen of invariant T cell (iNKT and MAIT cells) TRA motifs . . . . . 54

4.1 Breakdown of B cell sequences and models . . . . . . . . . . . . . . . 67

D.1 Mouse dataset summary . . . . . . . . . . . . . . . . . . . . . . . . . 106

E.1 Heterozygous V allele information (Individual A) . . . . . . . . . . . 116

E.2 Heterozygous D and J allele information (Individual A) . . . . . . . . 116

F.1 Mice dataset sample sizes . . . . . . . . . . . . . . . . . . . . . . . . 125

xiii

List of Figures

1.1 Schematic of VDJ recombination . . . . . . . . . . . . . . . . . . . . 5

2.1 Distribution functions: P (−E = log Pgen) . . . . . . . . . . . . . . . . 19

3.1 CDR3 indexing cartoon . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.2 Validation of OLGA VDJ algorithm . . . . . . . . . . . . . . . . . . . 44

3.3 Validation of OLGA VJ algorithm . . . . . . . . . . . . . . . . . . . . 46

3.4 Precomb and Pgen distributions . . . . . . . . . . . . . . . . . . . . . . 48

3.5 Pgen of human TRB sequences for hepatitis C and influenza A epitopes. 50

3.6 Pgen distributions for virus specific TRB sequences . . . . . . . . . . . 51

3.7 Scatter of mean occurrence frequencies vs Pgen . . . . . . . . . . . . . 52

4.1 Age-dependent insertion length distributions . . . . . . . . . . . . . . 56

4.2 Sequence entropy for thymic repertoires . . . . . . . . . . . . . . . . . 59

4.3 Repertoire maturation schematic . . . . . . . . . . . . . . . . . . . . 61

4.4 Mean effective TdT level ᾱ and entropy vs age . . . . . . . . . . . . . 63

4.5 Amount of mixing: variance of α vs age . . . . . . . . . . . . . . . . . 64

4.6 Allele organization on chromosomes . . . . . . . . . . . . . . . . . . . 69

4.7 Sequence dependence of somatic hypermutations . . . . . . . . . . . . 71

5.1 Pipeline for computing the distribution of shared sequences . . . . . . 76

5.2 Sharing distribution for 14 mice . . . . . . . . . . . . . . . . . . . . . 78

5.3 Sharing distribution for 658 humans . . . . . . . . . . . . . . . . . . . 79

xiv

5.4 Number of unique CDR3s in pooled repertoires . . . . . . . . . . . . 84

5.5 Fraction of total repertoire composed of ‘public’ sequences . . . . . . 85

5.6 Mouse Pgen distributions by sharing number . . . . . . . . . . . . . . 87

5.7 Human Pgen distributions by sharing number . . . . . . . . . . . . . . 88

5.8 PUBLIC classifier schematic . . . . . . . . . . . . . . . . . . . . . . . 89

5.9 Performance of the PUBLIC classifier . . . . . . . . . . . . . . . . . . 90

B.1 Probabilistic vs Deterministic marginal distributions . . . . . . . . . . 101

D.1 Gene usages by mouse age . . . . . . . . . . . . . . . . . . . . . . . . 107

D.2 Deletion profiles by mouse age . . . . . . . . . . . . . . . . . . . . . . 108

D.3 Frequencies of non-templated insertions . . . . . . . . . . . . . . . . . 109

D.4 Mouse model MI validation . . . . . . . . . . . . . . . . . . . . . . . 110

D.5 Variation of V and J gene usage across biological replicates . . . . . . 111

D.6 Variation of deletion profiles across biological replicates . . . . . . . . 112

E.1 Entropy of B cell model . . . . . . . . . . . . . . . . . . . . . . . . . 113

E.2 B cell gene usages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

E.3 B cell deletion profiles . . . . . . . . . . . . . . . . . . . . . . . . . . 118

E.4 B cell non-templated nucleotide frequencies . . . . . . . . . . . . . . . 119

E.5 PinsVD and PinsDJ over replicates . . . . . . . . . . . . . . . . . . . . . 120

E.6 B cell model MI validation . . . . . . . . . . . . . . . . . . . . . . . . 120

E.7 B cell model insertion Markov model validation . . . . . . . . . . . . 121

F.1 Downsampling in sharing analyses . . . . . . . . . . . . . . . . . . . . 123

xv

Chapter 1

Introduction

1.1 Adaptive immune system

The adaptive immune system evolved to provide animals with a precision tool to

identify and remove anything ‘foreign’ to the animal. This is done by having a

large library, or repertoire, of proteins called receptors that bind specifically to some

small fragment of a protein called an epitope or antigen. This binding or affinity

is determined by physical properties such as electrostatics, hydrophobicity, Van de

Waals forces, steric concerns, etc. By specificity we mean that this receptor will only

bind to a very limited number of epitopes and have only limited affinity for other

epitopes1. Crucially, this specificity allows the adaptive immune system to weed

out any receptors which recognize self peptides which would trigger an autoimmune

response. However, this repertoire must be large and diverse enough to be able to

identify any foreign peptide to ensure that microbes and cancerous cells are quickly

identified and dealt with. In this thesis we will characterize just how staggeringly

diverse these adaptive immune system repertoires are.

In order to generate and regulate these receptors, the adaptive immune system

has a special class of cells called lymphocytes, of which there are two main subtypes:

1Frequently the amount of ‘cross-reactivity’ is assumed to negligible

1

B cells and T cells. Each lymphocyte has a single receptor, of which it expresses many

copies, in order to recognize epitopes. These lymphocyte receptors are protein com-

plexes composed of two amino acid chains, a larger one and a smaller one. Each chain

has largely conserved portions (in order to standardize the way the adaptive immune

system uses these receptors) along with highly variable regions that provide the spe-

cific binding to epitopes. The most highly variable region, and the one that largely

determines the affinity of a receptor to an epitope, is called the complementarity-

determining region 3 or CDR32. We will often be a little sloppy and refer to the

‘receptor’ and the CDR3 of a single chain interchangeably. Once a ‘naive’ lympho-

cyte is activated by specifically binding to an epitope, it will proliferate and some of

these cells will be archived as ‘memory’ cells to quickly reactivate and eliminate the

antigen if the organism is ever exposed to it again.

1.1.1 B cells

B cells are lymphocytes that produce, and secrete, receptors called antibodies. Anti-

bodies are composed of a heavy chain (IGH) and a light chain (IGL). These receptors

can either be free in the plasma or expressed on the membrane of B cells3. These

antibodies bind specifically to antigens. An antibody bound to an antigen serves as a

tag for the rest of the immune system to attack the antigen. Furthermore, antibodies

can directly neutralize microbes by binding to surface proteins and ‘gum up’ their

operation. Foreign peptides in solution can also be made to precipitate by antibodies

coagulating many of the peptides together.

2There are two other variable loops, CDR1 and CDR2, that are determined by the V germlinetemplates. As a result the variation of these loops is limited. While the CDR1 and CDR2 loops areimportant biologically, particularly for major histocompatibility complex (MHC) recognition of Tcells, we focus exclusively on the CDR3 region in this thesis. Unlike the CDR1 and CDR2 loops, theCDR3 region spans the region of the receptor sequence where the DNA editing process called V(D)Jrecombination occurs (1.2). We define the boundaries of the CDR3 region to be the conserved aminoacid residues cysteine (C) on the 5 ′ end and a phenylalanine (F) or tryptophan (W) on the 3 ′ end.These conserved residues are important to ensure the receptor folds and works properly.

3If expressed on a membrane an antibody is frequently referred to as a B cell receptor (BCR).We are sometimes sloppy and will refer to antibodies in general as BCRs to parallel TCRs.

2

The amazing specificity of antibodies is generated through a process called hy-

permutation [Teng and Papavasiliou, 2007]. Following the successful recognition of

an antigen, a B cell proliferates and its receptor sequence undergoes random point

mutations. These cells are then selected for affinity to the epitope. The result is an

evolutionary process within a single individual, producing receptors with dramatically

increased affinity to the epitope. We will present a quantitative model of hypermu-

tation in chapter 4.

1.1.2 T cells

Although antibodies bind directly to epitopes in solution, T cells have their epitope

recognition mediated by other cells. In animals with adaptive immune systems, cells

display a protein complex called major histocompatibility complex (MHC) on their

membrane. This protein complex can then be ‘loaded up’ with a peptide fragment by

the cell, and a T cell receptor (TCR) can then recognize the peptide - MHC complex

(pMHC)4. Cells load up the MHC complex with chopped up peptides internal to the

cell, giving the T cell a snapshot of the current protein synthesis of the cell. This

provides an excellent mechanism for the T cell to be able to identify if a cell was

infected by a virus or has become cancerous. Also, if a cell is infected by a virus it is

possible that peptides internal to the viral capsid (and thus not an accessible epitope

to an antibody/BCR) could be loaded up into pMHC, providing additional epitopes

for the adaptive immune system to tag.

Similar to antibodies, TCRs are composed of two chains, an α chain (TRA) and

a β chain (TRB). Ideally we would analyze the full receptor composed of TRA-TRB

pairs, however it is hard experimentally to have high throughput sequencing that

4This is the interaction between Cytotoxic or CD8+ T cells and the MHC I complex. There isan additional MHC complex (MHC II) that is expressed by a class of cells called antigen presentingcells (APC) that actively uptake and present peptides. There are also several other classes of Tcells, which perform a variety of roles. For the purposes of this thesis we focus on CD8+ T cells andthe MHC I complex.

3

accurately pairs TRA and TRB chains. Instead, many sequencing analyses focus on

only one chain. For much of this thesis we will focus on TRBs in both humans and

mice as the TRB chain is not only much more diverse than the TRA chain, it is also

the chain that determines much of the receptor-epitope specificity.

1.1.3 The DNA problem

The massive diversity of receptors needed for a functioning repertoire poses a very

interesting problem. These receptors are proteins, coded for by DNA sequences. Each

unique receptor demands a unique DNA sequence. The number of unique receptors

in a repertoire utterly dwarfs the number of coding genes in a genome. For example,

a human TRB repertoire might have 108−1010 unique receptors, whereas the number

of coding genes in the human genome is approximated to be of the order of 104−105.

Clearly the human genome cannot directly store the DNA sequences of every receptor

in a repertoire. This prompts question of how can such a diversity of receptors be

generated from limited DNA.

1.2 V(D)J recombination

The solution to the apparent conundrum laid out in the previous section is a process

called V(D)J recombination wherein the actual DNA sequences of developing B cells

and T cells get recombined, generating novel genes that translate to unique CDR3

amino acid sequences. While highly regulated, this process allows the adaptive im-

mune repertoire to generate the necessary diversity to specifically recognize foreign

antigens/epitopes. This discovery led to Susumu Tonegawa’s 1987 Nobel Prize in

Medicine [Hozumi and Tonegawa, 1976]. The rest of the thesis will involve proba-

bilistically modeling this V(D)J recombination.

4

Figure 1.1: Schematic of VDJ recombination

J2-1D1

J2-2J2-1D2V3V2V1 J1-2J1-1D1… … …RAG

TdT

V3V2V1 … N2

V3V2V1 … D1 J2-1N2RAG

V2 D1 J2-1N2

TdT

N1

D1 J2-1N2V2 N1

Simplification of the stages of VDJ recombination for TRB. Shows the arrangementof example V, D, and J genes on the chromosome, along with the RSS regions (or-ange stripes). For TRB gene locus the D and J genes are arranged as above, whichimplies the topological constraint that D2 and J1-∗ genes are never jointly used.Non-templated nucleotides, indicated by N1 and N2, are inserted at the VD and DJjunctions by the TdT complex.

V(D)J recombination has become an extremely well studied process over the past

40 years and the critical enzymes have been identified and studied. Of particular

interest to this thesis will be the enzymes recombination activating genes (RAG) 1

and 2, and terminal deoxynucleotidyl transferase (TdT), both of which are uniquely

expressed in lymphocytes. VDJ recombination leads to the generation of sequences

that produce IGH and TRB chains, while VJ recombination produces IGL and TRA

chains.

Before recombination, the germline chromosome has two or three types of genetic

templates: variable (V), diversity (D), and joining (J). For each type of template,

there are multiple genes (e.g. there are 35 TRBV genes in mice) which are identi-

5

fied by immediately adjacent, highly stereotyped, 7-mer nucleotide sequences called

recombination signal sequences (RSS). During VDJ recombination5, RAG enzymes

bind specifically to the RSS of a J gene and of a D gene and make an incision that cuts

out the intervening DNA. This cutting of the DNA can be messy, possibly deleting

away parts of the D and J genes, or leaving some single stranded DNA hanging, which

will get repaired by inserting in reverse complementary palindromic nucleotides. The

D and J genes are then spliced together, possibly with non-templated nucleotide in-

sertions from the TdT enzyme. A similar slicing and splicing process then happens

at the V-D junction.

To remove the biology, and make this clear on an abstract level, VDJ recombina-

tion acts by choosing a particular gene (strings of nucleotides) for each of the V, D,

and J segments, deleting away some of the nucleotides of those genes (or inserting

reverse palindromic nucleotides), and then inserting random nucleotides at the VD

and DJ junctions as the sequence is spliced together to read (from 5 ′ to 3 ′ ) VDJ.

This provides a new DNA sequence, where all of the edits (splicing, deleting, and

inserting) correspond to the CDR3 region of the receptor.

This V(D)J recombination process has no guarantee of success, or of producing

a DNA sequence that can translate to a functional protein. As there are random

numbers of deletions and insertions, the DNA sequence may have frame shifts or stop

codons in them. If this happens, and a V(D)J recombination event on a chromosome

leads to a nonproductive sequence, the cell may try again on the second chromosome.

If this second recombination leads to a functional receptor, the cell will have two

rearranged chromosomes: one functional and expressed, and the nonfunctional one

silenced by allelic exclusion. This fortunate quirk will prove crucial later in this thesis.

Once a T cell or B cell has a functional receptor there is some quality control that

occurs. The cell undergoes both positive selection (e.g. checking a TCR interacts

5In VJ recombination, there is no D gene, and the V and J genes are directly spliced together

6

well with MHC) and negative selection (i.e. removing cells with high affinity to self

epitopes). This somatic selection process is crucial to ensure both useful receptors

and to prevent autoimmune responses and skews the repertoire on a statistical level.

Models characterizing the statistics of this selection process have been introduced by

my collaborators, particularly Yuval Elhanati [Elhanati et al., 2014], and are discussed

in the papers that are referenced in chapter 4 [Elhanati et al., 2015, Sethna et al.,

2017].

1.3 Repertoire sequencing and analysis

Advances in high throughput sequencing [Robins et al., 2010a] have allowed for large

scale sequencing of lymphocytes in a blood or tissue sample: the sample is broken

down, the DNA extracted, and specialized primers amplify the DNA sequence of the

CDR3 region before sequencing. Such experiments are now becoming so routine that

there is interest in using them for medical diagnostic and immunotherapy purposes.

Almost all of the data discussed in this thesis was sequenced using a protocol pio-

neered by Harlan Robins [Robins et al., 2010a], who has started a company, Adaptive

Biotechnologies, to provide repertoire sequencing services.

These experiments can successfully sequence millions of cells (or more), produc-

ing datasets of ∼ 104 − 106 unique DNA sequences. The availability of datasets of

such size and quality allows for serious statistical analyses to quantify the underly-

ing biology as well as the possibility to explore more theoretical questions. Being

physicists, the approach we will take in this thesis is to construct a statistical model,

i.e. a parameterized probability distribution, of V(D)J recombination that reflects

the underlying biological processes. These large datasets are then used to infer the

model parameters. The model parameters will provide quantitative descriptions of the

V(D)J recombination machinery, and the model itself provides a distribution of the

7

probability of generating any receptor (Pgen) that can be used to answer theoretical

questions like characterizing the diversity of a repertoire.

1.4 Organization of thesis

This thesis is broken into two main parts. The first, covers chapters 2 and 3 and

provides the mathematical framework for the rest of the thesis. The class of generative

models used to analyze the generation probability (Pgen) of adaptive immune system

repertoires (first introduced in Murugan et al. [2012]) is described, and the inference

process, expectation maximization (EM), used to fit the model parameters is laid out.

We also show how one of the main metrics we use, the entropy of a model, can be

computed and broken down into different components. In addition, the computational

challenges associated with computing Pgen of sequences is discussed, in particular the

exponential explosion of the number of recombination events that generate amino acid

CDR3 sequences. We then demonstrate the novel dynamic programming algorithm,

OLGA [Sethna et al., 2018], that we developed to efficiently solve this problem and

make the computation of amino acid CDR3 sequences Pgen not only tractable, but

fast.

The second part, spanning chapters 4 and 5, dives into the applications of the

modeling framework defined in the first part. The first part of chapter 4 describes

the work from Sethna et al. [2017] analyzing the maturation of mouse repertoires from

embryo to young adult. The second half of chapter 4 lays out a model quantifying

hypermutation in B cells [Elhanati et al., 2015]. Finally, chapter 5 demonstrates how

Pgen explains the curious observation of so-called ‘public’ sequences.

8

Chapter 2

Generative Model

2.1 V(D)J recombination models

The definition, selection, and inference of a generative model of V(D)J recombination

is the foundation for all of the work that comes later. Such a generative model defines

a probability measure over the state space of V(D)J recombination events, which can

be extended to define probabilities of particular receptors or collections of receptors.

We begin by introducing a general model framework by requiring that the model

respects the biology of the V(D)J recombination process. To do this we define the

state (sample) space of V(D)J recombination events by combinations of the stochastic

events in the DNA splicing itself (i.e. gene choice, deletions/palindromic insertions,

and insertions). For example, we can described the state (sample) space of VDJ

recombination events as:

Ωe = {(V,D, J, dV , dD, d′D, dJ, {mi}, {ni})} (2.1)

Where V, D, and J are the gene choices, dV , dD (5′ /left), d′D (3

′ /right), and dJ

are deletions (including palindromic insertions), and {mi} and {ni} are the specific

9

nucleotide sequences which are inserted at the VD and DJ junctions respectively 1.

This also allows us to define a fully general model family for the recombination event

e ∈ Ωe:

Precomb(e) = P (V, dV , {mi}, dD, D, d′D, {ni}, dJ , J) (2.2)

We cannot use the fully general model above, which defines a unique probability for

each combination of recombination events, due to the exponential explosion of param-

eters. The challenge is to construct sub-models which have few enough parameters

to be inferred, yet still sufficiently describe the observed sequences. In general this is

done by positing the independence and dependence of the various splicing events and

then checking if the factorization captures the necessary correlations. The specific

models used are factorized to reflect the spatial correlations along the chromosome.

For VDJ recombination, these models assume that the V choice is independent

of the D/J choice (the latter two being correlated by virtue of the order in which

the genes are laid out on the chromosome, see Fig. 1.1), the deletion profiles depend

only on the gene choice, and lastly that the insertions are independent of the genomic

contributions and each other. There is still an exponential blowup of parameters

unless a simpler (fewer parameters) model for the inserted sequences is introduced.

We use a model that is a product of a length distribution and a dinucleotide Markov

model. This model factorization and dinucleotide Markov model is first introduced

and validated in Murugan et al. [2012], however important exceptions to this factor-

ization will be discussed in Chapter 4 in the contexts of mouse T cells [Sethna et al.,

2017] and human B cells (Elhanati et al. [2015]). For VJ recombination, these models

assume the V/J choice is correlated, the deletion profiles depend only the gene choice,

and lastly the insertion region is independent of the genomic contribution.

1The subscript i index is read from 5 ′ to 3 ′ .

10

2.1.1 VDJ generative model

The VDJ recombination model is defined as:

Precomb(e) = PV(V )PDJ(D, J)PdelV(dV |V )PdelJ(dJ |J)PdelD(dD, d′D|D)

×PinsVD(`VD)p0(m1)[`VD∏i=2

SVD(mi|mi−1)]

×PinsDJ(`DJ)q0(n`DJ )[`DJ−1∏i=1

SDJ(ni|ni+1)] (2.3)

Where the inserted nucleotide sequences {mi} and {ni} have lengths `VD and `DJ with

insertion length distributions PinsDJ(`VD) and PinsDJ(`DJ), SVD and SDJ are the respec-

tive dinucleotide Markov transition matrices, and finally, p0 and q0 are the nucleotide

biases for the first insertion at each junction2. Note that the inserted sequence of

length 0 (i.e. no insertions at a junction) is also allowed and has probability PinsVD(0)

or PinsDJ(0) depending on the splicing junction.

2.1.2 Model Validation

As mentioned above, it is important to check that the factorization of the model

structure is correct. To address this issue, we examine the correlations between

various marginal variables of the model (i.e. the stochastic recombination events: V,

delV, J, insVD, etc) by examining the mutual information of each pair.

To determine if we have captured the correct correlations in the data, we compare

the precise mutual information computed directly from the model, to the estimated

mutual information determined by the expectation over the data (using the Treves-

Panzeri correction [Treves et al., 1998] to account for finite sample size).

The generative model has zero mutual information, by construction, between in-

dependent marginal pairs, e.g. the number of VD insertions and the choice of J gene.

2Note, we often make the further approximation that the insertion Markov model is at steady-state, i.e. we set p0 and q0 to be the steady state distributions of SVD and SDJ respectively.

11

Variables that correlate with each other either directly or indirectly, e.g. between D

and J gene choice, or between D choice and number of D deletions may have non-zero

mutual information. In order to quickly gauge if a model is consistent (or inconsis-

tent) with the model factorization, we use plots like D.4, where the MI computed

from the model is below the diagonal and the expectation over the data is above the

diagonal. If the plot is symmetric about the diagonal, then the model is self consis-

tent with the data. Indeed, the total missed mutual information is, to leading order,

precisely the amount of information our factorized model missed due to its structure.

To validate the dinucleotide Markov model for insertions, we compare the expected

trinucleotide frequencies to the observed trinucleotide frequencies.

We will perform these checks in Chapter 4 when we look at mouse T cells [Sethna

et al., 2017] and human B cells [Elhanati et al., 2015].

2.1.3 VJ generative model

Analogous to the VDJ model, we define the model factorization for the generative

model of VJ recombination. The primary distinction is that there is no D gene, nor

is there an N2 insertion region (DJ junction). Also, as there is evidence of repeated

splicing attempts for the TCRα chain, the V and J gene usages are allowed to be

correlated Elhanati et al. [2016].

Precomb(e) = PVJ(V, J)PdelV(dV |V )PdelJ(dJ |J)PinsVJ(`VJ)p0(m1)[`V J∏i=2

SVJ(mi|mi−1)]

(2.4)

2.1.4 Pgen

Our model, Precomb, defines a probability measure over the state (sample) space Ωe of

recombination events. However, this model can, in theory, be extended to other state

(sample) spaces of much more interest scientifically and biologically. In particular,

12

we examine the state spaces of DNA nucleotide sequence reads, CDR3 nucleotide

sequences, and CDR3 amino acid sequences (or collections/motifs of amino acid CDR3

sequences). This is done by summing over all recombination events that generate one

of the ‘coarse grained’ states to give the probability of generating a particular CDR3

sequence or receptor.

Pgen(seq) =∑e|seq

Precomb(e) (2.5)

This generation probability, or ‘Pgen’, of a sequence or receptor will be used con-

tinuously throughout this thesis. We will return to this idea of extending or ‘coarse

graining’ the probability space in greater detail in Chapter 3.

2.2 Model Entropy

Before we introduce our method for inferring the model parameters, we first introduce

a concept that we will return to repeatedly: the entropy of a model. One of the

advantages of having a probabilistic model of V(D)J recombination is that we can

use the (Shannon) entropy (Appendix A) of the distribution as a well defined measure

of the ‘diversity’ of a repertoire. We examine the entropy of both Precomb and Pgen.

First we show how to compute the entropy S(Precomb) directly from the model, and

how it decomposes into contributions from the gene choice, the deletions, and the

insertions. We also show explicitly how changing the insertion length distribution

has an outsized impact on the entropy. Then we discuss how to approximate S(Pgen)

by Monte Carlo simulation. Throughout this section we do not specify what units we

want to express the entropy in, however we will most frequently talk about entropy

in units of bits (so base of the log is 2, i.e. log2)3.

3Personally, I think everything should be done in nats (log base e), however for most people it iseasier to parse bits (log base 2) or dits (log base 10).

13

2.2.1 Entropy of Precomb

The entropy4 of a VDJ recombination model is:

H(Precomb) = −〈log(Precomb)〉Ωe = −〈log(PVPDJPdelVPdelJPdelDP{mi}P{ni})

〉Ωe

(2.6)

Now, we can break the total entropy expression into independent components, and

compute the entropy of each of the components independently.

Genes/Deletions entropic contribution

The gene/deletion contributions are fairly straightforward to compute. Examining

the V templates:

H(PV(V )PdelV(dV |V )) =−∑V,dV

PV(V )PdelV(dV |V ) [log(PV(V )) + log(PdelV(dV |V ))]

=−∑V

PV(V ) log(PV(V ))−∑V,dV

PV(V )PdelV(dV |V ) log(PdelV(dV |V ))

=H(PV) +∑V

PV(V )H(PdelV(dV |V ))

=H(PV) + 〈H(PdelV)〉V(2.7)

In an analogous fashion we can determine H(PDJ), 〈H(PdelD)〉D, and 〈H(PdelJ)〉J . We

say that the entropy contribution from the choice of germline template is H(PV) +

H(PDJ), while the deletion entropic contribution is 〈H(PdelV)〉V + 〈H(PdelD)〉D +

〈H(PdelJ)〉J .4We indicate entropy by H not S in this section so as not to confuse notation with the dinucleotide

transition matrices SVD and SDJ

14

Insertion entropic contribution

The entropy of the insertions is much trickier to compute as we will have to sum

the Markov model probabilities over all possible insertion sequences. We drop the

VD/DJ subscripts as the computations are identical.

H(P{mi}) =−∑{mi}

P{mi}({mi}) log(P{mi}({mi}))

=−∑`

∑{mi}|`

Pins(`)P{mi}|`({mi})[log(Pins(`)) + log(P{mi}|`({mi}))]

=−∑`

Pins(`) log(Pins(`))−∑`

Pins(`)∑{mi}|`

P{mi}|`({mi}) log(P{mi}|`({mi}))

=H(Pins)−∑`

Pins(`)∑{mi}|`


(2.8)

where,

P{mi}|`({mi}) = p0(m1)[∏̀i=2

S(mi|mi−1)]. (2.9)

In order to make the dependence of this entropy on the average insertion length

(〈`〉) more explicit we will make the approximation that the Markov model is at

steady-state (i.e. p0 = pss, the steady-state distribution of S).

We will now prove inductively that for ` ≥ 1:

H(P{mi}|`) =−∑{mi}|`


=H(pss)− (`− 1)∑m

pss(m)∑n

S(n|m) log(S(n|m))(2.10)

15

Initial Step: ` = 1

This is trivial as P{mi}|`({mi} = m) = p0(m) = pss(m), so by direct computation:

−∑

{mi}|`=1

P{mi}|`(m) log(P{mi}|`(m)) = −∑m

pss(m) log(pss(m)) = H(pss) (2.11)

Inductive step

Assuming we have have shown that Eq. 2.10 is true for ` ≤ k, we prove it holds for

` = k + 1.

−∑

{mi}|`=k+1


=−∑mk+1

∑{mi≤k}

S(mk+1|mk)P{mi}|k({mi≤k})[log(S(mk+1|mk)) + log(P{mi}|k({mi≤k}))

]=H(P{mi}|k({mi}))−

∑mk+1

∑{mi≤k}

S(mk+1|mk)P{mi}|k({mi≤k}) log(S(mk+1|mk))

=H(P{mi}|k({mi}))

−∑mk+1

∑mk

S(mk+1|mk) log(S(mk+1|mk))∑

{mi≤k−1}

S(mk|mk−1)P ({mi≤k−1}|k − 1)

(2.12)

Now, in order to do the summation in the second term we make the observation

that the conditional terms only depend on the two last nucleotides σk+1 and σk so we

would like to get the marginal distribution

pk(mk) =∑

{mi≤k−1}

P (mk|mk−1)P ({mi≤k−1}|k − 1) (2.13)

But, we recall our previous assumption that the Markov process is in its steady state

to know that the marginal distribution is the same as the steady-state distribution

16

(i.e. pk = pss)5 Plugging this back in shows

−∑mk+1

∑mk

S(mk+1|mk) log(S(mk+1|mk))∑

{mi≤k−1}

S(mk|mk−1)P ({mi≤k−1}|k − 1)

=−∑m

pss(m)∑n

S(n|m) log(S(n|m))

(2.14)

which shows the inductive step holds for k + 1 and completes the proof.

Putting everything together we get the entropy from a single insertion junction

being

H(Pins) +H(pss)− (〈`〉 − 1)∑m

pss(m)∑n

S(n|m) log(S(n|m)) (2.15)

Note the dependence of this expression on the average number of insertions 〈`〉.

We will return to this in chapter 4 when we see that the way a repertoire scales its

diversity is by changing the insertion length distribution.

Total entropy of Precomb

H(Precomb) =H(PV) +H(PDJ) + 〈H(PdelV)〉V + 〈H(PdelD)〉D + 〈H(PdelJ)〉J

+H(PinsVD) +H(pss)− (〈`VD〉 − 1)∑m

pss(m)∑n

SVD(n|m) log(SVD(n|m))

+H(PinsDJ) +H(qss)− (〈`DJ〉 − 1)∑m

qss(m)∑n

SDJ(n|m) log(SDJ(n|m))

(2.16)

5If we didn’t want to make the steady-state assumption, it is easy to see how using this marginaldistribution would change Eq. 2.10 to:H(P{mi}|`) = H(p0)−

∑`k=2

∑m pk(m)

∑n S(n|m) log(S(n|m))

17

2.2.2 Entropy of Pgen

The probability distribution of Pgen no longer factorizes after the summation. As a

result we cannot break down the entropy into independent pieces. Instead a different

tack is taken, to estimate the entropy of Pgen.

We recall that the entropy of a distribution is just −〈logP 〉. This means that we

can estimate the entropy of Pgen by taking the expectation value over Monte Carlo

simulated sequences:

S(Pgen) ≈ 〈log(Pgen(s))〉s∈MCsample (2.17)

2.2.3 The Pgen distribution

Another extremely effective way of visualizing the diversity of a repertoire is to exam-

ine the probability density of the log Pgen of sequences. If a large number of sequences

(or recombination events) are drawn from a model distribution (i.e. Monte Carlo sam-

pling), they can be histogrammed by the log of their generation probabilities. If we

define an energy as E ∼ − log Pgen, this distribution is the probability density P (−E),

and is closely related to the density of states (a connection we will return to in chapter

5). An example of one of these plots is shown for a human TRB model in Fig. 2.1,

demonstrating the massive range of generation probabilities, spanning ∼20 orders of

magnitude. Another very useful aspect of these plots is that the mean of each dis-

tribution is the entropy of the distribution (up to a minus sign), and is indicated as

the dotted lines in Fig 2.1. We frequently use such plots as a way of characterizing

the data visually. It is easy to see shifts to more or less entropic distributions, and

to see any impact on the tails. Furthermore, these plots can be made from the data

directly by histogramming their generation probabilities and the entropy of such a

distribution will again be the mean6.

6Please note, when using data sequences we should technically say that the ‘entropy’ computedas the mean of the distribution is technically a cross entropy. For the non-productive sequenceswe largely focus on in this thesis this is a negligible distinction. However, for inframe productive

18

Figure 2.1: Distribution functions: P (−E = log Pgen)

Shows the distribution of generation probabilities over 3 different state spaces of thesame human TRB model, highlighting the ‘coarse graining’ of the model from recombi-nation events, to nucleotide sequences, and finally to amino acid sequences/receptors.The dotted lines indicate the mean of each distribution and is mathematically equiv-alent to the negative of the entropy of each distribution. The entropy of the distri-butions decreases as they get more coarse grained.

2.3 Inference

The data which is used to infer these models comes from high-throughput Illumina

sequencing [Robins et al., 2010a] and is organized as a collection of DNA sequences

of around 60-200 base pairs. We will want to infer the parameters of the generative

model that most accurately reflect the sequences observed in the experiment. Without

a principled prior that significantly biases the distribution (note, Jeffrey’s prior is

remarkably flat for these generative models), the parameters are inferred by way of

sequences this is not an irrelevant concern as the distributions are noticeably skewed towards highergeneration probabilities due to somatic selection. See Elhanati et al. [2014] for a discussion of somaticselection and the statistical effects on the distribution. We are a little sloppy and always refer tothis quantity as the entropy of the distribution, even if it is technically a cross entropy at times.

19

maximum likelihood estimation. Given a collection of observed DNA sequences S and

a generative model determined by parameters θ ∈ Θ, we want to infer the estimated

parameters θ̂:

θ̂ = arg maxθ

L(θ; S) = arg maxθ

p(S|θ) = arg maxθ

∏seq∈S

Pgen(seq|θ) (2.18)

as the sequences in S are assumed to be independently generated.

In order to properly infer the parameters of a V(D)J model we must be careful to

only use sequences that are statistically representative of the V(D)J recombination

machinery itself and are not skewed by any selective process or somatic population

dynamics. This is a real worry as not only could clonal expansion overrepresent

specific sequences, but functional receptors are systematically biased away from the

underlying V(D)J generative distribution due to their involvement in the immune

system function (this is explored in Elhanati et al. [2014]). Fortunately, as discussed

in section 1.2, V(D)J recombination does not always produce inframe, productive

sequences with each recombination event. As a result, the DNA sequence datasets

we analyze contain a significant fraction of sequences we know must be nonproduc-

tive/nonfunctional because they are frame shifted (out of frame) or contain a stop

codon. These sequences can never be expressed and therefore should experience no

selective pressures. Thus, to ensure a statistically unbiased sample, we filter our sam-

ple for only unique, nonproductive sequences. Filtering for unique sequences removes

the influence of clonal dynamics and expansion, whereas filtering for nonproductive

sequences removes any selection effects.

The generative models described (Eq. 2.3, Eq. 2.4) are defined over the space

of recombination events which are ‘hidden’ in the sense that there are many, many,

recombination events that can lead to a particular DNA sequence and there is no way

to determine which one actually occurred. In order to infer the parameters of such a

20

model, a classic iterative learning algorithm, expectation maximization (EM), is used

which ensures that a local maximum in likelihood is achieved (proof in Appendix C).

2.3.1 Errors and Mismatches

Each recombination event e = (V,D, J, dV , dD, d′D, dJ , {mi}, {ni}) generates a specific

DNA sequence. However, it is possible that when this gene was sequenced that the

recorded nucleotides do not match up perfectly with the sequence generated by e.

This mismatch could indicate a sequencing error in the experiment or, in the case of

B cells, could be the result of hypermutations (this will be discussed in much greater

detail in Section 4.2). We will need to account for such mismatches or errors in order

to properly infer the parameters of the generative model. To do this we introduce

an error/mismatch model. Formally, we define the observed probabilities, given an

observed/measured sequence seqo as:

Porecomb(e, seqo) = Precomb(e)Pmis(seqo|e)

Pogen(seqo) =∑e∈E

Porecomb(e, seqo)(2.19)

Where Pmis(seqo|e) is the error/mismatch model whose parameters will be inferred

during the EM inference. There are several Pmis(seqo|e) models used over the course

of this work.

No error model

It is useful to first consider a model where no errors or mismatches are allowed. To

do this, define Pmis(seqo|e) = I[e generates seqo]. Then,

Porecomb(e, seqo) =

Precomb(e) if e generates seq0 otherwise (2.20)21

and

Pogen(seqo) =∑e∈E

Porecomb(e, seqo) =∑e|seqo

Precomb(e) = Pgen(seqo) (2.21)

we see we recover Pgen from Pogen.

Flat error rate

This model assumes that the probability of a mismatched nucleotide between the

observed sequence seqo = {soi} and the sequence generated by recombination event e,

seqe = {sei}, is a flat probability pm.

Pmis(seqo|e) =∏i

(pmI[soi 6= sei ] + (1− pm)I[soi = sei ]) (2.22)

Flat error rate, restricted to genomic templates

In practice, it doesn’t make much sense to examine mismatches outside of the region

of the sequence that is determined by a germline V, D, or J sequence. Define the set

of positions, Posgene where the nucleotides {sei} come from a germline template and

its complement, Posins, where the nucleotides come from non-templated insertions.

We define a new error model that applies the flat error model to positions Posgene and

the no error model to positions Posins:

Pmis(seqo|e) =

0, if ∃i ∈ Posins s.t. soi 6= sei∏

i∈Posgene (pmI[soi 6= sei ] + (1− pm)I[soi = sei ]) , otherwise

(2.23)

This is the model that is used most frequently and unless otherwise stated is the

model that is used for inference purposes.

22

N-mer context dependent error model

In order to study hypermutations in Section 4.2 we use a mismatch model where

the mismatch rate is modulated depending on the 7-mer nucleotide sequence around

the mismatch site. Here we define a general N-mer context model where there are

independent energies at each site (i.e. a one point model).

ph(i|seq) =1

Zpbg(si−bN

2c, si−bN

2c+1, . . . , si+bN

2c) exp

bN2 c∑k=−bN

2c

−Ek(si+k)

(2.24)where pbg(σ) is the background frequency of the N-mer nucleotide sequence σ and

the proportionality constant Z is determined by matching the overall mismatch rate

(i.e. 〈ph〉 = pm). As we have the freedom to define the 0 energy with each of the Eks,

it is convenient to set∑

σ∈{A,C,G,T}Ek(σ) = 0 to make it transparent if the nucleotide

identity at position k in the N-mer makes a hypermutation mismatch more or less

likely.

One may also notice that we did not specify whether seq is seqo or seqe. Ideally we

would want seq to be the sequence immediately before the hypermutation occurred

(e.g. if we were constructing an evolutionary tree from hypermutations we should use

the current node’s sequence as seq). However, for inference purposes this ambiguity

is functionally irrelevant as choosing either seqo or seqe to be seq will result in a

negligible difference.

Again, we will want to restrict to mismatches with the germline sequences (to

ensure we have identified a hypermutation), so we define:

Pmis(seqo|e) =

0, if ∃i ∈ Posins s.t. soi 6= sei

otherwise:∏i∈Posgene (ph(i|seq)I[soi 6= sei ] + (1− ph(i|seq))I[soi = sei ])

(2.25)

23

2.3.2 Expectation Maximization algorithm

Expectation maximization is implemented by taking an initial guess (generally ran-

domized) for the parameters and then iterating two different steps. The first step,

expectation, defines a function which is the expected log-likelihood over the distribu-

tion of data and hidden variables determined by the data and the current guess of the

parameters. Explicitly, if θ′ is the current estimation of the parameters, we define:

Q(θ|θ′) = 〈logL(θ; X,Z)〉Z|X,θ′ (2.26)

Note, Q(θ|θ′) is still a function of some undetermined parameters θ. This leads

to the second step: maximization. To determine the next iteration’s parameter esti-

mation we maximize the estimation function:

θ(i+1) = arg maxθ

Q(θ|θ(i)) (2.27)

Repeatedly iterating these steps will monotonically increase both Q and the full

likelihood function (proof below). Let us be explicit in how this translates into the

specific scenario of a VDJ generative model. Say we have (nonproductive) sequences

S, the set of possible recombination events Ωe = {(V,D, J, dV , dD, d′D, dJ , {mi}, {ni}},

and the model structure from Eq. 2.3. Then θ is the collection of parameters defining

PV, PDJ, PdelV, etc. The expectation step is defined as so:

Q(θ|θ′) = 〈logL(θ; S,E)〉E|S,θ′ =∑

seq∈S

∑e∈E

Porecomb(e|seq, θ′) log Porecomb(e, seq|θ)

(2.28)

Now,

Porecomb(e|seq, θ′) =Porecomb(e, seq|θ′)∑e∈E P

orecomb(e, seq|θ′)

=Porecomb(e, seq|θ′)

Pogen(seq|θ′)(2.29)

24

is the fractional contribution of the particular event to the total Pgen of that sequence.

Plugging Porecomb(e|seq, θ′) back in and expanding Porecomb(e|θ) we get:

Q(θ|θ′) =∑

seq∈S

∑e∈E

Porecomb(e, seq|θ′)Pogen(seq|θ′)

×[

logPV(V (e)) + logPDJ(D(e), J(e))

+ logPdelV(dV (e)|V (e)) + logPdelD(dD(e), d′D(e)|D(e)) + logPdelJ(dJ(e)|J(e))

+ logPinsVD(`VD(e)) + log p0(m1(e)) +

`VD∑i=2

logSVD(mi(e)|mi−1(e))

+ logPinsDJ(`DJ(e)) + log q0(n`DJ (e)) +

`DJ−1∑i=1

logSDJ(ni(e)|ni+1(e))

+ logPmis(e|seq)]

(2.30)

We now need to evaluate arg maxθQ(θ|θ′). As the expansion breaks up into indepen-

dent pieces, we can deal with them one at a time. First examine the parameters in PV.

We want to maximize f(PV) = Q(θ|θ′) conditioned on g(PV) =∑

V PV(V ) − 1 = 0.

Naturally, this is done with Lagrange multipliers (5f = λ5 g). 5f is readily com-

puted:

∂f

∂PV(Vi)=∂Q(θ|θ′)∂PV(Vi)

=∂

∂PV(Vi)

∑seq∈S

∑e∈E

Precomb(e, seq|θ′)Pgen(seq|θ′)

logPV(e)

=∑

seq∈S

∑e∈E


I[Vi = V (e)]PV(Vi)

(2.31)

λ5 g is even more straightforward:

λ∂g

∂PV(Vi)= λ

∂

∂PV(Vi)

[∑V

PV(V )− 1]

= λ (2.32)

So,

PV(Vi) =1

λ

∑seq∈S

∑e∈E


I[Vi = V (e)] (2.33)

25

To solve for λ, plug back in to our normalization condition (g(PV) =∑

V PV(V )−1 =

0):

g(PV) = 0 =∑V

PV(V )− 1 =− 1 +1

λ

∑seq∈S

∑e∈E


∑Vi

I[Vi = V (e)]

=− 1 + 1λ

∑seq∈S

∑e∈E


=− 1 + 1λ

∑seq∈S

Pgen(seq|θ′)Pgen(seq|θ′)

=− 1 + 1λ

∑seq∈S

1

=− 1 + |S|λ

⇒ λ =|S|

(2.34)

Finally this gives us the expression for the parameters of PV for the next iteration:

PV(Vi) =1

|S|∑

seq∈S

∑e∈E


I[Vi = V (e)] (2.35)

which is just the expectation of that marginal, V gene usage in this case, over the

data sequences and using the previous iteration’s parameters. It is easy to show

that the remaining parameters are inferred in an analogous fashion with the only

caveat being that in the derivation for conditional distributions you need to use a

normalization condition (and thus another Lagrange multiplier) for each variable

that the distribution is conditioned on (or do the inference as a joint distribution).

For example:

g(PdelV|Vi) = 0 =∑d′V

PdelV(d′V |Vi)− PV(Vi) (2.36)

26

Also note that as the insertion dinucleotide Markov models also break up into a

similar form, their parameters are also inferred in an identical manner (only that

each recombination event e can contribute more than one term to the sum).

2.3.3 Implementation

Implementation of the EM algorithm for these V(D)J generative models is quite

tricky, and requires a large amount of computational power. As model parameters

are learned from large datasets of∼ 104−105 sequences, there is a premium on efficient

parallelized code. Sequence alignment, efficient enumeration of recombination events,

and intelligent organization of data structures are only some of the challenges. The

story of developing software to infer these parameters belongs to others and so won’t

be a focus of this thesis. However, I do want to take a moment to describe and

highlight the work done to make this difficult inference process possible.

My predecessor, Anand Murugan, was the first to code up and implement a VDJ

generative model of the form Eq. 2.3 and this was the basis of the first paper de-

scribing these V(D)J generative models in Murugan et al. [2012]. His MATLAB code

was then later adapted by me to define and infer the models discussed in Chapter 4.

Despite the success of this MATLAB code, it does require some expertise to use and

any changes to the model structure must be hard coded.

Recently a collaborator, Quentin Marcou, developed a software package called

IGoR (Inference and Generation Of Repertoires) in C++ [Marcou et al., 2018]. IGoR

is constructed in a way that allows the user to easily define the model structure (i.e.

the factorization) and runs smoothly and quickly. This software was used to infer the

models discussed/used in chapters 3 and 5. IGoR is publicly available on GitHub:

https://github.com/qmarcou/IGoR.

27

https://github.com/qmarcou/IGoR

Chapter 3

V(D)J recombination to sequences:

Precomb→ Pgen

The previous chapter laid out how a generative V(D)J model can be constructed and

inferred. However, the generative model is defined over a state (sample) space of re-

combination events, Ωe, whereas the scientific interest is over the state (sample) space

of sequences or receptors (both nucleotide and amino acid), and biological/physical

effects can only take place on the level of the physical protein structure of the re-

ceptor, i.e. the amino acid sequence (or possibly some coarse grained version of the

amino acid sequence). As briefly discussed in 2.1.4, the V(D)J model does define the

probability of generating a particular nucleotide or amino acid sequence by summing

over all recombination events that generate the sequence. This was summarized in

Eq. 2.5, which we repeat here:

Pgen(seq) =∑e|seq

Precomb(e) (3.1)

This summation over recombination events is, in some sense, ‘coarse graining’ the

state (sample) space as we are aggregating many states (recombination events) into

a new state (a nucleotide or amino acid sequence).

28

3.0.1 Probability Spaces (mathematical aside)

Formally, this ‘coarse graining’ is just extending probability spaces. First we define

the sample space of recombination events (Ωe, with σ-algebra Be), the sample space of

nucleotide CDR3 sequences (Ωnt, with σ-algebra Bnt), and the sample space of amino

acid CDR3 sequences (Ωaa, with σ-algebra Baa). Note that as each recombination

event generates a specific nucleotide sequence through the physical process of V(D)J

recombination, we have the surjective map πv(d)j : Ωe → Ωnt. Furthermore, as each

(in-frame) nucleotide sequence translates to an amino acid sequence, we can define the

translation mapping πnt2aa : Ωnt → Ωaa (if we wished to be pedantic we could keep the

out of frame sequences in Ωaa to ensure that πnt2aa is a function over the whole sample

space and to maintain the total measure of 1 over Ωaa). In this notation it is easy to see

that the mapping πv(d)j extends the probability space of V(D)J recombination events,

(Ωe,Be,Precomb) to the probability space of nucleotide sequences, (Ωnt,Bnt,Pgennt),

while the mapping πnt2aa extends the probability space of nucleotide sequences to the

probability space of amino acid sequences (Ωaa,Baa,Pgenaa). Our sloppy notation of

e|seq can now be understood as either π−1v(d)j(ntseq) or π−1v(d)jπ−1nt2aa(aaseq).

3.1 Too many states! The free energy problem

Despite Eq 2.5’s seeming simplicity, it can prove to be computationally very problem-

atic because of the number of recombination events that could generate a particular

sequence. This is the exact same problem that plagues much of statistical physics –

summing over all states to determine the partition function or a free energy can prove

to be computationally prohibitive if the only method of doing the summation is by

enumerating the states. Indeed, log(Pgen), a quantity we will look at repeatedly, can

even be thought of as a free energy. The reader may remember that this quantity, Pgen

was required to do the EM inference in the previous chapter 2.3.2, so to do any sort

29

of inference or to construct any sort of probabilistic model of V(D)J recombination

the problem of enumerating all possible recombination events must be addressed.

In previous work, and in the inference procedures of Murugan et al. [2012], and

Marcou et al. [2018], the number of states to be summed over is controlled through

through regularization. By regularization we mean that some procedure is used to

limit the number of recombination events that are considered to a manageable num-

ber. Fortunately, this is quite possible for nucleotide sequences. By only considering

gene templates V, (D), and J that have a sufficiently good alignment (e.g. Smith-

Waterman alignment), capping the number of deletions/insertions, and having cutoffs

for fractional probabilities and errors, it is feasible to reduce the number of recom-

bination events that correspond to a nucleotide sequence (i.e. the notation e|seq) to

the order of thousands or less. This makes it tractable, if still very computationally

intensive, to compute Pgen for nucleotide sequences. It must be noted, that for soft-

ware attempting to infer V(D)J models of arbitrary structure, this enumeration of

recombination events is very useful as there are no restrictions on the correlations it

can consider.

However, this approach of exhaustive enumeration with some regularization is

computationally intractable for amino acid CDR3 sequences, let alone any kind of

coarse grained alphabet of amino acids that might be more interesting functionally.

This can easily be seen from the fact that the number of possible nucleotide sequences

that translate to a particular amino acid sequence will explode exponentially with the

number of amino acids in the CDR3 region:

|{σ s.t. nt2aa(σ) = a}| =∏ai∈a

#codons|ai (3.2)

To put some perspective on these numbers, the average number of nucleotide

sequences that code for a mouse TRB CDR3 amino acid sequence is ∼ 2 billion

30

— and mouse TRB CDR3 sequences are significantly shorter than human TRB or

IGH. Even the heavily optimized and efficient IGoR software developed to do V(D)J

generative model inference [Marcou et al., 2018], which can compute the Pgen of

around 60 nucleotide sequences per CPU second, would take around 8500 CPU hrs to

compute the Pgen of a single mouse TRB amino acid sequence. This is prohibitively

long if there is interest in analyzing repertoire datasets that can easily be of the order

of 105 unique sequences or larger. For this reason, much of the early work in this

field, and in this thesis, was restricted to the analysis of nucleotide sequences.

While computing Pgen for amino acid sequences by way of enumerating recombi-

nation events is computationally intractable, this is not to say that the summation

is impossible. In this chapter we present a dynamic programming algorithm and

software, OLGA (Optimized Likelihood estimate of immunoGlobulin Amino-acid se-

quences, available at https://github.com/zsethna/OLGA), that efficiently computes

Pgen not only for amino acid CDR3 sequences, but inframe nucleotide sequences as

well as sequences composed of coarse grained/ambiguous amino acid alphabets and

motifs. Indeed, OLGA can sum over all possible recombination events of a mouse

TRB model in seconds (and can compute around 50 Pgen mouse TRB amino acid

sequences per CPU second). This work is detailed in the paper Sethna et al. [2018].

This algorithm however does require the V(D)J generative models of the form 2.3 or

2.4, and so loses the flexibility of being able to consider arbitrary model correlations.

The ability to compute Pgen on an amino acid and functional receptor level will

likely prove to be extremely useful, and we explore some example applications.

3.2 Dynamic Programming

OLGA is an algorithm that leverages ‘dynamic programming’ to avoid enumerating an

exponentially large number of states. Rather than give a formal definition of dynamic

31

https://github.com/zsethna/OLGA

programming, we show an example. Fortunately, physicists are already familiar with

one of the cleanest examples of dynamic programming, and one that truly shows the

computational effectiveness of such a technique: the discretized path integral. If we

have position x with N possible locations, discretized time t, and a Markov transition

matrix Rt(xi → xj) (which may depend on time), we can ask what is the probability

of starting at position x0 and ending at position xT at time T . If we define the

function

Pt(x0, xi) =∑

{x0,x(1),x(2),...,x(t−1),xi}

t−1∏t′=0

Rt′(x(t′)→ x(t′ + 1) (3.3)

we want PT (x0, xT ). Now, one could list out all the paths that start at x0 and end

at xT , compute their weights, and sum. However, the number of paths increases

exponentially with t, so the computation time would explode exponentially as O(T ×

NT−1) (T operations on each of NT−1 paths). Instead, it is computationally much

more efficient to sum up all the path weights to each position, at each time step and

then update. In other words, we notice this recursion relation:

Pt+1(x0, xi) =∑

{x0,x(1),x(2),...,x(t−1),x(t),xi}

t∏t′=0

Rt′(x(t′)→ x(t′ + 1)

=∑x(t)

Rt(x(t)→ xi)∑

{x0,x(1),x(2),...,x(t−1),x(t)}

t−1∏t′=0

Rt′(x(t′)→ x(t′ + 1)

=∑x(t)

Rt(x(t)→ xi)Pt(x0, x(t))

(3.4)

This can be written in a vectorized notation by writing Pt(x0,x) as a column vector

with elements Pt(x0, xi):

Pt+1(x0,x) = RtPt(x0,x)⇒ PT (x0,x) = RT−1RT−2...R1R0P0(x0,x) (3.5)

where P0(x0,x) = I(x0). Thus, solving for PT (x0, xT ) by using dynamic programming

would require O(T ×N2) operations — a massive speedup from the O(T ×NT−1) op-32

erations of the exhaustive enumeration of the paths. We have turned the summation

over all individual microstates (i.e. the paths) into a matrix expression with steps in

time. The algorithm, OLGA, that we developed to compute Pgen of nucleotide and

amino acid sequences from a generative model will analogously reduce the exponen-

tial blowup of exhaustive enumeration of recombination events down to polynomial

time by summing over matrix expressions based on positions in the sequence read.

3.3 OLGA

We now describe how OLGA computes Eq. 2.5 without summing over exhaustively

enumerated recombination events, using dynamic programming. This algorithm re-

quires specific tailoring to the model structure as the correlations have to be built

in explicitly, so the algorithm is slightly different for generative models of VDJ

(TCRβ/IGH, Eq. 2.3) and VJ (TCRα/IGL, Eq. 2.4) recombination. We will first

present the VDJ algorithm, and give the simpler algorithm for generative models of

VJ recombination afterwards.

Each recombination event implies an annotation of the amino acid CDR3 sequence,

(a1, . . . , aL), assigning a different origin to each nucleotide position (one of V, N1, D,

N2, or J, where N1 and N2 are the non-templated VD and DJ insertion segments,

respectively) that parses the sequence into 5 contiguous segments (see schematic in

Fig.3.1)

The core principle of the method is to sum over possible nucleotide locations of

the 4 boundaries between the 5 segments, x1, x2, x3, and x4, but in a recursive way

using matrix operation. This can be summarized into a compact matrix expression:

Pgen(a1, . . . , aL) =∑

x1,x2,x3,x4

Vx1Mx1x2

∑D

[D(D)x2x3N

x3x4J(D)

x4]. (3.6)

33

Figure 3.1: CDR3 indexing cartoon

} } }} }Vx1 D(D)x2x3 J (D)

x4

Mx1x2 N x3x4N2N1

V D Jx1 x2 x3 x4

a4, i1=4

x1=11

u=1, u*=2 u=2, u*=1 u=3, u*=3

10 11 12u1=2, u1*=1

Boxes correspond to nucleotides and are indexed by integers. Each group of threeboxes (identified by heavier boundary lines) corresponds to an amino acid. Thenucleotide positions x1, . . . , x4 identify the boundaries between different elements ofthe partition. The V, M, D(D), N and J(D) matrices define cumulated weightscorresponding to each of the 5 elements.

However, to do this, we will need to define objects that accumulate the probabil-

ities of events from the left of a position x (i.e. up to x) and the right of x (i.e. from

x+ 1 on) which will require some notation.

3.3.1 Notation, 3 ′ and 5 ′ vectors

Suppose we have a CDR3 ‘amino acid’ sequence a = (a1, . . . , aL). By ‘amino acid’

sequence, we mean that each of the ‘amino acids’, ai, correspond to some collection of

nucleotide triplets, or codons. We allow this mapping between ‘amino acids’, a, and

codons to be arbitrary at this point, and use the notation σ ∼ a if the codons in the

nucleotide sequence σ correspond to the codons allowed by the amino acid sequence

34

a. This will allow us not only to recover the standard nucleotide translation map-

ping, πnt2aa, when using the standard amino acid alphabet (e.g. TGTGCCAGCAGT

∼ πnt2aa(TGTGCCAGCAGT) = CASS), but also provides a trivial extension to in-

clude in-frame nucleotide sequences (define an ‘amino acid’ symbol for each individual

codon) as well as coarser grained collections of amino acids. For example, all codons

that code for amino acids with a common chemical property, e.g. hydrophobicity or

charge, could be grouped into a single ‘amino acid’. In that formulation, (a1, . . . , aL)

would correspond to a sequence of symbols denoting that property. This could prove

to be very useful in constructing and assessing future coarse grained models of recep-

tor - epitope affinities.

It will simplify the later expressions to be able to refer to a position x not only

by its nucleotide index, but by the corresponding amino acid index i as well as what

position x is in the codon reading from 5 ′ to 3 ′ (u) and what position x + 1 is in a

codon reading from 3 ′ to 5 ′ (u∗). This is shown graphically in Fig. 3.1. Explicitly, for

position xj:

ij =⌈xj

3

⌉uj = xj − 3(ij − 1)

u∗j = 3−mod(uj, 3)

(3.7)

It is also crucial to introduce what we will call ‘5 ′ vectors’ and ‘3 ′ vectors’. A 5 ′ vector,

denoted with a subscript (e.g. Xx) accumulates weights for the sequence to the 5′ (left)

side of x (including the nucleotide position x), whereas a 3 ′ vector, denoted with a

superscript (e.g. Yy) reflects the weights for the sequence to the 3 ′ (right) side of

x (excluding the nucleotide position x). Because we are dealing with amino-acids,

which are encoded with codons made of 3 nucleotides, we need to keep track of

weights by the identity of the nucleotides at the beginning or the end of the codon.

This requires the definition of a 5 ′ vector (3 ′ vector) to depend on the value of u (u∗).

35

For the first nucleotide position in a codon, u = 1 (u∗ = 1), Xx (Yy) must be

interpreted as a row (column) vector of 4 numbers indexed by σ = A, T,G, or C,

corresponding to the cumulated probability weight from the 5 ′ /left (3 ′ /right) side

that nucleotide at position x (x + 1) takes value σ. If u = 2 (u∗ = 2), then Xx (Yy)

is also a row (column) vector of 4 numbers indexed by nucleotide σ = A, T,G, or

C, but with a different interpretation: it corresponds to the cumulated probability

up to position x from the 5 ′ /left side (x+ 1 from the 3 ′ /right), with the additional

constraint that the nucleotide at the last position in the codon, x + 1 (x), can take

value σ (the value is 0 otherwise). Lastly, if x (x+ 1) is the last position in a codon,

i.e. u = 3 (u∗ = 3), the cumulative sequence terminates at the end of a codon and

we do not keep nucleotide information, so Xx (Yx) is a scalar.

If we have a 5 ′ vector Xx that contains the accumulated weights up to position

x, and 3 ′ vector Yx that contains the weights from position x+1 onwards, we will

want to ‘glue’ these sequence contributions together to get the total probability of

the sequence. This is indicated by the expression1 XxYx, which has a very convenient

structure. As the combinations of u and u∗ are (1, 2), (2, 1), or (3, 3), we see that

the matrix multiplication XxYx is one of two situations. First if u = u∗ = 3, XxY

x is

just scalar multiplication of the aggregate weights for the 5 ′ and 3 ′ sides. If (u, u∗)

= (1, 2) or (2, 1), then XxYx is the dot product between a vector of weights indexed

by nucleotides needed to complete the codon and the vector of weights indexed the

completing nucleotide on the other side. In either case, the result is the total aggregate

weight of the sequence conditioned on the partition x, accurately reflecting the weight

of ‘gluing’ the possible sequences from the 5 ′ /left side to the 3 ′ /right side.

This notion of sequence gluing also allows for the definition and interpretation of

matrices (e.g. Rxy), with both 5′ and 3 ′ indices. A matrix Rxy can be thought of as

1Please note that the resemblance of the expression XxYx to a contraction over the position x in

Einstein notation should not be misinterpreted. The ‘contraction’ is over possible nucleotide identityindices not over the position index x.

36

‘gluing ’ a new sequence segment (x to y) to what an existing 5 ′ or 3 ′ vector describes.

For example:

XxRxy = Hy

RxyYy = Gx

(3.8)

The matrix Rxy can be mapping from any value of u to any other (or any value of

u∗ to any other), and so has 9 possible combinations/interpretations based on the u

mapping and can be a 4x4, 4x1, 1x4, or 1x1 matrix as a result.

3.3.2 VDJ recombination: V, M, D, N, and J

Eq 3.6 shows the summation over positions of a matrix expression, with the vec-

tors/matrices corresponding to different VDJ contributions. The 5 ′ vector Vx1 corre-

sponds to a cumulated probability of the V segment finishing at position x1; matrix

Mx1x2 is the probability of the VD insertion extending from x1 + 1 to x2; Nx3x4 is the

same for DJ insertions; matrix Dx2

x3(D) corresponds to weights of the D segment

extending from x2 + 1 to x3, conditioned on the D germline choice being D; 3′ vector

Jx4(D) gives the weight of J segments starting at position x4 + 1 conditioned on the

D germline being D. This D dependency is necessary to account for the dependence

between the D and J germline segment choices [Murugan et al., 2012]. All the defined

vectors and matrices depend on the amino acid sequence (a1, . . . , aL), but we leave

this dependency implicit to avoid making the notation too cumbersome.

The entries of the vectors/matrices corresponding to the germline segments, V,

D(D), and J(D), can be calculated by simply summing over the probabilities of

different germline segments compatible with the sequence (a1, . . . , aL) with conditions

on deletions to achieve the required segment length. The ∼ sign is generalized to

incomplete codons so that it returns a true value if there exists a codon completion

that agrees with the sequence a.

37

V contribution: Vx1

The 5 ′ vector, Vx1 , aggregates the weights (PV and PdelV) from sequences originating

from the templated V genes up from the start of the CDR3 region to position x1. As a

5 ′ vector, Vx1 can be a 1x1 or 1x4 matrix depending on u1. sV is the sequence of the V

germline gene (read 5 ′ to 3 ′ ) from the conserved residue (generally the cysteine C) to

the end of the gene plus the maximum number of reverse complementary palindromic

insertions appended to the 3 ′ end. lV is the length of sV .

Vx1(σ) =∑V

PV(V )PdelV(lV − x1|V )I(sVx1 = σ)I(sV1:x1 ∼ a1:i1) if u1 = 1,

Vx1(σ) =∑V

PV(V )PdelV(lV − x1|V )I((sV1:x1 , σ) ∼ a1:i1) if u1 = 2,

Vx1 =∑V

PV(V )PdelV(lV − x1|V )I(sV1:x1 ∼ a1:i1) if u1 = 3.

(3.9)

N1 contribution: Mx1x2

This matrix includes the weights (PinsVD, p0, and∏SVD(mi|mi−1) from the glu

probability, entropy, and adaptive immune system repertoires...my grandparents, patarasp sethna,...

Documents