probability, entropy, and adaptive immune system repertoires...my grandparents, patarasp sethna,...

158
Probability, Entropy, and Adaptive Immune System Repertoires Zachary Michael Sethna A Dissertation Presented to the Faculty of Princeton University in Candidacy for the Degree of Doctor of Philosophy Recommended for Acceptance by the Department of Physics Adviser: Professor Curtis Callan September 2018

Upload: others

Post on 04-Feb-2021

6 views

Category:

Documents


0 download

TRANSCRIPT

  • Probability, Entropy, and Adaptive

    Immune System Repertoires

    Zachary Michael Sethna

    A Dissertation

    Presented to the Faculty

    of Princeton University

    in Candidacy for the Degree

    of Doctor of Philosophy

    Recommended for Acceptance

    by the Department of

    Physics

    Adviser: Professor Curtis Callan

    September 2018

  • c© Copyright by Zachary Michael Sethna, 2018.

    All rights reserved.

  • Abstract

    The adaptive immune system, composed of white blood cells called lymphocytes (B

    and T cells) that circulate in the lymph and blood, is a precision tool that tags

    and removes foreign peptides. Such peptides, also called antigens or epitopes, are

    identified by a specific binding to elements of a library or repertoire of unique proteins

    called receptors (e.g. antibodies or T cell receptors). A repertoire must be large and

    diverse enough so that at least one receptor will be able to recognize any pathogen

    epitope the organism is likely to encounter. This diversity is achieved by stochastic

    rearrangement of the germline DNA to create novel complementarity determining

    region sequences (CDR3) in a process called called V(D)J recombination.

    In this thesis we utilize previously developed generative models of V(D)J recombi-

    nation events, and infer the model parameters from large datasets of DNA sequences.

    The generation probability (Pgen) of a nucleotide or amino acid CDR3 is the sum

    of all model probabilities of V(D)J recombination events that generate the sequence.

    While previously it was only feasible to compute Pgen of nucleotide sequences, we

    introduce a novel dynamic programming algorithm that efficiently computes Pgen of

    amino acid sequences. We use this Pgen for several applications. First we examine

    how the diversity of a repertoire, characterized by the model entropy, scales with the

    number of insertions in the V(D)J process. This is used to describe the maturation

    of the T cell repertoire of mice from embryos to young adults. Next, we introduce a

    statistical model of hypermutation in B cells and infer the parameters from a human

    repertoire, providing a principled quantification of the biases in hypermutation rates.

    Lastly, we examine the statistics of the receptors shared amongst a cohort of more

    than 600 individual humans and show that the statistics and identities of so-called

    ‘public’ sequences are determined directly from Pgen.

    We highlight possible clinical applications and attempt to place this work in the

    context of a full theory of the adaptive immune system.

    iii

  • Acknowledgements

    I don’t have the words to express my thanks to my advisor Curt Callan. Curt has been

    a consummate advisor, providing support, advice, direction, and countless opportu-

    nities. I came into grad school with somewhat scattered interests, yet Curt showed

    me, by example, how to find a path forward through dedication, collaboration, and

    boundless curiosity. Curt has always been willing to entertain my crazy, inchoate

    ideas, and with only a few incisive questions give them shape (though it often takes

    me days to catch up and realize this). Curt, thank you for all of your time and effort,

    thank you for being my mentor. Thank you.

    I also thank my collaborators on both sides of the pond. I have learned so much

    from the insights and clarity of Aleksandra Walczak and Thierry Mora. Their ability

    to parse the underlying science, translate it into math, and then communicate this

    effectively is something I hope to one day be able to emulate. Yuval Elhanati has

    made my time here much more productive and enjoyable. Not only did Yuval provide

    crucial assistance with every step of the research, but he provided a sympathetic ear

    and was willing to talk about whatever the topic of the day was. Quentin Marcou is

    not only a wonderful collaborator, but a welcoming friend.

    Thanks to Ben Greenbaum and Vinod Balachandran for great discussions, data,

    and continuing collaboration.

    I would also like to thank Anand Murugan, whom I have never met, but whose

    code I’ve spent uncounted hours working with.

    Biophysics

    The professors in biophysics have been hugely influential on my perspective on science

    and life, and I would like to thank them. I must start by thanking Bill Bialek, and not

    only for being on my committee. His vision, instant understanding of any topic, and

    personality have made his conversations something to be sought after. I would like to

    iv

  • thank Bob Austin, not only for being a reader of this thesis, but for the many crazy

    conversations and a shared appreciation of scotch. I also want to thank Josh Shaevitz

    for efficiently cutting to the bone of any issue, Thomas Gregor for teaching me much

    during my time as a TA for ISC, and Ned Wingreen for somehow always knowing

    everything about any biological system. You all have made Princeton biophysics not

    only a superb place to do research, but a friendly and welcoming environment.

    The biophysics community also has had several postdocs and graduate students

    over the years that I would like to thank for teaching me much and making my

    time here so much fun. Andreas Mayer for great discussions on immunology. I’ve

    immensely enjoyed speculating about Information Geometry with Ben Machta. I’d

    also like to thank Leenoy Mushulam, Henry Mattingly, Dima Krotov, Ashley Linder,

    Ugne Klibaite, Ben Bratton, Gordon Berman, Michael Tikhonov, Xiaowen Chen,

    Guannan Liu, Mochi Liu, Alex Song, Sagar Setru, Mark Ioffe, and Jeff Nyugen.

    Physics

    The greater physics community has made Jadwin Hall a second home for these years.

    I’d like to thank Herman Verlinde for all of his work in organizing the grad program.

    A special thanks to Suzanne Staggs for being on my committee. Thanks to Jessica

    Heslin, Barbara Mooring, and Kate Brosowksy for the invaluable administrative as-

    sistance – without you we grad students would be helpless. Sumit Saluja has been a

    lifesaver with helping me get my code running on the server. Also, a shoutout to the

    softball team – especially the impressive Ed Groth.

    Friends

    Naturally, I must thank my fellow grad students who’ve been through the ringer with

    me and yet made my time here enjoyable. There are too many people to name, so un-

    doubtably I have accidentally forgotten some people: I must beg your forgiveness! I’d

    v

  • like to thank Aitor Lewkowycz for the science, fun, keen insight, and advice. Aaron

    Levy for the innumerable discussions about life, politics, and science. Will Coulton

    for always being a good sport and a positive influence in every scenario. Josh Hard-

    enbrook for always calling me out when he thinks I am wrong. Dave Zajac for helping

    me ‘study’ for prelims with uncounted games of pool. Christian Jepsen for his impec-

    cable taste. Joaquin Turiaci and Debayan Mitra for the many fun nights of beer and

    foosball. Shai Chester for the fun and ridiculous stories, but NOT for any ‘help’ in my

    work. Farzan Beroz for the many philosophical and science discussions. Lauren Mc-

    Gough for the many discussions about about stat mech, information theory, and life.

    Kenan Diab also understands the important things in a grad student’s life: softball,

    starcraft, MTG, and beer. Ilya Belopolski for doing many prelim problems together

    while DJ’ing with some select music. DJ Strouse for our annual run-ins at APS and

    the many good conversations about information theory and machine learning. Bin Xu

    for his always cheerful demeanor and great scientific discussions. Mallika Randeria

    for her friendship and advice. Tom Hazard, softball captain extraordinaire. Many

    thanks to Shawn Westerdale, Anne Gambrel, Guangyong Koh, Ed Young, Matt Her-

    nandez, Lee Gunderson, Sarthak Parikh, Grisha Tarnoplskiy, Vlad Kirilin, Matteo

    Ippolti, Luca Iliesiu, and Trithep Devakul. Thanks to everyone.

    Family

    Lastly, I must thank the whole of my family for being so supportive of me since

    before I can remember. I come from a unique family, filled with medical doctors and

    physicists, such that when I go home I am frequently grilled on my research. Coming

    from such a background, it is no surprise that I’ve effectively split the difference

    between physics and medicine in this thesis.

    It would be hard to overstate the influence my uncle, Jim Sethna, has had on

    me: I’ve quite literally followed in his footsteps in getting a PhD in physics from

    vi

  • Princeton. Thank you Uncle Jim for all of your advice, support, and even academic

    mentorship. I cannot tell you how much it means to me.

    My grandparents, Patarasp Sethna, Shirley Sethna, Marjory Sethna, Joshua Lyn-

    field, and Yelva Lynfield, have always been an examples to me, both in their achieve-

    ments and morality. Sadly, not all of my grandparents will see me graduate, however

    I am confident that all of them would both be proud and approve of my time here.

    I also thank my sisters, Julia and Sharon Sethna, for always providing a ready

    distraction when needed.

    Finally, I would like to thank my parents Ruth Lynfield and Michael Sethna, with-

    out whom not only would this thesis not have been possible but I never would have

    been in the position in the first place. Your love, support, direction, and parenting

    have got me to this point. Mom, your talents and commitment to helping people is

    inspiring. Your work in infectious diseases and epidemiology have clearly colored my

    interests. And Dad, your elevation of science and logic above all else has shaped the

    way I think. You have frequently ‘joked’, that studying math, physics, and science is

    ‘holy work’ – a sentiment I certainly share. Thank you both for everything.

    vii

  • “The idea is like grass. It craves light, likes crowds, thrives

    on crossbreeding, grows better for being stepped on.”

    - Ursula K. Le Guin, The Dispossessed

    viii

  • Contents

    Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

    Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv

    List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii

    List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv

    1 Introduction 1

    1.1 Adaptive immune system . . . . . . . . . . . . . . . . . . . . . . . . . 1

    1.1.1 B cells . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

    1.1.2 T cells . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

    1.1.3 The DNA problem . . . . . . . . . . . . . . . . . . . . . . . . 4

    1.2 V(D)J recombination . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

    1.3 Repertoire sequencing and analysis . . . . . . . . . . . . . . . . . . . 7

    1.4 Organization of thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

    2 Generative Model 9

    2.1 V(D)J recombination models . . . . . . . . . . . . . . . . . . . . . . . 9

    2.1.1 VDJ generative model . . . . . . . . . . . . . . . . . . . . . . 11

    2.1.2 Model Validation . . . . . . . . . . . . . . . . . . . . . . . . . 11

    2.1.3 VJ generative model . . . . . . . . . . . . . . . . . . . . . . . 12

    2.1.4 Pgen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

    2.2 Model Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

    ix

  • 2.2.1 Entropy of Precomb . . . . . . . . . . . . . . . . . . . . . . . . 14

    2.2.2 Entropy of Pgen . . . . . . . . . . . . . . . . . . . . . . . . . . 18

    2.2.3 The Pgen distribution . . . . . . . . . . . . . . . . . . . . . . 18

    2.3 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

    2.3.1 Errors and Mismatches . . . . . . . . . . . . . . . . . . . . . 21

    2.3.2 Expectation Maximization algorithm . . . . . . . . . . . . . . 24

    2.3.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 27

    3 V(D)J recombination to sequences: Precomb → Pgen 28

    3.0.1 Probability Spaces (mathematical aside) . . . . . . . . . . . . 29

    3.1 Too many states! The free energy problem . . . . . . . . . . . . . . . 29

    3.2 Dynamic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . 31

    3.3 OLGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

    3.3.1 Notation, 3 ′ and 5 ′ vectors . . . . . . . . . . . . . . . . . . . . 34

    3.3.2 VDJ recombination: V, M, D, N, and J . . . . . . . . . . . . 37

    3.3.3 VJ recombination: V, M, and J . . . . . . . . . . . . . . . . . 43

    3.3.4 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

    3.3.5 Comparison to existing methods . . . . . . . . . . . . . . . . . 46

    3.4 Some applications of OLGA computed Pgen . . . . . . . . . . . . . . . 48

    3.4.1 Pgen distributions and diversity . . . . . . . . . . . . . . . . . 48

    3.4.2 Generation probability of epitope-specific TCRs . . . . . . . . 49

    3.4.3 Predicting the frequencies . . . . . . . . . . . . . . . . . . . . 51

    3.4.4 Generation probability of sequence motifs . . . . . . . . . . . 53

    4 The repertoires ‘Of Mice and Men’ 55

    4.1 Of Mice... (mouse TRB) . . . . . . . . . . . . . . . . . . . . . . . . . 55

    4.1.1 Generative model . . . . . . . . . . . . . . . . . . . . . . . . . 57

    4.1.2 Changing insertion profile → Increasing diversity . . . . . . . 58

    x

  • 4.1.3 Mixture mode . . . . . . . . . . . . . . . . . . . . . . . . . . 60

    4.1.4 Toy model of mouse repertoire maturation . . . . . . . . . . . 64

    4.1.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

    4.2 ...and Men (human IGH) . . . . . . . . . . . . . . . . . . . . . . . . . 67

    4.2.1 Analysis approach . . . . . . . . . . . . . . . . . . . . . . . . 67

    4.2.2 Generative Model, Allele identification . . . . . . . . . . . . . 68

    4.2.3 Hypermutation . . . . . . . . . . . . . . . . . . . . . . . . . . 70

    4.2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

    5 Sharing 74

    5.1 The Sharing Distribution . . . . . . . . . . . . . . . . . . . . . . . . 76

    5.1.1 Analytical calculation of the sharing distribution from the Pgen

    distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

    5.1.2 Sharing modified by selection . . . . . . . . . . . . . . . . . . 81

    5.2 Extrapolation to full repertoires and beyond . . . . . . . . . . . . . . 83

    5.3 Predicting the publicness of sequences . . . . . . . . . . . . . . . . . 86

    5.3.1 Sharing and TCR generation probability . . . . . . . . . . . . 86

    5.3.2 PUBLIC: Classifier of public vs. private TCRs based on gener-

    ation probability . . . . . . . . . . . . . . . . . . . . . . . . . 89

    5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

    6 Conclusion 93

    A Information Theory 96

    A.1 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

    A.2 Mutual Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

    A.3 Kullback-Leibler divergence . . . . . . . . . . . . . . . . . . . . . . . 99

    B Probabilistic vs Deterministic inference 100

    xi

  • C Proof of Expectation Maximization algorithm 103

    D Mouse Appendix 105

    D.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

    D.2 Model parameters and validation . . . . . . . . . . . . . . . . . . . . 106

    E Human B cells Appendix 113

    E.1 Repertoire entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

    E.2 Inference of alleles and their chromosome distribution . . . . . . . . . 114

    E.3 Model parameters and validation . . . . . . . . . . . . . . . . . . . . 116

    F Sharing Appendix 122

    F.1 Sampling effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

    F.2 Monte Carlo simulation . . . . . . . . . . . . . . . . . . . . . . . . . . 124

    F.2.1 Sequence data . . . . . . . . . . . . . . . . . . . . . . . . . . 124

    Bibliography 126

    xii

  • List of Tables

    3.1 Distance metrics for OLGA VDJ validation . . . . . . . . . . . . . . 45

    3.2 Time performance and scaling of possible methods. . . . . . . . . . . 47

    3.3 P funcgen of TCR motifs . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

    3.4 Pgen of invariant T cell (iNKT and MAIT cells) TRA motifs . . . . . 54

    4.1 Breakdown of B cell sequences and models . . . . . . . . . . . . . . . 67

    D.1 Mouse dataset summary . . . . . . . . . . . . . . . . . . . . . . . . . 106

    E.1 Heterozygous V allele information (Individual A) . . . . . . . . . . . 116

    E.2 Heterozygous D and J allele information (Individual A) . . . . . . . . 116

    F.1 Mice dataset sample sizes . . . . . . . . . . . . . . . . . . . . . . . . 125

    xiii

  • List of Figures

    1.1 Schematic of VDJ recombination . . . . . . . . . . . . . . . . . . . . 5

    2.1 Distribution functions: P (−E = log Pgen) . . . . . . . . . . . . . . . . 19

    3.1 CDR3 indexing cartoon . . . . . . . . . . . . . . . . . . . . . . . . . . 34

    3.2 Validation of OLGA VDJ algorithm . . . . . . . . . . . . . . . . . . . 44

    3.3 Validation of OLGA VJ algorithm . . . . . . . . . . . . . . . . . . . . 46

    3.4 Precomb and Pgen distributions . . . . . . . . . . . . . . . . . . . . . . 48

    3.5 Pgen of human TRB sequences for hepatitis C and influenza A epitopes. 50

    3.6 Pgen distributions for virus specific TRB sequences . . . . . . . . . . . 51

    3.7 Scatter of mean occurrence frequencies vs Pgen . . . . . . . . . . . . . 52

    4.1 Age-dependent insertion length distributions . . . . . . . . . . . . . . 56

    4.2 Sequence entropy for thymic repertoires . . . . . . . . . . . . . . . . . 59

    4.3 Repertoire maturation schematic . . . . . . . . . . . . . . . . . . . . 61

    4.4 Mean effective TdT level ᾱ and entropy vs age . . . . . . . . . . . . . 63

    4.5 Amount of mixing: variance of α vs age . . . . . . . . . . . . . . . . . 64

    4.6 Allele organization on chromosomes . . . . . . . . . . . . . . . . . . . 69

    4.7 Sequence dependence of somatic hypermutations . . . . . . . . . . . . 71

    5.1 Pipeline for computing the distribution of shared sequences . . . . . . 76

    5.2 Sharing distribution for 14 mice . . . . . . . . . . . . . . . . . . . . . 78

    5.3 Sharing distribution for 658 humans . . . . . . . . . . . . . . . . . . . 79

    xiv

  • 5.4 Number of unique CDR3s in pooled repertoires . . . . . . . . . . . . 84

    5.5 Fraction of total repertoire composed of ‘public’ sequences . . . . . . 85

    5.6 Mouse Pgen distributions by sharing number . . . . . . . . . . . . . . 87

    5.7 Human Pgen distributions by sharing number . . . . . . . . . . . . . . 88

    5.8 PUBLIC classifier schematic . . . . . . . . . . . . . . . . . . . . . . . 89

    5.9 Performance of the PUBLIC classifier . . . . . . . . . . . . . . . . . . 90

    B.1 Probabilistic vs Deterministic marginal distributions . . . . . . . . . . 101

    D.1 Gene usages by mouse age . . . . . . . . . . . . . . . . . . . . . . . . 107

    D.2 Deletion profiles by mouse age . . . . . . . . . . . . . . . . . . . . . . 108

    D.3 Frequencies of non-templated insertions . . . . . . . . . . . . . . . . . 109

    D.4 Mouse model MI validation . . . . . . . . . . . . . . . . . . . . . . . 110

    D.5 Variation of V and J gene usage across biological replicates . . . . . . 111

    D.6 Variation of deletion profiles across biological replicates . . . . . . . . 112

    E.1 Entropy of B cell model . . . . . . . . . . . . . . . . . . . . . . . . . 113

    E.2 B cell gene usages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

    E.3 B cell deletion profiles . . . . . . . . . . . . . . . . . . . . . . . . . . 118

    E.4 B cell non-templated nucleotide frequencies . . . . . . . . . . . . . . . 119

    E.5 PinsVD and PinsDJ over replicates . . . . . . . . . . . . . . . . . . . . . 120

    E.6 B cell model MI validation . . . . . . . . . . . . . . . . . . . . . . . . 120

    E.7 B cell model insertion Markov model validation . . . . . . . . . . . . 121

    F.1 Downsampling in sharing analyses . . . . . . . . . . . . . . . . . . . . 123

    xv

  • Chapter 1

    Introduction

    1.1 Adaptive immune system

    The adaptive immune system evolved to provide animals with a precision tool to

    identify and remove anything ‘foreign’ to the animal. This is done by having a

    large library, or repertoire, of proteins called receptors that bind specifically to some

    small fragment of a protein called an epitope or antigen. This binding or affinity

    is determined by physical properties such as electrostatics, hydrophobicity, Van de

    Waals forces, steric concerns, etc. By specificity we mean that this receptor will only

    bind to a very limited number of epitopes and have only limited affinity for other

    epitopes1. Crucially, this specificity allows the adaptive immune system to weed

    out any receptors which recognize self peptides which would trigger an autoimmune

    response. However, this repertoire must be large and diverse enough to be able to

    identify any foreign peptide to ensure that microbes and cancerous cells are quickly

    identified and dealt with. In this thesis we will characterize just how staggeringly

    diverse these adaptive immune system repertoires are.

    In order to generate and regulate these receptors, the adaptive immune system

    has a special class of cells called lymphocytes, of which there are two main subtypes:

    1Frequently the amount of ‘cross-reactivity’ is assumed to negligible

    1

  • B cells and T cells. Each lymphocyte has a single receptor, of which it expresses many

    copies, in order to recognize epitopes. These lymphocyte receptors are protein com-

    plexes composed of two amino acid chains, a larger one and a smaller one. Each chain

    has largely conserved portions (in order to standardize the way the adaptive immune

    system uses these receptors) along with highly variable regions that provide the spe-

    cific binding to epitopes. The most highly variable region, and the one that largely

    determines the affinity of a receptor to an epitope, is called the complementarity-

    determining region 3 or CDR32. We will often be a little sloppy and refer to the

    ‘receptor’ and the CDR3 of a single chain interchangeably. Once a ‘naive’ lympho-

    cyte is activated by specifically binding to an epitope, it will proliferate and some of

    these cells will be archived as ‘memory’ cells to quickly reactivate and eliminate the

    antigen if the organism is ever exposed to it again.

    1.1.1 B cells

    B cells are lymphocytes that produce, and secrete, receptors called antibodies. Anti-

    bodies are composed of a heavy chain (IGH) and a light chain (IGL). These receptors

    can either be free in the plasma or expressed on the membrane of B cells3. These

    antibodies bind specifically to antigens. An antibody bound to an antigen serves as a

    tag for the rest of the immune system to attack the antigen. Furthermore, antibodies

    can directly neutralize microbes by binding to surface proteins and ‘gum up’ their

    operation. Foreign peptides in solution can also be made to precipitate by antibodies

    coagulating many of the peptides together.

    2There are two other variable loops, CDR1 and CDR2, that are determined by the V germlinetemplates. As a result the variation of these loops is limited. While the CDR1 and CDR2 loops areimportant biologically, particularly for major histocompatibility complex (MHC) recognition of Tcells, we focus exclusively on the CDR3 region in this thesis. Unlike the CDR1 and CDR2 loops, theCDR3 region spans the region of the receptor sequence where the DNA editing process called V(D)Jrecombination occurs (1.2). We define the boundaries of the CDR3 region to be the conserved aminoacid residues cysteine (C) on the 5 ′ end and a phenylalanine (F) or tryptophan (W) on the 3 ′ end.These conserved residues are important to ensure the receptor folds and works properly.

    3If expressed on a membrane an antibody is frequently referred to as a B cell receptor (BCR).We are sometimes sloppy and will refer to antibodies in general as BCRs to parallel TCRs.

    2

  • The amazing specificity of antibodies is generated through a process called hy-

    permutation [Teng and Papavasiliou, 2007]. Following the successful recognition of

    an antigen, a B cell proliferates and its receptor sequence undergoes random point

    mutations. These cells are then selected for affinity to the epitope. The result is an

    evolutionary process within a single individual, producing receptors with dramatically

    increased affinity to the epitope. We will present a quantitative model of hypermu-

    tation in chapter 4.

    1.1.2 T cells

    Although antibodies bind directly to epitopes in solution, T cells have their epitope

    recognition mediated by other cells. In animals with adaptive immune systems, cells

    display a protein complex called major histocompatibility complex (MHC) on their

    membrane. This protein complex can then be ‘loaded up’ with a peptide fragment by

    the cell, and a T cell receptor (TCR) can then recognize the peptide - MHC complex

    (pMHC)4. Cells load up the MHC complex with chopped up peptides internal to the

    cell, giving the T cell a snapshot of the current protein synthesis of the cell. This

    provides an excellent mechanism for the T cell to be able to identify if a cell was

    infected by a virus or has become cancerous. Also, if a cell is infected by a virus it is

    possible that peptides internal to the viral capsid (and thus not an accessible epitope

    to an antibody/BCR) could be loaded up into pMHC, providing additional epitopes

    for the adaptive immune system to tag.

    Similar to antibodies, TCRs are composed of two chains, an α chain (TRA) and

    a β chain (TRB). Ideally we would analyze the full receptor composed of TRA-TRB

    pairs, however it is hard experimentally to have high throughput sequencing that

    4This is the interaction between Cytotoxic or CD8+ T cells and the MHC I complex. There isan additional MHC complex (MHC II) that is expressed by a class of cells called antigen presentingcells (APC) that actively uptake and present peptides. There are also several other classes of Tcells, which perform a variety of roles. For the purposes of this thesis we focus on CD8+ T cells andthe MHC I complex.

    3

  • accurately pairs TRA and TRB chains. Instead, many sequencing analyses focus on

    only one chain. For much of this thesis we will focus on TRBs in both humans and

    mice as the TRB chain is not only much more diverse than the TRA chain, it is also

    the chain that determines much of the receptor-epitope specificity.

    1.1.3 The DNA problem

    The massive diversity of receptors needed for a functioning repertoire poses a very

    interesting problem. These receptors are proteins, coded for by DNA sequences. Each

    unique receptor demands a unique DNA sequence. The number of unique receptors

    in a repertoire utterly dwarfs the number of coding genes in a genome. For example,

    a human TRB repertoire might have 108−1010 unique receptors, whereas the number

    of coding genes in the human genome is approximated to be of the order of 104−105.

    Clearly the human genome cannot directly store the DNA sequences of every receptor

    in a repertoire. This prompts question of how can such a diversity of receptors be

    generated from limited DNA.

    1.2 V(D)J recombination

    The solution to the apparent conundrum laid out in the previous section is a process

    called V(D)J recombination wherein the actual DNA sequences of developing B cells

    and T cells get recombined, generating novel genes that translate to unique CDR3

    amino acid sequences. While highly regulated, this process allows the adaptive im-

    mune repertoire to generate the necessary diversity to specifically recognize foreign

    antigens/epitopes. This discovery led to Susumu Tonegawa’s 1987 Nobel Prize in

    Medicine [Hozumi and Tonegawa, 1976]. The rest of the thesis will involve proba-

    bilistically modeling this V(D)J recombination.

    4

  • Figure 1.1: Schematic of VDJ recombination

    J2-1D1

    J2-2J2-1D2V3V2V1 J1-2J1-1D1… … …RAG

    TdT

    V3V2V1 … N2

    V3V2V1 … D1 J2-1N2RAG

    V2 D1 J2-1N2

    TdT

    N1

    D1 J2-1N2V2 N1

    Simplification of the stages of VDJ recombination for TRB. Shows the arrangementof example V, D, and J genes on the chromosome, along with the RSS regions (or-ange stripes). For TRB gene locus the D and J genes are arranged as above, whichimplies the topological constraint that D2 and J1-∗ genes are never jointly used.Non-templated nucleotides, indicated by N1 and N2, are inserted at the VD and DJjunctions by the TdT complex.

    V(D)J recombination has become an extremely well studied process over the past

    40 years and the critical enzymes have been identified and studied. Of particular

    interest to this thesis will be the enzymes recombination activating genes (RAG) 1

    and 2, and terminal deoxynucleotidyl transferase (TdT), both of which are uniquely

    expressed in lymphocytes. VDJ recombination leads to the generation of sequences

    that produce IGH and TRB chains, while VJ recombination produces IGL and TRA

    chains.

    Before recombination, the germline chromosome has two or three types of genetic

    templates: variable (V), diversity (D), and joining (J). For each type of template,

    there are multiple genes (e.g. there are 35 TRBV genes in mice) which are identi-

    5

  • fied by immediately adjacent, highly stereotyped, 7-mer nucleotide sequences called

    recombination signal sequences (RSS). During VDJ recombination5, RAG enzymes

    bind specifically to the RSS of a J gene and of a D gene and make an incision that cuts

    out the intervening DNA. This cutting of the DNA can be messy, possibly deleting

    away parts of the D and J genes, or leaving some single stranded DNA hanging, which

    will get repaired by inserting in reverse complementary palindromic nucleotides. The

    D and J genes are then spliced together, possibly with non-templated nucleotide in-

    sertions from the TdT enzyme. A similar slicing and splicing process then happens

    at the V-D junction.

    To remove the biology, and make this clear on an abstract level, VDJ recombina-

    tion acts by choosing a particular gene (strings of nucleotides) for each of the V, D,

    and J segments, deleting away some of the nucleotides of those genes (or inserting

    reverse palindromic nucleotides), and then inserting random nucleotides at the VD

    and DJ junctions as the sequence is spliced together to read (from 5 ′ to 3 ′ ) VDJ.

    This provides a new DNA sequence, where all of the edits (splicing, deleting, and

    inserting) correspond to the CDR3 region of the receptor.

    This V(D)J recombination process has no guarantee of success, or of producing

    a DNA sequence that can translate to a functional protein. As there are random

    numbers of deletions and insertions, the DNA sequence may have frame shifts or stop

    codons in them. If this happens, and a V(D)J recombination event on a chromosome

    leads to a nonproductive sequence, the cell may try again on the second chromosome.

    If this second recombination leads to a functional receptor, the cell will have two

    rearranged chromosomes: one functional and expressed, and the nonfunctional one

    silenced by allelic exclusion. This fortunate quirk will prove crucial later in this thesis.

    Once a T cell or B cell has a functional receptor there is some quality control that

    occurs. The cell undergoes both positive selection (e.g. checking a TCR interacts

    5In VJ recombination, there is no D gene, and the V and J genes are directly spliced together

    6

  • well with MHC) and negative selection (i.e. removing cells with high affinity to self

    epitopes). This somatic selection process is crucial to ensure both useful receptors

    and to prevent autoimmune responses and skews the repertoire on a statistical level.

    Models characterizing the statistics of this selection process have been introduced by

    my collaborators, particularly Yuval Elhanati [Elhanati et al., 2014], and are discussed

    in the papers that are referenced in chapter 4 [Elhanati et al., 2015, Sethna et al.,

    2017].

    1.3 Repertoire sequencing and analysis

    Advances in high throughput sequencing [Robins et al., 2010a] have allowed for large

    scale sequencing of lymphocytes in a blood or tissue sample: the sample is broken

    down, the DNA extracted, and specialized primers amplify the DNA sequence of the

    CDR3 region before sequencing. Such experiments are now becoming so routine that

    there is interest in using them for medical diagnostic and immunotherapy purposes.

    Almost all of the data discussed in this thesis was sequenced using a protocol pio-

    neered by Harlan Robins [Robins et al., 2010a], who has started a company, Adaptive

    Biotechnologies, to provide repertoire sequencing services.

    These experiments can successfully sequence millions of cells (or more), produc-

    ing datasets of ∼ 104 − 106 unique DNA sequences. The availability of datasets of

    such size and quality allows for serious statistical analyses to quantify the underly-

    ing biology as well as the possibility to explore more theoretical questions. Being

    physicists, the approach we will take in this thesis is to construct a statistical model,

    i.e. a parameterized probability distribution, of V(D)J recombination that reflects

    the underlying biological processes. These large datasets are then used to infer the

    model parameters. The model parameters will provide quantitative descriptions of the

    V(D)J recombination machinery, and the model itself provides a distribution of the

    7

  • probability of generating any receptor (Pgen) that can be used to answer theoretical

    questions like characterizing the diversity of a repertoire.

    1.4 Organization of thesis

    This thesis is broken into two main parts. The first, covers chapters 2 and 3 and

    provides the mathematical framework for the rest of the thesis. The class of generative

    models used to analyze the generation probability (Pgen) of adaptive immune system

    repertoires (first introduced in Murugan et al. [2012]) is described, and the inference

    process, expectation maximization (EM), used to fit the model parameters is laid out.

    We also show how one of the main metrics we use, the entropy of a model, can be

    computed and broken down into different components. In addition, the computational

    challenges associated with computing Pgen of sequences is discussed, in particular the

    exponential explosion of the number of recombination events that generate amino acid

    CDR3 sequences. We then demonstrate the novel dynamic programming algorithm,

    OLGA [Sethna et al., 2018], that we developed to efficiently solve this problem and

    make the computation of amino acid CDR3 sequences Pgen not only tractable, but

    fast.

    The second part, spanning chapters 4 and 5, dives into the applications of the

    modeling framework defined in the first part. The first part of chapter 4 describes

    the work from Sethna et al. [2017] analyzing the maturation of mouse repertoires from

    embryo to young adult. The second half of chapter 4 lays out a model quantifying

    hypermutation in B cells [Elhanati et al., 2015]. Finally, chapter 5 demonstrates how

    Pgen explains the curious observation of so-called ‘public’ sequences.

    8

  • Chapter 2

    Generative Model

    2.1 V(D)J recombination models

    The definition, selection, and inference of a generative model of V(D)J recombination

    is the foundation for all of the work that comes later. Such a generative model defines

    a probability measure over the state space of V(D)J recombination events, which can

    be extended to define probabilities of particular receptors or collections of receptors.

    We begin by introducing a general model framework by requiring that the model

    respects the biology of the V(D)J recombination process. To do this we define the

    state (sample) space of V(D)J recombination events by combinations of the stochastic

    events in the DNA splicing itself (i.e. gene choice, deletions/palindromic insertions,

    and insertions). For example, we can described the state (sample) space of VDJ

    recombination events as:

    Ωe = {(V,D, J, dV , dD, d′D, dJ, {mi}, {ni})} (2.1)

    Where V, D, and J are the gene choices, dV , dD (5′ /left), d′D (3

    ′ /right), and dJ

    are deletions (including palindromic insertions), and {mi} and {ni} are the specific

    9

  • nucleotide sequences which are inserted at the VD and DJ junctions respectively 1.

    This also allows us to define a fully general model family for the recombination event

    e ∈ Ωe:

    Precomb(e) = P (V, dV , {mi}, dD, D, d′D, {ni}, dJ , J) (2.2)

    We cannot use the fully general model above, which defines a unique probability for

    each combination of recombination events, due to the exponential explosion of param-

    eters. The challenge is to construct sub-models which have few enough parameters

    to be inferred, yet still sufficiently describe the observed sequences. In general this is

    done by positing the independence and dependence of the various splicing events and

    then checking if the factorization captures the necessary correlations. The specific

    models used are factorized to reflect the spatial correlations along the chromosome.

    For VDJ recombination, these models assume that the V choice is independent

    of the D/J choice (the latter two being correlated by virtue of the order in which

    the genes are laid out on the chromosome, see Fig. 1.1), the deletion profiles depend

    only on the gene choice, and lastly that the insertions are independent of the genomic

    contributions and each other. There is still an exponential blowup of parameters

    unless a simpler (fewer parameters) model for the inserted sequences is introduced.

    We use a model that is a product of a length distribution and a dinucleotide Markov

    model. This model factorization and dinucleotide Markov model is first introduced

    and validated in Murugan et al. [2012], however important exceptions to this factor-

    ization will be discussed in Chapter 4 in the contexts of mouse T cells [Sethna et al.,

    2017] and human B cells (Elhanati et al. [2015]). For VJ recombination, these models

    assume the V/J choice is correlated, the deletion profiles depend only the gene choice,

    and lastly the insertion region is independent of the genomic contribution.

    1The subscript i index is read from 5 ′ to 3 ′ .

    10

  • 2.1.1 VDJ generative model

    The VDJ recombination model is defined as:

    Precomb(e) = PV(V )PDJ(D, J)PdelV(dV |V )PdelJ(dJ |J)PdelD(dD, d′D|D)

    ×PinsVD(`VD)p0(m1)[`VD∏i=2

    SVD(mi|mi−1)]

    ×PinsDJ(`DJ)q0(n`DJ )[`DJ−1∏i=1

    SDJ(ni|ni+1)] (2.3)

    Where the inserted nucleotide sequences {mi} and {ni} have lengths `VD and `DJ with

    insertion length distributions PinsDJ(`VD) and PinsDJ(`DJ), SVD and SDJ are the respec-

    tive dinucleotide Markov transition matrices, and finally, p0 and q0 are the nucleotide

    biases for the first insertion at each junction2. Note that the inserted sequence of

    length 0 (i.e. no insertions at a junction) is also allowed and has probability PinsVD(0)

    or PinsDJ(0) depending on the splicing junction.

    2.1.2 Model Validation

    As mentioned above, it is important to check that the factorization of the model

    structure is correct. To address this issue, we examine the correlations between

    various marginal variables of the model (i.e. the stochastic recombination events: V,

    delV, J, insVD, etc) by examining the mutual information of each pair.

    To determine if we have captured the correct correlations in the data, we compare

    the precise mutual information computed directly from the model, to the estimated

    mutual information determined by the expectation over the data (using the Treves-

    Panzeri correction [Treves et al., 1998] to account for finite sample size).

    The generative model has zero mutual information, by construction, between in-

    dependent marginal pairs, e.g. the number of VD insertions and the choice of J gene.

    2Note, we often make the further approximation that the insertion Markov model is at steady-state, i.e. we set p0 and q0 to be the steady state distributions of SVD and SDJ respectively.

    11

  • Variables that correlate with each other either directly or indirectly, e.g. between D

    and J gene choice, or between D choice and number of D deletions may have non-zero

    mutual information. In order to quickly gauge if a model is consistent (or inconsis-

    tent) with the model factorization, we use plots like D.4, where the MI computed

    from the model is below the diagonal and the expectation over the data is above the

    diagonal. If the plot is symmetric about the diagonal, then the model is self consis-

    tent with the data. Indeed, the total missed mutual information is, to leading order,

    precisely the amount of information our factorized model missed due to its structure.

    To validate the dinucleotide Markov model for insertions, we compare the expected

    trinucleotide frequencies to the observed trinucleotide frequencies.

    We will perform these checks in Chapter 4 when we look at mouse T cells [Sethna

    et al., 2017] and human B cells [Elhanati et al., 2015].

    2.1.3 VJ generative model

    Analogous to the VDJ model, we define the model factorization for the generative

    model of VJ recombination. The primary distinction is that there is no D gene, nor

    is there an N2 insertion region (DJ junction). Also, as there is evidence of repeated

    splicing attempts for the TCRα chain, the V and J gene usages are allowed to be

    correlated Elhanati et al. [2016].

    Precomb(e) = PVJ(V, J)PdelV(dV |V )PdelJ(dJ |J)PinsVJ(`VJ)p0(m1)[`V J∏i=2

    SVJ(mi|mi−1)]

    (2.4)

    2.1.4 Pgen

    Our model, Precomb, defines a probability measure over the state (sample) space Ωe of

    recombination events. However, this model can, in theory, be extended to other state

    (sample) spaces of much more interest scientifically and biologically. In particular,

    12

  • we examine the state spaces of DNA nucleotide sequence reads, CDR3 nucleotide

    sequences, and CDR3 amino acid sequences (or collections/motifs of amino acid CDR3

    sequences). This is done by summing over all recombination events that generate one

    of the ‘coarse grained’ states to give the probability of generating a particular CDR3

    sequence or receptor.

    Pgen(seq) =∑e|seq

    Precomb(e) (2.5)

    This generation probability, or ‘Pgen’, of a sequence or receptor will be used con-

    tinuously throughout this thesis. We will return to this idea of extending or ‘coarse

    graining’ the probability space in greater detail in Chapter 3.

    2.2 Model Entropy

    Before we introduce our method for inferring the model parameters, we first introduce

    a concept that we will return to repeatedly: the entropy of a model. One of the

    advantages of having a probabilistic model of V(D)J recombination is that we can

    use the (Shannon) entropy (Appendix A) of the distribution as a well defined measure

    of the ‘diversity’ of a repertoire. We examine the entropy of both Precomb and Pgen.

    First we show how to compute the entropy S(Precomb) directly from the model, and

    how it decomposes into contributions from the gene choice, the deletions, and the

    insertions. We also show explicitly how changing the insertion length distribution

    has an outsized impact on the entropy. Then we discuss how to approximate S(Pgen)

    by Monte Carlo simulation. Throughout this section we do not specify what units we

    want to express the entropy in, however we will most frequently talk about entropy

    in units of bits (so base of the log is 2, i.e. log2)3.

    3Personally, I think everything should be done in nats (log base e), however for most people it iseasier to parse bits (log base 2) or dits (log base 10).

    13

  • 2.2.1 Entropy of Precomb

    The entropy4 of a VDJ recombination model is:

    H(Precomb) = −〈log(Precomb)〉Ωe = −〈log(PVPDJPdelVPdelJPdelDP{mi}P{ni})

    〉Ωe

    (2.6)

    Now, we can break the total entropy expression into independent components, and

    compute the entropy of each of the components independently.

    Genes/Deletions entropic contribution

    The gene/deletion contributions are fairly straightforward to compute. Examining

    the V templates:

    H(PV(V )PdelV(dV |V )) =−∑V,dV

    PV(V )PdelV(dV |V ) [log(PV(V )) + log(PdelV(dV |V ))]

    =−∑V

    PV(V ) log(PV(V ))−∑V,dV

    PV(V )PdelV(dV |V ) log(PdelV(dV |V ))

    =H(PV) +∑V

    PV(V )H(PdelV(dV |V ))

    =H(PV) + 〈H(PdelV)〉V(2.7)

    In an analogous fashion we can determine H(PDJ), 〈H(PdelD)〉D, and 〈H(PdelJ)〉J . We

    say that the entropy contribution from the choice of germline template is H(PV) +

    H(PDJ), while the deletion entropic contribution is 〈H(PdelV)〉V + 〈H(PdelD)〉D +

    〈H(PdelJ)〉J .4We indicate entropy by H not S in this section so as not to confuse notation with the dinucleotide

    transition matrices SVD and SDJ

    14

  • Insertion entropic contribution

    The entropy of the insertions is much trickier to compute as we will have to sum

    the Markov model probabilities over all possible insertion sequences. We drop the

    VD/DJ subscripts as the computations are identical.

    H(P{mi}) =−∑{mi}

    P{mi}({mi}) log(P{mi}({mi}))

    =−∑`

    ∑{mi}|`

    Pins(`)P{mi}|`({mi})[log(Pins(`)) + log(P{mi}|`({mi}))]

    =−∑`

    Pins(`) log(Pins(`))−∑`

    Pins(`)∑{mi}|`

    P{mi}|`({mi}) log(P{mi}|`({mi}))

    =H(Pins)−∑`

    Pins(`)∑{mi}|`

    P{mi}|`({mi}) log(P{mi}|`({mi}))

    (2.8)

    where,

    P{mi}|`({mi}) = p0(m1)[∏̀i=2

    S(mi|mi−1)]. (2.9)

    In order to make the dependence of this entropy on the average insertion length

    (〈`〉) more explicit we will make the approximation that the Markov model is at

    steady-state (i.e. p0 = pss, the steady-state distribution of S).

    We will now prove inductively that for ` ≥ 1:

    H(P{mi}|`) =−∑{mi}|`

    P{mi}|`({mi}) log(P{mi}|`({mi}))

    =H(pss)− (`− 1)∑m

    pss(m)∑n

    S(n|m) log(S(n|m))(2.10)

    15

  • Initial Step: ` = 1

    This is trivial as P{mi}|`({mi} = m) = p0(m) = pss(m), so by direct computation:

    −∑

    {mi}|`=1

    P{mi}|`(m) log(P{mi}|`(m)) = −∑m

    pss(m) log(pss(m)) = H(pss) (2.11)

    Inductive step

    Assuming we have have shown that Eq. 2.10 is true for ` ≤ k, we prove it holds for

    ` = k + 1.

    −∑

    {mi}|`=k+1

    P{mi}|`({mi}) log(P{mi}|`({mi}))

    =−∑mk+1

    ∑{mi≤k}

    S(mk+1|mk)P{mi}|k({mi≤k})[log(S(mk+1|mk)) + log(P{mi}|k({mi≤k}))

    ]=H(P{mi}|k({mi}))−

    ∑mk+1

    ∑{mi≤k}

    S(mk+1|mk)P{mi}|k({mi≤k}) log(S(mk+1|mk))

    =H(P{mi}|k({mi}))

    −∑mk+1

    ∑mk

    S(mk+1|mk) log(S(mk+1|mk))∑

    {mi≤k−1}

    S(mk|mk−1)P ({mi≤k−1}|k − 1)

    (2.12)

    Now, in order to do the summation in the second term we make the observation

    that the conditional terms only depend on the two last nucleotides σk+1 and σk so we

    would like to get the marginal distribution

    pk(mk) =∑

    {mi≤k−1}

    P (mk|mk−1)P ({mi≤k−1}|k − 1) (2.13)

    But, we recall our previous assumption that the Markov process is in its steady state

    to know that the marginal distribution is the same as the steady-state distribution

    16

  • (i.e. pk = pss)5 Plugging this back in shows

    −∑mk+1

    ∑mk

    S(mk+1|mk) log(S(mk+1|mk))∑

    {mi≤k−1}

    S(mk|mk−1)P ({mi≤k−1}|k − 1)

    =−∑m

    pss(m)∑n

    S(n|m) log(S(n|m))

    (2.14)

    which shows the inductive step holds for k + 1 and completes the proof.

    Putting everything together we get the entropy from a single insertion junction

    being

    H(Pins) +H(pss)− (〈`〉 − 1)∑m

    pss(m)∑n

    S(n|m) log(S(n|m)) (2.15)

    Note the dependence of this expression on the average number of insertions 〈`〉.

    We will return to this in chapter 4 when we see that the way a repertoire scales its

    diversity is by changing the insertion length distribution.

    Total entropy of Precomb

    H(Precomb) =H(PV) +H(PDJ) + 〈H(PdelV)〉V + 〈H(PdelD)〉D + 〈H(PdelJ)〉J

    +H(PinsVD) +H(pss)− (〈`VD〉 − 1)∑m

    pss(m)∑n

    SVD(n|m) log(SVD(n|m))

    +H(PinsDJ) +H(qss)− (〈`DJ〉 − 1)∑m

    qss(m)∑n

    SDJ(n|m) log(SDJ(n|m))

    (2.16)

    5If we didn’t want to make the steady-state assumption, it is easy to see how using this marginaldistribution would change Eq. 2.10 to:H(P{mi}|`) = H(p0)−

    ∑`k=2

    ∑m pk(m)

    ∑n S(n|m) log(S(n|m))

    17

  • 2.2.2 Entropy of Pgen

    The probability distribution of Pgen no longer factorizes after the summation. As a

    result we cannot break down the entropy into independent pieces. Instead a different

    tack is taken, to estimate the entropy of Pgen.

    We recall that the entropy of a distribution is just −〈logP 〉. This means that we

    can estimate the entropy of Pgen by taking the expectation value over Monte Carlo

    simulated sequences:

    S(Pgen) ≈ 〈log(Pgen(s))〉s∈MCsample (2.17)

    2.2.3 The Pgen distribution

    Another extremely effective way of visualizing the diversity of a repertoire is to exam-

    ine the probability density of the log Pgen of sequences. If a large number of sequences

    (or recombination events) are drawn from a model distribution (i.e. Monte Carlo sam-

    pling), they can be histogrammed by the log of their generation probabilities. If we

    define an energy as E ∼ − log Pgen, this distribution is the probability density P (−E),

    and is closely related to the density of states (a connection we will return to in chapter

    5). An example of one of these plots is shown for a human TRB model in Fig. 2.1,

    demonstrating the massive range of generation probabilities, spanning ∼20 orders of

    magnitude. Another very useful aspect of these plots is that the mean of each dis-

    tribution is the entropy of the distribution (up to a minus sign), and is indicated as

    the dotted lines in Fig 2.1. We frequently use such plots as a way of characterizing

    the data visually. It is easy to see shifts to more or less entropic distributions, and

    to see any impact on the tails. Furthermore, these plots can be made from the data

    directly by histogramming their generation probabilities and the entropy of such a

    distribution will again be the mean6.

    6Please note, when using data sequences we should technically say that the ‘entropy’ computedas the mean of the distribution is technically a cross entropy. For the non-productive sequenceswe largely focus on in this thesis this is a negligible distinction. However, for inframe productive

    18

  • Figure 2.1: Distribution functions: P (−E = log Pgen)

    Shows the distribution of generation probabilities over 3 different state spaces of thesame human TRB model, highlighting the ‘coarse graining’ of the model from recombi-nation events, to nucleotide sequences, and finally to amino acid sequences/receptors.The dotted lines indicate the mean of each distribution and is mathematically equiv-alent to the negative of the entropy of each distribution. The entropy of the distri-butions decreases as they get more coarse grained.

    2.3 Inference

    The data which is used to infer these models comes from high-throughput Illumina

    sequencing [Robins et al., 2010a] and is organized as a collection of DNA sequences

    of around 60-200 base pairs. We will want to infer the parameters of the generative

    model that most accurately reflect the sequences observed in the experiment. Without

    a principled prior that significantly biases the distribution (note, Jeffrey’s prior is

    remarkably flat for these generative models), the parameters are inferred by way of

    sequences this is not an irrelevant concern as the distributions are noticeably skewed towards highergeneration probabilities due to somatic selection. See Elhanati et al. [2014] for a discussion of somaticselection and the statistical effects on the distribution. We are a little sloppy and always refer tothis quantity as the entropy of the distribution, even if it is technically a cross entropy at times.

    19

  • maximum likelihood estimation. Given a collection of observed DNA sequences S and

    a generative model determined by parameters θ ∈ Θ, we want to infer the estimated

    parameters θ̂:

    θ̂ = arg maxθ

    L(θ; S) = arg maxθ

    p(S|θ) = arg maxθ

    ∏seq∈S

    Pgen(seq|θ) (2.18)

    as the sequences in S are assumed to be independently generated.

    In order to properly infer the parameters of a V(D)J model we must be careful to

    only use sequences that are statistically representative of the V(D)J recombination

    machinery itself and are not skewed by any selective process or somatic population

    dynamics. This is a real worry as not only could clonal expansion overrepresent

    specific sequences, but functional receptors are systematically biased away from the

    underlying V(D)J generative distribution due to their involvement in the immune

    system function (this is explored in Elhanati et al. [2014]). Fortunately, as discussed

    in section 1.2, V(D)J recombination does not always produce inframe, productive

    sequences with each recombination event. As a result, the DNA sequence datasets

    we analyze contain a significant fraction of sequences we know must be nonproduc-

    tive/nonfunctional because they are frame shifted (out of frame) or contain a stop

    codon. These sequences can never be expressed and therefore should experience no

    selective pressures. Thus, to ensure a statistically unbiased sample, we filter our sam-

    ple for only unique, nonproductive sequences. Filtering for unique sequences removes

    the influence of clonal dynamics and expansion, whereas filtering for nonproductive

    sequences removes any selection effects.

    The generative models described (Eq. 2.3, Eq. 2.4) are defined over the space

    of recombination events which are ‘hidden’ in the sense that there are many, many,

    recombination events that can lead to a particular DNA sequence and there is no way

    to determine which one actually occurred. In order to infer the parameters of such a

    20

  • model, a classic iterative learning algorithm, expectation maximization (EM), is used

    which ensures that a local maximum in likelihood is achieved (proof in Appendix C).

    2.3.1 Errors and Mismatches

    Each recombination event e = (V,D, J, dV , dD, d′D, dJ , {mi}, {ni}) generates a specific

    DNA sequence. However, it is possible that when this gene was sequenced that the

    recorded nucleotides do not match up perfectly with the sequence generated by e.

    This mismatch could indicate a sequencing error in the experiment or, in the case of

    B cells, could be the result of hypermutations (this will be discussed in much greater

    detail in Section 4.2). We will need to account for such mismatches or errors in order

    to properly infer the parameters of the generative model. To do this we introduce

    an error/mismatch model. Formally, we define the observed probabilities, given an

    observed/measured sequence seqo as:

    Porecomb(e, seqo) = Precomb(e)Pmis(seqo|e)

    Pogen(seqo) =∑e∈E

    Porecomb(e, seqo)(2.19)

    Where Pmis(seqo|e) is the error/mismatch model whose parameters will be inferred

    during the EM inference. There are several Pmis(seqo|e) models used over the course

    of this work.

    No error model

    It is useful to first consider a model where no errors or mismatches are allowed. To

    do this, define Pmis(seqo|e) = I[e generates seqo]. Then,

    Porecomb(e, seqo) =

    Precomb(e) if e generates seq0 otherwise (2.20)21

  • and

    Pogen(seqo) =∑e∈E

    Porecomb(e, seqo) =∑e|seqo

    Precomb(e) = Pgen(seqo) (2.21)

    we see we recover Pgen from Pogen.

    Flat error rate

    This model assumes that the probability of a mismatched nucleotide between the

    observed sequence seqo = {soi} and the sequence generated by recombination event e,

    seqe = {sei}, is a flat probability pm.

    Pmis(seqo|e) =∏i

    (pmI[soi 6= sei ] + (1− pm)I[soi = sei ]) (2.22)

    Flat error rate, restricted to genomic templates

    In practice, it doesn’t make much sense to examine mismatches outside of the region

    of the sequence that is determined by a germline V, D, or J sequence. Define the set

    of positions, Posgene where the nucleotides {sei} come from a germline template and

    its complement, Posins, where the nucleotides come from non-templated insertions.

    We define a new error model that applies the flat error model to positions Posgene and

    the no error model to positions Posins:

    Pmis(seqo|e) =

    0, if ∃i ∈ Posins s.t. soi 6= sei∏

    i∈Posgene (pmI[soi 6= sei ] + (1− pm)I[soi = sei ]) , otherwise

    (2.23)

    This is the model that is used most frequently and unless otherwise stated is the

    model that is used for inference purposes.

    22

  • N-mer context dependent error model

    In order to study hypermutations in Section 4.2 we use a mismatch model where

    the mismatch rate is modulated depending on the 7-mer nucleotide sequence around

    the mismatch site. Here we define a general N-mer context model where there are

    independent energies at each site (i.e. a one point model).

    ph(i|seq) =1

    Zpbg(si−bN

    2c, si−bN

    2c+1, . . . , si+bN

    2c) exp

    bN2 c∑k=−bN

    2c

    −Ek(si+k)

    (2.24)where pbg(σ) is the background frequency of the N-mer nucleotide sequence σ and

    the proportionality constant Z is determined by matching the overall mismatch rate

    (i.e. 〈ph〉 = pm). As we have the freedom to define the 0 energy with each of the Eks,

    it is convenient to set∑

    σ∈{A,C,G,T}Ek(σ) = 0 to make it transparent if the nucleotide

    identity at position k in the N-mer makes a hypermutation mismatch more or less

    likely.

    One may also notice that we did not specify whether seq is seqo or seqe. Ideally we

    would want seq to be the sequence immediately before the hypermutation occurred

    (e.g. if we were constructing an evolutionary tree from hypermutations we should use

    the current node’s sequence as seq). However, for inference purposes this ambiguity

    is functionally irrelevant as choosing either seqo or seqe to be seq will result in a

    negligible difference.

    Again, we will want to restrict to mismatches with the germline sequences (to

    ensure we have identified a hypermutation), so we define:

    Pmis(seqo|e) =

    0, if ∃i ∈ Posins s.t. soi 6= sei

    otherwise:∏i∈Posgene (ph(i|seq)I[soi 6= sei ] + (1− ph(i|seq))I[soi = sei ])

    (2.25)

    23

  • 2.3.2 Expectation Maximization algorithm

    Expectation maximization is implemented by taking an initial guess (generally ran-

    domized) for the parameters and then iterating two different steps. The first step,

    expectation, defines a function which is the expected log-likelihood over the distribu-

    tion of data and hidden variables determined by the data and the current guess of the

    parameters. Explicitly, if θ′ is the current estimation of the parameters, we define:

    Q(θ|θ′) = 〈logL(θ; X,Z)〉Z|X,θ′ (2.26)

    Note, Q(θ|θ′) is still a function of some undetermined parameters θ. This leads

    to the second step: maximization. To determine the next iteration’s parameter esti-

    mation we maximize the estimation function:

    θ(i+1) = arg maxθ

    Q(θ|θ(i)) (2.27)

    Repeatedly iterating these steps will monotonically increase both Q and the full

    likelihood function (proof below). Let us be explicit in how this translates into the

    specific scenario of a VDJ generative model. Say we have (nonproductive) sequences

    S, the set of possible recombination events Ωe = {(V,D, J, dV , dD, d′D, dJ , {mi}, {ni}},

    and the model structure from Eq. 2.3. Then θ is the collection of parameters defining

    PV, PDJ, PdelV, etc. The expectation step is defined as so:

    Q(θ|θ′) = 〈logL(θ; S,E)〉E|S,θ′ =∑

    seq∈S

    ∑e∈E

    Porecomb(e|seq, θ′) log Porecomb(e, seq|θ)

    (2.28)

    Now,

    Porecomb(e|seq, θ′) =Porecomb(e, seq|θ′)∑e∈E P

    orecomb(e, seq|θ′)

    =Porecomb(e, seq|θ′)

    Pogen(seq|θ′)(2.29)

    24

  • is the fractional contribution of the particular event to the total Pgen of that sequence.

    Plugging Porecomb(e|seq, θ′) back in and expanding Porecomb(e|θ) we get:

    Q(θ|θ′) =∑

    seq∈S

    ∑e∈E

    Porecomb(e, seq|θ′)Pogen(seq|θ′)

    ×[

    logPV(V (e)) + logPDJ(D(e), J(e))

    + logPdelV(dV (e)|V (e)) + logPdelD(dD(e), d′D(e)|D(e)) + logPdelJ(dJ(e)|J(e))

    + logPinsVD(`VD(e)) + log p0(m1(e)) +

    `VD∑i=2

    logSVD(mi(e)|mi−1(e))

    + logPinsDJ(`DJ(e)) + log q0(n`DJ (e)) +

    `DJ−1∑i=1

    logSDJ(ni(e)|ni+1(e))

    + logPmis(e|seq)]

    (2.30)

    We now need to evaluate arg maxθQ(θ|θ′). As the expansion breaks up into indepen-

    dent pieces, we can deal with them one at a time. First examine the parameters in PV.

    We want to maximize f(PV) = Q(θ|θ′) conditioned on g(PV) =∑

    V PV(V ) − 1 = 0.

    Naturally, this is done with Lagrange multipliers (5f = λ5 g). 5f is readily com-

    puted:

    ∂f

    ∂PV(Vi)=∂Q(θ|θ′)∂PV(Vi)

    =∂

    ∂PV(Vi)

    ∑seq∈S

    ∑e∈E

    Precomb(e, seq|θ′)Pgen(seq|θ′)

    logPV(e)

    =∑

    seq∈S

    ∑e∈E

    Precomb(e, seq|θ′)Pgen(seq|θ′)

    I[Vi = V (e)]PV(Vi)

    (2.31)

    λ5 g is even more straightforward:

    λ∂g

    ∂PV(Vi)= λ

    ∂PV(Vi)

    [∑V

    PV(V )− 1]

    = λ (2.32)

    So,

    PV(Vi) =1

    λ

    ∑seq∈S

    ∑e∈E

    Precomb(e, seq|θ′)Pgen(seq|θ′)

    I[Vi = V (e)] (2.33)

    25

  • To solve for λ, plug back in to our normalization condition (g(PV) =∑

    V PV(V )−1 =

    0):

    g(PV) = 0 =∑V

    PV(V )− 1 =− 1 +1

    λ

    ∑seq∈S

    ∑e∈E

    Precomb(e, seq|θ′)Pgen(seq|θ′)

    ∑Vi

    I[Vi = V (e)]

    =− 1 + 1λ

    ∑seq∈S

    ∑e∈E

    Precomb(e, seq|θ′)Pgen(seq|θ′)

    =− 1 + 1λ

    ∑seq∈S

    Pgen(seq|θ′)Pgen(seq|θ′)

    =− 1 + 1λ

    ∑seq∈S

    1

    =− 1 + |S|λ

    ⇒ λ =|S|

    (2.34)

    Finally this gives us the expression for the parameters of PV for the next iteration:

    PV(Vi) =1

    |S|∑

    seq∈S

    ∑e∈E

    Precomb(e, seq|θ′)Pgen(seq|θ′)

    I[Vi = V (e)] (2.35)

    which is just the expectation of that marginal, V gene usage in this case, over the

    data sequences and using the previous iteration’s parameters. It is easy to show

    that the remaining parameters are inferred in an analogous fashion with the only

    caveat being that in the derivation for conditional distributions you need to use a

    normalization condition (and thus another Lagrange multiplier) for each variable

    that the distribution is conditioned on (or do the inference as a joint distribution).

    For example:

    g(PdelV|Vi) = 0 =∑d′V

    PdelV(d′V |Vi)− PV(Vi) (2.36)

    26

  • Also note that as the insertion dinucleotide Markov models also break up into a

    similar form, their parameters are also inferred in an identical manner (only that

    each recombination event e can contribute more than one term to the sum).

    2.3.3 Implementation

    Implementation of the EM algorithm for these V(D)J generative models is quite

    tricky, and requires a large amount of computational power. As model parameters

    are learned from large datasets of∼ 104−105 sequences, there is a premium on efficient

    parallelized code. Sequence alignment, efficient enumeration of recombination events,

    and intelligent organization of data structures are only some of the challenges. The

    story of developing software to infer these parameters belongs to others and so won’t

    be a focus of this thesis. However, I do want to take a moment to describe and

    highlight the work done to make this difficult inference process possible.

    My predecessor, Anand Murugan, was the first to code up and implement a VDJ

    generative model of the form Eq. 2.3 and this was the basis of the first paper de-

    scribing these V(D)J generative models in Murugan et al. [2012]. His MATLAB code

    was then later adapted by me to define and infer the models discussed in Chapter 4.

    Despite the success of this MATLAB code, it does require some expertise to use and

    any changes to the model structure must be hard coded.

    Recently a collaborator, Quentin Marcou, developed a software package called

    IGoR (Inference and Generation Of Repertoires) in C++ [Marcou et al., 2018]. IGoR

    is constructed in a way that allows the user to easily define the model structure (i.e.

    the factorization) and runs smoothly and quickly. This software was used to infer the

    models discussed/used in chapters 3 and 5. IGoR is publicly available on GitHub:

    https://github.com/qmarcou/IGoR.

    27

    https://github.com/qmarcou/IGoR

  • Chapter 3

    V(D)J recombination to sequences:

    Precomb→ Pgen

    The previous chapter laid out how a generative V(D)J model can be constructed and

    inferred. However, the generative model is defined over a state (sample) space of re-

    combination events, Ωe, whereas the scientific interest is over the state (sample) space

    of sequences or receptors (both nucleotide and amino acid), and biological/physical

    effects can only take place on the level of the physical protein structure of the re-

    ceptor, i.e. the amino acid sequence (or possibly some coarse grained version of the

    amino acid sequence). As briefly discussed in 2.1.4, the V(D)J model does define the

    probability of generating a particular nucleotide or amino acid sequence by summing

    over all recombination events that generate the sequence. This was summarized in

    Eq. 2.5, which we repeat here:

    Pgen(seq) =∑e|seq

    Precomb(e) (3.1)

    This summation over recombination events is, in some sense, ‘coarse graining’ the

    state (sample) space as we are aggregating many states (recombination events) into

    a new state (a nucleotide or amino acid sequence).

    28

  • 3.0.1 Probability Spaces (mathematical aside)

    Formally, this ‘coarse graining’ is just extending probability spaces. First we define

    the sample space of recombination events (Ωe, with σ-algebra Be), the sample space of

    nucleotide CDR3 sequences (Ωnt, with σ-algebra Bnt), and the sample space of amino

    acid CDR3 sequences (Ωaa, with σ-algebra Baa). Note that as each recombination

    event generates a specific nucleotide sequence through the physical process of V(D)J

    recombination, we have the surjective map πv(d)j : Ωe → Ωnt. Furthermore, as each

    (in-frame) nucleotide sequence translates to an amino acid sequence, we can define the

    translation mapping πnt2aa : Ωnt → Ωaa (if we wished to be pedantic we could keep the

    out of frame sequences in Ωaa to ensure that πnt2aa is a function over the whole sample

    space and to maintain the total measure of 1 over Ωaa). In this notation it is easy to see

    that the mapping πv(d)j extends the probability space of V(D)J recombination events,

    (Ωe,Be,Precomb) to the probability space of nucleotide sequences, (Ωnt,Bnt,Pgennt),

    while the mapping πnt2aa extends the probability space of nucleotide sequences to the

    probability space of amino acid sequences (Ωaa,Baa,Pgenaa). Our sloppy notation of

    e|seq can now be understood as either π−1v(d)j(ntseq) or π−1v(d)jπ−1nt2aa(aaseq).

    3.1 Too many states! The free energy problem

    Despite Eq 2.5’s seeming simplicity, it can prove to be computationally very problem-

    atic because of the number of recombination events that could generate a particular

    sequence. This is the exact same problem that plagues much of statistical physics –

    summing over all states to determine the partition function or a free energy can prove

    to be computationally prohibitive if the only method of doing the summation is by

    enumerating the states. Indeed, log(Pgen), a quantity we will look at repeatedly, can

    even be thought of as a free energy. The reader may remember that this quantity, Pgen

    was required to do the EM inference in the previous chapter 2.3.2, so to do any sort

    29

  • of inference or to construct any sort of probabilistic model of V(D)J recombination

    the problem of enumerating all possible recombination events must be addressed.

    In previous work, and in the inference procedures of Murugan et al. [2012], and

    Marcou et al. [2018], the number of states to be summed over is controlled through

    through regularization. By regularization we mean that some procedure is used to

    limit the number of recombination events that are considered to a manageable num-

    ber. Fortunately, this is quite possible for nucleotide sequences. By only considering

    gene templates V, (D), and J that have a sufficiently good alignment (e.g. Smith-

    Waterman alignment), capping the number of deletions/insertions, and having cutoffs

    for fractional probabilities and errors, it is feasible to reduce the number of recom-

    bination events that correspond to a nucleotide sequence (i.e. the notation e|seq) to

    the order of thousands or less. This makes it tractable, if still very computationally

    intensive, to compute Pgen for nucleotide sequences. It must be noted, that for soft-

    ware attempting to infer V(D)J models of arbitrary structure, this enumeration of

    recombination events is very useful as there are no restrictions on the correlations it

    can consider.

    However, this approach of exhaustive enumeration with some regularization is

    computationally intractable for amino acid CDR3 sequences, let alone any kind of

    coarse grained alphabet of amino acids that might be more interesting functionally.

    This can easily be seen from the fact that the number of possible nucleotide sequences

    that translate to a particular amino acid sequence will explode exponentially with the

    number of amino acids in the CDR3 region:

    |{σ s.t. nt2aa(σ) = a}| =∏ai∈a

    #codons|ai (3.2)

    To put some perspective on these numbers, the average number of nucleotide

    sequences that code for a mouse TRB CDR3 amino acid sequence is ∼ 2 billion

    30

  • — and mouse TRB CDR3 sequences are significantly shorter than human TRB or

    IGH. Even the heavily optimized and efficient IGoR software developed to do V(D)J

    generative model inference [Marcou et al., 2018], which can compute the Pgen of

    around 60 nucleotide sequences per CPU second, would take around 8500 CPU hrs to

    compute the Pgen of a single mouse TRB amino acid sequence. This is prohibitively

    long if there is interest in analyzing repertoire datasets that can easily be of the order

    of 105 unique sequences or larger. For this reason, much of the early work in this

    field, and in this thesis, was restricted to the analysis of nucleotide sequences.

    While computing Pgen for amino acid sequences by way of enumerating recombi-

    nation events is computationally intractable, this is not to say that the summation

    is impossible. In this chapter we present a dynamic programming algorithm and

    software, OLGA (Optimized Likelihood estimate of immunoGlobulin Amino-acid se-

    quences, available at https://github.com/zsethna/OLGA), that efficiently computes

    Pgen not only for amino acid CDR3 sequences, but inframe nucleotide sequences as

    well as sequences composed of coarse grained/ambiguous amino acid alphabets and

    motifs. Indeed, OLGA can sum over all possible recombination events of a mouse

    TRB model in seconds (and can compute around 50 Pgen mouse TRB amino acid

    sequences per CPU second). This work is detailed in the paper Sethna et al. [2018].

    This algorithm however does require the V(D)J generative models of the form 2.3 or

    2.4, and so loses the flexibility of being able to consider arbitrary model correlations.

    The ability to compute Pgen on an amino acid and functional receptor level will

    likely prove to be extremely useful, and we explore some example applications.

    3.2 Dynamic Programming

    OLGA is an algorithm that leverages ‘dynamic programming’ to avoid enumerating an

    exponentially large number of states. Rather than give a formal definition of dynamic

    31

    https://github.com/zsethna/OLGA

  • programming, we show an example. Fortunately, physicists are already familiar with

    one of the cleanest examples of dynamic programming, and one that truly shows the

    computational effectiveness of such a technique: the discretized path integral. If we

    have position x with N possible locations, discretized time t, and a Markov transition

    matrix Rt(xi → xj) (which may depend on time), we can ask what is the probability

    of starting at position x0 and ending at position xT at time T . If we define the

    function

    Pt(x0, xi) =∑

    {x0,x(1),x(2),...,x(t−1),xi}

    t−1∏t′=0

    Rt′(x(t′)→ x(t′ + 1) (3.3)

    we want PT (x0, xT ). Now, one could list out all the paths that start at x0 and end

    at xT , compute their weights, and sum. However, the number of paths increases

    exponentially with t, so the computation time would explode exponentially as O(T ×

    NT−1) (T operations on each of NT−1 paths). Instead, it is computationally much

    more efficient to sum up all the path weights to each position, at each time step and

    then update. In other words, we notice this recursion relation:

    Pt+1(x0, xi) =∑

    {x0,x(1),x(2),...,x(t−1),x(t),xi}

    t∏t′=0

    Rt′(x(t′)→ x(t′ + 1)

    =∑x(t)

    Rt(x(t)→ xi)∑

    {x0,x(1),x(2),...,x(t−1),x(t)}

    t−1∏t′=0

    Rt′(x(t′)→ x(t′ + 1)

    =∑x(t)

    Rt(x(t)→ xi)Pt(x0, x(t))

    (3.4)

    This can be written in a vectorized notation by writing Pt(x0,x) as a column vector

    with elements Pt(x0, xi):

    Pt+1(x0,x) = RtPt(x0,x)⇒ PT (x0,x) = RT−1RT−2...R1R0P0(x0,x) (3.5)

    where P0(x0,x) = I(x0). Thus, solving for PT (x0, xT ) by using dynamic programming

    would require O(T ×N2) operations — a massive speedup from the O(T ×NT−1) op-32

  • erations of the exhaustive enumeration of the paths. We have turned the summation

    over all individual microstates (i.e. the paths) into a matrix expression with steps in

    time. The algorithm, OLGA, that we developed to compute Pgen of nucleotide and

    amino acid sequences from a generative model will analogously reduce the exponen-

    tial blowup of exhaustive enumeration of recombination events down to polynomial

    time by summing over matrix expressions based on positions in the sequence read.

    3.3 OLGA

    We now describe how OLGA computes Eq. 2.5 without summing over exhaustively

    enumerated recombination events, using dynamic programming. This algorithm re-

    quires specific tailoring to the model structure as the correlations have to be built

    in explicitly, so the algorithm is slightly different for generative models of VDJ

    (TCRβ/IGH, Eq. 2.3) and VJ (TCRα/IGL, Eq. 2.4) recombination. We will first

    present the VDJ algorithm, and give the simpler algorithm for generative models of

    VJ recombination afterwards.

    Each recombination event implies an annotation of the amino acid CDR3 sequence,

    (a1, . . . , aL), assigning a different origin to each nucleotide position (one of V, N1, D,

    N2, or J, where N1 and N2 are the non-templated VD and DJ insertion segments,

    respectively) that parses the sequence into 5 contiguous segments (see schematic in

    Fig.3.1)

    The core principle of the method is to sum over possible nucleotide locations of

    the 4 boundaries between the 5 segments, x1, x2, x3, and x4, but in a recursive way

    using matrix operation. This can be summarized into a compact matrix expression:

    Pgen(a1, . . . , aL) =∑

    x1,x2,x3,x4

    Vx1Mx1x2

    ∑D

    [D(D)x2x3N

    x3x4J(D)

    x4]. (3.6)

    33

  • Figure 3.1: CDR3 indexing cartoon

    } } }} }Vx1 D(D)x2x3 J (D)

    x4

    Mx1x2 N x3x4N2N1

    V D Jx1 x2 x3 x4

    a4, i1=4

    x1=11

    u=1, u*=2 u=2, u*=1 u=3, u*=3

    10 11 12u1=2, u1*=1

    Boxes correspond to nucleotides and are indexed by integers. Each group of threeboxes (identified by heavier boundary lines) corresponds to an amino acid. Thenucleotide positions x1, . . . , x4 identify the boundaries between different elements ofthe partition. The V, M, D(D), N and J(D) matrices define cumulated weightscorresponding to each of the 5 elements.

    However, to do this, we will need to define objects that accumulate the probabil-

    ities of events from the left of a position x (i.e. up to x) and the right of x (i.e. from

    x+ 1 on) which will require some notation.

    3.3.1 Notation, 3 ′ and 5 ′ vectors

    Suppose we have a CDR3 ‘amino acid’ sequence a = (a1, . . . , aL). By ‘amino acid’

    sequence, we mean that each of the ‘amino acids’, ai, correspond to some collection of

    nucleotide triplets, or codons. We allow this mapping between ‘amino acids’, a, and

    codons to be arbitrary at this point, and use the notation σ ∼ a if the codons in the

    nucleotide sequence σ correspond to the codons allowed by the amino acid sequence

    34

  • a. This will allow us not only to recover the standard nucleotide translation map-

    ping, πnt2aa, when using the standard amino acid alphabet (e.g. TGTGCCAGCAGT

    ∼ πnt2aa(TGTGCCAGCAGT) = CASS), but also provides a trivial extension to in-

    clude in-frame nucleotide sequences (define an ‘amino acid’ symbol for each individual

    codon) as well as coarser grained collections of amino acids. For example, all codons

    that code for amino acids with a common chemical property, e.g. hydrophobicity or

    charge, could be grouped into a single ‘amino acid’. In that formulation, (a1, . . . , aL)

    would correspond to a sequence of symbols denoting that property. This could prove

    to be very useful in constructing and assessing future coarse grained models of recep-

    tor - epitope affinities.

    It will simplify the later expressions to be able to refer to a position x not only

    by its nucleotide index, but by the corresponding amino acid index i as well as what

    position x is in the codon reading from 5 ′ to 3 ′ (u) and what position x + 1 is in a

    codon reading from 3 ′ to 5 ′ (u∗). This is shown graphically in Fig. 3.1. Explicitly, for

    position xj:

    ij =⌈xj

    3

    ⌉uj = xj − 3(ij − 1)

    u∗j = 3−mod(uj, 3)

    (3.7)

    It is also crucial to introduce what we will call ‘5 ′ vectors’ and ‘3 ′ vectors’. A 5 ′ vector,

    denoted with a subscript (e.g. Xx) accumulates weights for the sequence to the 5′ (left)

    side of x (including the nucleotide position x), whereas a 3 ′ vector, denoted with a

    superscript (e.g. Yy) reflects the weights for the sequence to the 3 ′ (right) side of

    x (excluding the nucleotide position x). Because we are dealing with amino-acids,

    which are encoded with codons made of 3 nucleotides, we need to keep track of

    weights by the identity of the nucleotides at the beginning or the end of the codon.

    This requires the definition of a 5 ′ vector (3 ′ vector) to depend on the value of u (u∗).

    35

  • For the first nucleotide position in a codon, u = 1 (u∗ = 1), Xx (Yy) must be

    interpreted as a row (column) vector of 4 numbers indexed by σ = A, T,G, or C,

    corresponding to the cumulated probability weight from the 5 ′ /left (3 ′ /right) side

    that nucleotide at position x (x + 1) takes value σ. If u = 2 (u∗ = 2), then Xx (Yy)

    is also a row (column) vector of 4 numbers indexed by nucleotide σ = A, T,G, or

    C, but with a different interpretation: it corresponds to the cumulated probability

    up to position x from the 5 ′ /left side (x+ 1 from the 3 ′ /right), with the additional

    constraint that the nucleotide at the last position in the codon, x + 1 (x), can take

    value σ (the value is 0 otherwise). Lastly, if x (x+ 1) is the last position in a codon,

    i.e. u = 3 (u∗ = 3), the cumulative sequence terminates at the end of a codon and

    we do not keep nucleotide information, so Xx (Yx) is a scalar.

    If we have a 5 ′ vector Xx that contains the accumulated weights up to position

    x, and 3 ′ vector Yx that contains the weights from position x+1 onwards, we will

    want to ‘glue’ these sequence contributions together to get the total probability of

    the sequence. This is indicated by the expression1 XxYx, which has a very convenient

    structure. As the combinations of u and u∗ are (1, 2), (2, 1), or (3, 3), we see that

    the matrix multiplication XxYx is one of two situations. First if u = u∗ = 3, XxY

    x is

    just scalar multiplication of the aggregate weights for the 5 ′ and 3 ′ sides. If (u, u∗)

    = (1, 2) or (2, 1), then XxYx is the dot product between a vector of weights indexed

    by nucleotides needed to complete the codon and the vector of weights indexed the

    completing nucleotide on the other side. In either case, the result is the total aggregate

    weight of the sequence conditioned on the partition x, accurately reflecting the weight

    of ‘gluing’ the possible sequences from the 5 ′ /left side to the 3 ′ /right side.

    This notion of sequence gluing also allows for the definition and interpretation of

    matrices (e.g. Rxy), with both 5′ and 3 ′ indices. A matrix Rxy can be thought of as

    1Please note that the resemblance of the expression XxYx to a contraction over the position x in

    Einstein notation should not be misinterpreted. The ‘contraction’ is over possible nucleotide identityindices not over the position index x.

    36

  • ‘gluing ’ a new sequence segment (x to y) to what an existing 5 ′ or 3 ′ vector describes.

    For example:

    XxRxy = Hy

    RxyYy = Gx

    (3.8)

    The matrix Rxy can be mapping from any value of u to any other (or any value of

    u∗ to any other), and so has 9 possible combinations/interpretations based on the u

    mapping and can be a 4x4, 4x1, 1x4, or 1x1 matrix as a result.

    3.3.2 VDJ recombination: V, M, D, N, and J

    Eq 3.6 shows the summation over positions of a matrix expression, with the vec-

    tors/matrices corresponding to different VDJ contributions. The 5 ′ vector Vx1 corre-

    sponds to a cumulated probability of the V segment finishing at position x1; matrix

    Mx1x2 is the probability of the VD insertion extending from x1 + 1 to x2; Nx3x4 is the

    same for DJ insertions; matrix Dx2

    x3(D) corresponds to weights of the D segment

    extending from x2 + 1 to x3, conditioned on the D germline choice being D; 3′ vector

    Jx4(D) gives the weight of J segments starting at position x4 + 1 conditioned on the

    D germline being D. This D dependency is necessary to account for the dependence

    between the D and J germline segment choices [Murugan et al., 2012]. All the defined

    vectors and matrices depend on the amino acid sequence (a1, . . . , aL), but we leave

    this dependency implicit to avoid making the notation too cumbersome.

    The entries of the vectors/matrices corresponding to the germline segments, V,

    D(D), and J(D), can be calculated by simply summing over the probabilities of

    different germline segments compatible with the sequence (a1, . . . , aL) with conditions

    on deletions to achieve the required segment length. The ∼ sign is generalized to

    incomplete codons so that it returns a true value if there exists a codon completion

    that agrees with the sequence a.

    37

  • V contribution: Vx1

    The 5 ′ vector, Vx1 , aggregates the weights (PV and PdelV) from sequences originating

    from the templated V genes up from the start of the CDR3 region to position x1. As a

    5 ′ vector, Vx1 can be a 1x1 or 1x4 matrix depending on u1. sV is the sequence of the V

    germline gene (read 5 ′ to 3 ′ ) from the conserved residue (generally the cysteine C) to

    the end of the gene plus the maximum number of reverse complementary palindromic

    insertions appended to the 3 ′ end. lV is the length of sV .

    Vx1(σ) =∑V

    PV(V )PdelV(lV − x1|V )I(sVx1 = σ)I(sV1:x1 ∼ a1:i1) if u1 = 1,

    Vx1(σ) =∑V

    PV(V )PdelV(lV − x1|V )I((sV1:x1 , σ) ∼ a1:i1) if u1 = 2,

    Vx1 =∑V

    PV(V )PdelV(lV − x1|V )I(sV1:x1 ∼ a1:i1) if u1 = 3.

    (3.9)

    N1 contribution: Mx1x2

    This matrix includes the weights (PinsVD, p0, and∏SVD(mi|mi−1) from the glu