using model-based statistical inference to learn about evolution
TRANSCRIPT
Using model-based statistical inferenceUsing model-based statistical inferenceto learn about evolutionto learn about evolution
Frederick “Erick” MatsenFrederick “Erick” Matsenhttp://matsen.fredhutch.org/http://matsen.fredhutch.org/
@ematsen@ematsen
My group develops mathematical and computationaltoolsfor model-based statistical inference on continuous and discrete mathematical objects motivated by evolutionary sequence analysisof microbes and the immune system.
What is model-based statistical inference?What is model-based statistical inference?
Modern technology gives us the ability to in great detailobserve
But very detailed observation is not the same as understanding
To understand we need to simplify and abstract.
What abstractions do we have at our disposal?What abstractions do we have at our disposal?
3
x
is useful and we love it dearly! is useful and we love it dearly!xx
allows us to describe knowledge in an implicit way:x
f(x) = y
then we can work towards solving for .x
Alternatively, one might be interested in taking the average of between two values and .
f(x)a b
Define Define as area as areaff((xx)) ddxx∫∫ bb
aa
a b
is average is average11//((bb −− aa)) ⋅⋅ ff((xx)) ddxx∫∫ bb
aa
a b
average on (a, b)
Variables allow us to solveVariables allow us to solve
?xy
Problem 1: given , solve for .Problem 2: predict if a 10% bigger charge will hit the castle.Say the answer to this is , such that is 1 if that will make the cannonball hit the castle, and 0 otherwise.
y x
(x)hit10 (x)hit10 x
Variables allow us to solveVariables allow us to solve
?xy
… in a deterministic framework.
Life is a probabilistic process.
How do we abstract probabilistic quantities?
X
Random variables Random variables abstract variables abstract variablesXXIt doesn’t have a fixed value: we have to “ask” it for a value.
Random variables are capricious,but they are well defined behind their stochastic exterior.
Random variable sampling determined byRandom variable sampling determined bydistributionsdistributions
Sometimes discrete:
P(heads)P(tails)
= 0.51= 0.49
Sometimes continuous:
Working with Working with random variablesrandom variables ::XX
We can solve for in “equations” like , obtainingexpressions such as this is called inference.
X f(X) ∼ YP(X ∣ Y );
We can also average with respect to :
where now we are averaging out with respect to a probability.
X
∫ f(X) dP(X ∣ Y )
Probabilistic approach to predictionProbabilistic approach to prediction
?XY
: horizontal distance traveled by a cannonball (random variable): cannon angle (inferred random variable)
Problem 1: given observed distribution , infer distribution of .Problem 2: find probability that a 10% bigger charge will hit castle.
YX
Y X
Solve to get .1. Integrate .2.
f(X) = Y P(X ∣ Y )∫ (X) dP(X ∣ Y )hit10
Biological experiments are measurements withBiological experiments are measurements withuncertaintyuncertainty
?X YCATTCTTGTACG
GTTCGGCGAAGA
GCGTAAAATAGG
AGGGGTTGCATG
CTTCACTGGCAT
expressionlevel ofcertaingenes
risk
Model-based statistical inference Model-based statistical inference ✓✓We can solve for in “equations” like ,
inferring an unknown distribution for (what can we learn about the angle of the cannon).
X f(X) ∼ YX
We can push uncertainty through an analysis using integrals like
(we don’t care what the angle of the cannon is really, we just want toknow with what probability the shot is going to hit the castle!)
f(X) dP(X ∣ Y ).∫ b
a
Now, what is model-based statistical inferenceNow, what is model-based statistical inferenceon on discrete mathematical objectsdiscrete mathematical objects??
Motivation: we would like to decide whether anMotivation: we would like to decide whether anindividual has been individual has been superinfectedsuperinfected, i.e. infected, i.e. infected
with a second viral variantwith a second viral variantin a separate eventin a separate event
single infection superinfection
Integrate out phylogenetic uncertaintyIntegrate out phylogenetic uncertainty?X Y
CATTCTTGTACG
GTTCGGCGAAGA
GCGTAAAATAGG
AGGGGTTGCATG
CTTCACTGGCAT
To decide superinfection, we would like to calculate
where is now a phylogenetic-tree-valued random variable.
f(X) dP(X ∣ Y )∫S
X
Time to count your blessings.Time to count your blessings. Real numbers are equipped with a total order. ( ) Real numbers are equipped with a simply-computed distancethat is compatible with the total order. ( ) Real numbers form a continuum. ( )
3 < 4
|7 − 3| = 4
2.9 < 2.95 < 3
We can thus define the integralWe can thus define the integral
a ba b
for real-valued and .f(x)dx∫ b
af(X) dP(X ∣ Y )∫ b
a
Integrating over phylogenetic trees?Integrating over phylogenetic trees?Phylogenetic trees have discrete topologies, there is no canonical
distance between them, nor a natural total order.
But we still want to do inference and integration in this setting!
ACATGGCTC...ATACGTTCC...TTACGGTTC...ATCCGGTAC...ATACAGTCT...
...
Joint work with postdoc Chris Whidden.
Notion of proximity of trees?Notion of proximity of trees?
Subtree-prune-regraft (rSPR) definitionSubtree-prune-regraft (rSPR) definition
1 4 5 61 2 3 4 5 6 1 2 34 5 6
2 3
These trees are then distance 1 apart.
Tree graph connected by rSPR movesTree graph connected by rSPR moves
Tree inference bounces around graphTree inference bounces around graph
Probability is # of visits to nodesProbability is # of visits to nodes
Subset to high probability nodesSubset to high probability nodes
node size proportional to posterior probability;color shows distance tohighest probability tree.
The top 4096 trees for a data setThe top 4096 trees for a data set
Graph effects matterGraph effects matterFor more details:
Chris Whidden and FM. Quantifying MCMC exploration of phylogenetic treespace. Systematic Biology 2015.
… so what do we know about this graph?
Is the tree graph positively curved?Is the tree graph positively curved?
Is it flat?Is it flat?
Is it negatively curved?Is it negatively curved?
curvature
SP
R distance
imbalanced
balanced
Model-based statistical inference on discreteModel-based statistical inference on discreteand continuous mathematical objects and continuous mathematical objects ✓✓When we perform inference on , we can have be
something continuous, discrete, or continuous and discrete.f(X) ∼ Y X
Discrete-ness brings special challenges; graphs are helpful.
Next: use model-based statistical inference toNext: use model-based statistical inference tolearn about adaptive immunitylearn about adaptive immunity
Joint with Trevor Bedford (VIDD), Connor McCoy (now at Google),Vladimir Minin (UW Statistics), and Duncan Ralph (postdoc).
Data from Harlan Robins (PHS/Adaptive).
Jenner’s 1796 vaccineJenner’s 1796 vaccine
A revolutionary advance.
Where are we 200 years later?Where are we 200 years later?
Vaccine trials still take a long time and are very costly.
Where are we 200 years later?Where are we 200 years later?
Justinventedvaccines.I rock.LOL
Vaccine trials still take a long time and are very costly.
Vaccines manipulate the adaptive immuneVaccines manipulate the adaptive immunesystemsystem
Current practice for trials:
Stimulate immune system1. Battle-test immune system via pathogen exposure2.
What can we learn from antibody-making B cells without battle-testing?
Antibodies bind antigensAntibodies bind antigens
B cell diversification processB cell diversification processV genes D genes J genes
Affinitymaturation
Somatic hypermutation
VDJrearrangement
includingerosion and
non-templatedinsertion
AntigenNaive B cell
Experienced B cell
Overall goal: reconstruct processOverall goal: reconstruct process
ACATGGCTC...ATACGTTCC...TTACGGTTC...ATCCGGTAC...ATACAGTCT...
reality
inference
......
Why reconstruct B cell lineages?Why reconstruct B cell lineages?
...
1. Vaccine design
This one is really good.How can we elicit it?
Why reconstruct B cell lineages?Why reconstruct B cell lineages?
...
1. Vaccine design
Why reconstruct B cell lineages?Why reconstruct B cell lineages?
...
1. Vaccine design
?
2. Vaccine assay
Why reconstruct B cell lineages?Why reconstruct B cell lineages?
...
1. Vaccine design
3. Evolutionary analysis to learn about underlying mechanisms
2. Vaccine assay
Goal 1: how are antibodies “drafted”?Goal 1: how are antibodies “drafted”?
ACATGGCTC...ATACGTTCC...TTACGGTTC...ATCCGGTAC...ATACAGTCT...
reality
rearrangement groups
......
“Solve” “Solve” , where, whereff((XX)) ∼∼ YYV genes D genes J genes
Affinitymaturation
Somatic hypermutation
VDJrearrangement
includingerosion and
non-templatedinsertion
AntigenNaive B cell
Experienced B cell
is a statistical model of recombination and maturation are parameters of that model (including clusters) are antibody repertoire sequences
fXY
VDJ annotation problem:VDJ annotation problem:from where did each nucleotide come?from where did each nucleotide come?
Somatic hypermutation
Sequencing primerSequencing error
3’V deletion
VD insertion
5’D deletion
3’D deletion5’J deletion
DJ insertion
Biological process
Sequencing
Inference
G
This is a key first step in BCR sequence analysis.
Rich probabilistic models workRich probabilistic models work
hamming distance
0 5 10 15
freq
uen
cy
0.0
0.1
0.2
0.3
HTTNpartis (k=5)partis (k=1)ighutiliHMMunealignigblastimgt
HTTN
Integrate out annotation uncertaintyIntegrate out annotation uncertaintyfor better clusteringfor better clustering
Goal 2: how are antibodies “revised”?Goal 2: how are antibodies “revised”?Estimate per-residue level of natural selection on receptor
sequences from healthy individuals.ω = dN/dS
■ Large : diversifying sites
■ near 1: neutral sites ■ Small : purifying sites
ω
ω
ω
AAC AAG
GTGGTC
more likely
less likely
In antibodies
CCA CCT
Pro Pro
Thr Ile
ATCACC
synonymous
nonsynonymous
For selection
AAC AAG
GTGGTC
more likely
less likely
In antibodies
CCA CCT
Pro Pro
Thr Ile
ATCACC
synonymous
nonsynonymous
For selection
AAC AAG
GTGGTC
more likely
less likely
In antibodies
Solution: use “out-of-frame” sequencesto determine neutral mutation rate.
antigen
light chain
purifying
neutral
diversifying
ConclusionConclusion We like to “solve equations” like , where and arerandom variables. We especially like the case when is sequence data and issomething weird. We can use these tools to learn about B cell receptor sequenceevolution.
f(X) ∼ Y X Y
Y X
Next steps: phylogeneticsNext steps: phylogenetics Understand the impact of data on curvature Extend work to other models of tree space Use understanding to design biased proposals that don’t get stuck Implement phylogenetic algorithms that can update trees given moresequences Continue building community with phyloseminar.org phylobabble.org
Next steps: B cellsNext steps: B cells
ACATGGCTC...
ATACGTTCC...
TTACGGTTC...
ATCCGGTAC...
ATACAGTCT...
reality
inference
......
Learn more about the mutation process in B cell maturation to betterreconstruct ancestral sequences; evolutionary dynamics
Etiology of Burkitt’s lymphoma
Next steps: B cellsNext steps: B cells
Origin of protective antibodies;optimization of vaccination strategies
Watching immune repertoires evolve through time
Wish I had time to talk aboutWish I had time to talk about
Evolution of innate immunity & viralantagonists; Origin of SIVcpz
Founder HIV sequence identificationfor sieve analysis
Wish I had time to talk aboutWish I had time to talk about
Human microbiome
Simian foamy virus variation;innate immune defense
Wish I had time to talk aboutWish I had time to talk about
HIV superinfectionDrug resistance mutations
Thank you to my group membersThank you to my group members
Thank you to the Fred Hutch communityThank you to the Fred Hutch community Brilliant students, postdocs, and staff scientist collaborators Computational biology program, esp. “scouts” and Marty Fantastic admin support: Sara, Melissa, and Anissa Fantastic computing support: esp. Dirk, Carl, Erik, and Michael
supporters: Katie P, Dan G, and Garnet Patience with my meddling: Larry, Myra, Jon C
fredhutch.io