Applied bandits:Supporting health-related research
Audrey Durand
July 2, 2019
We are going to talk about
Bandits with structure → Neuroscience research
Application to microscopy imaging parameters
Bandits with contexts → Cancer research
Application to adaptive treatment allocation
Stochastic bandits
For each episode 𝑡:• Select a action 𝑘# ∈ 1, 2, … , 𝐾
• Observe outcome 𝑟#~𝐷(𝜇/0)
1 2 3
…
𝐾
𝜇2 𝜇3 𝜇4 𝜇5
Stochastic bandits
For each episode 𝑡:• Select a action 𝑘# ∈ 1, 2, … , 𝐾
• Observe outcome 𝑟#~𝐷(𝜇/0)
1 2 3
…
𝐾
𝜇2 𝜇3 𝜇4 𝜇5
Goal: Maximize expected rewards
𝑘⋆=argmax/∈{2,3,…,5}
𝜇/
Minimize 𝑅 𝑇 = ∑#B2C 𝜇/⋆ − 𝜇/0
Exploration/Exploitation trade-off
Exploit: Potentially minimize regret• 𝑘# = argmax
/∈{2,3,…,5}E𝜇/
Explore: Gain information
Many strategies:• 𝜖-Greedy• Optimism in front of uncertainty (UCB)• Thompson Sampling• Best Empirical Sampled Average (BESA)
Theory showingsublinear regret underproper assumptions
In practice
We cannot compute regret: 𝔼∑#B2C 𝜇/⋆ − 𝜇/0
• Instead we minimize cumulative bad events, e.g. system failures, fractures, patient dropout
• Or we maximize cumulative good events, e.g. clicks, minutes spent on website, lives saved
In practice
We cannot compute regret: ∑#B2C 𝜇/⋆ − 𝜇/0
• Instead we minimize cumulative bad events, e.g. system failures, fractures, patient dropout
• Or we maximize cumulative good events, e.g. clicks, minutes spent on website, lives saved
We need to face constraints and challenges specific to applications
0.0 0.2 0.4 0.6 0.8 1.0
Actions
0.0
0.5
1.0
Reward
Many actions → Structured bandits
Expected reward is a functionof the action features
𝑓:𝒳 ↦ ℝ
For each episode 𝑡:• Select an action 𝑥# ∈ 𝒳• Obtain a reward 𝑟#~ 𝒟(𝑓(𝑥#))
𝔼[
]
0.0 0.2 0.4 0.6 0.8 1.0
Actions
0.0
0.5
1.0
Reward
Many actions → Structured bandits
Expected reward is a functionof the action features
𝑓:𝒳 ↦ ℝ
For each episode 𝑡:• Select an action 𝑥# ∈ 𝒳• Obtain a reward 𝑟#~ 𝒟(𝑓(𝑥#))
𝔼[
]
Goal: Maximize rewards Find 𝑥⋆= argmaxQ∈𝒳
𝑓(𝑥)
0.0 0.2 0.4 0.6 0.8 1.0
Actions
0.0
0.5
1.0
Reward
Capture structure: Linear model
• Unknown 𝜃 ∈ ℝS
• Mapping 𝜙:𝒳 ↦ ℝS
• 𝑓 𝑥 = 𝜙 𝑥 , 𝜃
𝒳 𝒦𝜙(V)
𝔼[
]
0.0 0.2 0.4 0.6 0.8 1.0
Actions
0.0
0.5
1.0
Reward
Online function approximation
• Unknown 𝜃 ∈ ℝS
• 𝑓 𝑥 = 𝜙 𝑥 , 𝜃
• 𝑥⋆ = argmaxQ∈𝒳
𝜙 𝑥 , 𝜃
For each episode 𝑡:• Select an action 𝑥# ∈ 𝒳• Observe outcome 𝑦# = 𝑓(𝑥#) + 𝜉#
with noise ξ[ ∼ 𝒩(0, σ3)
𝔼[
]
Minimize 𝔼∑#B2C 𝑓 𝑥⋆ − 𝑓(𝑥#)
𝑓
𝑥⋆
Kernel regression
• Kernel 𝑘 𝑥, 𝑥′ = 𝜙 𝑥 , 𝜙(𝑥′)
• Gaussian prior 𝜃 ∼ 𝒩S 0, Σ with Σ =bc
d𝐼 for 𝜆 > 0
𝐊i = 𝑘 𝑥j, 𝑥k 2lj,kliand 𝐤i 𝑥 = 𝑘 𝑥, 𝑥j 2ljli
ℙ 𝑓|𝑥2, … , 𝑥i, 𝑦2, … , 𝑦i ∼ 𝒩 𝑓i 𝑥Q∈𝒳
, 𝑘i 𝑥, 𝑥p Q,Qq∈𝒳
𝑓i 𝑥 = 𝐤i 𝑥 r 𝐊i + 𝜆𝐼s2𝐲i
𝑘i 𝑥, 𝑥p = 𝑘 𝑥, 𝑥′ − 𝐤i 𝑥 r 𝐊i + 𝜆𝐼s2𝐤i(𝑥
p)
𝜙:𝒳 ↦ ℝS
𝑑 can be very large!
Kernel regression
• Kernel 𝑘 𝑥j, 𝑥k = 𝜙 𝑥j , 𝜙(𝑥k)
• Gaussian prior 𝜃 ∼ 𝒩S 0, Σ with Σ =bc
d𝐼 for 𝜆 > 0
For 𝜆 = 𝜎3 → Gaussian Process (Rasmussen and Williams, 2006)
• Example: Pointwise posterior mean and standard deviation
𝜙:𝒳 ↦ ℝS
𝑑 can be very large!
Streaming kernel regression
• Next input location 𝑥# is selected based on the 𝑡 − 1 past observations
• Many algorithm variants bandits, e.g.Kernel UCB, Kernel TS, GP-UCB, GP-TS
Let’s apply those bandits!
Optimizing super-resolution imaging parameters
• Flavie Lavoie-Cardinal• Theresa Wiesner• Anthony Bilodeau• Paul De Koninck
• Louis-Émile Robitaille• Marc-André Gardner• Christian Gagné
Joint work with
D., Wiesner, Gardner, Robitaille, Bilodeau, Gagné, De Koninck, and Lavoie-Cardinal(Nature Comm 2018)
Observing structures at the nanoscale (Hell and Wichmann, 1994)
Problem
Biology: The optimal parameters are not always the same
Typical strategy:• Split samples in two groups A and B• Find good parameters on group A• Perform imaging task on group B
From abberior-instruments.com
Structured bandit problem
Find good parameters during the imaging task• Maximize the acquisition of useful images • Minimize trials of poor parameters
Imaging parameters
Feedback
→ Identify best parameters→ Explore wisely
Optimizing image quality
Recall goal: Maximize the acquisition of images useful to researchers
Imaging parameters
ImageQuality score
Imaging parameters
ImageQuality score
ThompsonSampling
−1 0 1
Parameter
−5.0
−2.5
0.0
2.5
5.0
Objective
N = 3
Qua
lity
Thompson Sampling for selecting imaging parameters
−1 0 1
Parameter
−5.0
−2.5
0.0
2.5
5.0
Objective
N = 3
f fN f̃ Previous points Next location
−1 0 1
Parameter
−1.0
−0.5
0.0
0.5
Objective
N = 10
−1 0 1
Parameter
−1.0
−0.5
0.0
0.5
Objective
N = 29
Qua
lity
What is good image quality?
Avoiding images like these:
Getting more like these:
But imaging is a destructive process…
Trade-off image quality and photobleaching
Outcomesoptions
Quality score
Parameterselection
Online analysis Images% bleach
Thompson Sampling for generating outcome options
• One kernel regression model w𝑓j per objective 𝑖
• Sample one function y𝑓j per objective 𝑖
• Option y𝑓(𝑥) at parameter 𝑥: Concatenate y𝑓j(𝑥) for all 𝑖
−1 0 1
Parameter
−4
−2
0
2
4
Objective
N = 3
Quality
−1 0 1
Parameter
−4
−2
0
2
4
Objective
N = 3
Photobleaching
10
10
y𝑓 0 = (0.75, 0.65)
Presenting estimates to the expert• Exploration/Exploitation
in the cloud!
• Expert acts as an argmaxon the preference function
Image quality
Pho
tobl
each
ing
Experiments on neuronal imaging
Three parameters (1000 configurations): • Excitation laser power • Depletion laser power• Duration of imaging per pixel
Different imaging targets:• Neuron: Rat neuron• PC12: Rat tumor cell line• HEK293: Human embryonic kidney cells
Acquire two STED images with ↑ 1st STED quality and ↓ photobleaching
Acquire good images and control photobleaching
Not super-resolution →
Quality
Photobleaching
Sublinear regret, as suggested by theory
Different imaging targets:• Neuron: Rat neuron• PC12: Rat tumor cell line• HEK293: Human
embryonic kidney cells
Fully automated process
Outcomesoptions
Quality score
Parameterselection
Online analysis Images% bleach
Fully automated imaging
Bottom left corner:Not super-resolution
Beginning of optim →
End of optim →
Towards the next application:Getting closer to the patient
Randomized trials
Ran
dom
izat
ion
Option A
Option B
Time
Comparison&
Decision
Adaptive trials
Dynamically adapt the study designbased on previous observations
Effective
Favorable
Promising
Unfavorable
Futile
Writing down the setting…
Treatments:
Probabilityof effectiveness:
For each patient 𝑡:• Select a treatment 𝑘# ∈ 1, 2, … , 𝐾
• Observe outcome 𝑟#~𝐷(𝜇/0)
1 2 3
…
𝐾
𝜇2 𝜇3 𝜇4 𝜇5
This is stochastic bandits!(Thompson, 1933)
In the absence of one size fits all strategy
Treatments:
Probabilityof effectiveness:
1 2 3
…
𝐾
𝜇2,~ 𝜇3,~ 𝜇4,~ 𝜇5,~
𝜇2,� 𝜇3,� 𝜇4,� 𝜇5,�
Context
In the absence of one size fits all strategy
Treatments:
Probabilityof effectiveness:
1 2 3
…
𝐾
𝜇2,~ 𝜇3,~ 𝜇4,~ 𝜇5,~
𝜇2,� 𝜇3,� 𝜇4,� 𝜇5,�
You can treat them as independent bandit problems!
In the absence of one size fits all strategy
Treatments:
Probabilityof effectiveness:
1 2 3
…
𝐾
𝜇2,~ 𝜇3,~ 𝜇4,~ 𝜇5,~
𝜇2,� 𝜇3,� 𝜇4,� 𝜇5,�
What happens if the number of contexts grows large?
Contextual bandits
Exploit the underlying structure on the context space
Reward = treatment effect
0.0 0.2 0.4 0.6 0.8 1.0
Context
-1
0
1
Reward
Action 1
Action 2
Action 3
Action 4
𝑓2(0.6)
𝑓3(0.6)
𝔼[
] 𝑓4(0.6)
𝑓�(0.6)
Contextual bandits
Expected reward of action 𝑘 is afunction 𝑓/ of the context features
𝑓/: 𝒮 ↦ ℝ
For each episode 𝑡:• Observe a context 𝑠# ∼ Π
• Select an action 𝑘# ∈ 1, 2, … , 𝐾
• Observe a reward 𝑟#~𝐷(𝑓/0(𝑠#))
0.0 0.2 0.4 0.6 0.8 1.0
Context
-1
0
1
Reward
Action 1
Action 2
Action 3
Action 4
𝔼[
]
Contextual bandits
Expected reward of action 𝑘 is afunction 𝑓/ of the context
𝑓/: 𝒮 ↦ ℝ
For each episode 𝑡:• Observe a context 𝑠# ∼ Π
• Select an action 𝑘# ∈ 1, 2, … , 𝐾
• Observe a reward 𝑟#~𝐷(𝑓/0(𝑠#))
0.0 0.2 0.4 0.6 0.8 1.0
Context
-1
0
1
Reward
Action 1
Action 2
Action 3
Action 4
𝔼[
]
Goal: Maximize rewards Find 𝑘#⋆= argmax/∈{2,3,…,5}
𝑓/(𝑠#)
0.0 0.2 0.4 0.6 0.8 1.0
Context
-1
0
1
Reward
Action 1
Action 2
Action 3
Action 4
Online function approximation
• Unknown 𝜃/ ∈ ℝS for each action 𝑘
• 𝑓/ 𝑠 = 𝜙 𝑠 , 𝜃/• 𝑘#
⋆ = argmax/∈{2,3,…,5}
𝜙 𝑠# , 𝜃/
For each episode 𝑡:• Observe a context 𝑠# ∼ Π
• Select an action 𝑘# ∈ 1, 2, … , 𝐾
• Observe reward 𝑟# = 𝑓/0(𝑠#) + 𝜉#
with noise ξ[ ∼ 𝒩(0, σ3)𝔼[
]
Minimize 𝔼∑#B2C 𝑓/0⋆ 𝑠# − 𝑓/0(𝑠#)
Adaptive treatment allocation for mice trials
• Georgios D. Mitsis• Joelle Pineau
• Charis Achilleos• Demetris Iacovides• Katerina Strati
Joint work with
D., Achilleos, Iacovides, Strati, Mitsis, and Pineau (MLHC 2018)
Data acquisition problem
• Mice with induced cancer tumours
• Treatment options: 5FU, Imiquimod, 5FU+Imiquimod, None
• Treatment allocation twice a week
Which treatment should be allocated to patients with cancer given the stage of their disease?
Squamous Cell
Carcinoma
Data acquisition problem
• Mice with induced cancer tumours
• Treatment options: 5FU, Imiquimod, 5FU+Imiquimod, None
• Treatment allocation twice a week
Which treatment should be allocated to patients with cancer given the stage of their disease?
Squamous Cell
Carcinoma
Tumour volume
Phase 1: Randomized allocation (only exploration)
• 6 mice
Processing a mouse:• 2x/week:
– Measure volume of tumours– If all tumours are below a critical level
» Randomly assign one of the four treatment options– Otherwise terminate this animal
Phase 1: Randomized allocation (only exploration)
Result: 12 usable tumours• 163 triplets (tumour volume, treatment, next tumour volume)
0 100 200 300
Volume (mm3)
0
5
10
15
20
25
Number
ofmeasurements
0 25 50 75 100
Number of days
0
200
400
Volume(m
m3)
Phase 1: Randomized allocation (only exploration)
Exponential tumour growth → Few data collected for larger tumours
Phase 2: Adaptive trial (exploration/exploitation)
• 10 mice• Select treatment 2x/week
Adaptive Clinical Trial• Do not fix the experiment design a priori• Adapt treatment allocation based on previous observations• Favor selection of better treatments
• Reduce exposition to less effective treatments
Contextual bandit problem
Improve treatment allocation online
• Maximize amount of acquired data• Minimize trials of poor treatments
Treatment
Tumour volume after treatment
Tumour volume before treatment
→ Identify best action given context→ Explore wisely
Alert: Contexts are not independent of actions!
Recall contextual bandits:
For each episode 𝑡:• Observe a context 𝑠# ∼ Π
• Select an action 𝑘# ∈ 1, 2, … , 𝐾
• Obtain a reward 𝑟#~𝒟(𝑓/0(𝑠#))
Beware of traps!
Reward shaping
Natural reward definition could be 𝑟# = 𝑠# − 𝑠#�2
Tumour volume reduction
Reward shaping
Natural reward definition could be 𝑟# = 𝑠# − 𝑠#�2
Controlling the disease, i.e. maintain tumour constant, has the same value independently of the tumour volume
Tumour volume reduction
Reward shaping
Natural reward definition could be 𝑟# = 𝑠# − 𝑠#�2
Controlling the disease, i.e. maintain tumour constant, has the same value independently of the tumour volume
What we used instead: 𝑟# = −𝑠#�2
Tumour volume reduction
Exploration/Exploitation strategy
Best Empirical Sampled Average: BESA (Baransi et al., 2014)
• Fair comparison of empirical estimators• Opportunities for actions to show how good they are
Samples # Action 1 Action 2
1 0 1
2 1 0
3 1 1
4 0
… …
100 1
Expected reward
E𝜇2
E𝜇3
Exploration/Exploitation strategy
Best Empirical Sampled Average: BESA (Baransi et al., 2014)
• Fair comparison of empirical estimators• Opportunities for actions to show how good they are
Samples # Action 1 Action 2
1 0 1
2 1 0
3 1 1
4 0
… …
100 1
Expected reward
E𝜇2
E𝜇3
Exploration/Exploitation strategy
Best Empirical Sampled Average: BESA (Baransi et al., 2014)
• Fair comparison of empirical estimators• Opportunities for actions to show how good they are
Samples # Action 1 Action 2
1 0 1
2 1 0
3 1 1
4 0
… …
100 1
Expected reward
E𝜇2
E𝜇3
Exploration/Exploitation strategy
Best Empirical Sampled Average: BESA (Baransi et al., 2014)
• Fair comparison of empirical estimators• Opportunities for actions to show how good they are
Samples # Action 1 Action 2
1 0 1
2 1 0
3 1 1
4 0
… …
100 1
Expected reward
E𝜇2
E𝜇3
0 75 150 225
Tumour volume
0
150
300
Treatmenteffect
f
f̃N
0 75 150 225
Tumour volume
0
150
300
Treatmenteffect
f
f̃N
GP BESA: Extension to contextual bandits
0 75 150 225
Tumour volume
0
150
300
Treatmenteffect
f
f̃N
0 75 150 225
Tumour volume
0
150
300
Treatmenteffect
f
f̃N
𝑁 = 50 𝑁 = 20
𝑁 = 10 𝑁 = 5
0 75 150 225 300
Tumour volume
0
150
300
Treatmenteff
ect
st
A1 (N = 5)
A2 (N = 5)
Example of exploration
Action 1: 100 observationsAction 2: 5 observations
0 75 150 225 300
Tumour volume
0
150
300
Treatmenteff
ect
st
A1 (N = 100)
A2 (N = 5)
0 75 150 225 300
Tumour volume
0
150
300
Treatmenteffect
st
A1
A2
True functions: Predictions (all data): Predictions (subsample):
Experimental setting
• 10 mice total
• Processing a group of mice (2-3 subjects)– Twice a week:
» For each mouse in the group:• Measure tumour → reward for last treatment• Select treatment to assign now
– Until death/sacrifice of all mice in group
• Update algorithm with tuples of (volume, treatment, next volume)
• Start next group
Animals live longer
Algorithm updated after each group
None Random 5-FU GP BESA
50
100
150
Lifeduration(days)
None Random 5-FU GP
50
100
150
Lifeduration(days)
A B C D
50
100
150Lifeduration(days)
Evolution of tumour volumes
Slowing the exponential growth
Recall phase 1:
0
200
400
Volume(m
m3)
Group A
0
200
400
Volume(m
m3)
Group B
0
200
400
Volume(m
m3)
Group C
0 50 100
Days
0
200
400
Volume(m
m3)
Group D
0 25 50 75 100
Number of days
0
200
400
Volume(m
m3)
A better state space covering
Using data in a next phase• More information on the tumor growth process• 40% more data points of volume > 70mm3
0 100 200 300
Volume (mm3)
0
5
10
15
20
25
Number
demeasurements
0.0
0.5
1.0
Probability
Group A
None 5-FU Imiquimod 5-FU + Imiquimod
Group B
0 50 100 150 200 250 300
Tumour volume (mm3)
0.0
0.5
1.0
Probability
Group C
0 50 100 150 200 250 300
Tumour volume (mm3)
Group D
Evolution of the policy
Conclusion + Take homes
• Bandits is a nice framework for theory, but also has applications! J
• We often break theoretical guarantees in practice L
• How to design algorithms that don’t make unrealistic assumptions?
• Other aspects important in practice were not considered here, e.g.– Fairness in exploration– Safe exploration
Huge thanks again to my collaborators!
Questions?…and more