![Page 1: Applied bandits: Supporting health-related researchQuality score Parameter selection Online analysis Images % bleach Thompson Sampling for generating outcome options • One kernel](https://reader035.vdocuments.us/reader035/viewer/2022081400/60aedc4ff75f59226b3428d7/html5/thumbnails/1.jpg)
Applied bandits:Supporting health-related research
Audrey Durand
July 2, 2019
![Page 2: Applied bandits: Supporting health-related researchQuality score Parameter selection Online analysis Images % bleach Thompson Sampling for generating outcome options • One kernel](https://reader035.vdocuments.us/reader035/viewer/2022081400/60aedc4ff75f59226b3428d7/html5/thumbnails/2.jpg)
We are going to talk about
Bandits with structure → Neuroscience research
Application to microscopy imaging parameters
Bandits with contexts → Cancer research
Application to adaptive treatment allocation
![Page 3: Applied bandits: Supporting health-related researchQuality score Parameter selection Online analysis Images % bleach Thompson Sampling for generating outcome options • One kernel](https://reader035.vdocuments.us/reader035/viewer/2022081400/60aedc4ff75f59226b3428d7/html5/thumbnails/3.jpg)
Stochastic bandits
For each episode 𝑡:• Select a action 𝑘# ∈ 1, 2, … , 𝐾
• Observe outcome 𝑟#~𝐷(𝜇/0)
1 2 3
…
𝐾
𝜇2 𝜇3 𝜇4 𝜇5
![Page 4: Applied bandits: Supporting health-related researchQuality score Parameter selection Online analysis Images % bleach Thompson Sampling for generating outcome options • One kernel](https://reader035.vdocuments.us/reader035/viewer/2022081400/60aedc4ff75f59226b3428d7/html5/thumbnails/4.jpg)
Stochastic bandits
For each episode 𝑡:• Select a action 𝑘# ∈ 1, 2, … , 𝐾
• Observe outcome 𝑟#~𝐷(𝜇/0)
1 2 3
…
𝐾
𝜇2 𝜇3 𝜇4 𝜇5
Goal: Maximize expected rewards
𝑘⋆=argmax/∈{2,3,…,5}
𝜇/
Minimize 𝑅 𝑇 = ∑#B2C 𝜇/⋆ − 𝜇/0
![Page 5: Applied bandits: Supporting health-related researchQuality score Parameter selection Online analysis Images % bleach Thompson Sampling for generating outcome options • One kernel](https://reader035.vdocuments.us/reader035/viewer/2022081400/60aedc4ff75f59226b3428d7/html5/thumbnails/5.jpg)
Exploration/Exploitation trade-off
Exploit: Potentially minimize regret• 𝑘# = argmax
/∈{2,3,…,5}E𝜇/
Explore: Gain information
Many strategies:• 𝜖-Greedy• Optimism in front of uncertainty (UCB)• Thompson Sampling• Best Empirical Sampled Average (BESA)
Theory showingsublinear regret underproper assumptions
![Page 6: Applied bandits: Supporting health-related researchQuality score Parameter selection Online analysis Images % bleach Thompson Sampling for generating outcome options • One kernel](https://reader035.vdocuments.us/reader035/viewer/2022081400/60aedc4ff75f59226b3428d7/html5/thumbnails/6.jpg)
In practice
We cannot compute regret: 𝔼∑#B2C 𝜇/⋆ − 𝜇/0
• Instead we minimize cumulative bad events, e.g. system failures, fractures, patient dropout
• Or we maximize cumulative good events, e.g. clicks, minutes spent on website, lives saved
![Page 7: Applied bandits: Supporting health-related researchQuality score Parameter selection Online analysis Images % bleach Thompson Sampling for generating outcome options • One kernel](https://reader035.vdocuments.us/reader035/viewer/2022081400/60aedc4ff75f59226b3428d7/html5/thumbnails/7.jpg)
In practice
We cannot compute regret: ∑#B2C 𝜇/⋆ − 𝜇/0
• Instead we minimize cumulative bad events, e.g. system failures, fractures, patient dropout
• Or we maximize cumulative good events, e.g. clicks, minutes spent on website, lives saved
We need to face constraints and challenges specific to applications
![Page 8: Applied bandits: Supporting health-related researchQuality score Parameter selection Online analysis Images % bleach Thompson Sampling for generating outcome options • One kernel](https://reader035.vdocuments.us/reader035/viewer/2022081400/60aedc4ff75f59226b3428d7/html5/thumbnails/8.jpg)
0.0 0.2 0.4 0.6 0.8 1.0
Actions
0.0
0.5
1.0
Reward
Many actions → Structured bandits
Expected reward is a functionof the action features
𝑓:𝒳 ↦ ℝ
For each episode 𝑡:• Select an action 𝑥# ∈ 𝒳• Obtain a reward 𝑟#~ 𝒟(𝑓(𝑥#))
𝔼[
]
![Page 9: Applied bandits: Supporting health-related researchQuality score Parameter selection Online analysis Images % bleach Thompson Sampling for generating outcome options • One kernel](https://reader035.vdocuments.us/reader035/viewer/2022081400/60aedc4ff75f59226b3428d7/html5/thumbnails/9.jpg)
0.0 0.2 0.4 0.6 0.8 1.0
Actions
0.0
0.5
1.0
Reward
Many actions → Structured bandits
Expected reward is a functionof the action features
𝑓:𝒳 ↦ ℝ
For each episode 𝑡:• Select an action 𝑥# ∈ 𝒳• Obtain a reward 𝑟#~ 𝒟(𝑓(𝑥#))
𝔼[
]
Goal: Maximize rewards Find 𝑥⋆= argmaxQ∈𝒳
𝑓(𝑥)
![Page 10: Applied bandits: Supporting health-related researchQuality score Parameter selection Online analysis Images % bleach Thompson Sampling for generating outcome options • One kernel](https://reader035.vdocuments.us/reader035/viewer/2022081400/60aedc4ff75f59226b3428d7/html5/thumbnails/10.jpg)
0.0 0.2 0.4 0.6 0.8 1.0
Actions
0.0
0.5
1.0
Reward
Capture structure: Linear model
• Unknown 𝜃 ∈ ℝS
• Mapping 𝜙:𝒳 ↦ ℝS
• 𝑓 𝑥 = 𝜙 𝑥 , 𝜃
𝒳 𝒦𝜙(V)
𝔼[
]
![Page 11: Applied bandits: Supporting health-related researchQuality score Parameter selection Online analysis Images % bleach Thompson Sampling for generating outcome options • One kernel](https://reader035.vdocuments.us/reader035/viewer/2022081400/60aedc4ff75f59226b3428d7/html5/thumbnails/11.jpg)
0.0 0.2 0.4 0.6 0.8 1.0
Actions
0.0
0.5
1.0
Reward
Online function approximation
• Unknown 𝜃 ∈ ℝS
• 𝑓 𝑥 = 𝜙 𝑥 , 𝜃
• 𝑥⋆ = argmaxQ∈𝒳
𝜙 𝑥 , 𝜃
For each episode 𝑡:• Select an action 𝑥# ∈ 𝒳• Observe outcome 𝑦# = 𝑓(𝑥#) + 𝜉#
with noise ξ[ ∼ 𝒩(0, σ3)
𝔼[
]
Minimize 𝔼∑#B2C 𝑓 𝑥⋆ − 𝑓(𝑥#)
𝑓
𝑥⋆
![Page 12: Applied bandits: Supporting health-related researchQuality score Parameter selection Online analysis Images % bleach Thompson Sampling for generating outcome options • One kernel](https://reader035.vdocuments.us/reader035/viewer/2022081400/60aedc4ff75f59226b3428d7/html5/thumbnails/12.jpg)
Kernel regression
• Kernel 𝑘 𝑥, 𝑥′ = 𝜙 𝑥 , 𝜙(𝑥′)
• Gaussian prior 𝜃 ∼ 𝒩S 0, Σ with Σ =bc
d𝐼 for 𝜆 > 0
𝐊i = 𝑘 𝑥j, 𝑥k 2lj,kliand 𝐤i 𝑥 = 𝑘 𝑥, 𝑥j 2ljli
ℙ 𝑓|𝑥2, … , 𝑥i, 𝑦2, … , 𝑦i ∼ 𝒩 𝑓i 𝑥Q∈𝒳
, 𝑘i 𝑥, 𝑥p Q,Qq∈𝒳
𝑓i 𝑥 = 𝐤i 𝑥 r 𝐊i + 𝜆𝐼s2𝐲i
𝑘i 𝑥, 𝑥p = 𝑘 𝑥, 𝑥′ − 𝐤i 𝑥 r 𝐊i + 𝜆𝐼s2𝐤i(𝑥
p)
𝜙:𝒳 ↦ ℝS
𝑑 can be very large!
![Page 13: Applied bandits: Supporting health-related researchQuality score Parameter selection Online analysis Images % bleach Thompson Sampling for generating outcome options • One kernel](https://reader035.vdocuments.us/reader035/viewer/2022081400/60aedc4ff75f59226b3428d7/html5/thumbnails/13.jpg)
Kernel regression
• Kernel 𝑘 𝑥j, 𝑥k = 𝜙 𝑥j , 𝜙(𝑥k)
• Gaussian prior 𝜃 ∼ 𝒩S 0, Σ with Σ =bc
d𝐼 for 𝜆 > 0
For 𝜆 = 𝜎3 → Gaussian Process (Rasmussen and Williams, 2006)
• Example: Pointwise posterior mean and standard deviation
𝜙:𝒳 ↦ ℝS
𝑑 can be very large!
![Page 14: Applied bandits: Supporting health-related researchQuality score Parameter selection Online analysis Images % bleach Thompson Sampling for generating outcome options • One kernel](https://reader035.vdocuments.us/reader035/viewer/2022081400/60aedc4ff75f59226b3428d7/html5/thumbnails/14.jpg)
Streaming kernel regression
• Next input location 𝑥# is selected based on the 𝑡 − 1 past observations
• Many algorithm variants bandits, e.g.Kernel UCB, Kernel TS, GP-UCB, GP-TS
![Page 15: Applied bandits: Supporting health-related researchQuality score Parameter selection Online analysis Images % bleach Thompson Sampling for generating outcome options • One kernel](https://reader035.vdocuments.us/reader035/viewer/2022081400/60aedc4ff75f59226b3428d7/html5/thumbnails/15.jpg)
Let’s apply those bandits!
![Page 16: Applied bandits: Supporting health-related researchQuality score Parameter selection Online analysis Images % bleach Thompson Sampling for generating outcome options • One kernel](https://reader035.vdocuments.us/reader035/viewer/2022081400/60aedc4ff75f59226b3428d7/html5/thumbnails/16.jpg)
Optimizing super-resolution imaging parameters
• Flavie Lavoie-Cardinal• Theresa Wiesner• Anthony Bilodeau• Paul De Koninck
• Louis-Émile Robitaille• Marc-André Gardner• Christian Gagné
Joint work with
D., Wiesner, Gardner, Robitaille, Bilodeau, Gagné, De Koninck, and Lavoie-Cardinal(Nature Comm 2018)
![Page 17: Applied bandits: Supporting health-related researchQuality score Parameter selection Online analysis Images % bleach Thompson Sampling for generating outcome options • One kernel](https://reader035.vdocuments.us/reader035/viewer/2022081400/60aedc4ff75f59226b3428d7/html5/thumbnails/17.jpg)
Observing structures at the nanoscale (Hell and Wichmann, 1994)
![Page 18: Applied bandits: Supporting health-related researchQuality score Parameter selection Online analysis Images % bleach Thompson Sampling for generating outcome options • One kernel](https://reader035.vdocuments.us/reader035/viewer/2022081400/60aedc4ff75f59226b3428d7/html5/thumbnails/18.jpg)
Problem
Biology: The optimal parameters are not always the same
Typical strategy:• Split samples in two groups A and B• Find good parameters on group A• Perform imaging task on group B
From abberior-instruments.com
![Page 19: Applied bandits: Supporting health-related researchQuality score Parameter selection Online analysis Images % bleach Thompson Sampling for generating outcome options • One kernel](https://reader035.vdocuments.us/reader035/viewer/2022081400/60aedc4ff75f59226b3428d7/html5/thumbnails/19.jpg)
Structured bandit problem
Find good parameters during the imaging task• Maximize the acquisition of useful images • Minimize trials of poor parameters
Imaging parameters
Feedback
→ Identify best parameters→ Explore wisely
![Page 20: Applied bandits: Supporting health-related researchQuality score Parameter selection Online analysis Images % bleach Thompson Sampling for generating outcome options • One kernel](https://reader035.vdocuments.us/reader035/viewer/2022081400/60aedc4ff75f59226b3428d7/html5/thumbnails/20.jpg)
Optimizing image quality
Recall goal: Maximize the acquisition of images useful to researchers
Imaging parameters
ImageQuality score
![Page 21: Applied bandits: Supporting health-related researchQuality score Parameter selection Online analysis Images % bleach Thompson Sampling for generating outcome options • One kernel](https://reader035.vdocuments.us/reader035/viewer/2022081400/60aedc4ff75f59226b3428d7/html5/thumbnails/21.jpg)
Imaging parameters
ImageQuality score
ThompsonSampling
−1 0 1
Parameter
−5.0
−2.5
0.0
2.5
5.0
Objective
N = 3
Qua
lity
![Page 22: Applied bandits: Supporting health-related researchQuality score Parameter selection Online analysis Images % bleach Thompson Sampling for generating outcome options • One kernel](https://reader035.vdocuments.us/reader035/viewer/2022081400/60aedc4ff75f59226b3428d7/html5/thumbnails/22.jpg)
Thompson Sampling for selecting imaging parameters
−1 0 1
Parameter
−5.0
−2.5
0.0
2.5
5.0
Objective
N = 3
f fN f̃ Previous points Next location
−1 0 1
Parameter
−1.0
−0.5
0.0
0.5
Objective
N = 10
−1 0 1
Parameter
−1.0
−0.5
0.0
0.5
Objective
N = 29
Qua
lity
![Page 23: Applied bandits: Supporting health-related researchQuality score Parameter selection Online analysis Images % bleach Thompson Sampling for generating outcome options • One kernel](https://reader035.vdocuments.us/reader035/viewer/2022081400/60aedc4ff75f59226b3428d7/html5/thumbnails/23.jpg)
What is good image quality?
Avoiding images like these:
Getting more like these:
![Page 24: Applied bandits: Supporting health-related researchQuality score Parameter selection Online analysis Images % bleach Thompson Sampling for generating outcome options • One kernel](https://reader035.vdocuments.us/reader035/viewer/2022081400/60aedc4ff75f59226b3428d7/html5/thumbnails/24.jpg)
But imaging is a destructive process…
Trade-off image quality and photobleaching
Outcomesoptions
Quality score
Parameterselection
Online analysis Images% bleach
![Page 25: Applied bandits: Supporting health-related researchQuality score Parameter selection Online analysis Images % bleach Thompson Sampling for generating outcome options • One kernel](https://reader035.vdocuments.us/reader035/viewer/2022081400/60aedc4ff75f59226b3428d7/html5/thumbnails/25.jpg)
Thompson Sampling for generating outcome options
• One kernel regression model w𝑓j per objective 𝑖
• Sample one function y𝑓j per objective 𝑖
• Option y𝑓(𝑥) at parameter 𝑥: Concatenate y𝑓j(𝑥) for all 𝑖
−1 0 1
Parameter
−4
−2
0
2
4
Objective
N = 3
Quality
−1 0 1
Parameter
−4
−2
0
2
4
Objective
N = 3
Photobleaching
10
10
y𝑓 0 = (0.75, 0.65)
![Page 26: Applied bandits: Supporting health-related researchQuality score Parameter selection Online analysis Images % bleach Thompson Sampling for generating outcome options • One kernel](https://reader035.vdocuments.us/reader035/viewer/2022081400/60aedc4ff75f59226b3428d7/html5/thumbnails/26.jpg)
Presenting estimates to the expert• Exploration/Exploitation
in the cloud!
• Expert acts as an argmaxon the preference function
Image quality
Pho
tobl
each
ing
![Page 27: Applied bandits: Supporting health-related researchQuality score Parameter selection Online analysis Images % bleach Thompson Sampling for generating outcome options • One kernel](https://reader035.vdocuments.us/reader035/viewer/2022081400/60aedc4ff75f59226b3428d7/html5/thumbnails/27.jpg)
Experiments on neuronal imaging
Three parameters (1000 configurations): • Excitation laser power • Depletion laser power• Duration of imaging per pixel
Different imaging targets:• Neuron: Rat neuron• PC12: Rat tumor cell line• HEK293: Human embryonic kidney cells
Acquire two STED images with ↑ 1st STED quality and ↓ photobleaching
![Page 28: Applied bandits: Supporting health-related researchQuality score Parameter selection Online analysis Images % bleach Thompson Sampling for generating outcome options • One kernel](https://reader035.vdocuments.us/reader035/viewer/2022081400/60aedc4ff75f59226b3428d7/html5/thumbnails/28.jpg)
Acquire good images and control photobleaching
Not super-resolution →
Quality
Photobleaching
![Page 29: Applied bandits: Supporting health-related researchQuality score Parameter selection Online analysis Images % bleach Thompson Sampling for generating outcome options • One kernel](https://reader035.vdocuments.us/reader035/viewer/2022081400/60aedc4ff75f59226b3428d7/html5/thumbnails/29.jpg)
Sublinear regret, as suggested by theory
Different imaging targets:• Neuron: Rat neuron• PC12: Rat tumor cell line• HEK293: Human
embryonic kidney cells
![Page 30: Applied bandits: Supporting health-related researchQuality score Parameter selection Online analysis Images % bleach Thompson Sampling for generating outcome options • One kernel](https://reader035.vdocuments.us/reader035/viewer/2022081400/60aedc4ff75f59226b3428d7/html5/thumbnails/30.jpg)
Fully automated process
Outcomesoptions
Quality score
Parameterselection
Online analysis Images% bleach
![Page 31: Applied bandits: Supporting health-related researchQuality score Parameter selection Online analysis Images % bleach Thompson Sampling for generating outcome options • One kernel](https://reader035.vdocuments.us/reader035/viewer/2022081400/60aedc4ff75f59226b3428d7/html5/thumbnails/31.jpg)
Fully automated imaging
Bottom left corner:Not super-resolution
Beginning of optim →
End of optim →
![Page 32: Applied bandits: Supporting health-related researchQuality score Parameter selection Online analysis Images % bleach Thompson Sampling for generating outcome options • One kernel](https://reader035.vdocuments.us/reader035/viewer/2022081400/60aedc4ff75f59226b3428d7/html5/thumbnails/32.jpg)
Towards the next application:Getting closer to the patient
![Page 33: Applied bandits: Supporting health-related researchQuality score Parameter selection Online analysis Images % bleach Thompson Sampling for generating outcome options • One kernel](https://reader035.vdocuments.us/reader035/viewer/2022081400/60aedc4ff75f59226b3428d7/html5/thumbnails/33.jpg)
Randomized trials
Ran
dom
izat
ion
Option A
Option B
Time
Comparison&
Decision
![Page 34: Applied bandits: Supporting health-related researchQuality score Parameter selection Online analysis Images % bleach Thompson Sampling for generating outcome options • One kernel](https://reader035.vdocuments.us/reader035/viewer/2022081400/60aedc4ff75f59226b3428d7/html5/thumbnails/34.jpg)
Adaptive trials
Dynamically adapt the study designbased on previous observations
Effective
Favorable
Promising
Unfavorable
Futile
![Page 35: Applied bandits: Supporting health-related researchQuality score Parameter selection Online analysis Images % bleach Thompson Sampling for generating outcome options • One kernel](https://reader035.vdocuments.us/reader035/viewer/2022081400/60aedc4ff75f59226b3428d7/html5/thumbnails/35.jpg)
Writing down the setting…
Treatments:
Probabilityof effectiveness:
For each patient 𝑡:• Select a treatment 𝑘# ∈ 1, 2, … , 𝐾
• Observe outcome 𝑟#~𝐷(𝜇/0)
1 2 3
…
𝐾
𝜇2 𝜇3 𝜇4 𝜇5
This is stochastic bandits!(Thompson, 1933)
![Page 36: Applied bandits: Supporting health-related researchQuality score Parameter selection Online analysis Images % bleach Thompson Sampling for generating outcome options • One kernel](https://reader035.vdocuments.us/reader035/viewer/2022081400/60aedc4ff75f59226b3428d7/html5/thumbnails/36.jpg)
In the absence of one size fits all strategy
Treatments:
Probabilityof effectiveness:
1 2 3
…
𝐾
𝜇2,~ 𝜇3,~ 𝜇4,~ 𝜇5,~
𝜇2,� 𝜇3,� 𝜇4,� 𝜇5,�
Context
![Page 37: Applied bandits: Supporting health-related researchQuality score Parameter selection Online analysis Images % bleach Thompson Sampling for generating outcome options • One kernel](https://reader035.vdocuments.us/reader035/viewer/2022081400/60aedc4ff75f59226b3428d7/html5/thumbnails/37.jpg)
In the absence of one size fits all strategy
Treatments:
Probabilityof effectiveness:
1 2 3
…
𝐾
𝜇2,~ 𝜇3,~ 𝜇4,~ 𝜇5,~
𝜇2,� 𝜇3,� 𝜇4,� 𝜇5,�
You can treat them as independent bandit problems!
![Page 38: Applied bandits: Supporting health-related researchQuality score Parameter selection Online analysis Images % bleach Thompson Sampling for generating outcome options • One kernel](https://reader035.vdocuments.us/reader035/viewer/2022081400/60aedc4ff75f59226b3428d7/html5/thumbnails/38.jpg)
In the absence of one size fits all strategy
Treatments:
Probabilityof effectiveness:
1 2 3
…
𝐾
𝜇2,~ 𝜇3,~ 𝜇4,~ 𝜇5,~
𝜇2,� 𝜇3,� 𝜇4,� 𝜇5,�
What happens if the number of contexts grows large?
![Page 39: Applied bandits: Supporting health-related researchQuality score Parameter selection Online analysis Images % bleach Thompson Sampling for generating outcome options • One kernel](https://reader035.vdocuments.us/reader035/viewer/2022081400/60aedc4ff75f59226b3428d7/html5/thumbnails/39.jpg)
Contextual bandits
Exploit the underlying structure on the context space
Reward = treatment effect
0.0 0.2 0.4 0.6 0.8 1.0
Context
-1
0
1
Reward
Action 1
Action 2
Action 3
Action 4
𝑓2(0.6)
𝑓3(0.6)
𝔼[
] 𝑓4(0.6)
𝑓�(0.6)
![Page 40: Applied bandits: Supporting health-related researchQuality score Parameter selection Online analysis Images % bleach Thompson Sampling for generating outcome options • One kernel](https://reader035.vdocuments.us/reader035/viewer/2022081400/60aedc4ff75f59226b3428d7/html5/thumbnails/40.jpg)
Contextual bandits
Expected reward of action 𝑘 is afunction 𝑓/ of the context features
𝑓/: 𝒮 ↦ ℝ
For each episode 𝑡:• Observe a context 𝑠# ∼ Π
• Select an action 𝑘# ∈ 1, 2, … , 𝐾
• Observe a reward 𝑟#~𝐷(𝑓/0(𝑠#))
0.0 0.2 0.4 0.6 0.8 1.0
Context
-1
0
1
Reward
Action 1
Action 2
Action 3
Action 4
𝔼[
]
![Page 41: Applied bandits: Supporting health-related researchQuality score Parameter selection Online analysis Images % bleach Thompson Sampling for generating outcome options • One kernel](https://reader035.vdocuments.us/reader035/viewer/2022081400/60aedc4ff75f59226b3428d7/html5/thumbnails/41.jpg)
Contextual bandits
Expected reward of action 𝑘 is afunction 𝑓/ of the context
𝑓/: 𝒮 ↦ ℝ
For each episode 𝑡:• Observe a context 𝑠# ∼ Π
• Select an action 𝑘# ∈ 1, 2, … , 𝐾
• Observe a reward 𝑟#~𝐷(𝑓/0(𝑠#))
0.0 0.2 0.4 0.6 0.8 1.0
Context
-1
0
1
Reward
Action 1
Action 2
Action 3
Action 4
𝔼[
]
Goal: Maximize rewards Find 𝑘#⋆= argmax/∈{2,3,…,5}
𝑓/(𝑠#)
![Page 42: Applied bandits: Supporting health-related researchQuality score Parameter selection Online analysis Images % bleach Thompson Sampling for generating outcome options • One kernel](https://reader035.vdocuments.us/reader035/viewer/2022081400/60aedc4ff75f59226b3428d7/html5/thumbnails/42.jpg)
0.0 0.2 0.4 0.6 0.8 1.0
Context
-1
0
1
Reward
Action 1
Action 2
Action 3
Action 4
Online function approximation
• Unknown 𝜃/ ∈ ℝS for each action 𝑘
• 𝑓/ 𝑠 = 𝜙 𝑠 , 𝜃/• 𝑘#
⋆ = argmax/∈{2,3,…,5}
𝜙 𝑠# , 𝜃/
For each episode 𝑡:• Observe a context 𝑠# ∼ Π
• Select an action 𝑘# ∈ 1, 2, … , 𝐾
• Observe reward 𝑟# = 𝑓/0(𝑠#) + 𝜉#
with noise ξ[ ∼ 𝒩(0, σ3)𝔼[
]
Minimize 𝔼∑#B2C 𝑓/0⋆ 𝑠# − 𝑓/0(𝑠#)
![Page 43: Applied bandits: Supporting health-related researchQuality score Parameter selection Online analysis Images % bleach Thompson Sampling for generating outcome options • One kernel](https://reader035.vdocuments.us/reader035/viewer/2022081400/60aedc4ff75f59226b3428d7/html5/thumbnails/43.jpg)
Adaptive treatment allocation for mice trials
• Georgios D. Mitsis• Joelle Pineau
• Charis Achilleos• Demetris Iacovides• Katerina Strati
Joint work with
D., Achilleos, Iacovides, Strati, Mitsis, and Pineau (MLHC 2018)
![Page 44: Applied bandits: Supporting health-related researchQuality score Parameter selection Online analysis Images % bleach Thompson Sampling for generating outcome options • One kernel](https://reader035.vdocuments.us/reader035/viewer/2022081400/60aedc4ff75f59226b3428d7/html5/thumbnails/44.jpg)
Data acquisition problem
• Mice with induced cancer tumours
• Treatment options: 5FU, Imiquimod, 5FU+Imiquimod, None
• Treatment allocation twice a week
Which treatment should be allocated to patients with cancer given the stage of their disease?
Squamous Cell
Carcinoma
![Page 45: Applied bandits: Supporting health-related researchQuality score Parameter selection Online analysis Images % bleach Thompson Sampling for generating outcome options • One kernel](https://reader035.vdocuments.us/reader035/viewer/2022081400/60aedc4ff75f59226b3428d7/html5/thumbnails/45.jpg)
Data acquisition problem
• Mice with induced cancer tumours
• Treatment options: 5FU, Imiquimod, 5FU+Imiquimod, None
• Treatment allocation twice a week
Which treatment should be allocated to patients with cancer given the stage of their disease?
Squamous Cell
Carcinoma
Tumour volume
![Page 46: Applied bandits: Supporting health-related researchQuality score Parameter selection Online analysis Images % bleach Thompson Sampling for generating outcome options • One kernel](https://reader035.vdocuments.us/reader035/viewer/2022081400/60aedc4ff75f59226b3428d7/html5/thumbnails/46.jpg)
Phase 1: Randomized allocation (only exploration)
• 6 mice
Processing a mouse:• 2x/week:
– Measure volume of tumours– If all tumours are below a critical level
» Randomly assign one of the four treatment options– Otherwise terminate this animal
![Page 47: Applied bandits: Supporting health-related researchQuality score Parameter selection Online analysis Images % bleach Thompson Sampling for generating outcome options • One kernel](https://reader035.vdocuments.us/reader035/viewer/2022081400/60aedc4ff75f59226b3428d7/html5/thumbnails/47.jpg)
Phase 1: Randomized allocation (only exploration)
Result: 12 usable tumours• 163 triplets (tumour volume, treatment, next tumour volume)
![Page 48: Applied bandits: Supporting health-related researchQuality score Parameter selection Online analysis Images % bleach Thompson Sampling for generating outcome options • One kernel](https://reader035.vdocuments.us/reader035/viewer/2022081400/60aedc4ff75f59226b3428d7/html5/thumbnails/48.jpg)
0 100 200 300
Volume (mm3)
0
5
10
15
20
25
Number
ofmeasurements
0 25 50 75 100
Number of days
0
200
400
Volume(m
m3)
Phase 1: Randomized allocation (only exploration)
Exponential tumour growth → Few data collected for larger tumours
![Page 49: Applied bandits: Supporting health-related researchQuality score Parameter selection Online analysis Images % bleach Thompson Sampling for generating outcome options • One kernel](https://reader035.vdocuments.us/reader035/viewer/2022081400/60aedc4ff75f59226b3428d7/html5/thumbnails/49.jpg)
Phase 2: Adaptive trial (exploration/exploitation)
• 10 mice• Select treatment 2x/week
Adaptive Clinical Trial• Do not fix the experiment design a priori• Adapt treatment allocation based on previous observations• Favor selection of better treatments
• Reduce exposition to less effective treatments
![Page 50: Applied bandits: Supporting health-related researchQuality score Parameter selection Online analysis Images % bleach Thompson Sampling for generating outcome options • One kernel](https://reader035.vdocuments.us/reader035/viewer/2022081400/60aedc4ff75f59226b3428d7/html5/thumbnails/50.jpg)
Contextual bandit problem
Improve treatment allocation online
• Maximize amount of acquired data• Minimize trials of poor treatments
Treatment
Tumour volume after treatment
Tumour volume before treatment
→ Identify best action given context→ Explore wisely
![Page 51: Applied bandits: Supporting health-related researchQuality score Parameter selection Online analysis Images % bleach Thompson Sampling for generating outcome options • One kernel](https://reader035.vdocuments.us/reader035/viewer/2022081400/60aedc4ff75f59226b3428d7/html5/thumbnails/51.jpg)
Alert: Contexts are not independent of actions!
Recall contextual bandits:
For each episode 𝑡:• Observe a context 𝑠# ∼ Π
• Select an action 𝑘# ∈ 1, 2, … , 𝐾
• Obtain a reward 𝑟#~𝒟(𝑓/0(𝑠#))
Beware of traps!
![Page 52: Applied bandits: Supporting health-related researchQuality score Parameter selection Online analysis Images % bleach Thompson Sampling for generating outcome options • One kernel](https://reader035.vdocuments.us/reader035/viewer/2022081400/60aedc4ff75f59226b3428d7/html5/thumbnails/52.jpg)
Reward shaping
Natural reward definition could be 𝑟# = 𝑠# − 𝑠#�2
Tumour volume reduction
![Page 53: Applied bandits: Supporting health-related researchQuality score Parameter selection Online analysis Images % bleach Thompson Sampling for generating outcome options • One kernel](https://reader035.vdocuments.us/reader035/viewer/2022081400/60aedc4ff75f59226b3428d7/html5/thumbnails/53.jpg)
Reward shaping
Natural reward definition could be 𝑟# = 𝑠# − 𝑠#�2
Controlling the disease, i.e. maintain tumour constant, has the same value independently of the tumour volume
Tumour volume reduction
![Page 54: Applied bandits: Supporting health-related researchQuality score Parameter selection Online analysis Images % bleach Thompson Sampling for generating outcome options • One kernel](https://reader035.vdocuments.us/reader035/viewer/2022081400/60aedc4ff75f59226b3428d7/html5/thumbnails/54.jpg)
Reward shaping
Natural reward definition could be 𝑟# = 𝑠# − 𝑠#�2
Controlling the disease, i.e. maintain tumour constant, has the same value independently of the tumour volume
What we used instead: 𝑟# = −𝑠#�2
Tumour volume reduction
![Page 55: Applied bandits: Supporting health-related researchQuality score Parameter selection Online analysis Images % bleach Thompson Sampling for generating outcome options • One kernel](https://reader035.vdocuments.us/reader035/viewer/2022081400/60aedc4ff75f59226b3428d7/html5/thumbnails/55.jpg)
Exploration/Exploitation strategy
Best Empirical Sampled Average: BESA (Baransi et al., 2014)
• Fair comparison of empirical estimators• Opportunities for actions to show how good they are
Samples # Action 1 Action 2
1 0 1
2 1 0
3 1 1
4 0
… …
100 1
Expected reward
E𝜇2
E𝜇3
![Page 56: Applied bandits: Supporting health-related researchQuality score Parameter selection Online analysis Images % bleach Thompson Sampling for generating outcome options • One kernel](https://reader035.vdocuments.us/reader035/viewer/2022081400/60aedc4ff75f59226b3428d7/html5/thumbnails/56.jpg)
Exploration/Exploitation strategy
Best Empirical Sampled Average: BESA (Baransi et al., 2014)
• Fair comparison of empirical estimators• Opportunities for actions to show how good they are
Samples # Action 1 Action 2
1 0 1
2 1 0
3 1 1
4 0
… …
100 1
Expected reward
E𝜇2
E𝜇3
![Page 57: Applied bandits: Supporting health-related researchQuality score Parameter selection Online analysis Images % bleach Thompson Sampling for generating outcome options • One kernel](https://reader035.vdocuments.us/reader035/viewer/2022081400/60aedc4ff75f59226b3428d7/html5/thumbnails/57.jpg)
Exploration/Exploitation strategy
Best Empirical Sampled Average: BESA (Baransi et al., 2014)
• Fair comparison of empirical estimators• Opportunities for actions to show how good they are
Samples # Action 1 Action 2
1 0 1
2 1 0
3 1 1
4 0
… …
100 1
Expected reward
E𝜇2
E𝜇3
![Page 58: Applied bandits: Supporting health-related researchQuality score Parameter selection Online analysis Images % bleach Thompson Sampling for generating outcome options • One kernel](https://reader035.vdocuments.us/reader035/viewer/2022081400/60aedc4ff75f59226b3428d7/html5/thumbnails/58.jpg)
Exploration/Exploitation strategy
Best Empirical Sampled Average: BESA (Baransi et al., 2014)
• Fair comparison of empirical estimators• Opportunities for actions to show how good they are
Samples # Action 1 Action 2
1 0 1
2 1 0
3 1 1
4 0
… …
100 1
Expected reward
E𝜇2
E𝜇3
![Page 59: Applied bandits: Supporting health-related researchQuality score Parameter selection Online analysis Images % bleach Thompson Sampling for generating outcome options • One kernel](https://reader035.vdocuments.us/reader035/viewer/2022081400/60aedc4ff75f59226b3428d7/html5/thumbnails/59.jpg)
0 75 150 225
Tumour volume
0
150
300
Treatmenteffect
f
f̃N
0 75 150 225
Tumour volume
0
150
300
Treatmenteffect
f
f̃N
GP BESA: Extension to contextual bandits
0 75 150 225
Tumour volume
0
150
300
Treatmenteffect
f
f̃N
0 75 150 225
Tumour volume
0
150
300
Treatmenteffect
f
f̃N
𝑁 = 50 𝑁 = 20
𝑁 = 10 𝑁 = 5
![Page 60: Applied bandits: Supporting health-related researchQuality score Parameter selection Online analysis Images % bleach Thompson Sampling for generating outcome options • One kernel](https://reader035.vdocuments.us/reader035/viewer/2022081400/60aedc4ff75f59226b3428d7/html5/thumbnails/60.jpg)
0 75 150 225 300
Tumour volume
0
150
300
Treatmenteff
ect
st
A1 (N = 5)
A2 (N = 5)
Example of exploration
Action 1: 100 observationsAction 2: 5 observations
0 75 150 225 300
Tumour volume
0
150
300
Treatmenteff
ect
st
A1 (N = 100)
A2 (N = 5)
0 75 150 225 300
Tumour volume
0
150
300
Treatmenteffect
st
A1
A2
True functions: Predictions (all data): Predictions (subsample):
![Page 61: Applied bandits: Supporting health-related researchQuality score Parameter selection Online analysis Images % bleach Thompson Sampling for generating outcome options • One kernel](https://reader035.vdocuments.us/reader035/viewer/2022081400/60aedc4ff75f59226b3428d7/html5/thumbnails/61.jpg)
Experimental setting
• 10 mice total
• Processing a group of mice (2-3 subjects)– Twice a week:
» For each mouse in the group:• Measure tumour → reward for last treatment• Select treatment to assign now
– Until death/sacrifice of all mice in group
• Update algorithm with tuples of (volume, treatment, next volume)
• Start next group
![Page 62: Applied bandits: Supporting health-related researchQuality score Parameter selection Online analysis Images % bleach Thompson Sampling for generating outcome options • One kernel](https://reader035.vdocuments.us/reader035/viewer/2022081400/60aedc4ff75f59226b3428d7/html5/thumbnails/62.jpg)
Animals live longer
Algorithm updated after each group
None Random 5-FU GP BESA
50
100
150
Lifeduration(days)
None Random 5-FU GP
50
100
150
Lifeduration(days)
A B C D
50
100
150Lifeduration(days)
![Page 63: Applied bandits: Supporting health-related researchQuality score Parameter selection Online analysis Images % bleach Thompson Sampling for generating outcome options • One kernel](https://reader035.vdocuments.us/reader035/viewer/2022081400/60aedc4ff75f59226b3428d7/html5/thumbnails/63.jpg)
Evolution of tumour volumes
Slowing the exponential growth
Recall phase 1:
0
200
400
Volume(m
m3)
Group A
0
200
400
Volume(m
m3)
Group B
0
200
400
Volume(m
m3)
Group C
0 50 100
Days
0
200
400
Volume(m
m3)
Group D
0 25 50 75 100
Number of days
0
200
400
Volume(m
m3)
![Page 64: Applied bandits: Supporting health-related researchQuality score Parameter selection Online analysis Images % bleach Thompson Sampling for generating outcome options • One kernel](https://reader035.vdocuments.us/reader035/viewer/2022081400/60aedc4ff75f59226b3428d7/html5/thumbnails/64.jpg)
A better state space covering
Using data in a next phase• More information on the tumor growth process• 40% more data points of volume > 70mm3
0 100 200 300
Volume (mm3)
0
5
10
15
20
25
Number
demeasurements
![Page 65: Applied bandits: Supporting health-related researchQuality score Parameter selection Online analysis Images % bleach Thompson Sampling for generating outcome options • One kernel](https://reader035.vdocuments.us/reader035/viewer/2022081400/60aedc4ff75f59226b3428d7/html5/thumbnails/65.jpg)
0.0
0.5
1.0
Probability
Group A
None 5-FU Imiquimod 5-FU + Imiquimod
Group B
0 50 100 150 200 250 300
Tumour volume (mm3)
0.0
0.5
1.0
Probability
Group C
0 50 100 150 200 250 300
Tumour volume (mm3)
Group D
Evolution of the policy
![Page 66: Applied bandits: Supporting health-related researchQuality score Parameter selection Online analysis Images % bleach Thompson Sampling for generating outcome options • One kernel](https://reader035.vdocuments.us/reader035/viewer/2022081400/60aedc4ff75f59226b3428d7/html5/thumbnails/66.jpg)
Conclusion + Take homes
• Bandits is a nice framework for theory, but also has applications! J
• We often break theoretical guarantees in practice L
• How to design algorithms that don’t make unrealistic assumptions?
• Other aspects important in practice were not considered here, e.g.– Fairness in exploration– Safe exploration
![Page 67: Applied bandits: Supporting health-related researchQuality score Parameter selection Online analysis Images % bleach Thompson Sampling for generating outcome options • One kernel](https://reader035.vdocuments.us/reader035/viewer/2022081400/60aedc4ff75f59226b3428d7/html5/thumbnails/67.jpg)
Huge thanks again to my collaborators!
Questions?…and more