solving bernoulli rank-one bandits with unimodal thompson...
TRANSCRIPT
Solving Bernoulli Rank-One Bandits withUnimodal Thompson Sampling
Cindy Trinh, Emilie Kaufmann, Claire Vernade, Richard Combes
ALT 2020
Cindy Trinh Solving Bernoulli Rank-One Bandits
An example
Let us choose the design of a button for our website!
Ü The user clicks if they are attracted both by shape i and color j .
Ü Reward is the product of two latent Bernoulli variables B(ui ), B(vj)
Cindy Trinh Solving Bernoulli Rank-One Bandits
The Bernoulli Rank-one bandit problem
Bernoulli Rank-one model, (Katariya et al., 2017b,a):
u = (u1, u2, . . . , uK ) ∈ [0, 1]K v = (v1, v2, . . . , vL) ∈ [0, 1]L
Pair (i , j)→ Bernoulli distribution with mean µij = uivj
µ = uvT is a rank one matrix
µ =
j
(u1v1) (u1v2) (u1v3) (u1v4)(u2v1) (u2v2) (u2v3) (u2v4)
i (u3v1) (u3v2) (u3v3) (u3v4)(u4v1) (u4v2) (u4v3) (u4v4)
Clic probability for (i , j):µi ,j = ui × vj
Cindy Trinh Solving Bernoulli Rank-One Bandits
The Bernoulli Rank-one bandit problem
Bernoulli Rank-one model, (Katariya et al., 2017b,a):
u = (u1, u2, . . . , uK ) ∈ [0, 1]K v = (v1, v2, . . . , vL) ∈ [0, 1]L
Pair (i , j)→ Bernoulli distribution with mean µij = uivj
µ = uvT is a rank one matrix
At each time step t = 1, . . . ,T ,
The learner chooses an arm K (t) = (I (t), J(t)),
And receives r(t) ∼ B(µI (t),J(t)), where µi ,j = uivj
Goal: minimize the expected cumulative regret
Rµ(T ,A) =T∑t=1
[max
(i ,j)∈[K ]×[L]µ(i ,j) − Eµ[µ(I (t),J(t))]
]
Cindy Trinh Solving Bernoulli Rank-One Bandits
Lower bound on the regret
Lower bound on the regret for the KL-armed bandit problem, (Lai andRobbins, 1985). For any algorithm A which is uniformly efficient,
lim infT→∞
Rµ(A,T )
log(T )≥
∑(i ,j)∈[K ]×[L]\(i?,j?)
µi?,j? − µi ,jkl(µi ,j , µi?,j?)
.
with kl(x , y) = x ln(x/y) + (1− x) ln((1− x)/(1− y))
Ü matched by the kl-UCB algorithm (Cappe et al., 2013)
Lower bound on the regret for Rank-one bandits, (Katariya et al., 2017a):
lim infT→∞
Rµ(A,T )
log(T )≥
∑i∈[K ]\i?
µi?,j? − µi ,j?kl(µi ,j? , µi?,j?)
+∑
j∈[L]/j?
µi?,j? − µi?,jkl(µi?,j , µi?,j?)
.
Ü an algorithm which exploits well the rank-one structure should samplearms which are not in the best row/column only o(logT ) times
Cindy Trinh Solving Bernoulli Rank-One Bandits
Current state-of-the-art
Theorem 1 of (Katariya et al., 2017a)
The regret of Rank1ElimKL satisfies
R(T ) ≤ 160
µγ
K∑i=1
1
∆Ui
+L∑
j=1
1
∆Vi
logT + (6e + 82)(K + L)
µ = min{ 1K
∑Ki=1 ui ,
1L
∑Lj=1 vj},
∆Ui = ui?−ui +1{i = i?}minj(vj?−vj), ∆V
i = vi?−vi +1{i = i?}minj(uj?−uj)γ = max{µ, 1−max(u1, . . . , uK , v1, . . . , vL)}
⊕ Regret in O((µγ)−1(K + L) logT ) instead of O(KL logT ) Does not match the lower bound... Empirically, Rank1ElimKL is not always better than kl-UCB
Ü Is the lower bound achievable?
Yes, by resorting to Graphical Unimodal bandits.
Cindy Trinh Solving Bernoulli Rank-One Bandits
Current state-of-the-art
Theorem 1 of (Katariya et al., 2017a)
The regret of Rank1ElimKL satisfies
R(T ) ≤ 160
µγ
K∑i=1
1
∆Ui
+L∑
j=1
1
∆Vi
logT + (6e + 82)(K + L)
µ = min{ 1K
∑Ki=1 ui ,
1L
∑Lj=1 vj},
∆Ui = ui?−ui +1{i = i?}minj(vj?−vj), ∆V
i = vi?−vi +1{i = i?}minj(uj?−uj)γ = max{µ, 1−max(u1, . . . , uK , v1, . . . , vL)}
⊕ Regret in O((µγ)−1(K + L) logT ) instead of O(KL logT ) Does not match the lower bound... Empirically, Rank1ElimKL is not always better than kl-UCB
Ü Is the lower bound achievable?
Yes, by resorting to Graphical Unimodal bandits.Cindy Trinh Solving Bernoulli Rank-One Bandits
Outline
1 Rank-one bandits are Unimodal bandits
2 A new analysis of Unimodal Thompson Sampling
3 Experimental Results
Cindy Trinh Solving Bernoulli Rank-One Bandits
Outline
1 Rank-one bandits are Unimodal bandits
2 A new analysis of Unimodal Thompson Sampling
3 Experimental Results
Cindy Trinh Solving Bernoulli Rank-One Bandits
Unimodal structure of the Rank-one model
Definition (Unimodal structure)
Let G = (V ,E ) an undirected graph.A vector µ = (µk)k∈V is unimodal with respect to G if
there exists a unique k? ∈ V such that µk? = maxi µi
from any k 6= k?, we can find an increasing path to the optimal arm.(p = (k , k2, . . . , k?) such that µk < µk2 < · · · < µk?)
Graph for the Rank-one model:
V = {1, . . . ,K} × {1, . . . , L}((i , j), (k , `)) ∈ E iff (i = k or j = `)
(u1v1) (u1v2) (u1v3) (u1v4)(u2v1) (u2v2) (u2v3) (u2v4)
(u3v1) (u3v2) (u3v3) (u3v4)(u4v1) (u4v2) (u4v3) (u4v4)
N (3, 3): Neighbors ofarm (3, 3)
Cindy Trinh Solving Bernoulli Rank-One Bandits
Unimodal structure of the Rank-one model
Definition (Unimodal structure)
Let G = (V ,E ) an undirected graph.A vector µ = (µk)k∈V is unimodal with respect to G if
there exists a unique k? ∈ V such that µk? = maxi µi
from any k 6= k?, we can find an increasing path to the optimal arm.(p = (k , k2, . . . , k?) such that µk < µk2 < · · · < µk?)
Valid increasing path from arm (3, 3) to the optimal arm (1, 1):→ p = ((3, 3), (1, 3), (1, 1)), since u3v3 < u1v3 < u?v?.
(u?v?) (u1v2) (u1v3) (u1v4)(u2v1) (u2v2) (u2v3) (u2v4)(u3v1) (u3v2) (u3v3) (u3v4)(u4v1) (u4v2) (u4v3) (u4v4)
Cindy Trinh Solving Bernoulli Rank-One Bandits
LB for Unimodal and Rank-one bandits
By considering the previous graph, the lower bound for Unimodal bandits(Combes and Proutiere, 2014)...
lim infT→∞
Rµ(A,T )
ln(T )≥
∑k∈NG (k?)
µk? − µkkl (µk , µk?)
...coincides with the lower bound for Rank-one bandits:
=∑
i∈[K ]\i?
µi?,j? − µi ,j?kl(µi ,j? , µi?,j?)
+∑
j∈[L]/j?
µi?,j? − µi?,jkl(µi?,j , µi?,j?)
Ü Asymptotically optimal algorithms for Unimodal bandits are alsoasymptotically optimal for Rank-one bandits
Ü Current algorithms for Graphical Unimodal bandits:OSUB (Combes and Proutiere, 2014), UTS (Paladino et al., 2017)
Cindy Trinh Solving Bernoulli Rank-One Bandits
Unimodal algorithms
At each time step:
Compute the empirical leader L← argmaxk∈[K ]
µk
Update the leader count `L ← `L + 1
OSUB (Combes and Proutiere, 2014) UTS (Paladino et al., 2017)
γ ← max degree of graph G
At+1 ←
L if `L ≡ 0 [γ]
argmaxk∈N (L)∪{L}
uk
where uk are kl-UCB indices:maxq{Nkkl (µk , q) ≤ f (`L)}
γ ← degree of leader L
At+1 ←
L if `L ≡ 0 [γ]
argmaxk∈N (L)∪{L}
θk
where θk are Thompson Samples:θk ∼ Beta (Sk + 1,Nk − Sk + 1)
⊕ Asymptotically optimal Tuning of conf. interval (f)
⊕ Better empirical performances Imprecise arguments in theanalysis of (Paladino et al., 2017)
Cindy Trinh Solving Bernoulli Rank-One Bandits
Unimodal algorithms
At each time step:
Compute the empirical leader L← argmaxk∈[K ]
µk
Update the leader count `L ← `L + 1
OSUB (Combes and Proutiere, 2014) UTS (Paladino et al., 2017)
γ ← max degree of graph G
At+1 ←
L if `L ≡ 0 [γ]
argmaxk∈N (L)∪{L}
uk
where uk are kl-UCB indices:maxq{Nkkl (µk , q) ≤ f (`L)}
γ ← degree of leader L
At+1 ←
L if `L ≡ 0 [γ]
argmaxk∈N (L)∪{L}
θk
where θk are Thompson Samples:θk ∼ Beta (Sk + 1,Nk − Sk + 1)
⊕ Asymptotically optimal
Tuning of conf. interval (f)
⊕ Better empirical performances Imprecise arguments in theanalysis of (Paladino et al., 2017)
Cindy Trinh Solving Bernoulli Rank-One Bandits
Unimodal algorithms
At each time step:
Compute the empirical leader L← argmaxk∈[K ]
µk
Update the leader count `L ← `L + 1
OSUB (Combes and Proutiere, 2014) UTS (Paladino et al., 2017)
γ ← max degree of graph G
At+1 ←
L if `L ≡ 0 [γ]
argmaxk∈N (L)∪{L}
uk
where uk are kl-UCB indices:maxq{Nkkl (µk , q) ≤ f (`L)}
γ ← degree of leader L
At+1 ←
L if `L ≡ 0 [γ]
argmaxk∈N (L)∪{L}
θk
where θk are Thompson Samples:θk ∼ Beta (Sk + 1,Nk − Sk + 1)
⊕ Asymptotically optimal Tuning of conf. interval (f)
⊕ Better empirical performances Imprecise arguments in theanalysis of (Paladino et al., 2017)
Cindy Trinh Solving Bernoulli Rank-One Bandits
Unimodal algorithms
At each time step:
Compute the empirical leader L← argmaxk∈[K ]
µk
Update the leader count `L ← `L + 1
OSUB (Combes and Proutiere, 2014) UTS (Paladino et al., 2017)
γ ← max degree of graph G
At+1 ←
L if `L ≡ 0 [γ]
argmaxk∈N (L)∪{L}
uk
where uk are kl-UCB indices:maxq{Nkkl (µk , q) ≤ f (`L)}
γ ← degree of leader L
At+1 ←
L if `L ≡ 0 [γ]
argmaxk∈N (L)∪{L}
θk
where θk are Thompson Samples:θk ∼ Beta (Sk + 1,Nk − Sk + 1)
⊕ Asymptotically optimal Tuning of conf. interval (f)
⊕ Better empirical performances
Imprecise arguments in theanalysis of (Paladino et al., 2017)
Cindy Trinh Solving Bernoulli Rank-One Bandits
Unimodal algorithms
At each time step:
Compute the empirical leader L← argmaxk∈[K ]
µk
Update the leader count `L ← `L + 1
OSUB (Combes and Proutiere, 2014) UTS (Paladino et al., 2017)
γ ← max degree of graph G
At+1 ←
L if `L ≡ 0 [γ]
argmaxk∈N (L)∪{L}
uk
where uk are kl-UCB indices:maxq{Nkkl (µk , q) ≤ f (`L)}
γ ← degree of leader L
At+1 ←
L if `L ≡ 0 [γ]
argmaxk∈N (L)∪{L}
θk
where θk are Thompson Samples:θk ∼ Beta (Sk + 1,Nk − Sk + 1)
⊕ Asymptotically optimal Tuning of conf. interval (f)
⊕ Better empirical performances Imprecise arguments in theanalysis of (Paladino et al., 2017)
Cindy Trinh Solving Bernoulli Rank-One Bandits
Outline
1 Rank-one bandits are Unimodal bandits
2 A new analysis of Unimodal Thompson Sampling
3 Experimental Results
Cindy Trinh Solving Bernoulli Rank-One Bandits
Unimodal Thompson Sampling
At each time step:
Compute the empirical leader L← argmaxk∈[K ]
µk
Update the leader count `L ← `L + 1
γ ∈ N, γ ≥ 2
At+1 ←
L if `L ≡ 0 [γ]
argmaxk∈N (L)∪{L}
θk
where θk are Thompson Samples: θk ∼ Beta (Sk + 1,Nk − Sk + 1)
Cindy Trinh Solving Bernoulli Rank-One Bandits
Regret upper bound for UTS
Theorem
Let µ be a unimodal bandit instance with respect to a graph G .We show that for all γ ≥ 2, ε > 0,
Rµ(T , UTS(γ)) ≤ (1 + ε)∑
k∈N (k?)
(µ? − µk)
kl(µk , µ?)log(T ) + C (µ, γ, ε).
→ Matches the lower bound for Unimodal bandits problem:
lim supT→∞
Rµ(T , UTS(γ))
log(T )≤
∑k∈N (k?)
(µ? − µk)
kl(µk , µ?)
→ ... for any exploration parameter γ ≥ 2
Cindy Trinh Solving Bernoulli Rank-One Bandits
Sketch of proof for the Upper Bound of UTS
Outline of the proof:
R(T ) =∑k 6=k?
∆kE
[T∑t=1
1(K (t) = k)
]
=∑
k∈N (k?)
∆kE
[T∑t=1
1(K (t) = k, L(t) = k?)
]}R1(T )
+∑k 6=k?
∆kE
[T∑t=1
1(K (t) = k, L(t) 6= k?)
]}R2(T )
R1(T ) : the leader is the optimal arm k?
Ü similar arguments than for TS restricted to N (k?)
R2(T ) : the leader is a suboptimal arm
Ü more sophisticated arguments
Cindy Trinh Solving Bernoulli Rank-One Bandits
Sketch of proof for the Upper Bound of UTS
Denote by BN (k) = argmax`∈N (k) µ`, the set of best neighbors of k .
R2(T ) ≤∑k 6=k?
T∑t=1
P (L(t) = k)
=∑k 6=k?
T∑t=1
P(L(t) = k, ∀k2 ∈ BN (k),Nk2(t) ≤ (`k(t))b
)︸ ︷︷ ︸
With Thompson Sampling it is unlikely that all optimalarms in the neighborhood of k are barely pulled
+∑k 6=k?
T∑t=1
P(L(t) = k, ∃k2 ∈ BN (k),Nk2(t) > (`k(t))b
)︸ ︷︷ ︸
Leader exploration + k2 is drawn often⇒ k is unlikely to remain leader
�
Cindy Trinh Solving Bernoulli Rank-One Bandits
Outline
1 Rank-one bandits are Unimodal bandits
2 A new analysis of Unimodal Thompson Sampling
3 Experimental Results
Cindy Trinh Solving Bernoulli Rank-One Bandits
Experimental study of leader exploration parameter γ
→ UTS draws the leader every γ times it has been leader.
Figure: Cumulative regret of UTS for γ ∈ {2, 5, 10, 15, 30,+∞}, K = L = 4,u = v = (0.75, 0.25, . . . , 0.25). Results averaged over 100 runs.
In our analysis, γ can be set to any arbitrary integer value ≥ 2
Empirically, forced exploration of the leader does not seem mandatory
But γ = 2 yields the best performance
Cindy Trinh Solving Bernoulli Rank-One Bandits
Comparison with other algorithms
Figure: Cumulative regret of Rank1ElimKL, OSUB, UTS, KL-UCB, on 4×4 rank-onematrix with u = v = (0.75, 0.25, . . . , 0.25) (left). Regret in log-scale (right)
Cindy Trinh Solving Bernoulli Rank-One Bandits
Experiments with larger matrices
Figure: Cumulative regret of Rank1ElimKL, OSUB, UTS, KL-UCB, on 128× 128rank-one matrix with u = v = (0.75, 0.25, . . . , 0.25) (left). Regret of OSUB and UTS
in log-scale (right)
Cindy Trinh Solving Bernoulli Rank-One Bandits
Summary
Our contributions:
Rank-one bandits are an instance of Unimodal bandits⇒ we close the remaining theoretical gap
New analysis of UTS⇒ UTS is asymptotically optimal⇒ γ can be set to any arbitrary value greater than 2
Experiments show outstanding performance of UTS
Future work:
Theoretical analysis for γ =∞?
Is leader exploration (γ = 2) helping in other structuredbandit problems?
Cindy Trinh Solving Bernoulli Rank-One Bandits
Summary
Our contributions:
Rank-one bandits are an instance of Unimodal bandits⇒ we close the remaining theoretical gap
New analysis of UTS⇒ UTS is asymptotically optimal⇒ γ can be set to any arbitrary value greater than 2
Experiments show outstanding performance of UTS
Future work:
Theoretical analysis for γ =∞?
Is leader exploration (γ = 2) helping in other structuredbandit problems?
Cindy Trinh Solving Bernoulli Rank-One Bandits
Thank you for your attention !
Cindy Trinh Solving Bernoulli Rank-One Bandits
Olivier Cappe, Aurelien Garivier, Odalric-Ambrym Maillard, Remi Munos,Gilles Stoltz, et al. Kullback–leibler upper confidence bounds foroptimal sequential allocation. The Annals of Statistics, 41(3):1516–1541, 2013.
Richard Combes and Alexandre Proutiere. Unimodal bandits: Regretlower bounds and optimal algorithms. 2014.
Sumeet Katariya, Branislav Kveton, Csaba Szepesvari, Claire Vernade,and Zheng Wen. Bernoulli rank-1 bandits for click feedback. In IJCAI,2017a.
Sumeet Katariya, Branislav Kveton, Csaba Szepesvari, Claire Vernade,and Zheng Wen. Stochastic rank-1 bandits. In Proceedings of the 20thInternational Conference on Artificial Intelligence and Statistics, 2017b.
Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptiveallocation rules. Advances in applied mathematics, 6(1):4–22, 1985.
Stefano Paladino, Francesco Trovo, Marcello Restelli, and Nicola Gatti.Unimodal thompson sampling for graph-structured arms. InThirty-First AAAI Conference on Artificial Intelligence, 2017.
Cindy Trinh Solving Bernoulli Rank-One Bandits