solving bernoulli rank-one bandits with unimodal thompson...

Solving Bernoulli Rank-One Bandits withUnimodal Thompson Sampling

Cindy Trinh, Emilie Kaufmann, Claire Vernade, Richard Combes

ALT 2020

Cindy Trinh Solving Bernoulli Rank-One Bandits

An example

Let us choose the design of a button for our website!

Ü The user clicks if they are attracted both by shape i and color j .

Ü Reward is the product of two latent Bernoulli variables B(ui ), B(vj)


The Bernoulli Rank-one bandit problem

Bernoulli Rank-one model, (Katariya et al., 2017b,a):

u = (u1, u2, . . . , uK ) ∈ [0, 1]K v = (v1, v2, . . . , vL) ∈ [0, 1]L

Pair (i , j)→ Bernoulli distribution with mean µij = uivj

µ = uvT is a rank one matrix

µ =

j

(u1v1) (u1v2) (u1v3) (u1v4)(u2v1) (u2v2) (u2v3) (u2v4)

i (u3v1) (u3v2) (u3v3) (u3v4)(u4v1) (u4v2) (u4v3) (u4v4)

Clic probability for (i , j):µi ,j = ui × vj


The Bernoulli Rank-one bandit problem

Bernoulli Rank-one model, (Katariya et al., 2017b,a):

u = (u1, u2, . . . , uK ) ∈ [0, 1]K v = (v1, v2, . . . , vL) ∈ [0, 1]L

Pair (i , j)→ Bernoulli distribution with mean µij = uivj

µ = uvT is a rank one matrix

At each time step t = 1, . . . ,T ,

The learner chooses an arm K (t) = (I (t), J(t)),

And receives r(t) ∼ B(µI (t),J(t)), where µi ,j = uivj

Goal: minimize the expected cumulative regret

Rµ(T ,A) =T∑t=1

[max

(i ,j)∈[K ]×[L]µ(i ,j) − Eµ[µ(I (t),J(t))]

]


Lower bound on the regret

Lower bound on the regret for the KL-armed bandit problem, (Lai andRobbins, 1985). For any algorithm A which is uniformly efficient,

lim infT→∞

Rµ(A,T )

log(T )≥

∑(i ,j)∈[K ]×[L]\(i?,j?)

µi?,j? − µi ,jkl(µi ,j , µi?,j?)

.

with kl(x , y) = x ln(x/y) + (1− x) ln((1− x)/(1− y))

Ü matched by the kl-UCB algorithm (Cappe et al., 2013)

Lower bound on the regret for Rank-one bandits, (Katariya et al., 2017a):

lim infT→∞

Rµ(A,T )

log(T )≥

∑i∈[K ]\i?

µi?,j? − µi ,j?kl(µi ,j? , µi?,j?)

+∑

j∈[L]/j?

µi?,j? − µi?,jkl(µi?,j , µi?,j?)

.

Ü an algorithm which exploits well the rank-one structure should samplearms which are not in the best row/column only o(logT ) times


Current state-of-the-art

Theorem 1 of (Katariya et al., 2017a)

The regret of Rank1ElimKL satisfies

R(T ) ≤ 160

µγ

K∑i=1

1

∆Ui

+L∑

j=1

1

∆Vi

logT + (6e + 82)(K + L)

µ = min{ 1K

∑Ki=1 ui ,

1L

∑Lj=1 vj},

∆Ui = ui?−ui +1{i = i?}minj(vj?−vj), ∆V

i = vi?−vi +1{i = i?}minj(uj?−uj)γ = max{µ, 1−max(u1, . . . , uK , v1, . . . , vL)}

⊕ Regret in O((µγ)−1(K + L) logT ) instead of O(KL logT ) Does not match the lower bound... Empirically, Rank1ElimKL is not always better than kl-UCB

Ü Is the lower bound achievable?

Yes, by resorting to Graphical Unimodal bandits.


Current state-of-the-art

Theorem 1 of (Katariya et al., 2017a)

The regret of Rank1ElimKL satisfies

R(T ) ≤ 160

µγ

K∑i=1

1

∆Ui

+L∑

j=1

1

∆Vi

logT + (6e + 82)(K + L)

µ = min{ 1K

∑Ki=1 ui ,

1L

∑Lj=1 vj},

∆Ui = ui?−ui +1{i = i?}minj(vj?−vj), ∆V

i = vi?−vi +1{i = i?}minj(uj?−uj)γ = max{µ, 1−max(u1, . . . , uK , v1, . . . , vL)}

⊕ Regret in O((µγ)−1(K + L) logT ) instead of O(KL logT ) Does not match the lower bound... Empirically, Rank1ElimKL is not always better than kl-UCB

Ü Is the lower bound achievable?

Yes, by resorting to Graphical Unimodal bandits.Cindy Trinh Solving Bernoulli Rank-One Bandits

Outline

1 Rank-one bandits are Unimodal bandits

2 A new analysis of Unimodal Thompson Sampling

3 Experimental Results


Unimodal structure of the Rank-one model

Definition (Unimodal structure)

Let G = (V ,E ) an undirected graph.A vector µ = (µk)k∈V is unimodal with respect to G if

there exists a unique k? ∈ V such that µk? = maxi µi

from any k 6= k?, we can find an increasing path to the optimal arm.(p = (k , k2, . . . , k?) such that µk < µk2 < · · · < µk?)

Graph for the Rank-one model:

V = {1, . . . ,K} × {1, . . . , L}((i , j), (k , `)) ∈ E iff (i = k or j = `)



N (3, 3): Neighbors ofarm (3, 3)


Unimodal structure of the Rank-one model

Definition (Unimodal structure)

Let G = (V ,E ) an undirected graph.A vector µ = (µk)k∈V is unimodal with respect to G if

there exists a unique k? ∈ V such that µk? = maxi µi

from any k 6= k?, we can find an increasing path to the optimal arm.(p = (k , k2, . . . , k?) such that µk < µk2 < · · · < µk?)

Valid increasing path from arm (3, 3) to the optimal arm (1, 1):→ p = ((3, 3), (1, 3), (1, 1)), since u3v3 < u1v3 < u?v?.

(u?v?) (u1v2) (u1v3) (u1v4)(u2v1) (u2v2) (u2v3) (u2v4)(u3v1) (u3v2) (u3v3) (u3v4)(u4v1) (u4v2) (u4v3) (u4v4)


LB for Unimodal and Rank-one bandits

By considering the previous graph, the lower bound for Unimodal bandits(Combes and Proutiere, 2014)...

lim infT→∞

Rµ(A,T )

ln(T )≥

∑k∈NG (k?)

µk? − µkkl (µk , µk?)

...coincides with the lower bound for Rank-one bandits:

=∑

i∈[K ]\i?

µi?,j? − µi ,j?kl(µi ,j? , µi?,j?)

+∑

j∈[L]/j?

µi?,j? − µi?,jkl(µi?,j , µi?,j?)

Ü Asymptotically optimal algorithms for Unimodal bandits are alsoasymptotically optimal for Rank-one bandits

Ü Current algorithms for Graphical Unimodal bandits:OSUB (Combes and Proutiere, 2014), UTS (Paladino et al., 2017)


Unimodal algorithms

At each time step:

Compute the empirical leader L← argmaxk∈[K ]

µk

Update the leader count `L ← `L + 1

OSUB (Combes and Proutiere, 2014) UTS (Paladino et al., 2017)

γ ← max degree of graph G

At+1 ←

L if `L ≡ 0 [γ]

argmaxk∈N (L)∪{L}

uk

where uk are kl-UCB indices:maxq{Nkkl (µk , q) ≤ f (`L)}

γ ← degree of leader L

At+1 ←

L if `L ≡ 0 [γ]


θk

where θk are Thompson Samples:θk ∼ Beta (Sk + 1,Nk − Sk + 1)

⊕ Asymptotically optimal Tuning of conf. interval (f)

⊕ Better empirical performances Imprecise arguments in theanalysis of (Paladino et al., 2017)


Unimodal algorithms

At each time step:


µk




At+1 ←

L if `L ≡ 0 [γ]


uk



At+1 ←

L if `L ≡ 0 [γ]


θk


⊕ Asymptotically optimal

Tuning of conf. interval (f)



Unimodal algorithms

At each time step:


µk




At+1 ←

L if `L ≡ 0 [γ]


uk



At+1 ←

L if `L ≡ 0 [γ]


θk





Unimodal algorithms

At each time step:


µk




At+1 ←

L if `L ≡ 0 [γ]


uk



At+1 ←

L if `L ≡ 0 [γ]


θk



⊕ Better empirical performances

Imprecise arguments in theanalysis of (Paladino et al., 2017)


Unimodal algorithms

At each time step:


µk




At+1 ←

L if `L ≡ 0 [γ]


uk



At+1 ←

L if `L ≡ 0 [γ]


θk





Outline





Unimodal Thompson Sampling

At each time step:


µk


γ ∈ N, γ ≥ 2

At+1 ←

L if `L ≡ 0 [γ]


θk

where θk are Thompson Samples: θk ∼ Beta (Sk + 1,Nk − Sk + 1)


Regret upper bound for UTS

Theorem

Let µ be a unimodal bandit instance with respect to a graph G .We show that for all γ ≥ 2, ε > 0,

Rµ(T , UTS(γ)) ≤ (1 + ε)∑

k∈N (k?)

(µ? − µk)

kl(µk , µ?)log(T ) + C (µ, γ, ε).

→ Matches the lower bound for Unimodal bandits problem:

lim supT→∞

Rµ(T , UTS(γ))

log(T )≤

∑k∈N (k?)

(µ? − µk)

kl(µk , µ?)

→ ... for any exploration parameter γ ≥ 2


Sketch of proof for the Upper Bound of UTS

Outline of the proof:

R(T ) =∑k 6=k?

∆kE

[T∑t=1

1(K (t) = k)

]

=∑

k∈N (k?)

∆kE

[T∑t=1

1(K (t) = k, L(t) = k?)

]}R1(T )

+∑k 6=k?

∆kE

[T∑t=1

1(K (t) = k, L(t) 6= k?)

]}R2(T )

R1(T ) : the leader is the optimal arm k?

Ü similar arguments than for TS restricted to N (k?)

R2(T ) : the leader is a suboptimal arm

Ü more sophisticated arguments


Sketch of proof for the Upper Bound of UTS

Denote by BN (k) = argmax`∈N (k) µ`, the set of best neighbors of k .

R2(T ) ≤∑k 6=k?

T∑t=1

P (L(t) = k)

=∑k 6=k?

T∑t=1

P(L(t) = k, ∀k2 ∈ BN (k),Nk2(t) ≤ (`k(t))b

)︸︷︷︸

With Thompson Sampling it is unlikely that all optimalarms in the neighborhood of k are barely pulled

+∑k 6=k?

T∑t=1

P(L(t) = k, ∃k2 ∈ BN (k),Nk2(t) > (`k(t))b

)︸︷︷︸

Leader exploration + k2 is drawn often⇒ k is unlikely to remain leader

�


Outline





Experimental study of leader exploration parameter γ

→ UTS draws the leader every γ times it has been leader.

Figure: Cumulative regret of UTS for γ ∈ {2, 5, 10, 15, 30,+∞}, K = L = 4,u = v = (0.75, 0.25, . . . , 0.25). Results averaged over 100 runs.

In our analysis, γ can be set to any arbitrary integer value ≥ 2

Empirically, forced exploration of the leader does not seem mandatory

But γ = 2 yields the best performance


Comparison with other algorithms

Figure: Cumulative regret of Rank1ElimKL, OSUB, UTS, KL-UCB, on 4×4 rank-onematrix with u = v = (0.75, 0.25, . . . , 0.25) (left). Regret in log-scale (right)


Experiments with larger matrices

Figure: Cumulative regret of Rank1ElimKL, OSUB, UTS, KL-UCB, on 128× 128rank-one matrix with u = v = (0.75, 0.25, . . . , 0.25) (left). Regret of OSUB and UTS

in log-scale (right)


Summary

Our contributions:

Rank-one bandits are an instance of Unimodal bandits⇒ we close the remaining theoretical gap

New analysis of UTS⇒ UTS is asymptotically optimal⇒ γ can be set to any arbitrary value greater than 2

Experiments show outstanding performance of UTS

Future work:

Theoretical analysis for γ =∞?

Is leader exploration (γ = 2) helping in other structuredbandit problems?


Thank you for your attention !


Olivier Cappe, Aurelien Garivier, Odalric-Ambrym Maillard, Remi Munos,Gilles Stoltz, et al. Kullback–leibler upper confidence bounds foroptimal sequential allocation. The Annals of Statistics, 41(3):1516–1541, 2013.

Richard Combes and Alexandre Proutiere. Unimodal bandits: Regretlower bounds and optimal algorithms. 2014.

Sumeet Katariya, Branislav Kveton, Csaba Szepesvari, Claire Vernade,and Zheng Wen. Bernoulli rank-1 bandits for click feedback. In IJCAI,2017a.

Sumeet Katariya, Branislav Kveton, Csaba Szepesvari, Claire Vernade,and Zheng Wen. Stochastic rank-1 bandits. In Proceedings of the 20thInternational Conference on Artificial Intelligence and Statistics, 2017b.

Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptiveallocation rules. Advances in applied mathematics, 6(1):4–22, 1985.

Stefano Paladino, Francesco Trovo, Marcello Restelli, and Nicola Gatti.Unimodal thompson sampling for graph-structured arms. InThirty-First AAAI Conference on Artificial Intelligence, 2017.


solving bernoulli rank-one bandits with unimodal thompson...

Documents