1/19
Random Forest for the Contextual Bandit ProblemRaphael Feraud, Robin Allesiardo, Tanguy Urvoy, Fabrice
Clerot (AISTATS 2016)
Jungtaek Kim ([email protected])
Machine Learning Group,Department of Computer Science and Engineering, POSTECH,
77-Cheongam-ro, Nam-gu, Pohang-si 37673,Gyungsangbuk-do, Republic of Korea
Mar 28, 2017
2/19
Table of Contents
PreliminaryDecision TreeRandom Decision ForestsMulti-Armed BanditContextual Bandit
Random Forest for the Contextual Bandit Problem
3/19
Preliminary
4/19
Decision Tree
I Decision tree is used for classification and regression.
I Each node has a set, a sum of children’s sets. If a nodebelongs to a binary tree, Sj = SL
j ∪ SRj and SL
j ∩ SRj = ∅ are
satisfied.
I Tree parameters of split functions are estimated to optimizemetrics to compute a objective function of each node andreach stopping criteria.
I MetricsI Gini impurity: GINI (t) = 1 −
∑j (p(j |t))
2.
I Information gain: I = H(S)−∑
i|Si ||S| H(Si ).
I Stopping CriteriaI Maximum depth limitI Node’s set size limit
5/19
Decision Tree
Figure 1: Training process (left) estimates the tree parameters tomaximize f (Sj ,SL
j ,SRj , θ) for node j and testing process (right) is to
reach an unseen data to a leaf node and determine an output using apredictor p(c |v).
6/19
Random Decision Forests
I A random decision forest is an ensemble of randomly traineddecision trees.
I Commonly, it enhances weak learners to strong learners.I The methods to build a randomized decision tree are
I Random training dataset samplingI Bagging: C̃bag (x) = MajorityVote{C (S∗b , x)}Bb=1.I Random Forests: Refinement of bagging.I Boosting: C (x) = sign[
∑Mm=1 αmCm(x)].
I Randomized node optimization
I A leaf predictor predicts an output based on a distributionover the classes that the test data reached to the leaf mightbe belonged.
7/19
Multi-Armed Bandit
8/19
Multi-Armed Bandit
I For K arms, ra,t is a reward that is sampled from unknownstochastic process, where a ∈ {1, . . . ,K } and t is each timestep.
I Iteratively, an agent chooses an arm at ∈ {1, . . . ,K }, andreceives a reward rt = rat ,t .
I A sequential decision is built as
at = ft(a1, r1, . . . , at−1, rt−1).
I Cumulative regret goal is to minimize
Rn =
(max
i=1,...,KE
n∑t=1
ri ,t
)− E
n∑t=1
gat ,t .
9/19
Contextual Bandit
I Octopus of conventional multi-armed bandit problem isidentical.
I A feature vector, xt summarizes information of both the userand the arm at .
I Common goal for regret bound is achieving O(√T ).
10/19
Random Forest for the ContextualBandit Problem
11/19
Random Forest for the Contextual Bandit Problem
I Based on the optimal decision stump, an online random forestalgorithm for the contextual bandit problem is proposed.
I The decision stumps are recursively stacked in a randomcollection of decision trees.
I Its computational cost is O(LMDT ) where L is the number ofdecision trees, D is .
12/19
Gentle Start
13/19
Random Forest for the Contextual Bandit Problem
I It is near optimal. The dependence of the sample complexityupon the number of contextual variables is logarithmic, andthe computational cost of the proposed algorithm with respectto the time horizon is linear.
14/19
Procedure: Random Forest for the Contextual BanditProblem
1. Variable selection
2. Action selection
3. Tree update
15/19
Variable Selection
16/19
Action Selection
17/19
Decision Stump
18/19
θ-Optimal Greedy Tree
19/19
Bandit Forest