fighting biases with dynamic boosting - arxiv biases with dynamic boosting ... andrey gulin yandex...
TRANSCRIPT
CatBoost: unbiased boosting with categorical featuresLiudmila Prokhorenkova, Gleb Gusev, Aleksandr Vorobev,
Anna Veronika Dorogush, Andrey Gulin
Yandex
Moscow, Russia
ABSTRACTThis paper presents the key algorithmic techniques behind Cat-
Boost, a state-of-the-art open-source gradient boosting toolkit.
Their combination leads to CatBoost outperforming other pub-
licly available boosting implementations in terms of quality on a
variety of datasets. Two critical algorithmic advances introduced in
CatBoost are the implementation of ordered boosting, a permutation-
driven alternative to the classic algorithm, and an innovative al-
gorithm for processing categorical features. Both techniques were
created to fight a prediction shift caused by a special kind of tar-
get leakage present in all currently existing implementations of
gradient boosting algorithms. In this paper, we provide a detailed
analysis of this problem and demonstrate that proposed algorithms
solve it effectively, leading to excellent empirical results.
CCS CONCEPTS•Computingmethodologies→Classification and regressiontrees; Boosting;
KEYWORDSGradient boosting, ordering principle, categorical features, decision
trees
1 INTRODUCTIONGradient boosting is a powerful machine-learning technique that
achieves state-of-the-art results in a variety of practical tasks. For
a number of years, it has remained the primary method for learn-
ing problems with heterogeneous features, noisy data, and com-
plex dependencies: web search, recommendation systems, weather
forecasting, and many others [5, 19, 21, 23]. Gradient boosting is
essentially a process of constructing an ensemble predictor by per-
forming gradient descent in a functional space. It is backed by solid
theoretical results that explain how strong predictors can be built by
iteratively combining weaker models (base predictors) in a greedy
manner [15].
The most popular implementations of gradient boosting use
decision trees as base predictors. It is convenient to use decision
trees for numerical features, but, in practice, many datasets include
categorical features, each having a discrete set of values that are not
comparable with each other (e.g., user ID, city, or ZIP Code). The
most commonly used practice [6, 18] for dealing with categorical
features in gradient boosting is converting them to comparable
numbers before training, e.g., by using one-hot encoding or cal-
culating target–based statistics. In this paper, we present a new
gradient boosting algorithm that successfully handles categorical
features.
As we show in this paper, all existing implementations of gradi-
ent boosting face the following statistical issue. A prediction model
F obtained after several steps of boosting relies on the targets of
all training examples. We demonstrate that this actually leads to
a shift of the distribution of F (xk ) | xk for a training example xkfrom the distribution of F (x) | x for a test example x. This finallyleads to a prediction shift of the learned model. We consider this
problem as a consequence of a specific kind of target leakage in
Section 4. A similar issue occurs for standard algorithms of prepro-
cessing categorical features. Namely, a target–based statistic can be
considered as a simple statistical model. Such models usually suffer
from the same problem of target leakage leading to their prediction
shift, as we discuss in detail in Section 3.
To solve both problems discussed above, we use an ordering prin-ciple. Namely, we propose a modification of a standard gradient
boosting procedure called ordered boosting which allows to avoid
target leakage. This technique is combined with a new algorithm
for processing categorical features. As a result, the new algorithm
outperforms the existing state-of-the-art implementations of gradi-
ent boosted decision trees — XGBoost [7] and LightGBM [14] — on
a diverse set of popular machine learning tasks (see Section 6 for de-
tails). The algorithm is called CatBoost (for “Categorical Boosting”)
and is available as an open-source library.1
2 BACKGROUNDAssume we observe a dataset of examples D = {(xk ,yk )}k=1..n ,where xk = (x1k , . . . ,x
mk ) is a random vector of m features and
yk ∈ R is a target, which can be either binary or a numerical
response. Examples (xk ,yk ) are independent and identically dis-
tributed according to some unknown distribution P(·, ·). The goal ofa learning task is to train a function F : Rm → R which minimizes
the expected loss L(F ) := EL(y, F (x)). Here L(·, ·) is a smooth loss
function and (x,y) is a test example sampled from P independently
of training set D.
A gradient boosting procedure [11] builds iteratively a sequence
of approximations F t : Rm → R, t = 0, 1, . . . in a greedy fashion.
Namely, F t is obtained from the previous approximation F t−1 inan additive manner:
F t = F t−1 + αht , (1)
where α is a step size and function ht : Rm → R (a base predictor)is chosen from a family of functions H in order to minimize the
expected loss:
ht = argminh∈HL(F t−1 + h) = argminh∈HEL(y, F t−1(x) + h(x)).(2)
The minimization problem is usually approached by the Newtonmethod using a second–order approximation of L(F t−1 + ht ) atF t−1 or is substituted with a (negative) gradient step. Both methods
are kinds of functional gradient descent [10, 17]. In particular, the
1https://github.com/catboost/catboost
arX
iv:1
706.
0951
6v2
[cs
.LG
] 2
3 A
pr 2
018
Age < 4
Pulse < 90
Age > 14
Y
N
N
Y
N Y
Figure 1: Example of a simple decision tree classifying pa-tients into healthy and sick.
gradient step ht is chosen in such a way that ht (x) approximates
−дt (x,y), where дt (x,y) := ∂L(y,s)∂s
��s=F t−1(x). Usually, the least-
squares approximation is used:
ht = argminh∈HE(−дt (x,y) − h(x)
)2
. (3)
CatBoost is an implementation of gradient boosting, which uses
binary decision trees as base predictors. A decision tree [4, 10, 20]is a model built by a recursive partition of the feature space Rm .
It is encoded by a labeled rooted tree whose nodes correspond to
regions in Rm . The root corresponds to the whole space Rm . The
space is divided into several disjoint regions according to the values
of some splitting attribute. This attribute is encoded as the label
of the root, and the regions are encoded as the root descendants,
which form the first level of the tree. The directed edges outgoing
the root encode the values of the attribute. The partition procedure
described above is then recursively applied to currently terminal
nodes until some stopping criteria (e.g., the maximal allowed depth)
is reached. A simple example of a decision tree is presented on
Figure 1.
A node with outgoing edges is called an internal node, all other
nodes are called leaves (a.k.a. terminal nodes). Attributes are usually
chosen from among binary variables that identify that some feature
xk exceeds some threshold t , that is, a = 1{xk>t } , where xk is
either numerical or binary feature, in the latter case t = 0.5.2
Finally, each leaf is assigned to a value, which is an estimate of
the response y in the corresponding region in the regression task or
the predicted class label in the case of classification problem. Test
examples are processed by navigating them from the root of the
tree down to a leaf, according to the outcome of the attributes along
the path. In this way, a decision tree h can be written as h(x) =∑Jj=1 bj1{x∈Rj } , where Rj are the disjoint regions corresponding
to the leaves of the tree.
2Alternatively, non-binary splits can be used, e.g., a region can be split according
to all values of a categorical feature. However, such splits, compared to binary ones,
would lead to either shallow trees (unable to capture complex dependences) or to more
complex trees with many terminal nodes (having weaker target statistics in each of
them). According to [4], the tree complexity has a crucial effect on the accuracy of the
model and less complex trees are less prone to overfitting.
In the case of regression task, splitting features and thresholds are
usually chosen using the least–squares splitting criterion, and the
values bj are computed as mean target values in the corresponding
leaves. Note that in gradient boosting, at each iteration, a tree is
constructed to approximate the negative gradient (see Equation (3)),
so it solves a regression problem.
3 CATEGORICAL FEATURESCategorical feature is a feature having a discrete set of values called
categories which are usually not comparable with each other. As
we discussed in the previous section, splitting criteria 1{xk>t }for binary decision trees are based on thresholds t , therefore theycan be directly applied only to numerical or binary features and
not to categorical ones. Therefore, a common practice is to con-
vert categorical features to numbers at the preprocessing step, i.e.,
each category for each example is substituted with one or several
numerical values.
Themostwidely used technique usually applied to low-cardinality
categorical features is one-hot encoding: the original feature is re-moved and, for each category, one adds a new binary variable that
identifies that category [18]. One-hot encoding is implemented in
CatBoost; however, it can be ineffective for many practical tasks. For
example, the number of values of a categorical feature can poten-
tially be of order n (e.g., for “user ID” feature). In this case, one-hot
encoding leads to a huge number of binary variables causing large
memory requirements and computational cost. More importantly,
each individual binary feature would cover only a small part of
examples, what makes it ineffective as a splitting attribute. For
example, assume that the task is to predict clicks on an ad and half
of the N users prefer to make clicks while the other half does not.
Then, in order to split these users according to one-hot encoded
user ID, we need at least N /2 splits what makes it impossible to
combine user ID with other sparse features within the same tree.
Target–based statistics. A more effective way to deal with cat-
egorical features, which is a main focus of this section, is to compute
some target–based statistic (TBS) that maps a categorical feature to
one numerical feature [6, 18]. Namely, given a categorical feature
x i , we substitute the category x ik of k-th training example with
some numerical value x ik , which is computed based on the training
dataset using some of the target values yj , j = 1..n. Usually, x ik is
an estimate of the expected target y conditioned by the category:
xik ≈ E(y |xi = x ik ). A splitting attribute based on this feature parti-
tions all categories into those with larger expected target values and
those with smaller ones. Such attributes are usually more effective
than ones using one-hot encoded features.
Greedy TBS. The straightforward approach to estimate the ex-
pected target is to take its average value over the whole training
dataset3[6, 18]:
x ik =
∑nj=1 1{x ij =x ik }
· yj∑nj=1 1{x ij =x ik }
. (4)
3For test examples, we may encounter new categories; some default value is usually
set for such categories.
2
The problem of such greedy approach is that x ik is computed using
yi , which is the target of xk . This target leakage can affect the gen-
eralization of the learned model, because TBS feature x i containsmore information about the target of xk than it will carry about
the target of a test example with the same input feature vector. In
other words, we observe a conditional shift [22]: the distribution of
x i |y differs for training and test examples. Let us present a simple
example showing this shift and problems caused by it. Assume that
i-th feature is categorical, all its values are unique, and for each
category a, we have P(y = 1|x i = a) = 0.5 for a classification task.
Then, in the training dataset, x ik = yk , so it is sufficient to make
only one split at t = 0.5 to perfectly classify all training examples.
However, such a classifier would make errors on the test set with
probability 0.5. If we have at least one meaningful feature in the
dataset, then we need to remove the i-th feature from the dataset to
allow the boosting algorithm to build a better model. So, we have
the following desired quality for TBS:
P1 E(x i |y = v) = E(x ik |yk = v), where (xk ,yk ) is the k-th train-ing example.
In our example above, E(x ik |yk ) = yk ∈ {0, 1} and E(xi |y) = 0.5.
A standard way to improve greedy approach is to add some prior
value P :
x ik =
∑nj=1 1{x ij =x ik }
· yj + a P∑nj=1 1{x ij =x ik }
+ a, (5)
where a > 0 is a parameter. The prior value is usually added to
reduce the noise obtained from low-frequency categories. A stan-
dard technique for calculating prior P is to take the average label
value in the dataset [18]. It is easy to see that having a non-zero
prior with a positive weight does not prevent the conditional shift
and this approach still violates P1. Considering the same example,
note that x ik =aP1+a if yk = 0 and
1+aP1+a otherwise. So, we again can
perfectly classify the training dataset by taking t = 0.5+aP1+a for i-th
feature.
There are several ways to overcome the problem discussed above,
all of them can be expressed by the following formula:
x ik =
∑xj ∈Dk
1{x ij =x ik }· yj + a P∑
xj ∈Dk1{x ij =x ik }
+ a. (6)
In other words, for each training example xk we compute the TBS
using a subset Dk ⊂ {x1, . . . , xn }.
Holdout TBS. A straightforward approach is to partition the train-
ing dataset randomly into two partsD = ˆD0⊔ ˆD1 and useDk = ˆD0
to calculate the TBS according to (6) andˆD1 to perform training
(see, e.g., [7]). For such holdout TBS equality in P1 holds, but this
approach significantly reduces the amount of data used to train
the model and also to calculate the statistics. So, it violates the
following desired quality:
P2 It is desirable for x ik to have a low variance.
Leave-one-out TBS. Another seemingly appealing technique is
leave-one-out, where Dk = D \ xk for training examples and
Dk = D for test examples.4The motivation behind this approach
is that we do not use the target of an example to compute its
features (as in holdout TBS), but reduce the variance. However,
this technique still violates P1 and to illustrate the problem, let us
consider a simple case, where we have a category a and x ik = a
for all examples. Let n+ be the number of examples with y = 1,
then x ik =n+−yk+a Pn−1+a . However, for a test example x , we have
x i = n++a Pn+a . So, again, one can perfectly classify the training
dataset by taking t =n+−0.5+a pn−1+a . Equation in P1 does not hold:
E(x ik |yk ) =n Ey−yk+a P
n−1+a and E(x i |y) = n Ey+a Pn+a .
Ordered TBS. CatBoost uses a more effective strategy, which is
inspired by online learning algorithms. The underlying ordering
principle is the central idea of the current research. In the online
learning problem (e.g., for the task of click prediction [1, 16]), nu-
merous training examples become available sequentially in time,
and the model is updated on the basis of each new example without
reconsideration of the previous ones. Clearly, the values of TBS on
a new example (e.g., click-through rates) used by the currently built
model depend only on the previously observed examples, since the
subsequent ones are unavailable. Usually, the TBS for each example
is computed on the basis of a fixed number K of preceding exam-
ples. Note that, for such sliding window TBS, if we assume that
the distribution P(·, ·) does not change in time, the requirement P1
holds, since the joint distribution of y and x i is the same for train-
ing and test examples. To adapt this idea to standard non-online
settings, we propose to introduce an artificial “time”, i.e., an order
of the training examples. Also, for each sample, we use all the avail-
able history instead of K preceding examples to compute its TBS.
Namely, we perform a random permutation σ of the dataset and
takeDk = {xj : σ (j) < σ (k)} in Equation (6) for a training example
xk and Dk = D for a test example. Note that, the obtained orderedTBS satisfies the requirement P1, and we also reduce the variance
of x ik (see P2) compared to sliding window TBS.
Note that, if we fix only one random permutation, then preced-
ing examples would have TBS with much higher variance than
subsequent ones, which is undesirable. Therefore, CatBoost uses
several different permutations for different iterations of gradient
boosting. We discuss this in detail in Section 5.
4 PREDICTION SHIFT AND ORDEREDBOOSTING
In this section, we discuss the problem of prediction shift in gradient
boosting algorithms. Like in the case of TBS, it is caused by a special
kind of target leakage. Our solution is called the ordered boostingand resembles the ordered TBS method.
4.1 Prediction shiftLet us go back to the gradient boosting procedure described in
Section 2. In practice, the expectation in (3) is unknown and is
4https://nycdatascience.com/blog/meetup/featured-talk-1-kaggle-data-scientist-
owen-zhang/
3
usually approximated using the same dataset D:
ht = argminh∈H1
n
n∑k=1
(−дt (xk ,yk ) − h(xk )
)2
. (7)
Now we describe and analyze the following chain of shifts:
(1) the conditional distribution of the gradient дt (xk ,yk ) | xk(accounting for randomness of the datasetD) is shifted from
that distribution on a test example дt (x,y) | x;(2) as a result, the base predictor ht defined by Equation (7) is
biased with respect to the solution of Equation (3);
(3) this, finally, affects the generalization ability of the trained
model F t .
As in the case of TBS, these problems are consequences of target
leakage. Indeed, gradients used at each step are estimated using
the target values of the same data points the current model F t−1
was built on. To illustrate the problem, consider the regression task
with the quadratic loss function L(y, y) = (y − y)2. In this case,
the negative gradient −дt (xk ,yk ) in Equation (7) is equivalent to
the residual function r t−1(xk ,yk ) := yk − F t−1(xk ).5 It is based
on the currently learned predictions F t−1(xk ) whose conditionaldistribution F t−1(xk ) | xk for a training example xk is shifted, in
general, from the distribution F t−1(x) | x for a test example x. We
call this a prediction shift.In the simplest case where the problem of prediction shift can
be observed, we havem = 2 features x1,x2 that are independentidentically distributed Bernoulli random variables with success
probability 1/2. We assume y = f ∗(x) = c1x1 + c2x2.Assume we follow the gradient boosting algorithm with the
decision trees of depth 1 (decision stumps), with step size α = 1,
and the number of iterations N = 2. That is, the trained model
F = F 2 is expressed as F 2 = h1 + h2. We can assume without
loss of generality that h1 is based on x1 and h2 is based on x2.The leaf values of hj are obtained according to Equation (7) based
on a dataset Dj = {(xj,k ,yj,k )}k=1..n . We are able to prove the
following statements:
Proposition 1. (1) Assume that two independent samplesD1
and D2 of size n are used to estimate h1 and h2, respectively,using Equation (7). Then ED1,D2
F 2(x) = f ∗(x)+O(1/2n ) forany x ∈ {0, 1}2.
(2) Assume that the same dataset D = D1 = D2 is used inEquation (7) for both h1 and h2. Then EDF 2(x) = f ∗(x) −1
n−1c2(x2 −1
2) +O(1/2n ).
This proposition means that the trained model F 2 is an unbiased
estimation of the true dependencey = f ∗(x), when we use indepen-
dent datasets at each gradient step.6At the same time, when we use
the same dataset at each step, we suffer from a bias − 1
n−1c2(x2 −1
2),
which is inversely proportional to the size of the dataset n. An-other important conclusion here is that the value of the bias can
be affected by the relation between x and y: in our example, it is
proportional to c2. We also note that the bias is caused by statistical
issues, and it is not related to any wrong assumptions underlying
5Here we removed the multiplier 2. This is equivalent to decreasing the step size α by
2 times, what does not matter for the current analysis.
6Up to an exponentially small term, which occurs for a technical reason.
the model. We track the chain of shifts in sketch of the proof below,
while the full proof of Proposition 1 is available in Appendix.
Sketch of the proof . Let us consider the second part of the proposi-
tion. Denote by ξst , s, t ∈ {0, 1}, the number of examples xk ∈ Dwith xk = (s, t). We have h1(s, t) = c1s +
c2ξs1ξs0+ξs1
. Its expectation
E(h1(x)) on a test example x equals c1x1 +
c22+ O(2−n ). At the
same time, the expectation E(h1(xk )) on a training example xk is
different and eqials (c1x1 + c22) − c2( 2x
2−1n ) +O(2−n ). That is, we
experience a prediction shift of h1. As a consequence, the expectedvalue of h2(x) is E(h2(x)) = c2(x2 − 1
2)(1 − 1
n−1 ) +O(2−n ) on a test
example x and E(h1(x)+h2(x)) = f ∗(x) − 1
n−1c2(x2 −1
2)+O(1/2n ).
□
Finally, let us stress again the connection between conditional
shift for TBS discussed in Section 3 (see P1) and prediction shift
in gradient boosting discussed in this section. At each step t of astandard boosting, we have a current model F t−1 built so far. The
conditional distribution of F t−1(xk ) | xk (and those of F t−1(xk ) |yk ) for a training example xk is shifted w.r.t. the same distribution
for an independent example from the distribution P(·, ·). The same
is applied to the distribution of residuals r t−1(xk ,yk ) | xk : we havea residual shift. These shifts are cased by target leakage, i.e., by
using targets yk on the previous steps of boosting. Greedy TBS x i
can be considered as a simple statistical model predicting the target
y and it suffers from a similar problem, conditional shift of x ik | yk ,caused by the target leakage, i.e., using yk to compute x ik .
Related work. The problems caused by the correlation between
the random deviations in the training dataset and the outputs of
the model F t−1 were considered previously in papers on boost-
ing [3, 12]. Based on the out-of-bag estimation first proposed in
his paper [2], Breiman proposed iterated bagging [3] which simul-
taneously constructs K models Fi , i = 1, . . . ,K , associated with
K independently bootstrapped subsamples Di . At t-th step of the
process, approximations F ti are grown from their predecessors F t−1ias follows. The current estimateMt
j at example j is obtained as the
average of the outputs of all models F t−1k such that j < Dk . The
term hti is built as a predictor of the residuals rtj := yj −Mt
j (targets
minus current estimates) on Di . Finally, the models are updated:
F tj := F t−1j + hti .
Unfortunately, the residuals r tj used in this procedure are not
unshifted, because each model F ti depends on each observation
(xj ,yj ) by construction. Indeed, althoughhtk does not useyj directly,
if j < Dk , it still usesMt−1j′ for j ′ ∈ Dk , which, in turn, can depend
on (xj ,yj ). The same is applicable to the idea of Friedman [12],
where subsampling of the dataset at each iteration was proposed.
4.2 Ordered boostingHere we propose a boosting algorithm which does not suffer from
the prediction shift problem described in Section 4.1. For simplicity,
we consider the case of a regression task where, at each step, the
base predictor ht (x) approximates the residual function r t−1(x,y).So, we aim at developing an algorithm with unshifted residuals.
Assuming access to an unlimited amount of training data, we
can easily construct such an algorithm. At each step of boosting,
4
1 2 3 654 7
𝑀5𝑡−1
𝑀6𝑡−1
𝑟𝑡 𝑥7, 𝑦7 = 𝑦7 −𝑀6𝑡−1(𝑥7)
8 9
Figure 2: Ordered boosting principle. For simplicity of visu-alization, on this figure we assume that sigma is an identitypermutation.
we sample a new dataset Dt independently and obtain unshifted
residuals by applying the current model to new training samples.
In practice, however, labeled data is typically limited. Assume
that we learn amodel with I trees. Tomake the residual r I−1(xk ,yk )unshifted, we need to have F I−1 trained without the example xk .Since we need unbiased residuals for all training examples, no ob-
servations may be used for training F I−1, which at first glance
makes the training process impossible. However, it is possible to
maintain a set of models differing by examples used for their train-
ing. Then, for calculating the residual on an example, we use a
model trained without it. At each step t of the learning process,
each model can be interpreted as an approximation of a virtual
model F t . In order to construct such a set of models, we consider
online learning algorithms again. To calculate the gradient for a
new example for updating the model, they use the current model
trained on the previous examples [1, 16], what provides unshifted
residual and unshifted gradient.
Thereby, we use the same trick with random ordering of exam-
ples, previously applied to TBS in Section 3. Let us assume that xkare ordered according to a random permutation σ . We maintain
n different modelsM1, . . . ,Mn such that the modelMi is learned
using only the first i samples in σ . At each step, in order to obtain
the residual for j-th sample, we use the modelMj−1 (see Figure 2).The resulting Algorithm 1 is called ordered boosting below. Unfor-
tunately, this algorithm is not feasible in most practical tasks due
to the need of maintaining n different models, which increase the
complexity and memory requirements by n times. In CatBoost, we
implemented a modification of this algorithm on the basis of the
gradient boosting algorithm with decision trees as base predictors
(GBDT). We describe our implementation in Section 5 in details.
4.3 Ordered boosting with categorical featuresIn Sections 3 and 4.2 we proposed to use some random permuta-
tions σcat and σboost of training examples for the TBS calculation
and for ordered boosting, respectively. Now, being combined in one
algorithm, should these two permutations be somehow dependent?
We argue that they should coincide. Otherwise, there exist exam-
ples xi and xj such that σboost (i) < σboost (j) and σcat (i) > σcat (j).
Algorithm 1: Ordered boosting
input : {(xk ,yk )}nk=1, I ;
1 σ ← random permutation of [1,n] ;2 Mi ← 0 for i = 1..n;
3 for t ← 1 to I do4 for i ← 1 to n do5 ri ← yi −Mσ (i)−1(i);6 for i ← 1 to n do7 M ← LearnModel((xj , r j ) : σ (j) ≤ i;
8 Mi ← Mi +M ;
9 returnMn
Then, the modelMσboost (j) is trained using TBS features of, in par-
ticular, example xi , which are calculated using yj . In general, it
may shift the prediction Mσboost (j)(xj ). To avoid such a shift, we
set σcat = σboost in CatBoost. In the case of the ordered boosting
(Algorithm 1) with sliding window TBS, it guarantees that the pre-
dictionMi (xi ) is not shifted for i = 1, . . . ,n, since, first, the targetyi was not used for training Mi (neither for the TBS calculation,
nor for the gradient estimation) and, second, the distribution of
TBS x i is the same for a training example and a test example with
the same value of feature x i .
5 PRACTICAL IMPLEMENTATIONIn this section, we describe an efficient implementation of ordered
boosting (Algorithm 1) with ordered TBS used in CatBoost.
Overview of the algorithm. In GBDT, the process of constructing
a decision tree can be divided into two phases: choosing the tree
structure (i.e., splitting attributes) and choosing the values bj inleaves [11]. CatBoost performs the second phase using the standard
GBDT scheme and has two modes for the first phase, Ordered and
Plain. The latter mode corresponds to a combination of the standard
GBDT algorithm with ordered TBS. The former mode presents an
efficient modification of Algorithm 1 using one tree structure shared
by all the models to be built.
At the start, CatBoost generates s + 1 independent random per-
mutations of the training dataset. The permutations σ1, . . . ,σs areused for constructing decision trees, while σ0 serves for choosingthe leaf values bj . Each permutation is used both for computing or-
dered TBS and for Ordered mode simultaneously. Since, for a given
permutation, both TBS and predictions used by ordered boosting
(Mσ (i)−1(i) in Algorithm 1) for examples with short history have a
high variance and thus may increase the variance of the final model
predictions, using several permutations allows us to reduce this
effect, what is confirmed by our experiments in Section 6.6.
A detailed description of CatBoost is provided in Algorithm 2.
Building a tree. CatBoost uses oblivious decision trees, where
the same splitting criterion is used across an entire level of the
tree [9, 13]. Such trees are balanced, less prone to overfitting, and
allow speeding up prediction significantly at testing time.
At each iteration t of the algorithm, we sample a random per-
mutation σr from {σ1, . . . ,σs } and construct a tree Tt on the basis
5
Algorithm 2: CatBoostinput : {(xi ,yi )}ni=1, I , α , L, s ,Mode
1 σi ← random permutation of [1,n] for i = 0..s;
2 Sr (i) ← 0 for r = 0..s , i = 1..n;
3 S ′r, j (i) ← 0 for r = 1..s , i = 1..n, j = 1..⌈log2n⌉;
4 for t ← 1 to I do5 дrad ← CalcGradient(L, S,y);6 дrad ′ ← CalcGradient(L, S ′,y);7 r ← random(1, s);8 Tt ← BuildTree(Mode,дradr ,дrad
′r ,σr , {xi }ni=1);
9 lea fr,i ← GetLeaf (xi ,Tt ,σr ) for r = 0..s, i = 1..n;
10 foreach leaf Rtj in Tt do11 btj ← −avg(дrad0(i) for i : lea fr,i = j);12 S, S ′ ← UM(Mode, lea f ,Tt , {btj }j , S, S
′,
дrad,дrad ′, {σr }sr=1);13 return F (x) = ∑I
t=1∑j α b
tj 1{GetLeaf (x,Tt ,ApplyMode)=j } ;
Function UpdateModel(UM)
input : Mode ,lea f , T ,{bj }j , S , S ′, дrad , дrad ′, {σr }sr=11 foreach leaf j in T do2 foreach i s.t. lea fr,i = j do3 S0(i) ← S0(i) + αbj ;4 for r ← 1 to s do5 foreach i s.t. lea fr,i = j do6 if Mode == Plain then7 Sr (i) ← Sr (i) − α avg(дradr (p) for
p : lea fr,p = j);8 else9 for l ← 1 to ⌈log
2n⌉ do
10 S ′r,l (i) ← S ′r,l (i) − α avg(дrad ′r,l (p) forp : lea fr,p = j, σr (p) < 2
l );11 Sr (i) ← S ′r, ⌊log
2σr (i)⌋ (i);
12 return S, S ′
of it (line 7 of Algorithm 2). In the Plain mode, it defines TBS to be
used in two aspects:
(1) As a model F t−1 in Equation (2), we use a model Sr (we
omit t for brevity) trained with ordered TBS based on the
permutation σr . These TBS are used to define the leaf each
example belongs to: the function GetLeaf (x,T ,σ ) servingfor this task calculates TBS for categorical features used inTon the basis of the original features x and the permutation σ .Each model Sr ′ , r
′ = 1..s , can be considered as an approxi-
mation of F t−1 on the basis of a given TBS. Note that the treeTt built on the basis of the model Sr , is used for updating all
the models Sr ′ , r′ = 1..s in line 12 of Algorithm 2.
Function BuildTree
input : Mode , дrad , дrad ′, σ , {xi }ni=11 T ← empty tree;
2 foreach step of top-down procedure do3 Form a set C of candidate splits;
4 foreach c ∈ C do5 Tc ← add split c to T ;
6 lea fi ← GetLeaf (xi ,Tc ,σ ) for i = 1..n;
7 foreach leaf j in Tc do8 foreach i : lea fi = j do9 if Mode == Plain then
10 ∆(i) ← avg(дrad(p) for p : lea fp = j);11 else12 ∆(i) ← avg(дrad ′⌊log
2σr (i)⌋ (p) for
p : lea fp = j,σ (p) < σ (i));
13 loss(Tc ) ← ||∆ − дrad | |214 T ← argminTc (loss(Tc ))15 return T
(2) In order to obtain a leaf value for evaluating a candidate split
(line 9 of Function BuildTree), we average the gradients ofall the examples in the leaf, where the gradient дradr (i) forthe example i (corresponding to дt (xi ,yi ) from Equation (3))
is calculated in the point of prediction Sr (i) based on the
permutation σr .
In the Plain mode, updating the models Sr ′ , r′ = 1..s (line 7 of
FunctionUpdateModel ) follows the standard gradient step: the leafvalue is the negative average of the gradients дradr ′ (in the leaf)
calculated on predictions of the model Sr ′ to be updated.
The setC of candidate splits for a tree is formed in line 2 of Func-
tion BuildTree by considering all the original numerical features
and TBS based on the categorical features.7
The difference of the Ordered mode is that each model Sr ′ , r′ =
1..s, is trained according to the ordering principle (Algorithm 1):
it is approximated by models S ′r ′, j , j = 1, . . . , ⌈log2n⌉, where the
model S ′r ′, j corresponds toM2j in Algorithm 1
8: we do not use ex-
amples l : σr ′(l) > 2jfor its training (unless it becomes impossible
due to the restriction of one tree structure shared by all the models).
For this purpose, while training these models, we always obtain
an individual leaf value for each example. At the candidate splits
evaluation step, the leaf value for the example i is obtained by the
gradient step (line 11 of Function BuildTree) with averaging the gra-dients of the preceding examples p; each gradientдrad ′⌊log
2σr (i)⌋ (p)
is obtained in the point of prediction S ′r, ⌊log2σr (i)⌋ (p). According
to line 10 of FunctionUpdateModel , this prediction is learned from
the part of these examples p′ : σr (p′) < 2⌊log
2σr (i)⌋
.
7The set C also contains TBS based on some feature combinations, chosen by the
algorithm described at the end of this section.
8The number of models to be maintained is reduced from n to ⌈log
2n ⌉ in comparison
with Algorithm 1 for the sake of efficiency.
6
Table 1: Computational complexity
Procedure Complexity for iteration t
CalcGradient O(s · n)BuildTree O(|C | · n)Calc leaf values btj O(n)UpdateModel O(s · n)Calc ordered TBS O(NT BSt · n)
Choosing leaf values. Given the tree Tt , the leaf value btj of
the final model F is calculated by the gradient step in line 10 of
Algorithm 2, where the gradient дrad0(i) for example i is calculatedin the point of prediction S0(i) obtained by applying the current
version of the final model F (i.e., with leaf values btj , what can be
seen from line 3 of Function UpdateModel) with TBS based on
the permutation σ0. When applying the final model F to a new
example (line 13 of Algorithm 2), using ApplyMode instead of a
permutation in the functionGetLeaf means, that TBS are calculated
by the greedy approach on the training data, as it was described in
Section 3.
Complexity.We present the computational complexity of dif-
ferent components of any of the two modes of CatBoost per one
iteration in Table 1. We first prove these asymptotics for the Or-
dered mode. For this purpose, we estimate the number Npred of
predictions Sr, j (i) to be maintained:
Npred < (s + 1) ·log
2n+1∑
j=12j < (s + 1) · 2log2 n+2 = 4(s + 1)n
Then, obviously, the complexity of CalcGradient is O(Npred ) =O(s · n). The complexity of leaf values calculation is O(n), sinceeach example i is included only in averaging operation in leaf lea fi .
Calculation of the ordered TBS for one categorical feature can
be performed sequentially in the order of the permutation by nadditive operations for calculation of n partial sums and n divi-
sion operations. Thus, the overall complexity of the procedure is
O(NT BSt · n), where NT BSt is the number of TBS which were not
calculated on the previous iterations. Since the leaf values ∆(i) cal-culated in the BuildTree procedure can be considered as ordered
TBS, where gradients play the role of targets, the complexity of this
procedure is O(|C | · n).Finally, in the UpdateModel procedure, we need to perform one
averaging operation for each l = 1, . . . , ⌈log2n⌉, and each gradient
дrad ′r,l is included in one averaging operation. Thus, the number
of operations is bounded by the number of the gradients дrad ′r,l ,which is equal to Npred = O(s · n).
To finish the proof, note that any component of the Plain mode
is not less efficient than the same one of the Ordered mode but,
at the same time, cannot be more efficient than corresponding
asymptotics from Table 1. Thus, our implementation of Ordered
boosting with decision trees has the same asymptotic complexity as
the standard GBDT with ordered TBS. In comparison with standard
GBDT, the latter slows down by the factor of s the procedures
CalcGradient ,UpdateModel and computation of TBS by any other
approach discussed in Section 3, as well as ordered TBS with s = 1.
Feature combinations. Let us describe another important im-
plementation detail of CatBoost: using feature combinations for
TBS. Tomotivate this, assume that the task is ad click prediction and
we have the following categorical features: user ID and ad topic. A
user might prefer ads on a given topic, however, when we substitute
user ID and ad topic by corresponding TBS, we loose this informa-
tion. TBS computed for a concatenation of these two features solves
this problem and gives a new powerful feature. However, the num-
ber of possible combinations grows exponentially with the number
of categorical features in the dataset and it is infeasible to consider
all of them in the algorithm. When constructing a new split for the
current tree, CatBoost considers combinations in a greedy way. No
combinations are considered for the first split in the tree. For the
next splits, CatBoost combines all combinations and categorical
features present in the current tree with all categorical features in
the dataset. Combination values are converted to TBS on the fly.
Feature combinations are used in Algorithm 2 in the follow-
ing way: the functionGetLeaf (x,T ,σ ) also calculates TBS for the
combinations of categorical features used in the tree T ; the set Cof candidate splits for a tree (line 2 of Function BuildTree) alsoincludes feature combinations, chosen by the above algorithm.
6 EXPERIMENTS6.1 Comparison with baselinesWe compare our algorithm with the most popular open-source
libraries — XGBoost and LightGBM — on several well-known ma-
chine learning tasks. The detailed description of the experimental
setup as well as dataset descriptions are published on our Github
together with the code of the experiment and a container with all
the used libraries, so that the results can be reproduced.9Additional
dataset used in this experiment is Epsilon10
(4 · 105 samples, 2000
features). We made categorical features preprocessing using the
statistics on random permutation of data. The parameter tunning
and training were performed on 4/5 of the data and the testing was
performed on the other 1/5.11
The training with selected parame-
ters was performed 5 times with different random seeds, we report
the average loss obtained (logloss and zero-one loss), the results
are presented in Table 2. For CatBoost we used Ordered mode in
this experiment.12
One can see that CatBoost outperforms other algorithms on all
considered datasets. In our Github repository you can also see that
CatBoost with default parameters outperforms tunned XGBoost on
all datasets and LightGBM on all but one dataset.
We also measured statistical significance of improvements pre-
sented in Table 2: except three datasets (KDD appetency, churn
and upselling) the improvements are statistically significant with
p-value≪ 0.01 measured by paired t-test.
7
Table 2: Comparison with baselines: logloss / zero-one loss, relative increase is presented in the brackets.
CatBoost LightGBM XGBoost
Adult 0.2695 / 0.1267 0.2760 (+2.4%) / 0.1291 (+1.9%) 0.2754 (+2.2%) / 0.1280 (+1.0%)
Amazon 0.1394 / 0.0442 0.1636 (+17%) / 0.0533 (+21%) 0.1633 (+17%) / 0.0532 (+21%)
Click prediction 0.3917 / 0.1561 0.3963 (+1.2%) / 0.1580 (+1.2%) 0.3962 (+1.2%) / 0.1581 (+1.2%)
Epsilon 0.2647 / 0.1086 0.2703 (+1.5%) / 0.114 (+4.1%) 0.2993 (+11%) / 0.1276 (+12%)
KDD appetency 0.0715 / 0.01768 0.0718 (+0.4%) / 0.01772 (+0.2%) 0.0718 (+0.4%) / 0.01780 (+0.7%)
KDD churn 0.2319 / 0.0719 0.2320 (+0.1%) / 0.0723 (+0.6%) 0.2331 (+0.5%) / 0.0730 (+1.6%)
KDD Internet 0.2089 / 0.0937 0.2231 (+6.8%) / 0.1017 (+8.6%) 0.2253 (+7.9%) / 0.1012 (+8.0%)
KDD upselling 0.1662 / 0.0490 0.1668 (+0.3%) / 0.0491 (+0.1%) 0.1663 (+0.04%) / 0.0492 (+0.3%)
Kick prediction 0.2855 / 0.0949 0.2957 (+3.5%) / 0.0991 (+4.4%) 0.2946 (+3.2%) / 0.0988 (+4.1%)
Table 3: Comparison of running times on Epsilon
time per tree
CatBoost Plain 1.1 sCatBoost Ordered 1.9 s
XGBoost 3.9 s
LightGBM 1.1 s
6.2 Running timesIt is quite hard to compare different boosting libraries in terms of
training speed. Every algorithm has a vast number of parameters
which affect training speed, quality and model size in a non-obvious
way. Different libraries have their unique quality/training speed
trade-off’s and they cannot be compared without domain knowl-
edge (e.g., is 0.5% of quality metric worth it to train model 3-4 times
slower?). Plus for each library it is possible to obtain almost the
same quality with different ensemble sizes and parameters. As a
result, one cannot compare libraries by time needed to obtain a
certain level of quality. As a result, we could give only some insights
of how fast our implementation could train a model of a fixed size.
We use Epsilon dataset and we measure mean tree construction
time one can achieve without using feature subsampling and/or
bagging by CatBoost (both Ordered and Plain modes), XGBoost
(we use histogram-based version, which is faster) and LightGBM.
For XGBoost and CatBoost we use default tree depth equal to 6, for
LightGBM we set leaves count to 64 to have comparable results. We
run all experiments on the same machine with Intel Xeon E3-12xx
2.6GHz, 16 cores, 64GB RAM and run all algorithms with 16 threads.
We set such learning rate that algorithms start to overfit approx-
imately after constructing about 7000 trees and measure average
time to train ensembles of 8000 trees. Mean tree construction time
is presented in Table 3. Note that CatBoost Plain and LightGBM
are the fastest ones followed by Ordered mode, which is about 1.7
times slower, which is expected.
Finally, let us note that CatBoost has a highly efficient GPU
implementation. The detailed description and comparison of the
9https://github.com/catboost/benchmarks/tree/master/quality_benchmarks
10https://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/binary.html
11For Epsilon we use default parameters instead of parameter tuning due to large
running time for all algorithms. We tune only the number of trees to avoid overfitting.
12The numbers for CatBoost in Table 2 may slightly differ from the corresponding
numbers in our Github repository, since we use another version of CatBoost with all
discussed features implemented.
Table 4: Plain mode: logloss, zero-one loss and their relativechange compared to Ordered mode.
Logloss Zero-one loss
Adult 0.2723 (+1.1%) 0.1265 (-0.1%)
Amazon 0.1385 (-0.6%) 0.0435 (-1.5%)
Click prediction 0.3915 (-0.05%) 0.1564 (+0.19%)
Epsilon 0.2663 (+0.6%) 0.1096 (+0.9%)
KDD appetency 0.0718 (+0.5%) 0.0179 (+1.5%)
Kdd churn 0.2317 (-0.06%) 0.0717 (-0.17%)
KDD internet 0.2170 (+3.9%) 0.0987 (+5.4%)
KDD upselling 0.1664 (+0.1%) 0.0492 (+0.4%)
Kick prediction 0.2850 (-0.2%) 0.0948 (-0.1%)
-5
0
5
10
15
20
25
0.001 0.01 0.1 1
rela
tive
logl
oss
for
Plai
n, %
fraction
AdultAmazon
Click predictionKDD appetency
KDD churnKDD internet
KDD upsellingKick prediction
Figure 3: Relative error of Plain mode compared to Orderedmode depending on the fraction of the dataset.
running times are beyond the scope of the current article, but these
experiments can be found on the corresponding Github page.13
6.3 Ordered and plain modesIn this section, we compare two essential modes of CatBoost: Plain
and Ordered. First, we compared their performance on all consid-
ered datasets, the results are presented in Table 4. It can be clearly
seen that Ordered mode is particularly useful on small datasets.
Indeed, the largest benefit from Ordered is observed on Adult and
Internet datasets, which are relatively small (less than 40K training
13https://github.com/catboost/benchmarks/tree/master/gpu_training
8
Table 5: Comparison of target–based statistics, relativechange in logloss / zero-one loss
Greedy Holdout Leave-one-out
Adult +1.1% / +0.79% +2.1 % / +2.0% +5.5% / +3.7%
Amazon +40% / +32% +8.3% / +8.3% +4.5% / +5.6%
Click prediction +13% / +6.7% +1.5% / +0.51% +2.7% / +0.90%
KDD appetency +24% / +0.68% +1.6% / -0.45% +8.5% / +0.68%
KDD churn +12% / +2.1% +0.87% / +1.3% +1.6% / +1.8%
KDD Internet +33% / +22% +2.6% / +1.8% +27% / +19%
KDD upselling +57% / +50% +1.6% / +0.85% +3.9% / +2.9%
Kick prediction +22% / +28% +1.3% / +0.32% +3.7% / +3.3%
examples), which supports our hypothesis that a higher bias nega-
tively affects the performance. Indeed, according to Proposition 1
and our reasoning in Section 4.1, bias is expected to be larger for
smaller datasets (however, it can also depend on other properties of
the dataset, e.g., on the dependency between features and target).
In order to further validate this hypothesis, we make the follow-
ing experiment: we train CatBoost in Ordered and Plain modes on
randomly filtered datasets and compare the obtained losses, see Fig-
ure 3. As we expected, for smaller datasets the relative performance
of Plain mode becomes worse. To save space, here and below we
present the results only for logloss since the figures for zero-one
loss are similar.
6.4 Analysis of target–based statisticsWe compare different TBSs by implementing them in CatBoost
instead of Ordered TBS, keeping all other algorithmic details the
same; the results can be found in Table 5. Here, to save space, we
present only relative increase in loss functions for each algorithm
compared to CatBoost in Ordered mode. Note that the ordered TBS
used in CatBoost significantly outperforms all other approaches.
Also, among baselines, holdout is the best for most of the datasets
since it does not suffer from conditional shift discussed in Section 3
(P1), so it is worse than CatBoost because of the larger variance of
TBS (P2). Leave-one-out is usually better than Greedy, but it can
be much worse on some datasets, e.g., on Adult. The reason is that
Greedy suffers from low-frequency categories, while Leave-one-out
suffers also from high-frequency ones, and on Adult all features
have high frequency.
Finally, let us note that in Table 5 we combine Ordered mode of
CatBoost with different TBSs. To generalize this results, we also
made similar experiment by combining different TBSs with Plain
mode, used in standard gradient boosting. The obtained results and
conclusions turned out to be very similar to the ones discussed
above.
6.5 Feature combinationsIn this section, we demonstrate the effect of feature combinations
discussed in Section 3. To do this, we vary the number of features
cmax allowed to be combined from 1 to 5, where 1 corresponds to
the absence of combinations. Figure 4 presents relative change in
logloss compared to cmax = 1. One can see that except some minor
variations the performance of CatBoost grows with cmax and the
-12
-10
-8
-6
-4
-2
0
2
1 2 3 4 5
rela
tive
chan
ge in
logl
oss,
%
complexity of combinations
AdultAmazon
Click predictionKDD appetency
KDD churnKDD internet
KDD upsellingKick prediction
Figure 4: Relative change in logloss for a given allowed com-plexity compared to the absence of combinations.
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1 2 3 4 5 6 7 8 9 10
rela
tive
chan
ge in
logl
oss,
%
number of permutations
AdultAmazon
Click predictionKDD appetency
KDD churnKDD internet
KDD upsellingKick prediction
Figure 5: Relative change in logloss for a given number ofpermutations s compared to s = 1.
largest improvement, which can reach 11%, is observed when we
change cmax from 1 to 2.
6.6 The effect of the number of permutationsHere we analyze the effect of the number of permutations s onthe performance of CatBoost. Figure 5 presents the relative im-
provement in logloss for each s compared to s = 1. Although we
noticed that the number of permutations has a smaller effect on
the algorithm performance than previously discussed options of
CatBoost, we see that logloss generally decreases with s and takings = 2 or s = 3 is good enough.
7 CONCLUSIONIn this paper, we identify and analyze the problem of prediction
shifts present in all existing implementations of gradient boosting.
We propose a general solution, ordered boosting with ordered TBS,
which solves the problem. This idea is implemented in CatBoost,
which is an open-source gradient boosting library. Empirical results
demonstrate that CatBoost outperform leading GBDT packages and
leads to state-of-the-art results on common benchmarks.
9
REFERENCES[1] Léon Bottou and Yann L Cun. 2004. Large scale online learning. In Advances in
neural information processing systems. 217–224.[2] Leo Breiman. 1996. Out-of-bag estimation. (1996).
[3] Leo Breiman. 2001. Using iterated bagging to debias regressions. MachineLearning 45, 3 (2001), 261–277.
[4] Leo Breiman, Jerome Friedman, Charles J Stone, and Richard A Olshen. 1984.
Classification and regression trees. CRC press.
[5] Rich Caruana and Alexandru Niculescu-Mizil. 2006. An empirical comparison of
supervised learning algorithms. In Proceedings of the 23rd international conferenceon Machine learning. ACM, 161–168.
[6] Bojan Cestnik et al. 1990. Estimating probabilities: a crucial task in machine
learning.. In ECAI, Vol. 90. 147–149.[7] Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system.
In Proceedings of the 22Nd ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining. ACM, 785–794.
[8] Bruno De Finetti. 2017. Theory of probability: a critical introductory treatment.Vol. 6. John Wiley & Sons.
[9] Michal Ferov and Marek Modry. 2016. Enhancing LambdaMART Using Oblivious
Trees. arXiv preprint arXiv:1609.05610 (2016).[10] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. 2000. Additive logistic
regression: a statistical view of boosting. The annals of statistics 28, 2 (2000),
337–407.
[11] Jerome H Friedman. 2001. Greedy function approximation: a gradient boosting
machine. Annals of statistics (2001), 1189–1232.[12] Jerome H Friedman. 2002. Stochastic gradient boosting. Computational Statistics
& Data Analysis 38, 4 (2002), 367–378.[13] Andrey Gulin, Igor Kuralenok, and Dmitry Pavlov. 2011. Winning The Transfer
Learning Track of Yahoo!’s Learning To Rank Challenge with YetiRank.. In Yahoo!Learning to Rank Challenge. 63–76.
[14] Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma,
Qiwei Ye, and Tie-Yan Liu. 2017. LightGBM: A highly efficient gradient boosting
decision tree. In Advances in Neural Information Processing Systems. 3149–3157.[15] Michael Kearns and Leslie Valiant. 1994. Cryptographic limitations on learning
Boolean formulae and finite automata. Journal of the ACM (JACM) 41, 1 (1994),67–95.
[16] John Langford, Lihong Li, and Tong Zhang. 2009. Sparse online learning via
truncated gradient. Journal of Machine Learning Research 10, Mar (2009), 777–801.
[17] LlewMason, Jonathan Baxter, Peter L Bartlett, andMarcus R Frean. 2000. Boosting
algorithms as gradient descent. In Advances in neural information processingsystems. 512–518.
[18] Daniele Micci-Barreca. 2001. A preprocessing scheme for high-cardinality cat-
egorical attributes in classification and prediction problems. ACM SIGKDDExplorations Newsletter 3, 1 (2001), 27–32.
[19] Byron P Roe, Hai-Jun Yang, Ji Zhu, Yong Liu, Ion Stancu, and Gordon McGregor.
2005. Boosted decision trees as an alternative to artificial neural networks for
particle identification. Nuclear Instruments and Methods in Physics ResearchSection A: Accelerators, Spectrometers, Detectors and Associated Equipment 543, 2(2005), 577–584.
[20] Lior Rokach and Oded Maimon. 2005. Top–down induction of decision trees
classifiers — a survey. IEEE Transactions on Systems, Man, and Cybernetics, Part C(Applications and Reviews) 35, 4 (2005), 476–487.
[21] Qiang Wu, Christopher JC Burges, Krysta M Svore, and Jianfeng Gao. 2010.
Adapting boosting for information retrieval measures. Information Retrieval 13,3 (2010), 254–270.
[22] Kun Zhang, Bernhard Schölkopf, Krikamol Muandet, and Zhikun Wang. 2013.
Domain adaptation under target and conditional shift. In International Conferenceon Machine Learning. 819–827.
[23] Yanru Zhang and Ali Haghani. 2015. A gradient boosting method to improve
travel time prediction. Transportation Research Part C: Emerging Technologies 58(2015), 308–324.
APPENDIXProof of Proposition 1. Here we provide the full proof. Denote by ξst ,s, t ∈ {0, 1}, the number of examples xk ∈ D with xk = (s, t). Thevalue of the first stump h1 in the region {x1 = s} is the averagevalue of yk over examples fromD belonging to this region. That is,
h1(0, t) =∑nj=1 c21{x j=(0,1)}∑n
j=1 1{x 1
j =0}=
c2ξ01ξ00 + ξ01
,
h1(1, t) =
∑nj=1 c11{x 1
j =1}+ c21{x j=(1,1)}∑n
j=1 1{x 1
j =1}= c1 +
c2ξ11ξ10 + ξ11
.14
Summarizing, we obtain h1(s, t) = c1s + c2ξs1ξs0+ξs1
.
Prediction shift of h1Now we derive the expectation E(h1(x)) of prediction h1 for a testexample x = (s, t). Due to linearity of expectation,
E(h1(s, t)) = c1s + c2n∑
k=1
E(αsk ), (8)
where αsk =1{xk =(s,1)}∑nj=1 1{x1j =s }
. Note that all E(αsk ), k = 1, . . . ,n, are
equal, because the observations xk are exchangeable [8], soE(αsk ) =E(αs1) for k = 1, . . . ,n. The law of total probability implies that
E(αs1) = P(x1 = (s, 1)) · E(
1∑nj=1 1{x 1
j =s }| x1 = (s, 1)
). (9)
The first multiplier equals1
4. We derive the second one using the
following statement.
Lemma 1. E
(1∑n
j=1 1{x1j =s }| x1 = (s, t)
)= 2
n (1 −1
2n ).
Proof . The probability P(∑nj=1 1{x 1
j =s }= k + 1 | x1 = (s, t)) equals
Ckn−1/2n−1
for k = 0, . . . ,n − 1, since 1{x 1
1=s } = 1 when x1 = (s, t)
with probability 1 and
∑nj=2 1{x 1
j =s }is a binomial variable inde-
pendent of x1. Therefore, we have E( 1∑nj=1 1{x1j =s }
| x1 = (s, t)) =∑n−1k=0
1
k+1Ckn−1
2n−1 =
1
n2n−1∑nk=1C
kn =
1
n2n−1 (2n − 1) = 2
n (1 −1
2n ). □
Summarizing Equations 8, 9 and Lemma 1, we obtain
Proposition 2. We have E(h1(s, t)) = c1s + c22(1 − 1
2n ).
It means that the conditional expectation E(h1(x) | x = (s, t))on a test example x equals c1s +
c22+ O(2−n ), since x and h1
are independent. At the same time, the conditional expectation
E(h1(xl ) | xk = (s, t)) on a training example xl is different for anyl = 1, . . . ,n, because the model h1 is fitted to xl .
Proposition 3. The conditional expectation E(h1(xl ) | xl =(s, t)) equals (c1s + c2
2) − c2( 2t−1n ) +O(2−n ).
Proof . We derive the conditional expectation as c1s+∑nk=1 c2E(αsk |
xl = (s, t)). Lemma 1 implies that E(αsl | xl = (s, t)) = t 2n (1 −1
2n ).
For k , l , we have E(αk | xl = (s, t)) = 1
4E( 1∑n
j=1 1{x1j =s }| xl =
(s, t), xk = (s, 1)) = 1
2n (1 −1
n−1 +O(2−n )) due to Lemma 2 below.
Summarizing, we obtain E(h1(xl ) | xl = (s, t)) = c1s + c2(t 2n (1 −1
2n )+ (n − 1) 1
2n (1−1
n−1 ))+O(2−n ) = c1s +c22− c2( 2t−1n )+O(2−n ).
□
Lemma 2. The conditional expectation E( 1∑nj=1 1{x1j =s }
| x1 =
(s, t1), x2 = (s, t2)) equals 2
n (1 −1
n−1 +1
2n−1(n−1) ).
14Formally, there is a probability 1/2n event that ξ00 + ξ01 or ξ10 + ξ11 equals 0. For
these cases, we assume that0
0= 0 anywhere below.
10
Proof . First, we have by definition
E( 1∑nj=1 1{x 1
j =s }| x1 = (s, t1), x2 = (s, t2)) =
1
2n−2
n−2∑k=0
Ckn−21
k + 2,
since P(∑nj=1 1{x 1
j =s }= k + 2 | x1 = (s, t1), x2 = (s, t2)) equals
Ckn−2
2n−2 . Then we calculate the expectation as
1
2n−2
n−2∑k=0
Ckn−21
k + 2=
1
2n−2
n−2∑k=0
Ckn−2
(1
k + 1− 1
(k + 1)(k + 2)
)=
=1
2n−2
n−2∑k=0
(1
n − 1Ck+1n−1 −
1
n(n − 1)Ck+2n
)=
=1
2n−2
(1
n − 1 (2n−1 − 1) − 1
n(n − 1) (2n − n − 1)
)=
=2
n(1 − 1
n − 1 +1
2n−1(n − 1)
).
□
Bias of the model h1 + h2Proposition 3 shows that the values of the model h1 on training
examples are shifted with respect to the ones on test examples. The
next step is to show how this can lead to a bias of the trained model,
if we use the same dataset for building both h1 and h2. First, assume
we have an additional sample D2 = {xn+k }k=1..n for building h2.
Proposition 4. The expected value of h1(s, t) + h2(s, t) on a testexample x = (s, t) is f ∗(s, t) +O(1/2n ), if h2 is built using datasetD2.
Proof . First, we need to derive the expectation E(h2(s, t)) of h2 on a
test example x = (s, t). We have h2(s, t) is the average of the residu-als f ∗(xn+k ) −h1(xn+k ) over all k such that x2n+k = t . For example,
h2(s, 0) =c1ξ ′
10−ξ ′
00h1(0,0)−ξ ′
10h1(1,0)
ξ ′00+ξ ′
10
, where ξ ′st is the number of ex-
amples xn+k that are equal to (s, t), k = 1, . . . ,n. Further, we have
E(h2(s, 0)) = E(c1ξ ′10
ξ ′00+ξ ′
10
) − E(h1(0, 0)ξ ′00
ξ ′00+ξ ′
10
) − E(h1(1, 0)ξ ′10
ξ ′00+ξ ′
10
).Since h1 andD2 are independent, we can rewright the last equation
as
E(h2(s, 0)) = c1E(ξ ′10
ξ ′00+ ξ ′
10
) − E(h1(0, 0))E(ξ ′00
ξ ′00+ ξ ′
10
)−
− E(h1(1, 0))(ξ ′10
ξ ′00+ ξ ′
10
) = c11
2
− c22
1
2
− (c1 +c22
)12
+O(2−n ) =
− c22
+O(2−n ).
Analogously, we can obtain E(h2(s, 1)) = c22+O(2−n ). Summing
up, E(h2(s, t)) = c2t − c22+O(2−n ) and E(h1(s, t)+h2(s, t)) = c1s +
c2t +O(2−n ). □At last, we dirive the expected value of h1(s, t) + h2(s, t) for
situation when the same dataset D1 is used for both building h1and h2. We obtain a bias according to the following result.
Proposition 5. The expected value of h1(s, t) + h2(s, t) on a testexample x = (s, t) is f ∗(s, t) − 1
n−1c2(t −1
2) + O(1/2n ), if both h1
and h2 are built using the same dataset D1.
Proof . The residual f ∗(s, t)−h1(s, t) equals c2(t− ξs1ξs0+ξs1
). Therefore,the value of h2(s, t) is the average
1
ξ0t + ξ1t
(c2(t −
ξ01ξ00 + ξ01
)ξ0t + c2(t −ξ11
ξ10 + ξ11)ξ1t
),
which is equal to −c2(
ξ00ξ01(ξ00+ξ01)(ξ00+ξ10) +
ξ10ξ11(ξ10+ξ11)(ξ00+ξ10)
), when
t = 0 and is equal to c2(
ξ00ξ01(ξ00+ξ01)(ξ01+ξ11) +
ξ10ξ11(ξ10+ξ11)(ξ01+ξ11)
), when
t = 1. All four ratios are equal due to symmetries, and they are equal
to1
4(1− 1
n−1 )+O(2−n ) according to Lemma 3 below. Summarising,
we obtain E(h2(s, t)) = (2t −1) c22(1− 1
n−1 )+O(2−n ) and E(h1(s, t)+h2(s, t)) = f ∗(s, t) − c2 1
n−1 (t −1
2) +O(2−n ). □
Lemma 3. We have E( ξ00ξ01(ξ00+ξ01)(ξ01+ξ11) ) =
1
4(1 − 1
n−1 ) +O(2−n ).
Proof . First, linearity implies that
E( ξ00ξ01(ξ00 + ξ01)(ξ01 + ξ11)
) =∑i, jE(
1xi=(0,0),xj=(0,1)(ξ00 + ξ01)(ξ01 + ξ11)
)
Taking into account that all terms are equal, the expectation can be
written asn(n−1)
42
a, where
a := E( 1
(ξ00 + ξ01)(ξ01 + ξ11)| x1 = (0, 0), x2 = (0, 1)).
A key observation is that ξ00 + ξ01 and ξ01 + ξ11 are two inde-
pendent binomial variables: the former one is the number of ksuch that x1k = 0 and the latter one is the number of k such that
x2k = 1. Moreover, they (and also their inverses) are also condition-
ally independent given that first two observations of the Bernoulli
scheme are known: x1 = (0, 1), x2 = (0, 1). This conditional inde-pendence imply that a is the product of E( 1
ξ00+ξ01| x1 = (0, 0), x2 =
(0, 1)) and E( 1
ξ01+ξ11| x1 = (0, 0), x2 = (0, 1)). The first factor
equals2
n (1 −1
n−1 + O(2−n )) according to Lemma 2. The second
one equals2
n−1 (1 + O(2−n )) according to Lemma 1. At last, we
obtain E( ξ00ξ01(ξ00+ξ01)(ξ01+ξ11) ) =
n(n−1)42
4
n(n−1) (1 −1
n−1 ) + O(2−n ) =1
4(1 − 1
n−1 ) +O(2−n ) . □□
11