fighting biases with dynamic boosting - arxiv biases with dynamic boosting ... andrey gulin yandex...

CatBoost: unbiased boosting with categorical featuresLiudmila Prokhorenkova, Gleb Gusev, Aleksandr Vorobev,

Anna Veronika Dorogush, Andrey Gulin

Yandex

Moscow, Russia

ABSTRACTThis paper presents the key algorithmic techniques behind Cat-

Boost, a state-of-the-art open-source gradient boosting toolkit.

Their combination leads to CatBoost outperforming other pub-

licly available boosting implementations in terms of quality on a

variety of datasets. Two critical algorithmic advances introduced in

CatBoost are the implementation of ordered boosting, a permutation-

driven alternative to the classic algorithm, and an innovative al-

gorithm for processing categorical features. Both techniques were

created to fight a prediction shift caused by a special kind of tar-

get leakage present in all currently existing implementations of

gradient boosting algorithms. In this paper, we provide a detailed

analysis of this problem and demonstrate that proposed algorithms

solve it effectively, leading to excellent empirical results.

CCS CONCEPTS•Computingmethodologies→Classification and regressiontrees; Boosting;

KEYWORDSGradient boosting, ordering principle, categorical features, decision

trees

1 INTRODUCTIONGradient boosting is a powerful machine-learning technique that

achieves state-of-the-art results in a variety of practical tasks. For

a number of years, it has remained the primary method for learn-

ing problems with heterogeneous features, noisy data, and com-

plex dependencies: web search, recommendation systems, weather

forecasting, and many others [5, 19, 21, 23]. Gradient boosting is

essentially a process of constructing an ensemble predictor by per-

forming gradient descent in a functional space. It is backed by solid

theoretical results that explain how strong predictors can be built by

iteratively combining weaker models (base predictors) in a greedy

manner [15].

The most popular implementations of gradient boosting use

decision trees as base predictors. It is convenient to use decision

trees for numerical features, but, in practice, many datasets include

categorical features, each having a discrete set of values that are not

comparable with each other (e.g., user ID, city, or ZIP Code). The

most commonly used practice [6, 18] for dealing with categorical

features in gradient boosting is converting them to comparable

numbers before training, e.g., by using one-hot encoding or cal-

culating target–based statistics. In this paper, we present a new

gradient boosting algorithm that successfully handles categorical

features.

As we show in this paper, all existing implementations of gradi-

ent boosting face the following statistical issue. A prediction model

F obtained after several steps of boosting relies on the targets of

all training examples. We demonstrate that this actually leads to

a shift of the distribution of F (xk ) | xk for a training example xkfrom the distribution of F (x) | x for a test example x. This finallyleads to a prediction shift of the learned model. We consider this

problem as a consequence of a specific kind of target leakage in

Section 4. A similar issue occurs for standard algorithms of prepro-

cessing categorical features. Namely, a target–based statistic can be

considered as a simple statistical model. Such models usually suffer

from the same problem of target leakage leading to their prediction

shift, as we discuss in detail in Section 3.

To solve both problems discussed above, we use an ordering prin-ciple. Namely, we propose a modification of a standard gradient

boosting procedure called ordered boosting which allows to avoid

target leakage. This technique is combined with a new algorithm

for processing categorical features. As a result, the new algorithm

outperforms the existing state-of-the-art implementations of gradi-

ent boosted decision trees — XGBoost [7] and LightGBM [14] — on

a diverse set of popular machine learning tasks (see Section 6 for de-

tails). The algorithm is called CatBoost (for “Categorical Boosting”)

and is available as an open-source library.1

2 BACKGROUNDAssume we observe a dataset of examples D = {(xk ,yk )}k=1..n ,where xk = (x1k , . . . ,x

mk ) is a random vector of m features and

yk ∈ R is a target, which can be either binary or a numerical

response. Examples (xk ,yk ) are independent and identically dis-

tributed according to some unknown distribution P(·, ·). The goal ofa learning task is to train a function F : Rm → R which minimizes

the expected loss L(F ) := EL(y, F (x)). Here L(·, ·) is a smooth loss

function and (x,y) is a test example sampled from P independently

of training set D.

A gradient boosting procedure [11] builds iteratively a sequence

of approximations F t : Rm → R, t = 0, 1, . . . in a greedy fashion.

Namely, F t is obtained from the previous approximation F t−1 inan additive manner:

F t = F t−1 + αht , (1)

where α is a step size and function ht : Rm → R (a base predictor)is chosen from a family of functions H in order to minimize the

expected loss:

ht = argminh∈HL(F t−1 + h) = argminh∈HEL(y, F t−1(x) + h(x)).(2)

The minimization problem is usually approached by the Newtonmethod using a second–order approximation of L(F t−1 + ht ) atF t−1 or is substituted with a (negative) gradient step. Both methods

are kinds of functional gradient descent [10, 17]. In particular, the

1https://github.com/catboost/catboost

arX

iv:1

706.

0951

6v2

[cs

.LG

] 2

3 A

pr 2

018

https://github.com/catboost/catboost

Age < 4

Pulse < 90

Age > 14

Y

N

N

Y

N Y

Figure 1: Example of a simple decision tree classifying pa-tients into healthy and sick.

gradient step ht is chosen in such a way that ht (x) approximates

−дt (x,y), where дt (x,y) := ∂L(y,s)∂s

��s=F t−1(x). Usually, the least-

squares approximation is used:

ht = argminh∈HE(−дt (x,y) − h(x)

)2

. (3)

CatBoost is an implementation of gradient boosting, which uses

binary decision trees as base predictors. A decision tree [4, 10, 20]is a model built by a recursive partition of the feature space Rm .

It is encoded by a labeled rooted tree whose nodes correspond to

regions in Rm . The root corresponds to the whole space Rm . The

space is divided into several disjoint regions according to the values

of some splitting attribute. This attribute is encoded as the label

of the root, and the regions are encoded as the root descendants,

which form the first level of the tree. The directed edges outgoing

the root encode the values of the attribute. The partition procedure

described above is then recursively applied to currently terminal

nodes until some stopping criteria (e.g., the maximal allowed depth)

is reached. A simple example of a decision tree is presented on

Figure 1.

A node with outgoing edges is called an internal node, all other

nodes are called leaves (a.k.a. terminal nodes). Attributes are usually

chosen from among binary variables that identify that some feature

xk exceeds some threshold t , that is, a = 1{xk>t } , where xk is

either numerical or binary feature, in the latter case t = 0.5.2

Finally, each leaf is assigned to a value, which is an estimate of

the response y in the corresponding region in the regression task or

the predicted class label in the case of classification problem. Test

examples are processed by navigating them from the root of the

tree down to a leaf, according to the outcome of the attributes along

the path. In this way, a decision tree h can be written as h(x) =∑Jj=1 bj1{x∈Rj } , where Rj are the disjoint regions corresponding

to the leaves of the tree.

2Alternatively, non-binary splits can be used, e.g., a region can be split according

to all values of a categorical feature. However, such splits, compared to binary ones,

would lead to either shallow trees (unable to capture complex dependences) or to more

complex trees with many terminal nodes (having weaker target statistics in each of

them). According to [4], the tree complexity has a crucial effect on the accuracy of the

model and less complex trees are less prone to overfitting.

In the case of regression task, splitting features and thresholds are

usually chosen using the least–squares splitting criterion, and the

values bj are computed as mean target values in the corresponding

leaves. Note that in gradient boosting, at each iteration, a tree is

constructed to approximate the negative gradient (see Equation (3)),

so it solves a regression problem.

3 CATEGORICAL FEATURESCategorical feature is a feature having a discrete set of values called

categories which are usually not comparable with each other. As

we discussed in the previous section, splitting criteria 1{xk>t }for binary decision trees are based on thresholds t , therefore theycan be directly applied only to numerical or binary features and

not to categorical ones. Therefore, a common practice is to con-

vert categorical features to numbers at the preprocessing step, i.e.,

each category for each example is substituted with one or several

numerical values.

Themostwidely used technique usually applied to low-cardinality

categorical features is one-hot encoding: the original feature is re-moved and, for each category, one adds a new binary variable that

identifies that category [18]. One-hot encoding is implemented in

CatBoost; however, it can be ineffective for many practical tasks. For

example, the number of values of a categorical feature can poten-

tially be of order n (e.g., for “user ID” feature). In this case, one-hot

encoding leads to a huge number of binary variables causing large

memory requirements and computational cost. More importantly,

each individual binary feature would cover only a small part of

examples, what makes it ineffective as a splitting attribute. For

example, assume that the task is to predict clicks on an ad and half

of the N users prefer to make clicks while the other half does not.

Then, in order to split these users according to one-hot encoded

user ID, we need at least N /2 splits what makes it impossible to

combine user ID with other sparse features within the same tree.

Target–based statistics. A more effective way to deal with cat-

egorical features, which is a main focus of this section, is to compute

some target–based statistic (TBS) that maps a categorical feature to

one numerical feature [6, 18]. Namely, given a categorical feature

x i , we substitute the category x ik of k-th training example with

some numerical value x ik , which is computed based on the training

dataset using some of the target values yj , j = 1..n. Usually, x ik is

an estimate of the expected target y conditioned by the category:

xik ≈ E(y |xi = x ik ). A splitting attribute based on this feature parti-

tions all categories into those with larger expected target values and

those with smaller ones. Such attributes are usually more effective

than ones using one-hot encoded features.

Greedy TBS. The straightforward approach to estimate the ex-

pected target is to take its average value over the whole training

dataset3[6, 18]:

x ik =

∑nj=1 1{x ij =x ik }

· yj∑nj=1 1{x ij =x ik }

. (4)

3For test examples, we may encounter new categories; some default value is usually

set for such categories.

2

The problem of such greedy approach is that x ik is computed using

yi , which is the target of xk . This target leakage can affect the gen-

eralization of the learned model, because TBS feature x i containsmore information about the target of xk than it will carry about

the target of a test example with the same input feature vector. In

other words, we observe a conditional shift [22]: the distribution of

x i |y differs for training and test examples. Let us present a simple

example showing this shift and problems caused by it. Assume that

i-th feature is categorical, all its values are unique, and for each

category a, we have P(y = 1|x i = a) = 0.5 for a classification task.

Then, in the training dataset, x ik = yk , so it is sufficient to make

only one split at t = 0.5 to perfectly classify all training examples.

However, such a classifier would make errors on the test set with

probability 0.5. If we have at least one meaningful feature in the

dataset, then we need to remove the i-th feature from the dataset to

allow the boosting algorithm to build a better model. So, we have

the following desired quality for TBS:

P1 E(x i |y = v) = E(x ik |yk = v), where (xk ,yk ) is the k-th train-ing example.

In our example above, E(x ik |yk ) = yk ∈ {0, 1} and E(xi |y) = 0.5.

A standard way to improve greedy approach is to add some prior

value P :

x ik =

∑nj=1 1{x ij =x ik }

· yj + a P∑nj=1 1{x ij =x ik }

+ a, (5)

where a > 0 is a parameter. The prior value is usually added to

reduce the noise obtained from low-frequency categories. A stan-

dard technique for calculating prior P is to take the average label

value in the dataset [18]. It is easy to see that having a non-zero

prior with a positive weight does not prevent the conditional shift

and this approach still violates P1. Considering the same example,

note that x ik =aP1+a if yk = 0 and

1+aP1+a otherwise. So, we again can

perfectly classify the training dataset by taking t = 0.5+aP1+a for i-th

feature.

There are several ways to overcome the problem discussed above,

all of them can be expressed by the following formula:

x ik =

∑xj ∈Dk

1{x ij =x ik }· yj + a P∑

xj ∈Dk1{x ij =x ik }

+ a. (6)

In other words, for each training example xk we compute the TBS

using a subset Dk ⊂ {x1, . . . , xn }.

Holdout TBS. A straightforward approach is to partition the train-

ing dataset randomly into two partsD = ˆD0⊔ ˆD1 and useDk = ˆD0

to calculate the TBS according to (6) andˆD1 to perform training

(see, e.g., [7]). For such holdout TBS equality in P1 holds, but this

approach significantly reduces the amount of data used to train

the model and also to calculate the statistics. So, it violates the

following desired quality:

P2 It is desirable for x ik to have a low variance.

Leave-one-out TBS. Another seemingly appealing technique is

leave-one-out, where Dk = D \ xk for training examples and

Dk = D for test examples.4The motivation behind this approach

is that we do not use the target of an example to compute its

features (as in holdout TBS), but reduce the variance. However,

this technique still violates P1 and to illustrate the problem, let us

consider a simple case, where we have a category a and x ik = a

for all examples. Let n+ be the number of examples with y = 1,

then x ik =n+−yk+a Pn−1+a . However, for a test example x , we have

x i = n++a Pn+a . So, again, one can perfectly classify the training

dataset by taking t =n+−0.5+a pn−1+a . Equation in P1 does not hold:

E(x ik |yk ) =n Ey−yk+a P

n−1+a and E(x i |y) = n Ey+a Pn+a .

Ordered TBS. CatBoost uses a more effective strategy, which is

inspired by online learning algorithms. The underlying ordering

principle is the central idea of the current research. In the online

learning problem (e.g., for the task of click prediction [1, 16]), nu-

merous training examples become available sequentially in time,

and the model is updated on the basis of each new example without

reconsideration of the previous ones. Clearly, the values of TBS on

a new example (e.g., click-through rates) used by the currently built

model depend only on the previously observed examples, since the

subsequent ones are unavailable. Usually, the TBS for each example

is computed on the basis of a fixed number K of preceding exam-

ples. Note that, for such sliding window TBS, if we assume that

the distribution P(·, ·) does not change in time, the requirement P1

holds, since the joint distribution of y and x i is the same for train-

ing and test examples. To adapt this idea to standard non-online

settings, we propose to introduce an artificial “time”, i.e., an order

of the training examples. Also, for each sample, we use all the avail-

able history instead of K preceding examples to compute its TBS.

Namely, we perform a random permutation σ of the dataset and

takeDk = {xj : σ (j) < σ (k)} in Equation (6) for a training example

xk and Dk = D for a test example. Note that, the obtained orderedTBS satisfies the requirement P1, and we also reduce the variance

of x ik (see P2) compared to sliding window TBS.

Note that, if we fix only one random permutation, then preced-

ing examples would have TBS with much higher variance than

subsequent ones, which is undesirable. Therefore, CatBoost uses

several different permutations for different iterations of gradient

boosting. We discuss this in detail in Section 5.

4 PREDICTION SHIFT AND ORDEREDBOOSTING

In this section, we discuss the problem of prediction shift in gradient

boosting algorithms. Like in the case of TBS, it is caused by a special

kind of target leakage. Our solution is called the ordered boostingand resembles the ordered TBS method.

4.1 Prediction shiftLet us go back to the gradient boosting procedure described in

Section 2. In practice, the expectation in (3) is unknown and is

4https://nycdatascience.com/blog/meetup/featured-talk-1-kaggle-data-scientist-

owen-zhang/

3

https://nycdatascience.com/blog/meetup/featured-talk-1-kaggle-data-scientist-owen-zhang/

https://nycdatascience.com/blog/meetup/featured-talk-1-kaggle-data-scientist-owen-zhang/

usually approximated using the same dataset D:

ht = argminh∈H1

n

n∑k=1

(−дt (xk ,yk ) − h(xk )

)2

. (7)

Now we describe and analyze the following chain of shifts:

(1) the conditional distribution of the gradient дt (xk ,yk ) | xk(accounting for randomness of the datasetD) is shifted from

that distribution on a test example дt (x,y) | x;(2) as a result, the base predictor ht defined by Equation (7) is

biased with respect to the solution of Equation (3);

(3) this, finally, affects the generalization ability of the trained

model F t .

As in the case of TBS, these problems are consequences of target

leakage. Indeed, gradients used at each step are estimated using

the target values of the same data points the current model F t−1

was built on. To illustrate the problem, consider the regression task

with the quadratic loss function L(y, y) = (y − y)2. In this case,

the negative gradient −дt (xk ,yk ) in Equation (7) is equivalent to

the residual function r t−1(xk ,yk ) := yk − F t−1(xk ).5 It is based

on the currently learned predictions F t−1(xk ) whose conditionaldistribution F t−1(xk ) | xk for a training example xk is shifted, in

general, from the distribution F t−1(x) | x for a test example x. We

call this a prediction shift.In the simplest case where the problem of prediction shift can

be observed, we havem = 2 features x1,x2 that are independentidentically distributed Bernoulli random variables with success

probability 1/2. We assume y = f ∗(x) = c1x1 + c2x2.Assume we follow the gradient boosting algorithm with the

decision trees of depth 1 (decision stumps), with step size α = 1,

and the number of iterations N = 2. That is, the trained model

F = F 2 is expressed as F 2 = h1 + h2. We can assume without

loss of generality that h1 is based on x1 and h2 is based on x2.The leaf values of hj are obtained according to Equation (7) based

on a dataset Dj = {(xj,k ,yj,k )}k=1..n . We are able to prove the

following statements:

Proposition 1. (1) Assume that two independent samplesD1

and D2 of size n are used to estimate h1 and h2, respectively,using Equation (7). Then ED1,D2

F 2(x) = f ∗(x)+O(1/2n ) forany x ∈ {0, 1}2.

(2) Assume that the same dataset D = D1 = D2 is used inEquation (7) for both h1 and h2. Then EDF 2(x) = f ∗(x) −1

n−1c2(x2 −1

2) +O(1/2n ).

This proposition means that the trained model F 2 is an unbiased

estimation of the true dependencey = f ∗(x), when we use indepen-

dent datasets at each gradient step.6At the same time, when we use

the same dataset at each step, we suffer from a bias − 1

n−1c2(x2 −1

2),

which is inversely proportional to the size of the dataset n. An-other important conclusion here is that the value of the bias can

be affected by the relation between x and y: in our example, it is

proportional to c2. We also note that the bias is caused by statistical

issues, and it is not related to any wrong assumptions underlying

5Here we removed the multiplier 2. This is equivalent to decreasing the step size α by

2 times, what does not matter for the current analysis.

6Up to an exponentially small term, which occurs for a technical reason.

the model. We track the chain of shifts in sketch of the proof below,

while the full proof of Proposition 1 is available in Appendix.

Sketch of the proof . Let us consider the second part of the proposi-

tion. Denote by ξst , s, t ∈ {0, 1}, the number of examples xk ∈ Dwith xk = (s, t). We have h1(s, t) = c1s +

c2ξs1ξs0+ξs1

. Its expectation

E(h1(x)) on a test example x equals c1x1 +

c22+ O(2−n ). At the

same time, the expectation E(h1(xk )) on a training example xk is

different and eqials (c1x1 + c22) − c2( 2x

2−1n ) +O(2−n ). That is, we

experience a prediction shift of h1. As a consequence, the expectedvalue of h2(x) is E(h2(x)) = c2(x2 − 1

2)(1 − 1

n−1 ) +O(2−n ) on a test

example x and E(h1(x)+h2(x)) = f ∗(x) − 1

n−1c2(x2 −1

2)+O(1/2n ).

□

Finally, let us stress again the connection between conditional

shift for TBS discussed in Section 3 (see P1) and prediction shift

in gradient boosting discussed in this section. At each step t of astandard boosting, we have a current model F t−1 built so far. The

conditional distribution of F t−1(xk ) | xk (and those of F t−1(xk ) |yk ) for a training example xk is shifted w.r.t. the same distribution

for an independent example from the distribution P(·, ·). The same

is applied to the distribution of residuals r t−1(xk ,yk ) | xk : we havea residual shift. These shifts are cased by target leakage, i.e., by

using targets yk on the previous steps of boosting. Greedy TBS x i

can be considered as a simple statistical model predicting the target

y and it suffers from a similar problem, conditional shift of x ik | yk ,caused by the target leakage, i.e., using yk to compute x ik .

Related work. The problems caused by the correlation between

the random deviations in the training dataset and the outputs of

the model F t−1 were considered previously in papers on boost-

ing [3, 12]. Based on the out-of-bag estimation first proposed in

his paper [2], Breiman proposed iterated bagging [3] which simul-

taneously constructs K models Fi , i = 1, . . . ,K , associated with

K independently bootstrapped subsamples Di . At t-th step of the

process, approximations F ti are grown from their predecessors F t−1ias follows. The current estimateMt

j at example j is obtained as the

average of the outputs of all models F t−1k such that j < Dk . The

term hti is built as a predictor of the residuals rtj := yj −Mt

j (targets

minus current estimates) on Di . Finally, the models are updated:

F tj := F t−1j + hti .

Unfortunately, the residuals r tj used in this procedure are not

unshifted, because each model F ti depends on each observation

(xj ,yj ) by construction. Indeed, althoughhtk does not useyj directly,

if j < Dk , it still usesMt−1j′ for j ′ ∈ Dk , which, in turn, can depend

on (xj ,yj ). The same is applicable to the idea of Friedman [12],

where subsampling of the dataset at each iteration was proposed.

4.2 Ordered boostingHere we propose a boosting algorithm which does not suffer from

the prediction shift problem described in Section 4.1. For simplicity,

we consider the case of a regression task where, at each step, the

base predictor ht (x) approximates the residual function r t−1(x,y).So, we aim at developing an algorithm with unshifted residuals.

Assuming access to an unlimited amount of training data, we

can easily construct such an algorithm. At each step of boosting,

4

1 2 3 654 7

𝑀5𝑡−1

𝑀6𝑡−1

𝑟𝑡 𝑥7, 𝑦7 = 𝑦7 −𝑀6𝑡−1(𝑥7)

8 9

Figure 2: Ordered boosting principle. For simplicity of visu-alization, on this figure we assume that sigma is an identitypermutation.

we sample a new dataset Dt independently and obtain unshifted

residuals by applying the current model to new training samples.

In practice, however, labeled data is typically limited. Assume

that we learn amodel with I trees. Tomake the residual r I−1(xk ,yk )unshifted, we need to have F I−1 trained without the example xk .Since we need unbiased residuals for all training examples, no ob-

servations may be used for training F I−1, which at first glance

makes the training process impossible. However, it is possible to

maintain a set of models differing by examples used for their train-

ing. Then, for calculating the residual on an example, we use a

model trained without it. At each step t of the learning process,

each model can be interpreted as an approximation of a virtual

model F t . In order to construct such a set of models, we consider

online learning algorithms again. To calculate the gradient for a

new example for updating the model, they use the current model

trained on the previous examples [1, 16], what provides unshifted

residual and unshifted gradient.

Thereby, we use the same trick with random ordering of exam-

ples, previously applied to TBS in Section 3. Let us assume that xkare ordered according to a random permutation σ . We maintain

n different modelsM1, . . . ,Mn such that the modelMi is learned

using only the first i samples in σ . At each step, in order to obtain

the residual for j-th sample, we use the modelMj−1 (see Figure 2).The resulting Algorithm 1 is called ordered boosting below. Unfor-

tunately, this algorithm is not feasible in most practical tasks due

to the need of maintaining n different models, which increase the

complexity and memory requirements by n times. In CatBoost, we

implemented a modification of this algorithm on the basis of the

gradient boosting algorithm with decision trees as base predictors

(GBDT). We describe our implementation in Section 5 in details.

4.3 Ordered boosting with categorical featuresIn Sections 3 and 4.2 we proposed to use some random permuta-

tions σcat and σboost of training examples for the TBS calculation

and for ordered boosting, respectively. Now, being combined in one

algorithm, should these two permutations be somehow dependent?

We argue that they should coincide. Otherwise, there exist exam-

ples xi and xj such that σboost (i) < σboost (j) and σcat (i) > σcat (j).

Algorithm 1: Ordered boosting

input : {(xk ,yk )}nk=1, I ;

1 σ ← random permutation of [1,n] ;2 Mi ← 0 for i = 1..n;

3 for t ← 1 to I do4 for i ← 1 to n do5 ri ← yi −Mσ (i)−1(i);6 for i ← 1 to n do7 M ← LearnModel((xj , r j ) : σ (j) ≤ i;

8 Mi ← Mi +M ;

9 returnMn

Then, the modelMσboost (j) is trained using TBS features of, in par-

ticular, example xi , which are calculated using yj . In general, it

may shift the prediction Mσboost (j)(xj ). To avoid such a shift, we

set σcat = σboost in CatBoost. In the case of the ordered boosting

(Algorithm 1) with sliding window TBS, it guarantees that the pre-

dictionMi (xi ) is not shifted for i = 1, . . . ,n, since, first, the targetyi was not used for training Mi (neither for the TBS calculation,

nor for the gradient estimation) and, second, the distribution of

TBS x i is the same for a training example and a test example with

the same value of feature x i .

5 PRACTICAL IMPLEMENTATIONIn this section, we describe an efficient implementation of ordered

boosting (Algorithm 1) with ordered TBS used in CatBoost.

Overview of the algorithm. In GBDT, the process of constructing

a decision tree can be divided into two phases: choosing the tree

structure (i.e., splitting attributes) and choosing the values bj inleaves [11]. CatBoost performs the second phase using the standard

GBDT scheme and has two modes for the first phase, Ordered and

Plain. The latter mode corresponds to a combination of the standard

GBDT algorithm with ordered TBS. The former mode presents an

efficient modification of Algorithm 1 using one tree structure shared

by all the models to be built.

At the start, CatBoost generates s + 1 independent random per-

mutations of the training dataset. The permutations σ1, . . . ,σs areused for constructing decision trees, while σ0 serves for choosingthe leaf values bj . Each permutation is used both for computing or-

dered TBS and for Ordered mode simultaneously. Since, for a given

permutation, both TBS and predictions used by ordered boosting

(Mσ (i)−1(i) in Algorithm 1) for examples with short history have a

high variance and thus may increase the variance of the final model

predictions, using several permutations allows us to reduce this

effect, what is confirmed by our experiments in Section 6.6.

A detailed description of CatBoost is provided in Algorithm 2.

Building a tree. CatBoost uses oblivious decision trees, where

the same splitting criterion is used across an entire level of the

tree [9, 13]. Such trees are balanced, less prone to overfitting, and

allow speeding up prediction significantly at testing time.

At each iteration t of the algorithm, we sample a random per-

mutation σr from {σ1, . . . ,σs } and construct a tree Tt on the basis

5

Algorithm 2: CatBoostinput : {(xi ,yi )}ni=1, I , α , L, s ,Mode

1 σi ← random permutation of [1,n] for i = 0..s;

2 Sr (i) ← 0 for r = 0..s , i = 1..n;

3 S ′r, j (i) ← 0 for r = 1..s , i = 1..n, j = 1..⌈log2n⌉;

4 for t ← 1 to I do5 дrad ← CalcGradient(L, S,y);6 дrad ′ ← CalcGradient(L, S ′,y);7 r ← random(1, s);8 Tt ← BuildTree(Mode,дradr ,дrad

′r ,σr , {xi }ni=1);

9 lea fr,i ← GetLeaf (xi ,Tt ,σr ) for r = 0..s, i = 1..n;

10 foreach leaf Rtj in Tt do11 btj ← −avg(дrad0(i) for i : lea fr,i = j);12 S, S ′ ← UM(Mode, lea f ,Tt , {btj }j , S, S

′,

дrad,дrad ′, {σr }sr=1);13 return F (x) = ∑I

t=1∑j α b

tj 1{GetLeaf (x,Tt ,ApplyMode)=j } ;

Function UpdateModel(UM)

input : Mode ,lea f , T ,{bj }j , S , S ′, дrad , дrad ′, {σr }sr=11 foreach leaf j in T do2 foreach i s.t. lea fr,i = j do3 S0(i) ← S0(i) + αbj ;4 for r ← 1 to s do5 foreach i s.t. lea fr,i = j do6 if Mode == Plain then7 Sr (i) ← Sr (i) − α avg(дradr (p) for

p : lea fr,p = j);8 else9 for l ← 1 to ⌈log

2n⌉ do

10 S ′r,l (i) ← S ′r,l (i) − α avg(дrad ′r,l (p) forp : lea fr,p = j, σr (p) < 2

l );11 Sr (i) ← S ′r, ⌊log

2σr (i)⌋ (i);

12 return S, S ′

of it (line 7 of Algorithm 2). In the Plain mode, it defines TBS to be

used in two aspects:

(1) As a model F t−1 in Equation (2), we use a model Sr (we

omit t for brevity) trained with ordered TBS based on the

permutation σr . These TBS are used to define the leaf each

example belongs to: the function GetLeaf (x,T ,σ ) servingfor this task calculates TBS for categorical features used inTon the basis of the original features x and the permutation σ .Each model Sr ′ , r

′ = 1..s , can be considered as an approxi-

mation of F t−1 on the basis of a given TBS. Note that the treeTt built on the basis of the model Sr , is used for updating all

the models Sr ′ , r′ = 1..s in line 12 of Algorithm 2.

Function BuildTree

input : Mode , дrad , дrad ′, σ , {xi }ni=11 T ← empty tree;

2 foreach step of top-down procedure do3 Form a set C of candidate splits;

4 foreach c ∈ C do5 Tc ← add split c to T ;

6 lea fi ← GetLeaf (xi ,Tc ,σ ) for i = 1..n;

7 foreach leaf j in Tc do8 foreach i : lea fi = j do9 if Mode == Plain then

10 ∆(i) ← avg(дrad(p) for p : lea fp = j);11 else12 ∆(i) ← avg(дrad ′⌊log

2σr (i)⌋ (p) for

p : lea fp = j,σ (p) < σ (i));

13 loss(Tc ) ← ||∆ − дrad | |214 T ← argminTc (loss(Tc ))15 return T

(2) In order to obtain a leaf value for evaluating a candidate split

(line 9 of Function BuildTree), we average the gradients ofall the examples in the leaf, where the gradient дradr (i) forthe example i (corresponding to дt (xi ,yi ) from Equation (3))

is calculated in the point of prediction Sr (i) based on the

permutation σr .

In the Plain mode, updating the models Sr ′ , r′ = 1..s (line 7 of

FunctionUpdateModel ) follows the standard gradient step: the leafvalue is the negative average of the gradients дradr ′ (in the leaf)

calculated on predictions of the model Sr ′ to be updated.

The setC of candidate splits for a tree is formed in line 2 of Func-

tion BuildTree by considering all the original numerical features

and TBS based on the categorical features.7

The difference of the Ordered mode is that each model Sr ′ , r′ =

1..s, is trained according to the ordering principle (Algorithm 1):

it is approximated by models S ′r ′, j , j = 1, . . . , ⌈log2n⌉, where the

model S ′r ′, j corresponds toM2j in Algorithm 1

8: we do not use ex-

amples l : σr ′(l) > 2jfor its training (unless it becomes impossible

due to the restriction of one tree structure shared by all the models).

For this purpose, while training these models, we always obtain

an individual leaf value for each example. At the candidate splits

evaluation step, the leaf value for the example i is obtained by the

gradient step (line 11 of Function BuildTree) with averaging the gra-dients of the preceding examples p; each gradientдrad ′⌊log

2σr (i)⌋ (p)

is obtained in the point of prediction S ′r, ⌊log2σr (i)⌋ (p). According

to line 10 of FunctionUpdateModel , this prediction is learned from

the part of these examples p′ : σr (p′) < 2⌊log

2σr (i)⌋

.

7The set C also contains TBS based on some feature combinations, chosen by the

algorithm described at the end of this section.

8The number of models to be maintained is reduced from n to ⌈log

2n ⌉ in comparison

with Algorithm 1 for the sake of efficiency.

6

Table 1: Computational complexity

Procedure Complexity for iteration t

CalcGradient O(s · n)BuildTree O(|C | · n)Calc leaf values btj O(n)UpdateModel O(s · n)Calc ordered TBS O(NT BSt · n)

Choosing leaf values. Given the tree Tt , the leaf value btj of

the final model F is calculated by the gradient step in line 10 of

Algorithm 2, where the gradient дrad0(i) for example i is calculatedin the point of prediction S0(i) obtained by applying the current

version of the final model F (i.e., with leaf values btj , what can be

seen from line 3 of Function UpdateModel) with TBS based on

the permutation σ0. When applying the final model F to a new

example (line 13 of Algorithm 2), using ApplyMode instead of a

permutation in the functionGetLeaf means, that TBS are calculated

by the greedy approach on the training data, as it was described in

Section 3.

Complexity.We present the computational complexity of dif-

ferent components of any of the two modes of CatBoost per one

iteration in Table 1. We first prove these asymptotics for the Or-

dered mode. For this purpose, we estimate the number Npred of

predictions Sr, j (i) to be maintained:

Npred < (s + 1) ·log

2n+1∑

j=12j < (s + 1) · 2log2 n+2 = 4(s + 1)n

Then, obviously, the complexity of CalcGradient is O(Npred ) =O(s · n). The complexity of leaf values calculation is O(n), sinceeach example i is included only in averaging operation in leaf lea fi .

Calculation of the ordered TBS for one categorical feature can

be performed sequentially in the order of the permutation by nadditive operations for calculation of n partial sums and n divi-

sion operations. Thus, the overall complexity of the procedure is

O(NT BSt · n), where NT BSt is the number of TBS which were not

calculated on the previous iterations. Since the leaf values ∆(i) cal-culated in the BuildTree procedure can be considered as ordered

TBS, where gradients play the role of targets, the complexity of this

procedure is O(|C | · n).Finally, in the UpdateModel procedure, we need to perform one

averaging operation for each l = 1, . . . , ⌈log2n⌉, and each gradient

дrad ′r,l is included in one averaging operation. Thus, the number

of operations is bounded by the number of the gradients дrad ′r,l ,which is equal to Npred = O(s · n).

To finish the proof, note that any component of the Plain mode

is not less efficient than the same one of the Ordered mode but,

at the same time, cannot be more efficient than corresponding

asymptotics from Table 1. Thus, our implementation of Ordered

boosting with decision trees has the same asymptotic complexity as

the standard GBDT with ordered TBS. In comparison with standard

GBDT, the latter slows down by the factor of s the procedures

CalcGradient ,UpdateModel and computation of TBS by any other

approach discussed in Section 3, as well as ordered TBS with s = 1.

Feature combinations. Let us describe another important im-

plementation detail of CatBoost: using feature combinations for

TBS. Tomotivate this, assume that the task is ad click prediction and

we have the following categorical features: user ID and ad topic. A

user might prefer ads on a given topic, however, when we substitute

user ID and ad topic by corresponding TBS, we loose this informa-

tion. TBS computed for a concatenation of these two features solves

this problem and gives a new powerful feature. However, the num-

ber of possible combinations grows exponentially with the number

of categorical features in the dataset and it is infeasible to consider

all of them in the algorithm. When constructing a new split for the

current tree, CatBoost considers combinations in a greedy way. No

combinations are considered for the first split in the tree. For the

next splits, CatBoost combines all combinations and categorical

features present in the current tree with all categorical features in

the dataset. Combination values are converted to TBS on the fly.

Feature combinations are used in Algorithm 2 in the follow-

ing way: the functionGetLeaf (x,T ,σ ) also calculates TBS for the

combinations of categorical features used in the tree T ; the set Cof candidate splits for a tree (line 2 of Function BuildTree) alsoincludes feature combinations, chosen by the above algorithm.

6 EXPERIMENTS6.1 Comparison with baselinesWe compare our algorithm with the most popular open-source

libraries — XGBoost and LightGBM — on several well-known ma-

chine learning tasks. The detailed description of the experimental

setup as well as dataset descriptions are published on our Github

together with the code of the experiment and a container with all

the used libraries, so that the results can be reproduced.9Additional

dataset used in this experiment is Epsilon10

(4 · 105 samples, 2000

features). We made categorical features preprocessing using the

statistics on random permutation of data. The parameter tunning

and training were performed on 4/5 of the data and the testing was

performed on the other 1/5.11

The training with selected parame-

ters was performed 5 times with different random seeds, we report

the average loss obtained (logloss and zero-one loss), the results

are presented in Table 2. For CatBoost we used Ordered mode in

this experiment.12

One can see that CatBoost outperforms other algorithms on all

considered datasets. In our Github repository you can also see that

CatBoost with default parameters outperforms tunned XGBoost on

all datasets and LightGBM on all but one dataset.

We also measured statistical significance of improvements pre-

sented in Table 2: except three datasets (KDD appetency, churn

and upselling) the improvements are statistically significant with

p-value≪ 0.01 measured by paired t-test.

7

Table 2: Comparison with baselines: logloss / zero-one loss, relative increase is presented in the brackets.

CatBoost LightGBM XGBoost

Adult 0.2695 / 0.1267 0.2760 (+2.4%) / 0.1291 (+1.9%) 0.2754 (+2.2%) / 0.1280 (+1.0%)

Amazon 0.1394 / 0.0442 0.1636 (+17%) / 0.0533 (+21%) 0.1633 (+17%) / 0.0532 (+21%)

Click prediction 0.3917 / 0.1561 0.3963 (+1.2%) / 0.1580 (+1.2%) 0.3962 (+1.2%) / 0.1581 (+1.2%)

Epsilon 0.2647 / 0.1086 0.2703 (+1.5%) / 0.114 (+4.1%) 0.2993 (+11%) / 0.1276 (+12%)

KDD appetency 0.0715 / 0.01768 0.0718 (+0.4%) / 0.01772 (+0.2%) 0.0718 (+0.4%) / 0.01780 (+0.7%)

KDD churn 0.2319 / 0.0719 0.2320 (+0.1%) / 0.0723 (+0.6%) 0.2331 (+0.5%) / 0.0730 (+1.6%)

KDD Internet 0.2089 / 0.0937 0.2231 (+6.8%) / 0.1017 (+8.6%) 0.2253 (+7.9%) / 0.1012 (+8.0%)

KDD upselling 0.1662 / 0.0490 0.1668 (+0.3%) / 0.0491 (+0.1%) 0.1663 (+0.04%) / 0.0492 (+0.3%)

Kick prediction 0.2855 / 0.0949 0.2957 (+3.5%) / 0.0991 (+4.4%) 0.2946 (+3.2%) / 0.0988 (+4.1%)

Table 3: Comparison of running times on Epsilon

time per tree

CatBoost Plain 1.1 sCatBoost Ordered 1.9 s

XGBoost 3.9 s

LightGBM 1.1 s

6.2 Running timesIt is quite hard to compare different boosting libraries in terms of

training speed. Every algorithm has a vast number of parameters

which affect training speed, quality and model size in a non-obvious

way. Different libraries have their unique quality/training speed

trade-off’s and they cannot be compared without domain knowl-

edge (e.g., is 0.5% of quality metric worth it to train model 3-4 times

slower?). Plus for each library it is possible to obtain almost the

same quality with different ensemble sizes and parameters. As a

result, one cannot compare libraries by time needed to obtain a

certain level of quality. As a result, we could give only some insights

of how fast our implementation could train a model of a fixed size.

We use Epsilon dataset and we measure mean tree construction

time one can achieve without using feature subsampling and/or

bagging by CatBoost (both Ordered and Plain modes), XGBoost

(we use histogram-based version, which is faster) and LightGBM.

For XGBoost and CatBoost we use default tree depth equal to 6, for

LightGBM we set leaves count to 64 to have comparable results. We

run all experiments on the same machine with Intel Xeon E3-12xx

2.6GHz, 16 cores, 64GB RAM and run all algorithms with 16 threads.

We set such learning rate that algorithms start to overfit approx-

imately after constructing about 7000 trees and measure average

time to train ensembles of 8000 trees. Mean tree construction time

is presented in Table 3. Note that CatBoost Plain and LightGBM

are the fastest ones followed by Ordered mode, which is about 1.7

times slower, which is expected.

Finally, let us note that CatBoost has a highly efficient GPU

implementation. The detailed description and comparison of the

9https://github.com/catboost/benchmarks/tree/master/quality_benchmarks

10https://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/binary.html

11For Epsilon we use default parameters instead of parameter tuning due to large

running time for all algorithms. We tune only the number of trees to avoid overfitting.

12The numbers for CatBoost in Table 2 may slightly differ from the corresponding

numbers in our Github repository, since we use another version of CatBoost with all

discussed features implemented.

Table 4: Plain mode: logloss, zero-one loss and their relativechange compared to Ordered mode.

Logloss Zero-one loss

Adult 0.2723 (+1.1%) 0.1265 (-0.1%)

Amazon 0.1385 (-0.6%) 0.0435 (-1.5%)

Click prediction 0.3915 (-0.05%) 0.1564 (+0.19%)

Epsilon 0.2663 (+0.6%) 0.1096 (+0.9%)

KDD appetency 0.0718 (+0.5%) 0.0179 (+1.5%)

Kdd churn 0.2317 (-0.06%) 0.0717 (-0.17%)

KDD internet 0.2170 (+3.9%) 0.0987 (+5.4%)

KDD upselling 0.1664 (+0.1%) 0.0492 (+0.4%)

Kick prediction 0.2850 (-0.2%) 0.0948 (-0.1%)

-5

0

5

10

15

20

25

0.001 0.01 0.1 1

rela

tive

logl

oss

for

Plai

n, %

fraction

AdultAmazon

Click predictionKDD appetency

KDD churnKDD internet

KDD upsellingKick prediction

Figure 3: Relative error of Plain mode compared to Orderedmode depending on the fraction of the dataset.

running times are beyond the scope of the current article, but these

experiments can be found on the corresponding Github page.13

6.3 Ordered and plain modesIn this section, we compare two essential modes of CatBoost: Plain

and Ordered. First, we compared their performance on all consid-

ered datasets, the results are presented in Table 4. It can be clearly

seen that Ordered mode is particularly useful on small datasets.

Indeed, the largest benefit from Ordered is observed on Adult and

Internet datasets, which are relatively small (less than 40K training

13https://github.com/catboost/benchmarks/tree/master/gpu_training

8

https://github.com/catboost/benchmarks/tree/master/quality_benchmarks

https://github.com/catboost/benchmarks/tree/master/gpu_training

Table 5: Comparison of target–based statistics, relativechange in logloss / zero-one loss

Greedy Holdout Leave-one-out

Adult +1.1% / +0.79% +2.1 % / +2.0% +5.5% / +3.7%

Amazon +40% / +32% +8.3% / +8.3% +4.5% / +5.6%

Click prediction +13% / +6.7% +1.5% / +0.51% +2.7% / +0.90%

KDD appetency +24% / +0.68% +1.6% / -0.45% +8.5% / +0.68%

KDD churn +12% / +2.1% +0.87% / +1.3% +1.6% / +1.8%

KDD Internet +33% / +22% +2.6% / +1.8% +27% / +19%

KDD upselling +57% / +50% +1.6% / +0.85% +3.9% / +2.9%

Kick prediction +22% / +28% +1.3% / +0.32% +3.7% / +3.3%

examples), which supports our hypothesis that a higher bias nega-

tively affects the performance. Indeed, according to Proposition 1

and our reasoning in Section 4.1, bias is expected to be larger for

smaller datasets (however, it can also depend on other properties of

the dataset, e.g., on the dependency between features and target).

In order to further validate this hypothesis, we make the follow-

ing experiment: we train CatBoost in Ordered and Plain modes on

randomly filtered datasets and compare the obtained losses, see Fig-

ure 3. As we expected, for smaller datasets the relative performance

of Plain mode becomes worse. To save space, here and below we

present the results only for logloss since the figures for zero-one

loss are similar.

6.4 Analysis of target–based statisticsWe compare different TBSs by implementing them in CatBoost

instead of Ordered TBS, keeping all other algorithmic details the

same; the results can be found in Table 5. Here, to save space, we

present only relative increase in loss functions for each algorithm

compared to CatBoost in Ordered mode. Note that the ordered TBS

used in CatBoost significantly outperforms all other approaches.

Also, among baselines, holdout is the best for most of the datasets

since it does not suffer from conditional shift discussed in Section 3

(P1), so it is worse than CatBoost because of the larger variance of

TBS (P2). Leave-one-out is usually better than Greedy, but it can

be much worse on some datasets, e.g., on Adult. The reason is that

Greedy suffers from low-frequency categories, while Leave-one-out

suffers also from high-frequency ones, and on Adult all features

have high frequency.

Finally, let us note that in Table 5 we combine Ordered mode of

CatBoost with different TBSs. To generalize this results, we also

made similar experiment by combining different TBSs with Plain

mode, used in standard gradient boosting. The obtained results and

conclusions turned out to be very similar to the ones discussed

above.

6.5 Feature combinationsIn this section, we demonstrate the effect of feature combinations

discussed in Section 3. To do this, we vary the number of features

cmax allowed to be combined from 1 to 5, where 1 corresponds to

the absence of combinations. Figure 4 presents relative change in

logloss compared to cmax = 1. One can see that except some minor

variations the performance of CatBoost grows with cmax and the

-12

-10

-8

-6

-4

-2

0

2

1 2 3 4 5

rela

tive

chan

ge in

logl

oss,

%

complexity of combinations

AdultAmazon




Figure 4: Relative change in logloss for a given allowed com-plexity compared to the absence of combinations.

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1 2 3 4 5 6 7 8 9 10

rela

tive

chan

ge in

logl

oss,

%

number of permutations

AdultAmazon




Figure 5: Relative change in logloss for a given number ofpermutations s compared to s = 1.

largest improvement, which can reach 11%, is observed when we

change cmax from 1 to 2.

6.6 The effect of the number of permutationsHere we analyze the effect of the number of permutations s onthe performance of CatBoost. Figure 5 presents the relative im-

provement in logloss for each s compared to s = 1. Although we

noticed that the number of permutations has a smaller effect on

the algorithm performance than previously discussed options of

CatBoost, we see that logloss generally decreases with s and takings = 2 or s = 3 is good enough.

7 CONCLUSIONIn this paper, we identify and analyze the problem of prediction

shifts present in all existing implementations of gradient boosting.

We propose a general solution, ordered boosting with ordered TBS,

which solves the problem. This idea is implemented in CatBoost,

which is an open-source gradient boosting library. Empirical results

demonstrate that CatBoost outperform leading GBDT packages and

leads to state-of-the-art results on common benchmarks.

9

REFERENCES[1] Léon Bottou and Yann L Cun. 2004. Large scale online learning. In Advances in

neural information processing systems. 217–224.[2] Leo Breiman. 1996. Out-of-bag estimation. (1996).

[3] Leo Breiman. 2001. Using iterated bagging to debias regressions. MachineLearning 45, 3 (2001), 261–277.

[4] Leo Breiman, Jerome Friedman, Charles J Stone, and Richard A Olshen. 1984.

Classification and regression trees. CRC press.

[5] Rich Caruana and Alexandru Niculescu-Mizil. 2006. An empirical comparison of

supervised learning algorithms. In Proceedings of the 23rd international conferenceon Machine learning. ACM, 161–168.

[6] Bojan Cestnik et al. 1990. Estimating probabilities: a crucial task in machine

learning.. In ECAI, Vol. 90. 147–149.[7] Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system.

In Proceedings of the 22Nd ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining. ACM, 785–794.

[8] Bruno De Finetti. 2017. Theory of probability: a critical introductory treatment.Vol. 6. John Wiley & Sons.

[9] Michal Ferov and Marek Modry. 2016. Enhancing LambdaMART Using Oblivious

Trees. arXiv preprint arXiv:1609.05610 (2016).[10] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. 2000. Additive logistic

regression: a statistical view of boosting. The annals of statistics 28, 2 (2000),

337–407.

[11] Jerome H Friedman. 2001. Greedy function approximation: a gradient boosting

machine. Annals of statistics (2001), 1189–1232.[12] Jerome H Friedman. 2002. Stochastic gradient boosting. Computational Statistics

& Data Analysis 38, 4 (2002), 367–378.[13] Andrey Gulin, Igor Kuralenok, and Dmitry Pavlov. 2011. Winning The Transfer

Learning Track of Yahoo!’s Learning To Rank Challenge with YetiRank.. In Yahoo!Learning to Rank Challenge. 63–76.

[14] Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma,

Qiwei Ye, and Tie-Yan Liu. 2017. LightGBM: A highly efficient gradient boosting

decision tree. In Advances in Neural Information Processing Systems. 3149–3157.[15] Michael Kearns and Leslie Valiant. 1994. Cryptographic limitations on learning

Boolean formulae and finite automata. Journal of the ACM (JACM) 41, 1 (1994),67–95.

[16] John Langford, Lihong Li, and Tong Zhang. 2009. Sparse online learning via

truncated gradient. Journal of Machine Learning Research 10, Mar (2009), 777–801.

[17] LlewMason, Jonathan Baxter, Peter L Bartlett, andMarcus R Frean. 2000. Boosting

algorithms as gradient descent. In Advances in neural information processingsystems. 512–518.

[18] Daniele Micci-Barreca. 2001. A preprocessing scheme for high-cardinality cat-

egorical attributes in classification and prediction problems. ACM SIGKDDExplorations Newsletter 3, 1 (2001), 27–32.

[19] Byron P Roe, Hai-Jun Yang, Ji Zhu, Yong Liu, Ion Stancu, and Gordon McGregor.

2005. Boosted decision trees as an alternative to artificial neural networks for

particle identification. Nuclear Instruments and Methods in Physics ResearchSection A: Accelerators, Spectrometers, Detectors and Associated Equipment 543, 2(2005), 577–584.

[20] Lior Rokach and Oded Maimon. 2005. Top–down induction of decision trees

classifiers — a survey. IEEE Transactions on Systems, Man, and Cybernetics, Part C(Applications and Reviews) 35, 4 (2005), 476–487.

[21] Qiang Wu, Christopher JC Burges, Krysta M Svore, and Jianfeng Gao. 2010.

Adapting boosting for information retrieval measures. Information Retrieval 13,3 (2010), 254–270.

[22] Kun Zhang, Bernhard Schölkopf, Krikamol Muandet, and Zhikun Wang. 2013.

Domain adaptation under target and conditional shift. In International Conferenceon Machine Learning. 819–827.

[23] Yanru Zhang and Ali Haghani. 2015. A gradient boosting method to improve

travel time prediction. Transportation Research Part C: Emerging Technologies 58(2015), 308–324.

APPENDIXProof of Proposition 1. Here we provide the full proof. Denote by ξst ,s, t ∈ {0, 1}, the number of examples xk ∈ D with xk = (s, t). Thevalue of the first stump h1 in the region {x1 = s} is the averagevalue of yk over examples fromD belonging to this region. That is,

h1(0, t) =∑nj=1 c21{x j=(0,1)}∑n

j=1 1{x 1

j =0}=

c2ξ01ξ00 + ξ01

,

h1(1, t) =

∑nj=1 c11{x 1

j =1}+ c21{x j=(1,1)}∑n

j=1 1{x 1

j =1}= c1 +

c2ξ11ξ10 + ξ11

.14

Summarizing, we obtain h1(s, t) = c1s + c2ξs1ξs0+ξs1

.

Prediction shift of h1Now we derive the expectation E(h1(x)) of prediction h1 for a testexample x = (s, t). Due to linearity of expectation,

E(h1(s, t)) = c1s + c2n∑

k=1

E(αsk ), (8)

where αsk =1{xk =(s,1)}∑nj=1 1{x1j =s }

. Note that all E(αsk ), k = 1, . . . ,n, are

equal, because the observations xk are exchangeable [8], soE(αsk ) =E(αs1) for k = 1, . . . ,n. The law of total probability implies that

E(αs1) = P(x1 = (s, 1)) · E(

1∑nj=1 1{x 1

j =s }| x1 = (s, 1)

). (9)

The first multiplier equals1

4. We derive the second one using the

following statement.

Lemma 1. E

(1∑n

j=1 1{x1j =s }| x1 = (s, t)

)= 2

n (1 −1

2n ).

Proof . The probability P(∑nj=1 1{x 1

j =s }= k + 1 | x1 = (s, t)) equals

Ckn−1/2n−1

for k = 0, . . . ,n − 1, since 1{x 1

1=s } = 1 when x1 = (s, t)

with probability 1 and

∑nj=2 1{x 1

j =s }is a binomial variable inde-

pendent of x1. Therefore, we have E( 1∑nj=1 1{x1j =s }

| x1 = (s, t)) =∑n−1k=0

1

k+1Ckn−1

2n−1 =

1

n2n−1∑nk=1C

kn =

1

n2n−1 (2n − 1) = 2

n (1 −1

2n ). □

Summarizing Equations 8, 9 and Lemma 1, we obtain

Proposition 2. We have E(h1(s, t)) = c1s + c22(1 − 1

2n ).

It means that the conditional expectation E(h1(x) | x = (s, t))on a test example x equals c1s +

c22+ O(2−n ), since x and h1

are independent. At the same time, the conditional expectation

E(h1(xl ) | xk = (s, t)) on a training example xl is different for anyl = 1, . . . ,n, because the model h1 is fitted to xl .

Proposition 3. The conditional expectation E(h1(xl ) | xl =(s, t)) equals (c1s + c2

2) − c2( 2t−1n ) +O(2−n ).

Proof . We derive the conditional expectation as c1s+∑nk=1 c2E(αsk |

xl = (s, t)). Lemma 1 implies that E(αsl | xl = (s, t)) = t 2n (1 −1

2n ).

For k , l , we have E(αk | xl = (s, t)) = 1

4E( 1∑n

j=1 1{x1j =s }| xl =

(s, t), xk = (s, 1)) = 1

2n (1 −1

n−1 +O(2−n )) due to Lemma 2 below.

Summarizing, we obtain E(h1(xl ) | xl = (s, t)) = c1s + c2(t 2n (1 −1

2n )+ (n − 1) 1

2n (1−1

n−1 ))+O(2−n ) = c1s +c22− c2( 2t−1n )+O(2−n ).

□

Lemma 2. The conditional expectation E( 1∑nj=1 1{x1j =s }

| x1 =

(s, t1), x2 = (s, t2)) equals 2

n (1 −1

n−1 +1

2n−1(n−1) ).

14Formally, there is a probability 1/2n event that ξ00 + ξ01 or ξ10 + ξ11 equals 0. For

these cases, we assume that0

0= 0 anywhere below.

10

Proof . First, we have by definition

E( 1∑nj=1 1{x 1

j =s }| x1 = (s, t1), x2 = (s, t2)) =

1

2n−2

n−2∑k=0

Ckn−21

k + 2,

since P(∑nj=1 1{x 1

j =s }= k + 2 | x1 = (s, t1), x2 = (s, t2)) equals

Ckn−2

2n−2 . Then we calculate the expectation as

1

2n−2

n−2∑k=0

Ckn−21

k + 2=

1

2n−2

n−2∑k=0

Ckn−2

(1

k + 1− 1

(k + 1)(k + 2)

)=

=1

2n−2

n−2∑k=0

(1

n − 1Ck+1n−1 −

1

n(n − 1)Ck+2n

)=

=1

2n−2

(1

n − 1 (2n−1 − 1) − 1

n(n − 1) (2n − n − 1)

)=

=2

n(1 − 1

n − 1 +1

2n−1(n − 1)

).

□

Bias of the model h1 + h2Proposition 3 shows that the values of the model h1 on training

examples are shifted with respect to the ones on test examples. The

next step is to show how this can lead to a bias of the trained model,

if we use the same dataset for building both h1 and h2. First, assume

we have an additional sample D2 = {xn+k }k=1..n for building h2.

Proposition 4. The expected value of h1(s, t) + h2(s, t) on a testexample x = (s, t) is f ∗(s, t) +O(1/2n ), if h2 is built using datasetD2.

Proof . First, we need to derive the expectation E(h2(s, t)) of h2 on a

test example x = (s, t). We have h2(s, t) is the average of the residu-als f ∗(xn+k ) −h1(xn+k ) over all k such that x2n+k = t . For example,

h2(s, 0) =c1ξ ′

10−ξ ′

00h1(0,0)−ξ ′

10h1(1,0)

ξ ′00+ξ ′

10

, where ξ ′st is the number of ex-

amples xn+k that are equal to (s, t), k = 1, . . . ,n. Further, we have

E(h2(s, 0)) = E(c1ξ ′10

ξ ′00+ξ ′

10

) − E(h1(0, 0)ξ ′00

ξ ′00+ξ ′

10

) − E(h1(1, 0)ξ ′10

ξ ′00+ξ ′

10

).Since h1 andD2 are independent, we can rewright the last equation

as

E(h2(s, 0)) = c1E(ξ ′10

ξ ′00+ ξ ′

10

) − E(h1(0, 0))E(ξ ′00

ξ ′00+ ξ ′

10

)−

− E(h1(1, 0))(ξ ′10

ξ ′00+ ξ ′

10

) = c11

2

− c22

1

2

− (c1 +c22

)12

+O(2−n ) =

− c22

+O(2−n ).

Analogously, we can obtain E(h2(s, 1)) = c22+O(2−n ). Summing

up, E(h2(s, t)) = c2t − c22+O(2−n ) and E(h1(s, t)+h2(s, t)) = c1s +

c2t +O(2−n ). □At last, we dirive the expected value of h1(s, t) + h2(s, t) for

situation when the same dataset D1 is used for both building h1and h2. We obtain a bias according to the following result.

Proposition 5. The expected value of h1(s, t) + h2(s, t) on a testexample x = (s, t) is f ∗(s, t) − 1

n−1c2(t −1

2) + O(1/2n ), if both h1

and h2 are built using the same dataset D1.

Proof . The residual f ∗(s, t)−h1(s, t) equals c2(t− ξs1ξs0+ξs1

). Therefore,the value of h2(s, t) is the average

1

ξ0t + ξ1t

(c2(t −

ξ01ξ00 + ξ01

)ξ0t + c2(t −ξ11

ξ10 + ξ11)ξ1t

),

which is equal to −c2(

ξ00ξ01(ξ00+ξ01)(ξ00+ξ10) +

ξ10ξ11(ξ10+ξ11)(ξ00+ξ10)

), when

t = 0 and is equal to c2(

ξ00ξ01(ξ00+ξ01)(ξ01+ξ11) +

ξ10ξ11(ξ10+ξ11)(ξ01+ξ11)

), when

t = 1. All four ratios are equal due to symmetries, and they are equal

to1

4(1− 1

n−1 )+O(2−n ) according to Lemma 3 below. Summarising,

we obtain E(h2(s, t)) = (2t −1) c22(1− 1

n−1 )+O(2−n ) and E(h1(s, t)+h2(s, t)) = f ∗(s, t) − c2 1

n−1 (t −1

2) +O(2−n ). □

Lemma 3. We have E( ξ00ξ01(ξ00+ξ01)(ξ01+ξ11) ) =

1

4(1 − 1

n−1 ) +O(2−n ).

Proof . First, linearity implies that

E( ξ00ξ01(ξ00 + ξ01)(ξ01 + ξ11)

) =∑i, jE(

1xi=(0,0),xj=(0,1)(ξ00 + ξ01)(ξ01 + ξ11)

)

Taking into account that all terms are equal, the expectation can be

written asn(n−1)

42

a, where

a := E( 1

(ξ00 + ξ01)(ξ01 + ξ11)| x1 = (0, 0), x2 = (0, 1)).

A key observation is that ξ00 + ξ01 and ξ01 + ξ11 are two inde-

pendent binomial variables: the former one is the number of ksuch that x1k = 0 and the latter one is the number of k such that

x2k = 1. Moreover, they (and also their inverses) are also condition-

ally independent given that first two observations of the Bernoulli

scheme are known: x1 = (0, 1), x2 = (0, 1). This conditional inde-pendence imply that a is the product of E( 1

ξ00+ξ01| x1 = (0, 0), x2 =

(0, 1)) and E( 1

ξ01+ξ11| x1 = (0, 0), x2 = (0, 1)). The first factor

equals2

n (1 −1

n−1 + O(2−n )) according to Lemma 2. The second

one equals2

n−1 (1 + O(2−n )) according to Lemma 1. At last, we

obtain E( ξ00ξ01(ξ00+ξ01)(ξ01+ξ11) ) =

n(n−1)42

4

n(n−1) (1 −1

n−1 ) + O(2−n ) =1

4(1 − 1

n−1 ) +O(2−n ) . □□

11

fighting biases with dynamic boosting - arxiv biases with dynamic boosting ... andrey gulin yandex...

Documents