dive into xgboost - github pages · 1.xgboost 的github地址 ... dive into xgboost created date:...

Dive into XGBoost

张海海鹏

JiangNan University

2017-10-20

1

⽬目录

数学理理论

系统设计

模型应⽤用

参考⽂文献

2

数学理理论

3

f(xk+1) = f(xk + xk+1 � xk) ⇡ f(xk) +rf(xk)(xk+1 � xk)

f(xk+1) < f(xk) ) rf(xk)(xk+1 � xk) < 0

xk+1 = xk � �rf(xk)

rf(xk) = 0, xk+1 = xk

数学理理论

4

f(x) = l(h(x,D), Y )

l(h(x,D), Y ) =Y

d2D,y2Y(h(x, d)y(1� h(x, d)1�y))

h(x, d) =1

1 + e�xd

l(H(xk+1)) = l(H(xt) + h(xt+1))

⇡ l(H(xt)) +rl(H(xt))h(xt+1)H(xt) =

tX

i=1

h(xi)

数学理理论

5

H(xt+1) = H(xt)� �rl(H(xt)) H(xt) =tX

i=1

h(xi)

h(xt+1) = ��rl(H(xt))

h : D �! rl(H(xt, D))x

数学理理论

L(�) =X

i

l(yi, ŷi) +X

k

⌦(fk)

ŷi = �(xi) =KX

k=1

fk(xi), fk 2 F

Model Formalization:

Training Objective:

6

数学理理论

L

(t) =nX

i=1

l(yi, ŷi(t)) +

tX

i=1

⌦(fi)

=nX

i=1

l(yi, ŷi(t�1) + ft(xi)) +

tX

i=1

⌦(fi)

7

数学理理论

L

(t) 'nX

i=1

[l(yi, ŷ(t�1)) + gift(xi) +

1

2hif

2t (xi)] +

tX

i=1

⌦(fi)

gi = @ŷ(t�1) l(yi, ŷ(t�1)) hi = @2ŷ(t�1) l(yi, ŷ

(t�1))

eL

(t) =nX

i=1

[gift(xi) +1

2hif

2t (xi)] + ⌦(ft)

remove constant terms!

8

数学理理论

q : Rd ! {1, 2, ..., T}

f

t

(x) = wq(x)

⌦(ft) = �T +1

2�

TX

j=1

w2j

eL

(t) =nX

i=1

[gift(xi) +1

2hif

2t (xi)] + �T +

1

2�

TX

j=1

w

2j

=TX

j=1

[(X

i2Ij

gi)wj +1

2(X

i2Ij

hi + �)w2j ] + �T

9

数学理理论

w⇤j = �P

i2Ij giPi2Ij hi + �

eL(t)(q) = �12

TX

j=1

(P

i2Ij gi)2

Pi2Ij hi + �

+ �T

10

Example

11

Index G H

1 g1 h1

2 g2 h2

3 g3 h3

4 g4 h4

5 g5 h5

Y N

Y N

age

Tree Growing Algorithm

w1=1 w2=9

w1=5 A>1 A

Missing Values Handling

example Age Gender

X1 ? male

X2 15 ?

X3 25 female

X3

Y N

X2

Y N

age

Alg-1

14

Alg-2

15

Alg-3

16

训练过程初始化模型参数；

设置training sample默认预测值为相同类别，计算grad和hess；

从第⼀一棵树直到设定的树的数⽬目：

grad’, hess’=(grad, hess)*sample_weight；

原始数据集⾏行行列列采样得到采样后数据集D；

对D进⾏行行划分，X=(属性集合)，Y=(真实值，预测值，grad’，hess’)；

建树，返回树模型curTree；

更更新预测值(学习率*新的预测值)，grad，hess；

⼦子树添加到根树；打印训练信息；

17

zhpmatrix

zhpmatrix

建树过程

18

初始化参数，获取训练数据X,Y；

判断当前训练样本是否构成叶⼦子节点，如果是，计算叶⼦子节点score并返回当前TreeNode，内含信息：叶⼦子标志，叶⼦子得分；否则，执⾏行行下⼀一步；

对X列列采样，得到X_selected;

找到X_selected的最佳属性，最佳属性分割阈值，最佳gain，空值默认分割⽅方向；

如果最佳gain为负，构成叶⼦子节点，返回当前TreeNode；

分割X；分别对X的左右⼦子树建树；

属性重要性计数；

返回TreeNode，内含信息：左右⼦子树，分割属性，分割阈值，分割⽅方向；

zhpmatrix

预测过程

19

初始化预测值pred为0；

对根树中的每⼀一棵⼦子树：//RandomForest的区别

//并⾏行行加速

pred += 学习率*每棵树的预测结果；

返回预测结果pred；

zhpmatrix

系统设计

Column Block for Parallel Learning

exact greedy, approximate

Cache-aware Access

exact greedy: cache-aware prefetching

approximate: choose better block size to balance cache

property and parallelization

Blocks for Out-of-core Computation

block compression, block sharding

20

Column-Block

BLK00

BLK01

BLK10

BLK11

Memory

21

Samples00

Samples01

Samples10

Samples11

Feature

CSC

CSC

CSC

CSC

Compressed Row Storage

22

Cache-Aware

5

1

8

3

1

3

5

8

sorted

23

每个样本的梯度信息特征值

Out-of-core Computation

BLK00

BLK01

BLK10

BLK11

Disk01

216

216

216

216

BLK00

BLK01

BLK10

BLK11

Disk00

216

216

216

216

Buffer00

Buffer01

Memory

CPU

24

模型应⽤用

Feature Importance

Customized Metric/Loss Function

Combination Features(stacking)

25

参考⽂文献

1.XGBoost的Github地址: https://github.com/dmlc/xgboost

2.XGBoost的tiny实现：https://github.com/zhpmatrix/groot

3.GBDT的实现：https://github.com/liuzhiqiangruc/dml/tree/master/gbdt

4.《Higgs Boson Discovery with Boosted Trees》, JMLR Workshop，Tianqi Chen, Tong He, 2015

5.《XGBoost: A Scalable Tree Boosting System》, KDD, Tianqi Chen, Carlos Guestrin, 2016

6.《Practical Lessons from Predicting Clicks on Ads at Facebook》, Xinran He

7.XGBoost源码阅读：https://zhpmatrix.github.io/2017/03/15/xgboost-src-reading-2

26

https://github.com/dmlc/xgboosthttps://github.com/zhpmatrix/groothttps://github.com/liuzhiqiangruc/dml/tree/master/gbdthttps://zhpmatrix.github.io/2017/03/15/xgboost-src-reading-2https://zhpmatrix.github.io/2017/03/15/xgboost-src-reading-2

TKS有啥问题需要探讨的吗？

27

dive into xgboost - github pages · 1.xgboost 的github地址 ... dive into xgboost created date:...

Documents