dive into xgboost - github pages · 1.xgboost 的github地址 ... dive into xgboost created date:...
TRANSCRIPT
-
Dive into XGBoost
张海海鹏
JiangNan University
2017-10-20
1
-
⽬目录
数学理理论
系统设计
模型应⽤用
参考⽂文献
2
-
数学理理论
3
f(xk+1) = f(xk + xk+1 � xk) ⇡ f(xk) +rf(xk)(xk+1 � xk)
f(xk+1) < f(xk) ) rf(xk)(xk+1 � xk) < 0
xk+1 = xk � �rf(xk)
rf(xk) = 0, xk+1 = xk
-
数学理理论
4
f(x) = l(h(x,D), Y )
l(h(x,D), Y ) =Y
d2D,y2Y(h(x, d)y(1� h(x, d)1�y))
h(x, d) =1
1 + e�xd
l(H(xk+1)) = l(H(xt) + h(xt+1))
⇡ l(H(xt)) +rl(H(xt))h(xt+1)H(xt) =
tX
i=1
h(xi)
-
数学理理论
5
H(xt+1) = H(xt)� �rl(H(xt)) H(xt) =tX
i=1
h(xi)
h(xt+1) = ��rl(H(xt))
h : D �! rl(H(xt, D))x
-
数学理理论
L(�) =X
i
l(yi, ŷi) +X
k
⌦(fk)
ŷi = �(xi) =KX
k=1
fk(xi), fk 2 F
Model Formalization:
Training Objective:
6
-
数学理理论
L
(t) =nX
i=1
l(yi, ŷi(t)) +
tX
i=1
⌦(fi)
=nX
i=1
l(yi, ŷi(t�1) + ft(xi)) +
tX
i=1
⌦(fi)
7
-
数学理理论
L
(t) 'nX
i=1
[l(yi, ŷ(t�1)) + gift(xi) +
1
2hif
2t (xi)] +
tX
i=1
⌦(fi)
gi = @ŷ(t�1) l(yi, ŷ(t�1)) hi = @2ŷ(t�1) l(yi, ŷ
(t�1))
eL
(t) =nX
i=1
[gift(xi) +1
2hif
2t (xi)] + ⌦(ft)
remove constant terms!
8
-
数学理理论
q : Rd ! {1, 2, ..., T}
f
t
(x) = wq(x)
⌦(ft) = �T +1
2�
TX
j=1
w2j
eL
(t) =nX
i=1
[gift(xi) +1
2hif
2t (xi)] + �T +
1
2�
TX
j=1
w
2j
=TX
j=1
[(X
i2Ij
gi)wj +1
2(X
i2Ij
hi + �)w2j ] + �T
9
-
数学理理论
w⇤j = �P
i2Ij giPi2Ij hi + �
eL(t)(q) = �12
TX
j=1
(P
i2Ij gi)2
Pi2Ij hi + �
+ �T
10
-
Example
11
Index G H
1 g1 h1
2 g2 h2
3 g3 h3
4 g4 h4
5 g5 h5
Y N
Y N
age
-
Tree Growing Algorithm
w1=1 w2=9
w1=5 A>1 A
-
Missing Values Handling
example Age Gender
X1 ? male
X2 15 ?
X3 25 female
X3
Y N
X2
Y N
age
-
Alg-1
14
-
Alg-2
15
-
Alg-3
16
-
训练过程初始化模型参数;
设置training sample默认预测值为相同类别,计算grad和hess;
从第⼀一棵树直到设定的树的数⽬目:
grad’, hess’=(grad, hess)*sample_weight;
原始数据集⾏行行列列采样得到采样后数据集D;
对D进⾏行行划分,X=(属性集合),Y=(真实值,预测值,grad’,hess’);
建树,返回树模型curTree;
更更新预测值(学习率*新的预测值),grad,hess;
⼦子树添加到根树;打印训练信息;
17
zhpmatrix
zhpmatrix
-
建树过程
18
初始化参数,获取训练数据X,Y;
判断当前训练样本是否构成叶⼦子节点,如果是,计算叶⼦子节点score并返回当前TreeNode,内含信息:叶⼦子标志,叶⼦子得分;否则,执⾏行行下⼀一步;
对X列列采样,得到X_selected;
找到X_selected的最佳属性,最佳属性分割阈值,最佳gain,空值默认分割⽅方向;
如果最佳gain为负,构成叶⼦子节点,返回当前TreeNode;
分割X;分别对X的左右⼦子树建树;
属性重要性计数;
返回TreeNode,内含信息:左右⼦子树,分割属性,分割阈值,分割⽅方向;
zhpmatrix
-
预测过程
19
初始化预测值pred为0;
对根树中的每⼀一棵⼦子树://RandomForest的区别
//并⾏行行加速
pred += 学习率*每棵树的预测结果;
返回预测结果pred;
zhpmatrix
-
系统设计
Column Block for Parallel Learning
exact greedy, approximate
Cache-aware Access
exact greedy: cache-aware prefetching
approximate: choose better block size to balance cache
property and parallelization
Blocks for Out-of-core Computation
block compression, block sharding
20
-
Column-Block
BLK00
BLK01
BLK10
BLK11
Memory
21
Samples00
Samples01
Samples10
Samples11
Feature
CSC
CSC
CSC
CSC
-
Compressed Row Storage
22
-
Cache-Aware
5
1
8
3
1
3
5
8
sorted
23
每个样本的梯度信息特征值
-
Out-of-core Computation
BLK00
BLK01
BLK10
BLK11
Disk01
216
216
216
216
BLK00
BLK01
BLK10
BLK11
Disk00
216
216
216
216
Buffer00
Buffer01
Memory
CPU
24
-
模型应⽤用
Feature Importance
Customized Metric/Loss Function
Combination Features(stacking)
25
-
参考⽂文献
1.XGBoost的Github地址: https://github.com/dmlc/xgboost
2.XGBoost的tiny实现:https://github.com/zhpmatrix/groot
3.GBDT的实现:https://github.com/liuzhiqiangruc/dml/tree/master/gbdt
4.《Higgs Boson Discovery with Boosted Trees》, JMLR Workshop,Tianqi Chen, Tong He, 2015
5.《XGBoost: A Scalable Tree Boosting System》, KDD, Tianqi Chen, Carlos Guestrin, 2016
6.《Practical Lessons from Predicting Clicks on Ads at Facebook》, Xinran He
7.XGBoost源码阅读:https://zhpmatrix.github.io/2017/03/15/xgboost-src-reading-2
26
https://github.com/dmlc/xgboosthttps://github.com/zhpmatrix/groothttps://github.com/liuzhiqiangruc/dml/tree/master/gbdthttps://zhpmatrix.github.io/2017/03/15/xgboost-src-reading-2https://zhpmatrix.github.io/2017/03/15/xgboost-src-reading-2
-
TKS有啥问题需要探讨的吗?
27