dive into xgboost - github pages · 1.xgboost 的github地址 ... dive into xgboost created date:...

27
Dive into XGBoost 张鹏 JiangNan University 2017-10-20 1

Upload: others

Post on 22-May-2020

58 views

Category:

Documents


0 download

TRANSCRIPT

  • Dive into XGBoost

    张海海鹏

    JiangNan University

    2017-10-20

    1

  • ⽬目录

    数学理理论

    系统设计

    模型应⽤用

    参考⽂文献

    2

  • 数学理理论

    3

    f(xk+1) = f(xk + xk+1 � xk) ⇡ f(xk) +rf(xk)(xk+1 � xk)

    f(xk+1) < f(xk) ) rf(xk)(xk+1 � xk) < 0

    xk+1 = xk � �rf(xk)

    rf(xk) = 0, xk+1 = xk

  • 数学理理论

    4

    f(x) = l(h(x,D), Y )

    l(h(x,D), Y ) =Y

    d2D,y2Y(h(x, d)y(1� h(x, d)1�y))

    h(x, d) =1

    1 + e�xd

    l(H(xk+1)) = l(H(xt) + h(xt+1))

    ⇡ l(H(xt)) +rl(H(xt))h(xt+1)H(xt) =

    tX

    i=1

    h(xi)

  • 数学理理论

    5

    H(xt+1) = H(xt)� �rl(H(xt)) H(xt) =tX

    i=1

    h(xi)

    h(xt+1) = ��rl(H(xt))

    h : D �! rl(H(xt, D))x

  • 数学理理论

    L(�) =X

    i

    l(yi, ŷi) +X

    k

    ⌦(fk)

    ŷi = �(xi) =KX

    k=1

    fk(xi), fk 2 F

    Model Formalization:

    Training Objective:

    6

  • 数学理理论

    L

    (t) =nX

    i=1

    l(yi, ŷi(t)) +

    tX

    i=1

    ⌦(fi)

    =nX

    i=1

    l(yi, ŷi(t�1) + ft(xi)) +

    tX

    i=1

    ⌦(fi)

    7

  • 数学理理论

    L

    (t) 'nX

    i=1

    [l(yi, ŷ(t�1)) + gift(xi) +

    1

    2hif

    2t (xi)] +

    tX

    i=1

    ⌦(fi)

    gi = @ŷ(t�1) l(yi, ŷ(t�1)) hi = @2ŷ(t�1) l(yi, ŷ

    (t�1))

    eL

    (t) =nX

    i=1

    [gift(xi) +1

    2hif

    2t (xi)] + ⌦(ft)

    remove constant terms!

    8

  • 数学理理论

    q : Rd ! {1, 2, ..., T}

    f

    t

    (x) = wq(x)

    ⌦(ft) = �T +1

    2�

    TX

    j=1

    w2j

    eL

    (t) =nX

    i=1

    [gift(xi) +1

    2hif

    2t (xi)] + �T +

    1

    2�

    TX

    j=1

    w

    2j

    =TX

    j=1

    [(X

    i2Ij

    gi)wj +1

    2(X

    i2Ij

    hi + �)w2j ] + �T

    9

  • 数学理理论

    w⇤j = �P

    i2Ij giPi2Ij hi + �

    eL(t)(q) = �12

    TX

    j=1

    (P

    i2Ij gi)2

    Pi2Ij hi + �

    + �T

    10

  • Example

    11

    Index G H

    1 g1 h1

    2 g2 h2

    3 g3 h3

    4 g4 h4

    5 g5 h5

    Y N

    Y N

    age

  • Tree Growing Algorithm

    w1=1 w2=9

    w1=5 A>1 A

  • Missing Values Handling

    example Age Gender

    X1 ? male

    X2 15 ?

    X3 25 female

    X3

    Y N

    X2

    Y N

    age

  • Alg-1

    14

  • Alg-2

    15

  • Alg-3

    16

  • 训练过程初始化模型参数;

    设置training sample默认预测值为相同类别,计算grad和hess;

    从第⼀一棵树直到设定的树的数⽬目:

    grad’, hess’=(grad, hess)*sample_weight;

    原始数据集⾏行行列列采样得到采样后数据集D;

    对D进⾏行行划分,X=(属性集合),Y=(真实值,预测值,grad’,hess’);

    建树,返回树模型curTree;

    更更新预测值(学习率*新的预测值),grad,hess;

    ⼦子树添加到根树;打印训练信息;

    17

    zhpmatrix

    zhpmatrix

  • 建树过程

    18

    初始化参数,获取训练数据X,Y;

    判断当前训练样本是否构成叶⼦子节点,如果是,计算叶⼦子节点score并返回当前TreeNode,内含信息:叶⼦子标志,叶⼦子得分;否则,执⾏行行下⼀一步;

    对X列列采样,得到X_selected;

    找到X_selected的最佳属性,最佳属性分割阈值,最佳gain,空值默认分割⽅方向;

    如果最佳gain为负,构成叶⼦子节点,返回当前TreeNode;

    分割X;分别对X的左右⼦子树建树;

    属性重要性计数;

    返回TreeNode,内含信息:左右⼦子树,分割属性,分割阈值,分割⽅方向;

    zhpmatrix

  • 预测过程

    19

    初始化预测值pred为0;

    对根树中的每⼀一棵⼦子树://RandomForest的区别

    //并⾏行行加速

    pred += 学习率*每棵树的预测结果;

    返回预测结果pred;

    zhpmatrix

  • 系统设计

    Column Block for Parallel Learning

    exact greedy, approximate

    Cache-aware Access

    exact greedy: cache-aware prefetching

    approximate: choose better block size to balance cache

    property and parallelization

    Blocks for Out-of-core Computation

    block compression, block sharding

    20

  • Column-Block

    BLK00

    BLK01

    BLK10

    BLK11

    Memory

    21

    Samples00

    Samples01

    Samples10

    Samples11

    Feature

    CSC

    CSC

    CSC

    CSC

  • Compressed Row Storage

    22

  • Cache-Aware

    5

    1

    8

    3

    1

    3

    5

    8

    sorted

    23

    每个样本的梯度信息特征值

  • Out-of-core Computation

    BLK00

    BLK01

    BLK10

    BLK11

    Disk01

    216

    216

    216

    216

    BLK00

    BLK01

    BLK10

    BLK11

    Disk00

    216

    216

    216

    216

    Buffer00

    Buffer01

    Memory

    CPU

    24

  • 模型应⽤用

    Feature Importance

    Customized Metric/Loss Function

    Combination Features(stacking)

    25

  • 参考⽂文献

    1.XGBoost的Github地址: https://github.com/dmlc/xgboost

    2.XGBoost的tiny实现:https://github.com/zhpmatrix/groot

    3.GBDT的实现:https://github.com/liuzhiqiangruc/dml/tree/master/gbdt

    4.《Higgs Boson Discovery with Boosted Trees》, JMLR Workshop,Tianqi Chen, Tong He, 2015

    5.《XGBoost: A Scalable Tree Boosting System》, KDD, Tianqi Chen, Carlos Guestrin, 2016

    6.《Practical Lessons from Predicting Clicks on Ads at Facebook》, Xinran He

    7.XGBoost源码阅读:https://zhpmatrix.github.io/2017/03/15/xgboost-src-reading-2

    26

    https://github.com/dmlc/xgboosthttps://github.com/zhpmatrix/groothttps://github.com/liuzhiqiangruc/dml/tree/master/gbdthttps://zhpmatrix.github.io/2017/03/15/xgboost-src-reading-2https://zhpmatrix.github.io/2017/03/15/xgboost-src-reading-2

  • TKS有啥问题需要探讨的吗?

    27