arxiv:1703.05291v1 [cs.lg] 15 mar 2017 · *bing ads of ai & research group, microsoft corp. **msr...

14
Deep Embedding Forest: Forest-based Serving with Deep Embedding Features J. Zhu * , Y. Shan * , JC Mao * , D. Yu ** , H. Rahmanian *** , and Y. Zhang * * Bing Ads of AI & Research Group, Microsoft Corp. ** MSR of AI & Research Group, Microsoft Corp. *** Department of Computer Science, UCSC March 16, 2017 Abstract Deep Neural Networks (DNN) have demonstrated su- perior ability to extract high level embedding vec- tors from low level features. Despite the success, the serving time is still the bottleneck due to expensive run-time computation of multiple layers of dense ma- trices. GPGPU, FPGA, or ASIC-based serving sys- tems require additional hardware that are not in the mainstream design of most commercial applications. In contrast, tree or forest-based models are widely adopted because of low serving cost, but heavily de- pend on carefully engineered features. This work pro- poses a Deep Embedding Forest model that benefits from the best of both worlds. The model consists of a number of embedding layers and a forest/tree layer. The former maps high dimensional (hundreds of thousands to millions) and heterogeneous low-level features to the lower dimensional (thousands) vec- tors, and the latter ensures fast serving. Built on top of a representative DNN model called Deep Crossing [21], and two forest/tree-based mod- els including XGBoost and LightGBM, a two-step Deep Embedding Forest algorithm is demonstrated to achieve on-par or slightly better performance as compared with the DNN counterpart, with only a fraction of serving time on conventional hardware. After comparing with a joint optimization algorithm called partial fuzzification, also proposed in this pa- per, it is concluded that the two-step Deep Embed- ding Forest has achieved near optimal performance. Experiments based on large scale data sets (up to 1 billion samples) from a major sponsored search en- gine proves the efficacy of the proposed model. 1 Introduction Well-abstracted features are known to be crucial for developing good machine learning models, but fea- ture engineering by human usually takes a large amount of work and needs expert domain knowl- edge during a traditional machine learning process. DNNs have been used as a powerful machine learn- ing tool in both industry and research for its ability of automatic feature engineering on various kinds of raw data including but not limited to speech, text or image sources without acquiring domain exper- tises [15, 20, 21, 8]. With the support of hardware acceleration plat- forms such as clusters of general-purpose graph- ics processing units (GPGPUs), field programmable gate arrays (FPGAs) or ASIC-based serving sys- tems [4, 1, 14], DNNs are capable of training on bil- lions of data with scalability. However, DNNs are still expensive for online serving due to the fact that most of the commercial platforms are central process- ing units (CPUs) based with limited applications of 1 arXiv:1703.05291v1 [cs.LG] 15 Mar 2017

Upload: others

Post on 26-Jan-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

  • Deep Embedding Forest: Forest-based Serving

    with Deep Embedding Features

    J. Zhu*, Y. Shan*, JC Mao*, D. Yu**, H. Rahmanian***, and Y. Zhang*

    *Bing Ads of AI & Research Group, Microsoft Corp.**MSR of AI & Research Group, Microsoft Corp.

    ***Department of Computer Science, UCSC

    March 16, 2017

    Abstract

    Deep Neural Networks (DNN) have demonstrated su-perior ability to extract high level embedding vec-tors from low level features. Despite the success, theserving time is still the bottleneck due to expensiverun-time computation of multiple layers of dense ma-trices. GPGPU, FPGA, or ASIC-based serving sys-tems require additional hardware that are not in themainstream design of most commercial applications.In contrast, tree or forest-based models are widelyadopted because of low serving cost, but heavily de-pend on carefully engineered features. This work pro-poses a Deep Embedding Forest model that benefitsfrom the best of both worlds. The model consistsof a number of embedding layers and a forest/treelayer. The former maps high dimensional (hundredsof thousands to millions) and heterogeneous low-levelfeatures to the lower dimensional (thousands) vec-tors, and the latter ensures fast serving.

    Built on top of a representative DNN model calledDeep Crossing [21], and two forest/tree-based mod-els including XGBoost and LightGBM, a two-stepDeep Embedding Forest algorithm is demonstratedto achieve on-par or slightly better performance ascompared with the DNN counterpart, with only afraction of serving time on conventional hardware.After comparing with a joint optimization algorithmcalled partial fuzzification, also proposed in this pa-

    per, it is concluded that the two-step Deep Embed-ding Forest has achieved near optimal performance.Experiments based on large scale data sets (up to 1billion samples) from a major sponsored search en-gine proves the efficacy of the proposed model.

    1 Introduction

    Well-abstracted features are known to be crucial fordeveloping good machine learning models, but fea-ture engineering by human usually takes a largeamount of work and needs expert domain knowl-edge during a traditional machine learning process.DNNs have been used as a powerful machine learn-ing tool in both industry and research for its abilityof automatic feature engineering on various kinds ofraw data including but not limited to speech, textor image sources without acquiring domain exper-tises [15, 20, 21, 8].

    With the support of hardware acceleration plat-forms such as clusters of general-purpose graph-ics processing units (GPGPUs), field programmablegate arrays (FPGAs) or ASIC-based serving sys-tems [4, 1, 14], DNNs are capable of training on bil-lions of data with scalability. However, DNNs arestill expensive for online serving due to the fact thatmost of the commercial platforms are central process-ing units (CPUs) based with limited applications of

    1

    arX

    iv:1

    703.

    0529

    1v1

    [cs

    .LG

    ] 1

    5 M

    ar 2

    017

  • these acceleration hardwares.Tree-based models such as random forests (RFs)

    and gradient boosted decision trees (GBDTs), onthe other hand, with their run-time efficiency andgood performance, are currently popular productionmodels in large scale applications. However, theconstruction of strong abstracted features to makethe raw data meaningful for tree-based models re-quires in depth domain knowledge and is often time-consuming.

    This work proposes a hybrid model that can carryon the performance of DNNs to run-time serving withspeed comparable to forest/tree-based models.

    The paper is arranged in the following structure.Sec. 2 sets up the context of the application do-main where the Deep Embedding Forest is developed.Sec. 3 gives examples of features in the context of thesponsored search. Sec. 4 provides a formal statementof the problem by introducing the model architecture,components, and design principles. Sec. 5 describesthe training methodology that involves the initial-ization of the embedding layers, the training of theforest layer, and the joint optimizationa through apartial fuzzification algorithm. Sec. 6 presents ex-periment results with data sets of various sizes up to1 billion samples. It also demonstrates the model’sability of working with different kind of forest/tree-based models. Sec. 7 elaborates on the relationshipsof the Deep Embedding Forest model with a numberof related work. Sec. 8 concludes the paper and pointsout the future directions. The appendix provides theimplementation details of the partial fuzzification al-gorithm.

    2 Sponsored Search

    Deep Embedding Forest is discussed in the context ofsponsored search of a major search engine. Readerscan refer to [6] for an overview on this subject. Inbrief, sponsored search is responsible for showing adsalongside organic search results. There are three ma-jor agents in the ecosystem: the user, the advertiser,and the search platform. The goal of the platform isto show the user the advertisement that best matchesthe user’s intent, which was expressed mainly through

    a specific query. Below are the concepts key to thediscussion that follows.

    Query: A text string a user types into the searchbox

    Keyword: A text string related to a product, spec-ified by an advertiser to match a user query

    Title: The title of a sponsored advertisement (re-ferred to as “an ad” hereafter), specified by anadvertiser to capture a user’s attention

    Landing page: A product’s web site a user reacheswhen the corresponding ad is clicked by a user

    Match type: An option given to the advertiser onhow closely the keyword should be matched bya user query, usually one of four kinds: exact,phrase, broad and contextual

    Campaign: A set of ads that share the same settingssuch as budget and location targeting, often usedto organize products into categories

    Impression: An instance of an ad being displayedto a user. An impression is usually logged withother information available at run-time

    Click: An indication of whether an impression wasclicked by a user. A click is usually logged withother information available at the run-time

    Click through rate: Total number of clicks overtotal number of impressions

    Click Prediction: A critical model of the platformthat predicts the likelihood a user clicks on agiven ad for a given query

    Sponsored search is only one kind of machine learn-ing application. However, given the richness of theproblem space, the various types of features, and thesheer volume of data, we think our results can begeneralized to other applications with similar scale.

    3 Feature Representation

    This section provides example features used in theprediction models of sponsored search. The featuresin Table 1 are available during run-time when an ad

    2

  • Feature name Type Dimension

    Query Text 49,292Keyword Text 49,292Title Text 49,292MatchType Category 4CampaignID ID 10,001CampaignIDCount Numerical 5

    Table 1: Examples of heterogeneous and high dimen-sional features used in typical applications of spon-sored search

    is displayed (an impression). They are also availablein offline logs for model training.

    Each feature is represented as a vector. Text fea-tures such as a query are converted into tri-lettergrams with 49, 292 dimensions as in [11]1. Cate-gorical input such as MatchType is represented bya one-hot vector, where exact match (see Sec. 2) is[1, 0, 0, 0], phrase match is [0, 1, 0, 0], and so on.

    There are usually millions of campaigns in a spon-sored search system. Instead of converting campaignids into a one-hot vector with millions of dimensions,a pair of companion features is used. Specifically,CampaignID is a one-hot representation consistingonly of the top 10, 000 campaigns with the highestnumber of clicks. The 10, 000th slot (index startsfrom 0) is saved for all the remaining campaigns.Other campaigns are covered by CampaignIDCount,which is a numerical feature that stores per campaignstatistics such as click through rate. Such features willbe referred as a counting feature in the following dis-cussions. All the features introduced above are sparsefeatures except the counting features.

    4 Problem Statement

    The goal is to construct a model with the structurein Fig. 1 that consists of feature inputs, the embed-ding layers, the stacking layer, the forest layer, andthe objective function. The model will be referred asDEF or the DEF model in the following discussions.

    1Unlike one-hot vectors, tri-letter grams are usually multi-hot and have integer values larger than 1 on non-zero elements

    Feature #1 Feature #2 Feature #n

    Stacking Layer

    Embedding #1 Embedding #n

    Forest Layer

    Objective

    . . .

    Figure 1: The Deep Embedding Forest model

    4.1 Model Components

    DEF allows low level features of different natures in-cluding sparse one-hot or categorical features, anddense numerical features. These features can be ex-tracted from text, speech, or image sources.

    An embedding layer maps low level features to adifferent feature space. It has a single layer of neuralnetwork with the general form of:

    ỹj = g(Wjxj + bj), (1)

    where j is the index to the individual features xj ∈Rnj , Wj is an mj ×nj matrix, b ∈ Rmj is the vectorof the bias terms, ỹj is the embedding vector, andg(·) is a non-linear activation function such as ReLU,sigmoid, or tanh. When mj < nj , embedding is usedto reduce the dimensionality of the input vector andconstruct high level features.

    The stacking layer concatenates the embeddingfeatures into one vector as the following:

    ỹ = [ỹ0, ỹ1, · · · , ỹK−1], (2)

    where K is the number of individual features. Notethat features with low dimensionality can be stackedwithout embedding. An example is Feature #2 inFig. 1.

    3

  • The stacking vector is sent as input to the forestlayer, which is represented as F(Ψ,Θ,Π), where Ψdefines the number of trees in the forest and the cor-responding structure, Θ is the parameter set of therouting functions on the decision nodes, and Π is theparameter set of the distribution functions on leafnodes.

    DEF allows objective functions of various types in-cluding but not limited to classification, regression,and ranking.

    4.2 DEF Properties

    When designed and implemented properly, DEF isexpected to possess two major advantages enabledby the unique model structure.

    The first is to minimize the effort of manual featureengineering. It is usually a challenge to forest/tree-based models to handle low level features exceed-ing tens of thousands dimensions. The embeddinglayers can comfortably operate on dimensions 10 or100 times higher, and automatically generate highlevel embedding features to the size manageable bythe forest/tree-based models. The embedding lay-ers are also trained together with the rest of themodel using the same objective function. As a re-sult, they are more adapted to the applications ascompared with stand-along embedding technologiessuch as Word2Vec [17].

    The second is to minimize the run-time latency.Based on the structure in Fig. 1, the serving timeper sample is determined by the embedding time T1and prediction2 time T2. T1, which is the run-timeof a single layer of neurons, makes a small fractionof the total in a typical DNN with multiple deep lay-ers. T1 is zero for dense numerical features when theyare stacked without embedding. The complexity ofembedding a sparse feature is O(n0̄ne), where n0̄ isthe number of non-zero elements, and ne is the di-mension of the corresponding embedding vector. Asan example, a sparse tri-letter gram [11] has around50K dimensions but usually with n0̄ ≤ 100. For a

    2The word prediction in this context refers to a generalscoring operation that is applicable to classification, regression,ranking, and so on

    typical ne between 128 and 256, the run-time cost ofan embedding layer of a sparse feature is negligible.

    The prediction time T2 is a function of ntdtnf ,where nt is the number of trees, dt is the averagedepth of the trees in the forest, and nf is the totalnumber of feature dimensions the decision or routingfunction depends on at each internal (or non-leaf)node. DEF uses decision nodes that rely on onlyone feature dimension to ensure serving speed. T2is then proportional to ntdt, which is independent ofthe dimensions of the stacking vector3. This is muchcheaper than a typical DNN with multiple layers ofneurons.

    5 Training Methodology

    Training the DEF model requires optimization of theobjective function w.r.t. {Wj}, {bj}, Ψ, Θ, and Π(see definitions in Sec. 4.1). It involves three stepsdetailed in the following sub-sections.

    5.1 Initialize Embedding Layers withDeep Crossing

    The first step is to initialize {Wj}, {bj} in Equ. 1with the Deep Crossing model [21]4. As can be seenfrom Fig. 2, the embedding layers and the stackinglayer are exactly the same as in the DEF model inFig. 1. The difference is that the forest layer is re-placed by the layers inside the dotted rectangle. Themultiple residual units are constructed from the resid-ual units, which is the basic building block of theResidual Net [7] that claimed the world record in theImageNet contest. The use of residual units providesDeep Crossing with superior convergence propertyand better control to avoid overfitting.

    While other DNNs can be applied in the contextof DEF, Deep Crossing offers some unique benefitsthat especially suitable for this task. First of all,it is designed to enable large scale applications. In

    3To be more precise, the type of DEF that possesses suchproperty is called a Two-Step DEF, as defined later in Sec. 5.4

    4Code is available underhttps://github.com/Microsoft/CNTK/wiki/Deep-Crossing-on-CNTK

    4

  • Feature #1 Feature #2 Feature #n

    Stacking Layer

    Embedding #1 Embedding #n

    Multiple Residual Units

    Objective

    . . .

    Scoring Layer

    Figure 2: The Deep Crossing Model used to initializethe embedding layers of DEF

    [21], Deep Crossing was implemented on a multi-GPU platform powered by the open source tool calledComputational Network Toolkit (CNTK)[1]5. Theresulting system was able to train a Deep Crossingmodel with five layers of residual units (10 individuallayers in total), with 2.2 billion samples of more than200K dimensions. The model significantly exceededthe offline AUC of a click prediction model in a majorsponsored search engine.

    Another key benefit of Deep Crossing is the abilityto handle low level sparse features with high dimen-sions. In the above example, three tri-letter gramswith 50K dimensions each were in the features list.The feature interactions (or cross features) were au-tomatically captured among sparse features, and alsow.r.t. other individual features such as CampaignIDand CampaignIDCount in Table 1. The resulting em-bedding features are dense, and in much lower dimen-sions that fall into the comfort zone of forest/tree-based models.

    With the presence of a general scoring layer, DeepCrossing also works with all the objective functionsof DEF.

    5http://www.cntk.ai

    It should be pointed out that there is no differ-ence in the training process of using Deep Crossingas a stand-alone model versus DEF initialization. Inother words, the optimization is w.r.t. the embeddingparameters, the parameters in the multiple residualunits, and those in the scoring layer. The differenceis in the usage of the model after the training com-pletes. Unlike the stand-along model, DEF uses onlythe embedding parameters for subsequent steps.

    5.2 Initialize the Forest Layer withXGBoost and LightGBM

    The embedding layers establish a forward functionthat maps the raw features into the stacking vector:

    ỹi = G({xj}i; {W0j}, {b0j}), (3)

    where i is the index to the training samples. The ini-tial values of the embedding parameters are denotedas {W0j} and {b0j}. The nonlinear operator G is thecombination of g in Equ. 1 and the vector-stackingoperator in Equ. 2.

    The second step in constructing a DEF builds ontop of this function to initialize the forest layer. Thisis accomplished by training a forest/tree-based modelusing the mapped sample and target pairs {(ti, ỹi)},where ti is the target of the i

    th training sample. BothXGBoost and LightGBM are applied to serve thispurpose.

    XGBoost [3]6 is a gradient boosting machine(GBM) that is widely used by data scientists, andis behind many winning solutions of various machinelearning challenges.

    LightGBM [16]7 is another open source GBM toolrecently developed by Microsoft. It uses histogrambased algorithms to accelerate training process andreduce memory consumption [18, 12] and also incor-porates advanced network communication algorithmsto optimize parallel learning.

    The outcome of this step, either produced by XG-Boost or LightGBM, becomes the initial values of theforest parameters including Ψ0, Θ0, and Π0.

    6https://github.com/dmlc/xgboost7https://github.com/Microsoft/LightGBM

    5

  • 5.3 Joint Optimization with PartialFuzzification

    The third step is a joint optimization that refinesthe parameters of the embedding layers {Wj} and{bj}, and the parameters of the forest layer Θ, andΠ. Note that the number and structure of trees (i.e.,Ψ) are kept unchanged. Ideally a joint optimizationthat solves these parameters holistically is preferred.Unfortunately, this is non-trivial mostly due to theexistence of the forest layer. Specifically, the searchof the best structure and the corresponding decisionand weight parameters in the forest layer usually re-lies on greedy approaches, which are not compatiblewith the gradient-based search for the neural-basedembedding layers. To enable this, DEF has to over-come the hurdle of converting the refinement of theforest into a continuous optimization problem.

    We start by looking into the pioneer work in [22, 19,13], where the above problem is solved by fuzzifyingthe decision functions8 of the internal nodes. Insteadof making a binary decision on the rth internal node,a fuzzy split function makes fuzzy decisions on eachnode. The probability of directing a sample to itsleft or right child node is determined by a sigmoidalfunction:

    µLr (ỹ) =1

    1 + e−cr(vr·ỹ−ar),

    µRr (ỹ) =1

    1 + ecr(vr·ỹ−ar)

    = 1− µLr (ỹ), (4)

    where r denotes the index of the internal nodes, vris the weight vector, ar is the split value of the scalarvariable vr · ỹ, and cr is the inverse width that deter-mines the fuzzy decision range. Outside this range,the assignment of a sample to the child nodes is ap-proximately reduced to a binary split. ỹ ≡ ỹi is thestacking vector of the ith training sample. The indexi is dropped here for simplicity. The functions µLr ,and µRr are defined for the left child and the rightchild (if exists), respectively. The prediction of the

    8Also referred as split function or routing function in liter-ature

    target t is:

    t̄ =∑l∈L

    µl(ỹ)πl, (5)

    where l is a leaf node in the set of all leaf nodes L,µl(·) is the probability of ỹ landing in l, and πl isthe corresponding prediction. Note that µl(·) is afunction of all {µL,Rr (ỹ)} in Equ. 4 along the pathfrom the root to l.

    A direct benefit of such fuzzification is that a con-tinuous loss function Loss({ti, t̄i}) is differentiablew.r.t. {vr}, {cr}, {ar}, and {πl}. The downside,however, is the cost of the prediction, which requiresthe traversal of the entire tree, as indicated in Equ. 5.From Equ. 4, it can be seen that the split functiondepends on all dimensions of the stacking features.As has been discussed in Sec. 4.2, this is also compu-tationally expensive.

    The above approach will be referred as full fuzzifi-cation hereafter, due to the fact that it requires fulltraverse of the forest, and has dependency on all fea-ture dimensions.

    DEF simplifies full fuzzification by having each in-ternal node address only one dimension of the stack-ing vector that was selected by the forest/tree modelto conduct the joint optimization. The fuzzificationon each node is simplified to the following:

    µLr (ỹ) =1

    1 + e−cr(ỹr−ar),

    µRr (ỹ) =1

    1 + ecr(ỹr−ar)

    = 1− µLr (ỹ), (6)

    where cr is the inverse width, and ar is the split valueon the rth node. These parameters are initialized bythe split function that has been learned in Sec. 5.2.The binary decision is replaced by a fuzzy decisionespecially for the samples that land within the fuzzydecision region. The joint optimization allows param-eters {Wj}, {bj}, Θ, and Π to evolve simultaneouslyto reach the model optimum.

    As compared with the split function in Equ. 4, thelinear transform of the stacking vector ỹ is removed.This is because each node is dedicated to only onedimension of the stacking vector, which is denoted asỹr in Equ. 6. More specifically, ỹr is the feature value

    6

  • of the dimension selected by the rth node based onΨ0 and Θ0. The prediction is the weighted averageof all the leaf values of the tree, as shown in Equ. 5.Compared to full fuzzification, the time complexityis reduced by the length of the stacking vector, sincethe split on each node only relays on the dimensionthat is selected by the forest/tree model.

    5.4 Two-Step DEF vs. Three-StepDEF

    There are two options applying DEF. The first op-tion, referred hereafter as the Two-Step DEF, in-volves only the initialization steps in Sec. 5.1 andSec. 5.2. Another option involves the full three stepsincluding partial fuzzification described in the abovesection. This option will be referred hereafter as theThree-Step DEF.

    As mentioned in Sec. 4.2, the prediction time ofthe Two-Step DEF is proportional to ntdt, where ntis the number of trees, and dt is the average depthof the trees in the forest. For the Three-Step DEF,the partial fuzzification relies on information from allnodes. As a result, the prediction time T2 is a func-tion of ntlt, where lt is the average number of nodesof the trees. The ratio of time complexity betweenThree-Step DEF and Two-Step DEF is ltdt , which cangrow rapidly as dt increases since lt can be roughlyexponential in terms of dt.

    6 Experiment Results

    This section reports results using click predictiondata. The problem is to predict the probability ofclick on an ad (see context in Sec. 2), given inputstrings from a query, a keyword, an ad title, and avector of dense features with several hundred dimen-sions.

    As explained in Sec. 3, query strings are convertedinto tri-letter grams, and so are the keyword and ti-tle strings. The dense features are mostly countingfeatures similar to the CampaignIDCount feature inTable 1. The raw input feature has around 150K di-mensions in total, and is a mix of both sparse anddense features.

    All models including Deep Crossing model, XG-Boost, and LightGBM share the same log loss func-tion as the objective. The embedding vector is 128dimensional for each tri-letter gram and the densefeatures. This leads to a stacking vector of 512 di-mensions. Deep Crossing uses two residual units9,where the first is connected with the input stackingvector, and the second takes its output as the in-put. The first residual unit has three layers with thesizes of 512, 128, and 512, respectively. The secondresidual unit has similar dimensions, except that themiddle layer has only 64 neurons.

    We used a simplified version of the Deep Cross-ing model that outperformed the production modelin offline experiments, as reported in [21]. Its log lossperformance is slightly worse than the original modelbut the model size is more suitable to demonstratethe efficacy of the DEF model. As can be seen inthe following experiments, DEF can reduce predic-tion time significantly even the Deep Crossing modelis a simplified one.

    Both XGBoost and LightGBM have parallel imple-mentations. We will use both of them to demonstratethat DEF works with different kinds of forest/tree-based models. However, since parallel LightGBM isalready set up in our environment, we will use it formost of the experiments.

    In order to achieve a clean apple-to-apple compari-son, the majority of the experiments are based on theprediction time using a single CPU processor, with-out relying on any of the SIMD instructions (referredhereafter as plain implementation).

    6.1 Experiment with 3 Million Sam-ples: Log Loss Comparison

    For the first set of experiments, the training data has3.59M (million) samples, and the test data has 3.64Msamples. The Deep Crossing model was trained on 1GPU and converged after 22 epoches. A stacking vec-tor was generated for each sample based on the for-ward computation of the Deep Crossing model, andwas used as the input to train the forest models in-

    9See Sec.5.2 in the Deep Crossing paper [21] for more in-formation about residual unit

    7

  • Deep Crossing 2-Step (XGBoost) 2-Step (LightGBM) 3-Step (LightGBM)

    Relative log loss 100 99.96 99.94 99.81Time (ms) 2.272 0.168 0.204 -

    Table 2: Comparison of performance and prediction time per sample between Deep Crossing and DEF. Themodels are trained with 3.59M samples. Performance is measured by a relative log loss using Deep Crossingas the baseline. The prediction time per sample was measured using 1 CPU processor

    cluding both XGBoost and LightGBM. The result-ing models are the Two-Step DEFs. We also experi-mented with the partial fuzzification model initializedby LightGBM. In all the experiments hereafter, theperformance will be reported in terms of relative logloss, defined as the following:

    γDEFr =γDEF

    γDC× 100, (7)

    where γ and γr are the actual and relative log losses,respectively, and DC represents the Deep CrossingModel for simplicity. As with the actual log loss, asmaller relative log loss indicates better performance.

    6.1.1 Two-Step DEF with XGBoost as theForest Layer

    XGBoost converged after 1108 iterations with a max-imum depth of 7. The Two-Step DEF slightly out-performed Deep Crossing, as shown in Table 2.

    6.1.2 Two-Step DEF with LightGBM as theForest Layer

    LightGBM model converged after 678 iterations witha maximum number of leaves of 128. As shown inTable 2, the Two-Step DEF using LightGBM per-formed better than Deep Crossing in log loss. Notethat the apple-to-apple performance comparison be-tween XGBoost and LightGBM requires a differentsetting, which is less of interest in this experiment.

    6.1.3 Three-Step DEF with Partial Fuzzifi-cation

    Taking the LightGBM model as the initial forestlayer, the joint optimization using partial fuzzifica-

    Figure 3: The Comparison of scalability of the predic-tion runtime per sample on CPU processors amongDeep Crossing, Two-Step DEF with XGBoost andTwo-Step DEF with LightGBM. Note that the axisof prediction time for DEF is on the left side, whilethe one for Deep Crossing is on the right side

    tion achieved slightly better accuracy with a relativelog loss of 99.81, as shown in Table 2.

    6.2 Experiment with 3 Million Sam-ples: Prediction Time Compari-son

    As shown in Table 2, the prediction time for DeepCrossing, Two-Step DEF with XGBoost, and Two-Step DEF with LightGBM are 2.272 ms, 0.168 ms,and 0.204 ms, respectively. The prediction time ismeasured at the per sample level using one CPU pro-cessor, with all the I/O processes excluded.

    We also experimented the prediction time with dif-ferent number of processors. As shown in Fig. 3,the prediction time for DEF with both XGBoost and

    8

  • LightGBM decreases as the number of processors in-creases. The prediction time for Deep Crossing, onthe other hand, started to increase after reaching aminimum of 0.368 ms with five processors. This isexpected because of its inherently sequential compu-tation between the consecutive layers. With morethan five processors, the parallelism is outweighed bythe communication overhead.

    A caveat to call out is that both XGBoost andLightGBM are using sample partition for paralleliza-tion in the current implementation of the predictor.This is in theory different from the per sample predic-tion time that we want to measure. However, sincethe forest/tree-based models are perfectly parallel, weexpect that the effect of sample partition will be agood approximation of partitioning with individualtrees.

    The Deep Crossing predictor in this experimentwas implemented with the Intel MKL to utilize mul-tiple processors. This implementation is faster thanthe plain implementation even when there is only oneprocessor involved. As a result, the prediction timein this section can not be compared with those inother experiments.

    Note that the speedup was achieved with DEF per-forming better in terms of log loss. If the goal is toachieve on-par performance in practice, the numberof iterations (trees) can be significantly reduced tofurther reduce the run-time latency.

    Also note that we didn’t compare the predictiontime of the Three-Step DEF in Table 2. This is be-cause the Three-Step DEF runs on GPUs, as detailedin the Appendix. From the discussion in Sec. 5.4, it isclear that the Three-Step DEF is always slower thanthe Two-Step DEF, if both measured against a CPUprocessor.

    6.3 Experiment with 60 Million Sam-ples

    The same experiment was applied to a dataset whichcontains 60M samples. The test data has 3.59M sam-ples. The comparison of the performance and speedis shown in Table 3. The Deep Crossing model con-verged after 14 epochs, of which the log loss is nor-malized to 100. The stacking vector generated by the

    Deep Crossing 2-Step DEF

    Relative log loss 100 99.77Time (ms) 2.272 0.162

    Table 3: Comparison of performance and predictiontime per sample between the Deep Crossing and Two-Step DEF using LightGBM that are trained with60M samples. The predictions both were conductedon 1 CPU processor. Performance is measured byrelative log loss using Deep Crossing as the baseline

    Deep Crossing 2-Step DEF

    Relative log loss 100 100.55Time (ms) 2.272 0.204

    Table 4: The same as Table 3 but for the modelstrained by 1B samples

    Deep Crossing model was then used to train Light-GBM, which converged with a normalized log loss of99.77 after 170 iterations. The LightGBM in this ex-periment used a maximum number of leaves of 2048.The prediction time per sample is 2.272 ms for theDeep Crossing Model, and is 0.162 ms for DEF. Thejoint optimization was not conducted because thetraining process took too long.

    6.4 Experiment with 1 Billion Sam-ples

    This experiment applied DEF to a large sample setwith 1B (billion) samples. A distributed LightGBMwas used to train the forest layer on a 8 server clus-ter. We kept the test data the same as before. TheDeep Crossing model was trained on a GPU clus-ter with 16 GPUs, and converged after 30 epochs.At the time this paper is submitted, the best Light-GBM model has 70 trees with the maximum num-ber of leaves equal to 16384. As shown in Table 4,the LightGBM model performed slightly worse thanthe Deep Crossing model. Regarding the predictionspeed, LightGBM outperformed Deep Crossing ap-proximately by more than 10 times.

    9

  • 3M 60M 1B

    Deep Crossing 100 99.78 98.712-Step DEF (LightGBM) 99.94 99.56 99.24

    Table 5: Comparison of performance of Deep Cross-ing and Two-Step DEF with LightGBM in the 3M,60M and 1B experiments. Performance is measuredby a relative log loss using Deep Crossing in the 3Mexperiment as the baseline

    6.5 The Effect of Scale

    Table 5 compares log loss performance of the DeepCrossing models and the DEF models as the numberof samples increases. The baseline log loss is basedon the Deep Crossing model trained with 3.59M sam-ples. As can be seen from the table, the log loss de-creases as the number of samples increases to 60Mand 1B. DEF models exhibit the same trend. It isimportant to point out that even though the DEFmodel trained at 1B samples performed worse thanthe Deep Crossing counterpart, it is still better thanthe Deep Crossing model trained with 60M samples.

    7 Related Work

    Combining a decision forest/tree with DNN has beenstudied before. Deep Neural Decision Forest [13] rep-resents the latest result along this direction, wherea decision forest is generalized and trained togetherwith DNN layers with Back Propagation. It avoidsthe hard problem of explicitly finding the forest struc-ture Ψ but has to rely on decision nodes that arethe functions of all input dimensions. As a re-sult, the run-time complexity is O(ntltnf ) insteadof O(ntdt)

    10. The work in [19] learns a forest firstand maps it to a DNN to provide better initializa-tion when training data is limited. The DNN isthen mapped back to a forest to speed up run-timeperformance. In order to achieve O(ntdt) complex-ity, it has to use an approximation to constrain thenumber of input nodes (the features) to the decisionnodes, which significantly deteriorates the accuracy.

    10See Sec. 4.2 for definitions

    In contrast, DEF without joint optimization achievesO(ntdt) run-time complexity with on-par or slightlybetter accuracy than the DNN counterpart.

    The idea of pre-training components of the networkseparately as initialization (similar to Sec. 5.1) inorder to construct high level features has also beeninvestigated in autoencoders and Deep Belief Net-works (DBNs) [2, 23]. Nevertheless, this paper isfundamentally different from those previous worksin two main aspects. First, the pre-training in au-toencoders and DBNs are performed in unsupervisedfashion as opposed to Sec. 5.1 in which super-vised Deep Crossing technique is used. Secondly, ourwork is mainly motivated by obtaining efficient pre-diction while maintaining high performance, whereasthe principal motivation for pre-training in those pre-vious papers is mostly to achieve high-level abstrac-tion.

    As discussed earlier, the inefficient serving time ofDNNs is mainly because of expensive run-time com-putation of multiple layers of dense matrices. There-fore, a natural idea to obtain efficiency in onlineprediction might be to convert DNNs into an (al-most) equivalent shallow neural network once theyare trained. This idea even sounds theoretically feasi-ble due to the richness of neural networks with simplyone single hidden layer according to Universal Ap-proximation Theorem [10]. Nevertheless, the func-tion modeled by DNN is fairly complicated due todeep embedding features. Thus, despite universality,a single layer neural network will become impracticalas it may need astronomical number of neurons tomodel the desired function [5].

    It is also worthwhile to mention the work in [9],which combines a boosted decision forest with asparse linear classifier. It was observed that the com-bined model performs better than either of the mod-els on its own. If this is the case, DEF can be eas-ily extended to add a linear layer on top of the for-est layer for better performance. The computationaloverhead will be negligible.

    DEF is built on top of the Deep Crossingmodel [21], the XGBoost [3] model, and the Light-GBM [16] model. While the deep layers in the DeepCrossing model are replaced with the forest/tree layerin the DEF model, they are critical in discovering

    10

  • the discriminative embedding features that make theconstruction of the forest/tree models an easier task.

    8 Conclusion and Future Work

    DEF is an end-to-end machine learning solution fromtraining to serving for large scale applications usingconventional hardware. Built on top of the latestadvance in both DNN and forest/tree-based models,DEF handles low-level and hight dimensional hetero-geneous features like DNN, and serves the high pre-cision model like a decision tree.

    DEF demonstrates that a simple two-step ap-proach can produce near optimal results as more com-plex joint optimization through fuzzification. WhileDNNs with fuzzification have been reported to per-form better than the plain DNNs, DEF is the first weare aware of that outperforms the DNN counterpartwithout compromising the property of fast serving.

    Also because of the two-step approach, DEF servesas a flexible framework that can be adapted to differ-ent application scenarios. In the paper, we demon-strated that DEF with XGBoost works as well asDEF with LightGBM. As a result, applications canchoose the forest/tree-based models that work best interms of availability and performance. For instance,we decided to use LightGBM (instead of XGBoost)for large scale experiments because we had it set uprunning in a parallel environment.

    DEF can work with other types of DNN-based em-bedding. However, embedding approaches such asLSTM and CNN cost significant amount of run-timecalculation that cannot be saved by the forest/treelayer. This is a direction we are looking into as afuture work.

    9 Acknowledgments

    The authors would like to thank Jian Jiao (BingAds), Qiang Lou (Bing Ads), T. Ryan Hoens, NingyiXu (MSRA), Taifeng Wang (MSRA), and Guolin Ke(MSRA) for their support and discussions that ben-efited the development of DEF.

    Appendix A Implementationof partial fuzzifi-cation in DEF

    The joint optimization using partial fuzzification inDEF was developed as a forest node in CNTK so thatit is compatible with the embedding and scoring lay-ers of Deep Crossing. Both the forward and backwardoperations are implemented on GPU for acceleration.

    A.1 Initialization

    The structure of the trees and the forest param-eters {cr}, {ar} and {πl} are initialized with thetrained forest/tree model. The embedding param-eters {Wj}, {bj} are initialized by the trained DeepCrossing model.

    A.2 Forward Propagation

    The prediction of a sample on leaf l is defined as

    tl = πl∏r∈Ωl

    µL,Rr (ỹ), (8)

    where r and l denote the index of internal and leafnodes respectively, Ωl represents the set of internalnodes along the route from root to leaf l. The stack-ing vector ỹ is computed by Equ. 1. The termsµL,Rl (ỹ) are defined in Equ. 6. The selection of µ

    Ll (ỹ)

    or µRl (ỹ) depends on which child node r directs toalong the path. The raw score t̄ is defined as:

    t̄ =∑l∈L

    tl, (9)

    where L represents the set of all the leaf nodes inthe forest. The raw score is then transformed to theprediction score via a sigmoidal function. The pre-diction score is a function of the split parameters {cr}and {ar}, the leaf parameter {πl}, and the stackingvector ỹ.

    A.3 Backward Propagation

    The backward propagation refines the fuzzificationparameters {cr}, {ar} and {πl}, and the input stack-ing vector ỹ. Define the gradient of the objective

    11

  • function (denoted as O) with respect to the raw scoret̄ as:

    δt =∂O∂t̄. (10)

    The updating equation for optimizing O w.r.t. theprediction parameter on leaf l (i.e., πl) is:

    ∂O∂πl

    =∂O∂t̄

    ∂t̄

    ∂tl

    ∂tl∂πl

    = δt∏r∈Ωl

    µL,Rr (ỹ). (11)

    The gradient w.r.t. cr on the rth internal node canbe expressed as:

    ∂O∂cr

    =∂O∂t̄

    ∑l∈Lr

    ∂t̄

    ∂tl

    ∂tl∂cr

    = σt∑l∈Lr

    πl·ηl∏

    p∈Ωl,p6=r

    µL,Rp (ỹ),

    (12)where p is the index of the internal nodes, and Lris the set of the leaf nodes from which the paths toroot contain the internal node r. Depending on whichchild the rth node heads to along the path from rootto leaf l, ηl is defined as:

    ηLl = µLr (ỹ)(1− µLr (ỹ)(ỹr − ar),

    ηRl = µRr (ỹ)(1− µRr (ỹ)(ar − ỹr). (13)

    The gradient of O w.r.t. ar has a similar expressionas cr in Equ. 12:

    ∂O∂ar

    =∂O∂t̄

    ∑l∈Lr

    ∂t̄

    ∂tl

    ∂tl∂ar

    = σt∑l∈Lr

    πl·ζl∏

    p∈Ωl,p6=r

    µL,Rp (ỹ),

    (14)where ζl is defined as:

    ζLl = −cr · µLr (ỹ)(1− µLr (ỹ),ζRl = cr · µRr (ỹ)(1− µRr (ỹ). (15)

    Here, once again, the selection of expression of ζl ispath-dependent. For the input vector ỹ on dimensionk (denoted as ỹk), the gradient is written as:

    ∂O∂ỹk

    =∂O∂t̄

    ∑l∈Lk

    ∂t̄

    ∂tl

    ∂tl∂ỹk

    = σt∑l∈Lk

    πl∑

    q∈Ωl,kξl,q

    ∏p∈Ωl,p6=q

    µL,Rp (ỹ),(16)

    where q is the index of the internal nodes, Lk is theset of leaf nodes to which the routes from root passone or more internal nodes with a split feature on ỹk.The set Ωl,k is the internal nodes that dedicate tothe kth dimension of ỹ along the route to leaf l, andthe set Ωl is the same as in Equ. 8. The term ξl,q isdefined as:

    ξLl,q = cq · µLq (ỹ)(1− µLq (ỹ),ζRl,q = −cq · µRq (ỹ)(1− µRq (ỹ). (17)

    As ỹ is a function of {Wj}, {bj}, the gradient of Owith respect to the two embedding parameters canbe computed through the chain rule using ∂O∂ỹk .

    References

    [1] Agarwal, A., Akchurin, E., Basoglu, C.,Chen, G., Cyphers, S., Droppo, J., Ev-ersole, A., Guenter, B., Hillebrand,M., Hoens, T. R., Huang, X., Huang,Z., Ivanov, V., Kamenev, A., Kranen,P., Kuchaiev, O., Manousek, W., May,A., Mitra, B., Nano, O., Navarro, G.,Orlov, A., Padmilac, M., Parthasarathi,H., Peng, B., Reznichenko, A., Seide, F.,Seltzer, M. L., Slaney, M., Stolcke, A.,Wang, H., Wang, Y., Yao, K., Yu, D.,Zhang, Y., and Zweig, G. An introduction tocomputational networks and the computationalnetwork toolkit. Tech. rep., Microsoft TechnicalReport MSR-TR-2014-112, 2014.

    [2] Bengio, Y., et al. Learning deep architec-tures for ai. Foundations and trends R© in Ma-chine Learning 2, 1 (2009), 1–127.

    [3] Chen, T., and Guestrin, C. Xgboost: Ascalable tree boosting system. In ACM SIGKDDConference on Knowledge Discovery and DataMining (2016), ACM.

    [4] Chetlur, S., Woolley, C., Vandermer-sch, P., Cohen, J., Tran, J., Catanzaro,B., and Shelhamer, E. cudnn: Efficientprimitives for deep learning. arXiv preprintarXiv:1410.0759 (2014).

    12

  • [5] Cybenko, G. Approximation by superpositionsof a sigmoidal function. Mathematics of control,signals and systems 2, 4 (1989), 303–314.

    [6] Edelman, B., Ostrovsky, M., andSchwarz, M. Internet advertising andthe generalized second price auction: Sellingbillions of dollars worth of keywords. Tech. rep.,National Bureau of Economic Research, 2005.

    [7] He, K., Zhang, X., Ren, S., and Sun, J.Deep residual learning for image recognition.arXiv preprint arXiv:1512.03385 (2015).

    [8] He, K., Zhang, X., Ren, S., and Sun, J.Deep residual learning for image recognition. InProceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition (2016),pp. 770–778.

    [9] He, X., Pan, J., Jin, O., Xu, T., Liu, B.,Xu, T., Shi, Y., Atallah, A., Herbrich,R., Bowers, S., et al. Practical lessons frompredicting clicks on ads at facebook. In Pro-ceedings of the Eighth International Workshopon Data Mining for Online Advertising (2014),ACM, pp. 1–9.

    [10] Hornik, K., Stinchcombe, M., and White,H. Multilayer feedforward networks are univer-sal approximators. Neural networks 2, 5 (1989),359–366.

    [11] Huang, P., He, X., Gao, J., Deng, L.,Acero, A., and Heck, L. Learning deep struc-tured semantic models for web search using click-through data. In Conference on Information andKnowledge Management (CIKM) (2013).

    [12] Jin, R., and Agrawal, G. Communicationand memory efficient parallel decision tree con-struction. In Proceedings of the 2003 SIAM In-ternational Conference on Data Mining (2003),SIAM, pp. 119–129.

    [13] Kontschieder, P., Fiterau, M., Criminisi,A., and Rota Bulo, S. Deep neural decision

    forests. In Proceedings of the IEEE Interna-tional Conference on Computer Vision (2015),pp. 1467–1475.

    [14] Lacey, G., Taylor, G. W., and Areibi, S.Deep learning on fpgas: Past, present, and fu-ture. arXiv preprint arXiv:1602.04283 (2016).

    [15] LeCun, Y., Bengio, Y., and Hinton, G.Deep learning. Nature 521, 7553 (2015), 436–444.

    [16] Meng, Q., Ke, G., Wang, T., Chen, W.,Ye, Q., Ma, Z.-M., and Liu, T. Acommunication-efficient parallel algorithm fordecision tree. In Advances in Neural Informa-tion Processing Systems (2016), pp. 1271–1279.

    [17] Mikolov, T., Sutskever, I., Chen, K.,Corrado, G., and Dean, J. Distributed rep-resentations of words and phrases and their com-positionality. In Conference on Neural Informa-tion Processing Systems (NIPS) (2013).

    [18] Ranka, S., and Singh, V. Clouds: A decisiontree classifier for large datasets. In Proceedings ofthe 4th Knowledge Discovery and Data MiningConference (1998), pp. 2–8.

    [19] Richmond, D. L., Kainmueller, D., Yang,M., Myers, E. W., and Rother, C. Mappingstacked decision forests to deep and sparse con-volutional neural networks for semantic segmen-tation. arXiv preprint arXiv:1507.07583 (2015).

    [20] Seide, F., Li, G., Chen, X., and Yu, D.Feature engineering in context-dependent deepneural networks for conversational speech tran-scription. In Automatic Speech Recognition andUnderstanding (ASRU), 2011 IEEE Workshopon (2011), IEEE, pp. 24–29.

    [21] Shan, Y., Hoens, T. R., Jiao, J., Wang, H.,Yu, D., and Mao, J. Deep crossing: Web-scalemodeling without manually crafted combinato-rial features. In ACM SIGKDD Conference onKnowledge Discovery and Data Mining (KDD)(2016), ACM.

    13

  • [22] Suárez, A., and Lutsko, J. F. Globally op-timal fuzzy decision trees for classification andregression. IEEE Transactions on Pattern Anal-ysis and Machine Intelligence 21, 12 (1999),1297–1311.

    [23] Vincent, P., Larochelle, H., Lajoie, I.,Bengio, Y., and Manzagol, P.-A. Stackeddenoising autoencoders: Learning useful repre-sentations in a deep network with a local de-noising criterion. Journal of Machine LearningResearch 11, Dec (2010), 3371–3408.

    14

    1 Introduction2 Sponsored Search3 Feature Representation4 Problem Statement4.1 Model Components4.2 DEF Properties

    5 Training Methodology5.1 Initialize Embedding Layers with Deep Crossing5.2 Initialize the Forest Layer with XGBoost and LightGBM5.3 Joint Optimization with Partial Fuzzification5.4 Two-Step DEF vs. Three-Step DEF

    6 Experiment Results6.1 Experiment with 3 Million Samples: Log Loss Comparison6.1.1 Two-Step DEF with XGBoost as the Forest Layer6.1.2 Two-Step DEF with LightGBM as the Forest Layer6.1.3 Three-Step DEF with Partial Fuzzification

    6.2 Experiment with 3 Million Samples: Prediction Time Comparison6.3 Experiment with 60 Million Samples6.4 Experiment with 1 Billion Samples6.5 The Effect of Scale

    7 Related Work8 Conclusion and Future Work9 AcknowledgmentsAppendicesAppendix A Implementation of partial fuzzification in DEFA.1 InitializationA.2 Forward PropagationA.3 Backward Propagation