scaling up practical learning algorithms (lecture by mikhail bilenko)

Post on 10-May-2015






Click to see full reader


Scaling Up Practical Learning Algorithms

Misha Bilenko

ALMADA Summer School, Moscow 2013

Preliminaries: ML-in-four-slides• ML: mapping observations to predictions that minimize error• Predictor: ,

– : observations, each consisting of features• Numbers, strings, attributes – typically mapped to vectors

– : predictions (labels), assume a true for a given • Binary, numeric, ordinal, structured…

– loss function that quantifies error. • 0-1 loss: L1: L2 :

– is the function class from which a predictor is learned• E.g., “linear model (feature weights)” or “1000 decision trees with 32 leaves


• Supervised learning: training set • Regularization: prevents overfitting to the training set:

ML Examples

• Email spam– : header/body words, client- and server-side statistics (sender, recipient, etc.)– : binary label– : Cost-sensitive: false positives (good mail in Junk) vs. false negatives (Inbox spam)

• Click prediction (ads, search results, recommendations, …)– : attributes of context (e.g., query), item (e.g., ad text), user (e.g., location), – : probability – :

Key Prediction Models• Important function classes

– Linear predictors (logistic regression, linear SVMs, …)• is a hyperplane (feature weights):

– Tree ensembles (boosting, random forests)• is a set of each tree’s splits and leaf outputs

– Non-linear parametric predictors (neural nets, Bayes nets)• is sets of parameters (weights for hidden units, distribution parameters)

– Non-parametric predictors (k-NN, kernel SVMs, Gaussian Processes)• is “remembered” subset of training examples with corresponding parameters

Learning: Training Predictors

• Two key algorithm patterns– Iteratively updating to reduce

• Gradient descent (stochastic, coordinate/sub-gradient, quasi-Newton, …)• Boosting: each subsequent ensemble member reduces error (functional GD)• Active-subset SVM training: iterative improvement over support vectors

– Averaging multiple models to reduce variance• Bagging/random forests: models learned on subsets of data/features• Parameter mixtures: explicitly averages weights from different data subsets

Big Learning: Large Datasets.. and Beyond

• Large training sets: many examples iff accuracy is improved• Large models: many features, ensembles, “deep” nets• Model selection: hyper-parameter tuning, statistical significance• Fast inference: structured prediction (e.g., speech)

• Fundamental differences across settings– Learning vs. inference, input complexity vs. model complexity– Dataflow/computation and bottlenecks are highly algorithm- and task-specific– Rest of this talk: practical algorithm nuggets for (1), (2)

Dealing with Large Training Sets (I): SGD

• Online learning: Stochastic Gradient Descent– “averaged perceptron”, “Pegasos”, etc.

• For • E.g., with hinge loss and

– if error, no update otherwise.

• Algorithm is fundamentally iterative: no “clean” parallelization– Literature: mini-batches, averaging, async updates, ….

• …but why? Algorithm runs at disk I/O speed!– Parallelize I/O. For truly enormous datasets, average parameters/models.

• RCV1: 1M documents (~1GB): <12s on this laptop!

Dealing with Large Training Sets (II): L-BFGS

• Regularized logistic regression

• L-BFGS: batch quasi-Newton method using quadratic approximation– Update is: where – Limited memory trick: keep a buffer of recent ( to approximate

• Parallelizes well on multi-core: – Each core takes a batch of examples and computes gradient

• Multi-node– Poor fit for MapReduce: global weight update, ( history, many iterations– Alternative: ADMM (first-order but better rate than SGD’s )

• Rule-based prediction is natural and powerful (non-linear)– Play outside: if no rain and not too hot, or if snowing but not windy.

• Trees hierarchically encode rule-based prediction– Nodes test features and split– Leaves produce predictions– Regression trees: numeric outputs

• Ensembles combine tree predictions

Dealing with Large Datasets (II): Trees






𝑓𝑡 𝑓𝑡

−+¿−𝑓 𝑡−







𝑓𝑡 𝑓𝑡

.1 0.6 0.2𝑓 𝑡





𝑊𝑖𝑛𝑑<25𝑓 𝑡


0.01 0.7


.1 0.6 0.2


+ + +…

Tree Ensemble Zoo

• Different models can define different types of:– Combiner function: voting vs. weighting – Leaf prediction models: constant vs. regression– Split conditions: single vs. multiple features

• Examples (small biased sample, some are not tree-specific)– Boosting: AdaBoost, LogitBoost, GBM/MART, BrownBoost, Transform

Regression– Random Forests: Random Subspaces, Bagging, Additive Groves, BagBoo – Beyond regression and binary classification: RankBoost, abc-mart, GBRank,

LambdaMART, MatrixNet

Tree Ensembles Are Rightfully Popular• State-of-the-art accuracy: web, vision, CRM, bio, …

• Efficient at prediction time – Multithread evaluation of individual trees; optimize/short-circuit

• Principled: extensively studied in statistics and learning theory

• Practical– Naturally handle mixed, missing, (un)transformed data– Feature selection embedded in algorithm– Well-understood parameter sweeps– Scalable to extremely large datasets: rest of this section

Naturally Parallel Tree Ensembles• No interaction when learning individual trees

– Bagging: each tree trained on a bootstrap sample of data – Random forests: bootstrap plus subsample features at each split– For large datasets, local data replaces bootstrap -> embarrassingly parallel

Bagging tree construction

¿𝑇𝑟𝑎𝑖𝑛 ()

Random forest tree construction

¿𝑆𝑝𝑙𝑖𝑡 ()

¿𝑆𝑝𝑙𝑖𝑡 ()

Boosting: Iterative Tree Construction“Best off-the-shelf classifier in the world” – Breiman

¿𝑇𝑟𝑎𝑖𝑛 ()

• Numerically: gradient descent in function space– Each subsequent tree approximates a step in direction– Recompute target labels – Logistic loss: – Squared loss:

• Reweight examples for each subsequent tree to focus on errors

¿𝑇𝑟𝑎𝑖𝑛 () ¿𝑇𝑟𝑎𝑖𝑛 (). . .

Efficient Tree Construction

• Boosting is iterative: scaling up = parallelizing tree construction

• For every node: pick best feature to split– For every feature: pick best split-point• For every potential split-point: compute gain– For every example in current node, add its gain contribution for given split

• Key efficiency: limiting+ordering the set of considered split points– Continuous features: discretize into bins, splits = bin boundaries– Allows computing split values in a single pass over data


Binned Split Evaluation• Each feature’s range is split into bins. Per-bin statistics are

aggregated in a single pass

• For each tree node, a two-stage procedure (1) Pass through dataset aggregating node-feature-bin statistics(2) Select split among all (feature,bin) options

… Bin …

Accumulate: Totals:Split gain:





Tree Construction Visualized

• Observation 1: a single pass is sufficient per tree level• Observation 2: data pass can iterate by-instance or by-feature

– Supports horizontally or vertically partitioned data

. . . .









. . .





Data-Distributed Tree Construction

• Master1. Send workers current model and set of nodes to expand2. Wait to receive local split histograms from workers3. Aggregate local split histograms, select best split for every node

• Worker2a. Pass through local data, aggregating split histograms2b. Send completed local histograms to master

Master Worker

Feature-Distributed Tree Construction

• Workers maintain per-instance index of current residuals and previous splits• Master

1. Request workers to expand a set of nodes2. Wait to receive best per-feature splits from workers3. Select best feature-split for every node4. Request best splits’ workers to broadcast per-instance assignments and residuals

• Worker2a. Pass through all instances for local features, aggregating split histograms for each node2b. Select local features’ best splits for each node, send to master

Master Worker

• How many is “many”? At least billions.

• Exhibit A: English n-gramsUnigrams: 13 millionBigrams: 315 millionTrigrams: 977 millionFourgrams: 1.3 billionFivegrams: 1.2 billion

• Can we scale up linear learners? Yes, but there are limits:– Retraining: ideally real-time, definitely not more than a couple hours– Modularity: ideally fit in memory, definitely decompose elastically

• Exhibit B: search ads, 3 monthsUser IDs: hundreds of millionsListing IDs: hundreds of millionsQueries: tens to hundreds of millionsUser x Listing x Query: billions

Learning with Many Features

Towards infinite features:Feature hashing

• Basic idea: elastic, medium-dimensional projection • Classic low-d projections: storage, cost, updates hard• Solution: mapping defined by a hashing function

+ Effortless elasticity, sparsity preserved- Compression is random (not driven by error reduction)

[Weinberger et al. 2009]; trick first described in [Moody 1989].

Scaling up ML: Concluding Thoughts

• Learner parallelization is highly algorithm-dependent

• High-level parallelization (MapReduce)– Less work but there is a convenience penalty– Limits on communication and control can be algorithm-killing

• Low-level parallelization (Multicore, GPUs, )– Harder to implement/debug– Successes architecture-vs-algorithm specific: i.e. GPUs are great if matrix

multiplication is the core operation (NNs)– Typical trade-off: memory/IO latency/contention vs. update complexity

top related