large scale learning - hcbravo lab: biomedical data science1. vowpal wabbit implements general...
TRANSCRIPT
![Page 1: Large Scale Learning - HCBravo Lab: Biomedical Data Science1. Vowpal Wabbit Implements general framework of (sparse) stochastic gradient descent for many optimization problems 2. Spark](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff430033e18d020323a564e/html5/thumbnails/1.jpg)
Large Scale LearningHéctor Corrada Bravo
University of Maryland, College Park, USA CMSC 643: 20181127
![Page 2: Large Scale Learning - HCBravo Lab: Biomedical Data Science1. Vowpal Wabbit Implements general framework of (sparse) stochastic gradient descent for many optimization problems 2. Spark](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff430033e18d020323a564e/html5/thumbnails/2.jpg)
Large-scale LearningAnalyses we have done in class are for inmemory data:
Datasets can be loaded onto memory of a single computing node.
Database systems can execute SQL queries, which can be used forefficient learning of some models (e.g. decision trees) over data on disk
Operations are usually performed by a single computing node.
1 / 46
![Page 3: Large Scale Learning - HCBravo Lab: Biomedical Data Science1. Vowpal Wabbit Implements general framework of (sparse) stochastic gradient descent for many optimization problems 2. Spark](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff430033e18d020323a564e/html5/thumbnails/3.jpg)
Large-scale LearningIn the 90s database systems that operate over multiple computing nodesbecame available
Basis of the first generation of large data warehousing.
In the last decade, systems that manipulate data over multiple nodeshave become standard.
2 / 46
![Page 4: Large Scale Learning - HCBravo Lab: Biomedical Data Science1. Vowpal Wabbit Implements general framework of (sparse) stochastic gradient descent for many optimization problems 2. Spark](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff430033e18d020323a564e/html5/thumbnails/4.jpg)
Large-scale LearningBasic observation
for very large datasets, many of the operations for aggregation andsummarization, which also form the basis of many learning methods, canbe parallelized.
3 / 46
![Page 5: Large Scale Learning - HCBravo Lab: Biomedical Data Science1. Vowpal Wabbit Implements general framework of (sparse) stochastic gradient descent for many optimization problems 2. Spark](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff430033e18d020323a564e/html5/thumbnails/5.jpg)
Large-scale LearningFor example:
partition observations and perform transformation on each partition asa parallel processpartition variables and perform transformation on each variable as aparallel processfor summarization (group_by and summarize), partitionobservations based on group_by expression, perform summarizeon each partition.
4 / 46
![Page 6: Large Scale Learning - HCBravo Lab: Biomedical Data Science1. Vowpal Wabbit Implements general framework of (sparse) stochastic gradient descent for many optimization problems 2. Spark](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff430033e18d020323a564e/html5/thumbnails/6.jpg)
Large-scale LearningEfficiency of implementation of this type of parallelism depends onunderlying architecture:
Shared memory vs. Shared storage vs. Shared nothing
For massive datasets, shared nothing is usually preferred since faulttolerance is perhaps the most important consideration.
5 / 46
![Page 7: Large Scale Learning - HCBravo Lab: Biomedical Data Science1. Vowpal Wabbit Implements general framework of (sparse) stochastic gradient descent for many optimization problems 2. Spark](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff430033e18d020323a564e/html5/thumbnails/7.jpg)
Map-reduceMapReduce is an implementation idea for a shared nothing architecture.It is based on:
distributed storagedata proximity (perform operaations on data that is physically close)fault tolerance.
6 / 46
![Page 8: Large Scale Learning - HCBravo Lab: Biomedical Data Science1. Vowpal Wabbit Implements general framework of (sparse) stochastic gradient descent for many optimization problems 2. Spark](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff430033e18d020323a564e/html5/thumbnails/8.jpg)
Map-reduceBasic computation paradigm is based on two operations:
reduce: perform operation on subset of observations in parallelmap: decide which parallel process (node) should operate on eachobservation
7 / 46
![Page 9: Large Scale Learning - HCBravo Lab: Biomedical Data Science1. Vowpal Wabbit Implements general framework of (sparse) stochastic gradient descent for many optimization problems 2. Spark](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff430033e18d020323a564e/html5/thumbnails/9.jpg)
Map-reduceThe fundamental operations that we have learned very well in this classare nicely represented in this framework: group_by clause correspondsto map, and summarize function corresponds to reduce.
8 / 46
![Page 10: Large Scale Learning - HCBravo Lab: Biomedical Data Science1. Vowpal Wabbit Implements general framework of (sparse) stochastic gradient descent for many optimization problems 2. Spark](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff430033e18d020323a564e/html5/thumbnails/10.jpg)
Map-reduceMapreduce is most efficient when computations are organized in anacyclic graph.
Data is moved from stable storage to computing process and the resultmoved to stable storage without much concern for operation ordering.
This architecture provides runtime benefits due to flexible resourceallocation and strong failure recovery.
Existing implementations of Mapreduce systems do not supportinteractive use, or workflows that are hard to represent as acyclicgraphs.
9 / 46
![Page 11: Large Scale Learning - HCBravo Lab: Biomedical Data Science1. Vowpal Wabbit Implements general framework of (sparse) stochastic gradient descent for many optimization problems 2. Spark](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff430033e18d020323a564e/html5/thumbnails/11.jpg)
SparkRecent system based on the general mapreduce framework
Designed for ultrafast data analysis.
Provides efficient support for interactive analysis (the kind we do inJupyter)
Designed to support iterative workflows needed by many MachineLearning algorithms.
10 / 46
![Page 12: Large Scale Learning - HCBravo Lab: Biomedical Data Science1. Vowpal Wabbit Implements general framework of (sparse) stochastic gradient descent for many optimization problems 2. Spark](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff430033e18d020323a564e/html5/thumbnails/12.jpg)
SparkThe basic data abstraction in Spark is the resilient distributed dataset(RDD).
Applications keep working sets of data in memory and support iterativealgorithms and interactive workflows.
11 / 46
![Page 13: Large Scale Learning - HCBravo Lab: Biomedical Data Science1. Vowpal Wabbit Implements general framework of (sparse) stochastic gradient descent for many optimization problems 2. Spark](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff430033e18d020323a564e/html5/thumbnails/13.jpg)
SparkRDDs are
(1) inmutable and partitioned collections of objects, (2) created by parallel transformations on data in stable storage (e.g.,map, filter, group_by, join, ...) (3) cached for efficient reuse (4) operated upon by actions defeind on RDDs (count, reduce, collect,save, ...)
12 / 46
![Page 14: Large Scale Learning - HCBravo Lab: Biomedical Data Science1. Vowpal Wabbit Implements general framework of (sparse) stochastic gradient descent for many optimization problems 2. Spark](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff430033e18d020323a564e/html5/thumbnails/14.jpg)
Spark
Fault Tolerance
RDDs maintain lineage, so partitions can be reconstructed upon failure.
13 / 46
![Page 15: Large Scale Learning - HCBravo Lab: Biomedical Data Science1. Vowpal Wabbit Implements general framework of (sparse) stochastic gradient descent for many optimization problems 2. Spark](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff430033e18d020323a564e/html5/thumbnails/15.jpg)
Spark
The components of a SPARK work�ow
Transformations: Define new RDDs
https://spark.apache.org/docs/latest/programmingguide.html#transformations
Actions: Return results to driver program
https://spark.apache.org/docs/latest/programmingguide.html#actions
14 / 46
![Page 16: Large Scale Learning - HCBravo Lab: Biomedical Data Science1. Vowpal Wabbit Implements general framework of (sparse) stochastic gradient descent for many optimization problems 2. Spark](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff430033e18d020323a564e/html5/thumbnails/16.jpg)
SparkSpark was designed first for Java with an interactive shell based onScala. It has strong support in Python and increasing support in RSparkR.
Spark programming guide:https://spark.apache.org/docs/latest/programmingguide.htmlMore info on python API: https://spark.apache.org/docs/0.9.1/pythonprogrammingguide.html
15 / 46
![Page 17: Large Scale Learning - HCBravo Lab: Biomedical Data Science1. Vowpal Wabbit Implements general framework of (sparse) stochastic gradient descent for many optimization problems 2. Spark](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff430033e18d020323a564e/html5/thumbnails/17.jpg)
Stochastic Gradient DescentOther learning methods we have seen, like regression and SVMs (oreven PCA), are optimization problems
We can design gradientdescent based optimization algorithms thatprocess data efficiently.
We will use linear regression as a case study of how this insight wouldwork.
16 / 46
![Page 18: Large Scale Learning - HCBravo Lab: Biomedical Data Science1. Vowpal Wabbit Implements general framework of (sparse) stochastic gradient descent for many optimization problems 2. Spark](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff430033e18d020323a564e/html5/thumbnails/18.jpg)
Case Study
Let's use linear regression with one predictor, no intercept as a casestudy.
Given: Training set , with continuous response and singlepredictor for the th observation.
Do: Estimate parameter in model to solve
{(x1, y1), … , (xn, yn)} yi
xi i
w y = wx
minw
L(w) =n
∑i=1
(yi − wxi)21
2
17 / 46
![Page 19: Large Scale Learning - HCBravo Lab: Biomedical Data Science1. Vowpal Wabbit Implements general framework of (sparse) stochastic gradient descent for many optimization problems 2. Spark](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff430033e18d020323a564e/html5/thumbnails/19.jpg)
Suppose we want to fit this model tothe following (simulated) data:
Case Study
18 / 46
![Page 20: Large Scale Learning - HCBravo Lab: Biomedical Data Science1. Vowpal Wabbit Implements general framework of (sparse) stochastic gradient descent for many optimization problems 2. Spark](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff430033e18d020323a564e/html5/thumbnails/20.jpg)
Our goal is then to find the value of(w) that minimizes mean squarederror. This corresponds to findingone of these many possible lines.
Case Study
19 / 46
![Page 21: Large Scale Learning - HCBravo Lab: Biomedical Data Science1. Vowpal Wabbit Implements general framework of (sparse) stochastic gradient descent for many optimization problems 2. Spark](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff430033e18d020323a564e/html5/thumbnails/21.jpg)
Each of which has a specific errorfor this dataset:
Case Study
20 / 46
![Page 22: Large Scale Learning - HCBravo Lab: Biomedical Data Science1. Vowpal Wabbit Implements general framework of (sparse) stochastic gradient descent for many optimization problems 2. Spark](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff430033e18d020323a564e/html5/thumbnails/22.jpg)
Case Study
1) As we saw before in class, loss is minimized when the derivative ofthe loss function is 0
2) and, the derivative of the loss (with respect to ) at a given estimate suggests new values of with smaller loss!
w w
w
21 / 46
![Page 23: Large Scale Learning - HCBravo Lab: Biomedical Data Science1. Vowpal Wabbit Implements general framework of (sparse) stochastic gradient descent for many optimization problems 2. Spark](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff430033e18d020323a564e/html5/thumbnails/23.jpg)
Let's take a look at thederivative:
Case Study
L(w) =
n
∑i=1
(yi − wxi)2
=n
∑i=1
(yi − wxi)(−xi)
∂
∂w
∂
∂w
1
2
22 / 46
![Page 24: Large Scale Learning - HCBravo Lab: Biomedical Data Science1. Vowpal Wabbit Implements general framework of (sparse) stochastic gradient descent for many optimization problems 2. Spark](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff430033e18d020323a564e/html5/thumbnails/24.jpg)
Gradient DescentThis is what motivates the Gradient Descent algorithm
1. Initialize 2. Repeat until convergence
Set
w = normal(0, 1)
w = w + η∑n
i=1(yi − f(xi))xi
23 / 46
![Page 25: Large Scale Learning - HCBravo Lab: Biomedical Data Science1. Vowpal Wabbit Implements general framework of (sparse) stochastic gradient descent for many optimization problems 2. Spark](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff430033e18d020323a564e/html5/thumbnails/25.jpg)
Gradient DescentThe basic idea is to move the current estimate of in the direction thatminimizes loss the fastest.
w
24 / 46
![Page 26: Large Scale Learning - HCBravo Lab: Biomedical Data Science1. Vowpal Wabbit Implements general framework of (sparse) stochastic gradient descent for many optimization problems 2. Spark](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff430033e18d020323a564e/html5/thumbnails/26.jpg)
Let's run GD and track what it does:
Gradient Descent
25 / 46
![Page 27: Large Scale Learning - HCBravo Lab: Biomedical Data Science1. Vowpal Wabbit Implements general framework of (sparse) stochastic gradient descent for many optimization problems 2. Spark](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff430033e18d020323a564e/html5/thumbnails/27.jpg)
Gradient Descent"Batch" gradient descent: take a step (update ) by calculating derivativewith respect to all observations in our dataset.
where .
w
n
w = w + η
n
∑i=1
(yi − f(xi, w))xi
f(xi) = wxi
26 / 46
![Page 28: Large Scale Learning - HCBravo Lab: Biomedical Data Science1. Vowpal Wabbit Implements general framework of (sparse) stochastic gradient descent for many optimization problems 2. Spark](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff430033e18d020323a564e/html5/thumbnails/28.jpg)
Gradient DescentFor multiple predictors (e.g., adding an intercept), this generalizes to thegradient
where
w = w + η
n
∑i=1
(yi − f(xi, w))xi
f(xi, w) = w0 + w1xi1 + ⋯ + wpxip
27 / 46
![Page 29: Large Scale Learning - HCBravo Lab: Biomedical Data Science1. Vowpal Wabbit Implements general framework of (sparse) stochastic gradient descent for many optimization problems 2. Spark](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff430033e18d020323a564e/html5/thumbnails/29.jpg)
Gradient DescentGradiest descent falls within a family of optimization methods called firstorder methods (firstorder means they use derivatives only). Thesemethods have properties amenable to use with very large datasets:
1. Inexpensive updates2. "Stochastic" version can converge with few sweeps of the data3. "Stochastic" version easily extended to streams4. Easily parallelizable
Drawback: Can take many steps before converging
28 / 46
![Page 30: Large Scale Learning - HCBravo Lab: Biomedical Data Science1. Vowpal Wabbit Implements general framework of (sparse) stochastic gradient descent for many optimization problems 2. Spark](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff430033e18d020323a564e/html5/thumbnails/30.jpg)
Stochastic Gradient DescentKey Idea: Update parameters using update equation one observation ata time:
1. Initialize , 2. Repeat until convergence
For to Set
β = normal(0, √p) i = 1
i = 1 n
w = w + η(yi − f(xi, w))xi
29 / 46
![Page 31: Large Scale Learning - HCBravo Lab: Biomedical Data Science1. Vowpal Wabbit Implements general framework of (sparse) stochastic gradient descent for many optimization problems 2. Spark](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff430033e18d020323a564e/html5/thumbnails/31.jpg)
Let's run this and see what it does:
Stochastic Gradient Descent
30 / 46
![Page 32: Large Scale Learning - HCBravo Lab: Biomedical Data Science1. Vowpal Wabbit Implements general framework of (sparse) stochastic gradient descent for many optimization problems 2. Spark](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff430033e18d020323a564e/html5/thumbnails/32.jpg)
Stochastic Gradient DescentWhy does SGD make sense?
For many problems we are minimizing a cost function of the type
Which in general has gradient
arg minf
∑i
L(yi, fi) + λR(f)1
n
∑i
∇fL(yi, fi) + λ∇fR(f)1
n
31 / 46
![Page 33: Large Scale Learning - HCBravo Lab: Biomedical Data Science1. Vowpal Wabbit Implements general framework of (sparse) stochastic gradient descent for many optimization problems 2. Spark](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff430033e18d020323a564e/html5/thumbnails/33.jpg)
Stochastic Gradient Descent
The first term looks like an empirical estimate (average) of the gradientat
SGD then uses updates provided by a different estimate of the gradientbased on a single point.
CheaperPotentially unstable
∑i
∇fL(yi, fi) + λ∇fR(f)1
n
fi
32 / 46
![Page 34: Large Scale Learning - HCBravo Lab: Biomedical Data Science1. Vowpal Wabbit Implements general framework of (sparse) stochastic gradient descent for many optimization problems 2. Spark](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff430033e18d020323a564e/html5/thumbnails/34.jpg)
Stochastic Gradient DescentIn practice
Minibatches: use ~100 or so examples at a time to estimategradientsShuffle data order every pass (epoch)
33 / 46
![Page 35: Large Scale Learning - HCBravo Lab: Biomedical Data Science1. Vowpal Wabbit Implements general framework of (sparse) stochastic gradient descent for many optimization problems 2. Spark](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff430033e18d020323a564e/html5/thumbnails/35.jpg)
Stochastic Gradient DescentSGD easily adapts to data streams where we receive observations oneat a time and assume they are not stored.
This setting falls in the general category of online learning.
Online learning is extremely useful in settings with massive datasets
34 / 46
![Page 36: Large Scale Learning - HCBravo Lab: Biomedical Data Science1. Vowpal Wabbit Implements general framework of (sparse) stochastic gradient descent for many optimization problems 2. Spark](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff430033e18d020323a564e/html5/thumbnails/36.jpg)
Stochastic Gradient Descent
Parallelizing gradient descent
Gradient descent algorithms are easily parallelizable:
Split observations across computing unitsFor each step, compute partial sum for each partition (map), computefinal update (reduce)
w = w + η ∗ ∑partition q
∑i∈q
(yi − f(xi, w))xi
35 / 46
![Page 37: Large Scale Learning - HCBravo Lab: Biomedical Data Science1. Vowpal Wabbit Implements general framework of (sparse) stochastic gradient descent for many optimization problems 2. Spark](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff430033e18d020323a564e/html5/thumbnails/37.jpg)
Stochastic Gradient DescentThis observation has resulted in their implementation if systems forlargescale learning:
1. Vowpal WabbitImplements general framework of (sparse) stochastic gradientdescent for many optimization problems
36 / 46
![Page 38: Large Scale Learning - HCBravo Lab: Biomedical Data Science1. Vowpal Wabbit Implements general framework of (sparse) stochastic gradient descent for many optimization problems 2. Spark](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff430033e18d020323a564e/html5/thumbnails/38.jpg)
Stochastic Gradient DescentThis observation has resulted in their implementation if systems forlargescale learning:
1. Vowpal Wabbit
Implements general framework of (sparse) stochastic gradientdescent for many optimization problems
2. Spark MLlib
Implements many learning algorithms using Spark framework wesaw previously
37 / 46
![Page 39: Large Scale Learning - HCBravo Lab: Biomedical Data Science1. Vowpal Wabbit Implements general framework of (sparse) stochastic gradient descent for many optimization problems 2. Spark](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff430033e18d020323a564e/html5/thumbnails/39.jpg)
Sparse RegularizationFor many ML algorithms prediction timeefficiency is determined by thenumber of predictors used in the model.
Reducing the number of predictors can yield huge gains in efficiency indeployment.
The amount of memory used to make predictions is also typicallygoverned by the number of features. (Note: this is not true of kernelmethods like support vector machines, in which the dominant cost is thenumber of support vectors.)
38 / 46
![Page 40: Large Scale Learning - HCBravo Lab: Biomedical Data Science1. Vowpal Wabbit Implements general framework of (sparse) stochastic gradient descent for many optimization problems 2. Spark](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff430033e18d020323a564e/html5/thumbnails/40.jpg)
Sparse RegularizationThe idea behind sparse models, and in particular, sparse regularizers.
A disadvantage of optimizing problems of the form
That they tend to never produce weights that are exactly zero.
∑i
L(yi, w) + λ|w|2
39 / 46
![Page 41: Large Scale Learning - HCBravo Lab: Biomedical Data Science1. Vowpal Wabbit Implements general framework of (sparse) stochastic gradient descent for many optimization problems 2. Spark](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff430033e18d020323a564e/html5/thumbnails/41.jpg)
Sparse RegularizationInstead, minimize problems of the form
where
∑i
L(yi, w) + λ|w|1
∥w∥1 = ∑j |wj|
40 / 46
![Page 42: Large Scale Learning - HCBravo Lab: Biomedical Data Science1. Vowpal Wabbit Implements general framework of (sparse) stochastic gradient descent for many optimization problems 2. Spark](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff430033e18d020323a564e/html5/thumbnails/42.jpg)
Sparse RegularizationThis is a convex optimization problem.
Can use standard subgradient methods.
See CIML for further details.
41 / 46
![Page 43: Large Scale Learning - HCBravo Lab: Biomedical Data Science1. Vowpal Wabbit Implements general framework of (sparse) stochastic gradient descent for many optimization problems 2. Spark](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff430033e18d020323a564e/html5/thumbnails/43.jpg)
Feature HashingFor data sets with a large number of features/attributes an idea based onhashing can also help.
Suppose we are building a model over features ( very large). We usehashing to reduce the number of features to smaller number .
For each observation we transform it into observation .
We will use hash function
P P
p
x ∈ RP ~x ∈ R
p
h : P → p
42 / 46
![Page 44: Large Scale Learning - HCBravo Lab: Biomedical Data Science1. Vowpal Wabbit Implements general framework of (sparse) stochastic gradient descent for many optimization problems 2. Spark](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff430033e18d020323a564e/html5/thumbnails/44.jpg)
Feature HashingInitialize For each
Hash to position Update
Return
We can think of this as a feature mapping
~x = ⟨0, 0, . . . , 0⟩
j ∈ 1, … , P
j k = h(j)
~xk ← ~xk + xj
~x
ϕ(x)k = ∑j|j∈h−1(k)
xj
43 / 46
![Page 45: Large Scale Learning - HCBravo Lab: Biomedical Data Science1. Vowpal Wabbit Implements general framework of (sparse) stochastic gradient descent for many optimization problems 2. Spark](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff430033e18d020323a564e/html5/thumbnails/45.jpg)
Feature HashingTo see how what this does, we can see what the inner product betweenobservations in the smaller feature space is:
ϕ(x)′ϕ(z) = ∑k
⎡⎣ ∑
j|j∈h−1(k)
xj
⎤⎦⎡⎣ ∑
j′|j′∈h−1(k)
zj′
⎤⎦
= ∑k
∑j,j′|j,j′∈h−1(k)
xjzj′
= ∑j
∑j′|j′∈h−1(h(j))
xjzj′
= ∑j
xjzj + ∑j′≠j|j′∈h−1(h(j))
xjzj′ = x′z + ⋯
44 / 46
![Page 46: Large Scale Learning - HCBravo Lab: Biomedical Data Science1. Vowpal Wabbit Implements general framework of (sparse) stochastic gradient descent for many optimization problems 2. Spark](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff430033e18d020323a564e/html5/thumbnails/46.jpg)
Feature HashingSo, we get the inner product in the original large dimensions plus anextra quadratic term
We might get lucky and get a useful interaction between two featuresNonetheless, the size of this sum is very small due to property ofhash functions: expected value of product is
ϕ(x)′ϕ(z) = x′z + ∑j′≠j|j′∈h−1(h(j))
xjzj′
≈ 0
45 / 46
![Page 47: Large Scale Learning - HCBravo Lab: Biomedical Data Science1. Vowpal Wabbit Implements general framework of (sparse) stochastic gradient descent for many optimization problems 2. Spark](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff430033e18d020323a564e/html5/thumbnails/47.jpg)
Large-Scale LearningDatabase operations for outofmemory datasetsParallelization on sharednothing architectures (map reduce)Spark as MR framework also supporting iterative procedures(optimization)Stochastic gradient descent (many low cost steps, easy to parallelize)Sparse models (build models with few features)Feature Hashing (build models over smaller number of features,retain inner product)
46 / 46