distributed machine learning and big dataacmsc/wbda2015/slides/sb/...machine learning and big data...

Distributed Machine Learning and Big Data

Sourangshu Bhattacharya

Dept. of Computer Science and Engineering,IIT Kharagpur.

http://cse.iitkgp.ac.in/~sourangshu/

August 21, 2015

Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 1 / 76

Outline

1 Machine Learning and Big DataSupport Vector MachinesStochastic Sub-gradient descent

2 Distributed OptimizationADMMConvergenceDistributed Loss MinimizationResultsDevelopment of ADMM

3 Applications and extensionsWeighted Parameter AveragingFully-distributed SVM

Machine Learning and Big Data

What is Big Data ?

6 Billion web queries per day.10 Billion display advertisements per day.30 Billion text ads per day.150 Million credit card transactions per day.100 Billion emails per day.

Machine Learning on Big Data

Classification - Spam / No Spam - 100B emails.Multi-label classification - image tagging - 14M images 10K tags.Regression - CTR estimation - 10B ad views.Ranking - web search - 6B queries.Recommendation - online shopping - 1.7B views in the US.

Classification example

Email spam classification.Features (ui)): Vector of counts of allwords.No. of Features (d): Words in vocabulary(∼ 100,000).No. of non-zero features: 100.No. of emails per day: 100 M.Size of training set using 30 days data: 6TB (assuming 20 B per data)Time taken to read the data once: 41.67hrs (at 20 MB per second)Solution: use multiple computers.

Big Data Paradigm

3V’s - Volume, Variety, Velocity.Distributed system.Chance of failure:Computers 1 10 100Chance of a failure in an hour 0.01 0.09 0.63

Communication efficiency - Data locality.Many systems: Hadoop, Spark, Graphlab, etc.Goal: Implement Machine Learning algorithms on Big datasystems.

Binary Classification Problem

A set of labeled datapoints (S) = (ui , vi), i = 1, . . . ,n, ui ∈ Rd

and vi ∈ +1,−1Linear Predictor function: v = sign(xT u)

Error function: E =∑n

i=1 1(vixT ui ≤ 0)

Logistic Regression

Probability of v is given by:

P(v |u,x) = σ(vxT u) =1

1 + e−vxT u

Learning problem is:Given dataset S, estimate x.Maximizing the regularized log likelihood:

x∗ = argminx

n∑i=1

log(1 + e−vi xT ui ) +λ

Convex Function

f is a Convex function: f (tx1 + (1− t)x2) ≤ tf (x1) + (1− t)f (x2)

Objective function for logistic regression is convex.

Convex Optimization

Convex optimization problem

minimizex f (x)

subject to: gi(x) ≤ 0, ∀i = 1, . . . , k

where:f , gi are convex functions.For convex optimization problems, local optima are also globaloptima.

Optimization Algorithm: Gradient Descent

Machine Learning and Big Data Support Vector Machines

Classification Problem

Separating hyperplane: xT u = 0Parallel hyperplanes (developing margin): xT u = ±1Margin (perpendicular distance between parallel hyperplanes): 2

‖x‖

Correct classification of training datapoints: vixT ui ≥ 1, ∀iAllowing error (slack), ξi : vixT ui ≥ 1− ξi , ∀iMax-margin formulation:

minx,ξ

12‖x‖2 + C

n∑i=1

subject to: vixT ui ≥ 1− ξi , ξi ≥ 0 ∀i = 1, . . . ,n

SVM: dual

Lagrangian:

xT x + Cn∑

ξi +n∑

αi(1− ξi − vixT ui) +n∑

µiξi

Dual problem: (x∗, α∗, µ∗) = maxα,µ minx L(x, α, µ)

For strictly convex problem, primal and dual solutions are same(Strong duality).KKT conditions:

x =n∑

αiviui

C = αi + µi

SVM: dual

The dual problem:

n∑i=1

αi −12

n,n∑i=1,j=1

αiαjvivjuTi uj

subject to: 0 ≤ αi ≤ C, ∀i

The dual is a quadratic programming problem in n variables.Can be solved even if kernel function, k(ui ,uj) = uT

i uj are given.Dimension agnostic.Many efficient algorithms exist for solving it, e.g. SMO (Platt99).Worst case complexity is O(n3), usually O(n2).

A more compact form: minx∑n

i=1 max(0,1− vixT ui) + λ‖x‖22Or: minx

∑ni=1 l(x,ui , vi) + λΩ(x)

Multi-class classification

There are m classes. vi ∈ 1, . . . ,mMost popular scheme: vi = argmaxv∈1,...,mx

Given example (ui , vi), xTvi

ui ≥ xTj ui ∀j ∈ 1, . . . ,m

Using a margin of at least 1, lossl(ui , vi) = maxj∈1,...,vi−1,vi +1,...,m0,1− (xT

viui − xT

j ui)Given dataset D, solve the problem

minx1,...,xm

∑i∈D

l(ui , vi) + λ

m∑j=1

‖xj‖2

This can be extended to many settings e.g. sequence labeling,learning to rank, etc.

General Learning Problems

Support Vector Machines:

n∑i=1

max0,1− vixT ui+ λ‖x‖22

Logistic Regression:

n∑i=1

log(1 + exp(−vixT ui)) + λ‖x‖22

General form:

n∑i=1

l(x,ui , vi) + λΩ(x)

l : loss function, Ω: regularizer.

Machine Learning and Big Data Stochastic Sub-gradient descent

Sub-gradient Descent

Sub-gradient for a non-differentiable convex function f at a pointx0 is a vector v such that:

f (x)− f (x0) ≥ vT (x− x0)

Randomly initialize x0

Iterate xk = xk−1 − tkg(xk−1), k = 1,2,3, . . . . Where g is asub-gradient of f .tk = 1√

xbest (k) = mini=1,...,k f (xk )

Stochastic Sub-gradient Descent

Convergence rate is: O( 1√k

Each iteration takes O(n) time.Reduce time by calculating the gradient using a subset ofexamples - stochastic subgradient.Inherently serial.Typical O( 1

ε2) behaviour.

Stochastic Sub-gradient Descent

Distributed Optimization

Distributed gradient descent

Divide the dataset into m parts. Each part is processed on onecomputer. Total m.There is one central computer.All computers can communicate with the central computer vianetwork.Define loss(x) =

∑mj=1

∑i∈Cj

li(x) +λΩ(x), where li(x) = l(x,ui , vi)

The gradient (in case of differentiable loss):

∇loss(x) =m∑

∇(∑i∈Cj

li(x)) + λΩ(x)

Compute ∇lj(x) =∑

i∈Cj∇li(x) on the j th computer. Communicate

to central computer.

Distributed Optimization

Distributed gradient descent

Compute ∇loss(x) =∑m

j=1∇lj(x) + Ω(x) at the central computer.

The gradient descent update: xk+1 = xk − α∇loss(x).α chosen by a line search algorithm (distributed).For non-differentiable loss functions, we can use distributedsub-gradient descent algorithm.

Slow for most practical problems.For achieving ε tolerance,

Gradient descent (Logistic regression): O(1/ε) iterations.Sub-gradient descent (Stochastic Sub-gradient descent): O( 1

ε2 )iterations.

Distributed Optimization ADMM

Alternating Direction Method of Multipliers

Problem

minimizex ,z f (x) + g(z)

subject to: Ax + Bz = c

AlgorithmIterate till convergence:

xk+1 = argminx f (x) + ρ2‖Ax + Bzk − c + uk‖22

zk+1 = argminzg(z) + ρ2‖Axk+1 + Bz − c + uk‖22

uk+1 = uk + Axk+1 + Bzk+1 − c

Stopping criteria

Stop when primal and dual residuals small:

‖r k‖2 ≤ εpri and ‖sk‖2 ≤ εdual

Hence, ‖r k‖2 → 0 and ‖sk‖2 → 0 as k →∞

Observations

x- update requires solving an optimization problem

f (x) +ρ

2‖Ax − v‖22

with, v = Bzk − c + uk

Similarly for z-update.Sometimes has a closed form.ADMM is a meta optimization algorithm.

Distributed Optimization Convergence

Convergence of ADMM

Assumption 1: Functions f : Rn → R and g : Rm → R are closed,proper and convex.

Same as assuming epif = (x , t) ∈ Rn × R|f (x) ≤ t is closed andconvex.

Assumption 2: The unaugmented Lagrangian L0(x , y , z) has asaddle point (x∗, z∗, y∗):

L0(x∗, z∗, y) ≤ L0(x∗, z∗, y∗) ≤ L0(x , z, y∗)

Distributed Optimization Convergence

Convergence of ADMM

Primal residual: r = Ax + Bz − cOptimal objective: p∗ = infx ,zf (x) + g(z)|Ax + Bz = cConvergence results:

Primal residual convergence: r k → 0 as k →∞Dual residual convergence: sk → 0 as k →∞Objective convergence: f (x) + g(z)→ p∗ as k →∞Dual variable convergence: yk → y∗ as k →∞

Distributed Optimization Distributed Loss Minimization

Decomposition

If f is separable:

f (x) = f1(x1) + · · ·+ fN(xN), x = (x1, . . . , xN)

A is conformably block separable; i.e. AT A is block diagonal.Then, x-update splits into N parallel updates of xi

Consensus Optimization

Problem:

f (x) =N∑

ADMM form:

minxi ,z

N∑i=1

fi(xi)

s.t. xi − z = 0, i = 1, . . . ,N

Augmented lagrangian:

Lρ(x1, . . . , xN , z, y) =N∑

(fi(xi) + yTi (xi − z) +

2‖xi − z‖22)

ADMM algorithm:

xk+1i = argminxi

(fi(xi) + ykTi (xi − zk ) +

2‖xi − zk‖22)

zk+1 =1N

N∑i=1

(xk+1i +

yk+1i = yk

i + ρ(xk+1i − zk+1)

Final solution is zk .

z-update can be written as:

zk+1 = xk+1 +1ρ

Averaging the y -updates:

yk+1 = yk + ρ(xk+1 − zk+1)

Substituting first into second: yk+1 = 0. Hence zk = xk .Revised algorithm:

xk+1i = argminxi

(fi(xi) + ykTi (xi − xk ) +

2‖xi − xk‖22)

yk+1i = yk

i + ρ(xk+1i − xk+1)

Final solution is zk .Sourangshu Bhattacharya (IITKGP) Distributed ML August 21, 2015 34 / 76

Distributed Loss minimization

Problem:

l(Ax − b) + r(x)

Partition A and b by rows:

,where, Ai ∈ Rmi×m and bi ∈ Rmi

ADMM formulation:

minxi ,z

N∑i=1

li(Aixi − bi) + r(z)

s.t.: xi − z = 0, i = 1, . . . ,N

Distributed Loss minimization

ADMM solution:

xk+1i = argminxi

(li(Aixi − bi) +ρ

2‖xi − zk + uk

i ‖22)

zk+1 = argminz(r(z) +Nρ2‖z − xk+1 + uk‖22)

uk+1i = uk

i + xk+1i − zk+1

Distributed Optimization Results

ADMM Results

Logistic Regression using the loss minimization formulation (Boyd etal.):

n∑i=1

ADMM Results

Logistic Regression using the loss minimization formulation (Boyd etal.):

n∑i=1

distributed machine learning and big dataacmsc/wbda2015/slides/sb/...machine learning and big data...

Documents

big data, big challenges, big - oracle

big data, big commerce, big challenge

big data ในภาครัฐ -...

big data + big ideas = big impact

cloudconf 2013 outubro / 2013 big compute, big net & big...

distributed deep learning framework over...

big o, big theta, big omega

big containers, big orchestration, big data · big...

big bang - tossb.com bang.pdf · big bang 19 big bang 1...

gamechangers: big data. big ideas. big impact

from big text to big knowledge - indian statistical...

big data big reward

big world - big challenges: can a big network help?

big eight big six big four

“x” companies – as “big oil”, “big pharma”,...

big bucks, big surprises, big questions: northeast ohio's

· for executive: box big data ussuiu lla:ansnnns1ðxnu...

big big apple

big data learning in prac - ku leuven kulak · big data:...

met big bezig? data and analytics presentati… · 2...