distributed machine learning: a brief overview · 2018-08-12 · distributed machine learning in 1...

42
Distributed Machine Learning: A Brief Overview Dan Alistarh IST Austria

Upload: others

Post on 24-May-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Distributed Machine Learning: A Brief Overview · 2018-08-12 · Distributed Machine Learning in 1 Slide Node1 Node2 Dataset Partition 1 Dataset Partition 2 Communication Complexity

DistributedMachineLearning:ABriefOverview

DanAlistarhISTAustria

Page 2: Distributed Machine Learning: A Brief Overview · 2018-08-12 · Distributed Machine Learning in 1 Slide Node1 Node2 Dataset Partition 1 Dataset Partition 2 Communication Complexity

BackgroundTheMachineLearning“CambrianExplosion”KeyFactors:

1. LargeDatasets:• Millionsoflabelledimages,thousandsofhoursofspeech

2. ImprovedModelsandAlgorithms:• DeepNeuralNetworks:hundreds oflayers,millions ofparameters

3. EfficientComputationforMachineLearning:• ComputationalpowerforMLincreasedby~100xsince2010(MaxwelllinetoVolta)• Gainsalmoststagnantinlatestgenerations(GPU:<1.8x,CPU:<1.3x)• Computationtimesareextremelylargeanyway(daystoweekstomonths)

Go-toSolution:DistributeMachineLearningApplicationstoMultipleProcessorsandNodes 2

Page 3: Distributed Machine Learning: A Brief Overview · 2018-08-12 · Distributed Machine Learning in 1 Slide Node1 Node2 Dataset Partition 1 Dataset Partition 2 Communication Complexity

TheProblemCSCS:Europe’sTopSupercomputer(World3rd)• 4500+GPUNodes,state-of-the-artinterconnectTask:• ImageClassification(ResNet-152onImageNet)• SingleNodetime(TensorFlow):19days• 1024Nodes:25minutes (intheory)

3

Page 4: Distributed Machine Learning: A Brief Overview · 2018-08-12 · Distributed Machine Learning in 1 Slide Node1 Node2 Dataset Partition 1 Dataset Partition 2 Communication Complexity

0

2

4

6

8

10

12

2 4 8 16 32 64

Days

NumberofGPUNodes

TimetoTrainModel9.6days

3.1days2.4days

5days

3.2days2.5days

Communication

Computation

4

TheProblemCSCS:Europe’sTopSupercomputer(World3rd)• 4500+GPUNodes,state-of-the-artinterconnectTask:• ImageClassification(ResNet-152onImageNet)• SingleNodetime(TensorFlow):19days• 1024Nodes:25minutes(intheory)

Page 5: Distributed Machine Learning: A Brief Overview · 2018-08-12 · Distributed Machine Learning in 1 Slide Node1 Node2 Dataset Partition 1 Dataset Partition 2 Communication Complexity

0

2

4

6

8

10

12

2 4 8 16 32 64

Days

NumberofGPUNodes

TimetoTrainModel9.6days

3.1days2.4days

5days

3.2days2.5days

10%Computation

90%Communication

andSynchronization

Communication

Computation

Efficientdistributionisstillanon-trivialchallengeformachinelearningapplications.

5

TheProblemCSCS:Europe’sTopSupercomputer(World3rd)• 4500+GPUNodes,state-of-the-artinterconnectTask:• ImageClassification(ResNet-152onImageNet)• SingleNodetime(TensorFlow):19days• 1024Nodes:25minutes(intheory)

Page 6: Distributed Machine Learning: A Brief Overview · 2018-08-12 · Distributed Machine Learning in 1 Slide Node1 Node2 Dataset Partition 1 Dataset Partition 2 Communication Complexity

Part1:Basics

Page 7: Distributed Machine Learning: A Brief Overview · 2018-08-12 · Distributed Machine Learning in 1 Slide Node1 Node2 Dataset Partition 1 Dataset Partition 2 Communication Complexity

MachineLearningin1Slide

Task

Data

argmin𝒙𝑓 𝒙

𝑓 𝑥 =+𝑙𝑜𝑠𝑠(𝑥, 𝑒𝑥𝑖)4

567

Notionof “quality,”e.g.squareddistance

Solvedviaoptimizationprocedure,e.g.stochasticgradientdescent(SGD).

E.g.,classification

model𝒙 E.g.,neuralnetworkorlinearmodel

Page 8: Distributed Machine Learning: A Brief Overview · 2018-08-12 · Distributed Machine Learning in 1 Slide Node1 Node2 Dataset Partition 1 Dataset Partition 2 Communication Complexity

DistributedMachineLearningin1Slide

Node1 Node2

DatasetPartition1

DatasetPartition2

CommunicationComplexityand

DegreeofSynchrony

argmin𝒙𝑓 𝒙 = 𝑓1(𝒙) + 𝑓2(𝒙)

𝑓1 𝑥 =+ 𝑙(𝑥,𝑒𝑖)4/A

567

𝑓2 𝑥 = + 𝑙(𝑥, 𝑒𝑖)4

564AB7

Thisisthe(somewhatstandard)dataparallelparadigm,buttherearealsomodelparallelorhybridapproaches.

model𝒙 model𝒙

Page 9: Distributed Machine Learning: A Brief Overview · 2018-08-12 · Distributed Machine Learning in 1 Slide Node1 Node2 Dataset Partition 1 Dataset Partition 2 Communication Complexity

TheOptimizationProcedure:StochasticGradientDescent

• Gradientdescent(GD):• Stochasticgradientdescent:Let𝒈D(𝒙𝒕) =gradientatrandomlychosenpoint.

• Let𝜠 𝒈D 𝒙 − 𝜵𝒇 𝒙 𝟐 ≤ 𝝈𝟐 (variancebound)𝒙𝒕B𝟏 = 𝒙𝒕 − 𝜼𝒕𝒈D(𝒙𝒕),where𝜠[𝑔Q(𝑥𝑡)]=𝛻𝑓 𝑥𝑡 .

Theorem [classic]:Given𝑓 convexandL-smooth,and𝑅2 = ||𝑥0− 𝑥∗||2.IfwerunSGDfor𝑻 = 𝓞(𝑹𝟐 𝟐𝝈

𝟐

𝜺𝟐 ) iterations,then

𝜠 𝑓(1𝑇+

𝑥^)_

^6`

− 𝑓 𝑥∗ ≤ 𝜀.

𝒙𝒕B𝟏 = 𝒙𝒕 − 𝜼𝒕𝛻𝑓 𝑥𝑡 .

Page 10: Distributed Machine Learning: A Brief Overview · 2018-08-12 · Distributed Machine Learning in 1 Slide Node1 Node2 Dataset Partition 1 Dataset Partition 2 Communication Complexity

ACompromise• Mini-batchSGD:Let𝒈D𝑩(𝒙𝒕) =stochasticgradientwithrespecttoasetofBrandomlychosenpoints.

• Whyisthisbetter?• Thevariance𝝈𝟐 of𝒈D𝑩(𝒙𝒕) isreducedlinearlybyBwithrespectto𝒈D(𝒙𝒕)• BythepreviousTheorem,thealgorithmwillconvergeinB timeslessiterations(intheconvex case)

𝒙𝒕B𝟏 = 𝒙𝒕 − 𝜼𝒕𝒈D𝑩(𝒙𝒕),where𝜠[𝑔Q𝐵(𝑥𝑡)]=𝛻𝑓 𝑥𝑡 .

Note:Convergenceislesswellunderstoodfornon-convexoptimizationobjectives(e.g.,neuralnets).Inthiscase,it’sknownthatSGDconvergestoalocaloptimum(pointwheregradient=0).

Page 11: Distributed Machine Learning: A Brief Overview · 2018-08-12 · Distributed Machine Learning in 1 Slide Node1 Node2 Dataset Partition 1 Dataset Partition 2 Communication Complexity

SGDParallelization

𝝨

𝑔Q1 𝑔Q2 𝑔Q3 𝑔Qn

DatasetPartition1

DatasetPartition2

DatasetPartition3

DatasetPartitionn

𝒈D𝒏=∑ 𝒊𝒈𝒊�

� /𝒏

Stochasticgradientwithn times lower variance

Theory:bydistributing,wecanperformP timesmoreworkper“clockstep.”Hence,weshouldconvergeP timesfasterintermsofwall-clocktime.

Embarrassinglyparallel?

Aggregationcanbeperformedvia:• Masternode(“parameterserver”)• MPIAll-Reduce(“decentralized”)

• Shared-Memory

Page 12: Distributed Machine Learning: A Brief Overview · 2018-08-12 · Distributed Machine Learning in 1 Slide Node1 Node2 Dataset Partition 1 Dataset Partition 2 Communication Complexity

ThePracticeTrainingverylargemodelsefficiently

• Vision• ImageNet:1.3millionimages• ResNet-152[He+15]:152layers,60millionparameters• Model/updatesize:approx.250MB

• Speech• NIST2000Switchboarddataset:2000hours• LACEA[Yu+16]:22LSTM(recurrent)layers,65millionparameters(w/olanguagemodel)

• Model/updatesize:approx.300MB

Heetal.(2015)“DeepResidualLearningforImageRecognition”Yuetal.(2016)“Deepconvolutionalneuralnetworkswithlayer-wisecontextexpansionandattention”

Page 13: Distributed Machine Learning: A Brief Overview · 2018-08-12 · Distributed Machine Learning in 1 Slide Node1 Node2 Dataset Partition 1 Dataset Partition 2 Communication Complexity

DataparallelSGDComputegradient

Exchangegradient

Updateparams

Minibatch 1 Minibatch 2 Minibatch 3

Page 14: Distributed Machine Learning: A Brief Overview · 2018-08-12 · Distributed Machine Learning in 1 Slide Node1 Node2 Dataset Partition 1 Dataset Partition 2 Communication Complexity

DataparallelSGD(biggermodels)Computegradient

Exchangegradient

Updateparams

Minibatch 1 Minibatch 2

Page 15: Distributed Machine Learning: A Brief Overview · 2018-08-12 · Distributed Machine Learning in 1 Slide Node1 Node2 Dataset Partition 1 Dataset Partition 2 Communication Complexity

DataparallelSGD(biggerermodel)Computegradient

Exchangegradient

Updateparams

Minibatch 1 Minibatch 2

Page 16: Distributed Machine Learning: A Brief Overview · 2018-08-12 · Distributed Machine Learning in 1 Slide Node1 Node2 Dataset Partition 1 Dataset Partition 2 Communication Complexity

MorePrecisely:TwomajorcostsComputegradient

Exchangegradient

Updateparams

Minibatch 1 Minibatch 2

Communication Synchronization

Page 17: Distributed Machine Learning: A Brief Overview · 2018-08-12 · Distributed Machine Learning in 1 Slide Node1 Node2 Dataset Partition 1 Dataset Partition 2 Communication Complexity

Part2:Communication-ReductionTechniques

Page 18: Distributed Machine Learning: A Brief Overview · 2018-08-12 · Distributed Machine Learning in 1 Slide Node1 Node2 Dataset Partition 1 Dataset Partition 2 Communication Complexity

DataparallelSGD(biggerermodel)Computegradient

Exchangegradient

Updateparams

Minibatch 1 Minibatch 2

Page 19: Distributed Machine Learning: A Brief Overview · 2018-08-12 · Distributed Machine Learning in 1 Slide Node1 Node2 Dataset Partition 1 Dataset Partition 2 Communication Complexity

Idea[Seide etal.,2014]:compress thegradients…

Minibatch 1 Minibatch 2

Page 20: Distributed Machine Learning: A Brief Overview · 2018-08-12 · Distributed Machine Learning in 1 Slide Node1 Node2 Dataset Partition 1 Dataset Partition 2 Communication Complexity

1BitSGDQuantization[MicrosoftResearch,Seide etal.2014]

Quantizationfunction

𝑄5(𝑣) = k𝑎𝑣𝑔B if𝑣5 ≥ 0,𝑎𝑣𝑔o otherwise

where𝑎𝑣𝑔B = mean( 𝑣5for𝑖: 𝑣5 ≥ 0 ),𝑎𝑣𝑔o = mean 𝑣5for𝑖: 𝑣5 < 0Accumulatetheerrorlocally,andapplytonextgradient!

v1float

v2float

v3float

v4float

vnfloat

vpfloat

vnfloat

sign Compressionrate≈ 32xnbits

Seideetal(2014)“1-BitStochasticGradientDescentanditsApplication toData-ParallelDistributedTrainingofSpeechDNNs”Doesnotalwaysconverge!

-0.30.1 0.3 -0.1 0

-+ + - +𝑎𝑣𝑔B = +0.2𝑎𝑣𝑔o = −0.2

Page 21: Distributed Machine Learning: A Brief Overview · 2018-08-12 · Distributed Machine Learning in 1 Slide Node1 Node2 Dataset Partition 1 Dataset Partition 2 Communication Complexity

Whythisshouldn’twork• Iteration:

• Let:• 𝜠 𝒈D 𝒙 − 𝜵𝒇 𝒙 𝟐 ≤ 𝝈𝟐 (variancebound)

𝒙𝒕B𝟏 = 𝒙𝒕 − 𝜼𝒕𝒈D(𝒙𝒕),where𝜠[𝑔Q(𝑥𝑡)]=𝛻𝑓 𝑥𝑡 .

Theorem [classic]:Given𝑓 convexandL-smooth,and𝑅2 = ||𝑥0− 𝑥∗||2.IfwerunSGDfor𝑻 = 𝓞(𝑹𝟐 𝟐𝝈

𝟐

𝜺𝟐 ) iterations,then

𝜠 𝑓(1𝑇+

𝑥^)_

^6`

− 𝑓 𝑥∗ ≤ 𝜀.

Let𝑸(𝒙) bethegradientquantizationfunction.

𝒙𝒕B𝟏 = 𝒙𝒕 − 𝜼𝒕𝑸(𝒈D 𝒙𝒕 )

Nolongerunbiased in1BitSGD!

𝜠[𝑔Q(𝑥𝑡)]≠ 𝛻𝑓 𝑥𝑡

Page 22: Distributed Machine Learning: A Brief Overview · 2018-08-12 · Distributed Machine Learning in 1 Slide Node1 Node2 Dataset Partition 1 Dataset Partition 2 Communication Complexity

TakeOne:StochasticQuantization

• Quantizationfunction

𝑄(𝑣𝑖) = 𝑣 A ⋅ sgn 𝑣5 ⋅ 𝜉5 𝑣𝑖

where𝜉5 𝑣𝑖 = 1 withprobability 𝑣5 / 𝑣 A and0otherwise.

Properties:1. Unbiasedness:

𝑬 𝑸 𝒗𝒊 = 𝑣 A ⋅ sgn 𝑣5 ⋅ 𝑣5 / 𝑣 A =sgn 𝑣5 ⋅ 𝑣52. Secondmoment(variance)bound:

𝑬 𝑸 𝒗 𝟐 ≤ 𝑣 A 𝑣 7 ≤ 𝒏� 𝒗 𝟐

3.Sparsity:Ifv hasdimensionn,thenE[non-zeroesinQ(v) ]=𝑬 ∑ 𝝃𝒊 𝒗�

𝒊 ≤ 𝒗 𝟏 𝒗 𝟐⁄ ≤ 𝒏�

-1

1

0

Convergence:𝜠[𝑄[𝑔Q 𝑥𝑡 ]]=𝜠[𝑔Q 𝑥𝑡 ]=𝛻𝑓 𝑥𝑡

Runtime≤ 𝑛� moreiterations

-0.30.1 0.3 -0.1 0

||𝑣||2=0.447

Page 23: Distributed Machine Learning: A Brief Overview · 2018-08-12 · Distributed Machine Learning in 1 Slide Node1 Node2 Dataset Partition 1 Dataset Partition 2 Communication Complexity

Compression

• Quantizationfunction

𝑄(𝑣𝑖) = 𝑣 A ⋅ sgn 𝑣5 ⋅ 𝜉5 𝑣𝑖

where𝜉5 𝑣𝑖 = 1 withprobability 𝑣5 / 𝑣 A and0otherwise.

v1float

v2float

v3float

v4float

vnfloat

𝑣 A

floatCompression≈ 𝑛� / log𝑛.

signandlocations≈ 𝑛� bitsandints

Original:32nbits

Compressed:32+ 𝑛� log𝑛 bitsMoral:We’renottoohappy:

the 𝑛� increaseinnumberofiterationsoffsetsthe ��

���� compression.

Page 24: Distributed Machine Learning: A Brief Overview · 2018-08-12 · Distributed Machine Learning in 1 Slide Node1 Node2 Dataset Partition 1 Dataset Partition 2 Communication Complexity

TakeTwo:QSGD[Alistarh,Grubic,Li,Tomioka,Vojnovic,NIPS17]

• Quantizationfunction𝑄 𝑣; 𝑠 = 𝑣 A ⋅ sgn 𝑣5 ⋅ 𝜉5 𝑣, 𝑠

where

• Note:s=1reducestothetwo-bitquantizationfunction.

(sisatuningparameter)

1/𝑠

ℓ/𝑠

(ℓ + 1)/𝑠

1

𝑣5 𝑣⁄ A

ℓ = floor 𝑠 ⋅ 𝑣5 𝑣 A⁄Withprobabilityℓ + 1 − 𝑠 ⋅ 𝑣5 𝑣 A,⁄

otherwise

0

-0.30.1 0.3 -0.1 0

20 2 1 0

-+ + - +

||𝑣||2=0.447

0

1

2

31.00

0.66

0.33

0.00

s =2bits

|𝑣i|/||𝑣||2=0.4

Page 25: Distributed Machine Learning: A Brief Overview · 2018-08-12 · Distributed Machine Learning in 1 Slide Node1 Node2 Dataset Partition 1 Dataset Partition 2 Communication Complexity

QSGDProperties

• Quantizationfunction𝑄 𝑣𝑖; 𝑠 = 𝑣 A ⋅ sgn 𝑣5 ⋅ 𝜉5 𝑣, 𝑠

• Properties1. Unbiasedness

𝐸 𝑄 𝑣𝑖; 𝑠 = 𝑣52. Sparsity

𝐸 𝑄 𝑣, 𝑠 ` ≤ 𝑠A + 𝑛�

3. Secondmomentbound

𝐸 𝑄 𝑣; 𝑠 AA ≤ 1+ min

𝑛𝑠A,𝑛�

𝑠⋅ 𝑣 A

A (Multiplieronly2 for𝑠 = 𝑛� )

0

1

2

31.00

0.66

0.33

0.00

|𝑣i|/||𝑣||2=0.4

Page 26: Distributed Machine Learning: A Brief Overview · 2018-08-12 · Distributed Machine Learning in 1 Slide Node1 Node2 Dataset Partition 1 Dataset Partition 2 Communication Complexity

• Idea1:therecanbefewlargeintegervaluesencoded• Idea2:UseEliasrecursivecodingtocodeintegersefficiently

TwoRegimes

Theorem1 (constants):Theexpectedbitlengthofthequantizedgradientis32 + 𝒔𝟐 + 𝒏� log𝒏 .

Theorem2 (larges):For𝑠 = 𝑛� , theexpectedbitlengthofthequantizedgradientis32 + 2.8 ⋅ 𝒏,andtheaddedvarianceisconstant.

Original:32nbits.

Theorem [Tsitsiklis&Luo,‘86]:Givendimensionn,thenecessarynumberofbitsforapproximatingtheminimumwithin𝜀 isΩ (n(logn+log(1/𝜀))).

MatchesThm 2.

Page 27: Distributed Machine Learning: A Brief Overview · 2018-08-12 · Distributed Machine Learning in 1 Slide Node1 Node2 Dataset Partition 1 Dataset Partition 2 Communication Complexity

• AmazonEC2p2.xlargemachine• AlexNet model(60Mparams)x ImageNetdatasetx2GPUs• QSGD4bitquantization(s=16)• Noadditionalhyperparameter tuning

Doesitactuallywork?

SGDvsQSGDonAlexNet.

Compute

Communicate

60%

40%

Compute 95%

5%Communicate

Page 28: Distributed Machine Learning: A Brief Overview · 2018-08-12 · Distributed Machine Learning in 1 Slide Node1 Node2 Dataset Partition 1 Dataset Partition 2 Communication Complexity

Experiments:“Strong”Scaling

2.5x

3.5x

1.8x 1.3x

Page 29: Distributed Machine Learning: A Brief Overview · 2018-08-12 · Distributed Machine Learning in 1 Slide Node1 Node2 Dataset Partition 1 Dataset Partition 2 Communication Complexity

Experiments:Accuracy

3-LayerLSTMonCMUAN4(Speech)2GPUNodes

2.5x

ResNet50onImageNet8GPUnodes

Acrossallnetworkswetried,4bitsaresufficient.(QSGDreportcontainsfullnumbersandcomparisons.)

4bit:- 0.2%8bit:+0.3%

Page 30: Distributed Machine Learning: A Brief Overview · 2018-08-12 · Distributed Machine Learning in 1 Slide Node1 Node2 Dataset Partition 1 Dataset Partition 2 Communication Complexity

OtherCommunication-EfficientApproachesQuantization-basedmethodsyieldstable,butlimitedgainsinpractice

• Usually<32xcompression,sinceit’sjustbitwidthreduction• Can’tdomuchbetterwithoutlargevariance[QSGD,NIPS17]

The“Engineering”approach[NVIDIANCCL]• Increasenetworkbandwidth,decreasenetworklatency• Newinterconnects(NVIDIA,CRAY),betterprotocols(NVIDIA)

The”Sparsification”approach[Drydenetal.,2016;Aji etal.,2018]• Sendthe“important”componentsofeachgradient,sortedbymagnitude• Empiricallygivesmuchhighercompression(upto800x [Hanetal.,ICLR2018])

“Large-Batch”approaches[Goyaletal.,2017;Youetal.,2018]• Runmorecomputationlocallybeforecommunicating(large“batches”)• Needextremelycarefulparametertuninginordertoworkwithoutaccuracyloss

Page 31: Distributed Machine Learning: A Brief Overview · 2018-08-12 · Distributed Machine Learning in 1 Slide Node1 Node2 Dataset Partition 1 Dataset Partition 2 Communication Complexity

Roadmap

• Introduction• Basics

• DistributedOptimizationandSGD

• Communication-Reduction• StochasticQuantizationandQSGD

• AsynchronousTraining• AsynchronousSGD

• RecentWork

Page 32: Distributed Machine Learning: A Brief Overview · 2018-08-12 · Distributed Machine Learning in 1 Slide Node1 Node2 Dataset Partition 1 Dataset Partition 2 Communication Complexity

TwomajorcostsComputegradient

Exchangegradient

Updateparams

Minibatch 1 Minibatch 2

Communication Synchronization

Page 33: Distributed Machine Learning: A Brief Overview · 2018-08-12 · Distributed Machine Learning in 1 Slide Node1 Node2 Dataset Partition 1 Dataset Partition 2 Communication Complexity

SGDParallelization

𝝨

𝑔Q 𝑔Q 𝑔Q 𝑔Q

DatasetPartition1

DatasetPartition2

DatasetPartition3

DatasetPartitionn

𝒈D𝒏

AggregationinSharedMemoryLock-based?Lock-free?

Page 34: Distributed Machine Learning: A Brief Overview · 2018-08-12 · Distributed Machine Learning in 1 Slide Node1 Node2 Dataset Partition 1 Dataset Partition 2 Communication Complexity

SGDinAsynchronousSharedMemory

DatasetPartition1

DatasetPartition2

DatasetPartition3

DatasetPartitionn

…𝑔Q 𝑔Q 𝑔Q 𝑔Q

Model𝒙 𝒙𝟏 𝒙𝟐 𝒙𝟑 𝒙𝒅

P threads,adversarialscheduler• Modelupdatedusingatomicoperations(read,CAS/Fetch-and-add)

DoesSGDstillconvergeunderasynchronous(inconsistent)iterations?

Page 35: Distributed Machine Learning: A Brief Overview · 2018-08-12 · Distributed Machine Learning in 1 Slide Node1 Node2 Dataset Partition 1 Dataset Partition 2 Communication Complexity

ModelingAsync SGD Readmodel𝒙

Define𝛕 =maximumnumberofpreviousupdatesascanmaymiss.Notethat𝛕 ≤maximumintervalcontentionforanoperation.

Modeldimension

Iterationnumber

Page 36: Distributed Machine Learning: A Brief Overview · 2018-08-12 · Distributed Machine Learning in 1 Slide Node1 Node2 Dataset Partition 1 Dataset Partition 2 Communication Complexity

ConvergenceIntuitionLegend:• Blue=originalminibatch SGD• Red dotted=delayedupdatesAdversary’spower:• Delayasubsetofgradientupdates• Movethedelayedupdatestodelay

convergencetotheoptimumAdversary’slimitation:• 𝛕 isthemaximumdelaybetween

whenthestepisgeneratedandwhenithastobeapplied

Theorem [Recht etal.,‘11]:Underanalyticassumptions,asynchronousSGDstillconverges,butataratethatisO(𝛕 ) timesslowerthanserialSGD.

Page 37: Distributed Machine Learning: A Brief Overview · 2018-08-12 · Distributed Machine Learning in 1 Slide Node1 Node2 Dataset Partition 1 Dataset Partition 2 Communication Complexity

0 2 4 6 8 100

1

2

3

4

5

Number of Splits

Spee

dup

(a)

HogwildAIGRR

0 2 4 6 8 100

1

2

3

4

5

Number of Splits

Spee

dup

(b)

HogwildAIGRR

0 2 4 6 8 100

2

4

6

8

10

Number of Splits

Spee

dup

(c)

HogwildAIGRR

Figure 2: Total CPU time versus number of threads for (a) RCV1, (b) Abdomen, and (c) DBLife.

⇢ = 0.44 and � = 1.0—large values that suggest a bad case for HOGWILD!. Nevertheless, inFigure 2(a), we see that HOGWILD! is able to achieve a factor of 3 speedup with while RR gets worseas more threads are added. Indeed, for fast gradients, RR is worse than a serial implementation.

For this data set, we also implemented the approach in [27] which runs multiple SGD runs in paralleland averages their output. In Figure 3(b), we display at the train error of the ensemble average acrossparallel threads at the end of each pass over the data. We note that the threads only communicateat the very end of the computation, but we want to demonstrate the effect of parallelization on trainerror. Each of the parallel threads touches every data example in each pass. Thus, the 10 thread rundoes 10x more gradient computations than the serial version. Here, the error is the same whetherwe run in serial or with ten instances. We conclude that on this problem, there is no advantage torunning in parallel with this averaging scheme.

Matrix Completion. We ran HOGWILD! on three very large matrix completion problems. TheNetflix Prize data set has 17,770 rows, 480,189 columns, and 100,198,805 revealed entries. TheKDD Cup 2011 (task 2) data set has 624,961 rows, 1,000,990, columns and 252,800,275 revealedentries. We also synthesized a low-rank matrix with rank 10, 1e7 rows and columns, and 2e9 re-vealed entries. We refer to this instance as “Jumbo.” In this synthetic example, ⇢ and � are botharound 1e-7. These values contrast sharply with the real data sets where ⇢ and � are both on theorder of 1e-3.

Figure 3(a) shows the speedups for these three data sets using HOGWILD!. Note that the Jumbo andKDD examples do not fit in our allotted memory, but even when reading data off disk, HOGWILD!attains a near linear speedup. The Jumbo problem takes just over two and a half hours to complete.Speedup graphs like in Figure 2 comparing HOGWILD! to AIG and RR on the three matrix comple-tion experiments are provided in the full version of this paper. Similar to the other experiments withquickly computable gradients, RR does not show any improvement over a serial approach. In fact,with 10 threads, RR is 12% slower than serial on KDD Cup and 62% slower on Netflix. We did notallow RR to run to completion on Jumbo because it several hours.

Graph Cuts. Our first cut problem was a standard image segmentation by graph cuts problempopular in computer vision. We computed a two-way cut of the abdomen data set [1]. This data setconsists of a volumetric scan of a human abdomen, and the goal is to segment the image into organs.The image has 512 ⇥ 512 ⇥ 551 voxels, and the associated graph is 6-connected with maximumcapacity 10. Both ⇢ and � are equal to 9.2e-4 We see that HOGWILD! speeds up the cut problemby more than a factor of 4 with 10 threads, while RR is twice as slow as the serial version.

Our second graph cut problem sought a mulit-way cut to determine entity recognition in a largedatabase of web data. We created a data set of clean entity lists from the DBLife website andof entity mentions from the DBLife Web Crawl [11]. The data set consists of 18,167 entities and180,110 mentions and similarities given by string similarity. In this problem each stochastic gradientstep must compute a Euclidean projection onto a simplex of dimension 18,167. As a result, theindividual stochastic gradient steps are quite slow. Nonetheless, the problem is still very sparse with⇢=8.6e-3 and �=4.2e-3. Consequently, in Figure 2, we see the that HOGWILD! achieves a ninefoldspeedup with 10 cores. Since the gradients are slow, RR is able to achieve a parallel speedup for thisproblem, however the speedup with ten processors is only by a factor of 5. That is, even in this casewhere the gradient computations are very slow, HOGWILD! outperforms a round-robin scheme.

7

ConvergenceofAsynchronousSGD(“Hogwild”)

Theorem [Recht etal.,‘11]:Underanalyticassumptions,asynchronousSGDstillconverges,butataratethatisO(𝛕 ) timesslowerthanserialSGD.

Lotsoffollow-upwork,tighteningassumptions.

Thelineardependencyon𝛕 istight ingeneral,butcanbereducedto P𝛕� bysimplemodifications[PODC18].

Thisisaworst-casebound: inpractice,asynchronousSGDsometimesconvergesatthesamerateastheserialversion.

0 2 4 6 8 100

2

4

6

8

Number of Splits

Spee

dup

(a)

JumboNetflixKDD

0 5 10 15 200.31

0.315

0.32

0.325

0.33

0.335

0.34

Epoch

Trai

n Er

ror

(b)

1 Thread3 Threads10 Threads

100 102 104 1060

2

4

6

8

10

Gradient Delay (ns)

Spee

dup

(c)

HogwildAIGRR

Figure 3: (a) Speedup for the three matrix completion problems with HOGWILD!. In all three cases,massive speedup is achieved via parallelism. (b) The training error at the end of each epoch of SVMtraining on RCV1 for the averaging algorithm [27]. (c) Speedup achieved over serial method forvarious levels of delays (measured in nanoseconds).

What if the gradients are slow? As we saw with the DBLIFE data set, the RR method does get anearly linear speedup when the gradient computation is slow. This raises the question whether RRever outperforms HOGWILD! for slow gradients. To answer this question, we ran the RCV1 exper-iment again and introduced an artificial delay at the end of each gradient computation to simulate aslow gradient. In Figure 3(c), we plot the wall clock time required to solve the SVM problem as wevary the delay for both the RR and HOGWILD! approaches.

Notice that HOGWILD! achieves a greater decrease in computation time across the board. Thespeedups for both methods are the same when the delay is few milliseconds. That is, if a gradienttakes longer than one millisecond to compute, RR is on par with HOGWILD! (but not better). Atthis rate, one is only able to compute about a million stochastic gradients per hour, so the gradientcomputations must be very labor intensive in order for the RR method to be competitive.

7 Conclusions

Our proposed HOGWILD! algorithm takes advantage of sparsity in machine learning problems toenable near linear speedups on a variety of applications. Empirically, our implementations outper-form our theoretical analysis. For instance, ⇢ is quite large in the RCV1 SVM problem, yet westill obtain significant speedups. Moreover, our algorithms allow parallel speedup even when thegradients are computationally intensive.

Our HOGWILD! schemes can be generalized to problems where some of the variables occur quitefrequently as well. We could choose to not update certain variables that would be in particularlyhigh contention. For instance, we might want to add a bias term to our Support Vector Machine, andwe could still run a HOGWILD! scheme, updating the bias only every thousand iterations or so.

For future work, it would be of interest to enumerate structures that allow for parallel gradient com-putations with no collisions at all. That is, It may be possible to bias the SGD iterations to completelyavoid memory contention between processors. An investigation into such biased orderings wouldenable even faster computation of machine learning problems.

Acknowledgements

BR is generously supported by ONR award N00014-11-1-0723 and NSF award CCF-1139953. CRis generously supported by the Air Force Research Laboratory (AFRL) under prime contract no.FA8750-09-C-0181, the NSF CAREER award under IIS-1054009, ONR award N000141210041,and gifts or research awards from Google, LogicBlox, and Johnson Controls, Inc. SJW is generouslysupported by NSF awards DMS-0914524 and DMS-0906818 and DOE award DE-SC0002283. Anyopinions, findings, and conclusion or recommendations expressed in this work are those of theauthors and do not necessarily reflect the views of any of the above sponsors including DARPA,AFRL, or the US government.

8

Theoreticalgainscomefromthefactthatthe𝛕 slowdownduetoasync iscompensatedbythespeedupofPduetoparallelism.

MoredetailsinNikola’stalkonWednesdaymorning!

Recht etal.“HOGWILD!:ALock-FreeApproachtoParallelizingStochasticGradientDescent”,NIPS2011

Page 38: Distributed Machine Learning: A Brief Overview · 2018-08-12 · Distributed Machine Learning in 1 Slide Node1 Node2 Dataset Partition 1 Dataset Partition 2 Communication Complexity

AsynchronousApproaches

TheConvexCase:• Bynow,lock-freeisthestandardimplementationofSGDinsharedmemory• Exploitthefactthatmanylargedatasetsaresparse,soconflictsarerare• NUMA caseismuchlesswellunderstood

TheNon-ConvexCase:• Requirescarefulhyperparameter tuningtowork,andislesspopular• ConvergenceofSGDinthenon-convexcaseislesswellunderstood,andverylittleisknownanalytically[Lian etal,NIPS2015]

Lian et al: Asynchronous Parallel Stochastic Gradient for Nonconvex Optimization, NIPS 2015

Page 39: Distributed Machine Learning: A Brief Overview · 2018-08-12 · Distributed Machine Learning in 1 Slide Node1 Node2 Dataset Partition 1 Dataset Partition 2 Communication Complexity

Summary

Mostmedium-to-large-scalemachinelearningisdistributed.

Communication-efficientandasynchronouslearningtechniquesarefairlycommon,andarestartingtohaveasoundtheoreticalbasis.

Lotsofexcitingnewquestions!

Page 40: Distributed Machine Learning: A Brief Overview · 2018-08-12 · Distributed Machine Learning in 1 Slide Node1 Node2 Dataset Partition 1 Dataset Partition 2 Communication Complexity

ASampleofOpenQuestions

Whatarethenotionsofconsistencyrequiredbydistributedmachinelearningalgorithmsinordertoconverge?

Atfirstsight,muchweakerthanstandardnotions.

Node1 Node2

DatasetPartition1

DatasetPartition2

argmin𝒙𝑓 𝒙 = 𝑓1(𝒙) + 𝑓2(𝒙)

model𝒙𝟏=𝒙 +noise model𝒙𝟏=𝒙 +noise

Page 41: Distributed Machine Learning: A Brief Overview · 2018-08-12 · Distributed Machine Learning in 1 Slide Node1 Node2 Dataset Partition 1 Dataset Partition 2 Communication Complexity

ASampleofOpenQuestions

CandistributedMachineLearningalgorithmsbeByzantine-resilient?Earlyworkby[Su,Vaidya],[Blanchard,ElMhamdi,Guerraoui,Steiner]

Non-trivialideasfrombothMLanddistributedcomputingsides.

Node1 Node2

DatasetPartition1

DatasetPartition2

argmin𝒙𝑓 𝒙 = 𝑓1(𝒙𝟏) + 𝑓2(𝒙𝟐)

model𝒙 model𝒙

Page 42: Distributed Machine Learning: A Brief Overview · 2018-08-12 · Distributed Machine Learning in 1 Slide Node1 Node2 Dataset Partition 1 Dataset Partition 2 Communication Complexity

ASampleofOpenQuestions

CandistributedMachineLearningalgorithmsbecompletelydecentralized?Earlyworkbye.g.[Lian etal.,NIPS2017],forSGD.

…𝑔Q 𝑔Q 𝑔Q 𝑔Q

DatasetPartition1

DatasetPartition2

DatasetPartition3

DatasetPartitionn