project fiddle fast & efficient infrastructure for...
Post on 12-Mar-2020
6 Views
Preview:
TRANSCRIPT
Project Fiddle
Fast & Efficient Infrastructure for
Distributed Deep LearningAmar Phanishayee
with
Nikhil Devanur, Aaron Harlap, Animesh Jain,
Liang Luo, Deepak Narayanan, Jacob Nelson,
Gennady Pekhimenko, Vivek Seshadri, Jorgen Thelin,
Shivaram Venkataraman, Guanhua Wang
Big Data Systems – Gen AI
• Motivating Scenarios:
• Solutions:
2
Image Recognition, Classification Translation &
Speech – Text - Speech
DistBelief
Deep Learning for X
3
• Image recognition, classification
• Segmentation
• Captioning
• Stereo
• Depth estimation
• Video action recognition
• Sentiment Analysis
• Speech-Text, Text-Speech, OCR
• Translation
• …
Training is time and resource intensive
• Long running training jobs: Days to multiple weeks
• Train, Validate, RepeatPerformance and accuracy specs validated only by running system
• Naïvely parallelizing work can be detrimental
• No luxury of limiting solution to one part of the stack
4
Training
Cluster
Serving
Cluster
Serving
Triggered
Data
Scientist
Triggered
Specialization
47%
0%
20%
40%
60%
80%
100%
Communication
Computation
<100% GPU utilization
Overcome memory
bottleneck
Restructure computation
Faster interconnects -
need new protocols,
Rethink what we push
to the network
6
Problem
Systematically speed-up distributed
learning tasks while eking out the most
from the resources usedGoal: 100x more efficient training
Day long training of model in 15-min coffee break
Approach
Cut through layers of the system stack
and optimize all relevant parts
7
Single Machine Training Multi-Machine TrainingM
em
ory
, Co
mp
uta
tio
nIn
terc
on
ne
cts
Train larger networks on
a single GPU by reducing
memory footprint.
Gist
Efficiently scale training.
New way to parallelize
DNN computation.
PipeDream
Blazingly fast transfers
over PCIe + NVLink + IB
Blink Hub
Fast
parameter
servers
8
Single Machine Training Multi-Machine TrainingM
em
ory
, Co
mp
uta
tio
nIn
terc
on
ne
cts
Train larger networks on
a single GPU by reducing
memory footprint.
Gist
Efficiently scale training.
New way to parallelize
DNN computation.
PipeDream
Blazingly fast transfers
over PCIe + NVLink + IB
Blink Hub
Fast
parameter
servers
Up to 2x compression ratio 3x faster training
Collective transfer primitives
8x faster than NCCLUp to 3x
speedup
9
Single Machine Training Multi-Machine TrainingM
em
ory
, Co
mp
uta
tio
nIn
terc
on
ne
cts
Train larger networks on
a single GPU by reducing
memory footprint.
Gist
Efficiently scale training.
New way to parallelize
DNN computation.
PipeDream
Blazingly fast transfers
over PCIe + NVLink + IB
Blink PSsst
Fast
parameter
servers
Up to 2x compression ratio 3x faster training
Collective transfer primitives
9x faster than NCCLUp to 3x
speedup
10
Training
Examples <x1, y1>
<x2, y2>
…
Hypothesis
Learning
Algorithm
yxh ),(
Learn
)( xgT
or )(tg
m
i
ii yhcmJ1
),)(()/1()(
)(J
i
))(
(i
ii
J
Structure of DNN computation
11
…
…
Layers
Activations (feature maps)
Cat?
Dog
Batch. Run for all input items.
Repeat over many “epochs”
Neuron
a
b
c
d
𝒈 𝒂 ∗∝ +𝒃 ∗ β + 𝒄 ∗ γ + 𝒅 ∗ δ 𝐨𝐫 )( xgT
Tasks, datasets, challenges
• 1989: MNIST
• 2001: Caltech 101
• 2007 – 2012: PASCAL Visual Object Classes
• 20 classes
• 2014 - … : Imagenet 1K
• Deng et al.
• Classification and Detection
13
Profile of memory consumption
14
DNN
Layer
Input Feature Map
(X)
Output Feature Map
(Y)
Output Gradient (dy)Input Gradient (dx)
dx = f(X, Y, dy)
AlexNet NiN Overfeat VGG-16 Inception
v3
Feature Maps are a major
consumer of GPU memory
CNTK Profiling
Larger minibatch size
potential crash
Profile of memory consumption
Feature Maps are a major
consumer of GPU memory
CNTK Profiling
Larger minibatch size
potential crash
Heavy hitters:
Relu->Pool,
Relu/Pool-> Conv
AlexNet NiN Overfeat VGG-16 Inception
v3
AlexNet NiN Overfeat VGG-16 Inception
v3
Idea: Encode data structures when unused,
Decode on use
16
Li
X…
timeline
Li+1
Y … … …
Forward Pass
ZLi
Backward Pass
dy dx…
Lifetime of X
Usage of X
Encode (X) Decode (X)
Full fidelity X
Succinct X
Long temporal gap between two uses
Statically construct usage schedule
for all data structures
Gist Architecture
17
Static
memory allocator
CNTK Runtime
(computation engine)
• Statically construct usage schedule
for all data structures
• Layer-specific encodings
• Lossless (Binarize, Sparse Storage/Dense Compute)
• Lossy (Delayed Precision Reduction)
Gist Lossless Encodings
18
Relu Backward Propagation
Binarize
1 bit representation
of 32bit values in Y
32x reduction
Naively applying Binarize
for Relu/Pool followed by Conv
can increase memory
consumption!
For Relu->Pool For Relu/Pool->Conv
Gist Lossless Encodings
19
Relu Backward Propagation
Binarize
1 bit representation
of 32bit values in YSparse Storage , Dense Compute
(sparse compute is
computationally expensive)
For Relu->Pool For Relu/Pool->Conv
Gist Lossy Encodings
• Used for all other feature maps
• Training with reduced precision (8/10/16 bits)
– done only when encoding/stashing values
• Forward pass uses full fidelity values
20
8 bits enough! 8/10 bits not enough.
16 bits works out.
10 bits enough!
Gist – Putting it all together
21
Mem
ory
Footp
rint
Reduction
Up to 2x
compression ratio
Minimal runtime overhead
(1 - 7%) for same batch size
With just lossless,
we go faster at times
(memory bandwidth bound)
22
Single Machine Training Multi-Machine TrainingM
em
ory
, Co
mp
uta
tio
nIn
terc
on
ne
cts
Train larger networks on
a single GPU by reducing
memory footprint.
Fiddle Gist
Efficiently scale training.
New way to parallelize
DNN computation.
PipeDream
Blazingly fast transfers
over PCIe + NVLink + IB
Blink PSsst
Fast
parameter
servers
Up to 2x compression ratio 3x faster training
Collective transfer primitives
9x faster than NCCLUp to 3x
speedup
Distributed Deep Learning
• Data Parallelism• Replicas run in parallel on different machines
• Occasionally exchange what they have learnt
• Naïve setup can hurt convergence, accuracy
• Total time = (Time per Epoch) * (#Epochs for given accuracy)
23
Manually tuned for performance of individual jobs
Statistical Efficiency
Compute
Commun
ication
Compute
Commun
ication
Compute
Commun
ication
Compute
Commun
ication
Hardware Efficiency
Pipeline Parallelism in Fiddle
• Idea: Pipeline computation across machines• For training tasks, optimize for throughput
• Each machine runs a subset of layers
– Compute one thing, data flows to compute task
– Better FLOPs due to cache locality
• Fits CPU, GPU, hybrid, heterogeneous clusters
24
Forward
Compute
Backward
Compute
Backward
Communication
Forward
Communication
x4
• No bubbles in pipeline• How should we divvy work
across machines?
Challenges
25
DNN Run
Fiddle Profiler
Fiddle Optimizer
𝒓𝒊 = 𝒓𝒖𝒏𝒕𝒊𝒎𝒆(𝒇𝒊 + 𝒃𝒊)𝒕𝒊 = transfer-time(i, i+1)
𝒍𝒊𝒋 = 𝒎𝒂𝒙 { , 𝒕𝒋}
j
ik
kr
],[],[
maxmin jiPjiPl
# machines
• No bubbles in pipeline• How should we divvy work
across machines?
Challenges
27
0.5x
1x
2x
Running at speed of the slowest layer
▪ Idle Cycles
0.5x
• No bubbles in pipeline• How should we divvy work
across machines?
• Replicate stages
(if needed)
– Static load balancing
mini-batch-id % stage-replica-id
– F/B paths consistent
Challenges
28
1X
1X
2X
Challenges
29
1X
1X
1X
• No bubbles in pipeline• How should we divvy work
across machines?
• Replicate stages
(if needed)
– Static load balancing
mini-batch-id % stage-replica-id
– F/B paths consistent
• How many items to admit?
– Wild, S, less?
S
Stages
Challenges
30
• No bubbles in pipeline• How should we divvy work
across machines?
• Replicate stages
(if needed)
– Static load balancing
mini-batch-id % stage-replica-id
– F/B paths consistent
• How many items to admit?
– Wild, S, less?
Challenges
31
1X
1X
1X
• No bubbles in pipeline
• How should we divvy work
across machines?
• Replicate stages
(if needed)
– Static load balancing
mini-batch-id % stage-replica-id
– F/B paths consistent
• How many items to admit?
– Wild, S, less?
• In steady state:
Each new item learns from a
previous (new) mini-batchS
Stages
Challenges
32
• No bubbles in pipeline
• How should we divvy work
across machines?
• Replicate stages
(if needed)
– Static load balancing
mini-batch-id % stage-replica-id
– F/B paths consistent
• How many items to admit?
– Wild, S, less?
• In steady state:
Each new item learns from a
previous (new) mini-batch
• Stage alternates between F/B
Stage
Forward
Queue
Backward
Queue
mbi
Challenges
33
• No bubbles in pipeline
• How should we divvy work
across machines?
• Replicate stages
(if needed)
– Static load balancing
mini-batch-id % stage-replica-id
– F/B paths consistent
• How many items to admit?
– Wild, S, less?
• In steady state:
Each new item learns from a
previous (new) mini-batch
• Stage alternates between F/B
Stage
Forward
Queue
Backward
QueueTo next
Level
Challenges
34
• No bubbles in pipeline
• How should we divvy work
across machines?
• Replicate stages
(if needed)
– Static load balancing
mini-batch-id % stage-replica-id
– F/B paths consistent
• How many items to admit?
– Wild, S, less?
• In steady state:
Each new item learns from a
previous (new) mini-batch
• Stage alternates between F/B
Stage
Forward
Queue
Backward
Queue
mbi-x
Challenges
35
• No bubbles in pipeline
• How should we divvy work
across machines?
• Replicate stages
(if needed)
– Static load balancing
mini-batch-id % stage-replica-id
– F/B paths consistent
• How many items to admit?
– Wild, S, less?
• In steady state:
Each new item learns from a
previous (new) mini-batch
• Stage alternates between F/B
Stage
Forward
Queue
Backward
Queue
To previous
Level
Challenges
36
• No bubbles in pipeline
• How should we divvy work
across machines?
• Replicate stages
(if needed)
– Static load balancing
mini-batch-id % stage-replica-id
– F/B paths consistent
• How many items to admit?
– Wild, S, less?
• In steady state:
Each new item learns from a
previous (new) mini-batch
• Stage alternates between F/B
• Statistical Efficiency
• Affected by feature map and
model parameter versions
• Each incomplete mini-batch
needs to stash feature maps
between F and B passes
• Should admitted items see
the updates of the same set
of minibatches across levels
(not newer ones)?
Valid Gradients Or Bust
PipeDream with wt stashing is critical for valid gradients
(use same wts across F and B within every stage)
37
𝒘(𝒕+𝟏) = 𝒘(𝒕) − 𝝁 ∗ 𝛁𝒇(𝒘𝟏𝒕, 𝒘𝟐
𝒕, … ,𝒘𝒏
𝒕)
𝒘(𝒕+𝟏) = 𝒘(𝒕) − 𝝁 ∗ 𝛁𝒇(𝒘𝟏𝒕−𝒏+𝟏
, 𝒘𝟐𝒕−𝒏+𝟐
, … ,𝒘𝒏𝒕)
Valid Gradients Or Bust
PipeDream with wt stashing is critical for valid gradients
(use same wts across F and B within every stage)
38
𝒘(𝒕+𝟏) = 𝒘(𝒕) − 𝝁 ∗ 𝛁𝒇(𝒘𝟏𝒕, 𝒘𝟐
𝒕, … ,𝒘𝒏
𝒕)
𝒘(𝒕+𝟏) = 𝒘(𝒕) − 𝝁 ∗ 𝛁𝒇(𝒘𝟏𝒕−𝒏+𝟏
, 𝒘𝟐𝒕−𝒏+𝟐
, … ,𝒘𝒏𝒕)
PipeDream with vertical sync maps to bounded staleness
(use same wts across all stages)
𝒘(𝒕+𝟏) = 𝒘(𝒕) − 𝝁 ∗ 𝛁𝒇(𝒘𝟏𝒕−𝒏+𝟏
, 𝒘𝟐𝒕−𝒏+𝟏
, … ,𝒘𝒏𝒕−𝒏+𝟏
)
Challenges
39
• No bubbles in pipeline
• How should we divvy work
across machines?
• Replicate stages
(if needed)
– Static load balancing
mini-batch-id % stage-replica-id
– F/B paths consistent
• How many items to admit?
– Wild, S, less?
• In steady state:
Each new item learns from a
previous (new) mini-batch
• Stage alternates between F/B
• Statistical Efficiency
• Affected by feature map (FM)
and model parameter versions
• Each incomplete mini-batch
needs to stash feature maps
between F and B passes
• Should admitted items see the
updates of the same set of
minibatches across levels (not
newer ones)?
• Memory Manager
• Stashing: Static memory buffer
pool (FMs, wts, wt-updates)
• Use same wts across F/B within
every stage (valid gradient)
• Vertical Sync
Stage Runtime Architecture
40
Caffe Compute Thread
Get GPU memory for output
Client Side Cache & Parameter Versioning
Sharded Parameter Server
Read Model Params
Fw Network
Receive
Fw Copy to
GPU
Fw Copy to
CPUIntermediate
Data
Manager
Fw Network
Send
Bw Copy to
GPU
Bw Copy to
CPU
Bw Network
Send
Bw Network
Receive
Update Model
Read Intermediate data
Encouraging results
42
Up to 3x
faster, more efficient
Effect of smaller minibatches
on hardware efficiency and
stat efficiency in
data parallel runs
44
Single Machine Training Multi-Machine TrainingM
em
ory
, Co
mp
uta
tio
nIn
terc
on
ne
cts
Train larger networks on
a single GPU by reducing
memory footprint.
Gist
Efficiently scale training.
New way to parallelize
DNN computation.
PipeDream
Blazingly fast transfers
over PCIe + NVLink + IB
Blink Hub
Fast
parameter
servers
Up to 2x compression ratio 3x faster training
Collective transfer primitives
8x faster than NCCLUp to 3x
speedup
45
Single Machine Training Multi-Machine TrainingM
em
ory
, Co
mp
uta
tio
nIn
terc
on
ne
cts
Gist PipeDream
Blink Hub
Diversity in workloads, hardware,
and frameworks.
TBD
Training Benchmark for DNNs
top related