project fiddle fast & efficient infrastructure for...

Project Fiddle

Fast & Efficient Infrastructure for

Distributed Deep LearningAmar Phanishayee

Nikhil Devanur, Aaron Harlap, Animesh Jain,

Liang Luo, Deepak Narayanan, Jacob Nelson,

Gennady Pekhimenko, Vivek Seshadri, Jorgen Thelin,

Shivaram Venkataraman, Guanhua Wang

Big Data Systems – Gen AI

• Motivating Scenarios:

• Solutions:

Image Recognition, Classification Translation &

Speech – Text - Speech

DistBelief

Deep Learning for X

• Image recognition, classification

• Segmentation

• Captioning

• Stereo

• Depth estimation

• Video action recognition

• Sentiment Analysis

• Speech-Text, Text-Speech, OCR

• Translation

• …

Training is time and resource intensive

• Long running training jobs: Days to multiple weeks

• Train, Validate, RepeatPerformance and accuracy specs validated only by running system

• Naïvely parallelizing work can be detrimental

• No luxury of limiting solution to one part of the stack

Training

Cluster

Serving

Cluster

Serving

Triggered

Scientist

Triggered

Specialization

Communication

Computation

<100% GPU utilization

Overcome memory

bottleneck

Restructure computation

Faster interconnects -

need new protocols,

Rethink what we push

to the network

Need for speed

Slide Credit: Joel Emer (MIT, NVIDIA)

Problem

Systematically speed-up distributed

learning tasks while eking out the most

from the resources usedGoal: 100x more efficient training

Day long training of model in 15-min coffee break

Approach

Cut through layers of the system stack

and optimize all relevant parts

Single Machine Training Multi-Machine TrainingM

Train larger networks on

a single GPU by reducing

memory footprint.

Efficiently scale training.

New way to parallelize

DNN computation.

PipeDream

Blazingly fast transfers

over PCIe + NVLink + IB

Blink Hub

parameter

servers

memory footprint.

DNN computation.

PipeDream

Blink Hub

parameter

servers

Up to 2x compression ratio 3x faster training

Collective transfer primitives

8x faster than NCCLUp to 3x

speedup

memory footprint.

DNN computation.

PipeDream

Blink PSsst

parameter

servers

speedup

Training

Examples <x1, y1>

<x2, y2>

Hypothesis

Learning

Algorithm

yxh ),(

)( xgT

or )(tg

ii yhcmJ1

),)(()/1()(

Structure of DNN computation

Layers

Activations (feature maps)

Batch. Run for all input items.

Repeat over many “epochs”

Neuron

𝒈 𝒂 ∗∝ +𝒃 ∗ β + 𝒄 ∗ γ + 𝒅 ∗ δ 𝐨𝐫 )( xgT

Example – AlexNet(Image Classification)

Slide Credit: http://vision03.csail.mit.edu/cnn_art/

Tasks, datasets, challenges

• 1989: MNIST

• 2001: Caltech 101

• 2007 – 2012: PASCAL Visual Object Classes

• 20 classes

• 2014 - … : Imagenet 1K

• Deng et al.

• Classification and Detection

Profile of memory consumption

Input Feature Map

Output Feature Map

Output Gradient (dy)Input Gradient (dx)

dx = f(X, Y, dy)

AlexNet NiN Overfeat VGG-16 Inception

Feature Maps are a major

consumer of GPU memory

CNTK Profiling

Larger minibatch size

potential crash

Profile of memory consumption

Feature Maps are a major

consumer of GPU memory

CNTK Profiling

Larger minibatch size

potential crash

Heavy hitters:

Relu->Pool,

Relu/Pool-> Conv

AlexNet NiN Overfeat VGG-16 Inception

Idea: Encode data structures when unused,

Decode on use

timeline

Y … … …

Forward Pass

Backward Pass

dy dx…

Lifetime of X

Usage of X

Encode (X) Decode (X)

Full fidelity X

Succinct X

Long temporal gap between two uses

Statically construct usage schedule

for all data structures

Gist Architecture

Static

memory allocator

CNTK Runtime

(computation engine)

• Statically construct usage schedule

for all data structures

• Layer-specific encodings

• Lossless (Binarize, Sparse Storage/Dense Compute)

• Lossy (Delayed Precision Reduction)

Gist Lossless Encodings

Relu Backward Propagation

Binarize

1 bit representation

of 32bit values in Y

32x reduction

Naively applying Binarize

for Relu/Pool followed by Conv

can increase memory

consumption!

For Relu->Pool For Relu/Pool->Conv

Gist Lossless Encodings

Relu Backward Propagation

Binarize

1 bit representation

of 32bit values in YSparse Storage , Dense Compute

(sparse compute is

computationally expensive)

For Relu->Pool For Relu/Pool->Conv

Gist Lossy Encodings

• Used for all other feature maps

• Training with reduced precision (8/10/16 bits)

– done only when encoding/stashing values

• Forward pass uses full fidelity values

8 bits enough! 8/10 bits not enough.

16 bits works out.

10 bits enough!

Gist – Putting it all together

Reduction

Up to 2x

compression ratio

Minimal runtime overhead

(1 - 7%) for same batch size

With just lossless,

we go faster at times

(memory bandwidth bound)

memory footprint.

Fiddle Gist

DNN computation.

PipeDream

Blink PSsst

parameter

servers

speedup

Distributed Deep Learning

• Data Parallelism• Replicas run in parallel on different machines

• Occasionally exchange what they have learnt

• Naïve setup can hurt convergence, accuracy

• Total time = (Time per Epoch) * (#Epochs for given accuracy)

Manually tuned for performance of individual jobs

Statistical Efficiency

Compute

Commun

ication

Compute

Commun

ication

Compute

Commun

ication

Compute

Commun

ication

Hardware Efficiency

Pipeline Parallelism in Fiddle

• Idea: Pipeline computation across machines• For training tasks, optimize for throughput

• Each machine runs a subset of layers

– Compute one thing, data flows to compute task

– Better FLOPs due to cache locality

• Fits CPU, GPU, hybrid, heterogeneous clusters

Forward

Compute

Backward

Compute

Backward

Communication

Forward

Communication

• No bubbles in pipeline• How should we divvy work

across machines?

Challenges

DNN Run

Fiddle Profiler

Fiddle Optimizer

𝒓𝒊 = 𝒓𝒖𝒏𝒕𝒊𝒎𝒆(𝒇𝒊 + 𝒃𝒊)𝒕𝒊 = transfer-time(i, i+1)

𝒍𝒊𝒋 = 𝒎𝒂𝒙 { , 𝒕𝒋}

],[],[

maxmin jiPjiPl

# machines

across machines?

Challenges

across machines?

Challenges

Running at speed of the slowest layer

▪ Idle Cycles

across machines?

• Replicate stages

(if needed)

– Static load balancing

mini-batch-id % stage-replica-id

– F/B paths consistent

Challenges

across machines?

(if needed)

• How many items to admit?

– Wild, S, less?

Stages

Challenges

across machines?

(if needed)

– Wild, S, less?

Challenges

• No bubbles in pipeline

• How should we divvy work

across machines?

(if needed)

– Wild, S, less?

• In steady state:

Each new item learns from a

previous (new) mini-batchS

Stages

Challenges

across machines?

(if needed)

– Wild, S, less?

previous (new) mini-batch

• Stage alternates between F/B

Forward

Backward

Challenges

across machines?

(if needed)

– Wild, S, less?

Forward

Backward

QueueTo next

Challenges

across machines?

(if needed)

– Wild, S, less?

Forward

Backward

Challenges

across machines?

(if needed)

– Wild, S, less?

Forward

Backward

To previous

Challenges

across machines?

(if needed)

– Wild, S, less?

• Statistical Efficiency

• Affected by feature map and

model parameter versions

• Each incomplete mini-batch

needs to stash feature maps

between F and B passes

• Should admitted items see

the updates of the same set

of minibatches across levels

(not newer ones)?

Valid Gradients Or Bust

PipeDream with wt stashing is critical for valid gradients

(use same wts across F and B within every stage)

𝒘(𝒕+𝟏) = 𝒘(𝒕) − 𝝁 ∗ 𝛁𝒇(𝒘𝟏𝒕, 𝒘𝟐

𝒕, … ,𝒘𝒏

𝒘(𝒕+𝟏) = 𝒘(𝒕) − 𝝁 ∗ 𝛁𝒇(𝒘𝟏𝒕−𝒏+𝟏

, 𝒘𝟐𝒕−𝒏+𝟐

, … ,𝒘𝒏𝒕)

Valid Gradients Or Bust

PipeDream with wt stashing is critical for valid gradients

(use same wts across F and B within every stage)

𝒘(𝒕+𝟏) = 𝒘(𝒕) − 𝝁 ∗ 𝛁𝒇(𝒘𝟏𝒕, 𝒘𝟐

𝒕, … ,𝒘𝒏

𝒘(𝒕+𝟏) = 𝒘(𝒕) − 𝝁 ∗ 𝛁𝒇(𝒘𝟏𝒕−𝒏+𝟏

, 𝒘𝟐𝒕−𝒏+𝟐

, … ,𝒘𝒏𝒕)

PipeDream with vertical sync maps to bounded staleness

(use same wts across all stages)

𝒘(𝒕+𝟏) = 𝒘(𝒕) − 𝝁 ∗ 𝛁𝒇(𝒘𝟏𝒕−𝒏+𝟏

, 𝒘𝟐𝒕−𝒏+𝟏

, … ,𝒘𝒏𝒕−𝒏+𝟏

Challenges

across machines?

(if needed)

– Wild, S, less?

• Statistical Efficiency

• Affected by feature map (FM)

and model parameter versions

• Each incomplete mini-batch

needs to stash feature maps

between F and B passes

• Should admitted items see the

updates of the same set of

minibatches across levels (not

newer ones)?

• Memory Manager

• Stashing: Static memory buffer

pool (FMs, wts, wt-updates)

• Use same wts across F/B within

every stage (valid gradient)

• Vertical Sync

Stage Runtime Architecture

Caffe Compute Thread

Get GPU memory for output

Client Side Cache & Parameter Versioning

Sharded Parameter Server

Read Model Params

Fw Network

Receive

Fw Copy to

CPUIntermediate

Manager

Fw Network

Bw Copy to

Bw Network

Receive

Update Model

Read Intermediate data

Visualization / Debugging

Encouraging results

Up to 3x

faster, more efficient

Effect of smaller minibatches

on hardware efficiency and

stat efficiency in

data parallel runs

Restructure computation to push a

lot less data on the network

memory footprint.

DNN computation.

PipeDream

Blink Hub

parameter

servers

speedup

Gist PipeDream

Blink Hub

Diversity in workloads, hardware,

and frameworks.

Training Benchmark for DNNs

project fiddle fast & efficient infrastructure for...

Documents

fast & efficient procedures for stability analysis using gle

burst tries, a fast, efficient data structure for string...

fiddle tunes.jpg

fiddle tunes

a new mppt technique for fast and efficient tracking under...

efficient authentication for fast handover in wireless...

fiddle fantasia

analysis of efficient techniques for fast elliptic curve

efficient multidimensional packet classification with fast...

to fiddle or not to fiddle case presentation

efficient recovery path computation for fast reroute in

fast. efficient. precise control

old time fiddle tunes for fiddle and mandolin, volume 1

software-defined storage: fast, safe and efficient

fast and efficient colour laser printers

an efficient implementation of fast fourier transform

fast, easy efficient - perkinelmer

fast efficient wireless connectivity and gnss production...

aalborg universitet efficient and fast implementation of

scale configuration management software fast. easy....