toward accelerating deep learning at scale using ... · toward accelerating deep learning at scale...

38
Toward Accelerating Deep Learning at Scale Using Specialized Hardware in the Datacenter

Upload: others

Post on 21-May-2020

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Toward Accelerating Deep Learning at Scale Using ... · Toward Accelerating Deep Learning at Scale Using Specialized Hardware in the Datacenter . 2 . 3 1Case study: CNN-based Image

Toward Accelerating Deep Learning at Scale Using Specialized Hardware in the Datacenter

Page 2: Toward Accelerating Deep Learning at Scale Using ... · Toward Accelerating Deep Learning at Scale Using Specialized Hardware in the Datacenter . 2 . 3 1Case study: CNN-based Image

2

Page 3: Toward Accelerating Deep Learning at Scale Using ... · Toward Accelerating Deep Learning at Scale Using Specialized Hardware in the Datacenter . 2 . 3 1Case study: CNN-based Image

3

1Case study: CNN-based Image Classification (inference)

Page 4: Toward Accelerating Deep Learning at Scale Using ... · Toward Accelerating Deep Learning at Scale Using Specialized Hardware in the Datacenter . 2 . 3 1Case study: CNN-based Image

4

+ Excellent maintainability in datacenter

+ Maximum flexibility for all workloads

- Performance of CNNs/DNNs vastly slower than specialized HW

Page 5: Toward Accelerating Deep Learning at Scale Using ... · Toward Accelerating Deep Learning at Scale Using Specialized Hardware in the Datacenter . 2 . 3 1Case study: CNN-based Image

5

+ CNNs/DNNs that utilize GPUs or ASICs benefit significantly

- CNNs/DNNs cannot scale beyond limited pools

- Heterogeneity challenging for maintainability

Page 6: Toward Accelerating Deep Learning at Scale Using ... · Toward Accelerating Deep Learning at Scale Using Specialized Hardware in the Datacenter . 2 . 3 1Case study: CNN-based Image

6

+ Homogeneous

- Increased cost and power per server (particularly GPUs)

- Not economical for all applications in the datacenter (GPUs and ASICs)

Page 7: Toward Accelerating Deep Learning at Scale Using ... · Toward Accelerating Deep Learning at Scale Using Specialized Hardware in the Datacenter . 2 . 3 1Case study: CNN-based Image

7

+ Homogeneous

+ Low overhead in power and cost per server

+ Flexibility benefits many workloads?

- Lower peak performance than GPUs or ASICs on some workloads Overtake through scale?

???

Page 8: Toward Accelerating Deep Learning at Scale Using ... · Toward Accelerating Deep Learning at Scale Using Specialized Hardware in the Datacenter . 2 . 3 1Case study: CNN-based Image

http://www.wired.com/2014/06/microsoft-fpga/

8

Page 9: Toward Accelerating Deep Learning at Scale Using ... · Toward Accelerating Deep Learning at Scale Using Specialized Hardware in the Datacenter . 2 . 3 1Case study: CNN-based Image

http://www.wired.com/2014/06/microsoft-fpga/

9

Page 10: Toward Accelerating Deep Learning at Scale Using ... · Toward Accelerating Deep Learning at Scale Using Specialized Hardware in the Datacenter . 2 . 3 1Case study: CNN-based Image

FPGA FPGA FPGA FPGA

Deep

Learning Compression

Service

Physics

Engine

Web Search Pipeline

10

Page 11: Toward Accelerating Deep Learning at Scale Using ... · Toward Accelerating Deep Learning at Scale Using Specialized Hardware in the Datacenter . 2 . 3 1Case study: CNN-based Image

• Altera Stratix V D5

• 172,600 ALMs, 2,014 M20Ks, 1,590 DSPs

• PCIe Gen 3 x8

• 8GB DDR3-1333

• Powered by PCIe slot

• Torus Network

Stratix V

8GB DDR3

PCIe Gen3 x8

11

Page 12: Toward Accelerating Deep Learning at Scale Using ... · Toward Accelerating Deep Learning at Scale Using Specialized Hardware in the Datacenter . 2 . 3 1Case study: CNN-based Image

• Two 8-core Xeon 2.1 GHz CPUs

• 64 GB DRAM

• 4 HDDs @ 2 TB, 2 SSDs @ 512 GB

• 10 Gb Ethernet

• No cable attachments to server 68 ⁰C

12

Page 13: Toward Accelerating Deep Learning at Scale Using ... · Toward Accelerating Deep Learning at Scale Using Specialized Hardware in the Datacenter . 2 . 3 1Case study: CNN-based Image

13

Host

NIC

ASIC

FPGA

CPU

ToR

Page 14: Toward Accelerating Deep Learning at Scale Using ... · Toward Accelerating Deep Learning at Scale Using Specialized Hardware in the Datacenter . 2 . 3 1Case study: CNN-based Image

14

Page 15: Toward Accelerating Deep Learning at Scale Using ... · Toward Accelerating Deep Learning at Scale Using Specialized Hardware in the Datacenter . 2 . 3 1Case study: CNN-based Image

Krizhevsky

3-D Convolution and Max Pooling Dense Layers

“Dog”

INPUT OUTPUT

15

11

11

224

224

3

Stride

of 4

5

5

55

55

96

Max

pooling

27

27

256 Max

pooling

3

3

13

13

384

3

3

3

3

13

13

384

13

13

256

Max

pooling 4096 4096

1000

dense

Input

Imag

e

(RGB)

dense dense

Page 16: Toward Accelerating Deep Learning at Scale Using ... · Toward Accelerating Deep Learning at Scale Using Specialized Hardware in the Datacenter . 2 . 3 1Case study: CNN-based Image

Input Feature Map

N k

k

Convolution Output Max Pooled Output

(Optional)

H = # feature maps

S = kernel stride

* N, k, H, and p may vary across layers

N = input height and width

k = kernel height and width

D = input depth

N

D

Convolution between k x k x D kernel

and region of Input Feature Map

Max value over p x p region

p

p

H

16

Page 17: Toward Accelerating Deep Learning at Scale Using ... · Toward Accelerating Deep Learning at Scale Using Specialized Hardware in the Datacenter . 2 . 3 1Case study: CNN-based Image

Input Kernel

Weights

Output

17

Page 18: Toward Accelerating Deep Learning at Scale Using ... · Toward Accelerating Deep Learning at Scale Using Specialized Hardware in the Datacenter . 2 . 3 1Case study: CNN-based Image

Input Kernel

Weights

Output

18

Page 19: Toward Accelerating Deep Learning at Scale Using ... · Toward Accelerating Deep Learning at Scale Using Specialized Hardware in the Datacenter . 2 . 3 1Case study: CNN-based Image

Input Kernel

Weights

Output

19

Page 20: Toward Accelerating Deep Learning at Scale Using ... · Toward Accelerating Deep Learning at Scale Using Specialized Hardware in the Datacenter . 2 . 3 1Case study: CNN-based Image

Input Kernel

Weights

Output

20

Page 21: Toward Accelerating Deep Learning at Scale Using ... · Toward Accelerating Deep Learning at Scale Using Specialized Hardware in the Datacenter . 2 . 3 1Case study: CNN-based Image

Input Kernel

Weights

Output

21

Page 22: Toward Accelerating Deep Learning at Scale Using ... · Toward Accelerating Deep Learning at Scale Using Specialized Hardware in the Datacenter . 2 . 3 1Case study: CNN-based Image

Input Kernel

Weights

Output

22

Page 23: Toward Accelerating Deep Learning at Scale Using ... · Toward Accelerating Deep Learning at Scale Using Specialized Hardware in the Datacenter . 2 . 3 1Case study: CNN-based Image

Input Kernel

Weights

Output

23

Page 24: Toward Accelerating Deep Learning at Scale Using ... · Toward Accelerating Deep Learning at Scale Using Specialized Hardware in the Datacenter . 2 . 3 1Case study: CNN-based Image

Input Kernel

Weights

Output

24

Page 25: Toward Accelerating Deep Learning at Scale Using ... · Toward Accelerating Deep Learning at Scale Using Specialized Hardware in the Datacenter . 2 . 3 1Case study: CNN-based Image

Input Kernel

Weights

Output

25

Page 26: Toward Accelerating Deep Learning at Scale Using ... · Toward Accelerating Deep Learning at Scale Using Specialized Hardware in the Datacenter . 2 . 3 1Case study: CNN-based Image

Input Kernel

Weights

Output

26

Page 27: Toward Accelerating Deep Learning at Scale Using ... · Toward Accelerating Deep Learning at Scale Using Specialized Hardware in the Datacenter . 2 . 3 1Case study: CNN-based Image

Input Kernel

Weights

Output

27

Page 28: Toward Accelerating Deep Learning at Scale Using ... · Toward Accelerating Deep Learning at Scale Using Specialized Hardware in the Datacenter . 2 . 3 1Case study: CNN-based Image

Input Kernel

Weights

Output

28

Page 29: Toward Accelerating Deep Learning at Scale Using ... · Toward Accelerating Deep Learning at Scale Using Specialized Hardware in the Datacenter . 2 . 3 1Case study: CNN-based Image

29

PE PE PE PE

PE PE PE PE

PE PE PE PE

PE PE PE PE

Input Volume

Segment 0

Output feature

mapInput kernel weight

0

Input Volume

Segment 1

Input Volume

Segment N-2

Input Volume

Segment N-1

Output feature

mapInput kernel weight

1

Output feature

mapInput kernel weight

M-2

Output feature

mapInput kernel weight

M-1

Layer Controller

Input Volume

x

z

y

Broad-cast

Scan chain

Address generation

TopController

Output Volume

x

Layer Config.

Data re-distribution

y

z

Page 30: Toward Accelerating Deep Learning at Scale Using ... · Toward Accelerating Deep Learning at Scale Using Specialized Hardware in the Datacenter . 2 . 3 1Case study: CNN-based Image

30

x

AC

C

Wijk

Dijk

FF(ACC)

MU

X0

MU

X FF(SR)

From south

Termination To north

Page 31: Toward Accelerating Deep Learning at Scale Using ... · Toward Accelerating Deep Learning at Scale Using Specialized Hardware in the Datacenter . 2 . 3 1Case study: CNN-based Image

31

Page 32: Toward Accelerating Deep Learning at Scale Using ... · Toward Accelerating Deep Learning at Scale Using Specialized Hardware in the Datacenter . 2 . 3 1Case study: CNN-based Image

Platform Library/OS

ImageNet 1K

Inference

Throughput

Peak TFLOPs Effective

TFLOPs

Estimated

Peak Power with

Server

Estimated

GOPs/J

(assuming

peak power)

16-core, 2-socket Xeon

E5-2450, 2.1GHz

Caffe + Intel MKL

Ubuntu 14.04.1* 53 images/s 0.27T 0.074T (27%) ~225W ~0.3

Arria 10 GX1150 Windows Server 2012 369 images/s1 1.366T 0.51T (38%) ~265W ~1.9

32 2https://github.com/soumith/convnet-benchmarks

1Dense layer time estimated

Page 33: Toward Accelerating Deep Learning at Scale Using ... · Toward Accelerating Deep Learning at Scale Using Specialized Hardware in the Datacenter . 2 . 3 1Case study: CNN-based Image

Platform Library/OS

ImageNet 1K

Inference

Throughput

Peak TFLOPs Effective

TFLOPs

Estimated

Peak Power with

Server

Estimated

GOPs/J

(assuming

peak power)

16-core, 2-socket Xeon

E5-2450, 2.1GHz

Caffe + Intel MKL

Ubuntu 14.04.1* 53 images/s 0.27T 0.074T (27%) ~225W ~0.3

Arria 10 GX1150 Windows Server 2012 369 images/s1 1.366T 0.51T (38%) ~265W ~1.9

NervanaSys-32 on

NVIDIA Titan X

NervanaSys-32 on

Ubuntu 14.0.4 4129 images/s2 6.1T 5.75T (94%) ~475W ~12.1

33 2https://github.com/soumith/convnet-benchmarks

Includes server power; however, CPUs available to other jobs in the datacenter 1Dense layer time estimated

Page 34: Toward Accelerating Deep Learning at Scale Using ... · Toward Accelerating Deep Learning at Scale Using Specialized Hardware in the Datacenter . 2 . 3 1Case study: CNN-based Image

Platform Library/OS

ImageNet 1K

Inference

Throughput

Peak TFLOPs Effective

TFLOPs

Estimated

Peak Power for

CNN

Computation

Estimated

GOPs/J

(assuming

peak power)

16-core, 2-socket Xeon

E5-2450, 2.1GHz

Caffe + Intel MKL

Ubuntu 14.04.1* 53 images/s 0.27T 0.074T (27%) ~225W ~0.3

Arria 10 GX1150 Windows Server 2012 369 images/s1 1.366T 0.51T (38%) ~40W ~12.8

NervanaSys-32 on

NVIDIA Titan X

NervanaSys-32 on

Ubuntu 14.0.4 4129 images/s2 6.1T 5.75T (94%) ~250W ~23.0

34 2https://github.com/soumith/convnet-benchmarks

1Dense layer time estimated

Under-utilized FPGA vs. highly tuned GPU-friendly workload

Page 35: Toward Accelerating Deep Learning at Scale Using ... · Toward Accelerating Deep Learning at Scale Using Specialized Hardware in the Datacenter . 2 . 3 1Case study: CNN-based Image

Platform Library/OS

ImageNet 1K

Inference

Throughput

Peak TFLOPs Effective

TFLOPs

Estimated

Peak Power for

CNN

Computation

Estimated

GOPs/J

(assuming

peak power)

16-core, 2-socket Xeon

E5-2450, 2.1GHz

Caffe + Intel MKL

Ubuntu 14.04.1* 53 images/s 0.27T 0.074T (27%) ~225W ~0.3

Arria 10 GX1150 Windows Server 2012 369 images/s1

~880 images/s 1.366T

0.51T (38%)

~1.2T (89%) ~40W

20.6

~30.6

NervanaSys-32 on

NVIDIA Titan X

NervanaSys-32 on

Ubuntu 14.0.4 4129 images/s2 6.1T 5.75T (94%) ~250W ~23.0

35 2https://github.com/soumith/convnet-benchmarks

1Dense layer time estimated

Projected Results Assuming Floorplanning and Scaling Up PEs

Page 36: Toward Accelerating Deep Learning at Scale Using ... · Toward Accelerating Deep Learning at Scale Using Specialized Hardware in the Datacenter . 2 . 3 1Case study: CNN-based Image

36

Yes.

Page 37: Toward Accelerating Deep Learning at Scale Using ... · Toward Accelerating Deep Learning at Scale Using Specialized Hardware in the Datacenter . 2 . 3 1Case study: CNN-based Image

37

Page 38: Toward Accelerating Deep Learning at Scale Using ... · Toward Accelerating Deep Learning at Scale Using Specialized Hardware in the Datacenter . 2 . 3 1Case study: CNN-based Image

38