low power convolutional neural networks on a chip · 2016. 6. 12. · low power convolutional...

Low Power Convolutional Neural Networks on a Chip

Yu Wang, Lixue Xia, Tianqi Tang, Boxun Li, Song Yao, Ming Cheng, Huazhong Yang

Dept. of E.E., Tsinghua National Laboratory for Information Science and Technology (TNList)

Tsinghua University, Beijing, China e-mail: [email protected]

Outline�

•  Background •  FPGA Accelerator •  CNN on RRAM

CNN: Application and Performance�

•  CNN: State-of-the-art in visual recognition applications �

Tracking [UIUC2015]Vehicle and Lane Detection [Stanford2015]

Pedestrian Detection [arxiv2015]

Google Translate App 2015.7

NN: Complexity�

•  AlexNet (a.k.a. CaffeNet) (2012)

•  GoogLeNet (2015)�

The computation complexity and energy consumption increase rapidly to obtain better and better recognition accuracy�

Energy Efficient Circuits and Systems�

Energy Efficiency = ——————

= Operations/J = (OP/s�/ W

Micro� Aerospace�Macro�

How to improve the energy efficiency?�

Complexity�

Energy��

Energy Efficiency for non-computational workloads for cognitive and other related applications

CPU� XPU,FPGAASIC…�

BrainCPU�

Architecture�

DelayEnergy�

Architecture IRRELEVANT to the semantic à Relevant

Architecture: Simple to Complex �

10

Embedded GPU for Object Detection�

•  PipelinedFastR-CNNonEmbeddedGPU–  So5ware:algorithmselec:on&modifica:onforlowpowerobjectdetec:on–  Hardware:twostagepipelineonNVIDIATK1EmbeddedGPU–  ~0.8sec/img&9.6W–  Championofthe1stLow-PowerImageRecogni:onChallenge(LPIRC)–  WinneroftheHIGHESTACCURACYwithlowenergy

Outline�


Architecture and Implementation Details�

•  Overall Architecture�

CPU External Memory

Proc

essi

ng S

yste

m

DMA

Data & Inst. Bus

Input Buffer

PE

Computing Complex

Output Buffer

PE PE

FIFO

Con

trol

ler

Prog

ram

mab

le L

ogic

Config. Bus

…

•  Processing System •  Flexibility •  CPU + DDR •  Scheduling operations •  Prepare data and instructions •  Realize Softmax function

•  Programmable Logic •  Hardware acceleration •  Computing Complex + On-chip

Buffers + Controller + DMA •  Few Complex PEs

•  Achieve three level parallelism •  Inter-output: multiple PEs •  Intra-output •  Operator-level

•  16-bit dynamic-precision data quantization


•  Processing Engine Architecture�

C

Convolver Complex

+

+

+

+

+ NL PoolC

C

Output Buffer

Input Buffer

Data

Bias

Weights

Intermediate Data

Controller

PE

Adder Tree

Bias Shift

Data shift

……

…

…

•  Achieve intra-output parallelism by placing multiple Convolvers •  Convolver: optimized for 3x3 convolution operation •  Adder Tree sum up results of one convolution operation •  NL: supports non-linear function (ReLU) •  Pool: supports max-pooling •  Bias Shift & Data Shift: support dynamic-precision fixed-point numbers


•  Line-buffer design –  Optimized for 3x3 Convolver –  Supports operator-level parallelism�

⋯⋯ ⋯⋯

⋯⋯ ⋯⋯

⋯

⋯

MU

XM

UX

Data buffer

Weight buffer

MultipliersAdder Tree

X+

9 Data Inputs

9 Weight Inputs

n Delays

! Delays

+

++

⋯

+⋯

X XX X XX X X

Input Data

Input Weight

Output Data

Performance Comparison�

•  Performance and Energy Efficiency Comparison Chakaradh

ar 2010

Gokhale 2014

Zhang 2015 Ours Ours

Platform Virtex 5 SX240t

Zynq XC7Z045

Virtex7 VX485t

Zynq XC7Z045

Zynq XC7Z020

Clock (MHz) 120 150 100 150 100Bandwidth (GB/s) - 4.2 12.8 4.2 4.2

Quantization 48-bit fixed 16-bit fixed 32-bit float 16-bit fixed 8-bit fixed

Problem Complexity (GOP) 0.52 0.552 1.33 30.76 .1

Performance(GOP/s) 16 23.18 61.62 136.97 (Overall)

187.89 (Conv) 19.2

Power (W) 14 8 18.61 9.63 2Power Efficiency

(GOP/J) 1.14 2.90 3.31 14.22 (Overall) 19.50 (Conv) 9.6

Video Demonstration�

•  Youku link:http://v.youku.com/v_show/id_XMTQ5MTI3NTM0OA==.html#paction

•  Youtube link: https://www.youtube.com/watch?v=m4e1SV89Dpg

Outline�


Energy Efficiency Limitation of CMOS�•  Scale Up will not improve the energy efficiency

For CNN task: –  CPU: 1.5 GOPS/W, FPGA: 14.2 GOPS/W –  DaDianNao:350 GOPS/W (peak) –  Brain: 500,000 GOPS/W, still >1000X gap

XPU, FPGA, ASIC…�

Brain CPU�

Architecture�

Delay Energy� CMOS

Scaling Down ~10X�

Accelerator ~100X�

?�

Vik

Voj

gkj

RRR

Vi1

Vi3

Vo1 Vo3

RRAM-based Computation�

•  Brain is NOT Boolean�•  Emerging Devices , such as RRAM devices, provide a promising solution

to realize better implementation of brain inspired circuits and systems�

1( ), oj ik kj kjk kj

V r V g gM

= ⋅ ⋅ =∑

O(n2)!O(n0)�

Merge Mem. & Compute�

I&F neuron LPF neuron

Plasticity: Configure with Voltage/Current�

High Density�

Non Volatile�

~~1RRAMCell 1m-bitMulDplier+1m-bitAdder+1m-bitReg.(SRAM)

RRAMCrossbar Matrix-VectorMulDplicaDonASIC

RRAM Crossbar�Non-Volatile

Merge Mem. & Comp. ~100X Efficiency Gains�

Circ

uit�

Arc

hite

ctur

e�A

pplic

atio

n�

u  Device Fault [DATE’14/ICCAD’16 Submitted]

u  Device Control (RD/WR) [JCST’16]�

u  Interface with CPU [DAC’15] u  Interface between crossbars [DAC’16] u  Mapping & Compile [ASPDAC’15] u  Process-In-Memory [ISCA’16] u  Simulator [DATE’16]

u  Self-Training with RRAM [ASPDAC’14/ICCAD’16 submitted]

u  Series of RRAM-based NN Systems (ANN,SNN) [TCAD’15/DATE’15/GLSVLSI’15/ISLPED’13]

Our Preliminary Work�

Inte

rfac

e

CPU Memristor

Approximate Computing

Two Chips have been Taped-out!�

•  CNN consists of cascaded convolutional layers and FC (full connected) layers

Structure of CNN�

•  Conv Layers cost the main part of CNN computations�

0.21

0.34

0.17

3.87

3.87

0.90

0.83

1.85

5.55

5.55

0.30

0.30

5.55

9.25

12.9

5

0.45

0.45

5.55

9.25

12.9

5

0.30

0.30

1.85

2.31

3.70

0.08

0.10

0.21

0.21

0.21

0.03

0.03

0.03

0.03

0.03

0.01

0.01

0.01

0.01

0.01

CaffeNet ZF VGG11 VGG16 VGG19

CONV1 CONV2 CONV3 CONV4 CONV5 FC6 FC7 FC8

Distribution of computations (GOPs)

RRAM-based Convolution�

•  The function of a convolution kernel is also vector-vector multiplication

•  Multiple Conv kernels share the same input data –  Convolution kernels can be regarded as Matrix-Vector

multiplication

p. 20

Convolutional Layer on RRAM�

•  Implement convolution kernels on RRAM –  Store weights of kernels on RRAM device –  Input the data to multiple kernels simultaneously

•  Peripheral Functions are implemented in CMOS •  We use the line buffer similar to our FPGA design�

Functions of Conv Layer RRAM-based Conv Layer

GPU FPGA RRAM ASIC [ISSCC 2016]

Network VGG 16 VGG 16 VGG 16 AlexNet Conv

Problem Complexity (GOP)

30.76 30.76 30.76 5.32

Weight (MB) 28 264 132 4.6

Data (MB) 127 63 32 1.56

Precision 32-bit float 16-bit fixed 8-bit fixed 16-bit fixed

Top-1 Accuracy (%) 8.10 68.02 66.58 -

Top 5 Accuracy (%) 88.00 87.94 7.38 -

Energy Efficiency (GOPS/W)

7.14 14.22 462.67 166

Experimental Results�

•  Improve the energy efficiency more than 40× compared with FPGA and GPU implementations�

Conclusion�

•  We implement large scale CNN on embedded chip based on FPGA

•  RRAM crossbar provides a further efficient way to implement the main computation of CNN –  We are designing our RRAM-based CNN chip to verify

the energy efficiency potential

References�

[GoogLeNet]Szegedy C, Liu W, Jia Y, et al. Going deeper with convolutions[J]. arXiv preprint arXiv:1409.4842, 2014.

[AlexNet] Krizhevsky A, Sutskever I, Hinton G E. Imagenet classification with deep convolutional neural networks[C]//Advances in neural information processing systems. 2012: 1097-1105.

[FPGA16] Jiantao Qiu et al., "Going deeper with embedded fpga platform for convolutional neural network", to appear in FPGA 2016.

[Chen ISSCC 2016] Y. H. Chen, T. Krishna, J. Emer and V. Sze, "14.5 Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks," 2016 IEEE International Solid-State Circuits Conference (ISSCC), San Francisco, CA, 2016, pp. 262-263.

[Gokhale 2014] V. Gokhale, J. Jin, A. Dundar, B. Martini and E. Culurciello, "A 240 G-ops/s Mobile Coprocessor for Deep Neural Networks," 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops, Columbus, OH, 2014, pp. 696-701.

[Zhang 2015] Zhang C, Li P, Sun G, et al. Optimizing fpga-based accelerator design for deep convolutional neural networks[C]//Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 2015: 161-170.

[Chakradhar 2010] Chakradhar S, Sankaradas M, Jakkula V, et al. A dynamically configurable coprocessor for convolutional neural networks[C]//ACM SIGARCH Computer Architecture News. ACM, 2010, 38(3): 247-257.

�

https://nicsefc.ee.tsinghua.edu.cn/�

p. 25

low power convolutional neural networks on a chip · 2016. 6. 12. · low power convolutional...

Documents