towards general-purpose neural network...

22
Towards General-Purpose Neural Network Computing Schuyler Eldridge 1 Amos Waterland 2 Margo Seltzer 2 Jonathan Appavoo 3 Ajay Joshi 1 1 Boston University Department of Electrical and Computer Engineering 2 Harvard University School of Engineering and Applies Sciences 3 Boston University Department of Computer Science 24 th International Conference on Parallel Architectures and Compilation Techniques PACT ’15 1/23

Upload: others

Post on 07-Aug-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Towards General-Purpose Neural Network Computingpeople.bu.edu/.../files/pact2015-eldridge-presentation.pdfHigh performance deep neural network architectures [5, 6] Neural networks

Towards General-Purpose Neural NetworkComputing

Schuyler Eldridge1 Amos Waterland2 Margo Seltzer2

Jonathan Appavoo3 Ajay Joshi1

1Boston University Department of Electrical and Computer Engineering

2Harvard University School of Engineering and Applies Sciences

3Boston University Department of Computer Science

24th International Conference on Parallel Architecturesand Compilation Techniques

PACT ’15 1/23

Page 2: Towards General-Purpose Neural Network Computingpeople.bu.edu/.../files/pact2015-eldridge-presentation.pdfHigh performance deep neural network architectures [5, 6] Neural networks

Why Do We Care About Neural Networks?

“Good” solutions for hard problemsCapable of learning

Neural networks, again?The neural network hype cyclehas been a bumpy rideModern, resurgent interest inneural networks is driven by:

Big, real-world data sets“Free” availability of transistorsUse of acceleratorsThe need for continuedperformance improvements

InputLayer

HiddenLayer

Bias

OutputLayer

HiddenLayer

PACT ’15 2/23

Page 3: Towards General-Purpose Neural Network Computingpeople.bu.edu/.../files/pact2015-eldridge-presentation.pdfHigh performance deep neural network architectures [5, 6] Neural networks

Neural Network Computing is Hot (Again)

Existing approachesDedicated neural network/vector processors from the 1990s [1]Ongoing NPU work for approximate computing [2, 3, 4]High performance deep neural network architectures [5, 6]

Neural networks as primitivesWe treat neural networks as an application primitive

[1] J. Wawrzynek et al., “Spert-II: a vector microprocessor system,” Computer, vol. 29, no. 3, pp. 79–86, Mar 1996.

[2] H. Esmaeilzadeh et al., “Neural acceleration for general-purpose approximate programs,” in MICRO, 2012.

[3] R. St. Amant, et al., “General-purpose code acceleration with limited-precision analog computation,” in ISCA, 2014.

[4] T. Moreau, et al., “Snnap: Approximate computing on programmable socs via neural acceleration,” in HPCA, 2015.

[5] T. Chen, et al. “Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning,” in ASPLOS, 2014.

[6] Z. Du, et al., “Shidiannao: shifting vision processing closer to the sensor,” in ISCA, 2015.

PACT ’15 3/23

Page 4: Towards General-Purpose Neural Network Computingpeople.bu.edu/.../files/pact2015-eldridge-presentation.pdfHigh performance deep neural network architectures [5, 6] Neural networks

Our Vision of the Future of Neural Network Computing

ApproximateComputing [1]

AutomaticParallelization [2]

MachineLearning

Operating System

User/Supervisor Interface

Multicontext/threaded NN Accelerator

Process2

Process1

Process3

ProcessN

input layer output layer

hidden layers

...

......

.

.

.

.

.

.

input layer output layer

hidden layers

...

......

.

.

.

.

.

.

input layer output layer

hidden layers

...

......

.

.

.

.

.

.

input layer output layer

hidden layers

...

......

.

.

.

.

.

.

input layer output layer

hidden layers

...

......

.

.

.

.

.

.

input layer output layer

hidden layers

...

......

.

.

.

.

.

.

input layer output layer

hidden layers

...

......

.

.

.

.

.

.

[1] H. Esmaeilzadeh et al., “Neural acceleration for general-purpose approximate programs,” in MICRO, 2012.

[2] A. Waterland et al. “Asc: Automatically scalable computation,” in ASPLOS, 2014.

PACT ’15 4/23

Page 5: Towards General-Purpose Neural Network Computingpeople.bu.edu/.../files/pact2015-eldridge-presentation.pdfHigh performance deep neural network architectures [5, 6] Neural networks

Our Contributions Towards this Vision

X-FILES: Hardware/Software ExtensionsExtensions for the Integration of Machine Learning in EverydaySystems

A defined user and supervisor interface for neural networksThis includes supervisor architectural state (hardware)

DANA: A Possible Multi-Transaction AcceleratorDynamically Allocated Neural Network Accelerator

An accelerator aligning with our multi transaction vision

I apologize for the namesThere is no association with files or filesystemsX-FILES is plural (like extensions)

PACT ’15 5/23

Page 6: Towards General-Purpose Neural Network Computingpeople.bu.edu/.../files/pact2015-eldridge-presentation.pdfHigh performance deep neural network architectures [5, 6] Neural networks

An Overview of X-FILES/DANA Hardware

ASID-NNIDTableWalker

ASID-NNID Table Pointer

TransactionQueue

Core 1

L1 Data $

L2 $

X-FILES Arbiter

ASID

ASID

ASID Register File

PE-1

PE-NEntry-N

Entry-1

PE Table

Entry-2MemoryMemory

Entry-1

NN Config Cache

ControlTransaction Table

ASID TID NNID State

DANA

Num ASIDs

Core 2

L1 Data $

Core N

L1 Data $

ComponentsGeneral purpose coresTransaction storageA backend accelerator that “executes” transactionsSupervisor resources for memory safetyDedicated memory interface

PACT ’15 6/23

Page 7: Towards General-Purpose Neural Network Computingpeople.bu.edu/.../files/pact2015-eldridge-presentation.pdfHigh performance deep neural network architectures [5, 6] Neural networks

At the User Level We Deal With “Transactions”

Neural Network TransactionsA transaction encapsulates a request by a process to compute theoutput of a specific neural network for a provided input

User Transaction API:newWriteRequest

writeData

readDataPoll

IdentifiersNNID: Neural Network IDTID: Transaction ID

CoreX-Files

Hardware Arbiter

Core/Accelerator InterfaceWe use the RoCC interface ofthe Rocket RISC-Vmicroprocessor [1, 2]

[1] A. Waterman et al., “The risc-v instruction set manual, volume i: User-level isa, version 2.0,” EECS Department, Universityof California, Berkeley, Tech. Rep. UCB/EECS-2014-54, May 2014.

[2] A. Waterman et al.,, “The risc-v instruction set manual volume ii: Privileged architecture version 1.7,” EECS Department,University of California, Berkeley, Tech. Rep. UCB/EECS-2015-49, May 2015.

PACT ’15 7/23

Page 8: Towards General-Purpose Neural Network Computingpeople.bu.edu/.../files/pact2015-eldridge-presentation.pdfHigh performance deep neural network architectures [5, 6] Neural networks

At the Supervisor Level We Deal With Address Spaces

Use cases:Single transactionMultiple transactionsSharing of networksMultiple networks

Application

Operating System

User/Supervisor Interface

Multicontext/threaded NN Accelerator

Process2

Process1

Process3

ProcessN

input layer output layer

hidden layers

...

......

.

.

.

.

.

.

input layer output layer

hidden layers

...

......

.

.

.

.

.

.

input layer output layer

hidden layers

...

......

.

.

.

.

.

.

input layer output layer

hidden layers

...

......

.

.

.

.

.

.

input layer output layer

hidden layers

...

......

.

.

.

.

.

.

Application

We maintain the state of executing transactionsWe group transactions into Address SpacesAddress Spaces are identified by an OS-managed ASID

Each ASID defines the set of accessible networksNetworks can be shared transparently if the OS allows this

PACT ’15 8/23

Page 9: Towards General-Purpose Neural Network Computingpeople.bu.edu/.../files/pact2015-eldridge-presentation.pdfHigh performance deep neural network architectures [5, 6] Neural networks

An ASID–NNID Table Enables NNID Dereferencing

*NN Configuration

*NN Configuration

*NN Configuration

Header

Neurons

Weights

ASID-NNID Table Ptr*ASID-NNID *IO QueueNum NNIDs

*ASID-NNID *IO QueueNum NNIDs

*ASID-NNID *IO QueueNum NNIDs

Status/Header *Input *Output

Ring Buffers

Num ASIDs0:

1:

2:0:

1:

2:

Layers

ASID–NNID TableThe OS establishes and maintains the ASID–NNID TableWe assign ASIDs and NNIDs sequentiallyThe ASID–NNID Table contains an optional asynchronous memoryinterface

PACT ’15 9/23

Page 10: Towards General-Purpose Neural Network Computingpeople.bu.edu/.../files/pact2015-eldridge-presentation.pdfHigh performance deep neural network architectures [5, 6] Neural networks

A Compact Binary Neural Network Configuration

binaryPointtotalEdgestotalNeuronstotalLayersweightsPtr

Info

neuron0-weight0neuron0-weight1neuron0-weight2neuron0-weight3 Weights

neuron1-weight0...

neuron0-weight0Ptrneuron0-numberOfWeightsneuron0-activationFunctionneuron0-steepnessneuron0-bias

Neurons

Layer0-neuron0PtrneuronsInLayerneuronsInNextLayer

layer1-neuron0PtrneuronsInLayerneuronsInNextLayer

Layers

neuron1-weight0Ptr...

We condense the normal FANN neural network data structureWe use a reduced configuration from the Fast Artificial NeuralNetwork (FANN) library [1] containing:

Global informationPer-layer informationPer-neuron informationPer-neuron weights

[1] S. Nissen, “Implementation of a fast artificial neural network library (fann),” Department of Computer Science University ofCopenhagen (DIKU), Tech. Rep., 2003.

PACT ’15 10/23

Page 11: Towards General-Purpose Neural Network Computingpeople.bu.edu/.../files/pact2015-eldridge-presentation.pdfHigh performance deep neural network architectures [5, 6] Neural networks

DANA: An Example Multi-Transaction Accelerator

Register File

X-FILESArbiter

PE-1PE-2

PE-N

Entry-2

Entry-N

Entry-1

PE Table

NN Transaction-1 IO MemoryNN Transaction-2 IO Memory

Entry-2Cache Memory-1Cache Memory-2

Entry-1

NN Configuration Cache

Control

DANA

TransactionTable

ComponentsControl logic determines actions given transaction stateNetwork configurations are stored in a Configuration CachePer-transaction IO Memory stores inputs and outputsA Register File stores intermediate outputsLogical neurons are mapped to Processing Elements

PACT ’15 11/23

Page 12: Towards General-Purpose Neural Network Computingpeople.bu.edu/.../files/pact2015-eldridge-presentation.pdfHigh performance deep neural network architectures [5, 6] Neural networks

DANA: Single Transaction Execution

Register File

X-FILESArbiter

PE1

PE Table

Cache Memory-1 ASID/NNID

NN Configuration Cache

DANA

TransactionTable

PE2PE3PE4

Bias

InputLayer

HiddenLayer

OutputLayer

Per-Transaction IO Memory

Control

PACT ’15 12/23

Page 13: Towards General-Purpose Neural Network Computingpeople.bu.edu/.../files/pact2015-eldridge-presentation.pdfHigh performance deep neural network architectures [5, 6] Neural networks

DANA: Multi-Transaction Execution

Register File

X-FILESArbiter

PE1

PE Table

Cache Memory-1 ASID/NNID

NN Configuration Cache

DANA

TransactionTable

PE2PE3PE4

Per-Transaction IO Memory

Control

Bias

InputLayer

HiddenLayer

OutputLayer

Bias

InputLayer

HiddenLayer

OutputLayer

ASID/NNIDCache Memory-2

TID-1TID-2

I-1 I-2I-1 I-2

R-1 R-2 R-3R-1 R-2 R-3

PACT ’15 13/23

Page 14: Towards General-Purpose Neural Network Computingpeople.bu.edu/.../files/pact2015-eldridge-presentation.pdfHigh performance deep neural network architectures [5, 6] Neural networks

We Organize All Data in Blocks of Elements

4 Elements Per Block

element 4 element 3 element 2 element 1

8 Elements Per Block

element 8 element 7 element 6 element 5

element 4 element 3 element 2 element 1

Blocks for temporal localityWe exploit neural network temporal locality of dataHere, data refers to inputs or weightsLarger block widths reduce inter-module communicationBlock width is an architectural parameter

PACT ’15 14/23

Page 15: Towards General-Purpose Neural Network Computingpeople.bu.edu/.../files/pact2015-eldridge-presentation.pdfHigh performance deep neural network architectures [5, 6] Neural networks

Evaluation Networks

Area Application Configuration Size

ASC [1]

3sum 85 × 16 × 85 largecollatz 40 × 16 × 40 largell 144 × 16 × 144 largersa 30 × 30 × 30 large

ApproximateComputing[2, 3, 4]

blackscholes 6 × 8 × 8 × 1 smallfft 1 × 4 × 4 × 2 smallinversek2j 2 × 8 × 2 smalljmeint 18 × 16 × 2 mediumjpeg 64 × 16 × 64 largekmeans 6 × 16 × 16 × 1 mediumsobel 9 × 8 × 1 small

Physics [5] edip 192 × 16 × 1 large

[1] A. Waterland et al. “Asc: Automatically scalable computation,” in ASPLOS, 2014.

[2] H. Esmaeilzadeh et al., “Neural acceleration for general-purpose approximate programs,” in MICRO, 2012.

[3] R. St. Amant, et al., “General-purpose code acceleration with limited-precision analog computation,” in ISCA, 2014.

[4] T. Moreau, et al., “Snnap: Approximate computing on programmable socs via neural acceleration,” in HPCA, 2015.

[5] J. F. Justo et al., “Interatomic potential for silicon defects and disordered phases,” Physical Review B, vol. 58, pp.2539–2550, Aug 1998.

PACT ’15 15/23

Page 16: Towards General-Purpose Neural Network Computingpeople.bu.edu/.../files/pact2015-eldridge-presentation.pdfHigh performance deep neural network architectures [5, 6] Neural networks

Evaluation Methodology

ImplementationX-FILES Arbiter and DANA implemented in System VerilogFree parameters include:

Elements per blockThe number of Processing ElementsInternal table widths and storage sizes

EvaluationWe compute average power with Cadence SOC Encounter in a45nm GlobalFoundries processWe compute operating frequency using Cadence SOC EncounterWe compute performance by running System Verilog testbenchesat the computed operating frequency

PACT ’15 16/23

Page 17: Towards General-Purpose Neural Network Computingpeople.bu.edu/.../files/pact2015-eldridge-presentation.pdfHigh performance deep neural network architectures [5, 6] Neural networks

Power and Performance

5 100

100

200

300

400

Number of Processing Elements

Ave

rage

Pow

er(m

W)

4 Elements per Block

5 100

100

200

300

400

Number of Processing Elements

8 Elements per Block

103

104

105

103

104

105

Pro

cess

ing

Tim

e(n

s)

Processing Elements Cache Register File Transaction Table Control Logic

inversek2j fft sobel blackscholes jmeint kmeans

collatz rsa jpeg edip 3sum ll

PACT ’15 17/23

Page 18: Towards General-Purpose Neural Network Computingpeople.bu.edu/.../files/pact2015-eldridge-presentation.pdfHigh performance deep neural network architectures [5, 6] Neural networks

Single Transaction Throughput

5 100

2

4

6

Number of Processing Elements

Edg

espe

rCyc

le

4 Elements per Block

5 100

2

4

6

Number of Processing Elements

Edg

espe

rCyc

le

8 Elements per Block

inversek2j fft sobel blackscholes jmeint kmeans

collatz rsa jpeg edip 3sum ll

PACT ’15 18/23

Page 19: Towards General-Purpose Neural Network Computingpeople.bu.edu/.../files/pact2015-eldridge-presentation.pdfHigh performance deep neural network architectures [5, 6] Neural networks

Multi-Transaction Throughput

5 100

2

4

6

Number of Processing Elements

Edg

espe

rCyc

le

4 Elements per Block

5 100

2

4

6

Number of Processing Elements

Edg

espe

rCyc

le

8 Elements per Block

fft-fft kmeans-fft kmeans-kmeans

edip-fft edip-kmeans edip-edip

PACT ’15 19/23

Page 20: Towards General-Purpose Neural Network Computingpeople.bu.edu/.../files/pact2015-eldridge-presentation.pdfHigh performance deep neural network architectures [5, 6] Neural networks

Multi-Transaction Speedup

1 2 3 4 5 6 7 8 9 10 11−20 %

0 %

20 %

Thro

ughp

utS

peed

up4 Elements per Block

1 2 3 4 5 6 7 8 9 10 11−20 %

0 %

20 %

Number of Processing Elements

Thro

ughp

utS

peed

up

8 Elements per Block

fft-fft kmeans-fft kmeans-kmeans edip-fft edip-kmeans edip-edip

PACT ’15 20/23

Page 21: Towards General-Purpose Neural Network Computingpeople.bu.edu/.../files/pact2015-eldridge-presentation.pdfHigh performance deep neural network architectures [5, 6] Neural networks

Software Comparison

NN Energy Delay EDP

3sum 7× 95× 664×collatz 8× 106× 826×ll 6× 88× 569×rsa 6× 88× 566×

Methodology and commentsComparison against a single core Intel SCC

Performance and power computed using gem5 [1] and McPAT [2]

[1] N. Binkert et al., “The gem5 simulator,” SIGARCH Computer Architecture News, vol. 39, no. 2, pp. 1–7, 2011.

[2] S. Li et al., “Mcpat: An integrated power, area, and timing modeling framework for multicore and manycore architectures,”in MICRO, 2009.

PACT ’15 21/23

Page 22: Towards General-Purpose Neural Network Computingpeople.bu.edu/.../files/pact2015-eldridge-presentation.pdfHigh performance deep neural network architectures [5, 6] Neural networks

Summary and Acknowledgments

ASID-NNIDTableWalker

ASID-NNID Table Pointer

TransactionQueue

Core 1

L1 Data $

L2 $

X-FILES Arbiter

ASID

ASID

ASID Register File

PE-1

PE-NEntry-N

Entry-1

PE Table

Entry-2MemoryMemory

Entry-1

NN Config Cache

ControlTransaction Table

ASID TID NNID State

DANA

Num ASIDs

Core 2

L1 Data $

Core N

L1 Data $

This work was supported by the following:A NASA Space Technology Research FellowshipAn NSF Graduate Research FellowshipNSF CAREER awardsA Google Faculty Research Award

PACT ’15 23/23