towards general-purpose neural network...

Towards General-Purpose Neural NetworkComputing

Schuyler Eldridge1 Amos Waterland2 Margo Seltzer2

Jonathan Appavoo3 Ajay Joshi1

1Boston University Department of Electrical and Computer Engineering

2Harvard University School of Engineering and Applies Sciences

3Boston University Department of Computer Science

24th International Conference on Parallel Architecturesand Compilation Techniques

PACT ’15 1/23

Why Do We Care About Neural Networks?

“Good” solutions for hard problemsCapable of learning

Neural networks, again?The neural network hype cyclehas been a bumpy rideModern, resurgent interest inneural networks is driven by:

Big, real-world data sets“Free” availability of transistorsUse of acceleratorsThe need for continuedperformance improvements

InputLayer

HiddenLayer

Bias

OutputLayer

HiddenLayer

PACT ’15 2/23

Neural Network Computing is Hot (Again)

Existing approachesDedicated neural network/vector processors from the 1990s [1]Ongoing NPU work for approximate computing [2, 3, 4]High performance deep neural network architectures [5, 6]

Neural networks as primitivesWe treat neural networks as an application primitive

[1] J. Wawrzynek et al., “Spert-II: a vector microprocessor system,” Computer, vol. 29, no. 3, pp. 79–86, Mar 1996.

[2] H. Esmaeilzadeh et al., “Neural acceleration for general-purpose approximate programs,” in MICRO, 2012.

[3] R. St. Amant, et al., “General-purpose code acceleration with limited-precision analog computation,” in ISCA, 2014.

[4] T. Moreau, et al., “Snnap: Approximate computing on programmable socs via neural acceleration,” in HPCA, 2015.

[5] T. Chen, et al. “Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning,” in ASPLOS, 2014.

[6] Z. Du, et al., “Shidiannao: shifting vision processing closer to the sensor,” in ISCA, 2015.

PACT ’15 3/23

Our Vision of the Future of Neural Network Computing

ApproximateComputing [1]

AutomaticParallelization [2]

MachineLearning

Operating System

User/Supervisor Interface

Multicontext/threaded NN Accelerator

Process2

Process1

Process3

ProcessN

input layer output layer

hidden layers

...

......

.

.

.

.

.

.


hidden layers

...

......

.

.

.

.

.

.


hidden layers

...

......

.

.

.

.

.

.


hidden layers

...

......

.

.

.

.

.

.


hidden layers

...

......

.

.

.

.

.

.


hidden layers

...

......

.

.

.

.

.

.


hidden layers

...

......

.

.

.

.

.

.


[2] A. Waterland et al. “Asc: Automatically scalable computation,” in ASPLOS, 2014.

PACT ’15 4/23

Our Contributions Towards this Vision

X-FILES: Hardware/Software ExtensionsExtensions for the Integration of Machine Learning in EverydaySystems

A defined user and supervisor interface for neural networksThis includes supervisor architectural state (hardware)

DANA: A Possible Multi-Transaction AcceleratorDynamically Allocated Neural Network Accelerator

An accelerator aligning with our multi transaction vision

I apologize for the namesThere is no association with files or filesystemsX-FILES is plural (like extensions)

PACT ’15 5/23

An Overview of X-FILES/DANA Hardware

ASID-NNIDTableWalker

ASID-NNID Table Pointer

TransactionQueue

Core 1

L1 Data $

L2 $

X-FILES Arbiter

ASID

ASID

ASID Register File

PE-1

PE-NEntry-N

Entry-1

PE Table

Entry-2MemoryMemory

Entry-1

NN Config Cache

ControlTransaction Table

ASID TID NNID State

DANA

Num ASIDs

Core 2

L1 Data $

Core N

L1 Data $

ComponentsGeneral purpose coresTransaction storageA backend accelerator that “executes” transactionsSupervisor resources for memory safetyDedicated memory interface

PACT ’15 6/23

At the User Level We Deal With “Transactions”

Neural Network TransactionsA transaction encapsulates a request by a process to compute theoutput of a specific neural network for a provided input

User Transaction API:newWriteRequest

writeData

readDataPoll

IdentifiersNNID: Neural Network IDTID: Transaction ID

CoreX-Files

Hardware Arbiter

Core/Accelerator InterfaceWe use the RoCC interface ofthe Rocket RISC-Vmicroprocessor [1, 2]

[1] A. Waterman et al., “The risc-v instruction set manual, volume i: User-level isa, version 2.0,” EECS Department, Universityof California, Berkeley, Tech. Rep. UCB/EECS-2014-54, May 2014.

[2] A. Waterman et al.,, “The risc-v instruction set manual volume ii: Privileged architecture version 1.7,” EECS Department,University of California, Berkeley, Tech. Rep. UCB/EECS-2015-49, May 2015.

PACT ’15 7/23

At the Supervisor Level We Deal With Address Spaces

Use cases:Single transactionMultiple transactionsSharing of networksMultiple networks

Application

Operating System

User/Supervisor Interface

Multicontext/threaded NN Accelerator

Process2

Process1

Process3

ProcessN


hidden layers

...

......

.

.

.

.

.

.


hidden layers

...

......

.

.

.

.

.

.


hidden layers

...

......

.

.

.

.

.

.


hidden layers

...

......

.

.

.

.

.

.


hidden layers

...

......

.

.

.

.

.

.

Application

We maintain the state of executing transactionsWe group transactions into Address SpacesAddress Spaces are identified by an OS-managed ASID

Each ASID defines the set of accessible networksNetworks can be shared transparently if the OS allows this

PACT ’15 8/23

An ASID–NNID Table Enables NNID Dereferencing

*NN Configuration

*NN Configuration

*NN Configuration

Header

Neurons

Weights

ASID-NNID Table Ptr*ASID-NNID *IO QueueNum NNIDs

*ASID-NNID *IO QueueNum NNIDs

*ASID-NNID *IO QueueNum NNIDs

Status/Header *Input *Output

Ring Buffers

Num ASIDs0:

1:

2:0:

1:

2:

Layers

ASID–NNID TableThe OS establishes and maintains the ASID–NNID TableWe assign ASIDs and NNIDs sequentiallyThe ASID–NNID Table contains an optional asynchronous memoryinterface

PACT ’15 9/23

A Compact Binary Neural Network Configuration

binaryPointtotalEdgestotalNeuronstotalLayersweightsPtr

Info

neuron0-weight0neuron0-weight1neuron0-weight2neuron0-weight3 Weights

neuron1-weight0...

neuron0-weight0Ptrneuron0-numberOfWeightsneuron0-activationFunctionneuron0-steepnessneuron0-bias

Neurons

Layer0-neuron0PtrneuronsInLayerneuronsInNextLayer

layer1-neuron0PtrneuronsInLayerneuronsInNextLayer

Layers

neuron1-weight0Ptr...

We condense the normal FANN neural network data structureWe use a reduced configuration from the Fast Artificial NeuralNetwork (FANN) library [1] containing:

Global informationPer-layer informationPer-neuron informationPer-neuron weights

[1] S. Nissen, “Implementation of a fast artificial neural network library (fann),” Department of Computer Science University ofCopenhagen (DIKU), Tech. Rep., 2003.

PACT ’15 10/23

DANA: An Example Multi-Transaction Accelerator

Register File

X-FILESArbiter

PE-1PE-2

PE-N

Entry-2

Entry-N

Entry-1

PE Table

NN Transaction-1 IO MemoryNN Transaction-2 IO Memory

Entry-2Cache Memory-1Cache Memory-2

Entry-1

NN Configuration Cache

Control

DANA

TransactionTable

ComponentsControl logic determines actions given transaction stateNetwork configurations are stored in a Configuration CachePer-transaction IO Memory stores inputs and outputsA Register File stores intermediate outputsLogical neurons are mapped to Processing Elements

PACT ’15 11/23

DANA: Single Transaction Execution

Register File

X-FILESArbiter

PE1

PE Table

Cache Memory-1 ASID/NNID


DANA

TransactionTable

PE2PE3PE4

Bias

InputLayer

HiddenLayer

OutputLayer

Per-Transaction IO Memory

Control

PACT ’15 12/23

DANA: Multi-Transaction Execution

Register File

X-FILESArbiter

PE1

PE Table

Cache Memory-1 ASID/NNID


DANA

TransactionTable

PE2PE3PE4

Per-Transaction IO Memory

Control

Bias

InputLayer

HiddenLayer

OutputLayer

Bias

InputLayer

HiddenLayer

OutputLayer

ASID/NNIDCache Memory-2

TID-1TID-2

I-1 I-2I-1 I-2

R-1 R-2 R-3R-1 R-2 R-3

PACT ’15 13/23

We Organize All Data in Blocks of Elements

4 Elements Per Block

element 4 element 3 element 2 element 1

8 Elements Per Block



Blocks for temporal localityWe exploit neural network temporal locality of dataHere, data refers to inputs or weightsLarger block widths reduce inter-module communicationBlock width is an architectural parameter

PACT ’15 14/23

Evaluation Networks

Area Application Configuration Size

ASC [1]

3sum 85 × 16 × 85 largecollatz 40 × 16 × 40 largell 144 × 16 × 144 largersa 30 × 30 × 30 large

ApproximateComputing[2, 3, 4]

blackscholes 6 × 8 × 8 × 1 smallfft 1 × 4 × 4 × 2 smallinversek2j 2 × 8 × 2 smalljmeint 18 × 16 × 2 mediumjpeg 64 × 16 × 64 largekmeans 6 × 16 × 16 × 1 mediumsobel 9 × 8 × 1 small

Physics [5] edip 192 × 16 × 1 large

[1] A. Waterland et al. “Asc: Automatically scalable computation,” in ASPLOS, 2014.


[3] R. St. Amant, et al., “General-purpose code acceleration with limited-precision analog computation,” in ISCA, 2014.

[4] T. Moreau, et al., “Snnap: Approximate computing on programmable socs via neural acceleration,” in HPCA, 2015.

[5] J. F. Justo et al., “Interatomic potential for silicon defects and disordered phases,” Physical Review B, vol. 58, pp.2539–2550, Aug 1998.

PACT ’15 15/23

Evaluation Methodology

ImplementationX-FILES Arbiter and DANA implemented in System VerilogFree parameters include:

Elements per blockThe number of Processing ElementsInternal table widths and storage sizes

EvaluationWe compute average power with Cadence SOC Encounter in a45nm GlobalFoundries processWe compute operating frequency using Cadence SOC EncounterWe compute performance by running System Verilog testbenchesat the computed operating frequency

PACT ’15 16/23

Power and Performance

5 100

100

200

300

400

Number of Processing Elements

Ave

rage

Pow

er(m

W)

4 Elements per Block

5 100

100

200

300

400



103

104

105

103

104

105

Pro

cess

ing

Tim

e(n

s)

Processing Elements Cache Register File Transaction Table Control Logic

inversek2j fft sobel blackscholes jmeint kmeans

collatz rsa jpeg edip 3sum ll

PACT ’15 17/23

Single Transaction Throughput

5 100

2

4

6


Edg

espe

rCyc

le


5 100

2

4

6


Edg

espe

rCyc

le


inversek2j fft sobel blackscholes jmeint kmeans

collatz rsa jpeg edip 3sum ll

PACT ’15 18/23

Multi-Transaction Throughput

5 100

2

4

6


Edg

espe

rCyc

le


5 100

2

4

6


Edg

espe

rCyc

le


fft-fft kmeans-fft kmeans-kmeans

edip-fft edip-kmeans edip-edip

PACT ’15 19/23

Multi-Transaction Speedup

1 2 3 4 5 6 7 8 9 10 11−20 %

0 %

20 %

Thro

ughp

utS

peed

up4 Elements per Block

1 2 3 4 5 6 7 8 9 10 11−20 %

0 %

20 %


Thro

ughp

utS

peed

up


fft-fft kmeans-fft kmeans-kmeans edip-fft edip-kmeans edip-edip

PACT ’15 20/23

Software Comparison

NN Energy Delay EDP

3sum 7× 95× 664×collatz 8× 106× 826×ll 6× 88× 569×rsa 6× 88× 566×

Methodology and commentsComparison against a single core Intel SCC

Performance and power computed using gem5 [1] and McPAT [2]

[1] N. Binkert et al., “The gem5 simulator,” SIGARCH Computer Architecture News, vol. 39, no. 2, pp. 1–7, 2011.

[2] S. Li et al., “Mcpat: An integrated power, area, and timing modeling framework for multicore and manycore architectures,”in MICRO, 2009.

PACT ’15 21/23

Summary and Acknowledgments

ASID-NNIDTableWalker

ASID-NNID Table Pointer

TransactionQueue

Core 1

L1 Data $

L2 $

X-FILES Arbiter

ASID

ASID

ASID Register File

PE-1

PE-NEntry-N

Entry-1

PE Table

Entry-2MemoryMemory

Entry-1

NN Config Cache

ControlTransaction Table

ASID TID NNID State

DANA

Num ASIDs

Core 2

L1 Data $

Core N

L1 Data $

This work was supported by the following:A NASA Space Technology Research FellowshipAn NSF Graduate Research FellowshipNSF CAREER awardsA Google Faculty Research Award

PACT ’15 23/23

towards general-purpose neural network...

Documents