towards general-purpose neural network...
TRANSCRIPT
Towards General-Purpose Neural NetworkComputing
Schuyler Eldridge1 Amos Waterland2 Margo Seltzer2
Jonathan Appavoo3 Ajay Joshi1
1Boston University Department of Electrical and Computer Engineering
2Harvard University School of Engineering and Applies Sciences
3Boston University Department of Computer Science
24th International Conference on Parallel Architecturesand Compilation Techniques
PACT ’15 1/23
Why Do We Care About Neural Networks?
“Good” solutions for hard problemsCapable of learning
Neural networks, again?The neural network hype cyclehas been a bumpy rideModern, resurgent interest inneural networks is driven by:
Big, real-world data sets“Free” availability of transistorsUse of acceleratorsThe need for continuedperformance improvements
InputLayer
HiddenLayer
Bias
OutputLayer
HiddenLayer
PACT ’15 2/23
Neural Network Computing is Hot (Again)
Existing approachesDedicated neural network/vector processors from the 1990s [1]Ongoing NPU work for approximate computing [2, 3, 4]High performance deep neural network architectures [5, 6]
Neural networks as primitivesWe treat neural networks as an application primitive
[1] J. Wawrzynek et al., “Spert-II: a vector microprocessor system,” Computer, vol. 29, no. 3, pp. 79–86, Mar 1996.
[2] H. Esmaeilzadeh et al., “Neural acceleration for general-purpose approximate programs,” in MICRO, 2012.
[3] R. St. Amant, et al., “General-purpose code acceleration with limited-precision analog computation,” in ISCA, 2014.
[4] T. Moreau, et al., “Snnap: Approximate computing on programmable socs via neural acceleration,” in HPCA, 2015.
[5] T. Chen, et al. “Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning,” in ASPLOS, 2014.
[6] Z. Du, et al., “Shidiannao: shifting vision processing closer to the sensor,” in ISCA, 2015.
PACT ’15 3/23
Our Vision of the Future of Neural Network Computing
ApproximateComputing [1]
AutomaticParallelization [2]
MachineLearning
Operating System
User/Supervisor Interface
Multicontext/threaded NN Accelerator
Process2
Process1
Process3
ProcessN
input layer output layer
hidden layers
...
......
.
.
.
.
.
.
input layer output layer
hidden layers
...
......
.
.
.
.
.
.
input layer output layer
hidden layers
...
......
.
.
.
.
.
.
input layer output layer
hidden layers
...
......
.
.
.
.
.
.
input layer output layer
hidden layers
...
......
.
.
.
.
.
.
input layer output layer
hidden layers
...
......
.
.
.
.
.
.
input layer output layer
hidden layers
...
......
.
.
.
.
.
.
[1] H. Esmaeilzadeh et al., “Neural acceleration for general-purpose approximate programs,” in MICRO, 2012.
[2] A. Waterland et al. “Asc: Automatically scalable computation,” in ASPLOS, 2014.
PACT ’15 4/23
Our Contributions Towards this Vision
X-FILES: Hardware/Software ExtensionsExtensions for the Integration of Machine Learning in EverydaySystems
A defined user and supervisor interface for neural networksThis includes supervisor architectural state (hardware)
DANA: A Possible Multi-Transaction AcceleratorDynamically Allocated Neural Network Accelerator
An accelerator aligning with our multi transaction vision
I apologize for the namesThere is no association with files or filesystemsX-FILES is plural (like extensions)
PACT ’15 5/23
An Overview of X-FILES/DANA Hardware
ASID-NNIDTableWalker
ASID-NNID Table Pointer
TransactionQueue
Core 1
L1 Data $
L2 $
X-FILES Arbiter
ASID
ASID
ASID Register File
PE-1
PE-NEntry-N
Entry-1
PE Table
Entry-2MemoryMemory
Entry-1
NN Config Cache
ControlTransaction Table
ASID TID NNID State
DANA
Num ASIDs
Core 2
L1 Data $
Core N
L1 Data $
ComponentsGeneral purpose coresTransaction storageA backend accelerator that “executes” transactionsSupervisor resources for memory safetyDedicated memory interface
PACT ’15 6/23
At the User Level We Deal With “Transactions”
Neural Network TransactionsA transaction encapsulates a request by a process to compute theoutput of a specific neural network for a provided input
User Transaction API:newWriteRequest
writeData
readDataPoll
IdentifiersNNID: Neural Network IDTID: Transaction ID
CoreX-Files
Hardware Arbiter
Core/Accelerator InterfaceWe use the RoCC interface ofthe Rocket RISC-Vmicroprocessor [1, 2]
[1] A. Waterman et al., “The risc-v instruction set manual, volume i: User-level isa, version 2.0,” EECS Department, Universityof California, Berkeley, Tech. Rep. UCB/EECS-2014-54, May 2014.
[2] A. Waterman et al.,, “The risc-v instruction set manual volume ii: Privileged architecture version 1.7,” EECS Department,University of California, Berkeley, Tech. Rep. UCB/EECS-2015-49, May 2015.
PACT ’15 7/23
At the Supervisor Level We Deal With Address Spaces
Use cases:Single transactionMultiple transactionsSharing of networksMultiple networks
Application
Operating System
User/Supervisor Interface
Multicontext/threaded NN Accelerator
Process2
Process1
Process3
ProcessN
input layer output layer
hidden layers
...
......
.
.
.
.
.
.
input layer output layer
hidden layers
...
......
.
.
.
.
.
.
input layer output layer
hidden layers
...
......
.
.
.
.
.
.
input layer output layer
hidden layers
...
......
.
.
.
.
.
.
input layer output layer
hidden layers
...
......
.
.
.
.
.
.
Application
We maintain the state of executing transactionsWe group transactions into Address SpacesAddress Spaces are identified by an OS-managed ASID
Each ASID defines the set of accessible networksNetworks can be shared transparently if the OS allows this
PACT ’15 8/23
An ASID–NNID Table Enables NNID Dereferencing
*NN Configuration
*NN Configuration
*NN Configuration
Header
Neurons
Weights
ASID-NNID Table Ptr*ASID-NNID *IO QueueNum NNIDs
*ASID-NNID *IO QueueNum NNIDs
*ASID-NNID *IO QueueNum NNIDs
Status/Header *Input *Output
Ring Buffers
Num ASIDs0:
1:
2:0:
1:
2:
Layers
ASID–NNID TableThe OS establishes and maintains the ASID–NNID TableWe assign ASIDs and NNIDs sequentiallyThe ASID–NNID Table contains an optional asynchronous memoryinterface
PACT ’15 9/23
A Compact Binary Neural Network Configuration
binaryPointtotalEdgestotalNeuronstotalLayersweightsPtr
Info
neuron0-weight0neuron0-weight1neuron0-weight2neuron0-weight3 Weights
neuron1-weight0...
neuron0-weight0Ptrneuron0-numberOfWeightsneuron0-activationFunctionneuron0-steepnessneuron0-bias
Neurons
Layer0-neuron0PtrneuronsInLayerneuronsInNextLayer
layer1-neuron0PtrneuronsInLayerneuronsInNextLayer
Layers
neuron1-weight0Ptr...
We condense the normal FANN neural network data structureWe use a reduced configuration from the Fast Artificial NeuralNetwork (FANN) library [1] containing:
Global informationPer-layer informationPer-neuron informationPer-neuron weights
[1] S. Nissen, “Implementation of a fast artificial neural network library (fann),” Department of Computer Science University ofCopenhagen (DIKU), Tech. Rep., 2003.
PACT ’15 10/23
DANA: An Example Multi-Transaction Accelerator
Register File
X-FILESArbiter
PE-1PE-2
PE-N
Entry-2
Entry-N
Entry-1
PE Table
NN Transaction-1 IO MemoryNN Transaction-2 IO Memory
Entry-2Cache Memory-1Cache Memory-2
Entry-1
NN Configuration Cache
Control
DANA
TransactionTable
ComponentsControl logic determines actions given transaction stateNetwork configurations are stored in a Configuration CachePer-transaction IO Memory stores inputs and outputsA Register File stores intermediate outputsLogical neurons are mapped to Processing Elements
PACT ’15 11/23
DANA: Single Transaction Execution
Register File
X-FILESArbiter
PE1
PE Table
Cache Memory-1 ASID/NNID
NN Configuration Cache
DANA
TransactionTable
PE2PE3PE4
Bias
InputLayer
HiddenLayer
OutputLayer
Per-Transaction IO Memory
Control
PACT ’15 12/23
DANA: Multi-Transaction Execution
Register File
X-FILESArbiter
PE1
PE Table
Cache Memory-1 ASID/NNID
NN Configuration Cache
DANA
TransactionTable
PE2PE3PE4
Per-Transaction IO Memory
Control
Bias
InputLayer
HiddenLayer
OutputLayer
Bias
InputLayer
HiddenLayer
OutputLayer
ASID/NNIDCache Memory-2
TID-1TID-2
I-1 I-2I-1 I-2
R-1 R-2 R-3R-1 R-2 R-3
PACT ’15 13/23
We Organize All Data in Blocks of Elements
4 Elements Per Block
element 4 element 3 element 2 element 1
8 Elements Per Block
element 8 element 7 element 6 element 5
element 4 element 3 element 2 element 1
Blocks for temporal localityWe exploit neural network temporal locality of dataHere, data refers to inputs or weightsLarger block widths reduce inter-module communicationBlock width is an architectural parameter
PACT ’15 14/23
Evaluation Networks
Area Application Configuration Size
ASC [1]
3sum 85 × 16 × 85 largecollatz 40 × 16 × 40 largell 144 × 16 × 144 largersa 30 × 30 × 30 large
ApproximateComputing[2, 3, 4]
blackscholes 6 × 8 × 8 × 1 smallfft 1 × 4 × 4 × 2 smallinversek2j 2 × 8 × 2 smalljmeint 18 × 16 × 2 mediumjpeg 64 × 16 × 64 largekmeans 6 × 16 × 16 × 1 mediumsobel 9 × 8 × 1 small
Physics [5] edip 192 × 16 × 1 large
[1] A. Waterland et al. “Asc: Automatically scalable computation,” in ASPLOS, 2014.
[2] H. Esmaeilzadeh et al., “Neural acceleration for general-purpose approximate programs,” in MICRO, 2012.
[3] R. St. Amant, et al., “General-purpose code acceleration with limited-precision analog computation,” in ISCA, 2014.
[4] T. Moreau, et al., “Snnap: Approximate computing on programmable socs via neural acceleration,” in HPCA, 2015.
[5] J. F. Justo et al., “Interatomic potential for silicon defects and disordered phases,” Physical Review B, vol. 58, pp.2539–2550, Aug 1998.
PACT ’15 15/23
Evaluation Methodology
ImplementationX-FILES Arbiter and DANA implemented in System VerilogFree parameters include:
Elements per blockThe number of Processing ElementsInternal table widths and storage sizes
EvaluationWe compute average power with Cadence SOC Encounter in a45nm GlobalFoundries processWe compute operating frequency using Cadence SOC EncounterWe compute performance by running System Verilog testbenchesat the computed operating frequency
PACT ’15 16/23
Power and Performance
5 100
100
200
300
400
Number of Processing Elements
Ave
rage
Pow
er(m
W)
4 Elements per Block
5 100
100
200
300
400
Number of Processing Elements
8 Elements per Block
103
104
105
103
104
105
Pro
cess
ing
Tim
e(n
s)
Processing Elements Cache Register File Transaction Table Control Logic
inversek2j fft sobel blackscholes jmeint kmeans
collatz rsa jpeg edip 3sum ll
PACT ’15 17/23
Single Transaction Throughput
5 100
2
4
6
Number of Processing Elements
Edg
espe
rCyc
le
4 Elements per Block
5 100
2
4
6
Number of Processing Elements
Edg
espe
rCyc
le
8 Elements per Block
inversek2j fft sobel blackscholes jmeint kmeans
collatz rsa jpeg edip 3sum ll
PACT ’15 18/23
Multi-Transaction Throughput
5 100
2
4
6
Number of Processing Elements
Edg
espe
rCyc
le
4 Elements per Block
5 100
2
4
6
Number of Processing Elements
Edg
espe
rCyc
le
8 Elements per Block
fft-fft kmeans-fft kmeans-kmeans
edip-fft edip-kmeans edip-edip
PACT ’15 19/23
Multi-Transaction Speedup
1 2 3 4 5 6 7 8 9 10 11−20 %
0 %
20 %
Thro
ughp
utS
peed
up4 Elements per Block
1 2 3 4 5 6 7 8 9 10 11−20 %
0 %
20 %
Number of Processing Elements
Thro
ughp
utS
peed
up
8 Elements per Block
fft-fft kmeans-fft kmeans-kmeans edip-fft edip-kmeans edip-edip
PACT ’15 20/23
Software Comparison
NN Energy Delay EDP
3sum 7× 95× 664×collatz 8× 106× 826×ll 6× 88× 569×rsa 6× 88× 566×
Methodology and commentsComparison against a single core Intel SCC
Performance and power computed using gem5 [1] and McPAT [2]
[1] N. Binkert et al., “The gem5 simulator,” SIGARCH Computer Architecture News, vol. 39, no. 2, pp. 1–7, 2011.
[2] S. Li et al., “Mcpat: An integrated power, area, and timing modeling framework for multicore and manycore architectures,”in MICRO, 2009.
PACT ’15 21/23
Summary and Acknowledgments
ASID-NNIDTableWalker
ASID-NNID Table Pointer
TransactionQueue
Core 1
L1 Data $
L2 $
X-FILES Arbiter
ASID
ASID
ASID Register File
PE-1
PE-NEntry-N
Entry-1
PE Table
Entry-2MemoryMemory
Entry-1
NN Config Cache
ControlTransaction Table
ASID TID NNID State
DANA
Num ASIDs
Core 2
L1 Data $
Core N
L1 Data $
This work was supported by the following:A NASA Space Technology Research FellowshipAn NSF Graduate Research FellowshipNSF CAREER awardsA Google Faculty Research Award
PACT ’15 23/23