![Page 1: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse](https://reader035.vdocuments.us/reader035/viewer/2022062402/5ec7def6bc89af77a97643d4/html5/thumbnails/1.jpg)
Scaling Datacenter Accelerators With Compute-Reuse Architectures
Adi Fuchs and David Wentzlaff
ISCA 2018 Session 5AJune 5, 2018 Los Angeles, CA
![Page 2: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse](https://reader035.vdocuments.us/reader035/viewer/2022062402/5ec7def6bc89af77a97643d4/html5/thumbnails/2.jpg)
Scaling Datacenter Accelerators With Compute-Reuse Architectures
2
Sources:"Cramming more components onto integrated circuits” GE Moore, Computer 1965“Next-Gen Power Solutions for Hyperscale Data Centers”, DataCenter Knowledge 2016
Scaling Datacenter Accelerators With Compute-Reuse Architectures
![Page 3: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse](https://reader035.vdocuments.us/reader035/viewer/2022062402/5ec7def6bc89af77a97643d4/html5/thumbnails/3.jpg)
Scaling Datacenter Accelerators With Compute-Reuse Architectures
3
Sources:"Cramming more components onto integrated circuits” GE Moore, Computer 1965“Next-Gen Power Solutions for Hyperscale Data Centers”, DataCenter Knowledge 2016
Scaling Datacenter Accelerators With Compute-Reuse Architectures
![Page 4: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse](https://reader035.vdocuments.us/reader035/viewer/2022062402/5ec7def6bc89af77a97643d4/html5/thumbnails/4.jpg)
Scaling Datacenter Accelerators With Compute-Reuse Architectures
4
Sources:"Cramming more components onto integrated circuits” GE Moore, Computer 1965“Next-Gen Power Solutions for Hyperscale Data Centers”, DataCenter Knowledge 2016
Scaling Datacenter Accelerators With Compute-Reuse Architectures
?
![Page 5: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse](https://reader035.vdocuments.us/reader035/viewer/2022062402/5ec7def6bc89af77a97643d4/html5/thumbnails/5.jpg)
Scaling Datacenter Accelerators With Compute-Reuse Architectures
5
Sources:“Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective”, Hazelwood et al. HPCA 2018“Cloud TPU”, Google, https://cloud.google.com/tpu/“FPGA Accelerated Computing Using AWS F1 Instances”, David Pellerin, AWS summit 2017“Microsoft unveils Project Brainwave for real-time AI“, Doug Burger, https://www.microsoft.com/en-us/research/blog/microsoft-unveils-project-brainwave/“NVIDIA TESLA V100“, NVIDIA, https://www.nvidia.com/en-us/data-center/tesla-v100/
Scaling Datacenter Accelerators With Compute-Reuse Architectures
![Page 6: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse](https://reader035.vdocuments.us/reader035/viewer/2022062402/5ec7def6bc89af77a97643d4/html5/thumbnails/6.jpg)
Scaling Datacenter Accelerators With Compute-Reuse Architectures
6
Sources:“Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective”, Hazelwood et al. HPCA 2018“Cloud TPU”, Google, https://cloud.google.com/tpu/“FPGA Accelerated Computing Using AWS F1 Instances”, David Pellerin, AWS summit 2017“Microsoft unveils Project Brainwave for real-time AI“, Doug Burger, https://www.microsoft.com/en-us/research/blog/microsoft-unveils-project-brainwave/“NVIDIA TESLA V100“, NVIDIA, https://www.nvidia.com/en-us/data-center/tesla-v100/
Scaling Datacenter Accelerators With Compute-Reuse Architectures
![Page 7: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse](https://reader035.vdocuments.us/reader035/viewer/2022062402/5ec7def6bc89af77a97643d4/html5/thumbnails/7.jpg)
Scaling Datacenter Accelerators With Compute-Reuse Architectures
7
Sources:“Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective”, Hazelwood et al. HPCA 2018“Cloud TPU”, Google, https://cloud.google.com/tpu/“FPGA Accelerated Computing Using AWS F1 Instances”, David Pellerin, AWS summit 2017“Microsoft unveils Project Brainwave for real-time AI“, Doug Burger, https://www.microsoft.com/en-us/research/blog/microsoft-unveils-project-brainwave/“NVIDIA TESLA V100“, NVIDIA, https://www.nvidia.com/en-us/data-center/tesla-v100/
Scaling Datacenter Accelerators With Compute-Reuse Architectures
Transistor scaling stops. Chip specialization runs out of steam.
What’s Next?
![Page 8: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse](https://reader035.vdocuments.us/reader035/viewer/2022062402/5ec7def6bc89af77a97643d4/html5/thumbnails/8.jpg)
Scaling Datacenter Accelerators With Compute-Reuse Architectures
8
Observation I: The Density of Emerging Memories are Projected to Increase
Scaling Datacenter Accelerators With Compute-Reuse Architectures
ITRS Logic Roadmap
![Page 9: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse](https://reader035.vdocuments.us/reader035/viewer/2022062402/5ec7def6bc89af77a97643d4/html5/thumbnails/9.jpg)
Scaling Datacenter Accelerators With Compute-Reuse Architectures
9
Source:”Face recognition in unconstrained videos with matched background similarity”, Wolf et al., CVPR 2011
Observation II: Datacenter Accelerators Perform Redundant Computations
▪ Temporal locality introduces redundancy in videos encoders (recurrent blocks in white)
t=0 sec t=2 sec t=4 sec
![Page 10: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse](https://reader035.vdocuments.us/reader035/viewer/2022062402/5ec7def6bc89af77a97643d4/html5/thumbnails/10.jpg)
Scaling Datacenter Accelerators With Compute-Reuse Architectures
10
Source:”Face recognition in unconstrained videos with matched background similarity”, Wolf et al., CVPR 2011
Observation II: Datacenter Accelerators Perform Redundant Computations
▪ Temporal locality introduces redundancy in videos encoders (recurrent blocks in white)
t=0 sec
0% recurrence 38% recurrence 61% recurrence
t=2 sec t=4 sec
![Page 11: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse](https://reader035.vdocuments.us/reader035/viewer/2022062402/5ec7def6bc89af77a97643d4/html5/thumbnails/11.jpg)
Scaling Datacenter Accelerators With Compute-Reuse Architectures
11
Observation II: Datacenter Accelerators Perform Redundant Computations
▪ Search term commonality retrieves the similar content
intercontinental downtown los angeles
Source: Google
![Page 12: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse](https://reader035.vdocuments.us/reader035/viewer/2022062402/5ec7def6bc89af77a97643d4/html5/thumbnails/12.jpg)
Scaling Datacenter Accelerators With Compute-Reuse Architectures
12
Observation II: Datacenter Accelerators Perform Redundant Computations
▪ Search term commonality retrieves the similar content
intercontinental downtown los angeles
Source: Google
hotel in downtown los angeles near intercontinental
![Page 13: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse](https://reader035.vdocuments.us/reader035/viewer/2022062402/5ec7def6bc89af77a97643d4/html5/thumbnails/13.jpg)
Scaling Datacenter Accelerators With Compute-Reuse Architectures
13
Observation II: Datacenter Accelerators Perform Redundant Computations
▪ Search term commonality retrieves the similar content
intercontinental downtown los angeles
Source: Google
hotel in downtown los angeles near intercontinental
![Page 14: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse](https://reader035.vdocuments.us/reader035/viewer/2022062402/5ec7def6bc89af77a97643d4/html5/thumbnails/14.jpg)
Scaling Datacenter Accelerators With Compute-Reuse Architectures
14
Source: Twitter
Observation II: Datacenter Accelerators Perform Redundant Computations
▪ Power laws suggest high recurrent processing of popular content
![Page 15: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse](https://reader035.vdocuments.us/reader035/viewer/2022062402/5ec7def6bc89af77a97643d4/html5/thumbnails/15.jpg)
Scaling Datacenter Accelerators With Compute-Reuse Architectures
15
Source: Twitter
Observation II: Datacenter Accelerators Perform Redundant Computations
▪ Power laws suggest high recurrent processing of popular content
![Page 16: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse](https://reader035.vdocuments.us/reader035/viewer/2022062402/5ec7def6bc89af77a97643d4/html5/thumbnails/16.jpg)
Memoization: Tables store past computation outputs. Reuse outputs of recurring inputs instead of recomputing.
COREx: Compute-Reuse Architecture For Accelerators
Scaling Datacenter Accelerators With Compute-Reuse Architectures
16
InputLookup
core result
DMA Engine
Accelerator Core
input
input output
Acceleration Fabric
Shared LLC / NoC
Host Processors
Scratchpad Memory
![Page 17: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse](https://reader035.vdocuments.us/reader035/viewer/2022062402/5ec7def6bc89af77a97643d4/html5/thumbnails/17.jpg)
Memoization: Tables store past computation outputs. Reuse outputs of recurring inputs instead of recomputing.
COREx: Compute-Reuse Architecture For Accelerators
Scaling Datacenter Accelerators With Compute-Reuse Architectures
17
InputLookup
lookup
fetchedresult
core result
core result
DMA Engine
Accelerator Core
input
input output
Compute-Reuse Storage
Acceleration Fabric
Shared LLC / NoC
hit
Host Processors
Scratchpad Memory
![Page 18: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse](https://reader035.vdocuments.us/reader035/viewer/2022062402/5ec7def6bc89af77a97643d4/html5/thumbnails/18.jpg)
Memoization: Tables store past computation outputs. Reuse outputs of recurring inputs instead of recomputing.
COREx: Compute-Reuse Architecture For Accelerators
Scaling Datacenter Accelerators With Compute-Reuse Architectures
18
InputLookup
lookup
fetchedresult
core result
core result
DMA Engine
Accelerator Core
input
input output
Compute-Reuse Storage
Acceleration Fabric
Shared LLC / NoC
hit
Host Processors
Scratchpad Memory
![Page 19: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse](https://reader035.vdocuments.us/reader035/viewer/2022062402/5ec7def6bc89af77a97643d4/html5/thumbnails/19.jpg)
19
Architectural Guidelines
Accelerator Core
Specialized Compute Lanes
ScratchpadDMA
EngineGeneral-Purpose
CMP
Shared LLC
![Page 20: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse](https://reader035.vdocuments.us/reader035/viewer/2022062402/5ec7def6bc89af77a97643d4/html5/thumbnails/20.jpg)
20
Architectural Guidelines
▪ Accelerators Memoization is Naturalo Little or no additional programming efforto Built-in input-compute-output flow
Accelerator Core
Specialized Compute Lanes
ScratchpadDMA
EngineGeneral-Purpose
CMP
Shared LLC
Output
Input
Compute
![Page 21: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse](https://reader035.vdocuments.us/reader035/viewer/2022062402/5ec7def6bc89af77a97643d4/html5/thumbnails/21.jpg)
21
Architectural Guidelines
▪ Accelerators Memoization is Naturalo Little or no additional programming efforto Built-in input-compute-output flow
▪ But Not Straightforward!o High lookup costso Unnecessary accesses o High access costs
▪ COREx Key Ideas:o Hashing (reduce lookup costs)o Lookup filtering (fewer accesses)o Banking (reduce access costs)
Accelerator Core
Specialized Compute Lanes
ScratchpadDMA
EngineGeneral-Purpose
CMP
Shared LLC
Output
Input
Compute
![Page 22: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse](https://reader035.vdocuments.us/reader035/viewer/2022062402/5ec7def6bc89af77a97643d4/html5/thumbnails/22.jpg)
22
Architectural Guidelines
▪ Accelerators Memoization is Naturalo Little or no additional programming efforto Built-in input-compute-output flow
▪ But Not Straightforward!o High lookup costso Unnecessary accesses o High access costs
▪ COREx Key Ideas:o Hashing (reduce lookup costs)o Lookup filtering (fewer accesses)o Banking (reduce access costs)
Accelerator Core
Specialized Compute Lanes
ScratchpadDMA
EngineGeneral-Purpose
CMP
Shared LLC
Output
Input
Compute
Goal: Extend Specialization with Workload-Specific Memoization
![Page 23: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse](https://reader035.vdocuments.us/reader035/viewer/2022062402/5ec7def6bc89af77a97643d4/html5/thumbnails/23.jpg)
23
Accelerator Core
Specialized Compute Lanes
Scratchpad General-Purpose CMP
Shared LLC
SoC Interconnect
Mem. Chip
Func. Block
Datapath
Control
Top Level Architecture
DMA Engine
![Page 24: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse](https://reader035.vdocuments.us/reader035/viewer/2022062402/5ec7def6bc89af77a97643d4/html5/thumbnails/24.jpg)
▪ New Modules:
o Input Hashing Unit (IHU)
24
Accelerator Core
Specialized Compute Lanes
Scratchpad General-Purpose CMP
Shared LLC
IHU
COREx Interconnect
SoC Interconnect
Mem. Chip
Func. Block
Datapath
Control
Top Level Architecture
DMA Engine
![Page 25: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse](https://reader035.vdocuments.us/reader035/viewer/2022062402/5ec7def6bc89af77a97643d4/html5/thumbnails/25.jpg)
▪ New Modules:
o Input Hashing Unit (IHU)
o Input Lookup Unit (ILU)
25
Accelerator Core
Specialized Compute Lanes
Scratchpad General-Purpose CMP
Shared LLC
IHU
ILU
Cache Ctrl.
COREx Interconnect
SoC Interconnect
Mem. Chip
Func. Block
Datapath
Control
Top Level Architecture
DMA Engine
Hashes
AssociativeCache
![Page 26: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse](https://reader035.vdocuments.us/reader035/viewer/2022062402/5ec7def6bc89af77a97643d4/html5/thumbnails/26.jpg)
▪ New Modules:
o Input Hashing Unit (IHU)
o Input Lookup Unit (ILU)
o Computation History Table (CHT)
26
Accelerator Core
Specialized Compute Lanes
Scratchpad General-Purpose CMP
Shared LLC
IHU
CHTILU
Cache Ctrl.
COREx Interconnect
SoC Interconnect
RAM-Array Ctrl.
RAM-Array Table
Mem. Chip
Func. Block
Datapath
Control AssociativeCache
Top Level Architecture
DMA Engine
Fetch
![Page 27: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse](https://reader035.vdocuments.us/reader035/viewer/2022062402/5ec7def6bc89af77a97643d4/html5/thumbnails/27.jpg)
▪ New Modules:
o Input Hashing Unit (IHU)
o Input Lookup Unit (ILU)
o Computation History Table (CHT)
27
Accelerator Core
Specialized Compute Lanes
Scratchpad General-Purpose CMP
Shared LLC
IHU
CHTILU
Cache Ctrl.
COREx Interconnect
SoC Interconnect
RAM-Array Ctrl.
RAM-Array Table
Mem. Chip
Func. Block
Datapath
Control AssociativeCache
Top Level Architecture
DMA Engine
Fetch
Match Input
![Page 28: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse](https://reader035.vdocuments.us/reader035/viewer/2022062402/5ec7def6bc89af77a97643d4/html5/thumbnails/28.jpg)
▪ New Modules:
o Input Hashing Unit (IHU)
o Input Lookup Unit (ILU)
o Computation History Table (CHT)
28
Accelerator Core
Specialized Compute Lanes
Scratchpad General-Purpose CMP
Shared LLC
IHU
CHTILU
Cache Ctrl.
COREx Interconnect
SoC Interconnect
RAM-Array Ctrl.
RAM-Array Table
Mem. Chip
Func. Block
Datapath
Control AssociativeCache
Top Level Architecture
DMA Engine
Use Output
Fetch
![Page 29: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse](https://reader035.vdocuments.us/reader035/viewer/2022062402/5ec7def6bc89af77a97643d4/html5/thumbnails/29.jpg)
IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table.
Building COREx
Case Study: Acceleration of Video Motion Estimation
▪ Optimization Goals:
o Runtime, Energy, and Energy-Delay Product (EDP)
▪ Baseline: highly-tuned accelerators
o Sweep space for design alternatives (Aladdin)
o Find optimal accelerator design for each goal
29
![Page 30: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse](https://reader035.vdocuments.us/reader035/viewer/2022062402/5ec7def6bc89af77a97643d4/html5/thumbnails/30.jpg)
IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table.
Building COREx
Case Study: Acceleration of Video Motion Estimation
▪ Optimization Goals:
o Runtime, Energy, and Energy-Delay Product (EDP)
▪ Baseline: highly-tuned accelerators
o Sweep space for design alternatives (Aladdin)
o Find optimal accelerator design for each goal
30
![Page 31: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse](https://reader035.vdocuments.us/reader035/viewer/2022062402/5ec7def6bc89af77a97643d4/html5/thumbnails/31.jpg)
IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table.
Building COREx
Runtime OPT: 5.8[us]
Energy OPT: 6.2[uJ]
EDP OPT: 148.7[pJs]
Case Study: Acceleration of Video Motion Estimation
▪ Optimization Goals:
o Runtime, Energy, and Energy-Delay Product (EDP)
▪ Baseline: highly-tuned accelerators
o Sweep space for design alternatives (Aladdin)
o Find optimal accelerator design for each goal
31
![Page 32: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse](https://reader035.vdocuments.us/reader035/viewer/2022062402/5ec7def6bc89af77a97643d4/html5/thumbnails/32.jpg)
32
▪ Memoization-Layers Specialization
o Extract input traces, examine hit and miss rates of different ILU/CHT sizes.
o Integrate accelerators with emerging memory based ILU+CHT, and sweep gains space.
▪ Example: Resistive RAM based COREx
Building COREx
IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table.
![Page 33: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse](https://reader035.vdocuments.us/reader035/viewer/2022062402/5ec7def6bc89af77a97643d4/html5/thumbnails/33.jpg)
33
▪ Memoization-Layers Specialization
o Extract input traces, examine hit and miss rates of different ILU/CHT sizes.
o Integrate accelerators with emerging memory based ILU+CHT, and sweep gains space.
▪ Example: Resistive RAM based COREx
Building COREx
Energy Optimization: 56.6% Energy Saved.
64KB ILU, 8MB CHT
EDP Optimization:63.5% EDP Saved.
512KB ILU, 2GB CHT
Runtime Optimization:2.7x Speedup.
512KB ILU, 32GB CHT
IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table.
![Page 34: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse](https://reader035.vdocuments.us/reader035/viewer/2022062402/5ec7def6bc89af77a97643d4/html5/thumbnails/34.jpg)
34
Kernel Domain Use-Case App Source Input Source and DescriptionDCT Video Encoding Video Server x264 YouTube Faces. 10 Videos, 10 Seconds, 24 FPS.
SAD Video Encoding Video Server PARBOIL YouTube Faces. 10 Videos, 10 Seconds, 24 FPS.
SNAPPY
("SNP")
Compression Web-Server Traffic
Compression
TailBench
Snappy-C
Wikipedia Abstracts. 13 Million Search Queries.
SSSP
("SSP")
Graph Processing Maps Service: Shortest
Walking Route
Internal DIMACS NYC Streets, 10 Million Zipfian Transactions.
BFS Graph Processing Online Retail MachSuite Amazon Co-Purchasing, 10 Million Zipfian Transactions.
RBM Machine Learning Collaborative Filtering CortexSuite Netflix Prize: 10 Million Zipfian Transactions.
Experimental Setup
IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table.
Workloads
![Page 35: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse](https://reader035.vdocuments.us/reader035/viewer/2022062402/5ec7def6bc89af77a97643d4/html5/thumbnails/35.jpg)
35
WorkloadsKernel Domain Use-Case App Source Input Source and Description
DCT Video Encoding Video Server x264 YouTube Faces. 10 Videos, 10 Seconds, 24 FPS.
SAD Video Encoding Video Server PARBOIL YouTube Faces. 10 Videos, 10 Seconds, 24 FPS.
SNAPPY
("SNP")
Compression Web-Server Traffic
Compression
TailBench
Snappy-C
Wikipedia Abstracts. 13 Million Search Queries.
SSSP
("SSP")
Graph Processing Maps Service: Shortest
Walking Route
Internal DIMACS NYC Streets, 10 Million Zipfian Transactions.
BFS Graph Processing Online Retail MachSuite Amazon Co-Purchasing, 10 Million Zipfian Transactions.
RBM Machine Learning Collaborative Filtering CortexSuite Netflix Prize: 10 Million Zipfian Transactions.
Temporal Redundancy
Experimental Setup
IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table.
![Page 36: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse](https://reader035.vdocuments.us/reader035/viewer/2022062402/5ec7def6bc89af77a97643d4/html5/thumbnails/36.jpg)
36
WorkloadsKernel Domain Use-Case App Source Input Source and Description
DCT Video Encoding Video Server x264 YouTube Faces. 10 Videos, 10 Seconds, 24 FPS.
SAD Video Encoding Video Server PARBOIL YouTube Faces. 10 Videos, 10 Seconds, 24 FPS.
SNAPPY
("SNP")
Compression Web-Server Traffic
Compression
TailBench
Snappy-C
Wikipedia Abstracts. 13 Million Search Queries.
SSSP
("SSP")
Graph Processing Maps Service: Shortest
Walking Route
Internal DIMACS NYC Streets, 10 Million Zipfian Transactions.
BFS Graph Processing Online Retail MachSuite Amazon Co-Purchasing, 10 Million Zipfian Transactions.
RBM Machine Learning Collaborative Filtering CortexSuite Netflix Prize: 10 Million Zipfian Transactions.
Temporal Redundancy
Search Commonality
Experimental Setup
IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table.
![Page 37: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse](https://reader035.vdocuments.us/reader035/viewer/2022062402/5ec7def6bc89af77a97643d4/html5/thumbnails/37.jpg)
37
WorkloadsKernel Domain Use-Case App Source Input Source and Description
DCT Video Encoding Video Server x264 YouTube Faces. 10 Videos, 10 Seconds, 24 FPS.
SAD Video Encoding Video Server PARBOIL YouTube Faces. 10 Videos, 10 Seconds, 24 FPS.
SNAPPY
("SNP")
Compression Web-Server Traffic
Compression
TailBench
Snappy-C
Wikipedia Abstracts. 13 Million Search Queries.
SSSP
("SSP")
Graph Processing Maps Service: Shortest
Walking Route
Internal DIMACS NYC Streets, 10 Million Zipfian Transactions.
BFS Graph Processing Online Retail MachSuite Amazon Co-Purchasing, 10 Million Zipfian Transactions.
RBM Machine Learning Collaborative Filtering CortexSuite Netflix Prize: 10 Million Zipfian Transactions.
Temporal Redundancy
Search Commonality
Content Popularity (75%, 90%, 95% Recurrence)
Experimental Setup
IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table.
![Page 38: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse](https://reader035.vdocuments.us/reader035/viewer/2022062402/5ec7def6bc89af77a97643d4/html5/thumbnails/38.jpg)
38
Workloads
Methodology
o Evaluate ILU/CHT as ReRAM, STT-RAM, PCM, or Racetrack (Destiny)o Integrate with highly-tuned accelerators (Aladdin)
Kernel Domain Use-Case App Source Input Source and DescriptionDCT Video Encoding Video Server x264 YouTube Faces. 10 Videos, 10 Seconds, 24 FPS.
SAD Video Encoding Video Server PARBOIL YouTube Faces. 10 Videos, 10 Seconds, 24 FPS.
SNAPPY
("SNP")
Compression Web-Server Traffic
Compression
TailBench
Snappy-C
Wikipedia Abstracts. 13 Million Search Queries.
SSSP
("SSP")
Graph Processing Maps Service: Shortest
Walking Route
Internal DIMACS NYC Streets, 10 Million Zipfian Transactions.
BFS Graph Processing Online Retail MachSuite Amazon Co-Purchasing, 10 Million Zipfian Transactions.
RBM Machine Learning Collaborative Filtering CortexSuite Netflix Prize: 10 Million Zipfian Transactions.
Temporal Redundancy
Search Commonality
Content Popularity (75%, 90%, 95% Recurrence)
Experimental Setup
IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table.
![Page 39: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse](https://reader035.vdocuments.us/reader035/viewer/2022062402/5ec7def6bc89af77a97643d4/html5/thumbnails/39.jpg)
39
Results
▪ Runtime-OPT: Avg. 6.0-6.4x Speedupo Negligible Differences Between Memories
IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table.
![Page 40: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse](https://reader035.vdocuments.us/reader035/viewer/2022062402/5ec7def6bc89af77a97643d4/html5/thumbnails/40.jpg)
40
Results
▪ Runtime-OPT: Avg. 6.0-6.4x Speedupo Negligible Differences Between Memories
▪ EDP-OPT: Avg. 50%-68% Savingso PCM/Racetrack High write energyo Gain less for low bias apps (freq. updates)
IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table.
![Page 41: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse](https://reader035.vdocuments.us/reader035/viewer/2022062402/5ec7def6bc89af77a97643d4/html5/thumbnails/41.jpg)
41
Results
▪ Runtime-OPT: Avg. 6.0-6.4x Speedupo Negligible Differences Between Memories
▪ EDP-OPT: Avg. 50%-68% Savingso PCM/Racetrack High write energyo Gain less for low bias apps (freq. updates)
▪ Energy-OPT: Avg. 22%-50% Savings o PCM unbeneficial for 75% bias SSSP/RBM
▪ General Trends:
o Large CHTs (MBs-TBs) for Speedup. Smaller (KBs-GBs) for EDP, Smallest for Energy (KBs-MBs)
IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table.
![Page 42: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse](https://reader035.vdocuments.us/reader035/viewer/2022062402/5ec7def6bc89af77a97643d4/html5/thumbnails/42.jpg)
42
▪ Memoization is Fit for Accelerators
o Memoization-Ready Programming Environment+Interface
Conclusions
![Page 43: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse](https://reader035.vdocuments.us/reader035/viewer/2022062402/5ec7def6bc89af77a97643d4/html5/thumbnails/43.jpg)
43
▪ Memoization is Fit for Accelerators
o Memoization-Ready Programming Environment+Interface
▪ Memoization is Fit for Datacenters
o Temporal Redundancy, Search Commonality, Content Popularity
Conclusions
![Page 44: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse](https://reader035.vdocuments.us/reader035/viewer/2022062402/5ec7def6bc89af77a97643d4/html5/thumbnails/44.jpg)
▪ COREx Extends Hardware Specialization
o Memoization-layer specialization tailored for the workload
44
Conclusions
![Page 45: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse](https://reader035.vdocuments.us/reader035/viewer/2022062402/5ec7def6bc89af77a97643d4/html5/thumbnails/45.jpg)
▪ COREx Extends Hardware Specialization
o Memoization-layer specialization tailored for the workload
▪ COREx Opens New Opportunities for Future Architectures
o Shift compute from non-scaling CMOS to still-scaling memories
45
Conclusions
![Page 46: Scaling Datacenter Accelerators With Compute-Reuse ...Next-Gen Power Solutions for Hyperscale Data enters, DataCenter Knowledge 2016 ... Scaling Datacenter Accelerators With Compute-Reuse](https://reader035.vdocuments.us/reader035/viewer/2022062402/5ec7def6bc89af77a97643d4/html5/thumbnails/46.jpg)
Scaling Datacenter Accelerators With Compute-Reuse Architectures
Adi Fuchs David [email protected] [email protected]