jade nie sharad malik
TRANSCRIPT
![Page 1: Jade Nie Sharad Malik](https://reader033.vdocuments.us/reader033/viewer/2022042307/625b575fd0fe1325457db426/html5/thumbnails/1.jpg)
MemFlow: Memory-Driven Data Scheduling with
Datapath Co-design in Accelerators for Large-scale Applications
Jade NieSharad Malik
This work is sponsored in part by SONIC and C-FAR, funded centers of STARnet, a Semiconductor Research Corporation (SRC) program sponsored by MARCO and DARPA
![Page 2: Jade Nie Sharad Malik](https://reader033.vdocuments.us/reader033/viewer/2022042307/625b575fd0fe1325457db426/html5/thumbnails/2.jpg)
Demands of large-scale computation
Neural networks
AlexNet 240MBVGG16 552MBGoogLeNet 28MBSqueezeNet 4.8MB
Feature size = 500 x 500 = 250,000Eigendecomposition on matrix 250,000x250,000
Eigenface
![Page 3: Jade Nie Sharad Malik](https://reader033.vdocuments.us/reader033/viewer/2022042307/625b575fd0fe1325457db426/html5/thumbnails/3.jpg)
Accelerator design is memory-driven
Latency: 10x SRAM20x function units
Energy consumption:100x SRAM1000x function units
Intel Core i7 4-8MB CacheIntel Core i9 25MB CacheNvidia Geforce 1080 GTX
4928KB SRAMTPU 24MB SRAMFPGA 10MB
DRAM access is inefficientOn-chip SRAM is limited
![Page 4: Jade Nie Sharad Malik](https://reader033.vdocuments.us/reader033/viewer/2022042307/625b575fd0fe1325457db426/html5/thumbnails/4.jpg)
Accelerator architecture
Software-controlled scratch-pad SRAM
DRAM accesses:- Compulsory accesses:
- Load initial input- Store back final results
- Accesses due to SRAM limited size:- Load/store spilled
intermediate results
Determining DRAM accesses:Computation scaleSRAM sizeData scheduling
![Page 5: Jade Nie Sharad Malik](https://reader033.vdocuments.us/reader033/viewer/2022042307/625b575fd0fe1325457db426/html5/thumbnails/5.jpg)
Effect of scheduling - DRAM accesses
* =
10
5
5
10
5
5
Scratch-pad size: 35 scalar data
DRAM Accesses due to spilling: 102030
![Page 6: Jade Nie Sharad Malik](https://reader033.vdocuments.us/reader033/viewer/2022042307/625b575fd0fe1325457db426/html5/thumbnails/6.jpg)
Effect of scheduling - DRAM accesses
* =
10
5
5
10
5
5
Scratch-pad size: 35 scalar data
DRAM Accesses due to spilling: 0
![Page 7: Jade Nie Sharad Malik](https://reader033.vdocuments.us/reader033/viewer/2022042307/625b575fd0fe1325457db426/html5/thumbnails/7.jpg)
Global Optimization β NP-hard problem
Data dependency graph
β¦β¦.
β¦β¦.
op
scalar data
β¦Through registers
Through scratchpad
Constraints: - Bound on scratchpad size- Bound on read/write ports- Function units numbers
Objective:- Number of data spilling- Number of cycles
Optimization tools:- Modern constraints solvers: SAT,
SMT, ILP- Stimulated annealing
Not feasible:- Complexity- Irregular scheduling β hard to control
Optimization with regularity
Variables: - Starting cycle of each op
![Page 8: Jade Nie Sharad Malik](https://reader033.vdocuments.us/reader033/viewer/2022042307/625b575fd0fe1325457db426/html5/thumbnails/8.jpg)
MemFlowβ’ Tunable parameters from three levels
β’ Level1: block dimensionsβ’ Level2: block schedulingβ’ Level3: parallelism of local computing
β’ Mathematical modelingβ’ DRAM accesses, SRAM accesses, cycles
β’ Integrated optimization
![Page 9: Jade Nie Sharad Malik](https://reader033.vdocuments.us/reader033/viewer/2022042307/625b575fd0fe1325457db426/html5/thumbnails/9.jpg)
Blocking β matrix multiply
Macro node: Block computationABlock * BBlock + CBlock -> CBlock
![Page 10: Jade Nie Sharad Malik](https://reader033.vdocuments.us/reader033/viewer/2022042307/625b575fd0fe1325457db426/html5/thumbnails/10.jpg)
Blocking β LU factorization
Block computationLU
TRS
LUCPL
SUBGEMM
π΄π΄ β πΏπΏππππππππππ β πππ’π’π’π’π’π’ππππ
π΄π΄ β πππ’π’π’π’π’π’ππππβ1 β πΏπΏ
Lππππππππππβ1 β A β ππ
π΄π΄ β πΏπΏ β ππ β π΄π΄
π΄π΄ β πΏπΏ β ππ L is lower triangular matrix, U is upper triangular matrix
![Page 11: Jade Nie Sharad Malik](https://reader033.vdocuments.us/reader033/viewer/2022042307/625b575fd0fe1325457db426/html5/thumbnails/11.jpg)
Blocking β QR factorization
Block computations:QR
QRUpdateTr
QRCPL
QRUpdate
π΄π΄ β π»π»(ππππππππππππ) β π π π’π’π’π’π’π’ππππ
π΄π΄ β π π π’π’π’π’π’π’ππππβ1 β π»π»(ππ)
H(Vlower) β A β π π
π»π» ππ β π π ,π΄π΄ β [π π ,π΄π΄]
ππ = π£π£1,π£π£2, . . , π£π£ππ ,π»π» ππ = ππππππππβ1. .ππ1 π€π€π€π€π€π€π€π€π€ ππππ = πΌπΌ β π£π£πππ£π£ππππ
π΄π΄ β ππ β π π Q is orthogonal matrix, R is upper triangular matrix
![Page 12: Jade Nie Sharad Malik](https://reader033.vdocuments.us/reader033/viewer/2022042307/625b575fd0fe1325457db426/html5/thumbnails/12.jpg)
Why blocking?
DRAM
Scratch-pad
A Block B Block C Block
Customized datapath
Data blocks
Increase data reuse for each array
Balance data reuse between arrays
Fully local computation- SRAM bandwidth is the key
limitation
![Page 13: Jade Nie Sharad Malik](https://reader033.vdocuments.us/reader033/viewer/2022042307/625b575fd0fe1325457db426/html5/thumbnails/13.jpg)
Level1: block dimensions
dimi dimj
diml
Not the larger the better!
![Page 14: Jade Nie Sharad Malik](https://reader033.vdocuments.us/reader033/viewer/2022042307/625b575fd0fe1325457db426/html5/thumbnails/14.jpg)
Level2: block scheduling
Six choices:ππ β ππ β ππ, ππ β ππ β ππ,ππ β ππ β ππ, ππ β ππ β ππ,ππ β ππ β ππ, ππ β ππ β ππ
List scheduling + regularity
For ππ : 0 β ππππππ_ππππππππFor ππ : 0 β ππππππ_ππππππππ
For ππ : 0 β ππππππ_ππππππππexecuting macro node(ππ, ππ, ππ)
![Page 15: Jade Nie Sharad Malik](https://reader033.vdocuments.us/reader033/viewer/2022042307/625b575fd0fe1325457db426/html5/thumbnails/15.jpg)
Fully local computation
Scratch-pad
A Block B Block C Block
Customized datapath
How to lay out data blocks?
How to design datapath?How to decide its parallelism?
![Page 16: Jade Nie Sharad Malik](https://reader033.vdocuments.us/reader033/viewer/2022042307/625b575fd0fe1325457db426/html5/thumbnails/16.jpg)
Level3: parallelism of local computation
Local computation of matrix multiply macro node
![Page 17: Jade Nie Sharad Malik](https://reader033.vdocuments.us/reader033/viewer/2022042307/625b575fd0fe1325457db426/html5/thumbnails/17.jpg)
Datapath sharing
LU
QR
Datapath sharing for Matmul, LU and QR
![Page 18: Jade Nie Sharad Malik](https://reader033.vdocuments.us/reader033/viewer/2022042307/625b575fd0fe1325457db426/html5/thumbnails/18.jpg)
MemFlowβ’ Tunable parameters from three levels
β’ Level1: block dimensions β blk_dimi, blk_dimj, blk_dimlβ’ Level2: block scheduling β six loop ordering β’ Level3: parallelism of local computing β PA, PB
β’ Mathematical modelingβ’ DRAM accesses, SRAM accesses, cycles
β’ Integrated optimization
![Page 19: Jade Nie Sharad Malik](https://reader033.vdocuments.us/reader033/viewer/2022042307/625b575fd0fe1325457db426/html5/thumbnails/19.jpg)
Modeling β DRAM accesses
C01C00 C02 C10 C11 C12 C20 C21 C22
A00A01
A02
B00B10
B20
B01
B11B21
B02B12
B22
A10A11
A12
A20A21
A22
Macro nodeTime
Block reuse pattern in matrix multiply of loop ordering ππ β ππ β ππ
![Page 20: Jade Nie Sharad Malik](https://reader033.vdocuments.us/reader033/viewer/2022042307/625b575fd0fe1325457db426/html5/thumbnails/20.jpg)
Modeling β DRAM accesses
A group of alternatively used blocks
When SRAM is full, evict block based on optimal replacement policy-- replace block with furthest next use
![Page 21: Jade Nie Sharad Malik](https://reader033.vdocuments.us/reader033/viewer/2022042307/625b575fd0fe1325457db426/html5/thumbnails/21.jpg)
Modeling β DRAM accesses
#data evicted for each matrix = #reuse group * #periods * #blocks evicted per period * block size
Input size, Block dimension, block ordering, scratchpad partitioning
![Page 22: Jade Nie Sharad Malik](https://reader033.vdocuments.us/reader033/viewer/2022042307/625b575fd0fe1325457db426/html5/thumbnails/22.jpg)
Mathematical Modeling
β’ DRAM accesses:β’ input size, block size, loop ordering, memory partitioning
factor
β’ SRAM Accesses:β’ input size, block dimension, SRAM partitioning
β’ Cycles:β’ input size, SRAM partitioning
![Page 23: Jade Nie Sharad Malik](https://reader033.vdocuments.us/reader033/viewer/2022042307/625b575fd0fe1325457db426/html5/thumbnails/23.jpg)
MemFlowβ’ Tunable parameters from three levels
β’ Level1: block dimensions β blk_dimi, blk_dimj, blk_dimlβ’ Level2: block scheduling β six loop ordering β’ Level3: parallelism of local computing β PA, PB
β’ Mathematical modelingβ’ DRAM accesses, SRAM accesses, cycles
β’ Integrated optimization
![Page 24: Jade Nie Sharad Malik](https://reader033.vdocuments.us/reader033/viewer/2022042307/625b575fd0fe1325457db426/html5/thumbnails/24.jpg)
Integrated Optimization
β’ Parameters:β’ Block dimensions, loop ordering, memory partitioning
β’ Objectivesβ’ Primary: weighted total of DRAM accesses and SRAM accessesβ’ Secondary: cycles
β’ Optimization methodsβ’ Gradient descent to optimal magnitudeβ’ Parameter sweeping for global optimum
β’ Worst-case search space: ππ(ππ1.5ππππππππππ) , N input size, P total SRAM ports
β’ ~90 secs for P=16, N=1000
Optimization is sensitive to input size
For specific kernel and input size:
![Page 25: Jade Nie Sharad Malik](https://reader033.vdocuments.us/reader033/viewer/2022042307/625b575fd0fe1325457db426/html5/thumbnails/25.jpg)
Results Comparison β cache reuse
Metric of evaluating data reuse: Number of SRAM allocations:
- Counts each time a new word is stored in SRAM- For cache, equals to (read misses + write misses)*cache line size
L1 cache: 32KB, 8-way associative, 64B cache line size
MemFlow:8 4KB dual-port banks
![Page 26: Jade Nie Sharad Malik](https://reader033.vdocuments.us/reader033/viewer/2022042307/625b575fd0fe1325457db426/html5/thumbnails/26.jpg)
Results Comparison β cache reuseSRAM allocation for MemFlow and L1 cache
MM: 37%, due to spilling 75%LU: 26%, due to spilling 76%QR: 23%, due to spilling 55%
![Page 27: Jade Nie Sharad Malik](https://reader033.vdocuments.us/reader033/viewer/2022042307/625b575fd0fe1325457db426/html5/thumbnails/27.jpg)
Results β GPU+cuBLAS
GPU: Nvidia Geforce 1080 GTXβ’ Total SRAM size: 4928 KB
β’ L1 cache, shared memory, L2 cacheβ’ cuBLAS library:
β’ Matrix multiply: cublasSgemmβ’ LU decomposition: cublasSgetrfBatchedβ’ QR decomposition: cublasSgeqrfBatched
MemFlow: 1232 4KB dual-port SRAM banks
![Page 28: Jade Nie Sharad Malik](https://reader033.vdocuments.us/reader033/viewer/2022042307/625b575fd0fe1325457db426/html5/thumbnails/28.jpg)
Results β GPU+cuBLAS
Improvements of SRAM and DRAM accesses.
![Page 29: Jade Nie Sharad Malik](https://reader033.vdocuments.us/reader033/viewer/2022042307/625b575fd0fe1325457db426/html5/thumbnails/29.jpg)
Results β GPU+cuBLASMatrix multiply data use distribution
GPU MemFlow
![Page 30: Jade Nie Sharad Malik](https://reader033.vdocuments.us/reader033/viewer/2022042307/625b575fd0fe1325457db426/html5/thumbnails/30.jpg)
Results β Vivado HLS
Datapath comparison with Vivado HLS
![Page 31: Jade Nie Sharad Malik](https://reader033.vdocuments.us/reader033/viewer/2022042307/625b575fd0fe1325457db426/html5/thumbnails/31.jpg)
Future Workβ’ More efficient optimization methodsβ’ Optimizing data scheduling for a flow of computation
kernelsβ’ Verilog validation