linear algebra libraries used in slam · stacked kgd 1-8gbit lp-ddr2/3 pll & cpm pll & cpm...
TRANSCRIPT
![Page 1: Linear Algebra Libraries used in SLAM · Stacked KGD 1-8Gbit LP-DDR2/3 PLL & CPM PLL & CPM RISC-RTRISC-RT L1 4/4 kB L1 4/4 kB L2 32kB L2 32kB MIPI D-PHY x12 lanes MIPI D-PHY x12 lanes](https://reader033.vdocuments.us/reader033/viewer/2022041501/5e222cbf3241da61df77a62e/html5/thumbnails/1.jpg)
Myriad2
Linear Algebra Library
![Page 2: Linear Algebra Libraries used in SLAM · Stacked KGD 1-8Gbit LP-DDR2/3 PLL & CPM PLL & CPM RISC-RTRISC-RT L1 4/4 kB L1 4/4 kB L2 32kB L2 32kB MIPI D-PHY x12 lanes MIPI D-PHY x12 lanes](https://reader033.vdocuments.us/reader033/viewer/2022041501/5e222cbf3241da61df77a62e/html5/thumbnails/2.jpg)
Introduction
• Cormac Brick, VP Software, Movidius• Movidius:
– Silicon vendor, founded 2005– Focus on low power compute / vision– Myriad Development kit
● Toolchain: incl. Llvm 3.6 compiler, Eclipse● Optimized Libraries: mvCV, mvISP, LAMA● Computer vision solutions● ISP solutions
![Page 3: Linear Algebra Libraries used in SLAM · Stacked KGD 1-8Gbit LP-DDR2/3 PLL & CPM PLL & CPM RISC-RTRISC-RT L1 4/4 kB L1 4/4 kB L2 32kB L2 32kB MIPI D-PHY x12 lanes MIPI D-PHY x12 lanes](https://reader033.vdocuments.us/reader033/viewer/2022041501/5e222cbf3241da61df77a62e/html5/thumbnails/3.jpg)
Main Bus
DDR ControllerDDR Controller
128
BridgeBridge
SW Controlled I/O MultiplexingSW Controlled I/O Multiplexing
RISC-RTOSRISC-RTOS
L1 32/32kB
L1 32/32kB
L2 256KB
L2 256KB
ROM 128KBROM
128KB
Stacked KGD1-8Gbit LP-DDR2/3
Stacked KGD1-8Gbit LP-DDR2/3
PLL & CPMPLL & CPM
RISC-RTRISC-RT
L1 4/4kB
L1 4/4kB
L2 32kB
L2 32kB
MIPI D-PHY x12 lanes
MIPI D-PHY x12 lanes
AMC CrossbarAMC Crossbar
Arbiter & 16:1 mux Arbiter & 16:1 mux
L2 cache 512KB L2 cache 512KB
32x HW
Mutex
32x HW
Mutex
Inter-SHAVE Interconnect (ISI)Inter-SHAVE Interconnect (ISI)
SHAV
E 0
SHAV
E 0
SHAV
E 1
SHAV
E 1
SHAV
E 2
SHAV
E 2
SHAV
E 3
SHAV
E 3
SHAV
E 4
SHAV
E 4
SHAV
E 5
SHAV
E 5
SHAV
E 6
SHAV
E 6
SHAV
E 7
SHAV
E 7
SHAV
E 8
SHAV
E 8
SHAV
E 9
SHAV
E 9
SHAV
E 10
SHAV
E 10
SHAV
E 11
SHAV
E 11
powerTBD independent power islands
2MBCMX Memory Fabric Multi-Ported RAM Subsystem
AONAONI2Sx4I2Sx4
UARTx2
UARTx2
I2Cx4I2Cx4
Eth2.5Gb
Eth2.5GbUSB3USB3SDIOSDIOCIFCIFLCDLCD JPEGJPEG
30x CV / ISP hardware accelarators
30x CV / ISP hardware accelarators
Myriad 2100 Architecture
![Page 4: Linear Algebra Libraries used in SLAM · Stacked KGD 1-8Gbit LP-DDR2/3 PLL & CPM PLL & CPM RISC-RTRISC-RT L1 4/4 kB L1 4/4 kB L2 32kB L2 32kB MIPI D-PHY x12 lanes MIPI D-PHY x12 lanes](https://reader033.vdocuments.us/reader033/viewer/2022041501/5e222cbf3241da61df77a62e/html5/thumbnails/4.jpg)
VPU CoreVPU Core
IRF 32x32-bit (18 ports)IRF 32x32-bit (18 ports)
LSU1Load-Store
LSU1Load-Store
LSU0Load-Store
LSU0Load-Store
2 KBI-cache
2 KBI-cache
2MB CMX (Connection MatriX) SRAM2MB CMX (Connection MatriX) SRAM
256kB 2-way L2 cache256kB 2-way L2 cache
PEUPredication
PEUPredication
BRUBranch Unit
BRUBranch Unit
VRF 32x128-bit (12 ports)VRF 32x128-bit (12 ports)
IAUInteger Unit
IAUInteger Unit
SAUScalar Unit
SAUScalar Unit
1 KBD-cache
1 KBD-cache
VAUVector Unit
VAUVector Unit
CMUCompare-Unit
CMUCompare-Unit
64-bitCMX Port
3x 128 bitPorts
DCUDebugDCUDebug
IDCInstr.
Decode
IDCInstr.
Decode
SHAVE Processor
64-bitCMX Port
128-bit AXI
32-b
it AP
B
SHAVE Bus
SHAVE Microarchitecture
![Page 5: Linear Algebra Libraries used in SLAM · Stacked KGD 1-8Gbit LP-DDR2/3 PLL & CPM PLL & CPM RISC-RTRISC-RT L1 4/4 kB L1 4/4 kB L2 32kB L2 32kB MIPI D-PHY x12 lanes MIPI D-PHY x12 lanes](https://reader033.vdocuments.us/reader033/viewer/2022041501/5e222cbf3241da61df77a62e/html5/thumbnails/5.jpg)
CV Application Observations
Kernels
complexKernels
Pipelines
Applications
Reusability Integration Effort
![Page 6: Linear Algebra Libraries used in SLAM · Stacked KGD 1-8Gbit LP-DDR2/3 PLL & CPM PLL & CPM RISC-RTRISC-RT L1 4/4 kB L1 4/4 kB L2 32kB L2 32kB MIPI D-PHY x12 lanes MIPI D-PHY x12 lanes](https://reader033.vdocuments.us/reader033/viewer/2022041501/5e222cbf3241da61df77a62e/html5/thumbnails/6.jpg)
Linear Algebra Evolution
![Page 7: Linear Algebra Libraries used in SLAM · Stacked KGD 1-8Gbit LP-DDR2/3 PLL & CPM PLL & CPM RISC-RTRISC-RT L1 4/4 kB L1 4/4 kB L2 32kB L2 32kB MIPI D-PHY x12 lanes MIPI D-PHY x12 lanes](https://reader033.vdocuments.us/reader033/viewer/2022041501/5e222cbf3241da61df77a62e/html5/thumbnails/7.jpg)
• LAMA: Linear Algebra MYRIAD Acceleration• Goals
– Accelerate most frequently used Linear Algebra Kernels– A framework for rapid utilization of optimized Linear Algebra kernels– Full source /ASM open to customers
MV LIB: LAMA Overview
Libflame (LAPACK) #Libflame (LAPACK) #
BLAS*BLAS*
00 11 1f1f 1v1v 1m1m 22 22
Levels
BASEBASE ControlControl CompatibilityCompatibility UtilitiesUtilities
*# Based on BLIS; UT Austin Flame Project
EIGENEIGEN
![Page 8: Linear Algebra Libraries used in SLAM · Stacked KGD 1-8Gbit LP-DDR2/3 PLL & CPM PLL & CPM RISC-RTRISC-RT L1 4/4 kB L1 4/4 kB L2 32kB L2 32kB MIPI D-PHY x12 lanes MIPI D-PHY x12 lanes](https://reader033.vdocuments.us/reader033/viewer/2022041501/5e222cbf3241da61df77a62e/html5/thumbnails/8.jpg)
BLIS microkernelsblis/kernels/1 blis/kernels/1f blis/kernels/3
/bgq
/bgq
/c99
/loongson3a/mic
/power7/x86
blis/kernels/1m
/x86
/x86_64/core2-sse3
/x86_64/core2-sse3
/x86_64/core2-sse3
/x86_64/piledriver
![Page 9: Linear Algebra Libraries used in SLAM · Stacked KGD 1-8Gbit LP-DDR2/3 PLL & CPM PLL & CPM RISC-RTRISC-RT L1 4/4 kB L1 4/4 kB L2 32kB L2 32kB MIPI D-PHY x12 lanes MIPI D-PHY x12 lanes](https://reader033.vdocuments.us/reader033/viewer/2022041501/5e222cbf3241da61df77a62e/html5/thumbnails/9.jpg)
BLIS GEMM Implementation notes
• Matrix A copied to CMX and all SHAVEs share the same copy.
• Matrix B copied to CMX, with double buffering.• Matrix C copied to CMX, with triple buffering
and write-back to DDR.• Each SHAVE processes 16 (at most) micro-
panels in a batch – trade off between speed and memory usage.
![Page 10: Linear Algebra Libraries used in SLAM · Stacked KGD 1-8Gbit LP-DDR2/3 PLL & CPM PLL & CPM RISC-RTRISC-RT L1 4/4 kB L1 4/4 kB L2 32kB L2 32kB MIPI D-PHY x12 lanes MIPI D-PHY x12 lanes](https://reader033.vdocuments.us/reader033/viewer/2022041501/5e222cbf3241da61df77a62e/html5/thumbnails/10.jpg)
BLIS GEMM implementation notes
• Implement double buffering for matrix A and allocated the proper slice only to the SHAVE that requires it.
• Move DMA function calls from LEON to each SHAVE.
• Implement DMA copy mechanism for general stride storage scheme.
![Page 11: Linear Algebra Libraries used in SLAM · Stacked KGD 1-8Gbit LP-DDR2/3 PLL & CPM PLL & CPM RISC-RTRISC-RT L1 4/4 kB L1 4/4 kB L2 32kB L2 32kB MIPI D-PHY x12 lanes MIPI D-PHY x12 lanes](https://reader033.vdocuments.us/reader033/viewer/2022041501/5e222cbf3241da61df77a62e/html5/thumbnails/11.jpg)
Myriad2 SGEMM @ 300MHz 1-12 SHAVEs40GFLOPS @ 600MHz @ 12 SHAVEs
![Page 12: Linear Algebra Libraries used in SLAM · Stacked KGD 1-8Gbit LP-DDR2/3 PLL & CPM PLL & CPM RISC-RTRISC-RT L1 4/4 kB L1 4/4 kB L2 32kB L2 32kB MIPI D-PHY x12 lanes MIPI D-PHY x12 lanes](https://reader033.vdocuments.us/reader033/viewer/2022041501/5e222cbf3241da61df77a62e/html5/thumbnails/12.jpg)
Myriad2 STRSM_LL 300MHz 1-12 SHAVEs
![Page 13: Linear Algebra Libraries used in SLAM · Stacked KGD 1-8Gbit LP-DDR2/3 PLL & CPM PLL & CPM RISC-RTRISC-RT L1 4/4 kB L1 4/4 kB L2 32kB L2 32kB MIPI D-PHY x12 lanes MIPI D-PHY x12 lanes](https://reader033.vdocuments.us/reader033/viewer/2022041501/5e222cbf3241da61df77a62e/html5/thumbnails/13.jpg)
Least-Squares 10fps 1x SHAVESystem running at 504.00MHzLeon running from CMX slice 7; stack: size=0x1A38, top=0x700edf00
Buffers: A[512][512] @0x70100000..0x701fffff ; A, then M=A'A, then L=M/L' b[512] @0x700ff800..0x700fffff ; b x[512] @0x700ff000..0x700ff7ff ; c=A'b, then y=L\c, then x=L'\y
Running a full lineq sequence.... gemv_t_block( x @0x700ff000, A @0x70100000, b @0x700ff800, 512 );. syrk_tn_l_block( A @0x70100000, tmp4cols @0x700fd000, 512 );. potrf_ln_block( A @0x70100000, 512 );. trsv_ln_block( x @0x700ff000, A @0x70100000, x @0x700ff000, 512 );. trsv_lt_block( x @0x700ff000, A @0x70100000, x @0x700ff000, 512 );All done.
Computation took: - gemv_t@opt : 0.3ms ( 201318cycles) - syrk_tn_l@opt : 59.3ms ( 29916308cycles) - potrf_ln@opt : 41.8ms ( 21077304cycles) - trsv_ln@opt : 0.7ms ( 369558cycles) - trsv_lt@opt : 0.7ms ( 360934cycles)Total: 103.0ms (51925422cycles)All done.
![Page 14: Linear Algebra Libraries used in SLAM · Stacked KGD 1-8Gbit LP-DDR2/3 PLL & CPM PLL & CPM RISC-RTRISC-RT L1 4/4 kB L1 4/4 kB L2 32kB L2 32kB MIPI D-PHY x12 lanes MIPI D-PHY x12 lanes](https://reader033.vdocuments.us/reader033/viewer/2022041501/5e222cbf3241da61df77a62e/html5/thumbnails/14.jpg)
Myriad2 IEEE Micro Jan/Feb 2015
![Page 15: Linear Algebra Libraries used in SLAM · Stacked KGD 1-8Gbit LP-DDR2/3 PLL & CPM PLL & CPM RISC-RTRISC-RT L1 4/4 kB L1 4/4 kB L2 32kB L2 32kB MIPI D-PHY x12 lanes MIPI D-PHY x12 lanes](https://reader033.vdocuments.us/reader033/viewer/2022041501/5e222cbf3241da61df77a62e/html5/thumbnails/15.jpg)
Myriad2 IEEE Micro Mar/Apr 2015
![Page 16: Linear Algebra Libraries used in SLAM · Stacked KGD 1-8Gbit LP-DDR2/3 PLL & CPM PLL & CPM RISC-RTRISC-RT L1 4/4 kB L1 4/4 kB L2 32kB L2 32kB MIPI D-PHY x12 lanes MIPI D-PHY x12 lanes](https://reader033.vdocuments.us/reader033/viewer/2022041501/5e222cbf3241da61df77a62e/html5/thumbnails/16.jpg)
Summary• MA1100 / 2014
– Stream benchmark – BLIS L3 by end of June 2014– SpMV done– Cholesky Factorization complete– libFlame ported to LEON
• MA2xxx / 2015– Tuning of BLIS, libFlame– Further Mixed (lower) precision work
• Acknowledgments:– http://excess-project.eu/– www.lero.ie at Trinity College Dublin, Ireland
![Page 17: Linear Algebra Libraries used in SLAM · Stacked KGD 1-8Gbit LP-DDR2/3 PLL & CPM PLL & CPM RISC-RTRISC-RT L1 4/4 kB L1 4/4 kB L2 32kB L2 32kB MIPI D-PHY x12 lanes MIPI D-PHY x12 lanes](https://reader033.vdocuments.us/reader033/viewer/2022041501/5e222cbf3241da61df77a62e/html5/thumbnails/17.jpg)