linear algebra libraries used in slam · stacked kgd 1-8gbit lp-ddr2/3 pll & cpm pll & cpm...
TRANSCRIPT
Myriad2
Linear Algebra Library
Introduction
• Cormac Brick, VP Software, Movidius• Movidius:
– Silicon vendor, founded 2005– Focus on low power compute / vision– Myriad Development kit
● Toolchain: incl. Llvm 3.6 compiler, Eclipse● Optimized Libraries: mvCV, mvISP, LAMA● Computer vision solutions● ISP solutions
Main Bus
DDR ControllerDDR Controller
128
BridgeBridge
SW Controlled I/O MultiplexingSW Controlled I/O Multiplexing
RISC-RTOSRISC-RTOS
L1 32/32kB
L1 32/32kB
L2 256KB
L2 256KB
ROM 128KBROM
128KB
Stacked KGD1-8Gbit LP-DDR2/3
Stacked KGD1-8Gbit LP-DDR2/3
PLL & CPMPLL & CPM
RISC-RTRISC-RT
L1 4/4kB
L1 4/4kB
L2 32kB
L2 32kB
MIPI D-PHY x12 lanes
MIPI D-PHY x12 lanes
AMC CrossbarAMC Crossbar
Arbiter & 16:1 mux Arbiter & 16:1 mux
L2 cache 512KB L2 cache 512KB
32x HW
Mutex
32x HW
Mutex
Inter-SHAVE Interconnect (ISI)Inter-SHAVE Interconnect (ISI)
SHAV
E 0
SHAV
E 0
SHAV
E 1
SHAV
E 1
SHAV
E 2
SHAV
E 2
SHAV
E 3
SHAV
E 3
SHAV
E 4
SHAV
E 4
SHAV
E 5
SHAV
E 5
SHAV
E 6
SHAV
E 6
SHAV
E 7
SHAV
E 7
SHAV
E 8
SHAV
E 8
SHAV
E 9
SHAV
E 9
SHAV
E 10
SHAV
E 10
SHAV
E 11
SHAV
E 11
powerTBD independent power islands
2MBCMX Memory Fabric Multi-Ported RAM Subsystem
AONAONI2Sx4I2Sx4
UARTx2
UARTx2
I2Cx4I2Cx4
Eth2.5Gb
Eth2.5GbUSB3USB3SDIOSDIOCIFCIFLCDLCD JPEGJPEG
30x CV / ISP hardware accelarators
30x CV / ISP hardware accelarators
Myriad 2100 Architecture
VPU CoreVPU Core
IRF 32x32-bit (18 ports)IRF 32x32-bit (18 ports)
LSU1Load-Store
LSU1Load-Store
LSU0Load-Store
LSU0Load-Store
2 KBI-cache
2 KBI-cache
2MB CMX (Connection MatriX) SRAM2MB CMX (Connection MatriX) SRAM
256kB 2-way L2 cache256kB 2-way L2 cache
PEUPredication
PEUPredication
BRUBranch Unit
BRUBranch Unit
VRF 32x128-bit (12 ports)VRF 32x128-bit (12 ports)
IAUInteger Unit
IAUInteger Unit
SAUScalar Unit
SAUScalar Unit
1 KBD-cache
1 KBD-cache
VAUVector Unit
VAUVector Unit
CMUCompare-Unit
CMUCompare-Unit
64-bitCMX Port
3x 128 bitPorts
DCUDebugDCUDebug
IDCInstr.
Decode
IDCInstr.
Decode
SHAVE Processor
64-bitCMX Port
128-bit AXI
32-b
it AP
B
SHAVE Bus
SHAVE Microarchitecture
CV Application Observations
Kernels
complexKernels
Pipelines
Applications
Reusability Integration Effort
Linear Algebra Evolution
• LAMA: Linear Algebra MYRIAD Acceleration• Goals
– Accelerate most frequently used Linear Algebra Kernels– A framework for rapid utilization of optimized Linear Algebra kernels– Full source /ASM open to customers
MV LIB: LAMA Overview
Libflame (LAPACK) #Libflame (LAPACK) #
BLAS*BLAS*
00 11 1f1f 1v1v 1m1m 22 22
Levels
BASEBASE ControlControl CompatibilityCompatibility UtilitiesUtilities
*# Based on BLIS; UT Austin Flame Project
EIGENEIGEN
BLIS microkernelsblis/kernels/1 blis/kernels/1f blis/kernels/3
/bgq
/bgq
/c99
/loongson3a/mic
/power7/x86
blis/kernels/1m
/x86
/x86_64/core2-sse3
/x86_64/core2-sse3
/x86_64/core2-sse3
/x86_64/piledriver
BLIS GEMM Implementation notes
• Matrix A copied to CMX and all SHAVEs share the same copy.
• Matrix B copied to CMX, with double buffering.• Matrix C copied to CMX, with triple buffering
and write-back to DDR.• Each SHAVE processes 16 (at most) micro-
panels in a batch – trade off between speed and memory usage.
BLIS GEMM implementation notes
• Implement double buffering for matrix A and allocated the proper slice only to the SHAVE that requires it.
• Move DMA function calls from LEON to each SHAVE.
• Implement DMA copy mechanism for general stride storage scheme.
Myriad2 SGEMM @ 300MHz 1-12 SHAVEs40GFLOPS @ 600MHz @ 12 SHAVEs
Myriad2 STRSM_LL 300MHz 1-12 SHAVEs
Least-Squares 10fps 1x SHAVESystem running at 504.00MHzLeon running from CMX slice 7; stack: size=0x1A38, top=0x700edf00
Buffers: A[512][512] @0x70100000..0x701fffff ; A, then M=A'A, then L=M/L' b[512] @0x700ff800..0x700fffff ; b x[512] @0x700ff000..0x700ff7ff ; c=A'b, then y=L\c, then x=L'\y
Running a full lineq sequence.... gemv_t_block( x @0x700ff000, A @0x70100000, b @0x700ff800, 512 );. syrk_tn_l_block( A @0x70100000, tmp4cols @0x700fd000, 512 );. potrf_ln_block( A @0x70100000, 512 );. trsv_ln_block( x @0x700ff000, A @0x70100000, x @0x700ff000, 512 );. trsv_lt_block( x @0x700ff000, A @0x70100000, x @0x700ff000, 512 );All done.
Computation took: - gemv_t@opt : 0.3ms ( 201318cycles) - syrk_tn_l@opt : 59.3ms ( 29916308cycles) - potrf_ln@opt : 41.8ms ( 21077304cycles) - trsv_ln@opt : 0.7ms ( 369558cycles) - trsv_lt@opt : 0.7ms ( 360934cycles)Total: 103.0ms (51925422cycles)All done.
Myriad2 IEEE Micro Jan/Feb 2015
Myriad2 IEEE Micro Mar/Apr 2015
Summary• MA1100 / 2014
– Stream benchmark – BLIS L3 by end of June 2014– SpMV done– Cholesky Factorization complete– libFlame ported to LEON
• MA2xxx / 2015– Tuning of BLIS, libFlame– Further Mixed (lower) precision work
• Acknowledgments:– http://excess-project.eu/– www.lero.ie at Trinity College Dublin, Ireland