linear algebra libraries used in slam · stacked kgd 1-8gbit lp-ddr2/3 pll & cpm pll & cpm...

Myriad2

Linear Algebra Library

Introduction

• Cormac Brick, VP Software, Movidius• Movidius:

– Silicon vendor, founded 2005– Focus on low power compute / vision– Myriad Development kit

● Toolchain: incl. Llvm 3.6 compiler, Eclipse● Optimized Libraries: mvCV, mvISP, LAMA● Computer vision solutions● ISP solutions

Main Bus

DDR ControllerDDR Controller

128

BridgeBridge

SW Controlled I/O MultiplexingSW Controlled I/O Multiplexing

RISC-RTOSRISC-RTOS

L1 32/32kB

L1 32/32kB

L2 256KB

L2 256KB

ROM 128KBROM

128KB

Stacked KGD1-8Gbit LP-DDR2/3

Stacked KGD1-8Gbit LP-DDR2/3

PLL & CPMPLL & CPM

RISC-RTRISC-RT

L1 4/4kB

L1 4/4kB

L2 32kB

L2 32kB

MIPI D-PHY x12 lanes

MIPI D-PHY x12 lanes

AMC CrossbarAMC Crossbar

Arbiter & 16:1 mux Arbiter & 16:1 mux

L2 cache 512KB L2 cache 512KB

32x HW

Mutex

32x HW

Mutex

Inter-SHAVE Interconnect (ISI)Inter-SHAVE Interconnect (ISI)

SHAV

E 0

SHAV

E 0

SHAV

E 1

SHAV

E 1

SHAV

E 2

SHAV

E 2

SHAV

E 3

SHAV

E 3

SHAV

E 4

SHAV

E 4

SHAV

E 5

SHAV

E 5

SHAV

E 6

SHAV

E 6

SHAV

E 7

SHAV

E 7

SHAV

E 8

SHAV

E 8

SHAV

E 9

SHAV

E 9

SHAV

E 10

SHAV

E 10

SHAV

E 11

SHAV

E 11

powerTBD independent power islands

2MBCMX Memory Fabric Multi-Ported RAM Subsystem

AONAONI2Sx4I2Sx4

UARTx2

UARTx2

I2Cx4I2Cx4

Eth2.5Gb

Eth2.5GbUSB3USB3SDIOSDIOCIFCIFLCDLCD JPEGJPEG

30x CV / ISP hardware accelarators

30x CV / ISP hardware accelarators

Myriad 2100 Architecture

VPU CoreVPU Core

IRF 32x32-bit (18 ports)IRF 32x32-bit (18 ports)

LSU1Load-Store

LSU1Load-Store

LSU0Load-Store

LSU0Load-Store

2 KBI-cache

2 KBI-cache

2MB CMX (Connection MatriX) SRAM2MB CMX (Connection MatriX) SRAM

256kB 2-way L2 cache256kB 2-way L2 cache

PEUPredication

PEUPredication

BRUBranch Unit

BRUBranch Unit

VRF 32x128-bit (12 ports)VRF 32x128-bit (12 ports)

IAUInteger Unit

IAUInteger Unit

SAUScalar Unit

SAUScalar Unit

1 KBD-cache

1 KBD-cache

VAUVector Unit

VAUVector Unit

CMUCompare-Unit

CMUCompare-Unit

64-bitCMX Port

3x 128 bitPorts

DCUDebugDCUDebug

IDCInstr.

Decode

IDCInstr.

Decode

SHAVE Processor

64-bitCMX Port

128-bit AXI

32-b

it AP

B

SHAVE Bus

SHAVE Microarchitecture

CV Application Observations

Kernels

complexKernels

Pipelines

Applications

Reusability Integration Effort

Linear Algebra Evolution

• LAMA: Linear Algebra MYRIAD Acceleration• Goals

– Accelerate most frequently used Linear Algebra Kernels– A framework for rapid utilization of optimized Linear Algebra kernels– Full source /ASM open to customers

MV LIB: LAMA Overview

Libflame (LAPACK) #Libflame (LAPACK) #

BLAS*BLAS*

00 11 1f1f 1v1v 1m1m 22 22

Levels

BASEBASE ControlControl CompatibilityCompatibility UtilitiesUtilities

*# Based on BLIS; UT Austin Flame Project

EIGENEIGEN

BLIS microkernelsblis/kernels/1 blis/kernels/1f blis/kernels/3

/bgq

/bgq

/c99

/loongson3a/mic

/power7/x86

blis/kernels/1m

/x86

/x86_64/core2-sse3

/x86_64/core2-sse3

/x86_64/core2-sse3

/x86_64/piledriver

BLIS GEMM Implementation notes

• Matrix A copied to CMX and all SHAVEs share the same copy.

• Matrix B copied to CMX, with double buffering.• Matrix C copied to CMX, with triple buffering

and write-back to DDR.• Each SHAVE processes 16 (at most) micro-

panels in a batch – trade off between speed and memory usage.

BLIS GEMM implementation notes

• Implement double buffering for matrix A and allocated the proper slice only to the SHAVE that requires it.

• Move DMA function calls from LEON to each SHAVE.

• Implement DMA copy mechanism for general stride storage scheme.

Myriad2 SGEMM @ 300MHz 1-12 SHAVEs40GFLOPS @ 600MHz @ 12 SHAVEs

Myriad2 STRSM_LL 300MHz 1-12 SHAVEs

Least-Squares 10fps 1x SHAVESystem running at 504.00MHzLeon running from CMX slice 7; stack: size=0x1A38, top=0x700edf00

Buffers: A[512][512] @0x70100000..0x701fffff ; A, then M=A'A, then L=M/L' b[512] @0x700ff800..0x700fffff ; b x[512] @0x700ff000..0x700ff7ff ; c=A'b, then y=L\c, then x=L'\y

Running a full lineq sequence.... gemv_t_block( x @0x700ff000, A @0x70100000, b @0x700ff800, 512 );. syrk_tn_l_block( A @0x70100000, tmp4cols @0x700fd000, 512 );. potrf_ln_block( A @0x70100000, 512 );. trsv_ln_block( x @0x700ff000, A @0x70100000, x @0x700ff000, 512 );. trsv_lt_block( x @0x700ff000, A @0x70100000, x @0x700ff000, 512 );All done.

Computation took: - gemv_t@opt : 0.3ms ( 201318cycles) - syrk_tn_l@opt : 59.3ms ( 29916308cycles) - potrf_ln@opt : 41.8ms ( 21077304cycles) - trsv_ln@opt : 0.7ms ( 369558cycles) - trsv_lt@opt : 0.7ms ( 360934cycles)Total: 103.0ms (51925422cycles)All done.

Myriad2 IEEE Micro Jan/Feb 2015

Myriad2 IEEE Micro Mar/Apr 2015

Summary• MA1100 / 2014

– Stream benchmark – BLIS L3 by end of June 2014– SpMV done– Cholesky Factorization complete– libFlame ported to LEON

• MA2xxx / 2015– Tuning of BLIS, libFlame– Further Mixed (lower) precision work

• Acknowledgments:– http://excess-project.eu/– www.lero.ie at Trinity College Dublin, Ireland

linear algebra libraries used in slam · stacked kgd 1-8gbit lp-ddr2/3 pll & cpm pll & cpm...

Documents