using risc-v in high computing, ultra-low power ......stm32 f7 216mhz 99.1ms 21 400 000 60mw gap8 *...

24
Using RISC-V in high computing, ultra-low power, programmable circuits for inference on battery operated edge devices Martin Croome, VP Business Development, GreenWaves Technologies 1 RISC-V Day in Shanghai, 30 June 2018

Upload: others

Post on 26-Jan-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

  • Using RISC-V in high computing, ultra-low power, programmable circuits for inference on battery

    operated edge devices

    Martin Croome, VP Business Development, GreenWaves Technologies

    1RISC-V Day in Shanghai, 30 June 2018

  • What this talk is about?

    RISC-V Foundation 2

    The IoT pipeNB-IoT, LTE-M, Sigfox,

    LoRa, etc.

    B/day to kB/dayBattery operated

    sensors

    8-bit, 160x120 @ 10 fps =4.6 Mbit/s

    24-bit @ 50kHz = 1.2 Mbit/s

    Linear PCM =1.4 Mbit/s

    Market DemandRich sensor data

    Keyword SpottingBeam formingSpeech pre-processing

    Vibration analysisFault detection

    Face detectionPresence detectionCountingEmotion detection

    30 June 2017

  • What this talk is about?

    RISC-V Foundation 3

    The IoT pipeNB-IoT, LTE-M, Sigfox,

    LoRa, etc.

    B/day to kB/dayBattery operated

    sensors

    8-bit, 160x120 @ 10 fps =4.6 Mbit/s

    24-bit @ 50kHz = 1.2 Mbit/s

    Linear PCM =1.4 Mbit/s

    Market DemandRich sensor data

    B/day to kB/day

    CNNSVM

    BayesianBoostingCepstral analysis

    30 June 2017

  • What this talk is about?

    RISC-V Foundation 4

    The IoT pipeNB-IoT, LTE-M, Sigfox,

    LoRa, etc.

    B/day to kB/dayBattery operated

    sensors

    8-bit, 160x120 @ 10 fps =4.6 Mbit/s

    24-bit @ 50kHz = 1.2 Mbit/s

    Linear PCM =1.4 Mbit/s

    Market DemandRich sensor data

    B/day to kB/day

    CNNSVM

    BayesianBoostingCepstral analysis

    Issue: way more MIPS than an MCU can

    deliver but needs to bewithin an MCU power

    envelope ?

    30 June 2017

  • General Patterns for content understanding

    RISC-V Foundation 5

    • Extract descriptors from raw data• 2D: Corners, blobs, HOG, DOG, …• 1D: LPC coefficients, Cepstral coeffs, …

    • Use descriptors to classify data among representative families• Machine learning (CNN, SVM, Boost), Bayesian, ….

    Usually highly parallel

    Also highly parallel30 June 2017

  • GAP8: Ultra Low Power IoT Processor

    RISC-V Foundation 6

    Architecture efficiency• Extended RISC-V ISA• Low contention shared memory 8 +1 core

    clustered architecture• Tight synchronization• CNN based pattern matching engine (HWCE)

    Performance• up to 12GOPS• up to 0.4GOPS @ 1mW, • up to 40MOPS @ 300uW• 3 uWatt stand-by power

    consumption

    HW features• Smart IOs• Voltage regulator/DVFS • RTC• Secured execution

    30 June 2017

  • monitoring event qualification,protocol stack,system control

    data analysis & classification

    Smart I/Osvoltage regulator & RTCSRAM in retentive mode

    extended RISC-V extended RISC-Vefficient 8 core parallelization

    HW synchronizationshared instruction cache

    CNN HW engine

    Quasi stand-by Low computing power High computing power

    uWs mWs 10 to 50 mWs

    primary energy consumption primary energy consumption

    GAP8 hierarchical power architecture

    7RISC-V Foundation30 June 2017

  • GAP8: Open Source Origin

    RISC-V Foundation 8

    GAP8Best in class Instruction Set Architecture (ISA)UC Berkeley originated

    Open Source Computing Platformcreated by ETHZ and UniBo

    Engineered as Ultra-low power IoT Application Processor

    30 June 2017

  • 9RISC-V Foundation

    SW development flowFC clock & voltage domain

    Logarithmic Interconnect

    Shared L1 Memory

    Shared Instruction Cache

    Cor

    e 0

    Debug

    ClusterDMA

    H/WSYNC

    Cor

    e 1

    Cor

    e 7

    Cor

    e 6

    Cor

    e 5

    Cor

    e 4

    Cor

    e 3

    Cor

    e 2

    HW

    CE

    MemoryL2

    DebugPMU RTC

    FabricController

    L1

    ROM

    I$

    LVDS

    Serial I/Q

    UART

    SPI

    I2C

    I2S

    CPI

    HyperBus

    GPIO / PWM

    Mic

    ro D

    MA

    Cluster clock & voltage domain

    Identical cores – Single GCC/GDB toolchain(including support for extended ISA)

    CNN graph translators(TF2GAP8, ONNX2GAP8 in development)

    Code generators for common algorithms

    (CNN layers, Matrix, FIR, FFT, HoG, MFCC, …)

    GAP8 AutoTilerSeparates kernel parallelization / vectorization

    and data flowAutomatic code generation for data flow

    OpenMP or Native API

    GAPUINO development board.

    Classic MCU developmentPULP OS, ARM™ Mbed, FreeRTOS, Other OS’s in

    developmentDrivers

    Cluster APIs

    Arm and Mbed are registered trademarks or trademarks of Arm Limited (or its subsidiaries) in the US and/or elsewhere.

    30 June 2017

  • 10RISC-V Foundation

    Automated Memory Management

    Basic KernelsHow to handle a parametric tile• Vectorization + Parallelization• No assumption on where actual data are located

    User Kernels

    Passing actual data to basic kernels and having data circulating between them• A multi dimensional iteration space (2D; 3D; 4D) and a

    traversal order• Each argument is a sub space of the iteration space and

    has actual dimensions, location (L2, external) and properties

    • Given a memory budget the auto tiler “tiles” each argument and generates a fully pipelined implementation interleaving processing and data movements

    • Basic Kernels are inserted at defined locations in the iteration space (prologue, body, epilog, …)

    • Generated tiles are passed to Basic Kernels

    Usually seen as libraries

    Can be grouped and organized as generators

    30 June 2017

  • 11RISC-V Foundation

    Automated Memory Management

    BasicKernelsUser KernelsGroup of User KernelsGenerators

    C Programs, calls to Autotiler’s Model API

    C Libraries

    Autotiler Library

    (Constraints Solver, C Code Generator)

    Compile & Run on PC

    C code for the target handling data movements and Basic Kernels dispatch on cluster’s cores

    #include "AutoTilerLib.h"

    #include "CNN_Generator.h"

    void Mnist()

    {

    CNN_TiledConvNxNReLUPool2x2_SW_fp("Conv5x5RLMP_0", 5, 1, 32, 28, 28, 1);

    CNN_TiledConvNxNReLUPool2x2_SW_fp("Conv5x5RLMP_1", 5, 32, 64, 12, 12, 1);

    CNN_TiledLinearLayer ("LinearLayerRL_2", 64, 4, 4, 10, 1, 0, 0);

    }

    30 June 2017

  • Algorithm Benchmarks

    RISC-V Foundation 12

    Application Cores1 2 4 81D FFT1024 Radix4 28.2 14.3 7.8 4.7

    2D FFT 256 x 256 Radix4 78.9 41.9 22.6 13.3 0.88 MHz/Frame

    Byte 5x5 Conv 18.5 9.3 4.7 2.2

    Short 5x5 Conv 37.8 18.9 9.5 4.6

    Binary 5x5 Conv 20.8 10.5 5.3 2.8

    Short MaxPool2x2 8.2 4.2 2.1 1.1

    Short MatMult 32x32 41.9 20.9 14.0 5.2

    Short 2048 to 1 Fully Connected 3112.0 1616.0 847.0 495.0

    CannyEdge 99.5 50.9 26.2 12.7 VGA: 3.9 MHz/Frame

    AES-CTR 128b 15.3 7.7 4.0 2.1 0.47 MHz/Mbs-1

    64 Mel Coefficients 542.7 299.4 176.7 101.3 10ms slots 0.64MHz

    HoG, 8x8 Cells, 2x2Blocks, 9 Bins 65.0 35.0 18.0 9.0 VGA: 2.76 MHz/Frame

    Cycles per produced output30 June 2017

  • Algorithm Benchmarks

    RISC-V Foundation 13

    7.1

    30 June 2017

  • CNN based text recognition

    RISC-V Foundation 14

    Trainable Par: 421 263Neurons: 1 511 904

    33ms per image

    30 June 2017

  • Dronet – Autonomous Drone

    RISC-V Foundation 15

    Power envelope breakdown @ 165MHz 12 images/sec

    30 June 2017

  • Unique energy efficiency vs performance

    20XExtended Instruction Set (ISA)Efficient parallelization

    Shared instruction cacheHW Convolution Engine

    Ultra fast HW state changes

    best in class ULP MCUs

    high end low power MCUs,mid-range application processors

    Embedded vision processorsDedicated CNN processors

    GAP8uAs asleepmWs awake 10s of mWs

    ener

    gy

    effic

    ienc

    y

    computing power

    100s of MOPS several GOPS TFLOPS

    Comparison of Latest optimized ARMCMSIS-CNN library versus GAP8 implementation of identical CNN graph trained on CIFAR-10 imagesSource: ARM processors blog

    Running on GAP8 cluster* No Hardware Convolution Engine** With Hardware Convolution Engine

    Target Clock Time Cycles Active Power

    STM32 F7 216Mhz 99.1ms 21 400 000 60mW

    GAP8 * 15.4Mhz 99.1ms 1 500 000 3.7mW

    GAP8 * 175Mhz 8.7ms 1 500 000 70mW

    GAP8 ** 4.7Mhz 99.1ms 460 000 0.8mW

    16 X reduction

    STM 32 H7 216Mhz40nm

    11 X

    16

  • Unique energy efficiency vs performance

    RISC-V Foundation 17

    @1.0V, 50 MHz. Input: W=32, H=100 Conv 3x3 Conv 5x5SW time 129.7 us 332.1 us

    SW Power 12.58 mW 12.80 mW

    HWCE time 69.2 us 60.8 us

    HWCE Power 4.95 mW 5.1 mW

    @1.0V, 50 MHz. Input: W=32, H=100 Conv 3x3 Conv 5x5Speed gain 1.87 5.46Power gain intrinsic 2.54 2.51

    Power gain combined with speed gain 4.76 13.71

    HWCE: Boosted convolution

    30 June 2017

  • Conclusion

    30 June 2017 RISC-V Foundation 18

    GAP8’s Extended RISC-V ISA and flexible, programmable architecture enables massive deployment of edge

    intelligence

    by dramatically reducing rich sensing device installation costs through true autonomy

    and by reducing solution costwith system on a chip integration

    Built on top of 2 major HW open source initiatives

    Architectural Innovation

    enabled by PULP, RISC-V

    and Open Source

  • Thank You!

    RISC-V Foundation 1930 June 2017

  • Backup Slides

    RISC-V Foundation 2030 June 2017

  • People Counting

    RISC-V Foundation 2130 June 2017

  • 22RISC-V Foundation

    Advanced Power Management

    ü Embedded DC/DC, low currentü Real Time Clock 32KHz onlyü L2 Memory partially retentive

    MCU sleep mode

    uW ra

    nge

    ü Embedded DC/DC, high currentü Voltage can dynamically changeü One clock gen active, frequency can dynamically

    changeü Systematic clock gating

    MCU active mode

    1 m

    W ra

    nge

    ü Embedded DC/DC, high currentü Voltage can dynamically changeü Two clock gen active, frequencies can

    dynamically changeü Systematic Clock Gating

    MCU + Parallel processor active mode

    10-4

    0 m

    W ra

    nge

    Ultra fast switching time from one mode to anotherUltra fast voltage and frequency change time

    Highly optimized system level power consumption

    30 June 2017

  • 23RISC-V Foundation

    Source of Energy Efficiency?

    data analysis & classification,

    extended RISC-Vefficient 8 core parallelization

    HW synchronizationshared instruction cache

    CNN HW engine

    3-5x

    1.4x

    4x

    1.5x

    eRIS

    C-V

    Logarithmic Interconnect

    Shared L1 Memory

    Shared Instruction CacheDbg Unit

    DMA

    CNN-HWE

    HW Sync

    ClusterL2 Memory

    LVDSUARTSPII2SI2C

    // 10bGPIOs

    HyperBus

    eRISC-V

    I$

    L1

    Micro

    DM

    A

    ClkDbg

    Rom eRIS

    C-V

    eRIS

    C-V

    eRIS

    C-V

    eRIS

    C-V

    eRIS

    C-V

    eRIS

    C-V

    eRIS

    C-V

    overall, in practice on targeted algorithms,

    typically 20x

    30 June 2017

  • System Cost

    RISC-V Foundation 24

    Sys

    tem

    cos

    t

    computing power100s of MOPS several GOPS TFLOPS

    best in class ULP MCUs

    high end low power MCUs,mid-range application processors

    Embedded vision processorsDedicated CNN processors

    GAP8

    2-3X

    System-On-a-ChipHigh integration

    30 June 2017