scaling-up edge computing with pulp · 2integrated systems laboratory 1department of electrical,...

50
2 Integrated Systems Laboratory 1 Department of Electrical, Electronic and Information Engineering Scaling-up Edge Computing with PULP A many-core Platform for Micropower in-Sensor Analytics Southampton 19.01.2018 Davide Rossi 1 , Antonio Pullini 2 , Igor Loi 1 , Davide Schiavone 2 , Francesco Conti 1 , Florian Glaser 1 , Florian Zaruba 2 , Stefan Mach 2 , Giovanni Rovere 2 , Germain Haugou 2 , Manuele Rusci 1 , Alessandro Capotondi 1 , Giuseppe Tagliavini 1 , Daniele Palossi 2 , Andrea Marongiu 1,2 , Fabio Montagna 1 , Victor Javier Kartsch Morinigo 1 , Simone Benatti 1 , Lei Li 2 , Renzo Andri, Lukas Gavigelli, Eric Flamand 2 , Frank K. Gürkaynak 2 , Andeas Kurth, Pirmin Vogel, Alessandro Capotondi, Luca Benini 1,2

Upload: others

Post on 29-Mar-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Scaling-up Edge Computing with PULP · 2Integrated Systems Laboratory 1Department of Electrical, Electronic and Information Engineering Scaling-up Edge Computing with PULP A many-core

2Integrated Systems Laboratory

1Department of Electrical, Electronicand Information Engineering

Scaling-up Edge Computing with PULPA many-core Platform for Micropower in-Sensor Analytics

Southampton 19.01.2018

Davide Rossi1, Antonio Pullini2, Igor Loi1, Davide Schiavone2,Francesco Conti1, Florian Glaser1, Florian Zaruba2, StefanMach2, Giovanni Rovere2, Germain Haugou2, Manuele Rusci1,Alessandro Capotondi1, Giuseppe Tagliavini1, Daniele Palossi2,Andrea Marongiu1,2, Fabio Montagna1, Victor Javier KartschMorinigo1, Simone Benatti1, Lei Li2, Renzo Andri, LukasGavigelli, Eric Flamand2, Frank K. Gürkaynak2, Andeas Kurth,Pirmin Vogel, Alessandro Capotondi, Luca Benini1,2

Page 2: Scaling-up Edge Computing with PULP · 2Integrated Systems Laboratory 1Department of Electrical, Electronic and Information Engineering Scaling-up Edge Computing with PULP A many-core

2Integrated Systems Laboratory

1Department of Electrical, Electronicand Information Engineering

CPS/IoT Hierarchical Processing

2

NVIDIA V100 815mm2 TSMC12nmApple A11 90mm2 TSMC10nm

Page 3: Scaling-up Edge Computing with PULP · 2Integrated Systems Laboratory 1Department of Electrical, Electronic and Information Engineering Scaling-up Edge Computing with PULP A many-core

||

CPS/IoT Hierarchical Processing

3

NVIDIA V100 815mm2 TSMC12nmApple A11 90mm2 TSMC10nm

Page 4: Scaling-up Edge Computing with PULP · 2Integrated Systems Laboratory 1Department of Electrical, Electronic and Information Engineering Scaling-up Edge Computing with PULP A many-core

||

NSP: A Quantitative View

4

Image

Voice/Sound

Inertial

Biometrics

Extremely compact output (single index, alarm, signature)

Recognition:

Speech:

Kalman:

SVM:

607 Kbps

INPUT BANDWIDTH

COMPUTATIONALDEMAND

OUTPUT BANDWIDTH

4.00 GOPS 10 bps

256 Kbps 100 MOPS 0.02 Kbps

2.4 Kbps 7.7 MOPS 0.02 Kbps

16 Kbps 150 MOPS 0.08 Kbps

[*Nilsson2014]

[*Benatti2014]

[*Ioffe2015]

[*VoiceControl]

Parallel worlkoad

COMPRESSION FACTOR

60700x

12800x

120x

200x

INCEPTION CNN

Page 5: Scaling-up Edge Computing with PULP · 2Integrated Systems Laboratory 1Department of Electrical, Electronic and Information Engineering Scaling-up Edge Computing with PULP A many-core

||

Does it Matter?

5

3x Cost reduction if data volume is reduced by 95%

Page 6: Scaling-up Edge Computing with PULP · 2Integrated Systems Laboratory 1Department of Electrical, Electronic and Information Engineering Scaling-up Edge Computing with PULP A many-core

||

NSP on MCUs?

6

High performance MCUs

Low-Pow

er MC

Us

Courtesy of J Pineda, NXP + Updates

1pJ/OP InceptionV3 @2fps in 10mW

Page 7: Scaling-up Edge Computing with PULP · 2Integrated Systems Laboratory 1Department of Electrical, Electronic and Information Engineering Scaling-up Edge Computing with PULP A many-core

|| 7

Parallel Ultra-Low Power

Page 8: Scaling-up Edge Computing with PULP · 2Integrated Systems Laboratory 1Department of Electrical, Electronic and Information Engineering Scaling-up Edge Computing with PULP A many-core

||

Outline

8

Near Threshold Multiprocessing Non-Von Neumann Accelerators Aggressive Approximation From Frame-based to Event-based Processing Outlook and Conclusion

Page 9: Scaling-up Edge Computing with PULP · 2Integrated Systems Laboratory 1Department of Electrical, Electronic and Information Engineering Scaling-up Edge Computing with PULP A many-core

||

Minimum Energy Operation

9

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2

Total EnergyLeakage EnergyDynamic Energy

0.55 0.55 0.55 0.55 0.6 0.7 0.8 0.9 1 1.1 1.2

Ener

gy/C

ycle

(nJ)

32nm CMOS, 25oC

4.7X

Logic Vcc / Memory Vcc (V)

Source: Vivek De, INTEL – Date 2013

Near-Threshold Computing (NTC): 1. Don’t waste energy pushing devices in strong inversion2. Recover performance with parallel execution3. Manage leakage+process/temperature variations in NT!!!

Page 10: Scaling-up Edge Computing with PULP · 2Integrated Systems Laboratory 1Department of Electrical, Electronic and Information Engineering Scaling-up Edge Computing with PULP A many-core

||

Near-Threshold Multiprocessing

10

. . . . .

4-stage RISC-V RV32IMC+

I$B0 I$Bk

DEMUX

L1 TCDM+T&SMB0 MBM

Shared L1 DataMem + Atomic Variables

DMA + HW SYNCH

Tightly Coupled DMAAnd Hardware Sychronizer

Periph+ExtM

N Cores PE0 PEN‐1

D. Rossi et al., "Energy-Efficient Near-Threshold ParallelComputing: The PULPv2 Cluster," in IEEE Micro, Sep./Oct. 2017.

Page 11: Scaling-up Edge Computing with PULP · 2Integrated Systems Laboratory 1Department of Electrical, Electronic and Information Engineering Scaling-up Edge Computing with PULP A many-core

||

Near-Threshold Multiprocessing

11

NT but parallel Max. Energy efficiency when Active + strong PM for (partial) idleness

1.. 8 PE-per-cluster, 1…32 clusters PGAS machine

PMCA-managed IOMMU

Page 12: Scaling-up Edge Computing with PULP · 2Integrated Systems Laboratory 1Department of Electrical, Electronic and Information Engineering Scaling-up Edge Computing with PULP A many-core

||

ULP (NT) Bottleneck: Memory

12

“Standard” 6T SRAMs: High VDDMIN Bottleneck for energy efficiency >50% of energy can go here!!!

Near-Threshold SRAMs (8T) Lower VDDMIN Area/timing overhead (25%-50%) High active energy Low technology portability

Standard Cell Memories: Wide supply voltage range Lower read/write energy (2x - 4x) High technology portability Major area overhead 4x 2.7x

with controlled placement

2x-4x256x32 6T SRAMS vs. SCM

A. Teman et.al., ‘Power, Area, and Performance Optimization of Standard Cell Memory Arrays Through Controlled Placement’,

in ACM TDAES, May 2016

Page 13: Scaling-up Edge Computing with PULP · 2Integrated Systems Laboratory 1Department of Electrical, Electronic and Information Engineering Scaling-up Edge Computing with PULP A many-core

||

I$: a Look Into ‘Real Life’ Applications

13

Issues:1) Area Overhead of SCMs (4Kb/core not affordable….)2) Capacity miss (with small caches)3) Jumps due to runtime (e.g. OpenMP, OpenCL) and other function calls

Applications on PULP

SHORT JUMP LOOP BASED APPLICATIONS

LONG JUMP APPLICATIONSLIBRARY BASED

Exixting ULP processors Latch based I$ REISC (ESSCIRC2011) 64b

Sleepwalker (ISSCC 2012) 128b

Bellevue (ISCAS 2014) 128b

Survey of State of The Art

SCM-BASED I$ IMPROVES EFFICIENCY BY ~2X ON SMALL BENCHMARKS, BUT…

Page 14: Scaling-up Edge Computing with PULP · 2Integrated Systems Laboratory 1Department of Electrical, Electronic and Information Engineering Scaling-up Edge Computing with PULP A many-core

||

Shared I$

14

Share instruction cache OK for data parallel execution model Not OK for task parallel execution model,

or very divergent parallel threads Architectures SP: single-port banks connected through

a read-only interconnect Pros: Low area overhead Cons: Timing pressure, contention

MP: Multi-ported banks Pros: High efficiency Cons: Area overhead (several ports)

Results Up to 40% better performance than

private I$ Up to 30% better energy efficiency Up to 20% better energy*area efficiency

I. Loi, et.al., "The Quest for Energy-Efficient I$ Design in Ultra-Low-Power Clustered Many-Cores," in IEEE Transactions on

Multi-Scale Computing Systems, In Press.

Page 15: Scaling-up Edge Computing with PULP · 2Integrated Systems Laboratory 1Department of Electrical, Electronic and Information Engineering Scaling-up Edge Computing with PULP A many-core

||

Synchronization & Events

Avoid busy waiting!Minimize sw synchro. overheadEfficient fine-grain parallelization 15

Page 16: Scaling-up Edge Computing with PULP · 2Integrated Systems Laboratory 1Department of Electrical, Electronic and Information Engineering Scaling-up Edge Computing with PULP A many-core

||

Cluster Event Unit

16

Page 17: Scaling-up Edge Computing with PULP · 2Integrated Systems Laboratory 1Department of Electrical, Electronic and Information Engineering Scaling-up Edge Computing with PULP A many-core

||

Energy-efficient Event Handling

Single instruction rules it all: address determines action read datum provides

corresponding information

evt_data = evt_read(addr);

instruction turns off core clock NO pipeline flush!

• trigger sw event,• trigger barrier,• try mutex lock,• read entry point,• auto buffer clear?• …

• event buffer content,• program entry point• mutex message• triggering core id• …

17

Page 18: Scaling-up Edge Computing with PULP · 2Integrated Systems Laboratory 1Department of Electrical, Electronic and Information Engineering Scaling-up Edge Computing with PULP A many-core

|| 18

<32-bit precision SIMD2/4 opportunity

1. HW loops and Post modified LD/ST2. Bit manipulations3. Packed-SIMD ALU operations with dot product4. Rounding and Normalizazion5. Shuffle operations for vectors

Extending RISC-V for NSP

Small Power and Area overhead

RISC‐V  V1

V2V3

HW loopsPost modified Load/StoreMac

SIMD 2/4 + DotProduct + ShufflingBit manipulation unitLightweight fixed point

V2

V3

Baseline RISC-V RV32IMCV1

M. Gautschi et al., "Near-Threshold RISC-V Core With DSP Extensions for Scalable IoT Endpoint Devices," in IEEE TVLSI, Oct. 2017.

Page 19: Scaling-up Edge Computing with PULP · 2Integrated Systems Laboratory 1Department of Electrical, Electronic and Information Engineering Scaling-up Edge Computing with PULP A many-core

||

Dot Product SIMD Processor datapath

Sum-of-dot-product units:8b version16b version1 cycle execution

19

Page 20: Scaling-up Edge Computing with PULP · 2Integrated Systems Laboratory 1Department of Electrical, Electronic and Information Engineering Scaling-up Edge Computing with PULP A many-core

||

Shuffle

7 Sum-of-dot-product 4 move 1 shuffle 3 lw/sw ~ 5 control instructions

20 instr. / output pixel Scalar version >100 instr. / output pixel

Convolution in registers 5x5 convolutional filter

20

Page 21: Scaling-up Edge Computing with PULP · 2Integrated Systems Laboratory 1Department of Electrical, Electronic and Information Engineering Scaling-up Edge Computing with PULP A many-core

||

CNNs with ISA-Extensions

Convolution Performance on PULP with 4 cores

speedup: 3.9x

5x5 convolution in only 6.6 cycles/pixel

Energy gains: 4.2x

21

Page 22: Scaling-up Edge Computing with PULP · 2Integrated Systems Laboratory 1Department of Electrical, Electronic and Information Engineering Scaling-up Edge Computing with PULP A many-core

||

void Radix4FFTKernelDIT_Vect(v2s *InOutA, v2s *InOutB, v2s *InOutC, v2s *InOutD, v2s W1, v2s W2, v2s W3)

{v2s B1, C1, D1;v2s A = *InOutA, B = *InOutB, C = *InOutC, D = *InOutD;

B1 = gap8_cplxmuls(B, W1); C1 = gap8_cplxmuls(C, W2); D1 = gap8_cplxmuls(D, W3);

*InOutA = ((A + C1) + (B1 + D1)) >> (v2s) {FFT4_SCALEDOWN, FFT4_SCALEDOWN};*InOutB = ((A C1) + gap8_sub2rotmj(B1, D1)) >> (v2s) {FFT4_SCALEDOWN, FFT4_SCALEDOWN};*InOutC = ((A + C1) (B1 + D1)) >> (v2s) {FFT4_SCALEDOWN, FFT4_SCALEDOWN};*InOutD = ((A C1) gap8_sub2rotmj(B1, D1)) >> (v2s) {FFT4_SCALEDOWN, FFT4_SCALEDOWN};

}

Original RiscV ISA: 80 cycles Enhanced ISA: 23 cycles

The work horse for radio, sound and vibration: FFT

FFT 256 FFT 1024 FFT 4096

1167 4842 22710

Number of Cycles running on 8 cores

22

Page 23: Scaling-up Edge Computing with PULP · 2Integrated Systems Laboratory 1Department of Electrical, Electronic and Information Engineering Scaling-up Edge Computing with PULP A many-core

||

Near threshold FDSOI technology

23

Body bias: Highly effective knob for power & variability management!

Page 24: Scaling-up Edge Computing with PULP · 2Integrated Systems Laboratory 1Department of Electrical, Electronic and Information Engineering Scaling-up Edge Computing with PULP A many-core

||

Selective, Fine Grained Body Biasing

24

The cluster is partitioned in separate clock gating and body bias regions

Body bias multiplexers (BBMUXes) control the well voltages of each region

A Power Management Unit (PMU) automatically manages transitions between the operating modes

Power modes of each region: Boost mode: active + FBB Normal mode: active + NO BB Idle mode: clock gated + NO BB (in LVT)

RBB (in RVT)

D. Rossi et. al., «A 60 GOPS/W, −1.8V to 0.9V body bias ULP cluster in 28nm UTBB FD-SOI technology»,

in Solid-State Electronics, 2016.

Page 25: Scaling-up Edge Computing with PULP · 2Integrated Systems Laboratory 1Department of Electrical, Electronic and Information Engineering Scaling-up Edge Computing with PULP A many-core

||

Body Bias-Based Performance Monitoring and Control

25

Goal: Compensate the effects of external

factors like ambient temperature Reduce PVT Margings at design time Improve Energy Efficiency

Mixed Hardware/Software control loop exploitingon-chip frequency measurement (PMB)

D. Rossi et al., "A Self-Aware Architecture for PVT Compensation and Power Nap in Near Threshold Processors,"

in IEEE Design & Test, Dec. 2017.

Page 26: Scaling-up Edge Computing with PULP · 2Integrated Systems Laboratory 1Department of Electrical, Electronic and Information Engineering Scaling-up Edge Computing with PULP A many-core

||

The Evolution of the ‘Species’

26

PULPv1 PULPv2 PULPv3Status silicon proven Silicon proven post tape outTechnology FD-SOI 28nm

conventional-wellFD-SOI 28nm

flip-wellFD-SOI 28nm

conventional-wellVoltage range 0.45V - 1.2V 0.3V - 1.2V 0.5V - 0.7VBB range -1.8V - 0.9V 0.0V - 1.8V -1.8V - 0.9VMax freq. 475 MHz 1 GHz 200 MHzMax perf. 1.9 GOPS 4 GOPS 1.8 GOPSPeak en. eff. 60 GOPS/W 135 GOPS/W 385 GOPS/W

PULPv1 PULPv2 PULPv3# of cores 4 4 4L2 memory 16 kB 64 kB 128 kBTCDM 16kB SRAM 32kB SRAM

8kB SCM32kB SRAM16kB SCM

DVFS no yes yesI$ 4kB SRAM private 4kB SCM private 4kB SCM sharedDSP Extensions no no yesHW Synchronizer no no yes

Page 27: Scaling-up Edge Computing with PULP · 2Integrated Systems Laboratory 1Department of Electrical, Electronic and Information Engineering Scaling-up Edge Computing with PULP A many-core

||

The Evolution of the ‘Species’

27

PULPv1 PULPv2 PULPv3Status silicon proven Silicon proven post tape outTechnology FD-SOI 28nm

conventional-wellFD-SOI 28nm

flip-wellFD-SOI 28nm

conventional-wellVoltage range 0.45V - 1.2V 0.3V - 1.2V 0.5V - 0.7VBB range -1.8V - 0.9V 0.0V - 1.8V -1.8V - 0.9VMax freq. 475 MHz 1 GHz 200 MHzMax perf. 1.9 GOPS 4 GOPS 1.8 GOPSPeak en. eff. 60 GOPS/W 135 GOPS/W 385 GOPS/W

PULPv1 PULPv2 PULPv3# of cores 4 4 4L2 memory 16 kB 64 kB 128 kBTCDM 16kB SRAM 32kB SRAM

8kB SCM32kB SRAM16kB SCM

DVFS no yes yesI$ 4kB SRAM private 4kB SCM private 4kB SCM sharedDSP Extensions no no yesHW Synchronizer no no yes

2.6pj/OP

Page 28: Scaling-up Edge Computing with PULP · 2Integrated Systems Laboratory 1Department of Electrical, Electronic and Information Engineering Scaling-up Edge Computing with PULP A many-core

||

PULPv3 Demo BoardPULPv3 board featuring voltage regulator (down to 0.5V)Peltier element to heat-up / cool the samplesTEC controller to drive the peltier elementEmbecosm MAGEEC Energy Monitoring Shield for power measurementsJTAG Programmer for loading code on PULPv3Host laptop to show plots

>2x leakage reduction @ 70°C

Extended operating range for zero-margin design (i.e. signoff in typical corner)

28

Page 29: Scaling-up Edge Computing with PULP · 2Integrated Systems Laboratory 1Department of Electrical, Electronic and Information Engineering Scaling-up Edge Computing with PULP A many-core

||

Full application with PULP

Is PULP really needed, or is processing light enough for an ULP MCU?Case study: real-time seizure detection 23 EEG channels Implemented on PULPv3 6x speedup on 8 processors

Near threshold operation and efficient architecture

Parallelization and workload distribution+

Sub-mW operation (months of battery lifetime) Real-time even under msec constraints=

Room for ↑↑ ExG channel (>128) and sensor fusion

70x lower energy than Ambiq MCU

29

Page 30: Scaling-up Edge Computing with PULP · 2Integrated Systems Laboratory 1Department of Electrical, Electronic and Information Engineering Scaling-up Edge Computing with PULP A many-core

||

Not only NT: GAP8 (55nm TSMC) & Mr. Wolf (45nm TSMC)

PULP’s Commercial “big brother” TSMC55 (available in march)

L2

SPIM x 2

SoCTCDMBANK #0 

TCDMBANK #1

TCDMBANK #M‐1

SHARED I$

RISCV#0

RISCV#1

RISCV#7

BRIDGE

CLUSTER

TCDM INTERCONNECT

...

...CLUSTER

 BUS

PERIPH

ERAL

 INTE

RCONNEC

T

DC FIFO

DC FIFOID R.

AXI2MEM

SOC BU

S

ROM

SoC Ctrl

JTAG2AXI

JTAG

FLL Ctr

UART

GPIO

Top

DU DU DU

PAD M

UX

JTAGTAP

I2C x 2

I2S x 2

SOC AP

B

TMC

uDMA

SPI S

FLL x 2

DMA

CLUSTERPERIPHS

LVDS I/Q

AXI TO MEM+ MPU FILTER

FC SUBSYSTEM

L2 REFILLMASTER

UDMASLAVE

ROM REFILL MASTER

APB SLAVE

AXIMASTER

PWM x 4

10b // (CPI)

PP Im

PP Au

DC/DCRTC

PMU

BOR BOR TRC

Q-SPIM

AXI TO APB + MPU FILTER

Hyper-Bus

Serial I/Q

AQFN84 package

HWCE

30

Page 31: Scaling-up Edge Computing with PULP · 2Integrated Systems Laboratory 1Department of Electrical, Electronic and Information Engineering Scaling-up Edge Computing with PULP A many-core

||

Not only NT: GAP8 (55nm TSMC) & Mr. Wolf (45nm TSMC)

PULP’s Commercial “big brother” TSMC55 (available in march)

L2

SPIM x 2

SoCTCDMBANK #0 

TCDMBANK #1

TCDMBANK #M‐1

SHARED I$

RISCV#0

RISCV#1

RISCV#7

BRIDGE

CLUSTER

TCDM INTERCONNECT

...

...CLUSTER

 BUS

PERIPH

ERAL

 INTE

RCONNEC

T

DC FIFO

DC FIFOID R.

AXI2MEM

SOC BU

S

ROM

SoC Ctrl

JTAG2AXI

JTAG

FLL Ctr

UART

GPIO

Top

DU DU DU

PAD M

UX

JTAGTAP

I2C x 2

I2S x 2

SOC AP

B

TMC

uDMA

SPI S

FLL x 2

DMA

CLUSTERPERIPHS

LVDS I/Q

AXI TO MEM+ MPU FILTER

FC SUBSYSTEM

L2 REFILLMASTER

UDMASLAVE

ROM REFILL MASTER

APB SLAVE

AXIMASTER

PWM x 4

10b // (CPI)

PP Im

PP Au

DC/DCRTC

PMU

BOR BOR TRC

Q-SPIM

AXI TO APB + MPU FILTER

Hyper-Bus

Serial I/Q

AQFN84 package

HWCE

31

Page 32: Scaling-up Edge Computing with PULP · 2Integrated Systems Laboratory 1Department of Electrical, Electronic and Information Engineering Scaling-up Edge Computing with PULP A many-core

||

Outline

32

Near Threshold Multiprocessing Non-Von Neumann Accelerators Aggressive Approximation From Frame-based to Event-based Processing Outlook and Conclusion

Page 33: Scaling-up Edge Computing with PULP · 2Integrated Systems Laboratory 1Department of Electrical, Electronic and Information Engineering Scaling-up Edge Computing with PULP A many-core

||

Recovering More Silicon Efficiency

33

1 > 1003 6

CPU GPGPU HW IP

GOPS/W

Accelerator Gap

SW HWMixed

ThroughputComputing

General-purposeComputing

<1 pj/OP

Closing The Accelerator Efficiency Gap with Agile Customization

ULP parallelComputing

Page 34: Scaling-up Edge Computing with PULP · 2Integrated Systems Laboratory 1Department of Electrical, Electronic and Information Engineering Scaling-up Edge Computing with PULP A many-core

||

Heterogeneous PULP Cluster

34

Page 35: Scaling-up Edge Computing with PULP · 2Integrated Systems Laboratory 1Department of Electrical, Electronic and Information Engineering Scaling-up Edge Computing with PULP A many-core

||

Heterogeneous PULP Cluster

35

Page 36: Scaling-up Edge Computing with PULP · 2Integrated Systems Laboratory 1Department of Electrical, Electronic and Information Engineering Scaling-up Edge Computing with PULP A many-core

||

Heterogeneous PULP SoC: Fulmine

36

CTRL

Line Buffer +Scalable Precision

SOP

Fulmine: Hardware Convolutional Engine (HWCE) in the Cluster ~6.9 mm2 PULP chip, 1/2016 4 cores 1 HWCRYPT accelerator 1 HWCE for 3D conv layers 64kB L1, 192kB L2 QSPI (M/S), I2C, I2S, UART

~1514 kGE for the cluster 232 kGE for the CNN HWCE

Crypto Engine Secure Analytics

SHARED-MEM IF

F. Conti et al., "An IoT Endpoint System-on-Chip for Secure and Energy-Efficient Near-Sensor Analytics,",in IEEE TCAS-I Sept. 2017.

Page 37: Scaling-up Edge Computing with PULP · 2Integrated Systems Laboratory 1Department of Electrical, Electronic and Information Engineering Scaling-up Edge Computing with PULP A many-core

||

Heterogeneous PULP CNN Performance

37

10

100

1000

10000

100000

1 core 4 coresvectorized

16bweights

8b weights 4b weights

SW HWCE

1

10

100

1000

10000

1 core 4 coresvectorized

16bweights

8b weights 4b weights

SW HWCE

Scaled to ST FD-SOI 28nm @ Vdd=0.6V, f=115MHz

PERFORMANCE ENERGY EFFICIENCY

13 GOPS3250 GOPS/W

218x

Cluster performance and energy efficiency on a 64x64 CNN layer (5x5 conv)

84x

[Mop

/s]

[Mop

/s/W

]

5x

0.3 pj/OP

Page 38: Scaling-up Edge Computing with PULP · 2Integrated Systems Laboratory 1Department of Electrical, Electronic and Information Engineering Scaling-up Edge Computing with PULP A many-core

||

SoP-Unit Optimization

38

Image Mapping (3x3, 5x5, 7x7)

Equivalent for 7x7 SoP

ImageBank

FilterBank

1 MAC Op = 2 Op (1 Op for the “sign-reverse”, 1 Op for the add).

Page 39: Scaling-up Edge Computing with PULP · 2Integrated Systems Laboratory 1Department of Electrical, Electronic and Information Engineering Scaling-up Edge Computing with PULP A many-core

||

Origami, YodaNN vs. Human

39

Type Analog (bio)

Q2.9Precision

Q2.9Precision

Binary‐Weight 

Network human ResNet‐34 ResNet‐18 ResNet‐18

Top‐1 error [%]

21.53 30.7 39.2

Top‐5 error [%]

5.1 5.6 10.8 17.0

Hardware Brain Origami Origami YodaNN

Energy‐eff. [uJ/img]

100.000(*) 1086 543 31

The «energy-efficient AI» challenge (e.g. Human vs. IBM Watson)

Game over for humans also in energy-efficient vision?Not yet!

Numbers above assume on-chip storage (1MB binary weights)Object recognition is a very basic task!

*Pbrain = 10W, 10% of the brain used for vision, trained human @ 10img/sec

Page 40: Scaling-up Edge Computing with PULP · 2Integrated Systems Laboratory 1Department of Electrical, Electronic and Information Engineering Scaling-up Edge Computing with PULP A many-core

||

Outline

40

Near Threshold Multiprocessing Non-Von Neumann Accelerators Aggressive Approximation From Frame-based to Event-based Processing Outlook and Conclusion

Page 41: Scaling-up Edge Computing with PULP · 2Integrated Systems Laboratory 1Department of Electrical, Electronic and Information Engineering Scaling-up Edge Computing with PULP A many-core

||

Back to System-Level

41

Event-Driven Computation, which occurs only when relevant events are detected by the sensor

Event-based sensor interface tominimize IO energy (vs. Frame-basedinteface

Mixed-signal event triggering with an ULP imager with internal processing AMS capability

PULPv3GrainCam

Mixed-SignalEvent-based

Imager

DigitalParallel

Processor

Smart Visual Sensor idle most of the time (nothing interesting to see)

A Neuromorphic Approach for doing nothing VERY well

Page 42: Scaling-up Edge Computing with PULP · 2Integrated Systems Laboratory 1Department of Electrical, Electronic and Information Engineering Scaling-up Edge Computing with PULP A many-core

||

GrainCam Readout

42

Readout modes: IDLE: readout the counter of asserted pixels ACTIVE: sending out the addresses of asserted

pixels (address-coded representation), according raster scan order

Event-based sensing: output frame data bandwidth depends on the external context-activity

Frame-based

{x0, y

0}

{x1, y

1}

{x2, y

2}

{x3, y

3}

{xN-1

, yN-1

}

Event-based

Ultra Low Power Consumption e.g. 10-20uW @10fps

Page 43: Scaling-up Edge Computing with PULP · 2Integrated Systems Laboratory 1Department of Electrical, Electronic and Information Engineering Scaling-up Edge Computing with PULP A many-core

||

Power Management

Graincam IDLE

PULP Deep Sleep

#events > thresholdSwitch to ACTIVE

ACTIVE

Deep Sleep10μW @10fps

7μW

READOUT

Wake‐upevent

Data Trasfer

Dataevent

Processing

10‐20μW @10fps

2.88 mW Active Power @ 0.55V , 81MHz 

Graincam PULP

M. Rusci, et. al., "A Sub-mW IoT-Endnode for Always-On Visual Monitoring and Smart Triggering," in IEEE IoT Journal, 2017.

uDMA interface

43

Page 44: Scaling-up Edge Computing with PULP · 2Integrated Systems Laboratory 1Department of Electrical, Electronic and Information Engineering Scaling-up Edge Computing with PULP A many-core

||

Outline

44

Near Threshold Multiprocessing Non-Von Neumann Accelerators Aggressive Approximation From Frame-based to Event-based Processing Outlook and Conclusion

Page 45: Scaling-up Edge Computing with PULP · 2Integrated Systems Laboratory 1Department of Electrical, Electronic and Information Engineering Scaling-up Edge Computing with PULP A many-core

||

PULP: An Open Source Parallel Computing Platform

45

PULP Hardware and Software released under Solderpad License

Compiler Infrastructure

Processor & Hardware IPs

Virtualization Layer

Programming Model

Low-Power Silicon Technology

Started in 2013

Page 46: Scaling-up Edge Computing with PULP · 2Integrated Systems Laboratory 1Department of Electrical, Electronic and Information Engineering Scaling-up Edge Computing with PULP A many-core

||

Zero-riscy RV32-ICM

Micro-riscy RV32-CE

Ariane RV64-ICM

RI5CY RV32-ICMX SIMD HW loops Bit

manipulation

Fixed point

RI5CY + FPU RV32-

ICMFX

RISC-V cores available in PULP

Low Cost Core

Linux capable

Core

Core with DSP enhancements

Floating-point capable Core

32 bit 64 bit

11GK,18KG 40KG 200KG70KG

46

Page 47: Scaling-up Edge Computing with PULP · 2Integrated Systems Laboratory 1Department of Electrical, Electronic and Information Engineering Scaling-up Edge Computing with PULP A many-core

||

PULP Open-Source Release and External Contributions

47

1 February 2016First release of PULPino, our single-core microcontroller

2

3

5

May 2016Toolchain and compiler for our RISC-V implementation (RI5CY), DSP extensions

August 2017PULPino updates, new cores Zero-riscyand Micro-riscy, FPU, toolchain updates

End of 2017PULPino v2 + SDK and Virtual Platform, Smart peripherals (uDMA ready), support for Verilator, new event unit

February 2018PULPissimo, ARIANE, PULP

1

2

4

September 2017Porting of ARM CMSIS to PULPinohttps://github.com/misaleh/CMSIS-DSP-PULPino

June 2017Porting of Verilator and BEEBS benchmarks to PULPinohttps://github.com/embecosm/ri5cy

December 2017STING: Open-Source Verification Environment for PULPinohttp://valtrix.in/programming/running-sting-on-pulpino

Releases Community Contributions

4

November 2017Numerous Bug fixes to RiscV in PULPinohttps://github.com/pulp-platform/riscv

3

Page 48: Scaling-up Edge Computing with PULP · 2Integrated Systems Laboratory 1Department of Electrical, Electronic and Information Engineering Scaling-up Edge Computing with PULP A many-core

||

Poseidon: enter 22FDX

QUENTIN KERBIN

HYPERDRIVE

First 22FDX test-chip of the labs (taped out Jan 12th) Goals:

Explore performance/energy efficiency: In the low-power domain LVT In the high-performance domain SLVT

Explore the voltage range of the technology (LVT + SLVT) Explore body biasing

Test SoCs: Quentin: ultra-low-power 32-bit MCU+XNOR Accelerator Kerbin: high performance (1GHz) 64-bit processor Hyperdrive: Binary Convolutional Networks Accelerator

Chip architecture: Signal pads are multiplexed among the three macros (only one

active at same time) Body bias pads shared among all macros Power supply pads are private for each SoC, with independent

logic, memory arrays and memory periphery supplies for accurate power measurements

Poseidon chip layout

48

Page 49: Scaling-up Edge Computing with PULP · 2Integrated Systems Laboratory 1Department of Electrical, Electronic and Information Engineering Scaling-up Edge Computing with PULP A many-core

||

Conclusion

49

Near-sensor processing Energy efficiency requirements: pJ/OP and below Technology scaling alone is not doing the job for us Ultra-low power architecture and circuits are needed

CNNs- and heavy DSP functions can be squeezed into mW envelope Non-von-Neumann acceleration Very robust to low precision computations (deterministic and statistical) fJ/OP is in sight!

More than CNN is needed (e.g. non-linear functions, online optimizazion) Open Source HW & SW approach innovation ecosystem

Page 50: Scaling-up Edge Computing with PULP · 2Integrated Systems Laboratory 1Department of Electrical, Electronic and Information Engineering Scaling-up Edge Computing with PULP A many-core

|| 50

Thanks for your attention!

www.pulp-platform.org

The fun is just beginning...

28nm 28nm 28nm 65nm 65nm 65nm 65nm 65nm 65nm65nm

130nm 180nm28nm65nm180nm40nm65nm130nm130nm 180nm

http://asic.ethz.ch