convolution engine: balancing efficiency & flexibility in...

38
http://www.c2s2.org Convolution Engine: Balancing Efficiency & Flexibility in Specialized Computing Wajahat Qadeer, Rehan Hameed, Ofer Shacham, Preethi Venkatesan, Christos Kozyrakis, Mark Horowitz Stanford University That’s me Did the heavy lifting but could not come today

Upload: others

Post on 01-Aug-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Convolution Engine: Balancing Efficiency & Flexibility in ...kozyraki/publications/2013.convolution.isca.slides.pdf2D Register 2D Shift Register ALU ALU ALU ALU 18 entries 16 wide

http://www.c2s2.org

Convolution Engine: Balancing Efficiency & Flexibility in Specialized Computing

Wajahat Qadeer, Rehan Hameed, Ofer Shacham,

Preethi Venkatesan, Christos Kozyrakis, Mark Horowitz Stanford University

That’s me

Did the heavy lifting but could not come today

Page 2: Convolution Engine: Balancing Efficiency & Flexibility in ...kozyraki/publications/2013.convolution.isca.slides.pdf2D Register 2D Shift Register ALU ALU ALU ALU 18 entries 16 wide

Smile, you’re on camera  By show of hands, who here has

an (HD) camera on them?  How many CPU’s/GPU’s in the

room?  How many of those xPU’s are

used for the image processing?

ISCA'13 [email protected] 2

Page 3: Convolution Engine: Balancing Efficiency & Flexibility in ...kozyraki/publications/2013.convolution.isca.slides.pdf2D Register 2D Shift Register ALU ALU ALU ALU 18 entries 16 wide

Imaging and video systems  High computational requirements, low power budget  Stills: ~10M pixels x 10 frames per second  Video: ~2M pixels x 30 frames per second  ~400 math operations per pixel (just for the image acquisition)

 On CPU… not enough horse power

 On GPU… too much power

 Typically use special purpose custom HW  About 500X better performance, 500X lower energy than CPU

ISCA'13 [email protected] 3

Page 4: Convolution Engine: Balancing Efficiency & Flexibility in ...kozyraki/publications/2013.convolution.isca.slides.pdf2D Register 2D Shift Register ALU ALU ALU ALU 18 entries 16 wide

Example: H.264 encoder on RISC vs. ASIC  By coupling compute and storage closely together, ASIC’s are

orders of magnitude performance and energy more efficient

ISCA'13 [email protected] 4

100

1000

10000

100000

1000000

10000000

IME FME IP CABAC

Ener

gy (u

J)

RISC ASIC

Sub-kernel-1 Sub-kernel-2 Sub-kernel-3 Sub-kernel-4

* R. Hameed et. al., Understanding Sources of Inefficiency in General-Purpose Chips. ISCA ’10

2-3 orders of magnitude

Page 5: Convolution Engine: Balancing Efficiency & Flexibility in ...kozyraki/publications/2013.convolution.isca.slides.pdf2D Register 2D Shift Register ALU ALU ALU ALU 18 entries 16 wide

We are solving the wrong problem!  Yes, ASIC is 1000X more efficient than general purpose  Yes, general purpose is more programmable than ASIC  Yes, we can make each one marginally better

 But those are good answers to all the wrong questions!

 The right questions: Why is the RISC energy so high? What type of computation can we make efficient? Can we make it just 100X better but keep it programmable?

ISCA'13 [email protected] 5

Page 6: Convolution Engine: Balancing Efficiency & Flexibility in ...kozyraki/publications/2013.convolution.isca.slides.pdf2D Register 2D Shift Register ALU ALU ALU ALU 18 entries 16 wide

Anatomy of a RISC Instruction

ISCA'13 6 [email protected]

ADD 70 pJ

* Assuming a typical 32-bit embedded RISC in 45mn @ 0.9V technology

Energy of a 32-bit ADD ≈ 0.5 pJ I-Cache access

Register file access

25pJ 4pJ Control

Control overheads (Instr Decode, sequencing, pipeline

management, clocking, ….)

Page 7: Convolution Engine: Balancing Efficiency & Flexibility in ...kozyraki/publications/2013.convolution.isca.slides.pdf2D Register 2D Shift Register ALU ALU ALU ALU 18 entries 16 wide

Other instructions overhead

ISCA'13 7 [email protected]

* Assuming a typical 32-bit embedded RISC in 45mn @ 0.9V technology

25pJ 4pJ Control

25pJ 4pJ Control

25pJ 4pJ Control

25pJ 4pJ Control

25pJ 4pJ Control

ADD

ST

BR

LD

LD Overhead instructions

Overhead instructions

Page 8: Convolution Engine: Balancing Efficiency & Flexibility in ...kozyraki/publications/2013.convolution.isca.slides.pdf2D Register 2D Shift Register ALU ALU ALU ALU 18 entries 16 wide

D-Cache accesses overhead

ISCA'13 8 [email protected]

* Assuming a typical 32-bit embedded RISC in 45mn @ 0.9V technology

25pJ 4pJ Control

25pJ 4pJ Control

25pJ 4pJ Control

25pJ 4pJ Control

25pJ 4pJ Control

D-Cache access overheads

25pJ

25pJ

25pJ

ADD

ST

BR

LD

LD

Page 9: Convolution Engine: Balancing Efficiency & Flexibility in ...kozyraki/publications/2013.convolution.isca.slides.pdf2D Register 2D Shift Register ALU ALU ALU ALU 18 entries 16 wide

SIMD machines give some improvement  SIMD units amortize overhead and improve performance

 Achieves 10X better energy and performance AND is programmable

 Can we do 100X and keep it programmable?

ISCA'13 9 [email protected]

I-Cache RF Control ADD

I-Cache RF Control SIMD ADD

Page 10: Convolution Engine: Balancing Efficiency & Flexibility in ...kozyraki/publications/2013.convolution.isca.slides.pdf2D Register 2D Shift Register ALU ALU ALU ALU 18 entries 16 wide

Energy efficiency in a programmable environment

Each memory and instruction fetch must be amortized by hundreds of operations

ISCA'13 10 [email protected]

Page 11: Convolution Engine: Balancing Efficiency & Flexibility in ...kozyraki/publications/2013.convolution.isca.slides.pdf2D Register 2D Shift Register ALU ALU ALU ALU 18 entries 16 wide

What we want to see

ISCA'13 11 [email protected]

I-Cache Reg File Control D-Cache

OP

ST

LD

I-Cache Reg File Control D-Cache

OP OP

OP

OP OP

I-Cache Reg File Control I-Cache Reg File Control I-Cache Reg File Control

I-Cache Reg File Control I-Cache Reg File Control I-Cache Reg File Control

D-Cache accesses much narrower than functional path

Many ops per instruction Many ALU instructions per LD/ST instruction

Page 12: Convolution Engine: Balancing Efficiency & Flexibility in ...kozyraki/publications/2013.convolution.isca.slides.pdf2D Register 2D Shift Register ALU ALU ALU ALU 18 entries 16 wide

Image processing looks like convolution  Most of the computation is performed over (overlapping) stencils

 Looks like convolution:

ISCA'13 [email protected] 12

Out

( ) ∑∑−= −=

−−⋅=⊗c

cl

c

cklmknlkmn fImgfImg ],[],[],[

In

coefficients

x

Page 13: Convolution Engine: Balancing Efficiency & Flexibility in ...kozyraki/publications/2013.convolution.isca.slides.pdf2D Register 2D Shift Register ALU ALU ALU ALU 18 entries 16 wide

Image processing looks like convolution  Most of the computation is performed over (overlapping) stencils

 Looks like convolution:

ISCA'13 [email protected] 13

Out In

coefficients

x

( ) ∑∑−= −=

−−⋅=⊗c

cl

c

cklmknlkmn fImgfImg ],[],[],[

Page 14: Convolution Engine: Balancing Efficiency & Flexibility in ...kozyraki/publications/2013.convolution.isca.slides.pdf2D Register 2D Shift Register ALU ALU ALU ALU 18 entries 16 wide

Image processing looks like convolution  Most of the computation is performed over (overlapping) stencils

 Looks like convolution:

ISCA'13 [email protected] 14

Out In

coefficients

x

( ) ∑∑−= −=

−−⋅=⊗c

cl

c

cklmknlkmn fImgfImg ],[],[],[

Page 15: Convolution Engine: Balancing Efficiency & Flexibility in ...kozyraki/publications/2013.convolution.isca.slides.pdf2D Register 2D Shift Register ALU ALU ALU ALU 18 entries 16 wide

It does not have to be convolution  It only looks like convolution:

ISCA'13 [email protected] 15

Out

( )[ ][ ]],[],[],[

, lmknlkc

ckccl

mn

CEfImgmapReduceReducefImg −−−=−=="

#$

%&' ⊗

In

coefficients

redu

ce

map

Page 16: Convolution Engine: Balancing Efficiency & Flexibility in ...kozyraki/publications/2013.convolution.isca.slides.pdf2D Register 2D Shift Register ALU ALU ALU ALU 18 entries 16 wide

Let’s look at some convolution-like workloads  De-mosaic:  Adaptive color plane interpolation (ACPI)*: image gradients

followed by a three-tap filter in the direction of smallest gradient.

ISCA'13 [email protected] 16

* Y. Cheng et. al. An adaptive color plane interpolation method based on edge detection. Journal of Electronics (China), 2007.

Page 17: Convolution Engine: Balancing Efficiency & Flexibility in ...kozyraki/publications/2013.convolution.isca.slides.pdf2D Register 2D Shift Register ALU ALU ALU ALU 18 entries 16 wide

Let’s look at more convolution-like workloads  H.264 (high definition) video encoder:   IME: 2D-Sum of absolute differences  FME: Half pixel interpolation, quarter pixel interpolation, 2D SAD

ISCA'13 [email protected] 17

Inter Prediction

Intra Prediction

CABAC Entropy Encoder

Video Frames

Compressed Bit Stream

Integer Motion

Estimation

Fractional Motion

Estimation

90% of execution time is here

Page 18: Convolution Engine: Balancing Efficiency & Flexibility in ...kozyraki/publications/2013.convolution.isca.slides.pdf2D Register 2D Shift Register ALU ALU ALU ALU 18 entries 16 wide

The main computation behind H.264  Trying to find best match for a stencil within a small neighborhood

ISCA'13 [email protected] 18

Current Frame Previous Frame

Page 19: Convolution Engine: Balancing Efficiency & Flexibility in ...kozyraki/publications/2013.convolution.isca.slides.pdf2D Register 2D Shift Register ALU ALU ALU ALU 18 entries 16 wide

The convolution engine must support different ops

Map Reduce Stencil Size Data Flow IME SAD Abs Diff Add 4x4 2D Convolution FMW ½ pixel up-sample Multiply Add 6 1D Horizontal & vertical conv. FME ¼ pixel up-sample Average None -- 2D Matrix operation SIFT Gaussian blur Multiply Add 9, 13, 15 1D Horizontal & vertical conv. SIFT DoG Subtract None -- 2D Matrix operation SIFT Extreme Compare Logic AND 3 1D Horizontal & vertical conv. Demosaic interpolation Multiply Complex 3 1D Horizontal & vertical conv.

ISCA'13 [email protected] 19

Page 20: Convolution Engine: Balancing Efficiency & Flexibility in ...kozyraki/publications/2013.convolution.isca.slides.pdf2D Register 2D Shift Register ALU ALU ALU ALU 18 entries 16 wide

Convolution Engine: An architecture for convolution-like kernels

ISCA'13 20 [email protected]

Arithmetic / Logical reduction

ALU ALU ALU ALU

Flexible “reduce” step

Wide 64-lane SIMD “map” unit

2D Regfile 2D shift Regfile

0 1 15 0 0 1 15 1

0 1 15 15

0 1 15 0 0 1 15 1

0 1 15 15

16 17 31 16 17 31

16 17 31

Coefficients Stencil

neighborhood

Page 21: Convolution Engine: Balancing Efficiency & Flexibility in ...kozyraki/publications/2013.convolution.isca.slides.pdf2D Register 2D Shift Register ALU ALU ALU ALU 18 entries 16 wide

Example from H.264’s motion estimation: Mapping Sum of Absolute Differences (SAD)

ISCA'13 21 [email protected]

Arithmetic / Logical reduction

ALU ALU ALU ALU

Wide 64-lane SIMD “map” unit

2D Regfile 2D shift Regfile

Current frame pixels

Reference frame pixels

Flexible “reduce” step

0 1 15 0 0 1 15 1

0 1 15 15

0 1 15 0 0 1 15 1

0 1 15 15

16 17 31 16 17 31

16 17 31

Page 22: Convolution Engine: Balancing Efficiency & Flexibility in ...kozyraki/publications/2013.convolution.isca.slides.pdf2D Register 2D Shift Register ALU ALU ALU ALU 18 entries 16 wide

Example from H.264’s motion estimation: Mapping Sum of Absolute Differences (SAD)

ISCA'13 22 [email protected]

-

ABS

-

ABS

-

ABS

-

ABS

2D Regfile

Wide 64-lane SIMD “map” unit

2D shift Regfile

Current frame pixels

Reference frame pixels

ALU’s instruction set to |a-b|

Arithmetic / Logical reduction Flexible “reduce” step

0 1 15 0 0 1 15 1

0 1 15 15

0 1 15 0 0 1 15 1

0 1 15 15

16 17 31 16 17 31

16 17 31

Page 23: Convolution Engine: Balancing Efficiency & Flexibility in ...kozyraki/publications/2013.convolution.isca.slides.pdf2D Register 2D Shift Register ALU ALU ALU ALU 18 entries 16 wide

Example from H.264’s motion estimation: Mapping Sum of Absolute Differences (SAD)

ISCA'13 23 [email protected]

-

ABS

-

ABS

-

ABS

-

ABS

Sum (Reduction)

2D Regfile

Wide 64-lane SIMD “map” unit

2D shift Regfile

Current frame pixels

Reference frame pixels

ALU’s instruction set to |a-b|

Summation tree

Flexible “reduce” step

pixels shift left

0 1 15 0 0 1 15 1

0 1 15 15

0 1 15 0 0 1 15 1

0 1 15 15

16 17 31 16 17 31

16 17 31

Page 24: Convolution Engine: Balancing Efficiency & Flexibility in ...kozyraki/publications/2013.convolution.isca.slides.pdf2D Register 2D Shift Register ALU ALU ALU ALU 18 entries 16 wide

Example from H.264’s motion estimation: Mapping Sum of Absolute Differences (SAD)

ISCA'13 24 [email protected]

-

ABS

-

ABS

-

ABS

-

ABS

Sum (Reduction)

Wide 64-lane SIMD “map” unit

2D Regfile 2D shift Regfile

Reference frame pixels

pixels shift left

Flexible “reduce” step

0 1 15 0 0 1 15 1

0 1 15 15

1 2 16 0 1 2 16 1

1 2 16 15

17 18 0 17 18 0

17 18 0

31 31

31

Page 25: Convolution Engine: Balancing Efficiency & Flexibility in ...kozyraki/publications/2013.convolution.isca.slides.pdf2D Register 2D Shift Register ALU ALU ALU ALU 18 entries 16 wide

Example from H.264’s motion estimation: Mapping Sum of Absolute Differences (SAD)

ISCA'13 25 [email protected]

-

ABS

-

ABS

-

ABS

-

ABS

Sum (Reduction)

Wide 64-lane SIMD “map” unit

2D Regfile 2D shift Regfile

Reference frame pixels

pixels shift left

Flexible “reduce” step

0 1 15 0 0 1 15 1

0 1 15 15

2 3 17 0 2 3 17 1

2 3 17 15

18 19 1 17 19 1

18 19 1

0 0

0

Page 26: Convolution Engine: Balancing Efficiency & Flexibility in ...kozyraki/publications/2013.convolution.isca.slides.pdf2D Register 2D Shift Register ALU ALU ALU ALU 18 entries 16 wide

Example from H.264’s motion estimation: Mapping Sum of Absolute Differences (SAD)

ISCA'13 26 [email protected]

-

ABS

-

ABS

-

ABS

-

ABS

Sum (Reduction)

Wide 64-lane SIMD “map” unit

2D Regfile 2D shift Regfile

Reference frame pixels

pixels shift left

We performed 4K ops before the next load!

Pix

els

shift

up

Flexible “reduce” step

0 1 15 0 0 1 15 1

0 1 15 15

16 17 31 0 16 17 31 1

16 17 31 15

0 1 15 0 1 15

0 1 15

14 14

14

Page 27: Convolution Engine: Balancing Efficiency & Flexibility in ...kozyraki/publications/2013.convolution.isca.slides.pdf2D Register 2D Shift Register ALU ALU ALU ALU 18 entries 16 wide

Example from H.264’s motion estimation: Mapping Sum of Absolute Differences (SAD)

ISCA'13 27 [email protected]

-

ABS

-

ABS

-

ABS

-

ABS

Sum (Reduction)

Wide 64-lane SIMD “map” unit

2D Regfile 2D shift Regfile

Reference frame pixels

Flexible “reduce” step

Pix

els

shift

up

0 1 15 0 0 1 15 1

0 1 15 15

16 17 31 1

15

0 1 15 14

16 16 17 31 0 1 15 14

Page 28: Convolution Engine: Balancing Efficiency & Flexibility in ...kozyraki/publications/2013.convolution.isca.slides.pdf2D Register 2D Shift Register ALU ALU ALU ALU 18 entries 16 wide

Example from H.264’s motion estimation: Mapping Sum of Absolute Differences (SAD)

ISCA'13 28 [email protected]

-

ABS

-

ABS

-

ABS

-

ABS

Sum (Reduction)

Wide 64-lane SIMD “map” unit

2D Regfile 2D shift Regfile

load just one row of data

Reference frame pixels

ready for pixels to start shifting again

Flexible “reduce” step

0 1 15 0 0 1 15 1

0 1 15 15

16 17 31 1

16 17 31 15

18 19 15

0 1 15

14

14 16 16 17 31 0 1 15 14

Page 29: Convolution Engine: Balancing Efficiency & Flexibility in ...kozyraki/publications/2013.convolution.isca.slides.pdf2D Register 2D Shift Register ALU ALU ALU ALU 18 entries 16 wide

Our Convolution Engine as implemented

ISCA'13 29 [email protected]

“Map”

Flexible “Reduce”

2D Register 2D Shift Register

ALU ALU ALU ALU

18 entries 16 wide

10-bit pixel

16 x 10bit lane

1D Shift Register

2D / Column Access IF 2D / Column Access IF

40 x 10-bit

16x16x10-bit 16x36x10-bit

1D Window Access IF

16-wide Regfile

16-way SIMD

ALU ALU

Get full implementation details in the paper:

•  How we accomplished complex reduce steps using a “fused instructions graph”

•  How we work on BIG stencils by combining multiple convolution slices

•  The details of the ISA for the engine

•  And so on, and so forth…

Page 30: Convolution Engine: Balancing Efficiency & Flexibility in ...kozyraki/publications/2013.convolution.isca.slides.pdf2D Register 2D Shift Register ALU ALU ALU ALU 18 entries 16 wide

Result #1: CE is user programmable in C!

ISCA'13 30 [email protected]

SET_CE_OPS (CE_ABSDIFF, CE_ADD); // Set map & reduce funcs to abs-diff and add SET_CE_OPSIZE(16); // Set convolution size 16x16 // Load the 16x16 current macroblock into 2D coefficients register for (int i=0; i<16; i++ {

LD_COEFF_REG_128(curMBPtr, i); // Load 16 pixels to row i of coefficient register curMBPtr += imgWidth;

} // Load the first 32x16 current reference window into 2D input register for (int i=0; i<16; i++ {

LD_2D_REG_128(refPtr, 0, SHIFT_ENABLED); // Load & shift-up 16 pixels to 2D Reg LD_2D_REG_128(refPtr+16, 1, SHIFT_DISABLED); // Load next 16 pixels refPtr += imgWidth;

} // Calculate one row of SAD output for (int x = 0; x < 16; x++) {

CONVOLVE_2D(ROTATE_LEFT, x); // 16x16 2D convolution step and shift left } // Store 16 output SAD results ST_OUT_REG_128(outPtr);

Page 31: Convolution Engine: Balancing Efficiency & Flexibility in ...kozyraki/publications/2013.convolution.isca.slides.pdf2D Register 2D Shift Register ALU ALU ALU ALU 18 entries 16 wide

0.1

1.0

10.0

100.0

SIFT - DoG SIFT-Extrema H.264 - FME H.264- IME Demosaic

Ener

gy N

orm

alize

d To

Cus

tom

(L

ower

is b

ette

r)

SIMD Convolution Engine Custom

Programmable Convolution enigne

Result #2: CE is 100X more energy efficient than RISC

 All variations were implemented as Tensilica extensions (TIE)

[email protected] ISCA'13 31

8 lane 16bit or 16 lane 8bit SIMD

~10X

~3X

Does not do “real time”

Fixed accelerator

Page 32: Convolution Engine: Balancing Efficiency & Flexibility in ...kozyraki/publications/2013.convolution.isca.slides.pdf2D Register 2D Shift Register ALU ALU ALU ALU 18 entries 16 wide

Conclusions  There are classes of computations for which we can build efficient

hardware, and we typically build them in ASIC

 Image and video are ubiquitous and represents one of those classes as their computation is convolution-like

 But when we restrict the domain, two orders of magnitude better programmable engines are also possible!

 Flexible specialized engines are not an oxymoron  Flexible convolution engine improves power & performance by ~100X  Only 2-3X worse off than a dedicated (not flexible) accelerator

ISCA'13 [email protected] 32

Page 33: Convolution Engine: Balancing Efficiency & Flexibility in ...kozyraki/publications/2013.convolution.isca.slides.pdf2D Register 2D Shift Register ALU ALU ALU ALU 18 entries 16 wide

THANK YOU FOR LISTENING!

ISCA'13 33 [email protected]

Page 34: Convolution Engine: Balancing Efficiency & Flexibility in ...kozyraki/publications/2013.convolution.isca.slides.pdf2D Register 2D Shift Register ALU ALU ALU ALU 18 entries 16 wide

BACKUP SLIDES…

ISCA'13 34 [email protected]

Page 35: Convolution Engine: Balancing Efficiency & Flexibility in ...kozyraki/publications/2013.convolution.isca.slides.pdf2D Register 2D Shift Register ALU ALU ALU ALU 18 entries 16 wide

Energy dissipation in RISC machines

 Let’s do a breakdown of a typical RISC Instruction

 Keep in mind (at 45nm):  Addition is ~0.1pJ for 8bits (ASIC) or ~0.5pJ for 32bits (RISC)  Multiplication is ~0.2pJ for 8bits (ASIC) or ~3.1pJ for 32bits (RISC)  But a single RISC instruction is 70pJ

 Need to see where the overhead is, and how we can mitigate it

ISCA'13 [email protected] 35

Page 36: Convolution Engine: Balancing Efficiency & Flexibility in ...kozyraki/publications/2013.convolution.isca.slides.pdf2D Register 2D Shift Register ALU ALU ALU ALU 18 entries 16 wide

Processor Integration  Specialized Functional Unit  Adds about 30 instructions to the processor ISA  The execution flow is controlled by the processor

ISCA'13 [email protected] 36

Processor Core

32-bit ALU

Register File

Integer FU

Compute

Register Storage

Convolution Engine Instruction Decode

Pipeline Management

Program Sequencing

Page 37: Convolution Engine: Balancing Efficiency & Flexibility in ...kozyraki/publications/2013.convolution.isca.slides.pdf2D Register 2D Shift Register ALU ALU ALU ALU 18 entries 16 wide

Evaluating the Convolution Engine  Applications  SIFT Feature extraction  Often a basic step for computational photography algorithms

  HDR Imaging   Panorama stitching   Smart zoom / Super resolution   Multi-frame noise reduction   Synthetic aperture   Augmented reality   Flash – No-Flash photography   Video de-shake   ……

 H.264 encoder  Every video system has one

37 ISCA'13 [email protected]

Page 38: Convolution Engine: Balancing Efficiency & Flexibility in ...kozyraki/publications/2013.convolution.isca.slides.pdf2D Register 2D Shift Register ALU ALU ALU ALU 18 entries 16 wide

Let’s look at some of the workloads  De-mosaic:  Adaptive color plane interpolation (ACPI)*: image gradients

followed by a three-tap filter in the direction of smallest gradient.

ISCA'13 [email protected] 38

* Y. Cheng et. al. An adaptive color plane interpolation method based on edge detection. Journal of Electronics (China), 2007.