vector processing as a soft-core cpu accelerator jason yu, guy lemieux, chris eagleston {jasony,...

36
Vector Processing as a Soft-core CPU Accelerator Jason Yu, Guy Lemieux, Chris Eagleston {jasony, lemieux, ceaglest}@ece.ubc.ca University of British Columbia Prepared for FPGA2008, Altera, and Xilinx February 26-28, 2008

Upload: osbaldo-munoz

Post on 31-Mar-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Vector Processing as a Soft-core CPU Accelerator Jason Yu, Guy Lemieux, Chris Eagleston {jasony, lemieux, ceaglest}@ece.ubc.ca University of British Columbia

Vector Processing as a Soft-core CPU Accelerator

Jason Yu, Guy Lemieux, Chris Eagleston

{jasony, lemieux, ceaglest}@ece.ubc.caUniversity of British Columbia

Prepared for FPGA2008, Altera, and XilinxFebruary 26-28, 2008

Page 2: Vector Processing as a Soft-core CPU Accelerator Jason Yu, Guy Lemieux, Chris Eagleston {jasony, lemieux, ceaglest}@ece.ubc.ca University of British Columbia

2

Motivation FPGAs for embedded processing

High performance, computationally intensive Growing use of embedded processor on FPGA Nios/MicroBlaze too slow

Faster performance Faster Nios/MicroBlaze Multiprocessor-on-FPGA Custom hardware accelerator Synthesized accelerator

Page 3: Vector Processing as a Soft-core CPU Accelerator Jason Yu, Guy Lemieux, Chris Eagleston {jasony, lemieux, ceaglest}@ece.ubc.ca University of British Columbia

3

Problems… Faster Nios/MicroBlaze not feasible

2 or 4-way superscalar/VLIW register file maps inefficiently to FPGA

Superscalar complex dependency checking

Multiprocessor-on-FPGA complexity Parallel programming and debugging System design Cache coherence, memory consistency

Custom hardware accelerator cost Need hardware engineer Time-consuming to design and debug 1 hardware accelerator per function

Page 4: Vector Processing as a Soft-core CPU Accelerator Jason Yu, Guy Lemieux, Chris Eagleston {jasony, lemieux, ceaglest}@ece.ubc.ca University of British Columbia

4

Possible Solutions… Automatically synthesized hardware accelerators

Change software regenerate & recompile RTL Altera C2H Xilinx CHiMPS Mitrion Virtual Processor CriticalBlue Cascade

Soft vector processorSoft vector processor Change software same RTL, just recompile software

Purely software-based Decouples hardware/software development teams

Page 5: Vector Processing as a Soft-core CPU Accelerator Jason Yu, Guy Lemieux, Chris Eagleston {jasony, lemieux, ceaglest}@ece.ubc.ca University of British Columbia

5

Advantages of Vector Processing Simple programming model

Short to long vector data parallelism Regular, easy to accelerate

Purely software-based One hardware accelerator supports many

applications

Scalable performance and area

Page 6: Vector Processing as a Soft-core CPU Accelerator Jason Yu, Guy Lemieux, Chris Eagleston {jasony, lemieux, ceaglest}@ece.ubc.ca University of British Columbia

6

Contributions Configurable soft vector processor

Selectable performance/resource tradeoff Area customization

FPGA-specific enhancements Partitioned register file Vector reductions using MAC chain Local vector datapath memory

Page 7: Vector Processing as a Soft-core CPU Accelerator Jason Yu, Guy Lemieux, Chris Eagleston {jasony, lemieux, ceaglest}@ece.ubc.ca University of British Columbia

Overview of Vector Processing

Page 8: Vector Processing as a Soft-core CPU Accelerator Jason Yu, Guy Lemieux, Chris Eagleston {jasony, lemieux, ceaglest}@ece.ubc.ca University of British Columbia

8

Acceleration with Vector Processing Organize data as long vectors Data-level parallelism

Vector instruction execution Multiple vector lanes (SIMD) Repeated SIMD operation

over length of vector

SourceSourcevectorvector

registersregisters

DestinationDestinationvectorvectorregisterregister

Vector lanes

for (i=0; i<NELEM; i++) a[i] = b[i] * c[i]

vmult a, b, c

Page 9: Vector Processing as a Soft-core CPU Accelerator Jason Yu, Guy Lemieux, Chris Eagleston {jasony, lemieux, ceaglest}@ece.ubc.ca University of British Columbia

9

Compared to CPUs with SIMD Extensions Intel SSE2, PowerPC

Altivec, etc Short, fixed-length

vectors (eg, 4) Single cycle per

instruction Many data

pack/unpack instructions

SourceSourceSIMDSIMD

registersregisters

DestinationDestinationSIMDSIMDregisterregister

SIMD Unit

Page 10: Vector Processing as a Soft-core CPU Accelerator Jason Yu, Guy Lemieux, Chris Eagleston {jasony, lemieux, ceaglest}@ece.ubc.ca University of British Columbia

11

Hybrid vector-SIMD vs Traditional Vector

Traditional vectorprocessing

HybridVector-SIMDprocessing

For (i=0; i<NELEM; i++) { C[i] = A[i] + B[i] E[i] = C[i] * D[i] }

0

1

2

3

C

E

C

E

4

5

6

7

Page 11: Vector Processing as a Soft-core CPU Accelerator Jason Yu, Guy Lemieux, Chris Eagleston {jasony, lemieux, ceaglest}@ece.ubc.ca University of British Columbia

12

Vector ISA Features Vector length (VL)

register Conditional execution

Vector flag registers

Vector addressing modes Unit stride Constant stride Indexed offset

0

1

0

0

1

0

1

0

Merge

Sourceregisters

DestinationregisterFlag

register

Vector Merge Operation

Page 12: Vector Processing as a Soft-core CPU Accelerator Jason Yu, Guy Lemieux, Chris Eagleston {jasony, lemieux, ceaglest}@ece.ubc.ca University of British Columbia

13

Example: Simple 5x5 Median Filtering

Pseudocode (Bubble sort)

Load the 25 pixel vectors P[0..24]For i=0 to 12 {

minimum = P[i]For j=i to 24 {

if (P[j] < minimum) {swap (minimum, P[j])}

}}

Slide “window” over after 1 median

Repeated over entire image Many windows

Output pixelOutput pixel

Page 13: Vector Processing as a Soft-core CPU Accelerator Jason Yu, Guy Lemieux, Chris Eagleston {jasony, lemieux, ceaglest}@ece.ubc.ca University of British Columbia

14

Example: Simple 5x5 Median Filtering

Pseudocode (Bubble sort)

Load the 25 pixel vectors P[0..24]For i=0 to 12 {

minimum = P[i]For j=i to 24 {

if (P[j] < minimum) {swap (minimum, P[j])}

}}

Bubble sort on vector registers

Vector flag register to mask execution

“VL” results at once!

25 rows ->25 vector registers

“VL” pixels each

Page 14: Vector Processing as a Soft-core CPU Accelerator Jason Yu, Guy Lemieux, Chris Eagleston {jasony, lemieux, ceaglest}@ece.ubc.ca University of British Columbia

Soft Vector Processor Architecture

Page 15: Vector Processing as a Soft-core CPU Accelerator Jason Yu, Guy Lemieux, Chris Eagleston {jasony, lemieux, ceaglest}@ece.ubc.ca University of British Columbia

16

Nios II coreShared instructionmemory

(scalar / vectorinstructions)

Shared scalar / vectorMemory interface

Distributedvector register file

Overlappedscalar / vector

execution

Configurablememory width

Configurablenumber of lanes

Page 16: Vector Processing as a Soft-core CPU Accelerator Jason Yu, Guy Lemieux, Chris Eagleston {jasony, lemieux, ceaglest}@ece.ubc.ca University of British Columbia

17

0

0

1

1

3

3

4

4

5

5

7

7

One vectorRegister(eg, v0)

Distributedvector register file

Page 17: Vector Processing as a Soft-core CPU Accelerator Jason Yu, Guy Lemieux, Chris Eagleston {jasony, lemieux, ceaglest}@ece.ubc.ca University of British Columbia

18

Local vectordatapath memory

MAC chain

Result toVLane 0

Page 18: Vector Processing as a Soft-core CPU Accelerator Jason Yu, Guy Lemieux, Chris Eagleston {jasony, lemieux, ceaglest}@ece.ubc.ca University of British Columbia

19

Vector Sum Reduction with MAC

Sum reduction

R = A[i] * B[i]

R = A[i] (using B[i] = 1)

Reduces VL elements in vector register to single number

Two instruction sequence: vmac

multiply accum. to accumulators vcczacc

compress copy and zero accumulators

Side effect: can only reduce 18-bit inputs

Accumulatechain

Page 19: Vector Processing as a Soft-core CPU Accelerator Jason Yu, Guy Lemieux, Chris Eagleston {jasony, lemieux, ceaglest}@ece.ubc.ca University of British Columbia

20

Configurable Parameters Some configurable features

Number of vector lanes Vector ALU width Vector memory access granularity (8, 16, 32b) Local memory size (or none)

Strongly affect performance, area

Page 20: Vector Processing as a Soft-core CPU Accelerator Jason Yu, Guy Lemieux, Chris Eagleston {jasony, lemieux, ceaglest}@ece.ubc.ca University of British Columbia

21

Partial List of Configurable Parameters

Primary ParametersSoft vector processors

Parameter Description Typical V4 V8 V16M32

NLane Number of vector lanes 4-128 4 8 16

MVL Maximum vector length 16-512 16 32 64

VPUW Processor data width (bits) 8, 16, 32 32 32 32

MemMinWidth

Minimum accessible data width in memory

8, 16, 32 8 8 32

Parameters for Optional Features

MultW Multiplier width (bits, 0 is off) 0, 8, 16, 32 16 16 16

MACL MAC chain length (0 is no MAC) 0,1,2,4 1 2 0

LMemN Local memory number of words 0-1024 256 256 0

LMemShare Shared local memory address space within lane

On/Off Off Off Off

Page 21: Vector Processing as a Soft-core CPU Accelerator Jason Yu, Guy Lemieux, Chris Eagleston {jasony, lemieux, ceaglest}@ece.ubc.ca University of British Columbia

Performance Results

Page 22: Vector Processing as a Soft-core CPU Accelerator Jason Yu, Guy Lemieux, Chris Eagleston {jasony, lemieux, ceaglest}@ece.ubc.ca University of British Columbia

23

Benchmarking 3 sample application kernels

5x5 median filter Motion estimation (full search block matching) 128-bit AES encryption (MiBench)

C code, 3 versions Nios II Nios II with inline vector assembly Nios II with C2H accelerator

Page 23: Vector Processing as a Soft-core CPU Accelerator Jason Yu, Guy Lemieux, Chris Eagleston {jasony, lemieux, ceaglest}@ece.ubc.ca University of British Columbia

24

Methodology and Assumptions Compile C code with nios2-gcc

Run time Instructions * cycles-per-instruction / Fmax

Nios II Instruction: 1 cycle Memory load: 1 cycle

Nios II with vectors Vector instruction: (VL / NLane) cycles Vector load: 2 * (VL / NLane) + 2 cycles

Page 24: Vector Processing as a Soft-core CPU Accelerator Jason Yu, Guy Lemieux, Chris Eagleston {jasony, lemieux, ceaglest}@ece.ubc.ca University of British Columbia

25

Altera C2H Compiler Nios II with C2H accelerator

Synthesizes HW accelerator from a C function C memory reference = master port to that memory Current limitations:

No automatic loop unrolling Up to user to efficiently partition memory

Memory

Nios IIC2H

acceleratorC2H

accelerator

Arbiter

Memory

Master portsMaster ports

ArbiterAvalonFabric

Page 25: Vector Processing as a Soft-core CPU Accelerator Jason Yu, Guy Lemieux, Chris Eagleston {jasony, lemieux, ceaglest}@ece.ubc.ca University of British Columbia

26

C2H Methodology Compile application kernels with C2H

compiler Automatic pipelining and scheduling Manually unroll loops Manually “vectorize” C code

Nios II with C2H accelerator C2H compiler reports # of clock cycles Includes memory arbitration overhead

Page 26: Vector Processing as a Soft-core CPU Accelerator Jason Yu, Guy Lemieux, Chris Eagleston {jasony, lemieux, ceaglest}@ece.ubc.ca University of British Columbia

27

C2H Example AES encryption round

Shift 4 32-bit words(by different amounts)

4 table lookups XOR results, XOR with key

Acceleration steps1. Process multiple blocks in parallel (increase

array sizes)2. Manually create 4 on-chip memories for 4

lookup tables

32-bit word

Page 27: Vector Processing as a Soft-core CPU Accelerator Jason Yu, Guy Lemieux, Chris Eagleston {jasony, lemieux, ceaglest}@ece.ubc.ca University of British Columbia

28

Vector soft processordesign flow

Design vectoralgorithm

Design vectoralgorithm

Develop C code,vector assembly

Develop C code,vector assembly

Compile source code,assemble with

vector assembler

Compile source code,assemble with

vector assembler

Result meetsrequirements?

Result meetsrequirements?

Synthesize system,place and route

Synthesize system,place and route

Yes

No

Configure softprocessor parameters

Configure softprocessor parameters

Download applicationto processor

Download applicationto processor

Determine FPGAresource budget

Determine FPGAresource budget

Hardware acceleratordesign flow

Develop Ccode for Nios II

Develop Ccode for Nios II

Identify areas forHW acceleration

Identify areas forHW acceleration

Isolate sectionsto accelerate into

C functions

Isolate sectionsto accelerate into

C functions

Analyze compilationestimates

Analyze compilationestimates

Result meetsrequirements?

Result meetsrequirements?

Tune systemarchitecture

Tune systemarchitecture

Apply optimizationsto C source code

Apply optimizationsto C source code

Yes

No

Run C2H compilerRun C2H compiler

Synthesize system,place and route

Synthesize system,place and route

Hardware-awaretransformations Software-only

optimizations

Programmersknow how to

do this!

Synthesize system,place and route

Synthesize system,place and route/

Page 28: Vector Processing as a Soft-core CPU Accelerator Jason Yu, Guy Lemieux, Chris Eagleston {jasony, lemieux, ceaglest}@ece.ubc.ca University of British Columbia

29

Resource Utilization

0

0.2

0.4

0.6

0.8

1

NormalizedResource

(to smallestStratix III)

Nios

II/ s

C2H M

edian

C2H M

otion

C2H AES V4 V8 V1

6

ALM M9K DSP Elements

Biggest Stratix III = 7x more resources

Note: These Vector processors include a large local memory in each vector lane (an optional feature), hence the high M9K utilization. Removal would save 60% of M9K in V16.

Page 29: Vector Processing as a Soft-core CPU Accelerator Jason Yu, Guy Lemieux, Chris Eagleston {jasony, lemieux, ceaglest}@ece.ubc.ca University of British Columbia

30

Resource Utilization Estimates

ALM DSP Elements M9K Fmax

Smallest Stratix III 19000 216 108 -

Nios II/s 489 8 4 153

+ C2H Median filtering 825 8 4* 147

+ C2H Motion estimation 977 10 4* 135

+ C2H AES encryption 2480 8 6* 119

UTIIe 324 0 3 193

+V4 5215 21 32 115

+V8 7011 34 53 114

+V16 10266 58 95 113

* C2H results are obtained from compiling to Stratix II; uses M4K memories

Page 30: Vector Processing as a Soft-core CPU Accelerator Jason Yu, Guy Lemieux, Chris Eagleston {jasony, lemieux, ceaglest}@ece.ubc.ca University of British Columbia

31

Results: Clock Cycles

00.10.20.30.40.50.60.70.80.9

1

Normalized Clock Cycles

(to idealNios II)

Nios II C2H V4 V8 V16

Median filtering Motion estimation AES encryption

Page 31: Vector Processing as a Soft-core CPU Accelerator Jason Yu, Guy Lemieux, Chris Eagleston {jasony, lemieux, ceaglest}@ece.ubc.ca University of British Columbia

32

Speedup vs Resource Utilization Summary

0

5

10

15

20

25

30

0 5 10 15 20 25 30

Normalized Area (Number of ALMs)

Sp

ee

du

p

Nios II/s

V16

V32

C2H

Vector

Median filteringAES encryptionMotion estimation

Page 32: Vector Processing as a Soft-core CPU Accelerator Jason Yu, Guy Lemieux, Chris Eagleston {jasony, lemieux, ceaglest}@ece.ubc.ca University of British Columbia

33

Summary of Effort C2H accelerators

1. “Vectorize” code for C2H: 1 day2. Extra-effort optimization: 1 day3. Place-and-route waiting: 1 hour

Each iteration = 1 day + P&R

Vector soft processor1. Vector algorithm, write vector assembly: 2 days2. Revise vector algorithm: 0.5 day

Each iteration = 0.5 day + SW compile only

Page 33: Vector Processing as a Soft-core CPU Accelerator Jason Yu, Guy Lemieux, Chris Eagleston {jasony, lemieux, ceaglest}@ece.ubc.ca University of British Columbia

34

Lessons from Vector Processor Design Register files

2-read, 1-write memory very common for CPUs Multiple write ports for wide-issue processing

Wide, flexible vector memory interface very costly Memory crossbars: several multi-bit multiplexers ~1/3 the resources of soft vector processor

(128b, byte access)

Stratix III specific DSP shift chain can no longer dynamically select input MAC chain is useful

Would like 32-bit MAC chain

Page 34: Vector Processing as a Soft-core CPU Accelerator Jason Yu, Guy Lemieux, Chris Eagleston {jasony, lemieux, ceaglest}@ece.ubc.ca University of British Columbia

35

Current Progress Development toolchain integration

Packaged as SOPC builder component No built-in debug core

Uses real Nios II processor to download code on to system

Inline vector assembly in Nios II IDE

Future work Compiler Floating-point

Page 35: Vector Processing as a Soft-core CPU Accelerator Jason Yu, Guy Lemieux, Chris Eagleston {jasony, lemieux, ceaglest}@ece.ubc.ca University of British Columbia

36

Conclusion Vector processing maps well to FPGA

Many small memories, DSP blocks Simple programming model

Soft vector processor Purely software-based acceleration

No hardware design / RTL recompile needed—just program One hardware accelerator supports many applications

Scalable performance and area More vector lanes more performance for more area Soft core parameters/features area customization

Page 36: Vector Processing as a Soft-core CPU Accelerator Jason Yu, Guy Lemieux, Chris Eagleston {jasony, lemieux, ceaglest}@ece.ubc.ca University of British Columbia

37

Conclusion FPGA-specific enhancements

Partitioned register file reduces resource utilization

MAC chain for efficient vector reduction Local vector datapath memory

Table lookup operations

Download the processor now! http://www.ece.ubc.ca/~jasony/