vector processing as a soft-core cpu accelerator jason yu, guy lemieux, chris eagleston {jasony,...

Vector Processing as a Soft-core CPU Accelerator

Jason Yu, Guy Lemieux, Chris Eagleston

{jasony, lemieux, ceaglest}@ece.ubc.caUniversity of British Columbia

Prepared for FPGA2008, Altera, and XilinxFebruary 26-28, 2008

2

Motivation FPGAs for embedded processing

High performance, computationally intensive Growing use of embedded processor on FPGA Nios/MicroBlaze too slow

Faster performance Faster Nios/MicroBlaze Multiprocessor-on-FPGA Custom hardware accelerator Synthesized accelerator

3

Problems… Faster Nios/MicroBlaze not feasible

2 or 4-way superscalar/VLIW register file maps inefficiently to FPGA

Superscalar complex dependency checking

Multiprocessor-on-FPGA complexity Parallel programming and debugging System design Cache coherence, memory consistency

Custom hardware accelerator cost Need hardware engineer Time-consuming to design and debug 1 hardware accelerator per function

4

Possible Solutions… Automatically synthesized hardware accelerators

Change software regenerate & recompile RTL Altera C2H Xilinx CHiMPS Mitrion Virtual Processor CriticalBlue Cascade

Soft vector processorSoft vector processor Change software same RTL, just recompile software

Purely software-based Decouples hardware/software development teams

5

Advantages of Vector Processing Simple programming model

Short to long vector data parallelism Regular, easy to accelerate

Purely software-based One hardware accelerator supports many

applications

Scalable performance and area

6

Contributions Configurable soft vector processor

Selectable performance/resource tradeoff Area customization

FPGA-specific enhancements Partitioned register file Vector reductions using MAC chain Local vector datapath memory

Overview of Vector Processing

8

Acceleration with Vector Processing Organize data as long vectors Data-level parallelism

Vector instruction execution Multiple vector lanes (SIMD) Repeated SIMD operation

over length of vector

SourceSourcevectorvector

registersregisters

DestinationDestinationvectorvectorregisterregister

Vector lanes

for (i=0; i<NELEM; i++) a[i] = b[i] * c[i]

vmult a, b, c

9

Compared to CPUs with SIMD Extensions Intel SSE2, PowerPC

Altivec, etc Short, fixed-length

vectors (eg, 4) Single cycle per

instruction Many data

pack/unpack instructions

SourceSourceSIMDSIMD

registersregisters

DestinationDestinationSIMDSIMDregisterregister

SIMD Unit

11

Hybrid vector-SIMD vs Traditional Vector

Traditional vectorprocessing

HybridVector-SIMDprocessing

For (i=0; i<NELEM; i++) { C[i] = A[i] + B[i] E[i] = C[i] * D[i] }

0

1

2

3

C

E

C

E

4

5

6

7

12

Vector ISA Features Vector length (VL)

register Conditional execution

Vector flag registers

Vector addressing modes Unit stride Constant stride Indexed offset

0

1

0

0

1

0

1

0

Merge

Sourceregisters

DestinationregisterFlag

register

Vector Merge Operation

13

Example: Simple 5x5 Median Filtering

Pseudocode (Bubble sort)

Load the 25 pixel vectors P[0..24]For i=0 to 12 {

minimum = P[i]For j=i to 24 {

if (P[j] < minimum) {swap (minimum, P[j])}

}}

Slide “window” over after 1 median

Repeated over entire image Many windows

Output pixelOutput pixel

14

Example: Simple 5x5 Median Filtering

Pseudocode (Bubble sort)

Load the 25 pixel vectors P[0..24]For i=0 to 12 {

minimum = P[i]For j=i to 24 {

if (P[j] < minimum) {swap (minimum, P[j])}

}}

Bubble sort on vector registers

Vector flag register to mask execution

“VL” results at once!

25 rows ->25 vector registers

“VL” pixels each

Soft Vector Processor Architecture

16

Nios II coreShared instructionmemory

(scalar / vectorinstructions)

Shared scalar / vectorMemory interface

Distributedvector register file

Overlappedscalar / vector

execution

Configurablememory width

Configurablenumber of lanes

17

0

0

1

1

3

3

4

4

5

5

7

7

One vectorRegister(eg, v0)

Distributedvector register file

18

Local vectordatapath memory

MAC chain

Result toVLane 0

19

Vector Sum Reduction with MAC

Sum reduction

R = A[i] * B[i]

R = A[i] (using B[i] = 1)

Reduces VL elements in vector register to single number

Two instruction sequence: vmac

multiply accum. to accumulators vcczacc

compress copy and zero accumulators

Side effect: can only reduce 18-bit inputs

Accumulatechain

20

Configurable Parameters Some configurable features

Number of vector lanes Vector ALU width Vector memory access granularity (8, 16, 32b) Local memory size (or none)

Strongly affect performance, area

21

Partial List of Configurable Parameters

Primary ParametersSoft vector processors

Parameter Description Typical V4 V8 V16M32

NLane Number of vector lanes 4-128 4 8 16

MVL Maximum vector length 16-512 16 32 64

VPUW Processor data width (bits) 8, 16, 32 32 32 32

MemMinWidth

Minimum accessible data width in memory

8, 16, 32 8 8 32

Parameters for Optional Features

MultW Multiplier width (bits, 0 is off) 0, 8, 16, 32 16 16 16

MACL MAC chain length (0 is no MAC) 0,1,2,4 1 2 0

LMemN Local memory number of words 0-1024 256 256 0

LMemShare Shared local memory address space within lane

On/Off Off Off Off

Performance Results

23

Benchmarking 3 sample application kernels

5x5 median filter Motion estimation (full search block matching) 128-bit AES encryption (MiBench)

C code, 3 versions Nios II Nios II with inline vector assembly Nios II with C2H accelerator

24

Methodology and Assumptions Compile C code with nios2-gcc

Run time Instructions * cycles-per-instruction / Fmax

Nios II Instruction: 1 cycle Memory load: 1 cycle

Nios II with vectors Vector instruction: (VL / NLane) cycles Vector load: 2 * (VL / NLane) + 2 cycles

25

Altera C2H Compiler Nios II with C2H accelerator

Synthesizes HW accelerator from a C function C memory reference = master port to that memory Current limitations:

No automatic loop unrolling Up to user to efficiently partition memory

Memory

Nios IIC2H

acceleratorC2H

accelerator

Arbiter

Memory

Master portsMaster ports

ArbiterAvalonFabric

26

C2H Methodology Compile application kernels with C2H

compiler Automatic pipelining and scheduling Manually unroll loops Manually “vectorize” C code

Nios II with C2H accelerator C2H compiler reports # of clock cycles Includes memory arbitration overhead

27

C2H Example AES encryption round

Shift 4 32-bit words(by different amounts)

4 table lookups XOR results, XOR with key

Acceleration steps1. Process multiple blocks in parallel (increase

array sizes)2. Manually create 4 on-chip memories for 4

lookup tables

32-bit word

28

Vector soft processordesign flow

Design vectoralgorithm

Design vectoralgorithm

Develop C code,vector assembly

Develop C code,vector assembly

Compile source code,assemble with

vector assembler

Compile source code,assemble with

vector assembler

Result meetsrequirements?


Synthesize system,place and route


Yes

No

Configure softprocessor parameters

Configure softprocessor parameters

Download applicationto processor

Download applicationto processor

Determine FPGAresource budget

Determine FPGAresource budget

Hardware acceleratordesign flow

Develop Ccode for Nios II

Develop Ccode for Nios II

Identify areas forHW acceleration

Identify areas forHW acceleration

Isolate sectionsto accelerate into

C functions

Isolate sectionsto accelerate into

C functions

Analyze compilationestimates

Analyze compilationestimates



Tune systemarchitecture

Tune systemarchitecture

Apply optimizationsto C source code

Apply optimizationsto C source code

Yes

No

Run C2H compilerRun C2H compiler



Hardware-awaretransformations Software-only

optimizations

Programmersknow how to

do this!


Synthesize system,place and route/

29

Resource Utilization

0

0.2

0.4

0.6

0.8

1

NormalizedResource

(to smallestStratix III)

Nios

II/ s

C2H M

edian

C2H M

otion

C2H AES V4 V8 V1

6

ALM M9K DSP Elements

Biggest Stratix III = 7x more resources

Note: These Vector processors include a large local memory in each vector lane (an optional feature), hence the high M9K utilization. Removal would save 60% of M9K in V16.

30

Resource Utilization Estimates

ALM DSP Elements M9K Fmax

Smallest Stratix III 19000 216 108 -

Nios II/s 489 8 4 153

+ C2H Median filtering 825 8 4* 147

+ C2H Motion estimation 977 10 4* 135

+ C2H AES encryption 2480 8 6* 119

UTIIe 324 0 3 193

+V4 5215 21 32 115

+V8 7011 34 53 114

+V16 10266 58 95 113

* C2H results are obtained from compiling to Stratix II; uses M4K memories

31

Results: Clock Cycles

00.10.20.30.40.50.60.70.80.9

1

Normalized Clock Cycles

(to idealNios II)

Nios II C2H V4 V8 V16

Median filtering Motion estimation AES encryption

32

Speedup vs Resource Utilization Summary

0

5

10

15

20

25

30

0 5 10 15 20 25 30

Normalized Area (Number of ALMs)

Sp

ee

du

p

Nios II/s

V16

V32

C2H

Vector

Median filteringAES encryptionMotion estimation

33

Summary of Effort C2H accelerators

1. “Vectorize” code for C2H: 1 day2. Extra-effort optimization: 1 day3. Place-and-route waiting: 1 hour

Each iteration = 1 day + P&R

Vector soft processor1. Vector algorithm, write vector assembly: 2 days2. Revise vector algorithm: 0.5 day

Each iteration = 0.5 day + SW compile only

34

Lessons from Vector Processor Design Register files

2-read, 1-write memory very common for CPUs Multiple write ports for wide-issue processing

Wide, flexible vector memory interface very costly Memory crossbars: several multi-bit multiplexers ~1/3 the resources of soft vector processor

(128b, byte access)

Stratix III specific DSP shift chain can no longer dynamically select input MAC chain is useful

Would like 32-bit MAC chain

35

Current Progress Development toolchain integration

Packaged as SOPC builder component No built-in debug core

Uses real Nios II processor to download code on to system

Inline vector assembly in Nios II IDE

Future work Compiler Floating-point

36

Conclusion Vector processing maps well to FPGA

Many small memories, DSP blocks Simple programming model

Soft vector processor Purely software-based acceleration

No hardware design / RTL recompile needed—just program One hardware accelerator supports many applications

Scalable performance and area More vector lanes more performance for more area Soft core parameters/features area customization

37

Conclusion FPGA-specific enhancements

Partitioned register file reduces resource utilization

MAC chain for efficient vector reduction Local vector datapath memory

Table lookup operations

Download the processor now! http://www.ece.ubc.ca/~jasony/

vector processing as a soft-core cpu accelerator jason yu, guy lemieux, chris eagleston {jasony,...

Documents

area slide

function slide

fpga niosmicroblaze

software regenerate

problems faster niosmicroblaze

guy lemieux

file maps

system design cache