improving memory system performance for soft vector processors peter yiannacouras j. gregory steffan...

Improving Memory System Performance for Soft Vector ProcessorsPeter YiannacourasJ. Gregory SteffanJonathan Rose

WoSPS – Oct 26, 2008

2

Soft Processors in FPGA Systems

SoftProcessor

CustomLogic

HDL+

CAD

C+

Compiler

Easier Faster Smaller Less Power

Data-level parallelism → soft vector processors

Configurable – how can we make use of this?

3

Vector Processing Primer

// C codefor(i=0;i<16; i++) b[i]+=a[i]

// Vectorized codeset vl,16vload vr0,bvload vr1,avadd vr0,vr0,vr1vstore vr0,b

Each vector instructionholds many units of independent operations

b[0]+=a[0]b[1]+=a[1]b[2]+=a[2]

b[4]+=a[4]b[3]+=a[3]

b[5]+=a[5]b[6]+=a[6]b[7]+=a[7]b[8]+=a[8]b[9]+=a[9]

b[10]+=a[10]b[11]+=a[11]b[12]+=a[12]b[13]+=a[13]b[14]+=a[14]b[15]+=a[15]

vadd

1 Vector Lane

4

Vector Processing Primer

// C codefor(i=0;i<16; i++) b[i]+=a[i]

// Vectorized codeset vl,16vload vr0,bvload vr1,avadd vr0,vr0,vr1vstore vr0,b

Each vector instructionholds many units of independent operations

vadd

16 Vector Lanes

b[0]+=a[0]b[1]+=a[1]b[2]+=a[2]

b[4]+=a[4]b[3]+=a[3]

b[5]+=a[5]b[6]+=a[6]b[7]+=a[7]b[8]+=a[8]b[9]+=a[9]

b[10]+=a[10]b[11]+=a[11]b[12]+=a[12]b[13]+=a[13]b[14]+=a[14]b[15]+=a[15]

16x speedup

5

Sub-Linear Scalability

4.7

8.0

6.05.2

3.1

01

23

45

67

89

autc

or

conv

en

ip_c

heck

sum

imgb

lend

GM

EA

N

Cycle

Perf

orm

ance

Rela

tive to 1

Lane

1 Lane

2 Lanes

4 Lanes

8 Lanes

16 Lanes

Vector lanes not being fully utilized

6

Where Are The Cycles Spent?

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

autcor conven ip_checksum imgblend AVERAGE

Fra

ction o

f Tota

l Cycle

s

~

Memory Unit Stall Cycles

Miss Cycles 67%

2/3 cycles spent waiting on memory unit, often from cache misses

16 lanes

7

Our Goals

1. Improve memory system Better cache design Hardware prefetching

2. Evaluate improvements for real: Using a complete hardware design (in Verilog) On real FPGA hardware (Stratix 1S80C6) Running full benchmarks (EEMBC) From off-chip memory (DDR-133MHz)

8

Current Infrastructure

Vectorizedassembly

subroutines

GNU as+

Vectorsupport

ELFBinary

MINT Instruction

Set Simulator

scalar μP

+vpu

VCRF

VSRF

VCWB

VSWB

Logic

DecodeRepli-cate

Hazardcheck

VRRF

ALU

MemUnit

x & satur.

VRWB

MUX

Satu-rate

Rshift

VRRF

ALU

x & satur.

VRWB

MUX

Satu-rate

Rshift

EEMBC CBenchmarks

Modelsim(RTL

Simulator)

SOFTWARE HARDWAREVerilog

AlteraQuartus II

v 8.0

cyclesarea,frequency

GCC

ld

verification verification

9

VESPA Architecture Design

ScalarPipeline3-stage

VectorControlPipeline3-stage

VectorPipeline6-stage

Icache Dcache

Decode RFALU

MUX WB

VCRF

VSRF

VCWB

VSWB

Logic

DecodeRepli-cate

Hazardcheck

VRRF A

LU

x & satur.

VRWB

MUX

Satu-rate

Rshift

VRRF A

LU

x & satur.

VRWB

MUX

Satu-rate

Rshift

MemUnit

Decode

Supports integerand fixed-point operations, and predication

32-bitdatapaths

Shared Dcache

10

10

Vector Memory Crossbar

Memory System Design

DDR

ScalarVectorCoproc

Lane0Lane

0Lane

0Lane

4

Dcache4KB,

16B line …

Lane0Lane

0Lane

0Lane

8Lane

0Lane

0Lane

0Lane12

Lane4Lane

4Lane15Lane16

VESPA16 lanes

DDR9 cycle access

vld.w (load 16 contiguous 32-bit words)

11

Vector Memory Crossbar

Memory System Design

DDR

ScalarVectorCoproc

Lane0Lane

0Lane

0Lane

4

Dcache16KB,

64B line …

Lane0Lane

0Lane

0Lane

8Lane

0Lane

0Lane

0Lane12

Lane4Lane

4Lane15Lane16

VESPA16 lanes

DDR9 cycle access

vld.w (load 16 contiguous 32-bit words)

4x

4x

Reducedcache accesses +some prefetching

12

Improving Cache Design

Vary the cache depth & cache line size Using parameterized design Cache line size: 16, 32, 64, 128 bytes Cache depth: 4, 8, 16, 32, 64 KB

Measure performance on 9 benchmarks 6 from EEMBC, all executed in hardware

Measure area cost Equate silicon area of all resources used

Report in units of Equivalent LEs

13

Cache Design Space – Performance (Wall Clock Time)

1.68

1.93

1.55

1.77

1.37

1.50

1.13

1.00

1.25

1.50

1.75

2.00

4KB 8KB 16KB 32KB 64KB

Speedup V

s 4

KB

,16B

128B

64B

32B

16B

Best cache design almost doubles performance of original VESPA

122MHz

123MHz

126MHz129MHz

More pipelining/retiming could reduce clock frequency penalty

Cache line more important than cache depth (lots of streaming)

14

Cache Design Space – Area

1.00

1.25

1.50

1.75

2.00


Are

a V

s 4

KB

,16B

128B

64B

32B

16B

M4K

MRAM

16bits

4096bits

64B (512 bits)

16bits

4096bits

16bits

4096bits

…16bits

4096bits

16bits

4096bits

16bits

4096bits

16bits

4096bits

32 => 16KB of storage

System area almost doubled in worst case

15

Cache Design Space – Area

1.00

1.25

1.50

1.75

2.00


Are

a V

s 4

KB

,16B

128B

64B

32B

16B

M4K

MRAM

b) Don’t use MRAMs: big, few, and overkill

a) Choose depth to fill block RAMs needed for line size

16

Hardware Prefetching Example

DDR

Dcache

…

vld.w

No Prefetching Prefetching 3 blocks

DDR

Dcache

…

vld.w

MISS MISS

9 cyclepenalty

9 cyclepenalty

vld.w vld.w

HITMISS

17

Hardware Data Prefetching

Advantages Little area overhead Parallelize memory fetching with computation Use full memory bandwidth

Disadvantages Cache pollution

We use Sequential Prefetching triggered on: a) any miss, or b) sequential vector instruction miss

We measure performance/area using a 64B, 16KB dcache

18

Prefetching K Blocks – Any Miss

0.5

1

1.5

2

2.5

0 1 3 7 15 31 63

Number of Cache Lines Prefetched

Speedup v

s n

o

Pre

fetc

hin

g

autcor

conven

viterb

fbital

rgbcmyk

rgbyiq

ip_checksum

imgblend

filt3x3

GMEAN

Peak average speedup 28%

2.2x

Not receptive

Only half the benchmarks significantly sped-up, max of 2.2x, avg 28%

19

dirtylines

…

Prefetching Area Cost: Writeback Buffer

Two options: Deny prefetch Buffer all dirty lines

Area cost is small 1.6% of system area Mostly block RAMs Little logic

No clock frequency impact

Prefetching 3 blocks

DDR

Dcache

…

vld.w

MISS

9 cyclepenalty

WBBuffer

20

Any Miss vs Sequential Vector Miss

0.70

0.80

0.90

1.00

1.10

1.20

1.30

0 1 3 7 15 31 63

Number of Cache Lines Prefetched

Speedup

Any Cache Misses

Sequential Vector only

Collinear – nearly all misses in our benchmarks are sequential vector

21

Vector Length Prefetching

Previously: constant # cache lines prefetched Now: Use multiple of vector length

Only for sequential vector memory instructions Eg. Vector load of 32 elements

Guarantees <= 1 miss per vector memory instr

vld.w0 31

fetch +prefetch 28*k

22

Vector Length Prefetching - Performance

0.5

1

1.5

2

2.5N

one

1*V

L

2*V

L

4*V

L

8*V

L

16*V

L

32*V

L

Amount of Prefetching

Speedup

autcor

conven

fbital

viterb

rgbcmyk

rgbyiq

ip_checksum

imgblend

filt3x3

GMEAN

Peak 29%

2.2x

Not receptive

1*VL prefetching provides good speedup without tuning, 8*VL best

no cachepollution

21%

23

Overall Memory System Performance

00.10.20.30.40.50.60.70.8

16-byte line 64-byte line 64-byte line +prefetch

Fra

ction o

f Tota

l Cycle

s

Memory Unit Stall Cycles

Miss Cycles

(4KB) (16KB)

67%

48%

31%

4%

15

Wider line + prefetching reduces memory unit stall cycles significantly

Wider line + prefetching eliminates all but 4% of miss cycles

24

Improved Scalability

02468

101214

autc

or

conv

en

fbita

l

vite

rb

rgbc

myk

rgby

iq

ip_c

heck

sum

imgb

lend

filt3

x3

GM

EA

N

Cyc

le P

erfo

rman

ce

Rel

ativ

e to

1 L

ane

1 Lane

2 Lanes

4 Lanes

8 Lanes

16 Lanes

Previous: 3-8x range, average of 5x for 16 lanes Now: 6-13x range, average of 10x for 16 lanes

25

Summary

Explored cache design ~2x performance for ~2x system area

Area growth due largely to memory crossbar Widened cache line size to 64B and depth to 16KB

Enhanced VESPA w/ hardware data prefetching Up to 2.2x performance, average of 28% for K=15 Vector length prefetcher gains 21% on average for 1*VL

Good for mixed workloads, no tuning, no cache pollution Peak at 8*VL, average of 29% speedup

Overall improved VESPA memory system & scalability Decreased miss cycles to 4%, Decreased memory unit stall cycles to 31%

26

Vector Memory Unit

Dcache

base

stride*0

index0

+MUX

...

stride*1

index1

+MUXstride*L

indexL

+MUX

MemoryRequestQueue

ReadCrossbar

…Memory Lanes=4

rddata0rddata1

rddataL

wrdata0wrdata1

wrdataL ...

WriteCrossbar

MemoryWrite

Queue

L = # Lanes - 1……

improving memory system performance for soft vector processors peter yiannacouras j. gregory steffan...

Documents

vector load

memory system performance

sequential prefetching

chip memory ddr

multiple of vector lengthonly

w load

complete hardware design

cache lines prefetchednow