improving pipelined soft processors with multithreading

Improving Pipelined Soft Processors with Multithreading

Martin LabrecqueGregory Steffan

ECE Dept. University of Toronto

Presented at RAAW 2006, Orlando, FL

2

Custom Logic

FPGA

FPGAs increasingly implement SoCs, with CPUs Soft processors: processors in the FPGA fabric

Processor

PC

Instr. Mem.

Reg. Array

regA

regB

regW

datW

datA

datB

ALU

25:21

20:16

+4

Data Mem.

datIn

addrdatOut

aluA

aluB

IncrPC

Instr

4:0 Wdest

Wdata

20:13

Xtnd

25:21

Wdata

Wdest

15:0

Xtnd << 2

Zero Test

25:21

Wdata

Wdest

20:0

25:21

Wdata

Wdest

Soft processors are:•Easier to program than HDL•Customizable

Processors and FPGAs

3

Soft processors in Embedded Systems

What do designers care about?Minimizing area?Matching frequency?Hitting performance target?

We trade-off 4 criteria (soft proc. power is related to area)

Area efficiency: a combined metric

Performance

Area Instr. Count xx Frequency

Cycle Count x Area

4

Multithreading

Replace processor stalls

Fine-grained multithreading: 1 instr. per thread in round-robin

Million Instr. xx Frequency# Cycles x Area

Fill them with instructions from other threadsWhen to switch thread?

Every instruction (e.g. Sun’s Niagara)Convenient technique for in-order processors

5

Avoiding processor stall cycles

Data and control hazards create stall cycles

F

E

W

Traditional execution

3 st

ages F

E

W

FE

W

F

E

WTimeB

EF

OR

E

F F F

E E E

W W W

F F F

E E E

W W W

Ideally, eliminates all stalls 3

stag

es

Time

Multithreading: execute streams of independent instructions

LegendThread1Thread2Thread3

AF

TE

R F

E

W

6

How useful is multithreading?

Commercial SPs: single-threaded (NIOS-II,Microblaze) Fort et al. [FCCM’06] have shown:

multithreaded SP smaller than multiple SPs with some performance degradation

We go further by showing that:the Area-Efficiency of Multithreaded SP

is GREATER THAN

the Area-Efficiency of Single-Threaded SP

Not straightforward, here is how we did it

7

Outline

Architectural Support for Multiple Threads Soft Processor Infrastructure Improvements to Baseline Multithreading

Architectural Support for Multiple Threads

8

Single-Threaded Processor (simplified)

Instr.Mem

PC

+4

Reg.Array

ALU

DataMem

Hazard Detection Logic

Fo

rwar

din

g li

nes

9

2-Threaded Processor (simplified)

Replicate state for each thread

Instr.Mem

PC

+4

PC

Reg.Array

ALU

DataMem

Ctrl.

Hazard Detection Logic

Simplify control logic

10

Additional storage for multiple threads

More efficiently done in FPGA than in ASIC

Increase memory size while preserving frequency

Program counters Registers Data mem.

Multithreading builds on the strengths of FPGAs

N x

11

Outline

Architectural Support for Multiple Threads Soft Processor Infrastructure Improvements to baseline multithreading

12

Measurement Infrastructure

RTL

2. Resource Usage3. Clock Frequency4. Power

1. Cycle Count

Benchmarks(MiBench,

Dhrystone 2.1,RATES,XiRisc)

Stratix 1S40C5

We can measure area/performance/energy accurately

ModelsimRTL Simulator

Quartus II 5.0CAD Software

Single-Thread ProcessorsSPREE System [FPGA’06]

13

Evaluation methodology

Same benchmark running on all threadsSome mixed benchmarks results in the paper

Run until completion of the last thread Same instruction space

We present results with fixed latency on-chip RAM We are implementing a solution for off-chip RAM

14

Processors: 3, 5 and 7 stages

Pipe3

Pipe5

Pipe7

F: FetchD: DecodeR: RegisterEX: ExecuteM: MemoryWB: Writeback

Pipe3

Pipe5

Pipe7

R/EX/MF/D WB

DF R/EX1 EX2/M WB

DF R EX2/M EX3/WB1EX1 WB2

Best of each pipeline depth generated by SPREEBy default: thread count = number of pipeline stages

1174 LEs78.3 MHz

1283 LEs86.79 MHz

1557 LEs, 100.59 MHz

15

Area efficiency results

0

10

20

30

40

50

60

70

80

90

single MT single MT single MT

Are

a e

ffici

ency

(M

IPS

/ 1

000 L

Es)

33%77%

106%

Area efficiency is most improved with deeper pipelines 3- and 7-stages have similar area efficiency

3-stage 5-stage 7-stage

16

IPC results for 3, 5 and 7 stages

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1

bubb

le_s

ort

crc

des fft

fir

quan

t

iqua

nt vlc

bitc

nts

gol

Mea

n

IPC

(In

stru

ctio

ns/

cycl

e)

pipe3_mt

pipe5_mt

pipe7_mt

24%, 45% and 104% more instructions per cycle, respectively

0

0,5

1

1,5

2

2,5

MeanNor

mal

ized

IPC

(ins

truct

ions

per

cyc

le).Ideal IPC = 1

IPC versus single-threaded proc.

17

Improvements to the Baseline Multithreaded Soft Processors

Optimize away unpipelined multi-cycle paths

Selection of architectural features1) Multiplier implementation 2) Number of registers 3) Number of threads

Combination of techniques optimizing area efficiency

Optimize away unpipelined multi-cycle paths

18

1- Changing multiplication support

Reg

iste

r fil

e

Multiplier

Hi/Lo

MUX

• Default MIPS has Hi/Lo registers

•3-operand multiplies (NIOS2 and Microblaze)

– Two instructions compute high and low parts

– Avoids replicating Hi and Lo registers support

19

2- Reducing the register file

Not all registers are utilized [RAAW’06] Many threads can combine the savings Results in saved memory blocks

•Applicable to the 5-stage processor

•Increases slightly cycle count due to increased register pressure

•Allows area and frequency improvements

1..N 1..N

2N

1..N-k 1..N-k

2N-2k

20

Reducing the Number of Threads

• Usually: # threads = # pipeline stages• Last stage: writeback to non-conflicting register

Positive effect on the 5 and 7-stage processorsHelps meet processing latency deadline (shorter round-robin)Gives designers more flexibility

F F

E E

F

E

W W W

F F

E E

W W

F

E

W3 st

ages

Time

LegendThread1Thread2Thread3

21

Conclusions Multithreaded SPs outperforms Single-threaded

Assumes independent threads Assumes use of on-chip memory

33%, 77% and 106% increase in area-efficiency Demonstrated that benefits increase with pipeline depth Techniques to optimize away unpipelined multi-cycle paths Selection and combination of architectural features

Multiplier support Number of threads Number of registers

Commercial FPGA makers should have a Multi-Threaded SP

22

Long term goals Multiple multithreaded soft processors

Research using off-chip memory hierarchy Study of synchronization mechanisms Make easy to target and scale up for non-HW people

Stanford/Xilinx platform Collaboration with network researchers

Perform real high bandwidth experiments

–Virtex-II Pro

–4 x 1 Gbps Ethernet

–PCI board

–64 MB DDR2 DRAM

Experimental Testbed: NetFPGA

23

Thank you

Martin Labrecque ([email protected])Gregory Steffan

ECE Dept. University of Toronto

24

Where do threads come from?

Event processing e.g. multiple sources of interrupts

Packet processinge.g. CAN, RS-485, Ethernet, etc.

Systems handling requests e.g. bus controllers

For now, we consider independent threads

25

300

500

700

900

1100

1300

1500

1700

1900

500 700 900 1100 1300 1500 1700 1900

Area (Equivalent LEs)

Ge

om

ea

n W

all

Clo

ck

Tim

e (

us

) SPREE Processors

Altera Nios II/e

Altera Nios II/s

Altera Nios II/f

SPREE vs Nios II [IEEE TCAD’07]

smaller

faster

26

Architectural Parameters Used in SPREE

We focus on core microarchitecture (for now)

Multiplication Support Hardware FU or software routine

Shifter implementation Flipflops, multiplier, or LUTs

PipeliningDepth

(2-7 stages)

Forwarding lines

27

Contributions on Multithreaded Soft Processors

Multithreaded SP dominate single-threadedprocessors in area and IPC

Demonstrated that these benefitsIncrease with the # of pipeline stages

Explained techniques to optimize awayunpipelined multi-cycle paths

Selection of architectural featuresNumber of threadsNumber of registersMultiplier support

Combination of techniques that optimize area efficiency

28

Unpipelined Multicycle Paths

ST

MT

R/EXF/D EX

Important source of IPC improvement

WB

R/EXF/D M WB

Not practical in STbecause of hazarddetection

Example of 3-stage pipeline with multicycle on load, store, shift and multiplies

29

Changing multiplication support

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

Hi/Lo 3op Hi/Lo 3op Hi/Lo 3op

Nor

mal

ized

Equ

iv. L

Es

/ MH

z / n

J/in

str

AreaFrequencyEnergyPerInstr

3-stage 5-stage 7-stage

For multithreaded SPs, 3op-multiplies always win

30

Reducing the Number of Threads

0

0.2

0.4

0.6

0.8

1

1.2

pipe3_mt_2T pipe5_mt_4T pipe7_mt_6TNor

mal

ized

Equ

iv. L

Es

/ MH

z / n

J/in

str

Area

Frequency

EnergyPerInstr

Positive effect on the 5 and 7-stage processors

31

3. Control Generation

2. Datapath Instantiation

SPREE

SPREE System (Soft Processor Rapid Exploration Environment)

RTL

ISA

Datapath

■ Input: Processor description■ Made of hand-coded components

1. Verify ISA against datapath

■ SPREE System

■ Output: Synthesizable Verilog

ProcessorDescription

32

Multithreading

Replace processor stalls

Fine-grained multithreading: 1 instr. per thread in round-robin

Million Instr. xx Frequency# Cycles x Area

T1 T2 T3 T1 T2 T3Time

Interleaved instructions in pipeline

Fill them with instructions from other threadsWhen to switch thread?

Multiple techniquesMost common: every instruction (e.g. Sun’s Niagara)

33

Experimental Testbed: NetFPGA

Stanford/Xilinx platform Collaboration with network researchers

Perform real high bandwidth experiments

–Virtex-II Pro

–4 x 1 Gbps Ethernet

–PCI board

–64 MB DDR2 DRAM

34

Removed load and branch delay slots in the code

improving pipelined soft processors with multithreading

Documents

order processors

cpussoft processors

pipelined soft processors

multithreading builds

processor stall cyclesdata

singlethreaded sp

singlethreaded nios

starea efficiency mips