improving pipelined soft processors with multithreading

34
Improving Pipelined Soft Processors with Multithreading Martin Labrecque Gregory Steffan ECE Dept. University of Toronto Presented at RAAW 2006, Orlando, FL

Upload: virgo

Post on 08-Jan-2016

39 views

Category:

Documents


0 download

DESCRIPTION

ECE Dept. University of Toronto. Presented at RAAW 2006, Orlando , FL. Improving Pipelined Soft Processors with Multithreading. Martin Labrecque Gregory Steffan. Processors and FPGAs. FPGA. Processor. Custom Logic. Soft processors are: Easier to program than HDL Customizable. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Improving Pipelined Soft Processors with Multithreading

Improving Pipelined Soft Processors with Multithreading

Martin LabrecqueGregory Steffan

ECE Dept. University of Toronto

Presented at RAAW 2006, Orlando, FL

Page 2: Improving Pipelined Soft Processors with Multithreading

2

Custom Logic

FPGA

FPGAs increasingly implement SoCs, with CPUs Soft processors: processors in the FPGA fabric

Processor

PC

Instr. Mem.

Reg. Array

regA

regB

regW

datW

datA

datB

ALU

25:21

20:16

+4

Data Mem.

datIn

addrdatOut

aluA

aluB

IncrPC

Instr

4:0 Wdest

Wdata

20:13

Xtnd

25:21

Wdata

Wdest

15:0

Xtnd << 2

Zero Test

25:21

Wdata

Wdest

20:0

25:21

Wdata

Wdest

Soft processors are:•Easier to program than HDL•Customizable

Processors and FPGAs

Page 3: Improving Pipelined Soft Processors with Multithreading

3

Soft processors in Embedded Systems

What do designers care about?Minimizing area?Matching frequency?Hitting performance target?

We trade-off 4 criteria (soft proc. power is related to area)

Area efficiency: a combined metric

Performance

Area Instr. Count xx Frequency

Cycle Count x Area

Page 4: Improving Pipelined Soft Processors with Multithreading

4

Multithreading

Replace processor stalls

Fine-grained multithreading: 1 instr. per thread in round-robin

Million Instr. xx Frequency# Cycles x Area

Fill them with instructions from other threadsWhen to switch thread?

Every instruction (e.g. Sun’s Niagara)Convenient technique for in-order processors

Page 5: Improving Pipelined Soft Processors with Multithreading

5

Avoiding processor stall cycles

Data and control hazards create stall cycles

F

E

W

Traditional execution

3 st

ages F

E

W

FE

W

F

E

WTimeB

EF

OR

E

F F F

E E E

W W W

F F F

E E E

W W W

Ideally, eliminates all stalls 3

stag

es

Time

Multithreading: execute streams of independent instructions

LegendThread1Thread2Thread3

AF

TE

R F

E

W

Page 6: Improving Pipelined Soft Processors with Multithreading

6

How useful is multithreading?

Commercial SPs: single-threaded (NIOS-II,Microblaze) Fort et al. [FCCM’06] have shown:

multithreaded SP smaller than multiple SPs with some performance degradation

We go further by showing that:the Area-Efficiency of Multithreaded SP

is GREATER THAN

the Area-Efficiency of Single-Threaded SP

Not straightforward, here is how we did it

Page 7: Improving Pipelined Soft Processors with Multithreading

7

Outline

Architectural Support for Multiple Threads Soft Processor Infrastructure Improvements to Baseline Multithreading

Architectural Support for Multiple Threads

Page 8: Improving Pipelined Soft Processors with Multithreading

8

Single-Threaded Processor (simplified)

Instr.Mem

PC

+4

Reg.Array

ALU

DataMem

Hazard Detection Logic

Fo

rwar

din

g li

nes

Page 9: Improving Pipelined Soft Processors with Multithreading

9

2-Threaded Processor (simplified)

Replicate state for each thread

Instr.Mem

PC

+4

PC

Reg.Array

ALU

DataMem

Ctrl.

Hazard Detection Logic

Simplify control logic

Page 10: Improving Pipelined Soft Processors with Multithreading

10

Additional storage for multiple threads

More efficiently done in FPGA than in ASIC

Increase memory size while preserving frequency

Program counters Registers Data mem.

Multithreading builds on the strengths of FPGAs

N x

Page 11: Improving Pipelined Soft Processors with Multithreading

11

Outline

Architectural Support for Multiple Threads Soft Processor Infrastructure Improvements to baseline multithreading

Page 12: Improving Pipelined Soft Processors with Multithreading

12

Measurement Infrastructure

RTL

2. Resource Usage3. Clock Frequency4. Power

1. Cycle Count

Benchmarks(MiBench,

Dhrystone 2.1,RATES,XiRisc)

Stratix 1S40C5

We can measure area/performance/energy accurately

ModelsimRTL Simulator

Quartus II 5.0CAD Software

Single-Thread ProcessorsSPREE System [FPGA’06]

Page 13: Improving Pipelined Soft Processors with Multithreading

13

Evaluation methodology

Same benchmark running on all threadsSome mixed benchmarks results in the paper

Run until completion of the last thread Same instruction space

We present results with fixed latency on-chip RAM We are implementing a solution for off-chip RAM

Page 14: Improving Pipelined Soft Processors with Multithreading

14

Processors: 3, 5 and 7 stages

Pipe3

Pipe5

Pipe7

F: FetchD: DecodeR: RegisterEX: ExecuteM: MemoryWB: Writeback

Pipe3

Pipe5

Pipe7

R/EX/MF/D WB

DF R/EX1 EX2/M WB

DF R EX2/M EX3/WB1EX1 WB2

Best of each pipeline depth generated by SPREEBy default: thread count = number of pipeline stages

1174 LEs78.3 MHz

1283 LEs86.79 MHz

1557 LEs, 100.59 MHz

Page 15: Improving Pipelined Soft Processors with Multithreading

15

Area efficiency results

0

10

20

30

40

50

60

70

80

90

single MT single MT single MT

Are

a e

ffici

ency

(M

IPS

/ 1

000 L

Es)

33%77%

106%

Area efficiency is most improved with deeper pipelines 3- and 7-stages have similar area efficiency

3-stage 5-stage 7-stage

Page 16: Improving Pipelined Soft Processors with Multithreading

16

IPC results for 3, 5 and 7 stages

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1

bubb

le_s

ort

crc

des fft

fir

quan

t

iqua

nt vlc

bitc

nts

gol

Mea

n

IPC

(In

stru

ctio

ns/

cycl

e)

pipe3_mt

pipe5_mt

pipe7_mt

24%, 45% and 104% more instructions per cycle, respectively

0

0,5

1

1,5

2

2,5

MeanNor

mal

ized

IPC

(ins

truct

ions

per

cyc

le).Ideal IPC = 1

IPC versus single-threaded proc.

Page 17: Improving Pipelined Soft Processors with Multithreading

17

Improvements to the Baseline Multithreaded Soft Processors

Optimize away unpipelined multi-cycle paths

Selection of architectural features1) Multiplier implementation 2) Number of registers 3) Number of threads

Combination of techniques optimizing area efficiency

Optimize away unpipelined multi-cycle paths

Page 18: Improving Pipelined Soft Processors with Multithreading

18

1- Changing multiplication support

Reg

iste

r fil

e

Multiplier

Hi/Lo

MUX

• Default MIPS has Hi/Lo registers

•3-operand multiplies (NIOS2 and Microblaze)

– Two instructions compute high and low parts

– Avoids replicating Hi and Lo registers support

Page 19: Improving Pipelined Soft Processors with Multithreading

19

2- Reducing the register file

Not all registers are utilized [RAAW’06] Many threads can combine the savings Results in saved memory blocks

•Applicable to the 5-stage processor

•Increases slightly cycle count due to increased register pressure

•Allows area and frequency improvements

1..N 1..N

2N

1..N-k 1..N-k

2N-2k

Page 20: Improving Pipelined Soft Processors with Multithreading

20

Reducing the Number of Threads

• Usually: # threads = # pipeline stages• Last stage: writeback to non-conflicting register

Positive effect on the 5 and 7-stage processorsHelps meet processing latency deadline (shorter round-robin)Gives designers more flexibility

F F

E E

F

E

W W W

F F

E E

W W

F

E

W3 st

ages

Time

LegendThread1Thread2Thread3

Page 21: Improving Pipelined Soft Processors with Multithreading

21

Conclusions Multithreaded SPs outperforms Single-threaded

Assumes independent threads Assumes use of on-chip memory

33%, 77% and 106% increase in area-efficiency Demonstrated that benefits increase with pipeline depth Techniques to optimize away unpipelined multi-cycle paths Selection and combination of architectural features

Multiplier support Number of threads Number of registers

Commercial FPGA makers should have a Multi-Threaded SP

Page 22: Improving Pipelined Soft Processors with Multithreading

22

Long term goals Multiple multithreaded soft processors

Research using off-chip memory hierarchy Study of synchronization mechanisms Make easy to target and scale up for non-HW people

Stanford/Xilinx platform Collaboration with network researchers

Perform real high bandwidth experiments

–Virtex-II Pro

–4 x 1 Gbps Ethernet

–PCI board

–64 MB DDR2 DRAM

Experimental Testbed: NetFPGA

Page 23: Improving Pipelined Soft Processors with Multithreading

23

Thank you

Martin Labrecque ([email protected])Gregory Steffan

ECE Dept. University of Toronto

Page 24: Improving Pipelined Soft Processors with Multithreading

24

Where do threads come from?

Event processing e.g. multiple sources of interrupts

Packet processinge.g. CAN, RS-485, Ethernet, etc.

Systems handling requests e.g. bus controllers

For now, we consider independent threads

Page 25: Improving Pipelined Soft Processors with Multithreading

25

300

500

700

900

1100

1300

1500

1700

1900

500 700 900 1100 1300 1500 1700 1900

Area (Equivalent LEs)

Ge

om

ea

n W

all

Clo

ck

Tim

e (

us

) SPREE Processors

Altera Nios II/e

Altera Nios II/s

Altera Nios II/f

SPREE vs Nios II [IEEE TCAD’07]

smaller

faster

Page 26: Improving Pipelined Soft Processors with Multithreading

26

Architectural Parameters Used in SPREE

We focus on core microarchitecture (for now)

Multiplication Support Hardware FU or software routine

Shifter implementation Flipflops, multiplier, or LUTs

PipeliningDepth

(2-7 stages)

Forwarding lines

Page 27: Improving Pipelined Soft Processors with Multithreading

27

Contributions on Multithreaded Soft Processors

Multithreaded SP dominate single-threadedprocessors in area and IPC

Demonstrated that these benefitsIncrease with the # of pipeline stages

Explained techniques to optimize awayunpipelined multi-cycle paths

Selection of architectural featuresNumber of threadsNumber of registersMultiplier support

Combination of techniques that optimize area efficiency

Page 28: Improving Pipelined Soft Processors with Multithreading

28

Unpipelined Multicycle Paths

ST

MT

R/EXF/D EX

Important source of IPC improvement

WB

R/EXF/D M WB

Not practical in STbecause of hazarddetection

Example of 3-stage pipeline with multicycle on load, store, shift and multiplies

Page 29: Improving Pipelined Soft Processors with Multithreading

29

Changing multiplication support

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

Hi/Lo 3op Hi/Lo 3op Hi/Lo 3op

Nor

mal

ized

Equ

iv. L

Es

/ MH

z / n

J/in

str

AreaFrequencyEnergyPerInstr

3-stage 5-stage 7-stage

For multithreaded SPs, 3op-multiplies always win

Page 30: Improving Pipelined Soft Processors with Multithreading

30

Reducing the Number of Threads

0

0.2

0.4

0.6

0.8

1

1.2

pipe3_mt_2T pipe5_mt_4T pipe7_mt_6TNor

mal

ized

Equ

iv. L

Es

/ MH

z / n

J/in

str

Area

Frequency

EnergyPerInstr

Positive effect on the 5 and 7-stage processors

Page 31: Improving Pipelined Soft Processors with Multithreading

31

3. Control Generation

2. Datapath Instantiation

SPREE

SPREE System (Soft Processor Rapid Exploration Environment)

RTL

ISA

Datapath

■ Input: Processor description■ Made of hand-coded components

1. Verify ISA against datapath

■ SPREE System

■ Output: Synthesizable Verilog

ProcessorDescription

Page 32: Improving Pipelined Soft Processors with Multithreading

32

Multithreading

Replace processor stalls

Fine-grained multithreading: 1 instr. per thread in round-robin

Million Instr. xx Frequency# Cycles x Area

T1 T2 T3 T1 T2 T3Time

Interleaved instructions in pipeline

Fill them with instructions from other threadsWhen to switch thread?

Multiple techniquesMost common: every instruction (e.g. Sun’s Niagara)

Page 33: Improving Pipelined Soft Processors with Multithreading

33

Experimental Testbed: NetFPGA

Stanford/Xilinx platform Collaboration with network researchers

Perform real high bandwidth experiments

–Virtex-II Pro

–4 x 1 Gbps Ethernet

–PCI board

–64 MB DDR2 DRAM

Page 34: Improving Pipelined Soft Processors with Multithreading

34

Removed load and branch delay slots in the code