chapter4 - pipelining › ~curt.nelson › cptr380 › lecture...pipelining single-cycle...

Cptr350 Chapter 4 — The Processor - Datapath 1

COMPUTERORGANIZATION AND DESIGNThe Hardware/Software Interface

5thEdition

Chapter 4The ProcessorPipelining

Single-Cycle Disadvantages & Advantagesn Uses the clock cycle inefficiently – the clock cycle

must be timed to accommodate the slowestinstruction.n This would be especially problematic for more complex

instructions like floating point multiply.

n May be wasteful of area. Some functional units (e.g., adders, memory) must be duplicated since they can not be shared during a clock cycle.

n However, the single-cycle implementation is simple and easy to understand.

Clk

lw sw Waste

Cycle 1 Cycle 2


The Single-Cycle Datapath

ReadAddress

Instr[31-0]

InstructionMemory

Add

PC

4

Write Data

Read Addr 1

Read Addr 2

Write Addr

Register

File

ReadData 1

ReadData 2

ALU

ovf

zero

RegWrite

DataMemory

Address

Write Data

Read Data

MemWrite

MemRead

SignExtend16 32

MemtoReg

ALUSrc

Shiftleft 2

Add

PCSrc

RegDst

ALUcontrol

1

1

1

00

0

0

1

ALUOp

Instr[5-0]

Instr[15-0]

Instr[25-21]

Instr[20-16]

Instr[15 -11]

ControlUnit

Instr[31-26]

Branch

Shiftleft 2

0

1

Jump

32Instr[25-0]

26PC+4[31-28]

28

The Five Stages of the Load Instruction

n IFetch: Instruction Fetch and Update PCn Dec: Instruction Decode and Register fetchn Exec: Execute R-type; calculate memory addressn Mem: Read/write the data from/to the Data Memoryn WB: Write the result into the register file

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5

IFetch Dec Exec Mem WBlw


A Pipelined MIPS Processorn Start the next instruction before the current one has completed

n Improves throughput - total amount of work done in a given time.n Instruction latency (time from the start of an instruction to its

completion) is not reduced. It is often increased due to imbalances between stages.

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5

IFetch Dec Exec Mem WBlw

Cycle 7Cycle 6 Cycle 8

sw IFetch Dec Exec Mem

R-type IFetch Dec Exec WB

§ Clock cycle (pipeline stage time) is limited by the slowest stage.§ For some stages, we don’t need the whole clock cycle (e.g., WB).§ For some instructions, some stages are wasted cycles (i.e., nothing is

done during that cycle for that instruction).

Pipeline PerformanceSingle-cycle (cycle time = 800ps)

Pipelined (cycle time = 200ps)


Pipelining the MIPS ISA

n What makes it easyn All instructions are the same length (32 bits)

n Can fetch in the 1st stage and decode in the 2nd stage.n Few instruction formats (three) with symmetry across

formatsn Can begin reading register file in 2nd stage.

n Memory operations occur only in loads and storesn Can use the execute stage to calculate memory addresses.

n Each instruction writes at most one result (i.e., changes the machine state) and does it in the last pipeline stages (MEM or WB).

n Operands must be aligned in memory so a single data transfer takes only one data memory access.

Graphically Representing MIPS Pipeline

n Can help with answering questions like:n How many cycles does it take to execute this code?n What is the ALU doing during cycle 4?n Is there a hazard (whatever that is), why does it occur,

and how can it be fixed?

ALUIM Reg DM Reg


Five Instruction Sequence

Once the pipeline is

full, one instruction is

completed every cycle,

so CPI = 1

Instr.

Order

Time (clock cycles)

Inst 0

Inst 1

Inst 2

Inst 4

Inst 3

ALUIM Reg DM Reg

ALUIM Reg DM Reg

ALUIM Reg DM Reg

ALUIM Reg DM Reg

ALUIM Reg DM Reg

Time to fill the pipeline

Other Pipeline Structures Are Possiblen What about a slow multiply instruction?

n Make the clock twice as slow or …n Let it take two cycles (since it doesn’t use the DM

stage).

ALUIM Reg DM Reg

MUL

ALUIM Reg DM1 RegDM2

n What if the data memory access is twice as slow as the instruction memory?n Make the clock twice as slow or …n Let data memory access take two cycles (and keep the

same clock rate).


Other Pipeline Architecturesn ARM7

ALUIM1 IM2 DM1 Reg

DM2

IM Reg EX

PC updateIM access

decodereg

access

ALU opDM accessshift/rotate

commit result(write back)

Reg SHFT

PC updateBTB access

start IM access

IM access

decodereg 1 access

shift/rotatereg 2 access

ALU op

start DM accessexception

DM writereg write

n Motorola XScale

Latches and Clocks in Single-Cycle Design

PC InstrMem

RegFile ALU Data

MemoryAddr

n The entire instruction executes in a single cycle.n Green blocks are latches.n At the rising edge, a new PC is latched.n At the rising edge, the result of the previous cycle is latched.n At the falling edge, the address of LW/SW is latched so we can

access the data memory in the 2nd half of the cycle.


Multi-Stage Circuit

PCInstrMem

ALU DataMemory

L2RegFile

L3 L4

RegFile

L5

n Instead of executing the entire instruction in a single cycle (a single stage), let’s break up the execution into multiple stages, each separated by a latch.

Pipeline Registers

n Need registers between stages to hold information produced in previous cycle.


The Assembly Line

A

Start and finish a job before moving on.

Time

Jobs

Break the job into smaller stages.B CA B C

A B CA B C

Unpipelined

Pipelined

Instruction Fetch


Instruction Decode and Register Fetch

Execute


Memory Access (Load or Store)

Write-back

Wrongregisternumber


Corrected Datapath for Load Instruction

Execute Stage for Store Instruction


Memory Access Stage for Store

Write-back Stage for Store Instruction


Performance Improvements?

n Compared to single-cycle executionn Does it take longer to finish each individual instruction?n Does it take longer to finish a series of instructions?n Is a 10-stage pipeline better than a 5-stage pipeline?

Quantitative Effects

n As a result of pipelining:n Time in ns/instruction increases.n Each instruction takes more cycles to execute.n But… average CPI remains roughly the same.n Clock speed goes up.n Total execution time goes down, resulting in lower average

time per instruction.n Under ideal conditions,

n Speedup = ratio of elapsed times between successive instruction completions = number of pipeline stages = increase in clock speed


Can Pipelining Get Us Into Trouble?n Yes: Pipeline Hazards

n Structural hazards: attempt to use the same resource by two different instructions at the same time.

n Data hazards: attempt to use data before it is readyn An instruction’s source operand(s) are produced by a

prior instruction still in the pipeline.n Control hazards: attempt to make a decision about

program control flow before the condition has been evaluated and the new PC target address calculated

n Branch and jump instructions, exceptions.

n Hazards can usually be resolved by waiting.n Pipeline control must detect the hazard and take

action to resolve it.

Summary

n All modern day processors use pipelining.n Pipelining doesn’t help latency of single task, it helps throughput of entire workload.

n Potential speedup: a CPI of 1 and faster Clock Cycle.n Pipeline rate limited by slowest pipeline stage

n Unbalanced pipe stages make for inefficiencies.n The time to “fill” pipeline and time to “drain” it can

impact speedup for deep pipelines and short code runs.

n Must detect and resolve hazardsn Stalling negatively affects CPI.

chapter4 - pipelining › ~curt.nelson › cptr380 › lecture...pipelining single-cycle...

Documents