chapter4 - pipelining › ~curt.nelson › cptr380 › lecture...pipelining single-cycle...
TRANSCRIPT
Cptr350 Chapter 4 — The Processor - Datapath 1
COMPUTERORGANIZATION AND DESIGNThe Hardware/Software Interface
5thEdition
Chapter 4The ProcessorPipelining
Single-Cycle Disadvantages & Advantagesn Uses the clock cycle inefficiently – the clock cycle
must be timed to accommodate the slowestinstruction.n This would be especially problematic for more complex
instructions like floating point multiply.
n May be wasteful of area. Some functional units (e.g., adders, memory) must be duplicated since they can not be shared during a clock cycle.
n However, the single-cycle implementation is simple and easy to understand.
Clk
lw sw Waste
Cycle 1 Cycle 2
Cptr350 Chapter 4 — The Processor - Datapath 2
The Single-Cycle Datapath
ReadAddress
Instr[31-0]
InstructionMemory
Add
PC
4
Write Data
Read Addr 1
Read Addr 2
Write Addr
Register
File
ReadData 1
ReadData 2
ALU
ovf
zero
RegWrite
DataMemory
Address
Write Data
Read Data
MemWrite
MemRead
SignExtend16 32
MemtoReg
ALUSrc
Shiftleft 2
Add
PCSrc
RegDst
ALUcontrol
1
1
1
00
0
0
1
ALUOp
Instr[5-0]
Instr[15-0]
Instr[25-21]
Instr[20-16]
Instr[15 -11]
ControlUnit
Instr[31-26]
Branch
Shiftleft 2
0
1
Jump
32Instr[25-0]
26PC+4[31-28]
28
The Five Stages of the Load Instruction
n IFetch: Instruction Fetch and Update PCn Dec: Instruction Decode and Register fetchn Exec: Execute R-type; calculate memory addressn Mem: Read/write the data from/to the Data Memoryn WB: Write the result into the register file
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5
IFetch Dec Exec Mem WBlw
Cptr350 Chapter 4 — The Processor - Datapath 3
A Pipelined MIPS Processorn Start the next instruction before the current one has completed
n Improves throughput - total amount of work done in a given time.n Instruction latency (time from the start of an instruction to its
completion) is not reduced. It is often increased due to imbalances between stages.
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5
IFetch Dec Exec Mem WBlw
Cycle 7Cycle 6 Cycle 8
sw IFetch Dec Exec Mem
R-type IFetch Dec Exec WB
§ Clock cycle (pipeline stage time) is limited by the slowest stage.§ For some stages, we don’t need the whole clock cycle (e.g., WB).§ For some instructions, some stages are wasted cycles (i.e., nothing is
done during that cycle for that instruction).
Pipeline PerformanceSingle-cycle (cycle time = 800ps)
Pipelined (cycle time = 200ps)
Cptr350 Chapter 4 — The Processor - Datapath 4
Pipelining the MIPS ISA
n What makes it easyn All instructions are the same length (32 bits)
n Can fetch in the 1st stage and decode in the 2nd stage.n Few instruction formats (three) with symmetry across
formatsn Can begin reading register file in 2nd stage.
n Memory operations occur only in loads and storesn Can use the execute stage to calculate memory addresses.
n Each instruction writes at most one result (i.e., changes the machine state) and does it in the last pipeline stages (MEM or WB).
n Operands must be aligned in memory so a single data transfer takes only one data memory access.
Graphically Representing MIPS Pipeline
n Can help with answering questions like:n How many cycles does it take to execute this code?n What is the ALU doing during cycle 4?n Is there a hazard (whatever that is), why does it occur,
and how can it be fixed?
ALUIM Reg DM Reg
Cptr350 Chapter 4 — The Processor - Datapath 5
Five Instruction Sequence
Once the pipeline is
full, one instruction is
completed every cycle,
so CPI = 1
Instr.
Order
Time (clock cycles)
Inst 0
Inst 1
Inst 2
Inst 4
Inst 3
ALUIM Reg DM Reg
ALUIM Reg DM Reg
ALUIM Reg DM Reg
ALUIM Reg DM Reg
ALUIM Reg DM Reg
Time to fill the pipeline
Other Pipeline Structures Are Possiblen What about a slow multiply instruction?
n Make the clock twice as slow or …n Let it take two cycles (since it doesn’t use the DM
stage).
ALUIM Reg DM Reg
MUL
ALUIM Reg DM1 RegDM2
n What if the data memory access is twice as slow as the instruction memory?n Make the clock twice as slow or …n Let data memory access take two cycles (and keep the
same clock rate).
Cptr350 Chapter 4 — The Processor - Datapath 6
Other Pipeline Architecturesn ARM7
ALUIM1 IM2 DM1 Reg
DM2
IM Reg EX
PC updateIM access
decodereg
access
ALU opDM accessshift/rotate
commit result(write back)
Reg SHFT
PC updateBTB access
start IM access
IM access
decodereg 1 access
shift/rotatereg 2 access
ALU op
start DM accessexception
DM writereg write
n Motorola XScale
Latches and Clocks in Single-Cycle Design
PC InstrMem
RegFile ALU Data
MemoryAddr
n The entire instruction executes in a single cycle.n Green blocks are latches.n At the rising edge, a new PC is latched.n At the rising edge, the result of the previous cycle is latched.n At the falling edge, the address of LW/SW is latched so we can
access the data memory in the 2nd half of the cycle.
Cptr350 Chapter 4 — The Processor - Datapath 7
Multi-Stage Circuit
PCInstrMem
ALU DataMemory
L2RegFile
L3 L4
RegFile
L5
n Instead of executing the entire instruction in a single cycle (a single stage), let’s break up the execution into multiple stages, each separated by a latch.
Pipeline Registers
n Need registers between stages to hold information produced in previous cycle.
Cptr350 Chapter 4 — The Processor - Datapath 8
The Assembly Line
A
Start and finish a job before moving on.
Time
Jobs
Break the job into smaller stages.B CA B C
A B CA B C
Unpipelined
Pipelined
Instruction Fetch
Cptr350 Chapter 4 — The Processor - Datapath 9
Instruction Decode and Register Fetch
Execute
Cptr350 Chapter 4 — The Processor - Datapath 10
Memory Access (Load or Store)
Write-back
Wrongregisternumber
Cptr350 Chapter 4 — The Processor - Datapath 11
Corrected Datapath for Load Instruction
Execute Stage for Store Instruction
Cptr350 Chapter 4 — The Processor - Datapath 12
Memory Access Stage for Store
Write-back Stage for Store Instruction
Cptr350 Chapter 4 — The Processor - Datapath 13
Performance Improvements?
n Compared to single-cycle executionn Does it take longer to finish each individual instruction?n Does it take longer to finish a series of instructions?n Is a 10-stage pipeline better than a 5-stage pipeline?
Quantitative Effects
n As a result of pipelining:n Time in ns/instruction increases.n Each instruction takes more cycles to execute.n But… average CPI remains roughly the same.n Clock speed goes up.n Total execution time goes down, resulting in lower average
time per instruction.n Under ideal conditions,
n Speedup = ratio of elapsed times between successive instruction completions = number of pipeline stages = increase in clock speed
Cptr350 Chapter 4 — The Processor - Datapath 14
Can Pipelining Get Us Into Trouble?n Yes: Pipeline Hazards
n Structural hazards: attempt to use the same resource by two different instructions at the same time.
n Data hazards: attempt to use data before it is readyn An instruction’s source operand(s) are produced by a
prior instruction still in the pipeline.n Control hazards: attempt to make a decision about
program control flow before the condition has been evaluated and the new PC target address calculated
n Branch and jump instructions, exceptions.
n Hazards can usually be resolved by waiting.n Pipeline control must detect the hazard and take
action to resolve it.
Summary
n All modern day processors use pipelining.n Pipelining doesn’t help latency of single task, it helps throughput of entire workload.
n Potential speedup: a CPI of 1 and faster Clock Cycle.n Pipeline rate limited by slowest pipeline stage
n Unbalanced pipe stages make for inefficiencies.n The time to “fill” pipeline and time to “drain” it can
impact speedup for deep pipelines and short code runs.
n Must detect and resolve hazardsn Stalling negatively affects CPI.