Download - CA-4 Multi Cycle Processor
-
8/2/2019 CA-4 Multi Cycle Processor
1/50
BKTP.HCM
2010
dce
KIN TRC MY TNH
CE2010Khoa Khoa hc v K thut My tnh
BM K thut My tnh
inh c Anh Vhttp://www.cse.hcmut.edu.vn/~anhvu
http://www.cse.hcmut.edu.vn/~anhvuhttp://www.cse.hcmut.edu.vn/~anhvu -
8/2/2019 CA-4 Multi Cycle Processor
2/50
2010
dce
2Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu
Chapter 4
The Processor
-
8/2/2019 CA-4 Multi Cycle Processor
3/50
2010
dce
56Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu
What is Pipelining?
A way of speeding up execution of instructions
Key idea: overlap execution of multipleinstructions
Analogy: doing your laundry1. Run load through washer
2. Run load through dryer
3. Fold clothes
4. Put away clothes
5. Go to 1
Observation: we can start another load as soonas we finish step 1!
-
8/2/2019 CA-4 Multi Cycle Processor
4/50
2010
dce
57Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu
The Laundry Analogy
Ann, Brian, Cathy, Dave
each have one load of clothesto wash, dry, and fold
Washer takes 30 minutes
Dryer takes 30 minutes
Folder takes 30 minutes
Stasher takes 30 minutesto put clothes into drawers
A B C D
-
8/2/2019 CA-4 Multi Cycle Processor
5/50
2010
dce
58Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu
If we do laundry sequentially...
Time Required: 8 hours for 4 loads
30Tas
k
Orde
r
TimeA
30 30 3030
B
30 3030
C
30 30 3030
D
30 30 3030
6 PM 7 8 9 10 11 12 1 2 AM
-
8/2/2019 CA-4 Multi Cycle Processor
6/50
2010
dce
59Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu
To Pipeline, We Overlap Tasks
Time Required: 3.5 Hours for 4 Loads Latency remains 2 hours
Throughput improves by factor of 2.3 (decreases for more loads)
12 2 AM6 PM 7 8 9 10 11 1Time30
A
C
D
B
30 30 3030 30 30Tas
k
Order
-
8/2/2019 CA-4 Multi Cycle Processor
7/50
2010
dce
60Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu
Pipelining a Digital System
Key idea: break big computation up into pieces
Separate each piece with a pipeline register
1ns
200ps 200ps 200ps 200ps 200ps
Pipeline
Register
-
8/2/2019 CA-4 Multi Cycle Processor
8/50
2010
dce
61Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu
Pipelining a Digital System
Why do this?
Because it's faster for repeated computations
1ns
Non-pipelined:
1 operation finishes
every 1ns
200ps 200ps 200ps 200ps 200ps
Pipelined:
1 operation finishes
every 200ps
-
8/2/2019 CA-4 Multi Cycle Processor
9/50
2010
dce
62Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu
Comments about pipelining
Pipelining increases throughput, but not
latency Answer available every 200ps, BUT
A single computation still takes 1ns
Limitations:
Computations must be divisible into stage size
Pipeline registers add overhead
-
8/2/2019 CA-4 Multi Cycle Processor
10/50
2010
dce
63Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu
IFetch: Instruction Fetch and Update PC Dec: Instruction Decode, Register Read, Sign
Extend Offset Exec: Execute R-type; Calculate MemoryAddress; Branch Comparison; Branch and JumpCompletion
Mem: Memory Read; Memory Write Completion;R-type Completion (RegFile write)
WB: Memory Read Completion (RegFile write)
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5
IFetch Dec Exec Mem WBlw
INSTRUCTIONS TAKE FROM 3 - 5 CYCLES!
The Five Steps of the Load Instruction
-
8/2/2019 CA-4 Multi Cycle Processor
11/50
2010
dce
64Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu
Review Single-Cycle Processor
IFInstruction Fetch
IDInstruction Decode
EXExecute/ Address Calc.
MEMMemory Access
WBWrite Back
Review: Single-Cycle Processor All 5 steps done in a single clock cycle
Dedicated hardware required for each step
-
8/2/2019 CA-4 Multi Cycle Processor
12/50
2010
dce
65Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu
Pipelining - Key Idea
Question:
What happens if we break execution into multiplecycles, but keep the extra hardware?
Answer: In the best case, we can start executing a new
instruction on each clock cycle
this is pipelining Pipelining stages: IF - Instruction Fetch ID - Instruction Decode
EX - Execute / Address Calculation MEM - Memory Access (read / write) WB - Write Back (results into register file)
-
8/2/2019 CA-4 Multi Cycle Processor
13/50
2010
dce
66Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu
Basic Pipelined Processor
IF/ID ID/EX EX/MEM MEM/WB
Pipeline Registers
-
8/2/2019 CA-4 Multi Cycle Processor
14/50
2010
dce
67Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu
Comments about Pipelining
The good news
Multiple instructions are being processed at same time
This works because stages are isolated by registers
Best case speedup of N
The bad news Instructions interfere with each other hazards
Example: different instructions may need the same piece ofhardware (e.g., memory) in same clock cycle
Example: instruction may require a result produced by anearlier instruction that is not yet complete
Worst case: must suspend execution stall
-
8/2/2019 CA-4 Multi Cycle Processor
15/50
2010
dce
68Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu
Multicycle Datapath Approach
Let an instruction take more than 1 clock cycle to complete
Break up instructions into steps where each steptakes a cyclewhile trying to balance the amount of work to be done in each step
restrict each cycle to use only one major functional unit
Not every instruction takes thesamenumber of clock cycles
In addition to faster clock rates, multicycle allows functionalunits that can be used more than once per instruction aslong as they are used on differentclock cycles, as a result only need one memory but only one memory access per cycle
need only one ALU/adder but only one ALU operation percycle
-
8/2/2019 CA-4 Multi Cycle Processor
16/50
2010
dce
69Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu
At the end of a cycle Store values needed in a later cycle by the current instruction in an internal register (not visible
to the programmer). All (except IR) hold data only between a pair of adjacent clock cycles (nowrite control signal needed)
IR Instruction Register MDR Memory Data Register
A, B regfile read data registers ALUout ALU output register
Data used by subsequent instructions are stored in programmer visible registers(i.e., register file, PC, or memory)
Our Multicycle Datapath Approach
Address
Read Data(Instr. or Data)
MemoryPC
Write Data
Read Addr 1
Read Addr 2Write Addr
Register
File
Read
Data 1
Read
Data 2
ALU
Write Data
IR
MDR
A
B ALUout
-
8/2/2019 CA-4 Multi Cycle Processor
17/50
2010
dce
70Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu
Clocking the Multicycle DatapathSystem Clock
MemWrite RegWrite
clock cycle
Address
Read Data
(Instr. or Data)
Memory
PC
Write Data
Read Addr 1
Read Addr 2
Write Addr
Register
File
Read
Data 1
Read
Data 2
ALU
Write Data
IR
MD
R
A
BALUout
The Multicycle Datapath with Control
-
8/2/2019 CA-4 Multi Cycle Processor
18/50
2010
dce
71Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu
The Multicycle Datapath with ControlSignals
Address
Read Data
(Instr. or Data)
MemoryPC
Write Data
Read Addr 1
Read Addr 2
Write Addr
Register
File
Read
Data 1
Read
Data 2
ALU
Write Data
IR
MDR
A
B
ALUout
SignExtend
Shiftleft 2 ALU
control
Shiftleft 2
ALUOp
Control
IRWriteMemtoReg
MemWriteMemRead
IorD
PCWrite
PCWriteCond
RegDstRegWrite
ALUSrcAALUSrcB
zero
PCSource
1
1
1
1
1
1
0
0
0
0
0
0
2
2
3
4
Instr[5-0]
Instr[25-0]
PC[31-28]
Instr[15-0]
Instr[31-26]
32
28
M l i l C l U i
-
8/2/2019 CA-4 Multi Cycle Processor
19/50
2010
dce
72Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu
Multicycle Control Unit
Multicycle datapath control signals are not determined solely by thebits in the instruction
e.g., op code bits tell what operation the ALU should be doing, but notwhat instruction cycle is to be done next
Must use a finite state machine (FSM) for control a set of states (current state stored in State Register)
next state function (determined
by current state and the input) output function (determined bycurrent state and the input) Combinational
control logic
State RegInst
Opcode
Datapath
control
points
Next State. . .
. . .
.
.
.
u t cyc e vantages
-
8/2/2019 CA-4 Multi Cycle Processor
20/50
2010
dce
73Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu
Uses the clock cycle efficiently the clock cycle is
timed to accommodate the slowest instruction step
Multicycle implementations allow functional units to beused more than once per instruction as long as theyare used on different clock cycles
but
Requires additional internal state registers, moremuxes, and more complicated (FSM) control
Clk
Cycle 1
IFetch Dec Exec Mem WB
Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10
IFetch Dec Exec Mem
lw sw
IFetch
R-type
u t cyc e vantagesDisadvantages
dceSi l C l M lti l C l Ti i
-
8/2/2019 CA-4 Multi Cycle Processor
21/50
2010
74Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu
Clk Cycle 1
Multiple Cycle Implementation:
IFetch Dec Exec Mem WB
Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10
IFetch Dec Exec Memlw sw
IFetchR-type
Clk
Single Cycle Implementation:
lw sw Waste
Cycle 1 Cycle 2
multicycle clock slowerthan 1/5th of single cycle
clock due to stateregister overhead
Single Cycle vs. Multiple Cycle Timing
dceH C W M k It F t ?
-
8/2/2019 CA-4 Multi Cycle Processor
22/50
2010
75Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu
How Can We Make It Faster?
Start fetching and executing the next instruction before thecurrent one has completed Pipelining (all?) modern processors are pipelined for
performance Remember theperformance equation:
CPU time = CPI * CC * IC
Under idealconditions and with a large number ofinstructions, the speedup from pipelining is approximatelyequal to the number of pipe stages A five stage pipeline is nearly five times faster because the CC
is nearly five times faster
Fetch (and execute) more than one instruction at a time Superscalar processing stay tuned
dceA Pi li d MIPS P
-
8/2/2019 CA-4 Multi Cycle Processor
23/50
2010
76Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu
A Pipelined MIPS Processor
Start the next instruction before the current one has completed improves throughput total amount of work done in a given time
instruction latency (execution time, delay time, response time timefrom the start of an instruction to its completion) is notreduced
clock cycle (pipeline stage time) is limited by the slowest stage for some stages dont need the whole clock cycle (e.g., WB) for some instructions, some stages are wasted cycles (i.e., nothing is
done during that cycle for that instruction)
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5
IFetch Dec Exec Mem WBlw
Cycle 7Cycle 6 Cycle 8
sw IFetch Dec Exec Mem WB
R-type IFetch Dec Exec Mem WB
2010
dceSingle C cle s Pi eline
-
8/2/2019 CA-4 Multi Cycle Processor
24/50
2010
77Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu
Single Cycle vs. Pipeline
To complete an entire instruction in the pipelined casetakes 1000 ps (as compared to 800 ps for the single cyclecase). Why ?
How long does each take to complete 1,000,000 adds ?
lw IFetch Dec Exec Mem WB
Pipeline Implementation (CC = 200 ps):
IFetch Dec Exec Mem WBsw
IFetch Dec Exec Mem WBR-type
Clk
Single Cycle Implementation (CC = 800 ps):
lw sw Waste
Cycle 1 Cycle 2
400 ps
2010
dcePipeline Performance
-
8/2/2019 CA-4 Multi Cycle Processor
25/50
2010
78Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu
Pipeline Performance
Assume time for stages is
100ps for register read or write 200ps for other stages
Compare pipelined datapath with single-cycledatapath
Instr Instr fetch Registerread
ALU op Memoryaccess
Registerwrite
Total time
lw 200ps 100 ps 200ps 200ps 100 ps 800ps
sw 200ps 100 ps 200ps 200ps 700ps
R-format 200ps 100 ps 200ps 100 ps 600ps
beq 200ps 100 ps 200ps 500ps
2010
dcePipeline Performance
-
8/2/2019 CA-4 Multi Cycle Processor
26/50
2010
79Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu
Pipeline Performance
Single-cycle (Tc= 800ps)
Pipelined (Tc= 200ps)
2010
dcePipeline Speedup
-
8/2/2019 CA-4 Multi Cycle Processor
27/50
2010
80Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu
Pipeline Speedup
If all stages are balanced
i.e., all take the same time
If not balanced, speedup is less
Speedup due to increased throughput
Latency (time for each instruction) does notdecrease
nonpipelined
pipelined
Time between instructionsTime between instructions
Number of stages
2010
dcePipelining the MIPS ISA
-
8/2/2019 CA-4 Multi Cycle Processor
28/50
2010
81Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu
Pipelining the MIPS ISA
What makes it easy
all instructions are the same length (32 bits) can fetch in the 1st stage and decode in the 2nd stage
few instruction formats (three) with symmetry acrossformats
can begin reading register file in 2nd stage
memory operations occur only in loads and stores can use the execute stage to calculate memory addresses
each instruction writes at most one result (i.e.,changes the machine state) and does it in the last few
pipeline stages (MEM or WB) operands must be aligned in memory so a single data
transfer takes only one data memory access
2010
dce
MIPS Pipeline Datapath Additions/Mods
-
8/2/2019 CA-4 Multi Cycle Processor
29/50
82Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu
State registers between each pipeline stage to isolate themIF: IFetch ID: Dec EX: Execute MEM:
MemAccess
WB:
WriteBack
Read
Address
InstructionMemory
Add
PC
4
Write Data
Read Addr 1
Read Addr 2
Write Addr
Register
File
Read
Data 1
Read
Data 2
16 32
ALU
Shift
left 2
Add
DataMemory
Address
Write Data
Read
Data
IF/ID
SignExtend
ID/EX EX/MEM
MEM/WB
System Clock
MIPS Pipeline Datapath Additions/Mods
2010
dceMIPS Pipeline Control Path Modifications
-
8/2/2019 CA-4 Multi Cycle Processor
30/50
83Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu
MIPS Pipeline Control Path Modifications All control signals can be determined during Decode
and held in thestate registers between pipeline stages
Read
Address
InstructionMemory
Add
PC
4
Write Data
Read Addr 1
Read Addr 2
Write Addr
Register
File
Read
Data 1
Read
Data 2
16 32
ALU
Shift
left 2
Add
DataMemory
Address
Write Data
Read
Data
IF/ID
SignExtend
ID/EXEX/MEM
MEM/WB
Control
ALU
cntrl
RegWrite
MemRead
MemtoReg
RegDst
ALUOp
ALUSrc
Branch
PCSrc
2010
dcePipeline Control
-
8/2/2019 CA-4 Multi Cycle Processor
31/50
84Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu
Pipeline Control
IF Stage: read Instr Memory (always
asserted) and write PC (on System Clock) ID Stage: no optional control signals to set
EX Stage MEM Stage WB Stage
RegDst
ALUOp1
ALUOp0
ALUSrc
Brch MemRead
MemWrite
RegWrite
MemtoReg
R 1 1 0 0 0 0 0 1 0
lw 0 0 0 1 0 1 0 1 1sw X 0 0 1 0 0 1 0 X
beq X 0 1 0 1 0 0 0 X
2010
dceGraphically Representing MIPS Pipeline
-
8/2/2019 CA-4 Multi Cycle Processor
32/50
85Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu
Can help with answering questions like: How many cycles does it take to execute this
code?
What is the ALU doing during cycle 4?
Is there a hazard, why does it occur, and how canit be fixed?
Graphically Representing MIPS Pipeline
EX MEM WBIDIF
2010dce
Why Pipeline? For Performance!
-
8/2/2019 CA-4 Multi Cycle Processor
33/50
86Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu
I
n
s
t
r.
O
r
d
e
r
Time (clock cycles)
Inst 0
Inst 1
Inst 2
Inst 4
Inst 3
Once the pipelineis full, one
instruction iscompleted everycycle, so CPI = 1
Time to fill the pipeline
Why Pipeline? For Performance!
EX MEM WBIDIF
EX MEM WBIDIF
EX MEM WBIDIF
EX MEM WBIDIF
EX MEM WBIDIF
2010dce
Can Pipelining Get Us Into Trouble?
-
8/2/2019 CA-4 Multi Cycle Processor
34/50
87Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu
Yes: Pipeline Hazards Situations that prevent starting the next instruction in the next cycle
structural hazards: attempt to use the same resource by two differentinstructions at the same time
data hazards: attempt to use data before it is ready An instructions source operand(s) are produced by a prior instruction still in
the pipeline
control hazards: attempt to make a decision about program control flow
before the condition has been evaluated and the new PC targetaddress calculated
branch and jump instructions, exceptions
Can usually resolve hazards by waiting
pipeline control must detect the hazard and take action to resolve hazards
Can Pipelining Get Us Into Trouble?
2010dce
Structural hazard Conflict for use of a resource
-
8/2/2019 CA-4 Multi Cycle Processor
35/50
88Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu
I
n
s
t
r.
O
r
d
er
Time (clock cycles)
lw
Inst 1
Inst 2
Inst 4
Inst 3
Structural hazard Conflict for use of a resource
Fix with separate instr and data memories (I$ and D$)
Reading data frommemory
Reading instructionfrom memory
EX MEM WBIDIF
EX MEM WBIDIF
EX MEM WBIDIF
EX MEM WBIDIF
EXMEM WBIDIF
Instruction fetchwould have to
stall for that cycle
2010dce
How About Register File Access?
-
8/2/2019 CA-4 Multi Cycle Processor
36/50
89Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu
How About Register File Access?
I
n
s
t
r.
O
r
d
er
Time (clock cycles)
Inst 1
Inst 2
Fix register file
access hazard bydoing reads in thesecond half of thecycle and writes in
the first half
add$1,
add $2,$1,
clock edge that controlsregister writing
clock edge that controls loadingof pipeline state registers
EX
MEM WBIDIF
EX MEM WBIDIF
EX
MEM WBIDIF
EX MEM WBIDIF
can mot thoi gian de doc vao thanhghi. Neu muon thuc hien nhu ben,ta phai co xung clock lon hon 2 lantime can thiet
2010dce
Register Usage Can Cause Data Hazards
-
8/2/2019 CA-4 Multi Cycle Processor
37/50
91Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu
Read before write data hazard
I
n
s
t
r.
O
r
d
er
add$1,
sub $4,$1,$5
and $6,$1,$7
xor $4,$1,$5
or $8,$1,$9
Register Usage Can Cause Data Hazards
Dependencies backward in time cause hazards
EX MEM WBIDIF
EX MEM WBIDIF
EX MEM WBIDIF
EX MEM WBIDIF
EXMEM WBIDIF
2010dce
One Way to Fix a Data Hazard
-
8/2/2019 CA-4 Multi Cycle Processor
38/50
92Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu
I
n
s
t
r.
O
r
d
er
add$1,
sub $4,$1,$5
and $6,$1,$7
EX MEM WBIDIF
EX MEM WBIDIF
EXMEM WBIDIF
stall
stall
Can fix datahazard by
waitingstallbut impacts CPI
One Way to Fix a Data Hazard
2010dce
Forwarding (aka Bypassing)
-
8/2/2019 CA-4 Multi Cycle Processor
39/50
93Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu
Forwarding (aka Bypassing)
forwarding results as soon as they are
available to where they are needed Use result when it is computed
Dont wait for it to be stored in a register
Requires extra connections in the datapath
add$s0,$t0,$t1
sub $t2,$s0,$t3
EX MEM WBIDIF
EX MEM WBIDIF
de khoai phai cho lenh load word, ta dua thang ket qua tuex xuong lenh sau (ky thuat forwarding)
- can them nhung duong ket noi dat biet- chi dung duoc mot so truong hop
2010dce
Loads Can Cause Data Hazards
-
8/2/2019 CA-4 Multi Cycle Processor
40/50
94Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu
Dependencies backward in time cause hazards
Load-use data hazard
EX MEM WBIDIF
Loads Can Cause Data Hazards
I
n
s
t
r.
O
r
d
er
lw $1,4($2)
sub $4,$1,$5
and $6,$1,$7
xor $4,$1,$5
or $8,$1,$9
EX MEM WBIDIF
EX MEM WBIDIF
EX MEM WBIDIF
EXMEM WBIDIF
2010dce
Load-Use Data Hazard
-
8/2/2019 CA-4 Multi Cycle Processor
41/50
95Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu
Load Use Data Hazard
Cant always avoid stalls by forwarding
If value not computed when needed Cant forward backward in time!
lw $s0, 20($t1)
sub $t2,$s0,$t3
EX MEM WBIDIF
EXMEM WBIDIF
su dung ky thuat: cho va forwardingket hop nhieu phuong phap khac nhau de toi uu chuong trinh
hazard do truy suat du lieu
2010dce
Code Scheduling to Avoid Stalls
-
8/2/2019 CA-4 Multi Cycle Processor
42/50
96Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu
g
Reorder code to avoid use of load result in the
next instruction C code for A = B + E; C = B + F;
lw $t1, 0($t0)
lw $t2, 4($t0)
add $t3, $t1, $t2
sw $t3, 12($t0)
lw $t4, 8($t0)
add $t5, $t1, $t4
sw $t5, 16($t0)
stall
lw $t1, 0($t0)
lw $t2, 4($t0)
lw $t4, 8($t0)
add $t3, $t1, $t2
sw $t3, 12($t0)
add $t5, $t1, $t4
sw $t5, 16($t0)
11 cycles13 cycles
stall
de tranh cho, ta sap xep lai cac cau lenh cua chuong trinh
stall: doi 1 chu ky[cach nhau 1 chu ky
cach nhau 2 chu ky,khong doi
2010dce
Branch Instructions Cause Control Hazards
-
8/2/2019 CA-4 Multi Cycle Processor
43/50
97Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu
Dependencies backward in time cause hazards
I
n
s
t
r.
O
r
d
e
r
lw
Inst 4
Inst 3
beqEX MEM WBIDIF
EX MEM WBIDIF
EX MEM WBIDIF
EX MEM WBIDIF
2010dce
Stall on Branch
-
8/2/2019 CA-4 Multi Cycle Processor
44/50
98Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu
Wait until branch outcome determined before
fetching next instruction
beq $s1, $s2, L
sub $t2,$t1,$t3
EX MEM WBIDIF
EXMEM WBIDIF
- cach don gian nhat la cho
2010dce Branch Prediction
-
8/2/2019 CA-4 Multi Cycle Processor
45/50
99Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu
Longer pipelines cant readily determine
branch outcome early Stall penalty becomes unacceptable
Predict outcome of branch
Only stall if prediction is wrong In MIPS pipeline
Can predict branches not taken
Fetch instruction after branch, with no delay
- ky thuat thuong dung la doan- cham nhat cung bang truong hop cho- neu doan dung thi khong can phai cho
- lenh Mips doan voi dieu kien khong say ratuc la lenh ngay sau beq- Tien doan tinh (khong phai tien doan dong
neu doan dung, thi coi nhu khong mat chu ky nao ca
2010dce MIPS with Predict Not Taken
-
8/2/2019 CA-4 Multi Cycle Processor
46/50
100Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu
beq $s1, $s2, L
sub $t2,$t1,$t3
EX MEM WBIDIF
EX MEM WBIDIF
beq $s1, $s2, L
add $t2,$t1,$t3
EX MEM WBIDIF
EX MEM WBIDIF
Predictioncorrect
Predictionincorrect
2010dce More-Realistic Branch Prediction
-
8/2/2019 CA-4 Multi Cycle Processor
47/50
101Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu
Static branch prediction
Based on typical branch behavior Example: loop and if-statement branches
Predict backward branches taken
Predict forward branches not taken
Dynamic branch prediction Hardware measures actual branch behavior e.g., record recent history of each branch
Assume future behavior will continue the trend When wrong, stall while re-fetching, and update history
tien doan tinh
tien doan dong
khong dua vao nhung thong tin truoc do
nghia la luc nao cung chon 1 trong 2:taken hoat not taken
co chien thuat:- dua vao 2 bit lich su: neu 2 lan minh tien doan sai thi se thay doi quan diem
dua vao nhung lan tien doan truoc do
2010dce Other Pipeline Structures Are Possible
-
8/2/2019 CA-4 Multi Cycle Processor
48/50
102Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu
What about the (slow) multiply operation? Make the clock twice as slow or
let it take two cycles (since it doesnt use the DM stage)
What if the data memory access is twice as slow as the instructionmemory? make the clock twice as slow or let data memory access take two cycles (and keep the same clock
rate)
MUL
EX MEM WBIDIF
EX MEM1 WBIDIF MEM2
lenh nhan thuc hien cham hon lenh congnen lenh nhan se thuc hien cham hon
ta lua chon giua viec gian xung clock hayde lenh nhan nhieu chu ki
cho lenh dung nhieu chu ky
cho lenh mem thanh 2 chu ky lien tiep
lenh load thuong rat cham so voi cac lenh khac
2010dce Other Sample Pipeline Alternatives
-
8/2/2019 CA-4 Multi Cycle Processor
49/50
103Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu
ARM7
XScaleALU
IM1 IM2 DM1 RegDM2
IM Reg EX
PC updateIM access
decodereg
access
ALU opDM access
shift/rotate
commit result
(write back)
Reg SHFT
PC update
BTB access
start IM access
IM access
decode
reg 1 access
shift/rotate
reg 2 access
ALU op
start DM access
exception
DM write
reg write
theo cau truc Reg: tap lenh don gian, cau truc (thiet ke) don gian
2010dce Summary
-
8/2/2019 CA-4 Multi Cycle Processor
50/50
104Computer Architecture Chapter 4 2010, Dr. Dinh Duc Anh Vu
All modern day processors use pipelining
Pipelining doesnt help latency of single task, it helpsthroughput of entire workload
Potential speedup: a CPI of 1 and fast a CC
Pipeline rate limited by slowest pipeline stage
Unbalanced pipe stages makes for inefficiencies The time to fill pipeline and time to drain it can impact
speedup for deep pipelines and short code runs
Must detect and resolve hazards
Stalling negatively affects CPI (makes CPI less than theideal of 1)
nap lenh goi dau
can bang cac giai doan : rd, sd, mem ...
lap day cac giai doan. Neu ong day lien tuc thi hieu suat toi da