real-world pipelines: car washes idea divide process into independent stages move objects through...
TRANSCRIPT
Real-World Pipelines: Car Washes
Idea Divide process into independent
stages Move objects through stages in
sequence At any instant, multiple objects
being processed
Sequential Parallel
Pipelined
Computational Example
System Computation requires total of 300 picoseconds Additional 20 picoseconds to save result in register Clock cycle must be at least 320 ps
Combinationallogic
Reg
300 ps 20 ps
Clock
Delay = 320 psThroughput = 3.12 GOPS
Throughput = 1 operation
320 ps 1 ns
1000 ps 3.12 GOPSx
3-Stage Pipelined Version
System Divide combinational logic into 3 blocks of 100 ps each Can begin new operation as soon as previous one passes through stage A.
Begin new operation every 120 ps Performance impact
Latency increases: 360 ps from start to finish (was 320) Speedup is less than 3x – why?
Reg
Clock
Comb.logic
A
Reg
Comb.logic
B
Reg
Comb.logic
C
100 ps 20 ps 100 ps 20 ps 100 ps 20 ps
Delay = 360 psThroughput = 8.33 GOPS
Pipelining Unpipelined
Cannot start new operation until previous one completes 3-way pipelined
Up to 3 operations in process simultaneously
Time
OP1
OP2
OP3
Time
A B C
A B C
A B C
OP1
OP2
OP3
Operating a Pipeline
Time
OP1
OP2
OP3
A B C
A B C
A B C
0 120 240 360 480 640
Clock
Reg
Clock
Comb.logic
A
Reg
Comb.logic
B
Reg
Comb.logic
C
100 ps 20 ps 100 ps 20 ps 100 ps 20 ps
239
Reg
Clock
Comb.logic
A
Reg
Comb.logic
B
Reg
Comb.logic
C
100 ps 20 ps 100 ps 20 ps 100 ps 20 ps
241
Reg
Reg
Reg
100 ps 20 ps 100 ps 20 ps 100 ps 20 ps
Comb.logic
A
Comb.logic
B
Comb.logic
C
Clock
300
Reg
Clock
Comb.logic
A
Reg
Comb.logic
B
Reg
Comb.logic
C
100 ps 20 ps 100 ps 20 ps 100 ps 20 ps
359
Real-World Pipelines: Laundry Assuming you’ve got:
One washer (takes 30 minutes)
One drier (takes 40 minutes)
One “folder” (takes 20 minutes)
It takes 90 minutes to wash, dry, and fold 1 load of laundry. How long does 4 loads take?
The slow way
If each load is done sequentially it takes 6 hours
30 40 20 30 40 20 30 40 20 30 40 20
6 PM 7 8 9 10 11 Midnight
Time
8
Laundry Pipelining Start each load as soon as possible: overlap loads
Pipelined laundry takes 3.5 hours
6 PM 7 8 9 10 11 Midnight
Time
30 40 40 40 40 20
Limitations: Nonuniform Delays
Throughput limited by slowest stage Other stages sit idle for much of the time Challenging to partition system into balanced stages
Reg
Clock
Reg
Comb.logic
B
Reg
Comb.logic
C
50 ps 20 ps 150 ps 20 ps 100 ps 20 ps
Delay = 510 psThroughput = 5.88 GOPS
Comb.logic
A
Time
OP1
OP2
OP3
A B C
A B C
A B C
Limitations: Register Overhead
As pipeline deepens, overhead of loading registers becomes more significant
Percentage of clock cycle spent loading register: 1-stage pipeline: 6.25% 3-stage pipeline: 16.67% 6-stage pipeline: 28.57%
High speeds of modern processor designs obtained through deep pipelining: high overhead
Delay = 420 ps, Throughput = 14.29 GOPSClock
Reg
Comb.logic
50 ps 20 ps
Reg
Comb.logic
50 ps 20 ps
Reg
Comb.logic
50 ps 20 ps
Reg
Comb.logic
50 ps 20 ps
Reg
Comb.logic
50 ps 20 ps
Reg
Comb.logic
50 ps 20 ps
Data Dependencies
Each operation depends on result from preceding one Not a problem for SEQ implementation
Clock
Combinationallogic
Reg
Time
OP1
OP2
OP3
Limitations: Data Hazards
Result does not feed back around in time for next operation Unless special action is taken, results will be incorrect!
Reg
Clock
Comb.logic
A
Reg
Comb.logic
B
Reg
Comb.logic
C
Time
OP1
OP2
OP3
A B C
A B C
A B C
OP4 A B C
Review: Arithmetic and Logical Operations
Refer to generically as “OPl” Encodings differ only by
“function code” Low-order 4 bits in first
instruction word All set condition codes as side
effect
addl rA, rB 6 0 rA rB
subl rA, rB 6 1 rA rB
andl rA, rB 6 2 rA rB
xorl rA, rB 6 3 rA rB
Add
Subtract (rA from rB)
And
Exclusive-Or
Instruction Code Function Code
Review: Move Operations
Similar to the IA32 movl instruction Simpler format for memory addresses Separated into different instructions to simplify hardware
implementation
rrmovl rA, rB 2 0 rA rB Register --> Register
Immediate --> Registerirmovl V, rB 3 0 F rB V
Register --> Memoryrmmovl rA, D(rB) 4 0 rA rB D
Memory --> Registermrmovl D(rB), rA 5 0 rA rB D
Data Dependencies in Y86 Code
Result from one instruction used as operand for another Read-after-write (RAW) dependency
Very common in actual programs Must make sure our pipeline handles these properly
Essential: results must be correct Desirable: performance impact should be minimized
1 irmovl $50, %eax
2 addl %eax , %ebx
3 mrmovl 100( %ebx ), %edx
SEQ Hardware Stages occur in sequence One operation in process at a
time
Instructionmemory
Instructionmemory
PCincrement
PCincrement
CCCC ALUALU
Datamemory
Datamemory
NewPC
rB
dstE dstM
ALUA
ALUB
Addr
srcA srcB
read
ALUfun.
Fetch
Decode
Execute
Memory
Write back
data out
Registerfile
Registerfile
A B M
E
Cnd
dstE dstM srcA srcB
icode ifun rA
PC
valC valP
valBvalA
Data
valE
valM
PC
newPC
Stat
dmem_error
instr_valid
imem_error
stat
write
Mem.control
SEQ+ Hardware Still sequential implementation Reorder PC stage to put at
beginning PC stage
Task is to select PC for current instruction
Based on results computed by previous instruction
Processor state PC is no longer stored in
register PC is determined from other
stored information Instructionmemory
Instructionmemory
PCincrement
PCincrement
CCCC ALUALU
Datamemory
Datamemory
PC
rB
dstE dstM
ALUA
ALUB
Mem.control
Addr
srcA srcB
read
write
ALUfun.
Fetch
Decode
Execute
Memory
Write back
data out
Registerfile
Registerfile
A BM
E
Cnd
dstE dstM srcA srcB
icode ifun rA
pCnd pValM pValC pValP pIcode
PC
valC valP
valBvalA
Data
valE
valM
PC
Stat
dmem_errorstat
instr_valid
imem_error
Adding Pipeline Registers
Instructionmemory
Instructionmemory
PC increment
PC increment
CCCCALUALU
Datamemory
Datamemory
Fetch
Decode
Execute
Memory
Write back
icode, ifunrA, rB
valC
Registerfile
Registerfile
A B
M
E
PC
valP
srcA, srcBdstE, dstM
valA, valB
aluA, aluB
Cnd
valE
Addr, Data
valM
PC
valE, valM
newPC
PC increment
PC increment
CCCCALUALU
Datamemory
Datamemory
Fetch
Decode
Execute
Memory
Write back
Registerfile
RegisterfileA B M
E
d_srcA, d_srcB
valA, valB
aluA, aluB
CndvalE
Addr, Data
valM
W_valE, W_valM, W_dstE, W_dstMW_icode, W_valM
icode, ifun,rA, rB, valC
E
M
W
F
D
valP
f_pc
predPC
Instructionmemory
Instructionmemory
M_icode, M_Cnd, M_valA
Pipeline Stages Fetch
Select current PC Read instruction Compute incremented PC
Decode Read program registers
Execute Perform ALU operation
Memory Read or write data memory
Write back Update register file
PC increment
PC increment
CCCCALUALU
Datamemory
Datamemory
Fetch
Decode
Execute
Memory
Write back
Registerfile
RegisterfileA B M
E
d_srcA, d_srcB
valA, valB
aluA, aluB
CndvalE
Addr, Data
valM
W_valE, W_valM, W_dstE, W_dstMW_icode, W_valM
icode, ifun,rA, rB, valC
E
M
W
F
D
valP
f_pc
predPC
Instructionmemory
Instructionmemory
M_icode, M_Cnd, M_valA
PIPE- Hardware Pipeline registers hold
intermediate values from instruction execution
Note naming conventions Forward (upward) paths
Values flow from one stage to next
Values cannot skip stages; look at
valC in decode dstE and dstM in execute,
memory
E
M
W
F
D
Instructionmemory
Instructionmemory
PCincrement
PCincrement
Registerfile
Registerfile
ALUALU
Datamemory
Datamemory
SelectPC
rB
dstE dstMSelectA
ALUA
ALUB
Mem.control
Addr
srcA srcB
read
write
ALUfun.
Fetch
Decode
Execute
Memory
icode
data out
data in
A BM
E
M_valA
W_valM
W_valE
M_valA
W_valM
d_rvalA
f_pc
PredictPC
valE valM dstE dstM
Cndicode valE valA dstE dstM
icode ifun valC valA valB dstE dstMsrcA srcB
valC valPicode ifun rA
predPC
CCCC
d_srcBd_srcA
e_Cnd
M_Cnd
stat
stat
stat
stat
statimem_error
instr_valid
stat
dstE
dmem_errorm_stat
W_stat
M_stat
E_stat
D_stat
f_stat
Writeback stat
Stat
PIPE- Example
At clock cycle 4, what are the values of
d_srcA d_srcB E_dstE e_valE M_dstE M_valE m_valM
Highlight the datapath in D, E, M stages.
E
M
W
F
D
Instructionmemory
Instructionmemory
PCincrement
PCincrement
Registerfile
Registerfile
ALUALU
Datamemory
Datamemory
SelectPC
rB
dstE dstMSelectA
ALUA
ALUB
Mem.control
Addr
srcA srcB
read
write
ALUfun.
Fetch
Decode
Execute
Memory
icode
data out
data in
A BM
E
M_valA
W_valM
W_valE
M_valA
W_valM
d_rvalA
f_pc
PredictPC
valE valM dstE dstM
Cndicode valE valA dstE dstM
icode ifun valC valA valB dstE dstMsrcA srcB
valC valPicode ifun rA
predPC
CCCC
d_srcBd_srcA
e_Cnd
M_Cnd
stat
stat
stat
stat
statimem_error
instr_valid
stat
dstE
dmem_errorm_stat
W_stat
M_stat
E_stat
D_stat
f_stat
Writeback stat
Stat
1: mrmovl 4(%ebx), %ecx2: addl %eax, %edx3: addl %ecx, %edx
Problems:Feedback Paths
Predicted PC Guess value of next PC
Branch information Jump taken/not-taken Fall-through or target
address Return address
Read from memory Register updates
To register file write ports
E
M
W
F
D
Instructionmemory
Instructionmemory
PCincrement
PCincrement
Registerfile
Registerfile
ALUALU
Datamemory
Datamemory
SelectPC
rB
dstE dstMSelectA
ALUA
ALUB
Mem.control
Addr
srcA srcB
read
write
ALUfun.
Fetch
Decode
Execute
Memory
icode
data out
data in
A BM
E
M_valA
W_valM
W_valE
M_valA
W_valM
d_rvalA
f_pc
PredictPC
valE valM dstE dstM
Cndicode valE valA dstE dstM
icode ifun valC valA valB dstE dstMsrcA srcB
valC valPicode ifun rA
predPC
CCCC
d_srcBd_srcA
e_Cnd
M_Cnd
stat
stat
stat
stat
statimem_error
instr_valid
stat
dstE
dmem_errorm_stat
W_stat
M_stat
E_stat
D_stat
f_stat
Writeback stat
Stat
Predicting the PC
Pipeline timing requires next instruction fetch to begin one cycle after current instruction is fetched Not enough time to determine next
instruction in all cases Can be done for all but conditional
jumps, return Solution for conditional jumps: guess
outcome + continue Recover later if prediction was
incorrectPC
incrementPC
increment
CCCCALUALU
Datamemory
Datamemory
Fetch
Decode
Execute
Memory
Write back
Registerfile
RegisterfileA B M
E
d_srcA, d_srcB
valA, valB
aluA, aluB
CndvalE
Addr, Data
valM
W_valE, W_valM, W_dstE, W_dstMW_icode, W_valM
icode, ifun,rA, rB, valC
E
M
W
F
D
valP
f_pc
predPC
Instructionmemory
Instructionmemory
M_icode, M_Cnd, M_valA
Simple Prediction Strategy Instructions that don’t transfer control
Next PC is always valP (no guess required) Call and unconditional jumps
Next PC is always valC (no guess required) Conditional jumps
Could predict next PC to be valC (branch target) Correct only if branch is taken (~60% of time)
Could predict next PC to be valP (sequential successor) Correct only if branch not taken (~40% of time)
Could predict taken if forward, not-taken if backward Correct ~65% of time
Return instruction Don’t try to predict (Why? What would be required?)
Branch Misprediction Example
Should execute only first 7 instructions Our predictor will guess the branch will be taken What must be done to recover?
0x000: xorl %eax,%eax 0x002: jne t # Not taken 0x007: irmovl $1, %eax # Fall through 0x00d: nop 0x00e: nop 0x00f: nop 0x010: halt 0x011: t: irmovl $2, %edx # Target (Should not execute) 0x017: irmovl $3, %ebx # Should not execute 0x01d: irmovl $4, %edx # Should not execute
Handling Misprediction: Desired Behavior
Detection Branch is in execute, and it was not taken Two incorrect instructions follow branch in pipe
Action On next cycle, replace mispredicted instructions with “bubbles”
0x000: irmovl Stack,%esp # Initialize stack pointer 0x006: call p # Procedure call 0x00b: irmovl $5,%esi # Return point 0x011: halt 0x020: .pos 0x20 0x020: p: irmovl $-1,%edi # procedure 0x026: ret 0x027: irmovl $1,%eax # Should not be executed 0x02d: irmovl $2,%ecx # Should not be executed 0x033: irmovl $3,%edx # Should not be executed 0x039: irmovl $4,%ebx # Should not be executed 0x100: .pos 0x100 0x100: Stack: # Stack: Stack pointer
Return Example
Without special handling we would fetch three more instructions after ret before we know return address
Instead, we’ll freeze the pipe when ret is fetched
0x026: ret F D E M
Wbubble F D E M
W
bubble F D E M W
bubble F D E M W
0x00b: irmovl $5,%esi # Return F D E M W
# demo-retb
F D E M W
FvalC f 5rB f %esi
FvalC f 5rB f %esi
W
valM = 0x0b
W
valM = 0x0b
•••
Handling Return: Desired Behavior
Detection ret is in decode, execute or
memory Address of next instruction
unavailable Action
Fill pipe with bubbles until ret’s memory access completes
Pipeline Operation
irmovl $1,%eax #I1
1 2 3 4 5 6 7 8 9
F D E M
Wirmovl $2,%ecx #I2 F D E M
W
irmovl $3,%edx #I3 F D E M W
irmovl $4,%ebx #I4 F D E M W
halt #I5 F D E M W
Cycle 5
W
I1
M
I2
E
I3
D
I4
F
I5
Problems with PIPE-: Example with 3 Nops
0x000: irmovl $10,%edx
1 2 3 4 5 6 7 8 9
F D E M WF D E M W
0x006: irmovl $3,%eax F D E M WF D E M W
0x00c: nop F D E M WF D E M W
0x00d: nop F D E M WF D E M W
0x00e: nop F D E M WF D E M W
0x00f: addl %edx,%eax F D E M WF D E M W
10
W
R[%eax] f 3
W
R[%eax] f 3
D
valA f R[%edx] = 10valB f R[%eax] = 3
D
valA f R[%edx] = 10valB f R[%eax] = 3
# demo-h3.ys
Cycle 6
11
0x011: halt F D E M WF D E M W
Cycle 7
Example: 2 Nops
0x000: irmovl $10,%edx
1 2 3 4 5 6 7 8 9
F D E M WF D E M W
0x006: irmovl $3,%eax F D E M WF D E M W
0x00c: nop F D E M WF D E M W
0x00d: nop F D E M WF D E M W
0x00e: addl %edx,%eax F D E M WF D E M W
0x010: halt F D E M WF D E M W
10# demo-h2.ys
W
R[%eax] f 3
D
valA f R[ %edx] = 10valB f R[ %eax] = 0
•••
W
R[%eax] f 3
W
R[%eax] f 3
D
valA fR[% edx] = 10valB fR[% eax] = 0
D
valA fR[% edx] = 10valB fR[% eax] = 0
•••
Cycle 6
Error
Example: 1 Nop0x000: irmovl $10,%edx
1 2 3 4 5 6 7 8 9
F D E M
W0x006: irmovl $3,%eax F D E M
W
0x00c: nop F D E M WF D E M W
0x00d: addl %edx,%eax F D E M WF D E M W
0x00f: halt F D E M WF D E M W
# demo-h1.ys
W
R[%edx] f 10
W
R[%edx] f 10
D
valA f R[%edx] = 0valB f R[%eax] = 0
D
valA f R[%edx] = 0valB f R[%eax] = 0
•••
Cycle 5
Error
MM_valE = 3M_dstE = %eax
Example: 0 Nops
0x000: irmovl $10,%edx
1 2 3 4 5 6 7 8
F D E M
W0x006: irmovl $3,%eax F D E M
W
F D E M W0x00c: addl % edx,% eax
F D E M W0x00e: halt
# demo-h0.ys
E
D
valA f R[%edx] = 0valB f R[%eax] = 0
D
valA f R[%edx] = 0valB f R[%eax] = 0
Cycle 4
Error
MM_valE = 10M_dstE = %edx
e_valE f 0 + 3 = 3 E_dstE = %eax
Hazard Solution 1: Stalling
If instruction that reads register follows too closely after instruction that writes same register, delay reading instruction
Delay or stall always takes place in decode stage Following instruction also delayed in fetch, but just one “stall cycle” tallied
Stall = dynamically injecting nop (bubble) into execute stage Hazard we get wrong answer without special action
0x000: irmovl $10,%edx
1 2 3 4 5 6 7 8 9
F D E M W
0x006: irmovl $3,%eax F D E M W
0x00c: nop F D E M W
bubble
F
E M W
0x00e: addl %edx,%eax D D E M W
0x010: halt F D E M W
10# demo-h2.ys
F
F D E M W0x00d: nop
11
Stalls
Detection Write is pending for a src
register srcA or srcB (not 0xF)
matches dstE or dstM in execute, memory, or write-back stages
Action Stall instruction in decode
E
M
W
F
D
Instructionmemory
Instructionmemory
PCincrement
PCincrement
Registerfile
Registerfile
ALUALU
Datamemory
Datamemory
SelectPC
rB
dstE dstMSelectA
ALUA
ALUB
Mem.control
Addr
srcA srcB
read
write
ALUfun.
Fetch
Decode
Execute
Memory
icode
data out
data in
A BM
E
M_valA
W_valM
W_valE
M_valA
W_valM
d_rvalA
f_pc
PredictPC
valE valM dstE dstM
Cndicode valE valA dstE dstM
icode ifun valC valA valB dstE dstMsrcA srcB
valC valPicode ifun rA
predPC
CCCC
d_srcBd_srcA
e_Cnd
M_Cnd
stat
stat
stat
stat
statimem_error
instr_valid
stat
dstE
dmem_errorm_stat
W_stat
M_stat
E_stat
D_stat
f_stat
Writeback stat
Stat
1: addl %eax, %edx2: mrmovl 4(%ebx), %ecx3: addl %ecx, %edx
Stalls
Detection Write is pending for a src
register srcA or srcB (not 0xF)
matches dstE or dstM in execute, memory, or write-back stages
Action Stall instruction in decode
E
M
W
F
D
Instructionmemory
Instructionmemory
PCincrement
PCincrement
Registerfile
Registerfile
ALUALU
Datamemory
Datamemory
SelectPC
rB
dstE dstMSelectA
ALUA
ALUB
Mem.control
Addr
srcA srcB
read
write
ALUfun.
Fetch
Decode
Execute
Memory
icode
data out
data in
A BM
E
M_valA
W_valM
W_valE
M_valA
W_valM
d_rvalA
f_pc
PredictPC
valE valM dstE dstM
Cndicode valE valA dstE dstM
icode ifun valC valA valB dstE dstMsrcA srcB
valC valPicode ifun rA
predPC
CCCC
d_srcBd_srcA
e_Cnd
M_Cnd
stat
stat
stat
stat
statimem_error
instr_valid
stat
dstE
dmem_errorm_stat
W_stat
M_stat
E_stat
D_stat
f_stat
Writeback stat
Stat
1: addl %eax, %edx2: mrmovl 4(%ebx), %ecx3: addl %ecx, %edx
Detecting Stall Condition
0x000: irmovl $10,%edx
1 2 3 4 5 6 7 8 9
F D E M W
0x006: irmovl $3,%eax F D E M W
0x00c: nop F D E M W
F0x00e: addl %edx,%eax D E M W
0x010: halt D E M W
10# demo-h2.ys
F
F D E M W0x00d: nop
11
Cycle 6
W
D
•••
W_dstE = %eaxW_valE = 3
srcA = %edxsrcB = %eax
Detecting Stall Condition
0x000: irmovl $10,%edx
1 2 3 4 5 6 7 8 9
F D E M W
0x006: irmovl $3,%eax F D E M W
0x00c: nop F D E M W
bubble
F
E M W
0x00e: addl %edx,%eax D D E M W
0x010: halt F D E M W
10# demo-h2.ys
F
F D E M W0x00d: nop
11
Cycle 6
W
D
•••
W_dstE = %eaxW_valE = 3
srcA = %edxsrcB = %eax
Multi-cycle Stalls
0x000: irmovl $10,%edx
1 2 3 4 5 6 7 8 9
F D E M W
0x006: irmovl $3,%eax F D E M W
bubble
F
E M W
bubble
D
E M W
0x00c: addl %edx,%eax D D E M W
0x00e: halt F D E M W
10# demo-h0.ys
F F
D
F
E M W bubble
11
Cycle 4
DsrcA = %edxsrcB = %eax
Multi-cycle Stalls
0x000: irmovl $10,%edx
1 2 3 4 5 6 7 8 9
F D E M W
0x006: irmovl $3,%eax F D E M W
bubble
F
E M W
bubble
D
E M W
0x00c: addl %edx,%eax D D E M W
0x00e: halt F D E M W
10# demo-h0.ys
F F
D
F
E M W bubble
11
Cycle 4
EE_dstE = %eax
DsrcA = %edxsrcB = %eax
Multi-cycle Stalls
0x000: irmovl $10,%edx
1 2 3 4 5 6 7 8 9
F D E M W
0x006: irmovl $3,%eax F D E M W
bubble
F
E M W
bubble
D
E M W
0x00c: addl %edx,%eax D D E M W
0x00e: halt F D E M W
10# demo-h0.ys
F F
D
F
E M W bubble
11
Cycle 4
•••D
srcA = %edxsrcB = %eax
EE_dstE = %eax
DsrcA = %edxsrcB = %eax
Cycle 5
Multi-cycle Stalls
0x000: irmovl $10,%edx
1 2 3 4 5 6 7 8 9
F D E M W
0x006: irmovl $3,%eax F D E M W
bubble
F
E M W
bubble
D
E M W
0x00c: addl %edx,%eax D D E M W
0x00e: halt F D E M W
10# demo-h0.ys
F F
D
F
E M W bubble
11
Cycle 4
•••
MM_dstE = %eax
DsrcA = %edxsrcB = %eax
EE_dstE = %eax
DsrcA = %edxsrcB = %eax
Cycle 5
Multi-cycle Stalls
0x000: irmovl $10,%edx
1 2 3 4 5 6 7 8 9
F D E M W
0x006: irmovl $3,%eax F D E M W
bubble
F
E M W
bubble
D
E M W
0x00c: addl %edx,%eax D D E M W
0x00e: halt F D E M W
10# demo-h0.ys
F F
D
F
E M W bubble
11
Cycle 4 •••
DsrcA = %edxsrcB = %eax
•••
MM_dstE = %eax
DsrcA = %edxsrcB = %eax
EE_dstE = %eax
DsrcA = %edxsrcB = %eax
Cycle 5
Cycle 6
Multi-cycle Stalls
0x000: irmovl $10,%edx
1 2 3 4 5 6 7 8 9
F D E M W
0x006: irmovl $3,%eax F D E M W
bubble
F
E M W
bubble
D
E M W
0x00c: addl %edx,%eax D D E M W
0x00e: halt F D E M W
10# demo-h0.ys
F F
D
F
E M W bubble
11
Cycle 4 •••
WW_dstE = %eax
DsrcA = %edxsrcB = %eax
•••
MM_dstE = %eax
DsrcA = %edxsrcB = %eax
EE_dstE = %eax
DsrcA = %edxsrcB = %eax
Cycle 5
Cycle 6
Stalling: Another View
Operational rules: Stalling instruction held back in decode stage Following instruction stays in fetch stage Bubbles injected into execute stage
Like dynamically generated nop’s Move through stage-by-stage as regular instructions
0x000: irmovl $10,%edx
0x006: irmovl $3,%eax
0x00c: addl %edx,%eax
Cycle 4
0x00e: halt
0x000: irmovl $10,%edx
0x006: irmovl $3,%eax
0x00c: addl %edx,%eax
# demo-h0.ys
0x00e: halt
0x000: irmovl $10,%edx
0x006: irmovl $3,%eax
bubble
0x00c: addl %edx,%eax
Cycle 5
0x00e: halt
0x006: irmovl $3,%eax
bubble
0x00c: addl %edx,%eax
bubble
Cycle 6
0x00e: halt
bubble
bubble
0x00c: addl %edx,%eax
bubble
Cycle 7
0x00e: halt
bubble
bubble
Cycle 8
0x00c: addl %edx,%eax
0x00e: halt
Write BackMemoryExecuteDecode
Fetch