pipelining 4 / caches 1 - university of virginia school of...
TRANSCRIPT
![Page 1: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE](https://reader033.vdocuments.us/reader033/viewer/2022053015/5f15bb297c5d2f266467c12f/html5/thumbnails/1.jpg)
Pipelining 4 / Caches 1
1
![Page 2: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE](https://reader033.vdocuments.us/reader033/viewer/2022053015/5f15bb297c5d2f266467c12f/html5/thumbnails/2.jpg)
after forwarding/prediction
where do we still need to stall?
memory output needed in fetchret followed by anything
memory output needed in exceutemrmovq or popq + use(in immediatelly following instruction)
2
![Page 3: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE](https://reader033.vdocuments.us/reader033/viewer/2022053015/5f15bb297c5d2f266467c12f/html5/thumbnails/3.jpg)
overall CPU
5 stage pipeline
1 instruction completes every cycle — except hazards
most data hazards: solved by forwarding
load/use hazard: 1 cycle of stalling
jXX control hazard: branch prediction + squashing2 cycle penalty for misprediction(correct misprediction after jXX finishes execute)
ret control hazard: 3 cycles of stalling(fetch next instruction after ret finishes memory)
3
![Page 4: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE](https://reader033.vdocuments.us/reader033/viewer/2022053015/5f15bb297c5d2f266467c12f/html5/thumbnails/4.jpg)
ret paths
pred.PC
Instr.Mem.
register filesrcA
srcB
R[srcA]R[srcB]
dstE
next R[dstE]
dstM
next R[dstM]
DataMem.
ZF/SF
Stat
Data in
Addr inData out
valC
0xF
0xF%rsp
%rsp
0xF
0xF%rsp
rA
rB
ALUaluA
aluBvalE8
0
add/subxor/and(functionof instr.)
write?
functionof opcode
PC+9
instr.length+
fetch
decode executememory
writeback
jmp target(from other stage)
very long critical path
4
![Page 5: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE](https://reader033.vdocuments.us/reader033/viewer/2022053015/5f15bb297c5d2f266467c12f/html5/thumbnails/5.jpg)
ret paths
pred.PC
Instr.Mem.
register filesrcA
srcB
R[srcA]R[srcB]
dstE
next R[dstE]
dstM
next R[dstM]
DataMem.
ZF/SF
Stat
Data in
Addr inData out
valC
0xF
0xF%rsp
%rsp
0xF
0xF%rsp
rA
rB
ALUaluA
aluBvalE8
0
add/subxor/and(functionof instr.)
write?
functionof opcode
PC+9
instr.length+
fetch
decode executememory
writeback
jmp target(from other stage)
very long critical path
4
![Page 6: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE](https://reader033.vdocuments.us/reader033/viewer/2022053015/5f15bb297c5d2f266467c12f/html5/thumbnails/6.jpg)
ret paths
pred.PC
Instr.Mem.
register filesrcA
srcB
R[srcA]R[srcB]
dstE
next R[dstE]
dstM
next R[dstM]
DataMem.
ZF/SF
Stat
Data in
Addr inData out
valC
0xF
0xF%rsp
%rsp
0xF
0xF%rsp
rA
rB
ALUaluA
aluBvalE8
0
add/subxor/and(functionof instr.)
write?
functionof opcode
PC+9
instr.length+
fetch
decode executememory
writeback
jmp target(from other stage) very long critical path
4
![Page 7: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE](https://reader033.vdocuments.us/reader033/viewer/2022053015/5f15bb297c5d2f266467c12f/html5/thumbnails/7.jpg)
unsolved problemcycle # 0 1 2 3 4 5 6 7 8
mrmovq 0(%rax), %rbx F D E M Wsubq %rbx, %rcx F D E M Wsubq %rbx, %rcx F D D E M W
stall
5
![Page 8: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE](https://reader033.vdocuments.us/reader033/viewer/2022053015/5f15bb297c5d2f266467c12f/html5/thumbnails/8.jpg)
unsolved problemcycle # 0 1 2 3 4 5 6 7 8
mrmovq 0(%rax), %rbx F D E M Wsubq %rbx, %rcx F D E M Wsubq %rbx, %rcx F D D E M W
stall
5
![Page 9: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE](https://reader033.vdocuments.us/reader033/viewer/2022053015/5f15bb297c5d2f266467c12f/html5/thumbnails/9.jpg)
solveable problemcycle # 0 1 2 3 4 5 6 7 8
mrmovq 0(%rax), %rbx F D E M Wrmmovq %rbx, 0(%rcx) F D E M W
common for real processors to do thisbut our textbook only forwards to the end of decode
6
![Page 10: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE](https://reader033.vdocuments.us/reader033/viewer/2022053015/5f15bb297c5d2f266467c12f/html5/thumbnails/10.jpg)
fetch/fetch logic — advance or not
predicted PC
MUXfrom incremented PC
should we stall?
…
7
![Page 11: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE](https://reader033.vdocuments.us/reader033/viewer/2022053015/5f15bb297c5d2f266467c12f/html5/thumbnails/11.jpg)
fetch/decode logic — bubble or not
rA
MUXno-op value — 0xF
should we sendno-op value (“bubble”)?
8
![Page 12: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE](https://reader033.vdocuments.us/reader033/viewer/2022053015/5f15bb297c5d2f266467c12f/html5/thumbnails/12.jpg)
HCLRS signals
register aB {...
}
HCLRS: every register bank has these MUXes built-in
stall_B: keep old value for all registersregister input → register output
bubble_B: use default value for all registersregister input → default value
9
![Page 13: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE](https://reader033.vdocuments.us/reader033/viewer/2022053015/5f15bb297c5d2f266467c12f/html5/thumbnails/13.jpg)
exerciseregister aB {
value : 8 = 0xFF;}...time a_value B_value stall_B bubble_B0 0x01 0xFF 0 01 0x02 ??? 1 02 0x03 ??? 0 03 0x04 ??? 0 14 0x05 ??? 0 05 0x06 ??? 0 06 0x07 ??? 1 07 0x08 ??? 1 08 ???
stall: keep old valuebubble: store default value
10
![Page 14: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE](https://reader033.vdocuments.us/reader033/viewer/2022053015/5f15bb297c5d2f266467c12f/html5/thumbnails/14.jpg)
exercise resultregister aB {
value : 8 = 0xFF;}...time a_value B_value stall_B bubble_B0 0x01 0xFF 0 01 0x02 0x01 1 02 0x03 0x01 0 03 0x04 0x03 0 14 0x05 0xFF 0 05 0x06 0x05 0 06 0x07 0x06 1 07 0x08 0x06 1 08 0x06
11
![Page 15: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE](https://reader033.vdocuments.us/reader033/viewer/2022053015/5f15bb297c5d2f266467c12f/html5/thumbnails/15.jpg)
exercise: squash + stall (1)time fetch decode execute memory writeback
1 E D C B A
2 E nop C nop B
stall (S) = keep old value; normal (N) = use new valuebubble (B) = use default (no-op);
? ? ? ? ?
S B S B N
exercise: what are the ?swrite down your answers,then compare with your neighbors
12
![Page 16: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE](https://reader033.vdocuments.us/reader033/viewer/2022053015/5f15bb297c5d2f266467c12f/html5/thumbnails/16.jpg)
exercise: squash + stall (1)time fetch decode execute memory writeback
1 E D C B A
2 E nop C nop B
stall (S) = keep old value; normal (N) = use new valuebubble (B) = use default (no-op);
? ? ? ? ?
S B S B N
exercise: what are the ?swrite down your answers,then compare with your neighbors
12
![Page 17: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE](https://reader033.vdocuments.us/reader033/viewer/2022053015/5f15bb297c5d2f266467c12f/html5/thumbnails/17.jpg)
exercise: squash + stall (2)time fetch decode execute memory writeback
1 E D C B A
2 F E C nop B
stall (S) = keep old value; normal (N) = use new valuebubble (B) = use default (no-op);
? ? ? ? ?
N N S B N
exercise: what are the ?swrite down your answers,then compare with your neighbors
13
![Page 18: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE](https://reader033.vdocuments.us/reader033/viewer/2022053015/5f15bb297c5d2f266467c12f/html5/thumbnails/18.jpg)
exercise: squash + stall (2)time fetch decode execute memory writeback
1 E D C B A
2 F E C nop B
stall (S) = keep old value; normal (N) = use new valuebubble (B) = use default (no-op);
? ? ? ? ?
N N S B N
exercise: what are the ?swrite down your answers,then compare with your neighbors
13
![Page 19: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE](https://reader033.vdocuments.us/reader033/viewer/2022053015/5f15bb297c5d2f266467c12f/html5/thumbnails/19.jpg)
exercise: squash + stall (2)time fetch decode execute memory writeback
1 E D C B A
2 F E C nop B
stall (S) = keep old value; normal (N) = use new valuebubble (B) = use default (no-op);
? ? ? ? ?
N N S B N
exercise: what are the ?swrite down your answers,then compare with your neighbors
13
![Page 20: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE](https://reader033.vdocuments.us/reader033/viewer/2022053015/5f15bb297c5d2f266467c12f/html5/thumbnails/20.jpg)
ret stalltime fetch decode execute memory writeback
0 call
1 ret call
2 wait for ret ret call
3 wait for ret nothing ret call (store)
4 wait for ret nothing nothing ret (load) call
5 addq nothing nothing nothing ret
stall (S) = keep old value; normal (N) = use new valuebubble (B) = use default (no-op);
N N
S N N
S B N N
S B N N N
N B N N N
14
![Page 21: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE](https://reader033.vdocuments.us/reader033/viewer/2022053015/5f15bb297c5d2f266467c12f/html5/thumbnails/21.jpg)
ret stalltime fetch decode execute memory writeback
0 call
1 ret call
2 wait for ret ret call
3 wait for ret nothing ret call (store)
4 wait for ret nothing nothing ret (load) call
5 addq nothing nothing nothing ret
stall (S) = keep old value; normal (N) = use new valuebubble (B) = use default (no-op);
N N
S N N
S B N N
S B N N N
N B N N N
14
![Page 22: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE](https://reader033.vdocuments.us/reader033/viewer/2022053015/5f15bb297c5d2f266467c12f/html5/thumbnails/22.jpg)
ret stalltime fetch decode execute memory writeback
0 call
1 ret call
2 wait for ret ret call
3 wait for ret nothing ret call (store)
4 wait for ret nothing nothing ret (load) call
5 addq nothing nothing nothing ret
stall (S) = keep old value; normal (N) = use new valuebubble (B) = use default (no-op);
N N
S N N
S B N N
S B N N N
N B N N N
14
![Page 23: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE](https://reader033.vdocuments.us/reader033/viewer/2022053015/5f15bb297c5d2f266467c12f/html5/thumbnails/23.jpg)
ret stalltime fetch decode execute memory writeback
0 call
1 ret call
2 wait for ret ret call
3 wait for ret nothing ret call (store)
4 wait for ret nothing nothing ret (load) call
5 addq nothing nothing nothing ret
stall (S) = keep old value; normal (N) = use new valuebubble (B) = use default (no-op);
N N
S N N
S B N N
S B N N N
N B N N N
14
![Page 24: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE](https://reader033.vdocuments.us/reader033/viewer/2022053015/5f15bb297c5d2f266467c12f/html5/thumbnails/24.jpg)
ret stalltime fetch decode execute memory writeback
0 call
1 ret call
2 wait for ret ret call
3 wait for ret nothing ret call (store)
4 wait for ret nothing nothing ret (load) call
5 addq nothing nothing nothing ret
stall (S) = keep old value; normal (N) = use new valuebubble (B) = use default (no-op);
N N
S N N
S B N N
S B N N N
N B N N N
14
![Page 25: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE](https://reader033.vdocuments.us/reader033/viewer/2022053015/5f15bb297c5d2f266467c12f/html5/thumbnails/25.jpg)
HCLRS bubble example
register fD {icode : 4 = NOP;rA : 4 = REG_NONE;rB : 4 = REG_NONE;...
};wire need_ret_bubble : 1;need_ret_bubble = ( D_icode == RET ||
E_icode == RET ||M_icode == RET );
bubble_D = ( need_ret_bubble ||... /* other cases */ );
15
![Page 26: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE](https://reader033.vdocuments.us/reader033/viewer/2022053015/5f15bb297c5d2f266467c12f/html5/thumbnails/26.jpg)
squashing with stall/bubbletime fetch decode execute memory writeback
1 subq
2 jne subq
3 addq [?] jne subq (set ZF)
4 rmmovq [?] addq [?] jne (use ZF) subq
5 xorq nothing nothing jne (done) subqstall (S) = keep old value; normal (N) = use new valuebubble (B) = use default (no-op);
N N
N N N
N N N N
N B B N Ncan compute bubble signal based on execute phasewon’t even start CC write for addq
16
![Page 27: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE](https://reader033.vdocuments.us/reader033/viewer/2022053015/5f15bb297c5d2f266467c12f/html5/thumbnails/27.jpg)
squashing with stall/bubbletime fetch decode execute memory writeback
1 subq
2 jne subq
3 addq [?] jne subq (set ZF)
4 rmmovq [?] addq [?] jne (use ZF) subq
5 xorq nothing nothing jne (done) subqstall (S) = keep old value; normal (N) = use new valuebubble (B) = use default (no-op);
N N
N N N
N N N N
N B B N Ncan compute bubble signal based on execute phasewon’t even start CC write for addq
16
![Page 28: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE](https://reader033.vdocuments.us/reader033/viewer/2022053015/5f15bb297c5d2f266467c12f/html5/thumbnails/28.jpg)
squashing with stall/bubbletime fetch decode execute memory writeback
1 subq
2 jne subq
3 addq [?] jne subq (set ZF)
4 rmmovq [?] addq [?] jne (use ZF) subq
5 xorq nothing nothing jne (done) subqstall (S) = keep old value; normal (N) = use new valuebubble (B) = use default (no-op);
N N
N N N
N N N N
N B B N Ncan compute bubble signal based on execute phasewon’t even start CC write for addq
16
![Page 29: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE](https://reader033.vdocuments.us/reader033/viewer/2022053015/5f15bb297c5d2f266467c12f/html5/thumbnails/29.jpg)
squashing with stall/bubbletime fetch decode execute memory writeback
1 subq
2 jne subq
3 addq [?] jne subq (set ZF)
4 rmmovq [?] addq [?] jne (use ZF) subq
5 xorq nothing nothing jne (done) subqstall (S) = keep old value; normal (N) = use new valuebubble (B) = use default (no-op);
N N
N N N
N N N N
N B B N N
can compute bubble signal based on execute phasewon’t even start CC write for addq
16
![Page 30: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE](https://reader033.vdocuments.us/reader033/viewer/2022053015/5f15bb297c5d2f266467c12f/html5/thumbnails/30.jpg)
squashing with stall/bubbletime fetch decode execute memory writeback
1 subq
2 jne subq
3 addq [?] jne subq (set ZF)
4 rmmovq [?] addq [?] jne (use ZF) subq
5 xorq nothing nothing jne (done) subqstall (S) = keep old value; normal (N) = use new valuebubble (B) = use default (no-op);
N N
N N N
N N N N
N B B N Ncan compute bubble signal based on execute phasewon’t even start CC write for addq
16
![Page 31: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE](https://reader033.vdocuments.us/reader033/viewer/2022053015/5f15bb297c5d2f266467c12f/html5/thumbnails/31.jpg)
squashing HCLRS
just_detected_mispredict =e_icode == JXX && !branchTaken;
bubble_D = just_detected_mispredict || ...;bubble_E = just_detected_mispredict || ...;
17
![Page 32: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE](https://reader033.vdocuments.us/reader033/viewer/2022053015/5f15bb297c5d2f266467c12f/html5/thumbnails/32.jpg)
missing pieces
multi-cycle memories
beyond pipelining: out-of-order, multiple issue
18
![Page 33: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE](https://reader033.vdocuments.us/reader033/viewer/2022053015/5f15bb297c5d2f266467c12f/html5/thumbnails/33.jpg)
missing pieces
multi-cycle memories
beyond pipelining: out-of-order, multiple issue
19
![Page 34: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE](https://reader033.vdocuments.us/reader033/viewer/2022053015/5f15bb297c5d2f266467c12f/html5/thumbnails/34.jpg)
multi-cycle memories
ideal case for memories: single-cycle
achieved with caches (next topic)fast access to small number of things
typical performance:90+% of the time: single-cycle
sometimes many cycles (3–400+)
20
![Page 35: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE](https://reader033.vdocuments.us/reader033/viewer/2022053015/5f15bb297c5d2f266467c12f/html5/thumbnails/35.jpg)
variable speed memoriescycle # 0 1 2 3 4 5 6 7 8
mrmovq 0(%rbx), %r8 F D E M Wmrmovq 0(%rcx), %r9 F D E M Waddq %r8, %r9 F D D E M W
mrmovq 0(%rbx), %r8 F D E M M M M M Wmrmovq 0(%rcx), %r9 F D E E E E E M M M M Maddq %r8, %r9 F D D D D D D D D D D
memory is fast: (cache “hit”; recently accessed?)
memory is slow: (cache “miss”)
21
![Page 36: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE](https://reader033.vdocuments.us/reader033/viewer/2022053015/5f15bb297c5d2f266467c12f/html5/thumbnails/36.jpg)
missing pieces
multi-cycle memories
beyond pipelining: out-of-order, multiple issue
22
![Page 37: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE](https://reader033.vdocuments.us/reader033/viewer/2022053015/5f15bb297c5d2f266467c12f/html5/thumbnails/37.jpg)
beyond pipelining: multiple issue
start more than one instruction/cycle
multiple parallel pipelines; many-input/output register file
hazard handling much more complexcycle # 0 1 2 3 4 5 6 7 8
addq %r8, %r9 F D E M Wsubq %r10, %r11 F D E M Wxorq %r9, %r11 F D E M Wsubq %r10, %rbx F D E M W…
23
![Page 38: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE](https://reader033.vdocuments.us/reader033/viewer/2022053015/5f15bb297c5d2f266467c12f/html5/thumbnails/38.jpg)
beyond pipelining: out-of-order
find later instructions to do instead of stallinglists of available instructions in pipeline registers
take any instruction with available values
provide illusion that work is still done in ordermuch more complicated hazard handling logic
cycle # 0 1 2 3 4 5 6 7 8mrmovq 0(%rbx), %r8 F D E M M M Wsubq %r8, %r9 F D E Waddq %r10, %r11 F D E Wxorq %r12, %r13 F D E W…
24
![Page 39: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE](https://reader033.vdocuments.us/reader033/viewer/2022053015/5f15bb297c5d2f266467c12f/html5/thumbnails/39.jpg)
better branch prediction
forward (target > PC) not taken; backward taken
intuition: loops:
LOOP: ......je LOOP
LOOP: ...jne SKIP_LOOP...jmp LOOP
SKIP_LOOP:
25
![Page 40: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE](https://reader033.vdocuments.us/reader033/viewer/2022053015/5f15bb297c5d2f266467c12f/html5/thumbnails/40.jpg)
predicting ret: extra copy of stack
predicting ret — stack in processor registers
different than real stack/out of room? just slowerbaz saved registers
baz return address
bar saved registers
bar return address
foo local variables
foo saved registers
foo return address
foo saved registers
stack in memory
baz return address
bar return address
foo return address
(partial?) stackin CPU registers
26
![Page 41: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE](https://reader033.vdocuments.us/reader033/viewer/2022053015/5f15bb297c5d2f266467c12f/html5/thumbnails/41.jpg)
prediction before fetch
real processors can take multiple cycles to read instruction memory
predict branches before reading their opcodes
how — more extra data structurestables of recent branches (often many kilobytes)
27
![Page 42: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE](https://reader033.vdocuments.us/reader033/viewer/2022053015/5f15bb297c5d2f266467c12f/html5/thumbnails/42.jpg)
2004 CPURegisters
L1 cacheL2 cache
Branch Prediction(approximate)
Image: approx 2004 AMD press image of Opteron die;approx register/branch prediction location via chip-architect.org (Hans de Vries) 28
![Page 43: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE](https://reader033.vdocuments.us/reader033/viewer/2022053015/5f15bb297c5d2f266467c12f/html5/thumbnails/43.jpg)
stalling/misprediction and latency
hazard handling where pipeline latency matterslonger pipeline — larger penaltypart of Intel’s Pentium 4 problem (c. 2000)
on release: 50% higher clock rate, 2-3x pipeline stages of competitors
out-of-order, multiple issue processorfirst-generation review quote:
Review quote: Anand Lai Shimpi, “Intel Pentium 4 1.4 & 1.5 GHz”, AnandTech, 20 November 2000 29
![Page 44: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE](https://reader033.vdocuments.us/reader033/viewer/2022053015/5f15bb297c5d2f266467c12f/html5/thumbnails/44.jpg)
2004 CPU
RegistersL1 cache
L2 cacheL3 cache
mainmemory
< 1 ns
∼ 1 ns
∼ 5 ns
∼ 20 ns
∼ 100 ns
Image: approx 2004 AMD press image of Opteron die;approx register location via chip-architect.org (Hans de Vries) 30
![Page 45: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE](https://reader033.vdocuments.us/reader033/viewer/2022053015/5f15bb297c5d2f266467c12f/html5/thumbnails/45.jpg)
2004 CPURegisters
L1 cacheL2 cache
L3 cachemain
memory
< 1 ns
∼ 1 ns
∼ 5 ns
∼ 20 ns
∼ 100 ns
Image: approx 2004 AMD press image of Opteron die;approx register location via chip-architect.org (Hans de Vries) 30
![Page 46: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE](https://reader033.vdocuments.us/reader033/viewer/2022053015/5f15bb297c5d2f266467c12f/html5/thumbnails/46.jpg)
2004 CPURegisters
L1 cache
L2 cacheL3 cache
mainmemory
< 1 ns
∼ 1 ns
∼ 5 ns
∼ 20 ns
∼ 100 ns
Image: approx 2004 AMD press image of Opteron die;approx register location via chip-architect.org (Hans de Vries) 30
![Page 47: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE](https://reader033.vdocuments.us/reader033/viewer/2022053015/5f15bb297c5d2f266467c12f/html5/thumbnails/47.jpg)
2004 CPURegisters
L1 cacheL2 cache
L3 cachemain
memory
< 1 ns
∼ 1 ns
∼ 5 ns
∼ 20 ns
∼ 100 ns
Image: approx 2004 AMD press image of Opteron die;approx register location via chip-architect.org (Hans de Vries) 30
![Page 48: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE](https://reader033.vdocuments.us/reader033/viewer/2022053015/5f15bb297c5d2f266467c12f/html5/thumbnails/48.jpg)
2004 CPURegisters
L1 cacheL2 cache
L3 cachemain
memory
< 1 ns
∼ 1 ns
∼ 5 ns
∼ 20 ns
∼ 100 ns
Image: approx 2004 AMD press image of Opteron die;approx register location via chip-architect.org (Hans de Vries) 30
![Page 49: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE](https://reader033.vdocuments.us/reader033/viewer/2022053015/5f15bb297c5d2f266467c12f/html5/thumbnails/49.jpg)
2004 CPURegisters
L1 cacheL2 cache
L3 cachemain
memory
< 1 ns
∼ 1 ns
∼ 5 ns
∼ 20 ns
∼ 100 ns
Image: approx 2004 AMD press image of Opteron die;approx register location via chip-architect.org (Hans de Vries) 30
![Page 50: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE](https://reader033.vdocuments.us/reader033/viewer/2022053015/5f15bb297c5d2f266467c12f/html5/thumbnails/50.jpg)
2004 CPURegisters
L1 cacheL2 cache
L3 cachemain
memory
< 1 ns
∼ 1 ns
∼ 5 ns
∼ 20 ns
∼ 100 ns
Image: approx 2004 AMD press image of Opteron die;approx register location via chip-architect.org (Hans de Vries) 30
![Page 51: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE](https://reader033.vdocuments.us/reader033/viewer/2022053015/5f15bb297c5d2f266467c12f/html5/thumbnails/51.jpg)
cache: real memory
Data MemoryAKA
L1 Data Cache
address
input (if writing)write enable
value
ready?
L2 Cache
31
![Page 52: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE](https://reader033.vdocuments.us/reader033/viewer/2022053015/5f15bb297c5d2f266467c12f/html5/thumbnails/52.jpg)
cache: real memory
Data MemoryAKA
L1 Data Cache
address
input (if writing)write enable
value
ready?
L2 Cache
31
![Page 53: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE](https://reader033.vdocuments.us/reader033/viewer/2022053015/5f15bb297c5d2f266467c12f/html5/thumbnails/53.jpg)
the place of cache
CPU Cache
RAMor
anothercache
read 0xABCD?read 0x1234?
0xABCD is 10000x1234 is 4000
read 0xABCD?
0xABCD is 1000
32
![Page 54: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE](https://reader033.vdocuments.us/reader033/viewer/2022053015/5f15bb297c5d2f266467c12f/html5/thumbnails/54.jpg)
memory hierarchy goals
performance of the fastest (smallest) memoryhide 100x latency difference? 99+% hit (= value found in cache) rate
capacity of the largest (slowest) memory
33
![Page 55: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE](https://reader033.vdocuments.us/reader033/viewer/2022053015/5f15bb297c5d2f266467c12f/html5/thumbnails/55.jpg)
memory hierarchy assumptions
temporal locality“if a value is accessed now, it will be accessed again soon”
caches should keep recently accessed values
spatial locality“if a value is accessed now, adjacent values will be accessed soon”
caches should store adjacent values at the same time
natural properties of programs — think about loops
34
![Page 56: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE](https://reader033.vdocuments.us/reader033/viewer/2022053015/5f15bb297c5d2f266467c12f/html5/thumbnails/56.jpg)
locality examples
double computeMean(int length, double *values) {double total = 0.0;for (int i = 0; i < length; ++i) {
total += values[i];}return total / length;
}
temporal locality: machine code of the loop
spatial locality: machine code of most consecutive instructions
temporal locality: total, i, length accessed repeatedly
spatial locality: values[i+1] accessed after values[i]35
![Page 57: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE](https://reader033.vdocuments.us/reader033/viewer/2022053015/5f15bb297c5d2f266467c12f/html5/thumbnails/57.jpg)
backup slides
36
![Page 58: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE](https://reader033.vdocuments.us/reader033/viewer/2022053015/5f15bb297c5d2f266467c12f/html5/thumbnails/58.jpg)
HCLRS pipelining debugging: intro
debugging pipelines is consistently one of the biggest sources ofdifficulty in this class
37
![Page 59: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE](https://reader033.vdocuments.us/reader033/viewer/2022053015/5f15bb297c5d2f266467c12f/html5/thumbnails/59.jpg)
HCL2D pipeline debugging (1)
draw a picture of the state of the instructions
get -d outputredirect to a file./hclrs file.hcl -d input.yo >output.txt
check each stage of the broken instruction
expect forwarding/hazard-handling problems
38
![Page 60: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE](https://reader033.vdocuments.us/reader033/viewer/2022053015/5f15bb297c5d2f266467c12f/html5/thumbnails/60.jpg)
HCL2D pipeline debugging (2)
write assembly — not just supplied test caseswrite file.ys — make file.yo to assembleremove anything not involved in the errorfind a minimal test casedon’t spend time looking at irrelevant instructions
draw the pipeline stageswhat instructions are in fetch/decode/etc. when?what values should be in pipelien registers when?what forwarding should happen?
39