csc 4250 computer architectures october 20, 2006 chapter 3.instruction-level parallelism & its...
TRANSCRIPT
![Page 1: CSC 4250 Computer Architectures October 20, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation](https://reader038.vdocuments.us/reader038/viewer/2022102906/56649c745503460f949280a9/html5/thumbnails/1.jpg)
CSC 4250Computer Architectures
October 20, 2006
Chapter 3. Instruction-Level Parallelism
& Its Dynamic Exploitation
![Page 2: CSC 4250 Computer Architectures October 20, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation](https://reader038.vdocuments.us/reader038/viewer/2022102906/56649c745503460f949280a9/html5/thumbnails/2.jpg)
One More Example on Tomasulo’s Algorithm
L.D F0,0(R0)
ADD.D F0,F0,F2
MUL.D F0,F0,F4
ADD.D F0,F0,F2
MUL.D F0,F0,F4
S.D F0,0(R0)
ADD.D F0,F4,F2
![Page 3: CSC 4250 Computer Architectures October 20, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation](https://reader038.vdocuments.us/reader038/viewer/2022102906/56649c745503460f949280a9/html5/thumbnails/3.jpg)
IBM 360 Assembly Language
Only two operands. Advantage? Disadvantage? Example:
L.D F0,0(R0)
ADD.D F0,F2
MUL.D F0,F4
ADD.D F0,F2
MUL.D F0,F4
S.D F0,0(R0)
… …
![Page 4: CSC 4250 Computer Architectures October 20, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation](https://reader038.vdocuments.us/reader038/viewer/2022102906/56649c745503460f949280a9/html5/thumbnails/4.jpg)
Figure 0.1Instruction Issue Execute Write Result
L.D F0,0(R0) √
ADD.D F0,F0,F2
MUL.D F0,F0,F4
ADD.D F0,F0,F2
MUL.D F0,F0,F4
S.D F0,0(R0)
ADD.D F0,F4,F2
Name Busy Op Vj Vk Qj Qk A
Load1 Yes Load 0+Reg[R0]
Add1 No
Add2 No
Add3 No
Mult1 No
Mult2 No
Store1 No
F0 F2 F4 F6 F8 F10 F12 … F30
Qi Load1
![Page 5: CSC 4250 Computer Architectures October 20, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation](https://reader038.vdocuments.us/reader038/viewer/2022102906/56649c745503460f949280a9/html5/thumbnails/5.jpg)
Figure 0.2Instruction Issue Execute Write Result
L.D F0,0(R0) √ √
ADD.D F0,F0,F2 √
MUL.D F0,F0,F4
ADD.D F0,F0,F2
MUL.D F0,F0,F4
S.D F0,0(R0)
ADD.D F0,F4,F2
Name Busy Op Vj Vk Qj Qk A
Load1 Yes Load 0+Reg[R0]
Add1 Yes Add Reg[F2] Load1
Add2 No
Add3 No
Mult1 No
Mult2 No
Store1 No
F0 F2 F4 F6 F8 F10 F12 … F30
Qi Add1
![Page 6: CSC 4250 Computer Architectures October 20, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation](https://reader038.vdocuments.us/reader038/viewer/2022102906/56649c745503460f949280a9/html5/thumbnails/6.jpg)
Figure 0.3Instruction Issue Execute Write Result
L.D F0,0(R0) √ √
ADD.D F0,F0,F2 √
MUL.D F0,F0,F4 √
ADD.D F0,F0,F2
MUL.D F0,F0,F4
S.D F0,0(R0)
ADD.D F0,F4,F2
Name Busy Op Vj Vk Qj Qk A
Load1 Yes Load 0+Reg[R0]
Add1 Yes Add Reg[F2] Load1
Add2 No
Add3 No
Mult1 Yes Mult Reg[F4] Add1
Mult2 No
Store1 No
F0 F2 F4 F6 F8 F10 F12 … F30
Qi Mult1
![Page 7: CSC 4250 Computer Architectures October 20, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation](https://reader038.vdocuments.us/reader038/viewer/2022102906/56649c745503460f949280a9/html5/thumbnails/7.jpg)
Figure 0.4Instruction Issue Execute Write Result
L.D F0,0(R0) √ √
ADD.D F0,F0,F2 √
MUL.D F0,F0,F4 √
ADD.D F0,F0,F2 √
MUL.D F0,F0,F4
S.D F0,0(R0)
ADD.D F0,F4F2
Name Busy Op Vj Vk Qj Qk A
Load1 Yes Load 0+Reg[R0]
Add1 Yes Add Reg[F2] Load1
Add2 Yes Add Reg[F2] Mult1
Add3 No
Mult1 Yes Mult Reg[F4] Add1
Mult2 No
Store1 No
F0 F2 F4 F6 F8 F10 F12 … F30
Qi Add2
![Page 8: CSC 4250 Computer Architectures October 20, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation](https://reader038.vdocuments.us/reader038/viewer/2022102906/56649c745503460f949280a9/html5/thumbnails/8.jpg)
Figure 0.5Instruction Issue Execute Write Result
L.D F0,0(R0) √ √
ADD.D F0,F0,F2 √
MUL.D F0,F0,F4 √
ADD.D F0,F0,F2 √
MUL.D F0,F0,F4 √
S.D F0,0(R0)
ADD.D F0,F4,F2
Name Busy Op Vj Vk Qj Qk A
Load1 Yes Load 0+Reg[R0]
Add1 Yes Add Reg[F2] Load1
Add2 Yes Add Reg[F2] Mult1
Add3 No
Mult1 Yes Mult Reg[F4] Add1
Mult2 Yes Mult Reg[F4] Add2
Store1 No
F0 F2 F4 F6 F8 F10 F12 … F30
Qi Mult2
![Page 9: CSC 4250 Computer Architectures October 20, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation](https://reader038.vdocuments.us/reader038/viewer/2022102906/56649c745503460f949280a9/html5/thumbnails/9.jpg)
Figure 0.6Instruction Issue Execute Write Result
L.D F0,0(R0) √ √
ADD.D F0,F0,F2 √
MUL.D F0,F0,F4 √
ADD.D F0,F0,F2 √
MUL.D F0,F0,F4 √
S.D F0,0(R0) √
ADD.D F0,F4,F2
Name Busy Op Vj Vk Qj Qk A
Load1 Yes Load 0+Reg[R0]
Add1 Yes Add Reg[F2] Load1
Add2 Yes Add Reg[F2] Mult1
Add3 No
Mult1 Yes Mult Reg[F4] Add1
Mult2 Yes Mult Reg[F4] Add2
Store1 Yes Store Mult2 0+Reg[R0]
F0 F2 F4 F6 F8 F10 F12 … F30
Qi Mult2
![Page 10: CSC 4250 Computer Architectures October 20, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation](https://reader038.vdocuments.us/reader038/viewer/2022102906/56649c745503460f949280a9/html5/thumbnails/10.jpg)
Figure 0.7Instruction Issue Execute Write Result
L.D F0,0(R0) √ √
ADD.D F0,F0,F2 √
MUL.D F0,F0,F4 √
ADD.D F0,F0,F2 √
MUL.D F0,F0,F4 √
S.D F0,0(R0) √
ADD.D F0,F4,F2 √
Name Busy Op Vj Vk Qj Qk A
Load1 Yes Load 0+Reg[R0]
Add1 Yes Add Reg[F2] Load1
Add2 Yes Add Reg[F2] Mult1
Add3 Yes Add Reg[F4] Reg[F2]
Mult1 Yes Mult Reg[F4] Add1
Mult2 Yes Mult Reg[F4] Add2
Store1 Yes Store Mult2 0+Reg[R0]
F0 F2 F4 F6 F8 F10 F12 … F30
Qi Add3
![Page 11: CSC 4250 Computer Architectures October 20, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation](https://reader038.vdocuments.us/reader038/viewer/2022102906/56649c745503460f949280a9/html5/thumbnails/11.jpg)
Figure 0.8Instruction Issue Execute Write Result
L.D F0,0(R0) √ √
ADD.D F0,F0,F2 √
MUL.D F0,F0,F4 √
ADD.D F0,F0,F2 √
MUL.D F0,F0,F4 √
S.D F0,0(R0) √
ADD.D F0,F4,F2 √ √ √
Name Busy Op Vj Vk Qj Qk A
Load1 Yes Load 0+Reg[R0]
Add1 Yes Add Reg[F2] Load1
Add2 Yes Add Reg[F2] Mult1
Add3 No
Mult1 Yes Mult Reg[F4] Add1
Mult2 Yes Mult Reg[F4] Add2
Store1 Yes Store Mult2 0+Reg[R0]
F0 F2 F4 F6 F8 F10 F12 … F30
Qi
![Page 12: CSC 4250 Computer Architectures October 20, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation](https://reader038.vdocuments.us/reader038/viewer/2022102906/56649c745503460f949280a9/html5/thumbnails/12.jpg)
Modified Loop-Based Example
Loop: L.D F0,0(R1)
MUL.D F0,F0,F2
ADD.D F0,F0,F4
S.D F0,0(R1)
DADDIU R1,R1,#−8
BNE R1,R2,Loop
![Page 13: CSC 4250 Computer Architectures October 20, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation](https://reader038.vdocuments.us/reader038/viewer/2022102906/56649c745503460f949280a9/html5/thumbnails/13.jpg)
Figure 0.1. One active iteration of loopInstruction Iteration Issue Execute Write Result
L.D F0,0(R1) 1 √ √
MUL.D F0,F0,F2 1 √
ADD.D F0,F0,F4 1 √
S.D F0,0(R1) 1 √
L.D F0,0(R1) 2
MUL.D F0,F0,F2 2
ADD.D F0,F0,F4 2
S.D F0,0(R1) 2
Name Busy Op Vj Vk Qj Qk A
Load1 Yes Load Reg[R1]
Load2 No
Add1 Yes Add Reg[F4] Mult1
Add2 No
Mult1 Yes Mult Reg[F2] Load1
Mult2 No
Store1 Yes Store Add1 Reg[R1]
Store2 No
F0 F2 F4 F6 F8 F10 F12 … F30
Qi Add1
![Page 14: CSC 4250 Computer Architectures October 20, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation](https://reader038.vdocuments.us/reader038/viewer/2022102906/56649c745503460f949280a9/html5/thumbnails/14.jpg)
Figure 0.2. Two active iterations of loopInstruction Iteration Issue Execute Write Result
L.D F0,0(R1) 1 √ √
MUL.D F0,F0,F2 1 √
ADD.D F0,F0,F4 1 √
S.D F0,0(R1) 1 √
L.D F0,0(R1) 2 √ √
MUL.D F0,F0,F2 2 √
ADD.D F0,F0,F4 2 √
S.D F0,0(R1) 2 √
Name Busy Op Vj Vk Qj Qk A
Load1 Yes Load Reg[R1]
Load2 Yes Load Reg[R1]-8
Add1 Yes Add Reg[F4] Mult1
Add2 Yes Add Reg[F4] Mult2
Mult1 Yes Mult Reg[F2] Load1
Mult2 Yes Mult Reg[F2] Load2
Store1 Yes Store Add1 Reg[R1]
Store2 Yes Add2 Reg[R1]-8
F0 F2 F4 F6 F8 F10 F12 … F30
Qi Add2
![Page 15: CSC 4250 Computer Architectures October 20, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation](https://reader038.vdocuments.us/reader038/viewer/2022102906/56649c745503460f949280a9/html5/thumbnails/15.jpg)
Figure 0.2. Two active iterations of loopInstruction Iteration Issue Execute Write Result
L.D F0,0(R1) 1 √ √
MUL.D F0,F0,F2 1 √
ADD.D F0,F0,F4 1 √
S.D F0,0(R1) 1 √
L.D F0,0(R1) 2 √ √
MUL.D F0,F0,F2 2 √
ADD.D F0,F0,F4 2 √
S.D F0,0(R1) 2 √
Name Busy Op Vj Vk Qj Qk A
Load1 Yes Load Reg[R1]
Load2 Yes Load Reg[R1]-8
Add1 Yes Add Reg[F4] Mult1
Add2 Yes Add Reg[F4] Mult2
Mult1 Yes Mult Reg[F2] Load1
Mult2 Yes Mult Reg[F2] Load2
Store1 Yes Store Add1 Reg[R1]
Store2 Yes Add2 Reg[R1]-8
F0 F2 F4 F6 F8 F10 F12 … F30
Qi Add2
![Page 16: CSC 4250 Computer Architectures October 20, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation](https://reader038.vdocuments.us/reader038/viewer/2022102906/56649c745503460f949280a9/html5/thumbnails/16.jpg)
Dynamic Branch Prediction
Static branch prediction in Appendix A Branch Prediction Buffer: a small memory
indexed by the lower portion of the address of the branch instruction. The memory contains a bit that says whether the branch was recently taken or not
The prediction bit may have been placed there by another instruction
![Page 17: CSC 4250 Computer Architectures October 20, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation](https://reader038.vdocuments.us/reader038/viewer/2022102906/56649c745503460f949280a9/html5/thumbnails/17.jpg)
Figure 3.14. A Branch Prediction Buffer Use the 4 low-order
address bits of the branch (word address) to choose a row.
![Page 18: CSC 4250 Computer Architectures October 20, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation](https://reader038.vdocuments.us/reader038/viewer/2022102906/56649c745503460f949280a9/html5/thumbnails/18.jpg)
Nested Loops
Loop1: L.D F2,1600(R1)DADDIU R2,R0,#80
Loop2: L.D F0,1000(R2)ADD.D F0,F0,F2S.D F0,1000(R2)DADDIU R2,R2,#−8BNEZ R2,Loop2DADDIU R1,R1,#−8BNEZ R1,Loop1
![Page 19: CSC 4250 Computer Architectures October 20, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation](https://reader038.vdocuments.us/reader038/viewer/2022102906/56649c745503460f949280a9/html5/thumbnails/19.jpg)
Figure 3.7. States in 2-bit Prediction Scheme
![Page 20: CSC 4250 Computer Architectures October 20, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation](https://reader038.vdocuments.us/reader038/viewer/2022102906/56649c745503460f949280a9/html5/thumbnails/20.jpg)
Figure 3.8. Prediction Accuracy of 4096-entry 2-bit Prediction Buffer for SPEC89 Benchmarks
![Page 21: CSC 4250 Computer Architectures October 20, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation](https://reader038.vdocuments.us/reader038/viewer/2022102906/56649c745503460f949280a9/html5/thumbnails/21.jpg)
Figure 3.9. Prediction Accuracy of 4096-entry 2-bit Prediction Buffer versus an infinite 2-bit Prediction Buffer for SPEC89