csc 4250 computer architectures october 17, 2006 chapter 3.instruction-level parallelism & its...
Post on 20-Dec-2015
216 views
TRANSCRIPT
![Page 1: CSC 4250 Computer Architectures October 17, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d4d5503460f94a2bccc/html5/thumbnails/1.jpg)
CSC 4250Computer Architectures
October 17, 2006
Chapter 3. Instruction-Level Parallelism
& Its Dynamic Exploitation
![Page 2: CSC 4250 Computer Architectures October 17, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d4d5503460f94a2bccc/html5/thumbnails/2.jpg)
MIPS FP Unit using Tomasulo’s Algorithm
![Page 3: CSC 4250 Computer Architectures October 17, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d4d5503460f94a2bccc/html5/thumbnails/3.jpg)
MIPS Processor with Scoreboard
![Page 4: CSC 4250 Computer Architectures October 17, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d4d5503460f94a2bccc/html5/thumbnails/4.jpg)
Three Steps in Execution for Tomasulo’s Alg.
1. Issue ─ if no structural hazards
2. Execute ─ if both operands are available
3. Write result on CDB (from there into reservation stations waiting for results)
Recall that for Scoreboard: Four Steps in Execution
1. Issue ─ if no structural nor WAW hazards2. Read operands ─ if no RAW hazards3. Execute ─ if both operands are received4. Write result ─ if no WAR hazards
![Page 5: CSC 4250 Computer Architectures October 17, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d4d5503460f94a2bccc/html5/thumbnails/5.jpg)
How Hazards are Handled
Structural Hazards ─ Reservation stations allow more instructions to be issued
RAW Hazards ─ An instruction is executed only when its operands are available
WAR and WAW Hazards ─ Register renaming eliminates these hazards by renaming all destination registers, including those with a pending read or write for an earlier instruction, so that the out-of-order write does not affect any instruction that depends on an earlier value of an operand
![Page 6: CSC 4250 Computer Architectures October 17, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d4d5503460f94a2bccc/html5/thumbnails/6.jpg)
Tags
Tag is a 4-bit quantity that denotes one of five reservation stations or one of six load buffers
Tag fields are found in the reservation stations, the register file, and the store buffers
![Page 7: CSC 4250 Computer Architectures October 17, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d4d5503460f94a2bccc/html5/thumbnails/7.jpg)
Example
L.D F6,34(R2)
L.D F2,45(R3)
MUL.DF0,F2,F4
SUB.D F8,F2,F6
DIV.D F10,F0,F6
ADD.DF6,F8,F2
![Page 8: CSC 4250 Computer Architectures October 17, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d4d5503460f94a2bccc/html5/thumbnails/8.jpg)
Three Tables
(1st table is not part of hardware; 2nd and 3rd tables are distributed)
1. Instruction status ─ indicates which of three steps of instruction
2. Reservation stations ─ busy, op, Vj, Vk, Qj, Qk, A (V = value; Q = reservation station)
3. Register status ─ indicates which reservation station will write this register
![Page 9: CSC 4250 Computer Architectures October 17, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d4d5503460f94a2bccc/html5/thumbnails/9.jpg)
Figure 0.0Instruction Issue Execute Write Result
L.D F6,34(R2) √ √
L.D F2,45(R3) √ √
MUL.D F0,F2,F4 √
SUB.D F8,F2,F6
DIV.D F10,F0,F6
ADD.D F6,F8,F2
Name Busy Op Vj Vk Qj Qk A
Load1 Yes Load 34+Reg[R2]
Load2 Yes Load 45+Reg[R3]
Add1 No
Add2 No
Add3 No
Mult1 Yes Mult Reg[F4] Load2
Mult2 No
F0 F2 F4 F6 F8 F10 F12 … F30
Qi Mult1 Load2 Load1
![Page 10: CSC 4250 Computer Architectures October 17, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d4d5503460f94a2bccc/html5/thumbnails/10.jpg)
Figure 0.1Instruction Issue Execute Write Result
L.D F6,34(R2) √ √
L.D F2,45(R3) √ √
MUL.D F0,F2,F4 √
SUB.D F8,F2,F6 √
DIV.D F10,F0,F6
ADD.D F6,F8,F2
Name Busy Op Vj Vk Qj Qk A
Load1 Yes Load 34+Reg[R2]
Load2 Yes Load 45+Reg[R3]
Add1 Yes Sub Load2 Load1
Add2 No
Add3 No
Mult1 Yes Mult Reg[F4] Load2
Mult2 No
F0 F2 F4 F6 F8 F10 F12 … F30
Qi Mult1 Load2 Load1 Add1
![Page 11: CSC 4250 Computer Architectures October 17, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d4d5503460f94a2bccc/html5/thumbnails/11.jpg)
Figure 0.2 (Suppose LD is slow)
Instruction Issue Execute Write Result
L.D F6,34(R2) √ √
L.D F2,45(R3) √ √
MUL.D F0,F2,F4 √
SUB.D F8,F2,F6 √
DIV.D F10,F0,F6 √
ADD.D F6,F8,F2
Name Busy Op Vj Vk Qj Qk A
Load1 Yes Load 34+Reg[R2]
Load2 Yes Load 45+Reg[R3]
Add1 Yes Sub Load2 Load1
Add2 No
Add3 No
Mult1 Yes Mult Reg[F4] Load2
Mult2 Yes Div Mult1 Load1
F0 F2 F4 F6 F8 F10 F12 … F30
Qi Mult1 Load2 Load1 Add1 Mult2
![Page 12: CSC 4250 Computer Architectures October 17, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d4d5503460f94a2bccc/html5/thumbnails/12.jpg)
Figure 0.3 (Suppose LD is slow)
Instruction Issue Execute Write Result
L.D F6,34(R2) √ √
L.D F2,45(R3) √ √
MUL.D F0,F2,F4 √
SUB.D F8,F2,F6 √
DIV.D F10,F0,F6 √
ADD.D F6,F8,F2 √
Name Busy Op Vj Vk Qj Qk A
Load1 Yes Load 34+Reg[R2]
Load2 Yes Load 45+Reg[R3]
Add1 Yes Sub Load2 Load1
Add2 Yes Add Add1 Load2
Add3 No
Mult1 Yes Mult Reg[F4] Load2
Mult2 Yes Div Mult1 Load1
F0 F2 F4 F6 F8 F10 F12 … F30
Qi Mult1 Load2 Add2 Add1 Mult2
![Page 13: CSC 4250 Computer Architectures October 17, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d4d5503460f94a2bccc/html5/thumbnails/13.jpg)
Figure 3.3Instruction Issue Execute Write Result
L.D F6,34(R2) √ √ √
L.D F2,45(R3) √ √
MUL.D F0,F2,F4 √
SUB.D F8,F2,F6 √
DIV.D F10,F0,F6 √
ADD.D F6,F8,F2 √
Name Busy Op Vj Vk Qj Qk A
Load1 No
Load2 Yes Load 45+Reg[R3]
Add1 Yes Sub Mem[34+Reg[R2]] Load2
Add2 Yes Add Add1 Load2
Add3 No
Mult1 Yes Mult Reg[F4] Load2
Mult2 Yes Div Mem[34+Reg[R2]] Mult1
F0 F2 F4 F6 F8 F10 F12 … F30
Qi Mult1 Load2 Add2 Add1 Mult2
![Page 14: CSC 4250 Computer Architectures October 17, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d4d5503460f94a2bccc/html5/thumbnails/14.jpg)
Figure 0.4 (2nd load just completes)
Instruction Issue Execute Write Result
L.D F6,34(R2) √ √ √
L.D F2,45(R3) √ √ √
MUL.D F0,F2,F4 √ √
SUB.D F8,F2,F6 √ √
DIV.D F10,F0,F6 √
ADD.D F6,F8,F2 √
Name Busy Op Vj Vk Qj Qk A
Load1 No
Load2 No
Add1 Yes Sub Mem[45+Reg[R3]] Mem[34+Reg[R2]]
Add2 Yes Add Mem[45+Reg[R3]] Add1
Add3 No
Mult1 Yes Mult Mem[45+Reg[R3]] Reg[F4]
Mult2 Yes Div Mem[34+Reg[R2]] Mult1
F0 F2 F4 F6 F8 F10 F12 … F30
Qi Mult1 Add2 Add1 Mult2
![Page 15: CSC 4250 Computer Architectures October 17, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d4d5503460f94a2bccc/html5/thumbnails/15.jpg)
Figure 3.4Instruction Issue Execute Write Result
L.D F6,34(R2) √ √ √
L.D F2,45(R3) √ √ √
MUL.D F0,F2,F4 √ √
SUB.D F8,F2,F6 √ √ √
DIV.D F10,F0,F6 √
ADD.D F6,F8,F2 √ √ √
Name Busy Op Vj Vk Qj Qk A
Load1 No
Load2 No
Add1 No
Add2 No
Add3 No
Mult1 Yes Mult Mem[45+Reg[R3]] Reg[F4]
Mult2 Yes Div Mem[34+Reg[R2]] Mult1
F0 F2 F4 F6 F8 F10 F12 … F30
Qi Mult1 Mult2
![Page 16: CSC 4250 Computer Architectures October 17, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d4d5503460f94a2bccc/html5/thumbnails/16.jpg)
Loop-Based Example
Loop: L.D F0,0(R1)
MUL.D F4,F0,F2
S.D F4,0(R1)
DADDIU R1,R1,#−8
BNE R1,R2,Loop
![Page 17: CSC 4250 Computer Architectures October 17, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d4d5503460f94a2bccc/html5/thumbnails/17.jpg)
Figure 0.5. One active iteration of loopInstruction Iteration Issue Execute Write Result
L.D F0,0(R1) 1 √ √
MUL.D F4,F0,F2 1 √
S.D F4,0(R1) 1 √
L.D F0,0(R1) 2
MUL.D F4,F0,F2 2
S.D F4,0(R1) 2
Name Busy Op Vj Vk Qj Qk A
Load1 Yes Load Reg[R1]
Load2 No
Add1 No
Add2 No
Add3 No
Mult1 Yes Mult Reg[F2] Load1
Mult2 No
Store1 Yes Store Mult1 Reg[R1]
Store2 No
F0 F2 F4 F6 F8 F10 F12 … F30
Qi Load1 Mult1
![Page 18: CSC 4250 Computer Architectures October 17, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d4d5503460f94a2bccc/html5/thumbnails/18.jpg)
Figure 0.6. One+ active iteration of loopInstruction Iteration Issue Execute Write Result
L.D F0,0(R1) 1 √ √
MUL.D F4,F0,F2 1 √
S.D F4,0(R1) 1 √
L.D F0,0(R1) 2 √
MUL.D F4,F0,F2 2
S.D F4,0(R1) 2
Name Busy Op Vj Vk Qj Qk A
Load1 Yes Load Reg[R1]
Load2 Yes Load Reg[R1]-8
Add1 No
Add2 No
Add3 No
Mult1 Yes Mult Reg[F2] Load1
Mult2 No
Store1 Yes Store Mult1 Reg[R1]
Store2 No
F0 F2 F4 F6 F8 F10 F12 … F30
Qi Load2 Mult1
![Page 19: CSC 4250 Computer Architectures October 17, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d4d5503460f94a2bccc/html5/thumbnails/19.jpg)
Figure 0.7. One++ active iteration of loopInstruction Iteration Issue Execute Write Result
L.D F0,0(R1) 1 √ √
MUL.D F4,F0,F2 1 √
S.D F4,0(R1) 1 √
L.D F0,0(R1) 2 √ √
MUL.D F4,F0,F2 2 √
S.D F4,0(R1) 2
Name Busy Op Vj Vk Qj Qk A
Load1 Yes Load Reg[R1]
Load2 Yes Load Reg[R1]-8
Add1 No
Add2 No
Add3 No
Mult1 Yes Mult Reg[F2] Load1
Mult2 Yes Mult Reg[F2] Load2
Store1 Yes Store Mult1 Reg[R1]
Store2 No
F0 F2 F4 F6 F8 F10 F12 … F30
Qi Load2 Mult12
![Page 20: CSC 4250 Computer Architectures October 17, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d4d5503460f94a2bccc/html5/thumbnails/20.jpg)
Figure 3.6. Two active iterations of loopInstruction Iteration Issue Execute Write Result
L.D F0,0(R1) 1 √ √
MUL.D F4,F0,F2 1 √
S.D F4,0(R1) 1 √
L.D F0,0(R1) 2 √ √
MUL.D F4,F0,F2 2 √
S.D F4,0(R1) 2 √
Name Busy Op Vj Vk Qj Qk A
Load1 Yes Load Reg[R1]
Load2 Yes Load Reg[R1]-8
Add1 No
Add2 No
Add3 No
Mult1 Yes Mult Reg[F2] Load1
Mult2 Yes Mult Reg[F2] Load2
Store1 Yes Store Mult1 Reg[R1]
Store2 Yes Store Mult2 Reg[R1]-8
F0 F2 F4 F6 F8 F10 F12 … F30
Qi Load2 Mult12
![Page 21: CSC 4250 Computer Architectures October 17, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d4d5503460f94a2bccc/html5/thumbnails/21.jpg)
IBM 360/91
Great ideas: Data tagging Register renaming Dynamic detection of memory hazards Generalized forwarding
Ideas broadly used now in microprocessors Was 360/91 successful commercially?
![Page 22: CSC 4250 Computer Architectures October 17, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d4d5503460f94a2bccc/html5/thumbnails/22.jpg)
IBM 360/85 (1968)
First commercial computer with a cache: Slower clock time (80ns versus 60ns) Less memory interleaving (4 versus 16) Slower main memory (1.04 μs versus 0.75 μs) Cheaper in price
Which machine was faster on applications?