csc 4250 computer architectures october 27, 2006 chapter 3.instruction-level parallelism & its...
Post on 21-Dec-2015
216 views
TRANSCRIPT
![Page 1: CSC 4250 Computer Architectures October 27, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation](https://reader033.vdocuments.us/reader033/viewer/2022052509/56649d6a5503460f94a48062/html5/thumbnails/1.jpg)
CSC 4250Computer Architectures
October 27, 2006
Chapter 3. Instruction-Level Parallelism
& Its Dynamic Exploitation
![Page 2: CSC 4250 Computer Architectures October 27, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation](https://reader033.vdocuments.us/reader033/viewer/2022052509/56649d6a5503460f94a48062/html5/thumbnails/2.jpg)
Nested Loops
DADDIU R1,R0,#80Loop1: L.D F2,1600(R1)
DADDIU R2,R0,#40Loop2: L.D F0,1000(R2)
ADD.D F0,F0,F2S.D F0,1000(R2)DADDIU R2,R2,#−8BNEZ R2,Loop2DADDIU R1,R1,#−8BNEZ R1,Loop1
How many times do Loop1 and Loop2 iterate?
![Page 3: CSC 4250 Computer Architectures October 27, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation](https://reader033.vdocuments.us/reader033/viewer/2022052509/56649d6a5503460f94a48062/html5/thumbnails/3.jpg)
BNEZ R2,Loop2
Branch history: TTTTN|TTTTN|TTTTN|TTTTN|…N means branch not
taken.1-bit predictor: TTTTT|NTTTT|NTTTT|NTTTT|…
→ two errors per iteration.
2-bit predictor: TTTTT|TTTTT|TTTTT|TTTTT|…→ one error per
iteration.
The error behavior for Loop1 is similar.Put more bits in the counter to improve error behavior?
![Page 4: CSC 4250 Computer Architectures October 27, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation](https://reader033.vdocuments.us/reader033/viewer/2022052509/56649d6a5503460f94a48062/html5/thumbnails/4.jpg)
Global Branch History
Global branch history:
TTTTN|T|TTTTN|T|TTTTN|T|TTTTN|T| …
Loop 22222 |1| 22222 |1| 22222 |1| 22222 |1| …
Can we use global branch history to get a better result?
(On previous slide, we looked at local branch history.)
![Page 5: CSC 4250 Computer Architectures October 27, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation](https://reader033.vdocuments.us/reader033/viewer/2022052509/56649d6a5503460f94a48062/html5/thumbnails/5.jpg)
5-Bit Global Branch History
We keep a 5-bit global branch history, and use the bit pattern to choose one of 25 1-bit predictors:
TTTTT NTTTTN TTTTNT TTTNTT TTNTTT TNTTTT T… .NNNNN T
We get 100% accuracy in the steady state.This strategy works if at least 5 bits are used.
![Page 6: CSC 4250 Computer Architectures October 27, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation](https://reader033.vdocuments.us/reader033/viewer/2022052509/56649d6a5503460f94a48062/html5/thumbnails/6.jpg)
Correlating Branch Predictors (p. 200) A 2-bit predictor uses only the recent behavior of a
single branch. SPEC92 benchmark eqntott (the worst case in
Figures 3.8 and 3.9 with an 18% error rate):
if (aa==2)
aa=0;
if (bb==2)
bb=0;
if (aa!=bb) {
![Page 7: CSC 4250 Computer Architectures October 27, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation](https://reader033.vdocuments.us/reader033/viewer/2022052509/56649d6a5503460f94a48062/html5/thumbnails/7.jpg)
MIPS Code
Assume that aa and bb are assigned to R1 and R2:
DSUBUI R3,R1,#2BNEZ R3,L1 ;branch b1 (aa!=2)DADD R1,R0,R0 ;aa=0
L1: DSUBUI R3,R2,#2BNEZ R3,L2 ;branch b2 (bb!=2)DADD R2,R0,R0 ;bb=0
L2: DSUBU R3,R1,R2 ;R3=aa−bbBEQZ R3,L3 ;branch b3 (aa==bb)
Consider the branches. The behavior of branch b3 is correlated with the behavior of branches b1 and b2: if both b1 and b2 are not taken, then b3 will be taken (as aa and bb are equal).
![Page 8: CSC 4250 Computer Architectures October 27, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation](https://reader033.vdocuments.us/reader033/viewer/2022052509/56649d6a5503460f94a48062/html5/thumbnails/8.jpg)
Simplified Example (p. 202)
Suppose that d has values 0, 1, and 2:if (d==0) d=1;if (d==1)
MIPS Code: Assume that d is assigned to R1:
BNEZ R1,L1 ;branch b1 (d!=0)DADDUI R1,R0,#1 ;d==0, so d=1
L1: DADDUI R3,R1,#−1BNEZ R3,L2 ;branch b2 (d!=1)
…L2:
![Page 9: CSC 4250 Computer Architectures October 27, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation](https://reader033.vdocuments.us/reader033/viewer/2022052509/56649d6a5503460f94a48062/html5/thumbnails/9.jpg)
Figure 3.10. Possible execution sequence
Initial value of d
d==0? b1 Value of d before b2
d==1? b2
0 Yes NT 1 Yes NT
1 No T 1 Yes NT
2 No T 2 No T
![Page 10: CSC 4250 Computer Architectures October 27, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation](https://reader033.vdocuments.us/reader033/viewer/2022052509/56649d6a5503460f94a48062/html5/thumbnails/10.jpg)
Figure 3.11. Behavior of 1-bit predictor initialized to NT
Suppose that d = 2, 0, 2, 0, …
Misprediction Rate = 100%!
d=? b1 prediction
b1 action New b1 prediction
b2 prediction
b2 action
New b2 prediction
2 NT T T NT T T
0 T NT NT T NT NT
2 NT T T NT T T
0 T NT NT T NT NT
![Page 11: CSC 4250 Computer Architectures October 27, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation](https://reader033.vdocuments.us/reader033/viewer/2022052509/56649d6a5503460f94a48062/html5/thumbnails/11.jpg)
Figure 3.12. Meaning of Prediction BitsPrediction bits Prediction if last
branch not takenPrediction if last
branch taken
NT/NT NT NT
NT/T NT T
T/NT T NT
T/T T T
![Page 12: CSC 4250 Computer Architectures October 27, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation](https://reader033.vdocuments.us/reader033/viewer/2022052509/56649d6a5503460f94a48062/html5/thumbnails/12.jpg)
Fig. 3.13. Action of 1-bit predictor with 1 bit of correlation.
Initialized to NT/NTd=? b1
predictionb1
actionNew b1
predictionb2
predictionb2
actionNew b2
prediction
2 NT/NT T T/NT NT/NT T NT/T
0 T/NT NT T/NT NT/T NT NT/T
2 T/NT T T/NT NT/T T NT/T
0 T/NT NT T/NT NT/T NT NT/T
![Page 13: CSC 4250 Computer Architectures October 27, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation](https://reader033.vdocuments.us/reader033/viewer/2022052509/56649d6a5503460f94a48062/html5/thumbnails/13.jpg)
Figure 3.14. A (2,2) Branch Prediction Buffer This buffer uses a 2-bit
global history to choose from among 22 predictors for each branch address. Each predictor is in turn a 2-bit predictor for that branch.
Figure 3.12 shows a (1,1) branch prediction buffer.
![Page 14: CSC 4250 Computer Architectures October 27, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation](https://reader033.vdocuments.us/reader033/viewer/2022052509/56649d6a5503460f94a48062/html5/thumbnails/14.jpg)
Figure 3.15. Comparison of 2-bit Predictors
![Page 15: CSC 4250 Computer Architectures October 27, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation](https://reader033.vdocuments.us/reader033/viewer/2022052509/56649d6a5503460f94a48062/html5/thumbnails/15.jpg)
Tournament Predictors (p. 206)
Adaptively combine local and global predictors. Alpha 21264 has a tournament predictor using 4K 2-bit
counters indexed by the local branch address to choose from between a global predictor and a local predictor. The global predictor also has 4K entries and is indexed by the history of the last 12 branches; each entry in the global predictor is a standard 2-bit predictor. The local predictor consists of a 2-level predictor. The top level is a local history table consisting of 1024 10-bit entries. The entry is used to index a table of 1K entries consisting of 3-bit saturating counters, providing the local prediction. (Total = 29K bits. For SPECfp95 benchmarks, less than 1 misprediction per 1000 completed instructions.)
![Page 16: CSC 4250 Computer Architectures October 27, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation](https://reader033.vdocuments.us/reader033/viewer/2022052509/56649d6a5503460f94a48062/html5/thumbnails/16.jpg)
Fig. 3.16. State Transition Diagram for Tournament Predictor The counter is incremented whenever the “predicted” predictor is
correct and the other predictor is incorrect, and it is decremented in the reverse situation.
![Page 17: CSC 4250 Computer Architectures October 27, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation](https://reader033.vdocuments.us/reader033/viewer/2022052509/56649d6a5503460f94a48062/html5/thumbnails/17.jpg)
Figure 3.17. Fraction of predictions from local predictor for a tournament predictor using SPEC89
![Page 18: CSC 4250 Computer Architectures October 27, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation](https://reader033.vdocuments.us/reader033/viewer/2022052509/56649d6a5503460f94a48062/html5/thumbnails/18.jpg)
Figure 3.18. Misprediction rates for three different predictors on SPEC89 as total # of bits is increased