ch 03. instruction-level parallelism and its exploration (1)
DESCRIPTION
This is related to processor design.TRANSCRIPT
![Page 1: Ch 03. Instruction-Level Parallelism and Its Exploration (1)](https://reader031.vdocuments.us/reader031/viewer/2022020401/55cf8e17550346703b8e7703/html5/thumbnails/1.jpg)
Computer Architecture
Professor Yong Ho Song1
Spring 2015
Computer ArchitectureInstruction-Level Parallelism and Its Exploitation
Prof. Yong Ho Song
![Page 2: Ch 03. Instruction-Level Parallelism and Its Exploration (1)](https://reader031.vdocuments.us/reader031/viewer/2022020401/55cf8e17550346703b8e7703/html5/thumbnails/2.jpg)
Computer Architecture
Professor Yong Ho Song
Preliminary
2
![Page 3: Ch 03. Instruction-Level Parallelism and Its Exploration (1)](https://reader031.vdocuments.us/reader031/viewer/2022020401/55cf8e17550346703b8e7703/html5/thumbnails/3.jpg)
Computer Architecture
Professor Yong Ho Song
Pipelining
Multiple instructions are overlapped in execution
E.g. automobile assembly line
Multiple different steps
Pipe stage or pipe segment
Cycle time
Throughput improvement
How many cars are assembled in an hour?
Balanced pipeline
Ideal cycle time = time per instruction (unpipelined) / # stages
Hard to perfectly balance a pipeline
► Causing overhead in time
3
![Page 4: Ch 03. Instruction-Level Parallelism and Its Exploration (1)](https://reader031.vdocuments.us/reader031/viewer/2022020401/55cf8e17550346703b8e7703/html5/thumbnails/4.jpg)
Computer Architecture
Professor Yong Ho Song
RISC Instruction Set Basics
Properties
All operations on data assume data to be pre-loaded in registers
Memory access are done via load/store operations
All instructions are 32-bit long
3 classes of instructions
ALU instructions
Load/Store instructions
Branch and Jump instructions
4
![Page 5: Ch 03. Instruction-Level Parallelism and Its Exploration (1)](https://reader031.vdocuments.us/reader031/viewer/2022020401/55cf8e17550346703b8e7703/html5/thumbnails/5.jpg)
Computer Architecture
Professor Yong Ho Song
Instruction Fetch (IF)
Instruction Decode and Register Fetch (ID)
Execution, Memory Address Computation, or Branch Completion (EX)
Memory Access (MEM)
Write-back (WB)
Five Execution Steps
![Page 6: Ch 03. Instruction-Level Parallelism and Its Exploration (1)](https://reader031.vdocuments.us/reader031/viewer/2022020401/55cf8e17550346703b8e7703/html5/thumbnails/6.jpg)
Computer Architecture
Professor Yong Ho Song
Step 1: Instruction Fetch
Use PC to get instruction and put it in the Instruction
Register.
Increment the PC by 4 and put the result back in the
PC.
Can be described succinctly using RTL "Register-
Transfer Language"
IR = Memory[PC];
PC = PC + 4;
![Page 7: Ch 03. Instruction-Level Parallelism and Its Exploration (1)](https://reader031.vdocuments.us/reader031/viewer/2022020401/55cf8e17550346703b8e7703/html5/thumbnails/7.jpg)
Computer Architecture
Professor Yong Ho Song
Step 2: Instruction Decode and Register Fetch
Read registers rs and rt in case we need them
Compute the branch address in case the instruction is
a branch
RTL:
A = Reg[IR[25-21]];
B = Reg[IR[20-16]];
ALUOut = PC + (sign-extend(IR[15-0]) << 2);
We aren't setting any control lines based on the
instruction type
(we are busy "decoding" it in our control logic)
![Page 8: Ch 03. Instruction-Level Parallelism and Its Exploration (1)](https://reader031.vdocuments.us/reader031/viewer/2022020401/55cf8e17550346703b8e7703/html5/thumbnails/8.jpg)
Computer Architecture
Professor Yong Ho Song
Step 3 (instruction dependent)
ALU is performing one of three functions, based on
instruction type
Memory Reference: ALUOut = A + sign-extend(IR[15-0]);
R-type: ALUOut = A op B;
Branch: if (A==B) PC = ALUOut;
Jump: PC = PC [31-28] || (IR[25-0] << 2);
![Page 9: Ch 03. Instruction-Level Parallelism and Its Exploration (1)](https://reader031.vdocuments.us/reader031/viewer/2022020401/55cf8e17550346703b8e7703/html5/thumbnails/9.jpg)
Computer Architecture
Professor Yong Ho Song
Loads and stores access memory
MDR = Memory[ALUOut];
or
Memory[ALUOut] = B;
R-type instructions finish
Reg[IR[15-11]] = ALUOut;
The write actually takes place at the end of the cycle on the edge
Step 4 (R-type or memory-access)
![Page 10: Ch 03. Instruction-Level Parallelism and Its Exploration (1)](https://reader031.vdocuments.us/reader031/viewer/2022020401/55cf8e17550346703b8e7703/html5/thumbnails/10.jpg)
Computer Architecture
Professor Yong Ho Song
Reg[IR[20-16]]= MDR;
What about all the other instructions?
Step 5 (Write-back Step)
![Page 11: Ch 03. Instruction-Level Parallelism and Its Exploration (1)](https://reader031.vdocuments.us/reader031/viewer/2022020401/55cf8e17550346703b8e7703/html5/thumbnails/11.jpg)
Computer Architecture
Professor Yong Ho Song
Simple RISC Pipeline
![Page 12: Ch 03. Instruction-Level Parallelism and Its Exploration (1)](https://reader031.vdocuments.us/reader031/viewer/2022020401/55cf8e17550346703b8e7703/html5/thumbnails/12.jpg)
Computer Architecture
Professor Yong Ho Song
Data Path Shifted in Time
12
![Page 13: Ch 03. Instruction-Level Parallelism and Its Exploration (1)](https://reader031.vdocuments.us/reader031/viewer/2022020401/55cf8e17550346703b8e7703/html5/thumbnails/13.jpg)
Computer Architecture
Professor Yong Ho Song
Pipeline Registers for Temporal Store
![Page 14: Ch 03. Instruction-Level Parallelism and Its Exploration (1)](https://reader031.vdocuments.us/reader031/viewer/2022020401/55cf8e17550346703b8e7703/html5/thumbnails/14.jpg)
Computer Architecture
Professor Yong Ho Song
Performance Issues in Pipelining
Example
Assume that it has a 1 ns clock cycle and that it uses 4 cycles for
ALU operations and branches and 5 cycles for memory
operations.
Assume that the relative frequencies of these operations are
40%, 20%, and 40%, respectively.
Suppose that due to clock skew and setup, pipelining the
processor adds 0.2 ns of overhead to the clock.
Ignoring any latency impact, how much speedup in the
instruction execution rate will we gain from a pipeline?
14
![Page 15: Ch 03. Instruction-Level Parallelism and Its Exploration (1)](https://reader031.vdocuments.us/reader031/viewer/2022020401/55cf8e17550346703b8e7703/html5/thumbnails/15.jpg)
Computer Architecture
Professor Yong Ho Song
Performance Issues in Pipelining
Solution
15
![Page 16: Ch 03. Instruction-Level Parallelism and Its Exploration (1)](https://reader031.vdocuments.us/reader031/viewer/2022020401/55cf8e17550346703b8e7703/html5/thumbnails/16.jpg)
Computer Architecture
Professor Yong Ho Song
Pipeline Hazards
Structural Hazards
arise from resource conflicts when the hardware cannot support
all possible combinations of instructions simultaneously in
overlapped execution
Data Hazards
arise when an instruction depends on the results of a previous
instruction in a way that is exposed by the overlapping of
instructions in the pipeline
Control Hazards
arise from the pipelining of branches and other instructions that
change the PC
16
![Page 17: Ch 03. Instruction-Level Parallelism and Its Exploration (1)](https://reader031.vdocuments.us/reader031/viewer/2022020401/55cf8e17550346703b8e7703/html5/thumbnails/17.jpg)
Computer Architecture
Professor Yong Ho Song
Structural Hazard
![Page 18: Ch 03. Instruction-Level Parallelism and Its Exploration (1)](https://reader031.vdocuments.us/reader031/viewer/2022020401/55cf8e17550346703b8e7703/html5/thumbnails/18.jpg)
Computer Architecture
Professor Yong Ho Song
Data Hazard
![Page 19: Ch 03. Instruction-Level Parallelism and Its Exploration (1)](https://reader031.vdocuments.us/reader031/viewer/2022020401/55cf8e17550346703b8e7703/html5/thumbnails/19.jpg)
Computer Architecture
Professor Yong Ho Song
Forwarding for Data Hazard
![Page 20: Ch 03. Instruction-Level Parallelism and Its Exploration (1)](https://reader031.vdocuments.us/reader031/viewer/2022020401/55cf8e17550346703b8e7703/html5/thumbnails/20.jpg)
Computer Architecture
Professor Yong Ho Song
Forwarding for Load/Store
![Page 21: Ch 03. Instruction-Level Parallelism and Its Exploration (1)](https://reader031.vdocuments.us/reader031/viewer/2022020401/55cf8e17550346703b8e7703/html5/thumbnails/21.jpg)
Computer Architecture
Professor Yong Ho Song
Forwarding is not Panacea
![Page 22: Ch 03. Instruction-Level Parallelism and Its Exploration (1)](https://reader031.vdocuments.us/reader031/viewer/2022020401/55cf8e17550346703b8e7703/html5/thumbnails/22.jpg)
Computer Architecture
Professor Yong Ho Song
MIPS Data Path w/o Pipeline
![Page 23: Ch 03. Instruction-Level Parallelism and Its Exploration (1)](https://reader031.vdocuments.us/reader031/viewer/2022020401/55cf8e17550346703b8e7703/html5/thumbnails/23.jpg)
Computer Architecture
Professor Yong Ho Song
MIPS Data Path w/ Pipeline
![Page 24: Ch 03. Instruction-Level Parallelism and Its Exploration (1)](https://reader031.vdocuments.us/reader031/viewer/2022020401/55cf8e17550346703b8e7703/html5/thumbnails/24.jpg)
Computer Architecture
Professor Yong Ho Song
Data Forwarding Path
![Page 25: Ch 03. Instruction-Level Parallelism and Its Exploration (1)](https://reader031.vdocuments.us/reader031/viewer/2022020401/55cf8e17550346703b8e7703/html5/thumbnails/25.jpg)
Computer Architecture
Professor Yong Ho Song
Re-design for Reducing Control Hazard
![Page 26: Ch 03. Instruction-Level Parallelism and Its Exploration (1)](https://reader031.vdocuments.us/reader031/viewer/2022020401/55cf8e17550346703b8e7703/html5/thumbnails/26.jpg)
Computer Architecture
Professor Yong Ho Song
Multi-cycle Memory Access
![Page 27: Ch 03. Instruction-Level Parallelism and Its Exploration (1)](https://reader031.vdocuments.us/reader031/viewer/2022020401/55cf8e17550346703b8e7703/html5/thumbnails/27.jpg)
Computer Architecture
Professor Yong Ho Song
Forwarding from Memory Access
![Page 28: Ch 03. Instruction-Level Parallelism and Its Exploration (1)](https://reader031.vdocuments.us/reader031/viewer/2022020401/55cf8e17550346703b8e7703/html5/thumbnails/28.jpg)
Computer Architecture
Professor Yong Ho Song
Forwarding from Branch
![Page 29: Ch 03. Instruction-Level Parallelism and Its Exploration (1)](https://reader031.vdocuments.us/reader031/viewer/2022020401/55cf8e17550346703b8e7703/html5/thumbnails/29.jpg)
Computer Architecture
Professor Yong Ho Song
Pipeline Performance
![Page 30: Ch 03. Instruction-Level Parallelism and Its Exploration (1)](https://reader031.vdocuments.us/reader031/viewer/2022020401/55cf8e17550346703b8e7703/html5/thumbnails/30.jpg)
Computer Architecture
Professor Yong Ho Song
Instruction-Level Parallelism
30
![Page 31: Ch 03. Instruction-Level Parallelism and Its Exploration (1)](https://reader031.vdocuments.us/reader031/viewer/2022020401/55cf8e17550346703b8e7703/html5/thumbnails/31.jpg)
Computer Architecture
Professor Yong Ho Song
Introduction
Pipelining become universal technique in 1985
Overlaps execution of instructions
Exploits “Instruction Level Parallelism”
Beyond this, there are two main approaches:
Hardware-based dynamic approaches
► Used in server and desktop processors
► Not used as extensively in PMP processors
Compiler-based static approaches
► Not as successful outside of scientific applications
![Page 32: Ch 03. Instruction-Level Parallelism and Its Exploration (1)](https://reader031.vdocuments.us/reader031/viewer/2022020401/55cf8e17550346703b8e7703/html5/thumbnails/32.jpg)
Computer Architecture
Professor Yong Ho Song
Instruction-Level Parallelism
When exploiting instruction-level parallelism, goal is
to maximize CPI
Pipeline CPI =
► Ideal pipeline CPI +
► Structural stalls +
► Data hazard stalls +
► Control stalls
Parallelism with basic block is limited
Typical size of basic block = 3-6 instructions
Must optimize across branches
![Page 33: Ch 03. Instruction-Level Parallelism and Its Exploration (1)](https://reader031.vdocuments.us/reader031/viewer/2022020401/55cf8e17550346703b8e7703/html5/thumbnails/33.jpg)
Computer Architecture
Professor Yong Ho Song
Data Dependence
Loop-Level Parallelism
Unroll loop statically or dynamically
Use SIMD (vector processors and GPUs)
Challenges:
Data dependency
► Instruction j is data dependent on instruction i if
Instruction i produces a result that may be used by instruction j
Instruction j is data dependent on instruction k and instruction k is data
dependent on instruction i
Dependent instructions cannot be executed
simultaneously
![Page 34: Ch 03. Instruction-Level Parallelism and Its Exploration (1)](https://reader031.vdocuments.us/reader031/viewer/2022020401/55cf8e17550346703b8e7703/html5/thumbnails/34.jpg)
Computer Architecture
Professor Yong Ho Song
Data Dependence
Dependencies are a property of programs
Pipeline organization determines if dependence is
detected and if it causes a stall
Data dependence conveys:
Possibility of a hazard
Order in which results must be calculated
Upper bound on exploitable instruction level parallelism
Dependencies that flow through memory locations are
difficult to detect
![Page 35: Ch 03. Instruction-Level Parallelism and Its Exploration (1)](https://reader031.vdocuments.us/reader031/viewer/2022020401/55cf8e17550346703b8e7703/html5/thumbnails/35.jpg)
Computer Architecture
Professor Yong Ho Song
Name Dependence
Two instructions use the same name but no flow of
information
Not a true data dependence, but is a problem when reordering
instructions
Antidependence: instruction j writes a register or memory
location that instruction i reads
► Initial ordering (i before j) must be preserved
Output dependence: instruction i and instruction j write the
same register or memory location
► Ordering must be preserved
To resolve, use renaming techniques
![Page 36: Ch 03. Instruction-Level Parallelism and Its Exploration (1)](https://reader031.vdocuments.us/reader031/viewer/2022020401/55cf8e17550346703b8e7703/html5/thumbnails/36.jpg)
Computer Architecture
Professor Yong Ho Song
Other Factors
Data Hazards
Read after write (RAW)
Write after write (WAW)
Write after read (WAR)
Control Dependence
Ordering of instruction i with respect to a branch instruction
► Instruction control dependent on a branch cannot be moved before
the branch so that its execution is no longer controlled by the branch
► An instruction not control dependent on a branch cannot be moved
after the branch so that its execution is controlled by the branch
![Page 37: Ch 03. Instruction-Level Parallelism and Its Exploration (1)](https://reader031.vdocuments.us/reader031/viewer/2022020401/55cf8e17550346703b8e7703/html5/thumbnails/37.jpg)
Computer Architecture
Professor Yong Ho Song
Examples
OR instruction dependent
on DADDU and DSUBU
Assume R4 isn’t used after
skip
Possible to move DSUBU
before the branch
Example 1:DADDU R1,R2,R3
BEQZ R4,L
DSUBU R1,R1,R6
L: …
OR R7,R1,R8
Example 2:DADDU R1,R2,R3
BEQZ R12,skip
DSUBU R4,R5,R6
DADDU R5,R4,R9
skip:
OR R7,R8,R9
![Page 38: Ch 03. Instruction-Level Parallelism and Its Exploration (1)](https://reader031.vdocuments.us/reader031/viewer/2022020401/55cf8e17550346703b8e7703/html5/thumbnails/38.jpg)
Computer Architecture
Professor Yong Ho Song
Compiler Techniques for Exposing ILP
Pipeline scheduling
Separate dependent instruction from the source instruction by
the pipeline latency of the source instruction
Example:
for (i=999; i>=0; i=i-1)
x[i] = x[i] + s;
![Page 39: Ch 03. Instruction-Level Parallelism and Its Exploration (1)](https://reader031.vdocuments.us/reader031/viewer/2022020401/55cf8e17550346703b8e7703/html5/thumbnails/39.jpg)
Computer Architecture
Professor Yong Ho Song
Pipeline Stalls
Loop: L.D F0,0(R1)stallADD.D F4,F0,F2stallstallS.D F4,0(R1)DADDUI R1,R1,#-8stall (assume integer load latency is 1)BNE R1,R2,Loop
![Page 40: Ch 03. Instruction-Level Parallelism and Its Exploration (1)](https://reader031.vdocuments.us/reader031/viewer/2022020401/55cf8e17550346703b8e7703/html5/thumbnails/40.jpg)
Computer Architecture
Professor Yong Ho Song
Pipeline Scheduling
Scheduled code:
Loop: L.D F0,0(R1)DADDUI R1,R1,#-8ADD.D F4,F0,F2stallstallS.D F4,8(R1)BNE R1,R2,Loop
![Page 41: Ch 03. Instruction-Level Parallelism and Its Exploration (1)](https://reader031.vdocuments.us/reader031/viewer/2022020401/55cf8e17550346703b8e7703/html5/thumbnails/41.jpg)
Computer Architecture
Professor Yong Ho Song
Loop Unrolling
Loop unrolling
Unroll by a factor of 4 (assume # elements is divisible by 4)
Eliminate unnecessary instructions
Loop: L.D F0,0(R1)ADD.D F4,F0,F2S.D F4,0(R1) ;drop DADDUI & BNEL.D F6,-8(R1)ADD.D F8,F6,F2S.D F8,-8(R1) ;drop DADDUI & BNEL.D F10,-16(R1)ADD.D F12,F10,F2S.D F12,-16(R1) ;drop DADDUI & BNEL.D F14,-24(R1)ADD.D F16,F14,F2S.D F16,-24(R1)DADDUI R1,R1,#-32BNE R1,R2,Loop
note: number of live registers vs. original loop
![Page 42: Ch 03. Instruction-Level Parallelism and Its Exploration (1)](https://reader031.vdocuments.us/reader031/viewer/2022020401/55cf8e17550346703b8e7703/html5/thumbnails/42.jpg)
Computer Architecture
Professor Yong Ho Song
Loop Unrolling/Pipeline Scheduling
Pipeline schedule the unrolled loop:
Loop: L.D F0,0(R1)L.D F6,-8(R1)L.D F10,-16(R1)L.D F14,-24(R1)ADD.D F4,F0,F2ADD.D F8,F6,F2ADD.D F12,F10,F2ADD.D F16,F14,F2S.D F4,0(R1)S.D F8,-8(R1)DADDUI R1,R1,#-32S.D F12,16(R1)S.D F16,8(R1)BNE R1,R2,Loop
![Page 43: Ch 03. Instruction-Level Parallelism and Its Exploration (1)](https://reader031.vdocuments.us/reader031/viewer/2022020401/55cf8e17550346703b8e7703/html5/thumbnails/43.jpg)
Computer Architecture
Professor Yong Ho Song
Strip Mining
Unknown number of loop iterations?
Number of iterations = n
Goal: make k copies of the loop body
Generate pair of loops:
► First executes n mod k times
► Second executes n / k times
► “Strip mining”
for (j = 0; j < n % k; j++) {
not unrolled (original)
}
for (j = 0; j < n/k; j++) {
unrolled
}
![Page 44: Ch 03. Instruction-Level Parallelism and Its Exploration (1)](https://reader031.vdocuments.us/reader031/viewer/2022020401/55cf8e17550346703b8e7703/html5/thumbnails/44.jpg)
Computer Architecture
Professor Yong Ho Song
Branch Prediction
Basic 2-bit predictor:
For each branch:
► Predict taken or not taken
► If the prediction is wrong two consecutive times, change prediction
![Page 45: Ch 03. Instruction-Level Parallelism and Its Exploration (1)](https://reader031.vdocuments.us/reader031/viewer/2022020401/55cf8e17550346703b8e7703/html5/thumbnails/45.jpg)
Computer Architecture
Professor Yong Ho Song
Misprediction Rate
![Page 46: Ch 03. Instruction-Level Parallelism and Its Exploration (1)](https://reader031.vdocuments.us/reader031/viewer/2022020401/55cf8e17550346703b8e7703/html5/thumbnails/46.jpg)
Computer Architecture
Professor Yong Ho Song
Branch Prediction
Correlating predictor:
Multiple 2-bit predictors for each branch
One for each possible combination of outcomes of preceding n
branches
(m, n) predictor
Uses the behavior of the last m branches to choose from 2m
branch predictors
Each of which is an n-bit predictor for a single branch
![Page 47: Ch 03. Instruction-Level Parallelism and Its Exploration (1)](https://reader031.vdocuments.us/reader031/viewer/2022020401/55cf8e17550346703b8e7703/html5/thumbnails/47.jpg)
Computer Architecture
Professor Yong Ho Song
Branch Prediction
Local predictor:
Multiple 2-bit predictors for each branch
One for each possible combination of outcomes for the last n
occurrences of this branch
Tournament predictor:
Combine correlating predictor with local predictor
Selector chooses one of the predictors
![Page 48: Ch 03. Instruction-Level Parallelism and Its Exploration (1)](https://reader031.vdocuments.us/reader031/viewer/2022020401/55cf8e17550346703b8e7703/html5/thumbnails/48.jpg)
Computer Architecture
Professor Yong Ho Song
Comparison of 2-bit Predictors
48
![Page 49: Ch 03. Instruction-Level Parallelism and Its Exploration (1)](https://reader031.vdocuments.us/reader031/viewer/2022020401/55cf8e17550346703b8e7703/html5/thumbnails/49.jpg)
Computer Architecture
Professor Yong Ho Song
Branch Prediction Performance
Branch predictor performance
![Page 50: Ch 03. Instruction-Level Parallelism and Its Exploration (1)](https://reader031.vdocuments.us/reader031/viewer/2022020401/55cf8e17550346703b8e7703/html5/thumbnails/50.jpg)
Computer Architecture
Professor Yong Ho Song
Dynamic Scheduling
Rearrange order of instructions to reduce stalls while
maintaining data flow
Advantages:
Compiler doesn’t need to have knowledge of microarchitecture
Handles cases where dependencies are unknown at compile time
Disadvantage:
Substantial increase in hardware complexity
Complicates exceptions
![Page 51: Ch 03. Instruction-Level Parallelism and Its Exploration (1)](https://reader031.vdocuments.us/reader031/viewer/2022020401/55cf8e17550346703b8e7703/html5/thumbnails/51.jpg)
Computer Architecture
Professor Yong Ho Song
Dynamic Scheduling
Dynamic scheduling implies:
Out-of-order execution
Out-of-order completion
Creates the possibility for WAR and WAW hazards
Tomasulo’s Approach
Tracks when operands are available
► Minimizes RAW hazards
Introduces register renaming in hardware
► Minimizes WAW and WAR hazards
![Page 52: Ch 03. Instruction-Level Parallelism and Its Exploration (1)](https://reader031.vdocuments.us/reader031/viewer/2022020401/55cf8e17550346703b8e7703/html5/thumbnails/52.jpg)
Computer Architecture
Professor Yong Ho Song
Register Renaming
Example:
DIV.D F0,F2,F4
ADD.D F6,F0,F8
S.D F6,0(R1)
SUB.D F8,F10,F14
MUL.D F6,F10,F8
+ name dependence with F6
antidependence
antidependence
![Page 53: Ch 03. Instruction-Level Parallelism and Its Exploration (1)](https://reader031.vdocuments.us/reader031/viewer/2022020401/55cf8e17550346703b8e7703/html5/thumbnails/53.jpg)
Computer Architecture
Professor Yong Ho Song
Register Renaming
Example:
DIV.D F0,F2,F4
ADD.D S,F0,F8
S.D S,0(R1)
SUB.D T,F10,F14
MUL.D F6,F10,T
Now only RAW hazards remain, which can be strictly
ordered
![Page 54: Ch 03. Instruction-Level Parallelism and Its Exploration (1)](https://reader031.vdocuments.us/reader031/viewer/2022020401/55cf8e17550346703b8e7703/html5/thumbnails/54.jpg)
Computer Architecture
Professor Yong Ho Song
Register Renaming
Register renaming is provided by reservation stations
(RS)
Contains:
► The instruction
► Buffered operand values (when available)
► Reservation station number of instruction providing the operand
values
RS fetches and buffers an operand as soon as it becomes
available (not necessarily involving register file)
![Page 55: Ch 03. Instruction-Level Parallelism and Its Exploration (1)](https://reader031.vdocuments.us/reader031/viewer/2022020401/55cf8e17550346703b8e7703/html5/thumbnails/55.jpg)
Computer Architecture
Professor Yong Ho Song
Register Renaming
Pending instructions designate the RS to which they will send
their output
► Result values broadcast on a result bus, called the common data bus
(CDB)
Only the last output updates the register file
As instructions are issued, the register specifiers are renamed
with the reservation station
May be more reservation stations than registers
![Page 56: Ch 03. Instruction-Level Parallelism and Its Exploration (1)](https://reader031.vdocuments.us/reader031/viewer/2022020401/55cf8e17550346703b8e7703/html5/thumbnails/56.jpg)
Computer Architecture
Professor Yong Ho Song
Tomasulo’s Algorithm
Load and store buffers
Contain data and addresses, act like reservation stations
Top-level design:
![Page 57: Ch 03. Instruction-Level Parallelism and Its Exploration (1)](https://reader031.vdocuments.us/reader031/viewer/2022020401/55cf8e17550346703b8e7703/html5/thumbnails/57.jpg)
Computer Architecture
Professor Yong Ho Song
Tomasulo’s Algorithm
Three Steps:
Issue
► Get next instruction from FIFO queue
► If available RS, issue the instruction to the RS with operand values if available
► If operand values not available, stall the instruction
Execute
► When operand becomes available, store it in any reservation stations waiting for
it
► When all operands are ready, issue the instruction
► Loads and store maintained in program order through effective address
► No instruction allowed to initiate execution until all branches that proceed it in
program order have completed
Write result
► Write result on CDB into reservation stations and store buffers
(Stores must wait until address and value are received)
![Page 58: Ch 03. Instruction-Level Parallelism and Its Exploration (1)](https://reader031.vdocuments.us/reader031/viewer/2022020401/55cf8e17550346703b8e7703/html5/thumbnails/58.jpg)
Computer Architecture
Professor Yong Ho Song
Example
![Page 59: Ch 03. Instruction-Level Parallelism and Its Exploration (1)](https://reader031.vdocuments.us/reader031/viewer/2022020401/55cf8e17550346703b8e7703/html5/thumbnails/59.jpg)
Computer Architecture
Professor Yong Ho Song
![Page 60: Ch 03. Instruction-Level Parallelism and Its Exploration (1)](https://reader031.vdocuments.us/reader031/viewer/2022020401/55cf8e17550346703b8e7703/html5/thumbnails/60.jpg)
Computer Architecture
Professor Yong Ho Song
![Page 61: Ch 03. Instruction-Level Parallelism and Its Exploration (1)](https://reader031.vdocuments.us/reader031/viewer/2022020401/55cf8e17550346703b8e7703/html5/thumbnails/61.jpg)
Computer Architecture
Professor Yong Ho Song
![Page 62: Ch 03. Instruction-Level Parallelism and Its Exploration (1)](https://reader031.vdocuments.us/reader031/viewer/2022020401/55cf8e17550346703b8e7703/html5/thumbnails/62.jpg)
Computer Architecture
Professor Yong Ho Song
![Page 63: Ch 03. Instruction-Level Parallelism and Its Exploration (1)](https://reader031.vdocuments.us/reader031/viewer/2022020401/55cf8e17550346703b8e7703/html5/thumbnails/63.jpg)
Computer Architecture
Professor Yong Ho Song
Reorder Buffer
Reorder buffer – holds the result of instruction
between completion and commit
Four fields:
Instruction type: branch/store/register
Destination field: register number
Value field: output value
Ready field: completed execution?
Modify reservation stations:
Operand source is now reorder buffer instead of functional unit
![Page 64: Ch 03. Instruction-Level Parallelism and Its Exploration (1)](https://reader031.vdocuments.us/reader031/viewer/2022020401/55cf8e17550346703b8e7703/html5/thumbnails/64.jpg)
Computer Architecture
Professor Yong Ho Song
Reorder Buffer
64
![Page 65: Ch 03. Instruction-Level Parallelism and Its Exploration (1)](https://reader031.vdocuments.us/reader031/viewer/2022020401/55cf8e17550346703b8e7703/html5/thumbnails/65.jpg)
Computer Architecture
Professor Yong Ho Song
Hardware-Based Speculation
Execute instructions along predicted execution paths
but only commit the results if prediction was correct
Instruction commit: allowing an instruction to update
the register file when instruction is no longer
speculative
Need an additional piece of hardware to prevent any
irrevocable action until an instruction commits
I.e. updating state or taking an execution
![Page 66: Ch 03. Instruction-Level Parallelism and Its Exploration (1)](https://reader031.vdocuments.us/reader031/viewer/2022020401/55cf8e17550346703b8e7703/html5/thumbnails/66.jpg)
Computer Architecture
Professor Yong Ho Song
Reorder Buffer
Register values and memory values are not written
until an instruction commits
On misprediction:
Speculated entries in ROB are cleared
Exceptions:
Not recognized until it is ready to commit
![Page 67: Ch 03. Instruction-Level Parallelism and Its Exploration (1)](https://reader031.vdocuments.us/reader031/viewer/2022020401/55cf8e17550346703b8e7703/html5/thumbnails/67.jpg)
Computer Architecture
Professor Yong Ho Song
Multiple Issue and Static Scheduling
To achieve CPI < 1, need to complete multiple
instructions per clock
Solutions:
Statically scheduled superscalar processors
VLIW (very long instruction word) processors
dynamically scheduled superscalar processors
![Page 68: Ch 03. Instruction-Level Parallelism and Its Exploration (1)](https://reader031.vdocuments.us/reader031/viewer/2022020401/55cf8e17550346703b8e7703/html5/thumbnails/68.jpg)
Computer Architecture
Professor Yong Ho Song
Multiple Issue
![Page 69: Ch 03. Instruction-Level Parallelism and Its Exploration (1)](https://reader031.vdocuments.us/reader031/viewer/2022020401/55cf8e17550346703b8e7703/html5/thumbnails/69.jpg)
Computer Architecture
Professor Yong Ho Song
VLIW Processors
Package multiple operations into one instruction
Example VLIW processor:
One integer instruction (or branch)
Two independent floating-point operations
Two independent memory references
Must be enough parallelism in code to fill the available
slots
![Page 70: Ch 03. Instruction-Level Parallelism and Its Exploration (1)](https://reader031.vdocuments.us/reader031/viewer/2022020401/55cf8e17550346703b8e7703/html5/thumbnails/70.jpg)
Computer Architecture
Professor Yong Ho Song
VLIW Processors
Disadvantages:
Statically finding parallelism
Code size
No hazard detection hardware
Binary code compatibility
![Page 71: Ch 03. Instruction-Level Parallelism and Its Exploration (1)](https://reader031.vdocuments.us/reader031/viewer/2022020401/55cf8e17550346703b8e7703/html5/thumbnails/71.jpg)
Computer Architecture
Professor Yong Ho Song
Dynamic Scheduling, Multiple Issue, and Speculation
Modern microarchitectures:
Dynamic scheduling + multiple issue + speculation
Two approaches:
Assign reservation stations and update pipeline control table in
half clock cycles
► Only supports 2 instructions/clock
Design logic to handle any possible dependencies between the
instructions
Hybrid approaches
Issue logic can become bottleneck
![Page 72: Ch 03. Instruction-Level Parallelism and Its Exploration (1)](https://reader031.vdocuments.us/reader031/viewer/2022020401/55cf8e17550346703b8e7703/html5/thumbnails/72.jpg)
Computer Architecture
Professor Yong Ho Song
Overview of Design
![Page 73: Ch 03. Instruction-Level Parallelism and Its Exploration (1)](https://reader031.vdocuments.us/reader031/viewer/2022020401/55cf8e17550346703b8e7703/html5/thumbnails/73.jpg)
Computer Architecture
Professor Yong Ho Song
Multiple Issue
Limit the number of instructions of a given class that
can be issued in a “bundle”
I.e. on FP, one integer, one load, one store
Examine all the dependencies amoung the
instructions in the bundle
If dependencies exist in bundle, encode them in
reservation stations
Also need multiple completion/commit
![Page 74: Ch 03. Instruction-Level Parallelism and Its Exploration (1)](https://reader031.vdocuments.us/reader031/viewer/2022020401/55cf8e17550346703b8e7703/html5/thumbnails/74.jpg)
Computer Architecture
Professor Yong Ho Song
Example
Loop: LD R2,0(R1) ;R2=array element
DADDIU R2,R2,#1 ;increment R2
SD R2,0(R1) ;store result
DADDIU R1,R1,#8 ;increment pointer
BNE R2,R3,LOOP ;branch if not last element
![Page 75: Ch 03. Instruction-Level Parallelism and Its Exploration (1)](https://reader031.vdocuments.us/reader031/viewer/2022020401/55cf8e17550346703b8e7703/html5/thumbnails/75.jpg)
Computer Architecture
Professor Yong Ho Song
Example (No Speculation)
![Page 76: Ch 03. Instruction-Level Parallelism and Its Exploration (1)](https://reader031.vdocuments.us/reader031/viewer/2022020401/55cf8e17550346703b8e7703/html5/thumbnails/76.jpg)
Computer Architecture
Professor Yong Ho Song
Example
![Page 77: Ch 03. Instruction-Level Parallelism and Its Exploration (1)](https://reader031.vdocuments.us/reader031/viewer/2022020401/55cf8e17550346703b8e7703/html5/thumbnails/77.jpg)
Computer Architecture
Professor Yong Ho Song
Branch-Target Buffer
Need high instruction bandwidth!
Branch-Target buffers
► Next PC prediction buffer, indexed by current PC
![Page 78: Ch 03. Instruction-Level Parallelism and Its Exploration (1)](https://reader031.vdocuments.us/reader031/viewer/2022020401/55cf8e17550346703b8e7703/html5/thumbnails/78.jpg)
Computer Architecture
Professor Yong Ho Song
Branch Folding
Optimization:
Larger branch-target buffer
Add target instruction into buffer to deal with longer decoding
time required by larger buffer
“Branch folding”
![Page 79: Ch 03. Instruction-Level Parallelism and Its Exploration (1)](https://reader031.vdocuments.us/reader031/viewer/2022020401/55cf8e17550346703b8e7703/html5/thumbnails/79.jpg)
Computer Architecture
Professor Yong Ho Song
Return Address Predictor
Most unconditional branches come from function
returns
The same procedure can be called from multiple sites
Causes the buffer to potentially forget about the return address
from previous calls
Create return address buffer organized as a stack
![Page 80: Ch 03. Instruction-Level Parallelism and Its Exploration (1)](https://reader031.vdocuments.us/reader031/viewer/2022020401/55cf8e17550346703b8e7703/html5/thumbnails/80.jpg)
Computer Architecture
Professor Yong Ho Song
Integrated Instruction Fetch Unit
Design monolithic unit that performs:
Branch prediction
Instruction prefetch
► Fetch ahead
Instruction memory access and buffering
► Deal with crossing cache lines
![Page 81: Ch 03. Instruction-Level Parallelism and Its Exploration (1)](https://reader031.vdocuments.us/reader031/viewer/2022020401/55cf8e17550346703b8e7703/html5/thumbnails/81.jpg)
Computer Architecture
Professor Yong Ho Song
Register Renaming
Register renaming vs. reorder buffers
Instead of virtual registers from reservation stations and reorder buffer,
create a single register pool
► Contains visible registers and virtual registers
Use hardware-based map to rename registers during issue
WAW and WAR hazards are avoided
Speculation recovery occurs by copying during commit
Still need a ROB-like queue to update table in order
Simplifies commit:
► Record that mapping between architectural register and physical register is no
longer speculative
► Free up physical register used to hold older value
► In other words: SWAP physical registers on commit
Physical register de-allocation is more difficult
![Page 82: Ch 03. Instruction-Level Parallelism and Its Exploration (1)](https://reader031.vdocuments.us/reader031/viewer/2022020401/55cf8e17550346703b8e7703/html5/thumbnails/82.jpg)
Computer Architecture
Professor Yong Ho Song
Integrated Issue and Renaming
Combining instruction issue with register renaming:
Issue logic pre-reserves enough physical registers for the bundle
(fixed number?)
Issue logic finds dependencies within bundle, maps registers as
necessary
Issue logic finds dependencies between current bundle and
already in-flight bundles, maps registers as necessary
![Page 83: Ch 03. Instruction-Level Parallelism and Its Exploration (1)](https://reader031.vdocuments.us/reader031/viewer/2022020401/55cf8e17550346703b8e7703/html5/thumbnails/83.jpg)
Computer Architecture
Professor Yong Ho Song
How Much?
How much to speculate
Mis-speculation degrades performance and power relative to no
speculation
► May cause additional misses (cache, TLB)
Prevent speculative code from causing higher costing misses
(e.g. L2)
Speculating through multiple branches
Complicates speculation recovery
No processor can resolve multiple branches per cycle
![Page 84: Ch 03. Instruction-Level Parallelism and Its Exploration (1)](https://reader031.vdocuments.us/reader031/viewer/2022020401/55cf8e17550346703b8e7703/html5/thumbnails/84.jpg)
Computer Architecture
Professor Yong Ho Song
Energy Efficiency
Speculation and energy efficiency
Note: speculation is only energy efficient when it significantly
improves performance
Value prediction
Uses:
► Loads that load from a constant pool
► Instruction that produces a value from a small set of values
Not been incorporated into modern processors
Similar idea--address aliasing prediction--is used on some
processors
![Page 85: Ch 03. Instruction-Level Parallelism and Its Exploration (1)](https://reader031.vdocuments.us/reader031/viewer/2022020401/55cf8e17550346703b8e7703/html5/thumbnails/85.jpg)
Computer Architecture
Professor Yong Ho Song
End of Chapter
85