the life of an instruction in ev6 pipeline constantinos kourouyiannis
DESCRIPTION
Instruction Fetch 4 commands per cycle Techniques for maximum fetch efficiency Large 64KB 2-way associative instruction cache Line and set prediction to indicate where to fetch the next block from including which set should be used Low mispredict cost of line and set prediction (single-cycle bubble) Branch Predictor Branch prediction scheme dynamically chooses between local and global historyTRANSCRIPT
![Page 1: The life of an instruction in EV6 pipeline Constantinos Kourouyiannis](https://reader036.vdocuments.us/reader036/viewer/2022082620/5a4d1afc7f8b9ab05998436b/html5/thumbnails/1.jpg)
The life of an instruction in EV6 pipeline
Constantinos Kourouyiannis
![Page 2: The life of an instruction in EV6 pipeline Constantinos Kourouyiannis](https://reader036.vdocuments.us/reader036/viewer/2022082620/5a4d1afc7f8b9ab05998436b/html5/thumbnails/2.jpg)
Alpha 21264 (EV6) pipeline
![Page 3: The life of an instruction in EV6 pipeline Constantinos Kourouyiannis](https://reader036.vdocuments.us/reader036/viewer/2022082620/5a4d1afc7f8b9ab05998436b/html5/thumbnails/3.jpg)
Instruction Fetch 4 commands per cycle Techniques for maximum fetch efficiency
Large 64KB 2-way associative instruction cache Line and set prediction to indicate where to fetch the next block
from including which set should be used Low mispredict cost of line and set prediction (single-cycle bubble)
Branch Predictor Branch prediction scheme dynamically chooses between local and
global history
![Page 4: The life of an instruction in EV6 pipeline Constantinos Kourouyiannis](https://reader036.vdocuments.us/reader036/viewer/2022082620/5a4d1afc7f8b9ab05998436b/html5/thumbnails/4.jpg)
Register Renaming Assignment of a unique storage location with each
write reference to a register Elimination of WAR and WAW register dependencies,
but preservation of all RAW register dependencies necessary for correct computation
64 architectural registers + 41 integer + 41 floating point registers available for holding speculative results prior to instruction retirement in an 80 instruction in-flight window
![Page 5: The life of an instruction in EV6 pipeline Constantinos Kourouyiannis](https://reader036.vdocuments.us/reader036/viewer/2022082620/5a4d1afc7f8b9ab05998436b/html5/thumbnails/5.jpg)
Out of Order Issue Queues Separate integer and floating-point queues
Each cycle the queues select from pending instructions as they become data-ready, using register scoreboards based on the renamed register numbers
Scoreboards maintain the status of renamed registers by tracking the progress of single-cycle, multiple-cycle and variable-cycle instructions
When FU available, scoreboard unit notifies instructions in queue that require the register value
These instructions can issue when bypass result is available from FU or load.
![Page 6: The life of an instruction in EV6 pipeline Constantinos Kourouyiannis](https://reader036.vdocuments.us/reader036/viewer/2022082620/5a4d1afc7f8b9ab05998436b/html5/thumbnails/6.jpg)
Out of Order Issue Queues (cont.) 20-entry integer queue
can issue 4 instructions per cycle 15-entry floating-point queue
can issue 2 instructions per cycle Static assignment of instructions to 2 of 4 execution
pipes before entering the queue Issue queue has 2 arbiters that dynamically issue the
oldest 2 instructions each cycle within the upper and lower pipes respectively
Queues issue instructions speculatively Queue is collapsing (an entry becomes available) when the
instruction issues or is squashed due to mis-speculation
![Page 7: The life of an instruction in EV6 pipeline Constantinos Kourouyiannis](https://reader036.vdocuments.us/reader036/viewer/2022082620/5a4d1afc7f8b9ab05998436b/html5/thumbnails/7.jpg)
Execution Engine All execution units require access to the register file
14 ports needed to support 4 simultaneous instructions in addition to 2 load operations large size of register file
The 21264 splits the register file into 2 clusters that contain duplicates of the 80-entry register file.
2 pipes access a single register file to form a cluster, and 2 clusters are combined to support 4 way-integer instruction execution
Incremental cost: additional cycle of latency to broadcast results from each integer cluster to the other cluster small cost
Integer issue queue dynamically schedules instructions to minimize the 1 cycle cross-cluster communication cost
2 FP execution pipes access a single 72-entry register file
![Page 8: The life of an instruction in EV6 pipeline Constantinos Kourouyiannis](https://reader036.vdocuments.us/reader036/viewer/2022082620/5a4d1afc7f8b9ab05998436b/html5/thumbnails/8.jpg)
Execution Engine (cont.)
New functionality not present in prior Alpha microprocessors: Fully-pipelined integer multiply unit Integer population count and leading/trailing zero count unit Floating-point square root FU Instructions to move register values directly between FP
and integer registers
![Page 9: The life of an instruction in EV6 pipeline Constantinos Kourouyiannis](https://reader036.vdocuments.us/reader036/viewer/2022082620/5a4d1afc7f8b9ab05998436b/html5/thumbnails/9.jpg)
Memory System
Supports in-flight memory references and out-of-order operation
Receives up to 2 memory operations from the integer execution pipes every cycle
Data cache operates at twice the frequency of the processor cycle
3-cycle latency for integer loads and 4 cycles for FP loads
![Page 10: The life of an instruction in EV6 pipeline Constantinos Kourouyiannis](https://reader036.vdocuments.us/reader036/viewer/2022082620/5a4d1afc7f8b9ab05998436b/html5/thumbnails/10.jpg)
Store/Load Memory Ordering
Hazard detection logic to recover from mis-speculation that allows a load to incorrectly issue before an earlier store to same address After the first time of a load mis-speculation training of
the out-of-order execution core to avoid it on subsequent executions of the same load. This is done by setting a bit in a load wait table that is examined at first time. If the bit is set, the 21264 forces the issue point of the load to be delayed until all prior stores have issued.
![Page 11: The life of an instruction in EV6 pipeline Constantinos Kourouyiannis](https://reader036.vdocuments.us/reader036/viewer/2022082620/5a4d1afc7f8b9ab05998436b/html5/thumbnails/11.jpg)
Load Hit/ Miss Prediction To achieve the 3-cycle integer load hit latency, it is necessary
to speculatively issue consumers of integer load data before knowing if the load hit or missed in the on-chip data cache.
In case of a load miss mini-restart When consumers speculatively issue 3 cycles after a load that misses, 2
integer issue cycles are squashed and all instructions that issued during these 2 cycles are pulled back into the issue queue to be re-issued later
less costly method than a full pipeline restart, but still expensive for applications for many integer load misses
The 21264 predicts when loads will miss and does not speculatively issue the consumers of the load in that case. Effective load latency: 5 cycles for an integer load hit that is incorrectly predicted to miss
![Page 12: The life of an instruction in EV6 pipeline Constantinos Kourouyiannis](https://reader036.vdocuments.us/reader036/viewer/2022082620/5a4d1afc7f8b9ab05998436b/html5/thumbnails/12.jpg)
Load Hit SpeculationSymbol Meaning
Q Issue queue
R Register file read
E Execute
D DCache Access
B Data bus active
Cycle Number 1 2 3 4 5 6
Integer Load Q BDER
Instruction 1
Instruction 2
Q R
Q
Hit
Pipeline Timing for Integer Load
![Page 13: The life of an instruction in EV6 pipeline Constantinos Kourouyiannis](https://reader036.vdocuments.us/reader036/viewer/2022082620/5a4d1afc7f8b9ab05998436b/html5/thumbnails/13.jpg)
Load Hit Speculation (cont.) There are 2 cycles in which the issue queue may speculatively
issue instructions that use load data before Dcache hit information is known
Any instructions issued in these 2 cycles are kept in the issue queue until the load hit condition is known, even if they are not dependent on the load operation. If load hits instructions are removed from queue If load misses execution of these instructions is aborted and
instructions are allowed to request service again In the previous example, instructions 1 and 2 are issued within
the speculative window of the load instruction. If load hits, instructions will be removed from queue by the start of cycle 7 while if it misses, both instructions will be aborted from execution pipelines.
![Page 14: The life of an instruction in EV6 pipeline Constantinos Kourouyiannis](https://reader036.vdocuments.us/reader036/viewer/2022082620/5a4d1afc7f8b9ab05998436b/html5/thumbnails/14.jpg)
Load Hit Speculation (cont.)
If software misses are likely, the 21264 can still benefit from scheduling the instruction stream for Dcache miss latency.
Saturating 4-bit counter incremented by 1 when load hits and decremented by 2 when load misses. When the upper bit of the counter=0 integer load latency is increased to 5 cycles and speculative window is removed
![Page 15: The life of an instruction in EV6 pipeline Constantinos Kourouyiannis](https://reader036.vdocuments.us/reader036/viewer/2022082620/5a4d1afc7f8b9ab05998436b/html5/thumbnails/15.jpg)
Load Hit Speculation (cont.)Symbol Meaning
Q Issue queue
R Register file read
E Execute
D DCache Access
B Data bus active
Cycle Number 1 2 3 4 5 6
Integer Load Q BDER
Instruction 1
Instruction 2
Q R
Q
Hit
Pipeline Timing for Floating Point Load
![Page 16: The life of an instruction in EV6 pipeline Constantinos Kourouyiannis](https://reader036.vdocuments.us/reader036/viewer/2022082620/5a4d1afc7f8b9ab05998436b/html5/thumbnails/16.jpg)
Load Hit Speculation (cont.) Speculative window for FP loads= 1 cycle FQ-issued instructions within this window of an FP
load that has missed are aborted only if they depend on the load being successful.
In the example, only instruction 1 is issued in the speculative window. If this instruction is not a user of data returned by the load, it is removed from the queue at the normal time (cycle 7). But if it is dependent on the load instruction data and the load hits, then it is removed from the queue one cycle later. If the load misses, instruction 1 is aborted from execution pipelines and may request service again in cycle 7.
![Page 17: The life of an instruction in EV6 pipeline Constantinos Kourouyiannis](https://reader036.vdocuments.us/reader036/viewer/2022082620/5a4d1afc7f8b9ab05998436b/html5/thumbnails/17.jpg)
Conclusion
21264: fastest microprocessor available Combines high Alpha clock speeds with many advanced
micro-architectural techniques, i.e. out-of-order and speculative execution with many in-flight instructions
High-bandwidth memory system to quickly deliver data values to the execution core robust performance for many applications