the life of an instruction in ev6 pipeline constantinos kourouyiannis

The life of an instruction in EV6 pipeline

Constantinos Kourouyiannis

Alpha 21264 (EV6) pipeline

Instruction Fetch 4 commands per cycle Techniques for maximum fetch efficiency

Large 64KB 2-way associative instruction cache Line and set prediction to indicate where to fetch the next block

from including which set should be used Low mispredict cost of line and set prediction (single-cycle bubble)

Branch Predictor Branch prediction scheme dynamically chooses between local and

global history

Register Renaming Assignment of a unique storage location with each

write reference to a register Elimination of WAR and WAW register dependencies,

but preservation of all RAW register dependencies necessary for correct computation

64 architectural registers + 41 integer + 41 floating point registers available for holding speculative results prior to instruction retirement in an 80 instruction in-flight window

Out of Order Issue Queues Separate integer and floating-point queues

Each cycle the queues select from pending instructions as they become data-ready, using register scoreboards based on the renamed register numbers

Scoreboards maintain the status of renamed registers by tracking the progress of single-cycle, multiple-cycle and variable-cycle instructions

When FU available, scoreboard unit notifies instructions in queue that require the register value

These instructions can issue when bypass result is available from FU or load.

Out of Order Issue Queues (cont.) 20-entry integer queue

can issue 4 instructions per cycle 15-entry floating-point queue

can issue 2 instructions per cycle Static assignment of instructions to 2 of 4 execution

pipes before entering the queue Issue queue has 2 arbiters that dynamically issue the

oldest 2 instructions each cycle within the upper and lower pipes respectively

Queues issue instructions speculatively Queue is collapsing (an entry becomes available) when the

instruction issues or is squashed due to mis-speculation

Execution Engine All execution units require access to the register file

14 ports needed to support 4 simultaneous instructions in addition to 2 load operations large size of register file

The 21264 splits the register file into 2 clusters that contain duplicates of the 80-entry register file.

2 pipes access a single register file to form a cluster, and 2 clusters are combined to support 4 way-integer instruction execution

Incremental cost: additional cycle of latency to broadcast results from each integer cluster to the other cluster small cost

Integer issue queue dynamically schedules instructions to minimize the 1 cycle cross-cluster communication cost

2 FP execution pipes access a single 72-entry register file

Execution Engine (cont.)

New functionality not present in prior Alpha microprocessors: Fully-pipelined integer multiply unit Integer population count and leading/trailing zero count unit Floating-point square root FU Instructions to move register values directly between FP

and integer registers

Memory System

Supports in-flight memory references and out-of-order operation

Receives up to 2 memory operations from the integer execution pipes every cycle

Data cache operates at twice the frequency of the processor cycle

3-cycle latency for integer loads and 4 cycles for FP loads

Store/Load Memory Ordering

Hazard detection logic to recover from mis-speculation that allows a load to incorrectly issue before an earlier store to same address After the first time of a load mis-speculation training of

the out-of-order execution core to avoid it on subsequent executions of the same load. This is done by setting a bit in a load wait table that is examined at first time. If the bit is set, the 21264 forces the issue point of the load to be delayed until all prior stores have issued.

Load Hit/ Miss Prediction To achieve the 3-cycle integer load hit latency, it is necessary

to speculatively issue consumers of integer load data before knowing if the load hit or missed in the on-chip data cache.

In case of a load miss mini-restart When consumers speculatively issue 3 cycles after a load that misses, 2

integer issue cycles are squashed and all instructions that issued during these 2 cycles are pulled back into the issue queue to be re-issued later

less costly method than a full pipeline restart, but still expensive for applications for many integer load misses

The 21264 predicts when loads will miss and does not speculatively issue the consumers of the load in that case. Effective load latency: 5 cycles for an integer load hit that is incorrectly predicted to miss

Load Hit SpeculationSymbol Meaning

Q Issue queue

R Register file read

E Execute

D DCache Access

B Data bus active

Cycle Number 1 2 3 4 5 6

Integer Load Q BDER

Instruction 1

Instruction 2

Q R

Q

Hit

Pipeline Timing for Integer Load

Load Hit Speculation (cont.) There are 2 cycles in which the issue queue may speculatively

issue instructions that use load data before Dcache hit information is known

Any instructions issued in these 2 cycles are kept in the issue queue until the load hit condition is known, even if they are not dependent on the load operation. If load hits instructions are removed from queue If load misses execution of these instructions is aborted and

instructions are allowed to request service again In the previous example, instructions 1 and 2 are issued within

the speculative window of the load instruction. If load hits, instructions will be removed from queue by the start of cycle 7 while if it misses, both instructions will be aborted from execution pipelines.

Load Hit Speculation (cont.)

If software misses are likely, the 21264 can still benefit from scheduling the instruction stream for Dcache miss latency.

Saturating 4-bit counter incremented by 1 when load hits and decremented by 2 when load misses. When the upper bit of the counter=0 integer load latency is increased to 5 cycles and speculative window is removed

Load Hit Speculation (cont.)Symbol Meaning

Q Issue queue

R Register file read

E Execute

D DCache Access

B Data bus active

Cycle Number 1 2 3 4 5 6

Integer Load Q BDER

Instruction 1

Instruction 2

Q R

Q

Hit

Pipeline Timing for Floating Point Load

Load Hit Speculation (cont.) Speculative window for FP loads= 1 cycle FQ-issued instructions within this window of an FP

load that has missed are aborted only if they depend on the load being successful.

In the example, only instruction 1 is issued in the speculative window. If this instruction is not a user of data returned by the load, it is removed from the queue at the normal time (cycle 7). But if it is dependent on the load instruction data and the load hits, then it is removed from the queue one cycle later. If the load misses, instruction 1 is aborted from execution pipelines and may request service again in cycle 7.

Conclusion

21264: fastest microprocessor available Combines high Alpha clock speeds with many advanced

micro-architectural techniques, i.e. out-of-order and speculative execution with many in-flight instructions

High-bandwidth memory system to quickly deliver data values to the execution core robust performance for many applications

the life of an instruction in ev6 pipeline constantinos kourouyiannis

Documents