out-of-order execution -p6 lihu rappoport, 12/2004 1 mamas – computer architecture the p6...
TRANSCRIPT
out-of-order execution -P6Lihu Rappoport, 12/2004 1
MAMAS – Computer Architecture
The P6 Micro-Architecture An Example of an Out-Of-Order Micro-processor
Dr. Lihu Rappoport
out-of-order execution -P6Lihu Rappoport, 12/2004 2
The P6 family Pentium® Pro (1995)
– 150~200 Hz, 512K L2
Pentium® II (1997)– 233~450 MHz, 512K L2
– MMXTM Technology
Pentium® III (1999)– Up to 1.4GHz (@ 0.13μ)
– 512K L2 (on die, full speed)
– MMXTM + SSE
Pentium® III XeonTM
– Up to 900MHz (@ 0.18μ)
– Up to 2MB L2 cache
Celeron® (1998)– Up to 1.4GHz (@ 0.13μ)
– 256K L2 on die
– Moved to a P4 based core (≥1.7GHz)
out-of-order execution -P6Lihu Rappoport, 12/2004 3
P6 Features
Dynamic Execution - combination of :– Out Of Order execution: Data flow analysis
– Register renaming
– Speculative Execution
– Multiple Branch prediction
Super pipeline: 12 pipe stages
out-of-order execution -P6Lihu Rappoport, 12/2004 4
In-Order Front End– BIU: Bus Interface Unit
– IFU: Instruction Fetch Unit (includes IC)
– BTB: Branch Target Buffer
– ID: Instruction Decoder
– MIS: Micro-Instruction Sequencer
– RAT: Register Alias Table
Out-of-order Core– ROB: Reorder Buffer
– RRF: Real Register File
– RS: Reservation Stations
– IEU: Integer Execution Unit
– FEU: Floating-point Execution Unit
– AGU: Address Generation Unit
– MIU: Memory Interface Unit
– DCU: Data Cache Unit
– MOB: Memory Order Buffer
– L2: Level 2 cache
In-Order Retire
P6 Arch
MIS
AGU
MOB
External Bus
IEU
MIU
FEU
BTB
BIU
IFU
ID
RAT
RS
L2
DCU
ROB
out-of-order execution -P6Lihu Rappoport, 12/2004 5
O1 O3
R1 R2
Ex
I1 I2 I3 I4 I5 I6 I7 I8
NextIP
RegRen
RSWr
Icache Decode
RS disp
Retirement
In-Order Front End
Out-of-order Core
In-order Retirement
1: Next IP2: ICache lookup3: ILD (instruction length decode)4: rotate5: ID16: ID27: RAT- rename sources, ALLOC-assign destinations8: ROB-read sources RS-schedule data-ready uops for dispatch9: RS-dispatch uops10:EX11:Retirement
P6 Pipeline
out-of-order execution -P6Lihu Rappoport, 12/2004 6
MIS
AGU
MOB
External Bus
IEU
MIU
FEU
ROB
BTB
BIU
IFU
ID
RAT
RS
L2
DCU
In-Order Front End
BTB: predicts the address of the next instruction to be fetched
IFU: fetches 16 bytes per cycle from the instruction cache – L2, or memory in case of IC miss
ID: Decodes instructions into uops – up to 6 uops/cycle
MIS: Produces uops for complex instructions– Instruction which decode into >4
uops
RAT: Register Alias Table
out-of-order execution -P6Lihu Rappoport, 12/2004 7
Branch Prediction Implementation
– Use local history to predict direction
– Need to predict multiple branches Need to predict branches before previous branches are resolved Branch history updated first based on prediction, later based on
actual execution (speculative history)
– Target address taken from BTB
Prediction rate: ~92%– ~60 instructions between mispredictions
– High prediction rate is very crucial for long pipelines
– Especially important for OOOE, speculative execution: On misprediction all instructions following the branch in the instruction
window are flushed Effective size of the window is determined by prediction accuracy
RSB used for Call/Return pairs
out-of-order execution -P6Lihu Rappoport, 12/2004 8
Branch Prediction - Clustering
A cluster is a 16 byte aligned memory block = fetch line The predictor needs to provide a prediction for the entire fetch line:
– Predict the first taken branch in the line, following the fetch IP
Implemented by – Splitting IP into offset within line, set, and tag
– If the tag of more than one way matches the fetch IP The offsets of the matching ways are ordered Ways with offset smaller than the fetch IP offset are discarded The first branch that is predicted taken is chosen as the predicted branch
Jump into the fetch line
Jump out of the line
jmp jmpjmpjmp
Predictednot taken
Predictedtaken
out-of-order execution -P6Lihu Rappoport, 12/2004 9
The P6 BTB 2-level, local histories, per-set counters 4-way set associative: 512 entries in 128 sets
IP Tag Hist
1001
Pred=msb of counter
9
0
15
Way 0
Target
Way
2W
ay 3
9 432
counters
128sets
PTV
21 1 32
LRR
2
Per-Set
Branch Type00- cond01- ret10- call11- uncond
ReturnStackBuffer
Way
1
Prediction bit
4
ofst
Up to 4 branches can have a tag match
out-of-order execution -P6Lihu Rappoport, 12/2004 10
MIS
AGUAGU
MOBMOB
External External BusBus
IEUIEU
MIUMIU
FEUFEU
ROBROB
BTB
BIUBIU
IFU
ID
RAT
RRSS
L2L2
DCUDCU
Alignment
D0
1 uop
Determine where each IA instruction starts
In-Order Front End: Decoder
Instruction Length Decode
16 Instruction bytes from IFU
D1 D2
IDQ
Direct the bytes of eachinst. to the decoder
Convert inst. Into uops• If inst aligned with dec1/2
decodes into >1 uops, defer it to next cycle 1 uop4 uops
Buffers up to 6 uops:• Smooth decoder’s
variable throughput
out-of-order execution -P6Lihu Rappoport, 12/2004 11
Micro Operations (Uops) Each “CISC” inst is broken into one or more “RISC” uops
– Simplicity: Each uop is (relatively) simple Canonical representation of src/dest (2 src, 1 dest)
– Increased ILP e.g., pop eax becomes esp1<-esp0+4, eax1<-[esp0]
Simple instructions translate to a few uops– Typical uop count (it is not necessarily cycle count!)
Reg-Reg ALU/Mov inst: 1 uopMem-Reg Mov (load) 1 uopMem-Reg ALU (load + op) 2 uopsReg-Mem Mov (store) 2 uops (st addr, st data)Reg-Mem ALU (ld + op + st) 4 uops
Complicated instructions need ucode
out-of-order execution -P6Lihu Rappoport, 12/2004 12
Out-of-order Core
MIS
AGU
MOB
External Bus
IEU
MIU
FEU
BTB
BIU
IFU
ID
RAT
RS
L2
DCU
ROB
ROB: Mechanism for renaming and retirement
– 40 entries that hold instructions in-order.
RS: pool of all “not yet executed” instructions (up to 20)
Execution Units– IEU: Integer Execution Unit
– FEU: Floating-point Execution Unit
Memory related units– AGU: Address Generation Unit
MIU: Memory Interface Unit
– DCU: Data Cache Unit
– MOB: Orders Memory operations
– L2: Level 2 cache
out-of-order execution -P6Lihu Rappoport, 12/2004 13
MISMIS
AGUAGU
MOBMOB
External External BusBus
IEUIEU
MIUMIU
FEUFEU
ROBROB
BTBBTB
BIUBIU
IFUIFU
IIDD
RATRAT
RRSS
L2L2
DCUDCU
SHFFMU
FDIVIDI
VFAUIEU
JEUIEU
AGU
AGU
Port 0
Port 1
Port 2
Port 3,4
Load Address
Store Address
Out-of-order Core: Execution Units
out-of-order execution -P6Lihu Rappoport, 12/2004 14
Out Of Order Execution
Reservation station– 20 entries, schedule, dispatch, bypass
Execution ports– 5 ports, each one serving one or more Execution units
port 0: Integer, FP, SIMD, Shift, DIV, Pfmul, PFlogic port 1: Integer, Jmp, SIMD, PFadd, PFlogic, Shuffle port 2: Load address, Load data port 3: Store address port 4: Store data
– All the units on a port share the same WB bus
Reorder buffer– 40 entries, RRF, In order, 2 read ports
MOB
out-of-order execution -P6Lihu Rappoport, 12/2004 15
Alloc & Rat Perform register allocation and renaming for ≤3 uops/cyc The Register Alias Table (RAT)
– Maps the architectural registers into the physical registers For each arch reg, holds number of latest phy reg that updates it When a new uop that writes to a arch reg R is allocated,
the RAT records the phy reg allocated to the uop as the latest reg that updates R
The Allocator (Alloc) – Assigns each uop an entry number in the ROB / RS / MOB
– For each one of the sources (architectural registers) of the uop Lookup the RAT to find out the latest phy reg updating it Write it up in the RS entry
– Allocate Load & Store buffers in the MOB
EAX 0 RRF
EBX 19 ROB
ECX 23 ROB
out-of-order execution -P6Lihu Rappoport, 12/2004 16
Re-order Buffer (ROB) Hold 40 uops which are not yet committed
– At the same order as in the program
Provide a large physical register space for register renaming– One physical register per each ROB entry
physical register number = entry number Each uop has only one destination Buffer the execution results until retirement
– Valid data is set after uop executed and result written to physical reg
#entry Data
Valid
Physical Reg Data
Architectural dest. reg
0 V 12H EBX
1 V 33H ECX
39 I xxx XXX
out-of-order execution -P6Lihu Rappoport, 12/2004 17
RRF – Real Register File
Holds the Architectural Register File– Architectural Register are numbered: 0 – EAX, 1 – EBX, …
The value of an architectural register – is the value written to it by the last instruction committed
which writes to this register
RRF: #entry Arch Reg Data
0 (EAX) 9AH
1 (EBX) F34H
out-of-order execution -P6Lihu Rappoport, 12/2004 18
Uop flow through the ROB Uops are entered in order
– Registers renamed by the entry number
Once assigned: execution order unimportant After execution:
– entries marked “executed” and wait for retirement
– executed entry can be “retired” once all prior instruction have retired
– Commit architectural state only after speculation (branch, exception) has resolved
Retirement– Detect exceptions and mispredictions
Initiate repair to get machine back on right track
– Update “real registers” with value of renamed registers
– Update memory
– Leave the ROB
out-of-order execution -P6Lihu Rappoport, 12/2004 19
Reservation station (RS) Pool of all “not yet executed” uops (up to 20)
– Holds the uop attributes and the uop source data until it is dispatched
When a uop is allocated in RS, operand values are updated– If operand is from an architectural register, value is taken from the
RRF
– If operand is from a phy reg, with data valid set, value taken from ROB
– If operand is from a phy reg, with data valid not set, wait for value
The RS maintains operands status “ready/not-ready”– Each cycle, executed uops make more operands “ready”
The RS arbitrate the WB busses between the units The RS monitors the WB bus to capture data needed by awaiting uops Data can be bypassed directly from WB bus to execution unit
– Uops whose all operands are ready can be dispatched for execution Dispatcher chooses which of the ready uops to execute next Dispatches chosen uops to functional units
out-of-order execution -P6Lihu Rappoport, 12/2004 20
Register Renaming exampleIDQ
Add EAX, EBX, EAX
ALLOC
EAX 0 RRF
EBX 19 ROB
ECX 23 ROB
Add EAX, ROB19, RRF00
RAT
EAX 37 ROB
EBX 19 ROB
ECX 23 ROB
Add ROB37, ROB19, RRF0
ROB
19 V 12H EBX
23 V 33H ECX
37 I xxx XXX
19 V 12H EBX
23 V 33H ECX
37 I xxx EAX
src1 src2 Pdst
add 97H 12H 37
RRF: 0 EAX 97H
RS
out-of-order execution -P6Lihu Rappoport, 12/2004 21
In-Order Retire
MIS
AGU
MOB
External Bus
IEU
MIU
FEU
BTB
BIU
IFU
ID
RAT
RS
L2
DCU
ROB
ROB:– Retires up to 3 uops per clock
– Copies the values to the RRF
– Retirement is done In-order
– Performs exception checking
out-of-order execution -P6Lihu Rappoport, 12/2004 22
In-order Retirement
The process of committing the results to the architectural state of the processor
Retires up to 3 uops per clock Copies the values to the RRF Retirement is done In Order Performs exception checking An instruction is retired after the following checks
– Instruction has executed
– All previous instructions have retired
– Instruction isn’t mis-predicted
– no exceptions
out-of-order execution -P6Lihu Rappoport, 12/2004 23
Flow of Uops
DECODE:
– Decoders translate instructions into uops (1 to 6 uops per cycle) ISSUE:
– ALLOC unit allocates one entry per uop in the RS and in the ROB (for up to 3 uops per cycle) If source data is available from the ROB (either from the RRF of from
the Result Buffer (RB) it is written in the RS entry Otherwise, it is marked invalid in the RS (and should be captured
from the WB bus)
READY/SCHEDULE:
– Check for data-ready uops if desired functional unit available
– Upto 5 resource-ready uops selected, and dispatched per clock DISPATCH:
– Ship scheduled uops to appropriate functional unit (RS)
out-of-order execution -P6Lihu Rappoport, 12/2004 24
Flow of Uops (cont) WRITEBACK:
– Capture results returned by the functional units in a result buffer (ROB)– Snoop result writeback ports for results that are sources to uops in RS– Update data-ready status of these uops (RS)
RETIREMENT: – 3 consecutive entries read out of the RB
these entries are candidates for retirement
– Algorithm to determine fitness for retirement: candidate is retired if its ready bit is set it will not cause an exception all preceding candidates are eligible for retirement
– Commit results from result buffer to architecturally visible state in original “Issue” order
– clear machine and restart execution if “badness” occurs (ROB)
out-of-order execution -P6Lihu Rappoport, 12/2004 25
Large ROB and RS are Important
Large RS– Increases the window in which looking for impendent instructions
Exposes more parallelism potential
Large ROB– The ROB is a superset of the RS ROB size ≥ RS size
– Allows for of covering long latency operations (cache miss, divide)
Example– Assume there is a Load that misses the L1 cache
Data takes ~10 cycles to return ~30 new instructions get into the pipeline
– Instructions following the Load cannot commit Pile up in the ROB
– Instructions independent of the load are executed, and leave the RS As long as the ROB is not full, we can keep executing instructions
– A 40 entry ROB can cover for an L1 cache miss Cannot cover for an L2 cache miss, which is hundreds of cycles
out-of-order execution -P6Lihu Rappoport, 12/2004 26
P6 Caches
Blocking Caches– A cache miss prevents from other cache requests (which could possibly
be hits) to be served
– Makes OOO execution much less beneficial (cannot hide misses)
Both L1 and L2 cache in the P6 are non-blocking– Initiate the actions necessary to return data to cache miss while they
respond to subsequent cached data requests
– Support up to 4 outstanding misses Misses translate into outstanding requests on the P6 bus The bus can support up to 8 outstanding requests
The cache “squashes” (suspends) subsequent requests for the same missed cache line– Squashed requests are not counted in the number of outstanding requests
– Once the engine has executed beyond the 4 outstanding requests, subsequent load requests are placed in the (12 deep) load buffer
out-of-order execution -P6Lihu Rappoport, 12/2004 27
Memory Operations
Two types of memory access: loads and stores Loads
– Loads need to specify memory address to be accessed width of data being retrieved the destination register
– Loads are encoded into a single uop
Stores– Stores need to provide
a memory address, a data width, and the data to be written
– Stores require two uops: One to generate the address One to generate the data
– These uops are scheduled independently to maximize concurrency must re-combine in the store buffer for the store to complete
out-of-order execution -P6Lihu Rappoport, 12/2004 28
The Memory Execution Unit
As a black box, the MEU is just another execution unit– Receives memory reads and writes from the RS
– Returns data back to the RS and to the ROB
Unlike many execution units, the MEU has side effects – May cause a bus request to external memory
The RS dispatches memory uops – When all the data used for address calculation is ready
– Both to the MOB and to the Address Generation Unit (AGU) are free
– The AGU computes the linear address:
Segment-Base + Base-Address + (Scale*Index) + Displacement Sends linear address to MOB, where it is stored with the uop in
the appropriate Load Buffer or Store Buffer entry
out-of-order execution -P6Lihu Rappoport, 12/2004 29
The Memory Execution Unit (cont.)
The RS operates based on dataflow dependencies– But cannot detect memory dependencies, even as simple as the
following:
i1: movl -4(%ebp), %ebx # MEM[ebp-4] ← ebx
i2: movl %eax, -4(%ebp) # eax ← MEM[ebp-4]
– The RS may dispatch these operations in any order
– If the MEU blindly executed the above operations, then eax may get the wrong value
– It is the job of the MEU to detect violations of memory ordering, and prevent stale data from returning to the core
Could require memory operations dispatch non-speculatively– Would be slow and severely penalize all other operations in the
machine waiting for memory read return data
MEU allows speculative execution as much as possible
out-of-order execution -P6Lihu Rappoport, 12/2004 30
OOO Execution of Memory Operations
Stores are not executed OOO– Stores are never performed speculatively
there is no transparent way to undo them
– Stores are also never re-ordered among themselves The Store Buffer dispatches a store only when
the store has both its address and its data, and there are no older stores awaiting dispatch
Resolving memory dependencies: memory disambiguation– Some memory dependencies can be resolved statically
store r1,a load r2,b
– Problem: some cannot store r1,[r3]; load r2,b
can advance load before store
load must wait till r3 is known
out-of-order execution -P6Lihu Rappoport, 12/2004 31
Performance Impact of Forcing In-order Memory Operations
x86 has small register set uses memory often Studied the importance of memory access reordering. Basic conclusions:
– Preventing Stores from passing Stores/Loads 3%~5% performance loss P6 chooses not allow Stores to pass Stores/Loads
– Preventing Loads from passing Loads/Stores Big performance loss Need to allow loads to pass stores, and loads to pass loads
out-of-order execution -P6Lihu Rappoport, 12/2004 32
Memory Order Buffer (MOB)
Every memory uop is allocated an entry in order– De-allocated when retires
Address & data (for stores), are updated when known Load is checked against all previous stores
– Waits if store to same address exist, but data not ready If store data exists, just use it
– Waits till all previous store addresses are resolved In case of no address collision – go to memory
out-of-order execution -P6Lihu Rappoport, 12/2004 33
Load Execution Load is dispatched into memory pipe after dispatch from the RS If it misses the DCU, it is dispatched by the BUS unit After the data returns from the L2 or the bus
– the load signals as complete (Load WB valid)
– eligible to retire
L2
Bus
DCU
MOBROB
RS
Mem
Data ReqData Return
Data Req
Bus Req
Mem Load and FB Request
Load WB Valid
Port 2 RS dispatch
Data Return
out-of-order execution -P6Lihu Rappoport, 12/2004 34
Pentium® III: Senior Load Implementation
If a load misses the cache its retirement is delayed – Instructions following the load are executed, but cannot retire until the load
retires Even if the instructions are independent of the load data
– These instructions accumulate inside the machine
– Eventually the ROB is saturated, and the front-end is stalled
Pentium® III removed this bottleneck for Prefetch instructions– Instructions following a Prefetch are allowed to retire before the Prefetch
returns data
Prefetch signal readiness for retirement much earlier than Load– Completion signaled almost immediately after allocation into the MOB
(Pref WB Valid)
– Completion not delayed until the data is actually fetched from the memory subsystem
out-of-order execution -P6Lihu Rappoport, 12/2004 35
Senior load implementation
L2
Bus
DCU
MOBROB
RS
Mem
Data ReqData Return
Data Req
Bus Req
Mem Load and FB Request
Pref WB Valid
Port 2 RS dispatch
Data Return
out-of-order execution -P6Lihu Rappoport, 12/2004 36
Jump Misprediction – Flush at Retire
Flushing the pipe when the mis-predicted jump retires– retirement is in-order all the instructions preceding the jump have already retired all the instructions remaining in the pipe follow the jump flush all the instructions in the pipe
Disadvantage– Mis-prediction is known after the jump was executed, but
we continue fetching instructions from the wrong path until the branch retires
out-of-order execution -P6Lihu Rappoport, 12/2004 37
Jump Misprediction – Flush at Execute
When the JEU detects jump misprediction it
– Flush the in-order front-end
– Start fetching and decoding from the “correct” path The “correct” path still be wrong
A preceding uop that hasn’t executed may cause an exception A preceding jump executed OOO can also mispredict
– The “correct” instruction stream is stalled at the RAT The RAT was wrongly updated also by wrong path instruction
When the mispredicted branch retires
– Resets all state in the Out-of-Order Engine (RAT, RS, RB, MOB, etc.) Only instruction following the jump are left – they must all be flushed Reset the RAT to point only to architectural registers
– Un-stalls the in-order machine
– RS gets uops from RAT and starts scheduling and dispatching them
out-of-order execution -P6Lihu Rappoport, 12/2004 38
Interrupt Handling Complications for pipelines:
– Interrupts occur in the middle of an instruction
– Must restart interrupting and subsequent instructions
Precise interrupts preserve the model that instructions execute in program order– Identify the instruction that caused the interrupt
– Instructions before the faulting instruction finish
– Disable writes for faulting and subsequent instructions
– Force trap instruction into pipeline
– Trap routine
– Save the state of the executing program
– Correct the cause of the interrupt
– Restore program state
– Restart faulting and subsequent instructions
out-of-order execution -P6Lihu Rappoport, 12/2004 39
P6 Tutorials
Pentium® Pro Processor Microarchitecture Overviewhttp://www.intel.com/software/products/college/ia32/ppardown.htm
Optimizing for Intel® Pentium® II and Pentium Pro Processorshttp://www.intel.com/software/products/college/ia32/ppropdown.htm
out-of-order execution -P6Lihu Rappoport, 12/2004 40
Backup
out-of-order execution -P6Lihu Rappoport, 12/2004 41
O1 O3
R1 R2
Ex
I1 I2 I3 I4 I5 I6 I7 I8
NextIP
RegRen
RSWr
Icache Decode
RS disp
Retirement
P6 Queuing
RS scheduling
Retirement scheduling
Decoder queue (IDQ)•Stores uops until they can enter the RAT/Allocator
•Smoothes the variable number of uops output by the decoder per cycle
out-of-order execution -P6Lihu Rappoport, 12/2004 42
Register Renaming Helps with Small Number of Architectural Registers
x86 has a small number of general purpose registers– False dependencies can be caused by the need to reuse registers
for unconnected reasons, i.e.
MOV EAX, 17
ADD Mem, EAX
MOV EAX, 3
ADD EAX, EBX– Solved by register renaming
out-of-order execution -P6Lihu Rappoport, 12/2004 43
Streaming Stores
Applications can stream data from memory and use them just once– Using regular cache models will result in eviction of useful data from the cache– Many applications have a large working set (e.g., video and 3D), which doesn’t fit into the cache– In such a situation, it is better to bypasses the cache– Hence it should write directly to the memory: this is what streaming stores are for
Pentium III improved the write bandwidth by 20%– Pentium III now can saturate a 100 MHz bus with 800 MBs of writes– Done by removing a dead cycle between back to back write combining writes
Before the Pentium III, the architecture was mainly oriented to scalar applications – Fill buffers provide high instantaneous throughput caused by bursts of misses in scalar apps– Comparatively small average bandwidth requirements (~100 MB per second)
SSE applications requires high average bandwidth– Now the fill buffers need to sustain high average throughput
Fill buffers WC modified to allow all four buffers to be utilized for WC in parallel– Pentium II allows just one
The following WC eviction conditions policies added in Pentium III:– A buffer is evicted when all bytes are written (all dirty) to the fill buffer. – Previously the buffer eviction policy was “resource Demand” driven
i.e. a buffer gets evicted when DCU requests the allocation of new buffer– When all fill buffers are busy a DCU fill buffer allocation request, such as regular loads, stores,
or prefetches requiring a fill buffer can evict a WC buffer even if it is not full yet