ece 4100/610 0 guest lecture: p6 & netburst microa rchitecture
DESCRIPTION
ECE 4100/610 0 Guest Lecture: P6 & NetBurst Microa rchitecture. Prof. Hsien-Hsin Sean Lee School of ECE Georgia Institute of Technology February 11, 2003. Why study P6 from last millennium?. A paradigm shift from Pentium A RISC core disguised as a CISC - PowerPoint PPT PresentationTRANSCRIPT
1
ECE4100/6100H-H. S. Lee
ECEECE4100/6104100/6100 0 Guest Guest Lecture:Lecture:
P6P6 & NetBurst & NetBurst
MicroaMicroarchitecturerchitecture
Prof. Hsien-Hsin Sean LeeProf. Hsien-Hsin Sean Lee
School of ECESchool of ECE
Georgia Institute of Georgia Institute of TechnologyTechnology
February February 11, 200311, 2003
2
ECE4100/6100H-H. S. LeeWhy study P6 from last
millennium? A paradigm shift from Pentium A RISC core disguised as a CISCHuge market success:
Microarchitecture And stock price
Architected by former VLIW and RISC folks Multiflow (pioneer in VLIW architecture for super-
minicomputer) Intel i960 (Intel’s RISC for graphics and embedded
controller)Netburst (P4’s microarchitecture) is based on P6
3
ECE4100/6100H-H. S. Lee
P6 Basics One implementation of IA32 architecture Super-pipelined processor 3-way superscalar In-order front-end and back-end Dynamic execution engine (restricted dataflow) Speculative execution P6 microarchitecture family processors include
Pentium Pro Pentium II (PPro + MMX + 2x caches—16KB I/16KB D) Pentium III (P-II + SSE + enhanced MMX, e.g. PSAD) Celeron (without MP support) Later P-II/P-III/Celeron all have on-die L2 cache
4
ECE4100/6100H-H. S. Lee
x86 Platform Architecture
System System Memory Memory (DRAM)(DRAM)
MCHMCH
Front-Side Front-Side BusBus
PCI USB I/O
GraphicsGraphicsProcessor Processor
LocalFrameBuffer
AGP
(SRAM)(SRAM)L2 CacheL2 Cache
Back-SideBack-Side
BusBus
P6 CoreP6 Core
Host ProcessorHost Processor
L1L1CacheCache
(SRAM)(SRAM)
GPUGPU
ICHICH
chipsetchipset
On-die or on-package
5
ECE4100/6100H-H. S. Lee
Pentium III Die Map EBL/BBL – External/Backside Bus logic MOB - Memory Order Buffer Packed FPU - Floating Point Unit for SSE IEU - Integer Execution Unit FAU - Floating Point Arithmetic Unit MIU - Memory Interface Unit DCU - Data Cache Unit (L1) PMH - Page Miss Handler DTLB - Data TLB BAC - Branch Address Calculator RAT - Register Alias Table SIMD - Packed Floating Point unit RS - Reservation Station BTB - Branch Target Buffer TAP – Test Access Port IFU - Instruction Fetch Unit and L1 I-Cache ID - Instruction Decode ROB - Reorder Buffer MS - Micro-instruction Sequencer
6
ECE4100/6100H-H. S. LeeISA Enahncement (on top of
Pentium) CMOVcc / FCMOVcc r, r/m
Conditional moves (predicated move) instructions Based on conditional code (cc)
FCOMI/P : compare FP stack and set integer flags RDPMC/RDTSC instructions Uncacheable Speculative Write-Combining (USWC) —weakly
ordered memory type for graphics memory MMX in Pentium II
SIMD integer operations SSE in Pentium III
Prefetches (non-temporal ntanta + temporal t0t0, t1t1, t2t2), sfence SIMD single-precision FP operations
7
ECE4100/6100H-H. S. Lee
P6 Pipelining
1111 1212 1313 1414 1515 1616 1717
2020 2121 2222
Nex
t IP
Nex
t IP
I-Cac
heI-C
ache
ILD
ILD
Rot
ate
Rot
ate
Dec
1D
ec1
Dec
2D
ec2
Br D
ecB
r Dec
RS
Writ
eR
S W
rite
RA
TR
AT
IDQ
IDQ
In-order FEIn-order FE
3131 3232 3333
8181 8282
.... ....
8383
Exec
2Ex
ec2
Exec
nEx
ec n
Multi-cycleMulti-cycle inst inst pipelinepipeline
3131 3232 3333
8181 8282
4242 4343
8383
AG
UA
GU
DC
ache
1D
Cac
he1
DC
ache
2D
Cac
he2
Non-blocking Non-blocking memory pipelinememory pipeline
3131 3232 3333
8282 8383
RS
schd
RS
schd
RS
Dis
pR
S D
isp
Exec
/ W
BEx
ec /
WB
Single-cycleSingle-cycle inst inst pipelinepipeline
83: Data WB83: Data WB82: Int WB82: Int WB81: Mem/FP WB81: Mem/FP WB
FE in
-ord
er b
ound
ary
FE in
-ord
er b
ound
ary
Ret
irem
ent i
n-or
der b
ound
ary
Ret
irem
ent i
n-or
der b
ound
ary
9191 9292 9393
Ret
ptr
wr
Ret
ptr
wr
Ret
RO
B rd
Ret
RO
B rd
RR
F w
rR
RF
wr
…
…
…
… ……..
RS Scheduling RS Scheduling DelayDelay
ROB Scheduling ROB Scheduling DelayDelay
MOB Scheduling MOB Scheduling DelayDelay
IFU1IFU1 IFU2IFU2 IFU3IFU3 DEC1DEC1 DEC2DEC2 RATRAT ROBROB DISDIS EXEX RET1RET1 RET2RET2
3131 3232 3333
8181 8282
4242 4343
8383
AG
UA
GU
MO
BM
OB
blk
blk
MO
B w
rM
OB
wr
4040 4141 4242 4343
MO
B d
isp
MO
B d
isp
DC
ache
1D
Cac
he1
Dca
che2
Dca
che2
Mob
wak
eup
Mob
wak
eup
Blocking Blocking memory memory pipelinepipeline
8
ECE4100/6100H-H. S. Lee
Instruction Fetch UnitInstruction Fetch Unit
P6 Microarchitecture
BTB/BACBTB/BAC
Instruction Fetch UnitInstruction Fetch Unit
Bus interface unitBus interface unit
InstructionInstruction
DecoderDecoder
InstructionInstruction
DecoderDecoder
Register Register Alias TableAlias Table
AllocatorAllocatorMicrocode Microcode SequencerSequencer
Reservation Reservation StationStation
ROB & ROB & Retire RFRetire RF
AGUAGU
MMXMMX
IEU/JEUIEU/JEUIEU/JEUIEU/JEU
FEUFEU
MIUMIU
Memory Memory Order BufferOrder Buffer
Data Cache Data Cache Unit (L1) Unit (L1)
External busExternal bus
Chip boundaryChip boundary
Control Control FlowFlow
(Restricted)(Restricted)DataDataFlowFlowInstruction Fetch Cluster
Issue Cluster
Out-of-orderCluster
MemoryCluster
Bus Cluster
9
ECE4100/6100H-H. S. Lee
Instruction Fetching Unit
IFU1: Initiate fetch, requesting 16 bytes at a time IFU2: Instruction length decoder, mark instruction boundaries, BTB makes prediction IFU3: Align instructions to 3 decoders in 4-1-1 format
Streaming Buffer
Instruction Cache
Victim Cache
Instruction TLB
datadata addraddr
P.AddrP.Addr
Branch Target Buffer
Next PCNext PCMuxMux
Other fetch Other fetch requestsrequests
Lin
ear
Add
ress
Lin
ear
Add
ress
Select Select muxmux
ILDILDLength Length marksmarks
Instruction Instruction rotatorrotator
Instruction Instruction bufferbuffer
#bytes #bytes consumed consumed by IDby ID
Prediction Prediction marksmarks
10
ECE4100/6100H-H. S. Lee
Dynamic Branch Prediction
Similar to a 2-level PAs design Associated with each BTB entry W/ 16-entry Return Stack Buffer 4 branch predictions per cycle (due to
16-byte fetch per cycle)
Static prediction provided by Branch Address Calculator when BTB misses (see prior slide)
512-entry BTB 512-entry BTB 1 1 0
Branch History RegisterBranch History Register(BHR)(BHR)
0000 0001 0010
1111 1110
Pattern History Tables Pattern History Tables (PHT)(PHT)
Prediction
Rc: Branch ResultRc: Branch Result2-bit sat. counter
11 00
1 10
Spec. updateSpec. update
New (spec) historyNew (spec) history
1101
W0W0 W1W1 W2W2 W3W3
11
ECE4100/6100H-H. S. Lee
Static Branch Prediction
BTB miss?BTB miss?
PC-relative?PC-relative?
Conditional?Conditional?
Backwards?Backwards?
Return?Return?
Unconditional Unconditional PC-relative?PC-relative?
NoNoNoNo
NoNo NoNo
NoNo
NoNo
YesYes
YesYes
YesYes
YesYes
YesYes
YesYes
BTB’s BTB’s decisiondecision
TakenTaken
TakenTakenTakenTaken
TakenTaken
TakenTaken
Indirect Indirect jumpjump
Not TakenNot Taken
12
ECE4100/6100H-H. S. Lee
X86 Instruction Decode
4-1-1 decoder Decode rate depends on instruction alignment DEC1: translate x86 into micro-operation’s (ops) DEC2: move decoded ops to ID queue MS performs translations either
Generate entire op sequence from microcode ROM Receive 4 ops from complex decoder, and the rest from microcode ROM
complexcomplex(1-4)(1-4)
complexcomplex(1-4)(1-4)
simplesimple(1)(1)
simplesimple(1)(1)
simplesimple(1)(1)
simplesimple(1)(1)
IFU3IFU3
Micro-Micro-instruction instruction sequencer sequencer
((MSMS))
Instruction decoder queueInstruction decoder queue(6 (6 ops)ops)
Next 3 instNext 3 inst #Inst to dec#Inst to dec
S,S,SS,S,S 33
S,S,CS,S,C First 2First 2
S,C,SS,C,S First 1First 1
S,C,CS,C,C First 1First 1
C,S,SC,S,S 33
C,S,CC,S,C First 2First 2
C,C,SC,C,S First 1First 1
C,C,CC,C,C First 1First 1
S: SimpleS: SimpleC: ComplexC: Complex
13
ECE4100/6100H-H. S. Lee
Allocator
The interface between in-order and out-of-order pipelines
Allocates “3-or-none” ops per cycle into RS, ROB “all-or-none” in MOB (LB and SB)
Generate physical destination PdstPdst from the ROB and pass it to the Register Alias Table (RAT)
Stalls upon shortage of resources
14
ECE4100/6100H-H. S. Lee
Register Alias Table (RAT)
Register renaming for 8 integer registers, 8 floating point (stack) registers and flags: 3 op per cycle 40 80-bit physicalphysical registers embedded in the ROB (thereby, 6 bit to specify PSrcPSrc) RAT looks up physical ROB locations for renamed sources based on RRF bit
In-o
rder
que
ueIn
-ord
er q
ueue
FP FP TOS TOS AdjustAdjust
FP FP RAT RAT ArrayArray
Integer Integer RAT RAT ArrayArray
Logical SrcLogical Src
Int a
nd F
P O
verri
des
Int a
nd F
P O
verri
des
Array Array Physical Physical Src (Psrc)Src (Psrc)
RAT RAT PSrc’sPSrc’s
Physical ROB PointersPhysical ROB Pointers
AllocatorAllocator
2525
22
ECXECX
1515
EAXEAX
EBXEBX
ECXECX
EDXEDX
Renaming ExampleRenaming Example
ROBROBRRFRRF
RRFRRF PSrcPSrc
00
00
00
11
15
ECE4100/6100H-H. S. LeePartial Register Width
Renaming
32/16-bit accesses: Read from low banklow bank Write to both banks
8-bit RAT accesses: depending on which Bank is being written
In-o
rder
que
ueIn
-ord
er q
ueue
FP FP TOS TOS AdjustAdjust
FP FP RAT RAT ArrayArray
Logical SrcLogical Src
Int a
nd F
P O
verri
esIn
t and
FP
Ove
rries
Array Array Physical Physical SrcSrc
RAT RAT Physical SrcPhysical Src
Physical ROB Pointers from AllocatorPhysical ROB Pointers from Allocator
op0: MOV AL = (a)op0: MOV AL = (a)op1: MOV AH = (b)op1: MOV AH = (b)op2: ADD AL = (c)op2: ADD AL = (c)op3: ADD AH = (d)op3: ADD AH = (d)
Integer Integer RAT RAT ArrayArray
INT Low Bank INT Low Bank (32b/16b/L): (32b/16b/L): 8 entries8 entries
INT High Bank (H): INT High Bank (H): 4 entries4 entries
Size(2)Size(2) RRF(1)RRF(1) PSrc(6)PSrc(6)
AllocatorAllocator
16
ECE4100/6100H-H. S. Lee
Partial Stalls due to RAT
Partial register stalls: Occurs when writing a smaller (e.g. 8/16-bit) register followed by a larger (e.g. 32-bit) read
Partial flags stalls: Occurs when a subsequent instruction read more flags than a prior unretired instruction touches
EAXEAXAXAX writewritereadread
MOVB AL, m8 ; MOVB AL, m8 ; ADD EAX, m32 ; stallADD EAX, m32 ; stall
Partial register stallsPartial register stalls
XOR EAX, EAX XOR EAX, EAX MOVB AL, m8 ; MOVB AL, m8 ; ADD EAX, m32 ; no stallADD EAX, m32 ; no stall
SUB EAX, EAX SUB EAX, EAX MOVB AL, m8 ; MOVB AL, m8 ; ADD EAX, m32 ; no stallADD EAX, m32 ; no stall
Idiom Fix (1)Idiom Fix (1)
Idiom Fix (2)Idiom Fix (2)
CMP EAX, EBX CMP EAX, EBX INC ECX INC ECX JBE XX ; stallJBE XX ; stall
Partial flag stalls (1)Partial flag stalls (1)
JBEJBE reads both ZFZF and CFCF while INC affects (ZFZF,OF,SF,AF,PF)
LAHF LAHF loads low byte of EFLAGS EFLAGS
TEST EBX, EBX TEST EBX, EBX LAHF ; stallLAHF ; stall
Partial flag stalls (2)Partial flag stalls (2)
17
ECE4100/6100H-H. S. Lee
Reservation Stations
Gateway to execution: binding max 5 op to each port per cycle 20 op entry buffer bridging the In-order and Out-of-order engine RS fields include op opcode, data valid bits, Pdst, Psrc, source data, BrPred, etc. Oldest first FIFO scheduling when multiple ops are ready at the same cycle
Port 0Port 0
Port 1Port 1
Port 2Port 2
Port 3Port 3
Port 4Port 4
IEU0IEU0 FaddFadd FmulFmul ImulImul DivDiv
IEU1IEU1 JEUJEU
AGU0AGU0
AGU1AGU1
MOBMOB DCUDCU
ROBROB RRFRRF
PfaddPfadd
PfmulPfmul
PfshufPfshuf
WB bus 1WB bus 1
WB bus 0WB bus 0
Ld addrLd addr
St addrSt addr
LDALDA
STASTA
STDSTDSt dataSt data
Loaded dataLoaded data
RSRS
Retired Retired datadata
18
ECE4100/6100H-H. S. Lee
ReOrder Buffer A 40-entry circular buffer
Similar to that described in [SmithPleszkun85][SmithPleszkun85]
157-bit wide Provide 40 alias physical registers
Out-of-orderOut-of-order completion Deposit exception in each entry Retirement (or de-allocation)
After resolving prior speculation Handle exceptions thru MS Clear OOO state when a mis-predicted branch or
exception is detected 3 op’s per cycle in program orderin program order For multi-op x86 instructions: none or all (atomic)none or all (atomic)
ALLOCALLOC
RATRAT
RSRS
RRFRRFROBROB. . . . ..
MSMS
(exp) (exp) code assistcode assist
19
ECE4100/6100H-H. S. Lee
Memory Execution Cluster
Manage data memory accesses Address Translation Detect violation of access ordering
RS / ROBRS / ROB
LDLD STASTA STDSTD
DTLBDTLBDTLBDTLB
LDLD STASTADCUDCUDCUDCU
Load BufferLoad Buffer
Store BufferStore BufferEBLEBL
Memory Cluster BlocksMemory Cluster Blocks
Fill buffers in DCU (similar to MSHR [Kroft’81][Kroft’81]) for handling cache misses (non-blocking)
FBFB
20
ECE4100/6100H-H. S. Lee
Memory Order Buffer (MOB) Allocated by ALLOC A second order RS for memory operations 1 op for load; 2 op’s for store: Store Address (STA) and Store Data (STD) MOB
16-entry load buffer (LB) 12-entry store address buffer (SAB) SAB works in unison with
Store data buffer (SDB) in MIU Physical Address Buffer (PAB) in DCU
Store Buffer (SB): SAB + SDB + PAB Senior Stores
Upon STD/STA retired from ROB SB marks the store “seniorsenior” Senior stores are committed back in program orderprogram order to memory when bus idle or SB full
Prefetch instructions in P-III Senior loadSenior load behavior Due to no explicit architectural destination
21
ECE4100/6100H-H. S. Lee
Store Coloring
ALLOC assigns Store Buffer ID (SBID) in program order ALLOC tags loads with the most recent SBID Check loads against stores with equal or younger SBIDs for potential
address conflicts SDB forwards data if conflict detected
x86 Instructionsx86 Instructions op’sop’s store colorstore color mov (0x1220), ebxmov (0x1220), ebx std (ebx)std (ebx) 2 2
sta 0x1220sta 0x1220 2 2 mov (0x1110), eaxmov (0x1110), eax std (eax)std (eax) 3 3
sta 0x1100sta 0x1100 3 3 mov ecx, (0x1220)mov ecx, (0x1220) ldld 33 mov edx, (0x1280)mov edx, (0x1280) ldld 33 mov (0x1400), edxmov (0x1400), edx std (edx)std (edx) 4 4 sta 0x1400sta 0x1400 4 4 mov edx, (0x1380)mov edx, (0x1380) ldld 44
22
ECE4100/6100H-H. S. LeeMemory Type Range Registers
(MTRR) Control registers written by the system (OS) Supporting Memory TypesMemory Types
UnCacheable (UC) Uncacheable Speculative Write-combining (USWC or WC)
Use a fill buffer entry as WC buffer WriteBack (WB) Write-Through (WT) Write-Protected (WP)
E.g. Support copy-on-write in UNIX, save memory space by allowing child processes to share with their parents. Only create new memory pages when child processes attempt to write.
Page Miss Handler (PMH) Look up MTRR while supplying physical addresses Return memory types and physical address to DTLB
23
ECE4100/6100H-H. S. LeeIntel NetBurst
MicroarchitecturePentium 4’s microarchitecture, a post-P6 new generationOriginal target market: Graphics workstations, but … the
major competitor screwed up themselves…Design Goals:
Performance, performance, performance, … Unprecedented multimedia/floating-point performance
Streaming SIMD Extensions 2 (SSE2) Reduced CPI
Low latency instructionsHigh bandwidth instruction fetchingRapid Execution of Arithmetic & Logic operations
Reduced clock periodNew pipeline designed for scalability
24
ECE4100/6100H-H. S. Lee
Innovations Beyond P6Hyperpipelined technologyStreaming SIMD Extension 2 Enhanced branch predictorExecution trace cacheRapid execution engineAdvanced Transfer CacheHyper-threading Technology (in Xeon and Xeon MP)
25
ECE4100/6100H-H. S. Lee
Pentium 4 Fact Sheet IA-32 fully backward compatible Available at speeds ranging from 1.3 to ~3 GHz Hyperpipelined (20+ stages) 42+ million transistors 0.18 μ for 1.7 to 1.9GHz; 0.13μ for 1.8 to 2.8GHz; Die Size of 217mm2
Consumes 55 watts of power at 1.5Ghz 400MHz (850) and 533MHz (850E) system bus 512KB or 256KB 8-way full-speed on-die L2 Advanced Transfer Cache (up
to 89.6 GB/s @2.8GHz to L1) 1MB or 512KB L3 cache (in Xeon MP) 144 new 128 bit SIMD instructions (SSE2) HyperThreading Technology (only enabled in Xeon and Xeon MP)
26
ECE4100/6100H-H. S. LeeRecent Intel IA-32
Processors
27
ECE4100/6100H-H. S. Lee
Building Blocks of Netburst
Bus UnitBus Unit
Level 2 CacheLevel 2 Cache
Memory subsystemMemory subsystem
Fetch/Fetch/DecDec
ETCETCμμROMROM
BTB / Br Pred.BTB / Br Pred.
System busSystem bus
L1 Data CacheL1 Data Cache
Execution UnitsExecution Units
INT and FP Exec. UnitINT and FP Exec. Unit
OOO OOO logiclogic RetireRetire
Branch history updateBranch history update
Front-endFront-endOut-of-Order EngineOut-of-Order Engine
28
ECE4100/6100H-H. S. Lee
Pentium 4 MicroarchitectueBTB (4k entries)BTB (4k entries) I-TLB/PrefetcherI-TLB/Prefetcher
IA32 DecoderIA32 Decoder
Execution Trace CacheExecution Trace CacheTrace Cache BTBTrace Cache BTB
(512 entries)(512 entries)
Code ROMCode ROM
op Queue op Queue
Allocator / Register RenamerAllocator / Register Renamer
INT / FP INT / FP op Queueop QueueMemory Memory op Queueop Queue
Memory Memory schedulerscheduler
INT Register File / Bypass NetworkINT Register File / Bypass Network FP RF / Bypass NtwkFP RF / Bypass Ntwk
AGUAGU AGUAGU 2x ALU2x ALU 2x ALU2x ALU Slow ALUSlow ALU
Ld addrLd addr St addrSt addr Simple Simple Inst.Inst.
Simple Simple Inst.Inst.
ComplexComplexInst.Inst.
FPFPMMX MMX SSE/2SSE/2
FP FP MoveMove
L1 Data Cache (8KB 4-way, 64-byte line, WT, 1 rd + 1 wr port)L1 Data Cache (8KB 4-way, 64-byte line, WT, 1 rd + 1 wr port)
FastFast Slow/General FP schedulerSlow/General FP scheduler Simple FPSimple FP
Quad Quad PumpedPumped
400M/533MHz 400M/533MHz 3.2/4.3 GB/sec3.2/4.3 GB/sec
BIUBIU
U-L2 Cache U-L2 Cache 256KB 8-way256KB 8-way128B line, WB128B line, WB
48 GB/s 48 GB/s @[email protected] bits256 bits
64 bits64 bits64-bit 64-bit
SystemSystemBusBus
29
ECE4100/6100H-H. S. Lee
Pipeline Depth Evolution
PREFPREF DECDEC DECDEC EXECEXEC WBWB
P5 MicroarchitectureP5 Microarchitecture
IFU1IFU1 IFU2IFU2 IFU3IFU3 DEC1DEC1 DEC2DEC2 RATRAT ROBROB DISDIS EXEX RET1RET1 RET2RET2
P6 MicroarchitectureP6 Microarchitecture
TC NextIPTC NextIP TC FetchTC Fetch DriveDrive AllocAlloc QueueQueueRenameRename ScheduleSchedule DispatchDispatch Reg FileReg File ExecExec FlagsFlags Br CkBr Ck DriveDrive
NetBurst MicroarchitectureNetBurst Microarchitecture
30
ECE4100/6100H-H. S. Lee
Execution Trace CachePrimary first level I-cache to replace conventional L1
Decoding several x86 instructions at high frequency is difficult, take several pipeline stages
Branch misprediction penalty is horrible lost 20 pipeline stages vs. 10 stages in P6lost 20 pipeline stages vs. 10 stages in P6
Advantages Cache post-decodepost-decode ops High bandwidth instruction fetching Eliminate x86 decoding overheads Reduce branch recovery time if TC hits
Hold up to 12,000 ops 6 ops per trace line Many (?) trace lines in a single trace
31
ECE4100/6100H-H. S. Lee
Execution Trace CacheDeliver 3 op’s per cycle to OOO engineX86 instructions read from L2 when TC misses (7+ cycle latency)TC Hit rate ~ 8K to 16KB conventional I-cache Simplified x86 decoder
Only one complex instruction per cycle Instruction > 4 op will be executed by micro-code ROM (P6’s MS)
Perform branch prediction in TC 512-entry BTB + 16-entry RAS With BP in x86 IFU, reduce 1/3 misprediction compared to P6 Intel did not disclose the details of BP algorithms used in TC and x86
IFU (Dynamic + Static)
32
ECE4100/6100H-H. S. Lee
Out-Of-Order Engine
Similar design philosophy with P6 uses Allocator Register Alias Table 128 physical registers 126-entry ReOrder Buffer 48-entry load buffer 24-entry store buffer
33
ECE4100/6100H-H. S. Lee
Register Renaming SchemesROB (40-entry)ROB (40-entry)
RRFRRF
DataData StatusStatus
EBXEBXECXECXEDXEDXESIESIEDIEDI
EAXEAX
ESPESPEBPEBP
RATRAT
P6 Register Renaming P6 Register Renaming
Allo
cate
d se
quen
tially
Allo
cate
d se
quen
tially
EBXEBXECXECXEDXEDXESIESIEDIEDI
EAXEAX
ESPESPEBPEBP
Retirement RATRetirement RAT
NetBurst Register Renaming NetBurst Register Renaming
StatusStatus
Allo
cate
d se
quen
tially
Allo
cate
d se
quen
tially
. . . . ..
. . . . ..
. . . . ..
. . . . ..
DataData
EBXEBXECXECXEDXEDXESIESIEDIEDI
EAXEAX
ESPESPEBPEBP
Front-end Front-end RATRAT RF (128-entry)RF (128-entry) ROB (126)ROB (126)
34
ECE4100/6100H-H. S. Lee
Micro-op Scheduling op FIFO queues
Memory queue for loads and stores Non-memory queue
op schedulers Several schedulers fire instructions to execution (P6’s RS) 4 distinct dispatch ports Maximum dispatch: 6 ops per cycle (2 fast ALU from Port 0,1 per cycle; 1 from
ld/st ports)
Exec Port 0Exec Port 0 Exec Port 1Exec Port 1 Load PortLoad Port Store PortStore Port
Fast ALUFast ALU(2x pumped)(2x pumped)
Fast ALUFast ALU(2x pumped)(2x pumped)
FP FP MoveMove
INTINTExecExec
FP FP ExecExec
Memory Memory LoadLoad
Memory Memory StoreStore
•Add/subAdd/sub•LogicLogic•Store DataStore Data•BranchesBranches
•FP/SSE MoveFP/SSE Move•FP/SSE StoreFP/SSE Store•FXCHFXCH
•Add/subAdd/sub •ShiftShift•RotateRotate
•FP/SSE AddFP/SSE Add•FP/SSE MulFP/SSE Mul•FP/SSE DivFP/SSE Div•MMXMMX
•LoadsLoads•LEALEA•PrefetchPrefetch
•StoresStores
35
ECE4100/6100H-H. S. Lee
Data Memory Accesses8KB 4-way L1 + 256KB 8-way L2 (with a HW prefetcher)Load-to-use speculation
Dependent instruction dispatched before load finishesDue to the high frequency and deep pipeline depth
Scheduler assumes loads always hit L1 If L1 miss, dependent instructions left the scheduler receive incorrect data
temporarily – mis-speculationmis-speculation Replay logic Replay logic – Re-execute the load when mis-speculated Independent instructions are allowed to proceed
Up to 4 outstanding load misses (= 4 fill buffers in original P6)Store-to-load forwarding buffer
24 entries Have the same starting physical address Load data size <= store data size
36
ECE4100/6100H-H. S. Lee
Streaming SIMD Extension 2P-III SSE (Katmai New Instructions: KNI)
Eight 128-bit wide xmmxmm registers (new architecture state) Single-precisionSingle-precision 128-bit SIMD FP
Four 32-bit FP operations in one instructionBroken down into 2 ops for execution (only 80-bit data in ROB)
64-bit SIMD MMX (use 8 mmmm registers — map to FP stack) Prefetch (nta, t0, t1, t2) and sfence
P4 SSE2 (Willamette New Instructions: WNI) Support Double-precision Double-precision 128-bit SIMD FP
Two 64-bit FP operations in one instructionThroughput: 2 cycles for most of SSE2 operations (exceptional examples: DIVPD
and SQRTPD: 69 cycles, non-pipelined.) Enhanced 128-bit SIMD MMX using xmmxmm registers
37
ECE4100/6100H-H. S. Lee
Examples of Using SSEX3X3 X2X2 X1X1 X0X0
Y3Y3 Y2Y2 Y1Y1 Y0Y0
opop opop opop opop
X3 op Y3X3 op Y3 X2 op Y2X2 op Y2 X1 op Y1X1 op Y1X0 op Y0X0 op Y0
X3X3 X2X2 X1X1 X0X0
Y3Y3 Y2Y2 Y1Y1 Y0Y0
opop
X0 op Y0X0 op Y0X3X3 X2X2 X1X1
Packed SP FP operationPacked SP FP operation(e.g. (e.g. ADDPS xmm1, xmm2ADDPS xmm1, xmm2))
Scalar SP FP operation Scalar SP FP operation (e.g. (e.g. ADDSS xmm1, xmm2ADDSS xmm1, xmm2))
xmm1xmm1
xmm2xmm2
xmm1xmm1
xmm1xmm1
xmm2xmm2
xmm1xmm1
Shuffle FP operation (8-bit imm)Shuffle FP operation (8-bit imm)(e.g. (e.g. SHUFPS xmm1, xmm2, SHUFPS xmm1, xmm2, imm8imm8))
X3X3 X2X2 X1X1 X0X0
Y3Y3 Y2Y2 Y1Y1 Y0Y0
Y3 .. Y0Y3 .. Y0 Y3 .. Y0Y3 .. Y0 X3 .. X0X3 .. X0 X3 .. X0X3 .. X0
Shuffle FP operation (8-bit imm)Shuffle FP operation (8-bit imm)(e.g. (e.g. SHUFPS xmm1, xmm2,SHUFPS xmm1, xmm2, 0xf1 0xf1) )
X3X3 X2X2 X1X1 X0X0
Y3Y3 Y2Y2 Y1Y1 Y0Y0
xmm1xmm1
Y3Y3 X0X0 X1X1Y3Y3
xmm2xmm2
xmm1xmm1
38
ECE4100/6100H-H. S. LeeExamples of Using SSE and
SSE2X3X3 X2X2 X1X1 X0X0
Y3Y3 Y2Y2 Y1Y1 Y0Y0
opop opop opop opop
X3 op Y3X3 op Y3 X2 op Y2X2 op Y2 X1 op Y1X1 op Y1X0 op Y0X0 op Y0
X3X3 X2X2 X1X1 X0X0
Y3Y3 Y2Y2 Y1Y1 Y0Y0
opop
X0 op Y0X0 op Y0X3X3 X2X2 X1X1
Packed Packed SPSP FP operation FP operation(e.g. (e.g. ADDPS xmm1, xmm2ADDPS xmm1, xmm2))
Scalar Scalar SPSP FP operation FP operation (e.g. (e.g. ADDSS xmm1, xmm2ADDSS xmm1, xmm2))
xmm1xmm1
xmm2xmm2
xmm1xmm1
xmm1xmm1
xmm2xmm2
xmm1xmm1
Shuffle FP operation Shuffle FP operation (e.g. (e.g. SHUFPS xmm1, xmm2, imm8SHUFPS xmm1, xmm2, imm8))
X3X3 X2X2 X1X1 X0X0
Y3Y3 Y2Y2 Y1Y1 Y0Y0
Y3 .. Y0Y3 .. Y0 Y3 .. Y0Y3 .. Y0 X3 .. X0X3 .. X0 X3 .. X0X3 .. X0
Shuffle Shuffle FPFP operation (8-bit imm) operation (8-bit imm) (e.g. (e.g. SHUFPS xmm1, xmm2,SHUFPS xmm1, xmm2, 0xf1 0xf1) )
X3X3 X2X2 X1X1 X0X0
Y3Y3 Y2Y2 Y1Y1 Y0Y0
xmm1xmm1
Y3Y3 X0X0 X1X1Y3Y3
xmm2xmm2
xmm1xmm1
X0X0
opop
Packed Packed DPDP FP operation FP operation(e.g. (e.g. ADDPDADDPD xmm1, xmm2xmm1, xmm2))
Scalar Scalar DPDP FP operation FP operation (e.g. (e.g. ADDSDADDSD xmm1, xmm2xmm1, xmm2))
xmm1xmm1
xmm2xmm2
xmm1xmm1
xmm1xmm1
xmm2xmm2
xmm1xmm1
Shuffle FP operation Shuffle FP operation (e.g. (e.g. SHUFPS xmm1, xmm2, imm8SHUFPS xmm1, xmm2, imm8))Shuffle Shuffle DPDP operation (2-bit imm) operation (2-bit imm)(e.g. (e.g. SHUFPD xmm1, xmm2, SHUFPD xmm1, xmm2, imm2imm2) )
X1X1
Y0Y0Y1Y1
X0 op Y0X0 op Y0X1 op Y1X1 op Y1
opop
X0X0X1X1
Y0Y0Y1Y1
X0 op Y0X0 op Y0X1 X1
opop
X0X0X1X1
Y0Y0Y1Y1
X1 or X0X1 or X0Y1 or Y0 Y1 or Y0
SSESSE
SSE2SSE2
39
ECE4100/6100H-H. S. Lee
HyperThreading In Intel Xeon Processor and Intel Xeon MP
ProcessorEnable Simultaneous Multi-Threading (SMT)
Exploit ILP through TLP (—Thread-Level Parallelism) Issuing and executing multiple threads at the same
snapshotSingle P4 Xeon appears to be 2 logical processors2 logical processorsShare the same execution resourcesArchitectural states are duplicated in hardware
40
ECE4100/6100H-H. S. LeeMultithreading (MT)
Paradigms
Thread 1Thread 1UnusedUnused
Exec
utio
n Ti
me
Exec
utio
n Ti
me
FU1FU1 FU2FU2 FU3FU3 FU4FU4
ConventionalConventionalSuperscalarSuperscalar
SingleSingleThreadedThreaded
SimultaneousSimultaneousMultithreadingMultithreading
Fine-grainedFine-grainedMultithreadingMultithreading(cycle-by-cycle(cycle-by-cycle
Interleaving)Interleaving)
Thread 2Thread 2Thread 3Thread 3Thread 4Thread 4Thread 5Thread 5
Coarse-grainedCoarse-grainedMultithreadingMultithreading
(Block Interleaving)(Block Interleaving)
Chip Chip MultiprocessorMultiprocessor
(CMP)(CMP)
41
ECE4100/6100H-H. S. LeeMore SMT commercial
processorsIntel Xeon Hyperthreading
Supports 2 replicated hardware contexts: PC (or IP) and architecture registers
New directions of usageHelper (or assisted) threads (e.g. speculative precomputation) Speculative multithreading
Clearwater (once called Xtream logic) 8 context SMT “network processor” designed by DISC architect (company no longer exists)
SUN 4-SMT-processor CMP?
42
ECE4100/6100H-H. S. Lee
Speculative Multithreading SMT can justify wider-than-ILP datapath But, datapath is only fully utilized by multiple threads How to speed up single-thread program by utilizing multiple threads? What to do with spare resources?
Execute both sides of hard-to-predictable branches Eager execution or Polypath execution Dynamic predication
Send another thread to scout ahead to warm up caches & BTB Speculative precomputation Early branch resolution
Speculatively execute future work Multiscalar or dynamic multithreading e.g. start several loop iterations concurrently as different threads, if data dependence
is detected, redo the work Run a dynamic compiler/optimizer on the side Dynamic verification
DIVA or Slipstream Processor