intel labs labs copyright © 2000 intel corporation. fall 2000 inside the pentium ® 4 processor...
Post on 19-Dec-2015
217 views
TRANSCRIPT
IntelIntel LabsLabsCopyright © 2000 Intel Corporation.
Fall 2000
Inside the Inside the PentiumPentium®® 4 Processor 4 Processor
Micro-architectureMicro-architectureNext Generation IA-32 Micro-architectureNext Generation IA-32 Micro-architecture
Doug CarmeanDoug CarmeanPrincipal ArchitectPrincipal Architect
Intel Architecture GroupIntel Architecture Group
August 24, 2000August 24, 2000
IntelIntel PDXPDXCopyright © 2000 Intel Corporation.
Fall 2000AgendaAgenda
IA-32 Processor RoadmapIA-32 Processor RoadmapDesign GoalsDesign GoalsFrequencyFrequencyInstructions Per Cycle Instructions Per Cycle SummarySummary
IntelIntel PDXPDXCopyright © 2000 Intel Corporation.
Fall 2000IntelIntel®® Pentium Pentium®® 4 Processor 4 Processor
Per
form
ance
Per
form
ance
486 Micro-architecture486 Micro-architecture
TimeTime
P5 Micro-ArchitectureP5 Micro-Architecture
P6 Micro-ArchitectureP6 Micro-Architecture
TodayToday
IntelIntel®® NetBurst™ NetBurst™Micro-ArchitectureMicro-Architecture
IntelIntel PDXPDXCopyright © 2000 Intel Corporation.
Fall 2000
Intel® Pentium® 4 Processor Intel® Pentium® 4 Processor Design GoalsDesign Goals
Deliver world class performance Deliver world class performance across both existing and emerging across both existing and emerging applicationsapplications
Deliver performance headroom and Deliver performance headroom and scalability for the futurescalability for the future
Micro-architecture that will Drive PerformanceMicro-architecture that will Drive PerformanceLeadership for the Next Several YearsLeadership for the Next Several Years
Micro-architecture that will Drive PerformanceMicro-architecture that will Drive PerformanceLeadership for the Next Several YearsLeadership for the Next Several Years
IntelIntel PDXPDXCopyright © 2000 Intel Corporation.
Fall 2000
Intel® NetburstIntel® NetburstTMTM Micro-architectureMicro-architecture
400 MHz 400 MHz SystemSystem
BusBus
RapidRapidExecutionExecution
EngineEngine
ExecutionExecutionTrace CacheTrace Cache
HyperHyperPipelinedPipelined
TechnologyTechnology
AdvancedAdvancedTransfer Transfer
CacheCacheAdvanced Advanced DynamicDynamic
ExecutionExecution
StreamingStreamingSIMDSIMD
Extensions 2Extensions 2
Enhanced FloatingEnhanced FloatingPoint / Multi-MediaPoint / Multi-Media
Pentium® 4 Processor Pentium® 4 Processor Block DiagramBlock Diagram
FP
RF
FP
RF
FMulFMulFAddFAddMMXMMXSSESSE
FP moveFP moveFP storeFP store
3.2
GB
/s S
yste
m In
terf
ace
3.2
GB
/s S
yste
m In
terf
ace L2 Cache and ControlL2 Cache and Control
L1
D-C
ach
e an
d D
-TL
BL
1 D
-Cac
he
and
D-T
LB
StoreStoreAGUAGULoadLoadAGUAGU
Sch
edu
lers
Sch
edu
lers
Inte
ger
RF
Inte
ger
RF
ALUALU
ALUALU
ALUALU
ALUALU
Tra
ce C
ach
eT
race
Cac
he
Ren
ame/
Allo
cR
enam
e/A
lloc
uo
p Q
ueu
esu
op
Qu
eues
BTBBTB
uCodeuCodeROMROM
33 33
Dec
od
erD
eco
der
BT
B &
I-T
LB
BT
B &
I-T
LB
L2 Cache and ControlL2 Cache and Control
L2 Cache and ControlL2 Cache and Control
FP
RF
FP
RF
FMulFMulFAddFAddMMXMMXSSESSE
FP moveFP moveFP storeFP store
3.2
GB
/s S
yste
m In
terf
ace
3.2
GB
/s S
yste
m In
terf
ace L2 Cache and ControlL2 Cache and Control
L1
D-C
ach
e an
d D
-TL
BL
1 D
-Cac
he
and
D-T
LB
StoreStoreAGUAGULoadLoadAGUAGU
Sch
edu
lers
Sch
edu
lers
Inte
ger
RF
Inte
ger
RF
ALUALU
ALUALU
ALUALU
ALUALU
Tra
ce C
ach
eT
race
Cac
he
Ren
ame/
Allo
cR
enam
e/A
lloc
uo
p Q
ueu
esu
op
Qu
eues
BTBBTB
uCodeuCodeROMROM
33 33
Dec
od
erD
eco
der
BT
B &
I-T
LB
BT
B &
I-T
LB
Pentium® 4 Processor Pentium® 4 Processor Block DiagramBlock Diagram
IntelIntel PDXPDXCopyright © 2000 Intel Corporation.
Fall 2000CPU Architecture 101CPU Architecture 101
Delivered Performance = Delivered Performance = Frequency * Instructions Per CycleFrequency * Instructions Per Cycle
Delivered Performance = Delivered Performance = Frequency * Instructions Per CycleFrequency * Instructions Per Cycle
FrequencyFrequencyFrequencyFrequency
IntelIntel PDXPDXCopyright © 2000 Intel Corporation.
Fall 2000FrequencyFrequency
What limits frequency?What limits frequency?–Process technology Process technology
–MicroarchitectureMicroarchitecture
On a given process technologyOn a given process technology–Fewer gates per pipeline stage will deliver Fewer gates per pipeline stage will deliver
higher frequencyhigher frequency
Frequency is driven by MicroarchitectureFrequency is driven by MicroarchitectureFrequency is driven by MicroarchitectureFrequency is driven by Microarchitecture
IntelIntel PDXPDXCopyright © 2000 Intel Corporation.
Fall 2000
NetburstNetburstTMTM Micro-architecture Micro-architecture Pipeline vs P6Pipeline vs P6
11 22 33 44 55 66 77 88 99 1010
FetchFetch FetchFetch DecodeDecode DecodeDecode DecodeDecode RenameRename ROB RdROB Rd Rdy/SchRdy/Sch DispatchDispatch ExecExec
Basic P6 PipelineBasic P6 Pipeline
Hyper pipelined Technology enables industry Hyper pipelined Technology enables industry leading performance and clock rateleading performance and clock rate
Hyper pipelined Technology enables industry Hyper pipelined Technology enables industry leading performance and clock rateleading performance and clock rate
Basic PentiumBasic Pentium®® 4 Processor Pipeline 4 Processor Pipeline
11 22 33 44 55 66 77 88 99 1010 1111 1212
TC Nxt IPTC Nxt IP TC FetchTC Fetch DriveDrive AllocAlloc RenameRename QueQue SchSch SchSch SchSch
1313 1414
DispDisp DispDisp
1515 1616 1717 1818 1919 2020
RFRF ExEx FlgsFlgs Br CkBr Ck DriveDriveRF RF
Intro at Intro at 1.4GHz1.4GHz
.18µ.18µ
Intro at Intro at 733MHz733MHz
.18µ.18µ
IntelIntel PDXPDXCopyright © 2000 Intel Corporation.
Fall 2000
Fre
qu
ency
Fre
qu
ency
TimeTimeIntroductionIntroduction
233MHz233MHz
60MHz60MHz P5 Micro-ArchitectureP5 Micro-Architecture
55
Hyper Pipelined Hyper Pipelined TechnologyTechnology
1.13GHz1.13GHz
166MHz166MHz
P6 Micro-ArchitectureP6 Micro-Architecture
1010
TodayToday
1.4GHz1.4GHz
2020
Netburst Micro-ArchitectureNetburst Micro-Architecture
IntelIntel PDXPDXCopyright © 2000 Intel Corporation.
Fall 2000Hyper pipelined TechnologyHyper pipelined Technology
FP
RF
FP
RF
FopFop
FmsFms
Sy
ste
m I
nte
rfa
ce
Sy
ste
m I
nte
rfa
ce L2 Cache and ControlL2 Cache and Control
L1
D-C
ac
he
an
d D
-TL
BL
1 D
-Ca
ch
e a
nd
D-T
LB
AGUAGU
AGUAGU
Sc
he
du
lers
Sc
he
du
lers
Inte
ge
r R
FIn
teg
er
RF
ALUALUALUALU
ALUALUALUALU
Tra
ce
Ca
ch
eT
rac
e C
ac
he
Re
na
me
/All
oc
Re
na
me
/All
oc
uo
p Q
ue
ue
su
op
Qu
eu
es
BTBBTB
ROMROM
33 33
De
co
de
rD
ec
od
er
BT
B &
I-T
LB
BT
B &
I-T
LB
L2 Cache and ControlL2 Cache and Control
33 44 55 66 77 88 99 1010 1111 1212
TC FetchTC Fetch DriveDrive AllocAlloc RenameRename QueQue SchSch SchSch SchSch
1313 1414
DispDisp DispDisp
1515 1616 1717 1818 1919 2020
RFRF ExEx FlgsFlgs Br CkBr Ck DriveDriveRF RF
TC Nxt IP: Trace cache next instruction pointerPointer from the BTB, indicating location ofnext instruction.
22
TC Nxt IPTC Nxt IP
11
IntelIntel PDXPDXCopyright © 2000 Intel Corporation.
Fall 2000Hyper pipelined TechnologyHyper pipelined Technology
FP
RF
FP
RF
FopFop
FmsFms
Sy
ste
m I
nte
rfa
ce
Sy
ste
m I
nte
rfa
ce L2 Cache and ControlL2 Cache and Control
L1
D-C
ac
he
an
d D
-TL
BL
1 D
-Ca
ch
e a
nd
D-T
LB
AGUAGU
AGUAGU
Sc
he
du
lers
Sc
he
du
lers
Inte
ge
r R
FIn
teg
er
RF
ALUALUALUALU
ALUALUALUALU
Tra
ce
Ca
ch
eT
rac
e C
ac
he
Re
na
me
/All
oc
Re
na
me
/All
oc
uo
p Q
ue
ue
su
op
Qu
eu
es
BTBBTB
ROMROM
33 33
De
co
de
rD
ec
od
er
BT
B &
I-T
LB
BT
B &
I-T
LB
L2 Cache and ControlL2 Cache and Control
22 33 44 55 66 77 88 99 1010 1111 1212
TC FetchTC Fetch DriveDrive AllocAlloc RenameRename QueQue SchSch SchSch SchSch
1313 1414
DispDisp DispDisp
1515 1616 1717 1818 1919 2020
RFRF ExEx FlgsFlgs Br CkBr Ck DriveDriveRF RF TC Nxt IPTC Nxt IP
11
TC Fetch: Trace cache fetchRead the decoded instructions (uOPs) out of the Execution Trace Cache
Tra
ce C
ach
eT
race
Cac
he
IntelIntel PDXPDXCopyright © 2000 Intel Corporation.
Fall 2000Hyper pipelined TechnologyHyper pipelined Technology
FP
RF
FP
RF
FopFop
FmsFms
Sy
ste
m I
nte
rfa
ce
Sy
ste
m I
nte
rfa
ce L2 Cache and ControlL2 Cache and Control
L1
D-C
ac
he
an
d D
-TL
BL
1 D
-Ca
ch
e a
nd
D-T
LB
AGUAGU
AGUAGU
Sc
he
du
lers
Sc
he
du
lers
Inte
ge
r R
FIn
teg
er
RF
ALUALUALUALU
ALUALUALUALU
Tra
ce
Ca
ch
eT
rac
e C
ac
he
Re
na
me
/All
oc
Re
na
me
/All
oc
uo
p Q
ue
ue
su
op
Qu
eu
es
BTBBTB
ROMROM
33 33
De
co
de
rD
ec
od
er
BT
B &
I-T
LB
BT
B &
I-T
LB
L2 Cache and ControlL2 Cache and Control
22 33 44 55 66 77 88 99 1010 1111 1212
TC FetchTC Fetch DriveDrive AllocAlloc RenameRename QueQue SchSch SchSch SchSch
1313 1414
DispDisp DispDisp
1515 1616 1717 1818 1919 2020
RFRF ExEx FlgsFlgs Br CkBr Ck DriveDriveRF RF TC Nxt IPTC Nxt IP
11
Drive: Wire delayDrive the uOPs to the allocator
IntelIntel PDXPDXCopyright © 2000 Intel Corporation.
Fall 2000Hyper pipelined TechnologyHyper pipelined Technology
FP
RF
FP
RF
FopFop
FmsFms
Sy
ste
m I
nte
rfa
ce
Sy
ste
m I
nte
rfa
ce L2 Cache and ControlL2 Cache and Control
L1
D-C
ac
he
an
d D
-TL
BL
1 D
-Ca
ch
e a
nd
D-T
LB
AGUAGU
AGUAGU
Sc
he
du
lers
Sc
he
du
lers
Inte
ge
r R
FIn
teg
er
RF
ALUALUALUALU
ALUALUALUALU
Tra
ce
Ca
ch
eT
rac
e C
ac
he
Re
na
me
/All
oc
Re
na
me
/All
oc
uo
p Q
ue
ue
su
op
Qu
eu
es
BTBBTB
ROMROM
33 33
De
co
de
rD
ec
od
er
BT
B &
I-T
LB
BT
B &
I-T
LB
L2 Cache and ControlL2 Cache and Control
22 33 44 55 66 77 88 99 1010 1111 1212
TC FetchTC Fetch DriveDrive AllocAlloc RenameRename QueQue SchSch SchSch SchSch
1313 1414
DispDisp DispDisp
1515 1616 1717 1818 1919 2020
RFRF ExEx FlgsFlgs Br CkBr Ck DriveDriveRF RF TC Nxt IPTC Nxt IP
11
Alloc: AllocateAllocate resources required for execution. Theresources include Load buffers, Store buffers, etc..
Re
na
me
/All
oc
Re
na
me
/All
oc
IntelIntel PDXPDXCopyright © 2000 Intel Corporation.
Fall 2000Hyper pipelined TechnologyHyper pipelined Technology
FP
RF
FP
RF
FopFop
FmsFms
Sy
ste
m I
nte
rfa
ce
Sy
ste
m I
nte
rfa
ce L2 Cache and ControlL2 Cache and Control
L1
D-C
ac
he
an
d D
-TL
BL
1 D
-Ca
ch
e a
nd
D-T
LB
AGUAGU
AGUAGU
Sc
he
du
lers
Sc
he
du
lers
Inte
ge
r R
FIn
teg
er
RF
ALUALUALUALU
ALUALUALUALU
Tra
ce
Ca
ch
eT
rac
e C
ac
he
Re
na
me
/All
oc
Re
na
me
/All
oc
uo
p Q
ue
ue
su
op
Qu
eu
es
BTBBTB
ROMROM
33 33
De
co
de
rD
ec
od
er
BT
B &
I-T
LB
BT
B &
I-T
LB
L2 Cache and ControlL2 Cache and Control
22 33 44 55 66 77 88 99 1010 1111 1212
TC FetchTC Fetch DriveDrive AllocAlloc RenameRename QueQue SchSch SchSch SchSch
1313 1414
DispDisp DispDisp
1515 1616 1717 1818 1919 2020
RFRF ExEx FlgsFlgs Br CkBr Ck DriveDriveRF RF TC Nxt IPTC Nxt IP
11
Rename: Register renamingRename the logical registers (EAX) to the physicalregister space (128 are implemented).
Re
na
me
/All
oc
Re
na
me
/All
oc
IntelIntel PDXPDXCopyright © 2000 Intel Corporation.
Fall 2000Hyper pipelined TechnologyHyper pipelined Technology
FP
RF
FP
RF
FopFop
FmsFms
Sy
ste
m I
nte
rfa
ce
Sy
ste
m I
nte
rfa
ce L2 Cache and ControlL2 Cache and Control
L1
D-C
ac
he
an
d D
-TL
BL
1 D
-Ca
ch
e a
nd
D-T
LB
AGUAGU
AGUAGU
Sc
he
du
lers
Sc
he
du
lers
Inte
ge
r R
FIn
teg
er
RF
ALUALUALUALU
ALUALUALUALU
Tra
ce
Ca
ch
eT
rac
e C
ac
he
Re
na
me
/All
oc
Re
na
me
/All
oc
uo
p Q
ue
ue
su
op
Qu
eu
es
BTBBTB
ROMROM
33 33
De
co
de
rD
ec
od
er
BT
B &
I-T
LB
BT
B &
I-T
LB
L2 Cache and ControlL2 Cache and Control
22 33 44 55 66 77 88 99 1010 1111 1212
TC FetchTC Fetch DriveDrive AllocAlloc RenameRename QueQue SchSch SchSch SchSch
1313 1414
DispDisp DispDisp
1515 1616 1717 1818 1919 2020
RFRF ExEx FlgsFlgs Br CkBr Ck DriveDriveRF RF TC Nxt IPTC Nxt IP
11
Que: Write into the uOP QueueuOPs are placed into the queues, where theyare held until there is room in the schedulers
uo
p Q
ueu
esu
op
Qu
eues
IntelIntel PDXPDXCopyright © 2000 Intel Corporation.
Fall 2000Hyper pipelined TechnologyHyper pipelined Technology
FP
RF
FP
RF
FopFop
FmsFms
Sy
ste
m I
nte
rfa
ce
Sy
ste
m I
nte
rfa
ce L2 Cache and ControlL2 Cache and Control
L1
D-C
ac
he
an
d D
-TL
BL
1 D
-Ca
ch
e a
nd
D-T
LB
AGUAGU
AGUAGU
Sc
he
du
lers
Sc
he
du
lers
Inte
ge
r R
FIn
teg
er
RF
ALUALUALUALU
ALUALUALUALU
Tra
ce
Ca
ch
eT
rac
e C
ac
he
Re
na
me
/All
oc
Re
na
me
/All
oc
uo
p Q
ue
ue
su
op
Qu
eu
es
BTBBTB
ROMROM
33 33
De
co
de
rD
ec
od
er
BT
B &
I-T
LB
BT
B &
I-T
LB
L2 Cache and ControlL2 Cache and Control
22 33 44 55 66 77 88 99 1010 1111 1212
TC FetchTC Fetch DriveDrive AllocAlloc RenameRename QueQue SchSch SchSch SchSch
1313 1414
DispDisp DispDisp
1515 1616 1717 1818 1919 2020
RFRF ExEx FlgsFlgs Br CkBr Ck DriveDriveRF RF TC Nxt IPTC Nxt IP
11
Sch: ScheduleWrite into the schedulers and compute dependencies. Watch for dependency to resolve.
Sch
edu
lers
Sch
edu
lers
IntelIntel PDXPDXCopyright © 2000 Intel Corporation.
Fall 2000Hyper pipelined TechnologyHyper pipelined Technology
FP
RF
FP
RF
FopFop
FmsFms
Sy
ste
m I
nte
rfa
ce
Sy
ste
m I
nte
rfa
ce L2 Cache and ControlL2 Cache and Control
L1
D-C
ac
he
an
d D
-TL
BL
1 D
-Ca
ch
e a
nd
D-T
LB
AGUAGU
AGUAGU
Sc
he
du
lers
Sc
he
du
lers
Inte
ge
r R
FIn
teg
er
RF
ALUALUALUALU
ALUALUALUALU
Tra
ce
Ca
ch
eT
rac
e C
ac
he
Re
na
me
/All
oc
Re
na
me
/All
oc
uo
p Q
ue
ue
su
op
Qu
eu
es
BTBBTB
ROMROM
33 33
De
co
de
rD
ec
od
er
BT
B &
I-T
LB
BT
B &
I-T
LB
L2 Cache and ControlL2 Cache and Control
22 33 44 55 66 77 88 99 1010 1111 1212
TC FetchTC Fetch DriveDrive AllocAlloc RenameRename QueQue SchSch SchSch SchSch
1313 1414
DispDisp DispDisp
1515 1616 1717 1818 1919 2020
RFRF ExEx FlgsFlgs Br CkBr Ck DriveDriveRF RF TC Nxt IPTC Nxt IP
11
Disp: DispatchSend the uOPs to the appropriate executionunit.
IntelIntel PDXPDXCopyright © 2000 Intel Corporation.
Fall 2000Hyper pipelined TechnologyHyper pipelined Technology
FP
RF
FP
RF
FopFop
FmsFms
Sy
ste
m I
nte
rfa
ce
Sy
ste
m I
nte
rfa
ce L2 Cache and ControlL2 Cache and Control
L1
D-C
ac
he
an
d D
-TL
BL
1 D
-Ca
ch
e a
nd
D-T
LB
AGUAGU
AGUAGU
Sc
he
du
lers
Sc
he
du
lers
Inte
ge
r R
FIn
teg
er
RF
ALUALUALUALU
ALUALUALUALU
Tra
ce
Ca
ch
eT
rac
e C
ac
he
Re
na
me
/All
oc
Re
na
me
/All
oc
uo
p Q
ue
ue
su
op
Qu
eu
es
BTBBTB
ROMROM
33 33
De
co
de
rD
ec
od
er
BT
B &
I-T
LB
BT
B &
I-T
LB
L2 Cache and ControlL2 Cache and Control
22 33 44 55 66 77 88 99 1010 1111 1212
TC FetchTC Fetch DriveDrive AllocAlloc RenameRename QueQue SchSch SchSch SchSch
1313 1414
DispDisp DispDisp
1515 1616 1717 1818 1919 2020
RFRF ExEx FlgsFlgs Br CkBr Ck DriveDriveRF RF TC Nxt IPTC Nxt IP
11
RF: Register FileRead the register file. These are the source(s)for the pending operation (ALU or other).
Inte
ge
r R
FIn
teg
er
RF
FP
RF
FP
RF
IntelIntel PDXPDXCopyright © 2000 Intel Corporation.
Fall 2000Hyper pipelined TechnologyHyper pipelined Technology
FP
RF
FP
RF
FopFop
FmsFms
Sy
ste
m I
nte
rfa
ce
Sy
ste
m I
nte
rfa
ce L2 Cache and ControlL2 Cache and Control
L1
D-C
ac
he
an
d D
-TL
BL
1 D
-Ca
ch
e a
nd
D-T
LB
AGUAGU
AGUAGU
Sc
he
du
lers
Sc
he
du
lers
Inte
ge
r R
FIn
teg
er
RF
ALUALUALUALU
ALUALUALUALU
Tra
ce
Ca
ch
eT
rac
e C
ac
he
Re
na
me
/All
oc
Re
na
me
/All
oc
uo
p Q
ue
ue
su
op
Qu
eu
es
BTBBTB
ROMROM
33 33
De
co
de
rD
ec
od
er
BT
B &
I-T
LB
BT
B &
I-T
LB
L2 Cache and ControlL2 Cache and Control
22 33 44 55 66 77 88 99 1010 1111 1212
TC FetchTC Fetch DriveDrive AllocAlloc RenameRename QueQue SchSch SchSch SchSch
1313 1414
DispDisp DispDisp
1515 1616 1717 1818 1919 2020
RFRF ExEx FlgsFlgs Br CkBr Ck DriveDriveRF RF TC Nxt IPTC Nxt IP
11
Ex: ExecuteExecute the uOPs on the appropriate executionport.
FopFop
FmsFms
AGUAGU
AGUAGU
ALUALUALUALU
ALUALUALUALU
IntelIntel PDXPDXCopyright © 2000 Intel Corporation.
Fall 2000Hyper pipelined TechnologyHyper pipelined Technology
FP
RF
FP
RF
FopFop
FmsFms
Sy
ste
m I
nte
rfa
ce
Sy
ste
m I
nte
rfa
ce L2 Cache and ControlL2 Cache and Control
L1
D-C
ac
he
an
d D
-TL
BL
1 D
-Ca
ch
e a
nd
D-T
LB
AGUAGU
AGUAGU
Sc
he
du
lers
Sc
he
du
lers
Inte
ge
r R
FIn
teg
er
RF
ALUALUALUALU
ALUALUALUALU
Tra
ce
Ca
ch
eT
rac
e C
ac
he
Re
na
me
/All
oc
Re
na
me
/All
oc
uo
p Q
ue
ue
su
op
Qu
eu
es
BTBBTB
ROMROM
33 33
De
co
de
rD
ec
od
er
BT
B &
I-T
LB
BT
B &
I-T
LB
L2 Cache and ControlL2 Cache and Control
22 33 44 55 66 77 88 99 1010 1111 1212
TC FetchTC Fetch DriveDrive AllocAlloc RenameRename QueQue SchSch SchSch SchSch
1313 1414
DispDisp DispDisp
1515 1616 1717 1818 1919 2020
RFRF ExEx FlgsFlgs Br CkBr Ck DriveDriveRF RF TC Nxt IPTC Nxt IP
11
Flgs: FlagsCompute flags (zero, negative, etc..). Theseare typically the input to a branch instruction.
ALUALUALUALU
ALUALUALUALU
IntelIntel PDXPDXCopyright © 2000 Intel Corporation.
Fall 2000Hyper pipelined TechnologyHyper pipelined Technology
FP
RF
FP
RF
FopFop
FmsFms
Sy
ste
m I
nte
rfa
ce
Sy
ste
m I
nte
rfa
ce L2 Cache and ControlL2 Cache and Control
L1
D-C
ac
he
an
d D
-TL
BL
1 D
-Ca
ch
e a
nd
D-T
LB
AGUAGU
AGUAGU
Sc
he
du
lers
Sc
he
du
lers
Inte
ge
r R
FIn
teg
er
RF
ALUALUALUALU
ALUALUALUALU
Tra
ce
Ca
ch
eT
rac
e C
ac
he
Re
na
me
/All
oc
Re
na
me
/All
oc
uo
p Q
ue
ue
su
op
Qu
eu
es
BTBBTB
ROMROM
33 33
De
co
de
rD
ec
od
er
BT
B &
I-T
LB
BT
B &
I-T
LB
L2 Cache and ControlL2 Cache and Control
22 33 44 55 66 77 88 99 1010 1111 1212
TC FetchTC Fetch DriveDrive AllocAlloc RenameRename QueQue SchSch SchSch SchSch
1313 1414
DispDisp DispDisp
1515 1616 1717 1818 1919 2020
RFRF ExEx FlgsFlgs Br CkBr Ck DriveDriveRF RF TC Nxt IPTC Nxt IP
11
Br Ck: Branch CheckThe branch operation compares the result of theactual branch direction with the prediction.
ALUALUALUALU
ALUALUALUALU
IntelIntel PDXPDXCopyright © 2000 Intel Corporation.
Fall 2000Hyper pipelined TechnologyHyper pipelined Technology
FP
RF
FP
RF
FopFop
FmsFms
Sy
ste
m I
nte
rfa
ce
Sy
ste
m I
nte
rfa
ce L2 Cache and ControlL2 Cache and Control
L1
D-C
ac
he
an
d D
-TL
BL
1 D
-Ca
ch
e a
nd
D-T
LB
AGUAGU
AGUAGU
Sc
he
du
lers
Sc
he
du
lers
Inte
ge
r R
FIn
teg
er
RF
ALUALUALUALU
ALUALUALUALU
Tra
ce
Ca
ch
eT
rac
e C
ac
he
Re
na
me
/All
oc
Re
na
me
/All
oc
uo
p Q
ue
ue
su
op
Qu
eu
es
BTBBTB
ROMROM
33 33
De
co
de
rD
ec
od
er
BT
B &
I-T
LB
BT
B &
I-T
LB
L2 Cache and ControlL2 Cache and Control
22 33 44 55 66 77 88 99 1010 1111 1212
TC FetchTC Fetch DriveDrive AllocAlloc RenameRename QueQue SchSch SchSch SchSch
1313 1414
DispDisp DispDisp
1515 1616 1717 1818 1919 2020
RFRF ExEx FlgsFlgs Br CkBr Ck DriveDriveRF RF TC Nxt IPTC Nxt IP
11
Drive: Wire delayDrive the result of the branch check to the frontend of the machine.
IntelIntel PDXPDXCopyright © 2000 Intel Corporation.
Fall 2000CPU Architecture 101CPU Architecture 101
Delivered Performance = Delivered Performance = Frequency * Instructions Per CycleFrequency * Instructions Per Cycle
Delivered Performance = Delivered Performance = Frequency * Instructions Per CycleFrequency * Instructions Per Cycle
Instructions Per CycleInstructions Per CycleInstructions Per CycleInstructions Per Cycle
IntelIntel PDXPDXCopyright © 2000 Intel Corporation.
Fall 2000Improving Improving Instructions Per CycleInstructions Per Cycle
Improve efficiencyImprove efficiency–Branch prediction Branch prediction
–Do more things in a clockDo more things in a clock
Reduce time it takes to do somethingReduce time it takes to do something–Reducing latencyReducing latency
IntelIntel PDXPDXCopyright © 2000 Intel Corporation.
Fall 2000Improving Improving Instructions Per CycleInstructions Per Cycle
Improve efficiencyImprove efficiency–Branch predictionBranch prediction
–Do more things in a clockDo more things in a clock
Reduce time it takes to do somethingReduce time it takes to do something–Reducing latencyReducing latency
IntelIntel PDXPDXCopyright © 2000 Intel Corporation.
Fall 2000Branch PredictionBranch Prediction
Dramatic improvement over P6 branch Dramatic improvement over P6 branch predictor:predictor:–8x the size (4K)8x the size (4K)
–Eliminated 1/3 of the mispredictionsEliminated 1/3 of the mispredictions
Proven to be better than Proven to be better than allall other other publicly disclosed predictors publicly disclosed predictors – (g-share, hybrid, etc)(g-share, hybrid, etc)
Accurate branch prediction is key to Accurate branch prediction is key to enabling longer pipelinesenabling longer pipelines
IntelIntel PDXPDXCopyright © 2000 Intel Corporation.
Fall 2000
The Execution Trace CacheThe Execution Trace Cache
L2 Cache and ControlL2 Cache and Control
L1
D-C
ach
e an
d D
-TL
BL
1 D
-Cac
he
and
D-T
LB
Tra
ce C
ach
eT
race
Cac
he
33 33
FP
RF
FP
RF
FMulFMulFAddFAddMMXMMXSSESSE
FP moveFP moveFP storeFP store
3.2
GB
/s S
yste
m In
terf
ace
3.2
GB
/s S
yste
m In
terf
ace
StoreStoreAGUAGULoadLoadAGUAGU
Sch
edu
lers
Sch
edu
lers
Inte
ger
RF
Inte
ger
RF
ALUALU
ALUALU
ALUALU
ALUALU
Ren
ame/
Allo
cR
enam
e/A
lloc
uo
p Q
ueu
esu
op
Qu
eues
BTBBTB
uCodeuCodeROMROM
Dec
od
erD
eco
der
BT
B &
I-T
LB
BT
B &
I-T
LB
Tra
ce C
ach
eT
race
Cac
he
BTBBTB
IntelIntel PDXPDXCopyright © 2000 Intel Corporation.
Fall 2000Execution Trace CacheExecution Trace CacheAdvanced L1 instruction cacheAdvanced L1 instruction cache
–Caches “decoded” IA-32 instructions (uops)Caches “decoded” IA-32 instructions (uops)
Removes decoder pipeline latencyRemoves decoder pipeline latencyCapacity is ~12K uOps Capacity is ~12K uOps Integrates branches into single lineIntegrates branches into single line
–Follows predicted path of program executionFollows predicted path of program execution
Execution Trace Cache feeds fast engineExecution Trace Cache feeds fast engineExecution Trace Cache feeds fast engineExecution Trace Cache feeds fast engine
IntelIntel PDXPDXCopyright © 2000 Intel Corporation.
Fall 2000
1 cmp1 cmp2 br -> T1 2 br -> T1 .... ... (unused code)... (unused code)
T1:T1: 3 sub3 sub 4 br -> T24 br -> T2 .... ... (unused code)... (unused code)
T2:T2: 5 mov5 mov 6 sub6 sub 7 br -> T37 br -> T3 .... ... (unused code)... (unused code)
T3:T3: 8 add 8 add 9 sub9 sub 10 mul10 mul 11 cmp11 cmp 12 br -> T412 br -> T4
Execution Trace CacheExecution Trace Cache
Trace Cache DeliveryTrace Cache Delivery
10 mul 11 cmp 12 br T4
7 br T3 8 T3:add 9 sub
4 br T2 5 mov 6 sub
1 cmp 2 br T1 3 T1: sub
IntelIntel PDXPDXCopyright © 2000 Intel Corporation.
Fall 2000
Advanced Advanced Dynamic ExecutionDynamic Execution
Provides larger window of visibilityProvides larger window of visibility–Better use of execution resourcesBetter use of execution resources
Deep Speculation Improves ParallelismDeep Speculation Improves ParallelismDeep Speculation Improves ParallelismDeep Speculation Improves Parallelism
Extends basic features found in P6 coreExtends basic features found in P6 coreVery deep speculative executionVery deep speculative execution
–126 instructions in flight (3x P6)126 instructions in flight (3x P6)
–48 loads (3x P6) and 24 stores (2x P6)48 loads (3x P6) and 24 stores (2x P6)
IntelIntel PDXPDXCopyright © 2000 Intel Corporation.
Fall 2000Improving Improving Instructions Per CycleInstructions Per Cycle
Improve efficiencyImprove efficiency–Branch predictionBranch prediction
–Do more things in a clockDo more things in a clock
Reduce time it takes to do somethingReduce time it takes to do something–Reducing latencyReducing latency
IntelIntel PDXPDXCopyright © 2000 Intel Corporation.
Fall 2000Rapid Execution EngineRapid Execution Engine
P4P:P4P:–½ clock @ >1.4GHz½ clock @ >1.4GHz
<0.36ns<0.36ns
1ns1ns
Dramatically lower ALU latencyDramatically lower ALU latencyP6: P6:
–1 clock @ 1GHz1 clock @ 1GHz
IntelIntel PDXPDXCopyright © 2000 Intel Corporation.
Fall 2000
L1 Data CacheL1 Data Cache
L2 Cache and ControlL2 Cache and Control
L1
D-C
ach
e an
d D
-TL
BL
1 D
-Cac
he
and
D-T
LB
Tra
ce C
ach
eT
race
Cac
he
33 33
FP
RF
FP
RF
FMulFMulFAddFAddMMXMMXSSESSE
FP moveFP moveFP storeFP store
3.2
GB
/s S
yste
m In
terf
ace
3.2
GB
/s S
yste
m In
terf
ace
StoreStoreAGUAGULoadLoadAGUAGU
Sch
edu
lers
Sch
edu
lers
Inte
ger
RF
Inte
ger
RF
ALUALU
ALUALU
ALUALU
ALUALU
Ren
ame/
Allo
cR
enam
e/A
lloc
uo
p Q
ueu
esu
op
Qu
eues
BTBBTB
uCodeuCodeROMROM
Dec
od
erD
eco
der
BT
B &
I-T
LB
BT
B &
I-T
LB
L1
D-C
ach
e an
d D
-TL
BL
1 D
-Cac
he
and
D-T
LB
IntelIntel PDXPDXCopyright © 2000 Intel Corporation.
Fall 2000
High Performance High Performance L1 Data CacheL1 Data Cache
8KB, 4-way set associative, 64-byte lines8KB, 4-way set associative, 64-byte linesVery high bandwidthVery high bandwidth
–1 Ld and 1 St per clock 1 Ld and 1 St per clock
New access algorithmsNew access algorithmsVery low latencyVery low latency
–2 clock read2 clock read
New algorithm enables faster cacheNew algorithm enables faster cacheNew algorithm enables faster cacheNew algorithm enables faster cache
IntelIntel PDXPDXCopyright © 2000 Intel Corporation.
Fall 2000
Observation: Almost all memory Observation: Almost all memory accesses hit in the cacheaccesses hit in the cache
Optimize for the common caseOptimize for the common case–Assume that the access will hit the cacheAssume that the access will hit the cache
–Use a low cost mechanism to fix the rare Use a low cost mechanism to fix the rare cases that misscases that miss
Benefit:Benefit:–Reduces latencyReduces latency
–Significantly higher performance Significantly higher performance
Data SpeculationData Speculation
IntelIntel PDXPDXCopyright © 2000 Intel Corporation.
Fall 2000ReplayReplay
Repairs incorrect speculationRepairs incorrect speculation–Re-execute until correctRe-execute until correct
Replay is uOP specificReplay is uOP specific–Replay the uOP that mis-speculatedReplay the uOP that mis-speculated
–Replay dependent uOPsReplay dependent uOPs
– Independent uOPs are not replayedIndependent uOPs are not replayed
IntelIntel PDXPDXCopyright © 2000 Intel Corporation.
Fall 2000L1 Cache is >2x FasterL1 Cache is >2x Faster
P4P:P4P:–2 clocks @ 2 clocks @ 1.4GHz1.4GHz
Lower Latency is Higher PerformanceLower Latency is Higher PerformanceLower Latency is Higher PerformanceLower Latency is Higher Performance
3ns3ns
<1.4ns<1.4ns
P6:P6:–3 clocks @ 1GHz3 clocks @ 1GHz
IntelIntel PDXPDXCopyright © 2000 Intel Corporation.
Fall 2000
Example with Higher IPC and Example with Higher IPC and Faster Clock!Faster Clock!
CodeCodeSequenceSequence
LdLd
AddAdd
AddAdd
LdLd
AddAdd
AddAdd
10 clocks10 clocks10ns10nsIPC = 0.6IPC = 0.6
6 clocks6 clocks4.3ns4.3nsIPC = 1.0IPC = 1.0
P6P6@1GHz@1GHz
Pentium® 4 Pentium® 4 [email protected]@1.4GHz
IntelIntel PDXPDXCopyright © 2000 Intel Corporation.
Fall 2000
L2 Advanced Transfer CacheL2 Advanced Transfer Cache
L2 Cache and ControlL2 Cache and Control
L1
D-C
ach
e an
d D
-TL
BL
1 D
-Cac
he
and
D-T
LB
Tra
ce C
ach
eT
race
Cac
he
33 33
FP
RF
FP
RF
FMulFMulFAddFAddMMXMMXSSESSE
FP moveFP moveFP storeFP store
3.2
GB
/s S
yste
m In
terf
ace
3.2
GB
/s S
yste
m In
terf
ace
StoreStoreAGUAGULoadLoadAGUAGU
Sch
edu
lers
Sch
edu
lers
Inte
ger
RF
Inte
ger
RF
ALUALU
ALUALU
ALUALU
ALUALU
Ren
ame/
Allo
cR
enam
e/A
lloc
uo
p Q
ueu
esu
op
Qu
eues
BTBBTB
uCodeuCodeROMROM
Dec
od
erD
eco
der
BT
B &
I-T
LB
BT
B &
I-T
LB
L2 Cache and ControlL2 Cache and Control
IntelIntel PDXPDXCopyright © 2000 Intel Corporation.
Fall 2000L2 ATC OrganizationL2 ATC Organization
256KB, 8-way set associative256KB, 8-way set associative–128-byte lines128-byte lines
–Two 64-byte pieces per lineTwo 64-byte pieces per line
Holds both data and instructionsHolds both data and instructionsHigh bandwidth: 45 GB/Sec @ 1.4GHzHigh bandwidth: 45 GB/Sec @ 1.4GHz
–2.8x P6 @1GHz2.8x P6 @1GHz
IntelIntel PDXPDXCopyright © 2000 Intel Corporation.
Fall 2000Aggregate Cache LatencyAggregate Cache Latency
Function of all caches in a processorFunction of all caches in a processorOverall Effective LatencyOverall Effective Latency
L1 latency +L1 latency +
L1 Miss Rate * L2 latency +L1 Miss Rate * L2 latency +
L2 Miss Rate * DRAM LatencyL2 Miss Rate * DRAM Latency
Average cache speed is >1.8x better Average cache speed is >1.8x better than the Pentiumthan the Pentium®® III Processor! III Processor!
Average cache speed is >1.8x better Average cache speed is >1.8x better than the Pentiumthan the Pentium®® III Processor! III Processor!
Average on desktop applications, Average on desktop applications,
PentiumPentium®® III Processor @ 1GHz, Pentium III Processor @ 1GHz, Pentium®® 4 Processor @ 1.4GHz 4 Processor @ 1.4GHz
IntelIntel PDXPDXCopyright © 2000 Intel Corporation.
Fall 2000PentiumPentium®®III III ProcessorProcessor
PentiumPentium®® 4 4 ProcessorProcessor
RelativeRelative
ImprovementImprovementFrequencyFrequency 1 GHz1 GHz > 1.4 Ghz> 1.4 Ghz > 1.4> 1.4Adder SpeedAdder Speed 1 ns1 ns < .36 ns< .36 ns > 2.8> 2.8L1 Cache SpeedL1 Cache Speed 3 ns3 ns < 1.42 ns< 1.42 ns > 2.1> 2.1L1 Cache SizeL1 Cache Size 16 KB16 KB 8 KB8 KB 0.50.5L1 Cache BandwidthL1 Cache Bandwidth 16 GB/sec16 GB/sec > 44.8 GB/sec> 44.8 GB/sec > 2.8> 2.8L2 Cache BandwidthL2 Cache Bandwidth 16 GB/sec16 GB/sec > 44.8 GB/sec> 44.8 GB/sec > 2.8> 2.8Uop Fetch BandwidthUop Fetch Bandwidth 3 billion/sec3 billion/sec > 4.2 billion/sec> 4.2 billion/sec > 1.4> 1.4Adder BandwidthAdder Bandwidth 2 billion/sec2 billion/sec > 5.6 billion/sec> 5.6 billion/sec > 2.8> 2.8Branch targetsBranch targets 512512 40924092 88Instructions In flightInstructions In flight 4040 126126 3.153.15Loads in flightLoads in flight 1616 4848 33Stores in flightStores in flight 1212 2424 22
RecapRecap
IntelIntel PDXPDXCopyright © 2000 Intel Corporation.
Fall 2000
Intel® Pentium® 4 ProcessorIntel® Pentium® 4 ProcessorSummarySummary
Revolutionary, new micro-Revolutionary, new micro-architecture from Intel designed for architecture from Intel designed for the evolving Internetthe evolving Internet
Design features for balanced, high Design features for balanced, high performance platform scalability and performance platform scalability and headroom headroom
The world’s The world’s highest performancehighest performance desktop processordesktop processor
IntelIntel PDXPDXCopyright © 2000 Intel Corporation.
Fall 2000