intel labs labs copyright © 2000 intel corporation. fall 2000 inside the pentium ® 4 processor...

IntelIntel LabsLabsCopyright © 2000 Intel Corporation.

Fall 2000

Inside the Inside the PentiumPentium®® 4 Processor 4 Processor

Micro-architectureMicro-architectureNext Generation IA-32 Micro-architectureNext Generation IA-32 Micro-architecture

Doug CarmeanDoug CarmeanPrincipal ArchitectPrincipal Architect

Intel Architecture GroupIntel Architecture Group

August 24, 2000August 24, 2000

IntelIntel PDXPDXCopyright © 2000 Intel Corporation.

Fall 2000AgendaAgenda

IA-32 Processor RoadmapIA-32 Processor RoadmapDesign GoalsDesign GoalsFrequencyFrequencyInstructions Per Cycle Instructions Per Cycle SummarySummary


Fall 2000IntelIntel®® Pentium Pentium®® 4 Processor 4 Processor

Per

form

ance

Per

form

ance

486 Micro-architecture486 Micro-architecture

TimeTime

P5 Micro-ArchitectureP5 Micro-Architecture


TodayToday

IntelIntel®® NetBurst™ NetBurst™Micro-ArchitectureMicro-Architecture


Fall 2000

Intel® Pentium® 4 Processor Intel® Pentium® 4 Processor Design GoalsDesign Goals

Deliver world class performance Deliver world class performance across both existing and emerging across both existing and emerging applicationsapplications

Deliver performance headroom and Deliver performance headroom and scalability for the futurescalability for the future

Micro-architecture that will Drive PerformanceMicro-architecture that will Drive PerformanceLeadership for the Next Several YearsLeadership for the Next Several Years

Micro-architecture that will Drive PerformanceMicro-architecture that will Drive PerformanceLeadership for the Next Several YearsLeadership for the Next Several Years


Fall 2000

Intel® NetburstIntel® NetburstTMTM Micro-architectureMicro-architecture

400 MHz 400 MHz SystemSystem

BusBus

RapidRapidExecutionExecution

EngineEngine

ExecutionExecutionTrace CacheTrace Cache

HyperHyperPipelinedPipelined

TechnologyTechnology

AdvancedAdvancedTransfer Transfer

CacheCacheAdvanced Advanced DynamicDynamic

ExecutionExecution

StreamingStreamingSIMDSIMD

Extensions 2Extensions 2

Enhanced FloatingEnhanced FloatingPoint / Multi-MediaPoint / Multi-Media

Pentium® 4 Processor Pentium® 4 Processor Block DiagramBlock Diagram

FP

RF

FP

RF

FMulFMulFAddFAddMMXMMXSSESSE

FP moveFP moveFP storeFP store

3.2

GB

/s S

yste

m In

terf

ace

3.2

GB

/s S

yste

m In

terf

ace L2 Cache and ControlL2 Cache and Control

L1

D-C

ach

e an

d D

-TL

BL

1 D

-Cac

he

and

D-T

LB

StoreStoreAGUAGULoadLoadAGUAGU

Sch

edu

lers

Sch

edu

lers

Inte

ger

RF

Inte

ger

RF

ALUALU

ALUALU

ALUALU

ALUALU

Tra

ce C

ach

eT

race

Cac

he

Ren

ame/

Allo

cR

enam

e/A

lloc

uo

p Q

ueu

esu

op

Qu

eues

BTBBTB

uCodeuCodeROMROM

33 33

Dec

od

erD

eco

der

BT

B &

I-T

LB

BT

B &

I-T

LB

L2 Cache and ControlL2 Cache and Control


FP

RF

FP

RF



3.2

GB

/s S

yste

m In

terf

ace

3.2

GB

/s S

yste

m In

terf

ace L2 Cache and ControlL2 Cache and Control

L1

D-C

ach

e an

d D

-TL

BL

1 D

-Cac

he

and

D-T

LB


Sch

edu

lers

Sch

edu

lers

Inte

ger

RF

Inte

ger

RF

ALUALU

ALUALU

ALUALU

ALUALU

Tra

ce C

ach

eT

race

Cac

he

Ren

ame/

Allo

cR

enam

e/A

lloc

uo

p Q

ueu

esu

op

Qu

eues

BTBBTB

uCodeuCodeROMROM

33 33

Dec

od

erD

eco

der

BT

B &

I-T

LB

BT

B &

I-T

LB

Pentium® 4 Processor Pentium® 4 Processor Block DiagramBlock Diagram


Fall 2000CPU Architecture 101CPU Architecture 101

Delivered Performance = Delivered Performance = Frequency * Instructions Per CycleFrequency * Instructions Per Cycle


FrequencyFrequencyFrequencyFrequency


Fall 2000FrequencyFrequency

What limits frequency?What limits frequency?–Process technology Process technology

–MicroarchitectureMicroarchitecture

On a given process technologyOn a given process technology–Fewer gates per pipeline stage will deliver Fewer gates per pipeline stage will deliver

higher frequencyhigher frequency

Frequency is driven by MicroarchitectureFrequency is driven by MicroarchitectureFrequency is driven by MicroarchitectureFrequency is driven by Microarchitecture


Fall 2000

NetburstNetburstTMTM Micro-architecture Micro-architecture Pipeline vs P6Pipeline vs P6

11 22 33 44 55 66 77 88 99 1010

FetchFetch FetchFetch DecodeDecode DecodeDecode DecodeDecode RenameRename ROB RdROB Rd Rdy/SchRdy/Sch DispatchDispatch ExecExec

Basic P6 PipelineBasic P6 Pipeline

Hyper pipelined Technology enables industry Hyper pipelined Technology enables industry leading performance and clock rateleading performance and clock rate

Hyper pipelined Technology enables industry Hyper pipelined Technology enables industry leading performance and clock rateleading performance and clock rate

Basic PentiumBasic Pentium®® 4 Processor Pipeline 4 Processor Pipeline

11 22 33 44 55 66 77 88 99 1010 1111 1212

TC Nxt IPTC Nxt IP TC FetchTC Fetch DriveDrive AllocAlloc RenameRename QueQue SchSch SchSch SchSch

1313 1414

DispDisp DispDisp

1515 1616 1717 1818 1919 2020

RFRF ExEx FlgsFlgs Br CkBr Ck DriveDriveRF RF

Intro at Intro at 1.4GHz1.4GHz

.18µ.18µ

Intro at Intro at 733MHz733MHz

.18µ.18µ


Fall 2000

Fre

qu

ency

Fre

qu

ency

TimeTimeIntroductionIntroduction

233MHz233MHz

60MHz60MHz P5 Micro-ArchitectureP5 Micro-Architecture

55

Hyper Pipelined Hyper Pipelined TechnologyTechnology

1.13GHz1.13GHz

166MHz166MHz


1010

TodayToday

1.4GHz1.4GHz

2020

Netburst Micro-ArchitectureNetburst Micro-Architecture


Fall 2000Hyper pipelined TechnologyHyper pipelined Technology

FP

RF

FP

RF

FopFop

FmsFms

Sy

ste

m I

nte

rfa

ce

Sy

ste

m I

nte

rfa

ce L2 Cache and ControlL2 Cache and Control

L1

D-C

ac

he

an

d D

-TL

BL

1 D

-Ca

ch

e a

nd

D-T

LB

AGUAGU

AGUAGU

Sc

he

du

lers

Sc

he

du

lers

Inte

ge

r R

FIn

teg

er

RF

ALUALUALUALU

ALUALUALUALU

Tra

ce

Ca

ch

eT

rac

e C

ac

he

Re

na

me

/All

oc

Re

na

me

/All

oc

uo

p Q

ue

ue

su

op

Qu

eu

es

BTBBTB

ROMROM

33 33

De

co

de

rD

ec

od

er

BT

B &

I-T

LB

BT

B &

I-T

LB


33 44 55 66 77 88 99 1010 1111 1212

TC FetchTC Fetch DriveDrive AllocAlloc RenameRename QueQue SchSch SchSch SchSch

1313 1414

DispDisp DispDisp

1515 1616 1717 1818 1919 2020

RFRF ExEx FlgsFlgs Br CkBr Ck DriveDriveRF RF

TC Nxt IP: Trace cache next instruction pointerPointer from the BTB, indicating location ofnext instruction.

22

TC Nxt IPTC Nxt IP

11



FP

RF

FP

RF

FopFop

FmsFms

Sy

ste

m I

nte

rfa

ce

Sy

ste

m I

nte

rfa


L1

D-C

ac

he

an

d D

-TL

BL

1 D

-Ca

ch

e a

nd

D-T

LB

AGUAGU

AGUAGU

Sc

he

du

lers

Sc

he

du

lers

Inte

ge

r R

FIn

teg

er

RF

ALUALUALUALU

ALUALUALUALU

Tra

ce

Ca

ch

eT

rac

e C

ac

he

Re

na

me

/All

oc

Re

na

me

/All

oc

uo

p Q

ue

ue

su

op

Qu

eu

es

BTBBTB

ROMROM

33 33

De

co

de

rD

ec

od

er

BT

B &

I-T

LB

BT

B &

I-T

LB


22 33 44 55 66 77 88 99 1010 1111 1212


1313 1414

DispDisp DispDisp

1515 1616 1717 1818 1919 2020

RFRF ExEx FlgsFlgs Br CkBr Ck DriveDriveRF RF TC Nxt IPTC Nxt IP

11

TC Fetch: Trace cache fetchRead the decoded instructions (uOPs) out of the Execution Trace Cache

Tra

ce C

ach

eT

race

Cac

he



FP

RF

FP

RF

FopFop

FmsFms

Sy

ste

m I

nte

rfa

ce

Sy

ste

m I

nte

rfa


L1

D-C

ac

he

an

d D

-TL

BL

1 D

-Ca

ch

e a

nd

D-T

LB

AGUAGU

AGUAGU

Sc

he

du

lers

Sc

he

du

lers

Inte

ge

r R

FIn

teg

er

RF

ALUALUALUALU

ALUALUALUALU

Tra

ce

Ca

ch

eT

rac

e C

ac

he

Re

na

me

/All

oc

Re

na

me

/All

oc

uo

p Q

ue

ue

su

op

Qu

eu

es

BTBBTB

ROMROM

33 33

De

co

de

rD

ec

od

er

BT

B &

I-T

LB

BT

B &

I-T

LB


22 33 44 55 66 77 88 99 1010 1111 1212


1313 1414

DispDisp DispDisp

1515 1616 1717 1818 1919 2020


11

Drive: Wire delayDrive the uOPs to the allocator



FP

RF

FP

RF

FopFop

FmsFms

Sy

ste

m I

nte

rfa

ce

Sy

ste

m I

nte

rfa


L1

D-C

ac

he

an

d D

-TL

BL

1 D

-Ca

ch

e a

nd

D-T

LB

AGUAGU

AGUAGU

Sc

he

du

lers

Sc

he

du

lers

Inte

ge

r R

FIn

teg

er

RF

ALUALUALUALU

ALUALUALUALU

Tra

ce

Ca

ch

eT

rac

e C

ac

he

Re

na

me

/All

oc

Re

na

me

/All

oc

uo

p Q

ue

ue

su

op

Qu

eu

es

BTBBTB

ROMROM

33 33

De

co

de

rD

ec

od

er

BT

B &

I-T

LB

BT

B &

I-T

LB


22 33 44 55 66 77 88 99 1010 1111 1212


1313 1414

DispDisp DispDisp

1515 1616 1717 1818 1919 2020


11

Alloc: AllocateAllocate resources required for execution. Theresources include Load buffers, Store buffers, etc..

Re

na

me

/All

oc

Re

na

me

/All

oc



FP

RF

FP

RF

FopFop

FmsFms

Sy

ste

m I

nte

rfa

ce

Sy

ste

m I

nte

rfa


L1

D-C

ac

he

an

d D

-TL

BL

1 D

-Ca

ch

e a

nd

D-T

LB

AGUAGU

AGUAGU

Sc

he

du

lers

Sc

he

du

lers

Inte

ge

r R

FIn

teg

er

RF

ALUALUALUALU

ALUALUALUALU

Tra

ce

Ca

ch

eT

rac

e C

ac

he

Re

na

me

/All

oc

Re

na

me

/All

oc

uo

p Q

ue

ue

su

op

Qu

eu

es

BTBBTB

ROMROM

33 33

De

co

de

rD

ec

od

er

BT

B &

I-T

LB

BT

B &

I-T

LB


22 33 44 55 66 77 88 99 1010 1111 1212


1313 1414

DispDisp DispDisp

1515 1616 1717 1818 1919 2020


11

Rename: Register renamingRename the logical registers (EAX) to the physicalregister space (128 are implemented).

Re

na

me

/All

oc

Re

na

me

/All

oc



FP

RF

FP

RF

FopFop

FmsFms

Sy

ste

m I

nte

rfa

ce

Sy

ste

m I

nte

rfa


L1

D-C

ac

he

an

d D

-TL

BL

1 D

-Ca

ch

e a

nd

D-T

LB

AGUAGU

AGUAGU

Sc

he

du

lers

Sc

he

du

lers

Inte

ge

r R

FIn

teg

er

RF

ALUALUALUALU

ALUALUALUALU

Tra

ce

Ca

ch

eT

rac

e C

ac

he

Re

na

me

/All

oc

Re

na

me

/All

oc

uo

p Q

ue

ue

su

op

Qu

eu

es

BTBBTB

ROMROM

33 33

De

co

de

rD

ec

od

er

BT

B &

I-T

LB

BT

B &

I-T

LB


22 33 44 55 66 77 88 99 1010 1111 1212


1313 1414

DispDisp DispDisp

1515 1616 1717 1818 1919 2020


11

Que: Write into the uOP QueueuOPs are placed into the queues, where theyare held until there is room in the schedulers

uo

p Q

ueu

esu

op

Qu

eues



FP

RF

FP

RF

FopFop

FmsFms

Sy

ste

m I

nte

rfa

ce

Sy

ste

m I

nte

rfa


L1

D-C

ac

he

an

d D

-TL

BL

1 D

-Ca

ch

e a

nd

D-T

LB

AGUAGU

AGUAGU

Sc

he

du

lers

Sc

he

du

lers

Inte

ge

r R

FIn

teg

er

RF

ALUALUALUALU

ALUALUALUALU

Tra

ce

Ca

ch

eT

rac

e C

ac

he

Re

na

me

/All

oc

Re

na

me

/All

oc

uo

p Q

ue

ue

su

op

Qu

eu

es

BTBBTB

ROMROM

33 33

De

co

de

rD

ec

od

er

BT

B &

I-T

LB

BT

B &

I-T

LB


22 33 44 55 66 77 88 99 1010 1111 1212


1313 1414

DispDisp DispDisp

1515 1616 1717 1818 1919 2020


11

Sch: ScheduleWrite into the schedulers and compute dependencies. Watch for dependency to resolve.

Sch

edu

lers

Sch

edu

lers



FP

RF

FP

RF

FopFop

FmsFms

Sy

ste

m I

nte

rfa

ce

Sy

ste

m I

nte

rfa


L1

D-C

ac

he

an

d D

-TL

BL

1 D

-Ca

ch

e a

nd

D-T

LB

AGUAGU

AGUAGU

Sc

he

du

lers

Sc

he

du

lers

Inte

ge

r R

FIn

teg

er

RF

ALUALUALUALU

ALUALUALUALU

Tra

ce

Ca

ch

eT

rac

e C

ac

he

Re

na

me

/All

oc

Re

na

me

/All

oc

uo

p Q

ue

ue

su

op

Qu

eu

es

BTBBTB

ROMROM

33 33

De

co

de

rD

ec

od

er

BT

B &

I-T

LB

BT

B &

I-T

LB


22 33 44 55 66 77 88 99 1010 1111 1212


1313 1414

DispDisp DispDisp

1515 1616 1717 1818 1919 2020


11

Disp: DispatchSend the uOPs to the appropriate executionunit.



FP

RF

FP

RF

FopFop

FmsFms

Sy

ste

m I

nte

rfa

ce

Sy

ste

m I

nte

rfa


L1

D-C

ac

he

an

d D

-TL

BL

1 D

-Ca

ch

e a

nd

D-T

LB

AGUAGU

AGUAGU

Sc

he

du

lers

Sc

he

du

lers

Inte

ge

r R

FIn

teg

er

RF

ALUALUALUALU

ALUALUALUALU

Tra

ce

Ca

ch

eT

rac

e C

ac

he

Re

na

me

/All

oc

Re

na

me

/All

oc

uo

p Q

ue

ue

su

op

Qu

eu

es

BTBBTB

ROMROM

33 33

De

co

de

rD

ec

od

er

BT

B &

I-T

LB

BT

B &

I-T

LB


22 33 44 55 66 77 88 99 1010 1111 1212


1313 1414

DispDisp DispDisp

1515 1616 1717 1818 1919 2020


11

RF: Register FileRead the register file. These are the source(s)for the pending operation (ALU or other).

Inte

ge

r R

FIn

teg

er

RF

FP

RF

FP

RF



FP

RF

FP

RF

FopFop

FmsFms

Sy

ste

m I

nte

rfa

ce

Sy

ste

m I

nte

rfa


L1

D-C

ac

he

an

d D

-TL

BL

1 D

-Ca

ch

e a

nd

D-T

LB

AGUAGU

AGUAGU

Sc

he

du

lers

Sc

he

du

lers

Inte

ge

r R

FIn

teg

er

RF

ALUALUALUALU

ALUALUALUALU

Tra

ce

Ca

ch

eT

rac

e C

ac

he

Re

na

me

/All

oc

Re

na

me

/All

oc

uo

p Q

ue

ue

su

op

Qu

eu

es

BTBBTB

ROMROM

33 33

De

co

de

rD

ec

od

er

BT

B &

I-T

LB

BT

B &

I-T

LB


22 33 44 55 66 77 88 99 1010 1111 1212


1313 1414

DispDisp DispDisp

1515 1616 1717 1818 1919 2020


11

Ex: ExecuteExecute the uOPs on the appropriate executionport.

FopFop

FmsFms

AGUAGU

AGUAGU

ALUALUALUALU

ALUALUALUALU



FP

RF

FP

RF

FopFop

FmsFms

Sy

ste

m I

nte

rfa

ce

Sy

ste

m I

nte

rfa


L1

D-C

ac

he

an

d D

-TL

BL

1 D

-Ca

ch

e a

nd

D-T

LB

AGUAGU

AGUAGU

Sc

he

du

lers

Sc

he

du

lers

Inte

ge

r R

FIn

teg

er

RF

ALUALUALUALU

ALUALUALUALU

Tra

ce

Ca

ch

eT

rac

e C

ac

he

Re

na

me

/All

oc

Re

na

me

/All

oc

uo

p Q

ue

ue

su

op

Qu

eu

es

BTBBTB

ROMROM

33 33

De

co

de

rD

ec

od

er

BT

B &

I-T

LB

BT

B &

I-T

LB


22 33 44 55 66 77 88 99 1010 1111 1212


1313 1414

DispDisp DispDisp

1515 1616 1717 1818 1919 2020


11

Flgs: FlagsCompute flags (zero, negative, etc..). Theseare typically the input to a branch instruction.

ALUALUALUALU

ALUALUALUALU



FP

RF

FP

RF

FopFop

FmsFms

Sy

ste

m I

nte

rfa

ce

Sy

ste

m I

nte

rfa


L1

D-C

ac

he

an

d D

-TL

BL

1 D

-Ca

ch

e a

nd

D-T

LB

AGUAGU

AGUAGU

Sc

he

du

lers

Sc

he

du

lers

Inte

ge

r R

FIn

teg

er

RF

ALUALUALUALU

ALUALUALUALU

Tra

ce

Ca

ch

eT

rac

e C

ac

he

Re

na

me

/All

oc

Re

na

me

/All

oc

uo

p Q

ue

ue

su

op

Qu

eu

es

BTBBTB

ROMROM

33 33

De

co

de

rD

ec

od

er

BT

B &

I-T

LB

BT

B &

I-T

LB


22 33 44 55 66 77 88 99 1010 1111 1212


1313 1414

DispDisp DispDisp

1515 1616 1717 1818 1919 2020


11

Br Ck: Branch CheckThe branch operation compares the result of theactual branch direction with the prediction.

ALUALUALUALU

ALUALUALUALU



FP

RF

FP

RF

FopFop

FmsFms

Sy

ste

m I

nte

rfa

ce

Sy

ste

m I

nte

rfa


L1

D-C

ac

he

an

d D

-TL

BL

1 D

-Ca

ch

e a

nd

D-T

LB

AGUAGU

AGUAGU

Sc

he

du

lers

Sc

he

du

lers

Inte

ge

r R

FIn

teg

er

RF

ALUALUALUALU

ALUALUALUALU

Tra

ce

Ca

ch

eT

rac

e C

ac

he

Re

na

me

/All

oc

Re

na

me

/All

oc

uo

p Q

ue

ue

su

op

Qu

eu

es

BTBBTB

ROMROM

33 33

De

co

de

rD

ec

od

er

BT

B &

I-T

LB

BT

B &

I-T

LB


22 33 44 55 66 77 88 99 1010 1111 1212


1313 1414

DispDisp DispDisp

1515 1616 1717 1818 1919 2020


11

Drive: Wire delayDrive the result of the branch check to the frontend of the machine.


Fall 2000CPU Architecture 101CPU Architecture 101



Instructions Per CycleInstructions Per CycleInstructions Per CycleInstructions Per Cycle


Fall 2000Improving Improving Instructions Per CycleInstructions Per Cycle

Improve efficiencyImprove efficiency–Branch prediction Branch prediction

–Do more things in a clockDo more things in a clock

Reduce time it takes to do somethingReduce time it takes to do something–Reducing latencyReducing latency



Improve efficiencyImprove efficiency–Branch predictionBranch prediction




Fall 2000Branch PredictionBranch Prediction

Dramatic improvement over P6 branch Dramatic improvement over P6 branch predictor:predictor:–8x the size (4K)8x the size (4K)

–Eliminated 1/3 of the mispredictionsEliminated 1/3 of the mispredictions

Proven to be better than Proven to be better than allall other other publicly disclosed predictors publicly disclosed predictors – (g-share, hybrid, etc)(g-share, hybrid, etc)

Accurate branch prediction is key to Accurate branch prediction is key to enabling longer pipelinesenabling longer pipelines


Fall 2000

The Execution Trace CacheThe Execution Trace Cache


L1

D-C

ach

e an

d D

-TL

BL

1 D

-Cac

he

and

D-T

LB

Tra

ce C

ach

eT

race

Cac

he

33 33

FP

RF

FP

RF



3.2

GB

/s S

yste

m In

terf

ace

3.2

GB

/s S

yste

m In

terf

ace


Sch

edu

lers

Sch

edu

lers

Inte

ger

RF

Inte

ger

RF

ALUALU

ALUALU

ALUALU

ALUALU

Ren

ame/

Allo

cR

enam

e/A

lloc

uo

p Q

ueu

esu

op

Qu

eues

BTBBTB

uCodeuCodeROMROM

Dec

od

erD

eco

der

BT

B &

I-T

LB

BT

B &

I-T

LB

Tra

ce C

ach

eT

race

Cac

he

BTBBTB


Fall 2000Execution Trace CacheExecution Trace CacheAdvanced L1 instruction cacheAdvanced L1 instruction cache

–Caches “decoded” IA-32 instructions (uops)Caches “decoded” IA-32 instructions (uops)

Removes decoder pipeline latencyRemoves decoder pipeline latencyCapacity is ~12K uOps Capacity is ~12K uOps Integrates branches into single lineIntegrates branches into single line

–Follows predicted path of program executionFollows predicted path of program execution

Execution Trace Cache feeds fast engineExecution Trace Cache feeds fast engineExecution Trace Cache feeds fast engineExecution Trace Cache feeds fast engine


Fall 2000

1 cmp1 cmp2 br -> T1 2 br -> T1 .... ... (unused code)... (unused code)

T1:T1: 3 sub3 sub 4 br -> T24 br -> T2 .... ... (unused code)... (unused code)

T2:T2: 5 mov5 mov 6 sub6 sub 7 br -> T37 br -> T3 .... ... (unused code)... (unused code)

T3:T3: 8 add 8 add 9 sub9 sub 10 mul10 mul 11 cmp11 cmp 12 br -> T412 br -> T4

Execution Trace CacheExecution Trace Cache

Trace Cache DeliveryTrace Cache Delivery

10 mul 11 cmp 12 br T4

7 br T3 8 T3:add 9 sub

4 br T2 5 mov 6 sub

1 cmp 2 br T1 3 T1: sub


Fall 2000

Advanced Advanced Dynamic ExecutionDynamic Execution

Provides larger window of visibilityProvides larger window of visibility–Better use of execution resourcesBetter use of execution resources

Deep Speculation Improves ParallelismDeep Speculation Improves ParallelismDeep Speculation Improves ParallelismDeep Speculation Improves Parallelism

Extends basic features found in P6 coreExtends basic features found in P6 coreVery deep speculative executionVery deep speculative execution

–126 instructions in flight (3x P6)126 instructions in flight (3x P6)

–48 loads (3x P6) and 24 stores (2x P6)48 loads (3x P6) and 24 stores (2x P6)



Improve efficiencyImprove efficiency–Branch predictionBranch prediction




Fall 2000Rapid Execution EngineRapid Execution Engine

P4P:P4P:–½ clock @ >1.4GHz½ clock @ >1.4GHz

<0.36ns<0.36ns

1ns1ns

Dramatically lower ALU latencyDramatically lower ALU latencyP6: P6:

–1 clock @ 1GHz1 clock @ 1GHz


Fall 2000

L1 Data CacheL1 Data Cache


L1

D-C

ach

e an

d D

-TL

BL

1 D

-Cac

he

and

D-T

LB

Tra

ce C

ach

eT

race

Cac

he

33 33

FP

RF

FP

RF



3.2

GB

/s S

yste

m In

terf

ace

3.2

GB

/s S

yste

m In

terf

ace


Sch

edu

lers

Sch

edu

lers

Inte

ger

RF

Inte

ger

RF

ALUALU

ALUALU

ALUALU

ALUALU

Ren

ame/

Allo

cR

enam

e/A

lloc

uo

p Q

ueu

esu

op

Qu

eues

BTBBTB

uCodeuCodeROMROM

Dec

od

erD

eco

der

BT

B &

I-T

LB

BT

B &

I-T

LB

L1

D-C

ach

e an

d D

-TL

BL

1 D

-Cac

he

and

D-T

LB


Fall 2000

High Performance High Performance L1 Data CacheL1 Data Cache

8KB, 4-way set associative, 64-byte lines8KB, 4-way set associative, 64-byte linesVery high bandwidthVery high bandwidth

–1 Ld and 1 St per clock 1 Ld and 1 St per clock

New access algorithmsNew access algorithmsVery low latencyVery low latency

–2 clock read2 clock read

New algorithm enables faster cacheNew algorithm enables faster cacheNew algorithm enables faster cacheNew algorithm enables faster cache


Fall 2000

Observation: Almost all memory Observation: Almost all memory accesses hit in the cacheaccesses hit in the cache

Optimize for the common caseOptimize for the common case–Assume that the access will hit the cacheAssume that the access will hit the cache

–Use a low cost mechanism to fix the rare Use a low cost mechanism to fix the rare cases that misscases that miss

Benefit:Benefit:–Reduces latencyReduces latency

–Significantly higher performance Significantly higher performance

Data SpeculationData Speculation


Fall 2000ReplayReplay

Repairs incorrect speculationRepairs incorrect speculation–Re-execute until correctRe-execute until correct

Replay is uOP specificReplay is uOP specific–Replay the uOP that mis-speculatedReplay the uOP that mis-speculated

–Replay dependent uOPsReplay dependent uOPs

– Independent uOPs are not replayedIndependent uOPs are not replayed


Fall 2000L1 Cache is >2x FasterL1 Cache is >2x Faster

P4P:P4P:–2 clocks @ 2 clocks @ 1.4GHz1.4GHz

Lower Latency is Higher PerformanceLower Latency is Higher PerformanceLower Latency is Higher PerformanceLower Latency is Higher Performance

3ns3ns

<1.4ns<1.4ns

P6:P6:–3 clocks @ 1GHz3 clocks @ 1GHz


Fall 2000

Example with Higher IPC and Example with Higher IPC and Faster Clock!Faster Clock!

CodeCodeSequenceSequence

LdLd

AddAdd

AddAdd

LdLd

AddAdd

AddAdd

10 clocks10 clocks10ns10nsIPC = 0.6IPC = 0.6

6 clocks6 clocks4.3ns4.3nsIPC = 1.0IPC = 1.0

P6P6@1GHz@1GHz

Pentium® 4 Pentium® 4 [email protected]@1.4GHz


Fall 2000

L2 Advanced Transfer CacheL2 Advanced Transfer Cache


L1

D-C

ach

e an

d D

-TL

BL

1 D

-Cac

he

and

D-T

LB

Tra

ce C

ach

eT

race

Cac

he

33 33

FP

RF

FP

RF



3.2

GB

/s S

yste

m In

terf

ace

3.2

GB

/s S

yste

m In

terf

ace


Sch

edu

lers

Sch

edu

lers

Inte

ger

RF

Inte

ger

RF

ALUALU

ALUALU

ALUALU

ALUALU

Ren

ame/

Allo

cR

enam

e/A

lloc

uo

p Q

ueu

esu

op

Qu

eues

BTBBTB

uCodeuCodeROMROM

Dec

od

erD

eco

der

BT

B &

I-T

LB

BT

B &

I-T

LB



Fall 2000L2 ATC OrganizationL2 ATC Organization

256KB, 8-way set associative256KB, 8-way set associative–128-byte lines128-byte lines

–Two 64-byte pieces per lineTwo 64-byte pieces per line

Holds both data and instructionsHolds both data and instructionsHigh bandwidth: 45 GB/Sec @ 1.4GHzHigh bandwidth: 45 GB/Sec @ 1.4GHz

–2.8x P6 @1GHz2.8x P6 @1GHz


Fall 2000Aggregate Cache LatencyAggregate Cache Latency

Function of all caches in a processorFunction of all caches in a processorOverall Effective LatencyOverall Effective Latency

L1 latency +L1 latency +

L1 Miss Rate * L2 latency +L1 Miss Rate * L2 latency +

L2 Miss Rate * DRAM LatencyL2 Miss Rate * DRAM Latency

Average cache speed is >1.8x better Average cache speed is >1.8x better than the Pentiumthan the Pentium®® III Processor! III Processor!

Average cache speed is >1.8x better Average cache speed is >1.8x better than the Pentiumthan the Pentium®® III Processor! III Processor!

Average on desktop applications, Average on desktop applications,

PentiumPentium®® III Processor @ 1GHz, Pentium III Processor @ 1GHz, Pentium®® 4 Processor @ 1.4GHz 4 Processor @ 1.4GHz


Fall 2000PentiumPentium®®III III ProcessorProcessor

PentiumPentium®® 4 4 ProcessorProcessor

RelativeRelative

ImprovementImprovementFrequencyFrequency 1 GHz1 GHz > 1.4 Ghz> 1.4 Ghz > 1.4> 1.4Adder SpeedAdder Speed 1 ns1 ns < .36 ns< .36 ns > 2.8> 2.8L1 Cache SpeedL1 Cache Speed 3 ns3 ns < 1.42 ns< 1.42 ns > 2.1> 2.1L1 Cache SizeL1 Cache Size 16 KB16 KB 8 KB8 KB 0.50.5L1 Cache BandwidthL1 Cache Bandwidth 16 GB/sec16 GB/sec > 44.8 GB/sec> 44.8 GB/sec > 2.8> 2.8L2 Cache BandwidthL2 Cache Bandwidth 16 GB/sec16 GB/sec > 44.8 GB/sec> 44.8 GB/sec > 2.8> 2.8Uop Fetch BandwidthUop Fetch Bandwidth 3 billion/sec3 billion/sec > 4.2 billion/sec> 4.2 billion/sec > 1.4> 1.4Adder BandwidthAdder Bandwidth 2 billion/sec2 billion/sec > 5.6 billion/sec> 5.6 billion/sec > 2.8> 2.8Branch targetsBranch targets 512512 40924092 88Instructions In flightInstructions In flight 4040 126126 3.153.15Loads in flightLoads in flight 1616 4848 33Stores in flightStores in flight 1212 2424 22

RecapRecap


Fall 2000

Intel® Pentium® 4 ProcessorIntel® Pentium® 4 ProcessorSummarySummary

Revolutionary, new micro-Revolutionary, new micro-architecture from Intel designed for architecture from Intel designed for the evolving Internetthe evolving Internet

Design features for balanced, high Design features for balanced, high performance platform scalability and performance platform scalability and headroom headroom

The world’s The world’s highest performancehighest performance desktop processordesktop processor


Fall 2000

intel labs labs copyright © 2000 intel corporation. fall 2000 inside the pentium ® 4 processor...

Documents

intel corporation

intel pdx pdx copyright

intel pentium

processor microarchitecture

processor performance

intel labs labs copyright

future microarchitecture

dtlb l1 dcache