pentium pro case study - ece/cs 752 fall 2019 · 2019-11-25 · pentium pro case study prof. mikko...

Pentium Pro Case Study

Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti

Pentium Pro Case Study

• Microarchitecture

– Order-3 Superscalar

– Out-of-Order execution

– Speculative execution

– In-order completion

• Design Methodology

• Performance Analysis

• Retrospective

Goals of P6 Microarchitecture

IA-32 Compliant

Performance (Frequency - IPC)

Validation

Die Size

Schedule

Power

P6 – The Big Picture

MOB

DCU

IEU1AGU0 IEU0

Fadd

Fmul

Imul

Div

AGU1

01234

Reservation Station (20)

Dispatch

Decode

Fetch

2 Cycles

4 Cycles

2 Cycles

BTB/ICU

BAC/Rename

Allocation

2 Cycles

ROB

RRF

JEU(40x157)

2 cyc

Memory Hierarchy

• Level 1 instruction and data caches - 2 cycle access time

• Level 2 unified cache - 6 cycle access time

• Separate level 2 cache and memory address/data bus

ICache(8KB)

DCache(8Kb)

BIUL2 Cache(256Kb)

Main Memory PCI

CPU

64 bit

16bytes

Instruction Fetch

Cache

I n s t

. B

u f

I n s t

. R

o t a

t o r

Inst. Length

Decoder

Length

Inst.

Marks

P r e

d i c

t i o

n

M a r k

s

Instruction

Victim

ICache

Stream

TLB

Buffer

Physical Addr.

L2 Cache (256Kb)

F e t

c h A

d d r e

s s

Next Addr. Logic

Other Fetch

Requests

Branch Target Buffer (512) P r e

d i c

t i o

n

To Decode

1 6 b

y t e

s

1 6 b

y t e

s +

m a r

k s

I n s t

r u c t

i o n D

a t a

M u x

I n s t

r u c t

i o n D

a t a

(8Kb)

2 cycle

Branch Target

Instruction Cache Unit

Stream Buffer

ICache(8 Kb)

Victim Cache

BusInterface

Unit

Data MuxInstructionTag Array

ITLBHit/MissInstruction

Data

Fetch Address

Lower 12 bits

Lower 12 bits

Upper20 bits

Branch Target Buffer

F e t c

h A

d d

r . T

a g

4 - b

i t B

H R

B r .

O f f

s e t

4 - b

i t B

H R

s p

e c .

T a r

g e t

A d

d r .

Way 0 1

2 8

S e

t s

F e t c

h A

d d

r . T

a g

4 - b

i t B

H R

B r .

O f f

s e t

4 - b

i t B

H R

s p

e c .

T a r

g e t

A d

d r .

Way 1

F e t c

h A

d d

r . T

a g

4 - b

i t B

H R

B r .

O f f

s e t

4 - b

i t B

H R

s p

e c .

T a r

g e t

A d

d r .

Way 3

Tag Compare

PHT

1 6

e n

t r i e

s / s e

t

Return Stack

Prediction Control Logic

Prediction &

Target Addr.

Fetch Address

Pattern History Table (PHT) is not speculatively updated

A speculative Branch History Register (BHR) and prediction state is maintained

Uses speculative prediction state if it exist for that branch

Branch Prediction Algorithm

Current prediction updates the speculative history prior to the next instance of the branch instruction

Branch History Register (BHR) is updated during branch execution

Branch recovery flushes front-end and drains the execution core

Branch mis-prediction resets the speculative branch history state to match BHR

0 0 1 0

1

0 1 0 1

Speculative History

Br. History

0000

0001

0010

0011

0100

0101

0110

1110

1111

1 0

1

.

.

.

Pattern TableStateMachine

0101

0010

11

10

Spec. Pred.

Branch Execution

Br. Pred.

Instruction Decode - 1

Branch instruction detection

Branch address calculation - Static prediction and branch always execution

One branch decode per cycle (break on branch)

Instruction Buffer 16 bytes

Macro-Instruction Bytes from IFU

Decoder0

Decoder1

Decoder2

BranchAddress

Calc.

To NextAddress

Calc.

4 uops 1 uop 1 uop

Up to 3 uops Issued to dispatch

uop Queue (6)

uROM

Instruction Decode - 2

Instruction Buffer contains up to 16 instructions, which must be decoded and queued before the instruction buffer is re-filled

Macro-instructions must shift from decoder 2 to decoder 1 to decoder 0

Instruction Buffer 16 bytes

Macro-Instruction Bytes from IFU

Decoder0

Decoder1

Decoder2

BranchAddress

Calc.

To NextAddress

Calc.

4 uops 1 uop 1 uop

Up to 3 uops Issued to dispatch

uop Queue (6)

uROM

What is a uop?

Small two-operand instruction - Very RISC like.

IA-32 instruction

add (eax),(ebx) MEM(eax) <- MEM(eax) + MEM(ebx)

Uop decomposition:

ld guop0, (eax) guop0 <- MEM(eax)

ld guop1, (ebx) guop1 <- MEM(ebx)

add guop0,guop1 guop0 <- guop0 + guop1

sta eax

std guop0 MEM(eax) <- guop0

Instruction Dispatch

Register Renaming

Allocation requirements

“3-or-none” Reorder buffer entries

Reservation station entry

Load buffer or store buffer entry

Dispatch buffer “probably” dispatches all 3 uops before re-fill

Renaming

Allocator

u o

P Q

u e u

e ( 6

)

D i s

p a

t c h

B u

f f e r

( 3 )

Mux Logic

To Reservation

Station

Retirement Info

2 cycles

Register Renaming - 1

Similar to Tomasulo’s Algorithm - Uses ROB entry number as tags

The register alias tables (RAT) maintain a pointer to the most recent data for the renamed register

Execution results are stored in the ROB

Integer RATEAXEBXECX

Floating Point RATFST0FST1FST2

FST7

GuoP0GuoP1

Real Register File (RRF) Reorder Buffer (ROB)0123456789

39

EAXEBXECX

FST0FST1

GuoP0GuoP1

IuoP(0-3)CC/Events

8

8

12

49

Register Renaming - Example

© Shen, Lipasti 15

Integer RATEAXEBXECX


FST7

GuoP0GuoP1


39

EAXEBXECX

FST0FST1

GuoP0GuoP1

IuoP(0-3)CC/Events

8

8

12

49

Dispatching:

add eax, ebx

add eax, ecx

fxch f0, f1

Completing:

sub eax, ecx

AllocComp sub

Challenges to Register Renaming Integer RAT

EAXEBXECX


FST7

GuoP0GuoP1


39

EAXEBXECX

FST0FST1

GuoP0GuoP1

IuoP(0-3)CC/Events

8

8

12

49

8-bit code

mov AL, #data1

mov AH, #data2

add AL, #data3

add AL, #data4

Byte addressable registers

Out-of-Order Execution Engine

• In-order branch issue and execution

• In-order load/store issue to address generation units

• Instruction execution and result bus scheduling

• Is the reservation station “truly” centralized & what is “binding”?

IEU1AGU0 IEU0

Fadd

Fmul

Imul

Div

AGU1

MOB

DCU (8Kb)

01234

Reservation Station (20) 2 Cycles

JEU

RSbypass

Reservation Station

• Cycle 1

– Order checking

– Operand availability

• Cycle 2

– Writeback bus scheduling

Cycle 1

Cycle 2

To Execution Units

Port 0Port 1Port 2Port 3Port 4

From Dispatch Queue

Memory Ordering Buffer (MOB)

• Load buffer retains loads until completed, for coherency checking

• Store forwarding out of store buffers

• 2 cycle latency through MOB

• “Store Coloring” - Load instructions are tagged by the last store

ConflictLogic

Load Buffer Store DataBuffer (12)

Store AddressBuffer (12)(16)

Data Cache Unit (8Kb)

BypassLogicControl

Load Data Result

AGU0 AGU1 R/S

MOB

2 cycle

2 cycle

Instruction Completion

• Handles all exception/interrupt/trap conditions

• Handles branch recovery

– OOO core drains out right-path instructions, commits to RRF

– In parallel, front end starts fetching from target/fall-through

– However, no renaming is allowed until OOO core is drained

– After draining is done, RAT is reset to point to RRF

– Avoids checkpointing RAT, recovering to intermediate RAT state

• Commits execution results to the architectural state in-order

– Retirement Register File (RRF)

– Must handle hazards to RRF (writes/reads in same cycle)

– Must handle hazards to RAT (writes/reads in same cycle)

• “Atomic” IA-32 instruction completion

– uops are marked as 1st or last in sequence

– exception/interrupt/trap boundary

• 2 cycle retirement

Pentium Pro Design Methodology - 1

Logic Design

Structural Model

Conception

Silicon Debug

Behavioral Model

Circuit Design

Layout

Performance Model

Microarchitecture

Start 1990

Finish 4Q

1 year

1.75 years

1 year

1994

0.75 years

Pentium Pro Performance Analysis

• Observability

– On-chip event counters

– Dynamic analysis

• Benchmark Suite

– BAPco Sysmark32 - 32-bit Windows NT applications

– Winstone97 - 32-bit Windows NT applications

– Some SPEC95 benchmarks

Performance – Run Times

avs

microstation

Pwave

PhotoShop

MSVCCorel excel

PageMaker

paradoxPowerPnt

word

WordProSYSexcelSYSword7SYSCorel6

SYSflw96SYSexcel7

SYSpowerpnt7SYSParadox7

SYSpagemaker6

compress

ligo

0

2E+10

4E+10

6E+10

8E+10

1E+11

1.2E+11

avs

microstation

Pwave

PhotoShop

MSVCCorel excel

PageMaker

paradoxPowerPnt

word

WordProSYSexcelSYSword7SYSCorel6

SYSflw96SYSexcel7

SYSpowerpnt7SYSParadox7

SYSpagemaker6

compress

ligo

Unhalted User-Mode Processor Cycles

Total of 27.5 billion cycles

User-Mode Processor Cycles

Performance – IPC vs. uPC

WS-AVS

WS-Microstation

WS-PWave

WS-PhotoShop

WS-MSVC++

WS-CorelDraw

WS-Excel

WS-PageMaker

WS-Paradox

WS-PowerPnt

WS-Word

WS-WordPro

SYS-ExcelSYS-Word

SYS-Core lDraw

SYS-FlW96SYS-Excel

SYS-PowerPnt

SYS-Paradox

SYS-PagemakerSPEC-Compress

SPEC-Li SPEC-Go

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

# Retired Per Cycle

WS-AVS

WS-Microstation

WS-PWave

WS-PhotoShop

WS-MSVC++

WS-CorelDraw

WS-Excel

WS-PageMaker

WS-Paradox

WS-PowerPnt

WS-Word

WS-WordPro

SYS-ExcelSYS-Word

SYS-Core lDraw

SYS-FlW96SYS-Excel

SYS-PowerPnt

SYS-Paradox


SPEC-Li SPEC-Go

Instructions and UOPs Retired Per Cycle

Inst retired

UOPS retired

2 uops/instruction

Instructions and Uops retired per cycle

Performance – IPC vs. uPC

3 UOPs 2 UOPs 1 UOP 0 UOPs 3 Inst. 2 Inst. 1 Inst. 0 Inst.

0

10

20

30

40

50

60

70

80

90

100

Percent of Cycles where n=0-3 UOPs or

Instructions were Retired

3 UOPs 2 UOPs 1 UOP 0 UOPs 3 Inst. 2 Inst. 1 Inst. 0 Inst.

Inst. or UOPS retired per cycle

AVS

PageMaker

Paradox

PowerPoint

WordPro

Inst. or Uops retired per cycle

Performance – Cache Misses

WS-AVS

WS-Microstation

WS-PWave

WS-PhotoShop

WS-MSVC++

WS-CorelDraw

WS-Excel

WS-PageMaker

WS-ParadoxWS-PowerPnt

WS-Word

WS-WordPro

SYS-ExcelSYS-Word

SYS-CorelDraw

SYS-FlW96SYS-Excel

SYS-PowerPntSYS-Paradox


SPEC-LiSPEC-Go

0

0.5

1

1.5

2

2.5

% Unhalted Cycles

WS-AVS

WS-Microstation

WS-PWave

WS-PhotoShop

WS-MSVC++

WS-CorelDraw

WS-Excel

WS-PageMaker


WS-Word

WS-WordPro

SYS-ExcelSYS-Word

SYS-CorelDraw

SYS-FlW96SYS-Excel



SPEC-LiSPEC-Go

Cache Misses Per Cycle

IFU Misses / Cycle

DCU Misses / Cycle

L2 Misses / Cycle

IL1 - 0.51%DL1 - 1.15%L2 - 0.25%

Cache Misses Per Cycle

Performance – Branch Prediction

W S-AVS

W S-M icrostat ion

W S-PW ave

WS-P hoto Shop

WS -MS VC+ +

W S-C orelD raw

W S-Exce l

W S-Pag eMa ker

W S-P aradoxWS- Pow erPnt

W S-Wo rd

W S-W ordPro

SYS -ExcelSY S-W ord

SYS-C orelD raw

SYS-F lW 96SYS-E xcel

SYS -Pow erPntSY S-Parado x

S YS-Page makerSPEC -Co mpre ss

SPEC -LiSPEC -Go

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

% Branches that Miss in BTB

W S-AVS

W S-M icrostat ion

W S-PW ave

WS-P hoto Shop

WS -MS VC+ +

W S-C orelD raw

W S-Exce l

W S-Pag eMa ker

W S-P aradoxWS- Pow erPnt

W S-Wo rd

W S-W ordPro

SYS -ExcelSY S-W ord

SYS-C orelD raw

SYS-F lW 96SYS-E xcel

SYS -Pow erPntSY S-Parado x

S YS-Page makerSPEC -Co mpre ss

SPEC -LiSPEC -Go

BTB Miss Rate

WS-AVS

WS-Microstation

WS-PWave

WS-PhotoShop

WS-MSVC++

WS-CorelDraw

WS-Excel

WS-PageMaker


WS-Word

WS-WordPro

SYS-ExcelSYS-Word

SYS-CorelDraw

SYS-FlW96SYS-Excel



SPEC-LiSPEC-Go

0%

2%

4%

6%

8%

10%

12%

14%

16%

% Branches Mispredicted

WS-AVS

WS-Microstation

WS-PWave

WS-PhotoShop

WS-MSVC++

WS-CorelDraw

WS-Excel

WS-PageMaker


WS-Word

WS-WordPro

SYS-ExcelSYS-Word

SYS-CorelDraw

SYS-FlW96SYS-Excel



SPEC-LiSPEC-Go

Branch Mispredict Rate

BTB Miss Rate

Branch Mispredict Rate

6.8% avg

Conclusions

IA-32 Compliant

Performance (Frequency - IPC) 366.0 ISpec92

283.2 FSpec92

8.09 SPECint95

6.70 SPECfp95

Validation

Die Size - Fabable Schedule - 1 year late Power -

Retrospective • Most commercially successful

microarchitecture in history

• Evolution

– Pentium II/III, Xeon, etc.

• Derivatives with on-chip L2, ISA extensions, etc.

– Replaced by Pentium 4 as flagship in 2001

• High frequency, deep pipeline, extreme speculation

– Resurfaced as Pentium M in 2003

• Initially a response to Transmeta in laptop market

• Pentium 4 derivative (90nm Prescott) delayed, slow, hot

– Core Duo, Core 2 Duo, Core i7 replaced Pentium 4 © Shen, Lipasti 29

Microarchitectural Updates • Pentium M (Banias), Core Duo (Yonah)

– Micro-op fusion (also in AMD K7/K8) • Multiple uops in one: (add eax,[mem] => ld/alu), sta/std

• These uops decode/dispatch/commit once, issue twice

– Better branch prediction • Loop count predictor

• Indirect branch predictor

– Slightly deeper pipeline (12 stages) • Extra decode stage for micro-op fusion

• Extra stage between issue and execute (for RS/PLRAM read)

– Data-capture reservation station (payload RAM) • Clock gated for 32 (int) , 64 (fp), and 128 (SSE) operands

© Shen, Lipasti 30

Microarchitectural Updates • Core 2 Duo (Merom)

– 64-bit ISA from AMD K8

– Macro-op fusion • Merge uops from two x86 ops

• E.g. cmp, jne => cmpjne

– 4-wide decoder (Complex + 3x Simple) • Peak x86 decode throughput is 5 due to macro-op fusion

– Loop buffer • Loops that fit in 18-entry instruction queue avoid fetch/decode

overhead

– Even deeper pipeline (14 stages)

– Larger reservation station (32), instruction window (96)

– Memory dependence prediction © Shen, Lipasti 31

Microarchitectural Updates • Nehalem (Core i7/i5/i3)

– RS size 36, ROB 128

– Loop cache up to 28 uops

– L2 branch predictor

– L2 TLB

– I$ and D$ now 32K, L2 back to 256K, inclusive L3 up to 8M

– Simultaneous multithreading

– RAS now renamed (repaired)

– 6 issue, 48 load buffers, 32 store buffers

– New system interface (QPI) – finally dropped front-side bus

– Integrated memory controller (up to 3 channels)

– New STTNI instructions for string/text handling

© Shen, Lipasti 32

Microarchitectural Updates • Sandybridge/Ivy Bridge (2nd-3rd generation Core i7)

– On-chip integrated graphics (GPU)

– Decoded uop cache up to 1.5K uops, handles loops, but more general

– 54-entry RS, 168-entry ROB

– Physical register file: 144 FP, 160 integer

– 256-bit AVX units: 8 DPFLOP/cycle, 16 SPFLOP/cycle

– 2 general AGUs enable 2 ld/cycle, 2 st/cycle or any combination, 2x128-bit load path from L1 D$

© Shen, Lipasti 33

Microarchitectural Updates • Haswell/Broadwell/Skylake: wider & deeper

– 8-wide issue (up from 6 wide)

– 4th integer ALU, third AGU, second branch unit

– 60-entry RS, 192-entry ROB

– 72-entry load queue/42-entry store queue

– Physical register file: 168 FP, 168 integer

– Doubled FP throughput (32 SP/16 DP)

– Load/store bandwidth to L1 doubled (64B/32B)

– TSX (transactional memory)

– Integrated voltage regulator

© Shen, Lipasti 34

pentium pro case study - ece/cs 752 fall 2019 · 2019-11-25 · pentium pro case study prof. mikko...

Documents