overcoming the memory wall: kilo- instruction , … the memory wall: kilo-instruction , runahead,...

114
Overcoming the Memory Wall: Kilo- Instruction , Runahead, and SMT Processors Prof. Mateo Valero, UPC, Barcelona BSC Director Onassis Foundation Lectures on Computer Science Heraklion, Crete, July 21-25, 2008

Upload: vuongkhuong

Post on 09-Mar-2018

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

Overcoming the Memory Wall: Kilo-

Instruction , Runahead, and SMT Processors

Prof. Mateo Valero,

UPC, Barcelona

BSC Director

Onassis Foundation

Lectures on Computer Science

Heraklion, Crete, July 21-25, 2008

Page 2: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 2

Motivation

Justin Rattner, Intel-MRL, Keynote lecture, Micro-32

Technology works against ILP: Faster clock rates => Lower ILP

Page 3: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 3

Processor Organization: Basic Concepts

� Instruction types:� Load/Store� Operation� Control

Control

Unit

Memory

Instructions + Data

. . .

Register File

Instructionsload Rx := M[]

store M[] := Rx

Ri := Rj op Rk

Branch (cond.)

Processor

Page 4: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 4

Technological Achievements

� Transistor (Bell Labs, 1947)

� DEC PDP-1 (1957)

� IBM 7090 (1960)

� Integrated circuit (1958)

� IBM System 360 (1965)

� DEC PDP-8 (1965)

� Microprocessor (1971)

� Intel 4004

Page 5: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 5

2X transistors/Chip Every 1.5 years

Called “Moore’s Law”

Moore’s Law

Microprocessors have become

smaller, denser, and more powerful.

Not just processors, bandwidth,

storage, etc

Gordon Moore (co-founder of

Intel) predicted in 1965 that the

transistor density of semiconductor

chips would double roughly every

18 months.

Technology Trends:

Microprocessor Capacity

Page 6: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 6

Medium High Very HighVariability

Energy scaling will slow down>0.5>0.5>0.35Energy/Logic Op scaling

0.5 to 1 layer per generation8-97-86-7Metal Layers

11111111RC Delay

Reduce slowly towards 2-2.5<3~3ILD (K)

Low Probability High ProbabilityAlternate, 3G etc

128

11

2016

High Probability Low ProbabilityBulk Planar CMOS

Delay scaling will slow down>0.7~0.70.7Delay = CV/I scaling

256643216842Integration Capacity (BT)

8162232456590Technology Node (nm)

2018201420122010200820062004High Volume Manufacturing

Shekhar Borkar, Micro37, P

Technology Outlook

Page 7: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 7

Pipeline (H. Ford)

Page 8: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 8

Program Dependences

� Data dependences

� Control dependencies

a Ri := …

b := Ri op Rj

c Ri := ...

DataData dependencesa and b (RAW)

Name dependencesb and c (WAR)a and c (WAW)

.

.

.a...b branch (cond.) a

b+1

F D R E W

F D R E W

F D R E W

F D R E W

F D R E W

F D R E W

CU

Main Memory

. . .

RF

Page 9: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 9

Superscalar ProcessorFetch

Decode

Renam

e

Instruction

Window

Wakeup+

select

Register

file

Bypass

Data Cache

Fetch of multiple instructions every cycle.

Rename of registers to eliminate added dependencies.

Instructions wait for source operands and for functional units.

Out- of -order execution, but in order graduation.

Predict branches and speculative execution

J.E. Smith and S.Vajapeyam.¨Trace Processors…¨ IEEE Computer.Sept. 1997. pp68-74.

Register Write

Commit

Page 10: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 10

L1

INSTRUCTION CACHE

FETCH IFQ

ALLO

CA

TIO

N &

RE

NA

ME

MA

PP

ER

/DIS

PA

TC

H

BRANCH

PREDICTOR

DE

CO

DE

Int. R

eg.

File

FP

Reg.

File

L1

DATA CACHE

L2

UNIFIED CACHE

Integer

FUs

DATA

PREFETCHING

PC BY

PA

SS

FP

FUs

IIQ

IFQ

LSQ

ROB

Modern Superscalar Processors

Marco A. Ramírez Salinas, PhD Thesis, Barcelona July 9th, 2007

Page 11: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 11

Superscalar Processors

� Out of order (IPC <= 3)

F D R W

F D R E W

F D R W

F D R E W

F D R E W

F D R E W

F D R E W

F D R E W

F D R W

E

E

E

E

E

E

E

E

E

E

E

E

E E

E

E

E

E

E

E

E E E

Time

T = N * * tc1

IPC

Page 12: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 12

Assignment, use and release of Resources

ROB IQ

LSQ registers

Release

---Assignment

Resource

CommitCommit

IQ(issued)Release

ROB IQ LSQ registers

AssignmentResource

Decode, RenamingDecode, Renaming

FetchFetch

CommitCommit

Memory LatencyMemory Latency

i.ei.e, 1000 cycles, 1000 cycles

T. Karkhanis and J.Smith, “A day in the life of a data cache miss” Workshop Memory Performance Issues. ISCA-2002

M.Valero. NSF Workshop on Computer Architecture. ISCA Conference. San Diego, June 2003

Short

Latency

Long

Latency

Page 13: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 13

Processor-DRAM Gap (latency)

µProc

60%/yr.

DRAM

7%/yr.1

10

100

10001980

1981

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

DRAM

CPU

1982

Processor-Memory

Performance Gap:

(grows 50% / year)

Pe

rfo

rma

nc

e

Time

“Moore’s Law”

D.A. Patterson “ New directions in Computer Architecture” Berkeley, June 1998

Page 14: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 14

Latencies and Pipelines

• F D/

M

a

p

Q R E

x

D/

St

R

W

R

et

PC

Ic

a

c

h

e

Ic

a

c

h

e

Register

Map

D

c

a

c

h

e

R

e

g

s

R

e

g

s

R

e

g

s

10-20 cycles

100-1000 cycles

L2 Cache

Memory

Processor on a chip1-3 cycles

1-3 cycles

Page 15: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 15

128 in-flight inst.

0.0

0.5

1.0

1.5

2.0

2.5

3.0

int 4way fp 4way int 8way fp 8way

IPC 100

500

1000

latency

Memory Wall Problem

Memory latency has enormous impact on IPC

0.6X

0.45X

M. Valero. NSF Workshop on Computer Architecture. ISCA Conference. San Diego, June 2003

Page 16: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 16

Branch Instructions

Every 5-6 instructions

Limits to high-speed

b1

b2b3

b4b7 b6

b5

b0

b8

b1

b2b3

b4b7 b6

b5

b0

b8

Page 17: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 17

Power DensityW

att

s/c

m2

1

10

100

1000

1.5µ1.5µ1.5µ1.5µ 1µ1µ1µ1µ 0.7µ0.7µ0.7µ0.7µ 0.5µ0.5µ0.5µ0.5µ 0.35µ0.35µ0.35µ0.35µ 0.25µ0.25µ0.25µ0.25µ 0.18µ0.18µ0.18µ0.18µ 0.13µ0.13µ0.13µ0.13µ 0.1µ0.1µ0.1µ0.1µ 0.07µ0.07µ0.07µ0.07µ

i386i386

i486i486

PentiumPentium®®

PentiumPentium®® ProPro

PentiumPentium®® IIII

PentiumPentium®® IIIIIIHot plateHot plate

Nuclear ReactorNuclear ReactorNuclear Reactor

RocketNozzleRocketRocketNozzleNozzle

* * ““New Microarchitecture Challenges in the Coming Generations of CMNew Microarchitecture Challenges in the Coming Generations of CMOS Process TechnologiesOS Process Technologies”” ––

Fred Pollack, Intel Corp. Micro32 conference keynote Fred Pollack, Intel Corp. Micro32 conference keynote -- 1999.1999.

PentiumPentium®® 44

Page 18: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 18

Lower Lower VoltageVoltage

Increase Increase Clock RateClock Rate& Transistor & Transistor

DensityDensity

We have seen increasing number of gates on a chip and increasing clock speed.

Heat becoming an unmanageable problem, Intel Processors > 100 Watts

We will not see the dramatic increases in clock speeds in the future.

However, the number of gates on a chip will continue to increase.

Increasing the number of gates into a tight knot and decreasing the cycle time of the processor

Increasing CPU performance: a delicate balancing act

Core

Cache

Core

Cache

Core

C1 C2

C3 C4

Cache

C1 C2

C3 C4

Cache

C1 C2

C3 C4

C1 C2

C3 C4

C1 C2

C3 C4

C1 C2

C3 C4

Page 19: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 19

Reducing Memory Latency

� Technology

� Caches

� Prefetching� Hardware, Software and combined

� Assisted/SSMT Threads

� Runahead Processors

� ………

� “Kilo-instruction” Processor

Page 20: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

20

Motivation: The Memory Wall and KIPs

� Since the 80s processor frequencies have accelerated about 40% every year. � However, main memory access time has decreased much slower.

SPECINT2000 SPECFP2000

Avg

IP

C

Avg

IP

C

Size of Instruction Window Size of Instruction Window

2.23x

1.29x0.6x 0.48x

KILO-Instruction Processors can overcome the Memory Wall for many codes, but how do we design

them to be efficient in power and complexity?

KILO-Instruction Processors (KIPs)I

Cristal et al. (1st Proposal to Intel, 2001) + many later works

Page 21: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 21

“Kilo-instruction” Processors

� Our goals� Better tolerate increasing memory latency

� Further improve ILP, even for such longer memory latency

� Allow additional optimizations enabled by the new architecture (See below)

� Our proposal: “Kilo-instruction” Processors� Out-Of-Order processors with thousands of instructions

in-flight (Very Large Instruction Windows)

� Intelligent use of resources (Resource requirements growing much slower than window size)

Page 22: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 22

“Kilo-instruction Processsor”

� It is not…..� A “heavy” processor ☺� Cyber-205 like processor� Vector Processor� Blue-Gene like� Multiscalar,Trace Processor� Raw, Imagine, Levo,TRIPS

� It is …….� An Affordable O-O-O Superscalar Processor having

“Thousands of In-flight Instructions”

Page 23: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 23

Outline� Motivation� Increasing the number of in-flight instructions� Kilo-instruction Processor Ingredients:

� Multi-Checkpointing the ROB• Out-of-Order Commit

� Early Release of Resources• Ephemeral Registers• Load Queues

� Locality Exploitation• Instruction Queues• LSQ

� Affordable KILO-Instruction Processors� Cross-pollination with other techniques:

� “Kilo-processor” and multiprocessor systems� “kilo-vector” processors and “kilo-valpred” processors� “Kilo-SMT” processor� Further Improvements:

• Branch prediction• Control Independence• Reuse• Predicated and multipath execution

� Conclusion

Page 24: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 24

branch 3

ROB ActivityROB Register File

IQ

x

xx

branch 3xbx

load 2x

branchx

branch 1xax

load 1

load 1a

branch 1load 2

b128-entry

1024-entry

fp

int

90%76%

72%62%

Lat=1000Lat=100

ROB full:

fp

int

75%50%

48%30%

Lat=1000Lat=100

ROB full:

M. Valero. NSF Workshop on Computer Architecture. ISCA Conference. San Diego, June 2003

Page 25: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

Effect of Window Size on Performance (EQUAKE)

CheckpointedArchitecture(~3000 Insts)

SmallWindow

Processor(~200 Insts)

~0.3

~2.2

smvp() : matrix-vector product

x7 x7 x7

Page 26: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

Impact of Memory Wall

SPECINT2000 SPECFP2000

The memory wall has a big impact on the performance on current hardware. Large-window processors can overcome this problem for many codes, but technology constraints limit the scalability of current designs

Page 27: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 27

Scalability

� Thousands of In-flight Instructions and In-Order Commit make designs impractical:� ROB : Needs to maintain a copy of every in-

flight instruction

� IQs : Instructions depending on long latency instructions remain in these queues for a long time

� LSQs : Instructions remain in the queue until commit

� Registers : A new physical register for each instruction producing a new value

� We would like to get the IPC of thousandsof instructions in-flight without drasticallyincreasing resource requirements

Re

ord

er

Buff

er

IQF

PQ

LS

Q

I nt e

ge

rR

eg

.F

ileF

PR

eg.

File

M. Valero. NSF Workshop on Computer Architecture. ISCA Conference. San Diego, June 2003

Page 28: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 28

Early Release of Resources

ROB IQ

LSQ registers

Release

---Assignment

Resource

CommitCommit

IQ(issued)Release

ROB IQ LSQ registers

AssignmentResource

Decode, RenamingDecode, Renaming

FetchFetch

CommitCommit

Memory LatencyMemory Latency

i.ei.e, 1000 cycles, 1000 cycles

T. Karkhanis and J.Smith, “A day in the life of a data cache miss” Workshop Memory Performance Issues. ISCA-2002

M.Valero. NSF Workshop on Computer Architecture. ISCA Conference. San Diego, June 2003

Short

Latency

Long

Latency

Page 29: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 29

Outline

� Motivation� Increasing the number of in-flight instructions� Kilo-instruction Processor Ingredients:

� Multi-Checkpointing the ROB• Out-of-Order Commit

� Early Release of Resources• Ephemeral Registers• Load Queues

� Locality Exploitation• Instruction Queues• LSQ

� Affordable KILO-Instruction Processors� Cross-pollination with other techniques:

� “Kilo-processor” and multiprocessor systems� “kilo-vector” processors and “kilo-valpred” processors� “Kilo-SMT” processor� Further Improvements:

• Branch prediction• Control Independence• Reuse• Predicated and multipath execution

� Conclusion

Page 30: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 30

Checkpointing the ROB

� Checkpointing to support precise exceptions:� Quite well established and used technique

• J.E. Smith and A.R. Pleszkun, ISCA 1985

• W.M.Hwu and Y.N.Patt, ISCA 1987

� Checkpointing to early release resources:� Quite recent concept

• Cherry: J. Martínez et al., MICRO, Nov. 2002

• Large VROB: A. Cristal et al. TR-UPC-DAC, July 2002

M. Valero. NSF Workshop on Computer Architecture. ISCA Conference. San Diego, June 2003

Page 31: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 31

Checkpointing Example: MIPS Like

branch

branch

load

ROBhead

tail

logical physical

Architectural State

logical physical

Rename Table

logical physical

Checkpoint

logical physical

Checkpoint

Page 32: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 32

load

branch

Nearby & Distant Parallelism

ROB Register Fileload

Speculative

Replayable

f(X)

X

load

branch

load

branch

Balasubramonian et al.: “Dynamically Allocating Processor Resources…”, ISCA’01

Nearby

Distant

Page 33: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 33

Runahead ExecutionROB

x

x

x

branch 3

x

b

x

load 2

x

branch

x

branch 1

x

a

x

load 1

L2 cache miss

CheckpointCheckpoint

INV

INV

INV

INV

INV

Runahead Mode- generate bogus value

- invalidate dep. registers

- continue execution

- Virtually increments ROB size

- Prefetch data of future loads

- Preexecution of Branches

Mutlu et al.: “Runahead Execution: An Alternative…”, HPCA’03

Page 34: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 34

CherryROB

branch

load

Early Release

- registers

- loads

- stores

Point of no return

(PNR)

Martínez et al.: “Cherry: Checkpointed Early Resource Recycling…”, MICRO’02

irreversible

reversible

CherryCherry

Page 35: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 35

Multi-Checkpoint

IQ

ROB

x

x

x

load 3

x

b

x

branch 2

x

branch

x

branch 1

x

a

x

load 1 Checkpointing TableCheckpoint 1Checkpoint 1

OOO commit

load 1: PC, status, counter, …

load 1a

branch 1

load 1Checkpoint 2Checkpoint 2

branch 2: PC, status, counter, …

x

x

x

x

branch 4

x

x

x

load 4

x

x

load 3

x

b

x

branch 2

branch 2b

load 3

Cristal et al.: “Large Virtual ROBs by Processor Checkpointing”, TR UPC-DAC, July 2002

Research Proposal to Intel (January 2002) and presentation to Intel-MRL Feb. 2002

branch 2 load 1load 1 arrivesarrives

GangGang commitcommit

CheckpointCheckpoint 11

Page 36: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 36

Checkpointing Description

� How many in-flight checkpoints should be supported?

� What kind of instructions should be checkpointed?

� How often should a checkpoint be taken?

� How much information should be kept?

Page 37: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 37

Outline� Motivation� Increasing the number of in-flight instructions� Kilo-instruction Processor Ingredients:

� Multi-Checkpointing the ROB• Out-of-Order Commit

� Early Release of Resources• Ephemeral Registers• Load Queues

� Locality Exploitation• Instruction Queues• LSQ

� Affordable KILO-Instruction Processors� Cross-pollination with other techniques:

� “Kilo-processor” and multiprocessor systems� “kilo-vector” processor and “kilo-valpred” processors� “Kilo-SMT” processor� Further Improvements:

• Branch prediction• Control Independence• Reuse• Predicated and multipath execution

� Conclusion

Page 38: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 38

Registers

� Register File is a critical component of a modern superscalar processor� Large number of entries to support out-of-order

execution and memory latency

� Large number of ports to increase issue width

� Power and access time are key issues for register file design

� It is always beneficial, to reduce the number of physical registers

Page 39: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 39

� Conventional renaming scheme

� Virtual-Physical Registers

� Early Release

� Ephemeral Registers: checkpoint + virtual-physical

Physical Registers

Icache Decode&Rename Commit

Register Unused Register Used Register Unused

Register Used Register Unused

Register Unused Register Used

Register Used

T. Monreal et al.: “Delaying physical register allocation through virtual-physical registers”, MICRO’99

M. Moudgill et al, “Register renaming and dynamic speculation: an alternative approach”, MICRO93

T. Monreal et al., “Late allocation and early release of physical registers”, IEEE-TC (to appear)

J. Martínez et al, “Ephemeral Registers”, Technical Report CSL-TR-2003-1035 , 2003

A. Cristal et al, “Ephemeral Registers with Multicheckpointing” Technical report UPC-DAC-2003-51, Oct 2003

Page 40: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 40

State of Registers (FP, ROB=2048)

1 10 25 50 75 90 1000

200

400

600

800

1000

1200

1400

FP

Re

gis

ters

Distribution of in-flight Instructions

Dead

Blocked-Long

Blocked-Short

Live

1168 1382 1607 1868 1955Number of Instructions

A. Cristal, et al, “ A case for resource-concious out-of-order processors”, IEEE TCCA CA Letters, Vol. 2, Oct. 2003

Early Release

Late Allocation

Page 41: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 41

Outline� Motivation� Increasing the number of in-flight instructions� Kilo-instruction Processor Ingredients:

� Multi-Checkpointing the ROB• Out-of-Order Commit

� Early Release of Resources• Ephemeral Registers• Load Queues

� Locality Exploitation• Instruction´s Queues• LSQ

� Affordable KILO-Instruction Processors� Cross-pollination with other techniques:

� “Kilo-processor” and multiprocessor systems� “kilo-vector” processor and “kilo-valpred” processors� “Kilo-SMT” processor� Further Improvements:

• Branch prediction• Control Independence• Reuse• Predicated and multipath execution

� Conclusion

Page 42: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 42

� Increasing the number of IQ entries increase the power, area and access time

� Wake-up and selection logic need to be done efficiently

� “Kilo-instruction” processors may have many “in-flight” instructions

� We need new organization for the IQs in order to have affordable “kilo-instruction processors”

Issue Queues

Page 43: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 43

L1

INSTRUCTION CACHE

FETCH IFQ

ALLO

CA

TIO

N &

RE

NA

ME

MA

PP

ER

/DIS

PA

TC

H

BRANCH

PREDICTOR

DE

CO

DE

Int. R

eg.

File

FP

Reg.

File

L1

DATA CACHE

L2

UNIFIED CACHE

Integer

FUs

DATA

PREFETCHING

PC BY

PA

SS

FP

FUs

IIQ

IFQ

LSQ

ROB

Instruction Queues

Marco A. Ramírez Salinas, PhD Thesis, Barcelona July 9th, 2007

Page 44: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 44

A new low power wakeup mechanism

Block Mapping Table

CAM-Blocks with explicit enable

Each in-flight instruction has an only identifier namely the Destination Register Tag

Marco A. Ramírez Salinas, PhD Thesis, Barcelona July 9th, 2007

Page 45: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 45

A new low power wakeup mechanism

Contributions:

Block Mapping Table Mechanism

Allow fast blocks enabling in wakeup phase

Power Reduction

1.5 comparison per committed instruction

Save: ~70% of power for 32-entries queues

Hardware

Minimal hardware requirements

Publications:

[1] Marco A. Ramírez, A. Cristal, Luis Villa, Alex V. Veidenbaum and Mateo Valero "A Low-Power Instruction-Queue Wakeup Mechanism " XIV Jornadas de Paralelismo ‘03 Leganés-Madrid Spain.

[2] Marco A. Ramírez, A. Cristal, Luis Villa, Alex V. Veidenbaum and Mateo Valero "A Simple Low-Energy Instruction Wakeup Mechanism” International Symposium on High Performance Computer ’03 Tokyo Japan.

[3] M. A. Ramirez, A. Cristal, M. Valero, A. V. Veidenbaumand L. Villa, A partitioned instruction queue to reduce instruction wakeup energy, International Journal of High Performance Computing and Networking ’04.

Marco A. Ramírez Salinas, PhD Thesis, Barcelona July 9th, 2007

Page 46: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 46

A RAM Queue design

MWT

RAM32X8

8W4R Ports

ST C M

Pointers Four 5-bits DECODERS

Mapping Table

RAM128X10

8W4R Ports

Ready

Counters

Six 5-bits DECODERS

SELECTION LOGIC

PAYLOAD RAM

RAM40X32

4W4R

4

8

1

1

ALLOCATION LOGIC

1

I0src1

I0src2

I1src1

I1src2

I2src1

I2src2

I3src1

T0Dest

T1Dest

T2Dest

T3Dest

4 Instructions from

Dispatch

4 Inst’s to

Execute

MCST

M0-5[0:2]

C0-5[0:4]

D0-3[0:39]

Ready Counter Update Logic

A0-3[0:2]

T4Dest

T5Dest

W0-5[0:6] T4Dest, T5Dest from LD queue

I3src2

A New Direct Instruction Wakeup Queue design

(all RAM + Traditional Selection Logic)

Marco A. Ramírez Salinas, PhD Thesis, Barcelona July 9th, 2007

Page 47: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 47

1 10 25 50 75 90 1000

100

200

300

400

500

600

FP

Qu

eue

Distribution of in-flight Instructions

Blocked-Long

Blocked-Short

Ready

1168 1382 1607 1868 1955

Number of Instructions

Long/Short Lat. Inst.Remove – ReinsertDependence Chain

State of FP IQs (specFP, ROB=2048)

Page 48: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 48

1

2

3

slowfast medium

x

x

branch 4

x

x

x

load 4

x

x

load 3

x

b

x

branch 2

Secondary Buffer

ROB

13

31

2

IQ

Execution Time of Instructions

� Lebeck et al., “A large, fast instruction window for tolerating cache misses”, ISCA-29, 2002.

� Brekelbaum et al., “Hierarchical scheduling windows”, ISCA-35, 2002.

� Cristal et al., “Out-of-Order Commit Processors”, TR UPC-DAC-2003-44, July 2003 & HPCA-10, Feb. 2004

Page 49: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 49

Load/Store Queues

� Efficient and affordable memory disambiguation is mandatory for kilo-instruction processors� We need to guarantee that loads and stores arrive to the

memory in the correct order

� Increasing the number of in-flight instructions, can make the load/store queues a true bottleneck both in latency and power

Page 50: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 50

1 10 25 50 75 90 1000

100

200

300

400

500

600

LD

Queue

Distribution of in-flight Instructions

Dead

Blocked-Long

Blocked-Short

Replayable

Live

1168 1382 1607 1868 1955

Number of Instructions

CheckpointingEarly Release

State of LD Queues (specFP, ROB=2048)

Cristal et al., “A case for resource-conscious out-of-order processors”, IEEE TCCA CA Letters, Vol. 2, 2003.

Cristal et al.: “Large Virtual ROBs by Processor Checkpointing”, TR UPC-DAC, July 2002

J.F. Martínez et al., “Cherry: checkpointer early resource recycling in out-of-order microprocessors”, MICRO-35, 2002.

Page 51: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 51

1 10 25 50 75 90 1000

50

100

150

200

250

300

ST

Qu

eue

Distribution of in-flight Instructions

Ready

Address Ready

Blocked-Long

Blocked-Short

1168 1382 1607 1868 1955

Number of Instructions

FP

State of ST Queues (specFP, ROB=2048)

Locality

Page 52: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 52

0

0.5

1

1.5

2

2.5

3

3.5

512 1024 2048 512 1024 2048 512 1024 2048

100 500 1000

Virtual Tags

Memory Latency

IPC

256

512

Limit 4096

Limit 4096

Limit 4096

Baseline 128

Baseline 128

Baseline 128

Putting It All Together

Physical

Registers

Virtual

Registers

Memory Latency

A. Cristal et al. “Kilo-instruction Processors”. Invited paper. ISHPC-V.Tokyo, LNCS-2858. October 20-22th, 2003

1.3X

2.1X

2.2X

IQs of 64 entries

Page 53: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 53

Outline� Motivation� Increasing the number of in-flight instructions� Kilo-instruction Processor Ingredients:

� Multi-Checkpointing the ROB• Out-of-Order Commit

� Early Release of Resources• Ephemeral Registers• Load Queues

� Locality Exploitation• Instruction´s Queues• LSQ

� Affordable Kilo-Instruction Processors:

� Cross-pollination with other techniques:� “Kilo-processor” and multiprocessor systems� “kilo-vector” processors and “kilo-valpred” processors� “Kilo-SMT” processor� Further Improvements:

• Branch prediction• Control Independence• Reuse• Predicated and multipath execution

� Conclusion

Page 54: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

54

58%

15% 6%20%

A different view: D-KIP

About 70% High Locality

About 30% Low Locality

Cache

Processor

Memory

Processor

FE

TC

H miss-dependent insts

� Processes only Cache Hit/Register dependent Insts

� Latency Critical

� Buffer few instructions (<100)J

� Speculative / Out-of-Order

� Executes most of control code

� LD/ST intensive

� Memory Lookahead (Prefetching)J

DECOUPLED KILO-INSTRUCTION PROCESSOR (HPCA'06, PACT'07)1

400 cyclesmain memoryaccess latency

Distribution of Instructions based on Decode->Issue Latency

KILO-Instructionprocessor model2MB L2 Cache

measured in groups of 30 cycles

SPEC FP 2000� Miss-dependent Instructions

� Latency Tolerant

� Buffer thousands of instructions

� Relaxed Scheduling

� Little Control Code -> Few Recoveries

� Few address calculations

� No caches, No fetch/decode logic

Page 55: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

Decoupled Kilo-Instruction Processor (D-KIP)

ADD2

LD

A

ADD1

LD

B

MISS HIT

MUL

Cache

Processor

Memory

Processor

LLIB

Low Execution Locality

High Execution

Locality

Intermediate

Instruction

Buffering

ADD

1 , [B

]

ADD1

r1 = A+B;

r2 = (B+imm1)*imm2;

• Code with short decode-to-issue distance (high locality code) is executed by a small "Cache Processor"• Miss-dependent code with large decode-to-issue distance (low locality code)migrates to a simpler "Memory Processor" through an in-order instruction buffer

Pericas et al, "A Decoupled KILO-Instruction Processor", HPCA-12 (2006)

missreturns

Page 56: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

•multiple engines where each one processes a sequential subset of the low-locality instruction window•every engine performs multiple in-order scans emulating OoO•use downsized engines (2wide, in-order) since execution of low-locality code is tolerant to additional latencies •commit/recovery like checkpointed architecture (one chkpt per engine)•there is no back-communication from MP to CP except for recovery

The Flexible Heterogeneous MultiCore (I)

Cache Processor Memory Processor

Instruction Windowyoungest oldest

ROB

Pericas et al, "A Flexible Heterogeneous MultiCore Architecture", PACT-16 (2007)

The D-KIP architecture forces in-order processing of all low-locality code. Codes covered bya disparity of miss latencies among parallel statements suffer an unnecessary penality

New Idea: Cache Processor + DataFlow Memory Processor

•small•out-of-order

Page 57: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

The Flexible Heterogeneous MultiCore (II)

FMC can be easily extended to share the back-end among threads

Engines are allocated based on the threads needs, incrementing throughput and fairness

Dynamically Assigned Pool of MEs

Cache Processors

ROB

ROB

ROB

Pericas et al, "A Flexible Heterogeneous MultiCore Architecture", PACT-16 (2007)

Page 58: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

Designing a Load/Store Queue for FMC

Load Address Calculation Distance Store Address Calculation Distance

91%

95% 99%

94%

95%

99%

Pericas et al., "A two-level Load/Store Queue based on Execution Locality", ISCA-35 (2008)

Most addresses calculations have high locality!Address Calculation Distribution based on Decode-to-Issue Latency

� The original proposal of FMC used a centralized LSQ in the Cache Processor.

� In this version, every load access the LSQ/Cache hierarchy from the Memory Processor pays a round trip penalty

� Using the ideas of Execution Locality we propose a new LSQ design for the FMC

Page 59: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 59

Epoch-based Load/Store Queue

� The Epoch-based Load/Store Queue is based on two main principles:� Execution Locality (Classification of Loads/Stores into high and low locality)

� Local and Global Disambiguation

� High/Low Locality Distribution� Same as in D-KIP / FMC

LQ

SQ

High Locality LSQ

LQ

SQ

Low Locality LSQ

Migrate Loads/Stores with known address or loads/storeswhose address calculation is miss-dependent (sooner or later all are migrated)

Partition Low-Locality LSQ

LQ

SQ

LQ

SQ

LQ

SQ

LQ

SQ

LQ

SQ

LQ

SQ

corresponding loads/storescalled a "Memory Epoch"

Pericas et al., "A two-level Load/Store Queue based on Execution Locality", ISCA-35 (2008)

Page 60: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 60

Epoch based Load/Store Queue: Disambiguation

� First Local Disambiguation: Traditional Method� High-Locality LSQ Loads Search locally for forwarding Stores

� Epoch LSQs loads search locally for forwarding Stores

� Stores also search locally for store-load violation

LQ

SQ

search store queue

AddressHash

Epochs0 1 2 3 4 5 6 7 8

0 0 0 0 0 0 011@A

STORE ERT

Possibly Matching Storesin Epochs 4 and 7

search (epoch 7);if(!match)

search (epoch 4);

Epoch Resolution Table:• Tracks LL loads/stores• Very simple:

•no counters•column clean when engine iscommited

� Global Disambiguation: Consult ERT (Epoch Resolution Table)� Only if Local Disambiguation does not return a definitive answer

Pericas et al., "A two-level Load/Store Queue based on Execution Locality", ISCA-35 (2008)

Page 61: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

ELSQ: Global Picture

Thanks to local forwarding, ELSQ even outperformsthe Central LSQ in Cache Processormodel by 1-2%

L1 DataCache

Cache Processor

ROB

iALU FPU

LQ

SQ

LQ

SQ

LQ

SQ

LQ

SQ

LQ

SQ

LQ

SQ

LQ

SQ

LQ

SQ

LQ

SQ

LQ

SQ

LQ

SQ

LQ

SQ

LQ

SQ

Memory Processor

to L2

HL-L

SQ

LL-LSQLocal Forwardings

Local Forwardings

Global Forwarding

ME ME ME ME

ME ME ME ME

ME ME ME ME

ERT

Pericas et al., "A two-level Load/Store Queue based on Execution Locality", ISCA-35 (2008)

Page 62: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 62

ELSQ: LSQ Network Utilization

� The network of LSQs has many interesting propertiesthat allow reduced-power operation:� Around 50% of all cycles the Memory Processor is empty. Only

static power is consumed. Low-power techniques (e.g., sleeptransistors) can be applied.

� Due to high locality, LD/ST activity in the MP is very reduced. ~80% of all searches happen in the Cache Processor.

� As ~98% of stores get their address in the CP, we can simplifythe architecture by forcing all stores to compute theiraddress in the CP. This completely eliminates the Load Queuefrom the MP and has less than a 2% IPC penality.

Pericas et al., "A two-level Load/Store Queue based on Execution Locality", ISCA-35 (2008)

Page 63: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 63

Evaluation of Architecture: Parameters

� Several Architectures and Configurations are evaluated:� Out-of-Order Processors with ROBs of 64 and 256 instructions (Issue Queue and

Register File sized to avoid stalls)

� D-KIP model is implemented with:

• 4-way OoO 64-ROB Cache Processor (40-entries IQ, 96 registers)

• 2K FIFO Instruction Buffer

• 4-way In-Order Memory Processor

� FMC Parameters:

• 4-way OoO 64-ROB Cache Processor (40-entries IQ, 96 registers)

• 16 Memory Engines

• Memory Engines are 2-way & In-Order. Up to 128 long-latency instructions, 64 loads and 32 stores per engine

� RunAhead is implemented on the OoO model with unlimited fully-associative runahead cache

� Stream prefetcher holding up to 64KB of prefetched data (up to 16 streams)

� Memory Latency: 400 cycles

Page 64: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

Performance: SPECINT2000•D-KIP and FMC perform similar to OoO-256-RA and OoO-256-Prefetch•OoO-64-RA performs only slightly worse than OoO-256-RA and OoO-256-Pref: RA very resource-efficient for integer programs

SPECINT2000 SPECFP2000

•D-KIP, FMC and OoO-64-RA outperform OoO-256 by 7% despite much smaller associative structures

•D-KIP-Pref and OoO-256-PrefRA have similar performance and slightly better than FMC-Pref

Page 65: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

Performance: SPECFP2000

•FMC outperforms D-KIP by 12%, OoO-256-RA by 33% and OoO-256 by 56%

SPECINT2000 SPECFP2000

•FMC lookahead is so far and accurate that a prefetcher is no longer necessary

Page 66: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 66

Outline� Motivation� Increasing the number of in-flight instructions� Kilo-instruction Processor Ingredients:

� Multi-Checkpointing the ROB• Out-of-Order Commit

� Early Release of Resources• Ephemeral Registers• Load Queues

� Locality Exploitation• Instruction´s Queues• LSQ

� Cross-pollination with other techniques:� “Kilo-processor” and multiprocessor systems� “kilo-vector” processors and “kilo-valpred” processors� “Kilo-SMT” processor� Further Improvements:

• Branch prediction• Control Independence• Reuse• Predicated and multipath execution

� Conclusion

Page 67: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 67

Motivation: multiprocessors� Shared-Memory Multiprocessors: increased

latencies� Traversing interconnection hardware

• Centralized memory (SMP)

• Remote memories (DSM)

� Preserving cache coherence

network

cpu

bus

directory

memory

L2

L1

net. interface

cpu

bus

directory

memory

L2

L1

net. interfaceDS

M m

ode

l

Page 68: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 68

Evaluate potential

Ideal Network - 500 mem. latency

0,00E+00

2,00E+06

4,00E+06

6,00E+06

8,00E+06

1,00E+07

1,20E+07

FFT RADIX LU

16 processors

Ex

ec

uti

on

Tim

e

ROB 64

ROB 128

ROB 512

ROB 1024

ROB 2048

70%

75% 55%

Page 69: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 69

0

0,5

1

1,5

2

2,5

3

3,5

64 128 512 1024 2048 64 128 512 1024 2048 64 128 512 1024 2048

FFT RADIX LU

ROB Size / Benchmark

IPC

IDEAL NET

BADA

IDEAL NET & MEM

First Results

M. Galluzzi et al. “ A First glance at Kiloinstruction Based Multiprocessors” Invited Paper. ACM Computing Frontiers

Conference. Ischia, Italy, April 10-12, 2004

“Kilo-processor” and multiprocessor systems

Page 70: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 70

2. Transactional Memory

� Programmer specifies large, atomic tasks� atomic { some_work; }� Multiple objects, unstructured control-flow, …� Declarative: user simply specifies, system implements details

� TM simplifies parallel programming� Parallel algorithms: non-blocking sync with coarse-grain code

• Performance = fine grain locks

� Sequential algorithms: speculative parallelization� “All transactions all the time”

� Stanford Transactional Coherence & Consistency (TCC)

� Eases performance optimization� Makes deterministic replay trivial

� Atomicity & isolation are generally useful� For debugging, checkpointing, exception handling, garbage collection,

security, compiler optimization

Page 71: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 71

Implicit transactions

P3

Chk34

Chk33

Chk32

Chk31

st a

P1

Chk11

Chk12

Chk13

Chk14

P4

ld a

Chk44

Chk43

Chk42

Chk41

P2

Chk24

Chk23

Chk22

Chk21

ld a

Executed Instruction

Checkpoint

In-flight Instruction

Chk34

End StoreChkPt

Commit

Bro

ad

cas

tEnd Load

Inv

alid

atio

n

R

O

L

L

B

A

C

K

End Load

Get v

alu

e

Page 72: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 72

SC – explanation

Process A

A1

A2

A3

Process B

B1

B2

B3

Global order proc. A, B

A4

A5

A6

B4

B5

A1

A2

A3

A4

A5

A6

B1

B2

B3

B4

B5

Page 73: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 73

SC – using transactions

TR

B2

TR

B1

TR

A2

TR

A1

Process A

A1

A2

A3

Process B

B1

B2

B3

Global order proc. A, B

A4

A5

A6

B4

B5

TR

A2

TR

A1

A1

A2

A3

A4

A5

A6

TR

B1

B1

B2

B3

TR

B2

B4

B5

� We allow:� Re-ordering� Overlapping

Page 74: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 74

Outline� Motivation� Increasing the number of in-flight instructions� Kilo-instruction Processor Ingredients:

� Multi-Checkpointing the ROB• Out-of-Order Commit

� Early Release of Resources• Ephemeral Registers• Load Queues

� Locality Exploitation• Instruction´s Queues• LSQ

� Cross-pollination with other techniques:� “Kilo-processor” and multiprocessor systems� “kilo-vector” processors and “kilo-valpred” processors� “Kilo-SMT” processor� Further Improvements:

• Branch prediction• Control Independence• Reuse• Predicated and multipath execution

� Conclusion

Page 75: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 75

“Kilo-vector” processor

20 80Program:

20 8Program:

5Program:

8

Kilo

Vector

Speedup: 3.5

Speedup: 7.7

M. Valero Keynote at HPCA, Madrid-2003

Page 76: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 76

“Kilo-valpred” processor

0

0.5

1

1.5

2

2.5

3

normal Checkpoint Value

Prediction

Hybrid

SpecFP

SpecINT

1.4X

1.1X1.2X

M. Valero Keynote at HPCA, Madrid-2003

Page 77: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 77

Outline� Motivation� Increasing the number of in-flight instructions� Kilo-instruction Processor Ingredients:

� Multi-Checkpointing the ROB• Out-of-Order Commit

� Early Release of Resources• Ephemeral Registers• Load Queues

� Locality Exploitation• Instruction´s Queues• LSQ

� Cross-pollination with other techniques:� “Kilo-processor” and multiprocessor systems� “kilo-vector” processors and “kilo-valpred” processors� “Kilo-SMT” processor� Further Improvements:

• Branch prediction• Control Independence• Reuse• Predicated and multipath execution

� Conclusion

Page 78: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 78

Kilo and Control Independence

� Larger windows improve:� The probability of finding the

reconvergence point

� The correct detection of control

independent instructions because the wrong path is completely executed

� The execution of more control

independent instructions for later reuse

Wrongpath

Correctpath

RP

currentinstructionwindows

kilo-instructionwindows

CI

M. Valero Keynote at HPCA, Madrid-2003

Page 79: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 79

Kilo and Control Independence

� More opportunities to find control independent instructions� Squash reuse

� Control-independent instruction

reexecution removal

� Savings:• Power/energy

• Execution bandwidth

• Resources

� Helps to go far ahead in the instruction window faster

% C.I. instructions

0

0,05

0,1

0,15

0,2

0,25

0,3

0,35

0,4

0,45

INT

M. Valero Keynote at HPCA, Madrid-2003

Page 80: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

Some Recent Work

� Transparent Control Independence is an interesting approach to overcome the control path problem:

to Issue Pipeline

ICache

Checkpoints

RepairRename

MapSelecitve Re-Execution

Buffer (RXB)

Only CD + CIDD Instructions are re-executed

SpeculativeRename

Map

CD + CI

correct CD

CD = Control-Dependent

CI = Control-Independent

CIDD = Control-IndependentData-Dependent

to RXB Phys.RF

Also stores input sources

Al-Zawawi et al., “Transparent Control Independence (TCI)”, ISCA-34 (2007)

Page 81: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 81

Page 82: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 82

Page 83: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 83

Superscalar Issue

Superscalar leads to more performance, but lower utilization

Time

J. Emer Compaq

Page 84: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 84

Multithreaded Processor

Fetch

Decode

Renam

e

Instruction

Window

Wakeup+

select

Register

file

Bypass

Data Cache

Thread-1

Thread-2

Thread-n

Several Threads executed concurrently

Threads belong to same / different processes

HEP ( 1978 ), Alewife , M-Machine , Tera-Computer

Page 85: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 85

SMT out-of-order

Fetch Decode

/Map

Queue Reg

Read

Execut

e

Dcache

/Store

Buffer

Reg

Write

Retire

Icache

Dcache

PC

Register

Map

Regs Regs

J. Emer Compaq

Page 86: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 86

Fine Grained Multithreading

Intra-thread dependencies still limit performance

Time

J. Emer Compaq

Page 87: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 87

Simultaneous Multithreading

Maximum utilization of function units by independent operations

Time

J. Emer Compaq

Page 88: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 88

Page 89: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 89

SMT Processor

RESOURCESSHARED

Page 90: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 90

SMT Processor

Threads share but also

compete for shared

resources

Page 91: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 91

Multi-Threaded Processors� MT processors are used in a wide range of computing systems:

� High-performance: IBM Power5, Intel/AMD Quad-Core

� Real-Time: Infineon Tricore, Imagination Meta

� Network: Sun Niagara T1, Sun Niagara T2

L2 cache

Core 0

Core 1

Power5

2 cores 2-way SMT

10-way 2MB L2 cache

Intel Core Duo

2cores,

8/16-way 2MB L2 cache

Niagara T2

8 cores 8-way FGMT

12-way 3MB L2 cache

Page 92: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 92

Ifetch policies� Icount1:

� Threads with less instructions in the pre-issue stages are given high priority

� When a thread experiences an L2 miss: • It can monopolize resources

• In this case all threads are stopped

� Some Ifetch policies try to solve this problem� Work on top of Icount

� Use L1/L2 data cache misses as indicators of resource monopolization,

� Stalling/squashing threads based on these indicators

[1] Tullsen et al. "Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor “ (ISCA 96)

Page 93: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 93

IFetch Policies (II)

� Main fetch policies:� Icount1: fetches from threads with fewer

instructions in pre-execution stages

� Stall2: stops a thread if it experiences an L2 miss

� Flush2: in addition, flushes all instructions after the missing load

� Data Gating3: stops threads on every L1 data cache miss

� Does not take into account resource occupancy.

� Basically they are a response actions to a given event

1 Tullsen et al. “Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor” (ISCA 96)

2 Tullsen et al. "Handling long-latency loads in a simultaneous multithreaded processor“ (MICRO 01)

3 El-Moursy et al. "Front-End Policies for Improved Issue Efficiency in SMT Processors“ (HPCA 03)

Page 94: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 94

� memory-bound thread is stalled

� thread (T2) holds the resources

STALL

L2 miss detection

T1

T2

long-latency miss���� STALL

time

ld

Page 95: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 95

� memory-bound thread is flushed

� thread (T2) frees the resources

� but it is also stopped

FLUSH

L2 miss detection

T1

T2

long-latency miss� FLUSH

time

ld

Page 96: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 96

Enter in Runahead Execution

� When L2 miss reaches head of ROB� checkpoint architectural state and

� enter in runahead mode

long-latency miss

time

Runahead mode

ROB

x xxb

ran

ch

2xbx

load

2x

bra

nch

xb

ran

ch

1xax

load

1

x dxlo

ad

3xxx

bra

nch

xcx

FULL

Page 97: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 97

Sources of Improvement� Prefetching effect

� RaT allows memory-bound threads going speculatively in advance doing prefetching• prefetches increase the MLP

• each thread itself is faster by RaT, without disturbing the other threads.

T1

T2

long-latency miss

time

prefetches

Runahead Thread

[1] T. Ramirez et al. “Runahead Threads to Improve SMT Performance”. HPCA 2008

Page 98: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 98

Sources of Improvement

�Alleviate resource contention

� RaT are much less aggressive than normal ones with the important SMT resources, allocating them in short periods of time

T1

T2

long-latency miss

timeRunahead Thread

Valid Instructions

INValid Instructions

don’t use resources free resources faster

[1] T. Ramirez et al. “Runahead Threads to Improve SMT Performance”. HPCA 2008

Page 99: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 99

Evaluation - Fairness

� RaT performs better than flush and stall

55%

63%

� Significant for MEM

workloads.

Page 100: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 100

� RaT again achieves better fairness regarding dynamic control policies

36%

71%

� for MEM workloads • 57% over DCRA

• 53% over HillClimb

Evaluation – Fairness (II)

Page 101: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 101

RaT Sources of Improvement

� RaT benefits reduce register file up to 60%

0,0

0,5

1,0

1,5

2,0

2,5

3,0

3,5

4,0

64 128 192 256 320 64 128 192 256 320

FLUSH RaT

ILP4 MIX4 MEM4

Page 102: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 102

UPC contribution to “kilo” processors (1 of 3)

� We started our work in June 2001� Grant proposal to Intel-MRL in January 28th. 2002

� A. Cristal, et al. “Large virtual ROBs by processor checkpointing” Technical Report UPC-DAC-2002-39, July 2002. (Rejected for Micro-2002)

• Multiple Checkpointing• Out-of-order Commit, No need for ROB• Early release of registers and loads

� A. Cristal and M. Valero, ”ROBs virtuales utilizando checkpointing”. Spanish Workshop on Parallelism. Lleida, Sept., 2002

• Same as the previous report, but in Spanish

� A. Cristal, J. Martínez, M. Valero and J. Llosa, “Ephemeral Registers”, Technical Report CSL-TR-2003-1035 , 2003. Rejected for ISCA 2003 and Micro 2003

• Ckeckpoint + Early Release + Late allocation of registers� A. Cristal, J. Martínez, J. LLosa and M. Valero, “ A case for resource-conscious

out-of-order processors”, IEEE TCCA Computer Architecture Letters, Vol. 2, October 2003

• Underutilization of resources

Page 103: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 103

UPC contribution to “kilo” processors (2 of 3)� A. Cristal, et al. “ A case for resource-conscious out-of-order processors:

Towards Kilo-instruction in-flight processors”. MEDEA Workshop, Sept 2003 and ACM-CAN, March 2004

� A. Cristal et al. “Kilo-instruction Processors”. Invited paper. ISHPC-V.Tokyo, LNCS-2858. October 20-22th, 2003

� A. Cristal et al. “Future ILP Processors”. Invited paper. IJHPCN, to be published� A. Cristal, et al. “Out-of-Order Commit Processors” Technical Report UPC-DAC-

2003-44, July 2003. HPCA-10, Madrid, Feb. 2004• Unified mechanism for Out-of-Order Commit and IQs management• Use of Checkpointing and Ephemeral Registers

� M. Galluzzi et al. “ A First glance at Kiloinstruction Based Multiprocessors”Invited Paper. ACM Computing Frontiers Conference. Ischia, Italy, April 10-12, 2004

� E. Vallejo, M. Galluzzi, A. Cristal, F. Vallejo, R. Beivide, Per Stenström, James E. Smith and Mateo Valero. “Implementing Kilo-Instruction Multiprocessors”. Invited paper. IEEE Conference on Pervasive Services, ICPS-05. Santorini, Greece. July 11-14, 200

� M. Galluzzi, E. Vallejo, A. Cristal, F. Vallejo, R. Beivide, P. Stenstrom, J. Smith and M. Valero. “Implicit Transactional Memory in Kilo-InstructionMultiprocessor”. Invited paper. ACSAC-2007. The Twelfth Asia-Pacific Computer Systems Architecture Conference. Seoul, Korea, August 23-25, 2007. LNCS 4697, pp.339-353

Page 104: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 104

� Pericas et al. “A decoupled KILO-Instruction Processor”, HPCA-12, Austin, Feb 2006� Execution Locality Concept

� Decoupled Processor (Cache/Memory Processor)

� Pericas et al. “A Flexible Heterogeneous MultiCoreArchitecture”, PACT-16, Brasov, Sept 2007� DataFlow Memory Processor

� Shared Memory Processor

� Pericas et al. “A Two-Level Load/Store Queue based on Execution Locality”, ISCA-35, June 2008� Locality-based LSQ + Local/Global Disambiguation

� Completes execution-locality based KILO-Instruction Processor

� More work being conducted at this point

UPC contribution to “kilo” processors (3 of 3)

Page 105: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 105

Talks about “Kilo” processors, from UPC� Presentation in Barcelona, to Intel-MRL in February 2002� Spanish Workshop on Parallelism. Lleida, Sept., 2002� Presentation to Intel-MRL in March 2003� Invited presentation. NSF Panel “On the Future of Computer Architecture Research:

Wise Views and Fresh Perspectives”. San Diego, June 2003 � Invited Lecture. PA3CT Conference. Edegem, Belgium, September 22-23, 2003� MEDEA Workshop. New Orleans, September 2003� Invited Lecture. ISHPC-V. The 5th International Symposium on High Performance

Computing. Tokyo, Japan, October 20-22, 2003� Keynote lecture. Seminar on Compilers and Architecture. IBM Haifa. November 11th.,

2003.� Invited lecture. Intel MRL. Haifa., Israel. Nov. 12th., 2003� HPCA-10, Madrid, February 14-18, 2003� Keynote lecture. HPCA-10. Madrid, February 14-18, 2003� Invited lecture. ACM Computing Frontiers. Ischia, April, 2004� ACM Invited lecture. ENCAR México, May 2004� Keynote Lecture. Europar. Pisa, September 2004� Distinguish lecture, Irvine, January 2006� HPCA-12, Austin, February 2006� PACT-16, Brasov, September 2007� … several keynotes and Distinguish Lectures from ACM (India, México,..)� Onassis Foundation. Haraklion, Crete, July 2008

Page 106: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 106

Memory Latency� Jouppi and P. Ranganathan. “ The relative importance of

memory latency, bandwidth and branch prediction”Whorkshop on Mixing Logic and DRAM: Chips that compute and remember”, during ISCA-24, 1997

� S. Srinivasan and A. Lebeck, “ Load latency tolerance in dynamically scheduled processors”, Micro-31, 1998

� K. Skadron, P. Ahuja, M. Martonosi and D. Clark “Branch prediction, instruction window size and cache size: Performance tradeoffs and simulation techniques” IEEE-TC, pp. 1260-1281, 1999.

� Tejas Karkhanis and James E. Smith. “A Day in the Life of a Data Cache Miss”, 2nd Annual Workshop on Memory Performance Issues (WMPI), June, 2002.

Page 107: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 107

Large Reorder Buffers� G. Sohi, S. Breach, and T. N. Vijaykumar “Multiscalar processors”

ISCA-22, 1995.

� E. Rotenberg, Q. Jacobson, Y. Sazeides, and J. Smith “Trace processors” ISCA-24, 1997

� H. Akkari and M. Driscoll “A dynamic multithreaded processor”Micro-31, 1998

� R. Balasubramonian, S. Dwarkadas, and D. Albonesi.“Dynamically allocating processor resources between nearby and distant ilp”ISCA, June 2001.

• Save some resources allocated for eager execution

� P. Ranganathan, V. Pai, and S. Adve “Using speculative retirement and large instruction windows to narrow the performance gap between memory consistency models” SPAA, 1997

� J. M. Tendler, S. Dodson, S. Fields, H. Lee, and B. Sinharoy“Power4 System Microarchitecture” IBM Journal of Research and Development, pp. 5-25, January 2002.

Page 108: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 108

Checkpointing

� J.E. Smith and A.R. Plezskun “Implementing Precise Interrupts in Pipelined Processors”, ISCA-12, 1985

� W.M. Hwu and Y. N. Patt, “Checkpoint repair for out-of-order execution machines” ISCA-14, 1987.

• Checkpointing as a recovery mechanism

� Early Release of Resources� A. Cristal, M. Valero, and J. LLosa. “Large virtual ROBs by

processor checkpointing” Technical Report UPC-DAC-2002-39, July 2002.• Multiple Checkpointers• Out-of-order Commit, No need for ROB• Early release of registers and loads

� J.F. Martínez, J. Renau, M.C. Huang, M. Prvulovic, and J. Torrellas. Cherry: checkpointed early resource recycling in out-of-order microprocessors. MICRO-35, Nov. 2002.• One checkpoint• Early release of resources

Page 109: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 109

Register File� M. Moudgill and K. Pingali and S. Vassiliadis, “Register renaming and dynamic

speculation: an alternative approach”, In Proceedings of the 26th annual international symposium on Microarchitecture, 1993.

• Early Release of Registers

� T. Monreal, A. González, M. Valero, J. González, V. Viñals, “Delaying Physical Register Allocation through Virtual-Physical Registers”, In Proceedings of the 33th annual international symposium on Microarchitecture, 1999.

• Virtual Registers, Late allocation of registers

� A. Cristal, J. Martínez, M. Valero and J. Llosa, “Ephemeral Registers”, Technical Report CSL-TR-2003-1035 , 2003.

• Ckeckpoint + Early Release + Late allocation of registers

� T. Monreal et al., “Late allocation and early release of physical registers”, IEEE-TC (to appear in October 2004)

Page 110: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 110

Instruction Queues

� S. Palacharla, N.P. Jouppi, and J.E. Smith “Complexity-effective superscalar processors” ISCA-24, 1997.

• Divide the Instruction queues in a set of FIFO queues

� A.R. Lebeck, J. Koppanalil, T. Li, J. Patwardhan, and E. Rotenberg “A large, fast instruction window for tolerating cache misses” ISCA-29, 2002.

• Remove-Reinsert Mechanism• Keep the load dependence of all instructions

� E. Brekelbaum, J. Rupley, C.Wilkerson, and B. Black “Hierarchical scheduling windows” ISCA-35, 2002.

• Two clusters, a slow/big one, and a faster/small one for critical instructions

� A. Cristal, D. Ortega, J. Llosa and M. Valero “Out-of-Order CommitProcessors” Technical Report UPC-DAC-2003-44, July 2003. HPCA-10, Madrid, Feb. 2004

• Remove-Reinsert Mechanism• Simple reinsert mechanism

Page 111: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 111

References for LSQ for Large ROB

� A. Cristal, M. Valero, and J. LLosa. “Large virtual ROBs by processor checkpointing”Technical Report UPC-DAC-2002-39, July 2002

� J.F. Martínez, J. Renau, M.C. Huang, M. Prvulovic, and J. Torrellas. “Cherry: checkpointed early resource recycling in out-of-order microprocessors”. MICRO-35, 2002

� H. Akkari, R. Rajwar and S. T. Srinivasan “Checkpointing Processing and Recovery: Towards Scalable Large Instruction Window Processors” Micro-36, 2003

� S. Sethumadhavan, R. Desikan, D. Burger, C.R. Moore and S. W. Keckler “Scalable Hardware Memory Disambiguation for High ILP Processors” Micro-36, 2003

� H. W. Cain et al. “Memory Ordering: A Value-based approach”, ISCA-32, 2004

� A. Roth, “Store Vulnerability Window (SVW): Re-Execution Filtering for Enhanced Load Optimization”, ISCA‘-33, 2005

Page 112: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 112

Conclusion� Affordable “Kilo-instruction Processors”� Checkpointing and resource-conscious architectures

� Out-of- order commit� Ephemeral registers� Two-level instruction queues� Early release of loads� Load/store queue management

� New ideas to watch for� Better branch predictors� Predication and Multi-path execution� Control and data independent instructions� Reuse of large blocks of instructions

� New processor paradigms:� “Kilo-based” multiprocessor systems� “Kilo-vector” processors� “Kilo-SMT” processors� “Kilo-valpred” processors

Page 113: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 113

Acknowledgments

� Adrián Cristal� José Martínez

� Josep Llosa� Daniel Ortega� Fran Cazorla� Enrique Fernández� Oliver Santana� Ayose Falcón� Alex Pajuelo� Marco Galluzzi� Tanausu Ramírez� Miquel Pericas� Jim Smith

� Yale Patt� Alex Veidenbaum� Guri Sohi� Mark Hill� Wen-mei Hwu� “Mon” Beivide� Valentín Puente� José Angel Gregorio� Teresa Monreal� Victor Viñals

� Intel, Konrad Lai and Ronny Ronen

Page 114: Overcoming the Memory Wall: Kilo- Instruction , … the Memory Wall: Kilo-Instruction , Runahead, and SMT Processors ... Intel-MRL, Keynote lecture, Micro-32 Technology works against

July 25th, 2008 Onassis Foundation Heraklion, Crete 114

Thank you very much ☺