ppt

43
© 2005 Babak Falsafi Temporal Memory Streaming Temporal Memory Streaming Babak Falsafi Team: Mike Ferdman, Brian Gold, Nikos Hardavellas, Jangwoo Kim, Stephen Somogyi, Tom Wenisch Collaborator: Anastassia Ailamaki & Andreas Moshovos STEMS STEMS Computer Architecture Lab Carnegie Mellon http://www.ece.cmu.edu/CALCM

Upload: tess98

Post on 30-Nov-2014

522 views

Category:

Documents


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: PPT

© 2005 Babak Falsafi

Temporal Memory StreamingTemporal Memory Streaming

Babak Falsafi

Team: Mike Ferdman, Brian Gold, Nikos Hardavellas, Jangwoo Kim, Stephen Somogyi, Tom Wenisch Collaborator: Anastassia Ailamaki & Andreas Moshovos

STEMSSTEMSComputer Architecture Lab Carnegie Mellonhttp://www.ece.cmu.edu/CALCM

Page 2: PPT

2© 2005 Babak Falsafi

The Memory Wall

Logic/DRAM speed gap continues to increase!

0.33

10

0.04

80

6

0.01

0.1

1

10

100

1000

Clo

ck

s p

er

ins

tru

cti

on

0.01

0.1

1

10

100

1000

Clo

ck

s p

er

DR

AM

ac

ce

ss

Core (s)Memory

VAX/1980 PPro/1996 2010+

Page 3: PPT

3© 2005 Babak Falsafi

Current Approach Cache hierarchies:• Trade off capacity for speed• Exploit “reuse”

But, in modern servers• Only 50% utilization of one proc.

[Ailamaki, VLDB’99]

• Much bigger problem in MPs

What is wrong?• Demand fetch/repl. data 100GDRAMDRAM

CCPPUU

L3 64M

L2 2M

100

0 clk

L1 64K

1 c

lk 1

0 cl

k 1

00 c

lk

Page 4: PPT

4© 2005 Babak Falsafi

Prior Work (SW-Transparent)

Prefetching [Joseph 97] [Roth 96] [Nesbit 04] [Gracia Pérez 04]

Simple patterns or low accuracy

Large Exec. Windows / Runahead [Mutlu 03]

Fetch dependent addresses serially

Coherence Optimizations [Stenström 93] [Lai 00] [Huh 04]

Limited applicability (e.g., migratory)

Need solutions for arbitrary access patterns

Page 5: PPT

5© 2005 Babak Falsafi

Our Solution: Spatio-Temporal Memory StreamingSpatio-Temporal Memory Streaming

Observation:• Data spatially/temporally correlated • Arbitrary, yet repetitive, patterns

Approach Memory Streaming• Extract spat./temp. patterns• Stream data to/from CPU

Manage resources for multiple blocks Break dependence chains

• In HW, SW or both

L1

CPUCPU

AC

DB

..

STEMSSTEMS

ZX

W

stre

am

repl

ace

fetc

h

DRAM or DRAM or Other CPUsOther CPUs

Page 6: PPT

6© 2005 Babak Falsafi

Contribution #1: Temporal Shared-Memory Streaming

• Recent coherence miss sequences recur 50% misses closely follow previous sequence Large opportunity to exploit MLP

• Temporal streaming engine Ordered streams allow practical HW Performance improvement:

7%-230% in scientific apps. 6%-21% in commercial Web & OLTP apps.

Page 7: PPT

7© 2005 Babak Falsafi

Contribution #2: Last-touch Correlated Data Streaming

• Last-touch prefetchers Cache block deadtime >> livetime Fetch on a predicted “last touch” But, designs impractical (> 200MB on-chip)

• Last-touch correlated data streaming Miss order ~ last-touch order Stream table entries from off-chip Eliminates 75% of all L1 misses with ~200KB

Page 8: PPT

8© 2005 Babak Falsafi

Outline

• STEMS OverviewSTEMS Overview

• Example Temporal Streaming

1. Temporal Shared-Memory Streaming

2. Last-Touch Correlated Data Streaming

• Summary

Page 9: PPT

9© 2005 Babak Falsafi

• Record sequences of memory accesses• Transfer data sequences ahead of requests

Temporal Shared-Memory Streaming [ISCA’05]

Baseline System

CPU Mem

Miss A

Fill A

Streaming System

Miss B

Fill B

CPU MemMiss A

Fill A,B,C,…

• Accelerates arbitrary access patterns Parallelizes critical path of pointer-chasing

Page 10: PPT

10© 2005 Babak Falsafi

Relationship Between Misses

• Intuition: Miss sequences repeat Because code sequences repeat

• Observed for uniprocessors in [Chilimbi’02]

• Temporal Address Correlation Same miss addresses repeat in the same order

Correlated miss sequence = stream

Q W A B C D E R T A B C D E YMiss seq. …

Page 11: PPT

11© 2005 Babak Falsafi

Relationship Between Streams

• Intuition: Streams exhibit temporal locality Because working set exhibits temporal locality For shared data, repetition often across nodes

• Temporal Stream Locality Recent streams likely to recur

Q W A B C D E R

T A B C D E Y

Node 1

Node 2

Addr. correlation + stream locality = temporal correlation

Page 12: PPT

12© 2005 Babak Falsafi

Memory Level Parallelism

• Streams create MLP for dependent misses

• Not possible with larger windows / runahead

Temporal streaming breaks dependence chains

AB

C

Baseline

CPU

Must wait to follow pointers

Temporal Streaming

CPU

Fetch in parallel

AB

C

Page 13: PPT

13© 2005 Babak Falsafi

Temporal Streaming

Record

Node 1

Miss AMiss BMiss CMiss D

Directory Node 2

Page 14: PPT

14© 2005 Babak Falsafi

Temporal Streaming

Record

Node 1

Miss AMiss BMiss CMiss D

Directory Node 2

Miss AReq. A

Fill A

Page 15: PPT

15© 2005 Babak Falsafi

Temporal Streaming

Record

Locate

Node 1

Miss AMiss BMiss CMiss D

Directory Node 2

Miss AReq. A

Fill ALocate A

Stream B, C, D

Page 16: PPT

16© 2005 Babak Falsafi

Temporal Streaming

Node 1

Miss AMiss BMiss CMiss D

Directory Node 2

Miss AReq. A

Fill ALocate A

Stream B, C, D

Record

Locate

Stream

Fetch B, C, D

Hit B

Hit C

Page 17: PPT

17© 2005 Babak Falsafi

Temporal Streaming Engine

Record• Coherence Miss Order Buffer (CMOB)

~1.5MB circular buffer per node In local memory Addresses only Coalesced accesses

CPU

$

Local Memory

CMOBQ W A B C D E R T Y

Fill E

Page 18: PPT

18© 2005 Babak Falsafi

Temporal Streaming Engine

Locate

• Annotate directory Already has coherence info for every block CMOB append send pointer to directory Coherence miss forward stream request

Directory

A

B

shared Node 4 @ CMOB[23]

modified Node 11 @ CMOB[401]

Page 19: PPT

19© 2005 Babak Falsafi

Temporal Streaming Engine

Stream• Fetch data to match use rate

Addresses in FIFO stream queue Fetch into streamed value buffer

F E D C BStream Queue

CPU

L1 $Streamed

Value Buffer

Node i: stream {A,B,C…}

A data

Fetch A

~32 entries

Page 20: PPT

20© 2005 Babak Falsafi

Practical HW Mechanisms

• Streams recorded/followed in order FIFO stream queues ~32-entry streamed value buffer Coalesced cache-block size CMOB appends

• Predicts many misses from one request More lookahead Allows off-chip stream storage Leverages existing directory lookup

Page 21: PPT

21© 2005 Babak Falsafi

Methodology: Infrastructure

SimFlex [SIGMETRICS’04]

Statistically sampling → uArch sim. in minutes Full-system MP simulation (boots Linux & Solaris)

Uni, CMP, DSM timing models Real server software (e.g., DB2 & Oracle) Component-based FPGA board interface for hybrid

simulation/prototyping

Publicly available athttp://www.ece.cmu.edu/~simflex

Page 22: PPT

22© 2005 Babak Falsafi

Methodology: Benchmarks & Parameters

Model Parameters 16 4GHz SPARC CPUs

8-wide OoO; 8-stage pipe

256-entry ROB/LSQ

64K L1, 8MB L2

TSO w/ speculation

Benchmark Applications• Scientific

em3d, moldyn, ocean• OLTP: TPC-C 3.0 100 WH

IBM DB2 7.2 Oracle 10g

• SPECweb99 w/ 16K con. Apache 2.0 Zeus 4.3

Page 23: PPT

23© 2005 Babak Falsafi

TSE Coverage Comparison

0%

50%

100%

150%

200%

250%S

trid

eG

/DC

G/A

CT

SE

Str

ide

G/D

CG

/AC

TS

E

Str

ide

G/D

CG

/AC

TS

E

Str

ide

G/D

CG

/AC

TS

E

Str

ide

G/D

CG

/AC

TS

E

Str

ide

G/D

CG

/AC

TS

E

Str

ide

G/D

CG

/AC

TS

E

em3d moldyn ocean Apache DB2 Oracle Zeus

% C

oh

ere

nt

Re

ad

Mis

se

s Coverage Discards

TSE outperforms Stride and GHB for coherence misses

Page 24: PPT

24© 2005 Babak Falsafi

Stream Lengths

0%

20%

40%

60%

80%

100%0 1 2 4 8 16

32

64

128

256

512

1K

2K

4K

8K

16K

32K

64K

128K

Length (# of streamed blocks)

Cu

m. %

of

All H

its

Apache DB2 Oracle Zeusem3d moldyn ocean

• Comm: Short streams; low base MLP (1.2-1.3)• Sci: Long streams; high base MLP (1.6-6.6)• Temporal Streaming addresses both cases

Page 25: PPT

25© 2005 Babak Falsafi

3.3

1.0

1.1

1.2

1.3

em3d

mol

dyn

ocea

n

Apa

che

DB

2O

racl

eZ

eus

TSE Performance Impact

-

0.2

0.4

0.6

0.8

1.0

ba

seT

SE

ba

seT

SE

ba

seT

SE

ba

seT

SE

ba

seT

SE

ba

seT

SE

ba

seT

SE

em3d moldyn ocean Apache DB2 Oracle Zeus

No

rma

lize

d T

ime

Busy Other Stalls Coherent Read Stalls

em3d moldyn ocean Apache DB2 Oracle Zeus

Time Breakdown Speedup

95% CI

• TSE eliminates 25%-95% of coherent read stalls6% to 230% performance improvement

Page 26: PPT

26© 2005 Babak Falsafi

TSE Conclusions

• Temporal Streaming Intuition: Recent coherence miss sequences recur Impact: Eliminates 50-100% of coherence misses

• Temporal Streaming Engine Intuition: In-order streams enable practical HW Impact: Performance improvement

7%-230% in scientific apps. 6%-21% in commercial Web & OLTP apps.

Page 27: PPT

27© 2005 Babak Falsafi

Outline

• Big Picture

• Example Streaming Techniques

1. Temporal Shared Memory Streaming

2. Last-Touch Correlated Data Streaming

• Summary

Page 28: PPT

28© 2005 Babak Falsafi

Enhancing Lookahead

Observation [Mendelson, Wood&Hill]:

• Few live sets Use until last “hit” Data reuse high hit rate ~80% dead frames!

Exploit for lookahead:• Predict last “touch” prior to “death”• Evict, predict and fetch next line

L1 @ Time T1

L1 @ Time T2

Live

set

s D

ead

set

s

Page 29: PPT

29© 2005 Babak Falsafi

How Much Lookahead?

Predicting last-touches will eliminate all latency!

DR

AM

lat

ency

L2

late

ncy

Frame Deadtimes (cycles)

Page 30: PPT

30© 2005 Babak Falsafi

Dead-Block Prediction [ISCA’00 & ’01]

• Per-block trace of memory accesses to a block Predicts repetitive last-touch events

PC3: load/store A1

PC1: load/store A1

PC3: load/store A1

PC5: load/store A3

Acc

esse

s to

a b

lock

fram

e

(miss)

(hit)

(hit)

(miss)

PC0: load/store A0 (hit)

Trace = A1 (PC1,PC3, PC3)

Last touch

First touch

Page 31: PPT

31© 2005 Babak Falsafi

Dead-Block Prefetcher (DBCP)

Evict A1Fetch A3

11

Correlation Table

A3A1,PC1,PC3,PC3PC1,PC3

History Table (HT)

PC3

Current Access

Latest

A1

• History & correlation tables History ~ L1 tag array Correlation ~ memory footprint

• Encoding truncated addition• Two bit saturating counter

Page 32: PPT

32© 2005 Babak Falsafi

0%

20%

40%

60%

80%

100%

120%

Olden SPEC INT SPEC FP

(%)

of

Cac

he

Mis

ses

early

train

incorrect

correct

DBCP Coverage with Unlimited Table Storage

• High average L1 miss coverage• Low misprediction (2-bit counters)

Page 33: PPT

33© 2005 Babak Falsafi

0%

20%

40%

60%

80%

100%

160K

B

640K

B

2M

B

10M

B

40M

B

160M

B

On-Chip Correlation Table Size

% o

f A

ch

ievab

le C

overa

ge

average

worst-case

Impractical On-Chip Storage Size

Needs over 150MB to achieve full potential!

Page 34: PPT

34© 2005 Babak Falsafi

Our Observation: Signatures are Temporally Correlated

Signatures need not reside on chip1. Last-touch sequences recur

• Much as cache miss sequences recur [Chilimbi’02]

• Often due to large structure traversals

2. Last-touch order ~ cache miss order• Off by at most L1 cache capacity

Key implications:• Can record last touches in miss order• Store & stream signatures from off-chip

Page 35: PPT

35© 2005 Babak Falsafi

Last-Touch Correlated Data Streaming(LT-CORDS)

• Streaming signatures on chip Keep all sigs. in sequences in off-chip DRAM Retain sequence “heads” on chip “Head” signals a stream fetch

• Small (~200KB) on-chip stream cache Tolerate order mismatch Lookahead for stream startup

DBCP coverage with moderate on-chip storage!

Page 36: PPT

36© 2005 Babak Falsafi

DBCP Mechanisms

Core L1L2

DRAM

HT

All signatures in random-access on-chip table

Sigs. (160MB)

Page 37: PPT

37© 2005 Babak Falsafi

Only a subset needed at a time“Head” as cue for the “stream”

Signatures stored off-chip

What LT-CORDS Does

Core L1L2

DRAM

HT

… and only in order

Page 38: PPT

38© 2005 Babak Falsafi

LT-CORDS Mechanisms

Core L1L2

DRAM

HT

SC

On-chip storage independent of footprint

Heads (10K) Sigs. (200K)

Page 39: PPT

39© 2005 Babak Falsafi

Methodology• SimpleScalar CPU model with Alpha ISA

SPEC CPU2000 & Olden benchmarks

• 8-wide out-of-order processor 2 cycle L1, 16 cycle L2, 180 cycle DRAM FU latencies similar to Alpha EV8 64KB 2-way L1D, 1MB 8-way L2

• LT-CORDS with 214KB on-chip storage• Apps. with significant memory stalls

Page 40: PPT

40© 2005 Babak Falsafi

0%

20%

40%

60%

80%

100%

120%D

BC

P

LT-

cord

s

DB

CP

LT-

cord

s

DB

CP

LT-

cord

s

Olden SPEC INT SPEC FP

(%)

of

Ca

ch

e M

iss

es

early

train

incorrect

correct

LT-CORDS vs. DBCP Coverage

LT-CORDS reaches infinite DBCP coverage

Page 41: PPT

41© 2005 Babak Falsafi

1

2

3

4

5

treead

dmgri

dsw

im

equa

keap

plu mcffm

a3d art

em3d

facere

c

wupwise bh

ammp gc

cpa

rser

sixtra

ck

Sp

ee

du

p

Infinite Cache

LT-CORDS

GHB (PC/DC)

LT-CORDS Speedup

LT-CORDS hides large fraction of memory latency

Page 42: PPT

42© 2005 Babak Falsafi

LT-CORDS Conclusions

• Intuition: Signatures temporally correlated Cache miss & last-touch sequences recur Miss order ~ last-touch order

• Impact: eliminates 75% of all misses Retains DBCP coverage, lookahead, accuracy On-chip storage indep. of footprint 2x less memory stalls over best prior work

Page 43: PPT

43© 2005 Babak Falsafi

For more informationVisit our website:http://www.ece.cmu.edu/CALCM