improving cache locality for thread-level speculation stanley fung and j. gregory steffan

1Improving Cache Locality for TLS Steffan

Improving Cache Locality for Improving Cache Locality for

Thread-Level SpeculationThread-Level Speculation

Stanley Fung and J. Gregory SteffanStanley Fung and J. Gregory Steffan

Electrical and Computer EngineeringElectrical and Computer Engineering

University of TorontoUniversity of Toronto

Chip Multiprocessors (CMPs) are Here!Chip Multiprocessors (CMPs) are Here!

IBM Power 5AMD OpteronIntel Yonah

Use CMPs to improve sequential program performance?

Exploiting CMPS: The IntuitionExploiting CMPS: The Intuition

CMPs have lots of distributed resourcesCMPs have lots of distributed resources

– Caches, branch predictors, processorsCaches, branch predictors, processors

Somehow distribute sequential programsSomehow distribute sequential programs

– Use distributed resources to improve performanceUse distributed resources to improve performance

Increasingly aggressive approaches:Increasingly aggressive approaches:

1)1) Prefetching (eg., helper threads)Prefetching (eg., helper threads)

2)2) Transactions and transactional memoryTransactions and transactional memory

3)3) Thread-Level Speculation (TLS)Thread-Level Speculation (TLS)

But distributing a sequential program is non-trivial…

Exploiting CMPs: The TensionExploiting CMPs: The Tension

Distributed CMP ResourcesSequential Program

ParallelismLocality

Our challenge: relaxing this tension

Example: TLS Execution on 4 ProcessorsExample: TLS Execution on 4 Processors

Execution

Sequential execution

active

inactive

TLS execution

4X total cache capacity 4X cache performance?

TLS on 4 CPU CMP: % Increase in Cache MissesTLS on 4 CPU CMP: % Increase in Cache Misses

93.4 12

8 306.

bzip2_

c goijp

272.5%

4X total cache capacity 4X increase in cache misses

Opportunities for ImprovementOpportunities for Improvement

1)1) Prefetching EffectsPrefetching Effects

– TLS indirectly prefetches from off-chip into L2TLS indirectly prefetches from off-chip into L2

– Orthogonal to the focus of this workOrthogonal to the focus of this work

2)2) ““Locality Misses”Locality Misses”

– An L1 miss where the line is resident in another L1An L1 miss where the line is resident in another L1

– An indicator of both: An indicator of both:

• Broken localityBroken locality

• Opportunity to repair localityOpportunity to repair locality

What fraction of misses are locality misses?

TLS on 4 CPU CMP: % Locality MissesTLS on 4 CPU CMP: % Locality Misses

88.7 92

65.8 69

bzip2_

crafty gc

c goijp

significant locality misses: problem and opportunity

OutlineOutline

• Experimental FrameworkExperimental Framework

• Classification of MissesClassification of Misses

• Techniques for Reducing MissesTechniques for Reducing Misses

• Combining TechniquesCombining Techniques

• Impact on ScalabilityImpact on Scalability

• ConclusionConclusion

Support for TLSSupport for TLS

Break programs into speculative threadsBreak programs into speculative threads

– We use the compilerWe use the compiler

Track data dependencesTrack data dependences

– We extend invalidation-based cache coherenceWe extend invalidation-based cache coherence

Recover from failed speculationRecover from failed speculation

– We extend L1 data caches to buffer speculative stateWe extend L1 data caches to buffer speculative state

three key elements of every TLS system

Executable

Compiler Support for TLSCompiler Support for TLS

Region

Selection

Transformation and

Optimization

Sequential

SourceCode

inserts

TLS instructions

profile

informationwhich loops?

Hardware Support for TLSHardware Support for TLS

CacheState Data

extend generic CMP’s L1 caches and coherence

Experimental FrameworkExperimental Framework

• CMP with 4 CPUs (or more)CMP with 4 CPUs (or more)

– 4-way issue, out-of-order superscalar4-way issue, out-of-order superscalar

• Memory HierarchyMemory Hierarchy

– Private L1 data caches: 32KB, 2-wayPrivate L1 data caches: 32KB, 2-way

– 2MB shared L2 cache2MB shared L2 cache

– Bus interconnectBus interconnect

• Not shown: results for crossbar interconnectNot shown: results for crossbar interconnect

• Benchmarks: SPEC INT 95 and 2000Benchmarks: SPEC INT 95 and 2000

– Speculatively parallelizedSpeculatively parallelized

TLS Cache Locality Problem: Our InvestigationTLS Cache Locality Problem: Our Investigation

Cache Locality Problem

a shared cache solves locality problems (but slow)

Shared Cache ArchitecturePrivate Cache Architecture

Private Cache Architecture Shared Cache Architecture

Data Cache Instruction Cache

i-cache misses are insignificant; focus on d-cache

Parallel Regions Sequential Regions

miss patterns transitions

TLS Execution Stages and TransitionsTLS Execution Stages and Transitionstim

P P P P

ParallelRegion

SequentialRegion

SteadyState

Our main focus

Startup Little impact

Wind-down Has impact

wind-down transitions: scheduling the seq. region

Scheduling the Sequential RegionScheduling the Sequential Region

P0 P1 P2 P3

Floating Sequential Processor

Fixed Sequential Processor

PotentialCache Locality

which is better?

Performance of Fixed Relative to FloatingPerformance of Fixed Relative to Floating

Overall Program: 3.4% speedup

fixed sequential processor is superior, at no cost

87.0 99.8

ijpeg li

m88ksi

xecuti

Parallel Regions Sequential Regions

miss patterns transitions

Classifying Misses Within Parallel RegionsClassifying Misses Within Parallel Regions1)1) L2 Misses L2 Misses (ignore)(ignore)

– These cannot be locality misses (inclusion enforced)These cannot be locality misses (inclusion enforced)

2)2) Read-based sharingRead-based sharing

– Line is read by multiple processorsLine is read by multiple processors

3)3) Write-based sharingWrite-based sharing

– Line is written (and possibly read) by multiple processorsLine is written (and possibly read) by multiple processors

4)4) StridedStrided

– Addresses of missing lines progress by a cross-CPU strideAddresses of missing lines progress by a cross-CPU stride

5)5) Other Other (ignore)(ignore)

– No observable patterns; likely conflict and capacity missesNo observable patterns; likely conflict and capacity misses

caveats: there is overlap; priority order; sliding window

Miss Patterns ObservedMiss Patterns Observed

Miss PatternMiss Pattern PercentagePercentage

L2 missL2 miss 15.7%15.7%

Read-based sharingRead-based sharing 53.7%53.7%

Write-based sharingWrite-based sharing 11.4%11.4%

StridedStrided 6.2%6.2%

OtherOther 13.0%13.0%

investigate techniques targeting these three patterns

Exploiting Read-Only Sharing PatternsExploiting Read-Only Sharing Patterns

• Read-only sharing misses dominate (53.7%)Read-only sharing misses dominate (53.7%)

– Hence a given read miss predicts future read missesHence a given read miss predicts future read misses

– i.e., other CPUs will likely read-miss that same linei.e., other CPUs will likely read-miss that same line

• Broadcasting for all read missesBroadcasting for all read misses

– Any read miss results in that line being pushed to all cachesAny read miss results in that line being pushed to all caches

• Provided lines in speculative state are not evictedProvided lines in speculative state are not evicted

– Trivial to implement in CMP with bus interconnectTrivial to implement in CMP with bus interconnect

• No extra trafficNo extra traffic

will such broadcasting result in cache pollution?

Impact of Broadcasting All Read Misses (RB)Impact of Broadcasting All Read Misses (RB)

Data Cache Misses Execution Time

27.7% reduction 7.3% speedup

simple broadcasting is effective

• Attempts to throttle broadcasting reduced benefitsAttempts to throttle broadcasting reduced benefits– Hence resulting cache pollution is limitedHence resulting cache pollution is limited

Exploiting Write-Based Sharing PatternsExploiting Write-Based Sharing Patterns• Note: caches extended for TLS are write-backNote: caches extended for TLS are write-back

– Modifications are not propagated before thread commitsModifications are not propagated before thread commits

• Example: write-based sharing of a cache lineExample: write-based sharing of a cache line– CPU0 writes then commits; then CPU1 readsCPU0 writes then commits; then CPU1 reads

– Read results in miss, read-request, write-back, then fillRead results in miss, read-request, write-back, then fill

• Aggressive approach: Aggressive approach: – On commit, broadcast all modified linesOn commit, broadcast all modified lines

– Too much traffic, too many superfluous copiesToo much traffic, too many superfluous copies

• A more selective approach: A more selective approach: – Predict lines involved in write-based sharing Predict lines involved in write-based sharing

more general: predict stores involved in WB sharing

Predicting Stores & Lines Involved in WB SharingPredicting Stores & Lines Involved in WB Sharing

tag index offsetAddress:

Extended Tag (etag)

RST Index

8 Entries

8 Entries8 Entries

8-entries each is sufficient

Recent Store Table (RST)

store PC

store PCRST Index

(Recent store PCs)

store PC

Invalidation PC List (IPCL)

(Store PCs for lines that are written back)

Push Required Buffer (PRB)

(lines to push on commit)

Operation of Write-Based Sharing TechniqueOperation of Write-Based Sharing Technique

On a store:On a store:

– Add store PC to Add store PC to Recent Store TableRecent Store Table ( (RSTRST))

– If store PC is in If store PC is in Invalidation PC List Invalidation PC List ((IPCLIPCL):):

• Add store PC to Add store PC to Push Required BufferPush Required Buffer ( (PRBPRB))

On a coherence request requring writeback:On a coherence request requring writeback:

– Use RST index to lookup PC in Use RST index to lookup PC in RSTRST, add PC to , add PC to IPCLIPCL

On commit:On commit:

– For each extended tag in For each extended tag in PRBPRB::

• Writeback, self-invalidate, push line to next cacheWriteback, self-invalidate, push line to next cache

simple case: next cache is in round-robin order

Impact of Write-Based Technique (WB)Impact of Write-Based Technique (WB)

19.6% reduction 7.8% speedup

worth the cost of small additional hardware

Exploiting Strided Miss PatternsExploiting Strided Miss Patterns

• Hardware stride-prefetcher [Fu Hardware stride-prefetcher [Fu et alet al, Baer , Baer et alet al]]

– Each CPU has its own aggressive prefetcherEach CPU has its own aggressive prefetcher

– Fully associative, 512 entries: Fully associative, 512 entries:

• PC, miss address, stride distance, statePC, miss address, stride distance, state

– Issue 16 prefetches when stride is recognizedIssue 16 prefetches when stride is recognized

• Prefetches are throttled to avoid burst of trafficPrefetches are throttled to avoid burst of traffic

• Prefetch from L2 to private cachesPrefetch from L2 to private caches

– To be fair, prefetches do not go beyond L2To be fair, prefetches do not go beyond L2

Impact of Strided Prefetching (ST)Impact of Strided Prefetching (ST)

10.3% reduction No significant impact

no good alone---complementary with other techniques?

Combining Techniques: Parallel Region Perf.Combining Techniques: Parallel Region Perf.

Data cache

misses

Execution

RB/WB/ST

RB/WB/ST has fewest misses, but RB/WB performs best

Overall Program SpeedupOverall Program Speedup

Float Baseline RB WB ST RB/ WB RB/ WB/ ST

RB/WB further improves program performance by 5.5%

Impact of RB/WB on ScalabilityImpact of RB/WB on Scalability

83.3 88.2

baseline improved baseline improved baseline improved

facilitates scaling

Bzip2_comp Vpr_place Average(all benchmarks)

SummarySummary• Have a fixed processor for sequential regionsHave a fixed processor for sequential regions

• Exploiting read-only sharing patterns (RB):Exploiting read-only sharing patterns (RB):

– Simple broadcasting for all load misses is effectiveSimple broadcasting for all load misses is effective

• No significant cache pollutionNo significant cache pollution

• Exploiting write-based sharing patterns (WB):Exploiting write-based sharing patterns (WB):

– Write-back/self-invalidate/push technique is effectiveWrite-back/self-invalidate/push technique is effective

• Exploiting strided miss patterns (ST):Exploiting strided miss patterns (ST):

– Extra traffic overwhelms benefit of reduced missesExtra traffic overwhelms benefit of reduced misses

• RB/WB are complementary and perform bestRB/WB are complementary and perform best

– And dramatically improve the scalability of TLSAnd dramatically improve the scalability of TLS

Improving cache locality is key for effective TLS

BackupsBackups

Ideal CachesIdeal Caches

Ideal Caches Model (Parallel Region Performance)

100 99.9

80.4 80.1

0102030405060708090

Baseline Idealinstruction

Ideal datacache

Idealinstructionand data

Parallel Region Cache Miss BreakdownParallel Region Cache Miss Breakdown

L2 Misses

Read-Based Sharing

Write-Based Sharing

Strided

improving cache locality for thread-level speculation stanley fung and j. gregory steffan

Documents

li & fung-

suite speculation

mysql data warehousing survival guide marius moscovici...

hw speculation

1 improving value communication…steffan carnegie mellon...

currency substitution, speculation, and crises:...

hardware speculation

wsns in harbour -mpac23-steffan

successful speculation

lecture 20: speculation

speculation -bertrand russell,

a scalable approach to thread-level speculation j. gregory...

customer preso 1 6 steffan welch

a scalable approach to thread-level speculation j. gregory...

futures: speculation

power of networks by steffan aquarone

optimistic intra-transaction parallelism using thread level...

expanding your affiliate activity into europe - silke...

cs 7810 lecture 18 the potential for using thread-level data...

fundamentals vs speculation