improving cache locality for thread-level speculation stanley fung and j. gregory steffan
Post on 28-Jan-2016
20 Views
Preview:
DESCRIPTION
TRANSCRIPT
1Improving Cache Locality for TLS Steffan
Improving Cache Locality for Improving Cache Locality for
Thread-Level SpeculationThread-Level Speculation
Stanley Fung and J. Gregory SteffanStanley Fung and J. Gregory Steffan
Electrical and Computer EngineeringElectrical and Computer Engineering
University of TorontoUniversity of Toronto
2Improving Cache Locality for TLS Steffan
Chip Multiprocessors (CMPs) are Here!Chip Multiprocessors (CMPs) are Here!
IBM Power 5AMD OpteronIntel Yonah
Use CMPs to improve sequential program performance?
3Improving Cache Locality for TLS Steffan
Exploiting CMPS: The IntuitionExploiting CMPS: The Intuition
CMPs have lots of distributed resourcesCMPs have lots of distributed resources
– Caches, branch predictors, processorsCaches, branch predictors, processors
Somehow distribute sequential programsSomehow distribute sequential programs
– Use distributed resources to improve performanceUse distributed resources to improve performance
Increasingly aggressive approaches:Increasingly aggressive approaches:
1)1) Prefetching (eg., helper threads)Prefetching (eg., helper threads)
2)2) Transactions and transactional memoryTransactions and transactional memory
3)3) Thread-Level Speculation (TLS)Thread-Level Speculation (TLS)
But distributing a sequential program is non-trivial…
4Improving Cache Locality for TLS Steffan
Exploiting CMPs: The TensionExploiting CMPs: The Tension
Distributed CMP ResourcesSequential Program
ParallelismLocality
Our challenge: relaxing this tension
L2
L1
P
L1
P
L1
P
L1
P
5Improving Cache Locality for TLS Steffan
Example: TLS Execution on 4 ProcessorsExample: TLS Execution on 4 Processors
Execution
Time
L2
L1
P
L1
P
L1
P
L1
P
Sequential execution
active
inactive
L2
L1
P
L1
P
L1
P
L1
P
TLS execution
4X total cache capacity 4X cache performance?
6Improving Cache Locality for TLS Steffan
TLS on 4 CPU CMP: % Increase in Cache MissesTLS on 4 CPU CMP: % Increase in Cache Misses
93.4 12
9.8
72.4
209.
8 306.
9
9.3
808.
9
37.9
19.1
934.
2
22.7
110.
3
788.
2
272.
5
0
100
200
300
400
500
600
700
800
900
1000
bzip2_
com
p
craf
tygc
c goijp
egli
m88
ksim
mcf
parse
r
perlb
mk
vorte
x
vpr_p
lace
vpr_r
oute
aver
age
Pe
rce
nta
ge
Inc
rea
se in
Da
ta C
ac
he
Miss
Ra
te
272.5%
~= 4X
4X total cache capacity 4X increase in cache misses
7Improving Cache Locality for TLS Steffan
Opportunities for ImprovementOpportunities for Improvement
1)1) Prefetching EffectsPrefetching Effects
– TLS indirectly prefetches from off-chip into L2TLS indirectly prefetches from off-chip into L2
– Orthogonal to the focus of this workOrthogonal to the focus of this work
2)2) ““Locality Misses”Locality Misses”
– An L1 miss where the line is resident in another L1An L1 miss where the line is resident in another L1
– An indicator of both: An indicator of both:
• Broken localityBroken locality
• Opportunity to repair localityOpportunity to repair locality
What fraction of misses are locality misses?
8Improving Cache Locality for TLS Steffan
TLS on 4 CPU CMP: % Locality MissesTLS on 4 CPU CMP: % Locality Misses
44.4
93.6
68.5
81.3
78.6
24.0
37.5
7.6
43.5
88.7 92
.2
65.8 69
.1
61.1
0
10
20
30
40
50
60
70
80
90
100
bzip2_
com
p
crafty gc
c goijp
egli
m88
ksim
mcf
parse
r
perlb
mk
vorte
x
vpr_p
lace
vpr_r
oute
avera
ge
Pe
rce
nta
ge
Lo
ca
lity
Ca
ch
e M
iss
significant locality misses: problem and opportunity
61.1%
9Improving Cache Locality for TLS Steffan
OutlineOutline
• Experimental FrameworkExperimental Framework
• Classification of MissesClassification of Misses
• Techniques for Reducing MissesTechniques for Reducing Misses
• Combining TechniquesCombining Techniques
• Impact on ScalabilityImpact on Scalability
• ConclusionConclusion
10Improving Cache Locality for TLS Steffan
Support for TLSSupport for TLS
Break programs into speculative threadsBreak programs into speculative threads
– We use the compilerWe use the compiler
Track data dependencesTrack data dependences
– We extend invalidation-based cache coherenceWe extend invalidation-based cache coherence
Recover from failed speculationRecover from failed speculation
– We extend L1 data caches to buffer speculative stateWe extend L1 data caches to buffer speculative state
three key elements of every TLS system
11Improving Cache Locality for TLS Steffan
MIPS
Executable
Compiler Support for TLSCompiler Support for TLS
Region
Selection
Transformation and
Optimization
Sequential
SourceCode
inserts
TLS instructions
profile
informationwhich loops?
12Improving Cache Locality for TLS Steffan
Hardware Support for TLSHardware Support for TLS
L2
L1
P
L1
P
L1
P
L1
P
CacheState Data
- -
- -
- -
Tag
-
-
-
-- -
SL
-
-
-
-
SM
-
-
-
-
P
extend generic CMP’s L1 caches and coherence
13Improving Cache Locality for TLS Steffan
Experimental FrameworkExperimental Framework
• CMP with 4 CPUs (or more)CMP with 4 CPUs (or more)
– 4-way issue, out-of-order superscalar4-way issue, out-of-order superscalar
• Memory HierarchyMemory Hierarchy
– Private L1 data caches: 32KB, 2-wayPrivate L1 data caches: 32KB, 2-way
– 2MB shared L2 cache2MB shared L2 cache
– Bus interconnectBus interconnect
• Not shown: results for crossbar interconnectNot shown: results for crossbar interconnect
• Benchmarks: SPEC INT 95 and 2000Benchmarks: SPEC INT 95 and 2000
– Speculatively parallelizedSpeculatively parallelized
14Improving Cache Locality for TLS Steffan
TLS Cache Locality Problem: Our InvestigationTLS Cache Locality Problem: Our Investigation
Cache Locality Problem
15Improving Cache Locality for TLS Steffan
TLS Cache Locality Problem: Our InvestigationTLS Cache Locality Problem: Our Investigation
Cache Locality Problem
a shared cache solves locality problems (but slow)
Shared Cache ArchitecturePrivate Cache Architecture
16Improving Cache Locality for TLS Steffan
TLS Cache Locality Problem: Our InvestigationTLS Cache Locality Problem: Our Investigation
Cache Locality Problem
Private Cache Architecture Shared Cache Architecture
Data Cache Instruction Cache
i-cache misses are insignificant; focus on d-cache
17Improving Cache Locality for TLS Steffan
TLS Cache Locality Problem: Our InvestigationTLS Cache Locality Problem: Our Investigation
Cache Locality Problem
Private Cache Architecture Shared Cache Architecture
Data Cache Instruction Cache
Parallel Regions Sequential Regions
miss patterns transitions
18Improving Cache Locality for TLS Steffan
TLS Execution Stages and TransitionsTLS Execution Stages and Transitionstim
e
P P P P
ParallelRegion
SequentialRegion
SequentialRegion
SteadyState
Our main focus
Startup Little impact
Wind-down Has impact
wind-down transitions: scheduling the seq. region
19Improving Cache Locality for TLS Steffan
Scheduling the Sequential RegionScheduling the Sequential Region
P0 P1 P2 P3
Floating Sequential Processor
Fixed Sequential Processor
PotentialCache Locality
which is better?
20Improving Cache Locality for TLS Steffan
Performance of Fixed Relative to FloatingPerformance of Fixed Relative to Floating
Overall Program: 3.4% speedup
fixed sequential processor is superior, at no cost
99.6
95.0
96.7
88.8
96.5
100.0
87.0 99.8
99.3
97.3
99.9
94.8
101.6
96.6
0.0
50.0
100.0
150.0
bzi
p2_c
omp
craft
y
gcc
go
ijpeg li
m88ksi
m
mcf
pars
er
per
lbm
k
vort
ex
vpr_
pla
ce
vpr_
rou
te
ave
rage
Norm
ali
zed E
xecuti
on
Tim
e
21Improving Cache Locality for TLS Steffan
TLS Cache Locality Problem: Our InvestigationTLS Cache Locality Problem: Our Investigation
Cache Locality Problem
Private Cache Architecture Shared Cache Architecture
Data Cache Instruction Cache
Parallel Regions Sequential Regions
miss patterns transitions
22Improving Cache Locality for TLS Steffan
Classifying Misses Within Parallel RegionsClassifying Misses Within Parallel Regions1)1) L2 Misses L2 Misses (ignore)(ignore)
– These cannot be locality misses (inclusion enforced)These cannot be locality misses (inclusion enforced)
2)2) Read-based sharingRead-based sharing
– Line is read by multiple processorsLine is read by multiple processors
3)3) Write-based sharingWrite-based sharing
– Line is written (and possibly read) by multiple processorsLine is written (and possibly read) by multiple processors
4)4) StridedStrided
– Addresses of missing lines progress by a cross-CPU strideAddresses of missing lines progress by a cross-CPU stride
5)5) Other Other (ignore)(ignore)
– No observable patterns; likely conflict and capacity missesNo observable patterns; likely conflict and capacity misses
caveats: there is overlap; priority order; sliding window
23Improving Cache Locality for TLS Steffan
Miss Patterns ObservedMiss Patterns Observed
71.3%
Miss PatternMiss Pattern PercentagePercentage
L2 missL2 miss 15.7%15.7%
Read-based sharingRead-based sharing 53.7%53.7%
Write-based sharingWrite-based sharing 11.4%11.4%
StridedStrided 6.2%6.2%
OtherOther 13.0%13.0%
investigate techniques targeting these three patterns
24Improving Cache Locality for TLS Steffan
Exploiting Read-Only Sharing PatternsExploiting Read-Only Sharing Patterns
• Read-only sharing misses dominate (53.7%)Read-only sharing misses dominate (53.7%)
– Hence a given read miss predicts future read missesHence a given read miss predicts future read misses
– i.e., other CPUs will likely read-miss that same linei.e., other CPUs will likely read-miss that same line
• Broadcasting for all read missesBroadcasting for all read misses
– Any read miss results in that line being pushed to all cachesAny read miss results in that line being pushed to all caches
• Provided lines in speculative state are not evictedProvided lines in speculative state are not evicted
– Trivial to implement in CMP with bus interconnectTrivial to implement in CMP with bus interconnect
• No extra trafficNo extra traffic
will such broadcasting result in cache pollution?
25Improving Cache Locality for TLS Steffan
Impact of Broadcasting All Read Misses (RB)Impact of Broadcasting All Read Misses (RB)
Data Cache Misses Execution Time
27.7% reduction 7.3% speedup
simple broadcasting is effective
• Attempts to throttle broadcasting reduced benefitsAttempts to throttle broadcasting reduced benefits– Hence resulting cache pollution is limitedHence resulting cache pollution is limited
26Improving Cache Locality for TLS Steffan
Miss Patterns ObservedMiss Patterns Observed
71.3%
Miss PatternMiss Pattern PercentagePercentage
L2 missL2 miss 15.7%15.7%
Read-based sharingRead-based sharing 53.7%53.7%
Write-based sharingWrite-based sharing 11.4%11.4%
StridedStrided 6.2%6.2%
OtherOther 13.0%13.0%
27Improving Cache Locality for TLS Steffan
Exploiting Write-Based Sharing PatternsExploiting Write-Based Sharing Patterns• Note: caches extended for TLS are write-backNote: caches extended for TLS are write-back
– Modifications are not propagated before thread commitsModifications are not propagated before thread commits
• Example: write-based sharing of a cache lineExample: write-based sharing of a cache line– CPU0 writes then commits; then CPU1 readsCPU0 writes then commits; then CPU1 reads
– Read results in miss, read-request, write-back, then fillRead results in miss, read-request, write-back, then fill
• Aggressive approach: Aggressive approach: – On commit, broadcast all modified linesOn commit, broadcast all modified lines
– Too much traffic, too many superfluous copiesToo much traffic, too many superfluous copies
• A more selective approach: A more selective approach: – Predict lines involved in write-based sharing Predict lines involved in write-based sharing
more general: predict stores involved in WB sharing
28Improving Cache Locality for TLS Steffan
Predicting Stores & Lines Involved in WB SharingPredicting Stores & Lines Involved in WB Sharing
tag index offsetAddress:
Extended Tag (etag)
RST Index
8 Entries
8 Entries8 Entries
8-entries each is sufficient
Recent Store Table (RST)
store PC
store PC
store PCRST Index
(Recent store PCs)
store PC
store PC
store PC
store PC
Invalidation PC List (IPCL)
(Store PCs for lines that are written back)
Push Required Buffer (PRB)
etag
etag
etag
(lines to push on commit)
29Improving Cache Locality for TLS Steffan
Operation of Write-Based Sharing TechniqueOperation of Write-Based Sharing Technique
On a store:On a store:
– Add store PC to Add store PC to Recent Store TableRecent Store Table ( (RSTRST))
– If store PC is in If store PC is in Invalidation PC List Invalidation PC List ((IPCLIPCL):):
• Add store PC to Add store PC to Push Required BufferPush Required Buffer ( (PRBPRB))
On a coherence request requring writeback:On a coherence request requring writeback:
– Use RST index to lookup PC in Use RST index to lookup PC in RSTRST, add PC to , add PC to IPCLIPCL
On commit:On commit:
– For each extended tag in For each extended tag in PRBPRB::
• Writeback, self-invalidate, push line to next cacheWriteback, self-invalidate, push line to next cache
simple case: next cache is in round-robin order
30Improving Cache Locality for TLS Steffan
Impact of Write-Based Technique (WB)Impact of Write-Based Technique (WB)
Data Cache Misses Execution Time
19.6% reduction 7.8% speedup
worth the cost of small additional hardware
31Improving Cache Locality for TLS Steffan
Miss Patterns ObservedMiss Patterns Observed
71.3%
Miss PatternMiss Pattern PercentagePercentage
L2 missL2 miss 15.7%15.7%
Read-based sharingRead-based sharing 53.7%53.7%
Write-based sharingWrite-based sharing 11.4%11.4%
StridedStrided 6.2%6.2%
OtherOther 13.0%13.0%
32Improving Cache Locality for TLS Steffan
Exploiting Strided Miss PatternsExploiting Strided Miss Patterns
• Hardware stride-prefetcher [Fu Hardware stride-prefetcher [Fu et alet al, Baer , Baer et alet al]]
– Each CPU has its own aggressive prefetcherEach CPU has its own aggressive prefetcher
– Fully associative, 512 entries: Fully associative, 512 entries:
• PC, miss address, stride distance, statePC, miss address, stride distance, state
– Issue 16 prefetches when stride is recognizedIssue 16 prefetches when stride is recognized
• Prefetches are throttled to avoid burst of trafficPrefetches are throttled to avoid burst of traffic
• Prefetch from L2 to private cachesPrefetch from L2 to private caches
– To be fair, prefetches do not go beyond L2To be fair, prefetches do not go beyond L2
33Improving Cache Locality for TLS Steffan
Impact of Strided Prefetching (ST)Impact of Strided Prefetching (ST)
10.3% reduction No significant impact
Data Cache Misses Execution Time
no good alone---complementary with other techniques?
34Improving Cache Locality for TLS Steffan
Combining Techniques: Parallel Region Perf.Combining Techniques: Parallel Region Perf.
72.4
92.7
65.9
93.6
61.8
87.2
57.3
88.1
0.0
100.0
Data cache
misses
Execution
time
No
rmal
ized
to
th
e B
asel
ine
WB/ST
RB/ST
RB/WB
RB/WB/ST
RB/WB/ST has fewest misses, but RB/WB performs best
35Improving Cache Locality for TLS Steffan
Overall Program SpeedupOverall Program Speedup
9.2
13.4
16.7
16.2
13.0
18.9
18.1
0.0
2.0
4.0
6.0
8.0
10.0
12.0
14.0
16.0
18.0
20.0
Float Baseline RB WB ST RB/ WB RB/ WB/ ST
Per
cen
tag
e P
rog
ram
Sp
eed
up
RB/WB further improves program performance by 5.5%
36Improving Cache Locality for TLS Steffan
Impact of RB/WB on ScalabilityImpact of RB/WB on Scalability
92.4
82.5
93.8
83.3 88.2
82.0
81.8
64.6
84.5
63.5
77.5
67.1
81.6
62.9
89.0
58.4
75.6
64.0
82.2
62.4
88.6
57.3
76.8
62.7
0.0
100.0
baseline improved baseline improved baseline improved
No
rmal
ized
Exe
cuti
on
Tim
e
2
4
6
8
facilitates scaling
Bzip2_comp Vpr_place Average(all benchmarks)
37Improving Cache Locality for TLS Steffan
SummarySummary• Have a fixed processor for sequential regionsHave a fixed processor for sequential regions
• Exploiting read-only sharing patterns (RB):Exploiting read-only sharing patterns (RB):
– Simple broadcasting for all load misses is effectiveSimple broadcasting for all load misses is effective
• No significant cache pollutionNo significant cache pollution
• Exploiting write-based sharing patterns (WB):Exploiting write-based sharing patterns (WB):
– Write-back/self-invalidate/push technique is effectiveWrite-back/self-invalidate/push technique is effective
• Exploiting strided miss patterns (ST):Exploiting strided miss patterns (ST):
– Extra traffic overwhelms benefit of reduced missesExtra traffic overwhelms benefit of reduced misses
• RB/WB are complementary and perform bestRB/WB are complementary and perform best
– And dramatically improve the scalability of TLSAnd dramatically improve the scalability of TLS
Improving cache locality is key for effective TLS
38Improving Cache Locality for TLS Steffan
BackupsBackups
39Improving Cache Locality for TLS Steffan
Ideal CachesIdeal Caches
Ideal Caches Model (Parallel Region Performance)
100 99.9
80.4 80.1
0102030405060708090
100
Baseline Idealinstruction
cache
Ideal datacache
Idealinstructionand data
cache
Norm
aliz
ed
Exe
cutio
n T
ime
40Improving Cache Locality for TLS Steffan
Parallel Region Cache Miss BreakdownParallel Region Cache Miss Breakdown
L2 Misses
Read-Based Sharing
Write-Based Sharing
Strided
Other
top related