symbiotic scheduling for shared caches in multi-core systems using memory footprint signature
DESCRIPTION
Symbiotic Scheduling for Shared Caches in Multi-Core Systems Using Memory Footprint Signature. Mrinmoy Ghosh Ripal Nathuji Min Lee Karsten Schwan Hsien-Hsin S. Lee. ARM Microsoft Research Georgia Tech. Cache Interference in “Concurrent Processes”. Core B. Core A. P2. P1. - PowerPoint PPT PresentationTRANSCRIPT
Symbiotic Scheduling for Shared Caches in Multi-Core Systems Using Memory Footprint Signature
Mrinmoy Ghosh
Ripal Nathuji
Min Lee
Karsten Schwan
Hsien-Hsin S. Lee
ARM Microsoft Research Georgia Tech
Cache Interference in “Concurrent Processes”
L2 Cache
Core A
L1 Cache
Core B
L1 Cache
P1
P2
P1 $ LineP2 $ LineLine Hit !!!Conflict !!!
Cache Interference Effect (Concurrent Processes)
Maximum performance degradation less than 10%
mcf
libq
mcf
perl
mcf
perl
libqlibq
libq
mcfmcf
libq
0.96
0.98
1.00
1.02
1.04
1.06
1.08
1.10
perlb
ench
gobm
k
hmm
er
sopl
ex
povr
ay
omne
tpp
mcf
libqu
antu
m
asta
r
bwav
es
sphi
nx3
xala
ncbm
k
Rel
ativ
e R
un
Tim
e
Cache Interference in “Shared Cache Multi-Core”
L2 Cache
Core A
L1 Cache
Core B
L1 Cache
P1 P2
P1 $ LineP2 $ LineConflict !!!
Cache Interference Effect (Shared Cache Multi-Core)
Performance degraded by as much as 65%
lbmlbmlibqbwaves
libq libqmcf
libq
libq
libqlibq
soplex
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8
perlb
ench
gobm
k
hmm
er
sopl
ex
povr
ay
omne
tpp
mcf
libqu
antu
m
asta
r
bwav
es
sphi
nx3
xala
ncbm
k
Rel
ativ
e R
un
Tim
es
Intelligent Process Management Needed !!
• Problem– Processes in different cores can be incompatible– Shared resource contention
• Observation– Less contention of incompatible processes when running
on the same core
• Insight:
– Process incompatibility severely affects performance– Compatibility-based scheduling increases throughput
Process (In-)Compatibility in Multi-Cores
7
Ideas
• Use Counting Bloom Filter to record memory access signature
• Compatibility test using signature
Insertion: Counting Bloom Filter
PresenceBit
1
1
Counter
N-to-mHash Func X
N-to-mHash Func Y
N-bit Data Address A
Insertion: Counting Bloom Filter
PresenceBit
1
1
1
Counter
N-to-mHash Func X
N-to-mHash Func Y
N-bit Data Address B
2
Deletion: Counting Bloom Filter
PresenceBit
1
1
Counter
N-to-mHash Func X
N-to-mHash Func Y
Data Address AWas Evicted
12
Query: Counting Bloom Filter
PresenceBit
1
0
2
Counter
N-to-mHash Func X
N-to-mHash Func Y
Data Address A??
1
Data Not Present !!!
Bloom Filter Signatures vs. Cache Footprint
Strong Correlation !!!
0
500
1000
1500
2000
2500
3000
3500
0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700
Cache Footprint Signature Value
13
Architectural Support
Bloom Filter Signature Multi-Core Architecture
L2 Cache
Core A
L1 Cache
Core B
L1 Cache
Last Filter
Core Filter
Last Filter
Core Filter
Bloom Filter Counters
Bloom Filter Signature Multi-Core Architecture
L2 Cache
Core A
L1 Cache
Core B
L1 Cache
P1 P2
Last Filter
Core Filter
Last Filter
Core Filter
Bloom Filter Counters
P3
Metric for Execution StateLast Filter
Core Filter
RBV (Running Bit Vector)
+Occupancy Weight
(i.e., # of 1s)
Interference Metric (Complement of Symbiosis)
Process Pool (Processes waiting to be scheduled) Proc1 RBV
Proc0
Proc1
Proc2
Proc**Proc*
Core Filter
Symbiosis = 5+
Interference Metric = N - 5
+
18
Process-to-Core
Mapping Algorithms
• A1: Use Occupancy Weight
• A2: Use Interference Graph
• A3: Use Weighted Interference Graph
• Sort all processes according to occupancy weight• Processes form groups using sorted weight
– # of processes in a group = Processes/Cores• Map processes to cores based on sorting results
A1: Weight Sorted Algorithm
P0100
P499
P270
P565
P643
P320
P115
Core A
L1 Cache
Core B
L1 Cache
Core C
L1 Cache
Core D
L1 Cache
• Form interference graph using interference metric• Find MAX-CUT of the graph
A2: Interference Graph Algorithm
P0
CA=20
CB=30
P1
CA=10
CB=45
P2
CA=40
CB=25
P3
CA=15
CB=50
Was in CA Was in CB
P0(A)
P1(A)
P2(B)
P3(B)
30
40Interference Graph
• Form interference graph using interference metric• Find MAX-CUT of the graph
A2: Interference Graph Algorithm
P0
CA=20
CB=30
P1
CA=10
CB=45
P2
CA=40
CB=25
P3
CA=15
CB=50
Was in CA Was in CB
P0(A)
P1(A)
P2(B)
P3(B)
70
Interference Graph
• Form interference graph using interference metric• Find MAX-CUT of the graph
A2: Interference Graph Algorithm
P0
CA=20
CB=30
P1
CA=10
CB=45
P2
CA=40
CB=25
P3
CA=15
CB=50
Was in CA Was in CB
P0(A)
P1(A)
P2(B)
P3(B)
70
Interference Graph
60
30 7545
85
• Form interference graph using interference metric• Find MAX-CUT of the graph
A2: Interference Graph Algorithm
P0(A)
P1(A)
P2(B)
P3(B)
70
Interference Graph
60
30 7545
85
P1(A)
P3(B)
P0(A)
P2(B)
85 45
• To address high interference issues• Weight the edges of the interference graph• The rest are the same as A2
A3: Weighted Interference Graph Algorithm
P0OW=90
CA=20
CB=30
P1OW=85
CA=10
CB=45
P2OW=50
CA=40
CB=25
P3OW=100
CA=15
CB=50
Was in CA Was in CB
P0(A)
P1(A)
P2(B)
P3(B)
90*30
50*40Interference Graph
25
Performance Evaluation
Evaluation Methodology
P1 P2 P3 PN
Fedora Linux
Simics x86
Gather Footprint in Emulator
“magic”interface
Process-to-CoreMapping
P1 P2 P3 PN
Intel Core 2
Native x86 Run
P1 P2 PN
Linux Linux Linux
Xen Hypervisor
Intel Core 2
VM Run
0%
10%
20%
30%
40%
50%
60%
asta
r
gobm
k
hm
mer
lbm
libquantu
m
mcf
om
netp
p
perlbench
povra
y
sople
x
sphin
x
xala
ncbm
k
Performance Results
0%
5%
10%
15%
20%
25%
asta
r
gobm
k
hm
mer
lbm
libquantu
m
mcf
om
netp
p
perlbench
povr
ay
sople
x
sphin
x
xala
ncbm
k
Maximum performance improvement of up to 54%
Average performance improvement of up to 23%
Performance of Virtualized Systems
Maximum performance improvement of up to 26%
Average performance improvement of up to 9.5%
asta
r
gobm
k
hmm
er
lbm
libqu
antu
m
mcf
omne
tpp
perlb
ench
povr
ay
sopl
ex
sphi
nx
xala
nbcm
k
0%1%2%3%4%5%6%7%8%9%
10%
Performance Sensitivity of 3 Algorithms
0%
4%
8%
12%
16%
mcfgobmkpovray
omnetpp
mcfhmmer
libquantumomnetpp
perlbenchgobmk
libquantumomnetpp
gobmkhmmer
libquantumpovray
mcfhmmer
libquantumpovray
Application Mix
Per
form
ance
Ben
efit
Sorted Graph Weighted Graph
Weighted Interference Graph has the best performance
Conclusion
30/53
Shared Resource (e.g., LLC) Management is Critical
Process Scheduling using Compatibility in Multi-Core
Capturing Cache Reference Behavior for Processes
Symbiotic Scheduling with Bloom Filter Signature
Measured Speedup of 22% (up to 54%) on Intel Core 2
31
That’s All, Folks!
Georgia TechECE MARS Labhttp://arch.ece.gatech.edu