symbiotic scheduling for shared caches in multi-core systems using memory footprint signature

Symbiotic Scheduling for Shared Caches in Multi-Core Systems Using Memory Footprint Signature

Mrinmoy Ghosh

Ripal Nathuji

Min Lee

Karsten Schwan

Hsien-Hsin S. Lee

ARM Microsoft Research Georgia Tech

Cache Interference in “Concurrent Processes”

L2 Cache

Core A

L1 Cache

Core B

L1 Cache

P1

P2

P1 $ LineP2 $ LineLine Hit !!!Conflict !!!

Cache Interference Effect (Concurrent Processes)

Maximum performance degradation less than 10%

mcf

libq

mcf

perl

mcf

perl

libqlibq

libq

mcfmcf

libq

0.96

0.98

1.00

1.02

1.04

1.06

1.08

1.10

perlb

ench

gobm

k

hmm

er

sopl

ex

povr

ay

omne

tpp

mcf

libqu

antu

m

asta

r

bwav

es

sphi

nx3

xala

ncbm

k

Rel

ativ

e R

un

Tim

e

Cache Interference in “Shared Cache Multi-Core”

L2 Cache

Core A

L1 Cache

Core B

L1 Cache

P1 P2

P1 $ LineP2 $ LineConflict !!!

Cache Interference Effect (Shared Cache Multi-Core)

Performance degraded by as much as 65%

lbmlbmlibqbwaves

libq libqmcf

libq

libq

libqlibq

soplex

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

perlb

ench

gobm

k

hmm

er

sopl

ex

povr

ay

omne

tpp

mcf

libqu

antu

m

asta

r

bwav

es

sphi

nx3

xala

ncbm

k

Rel

ativ

e R

un

Tim

es

Intelligent Process Management Needed !!

• Problem– Processes in different cores can be incompatible– Shared resource contention

• Observation– Less contention of incompatible processes when running

on the same core

• Insight:

– Process incompatibility severely affects performance– Compatibility-based scheduling increases throughput

Process (In-)Compatibility in Multi-Cores

7

Ideas

• Use Counting Bloom Filter to record memory access signature

• Compatibility test using signature

Insertion: Counting Bloom Filter

PresenceBit

1

1

Counter

N-to-mHash Func X

N-to-mHash Func Y

N-bit Data Address A

Insertion: Counting Bloom Filter

PresenceBit

1

1

1

Counter

N-to-mHash Func X

N-to-mHash Func Y

N-bit Data Address B

2

Deletion: Counting Bloom Filter

PresenceBit

1

1

Counter

N-to-mHash Func X

N-to-mHash Func Y

Data Address AWas Evicted

12

Query: Counting Bloom Filter

PresenceBit

1

0

2

Counter

N-to-mHash Func X

N-to-mHash Func Y

Data Address A??

1

Data Not Present !!!

Bloom Filter Signatures vs. Cache Footprint

Strong Correlation !!!

0

500

1000

1500

2000

2500

3000

3500

0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700

Cache Footprint Signature Value

13

Architectural Support

Bloom Filter Signature Multi-Core Architecture

L2 Cache

Core A

L1 Cache

Core B

L1 Cache

Last Filter

Core Filter

Last Filter

Core Filter

Bloom Filter Counters

Bloom Filter Signature Multi-Core Architecture

L2 Cache

Core A

L1 Cache

Core B

L1 Cache

P1 P2

Last Filter

Core Filter

Last Filter

Core Filter

Bloom Filter Counters

P3

Metric for Execution StateLast Filter

Core Filter

RBV (Running Bit Vector)

+Occupancy Weight

(i.e., # of 1s)

Interference Metric (Complement of Symbiosis)

Process Pool (Processes waiting to be scheduled) Proc1 RBV

Proc0

Proc1

Proc2

Proc**Proc*

Core Filter

Symbiosis = 5+

Interference Metric = N - 5

+

18

Process-to-Core

Mapping Algorithms

• A1: Use Occupancy Weight

• A2: Use Interference Graph

• A3: Use Weighted Interference Graph

• Sort all processes according to occupancy weight• Processes form groups using sorted weight

– # of processes in a group = Processes/Cores• Map processes to cores based on sorting results

A1: Weight Sorted Algorithm

P0100

P499

P270

P565

P643

P320

P115

Core A

L1 Cache

Core B

L1 Cache

Core C

L1 Cache

Core D

L1 Cache

• Form interference graph using interference metric• Find MAX-CUT of the graph

A2: Interference Graph Algorithm

P0

CA=20

CB=30

P1

CA=10

CB=45

P2

CA=40

CB=25

P3

CA=15

CB=50

Was in CA Was in CB

P0(A)

P1(A)

P2(B)

P3(B)

30

40Interference Graph



P0

CA=20

CB=30

P1

CA=10

CB=45

P2

CA=40

CB=25

P3

CA=15

CB=50

Was in CA Was in CB

P0(A)

P1(A)

P2(B)

P3(B)

70

Interference Graph



P0

CA=20

CB=30

P1

CA=10

CB=45

P2

CA=40

CB=25

P3

CA=15

CB=50

Was in CA Was in CB

P0(A)

P1(A)

P2(B)

P3(B)

70

Interference Graph

60

30 7545

85



P0(A)

P1(A)

P2(B)

P3(B)

70

Interference Graph

60

30 7545

85

P1(A)

P3(B)

P0(A)

P2(B)

85 45

• To address high interference issues• Weight the edges of the interference graph• The rest are the same as A2

A3: Weighted Interference Graph Algorithm

P0OW=90

CA=20

CB=30

P1OW=85

CA=10

CB=45

P2OW=50

CA=40

CB=25

P3OW=100

CA=15

CB=50

Was in CA Was in CB

P0(A)

P1(A)

P2(B)

P3(B)

90*30

50*40Interference Graph

25

Performance Evaluation

Evaluation Methodology

P1 P2 P3 PN

Fedora Linux

Simics x86

Gather Footprint in Emulator

“magic”interface

Process-to-CoreMapping

P1 P2 P3 PN

Intel Core 2

Native x86 Run

P1 P2 PN

Linux Linux Linux

Xen Hypervisor

Intel Core 2

VM Run

0%

10%

20%

30%

40%

50%

60%

asta

r

gobm

k

hm

mer

lbm

libquantu

m

mcf

om

netp

p

perlbench

povra

y

sople

x

sphin

x

xala

ncbm

k

Performance Results

0%

5%

10%

15%

20%

25%

asta

r

gobm

k

hm

mer

lbm

libquantu

m

mcf

om

netp

p

perlbench

povr

ay

sople

x

sphin

x

xala

ncbm

k

Maximum performance improvement of up to 54%

Average performance improvement of up to 23%

Performance of Virtualized Systems

Maximum performance improvement of up to 26%

Average performance improvement of up to 9.5%

asta

r

gobm

k

hmm

er

lbm

libqu

antu

m

mcf

omne

tpp

perlb

ench

povr

ay

sopl

ex

sphi

nx

xala

nbcm

k

0%1%2%3%4%5%6%7%8%9%

10%

Performance Sensitivity of 3 Algorithms

0%

4%

8%

12%

16%

mcfgobmkpovray

omnetpp

mcfhmmer

libquantumomnetpp

perlbenchgobmk

libquantumomnetpp

gobmkhmmer

libquantumpovray

mcfhmmer

libquantumpovray

Application Mix

Per

form

ance

Ben

efit

Sorted Graph Weighted Graph

Weighted Interference Graph has the best performance

Conclusion

30/53

Shared Resource (e.g., LLC) Management is Critical

Process Scheduling using Compatibility in Multi-Core

Capturing Cache Reference Behavior for Processes

Symbiotic Scheduling with Bloom Filter Signature

Measured Speedup of 22% (up to 54%) on Intel Core 2

31

That’s All, Folks!

Georgia TechECE MARS Labhttp://arch.ece.gatech.edu

symbiotic scheduling for shared caches in multi-core systems using memory footprint signature

Documents

memory hierarchy

mhash func xn

mhash func yn

larger cache area

chip area

larger caches

shared caches

chip manifests