thesis defense - carnegie mellon university · thesis defense lavanya subramanian 1 committee:...

157
Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel)

Upload: others

Post on 10-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Providing High and Predictable Performance in Multicore Systems

Through Shared Resource Management

Thesis Defense

Lavanya Subramanian

1

Committee:Advisor: Onur Mutlu

Greg GangerJames Hoe

Ravi Iyer (Intel)

Page 2: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

The Multicore Era

2

Main MemoryCacheCore

Page 3: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

The Multicore Era

3

Main Memory

Shared Cache

CoreCore

CoreCore

Interconnect

Multiple applications execute in parallelHigh throughput and efficiency

Page 4: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Challenge:Interference at Shared Resources

4

Main Memory

Shared Cache

CoreCore

CoreCore

Interconnect

Page 5: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Impact of Shared Resource Interference

0

1

2

3

4

5

6

leslie3d (core 0) gcc (core 1)

Slo

wd

ow

n

0

1

2

3

4

5

6

leslie3d (core 0) mcf (core 1)

Slo

wd

ow

n

2. Unpredictable application slowdowns1. High application slowdowns

5

gcc (core 1) mcf (core 1)

Page 6: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Why Predictable Performance?

• There is a need for predictable performance– When multiple applications share resources – Especially if some applications require performance

guarantees

• Example 1: In server systems– Different users’ jobs consolidated onto the same server– Need to provide bounded slowdowns to critical jobs

• Example 2: In mobile systems– Interactive applications run with non-interactive applications– Need to guarantee performance for interactive applications

6

Page 7: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Thesis Statement

High and predictable performance

can be achieved in multicore systems through simple/implementable mechanisms to

mitigate and quantify shared resource interference

7

Goals

Approaches

Page 8: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Goals and Approaches

8

Goals:1. High Performance

2. Predictable Performance

Mitigate Interference Quantify Interference

Approaches:

Page 9: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Focus Shared Resources in This Thesis

9

Main Memory

Shared Cache

Capacity

CoreCore

CoreCore

InterconnectMain Memory

Bandwidth

Page 10: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Related Prior Work

10

Mitigate Interference

Quantify Interference

Cache Capacity

Memory Bandwidth

CQoS (ICS ‘04), UCP (MICRO ‘06), DIP (ISCA

‘07), DRRIP (ISCA ‘10), EAF (PACT ‘12)

STFM (MICRO ’07),

PARBS (ISCA ’08), ATLAS (HPCA ’10), TCM (MICRO

’11), Criticality-aware(ISCA ‘’13)

Challenge: High complexity

PCASA (DATE ’12),

Ubik (ASPLOS ’14)

Goal: Meet resource allocation target

STFM (MICRO ’07)

Challenge: High inaccuracy

FST (ASPLOS ’10),

PTCA (TACO ’13)

Challenge: High inaccuracy

Much exploredNot our focus

Not our focus

Page 11: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Outline

11

Mitigate Interference

Quantify Interference

Cache Capacity

Memory Bandwidth

Much exploredNot our focus

Blacklisting Memory Scheduler

Not our focus

Memory Interference induced

Slowdown EstimationModel

and its uses

Application Slowdown

Modeland its uses

Page 12: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Outline

12

Mitigate Interference

Quantify Interference

Cache Capacity

Memory Bandwidth

Much exploredNot our focus

Blacklisting Memory Scheduler

Not our focus

Memory Interference induced

Slowdown EstimationModel

and its uses

Application Slowdown

Modeland its uses

Page 13: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Background: Main Memory

• FR-FCFS Memory Scheduler [Zuravleff and Robinson, US Patent ‘97; Rixner et al., ISCA ‘00]

– Row-buffer hit first

– Older request first

• Unaware of inter-application interference

Row Buffer

Bank 0 Bank 1 Bank 2 Bank 3

Row Buffer

Row Buffer

Row Buffer

Ro

ws

Columns

ChannelMemory

Controller

Bank 0 Bank 1 Bank 2 Bank 3

Row Buffer

13

Row-buffer hitRow-buffer miss

Page 14: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Tackling Inter-Application Interference:Application-aware Memory Scheduling

14

Monitor Rank

Highest Ranked AID

EnforceRanks

Full ranking increases critical path latency and area

significantly to improve performance and fairness

4

3

2

12

4

3

1

Req 1 1Req 2 4Req 3 1Req 4 1Req 5 3

Req 7 1Req 8 3

Request Buffer

Req 5 2

RequestApp. ID

(AID)

=

=

=

=

=

=

=

=

Page 15: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Performance vs. Fairness vs. Simplicity

15

Performance

Fairness

Simplicity

FRFCFS

PARBS

ATLAS

TCM

Blacklisting

Ideal

App-unaware

App-aware (Ranking)

Our Solution (No Ranking)

Is it essential to give up simplicity to optimize for performance and/or fairness?

Our solution achieves all three goalsVery Simple

Low performance and fairness

Complex

Our Solution

Page 16: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Problems with Previous Application-aware Memory Schedulers

1. Full ranking increases hardware complexity

2. Full ranking causes unfair slowdowns

16

Our Goal: Design a memory scheduler withLow Complexity, High Performance, and Fairness

Page 17: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Key Observation 1: Group Rather Than Rank

Observation 1: Sufficient to separate applications into two groups, rather than do full ranking

17

Benefit 1: Low complexity compared to ranking

Group

VulnerableInterference

Causing

>

Monitor Rank

4

3

2

12

4

3

1

4

2

3

1

Benefit 2: Lower slowdowns than ranking

Page 18: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Key Observation 1: Group Rather Than Rank

Observation 1: Sufficient to separate applications into two groups, rather than do full ranking

18

Group

VulnerableInterference

Causing

>

Monitor Rank

4

3

2

12

4

3

1

4

2

3

1

How to classify applications into groups?

Page 19: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Key Observation 2

Observation 2: Serving a large number of consecutive requests from an application causes interference

Basic Idea:• Group applications with a large number of consecutive

requests as interference-causing Blacklisting• Deprioritize blacklisted applications• Clear blacklist periodically (1000s of cycles)

Benefits:• Lower complexity• Finer grained grouping decisions Lower unfairness

19

Page 20: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

The Blacklisting Memory Scheduler (ICCD ‘14)

20

1. Monitor

Memory Controller

AID2 0 0

AID Blacklist

1 02 0

3 0

AID1AID1AID1AID1

1Last Req AID 3

# Consecutive Requests

1

21

2344

2. Blacklist

Memory Controller

Last Req AID 3

# Consecutive Requests

1

2. Blacklist

0 0

AID Blacklist

12 03 0

1

1. Monitor

Req Blacklist

Req 1 0Req 2 1

Req 3 1Req 4 0Req 5 0

Req 6 0Req 7 1Req 8 0

Request Buffer

?

?

?

3. Prioritize

4. Clear Periodically

0

Simple and scalable design

3. Prioritize

4. Clear Periodically

1. Monitor

?

?

?

?

?

Page 21: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Methodology

• Configuration of our simulated baseline system– 24 cores

– 4 channels, 8 banks/channel

– DDR3 1066 DRAM

– 512 KB private cache/core

• Workloads– SPEC CPU2006, TPC-C, Matlab , NAS

– 80 multiprogrammed workloads

21

Page 22: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Performance and Fairness

22

1

3

5

7

9

11

13

15

1 3 5 7 9

Un

fair

ne

ss

Performance

FRFCFS FRFCFS-Cap PARBS

ATLAS TCM Blacklisting

5%21%

(Higher is better)

(Lo

wer

is b

ette

r)

1. Blacklisting achieves the highest performance 2. Blacklisting balances performance and fairness

Page 23: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Complexity

23

0

20000

40000

60000

80000

100000

120000

0 2 4 6 8 10 12

Sch

ed

ule

r A

rea

(sq

. um

)

Critical Path Latency (ns)

FRFCFS FRFCFS-Cap PARBS

ATLAS TCM Blacklisting

43%

70%

Blacklisting reduces complexity significantly

Page 24: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Outline

24

Mitigate Interference

Quantify Interference

Cache Capacity

Memory Bandwidth

Much exploredNot our focus

Blacklisting Memory Scheduler

Not our focus

Memory Interference induced

Slowdown EstimationModel

and its uses

Application Slowdown

Modeland its uses

Page 25: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Impact of Interference on Performance

25

Alone (No interference)

time

Execution time

Shared (With interference)

time

Execution time

Impact of Interference

Page 26: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Slowdown: Definition

Shared

Alone

ePerformanc

ePerformanc Slowdown

26

Page 27: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Impact of Interference on Performance

27

Alone (No interference)

time

Execution time

Shared (With interference)

time

Execution time

Impact of Interference

Previous Approach: Estimate impact of interference at a per-request granularity

Difficult to estimate due to request overlap

Page 28: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Outline

28

Mitigate Interference

Quantify Interference

Cache Capacity

Memory Bandwidth

Much exploredNot our focus

Blacklisting Memory Scheduler

Not our focus

Memory Interference induced

Slowdown EstimationModel

and its uses

Application Slowdown

Modeland its uses

Page 29: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Observation: Request Service Rate is a Proxy for Performance

For a memory bound application, Performance Memory request service rate

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.3 0.4 0.5 0.6 0.7 0.8 0.9 1No

rmal

ize

d P

erf

orm

ance

Normalized Request Service Rate

omnetpp

mcf

astar

Shared

Alone

Rate ServiceRequest

Rate ServiceRequest Slowdown

Shared

Alone

ePerformanc

ePerformanc Slowdown

Easy

Difficult

Intel Core i7, 4 cores

29

Page 30: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Observation: Highest Priority Enables Request Service Rate Alone Estimation

Request Service Rate Alone (RSRAlone) of an application can be estimated by giving the

application highest priority at the memory controller

Highest priority Little interference

(almost as if the application were run alone)

30

Page 31: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Observation: Highest Priority Enables Request Service Rate Alone Estimation

Request Buffer State

Main Memory

1. Run aloneTime units Service order

Main Memory

12

Request Buffer State

Main Memory

2. Run with another applicationService order

Main Memory

123

Request Buffer State

Main Memory

3. Run with another application: highest priorityService order

Main Memory

123

Time units

Time units

3

31

Page 32: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Memory Interference-induced Slowdown Estimation (MISE) model for memory bound applications

)(RSR Rate ServiceRequest

)(RSR Rate ServiceRequest Slowdown

SharedShared

AloneAlone

32

Page 33: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Observation: Memory Bound vs. Non-Memory Bound

• Memory-bound application

No interference

Compute Phase

Memory Phase

With interference

Memory phase slowdown dominates overall slowdown

time

time

Req

Req

Req Req

Req Req

33

Page 34: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Observation: Memory Bound vs. Non-Memory Bound

• Non-memory-bound application

time

time

No interference

Compute Phase

Memory Phase

With interference

Only memory fraction ( ) slows down with interference

1

1

Shared

Alone

RSR

RSR

Shared

Alone

RSR

RSR ) - (1 Slowdown

Memory Interference-induced Slowdown Estimation (MISE) model for non-memory bound applications

34

Page 35: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Interval Based Operation

time

Interval

Estimate

slowdown

Interval

Estimate

slowdown

Measure RSRShared,

Estimate RSRAlone

Measure RSRShared,

Estimate RSRAlone

35

Page 36: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Previous Work on Slowdown Estimation

• Previous work on slowdown estimation– STFM (Stall Time Fair Memory) Scheduling [Mutlu et al., MICRO ’07]

– FST (Fairness via Source Throttling) [Ebrahimi et al., ASPLOS ’10]

– Per-thread Cycle Accounting [Du Bois et al., HiPEAC ’13]

• Basic Idea:

Shared

Alone

Time Stall

Time Stall Slowdown

Count number of cycles application receives interference36

Page 37: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Methodology

• Configuration of our simulated system– 4 cores

– 1 channel, 8 banks/channel

– DDR3 1066 DRAM

– 512 KB private cache/core

• Workloads– SPEC CPU2006

– 300 multi programmed workloads

37

Page 38: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Quantitative Comparison

1

1.5

2

2.5

3

3.5

4

0 20 40 60 80 100

Slo

wd

ow

n

Million Cycles

Actual

STFM

MISE

SPEC CPU 2006 applicationleslie3d

38

Page 39: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Comparison to STFM

cactusADM

0

1

2

3

4

0 50 100

Slo

wd

ow

n

0

1

2

3

4

0 50 100

Slo

wd

ow

n

GemsFDTD

0

1

2

3

4

0 50 100

Slo

wd

ow

n

soplex

0

1

2

3

4

0 50 100

Slo

wd

ow

n

wrf

0

1

2

3

4

0 50 100

Slo

wd

ow

n

calculix

0

1

2

3

4

0 50 100

Slo

wd

ow

n

povray

Average error of MISE: 8.2%Average error of STFM: 29.4%

(across 300 workloads)

39

Page 40: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Possible Use Cases of the MISE Model

• Bounding application slowdowns [HPCA ’13]

• Achieving high system fairness and performance [HPCA ’13]

• VM migration and admission control schemes [VEE ’15]

• Fair billing schemes in a commodity cloud

40

Page 41: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

MISE-QoS: Providing “Soft” Slowdown Guarantees

• Goal

1. Ensure QoS-critical applications meet a prescribed slowdown bound

2. Maximize system performance for other applications

• Basic Idea

– Allocate just enough bandwidth to QoS-critical application

– Assign remaining bandwidth to other applications

41

Page 42: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Methodology

• Each application (25 applications in total) considered the QoS-critical application

• Run with 12 sets of co-runners of different memory intensities

• Total of 300 multi programmed workloads

• Each workload run with 10 slowdown bound values

• Baseline memory scheduling mechanism– Always prioritize QoS-critical application

[Iyer et al., SIGMETRICS 2007]

– Other applications’ requests scheduled in FR-FCFS order[Zuravleff and Robinson, US Patent 1997, Rixner+, ISCA 2000]

42

Page 43: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

A Look at One Workload

0

0.5

1

1.5

2

2.5

3

leslie3d hmmer lbm omnetpp

Slo

wd

ow

n AlwaysPrioritize

MISE-QoS-10/1

MISE-QoS-10/3

MISE-QoS-10/5

MISE-QoS-10/7

MISE-QoS-10/9

QoS-critical non-QoS-critical

Slowdown Bound = 10 Slowdown Bound = 3.33

Slowdown Bound = 2

43

MISE is effective in 1. meeting the slowdown bound for the QoS-critical

application 2. improving performance of non-QoS-critical

applications

Page 44: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Effectiveness of MISE in Enforcing QoS

Predicted Met

PredictedNot Met

QoS Bound Met

78.8% 2.1%

QoS Bound Not Met

2.2% 16.9%

Across 3000 data points

MISE-QoS meets the bound for 80.9% of workloads

AlwaysPrioritize meets the bound for 83% of workloads

MISE-QoS correctly predicts whether or not the bound is met for 95.7% of workloads

44

Page 45: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Performance of Non-QoS-Critical Applications

Higher performance when bound is looseWhen slowdown bound is 10/3 MISE-QoS improves system performance by 10%

45

0

0.2

0.4

0.6

0.8

1

Syste

m

Pe

rfo

rma

nce AlwaysPrioritize

MISE-QoS-10/1

MISE-QoS-10/3

MISE-QoS-10/5

MISE-QoS-10/7

MISE-QoS-10/9

Page 46: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Outline

46

Mitigate Interference

Quantify Interference

Cache Capacity

Memory Bandwidth

Much exploredNot our focus

Blacklisting Memory Scheduler

Not our focus

Memory Interference induced

Slowdown EstimationModel

and its uses

Application Slowdown

Modeland its uses

Page 47: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Shared Cache Capacity Contention

47

Main Memory

Shared Cache

Capacity

CoreCore

CoreCore

Page 48: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Cache Capacity Contention

48

Main Memory

Shared Cache

Cache Access Rate

Priority

Core

Core

Applications evict each other’s blocks from the shared cache

Page 49: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Outline

49

Mitigate Interference

Quantify Interference

Cache Capacity

Memory Bandwidth

Much exploredNot our focus

Blacklisting Memory Scheduler

Not our focus

Memory Interference induced

Slowdown EstimationModel

and its uses

Application Slowdown

Modeland its uses

Page 50: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Estimating Cache and Memory Slowdowns

50

Core Core Core Core

Core Core Core Core

Core Core Core Core

Core Core Core Core

Main Memory

Shared Cache

Cache Service Rate

Memory Service Rate

Page 51: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Service Rates vs. Access Rates

51

Request service and access rates are tightly coupled

Core Core Core Core

Core Core Core Core

Core Core Core Core

Core Core Core Core

Main Memory

Shared Cache

Cache Service Rate

Cache Access Rate

Page 52: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

The Application Slowdown Model

52

Core Core Core Core

Core Core Core Core

Core Core Core Core

Core Core Core Core

Main Memory

Shared Cache

Shared

Alone

Rate Access Cache

Rate Access CacheSlowdown

Cache Access Rate

Page 53: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Real System Studies:Cache Access Rate vs. Slowdown

53

1

1.2

1.4

1.6

1.8

2

2.2

1 1.2 1.4 1.6 1.8 2 2.2

Slo

wd

ow

n

Cache Access Rate Ratio

astar

lbm

bzip2

Page 54: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Challenge

How to estimate alone cache access rate?

54

Core Core Core Core

Core Core Core Core

Core Core Core Core

Core Core Core Core

Main Memory

Shared Cache

Cache Access Rate

Auxiliary Tag Store

Priority

Page 55: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Auxiliary Tag Store

55

Main Memory

Shared Cache

Cache Access Rate

Auxiliary Tag Store

Priority

Core

Core

Still in auxiliary tag store

Auxiliary Tag StoreAuxiliary tag store tracks such contention misses

Page 56: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Accounting for Contention Misses

• Revisiting alone memory request service rate

Cycles serving contention misses should not

count as high priority cycles

56

CyclesPriority High #

EpochsPriority High During Requests #

nApplicatioan of Rate ServiceRequest Alone

Page 57: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Alone Cache Access Rate Estimation

57

Cycles Contention Cache# - CyclesPriority High #

EpochsPriority High During Requests #

nApplicatioan of Rate Access Cache

Alone

Cache Contention Cycles: Cycles spent serving contention misses

Time ServiceMemory Average

x Misses Contention # Cycles Contention Cache

From auxiliary tag storewhen given high priority

Measured when given high priority

Page 58: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Application Slowdown Model (ASM)

58

Core Core Core Core

Core Core Core Core

Core Core Core Core

Core Core Core Core

Main Memory

Shared Cache

Cache Access Rate

Shared

Alone

Rate Access Cache

Rate Access CacheSlowdown

Page 59: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Previous Work on Slowdown Estimation

• Previous work on slowdown estimation– STFM (Stall Time Fair Memory) Scheduling [Mutlu et al., MICRO ’07]

– FST (Fairness via Source Throttling) [Ebrahimi et al., ASPLOS ’10]

– Per-thread Cycle Accounting [Du Bois et al., HiPEAC ’13]

• Basic Idea:

Shared

Alone

TimeExecution

TimeExecution Slowdown

Count interference experienced by each request59

Page 60: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Model Accuracy Results

Average error of ASM’s slowdown estimates: 10%

60

Select applications

0

20

40

60

80

100

120

140

160

calc

ulix

po

vray

ton

to

nam

d

dea

lII

sjen

g

per

lben

gob

mk

xala

ncb

sph

inx3

Gem

sF…

om

net

pp

lbm

lesl

ie3

d

sop

lex

milc

libq

mcf

NP

Bb

t

NP

Bft

NP

Bis

NP

Bu

a

Ave

rage

Slo

wd

ow

n E

stim

atio

n

Erro

r (i

n %

)

FST PTCA ASM

Page 61: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Leveraging ASM’s Slowdown Estimates

• Slowdown-aware resource allocation for high performance and fairness

• Slowdown-aware resource allocation to bound application slowdowns

• VM migration and admission control schemes [VEE ’15]

• Fair billing schemes in a commodity cloud

61

Page 62: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Cache Capacity Partitioning

62

Main Memory

Shared Cache

Cache Access Rate

Core

Core

Goal: Partition the shared cache among applications to mitigate contention

Page 63: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Cache Capacity Partitioning

63

Main Memory

Core

Core

Way 2

Set 0Set 1Set 2Set 3

..

Set N-1

Way 0

Way 1

Way 3

Previous partitioning schemes optimize for miss countProblem: Not aware of performance and slowdowns

Page 64: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

ASM-Cache: Slowdown-aware Cache Way Partitioning

• Key Requirement: Slowdown estimates for all possible way partitions

• Extend ASM to estimate slowdown for all possible cache way allocations

• Key Idea: Allocate each way to the application whose slowdown reduces the most

64

Page 65: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Memory Bandwidth Partitioning

65

Main Memory

Shared Cache

Cache Access Rate

Core

Core

Goal: Partition the main memory bandwidth among applications to mitigate contention

Page 66: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

ASM-Mem: Slowdown-aware Memory Bandwidth Partitioning

• Key Idea: Allocate high priority proportional to an application’s slowdown

• Application i’s requests given highest priority at the memory controller for its fraction

66

jj

ii

Slowdown

Slowdown FractionPriority High

Page 67: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Coordinated Resource Allocation Schemes

67

Core Core Core Core

Core Core Core Core

Core Core Core Core

Core Core Core Core

Main Memory

Shared Cache

Cache capacity-aware bandwidth allocation

1. Employ ASM-Cache to partition cache capacity 2. Drive ASM-Mem with slowdowns from ASM-Cache

Page 68: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Fairness and Performance Results

68

16-core system 100 workloads

Significant fairness benefits across different channel counts

4

5

6

7

8

9

10

11

1 2

Fair

nes

s (L

ow

er is

bet

ter)

Number of Channels

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

1 2

Per

form

ance

Number of Channels

FRFCFS-NoPart

FRFCFS+UCP

TCM+UCP

PARBS+UCP

ASM-Cache-Mem

Page 69: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Outline

69

Mitigate Interference

Quantify Interference

Cache Capacity

Memory Bandwidth

Much exploredNot our focus

Blacklisting Memory Scheduler

Not our focus

Memory Interference induced

Slowdown EstimationModel

and its uses

Application Slowdown

Model and its uses

Page 70: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Thesis Contributions

• Principles behind our scheduler and models– Simple two-level prioritization sufficient to

mitigate interference

– Request service rate a proxy for performance

• Simple and high-performance memory scheduler design

• Accurate slowdown estimation models

• Mechanisms that leverage our slowdown estimates

70

Page 71: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Summary

• Problem: Shared resource interference causes high and unpredictable application slowdowns

• Goals: High and predictable performance

• Approaches: Mitigate and quantify interference

• Thesis Contributions:1. Principles behind our scheduler and models 2. Simple and high-performance memory scheduler3. Accurate slowdown estimation models4. Mechanisms that leverage our slowdown estimates

71

Page 72: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Future Work

• Leveraging slowdown estimates at the system and cluster level

• Interference estimation and performance predictability for multithreaded applications

• Performance predictability in heterogeneous systems

• Coordinating the management of main memory and storage

72

Page 73: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Research Summary

• Predictable performance in multicore systems [HPCA ’13, SuperFri ’14, KIISE ’15]

• High and predictable performance in heterogeneous systems [ISCA ’12, SAFARI Tech Report ’15]

• Low-complexity memory scheduling [ICCD ’14]

• Memory channel partitioning [MICRO ’11]

• Architecture-aware cluster management [VEE ’15]

• Low-latency DRAM architectures [HPCA ’13]

73

Page 74: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Backup Slides

74

Page 75: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Blacklisting

75

Page 76: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Problems with Previous Application-aware Memory Schedulers

1. Full ranking increases hardware complexity

2. Full ranking causes unfair slowdowns

76

Page 77: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Ranking Increases Hardware Complexity

77

Highest Ranked AID

EnforceRanks

Req 1 1Req 2 4Req 3 1Req 4 1Req 5 3

Req 7 1Req 8 3

Request Buffer

Req 5 4

RequestApp. ID

(AID)

Next Highest Ranked AID

Monitor Rank

4

3

2

12

4

3

1

=

=

=

=

=

=

=

=

Hardware complexity increases with application/core count

Page 78: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

78

0

1

2

3

4

5

6

7

8

9

Cri

tica

l Pat

h L

ate

ncy

(in

ns)

App-unaware

App-aware

0

10000

20000

30000

40000

50000

60000

70000

80000

Sch

ed

ule

r A

rea

(in

sq

uar

e u

m)

App-unaware

App-aware

Ranking Increases Hardware Complexity

8x

1.8x

Ranking-based application-aware schedulers incur high hardware cost

From synthesis of RTL implementations using a 32nm library

Page 79: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Problems with Previous Application-aware Memory Schedulers

1. Full ranking increases hardware complexity

2. Full ranking causes unfair slowdowns

79

Page 80: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Ranking Causes Unfair Slowdowns

80

GemsFDTD

0

10

20

30

0 20 40 60 80 100

Nu

mb

er

of

Re

qu

est

s

Execution Time (in 1000s of Cycles)

App-unaware

Ranking

GemsFDTD (high memory intensity)

sjeng

0

10

20

30

0 20 40 60 80 100

Nu

mb

er

of

Re

qu

est

s

Execution Time (in 1000s of Cycles)

App-unaware

Ranking

sjeng (low memory intensity)

0

10

20

30

0 20 40 60 80 100

Nu

mb

er

of

Re

qu

est

s

Execution Time (in 1000s of Cycles)

App-unaware

Ranking

Full ordered ranking of applicationsGemsFDTD denied request service

Page 81: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Ranking Causes Unfair Slowdowns

81

0

1

2

3

4

5

6

7

8

Slo

wd

ow

n

App-unaware

Ranking

0

1

2

3

4

5

6

7

8

Slo

wd

ow

n

App-unaware

Ranking

Ranking-based application-aware schedulers cause unfair slowdowns

GemsFDTD(high memory intensity)

sjeng(low memory intensity)

Page 82: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Key Observation 1: Group Rather Than Rank

82

GemsFDTD (high memory intensity)

sjeng (low memory intensity)

0

10

20

30

0 20 40 60 80 100

Nu

mb

er

of

Re

qu

est

s

Execution Time (in 1000s of Cycles)

App-unaware

Ranking

Grouping

0

10

20

30

0 20 40 60 80 100

Nu

mb

er

of

Re

qu

est

s

Execution Time (in 1000s of Cycles)

App-unaware

Ranking

Grouping

No unfairness due to denial of request service

Page 83: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

83

Key Observation 1: Group Rather Than Rank

0

1

2

3

4

5

6

7

8

Slo

wd

ow

n

App-unaware

Ranking

Grouping

0

1

2

3

4

5

6

7

8

Slo

wd

ow

n

App-unaware

Ranking

Grouping

Benefit 2: Lower slowdowns than ranking

GemsFDTD(high memory intensity)

sjeng(low memory intensity)

Page 84: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Previous Memory Schedulers

• FRFCFS [Zuravleff and Robinson, US Patent 1997, Rixner et al., ISCA 2000]

– Prioritizes row-buffer hits and older requests

• FRFCFS-Cap [Mutlu and Moscibroda, MICRO 2007]

– Caps number of consecutive row-buffer hits

• PARBS [Mutlu and Moscibroda, ISCA 2008]

– Batches oldest requests from each application; prioritizes batch– Employs ranking within a batch

• ATLAS [Kim et al., HPCA 2010]

– Prioritizes applications with low memory-intensity

• TCM [Kim et al., MICRO 2010]

– Always prioritizes low memory-intensity applications– Shuffles thread ranks of high memory-intensity applications

84

Application-unaware+ Low complexity

- Low performance and fairness

Application-aware+ High performance and fairness

- High complexity

Page 85: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Performance and Fairness

85

5

7

9

11

13

15

7.5 8 8.5 9 9.5 10

Un

fair

ne

ss

Performance

FRFCFS FRFCFS-Cap PARBS

ATLAS TCM Blacklisting

5%21%

1. Blacklisting achieves the highest performance 2. Blacklisting balances performance and fairness

Page 86: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Performance vs. Fairness vs. Simplicity

86

Performance

Fairness

Simplicity

FRFCFS

FRFCFS-Cap

PARBS

ATLAS

TCM

Blacklisting

Ideal

Highest performance

Close to simplest

Close to fairest

Blacklisting is the closest scheduler to ideal

Page 87: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Summary

• Applications’ requests interfere at main memory• Prevalent solution approach

– Application-aware memory request scheduling

• Key shortcoming of previous schedulers: Full ranking– High hardware complexity– Unfair application slowdowns

• Our Solution: Blacklisting memory scheduler– Sufficient to group applications rather than rank– Group by tracking number of consecutive requests

• Much simpler than application-aware schedulers at higher performance and fairness

87

Page 88: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Performance and Fairness

0

2

4

6

8

10

We

igh

ted

Sp

ee

du

p FRFCFS

FRFCFS-Cap

PARBS

ATLAS

TCM

Blacklisting

0

5

10

15

Max

imu

m S

low

do

wn

FRFCFS

FRFCFS-Cap

PARBS

ATLAS

TCM

Blacklisting

5% higher system performance and 21% lower maximum slowdown than TCM

88

Page 89: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Complexity Results

Blacklisting achieves 70% lower latency than TCM

89

Blacklisting achieves 43% lower area than TCM

0

2

4

6

8

10

12

Late

ncy

(in

ns)

App-unaware

FRFCFS-Cap

PARBS

ATLAS

App-aware

Blacklisting

0

20000

40000

60000

80000

100000

120000

Are

a (i

n s

qu

are

um

) FRFCFS

FRFCFS-Cap

PARBS

ATLAS

TCM

Blacklisting

Page 90: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Understanding Why Blacklisting Works

0

0.1

0.2

0.3

0.4

0 10 20

Frac

tio

n o

f R

eq

ue

sts

Streak Length

FRFCFS

PARBS

TCM

Blacklisting 0

0.1

0.2

0.3

0.4

0.5

0 10 20

Frac

tio

n o

f R

eq

ue

sts

Streak Length

FRFCFS

PARBS

TCM

Blacklisting

libquantum(High memory-intensity

application)

Blacklisting shifts the request distribution towards the left

calculix(Low memory-intensity

application)

Blacklisting shifts the request distribution towards the right

90

Page 91: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Harmonic Speedup

91

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

Har

mo

nic

Sp

ee

du

p FRFCFS

FRFCFS-Cap

PARBS

ATLAS

TCM

Blacklisting

Page 92: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Effect of Workload Memory Intensity

92

0

0.2

0.4

0.6

0.8

1

1.2

1.4

25 50 75 100 Avg

We

igh

ted

Sp

ee

du

p

(No

rmal

ize

d)

0

0.5

1

1.5

2

25 50 75 100 Avg

Max

imu

m S

low

do

wn

(N

orm

aliz

ed

) FRFCFS

FRFCFS-Cap

PARBS

ATLAS

TCM

Blacklisting

Page 93: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Combining FRFCFS-Cap and Blacklisting

93

0

0.2

0.4

0.6

0.8

1

1.2

1.4

We

igh

ted

Sp

ee

du

p

0

0.2

0.4

0.6

0.8

1

1.2

Max

imu

m S

low

do

wn

FRFCFS

FRFCFS-Cap

Blacklisting

FRFCFS-Cap-Blacklisting

Page 94: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Sensitivity to Blacklisting Threshold

94

0

0.2

0.4

0.6

0.8

1

1.2

1.4

We

igh

ted

Sp

ee

du

p

0

0.2

0.4

0.6

0.8

1

1.2

Max

imu

m S

low

do

wn

FRFCFS

Blacklisting-1

Blaclisting-2

Blacklisting-4

Blacklisting-8

Blacklisting-16

Page 95: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Sensitivity to Clearing Interval

95

0

0.2

0.4

0.6

0.8

1

1.2

1.4

We

igh

ted

Sp

ee

du

p

0

0.2

0.4

0.6

0.8

1

1.2

Max

imu

m S

low

do

wn

FRFCFS

Blacklisting-1000

Blacklisting-10000

Blacklisting-100000

Page 96: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Sensitivity to Core Count

96

0

5

10

15

20

16 24 32 64

Pe

rfo

rman

ce

Core Count

0

10

20

30

40

16 24 32 64

Un

fair

ne

ss

Core Count

FRFCFS

FRFCFS-Cap

PARBS

ATLAS

TCM

Blacklisting

Page 97: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Sensitivity to Channel Count

97

0

5

10

15

1 2 4 8

Pe

rfo

rman

ce

Channel Count

0

50

100

150

1 2 4 8

Un

fair

ne

ss

Channel Count

FRFCFS

FRFCFS-Cap

PARBS

ATLAS

TCM

Blacklisting

Page 98: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Sensitivity to Cache Size

98

0

5

10

15

512KB 1MB 2MB

Pe

rfo

rman

ce

0

5

10

15

512KB 1MB 2MB

Un

fair

ne

ss

FRFCFS

FRFCFS-Cap

PARBS

ATLAS

TCM

Blacklisting

Page 99: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Performance and Fairness with Shared Cache

99

7

7.5

8

8.5

9

9.5

10

Pe

rfo

rman

ce

0

5

10

15

20

Un

fair

ne

ss

FRFCFS

FRFCFS-CAP

PARBS

ATLAS

TCM

Blacklisting

Page 100: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Breakdown of Benefits

100

0

0.2

0.4

0.6

0.8

1

1.2

1.4

We

igh

ted

Sp

ee

du

p

0

0.2

0.4

0.6

0.8

1

1.2

Max

imu

m S

low

do

wn

FRFCFS

TCM

Grouping

Blacklisting

Page 101: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

BLISS vs. Criticality-aware Scheduling

101

0

0.2

0.4

0.6

0.8

1

1.2

1.4

We

igh

ted

Sp

ee

du

p

0

0.2

0.4

0.6

0.8

1

1.2

1.4

Max

imu

m S

low

do

wn

FRFCFS

PARBS

TCM

Crit-MaxStall

Crit-TotalStall

Blacklisting

Page 102: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Sub-row Interleaving

102

0

2

4

6

8

10

12

Max

imu

m S

low

do

wn

FRFCFS-Row

FRFCFS

FRFCFS-Cap

PARBS

ATLAS

TCM

Blacklisting

0

2

4

6

8

10

12

We

igh

ted

Sp

ee

du

p

Page 103: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

MISE

103

Page 104: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Measuring RSRShared and α

• Request Service Rate Shared (RSRShared)– Per-core counter to track number of requests serviced– At the end of each interval, measure

• Memory Phase Fraction ( )– Count number of stall cycles at the core– Compute fraction of cycles stalled for memory

Length Interval

Served Requests ofNumber RSRShared

104

Page 105: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Estimating Request Service Rate Alone (RSRAlone)

• Divide each interval into shorter epochs

• At the beginning of each epoch– Randomly pick an application as the highest priority

application

• At the end of an interval, for each application, estimate

PriorityHigh Given n Applicatio Cycles ofNumber

EpochsPriority High During Requests ofNumber RSR

Alone

105

Goal: Estimate RSRAlone

How: Periodically give each application highest priority in accessing memory

Page 106: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Inaccuracy in Estimating RSRAlone

106

Request BufferState

Main Memory

Time units Service order

Main Memory

123

• When an application has highest priority– Still experiences some interference

Request Buffer State

Main Memory

Time units Service order

Main Memory

123

Request Buffer State

Main Memory

Time units Service order

Main Memory

123

Interference Cycles

High Priority

Page 107: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Accounting for Interference in RSRAlone Estimation

• Solution: Determine and remove interference cycles from RSRAlone calculation

• A cycle is an interference cycle if– a request from the highest priority application is

waiting in the request buffer and

– another application’s request was issued previously

107

Cycles ceInterferen -Priority High Given n Applicatio Cycles ofNumber

EpochsPriority High During Requests ofNumber ARSR

Page 108: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

MISE Operation: Putting it All Together

time

Interval

Estimate

slowdown

Interval

Estimate

slowdown

Measure RSRShared,

Estimate RSRAlone

Measure RSRShared,

Estimate RSRAlone

108

Page 109: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

MISE-QoS: Mechanism to Provide Soft QoS

• Assign an initial bandwidth allocation to QoS-critical application

• Estimate slowdown of QoS-critical application using the MISE model

• After every N intervals

– If slowdown > bound B +/- ε, increase bandwidth allocation

– If slowdown < bound B +/- ε, decrease bandwidth allocation

• When slowdown bound not met for N intervals

– Notify the OS so it can migrate/de-schedule jobs109

Page 110: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Performance of Non-QoS-Critical Applications

0

0.2

0.4

0.6

0.8

1

1.2

1.4

0 1 2 3 Avg

Ha

rmo

nic

Sp

ee

du

p

Number of Memory Intensive Applications

AlwaysPrioritize

MISE-QoS-10/1

MISE-QoS-10/3

MISE-QoS-10/5

MISE-QoS-10/7

MISE-QoS-10/9

Higher performance when bound is looseWhen slowdown bound is 10/3 MISE-QoS improves system performance by 10%

110

Page 111: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Case Study with Two QoS-Critical Applications

• Two comparison points

– Always prioritize both applications

– Prioritize each application 50% of time

0

1

2

3

4

5

6

7

8

9

10

astar mcf leslie3d mcf

Slo

wd

ow

n

AlwaysPrioritize

EqualBandwidth

MISE-QoS-10/1

MISE-QoS-10/2

MISE-QoS-10/3

MISE-QoS-10/4

MISE-QoS-10/5

MISE-QoS can achieve a lower slowdown bound for both applications

MISE-QoS provides much lower slowdowns for non-QoS-critical applications

111

Page 112: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Minimizing Maximum Slowdown

• Goal– Minimize the maximum slowdown experienced by any

application

• Basic Idea– Assign more memory bandwidth to the more slowed

down application

112

Page 113: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Mechanism

• Memory controller tracks– Slowdown bound B

– Bandwidth allocation of all applications

• Different components of mechanism– Bandwidth redistribution policy

– Modifying target bound

– Communicating target bound to OS periodically

113

Page 114: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Bandwidth Redistribution

• At the end of each interval,

– Group applications into two clusters

– Cluster 1: applications that meet bound

– Cluster 2: applications that don’t meet bound

– Steal small amount of bandwidth from each application in cluster 1 and allocate to applications in cluster 2

114

Page 115: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Modifying Target Bound

• If bound B is met for past N intervals– Bound can be made more aggressive

– Set bound higher than the slowdown of most slowed down application

• If bound B not met for past N intervals by more than half the applications– Bound should be more relaxed

– Set bound to slowdown of most slowed down application

115

Page 116: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Results: Harmonic Speedup

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

4 8 16

Ha

rmo

nic

Sp

ee

du

p

Core Count

FRFCFS

ATLAS

TCM

STFM

MISE-Fair

116

Page 117: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Results: Maximum Slowdown

0

2

4

6

8

10

12

14

16

4 8 16

Ma

xim

um

Slo

wd

ow

n

Core Count

FRFCFS

ATLAS

TCM

STFM

MISE-Fair

117

Page 118: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Sensitivity to Memory Intensity(16 cores)

0

5

10

15

20

25

0 25 50 75 100 Avg

Ma

xim

um

Slo

wd

ow

n

FRFCFS

ATLAS

TCM

STFM

MISE-Fair

118

Page 119: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

MISE: Per-Application ErrorBenchmark STFM MISE Benchmark STFM MISE

453.povray 56.3 0.1 473.astar 12.3 8.1

454.calculix 43.5 1.3 456.hmmer 17.9 8.1

400.perlbench 26.8 1.6 464.h264ref 13.7 8.3

447.dealII 37.5 2.4 401.bzip2 28.3 8.5

436.cactusADM 18.4 2.6 458.sjeng 21.3 8.8

450.soplex 29.8 3.5 433.milc 26.4 9.5

444.namd 43.6 3.7 481.wrf 33.6 11.1

437.leslie3d 26.4 4.3 429.mcf 83.74 11.5

403.gcc 25.4 4.5 445.gobmk 23.1 12.5

462.libquantum 48.9 5.3 483.xalancbmk 18 13.6

459.GemsFDTD 21.6 5.5 435.gromacs 31.4 15.6

470.lbm 6.9 6.3 482.sphinx3 21 16.8

473.astar 12.3 8.1 471.omnetpp 26.2 17.5

456.hmmer 17.9 8.1 465.tonto 32.7 19.5119

Page 120: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Sensitivity to Epoch and Interval Lengths

1 mil. 5 mil. 10 mil. 25 mil. 50 mil.

1000 65.1% 9.1% 11.5% 10.7% 8.2%

10000 64.1% 8.1% 9.6% 8.6% 8.5%

100000 64.3% 11.2% 9.1% 8.9% 9%

1000000 64.5% 31.3% 14.8% 14.9% 11.7%

Interval Length

Epoch Length

120

Page 121: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Workload Mixes

Mix No. Benchmark 1 Benchmark 2 Benchmark 3

1 sphinx3 leslie3d milc

2 sjeng gcc perlbench

3 tonto povray wrf

4 perlbench gcc povray

5 gcc povray leslie3d

6 perlbench namd lbm

7 h264ref bzip2 libquantum

8 hmmer lbm omnetpp

9 sjeng libquantum cactusADM

10 namd libquantum mcf

11 xalancbmk mcf astar

12 mcf libquantum leslie3d121

Page 122: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

STFM’s Effectiveness in Enforcing QoS

Predicted Met

PredictedNot Met

QoS Bound Met

63.7% 16%

QoS Bound Not Met

2.4% 17.9%

Across 3000 data points

122

Page 123: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

STFM vs. MISE’s System Performance

0.7

0.75

0.8

0.85

0.9

0.95

MISE STFM

Syste

m P

erf

orm

an

ce

QoS-10/1

QoS-10/3

QoS-10/5

QoS-10/7

QoS-10/9

123

Page 124: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

MISE’s Implementation Cost

1. Per-core counters worth 20 bytes• Request Service Rate Shared• Request Service Rate Alone

– 1 counter for number of high priority epoch requests– 1 counter for number of high priority epoch cycles– 1 counter for interference cycles

• Memory phase fraction ( )2. Register for current bandwidth allocation – 4

bytes3. Logic for prioritizing an application in each epoch

124

Page 125: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

MISE Accuracy w/o Interference Cycles

• Average error – 23%

125

Page 126: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

MISE Average Error by Workload Category

Workload Category (Number of memory intensive applications)

Average Error

0 4.3%

1 8.9%

2 21.2%

3 18.4%

126

Page 127: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

ASM

127

Page 128: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Impact of Cache Capacity Contention

128

Cache capacity interference causes high application slowdowns

Shared Main Memory Shared Main Memory and Caches

0

0.5

1

1.5

2

bzip2 (core 0) soplex (core 1)

Slo

wd

ow

n

0

0.5

1

1.5

2

bzip2 (core 0) soplex (core 1)S

low

do

wn

Page 129: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Error with Sampling

129

Page 130: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Error Distribution

130

Page 131: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Impact of Prefetching

131

Page 132: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Sensitivity to Epoch and Quantum Lengths

132

Page 133: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Sensitivity to Core Count

133

Page 134: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Sensitivity to Cache Capacity

134

Page 135: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Sensitivity to Auxiliary Tag Store Sampling

135

Page 136: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

ASM-Cache:Fairness and Performance Results

136

Significant fairness benefits across different systems

0

5

10

15

4 8 16

Fair

nes

s(L

ow

er

is b

ette

r)

Number of Cores

0

0.2

0.4

0.6

0.8

4 8 16

Pe

rfo

rman

ce

Number of Cores

NoPart

UCP

ASM-Cache

Page 137: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

ASM-Mem: Fairness and Performance Results

137

0

5

10

15

20

4 8 16

Fair

nes

s (L

ow

er is

bet

ter)

Number of Cores

0

0.2

0.4

0.6

0.8

4 8 16P

erf

orm

ance

Number of Cores

FRFCFS

TCM

PARBS

ASM-Mem

Significant fairness benefits across different systems

Page 138: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

ASM-QoS: Meeting Slowdown Bounds

138

0

0.5

1

1.5

2

2.5

3

3.5

4

h264ref mcf sphinx3 soplex

Slo

wd

ow

n

Naive-QoS

ASM-QoS-2.5

ASM-QoS-3

ASM-QoS-3.5

ASM-QoS-4

Page 139: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Previous Approach: Estimate Interference Experienced Per-Request

139

Shared (With interference) time

Execution time

Req A

Req B

Req C

Request Overlap Makes Interference Estimation Per-Request Difficult

Page 140: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Estimating PerformanceAlone

140

Shared (With interference)

Execution time

Req A

Req B

Req CRequest QueuedRequest Served

Difficult to estimate impact of interference per-request due to request overlap

Page 141: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Impact of Interference on Performance

141

Alone (No interference)

time

Execution time

Shared (With interference) time

Execution time

Impact of Interference

Previous Approach: Estimate impact of interference at a per-request granularity

Difficult to estimate due to request overlap

Page 142: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Application-aware Memory Channel Partitioning

142

Goal: Mitigate

Inter-Application Interference

Previous Approach:Application-Aware Memory

Request Scheduling

Our First Approach:Application-Aware Memory

Channel Partitioning

Our Second Approach:Integrated Memory Partitioning

and Scheduling

Page 143: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Observation: Modern Systems Have Multiple Channels

A new degree of freedom

Mapping data across multiple channels

143

Channel 0Red App

Blue App

MemoryController

MemoryController

Channel 1

Memory

Core

Core

Memory

Page 144: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Data Mapping in Current Systems

144

Channel 0Red App

Blue App

Memory Controller

Memory Controller

Channel 1

Memory

Core

Core

Memory

Causes interference between applications’ requests

Page

Page 145: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Partitioning Channels Between Applications

145

Channel 0Red App

Blue App

Memory Controller

Memory Controller

Channel 1

Memory

Core

Core

Memory

Page

Eliminates interference between applications’ requests

Page 146: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Integrated Memory Partitioning and Scheduling

146

Goal: Mitigate

Inter-Application Interference

Previous Approach:Application-Aware Memory

Request Scheduling

Our First Approach:Application-Aware Memory

Channel Partitioning

Our Second Approach:Integrated Memory Partitioning

and Scheduling

Page 147: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Slowdown/Interference Estimation in Existing Systems

147

Core Core Core Core

Core Core Core Core

Core Core Core Core

Core Core Core Core

Main Memory

Shared Cache

How do we detect/mitigate the impact of interference on a real system using existing performance counters?

Page 148: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Our Approach: Mitigating Interference in a Cluster

1. Detect memory bandwidth contention at each host

2. Estimate impact of moving each VM to a non-contended host (cost-benefit analysis)

3. Execute the migrations that provide the most benefit

148

Page 149: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Architecture-aware DRM – ADRM(VEE 2015)

149

VM

1

Kernel-based Virtual Machine

(KVM + QEMU)

App

Profiler

VM

2

App

VM

i

App

VM

i+1

App

PM1

VM

j

Kernel-based Virtual Machine

(KVM + QEMU)

App

Profiler

VM

j+1

App

VM

n-1

App

VM

n

App

PMM

… ……

Profiling

Engine

Contention

Detector

Recommendation

Engine

Actuation

Engine

AI-DRM

Page 150: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

ADRM: Key Ideas and Results

• Key Ideas:

– Memory bandwidth captures impact of shared cache and memory bandwidth interference

– Model degradation in performance as linearly proportional to bandwidth increase/decrease

• Key Results:

– Average performance improvement of 9.67% on a 4-node cluster

150

Page 151: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

QoS in Heterogeneous Systems

• Staged memory scheduling – In collaboration with Rachata Ausavarungnirun,

Kevin Chang and Gabriel Loh

– Goal: High performance in CPU-GPU systems

• Memory scheduling in heterogeneous systems– In collaboration with Hiroukui Usui

– Goal: Meet deadlines for accelerators while improving performance

151

Page 152: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Performance Predictability in Heterogeneous Systems

Core Core Core Core

Core Core Core Core

Main Memory

Shared Cache

152

Accelerator

Accelerator

Page 153: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Goal of our Scheduler (SQUASH)

• Goal: Design a memory scheduler that – Meets accelerators’ deadlines and

– Achieves high CPU performance

• Basic Idea:– Different CPU applications and hardware

accelerators have different memory requirements

– Track progress of different agents and prioritize accordingly

153

Page 154: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Key Observation:Distribute Priority for Accelerators

• Accelerators need priority to meet deadlines

• Worst case prioritization not always the best

• Prioritize accelerators when they are not on track to meet a deadline

154

Distributing priority mitigates impact of accelerators on CPU cores’ requests

Page 155: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Key Observation: Not All Accelerators are Equal

• Long-deadline accelerators are more likely to meet their deadlines

• Short-deadline accelerators are more likely to miss their deadlines

155

Schedule short-deadline accelerators based on worst-case memory access time

Page 156: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

Key Observation: Not All CPU cores are Equal

• Memory-intensive cores are much less vulnerable to interference

• Memory non-intensive cores are much more vulnerable to interference

156

Prioritize accelerators over memory-intensive cores to ensure accelerators do not become urgent

Page 157: Thesis Defense - Carnegie Mellon University · Thesis Defense Lavanya Subramanian 1 Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel) The Multicore Era 2 Main

SQUASH: Key Ideas and Results

• Distribute priority for HWAs

• Prioritize HWAs over memory-intensive CPU cores even when not urgent

• Prioritize short-deadline-period HWAs based on worst case estimates

157

Improves CPU performance by 7-21%Meets 99.9% of deadlines for HWAs