a model for application slowdown estimation in on-chip …safari/pubs/on-chip-network... · in...

25
A Model for Application Slowdown Estimation in On-Chip Networks and Its Use for Improving System Fairness and Performance Xiyue Xiang*, Saugata Ghose , Onur Mutlu †§ , Nian-Feng Tzeng * * University of Louisiana at Lafayette Carnegie Mellon University § ETH Zürich

Upload: others

Post on 02-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Model for Application Slowdown Estimation in On-Chip …safari/pubs/on-chip-network... · in On-Chip Networks and Its Use for Improving System Fairness and Performance Xiyue Xiang*,

A Model for Application Slowdown Estimation in On-Chip Networks and Its Use for Improving

System Fairness and Performance

Xiyue Xiang*, Saugata Ghose†, Onur Mutlu†§, Nian-Feng Tzeng*

*University of Louisiana at Lafayette†Carnegie Mellon University

§ETH Zürich

Page 2: A Model for Application Slowdown Estimation in On-Chip …safari/pubs/on-chip-network... · in On-Chip Networks and Its Use for Improving System Fairness and Performance Xiyue Xiang*,

Executive Summary

2

Problem: inter-application interference in on-chip networks (NoCs)

In a multicore processor, interference can occur due to NoC contention

Interference causes applications to slow down unfairly

Goal: estimate NoC-level slowdown at runtime, and use slowdown information to

improve system fairness and performance

Our Approach

NoC Application Slowdown Model (NAS): first online model to quantify

inter-application interference in NoCs

Fairness-Aware Source Throttling (FAST): throttle network injection rate of

processor cores based on slowdown estimate from NAS

Results

NAS is very accurate and scalable: 4.2% error rate on average (8×8 mesh)

FAST improves system fairness by 9.5%, and performance by 5.2%

(compared to a baseline without source throttling on a 8×8 mesh)

Page 3: A Model for Application Slowdown Estimation in On-Chip …safari/pubs/on-chip-network... · in On-Chip Networks and Its Use for Improving System Fairness and Performance Xiyue Xiang*,

Motivation: Interference in NoCs

3

0

1

2

3

lbm leslie3d mcf GemsFDTD

Slo

wd

ow

n

2.7×

1.6×

Interference slows down applications and increases system unfairness

16 copies of each application run concurrently on a 64-core processor

Root cause:

NoC bandwidth is shared𝒔𝒍𝒐𝒘𝒅𝒐𝒘𝒏 =

𝒕𝒊𝒏𝒕𝒆𝒓𝒇𝒆𝒓𝒆𝒏𝒄𝒆

𝒕𝒏𝒐_𝒊𝒏𝒕𝒆𝒓𝒇𝒆𝒓𝒆𝒏𝒄𝒆=𝒕𝒔𝒉𝒂𝒓𝒆𝒅𝒕𝒂𝒍𝒐𝒏𝒆

Page 4: A Model for Application Slowdown Estimation in On-Chip …safari/pubs/on-chip-network... · in On-Chip Networks and Its Use for Improving System Fairness and Performance Xiyue Xiang*,

Challenges:

Flit-level delay ≠ slowdown

NAS: NoC Application Slowdown Model

4

Online estimation of ∆𝑡𝑠𝑡𝑎𝑙𝑙: application stall time due to interference

talone: unknown at runtime

𝑠𝑙𝑜𝑤𝑑𝑜𝑤𝑛 =𝑡𝑠ℎ𝑎𝑟𝑒𝑑𝑡𝑎𝑙𝑜𝑛𝑒

=𝑡𝑠ℎ𝑎𝑟𝑒𝑑

𝑡𝑠ℎ𝑎𝑟𝑒𝑑 − ∆𝑡𝑠𝑡𝑎𝑙𝑙

Node S Node D

request

response

Each request involves multiple packets

tshared: measured directly

Page 5: A Model for Application Slowdown Estimation in On-Chip …safari/pubs/on-chip-network... · in On-Chip Networks and Its Use for Improving System Fairness and Performance Xiyue Xiang*,

Challenges:

Flit-level delay ≠ slowdown

Random and distributive

Overlapped delay

NAS: NoC Application Slowdown Model

4

tshared: measured directly talone: unknown at runtime

𝑠𝑙𝑜𝑤𝑑𝑜𝑤𝑛 =𝑡𝑠ℎ𝑎𝑟𝑒𝑑𝑡𝑎𝑙𝑜𝑛𝑒

=𝑡𝑠ℎ𝑎𝑟𝑒𝑑

𝑡𝑠ℎ𝑎𝑟𝑒𝑑 − ∆𝑡𝑠𝑡𝑎𝑙𝑙

Node S Node D

A packet is formed by multiple flits

Basic idea: track delay and calculate ∆tstall

Online estimation of ∆𝑡𝑠𝑡𝑎𝑙𝑙: application stall time due to interference

Page 6: A Model for Application Slowdown Estimation in On-Chip …safari/pubs/on-chip-network... · in On-Chip Networks and Its Use for Improving System Fairness and Performance Xiyue Xiang*,

Flit-Level Interference

5

Three interference events

Injection

Virtual channel arbitration

Switch arbitration

Each flit carries an additional field ∆tflit

If arbitration loses, ∆tflit = ∆tflit + 1

Sum up arbitration delays due to interference

12 13 14 15

8 9 10 11

4 5 6 7

0 1 2 3

Node

Core

L1

Router

Shared LLC

Slice

MSHRs

Page 7: A Model for Application Slowdown Estimation in On-Chip …safari/pubs/on-chip-network... · in On-Chip Networks and Its Use for Improving System Fairness and Performance Xiyue Xiang*,

Packet-Level Interference

6

∆ treassembly= 𝑻𝒍𝒂𝒔𝒕_𝒂𝒓𝒓𝒊𝒗𝒂𝒍 − 𝑻𝒇𝒊𝒓𝒔𝒕_𝒂𝒓𝒓𝒊𝒗𝒂𝒍 −𝑴

∆𝒕𝒑𝒂𝒄𝒌𝒆𝒕 = ∆𝒕𝒇𝒊𝒓𝒔𝒕_𝒇𝒍𝒊𝒕 + ∆𝒕𝒓𝒆𝒂𝒔𝒔𝒆𝒎𝒃𝒍𝒚

Tfirst_arrival =3 Tlast_arrival=11

treassembly = M cycles (M=5)

1 2 3 4 5

f1

∆𝑡𝑓𝑖𝑟𝑠𝑡_𝑓𝑙𝑖𝑡=2

Alone run:

Shared run:

f2 f3 f4 f5

f1 f2 f3 f4 f5

1 2 3 4 5 6 7 8 9 10 11

M-cycle reassembly

Packet’s flits arrive consecutively when there is no interference

∆𝑡𝑝𝑎𝑐𝑘𝑒𝑡 = 2 + 11 − 3 − 5 = 5 𝑐𝑦𝑐𝑙𝑒𝑠

∆𝑡𝑟𝑒𝑎𝑠𝑠𝑒𝑚𝑏𝑙𝑦

Track increase in packet reassembly time

Page 8: A Model for Application Slowdown Estimation in On-Chip …safari/pubs/on-chip-network... · in On-Chip Networks and Its Use for Improving System Fairness and Performance Xiyue Xiang*,

Request-Level Interference

7

Node S Node D1 Request packet delayed by 5 cycles due

to inter-application interference

0

Leverage closed-loop packet behavior to accumulate ∆tpacket

Inheritance Table: lump sum of ∆tpacket for associated packets

Page 9: A Model for Application Slowdown Estimation in On-Chip …safari/pubs/on-chip-network... · in On-Chip Networks and Its Use for Improving System Fairness and Performance Xiyue Xiang*,

Request-Level Interference

7

Node S Node D

NI LLC Slice

Inheritance Table

2 Register request packet info in

inheritance table (Δtpacket = 5)

4 Generate response packet,

inheriting Δtpacket from table

3 Cache

access

1 Request packet delayed by 5 cycles due

to inter-application interference

reqID mshrID Δtpacket......

...

0

Leverage closed-loop packet behavior to accumulate ∆tpacket

Inheritance Table: lump sum of ∆tpacket for associated packets

5

5

Page 10: A Model for Application Slowdown Estimation in On-Chip …safari/pubs/on-chip-network... · in On-Chip Networks and Its Use for Improving System Fairness and Performance Xiyue Xiang*,

Request-Level Interference

7

Node S

3 Cache

access

1 Request packet delayed by 5 cycles due

to inter-application interference

5 Response packet delayed by 3 cycles

due to inter-application interference

5

Leverage closed-loop packet behavior to accumulate ∆tpacket

Inheritance Table: lump sum of ∆tpacket for associated packets

Node D

NI LLC Slice

Inheritance Table

2 Register request packet info in

inheritance table (Δtpacket = 5)

4 Generate response packet,

inheriting Δtpacket from table

3 Cache

accessreqID mshrID Δtpacket......

...

5

Page 11: A Model for Application Slowdown Estimation in On-Chip …safari/pubs/on-chip-network... · in On-Chip Networks and Its Use for Improving System Fairness and Performance Xiyue Xiang*,

Request-Level Interference

7

∆𝒕𝒓𝒆𝒒𝒖𝒆𝒔𝒕= ∆𝒕𝒓𝒆𝒒𝒖𝒆𝒔𝒕_𝒑𝒂𝒄𝒌𝒆𝒕 + ∆𝒕𝒓𝒆𝒔𝒑𝒐𝒏𝒔𝒆_𝒑𝒂𝒄𝒌𝒆𝒕

Node S 1 Request packet delayed by 5 cycles due

to inter-application interference

5 Response packet delayed by 3 cycles

due to inter-application interference

Final valueof Δtpacket

is 8 cycles

Leverage closed-loop packet behavior to accumulate ∆tpacket

Inheritance Table: lump sum of ∆tpacket for associated packets

8

Sum up delays of all associated packets

3 Cache

access

Node D

NI LLC Slice

Inheritance Table

2 Register request packet info in

inheritance table (Δtpacket = 5)

4 Generate response packet,

inheriting Δtpacket from table

3 Cache

accessreqID mshrID Δtpacket......

...

5

Page 12: A Model for Application Slowdown Estimation in On-Chip …safari/pubs/on-chip-network... · in On-Chip Networks and Its Use for Improving System Fairness and Performance Xiyue Xiang*,

Application Stall Time

8

A memory request becomes critical if

1) It is the oldest instruction at ROB and ROB is full, and/or

2) It is the oldest instruction at LSQ and LSQ is full when the next is a memory instruction

∆𝒕𝒔𝒕𝒂𝒍𝒍_𝒑𝒆𝒓_𝒓𝒆𝒒𝒖𝒆𝒔𝒕= 𝒎𝒊𝒏(𝑻𝒔𝒆𝒓𝒗𝒊𝒄𝒆 − 𝑻𝒄𝒓𝒊𝒕𝒊𝒄𝒂𝒍, 𝒕𝒓𝒆𝒒𝒖𝒆𝒔𝒕 )

For all critical requests

Latency is hidden App. stalls

ignored

Latency of critical request

Tcritical Tservice

∆𝒕𝒔𝒕𝒂𝒍𝒍_𝒑𝒆𝒓_𝒓𝒆𝒒𝒖𝒆𝒔𝒕

∆𝒕𝒔𝒕𝒂𝒍𝒍=

𝒊

∆𝒕𝒔𝒕𝒂𝒍𝒍_𝒑𝒆𝒓_𝒓𝒆𝒒𝒖𝒆𝒔𝒕,𝒊

ILP, MLP

Count only request delays on critical path of execution time

Page 13: A Model for Application Slowdown Estimation in On-Chip …safari/pubs/on-chip-network... · in On-Chip Networks and Its Use for Improving System Fairness and Performance Xiyue Xiang*,

Using NAS to Improve Fairness

9

NAS provides online estimation of slowdown

Sum up flit-level arbitration delays due to interference

Track increase in packet reassembly time

Sum up delays of all associated packets

Determine which request delays causes application stall

Goal

Use NAS to improve system fairness and performance

FAST: Fairness-Aware Source Throttling

Page 14: A Model for Application Slowdown Estimation in On-Chip …safari/pubs/on-chip-network... · in On-Chip Networks and Its Use for Improving System Fairness and Performance Xiyue Xiang*,

A New Metric: NoC Stall-Time Criticality

10

𝑺𝑻𝑪𝒏𝒐𝒄 =𝒔𝒍𝒐𝒘𝒅𝒐𝒘𝒏

𝑳𝟏𝒎𝒊𝒔𝒔

NoC Stall-Time Criticality

FAST utilizes STCnoc to proactively estimate

the expected impact of each L1 miss

Lower STCnoc <==> Less sensitive to NoC-level interference

Good candidate to be throttled down

Interference in NoCs

has uneven impact

0

20

40

60

80

100

120

0.0

0.5

1.0

1.5

2.0

2.5

3.0

lbm leslie3d mcf GemsFDTD

Ne

two

rk I

nte

nsi

ty (

MP

KI)

Slo

wd

ow

n

Slowdown Network Intensity

1.0

Page 15: A Model for Application Slowdown Estimation in On-Chip …safari/pubs/on-chip-network... · in On-Chip Networks and Its Use for Improving System Fairness and Performance Xiyue Xiang*,

Key Knobs of FAST

11

Rank based on slowdown

Classification based on network intensity

Latency-sensitive: spends more time in the core

Throughput-sensitive: network intensive

Throttle Up

Latency-sensitive applications: improve system performance

Slower applications: optimize system fairness

Throttle Down

Throughput sensitive application with lower STCnoc: reduce

interference with lower negative impact on performance

Avoid throttling down the slowest application

Page 16: A Model for Application Slowdown Estimation in On-Chip …safari/pubs/on-chip-network... · in On-Chip Networks and Its Use for Improving System Fairness and Performance Xiyue Xiang*,

Methodology

12

Processor

Out-of-order, ROB / instruction window = 128

Caches

L1: 64KB, 16 MSHRs

L2: perfect shared

NoCs

Topology: 4×4 and 8×8 mesh

Router: conventional VC router with 8 VCs, 4 flits/VC

Workloads: multiprogrammed SPEC CPU2006

90 randomly-chosen workloads

Categorized by network intensity (i.e., MPKI)

Page 17: A Model for Application Slowdown Estimation in On-Chip …safari/pubs/on-chip-network... · in On-Chip Networks and Its Use for Improving System Fairness and Performance Xiyue Xiang*,

NAS is Accurate

0%

5%

10%

15%

Slo

wd

ow

nE

stim

ati

on

Err

or

4x4

8x8

31.7% 4.2%2.6%Network saturation

Slowdown estimation error: 4.2% (2.6%) for 8×8 (4×4)

Low estimated slowdown error consistently

Good scalabilityNAS is highly accurate and scalable

13

Page 18: A Model for Application Slowdown Estimation in On-Chip …safari/pubs/on-chip-network... · in On-Chip Networks and Its Use for Improving System Fairness and Performance Xiyue Xiang*,

FAST Improves Performance

14

(a) Mixed workloads (b) Heavy workloads

FAST has better performance than both HAT and NoST

Inter-application interference is reduced

Only throttles applications with low negative impact (i.e., lower STCnoc)

0.90

0.95

1.00

1.05

1.10

NoST HAT FAST NoST HAT FAST

4×4 8×8

No

rma

lize

dW

eig

hte

d S

pee

du

p

0.90

0.95

1.00

1.05

1.10

NoST HAT FAST NoST HAT FAST

4×4 8×8

No

rma

lize

dW

eig

hte

d S

pee

du

p

+5.2%+5.0%

Page 19: A Model for Application Slowdown Estimation in On-Chip …safari/pubs/on-chip-network... · in On-Chip Networks and Its Use for Improving System Fairness and Performance Xiyue Xiang*,

0.85

0.90

0.95

1.00

1.05

1.10

NoST HAT FAST NoST HAT FAST

4×4 8×8

No

rma

lize

dU

nfa

irn

ess

0.85

0.90

0.95

1.00

1.05

1.10

NoST HAT FAST NoST HAT FAST

4×4 8×8

No

rma

lize

dU

nfa

irn

ess

FAST Reduces Unfairness

15

- 4.7%

-9.5%

FAST can improve fairness

Source throttling allows slower applications to catch up

Uses runtime slowdown to identify and avoid throttling the slowest application

(a) Mixed workloads (b) Heavy workloads

Page 20: A Model for Application Slowdown Estimation in On-Chip …safari/pubs/on-chip-network... · in On-Chip Networks and Its Use for Improving System Fairness and Performance Xiyue Xiang*,

Conclusion

16

Problem: inter-application interference in on-chip networks (NoCs)

In a multicore processor, interference can occur due to NoC contention

Interference causes applications to slow down unfairly

Goal: estimate NoC-level slowdown at runtime, and use slowdown information to

improve system fairness and performance

Our Approach

NoC Application Slowdown Model (NAS): first online model to quantify

inter-application interference in NoCs

Fairness-Aware Source Throttling (FAST): throttle network injection rate of

processor cores based on slowdown estimate from NAS

Results

NAS is very accurate and scalable: 4.2% error rate on average (8×8 mesh)

FAST improves system fairness by 9.5%, and performance by 5.2%

(compared to a baseline without source throttling on a 8×8 mesh)

Page 21: A Model for Application Slowdown Estimation in On-Chip …safari/pubs/on-chip-network... · in On-Chip Networks and Its Use for Improving System Fairness and Performance Xiyue Xiang*,

A Model for Application Slowdown Estimation in On-Chip Networks and Its Use for Improving

System Fairness and Performance

Xiyue Xiang*, Saugata Ghose†, Onur Mutlu†§, Nian-Feng Tzeng*

*University of Louisiana at Lafayette†Carnegie Mellon University

§ETH Zürich

Page 22: A Model for Application Slowdown Estimation in On-Chip …safari/pubs/on-chip-network... · in On-Chip Networks and Its Use for Improving System Fairness and Performance Xiyue Xiang*,

Backup Slides

Xiyue Xiang*, Saugata Ghose†, Onur Mutlu†§, Nian-Feng Tzeng*

*University of Louisiana at Lafayette†Carnegie Mellon University

§ETH Zürich

Page 23: A Model for Application Slowdown Estimation in On-Chip …safari/pubs/on-chip-network... · in On-Chip Networks and Its Use for Improving System Fairness and Performance Xiyue Xiang*,

Related Works

19

Slowdown modeling Fine grained: [Mutlu+ MICRO ’07], [Ebrahimi+ ASPLOS ’10], [Bois+ TACO ’13]

Coarse grained: [Subramanian+ HPCA ’13], [Subramanian MICRO ’15]

Source throttling

[Chang+ SBAC-PAD ’12], [Nychis+ SIGCOMM ’12], [Nychis+ HotNet ’10]

Application mapping

[Chou+ ICCD ’08], [Das+ HPCA ’13]

Prioritization

[Das+ MICRO ’09], [Das ISCA ’10]

Scheduling [Kim+ MICRO’10]

QoS [Grot+ MICRO ’09], [Grot+ ISCA ’11], [Lee+ ISCA ’08]

Page 24: A Model for Application Slowdown Estimation in On-Chip …safari/pubs/on-chip-network... · in On-Chip Networks and Its Use for Improving System Fairness and Performance Xiyue Xiang*,

Hardware Cost of NAS

20

Location Components Costs

Router Interference delay of each flit 5.3% wider data path

NI

Timestamp of the first and last arrival flit of a packet

(16+16)×16 bits

Inheritance table (6+4+8)×20 bits

Core

Interference delay of the request 8 bits

Timestamp when processor stalls 16 bits

Estimated application stall time 16 bits

Total cost of NAS per node 114 Bytes + 5.3% router area

Page 25: A Model for Application Slowdown Estimation in On-Chip …safari/pubs/on-chip-network... · in On-Chip Networks and Its Use for Improving System Fairness and Performance Xiyue Xiang*,

NAS Error Distribution

Plot 7,200 application instances

0%

10%

20%

30%

40%

50%

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Fra

ctio

n o

fA

pp

lica

tio

n I

nst

an

ces

Slowdown Estimation Error (Binned)

66.0% of application instances with < 10% error

84.3% of application instances with < 20% error

5.6% of application instances with ≥ 40% error

Plot 7,200 application instance

NAS exhibits high accuracy most of the time

21