a model for application slowdown estimation in on-chip …safari/pubs/on-chip-network... · in...

A Model for Application Slowdown Estimation in On-Chip Networks and Its Use for Improving

System Fairness and Performance

Xiyue Xiang*, Saugata Ghose†, Onur Mutlu†§, Nian-Feng Tzeng*

*University of Louisiana at Lafayette†Carnegie Mellon University

§ETH Zürich

http://www.louisiana.edu/


Executive Summary

2

Problem: inter-application interference in on-chip networks (NoCs)

In a multicore processor, interference can occur due to NoC contention

Interference causes applications to slow down unfairly

Goal: estimate NoC-level slowdown at runtime, and use slowdown information to

improve system fairness and performance

Our Approach

NoC Application Slowdown Model (NAS): first online model to quantify

inter-application interference in NoCs

Fairness-Aware Source Throttling (FAST): throttle network injection rate of

processor cores based on slowdown estimate from NAS

Results

NAS is very accurate and scalable: 4.2% error rate on average (8×8 mesh)

FAST improves system fairness by 9.5%, and performance by 5.2%

(compared to a baseline without source throttling on a 8×8 mesh)

Motivation: Interference in NoCs

3

0

1

2

3

lbm leslie3d mcf GemsFDTD

Slo

wd

ow

n

2.7×

1.6×

Interference slows down applications and increases system unfairness

16 copies of each application run concurrently on a 64-core processor

Root cause:

NoC bandwidth is shared𝒔𝒍𝒐𝒘𝒅𝒐𝒘𝒏 =

𝒕𝒊𝒏𝒕𝒆𝒓𝒇𝒆𝒓𝒆𝒏𝒄𝒆

𝒕𝒏𝒐_𝒊𝒏𝒕𝒆𝒓𝒇𝒆𝒓𝒆𝒏𝒄𝒆=𝒕𝒔𝒉𝒂𝒓𝒆𝒅𝒕𝒂𝒍𝒐𝒏𝒆

Challenges:

Flit-level delay ≠ slowdown

NAS: NoC Application Slowdown Model

4

Online estimation of ∆𝑡𝑠𝑡𝑎𝑙𝑙: application stall time due to interference

talone: unknown at runtime

𝑠𝑙𝑜𝑤𝑑𝑜𝑤𝑛 =𝑡𝑠ℎ𝑎𝑟𝑒𝑑𝑡𝑎𝑙𝑜𝑛𝑒

=𝑡𝑠ℎ𝑎𝑟𝑒𝑑

𝑡𝑠ℎ𝑎𝑟𝑒𝑑 − ∆𝑡𝑠𝑡𝑎𝑙𝑙

Node S Node D

request

response

Each request involves multiple packets

tshared: measured directly

Challenges:

Flit-level delay ≠ slowdown

Random and distributive

Overlapped delay

NAS: NoC Application Slowdown Model

4

tshared: measured directly talone: unknown at runtime

𝑠𝑙𝑜𝑤𝑑𝑜𝑤𝑛 =𝑡𝑠ℎ𝑎𝑟𝑒𝑑𝑡𝑎𝑙𝑜𝑛𝑒

=𝑡𝑠ℎ𝑎𝑟𝑒𝑑

𝑡𝑠ℎ𝑎𝑟𝑒𝑑 − ∆𝑡𝑠𝑡𝑎𝑙𝑙

Node S Node D

A packet is formed by multiple flits

Basic idea: track delay and calculate ∆tstall

Online estimation of ∆𝑡𝑠𝑡𝑎𝑙𝑙: application stall time due to interference

Flit-Level Interference

5

Three interference events

Injection

Virtual channel arbitration

Switch arbitration

Each flit carries an additional field ∆tflit

If arbitration loses, ∆tflit = ∆tflit + 1

Sum up arbitration delays due to interference

12 13 14 15

8 9 10 11

4 5 6 7

0 1 2 3

Node

Core

L1

Router

Shared LLC

Slice

MSHRs

Packet-Level Interference

6

∆ treassembly= 𝑻𝒍𝒂𝒔𝒕_𝒂𝒓𝒓𝒊𝒗𝒂𝒍 − 𝑻𝒇𝒊𝒓𝒔𝒕_𝒂𝒓𝒓𝒊𝒗𝒂𝒍 −𝑴

∆𝒕𝒑𝒂𝒄𝒌𝒆𝒕 = ∆𝒕𝒇𝒊𝒓𝒔𝒕_𝒇𝒍𝒊𝒕 + ∆𝒕𝒓𝒆𝒂𝒔𝒔𝒆𝒎𝒃𝒍𝒚

Tfirst_arrival =3 Tlast_arrival=11

treassembly = M cycles (M=5)

1 2 3 4 5

f1

∆𝑡𝑓𝑖𝑟𝑠𝑡_𝑓𝑙𝑖𝑡=2

Alone run:

Shared run:

f2 f3 f4 f5

f1 f2 f3 f4 f5

1 2 3 4 5 6 7 8 9 10 11

M-cycle reassembly

Packet’s flits arrive consecutively when there is no interference

∆𝑡𝑝𝑎𝑐𝑘𝑒𝑡 = 2 + 11 − 3 − 5 = 5 𝑐𝑦𝑐𝑙𝑒𝑠

∆𝑡𝑟𝑒𝑎𝑠𝑠𝑒𝑚𝑏𝑙𝑦

Track increase in packet reassembly time

Request-Level Interference

7

Node S Node D1 Request packet delayed by 5 cycles due

to inter-application interference

0

Leverage closed-loop packet behavior to accumulate ∆tpacket

Inheritance Table: lump sum of ∆tpacket for associated packets


7

Node S Node D

NI LLC Slice

Inheritance Table

2 Register request packet info in

inheritance table (Δtpacket = 5)

4 Generate response packet,

inheriting Δtpacket from table

3 Cache

access

1 Request packet delayed by 5 cycles due


reqID mshrID Δtpacket......

...

0



5

5


7

Node S

3 Cache

access

1 Request packet delayed by 5 cycles due


5 Response packet delayed by 3 cycles

due to inter-application interference

5



Node D

NI LLC Slice

Inheritance Table





3 Cache

accessreqID mshrID Δtpacket......

...

5


7

∆𝒕𝒓𝒆𝒒𝒖𝒆𝒔𝒕= ∆𝒕𝒓𝒆𝒒𝒖𝒆𝒔𝒕_𝒑𝒂𝒄𝒌𝒆𝒕 + ∆𝒕𝒓𝒆𝒔𝒑𝒐𝒏𝒔𝒆_𝒑𝒂𝒄𝒌𝒆𝒕

Node S 1 Request packet delayed by 5 cycles due


5 Response packet delayed by 3 cycles

due to inter-application interference

Final valueof Δtpacket

is 8 cycles



8

Sum up delays of all associated packets

3 Cache

access

Node D

NI LLC Slice

Inheritance Table





3 Cache

accessreqID mshrID Δtpacket......

...

5

Application Stall Time

8

A memory request becomes critical if

1) It is the oldest instruction at ROB and ROB is full, and/or

2) It is the oldest instruction at LSQ and LSQ is full when the next is a memory instruction

∆𝒕𝒔𝒕𝒂𝒍𝒍_𝒑𝒆𝒓_𝒓𝒆𝒒𝒖𝒆𝒔𝒕= 𝒎𝒊𝒏(𝑻𝒔𝒆𝒓𝒗𝒊𝒄𝒆 − 𝑻𝒄𝒓𝒊𝒕𝒊𝒄𝒂𝒍, 𝒕𝒓𝒆𝒒𝒖𝒆𝒔𝒕 )

For all critical requests

Latency is hidden App. stalls

ignored

Latency of critical request

Tcritical Tservice

∆𝒕𝒔𝒕𝒂𝒍𝒍_𝒑𝒆𝒓_𝒓𝒆𝒒𝒖𝒆𝒔𝒕

∆𝒕𝒔𝒕𝒂𝒍𝒍=

𝒊

∆𝒕𝒔𝒕𝒂𝒍𝒍_𝒑𝒆𝒓_𝒓𝒆𝒒𝒖𝒆𝒔𝒕,𝒊

ILP, MLP

Count only request delays on critical path of execution time

Using NAS to Improve Fairness

9

NAS provides online estimation of slowdown

Sum up flit-level arbitration delays due to interference

Track increase in packet reassembly time

Sum up delays of all associated packets

Determine which request delays causes application stall

Goal

Use NAS to improve system fairness and performance

FAST: Fairness-Aware Source Throttling

A New Metric: NoC Stall-Time Criticality

10

𝑺𝑻𝑪𝒏𝒐𝒄 =𝒔𝒍𝒐𝒘𝒅𝒐𝒘𝒏

𝑳𝟏𝒎𝒊𝒔𝒔

NoC Stall-Time Criticality

FAST utilizes STCnoc to proactively estimate

the expected impact of each L1 miss

Lower STCnoc <==> Less sensitive to NoC-level interference

Good candidate to be throttled down

Interference in NoCs

has uneven impact

0

20

40

60

80

100

120

0.0

0.5

1.0

1.5

2.0

2.5

3.0

lbm leslie3d mcf GemsFDTD

Ne

two

rk I

nte

nsi

ty (

MP

KI)

Slo

wd

ow

n

Slowdown Network Intensity

1.0

Key Knobs of FAST

11

Rank based on slowdown

Classification based on network intensity

Latency-sensitive: spends more time in the core

Throughput-sensitive: network intensive

Throttle Up

Latency-sensitive applications: improve system performance

Slower applications: optimize system fairness

Throttle Down

Throughput sensitive application with lower STCnoc: reduce

interference with lower negative impact on performance

Avoid throttling down the slowest application

Methodology

12

Processor

Out-of-order, ROB / instruction window = 128

Caches

L1: 64KB, 16 MSHRs

L2: perfect shared

NoCs

Topology: 4×4 and 8×8 mesh

Router: conventional VC router with 8 VCs, 4 flits/VC

Workloads: multiprogrammed SPEC CPU2006

90 randomly-chosen workloads

Categorized by network intensity (i.e., MPKI)

NAS is Accurate

0%

5%

10%

15%

Slo

wd

ow

nE

stim

ati

on

Err

or

4x4

8x8

31.7% 4.2%2.6%Network saturation

Slowdown estimation error: 4.2% (2.6%) for 8×8 (4×4)

Low estimated slowdown error consistently

Good scalabilityNAS is highly accurate and scalable

13

FAST Improves Performance

14

(a) Mixed workloads (b) Heavy workloads

FAST has better performance than both HAT and NoST

Inter-application interference is reduced

Only throttles applications with low negative impact (i.e., lower STCnoc)

0.90

0.95

1.00

1.05

1.10

NoST HAT FAST NoST HAT FAST

4×4 8×8

No

rma

lize

dW

eig

hte

d S

pee

du

p

0.90

0.95

1.00

1.05

1.10


4×4 8×8

No

rma

lize

dW

eig

hte

d S

pee

du

p

+5.2%+5.0%

0.85

0.90

0.95

1.00

1.05

1.10


4×4 8×8

No

rma

lize

dU

nfa

irn

ess

0.85

0.90

0.95

1.00

1.05

1.10


4×4 8×8

No

rma

lize

dU

nfa

irn

ess

FAST Reduces Unfairness

15

- 4.7%

-9.5%

FAST can improve fairness

Source throttling allows slower applications to catch up

Uses runtime slowdown to identify and avoid throttling the slowest application

(a) Mixed workloads (b) Heavy workloads

Conclusion

16

Problem: inter-application interference in on-chip networks (NoCs)

In a multicore processor, interference can occur due to NoC contention

Interference causes applications to slow down unfairly

Goal: estimate NoC-level slowdown at runtime, and use slowdown information to

improve system fairness and performance

Our Approach

NoC Application Slowdown Model (NAS): first online model to quantify

inter-application interference in NoCs

Fairness-Aware Source Throttling (FAST): throttle network injection rate of

processor cores based on slowdown estimate from NAS

Results

NAS is very accurate and scalable: 4.2% error rate on average (8×8 mesh)

FAST improves system fairness by 9.5%, and performance by 5.2%

(compared to a baseline without source throttling on a 8×8 mesh)

A Model for Application Slowdown Estimation in On-Chip Networks and Its Use for Improving

System Fairness and Performance



§ETH Zürich



Backup Slides



§ETH Zürich



Related Works

19

Slowdown modeling Fine grained: [Mutlu+ MICRO ’07], [Ebrahimi+ ASPLOS ’10], [Bois+ TACO ’13]

Coarse grained: [Subramanian+ HPCA ’13], [Subramanian MICRO ’15]

Source throttling

[Chang+ SBAC-PAD ’12], [Nychis+ SIGCOMM ’12], [Nychis+ HotNet ’10]

Application mapping

[Chou+ ICCD ’08], [Das+ HPCA ’13]

Prioritization

[Das+ MICRO ’09], [Das ISCA ’10]

Scheduling [Kim+ MICRO’10]

QoS [Grot+ MICRO ’09], [Grot+ ISCA ’11], [Lee+ ISCA ’08]

Hardware Cost of NAS

20

Location Components Costs

Router Interference delay of each flit 5.3% wider data path

NI

Timestamp of the first and last arrival flit of a packet

(16+16)×16 bits

Inheritance table (6+4+8)×20 bits

Core

Interference delay of the request 8 bits

Timestamp when processor stalls 16 bits

Estimated application stall time 16 bits

Total cost of NAS per node 114 Bytes + 5.3% router area

NAS Error Distribution

Plot 7,200 application instances

0%

10%

20%

30%

40%

50%

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Fra

ctio

n o

fA

pp

lica

tio

n I

nst

an

ces

Slowdown Estimation Error (Binned)

66.0% of application instances with < 10% error

84.3% of application instances with < 20% error

5.6% of application instances with ≥ 40% error

Plot 7,200 application instance

NAS exhibits high accuracy most of the time

21

a model for application slowdown estimation in on-chip …safari/pubs/on-chip-network... · in...

Documents