a model for application slowdown estimation in on-chip …safari/pubs/on-chip-network... · in...
TRANSCRIPT
A Model for Application Slowdown Estimation in On-Chip Networks and Its Use for Improving
System Fairness and Performance
Xiyue Xiang*, Saugata Ghose†, Onur Mutlu†§, Nian-Feng Tzeng*
*University of Louisiana at Lafayette†Carnegie Mellon University
§ETH Zürich
Executive Summary
2
Problem: inter-application interference in on-chip networks (NoCs)
In a multicore processor, interference can occur due to NoC contention
Interference causes applications to slow down unfairly
Goal: estimate NoC-level slowdown at runtime, and use slowdown information to
improve system fairness and performance
Our Approach
NoC Application Slowdown Model (NAS): first online model to quantify
inter-application interference in NoCs
Fairness-Aware Source Throttling (FAST): throttle network injection rate of
processor cores based on slowdown estimate from NAS
Results
NAS is very accurate and scalable: 4.2% error rate on average (8×8 mesh)
FAST improves system fairness by 9.5%, and performance by 5.2%
(compared to a baseline without source throttling on a 8×8 mesh)
Motivation: Interference in NoCs
3
0
1
2
3
lbm leslie3d mcf GemsFDTD
Slo
wd
ow
n
2.7×
1.6×
Interference slows down applications and increases system unfairness
16 copies of each application run concurrently on a 64-core processor
Root cause:
NoC bandwidth is shared𝒔𝒍𝒐𝒘𝒅𝒐𝒘𝒏 =
𝒕𝒊𝒏𝒕𝒆𝒓𝒇𝒆𝒓𝒆𝒏𝒄𝒆
𝒕𝒏𝒐_𝒊𝒏𝒕𝒆𝒓𝒇𝒆𝒓𝒆𝒏𝒄𝒆=𝒕𝒔𝒉𝒂𝒓𝒆𝒅𝒕𝒂𝒍𝒐𝒏𝒆
Challenges:
Flit-level delay ≠ slowdown
NAS: NoC Application Slowdown Model
4
Online estimation of ∆𝑡𝑠𝑡𝑎𝑙𝑙: application stall time due to interference
talone: unknown at runtime
𝑠𝑙𝑜𝑤𝑑𝑜𝑤𝑛 =𝑡𝑠ℎ𝑎𝑟𝑒𝑑𝑡𝑎𝑙𝑜𝑛𝑒
=𝑡𝑠ℎ𝑎𝑟𝑒𝑑
𝑡𝑠ℎ𝑎𝑟𝑒𝑑 − ∆𝑡𝑠𝑡𝑎𝑙𝑙
Node S Node D
request
response
Each request involves multiple packets
tshared: measured directly
Challenges:
Flit-level delay ≠ slowdown
Random and distributive
Overlapped delay
NAS: NoC Application Slowdown Model
4
tshared: measured directly talone: unknown at runtime
𝑠𝑙𝑜𝑤𝑑𝑜𝑤𝑛 =𝑡𝑠ℎ𝑎𝑟𝑒𝑑𝑡𝑎𝑙𝑜𝑛𝑒
=𝑡𝑠ℎ𝑎𝑟𝑒𝑑
𝑡𝑠ℎ𝑎𝑟𝑒𝑑 − ∆𝑡𝑠𝑡𝑎𝑙𝑙
Node S Node D
A packet is formed by multiple flits
Basic idea: track delay and calculate ∆tstall
Online estimation of ∆𝑡𝑠𝑡𝑎𝑙𝑙: application stall time due to interference
Flit-Level Interference
5
Three interference events
Injection
Virtual channel arbitration
Switch arbitration
Each flit carries an additional field ∆tflit
If arbitration loses, ∆tflit = ∆tflit + 1
Sum up arbitration delays due to interference
12 13 14 15
8 9 10 11
4 5 6 7
0 1 2 3
Node
Core
L1
Router
Shared LLC
Slice
MSHRs
Packet-Level Interference
6
∆ treassembly= 𝑻𝒍𝒂𝒔𝒕_𝒂𝒓𝒓𝒊𝒗𝒂𝒍 − 𝑻𝒇𝒊𝒓𝒔𝒕_𝒂𝒓𝒓𝒊𝒗𝒂𝒍 −𝑴
∆𝒕𝒑𝒂𝒄𝒌𝒆𝒕 = ∆𝒕𝒇𝒊𝒓𝒔𝒕_𝒇𝒍𝒊𝒕 + ∆𝒕𝒓𝒆𝒂𝒔𝒔𝒆𝒎𝒃𝒍𝒚
Tfirst_arrival =3 Tlast_arrival=11
treassembly = M cycles (M=5)
1 2 3 4 5
f1
∆𝑡𝑓𝑖𝑟𝑠𝑡_𝑓𝑙𝑖𝑡=2
Alone run:
Shared run:
f2 f3 f4 f5
f1 f2 f3 f4 f5
1 2 3 4 5 6 7 8 9 10 11
M-cycle reassembly
Packet’s flits arrive consecutively when there is no interference
∆𝑡𝑝𝑎𝑐𝑘𝑒𝑡 = 2 + 11 − 3 − 5 = 5 𝑐𝑦𝑐𝑙𝑒𝑠
∆𝑡𝑟𝑒𝑎𝑠𝑠𝑒𝑚𝑏𝑙𝑦
Track increase in packet reassembly time
Request-Level Interference
7
Node S Node D1 Request packet delayed by 5 cycles due
to inter-application interference
0
Leverage closed-loop packet behavior to accumulate ∆tpacket
Inheritance Table: lump sum of ∆tpacket for associated packets
Request-Level Interference
7
Node S Node D
NI LLC Slice
Inheritance Table
2 Register request packet info in
inheritance table (Δtpacket = 5)
4 Generate response packet,
inheriting Δtpacket from table
3 Cache
access
1 Request packet delayed by 5 cycles due
to inter-application interference
reqID mshrID Δtpacket......
...
0
Leverage closed-loop packet behavior to accumulate ∆tpacket
Inheritance Table: lump sum of ∆tpacket for associated packets
5
5
Request-Level Interference
7
Node S
3 Cache
access
1 Request packet delayed by 5 cycles due
to inter-application interference
5 Response packet delayed by 3 cycles
due to inter-application interference
5
Leverage closed-loop packet behavior to accumulate ∆tpacket
Inheritance Table: lump sum of ∆tpacket for associated packets
Node D
NI LLC Slice
Inheritance Table
2 Register request packet info in
inheritance table (Δtpacket = 5)
4 Generate response packet,
inheriting Δtpacket from table
3 Cache
accessreqID mshrID Δtpacket......
...
5
Request-Level Interference
7
∆𝒕𝒓𝒆𝒒𝒖𝒆𝒔𝒕= ∆𝒕𝒓𝒆𝒒𝒖𝒆𝒔𝒕_𝒑𝒂𝒄𝒌𝒆𝒕 + ∆𝒕𝒓𝒆𝒔𝒑𝒐𝒏𝒔𝒆_𝒑𝒂𝒄𝒌𝒆𝒕
Node S 1 Request packet delayed by 5 cycles due
to inter-application interference
5 Response packet delayed by 3 cycles
due to inter-application interference
Final valueof Δtpacket
is 8 cycles
Leverage closed-loop packet behavior to accumulate ∆tpacket
Inheritance Table: lump sum of ∆tpacket for associated packets
8
Sum up delays of all associated packets
3 Cache
access
Node D
NI LLC Slice
Inheritance Table
2 Register request packet info in
inheritance table (Δtpacket = 5)
4 Generate response packet,
inheriting Δtpacket from table
3 Cache
accessreqID mshrID Δtpacket......
...
5
Application Stall Time
8
A memory request becomes critical if
1) It is the oldest instruction at ROB and ROB is full, and/or
2) It is the oldest instruction at LSQ and LSQ is full when the next is a memory instruction
∆𝒕𝒔𝒕𝒂𝒍𝒍_𝒑𝒆𝒓_𝒓𝒆𝒒𝒖𝒆𝒔𝒕= 𝒎𝒊𝒏(𝑻𝒔𝒆𝒓𝒗𝒊𝒄𝒆 − 𝑻𝒄𝒓𝒊𝒕𝒊𝒄𝒂𝒍, 𝒕𝒓𝒆𝒒𝒖𝒆𝒔𝒕 )
For all critical requests
Latency is hidden App. stalls
ignored
Latency of critical request
Tcritical Tservice
∆𝒕𝒔𝒕𝒂𝒍𝒍_𝒑𝒆𝒓_𝒓𝒆𝒒𝒖𝒆𝒔𝒕
∆𝒕𝒔𝒕𝒂𝒍𝒍=
𝒊
∆𝒕𝒔𝒕𝒂𝒍𝒍_𝒑𝒆𝒓_𝒓𝒆𝒒𝒖𝒆𝒔𝒕,𝒊
ILP, MLP
Count only request delays on critical path of execution time
Using NAS to Improve Fairness
9
NAS provides online estimation of slowdown
Sum up flit-level arbitration delays due to interference
Track increase in packet reassembly time
Sum up delays of all associated packets
Determine which request delays causes application stall
Goal
Use NAS to improve system fairness and performance
FAST: Fairness-Aware Source Throttling
A New Metric: NoC Stall-Time Criticality
10
𝑺𝑻𝑪𝒏𝒐𝒄 =𝒔𝒍𝒐𝒘𝒅𝒐𝒘𝒏
𝑳𝟏𝒎𝒊𝒔𝒔
NoC Stall-Time Criticality
FAST utilizes STCnoc to proactively estimate
the expected impact of each L1 miss
Lower STCnoc <==> Less sensitive to NoC-level interference
Good candidate to be throttled down
Interference in NoCs
has uneven impact
0
20
40
60
80
100
120
0.0
0.5
1.0
1.5
2.0
2.5
3.0
lbm leslie3d mcf GemsFDTD
Ne
two
rk I
nte
nsi
ty (
MP
KI)
Slo
wd
ow
n
Slowdown Network Intensity
1.0
Key Knobs of FAST
11
Rank based on slowdown
Classification based on network intensity
Latency-sensitive: spends more time in the core
Throughput-sensitive: network intensive
Throttle Up
Latency-sensitive applications: improve system performance
Slower applications: optimize system fairness
Throttle Down
Throughput sensitive application with lower STCnoc: reduce
interference with lower negative impact on performance
Avoid throttling down the slowest application
Methodology
12
Processor
Out-of-order, ROB / instruction window = 128
Caches
L1: 64KB, 16 MSHRs
L2: perfect shared
NoCs
Topology: 4×4 and 8×8 mesh
Router: conventional VC router with 8 VCs, 4 flits/VC
Workloads: multiprogrammed SPEC CPU2006
90 randomly-chosen workloads
Categorized by network intensity (i.e., MPKI)
NAS is Accurate
0%
5%
10%
15%
Slo
wd
ow
nE
stim
ati
on
Err
or
4x4
8x8
31.7% 4.2%2.6%Network saturation
Slowdown estimation error: 4.2% (2.6%) for 8×8 (4×4)
Low estimated slowdown error consistently
Good scalabilityNAS is highly accurate and scalable
13
FAST Improves Performance
14
(a) Mixed workloads (b) Heavy workloads
FAST has better performance than both HAT and NoST
Inter-application interference is reduced
Only throttles applications with low negative impact (i.e., lower STCnoc)
0.90
0.95
1.00
1.05
1.10
NoST HAT FAST NoST HAT FAST
4×4 8×8
No
rma
lize
dW
eig
hte
d S
pee
du
p
0.90
0.95
1.00
1.05
1.10
NoST HAT FAST NoST HAT FAST
4×4 8×8
No
rma
lize
dW
eig
hte
d S
pee
du
p
+5.2%+5.0%
0.85
0.90
0.95
1.00
1.05
1.10
NoST HAT FAST NoST HAT FAST
4×4 8×8
No
rma
lize
dU
nfa
irn
ess
0.85
0.90
0.95
1.00
1.05
1.10
NoST HAT FAST NoST HAT FAST
4×4 8×8
No
rma
lize
dU
nfa
irn
ess
FAST Reduces Unfairness
15
- 4.7%
-9.5%
FAST can improve fairness
Source throttling allows slower applications to catch up
Uses runtime slowdown to identify and avoid throttling the slowest application
(a) Mixed workloads (b) Heavy workloads
Conclusion
16
Problem: inter-application interference in on-chip networks (NoCs)
In a multicore processor, interference can occur due to NoC contention
Interference causes applications to slow down unfairly
Goal: estimate NoC-level slowdown at runtime, and use slowdown information to
improve system fairness and performance
Our Approach
NoC Application Slowdown Model (NAS): first online model to quantify
inter-application interference in NoCs
Fairness-Aware Source Throttling (FAST): throttle network injection rate of
processor cores based on slowdown estimate from NAS
Results
NAS is very accurate and scalable: 4.2% error rate on average (8×8 mesh)
FAST improves system fairness by 9.5%, and performance by 5.2%
(compared to a baseline without source throttling on a 8×8 mesh)
A Model for Application Slowdown Estimation in On-Chip Networks and Its Use for Improving
System Fairness and Performance
Xiyue Xiang*, Saugata Ghose†, Onur Mutlu†§, Nian-Feng Tzeng*
*University of Louisiana at Lafayette†Carnegie Mellon University
§ETH Zürich
Backup Slides
Xiyue Xiang*, Saugata Ghose†, Onur Mutlu†§, Nian-Feng Tzeng*
*University of Louisiana at Lafayette†Carnegie Mellon University
§ETH Zürich
Related Works
19
Slowdown modeling Fine grained: [Mutlu+ MICRO ’07], [Ebrahimi+ ASPLOS ’10], [Bois+ TACO ’13]
Coarse grained: [Subramanian+ HPCA ’13], [Subramanian MICRO ’15]
Source throttling
[Chang+ SBAC-PAD ’12], [Nychis+ SIGCOMM ’12], [Nychis+ HotNet ’10]
Application mapping
[Chou+ ICCD ’08], [Das+ HPCA ’13]
Prioritization
[Das+ MICRO ’09], [Das ISCA ’10]
Scheduling [Kim+ MICRO’10]
QoS [Grot+ MICRO ’09], [Grot+ ISCA ’11], [Lee+ ISCA ’08]
Hardware Cost of NAS
20
Location Components Costs
Router Interference delay of each flit 5.3% wider data path
NI
Timestamp of the first and last arrival flit of a packet
(16+16)×16 bits
Inheritance table (6+4+8)×20 bits
Core
Interference delay of the request 8 bits
Timestamp when processor stalls 16 bits
Estimated application stall time 16 bits
Total cost of NAS per node 114 Bytes + 5.3% router area
NAS Error Distribution
Plot 7,200 application instances
0%
10%
20%
30%
40%
50%
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Fra
ctio
n o
fA
pp
lica
tio
n I
nst
an
ces
Slowdown Estimation Error (Binned)
66.0% of application instances with < 10% error
84.3% of application instances with < 20% error
5.6% of application instances with ≥ 40% error
Plot 7,200 application instance
NAS exhibits high accuracy most of the time
21