department of computer science and engineering the pennsylvania state university akbar sharifi, emre...
TRANSCRIPT
Department of Computer Science and EngineeringThe Pennsylvania State University
Akbar Sharifi, Emre Kultursay, Mahmut Kandemir and Chita R. Das
Addressing End-to-End Memory Access Latency in NoC Based Multicores
2
OutlineOutline
Introduction and Motivation
Details of the Proposed Schemes
Implementation
Experimental Setup and Results
3
Target SystemTarget System
Tiled multicore architecture Mesh NoC Shared, banked L2 cache (S-NUCA) MCs
Core
L2 bank
Router
Node
L1
Communication Link
MC MC
MCMC
4
MC0
MC2 MC3
Components of Memory LatencyComponents of Memory Latency
Many components add to end-to-end memory access latency
L1
4
5
3
1
2
MC1
RequestMessage
ResponseMessage
L2
5
End-to-end Memory Latency DistributionEnd-to-end Memory Latency Distribution
Significant contribution from network
Higher contribution for longer latencies
Motivation Reduce the contribution
from the network Make delays more uniform
150-200
200-250
250-300
300-350
350-400
400-450
450-500
500-550
550-600
600-650
650-700
0100200300400500600700
L1 to L2 L2 to Mem Mem Mem to L2 L2 to L1
Delay Ranges (cycles)
De
lay
(c
yc
les
)
100 200 300 400 500 600 700 800 9000
2
4
6
8
10
12
Delay (cycles)
Fra
ctio
n o
f to
tal a
cce
sse
s
Average
6
Out-of-Order Execution and MLPOut-of-Order Execution and MLP
OoO execution: Many memory requests in flight
Instruction Window Oldest instruction commits instruction window advances A memory access with a long delay
Block instruction window Performance degradation
Lo
ad-A
Lo
ad-B
Lo
ad-C
Lo
ad-D
miss
Network
L1-hitL2-hit
Network
Instruction Windowbegin end
miss
Network
com
mit
7
Memory Bank UtilizationMemory Bank Utilization
Large idle times More banks more idle times
Variation in queue length Some queues occupied Some queues empty
Motivation Utilize banks better Improve memory performance
MC MC
MCMC R-2
R-1
R-0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 160.70
0.75
0.80
0.85
0.90
Banks
Idle
ne
ss
Bank 0
Bank 1
Bank 2
8
Proposed SchemesProposed Schemes
Scheme 1
Identify and expedite “late” memory response messages Reduce NoC latency component
Provide more uniform end-to-end memory latency
Scheme 2
Identify and expedite memory request messages targeting idle memory banks
Improve memory bank utilization
Improve memory performance
9
Scheme 1Scheme 1
Based on first motivation Messages with high latency can be problematic NoC is a significant contributor Expedite them on the network
Prioritization Higher priority to “late” messages Response (return path) only, why?
Request messages not enough information yet Response messages easier to classify as late
Bypassing the pipeline Merge stages in the router and reduce latency
10
Scheme 1: Calculating AgeScheme 1: Calculating Age
Age = “so-far” delay of a message 12 bits
Part of 128-bit header flit No extra flit needed (assuming 12-bits available)
Updated at each router and MC locally No global clock needed
Frequency taken into account DVFS at routers/nodes supported
𝑎𝑔𝑒=𝑎𝑔𝑒+(𝑐𝑦𝑐𝑙𝑒𝑠𝑐𝑢𝑟𝑟𝑒𝑛𝑡−𝑐𝑦𝑐𝑙𝑒𝑠𝑚𝑒𝑠𝑠𝑎𝑔𝑒𝑒𝑛𝑡𝑟𝑦 )×𝐹𝑅𝐸𝑄𝑀𝑈𝐿𝑇
𝑙𝑜𝑐𝑎𝑙 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦
11
MC0 MC1
MC3MC2
L2
L1
core-1
Scheme 1: ExampleScheme 1: Example
MC1 receives request from core-1
R0 is the response message
MC1 updates age field Adds memory queuing/service
Use age to decide if “late” Mark as “high-priority”
Inject into network as “high-priority”
R0Age
Late
12
Scheme 1: Late Threshold CalculationScheme 1: Late Threshold Calculation
Cores Continuously calculate average round-trip delay: Convert into and then into Periodically send to MCs
MCs Record values Use them to decide if “late”
Each application is treated independently Not uniform latency across
whole-system Uniform latency for each
core/app
100 200 300 400 500 600 700 800 9000.000.050.100.150.200.250.30
round-trip-delayround-trip-delay
Delay (cycles)
Fra
ctio
n o
f to
tal
acc
ess
es
Delay_avg
Delay_so-far-avg
Threshold
100 200 300 400 500 600 700 800 9000.002.004.006.008.00
10.0012.00
round-trip-delay so-far-delay
Delay (cycles)
Fra
ctio
n o
f to
tal
acc
ess
es
Delay_avg
Threshold
13
Scheme 2: Improving Bank UtilizationScheme 2: Improving Bank Utilization
Based on second motivation High idleness at memory banks Uneven utilization
Improving bank utilization using network prioritization Problem: No global information available Approach: Prioritize at the routers using router-local information
Bank History Table per-router Number of requests sent to each bank in last T cycles If a message targets an idle bank, then prioritize
Route a diverse set of requests to all banks Keep all banks busy
14
Network Prioritization ImplementationNetwork Prioritization Implementation
Routing 5 stage router pipeline Flit-buffering, virtual channel (VC) credit-based flow control Messages split into several flits Wormhole routing
Our Method Message to expedite gets higher priority in VC and SW
arbitrations Employ pipeline bypassing [Kumar07]
Fewer number of stages Pack 4 cycles into 1 cycle
BW RC VA SA ST setup ST
Baseline Pipeline Bypassing
15
Experimental SetupExperimental Setup
Simulator: GEMS (Simics+Opal+Ruby)
MC MC
MCMC
4x8 Mesh NoC
Core
L2 bank
Router
L1
32KB64B/line3 cycle latency
512KB64B/line10 cycle latency1 bank/node(32 banks total)
32 OoO cores128 entry instruction window64 entry LSQ
5 stage128-bit flit size6-flit buffer size4 VC per portX-Y Routing
DDR-800 Memory Bus Multiplier = 5Bank Busy Time = 22 cyclesRank Delay = 2 cyclesRead-Write Delay = 3 cyclesMemory CTL Latency = 20 cycles16 banks per MC, 4 MCs
16
Experimental SetupExperimental Setup
Benchmarks from SPEC CPU2006 Applications categorized based on memory intensity (L2 MPKI) High memory intensity vs. low memory intensity [Henning06]
18 multiprogrammed workloads: 32-applications each Workload Categories
WL 1-6: Mixed (50% high intensity -50% low intensity) WL 7-12: Memory intensive (100% high intensity) WL 13-18: Memory non-intensive (100% low intensity)
1-1 application-to-core mapping
Metric
𝑊 h𝑒𝑖𝑔 𝑡𝑒𝑑𝑆𝑝𝑒𝑒𝑑𝑢𝑝=𝑊𝑆=∑ 𝐼𝑃𝐶𝑖 ( h𝑡𝑜𝑔𝑒𝑡 𝑒𝑟 )𝐼𝑃𝐶𝑖 (𝑎𝑙𝑜𝑛𝑒 )
𝑵𝒐𝒓𝒎𝒂𝒍𝒊𝒛𝒆𝒅𝑾𝑺=𝑾𝑺 (𝒐𝒑𝒕𝒊𝒎𝒊𝒛𝒆𝒅 )𝑾𝑺(𝒃𝒂𝒔𝒆𝒍𝒊𝒏𝒆)
17
Experimental ResultsExperimental Results
w-1 w-2 w-3 w-4 w-5 w-60.90
1.00
1.10
1.20
Scheme-1Scheme-1 + Scheme-2
Mixed Workloads
Nor
mal
ized
WS
w-7 w-8 w-9 w-10 w-11 w-120.90
1.00
1.10
1.20
Scheme-1Scheme-1 + Scheme-2
High Intensity Workloads
Nor
mal
ized
WS
w-13 w-14 w-15 w-16 w-17 w-180.90
1.00
1.10
1.20
Scheme-1Scheme-1 + Scheme-2
Low Intensity Workloads
Nor
mal
ized
WS
6%11%
10%
10%15%
13% Higher intensity benchmarks
benefit more from Scheme 1 More traffic More “late” messages
w-2 and w-9 degrade Prioritizing some messages
hurts some other messages
18
Experimental ResultsExperimental Results
Cumulative Distribution of latencies 8 threads of WL-1 90% point delay reduced from ~700 cycles to ~600 cycles
Probability Density Moved from region 1 to region 2 Not all can be moved
155 265 375 485 595 705 815 9250
0.2
0.4
0.6
0.8
1
Total Delay (cycles)
Fra
ctio
n o
f to
tal
acc
ess
es
1552493434375316257198139070
0.2
0.4
0.6
0.8
1
Total Delay (cycles)
Fra
ctio
n o
f to
tal a
cce
sse
s
100 200 300 400 500 600 700 800 9000.002.004.006.008.00
10.0012.00
Delay (cycles)
Fra
ctio
n o
f to
tal
acc
ess
es New distribution
Region 1Region 2
Fewer accesseswith high delays
Before Scheme-1
After Scheme-1
19
Experimental ResultsExperimental Results
Reduction in idleness of banks
Dynamic behavior Scheme-2 reduces the idleness consistently over time
1 3 5 7 9 11 13 150.700.750.800.850.90
default Scheme-2
Banks
Idle
nes
s
1 4 7 10 13 16 190.65
0.70
0.75
0.80
0.85
default Scheme-2
Time Interval (100k cycles)
Ave
rage
Idl
enes
s
20
Sensitivity AnalysisSensitivity Analysis
System Parameters Lateness threshold Bank History Table history length Number of memory controllers Number of cores Number of router stages
Analyze sensitivity of results to system parameters Experimented with different values
21
Sensitivity Analysis – “Late” Threshold Sensitivity Analysis – “Late” Threshold
Scheme 1: Threshold to determine if a message is late Default = =
Reduced Threshold: More messages considered late Too many messages to prioritize can hurt other messages
Increased Threshold: Fewer messages considered late Can miss opportunities
w-1 w-2 w-3 w-4 w-5 w-60.9
0.95
1
1.05
1.1
1.15
1.1 x Delay_avg 1.2 x Delay_avg 1.4 x Delay_avg
Mixed Workloads
No
rma
lize
d W
S
22
Sensitivity Analysis – History Length Sensitivity Analysis – History Length
Scheme 2: History length History kept at the routers for past T cycles Default value T=200 cycles
Shorter history: T=100 cycles Cannot find idle banks precisely
Longer history: T=400 cycles Less number requests prioritized
w-1 w-2 w-3 w-4 w-5 w-60.90
0.95
1.00
1.05
1.10
1.15
1.20
T=100 T=200 T=400
Mixed Workloads
No
rma
lize
d W
S
23
Sensitivity Analysis – Number of MCs Sensitivity Analysis – Number of MCs
Less MCs means More pressure on each MC Higher queuing latency
More late requests More room for Scheme 1
Less idleness at banks Less room for Scheme 2
w-1 w-2 w-3 w-4 w-5 w-60.90
0.95
1.00
1.05
1.10
1.15
4 MC 2 MC
Mixed Workloads
No
rma
lize
d W
S
Slightly higherimprovements with 2 MC
24
Sensitivity Analysis – 16 CoresSensitivity Analysis – 16 Cores
Scheme-1 + Scheme-2 8%, 10%, 5% speedup About 5% less than 32-cores
Proportional with the # of cores Higher network latency More room for our optimizations
w-1 w-2 w-3 w-4 w-5 w-60.90
1.00
1.10
Scheme-1Scheme-1 + Scheme-2
Mixed Workloads
No
rma
lize
d W
S
w-7 w-8 w-9 w-10 w-11 w-120.90
1.00
1.10
Scheme-1Scheme-1 + Scheme-2
High Intensity Workloads
No
rma
lize
d W
S
w-13 w-14 w-15 w-16 w-17 w-180.90
1.00
1.10
Scheme-1Scheme-1 + Scheme-2
Low Intensity Workloads
No
rma
lize
d W
S
25
Sensitivity Analysis – Router StagesSensitivity Analysis – Router Stages
NoC Latency depends on number of stages in the routers 5 stage vs. 2 stage routers Scheme 1+2 speedup ~7% on average (for mixed workloads)
w-1 w-2 w-3 w-4 w-5 w-60.90
0.95
1.00
1.05
1.10
1.15
1.20
5-stage pipeline 2-stage pipeline
Mixed Workloads
No
rma
lize
d W
eig
hte
d
Sp
ee
du
p
26
SummarySummary
Identified Some memory accesses suffer long network delays and block
the cores Banks utilization low and uneven
Proposed two schemes 1. Network prioritization and pipeline bypassing on “late” memory
response messages to expedite them2. Network prioritization of memory request messages to improve
bank utilization
Demonstrated Scheme 1 achieves 6%, 10%, 11% speedup Scheme 1+2 achieves 10%, 13%, 15% speedup
27
Questions?Questions?
Department of Computer Science and EngineeringThe Pennsylvania State University
Akbar Sharifi, Emre Kultursay, Mahmut Kandemir and Chita R. Das
Addressing End-to-End Memory Access Latency in NoC Based Multicores
Thank you for attending this presentation.