department of computer science and engineering the pennsylvania state university akbar sharifi, emre...

28
Department of Computer Science and Engineering The Pennsylvania State University Akbar Sharifi, Emre Kultursay, Mahmut Kandemir and Chita R. Das Addressing End-to-End Memory Access Latency in NoC Based Multicores

Upload: joseph-dorsey

Post on 03-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Department of Computer Science and EngineeringThe Pennsylvania State University

Akbar Sharifi, Emre Kultursay, Mahmut Kandemir and Chita R. Das

Addressing End-to-End Memory Access Latency in NoC Based Multicores

2

OutlineOutline

Introduction and Motivation

Details of the Proposed Schemes

Implementation

Experimental Setup and Results

3

Target SystemTarget System

Tiled multicore architecture Mesh NoC Shared, banked L2 cache (S-NUCA) MCs

Core

L2 bank

Router

Node

L1

Communication Link

MC MC

MCMC

4

MC0

MC2 MC3

Components of Memory LatencyComponents of Memory Latency

Many components add to end-to-end memory access latency

L1

4

5

3

1

2

MC1

RequestMessage

ResponseMessage

L2

5

End-to-end Memory Latency DistributionEnd-to-end Memory Latency Distribution

Significant contribution from network

Higher contribution for longer latencies

Motivation Reduce the contribution

from the network Make delays more uniform

150-200

200-250

250-300

300-350

350-400

400-450

450-500

500-550

550-600

600-650

650-700

0100200300400500600700

L1 to L2 L2 to Mem Mem Mem to L2 L2 to L1

Delay Ranges (cycles)

De

lay

(c

yc

les

)

100 200 300 400 500 600 700 800 9000

2

4

6

8

10

12

Delay (cycles)

Fra

ctio

n o

f to

tal a

cce

sse

s

Average

6

Out-of-Order Execution and MLPOut-of-Order Execution and MLP

OoO execution: Many memory requests in flight

Instruction Window Oldest instruction commits instruction window advances A memory access with a long delay

Block instruction window Performance degradation

Lo

ad-A

Lo

ad-B

Lo

ad-C

Lo

ad-D

miss

Network

L1-hitL2-hit

Network

Instruction Windowbegin end

miss

Network

com

mit

7

Memory Bank UtilizationMemory Bank Utilization

Large idle times More banks more idle times

Variation in queue length Some queues occupied Some queues empty

Motivation Utilize banks better Improve memory performance

MC MC

MCMC R-2

R-1

R-0

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 160.70

0.75

0.80

0.85

0.90

Banks

Idle

ne

ss

Bank 0

Bank 1

Bank 2

8

Proposed SchemesProposed Schemes

Scheme 1

Identify and expedite “late” memory response messages Reduce NoC latency component

Provide more uniform end-to-end memory latency

Scheme 2

Identify and expedite memory request messages targeting idle memory banks

Improve memory bank utilization

Improve memory performance

9

Scheme 1Scheme 1

Based on first motivation Messages with high latency can be problematic NoC is a significant contributor Expedite them on the network

Prioritization Higher priority to “late” messages Response (return path) only, why?

Request messages not enough information yet Response messages easier to classify as late

Bypassing the pipeline Merge stages in the router and reduce latency

10

Scheme 1: Calculating AgeScheme 1: Calculating Age

Age = “so-far” delay of a message 12 bits

Part of 128-bit header flit No extra flit needed (assuming 12-bits available)

Updated at each router and MC locally No global clock needed

Frequency taken into account DVFS at routers/nodes supported

𝑎𝑔𝑒=𝑎𝑔𝑒+(𝑐𝑦𝑐𝑙𝑒𝑠𝑐𝑢𝑟𝑟𝑒𝑛𝑡−𝑐𝑦𝑐𝑙𝑒𝑠𝑚𝑒𝑠𝑠𝑎𝑔𝑒𝑒𝑛𝑡𝑟𝑦 )×𝐹𝑅𝐸𝑄𝑀𝑈𝐿𝑇

𝑙𝑜𝑐𝑎𝑙 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦

11

MC0 MC1

MC3MC2

L2

L1

core-1

Scheme 1: ExampleScheme 1: Example

MC1 receives request from core-1

R0 is the response message

MC1 updates age field Adds memory queuing/service

Use age to decide if “late” Mark as “high-priority”

Inject into network as “high-priority”

R0Age

Late

12

Scheme 1: Late Threshold CalculationScheme 1: Late Threshold Calculation

Cores Continuously calculate average round-trip delay: Convert into and then into Periodically send to MCs

MCs Record values Use them to decide if “late”

Each application is treated independently Not uniform latency across

whole-system Uniform latency for each

core/app

100 200 300 400 500 600 700 800 9000.000.050.100.150.200.250.30

round-trip-delayround-trip-delay

Delay (cycles)

Fra

ctio

n o

f to

tal

acc

ess

es

Delay_avg

Delay_so-far-avg

Threshold

100 200 300 400 500 600 700 800 9000.002.004.006.008.00

10.0012.00

round-trip-delay so-far-delay

Delay (cycles)

Fra

ctio

n o

f to

tal

acc

ess

es

Delay_avg

Threshold

13

Scheme 2: Improving Bank UtilizationScheme 2: Improving Bank Utilization

Based on second motivation High idleness at memory banks Uneven utilization

Improving bank utilization using network prioritization Problem: No global information available Approach: Prioritize at the routers using router-local information

Bank History Table per-router Number of requests sent to each bank in last T cycles If a message targets an idle bank, then prioritize

Route a diverse set of requests to all banks Keep all banks busy

14

Network Prioritization ImplementationNetwork Prioritization Implementation

Routing 5 stage router pipeline Flit-buffering, virtual channel (VC) credit-based flow control Messages split into several flits Wormhole routing

Our Method Message to expedite gets higher priority in VC and SW

arbitrations Employ pipeline bypassing [Kumar07]

Fewer number of stages Pack 4 cycles into 1 cycle

BW RC VA SA ST setup ST

Baseline Pipeline Bypassing

15

Experimental SetupExperimental Setup

Simulator: GEMS (Simics+Opal+Ruby)

MC MC

MCMC

4x8 Mesh NoC

Core

L2 bank

Router

L1

32KB64B/line3 cycle latency

512KB64B/line10 cycle latency1 bank/node(32 banks total)

32 OoO cores128 entry instruction window64 entry LSQ

5 stage128-bit flit size6-flit buffer size4 VC per portX-Y Routing

DDR-800 Memory Bus Multiplier = 5Bank Busy Time = 22 cyclesRank Delay = 2 cyclesRead-Write Delay = 3 cyclesMemory CTL Latency = 20 cycles16 banks per MC, 4 MCs

16

Experimental SetupExperimental Setup

Benchmarks from SPEC CPU2006 Applications categorized based on memory intensity (L2 MPKI) High memory intensity vs. low memory intensity [Henning06]

18 multiprogrammed workloads: 32-applications each Workload Categories

WL 1-6: Mixed (50% high intensity -50% low intensity) WL 7-12: Memory intensive (100% high intensity) WL 13-18: Memory non-intensive (100% low intensity)

1-1 application-to-core mapping

Metric

𝑊 h𝑒𝑖𝑔 𝑡𝑒𝑑𝑆𝑝𝑒𝑒𝑑𝑢𝑝=𝑊𝑆=∑ 𝐼𝑃𝐶𝑖 ( h𝑡𝑜𝑔𝑒𝑡 𝑒𝑟 )𝐼𝑃𝐶𝑖 (𝑎𝑙𝑜𝑛𝑒 )

𝑵𝒐𝒓𝒎𝒂𝒍𝒊𝒛𝒆𝒅𝑾𝑺=𝑾𝑺 (𝒐𝒑𝒕𝒊𝒎𝒊𝒛𝒆𝒅 )𝑾𝑺(𝒃𝒂𝒔𝒆𝒍𝒊𝒏𝒆)

17

Experimental ResultsExperimental Results

w-1 w-2 w-3 w-4 w-5 w-60.90

1.00

1.10

1.20

Scheme-1Scheme-1 + Scheme-2

Mixed Workloads

Nor

mal

ized

WS

w-7 w-8 w-9 w-10 w-11 w-120.90

1.00

1.10

1.20

Scheme-1Scheme-1 + Scheme-2

High Intensity Workloads

Nor

mal

ized

WS

w-13 w-14 w-15 w-16 w-17 w-180.90

1.00

1.10

1.20

Scheme-1Scheme-1 + Scheme-2

Low Intensity Workloads

Nor

mal

ized

WS

6%11%

10%

10%15%

13% Higher intensity benchmarks

benefit more from Scheme 1 More traffic More “late” messages

w-2 and w-9 degrade Prioritizing some messages

hurts some other messages

18

Experimental ResultsExperimental Results

Cumulative Distribution of latencies 8 threads of WL-1 90% point delay reduced from ~700 cycles to ~600 cycles

Probability Density Moved from region 1 to region 2 Not all can be moved

155 265 375 485 595 705 815 9250

0.2

0.4

0.6

0.8

1

Total Delay (cycles)

Fra

ctio

n o

f to

tal

acc

ess

es

1552493434375316257198139070

0.2

0.4

0.6

0.8

1

Total Delay (cycles)

Fra

ctio

n o

f to

tal a

cce

sse

s

100 200 300 400 500 600 700 800 9000.002.004.006.008.00

10.0012.00

Delay (cycles)

Fra

ctio

n o

f to

tal

acc

ess

es New distribution

Region 1Region 2

Fewer accesseswith high delays

Before Scheme-1

After Scheme-1

19

Experimental ResultsExperimental Results

Reduction in idleness of banks

Dynamic behavior Scheme-2 reduces the idleness consistently over time

1 3 5 7 9 11 13 150.700.750.800.850.90

default Scheme-2

Banks

Idle

nes

s

1 4 7 10 13 16 190.65

0.70

0.75

0.80

0.85

default Scheme-2

Time Interval (100k cycles)

Ave

rage

Idl

enes

s

20

Sensitivity AnalysisSensitivity Analysis

System Parameters Lateness threshold Bank History Table history length Number of memory controllers Number of cores Number of router stages

Analyze sensitivity of results to system parameters Experimented with different values

21

Sensitivity Analysis – “Late” Threshold Sensitivity Analysis – “Late” Threshold

Scheme 1: Threshold to determine if a message is late Default = =

Reduced Threshold: More messages considered late Too many messages to prioritize can hurt other messages

Increased Threshold: Fewer messages considered late Can miss opportunities

w-1 w-2 w-3 w-4 w-5 w-60.9

0.95

1

1.05

1.1

1.15

1.1 x Delay_avg 1.2 x Delay_avg 1.4 x Delay_avg

Mixed Workloads

No

rma

lize

d W

S

22

Sensitivity Analysis – History Length Sensitivity Analysis – History Length

Scheme 2: History length History kept at the routers for past T cycles Default value T=200 cycles

Shorter history: T=100 cycles Cannot find idle banks precisely

Longer history: T=400 cycles Less number requests prioritized

w-1 w-2 w-3 w-4 w-5 w-60.90

0.95

1.00

1.05

1.10

1.15

1.20

T=100 T=200 T=400

Mixed Workloads

No

rma

lize

d W

S

23

Sensitivity Analysis – Number of MCs Sensitivity Analysis – Number of MCs

Less MCs means More pressure on each MC Higher queuing latency

More late requests More room for Scheme 1

Less idleness at banks Less room for Scheme 2

w-1 w-2 w-3 w-4 w-5 w-60.90

0.95

1.00

1.05

1.10

1.15

4 MC 2 MC

Mixed Workloads

No

rma

lize

d W

S

Slightly higherimprovements with 2 MC

24

Sensitivity Analysis – 16 CoresSensitivity Analysis – 16 Cores

Scheme-1 + Scheme-2 8%, 10%, 5% speedup About 5% less than 32-cores

Proportional with the # of cores Higher network latency More room for our optimizations

w-1 w-2 w-3 w-4 w-5 w-60.90

1.00

1.10

Scheme-1Scheme-1 + Scheme-2

Mixed Workloads

No

rma

lize

d W

S

w-7 w-8 w-9 w-10 w-11 w-120.90

1.00

1.10

Scheme-1Scheme-1 + Scheme-2

High Intensity Workloads

No

rma

lize

d W

S

w-13 w-14 w-15 w-16 w-17 w-180.90

1.00

1.10

Scheme-1Scheme-1 + Scheme-2

Low Intensity Workloads

No

rma

lize

d W

S

25

Sensitivity Analysis – Router StagesSensitivity Analysis – Router Stages

NoC Latency depends on number of stages in the routers 5 stage vs. 2 stage routers Scheme 1+2 speedup ~7% on average (for mixed workloads)

w-1 w-2 w-3 w-4 w-5 w-60.90

0.95

1.00

1.05

1.10

1.15

1.20

5-stage pipeline 2-stage pipeline

Mixed Workloads

No

rma

lize

d W

eig

hte

d

Sp

ee

du

p

26

SummarySummary

Identified Some memory accesses suffer long network delays and block

the cores Banks utilization low and uneven

Proposed two schemes 1. Network prioritization and pipeline bypassing on “late” memory

response messages to expedite them2. Network prioritization of memory request messages to improve

bank utilization

Demonstrated Scheme 1 achieves 6%, 10%, 11% speedup Scheme 1+2 achieves 10%, 13%, 15% speedup

27

Questions?Questions?

Department of Computer Science and EngineeringThe Pennsylvania State University

Akbar Sharifi, Emre Kultursay, Mahmut Kandemir and Chita R. Das

Addressing End-to-End Memory Access Latency in NoC Based Multicores

Thank you for attending this presentation.

28

ReferencesReferences

[1] A. Kumar, L. Shiuan Peh, P. Kundu, and N.K. Jha, “Express Virtual Channels: Towards the Ideal Interconnection Fabric”, in ISCA, 2007

[2] J. L. Henning, “SPEC CPU2006 Benchmark Descriptions”, SIGARCH Comput. Archit. News, 2006