handling the problems and opportunities posed by multiple on-chip memory controllers
Post on 22-Feb-2016
44 Views
Preview:
DESCRIPTION
TRANSCRIPT
Handling the Problems and Opportunities Posed by Multiple
On-Chip Memory Controllers
Manu Awasthi , David Nellans, Kshitij Sudan, Rajeev Balasubramonian, Al Davis
University of Utah
2
Takeaway
• Multiple, on-chip MCs will be common in future CMPs, with multiple cores sharing one MC– NUMA memory hierarchies across multiple sockets– Intelligent data mapping required to reduce average memory
access delay• Hardware-software co-design approach required for
efficient data placement– Minimum software involvement
• Data placement needs to be aware of system parameters – Row-buffer hit rates, queuing delays, physical proximity, etc.
NUMA - Today
3
MC
Core 1 Core 2
Core 3 Core 4
DIMM DIMM DIMM
Socket 1
MC
Core 1 Core 2
Core 3 Core 4
DIMM DIMM DIMM
MC
Core 1 Core 2
Core 3 Core 4
DIMM DIMM DIMM
MC
Core 1 Core 2
Core 3 Core 4
DIMM DIMM DIMMQPI
Conceptual representation of four
socket Nehalem machine
MC On-Chip Memory Controller
QPI Interconnect
Memory Channel
DIMM DRAM (DIMMs)
Socket Boundary
NUMA - Future
4
Core 5 Core 6 Core 7 Core 8
Core 9 Core 10 Core 11 Core 12
Core 13 Core 14 Core 15 Core 16
Core 1 Core 2 Core 3 Core 4
DIMM
MC2
MC3
MC4
MC1
DIMM DIMM
DIMM
L2$ L2$ L2$ L2$
L2$ L2$ L2$ L2$
L2$ L2$ L2$ L2$
L2$ L2$ L2$ L2$
Future CMPs withmultiple on-chip MCs
MC On-Chip Memory ControllerOn-Chip
Interconnect
Memory Channel
DIMM DRAM (DIMMs)
5
Local Memory Access
• Accessing local memory is fast!!
5
MC
Core 1 Core 2
Core 3 Core 4
DIMM DIMM DIMM
Socket 1
MC
Core 1 Core 2
Core 3 Core 4
DIMM DIMM DIMM
MC
Core 1 Core 2
Core 3 Core 4
DIMM DIMM DIMM
MC
Core 1 Core 2
Core 3 Core 4
DIMM DIMM DIMM
ADDR
DATA
6
Problem 1 - Remote Memory Access
• Data for Core N can be anywhere! MC
Core 1 Core 2
Core 3 Core 4
DIMM DIMM DIMM
Socket 1
MC
Core 1 Core 2
Core 3 Core 4
DIMM DIMM DIMM
MC
Core 1 Core 2
Core 3 Core 4
DIMM DIMM DIMM
MC
Core 1 Core 2
Core 3 Core 4
DIMM DIMM DIMM
ADDR
7
Problem 1 - Remote Memory Access
• Data for Core N can be anywhere! MC
Core 1 Core 2
Core 3 Core 4
DIMM DIMM DIMM
Socket 1
MC
Core 1 Core 2
Core 3 Core 4
DIMM DIMM DIMM
MC
Core 1 Core 2
Core 3 Core 4
DIMM DIMM DIMM
MC
Core 1 Core 2
Core 3 Core 4
DIMM DIMM DIMMDATA
8
Memory Access Stream – Single Core
Prog 1CPU 0
Prog 1CPU 0
Prog 1CPU 0
Prog 1CPU 0
Prog 2CPU 0
Prog 1CPU 0
Prog 1CPU 0
Prog 1CPU 0
Memory Controller Request Queue
In Out
• Single cores executed a handful of context-switched programs.• Spatio-temporal locality can be exploited!!
9
Problem 2 - Memory Access Stream - CMPs
Prog 0CPU 0
Prog 1CPU 1
Prog 1CPU 1
Prog 2 CPU 2
Prog 3 CPU 3
Prog 4CPU 4
Prog 5CPU 5
Prog 6CPU 6
Memory Controller Request Queue
In Out
• Memory accesses from cores get interleaved, leading to loss of spatio-temporal locality.
Problem 3 – Increased Overheads for Memory Accesses
Increased queuing delays1 Core/1 Thread
16 Core/16 Threads
11
Problem 4 – Pin Limitations
Core 5 Core 6 Core 7 Core 8
Core 9 Core 10 Core 11 Core 12
Core 13 Core 14 Core 15 Core 16
Core 1 Core 2 Core 3 Core 4
MC7MC5
MC1 MC2 MC3 MC4
MC8MC6
Core 5 Core 6 Core 7 Core 8
Core 9 Core 10 Core 11 Core 12
Core 13 Core 14 Core 15 Core 16
Core 1 Core 2 Core 3 Core 4
MC10MC12
MC1 MC2 MC3 MC4
MC9MC11
MC16
MC15
MC14
MC13
MC4
MC4
MC4
MC4
• Pin bandwidth is limited : Number of MCs cannot grow exponentially
• A small number of MCs will have to handle all traffic
12
Problems Summary - I• Pin limitations imply an increase in queuing delay
– Almost 8x increase in queuing delays from single core/one thread to 16 cores/16 threads
• Multi-core implies an increase in row-buffer interference– Increasingly randomized memory access stream– Row-buffer hit rates bound to go down
• Longer on- and off-chip wire delays imply an increase in NUMA factor
• NUMA factor already at 1.5 today
13
Problems Summary - II
• DRAM access time in systems with multiple on-chip MCs is governed by– Distance between requesting core and responding MC.– Load on the on-chip interconnect.– Average queuing delay at responding MC– Bank and rank contention at target DIMM– Row-buffer hit rate at responding MC
Bottomline : Intelligent management of data is required
14
cost j = α x loadj + β x rowhitsj + λ x distancej
Adaptive First Touch Policy
• Basic idea : Assign each new virtual page to a DRAM (physical) page belonging to MC (j) that minimizes the following cost function –
Measure of Queuing Delay
Measure of Locality at DRAM
Measure of Physical Proximity
Constants α, β and λ can be made programmable
15
Dynamic Page Migration Policy
• Programs change phases!!– Can completely stop touching new pages– Can change the frequency of access to a subset of pages
• Leads to imbalance in MC accesses– For long running programs with varying working sets,
AFT can lead to some MCs getting overloaded
Solution : Dynamically migrate pages between MCs at runtime to decrease
imbalance
16
Dynamic Page Migration Policy
Core 5 Core 6 Core 7 Core 8
Core 9 Core 10 Core 11 Core 12
Core 13 Core 14 Core 15 Core 16
Core 1 Core 2 Core 3 Core 4
DIMM
MC2
MC3
MC4
MC1
DIMM DIMM
DIMM
L2$ L2$ L2$ L2$
L2$ L2$ L2$ L2$
L2$ L2$ L2$ L2$
L2$ L2$ L2$ L2$
MC3
Heavily Loaded (Donor)
MC
MC2
Lightly Loaded MC
Lightly Loaded MC
Lightly Loaded MC
MC2
MC2
17
Dynamic Page Migration Policy
Core 5 Core 6 Core 7 Core 8
Core 9 Core 10 Core 11 Core 12
Core 13 Core 14 Core 15 Core 16
Core 1 Core 2 Core 3 Core 4
DIMM
MC2
MC3
MC4
MC1
DIMM DIMM
DIMM
L2$ L2$ L2$ L2$
L2$ L2$ L2$ L2$
L2$ L2$ L2$ L2$
L2$ L2$ L2$ L2$
MC3
Select N pages
MC2
MC2
MC2
Select Recipient MC
Copy N pages from
donor to recipient
MC
18
Dynamic Page Migration Policy - Challenges
• Selecting recipient MC
– Move pages to MC with least value of cost function• Selecting N pages to migrate
– Empirically select the best possible value– Can also be made programmable
Move pages to a physically proximal MC
Minimize interference at recipient MC
costk = Λ x distancek + Γ x rowhitsk
19
Dynamic Page Migration Policy - Overheads
• Pages are physically copied to new addresses– Original address mapping has to be invalidated– Invalidate cache lines belonging to copied pages
• Copying pages can block resources, leading to unnecessary stalls.
• Instant TLB invalidates could cause misses in memory even when data is present.
• Solution : Lazy Copying– Essentially, delayed write-back
20
Issues with TLB Invalidates
Donor MC RecipientMC
Copy Page A,B
Core 1 Core 3 Core 5 Core 12TLB INV
TLB INV
TLB INV
TLB INV
Read A’ -> A
OS Stall!
21
Lazy Copying
Donor MC RecipientMC
Copy Page A,B
Core 1 Core 3 Core 5 Core 12Read Only
Read Only
Read Only
Read Only
OSFlush Dirty Cachelines
Read A’ -> A
Copy Complete
TLB Update
TLB Update
TLB Update
TLB Update
Read A’ -> A
22
Methodology• Simics based simulation platform• DRAMSim based DRAM timing. • DRAM energy figures from CACTI 6.5• Baseline : Assign pages to closest MC
CPU 16-core Out-of-Order CMP, 3 GHz freq.L1 Inst. and Data Cache Private, 32 KB/2-way, 1-cycle access
L2 Unified Cache Shared, 2 MB KB/8-way, 4x4 S-NUCA, 3 cycle bank access
Total DRAM Capacity 4 GBDIMM Configuration 8 DIMMs, 1 rank/DIMM, 64 bit channel, 8
devices/DIMMα, β ,λ , Λ, Γ 10, 20, 100, 100, 100
Results - Throughput
23AFT : 17.1% , Dynamic Page Migration : 34.8%
24
Results – DRAM Locality
AFT : 16.6% , Dynamic Page Migration : 22.7%
STDDEV Down, increased fairness
25
Results – Reasons for Benefits
26
Sensitivity Studies
• Lazy Copying does help, a little– Average 3.2% improvement over without lazy copying
• Terms/Variables in cost function– Very sensitive to load and row-buffer hit rates, not as
much to distance• Cost of TLB shootdowns
– Negligible, since fairly uncommon• Physical placement of MCs – center or peripheral
– Most workloads agnostic to physical placement
27
Summary
• Multiple, on-chip MCs will be common in future CMPs, with multiple cores sharing one MC– Intelligent data mapping will need to be done to reduce
average memory access delay• Adaptive First Touch policy
– Increases performance by 17.1%– Decreases DRAM energy consumption by 14.1%
• Dynamic page migration, improvement on AFT– Further improvement over AFT by 17.7%, 34.8% over
baseline.– Increases energy consumption by 5.2%
28
Thank You
http://www.cs.utah.edu/arch-research
top related