vinodh cuppu and bruce jacob, university of maryland concurrency, latency, or system overhead: which...
Post on 21-Dec-2015
223 views
TRANSCRIPT
Vinodh Cuppu and Bruce Jacob, University of Maryland
Concurrency, Latency, or System Overhead: which Has the Largest Impact on Uniprocessor DRAM-System Performance?
Richard Wells
ECE 7810
April 21, 2009
The University of Utah
Reservations
The paper is old Presented at ISCA 2001 Only considers uniprocessor systems
They draw some conclusions that while valid are focused on their research goals
Papers relating to our groups project are not prevalent in recent years, except one already presented at the architecture reading club.
The University of Utah
Overview
Investigate DRAM system organization parameters to determine bottleneck
Determine synergy or antagonism between groups of parameters
Empirically determine the optimal DRAM system configuration
The University of Utah
Methodologies to increase system performance
Concurrent transactions
Reducing latency
Reduce system overhead
The University of Utah
Previous approaches to reduce memory system overhead
DRAM Component Increase bandwidth
Current “tack” taken by the PC industry Reduce DRAM latency
ESDRAM SRAM cache for the full row buffer Allows precharge to begin immediately after
access FCRAM
Subdivide internal bank by activating only a portion of each wordline
The University of Utah
Previous approaches to reduce memory system overhead (cont.)
Reduce capacitance on word access to 30 ns (2001)
MoSys Subdivides storage into a large number of very
small banks Reduces latency of DRAM core to nearly that of
SRAM VCDRAM
Set-associative SRAM buffer that holds a number of sub-pages
The University of Utah
The Jump
DRAM oriented approaches do reduce application execution time
Because zero latency DRAM doesn’t reduce the overhead of memory system to zero, bus transactions are considered
Other factors considered Turnaround time Queuing delays Inefficiencies due to asymmetric read/write requests Multiprocessor - Arbitration and Cache coherence would
add to overhead
The University of Utah
CPU – DRAM Channel
Access reordering (cited Impulse group here at the U) Compacts sparse data into densely-packed bus
transactions Reduces the number of bus transactions Possibly reduces duration of bus transaction
The University of Utah
Increasing concurrency
Different banks on the same channel Independent channels to different banks Pipelined requests Split-transaction bus
The University of Utah
Decreasing channel latency
Due to channel contention Back to back read requests Read arriving during precharge Narrow channels Large data burst size
The University of Utah
Addressing System Overhead
Bus turnaround time Dead cycles due to asymmetric read/write
shapes Queuing overhead Coalescing queued requests Dynamic re-prioritization of requests
The University of Utah
Timing Assumptions
10 ns address 70 ns until burst starts on a read 40 ns until a write can start
The University of Utah
Split Transaction Bus Assumptions
Overlapping Supported Back-to-back reads Back-to-back read/write pairs
The University of Utah
Burst Ordering, Coalescing
Critical-burst first, non-critical burst second, writes last
Coalesce writes followed by reads
The University of Utah
Bit Addressing & Page Policy
Bit assignments chosen to exploit page mode and maximize degree of memory concurrency Most significant bits identify the smallest-scale
component in the system Least significant bits identify the largest-scale
component in the system Allows sequential addresses to be stripped
across channels maximizing concurrency Close-page auto-precharge policy
The University of Utah
Simulation Environment
SimpleScalar (used in 6810) 2 GHz clock L1 caches 64Kb/64Kb, 2-way set associative L2 cache unified 1Mb, 4-way set associative, 10
cycle access time Lock-up free cache using miss status holding
register (MSHR)
The University of Utah
Timing Calculations
CPU + DRAM determined by running a second simulation with perfect primary memory (available on next cycle)
The University of Utah
Results – Degrees of Freedom
Bus Speed: 800 MHz Bus width: 1, 2, 4, 8 bytes Channels: 1, 2, 4 Banks/Channel: 1, 2, 4, 8 Queue Size: infinite, 0, 1, 2, 8, 16,
32 Turnaround: 0, 1 cycles R/W shapes: symmetric, asymmetric
The University of Utah
Results – Execution Times
Assumes infinite request queue System parameters can lead to widely varying CPI
The University of Utah
Results – Turnaround and Banks
Turnaround only accounts for 5% of system related overhead
Banks/Channel accounts for 1.2x – 2x variation – shows concurrency is important
Latency accounts for over about 50% of CPI
The University of Utah
Results – Burst Length vs. BW
Accounts for 10-30% of execution time Wider channels have optimal performance with larger bursts Narrow channels have optimal performance with smaller bursts
The University of Utah
Results – Concurrency (Cont.)
Increasing the number of banks typically increases performance, but not always much
Many narrow channels is risky because application might not have much inherent concurrency
Optimal 1 channel x 4 bytes x 64 byte burst, 2 channel x 2 bytes x 64 byte burst, 1 channel x 4 bytes x 128 byte burst
Performance varies depending on the concurrency of the benchmark
The University of Utah
Results – Concurrency (Cont.)
“We find that, in a uniprocessor setting, concurrency is very important, but it is not more important than latency. . . . However, we find that if, in an attempt to increase support for concurrent transactions, one interleaves very small bursts or fragments the DRAM bus into multiple channels, one does so at the expense of latency, and this expense is too great for the levels of concurrency being produced.”
The University of Utah
Results – Request Queue Size
How queuing benefits system performance Sub-blocks of different read requests can be interleaved Writes can be buffered until read-burst traffic has died
down Read and write requests may be coalesced
Applications with significant write activity see more benefit from queuing Bzip has many more writes than GCC
Anomalies attributed to requests with temporal locality go to the same bank. With a small queue they are delayed.
The University of Utah
Conclusions
Tuning system level parameters can improve the memory system performance by 40% Bus turnaround – 5-10% Banks – 1.2x – 2x Burst length vs. bandwidth – 10%-30% Concurrency
Smaller bursts to allow for interleaving is not a good idea because it limits concurrency
The University of Utah
Our Project
To evaluate the effect of mat array size on power and latency of the DRAM chips.
Simulators Cacti DRAMSim Simics
Predicted Results Positive
Decreased memory latency Decreased power profile DIMM parallelism increase
Negative Decreased row buffer hit rates Decreased memory capacity (for same chip area) Increase the important cost/bit metric
The University of Utah
How project relates to the paper
Trying to decrease the memory system bottlenecks Although we have evaluated bottlenecks
differently Jacob indirectly showed the importance of
minimizing DRAM latency DRAM latency was largest portion of CPI so
Amdahl’s law would justify reducing latency Both our solutions could work together
synergistically
The University of Utah
Additional thoughts
The current path of DRAM innovation has limitations
DRAM chips and DIMMs need to undergo fundamental changes, of which this could be a step
Helps power efficiency Can balance with cost effectiveness Partially addresses the memory gap