linearly compressed pages: a low-complexity, low...

13
Linearly Compressed Pages: A Low-Complexity, Low-Latency Main Memory Compression Framework Gennady Pekhimenko [email protected] Vivek Seshadri [email protected] Yoongu Kim [email protected] Hongyi Xin [email protected] Onur Mutlu [email protected] Phillip B. Gibbons ? [email protected] Michael A. Kozuch ? [email protected] Todd C. Mowry [email protected] Carnegie Mellon University ? Intel Labs Pittsburgh ABSTRACT Data compression is a promising approach for meeting the increas- ing memory capacity demands expected in future systems. Un- fortunately, existing compression algorithms do not translate well when directly applied to main memory because they require the memory controller to perform non-trivial computation to locate a cache line within a compressed memory page, thereby increasing access latency and degrading system performance. Prior propos- als for addressing this performance degradation problem are either costly or energy inefficient. By leveraging the key insight that all cache lines within a page should be compressed to the same size, this paper proposes a new approach to main memory compression—Linearly Compressed Pages (LCP)—that avoids the performance degradation problem without requiring costly or energy-inefficient hardware. We show that any compression algorithm can be adapted to fit the require- ments of LCP, and we specifically adapt two previously-proposed compression algorithms to LCP: Frequent Pattern Compression and Base-Delta-Immediate Compression. Evaluations using benchmarks from SPEC CPU2006 and five server benchmarks show that our approach can significantly in- crease the effective memory capacity (by 69% on average). In addition to the capacity gains, we evaluate the benefit of trans- ferring consecutive compressed cache lines between the memory controller and main memory. Our new mechanism considerably reduces the memory bandwidth requirements of most of the eval- uated benchmarks (by 24% on average), and improves overall per- formance (by 6.1%/13.9%/10.7% for single-/two-/four-core work- loads on average) compared to a baseline system that does not em- ploy main memory compression. LCP also decreases energy con- sumed by the main memory subsystem (by 9.5% on average over the best prior mechanism). Categories and Subject Descriptors B.3.1 [Semiconductor Memories]: Dynamic memory (DRAM); D.4.2 [Storage Management]: Main memory; E.4 [Coding and Information Theory]: Data compaction and compression Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. MICRO-46 December 07-11 2013, Davis, CA, USA. Copyright 2013 ACM 978-1-4503-2638-4/13/12 ...$15.00. Keywords Data Compression, Memory, Memory Bandwidth, Memory Capac- ity, Memory Controller, DRAM 1. INTRODUCTION Main memory, commonly implemented using DRAM technol- ogy, is a critical resource in modern systems. To avoid the devas- tating performance loss resulting from frequent page faults, main memory capacity must be sufficiently provisioned to prevent the target workload’s working set from overflowing into the orders-of- magnitude-slower backing store (e.g., hard disk or flash). Unfortunately, the required minimum memory capacity is ex- pected to increase in the future due to two major trends: (i) appli- cations are generally becoming more data-intensive with increasing working set sizes, and (ii) with more cores integrated onto the same chip, more applications are running concurrently on the system, thereby increasing the aggregate working set size. Simply scaling up main memory capacity at a commensurate rate is unattractive for two reasons: (i) DRAM already constitutes a significant portion of the system’s cost and power budget [19], and (ii) for signal integrity reasons, today’s high frequency memory channels prevent many DRAM modules from being connected to the same channel [17], effectively limiting the maximum amount of DRAM in a system unless one resorts to expensive off-chip signaling buffers [6]. If its potential could be realized in practice, data compression would be a very attractive approach to effectively increase main memory capacity without requiring significant increases in cost or power, because a compressed piece of data can be stored in a smaller amount of physical memory. Further, such compression could be hidden from application (and most system 1 ) software by materializing the uncompressed data as it is brought into the pro- cessor cache. Building upon the observation that there is significant redundancy in in-memory data, previous work has proposed a vari- ety of techniques for compressing data in caches [2, 3, 5, 12, 25, 37, 39] and in main memory [1, 7, 8, 10, 35]. 1.1 Shortcomings of Prior Approaches A key stumbling block to making data compression practical is that decompression lies on the critical path of accessing any compressed data. Sophisticated compression algorithms, such as Lempel-Ziv and Huffman encoding [13,40], typically achieve high compression ratios at the expense of large decompression latencies 1 We assume that main memory compression is made visible to the memory management functions of the operating system (OS). In Section 2.3, we discuss the drawbacks of a design that makes main memory compression mostly transparent to the OS [1]. 172

Upload: others

Post on 03-Aug-2021

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Linearly Compressed Pages: A Low-Complexity, Low ...class.ece.iastate.edu/.../papers/Micro13CompressedPages.pdfLinearly Compressed Pages: A Low-Complexity, Low-Latency Main Memory

Linearly Compressed Pages: A Low-Complexity,Low-Latency Main Memory Compression Framework

Gennady Pekhimenko†

[email protected] Seshadri†

[email protected] Kim†

[email protected] Xin†

[email protected]

Onur Mutlu†

[email protected] B. Gibbons?

[email protected] A. Kozuch?

[email protected] C. Mowry†

[email protected]

†Carnegie Mellon University ?Intel Labs Pittsburgh

ABSTRACTData compression is a promising approach for meeting the increas-ing memory capacity demands expected in future systems. Un-fortunately, existing compression algorithms do not translate wellwhen directly applied to main memory because they require thememory controller to perform non-trivial computation to locate acache line within a compressed memory page, thereby increasingaccess latency and degrading system performance. Prior propos-als for addressing this performance degradation problem are eithercostly or energy inefficient.

By leveraging the key insight that all cache lines within a pageshould be compressed to the same size, this paper proposes a newapproach to main memory compression—Linearly CompressedPages (LCP)—that avoids the performance degradation problemwithout requiring costly or energy-inefficient hardware. We showthat any compression algorithm can be adapted to fit the require-ments of LCP, and we specifically adapt two previously-proposedcompression algorithms to LCP: Frequent Pattern Compression andBase-Delta-Immediate Compression.

Evaluations using benchmarks from SPEC CPU2006 and fiveserver benchmarks show that our approach can significantly in-crease the effective memory capacity (by 69% on average). Inaddition to the capacity gains, we evaluate the benefit of trans-ferring consecutive compressed cache lines between the memorycontroller and main memory. Our new mechanism considerablyreduces the memory bandwidth requirements of most of the eval-uated benchmarks (by 24% on average), and improves overall per-formance (by 6.1%/13.9%/10.7% for single-/two-/four-core work-loads on average) compared to a baseline system that does not em-ploy main memory compression. LCP also decreases energy con-sumed by the main memory subsystem (by 9.5% on average overthe best prior mechanism).

Categories and Subject DescriptorsB.3.1 [Semiconductor Memories]: Dynamic memory (DRAM);D.4.2 [Storage Management]: Main memory; E.4 [Coding andInformation Theory]: Data compaction and compression

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies are notmade or distributed for profit or commercial advantage and that copies bearthis notice and the full citation on the first page. Copyrights for componentsof this work owned by others than ACM must be honored. Abstracting withcredit is permitted. To copy otherwise, or republish, to post on servers or toredistribute to lists, requires prior specific permission and/or a fee. Requestpermissions from [email protected] December 07-11 2013, Davis, CA, USA.Copyright 2013 ACM 978-1-4503-2638-4/13/12 ...$15.00.

KeywordsData Compression, Memory, Memory Bandwidth, Memory Capac-ity, Memory Controller, DRAM

1. INTRODUCTIONMain memory, commonly implemented using DRAM technol-

ogy, is a critical resource in modern systems. To avoid the devas-tating performance loss resulting from frequent page faults, mainmemory capacity must be sufficiently provisioned to prevent thetarget workload’s working set from overflowing into the orders-of-magnitude-slower backing store (e.g., hard disk or flash).

Unfortunately, the required minimum memory capacity is ex-pected to increase in the future due to two major trends: (i) appli-cations are generally becoming more data-intensive with increasingworking set sizes, and (ii) with more cores integrated onto the samechip, more applications are running concurrently on the system,thereby increasing the aggregate working set size. Simply scalingup main memory capacity at a commensurate rate is unattractive fortwo reasons: (i) DRAM already constitutes a significant portion ofthe system’s cost and power budget [19], and (ii) for signal integrityreasons, today’s high frequency memory channels prevent manyDRAM modules from being connected to the same channel [17],effectively limiting the maximum amount of DRAM in a systemunless one resorts to expensive off-chip signaling buffers [6].

If its potential could be realized in practice, data compressionwould be a very attractive approach to effectively increase mainmemory capacity without requiring significant increases in costor power, because a compressed piece of data can be stored in asmaller amount of physical memory. Further, such compressioncould be hidden from application (and most system1) software bymaterializing the uncompressed data as it is brought into the pro-cessor cache. Building upon the observation that there is significantredundancy in in-memory data, previous work has proposed a vari-ety of techniques for compressing data in caches [2, 3, 5, 12, 25, 37,39] and in main memory [1, 7, 8, 10, 35].

1.1 Shortcomings of Prior ApproachesA key stumbling block to making data compression practical

is that decompression lies on the critical path of accessing anycompressed data. Sophisticated compression algorithms, such asLempel-Ziv and Huffman encoding [13,40], typically achieve highcompression ratios at the expense of large decompression latencies

1We assume that main memory compression is made visible to thememory management functions of the operating system (OS). InSection 2.3, we discuss the drawbacks of a design that makes mainmemory compression mostly transparent to the OS [1].

172

Page 2: Linearly Compressed Pages: A Low-Complexity, Low ...class.ece.iastate.edu/.../papers/Micro13CompressedPages.pdfLinearly Compressed Pages: A Low-Complexity, Low-Latency Main Memory

that can significantly degrade performance. To counter this prob-lem, prior work [3,25,39] on cache compression proposed special-ized compression algorithms that exploit regular patterns presentin in-memory data, and showed that such specialized algorithmshave reasonable compression ratios compared to more complex al-gorithms while incurring much lower decompression latencies.

While promising, applying compression algorithms, sophisti-cated or simpler, to compress data stored in main memory requiresfirst overcoming the following three challenges. First, main mem-ory compression complicates memory management, because theoperating system has to map fixed-size virtual pages to variable-size physical pages. Second, because modern processors employon-chip caches with tags derived from the physical address to avoidaliasing between different cache lines (as physical addresses areunique, while virtual addresses are not), the cache tagging logicneeds to be modified in light of memory compression to take themain memory address computation off the critical path of latency-critical L1 cache accesses. Third, in contrast with normal virtual-to-physical address translation, the physical page offset of a cacheline is often different from the corresponding virtual page offset,because compressed physical cache lines are smaller than their cor-responding virtual cache lines. In fact, the location of a compressedcache line in a physical page in main memory depends upon thesizes of the compressed cache lines that come before it in that samephysical page. As a result, accessing a cache line within a com-pressed page in main memory requires an additional layer of ad-dress computation to compute the location of the cache line in mainmemory (which we will call the main memory address). This addi-tional main memory address computation not only adds complex-ity and cost to the system, but it can also increase the latency ofaccessing main memory (e.g., it requires up to 22 integer additionoperations in one prior design for main memory compression [10]),which in turn can degrade system performance.

While simple solutions exist for these first two challenges (as wedescribe later in Section 4), prior attempts to mitigate the perfor-mance degradation of the third challenge are either costly or inef-ficient [1, 10]. One approach (IBM MXT [1]) aims to reduce thenumber of main memory accesses, the cause of long-latency mainmemory address computation, by adding a large (32MB) uncom-pressed cache managed at the granularity at which blocks are com-pressed (1KB). If locality is present in the program, this approachcan avoid the latency penalty of main memory address computa-tions to access compressed data. Unfortunately, its benefit comesat a significant additional area and energy cost, and the approachis ineffective for accesses that miss in the large cache. A secondapproach [10] aims to hide the latency of main memory addresscomputation by speculatively computing the main memory addressof every last-level cache request in parallel with the cache access(i.e., before it is known whether or not the request needs to accessmain memory). While this approach can effectively reduce the per-formance impact of main memory address computation, it wastes asignificant amount of energy (as we show in Section 7.3) becausemany accesses to the last-level cache do not result in an access tomain memory.

1.2 Our Approach: Linearly CompressedPages

We aim to build a main memory compression framework thatneither incurs the latency penalty for memory accesses nor requirespower-inefficient hardware. Our goals are: (i) having low com-plexity and low latency (especially when performing memory ad-dress computation for a cache line within a compressed page), (ii)being compatible with compression employed in on-chip caches

(thereby minimizing the number of compressions/decompressionsperformed), and (iii) supporting compression algorithms with highcompression ratios.

To this end, we propose a new approach to compress pages,which we call Linearly Compressed Pages (LCP). The key ideaof LCP is to compress all of the cache lines within a given page tothe same size. Doing so simplifies the computation of the physi-cal address of the cache line, because the page offset is simply theproduct of the index of the cache line and the compressed cache linesize (i.e., it can be calculated using a simple shift operation). Basedon this idea, a target compressed cache line size is determined foreach page. Cache lines that cannot be compressed to the target sizefor its page are called exceptions. All exceptions, along with themetadata required to locate them, are stored separately in the samecompressed page. If a page requires more space in compressedform than in uncompressed form, then this page is not compressed.The page table indicates the form in which the page is stored.

The LCP framework can be used with any compression al-gorithm. We adapt two previously proposed compression algo-rithms (Frequent Pattern Compression (FPC) [2] and Base-Delta-Immediate Compression (BDI) [25]) to fit the requirements of LCP,and show that the resulting designs can significantly improve effec-tive main memory capacity on a wide variety of workloads.

Note that, throughout this paper, we assume that compressedcache lines are decompressed before being placed in the processorcaches. LCP may be combined with compressed cache designs bystoring compressed lines in the higher-level caches (as in [2, 25]),but the techniques are largely orthogonal, and for clarity, we presentan LCP design where only main memory is compressed.2

An additional, potential benefit of compressing data in mainmemory, which has not been fully explored by prior work on mainmemory compression, is memory bandwidth reduction. When dataare stored in compressed format in main memory, multiple consec-utive compressed cache lines can be retrieved at the cost of access-ing a single uncompressed cache line. Given the increasing demandon main memory bandwidth, such a mechanism can significantlyreduce the memory bandwidth requirement of applications, espe-cially those with high spatial locality. Prior works on bandwidthcompression [27, 32, 36] assumed efficient variable-length off-chipdata transfers that are hard to achieve with general-purpose DRAM(e.g., DDR3 [23]). We propose a mechanism that enables the mem-ory controller to retrieve multiple consecutive cache lines with asingle access to DRAM, with negligible additional cost. Evalua-tions show that our mechanism provides significant bandwidth sav-ings, leading to improved system performance.

In summary, this paper makes the following contributions:

• We propose a new main memory compression framework—Linearly Compressed Pages (LCP)—that solves the prob-lem of efficiently computing the physical address of a com-pressed cache line in main memory with much lower costand complexity than prior proposals. We also demonstratethat any compression algorithm can be adapted to fit the re-quirements of LCP.

• We evaluate our design with two state-of-the-art compres-sion algorithms (FPC [2] and BDI [25]), and observe that itcan significantly increase the effective main memory capac-ity (by 69% on average).

• We evaluate the benefits of transferring compressed cachelines over the bus between DRAM and the memory controller

2We show the results from combining main memory and cachecompression in our technical report [26].

173

Page 3: Linearly Compressed Pages: A Low-Complexity, Low ...class.ece.iastate.edu/.../papers/Micro13CompressedPages.pdfLinearly Compressed Pages: A Low-Complexity, Low-Latency Main Memory

and observe that it can considerably reduce memory band-width consumption (24% on average), and improve over-all performance by 6.1%/13.9%/10.7% for single-/two-/four-core workloads, relative to a system without main memorycompression. LCP also decreases the energy consumed bythe main memory subsystem (9.5% on average over the bestprior mechanism).

2. BACKGROUND ON MAIN MEMORYCOMPRESSION

Data compression is widely used in storage structures to in-crease the effective capacity and bandwidth without significantlyincreasing the system cost and power consumption. One primarydownside of compression is that the compressed data must be de-compressed before it can be used. Therefore, for latency-criticalapplications, using complex dictionary-based compression algo-rithms [40] significantly degrades performance due to their highdecompression latencies. Thus, prior work on compression of in-memory data has proposed simpler algorithms with low decom-pression latencies and reasonably high compression ratios, as dis-cussed next.

2.1 Compressing In-Memory DataSeveral studies [2, 3, 25, 39] have shown that in-memory data

has exploitable patterns that allow for simpler compression tech-niques. Frequent value compression (FVC) [39] is based on theobservation that an application’s working set is often dominated bya small set of values. FVC exploits this observation by encodingsuch frequently-occurring 4-byte values with fewer bits. Frequentpattern compression (FPC) [3] shows that a majority of words (4-byte elements) in memory fall under a few frequently occurringpatterns. FPC compresses individual words within a cache line byencoding the frequently occurring patterns with fewer bits. Base-Delta-Immediate (BDI) compression [25] observes that, in manycases, words co-located in memory have small differences in theirvalues. BDI compression encodes a cache line as a base-value andan array of differences that represent the deviation either from thebase-value or from zero (for small values) for each word. Thesethree low-latency compression algorithms have been proposed foron-chip caches, but can be adapted for use as part of the main mem-ory compression framework proposed in this paper.

2.2 Challenges in Memory CompressionLCP leverages the fixed-size memory pages of modern systems

for the basic units of compression. However, three challenges arisefrom the fact that different pages (and cache lines within a page)compress to different sizes depending on data compressibility.

Challenge 1: Main Memory Page Mapping. Irregular pagesizes in main memory complicate the memory management mod-ule of the operating system for two reasons (as shown in Figure 1).First, the operating system needs to allow mappings between thefixed-size virtual pages presented to software and the variable-sizephysical pages stored in main memory. Second, the operating sys-tem must implement mechanisms to efficiently handle fragmenta-tion in main memory.

64B 64B 64B 64B 64B∙ ∙ ∙ 

∙ ∙ ∙ 4:1 Compression

M E0 E1

1kB: 64 x 16B Metadata (64B)

Exceptions (64B each)

LN‐1∙ ∙ ∙ L1 L2Uncompressed Page

0 64 128 (N‐1)×64Address Offset

L0

Cache Line (64B)

Compressed Page LN‐1∙ ∙ ∙ L1 L2

? ? ?L0

0Address Offset

VirtualAddress

Virtual Page(4KB)

PhysicalAddress

Physical Page (?KB) Fragmentation

data tagdata tagdata tagdata tag

On‐Chip Cache

load %r1,[0xFF]

Physical Tags

VirtualAddr.

Translation

Core

PhysicalAddr. Critical‐Path

data tagdata tagdata tagdata tag

On‐Chip Cache

Physical Tags

VirtualAddr.

TranslationCore

PhysicalAddr.

64B 64B 64B 64B 64B∙ ∙ ∙ 

∙ ∙ ∙ 4:1 Compression

M E

1kB: 64 x 16B Metadata (64B)

Exceptions

64B 64B 64B 64B 64B∙ ∙ ∙ 

∙ ∙ ∙ 4:1 Compression

M E

Compressed Data Region

Metadata Region

ExceptionStorage Region

VirtualAddress

Virtual Page(4KB)

PhysicalAddress

Physical Page (?KB) Fragmentation

Uncompressed Page (4kB: 64 x 64B)

Compressed Page (1kB + α) 

Cache Line

s

Uncompressed Page (4kB: 64 x 64B)

Compressed Page (1kB + α) 

Uncompressed Page (4kB: 64 x 64B)

Figure 1: Main Memory Page Mapping Challenge

Challenge 2: Physical Address Tag Computation. On-chipcaches (including L1 caches) typically employ tags derived fromthe physical address of the cache line to avoid aliasing, and in suchsystems, every cache access requires the physical address of thecorresponding cache line to be computed. Hence, because the mainmemory addresses of the compressed cache lines differ from thenominal physical addresses of those lines, care must be taken thatthe computation of cache line tag does not lengthen the critical pathof latency-critical L1 cache accesses.

Challenge 3: Cache Line Address Computation. When mainmemory is compressed, different cache lines within a page can becompressed to different sizes. The main memory address of a cacheline is therefore dependent on the sizes of the compressed cachelines that come before it in the page. As a result, the processor(or the memory controller) must explicitly compute the location ofa cache line within a compressed main memory page before ac-cessing it (Figure 2), e.g., as in [10]. This computation not onlyincreases complexity, but can also lengthen the critical path of ac-cessing the cache line from both the main memory and the physi-cally addressed cache. Note that systems that do not employ mainmemory compression do not suffer from this problem because theoffset of a cache line within the physical page is the same as theoffset of the cache line within the corresponding virtual page.

LN-1· · · L1 L2Uncompressed Page

0 64 128 (N-1)×64Address Offset

L0

Cache Line (64B)

Compressed Page LN-1· · · L1 L2

? ? ?

L0

0Address Offset

Figure 2: Cache Line Address Computation Challenge

As will be seen shortly, while prior research efforts have con-sidered subsets of these challenges, this paper is the first designthat provides a holistic solution to all three challenges, particularlyChallenge 3, with low latency and low (hardware and software)complexity.

2.3 Prior Work on Memory CompressionOf the many prior works on using compression for main memory

(e.g., [1, 7, 8, 10, 18, 27, 35]), two in particular are the most closelyrelated to the design proposed in this paper, because both of themare mostly hardware designs. We describe these two designs alongwith their shortcomings.

Tremaine et al. [34] proposed a memory controller design, Pin-nacle, based on IBM’s Memory Extension Technology (MXT) [1]that employed Lempel-Ziv compression [40] to manage main mem-ory. To address the three challenges described above, Pinnacle em-ploys two techniques. First, Pinnacle internally uses a 32MB lastlevel cache managed at a 1KB granularity, same as the granularityat which blocks are compressed. This cache reduces the number ofaccesses to main memory by exploiting locality in access patterns,thereby reducing the performance degradation due to the addresscomputation (Challenge 3). However, there are several drawbacksto this technique: (i) such a large cache adds significant area andenergy costs to the memory controller, (ii) the approach requiresthe main memory address computation logic to be present and usedwhen an access misses in the 32MB cache, and (iii) if caching is noteffective (e.g., due to lack of locality or larger-than-cache workingset sizes), this approach cannot reduce the performance degradationdue to main memory address computation. Second, to avoid com-plex changes to the operating system and on-chip cache-tagginglogic, Pinnacle introduces a real address space between the vir-tual and physical address spaces. The real address space is uncom-

174

Page 4: Linearly Compressed Pages: A Low-Complexity, Low ...class.ece.iastate.edu/.../papers/Micro13CompressedPages.pdfLinearly Compressed Pages: A Low-Complexity, Low-Latency Main Memory

pressed and is twice the size of the actual available physical mem-ory. The operating system maps virtual pages to same-size pagesin the real address space, which addresses Challenge 1. On-chipcaches are tagged using the real address (instead of the physicaladdress, which is dependent on compressibility), which effectivelysolves Challenge 2. On a miss in the 32MB cache, Pinnacle mapsthe corresponding real address to the physical address of the com-pressed block in main memory, using a memory-resident mapping-table managed by the memory controller. Following this, Pinnacleretrieves the compressed block from main memory, performs de-compression and sends the data back to the processor. Clearly, theadditional access to the memory-resident mapping table on everycache miss significantly increases the main memory access latency.In addition to this, Pinnacle’s decompression latency, which is onthe critical path of a memory access, is 64 processor cycles.

Ekman and Stenström [10] proposed a main memory compres-sion design to address the drawbacks of MXT. In their design,the operating system maps the uncompressed virtual address spacedirectly to a compressed physical address space. To compresspages, they use a variant of the Frequent Pattern Compression tech-nique [2, 3], which has a much smaller decompression latency (5cycles) than the Lempel-Ziv compression in Pinnacle (64 cycles).To avoid the long latency of a cache line’s main memory addresscomputation (Challenge 3), their design overlaps this computationwith the last-level (L2) cache access. For this purpose, their designextends the page table entries to store the compressed sizes of allthe lines within the page. This information is loaded into a hard-ware structure called the Block Size Table (BST). On an L1 cachemiss, the BST is accessed in parallel with the L2 cache to computethe exact main memory address of the corresponding cache line.While the proposed mechanism reduces the latency penalty of ac-cessing compressed blocks by overlapping main memory addresscomputation with L2 cache access, the main memory address com-putation is performed on every L2 cache access (as opposed to onlyon L2 cache misses in LCP). This leads to significant wasted workand additional power consumption. Even though BST has the samenumber of entries as the translation lookaside buffer (TLB), its sizeis at least twice that of the TLB [10]. This adds to the complex-ity and power consumption of the system significantly. To addressChallenge 1, the operating system uses multiple pools of fixed-sizephysical pages. This reduces the complexity of managing physi-cal pages at a fine granularity. Ekman and Stenstrom [10] do notaddress Challenge 2.

In summary, prior work on hardware-based main memory com-pression mitigate the performance degradation due to the mainmemory address computation problem (Challenge 3) by eitheradding large hardware structures that consume significant area andpower [1] or by using techniques that require energy-inefficienthardware and lead to wasted energy [10].

3. LINEARLY COMPRESSED PAGESIn this section, we provide the basic idea and a brief overview of

our proposal, Linearly Compressed Pages (LCP), which overcomesthe aforementioned shortcomings of prior proposals. Further de-tails will follow in Section 4.

3.1 LCP: Basic IdeaThe main shortcoming of prior approaches to main memory

compression is that different cache lines within a physical page canbe compressed to different sizes based on the compression scheme.As a result, the location of a compressed cache line within a phys-ical page depends on the sizes of all the compressed cache linesbefore it in the same page. This requires the memory controller to

64B 64B 64B 64B 64B∙ ∙ ∙ 

∙ ∙ ∙ 4:1 Compression

M E0 E1

1kB: 64 x 16B Metadata (64B)

Exceptions (64B each)

LN‐1∙ ∙ ∙ L1 L2Uncompressed Page

0 64 128 (N‐1)×64Address Offset

L0

Cache Line (64B)

Compressed Page LN‐1∙ ∙ ∙ L1 L2

? ? ?L0

0Address Offset

VirtualAddress

Virtual Page(4KB)

PhysicalAddress

Physical Page (?KB) Fragmentation

data tagdata tagdata tagdata tag

On‐Chip Cache

load %r1,[0xFF]

Physical Tags

VirtualAddr.

Translation

Core

PhysicalAddr. Critical‐Path

data tagdata tagdata tagdata tag

On‐Chip Cache

Physical Tags

VirtualAddr.

TranslationCore

PhysicalAddr.

64B 64B 64B 64B 64B∙ ∙ ∙ 

∙ ∙ ∙ 4:1 Compression

M E

1kB: 64 x 16B Metadata (64B)

Exceptions

64B 64B 64B 64B 64B∙ ∙ ∙ 

∙ ∙ ∙ 4:1 Compression

M E

Compressed Data Region

Metadata Region

ExceptionStorage Region

VirtualAddress

Virtual Page(4KB)

PhysicalAddress

Physical Page (?KB) Fragmentation

Uncompressed Page (4kB: 64 x 64B)

Compressed Page (1kB + α) 

Cache Line

s

Uncompressed Page (4kB: 64 x 64B)

Compressed Page (1kB + α) 

Uncompressed Page (4KB: 64 x 64B)

Figure 3: Organization of a Linearly Compressed Page

explicitly perform this complex calculation (or cache the mappingin a large, energy-inefficient structure) in order to access the line.

To address this shortcoming, we propose a new approach to com-pressing pages, called the Linearly Compressed Page (LCP). Thekey idea of LCP is to use a fixed size for compressed cache lineswithin a given page (alleviating the complex and long-latency mainmemory address calculation problem that arises due to variable-sizecache lines), and yet still enable a page to be compressed even ifnot all cache lines within the page can be compressed to that fixedsize (enabling high compression ratios).

Because all the cache lines within a given page are compressedto the same size, the location of a compressed cache line within thepage is simply the product of the index of the cache line within thepage and the size of the compressed cache line—essentially a linearscaling using the index of the cache line (hence the name LinearlyCompressed Page). LCP greatly simplifies the task of computing acache line’s main memory address. For example, if all cache lineswithin a page are compressed to 16 bytes, the byte offset of the thirdcache line (index within the page is 2) from the start of the physicalpage is 16× 2 = 32, if the line is compressed. This computationcan be implemented as a simple shift operation.

Figure 3 shows the organization of an example Linearly Com-pressed Page, based on the ideas described above. In this example,we assume that a virtual page is 4KB, an uncompressed cache lineis 64B, and the target compressed cache line size is 16B.

As shown in the figure, the LCP contains three distinct regions.The first region, the compressed data region, contains a 16-byteslot for each cache line in the virtual page. If a cache line is com-pressible, the corresponding slot stores the compressed version ofthe cache line. However, if the cache line is not compressible,the corresponding slot is assumed to contain invalid data. In ourdesign, we refer to such an incompressible cache line as an “ex-ception”. The second region, metadata, contains all the necessaryinformation to identify and locate the exceptions of a page. Weprovide more details on what exactly is stored in the metadata re-gion in Section 4.2. The third region, the exception storage, is theplace where all the exceptions of the LCP are stored in their un-compressed form. Our LCP design allows the exception storageto contain unused space. In other words, not all entries in the ex-ception storage may store valid exceptions. As we will describein Section 4, this enables the memory controller to use the unusedspace for storing future exceptions, and also simplifies the operat-ing system page management mechanism.

Next, we will provide a brief overview of the main memory com-pression framework we build using LCP.

3.2 LCP OperationOur LCP-based main memory compression framework consists

of components that handle three key issues: (i) page compression,(ii) cache line reads from main memory, and (iii) cache line write-backs into main memory. Figure 4 shows the high-level design andoperation.

175

Page 5: Linearly Compressed Pages: A Low-Complexity, Low ...class.ece.iastate.edu/.../papers/Micro13CompressedPages.pdfLinearly Compressed Pages: A Low-Complexity, Low-Latency Main Memory

ProcessorDisk

DRAMCore

Compress/Decompress

Last-LevelCache

TLB MD Cache

MemoryController

Page Table Entry c-size(2b)

c-bit(1b)c-type(3b)

c-base(3b)

Figure 4: Memory request flow

Page Compression. When a page is accessed for the first timefrom disk, the operating system (with the help of the memory con-troller) first determines whether the page is compressible using thecompression algorithm employed by the framework (described inSection 4.7). If the page is compressible, the OS allocates a physi-cal page of appropriate size and stores the compressed page (LCP)in the corresponding location. It also updates the relevant portionsof the corresponding page table mapping to indicate (i) whether thepage is compressed, and if so, (ii) the compression scheme used tocompress the page (details in Section 4.1).

Cache Line Read. When the memory controller receives a readrequest for a cache line within an LCP, it must find and decom-press the data. Multiple design solutions are possible to performthis task efficiently. A naïve way of reading a cache line from anLCP would require at least two accesses to the corresponding pagein main memory. First, the memory controller accesses the meta-data in the LCP to determine whether the cache line is stored inthe compressed format. Second, based on the result, the controllereither (i) accesses the cache line from the compressed data regionand decompresses it, or (ii) accesses it uncompressed from the ex-ception storage.

To avoid two accesses to main memory, we propose two opti-mizations that enable the controller to retrieve the cache line withthe latency of just one main memory access in the common case.First, we add a small metadata (MD) cache to the memory con-troller that caches the metadata of the recently accessed LCPs—thecontroller avoids the first main memory access to the metadata incases when the metadata is present in the MD cache. Second, incases when the metadata is not present in the metadata cache, thecontroller speculatively assumes that the cache line is stored in thecompressed format and first accesses the data corresponding to thecache line from the compressed data region. The controller thenoverlaps the latency of the cache line decompression with the ac-cess to the metadata of the LCP. In the common case, when thespeculation is correct (i.e., the cache line is actually stored in thecompressed format), this approach significantly reduces the latencyof serving the read request. In the case of a misspeculation (uncom-mon case), the memory controller issues another request to retrievethe cache line from the exception storage.

Cache Line Writeback. If the memory controller receives arequest for a cache line writeback, it then attempts to compressthe cache line using the compression scheme associated with thecorresponding LCP. Depending on the original state of the cacheline (compressible or incompressible), there are four different pos-sibilities: the cache line (1) was compressed and stays compressed,(2) was uncompressed and stays uncompressed, (3) was uncom-pressed but becomes compressed, and (4) was compressed but be-comes uncompressed. In the first two cases, the memory controllersimply overwrites the old data with the new data at the same lo-cation associated with the cache line. In case 3, the memory con-troller frees the exception storage slot for the cache line and writesthe compressible data in the compressed data region of the LCP.(Section 4.2 provides more details on how the exception storage ismanaged.) In case 4, the memory controller checks whether thereis enough space in the exception storage region to store the uncom-

pressed cache line. If so, it stores the cache line in an availableslot in the region. If there are no free exception storage slots in theexception storage region of the page, the memory controller trapsto the operating system, which migrates the page to a new location(which can also involve page recompression). In both cases 3 and4, the memory controller appropriately modifies the LCP metadataassociated with the cache line’s page.

Note that in the case of an LLC writeback to main memory (andassuming that TLB information is not available at the LLC), thecache tag entry is augmented with the same bits that are used toaugment page table entries. Cache compression mechanisms, e.g.,FPC [2] and BDI [25], already have the corresponding bits for en-coding, so that the tag size overhead is minimal when main memorycompression is used together with cache compression.

4. DETAILED DESIGNIn this section, we provide a detailed description of LCP, along

with the changes to the memory controller, operating system andon-chip cache tagging logic. In the process, we explain howour proposed design addresses each of the three challenges (Sec-tion 2.2).

4.1 Page Table Entry ExtensionTo keep track of virtual pages that are stored in compressed for-

mat in main memory, the page table entries need to be extendedto store information related to compression (Figure 5). In addi-tion to the information already maintained in the page table en-tries (such as the base address for a corresponding physical page,p-base), each virtual page in the system is associated with the fol-lowing pieces of metadata: (i) c-bit, a bit that indicates if the pageis mapped to a compressed physical page (LCP), (ii) c-type, afield that indicates the compression scheme used to compress thepage, (iii) c-size, a field that indicates the size of the LCP, and(iv) c-base, a p-base extension that enables LCPs to start at an ad-dress not aligned with the virtual page size. The number of bitsrequired to store c-type, c-size and c-base depends on the exactimplementation of the framework. In the implementation we eval-uate, we assume 3 bits for c-type (allowing 8 possible differentcompression encodings), 2 bits for c-size (4 possible page sizes:512B, 1KB, 2KB, 4KB), and 3 bits for c-base (at most eight 512Bcompressed pages can fit into a 4KB uncompressed slot). Note thatexisting systems usually have enough unused bits (up to 15 bits inIntel x86-64 systems [15]) in their PTE entries that can be used byLCP without increasing the PTE size.

ProcessorDisk

DRAMCore

Compress/Decompress

Last-LevelCache

TLB MD Cache

MemoryController

Page Table Entry

c-size(2b)

c-bit(1b)c-type(3b)

c-base(3b)

p-base

Figure 5: Page table entry extension.

When a virtual page is compressed (the c-bit is set), all the com-pressible cache lines within the page are compressed to the samesize, say C ∗. The value of C ∗ is uniquely determined by the com-pression scheme used to compress the page, i.e., the c-type (Sec-tion 4.7 discusses determining the c-type for a page). We nextdescribe the LCP organization in more detail.

4.2 LCP OrganizationWe will discuss each of an LCP’s three regions in turn. We begin

by defining the following symbols: V is the virtual page size ofthe system (e.g., 4KB); C is the uncompressed cache line size (e.g.,

176

Page 6: Linearly Compressed Pages: A Low-Complexity, Low ...class.ece.iastate.edu/.../papers/Micro13CompressedPages.pdfLinearly Compressed Pages: A Low-Complexity, Low-Latency Main Memory

c‐base

Page Table 4KB2KB

2KB1KB 1KB 1KB 1KB

512B 512B ...4KB

PA1 + 512

VA0 PA0

VA1 PA1

512B

Main Memory Address:

PA0 + 512×4

Figure 6: Physical memory layout with the LCP framework.

64B);3 n = VC is the number of cache lines per virtual page (e.g.,

64); and M is the size of LCP’s metadata region. In addition, ona per-page basis, we define P to be the compressed physical pagesize; C ∗ to be the compressed cache line size; and navail to be thenumber of slots available for exceptions.

4.2.1 Compressed Data RegionThe compressed data region is a contiguous array of n slots each

of size C ∗. Each one of the n cache lines in the virtual page ismapped to one of the slots, irrespective of whether the cache line iscompressible or not. Therefore, the size of the compressed dataregion is nC ∗. This organization simplifies the computation re-quired to determine the main memory address for the compressedslot corresponding to a cache line. More specifically, the addressof the compressed slot for the ith cache line can be computed asp-base+m-size∗c-base+(i−1)C ∗, where the first two terms cor-respond to the start of the LCP (m-size equals to the minimum pagesize, 512B in our implementation) and the third indicates the off-set within the LCP of the ith compressed slot (see Figure 6). Thus,computing the main memory address of a compressed cache line re-quires one multiplication (can be implemented as a shift) and twoadditions independent of i (fixed latency). This computation re-quires a lower latency and simpler hardware than prior approaches(e.g., up to 22 additions in the design proposed in [10]), thereby ef-ficiently addressing Challenge 3 (cache line address computation).

4.2.2 Metadata RegionThe metadata region of an LCP contains two parts (Figure 7).

The first part stores two pieces of information for each cache line inthe virtual page: (i) a bit indicating whether the cache line is incom-pressible, i.e., whether the cache line is an exception, e-bit, and (ii)the index of the cache line in the exception storage, e-index. If thee-bit is set for a cache line, then the corresponding cache line isstored uncompressed in location e-index in the exception storage.The second part of the metadata region is a valid bit (v-bit) vectorto track the state of the slots in the exception storage. If a v-bit isset, it indicates that the corresponding slot in the exception storageis used by some uncompressed cache line within the page.

The size of the first part depends on the size of e-index, whichin turn depends on the number of exceptions allowed per page. Be-cause the number of exceptions cannot exceed the number of cachelines in the page (n), we will need at most 1+dlog2 ne bits for eachcache line in the first part of the metadata. For the same reason,we will need at most n bits in the bit vector in the second part ofthe metadata. Therefore, the size of the metadata region is given by3Large pages (e.g., 4MB or 1GB) can be supported with LCPthrough minor modifications that include scaling the correspondingsizes of the metadata and compressed data regions. The exceptionarea metadata keeps the exception index for every cache line ona compressed page. This metadata can be partitioned into multi-ple 64-byte cache lines that can be handled similar to 4KB pages.The exact “metadata partition” can be easily identified based on thecache line index within a page.

Metadata Regione-bit(1b) v-bit(1b)

e-index(6b)...

64 entries 64b

...

Figure 7: Metadata region, when n = 64.

M = n(1+dlog2 ne)+n bits. Since n is fixed for the entire system,the size of the metadata region (M ) is the same for all compressedpages (64B in our implementation).

4.2.3 Exception Storage RegionThe third region, the exception storage, is the place where

all incompressible cache lines of the page are stored. If acache line is present in the location e-index in the excep-tion storage, its main memory address can be computed as:p-base+ m-size ∗ c-base+ nC ∗ + M + e-indexC . The numberof slots available in the exception storage (navail) is dictated bythe size of the compressed physical page allocated by the oper-ating system for the corresponding LCP. The following equationexpresses the relation between the physical page size (P ), thecompressed cache line size (C ∗) that is determined by c-type, andthe number of available slots in the exception storage (navail):

navail = b(P − (nC ∗+M ))/Cc (1)

As mentioned before, the metadata region contains a bit vector thatis used to manage the exception storage. When the memory con-troller assigns an exception slot to an incompressible cache line, itsets the corresponding bit in the bit vector to indicate that the slot isno longer free. If the cache line later becomes compressible and nolonger requires the exception slot, the memory controller resets thecorresponding bit in the bit vector. In the next section, we describethe operating system memory management policy that determinesthe physical page size (P ) allocated for an LCP, and hence, thenumber of available exception slots (navail).

4.3 Operating System Memory ManagementThe first challenge related to main memory compression is to

provide operating system support for managing variable-size com-pressed physical pages – i.e., LCPs. Depending on the compressionscheme employed by the framework, different LCPs may be of dif-ferent sizes. Allowing LCPs of arbitrary sizes would require theOS to keep track of main memory at a very fine granularity. Itcould also lead to fragmentation across the entire main memory ata fine granularity. As a result, the OS would need to maintain largeamounts of metadata to maintain the locations of individual pagesand the free space, which would also lead to increased complexity.

To avoid this problem, our mechanism allows the OS to man-age main memory using a fixed number of pre-determined physicalpage sizes – e.g., 512B, 1KB, 2KB, 4KB (a similar approach wasproposed in [4] to address the memory allocation problem). Foreach one of the chosen sizes, the OS maintains a pool of allocatedpages and a pool of free pages. When a page is compressed forthe first time or recompressed due to overflow (described in Sec-tion 4.6), the OS chooses the smallest available physical page sizethat fits the compressed page. For example, if a page is compressedto 768B, then the OS allocates a physical page of size 1KB. For apage with a given size, the available number of exceptions for thepage, navail , can be determined using Equation 1.

4.4 Changes to the Cache Tagging LogicAs mentioned in Section 2.2, modern systems employ

physically-tagged caches to avoid aliasing problems. However, in

177

Page 7: Linearly Compressed Pages: A Low-Complexity, Low ...class.ece.iastate.edu/.../papers/Micro13CompressedPages.pdfLinearly Compressed Pages: A Low-Complexity, Low-Latency Main Memory

a system that employs main memory compression, using the physi-cal (main memory) address to tag cache lines puts the main memoryaddress computation on the critical path of L1 cache access (Chal-lenge 2). To address this challenge, we modify the cache tagginglogic to use the tuple <physical page base address, cache line in-dex within the page> for tagging cache lines. This tuple maps to aunique cache line in the system, and hence avoids aliasing problemswithout requiring the exact main memory address to be computed.The additional index bits are stored within the cache line tag.

4.5 Changes to the Memory ControllerIn addition to the changes to the memory controller operation

described in Section 3.2, our LCP-based framework requires twohardware structures to be added to the memory controller: (i) asmall metadata cache to accelerate main memory lookups in LCP,and (ii) compression/decompression hardware to perform the com-pression and decompression of cache lines.

4.5.1 Metadata CacheAs described in Section 3.2, a small metadata cache in the mem-

ory controller enables our approach, in the common case, to re-trieve a compressed cache block in a single main memory access.This cache stores the metadata region of recently accessed LCPs sothat the metadata for subsequent accesses to such recently-accessedLCPs can be retrieved directly from the cache. In our study, we findthat a small 512-entry metadata cache (32KB4) can service 88% ofthe metadata accesses on average across all our workloads. Someapplications have lower hit rate, especially sjeng and astar [29]. Ananalysis of these applications reveals that their memory accessesexhibit very low locality. As a result, we also observed a low TLBhit rate for these applications. Because TLB misses are costlier thanMD cache misses (the former requires multiple memory accesses),the low MD cache hit rate does not lead to significant performancedegradation for these applications.

We expect the MD cache power to be much lower than the powerconsumed by other on-chip structures (e.g., L1 caches), becausethe MD cache is accessed much less frequently (hits in any on-chipcache do not lead to an access to the MD cache).

4.5.2 Compression/Decompression HardwareDepending on the compression scheme employed with our LCP-

based framework, the memory controller should be equipped withthe hardware necessary to compress and decompress cache linesusing the corresponding scheme. Although our framework doesnot impose any restrictions on the nature of the compression al-gorithm, it is desirable to have compression schemes that havelow complexity and decompression latency – e.g., Frequent Pat-tern Compression (FPC) [2] and Base-Delta-Immediate Compres-sion (BDI) [25]. In Section 4.7, we provide more details on how toadapt any compression algorithm to fit the requirements of LCP andalso the specific changes we made to FPC and BDI as case studiesof compression algorithms that we adapted to the LCP framework.

4.6 Handling Page OverflowsAs described in Section 3.2, when a cache line is written back

to main memory, the cache line may switch from being compress-ible to being incompressible. When this happens, the memory con-troller should explicitly find a slot in the exception storage for theuncompressed cache line. However, it is possible that all the slotsin the exception storage are already used by other exceptions in

4We evaluated the sensitivity of performance to MD cache size andfind that 32KB is the smallest size that enables our design to avoidmost of the performance loss due to additional metadata accesses.

the LCP. We call this scenario a page overflow. A page overflowincreases the size of the LCP and leads to one of two scenarios:(i) the LCP still requires a physical page size that is smaller than theuncompressed virtual page size (type-1 page overflow), and (ii) theLCP now requires a physical page size that is larger than the un-compressed virtual page size (type-2 page overflow).

Type-1 page overflow simply requires the operating system tomigrate the LCP to a physical page of larger size (without recom-pression). The OS first allocates a new page and copies the datafrom the old location to the new location. It then modifies the map-ping for the virtual page to point to the new location. While intransition, the page is locked, so any memory request to this pageis delayed. In our evaluations, we stall the application for 20,000cycles5 when a type-1 overflow occurs; we also find that (on av-erage) type-1 overflows happen less than once per two million in-structions. We vary this latency between 10,000–100,000 cyclesand observe that the benefits of our framework (e.g., bandwidthcompression) far outweigh the overhead due to type-1 overflows.

In a type-2 page overflow, the size of the LCP exceeds the un-compressed virtual page size. Therefore, the OS attempts to re-compress the page, possibly using a different encoding (c-type).Depending on whether the page is compressible or not, the OS al-locates a new physical page to fit the LCP or the uncompressedpage, and migrates the data to the new location. The OS also ap-propriately modifies the c-bit, c-type and the c-base in the cor-responding page table entry. Clearly, a type-2 overflow requiresmore work from the OS than a type-1 overflow. However, we ex-pect page overflows of type-2 to occur rarely. In fact, we neverobserved a type-2 overflow in our evaluations.

4.6.1 Avoiding Recursive Page FaultsThere are two types of pages that require special consideration:

(i) pages that keep internal OS data structures, e.g., pages contain-ing information required to handle page faults, and (ii) shared datapages that have more than one page table entry (PTE) mapping tothe same physical page. Compressing pages of the first type can po-tentially lead to recursive page fault handling. The problem can beavoided if the OS sets a special do not compress bit, e.g., as a partof the page compression encoding, so that the memory controllerdoes not compress these pages. The second type of pages (sharedpages) require consistency across multiple page table entries, suchthat when one PTE’s compression information changes, the secondentry is updated as well. There are two possible solutions to thisproblem. First, as with the first type of pages, these pages can bemarked as do not compress. Second, the OS could maintain consis-tency of the shared PTEs by performing multiple synchronous PTEupdates (with accompanying TLB shootdowns). While the secondsolution can potentially lead to better average compressibility, thefirst solution (used in our implementation) is simpler and requiresminimal changes inside the OS.

Another situation that can potentially lead to a recursive fault isthe eviction of dirty cache lines from the LLC to DRAM due tosome page overflow handling that leads to another overflow. In or-der to solve this problem, we assume that the memory controllerhas a small dedicated portion of the main memory that is used as

5To fetch a 4KB page, we need to access 64 cache lines (64 byteseach). In the worst case, this will lead to 64 accesses to main mem-ory, most of which are likely to be DRAM row-buffer hits. Since arow-buffer hit takes 7.5ns, the total time to fetch the page is 495ns.On the other hand, the latency penalty of two context-switches (intothe OS and out of the OS) is around 4us [20]. Overall, a type-1overflow takes around 4.5us. For a 4.4Ghz or slower processor,this is less than 20,000 cycles.

178

Page 8: Linearly Compressed Pages: A Low-Complexity, Low ...class.ece.iastate.edu/.../papers/Micro13CompressedPages.pdfLinearly Compressed Pages: A Low-Complexity, Low-Latency Main Memory

a scratchpad to store cache lines needed to perform page overflowhandling. Dirty cache lines that are evicted from LLC to DRAMdue to OS overflow handling are stored in this buffer space. TheOS is responsible to minimize the memory footprint of the over-flow handler. Note that this situation is expected to be very rare inpractice, because even a single overflow is infrequent.

4.7 Compression AlgorithmsOur LCP-based main memory compression framework can be

employed with any compression algorithm. In this section, we de-scribe how to adapt a generic compression algorithm to fit the re-quirements of the LCP framework. Subsequently, we describe howto adapt the two compression algorithms used in our evaluation.

4.7.1 Adapting a Compression Algorithm to Fit LCPEvery compression scheme is associated with a compression

function, fc, and a decompression function, fd . To compress avirtual page into the corresponding LCP using the compressionscheme, the memory controller carries out three steps. In the firststep, the controller compresses every cache line in the page usingfc and feeds the sizes of each compressed cache line to the secondstep. In the second step, the controller computes the total com-pressed page size (compressed data + metadata + exceptions, usingthe formulas from Section 4.2) for each of a fixed set of target com-pressed cache line sizes and selects a target compressed cache linesize C ∗ that minimizes the overall LCP size. In the third and finalstep, the memory controller classifies any cache line whose com-pressed size is less than or equal to the target size as compressibleand all other cache lines as incompressible (exceptions). The mem-ory controller uses this classification to generate the correspondingLCP based on the organization described in Section 3.1.

To decompress a compressed cache line of the page, the memorycontroller reads the fixed-target-sized compressed data and feeds itto the hardware implementation of function fd .

4.7.2 FPC and BDI Compression AlgorithmsAlthough any compression algorithm can be employed with our

framework using the approach described above, it is desirable touse compression algorithms that have low complexity hardwareimplementation and low decompression latency, so that the over-all complexity and latency of the design are minimized. For thisreason, we adapt to fit our LCP framework two state-of-the-artcompression algorithms that achieve such design points in the con-text of compressing in-cache data: (i) Frequent Pattern Compres-sion [2], and (ii) Base-Delta-Immediate Compression [25].

Frequent Pattern Compression (FPC) is based on the observationthat a majority of the words accessed by applications fall undera small set of frequently occurring patterns [3]. FPC compresseseach cache line one word at a time. Therefore, the final compressedsize of a cache line is dependent on the individual words withinthe cache line. To minimize the time to perform the compressionsearch procedure described in Section 4.7.1, we limit the searchto four different target cache line sizes: 16B, 21B, 32B and 44B(similar to the fixed sizes used in [10]).

Base-Delta-Immediate (BDI) Compression is based on the ob-servation that in most cases, words co-located in memory havesmall differences in their values, a property referred to as low dy-namic range [25]. BDI encodes cache lines with such low dynamicrange using a base value and an array of differences (∆s) of wordswithin the cache line from either the base value or from zero. Thesize of the final compressed cache line depends only on the size ofthe base and the size of the ∆s. To employ BDI within our frame-work, the memory controller attempts to compress a page with dif-

ferent versions of the Base-Delta encoding as described by Pekhi-menko et al. [25] and then chooses the combination that minimizesthe final compressed page size (according to the search procedurein Section 4.7.1).

5. LCP OPTIMIZATIONSIn this section, we describe two simple optimizations to our pro-

posed LCP-based framework: (i) memory bandwidth reduction viacompressed cache lines, and (ii) exploiting zero pages and cachelines for higher bandwidth utilization.

5.1 Enabling Memory Bandwidth ReductionOne potential benefit of main memory compression that has not

been examined in detail by prior work on memory compression isbandwidth reduction.6 When cache lines are stored in compressedformat in main memory, multiple consecutive compressed cachelines can be retrieved at the cost of retrieving a single uncom-pressed cache line. For example, when cache lines of a page arecompressed to 1/4 their original size, four compressed cache linescan be retrieved at the cost of a single uncompressed cache line ac-cess. This can significantly reduce the bandwidth requirements ofapplications, especially those with good spatial locality. We pro-pose two mechanisms that exploit this idea.

In the first mechanism, when the memory controller needs toaccess a cache line in the compressed data region of LCP, it obtainsthe data from multiple consecutive compressed slots (which addup to the size of an uncompressed cache line). However, some ofthe cache lines that are retrieved in this manner may not be valid.To determine if an additionally-fetched cache line is valid or not,the memory controller consults the metadata corresponding to theLCP. If a cache line is not valid, then the corresponding data is notdecompressed. Otherwise, the cache line is decompressed and thenstored in the cache.

The second mechanism is an improvement over the first mech-anism, where the memory controller additionally predicts if theadditionally-fetched cache lines are useful for the application. Forthis purpose, the memory controller uses hints from a multi-strideprefetcher [14]. In this mechanism, if the stride prefetcher suggeststhat an additionally-fetched cache line is part of a useful stream,then the memory controller stores that cache line in the cache. Thisapproach has the potential to prevent cache lines that are not use-ful from polluting the cache. Section 7.5 shows the effect of thisapproach on both performance and bandwidth consumption.

Note that prior work [11, 27, 32, 36] assumed that when a cacheline is compressed, only the compressed amount of data can betransferred over the DRAM bus, thereby freeing the bus for the fu-ture accesses. Unfortunately, modern DRAM chips are optimizedfor full cache block accesses [38], so they would need to be mod-ified to support such smaller granularity transfers. Our proposaldoes not require modifications to DRAM itself or the use of spe-cialized DRAM such as GDDR3 [16].

5.2 Zero Pages and Zero Cache LinesPrior work [2, 9, 10, 25, 37] observed that in-memory data con-

tains a significant number of zeros at two granularities: all-zero

6Prior work [11, 27, 32, 36] looked at the possibility of using com-pression for bandwidth reduction between the memory controllerand DRAM. While significant reduction in bandwidth consumptionis reported, prior work achieve this reduction either at the cost ofincreased memory access latency [11, 32, 36], as they have to bothcompress and decompress a cache line for every request, or basedon a specialized main memory design [27], e.g., GDDR3 [16].

179

Page 9: Linearly Compressed Pages: A Low-Complexity, Low ...class.ece.iastate.edu/.../papers/Micro13CompressedPages.pdfLinearly Compressed Pages: A Low-Complexity, Low-Latency Main Memory

CPU Processor 1–4 cores, 4GHz, x86 in-orderCPU L1-D cache 32KB, 64B cache-line, 2-way, 1 cycleCPU L2 cache 2 MB, 64B cache-line, 16-way, 20 cyclesMain memory 2 GB, 4 Banks, 8 KB row buffers,

1 memory channel, DDR3-1066 [23]LCP Design Type-1 Overflow Penalty: 20,000 cycles

Table 1: Major Parameters of the Simulated Systems.

pages and all-zero cache lines. Because this pattern is quite com-mon, we propose two changes to the LCP framework to more ef-ficiently compress such occurrences of zeros. First, one value ofthe page compression encoding (e.g., c-type of 0) is reserved toindicate that the entire page is zero. When accessing data from apage with c-type = 0, the processor can avoid any LLC or DRAMaccess by simply zeroing out the allocated cache line in the L1-cache. Second, to compress all-zero cache lines more efficiently,we can add another bit per cache line to the first part of the LCPmetadata. This bit, which we call the z-bit, indicates if the cor-responding cache line is zero. Using this approach, the memorycontroller does not require any main memory access to retrieve acache line with the z-bit set (assuming a metadata cache hit).

6. METHODOLOGYOur evaluations use an in-house, event-driven 32-bit x86 simu-

lator whose front-end is based on Simics [22]. All configurationshave private L1 caches and shared L2 caches. Major simulationparameters are provided in Table 1. We use benchmarks from theSPEC CPU2006 suite [29], four TPC-H/TPC-C queries [33], andan Apache web server. All results are collected by running a repre-sentative portion (based on PinPoints [24]) of the benchmarks for1 billion instructions. We build our energy model based on Mc-Pat [21], CACTI [31], C-Pack [5], and the Synopsys Design Com-piler with 65nm library (to evaluate the energy of compression/de-compression with BDI and address calculation in [10]).

Metrics. We measure the performance of our benchmarks usingIPC (instruction per cycle) and effective compression ratio (effec-tive DRAM size increase, e.g., a compression ratio of 1.5 for 2GBDRAM means that the compression scheme achieves the size ben-efits of a 3GB DRAM). For multi-programmed workloads we use

the weighted speedup [28] performance metric: (∑iIPCshared

iIPCalone

i). For

bandwidth consumption we use BPKI (bytes transferred over busper thousand instructions [30]).

Parameters of the Evaluated Schemes. As reported in the re-spective previous works, we used a decompression latency of 5 cy-cles for FPC and 1 cycle for BDI.

7. RESULTSIn our experiments for both single-core and multi-core systems,

we compare five different designs that employ different main mem-ory compression strategies (frameworks) and different compressionalgorithms: (i) Baseline system with no compression, (ii) robustmain memory compression (RMC-FPC) [10], (iii) and (iv) LCPframework with both FPC and BDI compression algorithms (LCP-FPC and LCP-BDI), and (v) MXT [1]. Note that it is fundamentallypossible to build a RMC-BDI design as well, but we found that itleads to either low energy efficiency (due to an increase in the BSTmetadata table entry size [10] with many more encodings in BDI)or low compression ratio (when the number of encodings is arti-ficially decreased). Hence, for brevity, we exclude this potentialdesign from our experiments.

In addition, we evaluate two hypothetical designs: Zero Page

Name Framework Compression AlgorithmBaseline None NoneRMC-FPC RMC [10] FPC [2]LCP-FPC LCP FPC [2]LCP-BDI LCP BDI [25]MXT MXT [1] Lempel-Ziv [40]

ZPC None Zero Page CompressionLZ None Lempel-Ziv [40]

Table 2: List of evaluated designs.

Compression (ZPC) and Lempel-Ziv (LZ)7 to show some practicalupper bounds on main memory compression. Table 2 summarizesall the designs.

7.1 Effect on DRAM CapacityFigure 8 compares the compression ratio of all the designs de-

scribed in Table 2. We draw two major conclusions. First, as ex-pected, MXT, which employs the complex LZ algorithm, has thehighest average compression ratio (2.30) of all practical designsand performs closely to our idealized LZ implementation (2.60).At the same time, LCP-BDI provides a reasonably high compres-sion ratio (1.62 on average), outperforming RMC-FPC (1.59), andLCP-FPC (1.52). (Note that LCP could be used with both BDI andFPC algorithms together, and the average compression ratio in thiscase is as high as 1.69.)

Second, while the average compression ratio of ZPC is relativelylow (1.29), it greatly improves the effective memory capacity fora number of applications (e.g., GemsFDTD, zeusmp, and cactus-ADM). This justifies our design decision of handling zero pagesat the TLB-entry level. We conclude that our LCP frameworkachieves the goal of high compression ratio.

7.1.1 Distribution of Compressed PagesThe primary reason why applications have different compression

ratios is the redundancy difference in their data. This leads to thesituation where every application has its own distribution of com-pressed pages with different sizes (0B, 512B, 1KB, 2KB, 4KB).Figure 9 shows these distributions for the applications in our studywhen using the LCP-BDI design. As we can see, the percentage ofmemory pages of every size in fact significantly varies between theapplications, leading to different compression ratios (shown in Fig-ure 8). For example, cactusADM has a high compression ratio dueto many 0B and 512B pages (there is a significant number of zerocache lines in its data), while astar and h264ref get most of theircompression with 2KB pages due to cache lines with low dynamicrange [25].

7.1.2 Compression Ratio over TimeTo estimate the efficiency of LCP-based compression over time,

we conduct an experiment where we measure the compression ra-tios of our applications every 100 million instructions (for a to-tal period of 5 billion instructions). The key observation we makeis that the compression ratio for most of the applications is stableover time (the difference between the highest and the lowest ratio iswithin 10%). Figure 10 shows all notable outliers from this obser-vation: astar, cactusADM, h264ref, and zeusmp. Even for these ap-plications, the compression ratio stays relatively constant for a long7Our implementation of LZ performs compression at 4KB page-granularity and serves as an idealized upper bound for the in-memory compression ratio. In contrast, MXT employs Lempel-Zivat 1KB granularity.

180

Page 10: Linearly Compressed Pages: A Low-Complexity, Low ...class.ece.iastate.edu/.../papers/Micro13CompressedPages.pdfLinearly Compressed Pages: A Low-Complexity, Low-Latency Main Memory

1.0

1.5

2.0

2.5

3.0

3.5

apache

astar

bzip2

cactusAD

M

calculix

dealII

gamess

gcc

Gem

sFDT

D

gobm

k

grom

acs

h264

ref

hmmer

lbm

leslie3d

libqu

antum

mcf

milc

namd

omnetpp

perlb

ench

povray

sjen

g

soplex

sphinx3

tpcc

tpch17

tpch2

tpch6

wrf

xalancbm

k

zeusmp

Geo

Mean

Compression

 Ratio

ZPC RMC‐FPC LCP‐FPC LCP‐BDI MXT LZ4.0 7.6‐8.1 176‐454 3.8 4.1 3.5 4.7‐14.1

Figure 8: Main memory compression ratio.

period of time, although there are some noticeable fluctuations incompression ratio (e.g., for astar at around 4 billion instructions,for cactusADM at around 500M instructions). We attribute thisbehavior to a phase change within an application that sometimesleads to changes in the applications’ data. Fortunately, these casesare infrequent and do not have a noticeable effect on the applica-tion’s performance (as we describe in Section 7.2). We concludethat the capacity benefits provided by the LCP-based frameworksare usually stable over long periods of time.

0%10%20%30%40%50%60%70%80%90%

100%

apache

astar

bzip2

cactusAD

Mcalculix

dealII

gamess

gcc

Gem

sFDTD

gobm

kgrom

acs

h264ref

hmmer

lbm

leslie3d

libqu

antum

mcf

milc

namd

omne

tpp

perlb

ench

povray

sjen

gsoplex

sphinx3

tpcc

tpch17

tpch2

tpch6

wrf

xalancbm

kzeusmp

Fractio

n of Pages

0B 512B 1KB 2KB 4KB (uncompressed)

Figure 9: Compressed page size distribution with LCP-BDI.

012345678910

0 5 10 15 20 25 30 35 40 45

Compression

 Ratio zeusmp

cactusADM

astar

h264ref

X 100M instructions

Figure 10: Compression ratio over time with LCP-BDI.

7.2 Effect on PerformanceMain memory compression can improve performance in two ma-

jor ways: (i) reduced memory bandwidth requirements, which canenable less contention on the main memory bus, an increasingly im-portant bottleneck in systems, and (ii) reduced memory footprint,which can reduce long-latency disk accesses. We evaluate the per-formance improvement due to memory bandwidth reduction (in-cluding our optimizations for compressing zero values describedin Section 5.2) in Sections 7.2.1 and 7.2.2. We also evaluate thedecrease in page faults in Section 7.2.3.

7.2.1 Single-Core ResultsFigure 11 shows the performance of single-core workloads us-

ing three key evaluated designs (RMC-FPC, LCP-FPC, and LCP-BDI) normalized to the Baseline. Compared against an uncom-pressed system (Baseline), the LCP-based designs (LCP-BDI and

LCP-FPC) improve performance by 6.1%/5.2% and also outper-form RMC-FPC.8 We conclude that our LCP framework is effec-tive in improving performance by compressing main memory.

0.80.91.01.11.21.31.41.51.6

apache

astar

bzip2

cactusAD

Mcalculix

dealII

gamess

gcc

Gem

sFDT

Dgobm

kgrom

acs

h264

ref

hmmer

lbm

leslie3d

libqu

antum

mcf

milc

namd

omnetpp

perlb

ench

povray

sjen

gsoplex

sphinx3

tpcc

tpch17

tpch2

tpch6

wrf

xalancbm

kzeusmp

Geo

Mean

Normalize

d IPC RMC‐FPC

LCP‐FPCLCP‐BDI

1.68

Figure 11: Performance comparison (IPC) of different compresseddesigns for the single-core system.

Note that LCP-FPC outperforms RMC-FPC (on average) despitehaving a slightly lower compression ratio. This is mostly due to thelower overhead when accessing metadata information (RMC-FPCneeds two memory accesses to different main memory pages in thecase of a BST table miss, while LCP-based framework performstwo accesses to the same main memory page that can be pipelined).This is especially noticeable in several applications, e.g., astar,milc, and xalancbmk that have low metadata table (BST) hit rates(LCP can also degrade performance for these applications). Weconclude that our LCP framework is more effective in improvingperformance than RMC [10].

7.2.2 Multi-Core ResultsWhen the system has a single core, the memory bandwidth pres-

sure may not be large enough to take full advantage of the band-width benefits of main memory compression. However, in a multi-core system where multiple applications are running concurrently,savings in bandwidth (reduced number of memory bus transfers)may significantly increase the overall system performance.

To study this effect, we conducted experiments using 100 ran-domly generated multiprogrammed mixes of applications (for both2-core and 4-core workloads). Our results show that the bandwidthbenefits of memory compression are indeed more pronounced formulti-core workloads. Using our LCP-based design, LCP-BDI, theaverage performance improvement (normalized to the performanceof the Baseline system without compression) is 13.9% for 2-coreworkloads and 10.7% for 4-core workloads. We summarize our

8Note that in order to provide a fair comparison, we enhanced theRMC-FPC approach with the same optimizations we did for LCP,e.g., bandwidth compression. The original RMC-FPC design re-ported an average degradation in performance [10].

181

Page 11: Linearly Compressed Pages: A Low-Complexity, Low ...class.ece.iastate.edu/.../papers/Micro13CompressedPages.pdfLinearly Compressed Pages: A Low-Complexity, Low-Latency Main Memory

multi-core performance results in Figure 12a.We also vary the last-level cache size (1MB – 16MB) for both

single core and multi-core systems across all evaluated workloads.We find that LCP-based designs outperform the Baseline across allevaluated systems (average performance improvement for single-core varies from 5.1% to 13.4%), even when the L2 cache size ofthe system is as large as 16MB.

0%

5%

10%

15%

1 2 4

Performan

ce 

Improvem

ent, %

Number of Cores

LCP‐BDI

(a) Average performance improve-ment (weighted speedup).

8%14% 23% 21%

00.20.40.60.81

1.2

256MB 512MB 768MB 1GBNormalized

 # of 

Page Fau

ltsDRAM Size

Baseline LCP‐BDI

(b) Number of page faults (normal-ized to Baseline with 256MB).

Figure 12: Performance (with 2 GB DRAM) and number of pagefaults (varying DRAM size) using LCP-BDI.

7.2.3 Effect on the Number of Page FaultsModern systems are usually designed such that concurrently-

running applications have enough main memory to avoid most ofthe potential capacity page faults. At the same time, if the appli-cations’ total working set size exceeds the main memory capacity,the increased number of page faults can significantly affect perfor-mance. To study the effect of the LCP-based framework (LCP-BDI) on the number of page faults, we evaluate twenty randomlygenerated 16-core multiprogrammed mixes of applications fromour benchmark set. We also vary the main memory capacity from256MB to 1GB (larger memories usually lead to almost no pagefaults for these workload simulations). Our results (Figure 12b)show that the LCP-based framework (LCP-BDI) can decrease thenumber of page faults by 21% on average (for 1GB DRAM) whencompared with the Baseline design with no compression. We con-clude that the LCP-based framework can significantly decrease thenumber of page faults, and hence improve system performance be-yond the benefits it provides due to reduced bandwidth.

7.3 Effect on Bus Bandwidth and MemorySubsystem Energy

When DRAM pages are compressed, the traffic between the LLCand DRAM can be reduced. This can have two positive effects: (i)reduction in the average latency of memory accesses, which canlead to improvement in the overall system performance, and (ii)decrease in the bus energy consumption due to the decrease in thenumber of transfers.

Figure 13 shows the reduction in main memory bandwidth be-tween LLC and DRAM (in terms of bytes per kilo-instruction, nor-malized to the Baseline system with no compression) using dif-ferent compression designs. The key observation we make fromthis figure is that there is a strong correlation between bandwidthcompression and performance improvement (Figure 11). Applica-tions that show a significant reduction in bandwidth consumption(e.g., GemsFDTD, cactusADM, soplex, zeusmp, leslie3d, and thefour tpc queries) also see large performance improvements. Thereare some noticeable exceptions to this observation, e.g., h264ref,wrf and bzip2. Although the memory bus traffic is compressible inthese applications, main memory bandwidth is not the bottleneckfor their performance.

Figure 14 shows the reduction in memory subsystem energyof three systems that employ main memory compression—RMC-FPC, LCP-FPC, and LCP-BDI—normalized to the energy of Base-line. The memory subsystem energy includes the static and dy-

0.00.20.40.60.81.01.2

apache

astar

bzip2

cactusAD

Mcalculix

dealII

gamess

gcc

Gem

sFDT

Dgobm

kgrom

acs

h264

ref

hmmer

lbm

leslie3d

libqu

antum

mcf

milc

namd

omne

tpp

perlb

ench

povray

sjen

gsoplex

sphinx3

tpcc

tpch17

tpch2

tpch6

wrf

xalancbm

kzeusmp

Geo

Mean

Normalize

d BP

KI RMC‐FPC LCP‐FPC LCP‐BDI

Figure 13: Effect of different main memory compression schemeson memory bandwidth.

namic energy consumed by caches, TLBs, memory transfers, andDRAM, plus the energy of additional components due to mainmemory compression: BST [10], MD cache, address calculation,compressor/decompressor units. Two observations are in order.

0.50.60.70.80.91.01.11.21.3

apache

astar

bzip2

cactusAD

Mcalculix

dealII

gamess

gcc

Gem

sFDT

Dgobm

kgrom

acs

h264

ref

hmmer

lbm

leslie3d

libqu

antum

mcf

milc

namd

omne

tpp

perlb

ench

povray

sjeng

soplex

sphinx3

tpcc

tpch17

tpch2

tpch6

wrf

xalancbm

kzeusmp

Geo

Mean

Normalize

d En

ergy RMC‐FPC LCP‐FPC LCP‐BDI

Figure 14: Effect of different main memory compression schemeson memory subsystem energy.

First, our LCP-based designs (LCP-BDI and LCP-FPC) improvethe memory subsystem energy by 5.2% / 3.4% on average over theBaseline design with no compression, and by 11.3% / 9.5% overthe state-of-the-art design (RMC-FPC) based on [10]. This is espe-cially noticeable for bandwidth-limited applications, e.g., zeusmpand cactusADM. We conclude that our framework for main mem-ory compression enables significant energy savings, mostly due tothe decrease in bandwidth consumption.

Second, RMC-FPC consumes significantly more energy thanBaseline (6.1% more energy on average, as high as 21.7% fordealII). The primary reason for this energy consumption increase isthe physical address calculation that RMC-FPC speculatively per-forms on every L1 cache miss (to avoid increasing the memory la-tency due to complex address calculations). The second reason isthe frequent (every L1 miss) accesses to the BST table (describedin Section 2) that holds the address calculation information.

Note that other factors, e.g., compression/decompression energyoverheads or different compression ratios, are not the reasons forthis energy consumption increase. LCP-FPC uses the same com-pression algorithm as RMC-FPC (and even has a slightly lowercompression ratio), but does not increase energy consumption—in fact, LCP-FPC improves the energy consumption due to its de-crease in consumed bandwidth. We conclude that our LCP-basedframework is a more energy-efficient main memory compressionframework than previously proposed designs such as RMC-FPC.

7.4 Analysis of LCP Parameters

7.4.1 Analysis of Page OverflowsAs described in Section 4.6, page overflows can stall an applica-

tion for a considerable duration. As we mentioned in that section,we did not encounter any type-2 overflows (the more severe type) inour simulations. Figure 15 shows the number of type-1 overflowsper instruction. The y-axis uses a log-scale as the number of over-flows per instruction is very small. As the figure shows, on average,less than one type-1 overflow occurs every one million instructions.

182

Page 12: Linearly Compressed Pages: A Low-Complexity, Low ...class.ece.iastate.edu/.../papers/Micro13CompressedPages.pdfLinearly Compressed Pages: A Low-Complexity, Low-Latency Main Memory

Although such overflows are more frequent for some applications(e.g., soplex and the three tpch queries), our evaluations show thatthis does not degrade performance in spite of adding a 20,000 cyclepenalty for each type-1 page overflow.9 In fact, these applicationsgain significant performance from our LCP design. The main rea-son for this is that the performance benefits of bandwidth reductionfar outweigh the performance degradation due to type-1 overflows.We conclude that page overflows do not prevent the proposed LCPframework from providing good overall performance.

1.E‐08

1.E‐07

1.E‐06

1.E‐05

1.E‐04

1.E‐03

apache

astar

bzip2

cactusAD

Mcalculix

dealII

gamess

gcc

Gem

sFDT

Dgobm

kgrom

acs

h264ref

hmmer

lbm

leslie3d

libqu

antum

mcf

milc

namd

omne

tpp

perlb

ench

povray

sjen

gsoplex

sphinx3

tpcc

tpch17

tpch2

tpch6

wrf

xalancbm

kzeusmp

Geo

Mean

Type

‐1 Overflows pe

r instr. 

(log‐scale)

Figure 15: Type-1 page overflows for different applications.

7.4.2 Number of ExceptionsThe number of exceptions (uncompressed cache lines) in the

LCP framework is critical for two reasons. First, it determines thesize of the physical page required to store the LCP. The higher thenumber of exceptions, the larger the required physical page size.Second, it can affect an application’s performance as exceptionsrequire three main memory accesses on an MD cache miss (Sec-tion 3.2). We studied the average number of exceptions (acrossall compressed pages) for each application. Figure 16 shows theresults of these studies.

The number of exceptions varies from as low as 0.02/page forGemsFDTD to as high as 29.2/page in milc (17.3/page on average).The average number of exceptions has a visible impact on the com-pression ratio of applications (Figure 8). An application with a highcompression ratio also has relatively few exceptions per page. Notethat we do not restrict the number of exceptions in an LCP. As longas an LCP fits into a physical page not larger than the uncompressedpage size (i.e., 4KB in our system), it will be stored in compressedform irrespective of how large the number of exceptions is. This iswhy applications like milc have a large number of exceptions perpage. We note that better performance is potentially achievable byeither statically or dynamically limiting the number of exceptionsper page—a complete evaluation of the design space is a part of ourfuture work.

05

1015202530

apache

astar

bzip2

cactusAD

Mcalculix

dealII

gamess

gcc

Gem

sFDT

Dgobm

kgrom

acs

h264

ref

hmmer

lbm

leslie3d

libqu

antum

mcf

milc

namd

omne

tpp

perlb

ench

povray

sjen

gsoplex

sphinx3

tpcc

tpch17

tpch2

tpch6

wrf

xalancbm

kzeusmp

Average

Avg. # of E

xcep

tions 

Figure 16: Average number of exceptions per compressed page fordifferent applications.

7.5 Comparison to Stride PrefetchingOur LCP-based framework improves performance due to its abil-

ity to transfer multiple compressed cache lines using a single mem-ory request. Because this benefit resembles that of prefetching9We varied the type-1 overflow latency from 10,000 to 100,000 cy-cles and found that the impact on performance was negligible as wevaried the latency. Prior work on main memory compression [10]also used 10,000 to 100,000 cycle range for such overflows.

cache lines into the LLC, we compare our LCP-based design to asystem that employs a stride prefetcher implemented as describedin [14]. Figures 17 and 18 compare the performance and band-width consumption of three systems: (i) one that employs strideprefetching, (ii) one that employs LCP-BDI, and (iii) one that em-ploys LCP-BDI along with hints from a prefetcher to avoid cachepollution due to bandwidth compression (Section 5.1). Two con-clusions are in order.

First, our LCP-based designs (second and third bars) are com-petitive with the more general stride prefetcher for all but a fewapplications (e.g., libquantum). The primary reason is that a strideprefetcher can sometimes increase the memory bandwidth con-sumption of an application due to inaccurate prefetch requests. Onthe other hand, LCP obtains the benefits of prefetching without in-creasing (in fact, while significantly reducing) memory bandwidthconsumption.

Second, the effect of using prefetcher hints to avoid cache pollu-tion is not significant. The reason for this is that our systems em-ploy a large, highly-associative LLC (2MB 16-way) which is lesssusceptible to cache pollution. Evicting the LRU lines from such acache has little effect on performance, but we did observe the ben-efits of this mechanism on multi-core systems with shared caches(up to 5% performance improvement for some two-core workloadmixes—not shown).

0.951.001.051.101.151.201.251.30

apache

astar

bzip2

cactusAD

Mcalculix

dealII

gamess

gcc

Gem

sFDT

Dgobm

kgrom

acs

h264ref

hmmer

lbm

leslie3d

libqu

antum

mcf

milc

namd

omne

tpp

perlb

ench

povray

sjen

gsoplex

sphinx3

tpcc

tpch17

tpch2

tpch6

wrf

xalancbm

kzeusmp

Geo

MeanNormalize

d IPC

Stride Prefetching LCP‐BDI LCP‐BDI + Prefetching hints1.57 1.54 1.43 1.681.91

Figure 17: Performance comparison with stride prefetching, andusing prefetcher hints with the LCP-framework.

0.00.20.40.60.81.01.21.41.6

apache

astar

bzip2

cactusAD

Mcalculix

dealII

gamess

gcc

GemsFDT

Dgobm

kgrom

acs

h264ref

hmmer

lbm

leslie3d

libqu

antum

mcf

milc

namd

omne

tpp

perlb

ench

povray

sjen

gsoplex

sphinx3

tpcc

tpch17

tpch2

tpch6

wrf

xalancbm

kzeusmp

Geo

Mean

Normalize

d BP

KI Stride Prefetching LCP‐BDI LCP‐BDI + Prefetching hints

Figure 18: Bandwidth comparison with stride prefetching.

8. CONCLUSIONData compression is a promising technique to increase the ef-

fective main memory capacity without significantly increasing costand power consumption. As we described in this paper, the pri-mary challenge in incorporating compression in main memory isto devise a mechanism that can efficiently compute the main mem-ory address of a cache line without significantly adding complexity,cost, or latency. Prior approaches to addressing this challenge areeither relatively costly or energy inefficient.

In this work, we proposed a new main memory compressionframework, called Linearly Compressed Pages (LCP), to addressthis problem. The two key ideas of LCP are to use a fixed size forcompressed cache lines within a page (which simplifies main mem-ory address computation) and to enable a page to be compressedeven if some cache lines within the page are incompressible (whichenables high compression ratios). We showed that any compressionalgorithm can be adapted to fit the requirements of our LCP-basedframework.

183

Page 13: Linearly Compressed Pages: A Low-Complexity, Low ...class.ece.iastate.edu/.../papers/Micro13CompressedPages.pdfLinearly Compressed Pages: A Low-Complexity, Low-Latency Main Memory

We evaluated the LCP-based framework using two state-of-the-art compression algorithms (Frequent Pattern Compression andBase-Delta-Immediate Compression) and showed that it can sig-nificantly increase effective memory capacity (by 69%) and reducepage fault rate (by 23%). We showed that storing compressed datain main memory can also enable the memory controller to reducememory bandwidth consumption (by 24%), leading to significantperformance and energy improvements on a wide variety of single-core and multi-core systems with different cache sizes. Based onour results, we conclude that the proposed LCP-based frameworkprovides an effective approach for designing low-complexity andlow-latency compressed main memory.

AcknowledgmentsMany thanks to Brian Hirano, Kayvon Fatahalian, David Han-squine and Karin Strauss for their feedback during various stages ofthis project. We thank the anonymous reviewers and our shepherdAndreas Moshovos for their feedback. We acknowledge membersof the SAFARI and LBA groups for their feedback and for the stim-ulating research environment they provide. We acknowledge thesupport of AMD, IBM, Intel, Oracle, Samsung and Microsoft. Thisresearch was partially supported by NSF (CCF-0953246, CCF-1147397, CCF-1212962), Intel University Research Office Mem-ory Hierarchy Program, Intel Science and Technology Center forCloud Computing, Semiconductor Research Corporation and a Mi-crosoft Research Fellowship.

References[1] B. Abali et al. Memory Expansion Technology (MXT):

Software Support and Performance. IBM J. Res. Dev., 2001.[2] A. R. Alameldeen and D. A. Wood. Adaptive Cache

Compression for High-Performance Processors. In ISCA-31,2004.

[3] A. R. Alameldeen and D. A. Wood. Frequent PatternCompression: A Significance-Based Compression Schemefor L2 Caches. Tech. Rep., 2004.

[4] E. D. Berger. Memory Management for High-PerformanceApplications. PhD thesis, 2002.

[5] X. Chen et al. C-Pack: A High-Performance MicroprocessorCache Compression Algorithm. IEEE Transactions on VLSISystems, 2010.

[6] E. Cooper-Balis, P. Rosenfeld, and B. Jacob.Buffer-On-Board Memory Systems. In ISCA, 2012.

[7] R. S. de Castro, A. P. do Lago, and D. Da Silva. AdaptiveCompressed Caching: Design and Implementation. InSBAC-PAD, 2003.

[8] F. Douglis. The Compression Cache: Using On-lineCompression to Extend Physical Memory. In WinterUSENIX Conference, 1993.

[9] J. Dusser et al. Zero-Content Augmented Caches. In ICS,2009.

[10] M. Ekman and P. Stenström. A Robust Main-MemoryCompression Scheme. In ISCA-32, 2005.

[11] M. Farrens and A. Park. Dynamic Base Register Caching: ATechnique for Reducing Address Bus Width. In ISCA, 1991.

[12] E. G. Hallnor and S. K. Reinhardt. A Unified CompressedMemory Hierarchy. In HPCA-11, 2005.

[13] D. Huffman. A Method for the Construction ofMinimum-Redundancy Codes. IRE, 1952.

[14] S. Iacobovici et al. Effective Stream-Based andExecution-Based Data Prefetching. In ICS, 2004.

[15] Intel Corporation. Intel 64 and IA-32 Architectures SoftwareDeveloper’s Manual, 2013.

[16] JEDEC. GDDR3 Specific SGRAM Functions, JESD21-C,2012.

[17] U. Kang et al. 8Gb 3D DDR3 DRAM UsingThrough-Silicon-Via Technology. In ISSCC, 2009.

[18] S. F. Kaplan. Compressed Caching and Modern VirtualMemory Simulation. PhD thesis, 1999.

[19] C. Lefurgy et al. Energy Management for CommercialServers. In IEEE Computer, 2003.

[20] C. Li, C. Ding, and K. Shen. Quantifying the Cost of ContextSwitch. In ExpCS, 2007.

[21] S. Li et al. McPAT: An Integrated Power, Area, and TimingModeling Framework for Multicore and ManycoreArchitectures. In MICRO-42, 2009.

[22] P. S. Magnusson et al. Simics: A Full System SimulationPlatform. IEEE Computer, 2002.

[23] Micron. 2Gb: x4, x8, x16, DDR3 SDRAM, 2012.[24] H. Patil et al. Pinpointing Representative Portions of Large

Intel Itanium Programs with Dynamic Instrumentation. InMICRO-37, 2004.

[25] G. Pekhimenko et al. Base-Delta-Immediate Compression:Practical Data Compression for On-Chip Caches. In PACT,2012.

[26] G. Pekhimenko et al. Linearly Compressed Pages: A MainMemory Compression Framework with Low Complexity andLow Latency. In SAFARI Technical Report No. 2012-002,2012.

[27] V. Sathish, M. J. Schulte, and N. S. Kim. Lossless and LossyMemory I/O Link Compression for Improving Performanceof GPGPU Workloads. In PACT, 2012.

[28] A. Snavely and D. M. Tullsen. Symbiotic Jobscheduling for aSimultaneous Multithreaded Processor. In ASPLOS-9, 2000.

[29] SPEC CPU2006. http://www.spec.org/.[30] S. Srinath et al. Feedback Directed Prefetching: Improving

the Performance and Bandwidth-Efficiency of HardwarePrefetchers. In HPCA-13, 2007.

[31] S. Thoziyoor, N. Muralimanohar, J. H. Ahn, and N. P.Jouppi. CACTI 5.1. Technical Report HPL-2008-20, HPLaboratories, 2008.

[32] M. Thuresson et al. Memory-Link Compression Schemes: AValue Locality Perspective. IEEE TC, 2008.

[33] Transaction Processing Performance Council.http://www.tpc.org/.

[34] R. B. Tremaine et al. Pinnacle: IBM MXT in a MemoryController Chip. IEEE Micro, 2001.

[35] P. R. Wilson, S. F. Kaplan, and Y. Smaragdakis. The Case forCompressed Caching in Virtual Memory Systems. InUSENIX Annual Technical Conference, 1999.

[36] J. Yang, R. Gupta, and C. Zhang. Frequent Value Encodingfor Low Power Data Buses. ACM TODAES, 2004.

[37] J. Yang, Y. Zhang, and R. Gupta. Frequent ValueCompression in Data Caches. In MICRO-33, 2000.

[38] D. H. Yoon, M. K. Jeong, M. Sullivan, and M. Erez. TheDynamic Granularity Memory System. In ISCA, 2012.

[39] Y. Zhang, J. Yang, and R. Gupta. Frequent Value Localityand Value-Centric Data Cache Design. In ASPLOS-9, 2000.

[40] J. Ziv and A. Lempel. A Universal Algorithm for SequentialData Compression. IEEE TIT, 1977.

184