architectural support for address translation on gpus

14
Architectural Support for Address Translation on GPUs Bharath Pichai Lisa Hsu * Abhishek Bhattacharjee Dept. of Computer Science, Rutgers University * Qualcomm Research {bsp57, abhib}@cs.rutgers.edu [email protected] Technical Report DCS–TR–703, Submitted July 2013 ABSTRACT The proliferation of heterogeneous compute platforms, of which CPU/GPU is a prevalent example, necessitates a man- ageable programming model to ensure widespread adoption. A key component of this is a shared unified address space between the heterogeneous units to obtain the programma- bility benefits of virtual memory. Indeed, processor vendors have already begun embracing heterogeneous systems with unified address spaces (e.g., Intel’s Haswell, AMD’s Berlin processor, and ARM’s Mali and Cortex cores). We are the first to explore GPU Translation Lookaside Buffers (TLBs) and page table walkers for address trans- lation in the context of shared virtual memory for hetero- geneous systems. To exploit the programmability benefits of shared virtual memory, it is natural to consider mirroring CPUs and placing TLBs prior (or parallel) to cache accesses, making caches physically addressed. We show the perfor- mance challenges of such an approach and propose modest hardware augmentations to recover much of this lost perfor- mance. We then consider the impact of this approach on the design of general purpose GPU performance improvement schemes. We look at: (1) warp scheduling to increase cache hit rates; and (2) dynamic warp formation to mitigate con- trol flow divergence overheads. We show that introducing cache-parallel address translation does pose challenges, but that modest optimizations can buy back much of this lost performance. Overall, this paper explores address translation mecha- nisms on GPUs. While cache-parallel address translation does introduce non-trivial performance overheads, modestly TLB-aware designs can move overheads into a range deemed acceptable in the CPU world (5-15% of runtime). We pre- sume this initial design leaves room for improvement but hope the larger result, that a little TLB-awareness goes a long way in GPUs, spurs future work in this fruitful area. 1. INTRODUCTION The advent of the dark silicon era [59] has generated re- search on hardware accelerators. Recent heterogeneous sys- tems include accelerators for massive multidimensional data- sets [61], common signal processing operations [53], object caching [41], and spatially-programmed architectures [51]. To ensure widespread adoption of accelerators, their pro- gramming models must be efficient and easy to use. One option is to have unified virtual and physical ad- dress spaces between CPUs and accelerators. Unified ad- dress spaces enjoy many benefits; they make data structures and pointers globally visible among compute units, obviat- ing the need for expensive memory copies between CPUs and accelerators. They also unburden CPUs from pinning data pages for accelerators in main memory, improving memory efficiency. Unified address spaces do, however, also require architectural support for virtual-to-physical address trans- lation. CPUs currently use per-core Translation Lookaside Buffers (TLBs) and hardware page table walkers (PTWs) to access frequently-used address translations from operating system (OS) page tables. CPU TLBs and PTWs are accessed be- fore (or in parallel with) hardware caches, making caches physically-addressed. This placement restricts TLB size (so that its access time does not overly increase cache access times), but efficiently supports multiple contexts, dynamically- linked libraries, cache coherence, and unified virtual/physi- cal address spaces. Overall, CPU TLB misses expend 5-15% of runtime [7, 9, 10, 11, 12, 52], which is considered accept- able for their programmability benefits. As accelerators proliferate, we must study address trans- lation and its role in the impending unified virtual address space programming paradigm. We are the first to explore key accelerator TLB and PTW designs in this context. We focus on general purpose programming for graphics process- ing units (GPUs) [15, 19, 50, 49] for two reasons. First, GPUs are a relatively mature acceleration technology that have seen significant recent research [20, 21, 31, 42, 46, 55, 57]. Second, processor vendors like Intel, AMD, ARM, Qual- comm, and Samsung are embracing integrated CPU/GPUs and moving towards fully unified address space support, as detailed in Heterogeneous Systems Architecture (HSA) [40] specifications. For example, AMD’s upcoming Berlin pro- cessor commits to fully unified address spaces using hetero- geneous uniform memory access (hUMA) technology [54]. Today’s GPUs and CPUs typically use separate virtual and physical address spaces. Main memory may be physi- cally shared, but is usually partitioned, or allows unidirec-

Upload: dohuong

Post on 21-Dec-2016

218 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Architectural Support for Address Translation on GPUs

Architectural Support for Address Translation on GPUs

Bharath Pichai Lisa Hsu∗ Abhishek Bhattacharjee

Dept. of Computer Science, Rutgers University ∗Qualcomm Research{bsp57, abhib}@cs.rutgers.edu [email protected]

Technical Report DCS–TR–703, Submitted July 2013

ABSTRACT

The proliferation of heterogeneous compute platforms, ofwhich CPU/GPU is a prevalent example, necessitates a man-ageable programming model to ensure widespread adoption.A key component of this is a shared unified address spacebetween the heterogeneous units to obtain the programma-bility benefits of virtual memory. Indeed, processor vendorshave already begun embracing heterogeneous systems withunified address spaces (e.g., Intel’s Haswell, AMD’s Berlinprocessor, and ARM’s Mali and Cortex cores).We are the first to explore GPU Translation Lookaside

Buffers (TLBs) and page table walkers for address trans-lation in the context of shared virtual memory for hetero-geneous systems. To exploit the programmability benefitsof shared virtual memory, it is natural to consider mirroringCPUs and placing TLBs prior (or parallel) to cache accesses,making caches physically addressed. We show the perfor-mance challenges of such an approach and propose modesthardware augmentations to recover much of this lost perfor-mance.We then consider the impact of this approach on the

design of general purpose GPU performance improvementschemes. We look at: (1) warp scheduling to increase cachehit rates; and (2) dynamic warp formation to mitigate con-trol flow divergence overheads. We show that introducingcache-parallel address translation does pose challenges, butthat modest optimizations can buy back much of this lostperformance.Overall, this paper explores address translation mecha-

nisms on GPUs. While cache-parallel address translationdoes introduce non-trivial performance overheads, modestlyTLB-aware designs can move overheads into a range deemedacceptable in the CPU world (5-15% of runtime). We pre-sume this initial design leaves room for improvement buthope the larger result, that a little TLB-awareness goes along way in GPUs, spurs future work in this fruitful area.

1. INTRODUCTIONThe advent of the dark silicon era [59] has generated re-

search on hardware accelerators. Recent heterogeneous sys-tems include accelerators for massive multidimensional data-sets [61], common signal processing operations [53], objectcaching [41], and spatially-programmed architectures [51].To ensure widespread adoption of accelerators, their pro-gramming models must be efficient and easy to use.

One option is to have unified virtual and physical ad-dress spaces between CPUs and accelerators. Unified ad-dress spaces enjoy many benefits; they make data structuresand pointers globally visible among compute units, obviat-ing the need for expensive memory copies between CPUs andaccelerators. They also unburden CPUs from pinning datapages for accelerators in main memory, improving memoryefficiency. Unified address spaces do, however, also requirearchitectural support for virtual-to-physical address trans-lation.

CPUs currently use per-core Translation Lookaside Buffers(TLBs) and hardware page table walkers (PTWs) to accessfrequently-used address translations from operating system(OS) page tables. CPU TLBs and PTWs are accessed be-fore (or in parallel with) hardware caches, making cachesphysically-addressed. This placement restricts TLB size (sothat its access time does not overly increase cache accesstimes), but efficiently supports multiple contexts, dynamically-linked libraries, cache coherence, and unified virtual/physi-cal address spaces. Overall, CPU TLB misses expend 5-15%of runtime [7, 9, 10, 11, 12, 52], which is considered accept-able for their programmability benefits.

As accelerators proliferate, we must study address trans-lation and its role in the impending unified virtual addressspace programming paradigm. We are the first to explorekey accelerator TLB and PTW designs in this context. Wefocus on general purpose programming for graphics process-ing units (GPUs) [15, 19, 50, 49] for two reasons. First,GPUs are a relatively mature acceleration technology thathave seen significant recent research [20, 21, 31, 42, 46, 55,57]. Second, processor vendors like Intel, AMD, ARM, Qual-comm, and Samsung are embracing integrated CPU/GPUsand moving towards fully unified address space support, asdetailed in Heterogeneous Systems Architecture (HSA) [40]specifications. For example, AMD’s upcoming Berlin pro-cessor commits to fully unified address spaces using hetero-geneous uniform memory access (hUMA) technology [54].

Today’s GPUs and CPUs typically use separate virtualand physical address spaces. Main memory may be physi-cally shared, but is usually partitioned, or allows unidirec-

Page 2: Architectural Support for Address Translation on GPUs

tional coherence (e.g., ARM allows accelerators to snoopCPU memory partitions but not the other way around).GPU address translation has traditionally been performedby Input/Output Memory Management Unit (IOMMU) TLBsin the memory controller, leaving caches virtually-addressed.This sufficed because there was no concept of coherence orshared memory support with the host CPU. However, asvendors pursue address space unification with specificationslike HSA, this approach becomes insufficient. This posesthe following question: how should GPU address transla-tion be architected to ease programmability, with minimalperformance overheads? There is no industry and academicconsensus on this question yet, leaving open many possibleTLB and PTW architectures.In this context, we are the first to study GPU address

translation in a unified virtual/physical memory space paradigm.We consider a natural design point by mirroring CPUs andplacing GPU TLBs and PTWs before (or in parallel with)cache access, realizing physically-addressed caches. Thisachieves efficient support for multiple contexts and applica-tion libraries, and cache coherence between CPU and GPUcaches (all of which are industry goals [40]).We show that vanilla GPU TLBs degrade performance,

demonstrating that the success of accelerator platforms mayhinge upon more thoughtful TLB designs. In response, wepropose augmentations that recover much of this lost per-formance. We then show how address translation affects:(1) warp scheduling to increase cache hit rates; and (2) dy-namic warp formation to mitigate control flow divergenceoverheads. Specifically, our contributions are as follows:First, we show that placing TLBs and PTWs before (or

in parallel with) cache access degrades performance by 20-50%. This is because cache-parallel accesses mean latency(and thus sizing) is paramount. Meanwhile, GPU SIMD ex-ecution, in which multiple warp threads execute in lock-step,means a TLB miss from a single thread can stall all warpthreads, magnifying miss penalties. Fortunately, we showthat modest optimizations recover most lost performance.Second, we show the impact of TLBs on cache-conscious

warp/wavefront scheduling (CCWS) [55], recently-proposedto boost GPU cache hit rates. While naively adding TLBsoffsets CCWS, simple modifications yield close to ideal CCWS(without TLBs) performance. We also study TLB-basedCCWS schemes that are simpler and higher-performance.Third, we show how TLBs affect dynamic warp forma-

tion for branch divergence. Using thread block compaction(TBC) [20], we find that dynamically assimilating threadsfrom disparate warps increases memory divergence and TLBmisses. Though naive designs degrade performance by over20%, adding TLB-awareness mitigates these overheads, boost-ing performance close to ideal TBC (without TLBs).While our work is relevant to server and client systems,

our evaluations focus on server workloads. Our insights are,however, relevant across the compute spectrum. All ad-dress translation support has performance overheads; thequestion is whether the overheads are worth the associatedprogrammability benefits. Current CPUs generally deem 5-15% performance degradations acceptable [7, 8, 11, 12, 18].While our GPU TLB proposals meet these ranges, we shedlight on various GPU address translation design issues ratherthan advocate a specific implementation. In fact, we believethat there is plenty of room for improvement and even alter-nate GPU TLB designs and placement (e.g., opportunistic

virtual caching [9]). This paper provides valuable insightsfor future studies and methodology for reasoning about theimpact of address translation on GPU performance.

2. BACKGROUND

2.1 Address Translation on CPUsCPU TLBs have long been studied by the academic com-

munity [7, 8, 11, 12, 33, 52]. Most CPUs currently accessTLBs prior to (or in parallel with) L1 cache access, realizingphysically-addressed caches. This approach dominates com-mercial systems (over virtually-addressed caches [16, 38])because of programmability benefits. Physically-addressedcaches prevent coherence problems from address synonyms(multiple virtual addresses mapping to the same physicaladdress) and homonyms (a single virtual address mappingto multiple physical addresses) [16]. This efficiently sup-ports multiple contexts, dynamically-linked libraries, cachecoherence among multiple cores and with DMA engines.

2.2 Address Translation on CPU/GPUsRecent interest in general purpose programming for GPUs

has spurred commercial CPU/GPU implementations andresearch to program them [15, 19, 50, 49]. Intel, AMD,ARM, Qualcomm, and Samsung already support integratedCPU/GPU systems, with cross-platform standards for par-allel programming (e.g., OpenCL [44]). NVidia supportsgeneral purpose GPU programming through CUDA [60] andis considering GPU architectures for the cloud (e.g., Kepler[48]), with an eye on general purpose computation.

Current heterogeneous systems use rigid programming mod-els that require separate page tables, data replication, andmanual data movement between CPU and GPU. This is es-pecially problematic for pointer-based data structures (e.g.,linked lists, trees) 1. Recent research [22, 23, 28, 27] triesto overcome these issues, but none solve the problem gener-ally by unifying the address space. As such, although GPUscurrently use separate address spaces (though latest CUDAreleases permit limited CPU/GPU virtual address sharing[60]), vendors have begun supporting address translation [13]as a step towards unified virtual and physical address spaces.

There are many possibilities for address translation inGPUs. AMD and Intel use Input Output Memory Manage-ment Units [1, 2, 25] (IOMMUs) with their own page tables,TLBs, and PTWs. IOMMUs with large TLBs are placedin the memory controller, making GPU caches virtually-addressed. In this paper, we study the natural alternativeof translating addresses before (or in parallel with) cacheaccess. This has the following programmability benefits.

CPU/GPU systems with full-blown address translationmore naturally support a single virtual and physical addressspace. This eliminates the current need for CPUs to ini-tialize, copy, pin data pages (in main memory), duplicatepage tables for GPU IOMMUs, and set up IOMMU TLBentries [13]. It also allows GPUs to support page faultsand access memory mapped files, features desired in hUMA

1Some platforms provide pinning mechanisms that do notneed data transfers. Both CPU and GPU maintain pointersto the pinned data and the offset arithmetic is the same;however, the pointers are distinct. This suffers overheadsfrom pinning and pointer replication (which can lead tobuggy code).

2

Page 3: Architectural Support for Address Translation on GPUs

�������������� ���

���������������������

������������������

�������������

�������������

�������������

���������������

�����������

����������������

� ��� �! " #

������������ ���

���������������������

������������������

����������

����������

����������

��������������������������

����������������

�!

���

" # �!

���

" # �!

���

" #

Figure 1: The diagram on the left shows conventional

GPU address translation with an IOMMU TLB and

PTW in the memory controller. Instead, our approach

embeds a TLB and PTW per shader core so that all

caches become physically-addressed.

specifications [54] (though their feasibility requires a rangeof hardware/software studies beyond the scope of this work).Second, physically-addressed caches support multiple con-

texts (deemed desirable by HSA [40] and NVidia’s Keplerwhitepaper [48]) more efficiently. Virtually-addressed cachesstruggle due to incoherence from address synonyms [38].Workarounds like cache flushing on context switches can en-sure correct functionality, but do so at a performance andcomplexity cost versus physically-addressed caches.Third, address translation placement that allows physically-

addressed caches also efficiently supports application libraries(mentioned by HSA as a design goal). Traditional virtually-addressed GPU caches suffer from address homonyms [16],which can arise when executing libraries.Finally, cache coherence between CPU and GPU caches

has long been deemed desirable [35, 40, 57]. In general, cachecoherence is greatly simplified if GPU caches are physically-addressed, in tandem with CPU caches.

3. GOALS OF OURWORKProcessor vendors are embracing support for unified ad-

dress spaces between CPUs and accelerators, making accel-erator address translation a first-class design objective. Un-fortunately, the programmability benefits of address transla-tion do not come for free. In the past, CPU address transla-tion overheads of 5-15% of system runtime have been deemeda fair tradeoff for its benefits [7, 9, 10, 11, 12, 52]. Giventhe inevitable rise of GPU address translation, our goal isto conduct the first study of GPU TLBs and PTWs.We have already established that like CPUs, it is desir-

able for GPUs to have physically-addressed caches for var-ious programmability benefits. A natural design option isto mirror CPUs, placing GPU TLBs prior to (or in par-allel with) L1 caches. The right side of Figure 1 showsour approach, with per shader core TLBs and PTWs. LikeCPUs, we assume that L1 caches are virtually-indexed andphysically-tagged, allowing TLB access to overlap with L1cache access. This contrasts with past academic work ignor-ing address translation [20, 21, 55] or placing an IOMMUTLB and PTW in the memory controller (the diagram onthe left side of Figure 1).For all its programmability benefits, address translation

at the L1-level is challenging because it degrades perfor-mance by constraining TLB size. Figure 2 shows that naive

���

���

���

���

���

��

��

� �����

� ����

� ��

����

����

� �������

����

������

�������

������ !

""��

""���#������ !$

�!"

�!"�#������ !$

Figure 2: Compared to a baseline architecture without

TLBs, speedup of naive, 3-ported TLBs per shader core,

with and without cache-conscious wavefront scheduling,

with and without thread block compaction. Naive TLBs

degrade performance in every case.

designs that do not consider the distinguishing characteris-tics of GPUs can severely degrade performance. The plotsshow speedups (values higher and lower than 1 are improve-ments and degradations) of general purpose GPU bench-marks using naive 128-entry, 3-port TLBs with 1 PTW pershader core (With TLB). Not only do naive TLBs degradeperformance, they also lose 30-50% performance versus con-ventional cache-conscious wavefront scheduling and threadblock compaction [20, 55]. These degradations arise for threereasons, unique to GPUs.

First, GPU memory accesses have low temporal locality,increasing TLB miss rates. Second, shader cores run multi-ple threads in lock-step in warps (in NVIDIA terminology)[21, 20, 42]. Therefore, a TLB miss on one warp thread effec-tively stalls all warp threads, magnifying the miss penalty.Third, multiple warp threads often TLB miss in tandem,stressing conventional PTWs that serialize TLB miss han-dling.

Therefore, a key goal of this work is to refashion conven-tional CPU TLB and PTW hardware to fit GPU charac-teristics. We show that modest but informed modificationssignificantly reduces the degradations of Figure 2. In fact,simple, thoughtful GPU TLBs reduce GPU address transla-tion overheads to 5-15% of system runtime (just like CPUs).

We emphasize that our study is one natural design point toachieve programmability benefits [40]. Whether this is thebest design point is a function of what performance over-head is considered reasonable, alternate implementations,and emerging classes of GPU workloads. Though this isbeyond our scope, we do present broad insights to reasonabout address translation in GPUs (and more broadly, ac-celerators).

4. METHODOLOGY

4.1 Evaluation WorkloadsWe use server workloads from past studies on control-flow

divergence and cache scheduling [20, 55]. From the Ro-dinia benchmarks [17], we use bfs (graph traversal), kmeans(data clustering), streamcluster (data mining), mummergpu(DNA sequence alignment), and pathfinder (grid dynamicprogramming). In addition, we use memcached, a key-valuestore and retrieval system, stimulated with a representativeportion of the Wikipedia traces [24]. Our baseline version ofthese benchmarks have memory footprints in excess of 1GB.

Ideally, we would like to run benchmarks that will beprevalent on future integrated CPU/GPU platforms. Un-

3

Page 4: Architectural Support for Address Translation on GPUs

fortunately, there is a dearth of such workloads because ofthe lack of abstractions that ease CPU/GPU programming(like unified address spaces themselves). We do include ap-plications like memcached which will likely run on these sys-tems but expect that future workloads supporting braidedparallelism (a mix of task and data parallelism from a sin-gle source) [43] will particularly benefit from unified addressspaces. Our studies enable these applications and provideinsights on their performance with GPU address translation.

4.2 Evaluation InfrastructureWe use GPGPU-Sim [5] with parameters similar to past

work [20, 55]. In particular, we assume 30 SIMT cores,32-thread warps, and a pipeline width of 8. We have, percore, 1024 threads, 16KB shared memory, and 32KB L1 datacaches (with 128 byte lines and LRU). We also use 8 memorychannels with 128KB of unified L2 cache space per channel.We run binaries with CPU and GPU portions, but reporttiming results for the GPU part. GPGPU-Sim uses singleinstruction multiple data (SIMD) pipelines, grouping SIMDcores into core clusters, each of which has a port to theinterconnection network with a unified L2 cache and thememory controller.Most of our results focus on 4KB pages due to the ad-

ditional challenge imposed by small page size; however, wealso present initial results for large 2MB pages later in thepaper. Note that large pages, while effective, don’t come forfree and can have their own overheads in certain situations[3, 8, 47]. As a result, many applications are restricted to4KB page sizes; it is important for our GPUs with addresstranslation to be compatible with them.

5. ADDRESS TRANSLATION FOR GPUSWe now introduce design options for GPU TLBs and PTWs,

showing the shortcomings of blindly porting CPU-style ad-dress translation into GPUs. We then present GPU-appropriateaddress translation.

5.1 Address Translation Design SpaceWe consider the following address translation design points.

Number and placement of TLBs: CPUs traditionallyplace a TLB in each processor core so that pipelines can havequick, unfettered address translation. One might considerthe same option for each GPU shader core. It is possibleto implement one TLB per SIMD lane; this provides thehighest performance, at the cost of power and area (e.g.,we assume 240 SIMD lanes, so this approach requires 240TLBs per shader core). Instead, we assume a more power-and area-frugal approach, with one TLB per shader core(shared among lanes).PTW mechanisms, counts, and placement: It is pos-sible to implement page table walking with either hardwareor software (where the operating system is interrupted ona TLB miss [29]). Hardware approaches require more areabut perform far better [29]. When using hardware PTWs,it is possible to implement one or many per TLB. While asingle PTW per TLB saves area, multiple PTWs can po-tentially provide higher performance if multiple TLB missesoccur. Finally, placing PTWs with TLBs encourages fastermiss handling; alternately, using a single shared PTW savesarea at the cost of performance.Size of TLBs: While larger TLBs have a higher hit rate,they also occupy more area and have higher hit times. Since

TLB access must complete by the time the L1 cache setis selected (for virtually-indexed, physically-tagged caches),overly-high hit times degrade performance. In addition,TLBs use power-hungry content-addressable memories [9];larger TLBs hence consume much more power.TLB port count: More TLB ports permit parallel addresstranslation, which can be crucial in high-throughput shadercores. Unfortunately, they also consume area and power.Blocking versus non-blocking TLBs: One way of reduc-ing the impact of TLB misses is to overlap them with usefulwork. Traditional approaches involve augmenting TLBs tosupport hits under the original miss or additional missesunder the miss. Both approaches improve performance, butrequire additional hardware like Miss Status Holding Regis-ters (MSHRs) and access ports.Page table walk scheduling: Each page table walk re-quires multiple memory references (four in x86 [6]) to findthe desired PTE, some of which may hit in caches. In mul-ticore systems, it is possible that more than one core con-currently experience TLB misses. In response, one option(which has not been studied to date) is to consider the pagetable walks of the different cores and see if some of theirmemory references are to the same locations or cache lines.In response, it is possible to interleave PTW memory refer-ences from different cores to reduce the number of memoryreferences and boost cache hit rates (while retaining func-tional correctness).System-level issues (TLB shootdowns, page faults):The adoption of unified address spaces means that thereare many options for handling TLB shootdowns and pagefaults. In one possibility, we interrupt a CPU to execute theshootdown code or page fault handler on the GPU’s behalf.This CPU could be the one that launched the GPU kernel orany other idle core. Alternately, GPUs could themselves runthe shootdown code or the page fault handler, as suggestedby hUMA [54].

5.2 CPU-Style Address Translation in GPUsSince GPU address translation at the L1 cache level has

not previously been studied, the design space is a blank slate.As such, it is natural to begin with CPU-style TLBs. Wenow discuss this baseline naive architecture.

Figure 3 shows our design. Each shader core maintainsa TLB/PTW and has 48 warps (the minimum schedulingunit) per shader core. Threads of a warp execute the sameinstruction in lock-step. Instructions are fetched from an I-cache and operands read from a banked register file. Loadsand stores access the memory unit (which Figure 3 zoomson).

The memory unit’s address generation unit calculates vir-tual addresses, which are coalesced to unique cache line ref-erences. We enhance conventional coalescing logic to alsocoalesce multiple intra-warp requests to the same virtualpage (and hence PTE). This is crucial towards reducingTLB access traffic and port counts. At this point, two setsof accesses are available: (1) unique cache accesses; and (2)unique PTE accesses. These are presented in parallel to theTLB and data cache. Note that we implement virtually-indexed, physically-tagged L1 caches. We now elaborate onthe design space options from Section 5.1, referring to Figure3 when necessary.Number and placement of TLBs: As previously de-tailed, we assume 1 TLB per shader core shared among

4

Page 5: Architectural Support for Address Translation on GPUs

�����

������

������

�����

��������

�������

�����������

����

������

�������������

� !

�����"��#

���

����

$ %

&�'��(�

)�������(

)��

&���

&�'��(�!���

���

����

$ %

*�&

+)��

$ %�

&+)��

&�'��(� ���

����������

$�

���,�-��,

Figure 3: Shader core pipeline with address translation.

We assume that L1 data caches are virtually-indexed and

physically-tagged (allowing TLB lookup in parallel with

cache access). All caches (L1 and shared caches, which

are not shown) are physically-addressed. Note that the

diagram zooms in on the memory unit.

SIMD lanes, to save power and area.PTW mechanisms, counts, and placement: Our GPU,like most CPUs today, assumes hardware PTWs because:(1) they achieve higher performance than software-managedTLBs [29]; and (2) they do not need to run OS code (whichGPUs cannot currently execute), unlike software-managedTLBs.CPUs usually place PTWs close to each TLB so that page

table walks can be quickly initiated on misses. We use thesame logic to place PTWs next to TLBs. Finally, CPUsmaintain one PTW per TLB. While it is possible to con-sider multiple PTWs per GPU TLB, our baseline designmirrors CPUs and similarly has one PTW per shader core.We investigate the suitability of this decision in subsequentsections.Size of TLBs: Commercial CPUs currently implement 64-512 entry TLBs per core [11, 26]. These sizes are typicallypicked to be substantially smaller and lower-latency thanCPU L1 caches. Using a similar methodology with CACTI[45], we have found that 128-entry TLBs are the are thelargest possible structures that do not increase the accesstime of 32KB GPU L1 data caches. We use 128-entry TLBsin our naive baseline GPU design, studying other sizes later.TLB port count: Intel’s CPU TLBs support three portsin today’s systems [26]. Past work has found that thispresents a performance/power/area tradeoff appropriate forCPU workloads [4]. Our baseline design assumes 3-portedTLBs for GPUs too but we revisit this in subsequent sec-tions.Blocking versus non-blocking: There is evidence thatthat some CPU TLBs do support hits under misses becauseof their relatively simple hardware support [30]. However,most commercial CPUs typically use blocking TLBs [7, 52]because of their high hit rates. We therefore assume blockingTLBs in our baseline GPU design. This means that similarto cache misses, TLB misses prompt the scheduler to swapanother warp into the SIMD pipeline. Swapped-in warps ex-ecuting non-memory instructions proceed unhindered (untilthe original warp’s page table walks finish and it completesthe Writeback stage). Swapped-in threads with memoryreferences, however, do not proceed in this naive design asthey require non-blocking support.Figure 3 shows that TLBs have their own (MSHRs). We

assume, like both GPU caches and past work on TLBs [42],that there is one TLB MSHR per warp thread (32 in to-tal). MSHR allocation triggers page table walks, which in-

���

������������

���� ����

�� ��

� ���

�� �

�� ������ �

������� �

��� ���� �� �� �� ��� ��

��

��

��

��

���

���� ����

�� ��

� ���

�� �

�� ������ �

������� �

� �� ����

� �!�"�# $ %&'���

Figure 4: The left diagram shows the percentage of to-

tal instructions that are memory references and TLB miss

rates; the right diagram shows the average number of dis-

tinct translations requested per warp (page divergence)

and the maximum number of translations requested by

any warp through the execution.

ject memory requests to the shared caches and main mem-ory.Page table walk scheduling: CPU PTWs do not cur-rently interleave memory references from multiple concur-rent TLB misses for increased cache hit rate and reducedmemory traffic. This is because the low incidence of con-current TLB misses on CPUs [12] isn’t worth the complex-ity and area overheads of such logic (though this is likelyto require relatively-simple combinational logic). Similarly,our naive baseline GPU design also doesn’t include PTWscheduling logic.System-level issues (shootdowns, page faults: CPUsusually shootdown TLBs on remote cores when its own TLBupdates an entry, using software inter-processor interrupts(IPIs). We assume the same approach for GPUs (i.e., ifthe CPU that initiated the GPU modifies its TLB entries,GPU TLBs are flushed). Similarly, we assume that a pagefault interrupts a CPU to run the handler. In practice,our performance was not affected by these decisions (be-cause shootdowns and page faults almost never occur onour workloads). If these become a problem, as detailed inhUMA, future GPUs may be able to run dedicated OS codefor shootdowns and page faults without interrupting CPUs.We leave this for future work.Performance: Having detailed this naive baseline approach,we profile its performance for our workloads. Our results,already presented in Figure 2 in Section 3, shows the inade-quacies of naive baseline GPU address translation. Overall,performance is degraded by 20-35%. The graph on the left ofFigure 4 complements this data by showing: (1) the numberof memory references in each workload as a percentage ofthe total instructions; and (2) miss rates of 128-entry GPUTLBs. While the number of memory references are gener-ally low compared to CPUs (under 25% for all benchmarks),TLB miss rates are very high (ranging from 22% to 70%).This is because GPU benchmarks tend to stream throughlarge quantities of data without much reuse. In addition, wefind that average miss penalties are well above 200 cyclesfor every benchmark (Figure 5). To put this in perspective,this is over double the penalty of L1 cache misses.

Fortunately, we now show that simple GPU-aware modi-fications of naive TLBs and PTWs counter these problems,recovering most of this lost performance.

5.3 Augmenting Address Translation for GPUsGPU execution pipelines differ widely from CPUs in that

5

Page 6: Architectural Support for Address Translation on GPUs

��

���

���

���

���

���

���

�� �

�����

�������

�����������

����������

������

���������������� �����������������

Figure 5: Average cycles per TLB miss, compared to L1

cache misses. TLB miss penalties are typically twice as

long as L1 cache miss penalties.

they perform data-parallel SIMD operations on warps. Sinceeach warp simultaneously executes multiple threads, it canconceivably prompt multiple TLB misses (up to the warpwidth, which is 32 in our configuration). In the worst cases,all TLB misses could be to different virtual pages or PTEs.Thus, there is more pressure on TLBs in some ways, but alsomore opportunity for parallelism, if designed with GPUs inmind. We now detail optimizations to exploit these oppor-tunities, quantifying their improvements.TLB size and port counts: An ideal TLB is large andlow-latency. Furthermore, it is heavily multi-ported, withone port per warp thread (32 in total). While more portsfacilitate quick lookups and miss detection, they also sig-nificantly increase area and power. Figure 7 sheds light onsize, access time, and port count tradeoffs for naive baselineGPU address translation. We vary TLB sizes from 64 to512 entries (the range of CPU TLB sizes) and port countfrom 3 (like L1 CPU TLBs) to an ideal number of 32. Wepresent speedups versus against the no-TLB case (speedupsare under 1 since adding TLBs degrades performance). Weuse CACTI to assess access time increases with size.Figure 7 shows that larger sizes and more ports greatly im-

prove GPU TLB performance. In general, 128 entry TLBsperform best; beyond this, increased access times reduceperformance. Figure 7 also shows that while port counts doimpact performance (particularly for mummergpu and bfs),modestly increasing from 3 ports (in our naive baseline) to4 ports recovers much of this lost performance. The graphon the right of 4 shows why, by plotting page divergence(the number of distinct translations requested by a warp).We show both average page divergence the maximum pagedivergence of any warp through execution. Figure 4 showswarps usually request far less than their maximum of 32translations. This is because coalescing logic accessed be-fore TLBs reduce requests to the same PTE into a singlelookup. Only bfs and mummergpu have average page diver-gence higher than 4. While this does mean that some warpshave higher lookup time (e.g., when mummergpu encountersits maximum page divergence of 32), simply augmenting theport count of the naive implementation to 4 recovers muchof the lost performance. Figure 6 shows more details on thepage divergence requirements of different benchmarks.Blocking versus non-blocking TLBs: Using blockingTLBs, the only way to overlap miss penalties with usefulwork is to execute alternate warps without memory refer-ences. We have seen however, that GPU TLB miss penaltiesare extremely long. Therefore, GPU address translation re-quires more aggressive non-blocking facilities. Specifically,

��

��

��

��

���

� ���� ����� ������ ������

���������

�������

���

���������

������

���������

�������������

����������

Figure 6: Page divergence cumulative distribution func-

tions. We show what percentage of warps have have a

page divergence of 1, 2-3, 4-7, 8-15, and 16-32.

���

���

���

���

��

��

��

��

��

��

��

��

��

��

��

��

�� ������� ����� ����� ������ �����

�������

������ �

!��"��

���"��

���"��

!��"��

Figure 7: Performance for TLB size and port counts,

assuming fixed access times. Note that TLBs larger than

128 entries and 4 ports are impractical to implement and

actually have much higher access times that degrade per-

formance.

we investigate:Hits from one warp under misses from another warp: In thisapproach, the swapped-in warp executes even if it has mem-ory references, as long as they are all TLB hits. As soon asthe swapped-in thread TLB misses, it too must be swappedout. This approach leverages already-existing TLB MSHRsand requires only simple combinatorial logic updates to thewarp scheduler. In fact, hardware costs are similar to CPUTLBs which already support hits under misses [30]. We leavemore aggressive miss under miss support for future work.Overlapping TLB misses with cache accesses within a warp:Beyond overlapping a warp’s TLB miss penalty with theexecution of other warps, it is also possible to overlap TLBmisses with work from the warp that originally missed. Sincewarps have multiple threads, even when some miss, othersmay TLB hit. Since hits immediately yield physical ad-dresses, it is possible to look up the L1 cache with theseaddresses without waiting for the warp’s TLB misses to beresolved. This boosts cache hit rates since this warp’s datais likelier to be in the cache before a swapped-in warp evictsits data. Furthermore if these early cache accesses do miss,subsequent cache miss penalties can be overlapped with theTLB miss penalties of the warp.

In this approach, TLB hit addresses immediately look upthe cache even if the same warp has a TLB miss. Data foundin the cache is buffered in the standard warp context statein register files (from past work [21, 55]) before the warp isswapped out. This approach requires no hardware beyondsimple combinational logic in the PTW and MSHRs to allowTLB hits to proceed for cache lookup.

6

Page 7: Architectural Support for Address Translation on GPUs

���

���

���

���

���

� ������

������

���������

�������� ����

���������

����� �

�������� ����� �����������

!��"��������������������� #�������$%����&'�($%�����)*�

Figure 8: On a 128-entry, 4-port TLB, adding non-

blocking support improves performance closer to an ideal

(no increasing access latency) 32-port, 512-entry TLB.

Results: Figure 8 quantifies the benefits of non-blockingTLBs, normalized to a baseline without TLBs. We first per-mit hits under misses, then also allow TLB hits to proceedto the cache without waiting for misses to be resolved. Wecompare these to an ideal and impractical 512-entry TLBwith 32 ports and no increased access latencies.While hits under misses improve performance, immedi-

ately looking up the cache for threads that TLB hit andPTW scheduling is even more effective. For example, stream-cluster gains an additional 8% performance from overlappedcache access. Overall, Figure 11 shows that modest TLBdesign enhancements boost performance substantially. Wewill show how additional enhancement further bring perfor-mance close to the impractical, ideal case.Page table walk scheduling: Our page divergence resultsshow that GPUs execute warps that can suffer TLB misseson multiple threads. Often, the TLB misses are to distinctPTEs. Consider the example in Figure 9, which shows threeconcurrent x86 page walks. x86 page table walks requirea memory reference to the the Page Map Level 4 (PML4),Page Directory Pointer (PDP), Page Directory (PD), andPage Table (PT). A CR3 register provides the base physicaladdress of the PML4. A nine bit index (bits 47 to 39 ofthe virtual address) is concatenated with the base physicaladdress to generate a memory reference to the PML4. Thisfinds the base physical address of the PDP. Bits 38-30 of thevirtual address are then used to look up the PDP. Bits 29-21, and 20-12 are similarly used for the PD and PT (whichhas the desired translation).Figure 9 shows multilevel page lookups for a warp that has

three threads missing on virtual pages (0xb9, 0x0c, 0xac,

0x03), (0xb9, 0x0c, 0xac, 0x04), and (0xb9, 0x0c, 0xad,

0x05). We present addresses in groups of 9-bit indices asthese correspond directly to the page table lookups. Naivebaseline GPU PTWs perform the three page table walksserially (shown with dark bubbles). This means that eachpage table walk requires four memory references; (1-4) for(0xb9, 0x0c, 0xac, 0x03), (5-8) for (0xb9, 0x0c, 0xac,

0x04), and (9-12) for (0xb9, 0x0c, 0xad, 0x05). Each ofthe references hits in the shared cache (several tens of cy-cles) or main memory. Fortunately, simple PTW schedulingexploiting commonality across concurrent page table walksimproves performance in two ways:Reducing the number of page table walk memory references:Higher order virtual address bits tend to be constant acrossmemory references. For example, bits 47-39, and 38-30 ofthe virtual address change infrequently as lower-order bits

���������������

� �

��� �����

��� ������

��� �������

� �

� �

��� �������

��� ����� �

��� ������

� �

� �

��� ������

��� ������

��� ������

� �

� �

�� �����

��� ����� �

��� ������

� �

� �

��� ������

��� ������

�� �������

� �

���� ��� ��

��

� �

� �

��

��

������������ ������������������

� �

� � �

� � ��

Figure 9: Three threads from a warp TLB miss

on addresses (0xb9, 0x0c, 0xac, 0x03), (0xb9, 0x0c, 0xac,

0x04), and (0xb9, 0x0c, 0xad, 0x05). A conventional page

walker carries out three serial page walks (shown with

dark bubbles) , making references to (1-4), (5-8), and (9-

12), a total of 12 loads. Our cache-aware coalesced page

walker (shown with light bubbles) reduces this to 7 and

achieves better cache hit rate.

����������������� �

������������������

������������������

�����

������

�����

����������

����������������� �

������������������

������������������

�����

������

�����

����������

����������������� �

������������������

������������������

�����

������

�����

����������

� � � �

����������������� �

������������������

������������������

�����

�����

�����

����������

� � �

Figure 10: Our page table walk scheduler attempts to

reduce the number of memory references and increase

cache hit rate. The highlighted entries show what mem-

ory references in the page walks are performed and in

what order. This hardware assumes a mux per MSHR,

with a tree-based comparator circuit. We show 4 MSHRs

though this approach generalizes to 32 MSHRs.

(29 to 0) cover a 1GB address space. Since these bits indexthe PML4 and PDP during page table walks, multiple pagetable walks usually traverse similar paths. In Figure 9, allthree page walks read the same PML4 and PDP locations.We leverage this observation by remembering PML4 andPDP reads so that they can be reused among page walks.This means that three PML4 and PDP reads are replacedby a single reads.Increasing page table walk cache hit rates: 128-byte cachelines hold 16 consecutive 8-byte PTEs. Therefore, thereis potential for cache line reuse across different page tablewalks. For example, in Figure 9, two PD entries from thesame cache line are used for all walks. Similarly, the PT en-tries for virtual pages (0xb9, 0x0c, 0xac, 0x03) and (0xb9,

0x0c, 0xac, 0x04) are on the same cache line. We exploitthis observation by interleaving memory references from dif-ferent page table walks (shown in lighter bubbles). In Fig-ure 9, references 3 and 4 (from three different page ta-ble walks) are handled successively, as are references 5 and6 (from page walks for virtual pages (0xb9, 0x0c, 0xac,

0x03) and (0xb9, 0x0c, 0xac, 0x04)), boosting hit rates.Implementation: Figure 10 shows PTW scheduling (TLBs

and PTWs are present, though not shown). MSHRs storethe virtual page numbers causing TLB misses. We proposecombinational hardware that scans the MSHRs, extracting,

7

Page 8: Architectural Support for Address Translation on GPUs

���

���

���

���

���

� ������

������

���������

�������� ����

���������

����� �

������������������� !"#$%"����� ����&����"��'�����()"*'�����"$+�

���

���

���

���

���

� ������

������

���������

�������� ����

���������

����� �

�"#$%"���������"#$%", ��������"#$%�"��������&����"��'�����()"*'�����"$+�

Figure 11: (Left) On a 128-entry, 4-port TLB, adding

non-blocking and PTW scheduling logic achieves close

to the performance of an ideal (no increasing access la-

tency) 32-port, 512-entry TLB. (Right) Our augmented

TLB with 1 PTW consistently outperforms 8 PTWs.

in four consecutive steps, PML4, PDP, PD, and the PTindices. Each stage checks whether the memory access forits level are amenable to coalescing (they are repeated) orlie on the same cache line. The PTW then injects references(for each step, we show the matching memory reference fromFigure 9).Figure 10 shows that a comparator tree matches indices

in each stage of the algorithm. The It scans the PML4indices looking for a match in bits 47-44 because both a re-peated memory reference and PTEs within the same cacheline share all but the bottom 4 index bits. Once the com-parator discovers that all the page walks can be satisfiedfrom the same cache line in step 0, a memory referencefor PML4 commences. In parallel with this memory ref-erence, the same comparator tree now compares the upper5 bits of the PDP indices (bits 38-34). Again, all indicesmatch, meaning that in step 1, only one PDP referenceis necessary. In step 2, PD indices are studied (in par-allel with the PDP memory reference). We find that thetop 5 bits match but that the bottom 4 bits don’t (indi-cating that they are multiple accesses to the same cacheline). Therefore step 2 injects two memory references (3 and4) successively. Step 3 uses similar logic to complete thepage walks for (0xb9, 0x0c, 0xac, 0x03), (0xb9, 0x0c,

0xac, 0x04), and (0xb9, 0x0c, 0xad, 0x05).To reduce hardware, we use a comparator tree rather than

implementing a comparator between every pair of MSHRs(which provides maximum performance). We have foundthat this achieves close to pairwise comparator performance.Also, we share one comparator tree with muxes for all pagetable levels. This is possible because MSHRs scans can pro-ceed in parallel with loads from the previous step.Results: Figure 11 (left) shows that PTW scheduling sig-nificantly boosts GPU performance. For example, bfs andmummergpu gain from PTW scheduling because they have ahigher page divergence (so there are more memory referencesfrom different TLB misses to schedule). We have found thatPTW scheduling achieves its performance benefits by com-pletely eliminating 10-20% of the PTW memory referencesand boosting PTW cache hit rates by 5-8% across the work-loads. Consequently the number of idle cycles (due in largepart to TLB misses) reduces from 5-15% to 4-6% (see Figure12), boosting performance.

��

��

��

��

��

�� ���

����

�������

� �������

���������

����������������� �����������

�����������

�������

!������� ���

"#��$� �������������

����

"#���%����������

Figure 12: Percentage of total cycles that are idle for

blocking GPU TLBs and various nonblocking and page

table walking optimizations.

Overall, Figure 11 (left) shows that thoughtful non-blockingand PTW scheduling extensions to naive baseline GPUsboosts performance to the extent that it is within 1% of anideal, impractical, large and heavily-ported 512-entry, 32-port TLB with no access latency penalties. In fact, all thetechniques reduce GPU address translation overheads un-der 10% for all benchmarks, well within the 5-15% rangeconsidered acceptable on CPUs.Multiple PTWs: Multiple PTWs are an alternative toPTW scheduling. Unfortunately, their higher performancecomes with far higher area and power overheads. In ourstudies, we have found that single PTW with augmentedTLBs (4-ports with non-blocking and PTW scheduling ex-tensions) consistently outperforms naive TLBs with morePTWs. Specifically, Figure 11 (right) shows a 10% perfor-mance gap between the augmented 1 PTW approach and 8naive PTWs. We therefore assume (for the remainder of thepaper), a single lower-overhead PTWwith non-blocking sup-port (hit under miss and cache overlap) and PTW schedulingaugmentations.

6. TLBS AND CACHE CONSCIOUS WARP

SCHEDULINGHaving showcased the GPU address translation design

space, we now move to our second contribution. There aremany recent studies on improving GPU cache performance[32, 31, 36, 46, 55] via better warp scheduling. We focus onthe impact of address translation on cache-conscious wave-front scheduling (CCWS) [55]. Our insights, however, aregeneral and hold for other approaches too.

6.1 Baseline Cache-ConsciousWavefront Schedul-ing

Basic operation: CCWS observes that conventional warpscheduling (e.g., round robin) is oblivious to intra-warp lo-cality, touching data from enough threads to thrash the L1cache [55]. Carefully limiting the warps that overlap withone another promotes better cache reuse and boosts perfor-mance. CCWS accomplishes this using the baseline hard-ware shown in Figure 13. The cache holds tags and data, butalso a warp identifier for the warp that allocated the cacheline. A lost locality detector (LLD) maintains per-warp, set-associative cache victim tag arrays (VTAs), which store thetags of evicted cache lines. The LLD, with lost locality scor-ing (LLS), identifies warps that share cache working sets and

8

Page 9: Architectural Support for Address Translation on GPUs

������������� �

���������

������������� �������

���������

������ ������

��������

��������������

��� � �����!

���"�������� ����" � �

���"�������� ����" � �

��� ������� �� � ���#�� $

���!�%�� �����������

�&������

��������

�'������

�(������

���)���������

����

��!����������

���

���

���

���

���

���

���

�����

����

�����

������

�����

�������

����������� ����������!!"��#�$����%!!"��#���������%!!"��#�� ����������%

(*+,-� .

Figure 13: The left diagram shows conventional CCWS,

with a cache victim tag array. The right diagram shows,

compared to a baseline architecture without TLBs,

speedup of naive, 4-ported TLBs per shader core, aug-

mented TLBs and PTWs, CCWS without TLBs, CCWS

with naive TLBs and PTWs, and CCWS with augmented

TLBs and PTWs.

those that increase cache thrashing. CCWS scheduling logicencourages warps that share working sets to run together.CCWS operates as follows. Suppose a warp issues a mem-

ory reference. After coalescing, the address is presented tothe data cache. On a cache miss, CCWS logic is invoked todetermine whether multiple warps are thrashing the cache.The VTA of the current warp is probed to see if the desiredcache line was recently evicted. A hit indicates the pos-sibility of inter-warp interference. If the warp making thecurrent request were prioritized by the scheduler, intra-warpreuse would be promoted and cache misses reduced. This in-formation is communicated to the LLS, which maintains acounter per warp. A VTA hit increments the warp counter;whenever this happens, LLS logic sums all counter values.The LSS cutoff logic checks if the total is larger than a pre-defined cutoff; if so, warps with the highest counter valuesare prioritized since they hit most in VTAs, indicating thattheir lines are most-recently evicted and hence most likelyto gain if not swapped out. We refer readers to the CCWSpaper [55] for more details and sensitivity studies on the up-date and cutoff values, LSS, and LLD hardware overheads.The remainder of our work assumes 16-entry, 8-way victimVTAs per warp.Performance of basic approach: While baseline CCWSignored address translation [55], one might expect that boost-ing cache hit rate should also increase TLB hit rates. Figure13 (right) quantifies the speedup (against a baseline with-out TLBs) of (1) naive blocking 128-entry, 4-port TLBs withone PTW (no non-blocking or PTW scheduling); (2) aug-mented TLBs that overlap misses with cache access and al-low hits under misses (non-blocking), with PTW scheduling;(3) CCWS without TLBs; (4) CCWS with naive TLBs; and(5) CCWS with augmented TLBs.Baseline CCWS (without TLBs) improves performance

for all benchmarks by at least 20%. However, adding CCWSto naive TLBs and augmented TLBs outperform vanillanaive and augmented versions by only 5-10%. Also, the gapbetween CCWS with and without TLBs remains large (evenaugmented TLBs and PTWs have a 50-120% difference).

������������� �����������

������������� �

���������

������������� �������

���������

������ ������

��������

��������������

��� � �����!

���" � �

���" � �

���#��

����$�� �����������

�%������

��������

�&������

�'������

���(���������

����

����$�)� *

��!����������

���

���"�� "��+

���"�� "��+

������ �������

������������� �

���������

������������� �������

���������

������ ������

��������

��������������

��� � �����!

���"�������� ����" � �

���"�������� ����" � �

��� ������� �� � ���,�� -

���!�$�� �����������

�%������

��������

�&������

�'������

���(���������

����

)� �������������*

��!����������

���

Figure 14: On the left, we show TA-CCWS which up-

dates locality scores with TLB misses. On the right,

we show TLB conscious warp scheduling, which replaces

cache VTAs with TLB VTAs to outperform TA-CCWS,

despite using less hardware.

6.2 Adding Address Translation AwarenessIn response, we propose two CCWS schemes with TLB

information.TLB-aware CCWS (TA-CCWS): CCWS loses perfor-mance when integrating TLBs (even augmented ones) be-cause it treats all cache misses equivalently. In reality, somecache misses are accompanied by TLB misses, others withTLB hits. In this light, baseline CCWS should be modifiedso that LSS logic weighs cache misses with TLB misses moreheavily than those with TLB hits. TA-CCWS does exactlythis, prompting more frequent TLB misses to cause the LSScounter sum to go over the threshold faster. This in turnensures that the final pool of warps identified as schedul-ing candidates enjoys intra-warp cache and intra-warp TLB

reuse.The diagram on the left of Figure 13 shows TA-CCWS

hardware, with practically no changes to baseline CCWS.We consider only TLB weights that are multiple of 2 so thatshifters can perform the counter updates.TLB conscious warp scheduling (TCWS): TCWS goesbeyond TA-CCWS by observing that TLB and cache behav-ior are highly correlated. For example, TLB misses are fre-quently accompanied by cache misses because a TLB missindicates that cache lines from its physical page were ref-erences far back in the past. Therefore, it is possible tosubsume cache access behavior by analyzing the intra-warplocality lost on TLB behavior. TCWS exploits this cor-relation by replacing cache line based VTAs with smaller,lower-overhead, yet more performance-efficient TLB-basedVTAs.

Figure 14 (right) illustrates TWCS with TLB-based VTAscontaining virtual address tags. VTAs are now looked up onTLB misses rather than cache misses. VTA hits are com-municated to LSS update logic, where per-warp countersare updated (similar to baseline CCWS). When the sum ofthe counters exceeds the prefedined cutoff, warps with thehighest LLS counters are prioritized.

This initial approach uses baseline CCWS but with TLBVTAs. There is, however, one problem with this approach.As described, we now update LSS scores only on TLB misses

9

Page 10: Architectural Support for Address Translation on GPUs

�������������� ���

���������������������

������������������

����������

���

����������

���

����������

���

���������������

�����������

����������������

� ��� �! " #

������������ ���

���������������������

������������������

����������

����������

����������

���������������

�����������

����������������

�!

���

" # �!

���

" # �!

���

" #

���������������������

������������������

����������

����������

����������

���������������

�����������

����������������

�!

���

" # �!

���

" # �!

���

" #

������������������

��

� �����

� ����

� ��

����

����

� �������

����

������

�������

������ ! ""��

""���#������ !$ �!"

�!"�#������ !$

���

���

���

���

���

��

��

� �����

� ����

� ��

����

����

� �������

����

������

�������

������ !

""��

""���#������ !$

�!"

�!"�#������ !$

#���$

#���%

#����

#���&

'#���()

"�������

*����+��,����

������

��-�$�������

���

#����.���

��,����

�!

������

/��������

/��

��$$

����������

�����$$

0��1����

��,����

�!

���/�$

�!

��/�$

������"���

,����$���-

" #

���������������

% %

� &&'����

�� &&'��(�

�� &&'�')

% %

% %

�� &&'�')

��� &&'����

��� &&'����

% %

% %

��� &&'��*

��� &&'��*�

��� &&'���

% %

% %

��� &&'����

��( &&'����

��� &&'����

% %

% %

��( &&'���*

��� &&'���

��� &&'��(�

% %

���� ��� ��

��

� &

2

(

3 4

)

5

6 �%��

�&

������������ �����������������

% %

� &&'����

�� &&'��(�

�� &&'�')

% %

% %

�� &&'�')

��� &&'����

��� &&'����

% %

% %

��� &&'��*

��� &&'��*�

��� &&'���

% %

% %

��� &&'����

��( &&'����

��� &&'����

% %

% %

��( &&'���*

��� &&'���

��� &&'��(�

% %

���� ��� ��

��

� &

2

3

4

(

)

��/�%7.68%�8��8%29

��/��7.68%�8��8%(9

��/�&7.68%�8��8%39

��/�2

����%

,���1

,���1

,���1

��/�%7.68%�8��8%29

��/��7.68%�8��8%(9

��/�&7.68%�8��8%39

��/�2

�����

,���1

,���1

,���1

��/�%7.68%�8��8%29

��/��7.68%�8��8%(9

��/�&7.68%�8��8%39

��/�2

����&

,���1

,���1

,���1

��/�%7.68%�8��8%29

��/��7.68%�8��8%(9

��/�&7.68%�8��8%39

��/�2

����2

,���1

,���1

,���1

��/�%7.68%�8��8%29

��/��7.68%�8��8%(9

��/�&7.68%�8��8%39

��/�2

����(

,���1

,���1

,���1

��/�%7.68%�8��8%29

��/��7.68%�8��8%(9

��/�&7.68%�8��8%39

��/�2

����3

,���1

,���1

,���1

���

��*

��

���

�(

��

���

���

�(

��

���

���

�(

��

���

���

�(

��

���

���

�(

��

���

���

�(

��

���

���

�� � ��� � ���� � �� ����� � �����

�������

� !���+�

���,���

(��,���

��,���

����,���

��

������������

��

� �����

� ����

� ��

����

����

� �������

����

������

&���������

-�������.� ,�/�-����

0��������� ��"�����1���-���

���

���

��*

��

���

��

� �����

� ����

� ��

����

����

� �������

����

������

�������

!�,�����1���������� �����%2,3���������������������%2�&�������������0��������4����/5���4�,���� !

���

���

��*

��

���

��

� �����

� ����

� ��

����

����

� �������

����

������

�������

��&�� ��&���(�&��� �&�����&���6�7������ ����

�(

�������(���

��

� �����

� ����

� ��

����

����

� �������

����

������

&����8�3������� 93����� .�:

��

(�

��

���

��

� �����

� ����

� ��

����

����

� �������

����

������

&���������

.� ,�/�-���� � !�.�����

��

���

���

���

���

���

��

� �����

� ����

� ��

����

����

� �������

����

������

"/�

���

� !�.���� �����/ "����� ����������/

��

(�

��

���

� ���,�� (��,�* ��,��� ����,���

&���������

&����8�3�������

��

� �����

� ����

� ������

����� �������

����������

��(�

�����(��

��

� �����

� ����

� ��

����

����

� �������

����

������

0����"/�

���������&����������

,���,

����"

/����

!�,�����

1���������� �����

%2�,3���������������

������

%2�&��������������

��

(�

��

���

� ���,�� (��,�* ��,��� ����,���

&���������

&����8�3�������

��

� �����

� ����

� ������

����� �������

����������

���

��*

��

���

!�,��

���

1���������� ����

�5�,3�

�����;

%�2

�&��

����

������

!�,��

���

1���������� ����

�5�,3�

�����;

%�2

�&��

����

������

!�,��

���

1���������� ����

�5�,3�

�����;

%�2

�&��

����

������

!�,��

���

1���������� ����

�5�,3�

�����;

%�2

�&��

����

������

!�,��

���

1���������� ����

�5�,3�

�����;

%�2

�&��

����

������

!�,��

���

1���������� ����

�5�,3�

�����;

%�2

�&��

����

������

�� � � � ���� � �� ����� �����

������

���&�� ��&��� (�&��� �&���

#����$$:���.����

�������������

����������

�������������

!�$�����"���������-��

�����-��

���,:���� �$�

���������

#����$$:���-��

������,����

�-;#�������������;����

�-;#�������������;����

��$�����������������7���9

,����<����� �-�����

#% �- �-

#� �- �-

#& �- �-

#2 �- �-

����:������$$

��$$

/��=

�����:������$���

#����$$:���.����

����������

!�$�����"���������-��

�����-��

���,:���� �$�

���������

#����$$:���-��

������,����

�-;#�������������;����

�-;#�������������;����

��$�����������������7���9

,����<����� �-�����

#% �- �-

#� �- �-

#& �- �-

#2 �- �-

����:������$$

��$$

/����� �!��$$=

�����:������$���

�!

��������������������������

#����$$:���.���� ����������

!�$�����"���������-��

�����-��

���,:���� �$�

���������

#����$$:���-��

������,����

�-;����

�-;����

�!����

�!<����� �-�����

#% �- �-

#� �- �-

#& �- �-

#2 �- �-

����:������$$

��$$ �!< �/��=

�����:������$���

�!

�-;#��;""

�-;#��;""

������(������

�����(���

��

� �

� ����

� ��

����

� �

����

�������

'�<3��� ! 9�� ������� !""���#�,�� !$ ""���#��<3��� !$""���#��� ������� !$

2153

������(������

�����(���

�� � � � ���� � �� ����� � �����

�������

""���#�,�� !$ ""���#9�� ������� !$�94""���#�=�$ �94""���#(=�$�94""���#=�$

2153

Figure 15: TLB-aware CCWS performance for varying

weights of TLB misses versus cache misses. TA-CCWS

(x:y) indicates that the TLB miss is weighted x times as

much as y by the LLS scoring logic.

so warp scheduling decisions are relatively infrequent com-pared to conventional CCWS, which updates scores on cachemisses. This makes CCWS less rapidly-adaptive, possiblydegrading performance. Therefore, we force more frequentscheduling decisions by also updating LSS counters on TLBhits. We update by observing that each TLB set has an LRUstack (logically) of PTEs. This study assumes 128-entry, 4-way associative GPU TLBs. To update LSS logic sufficientlyoften, we track the LRU depth of TLB hits. Then, we updatethe LLS scoring logic, weighting a deeper hit more heavilysince it indicates that the PTEs are closer to eviction (andTLBs/caches are likelier to suffer thrashing). A key param-eter is how to vary weights with LRU depth.Interestingly, TCWS requires less hardware than CCWS.

Since TCWS VTAs maintain tags for 4KB pages, fewer ofthem are necessary compared to cache line VTAs. We findthat TLB-based VTAs in TCWS require half the area over-head of cache line-based CCWS.

6.3 Performance of CCWS with TLB Infor-mation

We now quantify how well TA-CCWS and TCWS per-form.TLB-aware cache conscious scheduling results: Fig-ure 15 shows that updating LLS scoring logic with not justcache misses but also TLB miss information improves per-formance substantially. The graph separates the speedups ofbaseline CCWS (no TLB) and CCWS with augmented TLBswith non-blocking and PTW scheduling logic. It then showsTA-CCWS, with ratios of how much a TLB miss is weightedversus a cache miss. Clearly, weighting TLB misses moreheavily improves performance. When they are weighted 4times as much as cache misses, TA-CCWS achieves within5-10% of CCWS without TLBs for four benchmarks (mum-mergpu, memcached, streamcluster and pathfinder). Whilebfs and kmeans still suffer degradations, we will show thatTLB conscious cache scheduling boosts performance evenfor these benchmarks significantly.TLB conscious cache scheduling results: Figure 16shows that TCWS comprehensively achieves high perfor-mance. The graph on the left shows how varying the num-ber of entries per warp (EPW) in the TLB VTA affectsperformance (this graph isolates the impact of EPWs alonewithout LRU depth weighting) Typically, 8 EPWs per warpVTA does best, consistently outperforming TA-CCWS.Figure 16 (right) then shows how updating LLS scores

based on the depth of the hit in the TLB set’s LRU stack im-proves performance. We consider many LRU stack weightsbut only show three of them due to space constraints. Thefirst scheme weights hits on the first entry of the set (MRU)

�������������

����������

���

��

���

���

��

���

��

����

�������

������������ ������!�"�����

��������#$� ��������#$�

��������#$� ��������#$�

���������#$�

�������������

����������

���

��

���

���

��

���

��

����

�������

������������ ������!�"�����

�%&����'��'�('�� �%&����'��'��'��

�%&����'�('��'�)

������������

��� ���

���

Figure 16: TLB conscious warp scheduling achieves

within 5-15% of baseline CCWS without TLBs. The left

diagram shows TCWS performance as the number of en-

tries per warp (EPW) in the VTA is varied. The right

diagram adds LRU depth weights to LLS scoring.

with a score of 1, the second a score of 2, the third a score of3, and the fourth a score of 4 (LRU(1, 2, 3, 4)). Similarly,we also show LRU(1, 2, 4, 8) and LRU(1, 3, 6, 9). TLBmisses that result in TLB VTA hits are scored as before.LRU(1, 2, 4, 8) typically performs best, consistently get-

ting within 1-15% of the baseline CCWS performance with-out TLBs. In fact, not only does TCWS require only half

the hardware of TA-CCWS or even CCWS, it outperforms

TA-CCWS consistently.At a high level, these results make two points. First, warp

scheduling schemes that improve cache hit rates are inti-mately affected by TLB hit rates too. Second, overheadsfrom this can be effectively countered with simple, thought-ful TLB and PTW awareness. Our approach, like CPU ad-dress translation, consistently reduces overheads to 5-15%of runtime.

7. TLBSANDTHREADBLOCKCOMPACTIONOur final contribution is to show how address transla-

tion affects mechanims to reduce branch divergence over-heads. Traditionally, SIMD architectures have supporteddivergent branch execution by masking vector lanes andstack reconvergence [14, 21], significantly reducing SIMDthroughput. Proposed solutions have included stream pro-gramming language extensions [34], allowing vector lanes toexecute scalar codes for short durations [39], or more re-cently, dynamic warp formation techniques [21] which as-similate threads with similar control flow paths to form newwarps. While address translation affects all of these ap-proaches, we focus on the best known dynamic warp forma-tion scheme, Thread Block Compaction (TBC) [20].

7.1 Baseline Thread Block CompactionBasic operation: We now briefly outline the operationof baseline thread block compaction. We refer readers tothe original paper for complete details [20]. In CUDA andOpenCL, threads are issued to SIMD cores in units of threadblocks. Warps within a thread block can communicate throughshared memory. TBC essentially also proposes control flowlocality within a thread block and is implemented usingblock-wide reconvergence stacks for divergence handling [20].At a divergent branch, all warps of a thread block synchro-nize. TBC hardware scans the thread block’s threads (whichcan be across multiple warps) to identify which ones followthe same control flow path. Threads are compacted into newdynamic warps according to branch outcomes and executeduntil the next branch or reconvergence point (where they are

10

Page 11: Architectural Support for Address Translation on GPUs

�������

������ ����

�����������

�������������

�������

�����������

�������������

���������������

�������

������ ����

���������

����������

������������ ����

�����������

�������������

������������� ���

��������� ������

���������������

��������

����������

������������

���� ���

������

�������

����������������������

������������

��������������

������ ������

��������������

������������

������ ������

Figure 17: Comparison of warp execution when using re-

convergence stacks, thread block compaction, and TLB-

aware thread block compaction. While TLB-TBC may

execute more warps, its higher TLB hit rate provides

higher overall performance.

synchronized again for compaction). Overall, this approachincreases SIMD utilization.Unfortunately, blindly adding address translation has prob-

lems. Dynamically assimilating threads from different warpsinto new warps increases both TLB miss rates and warp pagedivergence (which amplifies the latency of one thread’s TLBmiss on all warp threads). Consider, for example, the controlflow graph of Figure 17. In this example, each thread blockcontains three warps of 4 threads. Each thread is given anumber, along with the virtual page it is accessing if it is amemory operation. For example, 1(6) refers to thread 1 ac-cessing virtual page 6, 1(x) means that thread 1 is executiona non-memory instruction, and x(x) means that the threadis masked off through branching.All threads execute blocks A and D but only threads 2,

3, 5, and 12 execute block C due to a branch divergence atthe end of A (the rest execute block B). Blocks B and Cconsist of a memory operation. Figure 17 shows the orderin which warps execute blocks B and C, using conventionalstack reconvergence. Since there is no dynamic warp forma-tion, it takes six distinct warp fetches to execute both branchpaths. Instead, Figure 17 shows that forming TBC reduceswarp fetches to just three, fully utilizing SIMD pipelines.While TBC may at first seem ideal, address translation

poses problems. For example, the first dynamic warp nowrequires virtual pages 1 and 6. If we consider a 1-entry TLBwhich is initially empty, the first warp takes 2 TLB misses,the second 3, and the third 2. Instead, Figure 17 showsa TLB-aware scheme that potentially performs better byforming a first dynamic warp with threads requiring onlyvirtual page 6 and a second warp requesting virtual page 1.Now, the first two warps suffer 2 TLB misses as opposedto 5 (for baseline, TLB-agnostic TBC), without sacrificingSIMD utilization.Performance of basic approach: Figure 18 quantifiesthe performance of TLB-agnostic TBC when using addresstranslation. Against a baseline without TLBs, we plot thespeedup of TBC without TLBs, TBC with naive 128-entry,4-port TLBs (blocking, no PTW scheduling), TBC withaugmented TLBs (nonblocking, overlap misses with cacheaccess, PTW scheduling). We also show naive TLBs and

��������������

������

�����

����

�����

������

�����

�������

��������� �� �����

��!�"�#����$ ��!�"���������$

��!�"�� �����$

Figure 18: Performance of TBC without TLBs with

TBC when using naive 128-entry, 4-port blocking TLBs,

and when augmenting TLBs with nonblocking and PTW

scheduling facilities.

�������������� ��������

�������������

����

���� ����

���

� �����

������

!�"

��� #�

$� � ��#

$��

��

%�� ��� ����

�� &�"'((

" ��������

!� ������&����

�!�

��)

��*

��+

,

��"

%-��

!���.

/ ���0��)!��

/ ���0��*!��

/ ���0��+!��

,

!��

(���

�'

����!� ��

���&����

� �� ��#�

1����

����/�

��

!����

���������2� &%�' ���) ,%�' ���* ,

, ,

!������2� &��'((

�� ��� ��

�� ��� ��

�� ��� ��

"-�� �2�������/ ��� %���

!��(��� !� ������&����

!�"

!�� ��3�� &����� #

�&��������������

Figure 19: Hardware implementation of TLB-aware

TBC. We add only the combinational logic in the com-

mon page matrix (CPM) and a warp history field per

TLB entry. The red dotted arrows zoom into different

hardware modules.

augmented TLBs without TBC. There is a significant per-formance gap between TBC with and without TLBs. Evenwith augmented TLBs, an average of 20% performance islost compared to ideal TBC (without TLBs). In fact, aug-

mented TLBs without TBC actually outperform augmented

TLBs with TBC. We have found that this occurs primar-ily because TBC increases per-warp page divergence (by anaverage of 2-4 for our workloads). This further increasesTLB miss rates by 5-10%. In response, we study TLB-aware TBC, assuming block-wide reconvergence stacks andage-based scheduling [20].

7.2 Address Translation AwarenessHardware details: Figure 19 shows how we make TBCaware of address translation awareness with minimal addi-tional hardware. The diagram shows the basic SIMD pipeline,with red dotted arrows zooming on specific modules. We addonly the shaded hardware to basic TBC from past work [20].

In baseline TBC, on a divergent branch, a branch unitgenerates active masks based on branch outcomes. A block-wide active mask is sent to the thread compactor and isstored in multiple buffers. Each cycle, a priority encoderselects at most one thread from its corresponding inputs andsends its ID to the warp buffer. Bits corresponding to theselected threads are reset, allowing encoders to select fromremaining threads in subsequent cycles. At the end of thisstep, threads have been formed into dynamic warps basedon branch outcomes. When these dynamic warps becomes

11

Page 12: Architectural Support for Address Translation on GPUs

ready (indicated by the r bit per warp buffer entry) theyare sent to the fetch unit, which stores program countersand initiates warp fetch. We refer readers to the TBC paper[20] for details on how the priority encoder and compactorassimilate dynamic warps.Our contribution is to modify this basic design and en-

courage dynamic warp formation among warps that havehistorically accessed similar TLB PTEs and are hence likelyto do so in the future, minimizing page divergence and missrates. We use a table called the Common Page Matrix(CPM). Each CPM row holds a tag and multiple saturat-ing counters. The CPM maintains a row for every warp (48rows for our SIMD cores), each of which has a counter forevery other warp (47). Each counter essentially indicateshow often the warp ID associated with the row and the col-umn have accessed the same PTEs in the past. We use thisstructure in tandem with the priority encoder. Threads arecompacted into the warp buffer only if the thread’s original

warp had accessed PTEs that threads already compacted in

the new dynamic warp also accessed. This information iseasily extracted in the CPM when compacting threads; wechoose a CPM row with the thread’s original warp number.Then, we look up the counters using the original warp num-bers of the threads that have already been compacted intothe target dynamic warp. We compact the candidate threadinto the dynamic warp only if the selected counters are atmaximum value.Figure 19 shows that CPM counters are updated on TLB

hits. Each TLB entry maintains a history of warp numbersthat previously accessed it. Every time a warp hits on theentry, it selects a CPM row and updates the counters cor-responding to the warps in the history list. To ensure thatCPM continues to adapt to program behavior, the table isperiodically flushed (a flush every 500 cycles suffices).Hardware overheads: TLB-aware TBC adds little hard-ware to baseline TBC. We track PTE access similarity be-tween warps rather than between threads to reduce hard-ware costs and because original warps (not dynamic ones)usually have modest page divergence. This means that fo-cusing on warp access patterns provides most of the benefitsof per-thread information, but with far less overhead. TheCPM has 48x47 entries; we find that 3-bit counters performwell, for a total CPM of 0.8KB. In addition, we use a his-tory length of 2 per TLB; this requires 12 bits (since warpidentifier is 6 bits). Fortunately, we observe that PTEs donot actually use full 64-bit address spaces yet, leaving 18bits unused. We use 12 of these 18 bits to maintain history.All CPM updates and flushes occur off the critical path ofdynamic warp formation.

7.3 Performance of TLB-Aware TBCFigure 20 shows the performance of TLB-aware TBC against

baseline TLB-agnostic TBC without TLBs, and with aug-mented TLBs. We assume 1, 2, and 3 bits per CPM counter;more bits provide greater confidence in detecting whetherwarps access the same PTEs. Even 1 bit counters drasti-cally improves performance over TLB-agnostic TBC withaugmented TLBs (15-20% on average). 3 bits boost per-formance within 5-12% of baseline TBC without TLBs, wellwithin the typical 5-15% range deemed acceptable on CPUs.Like CCWS, these results mean that even though addresstranslation tests conventional dynamic warp formation, sim-ple tweaks recovers most lost performance.

8. DISCUSSION AND FUTUREWORK

���

���

���

���

���

���

�����

����

�����

������

�����

�������

������������

����� �!������

���"�#������������$��

���"�#������������$��

���"�#���������%��$��

Figure 20: Performance of TLB-aware TBC, as the num-

ber of bits per CPM counter is varied. With 3-bits per

counter, TLB-aware TBC achieves performance within

3-12% of TBC without TLBs.

Broad applicability. Our work is applicable to throughput-oriented architectures beyond GPUs like Intel’s Many Inte-grated Core architecture [56], and accelerators like Rigel [37]and vector-thread architectures [39].Shared last-level TLBs. Recent work has shown the ben-efits of last-level TLBs shared among multiple CPU cores[11]. We will consider the benefits of this approach in futurework.Memory management unit (MMU) caches. x86 coresuse MMU caches to store frequently-used PTEs from upperlevels of the page table tree [6]. These structures acceleratepage table walks and may be effective for GPUs too. Fu-ture studies will consider MMU caches; note though thatmany benefits of their benefits are similar to adding PTWintelligence.Large pages. Past work [3, 8, 58, 47] has shown thatlarge pages (2MB/1GB) can potentially improve TLB per-formance. However, in some cases, their overheads can be-come an issue; for example, they require specialized OS code,can increase paging traffic, and may require pinning in phys-ical memory (e.g., in Windows). We leave a detailed analysisof the pros and cons of large pages to future work. We do,however, present initial insights on 2MB pages. Specifically,one may, at first blush, expect large pages to dramaticallyreduce page divergence since it is much likelier that 32 warpthreads request the same 2MB chunk rather than the same4KB chunk. Though this is usually true, we have foundthat some benchmarks, mummer and bfs), still suffer highpage divergences of 6 and 3 (Figure 21). Warp threads inthese benchmarks have such far-flung accesses that they es-sentially span 12MB and 6MB of the address space. Wetherefore believe that a careful design space study of super-pages is a natural next step in the envolution of this work.

9. CONCLUSIONThis work examines address translation in CPU/GPUs.

We are prompted to study this because of industry trendstoward fully-coherent unified virtual address spaces in het-erogeneous platforms. The reasons for this trend are well-understood – fully coherent unified virtual address spacesbetween cores and accelerators simplify programming mod-els and reduce the burden on programmers to manage mem-ory. However, this trend’s implications on architecture hasnot been studied in academia. Unsurprisingly, we find thatadding address translation at the L1-level of the GPUs doesdegrade performance. Moreover, we find that the designof GPU address translation should not be naively borrowedfrom CPUs (even though CPU address translation is a rel-atively mature technology) because the resulting overheads

12

Page 13: Architectural Support for Address Translation on GPUs

��

��

��

��

���

� ���� ����� ������ ������

���

���

��

��������������

���

���������

������

���������

������� �����

����������

Figure 21: Page divergence cumulative distribution

functions when using 2MB large pages. We show what

percentage of warps have have a page divergence of 1,

2-3, 4-7, 8-15, and 16-32.

are untenable. We conclude that the wide adoption of het-erogeneous systems, which rely on a manageable program-ming model, hinges upon thoughtful GPU-aware addresstranslation.We show that, fortunately, simple designs mindful of GPU

data-parallel execution significantly reduce performance over-heads from cache-parallel address translation. We also showthat two recent proposals for improving GPU compute ex-ecution times (Cache-Conscious Wavefront Scheduling andThread Block Compaction), have their gains nearly elimi-nated with the introduction of cache-parallel address trans-lation. However, these schemes can be made TLB-awarewith a few simple adjustments, bringing their performanceback from the brink and approaching their original gains.Overall, mindful implementation of TLB-awareness in theGPU execution pipeline is not complicated, thus enablingmanageable performance degradation in exchange for theindustry-driven desire for enhanced programmability. There-fore, we expect there is a body of low-hanging fruit yet to beplucked for enhancing address translation in heterogeneoussystems.

10. REFERENCES

[1] AMD, “AMD I/O Virtualization Technology(IOMMU) Specification,” 2006.

[2] N. Amit, M. B. Yehuda, and B.-A. Yassour, “IOMMU:Strategies for Mitigating the IOTLB Bottleneck,”WIOSCA, 2010.

[3] Andrea Arcangeli, “Transparent Hugepage Support,”KVM Forum, 2010.

[4] T. Austin and G. Sohi, “High-Bandwidth AddressTranslation for Multiple-Issue Processors,” ISCA,1996.

[5] A. Bakhoda, G. Yuan, W. Fung, H. Wong, andT. Aamodt, “Analyzing CUDA Workloads Using aDetailed GPU Simulator,” ISPASS, 2009.

[6] T. Barr, A. Cox, and S. Rixner, “Translation Caching:Skip, Don’t Walk (the Page Table),” ISCA, 2010.

[7] ——, “SpecTLB: A Mechanism for SpeculativeAddress Translation,” ISCA, 2011.

[8] A. Basu, J. Gandhi, J. Chang, M. Swift, and M. Hill,“Efficient Virtual Memory for Big Memory Servers,”ISCA, 2013.

[9] A. Basu, M. Hill, and M. Swift, “Reducing MemoryReference Energy with Opportunistic VirtualCaching,” ISCA, 2012.

[10] R. Bhargava, B. Serebrin, F. Spadini, and S. Manne,“Accelerating Two-Dimensional Page Walks forVirtualized Systems,”ASPLOS, 2008.

[11] A. Bhattacharjee, D. Lustig, and M. Martonosi,“Shared Last-Level TLBs for Chip Multiprocessors,”HPCA, 2010.

[12] A. Bhattacharjee and M. Martonosi, “Inter-CoreCooperative TLB Prefetchers for ChipMultiprocessors,”ASPLOS, 2010.

[13] P. Boudier and G. Sellers, “Memory System on FusionAPUs,” Fusion Developer Summit, 2012.

[14] W. Bouknight, S. Denenberg, D. McIntyre, J. Randall,A. Sameh, and D. Slotnick, “The Illiac IV System,”Proceedings of the IEEE, vol. 60, no. 4, pp. 369–388,April 1972.

[15] I. Buck, T. Foley, D. Horn, J. Sugerman,K. Fatahalian, M. Houston, and P. Hanrahan, “Brookfor GPUs: Stream Computing on GraphicsHardware,” SIGGRAPH, 2004.

[16] M. Cekleov and M. Dubois, “Virtual-AddressedCaches,” IEEE Micro, 1997.

[17] S. Che, M. Boyer, J. Meng, D. Tarjan, J. Sheaffer,S. ha Lee, and K. Skadron, “Rodinia: A BenchmarkSuite for Heterogeneous Computing,” IISWC, 2009.

[18] D. Clark and J. Emer, “Performance of theVAX-11/780 Translation Buffers: Simulation andMeasurement,”ACM Transactions on Computer

Systems, vol. 3, no. 1, 1985.

[19] W. Dally, P. Hanrahan, M. Erez, T. Knight,F. Labonte, J.-H. Ahn, N. Jayasena, U. Kapasi,A. Das, J. Gummaraju, and I. Buck, “Merrimac:Supercomputing with Streams,” SC, 2003.

[20] W. Fung and T. Aamodt, “Thread Block Compactionfor Efficient SIMT Control Flow,”HPCA, 2011.

[21] W. Fung, I. Sham, G. Yuan, and T. Aamodt,“Dynamic Warp Formation and Scheduling forEfficient GPU Control Flow,”MICRO, 2007.

[22] I. Gelado, J. Cabezas, N. Navarro, J. Stone, S. Patel,and W. mei Hwu, “An Asymmetric DistributedShared Memory Model for Heterogeneous ParallelSystems,”ASPLOS, 2010.

[23] B. Hechtman and D. Sorin, “Evaluating CacheCoherent Shared Virtual Memory for HeterogeneousMulticore Chips,” ISPASS, 2013.

[24] T. Hetherington, T. Rogers, L. Hsu, M. O’Connor,and T. Aamodt, “Characterizing and Evaluating aKey-Value Store Application on HeterogeneousCPU-GPU Systems,” ISPASS, 2012.

[25] Intel, “Intel Virtualization Technology for DirectedI/O Architecture Specification,” 2006.

[26] Intel Corporation, “TLBs, Paging-Structure Cachesand their Invalidation,” Intel Technical Report, 2008.

[27] T. Jablin, J. Jablin, P. Prabhu, F. Liu, andD. August, “Dynamically Managed Data forCPU-GPU Architectures,”CGO, 2012.

[28] T. Jablin, P. Prabhu, J. Jablin, N. Johnson, S. Beard,and D. August, “Automatic CPU-GPUCommunication Management and Optimization,”

13

Page 14: Architectural Support for Address Translation on GPUs

PLDI, 2011.

[29] B. Jacob and T. Mudge, “A Look at Several MemoryManagement Units: TLB-Refill, and Page TableOrganizations,”ASPLOS, 1998.

[30] A. Jaleel and B. Jacob, “In-Line Interrupt Handlingfor Software-Managed TLBs,” ICCD, 2001.

[31] A. Jog, O. Kayiran, N. CN, A. Mishra, M. Kandemir,O. Mutlu, R. Iyer, and C. Das, “OWL: CooperativeThread Array Aware Scheduling Techniques forImproving GPGPU Performance,”ASPLOS, 2013.

[32] A. Jog, O. Kayiran, A. Mishra, M. Kandemir,O. Mutlu, R. Iyer, and C. Das, “OrchestratedScheduling and Prefetching for GPUs,” ISCA, 2013.

[33] G. Kandiraju and A. Sivasubramaniam, “Going theDistance for TLB Prefetching: An Application-DrivenStudy,” ISCA, 2002.

[34] U. Kapasi, W. Dally, S. Rixner, P. Mattson, J. Owens,and B. Khailany, “Efficient Conditional Operations forData-Parallel Architectures,”MICRO, 2000.

[35] S. Kaxiras and A. Ros, “A New Perspective forEfficient Virtual-Cache Coherence,” ISCA, 2013.

[36] O. Kayiran, A. Jog, M. Kandemir, and C. Das,“Neither More nor Less: Optimizing Thread LevelParallelism for GPGPUs,” PACT, 2013.

[37] J. Kelm, D. Johnson, M. Johnson, N. Crago,W. Tuohy, A. Mahesri, S. Lumetta, M. Frank, andS. Patel, “Rigel: An Architecture and ScalableProgramming Interface for a 1000-core Accelerator,”ISCA, 2008.

[38] J. Kim, S. L. Min, S. Jeon, B. Ahn, D.-K. Jeong, andC. S. Kim, “U-Cache: A Cost-Effective Solution to theSynonym Problem,”HPCA, 1995.

[39] R. Krashinsky, C. Batten, M. Hampton, S. Gerding,B. Pharris, J. Casper, and K. Asanovic, “TheVector-Thread Architecture,” ISCA, 2004.

[40] G. Kyriazis, “Heterogeneous System Architecture: ATechnical Review,”Whitepaper, 2012.

[41] K. Lim, D. Meisner, A. Saidi, P. Ranganathan, andT. Wenisch, “Thin Servers with Smart Pipes:Designing SoC Accelerators for Memcached,” ISCA,2013.

[42] J. Meng, D. Tarjan, and K. Skadron, “Dynamic WarpSubdivison for Integrated Branch and MemoryDivergence,” ISCA, 2010.

[43] G. Morris, B. Gaster, and L. Howes, “Kite: BraidedParallelism for Heterogeneous Systems,” 2012.

[44] A. Munshi, “The OpenCL Specification,”Kronos

OpenCL Working Group, 2012.

[45] N. Muralimanohar, R. Balasubramonian, andN. Jouppi, “CACTI 6.0: A Tool to Model LargeCaches,”MICRO, 2007.

[46] V. Narasiman, M. Shebanow, C. J. Lee,R. Miftakhudinov, O. Mutlu, and Y. Patt, “ImprovingGPU Performance via Large Warps and Two-LevelWarp Scheduling,”MICRO, 2011.

[47] J. Navarro, S. Iyer, P. Druschel, and A. Cox,“Practical, Transparent Operating System Support forSuperpages,”OSDI, 2002.

[48] NVidia, “NVidia’s Next Generation CUDA ComputeArchitecture: Kepler GK110,”NVidia Whitepaper,2012.

[49] J. Owens, M. Houston, D. Luebke, S. Green, J. Stone,and J. Phillips, “GPU Computing,” IEEE, vol. 96,no. 5, 2008.

[50] J. Owens, D. Luebke, N. Govindaraju, M. Harris,J. Kruger, A. Lefohn, and T. Purcell, “A Survey ofGeneral-Purpose Computation on GraphcisHardware,” EUROGRAPHICS, vol. 26, no. 1, 2007.

[51] A. Parashar, M. Pellauer, M. Adler, B. Ahsan,N. Crago, D. Lustig, V. Pavlov, A. Zhai, M. Gambhir,A. Jaleel, R. Allmon, R. Rayess, S. Maresh, andJ. Emer, “Triggered Instructions: A Control Paradigmfor Spatially-Programmed Architectures,” ISCA, 2013.

[52] B. Pham, V. Vaidyanathan, A. Jaleel, andA. Bhattacharjee, “CoLT: Coalesced Large ReachTLBs,”MICRO, 2012.

[53] W. Qadeer, R. Hameed, O. Shacham, P. Venkatesan,C. Kozyrakis, and M. Horowitz, “Convolution Engine:Balancing Efficiency and Flexibility in SpecializedComputing,” ISCA, 2013.

[54] P. Rogers, “AMD Heterogeneous Uniform MemoryAccess,”AMD, 2013.

[55] T. Rogers, M. O’Connor, and T. Aamodt, “CacheConscious Wavefront Scheduling,”MICRO, 2012.

[56] L. Seiler, D. Carmean, E. Sprangle, T. Forsyth,M. Abrash, P. Dubey, S. Junkins, A. Lake,J. Sugerman, R. Cavin, R. Espasa, E. Grochowski,T. Juan, and P. Hanrahan, “Larrabee: A Many-Corex86 Architecture for Visual Computing,” SIGGRAPH,2008.

[57] I. Singh, A. Shriraman, W. Fung, M. O’Connor, andT. Aamodt, “Cache Coherence for GPU Architecture,”HPCA, 2013.

[58] M. Talluri and M. Hill, “Surpassing the TLBPerformance of Superpages with Less OperatingSystem Support,”ASPLOS, 1994.

[59] M. Taylor, “Is Dark Silicon Useful?” DAC, 2012.

[60] N. Wilt, “The CUDA Handbook,” 2012.

[61] L. Wu, R. Barker, M. Kim, and K. Ross, “NavigatingBig Data with High-Throughput, Energy-EfficientData Partitioning,” ISCA, 2013.

14