c h a p t e r consistency and coherence for heterogeneous...

��

C H A P T E R ��

Consistency andCoherence for

Heterogeneous SystemsThis is the age of specialization� Today’s server� mobile and desktop processorsnot only contain conventional CPUs� but also various �avors of accelerators� Themost prominent among the accelerators are the Graphics Processing Units (GPUs)�Other types of accelerators including Digital Signal Processors (DSPs)� AI acceler-ators (e�g�� Apple’s Neural Engine� Google’s TPU)� Cryptographic accelerators andField-programmable-gate-arrays (FPGAs) are also becoming common�

Such heterogeneous architectures�however�pose a programmability challenge�How to synchronize and communicate within and across the accelerators? And� howto do this e�ciently? One promising trend is to expose a global sharedmemory inter-face across the CPUs and the accelerators� Recall that shared memory not only helpswith programmability by providing an intuitive load-store interface for communica-tion� but also helps in reaping the bene�ts of locality via programmer-transparentcaching�

The CPUs� GPUs and the accelerators can either be tightly integrated and ac-tually share the the same physical memory as is the case in mobile system-on-chips�or they may have physically independent memories with a runtime that providesa logical abstraction of shared memory� To simplify discussion we assume the for-mer� As shown in Figure �� each of the the CPUs� GPUs and the accelerators mayeach have multiple cores with private per-core L�s and a shared L�� The CPUs andthe accelerators also may share a memory-side last-level cache (LLC) which unlessspeci�ed otherwise is non-inclusive� (Recall that a memory-side LLC does not posecoherency issues)� The LLC also serves as an on-chip memory controller�

Shared memory automatically raises two questions� What is the consistencymodel? How is the consistency model (e�ciently) enforced? In particular� how arethe caches within an accelerator (the L�s) and across the CPUs and accelerators (L�sand L�s) kept coherent?

�� CONSISTENCY AND COHERENCE FORHETEROGENEOUS SYSTEMS

Figure �� System model of a heterogeneous system-on-chip�

In this chapter we discuss potential answers to these questions in this fast mov-ing area�We �rst focus on consistency and coherence within an accelerator� focusingon GPUs�We then consider consistency and coherence across the accelerators andthe CPU�

�� GPU CONSISTENCY AND COHERENCE

We start this section by brie�y summarizing early GPU architectures—i�e�� those tar-geted towards graphics workloads—and their programming model� This will helpus appreciate the design choices with regard to consistency and coherence in suchGPUs� We then discuss the trend of using GPU-like architectures to run general-purpose workloads� in the so called General-Purpose Graphic Processing Units (GP-GPUs)� Speci�cally�we discuss the new demands that GP-GPUs place on consistencyand coherence�We then discuss in detail some of the recent proposals for ful�llingthese demands�

�� EARLY GPUS� ARCHITECTURE AND PROGRAMMINGMODELEarly GPUs were primarily tailored towards embarrassingly parallel graphics work-loads� Roughly speaking� the workloads involve computing (independently) each ofpixels that constitute the display� Thus� the workloads are characterized by a veryhigh degree of data-parallelism and low degree of data sharing� synchronization andcommunication�

�� GPU CONSISTENCY AND COHERENCE ��

GPU ArchitectureTo exploit the abundance of parallelism� as shown in Figure �� GPUs typicallyhave tens of cores called StreamingMultiprocessors (SMs)�� Each SM is highly multi-threaded� capable of running in the order of a thousand threads�The threadsmappedto an SM share the L� cache and a local scratchpad memory (not shown)� All of theSMs share an L� cache�

In order to amortize the cost of maintaining state for all of these threads� GPUstypically execute threads in groups called warps�� All of the threads in a warp sharethe program counter (PC) and the stack� but can still execute independent thread-speci�c paths using mask-bits that specify which threads among the warps are activeand which thread should not exeucte� This style of parallelism is known as “Single-Instruction-Multiple-Threads” (SIMT)� Because all of the threads in a warp share thePC� traditionally the threads in a warp were scheduled together� Recently� however�GPUs are starting to allow for threads within a warp to have independent PCs andstacks� and consequently allow for the threads to be independently scheduled� Forthe rest of the discussion� we assume that threads can be independently scheduled�

Because graphics workloads do not share data or synchronize frequently—synchronization and communication typically happening infrequently at a coarserlevel of granularity—early GPUs chose not to implement hardware cache coher-ence for keeping the L� caches coherent�The absence of hardware cache coherencemakes synchronization a trickier prospect for the programmer� as we will see next�

GPU Programming modelAnalogously to the CPU instruction set architecture (ISA)� the GPUs also have (vir-tual) ISAs� Nvidia’s virtual ISA� for example� is known as Parallel Thread Execution(PTX)� In addition� there are higher-level language frameworks� Two such frame-works are especially popular� CUDA and OpenCL� The high-level programs in theseframeworks are compiled down to intermediate level virtual ISAs (e�g� PTX)� whichin turn is translated into native binary at kernel installation time� Kernel here refersto a unit of work o�oaded by the CPU onto the GPU and typically consists of a largenumber of software threads�

GPU programmingmodels (virtual ISAs as well as language frameworks) chooseto expose the hierarchical nature of GPU architecture to the programmer via a threadhierarchy known as scopes� In contrast to CPU threads where all threads are “equal”�GPU threads from a kernel are grouped into clusters called Cooperative Thread Ar-rays (CTA)�� The threads belonging to a CTA are guaranteed to be mapped to thesame SM and are said to be of CTA scope� Because they are mapped to the same�SM in NVIDIA parlance� Aka Compute Units (CUs) in AMD parlance��Warp in NVIDIA parlance� Also known as Wavefront in AMD parlance��CTA in PTX parlance� Aka Workgroups in OpenCL� blocks in CUDA�


SM� all of the threads share the same L� cache—in that sense� the CTA scope alsoimplicitly refers to the memory hierarchy level (L�) that all of these threads share�Threads from two di�erent CTAs (from the same GPU) are said to be of GPU scope�All of these threads share the L� and in that sense� the GPU scope implicitly refers tothe L�� Finally� threads from a CPU and a GPU (or from two di�erent accelerators)are said to be of System scope� All of these threads may share a cache (LLC)� Thus�the System scope implicitly refers to the LLC (or uni�ed shared memory� if there isno shared LLC)�

Why expose the thread and memory hierarchy to the software? In the absenceof hardware cache coherence� this allows for programmers and hardware to cooper-atively achieve synchronization e�ciently� Speci�cally� if the programmer ensuresthat two synchronizing threads are within the same CTA� the threads can synchro-nize and communicate e�ciently via the L� cache�

How can two threads of GPU scope (i�e� two threads from di�erent CTAs butfrom the same GPU kernel) synchronize? Somewhat surprisingly� early GPU consis-tency models did not explicitly allow this� In practice though� programmers couldrealize GPU-scoped synchronization via specially crafted loads and stores that by-pass the L�—which� needless to say� was causing a great deal of confusion amongthe programmers [�]� In the next section� we will explore GPU consistency in detailwith examples�

GPU ConsistencyGPUs support relaxed memory consistency models� Like relaxed consistency CPUs�GPUs only enforce the memory orderings indicated by the programmer (e�g�� viaFENCE instructions)�

However� because of the absence of hardware cache coherence in GPUs� thesemantics of a FENCE instruction is di�erent� Recall that a FENCE instruction in amulticore CPU ensures that loads and stores before the FENCE are performed (withrespect to all threads) after the loads and stores following the fence� A GPU FENCEprovides similar semantics� but only with respect to other threads belonging to thesame CTA�

Consider the message-passing example shown in Table �� where the pro-grammer intends for load Ld� to read NEW (and not the old value of �)� How toensure this?

First of all� note that without the two FENCE instructions� Ld� can read the oldvalue of �—GPUs� being relaxed�may perform loads and stores out of order�

Now� let us suppose the two threads T� and T� belong to the same CTA (andhencemapped to the same SM)�At themicroarchitectural level� aGPUFENCEworkssimilarly to a FENCE in XCwe discussed in Chapter �� (The reorder unit ensures that


Table ��Message passing example� T� and T� belong to the same CTA�

Thread T� Thread T� CommentsSt�� St data�NEW� Ld�� Ld r� � �ag� ��Initially data and �ag are ��FENCE� FENCE�St�� St �ag�SET� B�� if(r� 6=SET) goto

Ld��Ld�� Ld r� � data� Can r� ��?

the Load�Store!FENCE and FENCE!Load�Store orderings are enforced)� BecauseT� and T� share an L�� honoring the above rules is su�cient to ensure that the twostores from T� are made visible to T� in program order� ensuring that load L� readsNEW�

On the other hand� if the the two threads T� and T� belong to di�erent CTAs(and hence mapped to di�erent SMs� say SM� and SM�)� it is possible for load L� toread �� To see how� consider the following sequence of events�

• Initially� both data and �ag are cached in both L�s�

• St� and St� perform in program order� writing NEW and SET respectively inthe L� cache of SM��

• The cache line holding �ag from SM� is evicted from L�� and data�NEW iswritten to L��

• The cache line holding �ag in the L� cache of SM� is evicted�

• The load Ld� performs� misses in the L�� so the line from L� is fetched� readsSET�

• The load Ld� performs� hits in the L� and reads ��

Thus� although the FENCE ensures that the two stores are written to L� (ofSM�) in the correct order—in the absence of hardware cache coherence—they maybecome visible T� in a di�erent order� That is why early GPU programming manualsexplicitly disallow such type of inter-CTA communication between threads of thesame kernel�

Is inter-CTA synchronization then impossible to achieve? In practice� aworkaround is possible� leveraging special load and store instructions that directlytarget speci�c levels of the memory hierarchy� As we can see in Table �� the twostores in T� explicitly write to the GPU scope (i�e�� to the L�) bypassing the L�� Like-wise� the two loads from T� bypass the L� and read directly from the L�� Thus� thetwo threads of GPU scope� T� and T�� are explicitly made to synchronize at the L�by using loads and stores that bypass the L��


Table ��Message passing example� T� and T� belong to di�erent CTAs�

Thread T� Thread T� CommentsSt�� St�GPU data�NEW� Ld�� Ld�GPU r� � �ag� ��Initially data and �ag are ��FENCE� FENCE�St�� St�GPU �ag�SET� B�� if(r� 6=SET) goto

Ld��Ld�� Ld�GPU r� � data� Can r� ��?

However� there are problems with the above workaround—the primary issuebeing performance ine�ciency owing to loads and stores bypassing the L�� Thereader might wonder that the two variables in question—�ag and data—are thosethat nevertheless need to be communicated across the SMs� So� why is bypassinga signi�cant problem? But consider a commonly occurring pattern involving a bigdata array (instead of the single data variable)� with only a small subset being actu-ally communicated�while the rest being private or read-only� In such a situation� theprogrammer is confronted with the onerous task of carefully directing loads�storesto the appropriate memory hierarchy level� in order to make e�ective use of L��

Summary� Limitations and DesiderataEarly GPUs were primarily targeted towards embarrassingly parallel workloads thatneither synchronize nor share data frequently� Therefore� such GPUs chose not tosupport hardware cache coherence for keeping the local caches coherent at the costof a scoped memory consistency model that permits only intra-CTA synchroniza-tion� More �exible inter-CTA synchronization is either too ine�cient or places ahuge burden on programmers�

GPUs are starting to get used for general purpose workloads� Such workloadstend to involve relatively frequent �ne-grained synchronization and more generalsharing patterns� Thus� it is desirable for a GP-GPU to have�

• A rigorous and intuitive memory consistency model that permits synchroniza-tion across all threads�

• A coherence protocol that enforces the consistency model while allowing fore�cient data sharing and synchronization—at the same time� not departing toomuch from the simplicity of the conventional GPU architecture (GPUs are ex-pected to cater to graphic workloads predominately)�

�� BIG PICTURE�GP-GPU CONSISTENCY AND COHERENCEWe outlined the desired properties for GP-GPU consistency and coherence� Onestraightforward approach to meeting these demands is to use a (multicore) CPU-like


approach to coherence and consistency� I�e� use one of the consistency-agnostic co-herence protocols (that we covered at great length in Chapters � through �) to ideallyenforce a strong consistencymodel such as sequential consistency (SC)�Despite tick-ing almost all boxes—the consistency model is certainly intuitive (without the notionof scopes) and the coherence protocol allows for the e�cient sharing of data—theapproach is ill-suited for a GPU� There are two major reasons for this�

First� a CPU-like coherence protocol that invalidates sharers upon a writewould incur a high tra�c overhead in the GPU context� This is because� typically theaggregate capacity of the local caches (L�s) in GPUs is comparable (or even greater)than the size of the L��TheNVIDIA Volta GPU� for example� has an aggregate capac-ity of about �� MB of L� cache and only � MB of L� cache� A standalone inclusivedirectory for tracking all of the L� blocks would incur an impractically large areaoverhead� On the other hand� an inclusive directory embedded in the L� would re-sult in a signi�cant amount of of recall tra�c upon L� evictions� given the size of theaggregate L�s�

Second� because GPUs maintain thousands of active hardware threads� thereis a need to track a correspondingly high number of coherence transactions� whichwould cost signi�cant hardware overhead�

Without writer-initiated invalidations� how can a store be propagated to othernon-local L� caches? (I�e�� how can the store be made visible to threads belonging toother CTAs?) Without writer-intiated invalidations� how can a consistency model—let alone a strong consistency model—be enforced?

In the next section�we discuss two proposals—temporal coherence and releaseconsistency-directed coherence—that employ self-invalidation� whereby a proces-sor invalidates lines in its local cache� for ensuring that stores from others threadsbecome visible�

What consistencymodel can the self-invalidation protocols enforce e�ciently?We argue that such protocols are amenable to e�ciently enforcing relaxed consis-tency models directly� rather than enforcing consistency-agnostic invariants such asSWMR� But whether or not the consistency model should include scopes is moot�Because scopes cannot be ruled out� in the next section we will outline a scopedconsistency model that does not limit synchronization to within a subset of scopes�

�� TEMPORAL COHERENCEIn this section� we discuss an approach for enforcing coherence using timestampscalled temporal coherence [��]�We discuss two variants of temporal coherence� (�)a consistency-agnostic variant that enforces SWMR� (�) a more e�cient consistency-directed variant that directly enforces a relaxed consistencymodel� For the following


Table ��Message passing example� T� and T� belong to di�erent CTAs�

Thread T� Thread T� CommentsSt�� St data��NEW� Ld�� Ld r� � �ag� ��Initially all variables are ��St�� St data��NEW� FENCE�FENCE B��if(r�6=SET) goto Ld��St�� St �ag�SET� Ld�� Ld r� � data�� Can r��?

discussion� we assume that the shared cache (L�) is inclusive� a block not present inthe shared L� implies that it is not present in any of the local L�s�

Consistency-agnostic Temporal CoherenceInstead of a writer invalidating all sharers in non-local caches� consider a protocolin which the writer is made to wait until all of the sharers have evicted the block�By making the write wait until there are no sharers� the protocol ensures that thereare no concurrent readers at the instant when thewrite succeeds—thereby enforcingSWMR�

How can the writer know how long to wait? I�e� how can the writer ascertainthat there are no more sharers for the block? Temporal coherence achieves this byleveraging a global notion of time� Speci�cally� it requires that each of the L�s andthe L� have access to a register which keeps track of global time�

Readers (upon an L� miss) predict how long they expect to hold the block inthe L�� and inform the L� of this time duration known as the lease� Every L� cacheblock is tagged with a timestamp that holds the lease for that block� A read for an L�block with current time greater than its lease is treated as a miss�

Furthermore� each block in the L� is tagged with a timestamp�When a readerconsults the L� (on a L� cache miss) it informs the L� of its lease—the L� updates thetimestamp for the block subject to the invariant that the timestamp holds the latestlease for that block (across all L�s)�

Every write—even if the block is present in the L� cache—is written throughto the L�� the write request accesses the block’s timestamp held in the L� and if thetimestamp corresponds to a time in the future� the write blocks until this time� Thisblocking ensures that there are no sharers of the blocks when the write performs inthe L�� thereby ensuring SWMR�

Example� In order to understand how temporal coherenceworks� let us consider themessage passing example in Table �� (ignoring the fences for now)� Let us assumethat the threads T� and T� are from two di�erent CTAs and hence mapped to twodi�erent SMs (SM� and SM�) with separate local L�s�


We illustrate a timeline of events at SM�� SM� and the shared L� in Table ��Initially let us assume that both �ag� data� and data� are cached in the local L� ofSM� with lease values of �� and �� respectively� At time�� St� is performed�since the L�s use a write-through (no write-allocate) policy� a write request is issuedto the L�� But because data� is valid in the L� cache of SM� (until time��)� thewrite is blocked until this time� At time�� the write performs at L�� St� then issuesa write request to the L� at time�� and reaches L� at time�� By this time thelease for data� (��) would have already expired� and so the write performs at the L�without any blocking� Similarly� St� issues a write request at time�� and is writtento L� at time�� without any blocking (lease for �ag would have expired by thistime)�Meanwhile� at time�� Ld� checks its L� and �nds that the lease for �ag hasexpired� so a read request is issued to L� and completes at time�� Likewise� Ld�also issues a read request to L� at time�� reads the expected value of NEW from theL� and completes at time�� Because this variant of temporal coherence enforcesSWMR—as long as GPU threads issue memory operations in program order SC� isenforced�

Protocol Speci�cation� We present the detailed protocol speci�cation for the L�and L� controllers in Tables �� and �� respectively�The L� has two stable states�(I)nvalid and (V)alid� The L� communicates with the L� using the GetS� GetM andUpgr messages� all of which carry the usual meanings� but with one di�erence� TheGetS and Upgr messages additionally carry a timestamp (TS) as a parameter� for GetSthe TS parameter holds the requested lease� whereas for Upgr it holds the currentlease for the block�

A load in state I causes a GetS request to be sent to L�—upon the receipt ofdata from L�� the state transitions to V� A store in state I causes a GetM request to besent to L��Upon the receipt of an ack from the L�—indicating that the L� has writtenthe data—the state transitions to I� (Because L� uses a no-write-allocate policy� datais not brought to the L�)�

Recall that a block ceases to become valid if the global time advances past thelease for that block—this is represented logically by the V to I transition upon leaseexpiry�� Upon a store to a block in state V� an Upgr request is sent to the L�� Thepurpose of the upgrade is to exploit the fact that if the block is held privately in theL� (that issued the upgrade)� then there is no need for the write to block� Rather� the

�The original paper [��] considered a warp as equivalent to a thread�whereas we consider amoremodern GPUsetting where threads can be individually scheduled� Consistency in our setting is therefore subtly di�erentfrom their setting�

�It is worth noting that this state transition is logical� in that� in the actual implementation it is not necessaryfor hardware to proactively detect that a block has expired and transition it to I state—the fact that a blockhas expired may be discovered lazily on a subsequent transaction to the block


Table �� Temporal coherence� Timeline of events (for example in Table ��)

/* T1 and T2 (belonging to di�erent CTAs) are mapped to SM1 and SM2 respectively.flag=0 (lease=35), data1=0 (lease=30) and data2=0 (lease=20) are cached in L1 of SM2 */

Time SM� SM� L�� St� issued� St� attempts to write� but

blocked at L� until time � �� St� unblocked� writes NEW to data�� St� completes�� St� issued�� St� writes NEW to data�� since

current time � lease�� St� completes�� St� issued�� Ld� issued L� lease for �ag has expired�

so Ld� issued to L�� St� writes SET to �ag� since

current time � lease�� Ld� reads SET fromL�� gets new lease�� St� completes�� Ld� completes�� Ld� issued L� lease for data� has expired�

so Ld� issued to L�� Ld� reads NEW� gets new lease�� Ld� completes

Load Store L��Eviction

L��Expiry

From L��Data

From L��Write-Ack

I send GetS to L��IVD

send GetM to L��IIA

IVD stall stall stall -�VIIA stall stall -�I

V hit send Upgr to L��VVA -�I -�I

VVA hit stall stall IIA VTable �� Enforcing SWMR via Temporal Coherence� L� Controller

write can simply update L� (as well as L�) and continue to cache the block in L� untilits lease runs out�

We now describe the L� controller� The L� has four stable states� (I)nvalid�indicating that the block is neither present in the L� nor in any of the L�s� (P)rivate�indicating that the block is present in exactly one of the L�s� (S)hared� indicating thatthe block may be present in one or more L�s� and (E)xpired� indicating that the blockis present in the L�� but not valid in any of the L�s� Upon receiving a GetS request


GetS(TS) GetM Upgr(TS) L��Eviction

L��Expiry

FromMem�Data

I send Fetch to Memupdate timestamp�IPD

send Fetch to Mem�IED

IPD stall stall stall stall -�P

IED stall stall stall stall update L� & sendWrite-Ack to L� �E

P send Data to L�extend TS�S

stall(until expiry)

If TS matches{Update L� & sendWrite-Ack to L�}

else {stall until expiry}

stall(until expiry)

�E

S send Data to L�extend TS

stall(until expiry)

stall(until expiry)

stall(until expiry)

�E

E update timestamp�P update L� & sendWrite-Ack to L�

update L� & sendWrite-Ack to L�

Write-back if dirty�I V

Table �� Enforcing SWMR via Temporal Coherence� L� Controller

in state I� the L� fetches the block from memory� updates the block’s timestamp inaccordance with the requested lease� and transitions to P� Upon receiving a GetM�it fetches the block from memory� updates the value in L�� sends back an ack� andtransitions to E (as it is not valid in any of the L�s)�

Upon receiving a GetS request in state P� the L� responds with the data� ex-tends the timestamp (if the requested lease is greater than the current timestamp)and transitions to S� For an Upgr request received in state P� there are two cases�In the straightforward case� the only SM holding the block privately writes to it—inthis case� the L� can simply update the block without any stalling and reply with anack� But there is a tricky corner case in which the Upgr message from a private blockis delayed such that the block is now held privately� but in a di�erent SM! In orderto disambiguate these two cases� every Upgr message carries a timestamp with thelease for that block—a match with the timestamp held in the L� indicates the formerstraightforward case� Amismatch indicates the latter� in which case the Upgr requeststalls until the block expires� at which time the L� is updated and an ack is sent backto the L�� An Upgr (or GetM) request received in state S on the other hand will haveto always wait until the block’s lease at the L� expires� Finally� an L� block is allowedto be evicted only in E state (because writes must know how long to stall in order toenforce SWMR)�

In summary� we saw how temporal coherence enforces SWMR using leasedreads and blocking writes� In combination with a processor that presents memoryoperations in program order� temporal coherence can be used to enforce sequentialconsistency (SC)�


Consistency-directed Temporal CoherenceTemporal coherence as described previously enforces SWMR but at a signi�cantcost� with every write to an unexpired shared block needing to stall at the L� con-troller� Since L� is shared across all threads� stalling at the L� can indirectly a�ect allof the threads� thereby reducing overall GPU throughput�

Recall that in Section ��we discussed two classes of coherence interfaces�AnSWMR-enforcing protocol—that propagates writes synchronously before the writereturns—belongs to the class of coherence protocols we call consistency-agnostic�Given the cost of enforcing SWMR in the GPU setting� can we explore consistency-directed coherence? I�e�� instead of making writes visible to other threads syn-chronously� canwemake themvisible asynchronouslywithout violating consistency?Speci�cally� a relaxed consistencymodel such as XC� onlymandates thememory or-derings indicated by the programmer via FENCE instructions� Such a model allowsfor writes to propagate to other thread asynchronously�

Considering the message-passing example shown in Table �� XCmerely re-quires that St� and St� become visible to T� before St� becomes visible� It does notmandate that by the time St� performs it must have propagated to T� (i�e�� it does notstrictly require SWMR)� Consequently� XC permits a variant of temporal coherencewhich—rather than stalling a write at the L�—allows for delayed stalling� if neces-sary� upon hitting a fence� To this end� when a write request reaches the L� and theblock is shared� the L� simply replies back to the SM with the timestamp associatedwith the block�We refer to this time as the Global Write completion time (GWCT)�as this indicates the time until which the thread must stall (upon hitting a fence) inorder to ensure that the write has become globally visible to all threads�

For each thread mapped to an SM� the SM keeps track of the maximum ofGWCTs returned for the writes in the per-thread stall-time register� Upon hitting afence� stalling (or descheduling) the thread until this time ensures that all of thewritesbefore the fence have become globally visible�

Example� We illustrate a timeline of events in Table �� for the same message-passing example (Table ��)� but now taking into account the e�ect of fences� be-cause we are seeking to enforce a relaxed XC-like model�

St� issues a write request to L� at time�� and is written to L� at time�� al-though its lease (��) would not have expired then� L� replies with a GWCT of ��which is remembered at SM� in the per-thread stall-time register� In a similar vein�St� issues a request at time�� is written to the L� at time�� and the L� repliesback with a GWCT of �� Upon receiving the GWCT� SM� does not update itsstall-time (��) as the current value is higher� The FENCE instruction is executedat time�� and blocks thread T� until the stall-time of �� At time�� St� issues a�For now� let us assume there are no scopes in the memory model


Table �� Consistency-directed Temporal coherence� Timeline (for Table ��)

/* T1 and T2 (belonging to di�erent CTAs) are mapped to SM1 and SM2 respectively.flag=0 (lease=35), data1=0 (lease=30) and data2=0 (lease=20) are cached in L1 of SM2 */

Time SM� SM� L�� St� issued� St� writes NEW to L�� although

current time � �� (lease)�replies with GWCT��

�� St� completesstall-time ��

�� St� issued�� St� writes NEW to L�� although

current time � �� (lease)�replies with GWCT��

�� St� completes�GWCT(��)�stall-time(��)�stall-time unchanged�

�� FENCE�block until stall-time(��)

�� St� issued�� St� writes SET to L�� since

current time � �� (lease)�� Ld� issued L� lease for �ag has expired�

so Ld� issued to L�� St� completes�� Ld� reads SET fromL�� gets new lease�� Ld� completes�� Ld� issued L� lease for data� has expired�

so Ld� issued to L�� Ld� reads NEW� gets new lease�� Ld� completes

write-request to L� andwrites to L� at time�� Because the lease for �ag (��) wouldhave expired� L� does not respond with a GWCT� and St� completes at time��Meanwhile� Ld� issues a read request to L� at time�� as the lease for �ag wouldhave expired� it reads SET fromL� at time�� and completes at time�� In a similarvein� Ld� is issued at time�� reads from L� at time�� and completes at time��returning the expected value of NEW�

Protocol Speci�cation� The consistency-directed temporal coherence protocolspeci�cations are mostly similar to the consistency-agnostic variant—we highlightthe di�erences in bold in Table �� (L� controller) and Table �� (L� controller)�

The main di�erence with the L� controller (Table ��) is due to the fact thatWrite-Acks from L� now carry GWCTs� Accordingly� upon receiving a Write-Ack�the L� controller extends the stall-time if the incoming GWCT is greater than the


Load Store L��Eviction

L��Expiry

From L��Data

From L��Write-Ack (�GWCT)

I send GetS to L��IVD

send GetM to L��IIA

IVD stall stall stall -�VIIA stall stall Extend stall-time�I

V hit send Upgr to L��VVA -�I -�I

VVA hit stall stall IIA Extend stall-time�V

Table �� Consistency-directed Temporal Coherence� L� Controller (di�s in bold)

GetS(TS) GetM Upgr(TS) L��Eviction

L��Expiry

FromMem�Data

I send Fetch to Memupdate timestamp�IPD

send Fetch to Mem�IED

IPD stall stall stall stall -�P

IED stall stall stall stall update L� & sendWrite-Ack to L� �E

P send Data to L�extend TS�S

stall Update L� &sendWrite-Ack�GWCT to L�

If TS matches{Update L� & sendWrite-Ack to L�}

else {stall until expiry}

stall(until expiry)

�E

S send Data to L�extend TS



stall(until expiry)

�E

E update timestamp�P update L� & sendWrite-Ack to L�

update L� & sendWrite-Ack to L�

Write-back if dirty�I V

Table �� Consistency-directed Temporal Coherence� L� Controller (di�s in bold)

currently held stall-time for that thread� (Recall that upon hitting a FENCE� the threadis stalled until the stall-time�)

The main di�erence with the L� controller (Table ��) is that GetM requestsdo not induce a stall—rather� the write is performed and a GWCT is returned alongwith the Write-Ack� In a similar vein� an Upgr request in state S also does not stall�

Temporal Coherence� Summary and LimitationsWe saw how temporal coherence can either be used to enforce SWMR or directlyenforce a relaxed consistency model such as XC�� The key bene�t with the latter ap-proach is that it eliminates expensive stalling at the L�� instead writes optionally stallat the SM upon hitting a FENCE�More optimizations are possible with even reducedstalling [��]� However� there are some critical limitations with temporal coherence�

• Supporting a non-inclusive L� cache is cumbersome� This is because temporalcoherence requires that a block that is valid in one or more L�s must have its

�Temporal coherence enforces a variant of XC where writes are not atomic


lease time available at the L�� A (complex) workaround is possible wherein anunexpired block may be evicted from the L� provided the evicted block’s leaseis held someplace� for instance along with the L� miss status holding register(MSHR) [��]�

• It requires global timestamps� With modern GPUs being relatively large area-wise� maintaining globally synchronized timestamps could be hard� A recentproposal� however� has shown how a variant of temporal coherence can be im-plemented without using global timestamps [��]�

• Performance could be sensitive to the choice of the lease period�A lease periodthat is too low increases the L� miss rate� a lease period that is too high causesthe writes (or FENCEs) to stall more�

• Temporal coherence cannot directly take advantage of scoped synchronization�For instance� stores involved in CTA-scoped synchronization (intuitively) neednot be written to the L�s—but in temporal coherence every store is writtento the L� since it is designed under the assumption of write-through�no-write-allocate L� caches�

• Temporal coherence involves timestamps� Timestamps introduce complexities(e�g� timestamp rollover) in the design and veri�cation process�

In summary� although workarounds are possible for most of the above limitations�they do tend to add complexity to an already unconventional timestamp-based co-herence protocol�

�� RELEASE CONSISTENCY-DIRECTED COHERENCETemporal coherence� which is based on the powerful idea of leases� is versatileenough to enforce both consistency-agnostic and consistency-directed variants ofcoherence� On the other hand� protocols involving leases and timestamps are ar-guably cumbersome� In this section� we discuss an alternative approach to GP-GPUcoherence called release consistency-directed coherenc (RCC) that directly enforcesrelease consistency (RC)�

RCC compromises on �exibility� in that it can only enforce variants of RC� Butin return for this reduced �exibility� RCC is arguably simpler� can naturally exploitscope information� and can be made to work with non-inclusive L� caches� In thefollowing� we start by brie�y recapping the RC memory model and then extend itwith scopes�We then describe a simple RCCprotocol followed by two optimizations�Each of the protocols can enforce both the scoped and non-scoped variants of RC�


Table ��Message passing example� Non-scoped RC

Thread T� Thread T� CommentsSt�� St data��NEW� Ld�� Acq Ld r� � �ag� ��Initially all variables ��St�� Rel St �ag�SET� B��if(r�6=SET) goto Ld��

Ld�� Ld r� � data� Can r��?

Release Consistency�Non-scoped and Scoped variantsIn this section�we discuss the RC memory model starting with the non-scoped vari-ant� and then extend it with scopes�

Recall that RC has special operations called atomic operations� Each suchatomic operation carries annotations—a release (Rel) annotation for a store and anacquire (Acq) annotation for a load—that order memory accesses in one direction(as opposed to the bidirectional ordering enforced by a FENCE) as follows�

• Acq Load! Load�Store

• Load�Store! Rel Store

• Rel Store�Acq Load! Rel Store�Acq Load

Consider the message-passing example shown in Table ��)�Marking St� asrelease ensures St�!St�� marking Ld� as acquire ensures Ld�!Ld� orderings� Thefact that the acquire (Ld�) reads the value written by the release (St�)—i�e�� the ac-quire synchronizes with the release—implies that St�!Ld� in the global memoryorder� (Recall that every load returns the value of the latest store before the load inthe global memory order)� Therefore� combining all of the above orderings impliesSt�!Ld�—thus ensuring that Ld� sees the new value (and not �)�

In the variant of the memory model without scopes an acquire synchronizeswith a release as long as the acquire load returns the value written by the releasestore (irrespective of whether the threads they belong are from the same scope ordi�erent scope)� Thus in the above example� Ld� will see the new value irrespectiveof whether T� and T� belong to the same CTA or di�erent CTAs�

A Scoped RCModel�In a scoped RCmodel� every atomic operation—including release stores and acquireloads—is associated with a scope� An acquire is said to synchronize with a release (ifthe acquire load returns the value written by the release store and) only if the scopeof each atomic operation includes the thread executing the other operation�

For instance� an acquire of CTA scope is said to synchronize with a release ofCTA scope only if the two threads executing the acquire and release belong to thesame CTA� On the other hand� an acquire of GPU scope is said to synchronize with


a release of GPU scope� irrespective of whether the two threads are from the sameCTA or di�erent CTAs�

Table ��Message passing example� Scoped RC

Thread T� Thread T� CommentsSt�� St data��NEW� Ld�� CTA Acq Ld r� � �ag� ��Initially all variables ��St�� GPU Rel St �ag�SET� B��if(r�6=SET) goto Ld��

Ld�� Ld r� � data� Can r� ��? (could be ��if the twoT� andT� arefrom di�erent CTAs)

Formore intuition� consider the scoped variant of themessage-passing exampleshown in Table �� As we can see� the release St� carries a GPU scope whereasthe acquire Ld� carries only a CTA scope� If T� and T� are from di�erent CTAs� thenthe scope of the acquire (Ld�)� does not include T�� Therefore� in such a situationthe acquire is not said to synchronize with the release� which means that r� could infact read the old value of ��On the other hand� if T� and T� are from the same CTA�r� cannot read a ��

One approach [��] to formalizing scoped RC is to employ a variant of Shashaand Snir’s formalism [��] and use a partial order (not a global memory order) to ordercon�icting operations�More speci�cally� not all con�icting operations are ordered—only releases and acquire that synchronize with each other are ordered�

Release Consistency-directed Coherence (RCC)In this section� we introduce a protocol that directly enforces RC instead of enforc-ing SWMR� Speci�cally� the protocol does not eagerly propagate writes to otherthreads—i�e�� it does not enforce SWMR� Instead� writes are written to the L� upona release and become visible to another thread when that thread self-invalidates theL� on an acquire and pulls in new values from the L��

For the following discussion let us assumewrite-back�write-allocate L� caches�Also� for now let us ignore scopes and assume that synchronization is betweenthreads from two di�erent CTAs�We will describe later how RCC can handle intra-CTA synchronization e�ciently� The main steps involved in RCC are as follows�

• Loads and stores that are notmarked acquire or release behave like normal loadsand stores in a write-back�write-allocate L� cache�

• Upon a store marked release� all dirty blocks in the L�—including the data writ-ten by the release—are written to the L�� ensuring Load�Store! Rel Store

• Upon a loadmarked acquire� all valid blocks in the L� are self-invalidated beforethe acquire load reads from the L�� ensuring Acq Load! Load�Store


Load Store Rel�scope Acq�scope L��Eviction

I send GetS to L�Rec� Data from L��V

send GetS to L�Rec� Data from L�

Write�V


Write�Vif scope��GPU{for all dirty blockssend Write-back

Rec� Acks}

if scope��GPU{Inval� valid blocks}send GetS to L�

Rec� Data from L��V

V hit Write hit

Write�Vif scope��GPU{for all dirty blockssend Write-back

Rec� Acks}

if scope��GPU{Inval� valid blockssend GetS to L�

Rec� Data from L��V}else if scope��CTA{

Read hit}

I

Table �� RCC� L� controller� In non-scoped version scope is set to GPU

GetS Write-back L��Eviction

I Fetch DataRec� Data from Mem�V

Allocate spaceWrite DataReply Ack�V

V Reply with Data Write DataReply Ack

Write-back to lower-level�I

Table �� RCC� L� controller�

Example� Consider the message-passing example in Table �� assuming that T�and T� are from di�erent CTAs� and hence mapped to two di�erent SMs� SM� andSM�� Initially let us assume the cache blocks containing data� and �ag are cached inthe L�s of both SM� and SM�� Upon hitting St� (marked release)� the block contain-ing �ag is written to� then� all of the dirty blocks in the L� of SM�—including data�and �ag—are written to the L�� Upon hitting Ld� (marked acquire)� all of the validblocks in L� of SM�—including data� and �ag—are self-invalidated� Then� �ag vari-able is read from the L�� returning SET�When Ld� is performed� a request of data�misses at the L� and the up-to-date value of NEW is read from the L��

Exploiting Scopes� If the release store and acquire load are of GPU scope� the pro-tocol described above is applicable as is� This is because the protocol already pushesall of the dirty data to the L� (i�e�� the cache level corresponding to GPU scope) on arelease� and pulls data from the L� upon an acquire� RCC can take advantage of CTAscopes� A CTA-scoped release need not write-back dirty blocks to the L�� Likewise�a CTA-scoped acquire need not self-invalidate any of the valid blocks in the L��


Protocol Speci�cation� We present the detailed speci�cations of the L� and L�controllers in Table �� and Table �� There are two stable states� (I)nvalid andV(alid)� For presentational reasons (and since the protocols we present are blockingin nature) we do not present the transient states� For example� a Load in state I causesa GetS to be sent and enters a transient state which we do not show explicitly� uponreceiving Data� the state transitions to V�

Table �� shows the scoped version of the L� controller� If the release andacquire are of GPU scope� the protocol must write-though data (on a release) andself-invalidate valid data (on an acquire)� If the release and acquire are of CTA scope�the protocol does not write-through or self-invalidate� A non-scoped version of theprotocol behaves as if the releases and acquire are of GPU-scope�

Finally� it is worth noting that the protocol does not rely on L� inclusivity� In-deed� as shown in Table �� a valid L� block can be silently evicted without in-forming the L�� Intuitively� this is because the L� does not hold any critical metadatasuch as sharers or ownership information�

Summary� In summary� RCC is a simple protocol that directly enforces RC� Be-cause it does not hold any protocol metadata at the L� it does not require the L� tobe inclusive and allows for L� blocks to be evicted silently� The protocol can takeadvantage of scope information—speci�cally� if the release and acquire are of CTAscope it does not require expensive write-backs or self-invalidations� On the otherhand� without knowledge of scopes� intra-CTA synchronization is ine�cient� RCChas to assume conservatively that releases and acquires are of GPU scope and write-back�self-invalidate data from the L� even if the synchronization is within one CTA�

Exploiting Ownership� RCC-OReleases (that causes all dirty lines to be written to the L�) and acquires (that causesall valid lines to be self-invalidated) are expensive operations in the simple proto-col described above� One approach to reduce the cost of releases and acquires is totrack ownership� so that exclusively owned blocks need not be self-invalidated on anacquire nor written-back upon a release�

The key idea is to add an O(wned) state—every store must obtain exclusiveownership of the L� cache line before a subsequent release� For each block� theL� maintains the owner� which refers to the identity of the L� that owns the block�Upon a request for ownership� the L� �rst downgrades the previous owner (if thereis one)� causing the previous owner to write-back dirty data to the L�� After sendingthe current data to the new owner� the L� changes ownership� Because a block instate O implies that the L� is the exclusive owner of the block� there is no need toself-invalidate an owned L� block on an acquire (or write-back dirty blocks in stateO from the L� upon a release)�



From L��Req-Write-back


send GetO to L�Rec� Data from L�

Write�O


Write�O

if scope��GPU{Inval�valid non-O blocks}send GetS to L�


V hitsend GetO to L�Rec� Data from L�

Write�O


Write�O

if scope��GPU{Inval�valid non-O blockssend GetS to L�

Rec� Data from L��V}else if scope��CTA{

Read hit}

send Write-backRec� Ack�I

O hit Write hit Write hit hit send Write-backRec� Ack�I send Data

Table �� RCC-O� L� controller� In non-scoped version scope�GPU

There is yet another bene�t to ownership tracking� even in the absence of scopeinformation� ownership tracking can help in reducing the cost of intra-CTA synchro-nization (speci�cally� the cost of intra-CTA acquires)� Consider the message passingexample (without scopes) shown in Table �� assuming now that the threads T�and T� belong to the sameCTA� In the absence of scope information� recall that RCCwill have to treat both releases and acquires conservatively� writing back all of thedirty data (on a release)� and self-invalidating all valid blocks (on an acquire)� WithRCC-O� releases still have to be treated conservatively� all of the stores before therelease will still have to obtain ownership� But because the intra-CTA release storeobtains ownership—upon hitting the acquire� if the block continues to be in ownedstate� it implies that the release must have been from a thread that is within the sameCTA� Consequently� the acquire can be treated like a CTA-scoped acquire and self-invalidation of valid blocks can be obviated�

GetS GetO Write-back L��Eviction

IFetch Data

Rec� Data from MemReply with Data�V

Fetch DataRec� Data from MemReply with Data�O

V Reply with Data Reply with Data�O Write and Reply Ack I

Osend Req-Write-back to owner

Rec� Data from ownerReply with Data�V

send Req-Write-back to ownerRec� Data from ownerReply with Data�M

Write and Reply Ack�Vsend Req-Write-back to owner

Rec� Data from ownerWrite-back to lower-level�I

Table �� RCC-O� L� controller�

Protocol Speci�cation�We present the detailed speci�cation of the L� controllerin Table �� and the L� controller in Table �� The main change is the additionof the O state� Every store in V (or I) state has to contact the L� controller and requestownership�Having obtained ownership� any subsequent store to that line can simply


update the line in the L�without contacting L�� Thus a block in O state is potentiallystale at the L�� For this reason�when the L� obtains a request for ownership of a linealready in O state� the L� �rst requests the previous owner to write-back the block�sends the up-to-date data to the current requester� and then changes ownership�

Upon hitting a release� there is no need to write-back dirty data from the L�to the L�� Instead� the protocol merely requires that all previous stores must haveobtained ownership� Upon an acquire� only non-owned valid blocks need to be self-invalidated� If the acquire is to a block that is already owned� it implies that the syn-chronization is intra-CTA and hence there is no need for self-invalidation�

Finally� because the protocol hinges on the L�maintaining ownership informa-tion� an L� block in O state cannot be silently evicted—instead� it must �rst down-grade the current owner� asking it to write-back the block in question (if it is dirty)�

Summary� By tracking ownership� RCC-O allows for a reduced number of self-invalidations in comparison to RCC� In the absence of scopes� ownership track-ing also enables the detection of intra-CTA acquires and obviates the need forself-invalidation in such a situation� But intra-CTA releases have to be treatedconservatively—each store to a previously unowned block must still obtain own-ership before a subsequent release� Finally� RCC-O cannot allow for blocks in stateO to be evicted from L� silently—it has to �rst downgrade the current owner�

Lazy Release Consistency-directed Coherence� LRCCIn the absence of scope information� RCC treats both releases and acquires conser-vatively� In RCC-O� intra-CTA acquires can be detected and e�cienty handled butreleases are treated conservatively�

Is there away for intra-CTA releases to also be e�ciently handled? Lazy releaseconsistency-directed coherence (LRCC) allows for this� In LRCC� writes before arelease do not obtain ownership—instead� only the release obtains ownership�Whenthe block released is later acquired� the state of the block (at the time of acquire)determines the coherence actions�

If the block is owned by the L� of the acquiring thread� it implies that the syn-chronization is intra-CTA� hence� there is no need to self-invalidate valid L� blocks�If the block instead is owned by a remote L�� it implies that the synchronization isinter-CTA� Therefore� the dirty cache blocks in the remote L� are �rst written backand the valid blocks in the acquiring L� are self-invalidated� Thus� by delaying co-herency actions until an acquire� LRCC is able to handle intra-CTA synchronizatione�ciently�

Example� Consider the message passing example shown in Table ��—whichshows code from a non-scoped RC model� assuming that the threads T� and T� be-


long to di�erent CTAs�Upon hitting the release store� ownership of block containing�ag is obtained� (The store to data� before the release simply updates the value inthe L� and does not obtain ownership)� Upon hitting the acquire—since the blockcontaining �ag is not owned—valid blocks in the L� are �rst self-invalidated� thenthe up-to-date value of �ag is requested from the L�� At the L�� the block containing�ag is owned by the L� of SM�� So the L� requests all of the dirty blocks in that L�—including data�—to be written back� All of this would ensure that the read of data�from T� will miss in the L� and would read the up-to-�date value of NEw from theL��

Protocol Speci�cation�We present the detailed speci�cation of the L� controllerin Table �� and the L� controller in Table �� (Like in the previous protocols�in the non-scoped version of the protocol� scope is set to GPU�) In contrast to RCC-O� only releases stores obtain ownership (ordinary stores do not)�Note that when anowned block is evicted from the L�� all of the dirty blocks are written back to the L�as well�

Like in RCC-O� an acquire loadwith the block in state O at the L� implies intra-CTA synchronization and hence there are no self-invalidations� An acquire whoseblock is in any other state implies inter-CTA synchronization and so valid L� blocksare self-invalidated� A GetS request is sent to the L� to obtain the up-to-date valueof the block that is acquired� If the block is owned by a di�erent L�� the L� sends arequest to that L� asking for all of the dirty blocks in the L� to be written back� If theblock is not owned by any L�� all of the data before the release are guaranteed to bein L� (since an L� eviction of an owned block also writes back dirty data)�

Finally� like in RCC-O�LRCChinges on the L�maintaining ownership informa-tion� So an L� block inO state cannot be evicted silently—instead� it must downgradethe current owner� asking it to write-back not only the block in question� but also anyother dirty data in that L��

Summary� By delaying coherence actions until the acquire� LRCC allows for intra-CTA synchronization to be detected and handled e�ciently even in the absence ofscopes� LRCC cannot allow for blocks in state O to be evicted from L� silently� butthe impact of this is limited to synchronization objects (because only release storesobtain ownership)�

Release Consistency-directed Coherence� SummaryIn this section� we discussed three variants of release consistency-directed coher-ence that can be used to enforce both scoped and non-scoped �avors of releaseconsistency� While all of the variants can exploit knowledge of scopes� LRCC canhandle intra-CTA synchronization e�ciently even in the absence of scopes�

�� MOREHETEROGENEITY THAN JUST GPUS ��


From L��Req-Write-back



Write�V


Write�O

if scope��GPU{Inval�valid non-M blocks}send GetS to L�


V hit Write hitsend GetO to L�Rec� Data from L�

Write�O

if scope��GPU{Inval�valid non-O blocks}send GetS to L�Rec� Data from L�

if dirty{send Write-back

Rec� Ack�I}

O hit Write hit Write hit hitfor all dirty blocks{send Write-backRec� Acks�I}

for all dirty blocks{send Data�V}

Table �� LRCC� L� controller� In non-scoped version scope is set to GPU

GetS GetO Write-back L��Eviction

IFetch Data

Rec� Data from MemReply with Data�V

Fetch DataRec� Data from MemReply with Data�O

Allocate spaceWrite and Reply Ack�V

V Reply with Data Reply with Data�O Write and Reply Ack I

Osend Req-Write-back to owner

Rec� Data from ownerReply with Data�V

send Req-Write-back to ownerRec� Data from owner

Update ownerReply with Data�O

Write and Reply Ack�Vsend Req-Write-back to owner

Rec� Data from ownerWrite-back to lower-level�I

Table �� LRCC� L� controller�

Given that LRCC can handle synchronization e�ciently even in the absenceof scopes� can we get rid of scoped memory consistency models? Not quite� in ouropinion� Whereas laziness helps in avoiding wasteful write-backs (in case of intra-CTA synchronization)� it makes inter-CTA acquires slower� This is because lazinessforces all of the write-backs from the L� of the releasing thread to be performed inthe critical path of the acquire� (This e�ect becomes more pronounced in the caseof synchronization between two devices� for e�g�� between a CPU and a GPU�)

Thus� whether or not GPU and heterogeneous memory models must involvescopes is a complex programmability vs performance tradeo��

�� MOREHETEROGENEITY THAN JUST GPUSIn this section� we consider the problem of how to expose a global shared mem-ory interface across multiple multi-core devices consisting of CPUs�GPUs and otheraccelerators�

What makes the problem challenging is that each of the devices might guar-antee a distinct consistency model enforced via a distinct coherence protocol� Forinstance� a CPU (or a CPU-like accelerator) may choose to enforce a relatively strongconsistency model such as TSO� using a consistency-agnostic protocol that enforces


Figure �� Semantics of a heterogeneous architecture where C� and C� supportSC� whereas C� and C� support TSO�

SWMR� On the other hand� a GPU may choose to enforce scoped RC� by employinga consistency-directed lazy release consistency protocol�

When two or more devices with distinct consistency models are integrated�what is the resulting consistency model of the heterogeneous device? How can it beprogrammed? How to integrate the coherence protocols of the two devices? Mostof these questions are being actively researched by the academia and the industry�In the following section� we will attempt to understand these questions better andoutline the design space�

��HETEROGENEOUS CONSISTENCYMODELSLet us �rst understand the semantics of composing two distinct consistency models�Suppose two component multi-core devices A and B are integrated such that theyshare memory� what is the resulting consistency model?

For example� supposemulti-core A enforces SC andmulti-core B satis�es TSO�how does the heterogeneous multi-core made up of A and B behave? Figure ��show the operational model of the heterogeneous multi-core processor� Intuitively�operations coming out of (a thread from) A should satis�es thememory ordering rulesof A (which in our example is SC) and operations coming out of B should satisfy thememory model ordering rules of B (TSO)�

Consider the message-passing example shown in Table �� the heteroge-neous machine should not allow for r� to read the old value of ��This is because SC’sordering guarantees mandate St�!St� while TSO’s ordering guarantees mandateLd�!Ld�� thus implying St�!Ld��On the other hand—considering Table ��—it


Table ��Message passing example� Combined SC � TSO memory model�

Thread T� (SC) Thread T� (TSO) CommentsSt�� St data�NEW� Ld�� Ld r� � �ag� ��Initially all variables ��St�� St �ag�SET� B��if(r�6=SET) goto Ld��

Ld�� Ld r� � data Can r� ��? (No!)

Table �� Dekker’s example� Combined SC � TSO memory model�

Thread T� (SC) Thread T� (TSO) CommentsSt�� St� �ag� � �� St� �ag� � �� Initially all variables ��Ld�� Ld� r� � �ag�� Ld� r� � �ag�� Can both r� and r�

read �? Yes!

is possible for both Ld� and Ld� to read �� since TSO does not enforce the St�!Ld�ordering�

Thus� a heterogeneous shared memory architecture that is formed by compos-ing component multi-cores with distinct consistency models results in a compoundconsistency model such that� memory operations originating from each componentsatis�es the memory ordering rules of that component�While this notion seems in-tuitive� to our knowledge this has not been formalized yet�

Programming with Heterogeneous Consistency ModelsHow to program such a heterogeneous shared memory architecture where not alldevices support the same consistency model? As we saw above� one approach isto simply target the compound consistency model� However� as one can imagine�programming with compound memory models can quickly get tricky� especially ifdisparate memory consistency models are involved�

A more promising approach is to program with languages (such as HSA orOpenCL) with formally speci�ed (scoped) synchronization primitives� Recall thatC�� is a language model based on Sequential Consistency for data race free (SC forDRF) that we saw in Section �� HSA and OpenCL extend SC for DRF paradigmwith scopes� dubbed Sequential Consistency for Heterogeneous Race Free (SC forHRF) [��]�

HRF is conceptually similar to the scopedRCmodelwe introduced in the previ-ous section�with a couple of di�erences� First�HRF is at the language-level whereasthe scoped RC model is at the (virtual) ISA level� Second� whereas HRF does notprovide any semantics for racy programs� the scoped RC model provides semanticsfor such codes� (Thus the relationship between HRF and scoped-RC is akin to therelationship between DRF based language models and RC�) HRF has two variants�HRF-direct and HRF-indirect� In HRF-direct two threads can synchronize only if


the synchronization operations (releases and acquires) have the same exact scope�HRF-indirect on the other hand allows for transitive synchronization using di�er-ent scopes� For example� in HRF-indirect� if thread T� synchronizes with T� us-ing scope S� and subsequently T� synchronizes with T� using scope S�� T� is saidto transitively synchronize with T�� HRF-direct does not allow this� however� It isworth noting that the scoped RC model� similarly to HRF-indirect� allows for transi-tive synchronization�

The language level scoped-synchronization primitives are mapped to hardwareinstructions—each device with a unique hardware consistency model has separatemappings� Lustig et al� [��] provide a framework dubbed ArMOR for algorithmicallyderiving correct mappings based on their precise way to specify di�erent memoryconsistency models�

The third approach towards programming with heterogeneity is to programwith one of the component hardware memory consistency models� while instru-menting the code running in each of the other components with fences and otherinstructions to ensure they are compatible with the chosen memory model� Again�the ArMOR framework can be used to translate between memory consistency mod-els in the above fashion�

��HETEROGENEOUS COHERENCE PROTOCOLSConsider two multi-core devices A and B with each device adhering to a distinctconsistency model enforced via a distinct coherence protocol� As we can see in Fig-ure �� each of the devices has a local coherence protocol for keeping their localL�s coherent� How do we integrate the two devices into one heterogeneous sharedmemory machine? In particular� how do we stitch the two coherence protocols to-gether correctly? Correctness hinges on whether the heterogeneous machine satis-�es the compound consistency model (memory operations from each device mustsatisfy the memory ordering rules of the device’s consistency model�)

We will �rst discuss hierarchical coherence� wherein the two intra-device co-herence protocols are stitched together via a higher-level inter-device coherenceprotocol�We then discuss how coarse-grained coherence tracking can help in miti-gating the bandwidth demands of hierarchical coherence when used for multi-chipCPU-GPU systems� Finally� we will conclude with a simple approach to CPU-GPUcoherence wherein blocks cached by the CPU are not cached in the GPU�

AHierarchical Approach to Heterogeneous CoherenceRecall that in Section �� we introduced hierarchical coherence for integratingtwo disparate coherence protocol� The same idea can be extended for supportingheterogeneous coherence too�


Figure �� Integrating two ormore devices into one heterogeneousmachine via hi-erarchical coherence involves composing the two local coherence protocols togethervia a global coherence protocol�

More speci�cally� hierarchical coherence works as follows� The local coher-ence controllers� upon receiving a coherence request� attempts to ful�ll the requestlocally within the device�Requests that cannot be completely ful�lled locally—e�g�� aGetM request with sharers present in the other device—are forwarded to the globalcoherence controller� which in turn forwards the request to the other device’s lo-cal coherence controller� After serving the forwarded request� that local controllerresponds and the response in turn is forwarded back to the requestor by the globalcoherence controller�

To realize all of this�

• The global controller must be designed with an interface that is rich enoughto handle coherence requests initiated by the devices’ local controllers�

• Shims� Each of the local controllers must be extended with shims that� not onlyserve as translators between the local and the global coherence controller in-terfaces� but also choose the appropriate global coherence requests to make�and interpret global coherence responses appropriately�

In order to understand the global coherence interface and the shims better� letus consider the following scenarios�

Scenario�� consistency-agnostic � consistency-agnostic�Supposewewant to connect amulti-coreCPUwith aCPU-like accelerator�Let as as-sume that both devices employ consistency-agnostic� SWMR-enforcing coherenceprotocols (since such protocols are well-suited for a typical CPU workload with a


high degree of sharing and locality)� Note� however� that the actual protocols em-ployed by the two devices are di�erent� the CPU uses a directory protocol whereasthe accelerator uses snooping on a bus� Further� let us assume that the two protocolsare stitched together via a global directory protocol with coarse-grained sharer list�maintaining whether or not a cache block is present in or both devices�

Now� suppose aCPU corewants towrite to a block that is globally shared acrossthe CPU and the accelerator� The CPU core sends a GetM request to the local direc-tory which must non only invalidate sharers within the CPU� but also send a GetMrequest to the global directory for invalidating sharers in the accelerator� Upon re-ceiving the GetM� the global directory forwards an Invalidation request to the localsnooping controller of the accelerator� The local snooping controller interprets theInv request as a GetM (because the GetM request is the one that invalidates sharersin a snooping protocol) and issues the GetM in its local bus that invalidates any shar-ers present in the accelerator� it then responds to the global directory� which it turnresponds to the CPU’s local directory completing the write�

Thus� for stitching together consistency-agnostic protocols� the global coher-ence controllermust have an interface that can handle requests and responses similarto Table �� fromChapter �� Further� the shimsmust issue requests to the global con-troller as per its interface� and interpret forwarded requests and responses from theglobal controller in accordance with the local controller’s interface (e�g� it must inter-pret an Invalidation request from the directory as a GetM for a bus-based protocol)�

Scenario�� consistency-directed � consistency-directedSuppose we want to connect a GPU with another GPU-like accelerator via a sharedLLC which serves as the global coherence controller� By GPU-like we mean thatthe accelerator runs workloads having characteristics similar to a typical GPUworkload—e�g�� throughput limited� low temporal locality� Let us assume that boththe GPU and the accelerator enforce (non-scoped) release consistency using variantsof release consistency-directed coherence protocols� the GPU employs the LRCCprotocol whereas the accelerator employs RCC protocol�

Consider the message passing example shown in Table ��where T� is fromthe GPU and T� is from the accelerator� Further let us assume that St� and St� havealready been performed and both cache lines (containing �ag and data) are in Ownedstate in one of the GPU’s L�s�When T� performs Ld� in the accelerator� it �rst inval-idates the local L� and send a GetS request for �ag to the local L� controller� SinceL� does not have the block� its shimmust forward the request to the global LLC con-troller� The global LLC controller �nds that the block is Owned by the GPU� and sosends a write-back request (for all dirty items) to the GPU’s local L� controller (whichforwards it to the L� that owns the block)� After all of the dirty blocks (including �agand data) are written back to the LLC (which is the point of coherence)� the global


Table ��Message passing example� SC � Non-scoped RC

Thread T� (CPU) Thread T� (GPU) CommentsSt�� St data�NEW� Ld�� Acq Ld r� � �ag� ��Initially all variables ��St�� St �ag�SET� B��if(r�6=SET) goto Ld��

Ld�� Ld r� � data Can r��?

Table ��Message passing example� Non-scoped RC � SC

Thread T� (GPU) Thread T� (CPU) CommentsSt�� St data�NEW� Ld�� Ld r� � �ag� ��Initially all variables ��St�� Rel St �ag�SET� B��if(r�6=SET) goto Ld��

Ld�� Ld r� � data Can r��?

LLC controller responds to the accelerator’s L� controller� which in turn respondsto the requester� thus completing the request�

Thus� with regard to stitching together consistency-directed protocols� we caninfer the following� First� the global LLC controller (or thememory controller if thereif no LLC) serves as the point of coherence� Second� the LRCC controller shown inTable �� can serve as the global controller because LRCC subsumes the di�erentvariants of release consistency-directed coherence protocols�

Scenario�� consistency-agnostic � consistency-directed�Suppose we want to connect a CPU that enforces SC (using a consistency-agnostic MSI directory protocol) with a GPU that enforces non-scoped RC (usinga consistency-directed LRCC protocol) via a global directory that is embedded ina globally shared LLC (Figure ��)� How should the two coherence protocols bestitched together?

Consider the message passing example shown in Table ��where T� is fromthe CPU and T� is from the GPU� Initially� let us assume that both data and �ag arecached in the L�s of the GPU as well as the CPU with initial values of ��When St� isperformed in theCPU� it sends aGetM request to the local directory controllerwhichin turn must forwards the request to the global directory� The global directory mustthen upgrade the state of the block toModi�ed (i�e�� owned by theCPU) and respond�In particular� the global directory need not forward the GetM to the GPU despite theGPU caching the block� This is because the LRCC protocol on the GPU does notrequire sharers to be invalidated upon a write—instead sharers are self-invalidatedupon an acquire� Accordingly�when acquire Ld� is performed� it self-invalidates theL� and sends a GetS to the local L� controller which in turn forward the request tothe global directory� The global directory which �nds the block in Modi�ed statein the CPU� forwards the request to the CPU local directory� The CPU directory


forwards the request to the L� which responds with the data that is forwarded backto the global directory� then written back to the LLC and �nally� forwarded to therequestor�

Now� let us consider a similar message passing example but now with the re-leasing thread in the GPU and the acquiring thread is in the CPU (Table ��)�WhenSt� (release) is performed in the GPU� it issues a GetO request to the local L� con-troller which in turn must translates it and forward a GetM to the global directory�Because there are sharers in the CPU� the global directory must forward an Invali-dation request to the local directory controller of the CPU which must respond afterinvalidating all sharers� When Ld� is performed at the CPU� since �ag would havebeen invalidated� a GetS request is sent to local directory controller which in turnmust forward it to the global directory controller� The global directory controller�upon �nding that the block is owned by the GPU� must forward the GetS requestto the GPU L� controller� which in turn forwards to the L�� The GPU L� controllerupon receiving the forwarded request would write-back dirty data (including dataand �ag) to the local L� controller� The L� controller must then forward the write-back requests to the directory�LLC�When write-back for the block containing data�reaches the global directory� it would �nd the block shared in the CPU� so the direc-tory must forward an Invalidation request for data to the CPU local directory�whichin turn invalidates the L� cached copy of data� Thus� when Ld� is performed in theCPU� it would miss in the L� and get the correct value of NEW from the LLC�

From these two examples� we can infer the following� Any coherence requestdue to a GPU store—GetO and write-back requests—must invalidate CPU sharersof that block� Therefore� for such requests� the global directory must forward Inval-idations to the CPU if there are any sharers� In contrast� any coherence request dueto a CPU store (GetM) simply updates global directory state (marking it as owned bythe CPU) and need not invalidate GPU sharers (because the GPU’s are responsiblefor self-invalidating their cache blocks upon an acquire)� In other words� the globaldirectory must disambiguate CPU sharers from GPU sharers�One way to do this is tohave two types of Read requests as part of the global coherence interface� (�) GetS�that requests a block for reading and asks the directory to invalidate the block upona remote write (the directory tracks it via the sharer list)� (�) GetV [�]� that requestsa block for reading and is responsible for self-invalidating the block (the directorydoes not track it as part of the sharer list)�

Summary�In summary� hierarchical coherence is an elegant solution to integrating disparatecoherence protocols� Hierarchical coherence requires� (�) a global coherence pro-tocol with an interface that is rich enough to handle a range of coherence protocols�


and (�) extensions to the original coherence protocols (shims) for interfacing with theglobal coherence protocol�

Is there a universal coherence interface that can be used to stitch togetherany two coherence protocols? We believe the interface speci�ed in Table �� fromChapter � can handle any type of consistency-agnostic protocol e�ciently� Han-dling consistency-agnostic protocols requires an additional read request—GetV—that reads data without needing the directory to track or invalidate the block� It isworth noting that a number of such coherence speci�cations have been proposedboth in academia (Crossing guard [��] and Spandex [�]) and industry (CCIX [�]�Open-CAPI [�]� and Gen-Z [�])�

Mitigating Bandwidth Demands of Heterogeneous CoherenceOne problem with hierarchical coherence—especially for CPU-GPU multi-chipsystems—is that the inter-chip coherent interconnect can become a bottleneck�

As the name suggests� in multi-chip systems the CPU and GPU are in sepa-rate chips and hence the global directory has to be accessed by the GPU via ano�-chip coherent interconnect� Because GPU workloads typically incur high cachemiss-ratios� the global directory needs to be accessed frequently which explains whythe coherent interconnect can become saturated�

One approach [��] to mitigating this bandwidth demand is to employ coarse-grained coherence tracking [�]� wherein coherence state is tracked at a coarse gran-ularity (e�g� at page size granularity) locally in the GPU (as well as the CPU)�When theGPU incurs a miss and the page corresponding to the location is known to be privateto the GPU or read-only� there is no need to access the global directory� Instead theblock may be directly accessed from memory via a high-bandwidth bus�

A Low-complexity Solution to Heterogeneous CPU-GPU CoherenceCPU and GPUs are not always designed by the same vendor� Unfortunately� both ofthe the above approaches to heterogeneous coherence requires changes to both theCPU as well as the GPU�

One simple approach to CPU-GPU coherence inmulti-chip systems is selectiveGPU caching [�]� Any data that is mapped to CPUmemory is not cached in the GPU�Furthermore� any data from the GPU memory that is currently cached in the CPU�is also not cached in the GPU� This simple policy trivially enforces coherence� Toenforce this policy� the GPU maintains a coarse-grained remote directory that holdsthe data that is currently cached by the CPU�Whenever a block from the GPUmem-ory is accessed by the CPU� that coarse-grained region is inserted into the remotedirectory� (If the GPU was caching the line� that line is �ushed)� Any location presentin the remote director is not cached in the GPU�


Unfortunately� the above naive scheme can incur a signi�cant penalty becauseany location that is cached in the CPU must be retrieved from the CPU� To o�setthe cost� several optimizations including GPU request coalescing (multiple requeststo the CPU are coalesced)� and CPU-side caching (a special CPU-side cache for GPUremote requests) have been proposed [�]�

�� FURTHER READINGAlthough most of the literature on CPU coherence use the consistency-agnostic def-inition� there have been some classic works that have targeted coherence protocolsto the consistency model�

Afek et al� [�] proposed lazy coherence that directly enforces SC without sat-isfying SWMR� Lebeck and Wood proposed dynamic self invalidation [��]� the �rstself-invalidation based coherence protocol that targets both SC as well as weakermodels directly without satisfying SWMR�Kontothanassis [��] proposed lazy releaseconsistency for CPU processors� the precursor to similar such protocols proposed forthe GPU�

The advent of multi-cores sparked renewed interest on consistency-directedcoherence protocols� DeNovo [��] showed that targeting coherence towards DRFmodels can lead to simpler and scalable coherence protocols� VIPS [��] proposeda directory-less approach for directly enforcing release consistency relying on TLBsfor tracking private and read-only data�TSO-CC [��] andTardis [��] are consistency-directed coherence protocols that directly target TSO and SC respectively�

A number of proposals for GPU coherence were adaptations of CPU coherenceprotocols� Temporal coherence [��] was the �rst to propose coherence for GPUs byadapting library coherence [��]� a timestamp based protocol for CPU coherence�HRF [��] extended DRF to a heterogeneous context and showed how scoped con-sistency models can be enforced�Meanwhile� Sinclair et al� [��] adapted DeNovo forGPUs and showed that non-scoped memory models can perform almost as well asscoped consistency models� Alsop et al� [�] improved upon this by adapting lazy re-lease consistency for GPUs� Finally� Ren and Lis [��] adapted Tardis for GPUs andshowed that SC can be enforced e�ciently on GPUs�

REFERENCES[�] The CCIX Consortium� https://www.ccixconsortium.com/� ��

[�] The GenZ Consortium� https://genzconsortium.org/� ��

[�] The OpenCAPI Consortium� https://opencapi.org/� ��

�� REFERENCES ��

[�] Y� Afek� G� M� Brown� and M� Merritt� Lazy caching� ACM Trans� Program�Lang� Syst�� (�)��–��

[�] N� Agarwal� D� W� Nellans� E� Ebrahimi� T� F� Wenisch� J� Danskin� and S� W�Keckler� Selective GPU caches to eliminate CPU-GPU HW cache coherence�In �� IEEE International Symposium on High Performance Computer Archi-tecture� HPCA �� Barcelona� Spain�March ��-��

[�] J� Alglave� M� Batty� A� F� Donaldson� G� Gopalakrishnan� J� Ketema� D� Po-etzl� T� Sorensen� and J� Wickerson� GPU concurrency� Weak behaviours andprogramming assumptions� In Proceedings of the Twentieth International Con-ference on Architectural Support for Programming Languages and OperatingSystems� ASPLOS ’�� Istanbul� Turkey� March ��-�� pages ��–��

[�] J� Alsop�M� S�Orr� B�M� Beckmann� and D� A�Wood� Lazy release consistencyfor gpus� In ��th Annual IEEE�ACM International Symposium on Microarchi-tecture�MICRO ��Taipei�Taiwan�October ��-�� pages ��–��

[�] J�Alsop�M�D� Sinclair� and S�V�Adve� Spandex�A �exible interface for e�cientheterogeneous coherence� In ISCA� ��

[�] J� F�Cantin� J� E� Smith�M�H� Lipasti�A�Moshovos� and B� Falsa�� Coarse-graincoherence tracking� Regionscout and region coherence arrays� IEEE Micro��(�)��–��

[��] B�Choi�R�Komuravelli�H� Sung�R� Smolinski�N�Honarmand� S�V�Adve�V� S�Adve�N� P�Carter� and C�Chou� Denovo�Rethinking thememory hierarchy fordisciplined parallelism� In �� International Conference on Parallel Architec-tures and Compilation Techniques� PACT �� Galveston� TX� USA� October��-�� pages ��–��

[��] M� Elver and V�Nagarajan� TSO-CC� consistency directed cache coherence forTSO� In ��th IEEE International Symposium on High Performance ComputerArchitecture� HPCA �� Orlando� FL� USA� February ��-�� pages��–��

[��] D� R� Hower� B� A� Hechtman� B�M� Beckmann� B� R� Gaster�M� D� Hill� S� K�Reinhardt� and D� A�Wood� Heterogeneous-race-free memory models� In Ar-chitectural Support for Programming Languages and Operating Systems� AS-PLOS ’�� Salt Lake City� UT� USA� March �-�� pages ��–��


[��] L� I� Kontothanassis�M� L� Scott� and R� Bianchini� Lazy release consistency forhardware-coherent multiprocessors� In Proceedings Supercomputing ’�� SanDiego� CA� USA� December �-�� page ��

[��] A� R� Lebeck and D� A� Wood� Dynamic self-invalidation� Reducing coher-ence overhead in shared-memory multiprocessors� In Proceedings of the ��ndAnnual International Symposium on Computer Architecture� ISCA ’�� SantaMargherita Ligure� Italy� June ��-�� pages ��–��

[��] M� Lis�K� S� Shim�M�H�Cho� and S�Devadas�Memory coherence in the age ofmulticores� In IEEE ��th International Conference on Computer Design� ICCD�� Amherst�MA� USA� October �-��

[��] D� Lustig� S� Sahasrabuddhe� and O� Giroux� A formal analysis of the nvidia ptxmemory consistency model� In Proceedings of the Twentyfourth InternationalConference on Architectural Support for Programming Languages and Operat-ing Systems� ASPLOS� ��

[��] D� Lustig� C� Trippel�M� Pellauer� and M�Martonosi� Armor� defending againstmemory consistencymodelmismatches in heterogeneous architectures� InPro-ceedings of the ��nd Annual International Symposium on Computer Architec-ture� Portland� OR� USA� June ��-�� pages ��–��

[��] L� E� Olson� M� D� Hill� and D� A� Wood� Crossing guard� Mediating host-accelerator coherence interactions� In Proceedings of the Twenty-Second In-ternational Conference on Architectural Support for Programming Languagesand Operating Systems� ASPLOS �� Xi’an� China� April �-�� pages��–��

[��] J�Power�A�Basu� J�Gu� S�Puthoor�B�M�Beckmann�M�D�Hill� S�K�Reinhardt�and D� A� Wood� Heterogeneous system coherence for integrated CPU-GPUsystems� In The ��th Annual IEEE�ACM International Symposium onMicroar-chitecture�MICRO-��Davis� CA�USA�December �-�� pages ��–��

[��] X� Ren andM� Lis� E�cient sequential consistency in gpus via relativistic cachecoherence� In �� IEEE International Symposium onHigh Performance Com-puter Architecture� HPCA �� Austin� TX� USA� February �-�� pages��–��

[��] A� Ros and S� Kaxiras� Complexity-e�ective multicore coherence� In Interna-tional Conference on Parallel Architectures and Compilation Techniques� PACT’�� Minneapolis� MN� USA - September �� - �� pages ��–��

�� REFERENCES ��

[��] D� E� Shasha and M� Snir� E�cient and correct execution of parallel programsthat share memory� ACM Trans� Program� Lang� Syst�� (�)��–��

[��] M� D� Sinclair� J� Alsop� and S� V� Adve� E�cient GPU synchronization withoutscopes� saying no to complex consistencymodels� In Proceedings of the ��th In-ternational Symposium on Microarchitecture�MICRO ��Waikiki� HI� USA�December �-�� pages ��–��

[��] I� Singh� A� Shriraman�W�W� L� Fung�M�O’Connor� and T�M� Aamodt� Cachecoherence for GPU architectures� In ��th IEEE International Symposiumon High Performance Computer Architecture� HPCA �� Shenzhen� China�February ��-�� pages ��–��

[��] X�Yu and S�Devadas�Tardis�Time traveling coherence algorithm for distributedshared memory� In �� International Conference on Parallel Architecture andCompilation� PACT �� San Francisco� CA�USA�October ��-�� pages��–��

c h a p t e r consistency and coherence for heterogeneous...

Documents