rage against the machine clear: a systematic analysis of

19
This paper is included in the Proceedings of the 30th USENIX Security Symposium. August 11–13, 2021 978-1-939133-24-3 Open access to the Proceedings of the 30th USENIX Security Symposium is sponsored by USENIX. Rage Against the Machine Clear: A Systematic Analysis of Machine Clears and Their Implications for Transient Execution Attacks Hany Ragab, Enrico Barberis, Herbert Bos, and Cristiano Giuffrida, Vrije Universiteit Amsterdam https://www.usenix.org/conference/usenixsecurity21/presentation/ragab

Upload: others

Post on 05-Apr-2022

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Rage Against the Machine Clear: A Systematic Analysis of

This paper is included in the Proceedings of the 30th USENIX Security Symposium

August 11ndash13 2021978-1-939133-24-3

Open access to the Proceedings of the 30th USENIX Security Symposium

is sponsored by USENIX

Rage Against the Machine Clear A Systematic Analysis of Machine Clears and Their Implications

for Transient Execution AttacksHany Ragab Enrico Barberis Herbert Bos and Cristiano Giuffrida

Vrije Universiteit Amsterdamhttpswwwusenixorgconferenceusenixsecurity21presentationragab

Rage Against the Machine Clear A Systematic Analysis of Machine Clearsand Their Implications for Transient Execution Attacks

Hany Ragablowast

hanyragabvunlEnrico Barberislowast

ebarberisvunlHerbert Bos

herbertbcsvunlCristiano Giuffridagiuffridacsvunl

Vrije Universiteit AmsterdamAmsterdam The Netherlands

lowastEqual contribution joint first authors

AbstractSince the discovery of the Spectre and Meltdown vulnera-bilities transient execution attacks have increasingly gainedmomentum However while the community has investigatedseveral variants to trigger attacks during transient executionmuch less attention has been devoted to the analysis of theroot causes of transient execution itself Most attack vari-ants simply build on well-known root causes such as branchmisprediction and aborts of Intel TSXmdashwhich are no longersupported on many recent processors

In this paper we tackle the problem from a new perspectiveclosely examining the different root causes of transient exe-cution rather than focusing on new attacks based on knowntransient windows Our analysis specifically focuses on theclass of transient execution based on machine clears (MC)reverse engineering previously unexplored root causes suchas Floating Point MC Self-Modifying Code MC MemoryOrdering MC and Memory Disambiguation MC We showthese events not only originate new transient execution win-dows that widen the horizon for known attacks but also yieldentirely new attack primitives to inject transient values (Float-ing Point Value Injection or FPVI) and executing stale code(Speculative Code Store Bypass or SCSB) We present anend-to-end FPVI exploit on the latest Mozilla SpiderMonkeyJavaScript engine with all the mitigations enabled disclosingarbitrary memory in the browser through attacker-controlledand transiently-injected floating-point results We also pro-pose mitigations for both attack primitives and evaluate theirperformance impact Finally as a by-product of our analysiswe present a new root cause-based classification of all knowntransient execution paths

1 Introduction

Since the public disclosure of the Meltdown and Spectre vul-nerabilities in 2018 researchers have investigated ways to usetransient execution windows for crafting several new attackvariants [6ndash11 28 40 43 48 59 63 65 67 68 71 74ndash76 78

79 82] Building mostly on well-known causes of transientexecution such as branch mispredictions or aborting IntelTSX transactions such variants violate many security bound-aries allowing attackers to obtain access to unauthorized datadivert control flow or inject values in transiently executedcode Nonetheless little effort has been made to systemat-ically investigate the root causes of what Intel refers to asbad speculation [36]mdashthe conditions that cause the CPU todiscard already issued micro-operations (microOps) and renderthem transient As a result our understanding of these rootcauses as well as their security implications is still limitedIn this paper we systematically examine such causes with afocus on a major largely unexplored class of bad speculation

In particular Intel [36] identifies two general causes of badspeculation (and transient execution)mdashbranch mispredictionand machine clears Upon detecting branch mispredictionsthe CPU needs to squash all the microOps executed in the themispredicted branches Such occurrences in a wide varietyof forms and manifestations have been extensively examinedin the existing literature on transient execution attacks [10113140414348] The same cannot be said for machine clearsa class of bad speculation that relies on a full flush of theprocessor pipeline to restart from the last retired instructionIn this paper we therefore focus on the systematic explorationof machine clears their behavior and security implications

Specifically by reverse engineering this undocumentedclass of bad speculation we closely examine four previouslyunexplored sources of machine clears (and thus transient exe-cution) related to floating points self-modifying code mem-ory ordering and memory disambiguation We show attackerscan exploit such causes to originate new transient executionwindows with unique characteristics For instance attackerscan exploit Floating Point Machine Clear events to embedattacks in transient execution windows that require no train-ing or special (in many cases disabled) features such as IntelTSX Besides providing a general framework to run existingand future attacks in a variety of transient execution windowswith different constraints and leakage rates our analysis alsouncovered new attack primitives based on machine clears

USENIX Association 30th USENIX Security Symposium 1451

In particular we show that Self-Modifying Code MachineClear events allow attackers to transiently execute stale codewhile Floating Point Machine Clear events allow them to in-ject transient values in victim code We term these primitivesSpeculative Code Store Bypass (SCSB) and Floating PointValue Injection (FPVI) respectively The former is looselyrelated to Speculative Store Bypass (SSB) [55 82] but al-lows attackers to transiently reference stale code rather thandata The latter is loosely related to Load Value Injection(LVI) [71] but allows attackers to inject transient values bycontrolling operands of floating-point operations rather thantriggering victim page faults or microcode assists We alsodiscuss possible gadgets for exploitation in applications suchas JIT engines For instance we found that developers follow-ing instructions detailed in the Intel optimization manual [36]ad litteram could easily introduce SCSB gadgets in their ap-plications In addition we show that attackers controllingJITrsquoed code can easily inject FPVI gadgets and craft arbitrarymemory reads even from JavaScript in a modern browserusing NaN-boxing such as Mozilla Firefox [53]

Moreover since existing mitigations cannot hinder ourprimitives we implement new mitigations to eliminate theuncovered attack surface As shown in our evaluation SCSBcan be efficiently mitigated with serializing instructions in thepractical cases of interest such as JavaScript engines FPVIcan be mitigated using transient execution barriers betweenthe source of injection and its transmit gadgets either insertedby the programmer or by the compiler We implemented thegeneral compiler-based mitigation in LLVM [45] measuringan overhead of 32 53 on SPECfp 20062017 (geomean)

While not limited to Intel our analysis does build on a vari-ety of performance counters available in different generationsof Intel CPUs but not on other architectures Neverthelesswe show that our insights into the root causes of bad specu-lation also generalize to other architectures by successfullyrepeating our transient execution leakage experiments (weoriginally designed for Intel) on AMD

Finally armed with a deeper understanding of the rootcauses of transient execution we also present a new classi-fication for the resulting paths Existing classifications [10]are entirely attack-centric focusing on classes of Spectre-or Meltdown-like attacks and blending together (common)causes of transient execution with their uses (ie what at-tackers can do with them) Such classifications cannot easilyaccommodate the new transient execution paths presented inthis paper For this reason we propose a new orthogonal rootcause-centric classification much closer to the sources of badspeculation identified by the chip vendors themselves

Summarizing we make the following contributions

bull We systematically explore the root causes of transientexecution and closely examine the major largely unex-plored machine clear-based class To this end we presentthe reverse engineering and security analysis of causessuch as Floating Point Machine Clear (FP MC) Self-

Modifying Code Machine Clear (SMC MC) MemoryOrdering Machine Clear (MO MC) and Memory Dis-ambiguation Machine Clear (MD MC)

bull We present two novel machine clear-based transient exe-cution attack primitives (FPVI and SCSB) and an end-to-end FPVI exploit disclosing arbitrary memory in Firefox

bull We propose and evaluate possible mitigations

bull We propose a new root-cause based classification of allthe known transient execution paths

Code exploit demo and additional information are avail-able at httpswwwvusecnetprojectsfpvi-scsb

2 Background

21 IEEE-754 Denormal NumbersModern processors implement a Floating Point Unit (FPU)To represent floating point numbers the IEEE-754 stan-dard [1] distinguishes three components (Fig 1) First themost significant bit serves as the sign bit s The next w bitscontain a biased exponent e For instance for double precision64-bit floating point numbers e is 11 bits long and the biasis 211minus1minus1 = 1023 (ie 2wminus1minus1) The value e is computedby adding the real exponent ereal to the bias which ensuresthat e is a positive number even if the real exponent is neg-ative The remaining bits are used for the mantissa m Thecombination of these components represents the value Asan example suppose the sign bit s = 1 the biased exponente = 100000000112 = 102710 so that ereal = 1027minus1023 = 4and the mantissa m = 01110000 In that casethe corre-sponding decimal value is minus1 middot24 middot (1+001112) We refer tosuch numbers as normal numbers as they are in the normal-ized form with an implicit leading 1 present in the mantissa

Figure 1 Fields of a 64 bits IEEE 754 double

If e = 0 and m 6= 0 the CPU treats the value as a denormalvalue In this case the leading 1 becomes a leading 0 insteadwhile the exponent becomes the minimal value of 2minus2wminus1so that with s = 1 e = 0 and m = 01110000 the result-ing value for a double precision number is minus1 middot2minus1022 middot (0+001112) This additional representation is intended to allow agradual underflow when values get closer to 0 In 64-bits float-ing point numbers the smallest unbiased exponent is -1022making it impossible to represent a number with an exponentof -1023 or lower In contrast a denormal number can repre-sent this number by appending enough leading zeroes to the

1452 30th USENIX Security Symposium USENIX Association

mantissa until the minimal exponent is obtained The denor-mal representation trades precision for the ability to representa larger set of numbers Generating such a representation iscalled denormalization or gradual underflow

22 x86 Cache CoherenceOn multicore processors L1 and L2 caches are usually percore while the L3 is shared among all cores Much complexityarises due to the cache coherence problem when the samememory location is cached by multiple cores To ensure thecorrectness the memory must behave as a single coherentsource of data even if data is spread across a cache hierarchyInformally a memory system is coherent if any read of a dataitem returns the most recently written value [30] To obtaincoherence memory operations that affect shared data mustpropagate to other coresrsquo private caches This propagation isthe responsibility of the cache coherence protocol such asMESIF on Intel processors [37] and MOESI on AMD [3]Cache controllers implementing these protocols snoop andexchange coherence requests on the shared bus For examplewhen a core writes in a shared memory location it signals allother cores to invalidate their now stale local copy

The cache coherence policy also maintains the illusion ofa unified memory model for backward compatibility Theoriginal Intel 8086 released in 1978 operated in real-addressmode to implement what is essentially a pure Von Neumannarchitecture with no separation between data and code Mod-ern processors have more sophisticated memory architectureswith separate L1 caches for code (L1d) and data (L1i) Theneed for backward compatibility with the simple 8086 mem-ory model has led modern CPUs to a split-cache Harvardarchitecture whereby the cache coherence protocol ensuresthat the (L1) data and instruction caches are always coherent

23 Memory OrderingA Memory Consistency Model or Memory Ordering Modelis a contract between the microarchitecture and the program-mer regarding the permissible order of memory operations inparallel programs Memory consistency is often confused withmemory coherence but where coherence hides the complexmemory hierarchy in multicore systems consistency dealswith the order of readwrite operations

Consider the program in Figure 2 In the simplest consis-tency model lsquoSequential Consistencyrsquo (SC) all cores see alloperations in the same order as specified by the program Inother words A0 always execute before A1 and B0 always be-fore B1 Thus a valid memory order would be A0-B0-A1-B1while A1-B1-A0-B0 would be invalid

Intel and AMD CPUs implement Total-Store-Order (TSO)which is equivalent to SC apart from one case a store fol-lowed by a load on a different address may be reordered Thisallows cores to use a private store buffer to hide the latency of

Figure 2 Memory Ordering Example

store operations In the example of Figure 2 the store opera-tions A0 and B0 write their values initially only in their privatestore buffers The subsequent loads A1 and B1 will now readthe stale value 0 until the stores are globally visible Thusthe order A1-B1-A0-B0 (r1=0 r2=0) is also valid

24 Memory DisambiguationLoads must normally be executed only after all the precedingstores to the same memory locations However modern pro-cessors rely on speculative optimizations based on memorydisambiguation to allow loads to be executed before the ad-dresses of all preceding stores are computed In particular if aload is predicted not to alias a preceding store the CPU hoiststhe load and speculatively executes it before the precedingstore address is known Otherwise the load is stalled until thepreceding store is completed In case of a no-alias (or hoist)misprediction the load reads a stale value and the CPU needsto re-issue it after flushing the pipeline [36 37]

3 Threat Model

We consider unprivileged attackers who aim to disclose confi-dential information such as private keys passwords or ran-domized pointers We assume an x86-64 based victim ma-chine running the latest microcode and operating system ver-sion with all state-of-the-art mitigations against transient exe-cution attacks enabled We also consider a victim system withno exploitable vulnerabilities apart from the ones describedhereafter Finally we assume attackers can run (only) unprivi-leged code on the victim (eg in JavaScript user processesor VMs) but seek to leak data across security boundaries

4 Machine Clears

The Intel Architectures Optimization Reference Manual [36]refers to the root cause of discarding issued microOp as Bad Spec-ulation Bad speculation consists of two subcategories

bull Branch Mispredict A misprediction of the direction ortarget of a branch by the branch predictor will squash allmicroOps executed within a mispeculated branch

bull Machine Clear (or Nuke) A machine clear conditionwill flush the entire processor pipeline and restart theexecution from the last retired instruction

USENIX Association 30th USENIX Security Symposium 1453

Table 1 Machine clear performance counters

Name Description

MACHINE_CLEARSCOUNT Number of machine clears of any typeMACHINE_CLEARSSMC Number of machine clears caused by a selfcross-modifying codeMACHINE_CLEARSDISAMBIGUATION Number of machine clears caused by a memory disambiguation unit mispredictionMACHINE_CLEARSMEMORY_ORDERING Number of machine clears caused by a memory ordering principle violationMACHINE_CLEARSFP_ASSIST Number of machine clears caused by an assisted floating point operationMACHINE_CLEARSPAGE_FAULT Number of machine clears caused by a page faultMACHINE_CLEARSMASKMOV Number of machine clears caused by an AVX maskmov on an illegal address with a mask set to 0

Bad speculation not only causes performance degradationbut also security concerns [6ndash8314143475559637174ndash76 79 82] In contrast to branch misprediction extensivelystudied by security researchers [10 11 31 40 41 43 48]machine clears have undergone little scrutiny In this paperwe perform the first deep analysis of machine clears and thecorresponding root causes of transient execution

Since machine clears are hardly documented we examinedall the performance counters for every Intel architecture andfound a number relevant counters (Table 1) Some countersare present only in specific architectures For example thepage fault counter is available only on Goldmont Plus How-ever thanks to the generic counter MACHINE_CLEARSCOUNTit is always possible to count the overall number of machineclears regardless of the architecture In the remainder of thiswork we will reverse engineer and analyze the causes ofmachine clears by means of these six counters

As a general observation we note that the Floating PointAssist and Page Fault counters immediately suggest that ma-chine clears are also related to microcode assists and fault-sexceptions In particular further analysis shows that

bull Microcode assists trigger machine clears The hardwareoccasionally needs to resort to microcode to handle com-plex events Doing so requires flushing all the pend-ing instructions with a machine clear before handlinga microcode assist Indeed in our experiments wherethe OTHER_ASSISTSANY counter increased we also ob-served a matching increase in MACHINE_CLEARSCOUNT

bull Machine clears do not necessarily trigger microcodeassists Not all machine clears are microcode as-sisted as some machine clear causes are handleddirectly in silicon Indeed in our experiments weobserved that SMC MD and MO machine clearscause an increase of MACHINE_CLEARSCOUNT but leaveOTHER_ASSISTSANY unaltered

bull An exception triggers a machine clear When a faultor exception is detected the subsequent microOps must beflushed from the pipeline with a machine clear as theexecution should resume at the exception handler In-deed in our experiments we observed an increase ofMACHINE_CLEARSCOUNT at each faulty instruction

In this paper we focus on the machine clear causes men-tioned by the Intel documentation (Table 1) acknowledgingthat undocumented causes may still exist (much like undocu-mented x86 instructions [16]) We now first examine the fourmost relevant causes of machine clears (Self-Modifying CodeMC Floating Point MC Memory Ordering MC and MemoryDisambiguation MC) then briefly discuss the other cases

5 Self-Modifying Code Machine Clear

In a Von Neumann architecture stores may write instructionsas data and modify program code as it is being executedas long as the code pages are writable This is commonlyreferred to as Self-modifying Code (SMC)

Self-modifying code is problematic for the InstructionFetch Unit (IFU) which maintains high execution through-put by aggressively prefetching the instructions it expects toexecute next and feeding them to the decode units The CPUspeculatively fetches and decodes the instructions and feedsthem to the execution units well ahead of retirement

In case of a misprediction the CPU flushes the specula-tively processed instructions and resumes execution at the cor-rect target The IFUrsquos aggressive prefetching ensures that thefirst-level instruction cache (L1i) is constantly filled with in-structions which are either currently in (possibly speculative)execution or about to be executed As a result a store instruc-tion targeting code cached in L1i requires drastic measuresmdashas the associated cache lines should now be invalidated More-over the target instructions do not even need to be part of theactual execution since a prefetch is sufficient to bring theminto L1i any write to prefetched instructions also invalidatesthe prefetch queue In other words the problem occurs whenthe code is already in L1i or the store is sufficiently close tothe target code to ensure the target is prefetched in L1i Thisbehavior leads to a temporary desynchronization between thecode and data views of the CPU transiently breaking the ar-chitectural memory model (where L1dL1i coherence ensuresconsistent codedata views)

To reverse engineer this behavior we use the analysis codeexemplified by Listing 1 The store at line 15 overwrites codealready cached in L1i (lines 18-21) triggering a machineclear The machine clear needs to update the L1i cache (and

1454 30th USENIX Security Symposium USENIX Association

related microarchitectural structures) by flushing any staleinstruction(s) resuming execution at the last retired instruc-tion and then fetching the new instructions To test for thepresence of a transient execution path our analysis code im-mediately executes lines 18-21 targeted by the store and jumpsto a spec_code gadget (lines 29-32) which fills a number ofcache lines in a (flushed) buffer Architecturally this gad-get should never be executed as the store instruction shouldnop out the branch at line 18 However microarchitecturallywe do observe multiple cache hits in the (reload) buffer us-ing FLUSH + RELOAD which confirms the existence of apre-SMC-handling transient execution path executing stalecode and leaving observable traces in the cache We observethat the scheduling of the store instruction heavily affectsthe length of the transient path Indeed we use different in-structions (lines 5-8) to delay as much as possible the storeretirement and thus the SMC detection This suggests that theroot cause of the observed transient window might be the mi-croarchitectural de-synchronization between the store buffer(new code) and the instruction queue (stale code) yieldingtransiently incoherent codedata views We sampled machineclear performance counters to confirm the transient executionwindow is caused by the SMC machine clear and not by otherevents (eg memory disambiguation misprediction) Finallywe repeated our experiments on uncached code memory andcould not observe any transient path The counters revealedone machine clear triggered for each executed instructionsince the CPU has to pessimistically assume every fetchedinstruction has potentially been overwritten Additionally wehave verified that SMC detection is performed on physicaladdresses rather than virtual ones

Cross-Modifying Code

Instead of modifying its own instructions a thread runningon one core may also write data into the currently executingcode segment of a thread running on a different physical coreSuch Cross-Modifying Code (XMC) may be synchronous(the executor thread waits for the writer thread to completebefore executing the new code) or asynchronous with noparticular coordination between threads To reverse engineerthe behavior we distributed our analysis code across coresand reproduced a signal on the reload buffer in both casesThis confirms a Cross-Modifying Code Machine Clear (XMCmachine clear) behaves similarly to a SMC machine clearacross cores with a store on the writer core originating atransient execution window on the executor core

6 Floating Point Machine Clear

On Intel when the Floating Point Unit (FPU) is unable toproduce results in the IEEE-754 [1] format directly for in-stance in the case of denormal operands or results [4 17]the CPU requires special handling to produce a correct re-

Listing 1 Self-Modifying Code Machine Clear analysis code1 smc_snippet2 push r113 lea r11 [target] Load addr of target instr (line 17)4

5 clflush [r11] These instructions serve as a delay6 rep 10 for the store argument address They7 imul r11 1 ensure that the execution window of8 endrep spec_code is as long as possible9

10 Code to write as data 8 nops (overwriting lines 18-21)11 mov rax 0x909090909090909012

13 Store at target addr Also the last retired instr14 from which the execution will resume after the SMC MC15 mov QWORD [r11] rax16

17 target Target instruction to be modified18 jmp spec_code19 nop20 nop21 nop22

23 Architectural exit point of the function24 pop r1125 ret26

27 Code executed speculatively (flushed after SMC MC)28 spec_code29 mov rax [rdi+0x0] rdi covert channel reload buffer30 mov rax [rdi+0x400]31 mov rax [rdi+0x800]32 mov rax [rdi+0xc00]

sult According to an Intel patent [62] the denormalizationis indeed implemented as a microcode assist or an exceptionhandler since the corresponding hardware would be too com-plex In our experiments we observed microcode assists onall x87 SSE and AVX instructions that perform mathematicaloperations on denormal floating-point (FP) numbers Incre-ments of the FP_ASSISTANY or MACHINE_CLEARSCOUNT(or on older processors MACHINE_CLEARSFP_ASSIST) per-formance counters confirm such assists cause machine clears

Since a machine clear implies a pipeline flush the as-sisted FP operation will be squashed together with subse-quent microOps To reverse engineer the behavior we used anal-ysis code exemplified by Algorithm 1 Our code relies ona FLUSH + RELOAD [81] covert channel to observe the re-sult of a floating-point operation at the byte granularity Inour experiments we observed two different hits in the reloadbuffer for each byte for the transient and architectural resultrespectively The double-hit microarchitectural trace confirmsthat the transient (and wrong) value generated by the FPUis used in subsequent microOpsmdashas also exemplified in Table 2Later the CPU detects the error and triggers a machine clearto flush the wrongly executed path The microcode assist thencorrects the result while subsequent instructions are reissued

While we could not find any documentation on floating-point assist handling (even in the patents) our experimentsrevealed the following important properties First we veri-fied that many FP operations can trigger FP assists (ie addsub mul div and sqrt) across different extensions (ie x87SSE and AVX) Second the transient result is computed byldquoblindlyrdquo executing the operation as if both operands and result

USENIX Association 30th USENIX Security Symposium 1455

Algorithm 1 Floating Point Machine Clear analysis (pseudo)code byte is used to extract the i-th byte of z

1 for ilarr 1 8 do2 flush(reload_bu f )3 z = x y Any denormal FP operation4 reload_bu f [byte(z i) 1024]5 reload(reload_bu f )6 end for

Representation Value Type Exp

x 0x0010deadbeef1337 234e-308 N -1022

y 0x40f0000000000000 65536(216) N 16

zarch 0x00000010deadbeef 357e-313 D -1022

ztran 0x3f10deadbeef1337 643e-05 N -14

Table 2 Architectural (zarch) and transient (ztran) results ofdividing x and y of Algorithm 1 N Normal D Denormalrepresentations The mantissa is in bold A normal division by216 leaves the mantissa untouched and subtracts 16 from theexponentmdashthe result of ztran where the exponent overflowedfrom -1022 to -14

Figure 3 Transient execution due to invalid memory ordering

are normal numbers (see Table 2) Third the detection of thewrong computation occurs later in time creating a transientexecution window Finally by performing multiple floating-point operations together with the assisted one we were ableto expand the size of the window suggesting that detection isdelayed if the FPU is busy handling multiple operations

7 Memory Ordering Machine Clear

The CPU initiates a memory ordering (MO) machine clearwhen upon receiving a snoop request it is uncertain if mem-ory ordering will be preserved [36] Consider the program

Algorithm 2 Pseudo-code triggering a MO machine clearProcessor A

1 clflush(X) Make the load slow2 unlock(lock) Synchronize loads and stores3 r1larr [X]4 r2larr [Y ]5 reload(r2) For FLUSH + RELOAD

Processor B1 wait(lock) Synchronize loads and stores2 1rarr [Y ]

in Figure 3 Processor A loads X and Y while processor Bperforms two stores to the same locations If the load of Xis slow due to a cache miss the out-of-order CPU will issuethe next load (and subsequent operations) ahead of sched-ule Suppose that while the load of X is pending proces-sor B signals through a snoop request that the values ofX and Y have changed In this scenario the memory order-ing is A1-B0-B1-A0 which is not allowed according to theTotal-Store-Order memory model As two loads cannot bereordered r1=1 r2=0 is an illegal result Thus processor Ahas no choice other than to flush its pipeline and re-issuethe load of Y in the correct order This MO machine clearis needed for every inconsistent speculation on the memoryorderingmdashimplementing speculation behavior originally pro-posed by Gharachorloo et al [27] with the advantage thatstrict memory order principles can co-exist with aggressiveout-of-order scheduling

Notice that in the previous example the store on X isnot even necessary to cause a memory ordering violationsince A1-B1-A0 is still an invalid order Counterintuitivelythe memory ordering violation disappears if the load on X isnot performed as A1-B0-B1 is a perfectly valid order

To reverse engineer the memory ordering handling behav-ior on Intel CPUs we used analysis code exemplified by Al-gorithm 2 Our code mimics the scenario of Figure 3 withloadsstores from two threads racing against each other butrelies on a FLUSH + RELOAD [81] covert channel to observethe loaded value of Y The synchronization through the lockvariable ensures the desired (problematic) proximity and or-dering of the memory operations

As before in our experiments we observed two differenthits in the reload buffer for the loaded value of Y one forthe stale (transient) value and one for the new (architectural)value For every double hit in the reload buffer we also mea-sured an increase of MACHINE_CLEARSMEMORY_ORDERINGWe also ran experiments without the load of X Even thoughmemory ordering violations are no longer possible since eachprocessor executes a single memory operation and any orderis permitted we still observed MO machine clears The reasonis that Intel CPUs seem to resort to a simple but conservativeapproach to memory ordering violation detection where theCPU initiates a MO machine clear when a snoop request from

1456 30th USENIX Security Symposium USENIX Association

another processor matches the source of any load operationin the pipeline We obtained the same results with the twothreads running across physical or logical (hyperthreaded)cores Similarly the results are unchanged if the matchingstore and load are performed on different addressesmdashthe onlyrequirement we observed is that the memory operations needto refer to the same cache line Overall our results confirmthe presence of a transient execution window and the abilityof a thread to trigger a transient execution path in anotherthread by simply dirtying a cache line used in a ready-to-commit load Exploitation-wise abusing this type of MC isnon-trivial due to the strict synchronization requirements andthe difficulty of controlling pending stale data

8 Memory Disambiguation Machine Clear

As suggested by the MACHINE_CLEARSDISAMBIGUATIONcounter description memory disambiguation (MD) mis-predictions are handled via machine clears Our experi-ments confirmed this behavior by observing matching in-creases of MACHINE_CLEARSCOUNT Moreover we observedno changes in microcode assist counters suggesting mispre-dictions are resolved entirely in hardware In case of a mispre-diction a stale value is passed to subsequent loads a primitivethat was previously used to leak secret information with Spec-tre Variant 4 or Speculative Store Bypass (SSB) [55 82]

Different from the other machine clears MD machineclears trigger only on address aliasing mispredictions whenthe CPU wrongly predicts a load does not alias a precedingstore instruction and can be hoisted Similar to branch pre-diction generating a MD-based transient execution windowrequires mistraining the underlying predictor The latter hascomplex undocumented behavior which has been partiallyreverse engineered [18] For space constraints we discuss ourfull reverse engineering strategy in Appendix A Our resultsshow that executing the same load 64+ 15 times with non-aliasing stores is sufficient to ensure the next prediction tobe ldquono-aliasrdquo (and thus the next load to be hoisted) Uponreaching the hoisting prediction one can perform the loadwith an aliasing store to trigger the transient path exposed tothe incorrect stale value

Finally we experimentally verified that 4k aliasing [50]does not cause any machine clear but only incurs a furthertime penalty in case of wrong aliasing predictions Moredetails on 4k aliasing results can be found in Appendix A

9 Other Types of Machine Clear

AVX vmaskmov The AVX vmaskmov instructions performconditional packed load and store operations depending on abitmask For example a vmaskmovpd load may read 4 packeddoubles from memory depending on a 4-bit mask each doublewill be read only if the corresponding mask bit is set the

others will be assigned the value 0According to the Intel Optimization Reference Man-

ual [36] the instruction does not generate an exception inface of invalid addresses provided they are masked out How-ever our experiments confirm that it does incur a machineclear (and a microcode assist) when accessing an invalid ad-dress (eg with the present bit set to 0) with a loading maskset to zero (ie no bytes should be loaded) to check whetherthe bytes in the invalid address have the corresponding maskbits set or not We speculate the special handling is neededbecause the permission check is very complex especially inthe absence of memory alignment requirements

In our experiments vmaskmov instructions withall-zero masks and invalid addresses increment theOTHER_ASSISTANY and MACHINE_CLEARSCOUNT countersconfirming that the instruction triggers a machine clearHowever the resulting transient execution window seemsshort-lived or absent as we were unable to observe cache orother microarchitectural side effects of the execution

Exceptions The MACHINE_CLEARSPAGE_FAULT counterpresent on older microarchitectures confirms page faults areanother cause of machine clears Indeed we verified each in-struction incurring a page fault or any other exception such asldquoDivision by zerordquo increments the MACHINE_CLEARSCOUNTcounter We also verified exceptions do not trigger microcodeassists and software interrupts (traps) do not trigger machineclears Indeed the Intel documentation [37] specifies thatinstructions following a trap may be fetched but not specu-latively executed Transient execution windows originatingfrom exceptionsmdashand page faults in particularmdashhave beenextensively used in prior work with a faulty load instructionalso used as the trigger to leak information [7 47 63 75]

Hardware interrupts Although hardware interrupts arean undocumented cause of machine clears our experimentswith APIC timer interrupts showed they do increment theMACHINE_CLEARSCOUNT counter While this confirms hard-ware interrupts are another root cause of transient executionthe asynchronous nature of these events yields a less thanideal vector for transient execution attacks Nonetheless hard-ware interrupts play an important role in other classes ofmicroarchitectural attacks [73]

Microcode assists Microcode assists require a pipelineflush to insert the required microOps in the frontend and representa subclass of machine clears (and thus a root cause of transientexecution) for cases where a fast path in hardware is not avail-able In this paper we detailed the behavior of floating-pointand vmaskmov assists Prior work has discussed different sit-uations requiring microcode assists such as those related topage table entry AccessDirty bits typically in the context ofassisted loads used as the trigger to leak information [86375]AVX-to-SSE transitions [36] represent another microcode as-sist which based on our experiments we could not observeon modern Intel CPUs Remaining known microcode assistssuch as access control of memory pages belonging to SGX

USENIX Association 30th USENIX Security Symposium 1457

020406080

100120140160

Num

ber o

fTr

ansie

nt L

oads CPU

Intel Core i7-10700KIntel Xeon Silver 4214Intel Core i9-9900KIntel Core i7-7700KAMD Ryzen 5 5600XAMD Ryzen Threadripper2990WXAMD Ryzen 7 2700X

F+R TSX BHT FAULT SMC XMC FP MD MOTransient Execution Management

0

1

2

3

4

Leak

age

Rate

[Mb

s]

Figure 4 Top plot transient window size vs mechanism Each bar reports the number of transient loads that complete and leavea microarchitectural trace Bottom plot leakage rate vs mechanism for a simple Spectre Bounds-Check-Bypass attack and a1-bit FLUSH + RELOAD (F+R) cache covert channel F+R is the leakage rate upper bound (covert channel loop only no actualtransient window or attack)

secure enclaves and Precise Event Based Sampling (PEBS)are not presented in this work since already studied [14] oronly related to privileged performance profiling respectively

10 Transient Execution Capabilities

Transient execution attacks rely on crafting a transient win-dow to issue instructions that are never retired For thispurpose state-of-the-art attacks traditionally rely on mech-anisms based on root causes such as branch mispredictions(BHT) [41 43] faulty loads (Fault) [8 47 63 75] or mem-ory transaction aborts (TSX) [8 47 59 63 75] However thedifferent machine clears discussed in this paper provide anattacker with the exact same capabilities

To compare the capabilities of machine clear-based tran-sient windows with those of more traditional mechanismswe implemented a framework able to run arbitrary attacker-controlled code in a window generated by a mechanism ofchoosing We now evaluate our framework on recent proces-sors (with all the microcode updates and mitigations enabled)to compare the transient window size and leakage rate of thedifferent mechanisms

101 Transient Window Size

The transient window size provides an indication of thenumber of operations an attacker can issue on a transientpath before the results are squashed Larger windows canin principle host more complex attacks Using a classicFLUSH + RELOAD cache covert channel (F+R) as a referencewe measure the window size by counting how many transientloads can complete and hit entries in a designated F+R bufferFigure 4 (top) presents our results

As shown in the figure the window size varies greatlyacross the different mechanisms Broadly speaking mecha-nisms that have a higher detection cost such as XMC and MOMachine Clear yield larger window sizes Not surprisinglybranch mispredictions yield the largest window sizes as wecan significantly slow down the branch resolution process(ie causing cache misses) and delay detection FP on theother hand yields the shortest windows suggesting that de-normal numbers are efficiently detected inside the FPU Ourresults also show that while our framework was designed forIntel processors similar if not better results can usually beobtained on AMD processors (where we use the same con-servative training code for branchmemory prediction) Thisshows that both CPU families share a similar implementationin all cases except for MD where the used mistraining patternis not valid for pre-Zen3 architectures

102 Leakage Rate

To compare the leakage rates for the different transient exe-cution mechanisms we transiently read and repeatedly leakdata from a large memory region through a classic F+R cachecovert channel We report the resulting leakage ratesmdashasthe number of bits successfully leaked per secondmdashacrossdifferent microarchitectures using a 1-bit covert channel tohighlight the time complexity of each mechanism We con-sider data to be successfully leaked after a single correct hitin the reload buffer In case of a miss for a particular valuewe restart the leak for the same value until we get a hit (oruntil we get 100 misses in a row)

As shown in Figure 4 (bottom) different Intel and AMDmicroarchitectures generally yield similar leakage rates withsome variations For instance FP MC offers better leakagerates on Intel This difference stems from the different perfor-

1458 30th USENIX Security Symposium USENIX Association

F+R TSX BHT FP SMC XMC MO MD FAULTTransient Execution Management

0

1

2

3

4

Leak

age

Rate

[Mb

s]

F+R granularity [bit]148

Figure 5 Leakage rate vs mechanism with a 1-bit 4-bit or8-bit FLUSH + RELOAD cache covert channel (Intel Core i9)

mance impact of the corresponding machine clears on Intelvs AMD microarchitectures

As shown in Figure 5 the leakage rate varies instead greatlyacross the different mechanisms and so does the optimalcovert channel bitwidth Indeed while existing attacks typi-cally rely on 8-bit covert channels our results suggest 1-bit or4-bit channels can be much more efficient depending on thespecific mechanism Roughly speaking optimal leakage ratescan be obtained by balancing the time complexity (and hencebitwidth) of the covert channel with that of the mechanismFor example FP is a lightweight and reliable mechanismhence using a comparably fast and narrow 1-bit covert chan-nel is beneficial In contrast MD requires a time-consumingpredictor training phase between leak iterations and leakingmore bits per iteration with a 4-bit covert channel is moreefficient Interestingly a classic 8-bit covert channel yieldsconsistently worse and comparable leakage rates across allthe mechanisms since F+R dominates the execution time

Our results show that only two mechanisms (TSX and FP)are close to the maximum theoretical leakage rate of pureF+R Moreover FP performs as efficiently as TSX but un-like TSX is available on both Intel and AMD is always en-abled and can be used from managed code (eg JavaScript)BHT on the other hand yield the worst leakage rates due tothe inefficient training-based transient window BHT leakagerate can be improved if a tailored mistrain sequence is usedas in MD (Appendix A) Overall our results show that ma-chine clear-based windows achieve comparable and in manycases (eg FP) better leakage rates compared to traditionalmechanisms Moreover many machine clears eliminate theneed for mistraining which other than resulting in efficientleakage rates can escape existing pattern-based mitigationsand disabled hardware extensions (eg Intel TSX)

11 Attack Primitives

Building on our reverse engineering results and focusing onthe unexplored SMC and FP machine clears we now presenttwo new transient execution attack primitives and analyze

their security implications We also present an end-to-endFPVI exploit disclosing arbitrary memory in Firefox Laterwe discuss mitigations

111 Speculative Code Store Bypass (SCSB)

Our first attack primitive Speculative Code Store Bypass(SCSB) allows an attacker to execute stale controlled code ina transient execution window originated by a SMC machineclear Since the primitive relies on SMC its primary appli-cability is on JIT (eg JavaScript) engines running attacker-controlled codemdashalthough OS kernels and hypervisors stor-ing code pages and allowing their execution without firstissuing a serializing instruction are also potentially affected

Figure 6 SCSB primitive example where the instructionpointer is pointing at the bold code blocks g code is freedJITrsquoed code (of some g function) under attackerrsquos control(1) Force engine to JIT and execute code of function f caus-ing desynchronization of code and data views (2) Executestale code and SMC MC (3) After the SMC MC code anddata view coherence is restored and the new code is executed

As exemplified in Figure 6 the operations of the primi-tive can be broken down into three steps (1) the JIT enginecompiles a function f storing the generated code into a JITcode cache region previously used by a (now-stale) version offunction g (2) the JIT engine jumps to the newly generatedcode for the function f but due to the temporary desynchro-nization between the code and data views of the CPU thiscauses transient execution of the stale code of g until the SMCmachine clear is processed (3) after the pipeline flush thecode and data views are resynchronized and the CPU restartsthe execution of the correct code of f For exploitation theattacker needs to (i) massage the JIT code cache allocator toreuse a freed region with a target gadget g of choice (ii) forcethe JIT engine to generate and execute new (f ) code in sucha region enabling transient out-of-context execution of thegadget and spilling secrets into the microarchitectural state

Our primitive bears similarities with both transient and ar-chitectural primitives used in prior attacks On the transientfront our primitive is conceptually similar to a Speculative-Store-Bypass (SSB) primitive [55 82] but can transientlyexecute stale code rather than reading stale data Howeverinterestingly the underlying causes of the two primitives arequite different (MD misprediction vs SMC machine clear)On the architectural front our primitive mimics classic Use-After-Free (UAF) exploitation on the JIT code cache also

USENIX Association 30th USENIX Security Symposium 1459

Figure 7 Coding options suggested by the Intel ArchitecturesSoftware Developers Manuals to handle SMC and XMC ex-ecution Option 1 describes the exact steps required by ourSpeculative Code Store Bypass attack primitive potentiallyresulting in exploitable gadgets

known as Return-After-Free (RAF) in the hackers commu-nity [19 25] An example is CVE-2018-0946 where a use-after-free vulnerability can be exploited to force the ChakraJS engine to erroneously execute freed (attacker-controlled)JIT code resulting in arbitrary code execution after massagingthe right gadget into the JIT code cache [24]

Indeed at a high level SCSB yields a transient use-after-free primitive on a JIT code cache with exploitation prop-erties similar to its architectural counterpart However thereare some differences due to the transient nature of SCSBFirst we need to find an out-of-context gadget to transientlyleak data rather than architecturally execute arbitrary codeIn JavaScript engines similar gadgets have already been ex-ploited by ret2spec [48] escalating out-of-context transientexecution of valid code to type confusion arbitrary reads andultimately a secret-dependent load to transmit the value

Second we need to target short-lived JIT code cache up-dates so that the newly generated code is immediately exe-cuted fitting the target gadget in the resulting transient execu-tion window Interestingly executing JITrsquoed code is the nextstep after JIT code generation mentioned in the Intel manual(Figure 7 Option 1) suggesting that developers that followsuch directives ad litteram can easily introduce such gadgetsMoreover modern JavaScript compilers feature a multi-stageoptimization pipeline [12 54] and short-lived JIT code cacheupdates are favored for just-in-time (re)optimization Indeedwe found test code in V8 to specifically test such updates andverify instructiondata cache coherence [70]

Finally we need to ensure JIT code cache updates are notaccompanied by barriers that force immediate synchroniza-tion of the code and data views We analyzed the code ofthe popular SpiderMonkey and V8 JavaScript engines andverified that the functions called upon JIT code cache updatesto synchronize instruction and data caches (Listing 2 and List-ing 3) are always empty on x86 (as expected given Intelrsquosprimary recommendation in Figure 7)

Overall while the exploitation is far from trivial (ie hav-

Listing 2 Chromium instruction cache flush(chromiumsrcv8srccodegenx64cpu-x64cc)void CpuFeaturesFlushICache(void start size_t size) No need to flush the instructioncache on Intel

Listing 3 Firefox instruction cache flush(mozilla-unifiedjssrcjitFlushICacheh)inline void FlushICache(void code size_t size

bool codeIsThreadLocal = true) No-op Code and data caches are coherent on x86

and x64 rarr

ing to address the challenges of traditional use-after-free ex-ploitation as well as transient execution exploitation in thebrowser) we believe SCSB expands the attack surface of tran-sient execution attacks in the browser We found a number ofcandidate SCSB gadgets in real-world code but none of themwas ultimately exploitable due to the coincidental presence ofsome serializing instruction preventing stale code executionNonetheless mitigations are needed to enforce security-by-design Luckily as shown later SCSB is amenable to a prac-tical and efficient mitigation (ie at a similar cost as facedby non-x86 architectures) The implementationperformancecost for mitigation is even lower than for its sibling SSB prim-itive whose mitigation has been deployed in practice even inabsence of known practical exploits [55]

In Table 3 we show that all tested Intel and AMD proces-sors are affected by SCSB In contrast ARM is not vulnerablesince SMC updates require explicit software barriers

112 Floating Point Value Injection (FPVI)Our second attack primitive Floating Point Value Injection(FPVI) allows an attacker to inject arbitrary values into atransient execution window originated by a FP machine clearAs exemplified in Figure 8 the operations of the primitive canbe broken down into four steps (1) the attacker triggers theexecution of a gadget starting with a denormal FP operationin the victim application with the x and y operands underattackerrsquos control (2) the transient z result of the operation isprocessed by the subsequent gadget instructions leaving a mi-croarchitectural trace (3) the CPU detects the error condition(ie wrong result of a denormal operation) triggering a ma-chine clear and thus a pipeline flush (4) the CPU re-executesthe entire gadget with the correct architectural z result Forexploitation the attacker needs to (i) massage the x and yoperands to inject the desired z value into the victim transientpath and (ii) target a victim gadget so that the injected valueyields a security-sensitive trace which can be observed withFLUSH + RELOAD or other microarchitectural attacks

Our primitive bears similarities with Load-Value-Injection(LVI) [71] since both allow attackers to inject controlledvalues into transient execution Moreover both primitives

1460 30th USENIX Security Symposium USENIX Association

Figure 8 FPVI gadget example in SpiderMonkeyFP_Op(xy) is an arbitrary denormal FP operation The er-roneous z result causes dependent operations to be executedtwice (first transiently then architecturally) The NaN-boxingz encoding allows the attacker to type-confuse the JITrsquoed codeand read from an arbitrary address on the transient path

require gadgets in the victim application to process the in-jected value and perform security-sensitive computations onthe transient path Nonetheless the underlying issues andhence the triggering conditions are fairly different LVI re-quires the attacker to induce faulty or assisted loads on thevictim execution which is straightforward in SGX applica-tions but more difficult in the general case [71] FPVI imposesno such requirement but does require an attacker to directlyor indirectly control operands of a floating-point operation inthe victim Nonetheless FPVI can extend the existing LVIattack surface (eg for compute-intensive SGX applicationsprocessing attacker-controlled inputs) and also provides ex-ploitation opportunities in new scenarios Indeed while it ishard to draw general conclusions on the availability of ex-ploitable FPVI gadgets in the wildmdashmuch like Spectre [41]LVI [71] etc this would require gadget scanners subject oforthogonal research [56 58]mdashwe found exploitable gadgetsin NaN-Boxing implementations of modern JIT engines [53]NaN-Boxing implementations encode arbitrary data types asdouble values allowing attackers running code in a JIT sand-box (and thus trivially controlling operands of FP operations)to escalate FPVI to a speculative type confusion primitiveThe latter can be exploited similarly to NaN-Boxing-enabledarchitectural type confusion [26] and allows an attacker toaccess arbitrary data on a transient path Figure 8 presentsour end-to-end exploit for a JavaScript-enabled attacker in aSpiderMonkey (Mozilla JavaScript runtime) sandbox illus-trating a gadget unaffected by all the prior Mozilla Firefoxrsquo

mitigations against transient execution attacks [52]As exemplified in the figure SpiderMonkeyrsquos NaN-Boxing

strategy represents every variable with a IEEE-754 (64-bit)double where the highest 17 bits store the data type tag and theremaining 47 bits store the data value itself If the tag value isless than or equal to 0x1fff0 (ie JSVAL_TAG_MAX_DOUBLE)all the 64 bits are interpreted as a double while NaN-Boxingencoding is used otherwise For instance the 0xfff88 tagis used to represent 32-bit integers and the 0xfffb0 tag torepresent a string with the data value storing a pointer tothe string descriptor In the example the attacker crafts theoperands of a vulnerable FP operation (in this case a division)to produce a transient result which the JITrsquoed code interpretsas a string pointer due to NaN-Boxing This causes typeconfusion on a transient path and ultimately triggers a readwith an attacker-controlled address

To verify the attacker can inject arbitrary pointers withoutfully reverse engineering the complex function used by theFPU we implemented a simple fuzzer to find FP operandsthat yield transient division results with the upper bits setto 0xfffb0 (ie string tag) With such operands we caneasily control the remaining bits by performing the inverseoperation since the mantissa bits are transiently unaffectedby the exponent value as shown in Table 2 For exampleusing 0xc000e8b2c9755600 and 0x0004000000000000 asdivision operands yields -Infinity as the architectural resultand our target string pointer 0xfffb0deadbeef000 as thetransient result (see gadget in Figure 8)

Note that SpiderMonkey uses no guards or Spectre mitiga-tions when accessing the attribute length of the string Thisis normally safe since x86 guarantees that NaN results of FPoperations will always have the lowest 52 bits set to zeromdasharepresentation known as QNaN Floating-Point Indefinite [37]In other words the implementation relies on the fact that NaN-boxed variables such as string pointers can never accidentallyappear as the result of FP operations and can only be craftedby the JIT engine itself Unfortunately this invariant no longerholds on a FPVI-controlled transient path As shown in Fig-ure 8 this invariant violation allows an attacker to transientlyread arbitrary memory Since the length attribute is stored4 bytes away from the string pointer the zlength accessyields a transient read to 0xdeadbeef000+4

We ran our exploit on an Intel i9-9900K CPU (microcode0xde) running Linux 580 and Firefox 850 and by wrappingthis primitive with a variant of an EVICT + RELOAD covertchannel [61] we confirmed the ability to read arbitrary mem-ory Since prior work has already demonstrated that customhigh-precision timers are possible in JavaScript [26296064]we enabled precise timers in Firefox to simplify our covertchannel With our exploit we measured a leakage rate of ~13KBs and a transient window of ~12 load instructions enabledby increasing the FPU pressure through a chain of multipledependent FP operations

Finally as shown in Table 3 we observe that both Intel and

USENIX Association 30th USENIX Security Symposium 1461

Table 3 Tested processors

Processor MicrocodeSCSB

vulnerableFPVI

vulnerable

Intel Core i7-10700K 0xe0 3 3Intel Xeon Silver 4214 0x500001c 3 3Intel Core i9-9900K 0xde 3 3Intel Core i7-7700K 0xca 3 3AMD Ryzen 5 5600X 0xa201009 3 3daggerAMD Ryzen 2990WX 0x800820b 3 3daggerAMD Ryzen 7 2700X 0x800820b 3 3daggerBroadcom BCM2711Cortex-A72 (ARM v8) 7sect 7

dagger No exploitable NaN-boxed transient results foundsect On ARM SMC updates require explicit software barriers

AMD are affected by FPVI although we found no exploitabletransient NaN-boxed values on AMD On ARM we neverobserved traces of transient results yet we cannot rule outother FPU implementations being affected

12 Mitigations

121 SCSB Mitigation

SCSB can be mitigated by ensuring that any freshly writtencode is architecturally visible before being executed For ex-ample on ARM architectures where the hardware does notautomatically enforce cache coherence explicit serializing in-structions (ie L1i cache invalidation instructions) are neededto correctly support SMC [5] As such spec-compliant ARMimplementations cannot be affected by SCSB On Intel andAMD we can force eager codedata coherence using a serial-ization instructionmdashalthough this is normally not necessary(see Option 1 in Figure 7) In our experiments we verified anyserialization instruction such as lfence mfence or cpuidplaced after the SMC store operations is indeed sufficient tosuppress the transient window Note that sfence cannot serveas a serialization instruction [35] to eliminate the transientpath This serialization mitigation was confirmed by CPUvendors and adopted by the Xen hypervisor [32 33]

To evaluate the performance impact of our mitigation weadded a lfence instruction inside the FlushICache function(Listings 2 and 3) of the two popular V8 and SpiderMonkeyJavaScript engines Such function normally empty on x86is called after every code update Our repeated experimentson the popular JetStream2 and Speedometer2 [77] bench-marks did not produce any statically measurable performanceoverhead This shows JavaScript execution time is heavilydominated by JIT code generationexecution and code up-dates have negligible impact Our results show this mitigationis practical and can hinder SCSB-based attack primitives inJIT engines with a 1-line code change

122 FPVI Mitigation

The most efficient way to mitigate FPVI is to disable the de-normal representation On Intel this translates to enabling theFlush-to-Zero and Denormals-are-Zero flags [37] which re-spectively replace denormal results and inputs with zero Thisis a viable mitigation for applications with modest floating-point precision requirements and has also been selectively ap-plied to browsers [42] However this defense may break com-mon real-world (denormal-dependent) applications a concernthat has led browser vendors such as Firefox to adopt othermitigations Another option for browsers is to enable Site Iso-lation [13] but JIT engines such as SpiderMonkey still do nothave a production implementation [23] Yet another optionfor JIT engines is to conditionally mask (ie using a transientexecution-safe cmov instruction) the result of FP operationsto enforce QNaN Floating-Point Indefinite semantics [37] asdone in the SpiderMonkey FPVI mitigation [51] This strategysuppresses any malicious NaN-boxed transient results butrequires manual changes to the NaN-boxing implementationand only applies to NaN-boxed gadgets

A more general and automated mitigation is for the com-piler to place a serializing instruction such as lfence afterFP operations whose (attacker-controlled) result might leaksecrets by means of transmit gadgets (or transmitters) Weobserve this is the same high-level strategy adopted by theexisting LVI mitigation in modern compilers [39] whichidentifies loaded values as sources and uses data-flow anal-ysis to ensure all the sources that reach a sink (transmitter)are fenced Hence to mitigate FPVI we can rely on the samemitigation strategy but use computed FP values as sourcesinstead To identify sinks we consider both systems vulnera-ble and those resistant to Microarchitectural Data Sampling(MDS) [8 59 63 75 76] For the former we consider bothload and store instructions as sinks (eg FP operation resultused as a loadstore address) as the corresponding arbitrarydata spilled into microarchitectural buffers may be leaked bya MDS attacker For the latter we limit ourselves to load sinksto catch all the arbitrary read values potentially disclosed by adependent transmitter In both cases we add indirect branchescontrolled by FP operations (and thus potentially leading tospeculative control-flow hijacking) to the list of sinks

We have implemented such a mitigation in LLVM [45]with only 100 lines of code on top of the existing x86 LVIload hardening pass To evaluate the performance impactof our mitigation on floating-point-intensive programs weran all the CC++ SPECfp 2006 and 2017 benchmarks infour configurations LVI instrumentation FPVI instrumen-tation for both MDS-vulnerable and MDS-resistant systemsand joint LVI+FPVI instrumentation for MDS-vulnerable sys-tems Please note that on MDS-resistant systems the FPVItransmitters are already covered by the LVI mitigation

Figure 9 shows the performance overhead of such con-figurations compared to the baseline As expected the LVI

1462 30th USENIX Security Symposium USENIX Association

619lbm_s 638imagick_s 644nab_s SPEC17geomean

433milc 444namd 447dealII 450soplex 453povray 470lbm 482sphinx3 SPEC06geomean

0100200300400500600700800

Over

head

SPEC FP 2017 SPEC FP 2006LLVM Mitigation

LVIFPVI (MDS-vulnerable system)FPVI (MDS-resistant system)LVI+FPVI (MDS-vulnerable system)

Figure 9 Performance overhead of our FPVI mitigation on the CC++ SPECfp 2006 and 2017 benchmarks Experimental setup5 SPEC iterations Intel i9-9900K (microcode 0xde) and LLVM 1110

mitigation has non-trivial performance impact up to 280despite targeting floating-point-intensive programs On theother hand our FPVI mitigation incurs in a 32 and 53 ge-omean overhead on SPECfp 2006 and 2017 respectively withno observable performance impact difference between MDS-vulnerable and MDS-resistant variants We observed that ap-proximately 70 of the inserted lfence instructions are dueto the intraprocedural design of the original LVI pass forc-ing our analysis to consider every callsite with FPVI source-based arguments as a potential transmitter This suggests theoverhead can be further reduced by operating interprocedu-ral analysis or more aggressive inlining (eg using LLVMLTO [45])

13 Root Cause-based Classification

We now summarize the results of our investigation by present-ing a root cause-based classification for the known transientexecution paths While there have already been several at-tempts to classify properties of transient execution [107580]all the existing classifications are attack-centric While cer-tainly useful such classifications inevitably blend togetherthe root causes of transient execution with the attack triggersFor example a MDS exploit based on a demand-paging pagefault [75] may be simply classified as a Meltdown-like attackbased on a present page fault [10] However in such an exploitthe page fault is both the vulnerability trigger and the rootcause of the transient execution window A similar exploitmay instead be embedded in an Intel TSX window yielding adifferent root cause and capabilities

To better characterize the capabilities of transient executionexploits we propose the orthogonal root-cause-based classifi-cation in Figure 10 We draw from the Intel terminology todefine the two main classes of root causes of Bad Specula-tion (ie transient execution) Control-Flow Misprediction(ie branch misprediction) and Data Misprediction (ie ma-chine clear) Based on these two main classes we observethat all the known root causes of transient execution paths canbe classified into the following four subclasses PredictorsExceptions Likely invariants violations and Interrupts

Figure 10 A root cause-centric classification of known tran-sient execution paths (causes analyzed in the paper in bold)Acronyms descriptions can be found in Appendix B

Predictors This category includes the prediction-basedcauses of bad speculation due to either control-flow or datamispredictions Mistraining a predictor and forcing a mis-prediction is sufficient to create a transient execution pathaccessing erroneous code or data Itrsquos worth noting that inFigure 10 there are two separate predictors subclasses un-der control-flow and data mispredictions as they are differ-ent in nature While control-flow mispredictions are failedattempts to guess the next instructions to execute data mis-predictions are failed attempts to operate on not-yet-validateddata Misprediction is a common way to manage tran-sient execution windows in attacks like Spectre and deriva-tives [11 20 28 40 41 43 48 55 65 82]

Exceptions This class includes the causes of machineclear due to exceptions for instance the different (sub)classesof page faults Forcing an exception is sufficient to create atransient execution path erroneously executing code follow-ing the exception-inducing instruction Exceptions are a lesscommon way to manage transient execution windows (as theyrequire dedicated handling) but have been extensively usedas triggers in Meltdown-like attacks [78475963677174ndash76 78]

Interrupts This class includes the causes of machine cleardue to hardware (only) interrupts Similar to exceptions forc-

USENIX Association 30th USENIX Security Symposium 1463

ing a HW interrupt is sufficient to create a transient executionpath erroneously executing code following the interruptedinstruction Hardware interrupt are asynchronous by natureand thus difficult to control resulting in a less than ideal wayto manage transient execution windows Nonetheless theywere abused by prior side-channel attacks [72]

Likely invariants violations This class includes all theremaining causes of machine clear derived by likely invari-ants [15] used by the CPU Such invariants commonly holdbut occasionally fail allowing hardware to implement fast-path optimizations However compared to exceptions andinterrupts slow-path occurrences are typically more frequentrequiring more efficient handling in hardware or microcodeWe discussed examples of such invariants in the paper (egstore instructions are expected to never target cached instruc-tions floating-point operations are expected to never operateon denormal numbers etc) and their lazy handling mecha-nisms (ie L1dL1i resynchronization microcode-based de-normal arithmetic) Forcing a likely invariant violation is suf-ficient to create a transient execution path accessing erroneouscode or data In this paper we have shown such violationsare not only a realistic way to manage transient executionwindows but also provide new opportunities and primitivesfor transient execution attacks

14 Related Work

Spectre [3141] and Meltdown [47] first examined the securityimplications of transient execution originating a large bodyof research on transient execution attacks [6ndash112840434859 63 65 67 68 71 74ndash76 78 79 82] Rather than focusingon attacks and their classification [10 75 80] ours is the firsteffort to systematize the root causes of transient executionand examine the many unexplored cases of machine clears

We now briefly survey prior security efforts concernedwith the major causes of machine clear discussed in this pa-per Self-modifying code is commonly used by malware asan obfuscation technique [69] and has also been used to im-prove side-channel attacks by means of performance degra-dation [2] Moreover our SCSB primitive bears similaritieswith prior transient execution primitives inducing specula-tive control flow hijacking either through branch target in-jection [41 49] or architectural branch target corruption [28]The performance variability of floating-point operations ondenormal numbers [17 46] has been previously exploited intraditional timing side-channel attacks [4] Speculation intro-duced by stricter memory models is a well-known concept inthe computer architecture literature [27 30 66] While this isnon-trivial to exploit prior work did demonstrate informationdisclosure [5779] by exploiting the snoop protocol discussedin Section 7 The memory disambiguation predictor has beenpreviously abused to leak stale data in Spectre SpeculativeStore Bypass exploits [55 82] Moreover its behavior hasbeen partially reverse engineered before [18] In contrast to

all these efforts we focus on machine clears to systematicallystudy all the root causes of transient execution fully reverseengineering their behavior and uncovering their security im-plications well beyond the state of the art

15 Conclusions

We have shown that the root causes of transient execution canbe quite diverse and go well beyond simple branch mispredic-tion or similar To support this claim we systematically ex-plored and reverse engineered the previously unexplored classof bad speculation known as machine clear We discussedseveral transient execution paths neglected in the literatureexamining their capabilities and new opportunities for attacksFurthermore we presented two new machine clear-based tran-sient execution attack primitives (Floating Point Value Injec-tion and Speculative Code Store Bypass) We also presentedan end-to-end FPVI exploit disclosing arbitrary memory inFirefox and analyzed the applicability of SCSB in real-worldapplications such as JIT engines Additionally we proposedmitigations and evaluated their performance overhead Finallywe presented a new root cause-based classification for all theknown transient execution paths

Disclosure

We disclosed Floating Point Value Injection and SpeculativeCode Store Bypass to CPU browser OS and hypervisor ven-dors in February 2021 Following our reports Intel confirmedthe FPVI (CVE-2021-0086) and SCSB (CVE-2021-0089)vulnerabilities rewarded them with the Intel Bug Bounty pro-gram and released a security advisory with recommendationsin line with our proposed mitigations [34] Mozilla confirmedthe FPVI exploit (CVE-2021-29955 [21 22]) rewarded itwith the Mozilla Security Bug Bounty program and deployeda mitigation based on conditionally masking malicious NaN-boxed FP results in Firefox 87 [51]

Acknowledgments

We thank our shepherd Daniel Genkin and the anonymousreviewers for their valuable comments We also thank ErikBosman from VUSec and Andrew Cooper from Citrix fortheir input Intel and Mozilla engineers for the productivemitigation discussions Travis Downs for his MD reverse en-gineering and Evan Wallace for his Float Toy tool This workwas supported by the European Unionrsquos Horizon 2020 re-search and innovation programme under grant agreements No786669 (ReAct) and 825377 (UNICORE) by Intel Corpora-tion through the Side Channel Vulnerability ISRA and by theDutch Research Council (NWO) through the INTERSECTproject

1464 30th USENIX Security Symposium USENIX Association

References[1] IEEE Standard for Floating-Point Arithmetic IEEE Std 754-2019

2019

[2] Alejandro Cabrera Aldaya and Billy Bob Brumley Hyperde-grade From ghz to mhz effective cpu frequencies arXiv preprintarXiv210101077

[3] AMD AMD64 Architecture Programmerrsquos Manual

[4] Marc Andrysco David Kohlbrenner Keaton Mowery Ranjit JhalaSorin Lerner and Hovav Shacham On subnormal floating point andabnormal timing In 2015 IEEE S amp P

[5] ARM Architecture Reference Manual for Armv8-A

[6] Atri Bhattacharyya Alexandra Sandulescu Matthias NeugschwandtnerAlessandro Sorniotti Babak Falsafi Mathias Payer and Anil KurmusSmotherspectre exploiting speculative execution through port con-tention In CCSrsquo19

[7] Jo Van Bulck Marina Minkin Ofir Weisse Daniel Genkin BarisKasikci Frank Piessens Mark Silberstein Thomas F Wenisch Yu-val Yarom and Raoul Strackx Foreshadow Extracting the Keys tothe Intel SGX Kingdom with Transient Out-of-Order Execution InUSENIX Securityrsquo18

[8] Claudio Canella Daniel Genkin Lukas Giner Daniel Gruss MoritzLipp Marina Minkin Daniel Moghimi Frank Piessens MichaelSchwarz Berk Sunar Jo Van Bulck and Yuval Yarom Fallout LeakingData on Meltdown-resistant CPUs In CCSrsquo19

[9] Claudio Canella Michael Schwarz Martin Haubenwallner MartinSchwarzl and Daniel Gruss Kaslr Break it fix it repeat In ACMASIA CCS 2020

[10] Claudio Canella Jo Van Bulck Michael Schwarz Moritz Lipp Ben-jamin Von Berg Philipp Ortner Frank Piessens Dmitry Evtyushkinand Daniel Gruss A systematic evaluation of transient execution at-tacks and defenses In USENIX Security 19

[11] Guoxing Chen Sanchuan Chen Yuan Xiao Yinqian Zhang ZhiqiangLin and Ten H Lai Sgxpectre Stealing intel secrets from sgx enclavesvia speculative execution In 2019 IEEE EuroSampP

[12] Chrome V8 TurboFan documentation

[13] Chromium Site Isolation documentation

[14] Victor Costan and Srinivas Devadas Intel SGX Explained IACRCryptology ePrint Archive 2016

[15] David Devecsery Peter M Chen Jason Flinn and Satish NarayanasamyOptimistic hybrid analysis Accelerating dynamic analysis throughpredicated static analysis In ASPLOS 2018

[16] Christopher Domas Breaking the x86 isa Black Hat USA 2017

[17] Isaac Dooley and Laxmikant Kale Quantifying the interference causedby subnormal floating-point values In Proceedings of the Workshopon OSIHPA 2006

[18] Travis Downs Memory Disambiguation on Skylakehttpsgithubcomtravisdownsuarch-benchwikiMemory-Disambiguation-on-Skylake 2019

[19] Thomas Dullien Return after free discussion httpstwittercomhalvarflakestatus1273220345525415937

[20] Dmitry Evtyushkin Ryan Riley Nael CSE Abu-Ghazaleh ECE andDmitry Ponomarev Branchscope A new side-channel attack on di-rectional branch predictor ACM SIGPLAN Notices 53(2)693ndash7072018

[21] Firefox Firefox 87 Security Advisory httpswwwmozillaorgen-USsecurityadvisoriesmfsa2021-10CVE-2021-29955

[22] Firefox Firefox ESR 789 Security Advisory httpswwwmozillaorgen-USsecurityadvisoriesmfsa2021-11CVE-2021-29955

[23] Firefox Project Fission documentation

[24] Fortninet Use-After-Free Bug in Chakra (CVE-2018-0946)httpswwwfortinetcomblogthreat-researchan-analysis-of-the-use-after-free-bug-in-microsoft-edge-chakra-engine

[25] Ivan Fratric Return after free discussion httpstwittercomifsecurestatus1273230733516177408

[26] Pietro Frigo Cristiano Giuffrida Herbert Bos and Kaveh RazaviGrand Pwning Unit Accelerating Microarchitectural Attacks withthe GPU In SampP May 2018

[27] Kourosh Gharachorloo Anoop Gupta and John L Hennessy Twotechniques to enhance the performance of memory consistency models1991

[28] Enes Goktas Kaveh Razavi Georgios Portokalidis Herbert Bos andCristiano Giuffrida Speculative Probing Hacking Blind in the SpectreEra In CCS 2020

[29] Ben Gras Kaveh Razavi Erik Bosman Herbert Bos and CristianoGiuffrida ASLR on the Line Practical Cache Attacks on the MMUIn NDSS February 2017

[30] John L Hennessy and David A Patterson Computer architecture aquantitative approach Elsevier 2011

[31] Jann Horn Reading privileged memory with a side-channel 2018

[32] Xen Hypervisor block_speculation function call in invoke_stubhttpsxenbitsxenorggitwebp=xengita=blobf=xenarchx86x86_emulatex86_emulatechb=HEAD

[33] Xen Hypervisor block_speculation function call inio_emul_stub_setup httpsxenbitsxenorggitwebp=xengita=blobf=xenarchx86pvemul-priv-opchb=HEAD

[34] Intel FPVI amp SCSB Intel Security Advisoray 00516 httpswwwintelcomcontentwwwusensecurity-centeradvisoryintel-sa-00516html

[35] Intel INTEL-SA-00088 - Bounds Check Bypass

[36] Intel Intelreg 64 and IA-32 Architectures Optimization ReferenceManual

[37] Intel Intelreg 64 and IA-32 Architectures Software Developerrsquos Manualcombined volumes

[38] Intel Intelreg VTunetrade Profiler User Guide - 4KAliasing httpssoftwareintelcomcontentwwwusendevelopdocumentationvtune-helptopreferencecpu-metrics-referencel1-boundaliasing-of-4k-address-offsethtml

[39] Intel Load value injection - deep divehttpssoftwareintelcomsecurity-software-guidancedeep-divesdeep-dive-load-value-injection 2020

[40] Vladimir Kiriansky and Carl Waldspurger Speculative buffer overflowsAttacks and defenses arXiv180703757

[41] Paul Kocher Jann Horn Anders Fogh Daniel Genkin Daniel GrussWerner Haas Mike Hamburg Moritz Lipp Stefan Mangard ThomasPrescher Michael Schwarz and Yuval Yarom Spectre Attacks Ex-ploiting Speculative Execution In SampPrsquo19

[42] David Kohlbrenner and Hovav Shacham On the effectiveness ofmitigations against floating-point timing channels In USENIX SecuritySymposium 2017

[43] Esmaeil Mohammadian Koruyeh Khaled N Khasawneh ChengyuSong and Nael Abu-Ghazaleh Spectre Returns Speculation Attacksusing the Return Stack Buffer In USENIX WOOTrsquo18

[44] Evgeni Krimer Guillermo Savransky Idan Mondjak and JacobDoweck Counter-based memory disambiguation techniques for se-lectively predicting loadstore conflicts October 1 2013 US Patent8549263

USENIX Association 30th USENIX Security Symposium 1465

[45] Chris Lattner and Vikram Adve Llvm A compilation framework forlifelong program analysis amp transformation In CGO 2004

[46] Orion Lawlor Hari Govind Isaac Dooley Michael Breitenfeld andLaxmikant Kale Performance degradation in the presence of subnor-mal floating-point values In OSIHPA 2005

[47] Moritz Lipp Michael Schwarz Daniel Gruss Thomas Prescher WernerHaas Anders Fogh Jann Horn Stefan Mangard Paul Kocher DanielGenkin Yuval Yarom and Mike Hamburg Meltdown Reading KernelMemory from User Space In USENIX Securityrsquo18

[48] Giorgi Maisuradze and Christian Rossow ret2spec Speculative ex-ecution using return stack buffers In Proceedings of the 2018 ACMSIGSAC

[49] Andrea Mambretti Alexandra Sandulescu Matthias NeugschwandtnerAlessandro Sorniotti and Anil Kurmus Two methods for exploitingspeculative control flow hijacks In USENIX WOOT 19

[50] Ahmad Moghimi Jan Wichelmann Thomas Eisenbarth and BerkSunar Memjam A false dependency attack against constant-timecrypto implementations International Journal of Parallel Program-ming 2019

[51] Mozilla Firefox Bug 1692972 mitigation httpshgmozillaorgreleasesmozilla-betarevb129bba64358

[52] Mozilla Spectre mitigations for Value type checks - x86 part httpsbugzillamozillaorgshow_bugcgiid=1433111

[53] Mozilla Spider Monkey JSValue httpshgmozillaorgmozilla-centralfiletipjspublicValueh

[54] Mozilla SpiderMonkey IonMonkey documentation

[55] Ken Johnson Microsoft Security Response Center(MSRC) Analysis and mitigation of speculative storebypass httpsmsrc-blogmicrosoftcom20180521analysis-and-mitigation-of-speculative-store-bypass-cve-2018-3639 2019

[56] Oleksii Oleksenko Bohdan Trach Mark Silberstein and Christof Fet-zer Specfuzz Bringing spectre-type vulnerabilities to the surface InUSENIX Security 20

[57] Riccardo Paccagnella Licheng Luo and Christopher W Fletcher Lordof the ring (s) Side channel attacks on the cpu on-chip ring interconnectare practical arXiv preprint arXiv210303443 2021

[58] Zhenxiao Qi Qian Feng Yueqiang Cheng Mengjia Yan Peng Li HengYin and Tao Wei Spectaint Speculative taint analysis for discoveringspectre gadgets 2021

[59] Hany Ragab Alyssa Milburn Kaveh Razavi Herbert Bos and CristianoGiuffrida CrossTalk Speculative Data Leaks Across Cores Are RealIn SampP May 2021

[60] Thomas Rokicki Cleacutementine Maurice and Pierre Laperdrix Sok Insearch of lost time A review of javascript timers in browsers In IEEEEuroSampPrsquo21

[61] Gururaj Saileshwar Christopher W Fletcher and Moinuddin QureshiStreamline a fast flushless cache covert-channel attack by enablingasynchronous collusion In ASPLOS 2021

[62] Rahul Saxena and John William Phillips Optimized rounding in un-derflow handlers 2001 US Patent 6219684

[63] Michael Schwarz Moritz Lipp Daniel Moghimi Jo Van Bulck JulianStecklina Thomas Prescher and Daniel Gruss ZombieLoad Cross-privilege-boundary data sampling In CCSrsquo19

[64] Michael Schwarz Cleacutementine Maurice Daniel Gruss and Stefan Man-gard Fantastic timers and where to find them High-resolution microar-chitectural attacks in javascript In FC IFCA 17

[65] Michael Schwarz Martin Schwarzl Moritz Lipp Jon Masters andDaniel Gruss Netspectre Read arbitrary memory over network InESORICS 19

[66] Daniel J Sorin Mark D Hill and David A Wood A primer on memoryconsistency and cache coherence 2011

[67] Julian Stecklina and Thomas Prescher Lazyfp Leaking fpu reg-ister state using microarchitectural side-channels arXiv preprintarXiv180607480 2018

[68] Caroline Trippel Daniel Lustig and Margaret Martonosi Meltdown-prime and spectreprime Automatically-synthesized attacks exploitinginvalidation-based coherence protocols arXiv

[69] Xabier Ugarte-Pedrero Davide Balzarotti Igor Santos and Pablo GBringas Sok Deep packer inspection A longitudinal study of thecomplexity of run-time packers In 2015 IEEE S amp P

[70] Google V8 test-jump-table-assemblercc220 commit 251fece httpsgithubcomv8

[71] Jo Van Bulck Daniel Moghimi Michael Schwarz Moritz Lippi Ma-rina Minkin Daniel Genkin Yuval Yarom Berk Sunar Daniel Grussand Frank Piessens Lvi Hijacking transient execution through mi-croarchitectural load value injection In 2020 IEEE S amp P 20

[72] Jo Van Bulck Frank Piessens and Raoul Strackx Nemesis Studyingmicroarchitectural timing leaks in rudimentary cpu interrupt logic InACM CCS 2018

[73] Jo Van Bulck Frank Piessens and Raoul Strackx SGX-step A Prac-tical Attack Framework for Precise Enclave Execution Control InSysTEXrsquo17

[74] Stephan van Schaik Andrew Kwong Daniel Genkin and Yuval YaromSgaxe How sgx fails in practice

[75] Stephan van Schaik Alyssa Milburn Sebastian Oumlsterlund Pietro FrigoGiorgi Maisuradze Kaveh Razavi Herbert Bos and Cristiano GiuffridaRIDL Rogue in-flight data load In SampP May 2019

[76] Stephan van Schaik Marina Minkin Andrew Kwong Daniel Genkinand Yuval Yarom Cacheout Leaking data on intel cpus via cacheevictions arXiv preprint

[77] WebKit Browserbench httpsbrowserbenchorg

[78] Ofir Weisse Jo Van Bulck Marina Minkin Daniel Genkin BarisKasikci Frank Piessens Mark Silberstein Raoul Strackx Thomas FWenisch and Yuval Yarom Foreshadow-ng Breaking the virtual mem-ory abstraction with transient out-of-order execution 2018

[79] Pawel Wieczorkiewicz Intel deep-dive snoop-assisted L1 Data Sam-pling

[80] Wenjie Xiong and Jakub Szefer Survey of transient execution attacksarXiv preprint

[81] Yuval Yarom and Katrina Falkner Flush+ reload a high resolutionlow noise l3 cache side-channel attack In USENIX Security 14

[82] Jann Horn Google Project Zero Speculative execution variant4 speculative store bypass httpsbugschromiumorgpproject-zeroissuesdetailid=1528 2019

A Reversing Memory Disambiguation

To precisely trigger memory disambiguation mispredictionsit is essential to reverse engineer the predictor and understandhow it can be massaged into the desired state When exe-cuting a load operation the physical addresses of all priorstores must be known to decide whether the load should beforwarded the value from the store buffer (store-to-load for-warding when the store and the load alias) or served from thememory subsystem (when they do not) Since these dependen-cies introduce significant bottlenecks modern processors rely

1466 30th USENIX Security Symposium USENIX Association

Listing 4 A function to RE the MD predictor The first 10imuls used to delay the store address computation create theideal conditions for mispredictionsst_ld rdi store addr rsi load addrrep 10 Trick to delay the store addressimul rdi 1endrepmov DWORD [rdi] 0x42 Storemov eax DWORD [rsi] Loadrep 10 Pronounce load timingimul eax 1endrepret

Listing 5 Observing the size and behavior of the per-addresssaturation counteruint8_t mem = malloc(0x1000)Ensure that saturing counter is set to 0for(i=0 ilt100 i++) st_ld(mem mem)Make hoisiting possiblefor(i=100 ilt120 i++) st_ld(mem mem+64)Trigger a memory ordering machine clearfor(i=120 ilt130 i++) st_ld(mem mem)

on a memory disambiguation predictor to improve common-case performance If a load is predicted not to alias precedingstores it can be speculatively executed before the prior storesrsquoaddresses are known (ie the load is hoisted) Otherwise theload is stalled until aliasing information is available

Partial reverse engineering of the memory disambiguationunit behavior was presented in [18] based on a (complex)analysis of Intel patent US8549263B2 [44] In contrast wepresent a full reverse engineering effort (including featuressuch as 4k aliasing flush counter etc) entirely based on asimple implementationmdashst_ld function in Listing 4 Bysurrounding the st_ld function with instrumentation codeto measure the timing and the number of machine clears wewere able to accurately detect the status of the predictor forevery call to st_ld

Our first experiment illustrated in Listing 5 is designedto observe how many missed load hoisting opportunities areneeded to switch the predictor state As shown in the corre-sponding plot in Figure 11 after 15 non-aliasing loads weobserved that subsequent st_ld invocations are faster due toa correct hoisting prediction This matches the design sug-gested in the patent with the predictor implemented as a 4-bitsaturating counter incremented every time the load does not ul-timately alias with preceding stores (and reset to 0 otherwise)Load hoisting is predicted only if the counter is saturated Toensure that the hoisting state is reached we later scheduledan aliasing load and checked that the load was incorrectlyhoisted by observing a machine clear

According to the Intel patent [44] there are 64 per-addresspredictors (ie saturating counters) and the suggested hash-ing function simply uses the lowest 6 bits of the instructionpointer of the load We verified these numbers using the func-

100 105 110 115 120 125 130i-th call to st_ld

0

50

100

Cloc

k cy

cles

Figure 11 Timing measurement of the experiment in Listing5 Orange bar machine clear memory ordering observed

Listing 6 Snippet observing activation of never-hoisting stateuint8_t mem = malloc(0x1000)for(int i=0 ilt10 i++)

for(int j=0 jlt19 j++)st_ld(mem mem+64)

st_ld(mem mem)

0 10 20 30 40 50 60 70 80 90 100 110 120i-th call to st_ld

0

25

50

75

100

Cloc

k cy

cles

Figure 12 Never-hoisting state Orange bar MC observed

tion st_ld_offset which is an exact copy of the st_ldfunction but with a number of nops added in the preamble(which we change at every run The goal is to observe iftwo unrelated loads in these functions are able to influenceeach other when varying the number of nops In our testedCPUs we observed machine clears when two unrelated loadsin st_ld and st_ld_offset are located exactly 256k bytesapart in memory (k isinN) We used machine clear observationsto detect that the per-address prediction of st_ld was affectedby the execution of st_ld_offset Our results match the de-sign suggested in the patent except we observed 256 (ratherthan 64) predictors hashing the lowest 8 bits of the instruc-tion pointer With these implementation-specific numbers anattacker can easily mistrain the predictor of a victim loadinstruction just with knowledge of its location in memory

One important additional component of the predictor is thepresence of a watchdog A never-hoisting global state is usedas a fallback to temporarily disable the predictor when theCPU has decided it may be counterproductive To reverseengineer the behavior we triggered as many mispredictionsas possible to check if hoisting was eventually disabled Theresulting code is illustrated in Listing 6 and the numbers inFigure 12 shows that after 4 machine clears the predictoris disabled Indeed even after 19 further non-aliasing loads(normally abundantly sufficient to switch to a hoisting state)the execution time of st_ld does not decrease

We also reversed the conditions under which the watch-dog is enableddisabled The patent suggests the watchdog

USENIX Association 30th USENIX Security Symposium 1467

15 79 143 207 271 335 399Number of independent store-loads

1

2

3

4

Tota

l mea

sure

dM

achi

ne C

lear

s

Figure 13 MCs observed after n independent store-loadsstarting from the never-hoisting state

is enabled when the value of a flush counter is smaller than0 and disabled otherwise The flush counter is decrementedevery MD MC and incremented every n correctly hoistedloads To reverse engineer this behavior we measure howmany MCs can be triggered in a row before the flush counteris decremented to -1 and thus the predictor is disabled Asshown in the figure 13 we never observed more than 4 MCsin a row This suggests that the flush counter is a 2-bit sat-urating counter The machine clear patterns also reveal theflush counter is incremented every 64 correctly hoisted loadsLastly since every run starts with the watchdog disabled ourresults show that to switch from a never-hoisting to a predict-hoisting state 15+64 non-aliasing loads are sufficient Thefirst 15 loads are necessary to bring the per-address predictorto the hoisting state The next 64 loads record a would-becorrect prediction of the per-address predictor After 64 such(unused) predictions the flush counter is incremented to leavethe never-hoisting state

Listing 7 Snippet verifying 4k aliasing-MD unit interactionForce no-hoisting predictionfor(i=0 ilt10 i++) st_ld(mem+0x1000 mem+0x1000)for(i=10 ilt20 i++) st_ld(mem+0x2000 mem+0x2000)Cause 4k aliasingfor(i=20 ilt40 i++) st_ld(mem+0x1000 mem+0x2000)

0 5 10 15 20 25 30 35 40i-th call to st_ld

50

60

70

80

90

Cloc

k cy

cles

Figure 14 Timing measurements of Listing 7 Orange incre-ment of LD_BLOCKS_PARTIALADDRESS_ALIAS

Finally we examined the interaction between memory dis-ambiguation and 4k aliasing On Intel CPUs when a store isfollowed by a load matching its 4KB page offset the store-to-load forwarding (STL) logic forwards the stored valueand in case of a false match (4k aliasing) a few-cycle over-head is needed to re-issue the load [38] With the help of theperformance counter LD_BLOCKS_PARTIALADDRESS_ALIASand the experiment shown in Listing 7 we verified that 4kaliasing can only happen when a no-hoist prediction is made

Figure 15 Memory disambiguation unit ndash simplified view

as shown in Figure 14 Indeed STL can be performed onlyif the store-load pair is executed in order Additionally Fig-ure 14 shows that 4k aliasing introduces a slight performanceoverhead on top of an incorrect no-hoisting guess of the MDpredictor Figure 15 presents a simplified view of the reversedmemory disambiguation predictor In conclusion an attackercan precisely mistrain the memory disambiguation predictorby satisfying only two requirements (1) knowing the instruc-tion pointer of the victim load (2) issuing (in the worst-casescenario) 15+64=79 non-aliasing store-load pairs to train thepredictor to the hoisting state

B Root-Causes Description Table

Acronym Description

BHT Branch History TableBTB Branch Target BufferRSB Return Stack BufferMD Memory Disabmbiguation UnitNM Device Not Available ExceptionDE Divide-by-Zero ExceptionUD Invalid Opcode ExceptionGP General Protection FaultAC Alignment Check ExceptionSS Stack Segment ExceptionPF Page FaultPF - US Bit Page Table Entry UserSupervisor Bit PFPF - RW Bit Page Table Entry ReadWrite Bit PFPF - P Bit Page Table Entry Present Bit PFPF - PKU Page Table Entry Protection Keys PFBR Bound Range Exceeded ExceptionFP Floating Point AssistSMC Self-Modifying CodeXMC Cross-Modifying CodeMO Memory Ordering Principles ViolationMASKMOV Masked LoadStore Instruction AssistAD Bits Page Table Entry AccessDirty Bits AssistTSX Intel TSX Transaction AbortUC Uncachable Memory AssistPRM Processor Reserved Memory AssistHW Interrupts Hardware Interrupts

1468 30th USENIX Security Symposium USENIX Association

  • Introduction
  • Background
    • IEEE-754 Denormal Numbers
    • x86 Cache Coherence
    • Memory Ordering
    • Memory Disambiguation
      • Threat Model
      • Machine Clears
      • Self-Modifying Code Machine Clear
      • Floating Point Machine Clear
      • Memory Ordering Machine Clear
      • Memory Disambiguation Machine Clear
      • Other Types of Machine Clear
      • Transient Execution Capabilities
        • Transient Window Size
        • Leakage Rate
          • Attack Primitives
            • Speculative Code Store Bypass (SCSB)
            • Floating Point Value Injection (FPVI)
              • Mitigations
                • SCSB Mitigation
                • FPVI Mitigation
                  • Root Cause-based Classification
                  • Related Work
                  • Conclusions
                  • Reversing Memory Disambiguation
                  • Root-Causes Description Table
Page 2: Rage Against the Machine Clear: A Systematic Analysis of

Rage Against the Machine Clear A Systematic Analysis of Machine Clearsand Their Implications for Transient Execution Attacks

Hany Ragablowast

hanyragabvunlEnrico Barberislowast

ebarberisvunlHerbert Bos

herbertbcsvunlCristiano Giuffridagiuffridacsvunl

Vrije Universiteit AmsterdamAmsterdam The Netherlands

lowastEqual contribution joint first authors

AbstractSince the discovery of the Spectre and Meltdown vulnera-bilities transient execution attacks have increasingly gainedmomentum However while the community has investigatedseveral variants to trigger attacks during transient executionmuch less attention has been devoted to the analysis of theroot causes of transient execution itself Most attack vari-ants simply build on well-known root causes such as branchmisprediction and aborts of Intel TSXmdashwhich are no longersupported on many recent processors

In this paper we tackle the problem from a new perspectiveclosely examining the different root causes of transient exe-cution rather than focusing on new attacks based on knowntransient windows Our analysis specifically focuses on theclass of transient execution based on machine clears (MC)reverse engineering previously unexplored root causes suchas Floating Point MC Self-Modifying Code MC MemoryOrdering MC and Memory Disambiguation MC We showthese events not only originate new transient execution win-dows that widen the horizon for known attacks but also yieldentirely new attack primitives to inject transient values (Float-ing Point Value Injection or FPVI) and executing stale code(Speculative Code Store Bypass or SCSB) We present anend-to-end FPVI exploit on the latest Mozilla SpiderMonkeyJavaScript engine with all the mitigations enabled disclosingarbitrary memory in the browser through attacker-controlledand transiently-injected floating-point results We also pro-pose mitigations for both attack primitives and evaluate theirperformance impact Finally as a by-product of our analysiswe present a new root cause-based classification of all knowntransient execution paths

1 Introduction

Since the public disclosure of the Meltdown and Spectre vul-nerabilities in 2018 researchers have investigated ways to usetransient execution windows for crafting several new attackvariants [6ndash11 28 40 43 48 59 63 65 67 68 71 74ndash76 78

79 82] Building mostly on well-known causes of transientexecution such as branch mispredictions or aborting IntelTSX transactions such variants violate many security bound-aries allowing attackers to obtain access to unauthorized datadivert control flow or inject values in transiently executedcode Nonetheless little effort has been made to systemat-ically investigate the root causes of what Intel refers to asbad speculation [36]mdashthe conditions that cause the CPU todiscard already issued micro-operations (microOps) and renderthem transient As a result our understanding of these rootcauses as well as their security implications is still limitedIn this paper we systematically examine such causes with afocus on a major largely unexplored class of bad speculation

In particular Intel [36] identifies two general causes of badspeculation (and transient execution)mdashbranch mispredictionand machine clears Upon detecting branch mispredictionsthe CPU needs to squash all the microOps executed in the themispredicted branches Such occurrences in a wide varietyof forms and manifestations have been extensively examinedin the existing literature on transient execution attacks [10113140414348] The same cannot be said for machine clearsa class of bad speculation that relies on a full flush of theprocessor pipeline to restart from the last retired instructionIn this paper we therefore focus on the systematic explorationof machine clears their behavior and security implications

Specifically by reverse engineering this undocumentedclass of bad speculation we closely examine four previouslyunexplored sources of machine clears (and thus transient exe-cution) related to floating points self-modifying code mem-ory ordering and memory disambiguation We show attackerscan exploit such causes to originate new transient executionwindows with unique characteristics For instance attackerscan exploit Floating Point Machine Clear events to embedattacks in transient execution windows that require no train-ing or special (in many cases disabled) features such as IntelTSX Besides providing a general framework to run existingand future attacks in a variety of transient execution windowswith different constraints and leakage rates our analysis alsouncovered new attack primitives based on machine clears

USENIX Association 30th USENIX Security Symposium 1451

In particular we show that Self-Modifying Code MachineClear events allow attackers to transiently execute stale codewhile Floating Point Machine Clear events allow them to in-ject transient values in victim code We term these primitivesSpeculative Code Store Bypass (SCSB) and Floating PointValue Injection (FPVI) respectively The former is looselyrelated to Speculative Store Bypass (SSB) [55 82] but al-lows attackers to transiently reference stale code rather thandata The latter is loosely related to Load Value Injection(LVI) [71] but allows attackers to inject transient values bycontrolling operands of floating-point operations rather thantriggering victim page faults or microcode assists We alsodiscuss possible gadgets for exploitation in applications suchas JIT engines For instance we found that developers follow-ing instructions detailed in the Intel optimization manual [36]ad litteram could easily introduce SCSB gadgets in their ap-plications In addition we show that attackers controllingJITrsquoed code can easily inject FPVI gadgets and craft arbitrarymemory reads even from JavaScript in a modern browserusing NaN-boxing such as Mozilla Firefox [53]

Moreover since existing mitigations cannot hinder ourprimitives we implement new mitigations to eliminate theuncovered attack surface As shown in our evaluation SCSBcan be efficiently mitigated with serializing instructions in thepractical cases of interest such as JavaScript engines FPVIcan be mitigated using transient execution barriers betweenthe source of injection and its transmit gadgets either insertedby the programmer or by the compiler We implemented thegeneral compiler-based mitigation in LLVM [45] measuringan overhead of 32 53 on SPECfp 20062017 (geomean)

While not limited to Intel our analysis does build on a vari-ety of performance counters available in different generationsof Intel CPUs but not on other architectures Neverthelesswe show that our insights into the root causes of bad specu-lation also generalize to other architectures by successfullyrepeating our transient execution leakage experiments (weoriginally designed for Intel) on AMD

Finally armed with a deeper understanding of the rootcauses of transient execution we also present a new classi-fication for the resulting paths Existing classifications [10]are entirely attack-centric focusing on classes of Spectre-or Meltdown-like attacks and blending together (common)causes of transient execution with their uses (ie what at-tackers can do with them) Such classifications cannot easilyaccommodate the new transient execution paths presented inthis paper For this reason we propose a new orthogonal rootcause-centric classification much closer to the sources of badspeculation identified by the chip vendors themselves

Summarizing we make the following contributions

bull We systematically explore the root causes of transientexecution and closely examine the major largely unex-plored machine clear-based class To this end we presentthe reverse engineering and security analysis of causessuch as Floating Point Machine Clear (FP MC) Self-

Modifying Code Machine Clear (SMC MC) MemoryOrdering Machine Clear (MO MC) and Memory Dis-ambiguation Machine Clear (MD MC)

bull We present two novel machine clear-based transient exe-cution attack primitives (FPVI and SCSB) and an end-to-end FPVI exploit disclosing arbitrary memory in Firefox

bull We propose and evaluate possible mitigations

bull We propose a new root-cause based classification of allthe known transient execution paths

Code exploit demo and additional information are avail-able at httpswwwvusecnetprojectsfpvi-scsb

2 Background

21 IEEE-754 Denormal NumbersModern processors implement a Floating Point Unit (FPU)To represent floating point numbers the IEEE-754 stan-dard [1] distinguishes three components (Fig 1) First themost significant bit serves as the sign bit s The next w bitscontain a biased exponent e For instance for double precision64-bit floating point numbers e is 11 bits long and the biasis 211minus1minus1 = 1023 (ie 2wminus1minus1) The value e is computedby adding the real exponent ereal to the bias which ensuresthat e is a positive number even if the real exponent is neg-ative The remaining bits are used for the mantissa m Thecombination of these components represents the value Asan example suppose the sign bit s = 1 the biased exponente = 100000000112 = 102710 so that ereal = 1027minus1023 = 4and the mantissa m = 01110000 In that casethe corre-sponding decimal value is minus1 middot24 middot (1+001112) We refer tosuch numbers as normal numbers as they are in the normal-ized form with an implicit leading 1 present in the mantissa

Figure 1 Fields of a 64 bits IEEE 754 double

If e = 0 and m 6= 0 the CPU treats the value as a denormalvalue In this case the leading 1 becomes a leading 0 insteadwhile the exponent becomes the minimal value of 2minus2wminus1so that with s = 1 e = 0 and m = 01110000 the result-ing value for a double precision number is minus1 middot2minus1022 middot (0+001112) This additional representation is intended to allow agradual underflow when values get closer to 0 In 64-bits float-ing point numbers the smallest unbiased exponent is -1022making it impossible to represent a number with an exponentof -1023 or lower In contrast a denormal number can repre-sent this number by appending enough leading zeroes to the

1452 30th USENIX Security Symposium USENIX Association

mantissa until the minimal exponent is obtained The denor-mal representation trades precision for the ability to representa larger set of numbers Generating such a representation iscalled denormalization or gradual underflow

22 x86 Cache CoherenceOn multicore processors L1 and L2 caches are usually percore while the L3 is shared among all cores Much complexityarises due to the cache coherence problem when the samememory location is cached by multiple cores To ensure thecorrectness the memory must behave as a single coherentsource of data even if data is spread across a cache hierarchyInformally a memory system is coherent if any read of a dataitem returns the most recently written value [30] To obtaincoherence memory operations that affect shared data mustpropagate to other coresrsquo private caches This propagation isthe responsibility of the cache coherence protocol such asMESIF on Intel processors [37] and MOESI on AMD [3]Cache controllers implementing these protocols snoop andexchange coherence requests on the shared bus For examplewhen a core writes in a shared memory location it signals allother cores to invalidate their now stale local copy

The cache coherence policy also maintains the illusion ofa unified memory model for backward compatibility Theoriginal Intel 8086 released in 1978 operated in real-addressmode to implement what is essentially a pure Von Neumannarchitecture with no separation between data and code Mod-ern processors have more sophisticated memory architectureswith separate L1 caches for code (L1d) and data (L1i) Theneed for backward compatibility with the simple 8086 mem-ory model has led modern CPUs to a split-cache Harvardarchitecture whereby the cache coherence protocol ensuresthat the (L1) data and instruction caches are always coherent

23 Memory OrderingA Memory Consistency Model or Memory Ordering Modelis a contract between the microarchitecture and the program-mer regarding the permissible order of memory operations inparallel programs Memory consistency is often confused withmemory coherence but where coherence hides the complexmemory hierarchy in multicore systems consistency dealswith the order of readwrite operations

Consider the program in Figure 2 In the simplest consis-tency model lsquoSequential Consistencyrsquo (SC) all cores see alloperations in the same order as specified by the program Inother words A0 always execute before A1 and B0 always be-fore B1 Thus a valid memory order would be A0-B0-A1-B1while A1-B1-A0-B0 would be invalid

Intel and AMD CPUs implement Total-Store-Order (TSO)which is equivalent to SC apart from one case a store fol-lowed by a load on a different address may be reordered Thisallows cores to use a private store buffer to hide the latency of

Figure 2 Memory Ordering Example

store operations In the example of Figure 2 the store opera-tions A0 and B0 write their values initially only in their privatestore buffers The subsequent loads A1 and B1 will now readthe stale value 0 until the stores are globally visible Thusthe order A1-B1-A0-B0 (r1=0 r2=0) is also valid

24 Memory DisambiguationLoads must normally be executed only after all the precedingstores to the same memory locations However modern pro-cessors rely on speculative optimizations based on memorydisambiguation to allow loads to be executed before the ad-dresses of all preceding stores are computed In particular if aload is predicted not to alias a preceding store the CPU hoiststhe load and speculatively executes it before the precedingstore address is known Otherwise the load is stalled until thepreceding store is completed In case of a no-alias (or hoist)misprediction the load reads a stale value and the CPU needsto re-issue it after flushing the pipeline [36 37]

3 Threat Model

We consider unprivileged attackers who aim to disclose confi-dential information such as private keys passwords or ran-domized pointers We assume an x86-64 based victim ma-chine running the latest microcode and operating system ver-sion with all state-of-the-art mitigations against transient exe-cution attacks enabled We also consider a victim system withno exploitable vulnerabilities apart from the ones describedhereafter Finally we assume attackers can run (only) unprivi-leged code on the victim (eg in JavaScript user processesor VMs) but seek to leak data across security boundaries

4 Machine Clears

The Intel Architectures Optimization Reference Manual [36]refers to the root cause of discarding issued microOp as Bad Spec-ulation Bad speculation consists of two subcategories

bull Branch Mispredict A misprediction of the direction ortarget of a branch by the branch predictor will squash allmicroOps executed within a mispeculated branch

bull Machine Clear (or Nuke) A machine clear conditionwill flush the entire processor pipeline and restart theexecution from the last retired instruction

USENIX Association 30th USENIX Security Symposium 1453

Table 1 Machine clear performance counters

Name Description

MACHINE_CLEARSCOUNT Number of machine clears of any typeMACHINE_CLEARSSMC Number of machine clears caused by a selfcross-modifying codeMACHINE_CLEARSDISAMBIGUATION Number of machine clears caused by a memory disambiguation unit mispredictionMACHINE_CLEARSMEMORY_ORDERING Number of machine clears caused by a memory ordering principle violationMACHINE_CLEARSFP_ASSIST Number of machine clears caused by an assisted floating point operationMACHINE_CLEARSPAGE_FAULT Number of machine clears caused by a page faultMACHINE_CLEARSMASKMOV Number of machine clears caused by an AVX maskmov on an illegal address with a mask set to 0

Bad speculation not only causes performance degradationbut also security concerns [6ndash8314143475559637174ndash76 79 82] In contrast to branch misprediction extensivelystudied by security researchers [10 11 31 40 41 43 48]machine clears have undergone little scrutiny In this paperwe perform the first deep analysis of machine clears and thecorresponding root causes of transient execution

Since machine clears are hardly documented we examinedall the performance counters for every Intel architecture andfound a number relevant counters (Table 1) Some countersare present only in specific architectures For example thepage fault counter is available only on Goldmont Plus How-ever thanks to the generic counter MACHINE_CLEARSCOUNTit is always possible to count the overall number of machineclears regardless of the architecture In the remainder of thiswork we will reverse engineer and analyze the causes ofmachine clears by means of these six counters

As a general observation we note that the Floating PointAssist and Page Fault counters immediately suggest that ma-chine clears are also related to microcode assists and fault-sexceptions In particular further analysis shows that

bull Microcode assists trigger machine clears The hardwareoccasionally needs to resort to microcode to handle com-plex events Doing so requires flushing all the pend-ing instructions with a machine clear before handlinga microcode assist Indeed in our experiments wherethe OTHER_ASSISTSANY counter increased we also ob-served a matching increase in MACHINE_CLEARSCOUNT

bull Machine clears do not necessarily trigger microcodeassists Not all machine clears are microcode as-sisted as some machine clear causes are handleddirectly in silicon Indeed in our experiments weobserved that SMC MD and MO machine clearscause an increase of MACHINE_CLEARSCOUNT but leaveOTHER_ASSISTSANY unaltered

bull An exception triggers a machine clear When a faultor exception is detected the subsequent microOps must beflushed from the pipeline with a machine clear as theexecution should resume at the exception handler In-deed in our experiments we observed an increase ofMACHINE_CLEARSCOUNT at each faulty instruction

In this paper we focus on the machine clear causes men-tioned by the Intel documentation (Table 1) acknowledgingthat undocumented causes may still exist (much like undocu-mented x86 instructions [16]) We now first examine the fourmost relevant causes of machine clears (Self-Modifying CodeMC Floating Point MC Memory Ordering MC and MemoryDisambiguation MC) then briefly discuss the other cases

5 Self-Modifying Code Machine Clear

In a Von Neumann architecture stores may write instructionsas data and modify program code as it is being executedas long as the code pages are writable This is commonlyreferred to as Self-modifying Code (SMC)

Self-modifying code is problematic for the InstructionFetch Unit (IFU) which maintains high execution through-put by aggressively prefetching the instructions it expects toexecute next and feeding them to the decode units The CPUspeculatively fetches and decodes the instructions and feedsthem to the execution units well ahead of retirement

In case of a misprediction the CPU flushes the specula-tively processed instructions and resumes execution at the cor-rect target The IFUrsquos aggressive prefetching ensures that thefirst-level instruction cache (L1i) is constantly filled with in-structions which are either currently in (possibly speculative)execution or about to be executed As a result a store instruc-tion targeting code cached in L1i requires drastic measuresmdashas the associated cache lines should now be invalidated More-over the target instructions do not even need to be part of theactual execution since a prefetch is sufficient to bring theminto L1i any write to prefetched instructions also invalidatesthe prefetch queue In other words the problem occurs whenthe code is already in L1i or the store is sufficiently close tothe target code to ensure the target is prefetched in L1i Thisbehavior leads to a temporary desynchronization between thecode and data views of the CPU transiently breaking the ar-chitectural memory model (where L1dL1i coherence ensuresconsistent codedata views)

To reverse engineer this behavior we use the analysis codeexemplified by Listing 1 The store at line 15 overwrites codealready cached in L1i (lines 18-21) triggering a machineclear The machine clear needs to update the L1i cache (and

1454 30th USENIX Security Symposium USENIX Association

related microarchitectural structures) by flushing any staleinstruction(s) resuming execution at the last retired instruc-tion and then fetching the new instructions To test for thepresence of a transient execution path our analysis code im-mediately executes lines 18-21 targeted by the store and jumpsto a spec_code gadget (lines 29-32) which fills a number ofcache lines in a (flushed) buffer Architecturally this gad-get should never be executed as the store instruction shouldnop out the branch at line 18 However microarchitecturallywe do observe multiple cache hits in the (reload) buffer us-ing FLUSH + RELOAD which confirms the existence of apre-SMC-handling transient execution path executing stalecode and leaving observable traces in the cache We observethat the scheduling of the store instruction heavily affectsthe length of the transient path Indeed we use different in-structions (lines 5-8) to delay as much as possible the storeretirement and thus the SMC detection This suggests that theroot cause of the observed transient window might be the mi-croarchitectural de-synchronization between the store buffer(new code) and the instruction queue (stale code) yieldingtransiently incoherent codedata views We sampled machineclear performance counters to confirm the transient executionwindow is caused by the SMC machine clear and not by otherevents (eg memory disambiguation misprediction) Finallywe repeated our experiments on uncached code memory andcould not observe any transient path The counters revealedone machine clear triggered for each executed instructionsince the CPU has to pessimistically assume every fetchedinstruction has potentially been overwritten Additionally wehave verified that SMC detection is performed on physicaladdresses rather than virtual ones

Cross-Modifying Code

Instead of modifying its own instructions a thread runningon one core may also write data into the currently executingcode segment of a thread running on a different physical coreSuch Cross-Modifying Code (XMC) may be synchronous(the executor thread waits for the writer thread to completebefore executing the new code) or asynchronous with noparticular coordination between threads To reverse engineerthe behavior we distributed our analysis code across coresand reproduced a signal on the reload buffer in both casesThis confirms a Cross-Modifying Code Machine Clear (XMCmachine clear) behaves similarly to a SMC machine clearacross cores with a store on the writer core originating atransient execution window on the executor core

6 Floating Point Machine Clear

On Intel when the Floating Point Unit (FPU) is unable toproduce results in the IEEE-754 [1] format directly for in-stance in the case of denormal operands or results [4 17]the CPU requires special handling to produce a correct re-

Listing 1 Self-Modifying Code Machine Clear analysis code1 smc_snippet2 push r113 lea r11 [target] Load addr of target instr (line 17)4

5 clflush [r11] These instructions serve as a delay6 rep 10 for the store argument address They7 imul r11 1 ensure that the execution window of8 endrep spec_code is as long as possible9

10 Code to write as data 8 nops (overwriting lines 18-21)11 mov rax 0x909090909090909012

13 Store at target addr Also the last retired instr14 from which the execution will resume after the SMC MC15 mov QWORD [r11] rax16

17 target Target instruction to be modified18 jmp spec_code19 nop20 nop21 nop22

23 Architectural exit point of the function24 pop r1125 ret26

27 Code executed speculatively (flushed after SMC MC)28 spec_code29 mov rax [rdi+0x0] rdi covert channel reload buffer30 mov rax [rdi+0x400]31 mov rax [rdi+0x800]32 mov rax [rdi+0xc00]

sult According to an Intel patent [62] the denormalizationis indeed implemented as a microcode assist or an exceptionhandler since the corresponding hardware would be too com-plex In our experiments we observed microcode assists onall x87 SSE and AVX instructions that perform mathematicaloperations on denormal floating-point (FP) numbers Incre-ments of the FP_ASSISTANY or MACHINE_CLEARSCOUNT(or on older processors MACHINE_CLEARSFP_ASSIST) per-formance counters confirm such assists cause machine clears

Since a machine clear implies a pipeline flush the as-sisted FP operation will be squashed together with subse-quent microOps To reverse engineer the behavior we used anal-ysis code exemplified by Algorithm 1 Our code relies ona FLUSH + RELOAD [81] covert channel to observe the re-sult of a floating-point operation at the byte granularity Inour experiments we observed two different hits in the reloadbuffer for each byte for the transient and architectural resultrespectively The double-hit microarchitectural trace confirmsthat the transient (and wrong) value generated by the FPUis used in subsequent microOpsmdashas also exemplified in Table 2Later the CPU detects the error and triggers a machine clearto flush the wrongly executed path The microcode assist thencorrects the result while subsequent instructions are reissued

While we could not find any documentation on floating-point assist handling (even in the patents) our experimentsrevealed the following important properties First we veri-fied that many FP operations can trigger FP assists (ie addsub mul div and sqrt) across different extensions (ie x87SSE and AVX) Second the transient result is computed byldquoblindlyrdquo executing the operation as if both operands and result

USENIX Association 30th USENIX Security Symposium 1455

Algorithm 1 Floating Point Machine Clear analysis (pseudo)code byte is used to extract the i-th byte of z

1 for ilarr 1 8 do2 flush(reload_bu f )3 z = x y Any denormal FP operation4 reload_bu f [byte(z i) 1024]5 reload(reload_bu f )6 end for

Representation Value Type Exp

x 0x0010deadbeef1337 234e-308 N -1022

y 0x40f0000000000000 65536(216) N 16

zarch 0x00000010deadbeef 357e-313 D -1022

ztran 0x3f10deadbeef1337 643e-05 N -14

Table 2 Architectural (zarch) and transient (ztran) results ofdividing x and y of Algorithm 1 N Normal D Denormalrepresentations The mantissa is in bold A normal division by216 leaves the mantissa untouched and subtracts 16 from theexponentmdashthe result of ztran where the exponent overflowedfrom -1022 to -14

Figure 3 Transient execution due to invalid memory ordering

are normal numbers (see Table 2) Third the detection of thewrong computation occurs later in time creating a transientexecution window Finally by performing multiple floating-point operations together with the assisted one we were ableto expand the size of the window suggesting that detection isdelayed if the FPU is busy handling multiple operations

7 Memory Ordering Machine Clear

The CPU initiates a memory ordering (MO) machine clearwhen upon receiving a snoop request it is uncertain if mem-ory ordering will be preserved [36] Consider the program

Algorithm 2 Pseudo-code triggering a MO machine clearProcessor A

1 clflush(X) Make the load slow2 unlock(lock) Synchronize loads and stores3 r1larr [X]4 r2larr [Y ]5 reload(r2) For FLUSH + RELOAD

Processor B1 wait(lock) Synchronize loads and stores2 1rarr [Y ]

in Figure 3 Processor A loads X and Y while processor Bperforms two stores to the same locations If the load of Xis slow due to a cache miss the out-of-order CPU will issuethe next load (and subsequent operations) ahead of sched-ule Suppose that while the load of X is pending proces-sor B signals through a snoop request that the values ofX and Y have changed In this scenario the memory order-ing is A1-B0-B1-A0 which is not allowed according to theTotal-Store-Order memory model As two loads cannot bereordered r1=1 r2=0 is an illegal result Thus processor Ahas no choice other than to flush its pipeline and re-issuethe load of Y in the correct order This MO machine clearis needed for every inconsistent speculation on the memoryorderingmdashimplementing speculation behavior originally pro-posed by Gharachorloo et al [27] with the advantage thatstrict memory order principles can co-exist with aggressiveout-of-order scheduling

Notice that in the previous example the store on X isnot even necessary to cause a memory ordering violationsince A1-B1-A0 is still an invalid order Counterintuitivelythe memory ordering violation disappears if the load on X isnot performed as A1-B0-B1 is a perfectly valid order

To reverse engineer the memory ordering handling behav-ior on Intel CPUs we used analysis code exemplified by Al-gorithm 2 Our code mimics the scenario of Figure 3 withloadsstores from two threads racing against each other butrelies on a FLUSH + RELOAD [81] covert channel to observethe loaded value of Y The synchronization through the lockvariable ensures the desired (problematic) proximity and or-dering of the memory operations

As before in our experiments we observed two differenthits in the reload buffer for the loaded value of Y one forthe stale (transient) value and one for the new (architectural)value For every double hit in the reload buffer we also mea-sured an increase of MACHINE_CLEARSMEMORY_ORDERINGWe also ran experiments without the load of X Even thoughmemory ordering violations are no longer possible since eachprocessor executes a single memory operation and any orderis permitted we still observed MO machine clears The reasonis that Intel CPUs seem to resort to a simple but conservativeapproach to memory ordering violation detection where theCPU initiates a MO machine clear when a snoop request from

1456 30th USENIX Security Symposium USENIX Association

another processor matches the source of any load operationin the pipeline We obtained the same results with the twothreads running across physical or logical (hyperthreaded)cores Similarly the results are unchanged if the matchingstore and load are performed on different addressesmdashthe onlyrequirement we observed is that the memory operations needto refer to the same cache line Overall our results confirmthe presence of a transient execution window and the abilityof a thread to trigger a transient execution path in anotherthread by simply dirtying a cache line used in a ready-to-commit load Exploitation-wise abusing this type of MC isnon-trivial due to the strict synchronization requirements andthe difficulty of controlling pending stale data

8 Memory Disambiguation Machine Clear

As suggested by the MACHINE_CLEARSDISAMBIGUATIONcounter description memory disambiguation (MD) mis-predictions are handled via machine clears Our experi-ments confirmed this behavior by observing matching in-creases of MACHINE_CLEARSCOUNT Moreover we observedno changes in microcode assist counters suggesting mispre-dictions are resolved entirely in hardware In case of a mispre-diction a stale value is passed to subsequent loads a primitivethat was previously used to leak secret information with Spec-tre Variant 4 or Speculative Store Bypass (SSB) [55 82]

Different from the other machine clears MD machineclears trigger only on address aliasing mispredictions whenthe CPU wrongly predicts a load does not alias a precedingstore instruction and can be hoisted Similar to branch pre-diction generating a MD-based transient execution windowrequires mistraining the underlying predictor The latter hascomplex undocumented behavior which has been partiallyreverse engineered [18] For space constraints we discuss ourfull reverse engineering strategy in Appendix A Our resultsshow that executing the same load 64+ 15 times with non-aliasing stores is sufficient to ensure the next prediction tobe ldquono-aliasrdquo (and thus the next load to be hoisted) Uponreaching the hoisting prediction one can perform the loadwith an aliasing store to trigger the transient path exposed tothe incorrect stale value

Finally we experimentally verified that 4k aliasing [50]does not cause any machine clear but only incurs a furthertime penalty in case of wrong aliasing predictions Moredetails on 4k aliasing results can be found in Appendix A

9 Other Types of Machine Clear

AVX vmaskmov The AVX vmaskmov instructions performconditional packed load and store operations depending on abitmask For example a vmaskmovpd load may read 4 packeddoubles from memory depending on a 4-bit mask each doublewill be read only if the corresponding mask bit is set the

others will be assigned the value 0According to the Intel Optimization Reference Man-

ual [36] the instruction does not generate an exception inface of invalid addresses provided they are masked out How-ever our experiments confirm that it does incur a machineclear (and a microcode assist) when accessing an invalid ad-dress (eg with the present bit set to 0) with a loading maskset to zero (ie no bytes should be loaded) to check whetherthe bytes in the invalid address have the corresponding maskbits set or not We speculate the special handling is neededbecause the permission check is very complex especially inthe absence of memory alignment requirements

In our experiments vmaskmov instructions withall-zero masks and invalid addresses increment theOTHER_ASSISTANY and MACHINE_CLEARSCOUNT countersconfirming that the instruction triggers a machine clearHowever the resulting transient execution window seemsshort-lived or absent as we were unable to observe cache orother microarchitectural side effects of the execution

Exceptions The MACHINE_CLEARSPAGE_FAULT counterpresent on older microarchitectures confirms page faults areanother cause of machine clears Indeed we verified each in-struction incurring a page fault or any other exception such asldquoDivision by zerordquo increments the MACHINE_CLEARSCOUNTcounter We also verified exceptions do not trigger microcodeassists and software interrupts (traps) do not trigger machineclears Indeed the Intel documentation [37] specifies thatinstructions following a trap may be fetched but not specu-latively executed Transient execution windows originatingfrom exceptionsmdashand page faults in particularmdashhave beenextensively used in prior work with a faulty load instructionalso used as the trigger to leak information [7 47 63 75]

Hardware interrupts Although hardware interrupts arean undocumented cause of machine clears our experimentswith APIC timer interrupts showed they do increment theMACHINE_CLEARSCOUNT counter While this confirms hard-ware interrupts are another root cause of transient executionthe asynchronous nature of these events yields a less thanideal vector for transient execution attacks Nonetheless hard-ware interrupts play an important role in other classes ofmicroarchitectural attacks [73]

Microcode assists Microcode assists require a pipelineflush to insert the required microOps in the frontend and representa subclass of machine clears (and thus a root cause of transientexecution) for cases where a fast path in hardware is not avail-able In this paper we detailed the behavior of floating-pointand vmaskmov assists Prior work has discussed different sit-uations requiring microcode assists such as those related topage table entry AccessDirty bits typically in the context ofassisted loads used as the trigger to leak information [86375]AVX-to-SSE transitions [36] represent another microcode as-sist which based on our experiments we could not observeon modern Intel CPUs Remaining known microcode assistssuch as access control of memory pages belonging to SGX

USENIX Association 30th USENIX Security Symposium 1457

020406080

100120140160

Num

ber o

fTr

ansie

nt L

oads CPU

Intel Core i7-10700KIntel Xeon Silver 4214Intel Core i9-9900KIntel Core i7-7700KAMD Ryzen 5 5600XAMD Ryzen Threadripper2990WXAMD Ryzen 7 2700X

F+R TSX BHT FAULT SMC XMC FP MD MOTransient Execution Management

0

1

2

3

4

Leak

age

Rate

[Mb

s]

Figure 4 Top plot transient window size vs mechanism Each bar reports the number of transient loads that complete and leavea microarchitectural trace Bottom plot leakage rate vs mechanism for a simple Spectre Bounds-Check-Bypass attack and a1-bit FLUSH + RELOAD (F+R) cache covert channel F+R is the leakage rate upper bound (covert channel loop only no actualtransient window or attack)

secure enclaves and Precise Event Based Sampling (PEBS)are not presented in this work since already studied [14] oronly related to privileged performance profiling respectively

10 Transient Execution Capabilities

Transient execution attacks rely on crafting a transient win-dow to issue instructions that are never retired For thispurpose state-of-the-art attacks traditionally rely on mech-anisms based on root causes such as branch mispredictions(BHT) [41 43] faulty loads (Fault) [8 47 63 75] or mem-ory transaction aborts (TSX) [8 47 59 63 75] However thedifferent machine clears discussed in this paper provide anattacker with the exact same capabilities

To compare the capabilities of machine clear-based tran-sient windows with those of more traditional mechanismswe implemented a framework able to run arbitrary attacker-controlled code in a window generated by a mechanism ofchoosing We now evaluate our framework on recent proces-sors (with all the microcode updates and mitigations enabled)to compare the transient window size and leakage rate of thedifferent mechanisms

101 Transient Window Size

The transient window size provides an indication of thenumber of operations an attacker can issue on a transientpath before the results are squashed Larger windows canin principle host more complex attacks Using a classicFLUSH + RELOAD cache covert channel (F+R) as a referencewe measure the window size by counting how many transientloads can complete and hit entries in a designated F+R bufferFigure 4 (top) presents our results

As shown in the figure the window size varies greatlyacross the different mechanisms Broadly speaking mecha-nisms that have a higher detection cost such as XMC and MOMachine Clear yield larger window sizes Not surprisinglybranch mispredictions yield the largest window sizes as wecan significantly slow down the branch resolution process(ie causing cache misses) and delay detection FP on theother hand yields the shortest windows suggesting that de-normal numbers are efficiently detected inside the FPU Ourresults also show that while our framework was designed forIntel processors similar if not better results can usually beobtained on AMD processors (where we use the same con-servative training code for branchmemory prediction) Thisshows that both CPU families share a similar implementationin all cases except for MD where the used mistraining patternis not valid for pre-Zen3 architectures

102 Leakage Rate

To compare the leakage rates for the different transient exe-cution mechanisms we transiently read and repeatedly leakdata from a large memory region through a classic F+R cachecovert channel We report the resulting leakage ratesmdashasthe number of bits successfully leaked per secondmdashacrossdifferent microarchitectures using a 1-bit covert channel tohighlight the time complexity of each mechanism We con-sider data to be successfully leaked after a single correct hitin the reload buffer In case of a miss for a particular valuewe restart the leak for the same value until we get a hit (oruntil we get 100 misses in a row)

As shown in Figure 4 (bottom) different Intel and AMDmicroarchitectures generally yield similar leakage rates withsome variations For instance FP MC offers better leakagerates on Intel This difference stems from the different perfor-

1458 30th USENIX Security Symposium USENIX Association

F+R TSX BHT FP SMC XMC MO MD FAULTTransient Execution Management

0

1

2

3

4

Leak

age

Rate

[Mb

s]

F+R granularity [bit]148

Figure 5 Leakage rate vs mechanism with a 1-bit 4-bit or8-bit FLUSH + RELOAD cache covert channel (Intel Core i9)

mance impact of the corresponding machine clears on Intelvs AMD microarchitectures

As shown in Figure 5 the leakage rate varies instead greatlyacross the different mechanisms and so does the optimalcovert channel bitwidth Indeed while existing attacks typi-cally rely on 8-bit covert channels our results suggest 1-bit or4-bit channels can be much more efficient depending on thespecific mechanism Roughly speaking optimal leakage ratescan be obtained by balancing the time complexity (and hencebitwidth) of the covert channel with that of the mechanismFor example FP is a lightweight and reliable mechanismhence using a comparably fast and narrow 1-bit covert chan-nel is beneficial In contrast MD requires a time-consumingpredictor training phase between leak iterations and leakingmore bits per iteration with a 4-bit covert channel is moreefficient Interestingly a classic 8-bit covert channel yieldsconsistently worse and comparable leakage rates across allthe mechanisms since F+R dominates the execution time

Our results show that only two mechanisms (TSX and FP)are close to the maximum theoretical leakage rate of pureF+R Moreover FP performs as efficiently as TSX but un-like TSX is available on both Intel and AMD is always en-abled and can be used from managed code (eg JavaScript)BHT on the other hand yield the worst leakage rates due tothe inefficient training-based transient window BHT leakagerate can be improved if a tailored mistrain sequence is usedas in MD (Appendix A) Overall our results show that ma-chine clear-based windows achieve comparable and in manycases (eg FP) better leakage rates compared to traditionalmechanisms Moreover many machine clears eliminate theneed for mistraining which other than resulting in efficientleakage rates can escape existing pattern-based mitigationsand disabled hardware extensions (eg Intel TSX)

11 Attack Primitives

Building on our reverse engineering results and focusing onthe unexplored SMC and FP machine clears we now presenttwo new transient execution attack primitives and analyze

their security implications We also present an end-to-endFPVI exploit disclosing arbitrary memory in Firefox Laterwe discuss mitigations

111 Speculative Code Store Bypass (SCSB)

Our first attack primitive Speculative Code Store Bypass(SCSB) allows an attacker to execute stale controlled code ina transient execution window originated by a SMC machineclear Since the primitive relies on SMC its primary appli-cability is on JIT (eg JavaScript) engines running attacker-controlled codemdashalthough OS kernels and hypervisors stor-ing code pages and allowing their execution without firstissuing a serializing instruction are also potentially affected

Figure 6 SCSB primitive example where the instructionpointer is pointing at the bold code blocks g code is freedJITrsquoed code (of some g function) under attackerrsquos control(1) Force engine to JIT and execute code of function f caus-ing desynchronization of code and data views (2) Executestale code and SMC MC (3) After the SMC MC code anddata view coherence is restored and the new code is executed

As exemplified in Figure 6 the operations of the primi-tive can be broken down into three steps (1) the JIT enginecompiles a function f storing the generated code into a JITcode cache region previously used by a (now-stale) version offunction g (2) the JIT engine jumps to the newly generatedcode for the function f but due to the temporary desynchro-nization between the code and data views of the CPU thiscauses transient execution of the stale code of g until the SMCmachine clear is processed (3) after the pipeline flush thecode and data views are resynchronized and the CPU restartsthe execution of the correct code of f For exploitation theattacker needs to (i) massage the JIT code cache allocator toreuse a freed region with a target gadget g of choice (ii) forcethe JIT engine to generate and execute new (f ) code in sucha region enabling transient out-of-context execution of thegadget and spilling secrets into the microarchitectural state

Our primitive bears similarities with both transient and ar-chitectural primitives used in prior attacks On the transientfront our primitive is conceptually similar to a Speculative-Store-Bypass (SSB) primitive [55 82] but can transientlyexecute stale code rather than reading stale data Howeverinterestingly the underlying causes of the two primitives arequite different (MD misprediction vs SMC machine clear)On the architectural front our primitive mimics classic Use-After-Free (UAF) exploitation on the JIT code cache also

USENIX Association 30th USENIX Security Symposium 1459

Figure 7 Coding options suggested by the Intel ArchitecturesSoftware Developers Manuals to handle SMC and XMC ex-ecution Option 1 describes the exact steps required by ourSpeculative Code Store Bypass attack primitive potentiallyresulting in exploitable gadgets

known as Return-After-Free (RAF) in the hackers commu-nity [19 25] An example is CVE-2018-0946 where a use-after-free vulnerability can be exploited to force the ChakraJS engine to erroneously execute freed (attacker-controlled)JIT code resulting in arbitrary code execution after massagingthe right gadget into the JIT code cache [24]

Indeed at a high level SCSB yields a transient use-after-free primitive on a JIT code cache with exploitation prop-erties similar to its architectural counterpart However thereare some differences due to the transient nature of SCSBFirst we need to find an out-of-context gadget to transientlyleak data rather than architecturally execute arbitrary codeIn JavaScript engines similar gadgets have already been ex-ploited by ret2spec [48] escalating out-of-context transientexecution of valid code to type confusion arbitrary reads andultimately a secret-dependent load to transmit the value

Second we need to target short-lived JIT code cache up-dates so that the newly generated code is immediately exe-cuted fitting the target gadget in the resulting transient execu-tion window Interestingly executing JITrsquoed code is the nextstep after JIT code generation mentioned in the Intel manual(Figure 7 Option 1) suggesting that developers that followsuch directives ad litteram can easily introduce such gadgetsMoreover modern JavaScript compilers feature a multi-stageoptimization pipeline [12 54] and short-lived JIT code cacheupdates are favored for just-in-time (re)optimization Indeedwe found test code in V8 to specifically test such updates andverify instructiondata cache coherence [70]

Finally we need to ensure JIT code cache updates are notaccompanied by barriers that force immediate synchroniza-tion of the code and data views We analyzed the code ofthe popular SpiderMonkey and V8 JavaScript engines andverified that the functions called upon JIT code cache updatesto synchronize instruction and data caches (Listing 2 and List-ing 3) are always empty on x86 (as expected given Intelrsquosprimary recommendation in Figure 7)

Overall while the exploitation is far from trivial (ie hav-

Listing 2 Chromium instruction cache flush(chromiumsrcv8srccodegenx64cpu-x64cc)void CpuFeaturesFlushICache(void start size_t size) No need to flush the instructioncache on Intel

Listing 3 Firefox instruction cache flush(mozilla-unifiedjssrcjitFlushICacheh)inline void FlushICache(void code size_t size

bool codeIsThreadLocal = true) No-op Code and data caches are coherent on x86

and x64 rarr

ing to address the challenges of traditional use-after-free ex-ploitation as well as transient execution exploitation in thebrowser) we believe SCSB expands the attack surface of tran-sient execution attacks in the browser We found a number ofcandidate SCSB gadgets in real-world code but none of themwas ultimately exploitable due to the coincidental presence ofsome serializing instruction preventing stale code executionNonetheless mitigations are needed to enforce security-by-design Luckily as shown later SCSB is amenable to a prac-tical and efficient mitigation (ie at a similar cost as facedby non-x86 architectures) The implementationperformancecost for mitigation is even lower than for its sibling SSB prim-itive whose mitigation has been deployed in practice even inabsence of known practical exploits [55]

In Table 3 we show that all tested Intel and AMD proces-sors are affected by SCSB In contrast ARM is not vulnerablesince SMC updates require explicit software barriers

112 Floating Point Value Injection (FPVI)Our second attack primitive Floating Point Value Injection(FPVI) allows an attacker to inject arbitrary values into atransient execution window originated by a FP machine clearAs exemplified in Figure 8 the operations of the primitive canbe broken down into four steps (1) the attacker triggers theexecution of a gadget starting with a denormal FP operationin the victim application with the x and y operands underattackerrsquos control (2) the transient z result of the operation isprocessed by the subsequent gadget instructions leaving a mi-croarchitectural trace (3) the CPU detects the error condition(ie wrong result of a denormal operation) triggering a ma-chine clear and thus a pipeline flush (4) the CPU re-executesthe entire gadget with the correct architectural z result Forexploitation the attacker needs to (i) massage the x and yoperands to inject the desired z value into the victim transientpath and (ii) target a victim gadget so that the injected valueyields a security-sensitive trace which can be observed withFLUSH + RELOAD or other microarchitectural attacks

Our primitive bears similarities with Load-Value-Injection(LVI) [71] since both allow attackers to inject controlledvalues into transient execution Moreover both primitives

1460 30th USENIX Security Symposium USENIX Association

Figure 8 FPVI gadget example in SpiderMonkeyFP_Op(xy) is an arbitrary denormal FP operation The er-roneous z result causes dependent operations to be executedtwice (first transiently then architecturally) The NaN-boxingz encoding allows the attacker to type-confuse the JITrsquoed codeand read from an arbitrary address on the transient path

require gadgets in the victim application to process the in-jected value and perform security-sensitive computations onthe transient path Nonetheless the underlying issues andhence the triggering conditions are fairly different LVI re-quires the attacker to induce faulty or assisted loads on thevictim execution which is straightforward in SGX applica-tions but more difficult in the general case [71] FPVI imposesno such requirement but does require an attacker to directlyor indirectly control operands of a floating-point operation inthe victim Nonetheless FPVI can extend the existing LVIattack surface (eg for compute-intensive SGX applicationsprocessing attacker-controlled inputs) and also provides ex-ploitation opportunities in new scenarios Indeed while it ishard to draw general conclusions on the availability of ex-ploitable FPVI gadgets in the wildmdashmuch like Spectre [41]LVI [71] etc this would require gadget scanners subject oforthogonal research [56 58]mdashwe found exploitable gadgetsin NaN-Boxing implementations of modern JIT engines [53]NaN-Boxing implementations encode arbitrary data types asdouble values allowing attackers running code in a JIT sand-box (and thus trivially controlling operands of FP operations)to escalate FPVI to a speculative type confusion primitiveThe latter can be exploited similarly to NaN-Boxing-enabledarchitectural type confusion [26] and allows an attacker toaccess arbitrary data on a transient path Figure 8 presentsour end-to-end exploit for a JavaScript-enabled attacker in aSpiderMonkey (Mozilla JavaScript runtime) sandbox illus-trating a gadget unaffected by all the prior Mozilla Firefoxrsquo

mitigations against transient execution attacks [52]As exemplified in the figure SpiderMonkeyrsquos NaN-Boxing

strategy represents every variable with a IEEE-754 (64-bit)double where the highest 17 bits store the data type tag and theremaining 47 bits store the data value itself If the tag value isless than or equal to 0x1fff0 (ie JSVAL_TAG_MAX_DOUBLE)all the 64 bits are interpreted as a double while NaN-Boxingencoding is used otherwise For instance the 0xfff88 tagis used to represent 32-bit integers and the 0xfffb0 tag torepresent a string with the data value storing a pointer tothe string descriptor In the example the attacker crafts theoperands of a vulnerable FP operation (in this case a division)to produce a transient result which the JITrsquoed code interpretsas a string pointer due to NaN-Boxing This causes typeconfusion on a transient path and ultimately triggers a readwith an attacker-controlled address

To verify the attacker can inject arbitrary pointers withoutfully reverse engineering the complex function used by theFPU we implemented a simple fuzzer to find FP operandsthat yield transient division results with the upper bits setto 0xfffb0 (ie string tag) With such operands we caneasily control the remaining bits by performing the inverseoperation since the mantissa bits are transiently unaffectedby the exponent value as shown in Table 2 For exampleusing 0xc000e8b2c9755600 and 0x0004000000000000 asdivision operands yields -Infinity as the architectural resultand our target string pointer 0xfffb0deadbeef000 as thetransient result (see gadget in Figure 8)

Note that SpiderMonkey uses no guards or Spectre mitiga-tions when accessing the attribute length of the string Thisis normally safe since x86 guarantees that NaN results of FPoperations will always have the lowest 52 bits set to zeromdasharepresentation known as QNaN Floating-Point Indefinite [37]In other words the implementation relies on the fact that NaN-boxed variables such as string pointers can never accidentallyappear as the result of FP operations and can only be craftedby the JIT engine itself Unfortunately this invariant no longerholds on a FPVI-controlled transient path As shown in Fig-ure 8 this invariant violation allows an attacker to transientlyread arbitrary memory Since the length attribute is stored4 bytes away from the string pointer the zlength accessyields a transient read to 0xdeadbeef000+4

We ran our exploit on an Intel i9-9900K CPU (microcode0xde) running Linux 580 and Firefox 850 and by wrappingthis primitive with a variant of an EVICT + RELOAD covertchannel [61] we confirmed the ability to read arbitrary mem-ory Since prior work has already demonstrated that customhigh-precision timers are possible in JavaScript [26296064]we enabled precise timers in Firefox to simplify our covertchannel With our exploit we measured a leakage rate of ~13KBs and a transient window of ~12 load instructions enabledby increasing the FPU pressure through a chain of multipledependent FP operations

Finally as shown in Table 3 we observe that both Intel and

USENIX Association 30th USENIX Security Symposium 1461

Table 3 Tested processors

Processor MicrocodeSCSB

vulnerableFPVI

vulnerable

Intel Core i7-10700K 0xe0 3 3Intel Xeon Silver 4214 0x500001c 3 3Intel Core i9-9900K 0xde 3 3Intel Core i7-7700K 0xca 3 3AMD Ryzen 5 5600X 0xa201009 3 3daggerAMD Ryzen 2990WX 0x800820b 3 3daggerAMD Ryzen 7 2700X 0x800820b 3 3daggerBroadcom BCM2711Cortex-A72 (ARM v8) 7sect 7

dagger No exploitable NaN-boxed transient results foundsect On ARM SMC updates require explicit software barriers

AMD are affected by FPVI although we found no exploitabletransient NaN-boxed values on AMD On ARM we neverobserved traces of transient results yet we cannot rule outother FPU implementations being affected

12 Mitigations

121 SCSB Mitigation

SCSB can be mitigated by ensuring that any freshly writtencode is architecturally visible before being executed For ex-ample on ARM architectures where the hardware does notautomatically enforce cache coherence explicit serializing in-structions (ie L1i cache invalidation instructions) are neededto correctly support SMC [5] As such spec-compliant ARMimplementations cannot be affected by SCSB On Intel andAMD we can force eager codedata coherence using a serial-ization instructionmdashalthough this is normally not necessary(see Option 1 in Figure 7) In our experiments we verified anyserialization instruction such as lfence mfence or cpuidplaced after the SMC store operations is indeed sufficient tosuppress the transient window Note that sfence cannot serveas a serialization instruction [35] to eliminate the transientpath This serialization mitigation was confirmed by CPUvendors and adopted by the Xen hypervisor [32 33]

To evaluate the performance impact of our mitigation weadded a lfence instruction inside the FlushICache function(Listings 2 and 3) of the two popular V8 and SpiderMonkeyJavaScript engines Such function normally empty on x86is called after every code update Our repeated experimentson the popular JetStream2 and Speedometer2 [77] bench-marks did not produce any statically measurable performanceoverhead This shows JavaScript execution time is heavilydominated by JIT code generationexecution and code up-dates have negligible impact Our results show this mitigationis practical and can hinder SCSB-based attack primitives inJIT engines with a 1-line code change

122 FPVI Mitigation

The most efficient way to mitigate FPVI is to disable the de-normal representation On Intel this translates to enabling theFlush-to-Zero and Denormals-are-Zero flags [37] which re-spectively replace denormal results and inputs with zero Thisis a viable mitigation for applications with modest floating-point precision requirements and has also been selectively ap-plied to browsers [42] However this defense may break com-mon real-world (denormal-dependent) applications a concernthat has led browser vendors such as Firefox to adopt othermitigations Another option for browsers is to enable Site Iso-lation [13] but JIT engines such as SpiderMonkey still do nothave a production implementation [23] Yet another optionfor JIT engines is to conditionally mask (ie using a transientexecution-safe cmov instruction) the result of FP operationsto enforce QNaN Floating-Point Indefinite semantics [37] asdone in the SpiderMonkey FPVI mitigation [51] This strategysuppresses any malicious NaN-boxed transient results butrequires manual changes to the NaN-boxing implementationand only applies to NaN-boxed gadgets

A more general and automated mitigation is for the com-piler to place a serializing instruction such as lfence afterFP operations whose (attacker-controlled) result might leaksecrets by means of transmit gadgets (or transmitters) Weobserve this is the same high-level strategy adopted by theexisting LVI mitigation in modern compilers [39] whichidentifies loaded values as sources and uses data-flow anal-ysis to ensure all the sources that reach a sink (transmitter)are fenced Hence to mitigate FPVI we can rely on the samemitigation strategy but use computed FP values as sourcesinstead To identify sinks we consider both systems vulnera-ble and those resistant to Microarchitectural Data Sampling(MDS) [8 59 63 75 76] For the former we consider bothload and store instructions as sinks (eg FP operation resultused as a loadstore address) as the corresponding arbitrarydata spilled into microarchitectural buffers may be leaked bya MDS attacker For the latter we limit ourselves to load sinksto catch all the arbitrary read values potentially disclosed by adependent transmitter In both cases we add indirect branchescontrolled by FP operations (and thus potentially leading tospeculative control-flow hijacking) to the list of sinks

We have implemented such a mitigation in LLVM [45]with only 100 lines of code on top of the existing x86 LVIload hardening pass To evaluate the performance impactof our mitigation on floating-point-intensive programs weran all the CC++ SPECfp 2006 and 2017 benchmarks infour configurations LVI instrumentation FPVI instrumen-tation for both MDS-vulnerable and MDS-resistant systemsand joint LVI+FPVI instrumentation for MDS-vulnerable sys-tems Please note that on MDS-resistant systems the FPVItransmitters are already covered by the LVI mitigation

Figure 9 shows the performance overhead of such con-figurations compared to the baseline As expected the LVI

1462 30th USENIX Security Symposium USENIX Association

619lbm_s 638imagick_s 644nab_s SPEC17geomean

433milc 444namd 447dealII 450soplex 453povray 470lbm 482sphinx3 SPEC06geomean

0100200300400500600700800

Over

head

SPEC FP 2017 SPEC FP 2006LLVM Mitigation

LVIFPVI (MDS-vulnerable system)FPVI (MDS-resistant system)LVI+FPVI (MDS-vulnerable system)

Figure 9 Performance overhead of our FPVI mitigation on the CC++ SPECfp 2006 and 2017 benchmarks Experimental setup5 SPEC iterations Intel i9-9900K (microcode 0xde) and LLVM 1110

mitigation has non-trivial performance impact up to 280despite targeting floating-point-intensive programs On theother hand our FPVI mitigation incurs in a 32 and 53 ge-omean overhead on SPECfp 2006 and 2017 respectively withno observable performance impact difference between MDS-vulnerable and MDS-resistant variants We observed that ap-proximately 70 of the inserted lfence instructions are dueto the intraprocedural design of the original LVI pass forc-ing our analysis to consider every callsite with FPVI source-based arguments as a potential transmitter This suggests theoverhead can be further reduced by operating interprocedu-ral analysis or more aggressive inlining (eg using LLVMLTO [45])

13 Root Cause-based Classification

We now summarize the results of our investigation by present-ing a root cause-based classification for the known transientexecution paths While there have already been several at-tempts to classify properties of transient execution [107580]all the existing classifications are attack-centric While cer-tainly useful such classifications inevitably blend togetherthe root causes of transient execution with the attack triggersFor example a MDS exploit based on a demand-paging pagefault [75] may be simply classified as a Meltdown-like attackbased on a present page fault [10] However in such an exploitthe page fault is both the vulnerability trigger and the rootcause of the transient execution window A similar exploitmay instead be embedded in an Intel TSX window yielding adifferent root cause and capabilities

To better characterize the capabilities of transient executionexploits we propose the orthogonal root-cause-based classifi-cation in Figure 10 We draw from the Intel terminology todefine the two main classes of root causes of Bad Specula-tion (ie transient execution) Control-Flow Misprediction(ie branch misprediction) and Data Misprediction (ie ma-chine clear) Based on these two main classes we observethat all the known root causes of transient execution paths canbe classified into the following four subclasses PredictorsExceptions Likely invariants violations and Interrupts

Figure 10 A root cause-centric classification of known tran-sient execution paths (causes analyzed in the paper in bold)Acronyms descriptions can be found in Appendix B

Predictors This category includes the prediction-basedcauses of bad speculation due to either control-flow or datamispredictions Mistraining a predictor and forcing a mis-prediction is sufficient to create a transient execution pathaccessing erroneous code or data Itrsquos worth noting that inFigure 10 there are two separate predictors subclasses un-der control-flow and data mispredictions as they are differ-ent in nature While control-flow mispredictions are failedattempts to guess the next instructions to execute data mis-predictions are failed attempts to operate on not-yet-validateddata Misprediction is a common way to manage tran-sient execution windows in attacks like Spectre and deriva-tives [11 20 28 40 41 43 48 55 65 82]

Exceptions This class includes the causes of machineclear due to exceptions for instance the different (sub)classesof page faults Forcing an exception is sufficient to create atransient execution path erroneously executing code follow-ing the exception-inducing instruction Exceptions are a lesscommon way to manage transient execution windows (as theyrequire dedicated handling) but have been extensively usedas triggers in Meltdown-like attacks [78475963677174ndash76 78]

Interrupts This class includes the causes of machine cleardue to hardware (only) interrupts Similar to exceptions forc-

USENIX Association 30th USENIX Security Symposium 1463

ing a HW interrupt is sufficient to create a transient executionpath erroneously executing code following the interruptedinstruction Hardware interrupt are asynchronous by natureand thus difficult to control resulting in a less than ideal wayto manage transient execution windows Nonetheless theywere abused by prior side-channel attacks [72]

Likely invariants violations This class includes all theremaining causes of machine clear derived by likely invari-ants [15] used by the CPU Such invariants commonly holdbut occasionally fail allowing hardware to implement fast-path optimizations However compared to exceptions andinterrupts slow-path occurrences are typically more frequentrequiring more efficient handling in hardware or microcodeWe discussed examples of such invariants in the paper (egstore instructions are expected to never target cached instruc-tions floating-point operations are expected to never operateon denormal numbers etc) and their lazy handling mecha-nisms (ie L1dL1i resynchronization microcode-based de-normal arithmetic) Forcing a likely invariant violation is suf-ficient to create a transient execution path accessing erroneouscode or data In this paper we have shown such violationsare not only a realistic way to manage transient executionwindows but also provide new opportunities and primitivesfor transient execution attacks

14 Related Work

Spectre [3141] and Meltdown [47] first examined the securityimplications of transient execution originating a large bodyof research on transient execution attacks [6ndash112840434859 63 65 67 68 71 74ndash76 78 79 82] Rather than focusingon attacks and their classification [10 75 80] ours is the firsteffort to systematize the root causes of transient executionand examine the many unexplored cases of machine clears

We now briefly survey prior security efforts concernedwith the major causes of machine clear discussed in this pa-per Self-modifying code is commonly used by malware asan obfuscation technique [69] and has also been used to im-prove side-channel attacks by means of performance degra-dation [2] Moreover our SCSB primitive bears similaritieswith prior transient execution primitives inducing specula-tive control flow hijacking either through branch target in-jection [41 49] or architectural branch target corruption [28]The performance variability of floating-point operations ondenormal numbers [17 46] has been previously exploited intraditional timing side-channel attacks [4] Speculation intro-duced by stricter memory models is a well-known concept inthe computer architecture literature [27 30 66] While this isnon-trivial to exploit prior work did demonstrate informationdisclosure [5779] by exploiting the snoop protocol discussedin Section 7 The memory disambiguation predictor has beenpreviously abused to leak stale data in Spectre SpeculativeStore Bypass exploits [55 82] Moreover its behavior hasbeen partially reverse engineered before [18] In contrast to

all these efforts we focus on machine clears to systematicallystudy all the root causes of transient execution fully reverseengineering their behavior and uncovering their security im-plications well beyond the state of the art

15 Conclusions

We have shown that the root causes of transient execution canbe quite diverse and go well beyond simple branch mispredic-tion or similar To support this claim we systematically ex-plored and reverse engineered the previously unexplored classof bad speculation known as machine clear We discussedseveral transient execution paths neglected in the literatureexamining their capabilities and new opportunities for attacksFurthermore we presented two new machine clear-based tran-sient execution attack primitives (Floating Point Value Injec-tion and Speculative Code Store Bypass) We also presentedan end-to-end FPVI exploit disclosing arbitrary memory inFirefox and analyzed the applicability of SCSB in real-worldapplications such as JIT engines Additionally we proposedmitigations and evaluated their performance overhead Finallywe presented a new root cause-based classification for all theknown transient execution paths

Disclosure

We disclosed Floating Point Value Injection and SpeculativeCode Store Bypass to CPU browser OS and hypervisor ven-dors in February 2021 Following our reports Intel confirmedthe FPVI (CVE-2021-0086) and SCSB (CVE-2021-0089)vulnerabilities rewarded them with the Intel Bug Bounty pro-gram and released a security advisory with recommendationsin line with our proposed mitigations [34] Mozilla confirmedthe FPVI exploit (CVE-2021-29955 [21 22]) rewarded itwith the Mozilla Security Bug Bounty program and deployeda mitigation based on conditionally masking malicious NaN-boxed FP results in Firefox 87 [51]

Acknowledgments

We thank our shepherd Daniel Genkin and the anonymousreviewers for their valuable comments We also thank ErikBosman from VUSec and Andrew Cooper from Citrix fortheir input Intel and Mozilla engineers for the productivemitigation discussions Travis Downs for his MD reverse en-gineering and Evan Wallace for his Float Toy tool This workwas supported by the European Unionrsquos Horizon 2020 re-search and innovation programme under grant agreements No786669 (ReAct) and 825377 (UNICORE) by Intel Corpora-tion through the Side Channel Vulnerability ISRA and by theDutch Research Council (NWO) through the INTERSECTproject

1464 30th USENIX Security Symposium USENIX Association

References[1] IEEE Standard for Floating-Point Arithmetic IEEE Std 754-2019

2019

[2] Alejandro Cabrera Aldaya and Billy Bob Brumley Hyperde-grade From ghz to mhz effective cpu frequencies arXiv preprintarXiv210101077

[3] AMD AMD64 Architecture Programmerrsquos Manual

[4] Marc Andrysco David Kohlbrenner Keaton Mowery Ranjit JhalaSorin Lerner and Hovav Shacham On subnormal floating point andabnormal timing In 2015 IEEE S amp P

[5] ARM Architecture Reference Manual for Armv8-A

[6] Atri Bhattacharyya Alexandra Sandulescu Matthias NeugschwandtnerAlessandro Sorniotti Babak Falsafi Mathias Payer and Anil KurmusSmotherspectre exploiting speculative execution through port con-tention In CCSrsquo19

[7] Jo Van Bulck Marina Minkin Ofir Weisse Daniel Genkin BarisKasikci Frank Piessens Mark Silberstein Thomas F Wenisch Yu-val Yarom and Raoul Strackx Foreshadow Extracting the Keys tothe Intel SGX Kingdom with Transient Out-of-Order Execution InUSENIX Securityrsquo18

[8] Claudio Canella Daniel Genkin Lukas Giner Daniel Gruss MoritzLipp Marina Minkin Daniel Moghimi Frank Piessens MichaelSchwarz Berk Sunar Jo Van Bulck and Yuval Yarom Fallout LeakingData on Meltdown-resistant CPUs In CCSrsquo19

[9] Claudio Canella Michael Schwarz Martin Haubenwallner MartinSchwarzl and Daniel Gruss Kaslr Break it fix it repeat In ACMASIA CCS 2020

[10] Claudio Canella Jo Van Bulck Michael Schwarz Moritz Lipp Ben-jamin Von Berg Philipp Ortner Frank Piessens Dmitry Evtyushkinand Daniel Gruss A systematic evaluation of transient execution at-tacks and defenses In USENIX Security 19

[11] Guoxing Chen Sanchuan Chen Yuan Xiao Yinqian Zhang ZhiqiangLin and Ten H Lai Sgxpectre Stealing intel secrets from sgx enclavesvia speculative execution In 2019 IEEE EuroSampP

[12] Chrome V8 TurboFan documentation

[13] Chromium Site Isolation documentation

[14] Victor Costan and Srinivas Devadas Intel SGX Explained IACRCryptology ePrint Archive 2016

[15] David Devecsery Peter M Chen Jason Flinn and Satish NarayanasamyOptimistic hybrid analysis Accelerating dynamic analysis throughpredicated static analysis In ASPLOS 2018

[16] Christopher Domas Breaking the x86 isa Black Hat USA 2017

[17] Isaac Dooley and Laxmikant Kale Quantifying the interference causedby subnormal floating-point values In Proceedings of the Workshopon OSIHPA 2006

[18] Travis Downs Memory Disambiguation on Skylakehttpsgithubcomtravisdownsuarch-benchwikiMemory-Disambiguation-on-Skylake 2019

[19] Thomas Dullien Return after free discussion httpstwittercomhalvarflakestatus1273220345525415937

[20] Dmitry Evtyushkin Ryan Riley Nael CSE Abu-Ghazaleh ECE andDmitry Ponomarev Branchscope A new side-channel attack on di-rectional branch predictor ACM SIGPLAN Notices 53(2)693ndash7072018

[21] Firefox Firefox 87 Security Advisory httpswwwmozillaorgen-USsecurityadvisoriesmfsa2021-10CVE-2021-29955

[22] Firefox Firefox ESR 789 Security Advisory httpswwwmozillaorgen-USsecurityadvisoriesmfsa2021-11CVE-2021-29955

[23] Firefox Project Fission documentation

[24] Fortninet Use-After-Free Bug in Chakra (CVE-2018-0946)httpswwwfortinetcomblogthreat-researchan-analysis-of-the-use-after-free-bug-in-microsoft-edge-chakra-engine

[25] Ivan Fratric Return after free discussion httpstwittercomifsecurestatus1273230733516177408

[26] Pietro Frigo Cristiano Giuffrida Herbert Bos and Kaveh RazaviGrand Pwning Unit Accelerating Microarchitectural Attacks withthe GPU In SampP May 2018

[27] Kourosh Gharachorloo Anoop Gupta and John L Hennessy Twotechniques to enhance the performance of memory consistency models1991

[28] Enes Goktas Kaveh Razavi Georgios Portokalidis Herbert Bos andCristiano Giuffrida Speculative Probing Hacking Blind in the SpectreEra In CCS 2020

[29] Ben Gras Kaveh Razavi Erik Bosman Herbert Bos and CristianoGiuffrida ASLR on the Line Practical Cache Attacks on the MMUIn NDSS February 2017

[30] John L Hennessy and David A Patterson Computer architecture aquantitative approach Elsevier 2011

[31] Jann Horn Reading privileged memory with a side-channel 2018

[32] Xen Hypervisor block_speculation function call in invoke_stubhttpsxenbitsxenorggitwebp=xengita=blobf=xenarchx86x86_emulatex86_emulatechb=HEAD

[33] Xen Hypervisor block_speculation function call inio_emul_stub_setup httpsxenbitsxenorggitwebp=xengita=blobf=xenarchx86pvemul-priv-opchb=HEAD

[34] Intel FPVI amp SCSB Intel Security Advisoray 00516 httpswwwintelcomcontentwwwusensecurity-centeradvisoryintel-sa-00516html

[35] Intel INTEL-SA-00088 - Bounds Check Bypass

[36] Intel Intelreg 64 and IA-32 Architectures Optimization ReferenceManual

[37] Intel Intelreg 64 and IA-32 Architectures Software Developerrsquos Manualcombined volumes

[38] Intel Intelreg VTunetrade Profiler User Guide - 4KAliasing httpssoftwareintelcomcontentwwwusendevelopdocumentationvtune-helptopreferencecpu-metrics-referencel1-boundaliasing-of-4k-address-offsethtml

[39] Intel Load value injection - deep divehttpssoftwareintelcomsecurity-software-guidancedeep-divesdeep-dive-load-value-injection 2020

[40] Vladimir Kiriansky and Carl Waldspurger Speculative buffer overflowsAttacks and defenses arXiv180703757

[41] Paul Kocher Jann Horn Anders Fogh Daniel Genkin Daniel GrussWerner Haas Mike Hamburg Moritz Lipp Stefan Mangard ThomasPrescher Michael Schwarz and Yuval Yarom Spectre Attacks Ex-ploiting Speculative Execution In SampPrsquo19

[42] David Kohlbrenner and Hovav Shacham On the effectiveness ofmitigations against floating-point timing channels In USENIX SecuritySymposium 2017

[43] Esmaeil Mohammadian Koruyeh Khaled N Khasawneh ChengyuSong and Nael Abu-Ghazaleh Spectre Returns Speculation Attacksusing the Return Stack Buffer In USENIX WOOTrsquo18

[44] Evgeni Krimer Guillermo Savransky Idan Mondjak and JacobDoweck Counter-based memory disambiguation techniques for se-lectively predicting loadstore conflicts October 1 2013 US Patent8549263

USENIX Association 30th USENIX Security Symposium 1465

[45] Chris Lattner and Vikram Adve Llvm A compilation framework forlifelong program analysis amp transformation In CGO 2004

[46] Orion Lawlor Hari Govind Isaac Dooley Michael Breitenfeld andLaxmikant Kale Performance degradation in the presence of subnor-mal floating-point values In OSIHPA 2005

[47] Moritz Lipp Michael Schwarz Daniel Gruss Thomas Prescher WernerHaas Anders Fogh Jann Horn Stefan Mangard Paul Kocher DanielGenkin Yuval Yarom and Mike Hamburg Meltdown Reading KernelMemory from User Space In USENIX Securityrsquo18

[48] Giorgi Maisuradze and Christian Rossow ret2spec Speculative ex-ecution using return stack buffers In Proceedings of the 2018 ACMSIGSAC

[49] Andrea Mambretti Alexandra Sandulescu Matthias NeugschwandtnerAlessandro Sorniotti and Anil Kurmus Two methods for exploitingspeculative control flow hijacks In USENIX WOOT 19

[50] Ahmad Moghimi Jan Wichelmann Thomas Eisenbarth and BerkSunar Memjam A false dependency attack against constant-timecrypto implementations International Journal of Parallel Program-ming 2019

[51] Mozilla Firefox Bug 1692972 mitigation httpshgmozillaorgreleasesmozilla-betarevb129bba64358

[52] Mozilla Spectre mitigations for Value type checks - x86 part httpsbugzillamozillaorgshow_bugcgiid=1433111

[53] Mozilla Spider Monkey JSValue httpshgmozillaorgmozilla-centralfiletipjspublicValueh

[54] Mozilla SpiderMonkey IonMonkey documentation

[55] Ken Johnson Microsoft Security Response Center(MSRC) Analysis and mitigation of speculative storebypass httpsmsrc-blogmicrosoftcom20180521analysis-and-mitigation-of-speculative-store-bypass-cve-2018-3639 2019

[56] Oleksii Oleksenko Bohdan Trach Mark Silberstein and Christof Fet-zer Specfuzz Bringing spectre-type vulnerabilities to the surface InUSENIX Security 20

[57] Riccardo Paccagnella Licheng Luo and Christopher W Fletcher Lordof the ring (s) Side channel attacks on the cpu on-chip ring interconnectare practical arXiv preprint arXiv210303443 2021

[58] Zhenxiao Qi Qian Feng Yueqiang Cheng Mengjia Yan Peng Li HengYin and Tao Wei Spectaint Speculative taint analysis for discoveringspectre gadgets 2021

[59] Hany Ragab Alyssa Milburn Kaveh Razavi Herbert Bos and CristianoGiuffrida CrossTalk Speculative Data Leaks Across Cores Are RealIn SampP May 2021

[60] Thomas Rokicki Cleacutementine Maurice and Pierre Laperdrix Sok Insearch of lost time A review of javascript timers in browsers In IEEEEuroSampPrsquo21

[61] Gururaj Saileshwar Christopher W Fletcher and Moinuddin QureshiStreamline a fast flushless cache covert-channel attack by enablingasynchronous collusion In ASPLOS 2021

[62] Rahul Saxena and John William Phillips Optimized rounding in un-derflow handlers 2001 US Patent 6219684

[63] Michael Schwarz Moritz Lipp Daniel Moghimi Jo Van Bulck JulianStecklina Thomas Prescher and Daniel Gruss ZombieLoad Cross-privilege-boundary data sampling In CCSrsquo19

[64] Michael Schwarz Cleacutementine Maurice Daniel Gruss and Stefan Man-gard Fantastic timers and where to find them High-resolution microar-chitectural attacks in javascript In FC IFCA 17

[65] Michael Schwarz Martin Schwarzl Moritz Lipp Jon Masters andDaniel Gruss Netspectre Read arbitrary memory over network InESORICS 19

[66] Daniel J Sorin Mark D Hill and David A Wood A primer on memoryconsistency and cache coherence 2011

[67] Julian Stecklina and Thomas Prescher Lazyfp Leaking fpu reg-ister state using microarchitectural side-channels arXiv preprintarXiv180607480 2018

[68] Caroline Trippel Daniel Lustig and Margaret Martonosi Meltdown-prime and spectreprime Automatically-synthesized attacks exploitinginvalidation-based coherence protocols arXiv

[69] Xabier Ugarte-Pedrero Davide Balzarotti Igor Santos and Pablo GBringas Sok Deep packer inspection A longitudinal study of thecomplexity of run-time packers In 2015 IEEE S amp P

[70] Google V8 test-jump-table-assemblercc220 commit 251fece httpsgithubcomv8

[71] Jo Van Bulck Daniel Moghimi Michael Schwarz Moritz Lippi Ma-rina Minkin Daniel Genkin Yuval Yarom Berk Sunar Daniel Grussand Frank Piessens Lvi Hijacking transient execution through mi-croarchitectural load value injection In 2020 IEEE S amp P 20

[72] Jo Van Bulck Frank Piessens and Raoul Strackx Nemesis Studyingmicroarchitectural timing leaks in rudimentary cpu interrupt logic InACM CCS 2018

[73] Jo Van Bulck Frank Piessens and Raoul Strackx SGX-step A Prac-tical Attack Framework for Precise Enclave Execution Control InSysTEXrsquo17

[74] Stephan van Schaik Andrew Kwong Daniel Genkin and Yuval YaromSgaxe How sgx fails in practice

[75] Stephan van Schaik Alyssa Milburn Sebastian Oumlsterlund Pietro FrigoGiorgi Maisuradze Kaveh Razavi Herbert Bos and Cristiano GiuffridaRIDL Rogue in-flight data load In SampP May 2019

[76] Stephan van Schaik Marina Minkin Andrew Kwong Daniel Genkinand Yuval Yarom Cacheout Leaking data on intel cpus via cacheevictions arXiv preprint

[77] WebKit Browserbench httpsbrowserbenchorg

[78] Ofir Weisse Jo Van Bulck Marina Minkin Daniel Genkin BarisKasikci Frank Piessens Mark Silberstein Raoul Strackx Thomas FWenisch and Yuval Yarom Foreshadow-ng Breaking the virtual mem-ory abstraction with transient out-of-order execution 2018

[79] Pawel Wieczorkiewicz Intel deep-dive snoop-assisted L1 Data Sam-pling

[80] Wenjie Xiong and Jakub Szefer Survey of transient execution attacksarXiv preprint

[81] Yuval Yarom and Katrina Falkner Flush+ reload a high resolutionlow noise l3 cache side-channel attack In USENIX Security 14

[82] Jann Horn Google Project Zero Speculative execution variant4 speculative store bypass httpsbugschromiumorgpproject-zeroissuesdetailid=1528 2019

A Reversing Memory Disambiguation

To precisely trigger memory disambiguation mispredictionsit is essential to reverse engineer the predictor and understandhow it can be massaged into the desired state When exe-cuting a load operation the physical addresses of all priorstores must be known to decide whether the load should beforwarded the value from the store buffer (store-to-load for-warding when the store and the load alias) or served from thememory subsystem (when they do not) Since these dependen-cies introduce significant bottlenecks modern processors rely

1466 30th USENIX Security Symposium USENIX Association

Listing 4 A function to RE the MD predictor The first 10imuls used to delay the store address computation create theideal conditions for mispredictionsst_ld rdi store addr rsi load addrrep 10 Trick to delay the store addressimul rdi 1endrepmov DWORD [rdi] 0x42 Storemov eax DWORD [rsi] Loadrep 10 Pronounce load timingimul eax 1endrepret

Listing 5 Observing the size and behavior of the per-addresssaturation counteruint8_t mem = malloc(0x1000)Ensure that saturing counter is set to 0for(i=0 ilt100 i++) st_ld(mem mem)Make hoisiting possiblefor(i=100 ilt120 i++) st_ld(mem mem+64)Trigger a memory ordering machine clearfor(i=120 ilt130 i++) st_ld(mem mem)

on a memory disambiguation predictor to improve common-case performance If a load is predicted not to alias precedingstores it can be speculatively executed before the prior storesrsquoaddresses are known (ie the load is hoisted) Otherwise theload is stalled until aliasing information is available

Partial reverse engineering of the memory disambiguationunit behavior was presented in [18] based on a (complex)analysis of Intel patent US8549263B2 [44] In contrast wepresent a full reverse engineering effort (including featuressuch as 4k aliasing flush counter etc) entirely based on asimple implementationmdashst_ld function in Listing 4 Bysurrounding the st_ld function with instrumentation codeto measure the timing and the number of machine clears wewere able to accurately detect the status of the predictor forevery call to st_ld

Our first experiment illustrated in Listing 5 is designedto observe how many missed load hoisting opportunities areneeded to switch the predictor state As shown in the corre-sponding plot in Figure 11 after 15 non-aliasing loads weobserved that subsequent st_ld invocations are faster due toa correct hoisting prediction This matches the design sug-gested in the patent with the predictor implemented as a 4-bitsaturating counter incremented every time the load does not ul-timately alias with preceding stores (and reset to 0 otherwise)Load hoisting is predicted only if the counter is saturated Toensure that the hoisting state is reached we later scheduledan aliasing load and checked that the load was incorrectlyhoisted by observing a machine clear

According to the Intel patent [44] there are 64 per-addresspredictors (ie saturating counters) and the suggested hash-ing function simply uses the lowest 6 bits of the instructionpointer of the load We verified these numbers using the func-

100 105 110 115 120 125 130i-th call to st_ld

0

50

100

Cloc

k cy

cles

Figure 11 Timing measurement of the experiment in Listing5 Orange bar machine clear memory ordering observed

Listing 6 Snippet observing activation of never-hoisting stateuint8_t mem = malloc(0x1000)for(int i=0 ilt10 i++)

for(int j=0 jlt19 j++)st_ld(mem mem+64)

st_ld(mem mem)

0 10 20 30 40 50 60 70 80 90 100 110 120i-th call to st_ld

0

25

50

75

100

Cloc

k cy

cles

Figure 12 Never-hoisting state Orange bar MC observed

tion st_ld_offset which is an exact copy of the st_ldfunction but with a number of nops added in the preamble(which we change at every run The goal is to observe iftwo unrelated loads in these functions are able to influenceeach other when varying the number of nops In our testedCPUs we observed machine clears when two unrelated loadsin st_ld and st_ld_offset are located exactly 256k bytesapart in memory (k isinN) We used machine clear observationsto detect that the per-address prediction of st_ld was affectedby the execution of st_ld_offset Our results match the de-sign suggested in the patent except we observed 256 (ratherthan 64) predictors hashing the lowest 8 bits of the instruc-tion pointer With these implementation-specific numbers anattacker can easily mistrain the predictor of a victim loadinstruction just with knowledge of its location in memory

One important additional component of the predictor is thepresence of a watchdog A never-hoisting global state is usedas a fallback to temporarily disable the predictor when theCPU has decided it may be counterproductive To reverseengineer the behavior we triggered as many mispredictionsas possible to check if hoisting was eventually disabled Theresulting code is illustrated in Listing 6 and the numbers inFigure 12 shows that after 4 machine clears the predictoris disabled Indeed even after 19 further non-aliasing loads(normally abundantly sufficient to switch to a hoisting state)the execution time of st_ld does not decrease

We also reversed the conditions under which the watch-dog is enableddisabled The patent suggests the watchdog

USENIX Association 30th USENIX Security Symposium 1467

15 79 143 207 271 335 399Number of independent store-loads

1

2

3

4

Tota

l mea

sure

dM

achi

ne C

lear

s

Figure 13 MCs observed after n independent store-loadsstarting from the never-hoisting state

is enabled when the value of a flush counter is smaller than0 and disabled otherwise The flush counter is decrementedevery MD MC and incremented every n correctly hoistedloads To reverse engineer this behavior we measure howmany MCs can be triggered in a row before the flush counteris decremented to -1 and thus the predictor is disabled Asshown in the figure 13 we never observed more than 4 MCsin a row This suggests that the flush counter is a 2-bit sat-urating counter The machine clear patterns also reveal theflush counter is incremented every 64 correctly hoisted loadsLastly since every run starts with the watchdog disabled ourresults show that to switch from a never-hoisting to a predict-hoisting state 15+64 non-aliasing loads are sufficient Thefirst 15 loads are necessary to bring the per-address predictorto the hoisting state The next 64 loads record a would-becorrect prediction of the per-address predictor After 64 such(unused) predictions the flush counter is incremented to leavethe never-hoisting state

Listing 7 Snippet verifying 4k aliasing-MD unit interactionForce no-hoisting predictionfor(i=0 ilt10 i++) st_ld(mem+0x1000 mem+0x1000)for(i=10 ilt20 i++) st_ld(mem+0x2000 mem+0x2000)Cause 4k aliasingfor(i=20 ilt40 i++) st_ld(mem+0x1000 mem+0x2000)

0 5 10 15 20 25 30 35 40i-th call to st_ld

50

60

70

80

90

Cloc

k cy

cles

Figure 14 Timing measurements of Listing 7 Orange incre-ment of LD_BLOCKS_PARTIALADDRESS_ALIAS

Finally we examined the interaction between memory dis-ambiguation and 4k aliasing On Intel CPUs when a store isfollowed by a load matching its 4KB page offset the store-to-load forwarding (STL) logic forwards the stored valueand in case of a false match (4k aliasing) a few-cycle over-head is needed to re-issue the load [38] With the help of theperformance counter LD_BLOCKS_PARTIALADDRESS_ALIASand the experiment shown in Listing 7 we verified that 4kaliasing can only happen when a no-hoist prediction is made

Figure 15 Memory disambiguation unit ndash simplified view

as shown in Figure 14 Indeed STL can be performed onlyif the store-load pair is executed in order Additionally Fig-ure 14 shows that 4k aliasing introduces a slight performanceoverhead on top of an incorrect no-hoisting guess of the MDpredictor Figure 15 presents a simplified view of the reversedmemory disambiguation predictor In conclusion an attackercan precisely mistrain the memory disambiguation predictorby satisfying only two requirements (1) knowing the instruc-tion pointer of the victim load (2) issuing (in the worst-casescenario) 15+64=79 non-aliasing store-load pairs to train thepredictor to the hoisting state

B Root-Causes Description Table

Acronym Description

BHT Branch History TableBTB Branch Target BufferRSB Return Stack BufferMD Memory Disabmbiguation UnitNM Device Not Available ExceptionDE Divide-by-Zero ExceptionUD Invalid Opcode ExceptionGP General Protection FaultAC Alignment Check ExceptionSS Stack Segment ExceptionPF Page FaultPF - US Bit Page Table Entry UserSupervisor Bit PFPF - RW Bit Page Table Entry ReadWrite Bit PFPF - P Bit Page Table Entry Present Bit PFPF - PKU Page Table Entry Protection Keys PFBR Bound Range Exceeded ExceptionFP Floating Point AssistSMC Self-Modifying CodeXMC Cross-Modifying CodeMO Memory Ordering Principles ViolationMASKMOV Masked LoadStore Instruction AssistAD Bits Page Table Entry AccessDirty Bits AssistTSX Intel TSX Transaction AbortUC Uncachable Memory AssistPRM Processor Reserved Memory AssistHW Interrupts Hardware Interrupts

1468 30th USENIX Security Symposium USENIX Association

  • Introduction
  • Background
    • IEEE-754 Denormal Numbers
    • x86 Cache Coherence
    • Memory Ordering
    • Memory Disambiguation
      • Threat Model
      • Machine Clears
      • Self-Modifying Code Machine Clear
      • Floating Point Machine Clear
      • Memory Ordering Machine Clear
      • Memory Disambiguation Machine Clear
      • Other Types of Machine Clear
      • Transient Execution Capabilities
        • Transient Window Size
        • Leakage Rate
          • Attack Primitives
            • Speculative Code Store Bypass (SCSB)
            • Floating Point Value Injection (FPVI)
              • Mitigations
                • SCSB Mitigation
                • FPVI Mitigation
                  • Root Cause-based Classification
                  • Related Work
                  • Conclusions
                  • Reversing Memory Disambiguation
                  • Root-Causes Description Table
Page 3: Rage Against the Machine Clear: A Systematic Analysis of

In particular we show that Self-Modifying Code MachineClear events allow attackers to transiently execute stale codewhile Floating Point Machine Clear events allow them to in-ject transient values in victim code We term these primitivesSpeculative Code Store Bypass (SCSB) and Floating PointValue Injection (FPVI) respectively The former is looselyrelated to Speculative Store Bypass (SSB) [55 82] but al-lows attackers to transiently reference stale code rather thandata The latter is loosely related to Load Value Injection(LVI) [71] but allows attackers to inject transient values bycontrolling operands of floating-point operations rather thantriggering victim page faults or microcode assists We alsodiscuss possible gadgets for exploitation in applications suchas JIT engines For instance we found that developers follow-ing instructions detailed in the Intel optimization manual [36]ad litteram could easily introduce SCSB gadgets in their ap-plications In addition we show that attackers controllingJITrsquoed code can easily inject FPVI gadgets and craft arbitrarymemory reads even from JavaScript in a modern browserusing NaN-boxing such as Mozilla Firefox [53]

Moreover since existing mitigations cannot hinder ourprimitives we implement new mitigations to eliminate theuncovered attack surface As shown in our evaluation SCSBcan be efficiently mitigated with serializing instructions in thepractical cases of interest such as JavaScript engines FPVIcan be mitigated using transient execution barriers betweenthe source of injection and its transmit gadgets either insertedby the programmer or by the compiler We implemented thegeneral compiler-based mitigation in LLVM [45] measuringan overhead of 32 53 on SPECfp 20062017 (geomean)

While not limited to Intel our analysis does build on a vari-ety of performance counters available in different generationsof Intel CPUs but not on other architectures Neverthelesswe show that our insights into the root causes of bad specu-lation also generalize to other architectures by successfullyrepeating our transient execution leakage experiments (weoriginally designed for Intel) on AMD

Finally armed with a deeper understanding of the rootcauses of transient execution we also present a new classi-fication for the resulting paths Existing classifications [10]are entirely attack-centric focusing on classes of Spectre-or Meltdown-like attacks and blending together (common)causes of transient execution with their uses (ie what at-tackers can do with them) Such classifications cannot easilyaccommodate the new transient execution paths presented inthis paper For this reason we propose a new orthogonal rootcause-centric classification much closer to the sources of badspeculation identified by the chip vendors themselves

Summarizing we make the following contributions

bull We systematically explore the root causes of transientexecution and closely examine the major largely unex-plored machine clear-based class To this end we presentthe reverse engineering and security analysis of causessuch as Floating Point Machine Clear (FP MC) Self-

Modifying Code Machine Clear (SMC MC) MemoryOrdering Machine Clear (MO MC) and Memory Dis-ambiguation Machine Clear (MD MC)

bull We present two novel machine clear-based transient exe-cution attack primitives (FPVI and SCSB) and an end-to-end FPVI exploit disclosing arbitrary memory in Firefox

bull We propose and evaluate possible mitigations

bull We propose a new root-cause based classification of allthe known transient execution paths

Code exploit demo and additional information are avail-able at httpswwwvusecnetprojectsfpvi-scsb

2 Background

21 IEEE-754 Denormal NumbersModern processors implement a Floating Point Unit (FPU)To represent floating point numbers the IEEE-754 stan-dard [1] distinguishes three components (Fig 1) First themost significant bit serves as the sign bit s The next w bitscontain a biased exponent e For instance for double precision64-bit floating point numbers e is 11 bits long and the biasis 211minus1minus1 = 1023 (ie 2wminus1minus1) The value e is computedby adding the real exponent ereal to the bias which ensuresthat e is a positive number even if the real exponent is neg-ative The remaining bits are used for the mantissa m Thecombination of these components represents the value Asan example suppose the sign bit s = 1 the biased exponente = 100000000112 = 102710 so that ereal = 1027minus1023 = 4and the mantissa m = 01110000 In that casethe corre-sponding decimal value is minus1 middot24 middot (1+001112) We refer tosuch numbers as normal numbers as they are in the normal-ized form with an implicit leading 1 present in the mantissa

Figure 1 Fields of a 64 bits IEEE 754 double

If e = 0 and m 6= 0 the CPU treats the value as a denormalvalue In this case the leading 1 becomes a leading 0 insteadwhile the exponent becomes the minimal value of 2minus2wminus1so that with s = 1 e = 0 and m = 01110000 the result-ing value for a double precision number is minus1 middot2minus1022 middot (0+001112) This additional representation is intended to allow agradual underflow when values get closer to 0 In 64-bits float-ing point numbers the smallest unbiased exponent is -1022making it impossible to represent a number with an exponentof -1023 or lower In contrast a denormal number can repre-sent this number by appending enough leading zeroes to the

1452 30th USENIX Security Symposium USENIX Association

mantissa until the minimal exponent is obtained The denor-mal representation trades precision for the ability to representa larger set of numbers Generating such a representation iscalled denormalization or gradual underflow

22 x86 Cache CoherenceOn multicore processors L1 and L2 caches are usually percore while the L3 is shared among all cores Much complexityarises due to the cache coherence problem when the samememory location is cached by multiple cores To ensure thecorrectness the memory must behave as a single coherentsource of data even if data is spread across a cache hierarchyInformally a memory system is coherent if any read of a dataitem returns the most recently written value [30] To obtaincoherence memory operations that affect shared data mustpropagate to other coresrsquo private caches This propagation isthe responsibility of the cache coherence protocol such asMESIF on Intel processors [37] and MOESI on AMD [3]Cache controllers implementing these protocols snoop andexchange coherence requests on the shared bus For examplewhen a core writes in a shared memory location it signals allother cores to invalidate their now stale local copy

The cache coherence policy also maintains the illusion ofa unified memory model for backward compatibility Theoriginal Intel 8086 released in 1978 operated in real-addressmode to implement what is essentially a pure Von Neumannarchitecture with no separation between data and code Mod-ern processors have more sophisticated memory architectureswith separate L1 caches for code (L1d) and data (L1i) Theneed for backward compatibility with the simple 8086 mem-ory model has led modern CPUs to a split-cache Harvardarchitecture whereby the cache coherence protocol ensuresthat the (L1) data and instruction caches are always coherent

23 Memory OrderingA Memory Consistency Model or Memory Ordering Modelis a contract between the microarchitecture and the program-mer regarding the permissible order of memory operations inparallel programs Memory consistency is often confused withmemory coherence but where coherence hides the complexmemory hierarchy in multicore systems consistency dealswith the order of readwrite operations

Consider the program in Figure 2 In the simplest consis-tency model lsquoSequential Consistencyrsquo (SC) all cores see alloperations in the same order as specified by the program Inother words A0 always execute before A1 and B0 always be-fore B1 Thus a valid memory order would be A0-B0-A1-B1while A1-B1-A0-B0 would be invalid

Intel and AMD CPUs implement Total-Store-Order (TSO)which is equivalent to SC apart from one case a store fol-lowed by a load on a different address may be reordered Thisallows cores to use a private store buffer to hide the latency of

Figure 2 Memory Ordering Example

store operations In the example of Figure 2 the store opera-tions A0 and B0 write their values initially only in their privatestore buffers The subsequent loads A1 and B1 will now readthe stale value 0 until the stores are globally visible Thusthe order A1-B1-A0-B0 (r1=0 r2=0) is also valid

24 Memory DisambiguationLoads must normally be executed only after all the precedingstores to the same memory locations However modern pro-cessors rely on speculative optimizations based on memorydisambiguation to allow loads to be executed before the ad-dresses of all preceding stores are computed In particular if aload is predicted not to alias a preceding store the CPU hoiststhe load and speculatively executes it before the precedingstore address is known Otherwise the load is stalled until thepreceding store is completed In case of a no-alias (or hoist)misprediction the load reads a stale value and the CPU needsto re-issue it after flushing the pipeline [36 37]

3 Threat Model

We consider unprivileged attackers who aim to disclose confi-dential information such as private keys passwords or ran-domized pointers We assume an x86-64 based victim ma-chine running the latest microcode and operating system ver-sion with all state-of-the-art mitigations against transient exe-cution attacks enabled We also consider a victim system withno exploitable vulnerabilities apart from the ones describedhereafter Finally we assume attackers can run (only) unprivi-leged code on the victim (eg in JavaScript user processesor VMs) but seek to leak data across security boundaries

4 Machine Clears

The Intel Architectures Optimization Reference Manual [36]refers to the root cause of discarding issued microOp as Bad Spec-ulation Bad speculation consists of two subcategories

bull Branch Mispredict A misprediction of the direction ortarget of a branch by the branch predictor will squash allmicroOps executed within a mispeculated branch

bull Machine Clear (or Nuke) A machine clear conditionwill flush the entire processor pipeline and restart theexecution from the last retired instruction

USENIX Association 30th USENIX Security Symposium 1453

Table 1 Machine clear performance counters

Name Description

MACHINE_CLEARSCOUNT Number of machine clears of any typeMACHINE_CLEARSSMC Number of machine clears caused by a selfcross-modifying codeMACHINE_CLEARSDISAMBIGUATION Number of machine clears caused by a memory disambiguation unit mispredictionMACHINE_CLEARSMEMORY_ORDERING Number of machine clears caused by a memory ordering principle violationMACHINE_CLEARSFP_ASSIST Number of machine clears caused by an assisted floating point operationMACHINE_CLEARSPAGE_FAULT Number of machine clears caused by a page faultMACHINE_CLEARSMASKMOV Number of machine clears caused by an AVX maskmov on an illegal address with a mask set to 0

Bad speculation not only causes performance degradationbut also security concerns [6ndash8314143475559637174ndash76 79 82] In contrast to branch misprediction extensivelystudied by security researchers [10 11 31 40 41 43 48]machine clears have undergone little scrutiny In this paperwe perform the first deep analysis of machine clears and thecorresponding root causes of transient execution

Since machine clears are hardly documented we examinedall the performance counters for every Intel architecture andfound a number relevant counters (Table 1) Some countersare present only in specific architectures For example thepage fault counter is available only on Goldmont Plus How-ever thanks to the generic counter MACHINE_CLEARSCOUNTit is always possible to count the overall number of machineclears regardless of the architecture In the remainder of thiswork we will reverse engineer and analyze the causes ofmachine clears by means of these six counters

As a general observation we note that the Floating PointAssist and Page Fault counters immediately suggest that ma-chine clears are also related to microcode assists and fault-sexceptions In particular further analysis shows that

bull Microcode assists trigger machine clears The hardwareoccasionally needs to resort to microcode to handle com-plex events Doing so requires flushing all the pend-ing instructions with a machine clear before handlinga microcode assist Indeed in our experiments wherethe OTHER_ASSISTSANY counter increased we also ob-served a matching increase in MACHINE_CLEARSCOUNT

bull Machine clears do not necessarily trigger microcodeassists Not all machine clears are microcode as-sisted as some machine clear causes are handleddirectly in silicon Indeed in our experiments weobserved that SMC MD and MO machine clearscause an increase of MACHINE_CLEARSCOUNT but leaveOTHER_ASSISTSANY unaltered

bull An exception triggers a machine clear When a faultor exception is detected the subsequent microOps must beflushed from the pipeline with a machine clear as theexecution should resume at the exception handler In-deed in our experiments we observed an increase ofMACHINE_CLEARSCOUNT at each faulty instruction

In this paper we focus on the machine clear causes men-tioned by the Intel documentation (Table 1) acknowledgingthat undocumented causes may still exist (much like undocu-mented x86 instructions [16]) We now first examine the fourmost relevant causes of machine clears (Self-Modifying CodeMC Floating Point MC Memory Ordering MC and MemoryDisambiguation MC) then briefly discuss the other cases

5 Self-Modifying Code Machine Clear

In a Von Neumann architecture stores may write instructionsas data and modify program code as it is being executedas long as the code pages are writable This is commonlyreferred to as Self-modifying Code (SMC)

Self-modifying code is problematic for the InstructionFetch Unit (IFU) which maintains high execution through-put by aggressively prefetching the instructions it expects toexecute next and feeding them to the decode units The CPUspeculatively fetches and decodes the instructions and feedsthem to the execution units well ahead of retirement

In case of a misprediction the CPU flushes the specula-tively processed instructions and resumes execution at the cor-rect target The IFUrsquos aggressive prefetching ensures that thefirst-level instruction cache (L1i) is constantly filled with in-structions which are either currently in (possibly speculative)execution or about to be executed As a result a store instruc-tion targeting code cached in L1i requires drastic measuresmdashas the associated cache lines should now be invalidated More-over the target instructions do not even need to be part of theactual execution since a prefetch is sufficient to bring theminto L1i any write to prefetched instructions also invalidatesthe prefetch queue In other words the problem occurs whenthe code is already in L1i or the store is sufficiently close tothe target code to ensure the target is prefetched in L1i Thisbehavior leads to a temporary desynchronization between thecode and data views of the CPU transiently breaking the ar-chitectural memory model (where L1dL1i coherence ensuresconsistent codedata views)

To reverse engineer this behavior we use the analysis codeexemplified by Listing 1 The store at line 15 overwrites codealready cached in L1i (lines 18-21) triggering a machineclear The machine clear needs to update the L1i cache (and

1454 30th USENIX Security Symposium USENIX Association

related microarchitectural structures) by flushing any staleinstruction(s) resuming execution at the last retired instruc-tion and then fetching the new instructions To test for thepresence of a transient execution path our analysis code im-mediately executes lines 18-21 targeted by the store and jumpsto a spec_code gadget (lines 29-32) which fills a number ofcache lines in a (flushed) buffer Architecturally this gad-get should never be executed as the store instruction shouldnop out the branch at line 18 However microarchitecturallywe do observe multiple cache hits in the (reload) buffer us-ing FLUSH + RELOAD which confirms the existence of apre-SMC-handling transient execution path executing stalecode and leaving observable traces in the cache We observethat the scheduling of the store instruction heavily affectsthe length of the transient path Indeed we use different in-structions (lines 5-8) to delay as much as possible the storeretirement and thus the SMC detection This suggests that theroot cause of the observed transient window might be the mi-croarchitectural de-synchronization between the store buffer(new code) and the instruction queue (stale code) yieldingtransiently incoherent codedata views We sampled machineclear performance counters to confirm the transient executionwindow is caused by the SMC machine clear and not by otherevents (eg memory disambiguation misprediction) Finallywe repeated our experiments on uncached code memory andcould not observe any transient path The counters revealedone machine clear triggered for each executed instructionsince the CPU has to pessimistically assume every fetchedinstruction has potentially been overwritten Additionally wehave verified that SMC detection is performed on physicaladdresses rather than virtual ones

Cross-Modifying Code

Instead of modifying its own instructions a thread runningon one core may also write data into the currently executingcode segment of a thread running on a different physical coreSuch Cross-Modifying Code (XMC) may be synchronous(the executor thread waits for the writer thread to completebefore executing the new code) or asynchronous with noparticular coordination between threads To reverse engineerthe behavior we distributed our analysis code across coresand reproduced a signal on the reload buffer in both casesThis confirms a Cross-Modifying Code Machine Clear (XMCmachine clear) behaves similarly to a SMC machine clearacross cores with a store on the writer core originating atransient execution window on the executor core

6 Floating Point Machine Clear

On Intel when the Floating Point Unit (FPU) is unable toproduce results in the IEEE-754 [1] format directly for in-stance in the case of denormal operands or results [4 17]the CPU requires special handling to produce a correct re-

Listing 1 Self-Modifying Code Machine Clear analysis code1 smc_snippet2 push r113 lea r11 [target] Load addr of target instr (line 17)4

5 clflush [r11] These instructions serve as a delay6 rep 10 for the store argument address They7 imul r11 1 ensure that the execution window of8 endrep spec_code is as long as possible9

10 Code to write as data 8 nops (overwriting lines 18-21)11 mov rax 0x909090909090909012

13 Store at target addr Also the last retired instr14 from which the execution will resume after the SMC MC15 mov QWORD [r11] rax16

17 target Target instruction to be modified18 jmp spec_code19 nop20 nop21 nop22

23 Architectural exit point of the function24 pop r1125 ret26

27 Code executed speculatively (flushed after SMC MC)28 spec_code29 mov rax [rdi+0x0] rdi covert channel reload buffer30 mov rax [rdi+0x400]31 mov rax [rdi+0x800]32 mov rax [rdi+0xc00]

sult According to an Intel patent [62] the denormalizationis indeed implemented as a microcode assist or an exceptionhandler since the corresponding hardware would be too com-plex In our experiments we observed microcode assists onall x87 SSE and AVX instructions that perform mathematicaloperations on denormal floating-point (FP) numbers Incre-ments of the FP_ASSISTANY or MACHINE_CLEARSCOUNT(or on older processors MACHINE_CLEARSFP_ASSIST) per-formance counters confirm such assists cause machine clears

Since a machine clear implies a pipeline flush the as-sisted FP operation will be squashed together with subse-quent microOps To reverse engineer the behavior we used anal-ysis code exemplified by Algorithm 1 Our code relies ona FLUSH + RELOAD [81] covert channel to observe the re-sult of a floating-point operation at the byte granularity Inour experiments we observed two different hits in the reloadbuffer for each byte for the transient and architectural resultrespectively The double-hit microarchitectural trace confirmsthat the transient (and wrong) value generated by the FPUis used in subsequent microOpsmdashas also exemplified in Table 2Later the CPU detects the error and triggers a machine clearto flush the wrongly executed path The microcode assist thencorrects the result while subsequent instructions are reissued

While we could not find any documentation on floating-point assist handling (even in the patents) our experimentsrevealed the following important properties First we veri-fied that many FP operations can trigger FP assists (ie addsub mul div and sqrt) across different extensions (ie x87SSE and AVX) Second the transient result is computed byldquoblindlyrdquo executing the operation as if both operands and result

USENIX Association 30th USENIX Security Symposium 1455

Algorithm 1 Floating Point Machine Clear analysis (pseudo)code byte is used to extract the i-th byte of z

1 for ilarr 1 8 do2 flush(reload_bu f )3 z = x y Any denormal FP operation4 reload_bu f [byte(z i) 1024]5 reload(reload_bu f )6 end for

Representation Value Type Exp

x 0x0010deadbeef1337 234e-308 N -1022

y 0x40f0000000000000 65536(216) N 16

zarch 0x00000010deadbeef 357e-313 D -1022

ztran 0x3f10deadbeef1337 643e-05 N -14

Table 2 Architectural (zarch) and transient (ztran) results ofdividing x and y of Algorithm 1 N Normal D Denormalrepresentations The mantissa is in bold A normal division by216 leaves the mantissa untouched and subtracts 16 from theexponentmdashthe result of ztran where the exponent overflowedfrom -1022 to -14

Figure 3 Transient execution due to invalid memory ordering

are normal numbers (see Table 2) Third the detection of thewrong computation occurs later in time creating a transientexecution window Finally by performing multiple floating-point operations together with the assisted one we were ableto expand the size of the window suggesting that detection isdelayed if the FPU is busy handling multiple operations

7 Memory Ordering Machine Clear

The CPU initiates a memory ordering (MO) machine clearwhen upon receiving a snoop request it is uncertain if mem-ory ordering will be preserved [36] Consider the program

Algorithm 2 Pseudo-code triggering a MO machine clearProcessor A

1 clflush(X) Make the load slow2 unlock(lock) Synchronize loads and stores3 r1larr [X]4 r2larr [Y ]5 reload(r2) For FLUSH + RELOAD

Processor B1 wait(lock) Synchronize loads and stores2 1rarr [Y ]

in Figure 3 Processor A loads X and Y while processor Bperforms two stores to the same locations If the load of Xis slow due to a cache miss the out-of-order CPU will issuethe next load (and subsequent operations) ahead of sched-ule Suppose that while the load of X is pending proces-sor B signals through a snoop request that the values ofX and Y have changed In this scenario the memory order-ing is A1-B0-B1-A0 which is not allowed according to theTotal-Store-Order memory model As two loads cannot bereordered r1=1 r2=0 is an illegal result Thus processor Ahas no choice other than to flush its pipeline and re-issuethe load of Y in the correct order This MO machine clearis needed for every inconsistent speculation on the memoryorderingmdashimplementing speculation behavior originally pro-posed by Gharachorloo et al [27] with the advantage thatstrict memory order principles can co-exist with aggressiveout-of-order scheduling

Notice that in the previous example the store on X isnot even necessary to cause a memory ordering violationsince A1-B1-A0 is still an invalid order Counterintuitivelythe memory ordering violation disappears if the load on X isnot performed as A1-B0-B1 is a perfectly valid order

To reverse engineer the memory ordering handling behav-ior on Intel CPUs we used analysis code exemplified by Al-gorithm 2 Our code mimics the scenario of Figure 3 withloadsstores from two threads racing against each other butrelies on a FLUSH + RELOAD [81] covert channel to observethe loaded value of Y The synchronization through the lockvariable ensures the desired (problematic) proximity and or-dering of the memory operations

As before in our experiments we observed two differenthits in the reload buffer for the loaded value of Y one forthe stale (transient) value and one for the new (architectural)value For every double hit in the reload buffer we also mea-sured an increase of MACHINE_CLEARSMEMORY_ORDERINGWe also ran experiments without the load of X Even thoughmemory ordering violations are no longer possible since eachprocessor executes a single memory operation and any orderis permitted we still observed MO machine clears The reasonis that Intel CPUs seem to resort to a simple but conservativeapproach to memory ordering violation detection where theCPU initiates a MO machine clear when a snoop request from

1456 30th USENIX Security Symposium USENIX Association

another processor matches the source of any load operationin the pipeline We obtained the same results with the twothreads running across physical or logical (hyperthreaded)cores Similarly the results are unchanged if the matchingstore and load are performed on different addressesmdashthe onlyrequirement we observed is that the memory operations needto refer to the same cache line Overall our results confirmthe presence of a transient execution window and the abilityof a thread to trigger a transient execution path in anotherthread by simply dirtying a cache line used in a ready-to-commit load Exploitation-wise abusing this type of MC isnon-trivial due to the strict synchronization requirements andthe difficulty of controlling pending stale data

8 Memory Disambiguation Machine Clear

As suggested by the MACHINE_CLEARSDISAMBIGUATIONcounter description memory disambiguation (MD) mis-predictions are handled via machine clears Our experi-ments confirmed this behavior by observing matching in-creases of MACHINE_CLEARSCOUNT Moreover we observedno changes in microcode assist counters suggesting mispre-dictions are resolved entirely in hardware In case of a mispre-diction a stale value is passed to subsequent loads a primitivethat was previously used to leak secret information with Spec-tre Variant 4 or Speculative Store Bypass (SSB) [55 82]

Different from the other machine clears MD machineclears trigger only on address aliasing mispredictions whenthe CPU wrongly predicts a load does not alias a precedingstore instruction and can be hoisted Similar to branch pre-diction generating a MD-based transient execution windowrequires mistraining the underlying predictor The latter hascomplex undocumented behavior which has been partiallyreverse engineered [18] For space constraints we discuss ourfull reverse engineering strategy in Appendix A Our resultsshow that executing the same load 64+ 15 times with non-aliasing stores is sufficient to ensure the next prediction tobe ldquono-aliasrdquo (and thus the next load to be hoisted) Uponreaching the hoisting prediction one can perform the loadwith an aliasing store to trigger the transient path exposed tothe incorrect stale value

Finally we experimentally verified that 4k aliasing [50]does not cause any machine clear but only incurs a furthertime penalty in case of wrong aliasing predictions Moredetails on 4k aliasing results can be found in Appendix A

9 Other Types of Machine Clear

AVX vmaskmov The AVX vmaskmov instructions performconditional packed load and store operations depending on abitmask For example a vmaskmovpd load may read 4 packeddoubles from memory depending on a 4-bit mask each doublewill be read only if the corresponding mask bit is set the

others will be assigned the value 0According to the Intel Optimization Reference Man-

ual [36] the instruction does not generate an exception inface of invalid addresses provided they are masked out How-ever our experiments confirm that it does incur a machineclear (and a microcode assist) when accessing an invalid ad-dress (eg with the present bit set to 0) with a loading maskset to zero (ie no bytes should be loaded) to check whetherthe bytes in the invalid address have the corresponding maskbits set or not We speculate the special handling is neededbecause the permission check is very complex especially inthe absence of memory alignment requirements

In our experiments vmaskmov instructions withall-zero masks and invalid addresses increment theOTHER_ASSISTANY and MACHINE_CLEARSCOUNT countersconfirming that the instruction triggers a machine clearHowever the resulting transient execution window seemsshort-lived or absent as we were unable to observe cache orother microarchitectural side effects of the execution

Exceptions The MACHINE_CLEARSPAGE_FAULT counterpresent on older microarchitectures confirms page faults areanother cause of machine clears Indeed we verified each in-struction incurring a page fault or any other exception such asldquoDivision by zerordquo increments the MACHINE_CLEARSCOUNTcounter We also verified exceptions do not trigger microcodeassists and software interrupts (traps) do not trigger machineclears Indeed the Intel documentation [37] specifies thatinstructions following a trap may be fetched but not specu-latively executed Transient execution windows originatingfrom exceptionsmdashand page faults in particularmdashhave beenextensively used in prior work with a faulty load instructionalso used as the trigger to leak information [7 47 63 75]

Hardware interrupts Although hardware interrupts arean undocumented cause of machine clears our experimentswith APIC timer interrupts showed they do increment theMACHINE_CLEARSCOUNT counter While this confirms hard-ware interrupts are another root cause of transient executionthe asynchronous nature of these events yields a less thanideal vector for transient execution attacks Nonetheless hard-ware interrupts play an important role in other classes ofmicroarchitectural attacks [73]

Microcode assists Microcode assists require a pipelineflush to insert the required microOps in the frontend and representa subclass of machine clears (and thus a root cause of transientexecution) for cases where a fast path in hardware is not avail-able In this paper we detailed the behavior of floating-pointand vmaskmov assists Prior work has discussed different sit-uations requiring microcode assists such as those related topage table entry AccessDirty bits typically in the context ofassisted loads used as the trigger to leak information [86375]AVX-to-SSE transitions [36] represent another microcode as-sist which based on our experiments we could not observeon modern Intel CPUs Remaining known microcode assistssuch as access control of memory pages belonging to SGX

USENIX Association 30th USENIX Security Symposium 1457

020406080

100120140160

Num

ber o

fTr

ansie

nt L

oads CPU

Intel Core i7-10700KIntel Xeon Silver 4214Intel Core i9-9900KIntel Core i7-7700KAMD Ryzen 5 5600XAMD Ryzen Threadripper2990WXAMD Ryzen 7 2700X

F+R TSX BHT FAULT SMC XMC FP MD MOTransient Execution Management

0

1

2

3

4

Leak

age

Rate

[Mb

s]

Figure 4 Top plot transient window size vs mechanism Each bar reports the number of transient loads that complete and leavea microarchitectural trace Bottom plot leakage rate vs mechanism for a simple Spectre Bounds-Check-Bypass attack and a1-bit FLUSH + RELOAD (F+R) cache covert channel F+R is the leakage rate upper bound (covert channel loop only no actualtransient window or attack)

secure enclaves and Precise Event Based Sampling (PEBS)are not presented in this work since already studied [14] oronly related to privileged performance profiling respectively

10 Transient Execution Capabilities

Transient execution attacks rely on crafting a transient win-dow to issue instructions that are never retired For thispurpose state-of-the-art attacks traditionally rely on mech-anisms based on root causes such as branch mispredictions(BHT) [41 43] faulty loads (Fault) [8 47 63 75] or mem-ory transaction aborts (TSX) [8 47 59 63 75] However thedifferent machine clears discussed in this paper provide anattacker with the exact same capabilities

To compare the capabilities of machine clear-based tran-sient windows with those of more traditional mechanismswe implemented a framework able to run arbitrary attacker-controlled code in a window generated by a mechanism ofchoosing We now evaluate our framework on recent proces-sors (with all the microcode updates and mitigations enabled)to compare the transient window size and leakage rate of thedifferent mechanisms

101 Transient Window Size

The transient window size provides an indication of thenumber of operations an attacker can issue on a transientpath before the results are squashed Larger windows canin principle host more complex attacks Using a classicFLUSH + RELOAD cache covert channel (F+R) as a referencewe measure the window size by counting how many transientloads can complete and hit entries in a designated F+R bufferFigure 4 (top) presents our results

As shown in the figure the window size varies greatlyacross the different mechanisms Broadly speaking mecha-nisms that have a higher detection cost such as XMC and MOMachine Clear yield larger window sizes Not surprisinglybranch mispredictions yield the largest window sizes as wecan significantly slow down the branch resolution process(ie causing cache misses) and delay detection FP on theother hand yields the shortest windows suggesting that de-normal numbers are efficiently detected inside the FPU Ourresults also show that while our framework was designed forIntel processors similar if not better results can usually beobtained on AMD processors (where we use the same con-servative training code for branchmemory prediction) Thisshows that both CPU families share a similar implementationin all cases except for MD where the used mistraining patternis not valid for pre-Zen3 architectures

102 Leakage Rate

To compare the leakage rates for the different transient exe-cution mechanisms we transiently read and repeatedly leakdata from a large memory region through a classic F+R cachecovert channel We report the resulting leakage ratesmdashasthe number of bits successfully leaked per secondmdashacrossdifferent microarchitectures using a 1-bit covert channel tohighlight the time complexity of each mechanism We con-sider data to be successfully leaked after a single correct hitin the reload buffer In case of a miss for a particular valuewe restart the leak for the same value until we get a hit (oruntil we get 100 misses in a row)

As shown in Figure 4 (bottom) different Intel and AMDmicroarchitectures generally yield similar leakage rates withsome variations For instance FP MC offers better leakagerates on Intel This difference stems from the different perfor-

1458 30th USENIX Security Symposium USENIX Association

F+R TSX BHT FP SMC XMC MO MD FAULTTransient Execution Management

0

1

2

3

4

Leak

age

Rate

[Mb

s]

F+R granularity [bit]148

Figure 5 Leakage rate vs mechanism with a 1-bit 4-bit or8-bit FLUSH + RELOAD cache covert channel (Intel Core i9)

mance impact of the corresponding machine clears on Intelvs AMD microarchitectures

As shown in Figure 5 the leakage rate varies instead greatlyacross the different mechanisms and so does the optimalcovert channel bitwidth Indeed while existing attacks typi-cally rely on 8-bit covert channels our results suggest 1-bit or4-bit channels can be much more efficient depending on thespecific mechanism Roughly speaking optimal leakage ratescan be obtained by balancing the time complexity (and hencebitwidth) of the covert channel with that of the mechanismFor example FP is a lightweight and reliable mechanismhence using a comparably fast and narrow 1-bit covert chan-nel is beneficial In contrast MD requires a time-consumingpredictor training phase between leak iterations and leakingmore bits per iteration with a 4-bit covert channel is moreefficient Interestingly a classic 8-bit covert channel yieldsconsistently worse and comparable leakage rates across allthe mechanisms since F+R dominates the execution time

Our results show that only two mechanisms (TSX and FP)are close to the maximum theoretical leakage rate of pureF+R Moreover FP performs as efficiently as TSX but un-like TSX is available on both Intel and AMD is always en-abled and can be used from managed code (eg JavaScript)BHT on the other hand yield the worst leakage rates due tothe inefficient training-based transient window BHT leakagerate can be improved if a tailored mistrain sequence is usedas in MD (Appendix A) Overall our results show that ma-chine clear-based windows achieve comparable and in manycases (eg FP) better leakage rates compared to traditionalmechanisms Moreover many machine clears eliminate theneed for mistraining which other than resulting in efficientleakage rates can escape existing pattern-based mitigationsand disabled hardware extensions (eg Intel TSX)

11 Attack Primitives

Building on our reverse engineering results and focusing onthe unexplored SMC and FP machine clears we now presenttwo new transient execution attack primitives and analyze

their security implications We also present an end-to-endFPVI exploit disclosing arbitrary memory in Firefox Laterwe discuss mitigations

111 Speculative Code Store Bypass (SCSB)

Our first attack primitive Speculative Code Store Bypass(SCSB) allows an attacker to execute stale controlled code ina transient execution window originated by a SMC machineclear Since the primitive relies on SMC its primary appli-cability is on JIT (eg JavaScript) engines running attacker-controlled codemdashalthough OS kernels and hypervisors stor-ing code pages and allowing their execution without firstissuing a serializing instruction are also potentially affected

Figure 6 SCSB primitive example where the instructionpointer is pointing at the bold code blocks g code is freedJITrsquoed code (of some g function) under attackerrsquos control(1) Force engine to JIT and execute code of function f caus-ing desynchronization of code and data views (2) Executestale code and SMC MC (3) After the SMC MC code anddata view coherence is restored and the new code is executed

As exemplified in Figure 6 the operations of the primi-tive can be broken down into three steps (1) the JIT enginecompiles a function f storing the generated code into a JITcode cache region previously used by a (now-stale) version offunction g (2) the JIT engine jumps to the newly generatedcode for the function f but due to the temporary desynchro-nization between the code and data views of the CPU thiscauses transient execution of the stale code of g until the SMCmachine clear is processed (3) after the pipeline flush thecode and data views are resynchronized and the CPU restartsthe execution of the correct code of f For exploitation theattacker needs to (i) massage the JIT code cache allocator toreuse a freed region with a target gadget g of choice (ii) forcethe JIT engine to generate and execute new (f ) code in sucha region enabling transient out-of-context execution of thegadget and spilling secrets into the microarchitectural state

Our primitive bears similarities with both transient and ar-chitectural primitives used in prior attacks On the transientfront our primitive is conceptually similar to a Speculative-Store-Bypass (SSB) primitive [55 82] but can transientlyexecute stale code rather than reading stale data Howeverinterestingly the underlying causes of the two primitives arequite different (MD misprediction vs SMC machine clear)On the architectural front our primitive mimics classic Use-After-Free (UAF) exploitation on the JIT code cache also

USENIX Association 30th USENIX Security Symposium 1459

Figure 7 Coding options suggested by the Intel ArchitecturesSoftware Developers Manuals to handle SMC and XMC ex-ecution Option 1 describes the exact steps required by ourSpeculative Code Store Bypass attack primitive potentiallyresulting in exploitable gadgets

known as Return-After-Free (RAF) in the hackers commu-nity [19 25] An example is CVE-2018-0946 where a use-after-free vulnerability can be exploited to force the ChakraJS engine to erroneously execute freed (attacker-controlled)JIT code resulting in arbitrary code execution after massagingthe right gadget into the JIT code cache [24]

Indeed at a high level SCSB yields a transient use-after-free primitive on a JIT code cache with exploitation prop-erties similar to its architectural counterpart However thereare some differences due to the transient nature of SCSBFirst we need to find an out-of-context gadget to transientlyleak data rather than architecturally execute arbitrary codeIn JavaScript engines similar gadgets have already been ex-ploited by ret2spec [48] escalating out-of-context transientexecution of valid code to type confusion arbitrary reads andultimately a secret-dependent load to transmit the value

Second we need to target short-lived JIT code cache up-dates so that the newly generated code is immediately exe-cuted fitting the target gadget in the resulting transient execu-tion window Interestingly executing JITrsquoed code is the nextstep after JIT code generation mentioned in the Intel manual(Figure 7 Option 1) suggesting that developers that followsuch directives ad litteram can easily introduce such gadgetsMoreover modern JavaScript compilers feature a multi-stageoptimization pipeline [12 54] and short-lived JIT code cacheupdates are favored for just-in-time (re)optimization Indeedwe found test code in V8 to specifically test such updates andverify instructiondata cache coherence [70]

Finally we need to ensure JIT code cache updates are notaccompanied by barriers that force immediate synchroniza-tion of the code and data views We analyzed the code ofthe popular SpiderMonkey and V8 JavaScript engines andverified that the functions called upon JIT code cache updatesto synchronize instruction and data caches (Listing 2 and List-ing 3) are always empty on x86 (as expected given Intelrsquosprimary recommendation in Figure 7)

Overall while the exploitation is far from trivial (ie hav-

Listing 2 Chromium instruction cache flush(chromiumsrcv8srccodegenx64cpu-x64cc)void CpuFeaturesFlushICache(void start size_t size) No need to flush the instructioncache on Intel

Listing 3 Firefox instruction cache flush(mozilla-unifiedjssrcjitFlushICacheh)inline void FlushICache(void code size_t size

bool codeIsThreadLocal = true) No-op Code and data caches are coherent on x86

and x64 rarr

ing to address the challenges of traditional use-after-free ex-ploitation as well as transient execution exploitation in thebrowser) we believe SCSB expands the attack surface of tran-sient execution attacks in the browser We found a number ofcandidate SCSB gadgets in real-world code but none of themwas ultimately exploitable due to the coincidental presence ofsome serializing instruction preventing stale code executionNonetheless mitigations are needed to enforce security-by-design Luckily as shown later SCSB is amenable to a prac-tical and efficient mitigation (ie at a similar cost as facedby non-x86 architectures) The implementationperformancecost for mitigation is even lower than for its sibling SSB prim-itive whose mitigation has been deployed in practice even inabsence of known practical exploits [55]

In Table 3 we show that all tested Intel and AMD proces-sors are affected by SCSB In contrast ARM is not vulnerablesince SMC updates require explicit software barriers

112 Floating Point Value Injection (FPVI)Our second attack primitive Floating Point Value Injection(FPVI) allows an attacker to inject arbitrary values into atransient execution window originated by a FP machine clearAs exemplified in Figure 8 the operations of the primitive canbe broken down into four steps (1) the attacker triggers theexecution of a gadget starting with a denormal FP operationin the victim application with the x and y operands underattackerrsquos control (2) the transient z result of the operation isprocessed by the subsequent gadget instructions leaving a mi-croarchitectural trace (3) the CPU detects the error condition(ie wrong result of a denormal operation) triggering a ma-chine clear and thus a pipeline flush (4) the CPU re-executesthe entire gadget with the correct architectural z result Forexploitation the attacker needs to (i) massage the x and yoperands to inject the desired z value into the victim transientpath and (ii) target a victim gadget so that the injected valueyields a security-sensitive trace which can be observed withFLUSH + RELOAD or other microarchitectural attacks

Our primitive bears similarities with Load-Value-Injection(LVI) [71] since both allow attackers to inject controlledvalues into transient execution Moreover both primitives

1460 30th USENIX Security Symposium USENIX Association

Figure 8 FPVI gadget example in SpiderMonkeyFP_Op(xy) is an arbitrary denormal FP operation The er-roneous z result causes dependent operations to be executedtwice (first transiently then architecturally) The NaN-boxingz encoding allows the attacker to type-confuse the JITrsquoed codeand read from an arbitrary address on the transient path

require gadgets in the victim application to process the in-jected value and perform security-sensitive computations onthe transient path Nonetheless the underlying issues andhence the triggering conditions are fairly different LVI re-quires the attacker to induce faulty or assisted loads on thevictim execution which is straightforward in SGX applica-tions but more difficult in the general case [71] FPVI imposesno such requirement but does require an attacker to directlyor indirectly control operands of a floating-point operation inthe victim Nonetheless FPVI can extend the existing LVIattack surface (eg for compute-intensive SGX applicationsprocessing attacker-controlled inputs) and also provides ex-ploitation opportunities in new scenarios Indeed while it ishard to draw general conclusions on the availability of ex-ploitable FPVI gadgets in the wildmdashmuch like Spectre [41]LVI [71] etc this would require gadget scanners subject oforthogonal research [56 58]mdashwe found exploitable gadgetsin NaN-Boxing implementations of modern JIT engines [53]NaN-Boxing implementations encode arbitrary data types asdouble values allowing attackers running code in a JIT sand-box (and thus trivially controlling operands of FP operations)to escalate FPVI to a speculative type confusion primitiveThe latter can be exploited similarly to NaN-Boxing-enabledarchitectural type confusion [26] and allows an attacker toaccess arbitrary data on a transient path Figure 8 presentsour end-to-end exploit for a JavaScript-enabled attacker in aSpiderMonkey (Mozilla JavaScript runtime) sandbox illus-trating a gadget unaffected by all the prior Mozilla Firefoxrsquo

mitigations against transient execution attacks [52]As exemplified in the figure SpiderMonkeyrsquos NaN-Boxing

strategy represents every variable with a IEEE-754 (64-bit)double where the highest 17 bits store the data type tag and theremaining 47 bits store the data value itself If the tag value isless than or equal to 0x1fff0 (ie JSVAL_TAG_MAX_DOUBLE)all the 64 bits are interpreted as a double while NaN-Boxingencoding is used otherwise For instance the 0xfff88 tagis used to represent 32-bit integers and the 0xfffb0 tag torepresent a string with the data value storing a pointer tothe string descriptor In the example the attacker crafts theoperands of a vulnerable FP operation (in this case a division)to produce a transient result which the JITrsquoed code interpretsas a string pointer due to NaN-Boxing This causes typeconfusion on a transient path and ultimately triggers a readwith an attacker-controlled address

To verify the attacker can inject arbitrary pointers withoutfully reverse engineering the complex function used by theFPU we implemented a simple fuzzer to find FP operandsthat yield transient division results with the upper bits setto 0xfffb0 (ie string tag) With such operands we caneasily control the remaining bits by performing the inverseoperation since the mantissa bits are transiently unaffectedby the exponent value as shown in Table 2 For exampleusing 0xc000e8b2c9755600 and 0x0004000000000000 asdivision operands yields -Infinity as the architectural resultand our target string pointer 0xfffb0deadbeef000 as thetransient result (see gadget in Figure 8)

Note that SpiderMonkey uses no guards or Spectre mitiga-tions when accessing the attribute length of the string Thisis normally safe since x86 guarantees that NaN results of FPoperations will always have the lowest 52 bits set to zeromdasharepresentation known as QNaN Floating-Point Indefinite [37]In other words the implementation relies on the fact that NaN-boxed variables such as string pointers can never accidentallyappear as the result of FP operations and can only be craftedby the JIT engine itself Unfortunately this invariant no longerholds on a FPVI-controlled transient path As shown in Fig-ure 8 this invariant violation allows an attacker to transientlyread arbitrary memory Since the length attribute is stored4 bytes away from the string pointer the zlength accessyields a transient read to 0xdeadbeef000+4

We ran our exploit on an Intel i9-9900K CPU (microcode0xde) running Linux 580 and Firefox 850 and by wrappingthis primitive with a variant of an EVICT + RELOAD covertchannel [61] we confirmed the ability to read arbitrary mem-ory Since prior work has already demonstrated that customhigh-precision timers are possible in JavaScript [26296064]we enabled precise timers in Firefox to simplify our covertchannel With our exploit we measured a leakage rate of ~13KBs and a transient window of ~12 load instructions enabledby increasing the FPU pressure through a chain of multipledependent FP operations

Finally as shown in Table 3 we observe that both Intel and

USENIX Association 30th USENIX Security Symposium 1461

Table 3 Tested processors

Processor MicrocodeSCSB

vulnerableFPVI

vulnerable

Intel Core i7-10700K 0xe0 3 3Intel Xeon Silver 4214 0x500001c 3 3Intel Core i9-9900K 0xde 3 3Intel Core i7-7700K 0xca 3 3AMD Ryzen 5 5600X 0xa201009 3 3daggerAMD Ryzen 2990WX 0x800820b 3 3daggerAMD Ryzen 7 2700X 0x800820b 3 3daggerBroadcom BCM2711Cortex-A72 (ARM v8) 7sect 7

dagger No exploitable NaN-boxed transient results foundsect On ARM SMC updates require explicit software barriers

AMD are affected by FPVI although we found no exploitabletransient NaN-boxed values on AMD On ARM we neverobserved traces of transient results yet we cannot rule outother FPU implementations being affected

12 Mitigations

121 SCSB Mitigation

SCSB can be mitigated by ensuring that any freshly writtencode is architecturally visible before being executed For ex-ample on ARM architectures where the hardware does notautomatically enforce cache coherence explicit serializing in-structions (ie L1i cache invalidation instructions) are neededto correctly support SMC [5] As such spec-compliant ARMimplementations cannot be affected by SCSB On Intel andAMD we can force eager codedata coherence using a serial-ization instructionmdashalthough this is normally not necessary(see Option 1 in Figure 7) In our experiments we verified anyserialization instruction such as lfence mfence or cpuidplaced after the SMC store operations is indeed sufficient tosuppress the transient window Note that sfence cannot serveas a serialization instruction [35] to eliminate the transientpath This serialization mitigation was confirmed by CPUvendors and adopted by the Xen hypervisor [32 33]

To evaluate the performance impact of our mitigation weadded a lfence instruction inside the FlushICache function(Listings 2 and 3) of the two popular V8 and SpiderMonkeyJavaScript engines Such function normally empty on x86is called after every code update Our repeated experimentson the popular JetStream2 and Speedometer2 [77] bench-marks did not produce any statically measurable performanceoverhead This shows JavaScript execution time is heavilydominated by JIT code generationexecution and code up-dates have negligible impact Our results show this mitigationis practical and can hinder SCSB-based attack primitives inJIT engines with a 1-line code change

122 FPVI Mitigation

The most efficient way to mitigate FPVI is to disable the de-normal representation On Intel this translates to enabling theFlush-to-Zero and Denormals-are-Zero flags [37] which re-spectively replace denormal results and inputs with zero Thisis a viable mitigation for applications with modest floating-point precision requirements and has also been selectively ap-plied to browsers [42] However this defense may break com-mon real-world (denormal-dependent) applications a concernthat has led browser vendors such as Firefox to adopt othermitigations Another option for browsers is to enable Site Iso-lation [13] but JIT engines such as SpiderMonkey still do nothave a production implementation [23] Yet another optionfor JIT engines is to conditionally mask (ie using a transientexecution-safe cmov instruction) the result of FP operationsto enforce QNaN Floating-Point Indefinite semantics [37] asdone in the SpiderMonkey FPVI mitigation [51] This strategysuppresses any malicious NaN-boxed transient results butrequires manual changes to the NaN-boxing implementationand only applies to NaN-boxed gadgets

A more general and automated mitigation is for the com-piler to place a serializing instruction such as lfence afterFP operations whose (attacker-controlled) result might leaksecrets by means of transmit gadgets (or transmitters) Weobserve this is the same high-level strategy adopted by theexisting LVI mitigation in modern compilers [39] whichidentifies loaded values as sources and uses data-flow anal-ysis to ensure all the sources that reach a sink (transmitter)are fenced Hence to mitigate FPVI we can rely on the samemitigation strategy but use computed FP values as sourcesinstead To identify sinks we consider both systems vulnera-ble and those resistant to Microarchitectural Data Sampling(MDS) [8 59 63 75 76] For the former we consider bothload and store instructions as sinks (eg FP operation resultused as a loadstore address) as the corresponding arbitrarydata spilled into microarchitectural buffers may be leaked bya MDS attacker For the latter we limit ourselves to load sinksto catch all the arbitrary read values potentially disclosed by adependent transmitter In both cases we add indirect branchescontrolled by FP operations (and thus potentially leading tospeculative control-flow hijacking) to the list of sinks

We have implemented such a mitigation in LLVM [45]with only 100 lines of code on top of the existing x86 LVIload hardening pass To evaluate the performance impactof our mitigation on floating-point-intensive programs weran all the CC++ SPECfp 2006 and 2017 benchmarks infour configurations LVI instrumentation FPVI instrumen-tation for both MDS-vulnerable and MDS-resistant systemsand joint LVI+FPVI instrumentation for MDS-vulnerable sys-tems Please note that on MDS-resistant systems the FPVItransmitters are already covered by the LVI mitigation

Figure 9 shows the performance overhead of such con-figurations compared to the baseline As expected the LVI

1462 30th USENIX Security Symposium USENIX Association

619lbm_s 638imagick_s 644nab_s SPEC17geomean

433milc 444namd 447dealII 450soplex 453povray 470lbm 482sphinx3 SPEC06geomean

0100200300400500600700800

Over

head

SPEC FP 2017 SPEC FP 2006LLVM Mitigation

LVIFPVI (MDS-vulnerable system)FPVI (MDS-resistant system)LVI+FPVI (MDS-vulnerable system)

Figure 9 Performance overhead of our FPVI mitigation on the CC++ SPECfp 2006 and 2017 benchmarks Experimental setup5 SPEC iterations Intel i9-9900K (microcode 0xde) and LLVM 1110

mitigation has non-trivial performance impact up to 280despite targeting floating-point-intensive programs On theother hand our FPVI mitigation incurs in a 32 and 53 ge-omean overhead on SPECfp 2006 and 2017 respectively withno observable performance impact difference between MDS-vulnerable and MDS-resistant variants We observed that ap-proximately 70 of the inserted lfence instructions are dueto the intraprocedural design of the original LVI pass forc-ing our analysis to consider every callsite with FPVI source-based arguments as a potential transmitter This suggests theoverhead can be further reduced by operating interprocedu-ral analysis or more aggressive inlining (eg using LLVMLTO [45])

13 Root Cause-based Classification

We now summarize the results of our investigation by present-ing a root cause-based classification for the known transientexecution paths While there have already been several at-tempts to classify properties of transient execution [107580]all the existing classifications are attack-centric While cer-tainly useful such classifications inevitably blend togetherthe root causes of transient execution with the attack triggersFor example a MDS exploit based on a demand-paging pagefault [75] may be simply classified as a Meltdown-like attackbased on a present page fault [10] However in such an exploitthe page fault is both the vulnerability trigger and the rootcause of the transient execution window A similar exploitmay instead be embedded in an Intel TSX window yielding adifferent root cause and capabilities

To better characterize the capabilities of transient executionexploits we propose the orthogonal root-cause-based classifi-cation in Figure 10 We draw from the Intel terminology todefine the two main classes of root causes of Bad Specula-tion (ie transient execution) Control-Flow Misprediction(ie branch misprediction) and Data Misprediction (ie ma-chine clear) Based on these two main classes we observethat all the known root causes of transient execution paths canbe classified into the following four subclasses PredictorsExceptions Likely invariants violations and Interrupts

Figure 10 A root cause-centric classification of known tran-sient execution paths (causes analyzed in the paper in bold)Acronyms descriptions can be found in Appendix B

Predictors This category includes the prediction-basedcauses of bad speculation due to either control-flow or datamispredictions Mistraining a predictor and forcing a mis-prediction is sufficient to create a transient execution pathaccessing erroneous code or data Itrsquos worth noting that inFigure 10 there are two separate predictors subclasses un-der control-flow and data mispredictions as they are differ-ent in nature While control-flow mispredictions are failedattempts to guess the next instructions to execute data mis-predictions are failed attempts to operate on not-yet-validateddata Misprediction is a common way to manage tran-sient execution windows in attacks like Spectre and deriva-tives [11 20 28 40 41 43 48 55 65 82]

Exceptions This class includes the causes of machineclear due to exceptions for instance the different (sub)classesof page faults Forcing an exception is sufficient to create atransient execution path erroneously executing code follow-ing the exception-inducing instruction Exceptions are a lesscommon way to manage transient execution windows (as theyrequire dedicated handling) but have been extensively usedas triggers in Meltdown-like attacks [78475963677174ndash76 78]

Interrupts This class includes the causes of machine cleardue to hardware (only) interrupts Similar to exceptions forc-

USENIX Association 30th USENIX Security Symposium 1463

ing a HW interrupt is sufficient to create a transient executionpath erroneously executing code following the interruptedinstruction Hardware interrupt are asynchronous by natureand thus difficult to control resulting in a less than ideal wayto manage transient execution windows Nonetheless theywere abused by prior side-channel attacks [72]

Likely invariants violations This class includes all theremaining causes of machine clear derived by likely invari-ants [15] used by the CPU Such invariants commonly holdbut occasionally fail allowing hardware to implement fast-path optimizations However compared to exceptions andinterrupts slow-path occurrences are typically more frequentrequiring more efficient handling in hardware or microcodeWe discussed examples of such invariants in the paper (egstore instructions are expected to never target cached instruc-tions floating-point operations are expected to never operateon denormal numbers etc) and their lazy handling mecha-nisms (ie L1dL1i resynchronization microcode-based de-normal arithmetic) Forcing a likely invariant violation is suf-ficient to create a transient execution path accessing erroneouscode or data In this paper we have shown such violationsare not only a realistic way to manage transient executionwindows but also provide new opportunities and primitivesfor transient execution attacks

14 Related Work

Spectre [3141] and Meltdown [47] first examined the securityimplications of transient execution originating a large bodyof research on transient execution attacks [6ndash112840434859 63 65 67 68 71 74ndash76 78 79 82] Rather than focusingon attacks and their classification [10 75 80] ours is the firsteffort to systematize the root causes of transient executionand examine the many unexplored cases of machine clears

We now briefly survey prior security efforts concernedwith the major causes of machine clear discussed in this pa-per Self-modifying code is commonly used by malware asan obfuscation technique [69] and has also been used to im-prove side-channel attacks by means of performance degra-dation [2] Moreover our SCSB primitive bears similaritieswith prior transient execution primitives inducing specula-tive control flow hijacking either through branch target in-jection [41 49] or architectural branch target corruption [28]The performance variability of floating-point operations ondenormal numbers [17 46] has been previously exploited intraditional timing side-channel attacks [4] Speculation intro-duced by stricter memory models is a well-known concept inthe computer architecture literature [27 30 66] While this isnon-trivial to exploit prior work did demonstrate informationdisclosure [5779] by exploiting the snoop protocol discussedin Section 7 The memory disambiguation predictor has beenpreviously abused to leak stale data in Spectre SpeculativeStore Bypass exploits [55 82] Moreover its behavior hasbeen partially reverse engineered before [18] In contrast to

all these efforts we focus on machine clears to systematicallystudy all the root causes of transient execution fully reverseengineering their behavior and uncovering their security im-plications well beyond the state of the art

15 Conclusions

We have shown that the root causes of transient execution canbe quite diverse and go well beyond simple branch mispredic-tion or similar To support this claim we systematically ex-plored and reverse engineered the previously unexplored classof bad speculation known as machine clear We discussedseveral transient execution paths neglected in the literatureexamining their capabilities and new opportunities for attacksFurthermore we presented two new machine clear-based tran-sient execution attack primitives (Floating Point Value Injec-tion and Speculative Code Store Bypass) We also presentedan end-to-end FPVI exploit disclosing arbitrary memory inFirefox and analyzed the applicability of SCSB in real-worldapplications such as JIT engines Additionally we proposedmitigations and evaluated their performance overhead Finallywe presented a new root cause-based classification for all theknown transient execution paths

Disclosure

We disclosed Floating Point Value Injection and SpeculativeCode Store Bypass to CPU browser OS and hypervisor ven-dors in February 2021 Following our reports Intel confirmedthe FPVI (CVE-2021-0086) and SCSB (CVE-2021-0089)vulnerabilities rewarded them with the Intel Bug Bounty pro-gram and released a security advisory with recommendationsin line with our proposed mitigations [34] Mozilla confirmedthe FPVI exploit (CVE-2021-29955 [21 22]) rewarded itwith the Mozilla Security Bug Bounty program and deployeda mitigation based on conditionally masking malicious NaN-boxed FP results in Firefox 87 [51]

Acknowledgments

We thank our shepherd Daniel Genkin and the anonymousreviewers for their valuable comments We also thank ErikBosman from VUSec and Andrew Cooper from Citrix fortheir input Intel and Mozilla engineers for the productivemitigation discussions Travis Downs for his MD reverse en-gineering and Evan Wallace for his Float Toy tool This workwas supported by the European Unionrsquos Horizon 2020 re-search and innovation programme under grant agreements No786669 (ReAct) and 825377 (UNICORE) by Intel Corpora-tion through the Side Channel Vulnerability ISRA and by theDutch Research Council (NWO) through the INTERSECTproject

1464 30th USENIX Security Symposium USENIX Association

References[1] IEEE Standard for Floating-Point Arithmetic IEEE Std 754-2019

2019

[2] Alejandro Cabrera Aldaya and Billy Bob Brumley Hyperde-grade From ghz to mhz effective cpu frequencies arXiv preprintarXiv210101077

[3] AMD AMD64 Architecture Programmerrsquos Manual

[4] Marc Andrysco David Kohlbrenner Keaton Mowery Ranjit JhalaSorin Lerner and Hovav Shacham On subnormal floating point andabnormal timing In 2015 IEEE S amp P

[5] ARM Architecture Reference Manual for Armv8-A

[6] Atri Bhattacharyya Alexandra Sandulescu Matthias NeugschwandtnerAlessandro Sorniotti Babak Falsafi Mathias Payer and Anil KurmusSmotherspectre exploiting speculative execution through port con-tention In CCSrsquo19

[7] Jo Van Bulck Marina Minkin Ofir Weisse Daniel Genkin BarisKasikci Frank Piessens Mark Silberstein Thomas F Wenisch Yu-val Yarom and Raoul Strackx Foreshadow Extracting the Keys tothe Intel SGX Kingdom with Transient Out-of-Order Execution InUSENIX Securityrsquo18

[8] Claudio Canella Daniel Genkin Lukas Giner Daniel Gruss MoritzLipp Marina Minkin Daniel Moghimi Frank Piessens MichaelSchwarz Berk Sunar Jo Van Bulck and Yuval Yarom Fallout LeakingData on Meltdown-resistant CPUs In CCSrsquo19

[9] Claudio Canella Michael Schwarz Martin Haubenwallner MartinSchwarzl and Daniel Gruss Kaslr Break it fix it repeat In ACMASIA CCS 2020

[10] Claudio Canella Jo Van Bulck Michael Schwarz Moritz Lipp Ben-jamin Von Berg Philipp Ortner Frank Piessens Dmitry Evtyushkinand Daniel Gruss A systematic evaluation of transient execution at-tacks and defenses In USENIX Security 19

[11] Guoxing Chen Sanchuan Chen Yuan Xiao Yinqian Zhang ZhiqiangLin and Ten H Lai Sgxpectre Stealing intel secrets from sgx enclavesvia speculative execution In 2019 IEEE EuroSampP

[12] Chrome V8 TurboFan documentation

[13] Chromium Site Isolation documentation

[14] Victor Costan and Srinivas Devadas Intel SGX Explained IACRCryptology ePrint Archive 2016

[15] David Devecsery Peter M Chen Jason Flinn and Satish NarayanasamyOptimistic hybrid analysis Accelerating dynamic analysis throughpredicated static analysis In ASPLOS 2018

[16] Christopher Domas Breaking the x86 isa Black Hat USA 2017

[17] Isaac Dooley and Laxmikant Kale Quantifying the interference causedby subnormal floating-point values In Proceedings of the Workshopon OSIHPA 2006

[18] Travis Downs Memory Disambiguation on Skylakehttpsgithubcomtravisdownsuarch-benchwikiMemory-Disambiguation-on-Skylake 2019

[19] Thomas Dullien Return after free discussion httpstwittercomhalvarflakestatus1273220345525415937

[20] Dmitry Evtyushkin Ryan Riley Nael CSE Abu-Ghazaleh ECE andDmitry Ponomarev Branchscope A new side-channel attack on di-rectional branch predictor ACM SIGPLAN Notices 53(2)693ndash7072018

[21] Firefox Firefox 87 Security Advisory httpswwwmozillaorgen-USsecurityadvisoriesmfsa2021-10CVE-2021-29955

[22] Firefox Firefox ESR 789 Security Advisory httpswwwmozillaorgen-USsecurityadvisoriesmfsa2021-11CVE-2021-29955

[23] Firefox Project Fission documentation

[24] Fortninet Use-After-Free Bug in Chakra (CVE-2018-0946)httpswwwfortinetcomblogthreat-researchan-analysis-of-the-use-after-free-bug-in-microsoft-edge-chakra-engine

[25] Ivan Fratric Return after free discussion httpstwittercomifsecurestatus1273230733516177408

[26] Pietro Frigo Cristiano Giuffrida Herbert Bos and Kaveh RazaviGrand Pwning Unit Accelerating Microarchitectural Attacks withthe GPU In SampP May 2018

[27] Kourosh Gharachorloo Anoop Gupta and John L Hennessy Twotechniques to enhance the performance of memory consistency models1991

[28] Enes Goktas Kaveh Razavi Georgios Portokalidis Herbert Bos andCristiano Giuffrida Speculative Probing Hacking Blind in the SpectreEra In CCS 2020

[29] Ben Gras Kaveh Razavi Erik Bosman Herbert Bos and CristianoGiuffrida ASLR on the Line Practical Cache Attacks on the MMUIn NDSS February 2017

[30] John L Hennessy and David A Patterson Computer architecture aquantitative approach Elsevier 2011

[31] Jann Horn Reading privileged memory with a side-channel 2018

[32] Xen Hypervisor block_speculation function call in invoke_stubhttpsxenbitsxenorggitwebp=xengita=blobf=xenarchx86x86_emulatex86_emulatechb=HEAD

[33] Xen Hypervisor block_speculation function call inio_emul_stub_setup httpsxenbitsxenorggitwebp=xengita=blobf=xenarchx86pvemul-priv-opchb=HEAD

[34] Intel FPVI amp SCSB Intel Security Advisoray 00516 httpswwwintelcomcontentwwwusensecurity-centeradvisoryintel-sa-00516html

[35] Intel INTEL-SA-00088 - Bounds Check Bypass

[36] Intel Intelreg 64 and IA-32 Architectures Optimization ReferenceManual

[37] Intel Intelreg 64 and IA-32 Architectures Software Developerrsquos Manualcombined volumes

[38] Intel Intelreg VTunetrade Profiler User Guide - 4KAliasing httpssoftwareintelcomcontentwwwusendevelopdocumentationvtune-helptopreferencecpu-metrics-referencel1-boundaliasing-of-4k-address-offsethtml

[39] Intel Load value injection - deep divehttpssoftwareintelcomsecurity-software-guidancedeep-divesdeep-dive-load-value-injection 2020

[40] Vladimir Kiriansky and Carl Waldspurger Speculative buffer overflowsAttacks and defenses arXiv180703757

[41] Paul Kocher Jann Horn Anders Fogh Daniel Genkin Daniel GrussWerner Haas Mike Hamburg Moritz Lipp Stefan Mangard ThomasPrescher Michael Schwarz and Yuval Yarom Spectre Attacks Ex-ploiting Speculative Execution In SampPrsquo19

[42] David Kohlbrenner and Hovav Shacham On the effectiveness ofmitigations against floating-point timing channels In USENIX SecuritySymposium 2017

[43] Esmaeil Mohammadian Koruyeh Khaled N Khasawneh ChengyuSong and Nael Abu-Ghazaleh Spectre Returns Speculation Attacksusing the Return Stack Buffer In USENIX WOOTrsquo18

[44] Evgeni Krimer Guillermo Savransky Idan Mondjak and JacobDoweck Counter-based memory disambiguation techniques for se-lectively predicting loadstore conflicts October 1 2013 US Patent8549263

USENIX Association 30th USENIX Security Symposium 1465

[45] Chris Lattner and Vikram Adve Llvm A compilation framework forlifelong program analysis amp transformation In CGO 2004

[46] Orion Lawlor Hari Govind Isaac Dooley Michael Breitenfeld andLaxmikant Kale Performance degradation in the presence of subnor-mal floating-point values In OSIHPA 2005

[47] Moritz Lipp Michael Schwarz Daniel Gruss Thomas Prescher WernerHaas Anders Fogh Jann Horn Stefan Mangard Paul Kocher DanielGenkin Yuval Yarom and Mike Hamburg Meltdown Reading KernelMemory from User Space In USENIX Securityrsquo18

[48] Giorgi Maisuradze and Christian Rossow ret2spec Speculative ex-ecution using return stack buffers In Proceedings of the 2018 ACMSIGSAC

[49] Andrea Mambretti Alexandra Sandulescu Matthias NeugschwandtnerAlessandro Sorniotti and Anil Kurmus Two methods for exploitingspeculative control flow hijacks In USENIX WOOT 19

[50] Ahmad Moghimi Jan Wichelmann Thomas Eisenbarth and BerkSunar Memjam A false dependency attack against constant-timecrypto implementations International Journal of Parallel Program-ming 2019

[51] Mozilla Firefox Bug 1692972 mitigation httpshgmozillaorgreleasesmozilla-betarevb129bba64358

[52] Mozilla Spectre mitigations for Value type checks - x86 part httpsbugzillamozillaorgshow_bugcgiid=1433111

[53] Mozilla Spider Monkey JSValue httpshgmozillaorgmozilla-centralfiletipjspublicValueh

[54] Mozilla SpiderMonkey IonMonkey documentation

[55] Ken Johnson Microsoft Security Response Center(MSRC) Analysis and mitigation of speculative storebypass httpsmsrc-blogmicrosoftcom20180521analysis-and-mitigation-of-speculative-store-bypass-cve-2018-3639 2019

[56] Oleksii Oleksenko Bohdan Trach Mark Silberstein and Christof Fet-zer Specfuzz Bringing spectre-type vulnerabilities to the surface InUSENIX Security 20

[57] Riccardo Paccagnella Licheng Luo and Christopher W Fletcher Lordof the ring (s) Side channel attacks on the cpu on-chip ring interconnectare practical arXiv preprint arXiv210303443 2021

[58] Zhenxiao Qi Qian Feng Yueqiang Cheng Mengjia Yan Peng Li HengYin and Tao Wei Spectaint Speculative taint analysis for discoveringspectre gadgets 2021

[59] Hany Ragab Alyssa Milburn Kaveh Razavi Herbert Bos and CristianoGiuffrida CrossTalk Speculative Data Leaks Across Cores Are RealIn SampP May 2021

[60] Thomas Rokicki Cleacutementine Maurice and Pierre Laperdrix Sok Insearch of lost time A review of javascript timers in browsers In IEEEEuroSampPrsquo21

[61] Gururaj Saileshwar Christopher W Fletcher and Moinuddin QureshiStreamline a fast flushless cache covert-channel attack by enablingasynchronous collusion In ASPLOS 2021

[62] Rahul Saxena and John William Phillips Optimized rounding in un-derflow handlers 2001 US Patent 6219684

[63] Michael Schwarz Moritz Lipp Daniel Moghimi Jo Van Bulck JulianStecklina Thomas Prescher and Daniel Gruss ZombieLoad Cross-privilege-boundary data sampling In CCSrsquo19

[64] Michael Schwarz Cleacutementine Maurice Daniel Gruss and Stefan Man-gard Fantastic timers and where to find them High-resolution microar-chitectural attacks in javascript In FC IFCA 17

[65] Michael Schwarz Martin Schwarzl Moritz Lipp Jon Masters andDaniel Gruss Netspectre Read arbitrary memory over network InESORICS 19

[66] Daniel J Sorin Mark D Hill and David A Wood A primer on memoryconsistency and cache coherence 2011

[67] Julian Stecklina and Thomas Prescher Lazyfp Leaking fpu reg-ister state using microarchitectural side-channels arXiv preprintarXiv180607480 2018

[68] Caroline Trippel Daniel Lustig and Margaret Martonosi Meltdown-prime and spectreprime Automatically-synthesized attacks exploitinginvalidation-based coherence protocols arXiv

[69] Xabier Ugarte-Pedrero Davide Balzarotti Igor Santos and Pablo GBringas Sok Deep packer inspection A longitudinal study of thecomplexity of run-time packers In 2015 IEEE S amp P

[70] Google V8 test-jump-table-assemblercc220 commit 251fece httpsgithubcomv8

[71] Jo Van Bulck Daniel Moghimi Michael Schwarz Moritz Lippi Ma-rina Minkin Daniel Genkin Yuval Yarom Berk Sunar Daniel Grussand Frank Piessens Lvi Hijacking transient execution through mi-croarchitectural load value injection In 2020 IEEE S amp P 20

[72] Jo Van Bulck Frank Piessens and Raoul Strackx Nemesis Studyingmicroarchitectural timing leaks in rudimentary cpu interrupt logic InACM CCS 2018

[73] Jo Van Bulck Frank Piessens and Raoul Strackx SGX-step A Prac-tical Attack Framework for Precise Enclave Execution Control InSysTEXrsquo17

[74] Stephan van Schaik Andrew Kwong Daniel Genkin and Yuval YaromSgaxe How sgx fails in practice

[75] Stephan van Schaik Alyssa Milburn Sebastian Oumlsterlund Pietro FrigoGiorgi Maisuradze Kaveh Razavi Herbert Bos and Cristiano GiuffridaRIDL Rogue in-flight data load In SampP May 2019

[76] Stephan van Schaik Marina Minkin Andrew Kwong Daniel Genkinand Yuval Yarom Cacheout Leaking data on intel cpus via cacheevictions arXiv preprint

[77] WebKit Browserbench httpsbrowserbenchorg

[78] Ofir Weisse Jo Van Bulck Marina Minkin Daniel Genkin BarisKasikci Frank Piessens Mark Silberstein Raoul Strackx Thomas FWenisch and Yuval Yarom Foreshadow-ng Breaking the virtual mem-ory abstraction with transient out-of-order execution 2018

[79] Pawel Wieczorkiewicz Intel deep-dive snoop-assisted L1 Data Sam-pling

[80] Wenjie Xiong and Jakub Szefer Survey of transient execution attacksarXiv preprint

[81] Yuval Yarom and Katrina Falkner Flush+ reload a high resolutionlow noise l3 cache side-channel attack In USENIX Security 14

[82] Jann Horn Google Project Zero Speculative execution variant4 speculative store bypass httpsbugschromiumorgpproject-zeroissuesdetailid=1528 2019

A Reversing Memory Disambiguation

To precisely trigger memory disambiguation mispredictionsit is essential to reverse engineer the predictor and understandhow it can be massaged into the desired state When exe-cuting a load operation the physical addresses of all priorstores must be known to decide whether the load should beforwarded the value from the store buffer (store-to-load for-warding when the store and the load alias) or served from thememory subsystem (when they do not) Since these dependen-cies introduce significant bottlenecks modern processors rely

1466 30th USENIX Security Symposium USENIX Association

Listing 4 A function to RE the MD predictor The first 10imuls used to delay the store address computation create theideal conditions for mispredictionsst_ld rdi store addr rsi load addrrep 10 Trick to delay the store addressimul rdi 1endrepmov DWORD [rdi] 0x42 Storemov eax DWORD [rsi] Loadrep 10 Pronounce load timingimul eax 1endrepret

Listing 5 Observing the size and behavior of the per-addresssaturation counteruint8_t mem = malloc(0x1000)Ensure that saturing counter is set to 0for(i=0 ilt100 i++) st_ld(mem mem)Make hoisiting possiblefor(i=100 ilt120 i++) st_ld(mem mem+64)Trigger a memory ordering machine clearfor(i=120 ilt130 i++) st_ld(mem mem)

on a memory disambiguation predictor to improve common-case performance If a load is predicted not to alias precedingstores it can be speculatively executed before the prior storesrsquoaddresses are known (ie the load is hoisted) Otherwise theload is stalled until aliasing information is available

Partial reverse engineering of the memory disambiguationunit behavior was presented in [18] based on a (complex)analysis of Intel patent US8549263B2 [44] In contrast wepresent a full reverse engineering effort (including featuressuch as 4k aliasing flush counter etc) entirely based on asimple implementationmdashst_ld function in Listing 4 Bysurrounding the st_ld function with instrumentation codeto measure the timing and the number of machine clears wewere able to accurately detect the status of the predictor forevery call to st_ld

Our first experiment illustrated in Listing 5 is designedto observe how many missed load hoisting opportunities areneeded to switch the predictor state As shown in the corre-sponding plot in Figure 11 after 15 non-aliasing loads weobserved that subsequent st_ld invocations are faster due toa correct hoisting prediction This matches the design sug-gested in the patent with the predictor implemented as a 4-bitsaturating counter incremented every time the load does not ul-timately alias with preceding stores (and reset to 0 otherwise)Load hoisting is predicted only if the counter is saturated Toensure that the hoisting state is reached we later scheduledan aliasing load and checked that the load was incorrectlyhoisted by observing a machine clear

According to the Intel patent [44] there are 64 per-addresspredictors (ie saturating counters) and the suggested hash-ing function simply uses the lowest 6 bits of the instructionpointer of the load We verified these numbers using the func-

100 105 110 115 120 125 130i-th call to st_ld

0

50

100

Cloc

k cy

cles

Figure 11 Timing measurement of the experiment in Listing5 Orange bar machine clear memory ordering observed

Listing 6 Snippet observing activation of never-hoisting stateuint8_t mem = malloc(0x1000)for(int i=0 ilt10 i++)

for(int j=0 jlt19 j++)st_ld(mem mem+64)

st_ld(mem mem)

0 10 20 30 40 50 60 70 80 90 100 110 120i-th call to st_ld

0

25

50

75

100

Cloc

k cy

cles

Figure 12 Never-hoisting state Orange bar MC observed

tion st_ld_offset which is an exact copy of the st_ldfunction but with a number of nops added in the preamble(which we change at every run The goal is to observe iftwo unrelated loads in these functions are able to influenceeach other when varying the number of nops In our testedCPUs we observed machine clears when two unrelated loadsin st_ld and st_ld_offset are located exactly 256k bytesapart in memory (k isinN) We used machine clear observationsto detect that the per-address prediction of st_ld was affectedby the execution of st_ld_offset Our results match the de-sign suggested in the patent except we observed 256 (ratherthan 64) predictors hashing the lowest 8 bits of the instruc-tion pointer With these implementation-specific numbers anattacker can easily mistrain the predictor of a victim loadinstruction just with knowledge of its location in memory

One important additional component of the predictor is thepresence of a watchdog A never-hoisting global state is usedas a fallback to temporarily disable the predictor when theCPU has decided it may be counterproductive To reverseengineer the behavior we triggered as many mispredictionsas possible to check if hoisting was eventually disabled Theresulting code is illustrated in Listing 6 and the numbers inFigure 12 shows that after 4 machine clears the predictoris disabled Indeed even after 19 further non-aliasing loads(normally abundantly sufficient to switch to a hoisting state)the execution time of st_ld does not decrease

We also reversed the conditions under which the watch-dog is enableddisabled The patent suggests the watchdog

USENIX Association 30th USENIX Security Symposium 1467

15 79 143 207 271 335 399Number of independent store-loads

1

2

3

4

Tota

l mea

sure

dM

achi

ne C

lear

s

Figure 13 MCs observed after n independent store-loadsstarting from the never-hoisting state

is enabled when the value of a flush counter is smaller than0 and disabled otherwise The flush counter is decrementedevery MD MC and incremented every n correctly hoistedloads To reverse engineer this behavior we measure howmany MCs can be triggered in a row before the flush counteris decremented to -1 and thus the predictor is disabled Asshown in the figure 13 we never observed more than 4 MCsin a row This suggests that the flush counter is a 2-bit sat-urating counter The machine clear patterns also reveal theflush counter is incremented every 64 correctly hoisted loadsLastly since every run starts with the watchdog disabled ourresults show that to switch from a never-hoisting to a predict-hoisting state 15+64 non-aliasing loads are sufficient Thefirst 15 loads are necessary to bring the per-address predictorto the hoisting state The next 64 loads record a would-becorrect prediction of the per-address predictor After 64 such(unused) predictions the flush counter is incremented to leavethe never-hoisting state

Listing 7 Snippet verifying 4k aliasing-MD unit interactionForce no-hoisting predictionfor(i=0 ilt10 i++) st_ld(mem+0x1000 mem+0x1000)for(i=10 ilt20 i++) st_ld(mem+0x2000 mem+0x2000)Cause 4k aliasingfor(i=20 ilt40 i++) st_ld(mem+0x1000 mem+0x2000)

0 5 10 15 20 25 30 35 40i-th call to st_ld

50

60

70

80

90

Cloc

k cy

cles

Figure 14 Timing measurements of Listing 7 Orange incre-ment of LD_BLOCKS_PARTIALADDRESS_ALIAS

Finally we examined the interaction between memory dis-ambiguation and 4k aliasing On Intel CPUs when a store isfollowed by a load matching its 4KB page offset the store-to-load forwarding (STL) logic forwards the stored valueand in case of a false match (4k aliasing) a few-cycle over-head is needed to re-issue the load [38] With the help of theperformance counter LD_BLOCKS_PARTIALADDRESS_ALIASand the experiment shown in Listing 7 we verified that 4kaliasing can only happen when a no-hoist prediction is made

Figure 15 Memory disambiguation unit ndash simplified view

as shown in Figure 14 Indeed STL can be performed onlyif the store-load pair is executed in order Additionally Fig-ure 14 shows that 4k aliasing introduces a slight performanceoverhead on top of an incorrect no-hoisting guess of the MDpredictor Figure 15 presents a simplified view of the reversedmemory disambiguation predictor In conclusion an attackercan precisely mistrain the memory disambiguation predictorby satisfying only two requirements (1) knowing the instruc-tion pointer of the victim load (2) issuing (in the worst-casescenario) 15+64=79 non-aliasing store-load pairs to train thepredictor to the hoisting state

B Root-Causes Description Table

Acronym Description

BHT Branch History TableBTB Branch Target BufferRSB Return Stack BufferMD Memory Disabmbiguation UnitNM Device Not Available ExceptionDE Divide-by-Zero ExceptionUD Invalid Opcode ExceptionGP General Protection FaultAC Alignment Check ExceptionSS Stack Segment ExceptionPF Page FaultPF - US Bit Page Table Entry UserSupervisor Bit PFPF - RW Bit Page Table Entry ReadWrite Bit PFPF - P Bit Page Table Entry Present Bit PFPF - PKU Page Table Entry Protection Keys PFBR Bound Range Exceeded ExceptionFP Floating Point AssistSMC Self-Modifying CodeXMC Cross-Modifying CodeMO Memory Ordering Principles ViolationMASKMOV Masked LoadStore Instruction AssistAD Bits Page Table Entry AccessDirty Bits AssistTSX Intel TSX Transaction AbortUC Uncachable Memory AssistPRM Processor Reserved Memory AssistHW Interrupts Hardware Interrupts

1468 30th USENIX Security Symposium USENIX Association

  • Introduction
  • Background
    • IEEE-754 Denormal Numbers
    • x86 Cache Coherence
    • Memory Ordering
    • Memory Disambiguation
      • Threat Model
      • Machine Clears
      • Self-Modifying Code Machine Clear
      • Floating Point Machine Clear
      • Memory Ordering Machine Clear
      • Memory Disambiguation Machine Clear
      • Other Types of Machine Clear
      • Transient Execution Capabilities
        • Transient Window Size
        • Leakage Rate
          • Attack Primitives
            • Speculative Code Store Bypass (SCSB)
            • Floating Point Value Injection (FPVI)
              • Mitigations
                • SCSB Mitigation
                • FPVI Mitigation
                  • Root Cause-based Classification
                  • Related Work
                  • Conclusions
                  • Reversing Memory Disambiguation
                  • Root-Causes Description Table
Page 4: Rage Against the Machine Clear: A Systematic Analysis of

mantissa until the minimal exponent is obtained The denor-mal representation trades precision for the ability to representa larger set of numbers Generating such a representation iscalled denormalization or gradual underflow

22 x86 Cache CoherenceOn multicore processors L1 and L2 caches are usually percore while the L3 is shared among all cores Much complexityarises due to the cache coherence problem when the samememory location is cached by multiple cores To ensure thecorrectness the memory must behave as a single coherentsource of data even if data is spread across a cache hierarchyInformally a memory system is coherent if any read of a dataitem returns the most recently written value [30] To obtaincoherence memory operations that affect shared data mustpropagate to other coresrsquo private caches This propagation isthe responsibility of the cache coherence protocol such asMESIF on Intel processors [37] and MOESI on AMD [3]Cache controllers implementing these protocols snoop andexchange coherence requests on the shared bus For examplewhen a core writes in a shared memory location it signals allother cores to invalidate their now stale local copy

The cache coherence policy also maintains the illusion ofa unified memory model for backward compatibility Theoriginal Intel 8086 released in 1978 operated in real-addressmode to implement what is essentially a pure Von Neumannarchitecture with no separation between data and code Mod-ern processors have more sophisticated memory architectureswith separate L1 caches for code (L1d) and data (L1i) Theneed for backward compatibility with the simple 8086 mem-ory model has led modern CPUs to a split-cache Harvardarchitecture whereby the cache coherence protocol ensuresthat the (L1) data and instruction caches are always coherent

23 Memory OrderingA Memory Consistency Model or Memory Ordering Modelis a contract between the microarchitecture and the program-mer regarding the permissible order of memory operations inparallel programs Memory consistency is often confused withmemory coherence but where coherence hides the complexmemory hierarchy in multicore systems consistency dealswith the order of readwrite operations

Consider the program in Figure 2 In the simplest consis-tency model lsquoSequential Consistencyrsquo (SC) all cores see alloperations in the same order as specified by the program Inother words A0 always execute before A1 and B0 always be-fore B1 Thus a valid memory order would be A0-B0-A1-B1while A1-B1-A0-B0 would be invalid

Intel and AMD CPUs implement Total-Store-Order (TSO)which is equivalent to SC apart from one case a store fol-lowed by a load on a different address may be reordered Thisallows cores to use a private store buffer to hide the latency of

Figure 2 Memory Ordering Example

store operations In the example of Figure 2 the store opera-tions A0 and B0 write their values initially only in their privatestore buffers The subsequent loads A1 and B1 will now readthe stale value 0 until the stores are globally visible Thusthe order A1-B1-A0-B0 (r1=0 r2=0) is also valid

24 Memory DisambiguationLoads must normally be executed only after all the precedingstores to the same memory locations However modern pro-cessors rely on speculative optimizations based on memorydisambiguation to allow loads to be executed before the ad-dresses of all preceding stores are computed In particular if aload is predicted not to alias a preceding store the CPU hoiststhe load and speculatively executes it before the precedingstore address is known Otherwise the load is stalled until thepreceding store is completed In case of a no-alias (or hoist)misprediction the load reads a stale value and the CPU needsto re-issue it after flushing the pipeline [36 37]

3 Threat Model

We consider unprivileged attackers who aim to disclose confi-dential information such as private keys passwords or ran-domized pointers We assume an x86-64 based victim ma-chine running the latest microcode and operating system ver-sion with all state-of-the-art mitigations against transient exe-cution attacks enabled We also consider a victim system withno exploitable vulnerabilities apart from the ones describedhereafter Finally we assume attackers can run (only) unprivi-leged code on the victim (eg in JavaScript user processesor VMs) but seek to leak data across security boundaries

4 Machine Clears

The Intel Architectures Optimization Reference Manual [36]refers to the root cause of discarding issued microOp as Bad Spec-ulation Bad speculation consists of two subcategories

bull Branch Mispredict A misprediction of the direction ortarget of a branch by the branch predictor will squash allmicroOps executed within a mispeculated branch

bull Machine Clear (or Nuke) A machine clear conditionwill flush the entire processor pipeline and restart theexecution from the last retired instruction

USENIX Association 30th USENIX Security Symposium 1453

Table 1 Machine clear performance counters

Name Description

MACHINE_CLEARSCOUNT Number of machine clears of any typeMACHINE_CLEARSSMC Number of machine clears caused by a selfcross-modifying codeMACHINE_CLEARSDISAMBIGUATION Number of machine clears caused by a memory disambiguation unit mispredictionMACHINE_CLEARSMEMORY_ORDERING Number of machine clears caused by a memory ordering principle violationMACHINE_CLEARSFP_ASSIST Number of machine clears caused by an assisted floating point operationMACHINE_CLEARSPAGE_FAULT Number of machine clears caused by a page faultMACHINE_CLEARSMASKMOV Number of machine clears caused by an AVX maskmov on an illegal address with a mask set to 0

Bad speculation not only causes performance degradationbut also security concerns [6ndash8314143475559637174ndash76 79 82] In contrast to branch misprediction extensivelystudied by security researchers [10 11 31 40 41 43 48]machine clears have undergone little scrutiny In this paperwe perform the first deep analysis of machine clears and thecorresponding root causes of transient execution

Since machine clears are hardly documented we examinedall the performance counters for every Intel architecture andfound a number relevant counters (Table 1) Some countersare present only in specific architectures For example thepage fault counter is available only on Goldmont Plus How-ever thanks to the generic counter MACHINE_CLEARSCOUNTit is always possible to count the overall number of machineclears regardless of the architecture In the remainder of thiswork we will reverse engineer and analyze the causes ofmachine clears by means of these six counters

As a general observation we note that the Floating PointAssist and Page Fault counters immediately suggest that ma-chine clears are also related to microcode assists and fault-sexceptions In particular further analysis shows that

bull Microcode assists trigger machine clears The hardwareoccasionally needs to resort to microcode to handle com-plex events Doing so requires flushing all the pend-ing instructions with a machine clear before handlinga microcode assist Indeed in our experiments wherethe OTHER_ASSISTSANY counter increased we also ob-served a matching increase in MACHINE_CLEARSCOUNT

bull Machine clears do not necessarily trigger microcodeassists Not all machine clears are microcode as-sisted as some machine clear causes are handleddirectly in silicon Indeed in our experiments weobserved that SMC MD and MO machine clearscause an increase of MACHINE_CLEARSCOUNT but leaveOTHER_ASSISTSANY unaltered

bull An exception triggers a machine clear When a faultor exception is detected the subsequent microOps must beflushed from the pipeline with a machine clear as theexecution should resume at the exception handler In-deed in our experiments we observed an increase ofMACHINE_CLEARSCOUNT at each faulty instruction

In this paper we focus on the machine clear causes men-tioned by the Intel documentation (Table 1) acknowledgingthat undocumented causes may still exist (much like undocu-mented x86 instructions [16]) We now first examine the fourmost relevant causes of machine clears (Self-Modifying CodeMC Floating Point MC Memory Ordering MC and MemoryDisambiguation MC) then briefly discuss the other cases

5 Self-Modifying Code Machine Clear

In a Von Neumann architecture stores may write instructionsas data and modify program code as it is being executedas long as the code pages are writable This is commonlyreferred to as Self-modifying Code (SMC)

Self-modifying code is problematic for the InstructionFetch Unit (IFU) which maintains high execution through-put by aggressively prefetching the instructions it expects toexecute next and feeding them to the decode units The CPUspeculatively fetches and decodes the instructions and feedsthem to the execution units well ahead of retirement

In case of a misprediction the CPU flushes the specula-tively processed instructions and resumes execution at the cor-rect target The IFUrsquos aggressive prefetching ensures that thefirst-level instruction cache (L1i) is constantly filled with in-structions which are either currently in (possibly speculative)execution or about to be executed As a result a store instruc-tion targeting code cached in L1i requires drastic measuresmdashas the associated cache lines should now be invalidated More-over the target instructions do not even need to be part of theactual execution since a prefetch is sufficient to bring theminto L1i any write to prefetched instructions also invalidatesthe prefetch queue In other words the problem occurs whenthe code is already in L1i or the store is sufficiently close tothe target code to ensure the target is prefetched in L1i Thisbehavior leads to a temporary desynchronization between thecode and data views of the CPU transiently breaking the ar-chitectural memory model (where L1dL1i coherence ensuresconsistent codedata views)

To reverse engineer this behavior we use the analysis codeexemplified by Listing 1 The store at line 15 overwrites codealready cached in L1i (lines 18-21) triggering a machineclear The machine clear needs to update the L1i cache (and

1454 30th USENIX Security Symposium USENIX Association

related microarchitectural structures) by flushing any staleinstruction(s) resuming execution at the last retired instruc-tion and then fetching the new instructions To test for thepresence of a transient execution path our analysis code im-mediately executes lines 18-21 targeted by the store and jumpsto a spec_code gadget (lines 29-32) which fills a number ofcache lines in a (flushed) buffer Architecturally this gad-get should never be executed as the store instruction shouldnop out the branch at line 18 However microarchitecturallywe do observe multiple cache hits in the (reload) buffer us-ing FLUSH + RELOAD which confirms the existence of apre-SMC-handling transient execution path executing stalecode and leaving observable traces in the cache We observethat the scheduling of the store instruction heavily affectsthe length of the transient path Indeed we use different in-structions (lines 5-8) to delay as much as possible the storeretirement and thus the SMC detection This suggests that theroot cause of the observed transient window might be the mi-croarchitectural de-synchronization between the store buffer(new code) and the instruction queue (stale code) yieldingtransiently incoherent codedata views We sampled machineclear performance counters to confirm the transient executionwindow is caused by the SMC machine clear and not by otherevents (eg memory disambiguation misprediction) Finallywe repeated our experiments on uncached code memory andcould not observe any transient path The counters revealedone machine clear triggered for each executed instructionsince the CPU has to pessimistically assume every fetchedinstruction has potentially been overwritten Additionally wehave verified that SMC detection is performed on physicaladdresses rather than virtual ones

Cross-Modifying Code

Instead of modifying its own instructions a thread runningon one core may also write data into the currently executingcode segment of a thread running on a different physical coreSuch Cross-Modifying Code (XMC) may be synchronous(the executor thread waits for the writer thread to completebefore executing the new code) or asynchronous with noparticular coordination between threads To reverse engineerthe behavior we distributed our analysis code across coresand reproduced a signal on the reload buffer in both casesThis confirms a Cross-Modifying Code Machine Clear (XMCmachine clear) behaves similarly to a SMC machine clearacross cores with a store on the writer core originating atransient execution window on the executor core

6 Floating Point Machine Clear

On Intel when the Floating Point Unit (FPU) is unable toproduce results in the IEEE-754 [1] format directly for in-stance in the case of denormal operands or results [4 17]the CPU requires special handling to produce a correct re-

Listing 1 Self-Modifying Code Machine Clear analysis code1 smc_snippet2 push r113 lea r11 [target] Load addr of target instr (line 17)4

5 clflush [r11] These instructions serve as a delay6 rep 10 for the store argument address They7 imul r11 1 ensure that the execution window of8 endrep spec_code is as long as possible9

10 Code to write as data 8 nops (overwriting lines 18-21)11 mov rax 0x909090909090909012

13 Store at target addr Also the last retired instr14 from which the execution will resume after the SMC MC15 mov QWORD [r11] rax16

17 target Target instruction to be modified18 jmp spec_code19 nop20 nop21 nop22

23 Architectural exit point of the function24 pop r1125 ret26

27 Code executed speculatively (flushed after SMC MC)28 spec_code29 mov rax [rdi+0x0] rdi covert channel reload buffer30 mov rax [rdi+0x400]31 mov rax [rdi+0x800]32 mov rax [rdi+0xc00]

sult According to an Intel patent [62] the denormalizationis indeed implemented as a microcode assist or an exceptionhandler since the corresponding hardware would be too com-plex In our experiments we observed microcode assists onall x87 SSE and AVX instructions that perform mathematicaloperations on denormal floating-point (FP) numbers Incre-ments of the FP_ASSISTANY or MACHINE_CLEARSCOUNT(or on older processors MACHINE_CLEARSFP_ASSIST) per-formance counters confirm such assists cause machine clears

Since a machine clear implies a pipeline flush the as-sisted FP operation will be squashed together with subse-quent microOps To reverse engineer the behavior we used anal-ysis code exemplified by Algorithm 1 Our code relies ona FLUSH + RELOAD [81] covert channel to observe the re-sult of a floating-point operation at the byte granularity Inour experiments we observed two different hits in the reloadbuffer for each byte for the transient and architectural resultrespectively The double-hit microarchitectural trace confirmsthat the transient (and wrong) value generated by the FPUis used in subsequent microOpsmdashas also exemplified in Table 2Later the CPU detects the error and triggers a machine clearto flush the wrongly executed path The microcode assist thencorrects the result while subsequent instructions are reissued

While we could not find any documentation on floating-point assist handling (even in the patents) our experimentsrevealed the following important properties First we veri-fied that many FP operations can trigger FP assists (ie addsub mul div and sqrt) across different extensions (ie x87SSE and AVX) Second the transient result is computed byldquoblindlyrdquo executing the operation as if both operands and result

USENIX Association 30th USENIX Security Symposium 1455

Algorithm 1 Floating Point Machine Clear analysis (pseudo)code byte is used to extract the i-th byte of z

1 for ilarr 1 8 do2 flush(reload_bu f )3 z = x y Any denormal FP operation4 reload_bu f [byte(z i) 1024]5 reload(reload_bu f )6 end for

Representation Value Type Exp

x 0x0010deadbeef1337 234e-308 N -1022

y 0x40f0000000000000 65536(216) N 16

zarch 0x00000010deadbeef 357e-313 D -1022

ztran 0x3f10deadbeef1337 643e-05 N -14

Table 2 Architectural (zarch) and transient (ztran) results ofdividing x and y of Algorithm 1 N Normal D Denormalrepresentations The mantissa is in bold A normal division by216 leaves the mantissa untouched and subtracts 16 from theexponentmdashthe result of ztran where the exponent overflowedfrom -1022 to -14

Figure 3 Transient execution due to invalid memory ordering

are normal numbers (see Table 2) Third the detection of thewrong computation occurs later in time creating a transientexecution window Finally by performing multiple floating-point operations together with the assisted one we were ableto expand the size of the window suggesting that detection isdelayed if the FPU is busy handling multiple operations

7 Memory Ordering Machine Clear

The CPU initiates a memory ordering (MO) machine clearwhen upon receiving a snoop request it is uncertain if mem-ory ordering will be preserved [36] Consider the program

Algorithm 2 Pseudo-code triggering a MO machine clearProcessor A

1 clflush(X) Make the load slow2 unlock(lock) Synchronize loads and stores3 r1larr [X]4 r2larr [Y ]5 reload(r2) For FLUSH + RELOAD

Processor B1 wait(lock) Synchronize loads and stores2 1rarr [Y ]

in Figure 3 Processor A loads X and Y while processor Bperforms two stores to the same locations If the load of Xis slow due to a cache miss the out-of-order CPU will issuethe next load (and subsequent operations) ahead of sched-ule Suppose that while the load of X is pending proces-sor B signals through a snoop request that the values ofX and Y have changed In this scenario the memory order-ing is A1-B0-B1-A0 which is not allowed according to theTotal-Store-Order memory model As two loads cannot bereordered r1=1 r2=0 is an illegal result Thus processor Ahas no choice other than to flush its pipeline and re-issuethe load of Y in the correct order This MO machine clearis needed for every inconsistent speculation on the memoryorderingmdashimplementing speculation behavior originally pro-posed by Gharachorloo et al [27] with the advantage thatstrict memory order principles can co-exist with aggressiveout-of-order scheduling

Notice that in the previous example the store on X isnot even necessary to cause a memory ordering violationsince A1-B1-A0 is still an invalid order Counterintuitivelythe memory ordering violation disappears if the load on X isnot performed as A1-B0-B1 is a perfectly valid order

To reverse engineer the memory ordering handling behav-ior on Intel CPUs we used analysis code exemplified by Al-gorithm 2 Our code mimics the scenario of Figure 3 withloadsstores from two threads racing against each other butrelies on a FLUSH + RELOAD [81] covert channel to observethe loaded value of Y The synchronization through the lockvariable ensures the desired (problematic) proximity and or-dering of the memory operations

As before in our experiments we observed two differenthits in the reload buffer for the loaded value of Y one forthe stale (transient) value and one for the new (architectural)value For every double hit in the reload buffer we also mea-sured an increase of MACHINE_CLEARSMEMORY_ORDERINGWe also ran experiments without the load of X Even thoughmemory ordering violations are no longer possible since eachprocessor executes a single memory operation and any orderis permitted we still observed MO machine clears The reasonis that Intel CPUs seem to resort to a simple but conservativeapproach to memory ordering violation detection where theCPU initiates a MO machine clear when a snoop request from

1456 30th USENIX Security Symposium USENIX Association

another processor matches the source of any load operationin the pipeline We obtained the same results with the twothreads running across physical or logical (hyperthreaded)cores Similarly the results are unchanged if the matchingstore and load are performed on different addressesmdashthe onlyrequirement we observed is that the memory operations needto refer to the same cache line Overall our results confirmthe presence of a transient execution window and the abilityof a thread to trigger a transient execution path in anotherthread by simply dirtying a cache line used in a ready-to-commit load Exploitation-wise abusing this type of MC isnon-trivial due to the strict synchronization requirements andthe difficulty of controlling pending stale data

8 Memory Disambiguation Machine Clear

As suggested by the MACHINE_CLEARSDISAMBIGUATIONcounter description memory disambiguation (MD) mis-predictions are handled via machine clears Our experi-ments confirmed this behavior by observing matching in-creases of MACHINE_CLEARSCOUNT Moreover we observedno changes in microcode assist counters suggesting mispre-dictions are resolved entirely in hardware In case of a mispre-diction a stale value is passed to subsequent loads a primitivethat was previously used to leak secret information with Spec-tre Variant 4 or Speculative Store Bypass (SSB) [55 82]

Different from the other machine clears MD machineclears trigger only on address aliasing mispredictions whenthe CPU wrongly predicts a load does not alias a precedingstore instruction and can be hoisted Similar to branch pre-diction generating a MD-based transient execution windowrequires mistraining the underlying predictor The latter hascomplex undocumented behavior which has been partiallyreverse engineered [18] For space constraints we discuss ourfull reverse engineering strategy in Appendix A Our resultsshow that executing the same load 64+ 15 times with non-aliasing stores is sufficient to ensure the next prediction tobe ldquono-aliasrdquo (and thus the next load to be hoisted) Uponreaching the hoisting prediction one can perform the loadwith an aliasing store to trigger the transient path exposed tothe incorrect stale value

Finally we experimentally verified that 4k aliasing [50]does not cause any machine clear but only incurs a furthertime penalty in case of wrong aliasing predictions Moredetails on 4k aliasing results can be found in Appendix A

9 Other Types of Machine Clear

AVX vmaskmov The AVX vmaskmov instructions performconditional packed load and store operations depending on abitmask For example a vmaskmovpd load may read 4 packeddoubles from memory depending on a 4-bit mask each doublewill be read only if the corresponding mask bit is set the

others will be assigned the value 0According to the Intel Optimization Reference Man-

ual [36] the instruction does not generate an exception inface of invalid addresses provided they are masked out How-ever our experiments confirm that it does incur a machineclear (and a microcode assist) when accessing an invalid ad-dress (eg with the present bit set to 0) with a loading maskset to zero (ie no bytes should be loaded) to check whetherthe bytes in the invalid address have the corresponding maskbits set or not We speculate the special handling is neededbecause the permission check is very complex especially inthe absence of memory alignment requirements

In our experiments vmaskmov instructions withall-zero masks and invalid addresses increment theOTHER_ASSISTANY and MACHINE_CLEARSCOUNT countersconfirming that the instruction triggers a machine clearHowever the resulting transient execution window seemsshort-lived or absent as we were unable to observe cache orother microarchitectural side effects of the execution

Exceptions The MACHINE_CLEARSPAGE_FAULT counterpresent on older microarchitectures confirms page faults areanother cause of machine clears Indeed we verified each in-struction incurring a page fault or any other exception such asldquoDivision by zerordquo increments the MACHINE_CLEARSCOUNTcounter We also verified exceptions do not trigger microcodeassists and software interrupts (traps) do not trigger machineclears Indeed the Intel documentation [37] specifies thatinstructions following a trap may be fetched but not specu-latively executed Transient execution windows originatingfrom exceptionsmdashand page faults in particularmdashhave beenextensively used in prior work with a faulty load instructionalso used as the trigger to leak information [7 47 63 75]

Hardware interrupts Although hardware interrupts arean undocumented cause of machine clears our experimentswith APIC timer interrupts showed they do increment theMACHINE_CLEARSCOUNT counter While this confirms hard-ware interrupts are another root cause of transient executionthe asynchronous nature of these events yields a less thanideal vector for transient execution attacks Nonetheless hard-ware interrupts play an important role in other classes ofmicroarchitectural attacks [73]

Microcode assists Microcode assists require a pipelineflush to insert the required microOps in the frontend and representa subclass of machine clears (and thus a root cause of transientexecution) for cases where a fast path in hardware is not avail-able In this paper we detailed the behavior of floating-pointand vmaskmov assists Prior work has discussed different sit-uations requiring microcode assists such as those related topage table entry AccessDirty bits typically in the context ofassisted loads used as the trigger to leak information [86375]AVX-to-SSE transitions [36] represent another microcode as-sist which based on our experiments we could not observeon modern Intel CPUs Remaining known microcode assistssuch as access control of memory pages belonging to SGX

USENIX Association 30th USENIX Security Symposium 1457

020406080

100120140160

Num

ber o

fTr

ansie

nt L

oads CPU

Intel Core i7-10700KIntel Xeon Silver 4214Intel Core i9-9900KIntel Core i7-7700KAMD Ryzen 5 5600XAMD Ryzen Threadripper2990WXAMD Ryzen 7 2700X

F+R TSX BHT FAULT SMC XMC FP MD MOTransient Execution Management

0

1

2

3

4

Leak

age

Rate

[Mb

s]

Figure 4 Top plot transient window size vs mechanism Each bar reports the number of transient loads that complete and leavea microarchitectural trace Bottom plot leakage rate vs mechanism for a simple Spectre Bounds-Check-Bypass attack and a1-bit FLUSH + RELOAD (F+R) cache covert channel F+R is the leakage rate upper bound (covert channel loop only no actualtransient window or attack)

secure enclaves and Precise Event Based Sampling (PEBS)are not presented in this work since already studied [14] oronly related to privileged performance profiling respectively

10 Transient Execution Capabilities

Transient execution attacks rely on crafting a transient win-dow to issue instructions that are never retired For thispurpose state-of-the-art attacks traditionally rely on mech-anisms based on root causes such as branch mispredictions(BHT) [41 43] faulty loads (Fault) [8 47 63 75] or mem-ory transaction aborts (TSX) [8 47 59 63 75] However thedifferent machine clears discussed in this paper provide anattacker with the exact same capabilities

To compare the capabilities of machine clear-based tran-sient windows with those of more traditional mechanismswe implemented a framework able to run arbitrary attacker-controlled code in a window generated by a mechanism ofchoosing We now evaluate our framework on recent proces-sors (with all the microcode updates and mitigations enabled)to compare the transient window size and leakage rate of thedifferent mechanisms

101 Transient Window Size

The transient window size provides an indication of thenumber of operations an attacker can issue on a transientpath before the results are squashed Larger windows canin principle host more complex attacks Using a classicFLUSH + RELOAD cache covert channel (F+R) as a referencewe measure the window size by counting how many transientloads can complete and hit entries in a designated F+R bufferFigure 4 (top) presents our results

As shown in the figure the window size varies greatlyacross the different mechanisms Broadly speaking mecha-nisms that have a higher detection cost such as XMC and MOMachine Clear yield larger window sizes Not surprisinglybranch mispredictions yield the largest window sizes as wecan significantly slow down the branch resolution process(ie causing cache misses) and delay detection FP on theother hand yields the shortest windows suggesting that de-normal numbers are efficiently detected inside the FPU Ourresults also show that while our framework was designed forIntel processors similar if not better results can usually beobtained on AMD processors (where we use the same con-servative training code for branchmemory prediction) Thisshows that both CPU families share a similar implementationin all cases except for MD where the used mistraining patternis not valid for pre-Zen3 architectures

102 Leakage Rate

To compare the leakage rates for the different transient exe-cution mechanisms we transiently read and repeatedly leakdata from a large memory region through a classic F+R cachecovert channel We report the resulting leakage ratesmdashasthe number of bits successfully leaked per secondmdashacrossdifferent microarchitectures using a 1-bit covert channel tohighlight the time complexity of each mechanism We con-sider data to be successfully leaked after a single correct hitin the reload buffer In case of a miss for a particular valuewe restart the leak for the same value until we get a hit (oruntil we get 100 misses in a row)

As shown in Figure 4 (bottom) different Intel and AMDmicroarchitectures generally yield similar leakage rates withsome variations For instance FP MC offers better leakagerates on Intel This difference stems from the different perfor-

1458 30th USENIX Security Symposium USENIX Association

F+R TSX BHT FP SMC XMC MO MD FAULTTransient Execution Management

0

1

2

3

4

Leak

age

Rate

[Mb

s]

F+R granularity [bit]148

Figure 5 Leakage rate vs mechanism with a 1-bit 4-bit or8-bit FLUSH + RELOAD cache covert channel (Intel Core i9)

mance impact of the corresponding machine clears on Intelvs AMD microarchitectures

As shown in Figure 5 the leakage rate varies instead greatlyacross the different mechanisms and so does the optimalcovert channel bitwidth Indeed while existing attacks typi-cally rely on 8-bit covert channels our results suggest 1-bit or4-bit channels can be much more efficient depending on thespecific mechanism Roughly speaking optimal leakage ratescan be obtained by balancing the time complexity (and hencebitwidth) of the covert channel with that of the mechanismFor example FP is a lightweight and reliable mechanismhence using a comparably fast and narrow 1-bit covert chan-nel is beneficial In contrast MD requires a time-consumingpredictor training phase between leak iterations and leakingmore bits per iteration with a 4-bit covert channel is moreefficient Interestingly a classic 8-bit covert channel yieldsconsistently worse and comparable leakage rates across allthe mechanisms since F+R dominates the execution time

Our results show that only two mechanisms (TSX and FP)are close to the maximum theoretical leakage rate of pureF+R Moreover FP performs as efficiently as TSX but un-like TSX is available on both Intel and AMD is always en-abled and can be used from managed code (eg JavaScript)BHT on the other hand yield the worst leakage rates due tothe inefficient training-based transient window BHT leakagerate can be improved if a tailored mistrain sequence is usedas in MD (Appendix A) Overall our results show that ma-chine clear-based windows achieve comparable and in manycases (eg FP) better leakage rates compared to traditionalmechanisms Moreover many machine clears eliminate theneed for mistraining which other than resulting in efficientleakage rates can escape existing pattern-based mitigationsand disabled hardware extensions (eg Intel TSX)

11 Attack Primitives

Building on our reverse engineering results and focusing onthe unexplored SMC and FP machine clears we now presenttwo new transient execution attack primitives and analyze

their security implications We also present an end-to-endFPVI exploit disclosing arbitrary memory in Firefox Laterwe discuss mitigations

111 Speculative Code Store Bypass (SCSB)

Our first attack primitive Speculative Code Store Bypass(SCSB) allows an attacker to execute stale controlled code ina transient execution window originated by a SMC machineclear Since the primitive relies on SMC its primary appli-cability is on JIT (eg JavaScript) engines running attacker-controlled codemdashalthough OS kernels and hypervisors stor-ing code pages and allowing their execution without firstissuing a serializing instruction are also potentially affected

Figure 6 SCSB primitive example where the instructionpointer is pointing at the bold code blocks g code is freedJITrsquoed code (of some g function) under attackerrsquos control(1) Force engine to JIT and execute code of function f caus-ing desynchronization of code and data views (2) Executestale code and SMC MC (3) After the SMC MC code anddata view coherence is restored and the new code is executed

As exemplified in Figure 6 the operations of the primi-tive can be broken down into three steps (1) the JIT enginecompiles a function f storing the generated code into a JITcode cache region previously used by a (now-stale) version offunction g (2) the JIT engine jumps to the newly generatedcode for the function f but due to the temporary desynchro-nization between the code and data views of the CPU thiscauses transient execution of the stale code of g until the SMCmachine clear is processed (3) after the pipeline flush thecode and data views are resynchronized and the CPU restartsthe execution of the correct code of f For exploitation theattacker needs to (i) massage the JIT code cache allocator toreuse a freed region with a target gadget g of choice (ii) forcethe JIT engine to generate and execute new (f ) code in sucha region enabling transient out-of-context execution of thegadget and spilling secrets into the microarchitectural state

Our primitive bears similarities with both transient and ar-chitectural primitives used in prior attacks On the transientfront our primitive is conceptually similar to a Speculative-Store-Bypass (SSB) primitive [55 82] but can transientlyexecute stale code rather than reading stale data Howeverinterestingly the underlying causes of the two primitives arequite different (MD misprediction vs SMC machine clear)On the architectural front our primitive mimics classic Use-After-Free (UAF) exploitation on the JIT code cache also

USENIX Association 30th USENIX Security Symposium 1459

Figure 7 Coding options suggested by the Intel ArchitecturesSoftware Developers Manuals to handle SMC and XMC ex-ecution Option 1 describes the exact steps required by ourSpeculative Code Store Bypass attack primitive potentiallyresulting in exploitable gadgets

known as Return-After-Free (RAF) in the hackers commu-nity [19 25] An example is CVE-2018-0946 where a use-after-free vulnerability can be exploited to force the ChakraJS engine to erroneously execute freed (attacker-controlled)JIT code resulting in arbitrary code execution after massagingthe right gadget into the JIT code cache [24]

Indeed at a high level SCSB yields a transient use-after-free primitive on a JIT code cache with exploitation prop-erties similar to its architectural counterpart However thereare some differences due to the transient nature of SCSBFirst we need to find an out-of-context gadget to transientlyleak data rather than architecturally execute arbitrary codeIn JavaScript engines similar gadgets have already been ex-ploited by ret2spec [48] escalating out-of-context transientexecution of valid code to type confusion arbitrary reads andultimately a secret-dependent load to transmit the value

Second we need to target short-lived JIT code cache up-dates so that the newly generated code is immediately exe-cuted fitting the target gadget in the resulting transient execu-tion window Interestingly executing JITrsquoed code is the nextstep after JIT code generation mentioned in the Intel manual(Figure 7 Option 1) suggesting that developers that followsuch directives ad litteram can easily introduce such gadgetsMoreover modern JavaScript compilers feature a multi-stageoptimization pipeline [12 54] and short-lived JIT code cacheupdates are favored for just-in-time (re)optimization Indeedwe found test code in V8 to specifically test such updates andverify instructiondata cache coherence [70]

Finally we need to ensure JIT code cache updates are notaccompanied by barriers that force immediate synchroniza-tion of the code and data views We analyzed the code ofthe popular SpiderMonkey and V8 JavaScript engines andverified that the functions called upon JIT code cache updatesto synchronize instruction and data caches (Listing 2 and List-ing 3) are always empty on x86 (as expected given Intelrsquosprimary recommendation in Figure 7)

Overall while the exploitation is far from trivial (ie hav-

Listing 2 Chromium instruction cache flush(chromiumsrcv8srccodegenx64cpu-x64cc)void CpuFeaturesFlushICache(void start size_t size) No need to flush the instructioncache on Intel

Listing 3 Firefox instruction cache flush(mozilla-unifiedjssrcjitFlushICacheh)inline void FlushICache(void code size_t size

bool codeIsThreadLocal = true) No-op Code and data caches are coherent on x86

and x64 rarr

ing to address the challenges of traditional use-after-free ex-ploitation as well as transient execution exploitation in thebrowser) we believe SCSB expands the attack surface of tran-sient execution attacks in the browser We found a number ofcandidate SCSB gadgets in real-world code but none of themwas ultimately exploitable due to the coincidental presence ofsome serializing instruction preventing stale code executionNonetheless mitigations are needed to enforce security-by-design Luckily as shown later SCSB is amenable to a prac-tical and efficient mitigation (ie at a similar cost as facedby non-x86 architectures) The implementationperformancecost for mitigation is even lower than for its sibling SSB prim-itive whose mitigation has been deployed in practice even inabsence of known practical exploits [55]

In Table 3 we show that all tested Intel and AMD proces-sors are affected by SCSB In contrast ARM is not vulnerablesince SMC updates require explicit software barriers

112 Floating Point Value Injection (FPVI)Our second attack primitive Floating Point Value Injection(FPVI) allows an attacker to inject arbitrary values into atransient execution window originated by a FP machine clearAs exemplified in Figure 8 the operations of the primitive canbe broken down into four steps (1) the attacker triggers theexecution of a gadget starting with a denormal FP operationin the victim application with the x and y operands underattackerrsquos control (2) the transient z result of the operation isprocessed by the subsequent gadget instructions leaving a mi-croarchitectural trace (3) the CPU detects the error condition(ie wrong result of a denormal operation) triggering a ma-chine clear and thus a pipeline flush (4) the CPU re-executesthe entire gadget with the correct architectural z result Forexploitation the attacker needs to (i) massage the x and yoperands to inject the desired z value into the victim transientpath and (ii) target a victim gadget so that the injected valueyields a security-sensitive trace which can be observed withFLUSH + RELOAD or other microarchitectural attacks

Our primitive bears similarities with Load-Value-Injection(LVI) [71] since both allow attackers to inject controlledvalues into transient execution Moreover both primitives

1460 30th USENIX Security Symposium USENIX Association

Figure 8 FPVI gadget example in SpiderMonkeyFP_Op(xy) is an arbitrary denormal FP operation The er-roneous z result causes dependent operations to be executedtwice (first transiently then architecturally) The NaN-boxingz encoding allows the attacker to type-confuse the JITrsquoed codeand read from an arbitrary address on the transient path

require gadgets in the victim application to process the in-jected value and perform security-sensitive computations onthe transient path Nonetheless the underlying issues andhence the triggering conditions are fairly different LVI re-quires the attacker to induce faulty or assisted loads on thevictim execution which is straightforward in SGX applica-tions but more difficult in the general case [71] FPVI imposesno such requirement but does require an attacker to directlyor indirectly control operands of a floating-point operation inthe victim Nonetheless FPVI can extend the existing LVIattack surface (eg for compute-intensive SGX applicationsprocessing attacker-controlled inputs) and also provides ex-ploitation opportunities in new scenarios Indeed while it ishard to draw general conclusions on the availability of ex-ploitable FPVI gadgets in the wildmdashmuch like Spectre [41]LVI [71] etc this would require gadget scanners subject oforthogonal research [56 58]mdashwe found exploitable gadgetsin NaN-Boxing implementations of modern JIT engines [53]NaN-Boxing implementations encode arbitrary data types asdouble values allowing attackers running code in a JIT sand-box (and thus trivially controlling operands of FP operations)to escalate FPVI to a speculative type confusion primitiveThe latter can be exploited similarly to NaN-Boxing-enabledarchitectural type confusion [26] and allows an attacker toaccess arbitrary data on a transient path Figure 8 presentsour end-to-end exploit for a JavaScript-enabled attacker in aSpiderMonkey (Mozilla JavaScript runtime) sandbox illus-trating a gadget unaffected by all the prior Mozilla Firefoxrsquo

mitigations against transient execution attacks [52]As exemplified in the figure SpiderMonkeyrsquos NaN-Boxing

strategy represents every variable with a IEEE-754 (64-bit)double where the highest 17 bits store the data type tag and theremaining 47 bits store the data value itself If the tag value isless than or equal to 0x1fff0 (ie JSVAL_TAG_MAX_DOUBLE)all the 64 bits are interpreted as a double while NaN-Boxingencoding is used otherwise For instance the 0xfff88 tagis used to represent 32-bit integers and the 0xfffb0 tag torepresent a string with the data value storing a pointer tothe string descriptor In the example the attacker crafts theoperands of a vulnerable FP operation (in this case a division)to produce a transient result which the JITrsquoed code interpretsas a string pointer due to NaN-Boxing This causes typeconfusion on a transient path and ultimately triggers a readwith an attacker-controlled address

To verify the attacker can inject arbitrary pointers withoutfully reverse engineering the complex function used by theFPU we implemented a simple fuzzer to find FP operandsthat yield transient division results with the upper bits setto 0xfffb0 (ie string tag) With such operands we caneasily control the remaining bits by performing the inverseoperation since the mantissa bits are transiently unaffectedby the exponent value as shown in Table 2 For exampleusing 0xc000e8b2c9755600 and 0x0004000000000000 asdivision operands yields -Infinity as the architectural resultand our target string pointer 0xfffb0deadbeef000 as thetransient result (see gadget in Figure 8)

Note that SpiderMonkey uses no guards or Spectre mitiga-tions when accessing the attribute length of the string Thisis normally safe since x86 guarantees that NaN results of FPoperations will always have the lowest 52 bits set to zeromdasharepresentation known as QNaN Floating-Point Indefinite [37]In other words the implementation relies on the fact that NaN-boxed variables such as string pointers can never accidentallyappear as the result of FP operations and can only be craftedby the JIT engine itself Unfortunately this invariant no longerholds on a FPVI-controlled transient path As shown in Fig-ure 8 this invariant violation allows an attacker to transientlyread arbitrary memory Since the length attribute is stored4 bytes away from the string pointer the zlength accessyields a transient read to 0xdeadbeef000+4

We ran our exploit on an Intel i9-9900K CPU (microcode0xde) running Linux 580 and Firefox 850 and by wrappingthis primitive with a variant of an EVICT + RELOAD covertchannel [61] we confirmed the ability to read arbitrary mem-ory Since prior work has already demonstrated that customhigh-precision timers are possible in JavaScript [26296064]we enabled precise timers in Firefox to simplify our covertchannel With our exploit we measured a leakage rate of ~13KBs and a transient window of ~12 load instructions enabledby increasing the FPU pressure through a chain of multipledependent FP operations

Finally as shown in Table 3 we observe that both Intel and

USENIX Association 30th USENIX Security Symposium 1461

Table 3 Tested processors

Processor MicrocodeSCSB

vulnerableFPVI

vulnerable

Intel Core i7-10700K 0xe0 3 3Intel Xeon Silver 4214 0x500001c 3 3Intel Core i9-9900K 0xde 3 3Intel Core i7-7700K 0xca 3 3AMD Ryzen 5 5600X 0xa201009 3 3daggerAMD Ryzen 2990WX 0x800820b 3 3daggerAMD Ryzen 7 2700X 0x800820b 3 3daggerBroadcom BCM2711Cortex-A72 (ARM v8) 7sect 7

dagger No exploitable NaN-boxed transient results foundsect On ARM SMC updates require explicit software barriers

AMD are affected by FPVI although we found no exploitabletransient NaN-boxed values on AMD On ARM we neverobserved traces of transient results yet we cannot rule outother FPU implementations being affected

12 Mitigations

121 SCSB Mitigation

SCSB can be mitigated by ensuring that any freshly writtencode is architecturally visible before being executed For ex-ample on ARM architectures where the hardware does notautomatically enforce cache coherence explicit serializing in-structions (ie L1i cache invalidation instructions) are neededto correctly support SMC [5] As such spec-compliant ARMimplementations cannot be affected by SCSB On Intel andAMD we can force eager codedata coherence using a serial-ization instructionmdashalthough this is normally not necessary(see Option 1 in Figure 7) In our experiments we verified anyserialization instruction such as lfence mfence or cpuidplaced after the SMC store operations is indeed sufficient tosuppress the transient window Note that sfence cannot serveas a serialization instruction [35] to eliminate the transientpath This serialization mitigation was confirmed by CPUvendors and adopted by the Xen hypervisor [32 33]

To evaluate the performance impact of our mitigation weadded a lfence instruction inside the FlushICache function(Listings 2 and 3) of the two popular V8 and SpiderMonkeyJavaScript engines Such function normally empty on x86is called after every code update Our repeated experimentson the popular JetStream2 and Speedometer2 [77] bench-marks did not produce any statically measurable performanceoverhead This shows JavaScript execution time is heavilydominated by JIT code generationexecution and code up-dates have negligible impact Our results show this mitigationis practical and can hinder SCSB-based attack primitives inJIT engines with a 1-line code change

122 FPVI Mitigation

The most efficient way to mitigate FPVI is to disable the de-normal representation On Intel this translates to enabling theFlush-to-Zero and Denormals-are-Zero flags [37] which re-spectively replace denormal results and inputs with zero Thisis a viable mitigation for applications with modest floating-point precision requirements and has also been selectively ap-plied to browsers [42] However this defense may break com-mon real-world (denormal-dependent) applications a concernthat has led browser vendors such as Firefox to adopt othermitigations Another option for browsers is to enable Site Iso-lation [13] but JIT engines such as SpiderMonkey still do nothave a production implementation [23] Yet another optionfor JIT engines is to conditionally mask (ie using a transientexecution-safe cmov instruction) the result of FP operationsto enforce QNaN Floating-Point Indefinite semantics [37] asdone in the SpiderMonkey FPVI mitigation [51] This strategysuppresses any malicious NaN-boxed transient results butrequires manual changes to the NaN-boxing implementationand only applies to NaN-boxed gadgets

A more general and automated mitigation is for the com-piler to place a serializing instruction such as lfence afterFP operations whose (attacker-controlled) result might leaksecrets by means of transmit gadgets (or transmitters) Weobserve this is the same high-level strategy adopted by theexisting LVI mitigation in modern compilers [39] whichidentifies loaded values as sources and uses data-flow anal-ysis to ensure all the sources that reach a sink (transmitter)are fenced Hence to mitigate FPVI we can rely on the samemitigation strategy but use computed FP values as sourcesinstead To identify sinks we consider both systems vulnera-ble and those resistant to Microarchitectural Data Sampling(MDS) [8 59 63 75 76] For the former we consider bothload and store instructions as sinks (eg FP operation resultused as a loadstore address) as the corresponding arbitrarydata spilled into microarchitectural buffers may be leaked bya MDS attacker For the latter we limit ourselves to load sinksto catch all the arbitrary read values potentially disclosed by adependent transmitter In both cases we add indirect branchescontrolled by FP operations (and thus potentially leading tospeculative control-flow hijacking) to the list of sinks

We have implemented such a mitigation in LLVM [45]with only 100 lines of code on top of the existing x86 LVIload hardening pass To evaluate the performance impactof our mitigation on floating-point-intensive programs weran all the CC++ SPECfp 2006 and 2017 benchmarks infour configurations LVI instrumentation FPVI instrumen-tation for both MDS-vulnerable and MDS-resistant systemsand joint LVI+FPVI instrumentation for MDS-vulnerable sys-tems Please note that on MDS-resistant systems the FPVItransmitters are already covered by the LVI mitigation

Figure 9 shows the performance overhead of such con-figurations compared to the baseline As expected the LVI

1462 30th USENIX Security Symposium USENIX Association

619lbm_s 638imagick_s 644nab_s SPEC17geomean

433milc 444namd 447dealII 450soplex 453povray 470lbm 482sphinx3 SPEC06geomean

0100200300400500600700800

Over

head

SPEC FP 2017 SPEC FP 2006LLVM Mitigation

LVIFPVI (MDS-vulnerable system)FPVI (MDS-resistant system)LVI+FPVI (MDS-vulnerable system)

Figure 9 Performance overhead of our FPVI mitigation on the CC++ SPECfp 2006 and 2017 benchmarks Experimental setup5 SPEC iterations Intel i9-9900K (microcode 0xde) and LLVM 1110

mitigation has non-trivial performance impact up to 280despite targeting floating-point-intensive programs On theother hand our FPVI mitigation incurs in a 32 and 53 ge-omean overhead on SPECfp 2006 and 2017 respectively withno observable performance impact difference between MDS-vulnerable and MDS-resistant variants We observed that ap-proximately 70 of the inserted lfence instructions are dueto the intraprocedural design of the original LVI pass forc-ing our analysis to consider every callsite with FPVI source-based arguments as a potential transmitter This suggests theoverhead can be further reduced by operating interprocedu-ral analysis or more aggressive inlining (eg using LLVMLTO [45])

13 Root Cause-based Classification

We now summarize the results of our investigation by present-ing a root cause-based classification for the known transientexecution paths While there have already been several at-tempts to classify properties of transient execution [107580]all the existing classifications are attack-centric While cer-tainly useful such classifications inevitably blend togetherthe root causes of transient execution with the attack triggersFor example a MDS exploit based on a demand-paging pagefault [75] may be simply classified as a Meltdown-like attackbased on a present page fault [10] However in such an exploitthe page fault is both the vulnerability trigger and the rootcause of the transient execution window A similar exploitmay instead be embedded in an Intel TSX window yielding adifferent root cause and capabilities

To better characterize the capabilities of transient executionexploits we propose the orthogonal root-cause-based classifi-cation in Figure 10 We draw from the Intel terminology todefine the two main classes of root causes of Bad Specula-tion (ie transient execution) Control-Flow Misprediction(ie branch misprediction) and Data Misprediction (ie ma-chine clear) Based on these two main classes we observethat all the known root causes of transient execution paths canbe classified into the following four subclasses PredictorsExceptions Likely invariants violations and Interrupts

Figure 10 A root cause-centric classification of known tran-sient execution paths (causes analyzed in the paper in bold)Acronyms descriptions can be found in Appendix B

Predictors This category includes the prediction-basedcauses of bad speculation due to either control-flow or datamispredictions Mistraining a predictor and forcing a mis-prediction is sufficient to create a transient execution pathaccessing erroneous code or data Itrsquos worth noting that inFigure 10 there are two separate predictors subclasses un-der control-flow and data mispredictions as they are differ-ent in nature While control-flow mispredictions are failedattempts to guess the next instructions to execute data mis-predictions are failed attempts to operate on not-yet-validateddata Misprediction is a common way to manage tran-sient execution windows in attacks like Spectre and deriva-tives [11 20 28 40 41 43 48 55 65 82]

Exceptions This class includes the causes of machineclear due to exceptions for instance the different (sub)classesof page faults Forcing an exception is sufficient to create atransient execution path erroneously executing code follow-ing the exception-inducing instruction Exceptions are a lesscommon way to manage transient execution windows (as theyrequire dedicated handling) but have been extensively usedas triggers in Meltdown-like attacks [78475963677174ndash76 78]

Interrupts This class includes the causes of machine cleardue to hardware (only) interrupts Similar to exceptions forc-

USENIX Association 30th USENIX Security Symposium 1463

ing a HW interrupt is sufficient to create a transient executionpath erroneously executing code following the interruptedinstruction Hardware interrupt are asynchronous by natureand thus difficult to control resulting in a less than ideal wayto manage transient execution windows Nonetheless theywere abused by prior side-channel attacks [72]

Likely invariants violations This class includes all theremaining causes of machine clear derived by likely invari-ants [15] used by the CPU Such invariants commonly holdbut occasionally fail allowing hardware to implement fast-path optimizations However compared to exceptions andinterrupts slow-path occurrences are typically more frequentrequiring more efficient handling in hardware or microcodeWe discussed examples of such invariants in the paper (egstore instructions are expected to never target cached instruc-tions floating-point operations are expected to never operateon denormal numbers etc) and their lazy handling mecha-nisms (ie L1dL1i resynchronization microcode-based de-normal arithmetic) Forcing a likely invariant violation is suf-ficient to create a transient execution path accessing erroneouscode or data In this paper we have shown such violationsare not only a realistic way to manage transient executionwindows but also provide new opportunities and primitivesfor transient execution attacks

14 Related Work

Spectre [3141] and Meltdown [47] first examined the securityimplications of transient execution originating a large bodyof research on transient execution attacks [6ndash112840434859 63 65 67 68 71 74ndash76 78 79 82] Rather than focusingon attacks and their classification [10 75 80] ours is the firsteffort to systematize the root causes of transient executionand examine the many unexplored cases of machine clears

We now briefly survey prior security efforts concernedwith the major causes of machine clear discussed in this pa-per Self-modifying code is commonly used by malware asan obfuscation technique [69] and has also been used to im-prove side-channel attacks by means of performance degra-dation [2] Moreover our SCSB primitive bears similaritieswith prior transient execution primitives inducing specula-tive control flow hijacking either through branch target in-jection [41 49] or architectural branch target corruption [28]The performance variability of floating-point operations ondenormal numbers [17 46] has been previously exploited intraditional timing side-channel attacks [4] Speculation intro-duced by stricter memory models is a well-known concept inthe computer architecture literature [27 30 66] While this isnon-trivial to exploit prior work did demonstrate informationdisclosure [5779] by exploiting the snoop protocol discussedin Section 7 The memory disambiguation predictor has beenpreviously abused to leak stale data in Spectre SpeculativeStore Bypass exploits [55 82] Moreover its behavior hasbeen partially reverse engineered before [18] In contrast to

all these efforts we focus on machine clears to systematicallystudy all the root causes of transient execution fully reverseengineering their behavior and uncovering their security im-plications well beyond the state of the art

15 Conclusions

We have shown that the root causes of transient execution canbe quite diverse and go well beyond simple branch mispredic-tion or similar To support this claim we systematically ex-plored and reverse engineered the previously unexplored classof bad speculation known as machine clear We discussedseveral transient execution paths neglected in the literatureexamining their capabilities and new opportunities for attacksFurthermore we presented two new machine clear-based tran-sient execution attack primitives (Floating Point Value Injec-tion and Speculative Code Store Bypass) We also presentedan end-to-end FPVI exploit disclosing arbitrary memory inFirefox and analyzed the applicability of SCSB in real-worldapplications such as JIT engines Additionally we proposedmitigations and evaluated their performance overhead Finallywe presented a new root cause-based classification for all theknown transient execution paths

Disclosure

We disclosed Floating Point Value Injection and SpeculativeCode Store Bypass to CPU browser OS and hypervisor ven-dors in February 2021 Following our reports Intel confirmedthe FPVI (CVE-2021-0086) and SCSB (CVE-2021-0089)vulnerabilities rewarded them with the Intel Bug Bounty pro-gram and released a security advisory with recommendationsin line with our proposed mitigations [34] Mozilla confirmedthe FPVI exploit (CVE-2021-29955 [21 22]) rewarded itwith the Mozilla Security Bug Bounty program and deployeda mitigation based on conditionally masking malicious NaN-boxed FP results in Firefox 87 [51]

Acknowledgments

We thank our shepherd Daniel Genkin and the anonymousreviewers for their valuable comments We also thank ErikBosman from VUSec and Andrew Cooper from Citrix fortheir input Intel and Mozilla engineers for the productivemitigation discussions Travis Downs for his MD reverse en-gineering and Evan Wallace for his Float Toy tool This workwas supported by the European Unionrsquos Horizon 2020 re-search and innovation programme under grant agreements No786669 (ReAct) and 825377 (UNICORE) by Intel Corpora-tion through the Side Channel Vulnerability ISRA and by theDutch Research Council (NWO) through the INTERSECTproject

1464 30th USENIX Security Symposium USENIX Association

References[1] IEEE Standard for Floating-Point Arithmetic IEEE Std 754-2019

2019

[2] Alejandro Cabrera Aldaya and Billy Bob Brumley Hyperde-grade From ghz to mhz effective cpu frequencies arXiv preprintarXiv210101077

[3] AMD AMD64 Architecture Programmerrsquos Manual

[4] Marc Andrysco David Kohlbrenner Keaton Mowery Ranjit JhalaSorin Lerner and Hovav Shacham On subnormal floating point andabnormal timing In 2015 IEEE S amp P

[5] ARM Architecture Reference Manual for Armv8-A

[6] Atri Bhattacharyya Alexandra Sandulescu Matthias NeugschwandtnerAlessandro Sorniotti Babak Falsafi Mathias Payer and Anil KurmusSmotherspectre exploiting speculative execution through port con-tention In CCSrsquo19

[7] Jo Van Bulck Marina Minkin Ofir Weisse Daniel Genkin BarisKasikci Frank Piessens Mark Silberstein Thomas F Wenisch Yu-val Yarom and Raoul Strackx Foreshadow Extracting the Keys tothe Intel SGX Kingdom with Transient Out-of-Order Execution InUSENIX Securityrsquo18

[8] Claudio Canella Daniel Genkin Lukas Giner Daniel Gruss MoritzLipp Marina Minkin Daniel Moghimi Frank Piessens MichaelSchwarz Berk Sunar Jo Van Bulck and Yuval Yarom Fallout LeakingData on Meltdown-resistant CPUs In CCSrsquo19

[9] Claudio Canella Michael Schwarz Martin Haubenwallner MartinSchwarzl and Daniel Gruss Kaslr Break it fix it repeat In ACMASIA CCS 2020

[10] Claudio Canella Jo Van Bulck Michael Schwarz Moritz Lipp Ben-jamin Von Berg Philipp Ortner Frank Piessens Dmitry Evtyushkinand Daniel Gruss A systematic evaluation of transient execution at-tacks and defenses In USENIX Security 19

[11] Guoxing Chen Sanchuan Chen Yuan Xiao Yinqian Zhang ZhiqiangLin and Ten H Lai Sgxpectre Stealing intel secrets from sgx enclavesvia speculative execution In 2019 IEEE EuroSampP

[12] Chrome V8 TurboFan documentation

[13] Chromium Site Isolation documentation

[14] Victor Costan and Srinivas Devadas Intel SGX Explained IACRCryptology ePrint Archive 2016

[15] David Devecsery Peter M Chen Jason Flinn and Satish NarayanasamyOptimistic hybrid analysis Accelerating dynamic analysis throughpredicated static analysis In ASPLOS 2018

[16] Christopher Domas Breaking the x86 isa Black Hat USA 2017

[17] Isaac Dooley and Laxmikant Kale Quantifying the interference causedby subnormal floating-point values In Proceedings of the Workshopon OSIHPA 2006

[18] Travis Downs Memory Disambiguation on Skylakehttpsgithubcomtravisdownsuarch-benchwikiMemory-Disambiguation-on-Skylake 2019

[19] Thomas Dullien Return after free discussion httpstwittercomhalvarflakestatus1273220345525415937

[20] Dmitry Evtyushkin Ryan Riley Nael CSE Abu-Ghazaleh ECE andDmitry Ponomarev Branchscope A new side-channel attack on di-rectional branch predictor ACM SIGPLAN Notices 53(2)693ndash7072018

[21] Firefox Firefox 87 Security Advisory httpswwwmozillaorgen-USsecurityadvisoriesmfsa2021-10CVE-2021-29955

[22] Firefox Firefox ESR 789 Security Advisory httpswwwmozillaorgen-USsecurityadvisoriesmfsa2021-11CVE-2021-29955

[23] Firefox Project Fission documentation

[24] Fortninet Use-After-Free Bug in Chakra (CVE-2018-0946)httpswwwfortinetcomblogthreat-researchan-analysis-of-the-use-after-free-bug-in-microsoft-edge-chakra-engine

[25] Ivan Fratric Return after free discussion httpstwittercomifsecurestatus1273230733516177408

[26] Pietro Frigo Cristiano Giuffrida Herbert Bos and Kaveh RazaviGrand Pwning Unit Accelerating Microarchitectural Attacks withthe GPU In SampP May 2018

[27] Kourosh Gharachorloo Anoop Gupta and John L Hennessy Twotechniques to enhance the performance of memory consistency models1991

[28] Enes Goktas Kaveh Razavi Georgios Portokalidis Herbert Bos andCristiano Giuffrida Speculative Probing Hacking Blind in the SpectreEra In CCS 2020

[29] Ben Gras Kaveh Razavi Erik Bosman Herbert Bos and CristianoGiuffrida ASLR on the Line Practical Cache Attacks on the MMUIn NDSS February 2017

[30] John L Hennessy and David A Patterson Computer architecture aquantitative approach Elsevier 2011

[31] Jann Horn Reading privileged memory with a side-channel 2018

[32] Xen Hypervisor block_speculation function call in invoke_stubhttpsxenbitsxenorggitwebp=xengita=blobf=xenarchx86x86_emulatex86_emulatechb=HEAD

[33] Xen Hypervisor block_speculation function call inio_emul_stub_setup httpsxenbitsxenorggitwebp=xengita=blobf=xenarchx86pvemul-priv-opchb=HEAD

[34] Intel FPVI amp SCSB Intel Security Advisoray 00516 httpswwwintelcomcontentwwwusensecurity-centeradvisoryintel-sa-00516html

[35] Intel INTEL-SA-00088 - Bounds Check Bypass

[36] Intel Intelreg 64 and IA-32 Architectures Optimization ReferenceManual

[37] Intel Intelreg 64 and IA-32 Architectures Software Developerrsquos Manualcombined volumes

[38] Intel Intelreg VTunetrade Profiler User Guide - 4KAliasing httpssoftwareintelcomcontentwwwusendevelopdocumentationvtune-helptopreferencecpu-metrics-referencel1-boundaliasing-of-4k-address-offsethtml

[39] Intel Load value injection - deep divehttpssoftwareintelcomsecurity-software-guidancedeep-divesdeep-dive-load-value-injection 2020

[40] Vladimir Kiriansky and Carl Waldspurger Speculative buffer overflowsAttacks and defenses arXiv180703757

[41] Paul Kocher Jann Horn Anders Fogh Daniel Genkin Daniel GrussWerner Haas Mike Hamburg Moritz Lipp Stefan Mangard ThomasPrescher Michael Schwarz and Yuval Yarom Spectre Attacks Ex-ploiting Speculative Execution In SampPrsquo19

[42] David Kohlbrenner and Hovav Shacham On the effectiveness ofmitigations against floating-point timing channels In USENIX SecuritySymposium 2017

[43] Esmaeil Mohammadian Koruyeh Khaled N Khasawneh ChengyuSong and Nael Abu-Ghazaleh Spectre Returns Speculation Attacksusing the Return Stack Buffer In USENIX WOOTrsquo18

[44] Evgeni Krimer Guillermo Savransky Idan Mondjak and JacobDoweck Counter-based memory disambiguation techniques for se-lectively predicting loadstore conflicts October 1 2013 US Patent8549263

USENIX Association 30th USENIX Security Symposium 1465

[45] Chris Lattner and Vikram Adve Llvm A compilation framework forlifelong program analysis amp transformation In CGO 2004

[46] Orion Lawlor Hari Govind Isaac Dooley Michael Breitenfeld andLaxmikant Kale Performance degradation in the presence of subnor-mal floating-point values In OSIHPA 2005

[47] Moritz Lipp Michael Schwarz Daniel Gruss Thomas Prescher WernerHaas Anders Fogh Jann Horn Stefan Mangard Paul Kocher DanielGenkin Yuval Yarom and Mike Hamburg Meltdown Reading KernelMemory from User Space In USENIX Securityrsquo18

[48] Giorgi Maisuradze and Christian Rossow ret2spec Speculative ex-ecution using return stack buffers In Proceedings of the 2018 ACMSIGSAC

[49] Andrea Mambretti Alexandra Sandulescu Matthias NeugschwandtnerAlessandro Sorniotti and Anil Kurmus Two methods for exploitingspeculative control flow hijacks In USENIX WOOT 19

[50] Ahmad Moghimi Jan Wichelmann Thomas Eisenbarth and BerkSunar Memjam A false dependency attack against constant-timecrypto implementations International Journal of Parallel Program-ming 2019

[51] Mozilla Firefox Bug 1692972 mitigation httpshgmozillaorgreleasesmozilla-betarevb129bba64358

[52] Mozilla Spectre mitigations for Value type checks - x86 part httpsbugzillamozillaorgshow_bugcgiid=1433111

[53] Mozilla Spider Monkey JSValue httpshgmozillaorgmozilla-centralfiletipjspublicValueh

[54] Mozilla SpiderMonkey IonMonkey documentation

[55] Ken Johnson Microsoft Security Response Center(MSRC) Analysis and mitigation of speculative storebypass httpsmsrc-blogmicrosoftcom20180521analysis-and-mitigation-of-speculative-store-bypass-cve-2018-3639 2019

[56] Oleksii Oleksenko Bohdan Trach Mark Silberstein and Christof Fet-zer Specfuzz Bringing spectre-type vulnerabilities to the surface InUSENIX Security 20

[57] Riccardo Paccagnella Licheng Luo and Christopher W Fletcher Lordof the ring (s) Side channel attacks on the cpu on-chip ring interconnectare practical arXiv preprint arXiv210303443 2021

[58] Zhenxiao Qi Qian Feng Yueqiang Cheng Mengjia Yan Peng Li HengYin and Tao Wei Spectaint Speculative taint analysis for discoveringspectre gadgets 2021

[59] Hany Ragab Alyssa Milburn Kaveh Razavi Herbert Bos and CristianoGiuffrida CrossTalk Speculative Data Leaks Across Cores Are RealIn SampP May 2021

[60] Thomas Rokicki Cleacutementine Maurice and Pierre Laperdrix Sok Insearch of lost time A review of javascript timers in browsers In IEEEEuroSampPrsquo21

[61] Gururaj Saileshwar Christopher W Fletcher and Moinuddin QureshiStreamline a fast flushless cache covert-channel attack by enablingasynchronous collusion In ASPLOS 2021

[62] Rahul Saxena and John William Phillips Optimized rounding in un-derflow handlers 2001 US Patent 6219684

[63] Michael Schwarz Moritz Lipp Daniel Moghimi Jo Van Bulck JulianStecklina Thomas Prescher and Daniel Gruss ZombieLoad Cross-privilege-boundary data sampling In CCSrsquo19

[64] Michael Schwarz Cleacutementine Maurice Daniel Gruss and Stefan Man-gard Fantastic timers and where to find them High-resolution microar-chitectural attacks in javascript In FC IFCA 17

[65] Michael Schwarz Martin Schwarzl Moritz Lipp Jon Masters andDaniel Gruss Netspectre Read arbitrary memory over network InESORICS 19

[66] Daniel J Sorin Mark D Hill and David A Wood A primer on memoryconsistency and cache coherence 2011

[67] Julian Stecklina and Thomas Prescher Lazyfp Leaking fpu reg-ister state using microarchitectural side-channels arXiv preprintarXiv180607480 2018

[68] Caroline Trippel Daniel Lustig and Margaret Martonosi Meltdown-prime and spectreprime Automatically-synthesized attacks exploitinginvalidation-based coherence protocols arXiv

[69] Xabier Ugarte-Pedrero Davide Balzarotti Igor Santos and Pablo GBringas Sok Deep packer inspection A longitudinal study of thecomplexity of run-time packers In 2015 IEEE S amp P

[70] Google V8 test-jump-table-assemblercc220 commit 251fece httpsgithubcomv8

[71] Jo Van Bulck Daniel Moghimi Michael Schwarz Moritz Lippi Ma-rina Minkin Daniel Genkin Yuval Yarom Berk Sunar Daniel Grussand Frank Piessens Lvi Hijacking transient execution through mi-croarchitectural load value injection In 2020 IEEE S amp P 20

[72] Jo Van Bulck Frank Piessens and Raoul Strackx Nemesis Studyingmicroarchitectural timing leaks in rudimentary cpu interrupt logic InACM CCS 2018

[73] Jo Van Bulck Frank Piessens and Raoul Strackx SGX-step A Prac-tical Attack Framework for Precise Enclave Execution Control InSysTEXrsquo17

[74] Stephan van Schaik Andrew Kwong Daniel Genkin and Yuval YaromSgaxe How sgx fails in practice

[75] Stephan van Schaik Alyssa Milburn Sebastian Oumlsterlund Pietro FrigoGiorgi Maisuradze Kaveh Razavi Herbert Bos and Cristiano GiuffridaRIDL Rogue in-flight data load In SampP May 2019

[76] Stephan van Schaik Marina Minkin Andrew Kwong Daniel Genkinand Yuval Yarom Cacheout Leaking data on intel cpus via cacheevictions arXiv preprint

[77] WebKit Browserbench httpsbrowserbenchorg

[78] Ofir Weisse Jo Van Bulck Marina Minkin Daniel Genkin BarisKasikci Frank Piessens Mark Silberstein Raoul Strackx Thomas FWenisch and Yuval Yarom Foreshadow-ng Breaking the virtual mem-ory abstraction with transient out-of-order execution 2018

[79] Pawel Wieczorkiewicz Intel deep-dive snoop-assisted L1 Data Sam-pling

[80] Wenjie Xiong and Jakub Szefer Survey of transient execution attacksarXiv preprint

[81] Yuval Yarom and Katrina Falkner Flush+ reload a high resolutionlow noise l3 cache side-channel attack In USENIX Security 14

[82] Jann Horn Google Project Zero Speculative execution variant4 speculative store bypass httpsbugschromiumorgpproject-zeroissuesdetailid=1528 2019

A Reversing Memory Disambiguation

To precisely trigger memory disambiguation mispredictionsit is essential to reverse engineer the predictor and understandhow it can be massaged into the desired state When exe-cuting a load operation the physical addresses of all priorstores must be known to decide whether the load should beforwarded the value from the store buffer (store-to-load for-warding when the store and the load alias) or served from thememory subsystem (when they do not) Since these dependen-cies introduce significant bottlenecks modern processors rely

1466 30th USENIX Security Symposium USENIX Association

Listing 4 A function to RE the MD predictor The first 10imuls used to delay the store address computation create theideal conditions for mispredictionsst_ld rdi store addr rsi load addrrep 10 Trick to delay the store addressimul rdi 1endrepmov DWORD [rdi] 0x42 Storemov eax DWORD [rsi] Loadrep 10 Pronounce load timingimul eax 1endrepret

Listing 5 Observing the size and behavior of the per-addresssaturation counteruint8_t mem = malloc(0x1000)Ensure that saturing counter is set to 0for(i=0 ilt100 i++) st_ld(mem mem)Make hoisiting possiblefor(i=100 ilt120 i++) st_ld(mem mem+64)Trigger a memory ordering machine clearfor(i=120 ilt130 i++) st_ld(mem mem)

on a memory disambiguation predictor to improve common-case performance If a load is predicted not to alias precedingstores it can be speculatively executed before the prior storesrsquoaddresses are known (ie the load is hoisted) Otherwise theload is stalled until aliasing information is available

Partial reverse engineering of the memory disambiguationunit behavior was presented in [18] based on a (complex)analysis of Intel patent US8549263B2 [44] In contrast wepresent a full reverse engineering effort (including featuressuch as 4k aliasing flush counter etc) entirely based on asimple implementationmdashst_ld function in Listing 4 Bysurrounding the st_ld function with instrumentation codeto measure the timing and the number of machine clears wewere able to accurately detect the status of the predictor forevery call to st_ld

Our first experiment illustrated in Listing 5 is designedto observe how many missed load hoisting opportunities areneeded to switch the predictor state As shown in the corre-sponding plot in Figure 11 after 15 non-aliasing loads weobserved that subsequent st_ld invocations are faster due toa correct hoisting prediction This matches the design sug-gested in the patent with the predictor implemented as a 4-bitsaturating counter incremented every time the load does not ul-timately alias with preceding stores (and reset to 0 otherwise)Load hoisting is predicted only if the counter is saturated Toensure that the hoisting state is reached we later scheduledan aliasing load and checked that the load was incorrectlyhoisted by observing a machine clear

According to the Intel patent [44] there are 64 per-addresspredictors (ie saturating counters) and the suggested hash-ing function simply uses the lowest 6 bits of the instructionpointer of the load We verified these numbers using the func-

100 105 110 115 120 125 130i-th call to st_ld

0

50

100

Cloc

k cy

cles

Figure 11 Timing measurement of the experiment in Listing5 Orange bar machine clear memory ordering observed

Listing 6 Snippet observing activation of never-hoisting stateuint8_t mem = malloc(0x1000)for(int i=0 ilt10 i++)

for(int j=0 jlt19 j++)st_ld(mem mem+64)

st_ld(mem mem)

0 10 20 30 40 50 60 70 80 90 100 110 120i-th call to st_ld

0

25

50

75

100

Cloc

k cy

cles

Figure 12 Never-hoisting state Orange bar MC observed

tion st_ld_offset which is an exact copy of the st_ldfunction but with a number of nops added in the preamble(which we change at every run The goal is to observe iftwo unrelated loads in these functions are able to influenceeach other when varying the number of nops In our testedCPUs we observed machine clears when two unrelated loadsin st_ld and st_ld_offset are located exactly 256k bytesapart in memory (k isinN) We used machine clear observationsto detect that the per-address prediction of st_ld was affectedby the execution of st_ld_offset Our results match the de-sign suggested in the patent except we observed 256 (ratherthan 64) predictors hashing the lowest 8 bits of the instruc-tion pointer With these implementation-specific numbers anattacker can easily mistrain the predictor of a victim loadinstruction just with knowledge of its location in memory

One important additional component of the predictor is thepresence of a watchdog A never-hoisting global state is usedas a fallback to temporarily disable the predictor when theCPU has decided it may be counterproductive To reverseengineer the behavior we triggered as many mispredictionsas possible to check if hoisting was eventually disabled Theresulting code is illustrated in Listing 6 and the numbers inFigure 12 shows that after 4 machine clears the predictoris disabled Indeed even after 19 further non-aliasing loads(normally abundantly sufficient to switch to a hoisting state)the execution time of st_ld does not decrease

We also reversed the conditions under which the watch-dog is enableddisabled The patent suggests the watchdog

USENIX Association 30th USENIX Security Symposium 1467

15 79 143 207 271 335 399Number of independent store-loads

1

2

3

4

Tota

l mea

sure

dM

achi

ne C

lear

s

Figure 13 MCs observed after n independent store-loadsstarting from the never-hoisting state

is enabled when the value of a flush counter is smaller than0 and disabled otherwise The flush counter is decrementedevery MD MC and incremented every n correctly hoistedloads To reverse engineer this behavior we measure howmany MCs can be triggered in a row before the flush counteris decremented to -1 and thus the predictor is disabled Asshown in the figure 13 we never observed more than 4 MCsin a row This suggests that the flush counter is a 2-bit sat-urating counter The machine clear patterns also reveal theflush counter is incremented every 64 correctly hoisted loadsLastly since every run starts with the watchdog disabled ourresults show that to switch from a never-hoisting to a predict-hoisting state 15+64 non-aliasing loads are sufficient Thefirst 15 loads are necessary to bring the per-address predictorto the hoisting state The next 64 loads record a would-becorrect prediction of the per-address predictor After 64 such(unused) predictions the flush counter is incremented to leavethe never-hoisting state

Listing 7 Snippet verifying 4k aliasing-MD unit interactionForce no-hoisting predictionfor(i=0 ilt10 i++) st_ld(mem+0x1000 mem+0x1000)for(i=10 ilt20 i++) st_ld(mem+0x2000 mem+0x2000)Cause 4k aliasingfor(i=20 ilt40 i++) st_ld(mem+0x1000 mem+0x2000)

0 5 10 15 20 25 30 35 40i-th call to st_ld

50

60

70

80

90

Cloc

k cy

cles

Figure 14 Timing measurements of Listing 7 Orange incre-ment of LD_BLOCKS_PARTIALADDRESS_ALIAS

Finally we examined the interaction between memory dis-ambiguation and 4k aliasing On Intel CPUs when a store isfollowed by a load matching its 4KB page offset the store-to-load forwarding (STL) logic forwards the stored valueand in case of a false match (4k aliasing) a few-cycle over-head is needed to re-issue the load [38] With the help of theperformance counter LD_BLOCKS_PARTIALADDRESS_ALIASand the experiment shown in Listing 7 we verified that 4kaliasing can only happen when a no-hoist prediction is made

Figure 15 Memory disambiguation unit ndash simplified view

as shown in Figure 14 Indeed STL can be performed onlyif the store-load pair is executed in order Additionally Fig-ure 14 shows that 4k aliasing introduces a slight performanceoverhead on top of an incorrect no-hoisting guess of the MDpredictor Figure 15 presents a simplified view of the reversedmemory disambiguation predictor In conclusion an attackercan precisely mistrain the memory disambiguation predictorby satisfying only two requirements (1) knowing the instruc-tion pointer of the victim load (2) issuing (in the worst-casescenario) 15+64=79 non-aliasing store-load pairs to train thepredictor to the hoisting state

B Root-Causes Description Table

Acronym Description

BHT Branch History TableBTB Branch Target BufferRSB Return Stack BufferMD Memory Disabmbiguation UnitNM Device Not Available ExceptionDE Divide-by-Zero ExceptionUD Invalid Opcode ExceptionGP General Protection FaultAC Alignment Check ExceptionSS Stack Segment ExceptionPF Page FaultPF - US Bit Page Table Entry UserSupervisor Bit PFPF - RW Bit Page Table Entry ReadWrite Bit PFPF - P Bit Page Table Entry Present Bit PFPF - PKU Page Table Entry Protection Keys PFBR Bound Range Exceeded ExceptionFP Floating Point AssistSMC Self-Modifying CodeXMC Cross-Modifying CodeMO Memory Ordering Principles ViolationMASKMOV Masked LoadStore Instruction AssistAD Bits Page Table Entry AccessDirty Bits AssistTSX Intel TSX Transaction AbortUC Uncachable Memory AssistPRM Processor Reserved Memory AssistHW Interrupts Hardware Interrupts

1468 30th USENIX Security Symposium USENIX Association

  • Introduction
  • Background
    • IEEE-754 Denormal Numbers
    • x86 Cache Coherence
    • Memory Ordering
    • Memory Disambiguation
      • Threat Model
      • Machine Clears
      • Self-Modifying Code Machine Clear
      • Floating Point Machine Clear
      • Memory Ordering Machine Clear
      • Memory Disambiguation Machine Clear
      • Other Types of Machine Clear
      • Transient Execution Capabilities
        • Transient Window Size
        • Leakage Rate
          • Attack Primitives
            • Speculative Code Store Bypass (SCSB)
            • Floating Point Value Injection (FPVI)
              • Mitigations
                • SCSB Mitigation
                • FPVI Mitigation
                  • Root Cause-based Classification
                  • Related Work
                  • Conclusions
                  • Reversing Memory Disambiguation
                  • Root-Causes Description Table
Page 5: Rage Against the Machine Clear: A Systematic Analysis of

Table 1 Machine clear performance counters

Name Description

MACHINE_CLEARSCOUNT Number of machine clears of any typeMACHINE_CLEARSSMC Number of machine clears caused by a selfcross-modifying codeMACHINE_CLEARSDISAMBIGUATION Number of machine clears caused by a memory disambiguation unit mispredictionMACHINE_CLEARSMEMORY_ORDERING Number of machine clears caused by a memory ordering principle violationMACHINE_CLEARSFP_ASSIST Number of machine clears caused by an assisted floating point operationMACHINE_CLEARSPAGE_FAULT Number of machine clears caused by a page faultMACHINE_CLEARSMASKMOV Number of machine clears caused by an AVX maskmov on an illegal address with a mask set to 0

Bad speculation not only causes performance degradationbut also security concerns [6ndash8314143475559637174ndash76 79 82] In contrast to branch misprediction extensivelystudied by security researchers [10 11 31 40 41 43 48]machine clears have undergone little scrutiny In this paperwe perform the first deep analysis of machine clears and thecorresponding root causes of transient execution

Since machine clears are hardly documented we examinedall the performance counters for every Intel architecture andfound a number relevant counters (Table 1) Some countersare present only in specific architectures For example thepage fault counter is available only on Goldmont Plus How-ever thanks to the generic counter MACHINE_CLEARSCOUNTit is always possible to count the overall number of machineclears regardless of the architecture In the remainder of thiswork we will reverse engineer and analyze the causes ofmachine clears by means of these six counters

As a general observation we note that the Floating PointAssist and Page Fault counters immediately suggest that ma-chine clears are also related to microcode assists and fault-sexceptions In particular further analysis shows that

bull Microcode assists trigger machine clears The hardwareoccasionally needs to resort to microcode to handle com-plex events Doing so requires flushing all the pend-ing instructions with a machine clear before handlinga microcode assist Indeed in our experiments wherethe OTHER_ASSISTSANY counter increased we also ob-served a matching increase in MACHINE_CLEARSCOUNT

bull Machine clears do not necessarily trigger microcodeassists Not all machine clears are microcode as-sisted as some machine clear causes are handleddirectly in silicon Indeed in our experiments weobserved that SMC MD and MO machine clearscause an increase of MACHINE_CLEARSCOUNT but leaveOTHER_ASSISTSANY unaltered

bull An exception triggers a machine clear When a faultor exception is detected the subsequent microOps must beflushed from the pipeline with a machine clear as theexecution should resume at the exception handler In-deed in our experiments we observed an increase ofMACHINE_CLEARSCOUNT at each faulty instruction

In this paper we focus on the machine clear causes men-tioned by the Intel documentation (Table 1) acknowledgingthat undocumented causes may still exist (much like undocu-mented x86 instructions [16]) We now first examine the fourmost relevant causes of machine clears (Self-Modifying CodeMC Floating Point MC Memory Ordering MC and MemoryDisambiguation MC) then briefly discuss the other cases

5 Self-Modifying Code Machine Clear

In a Von Neumann architecture stores may write instructionsas data and modify program code as it is being executedas long as the code pages are writable This is commonlyreferred to as Self-modifying Code (SMC)

Self-modifying code is problematic for the InstructionFetch Unit (IFU) which maintains high execution through-put by aggressively prefetching the instructions it expects toexecute next and feeding them to the decode units The CPUspeculatively fetches and decodes the instructions and feedsthem to the execution units well ahead of retirement

In case of a misprediction the CPU flushes the specula-tively processed instructions and resumes execution at the cor-rect target The IFUrsquos aggressive prefetching ensures that thefirst-level instruction cache (L1i) is constantly filled with in-structions which are either currently in (possibly speculative)execution or about to be executed As a result a store instruc-tion targeting code cached in L1i requires drastic measuresmdashas the associated cache lines should now be invalidated More-over the target instructions do not even need to be part of theactual execution since a prefetch is sufficient to bring theminto L1i any write to prefetched instructions also invalidatesthe prefetch queue In other words the problem occurs whenthe code is already in L1i or the store is sufficiently close tothe target code to ensure the target is prefetched in L1i Thisbehavior leads to a temporary desynchronization between thecode and data views of the CPU transiently breaking the ar-chitectural memory model (where L1dL1i coherence ensuresconsistent codedata views)

To reverse engineer this behavior we use the analysis codeexemplified by Listing 1 The store at line 15 overwrites codealready cached in L1i (lines 18-21) triggering a machineclear The machine clear needs to update the L1i cache (and

1454 30th USENIX Security Symposium USENIX Association

related microarchitectural structures) by flushing any staleinstruction(s) resuming execution at the last retired instruc-tion and then fetching the new instructions To test for thepresence of a transient execution path our analysis code im-mediately executes lines 18-21 targeted by the store and jumpsto a spec_code gadget (lines 29-32) which fills a number ofcache lines in a (flushed) buffer Architecturally this gad-get should never be executed as the store instruction shouldnop out the branch at line 18 However microarchitecturallywe do observe multiple cache hits in the (reload) buffer us-ing FLUSH + RELOAD which confirms the existence of apre-SMC-handling transient execution path executing stalecode and leaving observable traces in the cache We observethat the scheduling of the store instruction heavily affectsthe length of the transient path Indeed we use different in-structions (lines 5-8) to delay as much as possible the storeretirement and thus the SMC detection This suggests that theroot cause of the observed transient window might be the mi-croarchitectural de-synchronization between the store buffer(new code) and the instruction queue (stale code) yieldingtransiently incoherent codedata views We sampled machineclear performance counters to confirm the transient executionwindow is caused by the SMC machine clear and not by otherevents (eg memory disambiguation misprediction) Finallywe repeated our experiments on uncached code memory andcould not observe any transient path The counters revealedone machine clear triggered for each executed instructionsince the CPU has to pessimistically assume every fetchedinstruction has potentially been overwritten Additionally wehave verified that SMC detection is performed on physicaladdresses rather than virtual ones

Cross-Modifying Code

Instead of modifying its own instructions a thread runningon one core may also write data into the currently executingcode segment of a thread running on a different physical coreSuch Cross-Modifying Code (XMC) may be synchronous(the executor thread waits for the writer thread to completebefore executing the new code) or asynchronous with noparticular coordination between threads To reverse engineerthe behavior we distributed our analysis code across coresand reproduced a signal on the reload buffer in both casesThis confirms a Cross-Modifying Code Machine Clear (XMCmachine clear) behaves similarly to a SMC machine clearacross cores with a store on the writer core originating atransient execution window on the executor core

6 Floating Point Machine Clear

On Intel when the Floating Point Unit (FPU) is unable toproduce results in the IEEE-754 [1] format directly for in-stance in the case of denormal operands or results [4 17]the CPU requires special handling to produce a correct re-

Listing 1 Self-Modifying Code Machine Clear analysis code1 smc_snippet2 push r113 lea r11 [target] Load addr of target instr (line 17)4

5 clflush [r11] These instructions serve as a delay6 rep 10 for the store argument address They7 imul r11 1 ensure that the execution window of8 endrep spec_code is as long as possible9

10 Code to write as data 8 nops (overwriting lines 18-21)11 mov rax 0x909090909090909012

13 Store at target addr Also the last retired instr14 from which the execution will resume after the SMC MC15 mov QWORD [r11] rax16

17 target Target instruction to be modified18 jmp spec_code19 nop20 nop21 nop22

23 Architectural exit point of the function24 pop r1125 ret26

27 Code executed speculatively (flushed after SMC MC)28 spec_code29 mov rax [rdi+0x0] rdi covert channel reload buffer30 mov rax [rdi+0x400]31 mov rax [rdi+0x800]32 mov rax [rdi+0xc00]

sult According to an Intel patent [62] the denormalizationis indeed implemented as a microcode assist or an exceptionhandler since the corresponding hardware would be too com-plex In our experiments we observed microcode assists onall x87 SSE and AVX instructions that perform mathematicaloperations on denormal floating-point (FP) numbers Incre-ments of the FP_ASSISTANY or MACHINE_CLEARSCOUNT(or on older processors MACHINE_CLEARSFP_ASSIST) per-formance counters confirm such assists cause machine clears

Since a machine clear implies a pipeline flush the as-sisted FP operation will be squashed together with subse-quent microOps To reverse engineer the behavior we used anal-ysis code exemplified by Algorithm 1 Our code relies ona FLUSH + RELOAD [81] covert channel to observe the re-sult of a floating-point operation at the byte granularity Inour experiments we observed two different hits in the reloadbuffer for each byte for the transient and architectural resultrespectively The double-hit microarchitectural trace confirmsthat the transient (and wrong) value generated by the FPUis used in subsequent microOpsmdashas also exemplified in Table 2Later the CPU detects the error and triggers a machine clearto flush the wrongly executed path The microcode assist thencorrects the result while subsequent instructions are reissued

While we could not find any documentation on floating-point assist handling (even in the patents) our experimentsrevealed the following important properties First we veri-fied that many FP operations can trigger FP assists (ie addsub mul div and sqrt) across different extensions (ie x87SSE and AVX) Second the transient result is computed byldquoblindlyrdquo executing the operation as if both operands and result

USENIX Association 30th USENIX Security Symposium 1455

Algorithm 1 Floating Point Machine Clear analysis (pseudo)code byte is used to extract the i-th byte of z

1 for ilarr 1 8 do2 flush(reload_bu f )3 z = x y Any denormal FP operation4 reload_bu f [byte(z i) 1024]5 reload(reload_bu f )6 end for

Representation Value Type Exp

x 0x0010deadbeef1337 234e-308 N -1022

y 0x40f0000000000000 65536(216) N 16

zarch 0x00000010deadbeef 357e-313 D -1022

ztran 0x3f10deadbeef1337 643e-05 N -14

Table 2 Architectural (zarch) and transient (ztran) results ofdividing x and y of Algorithm 1 N Normal D Denormalrepresentations The mantissa is in bold A normal division by216 leaves the mantissa untouched and subtracts 16 from theexponentmdashthe result of ztran where the exponent overflowedfrom -1022 to -14

Figure 3 Transient execution due to invalid memory ordering

are normal numbers (see Table 2) Third the detection of thewrong computation occurs later in time creating a transientexecution window Finally by performing multiple floating-point operations together with the assisted one we were ableto expand the size of the window suggesting that detection isdelayed if the FPU is busy handling multiple operations

7 Memory Ordering Machine Clear

The CPU initiates a memory ordering (MO) machine clearwhen upon receiving a snoop request it is uncertain if mem-ory ordering will be preserved [36] Consider the program

Algorithm 2 Pseudo-code triggering a MO machine clearProcessor A

1 clflush(X) Make the load slow2 unlock(lock) Synchronize loads and stores3 r1larr [X]4 r2larr [Y ]5 reload(r2) For FLUSH + RELOAD

Processor B1 wait(lock) Synchronize loads and stores2 1rarr [Y ]

in Figure 3 Processor A loads X and Y while processor Bperforms two stores to the same locations If the load of Xis slow due to a cache miss the out-of-order CPU will issuethe next load (and subsequent operations) ahead of sched-ule Suppose that while the load of X is pending proces-sor B signals through a snoop request that the values ofX and Y have changed In this scenario the memory order-ing is A1-B0-B1-A0 which is not allowed according to theTotal-Store-Order memory model As two loads cannot bereordered r1=1 r2=0 is an illegal result Thus processor Ahas no choice other than to flush its pipeline and re-issuethe load of Y in the correct order This MO machine clearis needed for every inconsistent speculation on the memoryorderingmdashimplementing speculation behavior originally pro-posed by Gharachorloo et al [27] with the advantage thatstrict memory order principles can co-exist with aggressiveout-of-order scheduling

Notice that in the previous example the store on X isnot even necessary to cause a memory ordering violationsince A1-B1-A0 is still an invalid order Counterintuitivelythe memory ordering violation disappears if the load on X isnot performed as A1-B0-B1 is a perfectly valid order

To reverse engineer the memory ordering handling behav-ior on Intel CPUs we used analysis code exemplified by Al-gorithm 2 Our code mimics the scenario of Figure 3 withloadsstores from two threads racing against each other butrelies on a FLUSH + RELOAD [81] covert channel to observethe loaded value of Y The synchronization through the lockvariable ensures the desired (problematic) proximity and or-dering of the memory operations

As before in our experiments we observed two differenthits in the reload buffer for the loaded value of Y one forthe stale (transient) value and one for the new (architectural)value For every double hit in the reload buffer we also mea-sured an increase of MACHINE_CLEARSMEMORY_ORDERINGWe also ran experiments without the load of X Even thoughmemory ordering violations are no longer possible since eachprocessor executes a single memory operation and any orderis permitted we still observed MO machine clears The reasonis that Intel CPUs seem to resort to a simple but conservativeapproach to memory ordering violation detection where theCPU initiates a MO machine clear when a snoop request from

1456 30th USENIX Security Symposium USENIX Association

another processor matches the source of any load operationin the pipeline We obtained the same results with the twothreads running across physical or logical (hyperthreaded)cores Similarly the results are unchanged if the matchingstore and load are performed on different addressesmdashthe onlyrequirement we observed is that the memory operations needto refer to the same cache line Overall our results confirmthe presence of a transient execution window and the abilityof a thread to trigger a transient execution path in anotherthread by simply dirtying a cache line used in a ready-to-commit load Exploitation-wise abusing this type of MC isnon-trivial due to the strict synchronization requirements andthe difficulty of controlling pending stale data

8 Memory Disambiguation Machine Clear

As suggested by the MACHINE_CLEARSDISAMBIGUATIONcounter description memory disambiguation (MD) mis-predictions are handled via machine clears Our experi-ments confirmed this behavior by observing matching in-creases of MACHINE_CLEARSCOUNT Moreover we observedno changes in microcode assist counters suggesting mispre-dictions are resolved entirely in hardware In case of a mispre-diction a stale value is passed to subsequent loads a primitivethat was previously used to leak secret information with Spec-tre Variant 4 or Speculative Store Bypass (SSB) [55 82]

Different from the other machine clears MD machineclears trigger only on address aliasing mispredictions whenthe CPU wrongly predicts a load does not alias a precedingstore instruction and can be hoisted Similar to branch pre-diction generating a MD-based transient execution windowrequires mistraining the underlying predictor The latter hascomplex undocumented behavior which has been partiallyreverse engineered [18] For space constraints we discuss ourfull reverse engineering strategy in Appendix A Our resultsshow that executing the same load 64+ 15 times with non-aliasing stores is sufficient to ensure the next prediction tobe ldquono-aliasrdquo (and thus the next load to be hoisted) Uponreaching the hoisting prediction one can perform the loadwith an aliasing store to trigger the transient path exposed tothe incorrect stale value

Finally we experimentally verified that 4k aliasing [50]does not cause any machine clear but only incurs a furthertime penalty in case of wrong aliasing predictions Moredetails on 4k aliasing results can be found in Appendix A

9 Other Types of Machine Clear

AVX vmaskmov The AVX vmaskmov instructions performconditional packed load and store operations depending on abitmask For example a vmaskmovpd load may read 4 packeddoubles from memory depending on a 4-bit mask each doublewill be read only if the corresponding mask bit is set the

others will be assigned the value 0According to the Intel Optimization Reference Man-

ual [36] the instruction does not generate an exception inface of invalid addresses provided they are masked out How-ever our experiments confirm that it does incur a machineclear (and a microcode assist) when accessing an invalid ad-dress (eg with the present bit set to 0) with a loading maskset to zero (ie no bytes should be loaded) to check whetherthe bytes in the invalid address have the corresponding maskbits set or not We speculate the special handling is neededbecause the permission check is very complex especially inthe absence of memory alignment requirements

In our experiments vmaskmov instructions withall-zero masks and invalid addresses increment theOTHER_ASSISTANY and MACHINE_CLEARSCOUNT countersconfirming that the instruction triggers a machine clearHowever the resulting transient execution window seemsshort-lived or absent as we were unable to observe cache orother microarchitectural side effects of the execution

Exceptions The MACHINE_CLEARSPAGE_FAULT counterpresent on older microarchitectures confirms page faults areanother cause of machine clears Indeed we verified each in-struction incurring a page fault or any other exception such asldquoDivision by zerordquo increments the MACHINE_CLEARSCOUNTcounter We also verified exceptions do not trigger microcodeassists and software interrupts (traps) do not trigger machineclears Indeed the Intel documentation [37] specifies thatinstructions following a trap may be fetched but not specu-latively executed Transient execution windows originatingfrom exceptionsmdashand page faults in particularmdashhave beenextensively used in prior work with a faulty load instructionalso used as the trigger to leak information [7 47 63 75]

Hardware interrupts Although hardware interrupts arean undocumented cause of machine clears our experimentswith APIC timer interrupts showed they do increment theMACHINE_CLEARSCOUNT counter While this confirms hard-ware interrupts are another root cause of transient executionthe asynchronous nature of these events yields a less thanideal vector for transient execution attacks Nonetheless hard-ware interrupts play an important role in other classes ofmicroarchitectural attacks [73]

Microcode assists Microcode assists require a pipelineflush to insert the required microOps in the frontend and representa subclass of machine clears (and thus a root cause of transientexecution) for cases where a fast path in hardware is not avail-able In this paper we detailed the behavior of floating-pointand vmaskmov assists Prior work has discussed different sit-uations requiring microcode assists such as those related topage table entry AccessDirty bits typically in the context ofassisted loads used as the trigger to leak information [86375]AVX-to-SSE transitions [36] represent another microcode as-sist which based on our experiments we could not observeon modern Intel CPUs Remaining known microcode assistssuch as access control of memory pages belonging to SGX

USENIX Association 30th USENIX Security Symposium 1457

020406080

100120140160

Num

ber o

fTr

ansie

nt L

oads CPU

Intel Core i7-10700KIntel Xeon Silver 4214Intel Core i9-9900KIntel Core i7-7700KAMD Ryzen 5 5600XAMD Ryzen Threadripper2990WXAMD Ryzen 7 2700X

F+R TSX BHT FAULT SMC XMC FP MD MOTransient Execution Management

0

1

2

3

4

Leak

age

Rate

[Mb

s]

Figure 4 Top plot transient window size vs mechanism Each bar reports the number of transient loads that complete and leavea microarchitectural trace Bottom plot leakage rate vs mechanism for a simple Spectre Bounds-Check-Bypass attack and a1-bit FLUSH + RELOAD (F+R) cache covert channel F+R is the leakage rate upper bound (covert channel loop only no actualtransient window or attack)

secure enclaves and Precise Event Based Sampling (PEBS)are not presented in this work since already studied [14] oronly related to privileged performance profiling respectively

10 Transient Execution Capabilities

Transient execution attacks rely on crafting a transient win-dow to issue instructions that are never retired For thispurpose state-of-the-art attacks traditionally rely on mech-anisms based on root causes such as branch mispredictions(BHT) [41 43] faulty loads (Fault) [8 47 63 75] or mem-ory transaction aborts (TSX) [8 47 59 63 75] However thedifferent machine clears discussed in this paper provide anattacker with the exact same capabilities

To compare the capabilities of machine clear-based tran-sient windows with those of more traditional mechanismswe implemented a framework able to run arbitrary attacker-controlled code in a window generated by a mechanism ofchoosing We now evaluate our framework on recent proces-sors (with all the microcode updates and mitigations enabled)to compare the transient window size and leakage rate of thedifferent mechanisms

101 Transient Window Size

The transient window size provides an indication of thenumber of operations an attacker can issue on a transientpath before the results are squashed Larger windows canin principle host more complex attacks Using a classicFLUSH + RELOAD cache covert channel (F+R) as a referencewe measure the window size by counting how many transientloads can complete and hit entries in a designated F+R bufferFigure 4 (top) presents our results

As shown in the figure the window size varies greatlyacross the different mechanisms Broadly speaking mecha-nisms that have a higher detection cost such as XMC and MOMachine Clear yield larger window sizes Not surprisinglybranch mispredictions yield the largest window sizes as wecan significantly slow down the branch resolution process(ie causing cache misses) and delay detection FP on theother hand yields the shortest windows suggesting that de-normal numbers are efficiently detected inside the FPU Ourresults also show that while our framework was designed forIntel processors similar if not better results can usually beobtained on AMD processors (where we use the same con-servative training code for branchmemory prediction) Thisshows that both CPU families share a similar implementationin all cases except for MD where the used mistraining patternis not valid for pre-Zen3 architectures

102 Leakage Rate

To compare the leakage rates for the different transient exe-cution mechanisms we transiently read and repeatedly leakdata from a large memory region through a classic F+R cachecovert channel We report the resulting leakage ratesmdashasthe number of bits successfully leaked per secondmdashacrossdifferent microarchitectures using a 1-bit covert channel tohighlight the time complexity of each mechanism We con-sider data to be successfully leaked after a single correct hitin the reload buffer In case of a miss for a particular valuewe restart the leak for the same value until we get a hit (oruntil we get 100 misses in a row)

As shown in Figure 4 (bottom) different Intel and AMDmicroarchitectures generally yield similar leakage rates withsome variations For instance FP MC offers better leakagerates on Intel This difference stems from the different perfor-

1458 30th USENIX Security Symposium USENIX Association

F+R TSX BHT FP SMC XMC MO MD FAULTTransient Execution Management

0

1

2

3

4

Leak

age

Rate

[Mb

s]

F+R granularity [bit]148

Figure 5 Leakage rate vs mechanism with a 1-bit 4-bit or8-bit FLUSH + RELOAD cache covert channel (Intel Core i9)

mance impact of the corresponding machine clears on Intelvs AMD microarchitectures

As shown in Figure 5 the leakage rate varies instead greatlyacross the different mechanisms and so does the optimalcovert channel bitwidth Indeed while existing attacks typi-cally rely on 8-bit covert channels our results suggest 1-bit or4-bit channels can be much more efficient depending on thespecific mechanism Roughly speaking optimal leakage ratescan be obtained by balancing the time complexity (and hencebitwidth) of the covert channel with that of the mechanismFor example FP is a lightweight and reliable mechanismhence using a comparably fast and narrow 1-bit covert chan-nel is beneficial In contrast MD requires a time-consumingpredictor training phase between leak iterations and leakingmore bits per iteration with a 4-bit covert channel is moreefficient Interestingly a classic 8-bit covert channel yieldsconsistently worse and comparable leakage rates across allthe mechanisms since F+R dominates the execution time

Our results show that only two mechanisms (TSX and FP)are close to the maximum theoretical leakage rate of pureF+R Moreover FP performs as efficiently as TSX but un-like TSX is available on both Intel and AMD is always en-abled and can be used from managed code (eg JavaScript)BHT on the other hand yield the worst leakage rates due tothe inefficient training-based transient window BHT leakagerate can be improved if a tailored mistrain sequence is usedas in MD (Appendix A) Overall our results show that ma-chine clear-based windows achieve comparable and in manycases (eg FP) better leakage rates compared to traditionalmechanisms Moreover many machine clears eliminate theneed for mistraining which other than resulting in efficientleakage rates can escape existing pattern-based mitigationsand disabled hardware extensions (eg Intel TSX)

11 Attack Primitives

Building on our reverse engineering results and focusing onthe unexplored SMC and FP machine clears we now presenttwo new transient execution attack primitives and analyze

their security implications We also present an end-to-endFPVI exploit disclosing arbitrary memory in Firefox Laterwe discuss mitigations

111 Speculative Code Store Bypass (SCSB)

Our first attack primitive Speculative Code Store Bypass(SCSB) allows an attacker to execute stale controlled code ina transient execution window originated by a SMC machineclear Since the primitive relies on SMC its primary appli-cability is on JIT (eg JavaScript) engines running attacker-controlled codemdashalthough OS kernels and hypervisors stor-ing code pages and allowing their execution without firstissuing a serializing instruction are also potentially affected

Figure 6 SCSB primitive example where the instructionpointer is pointing at the bold code blocks g code is freedJITrsquoed code (of some g function) under attackerrsquos control(1) Force engine to JIT and execute code of function f caus-ing desynchronization of code and data views (2) Executestale code and SMC MC (3) After the SMC MC code anddata view coherence is restored and the new code is executed

As exemplified in Figure 6 the operations of the primi-tive can be broken down into three steps (1) the JIT enginecompiles a function f storing the generated code into a JITcode cache region previously used by a (now-stale) version offunction g (2) the JIT engine jumps to the newly generatedcode for the function f but due to the temporary desynchro-nization between the code and data views of the CPU thiscauses transient execution of the stale code of g until the SMCmachine clear is processed (3) after the pipeline flush thecode and data views are resynchronized and the CPU restartsthe execution of the correct code of f For exploitation theattacker needs to (i) massage the JIT code cache allocator toreuse a freed region with a target gadget g of choice (ii) forcethe JIT engine to generate and execute new (f ) code in sucha region enabling transient out-of-context execution of thegadget and spilling secrets into the microarchitectural state

Our primitive bears similarities with both transient and ar-chitectural primitives used in prior attacks On the transientfront our primitive is conceptually similar to a Speculative-Store-Bypass (SSB) primitive [55 82] but can transientlyexecute stale code rather than reading stale data Howeverinterestingly the underlying causes of the two primitives arequite different (MD misprediction vs SMC machine clear)On the architectural front our primitive mimics classic Use-After-Free (UAF) exploitation on the JIT code cache also

USENIX Association 30th USENIX Security Symposium 1459

Figure 7 Coding options suggested by the Intel ArchitecturesSoftware Developers Manuals to handle SMC and XMC ex-ecution Option 1 describes the exact steps required by ourSpeculative Code Store Bypass attack primitive potentiallyresulting in exploitable gadgets

known as Return-After-Free (RAF) in the hackers commu-nity [19 25] An example is CVE-2018-0946 where a use-after-free vulnerability can be exploited to force the ChakraJS engine to erroneously execute freed (attacker-controlled)JIT code resulting in arbitrary code execution after massagingthe right gadget into the JIT code cache [24]

Indeed at a high level SCSB yields a transient use-after-free primitive on a JIT code cache with exploitation prop-erties similar to its architectural counterpart However thereare some differences due to the transient nature of SCSBFirst we need to find an out-of-context gadget to transientlyleak data rather than architecturally execute arbitrary codeIn JavaScript engines similar gadgets have already been ex-ploited by ret2spec [48] escalating out-of-context transientexecution of valid code to type confusion arbitrary reads andultimately a secret-dependent load to transmit the value

Second we need to target short-lived JIT code cache up-dates so that the newly generated code is immediately exe-cuted fitting the target gadget in the resulting transient execu-tion window Interestingly executing JITrsquoed code is the nextstep after JIT code generation mentioned in the Intel manual(Figure 7 Option 1) suggesting that developers that followsuch directives ad litteram can easily introduce such gadgetsMoreover modern JavaScript compilers feature a multi-stageoptimization pipeline [12 54] and short-lived JIT code cacheupdates are favored for just-in-time (re)optimization Indeedwe found test code in V8 to specifically test such updates andverify instructiondata cache coherence [70]

Finally we need to ensure JIT code cache updates are notaccompanied by barriers that force immediate synchroniza-tion of the code and data views We analyzed the code ofthe popular SpiderMonkey and V8 JavaScript engines andverified that the functions called upon JIT code cache updatesto synchronize instruction and data caches (Listing 2 and List-ing 3) are always empty on x86 (as expected given Intelrsquosprimary recommendation in Figure 7)

Overall while the exploitation is far from trivial (ie hav-

Listing 2 Chromium instruction cache flush(chromiumsrcv8srccodegenx64cpu-x64cc)void CpuFeaturesFlushICache(void start size_t size) No need to flush the instructioncache on Intel

Listing 3 Firefox instruction cache flush(mozilla-unifiedjssrcjitFlushICacheh)inline void FlushICache(void code size_t size

bool codeIsThreadLocal = true) No-op Code and data caches are coherent on x86

and x64 rarr

ing to address the challenges of traditional use-after-free ex-ploitation as well as transient execution exploitation in thebrowser) we believe SCSB expands the attack surface of tran-sient execution attacks in the browser We found a number ofcandidate SCSB gadgets in real-world code but none of themwas ultimately exploitable due to the coincidental presence ofsome serializing instruction preventing stale code executionNonetheless mitigations are needed to enforce security-by-design Luckily as shown later SCSB is amenable to a prac-tical and efficient mitigation (ie at a similar cost as facedby non-x86 architectures) The implementationperformancecost for mitigation is even lower than for its sibling SSB prim-itive whose mitigation has been deployed in practice even inabsence of known practical exploits [55]

In Table 3 we show that all tested Intel and AMD proces-sors are affected by SCSB In contrast ARM is not vulnerablesince SMC updates require explicit software barriers

112 Floating Point Value Injection (FPVI)Our second attack primitive Floating Point Value Injection(FPVI) allows an attacker to inject arbitrary values into atransient execution window originated by a FP machine clearAs exemplified in Figure 8 the operations of the primitive canbe broken down into four steps (1) the attacker triggers theexecution of a gadget starting with a denormal FP operationin the victim application with the x and y operands underattackerrsquos control (2) the transient z result of the operation isprocessed by the subsequent gadget instructions leaving a mi-croarchitectural trace (3) the CPU detects the error condition(ie wrong result of a denormal operation) triggering a ma-chine clear and thus a pipeline flush (4) the CPU re-executesthe entire gadget with the correct architectural z result Forexploitation the attacker needs to (i) massage the x and yoperands to inject the desired z value into the victim transientpath and (ii) target a victim gadget so that the injected valueyields a security-sensitive trace which can be observed withFLUSH + RELOAD or other microarchitectural attacks

Our primitive bears similarities with Load-Value-Injection(LVI) [71] since both allow attackers to inject controlledvalues into transient execution Moreover both primitives

1460 30th USENIX Security Symposium USENIX Association

Figure 8 FPVI gadget example in SpiderMonkeyFP_Op(xy) is an arbitrary denormal FP operation The er-roneous z result causes dependent operations to be executedtwice (first transiently then architecturally) The NaN-boxingz encoding allows the attacker to type-confuse the JITrsquoed codeand read from an arbitrary address on the transient path

require gadgets in the victim application to process the in-jected value and perform security-sensitive computations onthe transient path Nonetheless the underlying issues andhence the triggering conditions are fairly different LVI re-quires the attacker to induce faulty or assisted loads on thevictim execution which is straightforward in SGX applica-tions but more difficult in the general case [71] FPVI imposesno such requirement but does require an attacker to directlyor indirectly control operands of a floating-point operation inthe victim Nonetheless FPVI can extend the existing LVIattack surface (eg for compute-intensive SGX applicationsprocessing attacker-controlled inputs) and also provides ex-ploitation opportunities in new scenarios Indeed while it ishard to draw general conclusions on the availability of ex-ploitable FPVI gadgets in the wildmdashmuch like Spectre [41]LVI [71] etc this would require gadget scanners subject oforthogonal research [56 58]mdashwe found exploitable gadgetsin NaN-Boxing implementations of modern JIT engines [53]NaN-Boxing implementations encode arbitrary data types asdouble values allowing attackers running code in a JIT sand-box (and thus trivially controlling operands of FP operations)to escalate FPVI to a speculative type confusion primitiveThe latter can be exploited similarly to NaN-Boxing-enabledarchitectural type confusion [26] and allows an attacker toaccess arbitrary data on a transient path Figure 8 presentsour end-to-end exploit for a JavaScript-enabled attacker in aSpiderMonkey (Mozilla JavaScript runtime) sandbox illus-trating a gadget unaffected by all the prior Mozilla Firefoxrsquo

mitigations against transient execution attacks [52]As exemplified in the figure SpiderMonkeyrsquos NaN-Boxing

strategy represents every variable with a IEEE-754 (64-bit)double where the highest 17 bits store the data type tag and theremaining 47 bits store the data value itself If the tag value isless than or equal to 0x1fff0 (ie JSVAL_TAG_MAX_DOUBLE)all the 64 bits are interpreted as a double while NaN-Boxingencoding is used otherwise For instance the 0xfff88 tagis used to represent 32-bit integers and the 0xfffb0 tag torepresent a string with the data value storing a pointer tothe string descriptor In the example the attacker crafts theoperands of a vulnerable FP operation (in this case a division)to produce a transient result which the JITrsquoed code interpretsas a string pointer due to NaN-Boxing This causes typeconfusion on a transient path and ultimately triggers a readwith an attacker-controlled address

To verify the attacker can inject arbitrary pointers withoutfully reverse engineering the complex function used by theFPU we implemented a simple fuzzer to find FP operandsthat yield transient division results with the upper bits setto 0xfffb0 (ie string tag) With such operands we caneasily control the remaining bits by performing the inverseoperation since the mantissa bits are transiently unaffectedby the exponent value as shown in Table 2 For exampleusing 0xc000e8b2c9755600 and 0x0004000000000000 asdivision operands yields -Infinity as the architectural resultand our target string pointer 0xfffb0deadbeef000 as thetransient result (see gadget in Figure 8)

Note that SpiderMonkey uses no guards or Spectre mitiga-tions when accessing the attribute length of the string Thisis normally safe since x86 guarantees that NaN results of FPoperations will always have the lowest 52 bits set to zeromdasharepresentation known as QNaN Floating-Point Indefinite [37]In other words the implementation relies on the fact that NaN-boxed variables such as string pointers can never accidentallyappear as the result of FP operations and can only be craftedby the JIT engine itself Unfortunately this invariant no longerholds on a FPVI-controlled transient path As shown in Fig-ure 8 this invariant violation allows an attacker to transientlyread arbitrary memory Since the length attribute is stored4 bytes away from the string pointer the zlength accessyields a transient read to 0xdeadbeef000+4

We ran our exploit on an Intel i9-9900K CPU (microcode0xde) running Linux 580 and Firefox 850 and by wrappingthis primitive with a variant of an EVICT + RELOAD covertchannel [61] we confirmed the ability to read arbitrary mem-ory Since prior work has already demonstrated that customhigh-precision timers are possible in JavaScript [26296064]we enabled precise timers in Firefox to simplify our covertchannel With our exploit we measured a leakage rate of ~13KBs and a transient window of ~12 load instructions enabledby increasing the FPU pressure through a chain of multipledependent FP operations

Finally as shown in Table 3 we observe that both Intel and

USENIX Association 30th USENIX Security Symposium 1461

Table 3 Tested processors

Processor MicrocodeSCSB

vulnerableFPVI

vulnerable

Intel Core i7-10700K 0xe0 3 3Intel Xeon Silver 4214 0x500001c 3 3Intel Core i9-9900K 0xde 3 3Intel Core i7-7700K 0xca 3 3AMD Ryzen 5 5600X 0xa201009 3 3daggerAMD Ryzen 2990WX 0x800820b 3 3daggerAMD Ryzen 7 2700X 0x800820b 3 3daggerBroadcom BCM2711Cortex-A72 (ARM v8) 7sect 7

dagger No exploitable NaN-boxed transient results foundsect On ARM SMC updates require explicit software barriers

AMD are affected by FPVI although we found no exploitabletransient NaN-boxed values on AMD On ARM we neverobserved traces of transient results yet we cannot rule outother FPU implementations being affected

12 Mitigations

121 SCSB Mitigation

SCSB can be mitigated by ensuring that any freshly writtencode is architecturally visible before being executed For ex-ample on ARM architectures where the hardware does notautomatically enforce cache coherence explicit serializing in-structions (ie L1i cache invalidation instructions) are neededto correctly support SMC [5] As such spec-compliant ARMimplementations cannot be affected by SCSB On Intel andAMD we can force eager codedata coherence using a serial-ization instructionmdashalthough this is normally not necessary(see Option 1 in Figure 7) In our experiments we verified anyserialization instruction such as lfence mfence or cpuidplaced after the SMC store operations is indeed sufficient tosuppress the transient window Note that sfence cannot serveas a serialization instruction [35] to eliminate the transientpath This serialization mitigation was confirmed by CPUvendors and adopted by the Xen hypervisor [32 33]

To evaluate the performance impact of our mitigation weadded a lfence instruction inside the FlushICache function(Listings 2 and 3) of the two popular V8 and SpiderMonkeyJavaScript engines Such function normally empty on x86is called after every code update Our repeated experimentson the popular JetStream2 and Speedometer2 [77] bench-marks did not produce any statically measurable performanceoverhead This shows JavaScript execution time is heavilydominated by JIT code generationexecution and code up-dates have negligible impact Our results show this mitigationis practical and can hinder SCSB-based attack primitives inJIT engines with a 1-line code change

122 FPVI Mitigation

The most efficient way to mitigate FPVI is to disable the de-normal representation On Intel this translates to enabling theFlush-to-Zero and Denormals-are-Zero flags [37] which re-spectively replace denormal results and inputs with zero Thisis a viable mitigation for applications with modest floating-point precision requirements and has also been selectively ap-plied to browsers [42] However this defense may break com-mon real-world (denormal-dependent) applications a concernthat has led browser vendors such as Firefox to adopt othermitigations Another option for browsers is to enable Site Iso-lation [13] but JIT engines such as SpiderMonkey still do nothave a production implementation [23] Yet another optionfor JIT engines is to conditionally mask (ie using a transientexecution-safe cmov instruction) the result of FP operationsto enforce QNaN Floating-Point Indefinite semantics [37] asdone in the SpiderMonkey FPVI mitigation [51] This strategysuppresses any malicious NaN-boxed transient results butrequires manual changes to the NaN-boxing implementationand only applies to NaN-boxed gadgets

A more general and automated mitigation is for the com-piler to place a serializing instruction such as lfence afterFP operations whose (attacker-controlled) result might leaksecrets by means of transmit gadgets (or transmitters) Weobserve this is the same high-level strategy adopted by theexisting LVI mitigation in modern compilers [39] whichidentifies loaded values as sources and uses data-flow anal-ysis to ensure all the sources that reach a sink (transmitter)are fenced Hence to mitigate FPVI we can rely on the samemitigation strategy but use computed FP values as sourcesinstead To identify sinks we consider both systems vulnera-ble and those resistant to Microarchitectural Data Sampling(MDS) [8 59 63 75 76] For the former we consider bothload and store instructions as sinks (eg FP operation resultused as a loadstore address) as the corresponding arbitrarydata spilled into microarchitectural buffers may be leaked bya MDS attacker For the latter we limit ourselves to load sinksto catch all the arbitrary read values potentially disclosed by adependent transmitter In both cases we add indirect branchescontrolled by FP operations (and thus potentially leading tospeculative control-flow hijacking) to the list of sinks

We have implemented such a mitigation in LLVM [45]with only 100 lines of code on top of the existing x86 LVIload hardening pass To evaluate the performance impactof our mitigation on floating-point-intensive programs weran all the CC++ SPECfp 2006 and 2017 benchmarks infour configurations LVI instrumentation FPVI instrumen-tation for both MDS-vulnerable and MDS-resistant systemsand joint LVI+FPVI instrumentation for MDS-vulnerable sys-tems Please note that on MDS-resistant systems the FPVItransmitters are already covered by the LVI mitigation

Figure 9 shows the performance overhead of such con-figurations compared to the baseline As expected the LVI

1462 30th USENIX Security Symposium USENIX Association

619lbm_s 638imagick_s 644nab_s SPEC17geomean

433milc 444namd 447dealII 450soplex 453povray 470lbm 482sphinx3 SPEC06geomean

0100200300400500600700800

Over

head

SPEC FP 2017 SPEC FP 2006LLVM Mitigation

LVIFPVI (MDS-vulnerable system)FPVI (MDS-resistant system)LVI+FPVI (MDS-vulnerable system)

Figure 9 Performance overhead of our FPVI mitigation on the CC++ SPECfp 2006 and 2017 benchmarks Experimental setup5 SPEC iterations Intel i9-9900K (microcode 0xde) and LLVM 1110

mitigation has non-trivial performance impact up to 280despite targeting floating-point-intensive programs On theother hand our FPVI mitigation incurs in a 32 and 53 ge-omean overhead on SPECfp 2006 and 2017 respectively withno observable performance impact difference between MDS-vulnerable and MDS-resistant variants We observed that ap-proximately 70 of the inserted lfence instructions are dueto the intraprocedural design of the original LVI pass forc-ing our analysis to consider every callsite with FPVI source-based arguments as a potential transmitter This suggests theoverhead can be further reduced by operating interprocedu-ral analysis or more aggressive inlining (eg using LLVMLTO [45])

13 Root Cause-based Classification

We now summarize the results of our investigation by present-ing a root cause-based classification for the known transientexecution paths While there have already been several at-tempts to classify properties of transient execution [107580]all the existing classifications are attack-centric While cer-tainly useful such classifications inevitably blend togetherthe root causes of transient execution with the attack triggersFor example a MDS exploit based on a demand-paging pagefault [75] may be simply classified as a Meltdown-like attackbased on a present page fault [10] However in such an exploitthe page fault is both the vulnerability trigger and the rootcause of the transient execution window A similar exploitmay instead be embedded in an Intel TSX window yielding adifferent root cause and capabilities

To better characterize the capabilities of transient executionexploits we propose the orthogonal root-cause-based classifi-cation in Figure 10 We draw from the Intel terminology todefine the two main classes of root causes of Bad Specula-tion (ie transient execution) Control-Flow Misprediction(ie branch misprediction) and Data Misprediction (ie ma-chine clear) Based on these two main classes we observethat all the known root causes of transient execution paths canbe classified into the following four subclasses PredictorsExceptions Likely invariants violations and Interrupts

Figure 10 A root cause-centric classification of known tran-sient execution paths (causes analyzed in the paper in bold)Acronyms descriptions can be found in Appendix B

Predictors This category includes the prediction-basedcauses of bad speculation due to either control-flow or datamispredictions Mistraining a predictor and forcing a mis-prediction is sufficient to create a transient execution pathaccessing erroneous code or data Itrsquos worth noting that inFigure 10 there are two separate predictors subclasses un-der control-flow and data mispredictions as they are differ-ent in nature While control-flow mispredictions are failedattempts to guess the next instructions to execute data mis-predictions are failed attempts to operate on not-yet-validateddata Misprediction is a common way to manage tran-sient execution windows in attacks like Spectre and deriva-tives [11 20 28 40 41 43 48 55 65 82]

Exceptions This class includes the causes of machineclear due to exceptions for instance the different (sub)classesof page faults Forcing an exception is sufficient to create atransient execution path erroneously executing code follow-ing the exception-inducing instruction Exceptions are a lesscommon way to manage transient execution windows (as theyrequire dedicated handling) but have been extensively usedas triggers in Meltdown-like attacks [78475963677174ndash76 78]

Interrupts This class includes the causes of machine cleardue to hardware (only) interrupts Similar to exceptions forc-

USENIX Association 30th USENIX Security Symposium 1463

ing a HW interrupt is sufficient to create a transient executionpath erroneously executing code following the interruptedinstruction Hardware interrupt are asynchronous by natureand thus difficult to control resulting in a less than ideal wayto manage transient execution windows Nonetheless theywere abused by prior side-channel attacks [72]

Likely invariants violations This class includes all theremaining causes of machine clear derived by likely invari-ants [15] used by the CPU Such invariants commonly holdbut occasionally fail allowing hardware to implement fast-path optimizations However compared to exceptions andinterrupts slow-path occurrences are typically more frequentrequiring more efficient handling in hardware or microcodeWe discussed examples of such invariants in the paper (egstore instructions are expected to never target cached instruc-tions floating-point operations are expected to never operateon denormal numbers etc) and their lazy handling mecha-nisms (ie L1dL1i resynchronization microcode-based de-normal arithmetic) Forcing a likely invariant violation is suf-ficient to create a transient execution path accessing erroneouscode or data In this paper we have shown such violationsare not only a realistic way to manage transient executionwindows but also provide new opportunities and primitivesfor transient execution attacks

14 Related Work

Spectre [3141] and Meltdown [47] first examined the securityimplications of transient execution originating a large bodyof research on transient execution attacks [6ndash112840434859 63 65 67 68 71 74ndash76 78 79 82] Rather than focusingon attacks and their classification [10 75 80] ours is the firsteffort to systematize the root causes of transient executionand examine the many unexplored cases of machine clears

We now briefly survey prior security efforts concernedwith the major causes of machine clear discussed in this pa-per Self-modifying code is commonly used by malware asan obfuscation technique [69] and has also been used to im-prove side-channel attacks by means of performance degra-dation [2] Moreover our SCSB primitive bears similaritieswith prior transient execution primitives inducing specula-tive control flow hijacking either through branch target in-jection [41 49] or architectural branch target corruption [28]The performance variability of floating-point operations ondenormal numbers [17 46] has been previously exploited intraditional timing side-channel attacks [4] Speculation intro-duced by stricter memory models is a well-known concept inthe computer architecture literature [27 30 66] While this isnon-trivial to exploit prior work did demonstrate informationdisclosure [5779] by exploiting the snoop protocol discussedin Section 7 The memory disambiguation predictor has beenpreviously abused to leak stale data in Spectre SpeculativeStore Bypass exploits [55 82] Moreover its behavior hasbeen partially reverse engineered before [18] In contrast to

all these efforts we focus on machine clears to systematicallystudy all the root causes of transient execution fully reverseengineering their behavior and uncovering their security im-plications well beyond the state of the art

15 Conclusions

We have shown that the root causes of transient execution canbe quite diverse and go well beyond simple branch mispredic-tion or similar To support this claim we systematically ex-plored and reverse engineered the previously unexplored classof bad speculation known as machine clear We discussedseveral transient execution paths neglected in the literatureexamining their capabilities and new opportunities for attacksFurthermore we presented two new machine clear-based tran-sient execution attack primitives (Floating Point Value Injec-tion and Speculative Code Store Bypass) We also presentedan end-to-end FPVI exploit disclosing arbitrary memory inFirefox and analyzed the applicability of SCSB in real-worldapplications such as JIT engines Additionally we proposedmitigations and evaluated their performance overhead Finallywe presented a new root cause-based classification for all theknown transient execution paths

Disclosure

We disclosed Floating Point Value Injection and SpeculativeCode Store Bypass to CPU browser OS and hypervisor ven-dors in February 2021 Following our reports Intel confirmedthe FPVI (CVE-2021-0086) and SCSB (CVE-2021-0089)vulnerabilities rewarded them with the Intel Bug Bounty pro-gram and released a security advisory with recommendationsin line with our proposed mitigations [34] Mozilla confirmedthe FPVI exploit (CVE-2021-29955 [21 22]) rewarded itwith the Mozilla Security Bug Bounty program and deployeda mitigation based on conditionally masking malicious NaN-boxed FP results in Firefox 87 [51]

Acknowledgments

We thank our shepherd Daniel Genkin and the anonymousreviewers for their valuable comments We also thank ErikBosman from VUSec and Andrew Cooper from Citrix fortheir input Intel and Mozilla engineers for the productivemitigation discussions Travis Downs for his MD reverse en-gineering and Evan Wallace for his Float Toy tool This workwas supported by the European Unionrsquos Horizon 2020 re-search and innovation programme under grant agreements No786669 (ReAct) and 825377 (UNICORE) by Intel Corpora-tion through the Side Channel Vulnerability ISRA and by theDutch Research Council (NWO) through the INTERSECTproject

1464 30th USENIX Security Symposium USENIX Association

References[1] IEEE Standard for Floating-Point Arithmetic IEEE Std 754-2019

2019

[2] Alejandro Cabrera Aldaya and Billy Bob Brumley Hyperde-grade From ghz to mhz effective cpu frequencies arXiv preprintarXiv210101077

[3] AMD AMD64 Architecture Programmerrsquos Manual

[4] Marc Andrysco David Kohlbrenner Keaton Mowery Ranjit JhalaSorin Lerner and Hovav Shacham On subnormal floating point andabnormal timing In 2015 IEEE S amp P

[5] ARM Architecture Reference Manual for Armv8-A

[6] Atri Bhattacharyya Alexandra Sandulescu Matthias NeugschwandtnerAlessandro Sorniotti Babak Falsafi Mathias Payer and Anil KurmusSmotherspectre exploiting speculative execution through port con-tention In CCSrsquo19

[7] Jo Van Bulck Marina Minkin Ofir Weisse Daniel Genkin BarisKasikci Frank Piessens Mark Silberstein Thomas F Wenisch Yu-val Yarom and Raoul Strackx Foreshadow Extracting the Keys tothe Intel SGX Kingdom with Transient Out-of-Order Execution InUSENIX Securityrsquo18

[8] Claudio Canella Daniel Genkin Lukas Giner Daniel Gruss MoritzLipp Marina Minkin Daniel Moghimi Frank Piessens MichaelSchwarz Berk Sunar Jo Van Bulck and Yuval Yarom Fallout LeakingData on Meltdown-resistant CPUs In CCSrsquo19

[9] Claudio Canella Michael Schwarz Martin Haubenwallner MartinSchwarzl and Daniel Gruss Kaslr Break it fix it repeat In ACMASIA CCS 2020

[10] Claudio Canella Jo Van Bulck Michael Schwarz Moritz Lipp Ben-jamin Von Berg Philipp Ortner Frank Piessens Dmitry Evtyushkinand Daniel Gruss A systematic evaluation of transient execution at-tacks and defenses In USENIX Security 19

[11] Guoxing Chen Sanchuan Chen Yuan Xiao Yinqian Zhang ZhiqiangLin and Ten H Lai Sgxpectre Stealing intel secrets from sgx enclavesvia speculative execution In 2019 IEEE EuroSampP

[12] Chrome V8 TurboFan documentation

[13] Chromium Site Isolation documentation

[14] Victor Costan and Srinivas Devadas Intel SGX Explained IACRCryptology ePrint Archive 2016

[15] David Devecsery Peter M Chen Jason Flinn and Satish NarayanasamyOptimistic hybrid analysis Accelerating dynamic analysis throughpredicated static analysis In ASPLOS 2018

[16] Christopher Domas Breaking the x86 isa Black Hat USA 2017

[17] Isaac Dooley and Laxmikant Kale Quantifying the interference causedby subnormal floating-point values In Proceedings of the Workshopon OSIHPA 2006

[18] Travis Downs Memory Disambiguation on Skylakehttpsgithubcomtravisdownsuarch-benchwikiMemory-Disambiguation-on-Skylake 2019

[19] Thomas Dullien Return after free discussion httpstwittercomhalvarflakestatus1273220345525415937

[20] Dmitry Evtyushkin Ryan Riley Nael CSE Abu-Ghazaleh ECE andDmitry Ponomarev Branchscope A new side-channel attack on di-rectional branch predictor ACM SIGPLAN Notices 53(2)693ndash7072018

[21] Firefox Firefox 87 Security Advisory httpswwwmozillaorgen-USsecurityadvisoriesmfsa2021-10CVE-2021-29955

[22] Firefox Firefox ESR 789 Security Advisory httpswwwmozillaorgen-USsecurityadvisoriesmfsa2021-11CVE-2021-29955

[23] Firefox Project Fission documentation

[24] Fortninet Use-After-Free Bug in Chakra (CVE-2018-0946)httpswwwfortinetcomblogthreat-researchan-analysis-of-the-use-after-free-bug-in-microsoft-edge-chakra-engine

[25] Ivan Fratric Return after free discussion httpstwittercomifsecurestatus1273230733516177408

[26] Pietro Frigo Cristiano Giuffrida Herbert Bos and Kaveh RazaviGrand Pwning Unit Accelerating Microarchitectural Attacks withthe GPU In SampP May 2018

[27] Kourosh Gharachorloo Anoop Gupta and John L Hennessy Twotechniques to enhance the performance of memory consistency models1991

[28] Enes Goktas Kaveh Razavi Georgios Portokalidis Herbert Bos andCristiano Giuffrida Speculative Probing Hacking Blind in the SpectreEra In CCS 2020

[29] Ben Gras Kaveh Razavi Erik Bosman Herbert Bos and CristianoGiuffrida ASLR on the Line Practical Cache Attacks on the MMUIn NDSS February 2017

[30] John L Hennessy and David A Patterson Computer architecture aquantitative approach Elsevier 2011

[31] Jann Horn Reading privileged memory with a side-channel 2018

[32] Xen Hypervisor block_speculation function call in invoke_stubhttpsxenbitsxenorggitwebp=xengita=blobf=xenarchx86x86_emulatex86_emulatechb=HEAD

[33] Xen Hypervisor block_speculation function call inio_emul_stub_setup httpsxenbitsxenorggitwebp=xengita=blobf=xenarchx86pvemul-priv-opchb=HEAD

[34] Intel FPVI amp SCSB Intel Security Advisoray 00516 httpswwwintelcomcontentwwwusensecurity-centeradvisoryintel-sa-00516html

[35] Intel INTEL-SA-00088 - Bounds Check Bypass

[36] Intel Intelreg 64 and IA-32 Architectures Optimization ReferenceManual

[37] Intel Intelreg 64 and IA-32 Architectures Software Developerrsquos Manualcombined volumes

[38] Intel Intelreg VTunetrade Profiler User Guide - 4KAliasing httpssoftwareintelcomcontentwwwusendevelopdocumentationvtune-helptopreferencecpu-metrics-referencel1-boundaliasing-of-4k-address-offsethtml

[39] Intel Load value injection - deep divehttpssoftwareintelcomsecurity-software-guidancedeep-divesdeep-dive-load-value-injection 2020

[40] Vladimir Kiriansky and Carl Waldspurger Speculative buffer overflowsAttacks and defenses arXiv180703757

[41] Paul Kocher Jann Horn Anders Fogh Daniel Genkin Daniel GrussWerner Haas Mike Hamburg Moritz Lipp Stefan Mangard ThomasPrescher Michael Schwarz and Yuval Yarom Spectre Attacks Ex-ploiting Speculative Execution In SampPrsquo19

[42] David Kohlbrenner and Hovav Shacham On the effectiveness ofmitigations against floating-point timing channels In USENIX SecuritySymposium 2017

[43] Esmaeil Mohammadian Koruyeh Khaled N Khasawneh ChengyuSong and Nael Abu-Ghazaleh Spectre Returns Speculation Attacksusing the Return Stack Buffer In USENIX WOOTrsquo18

[44] Evgeni Krimer Guillermo Savransky Idan Mondjak and JacobDoweck Counter-based memory disambiguation techniques for se-lectively predicting loadstore conflicts October 1 2013 US Patent8549263

USENIX Association 30th USENIX Security Symposium 1465

[45] Chris Lattner and Vikram Adve Llvm A compilation framework forlifelong program analysis amp transformation In CGO 2004

[46] Orion Lawlor Hari Govind Isaac Dooley Michael Breitenfeld andLaxmikant Kale Performance degradation in the presence of subnor-mal floating-point values In OSIHPA 2005

[47] Moritz Lipp Michael Schwarz Daniel Gruss Thomas Prescher WernerHaas Anders Fogh Jann Horn Stefan Mangard Paul Kocher DanielGenkin Yuval Yarom and Mike Hamburg Meltdown Reading KernelMemory from User Space In USENIX Securityrsquo18

[48] Giorgi Maisuradze and Christian Rossow ret2spec Speculative ex-ecution using return stack buffers In Proceedings of the 2018 ACMSIGSAC

[49] Andrea Mambretti Alexandra Sandulescu Matthias NeugschwandtnerAlessandro Sorniotti and Anil Kurmus Two methods for exploitingspeculative control flow hijacks In USENIX WOOT 19

[50] Ahmad Moghimi Jan Wichelmann Thomas Eisenbarth and BerkSunar Memjam A false dependency attack against constant-timecrypto implementations International Journal of Parallel Program-ming 2019

[51] Mozilla Firefox Bug 1692972 mitigation httpshgmozillaorgreleasesmozilla-betarevb129bba64358

[52] Mozilla Spectre mitigations for Value type checks - x86 part httpsbugzillamozillaorgshow_bugcgiid=1433111

[53] Mozilla Spider Monkey JSValue httpshgmozillaorgmozilla-centralfiletipjspublicValueh

[54] Mozilla SpiderMonkey IonMonkey documentation

[55] Ken Johnson Microsoft Security Response Center(MSRC) Analysis and mitigation of speculative storebypass httpsmsrc-blogmicrosoftcom20180521analysis-and-mitigation-of-speculative-store-bypass-cve-2018-3639 2019

[56] Oleksii Oleksenko Bohdan Trach Mark Silberstein and Christof Fet-zer Specfuzz Bringing spectre-type vulnerabilities to the surface InUSENIX Security 20

[57] Riccardo Paccagnella Licheng Luo and Christopher W Fletcher Lordof the ring (s) Side channel attacks on the cpu on-chip ring interconnectare practical arXiv preprint arXiv210303443 2021

[58] Zhenxiao Qi Qian Feng Yueqiang Cheng Mengjia Yan Peng Li HengYin and Tao Wei Spectaint Speculative taint analysis for discoveringspectre gadgets 2021

[59] Hany Ragab Alyssa Milburn Kaveh Razavi Herbert Bos and CristianoGiuffrida CrossTalk Speculative Data Leaks Across Cores Are RealIn SampP May 2021

[60] Thomas Rokicki Cleacutementine Maurice and Pierre Laperdrix Sok Insearch of lost time A review of javascript timers in browsers In IEEEEuroSampPrsquo21

[61] Gururaj Saileshwar Christopher W Fletcher and Moinuddin QureshiStreamline a fast flushless cache covert-channel attack by enablingasynchronous collusion In ASPLOS 2021

[62] Rahul Saxena and John William Phillips Optimized rounding in un-derflow handlers 2001 US Patent 6219684

[63] Michael Schwarz Moritz Lipp Daniel Moghimi Jo Van Bulck JulianStecklina Thomas Prescher and Daniel Gruss ZombieLoad Cross-privilege-boundary data sampling In CCSrsquo19

[64] Michael Schwarz Cleacutementine Maurice Daniel Gruss and Stefan Man-gard Fantastic timers and where to find them High-resolution microar-chitectural attacks in javascript In FC IFCA 17

[65] Michael Schwarz Martin Schwarzl Moritz Lipp Jon Masters andDaniel Gruss Netspectre Read arbitrary memory over network InESORICS 19

[66] Daniel J Sorin Mark D Hill and David A Wood A primer on memoryconsistency and cache coherence 2011

[67] Julian Stecklina and Thomas Prescher Lazyfp Leaking fpu reg-ister state using microarchitectural side-channels arXiv preprintarXiv180607480 2018

[68] Caroline Trippel Daniel Lustig and Margaret Martonosi Meltdown-prime and spectreprime Automatically-synthesized attacks exploitinginvalidation-based coherence protocols arXiv

[69] Xabier Ugarte-Pedrero Davide Balzarotti Igor Santos and Pablo GBringas Sok Deep packer inspection A longitudinal study of thecomplexity of run-time packers In 2015 IEEE S amp P

[70] Google V8 test-jump-table-assemblercc220 commit 251fece httpsgithubcomv8

[71] Jo Van Bulck Daniel Moghimi Michael Schwarz Moritz Lippi Ma-rina Minkin Daniel Genkin Yuval Yarom Berk Sunar Daniel Grussand Frank Piessens Lvi Hijacking transient execution through mi-croarchitectural load value injection In 2020 IEEE S amp P 20

[72] Jo Van Bulck Frank Piessens and Raoul Strackx Nemesis Studyingmicroarchitectural timing leaks in rudimentary cpu interrupt logic InACM CCS 2018

[73] Jo Van Bulck Frank Piessens and Raoul Strackx SGX-step A Prac-tical Attack Framework for Precise Enclave Execution Control InSysTEXrsquo17

[74] Stephan van Schaik Andrew Kwong Daniel Genkin and Yuval YaromSgaxe How sgx fails in practice

[75] Stephan van Schaik Alyssa Milburn Sebastian Oumlsterlund Pietro FrigoGiorgi Maisuradze Kaveh Razavi Herbert Bos and Cristiano GiuffridaRIDL Rogue in-flight data load In SampP May 2019

[76] Stephan van Schaik Marina Minkin Andrew Kwong Daniel Genkinand Yuval Yarom Cacheout Leaking data on intel cpus via cacheevictions arXiv preprint

[77] WebKit Browserbench httpsbrowserbenchorg

[78] Ofir Weisse Jo Van Bulck Marina Minkin Daniel Genkin BarisKasikci Frank Piessens Mark Silberstein Raoul Strackx Thomas FWenisch and Yuval Yarom Foreshadow-ng Breaking the virtual mem-ory abstraction with transient out-of-order execution 2018

[79] Pawel Wieczorkiewicz Intel deep-dive snoop-assisted L1 Data Sam-pling

[80] Wenjie Xiong and Jakub Szefer Survey of transient execution attacksarXiv preprint

[81] Yuval Yarom and Katrina Falkner Flush+ reload a high resolutionlow noise l3 cache side-channel attack In USENIX Security 14

[82] Jann Horn Google Project Zero Speculative execution variant4 speculative store bypass httpsbugschromiumorgpproject-zeroissuesdetailid=1528 2019

A Reversing Memory Disambiguation

To precisely trigger memory disambiguation mispredictionsit is essential to reverse engineer the predictor and understandhow it can be massaged into the desired state When exe-cuting a load operation the physical addresses of all priorstores must be known to decide whether the load should beforwarded the value from the store buffer (store-to-load for-warding when the store and the load alias) or served from thememory subsystem (when they do not) Since these dependen-cies introduce significant bottlenecks modern processors rely

1466 30th USENIX Security Symposium USENIX Association

Listing 4 A function to RE the MD predictor The first 10imuls used to delay the store address computation create theideal conditions for mispredictionsst_ld rdi store addr rsi load addrrep 10 Trick to delay the store addressimul rdi 1endrepmov DWORD [rdi] 0x42 Storemov eax DWORD [rsi] Loadrep 10 Pronounce load timingimul eax 1endrepret

Listing 5 Observing the size and behavior of the per-addresssaturation counteruint8_t mem = malloc(0x1000)Ensure that saturing counter is set to 0for(i=0 ilt100 i++) st_ld(mem mem)Make hoisiting possiblefor(i=100 ilt120 i++) st_ld(mem mem+64)Trigger a memory ordering machine clearfor(i=120 ilt130 i++) st_ld(mem mem)

on a memory disambiguation predictor to improve common-case performance If a load is predicted not to alias precedingstores it can be speculatively executed before the prior storesrsquoaddresses are known (ie the load is hoisted) Otherwise theload is stalled until aliasing information is available

Partial reverse engineering of the memory disambiguationunit behavior was presented in [18] based on a (complex)analysis of Intel patent US8549263B2 [44] In contrast wepresent a full reverse engineering effort (including featuressuch as 4k aliasing flush counter etc) entirely based on asimple implementationmdashst_ld function in Listing 4 Bysurrounding the st_ld function with instrumentation codeto measure the timing and the number of machine clears wewere able to accurately detect the status of the predictor forevery call to st_ld

Our first experiment illustrated in Listing 5 is designedto observe how many missed load hoisting opportunities areneeded to switch the predictor state As shown in the corre-sponding plot in Figure 11 after 15 non-aliasing loads weobserved that subsequent st_ld invocations are faster due toa correct hoisting prediction This matches the design sug-gested in the patent with the predictor implemented as a 4-bitsaturating counter incremented every time the load does not ul-timately alias with preceding stores (and reset to 0 otherwise)Load hoisting is predicted only if the counter is saturated Toensure that the hoisting state is reached we later scheduledan aliasing load and checked that the load was incorrectlyhoisted by observing a machine clear

According to the Intel patent [44] there are 64 per-addresspredictors (ie saturating counters) and the suggested hash-ing function simply uses the lowest 6 bits of the instructionpointer of the load We verified these numbers using the func-

100 105 110 115 120 125 130i-th call to st_ld

0

50

100

Cloc

k cy

cles

Figure 11 Timing measurement of the experiment in Listing5 Orange bar machine clear memory ordering observed

Listing 6 Snippet observing activation of never-hoisting stateuint8_t mem = malloc(0x1000)for(int i=0 ilt10 i++)

for(int j=0 jlt19 j++)st_ld(mem mem+64)

st_ld(mem mem)

0 10 20 30 40 50 60 70 80 90 100 110 120i-th call to st_ld

0

25

50

75

100

Cloc

k cy

cles

Figure 12 Never-hoisting state Orange bar MC observed

tion st_ld_offset which is an exact copy of the st_ldfunction but with a number of nops added in the preamble(which we change at every run The goal is to observe iftwo unrelated loads in these functions are able to influenceeach other when varying the number of nops In our testedCPUs we observed machine clears when two unrelated loadsin st_ld and st_ld_offset are located exactly 256k bytesapart in memory (k isinN) We used machine clear observationsto detect that the per-address prediction of st_ld was affectedby the execution of st_ld_offset Our results match the de-sign suggested in the patent except we observed 256 (ratherthan 64) predictors hashing the lowest 8 bits of the instruc-tion pointer With these implementation-specific numbers anattacker can easily mistrain the predictor of a victim loadinstruction just with knowledge of its location in memory

One important additional component of the predictor is thepresence of a watchdog A never-hoisting global state is usedas a fallback to temporarily disable the predictor when theCPU has decided it may be counterproductive To reverseengineer the behavior we triggered as many mispredictionsas possible to check if hoisting was eventually disabled Theresulting code is illustrated in Listing 6 and the numbers inFigure 12 shows that after 4 machine clears the predictoris disabled Indeed even after 19 further non-aliasing loads(normally abundantly sufficient to switch to a hoisting state)the execution time of st_ld does not decrease

We also reversed the conditions under which the watch-dog is enableddisabled The patent suggests the watchdog

USENIX Association 30th USENIX Security Symposium 1467

15 79 143 207 271 335 399Number of independent store-loads

1

2

3

4

Tota

l mea

sure

dM

achi

ne C

lear

s

Figure 13 MCs observed after n independent store-loadsstarting from the never-hoisting state

is enabled when the value of a flush counter is smaller than0 and disabled otherwise The flush counter is decrementedevery MD MC and incremented every n correctly hoistedloads To reverse engineer this behavior we measure howmany MCs can be triggered in a row before the flush counteris decremented to -1 and thus the predictor is disabled Asshown in the figure 13 we never observed more than 4 MCsin a row This suggests that the flush counter is a 2-bit sat-urating counter The machine clear patterns also reveal theflush counter is incremented every 64 correctly hoisted loadsLastly since every run starts with the watchdog disabled ourresults show that to switch from a never-hoisting to a predict-hoisting state 15+64 non-aliasing loads are sufficient Thefirst 15 loads are necessary to bring the per-address predictorto the hoisting state The next 64 loads record a would-becorrect prediction of the per-address predictor After 64 such(unused) predictions the flush counter is incremented to leavethe never-hoisting state

Listing 7 Snippet verifying 4k aliasing-MD unit interactionForce no-hoisting predictionfor(i=0 ilt10 i++) st_ld(mem+0x1000 mem+0x1000)for(i=10 ilt20 i++) st_ld(mem+0x2000 mem+0x2000)Cause 4k aliasingfor(i=20 ilt40 i++) st_ld(mem+0x1000 mem+0x2000)

0 5 10 15 20 25 30 35 40i-th call to st_ld

50

60

70

80

90

Cloc

k cy

cles

Figure 14 Timing measurements of Listing 7 Orange incre-ment of LD_BLOCKS_PARTIALADDRESS_ALIAS

Finally we examined the interaction between memory dis-ambiguation and 4k aliasing On Intel CPUs when a store isfollowed by a load matching its 4KB page offset the store-to-load forwarding (STL) logic forwards the stored valueand in case of a false match (4k aliasing) a few-cycle over-head is needed to re-issue the load [38] With the help of theperformance counter LD_BLOCKS_PARTIALADDRESS_ALIASand the experiment shown in Listing 7 we verified that 4kaliasing can only happen when a no-hoist prediction is made

Figure 15 Memory disambiguation unit ndash simplified view

as shown in Figure 14 Indeed STL can be performed onlyif the store-load pair is executed in order Additionally Fig-ure 14 shows that 4k aliasing introduces a slight performanceoverhead on top of an incorrect no-hoisting guess of the MDpredictor Figure 15 presents a simplified view of the reversedmemory disambiguation predictor In conclusion an attackercan precisely mistrain the memory disambiguation predictorby satisfying only two requirements (1) knowing the instruc-tion pointer of the victim load (2) issuing (in the worst-casescenario) 15+64=79 non-aliasing store-load pairs to train thepredictor to the hoisting state

B Root-Causes Description Table

Acronym Description

BHT Branch History TableBTB Branch Target BufferRSB Return Stack BufferMD Memory Disabmbiguation UnitNM Device Not Available ExceptionDE Divide-by-Zero ExceptionUD Invalid Opcode ExceptionGP General Protection FaultAC Alignment Check ExceptionSS Stack Segment ExceptionPF Page FaultPF - US Bit Page Table Entry UserSupervisor Bit PFPF - RW Bit Page Table Entry ReadWrite Bit PFPF - P Bit Page Table Entry Present Bit PFPF - PKU Page Table Entry Protection Keys PFBR Bound Range Exceeded ExceptionFP Floating Point AssistSMC Self-Modifying CodeXMC Cross-Modifying CodeMO Memory Ordering Principles ViolationMASKMOV Masked LoadStore Instruction AssistAD Bits Page Table Entry AccessDirty Bits AssistTSX Intel TSX Transaction AbortUC Uncachable Memory AssistPRM Processor Reserved Memory AssistHW Interrupts Hardware Interrupts

1468 30th USENIX Security Symposium USENIX Association

  • Introduction
  • Background
    • IEEE-754 Denormal Numbers
    • x86 Cache Coherence
    • Memory Ordering
    • Memory Disambiguation
      • Threat Model
      • Machine Clears
      • Self-Modifying Code Machine Clear
      • Floating Point Machine Clear
      • Memory Ordering Machine Clear
      • Memory Disambiguation Machine Clear
      • Other Types of Machine Clear
      • Transient Execution Capabilities
        • Transient Window Size
        • Leakage Rate
          • Attack Primitives
            • Speculative Code Store Bypass (SCSB)
            • Floating Point Value Injection (FPVI)
              • Mitigations
                • SCSB Mitigation
                • FPVI Mitigation
                  • Root Cause-based Classification
                  • Related Work
                  • Conclusions
                  • Reversing Memory Disambiguation
                  • Root-Causes Description Table
Page 6: Rage Against the Machine Clear: A Systematic Analysis of

related microarchitectural structures) by flushing any staleinstruction(s) resuming execution at the last retired instruc-tion and then fetching the new instructions To test for thepresence of a transient execution path our analysis code im-mediately executes lines 18-21 targeted by the store and jumpsto a spec_code gadget (lines 29-32) which fills a number ofcache lines in a (flushed) buffer Architecturally this gad-get should never be executed as the store instruction shouldnop out the branch at line 18 However microarchitecturallywe do observe multiple cache hits in the (reload) buffer us-ing FLUSH + RELOAD which confirms the existence of apre-SMC-handling transient execution path executing stalecode and leaving observable traces in the cache We observethat the scheduling of the store instruction heavily affectsthe length of the transient path Indeed we use different in-structions (lines 5-8) to delay as much as possible the storeretirement and thus the SMC detection This suggests that theroot cause of the observed transient window might be the mi-croarchitectural de-synchronization between the store buffer(new code) and the instruction queue (stale code) yieldingtransiently incoherent codedata views We sampled machineclear performance counters to confirm the transient executionwindow is caused by the SMC machine clear and not by otherevents (eg memory disambiguation misprediction) Finallywe repeated our experiments on uncached code memory andcould not observe any transient path The counters revealedone machine clear triggered for each executed instructionsince the CPU has to pessimistically assume every fetchedinstruction has potentially been overwritten Additionally wehave verified that SMC detection is performed on physicaladdresses rather than virtual ones

Cross-Modifying Code

Instead of modifying its own instructions a thread runningon one core may also write data into the currently executingcode segment of a thread running on a different physical coreSuch Cross-Modifying Code (XMC) may be synchronous(the executor thread waits for the writer thread to completebefore executing the new code) or asynchronous with noparticular coordination between threads To reverse engineerthe behavior we distributed our analysis code across coresand reproduced a signal on the reload buffer in both casesThis confirms a Cross-Modifying Code Machine Clear (XMCmachine clear) behaves similarly to a SMC machine clearacross cores with a store on the writer core originating atransient execution window on the executor core

6 Floating Point Machine Clear

On Intel when the Floating Point Unit (FPU) is unable toproduce results in the IEEE-754 [1] format directly for in-stance in the case of denormal operands or results [4 17]the CPU requires special handling to produce a correct re-

Listing 1 Self-Modifying Code Machine Clear analysis code1 smc_snippet2 push r113 lea r11 [target] Load addr of target instr (line 17)4

5 clflush [r11] These instructions serve as a delay6 rep 10 for the store argument address They7 imul r11 1 ensure that the execution window of8 endrep spec_code is as long as possible9

10 Code to write as data 8 nops (overwriting lines 18-21)11 mov rax 0x909090909090909012

13 Store at target addr Also the last retired instr14 from which the execution will resume after the SMC MC15 mov QWORD [r11] rax16

17 target Target instruction to be modified18 jmp spec_code19 nop20 nop21 nop22

23 Architectural exit point of the function24 pop r1125 ret26

27 Code executed speculatively (flushed after SMC MC)28 spec_code29 mov rax [rdi+0x0] rdi covert channel reload buffer30 mov rax [rdi+0x400]31 mov rax [rdi+0x800]32 mov rax [rdi+0xc00]

sult According to an Intel patent [62] the denormalizationis indeed implemented as a microcode assist or an exceptionhandler since the corresponding hardware would be too com-plex In our experiments we observed microcode assists onall x87 SSE and AVX instructions that perform mathematicaloperations on denormal floating-point (FP) numbers Incre-ments of the FP_ASSISTANY or MACHINE_CLEARSCOUNT(or on older processors MACHINE_CLEARSFP_ASSIST) per-formance counters confirm such assists cause machine clears

Since a machine clear implies a pipeline flush the as-sisted FP operation will be squashed together with subse-quent microOps To reverse engineer the behavior we used anal-ysis code exemplified by Algorithm 1 Our code relies ona FLUSH + RELOAD [81] covert channel to observe the re-sult of a floating-point operation at the byte granularity Inour experiments we observed two different hits in the reloadbuffer for each byte for the transient and architectural resultrespectively The double-hit microarchitectural trace confirmsthat the transient (and wrong) value generated by the FPUis used in subsequent microOpsmdashas also exemplified in Table 2Later the CPU detects the error and triggers a machine clearto flush the wrongly executed path The microcode assist thencorrects the result while subsequent instructions are reissued

While we could not find any documentation on floating-point assist handling (even in the patents) our experimentsrevealed the following important properties First we veri-fied that many FP operations can trigger FP assists (ie addsub mul div and sqrt) across different extensions (ie x87SSE and AVX) Second the transient result is computed byldquoblindlyrdquo executing the operation as if both operands and result

USENIX Association 30th USENIX Security Symposium 1455

Algorithm 1 Floating Point Machine Clear analysis (pseudo)code byte is used to extract the i-th byte of z

1 for ilarr 1 8 do2 flush(reload_bu f )3 z = x y Any denormal FP operation4 reload_bu f [byte(z i) 1024]5 reload(reload_bu f )6 end for

Representation Value Type Exp

x 0x0010deadbeef1337 234e-308 N -1022

y 0x40f0000000000000 65536(216) N 16

zarch 0x00000010deadbeef 357e-313 D -1022

ztran 0x3f10deadbeef1337 643e-05 N -14

Table 2 Architectural (zarch) and transient (ztran) results ofdividing x and y of Algorithm 1 N Normal D Denormalrepresentations The mantissa is in bold A normal division by216 leaves the mantissa untouched and subtracts 16 from theexponentmdashthe result of ztran where the exponent overflowedfrom -1022 to -14

Figure 3 Transient execution due to invalid memory ordering

are normal numbers (see Table 2) Third the detection of thewrong computation occurs later in time creating a transientexecution window Finally by performing multiple floating-point operations together with the assisted one we were ableto expand the size of the window suggesting that detection isdelayed if the FPU is busy handling multiple operations

7 Memory Ordering Machine Clear

The CPU initiates a memory ordering (MO) machine clearwhen upon receiving a snoop request it is uncertain if mem-ory ordering will be preserved [36] Consider the program

Algorithm 2 Pseudo-code triggering a MO machine clearProcessor A

1 clflush(X) Make the load slow2 unlock(lock) Synchronize loads and stores3 r1larr [X]4 r2larr [Y ]5 reload(r2) For FLUSH + RELOAD

Processor B1 wait(lock) Synchronize loads and stores2 1rarr [Y ]

in Figure 3 Processor A loads X and Y while processor Bperforms two stores to the same locations If the load of Xis slow due to a cache miss the out-of-order CPU will issuethe next load (and subsequent operations) ahead of sched-ule Suppose that while the load of X is pending proces-sor B signals through a snoop request that the values ofX and Y have changed In this scenario the memory order-ing is A1-B0-B1-A0 which is not allowed according to theTotal-Store-Order memory model As two loads cannot bereordered r1=1 r2=0 is an illegal result Thus processor Ahas no choice other than to flush its pipeline and re-issuethe load of Y in the correct order This MO machine clearis needed for every inconsistent speculation on the memoryorderingmdashimplementing speculation behavior originally pro-posed by Gharachorloo et al [27] with the advantage thatstrict memory order principles can co-exist with aggressiveout-of-order scheduling

Notice that in the previous example the store on X isnot even necessary to cause a memory ordering violationsince A1-B1-A0 is still an invalid order Counterintuitivelythe memory ordering violation disappears if the load on X isnot performed as A1-B0-B1 is a perfectly valid order

To reverse engineer the memory ordering handling behav-ior on Intel CPUs we used analysis code exemplified by Al-gorithm 2 Our code mimics the scenario of Figure 3 withloadsstores from two threads racing against each other butrelies on a FLUSH + RELOAD [81] covert channel to observethe loaded value of Y The synchronization through the lockvariable ensures the desired (problematic) proximity and or-dering of the memory operations

As before in our experiments we observed two differenthits in the reload buffer for the loaded value of Y one forthe stale (transient) value and one for the new (architectural)value For every double hit in the reload buffer we also mea-sured an increase of MACHINE_CLEARSMEMORY_ORDERINGWe also ran experiments without the load of X Even thoughmemory ordering violations are no longer possible since eachprocessor executes a single memory operation and any orderis permitted we still observed MO machine clears The reasonis that Intel CPUs seem to resort to a simple but conservativeapproach to memory ordering violation detection where theCPU initiates a MO machine clear when a snoop request from

1456 30th USENIX Security Symposium USENIX Association

another processor matches the source of any load operationin the pipeline We obtained the same results with the twothreads running across physical or logical (hyperthreaded)cores Similarly the results are unchanged if the matchingstore and load are performed on different addressesmdashthe onlyrequirement we observed is that the memory operations needto refer to the same cache line Overall our results confirmthe presence of a transient execution window and the abilityof a thread to trigger a transient execution path in anotherthread by simply dirtying a cache line used in a ready-to-commit load Exploitation-wise abusing this type of MC isnon-trivial due to the strict synchronization requirements andthe difficulty of controlling pending stale data

8 Memory Disambiguation Machine Clear

As suggested by the MACHINE_CLEARSDISAMBIGUATIONcounter description memory disambiguation (MD) mis-predictions are handled via machine clears Our experi-ments confirmed this behavior by observing matching in-creases of MACHINE_CLEARSCOUNT Moreover we observedno changes in microcode assist counters suggesting mispre-dictions are resolved entirely in hardware In case of a mispre-diction a stale value is passed to subsequent loads a primitivethat was previously used to leak secret information with Spec-tre Variant 4 or Speculative Store Bypass (SSB) [55 82]

Different from the other machine clears MD machineclears trigger only on address aliasing mispredictions whenthe CPU wrongly predicts a load does not alias a precedingstore instruction and can be hoisted Similar to branch pre-diction generating a MD-based transient execution windowrequires mistraining the underlying predictor The latter hascomplex undocumented behavior which has been partiallyreverse engineered [18] For space constraints we discuss ourfull reverse engineering strategy in Appendix A Our resultsshow that executing the same load 64+ 15 times with non-aliasing stores is sufficient to ensure the next prediction tobe ldquono-aliasrdquo (and thus the next load to be hoisted) Uponreaching the hoisting prediction one can perform the loadwith an aliasing store to trigger the transient path exposed tothe incorrect stale value

Finally we experimentally verified that 4k aliasing [50]does not cause any machine clear but only incurs a furthertime penalty in case of wrong aliasing predictions Moredetails on 4k aliasing results can be found in Appendix A

9 Other Types of Machine Clear

AVX vmaskmov The AVX vmaskmov instructions performconditional packed load and store operations depending on abitmask For example a vmaskmovpd load may read 4 packeddoubles from memory depending on a 4-bit mask each doublewill be read only if the corresponding mask bit is set the

others will be assigned the value 0According to the Intel Optimization Reference Man-

ual [36] the instruction does not generate an exception inface of invalid addresses provided they are masked out How-ever our experiments confirm that it does incur a machineclear (and a microcode assist) when accessing an invalid ad-dress (eg with the present bit set to 0) with a loading maskset to zero (ie no bytes should be loaded) to check whetherthe bytes in the invalid address have the corresponding maskbits set or not We speculate the special handling is neededbecause the permission check is very complex especially inthe absence of memory alignment requirements

In our experiments vmaskmov instructions withall-zero masks and invalid addresses increment theOTHER_ASSISTANY and MACHINE_CLEARSCOUNT countersconfirming that the instruction triggers a machine clearHowever the resulting transient execution window seemsshort-lived or absent as we were unable to observe cache orother microarchitectural side effects of the execution

Exceptions The MACHINE_CLEARSPAGE_FAULT counterpresent on older microarchitectures confirms page faults areanother cause of machine clears Indeed we verified each in-struction incurring a page fault or any other exception such asldquoDivision by zerordquo increments the MACHINE_CLEARSCOUNTcounter We also verified exceptions do not trigger microcodeassists and software interrupts (traps) do not trigger machineclears Indeed the Intel documentation [37] specifies thatinstructions following a trap may be fetched but not specu-latively executed Transient execution windows originatingfrom exceptionsmdashand page faults in particularmdashhave beenextensively used in prior work with a faulty load instructionalso used as the trigger to leak information [7 47 63 75]

Hardware interrupts Although hardware interrupts arean undocumented cause of machine clears our experimentswith APIC timer interrupts showed they do increment theMACHINE_CLEARSCOUNT counter While this confirms hard-ware interrupts are another root cause of transient executionthe asynchronous nature of these events yields a less thanideal vector for transient execution attacks Nonetheless hard-ware interrupts play an important role in other classes ofmicroarchitectural attacks [73]

Microcode assists Microcode assists require a pipelineflush to insert the required microOps in the frontend and representa subclass of machine clears (and thus a root cause of transientexecution) for cases where a fast path in hardware is not avail-able In this paper we detailed the behavior of floating-pointand vmaskmov assists Prior work has discussed different sit-uations requiring microcode assists such as those related topage table entry AccessDirty bits typically in the context ofassisted loads used as the trigger to leak information [86375]AVX-to-SSE transitions [36] represent another microcode as-sist which based on our experiments we could not observeon modern Intel CPUs Remaining known microcode assistssuch as access control of memory pages belonging to SGX

USENIX Association 30th USENIX Security Symposium 1457

020406080

100120140160

Num

ber o

fTr

ansie

nt L

oads CPU

Intel Core i7-10700KIntel Xeon Silver 4214Intel Core i9-9900KIntel Core i7-7700KAMD Ryzen 5 5600XAMD Ryzen Threadripper2990WXAMD Ryzen 7 2700X

F+R TSX BHT FAULT SMC XMC FP MD MOTransient Execution Management

0

1

2

3

4

Leak

age

Rate

[Mb

s]

Figure 4 Top plot transient window size vs mechanism Each bar reports the number of transient loads that complete and leavea microarchitectural trace Bottom plot leakage rate vs mechanism for a simple Spectre Bounds-Check-Bypass attack and a1-bit FLUSH + RELOAD (F+R) cache covert channel F+R is the leakage rate upper bound (covert channel loop only no actualtransient window or attack)

secure enclaves and Precise Event Based Sampling (PEBS)are not presented in this work since already studied [14] oronly related to privileged performance profiling respectively

10 Transient Execution Capabilities

Transient execution attacks rely on crafting a transient win-dow to issue instructions that are never retired For thispurpose state-of-the-art attacks traditionally rely on mech-anisms based on root causes such as branch mispredictions(BHT) [41 43] faulty loads (Fault) [8 47 63 75] or mem-ory transaction aborts (TSX) [8 47 59 63 75] However thedifferent machine clears discussed in this paper provide anattacker with the exact same capabilities

To compare the capabilities of machine clear-based tran-sient windows with those of more traditional mechanismswe implemented a framework able to run arbitrary attacker-controlled code in a window generated by a mechanism ofchoosing We now evaluate our framework on recent proces-sors (with all the microcode updates and mitigations enabled)to compare the transient window size and leakage rate of thedifferent mechanisms

101 Transient Window Size

The transient window size provides an indication of thenumber of operations an attacker can issue on a transientpath before the results are squashed Larger windows canin principle host more complex attacks Using a classicFLUSH + RELOAD cache covert channel (F+R) as a referencewe measure the window size by counting how many transientloads can complete and hit entries in a designated F+R bufferFigure 4 (top) presents our results

As shown in the figure the window size varies greatlyacross the different mechanisms Broadly speaking mecha-nisms that have a higher detection cost such as XMC and MOMachine Clear yield larger window sizes Not surprisinglybranch mispredictions yield the largest window sizes as wecan significantly slow down the branch resolution process(ie causing cache misses) and delay detection FP on theother hand yields the shortest windows suggesting that de-normal numbers are efficiently detected inside the FPU Ourresults also show that while our framework was designed forIntel processors similar if not better results can usually beobtained on AMD processors (where we use the same con-servative training code for branchmemory prediction) Thisshows that both CPU families share a similar implementationin all cases except for MD where the used mistraining patternis not valid for pre-Zen3 architectures

102 Leakage Rate

To compare the leakage rates for the different transient exe-cution mechanisms we transiently read and repeatedly leakdata from a large memory region through a classic F+R cachecovert channel We report the resulting leakage ratesmdashasthe number of bits successfully leaked per secondmdashacrossdifferent microarchitectures using a 1-bit covert channel tohighlight the time complexity of each mechanism We con-sider data to be successfully leaked after a single correct hitin the reload buffer In case of a miss for a particular valuewe restart the leak for the same value until we get a hit (oruntil we get 100 misses in a row)

As shown in Figure 4 (bottom) different Intel and AMDmicroarchitectures generally yield similar leakage rates withsome variations For instance FP MC offers better leakagerates on Intel This difference stems from the different perfor-

1458 30th USENIX Security Symposium USENIX Association

F+R TSX BHT FP SMC XMC MO MD FAULTTransient Execution Management

0

1

2

3

4

Leak

age

Rate

[Mb

s]

F+R granularity [bit]148

Figure 5 Leakage rate vs mechanism with a 1-bit 4-bit or8-bit FLUSH + RELOAD cache covert channel (Intel Core i9)

mance impact of the corresponding machine clears on Intelvs AMD microarchitectures

As shown in Figure 5 the leakage rate varies instead greatlyacross the different mechanisms and so does the optimalcovert channel bitwidth Indeed while existing attacks typi-cally rely on 8-bit covert channels our results suggest 1-bit or4-bit channels can be much more efficient depending on thespecific mechanism Roughly speaking optimal leakage ratescan be obtained by balancing the time complexity (and hencebitwidth) of the covert channel with that of the mechanismFor example FP is a lightweight and reliable mechanismhence using a comparably fast and narrow 1-bit covert chan-nel is beneficial In contrast MD requires a time-consumingpredictor training phase between leak iterations and leakingmore bits per iteration with a 4-bit covert channel is moreefficient Interestingly a classic 8-bit covert channel yieldsconsistently worse and comparable leakage rates across allthe mechanisms since F+R dominates the execution time

Our results show that only two mechanisms (TSX and FP)are close to the maximum theoretical leakage rate of pureF+R Moreover FP performs as efficiently as TSX but un-like TSX is available on both Intel and AMD is always en-abled and can be used from managed code (eg JavaScript)BHT on the other hand yield the worst leakage rates due tothe inefficient training-based transient window BHT leakagerate can be improved if a tailored mistrain sequence is usedas in MD (Appendix A) Overall our results show that ma-chine clear-based windows achieve comparable and in manycases (eg FP) better leakage rates compared to traditionalmechanisms Moreover many machine clears eliminate theneed for mistraining which other than resulting in efficientleakage rates can escape existing pattern-based mitigationsand disabled hardware extensions (eg Intel TSX)

11 Attack Primitives

Building on our reverse engineering results and focusing onthe unexplored SMC and FP machine clears we now presenttwo new transient execution attack primitives and analyze

their security implications We also present an end-to-endFPVI exploit disclosing arbitrary memory in Firefox Laterwe discuss mitigations

111 Speculative Code Store Bypass (SCSB)

Our first attack primitive Speculative Code Store Bypass(SCSB) allows an attacker to execute stale controlled code ina transient execution window originated by a SMC machineclear Since the primitive relies on SMC its primary appli-cability is on JIT (eg JavaScript) engines running attacker-controlled codemdashalthough OS kernels and hypervisors stor-ing code pages and allowing their execution without firstissuing a serializing instruction are also potentially affected

Figure 6 SCSB primitive example where the instructionpointer is pointing at the bold code blocks g code is freedJITrsquoed code (of some g function) under attackerrsquos control(1) Force engine to JIT and execute code of function f caus-ing desynchronization of code and data views (2) Executestale code and SMC MC (3) After the SMC MC code anddata view coherence is restored and the new code is executed

As exemplified in Figure 6 the operations of the primi-tive can be broken down into three steps (1) the JIT enginecompiles a function f storing the generated code into a JITcode cache region previously used by a (now-stale) version offunction g (2) the JIT engine jumps to the newly generatedcode for the function f but due to the temporary desynchro-nization between the code and data views of the CPU thiscauses transient execution of the stale code of g until the SMCmachine clear is processed (3) after the pipeline flush thecode and data views are resynchronized and the CPU restartsthe execution of the correct code of f For exploitation theattacker needs to (i) massage the JIT code cache allocator toreuse a freed region with a target gadget g of choice (ii) forcethe JIT engine to generate and execute new (f ) code in sucha region enabling transient out-of-context execution of thegadget and spilling secrets into the microarchitectural state

Our primitive bears similarities with both transient and ar-chitectural primitives used in prior attacks On the transientfront our primitive is conceptually similar to a Speculative-Store-Bypass (SSB) primitive [55 82] but can transientlyexecute stale code rather than reading stale data Howeverinterestingly the underlying causes of the two primitives arequite different (MD misprediction vs SMC machine clear)On the architectural front our primitive mimics classic Use-After-Free (UAF) exploitation on the JIT code cache also

USENIX Association 30th USENIX Security Symposium 1459

Figure 7 Coding options suggested by the Intel ArchitecturesSoftware Developers Manuals to handle SMC and XMC ex-ecution Option 1 describes the exact steps required by ourSpeculative Code Store Bypass attack primitive potentiallyresulting in exploitable gadgets

known as Return-After-Free (RAF) in the hackers commu-nity [19 25] An example is CVE-2018-0946 where a use-after-free vulnerability can be exploited to force the ChakraJS engine to erroneously execute freed (attacker-controlled)JIT code resulting in arbitrary code execution after massagingthe right gadget into the JIT code cache [24]

Indeed at a high level SCSB yields a transient use-after-free primitive on a JIT code cache with exploitation prop-erties similar to its architectural counterpart However thereare some differences due to the transient nature of SCSBFirst we need to find an out-of-context gadget to transientlyleak data rather than architecturally execute arbitrary codeIn JavaScript engines similar gadgets have already been ex-ploited by ret2spec [48] escalating out-of-context transientexecution of valid code to type confusion arbitrary reads andultimately a secret-dependent load to transmit the value

Second we need to target short-lived JIT code cache up-dates so that the newly generated code is immediately exe-cuted fitting the target gadget in the resulting transient execu-tion window Interestingly executing JITrsquoed code is the nextstep after JIT code generation mentioned in the Intel manual(Figure 7 Option 1) suggesting that developers that followsuch directives ad litteram can easily introduce such gadgetsMoreover modern JavaScript compilers feature a multi-stageoptimization pipeline [12 54] and short-lived JIT code cacheupdates are favored for just-in-time (re)optimization Indeedwe found test code in V8 to specifically test such updates andverify instructiondata cache coherence [70]

Finally we need to ensure JIT code cache updates are notaccompanied by barriers that force immediate synchroniza-tion of the code and data views We analyzed the code ofthe popular SpiderMonkey and V8 JavaScript engines andverified that the functions called upon JIT code cache updatesto synchronize instruction and data caches (Listing 2 and List-ing 3) are always empty on x86 (as expected given Intelrsquosprimary recommendation in Figure 7)

Overall while the exploitation is far from trivial (ie hav-

Listing 2 Chromium instruction cache flush(chromiumsrcv8srccodegenx64cpu-x64cc)void CpuFeaturesFlushICache(void start size_t size) No need to flush the instructioncache on Intel

Listing 3 Firefox instruction cache flush(mozilla-unifiedjssrcjitFlushICacheh)inline void FlushICache(void code size_t size

bool codeIsThreadLocal = true) No-op Code and data caches are coherent on x86

and x64 rarr

ing to address the challenges of traditional use-after-free ex-ploitation as well as transient execution exploitation in thebrowser) we believe SCSB expands the attack surface of tran-sient execution attacks in the browser We found a number ofcandidate SCSB gadgets in real-world code but none of themwas ultimately exploitable due to the coincidental presence ofsome serializing instruction preventing stale code executionNonetheless mitigations are needed to enforce security-by-design Luckily as shown later SCSB is amenable to a prac-tical and efficient mitigation (ie at a similar cost as facedby non-x86 architectures) The implementationperformancecost for mitigation is even lower than for its sibling SSB prim-itive whose mitigation has been deployed in practice even inabsence of known practical exploits [55]

In Table 3 we show that all tested Intel and AMD proces-sors are affected by SCSB In contrast ARM is not vulnerablesince SMC updates require explicit software barriers

112 Floating Point Value Injection (FPVI)Our second attack primitive Floating Point Value Injection(FPVI) allows an attacker to inject arbitrary values into atransient execution window originated by a FP machine clearAs exemplified in Figure 8 the operations of the primitive canbe broken down into four steps (1) the attacker triggers theexecution of a gadget starting with a denormal FP operationin the victim application with the x and y operands underattackerrsquos control (2) the transient z result of the operation isprocessed by the subsequent gadget instructions leaving a mi-croarchitectural trace (3) the CPU detects the error condition(ie wrong result of a denormal operation) triggering a ma-chine clear and thus a pipeline flush (4) the CPU re-executesthe entire gadget with the correct architectural z result Forexploitation the attacker needs to (i) massage the x and yoperands to inject the desired z value into the victim transientpath and (ii) target a victim gadget so that the injected valueyields a security-sensitive trace which can be observed withFLUSH + RELOAD or other microarchitectural attacks

Our primitive bears similarities with Load-Value-Injection(LVI) [71] since both allow attackers to inject controlledvalues into transient execution Moreover both primitives

1460 30th USENIX Security Symposium USENIX Association

Figure 8 FPVI gadget example in SpiderMonkeyFP_Op(xy) is an arbitrary denormal FP operation The er-roneous z result causes dependent operations to be executedtwice (first transiently then architecturally) The NaN-boxingz encoding allows the attacker to type-confuse the JITrsquoed codeand read from an arbitrary address on the transient path

require gadgets in the victim application to process the in-jected value and perform security-sensitive computations onthe transient path Nonetheless the underlying issues andhence the triggering conditions are fairly different LVI re-quires the attacker to induce faulty or assisted loads on thevictim execution which is straightforward in SGX applica-tions but more difficult in the general case [71] FPVI imposesno such requirement but does require an attacker to directlyor indirectly control operands of a floating-point operation inthe victim Nonetheless FPVI can extend the existing LVIattack surface (eg for compute-intensive SGX applicationsprocessing attacker-controlled inputs) and also provides ex-ploitation opportunities in new scenarios Indeed while it ishard to draw general conclusions on the availability of ex-ploitable FPVI gadgets in the wildmdashmuch like Spectre [41]LVI [71] etc this would require gadget scanners subject oforthogonal research [56 58]mdashwe found exploitable gadgetsin NaN-Boxing implementations of modern JIT engines [53]NaN-Boxing implementations encode arbitrary data types asdouble values allowing attackers running code in a JIT sand-box (and thus trivially controlling operands of FP operations)to escalate FPVI to a speculative type confusion primitiveThe latter can be exploited similarly to NaN-Boxing-enabledarchitectural type confusion [26] and allows an attacker toaccess arbitrary data on a transient path Figure 8 presentsour end-to-end exploit for a JavaScript-enabled attacker in aSpiderMonkey (Mozilla JavaScript runtime) sandbox illus-trating a gadget unaffected by all the prior Mozilla Firefoxrsquo

mitigations against transient execution attacks [52]As exemplified in the figure SpiderMonkeyrsquos NaN-Boxing

strategy represents every variable with a IEEE-754 (64-bit)double where the highest 17 bits store the data type tag and theremaining 47 bits store the data value itself If the tag value isless than or equal to 0x1fff0 (ie JSVAL_TAG_MAX_DOUBLE)all the 64 bits are interpreted as a double while NaN-Boxingencoding is used otherwise For instance the 0xfff88 tagis used to represent 32-bit integers and the 0xfffb0 tag torepresent a string with the data value storing a pointer tothe string descriptor In the example the attacker crafts theoperands of a vulnerable FP operation (in this case a division)to produce a transient result which the JITrsquoed code interpretsas a string pointer due to NaN-Boxing This causes typeconfusion on a transient path and ultimately triggers a readwith an attacker-controlled address

To verify the attacker can inject arbitrary pointers withoutfully reverse engineering the complex function used by theFPU we implemented a simple fuzzer to find FP operandsthat yield transient division results with the upper bits setto 0xfffb0 (ie string tag) With such operands we caneasily control the remaining bits by performing the inverseoperation since the mantissa bits are transiently unaffectedby the exponent value as shown in Table 2 For exampleusing 0xc000e8b2c9755600 and 0x0004000000000000 asdivision operands yields -Infinity as the architectural resultand our target string pointer 0xfffb0deadbeef000 as thetransient result (see gadget in Figure 8)

Note that SpiderMonkey uses no guards or Spectre mitiga-tions when accessing the attribute length of the string Thisis normally safe since x86 guarantees that NaN results of FPoperations will always have the lowest 52 bits set to zeromdasharepresentation known as QNaN Floating-Point Indefinite [37]In other words the implementation relies on the fact that NaN-boxed variables such as string pointers can never accidentallyappear as the result of FP operations and can only be craftedby the JIT engine itself Unfortunately this invariant no longerholds on a FPVI-controlled transient path As shown in Fig-ure 8 this invariant violation allows an attacker to transientlyread arbitrary memory Since the length attribute is stored4 bytes away from the string pointer the zlength accessyields a transient read to 0xdeadbeef000+4

We ran our exploit on an Intel i9-9900K CPU (microcode0xde) running Linux 580 and Firefox 850 and by wrappingthis primitive with a variant of an EVICT + RELOAD covertchannel [61] we confirmed the ability to read arbitrary mem-ory Since prior work has already demonstrated that customhigh-precision timers are possible in JavaScript [26296064]we enabled precise timers in Firefox to simplify our covertchannel With our exploit we measured a leakage rate of ~13KBs and a transient window of ~12 load instructions enabledby increasing the FPU pressure through a chain of multipledependent FP operations

Finally as shown in Table 3 we observe that both Intel and

USENIX Association 30th USENIX Security Symposium 1461

Table 3 Tested processors

Processor MicrocodeSCSB

vulnerableFPVI

vulnerable

Intel Core i7-10700K 0xe0 3 3Intel Xeon Silver 4214 0x500001c 3 3Intel Core i9-9900K 0xde 3 3Intel Core i7-7700K 0xca 3 3AMD Ryzen 5 5600X 0xa201009 3 3daggerAMD Ryzen 2990WX 0x800820b 3 3daggerAMD Ryzen 7 2700X 0x800820b 3 3daggerBroadcom BCM2711Cortex-A72 (ARM v8) 7sect 7

dagger No exploitable NaN-boxed transient results foundsect On ARM SMC updates require explicit software barriers

AMD are affected by FPVI although we found no exploitabletransient NaN-boxed values on AMD On ARM we neverobserved traces of transient results yet we cannot rule outother FPU implementations being affected

12 Mitigations

121 SCSB Mitigation

SCSB can be mitigated by ensuring that any freshly writtencode is architecturally visible before being executed For ex-ample on ARM architectures where the hardware does notautomatically enforce cache coherence explicit serializing in-structions (ie L1i cache invalidation instructions) are neededto correctly support SMC [5] As such spec-compliant ARMimplementations cannot be affected by SCSB On Intel andAMD we can force eager codedata coherence using a serial-ization instructionmdashalthough this is normally not necessary(see Option 1 in Figure 7) In our experiments we verified anyserialization instruction such as lfence mfence or cpuidplaced after the SMC store operations is indeed sufficient tosuppress the transient window Note that sfence cannot serveas a serialization instruction [35] to eliminate the transientpath This serialization mitigation was confirmed by CPUvendors and adopted by the Xen hypervisor [32 33]

To evaluate the performance impact of our mitigation weadded a lfence instruction inside the FlushICache function(Listings 2 and 3) of the two popular V8 and SpiderMonkeyJavaScript engines Such function normally empty on x86is called after every code update Our repeated experimentson the popular JetStream2 and Speedometer2 [77] bench-marks did not produce any statically measurable performanceoverhead This shows JavaScript execution time is heavilydominated by JIT code generationexecution and code up-dates have negligible impact Our results show this mitigationis practical and can hinder SCSB-based attack primitives inJIT engines with a 1-line code change

122 FPVI Mitigation

The most efficient way to mitigate FPVI is to disable the de-normal representation On Intel this translates to enabling theFlush-to-Zero and Denormals-are-Zero flags [37] which re-spectively replace denormal results and inputs with zero Thisis a viable mitigation for applications with modest floating-point precision requirements and has also been selectively ap-plied to browsers [42] However this defense may break com-mon real-world (denormal-dependent) applications a concernthat has led browser vendors such as Firefox to adopt othermitigations Another option for browsers is to enable Site Iso-lation [13] but JIT engines such as SpiderMonkey still do nothave a production implementation [23] Yet another optionfor JIT engines is to conditionally mask (ie using a transientexecution-safe cmov instruction) the result of FP operationsto enforce QNaN Floating-Point Indefinite semantics [37] asdone in the SpiderMonkey FPVI mitigation [51] This strategysuppresses any malicious NaN-boxed transient results butrequires manual changes to the NaN-boxing implementationand only applies to NaN-boxed gadgets

A more general and automated mitigation is for the com-piler to place a serializing instruction such as lfence afterFP operations whose (attacker-controlled) result might leaksecrets by means of transmit gadgets (or transmitters) Weobserve this is the same high-level strategy adopted by theexisting LVI mitigation in modern compilers [39] whichidentifies loaded values as sources and uses data-flow anal-ysis to ensure all the sources that reach a sink (transmitter)are fenced Hence to mitigate FPVI we can rely on the samemitigation strategy but use computed FP values as sourcesinstead To identify sinks we consider both systems vulnera-ble and those resistant to Microarchitectural Data Sampling(MDS) [8 59 63 75 76] For the former we consider bothload and store instructions as sinks (eg FP operation resultused as a loadstore address) as the corresponding arbitrarydata spilled into microarchitectural buffers may be leaked bya MDS attacker For the latter we limit ourselves to load sinksto catch all the arbitrary read values potentially disclosed by adependent transmitter In both cases we add indirect branchescontrolled by FP operations (and thus potentially leading tospeculative control-flow hijacking) to the list of sinks

We have implemented such a mitigation in LLVM [45]with only 100 lines of code on top of the existing x86 LVIload hardening pass To evaluate the performance impactof our mitigation on floating-point-intensive programs weran all the CC++ SPECfp 2006 and 2017 benchmarks infour configurations LVI instrumentation FPVI instrumen-tation for both MDS-vulnerable and MDS-resistant systemsand joint LVI+FPVI instrumentation for MDS-vulnerable sys-tems Please note that on MDS-resistant systems the FPVItransmitters are already covered by the LVI mitigation

Figure 9 shows the performance overhead of such con-figurations compared to the baseline As expected the LVI

1462 30th USENIX Security Symposium USENIX Association

619lbm_s 638imagick_s 644nab_s SPEC17geomean

433milc 444namd 447dealII 450soplex 453povray 470lbm 482sphinx3 SPEC06geomean

0100200300400500600700800

Over

head

SPEC FP 2017 SPEC FP 2006LLVM Mitigation

LVIFPVI (MDS-vulnerable system)FPVI (MDS-resistant system)LVI+FPVI (MDS-vulnerable system)

Figure 9 Performance overhead of our FPVI mitigation on the CC++ SPECfp 2006 and 2017 benchmarks Experimental setup5 SPEC iterations Intel i9-9900K (microcode 0xde) and LLVM 1110

mitigation has non-trivial performance impact up to 280despite targeting floating-point-intensive programs On theother hand our FPVI mitigation incurs in a 32 and 53 ge-omean overhead on SPECfp 2006 and 2017 respectively withno observable performance impact difference between MDS-vulnerable and MDS-resistant variants We observed that ap-proximately 70 of the inserted lfence instructions are dueto the intraprocedural design of the original LVI pass forc-ing our analysis to consider every callsite with FPVI source-based arguments as a potential transmitter This suggests theoverhead can be further reduced by operating interprocedu-ral analysis or more aggressive inlining (eg using LLVMLTO [45])

13 Root Cause-based Classification

We now summarize the results of our investigation by present-ing a root cause-based classification for the known transientexecution paths While there have already been several at-tempts to classify properties of transient execution [107580]all the existing classifications are attack-centric While cer-tainly useful such classifications inevitably blend togetherthe root causes of transient execution with the attack triggersFor example a MDS exploit based on a demand-paging pagefault [75] may be simply classified as a Meltdown-like attackbased on a present page fault [10] However in such an exploitthe page fault is both the vulnerability trigger and the rootcause of the transient execution window A similar exploitmay instead be embedded in an Intel TSX window yielding adifferent root cause and capabilities

To better characterize the capabilities of transient executionexploits we propose the orthogonal root-cause-based classifi-cation in Figure 10 We draw from the Intel terminology todefine the two main classes of root causes of Bad Specula-tion (ie transient execution) Control-Flow Misprediction(ie branch misprediction) and Data Misprediction (ie ma-chine clear) Based on these two main classes we observethat all the known root causes of transient execution paths canbe classified into the following four subclasses PredictorsExceptions Likely invariants violations and Interrupts

Figure 10 A root cause-centric classification of known tran-sient execution paths (causes analyzed in the paper in bold)Acronyms descriptions can be found in Appendix B

Predictors This category includes the prediction-basedcauses of bad speculation due to either control-flow or datamispredictions Mistraining a predictor and forcing a mis-prediction is sufficient to create a transient execution pathaccessing erroneous code or data Itrsquos worth noting that inFigure 10 there are two separate predictors subclasses un-der control-flow and data mispredictions as they are differ-ent in nature While control-flow mispredictions are failedattempts to guess the next instructions to execute data mis-predictions are failed attempts to operate on not-yet-validateddata Misprediction is a common way to manage tran-sient execution windows in attacks like Spectre and deriva-tives [11 20 28 40 41 43 48 55 65 82]

Exceptions This class includes the causes of machineclear due to exceptions for instance the different (sub)classesof page faults Forcing an exception is sufficient to create atransient execution path erroneously executing code follow-ing the exception-inducing instruction Exceptions are a lesscommon way to manage transient execution windows (as theyrequire dedicated handling) but have been extensively usedas triggers in Meltdown-like attacks [78475963677174ndash76 78]

Interrupts This class includes the causes of machine cleardue to hardware (only) interrupts Similar to exceptions forc-

USENIX Association 30th USENIX Security Symposium 1463

ing a HW interrupt is sufficient to create a transient executionpath erroneously executing code following the interruptedinstruction Hardware interrupt are asynchronous by natureand thus difficult to control resulting in a less than ideal wayto manage transient execution windows Nonetheless theywere abused by prior side-channel attacks [72]

Likely invariants violations This class includes all theremaining causes of machine clear derived by likely invari-ants [15] used by the CPU Such invariants commonly holdbut occasionally fail allowing hardware to implement fast-path optimizations However compared to exceptions andinterrupts slow-path occurrences are typically more frequentrequiring more efficient handling in hardware or microcodeWe discussed examples of such invariants in the paper (egstore instructions are expected to never target cached instruc-tions floating-point operations are expected to never operateon denormal numbers etc) and their lazy handling mecha-nisms (ie L1dL1i resynchronization microcode-based de-normal arithmetic) Forcing a likely invariant violation is suf-ficient to create a transient execution path accessing erroneouscode or data In this paper we have shown such violationsare not only a realistic way to manage transient executionwindows but also provide new opportunities and primitivesfor transient execution attacks

14 Related Work

Spectre [3141] and Meltdown [47] first examined the securityimplications of transient execution originating a large bodyof research on transient execution attacks [6ndash112840434859 63 65 67 68 71 74ndash76 78 79 82] Rather than focusingon attacks and their classification [10 75 80] ours is the firsteffort to systematize the root causes of transient executionand examine the many unexplored cases of machine clears

We now briefly survey prior security efforts concernedwith the major causes of machine clear discussed in this pa-per Self-modifying code is commonly used by malware asan obfuscation technique [69] and has also been used to im-prove side-channel attacks by means of performance degra-dation [2] Moreover our SCSB primitive bears similaritieswith prior transient execution primitives inducing specula-tive control flow hijacking either through branch target in-jection [41 49] or architectural branch target corruption [28]The performance variability of floating-point operations ondenormal numbers [17 46] has been previously exploited intraditional timing side-channel attacks [4] Speculation intro-duced by stricter memory models is a well-known concept inthe computer architecture literature [27 30 66] While this isnon-trivial to exploit prior work did demonstrate informationdisclosure [5779] by exploiting the snoop protocol discussedin Section 7 The memory disambiguation predictor has beenpreviously abused to leak stale data in Spectre SpeculativeStore Bypass exploits [55 82] Moreover its behavior hasbeen partially reverse engineered before [18] In contrast to

all these efforts we focus on machine clears to systematicallystudy all the root causes of transient execution fully reverseengineering their behavior and uncovering their security im-plications well beyond the state of the art

15 Conclusions

We have shown that the root causes of transient execution canbe quite diverse and go well beyond simple branch mispredic-tion or similar To support this claim we systematically ex-plored and reverse engineered the previously unexplored classof bad speculation known as machine clear We discussedseveral transient execution paths neglected in the literatureexamining their capabilities and new opportunities for attacksFurthermore we presented two new machine clear-based tran-sient execution attack primitives (Floating Point Value Injec-tion and Speculative Code Store Bypass) We also presentedan end-to-end FPVI exploit disclosing arbitrary memory inFirefox and analyzed the applicability of SCSB in real-worldapplications such as JIT engines Additionally we proposedmitigations and evaluated their performance overhead Finallywe presented a new root cause-based classification for all theknown transient execution paths

Disclosure

We disclosed Floating Point Value Injection and SpeculativeCode Store Bypass to CPU browser OS and hypervisor ven-dors in February 2021 Following our reports Intel confirmedthe FPVI (CVE-2021-0086) and SCSB (CVE-2021-0089)vulnerabilities rewarded them with the Intel Bug Bounty pro-gram and released a security advisory with recommendationsin line with our proposed mitigations [34] Mozilla confirmedthe FPVI exploit (CVE-2021-29955 [21 22]) rewarded itwith the Mozilla Security Bug Bounty program and deployeda mitigation based on conditionally masking malicious NaN-boxed FP results in Firefox 87 [51]

Acknowledgments

We thank our shepherd Daniel Genkin and the anonymousreviewers for their valuable comments We also thank ErikBosman from VUSec and Andrew Cooper from Citrix fortheir input Intel and Mozilla engineers for the productivemitigation discussions Travis Downs for his MD reverse en-gineering and Evan Wallace for his Float Toy tool This workwas supported by the European Unionrsquos Horizon 2020 re-search and innovation programme under grant agreements No786669 (ReAct) and 825377 (UNICORE) by Intel Corpora-tion through the Side Channel Vulnerability ISRA and by theDutch Research Council (NWO) through the INTERSECTproject

1464 30th USENIX Security Symposium USENIX Association

References[1] IEEE Standard for Floating-Point Arithmetic IEEE Std 754-2019

2019

[2] Alejandro Cabrera Aldaya and Billy Bob Brumley Hyperde-grade From ghz to mhz effective cpu frequencies arXiv preprintarXiv210101077

[3] AMD AMD64 Architecture Programmerrsquos Manual

[4] Marc Andrysco David Kohlbrenner Keaton Mowery Ranjit JhalaSorin Lerner and Hovav Shacham On subnormal floating point andabnormal timing In 2015 IEEE S amp P

[5] ARM Architecture Reference Manual for Armv8-A

[6] Atri Bhattacharyya Alexandra Sandulescu Matthias NeugschwandtnerAlessandro Sorniotti Babak Falsafi Mathias Payer and Anil KurmusSmotherspectre exploiting speculative execution through port con-tention In CCSrsquo19

[7] Jo Van Bulck Marina Minkin Ofir Weisse Daniel Genkin BarisKasikci Frank Piessens Mark Silberstein Thomas F Wenisch Yu-val Yarom and Raoul Strackx Foreshadow Extracting the Keys tothe Intel SGX Kingdom with Transient Out-of-Order Execution InUSENIX Securityrsquo18

[8] Claudio Canella Daniel Genkin Lukas Giner Daniel Gruss MoritzLipp Marina Minkin Daniel Moghimi Frank Piessens MichaelSchwarz Berk Sunar Jo Van Bulck and Yuval Yarom Fallout LeakingData on Meltdown-resistant CPUs In CCSrsquo19

[9] Claudio Canella Michael Schwarz Martin Haubenwallner MartinSchwarzl and Daniel Gruss Kaslr Break it fix it repeat In ACMASIA CCS 2020

[10] Claudio Canella Jo Van Bulck Michael Schwarz Moritz Lipp Ben-jamin Von Berg Philipp Ortner Frank Piessens Dmitry Evtyushkinand Daniel Gruss A systematic evaluation of transient execution at-tacks and defenses In USENIX Security 19

[11] Guoxing Chen Sanchuan Chen Yuan Xiao Yinqian Zhang ZhiqiangLin and Ten H Lai Sgxpectre Stealing intel secrets from sgx enclavesvia speculative execution In 2019 IEEE EuroSampP

[12] Chrome V8 TurboFan documentation

[13] Chromium Site Isolation documentation

[14] Victor Costan and Srinivas Devadas Intel SGX Explained IACRCryptology ePrint Archive 2016

[15] David Devecsery Peter M Chen Jason Flinn and Satish NarayanasamyOptimistic hybrid analysis Accelerating dynamic analysis throughpredicated static analysis In ASPLOS 2018

[16] Christopher Domas Breaking the x86 isa Black Hat USA 2017

[17] Isaac Dooley and Laxmikant Kale Quantifying the interference causedby subnormal floating-point values In Proceedings of the Workshopon OSIHPA 2006

[18] Travis Downs Memory Disambiguation on Skylakehttpsgithubcomtravisdownsuarch-benchwikiMemory-Disambiguation-on-Skylake 2019

[19] Thomas Dullien Return after free discussion httpstwittercomhalvarflakestatus1273220345525415937

[20] Dmitry Evtyushkin Ryan Riley Nael CSE Abu-Ghazaleh ECE andDmitry Ponomarev Branchscope A new side-channel attack on di-rectional branch predictor ACM SIGPLAN Notices 53(2)693ndash7072018

[21] Firefox Firefox 87 Security Advisory httpswwwmozillaorgen-USsecurityadvisoriesmfsa2021-10CVE-2021-29955

[22] Firefox Firefox ESR 789 Security Advisory httpswwwmozillaorgen-USsecurityadvisoriesmfsa2021-11CVE-2021-29955

[23] Firefox Project Fission documentation

[24] Fortninet Use-After-Free Bug in Chakra (CVE-2018-0946)httpswwwfortinetcomblogthreat-researchan-analysis-of-the-use-after-free-bug-in-microsoft-edge-chakra-engine

[25] Ivan Fratric Return after free discussion httpstwittercomifsecurestatus1273230733516177408

[26] Pietro Frigo Cristiano Giuffrida Herbert Bos and Kaveh RazaviGrand Pwning Unit Accelerating Microarchitectural Attacks withthe GPU In SampP May 2018

[27] Kourosh Gharachorloo Anoop Gupta and John L Hennessy Twotechniques to enhance the performance of memory consistency models1991

[28] Enes Goktas Kaveh Razavi Georgios Portokalidis Herbert Bos andCristiano Giuffrida Speculative Probing Hacking Blind in the SpectreEra In CCS 2020

[29] Ben Gras Kaveh Razavi Erik Bosman Herbert Bos and CristianoGiuffrida ASLR on the Line Practical Cache Attacks on the MMUIn NDSS February 2017

[30] John L Hennessy and David A Patterson Computer architecture aquantitative approach Elsevier 2011

[31] Jann Horn Reading privileged memory with a side-channel 2018

[32] Xen Hypervisor block_speculation function call in invoke_stubhttpsxenbitsxenorggitwebp=xengita=blobf=xenarchx86x86_emulatex86_emulatechb=HEAD

[33] Xen Hypervisor block_speculation function call inio_emul_stub_setup httpsxenbitsxenorggitwebp=xengita=blobf=xenarchx86pvemul-priv-opchb=HEAD

[34] Intel FPVI amp SCSB Intel Security Advisoray 00516 httpswwwintelcomcontentwwwusensecurity-centeradvisoryintel-sa-00516html

[35] Intel INTEL-SA-00088 - Bounds Check Bypass

[36] Intel Intelreg 64 and IA-32 Architectures Optimization ReferenceManual

[37] Intel Intelreg 64 and IA-32 Architectures Software Developerrsquos Manualcombined volumes

[38] Intel Intelreg VTunetrade Profiler User Guide - 4KAliasing httpssoftwareintelcomcontentwwwusendevelopdocumentationvtune-helptopreferencecpu-metrics-referencel1-boundaliasing-of-4k-address-offsethtml

[39] Intel Load value injection - deep divehttpssoftwareintelcomsecurity-software-guidancedeep-divesdeep-dive-load-value-injection 2020

[40] Vladimir Kiriansky and Carl Waldspurger Speculative buffer overflowsAttacks and defenses arXiv180703757

[41] Paul Kocher Jann Horn Anders Fogh Daniel Genkin Daniel GrussWerner Haas Mike Hamburg Moritz Lipp Stefan Mangard ThomasPrescher Michael Schwarz and Yuval Yarom Spectre Attacks Ex-ploiting Speculative Execution In SampPrsquo19

[42] David Kohlbrenner and Hovav Shacham On the effectiveness ofmitigations against floating-point timing channels In USENIX SecuritySymposium 2017

[43] Esmaeil Mohammadian Koruyeh Khaled N Khasawneh ChengyuSong and Nael Abu-Ghazaleh Spectre Returns Speculation Attacksusing the Return Stack Buffer In USENIX WOOTrsquo18

[44] Evgeni Krimer Guillermo Savransky Idan Mondjak and JacobDoweck Counter-based memory disambiguation techniques for se-lectively predicting loadstore conflicts October 1 2013 US Patent8549263

USENIX Association 30th USENIX Security Symposium 1465

[45] Chris Lattner and Vikram Adve Llvm A compilation framework forlifelong program analysis amp transformation In CGO 2004

[46] Orion Lawlor Hari Govind Isaac Dooley Michael Breitenfeld andLaxmikant Kale Performance degradation in the presence of subnor-mal floating-point values In OSIHPA 2005

[47] Moritz Lipp Michael Schwarz Daniel Gruss Thomas Prescher WernerHaas Anders Fogh Jann Horn Stefan Mangard Paul Kocher DanielGenkin Yuval Yarom and Mike Hamburg Meltdown Reading KernelMemory from User Space In USENIX Securityrsquo18

[48] Giorgi Maisuradze and Christian Rossow ret2spec Speculative ex-ecution using return stack buffers In Proceedings of the 2018 ACMSIGSAC

[49] Andrea Mambretti Alexandra Sandulescu Matthias NeugschwandtnerAlessandro Sorniotti and Anil Kurmus Two methods for exploitingspeculative control flow hijacks In USENIX WOOT 19

[50] Ahmad Moghimi Jan Wichelmann Thomas Eisenbarth and BerkSunar Memjam A false dependency attack against constant-timecrypto implementations International Journal of Parallel Program-ming 2019

[51] Mozilla Firefox Bug 1692972 mitigation httpshgmozillaorgreleasesmozilla-betarevb129bba64358

[52] Mozilla Spectre mitigations for Value type checks - x86 part httpsbugzillamozillaorgshow_bugcgiid=1433111

[53] Mozilla Spider Monkey JSValue httpshgmozillaorgmozilla-centralfiletipjspublicValueh

[54] Mozilla SpiderMonkey IonMonkey documentation

[55] Ken Johnson Microsoft Security Response Center(MSRC) Analysis and mitigation of speculative storebypass httpsmsrc-blogmicrosoftcom20180521analysis-and-mitigation-of-speculative-store-bypass-cve-2018-3639 2019

[56] Oleksii Oleksenko Bohdan Trach Mark Silberstein and Christof Fet-zer Specfuzz Bringing spectre-type vulnerabilities to the surface InUSENIX Security 20

[57] Riccardo Paccagnella Licheng Luo and Christopher W Fletcher Lordof the ring (s) Side channel attacks on the cpu on-chip ring interconnectare practical arXiv preprint arXiv210303443 2021

[58] Zhenxiao Qi Qian Feng Yueqiang Cheng Mengjia Yan Peng Li HengYin and Tao Wei Spectaint Speculative taint analysis for discoveringspectre gadgets 2021

[59] Hany Ragab Alyssa Milburn Kaveh Razavi Herbert Bos and CristianoGiuffrida CrossTalk Speculative Data Leaks Across Cores Are RealIn SampP May 2021

[60] Thomas Rokicki Cleacutementine Maurice and Pierre Laperdrix Sok Insearch of lost time A review of javascript timers in browsers In IEEEEuroSampPrsquo21

[61] Gururaj Saileshwar Christopher W Fletcher and Moinuddin QureshiStreamline a fast flushless cache covert-channel attack by enablingasynchronous collusion In ASPLOS 2021

[62] Rahul Saxena and John William Phillips Optimized rounding in un-derflow handlers 2001 US Patent 6219684

[63] Michael Schwarz Moritz Lipp Daniel Moghimi Jo Van Bulck JulianStecklina Thomas Prescher and Daniel Gruss ZombieLoad Cross-privilege-boundary data sampling In CCSrsquo19

[64] Michael Schwarz Cleacutementine Maurice Daniel Gruss and Stefan Man-gard Fantastic timers and where to find them High-resolution microar-chitectural attacks in javascript In FC IFCA 17

[65] Michael Schwarz Martin Schwarzl Moritz Lipp Jon Masters andDaniel Gruss Netspectre Read arbitrary memory over network InESORICS 19

[66] Daniel J Sorin Mark D Hill and David A Wood A primer on memoryconsistency and cache coherence 2011

[67] Julian Stecklina and Thomas Prescher Lazyfp Leaking fpu reg-ister state using microarchitectural side-channels arXiv preprintarXiv180607480 2018

[68] Caroline Trippel Daniel Lustig and Margaret Martonosi Meltdown-prime and spectreprime Automatically-synthesized attacks exploitinginvalidation-based coherence protocols arXiv

[69] Xabier Ugarte-Pedrero Davide Balzarotti Igor Santos and Pablo GBringas Sok Deep packer inspection A longitudinal study of thecomplexity of run-time packers In 2015 IEEE S amp P

[70] Google V8 test-jump-table-assemblercc220 commit 251fece httpsgithubcomv8

[71] Jo Van Bulck Daniel Moghimi Michael Schwarz Moritz Lippi Ma-rina Minkin Daniel Genkin Yuval Yarom Berk Sunar Daniel Grussand Frank Piessens Lvi Hijacking transient execution through mi-croarchitectural load value injection In 2020 IEEE S amp P 20

[72] Jo Van Bulck Frank Piessens and Raoul Strackx Nemesis Studyingmicroarchitectural timing leaks in rudimentary cpu interrupt logic InACM CCS 2018

[73] Jo Van Bulck Frank Piessens and Raoul Strackx SGX-step A Prac-tical Attack Framework for Precise Enclave Execution Control InSysTEXrsquo17

[74] Stephan van Schaik Andrew Kwong Daniel Genkin and Yuval YaromSgaxe How sgx fails in practice

[75] Stephan van Schaik Alyssa Milburn Sebastian Oumlsterlund Pietro FrigoGiorgi Maisuradze Kaveh Razavi Herbert Bos and Cristiano GiuffridaRIDL Rogue in-flight data load In SampP May 2019

[76] Stephan van Schaik Marina Minkin Andrew Kwong Daniel Genkinand Yuval Yarom Cacheout Leaking data on intel cpus via cacheevictions arXiv preprint

[77] WebKit Browserbench httpsbrowserbenchorg

[78] Ofir Weisse Jo Van Bulck Marina Minkin Daniel Genkin BarisKasikci Frank Piessens Mark Silberstein Raoul Strackx Thomas FWenisch and Yuval Yarom Foreshadow-ng Breaking the virtual mem-ory abstraction with transient out-of-order execution 2018

[79] Pawel Wieczorkiewicz Intel deep-dive snoop-assisted L1 Data Sam-pling

[80] Wenjie Xiong and Jakub Szefer Survey of transient execution attacksarXiv preprint

[81] Yuval Yarom and Katrina Falkner Flush+ reload a high resolutionlow noise l3 cache side-channel attack In USENIX Security 14

[82] Jann Horn Google Project Zero Speculative execution variant4 speculative store bypass httpsbugschromiumorgpproject-zeroissuesdetailid=1528 2019

A Reversing Memory Disambiguation

To precisely trigger memory disambiguation mispredictionsit is essential to reverse engineer the predictor and understandhow it can be massaged into the desired state When exe-cuting a load operation the physical addresses of all priorstores must be known to decide whether the load should beforwarded the value from the store buffer (store-to-load for-warding when the store and the load alias) or served from thememory subsystem (when they do not) Since these dependen-cies introduce significant bottlenecks modern processors rely

1466 30th USENIX Security Symposium USENIX Association

Listing 4 A function to RE the MD predictor The first 10imuls used to delay the store address computation create theideal conditions for mispredictionsst_ld rdi store addr rsi load addrrep 10 Trick to delay the store addressimul rdi 1endrepmov DWORD [rdi] 0x42 Storemov eax DWORD [rsi] Loadrep 10 Pronounce load timingimul eax 1endrepret

Listing 5 Observing the size and behavior of the per-addresssaturation counteruint8_t mem = malloc(0x1000)Ensure that saturing counter is set to 0for(i=0 ilt100 i++) st_ld(mem mem)Make hoisiting possiblefor(i=100 ilt120 i++) st_ld(mem mem+64)Trigger a memory ordering machine clearfor(i=120 ilt130 i++) st_ld(mem mem)

on a memory disambiguation predictor to improve common-case performance If a load is predicted not to alias precedingstores it can be speculatively executed before the prior storesrsquoaddresses are known (ie the load is hoisted) Otherwise theload is stalled until aliasing information is available

Partial reverse engineering of the memory disambiguationunit behavior was presented in [18] based on a (complex)analysis of Intel patent US8549263B2 [44] In contrast wepresent a full reverse engineering effort (including featuressuch as 4k aliasing flush counter etc) entirely based on asimple implementationmdashst_ld function in Listing 4 Bysurrounding the st_ld function with instrumentation codeto measure the timing and the number of machine clears wewere able to accurately detect the status of the predictor forevery call to st_ld

Our first experiment illustrated in Listing 5 is designedto observe how many missed load hoisting opportunities areneeded to switch the predictor state As shown in the corre-sponding plot in Figure 11 after 15 non-aliasing loads weobserved that subsequent st_ld invocations are faster due toa correct hoisting prediction This matches the design sug-gested in the patent with the predictor implemented as a 4-bitsaturating counter incremented every time the load does not ul-timately alias with preceding stores (and reset to 0 otherwise)Load hoisting is predicted only if the counter is saturated Toensure that the hoisting state is reached we later scheduledan aliasing load and checked that the load was incorrectlyhoisted by observing a machine clear

According to the Intel patent [44] there are 64 per-addresspredictors (ie saturating counters) and the suggested hash-ing function simply uses the lowest 6 bits of the instructionpointer of the load We verified these numbers using the func-

100 105 110 115 120 125 130i-th call to st_ld

0

50

100

Cloc

k cy

cles

Figure 11 Timing measurement of the experiment in Listing5 Orange bar machine clear memory ordering observed

Listing 6 Snippet observing activation of never-hoisting stateuint8_t mem = malloc(0x1000)for(int i=0 ilt10 i++)

for(int j=0 jlt19 j++)st_ld(mem mem+64)

st_ld(mem mem)

0 10 20 30 40 50 60 70 80 90 100 110 120i-th call to st_ld

0

25

50

75

100

Cloc

k cy

cles

Figure 12 Never-hoisting state Orange bar MC observed

tion st_ld_offset which is an exact copy of the st_ldfunction but with a number of nops added in the preamble(which we change at every run The goal is to observe iftwo unrelated loads in these functions are able to influenceeach other when varying the number of nops In our testedCPUs we observed machine clears when two unrelated loadsin st_ld and st_ld_offset are located exactly 256k bytesapart in memory (k isinN) We used machine clear observationsto detect that the per-address prediction of st_ld was affectedby the execution of st_ld_offset Our results match the de-sign suggested in the patent except we observed 256 (ratherthan 64) predictors hashing the lowest 8 bits of the instruc-tion pointer With these implementation-specific numbers anattacker can easily mistrain the predictor of a victim loadinstruction just with knowledge of its location in memory

One important additional component of the predictor is thepresence of a watchdog A never-hoisting global state is usedas a fallback to temporarily disable the predictor when theCPU has decided it may be counterproductive To reverseengineer the behavior we triggered as many mispredictionsas possible to check if hoisting was eventually disabled Theresulting code is illustrated in Listing 6 and the numbers inFigure 12 shows that after 4 machine clears the predictoris disabled Indeed even after 19 further non-aliasing loads(normally abundantly sufficient to switch to a hoisting state)the execution time of st_ld does not decrease

We also reversed the conditions under which the watch-dog is enableddisabled The patent suggests the watchdog

USENIX Association 30th USENIX Security Symposium 1467

15 79 143 207 271 335 399Number of independent store-loads

1

2

3

4

Tota

l mea

sure

dM

achi

ne C

lear

s

Figure 13 MCs observed after n independent store-loadsstarting from the never-hoisting state

is enabled when the value of a flush counter is smaller than0 and disabled otherwise The flush counter is decrementedevery MD MC and incremented every n correctly hoistedloads To reverse engineer this behavior we measure howmany MCs can be triggered in a row before the flush counteris decremented to -1 and thus the predictor is disabled Asshown in the figure 13 we never observed more than 4 MCsin a row This suggests that the flush counter is a 2-bit sat-urating counter The machine clear patterns also reveal theflush counter is incremented every 64 correctly hoisted loadsLastly since every run starts with the watchdog disabled ourresults show that to switch from a never-hoisting to a predict-hoisting state 15+64 non-aliasing loads are sufficient Thefirst 15 loads are necessary to bring the per-address predictorto the hoisting state The next 64 loads record a would-becorrect prediction of the per-address predictor After 64 such(unused) predictions the flush counter is incremented to leavethe never-hoisting state

Listing 7 Snippet verifying 4k aliasing-MD unit interactionForce no-hoisting predictionfor(i=0 ilt10 i++) st_ld(mem+0x1000 mem+0x1000)for(i=10 ilt20 i++) st_ld(mem+0x2000 mem+0x2000)Cause 4k aliasingfor(i=20 ilt40 i++) st_ld(mem+0x1000 mem+0x2000)

0 5 10 15 20 25 30 35 40i-th call to st_ld

50

60

70

80

90

Cloc

k cy

cles

Figure 14 Timing measurements of Listing 7 Orange incre-ment of LD_BLOCKS_PARTIALADDRESS_ALIAS

Finally we examined the interaction between memory dis-ambiguation and 4k aliasing On Intel CPUs when a store isfollowed by a load matching its 4KB page offset the store-to-load forwarding (STL) logic forwards the stored valueand in case of a false match (4k aliasing) a few-cycle over-head is needed to re-issue the load [38] With the help of theperformance counter LD_BLOCKS_PARTIALADDRESS_ALIASand the experiment shown in Listing 7 we verified that 4kaliasing can only happen when a no-hoist prediction is made

Figure 15 Memory disambiguation unit ndash simplified view

as shown in Figure 14 Indeed STL can be performed onlyif the store-load pair is executed in order Additionally Fig-ure 14 shows that 4k aliasing introduces a slight performanceoverhead on top of an incorrect no-hoisting guess of the MDpredictor Figure 15 presents a simplified view of the reversedmemory disambiguation predictor In conclusion an attackercan precisely mistrain the memory disambiguation predictorby satisfying only two requirements (1) knowing the instruc-tion pointer of the victim load (2) issuing (in the worst-casescenario) 15+64=79 non-aliasing store-load pairs to train thepredictor to the hoisting state

B Root-Causes Description Table

Acronym Description

BHT Branch History TableBTB Branch Target BufferRSB Return Stack BufferMD Memory Disabmbiguation UnitNM Device Not Available ExceptionDE Divide-by-Zero ExceptionUD Invalid Opcode ExceptionGP General Protection FaultAC Alignment Check ExceptionSS Stack Segment ExceptionPF Page FaultPF - US Bit Page Table Entry UserSupervisor Bit PFPF - RW Bit Page Table Entry ReadWrite Bit PFPF - P Bit Page Table Entry Present Bit PFPF - PKU Page Table Entry Protection Keys PFBR Bound Range Exceeded ExceptionFP Floating Point AssistSMC Self-Modifying CodeXMC Cross-Modifying CodeMO Memory Ordering Principles ViolationMASKMOV Masked LoadStore Instruction AssistAD Bits Page Table Entry AccessDirty Bits AssistTSX Intel TSX Transaction AbortUC Uncachable Memory AssistPRM Processor Reserved Memory AssistHW Interrupts Hardware Interrupts

1468 30th USENIX Security Symposium USENIX Association

  • Introduction
  • Background
    • IEEE-754 Denormal Numbers
    • x86 Cache Coherence
    • Memory Ordering
    • Memory Disambiguation
      • Threat Model
      • Machine Clears
      • Self-Modifying Code Machine Clear
      • Floating Point Machine Clear
      • Memory Ordering Machine Clear
      • Memory Disambiguation Machine Clear
      • Other Types of Machine Clear
      • Transient Execution Capabilities
        • Transient Window Size
        • Leakage Rate
          • Attack Primitives
            • Speculative Code Store Bypass (SCSB)
            • Floating Point Value Injection (FPVI)
              • Mitigations
                • SCSB Mitigation
                • FPVI Mitigation
                  • Root Cause-based Classification
                  • Related Work
                  • Conclusions
                  • Reversing Memory Disambiguation
                  • Root-Causes Description Table
Page 7: Rage Against the Machine Clear: A Systematic Analysis of

Algorithm 1 Floating Point Machine Clear analysis (pseudo)code byte is used to extract the i-th byte of z

1 for ilarr 1 8 do2 flush(reload_bu f )3 z = x y Any denormal FP operation4 reload_bu f [byte(z i) 1024]5 reload(reload_bu f )6 end for

Representation Value Type Exp

x 0x0010deadbeef1337 234e-308 N -1022

y 0x40f0000000000000 65536(216) N 16

zarch 0x00000010deadbeef 357e-313 D -1022

ztran 0x3f10deadbeef1337 643e-05 N -14

Table 2 Architectural (zarch) and transient (ztran) results ofdividing x and y of Algorithm 1 N Normal D Denormalrepresentations The mantissa is in bold A normal division by216 leaves the mantissa untouched and subtracts 16 from theexponentmdashthe result of ztran where the exponent overflowedfrom -1022 to -14

Figure 3 Transient execution due to invalid memory ordering

are normal numbers (see Table 2) Third the detection of thewrong computation occurs later in time creating a transientexecution window Finally by performing multiple floating-point operations together with the assisted one we were ableto expand the size of the window suggesting that detection isdelayed if the FPU is busy handling multiple operations

7 Memory Ordering Machine Clear

The CPU initiates a memory ordering (MO) machine clearwhen upon receiving a snoop request it is uncertain if mem-ory ordering will be preserved [36] Consider the program

Algorithm 2 Pseudo-code triggering a MO machine clearProcessor A

1 clflush(X) Make the load slow2 unlock(lock) Synchronize loads and stores3 r1larr [X]4 r2larr [Y ]5 reload(r2) For FLUSH + RELOAD

Processor B1 wait(lock) Synchronize loads and stores2 1rarr [Y ]

in Figure 3 Processor A loads X and Y while processor Bperforms two stores to the same locations If the load of Xis slow due to a cache miss the out-of-order CPU will issuethe next load (and subsequent operations) ahead of sched-ule Suppose that while the load of X is pending proces-sor B signals through a snoop request that the values ofX and Y have changed In this scenario the memory order-ing is A1-B0-B1-A0 which is not allowed according to theTotal-Store-Order memory model As two loads cannot bereordered r1=1 r2=0 is an illegal result Thus processor Ahas no choice other than to flush its pipeline and re-issuethe load of Y in the correct order This MO machine clearis needed for every inconsistent speculation on the memoryorderingmdashimplementing speculation behavior originally pro-posed by Gharachorloo et al [27] with the advantage thatstrict memory order principles can co-exist with aggressiveout-of-order scheduling

Notice that in the previous example the store on X isnot even necessary to cause a memory ordering violationsince A1-B1-A0 is still an invalid order Counterintuitivelythe memory ordering violation disappears if the load on X isnot performed as A1-B0-B1 is a perfectly valid order

To reverse engineer the memory ordering handling behav-ior on Intel CPUs we used analysis code exemplified by Al-gorithm 2 Our code mimics the scenario of Figure 3 withloadsstores from two threads racing against each other butrelies on a FLUSH + RELOAD [81] covert channel to observethe loaded value of Y The synchronization through the lockvariable ensures the desired (problematic) proximity and or-dering of the memory operations

As before in our experiments we observed two differenthits in the reload buffer for the loaded value of Y one forthe stale (transient) value and one for the new (architectural)value For every double hit in the reload buffer we also mea-sured an increase of MACHINE_CLEARSMEMORY_ORDERINGWe also ran experiments without the load of X Even thoughmemory ordering violations are no longer possible since eachprocessor executes a single memory operation and any orderis permitted we still observed MO machine clears The reasonis that Intel CPUs seem to resort to a simple but conservativeapproach to memory ordering violation detection where theCPU initiates a MO machine clear when a snoop request from

1456 30th USENIX Security Symposium USENIX Association

another processor matches the source of any load operationin the pipeline We obtained the same results with the twothreads running across physical or logical (hyperthreaded)cores Similarly the results are unchanged if the matchingstore and load are performed on different addressesmdashthe onlyrequirement we observed is that the memory operations needto refer to the same cache line Overall our results confirmthe presence of a transient execution window and the abilityof a thread to trigger a transient execution path in anotherthread by simply dirtying a cache line used in a ready-to-commit load Exploitation-wise abusing this type of MC isnon-trivial due to the strict synchronization requirements andthe difficulty of controlling pending stale data

8 Memory Disambiguation Machine Clear

As suggested by the MACHINE_CLEARSDISAMBIGUATIONcounter description memory disambiguation (MD) mis-predictions are handled via machine clears Our experi-ments confirmed this behavior by observing matching in-creases of MACHINE_CLEARSCOUNT Moreover we observedno changes in microcode assist counters suggesting mispre-dictions are resolved entirely in hardware In case of a mispre-diction a stale value is passed to subsequent loads a primitivethat was previously used to leak secret information with Spec-tre Variant 4 or Speculative Store Bypass (SSB) [55 82]

Different from the other machine clears MD machineclears trigger only on address aliasing mispredictions whenthe CPU wrongly predicts a load does not alias a precedingstore instruction and can be hoisted Similar to branch pre-diction generating a MD-based transient execution windowrequires mistraining the underlying predictor The latter hascomplex undocumented behavior which has been partiallyreverse engineered [18] For space constraints we discuss ourfull reverse engineering strategy in Appendix A Our resultsshow that executing the same load 64+ 15 times with non-aliasing stores is sufficient to ensure the next prediction tobe ldquono-aliasrdquo (and thus the next load to be hoisted) Uponreaching the hoisting prediction one can perform the loadwith an aliasing store to trigger the transient path exposed tothe incorrect stale value

Finally we experimentally verified that 4k aliasing [50]does not cause any machine clear but only incurs a furthertime penalty in case of wrong aliasing predictions Moredetails on 4k aliasing results can be found in Appendix A

9 Other Types of Machine Clear

AVX vmaskmov The AVX vmaskmov instructions performconditional packed load and store operations depending on abitmask For example a vmaskmovpd load may read 4 packeddoubles from memory depending on a 4-bit mask each doublewill be read only if the corresponding mask bit is set the

others will be assigned the value 0According to the Intel Optimization Reference Man-

ual [36] the instruction does not generate an exception inface of invalid addresses provided they are masked out How-ever our experiments confirm that it does incur a machineclear (and a microcode assist) when accessing an invalid ad-dress (eg with the present bit set to 0) with a loading maskset to zero (ie no bytes should be loaded) to check whetherthe bytes in the invalid address have the corresponding maskbits set or not We speculate the special handling is neededbecause the permission check is very complex especially inthe absence of memory alignment requirements

In our experiments vmaskmov instructions withall-zero masks and invalid addresses increment theOTHER_ASSISTANY and MACHINE_CLEARSCOUNT countersconfirming that the instruction triggers a machine clearHowever the resulting transient execution window seemsshort-lived or absent as we were unable to observe cache orother microarchitectural side effects of the execution

Exceptions The MACHINE_CLEARSPAGE_FAULT counterpresent on older microarchitectures confirms page faults areanother cause of machine clears Indeed we verified each in-struction incurring a page fault or any other exception such asldquoDivision by zerordquo increments the MACHINE_CLEARSCOUNTcounter We also verified exceptions do not trigger microcodeassists and software interrupts (traps) do not trigger machineclears Indeed the Intel documentation [37] specifies thatinstructions following a trap may be fetched but not specu-latively executed Transient execution windows originatingfrom exceptionsmdashand page faults in particularmdashhave beenextensively used in prior work with a faulty load instructionalso used as the trigger to leak information [7 47 63 75]

Hardware interrupts Although hardware interrupts arean undocumented cause of machine clears our experimentswith APIC timer interrupts showed they do increment theMACHINE_CLEARSCOUNT counter While this confirms hard-ware interrupts are another root cause of transient executionthe asynchronous nature of these events yields a less thanideal vector for transient execution attacks Nonetheless hard-ware interrupts play an important role in other classes ofmicroarchitectural attacks [73]

Microcode assists Microcode assists require a pipelineflush to insert the required microOps in the frontend and representa subclass of machine clears (and thus a root cause of transientexecution) for cases where a fast path in hardware is not avail-able In this paper we detailed the behavior of floating-pointand vmaskmov assists Prior work has discussed different sit-uations requiring microcode assists such as those related topage table entry AccessDirty bits typically in the context ofassisted loads used as the trigger to leak information [86375]AVX-to-SSE transitions [36] represent another microcode as-sist which based on our experiments we could not observeon modern Intel CPUs Remaining known microcode assistssuch as access control of memory pages belonging to SGX

USENIX Association 30th USENIX Security Symposium 1457

020406080

100120140160

Num

ber o

fTr

ansie

nt L

oads CPU

Intel Core i7-10700KIntel Xeon Silver 4214Intel Core i9-9900KIntel Core i7-7700KAMD Ryzen 5 5600XAMD Ryzen Threadripper2990WXAMD Ryzen 7 2700X

F+R TSX BHT FAULT SMC XMC FP MD MOTransient Execution Management

0

1

2

3

4

Leak

age

Rate

[Mb

s]

Figure 4 Top plot transient window size vs mechanism Each bar reports the number of transient loads that complete and leavea microarchitectural trace Bottom plot leakage rate vs mechanism for a simple Spectre Bounds-Check-Bypass attack and a1-bit FLUSH + RELOAD (F+R) cache covert channel F+R is the leakage rate upper bound (covert channel loop only no actualtransient window or attack)

secure enclaves and Precise Event Based Sampling (PEBS)are not presented in this work since already studied [14] oronly related to privileged performance profiling respectively

10 Transient Execution Capabilities

Transient execution attacks rely on crafting a transient win-dow to issue instructions that are never retired For thispurpose state-of-the-art attacks traditionally rely on mech-anisms based on root causes such as branch mispredictions(BHT) [41 43] faulty loads (Fault) [8 47 63 75] or mem-ory transaction aborts (TSX) [8 47 59 63 75] However thedifferent machine clears discussed in this paper provide anattacker with the exact same capabilities

To compare the capabilities of machine clear-based tran-sient windows with those of more traditional mechanismswe implemented a framework able to run arbitrary attacker-controlled code in a window generated by a mechanism ofchoosing We now evaluate our framework on recent proces-sors (with all the microcode updates and mitigations enabled)to compare the transient window size and leakage rate of thedifferent mechanisms

101 Transient Window Size

The transient window size provides an indication of thenumber of operations an attacker can issue on a transientpath before the results are squashed Larger windows canin principle host more complex attacks Using a classicFLUSH + RELOAD cache covert channel (F+R) as a referencewe measure the window size by counting how many transientloads can complete and hit entries in a designated F+R bufferFigure 4 (top) presents our results

As shown in the figure the window size varies greatlyacross the different mechanisms Broadly speaking mecha-nisms that have a higher detection cost such as XMC and MOMachine Clear yield larger window sizes Not surprisinglybranch mispredictions yield the largest window sizes as wecan significantly slow down the branch resolution process(ie causing cache misses) and delay detection FP on theother hand yields the shortest windows suggesting that de-normal numbers are efficiently detected inside the FPU Ourresults also show that while our framework was designed forIntel processors similar if not better results can usually beobtained on AMD processors (where we use the same con-servative training code for branchmemory prediction) Thisshows that both CPU families share a similar implementationin all cases except for MD where the used mistraining patternis not valid for pre-Zen3 architectures

102 Leakage Rate

To compare the leakage rates for the different transient exe-cution mechanisms we transiently read and repeatedly leakdata from a large memory region through a classic F+R cachecovert channel We report the resulting leakage ratesmdashasthe number of bits successfully leaked per secondmdashacrossdifferent microarchitectures using a 1-bit covert channel tohighlight the time complexity of each mechanism We con-sider data to be successfully leaked after a single correct hitin the reload buffer In case of a miss for a particular valuewe restart the leak for the same value until we get a hit (oruntil we get 100 misses in a row)

As shown in Figure 4 (bottom) different Intel and AMDmicroarchitectures generally yield similar leakage rates withsome variations For instance FP MC offers better leakagerates on Intel This difference stems from the different perfor-

1458 30th USENIX Security Symposium USENIX Association

F+R TSX BHT FP SMC XMC MO MD FAULTTransient Execution Management

0

1

2

3

4

Leak

age

Rate

[Mb

s]

F+R granularity [bit]148

Figure 5 Leakage rate vs mechanism with a 1-bit 4-bit or8-bit FLUSH + RELOAD cache covert channel (Intel Core i9)

mance impact of the corresponding machine clears on Intelvs AMD microarchitectures

As shown in Figure 5 the leakage rate varies instead greatlyacross the different mechanisms and so does the optimalcovert channel bitwidth Indeed while existing attacks typi-cally rely on 8-bit covert channels our results suggest 1-bit or4-bit channels can be much more efficient depending on thespecific mechanism Roughly speaking optimal leakage ratescan be obtained by balancing the time complexity (and hencebitwidth) of the covert channel with that of the mechanismFor example FP is a lightweight and reliable mechanismhence using a comparably fast and narrow 1-bit covert chan-nel is beneficial In contrast MD requires a time-consumingpredictor training phase between leak iterations and leakingmore bits per iteration with a 4-bit covert channel is moreefficient Interestingly a classic 8-bit covert channel yieldsconsistently worse and comparable leakage rates across allthe mechanisms since F+R dominates the execution time

Our results show that only two mechanisms (TSX and FP)are close to the maximum theoretical leakage rate of pureF+R Moreover FP performs as efficiently as TSX but un-like TSX is available on both Intel and AMD is always en-abled and can be used from managed code (eg JavaScript)BHT on the other hand yield the worst leakage rates due tothe inefficient training-based transient window BHT leakagerate can be improved if a tailored mistrain sequence is usedas in MD (Appendix A) Overall our results show that ma-chine clear-based windows achieve comparable and in manycases (eg FP) better leakage rates compared to traditionalmechanisms Moreover many machine clears eliminate theneed for mistraining which other than resulting in efficientleakage rates can escape existing pattern-based mitigationsand disabled hardware extensions (eg Intel TSX)

11 Attack Primitives

Building on our reverse engineering results and focusing onthe unexplored SMC and FP machine clears we now presenttwo new transient execution attack primitives and analyze

their security implications We also present an end-to-endFPVI exploit disclosing arbitrary memory in Firefox Laterwe discuss mitigations

111 Speculative Code Store Bypass (SCSB)

Our first attack primitive Speculative Code Store Bypass(SCSB) allows an attacker to execute stale controlled code ina transient execution window originated by a SMC machineclear Since the primitive relies on SMC its primary appli-cability is on JIT (eg JavaScript) engines running attacker-controlled codemdashalthough OS kernels and hypervisors stor-ing code pages and allowing their execution without firstissuing a serializing instruction are also potentially affected

Figure 6 SCSB primitive example where the instructionpointer is pointing at the bold code blocks g code is freedJITrsquoed code (of some g function) under attackerrsquos control(1) Force engine to JIT and execute code of function f caus-ing desynchronization of code and data views (2) Executestale code and SMC MC (3) After the SMC MC code anddata view coherence is restored and the new code is executed

As exemplified in Figure 6 the operations of the primi-tive can be broken down into three steps (1) the JIT enginecompiles a function f storing the generated code into a JITcode cache region previously used by a (now-stale) version offunction g (2) the JIT engine jumps to the newly generatedcode for the function f but due to the temporary desynchro-nization between the code and data views of the CPU thiscauses transient execution of the stale code of g until the SMCmachine clear is processed (3) after the pipeline flush thecode and data views are resynchronized and the CPU restartsthe execution of the correct code of f For exploitation theattacker needs to (i) massage the JIT code cache allocator toreuse a freed region with a target gadget g of choice (ii) forcethe JIT engine to generate and execute new (f ) code in sucha region enabling transient out-of-context execution of thegadget and spilling secrets into the microarchitectural state

Our primitive bears similarities with both transient and ar-chitectural primitives used in prior attacks On the transientfront our primitive is conceptually similar to a Speculative-Store-Bypass (SSB) primitive [55 82] but can transientlyexecute stale code rather than reading stale data Howeverinterestingly the underlying causes of the two primitives arequite different (MD misprediction vs SMC machine clear)On the architectural front our primitive mimics classic Use-After-Free (UAF) exploitation on the JIT code cache also

USENIX Association 30th USENIX Security Symposium 1459

Figure 7 Coding options suggested by the Intel ArchitecturesSoftware Developers Manuals to handle SMC and XMC ex-ecution Option 1 describes the exact steps required by ourSpeculative Code Store Bypass attack primitive potentiallyresulting in exploitable gadgets

known as Return-After-Free (RAF) in the hackers commu-nity [19 25] An example is CVE-2018-0946 where a use-after-free vulnerability can be exploited to force the ChakraJS engine to erroneously execute freed (attacker-controlled)JIT code resulting in arbitrary code execution after massagingthe right gadget into the JIT code cache [24]

Indeed at a high level SCSB yields a transient use-after-free primitive on a JIT code cache with exploitation prop-erties similar to its architectural counterpart However thereare some differences due to the transient nature of SCSBFirst we need to find an out-of-context gadget to transientlyleak data rather than architecturally execute arbitrary codeIn JavaScript engines similar gadgets have already been ex-ploited by ret2spec [48] escalating out-of-context transientexecution of valid code to type confusion arbitrary reads andultimately a secret-dependent load to transmit the value

Second we need to target short-lived JIT code cache up-dates so that the newly generated code is immediately exe-cuted fitting the target gadget in the resulting transient execu-tion window Interestingly executing JITrsquoed code is the nextstep after JIT code generation mentioned in the Intel manual(Figure 7 Option 1) suggesting that developers that followsuch directives ad litteram can easily introduce such gadgetsMoreover modern JavaScript compilers feature a multi-stageoptimization pipeline [12 54] and short-lived JIT code cacheupdates are favored for just-in-time (re)optimization Indeedwe found test code in V8 to specifically test such updates andverify instructiondata cache coherence [70]

Finally we need to ensure JIT code cache updates are notaccompanied by barriers that force immediate synchroniza-tion of the code and data views We analyzed the code ofthe popular SpiderMonkey and V8 JavaScript engines andverified that the functions called upon JIT code cache updatesto synchronize instruction and data caches (Listing 2 and List-ing 3) are always empty on x86 (as expected given Intelrsquosprimary recommendation in Figure 7)

Overall while the exploitation is far from trivial (ie hav-

Listing 2 Chromium instruction cache flush(chromiumsrcv8srccodegenx64cpu-x64cc)void CpuFeaturesFlushICache(void start size_t size) No need to flush the instructioncache on Intel

Listing 3 Firefox instruction cache flush(mozilla-unifiedjssrcjitFlushICacheh)inline void FlushICache(void code size_t size

bool codeIsThreadLocal = true) No-op Code and data caches are coherent on x86

and x64 rarr

ing to address the challenges of traditional use-after-free ex-ploitation as well as transient execution exploitation in thebrowser) we believe SCSB expands the attack surface of tran-sient execution attacks in the browser We found a number ofcandidate SCSB gadgets in real-world code but none of themwas ultimately exploitable due to the coincidental presence ofsome serializing instruction preventing stale code executionNonetheless mitigations are needed to enforce security-by-design Luckily as shown later SCSB is amenable to a prac-tical and efficient mitigation (ie at a similar cost as facedby non-x86 architectures) The implementationperformancecost for mitigation is even lower than for its sibling SSB prim-itive whose mitigation has been deployed in practice even inabsence of known practical exploits [55]

In Table 3 we show that all tested Intel and AMD proces-sors are affected by SCSB In contrast ARM is not vulnerablesince SMC updates require explicit software barriers

112 Floating Point Value Injection (FPVI)Our second attack primitive Floating Point Value Injection(FPVI) allows an attacker to inject arbitrary values into atransient execution window originated by a FP machine clearAs exemplified in Figure 8 the operations of the primitive canbe broken down into four steps (1) the attacker triggers theexecution of a gadget starting with a denormal FP operationin the victim application with the x and y operands underattackerrsquos control (2) the transient z result of the operation isprocessed by the subsequent gadget instructions leaving a mi-croarchitectural trace (3) the CPU detects the error condition(ie wrong result of a denormal operation) triggering a ma-chine clear and thus a pipeline flush (4) the CPU re-executesthe entire gadget with the correct architectural z result Forexploitation the attacker needs to (i) massage the x and yoperands to inject the desired z value into the victim transientpath and (ii) target a victim gadget so that the injected valueyields a security-sensitive trace which can be observed withFLUSH + RELOAD or other microarchitectural attacks

Our primitive bears similarities with Load-Value-Injection(LVI) [71] since both allow attackers to inject controlledvalues into transient execution Moreover both primitives

1460 30th USENIX Security Symposium USENIX Association

Figure 8 FPVI gadget example in SpiderMonkeyFP_Op(xy) is an arbitrary denormal FP operation The er-roneous z result causes dependent operations to be executedtwice (first transiently then architecturally) The NaN-boxingz encoding allows the attacker to type-confuse the JITrsquoed codeand read from an arbitrary address on the transient path

require gadgets in the victim application to process the in-jected value and perform security-sensitive computations onthe transient path Nonetheless the underlying issues andhence the triggering conditions are fairly different LVI re-quires the attacker to induce faulty or assisted loads on thevictim execution which is straightforward in SGX applica-tions but more difficult in the general case [71] FPVI imposesno such requirement but does require an attacker to directlyor indirectly control operands of a floating-point operation inthe victim Nonetheless FPVI can extend the existing LVIattack surface (eg for compute-intensive SGX applicationsprocessing attacker-controlled inputs) and also provides ex-ploitation opportunities in new scenarios Indeed while it ishard to draw general conclusions on the availability of ex-ploitable FPVI gadgets in the wildmdashmuch like Spectre [41]LVI [71] etc this would require gadget scanners subject oforthogonal research [56 58]mdashwe found exploitable gadgetsin NaN-Boxing implementations of modern JIT engines [53]NaN-Boxing implementations encode arbitrary data types asdouble values allowing attackers running code in a JIT sand-box (and thus trivially controlling operands of FP operations)to escalate FPVI to a speculative type confusion primitiveThe latter can be exploited similarly to NaN-Boxing-enabledarchitectural type confusion [26] and allows an attacker toaccess arbitrary data on a transient path Figure 8 presentsour end-to-end exploit for a JavaScript-enabled attacker in aSpiderMonkey (Mozilla JavaScript runtime) sandbox illus-trating a gadget unaffected by all the prior Mozilla Firefoxrsquo

mitigations against transient execution attacks [52]As exemplified in the figure SpiderMonkeyrsquos NaN-Boxing

strategy represents every variable with a IEEE-754 (64-bit)double where the highest 17 bits store the data type tag and theremaining 47 bits store the data value itself If the tag value isless than or equal to 0x1fff0 (ie JSVAL_TAG_MAX_DOUBLE)all the 64 bits are interpreted as a double while NaN-Boxingencoding is used otherwise For instance the 0xfff88 tagis used to represent 32-bit integers and the 0xfffb0 tag torepresent a string with the data value storing a pointer tothe string descriptor In the example the attacker crafts theoperands of a vulnerable FP operation (in this case a division)to produce a transient result which the JITrsquoed code interpretsas a string pointer due to NaN-Boxing This causes typeconfusion on a transient path and ultimately triggers a readwith an attacker-controlled address

To verify the attacker can inject arbitrary pointers withoutfully reverse engineering the complex function used by theFPU we implemented a simple fuzzer to find FP operandsthat yield transient division results with the upper bits setto 0xfffb0 (ie string tag) With such operands we caneasily control the remaining bits by performing the inverseoperation since the mantissa bits are transiently unaffectedby the exponent value as shown in Table 2 For exampleusing 0xc000e8b2c9755600 and 0x0004000000000000 asdivision operands yields -Infinity as the architectural resultand our target string pointer 0xfffb0deadbeef000 as thetransient result (see gadget in Figure 8)

Note that SpiderMonkey uses no guards or Spectre mitiga-tions when accessing the attribute length of the string Thisis normally safe since x86 guarantees that NaN results of FPoperations will always have the lowest 52 bits set to zeromdasharepresentation known as QNaN Floating-Point Indefinite [37]In other words the implementation relies on the fact that NaN-boxed variables such as string pointers can never accidentallyappear as the result of FP operations and can only be craftedby the JIT engine itself Unfortunately this invariant no longerholds on a FPVI-controlled transient path As shown in Fig-ure 8 this invariant violation allows an attacker to transientlyread arbitrary memory Since the length attribute is stored4 bytes away from the string pointer the zlength accessyields a transient read to 0xdeadbeef000+4

We ran our exploit on an Intel i9-9900K CPU (microcode0xde) running Linux 580 and Firefox 850 and by wrappingthis primitive with a variant of an EVICT + RELOAD covertchannel [61] we confirmed the ability to read arbitrary mem-ory Since prior work has already demonstrated that customhigh-precision timers are possible in JavaScript [26296064]we enabled precise timers in Firefox to simplify our covertchannel With our exploit we measured a leakage rate of ~13KBs and a transient window of ~12 load instructions enabledby increasing the FPU pressure through a chain of multipledependent FP operations

Finally as shown in Table 3 we observe that both Intel and

USENIX Association 30th USENIX Security Symposium 1461

Table 3 Tested processors

Processor MicrocodeSCSB

vulnerableFPVI

vulnerable

Intel Core i7-10700K 0xe0 3 3Intel Xeon Silver 4214 0x500001c 3 3Intel Core i9-9900K 0xde 3 3Intel Core i7-7700K 0xca 3 3AMD Ryzen 5 5600X 0xa201009 3 3daggerAMD Ryzen 2990WX 0x800820b 3 3daggerAMD Ryzen 7 2700X 0x800820b 3 3daggerBroadcom BCM2711Cortex-A72 (ARM v8) 7sect 7

dagger No exploitable NaN-boxed transient results foundsect On ARM SMC updates require explicit software barriers

AMD are affected by FPVI although we found no exploitabletransient NaN-boxed values on AMD On ARM we neverobserved traces of transient results yet we cannot rule outother FPU implementations being affected

12 Mitigations

121 SCSB Mitigation

SCSB can be mitigated by ensuring that any freshly writtencode is architecturally visible before being executed For ex-ample on ARM architectures where the hardware does notautomatically enforce cache coherence explicit serializing in-structions (ie L1i cache invalidation instructions) are neededto correctly support SMC [5] As such spec-compliant ARMimplementations cannot be affected by SCSB On Intel andAMD we can force eager codedata coherence using a serial-ization instructionmdashalthough this is normally not necessary(see Option 1 in Figure 7) In our experiments we verified anyserialization instruction such as lfence mfence or cpuidplaced after the SMC store operations is indeed sufficient tosuppress the transient window Note that sfence cannot serveas a serialization instruction [35] to eliminate the transientpath This serialization mitigation was confirmed by CPUvendors and adopted by the Xen hypervisor [32 33]

To evaluate the performance impact of our mitigation weadded a lfence instruction inside the FlushICache function(Listings 2 and 3) of the two popular V8 and SpiderMonkeyJavaScript engines Such function normally empty on x86is called after every code update Our repeated experimentson the popular JetStream2 and Speedometer2 [77] bench-marks did not produce any statically measurable performanceoverhead This shows JavaScript execution time is heavilydominated by JIT code generationexecution and code up-dates have negligible impact Our results show this mitigationis practical and can hinder SCSB-based attack primitives inJIT engines with a 1-line code change

122 FPVI Mitigation

The most efficient way to mitigate FPVI is to disable the de-normal representation On Intel this translates to enabling theFlush-to-Zero and Denormals-are-Zero flags [37] which re-spectively replace denormal results and inputs with zero Thisis a viable mitigation for applications with modest floating-point precision requirements and has also been selectively ap-plied to browsers [42] However this defense may break com-mon real-world (denormal-dependent) applications a concernthat has led browser vendors such as Firefox to adopt othermitigations Another option for browsers is to enable Site Iso-lation [13] but JIT engines such as SpiderMonkey still do nothave a production implementation [23] Yet another optionfor JIT engines is to conditionally mask (ie using a transientexecution-safe cmov instruction) the result of FP operationsto enforce QNaN Floating-Point Indefinite semantics [37] asdone in the SpiderMonkey FPVI mitigation [51] This strategysuppresses any malicious NaN-boxed transient results butrequires manual changes to the NaN-boxing implementationand only applies to NaN-boxed gadgets

A more general and automated mitigation is for the com-piler to place a serializing instruction such as lfence afterFP operations whose (attacker-controlled) result might leaksecrets by means of transmit gadgets (or transmitters) Weobserve this is the same high-level strategy adopted by theexisting LVI mitigation in modern compilers [39] whichidentifies loaded values as sources and uses data-flow anal-ysis to ensure all the sources that reach a sink (transmitter)are fenced Hence to mitigate FPVI we can rely on the samemitigation strategy but use computed FP values as sourcesinstead To identify sinks we consider both systems vulnera-ble and those resistant to Microarchitectural Data Sampling(MDS) [8 59 63 75 76] For the former we consider bothload and store instructions as sinks (eg FP operation resultused as a loadstore address) as the corresponding arbitrarydata spilled into microarchitectural buffers may be leaked bya MDS attacker For the latter we limit ourselves to load sinksto catch all the arbitrary read values potentially disclosed by adependent transmitter In both cases we add indirect branchescontrolled by FP operations (and thus potentially leading tospeculative control-flow hijacking) to the list of sinks

We have implemented such a mitigation in LLVM [45]with only 100 lines of code on top of the existing x86 LVIload hardening pass To evaluate the performance impactof our mitigation on floating-point-intensive programs weran all the CC++ SPECfp 2006 and 2017 benchmarks infour configurations LVI instrumentation FPVI instrumen-tation for both MDS-vulnerable and MDS-resistant systemsand joint LVI+FPVI instrumentation for MDS-vulnerable sys-tems Please note that on MDS-resistant systems the FPVItransmitters are already covered by the LVI mitigation

Figure 9 shows the performance overhead of such con-figurations compared to the baseline As expected the LVI

1462 30th USENIX Security Symposium USENIX Association

619lbm_s 638imagick_s 644nab_s SPEC17geomean

433milc 444namd 447dealII 450soplex 453povray 470lbm 482sphinx3 SPEC06geomean

0100200300400500600700800

Over

head

SPEC FP 2017 SPEC FP 2006LLVM Mitigation

LVIFPVI (MDS-vulnerable system)FPVI (MDS-resistant system)LVI+FPVI (MDS-vulnerable system)

Figure 9 Performance overhead of our FPVI mitigation on the CC++ SPECfp 2006 and 2017 benchmarks Experimental setup5 SPEC iterations Intel i9-9900K (microcode 0xde) and LLVM 1110

mitigation has non-trivial performance impact up to 280despite targeting floating-point-intensive programs On theother hand our FPVI mitigation incurs in a 32 and 53 ge-omean overhead on SPECfp 2006 and 2017 respectively withno observable performance impact difference between MDS-vulnerable and MDS-resistant variants We observed that ap-proximately 70 of the inserted lfence instructions are dueto the intraprocedural design of the original LVI pass forc-ing our analysis to consider every callsite with FPVI source-based arguments as a potential transmitter This suggests theoverhead can be further reduced by operating interprocedu-ral analysis or more aggressive inlining (eg using LLVMLTO [45])

13 Root Cause-based Classification

We now summarize the results of our investigation by present-ing a root cause-based classification for the known transientexecution paths While there have already been several at-tempts to classify properties of transient execution [107580]all the existing classifications are attack-centric While cer-tainly useful such classifications inevitably blend togetherthe root causes of transient execution with the attack triggersFor example a MDS exploit based on a demand-paging pagefault [75] may be simply classified as a Meltdown-like attackbased on a present page fault [10] However in such an exploitthe page fault is both the vulnerability trigger and the rootcause of the transient execution window A similar exploitmay instead be embedded in an Intel TSX window yielding adifferent root cause and capabilities

To better characterize the capabilities of transient executionexploits we propose the orthogonal root-cause-based classifi-cation in Figure 10 We draw from the Intel terminology todefine the two main classes of root causes of Bad Specula-tion (ie transient execution) Control-Flow Misprediction(ie branch misprediction) and Data Misprediction (ie ma-chine clear) Based on these two main classes we observethat all the known root causes of transient execution paths canbe classified into the following four subclasses PredictorsExceptions Likely invariants violations and Interrupts

Figure 10 A root cause-centric classification of known tran-sient execution paths (causes analyzed in the paper in bold)Acronyms descriptions can be found in Appendix B

Predictors This category includes the prediction-basedcauses of bad speculation due to either control-flow or datamispredictions Mistraining a predictor and forcing a mis-prediction is sufficient to create a transient execution pathaccessing erroneous code or data Itrsquos worth noting that inFigure 10 there are two separate predictors subclasses un-der control-flow and data mispredictions as they are differ-ent in nature While control-flow mispredictions are failedattempts to guess the next instructions to execute data mis-predictions are failed attempts to operate on not-yet-validateddata Misprediction is a common way to manage tran-sient execution windows in attacks like Spectre and deriva-tives [11 20 28 40 41 43 48 55 65 82]

Exceptions This class includes the causes of machineclear due to exceptions for instance the different (sub)classesof page faults Forcing an exception is sufficient to create atransient execution path erroneously executing code follow-ing the exception-inducing instruction Exceptions are a lesscommon way to manage transient execution windows (as theyrequire dedicated handling) but have been extensively usedas triggers in Meltdown-like attacks [78475963677174ndash76 78]

Interrupts This class includes the causes of machine cleardue to hardware (only) interrupts Similar to exceptions forc-

USENIX Association 30th USENIX Security Symposium 1463

ing a HW interrupt is sufficient to create a transient executionpath erroneously executing code following the interruptedinstruction Hardware interrupt are asynchronous by natureand thus difficult to control resulting in a less than ideal wayto manage transient execution windows Nonetheless theywere abused by prior side-channel attacks [72]

Likely invariants violations This class includes all theremaining causes of machine clear derived by likely invari-ants [15] used by the CPU Such invariants commonly holdbut occasionally fail allowing hardware to implement fast-path optimizations However compared to exceptions andinterrupts slow-path occurrences are typically more frequentrequiring more efficient handling in hardware or microcodeWe discussed examples of such invariants in the paper (egstore instructions are expected to never target cached instruc-tions floating-point operations are expected to never operateon denormal numbers etc) and their lazy handling mecha-nisms (ie L1dL1i resynchronization microcode-based de-normal arithmetic) Forcing a likely invariant violation is suf-ficient to create a transient execution path accessing erroneouscode or data In this paper we have shown such violationsare not only a realistic way to manage transient executionwindows but also provide new opportunities and primitivesfor transient execution attacks

14 Related Work

Spectre [3141] and Meltdown [47] first examined the securityimplications of transient execution originating a large bodyof research on transient execution attacks [6ndash112840434859 63 65 67 68 71 74ndash76 78 79 82] Rather than focusingon attacks and their classification [10 75 80] ours is the firsteffort to systematize the root causes of transient executionand examine the many unexplored cases of machine clears

We now briefly survey prior security efforts concernedwith the major causes of machine clear discussed in this pa-per Self-modifying code is commonly used by malware asan obfuscation technique [69] and has also been used to im-prove side-channel attacks by means of performance degra-dation [2] Moreover our SCSB primitive bears similaritieswith prior transient execution primitives inducing specula-tive control flow hijacking either through branch target in-jection [41 49] or architectural branch target corruption [28]The performance variability of floating-point operations ondenormal numbers [17 46] has been previously exploited intraditional timing side-channel attacks [4] Speculation intro-duced by stricter memory models is a well-known concept inthe computer architecture literature [27 30 66] While this isnon-trivial to exploit prior work did demonstrate informationdisclosure [5779] by exploiting the snoop protocol discussedin Section 7 The memory disambiguation predictor has beenpreviously abused to leak stale data in Spectre SpeculativeStore Bypass exploits [55 82] Moreover its behavior hasbeen partially reverse engineered before [18] In contrast to

all these efforts we focus on machine clears to systematicallystudy all the root causes of transient execution fully reverseengineering their behavior and uncovering their security im-plications well beyond the state of the art

15 Conclusions

We have shown that the root causes of transient execution canbe quite diverse and go well beyond simple branch mispredic-tion or similar To support this claim we systematically ex-plored and reverse engineered the previously unexplored classof bad speculation known as machine clear We discussedseveral transient execution paths neglected in the literatureexamining their capabilities and new opportunities for attacksFurthermore we presented two new machine clear-based tran-sient execution attack primitives (Floating Point Value Injec-tion and Speculative Code Store Bypass) We also presentedan end-to-end FPVI exploit disclosing arbitrary memory inFirefox and analyzed the applicability of SCSB in real-worldapplications such as JIT engines Additionally we proposedmitigations and evaluated their performance overhead Finallywe presented a new root cause-based classification for all theknown transient execution paths

Disclosure

We disclosed Floating Point Value Injection and SpeculativeCode Store Bypass to CPU browser OS and hypervisor ven-dors in February 2021 Following our reports Intel confirmedthe FPVI (CVE-2021-0086) and SCSB (CVE-2021-0089)vulnerabilities rewarded them with the Intel Bug Bounty pro-gram and released a security advisory with recommendationsin line with our proposed mitigations [34] Mozilla confirmedthe FPVI exploit (CVE-2021-29955 [21 22]) rewarded itwith the Mozilla Security Bug Bounty program and deployeda mitigation based on conditionally masking malicious NaN-boxed FP results in Firefox 87 [51]

Acknowledgments

We thank our shepherd Daniel Genkin and the anonymousreviewers for their valuable comments We also thank ErikBosman from VUSec and Andrew Cooper from Citrix fortheir input Intel and Mozilla engineers for the productivemitigation discussions Travis Downs for his MD reverse en-gineering and Evan Wallace for his Float Toy tool This workwas supported by the European Unionrsquos Horizon 2020 re-search and innovation programme under grant agreements No786669 (ReAct) and 825377 (UNICORE) by Intel Corpora-tion through the Side Channel Vulnerability ISRA and by theDutch Research Council (NWO) through the INTERSECTproject

1464 30th USENIX Security Symposium USENIX Association

References[1] IEEE Standard for Floating-Point Arithmetic IEEE Std 754-2019

2019

[2] Alejandro Cabrera Aldaya and Billy Bob Brumley Hyperde-grade From ghz to mhz effective cpu frequencies arXiv preprintarXiv210101077

[3] AMD AMD64 Architecture Programmerrsquos Manual

[4] Marc Andrysco David Kohlbrenner Keaton Mowery Ranjit JhalaSorin Lerner and Hovav Shacham On subnormal floating point andabnormal timing In 2015 IEEE S amp P

[5] ARM Architecture Reference Manual for Armv8-A

[6] Atri Bhattacharyya Alexandra Sandulescu Matthias NeugschwandtnerAlessandro Sorniotti Babak Falsafi Mathias Payer and Anil KurmusSmotherspectre exploiting speculative execution through port con-tention In CCSrsquo19

[7] Jo Van Bulck Marina Minkin Ofir Weisse Daniel Genkin BarisKasikci Frank Piessens Mark Silberstein Thomas F Wenisch Yu-val Yarom and Raoul Strackx Foreshadow Extracting the Keys tothe Intel SGX Kingdom with Transient Out-of-Order Execution InUSENIX Securityrsquo18

[8] Claudio Canella Daniel Genkin Lukas Giner Daniel Gruss MoritzLipp Marina Minkin Daniel Moghimi Frank Piessens MichaelSchwarz Berk Sunar Jo Van Bulck and Yuval Yarom Fallout LeakingData on Meltdown-resistant CPUs In CCSrsquo19

[9] Claudio Canella Michael Schwarz Martin Haubenwallner MartinSchwarzl and Daniel Gruss Kaslr Break it fix it repeat In ACMASIA CCS 2020

[10] Claudio Canella Jo Van Bulck Michael Schwarz Moritz Lipp Ben-jamin Von Berg Philipp Ortner Frank Piessens Dmitry Evtyushkinand Daniel Gruss A systematic evaluation of transient execution at-tacks and defenses In USENIX Security 19

[11] Guoxing Chen Sanchuan Chen Yuan Xiao Yinqian Zhang ZhiqiangLin and Ten H Lai Sgxpectre Stealing intel secrets from sgx enclavesvia speculative execution In 2019 IEEE EuroSampP

[12] Chrome V8 TurboFan documentation

[13] Chromium Site Isolation documentation

[14] Victor Costan and Srinivas Devadas Intel SGX Explained IACRCryptology ePrint Archive 2016

[15] David Devecsery Peter M Chen Jason Flinn and Satish NarayanasamyOptimistic hybrid analysis Accelerating dynamic analysis throughpredicated static analysis In ASPLOS 2018

[16] Christopher Domas Breaking the x86 isa Black Hat USA 2017

[17] Isaac Dooley and Laxmikant Kale Quantifying the interference causedby subnormal floating-point values In Proceedings of the Workshopon OSIHPA 2006

[18] Travis Downs Memory Disambiguation on Skylakehttpsgithubcomtravisdownsuarch-benchwikiMemory-Disambiguation-on-Skylake 2019

[19] Thomas Dullien Return after free discussion httpstwittercomhalvarflakestatus1273220345525415937

[20] Dmitry Evtyushkin Ryan Riley Nael CSE Abu-Ghazaleh ECE andDmitry Ponomarev Branchscope A new side-channel attack on di-rectional branch predictor ACM SIGPLAN Notices 53(2)693ndash7072018

[21] Firefox Firefox 87 Security Advisory httpswwwmozillaorgen-USsecurityadvisoriesmfsa2021-10CVE-2021-29955

[22] Firefox Firefox ESR 789 Security Advisory httpswwwmozillaorgen-USsecurityadvisoriesmfsa2021-11CVE-2021-29955

[23] Firefox Project Fission documentation

[24] Fortninet Use-After-Free Bug in Chakra (CVE-2018-0946)httpswwwfortinetcomblogthreat-researchan-analysis-of-the-use-after-free-bug-in-microsoft-edge-chakra-engine

[25] Ivan Fratric Return after free discussion httpstwittercomifsecurestatus1273230733516177408

[26] Pietro Frigo Cristiano Giuffrida Herbert Bos and Kaveh RazaviGrand Pwning Unit Accelerating Microarchitectural Attacks withthe GPU In SampP May 2018

[27] Kourosh Gharachorloo Anoop Gupta and John L Hennessy Twotechniques to enhance the performance of memory consistency models1991

[28] Enes Goktas Kaveh Razavi Georgios Portokalidis Herbert Bos andCristiano Giuffrida Speculative Probing Hacking Blind in the SpectreEra In CCS 2020

[29] Ben Gras Kaveh Razavi Erik Bosman Herbert Bos and CristianoGiuffrida ASLR on the Line Practical Cache Attacks on the MMUIn NDSS February 2017

[30] John L Hennessy and David A Patterson Computer architecture aquantitative approach Elsevier 2011

[31] Jann Horn Reading privileged memory with a side-channel 2018

[32] Xen Hypervisor block_speculation function call in invoke_stubhttpsxenbitsxenorggitwebp=xengita=blobf=xenarchx86x86_emulatex86_emulatechb=HEAD

[33] Xen Hypervisor block_speculation function call inio_emul_stub_setup httpsxenbitsxenorggitwebp=xengita=blobf=xenarchx86pvemul-priv-opchb=HEAD

[34] Intel FPVI amp SCSB Intel Security Advisoray 00516 httpswwwintelcomcontentwwwusensecurity-centeradvisoryintel-sa-00516html

[35] Intel INTEL-SA-00088 - Bounds Check Bypass

[36] Intel Intelreg 64 and IA-32 Architectures Optimization ReferenceManual

[37] Intel Intelreg 64 and IA-32 Architectures Software Developerrsquos Manualcombined volumes

[38] Intel Intelreg VTunetrade Profiler User Guide - 4KAliasing httpssoftwareintelcomcontentwwwusendevelopdocumentationvtune-helptopreferencecpu-metrics-referencel1-boundaliasing-of-4k-address-offsethtml

[39] Intel Load value injection - deep divehttpssoftwareintelcomsecurity-software-guidancedeep-divesdeep-dive-load-value-injection 2020

[40] Vladimir Kiriansky and Carl Waldspurger Speculative buffer overflowsAttacks and defenses arXiv180703757

[41] Paul Kocher Jann Horn Anders Fogh Daniel Genkin Daniel GrussWerner Haas Mike Hamburg Moritz Lipp Stefan Mangard ThomasPrescher Michael Schwarz and Yuval Yarom Spectre Attacks Ex-ploiting Speculative Execution In SampPrsquo19

[42] David Kohlbrenner and Hovav Shacham On the effectiveness ofmitigations against floating-point timing channels In USENIX SecuritySymposium 2017

[43] Esmaeil Mohammadian Koruyeh Khaled N Khasawneh ChengyuSong and Nael Abu-Ghazaleh Spectre Returns Speculation Attacksusing the Return Stack Buffer In USENIX WOOTrsquo18

[44] Evgeni Krimer Guillermo Savransky Idan Mondjak and JacobDoweck Counter-based memory disambiguation techniques for se-lectively predicting loadstore conflicts October 1 2013 US Patent8549263

USENIX Association 30th USENIX Security Symposium 1465

[45] Chris Lattner and Vikram Adve Llvm A compilation framework forlifelong program analysis amp transformation In CGO 2004

[46] Orion Lawlor Hari Govind Isaac Dooley Michael Breitenfeld andLaxmikant Kale Performance degradation in the presence of subnor-mal floating-point values In OSIHPA 2005

[47] Moritz Lipp Michael Schwarz Daniel Gruss Thomas Prescher WernerHaas Anders Fogh Jann Horn Stefan Mangard Paul Kocher DanielGenkin Yuval Yarom and Mike Hamburg Meltdown Reading KernelMemory from User Space In USENIX Securityrsquo18

[48] Giorgi Maisuradze and Christian Rossow ret2spec Speculative ex-ecution using return stack buffers In Proceedings of the 2018 ACMSIGSAC

[49] Andrea Mambretti Alexandra Sandulescu Matthias NeugschwandtnerAlessandro Sorniotti and Anil Kurmus Two methods for exploitingspeculative control flow hijacks In USENIX WOOT 19

[50] Ahmad Moghimi Jan Wichelmann Thomas Eisenbarth and BerkSunar Memjam A false dependency attack against constant-timecrypto implementations International Journal of Parallel Program-ming 2019

[51] Mozilla Firefox Bug 1692972 mitigation httpshgmozillaorgreleasesmozilla-betarevb129bba64358

[52] Mozilla Spectre mitigations for Value type checks - x86 part httpsbugzillamozillaorgshow_bugcgiid=1433111

[53] Mozilla Spider Monkey JSValue httpshgmozillaorgmozilla-centralfiletipjspublicValueh

[54] Mozilla SpiderMonkey IonMonkey documentation

[55] Ken Johnson Microsoft Security Response Center(MSRC) Analysis and mitigation of speculative storebypass httpsmsrc-blogmicrosoftcom20180521analysis-and-mitigation-of-speculative-store-bypass-cve-2018-3639 2019

[56] Oleksii Oleksenko Bohdan Trach Mark Silberstein and Christof Fet-zer Specfuzz Bringing spectre-type vulnerabilities to the surface InUSENIX Security 20

[57] Riccardo Paccagnella Licheng Luo and Christopher W Fletcher Lordof the ring (s) Side channel attacks on the cpu on-chip ring interconnectare practical arXiv preprint arXiv210303443 2021

[58] Zhenxiao Qi Qian Feng Yueqiang Cheng Mengjia Yan Peng Li HengYin and Tao Wei Spectaint Speculative taint analysis for discoveringspectre gadgets 2021

[59] Hany Ragab Alyssa Milburn Kaveh Razavi Herbert Bos and CristianoGiuffrida CrossTalk Speculative Data Leaks Across Cores Are RealIn SampP May 2021

[60] Thomas Rokicki Cleacutementine Maurice and Pierre Laperdrix Sok Insearch of lost time A review of javascript timers in browsers In IEEEEuroSampPrsquo21

[61] Gururaj Saileshwar Christopher W Fletcher and Moinuddin QureshiStreamline a fast flushless cache covert-channel attack by enablingasynchronous collusion In ASPLOS 2021

[62] Rahul Saxena and John William Phillips Optimized rounding in un-derflow handlers 2001 US Patent 6219684

[63] Michael Schwarz Moritz Lipp Daniel Moghimi Jo Van Bulck JulianStecklina Thomas Prescher and Daniel Gruss ZombieLoad Cross-privilege-boundary data sampling In CCSrsquo19

[64] Michael Schwarz Cleacutementine Maurice Daniel Gruss and Stefan Man-gard Fantastic timers and where to find them High-resolution microar-chitectural attacks in javascript In FC IFCA 17

[65] Michael Schwarz Martin Schwarzl Moritz Lipp Jon Masters andDaniel Gruss Netspectre Read arbitrary memory over network InESORICS 19

[66] Daniel J Sorin Mark D Hill and David A Wood A primer on memoryconsistency and cache coherence 2011

[67] Julian Stecklina and Thomas Prescher Lazyfp Leaking fpu reg-ister state using microarchitectural side-channels arXiv preprintarXiv180607480 2018

[68] Caroline Trippel Daniel Lustig and Margaret Martonosi Meltdown-prime and spectreprime Automatically-synthesized attacks exploitinginvalidation-based coherence protocols arXiv

[69] Xabier Ugarte-Pedrero Davide Balzarotti Igor Santos and Pablo GBringas Sok Deep packer inspection A longitudinal study of thecomplexity of run-time packers In 2015 IEEE S amp P

[70] Google V8 test-jump-table-assemblercc220 commit 251fece httpsgithubcomv8

[71] Jo Van Bulck Daniel Moghimi Michael Schwarz Moritz Lippi Ma-rina Minkin Daniel Genkin Yuval Yarom Berk Sunar Daniel Grussand Frank Piessens Lvi Hijacking transient execution through mi-croarchitectural load value injection In 2020 IEEE S amp P 20

[72] Jo Van Bulck Frank Piessens and Raoul Strackx Nemesis Studyingmicroarchitectural timing leaks in rudimentary cpu interrupt logic InACM CCS 2018

[73] Jo Van Bulck Frank Piessens and Raoul Strackx SGX-step A Prac-tical Attack Framework for Precise Enclave Execution Control InSysTEXrsquo17

[74] Stephan van Schaik Andrew Kwong Daniel Genkin and Yuval YaromSgaxe How sgx fails in practice

[75] Stephan van Schaik Alyssa Milburn Sebastian Oumlsterlund Pietro FrigoGiorgi Maisuradze Kaveh Razavi Herbert Bos and Cristiano GiuffridaRIDL Rogue in-flight data load In SampP May 2019

[76] Stephan van Schaik Marina Minkin Andrew Kwong Daniel Genkinand Yuval Yarom Cacheout Leaking data on intel cpus via cacheevictions arXiv preprint

[77] WebKit Browserbench httpsbrowserbenchorg

[78] Ofir Weisse Jo Van Bulck Marina Minkin Daniel Genkin BarisKasikci Frank Piessens Mark Silberstein Raoul Strackx Thomas FWenisch and Yuval Yarom Foreshadow-ng Breaking the virtual mem-ory abstraction with transient out-of-order execution 2018

[79] Pawel Wieczorkiewicz Intel deep-dive snoop-assisted L1 Data Sam-pling

[80] Wenjie Xiong and Jakub Szefer Survey of transient execution attacksarXiv preprint

[81] Yuval Yarom and Katrina Falkner Flush+ reload a high resolutionlow noise l3 cache side-channel attack In USENIX Security 14

[82] Jann Horn Google Project Zero Speculative execution variant4 speculative store bypass httpsbugschromiumorgpproject-zeroissuesdetailid=1528 2019

A Reversing Memory Disambiguation

To precisely trigger memory disambiguation mispredictionsit is essential to reverse engineer the predictor and understandhow it can be massaged into the desired state When exe-cuting a load operation the physical addresses of all priorstores must be known to decide whether the load should beforwarded the value from the store buffer (store-to-load for-warding when the store and the load alias) or served from thememory subsystem (when they do not) Since these dependen-cies introduce significant bottlenecks modern processors rely

1466 30th USENIX Security Symposium USENIX Association

Listing 4 A function to RE the MD predictor The first 10imuls used to delay the store address computation create theideal conditions for mispredictionsst_ld rdi store addr rsi load addrrep 10 Trick to delay the store addressimul rdi 1endrepmov DWORD [rdi] 0x42 Storemov eax DWORD [rsi] Loadrep 10 Pronounce load timingimul eax 1endrepret

Listing 5 Observing the size and behavior of the per-addresssaturation counteruint8_t mem = malloc(0x1000)Ensure that saturing counter is set to 0for(i=0 ilt100 i++) st_ld(mem mem)Make hoisiting possiblefor(i=100 ilt120 i++) st_ld(mem mem+64)Trigger a memory ordering machine clearfor(i=120 ilt130 i++) st_ld(mem mem)

on a memory disambiguation predictor to improve common-case performance If a load is predicted not to alias precedingstores it can be speculatively executed before the prior storesrsquoaddresses are known (ie the load is hoisted) Otherwise theload is stalled until aliasing information is available

Partial reverse engineering of the memory disambiguationunit behavior was presented in [18] based on a (complex)analysis of Intel patent US8549263B2 [44] In contrast wepresent a full reverse engineering effort (including featuressuch as 4k aliasing flush counter etc) entirely based on asimple implementationmdashst_ld function in Listing 4 Bysurrounding the st_ld function with instrumentation codeto measure the timing and the number of machine clears wewere able to accurately detect the status of the predictor forevery call to st_ld

Our first experiment illustrated in Listing 5 is designedto observe how many missed load hoisting opportunities areneeded to switch the predictor state As shown in the corre-sponding plot in Figure 11 after 15 non-aliasing loads weobserved that subsequent st_ld invocations are faster due toa correct hoisting prediction This matches the design sug-gested in the patent with the predictor implemented as a 4-bitsaturating counter incremented every time the load does not ul-timately alias with preceding stores (and reset to 0 otherwise)Load hoisting is predicted only if the counter is saturated Toensure that the hoisting state is reached we later scheduledan aliasing load and checked that the load was incorrectlyhoisted by observing a machine clear

According to the Intel patent [44] there are 64 per-addresspredictors (ie saturating counters) and the suggested hash-ing function simply uses the lowest 6 bits of the instructionpointer of the load We verified these numbers using the func-

100 105 110 115 120 125 130i-th call to st_ld

0

50

100

Cloc

k cy

cles

Figure 11 Timing measurement of the experiment in Listing5 Orange bar machine clear memory ordering observed

Listing 6 Snippet observing activation of never-hoisting stateuint8_t mem = malloc(0x1000)for(int i=0 ilt10 i++)

for(int j=0 jlt19 j++)st_ld(mem mem+64)

st_ld(mem mem)

0 10 20 30 40 50 60 70 80 90 100 110 120i-th call to st_ld

0

25

50

75

100

Cloc

k cy

cles

Figure 12 Never-hoisting state Orange bar MC observed

tion st_ld_offset which is an exact copy of the st_ldfunction but with a number of nops added in the preamble(which we change at every run The goal is to observe iftwo unrelated loads in these functions are able to influenceeach other when varying the number of nops In our testedCPUs we observed machine clears when two unrelated loadsin st_ld and st_ld_offset are located exactly 256k bytesapart in memory (k isinN) We used machine clear observationsto detect that the per-address prediction of st_ld was affectedby the execution of st_ld_offset Our results match the de-sign suggested in the patent except we observed 256 (ratherthan 64) predictors hashing the lowest 8 bits of the instruc-tion pointer With these implementation-specific numbers anattacker can easily mistrain the predictor of a victim loadinstruction just with knowledge of its location in memory

One important additional component of the predictor is thepresence of a watchdog A never-hoisting global state is usedas a fallback to temporarily disable the predictor when theCPU has decided it may be counterproductive To reverseengineer the behavior we triggered as many mispredictionsas possible to check if hoisting was eventually disabled Theresulting code is illustrated in Listing 6 and the numbers inFigure 12 shows that after 4 machine clears the predictoris disabled Indeed even after 19 further non-aliasing loads(normally abundantly sufficient to switch to a hoisting state)the execution time of st_ld does not decrease

We also reversed the conditions under which the watch-dog is enableddisabled The patent suggests the watchdog

USENIX Association 30th USENIX Security Symposium 1467

15 79 143 207 271 335 399Number of independent store-loads

1

2

3

4

Tota

l mea

sure

dM

achi

ne C

lear

s

Figure 13 MCs observed after n independent store-loadsstarting from the never-hoisting state

is enabled when the value of a flush counter is smaller than0 and disabled otherwise The flush counter is decrementedevery MD MC and incremented every n correctly hoistedloads To reverse engineer this behavior we measure howmany MCs can be triggered in a row before the flush counteris decremented to -1 and thus the predictor is disabled Asshown in the figure 13 we never observed more than 4 MCsin a row This suggests that the flush counter is a 2-bit sat-urating counter The machine clear patterns also reveal theflush counter is incremented every 64 correctly hoisted loadsLastly since every run starts with the watchdog disabled ourresults show that to switch from a never-hoisting to a predict-hoisting state 15+64 non-aliasing loads are sufficient Thefirst 15 loads are necessary to bring the per-address predictorto the hoisting state The next 64 loads record a would-becorrect prediction of the per-address predictor After 64 such(unused) predictions the flush counter is incremented to leavethe never-hoisting state

Listing 7 Snippet verifying 4k aliasing-MD unit interactionForce no-hoisting predictionfor(i=0 ilt10 i++) st_ld(mem+0x1000 mem+0x1000)for(i=10 ilt20 i++) st_ld(mem+0x2000 mem+0x2000)Cause 4k aliasingfor(i=20 ilt40 i++) st_ld(mem+0x1000 mem+0x2000)

0 5 10 15 20 25 30 35 40i-th call to st_ld

50

60

70

80

90

Cloc

k cy

cles

Figure 14 Timing measurements of Listing 7 Orange incre-ment of LD_BLOCKS_PARTIALADDRESS_ALIAS

Finally we examined the interaction between memory dis-ambiguation and 4k aliasing On Intel CPUs when a store isfollowed by a load matching its 4KB page offset the store-to-load forwarding (STL) logic forwards the stored valueand in case of a false match (4k aliasing) a few-cycle over-head is needed to re-issue the load [38] With the help of theperformance counter LD_BLOCKS_PARTIALADDRESS_ALIASand the experiment shown in Listing 7 we verified that 4kaliasing can only happen when a no-hoist prediction is made

Figure 15 Memory disambiguation unit ndash simplified view

as shown in Figure 14 Indeed STL can be performed onlyif the store-load pair is executed in order Additionally Fig-ure 14 shows that 4k aliasing introduces a slight performanceoverhead on top of an incorrect no-hoisting guess of the MDpredictor Figure 15 presents a simplified view of the reversedmemory disambiguation predictor In conclusion an attackercan precisely mistrain the memory disambiguation predictorby satisfying only two requirements (1) knowing the instruc-tion pointer of the victim load (2) issuing (in the worst-casescenario) 15+64=79 non-aliasing store-load pairs to train thepredictor to the hoisting state

B Root-Causes Description Table

Acronym Description

BHT Branch History TableBTB Branch Target BufferRSB Return Stack BufferMD Memory Disabmbiguation UnitNM Device Not Available ExceptionDE Divide-by-Zero ExceptionUD Invalid Opcode ExceptionGP General Protection FaultAC Alignment Check ExceptionSS Stack Segment ExceptionPF Page FaultPF - US Bit Page Table Entry UserSupervisor Bit PFPF - RW Bit Page Table Entry ReadWrite Bit PFPF - P Bit Page Table Entry Present Bit PFPF - PKU Page Table Entry Protection Keys PFBR Bound Range Exceeded ExceptionFP Floating Point AssistSMC Self-Modifying CodeXMC Cross-Modifying CodeMO Memory Ordering Principles ViolationMASKMOV Masked LoadStore Instruction AssistAD Bits Page Table Entry AccessDirty Bits AssistTSX Intel TSX Transaction AbortUC Uncachable Memory AssistPRM Processor Reserved Memory AssistHW Interrupts Hardware Interrupts

1468 30th USENIX Security Symposium USENIX Association

  • Introduction
  • Background
    • IEEE-754 Denormal Numbers
    • x86 Cache Coherence
    • Memory Ordering
    • Memory Disambiguation
      • Threat Model
      • Machine Clears
      • Self-Modifying Code Machine Clear
      • Floating Point Machine Clear
      • Memory Ordering Machine Clear
      • Memory Disambiguation Machine Clear
      • Other Types of Machine Clear
      • Transient Execution Capabilities
        • Transient Window Size
        • Leakage Rate
          • Attack Primitives
            • Speculative Code Store Bypass (SCSB)
            • Floating Point Value Injection (FPVI)
              • Mitigations
                • SCSB Mitigation
                • FPVI Mitigation
                  • Root Cause-based Classification
                  • Related Work
                  • Conclusions
                  • Reversing Memory Disambiguation
                  • Root-Causes Description Table
Page 8: Rage Against the Machine Clear: A Systematic Analysis of

another processor matches the source of any load operationin the pipeline We obtained the same results with the twothreads running across physical or logical (hyperthreaded)cores Similarly the results are unchanged if the matchingstore and load are performed on different addressesmdashthe onlyrequirement we observed is that the memory operations needto refer to the same cache line Overall our results confirmthe presence of a transient execution window and the abilityof a thread to trigger a transient execution path in anotherthread by simply dirtying a cache line used in a ready-to-commit load Exploitation-wise abusing this type of MC isnon-trivial due to the strict synchronization requirements andthe difficulty of controlling pending stale data

8 Memory Disambiguation Machine Clear

As suggested by the MACHINE_CLEARSDISAMBIGUATIONcounter description memory disambiguation (MD) mis-predictions are handled via machine clears Our experi-ments confirmed this behavior by observing matching in-creases of MACHINE_CLEARSCOUNT Moreover we observedno changes in microcode assist counters suggesting mispre-dictions are resolved entirely in hardware In case of a mispre-diction a stale value is passed to subsequent loads a primitivethat was previously used to leak secret information with Spec-tre Variant 4 or Speculative Store Bypass (SSB) [55 82]

Different from the other machine clears MD machineclears trigger only on address aliasing mispredictions whenthe CPU wrongly predicts a load does not alias a precedingstore instruction and can be hoisted Similar to branch pre-diction generating a MD-based transient execution windowrequires mistraining the underlying predictor The latter hascomplex undocumented behavior which has been partiallyreverse engineered [18] For space constraints we discuss ourfull reverse engineering strategy in Appendix A Our resultsshow that executing the same load 64+ 15 times with non-aliasing stores is sufficient to ensure the next prediction tobe ldquono-aliasrdquo (and thus the next load to be hoisted) Uponreaching the hoisting prediction one can perform the loadwith an aliasing store to trigger the transient path exposed tothe incorrect stale value

Finally we experimentally verified that 4k aliasing [50]does not cause any machine clear but only incurs a furthertime penalty in case of wrong aliasing predictions Moredetails on 4k aliasing results can be found in Appendix A

9 Other Types of Machine Clear

AVX vmaskmov The AVX vmaskmov instructions performconditional packed load and store operations depending on abitmask For example a vmaskmovpd load may read 4 packeddoubles from memory depending on a 4-bit mask each doublewill be read only if the corresponding mask bit is set the

others will be assigned the value 0According to the Intel Optimization Reference Man-

ual [36] the instruction does not generate an exception inface of invalid addresses provided they are masked out How-ever our experiments confirm that it does incur a machineclear (and a microcode assist) when accessing an invalid ad-dress (eg with the present bit set to 0) with a loading maskset to zero (ie no bytes should be loaded) to check whetherthe bytes in the invalid address have the corresponding maskbits set or not We speculate the special handling is neededbecause the permission check is very complex especially inthe absence of memory alignment requirements

In our experiments vmaskmov instructions withall-zero masks and invalid addresses increment theOTHER_ASSISTANY and MACHINE_CLEARSCOUNT countersconfirming that the instruction triggers a machine clearHowever the resulting transient execution window seemsshort-lived or absent as we were unable to observe cache orother microarchitectural side effects of the execution

Exceptions The MACHINE_CLEARSPAGE_FAULT counterpresent on older microarchitectures confirms page faults areanother cause of machine clears Indeed we verified each in-struction incurring a page fault or any other exception such asldquoDivision by zerordquo increments the MACHINE_CLEARSCOUNTcounter We also verified exceptions do not trigger microcodeassists and software interrupts (traps) do not trigger machineclears Indeed the Intel documentation [37] specifies thatinstructions following a trap may be fetched but not specu-latively executed Transient execution windows originatingfrom exceptionsmdashand page faults in particularmdashhave beenextensively used in prior work with a faulty load instructionalso used as the trigger to leak information [7 47 63 75]

Hardware interrupts Although hardware interrupts arean undocumented cause of machine clears our experimentswith APIC timer interrupts showed they do increment theMACHINE_CLEARSCOUNT counter While this confirms hard-ware interrupts are another root cause of transient executionthe asynchronous nature of these events yields a less thanideal vector for transient execution attacks Nonetheless hard-ware interrupts play an important role in other classes ofmicroarchitectural attacks [73]

Microcode assists Microcode assists require a pipelineflush to insert the required microOps in the frontend and representa subclass of machine clears (and thus a root cause of transientexecution) for cases where a fast path in hardware is not avail-able In this paper we detailed the behavior of floating-pointand vmaskmov assists Prior work has discussed different sit-uations requiring microcode assists such as those related topage table entry AccessDirty bits typically in the context ofassisted loads used as the trigger to leak information [86375]AVX-to-SSE transitions [36] represent another microcode as-sist which based on our experiments we could not observeon modern Intel CPUs Remaining known microcode assistssuch as access control of memory pages belonging to SGX

USENIX Association 30th USENIX Security Symposium 1457

020406080

100120140160

Num

ber o

fTr

ansie

nt L

oads CPU

Intel Core i7-10700KIntel Xeon Silver 4214Intel Core i9-9900KIntel Core i7-7700KAMD Ryzen 5 5600XAMD Ryzen Threadripper2990WXAMD Ryzen 7 2700X

F+R TSX BHT FAULT SMC XMC FP MD MOTransient Execution Management

0

1

2

3

4

Leak

age

Rate

[Mb

s]

Figure 4 Top plot transient window size vs mechanism Each bar reports the number of transient loads that complete and leavea microarchitectural trace Bottom plot leakage rate vs mechanism for a simple Spectre Bounds-Check-Bypass attack and a1-bit FLUSH + RELOAD (F+R) cache covert channel F+R is the leakage rate upper bound (covert channel loop only no actualtransient window or attack)

secure enclaves and Precise Event Based Sampling (PEBS)are not presented in this work since already studied [14] oronly related to privileged performance profiling respectively

10 Transient Execution Capabilities

Transient execution attacks rely on crafting a transient win-dow to issue instructions that are never retired For thispurpose state-of-the-art attacks traditionally rely on mech-anisms based on root causes such as branch mispredictions(BHT) [41 43] faulty loads (Fault) [8 47 63 75] or mem-ory transaction aborts (TSX) [8 47 59 63 75] However thedifferent machine clears discussed in this paper provide anattacker with the exact same capabilities

To compare the capabilities of machine clear-based tran-sient windows with those of more traditional mechanismswe implemented a framework able to run arbitrary attacker-controlled code in a window generated by a mechanism ofchoosing We now evaluate our framework on recent proces-sors (with all the microcode updates and mitigations enabled)to compare the transient window size and leakage rate of thedifferent mechanisms

101 Transient Window Size

The transient window size provides an indication of thenumber of operations an attacker can issue on a transientpath before the results are squashed Larger windows canin principle host more complex attacks Using a classicFLUSH + RELOAD cache covert channel (F+R) as a referencewe measure the window size by counting how many transientloads can complete and hit entries in a designated F+R bufferFigure 4 (top) presents our results

As shown in the figure the window size varies greatlyacross the different mechanisms Broadly speaking mecha-nisms that have a higher detection cost such as XMC and MOMachine Clear yield larger window sizes Not surprisinglybranch mispredictions yield the largest window sizes as wecan significantly slow down the branch resolution process(ie causing cache misses) and delay detection FP on theother hand yields the shortest windows suggesting that de-normal numbers are efficiently detected inside the FPU Ourresults also show that while our framework was designed forIntel processors similar if not better results can usually beobtained on AMD processors (where we use the same con-servative training code for branchmemory prediction) Thisshows that both CPU families share a similar implementationin all cases except for MD where the used mistraining patternis not valid for pre-Zen3 architectures

102 Leakage Rate

To compare the leakage rates for the different transient exe-cution mechanisms we transiently read and repeatedly leakdata from a large memory region through a classic F+R cachecovert channel We report the resulting leakage ratesmdashasthe number of bits successfully leaked per secondmdashacrossdifferent microarchitectures using a 1-bit covert channel tohighlight the time complexity of each mechanism We con-sider data to be successfully leaked after a single correct hitin the reload buffer In case of a miss for a particular valuewe restart the leak for the same value until we get a hit (oruntil we get 100 misses in a row)

As shown in Figure 4 (bottom) different Intel and AMDmicroarchitectures generally yield similar leakage rates withsome variations For instance FP MC offers better leakagerates on Intel This difference stems from the different perfor-

1458 30th USENIX Security Symposium USENIX Association

F+R TSX BHT FP SMC XMC MO MD FAULTTransient Execution Management

0

1

2

3

4

Leak

age

Rate

[Mb

s]

F+R granularity [bit]148

Figure 5 Leakage rate vs mechanism with a 1-bit 4-bit or8-bit FLUSH + RELOAD cache covert channel (Intel Core i9)

mance impact of the corresponding machine clears on Intelvs AMD microarchitectures

As shown in Figure 5 the leakage rate varies instead greatlyacross the different mechanisms and so does the optimalcovert channel bitwidth Indeed while existing attacks typi-cally rely on 8-bit covert channels our results suggest 1-bit or4-bit channels can be much more efficient depending on thespecific mechanism Roughly speaking optimal leakage ratescan be obtained by balancing the time complexity (and hencebitwidth) of the covert channel with that of the mechanismFor example FP is a lightweight and reliable mechanismhence using a comparably fast and narrow 1-bit covert chan-nel is beneficial In contrast MD requires a time-consumingpredictor training phase between leak iterations and leakingmore bits per iteration with a 4-bit covert channel is moreefficient Interestingly a classic 8-bit covert channel yieldsconsistently worse and comparable leakage rates across allthe mechanisms since F+R dominates the execution time

Our results show that only two mechanisms (TSX and FP)are close to the maximum theoretical leakage rate of pureF+R Moreover FP performs as efficiently as TSX but un-like TSX is available on both Intel and AMD is always en-abled and can be used from managed code (eg JavaScript)BHT on the other hand yield the worst leakage rates due tothe inefficient training-based transient window BHT leakagerate can be improved if a tailored mistrain sequence is usedas in MD (Appendix A) Overall our results show that ma-chine clear-based windows achieve comparable and in manycases (eg FP) better leakage rates compared to traditionalmechanisms Moreover many machine clears eliminate theneed for mistraining which other than resulting in efficientleakage rates can escape existing pattern-based mitigationsand disabled hardware extensions (eg Intel TSX)

11 Attack Primitives

Building on our reverse engineering results and focusing onthe unexplored SMC and FP machine clears we now presenttwo new transient execution attack primitives and analyze

their security implications We also present an end-to-endFPVI exploit disclosing arbitrary memory in Firefox Laterwe discuss mitigations

111 Speculative Code Store Bypass (SCSB)

Our first attack primitive Speculative Code Store Bypass(SCSB) allows an attacker to execute stale controlled code ina transient execution window originated by a SMC machineclear Since the primitive relies on SMC its primary appli-cability is on JIT (eg JavaScript) engines running attacker-controlled codemdashalthough OS kernels and hypervisors stor-ing code pages and allowing their execution without firstissuing a serializing instruction are also potentially affected

Figure 6 SCSB primitive example where the instructionpointer is pointing at the bold code blocks g code is freedJITrsquoed code (of some g function) under attackerrsquos control(1) Force engine to JIT and execute code of function f caus-ing desynchronization of code and data views (2) Executestale code and SMC MC (3) After the SMC MC code anddata view coherence is restored and the new code is executed

As exemplified in Figure 6 the operations of the primi-tive can be broken down into three steps (1) the JIT enginecompiles a function f storing the generated code into a JITcode cache region previously used by a (now-stale) version offunction g (2) the JIT engine jumps to the newly generatedcode for the function f but due to the temporary desynchro-nization between the code and data views of the CPU thiscauses transient execution of the stale code of g until the SMCmachine clear is processed (3) after the pipeline flush thecode and data views are resynchronized and the CPU restartsthe execution of the correct code of f For exploitation theattacker needs to (i) massage the JIT code cache allocator toreuse a freed region with a target gadget g of choice (ii) forcethe JIT engine to generate and execute new (f ) code in sucha region enabling transient out-of-context execution of thegadget and spilling secrets into the microarchitectural state

Our primitive bears similarities with both transient and ar-chitectural primitives used in prior attacks On the transientfront our primitive is conceptually similar to a Speculative-Store-Bypass (SSB) primitive [55 82] but can transientlyexecute stale code rather than reading stale data Howeverinterestingly the underlying causes of the two primitives arequite different (MD misprediction vs SMC machine clear)On the architectural front our primitive mimics classic Use-After-Free (UAF) exploitation on the JIT code cache also

USENIX Association 30th USENIX Security Symposium 1459

Figure 7 Coding options suggested by the Intel ArchitecturesSoftware Developers Manuals to handle SMC and XMC ex-ecution Option 1 describes the exact steps required by ourSpeculative Code Store Bypass attack primitive potentiallyresulting in exploitable gadgets

known as Return-After-Free (RAF) in the hackers commu-nity [19 25] An example is CVE-2018-0946 where a use-after-free vulnerability can be exploited to force the ChakraJS engine to erroneously execute freed (attacker-controlled)JIT code resulting in arbitrary code execution after massagingthe right gadget into the JIT code cache [24]

Indeed at a high level SCSB yields a transient use-after-free primitive on a JIT code cache with exploitation prop-erties similar to its architectural counterpart However thereare some differences due to the transient nature of SCSBFirst we need to find an out-of-context gadget to transientlyleak data rather than architecturally execute arbitrary codeIn JavaScript engines similar gadgets have already been ex-ploited by ret2spec [48] escalating out-of-context transientexecution of valid code to type confusion arbitrary reads andultimately a secret-dependent load to transmit the value

Second we need to target short-lived JIT code cache up-dates so that the newly generated code is immediately exe-cuted fitting the target gadget in the resulting transient execu-tion window Interestingly executing JITrsquoed code is the nextstep after JIT code generation mentioned in the Intel manual(Figure 7 Option 1) suggesting that developers that followsuch directives ad litteram can easily introduce such gadgetsMoreover modern JavaScript compilers feature a multi-stageoptimization pipeline [12 54] and short-lived JIT code cacheupdates are favored for just-in-time (re)optimization Indeedwe found test code in V8 to specifically test such updates andverify instructiondata cache coherence [70]

Finally we need to ensure JIT code cache updates are notaccompanied by barriers that force immediate synchroniza-tion of the code and data views We analyzed the code ofthe popular SpiderMonkey and V8 JavaScript engines andverified that the functions called upon JIT code cache updatesto synchronize instruction and data caches (Listing 2 and List-ing 3) are always empty on x86 (as expected given Intelrsquosprimary recommendation in Figure 7)

Overall while the exploitation is far from trivial (ie hav-

Listing 2 Chromium instruction cache flush(chromiumsrcv8srccodegenx64cpu-x64cc)void CpuFeaturesFlushICache(void start size_t size) No need to flush the instructioncache on Intel

Listing 3 Firefox instruction cache flush(mozilla-unifiedjssrcjitFlushICacheh)inline void FlushICache(void code size_t size

bool codeIsThreadLocal = true) No-op Code and data caches are coherent on x86

and x64 rarr

ing to address the challenges of traditional use-after-free ex-ploitation as well as transient execution exploitation in thebrowser) we believe SCSB expands the attack surface of tran-sient execution attacks in the browser We found a number ofcandidate SCSB gadgets in real-world code but none of themwas ultimately exploitable due to the coincidental presence ofsome serializing instruction preventing stale code executionNonetheless mitigations are needed to enforce security-by-design Luckily as shown later SCSB is amenable to a prac-tical and efficient mitigation (ie at a similar cost as facedby non-x86 architectures) The implementationperformancecost for mitigation is even lower than for its sibling SSB prim-itive whose mitigation has been deployed in practice even inabsence of known practical exploits [55]

In Table 3 we show that all tested Intel and AMD proces-sors are affected by SCSB In contrast ARM is not vulnerablesince SMC updates require explicit software barriers

112 Floating Point Value Injection (FPVI)Our second attack primitive Floating Point Value Injection(FPVI) allows an attacker to inject arbitrary values into atransient execution window originated by a FP machine clearAs exemplified in Figure 8 the operations of the primitive canbe broken down into four steps (1) the attacker triggers theexecution of a gadget starting with a denormal FP operationin the victim application with the x and y operands underattackerrsquos control (2) the transient z result of the operation isprocessed by the subsequent gadget instructions leaving a mi-croarchitectural trace (3) the CPU detects the error condition(ie wrong result of a denormal operation) triggering a ma-chine clear and thus a pipeline flush (4) the CPU re-executesthe entire gadget with the correct architectural z result Forexploitation the attacker needs to (i) massage the x and yoperands to inject the desired z value into the victim transientpath and (ii) target a victim gadget so that the injected valueyields a security-sensitive trace which can be observed withFLUSH + RELOAD or other microarchitectural attacks

Our primitive bears similarities with Load-Value-Injection(LVI) [71] since both allow attackers to inject controlledvalues into transient execution Moreover both primitives

1460 30th USENIX Security Symposium USENIX Association

Figure 8 FPVI gadget example in SpiderMonkeyFP_Op(xy) is an arbitrary denormal FP operation The er-roneous z result causes dependent operations to be executedtwice (first transiently then architecturally) The NaN-boxingz encoding allows the attacker to type-confuse the JITrsquoed codeand read from an arbitrary address on the transient path

require gadgets in the victim application to process the in-jected value and perform security-sensitive computations onthe transient path Nonetheless the underlying issues andhence the triggering conditions are fairly different LVI re-quires the attacker to induce faulty or assisted loads on thevictim execution which is straightforward in SGX applica-tions but more difficult in the general case [71] FPVI imposesno such requirement but does require an attacker to directlyor indirectly control operands of a floating-point operation inthe victim Nonetheless FPVI can extend the existing LVIattack surface (eg for compute-intensive SGX applicationsprocessing attacker-controlled inputs) and also provides ex-ploitation opportunities in new scenarios Indeed while it ishard to draw general conclusions on the availability of ex-ploitable FPVI gadgets in the wildmdashmuch like Spectre [41]LVI [71] etc this would require gadget scanners subject oforthogonal research [56 58]mdashwe found exploitable gadgetsin NaN-Boxing implementations of modern JIT engines [53]NaN-Boxing implementations encode arbitrary data types asdouble values allowing attackers running code in a JIT sand-box (and thus trivially controlling operands of FP operations)to escalate FPVI to a speculative type confusion primitiveThe latter can be exploited similarly to NaN-Boxing-enabledarchitectural type confusion [26] and allows an attacker toaccess arbitrary data on a transient path Figure 8 presentsour end-to-end exploit for a JavaScript-enabled attacker in aSpiderMonkey (Mozilla JavaScript runtime) sandbox illus-trating a gadget unaffected by all the prior Mozilla Firefoxrsquo

mitigations against transient execution attacks [52]As exemplified in the figure SpiderMonkeyrsquos NaN-Boxing

strategy represents every variable with a IEEE-754 (64-bit)double where the highest 17 bits store the data type tag and theremaining 47 bits store the data value itself If the tag value isless than or equal to 0x1fff0 (ie JSVAL_TAG_MAX_DOUBLE)all the 64 bits are interpreted as a double while NaN-Boxingencoding is used otherwise For instance the 0xfff88 tagis used to represent 32-bit integers and the 0xfffb0 tag torepresent a string with the data value storing a pointer tothe string descriptor In the example the attacker crafts theoperands of a vulnerable FP operation (in this case a division)to produce a transient result which the JITrsquoed code interpretsas a string pointer due to NaN-Boxing This causes typeconfusion on a transient path and ultimately triggers a readwith an attacker-controlled address

To verify the attacker can inject arbitrary pointers withoutfully reverse engineering the complex function used by theFPU we implemented a simple fuzzer to find FP operandsthat yield transient division results with the upper bits setto 0xfffb0 (ie string tag) With such operands we caneasily control the remaining bits by performing the inverseoperation since the mantissa bits are transiently unaffectedby the exponent value as shown in Table 2 For exampleusing 0xc000e8b2c9755600 and 0x0004000000000000 asdivision operands yields -Infinity as the architectural resultand our target string pointer 0xfffb0deadbeef000 as thetransient result (see gadget in Figure 8)

Note that SpiderMonkey uses no guards or Spectre mitiga-tions when accessing the attribute length of the string Thisis normally safe since x86 guarantees that NaN results of FPoperations will always have the lowest 52 bits set to zeromdasharepresentation known as QNaN Floating-Point Indefinite [37]In other words the implementation relies on the fact that NaN-boxed variables such as string pointers can never accidentallyappear as the result of FP operations and can only be craftedby the JIT engine itself Unfortunately this invariant no longerholds on a FPVI-controlled transient path As shown in Fig-ure 8 this invariant violation allows an attacker to transientlyread arbitrary memory Since the length attribute is stored4 bytes away from the string pointer the zlength accessyields a transient read to 0xdeadbeef000+4

We ran our exploit on an Intel i9-9900K CPU (microcode0xde) running Linux 580 and Firefox 850 and by wrappingthis primitive with a variant of an EVICT + RELOAD covertchannel [61] we confirmed the ability to read arbitrary mem-ory Since prior work has already demonstrated that customhigh-precision timers are possible in JavaScript [26296064]we enabled precise timers in Firefox to simplify our covertchannel With our exploit we measured a leakage rate of ~13KBs and a transient window of ~12 load instructions enabledby increasing the FPU pressure through a chain of multipledependent FP operations

Finally as shown in Table 3 we observe that both Intel and

USENIX Association 30th USENIX Security Symposium 1461

Table 3 Tested processors

Processor MicrocodeSCSB

vulnerableFPVI

vulnerable

Intel Core i7-10700K 0xe0 3 3Intel Xeon Silver 4214 0x500001c 3 3Intel Core i9-9900K 0xde 3 3Intel Core i7-7700K 0xca 3 3AMD Ryzen 5 5600X 0xa201009 3 3daggerAMD Ryzen 2990WX 0x800820b 3 3daggerAMD Ryzen 7 2700X 0x800820b 3 3daggerBroadcom BCM2711Cortex-A72 (ARM v8) 7sect 7

dagger No exploitable NaN-boxed transient results foundsect On ARM SMC updates require explicit software barriers

AMD are affected by FPVI although we found no exploitabletransient NaN-boxed values on AMD On ARM we neverobserved traces of transient results yet we cannot rule outother FPU implementations being affected

12 Mitigations

121 SCSB Mitigation

SCSB can be mitigated by ensuring that any freshly writtencode is architecturally visible before being executed For ex-ample on ARM architectures where the hardware does notautomatically enforce cache coherence explicit serializing in-structions (ie L1i cache invalidation instructions) are neededto correctly support SMC [5] As such spec-compliant ARMimplementations cannot be affected by SCSB On Intel andAMD we can force eager codedata coherence using a serial-ization instructionmdashalthough this is normally not necessary(see Option 1 in Figure 7) In our experiments we verified anyserialization instruction such as lfence mfence or cpuidplaced after the SMC store operations is indeed sufficient tosuppress the transient window Note that sfence cannot serveas a serialization instruction [35] to eliminate the transientpath This serialization mitigation was confirmed by CPUvendors and adopted by the Xen hypervisor [32 33]

To evaluate the performance impact of our mitigation weadded a lfence instruction inside the FlushICache function(Listings 2 and 3) of the two popular V8 and SpiderMonkeyJavaScript engines Such function normally empty on x86is called after every code update Our repeated experimentson the popular JetStream2 and Speedometer2 [77] bench-marks did not produce any statically measurable performanceoverhead This shows JavaScript execution time is heavilydominated by JIT code generationexecution and code up-dates have negligible impact Our results show this mitigationis practical and can hinder SCSB-based attack primitives inJIT engines with a 1-line code change

122 FPVI Mitigation

The most efficient way to mitigate FPVI is to disable the de-normal representation On Intel this translates to enabling theFlush-to-Zero and Denormals-are-Zero flags [37] which re-spectively replace denormal results and inputs with zero Thisis a viable mitigation for applications with modest floating-point precision requirements and has also been selectively ap-plied to browsers [42] However this defense may break com-mon real-world (denormal-dependent) applications a concernthat has led browser vendors such as Firefox to adopt othermitigations Another option for browsers is to enable Site Iso-lation [13] but JIT engines such as SpiderMonkey still do nothave a production implementation [23] Yet another optionfor JIT engines is to conditionally mask (ie using a transientexecution-safe cmov instruction) the result of FP operationsto enforce QNaN Floating-Point Indefinite semantics [37] asdone in the SpiderMonkey FPVI mitigation [51] This strategysuppresses any malicious NaN-boxed transient results butrequires manual changes to the NaN-boxing implementationand only applies to NaN-boxed gadgets

A more general and automated mitigation is for the com-piler to place a serializing instruction such as lfence afterFP operations whose (attacker-controlled) result might leaksecrets by means of transmit gadgets (or transmitters) Weobserve this is the same high-level strategy adopted by theexisting LVI mitigation in modern compilers [39] whichidentifies loaded values as sources and uses data-flow anal-ysis to ensure all the sources that reach a sink (transmitter)are fenced Hence to mitigate FPVI we can rely on the samemitigation strategy but use computed FP values as sourcesinstead To identify sinks we consider both systems vulnera-ble and those resistant to Microarchitectural Data Sampling(MDS) [8 59 63 75 76] For the former we consider bothload and store instructions as sinks (eg FP operation resultused as a loadstore address) as the corresponding arbitrarydata spilled into microarchitectural buffers may be leaked bya MDS attacker For the latter we limit ourselves to load sinksto catch all the arbitrary read values potentially disclosed by adependent transmitter In both cases we add indirect branchescontrolled by FP operations (and thus potentially leading tospeculative control-flow hijacking) to the list of sinks

We have implemented such a mitigation in LLVM [45]with only 100 lines of code on top of the existing x86 LVIload hardening pass To evaluate the performance impactof our mitigation on floating-point-intensive programs weran all the CC++ SPECfp 2006 and 2017 benchmarks infour configurations LVI instrumentation FPVI instrumen-tation for both MDS-vulnerable and MDS-resistant systemsand joint LVI+FPVI instrumentation for MDS-vulnerable sys-tems Please note that on MDS-resistant systems the FPVItransmitters are already covered by the LVI mitigation

Figure 9 shows the performance overhead of such con-figurations compared to the baseline As expected the LVI

1462 30th USENIX Security Symposium USENIX Association

619lbm_s 638imagick_s 644nab_s SPEC17geomean

433milc 444namd 447dealII 450soplex 453povray 470lbm 482sphinx3 SPEC06geomean

0100200300400500600700800

Over

head

SPEC FP 2017 SPEC FP 2006LLVM Mitigation

LVIFPVI (MDS-vulnerable system)FPVI (MDS-resistant system)LVI+FPVI (MDS-vulnerable system)

Figure 9 Performance overhead of our FPVI mitigation on the CC++ SPECfp 2006 and 2017 benchmarks Experimental setup5 SPEC iterations Intel i9-9900K (microcode 0xde) and LLVM 1110

mitigation has non-trivial performance impact up to 280despite targeting floating-point-intensive programs On theother hand our FPVI mitigation incurs in a 32 and 53 ge-omean overhead on SPECfp 2006 and 2017 respectively withno observable performance impact difference between MDS-vulnerable and MDS-resistant variants We observed that ap-proximately 70 of the inserted lfence instructions are dueto the intraprocedural design of the original LVI pass forc-ing our analysis to consider every callsite with FPVI source-based arguments as a potential transmitter This suggests theoverhead can be further reduced by operating interprocedu-ral analysis or more aggressive inlining (eg using LLVMLTO [45])

13 Root Cause-based Classification

We now summarize the results of our investigation by present-ing a root cause-based classification for the known transientexecution paths While there have already been several at-tempts to classify properties of transient execution [107580]all the existing classifications are attack-centric While cer-tainly useful such classifications inevitably blend togetherthe root causes of transient execution with the attack triggersFor example a MDS exploit based on a demand-paging pagefault [75] may be simply classified as a Meltdown-like attackbased on a present page fault [10] However in such an exploitthe page fault is both the vulnerability trigger and the rootcause of the transient execution window A similar exploitmay instead be embedded in an Intel TSX window yielding adifferent root cause and capabilities

To better characterize the capabilities of transient executionexploits we propose the orthogonal root-cause-based classifi-cation in Figure 10 We draw from the Intel terminology todefine the two main classes of root causes of Bad Specula-tion (ie transient execution) Control-Flow Misprediction(ie branch misprediction) and Data Misprediction (ie ma-chine clear) Based on these two main classes we observethat all the known root causes of transient execution paths canbe classified into the following four subclasses PredictorsExceptions Likely invariants violations and Interrupts

Figure 10 A root cause-centric classification of known tran-sient execution paths (causes analyzed in the paper in bold)Acronyms descriptions can be found in Appendix B

Predictors This category includes the prediction-basedcauses of bad speculation due to either control-flow or datamispredictions Mistraining a predictor and forcing a mis-prediction is sufficient to create a transient execution pathaccessing erroneous code or data Itrsquos worth noting that inFigure 10 there are two separate predictors subclasses un-der control-flow and data mispredictions as they are differ-ent in nature While control-flow mispredictions are failedattempts to guess the next instructions to execute data mis-predictions are failed attempts to operate on not-yet-validateddata Misprediction is a common way to manage tran-sient execution windows in attacks like Spectre and deriva-tives [11 20 28 40 41 43 48 55 65 82]

Exceptions This class includes the causes of machineclear due to exceptions for instance the different (sub)classesof page faults Forcing an exception is sufficient to create atransient execution path erroneously executing code follow-ing the exception-inducing instruction Exceptions are a lesscommon way to manage transient execution windows (as theyrequire dedicated handling) but have been extensively usedas triggers in Meltdown-like attacks [78475963677174ndash76 78]

Interrupts This class includes the causes of machine cleardue to hardware (only) interrupts Similar to exceptions forc-

USENIX Association 30th USENIX Security Symposium 1463

ing a HW interrupt is sufficient to create a transient executionpath erroneously executing code following the interruptedinstruction Hardware interrupt are asynchronous by natureand thus difficult to control resulting in a less than ideal wayto manage transient execution windows Nonetheless theywere abused by prior side-channel attacks [72]

Likely invariants violations This class includes all theremaining causes of machine clear derived by likely invari-ants [15] used by the CPU Such invariants commonly holdbut occasionally fail allowing hardware to implement fast-path optimizations However compared to exceptions andinterrupts slow-path occurrences are typically more frequentrequiring more efficient handling in hardware or microcodeWe discussed examples of such invariants in the paper (egstore instructions are expected to never target cached instruc-tions floating-point operations are expected to never operateon denormal numbers etc) and their lazy handling mecha-nisms (ie L1dL1i resynchronization microcode-based de-normal arithmetic) Forcing a likely invariant violation is suf-ficient to create a transient execution path accessing erroneouscode or data In this paper we have shown such violationsare not only a realistic way to manage transient executionwindows but also provide new opportunities and primitivesfor transient execution attacks

14 Related Work

Spectre [3141] and Meltdown [47] first examined the securityimplications of transient execution originating a large bodyof research on transient execution attacks [6ndash112840434859 63 65 67 68 71 74ndash76 78 79 82] Rather than focusingon attacks and their classification [10 75 80] ours is the firsteffort to systematize the root causes of transient executionand examine the many unexplored cases of machine clears

We now briefly survey prior security efforts concernedwith the major causes of machine clear discussed in this pa-per Self-modifying code is commonly used by malware asan obfuscation technique [69] and has also been used to im-prove side-channel attacks by means of performance degra-dation [2] Moreover our SCSB primitive bears similaritieswith prior transient execution primitives inducing specula-tive control flow hijacking either through branch target in-jection [41 49] or architectural branch target corruption [28]The performance variability of floating-point operations ondenormal numbers [17 46] has been previously exploited intraditional timing side-channel attacks [4] Speculation intro-duced by stricter memory models is a well-known concept inthe computer architecture literature [27 30 66] While this isnon-trivial to exploit prior work did demonstrate informationdisclosure [5779] by exploiting the snoop protocol discussedin Section 7 The memory disambiguation predictor has beenpreviously abused to leak stale data in Spectre SpeculativeStore Bypass exploits [55 82] Moreover its behavior hasbeen partially reverse engineered before [18] In contrast to

all these efforts we focus on machine clears to systematicallystudy all the root causes of transient execution fully reverseengineering their behavior and uncovering their security im-plications well beyond the state of the art

15 Conclusions

We have shown that the root causes of transient execution canbe quite diverse and go well beyond simple branch mispredic-tion or similar To support this claim we systematically ex-plored and reverse engineered the previously unexplored classof bad speculation known as machine clear We discussedseveral transient execution paths neglected in the literatureexamining their capabilities and new opportunities for attacksFurthermore we presented two new machine clear-based tran-sient execution attack primitives (Floating Point Value Injec-tion and Speculative Code Store Bypass) We also presentedan end-to-end FPVI exploit disclosing arbitrary memory inFirefox and analyzed the applicability of SCSB in real-worldapplications such as JIT engines Additionally we proposedmitigations and evaluated their performance overhead Finallywe presented a new root cause-based classification for all theknown transient execution paths

Disclosure

We disclosed Floating Point Value Injection and SpeculativeCode Store Bypass to CPU browser OS and hypervisor ven-dors in February 2021 Following our reports Intel confirmedthe FPVI (CVE-2021-0086) and SCSB (CVE-2021-0089)vulnerabilities rewarded them with the Intel Bug Bounty pro-gram and released a security advisory with recommendationsin line with our proposed mitigations [34] Mozilla confirmedthe FPVI exploit (CVE-2021-29955 [21 22]) rewarded itwith the Mozilla Security Bug Bounty program and deployeda mitigation based on conditionally masking malicious NaN-boxed FP results in Firefox 87 [51]

Acknowledgments

We thank our shepherd Daniel Genkin and the anonymousreviewers for their valuable comments We also thank ErikBosman from VUSec and Andrew Cooper from Citrix fortheir input Intel and Mozilla engineers for the productivemitigation discussions Travis Downs for his MD reverse en-gineering and Evan Wallace for his Float Toy tool This workwas supported by the European Unionrsquos Horizon 2020 re-search and innovation programme under grant agreements No786669 (ReAct) and 825377 (UNICORE) by Intel Corpora-tion through the Side Channel Vulnerability ISRA and by theDutch Research Council (NWO) through the INTERSECTproject

1464 30th USENIX Security Symposium USENIX Association

References[1] IEEE Standard for Floating-Point Arithmetic IEEE Std 754-2019

2019

[2] Alejandro Cabrera Aldaya and Billy Bob Brumley Hyperde-grade From ghz to mhz effective cpu frequencies arXiv preprintarXiv210101077

[3] AMD AMD64 Architecture Programmerrsquos Manual

[4] Marc Andrysco David Kohlbrenner Keaton Mowery Ranjit JhalaSorin Lerner and Hovav Shacham On subnormal floating point andabnormal timing In 2015 IEEE S amp P

[5] ARM Architecture Reference Manual for Armv8-A

[6] Atri Bhattacharyya Alexandra Sandulescu Matthias NeugschwandtnerAlessandro Sorniotti Babak Falsafi Mathias Payer and Anil KurmusSmotherspectre exploiting speculative execution through port con-tention In CCSrsquo19

[7] Jo Van Bulck Marina Minkin Ofir Weisse Daniel Genkin BarisKasikci Frank Piessens Mark Silberstein Thomas F Wenisch Yu-val Yarom and Raoul Strackx Foreshadow Extracting the Keys tothe Intel SGX Kingdom with Transient Out-of-Order Execution InUSENIX Securityrsquo18

[8] Claudio Canella Daniel Genkin Lukas Giner Daniel Gruss MoritzLipp Marina Minkin Daniel Moghimi Frank Piessens MichaelSchwarz Berk Sunar Jo Van Bulck and Yuval Yarom Fallout LeakingData on Meltdown-resistant CPUs In CCSrsquo19

[9] Claudio Canella Michael Schwarz Martin Haubenwallner MartinSchwarzl and Daniel Gruss Kaslr Break it fix it repeat In ACMASIA CCS 2020

[10] Claudio Canella Jo Van Bulck Michael Schwarz Moritz Lipp Ben-jamin Von Berg Philipp Ortner Frank Piessens Dmitry Evtyushkinand Daniel Gruss A systematic evaluation of transient execution at-tacks and defenses In USENIX Security 19

[11] Guoxing Chen Sanchuan Chen Yuan Xiao Yinqian Zhang ZhiqiangLin and Ten H Lai Sgxpectre Stealing intel secrets from sgx enclavesvia speculative execution In 2019 IEEE EuroSampP

[12] Chrome V8 TurboFan documentation

[13] Chromium Site Isolation documentation

[14] Victor Costan and Srinivas Devadas Intel SGX Explained IACRCryptology ePrint Archive 2016

[15] David Devecsery Peter M Chen Jason Flinn and Satish NarayanasamyOptimistic hybrid analysis Accelerating dynamic analysis throughpredicated static analysis In ASPLOS 2018

[16] Christopher Domas Breaking the x86 isa Black Hat USA 2017

[17] Isaac Dooley and Laxmikant Kale Quantifying the interference causedby subnormal floating-point values In Proceedings of the Workshopon OSIHPA 2006

[18] Travis Downs Memory Disambiguation on Skylakehttpsgithubcomtravisdownsuarch-benchwikiMemory-Disambiguation-on-Skylake 2019

[19] Thomas Dullien Return after free discussion httpstwittercomhalvarflakestatus1273220345525415937

[20] Dmitry Evtyushkin Ryan Riley Nael CSE Abu-Ghazaleh ECE andDmitry Ponomarev Branchscope A new side-channel attack on di-rectional branch predictor ACM SIGPLAN Notices 53(2)693ndash7072018

[21] Firefox Firefox 87 Security Advisory httpswwwmozillaorgen-USsecurityadvisoriesmfsa2021-10CVE-2021-29955

[22] Firefox Firefox ESR 789 Security Advisory httpswwwmozillaorgen-USsecurityadvisoriesmfsa2021-11CVE-2021-29955

[23] Firefox Project Fission documentation

[24] Fortninet Use-After-Free Bug in Chakra (CVE-2018-0946)httpswwwfortinetcomblogthreat-researchan-analysis-of-the-use-after-free-bug-in-microsoft-edge-chakra-engine

[25] Ivan Fratric Return after free discussion httpstwittercomifsecurestatus1273230733516177408

[26] Pietro Frigo Cristiano Giuffrida Herbert Bos and Kaveh RazaviGrand Pwning Unit Accelerating Microarchitectural Attacks withthe GPU In SampP May 2018

[27] Kourosh Gharachorloo Anoop Gupta and John L Hennessy Twotechniques to enhance the performance of memory consistency models1991

[28] Enes Goktas Kaveh Razavi Georgios Portokalidis Herbert Bos andCristiano Giuffrida Speculative Probing Hacking Blind in the SpectreEra In CCS 2020

[29] Ben Gras Kaveh Razavi Erik Bosman Herbert Bos and CristianoGiuffrida ASLR on the Line Practical Cache Attacks on the MMUIn NDSS February 2017

[30] John L Hennessy and David A Patterson Computer architecture aquantitative approach Elsevier 2011

[31] Jann Horn Reading privileged memory with a side-channel 2018

[32] Xen Hypervisor block_speculation function call in invoke_stubhttpsxenbitsxenorggitwebp=xengita=blobf=xenarchx86x86_emulatex86_emulatechb=HEAD

[33] Xen Hypervisor block_speculation function call inio_emul_stub_setup httpsxenbitsxenorggitwebp=xengita=blobf=xenarchx86pvemul-priv-opchb=HEAD

[34] Intel FPVI amp SCSB Intel Security Advisoray 00516 httpswwwintelcomcontentwwwusensecurity-centeradvisoryintel-sa-00516html

[35] Intel INTEL-SA-00088 - Bounds Check Bypass

[36] Intel Intelreg 64 and IA-32 Architectures Optimization ReferenceManual

[37] Intel Intelreg 64 and IA-32 Architectures Software Developerrsquos Manualcombined volumes

[38] Intel Intelreg VTunetrade Profiler User Guide - 4KAliasing httpssoftwareintelcomcontentwwwusendevelopdocumentationvtune-helptopreferencecpu-metrics-referencel1-boundaliasing-of-4k-address-offsethtml

[39] Intel Load value injection - deep divehttpssoftwareintelcomsecurity-software-guidancedeep-divesdeep-dive-load-value-injection 2020

[40] Vladimir Kiriansky and Carl Waldspurger Speculative buffer overflowsAttacks and defenses arXiv180703757

[41] Paul Kocher Jann Horn Anders Fogh Daniel Genkin Daniel GrussWerner Haas Mike Hamburg Moritz Lipp Stefan Mangard ThomasPrescher Michael Schwarz and Yuval Yarom Spectre Attacks Ex-ploiting Speculative Execution In SampPrsquo19

[42] David Kohlbrenner and Hovav Shacham On the effectiveness ofmitigations against floating-point timing channels In USENIX SecuritySymposium 2017

[43] Esmaeil Mohammadian Koruyeh Khaled N Khasawneh ChengyuSong and Nael Abu-Ghazaleh Spectre Returns Speculation Attacksusing the Return Stack Buffer In USENIX WOOTrsquo18

[44] Evgeni Krimer Guillermo Savransky Idan Mondjak and JacobDoweck Counter-based memory disambiguation techniques for se-lectively predicting loadstore conflicts October 1 2013 US Patent8549263

USENIX Association 30th USENIX Security Symposium 1465

[45] Chris Lattner and Vikram Adve Llvm A compilation framework forlifelong program analysis amp transformation In CGO 2004

[46] Orion Lawlor Hari Govind Isaac Dooley Michael Breitenfeld andLaxmikant Kale Performance degradation in the presence of subnor-mal floating-point values In OSIHPA 2005

[47] Moritz Lipp Michael Schwarz Daniel Gruss Thomas Prescher WernerHaas Anders Fogh Jann Horn Stefan Mangard Paul Kocher DanielGenkin Yuval Yarom and Mike Hamburg Meltdown Reading KernelMemory from User Space In USENIX Securityrsquo18

[48] Giorgi Maisuradze and Christian Rossow ret2spec Speculative ex-ecution using return stack buffers In Proceedings of the 2018 ACMSIGSAC

[49] Andrea Mambretti Alexandra Sandulescu Matthias NeugschwandtnerAlessandro Sorniotti and Anil Kurmus Two methods for exploitingspeculative control flow hijacks In USENIX WOOT 19

[50] Ahmad Moghimi Jan Wichelmann Thomas Eisenbarth and BerkSunar Memjam A false dependency attack against constant-timecrypto implementations International Journal of Parallel Program-ming 2019

[51] Mozilla Firefox Bug 1692972 mitigation httpshgmozillaorgreleasesmozilla-betarevb129bba64358

[52] Mozilla Spectre mitigations for Value type checks - x86 part httpsbugzillamozillaorgshow_bugcgiid=1433111

[53] Mozilla Spider Monkey JSValue httpshgmozillaorgmozilla-centralfiletipjspublicValueh

[54] Mozilla SpiderMonkey IonMonkey documentation

[55] Ken Johnson Microsoft Security Response Center(MSRC) Analysis and mitigation of speculative storebypass httpsmsrc-blogmicrosoftcom20180521analysis-and-mitigation-of-speculative-store-bypass-cve-2018-3639 2019

[56] Oleksii Oleksenko Bohdan Trach Mark Silberstein and Christof Fet-zer Specfuzz Bringing spectre-type vulnerabilities to the surface InUSENIX Security 20

[57] Riccardo Paccagnella Licheng Luo and Christopher W Fletcher Lordof the ring (s) Side channel attacks on the cpu on-chip ring interconnectare practical arXiv preprint arXiv210303443 2021

[58] Zhenxiao Qi Qian Feng Yueqiang Cheng Mengjia Yan Peng Li HengYin and Tao Wei Spectaint Speculative taint analysis for discoveringspectre gadgets 2021

[59] Hany Ragab Alyssa Milburn Kaveh Razavi Herbert Bos and CristianoGiuffrida CrossTalk Speculative Data Leaks Across Cores Are RealIn SampP May 2021

[60] Thomas Rokicki Cleacutementine Maurice and Pierre Laperdrix Sok Insearch of lost time A review of javascript timers in browsers In IEEEEuroSampPrsquo21

[61] Gururaj Saileshwar Christopher W Fletcher and Moinuddin QureshiStreamline a fast flushless cache covert-channel attack by enablingasynchronous collusion In ASPLOS 2021

[62] Rahul Saxena and John William Phillips Optimized rounding in un-derflow handlers 2001 US Patent 6219684

[63] Michael Schwarz Moritz Lipp Daniel Moghimi Jo Van Bulck JulianStecklina Thomas Prescher and Daniel Gruss ZombieLoad Cross-privilege-boundary data sampling In CCSrsquo19

[64] Michael Schwarz Cleacutementine Maurice Daniel Gruss and Stefan Man-gard Fantastic timers and where to find them High-resolution microar-chitectural attacks in javascript In FC IFCA 17

[65] Michael Schwarz Martin Schwarzl Moritz Lipp Jon Masters andDaniel Gruss Netspectre Read arbitrary memory over network InESORICS 19

[66] Daniel J Sorin Mark D Hill and David A Wood A primer on memoryconsistency and cache coherence 2011

[67] Julian Stecklina and Thomas Prescher Lazyfp Leaking fpu reg-ister state using microarchitectural side-channels arXiv preprintarXiv180607480 2018

[68] Caroline Trippel Daniel Lustig and Margaret Martonosi Meltdown-prime and spectreprime Automatically-synthesized attacks exploitinginvalidation-based coherence protocols arXiv

[69] Xabier Ugarte-Pedrero Davide Balzarotti Igor Santos and Pablo GBringas Sok Deep packer inspection A longitudinal study of thecomplexity of run-time packers In 2015 IEEE S amp P

[70] Google V8 test-jump-table-assemblercc220 commit 251fece httpsgithubcomv8

[71] Jo Van Bulck Daniel Moghimi Michael Schwarz Moritz Lippi Ma-rina Minkin Daniel Genkin Yuval Yarom Berk Sunar Daniel Grussand Frank Piessens Lvi Hijacking transient execution through mi-croarchitectural load value injection In 2020 IEEE S amp P 20

[72] Jo Van Bulck Frank Piessens and Raoul Strackx Nemesis Studyingmicroarchitectural timing leaks in rudimentary cpu interrupt logic InACM CCS 2018

[73] Jo Van Bulck Frank Piessens and Raoul Strackx SGX-step A Prac-tical Attack Framework for Precise Enclave Execution Control InSysTEXrsquo17

[74] Stephan van Schaik Andrew Kwong Daniel Genkin and Yuval YaromSgaxe How sgx fails in practice

[75] Stephan van Schaik Alyssa Milburn Sebastian Oumlsterlund Pietro FrigoGiorgi Maisuradze Kaveh Razavi Herbert Bos and Cristiano GiuffridaRIDL Rogue in-flight data load In SampP May 2019

[76] Stephan van Schaik Marina Minkin Andrew Kwong Daniel Genkinand Yuval Yarom Cacheout Leaking data on intel cpus via cacheevictions arXiv preprint

[77] WebKit Browserbench httpsbrowserbenchorg

[78] Ofir Weisse Jo Van Bulck Marina Minkin Daniel Genkin BarisKasikci Frank Piessens Mark Silberstein Raoul Strackx Thomas FWenisch and Yuval Yarom Foreshadow-ng Breaking the virtual mem-ory abstraction with transient out-of-order execution 2018

[79] Pawel Wieczorkiewicz Intel deep-dive snoop-assisted L1 Data Sam-pling

[80] Wenjie Xiong and Jakub Szefer Survey of transient execution attacksarXiv preprint

[81] Yuval Yarom and Katrina Falkner Flush+ reload a high resolutionlow noise l3 cache side-channel attack In USENIX Security 14

[82] Jann Horn Google Project Zero Speculative execution variant4 speculative store bypass httpsbugschromiumorgpproject-zeroissuesdetailid=1528 2019

A Reversing Memory Disambiguation

To precisely trigger memory disambiguation mispredictionsit is essential to reverse engineer the predictor and understandhow it can be massaged into the desired state When exe-cuting a load operation the physical addresses of all priorstores must be known to decide whether the load should beforwarded the value from the store buffer (store-to-load for-warding when the store and the load alias) or served from thememory subsystem (when they do not) Since these dependen-cies introduce significant bottlenecks modern processors rely

1466 30th USENIX Security Symposium USENIX Association

Listing 4 A function to RE the MD predictor The first 10imuls used to delay the store address computation create theideal conditions for mispredictionsst_ld rdi store addr rsi load addrrep 10 Trick to delay the store addressimul rdi 1endrepmov DWORD [rdi] 0x42 Storemov eax DWORD [rsi] Loadrep 10 Pronounce load timingimul eax 1endrepret

Listing 5 Observing the size and behavior of the per-addresssaturation counteruint8_t mem = malloc(0x1000)Ensure that saturing counter is set to 0for(i=0 ilt100 i++) st_ld(mem mem)Make hoisiting possiblefor(i=100 ilt120 i++) st_ld(mem mem+64)Trigger a memory ordering machine clearfor(i=120 ilt130 i++) st_ld(mem mem)

on a memory disambiguation predictor to improve common-case performance If a load is predicted not to alias precedingstores it can be speculatively executed before the prior storesrsquoaddresses are known (ie the load is hoisted) Otherwise theload is stalled until aliasing information is available

Partial reverse engineering of the memory disambiguationunit behavior was presented in [18] based on a (complex)analysis of Intel patent US8549263B2 [44] In contrast wepresent a full reverse engineering effort (including featuressuch as 4k aliasing flush counter etc) entirely based on asimple implementationmdashst_ld function in Listing 4 Bysurrounding the st_ld function with instrumentation codeto measure the timing and the number of machine clears wewere able to accurately detect the status of the predictor forevery call to st_ld

Our first experiment illustrated in Listing 5 is designedto observe how many missed load hoisting opportunities areneeded to switch the predictor state As shown in the corre-sponding plot in Figure 11 after 15 non-aliasing loads weobserved that subsequent st_ld invocations are faster due toa correct hoisting prediction This matches the design sug-gested in the patent with the predictor implemented as a 4-bitsaturating counter incremented every time the load does not ul-timately alias with preceding stores (and reset to 0 otherwise)Load hoisting is predicted only if the counter is saturated Toensure that the hoisting state is reached we later scheduledan aliasing load and checked that the load was incorrectlyhoisted by observing a machine clear

According to the Intel patent [44] there are 64 per-addresspredictors (ie saturating counters) and the suggested hash-ing function simply uses the lowest 6 bits of the instructionpointer of the load We verified these numbers using the func-

100 105 110 115 120 125 130i-th call to st_ld

0

50

100

Cloc

k cy

cles

Figure 11 Timing measurement of the experiment in Listing5 Orange bar machine clear memory ordering observed

Listing 6 Snippet observing activation of never-hoisting stateuint8_t mem = malloc(0x1000)for(int i=0 ilt10 i++)

for(int j=0 jlt19 j++)st_ld(mem mem+64)

st_ld(mem mem)

0 10 20 30 40 50 60 70 80 90 100 110 120i-th call to st_ld

0

25

50

75

100

Cloc

k cy

cles

Figure 12 Never-hoisting state Orange bar MC observed

tion st_ld_offset which is an exact copy of the st_ldfunction but with a number of nops added in the preamble(which we change at every run The goal is to observe iftwo unrelated loads in these functions are able to influenceeach other when varying the number of nops In our testedCPUs we observed machine clears when two unrelated loadsin st_ld and st_ld_offset are located exactly 256k bytesapart in memory (k isinN) We used machine clear observationsto detect that the per-address prediction of st_ld was affectedby the execution of st_ld_offset Our results match the de-sign suggested in the patent except we observed 256 (ratherthan 64) predictors hashing the lowest 8 bits of the instruc-tion pointer With these implementation-specific numbers anattacker can easily mistrain the predictor of a victim loadinstruction just with knowledge of its location in memory

One important additional component of the predictor is thepresence of a watchdog A never-hoisting global state is usedas a fallback to temporarily disable the predictor when theCPU has decided it may be counterproductive To reverseengineer the behavior we triggered as many mispredictionsas possible to check if hoisting was eventually disabled Theresulting code is illustrated in Listing 6 and the numbers inFigure 12 shows that after 4 machine clears the predictoris disabled Indeed even after 19 further non-aliasing loads(normally abundantly sufficient to switch to a hoisting state)the execution time of st_ld does not decrease

We also reversed the conditions under which the watch-dog is enableddisabled The patent suggests the watchdog

USENIX Association 30th USENIX Security Symposium 1467

15 79 143 207 271 335 399Number of independent store-loads

1

2

3

4

Tota

l mea

sure

dM

achi

ne C

lear

s

Figure 13 MCs observed after n independent store-loadsstarting from the never-hoisting state

is enabled when the value of a flush counter is smaller than0 and disabled otherwise The flush counter is decrementedevery MD MC and incremented every n correctly hoistedloads To reverse engineer this behavior we measure howmany MCs can be triggered in a row before the flush counteris decremented to -1 and thus the predictor is disabled Asshown in the figure 13 we never observed more than 4 MCsin a row This suggests that the flush counter is a 2-bit sat-urating counter The machine clear patterns also reveal theflush counter is incremented every 64 correctly hoisted loadsLastly since every run starts with the watchdog disabled ourresults show that to switch from a never-hoisting to a predict-hoisting state 15+64 non-aliasing loads are sufficient Thefirst 15 loads are necessary to bring the per-address predictorto the hoisting state The next 64 loads record a would-becorrect prediction of the per-address predictor After 64 such(unused) predictions the flush counter is incremented to leavethe never-hoisting state

Listing 7 Snippet verifying 4k aliasing-MD unit interactionForce no-hoisting predictionfor(i=0 ilt10 i++) st_ld(mem+0x1000 mem+0x1000)for(i=10 ilt20 i++) st_ld(mem+0x2000 mem+0x2000)Cause 4k aliasingfor(i=20 ilt40 i++) st_ld(mem+0x1000 mem+0x2000)

0 5 10 15 20 25 30 35 40i-th call to st_ld

50

60

70

80

90

Cloc

k cy

cles

Figure 14 Timing measurements of Listing 7 Orange incre-ment of LD_BLOCKS_PARTIALADDRESS_ALIAS

Finally we examined the interaction between memory dis-ambiguation and 4k aliasing On Intel CPUs when a store isfollowed by a load matching its 4KB page offset the store-to-load forwarding (STL) logic forwards the stored valueand in case of a false match (4k aliasing) a few-cycle over-head is needed to re-issue the load [38] With the help of theperformance counter LD_BLOCKS_PARTIALADDRESS_ALIASand the experiment shown in Listing 7 we verified that 4kaliasing can only happen when a no-hoist prediction is made

Figure 15 Memory disambiguation unit ndash simplified view

as shown in Figure 14 Indeed STL can be performed onlyif the store-load pair is executed in order Additionally Fig-ure 14 shows that 4k aliasing introduces a slight performanceoverhead on top of an incorrect no-hoisting guess of the MDpredictor Figure 15 presents a simplified view of the reversedmemory disambiguation predictor In conclusion an attackercan precisely mistrain the memory disambiguation predictorby satisfying only two requirements (1) knowing the instruc-tion pointer of the victim load (2) issuing (in the worst-casescenario) 15+64=79 non-aliasing store-load pairs to train thepredictor to the hoisting state

B Root-Causes Description Table

Acronym Description

BHT Branch History TableBTB Branch Target BufferRSB Return Stack BufferMD Memory Disabmbiguation UnitNM Device Not Available ExceptionDE Divide-by-Zero ExceptionUD Invalid Opcode ExceptionGP General Protection FaultAC Alignment Check ExceptionSS Stack Segment ExceptionPF Page FaultPF - US Bit Page Table Entry UserSupervisor Bit PFPF - RW Bit Page Table Entry ReadWrite Bit PFPF - P Bit Page Table Entry Present Bit PFPF - PKU Page Table Entry Protection Keys PFBR Bound Range Exceeded ExceptionFP Floating Point AssistSMC Self-Modifying CodeXMC Cross-Modifying CodeMO Memory Ordering Principles ViolationMASKMOV Masked LoadStore Instruction AssistAD Bits Page Table Entry AccessDirty Bits AssistTSX Intel TSX Transaction AbortUC Uncachable Memory AssistPRM Processor Reserved Memory AssistHW Interrupts Hardware Interrupts

1468 30th USENIX Security Symposium USENIX Association

  • Introduction
  • Background
    • IEEE-754 Denormal Numbers
    • x86 Cache Coherence
    • Memory Ordering
    • Memory Disambiguation
      • Threat Model
      • Machine Clears
      • Self-Modifying Code Machine Clear
      • Floating Point Machine Clear
      • Memory Ordering Machine Clear
      • Memory Disambiguation Machine Clear
      • Other Types of Machine Clear
      • Transient Execution Capabilities
        • Transient Window Size
        • Leakage Rate
          • Attack Primitives
            • Speculative Code Store Bypass (SCSB)
            • Floating Point Value Injection (FPVI)
              • Mitigations
                • SCSB Mitigation
                • FPVI Mitigation
                  • Root Cause-based Classification
                  • Related Work
                  • Conclusions
                  • Reversing Memory Disambiguation
                  • Root-Causes Description Table
Page 9: Rage Against the Machine Clear: A Systematic Analysis of

020406080

100120140160

Num

ber o

fTr

ansie

nt L

oads CPU

Intel Core i7-10700KIntel Xeon Silver 4214Intel Core i9-9900KIntel Core i7-7700KAMD Ryzen 5 5600XAMD Ryzen Threadripper2990WXAMD Ryzen 7 2700X

F+R TSX BHT FAULT SMC XMC FP MD MOTransient Execution Management

0

1

2

3

4

Leak

age

Rate

[Mb

s]

Figure 4 Top plot transient window size vs mechanism Each bar reports the number of transient loads that complete and leavea microarchitectural trace Bottom plot leakage rate vs mechanism for a simple Spectre Bounds-Check-Bypass attack and a1-bit FLUSH + RELOAD (F+R) cache covert channel F+R is the leakage rate upper bound (covert channel loop only no actualtransient window or attack)

secure enclaves and Precise Event Based Sampling (PEBS)are not presented in this work since already studied [14] oronly related to privileged performance profiling respectively

10 Transient Execution Capabilities

Transient execution attacks rely on crafting a transient win-dow to issue instructions that are never retired For thispurpose state-of-the-art attacks traditionally rely on mech-anisms based on root causes such as branch mispredictions(BHT) [41 43] faulty loads (Fault) [8 47 63 75] or mem-ory transaction aborts (TSX) [8 47 59 63 75] However thedifferent machine clears discussed in this paper provide anattacker with the exact same capabilities

To compare the capabilities of machine clear-based tran-sient windows with those of more traditional mechanismswe implemented a framework able to run arbitrary attacker-controlled code in a window generated by a mechanism ofchoosing We now evaluate our framework on recent proces-sors (with all the microcode updates and mitigations enabled)to compare the transient window size and leakage rate of thedifferent mechanisms

101 Transient Window Size

The transient window size provides an indication of thenumber of operations an attacker can issue on a transientpath before the results are squashed Larger windows canin principle host more complex attacks Using a classicFLUSH + RELOAD cache covert channel (F+R) as a referencewe measure the window size by counting how many transientloads can complete and hit entries in a designated F+R bufferFigure 4 (top) presents our results

As shown in the figure the window size varies greatlyacross the different mechanisms Broadly speaking mecha-nisms that have a higher detection cost such as XMC and MOMachine Clear yield larger window sizes Not surprisinglybranch mispredictions yield the largest window sizes as wecan significantly slow down the branch resolution process(ie causing cache misses) and delay detection FP on theother hand yields the shortest windows suggesting that de-normal numbers are efficiently detected inside the FPU Ourresults also show that while our framework was designed forIntel processors similar if not better results can usually beobtained on AMD processors (where we use the same con-servative training code for branchmemory prediction) Thisshows that both CPU families share a similar implementationin all cases except for MD where the used mistraining patternis not valid for pre-Zen3 architectures

102 Leakage Rate

To compare the leakage rates for the different transient exe-cution mechanisms we transiently read and repeatedly leakdata from a large memory region through a classic F+R cachecovert channel We report the resulting leakage ratesmdashasthe number of bits successfully leaked per secondmdashacrossdifferent microarchitectures using a 1-bit covert channel tohighlight the time complexity of each mechanism We con-sider data to be successfully leaked after a single correct hitin the reload buffer In case of a miss for a particular valuewe restart the leak for the same value until we get a hit (oruntil we get 100 misses in a row)

As shown in Figure 4 (bottom) different Intel and AMDmicroarchitectures generally yield similar leakage rates withsome variations For instance FP MC offers better leakagerates on Intel This difference stems from the different perfor-

1458 30th USENIX Security Symposium USENIX Association

F+R TSX BHT FP SMC XMC MO MD FAULTTransient Execution Management

0

1

2

3

4

Leak

age

Rate

[Mb

s]

F+R granularity [bit]148

Figure 5 Leakage rate vs mechanism with a 1-bit 4-bit or8-bit FLUSH + RELOAD cache covert channel (Intel Core i9)

mance impact of the corresponding machine clears on Intelvs AMD microarchitectures

As shown in Figure 5 the leakage rate varies instead greatlyacross the different mechanisms and so does the optimalcovert channel bitwidth Indeed while existing attacks typi-cally rely on 8-bit covert channels our results suggest 1-bit or4-bit channels can be much more efficient depending on thespecific mechanism Roughly speaking optimal leakage ratescan be obtained by balancing the time complexity (and hencebitwidth) of the covert channel with that of the mechanismFor example FP is a lightweight and reliable mechanismhence using a comparably fast and narrow 1-bit covert chan-nel is beneficial In contrast MD requires a time-consumingpredictor training phase between leak iterations and leakingmore bits per iteration with a 4-bit covert channel is moreefficient Interestingly a classic 8-bit covert channel yieldsconsistently worse and comparable leakage rates across allthe mechanisms since F+R dominates the execution time

Our results show that only two mechanisms (TSX and FP)are close to the maximum theoretical leakage rate of pureF+R Moreover FP performs as efficiently as TSX but un-like TSX is available on both Intel and AMD is always en-abled and can be used from managed code (eg JavaScript)BHT on the other hand yield the worst leakage rates due tothe inefficient training-based transient window BHT leakagerate can be improved if a tailored mistrain sequence is usedas in MD (Appendix A) Overall our results show that ma-chine clear-based windows achieve comparable and in manycases (eg FP) better leakage rates compared to traditionalmechanisms Moreover many machine clears eliminate theneed for mistraining which other than resulting in efficientleakage rates can escape existing pattern-based mitigationsand disabled hardware extensions (eg Intel TSX)

11 Attack Primitives

Building on our reverse engineering results and focusing onthe unexplored SMC and FP machine clears we now presenttwo new transient execution attack primitives and analyze

their security implications We also present an end-to-endFPVI exploit disclosing arbitrary memory in Firefox Laterwe discuss mitigations

111 Speculative Code Store Bypass (SCSB)

Our first attack primitive Speculative Code Store Bypass(SCSB) allows an attacker to execute stale controlled code ina transient execution window originated by a SMC machineclear Since the primitive relies on SMC its primary appli-cability is on JIT (eg JavaScript) engines running attacker-controlled codemdashalthough OS kernels and hypervisors stor-ing code pages and allowing their execution without firstissuing a serializing instruction are also potentially affected

Figure 6 SCSB primitive example where the instructionpointer is pointing at the bold code blocks g code is freedJITrsquoed code (of some g function) under attackerrsquos control(1) Force engine to JIT and execute code of function f caus-ing desynchronization of code and data views (2) Executestale code and SMC MC (3) After the SMC MC code anddata view coherence is restored and the new code is executed

As exemplified in Figure 6 the operations of the primi-tive can be broken down into three steps (1) the JIT enginecompiles a function f storing the generated code into a JITcode cache region previously used by a (now-stale) version offunction g (2) the JIT engine jumps to the newly generatedcode for the function f but due to the temporary desynchro-nization between the code and data views of the CPU thiscauses transient execution of the stale code of g until the SMCmachine clear is processed (3) after the pipeline flush thecode and data views are resynchronized and the CPU restartsthe execution of the correct code of f For exploitation theattacker needs to (i) massage the JIT code cache allocator toreuse a freed region with a target gadget g of choice (ii) forcethe JIT engine to generate and execute new (f ) code in sucha region enabling transient out-of-context execution of thegadget and spilling secrets into the microarchitectural state

Our primitive bears similarities with both transient and ar-chitectural primitives used in prior attacks On the transientfront our primitive is conceptually similar to a Speculative-Store-Bypass (SSB) primitive [55 82] but can transientlyexecute stale code rather than reading stale data Howeverinterestingly the underlying causes of the two primitives arequite different (MD misprediction vs SMC machine clear)On the architectural front our primitive mimics classic Use-After-Free (UAF) exploitation on the JIT code cache also

USENIX Association 30th USENIX Security Symposium 1459

Figure 7 Coding options suggested by the Intel ArchitecturesSoftware Developers Manuals to handle SMC and XMC ex-ecution Option 1 describes the exact steps required by ourSpeculative Code Store Bypass attack primitive potentiallyresulting in exploitable gadgets

known as Return-After-Free (RAF) in the hackers commu-nity [19 25] An example is CVE-2018-0946 where a use-after-free vulnerability can be exploited to force the ChakraJS engine to erroneously execute freed (attacker-controlled)JIT code resulting in arbitrary code execution after massagingthe right gadget into the JIT code cache [24]

Indeed at a high level SCSB yields a transient use-after-free primitive on a JIT code cache with exploitation prop-erties similar to its architectural counterpart However thereare some differences due to the transient nature of SCSBFirst we need to find an out-of-context gadget to transientlyleak data rather than architecturally execute arbitrary codeIn JavaScript engines similar gadgets have already been ex-ploited by ret2spec [48] escalating out-of-context transientexecution of valid code to type confusion arbitrary reads andultimately a secret-dependent load to transmit the value

Second we need to target short-lived JIT code cache up-dates so that the newly generated code is immediately exe-cuted fitting the target gadget in the resulting transient execu-tion window Interestingly executing JITrsquoed code is the nextstep after JIT code generation mentioned in the Intel manual(Figure 7 Option 1) suggesting that developers that followsuch directives ad litteram can easily introduce such gadgetsMoreover modern JavaScript compilers feature a multi-stageoptimization pipeline [12 54] and short-lived JIT code cacheupdates are favored for just-in-time (re)optimization Indeedwe found test code in V8 to specifically test such updates andverify instructiondata cache coherence [70]

Finally we need to ensure JIT code cache updates are notaccompanied by barriers that force immediate synchroniza-tion of the code and data views We analyzed the code ofthe popular SpiderMonkey and V8 JavaScript engines andverified that the functions called upon JIT code cache updatesto synchronize instruction and data caches (Listing 2 and List-ing 3) are always empty on x86 (as expected given Intelrsquosprimary recommendation in Figure 7)

Overall while the exploitation is far from trivial (ie hav-

Listing 2 Chromium instruction cache flush(chromiumsrcv8srccodegenx64cpu-x64cc)void CpuFeaturesFlushICache(void start size_t size) No need to flush the instructioncache on Intel

Listing 3 Firefox instruction cache flush(mozilla-unifiedjssrcjitFlushICacheh)inline void FlushICache(void code size_t size

bool codeIsThreadLocal = true) No-op Code and data caches are coherent on x86

and x64 rarr

ing to address the challenges of traditional use-after-free ex-ploitation as well as transient execution exploitation in thebrowser) we believe SCSB expands the attack surface of tran-sient execution attacks in the browser We found a number ofcandidate SCSB gadgets in real-world code but none of themwas ultimately exploitable due to the coincidental presence ofsome serializing instruction preventing stale code executionNonetheless mitigations are needed to enforce security-by-design Luckily as shown later SCSB is amenable to a prac-tical and efficient mitigation (ie at a similar cost as facedby non-x86 architectures) The implementationperformancecost for mitigation is even lower than for its sibling SSB prim-itive whose mitigation has been deployed in practice even inabsence of known practical exploits [55]

In Table 3 we show that all tested Intel and AMD proces-sors are affected by SCSB In contrast ARM is not vulnerablesince SMC updates require explicit software barriers

112 Floating Point Value Injection (FPVI)Our second attack primitive Floating Point Value Injection(FPVI) allows an attacker to inject arbitrary values into atransient execution window originated by a FP machine clearAs exemplified in Figure 8 the operations of the primitive canbe broken down into four steps (1) the attacker triggers theexecution of a gadget starting with a denormal FP operationin the victim application with the x and y operands underattackerrsquos control (2) the transient z result of the operation isprocessed by the subsequent gadget instructions leaving a mi-croarchitectural trace (3) the CPU detects the error condition(ie wrong result of a denormal operation) triggering a ma-chine clear and thus a pipeline flush (4) the CPU re-executesthe entire gadget with the correct architectural z result Forexploitation the attacker needs to (i) massage the x and yoperands to inject the desired z value into the victim transientpath and (ii) target a victim gadget so that the injected valueyields a security-sensitive trace which can be observed withFLUSH + RELOAD or other microarchitectural attacks

Our primitive bears similarities with Load-Value-Injection(LVI) [71] since both allow attackers to inject controlledvalues into transient execution Moreover both primitives

1460 30th USENIX Security Symposium USENIX Association

Figure 8 FPVI gadget example in SpiderMonkeyFP_Op(xy) is an arbitrary denormal FP operation The er-roneous z result causes dependent operations to be executedtwice (first transiently then architecturally) The NaN-boxingz encoding allows the attacker to type-confuse the JITrsquoed codeand read from an arbitrary address on the transient path

require gadgets in the victim application to process the in-jected value and perform security-sensitive computations onthe transient path Nonetheless the underlying issues andhence the triggering conditions are fairly different LVI re-quires the attacker to induce faulty or assisted loads on thevictim execution which is straightforward in SGX applica-tions but more difficult in the general case [71] FPVI imposesno such requirement but does require an attacker to directlyor indirectly control operands of a floating-point operation inthe victim Nonetheless FPVI can extend the existing LVIattack surface (eg for compute-intensive SGX applicationsprocessing attacker-controlled inputs) and also provides ex-ploitation opportunities in new scenarios Indeed while it ishard to draw general conclusions on the availability of ex-ploitable FPVI gadgets in the wildmdashmuch like Spectre [41]LVI [71] etc this would require gadget scanners subject oforthogonal research [56 58]mdashwe found exploitable gadgetsin NaN-Boxing implementations of modern JIT engines [53]NaN-Boxing implementations encode arbitrary data types asdouble values allowing attackers running code in a JIT sand-box (and thus trivially controlling operands of FP operations)to escalate FPVI to a speculative type confusion primitiveThe latter can be exploited similarly to NaN-Boxing-enabledarchitectural type confusion [26] and allows an attacker toaccess arbitrary data on a transient path Figure 8 presentsour end-to-end exploit for a JavaScript-enabled attacker in aSpiderMonkey (Mozilla JavaScript runtime) sandbox illus-trating a gadget unaffected by all the prior Mozilla Firefoxrsquo

mitigations against transient execution attacks [52]As exemplified in the figure SpiderMonkeyrsquos NaN-Boxing

strategy represents every variable with a IEEE-754 (64-bit)double where the highest 17 bits store the data type tag and theremaining 47 bits store the data value itself If the tag value isless than or equal to 0x1fff0 (ie JSVAL_TAG_MAX_DOUBLE)all the 64 bits are interpreted as a double while NaN-Boxingencoding is used otherwise For instance the 0xfff88 tagis used to represent 32-bit integers and the 0xfffb0 tag torepresent a string with the data value storing a pointer tothe string descriptor In the example the attacker crafts theoperands of a vulnerable FP operation (in this case a division)to produce a transient result which the JITrsquoed code interpretsas a string pointer due to NaN-Boxing This causes typeconfusion on a transient path and ultimately triggers a readwith an attacker-controlled address

To verify the attacker can inject arbitrary pointers withoutfully reverse engineering the complex function used by theFPU we implemented a simple fuzzer to find FP operandsthat yield transient division results with the upper bits setto 0xfffb0 (ie string tag) With such operands we caneasily control the remaining bits by performing the inverseoperation since the mantissa bits are transiently unaffectedby the exponent value as shown in Table 2 For exampleusing 0xc000e8b2c9755600 and 0x0004000000000000 asdivision operands yields -Infinity as the architectural resultand our target string pointer 0xfffb0deadbeef000 as thetransient result (see gadget in Figure 8)

Note that SpiderMonkey uses no guards or Spectre mitiga-tions when accessing the attribute length of the string Thisis normally safe since x86 guarantees that NaN results of FPoperations will always have the lowest 52 bits set to zeromdasharepresentation known as QNaN Floating-Point Indefinite [37]In other words the implementation relies on the fact that NaN-boxed variables such as string pointers can never accidentallyappear as the result of FP operations and can only be craftedby the JIT engine itself Unfortunately this invariant no longerholds on a FPVI-controlled transient path As shown in Fig-ure 8 this invariant violation allows an attacker to transientlyread arbitrary memory Since the length attribute is stored4 bytes away from the string pointer the zlength accessyields a transient read to 0xdeadbeef000+4

We ran our exploit on an Intel i9-9900K CPU (microcode0xde) running Linux 580 and Firefox 850 and by wrappingthis primitive with a variant of an EVICT + RELOAD covertchannel [61] we confirmed the ability to read arbitrary mem-ory Since prior work has already demonstrated that customhigh-precision timers are possible in JavaScript [26296064]we enabled precise timers in Firefox to simplify our covertchannel With our exploit we measured a leakage rate of ~13KBs and a transient window of ~12 load instructions enabledby increasing the FPU pressure through a chain of multipledependent FP operations

Finally as shown in Table 3 we observe that both Intel and

USENIX Association 30th USENIX Security Symposium 1461

Table 3 Tested processors

Processor MicrocodeSCSB

vulnerableFPVI

vulnerable

Intel Core i7-10700K 0xe0 3 3Intel Xeon Silver 4214 0x500001c 3 3Intel Core i9-9900K 0xde 3 3Intel Core i7-7700K 0xca 3 3AMD Ryzen 5 5600X 0xa201009 3 3daggerAMD Ryzen 2990WX 0x800820b 3 3daggerAMD Ryzen 7 2700X 0x800820b 3 3daggerBroadcom BCM2711Cortex-A72 (ARM v8) 7sect 7

dagger No exploitable NaN-boxed transient results foundsect On ARM SMC updates require explicit software barriers

AMD are affected by FPVI although we found no exploitabletransient NaN-boxed values on AMD On ARM we neverobserved traces of transient results yet we cannot rule outother FPU implementations being affected

12 Mitigations

121 SCSB Mitigation

SCSB can be mitigated by ensuring that any freshly writtencode is architecturally visible before being executed For ex-ample on ARM architectures where the hardware does notautomatically enforce cache coherence explicit serializing in-structions (ie L1i cache invalidation instructions) are neededto correctly support SMC [5] As such spec-compliant ARMimplementations cannot be affected by SCSB On Intel andAMD we can force eager codedata coherence using a serial-ization instructionmdashalthough this is normally not necessary(see Option 1 in Figure 7) In our experiments we verified anyserialization instruction such as lfence mfence or cpuidplaced after the SMC store operations is indeed sufficient tosuppress the transient window Note that sfence cannot serveas a serialization instruction [35] to eliminate the transientpath This serialization mitigation was confirmed by CPUvendors and adopted by the Xen hypervisor [32 33]

To evaluate the performance impact of our mitigation weadded a lfence instruction inside the FlushICache function(Listings 2 and 3) of the two popular V8 and SpiderMonkeyJavaScript engines Such function normally empty on x86is called after every code update Our repeated experimentson the popular JetStream2 and Speedometer2 [77] bench-marks did not produce any statically measurable performanceoverhead This shows JavaScript execution time is heavilydominated by JIT code generationexecution and code up-dates have negligible impact Our results show this mitigationis practical and can hinder SCSB-based attack primitives inJIT engines with a 1-line code change

122 FPVI Mitigation

The most efficient way to mitigate FPVI is to disable the de-normal representation On Intel this translates to enabling theFlush-to-Zero and Denormals-are-Zero flags [37] which re-spectively replace denormal results and inputs with zero Thisis a viable mitigation for applications with modest floating-point precision requirements and has also been selectively ap-plied to browsers [42] However this defense may break com-mon real-world (denormal-dependent) applications a concernthat has led browser vendors such as Firefox to adopt othermitigations Another option for browsers is to enable Site Iso-lation [13] but JIT engines such as SpiderMonkey still do nothave a production implementation [23] Yet another optionfor JIT engines is to conditionally mask (ie using a transientexecution-safe cmov instruction) the result of FP operationsto enforce QNaN Floating-Point Indefinite semantics [37] asdone in the SpiderMonkey FPVI mitigation [51] This strategysuppresses any malicious NaN-boxed transient results butrequires manual changes to the NaN-boxing implementationand only applies to NaN-boxed gadgets

A more general and automated mitigation is for the com-piler to place a serializing instruction such as lfence afterFP operations whose (attacker-controlled) result might leaksecrets by means of transmit gadgets (or transmitters) Weobserve this is the same high-level strategy adopted by theexisting LVI mitigation in modern compilers [39] whichidentifies loaded values as sources and uses data-flow anal-ysis to ensure all the sources that reach a sink (transmitter)are fenced Hence to mitigate FPVI we can rely on the samemitigation strategy but use computed FP values as sourcesinstead To identify sinks we consider both systems vulnera-ble and those resistant to Microarchitectural Data Sampling(MDS) [8 59 63 75 76] For the former we consider bothload and store instructions as sinks (eg FP operation resultused as a loadstore address) as the corresponding arbitrarydata spilled into microarchitectural buffers may be leaked bya MDS attacker For the latter we limit ourselves to load sinksto catch all the arbitrary read values potentially disclosed by adependent transmitter In both cases we add indirect branchescontrolled by FP operations (and thus potentially leading tospeculative control-flow hijacking) to the list of sinks

We have implemented such a mitigation in LLVM [45]with only 100 lines of code on top of the existing x86 LVIload hardening pass To evaluate the performance impactof our mitigation on floating-point-intensive programs weran all the CC++ SPECfp 2006 and 2017 benchmarks infour configurations LVI instrumentation FPVI instrumen-tation for both MDS-vulnerable and MDS-resistant systemsand joint LVI+FPVI instrumentation for MDS-vulnerable sys-tems Please note that on MDS-resistant systems the FPVItransmitters are already covered by the LVI mitigation

Figure 9 shows the performance overhead of such con-figurations compared to the baseline As expected the LVI

1462 30th USENIX Security Symposium USENIX Association

619lbm_s 638imagick_s 644nab_s SPEC17geomean

433milc 444namd 447dealII 450soplex 453povray 470lbm 482sphinx3 SPEC06geomean

0100200300400500600700800

Over

head

SPEC FP 2017 SPEC FP 2006LLVM Mitigation

LVIFPVI (MDS-vulnerable system)FPVI (MDS-resistant system)LVI+FPVI (MDS-vulnerable system)

Figure 9 Performance overhead of our FPVI mitigation on the CC++ SPECfp 2006 and 2017 benchmarks Experimental setup5 SPEC iterations Intel i9-9900K (microcode 0xde) and LLVM 1110

mitigation has non-trivial performance impact up to 280despite targeting floating-point-intensive programs On theother hand our FPVI mitigation incurs in a 32 and 53 ge-omean overhead on SPECfp 2006 and 2017 respectively withno observable performance impact difference between MDS-vulnerable and MDS-resistant variants We observed that ap-proximately 70 of the inserted lfence instructions are dueto the intraprocedural design of the original LVI pass forc-ing our analysis to consider every callsite with FPVI source-based arguments as a potential transmitter This suggests theoverhead can be further reduced by operating interprocedu-ral analysis or more aggressive inlining (eg using LLVMLTO [45])

13 Root Cause-based Classification

We now summarize the results of our investigation by present-ing a root cause-based classification for the known transientexecution paths While there have already been several at-tempts to classify properties of transient execution [107580]all the existing classifications are attack-centric While cer-tainly useful such classifications inevitably blend togetherthe root causes of transient execution with the attack triggersFor example a MDS exploit based on a demand-paging pagefault [75] may be simply classified as a Meltdown-like attackbased on a present page fault [10] However in such an exploitthe page fault is both the vulnerability trigger and the rootcause of the transient execution window A similar exploitmay instead be embedded in an Intel TSX window yielding adifferent root cause and capabilities

To better characterize the capabilities of transient executionexploits we propose the orthogonal root-cause-based classifi-cation in Figure 10 We draw from the Intel terminology todefine the two main classes of root causes of Bad Specula-tion (ie transient execution) Control-Flow Misprediction(ie branch misprediction) and Data Misprediction (ie ma-chine clear) Based on these two main classes we observethat all the known root causes of transient execution paths canbe classified into the following four subclasses PredictorsExceptions Likely invariants violations and Interrupts

Figure 10 A root cause-centric classification of known tran-sient execution paths (causes analyzed in the paper in bold)Acronyms descriptions can be found in Appendix B

Predictors This category includes the prediction-basedcauses of bad speculation due to either control-flow or datamispredictions Mistraining a predictor and forcing a mis-prediction is sufficient to create a transient execution pathaccessing erroneous code or data Itrsquos worth noting that inFigure 10 there are two separate predictors subclasses un-der control-flow and data mispredictions as they are differ-ent in nature While control-flow mispredictions are failedattempts to guess the next instructions to execute data mis-predictions are failed attempts to operate on not-yet-validateddata Misprediction is a common way to manage tran-sient execution windows in attacks like Spectre and deriva-tives [11 20 28 40 41 43 48 55 65 82]

Exceptions This class includes the causes of machineclear due to exceptions for instance the different (sub)classesof page faults Forcing an exception is sufficient to create atransient execution path erroneously executing code follow-ing the exception-inducing instruction Exceptions are a lesscommon way to manage transient execution windows (as theyrequire dedicated handling) but have been extensively usedas triggers in Meltdown-like attacks [78475963677174ndash76 78]

Interrupts This class includes the causes of machine cleardue to hardware (only) interrupts Similar to exceptions forc-

USENIX Association 30th USENIX Security Symposium 1463

ing a HW interrupt is sufficient to create a transient executionpath erroneously executing code following the interruptedinstruction Hardware interrupt are asynchronous by natureand thus difficult to control resulting in a less than ideal wayto manage transient execution windows Nonetheless theywere abused by prior side-channel attacks [72]

Likely invariants violations This class includes all theremaining causes of machine clear derived by likely invari-ants [15] used by the CPU Such invariants commonly holdbut occasionally fail allowing hardware to implement fast-path optimizations However compared to exceptions andinterrupts slow-path occurrences are typically more frequentrequiring more efficient handling in hardware or microcodeWe discussed examples of such invariants in the paper (egstore instructions are expected to never target cached instruc-tions floating-point operations are expected to never operateon denormal numbers etc) and their lazy handling mecha-nisms (ie L1dL1i resynchronization microcode-based de-normal arithmetic) Forcing a likely invariant violation is suf-ficient to create a transient execution path accessing erroneouscode or data In this paper we have shown such violationsare not only a realistic way to manage transient executionwindows but also provide new opportunities and primitivesfor transient execution attacks

14 Related Work

Spectre [3141] and Meltdown [47] first examined the securityimplications of transient execution originating a large bodyof research on transient execution attacks [6ndash112840434859 63 65 67 68 71 74ndash76 78 79 82] Rather than focusingon attacks and their classification [10 75 80] ours is the firsteffort to systematize the root causes of transient executionand examine the many unexplored cases of machine clears

We now briefly survey prior security efforts concernedwith the major causes of machine clear discussed in this pa-per Self-modifying code is commonly used by malware asan obfuscation technique [69] and has also been used to im-prove side-channel attacks by means of performance degra-dation [2] Moreover our SCSB primitive bears similaritieswith prior transient execution primitives inducing specula-tive control flow hijacking either through branch target in-jection [41 49] or architectural branch target corruption [28]The performance variability of floating-point operations ondenormal numbers [17 46] has been previously exploited intraditional timing side-channel attacks [4] Speculation intro-duced by stricter memory models is a well-known concept inthe computer architecture literature [27 30 66] While this isnon-trivial to exploit prior work did demonstrate informationdisclosure [5779] by exploiting the snoop protocol discussedin Section 7 The memory disambiguation predictor has beenpreviously abused to leak stale data in Spectre SpeculativeStore Bypass exploits [55 82] Moreover its behavior hasbeen partially reverse engineered before [18] In contrast to

all these efforts we focus on machine clears to systematicallystudy all the root causes of transient execution fully reverseengineering their behavior and uncovering their security im-plications well beyond the state of the art

15 Conclusions

We have shown that the root causes of transient execution canbe quite diverse and go well beyond simple branch mispredic-tion or similar To support this claim we systematically ex-plored and reverse engineered the previously unexplored classof bad speculation known as machine clear We discussedseveral transient execution paths neglected in the literatureexamining their capabilities and new opportunities for attacksFurthermore we presented two new machine clear-based tran-sient execution attack primitives (Floating Point Value Injec-tion and Speculative Code Store Bypass) We also presentedan end-to-end FPVI exploit disclosing arbitrary memory inFirefox and analyzed the applicability of SCSB in real-worldapplications such as JIT engines Additionally we proposedmitigations and evaluated their performance overhead Finallywe presented a new root cause-based classification for all theknown transient execution paths

Disclosure

We disclosed Floating Point Value Injection and SpeculativeCode Store Bypass to CPU browser OS and hypervisor ven-dors in February 2021 Following our reports Intel confirmedthe FPVI (CVE-2021-0086) and SCSB (CVE-2021-0089)vulnerabilities rewarded them with the Intel Bug Bounty pro-gram and released a security advisory with recommendationsin line with our proposed mitigations [34] Mozilla confirmedthe FPVI exploit (CVE-2021-29955 [21 22]) rewarded itwith the Mozilla Security Bug Bounty program and deployeda mitigation based on conditionally masking malicious NaN-boxed FP results in Firefox 87 [51]

Acknowledgments

We thank our shepherd Daniel Genkin and the anonymousreviewers for their valuable comments We also thank ErikBosman from VUSec and Andrew Cooper from Citrix fortheir input Intel and Mozilla engineers for the productivemitigation discussions Travis Downs for his MD reverse en-gineering and Evan Wallace for his Float Toy tool This workwas supported by the European Unionrsquos Horizon 2020 re-search and innovation programme under grant agreements No786669 (ReAct) and 825377 (UNICORE) by Intel Corpora-tion through the Side Channel Vulnerability ISRA and by theDutch Research Council (NWO) through the INTERSECTproject

1464 30th USENIX Security Symposium USENIX Association

References[1] IEEE Standard for Floating-Point Arithmetic IEEE Std 754-2019

2019

[2] Alejandro Cabrera Aldaya and Billy Bob Brumley Hyperde-grade From ghz to mhz effective cpu frequencies arXiv preprintarXiv210101077

[3] AMD AMD64 Architecture Programmerrsquos Manual

[4] Marc Andrysco David Kohlbrenner Keaton Mowery Ranjit JhalaSorin Lerner and Hovav Shacham On subnormal floating point andabnormal timing In 2015 IEEE S amp P

[5] ARM Architecture Reference Manual for Armv8-A

[6] Atri Bhattacharyya Alexandra Sandulescu Matthias NeugschwandtnerAlessandro Sorniotti Babak Falsafi Mathias Payer and Anil KurmusSmotherspectre exploiting speculative execution through port con-tention In CCSrsquo19

[7] Jo Van Bulck Marina Minkin Ofir Weisse Daniel Genkin BarisKasikci Frank Piessens Mark Silberstein Thomas F Wenisch Yu-val Yarom and Raoul Strackx Foreshadow Extracting the Keys tothe Intel SGX Kingdom with Transient Out-of-Order Execution InUSENIX Securityrsquo18

[8] Claudio Canella Daniel Genkin Lukas Giner Daniel Gruss MoritzLipp Marina Minkin Daniel Moghimi Frank Piessens MichaelSchwarz Berk Sunar Jo Van Bulck and Yuval Yarom Fallout LeakingData on Meltdown-resistant CPUs In CCSrsquo19

[9] Claudio Canella Michael Schwarz Martin Haubenwallner MartinSchwarzl and Daniel Gruss Kaslr Break it fix it repeat In ACMASIA CCS 2020

[10] Claudio Canella Jo Van Bulck Michael Schwarz Moritz Lipp Ben-jamin Von Berg Philipp Ortner Frank Piessens Dmitry Evtyushkinand Daniel Gruss A systematic evaluation of transient execution at-tacks and defenses In USENIX Security 19

[11] Guoxing Chen Sanchuan Chen Yuan Xiao Yinqian Zhang ZhiqiangLin and Ten H Lai Sgxpectre Stealing intel secrets from sgx enclavesvia speculative execution In 2019 IEEE EuroSampP

[12] Chrome V8 TurboFan documentation

[13] Chromium Site Isolation documentation

[14] Victor Costan and Srinivas Devadas Intel SGX Explained IACRCryptology ePrint Archive 2016

[15] David Devecsery Peter M Chen Jason Flinn and Satish NarayanasamyOptimistic hybrid analysis Accelerating dynamic analysis throughpredicated static analysis In ASPLOS 2018

[16] Christopher Domas Breaking the x86 isa Black Hat USA 2017

[17] Isaac Dooley and Laxmikant Kale Quantifying the interference causedby subnormal floating-point values In Proceedings of the Workshopon OSIHPA 2006

[18] Travis Downs Memory Disambiguation on Skylakehttpsgithubcomtravisdownsuarch-benchwikiMemory-Disambiguation-on-Skylake 2019

[19] Thomas Dullien Return after free discussion httpstwittercomhalvarflakestatus1273220345525415937

[20] Dmitry Evtyushkin Ryan Riley Nael CSE Abu-Ghazaleh ECE andDmitry Ponomarev Branchscope A new side-channel attack on di-rectional branch predictor ACM SIGPLAN Notices 53(2)693ndash7072018

[21] Firefox Firefox 87 Security Advisory httpswwwmozillaorgen-USsecurityadvisoriesmfsa2021-10CVE-2021-29955

[22] Firefox Firefox ESR 789 Security Advisory httpswwwmozillaorgen-USsecurityadvisoriesmfsa2021-11CVE-2021-29955

[23] Firefox Project Fission documentation

[24] Fortninet Use-After-Free Bug in Chakra (CVE-2018-0946)httpswwwfortinetcomblogthreat-researchan-analysis-of-the-use-after-free-bug-in-microsoft-edge-chakra-engine

[25] Ivan Fratric Return after free discussion httpstwittercomifsecurestatus1273230733516177408

[26] Pietro Frigo Cristiano Giuffrida Herbert Bos and Kaveh RazaviGrand Pwning Unit Accelerating Microarchitectural Attacks withthe GPU In SampP May 2018

[27] Kourosh Gharachorloo Anoop Gupta and John L Hennessy Twotechniques to enhance the performance of memory consistency models1991

[28] Enes Goktas Kaveh Razavi Georgios Portokalidis Herbert Bos andCristiano Giuffrida Speculative Probing Hacking Blind in the SpectreEra In CCS 2020

[29] Ben Gras Kaveh Razavi Erik Bosman Herbert Bos and CristianoGiuffrida ASLR on the Line Practical Cache Attacks on the MMUIn NDSS February 2017

[30] John L Hennessy and David A Patterson Computer architecture aquantitative approach Elsevier 2011

[31] Jann Horn Reading privileged memory with a side-channel 2018

[32] Xen Hypervisor block_speculation function call in invoke_stubhttpsxenbitsxenorggitwebp=xengita=blobf=xenarchx86x86_emulatex86_emulatechb=HEAD

[33] Xen Hypervisor block_speculation function call inio_emul_stub_setup httpsxenbitsxenorggitwebp=xengita=blobf=xenarchx86pvemul-priv-opchb=HEAD

[34] Intel FPVI amp SCSB Intel Security Advisoray 00516 httpswwwintelcomcontentwwwusensecurity-centeradvisoryintel-sa-00516html

[35] Intel INTEL-SA-00088 - Bounds Check Bypass

[36] Intel Intelreg 64 and IA-32 Architectures Optimization ReferenceManual

[37] Intel Intelreg 64 and IA-32 Architectures Software Developerrsquos Manualcombined volumes

[38] Intel Intelreg VTunetrade Profiler User Guide - 4KAliasing httpssoftwareintelcomcontentwwwusendevelopdocumentationvtune-helptopreferencecpu-metrics-referencel1-boundaliasing-of-4k-address-offsethtml

[39] Intel Load value injection - deep divehttpssoftwareintelcomsecurity-software-guidancedeep-divesdeep-dive-load-value-injection 2020

[40] Vladimir Kiriansky and Carl Waldspurger Speculative buffer overflowsAttacks and defenses arXiv180703757

[41] Paul Kocher Jann Horn Anders Fogh Daniel Genkin Daniel GrussWerner Haas Mike Hamburg Moritz Lipp Stefan Mangard ThomasPrescher Michael Schwarz and Yuval Yarom Spectre Attacks Ex-ploiting Speculative Execution In SampPrsquo19

[42] David Kohlbrenner and Hovav Shacham On the effectiveness ofmitigations against floating-point timing channels In USENIX SecuritySymposium 2017

[43] Esmaeil Mohammadian Koruyeh Khaled N Khasawneh ChengyuSong and Nael Abu-Ghazaleh Spectre Returns Speculation Attacksusing the Return Stack Buffer In USENIX WOOTrsquo18

[44] Evgeni Krimer Guillermo Savransky Idan Mondjak and JacobDoweck Counter-based memory disambiguation techniques for se-lectively predicting loadstore conflicts October 1 2013 US Patent8549263

USENIX Association 30th USENIX Security Symposium 1465

[45] Chris Lattner and Vikram Adve Llvm A compilation framework forlifelong program analysis amp transformation In CGO 2004

[46] Orion Lawlor Hari Govind Isaac Dooley Michael Breitenfeld andLaxmikant Kale Performance degradation in the presence of subnor-mal floating-point values In OSIHPA 2005

[47] Moritz Lipp Michael Schwarz Daniel Gruss Thomas Prescher WernerHaas Anders Fogh Jann Horn Stefan Mangard Paul Kocher DanielGenkin Yuval Yarom and Mike Hamburg Meltdown Reading KernelMemory from User Space In USENIX Securityrsquo18

[48] Giorgi Maisuradze and Christian Rossow ret2spec Speculative ex-ecution using return stack buffers In Proceedings of the 2018 ACMSIGSAC

[49] Andrea Mambretti Alexandra Sandulescu Matthias NeugschwandtnerAlessandro Sorniotti and Anil Kurmus Two methods for exploitingspeculative control flow hijacks In USENIX WOOT 19

[50] Ahmad Moghimi Jan Wichelmann Thomas Eisenbarth and BerkSunar Memjam A false dependency attack against constant-timecrypto implementations International Journal of Parallel Program-ming 2019

[51] Mozilla Firefox Bug 1692972 mitigation httpshgmozillaorgreleasesmozilla-betarevb129bba64358

[52] Mozilla Spectre mitigations for Value type checks - x86 part httpsbugzillamozillaorgshow_bugcgiid=1433111

[53] Mozilla Spider Monkey JSValue httpshgmozillaorgmozilla-centralfiletipjspublicValueh

[54] Mozilla SpiderMonkey IonMonkey documentation

[55] Ken Johnson Microsoft Security Response Center(MSRC) Analysis and mitigation of speculative storebypass httpsmsrc-blogmicrosoftcom20180521analysis-and-mitigation-of-speculative-store-bypass-cve-2018-3639 2019

[56] Oleksii Oleksenko Bohdan Trach Mark Silberstein and Christof Fet-zer Specfuzz Bringing spectre-type vulnerabilities to the surface InUSENIX Security 20

[57] Riccardo Paccagnella Licheng Luo and Christopher W Fletcher Lordof the ring (s) Side channel attacks on the cpu on-chip ring interconnectare practical arXiv preprint arXiv210303443 2021

[58] Zhenxiao Qi Qian Feng Yueqiang Cheng Mengjia Yan Peng Li HengYin and Tao Wei Spectaint Speculative taint analysis for discoveringspectre gadgets 2021

[59] Hany Ragab Alyssa Milburn Kaveh Razavi Herbert Bos and CristianoGiuffrida CrossTalk Speculative Data Leaks Across Cores Are RealIn SampP May 2021

[60] Thomas Rokicki Cleacutementine Maurice and Pierre Laperdrix Sok Insearch of lost time A review of javascript timers in browsers In IEEEEuroSampPrsquo21

[61] Gururaj Saileshwar Christopher W Fletcher and Moinuddin QureshiStreamline a fast flushless cache covert-channel attack by enablingasynchronous collusion In ASPLOS 2021

[62] Rahul Saxena and John William Phillips Optimized rounding in un-derflow handlers 2001 US Patent 6219684

[63] Michael Schwarz Moritz Lipp Daniel Moghimi Jo Van Bulck JulianStecklina Thomas Prescher and Daniel Gruss ZombieLoad Cross-privilege-boundary data sampling In CCSrsquo19

[64] Michael Schwarz Cleacutementine Maurice Daniel Gruss and Stefan Man-gard Fantastic timers and where to find them High-resolution microar-chitectural attacks in javascript In FC IFCA 17

[65] Michael Schwarz Martin Schwarzl Moritz Lipp Jon Masters andDaniel Gruss Netspectre Read arbitrary memory over network InESORICS 19

[66] Daniel J Sorin Mark D Hill and David A Wood A primer on memoryconsistency and cache coherence 2011

[67] Julian Stecklina and Thomas Prescher Lazyfp Leaking fpu reg-ister state using microarchitectural side-channels arXiv preprintarXiv180607480 2018

[68] Caroline Trippel Daniel Lustig and Margaret Martonosi Meltdown-prime and spectreprime Automatically-synthesized attacks exploitinginvalidation-based coherence protocols arXiv

[69] Xabier Ugarte-Pedrero Davide Balzarotti Igor Santos and Pablo GBringas Sok Deep packer inspection A longitudinal study of thecomplexity of run-time packers In 2015 IEEE S amp P

[70] Google V8 test-jump-table-assemblercc220 commit 251fece httpsgithubcomv8

[71] Jo Van Bulck Daniel Moghimi Michael Schwarz Moritz Lippi Ma-rina Minkin Daniel Genkin Yuval Yarom Berk Sunar Daniel Grussand Frank Piessens Lvi Hijacking transient execution through mi-croarchitectural load value injection In 2020 IEEE S amp P 20

[72] Jo Van Bulck Frank Piessens and Raoul Strackx Nemesis Studyingmicroarchitectural timing leaks in rudimentary cpu interrupt logic InACM CCS 2018

[73] Jo Van Bulck Frank Piessens and Raoul Strackx SGX-step A Prac-tical Attack Framework for Precise Enclave Execution Control InSysTEXrsquo17

[74] Stephan van Schaik Andrew Kwong Daniel Genkin and Yuval YaromSgaxe How sgx fails in practice

[75] Stephan van Schaik Alyssa Milburn Sebastian Oumlsterlund Pietro FrigoGiorgi Maisuradze Kaveh Razavi Herbert Bos and Cristiano GiuffridaRIDL Rogue in-flight data load In SampP May 2019

[76] Stephan van Schaik Marina Minkin Andrew Kwong Daniel Genkinand Yuval Yarom Cacheout Leaking data on intel cpus via cacheevictions arXiv preprint

[77] WebKit Browserbench httpsbrowserbenchorg

[78] Ofir Weisse Jo Van Bulck Marina Minkin Daniel Genkin BarisKasikci Frank Piessens Mark Silberstein Raoul Strackx Thomas FWenisch and Yuval Yarom Foreshadow-ng Breaking the virtual mem-ory abstraction with transient out-of-order execution 2018

[79] Pawel Wieczorkiewicz Intel deep-dive snoop-assisted L1 Data Sam-pling

[80] Wenjie Xiong and Jakub Szefer Survey of transient execution attacksarXiv preprint

[81] Yuval Yarom and Katrina Falkner Flush+ reload a high resolutionlow noise l3 cache side-channel attack In USENIX Security 14

[82] Jann Horn Google Project Zero Speculative execution variant4 speculative store bypass httpsbugschromiumorgpproject-zeroissuesdetailid=1528 2019

A Reversing Memory Disambiguation

To precisely trigger memory disambiguation mispredictionsit is essential to reverse engineer the predictor and understandhow it can be massaged into the desired state When exe-cuting a load operation the physical addresses of all priorstores must be known to decide whether the load should beforwarded the value from the store buffer (store-to-load for-warding when the store and the load alias) or served from thememory subsystem (when they do not) Since these dependen-cies introduce significant bottlenecks modern processors rely

1466 30th USENIX Security Symposium USENIX Association

Listing 4 A function to RE the MD predictor The first 10imuls used to delay the store address computation create theideal conditions for mispredictionsst_ld rdi store addr rsi load addrrep 10 Trick to delay the store addressimul rdi 1endrepmov DWORD [rdi] 0x42 Storemov eax DWORD [rsi] Loadrep 10 Pronounce load timingimul eax 1endrepret

Listing 5 Observing the size and behavior of the per-addresssaturation counteruint8_t mem = malloc(0x1000)Ensure that saturing counter is set to 0for(i=0 ilt100 i++) st_ld(mem mem)Make hoisiting possiblefor(i=100 ilt120 i++) st_ld(mem mem+64)Trigger a memory ordering machine clearfor(i=120 ilt130 i++) st_ld(mem mem)

on a memory disambiguation predictor to improve common-case performance If a load is predicted not to alias precedingstores it can be speculatively executed before the prior storesrsquoaddresses are known (ie the load is hoisted) Otherwise theload is stalled until aliasing information is available

Partial reverse engineering of the memory disambiguationunit behavior was presented in [18] based on a (complex)analysis of Intel patent US8549263B2 [44] In contrast wepresent a full reverse engineering effort (including featuressuch as 4k aliasing flush counter etc) entirely based on asimple implementationmdashst_ld function in Listing 4 Bysurrounding the st_ld function with instrumentation codeto measure the timing and the number of machine clears wewere able to accurately detect the status of the predictor forevery call to st_ld

Our first experiment illustrated in Listing 5 is designedto observe how many missed load hoisting opportunities areneeded to switch the predictor state As shown in the corre-sponding plot in Figure 11 after 15 non-aliasing loads weobserved that subsequent st_ld invocations are faster due toa correct hoisting prediction This matches the design sug-gested in the patent with the predictor implemented as a 4-bitsaturating counter incremented every time the load does not ul-timately alias with preceding stores (and reset to 0 otherwise)Load hoisting is predicted only if the counter is saturated Toensure that the hoisting state is reached we later scheduledan aliasing load and checked that the load was incorrectlyhoisted by observing a machine clear

According to the Intel patent [44] there are 64 per-addresspredictors (ie saturating counters) and the suggested hash-ing function simply uses the lowest 6 bits of the instructionpointer of the load We verified these numbers using the func-

100 105 110 115 120 125 130i-th call to st_ld

0

50

100

Cloc

k cy

cles

Figure 11 Timing measurement of the experiment in Listing5 Orange bar machine clear memory ordering observed

Listing 6 Snippet observing activation of never-hoisting stateuint8_t mem = malloc(0x1000)for(int i=0 ilt10 i++)

for(int j=0 jlt19 j++)st_ld(mem mem+64)

st_ld(mem mem)

0 10 20 30 40 50 60 70 80 90 100 110 120i-th call to st_ld

0

25

50

75

100

Cloc

k cy

cles

Figure 12 Never-hoisting state Orange bar MC observed

tion st_ld_offset which is an exact copy of the st_ldfunction but with a number of nops added in the preamble(which we change at every run The goal is to observe iftwo unrelated loads in these functions are able to influenceeach other when varying the number of nops In our testedCPUs we observed machine clears when two unrelated loadsin st_ld and st_ld_offset are located exactly 256k bytesapart in memory (k isinN) We used machine clear observationsto detect that the per-address prediction of st_ld was affectedby the execution of st_ld_offset Our results match the de-sign suggested in the patent except we observed 256 (ratherthan 64) predictors hashing the lowest 8 bits of the instruc-tion pointer With these implementation-specific numbers anattacker can easily mistrain the predictor of a victim loadinstruction just with knowledge of its location in memory

One important additional component of the predictor is thepresence of a watchdog A never-hoisting global state is usedas a fallback to temporarily disable the predictor when theCPU has decided it may be counterproductive To reverseengineer the behavior we triggered as many mispredictionsas possible to check if hoisting was eventually disabled Theresulting code is illustrated in Listing 6 and the numbers inFigure 12 shows that after 4 machine clears the predictoris disabled Indeed even after 19 further non-aliasing loads(normally abundantly sufficient to switch to a hoisting state)the execution time of st_ld does not decrease

We also reversed the conditions under which the watch-dog is enableddisabled The patent suggests the watchdog

USENIX Association 30th USENIX Security Symposium 1467

15 79 143 207 271 335 399Number of independent store-loads

1

2

3

4

Tota

l mea

sure

dM

achi

ne C

lear

s

Figure 13 MCs observed after n independent store-loadsstarting from the never-hoisting state

is enabled when the value of a flush counter is smaller than0 and disabled otherwise The flush counter is decrementedevery MD MC and incremented every n correctly hoistedloads To reverse engineer this behavior we measure howmany MCs can be triggered in a row before the flush counteris decremented to -1 and thus the predictor is disabled Asshown in the figure 13 we never observed more than 4 MCsin a row This suggests that the flush counter is a 2-bit sat-urating counter The machine clear patterns also reveal theflush counter is incremented every 64 correctly hoisted loadsLastly since every run starts with the watchdog disabled ourresults show that to switch from a never-hoisting to a predict-hoisting state 15+64 non-aliasing loads are sufficient Thefirst 15 loads are necessary to bring the per-address predictorto the hoisting state The next 64 loads record a would-becorrect prediction of the per-address predictor After 64 such(unused) predictions the flush counter is incremented to leavethe never-hoisting state

Listing 7 Snippet verifying 4k aliasing-MD unit interactionForce no-hoisting predictionfor(i=0 ilt10 i++) st_ld(mem+0x1000 mem+0x1000)for(i=10 ilt20 i++) st_ld(mem+0x2000 mem+0x2000)Cause 4k aliasingfor(i=20 ilt40 i++) st_ld(mem+0x1000 mem+0x2000)

0 5 10 15 20 25 30 35 40i-th call to st_ld

50

60

70

80

90

Cloc

k cy

cles

Figure 14 Timing measurements of Listing 7 Orange incre-ment of LD_BLOCKS_PARTIALADDRESS_ALIAS

Finally we examined the interaction between memory dis-ambiguation and 4k aliasing On Intel CPUs when a store isfollowed by a load matching its 4KB page offset the store-to-load forwarding (STL) logic forwards the stored valueand in case of a false match (4k aliasing) a few-cycle over-head is needed to re-issue the load [38] With the help of theperformance counter LD_BLOCKS_PARTIALADDRESS_ALIASand the experiment shown in Listing 7 we verified that 4kaliasing can only happen when a no-hoist prediction is made

Figure 15 Memory disambiguation unit ndash simplified view

as shown in Figure 14 Indeed STL can be performed onlyif the store-load pair is executed in order Additionally Fig-ure 14 shows that 4k aliasing introduces a slight performanceoverhead on top of an incorrect no-hoisting guess of the MDpredictor Figure 15 presents a simplified view of the reversedmemory disambiguation predictor In conclusion an attackercan precisely mistrain the memory disambiguation predictorby satisfying only two requirements (1) knowing the instruc-tion pointer of the victim load (2) issuing (in the worst-casescenario) 15+64=79 non-aliasing store-load pairs to train thepredictor to the hoisting state

B Root-Causes Description Table

Acronym Description

BHT Branch History TableBTB Branch Target BufferRSB Return Stack BufferMD Memory Disabmbiguation UnitNM Device Not Available ExceptionDE Divide-by-Zero ExceptionUD Invalid Opcode ExceptionGP General Protection FaultAC Alignment Check ExceptionSS Stack Segment ExceptionPF Page FaultPF - US Bit Page Table Entry UserSupervisor Bit PFPF - RW Bit Page Table Entry ReadWrite Bit PFPF - P Bit Page Table Entry Present Bit PFPF - PKU Page Table Entry Protection Keys PFBR Bound Range Exceeded ExceptionFP Floating Point AssistSMC Self-Modifying CodeXMC Cross-Modifying CodeMO Memory Ordering Principles ViolationMASKMOV Masked LoadStore Instruction AssistAD Bits Page Table Entry AccessDirty Bits AssistTSX Intel TSX Transaction AbortUC Uncachable Memory AssistPRM Processor Reserved Memory AssistHW Interrupts Hardware Interrupts

1468 30th USENIX Security Symposium USENIX Association

  • Introduction
  • Background
    • IEEE-754 Denormal Numbers
    • x86 Cache Coherence
    • Memory Ordering
    • Memory Disambiguation
      • Threat Model
      • Machine Clears
      • Self-Modifying Code Machine Clear
      • Floating Point Machine Clear
      • Memory Ordering Machine Clear
      • Memory Disambiguation Machine Clear
      • Other Types of Machine Clear
      • Transient Execution Capabilities
        • Transient Window Size
        • Leakage Rate
          • Attack Primitives
            • Speculative Code Store Bypass (SCSB)
            • Floating Point Value Injection (FPVI)
              • Mitigations
                • SCSB Mitigation
                • FPVI Mitigation
                  • Root Cause-based Classification
                  • Related Work
                  • Conclusions
                  • Reversing Memory Disambiguation
                  • Root-Causes Description Table
Page 10: Rage Against the Machine Clear: A Systematic Analysis of

F+R TSX BHT FP SMC XMC MO MD FAULTTransient Execution Management

0

1

2

3

4

Leak

age

Rate

[Mb

s]

F+R granularity [bit]148

Figure 5 Leakage rate vs mechanism with a 1-bit 4-bit or8-bit FLUSH + RELOAD cache covert channel (Intel Core i9)

mance impact of the corresponding machine clears on Intelvs AMD microarchitectures

As shown in Figure 5 the leakage rate varies instead greatlyacross the different mechanisms and so does the optimalcovert channel bitwidth Indeed while existing attacks typi-cally rely on 8-bit covert channels our results suggest 1-bit or4-bit channels can be much more efficient depending on thespecific mechanism Roughly speaking optimal leakage ratescan be obtained by balancing the time complexity (and hencebitwidth) of the covert channel with that of the mechanismFor example FP is a lightweight and reliable mechanismhence using a comparably fast and narrow 1-bit covert chan-nel is beneficial In contrast MD requires a time-consumingpredictor training phase between leak iterations and leakingmore bits per iteration with a 4-bit covert channel is moreefficient Interestingly a classic 8-bit covert channel yieldsconsistently worse and comparable leakage rates across allthe mechanisms since F+R dominates the execution time

Our results show that only two mechanisms (TSX and FP)are close to the maximum theoretical leakage rate of pureF+R Moreover FP performs as efficiently as TSX but un-like TSX is available on both Intel and AMD is always en-abled and can be used from managed code (eg JavaScript)BHT on the other hand yield the worst leakage rates due tothe inefficient training-based transient window BHT leakagerate can be improved if a tailored mistrain sequence is usedas in MD (Appendix A) Overall our results show that ma-chine clear-based windows achieve comparable and in manycases (eg FP) better leakage rates compared to traditionalmechanisms Moreover many machine clears eliminate theneed for mistraining which other than resulting in efficientleakage rates can escape existing pattern-based mitigationsand disabled hardware extensions (eg Intel TSX)

11 Attack Primitives

Building on our reverse engineering results and focusing onthe unexplored SMC and FP machine clears we now presenttwo new transient execution attack primitives and analyze

their security implications We also present an end-to-endFPVI exploit disclosing arbitrary memory in Firefox Laterwe discuss mitigations

111 Speculative Code Store Bypass (SCSB)

Our first attack primitive Speculative Code Store Bypass(SCSB) allows an attacker to execute stale controlled code ina transient execution window originated by a SMC machineclear Since the primitive relies on SMC its primary appli-cability is on JIT (eg JavaScript) engines running attacker-controlled codemdashalthough OS kernels and hypervisors stor-ing code pages and allowing their execution without firstissuing a serializing instruction are also potentially affected

Figure 6 SCSB primitive example where the instructionpointer is pointing at the bold code blocks g code is freedJITrsquoed code (of some g function) under attackerrsquos control(1) Force engine to JIT and execute code of function f caus-ing desynchronization of code and data views (2) Executestale code and SMC MC (3) After the SMC MC code anddata view coherence is restored and the new code is executed

As exemplified in Figure 6 the operations of the primi-tive can be broken down into three steps (1) the JIT enginecompiles a function f storing the generated code into a JITcode cache region previously used by a (now-stale) version offunction g (2) the JIT engine jumps to the newly generatedcode for the function f but due to the temporary desynchro-nization between the code and data views of the CPU thiscauses transient execution of the stale code of g until the SMCmachine clear is processed (3) after the pipeline flush thecode and data views are resynchronized and the CPU restartsthe execution of the correct code of f For exploitation theattacker needs to (i) massage the JIT code cache allocator toreuse a freed region with a target gadget g of choice (ii) forcethe JIT engine to generate and execute new (f ) code in sucha region enabling transient out-of-context execution of thegadget and spilling secrets into the microarchitectural state

Our primitive bears similarities with both transient and ar-chitectural primitives used in prior attacks On the transientfront our primitive is conceptually similar to a Speculative-Store-Bypass (SSB) primitive [55 82] but can transientlyexecute stale code rather than reading stale data Howeverinterestingly the underlying causes of the two primitives arequite different (MD misprediction vs SMC machine clear)On the architectural front our primitive mimics classic Use-After-Free (UAF) exploitation on the JIT code cache also

USENIX Association 30th USENIX Security Symposium 1459

Figure 7 Coding options suggested by the Intel ArchitecturesSoftware Developers Manuals to handle SMC and XMC ex-ecution Option 1 describes the exact steps required by ourSpeculative Code Store Bypass attack primitive potentiallyresulting in exploitable gadgets

known as Return-After-Free (RAF) in the hackers commu-nity [19 25] An example is CVE-2018-0946 where a use-after-free vulnerability can be exploited to force the ChakraJS engine to erroneously execute freed (attacker-controlled)JIT code resulting in arbitrary code execution after massagingthe right gadget into the JIT code cache [24]

Indeed at a high level SCSB yields a transient use-after-free primitive on a JIT code cache with exploitation prop-erties similar to its architectural counterpart However thereare some differences due to the transient nature of SCSBFirst we need to find an out-of-context gadget to transientlyleak data rather than architecturally execute arbitrary codeIn JavaScript engines similar gadgets have already been ex-ploited by ret2spec [48] escalating out-of-context transientexecution of valid code to type confusion arbitrary reads andultimately a secret-dependent load to transmit the value

Second we need to target short-lived JIT code cache up-dates so that the newly generated code is immediately exe-cuted fitting the target gadget in the resulting transient execu-tion window Interestingly executing JITrsquoed code is the nextstep after JIT code generation mentioned in the Intel manual(Figure 7 Option 1) suggesting that developers that followsuch directives ad litteram can easily introduce such gadgetsMoreover modern JavaScript compilers feature a multi-stageoptimization pipeline [12 54] and short-lived JIT code cacheupdates are favored for just-in-time (re)optimization Indeedwe found test code in V8 to specifically test such updates andverify instructiondata cache coherence [70]

Finally we need to ensure JIT code cache updates are notaccompanied by barriers that force immediate synchroniza-tion of the code and data views We analyzed the code ofthe popular SpiderMonkey and V8 JavaScript engines andverified that the functions called upon JIT code cache updatesto synchronize instruction and data caches (Listing 2 and List-ing 3) are always empty on x86 (as expected given Intelrsquosprimary recommendation in Figure 7)

Overall while the exploitation is far from trivial (ie hav-

Listing 2 Chromium instruction cache flush(chromiumsrcv8srccodegenx64cpu-x64cc)void CpuFeaturesFlushICache(void start size_t size) No need to flush the instructioncache on Intel

Listing 3 Firefox instruction cache flush(mozilla-unifiedjssrcjitFlushICacheh)inline void FlushICache(void code size_t size

bool codeIsThreadLocal = true) No-op Code and data caches are coherent on x86

and x64 rarr

ing to address the challenges of traditional use-after-free ex-ploitation as well as transient execution exploitation in thebrowser) we believe SCSB expands the attack surface of tran-sient execution attacks in the browser We found a number ofcandidate SCSB gadgets in real-world code but none of themwas ultimately exploitable due to the coincidental presence ofsome serializing instruction preventing stale code executionNonetheless mitigations are needed to enforce security-by-design Luckily as shown later SCSB is amenable to a prac-tical and efficient mitigation (ie at a similar cost as facedby non-x86 architectures) The implementationperformancecost for mitigation is even lower than for its sibling SSB prim-itive whose mitigation has been deployed in practice even inabsence of known practical exploits [55]

In Table 3 we show that all tested Intel and AMD proces-sors are affected by SCSB In contrast ARM is not vulnerablesince SMC updates require explicit software barriers

112 Floating Point Value Injection (FPVI)Our second attack primitive Floating Point Value Injection(FPVI) allows an attacker to inject arbitrary values into atransient execution window originated by a FP machine clearAs exemplified in Figure 8 the operations of the primitive canbe broken down into four steps (1) the attacker triggers theexecution of a gadget starting with a denormal FP operationin the victim application with the x and y operands underattackerrsquos control (2) the transient z result of the operation isprocessed by the subsequent gadget instructions leaving a mi-croarchitectural trace (3) the CPU detects the error condition(ie wrong result of a denormal operation) triggering a ma-chine clear and thus a pipeline flush (4) the CPU re-executesthe entire gadget with the correct architectural z result Forexploitation the attacker needs to (i) massage the x and yoperands to inject the desired z value into the victim transientpath and (ii) target a victim gadget so that the injected valueyields a security-sensitive trace which can be observed withFLUSH + RELOAD or other microarchitectural attacks

Our primitive bears similarities with Load-Value-Injection(LVI) [71] since both allow attackers to inject controlledvalues into transient execution Moreover both primitives

1460 30th USENIX Security Symposium USENIX Association

Figure 8 FPVI gadget example in SpiderMonkeyFP_Op(xy) is an arbitrary denormal FP operation The er-roneous z result causes dependent operations to be executedtwice (first transiently then architecturally) The NaN-boxingz encoding allows the attacker to type-confuse the JITrsquoed codeand read from an arbitrary address on the transient path

require gadgets in the victim application to process the in-jected value and perform security-sensitive computations onthe transient path Nonetheless the underlying issues andhence the triggering conditions are fairly different LVI re-quires the attacker to induce faulty or assisted loads on thevictim execution which is straightforward in SGX applica-tions but more difficult in the general case [71] FPVI imposesno such requirement but does require an attacker to directlyor indirectly control operands of a floating-point operation inthe victim Nonetheless FPVI can extend the existing LVIattack surface (eg for compute-intensive SGX applicationsprocessing attacker-controlled inputs) and also provides ex-ploitation opportunities in new scenarios Indeed while it ishard to draw general conclusions on the availability of ex-ploitable FPVI gadgets in the wildmdashmuch like Spectre [41]LVI [71] etc this would require gadget scanners subject oforthogonal research [56 58]mdashwe found exploitable gadgetsin NaN-Boxing implementations of modern JIT engines [53]NaN-Boxing implementations encode arbitrary data types asdouble values allowing attackers running code in a JIT sand-box (and thus trivially controlling operands of FP operations)to escalate FPVI to a speculative type confusion primitiveThe latter can be exploited similarly to NaN-Boxing-enabledarchitectural type confusion [26] and allows an attacker toaccess arbitrary data on a transient path Figure 8 presentsour end-to-end exploit for a JavaScript-enabled attacker in aSpiderMonkey (Mozilla JavaScript runtime) sandbox illus-trating a gadget unaffected by all the prior Mozilla Firefoxrsquo

mitigations against transient execution attacks [52]As exemplified in the figure SpiderMonkeyrsquos NaN-Boxing

strategy represents every variable with a IEEE-754 (64-bit)double where the highest 17 bits store the data type tag and theremaining 47 bits store the data value itself If the tag value isless than or equal to 0x1fff0 (ie JSVAL_TAG_MAX_DOUBLE)all the 64 bits are interpreted as a double while NaN-Boxingencoding is used otherwise For instance the 0xfff88 tagis used to represent 32-bit integers and the 0xfffb0 tag torepresent a string with the data value storing a pointer tothe string descriptor In the example the attacker crafts theoperands of a vulnerable FP operation (in this case a division)to produce a transient result which the JITrsquoed code interpretsas a string pointer due to NaN-Boxing This causes typeconfusion on a transient path and ultimately triggers a readwith an attacker-controlled address

To verify the attacker can inject arbitrary pointers withoutfully reverse engineering the complex function used by theFPU we implemented a simple fuzzer to find FP operandsthat yield transient division results with the upper bits setto 0xfffb0 (ie string tag) With such operands we caneasily control the remaining bits by performing the inverseoperation since the mantissa bits are transiently unaffectedby the exponent value as shown in Table 2 For exampleusing 0xc000e8b2c9755600 and 0x0004000000000000 asdivision operands yields -Infinity as the architectural resultand our target string pointer 0xfffb0deadbeef000 as thetransient result (see gadget in Figure 8)

Note that SpiderMonkey uses no guards or Spectre mitiga-tions when accessing the attribute length of the string Thisis normally safe since x86 guarantees that NaN results of FPoperations will always have the lowest 52 bits set to zeromdasharepresentation known as QNaN Floating-Point Indefinite [37]In other words the implementation relies on the fact that NaN-boxed variables such as string pointers can never accidentallyappear as the result of FP operations and can only be craftedby the JIT engine itself Unfortunately this invariant no longerholds on a FPVI-controlled transient path As shown in Fig-ure 8 this invariant violation allows an attacker to transientlyread arbitrary memory Since the length attribute is stored4 bytes away from the string pointer the zlength accessyields a transient read to 0xdeadbeef000+4

We ran our exploit on an Intel i9-9900K CPU (microcode0xde) running Linux 580 and Firefox 850 and by wrappingthis primitive with a variant of an EVICT + RELOAD covertchannel [61] we confirmed the ability to read arbitrary mem-ory Since prior work has already demonstrated that customhigh-precision timers are possible in JavaScript [26296064]we enabled precise timers in Firefox to simplify our covertchannel With our exploit we measured a leakage rate of ~13KBs and a transient window of ~12 load instructions enabledby increasing the FPU pressure through a chain of multipledependent FP operations

Finally as shown in Table 3 we observe that both Intel and

USENIX Association 30th USENIX Security Symposium 1461

Table 3 Tested processors

Processor MicrocodeSCSB

vulnerableFPVI

vulnerable

Intel Core i7-10700K 0xe0 3 3Intel Xeon Silver 4214 0x500001c 3 3Intel Core i9-9900K 0xde 3 3Intel Core i7-7700K 0xca 3 3AMD Ryzen 5 5600X 0xa201009 3 3daggerAMD Ryzen 2990WX 0x800820b 3 3daggerAMD Ryzen 7 2700X 0x800820b 3 3daggerBroadcom BCM2711Cortex-A72 (ARM v8) 7sect 7

dagger No exploitable NaN-boxed transient results foundsect On ARM SMC updates require explicit software barriers

AMD are affected by FPVI although we found no exploitabletransient NaN-boxed values on AMD On ARM we neverobserved traces of transient results yet we cannot rule outother FPU implementations being affected

12 Mitigations

121 SCSB Mitigation

SCSB can be mitigated by ensuring that any freshly writtencode is architecturally visible before being executed For ex-ample on ARM architectures where the hardware does notautomatically enforce cache coherence explicit serializing in-structions (ie L1i cache invalidation instructions) are neededto correctly support SMC [5] As such spec-compliant ARMimplementations cannot be affected by SCSB On Intel andAMD we can force eager codedata coherence using a serial-ization instructionmdashalthough this is normally not necessary(see Option 1 in Figure 7) In our experiments we verified anyserialization instruction such as lfence mfence or cpuidplaced after the SMC store operations is indeed sufficient tosuppress the transient window Note that sfence cannot serveas a serialization instruction [35] to eliminate the transientpath This serialization mitigation was confirmed by CPUvendors and adopted by the Xen hypervisor [32 33]

To evaluate the performance impact of our mitigation weadded a lfence instruction inside the FlushICache function(Listings 2 and 3) of the two popular V8 and SpiderMonkeyJavaScript engines Such function normally empty on x86is called after every code update Our repeated experimentson the popular JetStream2 and Speedometer2 [77] bench-marks did not produce any statically measurable performanceoverhead This shows JavaScript execution time is heavilydominated by JIT code generationexecution and code up-dates have negligible impact Our results show this mitigationis practical and can hinder SCSB-based attack primitives inJIT engines with a 1-line code change

122 FPVI Mitigation

The most efficient way to mitigate FPVI is to disable the de-normal representation On Intel this translates to enabling theFlush-to-Zero and Denormals-are-Zero flags [37] which re-spectively replace denormal results and inputs with zero Thisis a viable mitigation for applications with modest floating-point precision requirements and has also been selectively ap-plied to browsers [42] However this defense may break com-mon real-world (denormal-dependent) applications a concernthat has led browser vendors such as Firefox to adopt othermitigations Another option for browsers is to enable Site Iso-lation [13] but JIT engines such as SpiderMonkey still do nothave a production implementation [23] Yet another optionfor JIT engines is to conditionally mask (ie using a transientexecution-safe cmov instruction) the result of FP operationsto enforce QNaN Floating-Point Indefinite semantics [37] asdone in the SpiderMonkey FPVI mitigation [51] This strategysuppresses any malicious NaN-boxed transient results butrequires manual changes to the NaN-boxing implementationand only applies to NaN-boxed gadgets

A more general and automated mitigation is for the com-piler to place a serializing instruction such as lfence afterFP operations whose (attacker-controlled) result might leaksecrets by means of transmit gadgets (or transmitters) Weobserve this is the same high-level strategy adopted by theexisting LVI mitigation in modern compilers [39] whichidentifies loaded values as sources and uses data-flow anal-ysis to ensure all the sources that reach a sink (transmitter)are fenced Hence to mitigate FPVI we can rely on the samemitigation strategy but use computed FP values as sourcesinstead To identify sinks we consider both systems vulnera-ble and those resistant to Microarchitectural Data Sampling(MDS) [8 59 63 75 76] For the former we consider bothload and store instructions as sinks (eg FP operation resultused as a loadstore address) as the corresponding arbitrarydata spilled into microarchitectural buffers may be leaked bya MDS attacker For the latter we limit ourselves to load sinksto catch all the arbitrary read values potentially disclosed by adependent transmitter In both cases we add indirect branchescontrolled by FP operations (and thus potentially leading tospeculative control-flow hijacking) to the list of sinks

We have implemented such a mitigation in LLVM [45]with only 100 lines of code on top of the existing x86 LVIload hardening pass To evaluate the performance impactof our mitigation on floating-point-intensive programs weran all the CC++ SPECfp 2006 and 2017 benchmarks infour configurations LVI instrumentation FPVI instrumen-tation for both MDS-vulnerable and MDS-resistant systemsand joint LVI+FPVI instrumentation for MDS-vulnerable sys-tems Please note that on MDS-resistant systems the FPVItransmitters are already covered by the LVI mitigation

Figure 9 shows the performance overhead of such con-figurations compared to the baseline As expected the LVI

1462 30th USENIX Security Symposium USENIX Association

619lbm_s 638imagick_s 644nab_s SPEC17geomean

433milc 444namd 447dealII 450soplex 453povray 470lbm 482sphinx3 SPEC06geomean

0100200300400500600700800

Over

head

SPEC FP 2017 SPEC FP 2006LLVM Mitigation

LVIFPVI (MDS-vulnerable system)FPVI (MDS-resistant system)LVI+FPVI (MDS-vulnerable system)

Figure 9 Performance overhead of our FPVI mitigation on the CC++ SPECfp 2006 and 2017 benchmarks Experimental setup5 SPEC iterations Intel i9-9900K (microcode 0xde) and LLVM 1110

mitigation has non-trivial performance impact up to 280despite targeting floating-point-intensive programs On theother hand our FPVI mitigation incurs in a 32 and 53 ge-omean overhead on SPECfp 2006 and 2017 respectively withno observable performance impact difference between MDS-vulnerable and MDS-resistant variants We observed that ap-proximately 70 of the inserted lfence instructions are dueto the intraprocedural design of the original LVI pass forc-ing our analysis to consider every callsite with FPVI source-based arguments as a potential transmitter This suggests theoverhead can be further reduced by operating interprocedu-ral analysis or more aggressive inlining (eg using LLVMLTO [45])

13 Root Cause-based Classification

We now summarize the results of our investigation by present-ing a root cause-based classification for the known transientexecution paths While there have already been several at-tempts to classify properties of transient execution [107580]all the existing classifications are attack-centric While cer-tainly useful such classifications inevitably blend togetherthe root causes of transient execution with the attack triggersFor example a MDS exploit based on a demand-paging pagefault [75] may be simply classified as a Meltdown-like attackbased on a present page fault [10] However in such an exploitthe page fault is both the vulnerability trigger and the rootcause of the transient execution window A similar exploitmay instead be embedded in an Intel TSX window yielding adifferent root cause and capabilities

To better characterize the capabilities of transient executionexploits we propose the orthogonal root-cause-based classifi-cation in Figure 10 We draw from the Intel terminology todefine the two main classes of root causes of Bad Specula-tion (ie transient execution) Control-Flow Misprediction(ie branch misprediction) and Data Misprediction (ie ma-chine clear) Based on these two main classes we observethat all the known root causes of transient execution paths canbe classified into the following four subclasses PredictorsExceptions Likely invariants violations and Interrupts

Figure 10 A root cause-centric classification of known tran-sient execution paths (causes analyzed in the paper in bold)Acronyms descriptions can be found in Appendix B

Predictors This category includes the prediction-basedcauses of bad speculation due to either control-flow or datamispredictions Mistraining a predictor and forcing a mis-prediction is sufficient to create a transient execution pathaccessing erroneous code or data Itrsquos worth noting that inFigure 10 there are two separate predictors subclasses un-der control-flow and data mispredictions as they are differ-ent in nature While control-flow mispredictions are failedattempts to guess the next instructions to execute data mis-predictions are failed attempts to operate on not-yet-validateddata Misprediction is a common way to manage tran-sient execution windows in attacks like Spectre and deriva-tives [11 20 28 40 41 43 48 55 65 82]

Exceptions This class includes the causes of machineclear due to exceptions for instance the different (sub)classesof page faults Forcing an exception is sufficient to create atransient execution path erroneously executing code follow-ing the exception-inducing instruction Exceptions are a lesscommon way to manage transient execution windows (as theyrequire dedicated handling) but have been extensively usedas triggers in Meltdown-like attacks [78475963677174ndash76 78]

Interrupts This class includes the causes of machine cleardue to hardware (only) interrupts Similar to exceptions forc-

USENIX Association 30th USENIX Security Symposium 1463

ing a HW interrupt is sufficient to create a transient executionpath erroneously executing code following the interruptedinstruction Hardware interrupt are asynchronous by natureand thus difficult to control resulting in a less than ideal wayto manage transient execution windows Nonetheless theywere abused by prior side-channel attacks [72]

Likely invariants violations This class includes all theremaining causes of machine clear derived by likely invari-ants [15] used by the CPU Such invariants commonly holdbut occasionally fail allowing hardware to implement fast-path optimizations However compared to exceptions andinterrupts slow-path occurrences are typically more frequentrequiring more efficient handling in hardware or microcodeWe discussed examples of such invariants in the paper (egstore instructions are expected to never target cached instruc-tions floating-point operations are expected to never operateon denormal numbers etc) and their lazy handling mecha-nisms (ie L1dL1i resynchronization microcode-based de-normal arithmetic) Forcing a likely invariant violation is suf-ficient to create a transient execution path accessing erroneouscode or data In this paper we have shown such violationsare not only a realistic way to manage transient executionwindows but also provide new opportunities and primitivesfor transient execution attacks

14 Related Work

Spectre [3141] and Meltdown [47] first examined the securityimplications of transient execution originating a large bodyof research on transient execution attacks [6ndash112840434859 63 65 67 68 71 74ndash76 78 79 82] Rather than focusingon attacks and their classification [10 75 80] ours is the firsteffort to systematize the root causes of transient executionand examine the many unexplored cases of machine clears

We now briefly survey prior security efforts concernedwith the major causes of machine clear discussed in this pa-per Self-modifying code is commonly used by malware asan obfuscation technique [69] and has also been used to im-prove side-channel attacks by means of performance degra-dation [2] Moreover our SCSB primitive bears similaritieswith prior transient execution primitives inducing specula-tive control flow hijacking either through branch target in-jection [41 49] or architectural branch target corruption [28]The performance variability of floating-point operations ondenormal numbers [17 46] has been previously exploited intraditional timing side-channel attacks [4] Speculation intro-duced by stricter memory models is a well-known concept inthe computer architecture literature [27 30 66] While this isnon-trivial to exploit prior work did demonstrate informationdisclosure [5779] by exploiting the snoop protocol discussedin Section 7 The memory disambiguation predictor has beenpreviously abused to leak stale data in Spectre SpeculativeStore Bypass exploits [55 82] Moreover its behavior hasbeen partially reverse engineered before [18] In contrast to

all these efforts we focus on machine clears to systematicallystudy all the root causes of transient execution fully reverseengineering their behavior and uncovering their security im-plications well beyond the state of the art

15 Conclusions

We have shown that the root causes of transient execution canbe quite diverse and go well beyond simple branch mispredic-tion or similar To support this claim we systematically ex-plored and reverse engineered the previously unexplored classof bad speculation known as machine clear We discussedseveral transient execution paths neglected in the literatureexamining their capabilities and new opportunities for attacksFurthermore we presented two new machine clear-based tran-sient execution attack primitives (Floating Point Value Injec-tion and Speculative Code Store Bypass) We also presentedan end-to-end FPVI exploit disclosing arbitrary memory inFirefox and analyzed the applicability of SCSB in real-worldapplications such as JIT engines Additionally we proposedmitigations and evaluated their performance overhead Finallywe presented a new root cause-based classification for all theknown transient execution paths

Disclosure

We disclosed Floating Point Value Injection and SpeculativeCode Store Bypass to CPU browser OS and hypervisor ven-dors in February 2021 Following our reports Intel confirmedthe FPVI (CVE-2021-0086) and SCSB (CVE-2021-0089)vulnerabilities rewarded them with the Intel Bug Bounty pro-gram and released a security advisory with recommendationsin line with our proposed mitigations [34] Mozilla confirmedthe FPVI exploit (CVE-2021-29955 [21 22]) rewarded itwith the Mozilla Security Bug Bounty program and deployeda mitigation based on conditionally masking malicious NaN-boxed FP results in Firefox 87 [51]

Acknowledgments

We thank our shepherd Daniel Genkin and the anonymousreviewers for their valuable comments We also thank ErikBosman from VUSec and Andrew Cooper from Citrix fortheir input Intel and Mozilla engineers for the productivemitigation discussions Travis Downs for his MD reverse en-gineering and Evan Wallace for his Float Toy tool This workwas supported by the European Unionrsquos Horizon 2020 re-search and innovation programme under grant agreements No786669 (ReAct) and 825377 (UNICORE) by Intel Corpora-tion through the Side Channel Vulnerability ISRA and by theDutch Research Council (NWO) through the INTERSECTproject

1464 30th USENIX Security Symposium USENIX Association

References[1] IEEE Standard for Floating-Point Arithmetic IEEE Std 754-2019

2019

[2] Alejandro Cabrera Aldaya and Billy Bob Brumley Hyperde-grade From ghz to mhz effective cpu frequencies arXiv preprintarXiv210101077

[3] AMD AMD64 Architecture Programmerrsquos Manual

[4] Marc Andrysco David Kohlbrenner Keaton Mowery Ranjit JhalaSorin Lerner and Hovav Shacham On subnormal floating point andabnormal timing In 2015 IEEE S amp P

[5] ARM Architecture Reference Manual for Armv8-A

[6] Atri Bhattacharyya Alexandra Sandulescu Matthias NeugschwandtnerAlessandro Sorniotti Babak Falsafi Mathias Payer and Anil KurmusSmotherspectre exploiting speculative execution through port con-tention In CCSrsquo19

[7] Jo Van Bulck Marina Minkin Ofir Weisse Daniel Genkin BarisKasikci Frank Piessens Mark Silberstein Thomas F Wenisch Yu-val Yarom and Raoul Strackx Foreshadow Extracting the Keys tothe Intel SGX Kingdom with Transient Out-of-Order Execution InUSENIX Securityrsquo18

[8] Claudio Canella Daniel Genkin Lukas Giner Daniel Gruss MoritzLipp Marina Minkin Daniel Moghimi Frank Piessens MichaelSchwarz Berk Sunar Jo Van Bulck and Yuval Yarom Fallout LeakingData on Meltdown-resistant CPUs In CCSrsquo19

[9] Claudio Canella Michael Schwarz Martin Haubenwallner MartinSchwarzl and Daniel Gruss Kaslr Break it fix it repeat In ACMASIA CCS 2020

[10] Claudio Canella Jo Van Bulck Michael Schwarz Moritz Lipp Ben-jamin Von Berg Philipp Ortner Frank Piessens Dmitry Evtyushkinand Daniel Gruss A systematic evaluation of transient execution at-tacks and defenses In USENIX Security 19

[11] Guoxing Chen Sanchuan Chen Yuan Xiao Yinqian Zhang ZhiqiangLin and Ten H Lai Sgxpectre Stealing intel secrets from sgx enclavesvia speculative execution In 2019 IEEE EuroSampP

[12] Chrome V8 TurboFan documentation

[13] Chromium Site Isolation documentation

[14] Victor Costan and Srinivas Devadas Intel SGX Explained IACRCryptology ePrint Archive 2016

[15] David Devecsery Peter M Chen Jason Flinn and Satish NarayanasamyOptimistic hybrid analysis Accelerating dynamic analysis throughpredicated static analysis In ASPLOS 2018

[16] Christopher Domas Breaking the x86 isa Black Hat USA 2017

[17] Isaac Dooley and Laxmikant Kale Quantifying the interference causedby subnormal floating-point values In Proceedings of the Workshopon OSIHPA 2006

[18] Travis Downs Memory Disambiguation on Skylakehttpsgithubcomtravisdownsuarch-benchwikiMemory-Disambiguation-on-Skylake 2019

[19] Thomas Dullien Return after free discussion httpstwittercomhalvarflakestatus1273220345525415937

[20] Dmitry Evtyushkin Ryan Riley Nael CSE Abu-Ghazaleh ECE andDmitry Ponomarev Branchscope A new side-channel attack on di-rectional branch predictor ACM SIGPLAN Notices 53(2)693ndash7072018

[21] Firefox Firefox 87 Security Advisory httpswwwmozillaorgen-USsecurityadvisoriesmfsa2021-10CVE-2021-29955

[22] Firefox Firefox ESR 789 Security Advisory httpswwwmozillaorgen-USsecurityadvisoriesmfsa2021-11CVE-2021-29955

[23] Firefox Project Fission documentation

[24] Fortninet Use-After-Free Bug in Chakra (CVE-2018-0946)httpswwwfortinetcomblogthreat-researchan-analysis-of-the-use-after-free-bug-in-microsoft-edge-chakra-engine

[25] Ivan Fratric Return after free discussion httpstwittercomifsecurestatus1273230733516177408

[26] Pietro Frigo Cristiano Giuffrida Herbert Bos and Kaveh RazaviGrand Pwning Unit Accelerating Microarchitectural Attacks withthe GPU In SampP May 2018

[27] Kourosh Gharachorloo Anoop Gupta and John L Hennessy Twotechniques to enhance the performance of memory consistency models1991

[28] Enes Goktas Kaveh Razavi Georgios Portokalidis Herbert Bos andCristiano Giuffrida Speculative Probing Hacking Blind in the SpectreEra In CCS 2020

[29] Ben Gras Kaveh Razavi Erik Bosman Herbert Bos and CristianoGiuffrida ASLR on the Line Practical Cache Attacks on the MMUIn NDSS February 2017

[30] John L Hennessy and David A Patterson Computer architecture aquantitative approach Elsevier 2011

[31] Jann Horn Reading privileged memory with a side-channel 2018

[32] Xen Hypervisor block_speculation function call in invoke_stubhttpsxenbitsxenorggitwebp=xengita=blobf=xenarchx86x86_emulatex86_emulatechb=HEAD

[33] Xen Hypervisor block_speculation function call inio_emul_stub_setup httpsxenbitsxenorggitwebp=xengita=blobf=xenarchx86pvemul-priv-opchb=HEAD

[34] Intel FPVI amp SCSB Intel Security Advisoray 00516 httpswwwintelcomcontentwwwusensecurity-centeradvisoryintel-sa-00516html

[35] Intel INTEL-SA-00088 - Bounds Check Bypass

[36] Intel Intelreg 64 and IA-32 Architectures Optimization ReferenceManual

[37] Intel Intelreg 64 and IA-32 Architectures Software Developerrsquos Manualcombined volumes

[38] Intel Intelreg VTunetrade Profiler User Guide - 4KAliasing httpssoftwareintelcomcontentwwwusendevelopdocumentationvtune-helptopreferencecpu-metrics-referencel1-boundaliasing-of-4k-address-offsethtml

[39] Intel Load value injection - deep divehttpssoftwareintelcomsecurity-software-guidancedeep-divesdeep-dive-load-value-injection 2020

[40] Vladimir Kiriansky and Carl Waldspurger Speculative buffer overflowsAttacks and defenses arXiv180703757

[41] Paul Kocher Jann Horn Anders Fogh Daniel Genkin Daniel GrussWerner Haas Mike Hamburg Moritz Lipp Stefan Mangard ThomasPrescher Michael Schwarz and Yuval Yarom Spectre Attacks Ex-ploiting Speculative Execution In SampPrsquo19

[42] David Kohlbrenner and Hovav Shacham On the effectiveness ofmitigations against floating-point timing channels In USENIX SecuritySymposium 2017

[43] Esmaeil Mohammadian Koruyeh Khaled N Khasawneh ChengyuSong and Nael Abu-Ghazaleh Spectre Returns Speculation Attacksusing the Return Stack Buffer In USENIX WOOTrsquo18

[44] Evgeni Krimer Guillermo Savransky Idan Mondjak and JacobDoweck Counter-based memory disambiguation techniques for se-lectively predicting loadstore conflicts October 1 2013 US Patent8549263

USENIX Association 30th USENIX Security Symposium 1465

[45] Chris Lattner and Vikram Adve Llvm A compilation framework forlifelong program analysis amp transformation In CGO 2004

[46] Orion Lawlor Hari Govind Isaac Dooley Michael Breitenfeld andLaxmikant Kale Performance degradation in the presence of subnor-mal floating-point values In OSIHPA 2005

[47] Moritz Lipp Michael Schwarz Daniel Gruss Thomas Prescher WernerHaas Anders Fogh Jann Horn Stefan Mangard Paul Kocher DanielGenkin Yuval Yarom and Mike Hamburg Meltdown Reading KernelMemory from User Space In USENIX Securityrsquo18

[48] Giorgi Maisuradze and Christian Rossow ret2spec Speculative ex-ecution using return stack buffers In Proceedings of the 2018 ACMSIGSAC

[49] Andrea Mambretti Alexandra Sandulescu Matthias NeugschwandtnerAlessandro Sorniotti and Anil Kurmus Two methods for exploitingspeculative control flow hijacks In USENIX WOOT 19

[50] Ahmad Moghimi Jan Wichelmann Thomas Eisenbarth and BerkSunar Memjam A false dependency attack against constant-timecrypto implementations International Journal of Parallel Program-ming 2019

[51] Mozilla Firefox Bug 1692972 mitigation httpshgmozillaorgreleasesmozilla-betarevb129bba64358

[52] Mozilla Spectre mitigations for Value type checks - x86 part httpsbugzillamozillaorgshow_bugcgiid=1433111

[53] Mozilla Spider Monkey JSValue httpshgmozillaorgmozilla-centralfiletipjspublicValueh

[54] Mozilla SpiderMonkey IonMonkey documentation

[55] Ken Johnson Microsoft Security Response Center(MSRC) Analysis and mitigation of speculative storebypass httpsmsrc-blogmicrosoftcom20180521analysis-and-mitigation-of-speculative-store-bypass-cve-2018-3639 2019

[56] Oleksii Oleksenko Bohdan Trach Mark Silberstein and Christof Fet-zer Specfuzz Bringing spectre-type vulnerabilities to the surface InUSENIX Security 20

[57] Riccardo Paccagnella Licheng Luo and Christopher W Fletcher Lordof the ring (s) Side channel attacks on the cpu on-chip ring interconnectare practical arXiv preprint arXiv210303443 2021

[58] Zhenxiao Qi Qian Feng Yueqiang Cheng Mengjia Yan Peng Li HengYin and Tao Wei Spectaint Speculative taint analysis for discoveringspectre gadgets 2021

[59] Hany Ragab Alyssa Milburn Kaveh Razavi Herbert Bos and CristianoGiuffrida CrossTalk Speculative Data Leaks Across Cores Are RealIn SampP May 2021

[60] Thomas Rokicki Cleacutementine Maurice and Pierre Laperdrix Sok Insearch of lost time A review of javascript timers in browsers In IEEEEuroSampPrsquo21

[61] Gururaj Saileshwar Christopher W Fletcher and Moinuddin QureshiStreamline a fast flushless cache covert-channel attack by enablingasynchronous collusion In ASPLOS 2021

[62] Rahul Saxena and John William Phillips Optimized rounding in un-derflow handlers 2001 US Patent 6219684

[63] Michael Schwarz Moritz Lipp Daniel Moghimi Jo Van Bulck JulianStecklina Thomas Prescher and Daniel Gruss ZombieLoad Cross-privilege-boundary data sampling In CCSrsquo19

[64] Michael Schwarz Cleacutementine Maurice Daniel Gruss and Stefan Man-gard Fantastic timers and where to find them High-resolution microar-chitectural attacks in javascript In FC IFCA 17

[65] Michael Schwarz Martin Schwarzl Moritz Lipp Jon Masters andDaniel Gruss Netspectre Read arbitrary memory over network InESORICS 19

[66] Daniel J Sorin Mark D Hill and David A Wood A primer on memoryconsistency and cache coherence 2011

[67] Julian Stecklina and Thomas Prescher Lazyfp Leaking fpu reg-ister state using microarchitectural side-channels arXiv preprintarXiv180607480 2018

[68] Caroline Trippel Daniel Lustig and Margaret Martonosi Meltdown-prime and spectreprime Automatically-synthesized attacks exploitinginvalidation-based coherence protocols arXiv

[69] Xabier Ugarte-Pedrero Davide Balzarotti Igor Santos and Pablo GBringas Sok Deep packer inspection A longitudinal study of thecomplexity of run-time packers In 2015 IEEE S amp P

[70] Google V8 test-jump-table-assemblercc220 commit 251fece httpsgithubcomv8

[71] Jo Van Bulck Daniel Moghimi Michael Schwarz Moritz Lippi Ma-rina Minkin Daniel Genkin Yuval Yarom Berk Sunar Daniel Grussand Frank Piessens Lvi Hijacking transient execution through mi-croarchitectural load value injection In 2020 IEEE S amp P 20

[72] Jo Van Bulck Frank Piessens and Raoul Strackx Nemesis Studyingmicroarchitectural timing leaks in rudimentary cpu interrupt logic InACM CCS 2018

[73] Jo Van Bulck Frank Piessens and Raoul Strackx SGX-step A Prac-tical Attack Framework for Precise Enclave Execution Control InSysTEXrsquo17

[74] Stephan van Schaik Andrew Kwong Daniel Genkin and Yuval YaromSgaxe How sgx fails in practice

[75] Stephan van Schaik Alyssa Milburn Sebastian Oumlsterlund Pietro FrigoGiorgi Maisuradze Kaveh Razavi Herbert Bos and Cristiano GiuffridaRIDL Rogue in-flight data load In SampP May 2019

[76] Stephan van Schaik Marina Minkin Andrew Kwong Daniel Genkinand Yuval Yarom Cacheout Leaking data on intel cpus via cacheevictions arXiv preprint

[77] WebKit Browserbench httpsbrowserbenchorg

[78] Ofir Weisse Jo Van Bulck Marina Minkin Daniel Genkin BarisKasikci Frank Piessens Mark Silberstein Raoul Strackx Thomas FWenisch and Yuval Yarom Foreshadow-ng Breaking the virtual mem-ory abstraction with transient out-of-order execution 2018

[79] Pawel Wieczorkiewicz Intel deep-dive snoop-assisted L1 Data Sam-pling

[80] Wenjie Xiong and Jakub Szefer Survey of transient execution attacksarXiv preprint

[81] Yuval Yarom and Katrina Falkner Flush+ reload a high resolutionlow noise l3 cache side-channel attack In USENIX Security 14

[82] Jann Horn Google Project Zero Speculative execution variant4 speculative store bypass httpsbugschromiumorgpproject-zeroissuesdetailid=1528 2019

A Reversing Memory Disambiguation

To precisely trigger memory disambiguation mispredictionsit is essential to reverse engineer the predictor and understandhow it can be massaged into the desired state When exe-cuting a load operation the physical addresses of all priorstores must be known to decide whether the load should beforwarded the value from the store buffer (store-to-load for-warding when the store and the load alias) or served from thememory subsystem (when they do not) Since these dependen-cies introduce significant bottlenecks modern processors rely

1466 30th USENIX Security Symposium USENIX Association

Listing 4 A function to RE the MD predictor The first 10imuls used to delay the store address computation create theideal conditions for mispredictionsst_ld rdi store addr rsi load addrrep 10 Trick to delay the store addressimul rdi 1endrepmov DWORD [rdi] 0x42 Storemov eax DWORD [rsi] Loadrep 10 Pronounce load timingimul eax 1endrepret

Listing 5 Observing the size and behavior of the per-addresssaturation counteruint8_t mem = malloc(0x1000)Ensure that saturing counter is set to 0for(i=0 ilt100 i++) st_ld(mem mem)Make hoisiting possiblefor(i=100 ilt120 i++) st_ld(mem mem+64)Trigger a memory ordering machine clearfor(i=120 ilt130 i++) st_ld(mem mem)

on a memory disambiguation predictor to improve common-case performance If a load is predicted not to alias precedingstores it can be speculatively executed before the prior storesrsquoaddresses are known (ie the load is hoisted) Otherwise theload is stalled until aliasing information is available

Partial reverse engineering of the memory disambiguationunit behavior was presented in [18] based on a (complex)analysis of Intel patent US8549263B2 [44] In contrast wepresent a full reverse engineering effort (including featuressuch as 4k aliasing flush counter etc) entirely based on asimple implementationmdashst_ld function in Listing 4 Bysurrounding the st_ld function with instrumentation codeto measure the timing and the number of machine clears wewere able to accurately detect the status of the predictor forevery call to st_ld

Our first experiment illustrated in Listing 5 is designedto observe how many missed load hoisting opportunities areneeded to switch the predictor state As shown in the corre-sponding plot in Figure 11 after 15 non-aliasing loads weobserved that subsequent st_ld invocations are faster due toa correct hoisting prediction This matches the design sug-gested in the patent with the predictor implemented as a 4-bitsaturating counter incremented every time the load does not ul-timately alias with preceding stores (and reset to 0 otherwise)Load hoisting is predicted only if the counter is saturated Toensure that the hoisting state is reached we later scheduledan aliasing load and checked that the load was incorrectlyhoisted by observing a machine clear

According to the Intel patent [44] there are 64 per-addresspredictors (ie saturating counters) and the suggested hash-ing function simply uses the lowest 6 bits of the instructionpointer of the load We verified these numbers using the func-

100 105 110 115 120 125 130i-th call to st_ld

0

50

100

Cloc

k cy

cles

Figure 11 Timing measurement of the experiment in Listing5 Orange bar machine clear memory ordering observed

Listing 6 Snippet observing activation of never-hoisting stateuint8_t mem = malloc(0x1000)for(int i=0 ilt10 i++)

for(int j=0 jlt19 j++)st_ld(mem mem+64)

st_ld(mem mem)

0 10 20 30 40 50 60 70 80 90 100 110 120i-th call to st_ld

0

25

50

75

100

Cloc

k cy

cles

Figure 12 Never-hoisting state Orange bar MC observed

tion st_ld_offset which is an exact copy of the st_ldfunction but with a number of nops added in the preamble(which we change at every run The goal is to observe iftwo unrelated loads in these functions are able to influenceeach other when varying the number of nops In our testedCPUs we observed machine clears when two unrelated loadsin st_ld and st_ld_offset are located exactly 256k bytesapart in memory (k isinN) We used machine clear observationsto detect that the per-address prediction of st_ld was affectedby the execution of st_ld_offset Our results match the de-sign suggested in the patent except we observed 256 (ratherthan 64) predictors hashing the lowest 8 bits of the instruc-tion pointer With these implementation-specific numbers anattacker can easily mistrain the predictor of a victim loadinstruction just with knowledge of its location in memory

One important additional component of the predictor is thepresence of a watchdog A never-hoisting global state is usedas a fallback to temporarily disable the predictor when theCPU has decided it may be counterproductive To reverseengineer the behavior we triggered as many mispredictionsas possible to check if hoisting was eventually disabled Theresulting code is illustrated in Listing 6 and the numbers inFigure 12 shows that after 4 machine clears the predictoris disabled Indeed even after 19 further non-aliasing loads(normally abundantly sufficient to switch to a hoisting state)the execution time of st_ld does not decrease

We also reversed the conditions under which the watch-dog is enableddisabled The patent suggests the watchdog

USENIX Association 30th USENIX Security Symposium 1467

15 79 143 207 271 335 399Number of independent store-loads

1

2

3

4

Tota

l mea

sure

dM

achi

ne C

lear

s

Figure 13 MCs observed after n independent store-loadsstarting from the never-hoisting state

is enabled when the value of a flush counter is smaller than0 and disabled otherwise The flush counter is decrementedevery MD MC and incremented every n correctly hoistedloads To reverse engineer this behavior we measure howmany MCs can be triggered in a row before the flush counteris decremented to -1 and thus the predictor is disabled Asshown in the figure 13 we never observed more than 4 MCsin a row This suggests that the flush counter is a 2-bit sat-urating counter The machine clear patterns also reveal theflush counter is incremented every 64 correctly hoisted loadsLastly since every run starts with the watchdog disabled ourresults show that to switch from a never-hoisting to a predict-hoisting state 15+64 non-aliasing loads are sufficient Thefirst 15 loads are necessary to bring the per-address predictorto the hoisting state The next 64 loads record a would-becorrect prediction of the per-address predictor After 64 such(unused) predictions the flush counter is incremented to leavethe never-hoisting state

Listing 7 Snippet verifying 4k aliasing-MD unit interactionForce no-hoisting predictionfor(i=0 ilt10 i++) st_ld(mem+0x1000 mem+0x1000)for(i=10 ilt20 i++) st_ld(mem+0x2000 mem+0x2000)Cause 4k aliasingfor(i=20 ilt40 i++) st_ld(mem+0x1000 mem+0x2000)

0 5 10 15 20 25 30 35 40i-th call to st_ld

50

60

70

80

90

Cloc

k cy

cles

Figure 14 Timing measurements of Listing 7 Orange incre-ment of LD_BLOCKS_PARTIALADDRESS_ALIAS

Finally we examined the interaction between memory dis-ambiguation and 4k aliasing On Intel CPUs when a store isfollowed by a load matching its 4KB page offset the store-to-load forwarding (STL) logic forwards the stored valueand in case of a false match (4k aliasing) a few-cycle over-head is needed to re-issue the load [38] With the help of theperformance counter LD_BLOCKS_PARTIALADDRESS_ALIASand the experiment shown in Listing 7 we verified that 4kaliasing can only happen when a no-hoist prediction is made

Figure 15 Memory disambiguation unit ndash simplified view

as shown in Figure 14 Indeed STL can be performed onlyif the store-load pair is executed in order Additionally Fig-ure 14 shows that 4k aliasing introduces a slight performanceoverhead on top of an incorrect no-hoisting guess of the MDpredictor Figure 15 presents a simplified view of the reversedmemory disambiguation predictor In conclusion an attackercan precisely mistrain the memory disambiguation predictorby satisfying only two requirements (1) knowing the instruc-tion pointer of the victim load (2) issuing (in the worst-casescenario) 15+64=79 non-aliasing store-load pairs to train thepredictor to the hoisting state

B Root-Causes Description Table

Acronym Description

BHT Branch History TableBTB Branch Target BufferRSB Return Stack BufferMD Memory Disabmbiguation UnitNM Device Not Available ExceptionDE Divide-by-Zero ExceptionUD Invalid Opcode ExceptionGP General Protection FaultAC Alignment Check ExceptionSS Stack Segment ExceptionPF Page FaultPF - US Bit Page Table Entry UserSupervisor Bit PFPF - RW Bit Page Table Entry ReadWrite Bit PFPF - P Bit Page Table Entry Present Bit PFPF - PKU Page Table Entry Protection Keys PFBR Bound Range Exceeded ExceptionFP Floating Point AssistSMC Self-Modifying CodeXMC Cross-Modifying CodeMO Memory Ordering Principles ViolationMASKMOV Masked LoadStore Instruction AssistAD Bits Page Table Entry AccessDirty Bits AssistTSX Intel TSX Transaction AbortUC Uncachable Memory AssistPRM Processor Reserved Memory AssistHW Interrupts Hardware Interrupts

1468 30th USENIX Security Symposium USENIX Association

  • Introduction
  • Background
    • IEEE-754 Denormal Numbers
    • x86 Cache Coherence
    • Memory Ordering
    • Memory Disambiguation
      • Threat Model
      • Machine Clears
      • Self-Modifying Code Machine Clear
      • Floating Point Machine Clear
      • Memory Ordering Machine Clear
      • Memory Disambiguation Machine Clear
      • Other Types of Machine Clear
      • Transient Execution Capabilities
        • Transient Window Size
        • Leakage Rate
          • Attack Primitives
            • Speculative Code Store Bypass (SCSB)
            • Floating Point Value Injection (FPVI)
              • Mitigations
                • SCSB Mitigation
                • FPVI Mitigation
                  • Root Cause-based Classification
                  • Related Work
                  • Conclusions
                  • Reversing Memory Disambiguation
                  • Root-Causes Description Table
Page 11: Rage Against the Machine Clear: A Systematic Analysis of

Figure 7 Coding options suggested by the Intel ArchitecturesSoftware Developers Manuals to handle SMC and XMC ex-ecution Option 1 describes the exact steps required by ourSpeculative Code Store Bypass attack primitive potentiallyresulting in exploitable gadgets

known as Return-After-Free (RAF) in the hackers commu-nity [19 25] An example is CVE-2018-0946 where a use-after-free vulnerability can be exploited to force the ChakraJS engine to erroneously execute freed (attacker-controlled)JIT code resulting in arbitrary code execution after massagingthe right gadget into the JIT code cache [24]

Indeed at a high level SCSB yields a transient use-after-free primitive on a JIT code cache with exploitation prop-erties similar to its architectural counterpart However thereare some differences due to the transient nature of SCSBFirst we need to find an out-of-context gadget to transientlyleak data rather than architecturally execute arbitrary codeIn JavaScript engines similar gadgets have already been ex-ploited by ret2spec [48] escalating out-of-context transientexecution of valid code to type confusion arbitrary reads andultimately a secret-dependent load to transmit the value

Second we need to target short-lived JIT code cache up-dates so that the newly generated code is immediately exe-cuted fitting the target gadget in the resulting transient execu-tion window Interestingly executing JITrsquoed code is the nextstep after JIT code generation mentioned in the Intel manual(Figure 7 Option 1) suggesting that developers that followsuch directives ad litteram can easily introduce such gadgetsMoreover modern JavaScript compilers feature a multi-stageoptimization pipeline [12 54] and short-lived JIT code cacheupdates are favored for just-in-time (re)optimization Indeedwe found test code in V8 to specifically test such updates andverify instructiondata cache coherence [70]

Finally we need to ensure JIT code cache updates are notaccompanied by barriers that force immediate synchroniza-tion of the code and data views We analyzed the code ofthe popular SpiderMonkey and V8 JavaScript engines andverified that the functions called upon JIT code cache updatesto synchronize instruction and data caches (Listing 2 and List-ing 3) are always empty on x86 (as expected given Intelrsquosprimary recommendation in Figure 7)

Overall while the exploitation is far from trivial (ie hav-

Listing 2 Chromium instruction cache flush(chromiumsrcv8srccodegenx64cpu-x64cc)void CpuFeaturesFlushICache(void start size_t size) No need to flush the instructioncache on Intel

Listing 3 Firefox instruction cache flush(mozilla-unifiedjssrcjitFlushICacheh)inline void FlushICache(void code size_t size

bool codeIsThreadLocal = true) No-op Code and data caches are coherent on x86

and x64 rarr

ing to address the challenges of traditional use-after-free ex-ploitation as well as transient execution exploitation in thebrowser) we believe SCSB expands the attack surface of tran-sient execution attacks in the browser We found a number ofcandidate SCSB gadgets in real-world code but none of themwas ultimately exploitable due to the coincidental presence ofsome serializing instruction preventing stale code executionNonetheless mitigations are needed to enforce security-by-design Luckily as shown later SCSB is amenable to a prac-tical and efficient mitigation (ie at a similar cost as facedby non-x86 architectures) The implementationperformancecost for mitigation is even lower than for its sibling SSB prim-itive whose mitigation has been deployed in practice even inabsence of known practical exploits [55]

In Table 3 we show that all tested Intel and AMD proces-sors are affected by SCSB In contrast ARM is not vulnerablesince SMC updates require explicit software barriers

112 Floating Point Value Injection (FPVI)Our second attack primitive Floating Point Value Injection(FPVI) allows an attacker to inject arbitrary values into atransient execution window originated by a FP machine clearAs exemplified in Figure 8 the operations of the primitive canbe broken down into four steps (1) the attacker triggers theexecution of a gadget starting with a denormal FP operationin the victim application with the x and y operands underattackerrsquos control (2) the transient z result of the operation isprocessed by the subsequent gadget instructions leaving a mi-croarchitectural trace (3) the CPU detects the error condition(ie wrong result of a denormal operation) triggering a ma-chine clear and thus a pipeline flush (4) the CPU re-executesthe entire gadget with the correct architectural z result Forexploitation the attacker needs to (i) massage the x and yoperands to inject the desired z value into the victim transientpath and (ii) target a victim gadget so that the injected valueyields a security-sensitive trace which can be observed withFLUSH + RELOAD or other microarchitectural attacks

Our primitive bears similarities with Load-Value-Injection(LVI) [71] since both allow attackers to inject controlledvalues into transient execution Moreover both primitives

1460 30th USENIX Security Symposium USENIX Association

Figure 8 FPVI gadget example in SpiderMonkeyFP_Op(xy) is an arbitrary denormal FP operation The er-roneous z result causes dependent operations to be executedtwice (first transiently then architecturally) The NaN-boxingz encoding allows the attacker to type-confuse the JITrsquoed codeand read from an arbitrary address on the transient path

require gadgets in the victim application to process the in-jected value and perform security-sensitive computations onthe transient path Nonetheless the underlying issues andhence the triggering conditions are fairly different LVI re-quires the attacker to induce faulty or assisted loads on thevictim execution which is straightforward in SGX applica-tions but more difficult in the general case [71] FPVI imposesno such requirement but does require an attacker to directlyor indirectly control operands of a floating-point operation inthe victim Nonetheless FPVI can extend the existing LVIattack surface (eg for compute-intensive SGX applicationsprocessing attacker-controlled inputs) and also provides ex-ploitation opportunities in new scenarios Indeed while it ishard to draw general conclusions on the availability of ex-ploitable FPVI gadgets in the wildmdashmuch like Spectre [41]LVI [71] etc this would require gadget scanners subject oforthogonal research [56 58]mdashwe found exploitable gadgetsin NaN-Boxing implementations of modern JIT engines [53]NaN-Boxing implementations encode arbitrary data types asdouble values allowing attackers running code in a JIT sand-box (and thus trivially controlling operands of FP operations)to escalate FPVI to a speculative type confusion primitiveThe latter can be exploited similarly to NaN-Boxing-enabledarchitectural type confusion [26] and allows an attacker toaccess arbitrary data on a transient path Figure 8 presentsour end-to-end exploit for a JavaScript-enabled attacker in aSpiderMonkey (Mozilla JavaScript runtime) sandbox illus-trating a gadget unaffected by all the prior Mozilla Firefoxrsquo

mitigations against transient execution attacks [52]As exemplified in the figure SpiderMonkeyrsquos NaN-Boxing

strategy represents every variable with a IEEE-754 (64-bit)double where the highest 17 bits store the data type tag and theremaining 47 bits store the data value itself If the tag value isless than or equal to 0x1fff0 (ie JSVAL_TAG_MAX_DOUBLE)all the 64 bits are interpreted as a double while NaN-Boxingencoding is used otherwise For instance the 0xfff88 tagis used to represent 32-bit integers and the 0xfffb0 tag torepresent a string with the data value storing a pointer tothe string descriptor In the example the attacker crafts theoperands of a vulnerable FP operation (in this case a division)to produce a transient result which the JITrsquoed code interpretsas a string pointer due to NaN-Boxing This causes typeconfusion on a transient path and ultimately triggers a readwith an attacker-controlled address

To verify the attacker can inject arbitrary pointers withoutfully reverse engineering the complex function used by theFPU we implemented a simple fuzzer to find FP operandsthat yield transient division results with the upper bits setto 0xfffb0 (ie string tag) With such operands we caneasily control the remaining bits by performing the inverseoperation since the mantissa bits are transiently unaffectedby the exponent value as shown in Table 2 For exampleusing 0xc000e8b2c9755600 and 0x0004000000000000 asdivision operands yields -Infinity as the architectural resultand our target string pointer 0xfffb0deadbeef000 as thetransient result (see gadget in Figure 8)

Note that SpiderMonkey uses no guards or Spectre mitiga-tions when accessing the attribute length of the string Thisis normally safe since x86 guarantees that NaN results of FPoperations will always have the lowest 52 bits set to zeromdasharepresentation known as QNaN Floating-Point Indefinite [37]In other words the implementation relies on the fact that NaN-boxed variables such as string pointers can never accidentallyappear as the result of FP operations and can only be craftedby the JIT engine itself Unfortunately this invariant no longerholds on a FPVI-controlled transient path As shown in Fig-ure 8 this invariant violation allows an attacker to transientlyread arbitrary memory Since the length attribute is stored4 bytes away from the string pointer the zlength accessyields a transient read to 0xdeadbeef000+4

We ran our exploit on an Intel i9-9900K CPU (microcode0xde) running Linux 580 and Firefox 850 and by wrappingthis primitive with a variant of an EVICT + RELOAD covertchannel [61] we confirmed the ability to read arbitrary mem-ory Since prior work has already demonstrated that customhigh-precision timers are possible in JavaScript [26296064]we enabled precise timers in Firefox to simplify our covertchannel With our exploit we measured a leakage rate of ~13KBs and a transient window of ~12 load instructions enabledby increasing the FPU pressure through a chain of multipledependent FP operations

Finally as shown in Table 3 we observe that both Intel and

USENIX Association 30th USENIX Security Symposium 1461

Table 3 Tested processors

Processor MicrocodeSCSB

vulnerableFPVI

vulnerable

Intel Core i7-10700K 0xe0 3 3Intel Xeon Silver 4214 0x500001c 3 3Intel Core i9-9900K 0xde 3 3Intel Core i7-7700K 0xca 3 3AMD Ryzen 5 5600X 0xa201009 3 3daggerAMD Ryzen 2990WX 0x800820b 3 3daggerAMD Ryzen 7 2700X 0x800820b 3 3daggerBroadcom BCM2711Cortex-A72 (ARM v8) 7sect 7

dagger No exploitable NaN-boxed transient results foundsect On ARM SMC updates require explicit software barriers

AMD are affected by FPVI although we found no exploitabletransient NaN-boxed values on AMD On ARM we neverobserved traces of transient results yet we cannot rule outother FPU implementations being affected

12 Mitigations

121 SCSB Mitigation

SCSB can be mitigated by ensuring that any freshly writtencode is architecturally visible before being executed For ex-ample on ARM architectures where the hardware does notautomatically enforce cache coherence explicit serializing in-structions (ie L1i cache invalidation instructions) are neededto correctly support SMC [5] As such spec-compliant ARMimplementations cannot be affected by SCSB On Intel andAMD we can force eager codedata coherence using a serial-ization instructionmdashalthough this is normally not necessary(see Option 1 in Figure 7) In our experiments we verified anyserialization instruction such as lfence mfence or cpuidplaced after the SMC store operations is indeed sufficient tosuppress the transient window Note that sfence cannot serveas a serialization instruction [35] to eliminate the transientpath This serialization mitigation was confirmed by CPUvendors and adopted by the Xen hypervisor [32 33]

To evaluate the performance impact of our mitigation weadded a lfence instruction inside the FlushICache function(Listings 2 and 3) of the two popular V8 and SpiderMonkeyJavaScript engines Such function normally empty on x86is called after every code update Our repeated experimentson the popular JetStream2 and Speedometer2 [77] bench-marks did not produce any statically measurable performanceoverhead This shows JavaScript execution time is heavilydominated by JIT code generationexecution and code up-dates have negligible impact Our results show this mitigationis practical and can hinder SCSB-based attack primitives inJIT engines with a 1-line code change

122 FPVI Mitigation

The most efficient way to mitigate FPVI is to disable the de-normal representation On Intel this translates to enabling theFlush-to-Zero and Denormals-are-Zero flags [37] which re-spectively replace denormal results and inputs with zero Thisis a viable mitigation for applications with modest floating-point precision requirements and has also been selectively ap-plied to browsers [42] However this defense may break com-mon real-world (denormal-dependent) applications a concernthat has led browser vendors such as Firefox to adopt othermitigations Another option for browsers is to enable Site Iso-lation [13] but JIT engines such as SpiderMonkey still do nothave a production implementation [23] Yet another optionfor JIT engines is to conditionally mask (ie using a transientexecution-safe cmov instruction) the result of FP operationsto enforce QNaN Floating-Point Indefinite semantics [37] asdone in the SpiderMonkey FPVI mitigation [51] This strategysuppresses any malicious NaN-boxed transient results butrequires manual changes to the NaN-boxing implementationand only applies to NaN-boxed gadgets

A more general and automated mitigation is for the com-piler to place a serializing instruction such as lfence afterFP operations whose (attacker-controlled) result might leaksecrets by means of transmit gadgets (or transmitters) Weobserve this is the same high-level strategy adopted by theexisting LVI mitigation in modern compilers [39] whichidentifies loaded values as sources and uses data-flow anal-ysis to ensure all the sources that reach a sink (transmitter)are fenced Hence to mitigate FPVI we can rely on the samemitigation strategy but use computed FP values as sourcesinstead To identify sinks we consider both systems vulnera-ble and those resistant to Microarchitectural Data Sampling(MDS) [8 59 63 75 76] For the former we consider bothload and store instructions as sinks (eg FP operation resultused as a loadstore address) as the corresponding arbitrarydata spilled into microarchitectural buffers may be leaked bya MDS attacker For the latter we limit ourselves to load sinksto catch all the arbitrary read values potentially disclosed by adependent transmitter In both cases we add indirect branchescontrolled by FP operations (and thus potentially leading tospeculative control-flow hijacking) to the list of sinks

We have implemented such a mitigation in LLVM [45]with only 100 lines of code on top of the existing x86 LVIload hardening pass To evaluate the performance impactof our mitigation on floating-point-intensive programs weran all the CC++ SPECfp 2006 and 2017 benchmarks infour configurations LVI instrumentation FPVI instrumen-tation for both MDS-vulnerable and MDS-resistant systemsand joint LVI+FPVI instrumentation for MDS-vulnerable sys-tems Please note that on MDS-resistant systems the FPVItransmitters are already covered by the LVI mitigation

Figure 9 shows the performance overhead of such con-figurations compared to the baseline As expected the LVI

1462 30th USENIX Security Symposium USENIX Association

619lbm_s 638imagick_s 644nab_s SPEC17geomean

433milc 444namd 447dealII 450soplex 453povray 470lbm 482sphinx3 SPEC06geomean

0100200300400500600700800

Over

head

SPEC FP 2017 SPEC FP 2006LLVM Mitigation

LVIFPVI (MDS-vulnerable system)FPVI (MDS-resistant system)LVI+FPVI (MDS-vulnerable system)

Figure 9 Performance overhead of our FPVI mitigation on the CC++ SPECfp 2006 and 2017 benchmarks Experimental setup5 SPEC iterations Intel i9-9900K (microcode 0xde) and LLVM 1110

mitigation has non-trivial performance impact up to 280despite targeting floating-point-intensive programs On theother hand our FPVI mitigation incurs in a 32 and 53 ge-omean overhead on SPECfp 2006 and 2017 respectively withno observable performance impact difference between MDS-vulnerable and MDS-resistant variants We observed that ap-proximately 70 of the inserted lfence instructions are dueto the intraprocedural design of the original LVI pass forc-ing our analysis to consider every callsite with FPVI source-based arguments as a potential transmitter This suggests theoverhead can be further reduced by operating interprocedu-ral analysis or more aggressive inlining (eg using LLVMLTO [45])

13 Root Cause-based Classification

We now summarize the results of our investigation by present-ing a root cause-based classification for the known transientexecution paths While there have already been several at-tempts to classify properties of transient execution [107580]all the existing classifications are attack-centric While cer-tainly useful such classifications inevitably blend togetherthe root causes of transient execution with the attack triggersFor example a MDS exploit based on a demand-paging pagefault [75] may be simply classified as a Meltdown-like attackbased on a present page fault [10] However in such an exploitthe page fault is both the vulnerability trigger and the rootcause of the transient execution window A similar exploitmay instead be embedded in an Intel TSX window yielding adifferent root cause and capabilities

To better characterize the capabilities of transient executionexploits we propose the orthogonal root-cause-based classifi-cation in Figure 10 We draw from the Intel terminology todefine the two main classes of root causes of Bad Specula-tion (ie transient execution) Control-Flow Misprediction(ie branch misprediction) and Data Misprediction (ie ma-chine clear) Based on these two main classes we observethat all the known root causes of transient execution paths canbe classified into the following four subclasses PredictorsExceptions Likely invariants violations and Interrupts

Figure 10 A root cause-centric classification of known tran-sient execution paths (causes analyzed in the paper in bold)Acronyms descriptions can be found in Appendix B

Predictors This category includes the prediction-basedcauses of bad speculation due to either control-flow or datamispredictions Mistraining a predictor and forcing a mis-prediction is sufficient to create a transient execution pathaccessing erroneous code or data Itrsquos worth noting that inFigure 10 there are two separate predictors subclasses un-der control-flow and data mispredictions as they are differ-ent in nature While control-flow mispredictions are failedattempts to guess the next instructions to execute data mis-predictions are failed attempts to operate on not-yet-validateddata Misprediction is a common way to manage tran-sient execution windows in attacks like Spectre and deriva-tives [11 20 28 40 41 43 48 55 65 82]

Exceptions This class includes the causes of machineclear due to exceptions for instance the different (sub)classesof page faults Forcing an exception is sufficient to create atransient execution path erroneously executing code follow-ing the exception-inducing instruction Exceptions are a lesscommon way to manage transient execution windows (as theyrequire dedicated handling) but have been extensively usedas triggers in Meltdown-like attacks [78475963677174ndash76 78]

Interrupts This class includes the causes of machine cleardue to hardware (only) interrupts Similar to exceptions forc-

USENIX Association 30th USENIX Security Symposium 1463

ing a HW interrupt is sufficient to create a transient executionpath erroneously executing code following the interruptedinstruction Hardware interrupt are asynchronous by natureand thus difficult to control resulting in a less than ideal wayto manage transient execution windows Nonetheless theywere abused by prior side-channel attacks [72]

Likely invariants violations This class includes all theremaining causes of machine clear derived by likely invari-ants [15] used by the CPU Such invariants commonly holdbut occasionally fail allowing hardware to implement fast-path optimizations However compared to exceptions andinterrupts slow-path occurrences are typically more frequentrequiring more efficient handling in hardware or microcodeWe discussed examples of such invariants in the paper (egstore instructions are expected to never target cached instruc-tions floating-point operations are expected to never operateon denormal numbers etc) and their lazy handling mecha-nisms (ie L1dL1i resynchronization microcode-based de-normal arithmetic) Forcing a likely invariant violation is suf-ficient to create a transient execution path accessing erroneouscode or data In this paper we have shown such violationsare not only a realistic way to manage transient executionwindows but also provide new opportunities and primitivesfor transient execution attacks

14 Related Work

Spectre [3141] and Meltdown [47] first examined the securityimplications of transient execution originating a large bodyof research on transient execution attacks [6ndash112840434859 63 65 67 68 71 74ndash76 78 79 82] Rather than focusingon attacks and their classification [10 75 80] ours is the firsteffort to systematize the root causes of transient executionand examine the many unexplored cases of machine clears

We now briefly survey prior security efforts concernedwith the major causes of machine clear discussed in this pa-per Self-modifying code is commonly used by malware asan obfuscation technique [69] and has also been used to im-prove side-channel attacks by means of performance degra-dation [2] Moreover our SCSB primitive bears similaritieswith prior transient execution primitives inducing specula-tive control flow hijacking either through branch target in-jection [41 49] or architectural branch target corruption [28]The performance variability of floating-point operations ondenormal numbers [17 46] has been previously exploited intraditional timing side-channel attacks [4] Speculation intro-duced by stricter memory models is a well-known concept inthe computer architecture literature [27 30 66] While this isnon-trivial to exploit prior work did demonstrate informationdisclosure [5779] by exploiting the snoop protocol discussedin Section 7 The memory disambiguation predictor has beenpreviously abused to leak stale data in Spectre SpeculativeStore Bypass exploits [55 82] Moreover its behavior hasbeen partially reverse engineered before [18] In contrast to

all these efforts we focus on machine clears to systematicallystudy all the root causes of transient execution fully reverseengineering their behavior and uncovering their security im-plications well beyond the state of the art

15 Conclusions

We have shown that the root causes of transient execution canbe quite diverse and go well beyond simple branch mispredic-tion or similar To support this claim we systematically ex-plored and reverse engineered the previously unexplored classof bad speculation known as machine clear We discussedseveral transient execution paths neglected in the literatureexamining their capabilities and new opportunities for attacksFurthermore we presented two new machine clear-based tran-sient execution attack primitives (Floating Point Value Injec-tion and Speculative Code Store Bypass) We also presentedan end-to-end FPVI exploit disclosing arbitrary memory inFirefox and analyzed the applicability of SCSB in real-worldapplications such as JIT engines Additionally we proposedmitigations and evaluated their performance overhead Finallywe presented a new root cause-based classification for all theknown transient execution paths

Disclosure

We disclosed Floating Point Value Injection and SpeculativeCode Store Bypass to CPU browser OS and hypervisor ven-dors in February 2021 Following our reports Intel confirmedthe FPVI (CVE-2021-0086) and SCSB (CVE-2021-0089)vulnerabilities rewarded them with the Intel Bug Bounty pro-gram and released a security advisory with recommendationsin line with our proposed mitigations [34] Mozilla confirmedthe FPVI exploit (CVE-2021-29955 [21 22]) rewarded itwith the Mozilla Security Bug Bounty program and deployeda mitigation based on conditionally masking malicious NaN-boxed FP results in Firefox 87 [51]

Acknowledgments

We thank our shepherd Daniel Genkin and the anonymousreviewers for their valuable comments We also thank ErikBosman from VUSec and Andrew Cooper from Citrix fortheir input Intel and Mozilla engineers for the productivemitigation discussions Travis Downs for his MD reverse en-gineering and Evan Wallace for his Float Toy tool This workwas supported by the European Unionrsquos Horizon 2020 re-search and innovation programme under grant agreements No786669 (ReAct) and 825377 (UNICORE) by Intel Corpora-tion through the Side Channel Vulnerability ISRA and by theDutch Research Council (NWO) through the INTERSECTproject

1464 30th USENIX Security Symposium USENIX Association

References[1] IEEE Standard for Floating-Point Arithmetic IEEE Std 754-2019

2019

[2] Alejandro Cabrera Aldaya and Billy Bob Brumley Hyperde-grade From ghz to mhz effective cpu frequencies arXiv preprintarXiv210101077

[3] AMD AMD64 Architecture Programmerrsquos Manual

[4] Marc Andrysco David Kohlbrenner Keaton Mowery Ranjit JhalaSorin Lerner and Hovav Shacham On subnormal floating point andabnormal timing In 2015 IEEE S amp P

[5] ARM Architecture Reference Manual for Armv8-A

[6] Atri Bhattacharyya Alexandra Sandulescu Matthias NeugschwandtnerAlessandro Sorniotti Babak Falsafi Mathias Payer and Anil KurmusSmotherspectre exploiting speculative execution through port con-tention In CCSrsquo19

[7] Jo Van Bulck Marina Minkin Ofir Weisse Daniel Genkin BarisKasikci Frank Piessens Mark Silberstein Thomas F Wenisch Yu-val Yarom and Raoul Strackx Foreshadow Extracting the Keys tothe Intel SGX Kingdom with Transient Out-of-Order Execution InUSENIX Securityrsquo18

[8] Claudio Canella Daniel Genkin Lukas Giner Daniel Gruss MoritzLipp Marina Minkin Daniel Moghimi Frank Piessens MichaelSchwarz Berk Sunar Jo Van Bulck and Yuval Yarom Fallout LeakingData on Meltdown-resistant CPUs In CCSrsquo19

[9] Claudio Canella Michael Schwarz Martin Haubenwallner MartinSchwarzl and Daniel Gruss Kaslr Break it fix it repeat In ACMASIA CCS 2020

[10] Claudio Canella Jo Van Bulck Michael Schwarz Moritz Lipp Ben-jamin Von Berg Philipp Ortner Frank Piessens Dmitry Evtyushkinand Daniel Gruss A systematic evaluation of transient execution at-tacks and defenses In USENIX Security 19

[11] Guoxing Chen Sanchuan Chen Yuan Xiao Yinqian Zhang ZhiqiangLin and Ten H Lai Sgxpectre Stealing intel secrets from sgx enclavesvia speculative execution In 2019 IEEE EuroSampP

[12] Chrome V8 TurboFan documentation

[13] Chromium Site Isolation documentation

[14] Victor Costan and Srinivas Devadas Intel SGX Explained IACRCryptology ePrint Archive 2016

[15] David Devecsery Peter M Chen Jason Flinn and Satish NarayanasamyOptimistic hybrid analysis Accelerating dynamic analysis throughpredicated static analysis In ASPLOS 2018

[16] Christopher Domas Breaking the x86 isa Black Hat USA 2017

[17] Isaac Dooley and Laxmikant Kale Quantifying the interference causedby subnormal floating-point values In Proceedings of the Workshopon OSIHPA 2006

[18] Travis Downs Memory Disambiguation on Skylakehttpsgithubcomtravisdownsuarch-benchwikiMemory-Disambiguation-on-Skylake 2019

[19] Thomas Dullien Return after free discussion httpstwittercomhalvarflakestatus1273220345525415937

[20] Dmitry Evtyushkin Ryan Riley Nael CSE Abu-Ghazaleh ECE andDmitry Ponomarev Branchscope A new side-channel attack on di-rectional branch predictor ACM SIGPLAN Notices 53(2)693ndash7072018

[21] Firefox Firefox 87 Security Advisory httpswwwmozillaorgen-USsecurityadvisoriesmfsa2021-10CVE-2021-29955

[22] Firefox Firefox ESR 789 Security Advisory httpswwwmozillaorgen-USsecurityadvisoriesmfsa2021-11CVE-2021-29955

[23] Firefox Project Fission documentation

[24] Fortninet Use-After-Free Bug in Chakra (CVE-2018-0946)httpswwwfortinetcomblogthreat-researchan-analysis-of-the-use-after-free-bug-in-microsoft-edge-chakra-engine

[25] Ivan Fratric Return after free discussion httpstwittercomifsecurestatus1273230733516177408

[26] Pietro Frigo Cristiano Giuffrida Herbert Bos and Kaveh RazaviGrand Pwning Unit Accelerating Microarchitectural Attacks withthe GPU In SampP May 2018

[27] Kourosh Gharachorloo Anoop Gupta and John L Hennessy Twotechniques to enhance the performance of memory consistency models1991

[28] Enes Goktas Kaveh Razavi Georgios Portokalidis Herbert Bos andCristiano Giuffrida Speculative Probing Hacking Blind in the SpectreEra In CCS 2020

[29] Ben Gras Kaveh Razavi Erik Bosman Herbert Bos and CristianoGiuffrida ASLR on the Line Practical Cache Attacks on the MMUIn NDSS February 2017

[30] John L Hennessy and David A Patterson Computer architecture aquantitative approach Elsevier 2011

[31] Jann Horn Reading privileged memory with a side-channel 2018

[32] Xen Hypervisor block_speculation function call in invoke_stubhttpsxenbitsxenorggitwebp=xengita=blobf=xenarchx86x86_emulatex86_emulatechb=HEAD

[33] Xen Hypervisor block_speculation function call inio_emul_stub_setup httpsxenbitsxenorggitwebp=xengita=blobf=xenarchx86pvemul-priv-opchb=HEAD

[34] Intel FPVI amp SCSB Intel Security Advisoray 00516 httpswwwintelcomcontentwwwusensecurity-centeradvisoryintel-sa-00516html

[35] Intel INTEL-SA-00088 - Bounds Check Bypass

[36] Intel Intelreg 64 and IA-32 Architectures Optimization ReferenceManual

[37] Intel Intelreg 64 and IA-32 Architectures Software Developerrsquos Manualcombined volumes

[38] Intel Intelreg VTunetrade Profiler User Guide - 4KAliasing httpssoftwareintelcomcontentwwwusendevelopdocumentationvtune-helptopreferencecpu-metrics-referencel1-boundaliasing-of-4k-address-offsethtml

[39] Intel Load value injection - deep divehttpssoftwareintelcomsecurity-software-guidancedeep-divesdeep-dive-load-value-injection 2020

[40] Vladimir Kiriansky and Carl Waldspurger Speculative buffer overflowsAttacks and defenses arXiv180703757

[41] Paul Kocher Jann Horn Anders Fogh Daniel Genkin Daniel GrussWerner Haas Mike Hamburg Moritz Lipp Stefan Mangard ThomasPrescher Michael Schwarz and Yuval Yarom Spectre Attacks Ex-ploiting Speculative Execution In SampPrsquo19

[42] David Kohlbrenner and Hovav Shacham On the effectiveness ofmitigations against floating-point timing channels In USENIX SecuritySymposium 2017

[43] Esmaeil Mohammadian Koruyeh Khaled N Khasawneh ChengyuSong and Nael Abu-Ghazaleh Spectre Returns Speculation Attacksusing the Return Stack Buffer In USENIX WOOTrsquo18

[44] Evgeni Krimer Guillermo Savransky Idan Mondjak and JacobDoweck Counter-based memory disambiguation techniques for se-lectively predicting loadstore conflicts October 1 2013 US Patent8549263

USENIX Association 30th USENIX Security Symposium 1465

[45] Chris Lattner and Vikram Adve Llvm A compilation framework forlifelong program analysis amp transformation In CGO 2004

[46] Orion Lawlor Hari Govind Isaac Dooley Michael Breitenfeld andLaxmikant Kale Performance degradation in the presence of subnor-mal floating-point values In OSIHPA 2005

[47] Moritz Lipp Michael Schwarz Daniel Gruss Thomas Prescher WernerHaas Anders Fogh Jann Horn Stefan Mangard Paul Kocher DanielGenkin Yuval Yarom and Mike Hamburg Meltdown Reading KernelMemory from User Space In USENIX Securityrsquo18

[48] Giorgi Maisuradze and Christian Rossow ret2spec Speculative ex-ecution using return stack buffers In Proceedings of the 2018 ACMSIGSAC

[49] Andrea Mambretti Alexandra Sandulescu Matthias NeugschwandtnerAlessandro Sorniotti and Anil Kurmus Two methods for exploitingspeculative control flow hijacks In USENIX WOOT 19

[50] Ahmad Moghimi Jan Wichelmann Thomas Eisenbarth and BerkSunar Memjam A false dependency attack against constant-timecrypto implementations International Journal of Parallel Program-ming 2019

[51] Mozilla Firefox Bug 1692972 mitigation httpshgmozillaorgreleasesmozilla-betarevb129bba64358

[52] Mozilla Spectre mitigations for Value type checks - x86 part httpsbugzillamozillaorgshow_bugcgiid=1433111

[53] Mozilla Spider Monkey JSValue httpshgmozillaorgmozilla-centralfiletipjspublicValueh

[54] Mozilla SpiderMonkey IonMonkey documentation

[55] Ken Johnson Microsoft Security Response Center(MSRC) Analysis and mitigation of speculative storebypass httpsmsrc-blogmicrosoftcom20180521analysis-and-mitigation-of-speculative-store-bypass-cve-2018-3639 2019

[56] Oleksii Oleksenko Bohdan Trach Mark Silberstein and Christof Fet-zer Specfuzz Bringing spectre-type vulnerabilities to the surface InUSENIX Security 20

[57] Riccardo Paccagnella Licheng Luo and Christopher W Fletcher Lordof the ring (s) Side channel attacks on the cpu on-chip ring interconnectare practical arXiv preprint arXiv210303443 2021

[58] Zhenxiao Qi Qian Feng Yueqiang Cheng Mengjia Yan Peng Li HengYin and Tao Wei Spectaint Speculative taint analysis for discoveringspectre gadgets 2021

[59] Hany Ragab Alyssa Milburn Kaveh Razavi Herbert Bos and CristianoGiuffrida CrossTalk Speculative Data Leaks Across Cores Are RealIn SampP May 2021

[60] Thomas Rokicki Cleacutementine Maurice and Pierre Laperdrix Sok Insearch of lost time A review of javascript timers in browsers In IEEEEuroSampPrsquo21

[61] Gururaj Saileshwar Christopher W Fletcher and Moinuddin QureshiStreamline a fast flushless cache covert-channel attack by enablingasynchronous collusion In ASPLOS 2021

[62] Rahul Saxena and John William Phillips Optimized rounding in un-derflow handlers 2001 US Patent 6219684

[63] Michael Schwarz Moritz Lipp Daniel Moghimi Jo Van Bulck JulianStecklina Thomas Prescher and Daniel Gruss ZombieLoad Cross-privilege-boundary data sampling In CCSrsquo19

[64] Michael Schwarz Cleacutementine Maurice Daniel Gruss and Stefan Man-gard Fantastic timers and where to find them High-resolution microar-chitectural attacks in javascript In FC IFCA 17

[65] Michael Schwarz Martin Schwarzl Moritz Lipp Jon Masters andDaniel Gruss Netspectre Read arbitrary memory over network InESORICS 19

[66] Daniel J Sorin Mark D Hill and David A Wood A primer on memoryconsistency and cache coherence 2011

[67] Julian Stecklina and Thomas Prescher Lazyfp Leaking fpu reg-ister state using microarchitectural side-channels arXiv preprintarXiv180607480 2018

[68] Caroline Trippel Daniel Lustig and Margaret Martonosi Meltdown-prime and spectreprime Automatically-synthesized attacks exploitinginvalidation-based coherence protocols arXiv

[69] Xabier Ugarte-Pedrero Davide Balzarotti Igor Santos and Pablo GBringas Sok Deep packer inspection A longitudinal study of thecomplexity of run-time packers In 2015 IEEE S amp P

[70] Google V8 test-jump-table-assemblercc220 commit 251fece httpsgithubcomv8

[71] Jo Van Bulck Daniel Moghimi Michael Schwarz Moritz Lippi Ma-rina Minkin Daniel Genkin Yuval Yarom Berk Sunar Daniel Grussand Frank Piessens Lvi Hijacking transient execution through mi-croarchitectural load value injection In 2020 IEEE S amp P 20

[72] Jo Van Bulck Frank Piessens and Raoul Strackx Nemesis Studyingmicroarchitectural timing leaks in rudimentary cpu interrupt logic InACM CCS 2018

[73] Jo Van Bulck Frank Piessens and Raoul Strackx SGX-step A Prac-tical Attack Framework for Precise Enclave Execution Control InSysTEXrsquo17

[74] Stephan van Schaik Andrew Kwong Daniel Genkin and Yuval YaromSgaxe How sgx fails in practice

[75] Stephan van Schaik Alyssa Milburn Sebastian Oumlsterlund Pietro FrigoGiorgi Maisuradze Kaveh Razavi Herbert Bos and Cristiano GiuffridaRIDL Rogue in-flight data load In SampP May 2019

[76] Stephan van Schaik Marina Minkin Andrew Kwong Daniel Genkinand Yuval Yarom Cacheout Leaking data on intel cpus via cacheevictions arXiv preprint

[77] WebKit Browserbench httpsbrowserbenchorg

[78] Ofir Weisse Jo Van Bulck Marina Minkin Daniel Genkin BarisKasikci Frank Piessens Mark Silberstein Raoul Strackx Thomas FWenisch and Yuval Yarom Foreshadow-ng Breaking the virtual mem-ory abstraction with transient out-of-order execution 2018

[79] Pawel Wieczorkiewicz Intel deep-dive snoop-assisted L1 Data Sam-pling

[80] Wenjie Xiong and Jakub Szefer Survey of transient execution attacksarXiv preprint

[81] Yuval Yarom and Katrina Falkner Flush+ reload a high resolutionlow noise l3 cache side-channel attack In USENIX Security 14

[82] Jann Horn Google Project Zero Speculative execution variant4 speculative store bypass httpsbugschromiumorgpproject-zeroissuesdetailid=1528 2019

A Reversing Memory Disambiguation

To precisely trigger memory disambiguation mispredictionsit is essential to reverse engineer the predictor and understandhow it can be massaged into the desired state When exe-cuting a load operation the physical addresses of all priorstores must be known to decide whether the load should beforwarded the value from the store buffer (store-to-load for-warding when the store and the load alias) or served from thememory subsystem (when they do not) Since these dependen-cies introduce significant bottlenecks modern processors rely

1466 30th USENIX Security Symposium USENIX Association

Listing 4 A function to RE the MD predictor The first 10imuls used to delay the store address computation create theideal conditions for mispredictionsst_ld rdi store addr rsi load addrrep 10 Trick to delay the store addressimul rdi 1endrepmov DWORD [rdi] 0x42 Storemov eax DWORD [rsi] Loadrep 10 Pronounce load timingimul eax 1endrepret

Listing 5 Observing the size and behavior of the per-addresssaturation counteruint8_t mem = malloc(0x1000)Ensure that saturing counter is set to 0for(i=0 ilt100 i++) st_ld(mem mem)Make hoisiting possiblefor(i=100 ilt120 i++) st_ld(mem mem+64)Trigger a memory ordering machine clearfor(i=120 ilt130 i++) st_ld(mem mem)

on a memory disambiguation predictor to improve common-case performance If a load is predicted not to alias precedingstores it can be speculatively executed before the prior storesrsquoaddresses are known (ie the load is hoisted) Otherwise theload is stalled until aliasing information is available

Partial reverse engineering of the memory disambiguationunit behavior was presented in [18] based on a (complex)analysis of Intel patent US8549263B2 [44] In contrast wepresent a full reverse engineering effort (including featuressuch as 4k aliasing flush counter etc) entirely based on asimple implementationmdashst_ld function in Listing 4 Bysurrounding the st_ld function with instrumentation codeto measure the timing and the number of machine clears wewere able to accurately detect the status of the predictor forevery call to st_ld

Our first experiment illustrated in Listing 5 is designedto observe how many missed load hoisting opportunities areneeded to switch the predictor state As shown in the corre-sponding plot in Figure 11 after 15 non-aliasing loads weobserved that subsequent st_ld invocations are faster due toa correct hoisting prediction This matches the design sug-gested in the patent with the predictor implemented as a 4-bitsaturating counter incremented every time the load does not ul-timately alias with preceding stores (and reset to 0 otherwise)Load hoisting is predicted only if the counter is saturated Toensure that the hoisting state is reached we later scheduledan aliasing load and checked that the load was incorrectlyhoisted by observing a machine clear

According to the Intel patent [44] there are 64 per-addresspredictors (ie saturating counters) and the suggested hash-ing function simply uses the lowest 6 bits of the instructionpointer of the load We verified these numbers using the func-

100 105 110 115 120 125 130i-th call to st_ld

0

50

100

Cloc

k cy

cles

Figure 11 Timing measurement of the experiment in Listing5 Orange bar machine clear memory ordering observed

Listing 6 Snippet observing activation of never-hoisting stateuint8_t mem = malloc(0x1000)for(int i=0 ilt10 i++)

for(int j=0 jlt19 j++)st_ld(mem mem+64)

st_ld(mem mem)

0 10 20 30 40 50 60 70 80 90 100 110 120i-th call to st_ld

0

25

50

75

100

Cloc

k cy

cles

Figure 12 Never-hoisting state Orange bar MC observed

tion st_ld_offset which is an exact copy of the st_ldfunction but with a number of nops added in the preamble(which we change at every run The goal is to observe iftwo unrelated loads in these functions are able to influenceeach other when varying the number of nops In our testedCPUs we observed machine clears when two unrelated loadsin st_ld and st_ld_offset are located exactly 256k bytesapart in memory (k isinN) We used machine clear observationsto detect that the per-address prediction of st_ld was affectedby the execution of st_ld_offset Our results match the de-sign suggested in the patent except we observed 256 (ratherthan 64) predictors hashing the lowest 8 bits of the instruc-tion pointer With these implementation-specific numbers anattacker can easily mistrain the predictor of a victim loadinstruction just with knowledge of its location in memory

One important additional component of the predictor is thepresence of a watchdog A never-hoisting global state is usedas a fallback to temporarily disable the predictor when theCPU has decided it may be counterproductive To reverseengineer the behavior we triggered as many mispredictionsas possible to check if hoisting was eventually disabled Theresulting code is illustrated in Listing 6 and the numbers inFigure 12 shows that after 4 machine clears the predictoris disabled Indeed even after 19 further non-aliasing loads(normally abundantly sufficient to switch to a hoisting state)the execution time of st_ld does not decrease

We also reversed the conditions under which the watch-dog is enableddisabled The patent suggests the watchdog

USENIX Association 30th USENIX Security Symposium 1467

15 79 143 207 271 335 399Number of independent store-loads

1

2

3

4

Tota

l mea

sure

dM

achi

ne C

lear

s

Figure 13 MCs observed after n independent store-loadsstarting from the never-hoisting state

is enabled when the value of a flush counter is smaller than0 and disabled otherwise The flush counter is decrementedevery MD MC and incremented every n correctly hoistedloads To reverse engineer this behavior we measure howmany MCs can be triggered in a row before the flush counteris decremented to -1 and thus the predictor is disabled Asshown in the figure 13 we never observed more than 4 MCsin a row This suggests that the flush counter is a 2-bit sat-urating counter The machine clear patterns also reveal theflush counter is incremented every 64 correctly hoisted loadsLastly since every run starts with the watchdog disabled ourresults show that to switch from a never-hoisting to a predict-hoisting state 15+64 non-aliasing loads are sufficient Thefirst 15 loads are necessary to bring the per-address predictorto the hoisting state The next 64 loads record a would-becorrect prediction of the per-address predictor After 64 such(unused) predictions the flush counter is incremented to leavethe never-hoisting state

Listing 7 Snippet verifying 4k aliasing-MD unit interactionForce no-hoisting predictionfor(i=0 ilt10 i++) st_ld(mem+0x1000 mem+0x1000)for(i=10 ilt20 i++) st_ld(mem+0x2000 mem+0x2000)Cause 4k aliasingfor(i=20 ilt40 i++) st_ld(mem+0x1000 mem+0x2000)

0 5 10 15 20 25 30 35 40i-th call to st_ld

50

60

70

80

90

Cloc

k cy

cles

Figure 14 Timing measurements of Listing 7 Orange incre-ment of LD_BLOCKS_PARTIALADDRESS_ALIAS

Finally we examined the interaction between memory dis-ambiguation and 4k aliasing On Intel CPUs when a store isfollowed by a load matching its 4KB page offset the store-to-load forwarding (STL) logic forwards the stored valueand in case of a false match (4k aliasing) a few-cycle over-head is needed to re-issue the load [38] With the help of theperformance counter LD_BLOCKS_PARTIALADDRESS_ALIASand the experiment shown in Listing 7 we verified that 4kaliasing can only happen when a no-hoist prediction is made

Figure 15 Memory disambiguation unit ndash simplified view

as shown in Figure 14 Indeed STL can be performed onlyif the store-load pair is executed in order Additionally Fig-ure 14 shows that 4k aliasing introduces a slight performanceoverhead on top of an incorrect no-hoisting guess of the MDpredictor Figure 15 presents a simplified view of the reversedmemory disambiguation predictor In conclusion an attackercan precisely mistrain the memory disambiguation predictorby satisfying only two requirements (1) knowing the instruc-tion pointer of the victim load (2) issuing (in the worst-casescenario) 15+64=79 non-aliasing store-load pairs to train thepredictor to the hoisting state

B Root-Causes Description Table

Acronym Description

BHT Branch History TableBTB Branch Target BufferRSB Return Stack BufferMD Memory Disabmbiguation UnitNM Device Not Available ExceptionDE Divide-by-Zero ExceptionUD Invalid Opcode ExceptionGP General Protection FaultAC Alignment Check ExceptionSS Stack Segment ExceptionPF Page FaultPF - US Bit Page Table Entry UserSupervisor Bit PFPF - RW Bit Page Table Entry ReadWrite Bit PFPF - P Bit Page Table Entry Present Bit PFPF - PKU Page Table Entry Protection Keys PFBR Bound Range Exceeded ExceptionFP Floating Point AssistSMC Self-Modifying CodeXMC Cross-Modifying CodeMO Memory Ordering Principles ViolationMASKMOV Masked LoadStore Instruction AssistAD Bits Page Table Entry AccessDirty Bits AssistTSX Intel TSX Transaction AbortUC Uncachable Memory AssistPRM Processor Reserved Memory AssistHW Interrupts Hardware Interrupts

1468 30th USENIX Security Symposium USENIX Association

  • Introduction
  • Background
    • IEEE-754 Denormal Numbers
    • x86 Cache Coherence
    • Memory Ordering
    • Memory Disambiguation
      • Threat Model
      • Machine Clears
      • Self-Modifying Code Machine Clear
      • Floating Point Machine Clear
      • Memory Ordering Machine Clear
      • Memory Disambiguation Machine Clear
      • Other Types of Machine Clear
      • Transient Execution Capabilities
        • Transient Window Size
        • Leakage Rate
          • Attack Primitives
            • Speculative Code Store Bypass (SCSB)
            • Floating Point Value Injection (FPVI)
              • Mitigations
                • SCSB Mitigation
                • FPVI Mitigation
                  • Root Cause-based Classification
                  • Related Work
                  • Conclusions
                  • Reversing Memory Disambiguation
                  • Root-Causes Description Table
Page 12: Rage Against the Machine Clear: A Systematic Analysis of

Figure 8 FPVI gadget example in SpiderMonkeyFP_Op(xy) is an arbitrary denormal FP operation The er-roneous z result causes dependent operations to be executedtwice (first transiently then architecturally) The NaN-boxingz encoding allows the attacker to type-confuse the JITrsquoed codeand read from an arbitrary address on the transient path

require gadgets in the victim application to process the in-jected value and perform security-sensitive computations onthe transient path Nonetheless the underlying issues andhence the triggering conditions are fairly different LVI re-quires the attacker to induce faulty or assisted loads on thevictim execution which is straightforward in SGX applica-tions but more difficult in the general case [71] FPVI imposesno such requirement but does require an attacker to directlyor indirectly control operands of a floating-point operation inthe victim Nonetheless FPVI can extend the existing LVIattack surface (eg for compute-intensive SGX applicationsprocessing attacker-controlled inputs) and also provides ex-ploitation opportunities in new scenarios Indeed while it ishard to draw general conclusions on the availability of ex-ploitable FPVI gadgets in the wildmdashmuch like Spectre [41]LVI [71] etc this would require gadget scanners subject oforthogonal research [56 58]mdashwe found exploitable gadgetsin NaN-Boxing implementations of modern JIT engines [53]NaN-Boxing implementations encode arbitrary data types asdouble values allowing attackers running code in a JIT sand-box (and thus trivially controlling operands of FP operations)to escalate FPVI to a speculative type confusion primitiveThe latter can be exploited similarly to NaN-Boxing-enabledarchitectural type confusion [26] and allows an attacker toaccess arbitrary data on a transient path Figure 8 presentsour end-to-end exploit for a JavaScript-enabled attacker in aSpiderMonkey (Mozilla JavaScript runtime) sandbox illus-trating a gadget unaffected by all the prior Mozilla Firefoxrsquo

mitigations against transient execution attacks [52]As exemplified in the figure SpiderMonkeyrsquos NaN-Boxing

strategy represents every variable with a IEEE-754 (64-bit)double where the highest 17 bits store the data type tag and theremaining 47 bits store the data value itself If the tag value isless than or equal to 0x1fff0 (ie JSVAL_TAG_MAX_DOUBLE)all the 64 bits are interpreted as a double while NaN-Boxingencoding is used otherwise For instance the 0xfff88 tagis used to represent 32-bit integers and the 0xfffb0 tag torepresent a string with the data value storing a pointer tothe string descriptor In the example the attacker crafts theoperands of a vulnerable FP operation (in this case a division)to produce a transient result which the JITrsquoed code interpretsas a string pointer due to NaN-Boxing This causes typeconfusion on a transient path and ultimately triggers a readwith an attacker-controlled address

To verify the attacker can inject arbitrary pointers withoutfully reverse engineering the complex function used by theFPU we implemented a simple fuzzer to find FP operandsthat yield transient division results with the upper bits setto 0xfffb0 (ie string tag) With such operands we caneasily control the remaining bits by performing the inverseoperation since the mantissa bits are transiently unaffectedby the exponent value as shown in Table 2 For exampleusing 0xc000e8b2c9755600 and 0x0004000000000000 asdivision operands yields -Infinity as the architectural resultand our target string pointer 0xfffb0deadbeef000 as thetransient result (see gadget in Figure 8)

Note that SpiderMonkey uses no guards or Spectre mitiga-tions when accessing the attribute length of the string Thisis normally safe since x86 guarantees that NaN results of FPoperations will always have the lowest 52 bits set to zeromdasharepresentation known as QNaN Floating-Point Indefinite [37]In other words the implementation relies on the fact that NaN-boxed variables such as string pointers can never accidentallyappear as the result of FP operations and can only be craftedby the JIT engine itself Unfortunately this invariant no longerholds on a FPVI-controlled transient path As shown in Fig-ure 8 this invariant violation allows an attacker to transientlyread arbitrary memory Since the length attribute is stored4 bytes away from the string pointer the zlength accessyields a transient read to 0xdeadbeef000+4

We ran our exploit on an Intel i9-9900K CPU (microcode0xde) running Linux 580 and Firefox 850 and by wrappingthis primitive with a variant of an EVICT + RELOAD covertchannel [61] we confirmed the ability to read arbitrary mem-ory Since prior work has already demonstrated that customhigh-precision timers are possible in JavaScript [26296064]we enabled precise timers in Firefox to simplify our covertchannel With our exploit we measured a leakage rate of ~13KBs and a transient window of ~12 load instructions enabledby increasing the FPU pressure through a chain of multipledependent FP operations

Finally as shown in Table 3 we observe that both Intel and

USENIX Association 30th USENIX Security Symposium 1461

Table 3 Tested processors

Processor MicrocodeSCSB

vulnerableFPVI

vulnerable

Intel Core i7-10700K 0xe0 3 3Intel Xeon Silver 4214 0x500001c 3 3Intel Core i9-9900K 0xde 3 3Intel Core i7-7700K 0xca 3 3AMD Ryzen 5 5600X 0xa201009 3 3daggerAMD Ryzen 2990WX 0x800820b 3 3daggerAMD Ryzen 7 2700X 0x800820b 3 3daggerBroadcom BCM2711Cortex-A72 (ARM v8) 7sect 7

dagger No exploitable NaN-boxed transient results foundsect On ARM SMC updates require explicit software barriers

AMD are affected by FPVI although we found no exploitabletransient NaN-boxed values on AMD On ARM we neverobserved traces of transient results yet we cannot rule outother FPU implementations being affected

12 Mitigations

121 SCSB Mitigation

SCSB can be mitigated by ensuring that any freshly writtencode is architecturally visible before being executed For ex-ample on ARM architectures where the hardware does notautomatically enforce cache coherence explicit serializing in-structions (ie L1i cache invalidation instructions) are neededto correctly support SMC [5] As such spec-compliant ARMimplementations cannot be affected by SCSB On Intel andAMD we can force eager codedata coherence using a serial-ization instructionmdashalthough this is normally not necessary(see Option 1 in Figure 7) In our experiments we verified anyserialization instruction such as lfence mfence or cpuidplaced after the SMC store operations is indeed sufficient tosuppress the transient window Note that sfence cannot serveas a serialization instruction [35] to eliminate the transientpath This serialization mitigation was confirmed by CPUvendors and adopted by the Xen hypervisor [32 33]

To evaluate the performance impact of our mitigation weadded a lfence instruction inside the FlushICache function(Listings 2 and 3) of the two popular V8 and SpiderMonkeyJavaScript engines Such function normally empty on x86is called after every code update Our repeated experimentson the popular JetStream2 and Speedometer2 [77] bench-marks did not produce any statically measurable performanceoverhead This shows JavaScript execution time is heavilydominated by JIT code generationexecution and code up-dates have negligible impact Our results show this mitigationis practical and can hinder SCSB-based attack primitives inJIT engines with a 1-line code change

122 FPVI Mitigation

The most efficient way to mitigate FPVI is to disable the de-normal representation On Intel this translates to enabling theFlush-to-Zero and Denormals-are-Zero flags [37] which re-spectively replace denormal results and inputs with zero Thisis a viable mitigation for applications with modest floating-point precision requirements and has also been selectively ap-plied to browsers [42] However this defense may break com-mon real-world (denormal-dependent) applications a concernthat has led browser vendors such as Firefox to adopt othermitigations Another option for browsers is to enable Site Iso-lation [13] but JIT engines such as SpiderMonkey still do nothave a production implementation [23] Yet another optionfor JIT engines is to conditionally mask (ie using a transientexecution-safe cmov instruction) the result of FP operationsto enforce QNaN Floating-Point Indefinite semantics [37] asdone in the SpiderMonkey FPVI mitigation [51] This strategysuppresses any malicious NaN-boxed transient results butrequires manual changes to the NaN-boxing implementationand only applies to NaN-boxed gadgets

A more general and automated mitigation is for the com-piler to place a serializing instruction such as lfence afterFP operations whose (attacker-controlled) result might leaksecrets by means of transmit gadgets (or transmitters) Weobserve this is the same high-level strategy adopted by theexisting LVI mitigation in modern compilers [39] whichidentifies loaded values as sources and uses data-flow anal-ysis to ensure all the sources that reach a sink (transmitter)are fenced Hence to mitigate FPVI we can rely on the samemitigation strategy but use computed FP values as sourcesinstead To identify sinks we consider both systems vulnera-ble and those resistant to Microarchitectural Data Sampling(MDS) [8 59 63 75 76] For the former we consider bothload and store instructions as sinks (eg FP operation resultused as a loadstore address) as the corresponding arbitrarydata spilled into microarchitectural buffers may be leaked bya MDS attacker For the latter we limit ourselves to load sinksto catch all the arbitrary read values potentially disclosed by adependent transmitter In both cases we add indirect branchescontrolled by FP operations (and thus potentially leading tospeculative control-flow hijacking) to the list of sinks

We have implemented such a mitigation in LLVM [45]with only 100 lines of code on top of the existing x86 LVIload hardening pass To evaluate the performance impactof our mitigation on floating-point-intensive programs weran all the CC++ SPECfp 2006 and 2017 benchmarks infour configurations LVI instrumentation FPVI instrumen-tation for both MDS-vulnerable and MDS-resistant systemsand joint LVI+FPVI instrumentation for MDS-vulnerable sys-tems Please note that on MDS-resistant systems the FPVItransmitters are already covered by the LVI mitigation

Figure 9 shows the performance overhead of such con-figurations compared to the baseline As expected the LVI

1462 30th USENIX Security Symposium USENIX Association

619lbm_s 638imagick_s 644nab_s SPEC17geomean

433milc 444namd 447dealII 450soplex 453povray 470lbm 482sphinx3 SPEC06geomean

0100200300400500600700800

Over

head

SPEC FP 2017 SPEC FP 2006LLVM Mitigation

LVIFPVI (MDS-vulnerable system)FPVI (MDS-resistant system)LVI+FPVI (MDS-vulnerable system)

Figure 9 Performance overhead of our FPVI mitigation on the CC++ SPECfp 2006 and 2017 benchmarks Experimental setup5 SPEC iterations Intel i9-9900K (microcode 0xde) and LLVM 1110

mitigation has non-trivial performance impact up to 280despite targeting floating-point-intensive programs On theother hand our FPVI mitigation incurs in a 32 and 53 ge-omean overhead on SPECfp 2006 and 2017 respectively withno observable performance impact difference between MDS-vulnerable and MDS-resistant variants We observed that ap-proximately 70 of the inserted lfence instructions are dueto the intraprocedural design of the original LVI pass forc-ing our analysis to consider every callsite with FPVI source-based arguments as a potential transmitter This suggests theoverhead can be further reduced by operating interprocedu-ral analysis or more aggressive inlining (eg using LLVMLTO [45])

13 Root Cause-based Classification

We now summarize the results of our investigation by present-ing a root cause-based classification for the known transientexecution paths While there have already been several at-tempts to classify properties of transient execution [107580]all the existing classifications are attack-centric While cer-tainly useful such classifications inevitably blend togetherthe root causes of transient execution with the attack triggersFor example a MDS exploit based on a demand-paging pagefault [75] may be simply classified as a Meltdown-like attackbased on a present page fault [10] However in such an exploitthe page fault is both the vulnerability trigger and the rootcause of the transient execution window A similar exploitmay instead be embedded in an Intel TSX window yielding adifferent root cause and capabilities

To better characterize the capabilities of transient executionexploits we propose the orthogonal root-cause-based classifi-cation in Figure 10 We draw from the Intel terminology todefine the two main classes of root causes of Bad Specula-tion (ie transient execution) Control-Flow Misprediction(ie branch misprediction) and Data Misprediction (ie ma-chine clear) Based on these two main classes we observethat all the known root causes of transient execution paths canbe classified into the following four subclasses PredictorsExceptions Likely invariants violations and Interrupts

Figure 10 A root cause-centric classification of known tran-sient execution paths (causes analyzed in the paper in bold)Acronyms descriptions can be found in Appendix B

Predictors This category includes the prediction-basedcauses of bad speculation due to either control-flow or datamispredictions Mistraining a predictor and forcing a mis-prediction is sufficient to create a transient execution pathaccessing erroneous code or data Itrsquos worth noting that inFigure 10 there are two separate predictors subclasses un-der control-flow and data mispredictions as they are differ-ent in nature While control-flow mispredictions are failedattempts to guess the next instructions to execute data mis-predictions are failed attempts to operate on not-yet-validateddata Misprediction is a common way to manage tran-sient execution windows in attacks like Spectre and deriva-tives [11 20 28 40 41 43 48 55 65 82]

Exceptions This class includes the causes of machineclear due to exceptions for instance the different (sub)classesof page faults Forcing an exception is sufficient to create atransient execution path erroneously executing code follow-ing the exception-inducing instruction Exceptions are a lesscommon way to manage transient execution windows (as theyrequire dedicated handling) but have been extensively usedas triggers in Meltdown-like attacks [78475963677174ndash76 78]

Interrupts This class includes the causes of machine cleardue to hardware (only) interrupts Similar to exceptions forc-

USENIX Association 30th USENIX Security Symposium 1463

ing a HW interrupt is sufficient to create a transient executionpath erroneously executing code following the interruptedinstruction Hardware interrupt are asynchronous by natureand thus difficult to control resulting in a less than ideal wayto manage transient execution windows Nonetheless theywere abused by prior side-channel attacks [72]

Likely invariants violations This class includes all theremaining causes of machine clear derived by likely invari-ants [15] used by the CPU Such invariants commonly holdbut occasionally fail allowing hardware to implement fast-path optimizations However compared to exceptions andinterrupts slow-path occurrences are typically more frequentrequiring more efficient handling in hardware or microcodeWe discussed examples of such invariants in the paper (egstore instructions are expected to never target cached instruc-tions floating-point operations are expected to never operateon denormal numbers etc) and their lazy handling mecha-nisms (ie L1dL1i resynchronization microcode-based de-normal arithmetic) Forcing a likely invariant violation is suf-ficient to create a transient execution path accessing erroneouscode or data In this paper we have shown such violationsare not only a realistic way to manage transient executionwindows but also provide new opportunities and primitivesfor transient execution attacks

14 Related Work

Spectre [3141] and Meltdown [47] first examined the securityimplications of transient execution originating a large bodyof research on transient execution attacks [6ndash112840434859 63 65 67 68 71 74ndash76 78 79 82] Rather than focusingon attacks and their classification [10 75 80] ours is the firsteffort to systematize the root causes of transient executionand examine the many unexplored cases of machine clears

We now briefly survey prior security efforts concernedwith the major causes of machine clear discussed in this pa-per Self-modifying code is commonly used by malware asan obfuscation technique [69] and has also been used to im-prove side-channel attacks by means of performance degra-dation [2] Moreover our SCSB primitive bears similaritieswith prior transient execution primitives inducing specula-tive control flow hijacking either through branch target in-jection [41 49] or architectural branch target corruption [28]The performance variability of floating-point operations ondenormal numbers [17 46] has been previously exploited intraditional timing side-channel attacks [4] Speculation intro-duced by stricter memory models is a well-known concept inthe computer architecture literature [27 30 66] While this isnon-trivial to exploit prior work did demonstrate informationdisclosure [5779] by exploiting the snoop protocol discussedin Section 7 The memory disambiguation predictor has beenpreviously abused to leak stale data in Spectre SpeculativeStore Bypass exploits [55 82] Moreover its behavior hasbeen partially reverse engineered before [18] In contrast to

all these efforts we focus on machine clears to systematicallystudy all the root causes of transient execution fully reverseengineering their behavior and uncovering their security im-plications well beyond the state of the art

15 Conclusions

We have shown that the root causes of transient execution canbe quite diverse and go well beyond simple branch mispredic-tion or similar To support this claim we systematically ex-plored and reverse engineered the previously unexplored classof bad speculation known as machine clear We discussedseveral transient execution paths neglected in the literatureexamining their capabilities and new opportunities for attacksFurthermore we presented two new machine clear-based tran-sient execution attack primitives (Floating Point Value Injec-tion and Speculative Code Store Bypass) We also presentedan end-to-end FPVI exploit disclosing arbitrary memory inFirefox and analyzed the applicability of SCSB in real-worldapplications such as JIT engines Additionally we proposedmitigations and evaluated their performance overhead Finallywe presented a new root cause-based classification for all theknown transient execution paths

Disclosure

We disclosed Floating Point Value Injection and SpeculativeCode Store Bypass to CPU browser OS and hypervisor ven-dors in February 2021 Following our reports Intel confirmedthe FPVI (CVE-2021-0086) and SCSB (CVE-2021-0089)vulnerabilities rewarded them with the Intel Bug Bounty pro-gram and released a security advisory with recommendationsin line with our proposed mitigations [34] Mozilla confirmedthe FPVI exploit (CVE-2021-29955 [21 22]) rewarded itwith the Mozilla Security Bug Bounty program and deployeda mitigation based on conditionally masking malicious NaN-boxed FP results in Firefox 87 [51]

Acknowledgments

We thank our shepherd Daniel Genkin and the anonymousreviewers for their valuable comments We also thank ErikBosman from VUSec and Andrew Cooper from Citrix fortheir input Intel and Mozilla engineers for the productivemitigation discussions Travis Downs for his MD reverse en-gineering and Evan Wallace for his Float Toy tool This workwas supported by the European Unionrsquos Horizon 2020 re-search and innovation programme under grant agreements No786669 (ReAct) and 825377 (UNICORE) by Intel Corpora-tion through the Side Channel Vulnerability ISRA and by theDutch Research Council (NWO) through the INTERSECTproject

1464 30th USENIX Security Symposium USENIX Association

References[1] IEEE Standard for Floating-Point Arithmetic IEEE Std 754-2019

2019

[2] Alejandro Cabrera Aldaya and Billy Bob Brumley Hyperde-grade From ghz to mhz effective cpu frequencies arXiv preprintarXiv210101077

[3] AMD AMD64 Architecture Programmerrsquos Manual

[4] Marc Andrysco David Kohlbrenner Keaton Mowery Ranjit JhalaSorin Lerner and Hovav Shacham On subnormal floating point andabnormal timing In 2015 IEEE S amp P

[5] ARM Architecture Reference Manual for Armv8-A

[6] Atri Bhattacharyya Alexandra Sandulescu Matthias NeugschwandtnerAlessandro Sorniotti Babak Falsafi Mathias Payer and Anil KurmusSmotherspectre exploiting speculative execution through port con-tention In CCSrsquo19

[7] Jo Van Bulck Marina Minkin Ofir Weisse Daniel Genkin BarisKasikci Frank Piessens Mark Silberstein Thomas F Wenisch Yu-val Yarom and Raoul Strackx Foreshadow Extracting the Keys tothe Intel SGX Kingdom with Transient Out-of-Order Execution InUSENIX Securityrsquo18

[8] Claudio Canella Daniel Genkin Lukas Giner Daniel Gruss MoritzLipp Marina Minkin Daniel Moghimi Frank Piessens MichaelSchwarz Berk Sunar Jo Van Bulck and Yuval Yarom Fallout LeakingData on Meltdown-resistant CPUs In CCSrsquo19

[9] Claudio Canella Michael Schwarz Martin Haubenwallner MartinSchwarzl and Daniel Gruss Kaslr Break it fix it repeat In ACMASIA CCS 2020

[10] Claudio Canella Jo Van Bulck Michael Schwarz Moritz Lipp Ben-jamin Von Berg Philipp Ortner Frank Piessens Dmitry Evtyushkinand Daniel Gruss A systematic evaluation of transient execution at-tacks and defenses In USENIX Security 19

[11] Guoxing Chen Sanchuan Chen Yuan Xiao Yinqian Zhang ZhiqiangLin and Ten H Lai Sgxpectre Stealing intel secrets from sgx enclavesvia speculative execution In 2019 IEEE EuroSampP

[12] Chrome V8 TurboFan documentation

[13] Chromium Site Isolation documentation

[14] Victor Costan and Srinivas Devadas Intel SGX Explained IACRCryptology ePrint Archive 2016

[15] David Devecsery Peter M Chen Jason Flinn and Satish NarayanasamyOptimistic hybrid analysis Accelerating dynamic analysis throughpredicated static analysis In ASPLOS 2018

[16] Christopher Domas Breaking the x86 isa Black Hat USA 2017

[17] Isaac Dooley and Laxmikant Kale Quantifying the interference causedby subnormal floating-point values In Proceedings of the Workshopon OSIHPA 2006

[18] Travis Downs Memory Disambiguation on Skylakehttpsgithubcomtravisdownsuarch-benchwikiMemory-Disambiguation-on-Skylake 2019

[19] Thomas Dullien Return after free discussion httpstwittercomhalvarflakestatus1273220345525415937

[20] Dmitry Evtyushkin Ryan Riley Nael CSE Abu-Ghazaleh ECE andDmitry Ponomarev Branchscope A new side-channel attack on di-rectional branch predictor ACM SIGPLAN Notices 53(2)693ndash7072018

[21] Firefox Firefox 87 Security Advisory httpswwwmozillaorgen-USsecurityadvisoriesmfsa2021-10CVE-2021-29955

[22] Firefox Firefox ESR 789 Security Advisory httpswwwmozillaorgen-USsecurityadvisoriesmfsa2021-11CVE-2021-29955

[23] Firefox Project Fission documentation

[24] Fortninet Use-After-Free Bug in Chakra (CVE-2018-0946)httpswwwfortinetcomblogthreat-researchan-analysis-of-the-use-after-free-bug-in-microsoft-edge-chakra-engine

[25] Ivan Fratric Return after free discussion httpstwittercomifsecurestatus1273230733516177408

[26] Pietro Frigo Cristiano Giuffrida Herbert Bos and Kaveh RazaviGrand Pwning Unit Accelerating Microarchitectural Attacks withthe GPU In SampP May 2018

[27] Kourosh Gharachorloo Anoop Gupta and John L Hennessy Twotechniques to enhance the performance of memory consistency models1991

[28] Enes Goktas Kaveh Razavi Georgios Portokalidis Herbert Bos andCristiano Giuffrida Speculative Probing Hacking Blind in the SpectreEra In CCS 2020

[29] Ben Gras Kaveh Razavi Erik Bosman Herbert Bos and CristianoGiuffrida ASLR on the Line Practical Cache Attacks on the MMUIn NDSS February 2017

[30] John L Hennessy and David A Patterson Computer architecture aquantitative approach Elsevier 2011

[31] Jann Horn Reading privileged memory with a side-channel 2018

[32] Xen Hypervisor block_speculation function call in invoke_stubhttpsxenbitsxenorggitwebp=xengita=blobf=xenarchx86x86_emulatex86_emulatechb=HEAD

[33] Xen Hypervisor block_speculation function call inio_emul_stub_setup httpsxenbitsxenorggitwebp=xengita=blobf=xenarchx86pvemul-priv-opchb=HEAD

[34] Intel FPVI amp SCSB Intel Security Advisoray 00516 httpswwwintelcomcontentwwwusensecurity-centeradvisoryintel-sa-00516html

[35] Intel INTEL-SA-00088 - Bounds Check Bypass

[36] Intel Intelreg 64 and IA-32 Architectures Optimization ReferenceManual

[37] Intel Intelreg 64 and IA-32 Architectures Software Developerrsquos Manualcombined volumes

[38] Intel Intelreg VTunetrade Profiler User Guide - 4KAliasing httpssoftwareintelcomcontentwwwusendevelopdocumentationvtune-helptopreferencecpu-metrics-referencel1-boundaliasing-of-4k-address-offsethtml

[39] Intel Load value injection - deep divehttpssoftwareintelcomsecurity-software-guidancedeep-divesdeep-dive-load-value-injection 2020

[40] Vladimir Kiriansky and Carl Waldspurger Speculative buffer overflowsAttacks and defenses arXiv180703757

[41] Paul Kocher Jann Horn Anders Fogh Daniel Genkin Daniel GrussWerner Haas Mike Hamburg Moritz Lipp Stefan Mangard ThomasPrescher Michael Schwarz and Yuval Yarom Spectre Attacks Ex-ploiting Speculative Execution In SampPrsquo19

[42] David Kohlbrenner and Hovav Shacham On the effectiveness ofmitigations against floating-point timing channels In USENIX SecuritySymposium 2017

[43] Esmaeil Mohammadian Koruyeh Khaled N Khasawneh ChengyuSong and Nael Abu-Ghazaleh Spectre Returns Speculation Attacksusing the Return Stack Buffer In USENIX WOOTrsquo18

[44] Evgeni Krimer Guillermo Savransky Idan Mondjak and JacobDoweck Counter-based memory disambiguation techniques for se-lectively predicting loadstore conflicts October 1 2013 US Patent8549263

USENIX Association 30th USENIX Security Symposium 1465

[45] Chris Lattner and Vikram Adve Llvm A compilation framework forlifelong program analysis amp transformation In CGO 2004

[46] Orion Lawlor Hari Govind Isaac Dooley Michael Breitenfeld andLaxmikant Kale Performance degradation in the presence of subnor-mal floating-point values In OSIHPA 2005

[47] Moritz Lipp Michael Schwarz Daniel Gruss Thomas Prescher WernerHaas Anders Fogh Jann Horn Stefan Mangard Paul Kocher DanielGenkin Yuval Yarom and Mike Hamburg Meltdown Reading KernelMemory from User Space In USENIX Securityrsquo18

[48] Giorgi Maisuradze and Christian Rossow ret2spec Speculative ex-ecution using return stack buffers In Proceedings of the 2018 ACMSIGSAC

[49] Andrea Mambretti Alexandra Sandulescu Matthias NeugschwandtnerAlessandro Sorniotti and Anil Kurmus Two methods for exploitingspeculative control flow hijacks In USENIX WOOT 19

[50] Ahmad Moghimi Jan Wichelmann Thomas Eisenbarth and BerkSunar Memjam A false dependency attack against constant-timecrypto implementations International Journal of Parallel Program-ming 2019

[51] Mozilla Firefox Bug 1692972 mitigation httpshgmozillaorgreleasesmozilla-betarevb129bba64358

[52] Mozilla Spectre mitigations for Value type checks - x86 part httpsbugzillamozillaorgshow_bugcgiid=1433111

[53] Mozilla Spider Monkey JSValue httpshgmozillaorgmozilla-centralfiletipjspublicValueh

[54] Mozilla SpiderMonkey IonMonkey documentation

[55] Ken Johnson Microsoft Security Response Center(MSRC) Analysis and mitigation of speculative storebypass httpsmsrc-blogmicrosoftcom20180521analysis-and-mitigation-of-speculative-store-bypass-cve-2018-3639 2019

[56] Oleksii Oleksenko Bohdan Trach Mark Silberstein and Christof Fet-zer Specfuzz Bringing spectre-type vulnerabilities to the surface InUSENIX Security 20

[57] Riccardo Paccagnella Licheng Luo and Christopher W Fletcher Lordof the ring (s) Side channel attacks on the cpu on-chip ring interconnectare practical arXiv preprint arXiv210303443 2021

[58] Zhenxiao Qi Qian Feng Yueqiang Cheng Mengjia Yan Peng Li HengYin and Tao Wei Spectaint Speculative taint analysis for discoveringspectre gadgets 2021

[59] Hany Ragab Alyssa Milburn Kaveh Razavi Herbert Bos and CristianoGiuffrida CrossTalk Speculative Data Leaks Across Cores Are RealIn SampP May 2021

[60] Thomas Rokicki Cleacutementine Maurice and Pierre Laperdrix Sok Insearch of lost time A review of javascript timers in browsers In IEEEEuroSampPrsquo21

[61] Gururaj Saileshwar Christopher W Fletcher and Moinuddin QureshiStreamline a fast flushless cache covert-channel attack by enablingasynchronous collusion In ASPLOS 2021

[62] Rahul Saxena and John William Phillips Optimized rounding in un-derflow handlers 2001 US Patent 6219684

[63] Michael Schwarz Moritz Lipp Daniel Moghimi Jo Van Bulck JulianStecklina Thomas Prescher and Daniel Gruss ZombieLoad Cross-privilege-boundary data sampling In CCSrsquo19

[64] Michael Schwarz Cleacutementine Maurice Daniel Gruss and Stefan Man-gard Fantastic timers and where to find them High-resolution microar-chitectural attacks in javascript In FC IFCA 17

[65] Michael Schwarz Martin Schwarzl Moritz Lipp Jon Masters andDaniel Gruss Netspectre Read arbitrary memory over network InESORICS 19

[66] Daniel J Sorin Mark D Hill and David A Wood A primer on memoryconsistency and cache coherence 2011

[67] Julian Stecklina and Thomas Prescher Lazyfp Leaking fpu reg-ister state using microarchitectural side-channels arXiv preprintarXiv180607480 2018

[68] Caroline Trippel Daniel Lustig and Margaret Martonosi Meltdown-prime and spectreprime Automatically-synthesized attacks exploitinginvalidation-based coherence protocols arXiv

[69] Xabier Ugarte-Pedrero Davide Balzarotti Igor Santos and Pablo GBringas Sok Deep packer inspection A longitudinal study of thecomplexity of run-time packers In 2015 IEEE S amp P

[70] Google V8 test-jump-table-assemblercc220 commit 251fece httpsgithubcomv8

[71] Jo Van Bulck Daniel Moghimi Michael Schwarz Moritz Lippi Ma-rina Minkin Daniel Genkin Yuval Yarom Berk Sunar Daniel Grussand Frank Piessens Lvi Hijacking transient execution through mi-croarchitectural load value injection In 2020 IEEE S amp P 20

[72] Jo Van Bulck Frank Piessens and Raoul Strackx Nemesis Studyingmicroarchitectural timing leaks in rudimentary cpu interrupt logic InACM CCS 2018

[73] Jo Van Bulck Frank Piessens and Raoul Strackx SGX-step A Prac-tical Attack Framework for Precise Enclave Execution Control InSysTEXrsquo17

[74] Stephan van Schaik Andrew Kwong Daniel Genkin and Yuval YaromSgaxe How sgx fails in practice

[75] Stephan van Schaik Alyssa Milburn Sebastian Oumlsterlund Pietro FrigoGiorgi Maisuradze Kaveh Razavi Herbert Bos and Cristiano GiuffridaRIDL Rogue in-flight data load In SampP May 2019

[76] Stephan van Schaik Marina Minkin Andrew Kwong Daniel Genkinand Yuval Yarom Cacheout Leaking data on intel cpus via cacheevictions arXiv preprint

[77] WebKit Browserbench httpsbrowserbenchorg

[78] Ofir Weisse Jo Van Bulck Marina Minkin Daniel Genkin BarisKasikci Frank Piessens Mark Silberstein Raoul Strackx Thomas FWenisch and Yuval Yarom Foreshadow-ng Breaking the virtual mem-ory abstraction with transient out-of-order execution 2018

[79] Pawel Wieczorkiewicz Intel deep-dive snoop-assisted L1 Data Sam-pling

[80] Wenjie Xiong and Jakub Szefer Survey of transient execution attacksarXiv preprint

[81] Yuval Yarom and Katrina Falkner Flush+ reload a high resolutionlow noise l3 cache side-channel attack In USENIX Security 14

[82] Jann Horn Google Project Zero Speculative execution variant4 speculative store bypass httpsbugschromiumorgpproject-zeroissuesdetailid=1528 2019

A Reversing Memory Disambiguation

To precisely trigger memory disambiguation mispredictionsit is essential to reverse engineer the predictor and understandhow it can be massaged into the desired state When exe-cuting a load operation the physical addresses of all priorstores must be known to decide whether the load should beforwarded the value from the store buffer (store-to-load for-warding when the store and the load alias) or served from thememory subsystem (when they do not) Since these dependen-cies introduce significant bottlenecks modern processors rely

1466 30th USENIX Security Symposium USENIX Association

Listing 4 A function to RE the MD predictor The first 10imuls used to delay the store address computation create theideal conditions for mispredictionsst_ld rdi store addr rsi load addrrep 10 Trick to delay the store addressimul rdi 1endrepmov DWORD [rdi] 0x42 Storemov eax DWORD [rsi] Loadrep 10 Pronounce load timingimul eax 1endrepret

Listing 5 Observing the size and behavior of the per-addresssaturation counteruint8_t mem = malloc(0x1000)Ensure that saturing counter is set to 0for(i=0 ilt100 i++) st_ld(mem mem)Make hoisiting possiblefor(i=100 ilt120 i++) st_ld(mem mem+64)Trigger a memory ordering machine clearfor(i=120 ilt130 i++) st_ld(mem mem)

on a memory disambiguation predictor to improve common-case performance If a load is predicted not to alias precedingstores it can be speculatively executed before the prior storesrsquoaddresses are known (ie the load is hoisted) Otherwise theload is stalled until aliasing information is available

Partial reverse engineering of the memory disambiguationunit behavior was presented in [18] based on a (complex)analysis of Intel patent US8549263B2 [44] In contrast wepresent a full reverse engineering effort (including featuressuch as 4k aliasing flush counter etc) entirely based on asimple implementationmdashst_ld function in Listing 4 Bysurrounding the st_ld function with instrumentation codeto measure the timing and the number of machine clears wewere able to accurately detect the status of the predictor forevery call to st_ld

Our first experiment illustrated in Listing 5 is designedto observe how many missed load hoisting opportunities areneeded to switch the predictor state As shown in the corre-sponding plot in Figure 11 after 15 non-aliasing loads weobserved that subsequent st_ld invocations are faster due toa correct hoisting prediction This matches the design sug-gested in the patent with the predictor implemented as a 4-bitsaturating counter incremented every time the load does not ul-timately alias with preceding stores (and reset to 0 otherwise)Load hoisting is predicted only if the counter is saturated Toensure that the hoisting state is reached we later scheduledan aliasing load and checked that the load was incorrectlyhoisted by observing a machine clear

According to the Intel patent [44] there are 64 per-addresspredictors (ie saturating counters) and the suggested hash-ing function simply uses the lowest 6 bits of the instructionpointer of the load We verified these numbers using the func-

100 105 110 115 120 125 130i-th call to st_ld

0

50

100

Cloc

k cy

cles

Figure 11 Timing measurement of the experiment in Listing5 Orange bar machine clear memory ordering observed

Listing 6 Snippet observing activation of never-hoisting stateuint8_t mem = malloc(0x1000)for(int i=0 ilt10 i++)

for(int j=0 jlt19 j++)st_ld(mem mem+64)

st_ld(mem mem)

0 10 20 30 40 50 60 70 80 90 100 110 120i-th call to st_ld

0

25

50

75

100

Cloc

k cy

cles

Figure 12 Never-hoisting state Orange bar MC observed

tion st_ld_offset which is an exact copy of the st_ldfunction but with a number of nops added in the preamble(which we change at every run The goal is to observe iftwo unrelated loads in these functions are able to influenceeach other when varying the number of nops In our testedCPUs we observed machine clears when two unrelated loadsin st_ld and st_ld_offset are located exactly 256k bytesapart in memory (k isinN) We used machine clear observationsto detect that the per-address prediction of st_ld was affectedby the execution of st_ld_offset Our results match the de-sign suggested in the patent except we observed 256 (ratherthan 64) predictors hashing the lowest 8 bits of the instruc-tion pointer With these implementation-specific numbers anattacker can easily mistrain the predictor of a victim loadinstruction just with knowledge of its location in memory

One important additional component of the predictor is thepresence of a watchdog A never-hoisting global state is usedas a fallback to temporarily disable the predictor when theCPU has decided it may be counterproductive To reverseengineer the behavior we triggered as many mispredictionsas possible to check if hoisting was eventually disabled Theresulting code is illustrated in Listing 6 and the numbers inFigure 12 shows that after 4 machine clears the predictoris disabled Indeed even after 19 further non-aliasing loads(normally abundantly sufficient to switch to a hoisting state)the execution time of st_ld does not decrease

We also reversed the conditions under which the watch-dog is enableddisabled The patent suggests the watchdog

USENIX Association 30th USENIX Security Symposium 1467

15 79 143 207 271 335 399Number of independent store-loads

1

2

3

4

Tota

l mea

sure

dM

achi

ne C

lear

s

Figure 13 MCs observed after n independent store-loadsstarting from the never-hoisting state

is enabled when the value of a flush counter is smaller than0 and disabled otherwise The flush counter is decrementedevery MD MC and incremented every n correctly hoistedloads To reverse engineer this behavior we measure howmany MCs can be triggered in a row before the flush counteris decremented to -1 and thus the predictor is disabled Asshown in the figure 13 we never observed more than 4 MCsin a row This suggests that the flush counter is a 2-bit sat-urating counter The machine clear patterns also reveal theflush counter is incremented every 64 correctly hoisted loadsLastly since every run starts with the watchdog disabled ourresults show that to switch from a never-hoisting to a predict-hoisting state 15+64 non-aliasing loads are sufficient Thefirst 15 loads are necessary to bring the per-address predictorto the hoisting state The next 64 loads record a would-becorrect prediction of the per-address predictor After 64 such(unused) predictions the flush counter is incremented to leavethe never-hoisting state

Listing 7 Snippet verifying 4k aliasing-MD unit interactionForce no-hoisting predictionfor(i=0 ilt10 i++) st_ld(mem+0x1000 mem+0x1000)for(i=10 ilt20 i++) st_ld(mem+0x2000 mem+0x2000)Cause 4k aliasingfor(i=20 ilt40 i++) st_ld(mem+0x1000 mem+0x2000)

0 5 10 15 20 25 30 35 40i-th call to st_ld

50

60

70

80

90

Cloc

k cy

cles

Figure 14 Timing measurements of Listing 7 Orange incre-ment of LD_BLOCKS_PARTIALADDRESS_ALIAS

Finally we examined the interaction between memory dis-ambiguation and 4k aliasing On Intel CPUs when a store isfollowed by a load matching its 4KB page offset the store-to-load forwarding (STL) logic forwards the stored valueand in case of a false match (4k aliasing) a few-cycle over-head is needed to re-issue the load [38] With the help of theperformance counter LD_BLOCKS_PARTIALADDRESS_ALIASand the experiment shown in Listing 7 we verified that 4kaliasing can only happen when a no-hoist prediction is made

Figure 15 Memory disambiguation unit ndash simplified view

as shown in Figure 14 Indeed STL can be performed onlyif the store-load pair is executed in order Additionally Fig-ure 14 shows that 4k aliasing introduces a slight performanceoverhead on top of an incorrect no-hoisting guess of the MDpredictor Figure 15 presents a simplified view of the reversedmemory disambiguation predictor In conclusion an attackercan precisely mistrain the memory disambiguation predictorby satisfying only two requirements (1) knowing the instruc-tion pointer of the victim load (2) issuing (in the worst-casescenario) 15+64=79 non-aliasing store-load pairs to train thepredictor to the hoisting state

B Root-Causes Description Table

Acronym Description

BHT Branch History TableBTB Branch Target BufferRSB Return Stack BufferMD Memory Disabmbiguation UnitNM Device Not Available ExceptionDE Divide-by-Zero ExceptionUD Invalid Opcode ExceptionGP General Protection FaultAC Alignment Check ExceptionSS Stack Segment ExceptionPF Page FaultPF - US Bit Page Table Entry UserSupervisor Bit PFPF - RW Bit Page Table Entry ReadWrite Bit PFPF - P Bit Page Table Entry Present Bit PFPF - PKU Page Table Entry Protection Keys PFBR Bound Range Exceeded ExceptionFP Floating Point AssistSMC Self-Modifying CodeXMC Cross-Modifying CodeMO Memory Ordering Principles ViolationMASKMOV Masked LoadStore Instruction AssistAD Bits Page Table Entry AccessDirty Bits AssistTSX Intel TSX Transaction AbortUC Uncachable Memory AssistPRM Processor Reserved Memory AssistHW Interrupts Hardware Interrupts

1468 30th USENIX Security Symposium USENIX Association

  • Introduction
  • Background
    • IEEE-754 Denormal Numbers
    • x86 Cache Coherence
    • Memory Ordering
    • Memory Disambiguation
      • Threat Model
      • Machine Clears
      • Self-Modifying Code Machine Clear
      • Floating Point Machine Clear
      • Memory Ordering Machine Clear
      • Memory Disambiguation Machine Clear
      • Other Types of Machine Clear
      • Transient Execution Capabilities
        • Transient Window Size
        • Leakage Rate
          • Attack Primitives
            • Speculative Code Store Bypass (SCSB)
            • Floating Point Value Injection (FPVI)
              • Mitigations
                • SCSB Mitigation
                • FPVI Mitigation
                  • Root Cause-based Classification
                  • Related Work
                  • Conclusions
                  • Reversing Memory Disambiguation
                  • Root-Causes Description Table
Page 13: Rage Against the Machine Clear: A Systematic Analysis of

Table 3 Tested processors

Processor MicrocodeSCSB

vulnerableFPVI

vulnerable

Intel Core i7-10700K 0xe0 3 3Intel Xeon Silver 4214 0x500001c 3 3Intel Core i9-9900K 0xde 3 3Intel Core i7-7700K 0xca 3 3AMD Ryzen 5 5600X 0xa201009 3 3daggerAMD Ryzen 2990WX 0x800820b 3 3daggerAMD Ryzen 7 2700X 0x800820b 3 3daggerBroadcom BCM2711Cortex-A72 (ARM v8) 7sect 7

dagger No exploitable NaN-boxed transient results foundsect On ARM SMC updates require explicit software barriers

AMD are affected by FPVI although we found no exploitabletransient NaN-boxed values on AMD On ARM we neverobserved traces of transient results yet we cannot rule outother FPU implementations being affected

12 Mitigations

121 SCSB Mitigation

SCSB can be mitigated by ensuring that any freshly writtencode is architecturally visible before being executed For ex-ample on ARM architectures where the hardware does notautomatically enforce cache coherence explicit serializing in-structions (ie L1i cache invalidation instructions) are neededto correctly support SMC [5] As such spec-compliant ARMimplementations cannot be affected by SCSB On Intel andAMD we can force eager codedata coherence using a serial-ization instructionmdashalthough this is normally not necessary(see Option 1 in Figure 7) In our experiments we verified anyserialization instruction such as lfence mfence or cpuidplaced after the SMC store operations is indeed sufficient tosuppress the transient window Note that sfence cannot serveas a serialization instruction [35] to eliminate the transientpath This serialization mitigation was confirmed by CPUvendors and adopted by the Xen hypervisor [32 33]

To evaluate the performance impact of our mitigation weadded a lfence instruction inside the FlushICache function(Listings 2 and 3) of the two popular V8 and SpiderMonkeyJavaScript engines Such function normally empty on x86is called after every code update Our repeated experimentson the popular JetStream2 and Speedometer2 [77] bench-marks did not produce any statically measurable performanceoverhead This shows JavaScript execution time is heavilydominated by JIT code generationexecution and code up-dates have negligible impact Our results show this mitigationis practical and can hinder SCSB-based attack primitives inJIT engines with a 1-line code change

122 FPVI Mitigation

The most efficient way to mitigate FPVI is to disable the de-normal representation On Intel this translates to enabling theFlush-to-Zero and Denormals-are-Zero flags [37] which re-spectively replace denormal results and inputs with zero Thisis a viable mitigation for applications with modest floating-point precision requirements and has also been selectively ap-plied to browsers [42] However this defense may break com-mon real-world (denormal-dependent) applications a concernthat has led browser vendors such as Firefox to adopt othermitigations Another option for browsers is to enable Site Iso-lation [13] but JIT engines such as SpiderMonkey still do nothave a production implementation [23] Yet another optionfor JIT engines is to conditionally mask (ie using a transientexecution-safe cmov instruction) the result of FP operationsto enforce QNaN Floating-Point Indefinite semantics [37] asdone in the SpiderMonkey FPVI mitigation [51] This strategysuppresses any malicious NaN-boxed transient results butrequires manual changes to the NaN-boxing implementationand only applies to NaN-boxed gadgets

A more general and automated mitigation is for the com-piler to place a serializing instruction such as lfence afterFP operations whose (attacker-controlled) result might leaksecrets by means of transmit gadgets (or transmitters) Weobserve this is the same high-level strategy adopted by theexisting LVI mitigation in modern compilers [39] whichidentifies loaded values as sources and uses data-flow anal-ysis to ensure all the sources that reach a sink (transmitter)are fenced Hence to mitigate FPVI we can rely on the samemitigation strategy but use computed FP values as sourcesinstead To identify sinks we consider both systems vulnera-ble and those resistant to Microarchitectural Data Sampling(MDS) [8 59 63 75 76] For the former we consider bothload and store instructions as sinks (eg FP operation resultused as a loadstore address) as the corresponding arbitrarydata spilled into microarchitectural buffers may be leaked bya MDS attacker For the latter we limit ourselves to load sinksto catch all the arbitrary read values potentially disclosed by adependent transmitter In both cases we add indirect branchescontrolled by FP operations (and thus potentially leading tospeculative control-flow hijacking) to the list of sinks

We have implemented such a mitigation in LLVM [45]with only 100 lines of code on top of the existing x86 LVIload hardening pass To evaluate the performance impactof our mitigation on floating-point-intensive programs weran all the CC++ SPECfp 2006 and 2017 benchmarks infour configurations LVI instrumentation FPVI instrumen-tation for both MDS-vulnerable and MDS-resistant systemsand joint LVI+FPVI instrumentation for MDS-vulnerable sys-tems Please note that on MDS-resistant systems the FPVItransmitters are already covered by the LVI mitigation

Figure 9 shows the performance overhead of such con-figurations compared to the baseline As expected the LVI

1462 30th USENIX Security Symposium USENIX Association

619lbm_s 638imagick_s 644nab_s SPEC17geomean

433milc 444namd 447dealII 450soplex 453povray 470lbm 482sphinx3 SPEC06geomean

0100200300400500600700800

Over

head

SPEC FP 2017 SPEC FP 2006LLVM Mitigation

LVIFPVI (MDS-vulnerable system)FPVI (MDS-resistant system)LVI+FPVI (MDS-vulnerable system)

Figure 9 Performance overhead of our FPVI mitigation on the CC++ SPECfp 2006 and 2017 benchmarks Experimental setup5 SPEC iterations Intel i9-9900K (microcode 0xde) and LLVM 1110

mitigation has non-trivial performance impact up to 280despite targeting floating-point-intensive programs On theother hand our FPVI mitigation incurs in a 32 and 53 ge-omean overhead on SPECfp 2006 and 2017 respectively withno observable performance impact difference between MDS-vulnerable and MDS-resistant variants We observed that ap-proximately 70 of the inserted lfence instructions are dueto the intraprocedural design of the original LVI pass forc-ing our analysis to consider every callsite with FPVI source-based arguments as a potential transmitter This suggests theoverhead can be further reduced by operating interprocedu-ral analysis or more aggressive inlining (eg using LLVMLTO [45])

13 Root Cause-based Classification

We now summarize the results of our investigation by present-ing a root cause-based classification for the known transientexecution paths While there have already been several at-tempts to classify properties of transient execution [107580]all the existing classifications are attack-centric While cer-tainly useful such classifications inevitably blend togetherthe root causes of transient execution with the attack triggersFor example a MDS exploit based on a demand-paging pagefault [75] may be simply classified as a Meltdown-like attackbased on a present page fault [10] However in such an exploitthe page fault is both the vulnerability trigger and the rootcause of the transient execution window A similar exploitmay instead be embedded in an Intel TSX window yielding adifferent root cause and capabilities

To better characterize the capabilities of transient executionexploits we propose the orthogonal root-cause-based classifi-cation in Figure 10 We draw from the Intel terminology todefine the two main classes of root causes of Bad Specula-tion (ie transient execution) Control-Flow Misprediction(ie branch misprediction) and Data Misprediction (ie ma-chine clear) Based on these two main classes we observethat all the known root causes of transient execution paths canbe classified into the following four subclasses PredictorsExceptions Likely invariants violations and Interrupts

Figure 10 A root cause-centric classification of known tran-sient execution paths (causes analyzed in the paper in bold)Acronyms descriptions can be found in Appendix B

Predictors This category includes the prediction-basedcauses of bad speculation due to either control-flow or datamispredictions Mistraining a predictor and forcing a mis-prediction is sufficient to create a transient execution pathaccessing erroneous code or data Itrsquos worth noting that inFigure 10 there are two separate predictors subclasses un-der control-flow and data mispredictions as they are differ-ent in nature While control-flow mispredictions are failedattempts to guess the next instructions to execute data mis-predictions are failed attempts to operate on not-yet-validateddata Misprediction is a common way to manage tran-sient execution windows in attacks like Spectre and deriva-tives [11 20 28 40 41 43 48 55 65 82]

Exceptions This class includes the causes of machineclear due to exceptions for instance the different (sub)classesof page faults Forcing an exception is sufficient to create atransient execution path erroneously executing code follow-ing the exception-inducing instruction Exceptions are a lesscommon way to manage transient execution windows (as theyrequire dedicated handling) but have been extensively usedas triggers in Meltdown-like attacks [78475963677174ndash76 78]

Interrupts This class includes the causes of machine cleardue to hardware (only) interrupts Similar to exceptions forc-

USENIX Association 30th USENIX Security Symposium 1463

ing a HW interrupt is sufficient to create a transient executionpath erroneously executing code following the interruptedinstruction Hardware interrupt are asynchronous by natureand thus difficult to control resulting in a less than ideal wayto manage transient execution windows Nonetheless theywere abused by prior side-channel attacks [72]

Likely invariants violations This class includes all theremaining causes of machine clear derived by likely invari-ants [15] used by the CPU Such invariants commonly holdbut occasionally fail allowing hardware to implement fast-path optimizations However compared to exceptions andinterrupts slow-path occurrences are typically more frequentrequiring more efficient handling in hardware or microcodeWe discussed examples of such invariants in the paper (egstore instructions are expected to never target cached instruc-tions floating-point operations are expected to never operateon denormal numbers etc) and their lazy handling mecha-nisms (ie L1dL1i resynchronization microcode-based de-normal arithmetic) Forcing a likely invariant violation is suf-ficient to create a transient execution path accessing erroneouscode or data In this paper we have shown such violationsare not only a realistic way to manage transient executionwindows but also provide new opportunities and primitivesfor transient execution attacks

14 Related Work

Spectre [3141] and Meltdown [47] first examined the securityimplications of transient execution originating a large bodyof research on transient execution attacks [6ndash112840434859 63 65 67 68 71 74ndash76 78 79 82] Rather than focusingon attacks and their classification [10 75 80] ours is the firsteffort to systematize the root causes of transient executionand examine the many unexplored cases of machine clears

We now briefly survey prior security efforts concernedwith the major causes of machine clear discussed in this pa-per Self-modifying code is commonly used by malware asan obfuscation technique [69] and has also been used to im-prove side-channel attacks by means of performance degra-dation [2] Moreover our SCSB primitive bears similaritieswith prior transient execution primitives inducing specula-tive control flow hijacking either through branch target in-jection [41 49] or architectural branch target corruption [28]The performance variability of floating-point operations ondenormal numbers [17 46] has been previously exploited intraditional timing side-channel attacks [4] Speculation intro-duced by stricter memory models is a well-known concept inthe computer architecture literature [27 30 66] While this isnon-trivial to exploit prior work did demonstrate informationdisclosure [5779] by exploiting the snoop protocol discussedin Section 7 The memory disambiguation predictor has beenpreviously abused to leak stale data in Spectre SpeculativeStore Bypass exploits [55 82] Moreover its behavior hasbeen partially reverse engineered before [18] In contrast to

all these efforts we focus on machine clears to systematicallystudy all the root causes of transient execution fully reverseengineering their behavior and uncovering their security im-plications well beyond the state of the art

15 Conclusions

We have shown that the root causes of transient execution canbe quite diverse and go well beyond simple branch mispredic-tion or similar To support this claim we systematically ex-plored and reverse engineered the previously unexplored classof bad speculation known as machine clear We discussedseveral transient execution paths neglected in the literatureexamining their capabilities and new opportunities for attacksFurthermore we presented two new machine clear-based tran-sient execution attack primitives (Floating Point Value Injec-tion and Speculative Code Store Bypass) We also presentedan end-to-end FPVI exploit disclosing arbitrary memory inFirefox and analyzed the applicability of SCSB in real-worldapplications such as JIT engines Additionally we proposedmitigations and evaluated their performance overhead Finallywe presented a new root cause-based classification for all theknown transient execution paths

Disclosure

We disclosed Floating Point Value Injection and SpeculativeCode Store Bypass to CPU browser OS and hypervisor ven-dors in February 2021 Following our reports Intel confirmedthe FPVI (CVE-2021-0086) and SCSB (CVE-2021-0089)vulnerabilities rewarded them with the Intel Bug Bounty pro-gram and released a security advisory with recommendationsin line with our proposed mitigations [34] Mozilla confirmedthe FPVI exploit (CVE-2021-29955 [21 22]) rewarded itwith the Mozilla Security Bug Bounty program and deployeda mitigation based on conditionally masking malicious NaN-boxed FP results in Firefox 87 [51]

Acknowledgments

We thank our shepherd Daniel Genkin and the anonymousreviewers for their valuable comments We also thank ErikBosman from VUSec and Andrew Cooper from Citrix fortheir input Intel and Mozilla engineers for the productivemitigation discussions Travis Downs for his MD reverse en-gineering and Evan Wallace for his Float Toy tool This workwas supported by the European Unionrsquos Horizon 2020 re-search and innovation programme under grant agreements No786669 (ReAct) and 825377 (UNICORE) by Intel Corpora-tion through the Side Channel Vulnerability ISRA and by theDutch Research Council (NWO) through the INTERSECTproject

1464 30th USENIX Security Symposium USENIX Association

References[1] IEEE Standard for Floating-Point Arithmetic IEEE Std 754-2019

2019

[2] Alejandro Cabrera Aldaya and Billy Bob Brumley Hyperde-grade From ghz to mhz effective cpu frequencies arXiv preprintarXiv210101077

[3] AMD AMD64 Architecture Programmerrsquos Manual

[4] Marc Andrysco David Kohlbrenner Keaton Mowery Ranjit JhalaSorin Lerner and Hovav Shacham On subnormal floating point andabnormal timing In 2015 IEEE S amp P

[5] ARM Architecture Reference Manual for Armv8-A

[6] Atri Bhattacharyya Alexandra Sandulescu Matthias NeugschwandtnerAlessandro Sorniotti Babak Falsafi Mathias Payer and Anil KurmusSmotherspectre exploiting speculative execution through port con-tention In CCSrsquo19

[7] Jo Van Bulck Marina Minkin Ofir Weisse Daniel Genkin BarisKasikci Frank Piessens Mark Silberstein Thomas F Wenisch Yu-val Yarom and Raoul Strackx Foreshadow Extracting the Keys tothe Intel SGX Kingdom with Transient Out-of-Order Execution InUSENIX Securityrsquo18

[8] Claudio Canella Daniel Genkin Lukas Giner Daniel Gruss MoritzLipp Marina Minkin Daniel Moghimi Frank Piessens MichaelSchwarz Berk Sunar Jo Van Bulck and Yuval Yarom Fallout LeakingData on Meltdown-resistant CPUs In CCSrsquo19

[9] Claudio Canella Michael Schwarz Martin Haubenwallner MartinSchwarzl and Daniel Gruss Kaslr Break it fix it repeat In ACMASIA CCS 2020

[10] Claudio Canella Jo Van Bulck Michael Schwarz Moritz Lipp Ben-jamin Von Berg Philipp Ortner Frank Piessens Dmitry Evtyushkinand Daniel Gruss A systematic evaluation of transient execution at-tacks and defenses In USENIX Security 19

[11] Guoxing Chen Sanchuan Chen Yuan Xiao Yinqian Zhang ZhiqiangLin and Ten H Lai Sgxpectre Stealing intel secrets from sgx enclavesvia speculative execution In 2019 IEEE EuroSampP

[12] Chrome V8 TurboFan documentation

[13] Chromium Site Isolation documentation

[14] Victor Costan and Srinivas Devadas Intel SGX Explained IACRCryptology ePrint Archive 2016

[15] David Devecsery Peter M Chen Jason Flinn and Satish NarayanasamyOptimistic hybrid analysis Accelerating dynamic analysis throughpredicated static analysis In ASPLOS 2018

[16] Christopher Domas Breaking the x86 isa Black Hat USA 2017

[17] Isaac Dooley and Laxmikant Kale Quantifying the interference causedby subnormal floating-point values In Proceedings of the Workshopon OSIHPA 2006

[18] Travis Downs Memory Disambiguation on Skylakehttpsgithubcomtravisdownsuarch-benchwikiMemory-Disambiguation-on-Skylake 2019

[19] Thomas Dullien Return after free discussion httpstwittercomhalvarflakestatus1273220345525415937

[20] Dmitry Evtyushkin Ryan Riley Nael CSE Abu-Ghazaleh ECE andDmitry Ponomarev Branchscope A new side-channel attack on di-rectional branch predictor ACM SIGPLAN Notices 53(2)693ndash7072018

[21] Firefox Firefox 87 Security Advisory httpswwwmozillaorgen-USsecurityadvisoriesmfsa2021-10CVE-2021-29955

[22] Firefox Firefox ESR 789 Security Advisory httpswwwmozillaorgen-USsecurityadvisoriesmfsa2021-11CVE-2021-29955

[23] Firefox Project Fission documentation

[24] Fortninet Use-After-Free Bug in Chakra (CVE-2018-0946)httpswwwfortinetcomblogthreat-researchan-analysis-of-the-use-after-free-bug-in-microsoft-edge-chakra-engine

[25] Ivan Fratric Return after free discussion httpstwittercomifsecurestatus1273230733516177408

[26] Pietro Frigo Cristiano Giuffrida Herbert Bos and Kaveh RazaviGrand Pwning Unit Accelerating Microarchitectural Attacks withthe GPU In SampP May 2018

[27] Kourosh Gharachorloo Anoop Gupta and John L Hennessy Twotechniques to enhance the performance of memory consistency models1991

[28] Enes Goktas Kaveh Razavi Georgios Portokalidis Herbert Bos andCristiano Giuffrida Speculative Probing Hacking Blind in the SpectreEra In CCS 2020

[29] Ben Gras Kaveh Razavi Erik Bosman Herbert Bos and CristianoGiuffrida ASLR on the Line Practical Cache Attacks on the MMUIn NDSS February 2017

[30] John L Hennessy and David A Patterson Computer architecture aquantitative approach Elsevier 2011

[31] Jann Horn Reading privileged memory with a side-channel 2018

[32] Xen Hypervisor block_speculation function call in invoke_stubhttpsxenbitsxenorggitwebp=xengita=blobf=xenarchx86x86_emulatex86_emulatechb=HEAD

[33] Xen Hypervisor block_speculation function call inio_emul_stub_setup httpsxenbitsxenorggitwebp=xengita=blobf=xenarchx86pvemul-priv-opchb=HEAD

[34] Intel FPVI amp SCSB Intel Security Advisoray 00516 httpswwwintelcomcontentwwwusensecurity-centeradvisoryintel-sa-00516html

[35] Intel INTEL-SA-00088 - Bounds Check Bypass

[36] Intel Intelreg 64 and IA-32 Architectures Optimization ReferenceManual

[37] Intel Intelreg 64 and IA-32 Architectures Software Developerrsquos Manualcombined volumes

[38] Intel Intelreg VTunetrade Profiler User Guide - 4KAliasing httpssoftwareintelcomcontentwwwusendevelopdocumentationvtune-helptopreferencecpu-metrics-referencel1-boundaliasing-of-4k-address-offsethtml

[39] Intel Load value injection - deep divehttpssoftwareintelcomsecurity-software-guidancedeep-divesdeep-dive-load-value-injection 2020

[40] Vladimir Kiriansky and Carl Waldspurger Speculative buffer overflowsAttacks and defenses arXiv180703757

[41] Paul Kocher Jann Horn Anders Fogh Daniel Genkin Daniel GrussWerner Haas Mike Hamburg Moritz Lipp Stefan Mangard ThomasPrescher Michael Schwarz and Yuval Yarom Spectre Attacks Ex-ploiting Speculative Execution In SampPrsquo19

[42] David Kohlbrenner and Hovav Shacham On the effectiveness ofmitigations against floating-point timing channels In USENIX SecuritySymposium 2017

[43] Esmaeil Mohammadian Koruyeh Khaled N Khasawneh ChengyuSong and Nael Abu-Ghazaleh Spectre Returns Speculation Attacksusing the Return Stack Buffer In USENIX WOOTrsquo18

[44] Evgeni Krimer Guillermo Savransky Idan Mondjak and JacobDoweck Counter-based memory disambiguation techniques for se-lectively predicting loadstore conflicts October 1 2013 US Patent8549263

USENIX Association 30th USENIX Security Symposium 1465

[45] Chris Lattner and Vikram Adve Llvm A compilation framework forlifelong program analysis amp transformation In CGO 2004

[46] Orion Lawlor Hari Govind Isaac Dooley Michael Breitenfeld andLaxmikant Kale Performance degradation in the presence of subnor-mal floating-point values In OSIHPA 2005

[47] Moritz Lipp Michael Schwarz Daniel Gruss Thomas Prescher WernerHaas Anders Fogh Jann Horn Stefan Mangard Paul Kocher DanielGenkin Yuval Yarom and Mike Hamburg Meltdown Reading KernelMemory from User Space In USENIX Securityrsquo18

[48] Giorgi Maisuradze and Christian Rossow ret2spec Speculative ex-ecution using return stack buffers In Proceedings of the 2018 ACMSIGSAC

[49] Andrea Mambretti Alexandra Sandulescu Matthias NeugschwandtnerAlessandro Sorniotti and Anil Kurmus Two methods for exploitingspeculative control flow hijacks In USENIX WOOT 19

[50] Ahmad Moghimi Jan Wichelmann Thomas Eisenbarth and BerkSunar Memjam A false dependency attack against constant-timecrypto implementations International Journal of Parallel Program-ming 2019

[51] Mozilla Firefox Bug 1692972 mitigation httpshgmozillaorgreleasesmozilla-betarevb129bba64358

[52] Mozilla Spectre mitigations for Value type checks - x86 part httpsbugzillamozillaorgshow_bugcgiid=1433111

[53] Mozilla Spider Monkey JSValue httpshgmozillaorgmozilla-centralfiletipjspublicValueh

[54] Mozilla SpiderMonkey IonMonkey documentation

[55] Ken Johnson Microsoft Security Response Center(MSRC) Analysis and mitigation of speculative storebypass httpsmsrc-blogmicrosoftcom20180521analysis-and-mitigation-of-speculative-store-bypass-cve-2018-3639 2019

[56] Oleksii Oleksenko Bohdan Trach Mark Silberstein and Christof Fet-zer Specfuzz Bringing spectre-type vulnerabilities to the surface InUSENIX Security 20

[57] Riccardo Paccagnella Licheng Luo and Christopher W Fletcher Lordof the ring (s) Side channel attacks on the cpu on-chip ring interconnectare practical arXiv preprint arXiv210303443 2021

[58] Zhenxiao Qi Qian Feng Yueqiang Cheng Mengjia Yan Peng Li HengYin and Tao Wei Spectaint Speculative taint analysis for discoveringspectre gadgets 2021

[59] Hany Ragab Alyssa Milburn Kaveh Razavi Herbert Bos and CristianoGiuffrida CrossTalk Speculative Data Leaks Across Cores Are RealIn SampP May 2021

[60] Thomas Rokicki Cleacutementine Maurice and Pierre Laperdrix Sok Insearch of lost time A review of javascript timers in browsers In IEEEEuroSampPrsquo21

[61] Gururaj Saileshwar Christopher W Fletcher and Moinuddin QureshiStreamline a fast flushless cache covert-channel attack by enablingasynchronous collusion In ASPLOS 2021

[62] Rahul Saxena and John William Phillips Optimized rounding in un-derflow handlers 2001 US Patent 6219684

[63] Michael Schwarz Moritz Lipp Daniel Moghimi Jo Van Bulck JulianStecklina Thomas Prescher and Daniel Gruss ZombieLoad Cross-privilege-boundary data sampling In CCSrsquo19

[64] Michael Schwarz Cleacutementine Maurice Daniel Gruss and Stefan Man-gard Fantastic timers and where to find them High-resolution microar-chitectural attacks in javascript In FC IFCA 17

[65] Michael Schwarz Martin Schwarzl Moritz Lipp Jon Masters andDaniel Gruss Netspectre Read arbitrary memory over network InESORICS 19

[66] Daniel J Sorin Mark D Hill and David A Wood A primer on memoryconsistency and cache coherence 2011

[67] Julian Stecklina and Thomas Prescher Lazyfp Leaking fpu reg-ister state using microarchitectural side-channels arXiv preprintarXiv180607480 2018

[68] Caroline Trippel Daniel Lustig and Margaret Martonosi Meltdown-prime and spectreprime Automatically-synthesized attacks exploitinginvalidation-based coherence protocols arXiv

[69] Xabier Ugarte-Pedrero Davide Balzarotti Igor Santos and Pablo GBringas Sok Deep packer inspection A longitudinal study of thecomplexity of run-time packers In 2015 IEEE S amp P

[70] Google V8 test-jump-table-assemblercc220 commit 251fece httpsgithubcomv8

[71] Jo Van Bulck Daniel Moghimi Michael Schwarz Moritz Lippi Ma-rina Minkin Daniel Genkin Yuval Yarom Berk Sunar Daniel Grussand Frank Piessens Lvi Hijacking transient execution through mi-croarchitectural load value injection In 2020 IEEE S amp P 20

[72] Jo Van Bulck Frank Piessens and Raoul Strackx Nemesis Studyingmicroarchitectural timing leaks in rudimentary cpu interrupt logic InACM CCS 2018

[73] Jo Van Bulck Frank Piessens and Raoul Strackx SGX-step A Prac-tical Attack Framework for Precise Enclave Execution Control InSysTEXrsquo17

[74] Stephan van Schaik Andrew Kwong Daniel Genkin and Yuval YaromSgaxe How sgx fails in practice

[75] Stephan van Schaik Alyssa Milburn Sebastian Oumlsterlund Pietro FrigoGiorgi Maisuradze Kaveh Razavi Herbert Bos and Cristiano GiuffridaRIDL Rogue in-flight data load In SampP May 2019

[76] Stephan van Schaik Marina Minkin Andrew Kwong Daniel Genkinand Yuval Yarom Cacheout Leaking data on intel cpus via cacheevictions arXiv preprint

[77] WebKit Browserbench httpsbrowserbenchorg

[78] Ofir Weisse Jo Van Bulck Marina Minkin Daniel Genkin BarisKasikci Frank Piessens Mark Silberstein Raoul Strackx Thomas FWenisch and Yuval Yarom Foreshadow-ng Breaking the virtual mem-ory abstraction with transient out-of-order execution 2018

[79] Pawel Wieczorkiewicz Intel deep-dive snoop-assisted L1 Data Sam-pling

[80] Wenjie Xiong and Jakub Szefer Survey of transient execution attacksarXiv preprint

[81] Yuval Yarom and Katrina Falkner Flush+ reload a high resolutionlow noise l3 cache side-channel attack In USENIX Security 14

[82] Jann Horn Google Project Zero Speculative execution variant4 speculative store bypass httpsbugschromiumorgpproject-zeroissuesdetailid=1528 2019

A Reversing Memory Disambiguation

To precisely trigger memory disambiguation mispredictionsit is essential to reverse engineer the predictor and understandhow it can be massaged into the desired state When exe-cuting a load operation the physical addresses of all priorstores must be known to decide whether the load should beforwarded the value from the store buffer (store-to-load for-warding when the store and the load alias) or served from thememory subsystem (when they do not) Since these dependen-cies introduce significant bottlenecks modern processors rely

1466 30th USENIX Security Symposium USENIX Association

Listing 4 A function to RE the MD predictor The first 10imuls used to delay the store address computation create theideal conditions for mispredictionsst_ld rdi store addr rsi load addrrep 10 Trick to delay the store addressimul rdi 1endrepmov DWORD [rdi] 0x42 Storemov eax DWORD [rsi] Loadrep 10 Pronounce load timingimul eax 1endrepret

Listing 5 Observing the size and behavior of the per-addresssaturation counteruint8_t mem = malloc(0x1000)Ensure that saturing counter is set to 0for(i=0 ilt100 i++) st_ld(mem mem)Make hoisiting possiblefor(i=100 ilt120 i++) st_ld(mem mem+64)Trigger a memory ordering machine clearfor(i=120 ilt130 i++) st_ld(mem mem)

on a memory disambiguation predictor to improve common-case performance If a load is predicted not to alias precedingstores it can be speculatively executed before the prior storesrsquoaddresses are known (ie the load is hoisted) Otherwise theload is stalled until aliasing information is available

Partial reverse engineering of the memory disambiguationunit behavior was presented in [18] based on a (complex)analysis of Intel patent US8549263B2 [44] In contrast wepresent a full reverse engineering effort (including featuressuch as 4k aliasing flush counter etc) entirely based on asimple implementationmdashst_ld function in Listing 4 Bysurrounding the st_ld function with instrumentation codeto measure the timing and the number of machine clears wewere able to accurately detect the status of the predictor forevery call to st_ld

Our first experiment illustrated in Listing 5 is designedto observe how many missed load hoisting opportunities areneeded to switch the predictor state As shown in the corre-sponding plot in Figure 11 after 15 non-aliasing loads weobserved that subsequent st_ld invocations are faster due toa correct hoisting prediction This matches the design sug-gested in the patent with the predictor implemented as a 4-bitsaturating counter incremented every time the load does not ul-timately alias with preceding stores (and reset to 0 otherwise)Load hoisting is predicted only if the counter is saturated Toensure that the hoisting state is reached we later scheduledan aliasing load and checked that the load was incorrectlyhoisted by observing a machine clear

According to the Intel patent [44] there are 64 per-addresspredictors (ie saturating counters) and the suggested hash-ing function simply uses the lowest 6 bits of the instructionpointer of the load We verified these numbers using the func-

100 105 110 115 120 125 130i-th call to st_ld

0

50

100

Cloc

k cy

cles

Figure 11 Timing measurement of the experiment in Listing5 Orange bar machine clear memory ordering observed

Listing 6 Snippet observing activation of never-hoisting stateuint8_t mem = malloc(0x1000)for(int i=0 ilt10 i++)

for(int j=0 jlt19 j++)st_ld(mem mem+64)

st_ld(mem mem)

0 10 20 30 40 50 60 70 80 90 100 110 120i-th call to st_ld

0

25

50

75

100

Cloc

k cy

cles

Figure 12 Never-hoisting state Orange bar MC observed

tion st_ld_offset which is an exact copy of the st_ldfunction but with a number of nops added in the preamble(which we change at every run The goal is to observe iftwo unrelated loads in these functions are able to influenceeach other when varying the number of nops In our testedCPUs we observed machine clears when two unrelated loadsin st_ld and st_ld_offset are located exactly 256k bytesapart in memory (k isinN) We used machine clear observationsto detect that the per-address prediction of st_ld was affectedby the execution of st_ld_offset Our results match the de-sign suggested in the patent except we observed 256 (ratherthan 64) predictors hashing the lowest 8 bits of the instruc-tion pointer With these implementation-specific numbers anattacker can easily mistrain the predictor of a victim loadinstruction just with knowledge of its location in memory

One important additional component of the predictor is thepresence of a watchdog A never-hoisting global state is usedas a fallback to temporarily disable the predictor when theCPU has decided it may be counterproductive To reverseengineer the behavior we triggered as many mispredictionsas possible to check if hoisting was eventually disabled Theresulting code is illustrated in Listing 6 and the numbers inFigure 12 shows that after 4 machine clears the predictoris disabled Indeed even after 19 further non-aliasing loads(normally abundantly sufficient to switch to a hoisting state)the execution time of st_ld does not decrease

We also reversed the conditions under which the watch-dog is enableddisabled The patent suggests the watchdog

USENIX Association 30th USENIX Security Symposium 1467

15 79 143 207 271 335 399Number of independent store-loads

1

2

3

4

Tota

l mea

sure

dM

achi

ne C

lear

s

Figure 13 MCs observed after n independent store-loadsstarting from the never-hoisting state

is enabled when the value of a flush counter is smaller than0 and disabled otherwise The flush counter is decrementedevery MD MC and incremented every n correctly hoistedloads To reverse engineer this behavior we measure howmany MCs can be triggered in a row before the flush counteris decremented to -1 and thus the predictor is disabled Asshown in the figure 13 we never observed more than 4 MCsin a row This suggests that the flush counter is a 2-bit sat-urating counter The machine clear patterns also reveal theflush counter is incremented every 64 correctly hoisted loadsLastly since every run starts with the watchdog disabled ourresults show that to switch from a never-hoisting to a predict-hoisting state 15+64 non-aliasing loads are sufficient Thefirst 15 loads are necessary to bring the per-address predictorto the hoisting state The next 64 loads record a would-becorrect prediction of the per-address predictor After 64 such(unused) predictions the flush counter is incremented to leavethe never-hoisting state

Listing 7 Snippet verifying 4k aliasing-MD unit interactionForce no-hoisting predictionfor(i=0 ilt10 i++) st_ld(mem+0x1000 mem+0x1000)for(i=10 ilt20 i++) st_ld(mem+0x2000 mem+0x2000)Cause 4k aliasingfor(i=20 ilt40 i++) st_ld(mem+0x1000 mem+0x2000)

0 5 10 15 20 25 30 35 40i-th call to st_ld

50

60

70

80

90

Cloc

k cy

cles

Figure 14 Timing measurements of Listing 7 Orange incre-ment of LD_BLOCKS_PARTIALADDRESS_ALIAS

Finally we examined the interaction between memory dis-ambiguation and 4k aliasing On Intel CPUs when a store isfollowed by a load matching its 4KB page offset the store-to-load forwarding (STL) logic forwards the stored valueand in case of a false match (4k aliasing) a few-cycle over-head is needed to re-issue the load [38] With the help of theperformance counter LD_BLOCKS_PARTIALADDRESS_ALIASand the experiment shown in Listing 7 we verified that 4kaliasing can only happen when a no-hoist prediction is made

Figure 15 Memory disambiguation unit ndash simplified view

as shown in Figure 14 Indeed STL can be performed onlyif the store-load pair is executed in order Additionally Fig-ure 14 shows that 4k aliasing introduces a slight performanceoverhead on top of an incorrect no-hoisting guess of the MDpredictor Figure 15 presents a simplified view of the reversedmemory disambiguation predictor In conclusion an attackercan precisely mistrain the memory disambiguation predictorby satisfying only two requirements (1) knowing the instruc-tion pointer of the victim load (2) issuing (in the worst-casescenario) 15+64=79 non-aliasing store-load pairs to train thepredictor to the hoisting state

B Root-Causes Description Table

Acronym Description

BHT Branch History TableBTB Branch Target BufferRSB Return Stack BufferMD Memory Disabmbiguation UnitNM Device Not Available ExceptionDE Divide-by-Zero ExceptionUD Invalid Opcode ExceptionGP General Protection FaultAC Alignment Check ExceptionSS Stack Segment ExceptionPF Page FaultPF - US Bit Page Table Entry UserSupervisor Bit PFPF - RW Bit Page Table Entry ReadWrite Bit PFPF - P Bit Page Table Entry Present Bit PFPF - PKU Page Table Entry Protection Keys PFBR Bound Range Exceeded ExceptionFP Floating Point AssistSMC Self-Modifying CodeXMC Cross-Modifying CodeMO Memory Ordering Principles ViolationMASKMOV Masked LoadStore Instruction AssistAD Bits Page Table Entry AccessDirty Bits AssistTSX Intel TSX Transaction AbortUC Uncachable Memory AssistPRM Processor Reserved Memory AssistHW Interrupts Hardware Interrupts

1468 30th USENIX Security Symposium USENIX Association

  • Introduction
  • Background
    • IEEE-754 Denormal Numbers
    • x86 Cache Coherence
    • Memory Ordering
    • Memory Disambiguation
      • Threat Model
      • Machine Clears
      • Self-Modifying Code Machine Clear
      • Floating Point Machine Clear
      • Memory Ordering Machine Clear
      • Memory Disambiguation Machine Clear
      • Other Types of Machine Clear
      • Transient Execution Capabilities
        • Transient Window Size
        • Leakage Rate
          • Attack Primitives
            • Speculative Code Store Bypass (SCSB)
            • Floating Point Value Injection (FPVI)
              • Mitigations
                • SCSB Mitigation
                • FPVI Mitigation
                  • Root Cause-based Classification
                  • Related Work
                  • Conclusions
                  • Reversing Memory Disambiguation
                  • Root-Causes Description Table
Page 14: Rage Against the Machine Clear: A Systematic Analysis of

619lbm_s 638imagick_s 644nab_s SPEC17geomean

433milc 444namd 447dealII 450soplex 453povray 470lbm 482sphinx3 SPEC06geomean

0100200300400500600700800

Over

head

SPEC FP 2017 SPEC FP 2006LLVM Mitigation

LVIFPVI (MDS-vulnerable system)FPVI (MDS-resistant system)LVI+FPVI (MDS-vulnerable system)

Figure 9 Performance overhead of our FPVI mitigation on the CC++ SPECfp 2006 and 2017 benchmarks Experimental setup5 SPEC iterations Intel i9-9900K (microcode 0xde) and LLVM 1110

mitigation has non-trivial performance impact up to 280despite targeting floating-point-intensive programs On theother hand our FPVI mitigation incurs in a 32 and 53 ge-omean overhead on SPECfp 2006 and 2017 respectively withno observable performance impact difference between MDS-vulnerable and MDS-resistant variants We observed that ap-proximately 70 of the inserted lfence instructions are dueto the intraprocedural design of the original LVI pass forc-ing our analysis to consider every callsite with FPVI source-based arguments as a potential transmitter This suggests theoverhead can be further reduced by operating interprocedu-ral analysis or more aggressive inlining (eg using LLVMLTO [45])

13 Root Cause-based Classification

We now summarize the results of our investigation by present-ing a root cause-based classification for the known transientexecution paths While there have already been several at-tempts to classify properties of transient execution [107580]all the existing classifications are attack-centric While cer-tainly useful such classifications inevitably blend togetherthe root causes of transient execution with the attack triggersFor example a MDS exploit based on a demand-paging pagefault [75] may be simply classified as a Meltdown-like attackbased on a present page fault [10] However in such an exploitthe page fault is both the vulnerability trigger and the rootcause of the transient execution window A similar exploitmay instead be embedded in an Intel TSX window yielding adifferent root cause and capabilities

To better characterize the capabilities of transient executionexploits we propose the orthogonal root-cause-based classifi-cation in Figure 10 We draw from the Intel terminology todefine the two main classes of root causes of Bad Specula-tion (ie transient execution) Control-Flow Misprediction(ie branch misprediction) and Data Misprediction (ie ma-chine clear) Based on these two main classes we observethat all the known root causes of transient execution paths canbe classified into the following four subclasses PredictorsExceptions Likely invariants violations and Interrupts

Figure 10 A root cause-centric classification of known tran-sient execution paths (causes analyzed in the paper in bold)Acronyms descriptions can be found in Appendix B

Predictors This category includes the prediction-basedcauses of bad speculation due to either control-flow or datamispredictions Mistraining a predictor and forcing a mis-prediction is sufficient to create a transient execution pathaccessing erroneous code or data Itrsquos worth noting that inFigure 10 there are two separate predictors subclasses un-der control-flow and data mispredictions as they are differ-ent in nature While control-flow mispredictions are failedattempts to guess the next instructions to execute data mis-predictions are failed attempts to operate on not-yet-validateddata Misprediction is a common way to manage tran-sient execution windows in attacks like Spectre and deriva-tives [11 20 28 40 41 43 48 55 65 82]

Exceptions This class includes the causes of machineclear due to exceptions for instance the different (sub)classesof page faults Forcing an exception is sufficient to create atransient execution path erroneously executing code follow-ing the exception-inducing instruction Exceptions are a lesscommon way to manage transient execution windows (as theyrequire dedicated handling) but have been extensively usedas triggers in Meltdown-like attacks [78475963677174ndash76 78]

Interrupts This class includes the causes of machine cleardue to hardware (only) interrupts Similar to exceptions forc-

USENIX Association 30th USENIX Security Symposium 1463

ing a HW interrupt is sufficient to create a transient executionpath erroneously executing code following the interruptedinstruction Hardware interrupt are asynchronous by natureand thus difficult to control resulting in a less than ideal wayto manage transient execution windows Nonetheless theywere abused by prior side-channel attacks [72]

Likely invariants violations This class includes all theremaining causes of machine clear derived by likely invari-ants [15] used by the CPU Such invariants commonly holdbut occasionally fail allowing hardware to implement fast-path optimizations However compared to exceptions andinterrupts slow-path occurrences are typically more frequentrequiring more efficient handling in hardware or microcodeWe discussed examples of such invariants in the paper (egstore instructions are expected to never target cached instruc-tions floating-point operations are expected to never operateon denormal numbers etc) and their lazy handling mecha-nisms (ie L1dL1i resynchronization microcode-based de-normal arithmetic) Forcing a likely invariant violation is suf-ficient to create a transient execution path accessing erroneouscode or data In this paper we have shown such violationsare not only a realistic way to manage transient executionwindows but also provide new opportunities and primitivesfor transient execution attacks

14 Related Work

Spectre [3141] and Meltdown [47] first examined the securityimplications of transient execution originating a large bodyof research on transient execution attacks [6ndash112840434859 63 65 67 68 71 74ndash76 78 79 82] Rather than focusingon attacks and their classification [10 75 80] ours is the firsteffort to systematize the root causes of transient executionand examine the many unexplored cases of machine clears

We now briefly survey prior security efforts concernedwith the major causes of machine clear discussed in this pa-per Self-modifying code is commonly used by malware asan obfuscation technique [69] and has also been used to im-prove side-channel attacks by means of performance degra-dation [2] Moreover our SCSB primitive bears similaritieswith prior transient execution primitives inducing specula-tive control flow hijacking either through branch target in-jection [41 49] or architectural branch target corruption [28]The performance variability of floating-point operations ondenormal numbers [17 46] has been previously exploited intraditional timing side-channel attacks [4] Speculation intro-duced by stricter memory models is a well-known concept inthe computer architecture literature [27 30 66] While this isnon-trivial to exploit prior work did demonstrate informationdisclosure [5779] by exploiting the snoop protocol discussedin Section 7 The memory disambiguation predictor has beenpreviously abused to leak stale data in Spectre SpeculativeStore Bypass exploits [55 82] Moreover its behavior hasbeen partially reverse engineered before [18] In contrast to

all these efforts we focus on machine clears to systematicallystudy all the root causes of transient execution fully reverseengineering their behavior and uncovering their security im-plications well beyond the state of the art

15 Conclusions

We have shown that the root causes of transient execution canbe quite diverse and go well beyond simple branch mispredic-tion or similar To support this claim we systematically ex-plored and reverse engineered the previously unexplored classof bad speculation known as machine clear We discussedseveral transient execution paths neglected in the literatureexamining their capabilities and new opportunities for attacksFurthermore we presented two new machine clear-based tran-sient execution attack primitives (Floating Point Value Injec-tion and Speculative Code Store Bypass) We also presentedan end-to-end FPVI exploit disclosing arbitrary memory inFirefox and analyzed the applicability of SCSB in real-worldapplications such as JIT engines Additionally we proposedmitigations and evaluated their performance overhead Finallywe presented a new root cause-based classification for all theknown transient execution paths

Disclosure

We disclosed Floating Point Value Injection and SpeculativeCode Store Bypass to CPU browser OS and hypervisor ven-dors in February 2021 Following our reports Intel confirmedthe FPVI (CVE-2021-0086) and SCSB (CVE-2021-0089)vulnerabilities rewarded them with the Intel Bug Bounty pro-gram and released a security advisory with recommendationsin line with our proposed mitigations [34] Mozilla confirmedthe FPVI exploit (CVE-2021-29955 [21 22]) rewarded itwith the Mozilla Security Bug Bounty program and deployeda mitigation based on conditionally masking malicious NaN-boxed FP results in Firefox 87 [51]

Acknowledgments

We thank our shepherd Daniel Genkin and the anonymousreviewers for their valuable comments We also thank ErikBosman from VUSec and Andrew Cooper from Citrix fortheir input Intel and Mozilla engineers for the productivemitigation discussions Travis Downs for his MD reverse en-gineering and Evan Wallace for his Float Toy tool This workwas supported by the European Unionrsquos Horizon 2020 re-search and innovation programme under grant agreements No786669 (ReAct) and 825377 (UNICORE) by Intel Corpora-tion through the Side Channel Vulnerability ISRA and by theDutch Research Council (NWO) through the INTERSECTproject

1464 30th USENIX Security Symposium USENIX Association

References[1] IEEE Standard for Floating-Point Arithmetic IEEE Std 754-2019

2019

[2] Alejandro Cabrera Aldaya and Billy Bob Brumley Hyperde-grade From ghz to mhz effective cpu frequencies arXiv preprintarXiv210101077

[3] AMD AMD64 Architecture Programmerrsquos Manual

[4] Marc Andrysco David Kohlbrenner Keaton Mowery Ranjit JhalaSorin Lerner and Hovav Shacham On subnormal floating point andabnormal timing In 2015 IEEE S amp P

[5] ARM Architecture Reference Manual for Armv8-A

[6] Atri Bhattacharyya Alexandra Sandulescu Matthias NeugschwandtnerAlessandro Sorniotti Babak Falsafi Mathias Payer and Anil KurmusSmotherspectre exploiting speculative execution through port con-tention In CCSrsquo19

[7] Jo Van Bulck Marina Minkin Ofir Weisse Daniel Genkin BarisKasikci Frank Piessens Mark Silberstein Thomas F Wenisch Yu-val Yarom and Raoul Strackx Foreshadow Extracting the Keys tothe Intel SGX Kingdom with Transient Out-of-Order Execution InUSENIX Securityrsquo18

[8] Claudio Canella Daniel Genkin Lukas Giner Daniel Gruss MoritzLipp Marina Minkin Daniel Moghimi Frank Piessens MichaelSchwarz Berk Sunar Jo Van Bulck and Yuval Yarom Fallout LeakingData on Meltdown-resistant CPUs In CCSrsquo19

[9] Claudio Canella Michael Schwarz Martin Haubenwallner MartinSchwarzl and Daniel Gruss Kaslr Break it fix it repeat In ACMASIA CCS 2020

[10] Claudio Canella Jo Van Bulck Michael Schwarz Moritz Lipp Ben-jamin Von Berg Philipp Ortner Frank Piessens Dmitry Evtyushkinand Daniel Gruss A systematic evaluation of transient execution at-tacks and defenses In USENIX Security 19

[11] Guoxing Chen Sanchuan Chen Yuan Xiao Yinqian Zhang ZhiqiangLin and Ten H Lai Sgxpectre Stealing intel secrets from sgx enclavesvia speculative execution In 2019 IEEE EuroSampP

[12] Chrome V8 TurboFan documentation

[13] Chromium Site Isolation documentation

[14] Victor Costan and Srinivas Devadas Intel SGX Explained IACRCryptology ePrint Archive 2016

[15] David Devecsery Peter M Chen Jason Flinn and Satish NarayanasamyOptimistic hybrid analysis Accelerating dynamic analysis throughpredicated static analysis In ASPLOS 2018

[16] Christopher Domas Breaking the x86 isa Black Hat USA 2017

[17] Isaac Dooley and Laxmikant Kale Quantifying the interference causedby subnormal floating-point values In Proceedings of the Workshopon OSIHPA 2006

[18] Travis Downs Memory Disambiguation on Skylakehttpsgithubcomtravisdownsuarch-benchwikiMemory-Disambiguation-on-Skylake 2019

[19] Thomas Dullien Return after free discussion httpstwittercomhalvarflakestatus1273220345525415937

[20] Dmitry Evtyushkin Ryan Riley Nael CSE Abu-Ghazaleh ECE andDmitry Ponomarev Branchscope A new side-channel attack on di-rectional branch predictor ACM SIGPLAN Notices 53(2)693ndash7072018

[21] Firefox Firefox 87 Security Advisory httpswwwmozillaorgen-USsecurityadvisoriesmfsa2021-10CVE-2021-29955

[22] Firefox Firefox ESR 789 Security Advisory httpswwwmozillaorgen-USsecurityadvisoriesmfsa2021-11CVE-2021-29955

[23] Firefox Project Fission documentation

[24] Fortninet Use-After-Free Bug in Chakra (CVE-2018-0946)httpswwwfortinetcomblogthreat-researchan-analysis-of-the-use-after-free-bug-in-microsoft-edge-chakra-engine

[25] Ivan Fratric Return after free discussion httpstwittercomifsecurestatus1273230733516177408

[26] Pietro Frigo Cristiano Giuffrida Herbert Bos and Kaveh RazaviGrand Pwning Unit Accelerating Microarchitectural Attacks withthe GPU In SampP May 2018

[27] Kourosh Gharachorloo Anoop Gupta and John L Hennessy Twotechniques to enhance the performance of memory consistency models1991

[28] Enes Goktas Kaveh Razavi Georgios Portokalidis Herbert Bos andCristiano Giuffrida Speculative Probing Hacking Blind in the SpectreEra In CCS 2020

[29] Ben Gras Kaveh Razavi Erik Bosman Herbert Bos and CristianoGiuffrida ASLR on the Line Practical Cache Attacks on the MMUIn NDSS February 2017

[30] John L Hennessy and David A Patterson Computer architecture aquantitative approach Elsevier 2011

[31] Jann Horn Reading privileged memory with a side-channel 2018

[32] Xen Hypervisor block_speculation function call in invoke_stubhttpsxenbitsxenorggitwebp=xengita=blobf=xenarchx86x86_emulatex86_emulatechb=HEAD

[33] Xen Hypervisor block_speculation function call inio_emul_stub_setup httpsxenbitsxenorggitwebp=xengita=blobf=xenarchx86pvemul-priv-opchb=HEAD

[34] Intel FPVI amp SCSB Intel Security Advisoray 00516 httpswwwintelcomcontentwwwusensecurity-centeradvisoryintel-sa-00516html

[35] Intel INTEL-SA-00088 - Bounds Check Bypass

[36] Intel Intelreg 64 and IA-32 Architectures Optimization ReferenceManual

[37] Intel Intelreg 64 and IA-32 Architectures Software Developerrsquos Manualcombined volumes

[38] Intel Intelreg VTunetrade Profiler User Guide - 4KAliasing httpssoftwareintelcomcontentwwwusendevelopdocumentationvtune-helptopreferencecpu-metrics-referencel1-boundaliasing-of-4k-address-offsethtml

[39] Intel Load value injection - deep divehttpssoftwareintelcomsecurity-software-guidancedeep-divesdeep-dive-load-value-injection 2020

[40] Vladimir Kiriansky and Carl Waldspurger Speculative buffer overflowsAttacks and defenses arXiv180703757

[41] Paul Kocher Jann Horn Anders Fogh Daniel Genkin Daniel GrussWerner Haas Mike Hamburg Moritz Lipp Stefan Mangard ThomasPrescher Michael Schwarz and Yuval Yarom Spectre Attacks Ex-ploiting Speculative Execution In SampPrsquo19

[42] David Kohlbrenner and Hovav Shacham On the effectiveness ofmitigations against floating-point timing channels In USENIX SecuritySymposium 2017

[43] Esmaeil Mohammadian Koruyeh Khaled N Khasawneh ChengyuSong and Nael Abu-Ghazaleh Spectre Returns Speculation Attacksusing the Return Stack Buffer In USENIX WOOTrsquo18

[44] Evgeni Krimer Guillermo Savransky Idan Mondjak and JacobDoweck Counter-based memory disambiguation techniques for se-lectively predicting loadstore conflicts October 1 2013 US Patent8549263

USENIX Association 30th USENIX Security Symposium 1465

[45] Chris Lattner and Vikram Adve Llvm A compilation framework forlifelong program analysis amp transformation In CGO 2004

[46] Orion Lawlor Hari Govind Isaac Dooley Michael Breitenfeld andLaxmikant Kale Performance degradation in the presence of subnor-mal floating-point values In OSIHPA 2005

[47] Moritz Lipp Michael Schwarz Daniel Gruss Thomas Prescher WernerHaas Anders Fogh Jann Horn Stefan Mangard Paul Kocher DanielGenkin Yuval Yarom and Mike Hamburg Meltdown Reading KernelMemory from User Space In USENIX Securityrsquo18

[48] Giorgi Maisuradze and Christian Rossow ret2spec Speculative ex-ecution using return stack buffers In Proceedings of the 2018 ACMSIGSAC

[49] Andrea Mambretti Alexandra Sandulescu Matthias NeugschwandtnerAlessandro Sorniotti and Anil Kurmus Two methods for exploitingspeculative control flow hijacks In USENIX WOOT 19

[50] Ahmad Moghimi Jan Wichelmann Thomas Eisenbarth and BerkSunar Memjam A false dependency attack against constant-timecrypto implementations International Journal of Parallel Program-ming 2019

[51] Mozilla Firefox Bug 1692972 mitigation httpshgmozillaorgreleasesmozilla-betarevb129bba64358

[52] Mozilla Spectre mitigations for Value type checks - x86 part httpsbugzillamozillaorgshow_bugcgiid=1433111

[53] Mozilla Spider Monkey JSValue httpshgmozillaorgmozilla-centralfiletipjspublicValueh

[54] Mozilla SpiderMonkey IonMonkey documentation

[55] Ken Johnson Microsoft Security Response Center(MSRC) Analysis and mitigation of speculative storebypass httpsmsrc-blogmicrosoftcom20180521analysis-and-mitigation-of-speculative-store-bypass-cve-2018-3639 2019

[56] Oleksii Oleksenko Bohdan Trach Mark Silberstein and Christof Fet-zer Specfuzz Bringing spectre-type vulnerabilities to the surface InUSENIX Security 20

[57] Riccardo Paccagnella Licheng Luo and Christopher W Fletcher Lordof the ring (s) Side channel attacks on the cpu on-chip ring interconnectare practical arXiv preprint arXiv210303443 2021

[58] Zhenxiao Qi Qian Feng Yueqiang Cheng Mengjia Yan Peng Li HengYin and Tao Wei Spectaint Speculative taint analysis for discoveringspectre gadgets 2021

[59] Hany Ragab Alyssa Milburn Kaveh Razavi Herbert Bos and CristianoGiuffrida CrossTalk Speculative Data Leaks Across Cores Are RealIn SampP May 2021

[60] Thomas Rokicki Cleacutementine Maurice and Pierre Laperdrix Sok Insearch of lost time A review of javascript timers in browsers In IEEEEuroSampPrsquo21

[61] Gururaj Saileshwar Christopher W Fletcher and Moinuddin QureshiStreamline a fast flushless cache covert-channel attack by enablingasynchronous collusion In ASPLOS 2021

[62] Rahul Saxena and John William Phillips Optimized rounding in un-derflow handlers 2001 US Patent 6219684

[63] Michael Schwarz Moritz Lipp Daniel Moghimi Jo Van Bulck JulianStecklina Thomas Prescher and Daniel Gruss ZombieLoad Cross-privilege-boundary data sampling In CCSrsquo19

[64] Michael Schwarz Cleacutementine Maurice Daniel Gruss and Stefan Man-gard Fantastic timers and where to find them High-resolution microar-chitectural attacks in javascript In FC IFCA 17

[65] Michael Schwarz Martin Schwarzl Moritz Lipp Jon Masters andDaniel Gruss Netspectre Read arbitrary memory over network InESORICS 19

[66] Daniel J Sorin Mark D Hill and David A Wood A primer on memoryconsistency and cache coherence 2011

[67] Julian Stecklina and Thomas Prescher Lazyfp Leaking fpu reg-ister state using microarchitectural side-channels arXiv preprintarXiv180607480 2018

[68] Caroline Trippel Daniel Lustig and Margaret Martonosi Meltdown-prime and spectreprime Automatically-synthesized attacks exploitinginvalidation-based coherence protocols arXiv

[69] Xabier Ugarte-Pedrero Davide Balzarotti Igor Santos and Pablo GBringas Sok Deep packer inspection A longitudinal study of thecomplexity of run-time packers In 2015 IEEE S amp P

[70] Google V8 test-jump-table-assemblercc220 commit 251fece httpsgithubcomv8

[71] Jo Van Bulck Daniel Moghimi Michael Schwarz Moritz Lippi Ma-rina Minkin Daniel Genkin Yuval Yarom Berk Sunar Daniel Grussand Frank Piessens Lvi Hijacking transient execution through mi-croarchitectural load value injection In 2020 IEEE S amp P 20

[72] Jo Van Bulck Frank Piessens and Raoul Strackx Nemesis Studyingmicroarchitectural timing leaks in rudimentary cpu interrupt logic InACM CCS 2018

[73] Jo Van Bulck Frank Piessens and Raoul Strackx SGX-step A Prac-tical Attack Framework for Precise Enclave Execution Control InSysTEXrsquo17

[74] Stephan van Schaik Andrew Kwong Daniel Genkin and Yuval YaromSgaxe How sgx fails in practice

[75] Stephan van Schaik Alyssa Milburn Sebastian Oumlsterlund Pietro FrigoGiorgi Maisuradze Kaveh Razavi Herbert Bos and Cristiano GiuffridaRIDL Rogue in-flight data load In SampP May 2019

[76] Stephan van Schaik Marina Minkin Andrew Kwong Daniel Genkinand Yuval Yarom Cacheout Leaking data on intel cpus via cacheevictions arXiv preprint

[77] WebKit Browserbench httpsbrowserbenchorg

[78] Ofir Weisse Jo Van Bulck Marina Minkin Daniel Genkin BarisKasikci Frank Piessens Mark Silberstein Raoul Strackx Thomas FWenisch and Yuval Yarom Foreshadow-ng Breaking the virtual mem-ory abstraction with transient out-of-order execution 2018

[79] Pawel Wieczorkiewicz Intel deep-dive snoop-assisted L1 Data Sam-pling

[80] Wenjie Xiong and Jakub Szefer Survey of transient execution attacksarXiv preprint

[81] Yuval Yarom and Katrina Falkner Flush+ reload a high resolutionlow noise l3 cache side-channel attack In USENIX Security 14

[82] Jann Horn Google Project Zero Speculative execution variant4 speculative store bypass httpsbugschromiumorgpproject-zeroissuesdetailid=1528 2019

A Reversing Memory Disambiguation

To precisely trigger memory disambiguation mispredictionsit is essential to reverse engineer the predictor and understandhow it can be massaged into the desired state When exe-cuting a load operation the physical addresses of all priorstores must be known to decide whether the load should beforwarded the value from the store buffer (store-to-load for-warding when the store and the load alias) or served from thememory subsystem (when they do not) Since these dependen-cies introduce significant bottlenecks modern processors rely

1466 30th USENIX Security Symposium USENIX Association

Listing 4 A function to RE the MD predictor The first 10imuls used to delay the store address computation create theideal conditions for mispredictionsst_ld rdi store addr rsi load addrrep 10 Trick to delay the store addressimul rdi 1endrepmov DWORD [rdi] 0x42 Storemov eax DWORD [rsi] Loadrep 10 Pronounce load timingimul eax 1endrepret

Listing 5 Observing the size and behavior of the per-addresssaturation counteruint8_t mem = malloc(0x1000)Ensure that saturing counter is set to 0for(i=0 ilt100 i++) st_ld(mem mem)Make hoisiting possiblefor(i=100 ilt120 i++) st_ld(mem mem+64)Trigger a memory ordering machine clearfor(i=120 ilt130 i++) st_ld(mem mem)

on a memory disambiguation predictor to improve common-case performance If a load is predicted not to alias precedingstores it can be speculatively executed before the prior storesrsquoaddresses are known (ie the load is hoisted) Otherwise theload is stalled until aliasing information is available

Partial reverse engineering of the memory disambiguationunit behavior was presented in [18] based on a (complex)analysis of Intel patent US8549263B2 [44] In contrast wepresent a full reverse engineering effort (including featuressuch as 4k aliasing flush counter etc) entirely based on asimple implementationmdashst_ld function in Listing 4 Bysurrounding the st_ld function with instrumentation codeto measure the timing and the number of machine clears wewere able to accurately detect the status of the predictor forevery call to st_ld

Our first experiment illustrated in Listing 5 is designedto observe how many missed load hoisting opportunities areneeded to switch the predictor state As shown in the corre-sponding plot in Figure 11 after 15 non-aliasing loads weobserved that subsequent st_ld invocations are faster due toa correct hoisting prediction This matches the design sug-gested in the patent with the predictor implemented as a 4-bitsaturating counter incremented every time the load does not ul-timately alias with preceding stores (and reset to 0 otherwise)Load hoisting is predicted only if the counter is saturated Toensure that the hoisting state is reached we later scheduledan aliasing load and checked that the load was incorrectlyhoisted by observing a machine clear

According to the Intel patent [44] there are 64 per-addresspredictors (ie saturating counters) and the suggested hash-ing function simply uses the lowest 6 bits of the instructionpointer of the load We verified these numbers using the func-

100 105 110 115 120 125 130i-th call to st_ld

0

50

100

Cloc

k cy

cles

Figure 11 Timing measurement of the experiment in Listing5 Orange bar machine clear memory ordering observed

Listing 6 Snippet observing activation of never-hoisting stateuint8_t mem = malloc(0x1000)for(int i=0 ilt10 i++)

for(int j=0 jlt19 j++)st_ld(mem mem+64)

st_ld(mem mem)

0 10 20 30 40 50 60 70 80 90 100 110 120i-th call to st_ld

0

25

50

75

100

Cloc

k cy

cles

Figure 12 Never-hoisting state Orange bar MC observed

tion st_ld_offset which is an exact copy of the st_ldfunction but with a number of nops added in the preamble(which we change at every run The goal is to observe iftwo unrelated loads in these functions are able to influenceeach other when varying the number of nops In our testedCPUs we observed machine clears when two unrelated loadsin st_ld and st_ld_offset are located exactly 256k bytesapart in memory (k isinN) We used machine clear observationsto detect that the per-address prediction of st_ld was affectedby the execution of st_ld_offset Our results match the de-sign suggested in the patent except we observed 256 (ratherthan 64) predictors hashing the lowest 8 bits of the instruc-tion pointer With these implementation-specific numbers anattacker can easily mistrain the predictor of a victim loadinstruction just with knowledge of its location in memory

One important additional component of the predictor is thepresence of a watchdog A never-hoisting global state is usedas a fallback to temporarily disable the predictor when theCPU has decided it may be counterproductive To reverseengineer the behavior we triggered as many mispredictionsas possible to check if hoisting was eventually disabled Theresulting code is illustrated in Listing 6 and the numbers inFigure 12 shows that after 4 machine clears the predictoris disabled Indeed even after 19 further non-aliasing loads(normally abundantly sufficient to switch to a hoisting state)the execution time of st_ld does not decrease

We also reversed the conditions under which the watch-dog is enableddisabled The patent suggests the watchdog

USENIX Association 30th USENIX Security Symposium 1467

15 79 143 207 271 335 399Number of independent store-loads

1

2

3

4

Tota

l mea

sure

dM

achi

ne C

lear

s

Figure 13 MCs observed after n independent store-loadsstarting from the never-hoisting state

is enabled when the value of a flush counter is smaller than0 and disabled otherwise The flush counter is decrementedevery MD MC and incremented every n correctly hoistedloads To reverse engineer this behavior we measure howmany MCs can be triggered in a row before the flush counteris decremented to -1 and thus the predictor is disabled Asshown in the figure 13 we never observed more than 4 MCsin a row This suggests that the flush counter is a 2-bit sat-urating counter The machine clear patterns also reveal theflush counter is incremented every 64 correctly hoisted loadsLastly since every run starts with the watchdog disabled ourresults show that to switch from a never-hoisting to a predict-hoisting state 15+64 non-aliasing loads are sufficient Thefirst 15 loads are necessary to bring the per-address predictorto the hoisting state The next 64 loads record a would-becorrect prediction of the per-address predictor After 64 such(unused) predictions the flush counter is incremented to leavethe never-hoisting state

Listing 7 Snippet verifying 4k aliasing-MD unit interactionForce no-hoisting predictionfor(i=0 ilt10 i++) st_ld(mem+0x1000 mem+0x1000)for(i=10 ilt20 i++) st_ld(mem+0x2000 mem+0x2000)Cause 4k aliasingfor(i=20 ilt40 i++) st_ld(mem+0x1000 mem+0x2000)

0 5 10 15 20 25 30 35 40i-th call to st_ld

50

60

70

80

90

Cloc

k cy

cles

Figure 14 Timing measurements of Listing 7 Orange incre-ment of LD_BLOCKS_PARTIALADDRESS_ALIAS

Finally we examined the interaction between memory dis-ambiguation and 4k aliasing On Intel CPUs when a store isfollowed by a load matching its 4KB page offset the store-to-load forwarding (STL) logic forwards the stored valueand in case of a false match (4k aliasing) a few-cycle over-head is needed to re-issue the load [38] With the help of theperformance counter LD_BLOCKS_PARTIALADDRESS_ALIASand the experiment shown in Listing 7 we verified that 4kaliasing can only happen when a no-hoist prediction is made

Figure 15 Memory disambiguation unit ndash simplified view

as shown in Figure 14 Indeed STL can be performed onlyif the store-load pair is executed in order Additionally Fig-ure 14 shows that 4k aliasing introduces a slight performanceoverhead on top of an incorrect no-hoisting guess of the MDpredictor Figure 15 presents a simplified view of the reversedmemory disambiguation predictor In conclusion an attackercan precisely mistrain the memory disambiguation predictorby satisfying only two requirements (1) knowing the instruc-tion pointer of the victim load (2) issuing (in the worst-casescenario) 15+64=79 non-aliasing store-load pairs to train thepredictor to the hoisting state

B Root-Causes Description Table

Acronym Description

BHT Branch History TableBTB Branch Target BufferRSB Return Stack BufferMD Memory Disabmbiguation UnitNM Device Not Available ExceptionDE Divide-by-Zero ExceptionUD Invalid Opcode ExceptionGP General Protection FaultAC Alignment Check ExceptionSS Stack Segment ExceptionPF Page FaultPF - US Bit Page Table Entry UserSupervisor Bit PFPF - RW Bit Page Table Entry ReadWrite Bit PFPF - P Bit Page Table Entry Present Bit PFPF - PKU Page Table Entry Protection Keys PFBR Bound Range Exceeded ExceptionFP Floating Point AssistSMC Self-Modifying CodeXMC Cross-Modifying CodeMO Memory Ordering Principles ViolationMASKMOV Masked LoadStore Instruction AssistAD Bits Page Table Entry AccessDirty Bits AssistTSX Intel TSX Transaction AbortUC Uncachable Memory AssistPRM Processor Reserved Memory AssistHW Interrupts Hardware Interrupts

1468 30th USENIX Security Symposium USENIX Association

  • Introduction
  • Background
    • IEEE-754 Denormal Numbers
    • x86 Cache Coherence
    • Memory Ordering
    • Memory Disambiguation
      • Threat Model
      • Machine Clears
      • Self-Modifying Code Machine Clear
      • Floating Point Machine Clear
      • Memory Ordering Machine Clear
      • Memory Disambiguation Machine Clear
      • Other Types of Machine Clear
      • Transient Execution Capabilities
        • Transient Window Size
        • Leakage Rate
          • Attack Primitives
            • Speculative Code Store Bypass (SCSB)
            • Floating Point Value Injection (FPVI)
              • Mitigations
                • SCSB Mitigation
                • FPVI Mitigation
                  • Root Cause-based Classification
                  • Related Work
                  • Conclusions
                  • Reversing Memory Disambiguation
                  • Root-Causes Description Table
Page 15: Rage Against the Machine Clear: A Systematic Analysis of

ing a HW interrupt is sufficient to create a transient executionpath erroneously executing code following the interruptedinstruction Hardware interrupt are asynchronous by natureand thus difficult to control resulting in a less than ideal wayto manage transient execution windows Nonetheless theywere abused by prior side-channel attacks [72]

Likely invariants violations This class includes all theremaining causes of machine clear derived by likely invari-ants [15] used by the CPU Such invariants commonly holdbut occasionally fail allowing hardware to implement fast-path optimizations However compared to exceptions andinterrupts slow-path occurrences are typically more frequentrequiring more efficient handling in hardware or microcodeWe discussed examples of such invariants in the paper (egstore instructions are expected to never target cached instruc-tions floating-point operations are expected to never operateon denormal numbers etc) and their lazy handling mecha-nisms (ie L1dL1i resynchronization microcode-based de-normal arithmetic) Forcing a likely invariant violation is suf-ficient to create a transient execution path accessing erroneouscode or data In this paper we have shown such violationsare not only a realistic way to manage transient executionwindows but also provide new opportunities and primitivesfor transient execution attacks

14 Related Work

Spectre [3141] and Meltdown [47] first examined the securityimplications of transient execution originating a large bodyof research on transient execution attacks [6ndash112840434859 63 65 67 68 71 74ndash76 78 79 82] Rather than focusingon attacks and their classification [10 75 80] ours is the firsteffort to systematize the root causes of transient executionand examine the many unexplored cases of machine clears

We now briefly survey prior security efforts concernedwith the major causes of machine clear discussed in this pa-per Self-modifying code is commonly used by malware asan obfuscation technique [69] and has also been used to im-prove side-channel attacks by means of performance degra-dation [2] Moreover our SCSB primitive bears similaritieswith prior transient execution primitives inducing specula-tive control flow hijacking either through branch target in-jection [41 49] or architectural branch target corruption [28]The performance variability of floating-point operations ondenormal numbers [17 46] has been previously exploited intraditional timing side-channel attacks [4] Speculation intro-duced by stricter memory models is a well-known concept inthe computer architecture literature [27 30 66] While this isnon-trivial to exploit prior work did demonstrate informationdisclosure [5779] by exploiting the snoop protocol discussedin Section 7 The memory disambiguation predictor has beenpreviously abused to leak stale data in Spectre SpeculativeStore Bypass exploits [55 82] Moreover its behavior hasbeen partially reverse engineered before [18] In contrast to

all these efforts we focus on machine clears to systematicallystudy all the root causes of transient execution fully reverseengineering their behavior and uncovering their security im-plications well beyond the state of the art

15 Conclusions

We have shown that the root causes of transient execution canbe quite diverse and go well beyond simple branch mispredic-tion or similar To support this claim we systematically ex-plored and reverse engineered the previously unexplored classof bad speculation known as machine clear We discussedseveral transient execution paths neglected in the literatureexamining their capabilities and new opportunities for attacksFurthermore we presented two new machine clear-based tran-sient execution attack primitives (Floating Point Value Injec-tion and Speculative Code Store Bypass) We also presentedan end-to-end FPVI exploit disclosing arbitrary memory inFirefox and analyzed the applicability of SCSB in real-worldapplications such as JIT engines Additionally we proposedmitigations and evaluated their performance overhead Finallywe presented a new root cause-based classification for all theknown transient execution paths

Disclosure

We disclosed Floating Point Value Injection and SpeculativeCode Store Bypass to CPU browser OS and hypervisor ven-dors in February 2021 Following our reports Intel confirmedthe FPVI (CVE-2021-0086) and SCSB (CVE-2021-0089)vulnerabilities rewarded them with the Intel Bug Bounty pro-gram and released a security advisory with recommendationsin line with our proposed mitigations [34] Mozilla confirmedthe FPVI exploit (CVE-2021-29955 [21 22]) rewarded itwith the Mozilla Security Bug Bounty program and deployeda mitigation based on conditionally masking malicious NaN-boxed FP results in Firefox 87 [51]

Acknowledgments

We thank our shepherd Daniel Genkin and the anonymousreviewers for their valuable comments We also thank ErikBosman from VUSec and Andrew Cooper from Citrix fortheir input Intel and Mozilla engineers for the productivemitigation discussions Travis Downs for his MD reverse en-gineering and Evan Wallace for his Float Toy tool This workwas supported by the European Unionrsquos Horizon 2020 re-search and innovation programme under grant agreements No786669 (ReAct) and 825377 (UNICORE) by Intel Corpora-tion through the Side Channel Vulnerability ISRA and by theDutch Research Council (NWO) through the INTERSECTproject

1464 30th USENIX Security Symposium USENIX Association

References[1] IEEE Standard for Floating-Point Arithmetic IEEE Std 754-2019

2019

[2] Alejandro Cabrera Aldaya and Billy Bob Brumley Hyperde-grade From ghz to mhz effective cpu frequencies arXiv preprintarXiv210101077

[3] AMD AMD64 Architecture Programmerrsquos Manual

[4] Marc Andrysco David Kohlbrenner Keaton Mowery Ranjit JhalaSorin Lerner and Hovav Shacham On subnormal floating point andabnormal timing In 2015 IEEE S amp P

[5] ARM Architecture Reference Manual for Armv8-A

[6] Atri Bhattacharyya Alexandra Sandulescu Matthias NeugschwandtnerAlessandro Sorniotti Babak Falsafi Mathias Payer and Anil KurmusSmotherspectre exploiting speculative execution through port con-tention In CCSrsquo19

[7] Jo Van Bulck Marina Minkin Ofir Weisse Daniel Genkin BarisKasikci Frank Piessens Mark Silberstein Thomas F Wenisch Yu-val Yarom and Raoul Strackx Foreshadow Extracting the Keys tothe Intel SGX Kingdom with Transient Out-of-Order Execution InUSENIX Securityrsquo18

[8] Claudio Canella Daniel Genkin Lukas Giner Daniel Gruss MoritzLipp Marina Minkin Daniel Moghimi Frank Piessens MichaelSchwarz Berk Sunar Jo Van Bulck and Yuval Yarom Fallout LeakingData on Meltdown-resistant CPUs In CCSrsquo19

[9] Claudio Canella Michael Schwarz Martin Haubenwallner MartinSchwarzl and Daniel Gruss Kaslr Break it fix it repeat In ACMASIA CCS 2020

[10] Claudio Canella Jo Van Bulck Michael Schwarz Moritz Lipp Ben-jamin Von Berg Philipp Ortner Frank Piessens Dmitry Evtyushkinand Daniel Gruss A systematic evaluation of transient execution at-tacks and defenses In USENIX Security 19

[11] Guoxing Chen Sanchuan Chen Yuan Xiao Yinqian Zhang ZhiqiangLin and Ten H Lai Sgxpectre Stealing intel secrets from sgx enclavesvia speculative execution In 2019 IEEE EuroSampP

[12] Chrome V8 TurboFan documentation

[13] Chromium Site Isolation documentation

[14] Victor Costan and Srinivas Devadas Intel SGX Explained IACRCryptology ePrint Archive 2016

[15] David Devecsery Peter M Chen Jason Flinn and Satish NarayanasamyOptimistic hybrid analysis Accelerating dynamic analysis throughpredicated static analysis In ASPLOS 2018

[16] Christopher Domas Breaking the x86 isa Black Hat USA 2017

[17] Isaac Dooley and Laxmikant Kale Quantifying the interference causedby subnormal floating-point values In Proceedings of the Workshopon OSIHPA 2006

[18] Travis Downs Memory Disambiguation on Skylakehttpsgithubcomtravisdownsuarch-benchwikiMemory-Disambiguation-on-Skylake 2019

[19] Thomas Dullien Return after free discussion httpstwittercomhalvarflakestatus1273220345525415937

[20] Dmitry Evtyushkin Ryan Riley Nael CSE Abu-Ghazaleh ECE andDmitry Ponomarev Branchscope A new side-channel attack on di-rectional branch predictor ACM SIGPLAN Notices 53(2)693ndash7072018

[21] Firefox Firefox 87 Security Advisory httpswwwmozillaorgen-USsecurityadvisoriesmfsa2021-10CVE-2021-29955

[22] Firefox Firefox ESR 789 Security Advisory httpswwwmozillaorgen-USsecurityadvisoriesmfsa2021-11CVE-2021-29955

[23] Firefox Project Fission documentation

[24] Fortninet Use-After-Free Bug in Chakra (CVE-2018-0946)httpswwwfortinetcomblogthreat-researchan-analysis-of-the-use-after-free-bug-in-microsoft-edge-chakra-engine

[25] Ivan Fratric Return after free discussion httpstwittercomifsecurestatus1273230733516177408

[26] Pietro Frigo Cristiano Giuffrida Herbert Bos and Kaveh RazaviGrand Pwning Unit Accelerating Microarchitectural Attacks withthe GPU In SampP May 2018

[27] Kourosh Gharachorloo Anoop Gupta and John L Hennessy Twotechniques to enhance the performance of memory consistency models1991

[28] Enes Goktas Kaveh Razavi Georgios Portokalidis Herbert Bos andCristiano Giuffrida Speculative Probing Hacking Blind in the SpectreEra In CCS 2020

[29] Ben Gras Kaveh Razavi Erik Bosman Herbert Bos and CristianoGiuffrida ASLR on the Line Practical Cache Attacks on the MMUIn NDSS February 2017

[30] John L Hennessy and David A Patterson Computer architecture aquantitative approach Elsevier 2011

[31] Jann Horn Reading privileged memory with a side-channel 2018

[32] Xen Hypervisor block_speculation function call in invoke_stubhttpsxenbitsxenorggitwebp=xengita=blobf=xenarchx86x86_emulatex86_emulatechb=HEAD

[33] Xen Hypervisor block_speculation function call inio_emul_stub_setup httpsxenbitsxenorggitwebp=xengita=blobf=xenarchx86pvemul-priv-opchb=HEAD

[34] Intel FPVI amp SCSB Intel Security Advisoray 00516 httpswwwintelcomcontentwwwusensecurity-centeradvisoryintel-sa-00516html

[35] Intel INTEL-SA-00088 - Bounds Check Bypass

[36] Intel Intelreg 64 and IA-32 Architectures Optimization ReferenceManual

[37] Intel Intelreg 64 and IA-32 Architectures Software Developerrsquos Manualcombined volumes

[38] Intel Intelreg VTunetrade Profiler User Guide - 4KAliasing httpssoftwareintelcomcontentwwwusendevelopdocumentationvtune-helptopreferencecpu-metrics-referencel1-boundaliasing-of-4k-address-offsethtml

[39] Intel Load value injection - deep divehttpssoftwareintelcomsecurity-software-guidancedeep-divesdeep-dive-load-value-injection 2020

[40] Vladimir Kiriansky and Carl Waldspurger Speculative buffer overflowsAttacks and defenses arXiv180703757

[41] Paul Kocher Jann Horn Anders Fogh Daniel Genkin Daniel GrussWerner Haas Mike Hamburg Moritz Lipp Stefan Mangard ThomasPrescher Michael Schwarz and Yuval Yarom Spectre Attacks Ex-ploiting Speculative Execution In SampPrsquo19

[42] David Kohlbrenner and Hovav Shacham On the effectiveness ofmitigations against floating-point timing channels In USENIX SecuritySymposium 2017

[43] Esmaeil Mohammadian Koruyeh Khaled N Khasawneh ChengyuSong and Nael Abu-Ghazaleh Spectre Returns Speculation Attacksusing the Return Stack Buffer In USENIX WOOTrsquo18

[44] Evgeni Krimer Guillermo Savransky Idan Mondjak and JacobDoweck Counter-based memory disambiguation techniques for se-lectively predicting loadstore conflicts October 1 2013 US Patent8549263

USENIX Association 30th USENIX Security Symposium 1465

[45] Chris Lattner and Vikram Adve Llvm A compilation framework forlifelong program analysis amp transformation In CGO 2004

[46] Orion Lawlor Hari Govind Isaac Dooley Michael Breitenfeld andLaxmikant Kale Performance degradation in the presence of subnor-mal floating-point values In OSIHPA 2005

[47] Moritz Lipp Michael Schwarz Daniel Gruss Thomas Prescher WernerHaas Anders Fogh Jann Horn Stefan Mangard Paul Kocher DanielGenkin Yuval Yarom and Mike Hamburg Meltdown Reading KernelMemory from User Space In USENIX Securityrsquo18

[48] Giorgi Maisuradze and Christian Rossow ret2spec Speculative ex-ecution using return stack buffers In Proceedings of the 2018 ACMSIGSAC

[49] Andrea Mambretti Alexandra Sandulescu Matthias NeugschwandtnerAlessandro Sorniotti and Anil Kurmus Two methods for exploitingspeculative control flow hijacks In USENIX WOOT 19

[50] Ahmad Moghimi Jan Wichelmann Thomas Eisenbarth and BerkSunar Memjam A false dependency attack against constant-timecrypto implementations International Journal of Parallel Program-ming 2019

[51] Mozilla Firefox Bug 1692972 mitigation httpshgmozillaorgreleasesmozilla-betarevb129bba64358

[52] Mozilla Spectre mitigations for Value type checks - x86 part httpsbugzillamozillaorgshow_bugcgiid=1433111

[53] Mozilla Spider Monkey JSValue httpshgmozillaorgmozilla-centralfiletipjspublicValueh

[54] Mozilla SpiderMonkey IonMonkey documentation

[55] Ken Johnson Microsoft Security Response Center(MSRC) Analysis and mitigation of speculative storebypass httpsmsrc-blogmicrosoftcom20180521analysis-and-mitigation-of-speculative-store-bypass-cve-2018-3639 2019

[56] Oleksii Oleksenko Bohdan Trach Mark Silberstein and Christof Fet-zer Specfuzz Bringing spectre-type vulnerabilities to the surface InUSENIX Security 20

[57] Riccardo Paccagnella Licheng Luo and Christopher W Fletcher Lordof the ring (s) Side channel attacks on the cpu on-chip ring interconnectare practical arXiv preprint arXiv210303443 2021

[58] Zhenxiao Qi Qian Feng Yueqiang Cheng Mengjia Yan Peng Li HengYin and Tao Wei Spectaint Speculative taint analysis for discoveringspectre gadgets 2021

[59] Hany Ragab Alyssa Milburn Kaveh Razavi Herbert Bos and CristianoGiuffrida CrossTalk Speculative Data Leaks Across Cores Are RealIn SampP May 2021

[60] Thomas Rokicki Cleacutementine Maurice and Pierre Laperdrix Sok Insearch of lost time A review of javascript timers in browsers In IEEEEuroSampPrsquo21

[61] Gururaj Saileshwar Christopher W Fletcher and Moinuddin QureshiStreamline a fast flushless cache covert-channel attack by enablingasynchronous collusion In ASPLOS 2021

[62] Rahul Saxena and John William Phillips Optimized rounding in un-derflow handlers 2001 US Patent 6219684

[63] Michael Schwarz Moritz Lipp Daniel Moghimi Jo Van Bulck JulianStecklina Thomas Prescher and Daniel Gruss ZombieLoad Cross-privilege-boundary data sampling In CCSrsquo19

[64] Michael Schwarz Cleacutementine Maurice Daniel Gruss and Stefan Man-gard Fantastic timers and where to find them High-resolution microar-chitectural attacks in javascript In FC IFCA 17

[65] Michael Schwarz Martin Schwarzl Moritz Lipp Jon Masters andDaniel Gruss Netspectre Read arbitrary memory over network InESORICS 19

[66] Daniel J Sorin Mark D Hill and David A Wood A primer on memoryconsistency and cache coherence 2011

[67] Julian Stecklina and Thomas Prescher Lazyfp Leaking fpu reg-ister state using microarchitectural side-channels arXiv preprintarXiv180607480 2018

[68] Caroline Trippel Daniel Lustig and Margaret Martonosi Meltdown-prime and spectreprime Automatically-synthesized attacks exploitinginvalidation-based coherence protocols arXiv

[69] Xabier Ugarte-Pedrero Davide Balzarotti Igor Santos and Pablo GBringas Sok Deep packer inspection A longitudinal study of thecomplexity of run-time packers In 2015 IEEE S amp P

[70] Google V8 test-jump-table-assemblercc220 commit 251fece httpsgithubcomv8

[71] Jo Van Bulck Daniel Moghimi Michael Schwarz Moritz Lippi Ma-rina Minkin Daniel Genkin Yuval Yarom Berk Sunar Daniel Grussand Frank Piessens Lvi Hijacking transient execution through mi-croarchitectural load value injection In 2020 IEEE S amp P 20

[72] Jo Van Bulck Frank Piessens and Raoul Strackx Nemesis Studyingmicroarchitectural timing leaks in rudimentary cpu interrupt logic InACM CCS 2018

[73] Jo Van Bulck Frank Piessens and Raoul Strackx SGX-step A Prac-tical Attack Framework for Precise Enclave Execution Control InSysTEXrsquo17

[74] Stephan van Schaik Andrew Kwong Daniel Genkin and Yuval YaromSgaxe How sgx fails in practice

[75] Stephan van Schaik Alyssa Milburn Sebastian Oumlsterlund Pietro FrigoGiorgi Maisuradze Kaveh Razavi Herbert Bos and Cristiano GiuffridaRIDL Rogue in-flight data load In SampP May 2019

[76] Stephan van Schaik Marina Minkin Andrew Kwong Daniel Genkinand Yuval Yarom Cacheout Leaking data on intel cpus via cacheevictions arXiv preprint

[77] WebKit Browserbench httpsbrowserbenchorg

[78] Ofir Weisse Jo Van Bulck Marina Minkin Daniel Genkin BarisKasikci Frank Piessens Mark Silberstein Raoul Strackx Thomas FWenisch and Yuval Yarom Foreshadow-ng Breaking the virtual mem-ory abstraction with transient out-of-order execution 2018

[79] Pawel Wieczorkiewicz Intel deep-dive snoop-assisted L1 Data Sam-pling

[80] Wenjie Xiong and Jakub Szefer Survey of transient execution attacksarXiv preprint

[81] Yuval Yarom and Katrina Falkner Flush+ reload a high resolutionlow noise l3 cache side-channel attack In USENIX Security 14

[82] Jann Horn Google Project Zero Speculative execution variant4 speculative store bypass httpsbugschromiumorgpproject-zeroissuesdetailid=1528 2019

A Reversing Memory Disambiguation

To precisely trigger memory disambiguation mispredictionsit is essential to reverse engineer the predictor and understandhow it can be massaged into the desired state When exe-cuting a load operation the physical addresses of all priorstores must be known to decide whether the load should beforwarded the value from the store buffer (store-to-load for-warding when the store and the load alias) or served from thememory subsystem (when they do not) Since these dependen-cies introduce significant bottlenecks modern processors rely

1466 30th USENIX Security Symposium USENIX Association

Listing 4 A function to RE the MD predictor The first 10imuls used to delay the store address computation create theideal conditions for mispredictionsst_ld rdi store addr rsi load addrrep 10 Trick to delay the store addressimul rdi 1endrepmov DWORD [rdi] 0x42 Storemov eax DWORD [rsi] Loadrep 10 Pronounce load timingimul eax 1endrepret

Listing 5 Observing the size and behavior of the per-addresssaturation counteruint8_t mem = malloc(0x1000)Ensure that saturing counter is set to 0for(i=0 ilt100 i++) st_ld(mem mem)Make hoisiting possiblefor(i=100 ilt120 i++) st_ld(mem mem+64)Trigger a memory ordering machine clearfor(i=120 ilt130 i++) st_ld(mem mem)

on a memory disambiguation predictor to improve common-case performance If a load is predicted not to alias precedingstores it can be speculatively executed before the prior storesrsquoaddresses are known (ie the load is hoisted) Otherwise theload is stalled until aliasing information is available

Partial reverse engineering of the memory disambiguationunit behavior was presented in [18] based on a (complex)analysis of Intel patent US8549263B2 [44] In contrast wepresent a full reverse engineering effort (including featuressuch as 4k aliasing flush counter etc) entirely based on asimple implementationmdashst_ld function in Listing 4 Bysurrounding the st_ld function with instrumentation codeto measure the timing and the number of machine clears wewere able to accurately detect the status of the predictor forevery call to st_ld

Our first experiment illustrated in Listing 5 is designedto observe how many missed load hoisting opportunities areneeded to switch the predictor state As shown in the corre-sponding plot in Figure 11 after 15 non-aliasing loads weobserved that subsequent st_ld invocations are faster due toa correct hoisting prediction This matches the design sug-gested in the patent with the predictor implemented as a 4-bitsaturating counter incremented every time the load does not ul-timately alias with preceding stores (and reset to 0 otherwise)Load hoisting is predicted only if the counter is saturated Toensure that the hoisting state is reached we later scheduledan aliasing load and checked that the load was incorrectlyhoisted by observing a machine clear

According to the Intel patent [44] there are 64 per-addresspredictors (ie saturating counters) and the suggested hash-ing function simply uses the lowest 6 bits of the instructionpointer of the load We verified these numbers using the func-

100 105 110 115 120 125 130i-th call to st_ld

0

50

100

Cloc

k cy

cles

Figure 11 Timing measurement of the experiment in Listing5 Orange bar machine clear memory ordering observed

Listing 6 Snippet observing activation of never-hoisting stateuint8_t mem = malloc(0x1000)for(int i=0 ilt10 i++)

for(int j=0 jlt19 j++)st_ld(mem mem+64)

st_ld(mem mem)

0 10 20 30 40 50 60 70 80 90 100 110 120i-th call to st_ld

0

25

50

75

100

Cloc

k cy

cles

Figure 12 Never-hoisting state Orange bar MC observed

tion st_ld_offset which is an exact copy of the st_ldfunction but with a number of nops added in the preamble(which we change at every run The goal is to observe iftwo unrelated loads in these functions are able to influenceeach other when varying the number of nops In our testedCPUs we observed machine clears when two unrelated loadsin st_ld and st_ld_offset are located exactly 256k bytesapart in memory (k isinN) We used machine clear observationsto detect that the per-address prediction of st_ld was affectedby the execution of st_ld_offset Our results match the de-sign suggested in the patent except we observed 256 (ratherthan 64) predictors hashing the lowest 8 bits of the instruc-tion pointer With these implementation-specific numbers anattacker can easily mistrain the predictor of a victim loadinstruction just with knowledge of its location in memory

One important additional component of the predictor is thepresence of a watchdog A never-hoisting global state is usedas a fallback to temporarily disable the predictor when theCPU has decided it may be counterproductive To reverseengineer the behavior we triggered as many mispredictionsas possible to check if hoisting was eventually disabled Theresulting code is illustrated in Listing 6 and the numbers inFigure 12 shows that after 4 machine clears the predictoris disabled Indeed even after 19 further non-aliasing loads(normally abundantly sufficient to switch to a hoisting state)the execution time of st_ld does not decrease

We also reversed the conditions under which the watch-dog is enableddisabled The patent suggests the watchdog

USENIX Association 30th USENIX Security Symposium 1467

15 79 143 207 271 335 399Number of independent store-loads

1

2

3

4

Tota

l mea

sure

dM

achi

ne C

lear

s

Figure 13 MCs observed after n independent store-loadsstarting from the never-hoisting state

is enabled when the value of a flush counter is smaller than0 and disabled otherwise The flush counter is decrementedevery MD MC and incremented every n correctly hoistedloads To reverse engineer this behavior we measure howmany MCs can be triggered in a row before the flush counteris decremented to -1 and thus the predictor is disabled Asshown in the figure 13 we never observed more than 4 MCsin a row This suggests that the flush counter is a 2-bit sat-urating counter The machine clear patterns also reveal theflush counter is incremented every 64 correctly hoisted loadsLastly since every run starts with the watchdog disabled ourresults show that to switch from a never-hoisting to a predict-hoisting state 15+64 non-aliasing loads are sufficient Thefirst 15 loads are necessary to bring the per-address predictorto the hoisting state The next 64 loads record a would-becorrect prediction of the per-address predictor After 64 such(unused) predictions the flush counter is incremented to leavethe never-hoisting state

Listing 7 Snippet verifying 4k aliasing-MD unit interactionForce no-hoisting predictionfor(i=0 ilt10 i++) st_ld(mem+0x1000 mem+0x1000)for(i=10 ilt20 i++) st_ld(mem+0x2000 mem+0x2000)Cause 4k aliasingfor(i=20 ilt40 i++) st_ld(mem+0x1000 mem+0x2000)

0 5 10 15 20 25 30 35 40i-th call to st_ld

50

60

70

80

90

Cloc

k cy

cles

Figure 14 Timing measurements of Listing 7 Orange incre-ment of LD_BLOCKS_PARTIALADDRESS_ALIAS

Finally we examined the interaction between memory dis-ambiguation and 4k aliasing On Intel CPUs when a store isfollowed by a load matching its 4KB page offset the store-to-load forwarding (STL) logic forwards the stored valueand in case of a false match (4k aliasing) a few-cycle over-head is needed to re-issue the load [38] With the help of theperformance counter LD_BLOCKS_PARTIALADDRESS_ALIASand the experiment shown in Listing 7 we verified that 4kaliasing can only happen when a no-hoist prediction is made

Figure 15 Memory disambiguation unit ndash simplified view

as shown in Figure 14 Indeed STL can be performed onlyif the store-load pair is executed in order Additionally Fig-ure 14 shows that 4k aliasing introduces a slight performanceoverhead on top of an incorrect no-hoisting guess of the MDpredictor Figure 15 presents a simplified view of the reversedmemory disambiguation predictor In conclusion an attackercan precisely mistrain the memory disambiguation predictorby satisfying only two requirements (1) knowing the instruc-tion pointer of the victim load (2) issuing (in the worst-casescenario) 15+64=79 non-aliasing store-load pairs to train thepredictor to the hoisting state

B Root-Causes Description Table

Acronym Description

BHT Branch History TableBTB Branch Target BufferRSB Return Stack BufferMD Memory Disabmbiguation UnitNM Device Not Available ExceptionDE Divide-by-Zero ExceptionUD Invalid Opcode ExceptionGP General Protection FaultAC Alignment Check ExceptionSS Stack Segment ExceptionPF Page FaultPF - US Bit Page Table Entry UserSupervisor Bit PFPF - RW Bit Page Table Entry ReadWrite Bit PFPF - P Bit Page Table Entry Present Bit PFPF - PKU Page Table Entry Protection Keys PFBR Bound Range Exceeded ExceptionFP Floating Point AssistSMC Self-Modifying CodeXMC Cross-Modifying CodeMO Memory Ordering Principles ViolationMASKMOV Masked LoadStore Instruction AssistAD Bits Page Table Entry AccessDirty Bits AssistTSX Intel TSX Transaction AbortUC Uncachable Memory AssistPRM Processor Reserved Memory AssistHW Interrupts Hardware Interrupts

1468 30th USENIX Security Symposium USENIX Association

  • Introduction
  • Background
    • IEEE-754 Denormal Numbers
    • x86 Cache Coherence
    • Memory Ordering
    • Memory Disambiguation
      • Threat Model
      • Machine Clears
      • Self-Modifying Code Machine Clear
      • Floating Point Machine Clear
      • Memory Ordering Machine Clear
      • Memory Disambiguation Machine Clear
      • Other Types of Machine Clear
      • Transient Execution Capabilities
        • Transient Window Size
        • Leakage Rate
          • Attack Primitives
            • Speculative Code Store Bypass (SCSB)
            • Floating Point Value Injection (FPVI)
              • Mitigations
                • SCSB Mitigation
                • FPVI Mitigation
                  • Root Cause-based Classification
                  • Related Work
                  • Conclusions
                  • Reversing Memory Disambiguation
                  • Root-Causes Description Table
Page 16: Rage Against the Machine Clear: A Systematic Analysis of

References[1] IEEE Standard for Floating-Point Arithmetic IEEE Std 754-2019

2019

[2] Alejandro Cabrera Aldaya and Billy Bob Brumley Hyperde-grade From ghz to mhz effective cpu frequencies arXiv preprintarXiv210101077

[3] AMD AMD64 Architecture Programmerrsquos Manual

[4] Marc Andrysco David Kohlbrenner Keaton Mowery Ranjit JhalaSorin Lerner and Hovav Shacham On subnormal floating point andabnormal timing In 2015 IEEE S amp P

[5] ARM Architecture Reference Manual for Armv8-A

[6] Atri Bhattacharyya Alexandra Sandulescu Matthias NeugschwandtnerAlessandro Sorniotti Babak Falsafi Mathias Payer and Anil KurmusSmotherspectre exploiting speculative execution through port con-tention In CCSrsquo19

[7] Jo Van Bulck Marina Minkin Ofir Weisse Daniel Genkin BarisKasikci Frank Piessens Mark Silberstein Thomas F Wenisch Yu-val Yarom and Raoul Strackx Foreshadow Extracting the Keys tothe Intel SGX Kingdom with Transient Out-of-Order Execution InUSENIX Securityrsquo18

[8] Claudio Canella Daniel Genkin Lukas Giner Daniel Gruss MoritzLipp Marina Minkin Daniel Moghimi Frank Piessens MichaelSchwarz Berk Sunar Jo Van Bulck and Yuval Yarom Fallout LeakingData on Meltdown-resistant CPUs In CCSrsquo19

[9] Claudio Canella Michael Schwarz Martin Haubenwallner MartinSchwarzl and Daniel Gruss Kaslr Break it fix it repeat In ACMASIA CCS 2020

[10] Claudio Canella Jo Van Bulck Michael Schwarz Moritz Lipp Ben-jamin Von Berg Philipp Ortner Frank Piessens Dmitry Evtyushkinand Daniel Gruss A systematic evaluation of transient execution at-tacks and defenses In USENIX Security 19

[11] Guoxing Chen Sanchuan Chen Yuan Xiao Yinqian Zhang ZhiqiangLin and Ten H Lai Sgxpectre Stealing intel secrets from sgx enclavesvia speculative execution In 2019 IEEE EuroSampP

[12] Chrome V8 TurboFan documentation

[13] Chromium Site Isolation documentation

[14] Victor Costan and Srinivas Devadas Intel SGX Explained IACRCryptology ePrint Archive 2016

[15] David Devecsery Peter M Chen Jason Flinn and Satish NarayanasamyOptimistic hybrid analysis Accelerating dynamic analysis throughpredicated static analysis In ASPLOS 2018

[16] Christopher Domas Breaking the x86 isa Black Hat USA 2017

[17] Isaac Dooley and Laxmikant Kale Quantifying the interference causedby subnormal floating-point values In Proceedings of the Workshopon OSIHPA 2006

[18] Travis Downs Memory Disambiguation on Skylakehttpsgithubcomtravisdownsuarch-benchwikiMemory-Disambiguation-on-Skylake 2019

[19] Thomas Dullien Return after free discussion httpstwittercomhalvarflakestatus1273220345525415937

[20] Dmitry Evtyushkin Ryan Riley Nael CSE Abu-Ghazaleh ECE andDmitry Ponomarev Branchscope A new side-channel attack on di-rectional branch predictor ACM SIGPLAN Notices 53(2)693ndash7072018

[21] Firefox Firefox 87 Security Advisory httpswwwmozillaorgen-USsecurityadvisoriesmfsa2021-10CVE-2021-29955

[22] Firefox Firefox ESR 789 Security Advisory httpswwwmozillaorgen-USsecurityadvisoriesmfsa2021-11CVE-2021-29955

[23] Firefox Project Fission documentation

[24] Fortninet Use-After-Free Bug in Chakra (CVE-2018-0946)httpswwwfortinetcomblogthreat-researchan-analysis-of-the-use-after-free-bug-in-microsoft-edge-chakra-engine

[25] Ivan Fratric Return after free discussion httpstwittercomifsecurestatus1273230733516177408

[26] Pietro Frigo Cristiano Giuffrida Herbert Bos and Kaveh RazaviGrand Pwning Unit Accelerating Microarchitectural Attacks withthe GPU In SampP May 2018

[27] Kourosh Gharachorloo Anoop Gupta and John L Hennessy Twotechniques to enhance the performance of memory consistency models1991

[28] Enes Goktas Kaveh Razavi Georgios Portokalidis Herbert Bos andCristiano Giuffrida Speculative Probing Hacking Blind in the SpectreEra In CCS 2020

[29] Ben Gras Kaveh Razavi Erik Bosman Herbert Bos and CristianoGiuffrida ASLR on the Line Practical Cache Attacks on the MMUIn NDSS February 2017

[30] John L Hennessy and David A Patterson Computer architecture aquantitative approach Elsevier 2011

[31] Jann Horn Reading privileged memory with a side-channel 2018

[32] Xen Hypervisor block_speculation function call in invoke_stubhttpsxenbitsxenorggitwebp=xengita=blobf=xenarchx86x86_emulatex86_emulatechb=HEAD

[33] Xen Hypervisor block_speculation function call inio_emul_stub_setup httpsxenbitsxenorggitwebp=xengita=blobf=xenarchx86pvemul-priv-opchb=HEAD

[34] Intel FPVI amp SCSB Intel Security Advisoray 00516 httpswwwintelcomcontentwwwusensecurity-centeradvisoryintel-sa-00516html

[35] Intel INTEL-SA-00088 - Bounds Check Bypass

[36] Intel Intelreg 64 and IA-32 Architectures Optimization ReferenceManual

[37] Intel Intelreg 64 and IA-32 Architectures Software Developerrsquos Manualcombined volumes

[38] Intel Intelreg VTunetrade Profiler User Guide - 4KAliasing httpssoftwareintelcomcontentwwwusendevelopdocumentationvtune-helptopreferencecpu-metrics-referencel1-boundaliasing-of-4k-address-offsethtml

[39] Intel Load value injection - deep divehttpssoftwareintelcomsecurity-software-guidancedeep-divesdeep-dive-load-value-injection 2020

[40] Vladimir Kiriansky and Carl Waldspurger Speculative buffer overflowsAttacks and defenses arXiv180703757

[41] Paul Kocher Jann Horn Anders Fogh Daniel Genkin Daniel GrussWerner Haas Mike Hamburg Moritz Lipp Stefan Mangard ThomasPrescher Michael Schwarz and Yuval Yarom Spectre Attacks Ex-ploiting Speculative Execution In SampPrsquo19

[42] David Kohlbrenner and Hovav Shacham On the effectiveness ofmitigations against floating-point timing channels In USENIX SecuritySymposium 2017

[43] Esmaeil Mohammadian Koruyeh Khaled N Khasawneh ChengyuSong and Nael Abu-Ghazaleh Spectre Returns Speculation Attacksusing the Return Stack Buffer In USENIX WOOTrsquo18

[44] Evgeni Krimer Guillermo Savransky Idan Mondjak and JacobDoweck Counter-based memory disambiguation techniques for se-lectively predicting loadstore conflicts October 1 2013 US Patent8549263

USENIX Association 30th USENIX Security Symposium 1465

[45] Chris Lattner and Vikram Adve Llvm A compilation framework forlifelong program analysis amp transformation In CGO 2004

[46] Orion Lawlor Hari Govind Isaac Dooley Michael Breitenfeld andLaxmikant Kale Performance degradation in the presence of subnor-mal floating-point values In OSIHPA 2005

[47] Moritz Lipp Michael Schwarz Daniel Gruss Thomas Prescher WernerHaas Anders Fogh Jann Horn Stefan Mangard Paul Kocher DanielGenkin Yuval Yarom and Mike Hamburg Meltdown Reading KernelMemory from User Space In USENIX Securityrsquo18

[48] Giorgi Maisuradze and Christian Rossow ret2spec Speculative ex-ecution using return stack buffers In Proceedings of the 2018 ACMSIGSAC

[49] Andrea Mambretti Alexandra Sandulescu Matthias NeugschwandtnerAlessandro Sorniotti and Anil Kurmus Two methods for exploitingspeculative control flow hijacks In USENIX WOOT 19

[50] Ahmad Moghimi Jan Wichelmann Thomas Eisenbarth and BerkSunar Memjam A false dependency attack against constant-timecrypto implementations International Journal of Parallel Program-ming 2019

[51] Mozilla Firefox Bug 1692972 mitigation httpshgmozillaorgreleasesmozilla-betarevb129bba64358

[52] Mozilla Spectre mitigations for Value type checks - x86 part httpsbugzillamozillaorgshow_bugcgiid=1433111

[53] Mozilla Spider Monkey JSValue httpshgmozillaorgmozilla-centralfiletipjspublicValueh

[54] Mozilla SpiderMonkey IonMonkey documentation

[55] Ken Johnson Microsoft Security Response Center(MSRC) Analysis and mitigation of speculative storebypass httpsmsrc-blogmicrosoftcom20180521analysis-and-mitigation-of-speculative-store-bypass-cve-2018-3639 2019

[56] Oleksii Oleksenko Bohdan Trach Mark Silberstein and Christof Fet-zer Specfuzz Bringing spectre-type vulnerabilities to the surface InUSENIX Security 20

[57] Riccardo Paccagnella Licheng Luo and Christopher W Fletcher Lordof the ring (s) Side channel attacks on the cpu on-chip ring interconnectare practical arXiv preprint arXiv210303443 2021

[58] Zhenxiao Qi Qian Feng Yueqiang Cheng Mengjia Yan Peng Li HengYin and Tao Wei Spectaint Speculative taint analysis for discoveringspectre gadgets 2021

[59] Hany Ragab Alyssa Milburn Kaveh Razavi Herbert Bos and CristianoGiuffrida CrossTalk Speculative Data Leaks Across Cores Are RealIn SampP May 2021

[60] Thomas Rokicki Cleacutementine Maurice and Pierre Laperdrix Sok Insearch of lost time A review of javascript timers in browsers In IEEEEuroSampPrsquo21

[61] Gururaj Saileshwar Christopher W Fletcher and Moinuddin QureshiStreamline a fast flushless cache covert-channel attack by enablingasynchronous collusion In ASPLOS 2021

[62] Rahul Saxena and John William Phillips Optimized rounding in un-derflow handlers 2001 US Patent 6219684

[63] Michael Schwarz Moritz Lipp Daniel Moghimi Jo Van Bulck JulianStecklina Thomas Prescher and Daniel Gruss ZombieLoad Cross-privilege-boundary data sampling In CCSrsquo19

[64] Michael Schwarz Cleacutementine Maurice Daniel Gruss and Stefan Man-gard Fantastic timers and where to find them High-resolution microar-chitectural attacks in javascript In FC IFCA 17

[65] Michael Schwarz Martin Schwarzl Moritz Lipp Jon Masters andDaniel Gruss Netspectre Read arbitrary memory over network InESORICS 19

[66] Daniel J Sorin Mark D Hill and David A Wood A primer on memoryconsistency and cache coherence 2011

[67] Julian Stecklina and Thomas Prescher Lazyfp Leaking fpu reg-ister state using microarchitectural side-channels arXiv preprintarXiv180607480 2018

[68] Caroline Trippel Daniel Lustig and Margaret Martonosi Meltdown-prime and spectreprime Automatically-synthesized attacks exploitinginvalidation-based coherence protocols arXiv

[69] Xabier Ugarte-Pedrero Davide Balzarotti Igor Santos and Pablo GBringas Sok Deep packer inspection A longitudinal study of thecomplexity of run-time packers In 2015 IEEE S amp P

[70] Google V8 test-jump-table-assemblercc220 commit 251fece httpsgithubcomv8

[71] Jo Van Bulck Daniel Moghimi Michael Schwarz Moritz Lippi Ma-rina Minkin Daniel Genkin Yuval Yarom Berk Sunar Daniel Grussand Frank Piessens Lvi Hijacking transient execution through mi-croarchitectural load value injection In 2020 IEEE S amp P 20

[72] Jo Van Bulck Frank Piessens and Raoul Strackx Nemesis Studyingmicroarchitectural timing leaks in rudimentary cpu interrupt logic InACM CCS 2018

[73] Jo Van Bulck Frank Piessens and Raoul Strackx SGX-step A Prac-tical Attack Framework for Precise Enclave Execution Control InSysTEXrsquo17

[74] Stephan van Schaik Andrew Kwong Daniel Genkin and Yuval YaromSgaxe How sgx fails in practice

[75] Stephan van Schaik Alyssa Milburn Sebastian Oumlsterlund Pietro FrigoGiorgi Maisuradze Kaveh Razavi Herbert Bos and Cristiano GiuffridaRIDL Rogue in-flight data load In SampP May 2019

[76] Stephan van Schaik Marina Minkin Andrew Kwong Daniel Genkinand Yuval Yarom Cacheout Leaking data on intel cpus via cacheevictions arXiv preprint

[77] WebKit Browserbench httpsbrowserbenchorg

[78] Ofir Weisse Jo Van Bulck Marina Minkin Daniel Genkin BarisKasikci Frank Piessens Mark Silberstein Raoul Strackx Thomas FWenisch and Yuval Yarom Foreshadow-ng Breaking the virtual mem-ory abstraction with transient out-of-order execution 2018

[79] Pawel Wieczorkiewicz Intel deep-dive snoop-assisted L1 Data Sam-pling

[80] Wenjie Xiong and Jakub Szefer Survey of transient execution attacksarXiv preprint

[81] Yuval Yarom and Katrina Falkner Flush+ reload a high resolutionlow noise l3 cache side-channel attack In USENIX Security 14

[82] Jann Horn Google Project Zero Speculative execution variant4 speculative store bypass httpsbugschromiumorgpproject-zeroissuesdetailid=1528 2019

A Reversing Memory Disambiguation

To precisely trigger memory disambiguation mispredictionsit is essential to reverse engineer the predictor and understandhow it can be massaged into the desired state When exe-cuting a load operation the physical addresses of all priorstores must be known to decide whether the load should beforwarded the value from the store buffer (store-to-load for-warding when the store and the load alias) or served from thememory subsystem (when they do not) Since these dependen-cies introduce significant bottlenecks modern processors rely

1466 30th USENIX Security Symposium USENIX Association

Listing 4 A function to RE the MD predictor The first 10imuls used to delay the store address computation create theideal conditions for mispredictionsst_ld rdi store addr rsi load addrrep 10 Trick to delay the store addressimul rdi 1endrepmov DWORD [rdi] 0x42 Storemov eax DWORD [rsi] Loadrep 10 Pronounce load timingimul eax 1endrepret

Listing 5 Observing the size and behavior of the per-addresssaturation counteruint8_t mem = malloc(0x1000)Ensure that saturing counter is set to 0for(i=0 ilt100 i++) st_ld(mem mem)Make hoisiting possiblefor(i=100 ilt120 i++) st_ld(mem mem+64)Trigger a memory ordering machine clearfor(i=120 ilt130 i++) st_ld(mem mem)

on a memory disambiguation predictor to improve common-case performance If a load is predicted not to alias precedingstores it can be speculatively executed before the prior storesrsquoaddresses are known (ie the load is hoisted) Otherwise theload is stalled until aliasing information is available

Partial reverse engineering of the memory disambiguationunit behavior was presented in [18] based on a (complex)analysis of Intel patent US8549263B2 [44] In contrast wepresent a full reverse engineering effort (including featuressuch as 4k aliasing flush counter etc) entirely based on asimple implementationmdashst_ld function in Listing 4 Bysurrounding the st_ld function with instrumentation codeto measure the timing and the number of machine clears wewere able to accurately detect the status of the predictor forevery call to st_ld

Our first experiment illustrated in Listing 5 is designedto observe how many missed load hoisting opportunities areneeded to switch the predictor state As shown in the corre-sponding plot in Figure 11 after 15 non-aliasing loads weobserved that subsequent st_ld invocations are faster due toa correct hoisting prediction This matches the design sug-gested in the patent with the predictor implemented as a 4-bitsaturating counter incremented every time the load does not ul-timately alias with preceding stores (and reset to 0 otherwise)Load hoisting is predicted only if the counter is saturated Toensure that the hoisting state is reached we later scheduledan aliasing load and checked that the load was incorrectlyhoisted by observing a machine clear

According to the Intel patent [44] there are 64 per-addresspredictors (ie saturating counters) and the suggested hash-ing function simply uses the lowest 6 bits of the instructionpointer of the load We verified these numbers using the func-

100 105 110 115 120 125 130i-th call to st_ld

0

50

100

Cloc

k cy

cles

Figure 11 Timing measurement of the experiment in Listing5 Orange bar machine clear memory ordering observed

Listing 6 Snippet observing activation of never-hoisting stateuint8_t mem = malloc(0x1000)for(int i=0 ilt10 i++)

for(int j=0 jlt19 j++)st_ld(mem mem+64)

st_ld(mem mem)

0 10 20 30 40 50 60 70 80 90 100 110 120i-th call to st_ld

0

25

50

75

100

Cloc

k cy

cles

Figure 12 Never-hoisting state Orange bar MC observed

tion st_ld_offset which is an exact copy of the st_ldfunction but with a number of nops added in the preamble(which we change at every run The goal is to observe iftwo unrelated loads in these functions are able to influenceeach other when varying the number of nops In our testedCPUs we observed machine clears when two unrelated loadsin st_ld and st_ld_offset are located exactly 256k bytesapart in memory (k isinN) We used machine clear observationsto detect that the per-address prediction of st_ld was affectedby the execution of st_ld_offset Our results match the de-sign suggested in the patent except we observed 256 (ratherthan 64) predictors hashing the lowest 8 bits of the instruc-tion pointer With these implementation-specific numbers anattacker can easily mistrain the predictor of a victim loadinstruction just with knowledge of its location in memory

One important additional component of the predictor is thepresence of a watchdog A never-hoisting global state is usedas a fallback to temporarily disable the predictor when theCPU has decided it may be counterproductive To reverseengineer the behavior we triggered as many mispredictionsas possible to check if hoisting was eventually disabled Theresulting code is illustrated in Listing 6 and the numbers inFigure 12 shows that after 4 machine clears the predictoris disabled Indeed even after 19 further non-aliasing loads(normally abundantly sufficient to switch to a hoisting state)the execution time of st_ld does not decrease

We also reversed the conditions under which the watch-dog is enableddisabled The patent suggests the watchdog

USENIX Association 30th USENIX Security Symposium 1467

15 79 143 207 271 335 399Number of independent store-loads

1

2

3

4

Tota

l mea

sure

dM

achi

ne C

lear

s

Figure 13 MCs observed after n independent store-loadsstarting from the never-hoisting state

is enabled when the value of a flush counter is smaller than0 and disabled otherwise The flush counter is decrementedevery MD MC and incremented every n correctly hoistedloads To reverse engineer this behavior we measure howmany MCs can be triggered in a row before the flush counteris decremented to -1 and thus the predictor is disabled Asshown in the figure 13 we never observed more than 4 MCsin a row This suggests that the flush counter is a 2-bit sat-urating counter The machine clear patterns also reveal theflush counter is incremented every 64 correctly hoisted loadsLastly since every run starts with the watchdog disabled ourresults show that to switch from a never-hoisting to a predict-hoisting state 15+64 non-aliasing loads are sufficient Thefirst 15 loads are necessary to bring the per-address predictorto the hoisting state The next 64 loads record a would-becorrect prediction of the per-address predictor After 64 such(unused) predictions the flush counter is incremented to leavethe never-hoisting state

Listing 7 Snippet verifying 4k aliasing-MD unit interactionForce no-hoisting predictionfor(i=0 ilt10 i++) st_ld(mem+0x1000 mem+0x1000)for(i=10 ilt20 i++) st_ld(mem+0x2000 mem+0x2000)Cause 4k aliasingfor(i=20 ilt40 i++) st_ld(mem+0x1000 mem+0x2000)

0 5 10 15 20 25 30 35 40i-th call to st_ld

50

60

70

80

90

Cloc

k cy

cles

Figure 14 Timing measurements of Listing 7 Orange incre-ment of LD_BLOCKS_PARTIALADDRESS_ALIAS

Finally we examined the interaction between memory dis-ambiguation and 4k aliasing On Intel CPUs when a store isfollowed by a load matching its 4KB page offset the store-to-load forwarding (STL) logic forwards the stored valueand in case of a false match (4k aliasing) a few-cycle over-head is needed to re-issue the load [38] With the help of theperformance counter LD_BLOCKS_PARTIALADDRESS_ALIASand the experiment shown in Listing 7 we verified that 4kaliasing can only happen when a no-hoist prediction is made

Figure 15 Memory disambiguation unit ndash simplified view

as shown in Figure 14 Indeed STL can be performed onlyif the store-load pair is executed in order Additionally Fig-ure 14 shows that 4k aliasing introduces a slight performanceoverhead on top of an incorrect no-hoisting guess of the MDpredictor Figure 15 presents a simplified view of the reversedmemory disambiguation predictor In conclusion an attackercan precisely mistrain the memory disambiguation predictorby satisfying only two requirements (1) knowing the instruc-tion pointer of the victim load (2) issuing (in the worst-casescenario) 15+64=79 non-aliasing store-load pairs to train thepredictor to the hoisting state

B Root-Causes Description Table

Acronym Description

BHT Branch History TableBTB Branch Target BufferRSB Return Stack BufferMD Memory Disabmbiguation UnitNM Device Not Available ExceptionDE Divide-by-Zero ExceptionUD Invalid Opcode ExceptionGP General Protection FaultAC Alignment Check ExceptionSS Stack Segment ExceptionPF Page FaultPF - US Bit Page Table Entry UserSupervisor Bit PFPF - RW Bit Page Table Entry ReadWrite Bit PFPF - P Bit Page Table Entry Present Bit PFPF - PKU Page Table Entry Protection Keys PFBR Bound Range Exceeded ExceptionFP Floating Point AssistSMC Self-Modifying CodeXMC Cross-Modifying CodeMO Memory Ordering Principles ViolationMASKMOV Masked LoadStore Instruction AssistAD Bits Page Table Entry AccessDirty Bits AssistTSX Intel TSX Transaction AbortUC Uncachable Memory AssistPRM Processor Reserved Memory AssistHW Interrupts Hardware Interrupts

1468 30th USENIX Security Symposium USENIX Association

  • Introduction
  • Background
    • IEEE-754 Denormal Numbers
    • x86 Cache Coherence
    • Memory Ordering
    • Memory Disambiguation
      • Threat Model
      • Machine Clears
      • Self-Modifying Code Machine Clear
      • Floating Point Machine Clear
      • Memory Ordering Machine Clear
      • Memory Disambiguation Machine Clear
      • Other Types of Machine Clear
      • Transient Execution Capabilities
        • Transient Window Size
        • Leakage Rate
          • Attack Primitives
            • Speculative Code Store Bypass (SCSB)
            • Floating Point Value Injection (FPVI)
              • Mitigations
                • SCSB Mitigation
                • FPVI Mitigation
                  • Root Cause-based Classification
                  • Related Work
                  • Conclusions
                  • Reversing Memory Disambiguation
                  • Root-Causes Description Table
Page 17: Rage Against the Machine Clear: A Systematic Analysis of

[45] Chris Lattner and Vikram Adve Llvm A compilation framework forlifelong program analysis amp transformation In CGO 2004

[46] Orion Lawlor Hari Govind Isaac Dooley Michael Breitenfeld andLaxmikant Kale Performance degradation in the presence of subnor-mal floating-point values In OSIHPA 2005

[47] Moritz Lipp Michael Schwarz Daniel Gruss Thomas Prescher WernerHaas Anders Fogh Jann Horn Stefan Mangard Paul Kocher DanielGenkin Yuval Yarom and Mike Hamburg Meltdown Reading KernelMemory from User Space In USENIX Securityrsquo18

[48] Giorgi Maisuradze and Christian Rossow ret2spec Speculative ex-ecution using return stack buffers In Proceedings of the 2018 ACMSIGSAC

[49] Andrea Mambretti Alexandra Sandulescu Matthias NeugschwandtnerAlessandro Sorniotti and Anil Kurmus Two methods for exploitingspeculative control flow hijacks In USENIX WOOT 19

[50] Ahmad Moghimi Jan Wichelmann Thomas Eisenbarth and BerkSunar Memjam A false dependency attack against constant-timecrypto implementations International Journal of Parallel Program-ming 2019

[51] Mozilla Firefox Bug 1692972 mitigation httpshgmozillaorgreleasesmozilla-betarevb129bba64358

[52] Mozilla Spectre mitigations for Value type checks - x86 part httpsbugzillamozillaorgshow_bugcgiid=1433111

[53] Mozilla Spider Monkey JSValue httpshgmozillaorgmozilla-centralfiletipjspublicValueh

[54] Mozilla SpiderMonkey IonMonkey documentation

[55] Ken Johnson Microsoft Security Response Center(MSRC) Analysis and mitigation of speculative storebypass httpsmsrc-blogmicrosoftcom20180521analysis-and-mitigation-of-speculative-store-bypass-cve-2018-3639 2019

[56] Oleksii Oleksenko Bohdan Trach Mark Silberstein and Christof Fet-zer Specfuzz Bringing spectre-type vulnerabilities to the surface InUSENIX Security 20

[57] Riccardo Paccagnella Licheng Luo and Christopher W Fletcher Lordof the ring (s) Side channel attacks on the cpu on-chip ring interconnectare practical arXiv preprint arXiv210303443 2021

[58] Zhenxiao Qi Qian Feng Yueqiang Cheng Mengjia Yan Peng Li HengYin and Tao Wei Spectaint Speculative taint analysis for discoveringspectre gadgets 2021

[59] Hany Ragab Alyssa Milburn Kaveh Razavi Herbert Bos and CristianoGiuffrida CrossTalk Speculative Data Leaks Across Cores Are RealIn SampP May 2021

[60] Thomas Rokicki Cleacutementine Maurice and Pierre Laperdrix Sok Insearch of lost time A review of javascript timers in browsers In IEEEEuroSampPrsquo21

[61] Gururaj Saileshwar Christopher W Fletcher and Moinuddin QureshiStreamline a fast flushless cache covert-channel attack by enablingasynchronous collusion In ASPLOS 2021

[62] Rahul Saxena and John William Phillips Optimized rounding in un-derflow handlers 2001 US Patent 6219684

[63] Michael Schwarz Moritz Lipp Daniel Moghimi Jo Van Bulck JulianStecklina Thomas Prescher and Daniel Gruss ZombieLoad Cross-privilege-boundary data sampling In CCSrsquo19

[64] Michael Schwarz Cleacutementine Maurice Daniel Gruss and Stefan Man-gard Fantastic timers and where to find them High-resolution microar-chitectural attacks in javascript In FC IFCA 17

[65] Michael Schwarz Martin Schwarzl Moritz Lipp Jon Masters andDaniel Gruss Netspectre Read arbitrary memory over network InESORICS 19

[66] Daniel J Sorin Mark D Hill and David A Wood A primer on memoryconsistency and cache coherence 2011

[67] Julian Stecklina and Thomas Prescher Lazyfp Leaking fpu reg-ister state using microarchitectural side-channels arXiv preprintarXiv180607480 2018

[68] Caroline Trippel Daniel Lustig and Margaret Martonosi Meltdown-prime and spectreprime Automatically-synthesized attacks exploitinginvalidation-based coherence protocols arXiv

[69] Xabier Ugarte-Pedrero Davide Balzarotti Igor Santos and Pablo GBringas Sok Deep packer inspection A longitudinal study of thecomplexity of run-time packers In 2015 IEEE S amp P

[70] Google V8 test-jump-table-assemblercc220 commit 251fece httpsgithubcomv8

[71] Jo Van Bulck Daniel Moghimi Michael Schwarz Moritz Lippi Ma-rina Minkin Daniel Genkin Yuval Yarom Berk Sunar Daniel Grussand Frank Piessens Lvi Hijacking transient execution through mi-croarchitectural load value injection In 2020 IEEE S amp P 20

[72] Jo Van Bulck Frank Piessens and Raoul Strackx Nemesis Studyingmicroarchitectural timing leaks in rudimentary cpu interrupt logic InACM CCS 2018

[73] Jo Van Bulck Frank Piessens and Raoul Strackx SGX-step A Prac-tical Attack Framework for Precise Enclave Execution Control InSysTEXrsquo17

[74] Stephan van Schaik Andrew Kwong Daniel Genkin and Yuval YaromSgaxe How sgx fails in practice

[75] Stephan van Schaik Alyssa Milburn Sebastian Oumlsterlund Pietro FrigoGiorgi Maisuradze Kaveh Razavi Herbert Bos and Cristiano GiuffridaRIDL Rogue in-flight data load In SampP May 2019

[76] Stephan van Schaik Marina Minkin Andrew Kwong Daniel Genkinand Yuval Yarom Cacheout Leaking data on intel cpus via cacheevictions arXiv preprint

[77] WebKit Browserbench httpsbrowserbenchorg

[78] Ofir Weisse Jo Van Bulck Marina Minkin Daniel Genkin BarisKasikci Frank Piessens Mark Silberstein Raoul Strackx Thomas FWenisch and Yuval Yarom Foreshadow-ng Breaking the virtual mem-ory abstraction with transient out-of-order execution 2018

[79] Pawel Wieczorkiewicz Intel deep-dive snoop-assisted L1 Data Sam-pling

[80] Wenjie Xiong and Jakub Szefer Survey of transient execution attacksarXiv preprint

[81] Yuval Yarom and Katrina Falkner Flush+ reload a high resolutionlow noise l3 cache side-channel attack In USENIX Security 14

[82] Jann Horn Google Project Zero Speculative execution variant4 speculative store bypass httpsbugschromiumorgpproject-zeroissuesdetailid=1528 2019

A Reversing Memory Disambiguation

To precisely trigger memory disambiguation mispredictionsit is essential to reverse engineer the predictor and understandhow it can be massaged into the desired state When exe-cuting a load operation the physical addresses of all priorstores must be known to decide whether the load should beforwarded the value from the store buffer (store-to-load for-warding when the store and the load alias) or served from thememory subsystem (when they do not) Since these dependen-cies introduce significant bottlenecks modern processors rely

1466 30th USENIX Security Symposium USENIX Association

Listing 4 A function to RE the MD predictor The first 10imuls used to delay the store address computation create theideal conditions for mispredictionsst_ld rdi store addr rsi load addrrep 10 Trick to delay the store addressimul rdi 1endrepmov DWORD [rdi] 0x42 Storemov eax DWORD [rsi] Loadrep 10 Pronounce load timingimul eax 1endrepret

Listing 5 Observing the size and behavior of the per-addresssaturation counteruint8_t mem = malloc(0x1000)Ensure that saturing counter is set to 0for(i=0 ilt100 i++) st_ld(mem mem)Make hoisiting possiblefor(i=100 ilt120 i++) st_ld(mem mem+64)Trigger a memory ordering machine clearfor(i=120 ilt130 i++) st_ld(mem mem)

on a memory disambiguation predictor to improve common-case performance If a load is predicted not to alias precedingstores it can be speculatively executed before the prior storesrsquoaddresses are known (ie the load is hoisted) Otherwise theload is stalled until aliasing information is available

Partial reverse engineering of the memory disambiguationunit behavior was presented in [18] based on a (complex)analysis of Intel patent US8549263B2 [44] In contrast wepresent a full reverse engineering effort (including featuressuch as 4k aliasing flush counter etc) entirely based on asimple implementationmdashst_ld function in Listing 4 Bysurrounding the st_ld function with instrumentation codeto measure the timing and the number of machine clears wewere able to accurately detect the status of the predictor forevery call to st_ld

Our first experiment illustrated in Listing 5 is designedto observe how many missed load hoisting opportunities areneeded to switch the predictor state As shown in the corre-sponding plot in Figure 11 after 15 non-aliasing loads weobserved that subsequent st_ld invocations are faster due toa correct hoisting prediction This matches the design sug-gested in the patent with the predictor implemented as a 4-bitsaturating counter incremented every time the load does not ul-timately alias with preceding stores (and reset to 0 otherwise)Load hoisting is predicted only if the counter is saturated Toensure that the hoisting state is reached we later scheduledan aliasing load and checked that the load was incorrectlyhoisted by observing a machine clear

According to the Intel patent [44] there are 64 per-addresspredictors (ie saturating counters) and the suggested hash-ing function simply uses the lowest 6 bits of the instructionpointer of the load We verified these numbers using the func-

100 105 110 115 120 125 130i-th call to st_ld

0

50

100

Cloc

k cy

cles

Figure 11 Timing measurement of the experiment in Listing5 Orange bar machine clear memory ordering observed

Listing 6 Snippet observing activation of never-hoisting stateuint8_t mem = malloc(0x1000)for(int i=0 ilt10 i++)

for(int j=0 jlt19 j++)st_ld(mem mem+64)

st_ld(mem mem)

0 10 20 30 40 50 60 70 80 90 100 110 120i-th call to st_ld

0

25

50

75

100

Cloc

k cy

cles

Figure 12 Never-hoisting state Orange bar MC observed

tion st_ld_offset which is an exact copy of the st_ldfunction but with a number of nops added in the preamble(which we change at every run The goal is to observe iftwo unrelated loads in these functions are able to influenceeach other when varying the number of nops In our testedCPUs we observed machine clears when two unrelated loadsin st_ld and st_ld_offset are located exactly 256k bytesapart in memory (k isinN) We used machine clear observationsto detect that the per-address prediction of st_ld was affectedby the execution of st_ld_offset Our results match the de-sign suggested in the patent except we observed 256 (ratherthan 64) predictors hashing the lowest 8 bits of the instruc-tion pointer With these implementation-specific numbers anattacker can easily mistrain the predictor of a victim loadinstruction just with knowledge of its location in memory

One important additional component of the predictor is thepresence of a watchdog A never-hoisting global state is usedas a fallback to temporarily disable the predictor when theCPU has decided it may be counterproductive To reverseengineer the behavior we triggered as many mispredictionsas possible to check if hoisting was eventually disabled Theresulting code is illustrated in Listing 6 and the numbers inFigure 12 shows that after 4 machine clears the predictoris disabled Indeed even after 19 further non-aliasing loads(normally abundantly sufficient to switch to a hoisting state)the execution time of st_ld does not decrease

We also reversed the conditions under which the watch-dog is enableddisabled The patent suggests the watchdog

USENIX Association 30th USENIX Security Symposium 1467

15 79 143 207 271 335 399Number of independent store-loads

1

2

3

4

Tota

l mea

sure

dM

achi

ne C

lear

s

Figure 13 MCs observed after n independent store-loadsstarting from the never-hoisting state

is enabled when the value of a flush counter is smaller than0 and disabled otherwise The flush counter is decrementedevery MD MC and incremented every n correctly hoistedloads To reverse engineer this behavior we measure howmany MCs can be triggered in a row before the flush counteris decremented to -1 and thus the predictor is disabled Asshown in the figure 13 we never observed more than 4 MCsin a row This suggests that the flush counter is a 2-bit sat-urating counter The machine clear patterns also reveal theflush counter is incremented every 64 correctly hoisted loadsLastly since every run starts with the watchdog disabled ourresults show that to switch from a never-hoisting to a predict-hoisting state 15+64 non-aliasing loads are sufficient Thefirst 15 loads are necessary to bring the per-address predictorto the hoisting state The next 64 loads record a would-becorrect prediction of the per-address predictor After 64 such(unused) predictions the flush counter is incremented to leavethe never-hoisting state

Listing 7 Snippet verifying 4k aliasing-MD unit interactionForce no-hoisting predictionfor(i=0 ilt10 i++) st_ld(mem+0x1000 mem+0x1000)for(i=10 ilt20 i++) st_ld(mem+0x2000 mem+0x2000)Cause 4k aliasingfor(i=20 ilt40 i++) st_ld(mem+0x1000 mem+0x2000)

0 5 10 15 20 25 30 35 40i-th call to st_ld

50

60

70

80

90

Cloc

k cy

cles

Figure 14 Timing measurements of Listing 7 Orange incre-ment of LD_BLOCKS_PARTIALADDRESS_ALIAS

Finally we examined the interaction between memory dis-ambiguation and 4k aliasing On Intel CPUs when a store isfollowed by a load matching its 4KB page offset the store-to-load forwarding (STL) logic forwards the stored valueand in case of a false match (4k aliasing) a few-cycle over-head is needed to re-issue the load [38] With the help of theperformance counter LD_BLOCKS_PARTIALADDRESS_ALIASand the experiment shown in Listing 7 we verified that 4kaliasing can only happen when a no-hoist prediction is made

Figure 15 Memory disambiguation unit ndash simplified view

as shown in Figure 14 Indeed STL can be performed onlyif the store-load pair is executed in order Additionally Fig-ure 14 shows that 4k aliasing introduces a slight performanceoverhead on top of an incorrect no-hoisting guess of the MDpredictor Figure 15 presents a simplified view of the reversedmemory disambiguation predictor In conclusion an attackercan precisely mistrain the memory disambiguation predictorby satisfying only two requirements (1) knowing the instruc-tion pointer of the victim load (2) issuing (in the worst-casescenario) 15+64=79 non-aliasing store-load pairs to train thepredictor to the hoisting state

B Root-Causes Description Table

Acronym Description

BHT Branch History TableBTB Branch Target BufferRSB Return Stack BufferMD Memory Disabmbiguation UnitNM Device Not Available ExceptionDE Divide-by-Zero ExceptionUD Invalid Opcode ExceptionGP General Protection FaultAC Alignment Check ExceptionSS Stack Segment ExceptionPF Page FaultPF - US Bit Page Table Entry UserSupervisor Bit PFPF - RW Bit Page Table Entry ReadWrite Bit PFPF - P Bit Page Table Entry Present Bit PFPF - PKU Page Table Entry Protection Keys PFBR Bound Range Exceeded ExceptionFP Floating Point AssistSMC Self-Modifying CodeXMC Cross-Modifying CodeMO Memory Ordering Principles ViolationMASKMOV Masked LoadStore Instruction AssistAD Bits Page Table Entry AccessDirty Bits AssistTSX Intel TSX Transaction AbortUC Uncachable Memory AssistPRM Processor Reserved Memory AssistHW Interrupts Hardware Interrupts

1468 30th USENIX Security Symposium USENIX Association

  • Introduction
  • Background
    • IEEE-754 Denormal Numbers
    • x86 Cache Coherence
    • Memory Ordering
    • Memory Disambiguation
      • Threat Model
      • Machine Clears
      • Self-Modifying Code Machine Clear
      • Floating Point Machine Clear
      • Memory Ordering Machine Clear
      • Memory Disambiguation Machine Clear
      • Other Types of Machine Clear
      • Transient Execution Capabilities
        • Transient Window Size
        • Leakage Rate
          • Attack Primitives
            • Speculative Code Store Bypass (SCSB)
            • Floating Point Value Injection (FPVI)
              • Mitigations
                • SCSB Mitigation
                • FPVI Mitigation
                  • Root Cause-based Classification
                  • Related Work
                  • Conclusions
                  • Reversing Memory Disambiguation
                  • Root-Causes Description Table
Page 18: Rage Against the Machine Clear: A Systematic Analysis of

Listing 4 A function to RE the MD predictor The first 10imuls used to delay the store address computation create theideal conditions for mispredictionsst_ld rdi store addr rsi load addrrep 10 Trick to delay the store addressimul rdi 1endrepmov DWORD [rdi] 0x42 Storemov eax DWORD [rsi] Loadrep 10 Pronounce load timingimul eax 1endrepret

Listing 5 Observing the size and behavior of the per-addresssaturation counteruint8_t mem = malloc(0x1000)Ensure that saturing counter is set to 0for(i=0 ilt100 i++) st_ld(mem mem)Make hoisiting possiblefor(i=100 ilt120 i++) st_ld(mem mem+64)Trigger a memory ordering machine clearfor(i=120 ilt130 i++) st_ld(mem mem)

on a memory disambiguation predictor to improve common-case performance If a load is predicted not to alias precedingstores it can be speculatively executed before the prior storesrsquoaddresses are known (ie the load is hoisted) Otherwise theload is stalled until aliasing information is available

Partial reverse engineering of the memory disambiguationunit behavior was presented in [18] based on a (complex)analysis of Intel patent US8549263B2 [44] In contrast wepresent a full reverse engineering effort (including featuressuch as 4k aliasing flush counter etc) entirely based on asimple implementationmdashst_ld function in Listing 4 Bysurrounding the st_ld function with instrumentation codeto measure the timing and the number of machine clears wewere able to accurately detect the status of the predictor forevery call to st_ld

Our first experiment illustrated in Listing 5 is designedto observe how many missed load hoisting opportunities areneeded to switch the predictor state As shown in the corre-sponding plot in Figure 11 after 15 non-aliasing loads weobserved that subsequent st_ld invocations are faster due toa correct hoisting prediction This matches the design sug-gested in the patent with the predictor implemented as a 4-bitsaturating counter incremented every time the load does not ul-timately alias with preceding stores (and reset to 0 otherwise)Load hoisting is predicted only if the counter is saturated Toensure that the hoisting state is reached we later scheduledan aliasing load and checked that the load was incorrectlyhoisted by observing a machine clear

According to the Intel patent [44] there are 64 per-addresspredictors (ie saturating counters) and the suggested hash-ing function simply uses the lowest 6 bits of the instructionpointer of the load We verified these numbers using the func-

100 105 110 115 120 125 130i-th call to st_ld

0

50

100

Cloc

k cy

cles

Figure 11 Timing measurement of the experiment in Listing5 Orange bar machine clear memory ordering observed

Listing 6 Snippet observing activation of never-hoisting stateuint8_t mem = malloc(0x1000)for(int i=0 ilt10 i++)

for(int j=0 jlt19 j++)st_ld(mem mem+64)

st_ld(mem mem)

0 10 20 30 40 50 60 70 80 90 100 110 120i-th call to st_ld

0

25

50

75

100

Cloc

k cy

cles

Figure 12 Never-hoisting state Orange bar MC observed

tion st_ld_offset which is an exact copy of the st_ldfunction but with a number of nops added in the preamble(which we change at every run The goal is to observe iftwo unrelated loads in these functions are able to influenceeach other when varying the number of nops In our testedCPUs we observed machine clears when two unrelated loadsin st_ld and st_ld_offset are located exactly 256k bytesapart in memory (k isinN) We used machine clear observationsto detect that the per-address prediction of st_ld was affectedby the execution of st_ld_offset Our results match the de-sign suggested in the patent except we observed 256 (ratherthan 64) predictors hashing the lowest 8 bits of the instruc-tion pointer With these implementation-specific numbers anattacker can easily mistrain the predictor of a victim loadinstruction just with knowledge of its location in memory

One important additional component of the predictor is thepresence of a watchdog A never-hoisting global state is usedas a fallback to temporarily disable the predictor when theCPU has decided it may be counterproductive To reverseengineer the behavior we triggered as many mispredictionsas possible to check if hoisting was eventually disabled Theresulting code is illustrated in Listing 6 and the numbers inFigure 12 shows that after 4 machine clears the predictoris disabled Indeed even after 19 further non-aliasing loads(normally abundantly sufficient to switch to a hoisting state)the execution time of st_ld does not decrease

We also reversed the conditions under which the watch-dog is enableddisabled The patent suggests the watchdog

USENIX Association 30th USENIX Security Symposium 1467

15 79 143 207 271 335 399Number of independent store-loads

1

2

3

4

Tota

l mea

sure

dM

achi

ne C

lear

s

Figure 13 MCs observed after n independent store-loadsstarting from the never-hoisting state

is enabled when the value of a flush counter is smaller than0 and disabled otherwise The flush counter is decrementedevery MD MC and incremented every n correctly hoistedloads To reverse engineer this behavior we measure howmany MCs can be triggered in a row before the flush counteris decremented to -1 and thus the predictor is disabled Asshown in the figure 13 we never observed more than 4 MCsin a row This suggests that the flush counter is a 2-bit sat-urating counter The machine clear patterns also reveal theflush counter is incremented every 64 correctly hoisted loadsLastly since every run starts with the watchdog disabled ourresults show that to switch from a never-hoisting to a predict-hoisting state 15+64 non-aliasing loads are sufficient Thefirst 15 loads are necessary to bring the per-address predictorto the hoisting state The next 64 loads record a would-becorrect prediction of the per-address predictor After 64 such(unused) predictions the flush counter is incremented to leavethe never-hoisting state

Listing 7 Snippet verifying 4k aliasing-MD unit interactionForce no-hoisting predictionfor(i=0 ilt10 i++) st_ld(mem+0x1000 mem+0x1000)for(i=10 ilt20 i++) st_ld(mem+0x2000 mem+0x2000)Cause 4k aliasingfor(i=20 ilt40 i++) st_ld(mem+0x1000 mem+0x2000)

0 5 10 15 20 25 30 35 40i-th call to st_ld

50

60

70

80

90

Cloc

k cy

cles

Figure 14 Timing measurements of Listing 7 Orange incre-ment of LD_BLOCKS_PARTIALADDRESS_ALIAS

Finally we examined the interaction between memory dis-ambiguation and 4k aliasing On Intel CPUs when a store isfollowed by a load matching its 4KB page offset the store-to-load forwarding (STL) logic forwards the stored valueand in case of a false match (4k aliasing) a few-cycle over-head is needed to re-issue the load [38] With the help of theperformance counter LD_BLOCKS_PARTIALADDRESS_ALIASand the experiment shown in Listing 7 we verified that 4kaliasing can only happen when a no-hoist prediction is made

Figure 15 Memory disambiguation unit ndash simplified view

as shown in Figure 14 Indeed STL can be performed onlyif the store-load pair is executed in order Additionally Fig-ure 14 shows that 4k aliasing introduces a slight performanceoverhead on top of an incorrect no-hoisting guess of the MDpredictor Figure 15 presents a simplified view of the reversedmemory disambiguation predictor In conclusion an attackercan precisely mistrain the memory disambiguation predictorby satisfying only two requirements (1) knowing the instruc-tion pointer of the victim load (2) issuing (in the worst-casescenario) 15+64=79 non-aliasing store-load pairs to train thepredictor to the hoisting state

B Root-Causes Description Table

Acronym Description

BHT Branch History TableBTB Branch Target BufferRSB Return Stack BufferMD Memory Disabmbiguation UnitNM Device Not Available ExceptionDE Divide-by-Zero ExceptionUD Invalid Opcode ExceptionGP General Protection FaultAC Alignment Check ExceptionSS Stack Segment ExceptionPF Page FaultPF - US Bit Page Table Entry UserSupervisor Bit PFPF - RW Bit Page Table Entry ReadWrite Bit PFPF - P Bit Page Table Entry Present Bit PFPF - PKU Page Table Entry Protection Keys PFBR Bound Range Exceeded ExceptionFP Floating Point AssistSMC Self-Modifying CodeXMC Cross-Modifying CodeMO Memory Ordering Principles ViolationMASKMOV Masked LoadStore Instruction AssistAD Bits Page Table Entry AccessDirty Bits AssistTSX Intel TSX Transaction AbortUC Uncachable Memory AssistPRM Processor Reserved Memory AssistHW Interrupts Hardware Interrupts

1468 30th USENIX Security Symposium USENIX Association

  • Introduction
  • Background
    • IEEE-754 Denormal Numbers
    • x86 Cache Coherence
    • Memory Ordering
    • Memory Disambiguation
      • Threat Model
      • Machine Clears
      • Self-Modifying Code Machine Clear
      • Floating Point Machine Clear
      • Memory Ordering Machine Clear
      • Memory Disambiguation Machine Clear
      • Other Types of Machine Clear
      • Transient Execution Capabilities
        • Transient Window Size
        • Leakage Rate
          • Attack Primitives
            • Speculative Code Store Bypass (SCSB)
            • Floating Point Value Injection (FPVI)
              • Mitigations
                • SCSB Mitigation
                • FPVI Mitigation
                  • Root Cause-based Classification
                  • Related Work
                  • Conclusions
                  • Reversing Memory Disambiguation
                  • Root-Causes Description Table
Page 19: Rage Against the Machine Clear: A Systematic Analysis of

15 79 143 207 271 335 399Number of independent store-loads

1

2

3

4

Tota

l mea

sure

dM

achi

ne C

lear

s

Figure 13 MCs observed after n independent store-loadsstarting from the never-hoisting state

is enabled when the value of a flush counter is smaller than0 and disabled otherwise The flush counter is decrementedevery MD MC and incremented every n correctly hoistedloads To reverse engineer this behavior we measure howmany MCs can be triggered in a row before the flush counteris decremented to -1 and thus the predictor is disabled Asshown in the figure 13 we never observed more than 4 MCsin a row This suggests that the flush counter is a 2-bit sat-urating counter The machine clear patterns also reveal theflush counter is incremented every 64 correctly hoisted loadsLastly since every run starts with the watchdog disabled ourresults show that to switch from a never-hoisting to a predict-hoisting state 15+64 non-aliasing loads are sufficient Thefirst 15 loads are necessary to bring the per-address predictorto the hoisting state The next 64 loads record a would-becorrect prediction of the per-address predictor After 64 such(unused) predictions the flush counter is incremented to leavethe never-hoisting state

Listing 7 Snippet verifying 4k aliasing-MD unit interactionForce no-hoisting predictionfor(i=0 ilt10 i++) st_ld(mem+0x1000 mem+0x1000)for(i=10 ilt20 i++) st_ld(mem+0x2000 mem+0x2000)Cause 4k aliasingfor(i=20 ilt40 i++) st_ld(mem+0x1000 mem+0x2000)

0 5 10 15 20 25 30 35 40i-th call to st_ld

50

60

70

80

90

Cloc

k cy

cles

Figure 14 Timing measurements of Listing 7 Orange incre-ment of LD_BLOCKS_PARTIALADDRESS_ALIAS

Finally we examined the interaction between memory dis-ambiguation and 4k aliasing On Intel CPUs when a store isfollowed by a load matching its 4KB page offset the store-to-load forwarding (STL) logic forwards the stored valueand in case of a false match (4k aliasing) a few-cycle over-head is needed to re-issue the load [38] With the help of theperformance counter LD_BLOCKS_PARTIALADDRESS_ALIASand the experiment shown in Listing 7 we verified that 4kaliasing can only happen when a no-hoist prediction is made

Figure 15 Memory disambiguation unit ndash simplified view

as shown in Figure 14 Indeed STL can be performed onlyif the store-load pair is executed in order Additionally Fig-ure 14 shows that 4k aliasing introduces a slight performanceoverhead on top of an incorrect no-hoisting guess of the MDpredictor Figure 15 presents a simplified view of the reversedmemory disambiguation predictor In conclusion an attackercan precisely mistrain the memory disambiguation predictorby satisfying only two requirements (1) knowing the instruc-tion pointer of the victim load (2) issuing (in the worst-casescenario) 15+64=79 non-aliasing store-load pairs to train thepredictor to the hoisting state

B Root-Causes Description Table

Acronym Description

BHT Branch History TableBTB Branch Target BufferRSB Return Stack BufferMD Memory Disabmbiguation UnitNM Device Not Available ExceptionDE Divide-by-Zero ExceptionUD Invalid Opcode ExceptionGP General Protection FaultAC Alignment Check ExceptionSS Stack Segment ExceptionPF Page FaultPF - US Bit Page Table Entry UserSupervisor Bit PFPF - RW Bit Page Table Entry ReadWrite Bit PFPF - P Bit Page Table Entry Present Bit PFPF - PKU Page Table Entry Protection Keys PFBR Bound Range Exceeded ExceptionFP Floating Point AssistSMC Self-Modifying CodeXMC Cross-Modifying CodeMO Memory Ordering Principles ViolationMASKMOV Masked LoadStore Instruction AssistAD Bits Page Table Entry AccessDirty Bits AssistTSX Intel TSX Transaction AbortUC Uncachable Memory AssistPRM Processor Reserved Memory AssistHW Interrupts Hardware Interrupts

1468 30th USENIX Security Symposium USENIX Association

  • Introduction
  • Background
    • IEEE-754 Denormal Numbers
    • x86 Cache Coherence
    • Memory Ordering
    • Memory Disambiguation
      • Threat Model
      • Machine Clears
      • Self-Modifying Code Machine Clear
      • Floating Point Machine Clear
      • Memory Ordering Machine Clear
      • Memory Disambiguation Machine Clear
      • Other Types of Machine Clear
      • Transient Execution Capabilities
        • Transient Window Size
        • Leakage Rate
          • Attack Primitives
            • Speculative Code Store Bypass (SCSB)
            • Floating Point Value Injection (FPVI)
              • Mitigations
                • SCSB Mitigation
                • FPVI Mitigation
                  • Root Cause-based Classification
                  • Related Work
                  • Conclusions
                  • Reversing Memory Disambiguation
                  • Root-Causes Description Table