privacy scope: a precise information flow tracking system for

16
Privacy Scope: A Precise Information Flow Tracking System For Finding Application Leaks Yu Zhu Jaeyeon Jung Dawn Song Tadayoshi Kohno David Wetherall Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-2009-145 http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.html October 27, 2009

Upload: doanduong

Post on 05-Feb-2017

217 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Privacy Scope: A Precise Information Flow Tracking System For

Privacy Scope: A Precise Information Flow Tracking

System For Finding Application Leaks

Yu ZhuJaeyeon JungDawn SongTadayoshi KohnoDavid Wetherall

Electrical Engineering and Computer SciencesUniversity of California at Berkeley

Technical Report No. UCB/EECS-2009-145

http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-145.html

October 27, 2009

Page 2: Privacy Scope: A Precise Information Flow Tracking System For

Copyright © 2009, by the author(s).All rights reserved.

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission.

Page 3: Privacy Scope: A Precise Information Flow Tracking System For

Privacy Scope: A Precise Information Flow Tracking SystemFor Finding Application Leaks

David (Yu) Zhu∗, Jaeyeon Jung†, Dawn Song∗, Tadayoshi Kohno‡ and David Wetherall†‡

∗University of California, Berkeley†Intel Labs Seattle

‡University of Washington

Abstract

We present Privacy Scope, a new system that tracksthe movement of sensitive user data as it flowsthrough off-the-shelf applications. Privacy Scope usesapplication-level dynamic taint analysis, implementedwith dynamic binary translation tools, to let users runapplications in their own environment while pinpoint-ing information leaks, even when the sensitive datais encrypted. The system is made possible by tech-niques we developed for accurate and efficient taint-ing. Semantic-aware instruction-level tainting handlesspecial cases and is critical to avoid taint explosionor loss. Function summaries provide an interface tohandle taint propagation within the kernel and reducethe overhead of instruction-level tracking. On-demandinstrumentation enables fast loading of large applica-tions. Together, these techniques let us run on large,multi-threaded, networked applications and preciselytrack where information goes. In tests on InternetExplorer, Yahoo! Messenger, and Windows Notepad,Privacy Scope generated no false positives and instru-mented fewer than 5% of the executed instructions.

1. Introduction

Media and research papers regularly report privacyvulnerabilities in which applications collect or exposepersonal information. Some of these applications aremalware that intend to maliciously exfiltrate data, butmany are not. Our previous study shows that networkapplications often disclose various types of personal in-formation (e.g., search terms, user names, system con-figuration) to their publishers and to third parties [11].Google desktop is known to send indexes of localfiles to servers under a certain configuration [28], andTom-Skype tracks personal chat messages [14]. Otherapplications leak information via temporary copies orcached file snapshots [7].

These examples highlight the fact that legitimatecommercial off-the-shelf applications may manipulateuser information in ways that their users neither expect,nor appreciate. Unfortunately, it is not feasible forusers to check that the applications they run meettheir privacy expectations, company guidelines, or anyother policies they have for the handling of sensitiveinformation. Tools are available to protect storage(e.g., file encryption on sensitive volumes) or limitnetwork access (e.g., two-way firewalls such as LittleSnitch [17]). However, these tools fail to provideprotection once an application is authorized to read theuser’s data and has access to output channels. ConsiderAlice, who uses a messenger client on a companylaptop and wants to be sure that her messages do notaccumulate in a log that may surface later. Or considerBob, who purchases goods online and wants to knowexactly which sites receive his credit card number.Once they choose to use an application, they mustsimply hope for the best.

Our goal is to develop a system that will let usersdiscover when and where applications reveal sensitiveuser data. To be valuable in practice, we have foursubgoals. First, our system must run on real applica-tions. These may be large, multi-threaded and makeheavy use of operating system services. Second, it mustrun in the user’s own environment without the needfor application source code. Requiring either sourcecode or testing environments would greatly limit ap-plicability. Third, while we do not target maliciousapplications that intentionally avoid our techniques,our system must track information even when it istransformed in the output stream. Encryption is animportant transformation that is used with sensitivedata, but there are many other ways that informationis encoded in practice. Fourth, our system must be fastenough to run networked and interactive applications.Heavyweight mechanisms can introduce delays thatcause timeouts in client-server programs (e.g., Web

Page 4: Privacy Scope: A Precise Information Flow Tracking System For

browsers) and prevent normal use. While these goalsare ambitious, we believe they define a highly usableand desirable system.

In this paper we present Privacy Scope, our systemfor pinpointing leaks of sensitive data by commer-cial software that takes significant steps to our goal.Pioneered by Bell and La Padula [1], there is alarge body of literature on information flow security,in which mandatory access control policy is definedand enforced between “subjects” (e.g., processes, pro-grams) and sensitive “objects” (e.g., files, user inputs).Recent works proposed different system prototypes,each implementing information flow security in theform of new operating systems [8], [30], programminglanguages and libraries [19], [12], or system call in-terposition in the legacy system [25]. However, littlehas been studied on how to efficiently track infor-mation flow by real applications in modern operatingsystems. Previous works developed system prototypesfor detecting application leaks while treating a testingapplication as a black box [29], [11]. However, thesesystems are ineffective when the testing applicationreveals a randomized function of the input data. Wholesystem simulation can provide low-level informationon how sensitive input data is accessed and propagatedthroughout the system [5], [28]. However, hardware-level data tracking incurs significant performance andanalysis overhead, which makes it unsuitable for in-specting interactive network applications.

Privacy Scope implements information flow analysison applications by using dynamic binary translationwith Pin [13]. This lets users run off-the-shelf softwarein their own environment. On this base, we develop aset of techniques to track where user information goesaccurately and with enough run-time efficiency that itis plausible for end users to run the system.

Accuracy was a surprising challenge we faced giventhat the idea of tainting data is conceptually simple.However, there are corner cases in which some instruc-tions (e.g., MOV, AND) or situations (e.g., system callside-effects) need special taint-propagation logic. Aswe found when testing on real applications, failure tohandle these cases quickly results in taint explosionor loss of taint for real applications. After overcomingthese special cases, we find that we can interpose onsystem calls and precisely track information betweenkeystroke, file and network socket input and output.

Runtime efficiency is a more traditional but stillchallenging problem for information flow, and we haveinvested considerable engineering effort to developPrivacy Scope into a feasible system. We start withinstrumentation that operates at both the instructionlevel and function level in one program. Function-level

tainting speeds up frequently called functions, and byusing it to model the taint effects of systems calls wecan confine our analysis to an application instead ofhaving to check the entire system at the hardware level.Further, by deferring taint tracking from load time towhen it is first needed we can speed up the analysisdramatically for the tested applications. The result isthat we are able to run our system on networkedand interactive applications such as Internet Explorerand Yahoo! Messenger. We believe that we will beable to further improve performance given the growingresearch community around binary instrumentation.

We make three contributions in this paper. Ourmain contribution is the Privacy Scope system itselfand the set of design choices it embodies. It showshow to implement personal information tracking thatis both accurate and efficient in a context that isbroadly applicable to everyday usage. We plan torelease Privacy Scope as an open source package toempower users to check their off-the-shelf softwarepackages for unexpected information leakage.

A second contribution is semantic-aware taint prop-agation rules we develop to implement precise infor-mation flow tracking. At the instruction level, special-ized taint routines are prescribed for uncommon datamovements (e.g., the REP MOV string copy instruction).At the function level, previously generated modelspropagate taint to capture important side-effects forcalls into the kernel. Our evaluation results showthat Privacy Scope is highly accurate, generating nofalse positives when analyzing large size complicatedapplications running on Windows. While doing so, itaccurately detected when user typed messages weresent or written to a file after transformation (even withencryption or encoding).

Our last contribution is multi-level techniques forefficient tainting. We find that function summariesspeed up taint tracking eight to nine times compared toinstruction-level taint, and their impact is greatly multi-plied by using them for frequently called functions andas part of our approach to instrument the applicationbut not the operating system. Compared to typical load-time instrumentation, our on-demand instrumentationdramatically reduces the number of instructions thatare analyzed, e.g., 5% for Internet Explorer.

The rest of the paper is organized as follows: §2describes our approach. §3 describes the system designfollowed by the implementation highlights. §4 presentsour optimization techniques and the performance study.§5 shows the evaluation results. §6 reviews relatedwork in comparison to Privacy Scope. After presentingremaining challenges in §7, the paper concludes in §8.

Page 5: Privacy Scope: A Precise Information Flow Tracking System For

2. Approach

This section presents two underlying technologiesthat Privacy Scope is built upon. First, we discuss thedynamic taint analysis technique and how we usedthe technique for tracking sensitive data. Then wepresent our approach in terms of (1) what input datawe monitor; (2) how we track the propagation ofsensitive input data; and (3) what output channels wemonitor for leak detection. Second, we discuss the Pindynamic binary transformation (DBT) system [13]. Pinis publicly available software, offering a set of highlevel APIs for instrumenting applications at run time.We review the overall software architecture of Pin DBTand explain in detail a few specific features that weleveraged for Privacy Scope’s efficient operations.

2.1. Dynamic Taint Analysis

Taint analysis is a process for tracking informationthat may have been influenced, or tainted, by otherdata. We use taint analysis to track data deemed to besensitive by the user. The process is inductive. First,sensitive data is marked as tainted as it enters theprogram. Then, each time an instruction that outputsdata accesses input data that is tainted, any outputsthat may have been influenced by the tainted data arethemselves marked as tainted. Figure 1 illustrates thesebasic steps. Tainted data may also influence other dataindirectly, through branch instructions that change theflow of control and determine which instruction willwrite to a given variable [19]. We find that in theapplications we monitor, we are able to accuratelytrack sensitive data even though we do not track taintthat results exclusively as a result of changes of controlflow.

We track taint at byte-level granularity in applicationcode, with object-level granularity for system callsas we do not track instructions within the kernel.In general, without instrumenting the whole-system(e.g., Panorama [28]), taint analysis loses the track ofinformation flow when the application moves tainteddata around via system calls. In this work, we developa special function-level taint propagation to addressthis limitation of application-level taint analysis, whichwe will describe in detail §4.1Taint Source. An application can receive and processsensitive information from many input channels (e.g.USB, keyboard, files, network sockets). Ideally, onewants to monitor all the possible input channels andtaint the input data if the data deem sensitive. However,in practice, monitoring an input channel involves OS

• MOV B,A

• Backup register state

• SetTaint (B, GetTaint(A))

• Restore register state

• MOV B,A

…retmov al, byte ptr ds[esi]mov byte ptr ds[edi], almov eax, dword ptr ss[ebp+0x8]pop esipop edi…

code snippet from IE

instrumentation

k

k

virtualmemory

taint map

0

01

mem

reg

01

10…

Figure 1. MOV instruction is instrumented for dynamictaint analysis. The first step brings the tainted data(k) from memory to the register al. The second stepupdates the taint map entry corresponding to al. Third,the content of the register al is copied to memory. Last,the taint map is updated to reflect the data movement.

device-level instrumentation, thereby requiring some-what complicated interactions with the underlying op-erating systems (e.g., system call hooking). In thiswork, we mainly focus on two common input channels.For each input type, the followings discuss cases inwhich these input channels carry sensitive informationand how we detect these cases for initial tainting.Keystroke Tracking. Users routinely type in creden-tials for accessing online services. Financial data suchas bank accounts and credit card numbers are alsofrequently entered by users for online transactions.These data are often a target of phishing attacksand the increasing complexity of Web pages withdynamic scripts makes it hard to track exactly wherethe sensitive information is sent to. In other cases, usersmay want to send sensitive messages via an instantmessenger or email client and want to make sure thatnothing is cached in the local file system. While itis straightforward to monitor all keyboard input data,it is challenging to automatically identify sensitiveinputs and to taint only those. In our system, PrivacyScope continuously monitors keystrokes via Windowsmessages and listens for designated key combinationsfor marking the beginning/ending of sensitive keyboardinput data. We stress that a better user interface oreven automatic tagging of sensitive data (if available)could improve the user experience of the system butthe system usability is beyond the scope of this paper.File Tracking. Confidential documents, company se-crets and private memo need to be protected from

Page 6: Privacy Scope: A Precise Information Flow Tracking System For

accidental leaks. Even if the original document staysencrypted in the local file system, its temporary copiescan be left unencrypted by the editing software [7] orby the users themselves for convenience. While thetemporary copies are created on the same machine,these copies could be leaked out if the machine isrunning file sharing clients or remote backup programs.In this work, we use the extended attributes availablein NTFS to tag files as sensitive. Privacy Scope auto-matically tracks the file content if a tainted file is readby the application.Leak Detection. Output channels through which sen-sitive information can escape include network sockets,files, Windows registry keys, shared memory, and sys-tem messages. We monitor output to network socketsand files using system call interposition; Privacy Scopecan be extended to monitor other output channels asnecessary. Notification will be displayed (and logged)when tainted information is leaked to monitored outputchannels. While we seek to track input data propaga-tion irrespectively to data transformation, we treat allleaks equally, regardless of how data has been trans-formed. For example, input data can be transformedand may be reduced to the fewer number of bits to thepoint in which the leak is insignificant.

2.2. Dynamic Binary Translation

Software systems that support run-time binaryinstrumentation of Windows applications includePin [13], DynamoRIO [2], StarDBT [24], and Val-grind [21]. Our design is independent of the underlyingDBT system. We chose Pin because of its avail-ability1, well defined programming interface, efficientinstrumentation, and support for additional operatingsystems should we also choose to support them.

Pin uses just-in-time instrumentation and thereforecan handle dynamic program behaviors such as selfmodifying code and statically unknown indirect jumptargets that are hard to predict with static instrumenta-tion. Pin also allows us to instrument instructions (us-ing pintools) without worrying about the housekeepingrequired to ensure that application behavior is not af-fected. However, as we will show in §4, this instrumen-tation transparency comes at the cost of performanceoverhead: every time analysis code is executed, Pinbacks-up the registers before and restores them afterto reserve the application context. We use Pin’s otherfeatures such as inlining and flexible instrumentationfor reducing the overhead. Figure 2 shows a simple

1. In comparison, StarDBT [24] is not publicly available and lacksprogramming interface although the optimizations implemented onthe DBT by the LIFT tool [18] made it an attractive choice at first.

pintool: The program counts the number of instructionsexecuted. The figure highlights routines that are usedfor instrumentation (which is executed only once) andfor analysis (which is executed whenever the programencounters the instrumented code.).

analysis routine

Instrumentation routine

#include "pin.H"#include <iostream>

UINT64 ins_count = 0;

void count(){ ins_count++; }

void Instruction(INS ins,VOID *v){ INS_InsertCall(ins,IPOINT_BEFORE,(AFUNPTR)count, IARG_END); }

void Fini(INT32 code, VOID *v){ cerr << "Count " << ins_count << endl; }

int main(int argc, char *argv[]){ PIN_Init(argc,argv); INS_AddInstrumentFunction(Instruction, 0); PIN_AddFiniFunction(Fini, 0); PIN_StartProgram(); return 0;}

Figure 2. A simple example of Pin-based instrumenta-tion [9]. This pintool counts the number of times that theanalyzed application’s instructions are executed.

Flexible Instrumentation. First, a pintool can im-plement complex instrumentation using the instruc-tion semantics (e.g., opcode, operand type) exposedby Pin at the instrumentation time. For instance,the above example can be easily modified to countonly MOV instructions that read memory. Once aninstruction is instrumented, Pin places the instrumentedcode in the code cache for reuse throughout theapplication’s runtime. However, a pintool can callPIN_RemoveInstrumentation() to remove the al-ready instrumented code and invoke a new instru-mentation logic at runtime. We utilize this feature toimplement the on-demand instrumentation as discussedin §3.

Conditional Inlining. When an instrumented code isconditionally executed in the analysis time, a pintoolcan use the conditional inlining feature to bypassinjecting the instrumented code all together when thecondition is unmet. Pin further optimizes this featureby placing the condition logic inline as checking forthe condition (e.g., the if part) needs to be run allthe time whereas the body (the then part) does not.We use this feature to quickly turn on/off instructionlevel taint propagation logic as needed (e.g., when thefunction summary is available). See §3 for details.

Page 7: Privacy Scope: A Precise Information Flow Tracking System For

3. Privacy Scope: Design and Implemen-tation

In this section, we present the design and implemen-tation of Privacy Scope and how it interacts with theunderlying operating system for monitoring a targetapplication’s input and output channels. Privacy Scopeis implemented in the framework of Pin 2.6 for Win-dows XP. Because we are aiming to create a systemthat end users can run on their runtime environment,efficiency and accuracy are two of our top concerns,and they are the driving factors in our design.Taint Map and Propagation. We have a staticallyallocated taint table with a statically configurable sizeof 8MB. Each bit in the table corresponds to a tainttag of a byte in the virtual memory (4GB). We mapthe virtual memory space into our tag map by leftshift operations. In the case when we use 8MB forour tag map, we left shift virtual address by 6 bits( 4GB

8∗8MB ) to obtain its tag position. We chose to use astatic tag map instead of a shadow page table structure[28] for its simplicity and efficiency in looking up oftaint information. It can potentially lead to collisions,but we expect the collision rate to be low sinceapplication tends to write to only a small portion ofthe virtual memory space. Our experimental results in§5 reaffirms this assumption.

We use a combination of generic instrumentationand instruction specific instrumentation to implementtaint propagation in our tool. We use instruction analy-sis API provided by Pin to determine the registers andmemory regions read and written by an instruction.We are implementing a policy based on the notionof data dependency. If the output is a direct copy ortransformation of the input, then the output will betainted if the input is tainted. We adopt that policyfor our generalized instrumentation. This gives us agood coverage over all instructions and allows us toimplement taint propagation without a specific handlerfor each type of instruction.

However, there are a few notable exceptions to thisgeneralization. We have identified four cases in Table1. Instruction-specific instrumentation is aware of thedifferent mode of operation of the instruction, andtherefore can be more accurate and efficient. §5 showsthat these exceptions lead to significantly fewer falsepositives in real world applications

One important exception is how we handle taintedindex registers. As observed by [22], tainting point-ers to a tainted piece of data could lead to manyfalse positives. Our experience confirms those findings.Hence, we do not taint pointers to tainted data inour system. However, another finding by [22] shows

that it is critical to handle table lookup operationswhere tainted data is used as an index to an array.Without propagating taint through a table lookup op-eration, keyboard taint propagation breaks down whenthe input is translated from keyboard code into anASCII character. We found these table lookups areaccomplished often by addressing memory using anindex register which contains tainted data. Thus wemake an exception for the case, and propagate taintfrom index register to the destination. Furthermore, wehave found that some compilers use the base registeras the index to the array, and store the location ofthe array in the index register. We also propagate thetaint from the base register, but only when both baseand index registers are used in the instruction. Toavoid the problems with full pointer tainting, we donot propagate the taint when only the base register isused. In that case, the instruction is doing a pointerdereference. In practice, this limited pointer taintingallows us to capture important table lookup operations,while avoiding many pitfalls with full pointer tainting.Object-based Kernel Propagation. Previous tainttracking systems based on application level binaryrewriting do not propagate taint through system calls.Specifically, return values from system calls are nevertainted even when the parameters to these system callsare tainted. This problem occurs on all operating sys-tems, but is a rather common occurrence on Windowssystems as its user interface subsystem are in thekernel (GDI subsystem). Messages are passed betweenuser programs and the kernel frequently. Additionally,memory management and file operations can changethe data in the memory without being tracked byPrivacy Scope.

Previously tainting tracking through the kernel in-volves installing a kernel module/ driver and it isresponsible for tracking propagation within the ker-nel [29]. Unfortunately, integrating a kernel modulewith a dynamic binary translation framework wouldintroduce a lot of complexity to the system as two com-ponents need to coordinately update taint information.This kernel component can also potentially slow downother parts of the system, as it is constantly resident.

We take a hybrid approach to this problem by usingbyte level propagation at user level and object levelpropagation at the kernel level. Instead of having kernelcomponent monitoring changes in the kernel, PrivacyScope maintains a list of tainted kernel level objects(often object handles in Windows) and a few importantattributes of these objects such as size or the locationin memory by interposing on kernel function calls.Example of such kernel objects include file handles,memory mapped regions among other things.

Page 8: Privacy Scope: A Precise Information Flow Tracking System For

Instruction Reason and specific handlerXOR, SUB, SBB, AND These instructions can be used to clear register if the operands are the same.

Need to special case this as a clear operation rather than a taint propagation.MOV MOV instruction represent a very common case of propagating taint from

between registers and memory regions. We choose to instrument this to reducethe time needed to analyze each MOV instruction in a generic fashion.

REP prefix A number of instructions can have the REP prefix. The operation followingREP prefix is repeated until a register counter counts down to 0. When usedwith MOV instruction, it can facilitate large memory copy efficiently. However,the counters should be excluded from taint propagation. They should not bethe source or the destination of taint, even though they are both read andwritten.

Tainted index registers When an x86 instruction addresses memory, it computes the final address usingBase + (Index * Scale) + Displacement. Base and index value are specifiedusing a base and an index registers. In this case, we adopt the policy topropagate the taint in the base and index register to the destination if both ofthem are present. If only base register is used, then we ignore the propagation.Note this is also an exception to explicit flow propagation. We found this policynecessary for taint propagation in real world applications.

Table 1. Exception Cases to Generic Data Dependency Propagation

We will explain the object-based kernel propagationusing the file system as an example. We monitoroperations such as CreateFile, WriteFile, andCloseHandle so that we can detect opening of taintedfile or when we write tainted data to a file. For eachof these operations, we insert a function level instru-mentation, so that we capture the parameters to thesefunctions and create a corresponding object in userspace for each file that is open. Taint can flow from thememory taint map to these objects. For example, whenWriteFile is called with a tainted buffer as parameter,the entire file object becomes tainted.

Because files can also be mapped into user’saddress space using CreateFileMapping andMapViewOfFile, we also monitor these calls andrecord the location where the files are mapped.In essence, we are mirroring some kernel statesand use them to propagate taint from one kernelobject (file) to another (memory mapping). Becauseof these auxiliary information, we can constructaccurate instrumentations for system calls basedon the semantics of the function. We are able tocapture any potentially tainted output from the kernelcall, as well as simulating its side effect using theshadow data structures. We name these functionlevel instrumentations ”kernel function summaries”,because they summarize the taint propagation behaviorof the kernel function. They are a special class offunction summary, which we use in general to improvethe performance of Privacy Scope.

Input and Output Channel Monitoring. We monitoruser input and allow users to indicate which keystrokes

are sensitive information. In our implementation, weintercept any calls made to DispatchMessage() andexamine the message to look for WM_KEYDOWN typemessages which indicate key press. If input tainting isturned on (between ALT+F9 and ALT+F10), these char-acters will be set to be tainted and tracked throughoutapplication execution. We handle taint flows to the filesystem using the aforementioned object level propa-gation. The in-memory file handle object is taintedas soon as any tainted information is written to thefile. Furthermore, this information is persistent, andstored in the file system. We use extended attributessupported by NTFS to store taint information in eachfile. While extended attributes are flexible to use (i.e.,an attribute is defined by a pair of name and value,whose length can be variable.), a potential securityissue is that currently there is no support for controllingaccess or modification of extended attributes.

To detect leaks, we monitor send() andWriteFile() system calls: Any tainted networktraffic can be recorded with the socket infomation.Files are marked to be tainted if tainted information iswritten to them. Figure 3 shows the interfaces definedfor input and output monitoring.

4. Efficient Taint Tracking

As shown in the previous works [16], [4], [18],instruction-level taint tracking introduces significantslow down when implemented in a straightforwardway. Although these previous solutions were builton a different DBT, we also experience a similar

Page 9: Privacy Scope: A Precise Information Flow Tracking System For

application

files (persistent taint infostored in extended attributes)

network buffers

DispatchMessage() Send()

keystrokes WriteFile()CreateFile()

MapViewOfFile()CloseHandle()

ALT+F9: taint begin

ALT+F10: taint end

WM_KEYDOWN: input char

NtSetEaFile()NtQueryEaFile()

Figure 3. Windows system calls interposed for monitor-ing input and output channels of applications

performance overhead (e.g., taking a couple of minutesto signin to Yahoo! Messenger as opposed to seconds)when running a testing application with an early ver-sion of Privacy Scope. This section discusses the newfeatures that we developed for faster taint tracking.

Since the instrumentation routine is executed onlyonce in most cases, the majority of the overhead comesfrom the analysis routine which inspects the inputdata used by an instruction and taints output data ifnecessary. In general, there are two ways to reduce theanalysis overhead as shown in the figure. The first oneis to optimize the analysis routine itself and the secondone is to reduce the number of times that the analysisroutine is called. Many techniques have been proposedfor the former [4], [18] and in this work, we focus onthe techniques to achieve the latter.

4.1. Function Summary

When DBTs instrument an instruction, it needs tobackup the necessary state (i.e. registers) and switchto a new execution stack before starting to executethe analysis routine. Although much research has beendone about only partially backing up state, this switch-ing cost can still be expensive because it happens eachtime an instruction executes. We noticed that manyhighly utilized functions have well defined semantics,and we can completely turn off taint propagationwhile running the function. If necessary, at the endof the execution, we will run a patching function topropagate taint information between inputs and outputsof the function. Because we turn off taint propagationfor each instruction inside the function, we eliminatethe cost of these context switches while running the

function. The followings show the different types ofsummaries that can be generated based on the types offunctions:

1) No patching necessary: Functions that do notproduce output nor have side effect or functionswhose only outputs are independent of the inputs.

2) Function-level taint tracking: Functions that pro-duce output in output parameters or by modifyingmemory region. We can still turn off the taintpropagation inside the function, but we need tomodify the taint table upon returning from thisfunction.

As a first step, we currently rely on human experts togenerate the summaries of highly utilized functions. Toimprove scalability, we can employ static analysis togenerate function summary or to statically instrumentthe binaries with taint propagation logic [20]. We pro-filed one of our example application, Internet Explorer(IE) to capture the functions where the most of thetime is spent. For this profile, we do not count anycallee’s execution time in the caller’s time. We hope tofind functions that are general enough to be beneficialto most applications. Figure 4 shows the distributionof instructions over the functions called by IE duringa test run. We have excluded functions that are notdocumented or do not have clear semantics defined forthem. The functions are sorted by the cumulative num-ber of executed instructions for the observation period.A noteworthy point is that these top fifteen functioncalls account for over 25% of the total instructions,suggesting that there is potential for significant savingif these functions are summarized. Table 2 dividedthese functions into the types listed above. Similarfunction distributions are observed for our other testapplications, Notepad and Yahoo! Messenger. Thesetop functions we obtained for IE are also among the topfunctions in our other experiments, which shows someempirical evidence that we are not being too specificin our function selections.

We will use a representative example of wcsncpyto explain how function summary works and its per-formance implications. According to MSDN [26], ithas three input parameters, source address, destinationaddress and length N . It copies up to N wide characterstrings from source buffer to the destination buffer.After the function terminates, we will perform taintpropagation operation to the taint table that mimicthe logic of the function. It copies the taint infor-mation from the source to the destination up to Nwide characters. In addition to a reduction in thenumber of context switches, function summary alsoallows wcsncpy to execute without interruption. This

Page 10: Privacy Scope: A Precise Information Flow Tracking System For

Category Functions1. No patching necessary wcslen, strchr, bsearch, strchr,ReleaseMutex, RtlEqualUnicodeString, bsearch,

RtlAllocateHeap, RtlFreeHeap, RtlValidateUnicodeString, GetWindowThread-ProcessId, GetDC, LdrGetProcedureAdress

2. Function-level taint tracking wcscpy, RtlHashUnicodeString, tan

Table 2. Break-down of the top fifteen functions called by Internet Explorer

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

1 144 287 430 573 716 859 1002 1145 1288 1431 1574 1717Function ranks sorted by the total instructions ran

CD

F (%

)

Figure 4. IE profile results: We pick top 15 functionsfrom the profiling. They account for 28% of the total num-ber of instructions executed at runtime. These functionsaccount for 26.60% and 11.50% for Notepad and forYahoo! Messenger respectively.

preserves any caching locality that can be disruptedwhen instrumented code are mixed with applicationinstructions.Fast Lookup. Because any arbitrary function can becalled within a function, we must ensure the calledfunction will not propagate taint and thus interferewith the logic of the function summary. In the initialimplementation of function summary, this was doneby using a thread local variable to indicate whethera particular thread is propagating taint. However, it isexpensive simply to check that state variable becauseof context switch cost. We take advantage of a featurecalled conditional inlining where a separate scratchregister is used to indicate whether the analysis routineshould run. This lead to roughly about 2x performanceimprovement.Performance. We present the empirical results show-ing the significant speed up when function summariesare in place. To better understand the performancesaving, we conducted micro-benchmarking, measuringthe time spent in selected functions in each call, ratherthan the total time spent to execute an entire program.However, we show that savings from each functioncall can be multiplied, resulting in a significant overall

improvement.We create an artificial workload for the benchmark.

In each experiment, we measured the time it took fromhitting the first instruction of the function to the returninstruction in microseconds. We will use two specificfunctions GetWindowThreadProcessId and wcsncpyto represent the two types of function listed above.Different instrumentations were applied to the twofunctions. We used 50 sample data to generate figures 5and used 10 samples to generate each data point in 6.

Figure 5 compares the average times to run theGetWindowThreadProcessId function with instruc-tion level taint tracking and with function summary.Because it does not propagate taint, function sum-mary essentially turns off the propagation within thefunction. The graph also shows the standard deviation,which is small (less than 5 usecs). The average speed-up by skipping taint-tracking inside the function is 6.6(97.34 vs 14.8). Note that the function is quite simplein its logic so the number of context switches we savedusing function summary is also relatively small.

Next, we would like to study how much overheadthe patching function incurs over the baseline wherepropagation is simply turned off. Figure 6 shows thatfunctional propagation is virtually overlapping withthe baseline case. The graph also shows how thecomplexity of the function affects the benefit of thefunction summary (note the log-log scale). The benefitof function summary improves as the function com-plexity increases. Note that when N ≥ 16, the speed-up is more than 10 times. This results encourage usto summarize higher level functions. However, thereis an inherent tradeoff between how much benefit weget from each function summary and how often afunction is called and whether the function is used ina wide range of applications. Since we are building ageneral framework for evaluating different kind of ap-plications, we chose generality over high-level functionsummaries that tend to be application specific.

Although we do not report the results from the other12 functions for the sake of brevity, we note the similaroverhead reduction by function summaries. Since theseare the functions that are frequently called by thetesting application, we expect that the performance

Page 11: Privacy Scope: A Precise Information Flow Tracking System For

gain gets compounded as the program runs longer. All15 function summaries are added to Privacy Scope andwas used for the application study.

0

20

40

60

80

100

120

Instruc/onlevelpropaga/on

Nopropaga/on

Execu&

on&me

(microsecond

s)

Figure 5. Average time to run GetWindowThreadPro-cessId measured in u seconds (category 1)

1

10

100

1000

10000

1 2 4 8 16 32 64 128 256 5121024

Execu&

on&me(m

icrosecond

s)

Parametersize

nopropaga0on

func0onlevelpropaga0on

instruc0onlevelpropaga0on

Figure 6. Average time to run wcsncpy measured in useconds (category 2). We vary N, the number of bytesto copy, to show the increasing benefit of function-levelpropagation.

4.2. On-demand Instrumentation

When the application starts up, the instrumentationcode cache is completely empty and all instructionsneed to be instrumented. In addition, many initial-ization routines are instrumented to propagate taintwhen there is no taint in the system yet. Both factorslead to a significant delay in application start-up. Wenoticed that the instrumentation cost is often unnec-essary if sensitive information is never introduced tothe program. We introduce on-demand instrumentationin Privacy Scope. As the application starts, we per-form no instructional instrumentation and only limitedfunctional instrumentation to monitor the various input

channels taint can be introduced to the process. Thismay include opening of a tainted file and keyboardinput. Because there is no instructional instrumentationinitially, the application loads very quickly. When oneof the trigger condition happens, Privacy Scope inval-idates all existing instrumentations, and re-instrumentinstructions as necessary. This has the added benefit ofnot instrumenting libraries or functions that are onlyused at the loading time of the application.Performance. The cost of instrumentation comes fromthe cost of inserting the instrumentation (instrumenta-tion time) and the cost of running those instrumenta-tions (analysis time). We break it down by measuringthe number of instrumentations done and the number ofinstrumentations we ran during run time and see howthey are affected by the on-demand instrumentation.To our surprise, the number of total instruction instru-mented is only different by a small amount. However,the number of total instrumentation ran is different byan order of magnitude.

1.00E+03

1.00E+04

1.00E+05

1.00E+06

1.00E+07

1.00E+08

1.00E+09

1.00E+10

Notepad IE Yahoo

Num

ber o

f Ins

truct

ion

Ana

lysi

s R

un OnDemand Enabled

OnDemand Disabled

Figure 7. Savings of on-demand instrumentation foreach use case

Figure 7 that shows the benefit of on-demand instru-mentation is also a function of the application and spe-cific application use cases that we are experimenting.As the application size and complexity increases, thebenefit from on-demand instrumentation also increase.

5. Evaluation

This section presents the evaluation results of Pri-vacy Scope using three real applications: Notepad isa simple but representative example of text editingsoftware that may create copies of potentially sen-sitive information. Yahoo! Messenger carries privateconversations and its complexity (especially running apropriety protocol such as YMSG) makes the analysis

Page 12: Privacy Scope: A Precise Information Flow Tracking System For

intriguing. Internet Explorer is a popular browser,through which users send sensitive information (e.g.,credit card numbers) and interact with online services.They vary in application code size and complexity,but are also general enough to represent classes ofapplications we are interested in. In what follows, wefirst present the evaluation methodology and show theexperimental results. Then, we examine the specialtaint propagation logics that Privacy Scope implements(as discussed in §3) and how they affect the accuracywhen analyzing real-world applications.

5.1. Application Study

Because of the interactive nature of the applicationsthat we evaluate, it is difficult to automate experimentsand moreover to tease apart human induced delay fromthe overall performance results. However, we believethat it is important to run experiments with realistic usecases and to report the overall latency to demonstratethe practical value of the Privacy Scope system. In thissection, we run each experiment at least three timesand report the average. Each experiment starts afterthe application has been running for a while in orderto separate out one time instrumentation cost (i.e., weonly report hotcache numbers). However, we reset thetaint map prior to each run to isolate experiments.

Table 3 summarizes the evaluation results. Test runtime shows the average time to execute each experi-ment over three runs. We use Pin to print out times-tamps in the beginning and the end of an experimentto obtain the results. The next column (# of taintmap updates) shows the number of times that tainteddata is updated by instructions during the experimentperiod. This value represents the low-level movementsof sensitive data during the experiment. For accuracy,we inspect the output of each experiment (e.g., networkor file buffers) and check the taint map for eachoutput byte. We use domain knowledge (given thatthese applications are closed source2) to determine theaccuracy. The last column (taint map size) shows thenumber of bytes in the application’s memory that arestill tainted after the experiment is done.Windows Notepad. We use Notepad to test file taintpropagation. Privacy Scope inserts the special markingin the extended attributes of a file containing tainted (orsensitive) data. In this experiment, we open a taintedfile using Notepad and then save the content to a newfile. As expected, when WriteFile was called, Privacy

2. Even for an open source program, one needs to understand boththe program and the system calls used by the program in order tofully understand all the possible paths of taint propagation, which ischallenging.

Scope correctly carried the taint over to the buffercontaining the data to be written to the new file andupdated the extended attributes of the new file to reflectthe change. This experiment took longer than the restpartly because it involves multiple user interactions(e.g., clicking Save As after the file is loaded thentyping the new file name). However, most delay wasintroduced by taint analysis — as Figure 7 shows thatalthough this task appears simple, it requires more taintpropagating instructions to be run than the other twoapplications, thus resulting in higher latency.

Consistently across all the experiments, there arethree strings remained in the taint map, two of whichcorrespond to the file content. For example, we usedthe tainted input file that contained 7 characters,tainted. When WriteFile was called, the taint mapwas 25 bytes containing the following three strings:tainted (7 bytes), the unicode conversion of tainted(14 bytes), and x71x00x00x00x00 (4 bytes). Althoughit is obvious that the first two are correctly tainted,since we do not know the internals of the testingapplication, it is difficult to determine the correctnessof the third string. Therefore, in this analysis, we onlyexamine tainted bytes exposed to output channels.

This experiment shows that Privacy Scope couldhave prevented accidental creation of temporary filesthat may cause leakage of private information. Sincethe taint marking is embedded in a file’s meta data,Privacy Scope can eliminate the propagation of sensi-tive inputs through file system if an application copieswhat is in the memory (e.g., password) to a file andlater attempts to send it from the file.Yahoo! Messenger. We first sign in using a Yahoo!Messenger client then send a message, taintme toanother user. To track when and where the messageleaves the client program, we use the hotkeysmentioned in the earlier section to tag only themessage as we type each character into the client.Privacy Scope first detects that the message is sentto a Yahoo! server (as expected). The network buffercontaining the message is correctly tainted: Only the7 bytes of taintme are marked tainted in the bufferof 118 bytes as shown below. Each byte is showneither in ASCII or hex (e.g., x80) if not printable.Tainted bytes are shown between [TS] and [TE]:...x8014xc0x80[TS]taintme[TE]xc0x80429xc0...What was a surprise to us is that shortly after thisevent, Privacy Scope discovers that the taintedmessage is written to a file after converted to a stringof the same length as shown in Table 3. Because of thetransformation, initially we were unsure of the result.However, the filename (20090830-peername.dat)suggests that the file contains message archives. We

Page 13: Privacy Scope: A Precise Information Flow Tracking System For

Test details Test results Test runtime

# of taintmap updates

Taint mapsize

NotepadOpen a tainted filethen save thecontent to a new file

The new file’s extendedattributes has the taint tag andthe entire buffer of WriteFileis correctly tainted

22.9 (sec) 908 25 (bytes)

Yahoo!Messenger

Send a taintedmessage, “taintme”

The network buffer containing“taintme” is correctly tainted 6.3 (sec) 3854 148 (byes)

“taintme” is saved to a file asx17x13x80x1Ex1Dx1bx1c.The file is marked tainted.

8.4 (sec) 3885 118 (byes)

InternetExplorer

Post a tainted data,“7777” over HTTP

The network buffer containing“7777” is correctly tainted. 5.6 (sec) 4884 132 (byes)

Post a tainted data,“7777” over HTTPS

The last 41 bytes of the SSLpacket containing “7777” iscorrectly tainted.

9.0 (sec) 4474 252 (byes)

Table 3. Evaluation results of Privacy Scope using three applications running on Windows

later find an option to turn on/off message archivingand another option to display the archive. Using theoption, we confirm that indeed the file contains taintedmessage.

As shown in the table, each experiment takes lessthan 10 seconds on average with less than 1.5 secondstandard deviation. We also checked all clean networkbuffers and files that were created shortly after themessage was sent and found no false negatives to ourbest knowledge.

This experiment demonstrates that by monitoringboth network and file outputs, Privacy Scope wouldhave effectively detected and prevented an applica-tion’s logging of keystrokes without user knowledge.It also shows our ability to track the input informationregardless of transformation by the application.Internet Explorer. This experiment was designed tostudy Privacy Scope’s ability to track sensitive dataeven when it is encrypted. We set up a Web page witha simple HTML Form that sends the input data to aremote server via the HTTP Post method. First, weenter four digits 7777 on the form and submit it andconfirm that Privacy Scope correctly taints the fourdigits in the post message. Again, we double checkthat only the four digits out of 870 bytes of the sendbuffer were tainted as shown below....cardnumber=[TS]7777[TE]&expmonth=8&...Second, we repeat the same experiment but this timewe modify the page so that the form is submitted to thesame server over HTTPS. The results show that PrivacyScope taint the last 41 bytes of the encrypted messagewhose length is 888 bytes. In order to confirm that thetainted part contains the input string 7777, we vary the

length of an input and record how the length a taintedstring changes. Two observations suggest that PrivacyScope correctly taints the segment of the encryptedmessage containing the input string.(1) The length of the (output) tainted string is directlyproportional to that of the input string. I.e., the lengthof the tainted string increased from 41 to 45 and to53 when the input length increased from 4 to 8 andto 16. Our hypothesis is that 41 bytes (when the inputwas 4 bytes) include the input (4 bytes), the remainingmessage (starting from &expmonth= as shown above),which is 21 bytes, and MAC (16 bytes). This suggeststhat the data was encrypted with a length-preservingscheme with an authenticator added at the end.(2) The tainted string in the output buffer always beginsat the same byte position (842th byte in this case). Webelieve that that is where the input string is inserted.However, without knowing the exactly what encryptionmode is used, we are unsure whether all the bytesafter the input were correctly tainted3. However, webelieve that the tainted string correctly includes thetainted input.

As shown in Table 3, it takes also less than 10 sec-onds on average to run experiments with Internet Ex-plorer even when Internet Explorer is communicatingwith a server over SSL. This experiment demonstratesthat Privacy Scope could detect if online applicationsleak sensitive user inputs even when the applicationsmake use of cryptography, which would otherwiserender unreadable the output data.

3. For a pure stream cipher like RC4, those bytes shouldn’t havebeen tainted. For CFB or GCM, they should have been tainted.

Page 14: Privacy Scope: A Precise Information Flow Tracking System For

Accuracy. The highly accurate results demonstrate thatlow-level taint propagation logics discussed in Table 1correctly model data dependency flows. To confirmthat these logics are necessary (and not optional)for precise information tracking, we repeat theabove experiment with only generic data dependencypropagation logic and register clearing logic on (i.e.,turning off the logics implementing rows 3—5 inTable 1). The result shows that the taint map getsquickly polluted and resulted in many false positives inthe output. For instance, the same experiment with IEcauses overtainting in the network buffer as follows:[TS]paymentType=American+Express&cardnumber=7777&expmonth=8&expyear=[TE] when only 7777should have been tainted. The taint map size is 5,308bytes including large strings of randomly lookingbytes, which are very likely false positives given thatthe input string is only 4 byte long. Then, we reinstatethe special logics for MOV instructions but excludethe logic for handling REP prefix. The result is quitesurprising. Taint fails to propagate to network output.Moreover, the propagation quickly disappeared. Thenumber of times that taint map is updated is only 195,which is by an order of magnitude smaller what wesee with Privacy Scope (4884 as shown in Table 3).

6. Related Work

There is a large body of work aimed at protectinguser privacy using information flow tracking tech-niques. This section discusses how Privacy Scopediffers and complements these previous approaches.Dynamic Software Analysis for Information Leaks.Vachharajani et al. present a detailed discussion ofissues around building a runtime system for enforcinguser-defined information-flow security policies [23].The proposed solution, RIFLE, tracks information flowusing new hardware extensions with carefully designedbinary translation. RIFLE incurs low overhead and caneffectively handle conditional dependency and loopsbut it requires significant hardware support, thus notdirectly applicable to existing systems.

Panorama [28] and TaintBochs [5] are built onwhole-system simulation (e.g., QEMU, Bochs), capa-ble of tracking the propagation of sensitive data atthe hardware level. Designed for malware analysis(Panorama) and for data lifetime analysis (TaintBochs),both systems generate the sheer volume of logs de-tailing data propagation across applications and theunderlying operating system. While such low-levelinformation is valuable for understanding the completepicture of information leaks within the system, thecurrent implementations incur high overhead —- 20x

slowdown (Panorama), 2 to 10 times slower than Bochs(TaintBochs), rendering these systems not suitable forcapturing client-server interactions. Dytan [6] is ageneric taint analysis framework for Linux platformand supports customizable taint propagation policies.However, our multilevel taint propagation techniquewould have required significant reengineering of theDytan’s internals.

Privacy Oracle [11] and TightLip [29] arelightweight tools that are capable of analyzing appli-cations for information leaks without any application-level instrumentations. However, these systems arelimited to the software whose outputs only depend oninputs (and externally controllable parameters such astime and system configurations) and not scalable totracing multiple input data.Optimizing Dynamic Taint Analysis. Many opti-mization techniques have been proposed to improveefficiency in dynamic taint tracking using binary instru-mentation [18], [16], [4], [10]. Although these systemsfocused on software vulnerability analysis (by tracingincoming network data), some of the optimizationtechniques are complementary to what our currentsystem implements and can further improve PrivacyScope.

LIFT [18], built on the StarDBT binary instrumen-tation tool, implements three optimization techniques:fast-switch is to reduce the overhead of context switchwhereas the other two (fast-path, and merge-check)are to reduce the number of taint propagating instru-mentations. A key difference between LIFT’s fast-pathand merge-check and our function summary is thatLIFT’s optimization techniques apply at basic blockor trace levels and they require runtime checking.Unfortunately, since neither LIFT nor StarDBT is notpublicly available for testing, we cannot compare theeffectiveness of the two approaches. However, ourfunction summary can be implemented on top ofLIFT’s three optimizations and further reduce the taint-tracking overhead.

Ho et al. [10] implement page-granularity tainttracking for efficiency. Their system is built on theXen virtual machine monitor and dynamically switchesfrom virtualization to hardware-based emulation whena tainted page is accessed by the processor. We be-lieve that this optimization can be highly effectivewhen implemented in Privacy Scope since most taintsources are small. Although not suitable for commer-cial software analysis, TaintPolicy [27] instruments Cprograms through a source-to-source transformation toperform efficient runtime taint tracking.OS Level Information Flow. Other systems have beenbuilt to integrate the notion of information flow and

Page 15: Privacy Scope: A Precise Information Flow Tracking System For

taint tracking directly into the operating system. BothAsbestos [8] and HiStar [30] use labels to indicate thetaint level of OS abstractions and restrict informationflow from more sensitive object to less sensitive objectwithout the use of a trusted agent. Many legacy ap-plications cannot run on these experimental platformsand end users would have to run a different operatingsystem to benefit.

PRECIP [25] retrofits these ideas to Windows oper-ating system and is a lightweight system that aims toprevent information leaks. PRECIP intercepts systemcalls and monitors output channels (e.g., files, networksockets) in which sensitive input data (e.g., files, userinputs) are written to, and prevents malicious processes(e.g., keyloggers) from gaining access to these re-sources. Tracing is done at the object level granularityand thus it is unable to track information if transformedby the application.

7. Discussions

The application study in §5 demonstrated the abilityof Privacy Scope in tracking sensitive data propagationand discovering leaks by off-the-shelf applications.However, issues are remained to increase the effec-tiveness of Privacy Scope. In this section, we discusslimitations of the current implementation and promis-ing avenues to explore in order to improve the systemwith fast analysis and accurate and comprehensive leaktracking.Performance. Privacy Scope markedly improved theperformance of application-level taint tracking withthe two techniques (function summaries, on-demandinstrumentation) as shown in §4. We expect furtherperformance gain by adapting some of the previouslyexplored methods for reducing taint analysis overheadin [18], [4], [10].

Various low-level system supports for fast binaryinstrumentations are on the horizon as well. Dynamicbinary translation tools are continuously evolved withnew optimizations and additional features. For in-stance, persistent code caches are shown to be ef-fective in reducing long initialization sequences ofapplications when implemented in the DynamoRIODBT [3]. A simple hardware enhancement (a dedicatedinterconnect with added ISA support) is shown todrastically reduce the overhead of information flowtracking by efficiently leveraging multicores [15].Limitations. Sensitive data can be passed along (viashared memory or system messages) and then leakedby other processes running on the same system. Todetect such leaks involving multiple processes, one

may turn to a whole-system simulation based approachsuch as Panorama [28] or TaintBochs [5].

In the current implementation, we chose 1 bit tainttag for speed and simplicity, but it does not allow usersto pinpoint which sensitive input is getting leaked.In addition, for file tainting, taint level changes cancause previously collected taint information to becomeobsolete. While it is rather straightforward to extendthe tag table to include multiple bit tags, larger tagscan lead to cache pollution and space overhead.

Like many existing tools [18], [16], [4], we trackonly explicit flows (also called data flows or datadependency). As a result, Privacy Scope will miss leaksif tainted data propagate through control flows, which,we assume, is infrequent in practice. However, Slowin-ska and Bos point out that explicit data flow trackingcan lose taint very easily through table lookups [22].We address this issue by employing specific policiesregarding the use of index registers.Evasion. Privacy Scope is designed for evaluating off-the-shelf software and aim to protect accidental leakof private information. Therefore, it is not our goalto avoid possible active evasion techniques one mightemploy. Some software like Skype or Limewire ac-tively probe for the presence of instrumentation tools,debuggers, or virtual machines and abort the programwhen detected. These practices are not common andwould likely alarm the user to proceed more cautiouslyif such behavior is observed.

8. Conclusion

We present Privacy Scope, a system that pinpointsleaks of sensitive data by commercial software. Itis well-known that legitimate, popular applicationscan accidentally or intentionally expose private userinformation. With Privacy Scope, users can check thatapplications disclose their personal information in theways that they expect, rather than simply trust them.

Privacy Scope uses dynamic binary translation tech-niques to implement dynamic taint analysis on un-modified commercial applications running in normaluser environments. To build our system, we developedand integrated a set of techniques that include: mixedinstruction and function-level tainting; function sum-maries for efficiency and accuracy of application-onlytainting; special semantics for corner-case instructionsand kernel side-effects; and tainting on demand ratherthan at load time. The result is a comprehensivesystem that is efficient enough to track where sensi-tive information goes in large multi-threaded networkapplications that include Internet Explorer and Yahoo!

Page 16: Privacy Scope: A Precise Information Flow Tracking System For

Messenger. In tests, we were able to run these appli-cations online and precisely trace where input markedas sensitive was output with no false positives. Withadditional engineering effort, we believe that PrivacyScope can be valuable as a system that is widely usedto discover how applications behave in practice.

Acknowledgments

We are indebted to Robert Cohn and Greg Lueck at Intel’sVSSAD group for sharing their technical insight on Pin DBT. Wealso like to thank Heng Yin for the details of Panorama that we usedto verify function summaries. Stuart Schechter, Heidi Pan, and WillEnck read our early drafts and provided feedback that improved thepaper.

References[1] D.E. Bell and L.J. La Padula. Secure Computer

System: Unified Exposition and Multics Interpretation.Technical Report ESD-TR-75-306, MITRE, 1976.

[2] Derek Bruening. Efficient, Transparent, and Compre-hensive Runtime Code Manipulation. PhD thesis, MIT,September 2004.

[3] Derek Bruening and Vladimir Kiriansky. Process-Shared and Persistent Code Caches. In VEE, 2008.

[4] W. Cheng, Q. Zhao, B. Yu, and S. Hiroshige. Taint-Trace: Efficient Flow Tracing with Dynamic BinaryRewriting. In IEEE Symposium on Computers andCommunications,, 2006.

[5] Jim Chow, Ben Pfaff, Tal Garfinkel, Kevin Christopher,and Mendel Rosenblum. Understanding data lifetimevia whole system simulation. In USENIX SecuritySymposium, 2004.

[6] James Clause, Wanchun Li, and Alessandro Orso.Dytan: a generic dynamic taint analysis framework.In ISSTA ’07: Proceedings of the 2007 internationalsymposium on Software testing and analysis, pages196–206, New York, NY, USA, 2007. ACM.

[7] Alexei Czeskis, David J. St. Hilaire, Karl Koscher,Steven D. Gribble, Tadayoshi Kohno, and BruceSchneier. Defeating Encrypted and Deniable File Sys-tems: TrueCrypt v5.1a and the Case of the Tattling OSand Applications. In HotSec, 2008.

[8] Petros Efstathopoulos, Maxwell Krohn, Steve VanDe-Bogart, Cliff Frey, David Ziegler, Eddie Kohler, DavidMazieres, Frans Kaashoek, and Robert Morris. Labelsand event processes in the asbestos operating system.In SOSP, 2005.

[9] Kim Hazelwood and Vijay Janapa Reddi. TutorialNotes on Hands-on Pin. In ASPLOS, 2008.

[10] Alex Ho, Michael Fetterman, Christopher Clark, An-drew Warfield, and Steven Hand. Practical taint-basedprotection using demand emulation. SIGOPS Oper.Syst. Rev., 40(4), 2006.

[11] Jaeyeon Jung, Anmol Sheth, Ben Greenstein, DavidWetherall, Gabriel Maganis, and Tadayoshi Kohno.Privacy Oracle: a System for Finding Application Leakswith Black Box Differential Testing. In CCS, 2008.

[12] Maxwell Krohn, Alexander Yip, Micah Brodsky, NatanCliffer, M. Frans Kaashoek, Eddie Kohler, and RobertMorris. Information flow control for standard OSabstractions. In SOSP, 2007.

[13] Chi-Keung Luk, Robert Cohn, Robert Muth, HarishPatil, Artur Klauser, Geoff Lowney, Steven Wallace,Vijay Janapa Reddi, and Kim Hazelwood. Pin: build-ing customized program analysis tools with dynamicinstrumentation. In PLDI, 2005.

[14] John Markoff. Surveillance of Skype Messages Foundin China. The New York Times, October 2008.

[15] Vijay Nagarajan, Ho-Seop Kim, Youfeng Wu, andRajiv Gupta. Dynamic Information Flow Tracking onMulticores. In Interact, 2008.

[16] James Newsome and Dawn Song. Dynamic Taint Anal-ysis for Automatic Detection, Analysis, and SignatureGeneration of Exploits on Commodity Software. InNDSS, 2005.

[17] Objective Development. Little Snitch. http://www.obdev.at/products/littlesnitch/.

[18] Feng Qin, Cheng Wang, Zhenmin Li, Ho seop Kim,Yuanyuan Zhou, and Youfeng Wu. LIFT: A Low-Overhead Practical Information Flow Tracking Systemfor Detecting Security Attacks. In MICRO, 2006.

[19] Andrei Sabelfeld and Andrew C. Myers. Language-based information-flow security. IEEE JSAC, 21:2003,2003.

[20] Prateek Saxena, R Sekar, and Varun Puranik. EfficientFine-Grained Binary Instrumentation with Applicationsto Taint-Tracking. In CGO, 2008.

[21] Julian Seward and Nicholas Nethercote. Using Valgrindto detect undefined value errors with bit-precision. InUSENIX Annual Technical Conference, 2005.

[22] Asia Slowinska and Herbert Bos. Pointless tainting?:evaluating the practicality of pointer tainting. InEuroSys ’09: Proceedings of the 4th ACM Europeanconference on Computer systems, pages 61–74, NewYork, NY, USA, 2009. ACM.

[23] Neil Vachharajani, Matthew J. Bridges, JonathanChang, Ram Rangan, Guilherme Ottoni, Jason A.Blome, George A. Reis, Manish Vachharajani, andDavid I. August. RIFLE: An architectural frameworkfor user-centric information-flow security. In MICRO,2004.

[24] Cheng Wang, Shiliang Hu, Ho-Seop Kim, Sreeku-mar R. Nair, Mauricio Breternitz Jr, Zhiwei Ying, andYoufeng Wu. StarDBT: An Efficient Multi-platformDynamic Binary Translation System. In Asia-PacificComputer Systems Architecture Conference, 2007.

[25] XiaoFeng Wang, Zhuowei Li, Ninghui Li, andJong Youl Choi. PRECIP: Practical and RetrofittableConfidential Information Protection. In NDSS, February2008.

[26] Msdn documentation - wcsncpy. http://msdn.microsoft.com/en-us/library/ms860450.aspx.

[27] Wei Xu, Sandeep Bhatkar, and R. Sekar. Taint-enhanced policy enforcement: a practical approach todefeat a wide range of attacks. In USENIX SecuritySymposium, 2006.

[28] Heng Yin, Dawn Song, Manuel Egele, ChristopherKruegel, and Engin Kirda. Panorama: capturingsystem-wide information flow for malware detectionand analysis. In CCS, 2007.

[29] Aydan R. Yumerefendi, Benjamin Mickle, and Lan-don P. Cox. TightLip: Keeping Applications fromSpilling the Beans. In NSDI, April 2007.

[30] Nickolai Zeldovich, Silas Boyd-Wickizer, EddieKohler, and David Mazieres. Making information flowexplicit in HiStar. In OSDI, 2006.