understanding malicious cross-library data harvesting on

This paper is included in the Proceedings of the 30th USENIX Security Symposium

August 11ndash13 2021978-1-939133-24-3

Open access to the Proceedings of the 30th USENIX Security Symposium

is sponsored by USENIX

Understanding Malicious Cross-library Data Harvesting on Android

Jice Wang National Computer Network Intrusion Protection Center University of Chinese Academy of Sciences Indiana University Bloomington Yue Xiao and

Xueqiang Wang Indiana University Bloomington Yuhong Nan Purdue University Luyi Xing and Xiaojing Liao Indiana University Bloomington JinWei Dong School of

Cyber Engineering Xidian University Nicolas Serrano Indiana University Bloomington Haoran Lu and XiaoFeng Wang Indiana University Bloomington Yuqing Zhang

National Computer Network Intrusion Protection Center University of Chinese Academy of Sciences School of Cyber Engineering Xidian University School of Computer Science

and Cyberspace Security Hainan Universityhttpswwwusenixorgconferenceusenixsecurity21presentationwang-jice


Jice Wang12lowast Yue Xiao2lowast Xueqiang Wang2 Yuhong Nan3 Luyi Xing2daggerXiaojing Liao2dagger JinWei Dong4 Nicolas Serrano2 Haoran Lu2 XiaoFeng Wang2 Yuqing Zhang145dagger

1National Computer Network Intrusion Protection Center University of Chinese Academy of Sciences2Indiana University Bloomington

3Purdue University 4School of Cyber Engineering Xidian University5School of Computer Science and Cyberspace Security Hainan University

AbstractRecent years have witnessed the rise of security risks of

libraries integrated in mobile apps which are reported tosteal private user data from the host apps and the app backendservers Their security implications however have never beenfully understood In our research we brought to light a newattack vector long been ignored yet with serious privacy im-pacts ndash malicious libraries strategically target other vendorsrsquoSDKs integrated in the same host app to harvest private userdata (eg Facebookrsquos user profile) Using a methodology thatincorporates semantic analysis on an SDKrsquos Terms of Ser-vices (ToS which describes restricted data access and sharingpolicies) and code analysis on cross-library interactions wewere able to investigate 13 million Google Play apps and theToSes from 40 highly-popular SDKs leading to the discoveryof 42 distinct libraries stealthily harvesting data from 16 pop-ular SDKs which affect more than 19K apps with a total of 9billion downloads Our study further sheds light on the under-ground ecosystem behind such library-based data harvesting(eg monetary incentives for SDK integration) their uniquestrategies (eg hiding data in crash reports and using C2server to schedule data exfiltration) and significant impacts

1 Introduction

Mobile apps today extensively incorporate third-party li-braries (eg analytics advertising app monetization orsingle-sign-on SDK) which enriches their functionalities butalso brings in security risks It has been reported that mali-cious SDKs stealthily collect private user data from the devicerunning their host app [53 73 74 78] (eg IMEI GPS loca-tion phone number MAC address SIM card ID Android IDetc) the server or the cloud supporting the app [69] Withsignificance of such leaks the security implications of libraryintegration have yet been fully revealed it is less clear whethera malicious library could endanger a userrsquos sensitive infor-mation from other data sources those not under the directcontrol of the affected app

lowastThe first two authors are ordered alphabetically Work was done whenJice Wang was studying at Indiana University Bloomington

daggerLuyi Xing Xiaojing Liao and Yuqing Zhang are co-corresponding au-thors

Figure 1 Workflow of cross-library data harvesting (XLDH)

Cross-library data harvesting In our research we discov-ered a type of data harvesting libraries never reported beforewhich strategically target the SDKs from other vendors alsointegrated by the host app These SDKs carry sensitive userdata For example the Facebook SDK extensively used byapps for single sign-on [34] also manages the informationsuch as a userrsquos name birthday locations she went to so-cial health and political groups she follows The data couldbe exposed to the malicious library hosted by the same appintegrating the Facebook SDK Figure 1 illustrates such amalicious library which first checks the presence of the Face-book SDK in its host app and if so invokes the FacebookAPI to acquire the userrsquos Facebook session token and dataSince both the malicious library and the victim SDK co-existwithin the same app this invocation is not mediated Giventhe wide deployment of the Facebook Login SDK (in morethan 16 of the apps on Google Play [5]) the risk of sucha data leak is significant We call this attack Cross-LibraryData Harvesting (XLDH)1

Beyond its threat to personal privacy malicious data har-vesting can also have serious social implications A prominentexample is the Cambridge Analytica scandal [17] in whichthe personal data of millions of Facebook users (profiles pagelikes current city News Feed etc ndash enough to create psy-chographic profiles of the users) were collected and utilizedfor malicious political advertising [17] XLDH also providesa new avenue for such political profiling and promotion asdiscovered in our research (Section 54) Despite the impor-tance of the problem little has been done so far to understand

1In this paper the terms ldquoSDK and ldquolibrary always refer to those de-veloped by third-parties ie vendors other than the host app vendor or OSvendor also we refer to ldquoSDK as the victim and ldquolibrary as a general term

USENIX Association 30th USENIX Security Symposium 4133

XLDH not to mention any attempt to address this new securityand privacy risk

Finding XLDH in the wild In this paper we report the firststudy on XLDH on Android aiming to understand its privacyand social impacts underground ecosystem and challengesin controlling the threat To this end we developed a new au-tomatic methodology called XFinder to identify malicious li-braries integrated in real-world apps on Google Play Our ideais to discover restricted data managed by the SDKs and theirthird-party data sharing policies which describes whetherand how restricted data can be shared with or collected byother libraries We automatically extract those policies fromthe terms of service (ToS aka terms of use terms and con-ditions) released by the SDK vendors and then analyze thecode of each integrated library to find out whether it makesany access to the SDKrsquos data in violation of these policiesThis turns out to be nontrivial due to the challenges in analyz-ing ToS to recover its semantics and evaluating apps to findcross-library interactions

More specifically unlike app privacy policies that protectknown sensitive content (eg address contact etc) and there-fore can be identified by existing privacy policy analyzer suchas Polisis [57] ToS describes restricted data whose security orprivacy implications can only be determined from the contextof their usage Examples include security-critical data suchas password and token and SDK-specific sensitive data suchas utdid used by Alibaba for identifying user devices [2]page likes health or political groups of a user recorded byFacebook and education and project information maintainedby LinkedIn [18] More challenging is to recover the datasharing policies from ToS that specify the restrictions oncollecting and sharing different data items which tends tobe complicated For example Google allows developers toaccess advertising ID or device identifier (eg ssaid mac ad-dress imei) but restricts the collection of these two data itemsimultaneously also Facebook userrsquos page likes timelineetc are open to the apps certified by Facebook [15] but notto other parties (including third-party libraries) [55] whileFacebook user ID and password are not allowed to be sentout to the Internet by any party Our research shows that ex-isting techniques like Polisis [57] and PolicyLint [45] cannotbe directly adopted for ToS analysis (see the evaluation inSection 32)

To address these challenges XFinder utilizes a semanticanalysis tuned towards the unique features of ToS whichleverages natural language processing techniques to capturesensitive data items and to recover complicated policies (Sec-tion 32) Further our code analyzer module in XFinder is de-signed to handle potential evasion tricks played by maliciouslibraries when evaluating its interactions with a target SDK(Section 33) Our experiment shows that XFinder achieved ahigh precision of 86 and successfully detected 42 maliciouslibraries from more than one million Android apps

Measurement and discoveries From 13 million GooglePlay apps analyzed in our research we are surprised to findthe significant impacts of the new threat More specificallywe discovered 42 distinct libraries that stealthily harvest datafrom third-party SDKs without a user consent These librarieshave been integrated into more than 19K apps with a total of9 billion downloads The data harvested are highly sensitiveincluding access tokens profile photos and friend lists (seeSection 5) As an example OneAudience a library integratedin more than 1738 apps with more than 100 million userscollects usersrsquo private data from Facebook and Twitter SDKsBased on a press release from Nielsen [29] OneAudienceshared mobile user data with Nielsen ndash a marketing researchfirm and the data can be used by Nielsenrsquos customers forpolitical marketing purpose among other marketing usagesHence we suspect that the data harvesting campaign mightlead to a Cambridge-Analytica-like political scandal if theywere taken advantage of by the adversary Although the cam-paign has been stopped after we reported it to Facebook (seebelow) already millions of Facebook usersrsquo data have beenexposed since the library has been continuously gatheringuser data once per hour on both Android and iOS since 2014

Also interesting is the ecosystem behind XLDH which in-cludes library distribution stealthy data exfiltration channeland data monetization In particular XLDH vendors are foundto distribute their libraries through multiple channels includ-ing colluding with free app building services integrating intopopular libraries and offering app monetization (Section 54)For example app monetization is used to attract app develop-ers to integrate problematic libraries into their apps app de-velopers that integrate OneAudience and Mobiburn are paid$0015 to $003 per app install Furthermore we revealed thetechniques used by malicious libraries that made their dataharvesting activities more stealthy and harder to detect suchas the abuse of Java reflection technique (see Section 33)

Our study also sheds light on the challenges in eliminat-ing the XLDH risk We found that although VirusTotal andGoogle Play are able to detect the libraries collecting datafrom mobile devices (such as IMEI contact) they all failed todetect XLDH libraries and the apps integrating them possiblydue to the challenges in determining third-party data sharingpolicies and non-compliance with the policies This has beenaddressed by XFinder We reported our findings to affectedparties including Facebook Twitter Google Play and otherswho are all serious about this new risk and expressed grati-tude for our help with bounty programs Google asked thedevelopers of affected apps to remove the malicious librariesor drop these apps to control the risk Facebook and Twitterhave taken legal actions to take down OneAudience a XLDHlibrary owned by Bridge a digital marketing company

Contributions We summarize the contributions as followsbull Our study brings to light a new attack vector that has longbeen ignored yet with serious privacy implications maliciouslibraries aiming at third-party SDKs integrated in the same

4134 30th USENIX Security Symposium USENIX Association

apps to harvest private user data Our findings demonstrate thesignificant privacy and social impacts of this new threat Ourworks also help better understand the underground ecosystembehind it and the challenges in controlling the riskbull Our study has been made possible by a novel methodol-ogy that automatically identifies XLDH from over a millionAndroid apps through semantic analysis on ToS and codeanalysis on cross-library interactionsbullWe release the dataset used in this research and our sourcecode for the automatic ToS analysis online [39]

2 Background

Cross-library API calls Like an app that calls functionsof a library libraries in an app naturally can invoke thefunctions of another library On Android this is typicallydone through first explicitly importing the package nameof the callee class (in Java) and then invoking the targetfunction through the callee classrsquo instance Further Javafeatures a technique called reflection [26] that allows func-tion invocation in a more flexible manner As illustrated inFigure 4a to invoke a function getCurrentAccessTokenin the Facebook library one can first obtain a class ob-ject through Java reflection API ClassforName by pro-viding the class name (comfacebookAccessToken) thenthrough another reflection API getDeclaredMethod one canobtain a method object using the name of the target func-tion getCurrentAccessToken last calling reflection APIinvoke on the method object one can invoke the target func-tion In our research we observed that XLDH libraries oftenleverage reflection to call victim libraries likely for makingthe behaviors more stealthy Note that Android provides acoarse-grained sandbox and permission model to regulatethird-party libraries allowing them to operate with the samepermissions as their host apps [4 76] In particular thereis no security boundary between libraries within the sameapp allowing one library to access another (eg invokingfunctions) without restrictions

SDK terms of service Term of service (ToS) is an SDKdeveloper document that lays out terms conditions require-ments and clauses associated with the use of a mobile SDKeg copyright protection accounts termination in the cases ofabuses data usage and management etc Note that in additionto the ToS for developers an SDK vendor (eg Facebook andTwitter) may also have a ToS for regular users such as [16]which is outside the scope of our study In our research wemanually collected 40 ToSes from SDK vendorsrsquo developerwebsites to investigate the XLDH risks

Unlike privacy policy which aims at informing end-usersabout collection and use of personal data (eg name emailaddress mailing address birthday IP address) ToS specifiesrules and guidelines for developers who uses an SDK as illus-trated in Figure 2 Also data protected under privacy policy

You may not associate the adver tising ID with any device identifier user consent

nsubjaux

negdobj

prep

pobj

without

preppobj

OTHER OTHER OTHER ACTION DATA OTHER DATA OTHER CONDITION

(a) Dependency parsing tree

You may not associate the adver tising ID with any device identifier user consent withoutOTHER OTHER OTHER ACTION DATA OTHER DATA OTHER CONDITION

DT NNP NNPVBRBMDPRP IN DT NN NN IN NN NN

NP NP NP

PP

(b) Constituency parsing tree

Figure 2 The dependency parsing tree and constituency pars-ing tree of the sentence ldquoyou may not associate the Adver-tising ID with any device identifier without consent from theend user

is different from that covered by ToS The former is usuallyldquopersonal informationrdquo (eg name email address mailing ad-dress) as guided by laws (eg GDPR CalOPPA [48]) Thelatter also prevents the abuse of security-critical data (egpassword and token) and SDK-specific data (eg API keysaccess credentials) Table 1 shows the data items protected bythe ToS of Twitter Facebook and Google We can see that 21data items in the ToSes are SDK-specific and not mentionedby the privacy policies

In our research we found that the state-of-the-art privacypolicy analyzer (eg Polisis [57]) cannot effectively analyzeToS to recover the content about sensitive data sharing policy(see Section 32) possibly due to the different grammaticalstructures of ToS (for addressing to different audience anddescribing different data items and rules) than those appearingin common privacy policy corpora

Natural language processing In our research we leverageNatural language process (NLP) to automatically extract third-party sharing policies for sensitive data from SDK ToS Belowwe briefly introduce the NLP techniques used in our research

bull Named entity recognition Named entity recognition (NER)is a technique that locates named entities mentioned in un-structured text and classifies them into pre-defined categoriessuch as person names organizations locations The state-of-the-art NER tools such as Stanford NER and Spacy NER canachieve a 95 accuracy on open-domain corpora to recog-nize person names organizations locations However NERsystems are known to be brittle highly domain-specific mdashthose designed for one domain hardly work well on the otherdomain [61] A direct use of the state-of-the-art tools likeStanford NER [66] does not work because the common pre-defined categories (names organizations locations) are notsuitable for our task In our study we tailor named entityrecognition techniques to identify sensitive data which isprotected by third-party sharing policies


Table 1 Examples of data items protected by the ToS of Facebook Twitter and Pinterest

SDK Term of ServiceFacebook access token access credentials Friend data Facebook user IDs trademarks PSIDs(Page-scoped user IDs) Marketplace Lead Data

Twitter API keys access credentials Twitter Content Twitter passwords Tweet IDs Direct Message IDs user IDs Periscope BroadcastsPinterest Wordmark image Ad Data user ID and campaign reporting secret boards

bull Constituency parsing and dependency parsing Con-stituency parsing and dependency parsing are NLP techniquesto analyze a sentencersquos syntactic structure Constituency pars-ing breaks a sentence into sub-phrases and displays its syn-tactic structure using context-free grammar while depen-dency parsing analyzes the grammatical relations betweenwords such as subject-verb (SBV) verb-object (VOB) at-tribute (ATT) adverbial (ADV) coordinate (COO) and oth-ers [11] Figure 2 illustrates the constituency parsing tree anddependency parsing tree of a sentence In the constituencyparsing tree non-terminals are types of phrases and the ter-minals are the words in the sentence For instance as shownin the figure 2b NP is a non-terminal node that representsa noun phrase and connects three child nodes (the (DT) ad-vertising (NNP) ID (NNP)) Here DT means determiner andNNP means a noun in a singular phrase [30] By comparisonthe dependency parsing tree is represented as a rooted parsingtree (see Figure 2a) At the center of the tree is the verb ofa clause structure which is linked directly or indirectly byother linguistic units This unit can either be a single wordor a noun phase that merged by the parserrsquos built-in phasemerge API phrasemerge [28] The state-of-the-art depen-dency and constituency parser (eg Stanford parser [49] Al-lenNLP [56]) can achieve over 90 accuracy in syntacticstructure discovery from a sentence In our study we leverageboth dependency and constituency parsing trees generatedfrom sentences in ToS to recover the semantics of third-partysharing policies in SDK ToS

bull word2vec Word2vec [67] is a word embedding techniquethat maps text (words or phrases) to numerical vectors Such amapping can be done in different ways eg using the contin-ual bag-of-words model [8] or the skip-gram technique [35]to analyze the context in which the words show up Such avector representation ensures that synonyms are given similarvectors and antonyms are mapped to different vectors In ourstudy we build a customized word embedding model for datasharing policies to measure the similarity of words in thisdomain as elaborated in Section 34

Threat model We consider an adversary who spreads mali-cious libraries that harvests private user data from third-partySDKs hosted by the same mobile apps For this purpose theadversary often offers appealing functionalities or monetaryincentives to app developers for integrating a malicious li-brary into their apps In our study victim SDKs in an app arethose neither owned by the app vendor nor provided by AOSP(Android Open Source Project) [42] ndash the official Android

data sharing policies

return value of sensitive APIs

locate all cross-lib calls

Data flow tracking

recognize restricted data

SDK API specifications Android apps

XLHD

Compliance check

Meta-DB Constructor Cross-Library Analyzer

Data Policy Analyzer

restricted data items

identify sensitive calls

locatetaint return value

Meta-DB construction

sensitive APIs

ToS

1

2 3

identify policy statement

extract policy

Figure 3 Overview of XFinder

version not customized by original equipment manufacturers(OEMs) In this regard and for the best understanding of thethreats libraries developed by Google but not on AOSP arealso studied in our research

3 Methodology

In this section we elaborate on the design and implementationof XFinder a methodology for discovering XLDH from real-world Android apps

31 Overview

Architecture As mentioned earlier our approach relies onthe extraction of data sharing policies from ToS and analy-sis of policy compliance during a libraryrsquos interactions withother SDKs within the same app In particular the designof XFinder includes three major components Data PolicyAnalyzer (DPA) Meta-DB Constructor and Cross-libraryAnalyzer (XLA) as outlined in Figure 3

DPA takes as its input a set of SDK ToSes associated withpopular SDKs which are widely deployed in Android appsThese ToSes are processed by DPA to output the SDKsrsquo datasharing policies (Section 32) The restricted data items gov-erned by those data sharing policies along with the corre-sponding SDK APIs that return those data items are recordedin a Meta-DB (Section 34) To identify the leaks of thoserestricted data due to XLDH activities XLA inspects the de-


compiled code of an app to find all cross-library invocationson the sensitive SDK APIs (those that return the restricteddata as recorded in the Meta-DB) XLA then tracks downthe data flow of their return value to identify an exfiltration(Section 33) The cross-library interactions discovered in thisway are then checked against the data sharing policies thatDPA extracted for finding policy violations

Example Here we use a real example to explain how XFinderworks Specifically XFinder inspected the app ColumnsGembira (comfrzgamecolumns) which includes both anXLDH library Mobiburn and the Facebook SDK In ana-lyzing the app XLA first scans all function calls to findout cross-library API calls those with caller classes andcallee classes in different libraries For example the classcommobiburneh in Mobiburn library is found to invokefunction comfacebookAccessTokengetToken() in theFacebook SDK as shown in Figure 4a Then XLA looksup the meta-DB to determine the return value of the func-tion which is the userrsquos Facebook session token and furthertracks down the data flow using taint tracking In the end wefound that the token is used to fetch a userrsquos Facebook profiledata (ID name gender email locale link etc) in functioncommobiburnehgetFbProfile() and all the data in-cluding the token are sent out to the server of Mobiburn (seeFigure 4b)

XFinder then checks whether such a data practice violatesthe data sharing policies specified in Facebook ToS Morespecifically given the statement of ldquokeep private your secretkey and access tokens in Facebookrsquos ToS DPA automati-cally extracted the data sharing policy (access token condi-tionnull) which indicates that access token is the restricteddata item and it cannot be shared with and transferred to athird-party under any conditions (ie conditionnull) Hencethe XLDH of the Mobiburn library violates the data sharingpolicy of Facebook SDK and thus XFinder flags Mobiburnas an XLDH library

Dataset summary We summarize the dataset produced andconsumed by each stage of our pipeline as below Table 2shows the datasets used in our study

In total we collected 13M Android apps (Dg) from GooglePlay for XLDH library detection More specifically thedataset was collected based on a publicly-available applist (AndroZoo [43]) using an open-source Google Playcrawler [23] which has been widely used in previous researchsuch as [87] We used the default settings of the crawler todownload the apps from Google Play from Oct 03 to Oct 152019 In total we successfully collected 9312 of the appson the list (13411481440160) from Google Play Amongthem we identified top 200 SDKs widely integrated into real-world apps (Section 34) After removing utility SDKs whichare not associated with sensitive data we further gatheredToSes for the remaining 40 SDKs from their vendor websites(Ctos)

public class h public static String getAccessToken()

Class[] param = new Class[0]Class clz = ClassforName(

comfacebookAccessToken)Method meth1 = clzgetDeclaredMethod(

getCurrentAccessToken param)Object curToken = meth1invoke(clz null)Method meth2 = clz

getDeclaredMethod(getToken param)return meth2invoke(curToken null)

public JSONObject getFbProfile(String token)String uri = Uriparse(

httpsgraphfacebookcomv210me)appendParam(accesstoken token)appendParam(fieldsidfirst_namegenderlast_namelinklocalenametimezoneupdated_timeverifiedemail)

HttpsURLConnection httpsURLConnection =new URL(uri)openConnection()

return new JSONObject(httpsURLConnectiongetInputStream()readLine ())

(a) Reading app usersrsquo Facebook access token and profile

public class fpublic void a()JSONObject userData = new JSONObject()userDataput(accessToken getAccessToken())userDataput(accountJson getFbProfile())HttpsURLConnection httpsURLConnection =

new URL(thisserverUri)openConnection()DataOutputStream dataOutputStream =

httpsURLConnectiongetOutputStream()dataOutputStreamwrite(userData)

(b) Sending the Facebook token and profile to mobiburn server

Figure 4 Code of XLDH library commobiburn

After that we bootstrapped our study by using DPA to auto-matically extract 1056 data sharing policies associated with1215 restricted data objects from the 40 SDK ToSes (Sec-tion 32) We constructed the Meta-DB (Section 34) whichrecorded all 936 sensitive APIs of the SDKs that return re-stricted data Then in XLA we statically analyzed 13M An-droid apps (Dg) to extract cross-library API calls (Section 33)After filtering by Meta-DB 1934874 of them are regardedas sensitive Given those sensitive API calls we tracked theirdata flows to check whether such flows are in compliance withthe SDKrsquos data sharing policies In particular for restricteddata not allowing access by a third-party or any party weconsider the exfiltration of the data a violation of the ToSand identify 15 XLDH libraries For restricted data access re-quiring user consent or complying with regulations we checkwhether such behavior was disclosed in the caller libraryrsquosprivacy policy (Cp) which revealed 27 XLDH libraries Intotal our study reported 42 distinct XLDH libraries (4 manu-ally found and 38 automatically detected) integrated in morethan 19K apps and targeting at 16 victim SDKs


Table 2 Summary of datasets and corpora

Name Source Size Timestamp (yyyyMM) UsageDg Google Play 13M apps 201910 Detection

Ctos 40 victim SDK ToSes 8622 sentences 201910 DetectionCapi 40 victim SDK API specifications Documentations of 27K APIs 201910 DetectionCp 73 XLDH library privacy policies 10K sentences 202006 DetectionDha Historical Google Play apps 300K apps 2014-2019 MeasurementDhl Historical XLDH library versions 42 XLDH libraries of 495 versions 2011-2019 Measurement

Table 3 The verbs related to data sharing policy

connect associate post combine lease disclose offerdistribute afford share send deliver disseminate transport

protect against keep proxy request track aggregate providegive transfer cache transmit get seek possess accumulate

convert collect use store gather obtain receive access save

32 Data Policy Analyzer

The goal of Data Policy Analyzer (DPA) is to extract third-party data sharing policies from an SDK ToS which describehow restricted data items can be shared with or collected byother libraries Here we describe a data sharing policy as apair (ob jectcondition) where ob ject is the restricted dataitem of the SDK such as utdid password and condition is therequirements and clauses for the operations on the restricteddata which can be empty For example the policy statementldquothe advertising identifier must not be associated with anypersistent device identifier without explicit consent of theuser can be represented as (advertising ID and device IDuser consent) Note that in our study we focus on ToS datasharing policies and thus the subject of such a policy is thelibrary developers (and their libraries) that call the target SDKand the operation on the restricted data is GET

During the analysis our approach first runs a NER modelto recover restricted data items More specifically the NERis customized on the ToS corpora and the entity category ofrestricted data items (eg utdid password) using an efficientconstituent parsing technique Then based on the restricteddata items we identify the sentences related to third-partydata sharing policies from the ToS After that we extract thepair (ob jectcondition) from the data sharing policies usingrestricted data as ldquoanchorsrdquo to recognize the pattern of eachpolicyrsquos grammatical structures and to locate the condition ondata sharing We elaborate on our methodology as follows

Restricted data object recognition As mentioned earlieridentifying restricted data from an SDK ToS is an NER prob-lem Unfortunately NER techniques today are known to behighly domain-specific [63] open-domain NER model doesnot work well on the security corpora as restricted data aredifferent from the common entity categories (eg locationpeople organization) whose annotated datasets are availableIn our study we observe that restricted data in the SDK ToS

is often characterized by a long noun phrase (eg Google Ad-vertising ID Facebook password Amazon purchase history)covered by a single or multiple consecutive noun phrases inthe constituency tree (Figure 2) Therefore we can utilize thefeatures of the constituency tree to help identify such a phraseas an entity

More specifically we include the constituency tree of asentence as a feature which enables our NER model to learnthat certain types of phrasal nodes such as NPs are morelikely to be entities ie restricted data Hence we craftedseveral features based on constituency parsing tree tags foreach word which include a wordrsquos tag its parent tag theleft and right siblings the location of the word in the spanof NPs nodes For example as shown in Figure 2 the wordldquoadvertising in the NP span ldquothe advertising ID in has 5features its tag ldquoNNP its parent tag ldquoNP its left siblingldquoDT the its right sibling ldquoNNP ID and its position of 1under the span Such features help the model to learn andinference similar long noun phrase (eg ldquothe Twitter ID)

In our implementation we utilize AllenNLP constituencyparser [56] to generate the constituency tree related featuresfor each sentence Then we built these features into the state-of-the-art conditional random fields (CRF) based NER model- Stanford NER [66] As these features are not built-in features[37] in Stanford NER we configure the feature variables ofthem using the SeqClassifierFlags class and then read thefeature set into the CoreLabel class In addition we updatedtraining data using SDK ToSes Particularly we manuallyannotated 534 sentences from 6 SDK documents using IOBencoding [25] to retrain the NER model

To evaluate the model we perform 10-fold cross validationon the annotated sentences Our result shows that by leverag-ing constituency tree features the model achieves a precisionof 952 and a recall of 908 Compared with the modelwithout constituency tree features our model shows a increaseof 13 and 21 for the precision and recall respectivelyAfter that the model was also evaluated on additional 103randomly selected and manually annotated sentences fromtwo previously-unseen SDK ToSes which yields a precisionof 882 and a recall of 904

Policy statement discovery From each ToS the analyzeridentifies the sentences describing how restricted data itemscan be shared with or collected by other libraries These sen-tences are selected based on restricted data identified by the


aforementioned model the subject (ie library developer)and the operation (ie GET) they contain

We first need to construct the keywords list associated withdata collection and sharing (eg use collect transfer etc)For this purpose we leverage the OPP-115 [86] and APP-350 [87] datasets which contain 46259 manually annotatedprivacy policy statements Among them 14100 annotatedsentences are related to first-party collection and third-partycollection After that we use constituency parser [56] to rec-ognize the verb in the verb phrases and further identify thelemma of a verb from those sentences In this way we collect38 keywords related to data collection and sharing as shownin Table 3

After that we use this keywords list in Table 3 and therestricted data to filter out the sentences irrelevant to datasharing and collection policy Specifically after parsing theHTML content of each ToS and splitting the text into sen-tences we run our NER model to find all statement sentencesthat contain restricted data Then we leverage dependencyparser [71] to locate the verb or nominal modifier of restricteddata In particular if (1) the dependency relationship betweenthem is the direct object (eg collect personal informa-tion) or nominal modifier (eg the usage of personal in-formation) and (2) such verb or nominal modifier is in thekeywords list we will regard the sentence describing datasharing and collection policy

After this we check that the sentence subject is not theSDK itself but library developer Specifically we build a de-pendency parsing tree to recognize the subject of a sentenceWe eliminate sentences with the ldquofirst-party as the subjectFor example the target SDKrsquos name (eg ldquoTwitter ldquoFace-book) and first-person plural (eg ldquowe ldquous) Note thatfor the sentences with an ambiguous subject reference (egldquoit ldquothis ldquothat) we run a co-reference resolution tool [56]on the paragraph in which the sentence exists to identify thesubject

Altogether we gathered 1056 sentences associated withdata sharing policy from 40 ToSes By manually inspecting200 sentences we found that our method yields the recall of893 The miss-reported sentences are mainly due to wrongdependency parsing results from underlying tools we use Forexample ldquostore non-public Twitter content is parsed as anoun phrase instead of verb-object phrase by [71]

Data sharing policy identification To extract the pair(ob jectcondition) from the policy statements of an SDKToS our approach first uses a dependency parser [58] to trans-form a sentence into a dependency parsing tree which de-scribes the grammatical connections between different wordsand then leverages the restricted data ob ject as known an-chors to locate the condition by traversing the parsing treeHere we prune the dependency tree of the policy statementinto a subtree that represents the grammatical relation amongob ject condition and the operation (eg ldquotransferrdquo ldquouserdquo)This is because that the policy statement is usually long and

consists of noisy information and the subtree is most relevantto the understanding of the relation

bull Object identification In our research we observe ob ject inthe data sharing policy sometimes consists of more than onerestricted data eg ldquoDonrsquot collect usernames or passwordsHence to extract the ob ject from each policy statement wefirst identify the restricted data d1d2 dn using the afore-mentioned method and then use dependency tree to determinewhether they have conjunctive relation and their coordinatingconjunction (aka CCONJ [32]) is ldquoOR If so we recognizethem as n different objects

Similarly for the restricted data d1d2 dn are withconjunctive relation but their coordinating conjunction(akaCCONJ [32]) is ldquoAND we recognize them as oneobject However things get complicated when the policy state-ment illustrates multiple objects can not GET at the same timeeg Donrsquot associate user profiles with any mobile deviceidentifier Here we use specific verbs (eg associate withcombine with connect to) to identify this relationship In thisway we recognize them as one object ie d1andd2anddn

In addition we use the lexicosyntatic patterns discoveredin [45] to find the object hyponym and then use the specifiedobject hyponym in the policy tuple For example given thepattern ldquoX for example Y1 Y2Yn rdquo where Y1 Y2Yn isthe hyponym of X and the sentence ldquodevice identifier forexample ssaid mac address imei etc we will extract fivepolicies of ldquodevice identifier ldquossaid ldquomac address andldquoimei

bull Condition extraction By manually inspecting 1K sentencesfrom 10 SDK ToSes we annotated 14 generic patterns (interms of dependency trees) which describe the grammaticalrelation among ob ject condition and the operation The an-notated pattern list is shown in Table 9 Then we fed them intothe analyzer which utilizes these patterns to match the depen-dency parsing trees of the policy statements using the ob jectand operation nodes as anchors More specifically givena policy statement we use the depth-first search algorithmwhich starts at ob ject and operation nodes to extract all sub-trees for pattern similarity comparison Then we identify themost similar subtree of a policy statement by calculating adependency tree edit distance between each subtree and thepatterns in Table 9 Here we define a dependency tree edit dis-tance D(t1 t2) = min(o1ok)isinO(t1t2) sum

ki=1 oi where O(t1 t2)

is a set of tree edits (eg node or edgersquos insertion deletionand substitution) that transform t1 to t2 and we consider t1and t2 are equal when all node types and edge attributes arematched After that we locate the condition node based onthe matched subtree

For example Figure 5a illustrates the dependency tree struc-ture of the policy statement In the tree each edge has anattribute dep that shows the dependency relationship betweennodes and each node has an attribute type which indicateswhether it is ob ject operation or none of the above (other)


have

you

store

must

leally vaild

consent

from

you

before

a member

members Profile data

depaux

depnsubj

depdobj

depadvcl

depprep deppobj

depnsubj

depdobj

depadvmod

typeother

typeother

typeother

typeother

typeother

typeother

typeoperation typedata

typeother

typeother

(a) Tree structure of policy statement

obtain

consent

use

our service data

depdobj

depadvcl depdobj typeother

typeother

typeoperation

typedata

(b) Tree structure of matched pattern

Figure 5 A example of condition extraction with the pol-icy statement ldquoYou must have legally valid consent from aMember before you store that Memberrsquos Profile Data

The subtree which consists of all green nodes is the mostsimilar subtree with 0 edit distance to the pattern shown inthe Figure 5b Traversing the matched pattern we can locatecondition node which is (have legally valid consent)

As the graph matching problem is NP-complete the com-putation time grows dramatically with the increase of nodeand edge In our implementation to reduce the matching timewhen constructing subtrees using depth-first search we definethe search depth which will not exceed twice of pattern lengthand the threshold of edit distancersquos threshold to be 3

Discussion and evaluation DPA recognized 1215 pairs(ob ject condition) from 1056 policy statements from 40ToSes We manually inspect all of the detected pairs basedon the relevant policy statements The results show that ourmethod achieves a precision of 842 However still ourtechnique misses some cases We acknowledge that the ef-fectiveness of the method can increase with more annotatedsentences for pattern matching In our research to guaranteethe diversity of the annotated patterns we design a samplestrategy for annotated sentence selection Specifically wekeep randomly sampling sentences for annotation until notobserving new patterns for continuous 200 sentences In thisway we annotated 14 patterns by inspecting 1K sentencesfrom 10 SDK ToSes Another limitation is the capability toprocess long sentences Our method utilized the dependencyparsing trees of the sentences for condition extraction How-ever the state-of-the-art dependency parser cannot maintainits accuracy when sentences become too long

Analyzing these 1215 pairs of data sharing policies wefound that most are from Facebook SDK (94) followed byAmazon (88) and LinkedIn (73) Also 37 of them havethe object with more than one restricted data We observe theobjects advertising ID and mobile device identifier alwaysco-occur (7 pairs) because the user-resettable advertising IDwill be personally identifiable when associated with mobiledevice identifier which is not privacy-compliant [9] To un-derstand the data sharing conditions we manually analyzedall of the recognized data sharing policies and categorizedthem into five types No access by any party Requiring userconsents No third-party access Complying with regulations(ie GDPR CCPA COPPA) and Others as shown in Table5 Note that 96 of the data sharing policy are with the firstfour types of data sharing conditions The Others type of datasharing condition is rarely observed and sometimes associatedwith some vaguely-described condition such as ldquoOnly certainapplication types can accessrdquo In our policy compliance check(see Section 33) we did not check the policy compliance re-lated to this type of condition We acknowledge that checkingthose policies would allow for a more holistic view of XLDHactivities However doing so will require subjective analysisof the vague and ambiguous policies and a large amount ofmanual efforts for corner cases

Comparison with other policy analyzers Since there is nopublic-available ToS analyzer we compared DPArsquos policystatement discovery with two state-of-the-art privacy policyanalyzer Polisis [57] and PolicyLint [45]2 Note that Polisisand PolicyLint are designed for privacy policy analysis notToS analysis

More specifically we manually annotated 200 sentencesfrom 3 SDK ToSes (eg Twitter Google Facebook) whichyielded 83 sentences are associated data collection and shar-ing policy and the rest (117) are not In our experiment weevaluated the approaches on this dataset Table 4 shows theexperiment results Our study shows that DPA outperformsboth approaches in precision and recall

We also compared DPArsquos restricted data object recognitionmodule with PolicyLint [45] We use the aforementioned 534IOB-encoded sentences to evaluate both DPA and PolicyLintvia 10-fold cross validation For PolicyLint we retrained itsNER model using our annotated corpora We also show theperformance of PolicyLint with its original model Table 4shows the precision and recall of both approaches Our studyshows that DPA has a much better recall (908 vs 827)

33 Cross-library Analysis

To capture malicious data harvesting in an app our Cross-library Analyzer (XLA) identifies cross-library API calls tofind the data gathered by a third-party library from a co-

2PolicyCheck [46] shared the same policy analysis module with Poli-cyLint


Table 4 Comparison with Polisis and PolicyLintPolicyCheck

Tasks Tools Precision Recall

Statement finderXFinder 872 893Polisis 275 192

PolicyLint 714 253

Restricted datadetector

XFinder 952 908PolicyLint (original) 822 713PolicyLint (retrained) 865 827

located SDK (within the same app) and then checks the com-pliance of such activities with the SDKrsquos data sharing policiesrecovered by DPA (Section 32) To this end XLA runs aprogram analysis tool that integrates existing techniques

Locating cross-library API calls XLA looks for cross-library calls by walking through the call graph generatedby FlowDroid [47] Specifically each node in the graph rep-resents a function and carries the information about the func-tionrsquos class and package (according to Javarsquos reverse domainname notational convention [33]) each edge (with direction)describes a call from the caller node to the callee node Onthe graph XLA identifies cross-library calls by comparing thepackage names of the caller and callee class if their top andsecond level domains (1) do not match with each other and(2) do not match the host apprsquos package name the call is con-sidered cross-library ndash an approach also used by MAPS [87]

Also a cross-library call can leverage Javarsquos reflection toimplicitly trigger a function (see the example in Figure 4a)Hence XLA inspects all reflection calls on the call graph andchecks whether the caller and callee classes belong to dif-ferent libraries To this end XLA first locates reflection callsfrom a set of call patterns (see Table 6) As shown by a recentstudy [62] these patterns cover the most common reflectionuse cases in Android apps Further our approach recovers thecalleersquos class name and method name from the argumentspassed to the reflection functions For example the argu-ment of ClassforName(target_class_name) indicatesthe callee class name eg comfacebookAccessToken inFigure 4a A problem here is the argument could be a vari-able To find its value XLA utilizes DroidRA [62] an inter-procedural context-sensitive and flow-sensitive analyzer ded-icated to resolve reflection calls to track the string contentpropagated to the variable

Identifying cross-library leaks With the discovered cross-library calls XLA then identifies the restricted data itemsreturned to the caller library and performs taint tracking todetect potential data exfiltration (to the Internet) by the callerlibrary In particular XLA leverages Meta-DB to recognizerestricted data items being returned as meta-DB recordedwhich are the sensitive SDK APIs and the restricted data theyreturn (see Section 34)

Further we need to track down the data flow of the re-stricted data Instead of directly using the techniques of Flow-Droid (eg with deep object sensitivity) which is considered

heavy-weight for an analysis of 13M apps [77 87] we needa relatively light-weight tool Hence we opt for existing tainttrack techniques that are capable of inter-procedural analy-sis field-sensitive but not object-sensitive We take the returnvalue of the cross-library calls as the taint source and network-ing APIs as the sink For example the return value of the re-flection call on comfacebookAccessTokengetToken()is a taint source including Facebook userrsquos session token (Fig-ure 4a) once the data reaches a sink in the caller library egOutputStreamwrite(StringgetBytes()) API to send the datato the Internet XLA reports a potential data exfiltration

Checking policy non-compliance Given a potential exfil-tration of the restricted data from a victim SDK we checkwhether it violates the ToS policy of the target SDK (obtainedby DPA in Section 32) Depending on the conditions withwhich the ToSes restrict the access to individual data itemsour approach for a compliance check is as follows

bull No third-party access no access by any party If the ToS(eg those of Facebook Twitter and Pinterest) prohibits anaccess to the data by a third-party (eg a third-party libraryor its vendor) or by any party (eg Facebook user ID andpassword are not even allowed to be exfiltratedstored by thehost app vendor) we consider the exfiltration of the data aviolation of the ToS ndash an XLDH activity is identified

bull Requiring user consents complying with regulations SomeToSes ask that the access to certain data items should requirea user consent or comply with privacy regulations (ie GDPRCCPA COPPA) In XLDH data sharing and collectionoccur between caller library and victim SDK without beingprocessed by the host app Hence we consider the callerlibrary to be a data controller [19] which has obligationsto comply with regulations and disclose the data practicein its privacy policy [19] In our study we check the privacypolicy of the caller library to determine whether it disclosesthe data collection and sharing behaviors in its privacypolicy To automatically analyze the privacy policy we usePolicyLint [45] to extract privacy policy tuples (actor actiondata object entity) associated with that restricted data Herethe tuple (actor action data object entity) illustrates who[actor] collectsshares [action] what [data object] with whom[entity] eg ldquoWe [actor] share [action] personal information[data object] with advertisers [entity] In our study we careabout the tuples with caller library as actor sharecollectas action and the restricted data as entity Note that fornon-English privacy policies which PolicyLint [45] can nothandle we translate them into English for further processing

Discussion Recent studies such as [80 87] on privacy com-pliance considered the data as leaked out once an API return-ing the data is invoked by an unauthorized party We found thisis imprecise in detecting XLDH due to the pervasiveness ofservice syndication (eg Twitter4j Firebase Authentication)in which a benign library wraps other SDKs (Facebook loginTwitter login) to support their easy integration into apps Such


Table 5 Summary of 40 vendorrsquos data sharing policies

Condition Percentage ExampleNo access by any party 396 (387) Donrsquot proxy request or collect Facebook usernames or passwords

User consent 249 (2434) Obtain consent from people before using user data in any adNo third-party access 206 (2013) Donrsquot use the Ads API if yoursquore an ad network or data broker

Comply with regulations 123 (1202) Any End User Customer Data collected through your use of the Service is subject to the GDPROther 47 (459) Only certain application types may access Restricted data for each product

Table 6 Most common patterns of reflection call sequences

Sequence patternClassforName()rarr getMethod()rarr invoke()

getDeclaredMethod()rarr setAccessiblerarr invoke()

syndication libraries also acquire restricted data from thesethird-party SDKs but rarely send them out to their serversTherefore a policy violation can only be confirmed once thecollected data are delivered to the unauthorized recipient

34 Meta-DB Construction

Our Meta-DB records the API specifications and metadataof top 40 third-party libraries which cover 91 of GooglePlay apps (see below) For each API Meta-DB records thedata it returns (eg session token page likes user ID pro-files groups followed) and whether or not the return data isrestricted by the SDKrsquos ToS

Identifying popular third-party SDKs To find the mostpopular SDKs which are appealing XLDH-attack targets weranked the third-party SDKs based on the number of appsusing them Specifically we randomly sampled 200000 appsin Dg and identified the third-party SDKs using by those appsJust like MAPS [87] we considered a SDK as third-party ifthe top and second level domains in its package name do notmatch the apprsquos package name After ranking those SDKs weselected the top 200 excluding those with obfuscated pack-age names and further manually reviewed and removed util-ity SDKs which are not associated with restricted data egGoogle gson SDK [21] The remaining 40 SDKs were thenused in our research to construct Meta-DB (Figure 3)

Note that in our study the 40 SDKs (from top 200) recordedin Meta-DB are integrated in 91 of apps This indicates ahigh chance for them to co-locate with a malicious library inan app In contrast the remaining 6273 SDKs we found wereless popular the 201st popular SDK was integrated in just08 of Google Play apps

Identifying privacy-sensitive APIs We gathered 26707API specifications provided by the aforementioned 40 SDKvendors Such documentations especially those providedby popular vendors tend to be highly structured with wellspecified API names argument lists and return data Thisallowed us to build a parser to extract the API names and

the return values Particularly for each API we use regex (egreturns(Ww)|retrieves(Ww)|get(Ww))to match the return values Note that API specifications areoften well-structured and the regex based method is efficientto identify the return values In particular we evaluate theregex-based method on 200 labelled data and achieve aprecision of 100 and a recall of 9874 Altogether weextracted 10336 APIs and their associated return values from26707 API specifications

Our study marked an API as privacy-sensitive if itsreturn values were protected by data sharing policies Thisis done by checking each APIrsquos return values against therestricted data reported by DPA However this can not beachieved by simply using a string matching method becausethe API specification and ToS usually describe protectedinformation differently For example ToS tends to describea data object in a more generic way (eg user profile) whilethe API documentation usually use more specific terms(eg username) Hence we align the data objects in APIspecification with that in the ToS based upon their semantics(represented by the vectors computed using an embeddingtechnique) Specifically we train a domain-specific wordembedding model to get the data object vectors and thenmeasure the similarity by calculating the cosine distancebetween the vectors In our implementation we gather 15Gdomain-specific corpora (eg privacy policies ToSes APIdocumentations) and 25G open-domain corpora (eg GoogleNews Wikipedia) to train a skip-gram based word2vec modelHere we leverage data augmentation technique [84] whichgenerate a new sentence by randomly replacing synonyminserting word swapping positions of words and deletingwords to enlarge our domain-specific corpora

Evaluation To evaluate the model we randomly sampled300 APIs from 6394 APIs associated with 10 SDKsrsquo APIspecifications We manually checked the API specificationand labeled 153 privacy-sensitive APIs and 147 non-privacy-sensitive APIs By setting a similarity threshold of 07 ourapproach achieved 87 precision and 93 recall on the an-notated dataset In total our model discovered 1094 sensitiveAPIs from 26707 APIs of 40 SDKs meta-DB We manuallychecked all of them and got a precision of 856 Note that weonly used the validated sensitive APIs in the XLDH detection


4 Evaluation and Challenges in Detection

This section reports our evaluation study on XFinder to under-stand its effectiveness and performance and the challenges inidentifying XLDH from a large number of real-world apps

41 EffectivenessEvaluation on ground-truth set We evaluated XFinder overthe ground-truth dataset including a ldquobad setrdquo and a ldquogoodsetrdquo with 40 apps each The apps in the bad set are integratedwith 4 XLDH libraries (comyandexmetrica cominmobicomappsgeyser cnsharesdk) which were found manuallyearly in our research (before we built XFinder) The good setincludes the apps randomly sampled from the top paid applist on Google Play [22] They are considered to be mostlyclean and were further confirmed manually in our research tobe free of XLDH libraries we inspected cross-library callsin these apps against the top 40 SDKs (recorded in Meta-DB)and concluded that their corresponding data flows do not vio-late the calleesrsquo ToSes Running on these ground-truth setsXFinder achieved a precision of 100 and a recall of 100

Evaluation on unknown set Then we evaluated XFinderon a large ldquounknownrdquo dataset ndash Dg excluding 13018 appsintegrating known XLDH libraries which contains 1328130free Android app in total with 40 SDK ToSes XFinder re-ported 2968 apps associated with 37 distinct XLDH libraries(distinguished based on their package names) To measure theeffectiveness of XFinder we randomly selected three appsfor each identified XLDH library (105 in total) and manuallyvalidated the detection results 32 (out of 37) identified XLDHlibraries were true positives (a precision of 86) affecting 93out of the 105 apps We performed manual end-to-end tests onseven XLDH libraries (including OneAudience Mobiburnand Devtodev) in real-world apps and confirmed that theyindeed exfiltrated Facebook user data to their servers (usingXposed [40] for app instrumentation and Packet Capture [1]for inspecting networking traffic)

Looking into the five falsely reported libraries wefound that three of them (comparse combatch andcomgigabud) were caused by the taint analysis of XLA Asmentioned in Section 33 for better scalability our taint track-ing is object-insensitive Specifically after our approach taintsa field f (holding a Facebook token) in an object obj of classC which causes the whole class to be tainted as a resultwhen the taint of the field f prime (not storing a sensitive data)in another object obj2 of the same class is propagated to asink XLA could not distinguish the two objects and simplyconsiders the token-related information to be exposed to thesink thereby leading to the false alarms

Another two false positives (comxcosoftware andfrpcsoft) were introduced because our current programanalysis could not fully resolve the server endpoints of dataexfiltration Although XFinder found that the two libraries

expose a Facebook access token to the Internet (so reportingthem as XLDH) the libraries actually send the token to theFacebook server (to retrieve additional user data eg nameID page likes) not an unauthorized recipient Fully auto-mated resolution of such an endpoint is challenging sincethe Facebook endpoint used in the networking API is heav-ily obfuscated (using a complicated control flow to trans-form the endpoint string see the code snippet in our releaseddataset [39]) We utilized one of the state-of-the-art tools [89]capable of statically resolving string values in Android apps(using a value set analysis approach with backward slicingand string related operation analysis) which however stillfailed to handle a case In our research we also observedthat certain XLDH libraries such as commobiburn fetch adynamic exfiltration endpoint whose value cannot be resolvedstatically (see the evasion techniques in Section 53) Hencefor a better coverage XFinder opts to report all exfiltrationcases even if the endpoints could not be resolved and thenrelies on a manual process to validate the results Note thatthe percentage of such false cases is low in our results

Discussion of potentially missed cases Due to the lack ofground truth determining the number of missed XLDH li-braries on a large scale is challenging In general false nega-tives can be introduced for two reasons (1) challenges in au-tomatic data-sharing policy analysis on ToSes (2) limitationsof todayrsquos static program analysis techniques eg precisetaint tracking building complete call-graphs and resolvingreflection-call targets

Specifically although DPA achieved a high precision andrecall in ToSes analysis (see evaluation) it missed vague andcomplicated cases as mentioned in Section 32 This can beimproved in the future by investigating efficient dependencyparsing on long sentences Second XFinder shares the lim-itations of current static analysis techniques In particularfalse negatives could be introduced due to the limited capabil-ities of taint tracking in complicated real-world appslibrariesFor example an XLDH library could store the restricted datain the host apprsquos datastore (eg Android SharedPreferencesSQLite database files [3]) and later use another threadmoduleto retrieve a specific date item and send it out Such a com-plicated data flow could not be automatically taint-tracked byour current approach nor could it be handled by other state-of-the-art tools like FlowDroid We will leave the systematicstudy of the convoluted XLDH data practices to our futurework Furthermore limited by the capability of DroidRA [62]to resolve targets of reflection calls our approach may notidentify all cross-library calls if the target class name and func-tion name are stored in variables passed from other threadsor obfuscated Also since we leverage FlowDroid to buildthe call graphs which may not be complete we may not findall cross-library calls based on the graphs


42 Performance

Running XFinder on 13 million apps and 40 SDK ToSesit took around two months to finish all the tasks includingDPA Meta-DB construction and XLA Among these threecomponents XLA was the most time-consuming one (aroundtwo months) To analyze all 13M apps (Dg) we utilized aset of computing resources available to us including one su-percomputer (shared in our organization) two servers (20cores251GB memory 12 cores62GB memory respectively)and 24 desktops (4 cores15GB memory each) We config-ured a 300-second timeout for taint tracking with 837of the apps successfully analyzed without timeouts or de-compilation errors (114 and 49 of them with timeoutsand de-compilation errors respectively) DPA took 2 hours toextract 1215 (ob ject condition) pairs on a Mac machine withprocessor 26 GHz Quad-Core Intel Core i7 and memory of 6GB 2133 MHz LPDDR3 Meta-DB construction took 4 hoursto find privacy-sensitive APIs from the API documentationsof the top 40 SDKs

5 Measurement

Based on the detected XLDH libraries and affected appswe further conducted a measurement study to understand theXLDH ecosystem In this section we first present the overviewof the real-world XLDH ecosystem discovered in our study(Section 51) and then describe the scope and magnitude ofthis malicious activity as well as the infection techniques anddistribution channels of the XLDH libraries

51 XLDH Ecosystem

Before coming to the details of our measurement findingswe first summarize the XLDH ecosystem As outlined in Fig-ure 6 an adversary who owns an XLDH based data brokerageplatform (eg OneAudience) releases an XLDH library thataims to harvest data from Facebook SDK To this end theadversary needs to distribute the library to a large numberof real-world mobile apps so he reaches out to app ownersespecially those with popular apps embedding with FacebookSDK to provide them monetary incentive to integrate theXLDH library (Ecirc) The app integrated with the library andthe Facebook SDK once passing the SDK vendorrsquos review(Euml) and the app store vetting (Igrave) is available for downloading(Iacute) When innocent users install the app and log in Facebookthe XLDH library will stealthily access the Facebook token toharvest the userrsquos Facebook data (Icirc) and send them out to itsback-end platform (Iuml) Meanwhile the app owner receivescommissions from the adversaries based on the number of appinstallation (eg 003$ per installation for OneAudience) (ETH)Finally the brokerage platform monetizes usersrsquo Facebookdata by sharing it with a marketing company (eg Nielsenwhich offers political and business marketing) (Ntilde)

Table 7 App Categories

Categories of apps ProportionGame 5556 28

Entertainment 1296 7Food and Drink 1160 6

Books 827 4Business 827 4

Figure 6 XLDH ecosystem

52 XLDH Libraries in the Wild

Prevalence of XLDH Our study reveals that XLDH activitiesare indeed trending among the real-world apps Altogetherwe detected 42 distinct XLDH libraries integrated in morethan 19K apps and targeting at 16 SDKs Apps includingXLDH libraries as discovered in our research are found in33 categories on Google Play As shown in Table 7 over 35of apps are in the categories of game and entertainment Thoseapps have been downloaded more than 9 billion times in totalon Google Play Among all affected apps some are highlypopular with more than 100 million downloads (Table 8)

Table 8 illustrates the top-10 XLDH libraries based onthe number of apps integrating them and the data harvested(we release the full list of XLDH libraries online [39])We found that a few XLDH libraries dominate the XLDHecosystem Particularly comyandexmetrica is the mostpopular XLDH library and appears in 40 of the affectedapps comyandexmetrica is the SDK provided by Yandexa Russian Internet corporation for traffic analytics servicecomyandexmetrica leverages the reflection techniqueto fetch Google advertising ID and Android device IDfrom the Google play service SDK Associating Googleadvertising ID with Android device ID is privacy sensitivesince it can be used to identify a specific Android userHowever comyandexmetrica does not declare this activityin its privacy policy which violates the ToS of the Googleplay service SDK [13] Similar behavior is also found incomappsgeyser which has affected more than 4 thousandapps with 15 million downloads

Historical versions of XLDH libraries To understand theevolution of XLDH librariesrsquo behavior we collected their old


Table 8 Top-10 XLDH libraries (integrated in the most apps)

XLDH library ofappsdownloads Harvested data

comyandexmetrica 80142B+ Google Advertising ID AndroidID

cominmobi 42834B+ Google Activity Recognition

comappsgeyser 420215M+ Google Advertising ID AndroidID IMEI Mac Address

comoneaudience 1738100M+Facebook

IDnamegenderemaillinkTwitter user data

cnsharesdk 815191M+ Bytedance IDname

comumengsocialize 495175M+

FacebookTwitterDrop-boxKakaoYixinWechatQQSi-

naAli-payLaiwangVkLineLinkedinrsquos

AccessToken and user data(IDnamelinkphoto)

comrevmob 340 36M+ Facebook AccessToken

rumail 299100M+ Google Advertising ID MacAddress Android ID IMEI

comad4screen 245183M+ Facebook appid AccessTokencomdevtodev 231318M+ Facebook user gender birtbday

versions and then investigated the change of their maliciousfunctionality Specifically we gathered historical versions ofXLDH libraries from the library websites Maven repositories[27] and GitHub [20] In this way we found 495 versionsfrom October 31 2011 to February 12 2020 for all the42 XLDH libraries After that for each XLDH library wemonitored the code change related to malicious cross-librarydata harvesting by checking its fingerprints (eg class namesof the reflections f orName and getMethod) across differentlibrary versions

We selected 7 XLDH libraries (comad4screencomonradar ruwapstart ioradar comdevtodevcomyandexmetrica and cominmobi) which had atleast three versions recorded in our dataset to look at theirtrend individually Figure 7 illustrates the maliciousness ofXLDH libraries across different versions We observe thatlibraries tend to have XLDH code in their newer versionsAmong them comyandexmetrica and ruwapstart beganto release the XLDH versions since late 2014 Interestinglywe found that after Oct 2019 the versions of ioradar andcomdevtodev removed the XLDH function to steal usersrsquoFacebook data We believe that this is at least in part thanks toour report to Facebook Twitter and Google Play (in October2019) which then warned the vendors of the offendinglibraries Also in 2018 cominmobi released the version with-out XLDH to stop collecting Google activity recognition dataThis could result from the attempt to comply with GDPR

To understand the presence of XLDH library in the wildwe performed a longitude study of the Google Play appswith XLDH libraries comoneaudience comdevtodev andioradar from January 2015 to December 2019 All of theselibraries stealthily access app usersrsquo Facebook data (eg gen-der birthday visited place etc) Specifically we started from

Figure 7 Distribution of XLDH versions

the 2076 apps (out of the 13 million set) which are found tohave integrated the above libraries and fetched the historicalversions of these apps on [41] to evaluate the presence ofXLDH libraries in each version In this way we were able tocollect 936 apps with 5732 versions Among them 1976 ofthe versions were affected

Figure 8 illustrates the evolution of the number for thenewly-appearing apps with the XLDH libraries comparedwith the number of the apps with the libraries removed overtime We can observe that a large number of the affected appscame into sight from July 2017 to December 2019 and thegrowth started to slowdown in 2019 (in part thanks to ourreport to Facebook Google Play etc in October 2019) In-terestingly for each library the trend of disappeared apps isalmost identical to that of newly appearing apps with the delayof about half a year eg 165 new comoneaudience apps werepublished during the second half of 2018 while 166 apps dis-appeared half a year later Comparing these two app sets wefound that there were actually 159 overlapped apps generatedby an app builder appsgeysercom ndash these apps share similarpackage names which we released online [39] This indi-cates that the adversaries were leveraging appsgeysercom toquickly release and then remove a bunch of comoneaudienceapps periodically which is likely to be a strategy for gatheringdata from different users as further evidence we observedthat since March 2017 AppsGeyser in its privacy policy ac-knowledged that it would include OneAudience in app gener-ation [7] Similarly the technique is adopted by a game appprovider dukselcom to integrate comdevtodev

53 Dissecting Infection Operations

Most targeted SDKs Figure 9 shows the victim SDKs andthe number of XLDH libraries (top 20 XLDH libraries basedon the number of apps integrating them) that attack themGoogle ads service is the most commonly affected SDK fol-lowed by Facebook login and Twitter login Among victimSDKs 7 of them are OSNs 2 are advertising and trackingplatforms 6 are instant messaging service 1 is cloud service

Given the list of victim SDKs we observe around half ofthem are in the category of online social network (OSN) Itsuggests that high-profile OSN platforms present an invalu-


Figure 8 Number of newly appearing and disappeared XLDHlibraries

Figure 9 XLDH flows where XLDH libraries (left) fetchdata from victim SDKs (right)

able source of private user data to the XLDH adversaries Inour study we observe the adversaries stealthily collect Face-book token WeChat token and LinkedIn token which canbe used to access OSN-specific semantically-rich data Forexample we observe comoneaudience and commobiburnstealthily access Facebook token and further leverage that to-ken to fetch semantically-rich Facebook data including Face-book ID Gender Email Pages likes Followed groups etc Inparticular Facebook data such as the pages likes social po-

litical health groups the users followed timeline etc can beused to create the usersrsquo psychographic profiles as suggestedby recent Facebook political scandal [85] similarly the dataof Twitter LinkedIn Amazon including a userrsquos tweets pagelikes education background celebrities she followed pur-chase history etc can also be exploited by the adversariesto profile the userrsquos personality values opinions attitudesinterests and lifestyles The risk is especially serious sincewe observed that data harvested through comoneaudienceindeed have been shared with a political and business adver-tising company Nielsen

Evasiveness of XLDH libraries In our study all affectedapps were published on Google Play which means the XLDHlibraries have bypassed the app vetting of the Google Playand the victim SDK vendors (eg Facebook and Twitter ifapp review process exists) The result suggests that XLDH isa new type of threat not well understood before To furtherassess whether existing techniques can detect XLDH librarieswe leveraged VirusTotal [38] which aggregates more than 70antivirus products to scan all the XLDH libraries we foundInterestingly no single product in the VirusTotal can detectany XLDH libraries we found in contrast VirusTotal wasknown to be capable of detecting harmful libraries studiedin prior works [50 60 64 75] More than half of XLDHlibraries we found abused Java reflection to call the targetSDK (see Section 2) which leverages a generic Java func-tion to implicitly call the target SDK API Further unlikeregular code in Java that needs to import the target class (egimport comfacebookAccessToken [10]) the maliciouslibrary does not have to explicitly import the target class Thismakes the cross-library intention more evasive Interestinglyin our study we also observe that XLDH libraries strategi-cally hide the data exfiltration channel to evade detection aselaborated below

Hidden data exfiltration channels We manually analyzedthe decompiled code of XLDH libraries and observed someother techniques they use with which the data-harvestingbehaviors are harder to analyze

bullDynamic exfiltration endpoints In some XLDH libraries weobserved the adversaries remotely control the data exfiltrationprocess where the endpoints to receive the data and the timeto exfiltrate data are configured to be dynamic More specif-ically at runtime the XLDH library (eg commobiburn )fetches a remote configuration file which specifies the end-point (a URL) that data should be exfiltrated to and when todo so Such a treatment helps XLDH library evade the do-main blocklist based blocking once its server endpoint is onthe blocklist and thereby blocked by local firewalls ISP etcnot hard-coding the malicious endpoint can also render staticvetting less effective Furthermore a dynamic controlled ex-filtration schedule gives the adversaries better control to evadedetection eg no exfiltration during vetting

bull Hidding data exfiltration in crash reports Another in-


teresting observation is that XLDH libraries put the exfil-trated data in crash reports For instance the XLDH librarycomkongregate appends the harvested Facebook session to-ken in its crash report and sends it out to the Internetbull Data encryption We observed the XLDH libraries(eg commobiburn and comumeng) encrypted the databefore sending to the Internet particularly RSA used bycommobiburn and AES used by comumeng for data encryp-tion This can render network-based privacy detection lesseffective eg [73] which used mitmproxy to decrypt HTTPStraffic for data leakage inspection

Backend server endpoints Our study uncovers a series ofbackend servers of the XLDH libraries (see Table 10 in Ap-pendix) We used VirusTotal to scan all of the backend serverURLs However none of them were flagged as malicious Weobserve that while an XLDH library typically exfiltrates datato their own domain ndash which might be blocked by domainblocklists [31 36 38] ndash XLDH developers also leverage rep-utable domain to receive data Specifically the XLDH librarycomkongregate appends the data into its crash reports andthus the data was sent to Googlersquos Crashlytics a platform thathelps app developers collect crash information and analyzestability issues [12] Similarly comadience sent their data toa free web hosting service frwrdme

54 XLDH Library Distribution Channels

Behind the popularity of XLDH libraries are the promotionmechanisms that XLDH vendors take to attract app vendorsTo understand their promotion and distribution channels wesearched the library websites news reports and analyzed thecode and harvested data of XLDH libraries and summarizetheir distribution channels as below

bull Pre-installed libraries As mentioned in Section 52 weobserve that XLDH libraries were widely pre-installed inthe apps which were generated by free app building ser-vices (eg appsgeyser [6] duksel [14]) For instance 18of the affected apps of comdevtodev were generated by theduksel [14] whose apps show similar package name patternldquocomduksel(Ww)free

bull Embedding in other librariesservices Another importantchannel to distribute malicious libraries in real-world apps isto be integrated by other popular libraries or services As men-tioned earlier the vendor of comappsgeyser AppsGeyser isthe biggest free Android app builder on the market [6] whointegrates XLDH libraries in the Android apps it created In-terestingly we found that comappsgeyser appears 34 and29 of the time with comonaudience and comyandex re-spectively Hence the users of apps generated by AppsGeyserface the risks of exposing Facebook and Twitter user data andGoogle Advertising IDs and device IDs

As another example we found the XLDH li-brary commobiburn included another XLDH library

comonaudience So any app integrates the former librarycould silently include the latter increasing the chance ofXLDH-library distributions This was observed in 8 olderversions of commobiburn released from April 11 2018 toJune 17 2019 (v153 to v190) This might come out of thecollaboration relation between the XLDH developersbull Offering app monetization Another important promo-tion channel XLDH vendors take is to offer app vendorsmonetary incentives to integrate the libraries For exampleOneAudience offers app vendors 003 USD per app installa-tion Mobiburn offers 0015 USD per app installation

Among those libraries offering app monetizationcomoneaudience and commobiburn do not provide anyfunctionalities to the apps except for data harvesting In-terestingly the app vendors may not fully understand the risksuch a library incurs in her app For example OneAudienceclaims in its privacy policy (dated in February 19 2019 andaccessed in our study in October 2019) that it collects theuserrsquos device ID operating system type device make andmodel etc which are albeit private but commonly consid-ered acceptable if properly disclosed however behind thescene it collects a full spectrum of Facebook and Twitter datavia unauthorizedly using AccessTokenbull Offering appealing functionalities Some libraries(comsharesdk and comumengsocialize) offer app develop-ers helpful functionalities although behind the scene per-forming XLDH The functionalities they offer include in-tegration with social media (eg single-sign on posting toFacebookTwitter) analytics crashreports pushing messagesin-app purchases etc

6 Discussion

Impacts to privacy regulations Our study complements therecent understandings to privacy compliance a thread of re-cent works [87][44] assessed whether an apprsquos data practice(eg data collection and sharing with third-party) is con-sistent with what is disclosed in its privacy policy (akaflow-to-policy consistency analysis) These studies generallysuggested that app vendors are at fault by not properly dis-closing the data sharing with third-parties (eg advertisersanalytics providers etc) and correspondingly app vendorshave been charged by regulatory agencies (eg FTC) [24] Tocomplement these studies our study shows that app vendorscan also be victims since an in-app data flow to a maliciousthird-party by XLDH can be opaque to app vendors In this re-gard our work has serious implications to privacy complianceregulations also XFinder can be used by both regulators andapp vendors to inspect in-app data practice with third-partiesResponsible disclosure We have reported the affected appsand XLDH libraries we found to the app vendors and appstore (eg Facebook Twitter Google Play store) and helpedthem understand the threats since October 2019 Google has


removed the affected apps from the Google Play or asked theapp owners to remove those libraries Facebook and Twitterhave taken legal actions against the XLDH library providers

Future work As discussed in Section 4 an automatic andsound detection of XLDH libraries in the wild is limited atleast in part by todayrsquos techniques for document analysis(ToS privacy polices) and program analysis (eg trackingcomplicated data flows resolving reflection-call targets) Notethat as we observed reflection-based calls are widely used inXLDH activities likely because they are more stealthy anddifficult to detect than conventional calls (developers mayalso use reflection calls with lessnone malicious motivationssuch as maintaining backward compatibility when the targetclass may not exist [62]) Hence better capability to resolvereflection calls in the future may contribute to a more sounddetection of XLDH and more in-depth understanding of theXLDH activities Further although we focused on Androidin this study XLDH is completely feasible on iOS whichwe plan to study in the future work We indeed made a pre-liminary attempt by looking at the iOS version of an XLDHlibrary OneAudience and found it has the same XLDH be-havior as on Android (harvesting data from Facebook andTwitter SDKs) We communicated XLDH risks to Apple inlate 2019 who worked with us to analyze the iOS counterpartsof XLDH libraries we found on Android and asked affectedapps to remove the malicious libraries

7 Related Work

Study on malicious mobile SDKs Prior studies such as ex-tensively explored the risks of malicious mobile SDKs Inparticular prior research showed that malicious SDKs couldcollect usersrsquo sensitive data from the host apps and mobile de-vices leading to serious privacy leakage due to their wide inte-grationadoption by popular mobile apps [53 72 78 79 81]The problem has been studied through large-scale measure-ments using both static [69] and dynamic [52 72 73] programanalysis The sensitive data studied include on-device data(eg IMEI phone number GPS coordinates) as well as userprofiles (eg age gender preferences) from app server [5469] To mitigate the problems prior research proposed differ-ent fine-grained mechanisms to isolate third-party SDKs [5976 82 83] Unfortunately these mitigates are hard to be fullyadopted by current ecosystem due to different deploymentlimitations (eg requiring app code instrumentation [76 82]or human-crafted policies [83]) Recent studies also studiedmalicious SDKs involved in the ad-fraud scheme using tech-niques like click injection and click flooding [65] Differentfrom prior research our study sheds lights on a new type of pri-vacy harvesting channel (ie the cross-library data harvesting)which is significantly different from prior studies in terms ofthe diversity of the private data and complexity of in determin-ing their data sharing policies (specific to individual SDKs) In

addition our measurement covers 13M apps for a comprehen-sive understanding of the problem in the wild the largest scalecompared to all previous research for privacy leakage study

Text analysis for mobile privacy Our approach for identify-ing privacy in-compliance between cross-library invocationsthrough their documentations (ie Terms-of-service) followsa history of proposed text analysis over mobile apps using amixed technique of NLP and machine learning Topics withinthis range include identifying sensitive data items [68 69]their desired destinations [87] policy contradictions [45] aswell as the the benign usage contexts [44 51] In terms of mo-bile privacy compliance Whyper [70] is among the first to useNLP techniques for automatically reasoning the permissionusage in mobile apps through text analysis from app descrip-tions Later a thread of recent works [44 57 80 87 88] pro-vide better understanding for the privacy policy and its compli-ance with mobile apprsquos data practice Specifically Polisis [57]proposes neural-network based classifiers to automatically an-notating privacy policies with both high-level and fine-grainedlabels Maps [87] performs large-scale measurement analysisto identify those privacy leakage of mobile apps which arenot disclosed by their privacy policies PolicyLint [45] inves-tigates the internal contradictions of a given privacy policy byidentifying and analyzing the data collection and sharing state-ments at the sentence-level However these works are morefocused on the privacy implications caused by app developersInstead our research look into the privacy compliance amongdifferent parties This recalls a more in-depth analysis overa border range of privacy-related documents (ie Terms-of-service of third-party libraries) The sensitive data consideredin our research among cross-library invocations rely on pars-ing the ToS statement rather than a pre-defined list as did inprevious studies [69 87] In addition identify such privacy vi-olations requires more fine-grained rules (Section 32) whichis not addressed by previous research

8 Conclusion

In our paper we report the first systematic research on XLDHlibraries aiming at third-party SDKs to harvest private userdata based upon a suite of techniques that addresses the chal-lenges in analyzing SDK ToSes to recover the semantics ofdata sharing policies and evaluating apps to find cross-libraryinteractions Our study demonstrates the significant privacyand social impacts of this new threat Our research further un-covers a series of unique characteristics of the XLDH librariessuch as their distribution channels hidden data exfiltrationchannels We discussed the limitations of our current tooland the future research that is needed for a more in-depthunderstanding of XLDH activities


Acknowledgement

Jice Wang Jinwei Dong and Yuqing Zhang are supportedby National Natural Science Foundation of China U1836210and CSC scholarship The authors of Indiana University aresupported in part by Indiana University FRSP-SF and NSFCNS-1618493 1801432 and 1838083

References[1] Packet Capture httpsplaygooglecomstoreapps

detailsid=appgreyshirtssslcapture

[2] Aliyun - resolve conflicted utdids httpshelpaliyuncomknowledge_detail59152html

[3] Android app storage httpsdeveloperandroidcomtrainingdata-storage

[4] Android application sandbox httpssourceandroidcomsecurityapp-sandbox

[5] Android social library statistics and market share httpswwwappbraincomstatslibrariessocial-libs

[6] Appsgeyser httpsappsgeysercom

[7] Appsgeyserrsquos privacy policy httpappsgeysercomprivacyapp

[8] Bag-of-words httpsenwikipediaorgwikiBag-of-words_model

[9] Best practise of advertising id httpsdeveloperandroidcomtrainingarticlesuser-data-ids

[10] comfacebookaccesstoken - facebook httpsdevelopersfacebookcomdocsreferenceandroidcurrentclassAccessTokenlocale=en_US

[11] Constituency parsing httpswebstanfordedu~jurafskyslp313pdf

[12] Crashlytics - google httpsenwikipediaorgwikiCrashlytics

[13] Developer policy center- google httpsplaygooglecomintlen_usaboutdeveloper-content-policy

[14] Duksel we create amazing games httpsdukselcom

[15] Facebook app review ndash app development httpsdevelopersfacebookcomdocsappsreview

[16] Facebook terms of service httpswwwfacebookcomtermsphp

[17] Facebookndashcambridge analytica data scandal - wikipediahttpsenwikipediaorgwikiFacebookE28093Cambridge_Analytica_data_scandal

[18] Full profile fields - linkedin httpsdocsmicrosoftcomzh-cnlinkedinsharedreferencesv2profilefull-profilecontext=linkedinconsumercontext

[19] GDPR httpsgdpr-infoeu

[20] Github httpsgithubcom

[21] Google gson sdk httpsgithubcomgooglegson

[22] Google play top apps httpsplaygooglecomstoreappstop

[23] googleplay-api httpsgithubcomNoMore201googleplay-api

[24] In the Matter of Goldenshores Technologies LLC and Erik M Geidlhttpswwwftcgovenforcementcasesproceedings132-3087goldenshores-technologies-llcerik-m-geidl-matter

[25] Inside-outside-beginning httpsenwikipediaorgwikiInside--outside--beginning_(tagging)

[26] Java reflection- oracle httpswwworaclecomtechnical-resourcesarticlesjavajavareflectionhtml

[27] Maven repository httpsmvnrepositorycom

[28] merge-phrase httpsspacyioapipipeline-functions

[29] Nielsen and bridge marketing httpswwwnielsencomusenpress-releases2016nielsen-and-bridge-marketing-collaborate-to-bring-exclusive-oneaudience-data-to-nielsen-marketing-cloud

[30] Part-of-speech httpwebstanfordeduclasscs124lecpostaggingpdf

[31] phishtank httpsphishtankorguserphpusername=cleanmx

[32] pos-attributes httpsspacyioapidependencyparser

[33] Reverse domain name notation - java httpsenwikipediaorgwikiReverse_domain_name_notation

[34] Single sign-on (sso) - wikipedia httpsenwikipediaorgwikiSingle_sign-on

[35] Skip-gram httpsenwikipediaorgwikiN-gramSkip-gram

[36] spamhaus httpswwwspamhausorglookup

[37] standford-doc httpsnlpstanfordedusoftwarecrf-faqhtml

[38] virustotal httpswwwvirustotalcomenusercleanmx

[39] XLDH Project Dataset httpssitesgooglecomviewroommatetheft

[40] Xposed httpsrepoxposedinfo

[41] Apkpure download apk free online httpsapkpurecom 2020

[42] Open Handset Alliance Android open source project httpssourceandroidcom

[43] Kevin Allix Tegawendeacute F Bissyandeacute Jacques Klein and YvesLe Traon Androzoo Collecting millions of android apps for theresearch community In Proceedings of the 13th International Confer-ence on Mining Software Repositories MSR rsquo16 pages 468ndash471 NewYork NY USA 2016 ACM

[44] Benjamin Andow Samin Yaseer Mahmud et al Actions speak louderthan words Entity-sensitive privacy policy and data flow analysis withpolicheck

[45] Benjamin Andow Samin Yaseer Mahmud et al Policylint investi-gating internal privacy policy contradictions on google play In 28thUSENIX Security Symposium (USENIX Security 19) pages 585ndash602 2019

[46] Benjamin Andow Samin Yaseer Mahmud et al Actions speak louderthan words Entity-sensitive privacy policy and data flow analysis withpolicheck In 29th USENIX Security Symposium (USENIX Secu-rity 20) pages 985ndash1002 2020

[47] Steven Arzt Rasthofer et al Flowdroid Precise context flow fieldobject-sensitive and lifecycle-aware taint analysis for android apps

[48] CalOPPA California online privacy protection act (caloppa) httpsleginfolegislaturecagov 2020

[49] Danqi Chen and Christopher D Manning A fast and accurate depen-dency parser using neural networks

[50] Kai Chen Xueqiang Wang Yi Chen Peng Wang et al Followingdevilrsquos footprints Cross-platform analysis of potentially harmful li-braries on android and ios In 2016 IEEE Symposium on Security andPrivacy (SP) pages 357ndash376 IEEE 2016

[51] Yi Chen Mingming Zha et al Demystifying hidden privacy settings inmobile apps In 2019 IEEE Symposium on Security and Privacy (SP)pages 570ndash586 IEEE 2019

[52] Jonathan Crussell Ryan Stevens et al Madfraud Investigating ad fraudin android applications In Proceedings of the 12th annual internationalconference on Mobile systems applications and services pages 123ndash134 2014

[53] Soteris Demetriou Whitney Merrill et al Free for all assessing userdata exposure to advertising libraries on android In NDSS 2016

[54] Mojtaba Eskandari Bruno Kessler et al Analyzing remote serverlocations for personal data transfers in mobile apps Proceedings onPrivacy Enhancing Technologies 2017(1)118ndash131 2017

[55] Facebook Facebook platform policy httpsdevelopersfacebookcompolicylocale=en_US


[56] Matt Gardner Joel Grus Mark Neumann et al Allennlp A deepsemantic natural language processing platform 2017

[57] Hamza Harkous Kassem Fawaz et al Polisis Automated analysis andpresentation of privacy policies using deep learning In 27th USENIXSecurity Symposium (USENIX Security 18) pages 531ndash548 2018

[58] Matthew Honnibal and Ines Montani spacy 2 Natural language under-standing with bloom embeddings

[59] Jie Huang Oliver Schranz et al The art of app compartmentaliza-tion Compiler-based library privilege separation on stock android InProceedings of the 2017 ACM SIGSAC Conference on Computer andCommunications Security pages 1037ndash1049 2017

[60] Bum Jun Kwon Mondal et al The dropper effect Insights into mal-ware distribution with downloader graph analytics In Proceedings ofthe 22nd ACM SIGSAC Conference on Computer and CommunicationsSecurity pages 1118ndash1129 2015

[61] Guillaume Lample Miguel Ballesteros et al Neural architectures fornamed entity recognition

[62] Li Li Tegawendeacute F Bissyandeacute et al Droidra Taming reflection tosupport whole-program analysis of android apps 2016

[63] Xiaojing Liao Kan Yuan et al Acing the ioc game Toward automaticdiscovery and analysis of open-source cyber threat intelligence InProceedings of the 2016 ACM SIGSAC Conference on Computer andCommunications Security pages 755ndash766 2016

[64] Xiaojing Liao Kan Yuan XiaoFeng Wang et al Seeking nonsenselooking for trouble Efficient promotional-infection detection throughsemantic inconsistency search In 2016 IEEE Symposium on Securityand Privacy (SP) pages 707ndash723 IEEE 2016

[65] Bin Liu Suman Nath et al DECAF Detecting and characterizingad fraud in mobile apps In 11th USENIX Symposium on NetworkedSystems Design and Implementation (NSDI 14) pages 57ndash70 2014

[66] Christopher D Manning Mihai Surdeanu et al The stanford corenlpnatural language processing toolkit

[67] Tomas Mikolov Ilya Sutskever Kai Chen et al Distributed represen-tations of words and phrases and their compositionality In Advancesin neural information processing systems pages 3111ndash3119 2013

[68] Yuhong Nan Min Yang Zhemin Yang et al Uipicker User-input pri-vacy identification in mobile applications In 24th USENIX SecuritySymposium (USENIX Security 15) pages 993ndash1008 2015

[69] Yuhong Nan Zhemin Yang et al Finding clues for your secretsSemantics-driven learning-based privacy discovery in mobile apps InNDSS 2018

[70] Rahul Pandita Xusheng Xiao et al WHYPER Towards automatingrisk assessment of mobile applications In Presented as part of the22nd USENIX Security Symposium (USENIX Security 13) pages527ndash542 2013

[71] Peng Qi Yuhao Zhang Yuhui Zhang et al Stanza A python naturallanguage processing toolkit for many human languages

[72] Abbas Razaghpanah Rishab Nithyanand et al Apps trackers privacyand regulators A global study of the mobile tracking ecosystem 2018

[73] Jingjing Ren Martina Lindorfer et al Bug fixes improvements and privacy leaks A longitudinal study of pii leaks across android appversions 2018

[74] Jingjing Ren Ashwin Rao et al Recon Revealing and controlling piileaks in mobile network traffic In Proceedings of the 14th Annual In-ternational Conference on Mobile Systems Applications and Servicespages 361ndash374 2016

[75] Sankardas Roy DeLoach et al Experimental study with real-worlddata for android app security analysis using machine learning InProceedings of the 31st Annual Computer Security Applications Con-ference pages 81ndash90 2015

[76] Jaebaek Seo Daehyeok Kim et al Flexdroid Enforcing in-app privi-lege separation in android In NDSS 2016

[77] Rocky Slavin Xiaoyin Wang et al Toward a framework for detectingprivacy policy violations in android application code In Proceedingsof the 38th International Conference on Software Engineering ICSE2016 Austin TX USA May 14-22 2016 pages 25ndash36 ACM 2016

[78] Sooel Son Daehyeok Kim and Vitaly Shmatikov What mobile adsknow about mobile users In NDSS 2016

[79] Ryan Stevens Clint Gibler et al Investigating user privacy in androidad libraries In Workshop on Mobile Security Technologies (MoST)volume 10 Citeseer 2012

[80] Peter Story Sebastian Zimmeck Ravichander et al Natural languageprocessing for mobile app privacy compliance In AAAI Spring Sym-posium on Privacy-Enhancing Artificial Intelligence and LanguageTechnologies 2019

[81] Kurt Thomas Elie Bursztein et al Ad injection at scale Assessingdeceptive advertisement modifications In 2015 IEEE Symposium onSecurity and Privacy pages 151ndash167 IEEE 2015

[82] Eran Tromer and Roei Schuster Droiddisintegrator Intra-applicationinformation flow control in android apps In Proceedings of the 11thACM on Asia Conference on Computer and Communications Securitypages 401ndash412 2016

[83] Nikos Vasilakis Ben Karel et al Breakapp Automated flexible appli-cation compartmentalization In NDSS 2018

[84] Jason W Wei and Kai Zou Eda Easy data augmentation techniquesfor boosting performance on text classification tasks arXiv preprintarXiv190111196 2019

[85] Wikipedia Facebookndashcambridge analytica data scan-dal httpsenwikipediaorgwikiFacebook--Cambridge_Analytica_data_scandal

[86] Shomir Wilson Florian Schaub et al The creation and analysis of awebsite privacy policy corpus

[87] Sebastian Zimmeck Peter Story et al Maps Scaling privacy compli-ance analysis to a million apps Proceedings on Privacy EnhancingTechnologies 2019(3)66ndash86 2019

[88] Sebastian Zimmeck Ziqi Wang Lieyong Zou et al Automated analysisof privacy requirements for mobile apps In 2016 AAAI Fall SymposiumSeries 2016

[89] etc Zuo Chaoshun Why does your data leak uncovering the dataleakage in cloud from mobile apps In 2019 IEEE Symposium onSecurity and Privacy (SP) pages 1296ndash1310 IEEE 2019

9 Appendix

Table 9 Examples of SubTree Patterns (with the full listreleased online [39])

sentence patternsstore the confidential informationwithout our prior written consent

((datadobj(conditionpobj)conditionprep)actionROOT)

collect such information when theapplicable End User has consented

to such activities

((datadobj(othernsubjhasaux(conditionpobj)conditionprep)conditionadvcl)actionROOT)

obtain consent before you use ourservice data

((conditiondobj(datadobj)actionadvcl)obtainROOT)

Table 10 Examples of Exfiltration Endpoints (with full listreleased online [39])

XLDH Library Endpointcomoneaudience httpsapioneaudiencecomapidevices

comrevmob httpsandroidrevmobcomioradar httpsapiradario

comadience httpfrwrdmesdkservercombuongiorno httpsauth-apinewtonpm


Introduction

Background

Methodology

Overview


Cross-library Analysis

Meta-DB Construction

Evaluation and Challenges in Detection

Effectiveness

Performance

Measurement

XLDH Ecosystem

XLDH Libraries in the Wild

Dissecting Infection Operations

XLDH Library Distribution Channels

Discussion

Related Work

Conclusion

Appendix


Jice Wang12lowast Yue Xiao2lowast Xueqiang Wang2 Yuhong Nan3 Luyi Xing2daggerXiaojing Liao2dagger JinWei Dong4 Nicolas Serrano2 Haoran Lu2 XiaoFeng Wang2 Yuqing Zhang145dagger

1National Computer Network Intrusion Protection Center University of Chinese Academy of Sciences2Indiana University Bloomington

3Purdue University 4School of Cyber Engineering Xidian University5School of Computer Science and Cyberspace Security Hainan University

AbstractRecent years have witnessed the rise of security risks of

libraries integrated in mobile apps which are reported tosteal private user data from the host apps and the app backendservers Their security implications however have never beenfully understood In our research we brought to light a newattack vector long been ignored yet with serious privacy im-pacts ndash malicious libraries strategically target other vendorsrsquoSDKs integrated in the same host app to harvest private userdata (eg Facebookrsquos user profile) Using a methodology thatincorporates semantic analysis on an SDKrsquos Terms of Ser-vices (ToS which describes restricted data access and sharingpolicies) and code analysis on cross-library interactions wewere able to investigate 13 million Google Play apps and theToSes from 40 highly-popular SDKs leading to the discoveryof 42 distinct libraries stealthily harvesting data from 16 pop-ular SDKs which affect more than 19K apps with a total of 9billion downloads Our study further sheds light on the under-ground ecosystem behind such library-based data harvesting(eg monetary incentives for SDK integration) their uniquestrategies (eg hiding data in crash reports and using C2server to schedule data exfiltration) and significant impacts

1 Introduction

Mobile apps today extensively incorporate third-party li-braries (eg analytics advertising app monetization orsingle-sign-on SDK) which enriches their functionalities butalso brings in security risks It has been reported that mali-cious SDKs stealthily collect private user data from the devicerunning their host app [53 73 74 78] (eg IMEI GPS loca-tion phone number MAC address SIM card ID Android IDetc) the server or the cloud supporting the app [69] Withsignificance of such leaks the security implications of libraryintegration have yet been fully revealed it is less clear whethera malicious library could endanger a userrsquos sensitive infor-mation from other data sources those not under the directcontrol of the affected app

lowastThe first two authors are ordered alphabetically Work was done whenJice Wang was studying at Indiana University Bloomington

daggerLuyi Xing Xiaojing Liao and Yuqing Zhang are co-corresponding au-thors

Figure 1 Workflow of cross-library data harvesting (XLDH)

Cross-library data harvesting In our research we discov-ered a type of data harvesting libraries never reported beforewhich strategically target the SDKs from other vendors alsointegrated by the host app These SDKs carry sensitive userdata For example the Facebook SDK extensively used byapps for single sign-on [34] also manages the informationsuch as a userrsquos name birthday locations she went to so-cial health and political groups she follows The data couldbe exposed to the malicious library hosted by the same appintegrating the Facebook SDK Figure 1 illustrates such amalicious library which first checks the presence of the Face-book SDK in its host app and if so invokes the FacebookAPI to acquire the userrsquos Facebook session token and dataSince both the malicious library and the victim SDK co-existwithin the same app this invocation is not mediated Giventhe wide deployment of the Facebook Login SDK (in morethan 16 of the apps on Google Play [5]) the risk of sucha data leak is significant We call this attack Cross-LibraryData Harvesting (XLDH)1

Beyond its threat to personal privacy malicious data har-vesting can also have serious social implications A prominentexample is the Cambridge Analytica scandal [17] in whichthe personal data of millions of Facebook users (profiles pagelikes current city News Feed etc ndash enough to create psy-chographic profiles of the users) were collected and utilizedfor malicious political advertising [17] XLDH also providesa new avenue for such political profiling and promotion asdiscovered in our research (Section 54) Despite the impor-tance of the problem little has been done so far to understand

1In this paper the terms ldquoSDK and ldquolibrary always refer to those de-veloped by third-parties ie vendors other than the host app vendor or OSvendor also we refer to ldquoSDK as the victim and ldquolibrary as a general term












2 Background





nsubjaux

negdobj

prep

pobj

without

preppobj





NP NP NP

PP

















Data flow tracking



XLHD

Compliance check







sensitive APIs

ToS

1

2 3


extract policy



3 Methodology


31 Overview


























































have

you

store

must

leally vaild

consent

from

you

before

a member


depaux

depnsubj

depdobj

depadvcl

depprep deppobj

depnsubj

depdobj

depadvmod

typeother

typeother

typeother

typeother

typeother

typeother


typeother

typeother


obtain

consent

use

our service data

depdobj


typeother

typeoperation

typedata

















PolicyLint 714 253









































42 Performance


5 Measurement


51 XLDH Ecosystem























































6 Discussion





7 Related Work




8 Conclusion



Acknowledgement





























































































9 Appendix





to such activities









Introduction

Background

Methodology

Overview





Effectiveness

Performance

Measurement

XLDH Ecosystem




Discussion

Related Work

Conclusion

Appendix











2 Background





nsubjaux

negdobj

prep

pobj

without

preppobj





NP NP NP

PP

















Data flow tracking



XLHD

Compliance check







sensitive APIs

ToS

1

2 3


extract policy



3 Methodology


31 Overview


























































have

you

store

must

leally vaild

consent

from

you

before

a member


depaux

depnsubj

depdobj

depadvcl

depprep deppobj

depnsubj

depdobj

depadvmod

typeother

typeother

typeother

typeother

typeother

typeother


typeother

typeother


obtain

consent

use

our service data

depdobj


typeother

typeoperation

typedata

















PolicyLint 714 253









































42 Performance


5 Measurement


51 XLDH Ecosystem























































6 Discussion





7 Related Work




8 Conclusion



Acknowledgement





























































































9 Appendix





to such activities









Introduction

Background

Methodology

Overview





Effectiveness

Performance

Measurement

XLDH Ecosystem




Discussion

Related Work

Conclusion

Appendix










Data flow tracking



XLHD

Compliance check







sensitive APIs

ToS

1

2 3


extract policy



3 Methodology


31 Overview


























































have

you

store

must

leally vaild

consent

from

you

before

a member


depaux

depnsubj

depdobj

depadvcl

depprep deppobj

depnsubj

depdobj

depadvmod

typeother

typeother

typeother

typeother

typeother

typeother


typeother

typeother


obtain

consent

use

our service data

depdobj


typeother

typeoperation

typedata

















PolicyLint 714 253









































42 Performance


5 Measurement


51 XLDH Ecosystem























































6 Discussion





7 Related Work




8 Conclusion



Acknowledgement





























































































9 Appendix





to such activities









Introduction

Background

Methodology

Overview





Effectiveness

Performance

Measurement

XLDH Ecosystem




Discussion

Related Work

Conclusion

Appendix























































have

you

store

must

leally vaild

consent

from

you

before

a member


depaux

depnsubj

depdobj

depadvcl

depprep deppobj

depnsubj

depdobj

depadvmod

typeother

typeother

typeother

typeother

typeother

typeother


typeother

typeother


obtain

consent

use

our service data

depdobj


typeother

typeoperation

typedata

















PolicyLint 714 253









































42 Performance


5 Measurement


51 XLDH Ecosystem























































6 Discussion





7 Related Work




8 Conclusion



Acknowledgement





























































































9 Appendix





to such activities









Introduction

Background

Methodology

Overview





Effectiveness

Performance

Measurement

XLDH Ecosystem




Discussion

Related Work

Conclusion

Appendix




PolicyLint 714 253









































42 Performance


5 Measurement


51 XLDH Ecosystem























































6 Discussion





7 Related Work




8 Conclusion



Acknowledgement





























































































9 Appendix





to such activities









Introduction

Background

Methodology

Overview





Effectiveness

Performance

Measurement

XLDH Ecosystem




Discussion

Related Work

Conclusion

Appendix




























42 Performance


5 Measurement


51 XLDH Ecosystem























































6 Discussion





7 Related Work




8 Conclusion



Acknowledgement





























































































9 Appendix





to such activities









Introduction

Background

Methodology

Overview





Effectiveness

Performance

Measurement

XLDH Ecosystem




Discussion

Related Work

Conclusion

Appendix












































6 Discussion





7 Related Work




8 Conclusion



Acknowledgement





























































































9 Appendix





to such activities









Introduction

Background

Methodology

Overview





Effectiveness

Performance

Measurement

XLDH Ecosystem




Discussion

Related Work

Conclusion

Appendix



7 Related Work




8 Conclusion



Acknowledgement





























































































9 Appendix





to such activities









Introduction

Background

Methodology

Overview





Effectiveness

Performance

Measurement

XLDH Ecosystem




Discussion

Related Work

Conclusion

Appendix

Acknowledgement





























































































9 Appendix





to such activities









Introduction

Background

Methodology

Overview





Effectiveness

Performance

Measurement

XLDH Ecosystem




Discussion

Related Work

Conclusion

Appendix



































9 Appendix





to such activities









Introduction

Background

Methodology

Overview





Effectiveness

Performance

Measurement

XLDH Ecosystem




Discussion

Related Work

Conclusion

Appendix

understanding malicious cross-library data harvesting on

Documents