understanding the use of leaked webmail credentials in the wild

What Happens After You Are Pwnd:Understanding the Use of Leaked Webmail

Credentials in the Wild

Jeremiah Onaolapo, Enrico Mariconti, and Gianluca StringhiniUniversity College London

{j.onaolapo, e.mariconti, g.stringhini}@cs.ucl.ac.uk

ABSTRACTCybercriminals steal access credentials to webmail ac-counts and then misuse them for their own profit, re-lease them publicly, or sell them on the undergroundmarket. Despite the importance of this problem, theresearch community still lacks a comprehensive under-standing of what these stolen accounts are used for. Inthis paper, we aim to shed light on the modus operandiof miscreants accessing stolen Gmail accounts. We de-veloped an infrastructure that is able to monitor the ac-tivity performed by users on Gmail accounts, and leakedcredentials to 100 accounts under our control throughvarious means, such as having information-stealing mal-ware capture them, leaking them on public paste sites,and posting them on underground forums. We thenmonitored the activity recorded on these accounts overa period of 7 months. Our observations allowed us todevise a taxonomy of malicious activity performed onstolen Gmail accounts, to identify differences in the be-havior of cybercriminals that get access to stolen ac-counts through different means, and to identify system-atic attempts to evade the protection systems in placeat Gmail and blend in with the legitimate user activity.This paper gives the research community a better un-derstanding of a so far understudied, yet critical aspectof the cybercrime economy.

Categories and Subject DescriptorsJ.4 [Computer Applications]: Social and BehavioralSciences; K.6.5 [Security and Protection]: Unautho-rized Access

Permission to make digital or hard copies of all or part of this work for personalor classroom use is granted without fee provided that copies are not made ordistributed for profit or commercial advantage and that copies bear this noticeand the full citation on the first page. Copyrights for components of this workowned by others than ACM must be honored. Abstracting with credit is per-mitted. To copy otherwise, or republish, to post on servers or to redistribute tolists, requires prior specific permission and/or a fee. Request permissions [email protected].

IMC 2016, November 14-16, 2016, Santa Monica, CA, USA

c© 2016 ACM. ISBN 978-1-4503-4526-2/16/11. . . $15.00

DOI: http://dx.doi.org/10.1145/2987443.2987475

KeywordsCybercrime, Webmail, Underground Economy, Malware

1. INTRODUCTIONThe wealth of information that users store in web-

mail accounts on services such as Gmail, Yahoo! Mail,or Outlook.com, as well as the possibility of misusingthem for illicit activities has attracted cybercriminals,who actively engage in compromising such accounts.Miscreants obtain the credentials to victims’ online ac-counts by performing phishing scams [17], by infect-ing users with information-stealing malware [29], or bycompromising large password databases, leveraging thefact that people often use the same password acrossmultiple services [16]. Such credentials can be used bythe cybercriminal privately, or can then be sold on theblack market to other cybercriminals who wish to usethe stolen accounts for profit. This ecosystem has be-come a very sophisticated market in which only vettedsellers are allowed to join [30].

Cybercriminals can use compromised accounts in mul-tiple ways. First, they can use them to send spam [18].This practice is particularly effective because of theestablished reputation of such accounts: the already-established contacts of the account are likely to trustits owner, and are therefore more likely to open themessages that they receive from her [20]. Similarly, thestolen account is likely to have a history of good be-havior with the online service, and the malicious mes-sages sent by it are therefore less likely to be detectedas spam, especially if the recipients are within the sameservice (e.g., a Gmail account used to send spam toother Gmail accounts) [33]. Alternatively, cybercrim-inals can use the stolen accounts to collect sensitiveinformation about the victim. Such information caninclude financial credentials (credit card numbers, bankaccount numbers), login information to other online ser-vices, and personal communications of the victim [13].Despite the importance of stolen accounts for the under-ground economy, there is surprisingly little work on thetopic. Bursztein et al. [13] studied the modus operandiof cybercriminals collecting Gmail account credentials

http://dx.doi.org/10.1145/2987443.2987475

through phishing scams. Their paper shows that crim-inals access these accounts to steal financial informa-tion from their victims, or use these accounts to sendfraudulent emails. Since their work only focused on onepossible way used by criminals to steal user login cre-dentials, it leaves questions unanswered on how generaltheir observations are, compared to credentials acquiredthrough other means. Most importantly, [13] relies onproprietary information from Google, and therefore itis not possible for other researchers to replicate theirresults or build on top of their work.

Other researchers did not attempt studying the activ-ity of criminals on compromised online accounts becauseit is usually difficult to monitor what happens to themwithout being a large online service. The rare excep-tions are studies that look at information that is pub-licly observable, such as the messages posted on Twitterby compromised accounts [18,19].

To close this gap, in this paper we present a systemthat is able to monitor the activity performed by at-tackers on Gmail accounts. To this end, we instrumentthe accounts using Google Apps Script [1]; by doing so,we were able to monitor any time an email was opened,favorited, sent, or a new draft was created. We alsomonitor the accesses that the accounts receive, withparticular attention to their system configuration andtheir origin. We call such accounts honey accounts.

We set up 100 honey accounts, each resembling theGmail account of the employee of a fictitious company.To understand how criminals use these accounts af-ter they get compromised, we leaked the credentials tosuch accounts on multiple outlets, modeling the differ-ent ways in which cybercriminals share and get access tosuch credentials. First, we leaked credentials on pastesites, such as pastebin [5]. Paste sites are commonlyused by cybercriminals to post account credentials afterdata breaches [2]. We also leaked them to undergroundforums, which have been shown to be the place wherecybercriminals gather to trade stolen commodities suchas account credentials [30]. Finally, we logged in toour honey accounts on virtual machines that were pre-viously infected with information stealing malware. Bydoing this, the credentials will be sent to the cybercrim-inal behind the malware’s command and control infras-tructure, and will then be used directly by her or placedon the black market for sale [29]. We know that thereare other outlets that attackers use, for instance, phish-ing and data breaches, but we decided to focus on pastesites, underground forums, and malware in this paper.We worked in close collaboration with the Google anti-abuse team, to make sure that any unwanted activity bythe compromised accounts would be promptly blocked.The accounts were configured to send any email to amail server under our control, to prevent them fromsuccessfully delivering spam.

After leaking our credentials, we recorded any inter-action with our honey accounts for a period of 7 months.Our analysis allowed us to draw a taxonomy of the dif-

ferent actions performed by criminals on stolen Gmailaccounts, and provided us interesting insights on thekeywords that criminals typically search for when look-ing for valuable information on these accounts. Wealso show that criminals who obtain access to stolen ac-counts through certain outlets appear more skilled thanothers, and make additional efforts to avoid detectionfrom Gmail. For instance, criminals who steal accountcredentials via malware make more efforts to hide theiridentity, by connecting from the Tor network and dis-guising their browser user agent. Criminals who obtainaccess to stolen credentials through paste sites, on theother hand, tend to connect to the accounts from lo-cations that are closer to the typical location used bythe owner of the account, if this information is sharedwith them. At the lowest level of sophistication arecriminals who browse free underground forums lookingfor free samples of stolen accounts: these individuals donot take significant measures to avoid detection, andare therefore easier to detect and block. Our findingscomplement what was reported by previous work in thecase of manual account hijacking [13], and show thatthe modus operandi of miscreants varies considerablydepending on how they obtain the credentials to stolenaccounts.

In summary, this paper makes the following contri-butions:

• We developed a system to monitor the activity ofGmail accounts. We publicly release the sourcecode of our system, to allow other researchers todeploy their own Gmail honey accounts and fur-ther the understanding that the security commu-nity has of malicious activity on online services. Tothe best of our knowledge, this is the first publiclyavailable Gmail honeypot infrastructure.

• We deployed 100 honey accounts on Gmail, andleaked credentials through three different outlets:underground forums, public paste sites, and vir-tual machines infected with information-stealingmalware.

• We provide detailed measurements of the activ-ity logged by our honey accounts over a period of7 months. We show that certain outlets on whichcredentials are leaked appear to be used by moreskilled criminals, who act stealthy and actively at-tempt to evade detection systems.

2. BACKGROUNDGmail accounts. In this paper we focus on Gmailaccounts, with particular attention to the actions per-formed by cybercriminals once they obtain access tosomeone else’s account. We made this choice over otherwebmail platforms because Gmail allows users to set upscripts that augment the functionality of their accounts,and it was therefore the ideal platform for developingwebmail–based honeypots. To ease the understanding

of the rest of the paper, we briefly summarize the capa-bilities offered by webmail accounts in general, and byGmail in particular.

In Gmail, after logging in, users are presented with aview of their Inbox. The inbox contains all the emailsthat the user received, and highlights the ones that havenot been read yet by displaying them in boldface font.Users have the option to mark emails that are importantto them and that need particular attention by starringthem. Users are also given a search functionality, whichallows them to find emails of interest by typing relatedkeywords. They are also given the possibility to orga-nize their email by placing related messages in folders,or assigning them descriptive labels. Such operationscan be automated by creating rules that automaticallyprocess received emails. When writing emails, contentis saved in a Drafts folder until the user decides to sendit. Once this happens, sent emails can be found in adedicated folder, and they can be searched similarly towhat happens for received emails.Threat model. Cybercriminals can get access to ac-count credentials in many ways. First, they can per-form social engineering-based scams, such as settingup phishing web pages that resemble the login pagesof popular online services [17] or sending spearphishingemails pretending to be members of customer supportteams at such online services [32]. As a second wayof obtaining user credentials, cybercriminals can installmalware on victim computers and configure it to re-port back any account credentials issued by the user tothe command and control server of the botnet [29]. Asa third way of obtaining access to user credentials, cy-bercriminals can exploit vulnerabilities in the databasesused by online services to store them [6]. User creden-tials can also be obtained illegitimately through tar-geted online password guessing techniques [36], oftenaided by the problem of password reuse across variousonline services [16]. Finally, cybercriminals can stealuser credentials and access tokens by running networksniffers [14] or mounting Man-in-the-Middle [11] attacksagainst victims.

After stealing account credentials, a cybercriminalcan either use them privately for their own profit, re-lease them publicly, or sell them on the undergroundmarket. Previous work studied the modus operandi ofcybercriminals stealing user accounts through phishingand using them privately [13]. In this work, we studya broader threat model in which we mimic cybercrimi-nals leaking credentials on paste sites [5] as well as mis-creants advertising them for sale on underground fo-rums [30]. In particular, previous research showed thatcybercriminals often offer a small number of accountcredentials for free to test their “quality” [30]. We fol-lowed a similar approach, pretending to have more ac-counts for sale, but never following up to any furtherinquiries. In addition, we simulate infected victim ma-chines in which malware steals the user’s credentials andsends them to the cybercriminal. We describe our setup

and how we leaked account credentials on each outletin detail in Section 3.2.

3. METHODOLOGYOur overall goal was to gain a better understanding

of malicious activity in compromised webmail accounts.To achieve this goal, we developed a system able tomonitor accesses and activity on Gmail accounts. Weset up accounts and leaked them through different out-lets. In the following sections, we describe our systemarchitecture and experiment setup in detail.

3.1 System overviewOur system comprises two components, namely, honey

accounts and a monitor infrastructure.Honey accounts. Our honey accounts are webmail ac-counts instrumented with Google Apps Script to mon-itor activity in them. Google Apps Script is a cloud-based scripting language based on JavaScript, designedto augment the functionality of Gmail accounts andGoogle Drive documents, in addition to building webapps [4]. The scripts we embedded in the honey ac-counts send notifications to a dedicated webmail ac-count under our control whenever an email is opened,sent, or“starred.” In addition, the scripts send us copiesof all draft emails created in the honey accounts. Wealso added a “heartbeat message” function, to send usa message once a day from each honey account, to at-test that the account was still functional and had notbeen blocked by Google. In each honey account, we hidthe script in a Google Docs spreadsheet. We believethat this measure makes it unlikely for attackers to findand delete our scripts. To minimize abuse, we changedeach honeypot account’s default send-from address toan email address pointing to a mailserver under ourcontrol. All emails sent from the honeypot accountsare delivered to the mailserver, which simply dumpsthe emails to disk and does not forward them to theintended destination.Monitoring infrastructure. Google Apps Scripts arequite powerful, but they do not provide enough informa-tion in some cases. For example, they do not provide lo-cation information and IP addresses of accesses to web-mail accounts. To track those accesses, we set up exter-nal scripts to drive a web browser and periodically logininto each honey account and record information aboutvisitors (cookie identifier, geolocation information, andtimes of accesses, among others). The scripts navigateto the visitor activity page in each honey account, anddump the pages to disk, for offline parsing. By col-lecting information from the visitor activity pages, weobtain location and system configuration information ofaccesses, as provided by Google’s geolocation and sys-tem configuration fingerprinting system.

We believe that our honey account and monitoringframework unleashes multiple possibilities for researcherswho want to further study the behavior of attackers in

webmail accounts. For this reason, we release the sourcecode of our system1.

3.2 Experiment setupAs part of our experiments, we first set up a num-

ber of honey accounts on Gmail, and then leaked themthrough multiple outlets used by cybercriminals.Honey account setup. We created 100 Gmail ac-counts and assigned them random combinations of pop-ular first and last names, similar to what was donein [31]. Creating and setting up these accounts is amanual process. Google also rate-limits the creation ofnew accounts from the same IP address by presenting aphone verification page after a few accounts have beencreated. These factors imposed limits on the number ofhoney accounts we could set up in practice.

We populated the freshly-created accounts with emailsfrom the public Enron email dataset [22]. This datasetcontains the emails sent by the executives of the en-ergy corporation Enron, and was publicly released asevidence for the bankruptcy trial of the company. Thisdataset is suitable for our purposes, since the emailsthat it contains are the typical emails exchanged bycorporate users. To make the honey accounts believ-able and avoid raising suspicion from cybercriminalsaccessing them, we mapped distinct recipients in theEnron dataset to our fictional characters (i.e., the ficti-tious “owners” of the honey accounts), and replaced theoriginal first names and last names in the dataset withour honey first names and last names. In addition, wechanged all instances of “Enron” to a fictitious companyname that we came up with.

In order to have realistic email timestamps, we trans-lated the old Enron email timestamps to recent times-tamps slightly earlier than our experiment start date.For instance, given two email timestamps t1 and t2 inthe Enron dataset such that t1 is earlier than t2, wetranslate them to more recent timestamps T1 and T2

such that T1 is earlier than T2. We then schedule thoseparticular emails to be sent to the recipient honey ac-counts at times T1 and T2 respectively. We sent between200 – 300 emails from the Enron dataset to each honeyaccount in the process of populating them.Leaking account credentials. To achieve our objec-tives, we had to entice cybercriminals to interact withour account honeypots while we logged their accesses.We selected paste sites and underground forums as ap-propriate venues for leaking account credentials, sincethey tend to be misused by cybercriminals for dissemi-nation of stolen credentials. In addition, we leaked somecredentials through malware, since this is a popular wayby which professional cybercriminals steal credentialsand compromise accounts [10]. We divided the honey-pot accounts in groups and leaked their credentials indifferent locations, as shown in Table 1. We leaked 50accounts in total on paste sites. For 20 of them, we

1https://bitbucket.org/gianluca students/gmail-honeypot

leaked basic credentials (username and password pairs)on the popular paste sites pastebin.com and pastie.org. Weleaked 10 account credentials on Russian paste websites(p.for-us.nl and paste.org.ru). For the remaining 20 ac-counts, we leaked username and password pairs alongwith UK and US location information of the fictitiouspersonas that we associated with the honey accounts.We also included date of birth information of each per-sona.

Group Accounts Outlet of leak

1 30 paste websites (no location)2 20 paste websites (with location)3 10 forums (no location)4 20 forums (with location)5 20 malware (no location)

Table 1: List of account honeypot groupings.

We leaked 30 account credentials on underground fo-rums. For 10 of them, we only specified username andpassword pairs, without additional information. In amanner similar to the paste site leaks described earlier,we appended UK and US location information to un-derground forum leaks, claiming that our fictitious per-sonas lived in those locations. We also included date ofbirth information for each persona.

To leak credentials, we used these forums: offen-sivecommunity.net, bestblackhatforums.eu, hack-forums.net, and blackhatworld.com. We selected thembecause they were open for anybody to register, andwere highly ranked in Google results. We acknowledgethat some underground forums are not open, and have astrict vetting policy to let users in [30]. Unfortunately,however, we did not have access to any private forum.In addition, the same approach of studying open un-derground forums has been used by previous work [7].When leaking credentials on underground forums, wemimicked the modus operandi of cybercriminals thatwas outlined by Stone-Gross et al. in [30]. In the pa-per, the authors showed that cybercriminals often posta sample of their stolen datasets on the forum to showthat the accounts are real, and promise to provide ad-ditional data in exchange for a fee. We logged the mes-sages that we received on underground forums, mostlyinquiring about obtaining the full dataset, but we didnot follow up to them.

Finally, to study the activity of criminals obtainingcredentials through information-stealing malware in honeyaccounts, we leaked access credentials of 20 accounts toinformation-stealing malware samples. To this end, weselected malware samples from the Zeus family, whichis one of the most popular malware families performinginformation stealing [10], as well as from the Corebotfamily. We will provide detailed information on ourmalware honeypot infrastructure in the next section.

The reason for leaking different accounts on differentoutlets is to study differences in the behavior of cyber-criminals getting access to stolen credentials throughdifferent sources. Similarly, we provide decoy location

https://bitbucket.org/gianluca_students/gmail-honeypot

pastebin.com

pastie.org

p.for-us.nl

paste.org.ru

information in some leaks, and not in others, with theidea of observing differences in malicious activity de-pending on the amount and type of information avail-able to cybercriminals. As we will show in Section 4,the accesses that were observed in our honey accountswere heavily influenced by the presence of additionallocation information in the leaked content.Malware honeypot infrastructure. Our malwaresandbox system is structured as follows. A web serverentity manages the honey credentials (usernames andpasswords) and the malware samples. The host ma-chine creates a Virtual Machine (VM), which contactsthe web server to request an executable malware file anda honey credential file. The structure is similar to theone explained in [21]. The malware file is then executedin the VM (that is, the VM is infected with malware),after which a script drives a browser in the VM to lo-gin to Gmail using the downloaded credentials. Theidea is to expose the honey credentials to the malwarethat is already running in the VM. After some time,the infected VM is deleted and a fresh one is created.This new VM downloads another malware sample anda different honey credential file, and it repeats the in-fection and login operation. To maximize the efficiencyof the configuration, before the experiment we carriedout a test without the Gmail login process to select onlysamples whose C&C servers were still up and running.

3.3 Threats to validityWe acknowledge that seeding the honey accounts with

emails from the Enron dataset may introduce bias intoour results, and may make the honey accounts less be-lievable to visitors. However, it is necessary to note thatthe Enron dataset is the only large publicly availableemail corpus, to the best of our knowledge. To make theemails believable, we changed the names in the emails,dates, and company name. In the future, we will worktowards obtaining or generating a better email dataset,if possible. Also, some visitors may notice that thehoney accounts did not receive any new emails duringthe period of observation, and this may affect the wayin which criminals interact with the accounts. Anotherthreat is that we only leaked honey credentials throughthe outlets listed previously (namely paste sites, un-derground forums, and malware), therefore, our resultsreflect the activity of participants present on those out-lets only. Finally, since we selected only undergroundforums that are publicly accessible, our observationsmight not reflect the modus operandi of actors who areactive on closed forums that require vetting for signingup.

3.4 EthicsThe experiments performed in this paper require some

ethical considerations. First of all, by giving access toour honey accounts to cybercriminals, we incur the riskthat these accounts will be used to damage third par-ties. To minimize this risk, as we said, we configured our

accounts in a way that all emails would be forwardedto a sinkhole mailserver under our control and neverdelivered to the outside world. We also established aclose collaboration with Google and made sure to re-port to them any malicious activity that needed atten-tion. Although the suspicious login filters that Googletypically uses to protect their accounts from unautho-rized accesses were disabled for our honey accounts, allother malicious activity detection algorithms were stillin place, and in fact Google suspended a number ofaccounts under our control that engaged in suspiciousactivity. It is important to note, however, that our ap-proach does not rely on help from Google to work. Ourmain reason for enlisting Google’s help to disable sus-picious login filters was to ensure that all accesses getthrough to the honey accounts (most accesses wouldbe blocked if Google did not disable the login filters).This does not impact directly on our methodology, andas a result does not reduce the wider applicability ofour approach. It is also important to note that Googledid not share with us any details on the techniquesused internally for the detection of malicious activity onGmail. Another point of risk is ensuring that the mal-ware in our VMs would not be able to harm third par-ties. We followed common practices [28] such as restrict-ing the bandwidth available to our virtual machines andsinkholing all email traffic sent by them. Finally, ourexperiments involve deceiving cybercriminals by provid-ing them fake accounts with fake personal informationin them. To ensure that our experiments were run inan ethical fashion, we obtained IRB approval from ourinstitution.

4. DATA ANALYSISWe monitored the activity on our honey accounts for

a period of 7 months, from 25th June, 2015 to 16thFebruary, 2016. In this section, we first provide anoverview of our results. We then discuss a taxonomyof the types of activity that we observed. We providea detailed analysis of the type of activity monitoredon our honey accounts, focusing on the differences inmodus operandi shown by cybercriminals who obtaincredentials to our honey accounts from different outlets.We then investigate whether cybercriminals attempt toevade location-based detection systems by connectingfrom locations that are closer to where the owner of theaccount typically connects from. We also develop a met-ric to infer which keywords attackers search for whenlooking for interesting information in an email account.Finally, we analyze how certain types of cybercriminalsappear to be stealthier and more advanced than others.

Google records each unique access to a Gmail accountand labels the access with a unique cookie identifier.These unique cookie identifiers, along with more infor-mation including times of accesses, are included in thevisitor activity pages of Gmail accounts. Our scriptsextract this data, which we analyze in this section. For

the sake of convenience, we will use the terms “cookie”and“unique access” interchangeably in the remainder ofthis paper.

4.1 OverviewWe created, instrumented, and leaked 100 Gmail ac-

counts for our experiments. To avoid biasing our re-sults, we removed all accesses made to honey accountsby IP addresses from our monitoring infrastructure. Wealso removed all accesses that originated from the citywhere our monitoring infrastructure is located. Afterthis filtering operation, we observed 326 unique accessesto the accounts during the experiments, during which147 emails were opened, 845 emails were sent, and therewere 12 unique draft emails composed by cybercrimi-nals.

In total, 90 accounts received accesses during the ex-periment, comprising 41 accounts leaked to paste sites,30 accounts leaked to underground forums, and 19 ac-counts leaked through malware. 42 accounts were blockedby Google during the course of the experiment, dueto suspicious activity. We were able to log activityin those accounts for some time before Google blockedthem. 36 accounts were hijacked by cybercriminals, thatis, the passwords of such accounts were changed by thecybercriminals. As a result, we lost control of those ac-counts. We did not observe any attempt by attackersto change the default send-from address of our honeyaccounts. However, assuming that happened and at-tackers started sending spam messages, Google wouldblock such accounts since we asked them to monitor theaccounts with particular attention. A dataset contain-ing the parsed metadata of the accesses received fromour honey accounts during our experiments is publiclyavailable at http://dx.doi.org/10.14324/000.ds.1508297

4.2 A taxonomy of account activityFrom our dataset of activity observed in the honey

accounts, we devise a taxonomy of attackers based onunique accesses to such accounts. We identify four typesof attackers, described in detail in the following.Curious. These accesses constitute the most basic typeof access to stolen accounts. After getting hold of ac-count credentials, people login on those accounts tocheck if such credentials work. Afterwards, they do notperform any additional action. The majority of the ob-served accesses belong to this category, accounting for224 accesses. We acknowledge that this large numberof curious accesses may be due in part to experiencedattackers avoiding interactions with the accounts afterlogging in, probably after some careful observations in-dicating that the accounts do not look real. This couldpotentially introduce some bias into our results.Gold diggers. When getting access to a stolen ac-count, attackers often want to understand its worth.For this reason, on logging into honey accounts, someattackers search for sensitive information, such as ac-count information and attachments that have financial-

related names. They also seek information that maybe useful in spearphishing attacks. We call these ac-cesses “gold diggers.” Previous research showed thatthis practice is quite common for manual account hi-jackers [13]. In this paper, we confirm that finding,provide a methodology to assess the keywords that cy-bercriminals search for, and analyze differences in themodus operandi of gold digger accesses for credentialsleaked through different outlets. In total, we observed82 accesses of this type.Spammers. One of the main capabilities of webmailaccounts is sending emails. Previous research showedthat large spamming botnets have code in their botsand in their C&C infrastructure to take advantage ofthis capability, by having the bots directly connect tosuch accounts and send spam [30]. We consider accessesto belong to this category if they send any email. Weobserved 8 accounts of this type that recorded such ac-cesses. This low number of accounts shows that send-ing spam appears not to be one of the main purposesthat cybercriminals use stolen accounts for, when stolenthrough the outlets that we studied.Hijackers. A stealthy cybercriminal is likely to keepa low profile when accessing a stolen account, to avoidraising suspicion from the account’s legitimate owner.Less concerned miscreants, however, might just act tolock the legitimate owner out of their account by chang-ing the account’s password. We call these accesses “hi-jackers.” In total, we observed 36 accesses of this type.A change of password prevents us from scraping the vis-itor activity page, and therefore we are unable to col-lect further information about the accesses performedto that account.

It is important to note that the taxonomy classes thatwe described are not exclusive. For example, an at-tacker might use an account to send spam emails, there-fore falling in the “spammer” category, and then changethe password of that account, therefore falling into the“hijacker” category. Such overlaps happened often forthe accesses recorded in our honey accounts. It is in-teresting to note that there was no access that behavedexclusively as “spammer.” Miscreants that sent spamthrough our honey accounts also acted as “hijackers” oras “gold diggers,” searching for sensitive information inthe account.

We wanted to understand the distribution of differenttypes of accesses in accounts that were leaked throughdifferent means. Figure 1 shows a breakdown of thisdistribution. As it can be seen, cybercriminals whoget access to stolen accounts through malware are thestealthiest, and never lock the legitimate users out oftheir accounts. Instead, they limit their activity tochecking if such credentials are real or searching for sen-sitive information in the account inbox, perhaps in anattempt to estimate the value of the accounts. Accountsleaked through paste sites and underground forums seethe presence of “hijackers.” 20% of the accesses to ac-counts leaked through paste sites, in particular, belong

http://dx.doi.org/10.14324/000.ds.1508297

Malware Paste Sites Underground Forums0.0

0.2

0.4

0.6

0.8

1.0

Act

ivity

frac

tion

CuriousGold DiggerHijackerSpammer

Figure 1: Distribution of types of accesses for differentcredential leak accesses. As it can be seen, most accessesbelong to the “curious” category. It is possible to spotdifferences in the types of activities for different leakoutlets. For example, accounts leaked by malware donot present activity of “hijacker” type. Hijackers, on theother hand, are particularly common among miscreantswho obtain stolen credentials through paste sites.

Figure 2: CDF of the length of unique accesses for dif-ferent types of activity on our honey accounts. The vastmajority of unique accesses lasts a few minutes. Spam-mers tend to use accounts aggressively for a short timeand then disconnect. The other types of accesses, andin particular “curious” ones, come back after some time,likely to check for new activity in the honey accounts.

to this category. Accounts leaked through undergroundforums, on the other hand, see the highest percentageof “gold digger” accesses, with about 30% of all accessesbelonging to this category.

4.3 Activity on honey accountsIn the following, we provide detailed analysis on the

unique accesses that we recorded for our honey accounts.

4.3.1 Duration of accessesFor each cookie identifier, we recorded the time that

the cookie first appeared in a particular honey accountas t0, and the last time it appeared in the honey ac-count as tlast. From this information, we computed theduration of activity of each cookie as tlast−t0. It is nec-essary to note that tlast of each cookie is a lower bound,since we cease to obtain information about cookies if the

Figure 3: CDF of the time passed between account cre-dentials leaks and the first visit by a cookie. Accountsleaked through paste sites receive on average accessesearlier than accounts leaked through other outlets.

password of the honey account that is recording cookiesis changed, for instance. Figure 2 shows the CumulativeDistribution Function (CDF) of the length of unique ac-cesses of different types of attackers. As it can be seen,the vast majority of accesses are very short, lasting onlya few minutes and never coming back. “Spammer” ac-cesses, in particular, tend to send emails in burst fora certain period and then disconnect. “Hijacker” and“gold digger” accesses, on the other hand, have a longtail of about 10% accesses that keep coming back forseveral days in a row. The CDF shows that most “cu-rious” accesses are repeated over many days, indicatingthat the cybercriminals keep coming back to find outif there is new information in the accounts. This con-flicts with the finding in [13], which states that mostcybercriminals connect to a compromised webmail ac-count once, to assess its value within a few minutes.However, [13] focused only on accounts compromisedvia phishing pages, while we look at a broader rangeof ways in which criminals can obtain such credentials.Our results show that the modes of operation of cy-bercriminals vary, depending on the outlets they obtainstolen credentials from.

4.3.2 Time between leak and first accessWe then studied how long it takes after credentials

are leaked on different outlets until our infrastructurerecords accesses from cybercriminals. Figure 3 reportsa CDF of the time between leak and first access foraccounts leaked through different outlets. As it can beseen, within the first 25 days after leak, we recorded80% of all unique accesses to accounts leaked to pastesites, 60% of all unique accesses to accounts leaked tounderground forums, and 40% of all unique accesses toaccounts leaked to malware. A particularly interestingobservation is the nature of unique accesses to accountsleaked to malware. A close look at Figure 3 revealsrapid increases in unique accesses to honey accounts

leaked to malware, about 30 days after the leak, andalso after 100 days, indicated by two sharp inflectionpoints.

Figure 4 sheds more light into what happened atthose points. The figure reports the unique accessesto each of our honey accounts over time. An interestingaspect to note is that accounts that are leaked on publicoutlets such as forum and paste sites can be accessedby multiple cybercriminals at the same time. Accountcredentials leaked through malware, on the other hand,are available only to the botmaster that stole them, un-til they decide to sell them or to give them to some-one else. Seeing bursts in accesses to accounts leakedthrough malware months after the actual leak happenedcould indicate that the accounts were visited again bythe same criminal who operated the malware infrastruc-ture, or that the accounts were sold on the undergroundmarket and that another miscreant is now using them.This hypothesis is somewhat confirmed by the fact thatthese bursts in accesses were of the “gold digger” type,while all previous accesses to the same accounts were ofthe “curious” type.

In addition, Figure 4 shows that the majority of ac-counts leaked to paste sites were accessed within a fewdays of leak, while a particular subset was not accessedfor more than 2 months. That subset refers to the tencredentials we leaked to Russian paste sites. The cor-responding honey accounts were not accessed for morethan 2 months from the time of leak. This either indi-cates that cybercriminals are not many on the Russianpaste sites, or maybe they did not believe that the ac-counts were real, thus not bothering to access them.

4.3.3 System configuration of accessesWe observed a wide variety of system configurations

for the accesses across groups of leaked accounts, byleveraging Google’s system fingerprinting informationavailable to us inside the honey accounts. As shownin Figure 5a, accesses to accounts leaked on paste siteswere made through a variety of popular browsers, withFirefox and Chrome taking the lead. We also recordedmany accesses from unknown browsers. It is possiblefor an attacker to hide browser information from Googleservers by presenting an empty user agent and hidingother fingerprintable information [27]. About 50% ofaccesses to accounts leaked through paste sites werenot identifiable. Chrome and Firefox take the lead ingroups leaked in underground forums as well, but thereis less variety of browsers there. Interestingly, all ac-cesses to accounts in malware groups were made fromunknown browsers. This shows that cybercriminals thataccessed groups leaked through malware were stealth-ier than others. While analyzing the operating systemsused by criminals, we observed that honey accountsleaked through malware mostly received accesses fromWindows computers, followed by Mac OS X and Linux.This is shown in Figure 5b. In the paste sites and un-derground forum groups, we observe a wider range of

Figure 4: Plot of duration between time of leak andunique accesses in accounts leaked through differentoutlets. As it can be seen, accounts leaked to mal-ware experience a sudden increase in unique accessesafter 30 days and 100 days from the leak, indicatingthat they may have been sold or transferred to someother party by the cybercriminals behind the malwarecommand and control infrastructure.

operating systems. More than 50% of computers in thethree categories ran on Windows. It is interesting tonote that Android devices were also used to connectto the honey accounts in paste sites and undergroundforum groups.

The diversity of devices and browsers in the pastesites and underground forums groups indicates a mot-ley mix of cybercriminals with various motives and ca-pabilities, compared to the malware groups that appearto be more homogeneous. It is also obvious that attack-ers that steal credentials through malware make moreefforts to cover their tracks by evading browser finger-printing.

4.3.4 Location of accessesWe recorded the location information that we found

in the accesses that were logged by our infrastructure.Our goal was to understand patterns in the locations(or proxies) used by criminals to access stolen accounts.Out of the 326 accesses logged, 132 were coming fromTor exit nodes. More specifically, 28 accesses to ac-counts leaked on paste sites were made via Tor, out of atotal of 144 accesses to accounts leaked on paste sites.48 accesses to accounts leaked on forums were madethrough Tor, out of a total of 125 accesses made to ac-counts leaked on forums. We observed 57 accesses toaccounts leaked through malware, and all except one ofthose accesses were made via Tor. We removed these ac-cesses for further analysis, since they do not provide in-formation on the physical location of the criminals. Af-ter removing Tor nodes, 173 unique accesses presented


0.2

0.4

0.6

0.8

1.0B

row

ser

frac

tion

VivaldiFirefoxChromeOperaEdgeExplorerIceweaselUnknown

(a) Distribution of browsers of honey account accesses


0.2

0.4

0.6

0.8

1.0

OS

frac

tion

AndroidChrome OSLinuxMac OSXWindowsUnknown

(b) Distribution of operating systems of honey accountaccesses

Figure 5: Distribution of browsers and operating systems of the accesses that we logged to our honey accounts. As itcan be seen, accounts leaked through different outlets attracted cybercriminals with different system configurations.

location information. To determine this location infor-mation, we used the geolocation provided by Google onthe account activity page of the honey accounts. Weobserved accesses from a total of 29 countries. To un-derstand whether the IP addresses that connected toour honey accounts had been recorded in previous ma-licious activity, we ran checks on all IP addresses weobserved against the Spamhaus blacklist. We found20 IP addresses that accessed our honey accounts in theSpamhaus blacklist. Because of the nature of this black-list, we believe that the addresses belong to malware-infected machines that were used by cybercriminals toconnect to the stolen accounts.

One of our goals was to observe if cybercriminals at-tempt to evade location-based login risk analysis sys-tems by tweaking access origins. In particular, we wantedto assess whether telling criminals the location wherethe owner of an account is based influences the loca-tions that they will use to connect to the account. De-spite observing 57 accesses to our honey accounts leakedthrough malware, we discovered that all these connec-tions except one originated from Tor exit nodes. Thisshows that the malware operators that accessed our ac-counts prefer to hide their location through the use ofanonymizing systems rather than modifying their lo-cation based on where the stolen account is typicallyconnecting from.

While leaking the honey credentials, we chose Lon-don and Pontiac, MI as our decoy UK and US locationsrespectively. The idea was to claim that the honey ac-counts leaked with location details belonged to fictitiouspersonas living in either London or Pontiac. However,we realized that leaking multiple accounts with the samelocation might cause suspicion. As a result, we chose de-coy UK and US locations such that London and Pontiac,IL were the midpoints of those locations respectively.

To observe the impact of availability of location in-formation about the honey accounts on the locationsthat cybercriminals connect from, we calculated themedian values of distances of the locations recordedin unique accesses, from the midpoints of the adver-

tised decoy locations in our account leaks. For exam-ple, for the accesses A to honey accounts leaked onpaste sites, advertised with UK information, we ex-tracted location information, translated them to coor-dinates LA, and computed the dist paste UK vector asdistance(LA,midUK), where midUK are London’s co-ordinates. All distances are in kilometers. We extractedthe median values of all distance vectors obtained, andplotted circles on UK and US maps, specifying thosemedian distances as radii of the circles, as shown inFigures 6a and 6b.

Interestingly, we observe that connections to accountswith advertised location information originate from placescloser to the midpoints than accounts with leaked infor-mation containing usernames and passwords only. Fig-ure 6a shows that connections to accounts leaked onpaste sites and forums result in the smaller median cir-cles, that is, the connections originate from locationscloser to London, the UK midpoint. The smallest circleis for the accounts leaked on paste sites, with adver-tised UK location information (radius 1400 kilometers).In contrast, the circle of accounts leaked on paste siteswithout location information has a radius of 1784 kilo-meters. The median circle of the accounts leaked inunderground forums, with no advertised location infor-mation, is the largest circle in Figure 6a, while the oneof accounts leaked in underground forums, along withUK location information, is smaller.

We obtained similar results in the US plot, with someinteresting distinctions. As shown in Figure 6b, con-nections to honey accounts leaked on paste sites, withadvertised US locations are clustered around the USmidpoint, as indicated by the circle with a radius of939 kilometers, compared to the median circle of ac-counts leaked on paste sites without location informa-tion, which has a radius of 7900 kilometers. However,despite the fact that the median circle of accounts leakedin underground forums with advertised location infor-mation is less than that of the one without advertisedlocation information, the difference in their radii is notas pronounced. This again supports the indication that

(a) Distance of login locations from the UK midpoint (b) Distance of login locations from the US midpoint

Figure 6: Distance of login locations from the midpoints of locations advertised while leaking credentials. Redlines indicate credentials leaked on paste sites with no location information, green lines indicate credentials leakedon paste sites with location information, purple lines indicate credentials leaked on underground forums withoutlocation information, while blue lines indicate credentials leaked on underground forums with location information.As it can be seen, account credentials leaked with location information attract logins from hosts that are closer tothe advertised midpoint than credentials that are posted without any location information.

cybercriminals on paste sites exhibit more location mal-leability, that is, masquerading their origins of accessesto appear closer to the advertised location, when pro-vided. It also shows that cybercriminals on the studiedforums are less sophisticated, or care less than the oneson paste sites.Statistical significance. As we explained, Figures 6aand 6b show that accesses to leaked accounts happencloser to advertised locations if this information is in-cluded in the leak. To confirm the statistical signifi-cance of this finding, we performed a Cramer Von Misestest [15]. The Anderson version [8] of this test is usedto understand if two vectors of values do likely havethe same statistical distribution or not. The p-valuehas to be under 0.01 to let us state that it is possibleto reject the null hypothesis (i.e., that the two vectorsof distances have the same distribution), otherwise itis not possible to state with statistical significance thatthe two distance vectors come from different distribu-tions. The p-value from the test on paste sites vec-tors (p-values of 0.0017415 for UK location informationversus no location and 0.0000007 for US location infor-mation versus no location) allows us to reject the nullhypothesis, thus stating that the two vectors come fromdifferent distributions while we cannot say the same ob-

serving the p-values for the tests on forum vectors (p-values of 0.272883 for the UK case and 0.272011 for theUS one). Therefore, we can conclusively state that thestatistical test proves that criminals using paste sitesconnect from closer locations when location informationis provided along with the leaked credentials. We can-not reach that conclusion in the case of accounts leakedto underground forums, although Figures 6a and 6b in-dicate that there are some location differences in thiscase too.

4.3.5 What are “gold diggers” looking for?Cybercriminals compromise online accounts due to

the inherent value of those accounts. As a result, theyassess accounts to decide how valuable they are, and de-cide exactly what to do with such accounts. We decidedto study the words that they searched for in the honeyaccounts, in order to understand and potentially char-acterize anomalous searches in the accounts. A limitingfactor in this case was the fact that we did not have ac-cess to search logs of the honey accounts, but only to thecontent of the emails that were opened. To overcomethis limitation, we employed Term Frequency–InverseDocument Frequency (TF-IDF). TF-IDF is used to rankwords in a corpus by importance. As a result we re-

Searched words TFIDFR TFIDFA TFIDFR − TFIDFA Common words TFIDFR TFIDFA TFIDFR − TFIDFA

results 0.2250 0.0127 0.2122 transfer 0.2795 0.2949 -0.0154bitcoin 0.1904 0.0 0.1904 please 0.2116 0.2608 -0.0493family 0.1624 0.0200 0.1423 original 0.1387 0.1540 -0.0154seller 0.1333 0.0037 0.1296 company 0.0420 0.1531 -0.1111localbitcoins 0.1009 0.0 0.1009 would 0.0864 0.1493 -0.0630account 0.1114 0.0247 0.0866 energy 0.0618 0.1471 -0.0853payment 0.0982 0.0157 0.0824 information 0.0985 0.1308 -0.0323bitcoins 0.0768 0.0 0.0768 about 0.1342 0.1226 0.0116below 0.1236 0.0496 0.0740 email 0.1402 0.1196 0.0207listed 0.0858 0.0207 0.0651 power 0.0462 0.1175 -0.0713

Table 2: List of top 10 words by TFIDFR − TFIDFA (on the left) and list of top 10 words by TFIDFA (on theright). The words on the left are the ones that have the highest difference in importance between the emails openedby attackers and the emails in the entire corpus. For this reason, they are the words that attackers most likelysearched for when looking for sensitive information in the stolen accounts. The words on the right, on the otherhand, are the ones that have the highest importance in the entire corpus.

lied on TF-IDF to infer the words that cybercriminalssearched for in the honey accounts. TF-IDF is a prod-uct of two metrics, namely Term Frequency (TF) andInverse Document Frequency (IDF). The idea is that wecan infer the words that cybercriminals searched for, bycomparing the important words in the emails opened bycybercriminals to the important words in all emails inthe decoy accounts.

In its simplest form, TF is a measure of how fre-quently term t is found in document d, as shown inEquation 1. IDF is a logarithmic scaling of the fractionof the number of documents containing term t, as shownin Equation 2 where D is the set of all documents in thecorpus, N is the total number of documents in the cor-pus, |d ∈ D : t ∈ d| is the number of documents in D,that contain term t. Once TF and IDF are obtained,TF-IDF is computed by multiplying TF and IDF, asshown in Equation 3.

tf(t, d) = ft,d (1)

idf(t,D) = logN

|d ∈ D : t ∈ d|(2)

tfidf(t, d,D) = tf(t, d)× idf(t,D) (3)

The output of TF-IDF is a weighted metric that rangesbetween 0 and 1. The closer the weighted value is to 1,the more important the term is in the corpus.

We evaluated TF-IDF on all terms in a corpus oftext comprising two documents, namely, all emails dAin the honey accounts, and all emails dR opened bythe attackers. The intuition is that the words thathave a large importance in the emails that have beenopened by a criminal, but have a lower importance inthe overall dataset, are likely to be keywords that theattackers searched for in the Gmail account. We pre-processed the corpus by filtering out all words that haveless than 5 characters, and removing all known header-related words, for instance “delivered” and “charset,”honey email handles, and also removing signaling infor-

mation that our monitoring infrastructure introducedinto the emails. After running TF-IDF on all remainingterms in the corpus, we obtained their TF-IDF valuesas vectors TFIDFA and TFIDFR, the TF-IDF val-ues of all terms in the corpus [dA, dR]. We proceededto compute the vector TFIDFR − TFIDFA. The top10 words by TFIDFR − TFIDFA, compared to thetop 10 words by TFIDFA are presented in Table 2.Words that have TFIDFR values that are higher thanTFIDFA will rank higher in the list, and those are thewords that the cybercriminals likely searched for.

As seen in Table 2, the top 10 important words byTFIDFR − TFIDFA are sensitive words, such as “bit-coin,”“family,” and “payment.” Comparing these wordswith the most important words in the entire corpus re-veals the indication that attackers likely searched forsensitive information, especially financial information.In addition, words with the highest importance in theentire corpus (for example, “company” and “energy”),shown in the right side of Table 2, have much lower im-portance in the emails opened by cybercriminals, andmost of them have negative values in TFIDFR−TFIDFA.This is a strong indicator that the emails opened in thehoney accounts were not opened at random, but werethe result of searches for sensitive information.

Originally, the Enron dataset had no “bitcoin” term.However, that term was introduced into the openedemails document dR, through the actions of one of thecybercriminals that accessed some of the honey accounts.The cybercriminal attempted to send blackmail mes-sages from some of our honey accounts to Ashley Madi-son scandal victims [3], requesting ransoms in bitcoin,in exchange for silence. In the process, many draftemails containing information about “bitcoin” were cre-ated and abandoned by the cybercriminal, and othercybercriminals opened them during later accesses. Thatway, our monitoring infrastructure picked up “bitcoin”related terms, and they rank high in Table 2, showingthat cybercriminals showed a lot of interest in thoseemails.

4.4 Interesting case studiesIn this section, we present some interesting case stud-

ies that we encountered during our experiments. Theyhelp to shed further light into actions that cybercrimi-nals take on compromised webmail accounts.

Three of the honey accounts were used by an attackerto send multiple blackmail messages to some victims ofthe Ashley Madison scandal. The blackmailer threat-ened to expose the victims, unless they made some pay-ments in bitcoin to a specified bitcoin wallet. Tutorialson how to make bitcoin payments were also included inthe messages. The blackmailer created and abandonedmany drafts emails targeted at more Ashley Madisonvictims, which as we have already mentioned some othervisitors to the accounts opened, thus contributing tothe opened emails that our monitoring infrastructurerecorded.

Two of the honey accounts received notification emailsabout the hidden Google Apps Script in both honeyaccounts “using too much computer time.” The noti-fications were opened by an attacker, and we receivednotifications about the opening actions.

Finally, an attacker registered on an carding forumusing one of the honey accounts as registration emailaddress. As a result, registration confirmation infor-mation was sent to the honey account This shows thatsome of the accounts were used as stepping stones bycybercriminals to perform further illicit activity.

4.5 Sophistication of attackersFrom the accesses we recorded in the honey accounts,

we identified 3 peculiar behaviors of cybercriminals thatindicate their level of sophistication, namely, configu-ration hiding – for instance by hiding user agent in-formation, location filter evading – by connecting fromlocations close to the advertised decoy location if pro-vided, and stealthiness – avoiding performing clearlymalicious actions such as hijacking and spamming. At-tackers accessing the different groups of honey accountsexhibit different types of sophistication. Those access-ing accounts leaked through malware are stealthier thanothers – they don’t hijack the accounts, and they don’tsend spam from them. They also access the accountsthrough Tor, and they hide their system configuration,for instance, their web browser is not fingerprintable byGoogle. Attackers accessing accounts leaked on pastesites tend to connect from locations closer to the onesspecified as decoy locations in the leaked account. Theydo this in a bid to evade detection. Attackers access-ing accounts leaked in underground forums do not makesignificant attempts to stay stealthy or to connect fromcloser locations. These differences in sophistication couldbe used to characterize attacker behavior in future work.

5. DISCUSSIONIn this section, we discuss the implications of the

findings we made in this paper. First, we talk aboutwhat our findings mean for current mitigation tech-niques against compromised online service accounts, andhow they could be used to devise better defenses. Then,we talk about some limitations of our method. Finally,we present some ideas for future work.Implications of our findings. In this paper, we mademultiple findings that provide the research communitywith a better understanding of what happens when on-line accounts get compromised. In particular, we dis-covered that if attackers are provided with location in-formation about the online accounts, they then tendto connect from places that are closer to those adver-tised locations. We believe that this is an attempt toevade current security mechanisms employed by onlineservices to discover suspicious logins. Such systems of-ten rely on the origin of logins, to assess how suspiciousthose login attempts are. Our findings show that thereis an arms race going on, with attackers attemptingto actively evade the location-based anomaly detectionsystems employed by Google. We also observed thatmany accesses were received through Tor exit nodes, soit is hard to determine the exact origins of logins. Thisproblem shows the importance of defense in depth inprotecting online systems, in which multiple detectionsystems are employed at the same time to identify andblock miscreants.

Despite confirming existing evasion techniques in useby cybercriminals, our experiments also highlighted in-teresting behaviors that could be used to develop effec-tive systems to detect malicious activity. For example,our observations about the words searched for by the cy-bercriminals show that behavioral modeling could workin identifying anomalous behavior in online accounts.Anomaly detection systems could be trained adaptivelyon words being searched for by the legitimate accountowner over a period of time. A deviation of search be-havior would then be flagged as anomalous, indicatingthat the account may have been compromised. Sim-ilarly, anomaly detection systems could be trained onthe durations of connections during benign usage, anddeviations from those could be flagged as anomalous.Limitations. We encountered a number of limitationsin the course of the experiments. For example, we wereable to leak the honey accounts only on a few outlets,namely paste sites, underground forums, and malware.In particular, we could only target underground forumsthat were open to the public and for which registrationwas free. Similarly, we could not study some of the mostrecent families of information-stealing malware such asDridex, because they would not execute in our virtualenvironment. Attackers could find the scripts we hid inthe honey accounts and remove them, making it impos-sible for us to monitor the activity of the accounts. Thisis an intrinsic limitation of our monitoring architecture,

but in principle studies similar to ours could be per-formed by the online service providers themselves, suchas Google and Facebook. By having access to the fulllogs of their systems, such entities would have no needto set up monitoring scripts, and it would be impossi-ble for attackers to evade their scrutiny. Finally, whileevaluating what cybercriminals were looking for in thehoney accounts, we were able to observe the emails thatthey found interesting in the honey accounts, not every-thing they searched for. This is because we do not haveaccess to the search logs of the honey accounts.Future work. In the future, we plan to continue ex-ploring the ecosystem of stolen accounts, and gaining abetter understanding of the underground economy sur-rounding them. We would explore ways to make thedecoy accounts more believable, to attract more cyber-criminals and keep them engaged with the decoy ac-counts. We intend to set up additional scenarios, suchas studying attackers who have a specific motivation,for example compromising accounts that belong to po-litical activists (rather than generic corporate accounts,as we did in this paper). We would also like to study ifdemographic information, as well as the language thatthe emails in honey accounts are written in, influencethe way in which cybercriminals interact with these ac-counts. To mitigate the fact that our infrastructure canonly identify search terms for emails that were found inthe accounts, we plan to seed the honey accounts withsome specially crafted emails containing decoy sensi-tive information, for instance, fake bank account infor-mation and login credentials, along with other regularemail messages. Hopefully, this type of specialized emailseeding will help to increase the variety of hits when cy-bercriminals search for content in the honey accounts,by improving the seeding of the honey accounts. Webelieve this will improve our insight into what criminalssearch for.

6. RELATED WORKIn this section, we briefly compare this paper with

previous work, noting that most previous work focusedon spam and social spam. Only a few focused on manualhijacking of accounts and their activity.

Bursztein et al. [13] investigated manual hijacking ofonline accounts through phishing pages. The study fo-cuses on cybercriminals that steal user credentials anduse them privately, and shows that manual hijackingis not as common as automated hijacking by botnets.This paper illustrates the usefulness of honey creden-tials (account honeypots), in the study of hijacked ac-counts. Compared to the work by Bursztein et al.,which focused on phishing, we analyzed a much broaderthreat model, looking at account credentials automati-cally stolen by malware, as well as the behavior of cy-bercriminals that obtain account credentials throughunderground forums and paste sites. By focusing onmultiple types of miscreants, we were able to show dif-

ferences in their modus operandi, and provide multi-ple insights on the activities that happen on hijackedGmail accounts in the wild. We also provide an opensource framework that can be used by other researchersto set up experiments similar to ours and further explorethe ecosystem of stolen Google accounts. To the bestof our knowledge, our infrastructure is the first pub-licly available Gmail honeypot infrastructure. Despitethe fact that the authors of [13] had more visibility onthe account hijacking phenomenon than we did, sincethey were operating the Gmail service, the dataset thatwe collected is of comparable size to theirs: we logged326 malicious accesses to 100 accounts, while they stud-ied 575 high-confidence hijacked accounts.

A number of papers looked at abuse of accounts onsocial networks. Thomas et al. [34] studied Twitter ac-counts under the control of spammers. Stringhini etal. [31] studied social spam using 300 honeypot profiles,and presented a tool for detection of spam on Face-book and Twitter. Similar work was also carried outin [9,12,24,38]. Thomas et al. [35] studied undergroundmarkets in which fake Twitter accounts are sold andthen used to spread spam and other malicious content.Unlike this paper, they focus on fake accounts and noton legitimate ones that have been hijacked. Wang etal. [37] proposed the use of patterns of click events tofind fake accounts in online services.

Researchers also looked at developing systems to de-tect compromised accounts. Egele et al. [18] presenteda system that detects malicious activity in online socialnetworks using statistical models. Stringhini et al. [32]developed a tool for detecting compromised email ac-counts based on the behavioral modeling of senders.Other papers investigated the use of stolen credentialsand stolen files by setting up honeyfiles. Liu et al. [25]deployed honeyfiles containing honey account creden-tials in P2P shared spaces. The study used a similarapproach to ours, especially in the placement of honeyaccount credentials. However, they placed more empha-sis on honeyfiles than on honey credentials. Besides,they studied P2P networks while our work focuses oncompromised accounts in webmail services. Nikiforakiset al. [26] studied privacy leaks in file hosting services bydeploying honeyfiles on them. In our previous work [23],we developed an infrastructure to study malicious activ-ity in online spreadsheets, using an approach similar tothe one described in this paper. Stone-Gross et al. [30]studied a large-scale spam operation by analyzing 16C&C servers of Pushdo/Cutwail botnet. In the paper,the authors highlight that the Cutwail botnet, one ofthe largest of its time, has the capability of connect-ing to webmail accounts to send spam. In their paper,Stone-Gross et al. also describe the activity of cyber-criminals on spamdot, a large underground forum. Theyshow that cybercriminals were actively trading accountinformation such as the one provided in this paper, pro-viding free “teasers” of the overall datasets for sale. In

this paper, we used a similar approach to leak accountcredentials on underground forums.

7. CONCLUSIONIn this paper, we presented a honey account system

able to monitor the activity of cybercriminals that gainaccess to Gmail account credentials. Our system ispublicly available to encourage researchers to set upadditional experiments and improve the knowledge ofour community regarding what happens after webmailaccounts are compromised2. We leaked 100 honey ac-counts on paste sites, underground forums, and virtualmachines infected with malware, and provided detailedstatistics of the activity of cybercriminals on the ac-counts, together with a taxonomy of the criminals. Ourfindings help the research community to get a better un-derstanding of the ecosystem of stolen online accounts,and potentially help researchers to develop better de-tection systems against this malicious activity.

8. ACKNOWLEDGMENTSWe wish to thank our shepherd Andreas Haeberlen

for his advice on how to improve our paper, and MarkRisher and Tejaswi Nadahalli from Google for their sup-port throughout the project. We also thank the anony-mous reviewers for their comments. This work wassupported by the EPSRC under grant EP/N008448/1,and by a Google Faculty Award. Jeremiah Onaolapowas supported by the Petroleum Technology Develop-ment Fund (PTDF), Nigeria, while Enrico Maricontiwas funded by the EPSRC under grant 1490017.

9. REFERENCES[1] Apps Script.

https://developers.google.com/apps-script/?hl=en.

[2] Dropbox User Credentials Stolen: A Reminder ToIncrease Awareness In House.http://www.symantec.com/connect/blogs/

dropbox-user-credentials-stolen-reminder-\increase-awareness-house.

[3] Hackers Finally Post Stolen Ashley MadisonData. https://www.wired.com/2015/08/

happened-hackers-posted-stolen-ashley-madison-data/.

[4] Overview of Google Apps Script.https://developers.google.com/apps-script/overview.

[5] Pastebin. pastebin.com.

[6] The Target Breach, By the Numbers.http://krebsonsecurity.com/2014/05/

the-target-breach-by-the-numbers/.

[7] S. Afroz, A. C. Islam, A. Stolerman,R. Greenstadt, and D. McCoy. Doppelgangerfinder: Taking stylometry to the underground. InIEEE Symposium on Security and Privacy, 2014.

[8] T. W. Anderson and D. A. Darling. Asymptotictheory of certain “goodness of fit” criteria based

2https://bitbucket.org/gianluca students/gmail-honeypot

on stochastic processes. The Annals ofMathematical Statistics, 1952.

[9] F. Benevenuto, G. Magno, T. Rodrigues, andV. Almeida. Detecting Spammers on Twitter. InConference on Email and Anti-Spam (CEAS),2010.

[10] H. Binsalleeh, T. Ormerod, A. Boukhtouta,P. Sinha, A. Youssef, M. Debbabi, and L. Wang.On the analysis of the Zeus botnet crimewaretoolkit. In Privacy, Security and Trust (PST),2010.

[11] D. Boneh, S. Inguva, and I. Baker. SSL MITMProxy. http:// crypto.stanford.edu/ ssl-mitm, 2007.

[12] Y. Boshmaf, I. Muslukhov, K. Beznosov, andM. Ripeanu. The socialbot network: when botssocialize for fame and money. In AnnualComputer Security Applications Conference(ACSAC), 2011.

[13] E. Bursztein, B. Benko, D. Margolis,T. Pietraszek, A. Archer, A. Aquino, A. Pitsillidis,and S. Savage. Handcrafted Fraud and Extortion:Manual Account Hijacking in the Wild. In ACMInternet Measurement Conference (IMC), 2014.

[14] E. Butler. Firesheep.http:// codebutler.com/ firesheep, 2010.

[15] H. Cramer. On the composition of elementaryerrors. Skandinavisk Aktuarietidskrift, 1928.

[16] A. Das, J. Bonneau, M. Caesar, N. Borisov, andX. Wang. The Tangled Web of Password Reuse.In Symposium on Network and Distributed SystemSecurity (NDSS), 2014.

[17] R. Dhamija, J. D. Tygar, and M. Hearst. Whyphishing works. In ACM Conference on HumanFactors in Computing Systems (CHI), 2006.

[18] M. Egele, G. Stringhini, C. Kruegel, andG. Vigna. COMPA: Detecting CompromisedAccounts on Social Networks. In Symposium onNetwork and Distributed System Security (NDSS),2013.

[19] M. Egele, G. Stringhini, C. Kruegel, andG. Vigna. Towards Detecting CompromisedAccounts on Social Networks. In IEEETransactions on Dependable and SecureComputing (TDSC), 2015.

[20] T. N. Jagatic, N. A. Johnson, M. Jakobsson, andF. Menczer. Social Phishing. Communications ofthe ACM, 50(10):94–100, 2007.

[21] J. P. John, A. Moshchuk, S. D. Gribble, andA. Krishnamurthy. Studying Spamming BotnetsUsing Botlab. In USENIX Symposium onNetworked Systems Design and Implementation(NSDI), 2009.

[22] B. Klimt and Y. Yang. Introducing the EnronCorpus. In Conference on Email and Anti-Spam(CEAS), 2004.

https://developers.google.com/apps-script/?hl=en

http://www.symantec.com/connect/blogs/dropbox-user-credentials-stolen-reminder-\increase-awareness-house



https://www.wired.com/2015/08/happened-hackers-posted-stolen-ashley-madison-data/

https://www.wired.com/2015/08/happened-hackers-posted-stolen-ashley-madison-data/

https://developers.google.com/apps-script/overview

pastebin.com

http://krebsonsecurity.com/2014/05/the-target-breach-by-the-numbers/

http://krebsonsecurity.com/2014/05/the-target-breach-by-the-numbers/

https://bitbucket.org/gianluca_students/gmail-honeypot

http://crypto.stanford.edu/ssl-mitm

http://codebutler.com/firesheep

[23] M. Lazarov, J. Onaolapo, and G. Stringhini.Honey Sheets: What Happens to Leaked GoogleSpreadsheets? In USENIX Workshop on CyberSecurity Experimentation and Test (CSET), 2016.

[24] K. Lee, J. Caverlee, and S. Webb. The socialhoneypot project: protecting online communitiesfrom spammers. In World Wide Web Conference(WWW), 2010.

[25] B. Liu, Z. Liu, J. Zhang, T. Wei, and W. Zou.How many eyes are spying on your shared folders?In ACM Workshop on Privacy in the ElectronicSociety (WPES), 2012.

[26] N. Nikiforakis, M. Balduzzi, S. Van Acker,W. Joosen, and D. Balzarotti. Exposing the Lackof Privacy in File Hosting Services. In USENIXWorkshop on Large-Scale Exploits and EmergentThreats (LEET), 2011.

[27] N. Nikiforakis, A. Kapravelos, W. Joosen,C. Kruegel, F. Piessens, and G. Vigna. Cookielessmonster: Exploring the ecosystem of web-baseddevice fingerprinting. In IEEE Symposium onSecurity and Privacy, 2013.

[28] C. Rossow, C. J. Dietrich, C. Grier, C. Kreibich,V. Paxson, N. Pohlmann, H. Bos, and M. vanSteen. Prudent practices for designing malwareexperiments: Status quo and outlook. In IEEESymposium on Security and Privacy, 2012.

[29] B. Stone-Gross, M. Cova, L. Cavallaro,B. Gilbert, M. Szydlowski, R. Kemmerer,C. Kruegel, and G. Vigna. Your Botnet is MyBotnet: Analysis of a Botnet Takeover. In ACMConference on Computer and CommunicationsSecurity (CCS), 2009.

[30] B. Stone-Gross, T. Holz, G. Stringhini, andG. Vigna. The underground economy of spam: Abotmaster’s perspective of coordinatinglarge-scale spam campaigns. In USENIX

Workshop on Large-Scale Exploits and EmergentThreats (LEET), 2011.

[31] G. Stringhini, C. Kruegel, and G. Vigna.Detecting Spammers on Social Networks. InAnnual Computer Security ApplicationsConference (ACSAC), 2010.

[32] G. Stringhini and O. Thonnard. That Ain’t You:Blocking Spearphishing Through BehavioralModelling. In Detection of Intrusions andMalware, and Vulnerability Assessment (DIMVA),2015.

[33] B. Taylor. Sender Reputation in a Large WebmailService. In Conference on Email and Anti-Spam(CEAS), 2006.

[34] K. Thomas, C. Grier, D. Song, and V. Paxson.Suspended accounts in retrospect: an analysis ofTwitter spam. In ACM Internet MeasurementConference (IMC), 2011.

[35] K. Thomas, D. McCoy, C. Grier, A. Kolcz, andV. Paxson. Trafficking Fraudulent Accounts: TheRole of the Underground Market in Twitter Spamand Abuse. In USENIX Security Symposium,2013.

[36] D. Wang, Z. Zhang, P. Wang, J. Yan, andX. Huang. Targeted Online Password Guessing:An Underestimated Threat. In ACM Conferenceon Computer and Communications Security(CCS), 2016.

[37] G. Wang, T. Konolige, C. Wilson, X. Wang,H. Zheng, and B. Y. Zhao. You are How YouClick: Clickstream Analysis for Sybil Detection.In USENIX Security Symposium, 2013.

[38] S. Webb, J. Caverlee, and C. Pu. SocialHoneypots: Making Friends with a SpammerNear You. In Conference on Email and Anti-Spam(CEAS), 2008.

understanding the use of leaked webmail credentials in the wild

Documents