searching the searchers with searchaudit

NOZZLE: A Defense Against Heap-spraying Code Injection Attacks

Searching the Searchers with SearchAuditJohn P., Fang Yu, Yinglian Xie,Martin Abadi, Arvind KrishnamurthyUniversity of California, Santa CruzUSENIX SECURITY SYMPOSIUM, August, 2010A Presentation at Advanced Defense Lab1OutlineIntroductionRelated WorkArchitectureImplementation Stage 1Implementation Stage 2Attack 1: Indentifying Vulnerable Web SitesAttack 2: Forum SpammingAttack 3: Windows Live Messenger PhishingConclusion

Advanced Defense Lab22IntroductionA framework that identifies malicious queries from massive search engine logs to uncover their relationship with potential attacks.

Use a small set of malicious queries as seed, and generates regular expressions for detecting new malicious queries.Advanced Defense Lab3With the amount of information in theWeb rapidly growing,the search engine has become an everyday tool forpeople to find relevant and useful information. Whilesearch engines make online browsing easier for normalusers, they have also been exploited by malicious entitiesto facilitate their various attacks. For example, in 2004,the MyDoom worm used Google to search for email addressesin order to send spam and virus emails. Recently,it was also reported that hackers used search engines toidentify vulnerableWeb sites and compromised them immediatelyafter the malicious searches [20, 16]. ThesecompromisedWeb sites were then used to serve malwareor phishing pages.

Indeed, by crafting specific search queries, hackersmay get very specific information from search enginesthat could potentially reveal the existence and locationsof security flaws such as misconfigured servers and vulnerablesoftware. Furthermore, attackers may prefer usingsearch engines because it is stealthier and easier thansetting up their own crawlers.

In addition, these maliciousqueries could provide rich information about the attackers,including their intentions and locations

Therefore,strategically, we can let the attackers guide us to betterunderstand their methods and techniques, and ultimately,to predict and prevent followup attacks before they arelaunched.

3IntroductionTwo stage:IdentificationInvestigation

SearchAudit identifies malicious queries.

Analyzing those queries and the attacks of which they are part.Advanced Defense Lab4Working with SearchAudit consists of two stages:identification and investigation. In the first stage,SearchAudit identifies malicious queries. In the secondstage, with SearchAudits assistance, we focus on analyzingthose queries and the attacks of which they arepart.

------------------------------- Stage 1 ----------------------------------------

More specifically, in the first stage, SearchAudit takesa few known malicious queries as seed input and triesto identify more malicious queries. The seed can be obtainedfrom hacker Web sites [1], known security vulnerabilities.

More specifically, in the first stage, SearchAudit takesa few known malicious queries as seed input and triesto identify more malicious queries. The seed can be obtainedfrom hacker Web sites [1], known security vulnerabilities,or case studies performed by other security researchers.

As seed malicious queries are usuallylimited in quantity and restricted by previous discoveries,SearchAudit monitors the hosts that conductedthese malicious queries to obtain an expanded set ofqueries from these hosts.

Using the expanded set ofqueries, SearchAudit further generates regular expressions,which are then used to match search logs for identifyingother malicious queries. This step is critical asmalicious queries are typically automated searches generatedby scripts.

Using regular expressions offers us theopportunity to catch a large number of other queries witha similar format, possibly generated by such scripts.

------------------------------------ Stage 2 ------------------------------------------

After identifying a large number of malicious queries,in stage two, we analyze the malicious queries and thecorrelation between search and other attacks. In particular,we ask questions such as:why do attackers use Web search. how do they leverage search results. and who are the victims.

Answers to these questions not only helpus better understand the attacks, but also provide us anopportunity to protect or notify potential victims beforethe actual attacks are launched, and hence stop attacks intheir early stages.4IntroductionEnhanced detection capability400 becomes 4 million.Low false-positive rates.2%Ability to detect new attacksForum spamingFacilitation of attack analysisAnalyze a series of phishing attacks that lasted for more than one year.

Advanced Defense Lab5Enhanced detection capability:

Using just 500 seedqueries obtained from one hackerWeb site, SearchAuditdetects another 4 million malicious queries, someeven before they are listed by hacker Web sites.

Low false-positive rates:

Over 99% of the capturedmalicious queries display multiple bot features, whileless than 2% of normal user queries do.

Ability to detect new attacks:

While the seed queriesare mostly ones used to search forWeb site vulnerabilities,SearchAudit identifies a large number of queriesbelonging to a different type of attackforum spamming.

Facilitation of attack analysis:

SearchAudit helpsidentify vulnerable Web sites that are targeted by attackers.In addition, SearchAudit helps analyze a seriesof phishing attacks that lasted for more than oneyear. These attacks set up more than 400 phishing domains,and tried to steal a large number of WindowsLive Messenger user credentials.5OutlineIntroductionRelated WorkArchitectureImplementation Stage 1Implementation Stage 2Attack 1: Indentifying Vulnerable Web SitesAttack 2: Forum SpammingAttack 3: Windows Live Messenger PhishingConclusion

Advanced Defense Lab66Related WorkAdvanced Defense Lab7Theres a significant amount of automated Web traffic on the Internet.Another research showed that more than 3% of the entire search traffic may be generated by stealthy search bots.Whats the motivation of those search bots?Search engine competitorsStudying search qualityClick fraud for monetary gainSpreading infection (MyDoom, Santy)Identifying victimsThere is a significant amount of automated Web trafficon the Internet [5]. A recent study by Yu et al. showedthat more than 3% of the entire search traffic may be generatedby stealthy search bots [25] .

One natural question to ask is: what is the motivationof these search bots? While some search bots have legitimateuses, e.g.,

by search engine competitors third parties for studying search quality [8, 17] attackers conduct click fraud for monetary gain [7, 10]. Recently,researchers have associated malicious searcheswith other types of attacks. For example, Provos etal. reported that worms such as MyDoom.O and Santyused Web search to identify victims for spreading infection[20].

Also, Moore et al. [16] identified four types ofevil searches and showed that someWeb sites were compromisedshortly after evil searches. They showed thatattackers searched for keywords like phpizabi v0.848bc1 hfp1 to gather all the Web sites that have a knownPHP vulnerability [9].

Subsequently these vulnerableWeb servers were compromised to set up phishing pages.

----------------------------------------- better ------------------------------------------

Besides email spamming and phishing, there are manyother types of attacks, e.g., malware propagation and

Denial of Service (DoS) attacks. Although there are awealth of attack-detection approaches, most of these attackswere studied in isolation.

Their correlations, especiallyto Web searches, have not been extensively studied.In this paper, we aim to take a step towards a systematicframework to unveil the correlations between malicioussearches and many other attacks.7Related WorkAdvanced Defense Lab8Using regular expression patternsHon-eycombPolygraphHamsa

AutoRE (A way to generate RE from another research)

In [20], Provos et al. found malicious queries from theSanty worm by looking at search results. In those attacks,the attackers constantly changed the queries, butobtained similar search results (viz., theWeb servers thatare vulnerable to Santys attack).

SearchAudit, on theother hand, is primarily targeted at finding new attacks,of which we have no prior knowledge. SearchAudit isthus a general framework to detect and understand malicioussearches.8OutlineIntroductionRelated WorkArchitectureImplementation Stage 1Implementation Stage 2Attack 1: Indentifying Vulnerable Web SitesAttack 2: Forum SpammingAttack 3: Windows Live Messenger PhishingConclusion

Advanced Defense Lab99ArchitectureLet attackers be our guidesFollow their activities and predict their future attacks.Advanced Defense Lab10

---------------------------- First Stage --------------------------------In the first stage, it examines search query logs,and expands the set of seed queries to generate additionalsets of suspicious queries. This stage is automated andquite general, i.e., it can be used to find different types ofsuspicious queries pertaining to different malicious activities.

---------------------------- Second Stage ----------------------------The second stage involves the analysis of thesesuspicious queries to see how different attacks are connectedwith searchthis is mostly done manually, sinceit requires a significant amount of domain knowledge tounderstand the behavior of the different malicious entities.This section focuses on the first stage of our systemand Sections 6, 7, and 8 provide examples of the analysisdone in the second stage.10ArchitecturePlatformDryad/DryadLINQQuery ExpansionTaking a small set of seed queries and expand themExtract IPs and search againRegular Expression GenerationSignature Generation (AutoRE)Eliminating RedundanciesEliminating ProxiesAdvanced Defense Lab11Extending the seed using query logs appears to be astraightforward idea.

Yet, there are two challenges.

First,hackers do not always use the same queries; they modifyand change query terms over time in order to obtaindifferent sets of search results, and thereby identifynew victims. Therefore, simply using a blacklist of badqueries is not effective.

Second,malicious searches maybe mixed with normal user activities, especially on proxies.So we need to differentiate malicious queries fromnormal ones, though they may originate from the samemachine or IP address.

To address these challenges, We do not simply use the suspicious queries directly, but insteadGenerate regular expression signatures from thesesuspicious queries.

Regular expressions help us capturethe structure of these malicious queries, which is necessaryto identify future queries. We also filter regularexpressions that are too general and therefore matchboth malicious and normal queries. Using these two approaches,the first stage of the system now consists of apipeline of two steps:

Query Expansion and Regular Expression

Generation. Since any set of malicious queriescould potentially lead to additional ones, we loop backthese queries until we reach a fixed point with respect toquery expansion. The rest of this section presents eachof the stages in detail.

11Arch. Eliminating RedundanciesAdvanced Defense Lab12AlgorithmREGEX_CONSOLIDATE

Architecture Eliminating ProxiesAdvanced Defense Lab13Most users in a geographical region have similar query patterns.

Mostly legitimate users queries will have a large overlap with the popular queries from the same /16 IP prefix.

We label an IP as a proxy if K most popular queries from that IP and the K most popular queries from that prefix overlap in m queries.K = 100, m = 5OutlineIntroductionRelated WorkArchitectureImplementation Stage 1Implementation Stage 2Attack 1: Indentifying Vulnerable Web SitesAttack 2: Forum SpammingAttack 3: Windows Live Messenger PhishingConclusion

Advanced Defense Lab1414Data Description and Sys SetupUse 3 months of search logs from the Bing search engine.February 2009 (when it was known as Live Search)December 2009January 2010Each month of sampled data contains around 2 billion pageviews.The seed 500 malicious queries are obtained from a hacker Web site milw0rm.comTakes about 7 hours to process the 1.2 TB of sampled data.Advanced Defense Lab15Selection of REUse Cookies to identify the malicious queries.Benign proxy are eliminated.Use a threshold to pick regular expressions based on their scores.Advanced Defense Lab16

Detection Results:Effect of Query Expansion and Regular Expression MatchingFeed the 500 malicious queries into SearchAudit, we find that 122 of the 500 queries appear in the dataset.February 2009 dataset174 IPs issued these queriesUse the result to feed our system again800 unique queries from 264 IPsAdvanced Defense Lab17Detection ResultsAdvanced Defense Lab18

Effect of Incomplete SeedsSplit the 122 seed queries into two sets100 queries that were first posted on milw0rm.com before 200922 queries were posted in 2009Advanced Defense Lab19

Looping Back Seed QueriesUse derived RE as new seeds to feed back as an input to SearchAudit.Advanced Defense Lab20

Overall Matching StatisticsAdvanced Defense Lab21

Verification of Malicious QueriesAs we lack ground truth information about whether a query is malicious or not.Check whether the query is reported on any hacker Web sitesCheck query behavior whether the query matches individual bot or botnet featuresFor each query q returned by SearchAuditIssue a query q AND (dork OR vulnerability) to search engine, and save the results.Advanced Defense Lab22Domains that lista large number of malicious searches from our set arelikely to be security forums, blogs by security companiesor researchers, or even hacker Web sites. These can nowbe used as new sources for finding more seed queries.We manually examine 50 of theseWeb sites, and find thataround 60% of them are security blogs or advisories. Theremaining 40% are in fact hacker forums. In all, 73% ofthe queries reported by SearchAudit contain search resultsassociated with these 50 Web sites22Verification of Queries Generated by Individual BotsTwo features help us to distinguish bot queries from human queriesCookie:Most bot queries do not enable cookies, resulting in an empty cookie field.Normal users who do not clear their cookies, all the queries carry the old cookies.Link clickedMany bots do not click any link on the result page. Instead, they scrape the results off the page.Advanced Defense Lab23Verification of Queries Generated by Individual BotsAdvanced Defense Lab24

Verification of Queries Generated by BotnetsIf most of the IPs that issued malicious queries exhibit similar behavior, then its likely that all these IPs were running the same script.User agentContains information about the browser and the version usedMetadataRecords certain metadata that comes with the requestPages per queryRecords the number of search result pages retrieved per queryInter-query intervalDenotes the time between queries issued by the same IP

Advanced Defense Lab25--------------------------------- First Two ------------------------------Some botnets use a fixed user agent string or metadata,or choose from a set of common values. For each group,we check the percentage of IP addresses that have identicalvalues or identical behavior, e.g., changing value foreach request. If over 90% of the IPs show similar behavior,we infer that IPs in this group might have used thesame script.

--------------------------------- Last Two ------------------------------Queries generated by the same script may retrieve asimilar number of result pages per query or have a similarinter-query interval. For these two features, we computemedian value for each IP address and then checkwhether there is only a small spread in this value acrossIP addresses (< 20%). This allows us to infer whetherthe different IPs follow the same distribution, and so belongto the same group.25Verification of Queries Generated by BotnetsAdvanced Defense Lab26

Verification of Queries Generated by BotnetsAdvanced Defense Lab27

OutlineIntroductionRelated WorkArchitectureImplementation Stage 1Implementation Stage 2Attack 1: Indentifying Vulnerable Web SitesAttack 2: Forum SpammingAttack 3: Windows Live Messenger PhishingConclusion

Advanced Defense Lab2828Analysis of Detection ResultsLarge countries such as USA, Russia, and China are responsible for almost half the IPs issuing malicious queries.Vulnerable Web SitesTry to exploit these web sites by SQL injectionindex.php?content=[?=#+;&:]{1,10}Try to find particular software with known vulnerabilitiesPower byForum spamming/includes/joomla.php site:.[a-zA-Z]{2,3}Windows Live Messenger phishing

Advanced Defense Lab29Analysis of Detection ResultsAdvanced Defense Lab30


Advanced Defense Lab3131Identifying Vulnerable Web SitesApplications of Vulnerability SearchesSample 5000 queries returned by SearchAudit.For every query q we issue a query q dork vulnerability.Obtain 80,490 URLs from 39,475 unique Web sites.Compare this list of random Web sites against a list of known phishing or malware sites.PhishTankMicrosoftTest and show that many of these sites indeed have SQL injection vulnerabilities.Advanced Defense Lab32Identifying Vulnerable Web SitesAdvanced Defense Lab33

SQL Injection VulnerabilitiesFor the malicious queries, we look at the search results and crawl all of the links twice.First time, we crawl the link as isSecond time, we add a single quote ()If the two pages are identical, then it suggests that theres no obvious SQL injection vulnerabilityIf the second page have any kind of SQL error, then there might exists an SQL injection vulnerabilityIn 14,500 URLs, we find 1,760 URLs (12%) may have SQL injection vulnerability.Advanced Defense Lab34OutlineIntroductionRelated WorkArchitectureImplementation Stage 1Implementation Stage 2Attack 1: Indentifying Vulnerable Web SitesAttack 2: Forum SpammingAttack 3: Windows Live Messenger PhishingConclusion

Advanced Defense Lab3535Forum-Spamming AttacksWe manually identified 46 REs that are associated with forum spamming.Advanced Defense Lab36

Advanced Defense Lab37

Forum-Spamming AttacksAdvanced Defense Lab38

Apps of Forum Searching QueriesUsing Project Hony Pot to identify Web spammingAdvanced Defense Lab39


Advanced Defense Lab4040Windows Live MSN PhishingWhat is a MSN Phishing ?http://[a-zA-Z0-9._]*./http://?user=[a-zA-Z0-9._]*Advanced Defense Lab41

Windows Live MSN PhishingAdvanced Defense Lab42

Characteristics of Compromised AccountsAdvanced Defense Lab43


Advanced Defense Lab4444ConclusionAdvanced Defense Lab45

searching the searchers with searchaudit

Documents

new malicious queries

analyzingthose queries

specific search queries

small set of malicious

malicious entitiesto

malicious searches

seed input

introductiontwo stage