[ieee 2011 international conference on communications, computing and control applications (ccca) -...

PhishBlock: A hybrid anti-phishing tool

Hossam M.A.Fahmy1 and Salma A.Ghoneim2 Faculty of Engineering – Ain Shams University Computer Engineering and Systems Department

Cairo-EGYPT [email protected]

[email protected]

Abstract - Phishing is a means of obtaining confidential information through fraudulent websites that appear to be legitimate. Anti-phishing detection techniques are either lookup based or classifier based. Lookup based systems suffer from high false negatives while classifier systems suffer from high false positives. To better detect fraudulent websites, we propose in this work an efficient hybrid system that is based on both lookup and a support vector machine classifier that checks features derived from websites URL, text and linkage.

I. INTRODUCTION Phishing is an online form of identity theft in which "phishers" pose as legitimate entities in order to trick users into revealing sensitive personal information. The term comes from the fact that phishers “fish" for credentials from legitimate users [1]. For each phishing campaign, phishers choose an established legitimate site to mimic. Phishing sites are designed to imitate the look and feel of the legitimate site. They also send, to potential victims, emails linking to them; these emails typically include some "call to action" such as sending bank account information. The offenders have at their disposal several means, specifically, innocuous links embedded in emails that redirect to fake sites, pop-up windows that encourage the user to enter their sensitive information, and URL masks that conjure up real web addresses.

Accordingly, a need has arisen for more refined fake website detection techniques. There are a variety of methods that can be used to identify a page as a phishing site, including whitelists (lists of known safe sites), blacklists (lists of known fraudulent sites), various heuristics to see if a URL is similar to a well known URL, and community ratings. Comprehensive surveys are available in [1,2,8,9]. It has also been testified that blacklists are becoming increasingly ineffective [13],[14] owing to the agility of the attackers.

In this work we propose PhishBlock a hybrid system that blocks URLs listed on the blacklist, while its neural network based classifier system evaluate non-black listed websites. This leads to anti-phishing with high catch rates.

This work is organized as follows, Section II surveys some anti-phishing tools, Section III describes PhishBlock the proposed hybrid tool, Section IV illustrates PhishBlock techniques, Section V presents the comparison study, and Section VI highlights the conclusion.

II. ANTI-PHISHING TOOLS

A. SpoofGuard

SpoofGuard is an anti-phishing toolbar developed at Stanford University [3]. SpoofGuard does not use whitelists or blacklists, instead, it employs a series of heuristics to identify phishing pages. After being installed, SpoofGuard that runs on Microsoft Windows 98/NT/2000/XP looks like a toolbar. When a user enters a username and password on a spoofed site that contains some combination of suspicious URLs, misleading domain name, images from an honest site, and a username and password that have previously been used at an honest site, SpoofGuard will intercept the post and warn the user with a pop-up that stops the attack.

SpoofGuard uses configurable, weighted heuristics to determine the probability that a page is malicious. It analyzes URLs for patterns used by phishing sites, compares images on the site to those from popular domains, and hashes and compares post data to stored sensitive data. SpoofGuard also allows the user to configure the algorithmic weights in order to control the level of false positives. Subsequent studies found that SpoofGuard has a 90% true positive rate, but a 35-48% false positive rate [5], [6].

Chen and Guo [4] improved on the URL analysis work done in SpoofGuard. After studying 203 phishing emails collected from the APWG [7], they identified five categories of website link obfuscation employed by phishing sites and measured their frequency.

B. Microsoft Phishing Filter in Internet Explorer This tool largely relies on a blacklist hosted by Microsoft. However, it also uses some heuristics when it encounters a site that is not on the blacklist. When a suspected phishing website is encountered, the users are redirected to a built in warning message and asked if they would like to continue visiting the site or close the window. Users also have the option of using this feature to report suspected phishing sites or to report that a site has been incorrectly added to the blacklist [10]. C. Netcraft The Netcraft anti-phishing toolbar uses several methods to determine the legitimacy of a website [11]. The toolbar traps suspicious URLs containing characters which have

no common purpose other than to deceive. It also enforces display of browser navigation controls (tool and address bar) in all windows to defend against popup windows which attempt to hide the navigational controls. Netcraft clearly displays sites hosting location including country helping the user to evaluate fraudulent URLs. The Netcraft toolbar also uses a blacklist, which consists of fraudulent sites identified by Netcraft as well as sites submitted by users and verified by the company.

D. Firefox 2 Firefox 2.0 includes a new feature designed to identify fraudulent websites. Originally, this functionality was an optional extension for Firefox as part of the Google Safe browsing toolbar. URLs are checked against a blacklist, which Firefox downloads periodically. The feature displays a popup if it suspects the visited site to be fraudulent and provides users with a choice of leaving the site or ignoring the warning. Optionally, the feature can send every URL to Google to determine if it is a scam. According to the Google toolbar download site, the toolbar combines advanced algorithms with reports about misleading pages from a number of sources [12].

III. PROPOSED TOOL

PhishBlock is a tool for dynamic, proactive detection of both spoofed and concocted websites, based on a hybrid of lookup and classifier systems in one simple browser independent user friendly application. PhishBlock is browser independent since the user does not have to use a specific anti-phishing tool for each browser. It is user friendly, with an easy to use GUI without need for any prior knowledge to use it. PhishBlock uses an open source code that permits adding to the tool.

As shown in Figure 1, if a URL is not white or black listed in the proxy server, it is checked against PhishTank, Escrow Fraud and Google black lists, and if not in any of them it is delivered to the Support Vector Machine (SVM) for features extraction to replenish the lists.

Figure 1. PhishBlock design

IV. TECHNICAL DESCRIPTION

A. PhishBlock Components 1) Lookup System: Lookup systems use a client-server

architecture in which the server side maintains a blacklist of known fake URLs, while the client-side tool checks the blacklist and provides a warning if a website poses a threat. Lookup systems also consider URLs directly reported or rated by system users. Lookup systems typically have high precision since they are less likely to consider authentic sites as fake. They are also easier to implement than classifier systems. However, lookup systems are more susceptible to higher levels of false negatives.

In PhishBlock, we have used three local lists (Blacklist, Whitelist and Suspicious list) in which the URL to be tested is passed to each one of them respectively. If it is found in either of them, a message is presented to the user (Fake, Safe or Suspicious), otherwise, the URL is passed to the global servers to be tested by the SVM. In PhishBlock, we have used three servers (Phishtank [15], Google and Escrow Fraud [16]). The URL to be tested is passed to Phishtank where it is detected as a fake or an unknown website, if found to be unknown it is passed to Escrow Fraud, otherwise it is added to PhishBlock blacklist. The same procedure takes place in Escrow Fraud where the URL is passed to Google if it is unknown, otherwise, it is added to PhishBlock blacklist

2) Classifier System: Classifier systems are client-side

tools that apply rule- or similarity-based heuristics to website content or domain registration information. Classifier systems offer better coverage for spoof and concocted sites than lookup systems. They are proactive, capable of detecting fakes independent of blacklists. The main disadvantage is that classifier systems can take longer time to classify webpages than lookup systems. They are also more prone to false positives.

The proposed PhishBlock classifier system is implemented using neural networks. Neural networks, with their remarkable ability to derive meaning from complicated or imprecise data, can be used to extract patterns and detect trends that are too complex to be noticed by either humans or other computer techniques. A trained neural network can be thought of as an "expert" in the category of information it has been given to analyze. This expert can then be used to provide projections given new situations of interest and answer "what if" questions. Support vector machine (SVM) is the used neural network. SVMs are a set of related supervised learning methods used for classification [19]. Given a set of training examples, each marked as belonging to one of two categories an SVM training algorithm builds a model that predicts whether a new example falls into one category or the other.

3) Hybrid System: Hybrid systems combine classifier and lookup mechanism. The system blocks URLs on the blacklist, while the classifier evaluate others. PhishBlock uses a hybrid system for detecting fake websites, this leads to better results with high catch rates. Moreover,

PhishBlock is the first hybrid tool that uses neural networks.

4) Fishblock Checks : URL Features: Nine features are tracked for

suspicious URLs detection, some of them are found in [17], a survey is found in [18]:

1. Numbers, a URL with more numbers is suspicious. 2. Http and Https, Https in a URL indicates a safe

website. 3. Suspicious characters, a URL is checked for ‘@’ or

‘_’, when found a URL is regarded as suspicious. 4. Number of dots and dashes in a domain name, it

counts the number of dots and dashes in a page’s URL domain name.

5. Number of dots in a URL, as the number of dots in the sites URL increase, phishing score increases.

6. Number of dashes and dots in a URL, if the sum of dashes and dots in the whole URL is higher, the probability of being suspicious is higher.

7. Hosting, a URL is checked for a free hosting domain name, it is a common feature found in the phishing URLs.

8. Number of slashes, the number of slashes in the URL adds to the phishing score.

9. IP Address, this checks if an IP address is found in the URL, if found, this refers to a suspicious URL.

Webpage Content Features: Nine features are checked in the webpage html source:

1. Image source, the image tag is checked as well as the image source.

2. Input tag, the input tag is checked for the word "password" which adds to the phishing score.

3. Link servers, given a webpage, the parser will extract attributes of “HREF:”, a counter to internal links and a counter to external links are implemented.

4. Link servers (external and internal), similar to the previous one but the sum of both links is calculated.

5. Title, is extracted from the HTML's body and the domain name is extracted from the URL string. If the title does not contain the domain name, phishing score is increased.

6. Form tag, given a webpage, the parser will extract the "action" parameter link of the "form" tag.

7. Dots and dashes in external links, the number of dots and dashes in each of the page's external links is counted.

8. Term frequency, title, keywords and descriptions are extracted from html body, the term frequency score of each is found.

9. External links against Google Whitelist, given a webpage, the parser will extract attribute of "HREF:", the external links are compared with Google whitelist.

V. EXPERIMENTS AND RESULTS

A. Test Approach

A three phases testing approach is adopted to examine PhishBlock:

1) Phase 1: Assessing and testing the anti-phishing tools based on lookup systems along with the lookup system used in PhishBlock (PhishTank, Google blacklist and Escrow Fraud database).

2) Phase 2: assessing PhishBlock SVM classifier against SpoofGuard the sole classifier based anti-phishing tool.

3) Phase 3: Combines the lookup and the classifier to build the proposed hybrid system and checks if it enhances security and accuracy over time.

B. Test metrics

Phishing wastes efforts and resources so, the most important metrics for anti-phishing tools are: Catch rate (or true positive rate) [2], is the number of

phishing sites positively identified by an antiphishing tool out of the total number of active phishing sites visited, with sites that had been taken down at the time of testing removed from the denominator.

False positive rate, ideally the tool should have a low false positive rate, this means that legitimate sites should not be detected as fraudulent sites, which highlights the error margin of the tool [2].

C. Testing Tools and Environment PhishBlock is examined against IE8 phishing filter (not assessed before), Netcraft toolbar, Firefox ver.3.3.6 with Google safe browsing, and SpoofGuard (works only on IE6) the sole classifier anti-phishing tool available on cyberspace.

An environment was adopted and a policy was devised to ensure proper comparisons:

1. All tests were implemented on the same OS

(Windows XP). 2. IE8 is the used web browser (except for spoofguard

that works on IE6). 3. Verified fake sites must be tested within 12 hours

of acquiring them to prevent broken links, since the lifetime of phishing websites is short.

4. Unverified fake sites must be tested within 2 hours of acquiring them to be able to truly test the impact of time.

5. Lookup tests are carried between 8:00 am and 11:00 pm.

D. Test Cases

1) Lookup Test, examines the effectiveness of lookup tools and the impact of time on the catch rate (accuracy) of lookup systems. A screenshot was taken for every website and the toolbar warnings were monitored. A 200 unverified fake websites testbed was pulled from PhishTank for this test. The examined tools are (Figure 2): IE8 Phishing filter. Firefox 3.3.6 with Google safe browsing. Netcraft toolbar (although it uses some domain

registration name heuristics). PhishBlock lookup solution (PhishTank, Google

blacklist and Escrow Fraud database).

Figure 2. Accuracy of lookup tools

2) Classifier test, the core of PhishBlock is the SVM

based classifier that profits from neural networks in the anti-phishing campaign. 200 verified fake websites and 200 safe sites are used in this test to examine the catch rate and false positives. The examined tools are: IE8 Phishing filter. Firefox 3.3.6 with Google safe browsing. Netcraft toolbar. PhishBlock SVM classifier. SpoofGuard.

As seen in Figure 3 , PhishBlock compares well with

other tools at a 95% catch rate (accuracy), noticing that IE 8 filter offers a 75% accuracy. Observing the false postives rate, PhishBlock provides a 0.1% false positives rate while Spoofguard gives a 36%, which illustrates the power of the neural technique used in PhishBlock.

3) Hybrid test (Combining lookup and classifier

systems), examines the impact of introducing the hybrid technique in enhancing the performance of PhishBlock. We assessed the impact of hybriding, using the same 200 verified fake websites and 200 safe sites testbed. The test was carried every N hours, where N ranges from 1 to 24 hours. Three different systems were compared, lookup systems, the PhishBlock classifier and the hybrid lookup+classifier approach. As observed in Figure 4, the hybrid approach outperforms the classifier and the lookup systems and delivers a higher accuracy over time.

VI.CONCLUSION

The proposed PhishBlock antiphishing tool compares well with other lookup based and classifier based tools, providing an 95% accuracy and a very low false positives rate (0.1%). Equipped with dynamic lookup and neural networks based SVM, PhishBlock provides the basis of a stronger hybrid tool. This study suggests that systems relying solely on lookup mechanisms or classifier systems that utilize a small set of features are ineffective in combating phishing.

Figure 3. Catch rate and false positives

Figure 4. Accuracy of the proposed hybrid tool

REFERENCES

[1] A.Abbasi and H.Chen, “A comparison of tools for detecting fake websites,” IEEE Computer Magazine, vol.42, n.10, October 2009, pp.78-86.

[2] Y.Zhang, S. Egelman, L. Cranor, and J.Hong, “Phinding phish: evaluating anti-phishing tools,” in Proceedings of the 14th Annual Network and Distributed System Security Symposium (NDSS 2007), San Diego, CA, 28 February-2 March, 2007.

[3] N. Chou,R. Ledesma,Y. Teraguch and J. Mitchell, “Client-side defense against web-based identity theft,” in Proceedings of the 11th Network and Distributed Systems Security (NDSS 2004), San Diego, CA, 5-6 February, 2004.

[4] J. Chen and C. Guo, “Online detection and prevention of phishing attacks,” in proceedings of Communications and Networking in China (ChinaCom'06), October 2006, pp.1-7.

[5] P. Likarish, E. Jung, D. Dunbar, T. Hansen, and J. Hourcade, “B-apt:bayesian anti-phishing toolbar,” in proceedings of the IEEE International Conference on

Communications, (ICC '08), Beijing, China, May 2008, pp.1745-1749.

[6] Y. Zhang, J. I. Hong, and L. F. Cranor, “Cantina: a content-based approach to detecting phishing websites,” in proceedings of the ACM 16th International Conference on World Wide Web (WWW'07), NY, USA, 2007, pp.639-648.

[7] Anti-phishing Working Group, http://www.antiphishing.org/, April, 2009. Accessed: October 15, 2010.

[8] M. Jakobsson, S. Myers, Phishing and counter measures: understanding the increasing problem of electronic identity theft, Wiley, 2006.

[9] C. Ludl, S. Mcallister, E. Kirda, and C. Kruegel, “On the efectiveness of techniques to detect phishing sites,” in proceedings of the 4th international conference on Detection of Intrusions and Malware, and Vulnerability Assessment (DIMVA '07), Springer-Verlag, Berlin, Heidelberg, 2007, pp.20-39.

[10] Microsoft Corporation, Anti-phishing technologies, http://www.microsoft.com/mscorp/safety/technologies/antiphishing/default.mspx. Accessed: October 15, 2010.

[11] Netcraft. Anti-phishing tool. http://toolbar.netcraft.com/. Accessed: October 15, 2010.

[12] Google, Inc. Google Safe Browsing for Firefox. http://www.google.com/tools/firefox/safebrowsing/. Accessed: October 15, 2010.

[13] A. Ramachandran, D. Dagon, and N. Feamster, “Can DNS-based blacklists keep up with bots?,” in Proceedings of the 3rd Conference on Email and Anti-Spam (CEAS 2006), Mountain View, California, July 2006.

[14] Sheng, B. Wardman, G. Warner, L. F. Cranor, J. Hong, and C. Zhang, “An empirical analysis of phishing blacklists,” in Proceedings of the 6th

Conference on Email and Anti-Spam (CEAS 2009), Mountain View, California, July 2009.

[15] Phishtank. Phishtank-join the fight against phishing, June 2009. http://www.phishtank.com/. Accessed: October 19, 2010.

[16] Home-Escrow Fraud prevention, Stop Escrow Fraud, March 2009, http://escrow-fraud.com/. Accessed: October 19, 2010.

[17] Michael Blasi, Techniques for detecting zero day phishing websites, MSc thesis, Iowa State University, Ames, Iowa, 2009.

[18] C. Ludl, S. Mcallister, E. Kirda, and C. Kruegel, “On the effectiveness of techniques to detect phishing sites,” in Proceedings of the 4th International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment (DIMVA '07), Springer-Verlag, Berlin, Heidelberg, 2007, pp.20-39.

[19] N. Cristianini, J. Shawe-Taylor, An introduction to support vector machines and other kernel-based learning methods, Cambridge University Press, UK, 2000.

[ieee 2011 international conference on communications, computing and control applications (ccca) -...

Documents