malicious url detection using machine learning
TRANSCRIPT
![Page 1: Malicious url detection using machine learning](https://reader035.vdocuments.us/reader035/viewer/2022062503/588499f71a28ab26058b609b/html5/thumbnails/1.jpg)
Beyond Blacklists: Malicious Url Detection Using Machine Learning
![Page 2: Malicious url detection using machine learning](https://reader035.vdocuments.us/reader035/viewer/2022062503/588499f71a28ab26058b609b/html5/thumbnails/2.jpg)
Who am I ?• Info security Investigator @ Cisco.• Completed Mtech from IIT Jodhpur in
2014.• Areas of interest include machine
learning, computer vision and A.I.• Email : [email protected]
![Page 3: Malicious url detection using machine learning](https://reader035.vdocuments.us/reader035/viewer/2022062503/588499f71a28ab26058b609b/html5/thumbnails/3.jpg)
Malicious websites
Phishing : which one is real ??
![Page 4: Malicious url detection using machine learning](https://reader035.vdocuments.us/reader035/viewer/2022062503/588499f71a28ab26058b609b/html5/thumbnails/4.jpg)
Visiting Malicious Websites
![Page 5: Malicious url detection using machine learning](https://reader035.vdocuments.us/reader035/viewer/2022062503/588499f71a28ab26058b609b/html5/thumbnails/5.jpg)
What we want ?
![Page 6: Malicious url detection using machine learning](https://reader035.vdocuments.us/reader035/viewer/2022062503/588499f71a28ab26058b609b/html5/thumbnails/6.jpg)
6
Problem in a Nutshell URL features to identify malicious Web
sites No context, no content
Different classes of URLs Benign, spam, phishing, exploits, scams... For now, distinguish benign vs. maliciousfacebook.com fblight.com
![Page 7: Malicious url detection using machine learning](https://reader035.vdocuments.us/reader035/viewer/2022062503/588499f71a28ab26058b609b/html5/thumbnails/7.jpg)
Information about new websites
![Page 8: Malicious url detection using machine learning](https://reader035.vdocuments.us/reader035/viewer/2022062503/588499f71a28ab26058b609b/html5/thumbnails/8.jpg)
8
State of the Practice Current approaches
Blacklists [SORBS, URIBL, SURBL, Spamhaus] Learning on hand-tuned features [Garera et al,
2007] Limitations
Cannot predict unlisted sites Cannot account for new features
Arms race: Fast feedback cycle is critical More automated approach?
![Page 9: Malicious url detection using machine learning](https://reader035.vdocuments.us/reader035/viewer/2022062503/588499f71a28ab26058b609b/html5/thumbnails/9.jpg)
9
URL Classification System
Label Example Hypothesis
![Page 10: Malicious url detection using machine learning](https://reader035.vdocuments.us/reader035/viewer/2022062503/588499f71a28ab26058b609b/html5/thumbnails/10.jpg)
10
Data Sets Malicious URLs
5,000 from PhishTank (phishing) 15,000 from Spamscatter (spam, phishing,
etc) Benign URLs
15,000 from Yahoo Web directory 15,000 from DMOZ directory
Malicious x Benign → 4 Data Sets 30,000 – 55,000 features per data set
![Page 11: Malicious url detection using machine learning](https://reader035.vdocuments.us/reader035/viewer/2022062503/588499f71a28ab26058b609b/html5/thumbnails/11.jpg)
11
Algorithms Logistic regression w/ L1-norm
regularization
Other models Naive Bayes Support vector machines (linear, RBF kernels)
Implicit feature selection Easier to interpret
![Page 12: Malicious url detection using machine learning](https://reader035.vdocuments.us/reader035/viewer/2022062503/588499f71a28ab26058b609b/html5/thumbnails/12.jpg)
Feature vector construction
![Page 13: Malicious url detection using machine learning](https://reader035.vdocuments.us/reader035/viewer/2022062503/588499f71a28ab26058b609b/html5/thumbnails/13.jpg)
14
Features to consider?1) Blacklists2) Simple heuristics3) Domain name registration4) Host properties5) Lexical
![Page 14: Malicious url detection using machine learning](https://reader035.vdocuments.us/reader035/viewer/2022062503/588499f71a28ab26058b609b/html5/thumbnails/14.jpg)
15
(1) Blacklist Queries List of known malicious sites Providers: SORBS, URIBL, SURBL,
Spamhaus
http://www.bfuduuioo1fp.mobiIn blacklist?
Yes
http://fblight.com
No
In blacklist?
http://www.bfuduuioo1fp.mobi
Blacklist queries as features
........................................
........................................
![Page 15: Malicious url detection using machine learning](https://reader035.vdocuments.us/reader035/viewer/2022062503/588499f71a28ab26058b609b/html5/thumbnails/15.jpg)
16
(2) Manually-Selected Features Considered by previous studies
IP address in hostname? Number of dots in URL WHOIS (domain name) registration date
stopgap.cn registered 28 June 2009
http://72.23.5.122/www.bankofamerica.com/
http://www.bankofamerica.com.qytrpbcw.stopgap.cn/
![Page 16: Malicious url detection using machine learning](https://reader035.vdocuments.us/reader035/viewer/2022062503/588499f71a28ab26058b609b/html5/thumbnails/16.jpg)
17
(3) WHOIS Features Domain name registration
Date of registration, update, expiration Registrant: Who registered domain? Registrar: Who manages registration?
http://sleazysalmon.comhttp://angryalbacore.com
http://mangymackerel.com
http://yammeringyellowtail.comRegistered on29 June 2009
By SpamMedia
![Page 17: Malicious url detection using machine learning](https://reader035.vdocuments.us/reader035/viewer/2022062503/588499f71a28ab26058b609b/html5/thumbnails/17.jpg)
18
(4) Host-Based Features Blacklisted? (SORBS, URIBL, SURBL, Spamhaus) WHOIS: registrar, registrant, dates IP address: Which ASes/IP prefixes? DNS: TTL? PTR record exists/resolves? Geography-related: Locale? Connection speed?
75.102.60.0/2269.63.176.0/20
facebook.com fblight.com
![Page 18: Malicious url detection using machine learning](https://reader035.vdocuments.us/reader035/viewer/2022062503/588499f71a28ab26058b609b/html5/thumbnails/18.jpg)
19
(5) Lexical Features Tokens in URL hostname + path Length of URL Entropy of the domain name
http://www.bfuduuioo1fp.mobi/ws/ebayisapi.dll
![Page 19: Malicious url detection using machine learning](https://reader035.vdocuments.us/reader035/viewer/2022062503/588499f71a28ab26058b609b/html5/thumbnails/19.jpg)
20
Which feature sets?BlacklistManualWHOIS
Host-basedLexical
Fullw/o WHOIS/Blacklist
4,000
# Features
13,000
4
3
17,000
30,000
26,000
![Page 20: Malicious url detection using machine learning](https://reader035.vdocuments.us/reader035/viewer/2022062503/588499f71a28ab26058b609b/html5/thumbnails/20.jpg)
21
Beyond Blacklists
Blacklist
Full featuresYahoo-PhishTank
Higher detection rate for given false positive rate
![Page 21: Malicious url detection using machine learning](https://reader035.vdocuments.us/reader035/viewer/2022062503/588499f71a28ab26058b609b/html5/thumbnails/21.jpg)
22
Limitations False positives
Sites hosted in disreputable ISP Guilt by association
False negatives Compromised sites Free hosting sites Hosted in reputable ISP
Future work: Web page content
![Page 22: Malicious url detection using machine learning](https://reader035.vdocuments.us/reader035/viewer/2022062503/588499f71a28ab26058b609b/html5/thumbnails/22.jpg)
23
Conclusion Detect malicious URLs with high
accuracy Only using URL Diverse feature set helps: 86.5% w/
18,000+ features Proof concept working in lab
Future work Scaling up for deployment
![Page 23: Malicious url detection using machine learning](https://reader035.vdocuments.us/reader035/viewer/2022062503/588499f71a28ab26058b609b/html5/thumbnails/23.jpg)
References Ma, Justin, et al. "Beyond blacklists:
learning to detect malicious web sites from suspicious URLs." Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2009.
![Page 24: Malicious url detection using machine learning](https://reader035.vdocuments.us/reader035/viewer/2022062503/588499f71a28ab26058b609b/html5/thumbnails/24.jpg)
Q & A