© 2011 cisco systems, inc. all rights reerved. 1 applications of machine learning in cisco web...

Post on 28-Mar-2015

213 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

© 2011 Cisco Systems, Inc. All rights reerved.

1

Applications of Machine Learning in Cisco Web Security

Richard Wheeldon PhD BSc

rwheeldo@cisco.com

2© 2011 Cisco Systems, Inc. All rights reerved.

Cisco Web Security

• Cisco, Ironport and ScanSafe

• Request time filtering•Categorization and classification•Reputation

• Response time filtering•Malware types and attack vectors•Malware detection•Dynamic classification

• Other challenges

3© 2011 Cisco Systems, Inc. All rights reerved.

The Ubiquitous Speaker Slide

• Richard Wheeldon•UCL Graduate in 1999•PhD from Birkbeck in 2003•Joined Cisco December 2009•http://www.rswheeldon.com/

• Acknowledgements•Steve Poulson - spoulson@cisco.com•Bryan Feeney - b.feeney@cs.ucl.ac.uk

4© 2011 Cisco Systems, Inc. All rights reerved.

Cisco, Ironport and ScanSafe

• Cisco•World’s leading network company

• Ironport•Leader in Anti-spam•Provide Web Security Appliances

• ScanSafe•World leader in “Security as a Service”•Scans 1.8 billion web requests a day•Blocks 32 million of them

5© 2011 Cisco Systems, Inc. All rights reerved.

We’re local

6© 2011 Cisco Systems, Inc. All rights reerved.

Previous MSc projects

• Tree Kernels for CFG similarity•Guangyan Song, 2010

• Fast computation of the Kernel of a Tree and applications to Semi-Supervised Learning

•Malcolm Reynolds, 2009

• Comparing N-gram features for web page classification•Noureen Tejani, 2007

7© 2011 Cisco Systems, Inc. All rights reerved.

We’re hiring• Positions

•Software Developers•QA, Operations, Research

• Locations•ScanSafe•UK - Bedfont Lakes, Reading, Staines, Edinburgh•Galway, EMEA, US, Worldwide

• Graduate recruitment•http://www.cisco.com/go/universityjobs•http://www.cisco.com/careers/• rwheeldo@cisco.com

8© 2011 Cisco Systems, Inc. All rights reerved.

1. Availability

Time our service is available to scan traffic99.999% guaranteed availability

2. Latency

Additional load time attributable to servicesEvaluated by 3rd party analysis

3. False Positives

Pages that were blocked but should not have

4. False Negatives

Pages that were not blocked, but should have

Scansafe’s SaaS

9© 2011 Cisco Systems, Inc. All rights reerved.

Risks of Unfiltered Content

• Software threats•Malware•Phishing•Botnets

• Business threats•Productivity Loss•Bandwidth congestion•Legal liability•Data Leaks

10© 2011 Cisco Systems, Inc. All rights reerved.

The Web vs. Email

Web EmailMost web traffic is good Most e-mail is bad

Easy to find safe sites Easy to get Spam

Harder to get dangerous URLs Harder to get examples of good mail

Blocking web sites is visible Blocking email is invisible

Performance gain from white-listing Performance gain from blocking

Very Real-Time (<2s) Not Real-Time (<Nhrs)

11© 2011 Cisco Systems, Inc. All rights reerved.

Request time filtering

• Motivation•Quicker blocks save bandwidth and processing time• If the request is made, the damage may be done

• Techniques•Databases•Reputation•Rules•Trained systems

12© 2011 Cisco Systems, Inc. All rights reerved.

Category-based filtering

• Responsible for most blocks

• High-risk and high-traffic

• Manual categorizers

• 10 million URLs

• 97% of traffic

• 2 million porn sites

13© 2011 Cisco Systems, Inc. All rights reerved.

Web Reputation

3rd PartyFeeds Spam H o sts

Databases

Sco re between -10 and +10(Bad, N eutral o r Go o d)

• Feeds•Phishing sites•Malware sites

• Heuristics• In spam but not in ham•Age of domain registration•High traffic – e.g. Alexa 1000•Scanned but never blocked

14© 2011 Cisco Systems, Inc. All rights reerved.

Web Reputation in the WSA

15© 2011 Cisco Systems, Inc. All rights reerved.

16© 2011 Cisco Systems, Inc. All rights reerved.

Keyword-based URL filtering

• Keyword rules•Fitness -> Health•Basketball -> Sport•Pizzeria -> Food•Restaurant -> Food•Whore -> Porn

• Strange URLs•whorepresents.com• therapistfinder.com• speedofart.com•expertsexchange.com•penisland.com•powergenitalia.it

17© 2011 Cisco Systems, Inc. All rights reerved.

Recognizing Porn URLs

• http://www.penisland.com

• Example of segmentation problemP('peni') X P('sland')

P('penis') X P('land')

P('pen') X P('island')

• Extends to classificationP('penis') X P('land') X P(porn|'penis') X P(porn|'land')

P('pen') X P('island') X P(not_porn|'pen') X P(not_porn|'island')

18© 2011 Cisco Systems, Inc. All rights reerved.

Phishing and Malware Examples

• Phishing examples•http://pavpals-com-usaprewiwerluithaniirse.345.pl•http://82.195.143.18/onlinepaypal.com/•http://www.jetboatflush.com/~nfioemro/www.paypal.fr/webscrcmd=...

• Malicious examples:•www1.scan-projectrf.cz.cc•www1.scan-projectsi.cz.cc•www1.scan-projectst.cz.cc•www1.scan-projectte.cz.cc•www1.scan-projectti.cz.cc

19© 2011 Cisco Systems, Inc. All rights reerved.

Searchahead

• If we can identify bad URLs we can warn before the user clicks.

• Over 90% of new sites are visited as the result of an Internet search

Acceptable

Uncategorized

Prohibited

Malicious

20© 2011 Cisco Systems, Inc. All rights reerved.

Response Time Scanning

• Trusted sites are targets

• Strength-in-depth combination of commercial scanners and in-house technology.

Graphics

Webmail

New Web Pages

BlogsAd Links

Links

Comments

Banner Ads

Backdoors

Rootkits

Trojan Horses

Keyloggers

Worms

21© 2011 Cisco Systems, Inc. All rights reerved.

Exploited sites in recent years

• Facebook

• Times India

• Miami Dolphins

• Samsung

22© 2011 Cisco Systems, Inc. All rights reerved.

Nothing is safe – not even Twitter!

http://www.youtube.com/fslabs

23© 2011 Cisco Systems, Inc. All rights reerved.

Signature Databases

0

0.5

1

1.5

Signatures(millions)

2006

2007

2008

• From 2006 to 2008, the F-Secure signature database grew from 250000 entries to 1.5 million

• The rate at which variants of viruses come out is growing rapidly

• No vendor can rely exclusively on signatures

24© 2011 Cisco Systems, Inc. All rights reerved.

Zero-hour protection

• Vendors take time to release signature updates

•Win32.IstBar.jl trojan

• Outbreak Intelligence (OI) provides proactive threat detection

• A huge data set of traffic to be leveraged

25© 2011 Cisco Systems, Inc. All rights reerved.

How does OI use Machine Learning?

• Approaches•Malware detection•Anomaly detection•Dynamic categorization

• Techniques Employed•Supervised Learning•Unsupervised Learning•Sandboxing

26© 2011 Cisco Systems, Inc. All rights reerved.

Dynamic Classification

• Document classification across 80 categories• Increases coverage•Language identification

• Identifies inappropriate content•Porn is relatively easy•Phishing is harder – but not impossible?•Hate speech is harder still

27© 2011 Cisco Systems, Inc. All rights reerved.

DC for identifying malicious sites

• Automated tools generate malicious sites•Fake escrow•Fake pharmacy•Mule recruitment

• Examples from Richard Clayton’s 2010 FOSDEM talk•http://www.google.com/search?q=%22before+that+was+a+commercial+manager+of+a+large+corporation+engaged+in+electronics+production%22

•http://www.google.com/search?q=%22as+the+most+trusted+escrow+service+on+the+internet%22

28© 2011 Cisco Systems, Inc. All rights reerved.

Malicious Executable Files

• The final stage of an attack is frequently downloading an executable

• Traditionally blocked using signatures

• We use a combination of signature-based scanners and machine-learning

29© 2011 Cisco Systems, Inc. All rights reerved.

Drive-by attacks

• Almost no-one opens executables from odd sources any more, so instead people use drive-by attacks.

• A normal file (e.g. Flash, PDF, Javascript, Image file) is crafted to exploit a vulnerability in a viewer or library and execute code embedded within the file.

30© 2011 Cisco Systems, Inc. All rights reerved.

Flash

“Symantec recently highlighted Flash for having one of the worst security records in 2009. We also know first hand that Flash is the number one reason Macs crash. We have been working with Adobe to fix these problems, but they have persisted for several years now. We don’t want to reduce the reliability and security of our iPhones, iPods and iPads by adding Flash”

Steve Jobs, April 2010

http://www.apple.com/hotnews/thoughts-on-flash/

31© 2011 Cisco Systems, Inc. All rights reerved.

The growing threat of Java

• Almost as common as Flash•90% of PCs have Java•700 000 JDK downloads per month•3.48 Million JRE downloads per month

• Growth in known vulnerabilities•29 patched in a single update (Oct 2010)•Growth in exploits reported by Sophos, Symantec, Microsoft and Cisco

• Signatures + Trained Scanlet

32© 2011 Cisco Systems, Inc. All rights reerved.

Detecting Malicious JavaScript

• Sandboxing•Behavioural checking•Good way to beat obfuscation techniques•Difficult to constrain

• Trained classification•Analyse features

33© 2011 Cisco Systems, Inc. All rights reerved.

Javascript Features

v46f658f5e2260(v46f658f5e3226){ function v46f658f5e4207 () {return 16;} return(parseInt(v46f658f5e3226,v46f658f5e4207()));}function v46f658f5e61f4(v46f658f5e7174){ function v46f658f5ea0cd () {return 2;} var v46f658f5e813e=\'\';for(v46f658f5e9105=0; v46f658f5e9105<v46f658f5e7174.length; v46f658f5e9105+=v46f658f5ea0cd()){ v46f658f5e813e+=(String.fromCharCode(v46f658f5e2260(v46f658f5e7174.substr(v46f658f5e9105, v46f658f5ea0cd()))));}return v46f658f5e813e;} document.write(v46f658f5e61f4(\'3C5343524950543E77696E646F772E7374617475733D2\'));

The above is JavaScript, but where are the features?An exercise for the reader!

34© 2011 Cisco Systems, Inc. All rights reerved.

Obfuscation

• Attackers use obfuscation•But so do legitimate vendors (e.g. Google)•And large Web 2.0 libraries

• Techniques include•Name changes•String concatenation (eval)•Dynamically loaded/generated/decrypted code (eval)•Splitting functionality across files

35© 2011 Cisco Systems, Inc. All rights reerved.

Malicious Non-Executable Files

• There are a lot of file formats out there – documents, pictures, videos.

• For zero-day attacks, we have no data to compare against.

• Basically this is anomaly detection.

36© 2011 Cisco Systems, Inc. All rights reerved.

Development Constraints

• Low False Positive Rate

• Robust•Tolerant against malformed data•Language-agnostic

• Scalable•1.8 Billion requests per day on 1000 servers

• Low latency

37© 2011 Cisco Systems, Inc. All rights reerved.

Back-end processing

A M scanners

U R L Black l ists

A V scanners

bad

F i le Whitel ists

N o A V hi ts

U R L Whitel ists

go o d

Behav io ural features

Co ntent featuresM L

bad go o d

• If a technique is too slow for real-time scanning, that doesn’t make it useless.

• Back end processing can generate lists of good and bad files and help evaluate new techniques.

38© 2011 Cisco Systems, Inc. All rights reerved.

Want to know more?

• Cisco 2Q10 Global Threat Report http://www.cisco.com/web/about/security/intelligence/cisco_threat_072610_959.pdf

• Richard Clayton : Evil on the Internet http://www.securitytube.net/Phishing-(Evil-on-the-Internet)-FOSDEM-Talk-video.aspx

• Kaspersky Lab Security News Service http://threatpost.com/

• A plan for Spam http://www.paulgraham.com/spam.html

39© 2011 Cisco Systems, Inc. All rights reerved.

Still want to know more?

• Identifying Suspicious URLs : An Application of Large-Scale Online Learning http://videolectures.net/icml09_ma_isu/

• Peter Norvig Google : Statistical Learning as the Ultimate Agile Development Tool http://videolectures.net/cikm08_norvig_slatuad/

• Writing ClamAV Signatures Alain Zidouemba http://www.clamav.net/doc/webinars/Webinar-Alain-2009-03-04.ppt

40© 2011 Cisco Systems, Inc. All rights reerved.

Take Home Messages

• Web Security•Challenging and interesting domain•Many applications for Machine Learning

• ScanSafe and Cisco•Many opportunities for collaboration•Several opportunities for student projects

© 2011 Cisco Systems, Inc. All rights reerved.

41

Any Questions?

rwheeldo@cisco.com

top related