web spamming detecting spam web pages through content analysis alexandros ntoulas et al, 2006,...

35
Web spamming Detecting Spam Web Pages thro ugh Content Analysis Alexandros Ntoulas et al, 200 6, International World Wide W eb Conference

Upload: shon-brown

Post on 12-Jan-2016

215 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Web spamming Detecting Spam Web Pages through Content Analysis Alexandros Ntoulas et al, 2006, International World Wide Web Conference

Web spamming

Detecting Spam Web Pages through Content Analysis

Alexandros Ntoulas et al, 2006, International World Wide Web Conference

Page 2: Web spamming Detecting Spam Web Pages through Content Analysis Alexandros Ntoulas et al, 2006, International World Wide Web Conference

• link stuffing: for link-based ranking, black hat SEO techniques include the creation of extraneous pages which link to a target page

• keyword-stuffing:The content of other pages

• may be “engineered” so as to appear relevant to popular searches

Page 3: Web spamming Detecting Spam Web Pages through Content Analysis Alexandros Ntoulas et al, 2006, International World Wide Web Conference

Figure 1: An example spam page; although it contains popularkeywords, the overall content is useless to a human user

Page 4: Web spamming Detecting Spam Web Pages through Content Analysis Alexandros Ntoulas et al, 2006, International World Wide Web Conference

Web spam

• The practices of crafting web pages for the sole purpose of increasing the ranking of these or some affiliated pages, without improving the utility to the viewer, are called “web spam”.

Page 5: Web spamming Detecting Spam Web Pages through Content Analysis Alexandros Ntoulas et al, 2006, International World Wide Web Conference

왜 web spamming 을 하는가 ?

• 첫째 , Search engine 이 스팸사이트를 상위에 rank 하게 하여 웹검색자들을 스팸사이트로 끌여들여 경제적 이득을 취함

• 둘째로 search engine 이 스팸사이트를 노출시켜 사용자가 search engine 의 성능을 믿지 못하도록 함 , 즉 search engine 에 대한 공격

• 마지막으로 a search engine 이 spam pages 들로 인하여 필요없는 공간과 시간 , 혹은 네트워크 resource 를 을 낭비하게 함 . – 1/7 of English-language pages

Page 6: Web spamming Detecting Spam Web Pages through Content Analysis Alexandros Ntoulas et al, 2006, International World Wide Web Conference

Importance of detecting web spam

• Creating an effective spam detection method is a challenging problem. – Given the size of the web, such a method has to be automated.

– However, while detecting spam, we have to ensure that we identify spam pages alone, and that we do not mistakenly consider legitimate pages to be spam.

– At the same time, it is most useful if we can detect that a page is spam as early as possible, and certainly prior to query processing. In this way, we can allocate our crawling, processing, and indexing efforts to non-spam pages, thus making more efficient use of our resources.

Page 7: Web spamming Detecting Spam Web Pages through Content Analysis Alexandros Ntoulas et al, 2006, International World Wide Web Conference

Web spamming techniques

Page 8: Web spamming Detecting Spam Web Pages through Content Analysis Alexandros Ntoulas et al, 2006, International World Wide Web Conference

• Web Spam TaxonomyBy Zoltán Gyöngyi and Hector Garcia-Molina, Stanford University. First International Workshop on Adversarial Information Retrieval on the Web, May 2005

Page 9: Web spamming Detecting Spam Web Pages through Content Analysis Alexandros Ntoulas et al, 2006, International World Wide Web Conference

Term Spamming

• p: page, q: query words• TF(t)= 문서에 출현하는 term t 의 수• IDF(t)=term t 를 포함하는 문서의 수

• Term spamming 은 TFIDF score 에 기반한 랭킹알고리즘을 채택하고 있는 search engine 을 대상으로 공격

Page 10: Web spamming Detecting Spam Web Pages through Content Analysis Alexandros Ntoulas et al, 2006, International World Wide Web Conference

Term Spamming• Body/title/meta tag/Anchor text

<meta name=\keywords" content=\buy, cheap,cameras, lens, accessories, nikon, canon">

<a href=\target.html">free, great deals, cheap, in-expensive, cheap, free</a>

• URL spambuy-canon-rebel-20d-lens-case.camerasx.com,buy-nikon-d100-d70-lens-case.camerasx.com,

Page 11: Web spamming Detecting Spam Web Pages through Content Analysis Alexandros Ntoulas et al, 2006, International World Wide Web Conference

How to Term Spamming• Repetition of one or a few specific terms• Dumping of a large number of unrelated terms• Weaving of spam terms into copied contents

• Phrase stitching is also used by spammers to create content quickly

Page 12: Web spamming Detecting Spam Web Pages through Content Analysis Alexandros Ntoulas et al, 2006, International World Wide Web Conference

Link Spamming

• PageRank 알고리즘의 특징을 파악하여 Outgoing links, Incoming links 를 조작하는 수법

Page 13: Web spamming Detecting Spam Web Pages through Content Analysis Alexandros Ntoulas et al, 2006, International World Wide Web Conference

Outgoing links

• A spammer might manually add a number of outgoing links to well-known pages, hoping to increase the page's hub score.

• At the same time, the most wide-spread method for creating a massive number of outgoing links is directory cloning: One can find on the World Wide Web a number of directory sites, some larger and better known (e.g., the DMOZ Open Directory, dmoz.org, or the Yahoo! directory, dir.yahoo.com)

Page 14: Web spamming Detecting Spam Web Pages through Content Analysis Alexandros Ntoulas et al, 2006, International World Wide Web Conference

Incoming links

• Create a honey pot, a set of pages that provide some useful resource (e.g., copies of some Unix documentation pages), but that also have (hidden) links to the target spam page(s).

• Post links on blogs, unmoderated message boards, guest books, or wikis. spammers may include URLs to their spam pages as part of the seemingly innocent comments/messages they post.

Page 15: Web spamming Detecting Spam Web Pages through Content Analysis Alexandros Ntoulas et al, 2006, International World Wide Web Conference

Hiding Techniques-Content Hiding

Page 16: Web spamming Detecting Spam Web Pages through Content Analysis Alexandros Ntoulas et al, 2006, International World Wide Web Conference

Hiding Techniques-Cloaking

If spammers can clearly identify web crawler clients,they can adopt the following strategy, called cloak-ing: given a URL, spam web servers return one specicHTML document to a regular web browser, while theyreturn a dierent document to a web crawler. This way,spammers can present the ultimately intended contentto the web users (without traces of spam on the page),and, at the same time, send a spammed document tothe search engine for indexing.

Page 17: Web spamming Detecting Spam Web Pages through Content Analysis Alexandros Ntoulas et al, 2006, International World Wide Web Conference

Hiding Techniques-Redirection

Page 18: Web spamming Detecting Spam Web Pages through Content Analysis Alexandros Ntoulas et al, 2006, International World Wide Web Conference

Spam occurrence per top-level domain

• 105, 484, 446 web pages, collected by the MSN Search crawler during August 2004.

Page 19: Web spamming Detecting Spam Web Pages through Content Analysis Alexandros Ntoulas et al, 2006, International World Wide Web Conference

Spam occurrence per language in our data set.

Page 20: Web spamming Detecting Spam Web Pages through Content Analysis Alexandros Ntoulas et al, 2006, International World Wide Web Conference

Prevalence of spam - number of words on page

Page 21: Web spamming Detecting Spam Web Pages through Content Analysis Alexandros Ntoulas et al, 2006, International World Wide Web Conference

Prevalence of spam - number of words in title

Page 22: Web spamming Detecting Spam Web Pages through Content Analysis Alexandros Ntoulas et al, 2006, International World Wide Web Conference

Prevalence of spam - average word-length of page

Page 23: Web spamming Detecting Spam Web Pages through Content Analysis Alexandros Ntoulas et al, 2006, International World Wide Web Conference

Prevalence of spam - visible content on page

Page 24: Web spamming Detecting Spam Web Pages through Content Analysis Alexandros Ntoulas et al, 2006, International World Wide Web Conference

Prevalence of spam - compressibility of page

Page 25: Web spamming Detecting Spam Web Pages through Content Analysis Alexandros Ntoulas et al, 2006, International World Wide Web Conference

Classification model to detect spam

Page 26: Web spamming Detecting Spam Web Pages through Content Analysis Alexandros Ntoulas et al, 2006, International World Wide Web Conference

• given the training set DS we generate N training sets by sampling n random items with replacement

• For each of the N training sets, we now create a classifier, thus obtaining N classifiers.

• In order to classify a page, we have each of the N classifiers provide a class prediction, which is considered as a vote for that particular class.

• The eventual class of the page is the class with the majority of the votes

Page 27: Web spamming Detecting Spam Web Pages through Content Analysis Alexandros Ntoulas et al, 2006, International World Wide Web Conference

Bagging & Boosting

spam Non-spam

Spam A B

Non-spam

C D

예측

실제

Page 28: Web spamming Detecting Spam Web Pages through Content Analysis Alexandros Ntoulas et al, 2006, International World Wide Web Conference

Challenges in Web Information Retrieval

Mehran Sahami Vibhu Mittal Shumeet Baluja Henry Rowley

Google Inc.

Page 29: Web spamming Detecting Spam Web Pages through Content Analysis Alexandros Ntoulas et al, 2006, International World Wide Web Conference

Information Retrieval on the Web

• Goal: identify which pages are of high quality and relevance to a user’s query.– PageRank, HITS

• Two Challenges– Adversarial classification: detecting Web spammin

g– Evaluating Search results

Page 30: Web spamming Detecting Spam Web Pages through Content Analysis Alexandros Ntoulas et al, 2006, International World Wide Web Conference

PageRank

• Assume four web pages: A, B,C and D. • The initial values of PageRank

– PR(A)= PR(B)= PR(C)= PR(D)= 0.25.• PageRank for any page u

• Bu ={v| v links to page u }• Nv = the number of links from page v.

Page 31: Web spamming Detecting Spam Web Pages through Content Analysis Alexandros Ntoulas et al, 2006, International World Wide Web Conference

PR(A) = PR(C)/1

PR(B) = PR(A)/2

PR(C) = PR(A)/2 + PR(B)/1+PR(D)/1

PR(D) = 0

Page 32: Web spamming Detecting Spam Web Pages through Content Analysis Alexandros Ntoulas et al, 2006, International World Wide Web Conference

Determining the relatedness of fragments of text

• eg:– “Captain Kirk” & “Star Trek” is similar than– “Captain Kirk” & “Fried Chicken”.

• How to measure the closeness between two phases.

• K(x,y) =

Page 33: Web spamming Detecting Spam Web Pages through Content Analysis Alexandros Ntoulas et al, 2006, International World Wide Web Conference
Page 34: Web spamming Detecting Spam Web Pages through Content Analysis Alexandros Ntoulas et al, 2006, International World Wide Web Conference

Retrieval of UseNet Articles

• at least 800 million documents

Page 35: Web spamming Detecting Spam Web Pages through Content Analysis Alexandros Ntoulas et al, 2006, International World Wide Web Conference

Retrieval of Images and Sounds

• non-textual “documents”– from digital still and video cameras, camera phone

s, audio recording devices, and mp3 music.