Download - Web Spam Final
-
8/2/2019 Web Spam Final
1/18
UNDER THE GUIDANCE OF
MR. KHASIM D VALI
NAVYA R
8th SEM CS
-
8/2/2019 Web Spam Final
2/18
CONTENTSWhat is web spam?
Boosting techniques
Term spamming techniques and targets Link spamming techniques
Hiding techniques
Major types of spamming
Spam filters
Conclusion
-
8/2/2019 Web Spam Final
3/18
What is web spam?
Spamming is any deliberate action solely inorder to boost a web pages position in search
engine results, regardless of the pages real value Spam = web pages that are the result of
spamming
Approximately 10-15% of web pages are spam
-
8/2/2019 Web Spam Final
4/18
Web spam taxonomy
Boosting techniques
Techniques to achieve high relevance or
importance for some pages.
Hiding techniques
Techniques to hide the adopted boostingtechniques from human web users.
-
8/2/2019 Web Spam Final
5/18
Boosting techniques
Term spamming
Manipulating the text/fields of web pages in
order to appear relevant to queries
Link spamming
Creating link structures that boost page rank
-
8/2/2019 Web Spam Final
6/18
Cont.
-
8/2/2019 Web Spam Final
7/18
Term spamming
Target algorithms
-
8/2/2019 Web Spam Final
8/18
Techniques
Repetition of one or a few specific terms e.g., free, cheap
increase relevance for a document with respectto a small number of query terms
Dumping of a large number of unrelated terms
e.g., copy entire dictionaries
Weaving Copy legitimate pages and insert spam terms
at random positions
-
8/2/2019 Web Spam Final
9/18
Term spam targets
Body of web page
TitleHTML meta tags
Anchor text
URL
-
8/2/2019 Web Spam Final
10/18
Link spam
There are three kinds of web pages from aspammers point of view
Inaccessible pages
Accessible pages
e.g., web blog comments pages
spammer can post links to his pages
Own pages
Completely controlled by spammer.
Group of own pages known as spam farm.
-
8/2/2019 Web Spam Final
11/18
Target algorithms
Hypertext Induced Topic Search (HITS)algorithm
Hub scores A spammer should add many outgoing links to
the target page t to increase its hub score
Authority scores Having many incoming links from presumably
important hubs
-
8/2/2019 Web Spam Final
12/18
Cont.
PageRank
Uses incoming link information to assignnumerical weights to all pages on the web
Numerical weight that it assigns to any givenelement Eis also called the PageRank of Eanddenoted byPR(E)
Spammers manipulate the algorithm using links
-
8/2/2019 Web Spam Final
13/18
Targets and Techniques
Outgoing links
Directory cloning
Incoming links
Create a honey pot
Infiltrate a web directory
Post links to unmoderated message boards orguest books
-
8/2/2019 Web Spam Final
14/18
Hiding techniques
Content hiding
Use same color for text and page background
Spam terms or links on a page can be madeinvisible when the browser renders the page
Cloaking Return different page to crawlers and browsers
-
8/2/2019 Web Spam Final
15/18
MAJOR TYPES OF SPAMMING
Email spam
Blog spamSocial networking spam
Text spam
-
8/2/2019 Web Spam Final
16/18
FILTERS TO DETECT SPAMA spam filter is a web-based program which helps toprevent spam e-mail from entering your inbox.
Password filter
Blacklist and whitelist filters
Challenge/response filter
Rules-based filter
Community-based filter
Adaptive/Bayesian filter
-
8/2/2019 Web Spam Final
17/18
CONCLUSIONIdentify instances of spam, i.e., find pages that
contain specific types of spam, and stop crawling
and/or indexing such pages.
Prevent spamming, that is, making specific
spamming techniques impossible to use.
Counterbalance the effect of spamming
-
8/2/2019 Web Spam Final
18/18