web spam final

Upload: navya-nav

Post on 06-Apr-2018

224 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/2/2019 Web Spam Final

    1/18

    UNDER THE GUIDANCE OF

    MR. KHASIM D VALI

    NAVYA R

    8th SEM CS

  • 8/2/2019 Web Spam Final

    2/18

    CONTENTSWhat is web spam?

    Boosting techniques

    Term spamming techniques and targets Link spamming techniques

    Hiding techniques

    Major types of spamming

    Spam filters

    Conclusion

  • 8/2/2019 Web Spam Final

    3/18

    What is web spam?

    Spamming is any deliberate action solely inorder to boost a web pages position in search

    engine results, regardless of the pages real value Spam = web pages that are the result of

    spamming

    Approximately 10-15% of web pages are spam

  • 8/2/2019 Web Spam Final

    4/18

    Web spam taxonomy

    Boosting techniques

    Techniques to achieve high relevance or

    importance for some pages.

    Hiding techniques

    Techniques to hide the adopted boostingtechniques from human web users.

  • 8/2/2019 Web Spam Final

    5/18

    Boosting techniques

    Term spamming

    Manipulating the text/fields of web pages in

    order to appear relevant to queries

    Link spamming

    Creating link structures that boost page rank

  • 8/2/2019 Web Spam Final

    6/18

    Cont.

  • 8/2/2019 Web Spam Final

    7/18

    Term spamming

    Target algorithms

  • 8/2/2019 Web Spam Final

    8/18

    Techniques

    Repetition of one or a few specific terms e.g., free, cheap

    increase relevance for a document with respectto a small number of query terms

    Dumping of a large number of unrelated terms

    e.g., copy entire dictionaries

    Weaving Copy legitimate pages and insert spam terms

    at random positions

  • 8/2/2019 Web Spam Final

    9/18

    Term spam targets

    Body of web page

    TitleHTML meta tags

    Anchor text

    URL

  • 8/2/2019 Web Spam Final

    10/18

    Link spam

    There are three kinds of web pages from aspammers point of view

    Inaccessible pages

    Accessible pages

    e.g., web blog comments pages

    spammer can post links to his pages

    Own pages

    Completely controlled by spammer.

    Group of own pages known as spam farm.

  • 8/2/2019 Web Spam Final

    11/18

    Target algorithms

    Hypertext Induced Topic Search (HITS)algorithm

    Hub scores A spammer should add many outgoing links to

    the target page t to increase its hub score

    Authority scores Having many incoming links from presumably

    important hubs

  • 8/2/2019 Web Spam Final

    12/18

    Cont.

    PageRank

    Uses incoming link information to assignnumerical weights to all pages on the web

    Numerical weight that it assigns to any givenelement Eis also called the PageRank of Eanddenoted byPR(E)

    Spammers manipulate the algorithm using links

  • 8/2/2019 Web Spam Final

    13/18

    Targets and Techniques

    Outgoing links

    Directory cloning

    Incoming links

    Create a honey pot

    Infiltrate a web directory

    Post links to unmoderated message boards orguest books

  • 8/2/2019 Web Spam Final

    14/18

    Hiding techniques

    Content hiding

    Use same color for text and page background

    Spam terms or links on a page can be madeinvisible when the browser renders the page

    Cloaking Return different page to crawlers and browsers

  • 8/2/2019 Web Spam Final

    15/18

    MAJOR TYPES OF SPAMMING

    Email spam

    Blog spamSocial networking spam

    Text spam

  • 8/2/2019 Web Spam Final

    16/18

    FILTERS TO DETECT SPAMA spam filter is a web-based program which helps toprevent spam e-mail from entering your inbox.

    Password filter

    Blacklist and whitelist filters

    Challenge/response filter

    Rules-based filter

    Community-based filter

    Adaptive/Bayesian filter

  • 8/2/2019 Web Spam Final

    17/18

    CONCLUSIONIdentify instances of spam, i.e., find pages that

    contain specific types of spam, and stop crawling

    and/or indexing such pages.

    Prevent spamming, that is, making specific

    spamming techniques impossible to use.

    Counterbalance the effect of spamming

  • 8/2/2019 Web Spam Final

    18/18