graph-based algorithms in large scale information retrieval fatemeh kaveh-yazdy computer engineering...

Graph-based Algorithms in Large Scale Information Retrieval

Fatemeh Kaveh-Yazdy

Computer Engineering DepartmentSchool of Electrical and Computer Engineering

Yazd University

Graduate Research Assistant @ Web Lab. & IPKD Lab., YUSenior Research Fellow @ Parsijoo

External Research Member of MSC Lab. , DUT

• Information Retrieval Systems: Search Engines• Graphs in Information Retrieval

– Connection-based Ranking

• Spamming• Spam Detection• A Real world Case

Outline

Introduction to Information Retrieval Graphs in Information Retrieval

A Real World Case…

• Enterprise Document Retrieval• Web Information Retrieval Systems: Search Engines• Web Retrieval vs. Document Retrieval

– Structure of documents– Scale– Domain– Users – Query Specificity– Determination

Introduction to IR



Search EnginesTrust in Web

Architecture of Search Engines




Crawler(s)

Page Repository

Indexer Module

CollectionAnalysis Module

Query Engine

Ranking

Client

Indexes : TextStructureUtility

Queries

Web

• Web Structure– Meta Data– Linkage

• Applications of Web Structure– Crawling– Indexing – Ranking

Cont.




www.sharif.edumath.sharif.edu

Math Dept.

• Cite / Link– Use / Quote / Express favoring – Trust / Applicability

• Assumption– A link from page A to page B is a recommendation of page B

by the author of A (we say B is successor of A)

• Recursion: Quality of a page is related to– its in-degree,– the quality of pages linking to it

Trust in Web Structure




A B

• Page and Berin [1] introduce the random surfer model

• Definition– Random surfer starts from a random page– The surfer proceeds to a randomly chosen successor of the

current page (With probability 1/outdegree)

Random Surfer on the Web



Random Surfer ModelPageRankPageRank with TeleportSpamming

sSurfer

Random Surfer on the Web (II)




n

1

Random Surfer on the Web (II)




Surfer

sn

1

Random Surfer on the Web (III)




s

s

s

ss

s

s

s

s

s

sn

1

• Each page inherits its rank from its ancestors.

• Issues– Web graph is not strongly connected

– Convergence of PageRank is not guaranteed

– Effects of sink nodes– Pages without outputs– Trapping pages

PageRank




pq (q) outdegree

(q) PageRankpPageRank )(

Cont.




s

s

s

ss

s

s

s

s

s

s

Sink

Sink

• Teleport – Random surfer jumps from a node to any other node – The destination is chosen uniformly from all nodes

• Prob. of selecting each node is (1/n)

– In each node, surfer has the option of jumping • Prob. of jumping is α (0 ≤ α ≤ 1)

• Damping factor (d=1-α)

PageRank with Teleport




pq q)outdegree(

)PageRank(q

npPageRank )1()(

pq q)outdegree(

)PageRank(qd

n

dpPageRank )(

1)(

s

Spamming




• Spam– The manipulation of web page content for the purpose of

appearing high up in search results.

• Spamming Techniques– Text content manipulation – (tags, comments, invisible text blocks)

– Structural content manipulation (Mimicking important websites)

Spam Detection




• Spam Detection Methods– Text Spam

• Comparing word probability

– Link-farm Spam• Trust/Anti-trust Rank• Community Detection

Link-farm Spam Detection




• Link-farm Spam– Trust Rank– Anti-trust

Parsijoo



Parsijoo




Parsijoo

• Parsijoo Facts – Crawled Pages: (1x109 /month) rem. 500 x 106

• Crawling rate: 2000 page/sec

– Cached URLs: 10 x 109

• 80,000 URL /sec• 10 X 106 Unique Host (each host needs one queue)

– Unique URLS: 800 x 106 – Unique Words: 80 X 106

– Unique Requests: 200 x 103/day




Parsijoo

• Parsijoo Facts – Requests (per day)

• Web:100 K Image:35 K News: 10 K• Music: 10 K Scholar: 1 K Video: 5 K• SADANA and etc. 35K

– Unique Requests: 200 x 103/day

graph-based algorithms in large scale information retrieval fatemeh kaveh-yazdy computer engineering...

Documents