graph-based algorithms in large scale information retrieval fatemeh kaveh-yazdy computer engineering...

24
Graph-based Algorithms in Large Scale Information Retrieval Fatemeh Kaveh-Yazdy Computer Engineering Department School of Electrical and Computer Engineering Yazd University Graduate Research Assistant @ Web Lab. & IPKD Lab., YU Senior Research Fellow @ Parsijoo External Research Member of MSC Lab. , DUT

Upload: kerrie-allen

Post on 12-Jan-2016

212 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Graph-based Algorithms in Large Scale Information Retrieval Fatemeh Kaveh-Yazdy Computer Engineering Department School of Electrical and Computer Engineering

Graph-based Algorithms in Large Scale Information Retrieval

Fatemeh Kaveh-Yazdy

Computer Engineering DepartmentSchool of Electrical and Computer Engineering

Yazd University

Graduate Research Assistant @ Web Lab. & IPKD Lab., YUSenior Research Fellow @ Parsijoo

External Research Member of MSC Lab. , DUT

Page 2: Graph-based Algorithms in Large Scale Information Retrieval Fatemeh Kaveh-Yazdy Computer Engineering Department School of Electrical and Computer Engineering

Slide 2

• Information Retrieval Systems: Search Engines• Graphs in Information Retrieval

– Connection-based Ranking

• Spamming• Spam Detection• A Real world Case

Outline

Introduction to Information Retrieval Graphs in Information Retrieval

A Real World Case…

Page 3: Graph-based Algorithms in Large Scale Information Retrieval Fatemeh Kaveh-Yazdy Computer Engineering Department School of Electrical and Computer Engineering

Slide 3

• Enterprise Document Retrieval• Web Information Retrieval Systems: Search Engines• Web Retrieval vs. Document Retrieval

– Structure of documents– Scale– Domain– Users – Query Specificity– Determination

Introduction to IR

Introduction to Information Retrieval Graphs in Information Retrieval

A Real World Case…

Search EnginesTrust in Web

Page 4: Graph-based Algorithms in Large Scale Information Retrieval Fatemeh Kaveh-Yazdy Computer Engineering Department School of Electrical and Computer Engineering

Slide 4

Architecture of Search Engines

Introduction to Information Retrieval Graphs in Information Retrieval

A Real World Case…

Search EnginesTrust in Web

Crawler(s)

Page Repository

Indexer Module

CollectionAnalysis Module

Query Engine

Ranking

Client

Indexes : TextStructureUtility

Queries

Web

Page 5: Graph-based Algorithms in Large Scale Information Retrieval Fatemeh Kaveh-Yazdy Computer Engineering Department School of Electrical and Computer Engineering

Slide 5

• Web Structure– Meta Data– Linkage

• Applications of Web Structure– Crawling– Indexing – Ranking

Cont.

Introduction to Information Retrieval Graphs in Information Retrieval

A Real World Case…

Search EnginesTrust in Web

www.sharif.edumath.sharif.edu

Math Dept.

Page 6: Graph-based Algorithms in Large Scale Information Retrieval Fatemeh Kaveh-Yazdy Computer Engineering Department School of Electrical and Computer Engineering

Slide 6

• Cite / Link– Use / Quote / Express favoring – Trust / Applicability

• Assumption– A link from page A to page B is a recommendation of page B

by the author of A (we say B is successor of A)

• Recursion: Quality of a page is related to– its in-degree,– the quality of pages linking to it

Trust in Web Structure

Introduction to Information Retrieval Graphs in Information Retrieval

A Real World Case…

Search EnginesTrust in Web

A B

Page 7: Graph-based Algorithms in Large Scale Information Retrieval Fatemeh Kaveh-Yazdy Computer Engineering Department School of Electrical and Computer Engineering

Slide 7

• Page and Berin [1] introduce the random surfer model

• Definition– Random surfer starts from a random page– The surfer proceeds to a randomly chosen successor of the

current page (With probability 1/outdegree)

Random Surfer on the Web

Introduction to Information Retrieval Graphs in Information Retrieval

A Real World Case…

Random Surfer ModelPageRankPageRank with TeleportSpamming

sSurfer

Page 8: Graph-based Algorithms in Large Scale Information Retrieval Fatemeh Kaveh-Yazdy Computer Engineering Department School of Electrical and Computer Engineering

Slide 8

Random Surfer on the Web (II)

Introduction to Information Retrieval Graphs in Information Retrieval

A Real World Case…

Random Surfer ModelPageRankPageRank with TeleportSpamming

n

1

Page 9: Graph-based Algorithms in Large Scale Information Retrieval Fatemeh Kaveh-Yazdy Computer Engineering Department School of Electrical and Computer Engineering

Slide 9

Random Surfer on the Web (II)

Introduction to Information Retrieval Graphs in Information Retrieval

A Real World Case…

Random Surfer ModelPageRankPageRank with TeleportSpamming

Surfer

sn

1

Page 10: Graph-based Algorithms in Large Scale Information Retrieval Fatemeh Kaveh-Yazdy Computer Engineering Department School of Electrical and Computer Engineering

Slide 10

Random Surfer on the Web (II)

Introduction to Information Retrieval Graphs in Information Retrieval

A Real World Case…

Random Surfer ModelPageRankPageRank with TeleportSpamming

Surfer

sn

1

Page 11: Graph-based Algorithms in Large Scale Information Retrieval Fatemeh Kaveh-Yazdy Computer Engineering Department School of Electrical and Computer Engineering

Slide 11

Random Surfer on the Web (II)

Introduction to Information Retrieval Graphs in Information Retrieval

A Real World Case…

Random Surfer ModelPageRankPageRank with TeleportSpamming

Surfer

sn

1

Page 12: Graph-based Algorithms in Large Scale Information Retrieval Fatemeh Kaveh-Yazdy Computer Engineering Department School of Electrical and Computer Engineering

Slide 12

Random Surfer on the Web (II)

Introduction to Information Retrieval Graphs in Information Retrieval

A Real World Case…

Random Surfer ModelPageRankPageRank with TeleportSpamming

Surfer

sn

1

Page 13: Graph-based Algorithms in Large Scale Information Retrieval Fatemeh Kaveh-Yazdy Computer Engineering Department School of Electrical and Computer Engineering

Slide 13

Random Surfer on the Web (III)

Introduction to Information Retrieval Graphs in Information Retrieval

A Real World Case…

Random Surfer ModelPageRankPageRank with TeleportSpamming

s

s

s

ss

s

s

s

s

s

sn

1

Page 14: Graph-based Algorithms in Large Scale Information Retrieval Fatemeh Kaveh-Yazdy Computer Engineering Department School of Electrical and Computer Engineering

Slide 14

Random Surfer on the Web (III)

Introduction to Information Retrieval Graphs in Information Retrieval

A Real World Case…

Random Surfer ModelPageRankPageRank with TeleportSpamming

s

s

s

ss

s

s

s

s

s

sn

1

Page 15: Graph-based Algorithms in Large Scale Information Retrieval Fatemeh Kaveh-Yazdy Computer Engineering Department School of Electrical and Computer Engineering

Slide 17

• Each page inherits its rank from its ancestors.

• Issues– Web graph is not strongly connected

– Convergence of PageRank is not guaranteed

– Effects of sink nodes– Pages without outputs– Trapping pages

PageRank

Introduction to Information Retrieval Graphs in Information Retrieval

A Real World Case…

Random Surfer ModelPageRankPageRank with TeleportSpamming

pq (q) outdegree

(q) PageRankpPageRank )(

Page 16: Graph-based Algorithms in Large Scale Information Retrieval Fatemeh Kaveh-Yazdy Computer Engineering Department School of Electrical and Computer Engineering

Slide 18

Cont.

Introduction to Information Retrieval Graphs in Information Retrieval

A Real World Case…

Random Surfer ModelPageRankPageRank with TeleportSpamming

s

s

s

ss

s

s

s

s

s

s

Sink

Sink

Page 17: Graph-based Algorithms in Large Scale Information Retrieval Fatemeh Kaveh-Yazdy Computer Engineering Department School of Electrical and Computer Engineering

Slide 19

• Teleport – Random surfer jumps from a node to any other node – The destination is chosen uniformly from all nodes

• Prob. of selecting each node is (1/n)

– In each node, surfer has the option of jumping • Prob. of jumping is α (0 ≤ α ≤ 1)

• Damping factor (d=1-α)

PageRank with Teleport

Introduction to Information Retrieval Graphs in Information Retrieval

A Real World Case…

Random Surfer ModelPageRankPageRank with TeleportSpamming

pq q)outdegree(

)PageRank(q

npPageRank )1()(

pq q)outdegree(

)PageRank(qd

n

dpPageRank )(

1)(

s

Page 18: Graph-based Algorithms in Large Scale Information Retrieval Fatemeh Kaveh-Yazdy Computer Engineering Department School of Electrical and Computer Engineering

Slide 20

Spamming

Introduction to Information Retrieval Graphs in Information Retrieval

A Real World Case…

Random Surfer ModelPageRankPageRank with TeleportSpamming

• Spam– The manipulation of web page content for the purpose of

appearing high up in search results.

• Spamming Techniques– Text content manipulation – (tags, comments, invisible text blocks)

– Structural content manipulation (Mimicking important websites)

Page 19: Graph-based Algorithms in Large Scale Information Retrieval Fatemeh Kaveh-Yazdy Computer Engineering Department School of Electrical and Computer Engineering

Slide 21

Spam Detection

Introduction to Information Retrieval Graphs in Information Retrieval

A Real World Case…

Random Surfer ModelPageRankPageRank with TeleportSpamming

• Spam Detection Methods– Text Spam

• Comparing word probability

– Link-farm Spam• Trust/Anti-trust Rank• Community Detection

Page 20: Graph-based Algorithms in Large Scale Information Retrieval Fatemeh Kaveh-Yazdy Computer Engineering Department School of Electrical and Computer Engineering

Slide 22

Link-farm Spam Detection

Introduction to Information Retrieval Graphs in Information Retrieval

A Real World Case…

Random Surfer ModelPageRankPageRank with TeleportSpamming

• Link-farm Spam– Trust Rank– Anti-trust

Page 21: Graph-based Algorithms in Large Scale Information Retrieval Fatemeh Kaveh-Yazdy Computer Engineering Department School of Electrical and Computer Engineering

Slide 23

Parsijoo

Introduction to Information Retrieval Graphs in Information Retrieval

A Real World Case…

Parsijoo

Page 22: Graph-based Algorithms in Large Scale Information Retrieval Fatemeh Kaveh-Yazdy Computer Engineering Department School of Electrical and Computer Engineering

Slide 24

A Real World Case…

Introduction to Information Retrieval Graphs in Information Retrieval

A Real World Case…

Parsijoo

• Parsijoo Facts – Crawled Pages: (1x109 /month) rem. 500 x 106

• Crawling rate: 2000 page/sec

– Cached URLs: 10 x 109

• 80,000 URL /sec• 10 X 106 Unique Host (each host needs one queue)

– Unique URLS: 800 x 106 – Unique Words: 80 X 106

– Unique Requests: 200 x 103/day

Page 23: Graph-based Algorithms in Large Scale Information Retrieval Fatemeh Kaveh-Yazdy Computer Engineering Department School of Electrical and Computer Engineering

Slide 25

A Real World Case…

Introduction to Information Retrieval Graphs in Information Retrieval

A Real World Case…

Parsijoo

• Parsijoo Facts – Requests (per day)

• Web:100 K Image:35 K News: 10 K• Music: 10 K Scholar: 1 K Video: 5 K• SADANA and etc. 35K

– Unique Requests: 200 x 103/day

Page 24: Graph-based Algorithms in Large Scale Information Retrieval Fatemeh Kaveh-Yazdy Computer Engineering Department School of Electrical and Computer Engineering

Slide 26

QA