graph-based algorithms in large scale information retrieval fatemeh kaveh-yazdy computer engineering...
TRANSCRIPT
Graph-based Algorithms in Large Scale Information Retrieval
Fatemeh Kaveh-Yazdy
Computer Engineering DepartmentSchool of Electrical and Computer Engineering
Yazd University
Graduate Research Assistant @ Web Lab. & IPKD Lab., YUSenior Research Fellow @ Parsijoo
External Research Member of MSC Lab. , DUT
Slide 2
• Information Retrieval Systems: Search Engines• Graphs in Information Retrieval
– Connection-based Ranking
• Spamming• Spam Detection• A Real world Case
Outline
Introduction to Information Retrieval Graphs in Information Retrieval
A Real World Case…
Slide 3
• Enterprise Document Retrieval• Web Information Retrieval Systems: Search Engines• Web Retrieval vs. Document Retrieval
– Structure of documents– Scale– Domain– Users – Query Specificity– Determination
Introduction to IR
Introduction to Information Retrieval Graphs in Information Retrieval
A Real World Case…
Search EnginesTrust in Web
Slide 4
Architecture of Search Engines
Introduction to Information Retrieval Graphs in Information Retrieval
A Real World Case…
Search EnginesTrust in Web
Crawler(s)
Page Repository
Indexer Module
CollectionAnalysis Module
Query Engine
Ranking
Client
Indexes : TextStructureUtility
Queries
Web
Slide 5
• Web Structure– Meta Data– Linkage
• Applications of Web Structure– Crawling– Indexing – Ranking
Cont.
Introduction to Information Retrieval Graphs in Information Retrieval
A Real World Case…
Search EnginesTrust in Web
www.sharif.edumath.sharif.edu
Math Dept.
Slide 6
• Cite / Link– Use / Quote / Express favoring – Trust / Applicability
• Assumption– A link from page A to page B is a recommendation of page B
by the author of A (we say B is successor of A)
• Recursion: Quality of a page is related to– its in-degree,– the quality of pages linking to it
Trust in Web Structure
Introduction to Information Retrieval Graphs in Information Retrieval
A Real World Case…
Search EnginesTrust in Web
A B
Slide 7
• Page and Berin [1] introduce the random surfer model
• Definition– Random surfer starts from a random page– The surfer proceeds to a randomly chosen successor of the
current page (With probability 1/outdegree)
Random Surfer on the Web
Introduction to Information Retrieval Graphs in Information Retrieval
A Real World Case…
Random Surfer ModelPageRankPageRank with TeleportSpamming
sSurfer
Slide 8
Random Surfer on the Web (II)
Introduction to Information Retrieval Graphs in Information Retrieval
A Real World Case…
Random Surfer ModelPageRankPageRank with TeleportSpamming
n
1
Slide 9
Random Surfer on the Web (II)
Introduction to Information Retrieval Graphs in Information Retrieval
A Real World Case…
Random Surfer ModelPageRankPageRank with TeleportSpamming
Surfer
sn
1
Slide 10
Random Surfer on the Web (II)
Introduction to Information Retrieval Graphs in Information Retrieval
A Real World Case…
Random Surfer ModelPageRankPageRank with TeleportSpamming
Surfer
sn
1
Slide 11
Random Surfer on the Web (II)
Introduction to Information Retrieval Graphs in Information Retrieval
A Real World Case…
Random Surfer ModelPageRankPageRank with TeleportSpamming
Surfer
sn
1
Slide 12
Random Surfer on the Web (II)
Introduction to Information Retrieval Graphs in Information Retrieval
A Real World Case…
Random Surfer ModelPageRankPageRank with TeleportSpamming
Surfer
sn
1
Slide 13
Random Surfer on the Web (III)
Introduction to Information Retrieval Graphs in Information Retrieval
A Real World Case…
Random Surfer ModelPageRankPageRank with TeleportSpamming
s
s
s
ss
s
s
s
s
s
sn
1
Slide 14
Random Surfer on the Web (III)
Introduction to Information Retrieval Graphs in Information Retrieval
A Real World Case…
Random Surfer ModelPageRankPageRank with TeleportSpamming
s
s
s
ss
s
s
s
s
s
sn
1
Slide 17
• Each page inherits its rank from its ancestors.
• Issues– Web graph is not strongly connected
– Convergence of PageRank is not guaranteed
– Effects of sink nodes– Pages without outputs– Trapping pages
PageRank
Introduction to Information Retrieval Graphs in Information Retrieval
A Real World Case…
Random Surfer ModelPageRankPageRank with TeleportSpamming
pq (q) outdegree
(q) PageRankpPageRank )(
Slide 18
Cont.
Introduction to Information Retrieval Graphs in Information Retrieval
A Real World Case…
Random Surfer ModelPageRankPageRank with TeleportSpamming
s
s
s
ss
s
s
s
s
s
s
Sink
Sink
Slide 19
• Teleport – Random surfer jumps from a node to any other node – The destination is chosen uniformly from all nodes
• Prob. of selecting each node is (1/n)
– In each node, surfer has the option of jumping • Prob. of jumping is α (0 ≤ α ≤ 1)
• Damping factor (d=1-α)
PageRank with Teleport
Introduction to Information Retrieval Graphs in Information Retrieval
A Real World Case…
Random Surfer ModelPageRankPageRank with TeleportSpamming
pq q)outdegree(
)PageRank(q
npPageRank )1()(
pq q)outdegree(
)PageRank(qd
n
dpPageRank )(
1)(
s
Slide 20
Spamming
Introduction to Information Retrieval Graphs in Information Retrieval
A Real World Case…
Random Surfer ModelPageRankPageRank with TeleportSpamming
• Spam– The manipulation of web page content for the purpose of
appearing high up in search results.
• Spamming Techniques– Text content manipulation – (tags, comments, invisible text blocks)
– Structural content manipulation (Mimicking important websites)
Slide 21
Spam Detection
Introduction to Information Retrieval Graphs in Information Retrieval
A Real World Case…
Random Surfer ModelPageRankPageRank with TeleportSpamming
• Spam Detection Methods– Text Spam
• Comparing word probability
– Link-farm Spam• Trust/Anti-trust Rank• Community Detection
Slide 22
Link-farm Spam Detection
Introduction to Information Retrieval Graphs in Information Retrieval
A Real World Case…
Random Surfer ModelPageRankPageRank with TeleportSpamming
• Link-farm Spam– Trust Rank– Anti-trust
Slide 23
Parsijoo
Introduction to Information Retrieval Graphs in Information Retrieval
A Real World Case…
Parsijoo
Slide 24
A Real World Case…
Introduction to Information Retrieval Graphs in Information Retrieval
A Real World Case…
Parsijoo
• Parsijoo Facts – Crawled Pages: (1x109 /month) rem. 500 x 106
• Crawling rate: 2000 page/sec
– Cached URLs: 10 x 109
• 80,000 URL /sec• 10 X 106 Unique Host (each host needs one queue)
– Unique URLS: 800 x 106 – Unique Words: 80 X 106
– Unique Requests: 200 x 103/day
Slide 25
A Real World Case…
Introduction to Information Retrieval Graphs in Information Retrieval
A Real World Case…
Parsijoo
• Parsijoo Facts – Requests (per day)
• Web:100 K Image:35 K News: 10 K• Music: 10 K Scholar: 1 K Video: 5 K• SADANA and etc. 35K
– Unique Requests: 200 x 103/day
Slide 26
QA