keyword search in databases using pagerank by michael sirivianos april 11, 2003

24
Keyword Search in Databases using PageRank By Michael Sirivianos April 11, 2003

Upload: grant-adams

Post on 04-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Keyword Search in Databases using PageRank By Michael Sirivianos April 11, 2003

Keyword Search in Databases using PageRank

By Michael Sirivianos

April 11, 2003

Page 2: Keyword Search in Databases using PageRank By Michael Sirivianos April 11, 2003

Roadmap PageRank: Ranking Web Pages

using link structure Ranking Keyword Search Results in

Structured Databases Ranking Combining Individual

PageRanks

Page 3: Keyword Search in Databases using PageRank By Michael Sirivianos April 11, 2003

Roadmap PageRank: Ranking Web Pages

using link structure of the web Ranking Keyword Search Results in

Structured Databases Ranking Combining Individual

PageRanks

Page 4: Keyword Search in Databases using PageRank By Michael Sirivianos April 11, 2003

PageRank(1) Stanford project Lawrence Page, Sergey Brin,

Rajeev Motwani, Terry Winograd. “The PageRank Citation Ranking:

Bringing Order to the Web”. Started Google

Page 5: Keyword Search in Databases using PageRank By Michael Sirivianos April 11, 2003

PageRank(2) Make use of the link structure of the web to

calculate a quality ranking (PageRank) for each web page.

Citation counting a metric for measuring page/paper quality

PageRank a more sophisticated citation counting method, not prone to manipulation.

Each page has unique PageRank, independent of keyword query

PageRank does NOT express relevance of page to query

Page 6: Keyword Search in Databases using PageRank By Michael Sirivianos April 11, 2003

PageRank (3) Calculation Intuition :PageRank of page

P increases when pages with large PageRanks point to P.

The rank of a page is evenly distributed among its forward links.

A problem: When two pages form a loop by pointing to each other but no other page, then in every iteration this loop accumulates and never distributes rank. This is called rank sink.

Page 7: Keyword Search in Databases using PageRank By Michael Sirivianos April 11, 2003

PageRank is a Usage Simulation “Random surfer”

Given a random URL Clicks randomly on links After a while gets bored and gets a

new random URL The number of visits to each page

is its PageRank.

Page 8: Keyword Search in Databases using PageRank By Michael Sirivianos April 11, 2003

PageRank CalculationPR(A)=(1-d) + d*( PR(T1)/C(T1)+…+

PR(Tn)/C(Tn) )

d: damping factor, normally this is set to 0.85.T1, …, Tn: pages pointing to page APR(A): PageRank of page A.PR(Ti): PageRank of page Ti.C(Ti): the number of links going out of page Ti.

Note: d counts for PageRank sinks

Page 9: Keyword Search in Databases using PageRank By Michael Sirivianos April 11, 2003

Example of Calculation (1)

Page A

Page C

Page B

Page D

Page 10: Keyword Search in Databases using PageRank By Michael Sirivianos April 11, 2003

Example of Calculation (2)

Page A 1

Page C1

Page D1

Page B1

1*0.85/2

1*0.85/2

1*0.85

1*0.85

1*0.85

Page 11: Keyword Search in Databases using PageRank By Michael Sirivianos April 11, 2003

Example of Calculation (3) Each page has not passed

on 0.15, so we get:Page A: 0.85 (from Page C) + 0.15 (not transferred) = 1Page B: 0.425 (from Page A) + 0.15 (not transferred) = 0.575Page C: 0.85 (from Page D) + 0.85 (from Page B) + 0.425 (from Page A) + 0.15 (not transferred) = 2.275Page D: receives none, but has not transferred 0.15 = 0.15

Page A 1

Page C2.275

Page B0.575

Page D0.15

Page 12: Keyword Search in Databases using PageRank By Michael Sirivianos April 11, 2003

Example of Calculation (4)Page A: 2.275*0.85 (from Page C)

+ 0.15 (not transferred) = 2.08375

Page B: 1*0.85/2 (from Page A) + 0.15 (not transferred) = 0.575

Page C: 0.15*0.85 (from Page D) + 0.575*0.85(from Page B) + 1*0.85/2 (from Page A) +0.15 (not transferred) =

1.19125Page D: receives none, but has not

transferred 0.15 = 0.15

Page A 2.08375

Page C1.19125

Page B0.575

Page D0.15

Page 13: Keyword Search in Databases using PageRank By Michael Sirivianos April 11, 2003

Example - Conclusions Page C has the highest PageRank,

and page A has the next highest: page C has a highest importance in this page graph!

More iterations lead to convergence of PageRanks.

Page 14: Keyword Search in Databases using PageRank By Michael Sirivianos April 11, 2003

Base set In practice when the user gets bored tends

to use his bookmarked pages instead of a random one. These bookmarked pages constitute the base set.

The PR formula is modified to reflect this behavior.PR(A)=(1-d)*E + d*( PR(T1)/C(T1)+…+ PR(Tn)/C(Tn) )

If A in base set E = 1 else E = 0

Page 15: Keyword Search in Databases using PageRank By Michael Sirivianos April 11, 2003

Roadmap PageRank: Ranking Web Pages

using link structure Ranking Keyword Search Results in

Structured Databases Ranking Combining Individual

PageRanks

Page 16: Keyword Search in Databases using PageRank By Michael Sirivianos April 11, 2003

Keyword QueryInput: set of keywords

Output: List of nodes ranked according to their relevance to the keywords

Score of a result-node:• Sum of keyword-specific PRs (OR semantics)• Product of keyword-specific PRs (AND

semantics)

Page 17: Keyword Search in Databases using PageRank By Michael Sirivianos April 11, 2003

Database Schema

C(cid,name)

Y(yid,year,cid)

P(pid,title,yid)

A(aid,name)

PP(pid1,pid2)

PA(pid,aid)

C: conferenceY: conference yearP: paperA: author

: primary to foreign key

Tupples in C, Y, P, Aare objects that represent nodes in schema graph

Primary to foreign key relations represent edges in the graph

All connections are two way except P – P that is only from paper to cited paper

Page 18: Keyword Search in Databases using PageRank By Michael Sirivianos April 11, 2003

Architecture

Attributes of PRindex table:•Keyword •CLOB of (id,PR) list

List of •Nodeid•Node text•PR wrt all keywords

CreatePR index

Database

PRindex

d,edge weights,

epsilon, threshold

QueryModule

Keywords,k

Results

Preprocessingstage

Query stage

Page 19: Keyword Search in Databases using PageRank By Michael Sirivianos April 11, 2003

Modified PageRank Formula

PR(A)=(1-d) + d*(weight(T1→A)*PR(T1)/C(T1)+…+ weight(Tn→A)*PR(Tn)/C(Tn)), if A has keyword

PR(A)=d*(weight(T1→A)*PR(T1)/C(T1)+…+ weight(Tn→A)*PR(Tn)/C(Tn)), if A doesn’t have keyword

Page 20: Keyword Search in Databases using PageRank By Michael Sirivianos April 11, 2003

Preprocessing stage (1) Load whole database in memory

Create edges Hashtable ( nodeId, nodeId, Type of edge )

Create nodes Hashtable ( nodeId ) Create text Hashtable ( nodeId, text )

For each keyword Find all nodes that contain keyword and put

them in base set. Execute PR algorithm with base set.

Page 21: Keyword Search in Databases using PageRank By Michael Sirivianos April 11, 2003

Preprocessing stage (2) Create descending list of (nodeid,PR)

pair. Store list in CLOB in PRindex table indexed by keyword.

Page 22: Keyword Search in Databases using PageRank By Michael Sirivianos April 11, 2003

Query Stage For each keyword in input retrieve

( id, PR ) list from database. Resolve top-k ids with respect to

the sum of Page ranks using Fagin’s algorithm (PODS 2001).

Page 23: Keyword Search in Databases using PageRank By Michael Sirivianos April 11, 2003

Fagin’s Algorithm Descending sorted keyword-specific PR lists

Keep the maximum possible value of a node that is the current PR for node extracted so far in scanned lists plus the PR of currently pointed nodes in other lists. Keep the minimum value that is the current PR for node.

Algorithm terminates when it finds k objects of which minimum value is greater than the maximum PR value for the rest of nodes.

Page 24: Keyword Search in Databases using PageRank By Michael Sirivianos April 11, 2003

Conclusions

We implemented a system for keyword search in databases using PageRank.

It uses an index of keyword specific Object Ranks