google search engine* cs461 lecture department of computer science iowa state university

25
Google Search Engine* CS461 Lecture Department of Computer Science Iowa State University 1. “The Anatomy of a Large-Scale Hypertextual Web Search Engine ”, S. Brin and L. Page, in Proceeding of WWW’98 2. “The pagerank citation ranking: Bringing order to the Web “, L. Page, S. Brin, R. Motwani, and T. Winograd, Technical Report, Stanford University, 1998

Upload: tasha-vang

Post on 03-Jan-2016

32 views

Category:

Documents


1 download

DESCRIPTION

Google Search Engine* CS461 Lecture Department of Computer Science Iowa State University. “ The Anatomy of a Large-Scale Hypertextual Web Search Engine ”, S. Brin and L. Page, in Proceeding of WWW’98 - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Google Search Engine* CS461 Lecture Department of Computer Science Iowa State University

Google Search Engine*

CS461 LectureDepartment of Computer Science

Iowa State University

1. “The Anatomy of a Large-Scale Hypertextual Web Search Engine”, S. Brin and L. Page, in Proceeding of WWW’98

2. “The pagerank citation ranking: Bringing order to the Web “, L. Page, S. Brin, R. Motwani, and T. Winograd, Technical Report, Stanford University, 1998

Page 2: Google Search Engine* CS461 Lecture Department of Computer Science Iowa State University

What to cover today

PageRankGoogle Architecture

Page 3: Google Search Engine* CS461 Lecture Department of Computer Science Iowa State University

Problem Statement

Ultimate version Find what I want

In most cases, I don’t know exactly or cannot expressed clearly what I want

“What-I-want” can be estimated using a set of keywords

Simplified version Find the files that are most related to a

set of keywords

Page 4: Google Search Engine* CS461 Lecture Department of Computer Science Iowa State University

Naïve Solution

How it works Download the entire Internet to a local

machine Search and return all files containing the set

of keywords

Problems: all files are treated equally importance Could return tons of files, but most of them

are not what I want Since most users simply check out the first

few files, this scheme actually cannot find much useful things

Page 5: Google Search Engine* CS461 Lecture Department of Computer Science Iowa State University

Ranking Based on Hit Rate

How it works A file is ranked higher if it is visited

more frequently

Problems Could be affected by faked hits A file will be ranked higher and higher

Page 6: Google Search Engine* CS461 Lecture Department of Computer Science Iowa State University

Ranking based on Citation

Basic idea A paper is important if it is cited by many papers

Each paper has a set of references that link to the related work

A pioneering paper typically has a high citation An HTML page is more important if it is linked by

many other page Each page may link to other pages

Problems Publish of academic papers is well-controlled

Many are peer-reviewed Chronically ordered

Internet files could be anything

Page 7: Google Search Engine* CS461 Lecture Department of Computer Science Iowa State University

Proposed: PageRank

Basic idea A page with many links to it is more likely to

be useful than one with few links to it Just like citation

The links from a page that itself is the target of many links are likely to be particularly important This is something new

Page 8: Google Search Engine* CS461 Lecture Department of Computer Science Iowa State University

Proposed: PageRank

Basic idea A page with many links to it is more likely to be

useful than one with few links to it Just like citation

The links from a page that itself is the target of many links are likely to be particularly important This is something new

back linksforward link

Each link has different weight

Page 9: Google Search Engine* CS461 Lecture Department of Computer Science Iowa State University

Proposed: PageRank

How it works Each page is ranked using a value called PageRank

(PR) A page’s PR depends on the PRs of its back link

pages

PR(A)=(1-d) + d*[PR(T1)/C(T1)+…+ PR(Tn)/C(Tn)]

d: damping factor, normally this is set to 0.85

T1, … Tn: pages point to page A

PR(A): PageRank of page A

PR(Ti): PageRank of page Ti pointing to page A

C(Ti): the number of links going out of page Ti

Page 10: Google Search Engine* CS461 Lecture Department of Computer Science Iowa State University

Properties of PageRank formula PageRanks form a probability distribution

over web pages, so the normalized sum of all web pages' PageRanks will be one

Challenge of calculating PageRanks The links could be circulated, e.g., ABA

Proposed: PageRank

Page A Page B

Page 11: Google Search Engine* CS461 Lecture Department of Computer Science Iowa State University

Assign each page an initial rank value Could be any number (seed)

Repeat calculations until the rank of each page does not change much

PageRank Calculation

Page A

Page B

d= 0.85PR(A)= (1 – d) + d(PR(B)/1)PR(B)= (1 – d) + d(PR(A)/1)

Seed = 1

PR(A)= 0.15 + 0.85 * 1 = 1PR(B)= 0.15 + 0.85 * 1 = 1

Page 12: Google Search Engine* CS461 Lecture Department of Computer Science Iowa State University

Assign each page an initial rank value Could be any number (seed)

Repeat calculations until the rank of each page does not change much

PageRank Calculation

Page A

Page B

d= 0.85PR(A)= (1 – d) + d(PR(B)/1)PR(B)= (1 – d) + d(PR(A)/1)

Seed = 01)PR(A)= 0.15 + 0.85 * 0 = 0.15 PR(B)= 0.15 + 0.85 * 0.15 = 0.27752)PR(A)= 0.15 + 0.85 * 0.2775 = 0.385875PR(B)= 0.15 + 0.85 * 0.385875 = 0.477993753)PR(A)= 0.15 + 0.85 * 0.47799375 = 0.5562946875PR(B)= 0.15 + 0.85 * 0.5562946875 =

0.622850484375

Page 13: Google Search Engine* CS461 Lecture Department of Computer Science Iowa State University

Assign each page an initial rank value Could be any number (seed)

Repeat calculations until the rank of each page does not change much

PageRank Calculation

Page A

Page B

d= 0.85PR(A)= (1 – d) + d(PR(B)/1)PR(B)= (1 – d) + d(PR(A)/1)

Seed = 401)PR(A)= 0.15 + 0.85 * 40 = 34.25PR(B)= 0.15 + 0.85 * 0.385875 = 29.17752)PR(A)= 0.15 + 0.85 * 29.1775 = 24.950875PR(B)= 0.15 + 0.85 * 24.950875 = 21.358243753) ......

Page 14: Google Search Engine* CS461 Lecture Department of Computer Science Iowa State University

Assign each page an initial rank value Could be any number (seed)

Repeat calculations until the rank of each page does not change much

PageRank Calculation

Page A

Page B

Seed = 401)PR(A)= 0.15 + 0.85 * 40 = 34.25PR(B)= 0.15 + 0.85 * 0.385875 = 29.17752)PR(A)= 0.15 + 0.85 * 29.1775 = 24.950875PR(B)= 0.15 + 0.85 * 24.950875 = 21.358243753) ……

Observation: It doesn’t matter what the seed value you use, once the PageRank calculations settle down, the “normalized probability distribution” (the average PageRank for all pages) will be 1.0

Page 15: Google Search Engine* CS461 Lecture Department of Computer Science Iowa State University

Example of Calculation (0)

Page A

Page C

Page B

Page D

Page 16: Google Search Engine* CS461 Lecture Department of Computer Science Iowa State University

Example of Calculation (1)

Page A1

Page C1

Page B1

Page D1

Page 17: Google Search Engine* CS461 Lecture Department of Computer Science Iowa State University

Example of Calculation (2)

Page A 1

Page C1

Page B1

Page D1

1*0.85/2

1*0.85/21*0.85

1*0.85

1*0.85

Page 18: Google Search Engine* CS461 Lecture Department of Computer Science Iowa State University

Each page has not passed on 0.15, so we get:Page A: 0.85 (from Page C) + 0.15 (not transferred) = 1

Page B: 0.425 (from Page A) + 0.15 (not transferred) = 0.575Page C: 0.85 (from Page D) + 0.85 (from Page B) + 0.425 (from Page A) + 0.15 (not transferred) = 2.275Page D: receives none, but has not transferred 0.15 = 0.15Page A

1

Page C2.275

Page B0.575

Page D0.15

Page 19: Google Search Engine* CS461 Lecture Department of Computer Science Iowa State University

Example of Calculation (3)

Page A 1

Page C2.275

Page B0.575

Page D0.15

Page 20: Google Search Engine* CS461 Lecture Department of Computer Science Iowa State University

Page A: 2.275*0.85 (from Page C) + 0.15 (not transferred) = 2.08375

Page B: 1*0.85/2 (from Page A) + 0.15 (not transferred) = 0.575

Page C: 0.15*0.85 (from Page D) + 0.575*0.85(from Page B) + 1*0.85/2 (from Page A) +0.15 (not transferred) = 1.19125

Page D: receives none, but has not transferred, remains at 0.15

Page A 2.03875

Page C1.1925

Page B0.575

Page D0.15

Page 21: Google Search Engine* CS461 Lecture Department of Computer Science Iowa State University

Example of calculation (4)

After 20 iterations, we get

Page A 1.490

Page C1.577

Page B0.783

Page D0.15

In reality: a PageRank for 26,000,000 web pages can be computed in a few hours on a medium size workstation. (1998)

Page 22: Google Search Engine* CS461 Lecture Department of Computer Science Iowa State University

Result

Page C has the highest PageRank, and page A has the next highest: page C has a highest importance in this page links!More iterations lead to a stability PageRank of the resulting page for keyword research.

Page 23: Google Search Engine* CS461 Lecture Department of Computer Science Iowa State University

PageRank Summary

PageRank is a citation importance ranking Approximated measure of importance or quality Number of citations or backlinks

The pages with high PageRanks are those that are linked to by many pages and/or by important pages (e.g., Yahoo!)

Page 24: Google Search Engine* CS461 Lecture Department of Computer Science Iowa State University

PageRank Summary

PageRank is a citation importance ranking Approximated measure of importance or

quality Number of citations or backlinks Each citation has different weight

The pages with high PageRanks are those that are linked to by many pages and/or by important pages (e.g., Yahoo!)Questions: how to improve the ranking of your web pages? Creating dummy sites to link to their main

sites? Increasing internal links and/or decreasing

external links?

Page 25: Google Search Engine* CS461 Lecture Department of Computer Science Iowa State University

Google Architecture