the anatomy of a large-scale hypertextual web search engine

The Anatomy of a Large-Scale

Hypertextual Web Search Engine

A review by: Adam Chamberlain, Adrian Hudnott, Rob

Garrood & Ben Smith

November 2005

2

Agenda• Introduction• Overview of Google• PageRank

– Motivation & Description– Example– Issues & Comparison– Further Work

• Application• Conclusions

3

Introduction• About the paper

– Brin & Page, 1998, Stanford University– Details a prototype search engine, Google– Covers both architecture and algorithms– Cited in web metrics with relation to significance

• Also relevant to Web Graph Properties

• PageRank– Covered in a separate paper from Brin & Page– Is the primary metric used in the paper

4

Overview : What is Google?• Web search engine

– Tackles issues faced by previous crawlers of scalability and manipulation

• Academic– Built on strong understanding of web metrics– Use of hyperlink structures

• Transparent– Initially released into the public domain– Support for informatics research

5

Overview : ArchitectureURL Server Crawler Store Server

Repository

IndexerURL Resolver

AnchorsAnchors

LexiconLexiconBarrelsBarrels

LinksLinks DocDocIndexIndex

Sorter

PageRank Searcher

CheckChecksumssums

6

Overview: Google Architecture

(Explanation for handout only.)• URL Server: Finds pages to surf.• Crawler: Downloads pages and places them in the repository.• Store Server: Document compression.• Repository: Cached copies of most web pages.• Indexer: Creates the forward index (documents words) and extracts

hyperlink tags into the Anchors file.• URL Resolver: Converts relative URLs into absolute URLs and creates the

Links file.• Links file: Ordered pairs of document IDs where a hyperlink exists between

them.• Sorter: Re-sorts the forward index to create the inverted index (words

documents) and creates the Lexicon.• Lexicon: Dictionary of all possible search keywords.• Doc Index: Maps document identifier codes to URLs.• PageRank: An influential web metric used to sort Google’s matches.• Searcher: Performs searches!

7

Overview : Forward Index• Indexer identifies key word ‘hits’

in a document• Maps document (page) ID’s to

word ID’s in Lexicon• Word ID’s partially sorted into

barrels– 64 of these– Word ID’s within a barrel are

unsorted.– Individual document may spread

over barrels.

• However, not useful for search!

8

Overview : Inverted Index

• Want to know in what documents a key word occurs

• Need the ‘Inverted Index’• Sorts the forward index into

its inverted form• Function performed by the

‘Sorter’

9

Overview : Ranking System • Proximity of keyword ‘hits’

– This is the sum of the distance between them

• Hits have ‘types’– Types: body text, heading text, anchor text, url, …– Relative font size factor used

• Count how many hits occur of each type and range of proximity values– Apply a function to each type-proximity count

• These form a type-proximity vector, C

10

Overview : Ranking System (2)• V = C·W (dot product) is computed.

– W is the importance associated with each type-proximity class.

• Combine V with the PageRank score

• Effect of increasing hits declines– Prevents large scale manipulation

Hit Count, x

f( x)

11

PageRank : Motivation• Academic Citation Analysis* attempted, but…

– Web has no formal quality control or peer review– Possible to inflate citation counts artificially– Web pages vary more than academic papers

• Consider:– One link from the University’s main page, or one

link from Yahoo’s main page…– Which citation should carry the higher weight ?

*Also known as bibliometrics

12

PageRank : Description• Informal Definition:

– “A page has a high rank if the sum of the ranks of its backlinks are high”

– Handles ‘Yahoo’ case on previous slide

• Intuitive Definition:– Corresponds to the Random Surfer Model– User keeps clicking on links ‘linearly’ then gets

bored and restarts at a random location

• Now for the maths…

13

PageRank : Description (2)• Formal Definition:

– c is a ‘dampening’ factor, was 0.85– Nv is number of out-links from page v– Bu is the set of backlinks from the current page– cE(u) corresponds to the surfer getting ‘bored’

)()('

)(' ucEN

vRcuR

u vBv

14

PageRank : Example• Considering an example network• Calculating A:

))(/)()(/)()(/)(()1()( ENERCNCRBNBRccAR

c = dampening factor

N = out-degree

R = PageRank

A B

ED

C

15

PageRank : Example (2)• Initially set all PageRank to 1

• First Iteration:

In-Links Rank (R) Out-Links (N) R/N

B 1 1 1

C 1 2 0.5

E 1 2 0.5

85.1)5.05.01(85.0)85.01()( AR

A B

ED

C

16

PageRank : Example (3)• Repeat process for B, C, D and E• Feed computed values into next iteration

Iteration 1 2 3 4 5 6

A 1.8500 1.2479 1.1967 1.5230 1.3412 1.2954

B 0.4333 0.4333 0.6380 0.4930 0.4807 0.5593

C 0.8583 0.7981 0.9772 0.9084 0.8668 0.9277

D 1.0000 1.7225 1.2107 1.1672 1.4445 1.2900

E 0.8583 0.7981 0.9772 0.9084 0.8668 0.9277

Order ADCEB DACEB ADCEB ADCEB DACEB ADCEB

17

PageRank : Analysis• Converges in log n time

– Constrained by the time to build a full-text index more than anything

• Rank ‘Sinks’– Caused by two pages that point to each other but

not to any other pages: rank accumulates– Solved by random surfer model

• Manipulation – ‘Google Bombing’– French Military ‘Victories’ links to ‘Defeats’– ‘Miserable Failure’ links to George Bush biography

19

PageRank : Comparison• Web Graph Properties

– Uses graph of the entire web: depends on full crawl– More sophisticated than simply summing in/out-

degrees

• Web Page Significance– Uses Boolean Spread Activation – match all words– Enhanced citation analysis – building on work of

Kleinberg, Egghe & Rousseau– Doesn’t suffer from Tightly Knit Communities effect

of Kleinberg’s Hubs & Authorities

20

PageRank : Further Work• Personalised PageRank, Haveliwala, 1999

– In-memory, block oriented, algorithm• PageRank can be computed in an hour on a PIII

450Mhz using less than 100Mb of main memory– Compute PageRank on the client-side

• Use local information: bookmarks, searches, history

• Provide the link structure of the web on a DVD

– 11/11/05, “Personalized Search” released

21

PageRank : Further Work (2)• Topic Sensitive PageRank, Haveliwala, 2002

– Improve Google by giving weight to the informational relationship between sites

– A) Uniform Results

• Similar to ‘current’ Google but with topics

– B) Personalised to a particular user• Based on previous searches and users’ surfing

habits

22

Applications : Google• Google Inc.

– Largest search engine• Technologies utilised by others (e.g. Yahoo!)• Biggest ever technology IPO, 2004

– Redefining search• Set a trend for other search providers• Raised importance of quality web search results• Combining information retrieval methods

– Business model based on advertising• Potential area for conflict• Over 100 factors now influence results

23

Applications : PageRank• Back-link prediction

– Desire for optimal web crawling strategy– Better indicator than citation counts!

• Improving user navigation– ‘The PageRank Proxy’– Providing PageRank information with links

• Establishing trust– Wealth of authors on the web, who to trust?– Use PageRank to rate trust

24

Applications : The Future• Internal Development

– Project no longer in academic realm• Lack of transparency initially intended• Role of PageRank unclear• Likely focus on extensions and results tuning

• External Development– API’s

• Allowing innovative use of Google technologies

– Open Source Code• Focused on developing infrastructure

25

Conclusions• Academic Background

– Success from strong academic understanding– Raised profile of informatics and search– Good platform for future research

• Success as a failure– Intention for transparency and use in academia– Commercial success has removed transparency– Potentially bad for further research in this area

26

Summary• We have seen:

– The architecture used by Google– PageRank as a web metric– Strengths and potential manipulations– The commercial success of Google– Applications– Potential areas of future research

27

References• Work by Brin & Page (now at Google)

– Brin, S., Page, L. (1998), ‘The anatomy of a large-scale hypertextual search engine’, Computer Networks and ISDN Systems, 30(1-7):107--117.

– Page, L., Brin, S., Motwani, R. and Winograd, T. (1998), ‘The PageRank Citation Ranking: Bringing Order to the Web', Stanford Digital Library Technologies Project.

– More papers at: http://www.google.com on many aspects of web metrics and search in general

• PageRank– http://www.iprcom.com/papers/pagerank/– Take a look at the example at: http://www.dcs.warwick.ac.uk/~csucbu– http://en.wikipedia.org/wiki/Google_bomb

28

References (2)• Further Developments

– Haveliwala, T. H. (1999), ‘Efficient computation of PageRank’. Technical report, Stanford University, Stanford, CA, 1999.

– Haveliwala, T. H. (2002), ‘Topic-sensitive PageRank’. In Proceedings of the Eleventh International World Wide Web Conference, Honolulu, Hawaii, May 2002.

• Commercial Aspect– http://money.cnn.com/2004/04/29/technology/google/– http://www.google.com/corporate/history.html

• Web Metrics– Dhyani, D., Keong N., W. , and Bhowmick, S. (2002), ‘A survey of web

metrics’, ACM Computing Surveys, 34(4):469--503.

the anatomy of a large-scale hypertextual web search engine

Documents