csm06 information retrieval lecture 4: web ir part 1 dr andrew salway [email protected]...

23
CSM06 Information Retrieval Lecture 4: Web IR part 1 Dr Andrew Salway [email protected]

Upload: gwendolyn-powers

Post on 30-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CSM06 Information Retrieval Lecture 4: Web IR part 1 Dr Andrew Salway a.salway@surrey.ac.uk a.salway@surrey.ac.uk

CSM06 Information RetrievalLecture 4: Web IR part 1

Dr Andrew Salway [email protected]

Page 2: CSM06 Information Retrieval Lecture 4: Web IR part 1 Dr Andrew Salway a.salway@surrey.ac.uk a.salway@surrey.ac.uk

Lecture 4: OVERVIEW

• Previously we looked at IR techniques that indexed a document based on the words that occur in the document

• Some of these techniques are applied in web search engines (but VSM may not be appropriate). However, web IR can also exploit a distinctive feature of information on the web – hypertext link structure

Use of anchor text for indexing web pages

The PageRank algorithm based on link structure analysis

Other techniques for ranking web pages

Page 3: CSM06 Information Retrieval Lecture 4: Web IR part 1 Dr Andrew Salway a.salway@surrey.ac.uk a.salway@surrey.ac.uk

Challenges for IR on the Web

• High volume of information• Heterogeneous information

(multimedia and multilingual)• Diverse users - hence diverse

information needs, and many inexperienced users

• Average query length 2-5 words• Poorly structured and low quality

information

Page 4: CSM06 Information Retrieval Lecture 4: Web IR part 1 Dr Andrew Salway a.salway@surrey.ac.uk a.salway@surrey.ac.uk

Scale

•Projection of worldwide Internet population in 2005 = 1.07 billion users, www.clickz.com/stats/web_worldwide/

•Early in 2005 Google claimed to index over 8 billion web pages, Yahoo recently claimed 19 billion, now Google claims to index 3 times more than nearest competitorhttp://select.nytimes.com/gst/abstract.html?res=F30610F93E540C748EDDA00894DD404482

•Given the low overlap in search engine results for a given query, it is likely that the total number of webpages is much greater than that indexed by any single web search engine

Page 5: CSM06 Information Retrieval Lecture 4: Web IR part 1 Dr Andrew Salway a.salway@surrey.ac.uk a.salway@surrey.ac.uk

Requirements of Web Search Engine Users?

• Fast response time• Some relevant results in first page;

maybe less concern with getting all relevant results

• Good coverage of web, at least of ‘important sites’

• Up-to-date links• Simple and intuitive to use – making

queries and understanding results

NB. Some of these requirements contrast with those of expert researchers using specialist information retrieval systems

Page 6: CSM06 Information Retrieval Lecture 4: Web IR part 1 Dr Andrew Salway a.salway@surrey.ac.uk a.salway@surrey.ac.uk

User Goals (Information Needs)

• Queries are used to express a user’s goal (or information need), but note that the same query might be used for quite different goals

(Rose and Levinson 2004)

Page 7: CSM06 Information Retrieval Lecture 4: Web IR part 1 Dr Andrew Salway a.salway@surrey.ac.uk a.salway@surrey.ac.uk

User Goals: Rose and Levinson’s classification (2004)

1. Navigational – wanting a specific known website

2. Informational – “my goal is to learn something by reading or viewing web pages” – e.g. closed and open-ended questions, advice

3. Resource – “my goal is to obtain a resource (not information) available on web pages” – e.g. download music, interact with online shopping service

NOTE: prior to web most IR was concerned only with Informational queries

Page 8: CSM06 Information Retrieval Lecture 4: Web IR part 1 Dr Andrew Salway a.salway@surrey.ac.uk a.salway@surrey.ac.uk

User Goals: Rose and Levinson’s classification (2004)

• The more a search engine understands about a user’s goal then the better results it can provide

User goals may be deduced not only from the query, but also from

• The results returned by the search engine

• Results clicked on by the user• Further searches / actions by the user

Page 9: CSM06 Information Retrieval Lecture 4: Web IR part 1 Dr Andrew Salway a.salway@surrey.ac.uk a.salway@surrey.ac.uk

Opportunity…

• Web search engines can exploit the fact that information on the web is in the form of hypertext…

Page 10: CSM06 Information Retrieval Lecture 4: Web IR part 1 Dr Andrew Salway a.salway@surrey.ac.uk a.salway@surrey.ac.uk

Hypertext

• The web is, in some senses at least, hypertextual, i.e. it can be viewed as networks of nodes (e.g. pages) and links (between pages)

Page 11: CSM06 Information Retrieval Lecture 4: Web IR part 1 Dr Andrew Salway a.salway@surrey.ac.uk a.salway@surrey.ac.uk

Hypertext

• Links suggest – relatedness of topic / perhaps also a recommendation

• Topological information about the hypertext graph gained by link structure analysis can be exploited for ranking

Page 12: CSM06 Information Retrieval Lecture 4: Web IR part 1 Dr Andrew Salway a.salway@surrey.ac.uk a.salway@surrey.ac.uk

Use of Anchor Text (Brin and Page 1998)

• Words in the anchor text can be used to index the webpage being linked to – the text in an anchor may give a good description of the page it points to, e.g.

<ahref=“www.bio.com/beckhambio.html"> A Biography of David Beckham</a></p>

• The words in the anchor text might be a better indicator of what the webpage is about than the words in the webpage

• Anchor text is also good for resources like images that can not be analysed as keywords

Page 13: CSM06 Information Retrieval Lecture 4: Web IR part 1 Dr Andrew Salway a.salway@surrey.ac.uk a.salway@surrey.ac.uk

PageRank (Brin and Page 1998)

• “Google makes use of both link structure and anchor text”

• “The citation (link) graph of the web is an important resource that has largely gone unused in existing web search engines”

PageRank is “an objective measure of [a web page’s] citation importance that corresponds well with people’s subjective idea of importance”

Page 14: CSM06 Information Retrieval Lecture 4: Web IR part 1 Dr Andrew Salway a.salway@surrey.ac.uk a.salway@surrey.ac.uk

Calculating PageRank

PR(A) = (1-d) + d*(PR(T1)/C(T1) + … + PR(Tn)/C(Tn)

PR(A) = PageRank of webpage AC (A) = the number of links out of webpage AT1…Tn = the webpages that point to webpage Ad = a damping factor set between 0-1

In reality, the calculation of PageRank is iterative

Page 15: CSM06 Information Retrieval Lecture 4: Web IR part 1 Dr Andrew Salway a.salway@surrey.ac.uk a.salway@surrey.ac.uk

Web-adjacency Analysis (a similar idea to PageRank)

• Kleinberg and colleagues proposed a method for identifying authoritative web-pages– Identify set of relevant pages (as normal)– Identify those with a large in-degree, i.e. lots

of pages point to them (cf. ‘impact’)– Ensure that the authorities selected are

referred to by a number of the same hubs, i.e. those with a large out-degree

Page 16: CSM06 Information Retrieval Lecture 4: Web IR part 1 Dr Andrew Salway a.salway@surrey.ac.uk a.salway@surrey.ac.uk

Web-adjacency Analysis

• “Hubs and authorities exhibit what could be called a mutually reinforcing relationship” (Kleinberg 1998)

• Computing authority and hub values for web-pages is an iterative process over a graph, where each node is a web-page– Two weights are given to each node relating

to in-degree and out-degree: total in-degree weights and total out-degree weights are kept constant

– Weights are modified each iteration depending on weights of connected nodes

Page 17: CSM06 Information Retrieval Lecture 4: Web IR part 1 Dr Andrew Salway a.salway@surrey.ac.uk a.salway@surrey.ac.uk

Some other Factors used to rank Web Pages (Hock 2001)

• Popularity of the Page: measured either by how many other web-pages link to it, or by how many people have clicked on it when they had the same query

• Frequency of search terms: need to consider length of the document, and web-page authors attempts to affect ranking by deliberate repetition

• Number of query terms matched: but remember many queries are only one or two words

Page 18: CSM06 Information Retrieval Lecture 4: Web IR part 1 Dr Andrew Salway a.salway@surrey.ac.uk a.salway@surrey.ac.uk

Other Factors (continued…)

• Rarity of terms: rank pages containing rare search terms more highly (cf. TFIDF)

• Weighting by Field: give high ranking to pages including search terms in important fields, e.g. Title

• Proximity of Terms: rank pages more highly if search terms occur near one another

• Order of Query Terms: give priority to pages containing the search term entered first

Page 19: CSM06 Information Retrieval Lecture 4: Web IR part 1 Dr Andrew Salway a.salway@surrey.ac.uk a.salway@surrey.ac.uk

Set Reading for Lecture 4

• Page and Brin (1998), “The Anatomy of a Large-Scale Hypertextual Web Search Engine”. SECTIONS 1 and 2. Explains Google’s use of anchor text and PageRank.

www-db.stanford.edu/~backrub/google.html

• Hock (2001), The extreme searcher's guide to web search engines, pages 25-31. Gives an overview of some factors used by web search engines to rank webpages. AVAILABLE in Main Library collection and in Library Article Collection.

Page 21: CSM06 Information Retrieval Lecture 4: Web IR part 1 Dr Andrew Salway a.salway@surrey.ac.uk a.salway@surrey.ac.uk

Further ReadingRose and Levinson (2004), “Understanding User Goals in Web Search”, 13th

International WWW Conference, 2004. www.sims.berkeley.edu/courses/is141/f05/readings/rose_www04.pdf

Page, Brin, Motwani and Winograd (1999), “The PageRank Citation Ranking: Bringing Order to the Web.” http://dbpubs.stanford.edu:8090/pub/1999-66

Belew (2000), Finding Out About, pages 195-199 for an overview of Kleinberg’s work on web-adjacency analysis and authorities and hubs.

Kleinberg (1998), ‘Authoritative Sources in a Hyperlinked Environment’, Journal of the ACM. http://citeseer.nj.nec.com/87928.html

Kobayashi and Takeda (2000), “Information Retrieval on the Web”, ACM Computing Surveys 32(2), pp. 144-173. AVAILABLE IN LIBRARY / ARTICLE COLLECTION. **This comprehensive article reviews a lot the ideas covered so far in this module and discusses them in the context of Web IR. NOTE, it is already a little out of date in places because of the rapid changes of the Web.

Page 22: CSM06 Information Retrieval Lecture 4: Web IR part 1 Dr Andrew Salway a.salway@surrey.ac.uk a.salway@surrey.ac.uk

Lecture 4: LEARNING OUTCOMES

After this lecture you should be able to:• Explain how the challenges of web IR are

different than those facing the developers of traditional IR systems

• Explain how web search engines can exploit the hypertext structure of the web to index and rank web pages, e.g. using Anchor Text, and PageRank

• Explain how PageRank is calculated• Discuss and critique a range of factors

used by web search engines to rank web pages

Page 23: CSM06 Information Retrieval Lecture 4: Web IR part 1 Dr Andrew Salway a.salway@surrey.ac.uk a.salway@surrey.ac.uk

Reading ahead for LECTURE 5If you want to read about next week’s lecture topics,

see:

Dean and Henzinger (1999), ‘Finding Related Pages in the World Wide Web’. Pages 1-10.

http://citeseer.ist.psu.edu/dean99finding.html

Agichtein, Lawrence and Gravano (2001), ‘Learning Search Engine Specific Query Transformations for Question Answering’, Procs. 10th International WWW Conference. **Section 1 and Section 3**

www.cs.columbia.edu/~eugene/papers/www10.pdf

Oppenheim, Morris and McKnight (2000), ‘The Evaluation of WWW Search Engines’, Journal of Documentation, 56(2). Pages 194-205. In Library Article Collection.