web search engine metrics for measuring user satisfaction€¦ · – from search engine’s query...

129
1 Web Search Engine Metrics for Measuring User Satisfaction Ali Dasdan Kostas Tsioutsiouliklis Emre Velipasaoglu {dasdan, kostas, emrev}@yahoo-inc.com Yahoo! Inc. 20 Apr 2009

Upload: others

Post on 24-Jul-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

1

Web Search Engine Metrics for Measuring

User Satisfaction

Ali Dasdan Kostas Tsioutsiouliklis

Emre Velipasaoglu {dasdan, kostas, emrev}@yahoo-inc.com

Yahoo! Inc. 20 Apr 2009

Page 2: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

2

Tutorial @

18th International World Wide Web

Conference

http://www2009.org/ April 20-24, 2009

Page 3: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Disclaimers

•  This talk presents the opinions of the authors. It does not necessarily reflect the views of Yahoo! Inc.

•  This talk does not imply that these metrics are used by Yahoo!, or should they be used, they may not be used in the way described in this talk.

•  The examples are just that – examples. Please do not generalize them to the level of comparing search engines.

3

Page 4: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Acknowledgments for presentation material (in alphabetical order of last names in each category)

•  Coverage –  Paolo D’Alberto, Amit Sasturkar

•  Discovery –  Chris Drome, Kaori Drome

•  Freshness –  Xinh Huynh

•  Presentation –  Rob Aseron, Youssef Billawala, Prasad Kantamneni, Diane

Yip •  General

–  Stanford U. presentation audience (organized by Aneesh Sharma and Panagiotis Papadimitriou), Yahoo! presentation audience (organized by Pavel Dmitriev)

4

Page 5: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Learning objectives

•  To learn about user satisfaction metrics

•  To learn about how to interpret metrics results

•  To get the relevant bibliography •  To learn about the open problems

5

Page 6: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Scope

• Web “textual” search • Users’ point of view • Analysis rather than

synthesis •  Intuitive rather than formal • Not exhaustive coverage

(including references) 6

Page 7: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Outline

•  Introduction (30min) – Ali •  Relevance metrics (50min) – Emre •  Break (15min) •  Coverage metrics (15min) – Ali •  Diversity metrics (15min) – Ali •  Discovery metrics (15min) – Ali •  Freshness metrics (15min) – Ali •  Presentation metrics (50min) – Kostas •  Conclusions (5min) – Kostas

7

Page 8: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

8

Introduction PART “0”

of WWW’09 Tutorial on Web Search Engine Metrics

by A. Dasdan, K. Tsioutsiouliklis, E. Velipasaoglu

Page 9: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

“To measure is to know”

“If you cannot measure it, you cannot improve it”

Lord Kelvin (1824-1907)

Why measure? Why metrics?

9

Page 10: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Search engine pipeline: Simplified architecture

10

•  Serving system: serves user queries and search results

•  Content system: acquires and processes content

Front-end tiers

Search tiers Indexers Web graphs Crawlers

Serving system

Content system

WWW

User

Page 11: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Search engine pipeline: Content selection

11

Craw

led

Accessed

Served

Indexed

Graphed

the Web

How do you select content to pass to the next catalog?

Front-end tiers

Search tiers Indexers Web graphs Crawlers

Serving system

Content system

WWW

User

Page 12: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

User view of metrics: Example with coverage metrics (SE #1)

12

Front-end tiers

Search tiers Indexers Web graphs Crawlers

Serving system

Content system

WWW

User

Search Engine (SE) #1

Page 13: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

User view of metrics: Example with coverage metrics (SE #2)

13

Front-end tiers

Search tiers Indexers Web graphs Crawlers

Serving system

Content system

WWW

User

Search Engine (SE) #2

Page 14: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

User view of metrics: Example with coverage metrics (SE #3)

14

Front-end tiers

Search tiers Indexers Web graphs Crawlers

Serving system

Content system

WWW

User

Search Engine (SE) #3

Page 15: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

System view of metrics: Example with coverage metrics

15

Check for coverage of expected URL http://rain.stanford.edu/schedule/ (if missing from SRP)

Front-end tiers

Search tiers Indexers Web graphs Crawlers

Serving system

Content system

WWW

User

Page 16: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Ideal vs. reality

•  Ideal view –  crawl all content –  discover all changes instantaneously –  serve all content instantaneously –  store all content indefinitely –  meet user’s information need perfectly

•  Practical view –  constraints on above aspects due to

•  market focus, long tails, cost, resources, complexity

•  Moral of the story –  Cannot make all the users happy all the time!

16

Page 17: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Sampling methods for metrics

•  Random sampling of queries –  from search engine’s query logs –  from third-party logs (e.g., ComScore)

•  Random sampling of URLs –  from random walking the Web

•  see a review at Baykan et al., WWW’06 –  from directories and similar hubs –  from RSS feeds and sitemaps –  from third-party feeds –  from search engine’s catalogs –  from competitor’s indices using queries

•  Customer-selected samples

17

Page 18: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Different dimensions for metrics

•  Content types and sources –  news, blogs, wikipedia, forums, scholar; regions, languages;

adult, spam, etc. •  Site types

–  small vs. large, region, language •  Document formats

–  html, pdf, etc. •  Query types

–  head, torso, tail; #terms; informational, navigational, transactional; celebrity, adult, business, research, etc.

•  Open web vs. hidden web •  Organic vs. commercial •  Dynamic vs. static content •  New content vs. existing content

18

Page 19: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Further issues to consider

•  Rate limitations –  search engine blocking, hence, difficulty of large competitive testing –  internal bandwidth usage limitations

•  Intrusiveness –  How can metrics queries affect what’s observed?

•  Statistical soundness –  in methods used and guarantees provided –  accumulation of errors –  the “value” question –  E.g., what is “random”? is “random” good enough?

•  Undesired positive feedback or the chicken-and-egg problem –  Focus on popular queries may make them more popular at the expense

of potentially what’s good for the future. •  Controlled feedback or labeled training and testing data

–  Paid human judges (or editors), crowdsourcing (e.g., Amazon’s Mechanical Turk), Games with a Purpose (e.g., Dasdan et al., WWW’09), bucket testing on live traffic, etc.

19

Page 20: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Key problems for metrics

•  Measure user satisfaction •  Compare two search engines •  Optimize for user satisfaction in

each component of the pipeline •  Automate all metrics •  Discover anomalies •  Visualize, mine, and summarize

metrics data •  Debug problems automatically

20 Also see: Yahoo Research list at http://research.yahoo.com/ksc

Page 21: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

21

Relevance Metrics PART I

of WWW’09 Tutorial on Web Search Engine Metrics

by A. Dasdan, K. Tsioutsiouliklis, E. Velipasaoglu

Page 22: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 22 22

Example on relevance

Ad for gear. OK if I will go to the game.

No schedule here.

There is a schedule.

A different schedule?

A different Real Madrid!

Page 23: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 23 23

What is relevance?

•  User issues a query to a search engine and receives an ordered list of results…

•  Relevance: How effectively was the user’s information need met? –  How useful were the results? –  How many of the retrieved results were useful? –  Were there any useful pages not retrieved? –  Did the order of the results make the user’s search

easier or harder? –  How successful did the search engine handle the

ambiguity and the subjectivity of the query?

Page 24: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 24

Evaluating relevance

•  Set based evaluation –  basic but fundamental

•  Rank based evaluation with explicit absolute judgments –  binary vs. graded judgments

•  Rank based evaluation with explicit preference judgments –  binary vs. graded judgments –  practical system testing and incomplete judgments

•  Rank based evaluation with implicit judgments –  direct and indirect evaluation by clicks

•  User satisfaction •  More notes

Page 25: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

25

Relevance Metrics: Set Based Evaluation

Page 26: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 26

Precision

–  True Positive (TP): A retrieved document is relevant –  False Positive (FP): A retrieved document is not relevant

•  Kent et. al. (1955) €

Precision =# relevant items retrieved( )

# retrieved items( )

=TP

TP+ FP( )= Prob relevant retrieved( )

•  How many of the retrieved results were useful?

Page 27: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 27

Recall

Recall =# relevant items retrieved( )

# relevant items( )

=TP

TP+ FN( )= Prob retrieved relevant( )

–  True Positive (TP): A retrieved document is relevant –  False Negatives (FN): A relevant document is not retrieved

•  Kent et. al. (1955)

•  Were there any useful pages left not retrieved?

Page 28: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 28

Properties of precision and recall

•  Precision decreases when false positives increase •  False positives:

–  also known as false alarm in signal processing –  correspond to Type I error in statistical hypothesis testing

•  Recall decreases when false negatives increase •  False negatives

–  also known as missed opportunities

–  correspond to Type II error in statistical hypothesis testing

Page 29: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 29

F-measure

•  Inconvenient to have two numbers •  F-measure: Harmonic mean of precision and recall

–  related to van Rijsbergen’s effectiveness measure –  reflects user’s willingness to trade precision for recall controlled

by a parameter selected by the system designer

F =1

α1P

+ 1−α( ) 1R

=β 2 +1( )PRβ 2P+ R

α = 1β 2 +1( )

F β =1( ) =2PRP + R

Page 30: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 30

Various means of precision and recall

0%

20%

40%

60%

80%

100%

0% 20% 40% 60% 80% 100%

Precision

Averag

e o

f p

recis

ion

an

d r

ecall

R

arithmetic

geometric

F1

F2

F.5

Recall = 70%

Page 31: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

31

Relevance Metrics: Rank Based Evaluation

with Explicit Absolute

Judgments

Page 32: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 32

Extending precision and recall

•  So far, considered: –  How many of the retrieved results were useful? –  Were there any useful pages left not retrieved?

•  Next, consider: –  Did the order of the results made the user’s search for

information easier or harder?

•  Extending set based precision/recall to ranked list –  It is possible to define many sets over a ranked list. –  E.g. Start with a set including the first result and progressively

increase the size of the set by adding the next result.

•  Precision-recall curve: –  Calculate precision at standard recall levels and interpolate.

Page 33: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 33

Precision-recall curve example

rank relevance TP FP FN recall precisioninterpolated

precision1 1 1 0 3 0.25 1.00 1.00

2 1 2 0 2 0.5 1.00 1.00

3 0 2 1 2 0.5 0.67 0.75

4 1 3 1 1 0.75 0.75 0.75

5 0 3 2 1 0.75 0.60 0.60

6 0 3 3 1 0.75 0.50 0.57

7 1 4 3 0 1 0.57 0.57

8 0 4 4 0 1 0.50 0.50

9 0 4 5 0 1 0.44 0.44

10 0 4 6 0 1 0.40 0.40

Page 34: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 34

Precision-recall curve example

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.2 0.4 0.6 0.8 1.0

Recall

Precision

precision interpolated precision

Page 35: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 35

Average precision-recall curve

•  Precision-recall curve is for one ranked list (i.e. one query).

•  To evaluate relevance of a search engine: –  Calculate interpolated

precision-recall curves for a sample of queries at 11-points (Recall = 0.0:0.1:1.0).

–  Average over test sample of queries.

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

recall

precision

Page 36: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 36

Mean average precision (MAP)

•  Single number instead of a graph •  Measure of quality at all recall levels •  Average precision for a single query:

AP = 1# relevant Precision at rank of kth relevant document( )

k=1

# relevant

•  MAP: Mean of average precision over all queries –  Most frequently, arithmetic mean is used over the query

sample. –  Sometimes, geometric mean can be useful by putting

emphasis on low performing queries.

Page 37: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 37

Average precision example

rank relevance TP FP FN R P P@rel(k)

1 1 1 0 3 0.25 1.00 1.00

2 1 2 0 2 0.5 1.00 1.00

3 0 2 1 2 0.5 0.67 0

4 1 3 1 1 0.75 0.75 0.75

5 0 3 2 1 0.75 0.60 0

6 0 3 3 1 0.75 0.50 0

7 1 4 3 0 1 0.57 0.57

8 0 4 4 0 1 0.50 0

9 0 4 5 0 1 0.44 0

10 0 4 6 0 1 0.40 0

# relevant 4 ave P 0.83

Page 38: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 38

Precision @ k

•  MAP evaluates precision at all recall levels. •  In web search, top portion of a result set is more

important. •  A natural alternative is to report at top-k

(e.g. top-10). •  Problem:

–  Not all queries will have more than k relevant results. So, even a perfect system may score less than 1.0 for some queries.

Page 39: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 39

R-precision

•  Allan (2005) •  Use a variable result set cut-off for each query

based on number of its relevant results. •  In this case, a perfect system can score 1.0 over all

queries. •  Official evaluation metric of the TREC HARD track •  Highly correlated with MAP

Page 40: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 40

Mean reciprocal rank (MRR)

•  Voorhees (1999) •  Reciprocal of the rank of the first relevant result

averaged over a population of queries •  Possible to define it for entities other than explicit

absolute relevance judgments (e.g. clicks - see implicit judgments later on)

MRR = 1#queries

1rank(1st relevant result of query q)q=1

#queries

Page 41: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 41

Graded Relevance

•  So far, the evaluation methods did not measure satisfaction in the following aspects: –  How useful were the results?

•  Do documents have grades of usefulness in meeting an information need?

–  How successful did the search engine handle the ambiguity and the subjectivity of the query?

•  Is the information need of the user clear in the query? •  Do different users mean different things with the same query?

•  Can we cover these aspects by using graded relevance judgments instead of binary? –  very useful –  somewhat useful –  not useful

Page 42: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 42

Precision-recall curves

•  If we have grades of relevance, how can we modify some of the binary relevance measures?

•  Calculate Precision-Recall curves at each grade level (Järvelin and Kekäläinen (2000))

•  Informative but, too many curves to compare

Page 43: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 43 43

Discounted cumulative gain (DCG)

•  Järvelin and Kekäläinen (2002) •  Gain adjustable for importance of different relevance

grades for user satisfaction •  Discounting desirable for web ranking

–  Most users don’t browse deep. –  Search engines truncate the list of results returned.

DCG =Gain(result@r)logb r +1( )r=1

R

∑Discount proportional to

effort to reach result at rank r.

Gain proportional to utility of result at rank r.

Page 44: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 44

DCG example

•  Gain for various grades –  Very useful (V): 3 –  Somewhat useful (S): 1 –  Not useful (N) : 0

•  E.g. Results ordered as VSN:

DCG = 3/log2(1+1) + 1/log2(2+1) + 0/log2(3+1) = 2.63

•  E.g. Results ordered as VNS:

DCG = 3/log2(1+1) + 0/log2(2+1) + 1/log2(3+1) = 2.50

Page 45: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 45

Normalized DCG (nDCG)

•  DCG is yields unbounded scores. It is desirable for the best possible result set to have a score of 1.

•  For each query, divide the DCG by best attainable DCG for that query.

•  E.g. VSN:

nDCG = 2.63 / 2.63 = 1.00

•  E.g. VNS:

nDCG = 2.50 / 2.63 = 0.95

Page 46: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

46

Relevance Metrics: Rank Based Evaluation

with Explicit Preference

Judgments

Page 47: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 47

Kendall tau coefficient

•  Based on counts of preferences –  Preference judgments are cheaper and easier/cleaner than

absolute judgments. –  But, may need to deal with circular preferences.

•  Range in [-1,1] ‒  τ = 1 when all in agreement ‒  τ = -1 when all disagree

•  Robust for incomplete judgments –  Just use the known set of preferences.

τ =A−DA+D

preferences in agreement

preferences in disagreement

Page 48: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 48

Binary preference (bpref)

•  Buckley and Voorhees (2004) •  Designed in particular for incomplete judgments •  Similar to some other relevance metrics (MAP) •  Can be generalized to graded judgments

bpref =1R

1− Nr

R

r∈R∑ ∝

AA + D

# of non-relevant docs above relevant doc r,

in the first R non-relevant

For a query with R relevant results

Page 49: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 49

Bpref example

rank relevance numerator denominator summand

1 0

2 1 1 3 0.66

3 NA

4 1 1 3 0.66

5 NA

6 0

7 0

8 0

9 1 3 3 0

10 0

# relevant 3 bpref 0.44

# non-relevant 5

# unjudged 2

Page 50: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 50

Generalization of bpref to graded judgments - rpref

•  De Beer and Moens (2006) •  Graded relevance version of bpref •  Sakai (2007) gives a corrected version expressed in

terms of cumulative gain.

rprefrelative R( ) = 1CGideal R( ) g r( ) 1−

penalty r( )Nr

r>1,g r( )>0

r=R

penalty r( ) =g r( ) − g i( )g r( )i<r,g i( )<g r( )

∑ # of judged docs above r

Soft count of out-of-order pairs

Relevance gain of result at rank r

Cumulative gain

Page 51: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 51

Practical system testing with incomplete judgments

•  Comparing two search engines in practice –  Scrape top-k result sets for a sample of queries –  Calculate any of the metrics above for each engine and

compare using a statistical test (e.g. paired t-test) •  Need judgments •  Use existing judgments •  What to do if judgments missing •  Use a metric robust to missing judgments

Page 52: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 52

Comparing various metrics under incomplete judgment scenario

•  Sakai (2007) simulates incomplete judgments by sampling from pooled judgments –  Stratified sampling yields various levels of completeness from 100% to

10%. •  Then tests bpref, rpref, MAP, Q-measure, and normalized DCG

(nDCG). –  Q-measure is similar to rpref (see Sakai (2007)) –  Since all but the first two are originally designed for complete judgments,

he tests two versions of them: •  one based on assuming results with missing judgments are non-relevant, •  and another computed on condensed lists by removing results with missing

judgments.

•  nDCG with incomplete absolute judgments –  As in average precision based measures, one can ignore the unjudged

documents when using normalized DCG.

Page 53: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 53

Robustness of evaluation with incomplete judgments

•  Among the original methods only bpref and rpref stay stable with increasing incompleteness.

•  nDCG, Q and MAP computed on condensed lists also perform well. –  Furthermore, they have more discriminative power.

•  Graded relevance metrics are more robust than binary metrics for incompleteness.

•  nDCG and Q-measure on condensed lists are the best metrics.

Page 54: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 54

Average precision based rank correlation

•  Yilmaz, Aslam and Robertson (2008) •  Kendal tau rank correlation as a random variable

–  Pick a pair of items at random. –  Define p: Return 1 if pair in same order in both lists, 0 otherwise.

•  Rank correlation based on average precision as a random variable –  Pick an item at random from the 1st list (other than the top item). –  Pick another document at random above the current. –  Define p’: Return the 1 if this pair is in the same relevance order

in the 2nd list, 0 otherwise. •  Agreement on top of the list is rewarded.

τ =A −DA + D

= p − 1− p( ) = 2p −1

Page 55: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

55

Relevance Metrics: Rank Based Evaluation

with Implicit Judgments

Page 56: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 56

Implicit judgments from clicks

•  Explicit judgments are expensive. •  A search engine has lots of user interaction data

–  which results were viewed for a query, and –  which of those received clicks.

•  Can we obtain implicit judgments of satisfaction or relevance from clicks? –  Clicks are highly biased.

•  presentation details (order of results, attractiveness of abstracts) •  trust and other subtle aspects of user’s need

–  Not impossible - some innovative methods are emerging. •  Pros: Cheap, better model of ambiguity and subjectivity •  Cons: Noisy and retroactive. (May expose poor quality search

engines to live traffic.)

Page 57: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 57

Performance metrics from user logs

•  Naïve way to utilize user interaction data is to develop basic statistics from raw observations: –  abandonment rate –  reformulation rate –  number of queries per session –  clicks per query –  mean reciprocal rank of clicked results –  time to first or last click

•  Intuitive but not clear how sensitive these metrics are to what we want to measure

Page 58: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 58

Implicit preference judgments from clicks

•  Joachims (2002) •  Radlinski and Joachims (2005) •  These are document level preference judgments

and have not been used in evaluation.

A

B

C

skip

skip

click

C>A and C>B

A

B

C

click

skip

skip

A>B

Page 59: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 59

Direct evaluation by clicks

•  Randomly interleave two result sets to be compared. –  Have the same number of links from top of each result set. –  More clicks on links from one result set indicates preference.

•  Balanced interleaving (Joachims (2003)) –  Determine randomly which side goes first at the start. –  Pick the next available result from the side that has the turn

while removing duplicates. –  Caution: Biased when two result sets are nearly identical

•  Team draft interleaving (Radlinski et. al. (2008)) –  Determine randomly which side goes first at each round. –  Pick the next available result from the side that has the turn

while removing duplicates. •  Effectively removes the rank bias, but not directly

applicable to evaluation of multi-page sessions.

Page 60: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 60

Interleaving example

A B A first B first Captain Team Captain Team1 a b a b A a B b

b a B b A a2 b e e A A

e B e B e3 c a c B B

c A c A c4 d f d f A d A d

f d B f B f5 e g g B g A

g A B g6 f h h B h A

h A B h7 g k k B k B k

k A A8 h c A B

B A9 i d i B B

i A i A i10 j i j A j A j

j B B

Rank

Engines to Comapre

Balanced Interleave

Team Draft InterleaveTDI example 1 TDI example 2

Page 61: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 61

Indirect evaluation by clicks

•  Carterette and Jones (2007) •  Relevance as a multinomial random variable

•  Model absolute judgments by clicks

•  Expected DCG (incomplete judgments are OK)

P Ri = grade j{ }

E DCGN[ ] = E R1[ ] +E Ri[ ]log2 ii= 2

N

∑€

p(R | q,c) = p Ri q,c( )i=1

N

logp R > g j q,c( )p R ≤ g j q,c( )

=α j +βq+ βicii=1

N

∑ + βikcicki<k

N

Page 62: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 62

Indirect evaluation by clicks (cont’d)

•  Comparing two search engines

•  Predict if difference is statistically significant –  use Monte Carlo

•  Can improve confidence by asking for labels where

•  Efficient but, effectiveness depends on the quality of the relevance model obtained from the clicks.

E ΔDCG[ ] = E DCGA[ ]− E DCGB[ ]

P ΔDCG < 0( ) ≥ 0.95

maxi E GiA[ ]− E Gi

B[ ] Gi =Ri if rank(i) =1

Rilog2 rank(i)

o/w

Page 63: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

63

Relevance Metrics: User Satisfaction

Page 64: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 64

Relevance evaluation and user satisfaction

•  So far, focused on evaluation method rather than the entity (i.e. user satisfaction) to be evaluated.

•  Subtle and salient aspects of user satisfaction are difficult for traditional relevance. –  E.g. trust, expectation, patience, ambiguity, subjectivity –  Explicit absolute or preference judgments are not very

successful in addressing all aspects at once. –  Implicit judgment models get one step closer to user

satisfaction by incorporating user feedback.

•  The popular IR relevance metrics are not strongly based on user tasks and experiences. –  Turpin and Scholer (2006): precision based metrics such as

MAP fail to assess user satisfaction on tasks targeting recall.

Page 65: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 65

Modeling user satisfaction

•  Huffman and Hochster (2007) •  Obtain explicit judgments of true satisfaction over a

sample of sessions or any other grain. •  Develop a predictive model based on observable

statistics. –  explicit absolute relevance judgments –  number of user actions in a session –  query classification

•  Carry out correlation analysis •  Pros: More direct than many other evaluation metrics.

•  Cons: More exploratory than a usable metric at this stage.

Page 66: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

66

Relevance Metrics: More Notes

Page 67: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 67

Relevance through search system components

•  Relevance can explicitly be measured for each search system component (Dasdan and Drome (2009)). –  Use set based evaluation for WWW, catalog, database tiers.

•  Rank based evaluation can be used if sampled subset is ordered by explicit judgments or by using order inferred from a downstream component.

–  Yields approximate upper bounds

–  Use rank based evaluation for candidate documents and result set. •  Useful for quantifying and monitoring relevance gap

–  inter-system relevance gap by comparing different system stages –  intra-system relevance gap by comparing against external benchmarks

WWW crawl catalog index tier 1

tier N

selection candidate doc list

ranking result

set

query

Page 68: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 68

Where to find more

•  Traditional relevance metrics have deep roots in information retrieval –  Cranfield experiments (Cleverdon (1991)) –  SMART (Salton (1991)) –  TREC (Voorhees and Harman (2005))

•  Modern metrics addressing cost and noise by using statistical inference in more advanced ways

•  For more on relevance evaluation, see –  Manning, Raghavan, and Schütze (2008) –  Croft, Metzler, and Strohman (2009)

•  For more on the user dimension, see –  Baeza-Yates, and Ribeiro-Neto (1999) –  Spink and Cole (2005)

Page 69: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 69

References 1/2

•  J. Allan (2005), HARD track overview in TREC 2005: High accuracy retrieval from documents. •  R. Baeza-Yates and B. Ribeiro-Neto (1999), Modern Information Retrieval, Addison-Wesley. •  C. Buckley and E.M. Voorhees (2004), Retrieval Evaluation with Incomplete Information, SIGIR’04. •  B. Carterette and R. Jones (2007), Evaluating Search Engines by Modeling the Relationship

Between Relevance and Clicks, NIPS’07. •  C.W. Cleverdon (1991), The significance of the Cranfield tests on index languages, SIGIR’91. •  B. Croft, D. Metzler, and T. Strohman (2009), Search Engines: Information Retrieval in Practice,

Addison Wesley. •  A. Dasdan and C. Drome (2008), Measuring Relevance Loss of Search Engine Components,

submitted. •  J. De Beer and M.-F. Moens (2006), Rpref - A Generalization of Bpref towards Graded Relevance

Judgments, SIGIR’06. •  S.B. Huffman and M. Hochster (2007), How Well does Result Relevance Predict Session

Satisfaction? SIGIR’07. •  K. Järvelin and J. Kekäläinen (2000), IR evaluation methods for retrieving highly relevant

documents, SIGIR’00. •  K. Järvelin and J. Kekäläinen (2002), Cumulated Gain-Based Evaluation of IR Techniques, ACM

Trans. IS 20(4):422-446. •  T. Joachims (2002), Optimizing Search Engines using Clickthrough Data. SIGKDD’02. •  T. Joachims (2003), Evaluating Retrieval Performance using Clickthrough Data, In J. Franke, et.

al. (eds.), Text Mining, Physica Verlag.

Page 70: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 70

References 2/2

•  A. Kent, M.M. Berry F.U. Luehrs Jr., and J.W. Perry (1955), Machine literature searching VIII. Operational criteria for designing information retrieval systems. American Documentation 6(2):93-101.

•  C. Manning, P. Raghavan, and H. Schützke (2008), H. Introduction to Information Retrieval, Cambridge University Press.

•  F. Radlinski, M. Kurup, and T. Joachims (2008), How Does Clickthrough Data Reflect Retrieval Quality? CIKM’08.

•  F. Radlinski and T. Joachims (2005), Evaluating the Robustness of Learning from Implicit Feedback, ICML’05.

•  T. Sakai (2007), Alternatives to Bpref, SIGIR’07. •  G. Salton (1991), The smart project in automatic document retrieval, SIGIR’91. •  A. Spink and C. Cole (eds.) (2005), New Directions in Cognitive Information Retrieval, Springer. •  A. Turpin and F. Scholer (2006), User performance versus precision measures for simple search

tasks, SIGIR’06. •  C.J. Van Rijsbergen (1979), Information Retrieval (2nd ed.), Butterworth. •  E.M. Voorhees and D. Harman (eds) (2005), TREC: Experiment and Evaluation in Information

Retrieval, MIT Press. •  E.M. Voorhees (1999), TREC-8 Question Answering Track Report. •  E. Yilmaz and J. Aslam (2006), Estimating Average Precision with Incomplete and Imperfect

Information, CIKM’06. •  E. Yilmaz, J. Aslam, and S. Robertson (2008), A New Rank Correlation Coefficient for Information

Retrieval, SIGIR’08.

Page 71: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

71

Coverage Metrics PART II

of WWW’09 Tutorial on Web Search Engine Metrics

by A. Dasdan, K. Tsioutsiouliklis, E. Velipasaoglu

Page 72: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Example on coverage: Heard some interesting news; decided to search

72

Page 73: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Example on coverage: URL was not found

73

Page 74: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Example on coverage: But content was found under different URLs

74

Page 75: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Example on coverage: URL was also found after some time

75

Page 76: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Definitions for coverage

•  Coverage refers to presence of content of interest in a catalog.

•  Coverage ratio – defined as the ratio of the number of

documents (pages) found to the number of documents (pages) tested

– Can be represented as a distribution when many document attributes are considered together

76

Page 77: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Some background: Shingling and Jaccard Index

77

Doc = (a b c d e) (5 terms) 2-grams: (a b, b c, c d, d e)

Shingles for 2-grams (after hashing them): 10, 3, 7, 16 Min shingle: 3 (used as a signature of Doc)

Doc1 = (a b c d e) Doc2 = (a e f g)

Doc1 Doc2 = (a e) Doc1 Doc2 = (a b c d e f g)

Jaccard index = |Doc1 Doc2| / |Doc1 Doc2| = 2 / 7 ≈ 30% (Shingling estimates this index.)

Page 78: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

How to measure coverage

•  Given an input document with its URL •  Query by URL (QBU)

–  enter URL at the target search engine’s query interface –  if the URL is not found, then iterate using “normalized” forms of

the same URL •  Query by content (QBC)

–  if URL is not given or URL search has failed, then perform this search

–  generate a set of queries (called strong queries) from the document

–  submit the queries to the target search engine’s query interface –  combine the returned results –  perform a more thorough similarity check between the returned

documents and the input document •  Compute coverage ratio over multiple documents

78

Page 79: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Query-by-Content flowchart

79

String signature: Terms from page

Strings combined into queries

Similarity check using shingles

Search results extraction

Page 80: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Query by content: How to generate queries

•  Select sequences of terms randomly –  find the document’s shingles signature –  find the corresponding sequences of terms –  This method can produce the same query signature for the

same document, as opposed to the method of just selecting random sequences of terms from the document.

•  Select sequences of terms by frequency –  terms with the lowest frequency or highest TF-IDF

•  Select sequences of terms by position –  +/- two terms at every 5th term

80

Page 81: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Further issues to consider

•  URL normalization –  see Dasgupta, Kumar, and Sasturkar (2008)

•  Page templates and ads –  or how to avoid undesired matches

•  Search for non-textual content –  images, mathematical formulas, tables and other similar

structures

•  Definition of content similarity •  Syntactic vs. semantic match •  How to balance coverage against other

objectives

81

Page 82: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Key problems

•  Measure web growth in general and along any dimension

•  Compare search engines automatically and reliably

•  Improve content-based search, including semantic-similarity search

•  Improve copy detection methods for quality and performance, including URL based copy detection

82

Page 83: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Reference review on coverage metrics

•  Luhn (1957) –  summarizes an input document by selecting terms or sentences

by frequency –  Bharat and Broder (1998) discovered the same method

independently for a different purpose •  Bar-Yossef and Gurevich (2008)

–  introduces improved methods to randomly sample pages from a search engine’s index using its public query interface, a problem introduced by Bharat and Broder (1998)

•  Dasdan et al. (2008), Pereira and Ziviani (2004) –  represents an input document by selecting (sequences of)

terms randomly or by frequency –  uses the term-based document signature as queries (called

strong queries) for similarity search –  Yang et al. (2009) proposes similar methods for blog search

83

Page 84: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

References

•  Z. Bar-Yossef and M. Gurevich (2008), Random sampling from a search engine’s index, J. ACM, 55(5).

•  K. Bharat, A. Broder (1998), A technique for measuring the relative size and overlap of public Web search engines, WWW’98.

•  S. Brin, J. Davis, and H. Garcia-Molina (1995), Copy detection mechanisms for digital documents, SIGMOD’95.

•  A. Dasdan, P. D’Alberto, C. Drome, and S. Kolay (2008), Automating retrieval for similar content using search engine query interface, submitted.

•  A. Dasgupta, R. Kumar, and A. Sasturkar (2008), De-duping URLs via Rewrite Rules, KDD’08.

•  H. Luhn (1957), A statistical approach to mechanized encoding and searching of literary information, IBM J. Research and Dev., 1(4):309–317.

•  H. P. Luhn (1958), The automatic creation of literature abstracts, IBM J. Research and Dev., 2(2).

•  A.R. Pereira Jr. and N. Ziviani (2004), Retrieving similar documents from the Web, J. Web Engineering, 2(4):247-261.

•  Y. Yang, N. Bansal, W. Dakka, P. Ipeirotis, N. Koudas, D. Papadias (2009), Query by document, WSDM’09.

84

Page 85: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

85

Diversity Metrics PART III

of WWW’09 Tutorial on Web Search Engine Metrics

by A. Dasdan, K. Tsioutsiouliklis, E. Velipasaoglu

Page 86: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Example on diversity: Long query

86

Ever

y re

sult

is a

bout

the

sam

e ne

ws.

Page 87: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Example on diversity: Long query

87

Mor

e di

vers

e

Page 88: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Example on diversity: Ambiguous query [stanford]

88

See

http

://en

.wik

iped

ia.o

rg/w

iki/S

tanf

ord_

(dis

ambi

guat

ion)

Page 89: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Example on diversity: Ambiguous query [stanford]

89

See

http

://en

.wik

iped

ia.o

rg/w

iki/S

tanf

ord_

(dis

ambi

guat

ion)

Page 90: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Definitions for diversity

•  Diversity –  related to the breadth of the content –  also related to the quantification of “concepts” in

a set of documents, or the quantification of query disambiguation or query intent

•  Closely tied to relevance and redundancy –  excluding near-duplicate results

•  May have implications for search engine interfaces too –  e.g., clustered or faceted presentations

90

Page 91: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

How to measure diversity

•  Method #1: –  get editorial judgments as to the degree of diversity in a catalog

•  Method #2: –  use the number of the content or source types for the

documents in a catalog –  find the set of concepts in a catalog and measure diversity

based on their relationships •  e.g., cluster using document similarity and assign a

concept to each cluster

•  Method #3: (with a given relevance metric) –  iterate over each intent of the input query –  consider sets of documents relevant to each intent –  weight the given relevance metric by the probability of each

intent

91

Page 92: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

How to measure diversity: Example

92

•  Types: News, organic, rich, ads •  Sources for 10 organic results:

•  4 domains •  Themes for organic results:

•  6 for Stanford University related •  1 for Stanfords restaurant related •  1 for Stanford, MT related •  1 for Stanford, KY related

•  Detailed themes for organic results: •  2 for general Stanford U. intro •  1 for Stanford athletics •  1 for Stanford medical school •  1 for Stanford business school •  1 for Stanford news •  1 for Stanford green buildings •  1 for Stanfords restaurant •  1 for Stanford, MT high school •  1 for Stanford, KY fire department

Page 93: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Further issues to consider

•  Categorization and similarity methods –  for documents, queries, sites

•  Presentation issues – single page, clusters, facets, term cloud

•  Summarizing diversity •  How to balance diversity against

other objectives – diversity vs. relevance in particular

93

Page 94: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Key problems

• Measure and summarize diversity better

• Measure tradeoffs between diversity and relevance better

• Determine the best presentation of diversity

94

Page 95: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Reference review on diversity metrics

•  Goldstein and Carbonell (1998) –  defines maximizal marginal relevance as a parameterized linear combination of novelty and

relevance •  novelty: measured via the similarity among documents (to avoid redundancy) •  relevance: measured via the similarity between documents and the query

•  Jain, Sarda, and Haritsa (2003); Chen and Karger (2006); Joachims et al. (2008); and Swaminathan et al. (2008)

–  iteratively expand a document set to maximize marginal gain –  each time add a new relevant document that is least similar to the existing set –  Joachims et al. (2008) address the learning aspect.

•  Radlinski and Dumais (2006) –  diversifies search results using relevant results to the input query and queries related to it

•  Agrawal et al. (2009) –  diversifies search results using a taxonomy for classifying queries and documents –  also reviews diversity metrics and proposes new ones

•  Gollapudi and Sharma (2009) –  proposes an axiomatization of result diversification (similar to similar recent efforts for ranking

and clustering) and proves the impossibility of satisfying all properties –  enumerates a set of diversification functions satisfying different subsets of properties

•  Metrics to measure diversity of a given set of results are proposed by Chen and Karger (2006), Clarke et al. (2008), and Agrawal et al. (2009).

95

Page 96: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

References

•  R. Agrawal, S. Gollapudi, A. Halverson, and S. Leong (2009), Diversifying search results, WSDM’09.

•  H. Chen and D.R. Karger (2006), Less is more: Probabilistic models for retrieving fewer relevant documents, SIGIR’06.

•  C.L.A. Clarke, M. Kolla, G.V. Cormack, O. Vechtomova, A. Ashkan, S. Büttcher, and I. MacKinnon (2008), Novelty and diversity in information retrieval evaluation, SIGIR’08.

•  J. Goldstein and J. Carbonell (1998), Summarization: (1) Using MMR for Diversity-based Reranking and (2) Evaluating Summaries, SIGIR’98.

•  S. Gollapudi and A. Sharma (2009), An axiomatic approach for result diversification, WWW’09.

•  A. Jain, P. Sarda, and J.R. Haritsa (2003), Providing Diversity in K-Nearest Neighbor Query Results, CoRR’03.

•  R. Kleinberg, F. Radlinski, and T. Joachims (2008), Learning Diverse Rankings with Multi-armed Bandits, ICML’08.

•  F. Radlinski and S.T. Dumais (2006), Improving personalized web search using result diversification, SIGIR’06.

•  A. Swaminathan, C. Mathew, and D. Kirovski (2008), Essential pages, MSR-TR-2008-015, Microsoft Research.

•  Y. Yue, and T. Joachims (2008), Predicting Diverse Subsets Using Structural SVMs, ICML’08.

•  C. Zhai and J.D. Lafferty (2006), A risk minimization framework for information retrieval, Info. Proc. and Management, 42(1):31-55.

96

Page 97: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

97

Discovery and Latency Metrics

PART IV of

WWW’09 Tutorial on Web Search Engine Metrics by

A. Dasdan, K. Tsioutsiouliklis, E. Velipasaoglu

Page 98: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Example on discovery: Page was born ~30 minutes before

98

Page 99: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Example on discovery: URL of page was not found

99

Page 100: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Example on discovery: But content existed under different URLs

100

Page 101: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Example on discovery: URL was also found after ~1 hr

101

Page 102: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Life of a URL

102

AGE

LATENCY

BORN DISCOVERED NOW EXPIRED

TIME

Page 103: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Lives of many URLs

103

AGE

LATENCY

BORN DISCOVERED NOW EXPIRED

TIME

LATENCY

LATENCY

LATENCY

Page 104: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

How to measure discovery and latency

•  Consider a sample of new pages on the Web –  Feeds at regular intervals –  Each sample monitored for a period (e.g., 15 days)

•  User view –  Discovery: Measure how many of these new pages are in

the search results? •  Using the coverage ratio formula

–  Latency: Measure how long it took to get these new pages in the search results?

•  System view –  Discovery: Measure how many of these new pages are in a

catalog? –  Latency: Measure how long it took to get these new pages

in a catalog?

104

Page 105: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Discovery profile of a search engine component: Overview

105

Time to reach a certain coverage percentage

No expiration yet

Content expired

Convergence

Over many URLs, per search engine component

Oth

er b

ehav

iors

Page 106: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Discovery profiles and monitoring: Examples

106

Profiles Monitoring of

profile parameters

Page 107: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Latency profiles of a search engine component: Overview

107

Over many URLs, per search engine component

Desired skewness direction Close to zero for crawlers

Page 108: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Latency profiles and monitoring: Examples

108

Profiles Monitoring of

profile parameters

Page 109: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Further issues to consider

•  How to discover samples to measure discovery and latency

•  How to beat crawlers to acquire samples

•  Discovery of top-level pages •  Discovery of deep links •  Discovery of hidden web content •  How to balance discovery against

other objectives

109

Page 110: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Key problems

•  Predict content changes on the Web •  Discover new content almost

instantaneously •  Reduce latency per search engine

component and overall

110

Page 111: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Reference review on discovery metrics

•  Cho, Garcia-Molina, & Page (1998) –  discusses how to order URL accesses based on importance

scores •  importance: PageRank (best), link count, similarity to query in

anchortext or URL string, attributes of URL string. •  Dasgupta et al. (2007)

–  formulates the problem of discoverability (discover new content from the fewest number of known pages) and proposes approximation algorithms

•  Kim and Kang (2007) –  compares top three search engines for discovery (called

“timeliness”), freshness, and latency •  Lewandowski (2008)

–  compares top three search engines for freshness and latency •  Dasdan and Drome (2009)

–  proposes discovery metrics along the lines discussed in this section

111

Page 112: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

References

•  J. Cho, H. Garcia-Molina, and L. Page (1998), Efficient Crawling Through URL Ordering, Computer Networks and ISDN Systems, 30(1-7):161-172.

•  A. Dasdan and C. Drome (2009), Discovery coverage: Measuring how fast content is discovered by search engines, submitted.

•  A. Dasgupta, A. Ghosh R. Kumar, C. Olston, S. Pandey, and A. Tomkins (2007), The discoverability of the Web, WWW’07.

•  J. Dean (2009), Challenges in building large-scale information retrieval systems, WSDM’09.

•  N. Eiron, K.S. McCurley, and J.A. Tomlin, Ranking the Web frontier, WWW’04.

•  C. Grimes, D. Ford, and E. Tassone (2008), Keeping a search engine index fresh: Risk and optimality in estimating refresh rates for web pages, INTERFACE’08.

•  Y.S. Kim and B.H. Kang (2007), Coverage and timeliness analysis of search engines with webpage monitoring results, WISE’07.

•  D. Lewandowski (2008), A three-year study on the freshness of Web search engine databases, to appear in J. Info. Syst., 2008.

112

Page 113: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

113

Freshness Metrics PART V

of WWW’09 Tutorial on Web Search Engine Metrics

by A. Dasdan, K. Tsioutsiouliklis, E. Velipasaoglu

Page 114: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Example on freshness: Stale abstract in Search Results Page

114

Page 115: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Example on freshness: Actual page content

115

http://en.wikipedia.org/wiki/John_Yoo:

Page 116: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Example on freshness: Fresh abstract now

116

Page 117: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Definitions illustrated for a page

117

(Dasdan and Huynh, WWW’09)

Last sync Page is up-to-date or fresh until time 3.

CRAWLED

0

TIME

1 2 3 4 5 6

MODIFIED MODIFIED

INDEXED CLICKED

AGE=3

Page 118: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Definitions illustrated for a page

118

CRAWLED

0

TIME

1 2 3 4 5 6

MODIFIED MODIFIED

INDEXED CLICKED

TIME

FRESHNESS

AGE 0

3

1

(Dasdan and Huynh, WWW’09)

Page 119: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Freshness and age of a page

•  The freshness F(p,t) of a local page p at time t is – 1 if p is up-to-date at time t – 0 otherwise

•  The age A(p,t) of a local page p at time t is – 0 if p is up-to-date at time t –  t−tmod otherwise, where tmod is the time of

the first modification after the last sync of p.

119

Page 120: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Freshness and age of a catalog

•  S: catalog of documents •  Sc: catalog of clicked documents •  Basic freshness and age

•  Unweighted freshness and age

•  Weighted freshness and age (c(): #clicks)

120

Page 121: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

How to measure freshness

•  Find the true refresh history of each page in the sample –  Needs independent crawling

•  Compare with the history in the search engine

•  Determine freshness and age –  basic form: averaged over all documents in the catalog

•  Consider clicked or viewed documents –  unweighted form: averaged over all clicked or viewed

documents in the catalog –  weighted form: unweighted form weighted with #clicks or

#views (or any other weight function)

121

Page 122: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

How to measure freshness: Example

122

Page 123: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Further issues to consider

•  Sampling pages –  random, from DMOZ, revisited, popular

•  Classifying pages –  topical, importance, change period, refresh period

•  Refresh period for monitoring –  daily, hourly, minutely

•  Measuring change –  hashing (MD5, Broder’s shingles, Charikar’s SimHash),

Jaccard’s index, Dice coefficient, word frequency distribution similarity, structural similarity via DOM trees

•  note:

•  What is change? –  content, “information”, structure, status, links, features, ads

•  How to balance freshness against other objectives

123

Page 124: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Key problems

•  Measure the evaluation of the content on the Web

•  Design refresh policies to adapt to the changes on the Web

•  Reduce latency from discovery to serving

•  Improve freshness metrics

124

Page 125: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Reference review on web page change patterns

•  Cho & Garcia-Molina (2000): Crawled 720K pages once a day for 4 months.

•  Ntoulas, Cho, & Olston (2004): Crawled 150 sites once a week for a year.

–  found: most pages didn’t change; changes were minor; freq of change couldn’t predict degree of change but degree of change could predict future degree of change;

•  Fetterly, Manasse, Najork, & Wiener (2003): Crawled 150M pages once a week for 11 weeks.

–  found: past change could predict future change; page length & top level domain name were correlated with change;

•  Olston & Panday (2008): Crawled 10K random pages and 10K pages sampled from DMOZ every two days for several months.

–  found: moderate correlation between change frequency and information longevity

•  Adar, Teevan, Dumais, & Elsas (2009): Crawled 55K revisited pages (sub)hourly for 5 weeks.

–  found: higher change rates compared to random pages; large portions of pages changing more than hourly; focus on pages with important static or dynamic content;

125

Page 126: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Reference review on predicting refresh rates

•  Grimes, Ford & Tassone (2008) –  determines optimal crawl rates under a set of scenarios:

•  while doing estimation; while fairly sure of the estimate; •  when crawls are expensive, and when they are cheap;

•  Matloff (2005) –  derives estimators similar to Cho & Garcia-Molina but lower

variance (and with improved theory) –  also derives estimators for non-Poisson case –  finds that Poisson model is not very good for its data

•  but the estimators seem accurate (bias around 10%)

•  Singh (2007) –  non-homogeneous Poisson, localized windows, piecewise,

Weibull, experimental evaluation •  No work seems to consider non-periodical case.

126

Page 127: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Reference review on freshness metrics

•  Cho & Garcia-Molina (2003) –  freshness & age of one page –  average/expected freshness & age of one page & corpus –  freshness & age wrt Poisson model of change –  weighted freshness & age –  sync policies

•  uniform (better): all pages at the same rate •  nonuniform: rates proportionally to change rates

–  sync order •  fixed order (better), random order

–  to improve freshness, penalize pages that change too often –  to improve age, sync proportionally to freq but uniform is not far from

optimal •  Han et al. (2004) and Dasdan and Huynh (2009) add user

perspective with weights. •  Lewandowski (2008) and Kim and Kang (2007) compare top

three search engines for freshness. 127

Page 128: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

References 1/2

•  E. Adar, J. Teevan, S. Dumais, and J.L. Elsas (2009), The Web changes everything: Understanding the dynamics of Web content, WSDM’09.

•  J. Cho and H. Garcia-Molina (2000), The evolution of the Web and implications for an incremental crawler, VLDB’00.

•  D. Fetterly, M. Manasse, M. Najork, and J. Wiener (2003), A Large scale study of the evolution of Web pages, WWW’03.

•  F. Grandi (2000), Introducing an annotated bibliography on temporal and evolution aspects in the World Wide Web, SIGMOD Records, 33(2):84-86.

•  A. Ntoulas, J. Cho, and C. Olston (2004), What’s new on the Web? The evolution of the Web from a search engine perspective, WWW’04.

128

Page 129: Web Search Engine Metrics for Measuring User Satisfaction€¦ · – from search engine’s query logs – from third-party logs (e.g., ComScore) • Random sampling of URLs –

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

References 2/2

•  J. Cho and H. Garcia-Molina (2003), Effective page refresh policies for web crawlers, ACM Trans. Database Syst., 28(4):390-426.

•  J. Cho and H. Garcia-Molina (2003), Estimating frequency of change, ACM Trans. Inter. Tech., 3(3):256-290.

•  A. Dasdan and X. Huynh, User-centric content freshness metrics for search engines, WWW’09.

•  J. Dean (2009), Challenges in building large-scale information retrieval systems, WSDM’09.

•  C. Grimes, D. Ford, and E. Tassone (2008), Keeping a search engine index fresh: Risk and optimality in estimating refresh rates for web pages, INTERFACE’08.

•  J. Han, N. Cercone, and X. Hu (2004), A Weighted freshness metric for maintaining a search engine local repository, WI’04.

•  Y.S. Kim and B.H. Kang (2007), Coverage and timeliness analysis of search engines with webpage monitoring results, WISE’07.

•  D. Lewandowski, H. Wahlig, and G. Meyer-Bautor (2006), The freshness of web search engine databases, J. Info. Syst., 32(2):131-148.

•  D. Lewandowski (2008), A three-year study on the freshness of Web search engine databases, to appear in J. Info. Syst., 2008.

•  N. Matloff (2005), Estimation of internet file-access/modification rates from indirect data, ACM Trans. Model. Comput. Simul., 15(3):233-253.

•  C. Olston and S. Padley (2008), Recrawl scheduling based on information longevity, WWW’08.

•  S.R. Singh (2007), Estimating the rate of web page changes, IJCAI’07.

129