ir evaluation using rank-biased precision

25
Measuring Search Engine Quality Rank-Biased Precision Alistair Moffat and Justin Zobel, “Rank-Biased Precision for Measurement of Retrieval Effectiveness” , TOIS vol.27 no. 1, 2008. Ofer Egozi LARA group, Technion

Upload: ofer-egozi

Post on 25-Jan-2015

3.419 views

Category:

Technology


2 download

DESCRIPTION

How IR systems (search engines) are evaluated, in particular under the TREC methodology. The common measure of Mean Average Precision is discussed and compared to the newly proposed (Moffat and Zobel 2008) Rank-Biased Precision. For more discussion, see: http://alteregozi.com/2009/01/18/evaluating-search-engines-relevance/

TRANSCRIPT

Page 1: IR Evaluation using Rank-Biased Precision

Measuring Search Engine Quality

Rank-Biased PrecisionAlistair Moffat and Justin Zobel, “Rank-Biased Precision for Measurement of Retrieval Effectiveness”, TOIS vol.27 no. 1, 2008.

Ofer EgoziLARA group, Technion

Page 2: IR Evaluation using Rank-Biased Precision

Introduction to IR Evaluation

Mean Average Precision

Rank-Biased Precision

Analysis of RBP

Outline

Page 3: IR Evaluation using Rank-Biased Precision

Task: given query q, output ranked list of documents◦ Find probability that document d is relevant for q

IR Evaluation

Page 4: IR Evaluation using Rank-Biased Precision

Task: given query q, output ranked list of documents◦ Find probability that document d is relevant for q

Evaluation is difficult◦ No (per query) test data

◦ Queries vary tremendously

◦ Relevance is a vague (human) concept

IR Evaluation

Page 5: IR Evaluation using Rank-Biased Precision

Precision / recall

◦ Precision and recall usually conflict

◦ Single measures proposed (P@X, RR, AP…)

Elementary IR Measures

Drel(q,D)alg(q,D)

Precision: |alg rel|/|alg| Recall: |alg rel|/|rel|

Page 6: IR Evaluation using Rank-Biased Precision

Relevancy requires human judgment◦ Exhaustive judging is not scalable

◦ TREC uses pooling

◦ Shown to miss significant relevant portion…

◦ … but shown to compare cross-system well

◦ Bias against novel approaches

Pooling for Scalable Judging

Page 7: IR Evaluation using Rank-Biased Precision

In real-world, what does recall measure?◦ Recall important only with “perfect” knowledge◦ If I got one result, and there is another I don’t

know of, am I half-satisfied?...◦ …yes, for specific needs (legal, patent) session◦ “Boiling temperature of lead”

How Important is Recall?

Page 8: IR Evaluation using Rank-Biased Precision

In real-world, what does recall measure?◦ Recall important only with “perfect” knowledge◦ If I got one result, and there is another I don’t

know of, am I half-satisfied?...◦ …yes, for specific needs (legal, patent) session◦ “Boiling temperature of lead”

Precision is more user-oriented◦ P@10 measures real user satisfaction◦ Still, P@10=0.3 can mean first three or last

three…

How Important is Recall?

Page 9: IR Evaluation using Rank-Biased Precision

Calculated as ◦ Intuitively: sum all P@X where rel found, divide by

total rel to normalize for summing across queries Example: $$---$----$-----$---

(Mean) Average Precision

Page 10: IR Evaluation using Rank-Biased Precision

Calculated as ◦ Intuitively: sum all P@X where rel found, divide by

total rel to normalize for summing across queries Example: $$---$----$-----$--- Consider: $$---$----$-----$$$$

◦ AP is down to 0.5234, despite P@20 increasing

◦ Finding more rels can harm AP performance!

◦ Similar problems if some are initially unjudged

(Mean) Average Precision

Page 11: IR Evaluation using Rank-Biased Precision

Methodological problem of instability◦ Results may depend on judging extent

◦ More judging can be destabilizing (meaning error margins don’t shrink with reducing uncertainty)

MAP is Unstable

Page 12: IR Evaluation using Rank-Biased Precision

Complex abstraction of user satisfaction◦ “Every time a relevant document is encountered, the user pauses, asks “Over the

documents I have seen so far, on average how satisfied am I?” and writes a number on a piece of paper. Finally, when the user has examined every document in the collection — because this is the only way to be sure that all of the relevant ones have been seen — the user computes the average of the values they have written.”

How can R be truly calculated? Think evaluating a Google query…

MAP is not “Real-Life”

Page 13: IR Evaluation using Rank-Biased Precision

Complex abstraction of user satisfaction◦ “Every time a relevant document is encountered, the user pauses, asks “Over the

documents I have seen so far, on average how satisfied am I?” and writes a number on a piece of paper. Finally, when the user has examined every document in the collection — because this is the only way to be sure that all of the relevant ones have been seen — the user computes the average of the values they have written.”

How can R be truly calculated? Think evaluating a Google query…

Still, MAP is highly popular and useful: ◦ Validated in numerous TREC researches◦ Shown to be stable and robust across query sets (for

deep enough pools)

MAP is not “Real-Life”

Page 14: IR Evaluation using Rank-Biased Precision

Enter RBP…

Page 15: IR Evaluation using Rank-Biased Precision

Induced by a user model

Rank-Biased Precision

Page 16: IR Evaluation using Rank-Biased Precision

Induced by a user model

◦ Each document is observed at probability pi-1

◦ Expected #docs seen:

◦ Total expected utility (ri = known relevance function):

◦ RBP = expected utility rate = utility/effort

Rank-Biased Precision

Page 17: IR Evaluation using Rank-Biased Precision

Values of p reflect user behaviors◦ P=0.95 persistent user (60% chance for 2nd page)

◦ P=0.5 impatient (0.1% chance for 2nd page)

RBP Intuitions

Page 18: IR Evaluation using Rank-Biased Precision

Values of p reflect user behaviors◦ P=0.95 persistent user (60% chance for 2nd page)

◦ P=0.5 impatient (0.1% chance for 2nd page)

◦ P=0 I’m feeling lucky (identical to P@1)

RBP Intuitions

Page 19: IR Evaluation using Rank-Biased Precision

Values of p reflect user behaviors◦ P=0.95 persistent user (60% chance for 2nd page)

◦ P=0.5 impatient (0.1% chance for 2nd page)

◦ P=0 I’m feeling lucky (identical to P@1)

Values of p control contribution of each relevant document◦ But always positive!

RBP Intuitions

Page 20: IR Evaluation using Rank-Biased Precision

Evaluating a new evaluation measure…

Page 21: IR Evaluation using Rank-Biased Precision

RBP Stability

Page 22: IR Evaluation using Rank-Biased Precision

Uncertainty: how many relevant documents? (down the ranking, or even in current depth)

RBP value is inherently lower bound

Error Bounds

Page 23: IR Evaluation using Rank-Biased Precision

Uncertainty: how many relevant documents? (down the ranking, or even in current depth)

RBP value is inherently lower bound Residual uncertainty is easy to calculate –

assume relevant…

Error Bounds

Page 24: IR Evaluation using Rank-Biased Precision

RBP in Comparison

Similarity (correlation) between measures

Detected significance in evaluated systems’ ranking

Page 25: IR Evaluation using Rank-Biased Precision

RBP has significant advantages:◦ Based on a solid and supported user model

◦ Real-life, no unknown factors (R, |D|)

◦ Error bounds for uncertainty

◦ Statistical significance as good as others

But also:◦ Absolute values, not relative to query difficulty

◦ A choice for p must be made

Conclusion