Transcript
Page 1: IR Evaluation using Rank-Biased Precision

Measuring Search Engine Quality

Rank-Biased PrecisionAlistair Moffat and Justin Zobel, “Rank-Biased Precision for Measurement of Retrieval Effectiveness”, TOIS vol.27 no. 1, 2008.

Ofer EgoziLARA group, Technion

Page 2: IR Evaluation using Rank-Biased Precision

Introduction to IR Evaluation

Mean Average Precision

Rank-Biased Precision

Analysis of RBP

Outline

Page 3: IR Evaluation using Rank-Biased Precision

Task: given query q, output ranked list of documents◦ Find probability that document d is relevant for q

IR Evaluation

Page 4: IR Evaluation using Rank-Biased Precision

Task: given query q, output ranked list of documents◦ Find probability that document d is relevant for q

Evaluation is difficult◦ No (per query) test data

◦ Queries vary tremendously

◦ Relevance is a vague (human) concept

IR Evaluation

Page 5: IR Evaluation using Rank-Biased Precision

Precision / recall

◦ Precision and recall usually conflict

◦ Single measures proposed (P@X, RR, AP…)

Elementary IR Measures

Drel(q,D)alg(q,D)

Precision: |alg rel|/|alg| Recall: |alg rel|/|rel|

Page 6: IR Evaluation using Rank-Biased Precision

Relevancy requires human judgment◦ Exhaustive judging is not scalable

◦ TREC uses pooling

◦ Shown to miss significant relevant portion…

◦ … but shown to compare cross-system well

◦ Bias against novel approaches

Pooling for Scalable Judging

Page 7: IR Evaluation using Rank-Biased Precision

In real-world, what does recall measure?◦ Recall important only with “perfect” knowledge◦ If I got one result, and there is another I don’t

know of, am I half-satisfied?...◦ …yes, for specific needs (legal, patent) session◦ “Boiling temperature of lead”

How Important is Recall?

Page 8: IR Evaluation using Rank-Biased Precision

In real-world, what does recall measure?◦ Recall important only with “perfect” knowledge◦ If I got one result, and there is another I don’t

know of, am I half-satisfied?...◦ …yes, for specific needs (legal, patent) session◦ “Boiling temperature of lead”

Precision is more user-oriented◦ P@10 measures real user satisfaction◦ Still, P@10=0.3 can mean first three or last

three…

How Important is Recall?

Page 9: IR Evaluation using Rank-Biased Precision

Calculated as ◦ Intuitively: sum all P@X where rel found, divide by

total rel to normalize for summing across queries Example: $$---$----$-----$---

(Mean) Average Precision

Page 10: IR Evaluation using Rank-Biased Precision

Calculated as ◦ Intuitively: sum all P@X where rel found, divide by

total rel to normalize for summing across queries Example: $$---$----$-----$--- Consider: $$---$----$-----$$$$

◦ AP is down to 0.5234, despite P@20 increasing

◦ Finding more rels can harm AP performance!

◦ Similar problems if some are initially unjudged

(Mean) Average Precision

Page 11: IR Evaluation using Rank-Biased Precision

Methodological problem of instability◦ Results may depend on judging extent

◦ More judging can be destabilizing (meaning error margins don’t shrink with reducing uncertainty)

MAP is Unstable

Page 12: IR Evaluation using Rank-Biased Precision

Complex abstraction of user satisfaction◦ “Every time a relevant document is encountered, the user pauses, asks “Over the

documents I have seen so far, on average how satisfied am I?” and writes a number on a piece of paper. Finally, when the user has examined every document in the collection — because this is the only way to be sure that all of the relevant ones have been seen — the user computes the average of the values they have written.”

How can R be truly calculated? Think evaluating a Google query…

MAP is not “Real-Life”

Page 13: IR Evaluation using Rank-Biased Precision

Complex abstraction of user satisfaction◦ “Every time a relevant document is encountered, the user pauses, asks “Over the

documents I have seen so far, on average how satisfied am I?” and writes a number on a piece of paper. Finally, when the user has examined every document in the collection — because this is the only way to be sure that all of the relevant ones have been seen — the user computes the average of the values they have written.”

How can R be truly calculated? Think evaluating a Google query…

Still, MAP is highly popular and useful: ◦ Validated in numerous TREC researches◦ Shown to be stable and robust across query sets (for

deep enough pools)

MAP is not “Real-Life”

Page 14: IR Evaluation using Rank-Biased Precision

Enter RBP…

Page 15: IR Evaluation using Rank-Biased Precision

Induced by a user model

Rank-Biased Precision

Page 16: IR Evaluation using Rank-Biased Precision

Induced by a user model

◦ Each document is observed at probability pi-1

◦ Expected #docs seen:

◦ Total expected utility (ri = known relevance function):

◦ RBP = expected utility rate = utility/effort

Rank-Biased Precision

Page 17: IR Evaluation using Rank-Biased Precision

Values of p reflect user behaviors◦ P=0.95 persistent user (60% chance for 2nd page)

◦ P=0.5 impatient (0.1% chance for 2nd page)

RBP Intuitions

Page 18: IR Evaluation using Rank-Biased Precision

Values of p reflect user behaviors◦ P=0.95 persistent user (60% chance for 2nd page)

◦ P=0.5 impatient (0.1% chance for 2nd page)

◦ P=0 I’m feeling lucky (identical to P@1)

RBP Intuitions

Page 19: IR Evaluation using Rank-Biased Precision

Values of p reflect user behaviors◦ P=0.95 persistent user (60% chance for 2nd page)

◦ P=0.5 impatient (0.1% chance for 2nd page)

◦ P=0 I’m feeling lucky (identical to P@1)

Values of p control contribution of each relevant document◦ But always positive!

RBP Intuitions

Page 20: IR Evaluation using Rank-Biased Precision

Evaluating a new evaluation measure…

Page 21: IR Evaluation using Rank-Biased Precision

RBP Stability

Page 22: IR Evaluation using Rank-Biased Precision

Uncertainty: how many relevant documents? (down the ranking, or even in current depth)

RBP value is inherently lower bound

Error Bounds

Page 23: IR Evaluation using Rank-Biased Precision

Uncertainty: how many relevant documents? (down the ranking, or even in current depth)

RBP value is inherently lower bound Residual uncertainty is easy to calculate –

assume relevant…

Error Bounds

Page 24: IR Evaluation using Rank-Biased Precision

RBP in Comparison

Similarity (correlation) between measures

Detected significance in evaluated systems’ ranking

Page 25: IR Evaluation using Rank-Biased Precision

RBP has significant advantages:◦ Based on a solid and supported user model

◦ Real-life, no unknown factors (R, |D|)

◦ Error bounds for uncertainty

◦ Statistical significance as good as others

But also:◦ Absolute values, not relative to query difficulty

◦ A choice for p must be made

Conclusion


Top Related