ir evaluation using rank-biased precision

Measuring Search Engine Quality

Rank-Biased PrecisionAlistair Moffat and Justin Zobel, “Rank-Biased Precision for Measurement of Retrieval Effectiveness”, TOIS vol.27 no. 1, 2008.

Ofer EgoziLARA group, Technion

Introduction to IR Evaluation

Mean Average Precision

Rank-Biased Precision

Analysis of RBP

Outline

Task: given query q, output ranked list of documents◦ Find probability that document d is relevant for q

IR Evaluation

Task: given query q, output ranked list of documents◦ Find probability that document d is relevant for q

Evaluation is difficult◦ No (per query) test data

◦ Queries vary tremendously

◦ Relevance is a vague (human) concept

IR Evaluation

Precision / recall

◦ Precision and recall usually conflict

◦ Single measures proposed (P@X, RR, AP…)

Elementary IR Measures

Drel(q,D)alg(q,D)

Precision: |alg rel|/|alg| Recall: |alg rel|/|rel|

Relevancy requires human judgment◦ Exhaustive judging is not scalable

◦ TREC uses pooling

◦ Shown to miss significant relevant portion…

◦ … but shown to compare cross-system well

◦ Bias against novel approaches

Pooling for Scalable Judging

In real-world, what does recall measure?◦ Recall important only with “perfect” knowledge◦ If I got one result, and there is another I don’t

know of, am I half-satisfied?...◦ …yes, for specific needs (legal, patent) session◦ “Boiling temperature of lead”

How Important is Recall?

In real-world, what does recall measure?◦ Recall important only with “perfect” knowledge◦ If I got one result, and there is another I don’t

know of, am I half-satisfied?...◦ …yes, for specific needs (legal, patent) session◦ “Boiling temperature of lead”

Precision is more user-oriented◦ P@10 measures real user satisfaction◦ Still, P@10=0.3 can mean first three or last

three…

How Important is Recall?

Calculated as ◦ Intuitively: sum all P@X where rel found, divide by

total rel to normalize for summing across queries Example: $$---$----$-----$---

(Mean) Average Precision

Calculated as ◦ Intuitively: sum all P@X where rel found, divide by

total rel to normalize for summing across queries Example: $$---$----$-----$--- Consider: $$---$----$-----$$$$

◦ AP is down to 0.5234, despite P@20 increasing

◦ Finding more rels can harm AP performance!

◦ Similar problems if some are initially unjudged

(Mean) Average Precision

Methodological problem of instability◦ Results may depend on judging extent

◦ More judging can be destabilizing (meaning error margins don’t shrink with reducing uncertainty)

MAP is Unstable

Complex abstraction of user satisfaction◦ “Every time a relevant document is encountered, the user pauses, asks “Over the

documents I have seen so far, on average how satisfied am I?” and writes a number on a piece of paper. Finally, when the user has examined every document in the collection — because this is the only way to be sure that all of the relevant ones have been seen — the user computes the average of the values they have written.”

How can R be truly calculated? Think evaluating a Google query…

MAP is not “Real-Life”

Complex abstraction of user satisfaction◦ “Every time a relevant document is encountered, the user pauses, asks “Over the

documents I have seen so far, on average how satisfied am I?” and writes a number on a piece of paper. Finally, when the user has examined every document in the collection — because this is the only way to be sure that all of the relevant ones have been seen — the user computes the average of the values they have written.”

How can R be truly calculated? Think evaluating a Google query…

Still, MAP is highly popular and useful: ◦ Validated in numerous TREC researches◦ Shown to be stable and robust across query sets (for

deep enough pools)

MAP is not “Real-Life”

Enter RBP…

Induced by a user model


Induced by a user model

◦ Each document is observed at probability pi-1

◦ Expected #docs seen:

◦ Total expected utility (ri = known relevance function):

◦ RBP = expected utility rate = utility/effort


Values of p reflect user behaviors◦ P=0.95 persistent user (60% chance for 2nd page)

◦ P=0.5 impatient (0.1% chance for 2nd page)

RBP Intuitions



◦ P=0 I’m feeling lucky (identical to P@1)

RBP Intuitions



◦ P=0 I’m feeling lucky (identical to P@1)

Values of p control contribution of each relevant document◦ But always positive!

RBP Intuitions

Evaluating a new evaluation measure…

RBP Stability

Uncertainty: how many relevant documents? (down the ranking, or even in current depth)

RBP value is inherently lower bound

Error Bounds

Uncertainty: how many relevant documents? (down the ranking, or even in current depth)

RBP value is inherently lower bound Residual uncertainty is easy to calculate –

assume relevant…

Error Bounds

RBP in Comparison

Similarity (correlation) between measures

Detected significance in evaluated systems’ ranking

RBP has significant advantages:◦ Based on a solid and supported user model

◦ Real-life, no unknown factors (R, |D|)

◦ Error bounds for uncertainty

◦ Statistical significance as good as others

But also:◦ Absolute values, not relative to query difficulty

◦ A choice for p must be made

Conclusion

ir evaluation using rank-biased precision

Technology