super awesome presentation dandre allison devin adair

Super Awesome Presentation

Dandre AllisonDevin Adair

Comparing the Sensitivity of Information Retrieval Metrics

Filip RadlinskiMicrosoft

Cambridge, [email protected]

Nick CraswellMicrosoft

Redmond, WA, [email protected]

How do you evaluate Information Retrieval effectiveness?

• Precision (P)• Mean Average Precision (MAP)• Normalized Discounted Cumulative Gain (NDCG)

Precision

• Average the number of relevant documents in the top 5 for a given query

• Average over all queries

Mean Average Precision

• For each relevant document in the top 10 , find the precision up until its rank for a given query

• Sum the precisions and normalize by the known relevant documents


Normalized Discounted Cumulative Gain

• Normalize the Discounted Cumulative Gain by the Ideal Discounted Cumulative Gain for a given query


Normalized Discounted Cumulative Gain

• Discounted Cumulative Gain– Give more emphasis to relevant documents by

using 2relevance

– Give more emphasis to earlier ranks by using a logarithmic reduction factor

– Sums over top 5 • Ideal Discounted Cumulative Gain– Same as DCG by sorts by relevance

What’s the problem

SensitivityMight reject small but significant improvements

BiasJudges removed from search process

FidelityEvaluation should reflect user success!!

Alternative Evaluation

• Use actually user searches• Judges become actual users• Evaluation becomes user success

Interleaving

System A Results + System B Results

Team-Draft Algorithm

Captain Ahab Captain Barnacle

Captain Ahab Captain Barnacle

Interleaved List

Crediting

• Whoever has the most distinct clicks is considered “better”

• In case of tie - ignored

Retrieval Systems Pairs

• Major improvements– majorAB– majorBC– majorAC

• Minor improvements– minorE– minorD

Evaluation

• 12,000 queries– Samples n-times with replacement

• count sampled queries where rankers differ– Ignores ties

• Percent where better ranker scores better

Interleaving Evaluation

Credit Assignment Alternatives

• Shared top k– Ignore?– Lower clicks treated the same

• Not all clicks are created equal– log(rank)– 1/rank– Top– Bottom

Conclusions

• Performance measured by:– Judgment-based– Usage-based

• Surprise surpise small sample size is stupid– (check out that alliteration)

• Interleaving is transitive

super awesome presentation dandre allison devin adair

Documents

relevance slide

transitive slide

interleaving evaluation

user success slide

precision average

mean average precision

average precision map

given query average