super awesome presentation dandre allison devin adair
TRANSCRIPT
Super Awesome Presentation
Dandre AllisonDevin Adair
Comparing the Sensitivity of Information Retrieval Metrics
Filip RadlinskiMicrosoft
Cambridge, [email protected]
Nick CraswellMicrosoft
Redmond, WA, [email protected]
How do you evaluate Information Retrieval effectiveness?
• Precision (P)• Mean Average Precision (MAP)• Normalized Discounted Cumulative Gain (NDCG)
Precision
• Average the number of relevant documents in the top 5 for a given query
• Average over all queries
Mean Average Precision
• For each relevant document in the top 10 , find the precision up until its rank for a given query
• Sum the precisions and normalize by the known relevant documents
• Average over all queries
Normalized Discounted Cumulative Gain
• Normalize the Discounted Cumulative Gain by the Ideal Discounted Cumulative Gain for a given query
• Average over all queries
Normalized Discounted Cumulative Gain
• Discounted Cumulative Gain– Give more emphasis to relevant documents by
using 2relevance
– Give more emphasis to earlier ranks by using a logarithmic reduction factor
– Sums over top 5 • Ideal Discounted Cumulative Gain– Same as DCG by sorts by relevance
What’s the problem
SensitivityMight reject small but significant improvements
BiasJudges removed from search process
FidelityEvaluation should reflect user success!!
Alternative Evaluation
• Use actually user searches• Judges become actual users• Evaluation becomes user success
Interleaving
System A Results + System B Results
Team-Draft Algorithm
Captain Ahab Captain Barnacle
Captain Ahab Captain Barnacle
Interleaved List
Crediting
• Whoever has the most distinct clicks is considered “better”
• In case of tie - ignored
Retrieval Systems Pairs
• Major improvements– majorAB– majorBC– majorAC
• Minor improvements– minorE– minorD
Evaluation
• 12,000 queries– Samples n-times with replacement
• count sampled queries where rankers differ– Ignores ties
• Percent where better ranker scores better
Interleaving Evaluation
Credit Assignment Alternatives
• Shared top k– Ignore?– Lower clicks treated the same
• Not all clicks are created equal– log(rank)– 1/rank– Top– Bottom
Conclusions
• Performance measured by:– Judgment-based– Usage-based
• Surprise surpise small sample size is stupid– (check out that alliteration)
• Interleaving is transitive