SIGIR 2013 Dublin, Ireland · July 30th Picture by Philip Milne
On the Measurement of Test Collection Reliability
@julian_urbano University Carlos III of Madrid
Mónica Marrero University Carlos III of Madrid
Diego Martín Technical University of Madrid
Gratefully supported by Student Travel Grant
Is System A More Effective than System B?
-1 1 Δeffectiveness
𝑑 0
Is System A More Effective than System B?
Get a test collection and evaluate
Measure the average difference 𝒅
and conclude which one is better
Samples
Test collections are samples from a larger, possibly infinite, population
Documents, queries and assessors
𝒅 is only an estimate
How reliable is our conclusion?
Reliability vs. Cost
Building reliable collections is easy…
Just use more documents, more queries, more assessors
…but it is prohibitively expensive
Our best bet is to increase query set size
Data-based approach
1.Randomly split query set 2.Compute indicators of reliability
based on those two subsets 3.Extrapolate to larger query sets
..with some variations
Voorhees’98, Zobel’98, Buckley & Voorhees’00, Voorhees & Buckley’02, Sanderson & Zobel’05,
Sakai’07, Voorhees’09
Data-based Reliability Indicators based on results with two collections
Kendall 𝝉 correlation stability of the ranking of systems
𝝉𝑨𝑷 correlation add a top-heaviness components
Absolute sensitivity minimum absolute 𝒅 s.t. swaps <5%
Relative sensitivity minimum relative 𝒅 s.t. swaps <5%
Data-based Reliability Indicators based on results with two collections
Power ratio statistically significant results
Minor conflict ratio statistically non-significant swap
Major conflict ratio statistically significant swap
RMSE differences in 𝒅
Generalizability Theory
Directly address variability of scores
G-study Estimate variance components
from previous, representative, data
D-study Estimate reliability based on
estimated variance components
G-study
𝝈𝟐 = 𝝈𝒔𝟐 + 𝝈𝒒
𝟐 + 𝝈𝒔:𝒒𝟐
Estimated using Analysis of Variance
From previous data, usually an existing test collection
G-study
𝝈𝟐 = 𝝈𝒔𝟐 + 𝝈𝒒
𝟐 + 𝝈𝒔:𝒒𝟐
Estimated using Analysis of Variance
From previous data, usually an existing test collection
system differences,
our goal!
G-study
𝝈𝟐 = 𝝈𝒔𝟐 + 𝝈𝒒
𝟐 + 𝝈𝒔:𝒒𝟐
Estimated using Analysis of Variance
From previous data, usually an existing test collection
system differences,
our goal! query difficulty
G-study
𝝈𝟐 = 𝝈𝒔𝟐 + 𝝈𝒒
𝟐 + 𝝈𝒔:𝒒𝟐
Estimated using Analysis of Variance
From previous data, usually an existing test collection
system differences,
our goal! query difficulty
some systems better for
some queries
D-study
Relative stability
𝑬𝝆𝟐 =𝝈𝒔𝟐
𝝈𝒔𝟐 +
𝝈𝒔:𝒒𝟐
𝒏𝒒′
Absolute stability
𝚽 =𝝈𝒔𝟐
𝝈𝒔𝟐 +
𝝈𝒒𝟐 + 𝝈𝒔:𝒒
𝟐
𝒏𝒒′
Easy to estimate how many queries we need for a certain stability level
Generalizability Theory
Proposed by Bodoff’07
Kanoulas & Aslam’09 derive optimal gain & discount in nDCG
TREC Million Query Track
≈80 queries sufficient for stable rankings ≈130 queries for stable absolute scores
In this Paper / Talk
How sensitive is the D-study to the initial data used in the G-study?
How to interpret G-theory in practice,
why 𝑬𝝆𝟐 = 𝟎. 𝟗𝟓 and 𝚽 = 𝟎. 𝟗𝟓?
From the above two, review the reliability of >40 TREC test collections
variability of G-theory indicators of reliability
Data
43 TREC collections from TREC-3 to TREC 2011
12 tasks across 10 tracks
Ad Hoc, Web, Novelty, Genomics, Robust, Terabyte, Enterprise, Million
Query, Medical and Microblog
Experiment
Vary number of queries in G-study from 𝒏𝒒 = 𝟓 to full set Use all runs available
Run D-study
Compute 𝑬𝝆 𝟐, 𝚽 Compute 𝒏 𝒒
′ to reach 0.95 stability
200 random trials
Variability due to queries
Variability due to queries
We may get 𝐸𝜌 2 = 0.9 or 𝐸𝜌 2 = 0.3, depending on what 10 queries we use
Experiment (II)
The same, but vary number of systems from 𝒏𝒔 = 𝟓 to full set
Use all queries available
200 random trials
Variability due to systems
Variability due to systems
We may get 𝐸𝜌 2 = 0.9 or 𝐸𝜌 2 = 0.5, depending on what 20 systems we use
Results
G-Theory is very sensitive to initial data Need about 50 queries and 50 systems for
differences in 𝑬𝝆 𝟐 and 𝚽 below 0.1
Number of queries for 𝑬𝝆 𝟐 = 𝟎. 𝟗𝟓 may change in orders of magnitude
Microblog2011 (all 184 systems and 30 queries): need 63 to 133 queries
Medical2011 (all 34 queries and 40 systems): need 109 to 566 queries
Use Confidence Intervals
Bodoff’08 Confidence intervals in G-study
But what about the D-study? Feldt’65 and Arteaga et al.’82
Work reasonably well even when
assumptions are violated Brennan’01
Example
Example
Example
Account for variability in initial data
Example
Required number of queries to reach the
lower end of the interval
Summary in TREC that is, the 43 collections we study here
𝑬𝝆 𝟐: mean=0.88 sd=0.1
95% conf. intervals are 0.1 long
𝚽 : mean=0.74 sd=0.2 95% conf. intervals are 0.19 long
interpretation of G-Theory indicators of reliability
Experiment
Split query set in 2 subsets from 𝒏𝒒 = 𝟏𝟎 to full set / 2
Use all runs available Run D-study
Compute 𝑬𝝆 𝟐 and 𝚽 and map onto 𝝉, sensitivity, power, conflicts, etc.
50 random trials
>28,000 datapoints
Example: 𝑬𝝆𝟐 → 𝝉
*All mappings in the paper
Example: 𝑬𝝆𝟐 → 𝝉
𝐸𝜌2 = 0.95 → 𝜏 ≈ 0.85
*All mappings in the paper
Example: 𝑬𝝆𝟐 → 𝝉
𝜏 = 0.9 → 𝐸𝜌2 ≈ 0.97
*All mappings in the paper
Example: 𝑬𝝆𝟐 → 𝝉
Million Query 2007
Million Query 2008
*All mappings in the paper
Future Predictions
Allows us to make more informed decisions within a collection
What about a new collection?
Fit a single model for each mapping with 90% and 95% prediction intervals
Assess whether a larger collection
is really worth the effort
Example: 𝑬𝝆𝟐 → 𝝉
*All mappings in the paper
Example: 𝑬𝝆𝟐 → 𝝉 current collection
*All mappings in the paper
Example: 𝑬𝝆𝟐 → 𝝉 current collection target
*All mappings in the paper
Example: 𝚽 → 𝒓𝒆𝒍. 𝒔𝒆𝒏𝒔𝒊𝒕𝒗𝒊𝒕𝒚
Example: 𝚽 → 𝒓𝒆𝒍. 𝒔𝒆𝒏𝒔𝒊𝒕𝒗𝒊𝒕𝒚
review of TREC collections
Outline
Estimate 𝑬𝝆 𝟐 and 𝚽 , with 95% confidence intervals, and full query set
Map onto 𝝉, sensitivity, power,
conflicts, etc.
Results within task offer historical perspective since 1994
Example: Ad Hoc 3-8
𝑬𝝆 𝟐 ∈ 𝟎. 𝟖𝟔, 𝟎. 𝟗𝟑 → 𝝉 ∈ [𝟎. 𝟔𝟓, 𝟎. 𝟖𝟏] 𝒎𝒊𝒏𝒐𝒓 𝒄𝒐𝒏𝒇𝒍𝒊𝒄𝒕𝒔 ∈ 𝟎. 𝟔, 𝟖. 𝟐 %
𝒎𝒂𝒋𝒐𝒓 𝒄𝒐𝒏𝒇𝒍𝒊𝒄𝒕𝒔 ∈ 𝟎. 𝟎𝟐, 𝟏. 𝟑𝟖 %
Queries to get 𝑬𝝆𝟐 = 𝟎. 𝟗𝟓: [𝟑𝟕, 𝟐𝟑𝟑] Queries to get 𝚽 = 𝟎. 𝟗𝟓: [𝟏𝟏𝟔, 𝟗𝟗𝟗]
50 queries were used
*All collections and mappings in the paper
Example: Web Ad Hoc
TREC-8 to TREC-2001: WT2g and WT10g 𝑬𝝆 𝟐 ∈ 𝟎. 𝟖𝟔, 𝟎. 𝟗𝟑 → 𝝉 ∈ [𝟎. 𝟔𝟓, 𝟎. 𝟖𝟏]
Queries to get 𝑬𝝆𝟐 = 𝟎. 𝟗𝟓: 𝟒𝟎, 𝟐𝟐𝟎
TREC-2009 to TREC-2011: ClueWeb09 𝑬𝝆 𝟐 ∈ 𝟎. 𝟖, 𝟎. 𝟖𝟑 → 𝝉 ∈ [𝟎. 𝟓𝟑, 𝟎. 𝟓𝟗] Queries to get 𝑬𝝆𝟐 = 𝟎. 𝟗𝟓: 𝟏𝟎𝟕, 𝟒𝟑𝟖
50 queries were used
Historical Trend
Decreasing within and across tracks?
Historical Trend
Systems getting better for specific problems?
Historical Trend
Increasing task-specificity in queries?
summing up
Generalizability Theory
Regarded as more appropriate, easy to use and powerful tool
to assess test collection reliability
Very sensitive to the initial data used to estimate variance components
Almost impossible to interpret
in practical terms
Sensitivity of G-Theory
About 50 queries and 50 systems are needed for robust estimates
Caution if building a new collection
Can always use confidence intervals
Interpretation of G-Theory
Empirical mapping onto traditional
indicators of reliability like 𝝉 correlation
𝝉 = 𝟎. 𝟗 → 𝑬𝝆𝟐 ≈ 𝟎. 𝟗𝟕
𝑬𝝆𝟐 = 𝟎. 𝟗𝟓 → 𝝉 ≈ 𝟎. 𝟖𝟓
Historical Reliability in TREC
On average, 𝑬𝝆𝟐 = 𝟎. 𝟖𝟖 → 𝝉 ≈ 𝟎. 𝟕
Some collections clearly unreliable Web Distillation 2003, Genomics 2005, Terabyte 2006, Enterprise 2008, Medical 2011 and Web Ad Hoc 2011
50 queries not enough for stable rankings, about 200 are needed
Implications
Fixing a minimum number of queries
across tracks is unrealistic Not even across editions of the same task
Need to analyze on a case-by-case basis, while building the collections
to be continued…
Future Work
Study assessor effect Study document-collection effect
Better models to map G-Theory
onto data-based indicators We fitted theoretically correct(-ish) models,
but in practice theory does not hold
Methods to reliably measure reliability while building the collection
Source Code Online
Code for R stats software
G-study and D-study Required number of queries
Map onto data-based indicators Confidence intervals
..in two simple steps
G-Theory too sensitive to initial data Questionable with small collections
Compute confidence intervals
Need 𝑬𝝆𝟐 ≈ 𝟎. 𝟗𝟕 for 𝝉 = 𝟎. 𝟗 50 queries not enough for stable rankings
Fixing a minimum number of queries across tasks is unrealistic
Need to analyze on a case-by-case basis