on the measurement of test collection reliability

SIGIR 2013 Dublin, Ireland · July 30th Picture by Philip Milne

On the Measurement of Test Collection Reliability

@julian_urbano University Carlos III of Madrid

Mónica Marrero University Carlos III of Madrid

Diego Martín Technical University of Madrid

Gratefully supported by Student Travel Grant

Is System A More Effective than System B?

-1 1 Δeffectiveness

𝑑 0

Is System A More Effective than System B?

Get a test collection and evaluate

Measure the average difference 𝒅

and conclude which one is better

Samples

Test collections are samples from a larger, possibly infinite, population

Documents, queries and assessors

𝒅 is only an estimate

How reliable is our conclusion?

Reliability vs. Cost

Building reliable collections is easy…

Just use more documents, more queries, more assessors

…but it is prohibitively expensive

Our best bet is to increase query set size

Data-based approach

1.Randomly split query set 2.Compute indicators of reliability

based on those two subsets 3.Extrapolate to larger query sets

..with some variations

Voorhees’98, Zobel’98, Buckley & Voorhees’00, Voorhees & Buckley’02, Sanderson & Zobel’05,

Sakai’07, Voorhees’09

Data-based Reliability Indicators based on results with two collections

Kendall 𝝉 correlation stability of the ranking of systems

𝝉𝑨𝑷 correlation add a top-heaviness components

Absolute sensitivity minimum absolute 𝒅 s.t. swaps <5%

Relative sensitivity minimum relative 𝒅 s.t. swaps <5%

Data-based Reliability Indicators based on results with two collections

Power ratio statistically significant results

Minor conflict ratio statistically non-significant swap

Major conflict ratio statistically significant swap

RMSE differences in 𝒅

Generalizability Theory

Directly address variability of scores

G-study Estimate variance components

from previous, representative, data

D-study Estimate reliability based on

estimated variance components

G-study

𝝈𝟐 = 𝝈𝒔𝟐 + 𝝈𝒒

𝟐 + 𝝈𝒔:𝒒𝟐

Estimated using Analysis of Variance

From previous data, usually an existing test collection

G-study

𝝈𝟐 = 𝝈𝒔𝟐 + 𝝈𝒒

𝟐 + 𝝈𝒔:𝒒𝟐

system differences,

our goal!

G-study

𝝈𝟐 = 𝝈𝒔𝟐 + 𝝈𝒒

𝟐 + 𝝈𝒔:𝒒𝟐

system differences,

our goal! query difficulty

G-study

𝝈𝟐 = 𝝈𝒔𝟐 + 𝝈𝒒

𝟐 + 𝝈𝒔:𝒒𝟐

system differences,

our goal! query difficulty

some systems better for

some queries

D-study

Relative stability

𝑬𝝆𝟐 =𝝈𝒔𝟐

𝝈𝒔𝟐 +

𝝈𝒔:𝒒𝟐

𝒏𝒒′

Absolute stability

𝚽 =𝝈𝒔𝟐

𝝈𝒔𝟐 +

𝝈𝒒𝟐 + 𝝈𝒔:𝒒

𝒏𝒒′

Easy to estimate how many queries we need for a certain stability level

Proposed by Bodoff’07

Kanoulas & Aslam’09 derive optimal gain & discount in nDCG

TREC Million Query Track

≈80 queries sufficient for stable rankings ≈130 queries for stable absolute scores

In this Paper / Talk

How sensitive is the D-study to the initial data used in the G-study?

How to interpret G-theory in practice,

why 𝑬𝝆𝟐 = 𝟎. 𝟗𝟓 and 𝚽 = 𝟎. 𝟗𝟓?

From the above two, review the reliability of >40 TREC test collections

variability of G-theory indicators of reliability

43 TREC collections from TREC-3 to TREC 2011

12 tasks across 10 tracks

Ad Hoc, Web, Novelty, Genomics, Robust, Terabyte, Enterprise, Million

Query, Medical and Microblog

Experiment

Vary number of queries in G-study from 𝒏𝒒 = 𝟓 to full set Use all runs available

Run D-study

Compute 𝑬𝝆 𝟐, 𝚽 Compute 𝒏 𝒒

′ to reach 0.95 stability

200 random trials

Variability due to queries

We may get 𝐸𝜌 2 = 0.9 or 𝐸𝜌 2 = 0.3, depending on what 10 queries we use

Experiment (II)

The same, but vary number of systems from 𝒏𝒔 = 𝟓 to full set

Use all queries available

200 random trials

Variability due to systems

We may get 𝐸𝜌 2 = 0.9 or 𝐸𝜌 2 = 0.5, depending on what 20 systems we use

Results

G-Theory is very sensitive to initial data Need about 50 queries and 50 systems for

differences in 𝑬𝝆 𝟐 and 𝚽 below 0.1

Number of queries for 𝑬𝝆 𝟐 = 𝟎. 𝟗𝟓 may change in orders of magnitude

Microblog2011 (all 184 systems and 30 queries): need 63 to 133 queries

Medical2011 (all 34 queries and 40 systems): need 109 to 566 queries

Use Confidence Intervals

Bodoff’08 Confidence intervals in G-study

But what about the D-study? Feldt’65 and Arteaga et al.’82

Work reasonably well even when

assumptions are violated Brennan’01

Example

Account for variability in initial data

Example

Required number of queries to reach the

lower end of the interval

Summary in TREC that is, the 43 collections we study here

𝑬𝝆 𝟐: mean=0.88 sd=0.1

95% conf. intervals are 0.1 long

𝚽 : mean=0.74 sd=0.2 95% conf. intervals are 0.19 long

interpretation of G-Theory indicators of reliability

Experiment

Split query set in 2 subsets from 𝒏𝒒 = 𝟏𝟎 to full set / 2

Use all runs available Run D-study

Compute 𝑬𝝆 𝟐 and 𝚽 and map onto 𝝉, sensitivity, power, conflicts, etc.

50 random trials

>28,000 datapoints

Example: 𝑬𝝆𝟐 → 𝝉

*All mappings in the paper

𝐸𝜌2 = 0.95 → 𝜏 ≈ 0.85

𝜏 = 0.9 → 𝐸𝜌2 ≈ 0.97

Million Query 2007

Million Query 2008

Future Predictions

Allows us to make more informed decisions within a collection

What about a new collection?

Fit a single model for each mapping with 90% and 95% prediction intervals

Assess whether a larger collection

is really worth the effort

Example: 𝑬𝝆𝟐 → 𝝉 current collection

Example: 𝑬𝝆𝟐 → 𝝉 current collection target

Example: 𝚽 → 𝒓𝒆𝒍. 𝒔𝒆𝒏𝒔𝒊𝒕𝒗𝒊𝒕𝒚

review of TREC collections

Outline

Estimate 𝑬𝝆 𝟐 and 𝚽 , with 95% confidence intervals, and full query set

Map onto 𝝉, sensitivity, power,

conflicts, etc.

Results within task offer historical perspective since 1994

Example: Ad Hoc 3-8

𝑬𝝆 𝟐 ∈ 𝟎. 𝟖𝟔, 𝟎. 𝟗𝟑 → 𝝉 ∈ [𝟎. 𝟔𝟓, 𝟎. 𝟖𝟏] 𝒎𝒊𝒏𝒐𝒓 𝒄𝒐𝒏𝒇𝒍𝒊𝒄𝒕𝒔 ∈ 𝟎. 𝟔, 𝟖. 𝟐 %

𝒎𝒂𝒋𝒐𝒓 𝒄𝒐𝒏𝒇𝒍𝒊𝒄𝒕𝒔 ∈ 𝟎. 𝟎𝟐, 𝟏. 𝟑𝟖 %

Queries to get 𝑬𝝆𝟐 = 𝟎. 𝟗𝟓: [𝟑𝟕, 𝟐𝟑𝟑] Queries to get 𝚽 = 𝟎. 𝟗𝟓: [𝟏𝟏𝟔, 𝟗𝟗𝟗]

50 queries were used

*All collections and mappings in the paper

Example: Web Ad Hoc

TREC-8 to TREC-2001: WT2g and WT10g 𝑬𝝆 𝟐 ∈ 𝟎. 𝟖𝟔, 𝟎. 𝟗𝟑 → 𝝉 ∈ [𝟎. 𝟔𝟓, 𝟎. 𝟖𝟏]

Queries to get 𝑬𝝆𝟐 = 𝟎. 𝟗𝟓: 𝟒𝟎, 𝟐𝟐𝟎

TREC-2009 to TREC-2011: ClueWeb09 𝑬𝝆 𝟐 ∈ 𝟎. 𝟖, 𝟎. 𝟖𝟑 → 𝝉 ∈ [𝟎. 𝟓𝟑, 𝟎. 𝟓𝟗] Queries to get 𝑬𝝆𝟐 = 𝟎. 𝟗𝟓: 𝟏𝟎𝟕, 𝟒𝟑𝟖

50 queries were used

Historical Trend

Decreasing within and across tracks?

Historical Trend

Systems getting better for specific problems?

Historical Trend

Increasing task-specificity in queries?

summing up

Regarded as more appropriate, easy to use and powerful tool

to assess test collection reliability

Very sensitive to the initial data used to estimate variance components

Almost impossible to interpret

in practical terms

Sensitivity of G-Theory

About 50 queries and 50 systems are needed for robust estimates

Caution if building a new collection

Can always use confidence intervals

Interpretation of G-Theory

Empirical mapping onto traditional

indicators of reliability like 𝝉 correlation

𝝉 = 𝟎. 𝟗 → 𝑬𝝆𝟐 ≈ 𝟎. 𝟗𝟕

𝑬𝝆𝟐 = 𝟎. 𝟗𝟓 → 𝝉 ≈ 𝟎. 𝟖𝟓

Historical Reliability in TREC

On average, 𝑬𝝆𝟐 = 𝟎. 𝟖𝟖 → 𝝉 ≈ 𝟎. 𝟕

Some collections clearly unreliable Web Distillation 2003, Genomics 2005, Terabyte 2006, Enterprise 2008, Medical 2011 and Web Ad Hoc 2011

50 queries not enough for stable rankings, about 200 are needed

Implications

Fixing a minimum number of queries

across tracks is unrealistic Not even across editions of the same task

Need to analyze on a case-by-case basis, while building the collections

to be continued…

Future Work

Study assessor effect Study document-collection effect

Better models to map G-Theory

onto data-based indicators We fitted theoretically correct(-ish) models,

but in practice theory does not hold

Methods to reliably measure reliability while building the collection

Source Code Online

Code for R stats software

G-study and D-study Required number of queries

Map onto data-based indicators Confidence intervals

..in two simple steps

G-Theory too sensitive to initial data Questionable with small collections

Compute confidence intervals

Need 𝑬𝝆𝟐 ≈ 𝟎. 𝟗𝟕 for 𝝉 = 𝟎. 𝟗 50 queries not enough for stable rankings

Fixing a minimum number of queries across tasks is unrealistic

Need to analyze on a case-by-case basis

on the measurement of test collection reliability

Technology

measurement and evaluation of reliability, availability and

measurement and scales validity & reliability error

stochastic reliability measurement and design optimization...

reliability of ultrasound for measurement of...

ch. 9 measurement: scaling, reliability, validity

chapter 2 reliability, precision, & errors of measurement

section 8: reliability data collection and analysis

socw 671: #5 measurement levels, reliability, validity, &...

measurement,reliability &validity

measurement using artificial pulse generator reliability

aircraft reliability data collection and exchange - spec2000

measurement and assuring reliability through …...security...

reliability reliability refers to the consistency of a test...

measurement characteristics error & confidence reliability,...

definitions correlation, reliability, validity, measurement...

reliability and...

measurement in psychology i: reliability

developing a hiring system reliability of measurement

communication reliability improvements through measurement

the validity and reliability of general measurement and