on the measurement of test collection reliability

Post on 13-Dec-2014

157 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

The reliability of a test collection is proportional to the number of queries it contains. But building a collection with many queries is expensive, so researchers have to find a balance between reliability and cost. Previous work on the measurement of test collection reliability relied on data-based approaches that contemplated random what if scenarios, and provided indicators such as swap rates and Kendall tau correlations. Generalizability Theory was proposed as an alternative founded on analysis of variance that provides reliability indicators based on statistical theory. However, these reliability indicators are hard to interpret in practice, because they do not correspond to well known indicators like Kendall tau correlation. We empirically established these relationships based on data from over 40 TREC collections, thus filling the gap in the practical interpretation of Generalizability Theory. We also review the computation of these indicators, and show that they are extremely dependent on the sample of systems and queries used, so much that the required number of queries to achieve a certain level of reliability can vary in orders of magnitude. We discuss the computation of confidence intervals for these statistics, providing a much more reliable tool to measure test collection reliability. Reflecting upon all these results, we review a wealth of TREC test collections, arguing that they are possibly not as reliable as generally accepted and that the common choice of 50 queries is insufficient even for stable rankings.

TRANSCRIPT

SIGIR 2013 Dublin, Ireland · July 30th Picture by Philip Milne

On the Measurement of Test Collection Reliability

@julian_urbano University Carlos III of Madrid

Mónica Marrero University Carlos III of Madrid

Diego Martín Technical University of Madrid

Gratefully supported by Student Travel Grant

Is System A More Effective than System B?

-1 1 Δeffectiveness

𝑑 0

Is System A More Effective than System B?

Get a test collection and evaluate

Measure the average difference 𝒅

and conclude which one is better

Samples

Test collections are samples from a larger, possibly infinite, population

Documents, queries and assessors

𝒅 is only an estimate

How reliable is our conclusion?

Reliability vs. Cost

Building reliable collections is easy…

Just use more documents, more queries, more assessors

…but it is prohibitively expensive

Our best bet is to increase query set size

Data-based approach

1.Randomly split query set 2.Compute indicators of reliability

based on those two subsets 3.Extrapolate to larger query sets

..with some variations

Voorhees’98, Zobel’98, Buckley & Voorhees’00, Voorhees & Buckley’02, Sanderson & Zobel’05,

Sakai’07, Voorhees’09

Data-based Reliability Indicators based on results with two collections

Kendall 𝝉 correlation stability of the ranking of systems

𝝉𝑨𝑷 correlation add a top-heaviness components

Absolute sensitivity minimum absolute 𝒅 s.t. swaps <5%

Relative sensitivity minimum relative 𝒅 s.t. swaps <5%

Data-based Reliability Indicators based on results with two collections

Power ratio statistically significant results

Minor conflict ratio statistically non-significant swap

Major conflict ratio statistically significant swap

RMSE differences in 𝒅

Generalizability Theory

Directly address variability of scores

G-study Estimate variance components

from previous, representative, data

D-study Estimate reliability based on

estimated variance components

G-study

𝝈𝟐 = 𝝈𝒔𝟐 + 𝝈𝒒

𝟐 + 𝝈𝒔:𝒒𝟐

Estimated using Analysis of Variance

From previous data, usually an existing test collection

G-study

𝝈𝟐 = 𝝈𝒔𝟐 + 𝝈𝒒

𝟐 + 𝝈𝒔:𝒒𝟐

Estimated using Analysis of Variance

From previous data, usually an existing test collection

system differences,

our goal!

G-study

𝝈𝟐 = 𝝈𝒔𝟐 + 𝝈𝒒

𝟐 + 𝝈𝒔:𝒒𝟐

Estimated using Analysis of Variance

From previous data, usually an existing test collection

system differences,

our goal! query difficulty

G-study

𝝈𝟐 = 𝝈𝒔𝟐 + 𝝈𝒒

𝟐 + 𝝈𝒔:𝒒𝟐

Estimated using Analysis of Variance

From previous data, usually an existing test collection

system differences,

our goal! query difficulty

some systems better for

some queries

D-study

Relative stability

𝑬𝝆𝟐 =𝝈𝒔𝟐

𝝈𝒔𝟐 +

𝝈𝒔:𝒒𝟐

𝒏𝒒′

Absolute stability

𝚽 =𝝈𝒔𝟐

𝝈𝒔𝟐 +

𝝈𝒒𝟐 + 𝝈𝒔:𝒒

𝟐

𝒏𝒒′

Easy to estimate how many queries we need for a certain stability level

Generalizability Theory

Proposed by Bodoff’07

Kanoulas & Aslam’09 derive optimal gain & discount in nDCG

TREC Million Query Track

≈80 queries sufficient for stable rankings ≈130 queries for stable absolute scores

In this Paper / Talk

How sensitive is the D-study to the initial data used in the G-study?

How to interpret G-theory in practice,

why 𝑬𝝆𝟐 = 𝟎. 𝟗𝟓 and 𝚽 = 𝟎. 𝟗𝟓?

From the above two, review the reliability of >40 TREC test collections

variability of G-theory indicators of reliability

Data

43 TREC collections from TREC-3 to TREC 2011

12 tasks across 10 tracks

Ad Hoc, Web, Novelty, Genomics, Robust, Terabyte, Enterprise, Million

Query, Medical and Microblog

Experiment

Vary number of queries in G-study from 𝒏𝒒 = 𝟓 to full set Use all runs available

Run D-study

Compute 𝑬𝝆 𝟐, 𝚽 Compute 𝒏 𝒒

′ to reach 0.95 stability

200 random trials

Variability due to queries

Variability due to queries

We may get 𝐸𝜌 2 = 0.9 or 𝐸𝜌 2 = 0.3, depending on what 10 queries we use

Experiment (II)

The same, but vary number of systems from 𝒏𝒔 = 𝟓 to full set

Use all queries available

200 random trials

Variability due to systems

Variability due to systems

We may get 𝐸𝜌 2 = 0.9 or 𝐸𝜌 2 = 0.5, depending on what 20 systems we use

Results

G-Theory is very sensitive to initial data Need about 50 queries and 50 systems for

differences in 𝑬𝝆 𝟐 and 𝚽 below 0.1

Number of queries for 𝑬𝝆 𝟐 = 𝟎. 𝟗𝟓 may change in orders of magnitude

Microblog2011 (all 184 systems and 30 queries): need 63 to 133 queries

Medical2011 (all 34 queries and 40 systems): need 109 to 566 queries

Use Confidence Intervals

Bodoff’08 Confidence intervals in G-study

But what about the D-study? Feldt’65 and Arteaga et al.’82

Work reasonably well even when

assumptions are violated Brennan’01

Example

Example

Example

Account for variability in initial data

Example

Required number of queries to reach the

lower end of the interval

Summary in TREC that is, the 43 collections we study here

𝑬𝝆 𝟐: mean=0.88 sd=0.1

95% conf. intervals are 0.1 long

𝚽 : mean=0.74 sd=0.2 95% conf. intervals are 0.19 long

interpretation of G-Theory indicators of reliability

Experiment

Split query set in 2 subsets from 𝒏𝒒 = 𝟏𝟎 to full set / 2

Use all runs available Run D-study

Compute 𝑬𝝆 𝟐 and 𝚽 and map onto 𝝉, sensitivity, power, conflicts, etc.

50 random trials

>28,000 datapoints

Example: 𝑬𝝆𝟐 → 𝝉

*All mappings in the paper

Example: 𝑬𝝆𝟐 → 𝝉

𝐸𝜌2 = 0.95 → 𝜏 ≈ 0.85

*All mappings in the paper

Example: 𝑬𝝆𝟐 → 𝝉

𝜏 = 0.9 → 𝐸𝜌2 ≈ 0.97

*All mappings in the paper

Example: 𝑬𝝆𝟐 → 𝝉

Million Query 2007

Million Query 2008

*All mappings in the paper

Future Predictions

Allows us to make more informed decisions within a collection

What about a new collection?

Fit a single model for each mapping with 90% and 95% prediction intervals

Assess whether a larger collection

is really worth the effort

Example: 𝑬𝝆𝟐 → 𝝉

*All mappings in the paper

Example: 𝑬𝝆𝟐 → 𝝉 current collection

*All mappings in the paper

Example: 𝑬𝝆𝟐 → 𝝉 current collection target

*All mappings in the paper

Example: 𝚽 → 𝒓𝒆𝒍. 𝒔𝒆𝒏𝒔𝒊𝒕𝒗𝒊𝒕𝒚

Example: 𝚽 → 𝒓𝒆𝒍. 𝒔𝒆𝒏𝒔𝒊𝒕𝒗𝒊𝒕𝒚

review of TREC collections

Outline

Estimate 𝑬𝝆 𝟐 and 𝚽 , with 95% confidence intervals, and full query set

Map onto 𝝉, sensitivity, power,

conflicts, etc.

Results within task offer historical perspective since 1994

Example: Ad Hoc 3-8

𝑬𝝆 𝟐 ∈ 𝟎. 𝟖𝟔, 𝟎. 𝟗𝟑 → 𝝉 ∈ [𝟎. 𝟔𝟓, 𝟎. 𝟖𝟏] 𝒎𝒊𝒏𝒐𝒓 𝒄𝒐𝒏𝒇𝒍𝒊𝒄𝒕𝒔 ∈ 𝟎. 𝟔, 𝟖. 𝟐 %

𝒎𝒂𝒋𝒐𝒓 𝒄𝒐𝒏𝒇𝒍𝒊𝒄𝒕𝒔 ∈ 𝟎. 𝟎𝟐, 𝟏. 𝟑𝟖 %

Queries to get 𝑬𝝆𝟐 = 𝟎. 𝟗𝟓: [𝟑𝟕, 𝟐𝟑𝟑] Queries to get 𝚽 = 𝟎. 𝟗𝟓: [𝟏𝟏𝟔, 𝟗𝟗𝟗]

50 queries were used

*All collections and mappings in the paper

Example: Web Ad Hoc

TREC-8 to TREC-2001: WT2g and WT10g 𝑬𝝆 𝟐 ∈ 𝟎. 𝟖𝟔, 𝟎. 𝟗𝟑 → 𝝉 ∈ [𝟎. 𝟔𝟓, 𝟎. 𝟖𝟏]

Queries to get 𝑬𝝆𝟐 = 𝟎. 𝟗𝟓: 𝟒𝟎, 𝟐𝟐𝟎

TREC-2009 to TREC-2011: ClueWeb09 𝑬𝝆 𝟐 ∈ 𝟎. 𝟖, 𝟎. 𝟖𝟑 → 𝝉 ∈ [𝟎. 𝟓𝟑, 𝟎. 𝟓𝟗] Queries to get 𝑬𝝆𝟐 = 𝟎. 𝟗𝟓: 𝟏𝟎𝟕, 𝟒𝟑𝟖

50 queries were used

Historical Trend

Decreasing within and across tracks?

Historical Trend

Systems getting better for specific problems?

Historical Trend

Increasing task-specificity in queries?

summing up

Generalizability Theory

Regarded as more appropriate, easy to use and powerful tool

to assess test collection reliability

Very sensitive to the initial data used to estimate variance components

Almost impossible to interpret

in practical terms

Sensitivity of G-Theory

About 50 queries and 50 systems are needed for robust estimates

Caution if building a new collection

Can always use confidence intervals

Interpretation of G-Theory

Empirical mapping onto traditional

indicators of reliability like 𝝉 correlation

𝝉 = 𝟎. 𝟗 → 𝑬𝝆𝟐 ≈ 𝟎. 𝟗𝟕

𝑬𝝆𝟐 = 𝟎. 𝟗𝟓 → 𝝉 ≈ 𝟎. 𝟖𝟓

Historical Reliability in TREC

On average, 𝑬𝝆𝟐 = 𝟎. 𝟖𝟖 → 𝝉 ≈ 𝟎. 𝟕

Some collections clearly unreliable Web Distillation 2003, Genomics 2005, Terabyte 2006, Enterprise 2008, Medical 2011 and Web Ad Hoc 2011

50 queries not enough for stable rankings, about 200 are needed

Implications

Fixing a minimum number of queries

across tracks is unrealistic Not even across editions of the same task

Need to analyze on a case-by-case basis, while building the collections

to be continued…

Future Work

Study assessor effect Study document-collection effect

Better models to map G-Theory

onto data-based indicators We fitted theoretically correct(-ish) models,

but in practice theory does not hold

Methods to reliably measure reliability while building the collection

Source Code Online

Code for R stats software

G-study and D-study Required number of queries

Map onto data-based indicators Confidence intervals

..in two simple steps

G-Theory too sensitive to initial data Questionable with small collections

Compute confidence intervals

Need 𝑬𝝆𝟐 ≈ 𝟎. 𝟗𝟕 for 𝝉 = 𝟎. 𝟗 50 queries not enough for stable rankings

Fixing a minimum number of queries across tasks is unrealistic

Need to analyze on a case-by-case basis

top related