a new measure of retrieval effectiveness (or: what’s wrong with precision and recall) stefano...

Post on 18-Dec-2015

212 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

A new measure of retrieval A new measure of retrieval effectiveness (Or: What’s wrong effectiveness (Or: What’s wrong

with precision and recall)with precision and recall)

Stefano Mizzaro

Department of Mathematics and Computer ScienceUniversity of Udine

mizzaro@dimi.uniud.ithttp://www.dimi.uniud.it/~mizzaro

S. Mizzaro - A new measure of retrieval effectiveness (Or: What's wrong...) 2/16

OutlineOutline

Introduction: Measures of retrieval effectiveness... motivation for...

...a new measure: Average Distance Measure (ADM)

Discussion– Theoretical and practical adequacy of ADM– ADM vs. precision and recall– Pbms. with P & R

Conclusions and future work

S. Mizzaro - A new measure of retrieval effectiveness (Or: What's wrong...) 3/16

From binary to continuous From binary to continuous relevance & retrievalrelevance & retrieval

“Less” retrieved

“More” retrieved

“Less” relevant

“More” relevant

Not retrieved

Retrieved

Not relevant

Relevant

[Salton & McGill, 84]Documents database

S. Mizzaro - A new measure of retrieval effectiveness (Or: What's wrong...) 4/16

Continuous relevance & retrievalContinuous relevance & retrieval

URE

SRE

0.5

0.5

1.0

1.0

00

“Less” retrieved

“More” retrieved

“Less” relevant

“More” relevant

• SRE = System Relevance Estimate (aka RSV)

• URE = User Relevance Estimate

S. Mizzaro - A new measure of retrieval effectiveness (Or: What's wrong...) 5/16

ThresholdsThresholds on URE & SRE: why? on URE & SRE: why?

URE

SRE

0.5

0.5

1.0

1.0

00

“Less” retrieved

“More” retrieved

“Less” relevant

“More” relevant

Retrieved &

relevant?

Nonretrieved & relevant?

Nonretrieved& nonrelevant?

Retrieved & nonrelevant?

P = RetRel /(RetRel+RetNRel)R = RetRel /(RetRel+NRetRel)

... and historical reasons

S. Mizzaro - A new measure of retrieval effectiveness (Or: What's wrong...) 6/16

Average Distance Measure (ADM)Average Distance Measure (ADM)

SRE:

URE:

ADM = average “distance” between

URE and SRE values

D

dUREdSREADM Dd iqiq

qi

1

iq dURE

iq dSRE

S. Mizzaro - A new measure of retrieval effectiveness (Or: What's wrong...) 7/16

ADM: graphical representationADM: graphical representation

URE

SRE

1.0

1.0

00

Exactly

evaluated

S. Mizzaro - A new measure of retrieval effectiveness (Or: What's wrong...) 8/16

ADM: An exampleADM: An example

URE

SRE

0.5

0.3

0.1 1.0

1.0

00

0.2

0.4

0.6

0.8

0.9

0.4 0.8 URE

SRE

0.5

0.3

0.1 1.0

1.0

00

0.2

0.4

0.6

0.8

0.9

0.4 0.8

0.10.40.8URE

0.71.00.40.8IRS3

0.80.30.61.0IRS2

0.90.20.50.9IRS1

ADMd3d2d1Docs.

S. Mizzaro - A new measure of retrieval effectiveness (Or: What's wrong...) 9/16

Adequacy of ADMAdequacy of ADMOne single numberAllows complete ordering of different

performances...ADM vs. P & R

– No hyper-sensitiveness to small variations close to borders

– No lack of sensitiveness to big variations inside “equivalence” regions

– Wrong thresholds

S. Mizzaro - A new measure of retrieval effectiveness (Or: What's wrong...) 10/16

Hyper-sensitiveness: Hyper-sensitiveness: Three very similar IRSsThree very similar IRSs

URE

SRE

0.5

0.50.49

0.49 1.0

1.0

00

0.8260.50.50.5IRS3

0.830.750.51IRS2

0.830.8410.67IRS1

ADMERP

stable

unstable

S. Mizzaro - A new measure of retrieval effectiveness (Or: What's wrong...) 11/16

Lack of sensitiveness:Lack of sensitiveness:two very different IRSstwo very different IRSs

URE

SRE

0.5

0.5

1.0

1.0

00

0.5111IRS2

1111IRS1

ADMERP

unstable

stable

S. Mizzaro - A new measure of retrieval effectiveness (Or: What's wrong...) 12/16

Again on the thresholds...Again on the thresholds...

URE

SRE

0.5

0.5

1.0

1.0

00

“Less” retrieved

“More” retrieved

“Less” relevant

“More” relevant

Retrieved &

relevant?

Nonretrieved & relevant?

Nonretrieved& nonrelevant?

Retrieved & nonrelevant?

S. Mizzaro - A new measure of retrieval effectiveness (Or: What's wrong...) 13/16

The “right” thresholdsThe “right” thresholds

URE

SRE

0.5 1.0

1.0

00

“Less” retrieved

“More” retrieved

“Less” relevant

“More” relevant

OverEvaluated

UnderEvaluated

E = CE / (OE + UE)

0.5Correctly

Evaluated

S. Mizzaro - A new measure of retrieval effectiveness (Or: What's wrong...) 14/16

ADM in practiceADM in practiceHow to get URE values? Either

– asking the judge(s) to directly express continuous relevance judgments (feasible, literature evidence), or

– averaging dichotomous/discrete relevance judgments

UREs for all the documents in the database? Impossible!!– Sampling – (that takes place with P & R too, anyway)

S. Mizzaro - A new measure of retrieval effectiveness (Or: What's wrong...) 15/16

ConclusionsConclusions

ADM, a new measure of retrieval effectiveness– Adequacy– Improvements w.r.t. P & R: avoids hyper-

sensitiveness and lack of sensitiveness– Practical usability (continuous relevance

judgments, sampling)Very preliminary work

S. Mizzaro - A new measure of retrieval effectiveness (Or: What's wrong...) 16/16

Future workFuture workTheoretical variations and

improvements– Standard deviation in place of the

difference of absolute values?– Which sampling?

Re-examine the data of some evaluation experiments (any volunteers?)

Using ADM in real life

top related