advances on the development of evaluation measures
TRANSCRIPT
Advances on the Development of Evaluation Measures
Ben Carterette Evangelos Kanoulas Emine Yilmaz
Information Retrieval Systems
Match information seekers with
the information they seek
“What you can’t measure you can’t improve”
Lord Kelvin
3
Most retrieval systems are tuned to optimize for an objective
evaluation metric
Why is Evaluation so Important?
Outline
• Intro to evaluation
– Different approaches to evaluation
– Traditional evaluation measures
• User model based evaluation measures
• Session Evaluation
• Novelty and Diversity
4
Online Evaluation
• Design interactive experiments
• Use users’ actions to evaluate the quality
Click/Noclick
Evaluate
5
Online Evaluation
• Standard click metrics – Clickthrough rate
– Queries per user
– Probability user skips over results they have considered (pSkip)
• Result interleaving
What is result interleaving? • A way to compare rankers online
– Given the two rankings produced by two methods
– Present a combination of the rankings to users
• Result interleaving
– Credit assignment based on clicks
Team Draft Interleaving (Radlinski et al., 2008)
• Interleaving two rankings
– Input: Two rankings
• Repeat: – Toss a coin to see which team picks next
– Winner picks their best remaining player
– Loser picks their best remaining player
– Output: One ranking
• Credit assignment
– Ranking providing more of the clicked results wins
Team Draft Interleaving Ranking A
1. Napa Valley – The authority for lodging... www.napavalley.com 2. Napa Valley Wineries - Plan your wine... www.napavalley.com/wineries 3. Napa Valley College www.napavalley.edu/homex.asp 4. Been There | Tips | Napa Valley www.ivebeenthere.co.uk/tips/16681 5. Napa Valley Wineries and Wine www.napavintners.com 6. Napa Country, California – Wikipedia en.wikipedia.org/wiki/Napa_Valley
Ranking B 1. Napa Country, California – Wikipedia en.wikipedia.org/wiki/Napa_Valley 2. Napa Valley – The authority for lodging... www.napavalley.com 3. Napa: The Story of an American Eden... books.google.co.uk/books?isbn=... 4. Napa Valley Hotels – Bed and Breakfast... www.napalinks.com 5. NapaValley.org www.napavalley.org 6. The Napa Valley Marathon www.napavalleymarathon.org
Presented Ranking 1. Napa Valley – The authority for lodging... www.napavalley.com 2. Napa Country, California – Wikipedia en.wikipedia.org/wiki/Napa_Valley 3. Napa: The Story of an American Eden... books.google.co.uk/books?isbn=... 4. Napa Valley Wineries – Plan your wine... www.napavalley.com/wineries 5. Napa Valley Hotels – Bed and Breakfast... www.napalinks.com 6. Napa Valley College www.napavalley.edu/homex.asp 7 NapaValley.org www.napavalley.org
A B
Team Draft Interleaving Ranking A
1. Napa Valley – The authority for lodging... www.napavalley.com 2. Napa Valley Wineries - Plan your wine... www.napavalley.com/wineries 3. Napa Valley College www.napavalley.edu/homex.asp 4. Been There | Tips | Napa Valley www.ivebeenthere.co.uk/tips/16681 5. Napa Valley Wineries and Wine www.napavintners.com 6. Napa Country, California – Wikipedia en.wikipedia.org/wiki/Napa_Valley
Ranking B 1. Napa Country, California – Wikipedia en.wikipedia.org/wiki/Napa_Valley 2. Napa Valley – The authority for lodging... www.napavalley.com 3. Napa: The Story of an American Eden... books.google.co.uk/books?isbn=... 4. Napa Valley Hotels – Bed and Breakfast... www.napalinks.com 5. NapaValley.org www.napavalley.org 6. The Napa Valley Marathon www.napavalleymarathon.org
Presented Ranking 1. Napa Valley – The authority for lodging... www.napavalley.com 2. Napa Country, California – Wikipedia en.wikipedia.org/wiki/Napa_Valley 3. Napa: The Story of an American Eden... books.google.co.uk/books?isbn=... 4. Napa Valley Wineries – Plan your wine... www.napavalley.com/wineries 5. Napa Valley Hotels – Bed and Breakfast... www.napalinks.com 6. Napa Valley College www.napavalley.edu/homex.asp 7 NapaValley.org www.napavalley.org
B wins!
Offline Evaluation
• Controlled laboratory experiments
• The user’s interaction with the engine is only simulated – Ask experts to judge each query result
– Predict how users behave when they search
– Aggregate judgments to evaluate
11
Offline Evaluation
• Ask experts to judge each query result
• Predict how users behave when they search
• Aggregate judgments to evaluate
Documents Judge
Evaluate
User model
12
Online vs. Offline Evaluation
Online Offline
Pros Cheap
Measure actual user reactions
Fast to evaluate
Easy to try new ideas
Portable
Cons
Need to go live
Noisy
Slow
Not duplicable
Needs ground truth
Slow to obtain judgments “Expensive”
“Inconsistent”
Difficult to model how users behave
13
Outline
• Intro to evaluation
– Different approaches to evaluation
– Traditional evaluation measures
• User model based evaluation measures
• Session Evaluation
• Novelty and Diversity
14
Traditional Experiment
Results
Judges
Search Engines
How many good docs have I missed/found? 15
Depth-k Pooling
. . . .
1
2
3
z
k
sys2
. . . . . .
C
D
M
A
T
sys3
. . . . . .
A
E
D
F
B
sysM
. . . . . .
B
A
S
Z
L
sys1
. . . . . .
A
B
C
X
Y
Judge Documents
16
Depth-k Pooling
. . . .
1
2
3
z
k
sys2
. . . . . .
C
D
M
A
T
sys3
. . . . . .
A
E
D
F
B
sysM
. . . . . .
B
A
S
Z
L
sys1
. . . . . .
A
B
C
X
Y
Judge
17
Depth-k Pooling
. . . .
1
2
3
z
k
sys2
. . . . . .
C
D
M
A
T
sys3
. . . . . .
A
E
D
F
B
sysM
. . . . . .
B
A
S
Z
L
sys1
. . . . . .
A
B
C
X
Y
. . . .
1
2
3
z
k
sys2
. . . . . .
N
N
R
R
?
sys3
. . . . . .
R
N
N
R
R
sysM
. . . . . .
R
R
N
N
?
sys1
. . . . . .
R
R
N
N
?
Judge
18
Depth-k Pooling
. . . .
1
2
3
z
k
sys2
. . . . . .
C
D
M
A
T
sys3
. . . . . .
A
E
D
F
B
sysM
. . . . . .
B
A
S
Z
L
sys1
. . . . . .
A
B
C
X
Y
. . . .
1
2
3
z
k
sys2
. . . . . .
N
N
R
R
N
sys3
. . . . . .
R
N
N
R
R
sysM
. . . . . .
R
R
N
N
N
sys1
. . . . . .
R
R
N
N
N
Judge
19
Reusable Test Collections
• Document Corpus
• Topics
• Relevance Judgments
Topic 1 Topic 2 Topic N
20
Evaluation Metrics: Precision vs Recall
Retrieved list
R
N
R
N
N
R
N
N
N
R . .
1
2
3
4
5
6
7
8
9
10 . .
Visualizing Retrieval Performance: Precision-Recall Curves
R
N
R
N
N
R
N
N
N
R
List:
Evaluation Metrics: Average Precision
R
N
R
N
N
R
N
N
N
R
List:
Outline
• Intro to evaluation
– Different approaches to evaluation
– Traditional evaluation measures
• User model based evaluation measures
• Session Evaluation
• Novelty and Diversity
24
User models Behind Traditional Metrics
• Precision@k
– Users always look at top k documents
– What fraction of the top k documents are relevant?
• Recall
– Users would like to find all the relevant documents.
– What fraction of these documents have been retrieved by the search engine?
User Model of Average Precision (Robertson ‘08)
1. User steps down a ranked list one-by-one
2. Stops browsing documents due to satisfaction
– stops with a certain probability after observing a relevant document
3. Gains utility from each relevant document
User Model of Average Precision (Robertson ‘08)
• The probability that the user stops browsing is uniform over all the relevant documents
• The utility a user gains when he stops browsing at a relevant document at rank n (precision at rank n)
• AP can be written as:
U(n) 1
nrel(k)
k1
n
1
)()(n
nUnPAP
o.w. 0 relevant, is doc if 1
)(R
nP
User Model Based Evaluation Measures
• Directly aim at evaluating user satisfaction
– An effectiveness measure should be correlated to the user’s experience
• Thus interest in effectiveness measures based on explicit models of user interaction
– Devise a user model correlated with user behavior
– Infer an evaluation metric from the user model
Basic User Model
• Simple model of user interaction:
1. User steps down ranked results one-by-one
2. Stops at a document at rank k with some probability P(k)
3. Gains some utility U(k) from relevant documents
M U(k)P(k)k1
Basic User Model
1. Discount: What is the chance a user will visit a document?
– Model of the browsing behavior
2. Utility: What does the user gain by visiting a document?
Model Browsing Behavior
Position-based models
The chance of observing a document depends on the position
it is presented in the ranked list.
black powder
ammunition
1
2
3
4
5
6
7
8
9
10
…
Rank Biased Precision black powder
ammunition
1
2
3
4
5
6
7
8
9
10
…
Query
Stop View Next
Item
Rank Biased Precision black powder
ammunition
1
2
3
4
5
6
7
8
9
10
…
)-(1= RBP1=i
1
i
irel
Discounted Cumulative Gain black powder
ammunition
1
2
3
4
5
6
7
8
9
10
…
HR
R
N
N
HR
R
N
R
N
N
…
Discounted Gain
3
0.63
0
0
1.14
0.35
0
0.31
0
0
1/log2(r+1)
Relevance Score
2
1
0
0
2
1
0
1
0
0
Discount by rank
Relevance Gain
3
1
0
0
3
1
0
1
0
0
12 rrel
5.46
Discounted Cumulative Gain
• DCG can be written as:
• Discount function models the probability that the user visits (clicks on) the document at rank r
– Currently, P(user clicks on doc r) = 1/log2(r+1)
r
N
r
UtilityrdocvisitsuserP
) (1
Discounted Cumulative Gain
• Instead of stopping probability, think about viewing probability
• This fits in discounted gain model framework:
Normalised Discounted Cumulative Gain
black powder
ammunition
1
2
3
4
5
6
7
8
9
10
…
HR
R
N
N
HR
R
N
R
N
N
…
Discounted Gain
3
0.63
0
0
1.14
0.35
0
0.31
0
0
1/log2(r+1)
Relevance Score
2
1
0
0
2
1
0
1
0
0
Discount by rank
Relevance Gain
3
1
0
0
3
1
0
1
0
0
12 rrel
optDCG
DCGNDCG
Model Browsing Behavior
Cascade-based models
black powder
ammunition
1
2
3
4
5
6
7
8
9
10
…
• The user views search results from top to bottom
• At each rank i, the user has a certain probability of being satisfied. • Probability of satisfaction proportional to the
relevance grade of the document at rank i.
• Once the user is satisfied with a document, he terminates the search.
Rank Biased Precision
Query
Stop
View Next Item
black powder
ammunition
1
2
3
4
5
6
7
8
9
10
…
Expected Reciprocal Rank [Chapelle et al CIKM09]
Query
Stop
Relevant?
View Next Item
no somewhat highly
black powder
ammunition
1
2
3
4
5
6
7
8
9
10
…
Expected Reciprocal Rank [Chapelle et al CIKM09]
black powder
ammunition
rrank at
document"perfect the" finding of Utility :(r)
1/r (r)
)position at stopsuser (1
1
rPr
ERRn
r
1
11
)1(1 r
i
ri
n
r
RRr
ERR
document r theof grade relevance : th
rg
) positionat stopsP(user 2
12 doc of relevance of Prob.
maxrRr
g
g
r
r
1
2
3
4
5
6
7
8
9
10
…
Metrics derived from Query Logs
• Use the query logs to understand how users behave
• Learn the parameters of the user model from the query logs
– Utility, discount, etc.
Metrics derived from Query Logs
• Users tend to stop search if they are satisfied or frustrated
• P(observe a doc at rank r) highly affected by snippet quality
Relevance P(C|R)
Bad 0.50
Fair 0.49
Good 0.45
Excellent 0.59
Perfect 0.79
Relevance P(Stop|R)
Bad 0.49
Fair 0.41
Good 0.37
Excellent 0.53
Perfect 0.76
Metrics derived from Query Logs
• Users behave differently for different queries
– Informational queries
– Navigational queries
Navigational Informational
P(C|R) P(Stop|R) P(C|R) P(Stop|R)
Bad 0.632 0.587 0.516 0.431
Fair 0.569 0.523 0.455 0.357
Good 0.526 0.483 0.442 0.349
Excellent 0.700 0.669 0.533 0.458
Perfect 0.809 0.786 0.557 0.502
DEBU (r) P(Er ) P(C | Rr )
EBU DEBU (r)r 1
n
Rr
Expected Browsing Utility (Yilmaz et al. CIKM’10)
Basic User Model
1. Discount: What is the chance a user will visit a document?
– Model of the browsing behavior
2. Utility: What does the user gain by visiting a document?
– Mostly ad-hoc, no clear user model
Graded Average Precision (Robertson et al. SIGIR’10)
• One document is more useful than another • One possible meaning:
– one document is useful to more users than another
• Hence the following: – assume grades of relevance... ... but that user has a threshold relevance grade
which defines a binary view
– different users have different thresholds
described by a probability distribution over users
Graded Average Precision [Robertson et al. SIGIR10]
• User has binary view of relevance
– by thresholding the relevance scale
Considered relevant with probability g1
Irrelevant
Relevant
Highly Relevant
Relevance Scale
Graded Average Precision [Robertson et al. SIGIR10]
• User has binary view of relevance
– by thresholding the relevance scale
Irrelevant
Relevant
Highly Relevant
Relevance Scale
Considered relevant with probability g2
Graded Average Precision
• Assume relevance grades {0...c} – 0 for non-relevant, + c positive grades
• gi = P(user threshold is at i) for i ∈ {1...c} – i.e. user regards grades {i...c} as relevant, grades {0...(i-1)}
as not relevant – gis sum to one
• Step down the ranked list, stopping at documents that may be relevant – then calculate expected precision at each of these
(expected over the population of users)
Graded Average Precision (GAP)
1 HR
2 R
3 N
4 N
5 R
6 HR
7 R
Relevance
1 HR
2 R
3 N
4 N
5 R
6 HR
7 R
Relevance
1 Rel
2 Rel
3 N
4 N
5 Rel
6 Rel
7 Rel
with prob.
g1
6
4prec6
Graded Average Precision (GAP)
1 HR
2 R
3 N
4 N
5 R
6 HR
7 R
Relevance
1 Rel
2 N
3 N
4 N
5 N
6 Rel
7 N
with prob.
g2
6
2prec6
Graded Average Precision (GAP)
1 HR
2 R
3 NR
4 NR
5 R
6 HR
7 R
Relevance
wprec6 4
6 g1
2
6 g2
Graded Average Precision (GAP)
Probability Models
• Almost all the measures we’ve discussed are based on probabilistic models of users
– Most have one or more parameters representing something about user behavior
– Is there a way to incorporate variability in the user population?
• How do we estimate parameter values?
– Is a single point estimate good enough?
Choosing Parameter Values
• Parameter θ models a user – Higher θ more patience, more results viewed
– Lower θ less patience, fewer results viewed
• Different approaches: – Minimize variance in evaluation (Kanoulas & Aslam,
CIKM ‘09)
– Use click log; fit a model to gaps between clicks (Zhang et al., IRJ, 2010)
– All try to infer a single value for the parameters
Distribution of “Patience” for RBP
• Form a distribution P(θ)
• Sampling from P(θ) is like sampling a “user” defined by their patience
• How can we form a proper distribution of θ?
• Idea: mine logged search engine user data – Look at ranks users are clicking
– Estimate patience based on absence or presence of clicks
Modeling Patience from Log Data
• We will assume we have a flat prior θ that we want to update using log data L
• Decompose L into individual search sessions
– For each session q, count:
• cq, the total number of clicks
• rq, the total number of no-clicks
– Model cq with a negative binomial distribution conditional on rq and θ:
Modeling Patience from Log Data
• Marginalize P(θ|L) over r:
• Apply Bayes’ rule to P(θ | r, L):
• P(L | θ, r) is the likelihood of the observed clicks
Complete Model Expression
• Model components result in three equations to estimate P(θ | L)
Empirical Patience Profiles: Navigational Queries
Empirical Patience Profiles: Informational Queries
Extend to ERR Parameters
Evaluation Using Parameter Distributions
• Monte Carlo procedure:
– Sample a parameter value from P(θ | L)
• Or a vector of values for ERR
– Compute the measure with the sampled value
– Iterate to form distribution P(RBP) or P(ERR)
Marginal Distribution Analysis
• S1=[R N N N N N N N N N]
• S2=[N R R R R R R R R R]
Distribution of RBP
Distribution of ERR
Marginal Distribution Analysis
• Given two systems, over all choices of θ
– What is P(M1 > M2)?
– What is P((M1 - M2)>t)?
Marginal Distribution Analysis
Outline
• Intro to evaluation
– Different approaches to evaluation
– Traditional evaluation measures
• User model based evaluation measures
• Session Evaluation
• Novelty and Diversity
71
Why sessions?
• Current evaluation framework – Assesses the effectiveness of systems over one-
shot queries
• Users reformulate their initial query
• Still fine if … – optimizing system for one-shot queries led to
optimal performance over an entire session
When was the DuPont Science Essay Contest created?
Initial Query : DuPont Science Essay Contest
Reformulation : When was the DSEC created?
• e.g. retrieval systems should accumulate information along a session
Why sessions?
Paris Luxurious Hotels Paris Hilton J Lo Paris
Extend the evaluation framework
From one query evaluation
To multi-query sessions evaluation
Construct appropriate test collections
Rethink of evaluation measures
• A set of information needs A friend from Kenya is visiting you and you'd like to surprise him with
by cooking a traditional swahili dish. You would like to search online to decide which dish you will cook at home.
– A static sequence of m queries
Basic test collection
Initial Query : 1st Reformulation : 2nd Reformulation : … (m-1)th Reformulation :
kenya cooking traditional
kenya cooking traditional swahili
www.allrecipes.com
…
kenya swahili traditional food recipes
Basic Test Collection
Factual/Amorphous, Known-item search
Intellectual/Amorphous, Explanatory search
Factual/Amorphous, Known-item search
Experiment
kenya cooking
traditional swahili
kenya swahili
traditional food
recipes
kenya cooking
traditional
1
2
3
4
5
6
7
8
9
10
…
Experiment
1
2
3
4
5
6
7
8
9
10
…
kenya cooking
traditional swahili
kenya cooking
traditional
kenya swahili
traditional food
recipes
Construct appropriate test collections
Rethink of evaluation measures
What is a good system?
How can we measure “goodness”?
Measuring “goodness”
The user steps down a ranked list of documents and
observes each one of them until a decision point and either
a) abandons the search, or
b) reformulates
While stepping down or sideways, the user accumulates utility
What are the challenges?
Evaluation over a single ranked list
1
2
3
4
5
6
7
8
9
10
…
kenya cooking
traditional swahili
kenya cooking
traditional
kenya swahili
traditional food
recipes
Session DCG [Järvelin et al ECIR 2008]
kenya cooking
traditional swahili
kenya cooking
traditional
2rel(r ) 1
logb (r b 1)r1
k
2rel(r ) 1
logb (r b 1)r1
k
1
logc (1 c 1)DCG(RL1)
1
logc(2 c 1) DCG(RL2)
Session Metrics
• Session DCG [Järvelin et al ECIR 2008]
The user steps down the ranked list until rank k and reformulates [Deterministic; no early abandonment]
Model-based measures
Probabilistic space of users following
different paths
• Ω is the space of all paths
• P(ω) is the prob of a user following a path ω in Ω
• U(ω) is the utility of path ω in Ω
[Yang and Lad ICTIR 2009]
P()U()
Expected Global Utility [Yang and Lad ICTIR 2009]
1. User steps down ranked results one-by-one
2. Stops browsing documents based on a stochastic process that defines a stopping probability distribution over ranks and reformulates
3. Gains something from relevant documents, accumulating utility
Expected Global Utility [Yang and Lad ICTIR 2009]
• The probability of a user following a path ω:
P(ω) = P(r1, r2, ..., rK)
ri is the stopping and reformulation point in list i
– Assumption: stopping positions in each list are independent
P(r1, r2, ..., rK) = P(r1)P(r2)...P(rK)
– Use geometric distribution (RBP) to model the stopping and reformulation behaviour
P(ri = r) = (1-) k1
Q1 Q2 Q3
N R R
N R R
N R R
N R R
N R R
N N R
N N R
N N R
N N R
N N R
Geo
met
ric
w/
par
amet
er θ
Expected Global Utility
Session Metrics
• Session DCG [Järvelin et al ECIR 2008]
The user steps down the ranked list until rank k and reformulates [Deterministic; no early abandonment]
• Expected global utility [Yang and Lad ICTIR 2009]
The user steps down a ranked list of documents until a decision point and reformulates [Stochastic; no early abandonment]
Model-based measures
Probabilistic space of users following
different paths
• Ω is the space of all paths
• P(ω) is the prob of a user following a path ω in Ω
• Mω is a measure over a path ω
[Kanoulas et al. SIGIR 2011]
esM P()M
Probability of a path
Probability of abandoning at reform 2
X
Probability of reformulating at rank 3
Q1 Q2 Q3
N R R
N R R
N R R
N R R
N R R
N N R
N N R
N N R
N N R
N N R
… … …
(1) (2)
Q1 Q2 Q3
N R R
N R R
N R R
N R R
N R R
N N R
N N R
N N R
N N R
N N R
… … …
Probability of abandoning the session at reformulation i
Geometric w/ parameter preform
(1)
Q1 Q2 Q3
N R R
N R R
N R R
N R R
N R R
N N R
N N R
N N R
N N R
N N R
… … …
Truncated Geometric w/ parameter preform
Probability of abandoning the session at reformulation i
(1)
Q1 Q2 Q3
N R R
N R R
N R R
N R R
N R R
N N R
N N R
N N R
N N R
N N R
… … …
Truncated Geometric w/ parameter preform
Geo
met
ric
w/
par
amet
er p
do
wn
Probability of reformulating
at rank j (2)
Session Metrics
• Session DCG [Järvelin et al ECIR 2008]
The user steps down the ranked list until rank k and reformulates [Deterministic; no early abandonment]
• Expected global utility [Yang and Lad ICTIR 2009]
The user steps down a ranked list of documents until a decision point and reformulates [Stochastic; no early abandonment]
• Expected session measures [Kanoulas et al. SIGIR 2011]
The user steps down a ranked list of documents until a decision point and either abandons the query or reformulates [Stochastic; allows early abandonment]
Outline
• Intro to evaluation
– Different approaches to evaluation
– Traditional evaluation measures
• User model based evaluation measures
• Session Evaluation
• Novelty and Diversity
101
Novelty
• The redundancy problem:
– the first relevant document contains some useful information
– every document with the same information after that is worth less to the user
• but worth the same to traditional evaluation measures
• Novelty retrieval attempts to ensure that ranked results do not have much redundancy
Example
• query: “oil-producing nations” – members of OPEC
– North Atlantic nations
– South American nations
• 10 relevant articles about OPEC probably not as useful as one relevant article about each group – And one relevant article about all oil-producing
nations might be even better
How to Evaluate?
• One approach:
– List subtopics, aspects, or facets of the topic
– Judge each document relevant or not to each possible subtopic
• For oil-producing nations, subtopics could be names of nations
– Saudi Arabia, Russia, Canada, …
Subtopic Relevance Example
Evaluation Measures
• Subtopic recall and precision (Zhai et al., 2003) – Subtopic recall at rank k:
• Count unique subtopics in top k documents • Divide by total number of known unique subtopics
– Subtopic precision at recall r: • Find least k at which subtopic recall r is achieved • Find least k at which subtopic recall r could possibly be
achieved (by a perfect system) • Divide latter by former
– Models a user that wants all subtopics • and doesn’t care about redundancy as long as they are
seeing new information
Subtopic Relevance Evaluation
Copyright © Ben Carterette
Diversity
• Short keyword queries are inherently ambiguous
– An automatic system can never know the user’s intent
• Diversification attempts to retrieve results that may be relevant to a space of possible intents
Evaluation Measures
• Subtopic recall and precision
– This time with judgments to “intents” rather than subtopics
• Measures that know about intents:
– “Intent-aware” family of measures (Agrawal et al.)
– D, D♯ measures (Sakai et al.)
– α-nDCG (Clarke et al.)
– ERR-IA (Chapelle et al.)
Intent-Aware Measures
• Assume there is a probability distribution P(i | Q) over intents for a query Q
– Probability that a randomly-sampled user means intent i when submitting query Q
• The intent-aware version of a measure is its weighted average over this distribution
P@10-IA = 0.35*0.3 + 0.35*0.3 + 0.2*0.2 + 0.08*0.1 + 0.02*0.1 = 0.23
D-measure
• Take the idea of intent-awareness and apply it to computing document gain
– The gain for a document is the (weighted) average of its gains for subtopics it is relevant to
• D-nDCG is nDCG computed using intent-aware gains
D-DCG = 0.35/log 2 + 0.35/log 3 + …
α-nDCG
• α-nDCG is a generalization of nDCG that accounts for both novelty and diversity
• α is a geometric penalization for redundancy
– Redefine the gain of a document:
• +1 for each subtopic it is relevant to
• ×(1-α) for each document higher in the ranking that subtopic already appeared in
• Discount is the same as usual
+1
+1
+1
+1
+(1-α)
+1
+(1-α)
+(1-α)
+(1-α)2
+(1-α)2
ERR-IA
• Intent-aware version of ERR
• But it has appealing properties other IA measures do not have:
– ranges between 0 and 1
– submodularity: diminishing returns for relevance to a given subtopic -> built-in redundancy penalization
• Also has appealing properties over α-nDCG:
– Easily handles graded subtopic judgments
– Easily handles intent distributions
Granularity of Judging
• What exactly is a “subtopic”?
– Perhaps any piece of information a user may be interested in finding?
• At what granularity should subtopics be defined?
– For example:
• “cardinals” has many possible meanings
• “cardinals baseball team” is still very broad
• “cardinals baseball team schedule” covers 6 months
• “cardinals baseball team schedule august” covers ~25 games
• “cardinals baseball team schedule august 12th”
Preference Judgments for Novelty
• What about evaluating novelty with no subtopic judgments?
• Preference judgments:
– Is document A more relevant than document B?
• Conditional preference judgments:
– Is document A better than document B given that I’ve just seen document C?
– Assumption: preference is based on novelty over C
• Is it true? Come to our presentation on Wednesday…
Conclusions
• Strong interest in using evaluation measures to model user behavior and satisfaction – Driven by availability of user logs, increased
computational power, good abstract models
– DCG, RBP, ERR, EBU, session measures, diversity measures all model users in different ways
• Cranfield-style evaluation is still important!
• But there is still much to understand about users and how they derive satisfaction
Conclusions
• Ongoing and future work: – Models with more degrees of freedom
– Direct simulation of users from start of session to finish
– Application to other domains
• Thank you! – Slides will be available online
– http://ir.cis.udel.edu/SIGIR12tutorial