Text SimilarityText Similarity
Dr Eamonn KeoghDr Eamonn KeoghComputer Science & Engineering Department
University of California - RiversideRiverside,CA [email protected]
Word Twain Twain Twain Twain Twain
Length Sample 1 Sample 2 Sample 3 Sample 4 Sample 5 Snodgrass
1 74 312 116 138 122 424
2 349 1146 496 532 466 2685
3 456 1394 673 741 653 2752
4 374 1177 565 591 517 2302
5 212 661 381 357 343 1431
6 127 442 249 258 207 992
7 107 367 185 215 152 896
8 84 231 125 150 103 638
9 45 181 94 83 92 465
10 27 109 51 55 45 276
11 13 50 23 30 18 152
12 8 24 8 10 12 101
13+ 9 12 8 9 9 61
0
200
400
600
800
1000
1200
1400
1600
1 2 3 4 5 6 7 8 9 10 11 12 13
Sample 1
Sample 2
0
0.05
0.1
0.15
0.2
0.25
0.3
1 2 3 4 5 6 7 8 9 10 11 12 13
Series1
Series2
1
2
5
3
4
6
Information RetrievalInformation Retrieval
• Task Statement:
Build a system that retrieves documents that users are likely to find relevant to their queries.
• This assumption underlies the field of Information Retrieval.
Informationneed
Index
Pre-process
Parse
Collections
Rank
Query
text input
How isthe queryconstructed? How is
the text processed?
Evaluate
TerminologyTerminology
Token: A natural language word “Swim”, “Simpson”, “92513” etc
Document: Usually a web page, but more generally any file.
Some IR HistorySome IR History
– Roots in the scientific “Information Explosion” following WWII
– Interest in computer-based IR from mid 1950’s• H.P. Luhn at IBM (1958)
• Probabilistic models at Rand (Maron & Kuhns) (1960)
• Boolean system development at Lockheed (‘60s)
• Vector Space Model (Salton at Cornell 1965)
• Statistical Weighting methods and theoretical advances (‘70s)
• Refinements and Advances in application (‘80s)• User Interfaces, Large-scale testing and application (‘90s)
RelevanceRelevance
• In what ways can a document be relevant to a query?– Answer precise question precisely.
– Who is Homer’s Boss? Montgomery Burns.
– Partially answer question.– Where does Homer work? Power Plant.
– Suggest a source for more information.– What is Bart’s middle name? Look in Issue 234 of Fanzine
– Give background information.– Remind the user of other knowledge.– Others ...
Informationneed
Index
Pre-process
Parse
Collections
Rank
Query
text input
How isthe queryconstructed? How is
the text processed?
EvaluateThe section that follows is about
Content AnalysisContent Analysis(transforming raw text into a computationally more manageable form)
Stemming and Morphological AnalysisStemming and Morphological Analysis
• Goal: “normalize” similar words
• Morphology (“form” of words)– Inflectional Morphology
• E.g,. inflect verb endings and noun number
• Never change grammatical class– dog, dogs
– Bike, Biking
– Swim, Swimmer, Swimming
What about… build, building;
Original Words …consignconsignedconsigningconsignmentconsistconsistedconsistencyconsistentconsistentlyconsistingconsists…
Stemmed Words…consignconsignconsignconsignconsistconsistconsistconsistconsistconsistconsist
Examples of Stemming (using Porters algorithm)Examples of Stemming (using Porters algorithm)
Porters algorithms is available in Java, C, Lisp, Perl, Python etc from
http://www.tartarus.org/~martin/PorterStemmer/
Errors Generated by PorterErrors Generated by Porter Stemmer (Krovetz 93)
Too Aggressive Too Timidorganization/ organ european/ europe
policy/ police cylinder/ cylindrical
execute/ executive create/ creation
arm/ army search/ searcher
Homework!! Play with the following URLhttp://fusion.scs.carleton.ca/~dquesnel/java/stuff/PorterApplet.html
Statistical Properties of TextStatistical Properties of Text
• Token occurrences in text are not uniformly distributed
• They are also not normally distributed
• They do exhibit a Zipf distribution
8164 the4771 of4005 to2834 a2827 and2802 in1592 The1370 for1326 is1324 s1194 that 973 by
969 on 915 FT 883 Mr 860 was 855 be 849 Pounds 798 TEXT 798 PUB 798 PROFILE 798 PAGE 798 HEADLINE 798 DOCNO
1 ABC 1 ABFT 1 ABOUT 1 ACFT 1 ACI 1 ACQUI 1 ACQUISITIONS 1 ACSIS 1 ADFT 1 ADVISERS 1 AE
Government documents, 157734 tokens, 32259 uniqueGovernment documents, 157734 tokens, 32259 unique
Plotting Word Frequency by RankPlotting Word Frequency by Rank
• Main idea: count– How many times tokens occur in the text
• Over all texts in the collection
• Now rank these according to how often they occur. This is called the rank.
Rank Freq1 37 system2 32 knowledg3 24 base4 20 problem5 18 abstract6 15 model7 15 languag8 15 implem9 13 reason10 13 inform11 11 expert12 11 analysi13 10 rule14 10 program15 10 oper16 10 evalu17 10 comput18 10 case19 9 gener20 9 form
The Corresponding Zipf CurveThe Corresponding Zipf Curve
Zipf DistributionZipf Distribution
• The Important Points:– a few elements occur very frequently– a medium number of elements have medium
frequency– many elements occur very infrequently
Zipf DistributionZipf Distribution• The product of the frequency of words (f) and their rank (r) is
approximately constant– Rank = order of words’ frequency of occurrence
• Another way to state this is with an approximately correct rule of thumb:– Say the most common term occurs C times– The second most common occurs C/2 times– The third most common occurs C/3 times– …
10/
/1
NC
rCf
Illustration by Jacob Nielsen
Zipf DistributionZipf Distribution(linear and log scale)(linear and log scale)
What Kinds of Data Exhibit a What Kinds of Data Exhibit a Zipf Distribution?Zipf Distribution?
• Words in a text collection– Virtually any language usage
• Library book checkout patterns• Incoming Web Page Requests • Outgoing Web Page Requests• Document Size on Web• City Sizes• …
Consequences of ZipfConsequences of Zipf
• There are always a few very frequent tokens that are not good discriminators.– Called “stop words” in IR
• English examples: to, from, on, and, the, ...
• There are always a large number of tokens that occur once and can mess up algorithms.
• Medium frequency words most descriptive
Word Frequency vs. Resolving Word Frequency vs. Resolving Power Power (from van Rijsbergen 79)(from van Rijsbergen 79)
The most frequent words are not the most descriptive.
Statistical IndependenceStatistical Independence
Two events x and y are statistically independent if the product of their probability of their happening individually equals their probability of happening together.
),()()( yxPyPxP
Lexical AssociationsLexical Associations• Subjects write first word that comes to mind
– doctor/nurse; black/white (Palermo & Jenkins 64)
• Text Corpora yield similar associations• One measure: Mutual Information (Church and Hanks 89)
• If word occurrences were independent, the numerator and denominator would be equal (if measured across a large collection)
)(),(
),(log),( 2 yPxP
yxPyxI
Statistical IndependenceStatistical Independence• Compute for a window of words
collectionin words ofnumber
in occur -co and timesofnumber ),(
position at startingwindow within words
5)(say window oflength ||
),(1
),(
:follows as ),( eapproximat llWe'
/)()(
t.independen if ),()()(
||
1
N
wyxyxw
iw
ww
yxwN
yxP
yxP
NxfxP
yxPyPxP
i
wN
ii
w1 w11w21
a b c d e f g h i j k l m n o p
Interesting Associations with “Doctor”Interesting Associations with “Doctor” (AP Corpus, N=15 million, Church & Hanks 89)
I(x,y) f(x,y) f(x) x f(y) y11.3 12 111 Honorary 621 Doctor
11.3 8 1105 Doctors 44 Dentists
10.7 30 1105 Doctors 241 Nurses
9.4 8 1105 Doctors 154 Treating
9.0 6 275 Examined 621 Doctor
8.9 11 1105 Doctors 317 Treat
8.7 25 621 Doctor 1407 Bills
I(x,y) f(x,y) f(x) x f(y) y0.96 6 621 doctor 73785 with
0.95 41 284690 a 1105 doctors
0.93 12 84716 is 1105 doctors
UnUn--Interesting Associations with Interesting Associations with “Doctor“Doctor”
(AP Corpus, N=15 million, Church & Hanks 89)
These associations were likely to happen because the non-doctor words shown here are very commonand therefore likely to co-occur with any noun.
Associations Are Important Because…Associations Are Important Because…
• We may be able to discover that phrases that should be treated as a word. I.e. “data mining”.
• We may be able to automatically discover synonyms. I.e. “Bike” and “Bicycle”
Content Analysis SummaryContent Analysis Summary• Content Analysis: transforming raw text into more
computationally useful forms• Words in text collections exhibit interesting
statistical properties– Word frequencies have a Zipf distribution
– Word co-occurrences exhibit dependencies
• Text documents are transformed to vectors– Pre-processing includes tokenization, stemming,
collocations/phrases
Informationneed
Index
Pre-process
Parse
Collections
Rank
Query
text inputHow isthe indexconstructed?
The section that follows is about
Index ConstructionIndex Construction Evaluate
Inverted IndexInverted Index• This is the primary data structure for text indexes• Main Idea:
– Invert documents into a big index
• Basic steps:– Make a “dictionary” of all the tokens in the collection
– For each token, list all the docs it occurs in.
– Do a few things to reduce redundancy in the data structure
Inverted IndexesInverted Indexes
We have seen “Vector files” conceptually. An Inverted File is a vector file “inverted” so that rows become columns and columns become rowsdocs t1 t2 t3D1 1 0 1D2 1 0 0D3 0 1 1D4 1 0 0D5 1 1 1D6 1 1 0D7 0 1 0D8 0 1 0D9 0 0 1
D10 0 1 1
Terms D1 D2 D3 D4 D5 D6 D7 …
t1 1 1 0 1 1 1 0t2 0 0 1 0 1 1 1t3 1 0 1 0 1 0 0
How Are Inverted Files CreatedHow Are Inverted Files Created
• Documents are parsed to extract tokens. These are saved with the Document ID.
Now is the timefor all good men
to come to the aidof their country
Doc 1
It was a dark andstormy night in
the country manor. The time was past midnight
Doc 2
Term Doc #now 1is 1the 1time 1for 1all 1good 1men 1to 1come 1to 1the 1aid 1of 1their 1country 1it 2was 2a 2dark 2and 2stormy 2night 2in 2the 2country 2manor 2the 2time 2was 2past 2midnight 2
How Inverted How Inverted Files are CreatedFiles are Created
• After all documents have been parsed the inverted file is sorted alphabetically.
Term Doc #a 2aid 1all 1and 2come 1country 1country 2dark 2for 1good 1in 2is 1it 2manor 2men 1midnight 2night 2now 1of 1past 2stormy 2the 1the 1the 2the 2their 1time 1time 2to 1to 1was 2was 2
Term Doc #now 1is 1the 1time 1for 1all 1good 1men 1to 1come 1to 1the 1aid 1of 1their 1country 1it 2was 2a 2dark 2and 2stormy 2night 2in 2the 2country 2manor 2the 2time 2was 2past 2midnight 2
How InvertedHow InvertedFiles are CreatedFiles are Created
• Multiple term entries for a single document are merged.
• Within-document term frequency information is compiled.
Term Doc # Freqa 2 1aid 1 1all 1 1and 2 1come 1 1country 1 1country 2 1dark 2 1for 1 1good 1 1in 2 1is 1 1it 2 1manor 2 1men 1 1midnight 2 1night 2 1now 1 1of 1 1past 2 1stormy 2 1the 1 2the 2 2their 1 1time 1 1time 2 1to 1 2was 2 2
Term Doc #a 2aid 1all 1and 2come 1country 1country 2dark 2for 1good 1in 2is 1it 2manor 2men 1midnight 2night 2now 1of 1past 2stormy 2the 1the 1the 2the 2their 1time 1time 2to 1to 1was 2was 2
How Inverted Files are CreatedHow Inverted Files are Created
• Then the file can be split into – A Dictionary file
and – A Postings file
How Inverted Files are CreatedHow Inverted Files are CreatedDictionary PostingsTerm Doc # Freq
a 2 1aid 1 1all 1 1and 2 1come 1 1country 1 1country 2 1dark 2 1for 1 1good 1 1in 2 1is 1 1it 2 1manor 2 1men 1 1midnight 2 1night 2 1now 1 1of 1 1past 2 1stormy 2 1the 1 2the 2 2their 1 1time 1 1time 2 1to 1 2was 2 2
Doc # Freq2 11 11 12 11 11 12 12 11 11 12 11 12 12 11 12 12 11 11 12 12 11 22 21 11 12 11 22 2
Term N docs Tot Freqa 1 1aid 1 1all 1 1and 1 1come 1 1country 2 2dark 1 1for 1 1good 1 1in 1 1is 1 1it 1 1manor 1 1men 1 1midnight 1 1night 1 1now 1 1of 1 1past 1 1stormy 1 1the 2 4their 1 1time 2 2to 1 2was 1 2
Inverted IndexesInverted Indexes• Permit fast search for individual terms• For each term, you get a list consisting of:
– document ID – frequency of term in doc (optional) – position of term in doc (optional)
• These lists can be used to solve Boolean queries:• country -> d1, d2• manor -> d2• country AND manor -> d2
• Also used for statistical ranking algorithms
How Inverted Files are UsedHow Inverted Files are UsedQuery on “time” AND “dark”
2 docs with “time” in dictionary ->IDs 1 and 2 from posting file
1 doc with “dark” in dictionary ->ID 2 from posting file
Therefore, only doc 2 satisfied the query.
Dictionary PostingsDoc # Freq
2 11 11 12 11 11 12 12 11 11 12 11 12 12 11 12 12 11 11 12 12 11 22 21 11 12 11 22 2
Term N docs Tot Freqa 1 1aid 1 1all 1 1and 1 1come 1 1country 2 2dark 1 1for 1 1good 1 1in 1 1is 1 1it 1 1manor 1 1men 1 1midnight 1 1night 1 1now 1 1of 1 1past 1 1stormy 1 1the 2 4their 1 1time 2 2to 1 2was 1 2
Informationneed
Index
Pre-process
Parse
Collections
Rank
Query
text inputHow isthe indexconstructed?
The section that follows is about
Querying (and Querying (and ranking)ranking)
Evaluate
Simple query language: Simple query language: BooleanBoolean
– Terms + Connectors (or operators)
– terms• words
• normalized (stemmed) words
• phrases
– connectors• AND
• OR
• NOT
• NEAR (Pseudo Boolean)
Word Doc
• Cat x
• Dog
• Collar x
• Leash
Boolean QueriesBoolean Queries• Cat
• Cat OR Dog
• Cat AND Dog
• (Cat AND Dog)
• (Cat AND Dog) OR Collar
• (Cat AND Dog) OR (Collar AND Leash)
• (Cat OR Dog) AND (Collar OR Leash)
Boolean SearchingBoolean Searching“Measurement of thewidth of cracks in prestressedconcrete beams”
Formal Query:cracks AND beamsAND Width_measurementAND Prestressed_concrete
Cracks
Beams Widthmeasurement
Prestressedconcrete
Relaxed Query:(C AND B AND P) OR(C AND B AND W) OR(C AND W AND P) OR(B AND W AND P)
Ordering of Retrieved DocumentsOrdering of Retrieved Documents
• Pure Boolean has no ordering
• In practice:– order chronologically– order by total number of “hits” on query terms
• What if one term has more hits than others?
• Is it better to one of each term or many of one term?
Boolean ModelBoolean Model• Advantages
– simple queries are easy to understand– relatively easy to implement
• Disadvantages– difficult to specify what is wanted– too much returned, or too little– ordering not well determined
• Dominant language in commercial Information Retrieval systems until the WWW
Since the Boolean model is limited, lets consider a generalization…
Vector ModelVector Model• Documents are represented as “bags of words”• Represented as vectors when used computationally
– A vector is like an array of floating point
– Has direction and magnitude
– Each vector holds a place for every term in the collection
– Therefore, most vectors are sparse
• Smithers secretly loves Monty Burns• Monty Burns secretly loves Smithers Both map to…[ Burns, loves, Monty, secretly, Smithers]
Document VectorsDocument VectorsOne location for each wordOne location for each word
nova galaxy heat h’wood film role diet fur
10 5 3
5 10
10 8 7
9 10 5
10 10
9 10
5 7 9
6 10 2 8
7 5 1 3
ABCDEFGHI
Document ids
We Can Plot the VectorsWe Can Plot the VectorsStar
Diet
Doc about astronomyDoc about movie stars
Doc about mammal behavior
Illustration from Jurafsky & Martin
Documents in 3D Vector SpaceDocuments in 3D Vector Space
t1
t2
t3
D1
D2
D10
D3
D9
D4
D7
D8
D5
D11
D6
Vector Space ModelVector Space Modeldocs Homer Marge BartD1 * *D2 *D3 * *D4 *D5 * * *D6 * *D7 *D8 *D9 *
D10 * *D11 * *Q *
Note that the query is projected into the same vector space as the documents.
The query here is for “Marge”.
We can use a vector similarity model to determine the best match to our query (details in a few slides).
But what weights should we use for the terms?
Assigning Weights to TermsAssigning Weights to Terms
• Binary Weights
• Raw term frequency
• tf x idf– Recall the Zipf distribution– Want to weight terms highly if they are
• frequent in relevant documents … BUT
• infrequent in the collection as a whole
Binary WeightsBinary Weights
• Only the presence (1) or absence (0) of a term is included in the vector
docs t1 t2 t3D1 1 0 1D2 1 0 0D3 0 1 1D4 1 0 0D5 1 1 1D6 1 1 0D7 0 1 0D8 0 1 0D9 0 0 1
D10 0 1 1D11 1 0 1
We have already seen and discussed this model.
Raw Term WeightsRaw Term Weights
• The frequency of occurrence for the term in each document is included in the vector
docs t1 t2 t3D1 2 0 3D2 1 0 0D3 0 4 7D4 3 0 0D5 1 6 3D6 3 5 0D7 0 8 0D8 0 10 0D9 0 0 1
D10 0 3 5D11 4 0 1
This model is open to exploitation by websites… sex sex sex sex sexsex sex sex sex sexsex sex sex sex sexsex sex sex sex sexsex sex sex sex sex sex sex sex sex sex
Counts can be normalized by document lengths.
tf * idf Weightstf * idf Weights
• tf * idf measure:– term frequency (tf)– inverse document frequency (idf) -- a way to
deal with the problems of the Zipf distribution
• Goal: assign a tf * idf weight to each term in each document
tf * idftf * idf)/log(* kikik nNtfw
log
Tcontain that in documents ofnumber the
collection in the documents ofnumber total
in T termoffrequency document inverse
document in T termoffrequency
document in term
nNidf
Cn
CN
Cidf
Dtf
DkT
kk
kk
kk
ikik
ik
Inverse Document FrequencyInverse Document Frequency
• IDF provides high values for rare words and low values for common words
41
10000log
698.220
10000log
301.05000
10000log
010000
10000log
For a collectionof 10000 documents
log
nNidf
kk
Similarity MeasuresSimilarity Measures
|)||,min(|
||
||||
||
||||
||||
||2
||
21
21
DQ
DQ
DQ
DQ
DQDQ
DQ
DQ
DQ
Simple matching (coordination level match)
Dice’s Coefficient
Jaccard’s Coefficient
Cosine Coefficient
Overlap Coefficient
CosineCosine
2
1 1D
Q2D
98.0cos
74.0cos
)8.0 ,4.0(
)7.0 ,2.0(
)3.0 ,8.0(
2
1
2
1
Q
D
D
1.0
0.8
0.6
0.8
0.4
0.60.4 1.00.2
0.2
Vector Space Similarity MeasureVector Space Similarity Measure
)()(
),(
:comparison similarity in the normalize otherwise
),( :normalized weights termif
absent is terma if 0 ...,,
,...,,
1
2
1
2
1
1
,21
21
t
jd
t
jqj
t
jdqj
i
t
jdqji
qtqq
dddi
ij
ij
ij
itii
ww
ww
DQsim
wwDQsim
wwwwQ
wwwD
Problems with Vector SpaceProblems with Vector Space
• There is no real theoretical basis for the assumption of a term space– it is more for visualization that having any real
basis– most similarity measures work about the same
regardless of model
• Terms are not really orthogonal dimensions– Terms are not independent of all other terms
Probabilistic ModelsProbabilistic Models
• Rigorous formal model attempts to predict the probability that a given document will be relevant to a given query
• Ranks retrieved documents according to this probability of relevance (Probability Ranking Principle)
• Rely on accurate estimates of probabilities
Relevance FeedbackRelevance Feedback• Main Idea:
– Modify existing query based on relevance judgements• Query Expansion: Extract terms from relevant documents
and add them to the query• Term Re-weighing: and/or re-weight the terms already in the
query
– Two main approaches:• Automatic (psuedo-relevance feedback)• Users select relevant documents
– Users/system select terms from an automatically-generated list
Definition: Relevance Feedback is the reformulation of a search query in response to feedback provided by the user for the results of previous versions of the query.
Term Vector [Jordan , Bank, Bull, River]Term Weights [ 1 , 1 , 1 , 1 ]
Term Vector [Jordan , Bank, Bull, River]
Term Weights [ 1.1 , 0.1 , 1.3 , 1.2 ]
SearchSearch
Display ResultsDisplay Results
Gather FeedbackGather Feedback
Update WeightsUpdate Weights
Suppose you are interested in bovine agriculture on the banks of the river Jordan…
Rocchio MethodRocchio Method
0.25) to and 0.75 toset best to studies some(in
t termsnonrelevan andrelevant of importance the tune and
chosen documentsrelevant -non ofnumber the
chosen documentsrelevant ofnumber the
document relevant -non for the vector the
document relevant for the vector the
query initial for the vector the
2
1
0
1 21 101
21
n
n
iS
iR
Q
where
n
S
n
RQQ
i
i
n
i
in
i
i
Rocchio IllustrationRocchio IllustrationAlthough we usually work in vector space for text, it is easier to visualize Euclidian space
Original Query Term Re-weightingNote that both the location of the center, and the shape of the query have changed
Query Expansion
Rocchio Method
• Rocchio automatically– re-weights terms– adds in new terms (from relevant docs)
• Most methods perform similarly– results heavily dependent on test collection
• Machine learning methods are proving to work better than standard IR approaches like Rocchio
Using Relevance Feedback
• Known to improve results
• People don’t seem to like giving feedback!
Informationneed
Index
Pre-process
Parse
Collections
Rank
Query
text inputHow isthe indexconstructed?
The section that follows is about
Evaluation Evaluation Evaluate
EvaluationEvaluation
• Why Evaluate?
• What to Evaluate?
• How to Evaluate?
Why Evaluate?Why Evaluate?
• Determine if the system is desirable
• Make comparative assessments
What to Evaluate?What to Evaluate?
• How much of the information need is satisfied.
• How much was learned about a topic.
• Incidental learning:– How much was learned about the collection.– How much was learned about other topics.
• How inviting the system is.
What to Evaluate?What to Evaluate?
What can be measured that reflects users’ ability to use system? (Cleverdon 66)
– Coverage of Information– Form of Presentation– Effort required/Ease of Use– Time and Space Efficiency– Recall
• proportion of relevant material actually retrieved
– Precision• proportion of retrieved material actually relevant
effe
ctiv
enes
s
Relevant vs. RetrievedRelevant vs. Retrieved
Relevant
Retrieved
All docs
Precision vs. RecallPrecision vs. Recall
Relevant
Retrieved
|Collectionin Rel|
|edRelRetriev| Recall
|Retrieved|
|edRelRetriev| Precision
All docs
Why Precision and Recall?Why Precision and Recall?
Intuition:
Get as much good stuff while at the same time getting as little junk as possible.
Retrieved vs. Relevant DocumentsRetrieved vs. Relevant Documents
Relevant
Very high precision, very low recall
Retrieved vs. Relevant DocumentsRetrieved vs. Relevant Documents
Relevant
Very low precision, very low recall (0 in fact)
Retrieved vs. Relevant DocumentsRetrieved vs. Relevant Documents
Relevant
High recall, but low precision
Retrieved vs. Relevant DocumentsRetrieved vs. Relevant Documents
Relevant
High precision, high recall (at last!)
Precision/Recall CurvesPrecision/Recall Curves
• There is a tradeoff between Precision and Recall
• So measure Precision at different levels of Recall
• Note: this is an AVERAGE over MANY queries
precision
recall
x
x
x
x
Precision/Recall CurvesPrecision/Recall Curves
• Difficult to determine which of these two hypothetical results is better:
precision
recall
x
x
x
x
Document Cutoff LevelsDocument Cutoff Levels
• Another way to evaluate:– Fix the number of documents retrieved at several levels:
• top 5• top 10• top 20• top 50• top 100• top 500
– Measure precision at each of these levels– Take (weighted) average over results
• This is a way to focus on how well the system ranks the first k documents.
Problems with Precision/RecallProblems with Precision/Recall
• Can’t know true recall value – except in small collections
• Precision/Recall are related– A combined measure sometimes more appropriate
• Assumes batch mode– Interactive IR is important and has different criteria for
successful searches
– Assumes a strict rank ordering matters.
Relation to Contingency TableRelation to Contingency Table
• Accuracy: (a+d) / (a+b+c+d)• Precision: a/(a+b)• Recall: a/(a+c)• Why don’t we use Accuracy for IR?
– (Assuming a large collection)– Most docs aren’t relevant – Most docs aren’t retrieved– Inflates the accuracy value
Doc is Relevant
Doc is NOT relevant
Doc is retrieved a b
Doc is NOT retrieved
c d
Doc is Relevant
Doc is NOT relevant
Doc is retrieved
Doc is NOT retrieved
relretN
relretN relretN
relretN
The E-MeasureThe E-MeasureCombine Precision and Recall into one number (van
Rijsbergen 79)
PRb
bE
1
11 2
2
P = precisionR = recallb = measure of relative importance of P or R
For example,b = 0.5 means user is twice as interested in
precision as recall
How to Evaluate?How to Evaluate?Test CollectionsTest Collections
TRECTREC
• Text REtrieval Conference/Competition– Run by NIST (National Institute of Standards & Technology)
– 2004 (November) will be 13th year
• Collection: >6 Gigabytes (5 CRDOMs), >1.5 Million Docs– Newswire & full text news (AP, WSJ, Ziff, FT)– Government documents (federal register, Congressional
Record)– Radio Transcripts (FBIS)– Web “subsets”
TREC (cont.)TREC (cont.)
• Queries + Relevance Judgments– Queries devised and judged by “Information Specialists”
– Relevance judgments done only for those documents retrieved -- not entire collection!
• Competition– Various research and commercial groups compete (TREC
6 had 51, TREC 7 had 56, TREC 8 had 66)
– Results judged on precision and recall, going up to a recall level of 1000 documents
TRECTREC• Benefits:
– made research systems scale to large collections (pre-WWW)
– allows for somewhat controlled comparisons
• Drawbacks:– emphasis on high recall, which may be unrealistic for
what most users want
– very long queries, also unrealistic
– comparisons still difficult to make, because systems are quite different on many dimensions
– focus on batch ranking rather than interaction
– no focus on the WWW
TREC is changingTREC is changing
• Emphasis on specialized “tracks”– Interactive track– Natural Language Processing (NLP) track– Multilingual tracks (Chinese, Spanish)– Filtering track– High-Precision– High-Performance
• http://trec.nist.gov/
Homework…Homework…