Introduc)ontoInforma)onRetrieval
Introduc*onto
Informa(onRetrieval
CS276Informa*onRetrievalandWebSearch
ChrisManningandPanduNayak
Evalua*on
Introduc)ontoInforma)onRetrieval
Situa*on§ ThankstoyourstellarperformanceinCS276,youquicklyrisetoVPofSearchatinternetretailgiantnozama.com.YourbossbringsinhernephewSergey,whoclaimstohavebuiltabeLersearchenginefornozama.Doyou§ LaughderisivelyandsendhimtorivalTramlawLabs?§ CounselSergeytogotoStanfordandtakeCS276?§ Tryafewqueriesonhisengineandsay“Notbad”?§ …?
2
Introduc)ontoInforma)onRetrieval
3
WhatcouldyouaskSergey?§ Howfastdoesitindex?
§ Numberofdocuments/hour§ Incrementalindexing–nozamaadds10Kproducts/day
§ Howfastdoesitsearch?§ LatencyandCPUneedsfornozama’s5millionproducts
§ Doesitrecommendrelatedproducts?§ Thisisallgood,butitsaysnothingaboutthequalityofSergey’ssearch§ Youwantnozama’suserstobehappywiththesearchexperience
Sec. 8.6
Introduc)ontoInforma)onRetrieval
Howdoyoutellifusersarehappy?§ Searchreturnsproductsrelevanttousers
§ Howdoyouassessthisatscale?§ Searchresultsgetclickedalot
§ Misleading*tles/summariescancauseuserstoclick
§ Usersbuya^erusingthesearchengine§ Or,usersspendalotof$a^erusingthesearchengine
§ Repeatvisitors/buyers§ Dousersleavesoona^ersearching?§ Dotheycomebackwithinaweek/month/…?
4
Introduc)ontoInforma)onRetrieval
5
Happiness:elusivetomeasure
§ Mostcommonproxy:relevanceofsearchresults§ Buthowdoyoumeasurerelevance?
§ PioneeredbyCyrilCleverdonintheCranfieldExperiments
Sec. 8.1
Introduc)ontoInforma)onRetrieval
6
Measuringrelevance
§ Threeelements:1. Abenchmarkdocumentcollec*on2. Abenchmarksuiteofqueries3. AnassessmentofeitherRelevantorNonrelevantfor
eachqueryandeachdocument
Sec. 8.1
Introduc)ontoInforma)onRetrieval
Soyouwanttomeasurethequalityofanewsearchalgorithm§ Benchmarkdocuments–nozama’sproducts§ Benchmarkquerysuite–moreonthis§ Judgmentsofdocumentrelevanceforeachquery
7
5 million nozama.com products
50000 sample queries
Relevancejudgement
Introduc)ontoInforma)onRetrieval
Relevancejudgments§ Binary(relevantvs.non-relevant)inthesimplestcase,morenuanced(0,1,2,3…)inothers
§ Whataresomeissuesalready?§ 5million*mes50Ktakesusintotherangeofaquartertrillionjudgments§ Ifeachjudgmenttookahuman2.5seconds,we’ds*llneed1011seconds,ornearly$300millionifyoupaypeople$10perhourtoassess
§ 10Knewproductsperday
8
Introduc)ontoInforma)onRetrieval
Crowdsourcerelevancejudgments?§ Presentquery-documentpairstolow-costlaborononlinecrowd-sourcingplalorms§ Hopethatthisischeaperthanhiringqualifiedassessors
§ Lotsofliteratureonusingcrowd-sourcingforsuchtasks
§ Maintakeaway–yougetsomesignal,butthevarianceintheresul*ngjudgmentsisveryhigh
9
Introduc)ontoInforma)onRetrieval
10
Evalua*nganIRsystem§ Note:userneedistranslatedintoaquery§ Relevanceisassessedrela*vetotheuserneed,notthequery
§ E.g.,Informa*onneed:Myswimmingpoolbo;omisbecomingblackandneedstobecleaned.
§ Query:poolcleaner§ Assesswhetherthedocaddressestheunderlyingneed,notwhetherithasthesewords
Sec. 8.1
Introduc)ontoInforma)onRetrieval
11
Whatelse?§ S*llneedtestqueries
§ Mustbegermanetodocsavailable§ Mustberepresenta*veofactualuserneeds§ Randomquerytermsfromthedocumentsgenerallynotagoodidea
§ Samplefromquerylogsifavailable
§ Classically(non-Web)§ Lowqueryrates–notenoughquerylogs§ Expertshand-cra^“userneeds”
Sec. 8.5
Introduc)ontoInforma)onRetrieval
12
SomepublictestCollec*ons
Sec. 8.5
TypicalTREC
Introduc)ontoInforma)onRetrieval
Nowwehavethebasicsofabenchmark§ Let’sreviewsomeevalua*onmeasures
§ Precision§ Recall§ NDCG§ …
13
Introduc)ontoInforma)onRetrieval
14
Unrankedretrievalevalua*on:PrecisionandRecall–recapfromIIR8/video
§ BinaryassessmentsPrecision:frac*onofretrieveddocsthatarerelevant=P(relevant|retrieved)
Recall:frac*onofrelevantdocsthatareretrieved=P(retrieved|relevant)
§ PrecisionP=tp/(tp+fp)§ RecallR=tp/(tp+fn)
Relevant Nonrelevant
Retrieved tp fp
Not Retrieved fn tn
Sec. 8.3
Introduc)ontoInforma)onRetrieval
Rank-Based Measures
§ Binary relevance § Precision@K (P@K) § Mean Average Precision (MAP) § Mean Reciprocal Rank (MRR)
§ Multiple levels of relevance § Normalized Discounted Cumulative Gain (NDCG)
Introduc)ontoInforma)onRetrieval
Precision@K
§ Set a rank threshold K
§ Compute % relevant in top K
§ Ignores documents ranked lower than K
§ Ex: § Prec@3 of 2/3 § Prec@4 of 2/4 § Prec@5 of 3/5
§ In similar fashion we have Recall@K
Introduc)ontoInforma)onRetrieval
17
Aprecision-recallcurve
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.2 0.4 0.6 0.8 1.0Recall
Precision
Sec. 8.4
Lots more detail on this in the Coursera video
Introduc)ontoInforma)onRetrieval
Mean Average Precision
§ Consider rank position of each relevant doc § K1, K2, … KR
§ Compute Precision@K for each K1, K2, … KR
§ Average precision = average of P@K
§ Ex: has AvgPrec of
§ MAP is Average Precision across multiple queries/rankings
76.053
32
11
31
≈⎟⎠
⎞⎜⎝
⎛ ++⋅
Introduc)ontoInforma)onRetrieval
Average Precision
Introduc)ontoInforma)onRetrieval
MAP
Introduc)ontoInforma)onRetrieval
Meanaverageprecision§ If a relevant document never gets retrieved, we
assume the precision corresponding to that relevant doc to be zero
§ MAP is macro-averaging: each query counts equally § Now perhaps most commonly used measure in
research papers § Good for web search? § MAP assumes user is interested in finding many
relevant documents for each query § MAP requires many relevance judgments in text
collection
Introduc)ontoInforma)onRetrieval
BEYONDBINARYRELEVANCE
22
Introduc)ontoInforma)onRetrieval
fair
fair
Good
Introduc)ontoInforma)onRetrieval
Discounted Cumulative Gain § Popular measure for evaluating web search and
related tasks
§ Two assumptions: § Highly relevant documents are more useful
than marginally relevant documents § the lower the ranked position of a relevant
document, the less useful it is for the user, since it is less likely to be examined
Introduc)ontoInforma)onRetrieval
Discounted Cumulative Gain § Uses graded relevance as a measure of
usefulness, or gain, from examining a document § Gain is accumulated starting at the top of the
ranking and may be reduced, or discounted, at lower ranks
§ Typical discount is 1/log (rank) § With base 2, the discount at rank 4 is 1/2, and
at rank 8 it is 1/3
Introduc)ontoInforma)onRetrieval
26
Summarize a Ranking: DCG
§ What if relevance judgments are in a scale of [0,r]? r>2
§ Cumulative Gain (CG) at rank n § Let the ratings of the n documents be r1, r2, …rn
(in ranked order) § CG = r1+r2+…rn
§ Discounted Cumulative Gain (DCG) at rank n § DCG = r1 + r2/log22 + r3/log23 + … rn/log2n
§ We may use any base for the logarithm
Introduc)ontoInforma)onRetrieval
Discounted Cumulative Gain § DCG is the total gain accumulated at a particular
rank p:
§ Alternative formulation:
§ used by some web search companies § emphasis on retrieving highly relevant documents
Introduc)ontoInforma)onRetrieval
DCG Example § 10 ranked documents judged on 0-3 relevance
scale: 3, 2, 3, 0, 0, 1, 2, 2, 3, 0
§ discounted gain: 3, 2/1, 3/1.59, 0, 0, 1/2.59, 2/2.81, 2/3, 3/3.17, 0 = 3, 2, 1.89, 0, 0, 0.39, 0.71, 0.67, 0.95, 0
§ DCG: 3, 5, 6.89, 6.89, 6.89, 7.28, 7.99, 8.66, 9.61, 9.61
Introduc)ontoInforma)onRetrieval
29
Summarize a Ranking: NDCG
§ Normalized Discounted Cumulative Gain (NDCG) at rank n § Normalize DCG at rank n by the DCG value at
rank n of the ideal ranking § The ideal ranking would first return the
documents with the highest relevance level, then the next highest relevance level, etc
§ Normalization useful for contrasting queries with varying numbers of relevant results
§ NDCG is now quite popular in evaluating Web search
Introduc)ontoInforma)onRetrieval
NDCG - Example
iGroundTruth RankingFunc*on1 RankingFunc*on2
DocumentOrder ri
DocumentOrder ri
DocumentOrder ri
1 d4 2 d3 2 d3 2
2 d3 2 d4 2 d2 1
3 d2 1 d2 1 d4 2
4 d1 0 d1 0 d1 0
NDCGGT=1.00 NDCGRF1=1.00 NDCGRF2=0.9203
6309.44log
03log
12log
22222
=⎟⎟⎠
⎞⎜⎜⎝
⎛+++=GTDCG
6309.44log
03log
12log
22222
1 =⎟⎟⎠
⎞⎜⎜⎝
⎛+++=RFDCG
2619.44log
03log
22log
12222
2 =⎟⎟⎠
⎞⎜⎜⎝
⎛+++=RFDCG
6309.4== GTDCGMaxDCG
4 documents: d1, d2, d3, d4
Introduc)ontoInforma)onRetrieval
31
Whatiftheresultsarenotinalist?§ Supposethere’sonlyoneRelevantDocument§ Scenarios:
§ known-itemsearch§ naviga*onalqueries§ lookingforafact
§ Searchdura*on~Rankoftheanswer§ measuresauser’seffort
Introduc)ontoInforma)onRetrieval
Mean Reciprocal Rank
§ Consider rank position, K, of first relevant doc § Could be – only clicked doc
§ Reciprocal Rank score = § MRR is the mean RR across multiple queries
K1
Introduc)ontoInforma)onRetrieval
Humanjudgmentsare§ Expensive§ Inconsistent
§ Betweenraters§ Over*me
§ Decayinvalueasdocuments/querymixevolves§ Notalwaysrepresenta*veof“realusers”
§ Ra*ngvis-à-visquery,vsunderlyingneed§ So–whatalterna*vesdowehave?
33
Introduc)ontoInforma)onRetrieval
USINGUSERCLICKS
34
Introduc)ontoInforma)onRetrieval
Whatdoclickstellus?
35
#ofclicksreceived
Strong position bias, so absolute click rates unreliable
Introduc)ontoInforma)onRetrieval
Rela*vevsabsolutera*ngs
36
Hard to conclude Result1 > Result3 Probably can conclude Result3 > Result2
User’s click sequence
Introduc)ontoInforma)onRetrieval
Pairwiserela*vera*ngs§ Pairsoftheform:DocAbeLerthanDocBforaquery
§ Doesn’tmeanthatDocArelevanttoquery
§ Now,ratherthanassessarank-orderingwrtper-docrelevanceassessments
§ Assessintermsofconformancewithhistoricalpairwisepreferencesrecordedfromuserclicks
37
Introduc)ontoInforma)onRetrieval
A/Btes*ngatwebsearchengines§ Purpose:Testasingleinnova*on
§ Prerequisite:Youhavealargesearchengineupandrunning.
§ Havemostusersuseoldsystem
§ Divertasmallpropor*onoftraffic(e.g.,1%)toanexperimenttoevaluateaninnova*on§ Fullpageexperiment§ Interleavedexperiment
38
Sec. 8.6.3
Introduc)ontoInforma)onRetrieval
Comparingtworankingsviaclicks(Joachims2002)
39
Kernelmachines
SVM-light
LucentSVMdemo
RoyalHoll.SVM
SVMso^ware
SVMtutorial
Kernelmachines
SVMs
IntrotoSVMs
ArchivesofSVM
SVM-light
SVMso^ware
Query: [support vector machines]
Ranking A Ranking B
Introduc)ontoInforma)onRetrieval
Interleavethetworankings
40
Kernelmachines
SVM-light
LucentSVMdemo
RoyalHoll.SVM
Kernelmachines
SVMs
IntrotoSVMs
ArchivesofSVM
SVM-light
This interleaving starts with B
…
Introduc)ontoInforma)onRetrieval
Removeduplicateresults
41
Kernelmachines
SVM-light
LucentSVMdemo
RoyalHoll.SVM
Kernelmachines
SVMs
IntrotoSVMs
ArchivesofSVM
SVM-light…
Introduc)ontoInforma)onRetrieval
Countuserclicks
42
Kernelmachines
SVM-light
LucentSVMdemo
RoyalHoll.SVM
Kernelmachines
SVMs
IntrotoSVMs
ArchivesofSVM
SVM-light…
Clicks
Ranking A: 3 Ranking B: 1
A, B
A
A
Introduc)ontoInforma)onRetrieval
Interleavedranking§ Presentinterleavedrankingtousers
§ StartrandomlywithrankingAorrankingBtoevensoutpresenta*onbias
§ CountclicksonresultsfromAversusresultsfromB
§ BeLerrankingwill(onaverage)getmoreclicks
43
Introduc)ontoInforma)onRetrieval
Facts/en**es(whathappenstoclicks?)
44
Introduc)ontoInforma)onRetrieval
Comparingtworankingstoabaselineranking§ GivenasetofpairwisepreferencesP§ WewanttomeasuretworankingsAandB§ DefineaproximitymeasurebetweenAandP
§ Andlikewise,betweenBandP§ WanttodeclaretherankingwithbeLerproximitytobethewinner
§ ProximitymeasureshouldrewardagreementswithPandpenalizedisagreements
45
Introduc)ontoInforma)onRetrieval
Kendalltaudistance§ LetXbethenumberofagreementsbetweenaranking(sayA)andP
§ LetYbethenumberofdisagreements§ ThentheKendalltaudistancebetweenAandPis(X-Y)/(X+Y)
§ SayP={(1,2),(1,3),(1,4),(2,3),(2,4),(3,4))}andA=(1,3,2,4)
§ ThenX=5,Y=1…§ (WhataretheminimumandmaximumpossiblevaluesoftheKendalltaudistance?)
46
Introduc)ontoInforma)onRetrieval
Recap§ Benchmarksconsistof
§ Documentcollec*on§ Queryset§ Assessmentmethodology
§ Assessmentmethodologycanuseraters,userclicks,oracombina*on§ Thesegetquan*zedintoagoodnessmeasure–Precision/NDCGetc.
§ Differentengines/algorithmscomparedonabenchmarktogetherwithagoodnessmeasure
47