web search information retrieval. 2 boolean queries: examples simple queries involving...
DESCRIPTION
3 Document preprocessing Tokenization Tokenization Filtering away tags Filtering away tags Tokens regarded as nonempty sequence of characters excluding spaces and punctuations. Tokens regarded as nonempty sequence of characters excluding spaces and punctuations. Token represented by a suitable integer, tid, typically 32 bits Token represented by a suitable integer, tid, typically 32 bits Optional: stemming/conflation of words Optional: stemming/conflation of words Result: document (did) transformed into a sequence of integers (tid, pos) Result: document (did) transformed into a sequence of integers (tid, pos)TRANSCRIPT
Web Search & Web Search & Information RetrievalInformation Retrieval
22
Boolean queries: ExamplesBoolean queries: Examples Simple queries involving relationships Simple queries involving relationships
between terms and documentsbetween terms and documents Documents containing the word Documents containing the word JavaJava Documents containing the word Documents containing the word Java Java but not the but not the
word word coffeecoffee Proximity queriesProximity queries
Documents containing the phrase Documents containing the phrase Java beans Java beans or or the term the term APIAPI
Documents where Documents where Java Java and and island island occur in the occur in the same sentencesame sentence
33
Document preprocessingDocument preprocessing TokenizationTokenization
Filtering away tagsFiltering away tags Tokens regarded as nonempty sequence of Tokens regarded as nonempty sequence of
characters excluding spaces and punctuations.characters excluding spaces and punctuations. Token represented by a suitable integer,Token represented by a suitable integer, tid tid, ,
typically 32 bitstypically 32 bits Optional: stemming/conflation of wordsOptional: stemming/conflation of words Result: document (did) transformed into a Result: document (did) transformed into a
sequence of integers (sequence of integers (tid, postid, pos))
44
Storing tokensStoring tokens Straight-forward implementation using a Straight-forward implementation using a
relational databaserelational database Example figureExample figure Space scales to almost 10 timesSpace scales to almost 10 times
Accesses to table show common patternAccesses to table show common pattern reduce the storage by mapping reduce the storage by mapping tidtidss to a to a
lexicographically sorted buffer of lexicographically sorted buffer of ((did, posdid, pos) ) tuples.tuples.
Indexing = transposing document-term matrixIndexing = transposing document-term matrix
55
Two variants of the inverted index data structure, usually stored on disk. The simplerversion in the middle does not store term offset information; the version to the right stores termoffsets. The mapping from terms to documents and positions (written as “document/position”) maybe implemented using a B-tree or a hash-table.
66
StopwordsStopwords Function words Function words and connectivesand connectives Appear in large number of documents and little use in Appear in large number of documents and little use in
pinpointing documentspinpointing documents Indexing stopwordsIndexing stopwords
Stopwords not indexedStopwords not indexed For reducing index space and improving performanceFor reducing index space and improving performance
Replace stopwords with a placeholder (to remember the offset)Replace stopwords with a placeholder (to remember the offset) IssuesIssues
Queries containing only stopwords ruled outQueries containing only stopwords ruled out Polysemous words that are stopwords in one sense but not in Polysemous words that are stopwords in one sense but not in
othersothers E.g.; E.g.; cancan as a verb vs. as a verb vs. can can as a nounas a noun
77
StemmingStemming Conflating words to help match a query term with a Conflating words to help match a query term with a
morphological variant in the corpus.morphological variant in the corpus. Remove inflections that convey parts of speech, tense and Remove inflections that convey parts of speech, tense and
numbernumber E.g.: E.g.: university university and and universal both stem universal both stem to to universeuniverse.. TechniquesTechniques
morphological analysis (e.g., Porter's algorithm)morphological analysis (e.g., Porter's algorithm) dictionary lookup (e.g., WordNetdictionary lookup (e.g., WordNet).).
Stemming may increase recall but at the price of precisionStemming may increase recall but at the price of precision Abbreviations, polysemy and names coined in the technical and Abbreviations, polysemy and names coined in the technical and
commercial sectorscommercial sectors E.g.: Stemming “ides” to “IDE”, “SOCKS” to “sock”, “gated” to E.g.: Stemming “ides” to “IDE”, “SOCKS” to “sock”, “gated” to
“gate”, may be bad“gate”, may be bad ! !
88
Maintaining indices over dynamic collections.
99
Relevance rankingRelevance ranking Keyword queriesKeyword queries
In natural languageIn natural language Not precise, unlike SQLNot precise, unlike SQL
Boolean decision for response unacceptableBoolean decision for response unacceptable SolutionSolution
Rate each document for how likely it is to satisfy the user's Rate each document for how likely it is to satisfy the user's information needinformation need
Sort in decreasing order of the scoreSort in decreasing order of the score Present results in a ranked list.Present results in a ranked list.
No algorithmic way of ensuring that the ranking No algorithmic way of ensuring that the ranking strategy always favors the information needstrategy always favors the information need Query: only a part of the user's information needQuery: only a part of the user's information need
1010
Responding to queriesResponding to queries Set-valued responseSet-valued response
Response set may be very largeResponse set may be very large (E.g., by recent estimates, over 12 million Web pages (E.g., by recent estimates, over 12 million Web pages
contain the word contain the word javajava.).) Demanding selective query from userDemanding selective query from user Guessing user's information need and Guessing user's information need and
ranking ranking responsesresponses Evaluating rankingsEvaluating rankings
1111
Evaluating procedureEvaluating procedure Given benchmarkGiven benchmark
Corpus of Corpus of n n documents documents D D A set of queries A set of queries QQ For each query, an exhaustive set of For each query, an exhaustive set of
relevant documents identified manuallyrelevant documents identified manually Query submitted systemQuery submitted system
Ranked list of documents retrievedRanked list of documents retrieved compute a 0/1 relevance listcompute a 0/1 relevance list
iffiff otherwise.otherwise.
Q q D Dq
)d ,,d ,(d n21 )r.., ,r ,(r n21
D d qi 1 ri 0 ri
1212
Recall and precisionRecall and precision Recall Recall at rankat rank
Fraction of all relevant documents included in Fraction of all relevant documents included in . .
.. Precision Precision at rankat rank
Fraction of the top Fraction of the top k k responses that are responses that are actually relevant.actually relevant.
..
1 k
)d ,,d ,(d n21
ki1
iq
r |D|
1 recall(k)
ki1
irk 1 k)precision(
1313
Other measuresOther measures Average precision Average precision
Sum of precision at each relevant hit position in the response Sum of precision at each relevant hit position in the response list, divided by the total number of relevant documentslist, divided by the total number of relevant documents
. .. . avg.precision =1 iff engine retrieves all relevant documents avg.precision =1 iff engine retrieves all relevant documents
and ranks them ahead of any irrelevant documentand ranks them ahead of any irrelevant document Interpolated precisionInterpolated precision
To combine precision values from multiple queriesTo combine precision values from multiple queries Gives precision-vs.-recall curve for the benchmark.Gives precision-vs.-recall curve for the benchmark.
For each query, For each query, take the maximum precision obtained for the take the maximum precision obtained for the query for any recall greater than or equal to query for any recall greater than or equal to
average them together for all queriesaverage them together for all queries Others like measures of authority, prestige etcOthers like measures of authority, prestige etc
||k1k
q
)(*r |D|
1 ionavg.precisD
kprecision
1414
Precision-Recall tradeoffPrecision-Recall tradeoff Interpolated precision cannot increase with recallInterpolated precision cannot increase with recall
Interpolated precision at recall level 0 may be less than 1Interpolated precision at recall level 0 may be less than 1 At level k At level k = 0= 0
Precision (by convention) = 1, Recall = 0Precision (by convention) = 1, Recall = 0 Inspecting more documentsInspecting more documents
Can increase recallCan increase recall Precision may decreasePrecision may decrease
we will start encountering more and more irrelevant documentswe will start encountering more and more irrelevant documents Search engine with a good ranking function will Search engine with a good ranking function will
generally show a negative relation between recall generally show a negative relation between recall and precision.and precision. Higher the curve, better the engineHigher the curve, better the engine
1515
Precision and interpolated precision plotted against recall for the given relevance vector.Missing are zeroes.kr
1616
The vector space modelThe vector space model Documents represented as vectors in a multi-Documents represented as vectors in a multi-
dimensional Euclidean spacedimensional Euclidean space Each axis = a term (token)Each axis = a term (token)
Coordinate of document Coordinate of document d d in direction of term in direction of term t t determined by:determined by: Term frequency Term frequency TF(TF(d,td,t))
number of times term number of times term t t occurs in document occurs in document dd, scaled in a , scaled in a variety of ways to normalize document lengthvariety of ways to normalize document length
Inverse document frequency Inverse document frequency IDF(IDF(tt)) to scale down the coordinates of terms that occur in many to scale down the coordinates of terms that occur in many
documentsdocuments
1717
Term frequency Term frequency . .
. . Cornell SMART system uses a smoothed Cornell SMART system uses a smoothed
versionversion
) n(d, t)n(d, t)TF(d,
)) (n(d,max t)n(d, t)TF(d,
)),(1log(1),(0),(
tdntdTFtdTF
otherwisetdn 0),(
1818
Inverse document frequencyInverse document frequency GivenGiven
D D is the document collection and is the set is the document collection and is the set of documents containing of documents containing tt
FormulaeFormulae mostly dampened functions of mostly dampened functions of SMARTSMART
..
|| tDD
)||||1log()(
tDDtIDF
tD
1919
Vector space modelVector space model Coordinate of document Coordinate of document d d in axis in axis t t
.. Transformed to Transformed to inin the TFIDF-space the TFIDF-space
Query Query q q Interpreted as a documentInterpreted as a document Transformed to Transformed to inin the same TFIDF-space as the same TFIDF-space as
dd
)(),( tIDFtdTFdt d
q
2020
Measures of proximityMeasures of proximity Distance measureDistance measure
Magnitude of the vector differenceMagnitude of the vector difference ..
Document vectors must be normalized to unit ( Document vectors must be normalized to unit ( or or ) length) length
Else shorter documents dominate (since queries are Else shorter documents dominate (since queries are short)short)
Cosine similarityCosine similarity cosine cosine of the angle between and of the angle between and
Shorter documents are penalizedShorter documents are penalized
|| qd
1L 2L
d
q
2121
Relevance feedback Relevance feedback Users Users learning learning how to modify querieshow to modify queries
Response list must have least some relevant documentsResponse list must have least some relevant documents Relevance feedback Relevance feedback
`correcting' the ranks to the user's taste`correcting' the ranks to the user's taste automates the query refinement processautomates the query refinement process
Rocchio's methodRocchio's method Folding-in user feedbackFolding-in user feedback To query vector To query vector
Add Add a weighted sum of vectors for relevant documents a weighted sum of vectors for relevant documents DD++ Subtract a weighted sum of the irrelevant documents Subtract a weighted sum of the irrelevant documents D-D-
..
q
D -D
d-dq'q
2222
Relevance feedback (contd.)Relevance feedback (contd.) PseudoPseudo-relevance feedback-relevance feedback
D+ and D- generated automaticallyD+ and D- generated automatically E.g.: Cornell SMART systemE.g.: Cornell SMART system top 10 documents reported by the first round of query top 10 documents reported by the first round of query
execution are included in execution are included in DD++ typically set to 0; D- not usedtypically set to 0; D- not used
Not a commonly available featureNot a commonly available feature Web users want instant gratificationWeb users want instant gratification System complexitySystem complexity
Executing the second round query slower and expensive for Executing the second round query slower and expensive for major search enginesmajor search engines
2323
Ranking by odds ratioRanking by odds ratio R R : Boolean random variable which represents : Boolean random variable which represents
the relevance of document the relevance of document d d w.r.t. query w.r.t. query qq.. Ranking documents by their Ranking documents by their odds ratio odds ratio for for
relevancerelevance ..
Approximating probability of d by product of Approximating probability of d by product of the probabilities of individual terms in the probabilities of individual terms in dd .. Approximately…Approximately…
),|Pr(/)|Pr(),|Pr(/)|Pr(
),Pr(/),,Pr(),Pr(/),,Pr(
),|Pr(),|Pr(
qRdqRqRdqR
dqdqRdqdqR
dqRdqR
t t
t
qRxqRx
qRdqRd
),|Pr(),|Pr(
),|Pr(),|Pr(
dqt qtqt
qtqt
abba
dqRdqR
)1()1(
),|Pr(),|Pr(
,,
,,
2424
Meta-search systemsMeta-search systems• Take the search engine to the documentTake the search engine to the document
Forward queries to many geographically distributed repositoriesForward queries to many geographically distributed repositories• Each has its own search serviceEach has its own search service
Consolidate their responses.Consolidate their responses.• AdvantagesAdvantages
Perform non-trivial query rewriting Perform non-trivial query rewriting • Suit a single user query to many search engines with different Suit a single user query to many search engines with different
query syntaxquery syntax Surprisingly small overlap between crawlsSurprisingly small overlap between crawls
• Consolidating responsesConsolidating responses Function goes beyond just eliminating duplicatesFunction goes beyond just eliminating duplicates Search services do not provide standard ranks which can be Search services do not provide standard ranks which can be
combined meaningfullycombined meaningfully
Mining the WebMining the Web Chakrabarti and RamakrishnanChakrabarti and Ramakrishnan 2525
Similarity searchSimilarity search• Cluster hypothesisCluster hypothesis
Documents similar to relevant documents are Documents similar to relevant documents are also likely to be relevantalso likely to be relevant
• Handling “find similar” queriesHandling “find similar” queries Replication Replication or or duplicationduplication of pages of pages Mirroring of sitesMirroring of sites
Mining the WebMining the Web Chakrabarti and RamakrishnanChakrabarti and Ramakrishnan 2626
Document similarityDocument similarity• Jaccard coefficientJaccard coefficient of similarity between of similarity between
document and document and • T(d) = set of tokens in document dT(d) = set of tokens in document d
.. Symmetric, reflexive, not a metricSymmetric, reflexive, not a metric Forgives any number of occurrences and any Forgives any number of occurrences and any
permutations of the terms.permutations of the terms.• is a metricis a metric
1d 2d
|)()(||)()(|),('
21
2121 dTdT
dTdTddr
),('1 21 ddr