web search information retrieval. 2 boolean queries: examples simple queries involving...

Web Search & Web Search & Information RetrievalInformation Retrieval

22

Boolean queries: ExamplesBoolean queries: Examples Simple queries involving relationships Simple queries involving relationships

between terms and documentsbetween terms and documents Documents containing the word Documents containing the word JavaJava Documents containing the word Documents containing the word Java Java but not the but not the

word word coffeecoffee Proximity queriesProximity queries

Documents containing the phrase Documents containing the phrase Java beans Java beans or or the term the term APIAPI

Documents where Documents where Java Java and and island island occur in the occur in the same sentencesame sentence

33

Document preprocessingDocument preprocessing TokenizationTokenization

Filtering away tagsFiltering away tags Tokens regarded as nonempty sequence of Tokens regarded as nonempty sequence of

characters excluding spaces and punctuations.characters excluding spaces and punctuations. Token represented by a suitable integer,Token represented by a suitable integer, tid tid, ,

typically 32 bitstypically 32 bits Optional: stemming/conflation of wordsOptional: stemming/conflation of words Result: document (did) transformed into a Result: document (did) transformed into a

sequence of integers (sequence of integers (tid, postid, pos))

44

Storing tokensStoring tokens Straight-forward implementation using a Straight-forward implementation using a

relational databaserelational database Example figureExample figure Space scales to almost 10 timesSpace scales to almost 10 times

Accesses to table show common patternAccesses to table show common pattern reduce the storage by mapping reduce the storage by mapping tidtidss to a to a

lexicographically sorted buffer of lexicographically sorted buffer of ((did, posdid, pos) ) tuples.tuples.

Indexing = transposing document-term matrixIndexing = transposing document-term matrix

55

Two variants of the inverted index data structure, usually stored on disk. The simplerversion in the middle does not store term offset information; the version to the right stores termoffsets. The mapping from terms to documents and positions (written as “document/position”) maybe implemented using a B-tree or a hash-table.

66

StopwordsStopwords Function words Function words and connectivesand connectives Appear in large number of documents and little use in Appear in large number of documents and little use in

pinpointing documentspinpointing documents Indexing stopwordsIndexing stopwords

Stopwords not indexedStopwords not indexed For reducing index space and improving performanceFor reducing index space and improving performance

Replace stopwords with a placeholder (to remember the offset)Replace stopwords with a placeholder (to remember the offset) IssuesIssues

Queries containing only stopwords ruled outQueries containing only stopwords ruled out Polysemous words that are stopwords in one sense but not in Polysemous words that are stopwords in one sense but not in

othersothers E.g.; E.g.; cancan as a verb vs. as a verb vs. can can as a nounas a noun

77

StemmingStemming Conflating words to help match a query term with a Conflating words to help match a query term with a

morphological variant in the corpus.morphological variant in the corpus. Remove inflections that convey parts of speech, tense and Remove inflections that convey parts of speech, tense and

numbernumber E.g.: E.g.: university university and and universal both stem universal both stem to to universeuniverse.. TechniquesTechniques

morphological analysis (e.g., Porter's algorithm)morphological analysis (e.g., Porter's algorithm) dictionary lookup (e.g., WordNetdictionary lookup (e.g., WordNet).).

Stemming may increase recall but at the price of precisionStemming may increase recall but at the price of precision Abbreviations, polysemy and names coined in the technical and Abbreviations, polysemy and names coined in the technical and

commercial sectorscommercial sectors E.g.: Stemming “ides” to “IDE”, “SOCKS” to “sock”, “gated” to E.g.: Stemming “ides” to “IDE”, “SOCKS” to “sock”, “gated” to

“gate”, may be bad“gate”, may be bad ! !

88

Maintaining indices over dynamic collections.

99

Relevance rankingRelevance ranking Keyword queriesKeyword queries

In natural languageIn natural language Not precise, unlike SQLNot precise, unlike SQL

Boolean decision for response unacceptableBoolean decision for response unacceptable SolutionSolution

Rate each document for how likely it is to satisfy the user's Rate each document for how likely it is to satisfy the user's information needinformation need

Sort in decreasing order of the scoreSort in decreasing order of the score Present results in a ranked list.Present results in a ranked list.

No algorithmic way of ensuring that the ranking No algorithmic way of ensuring that the ranking strategy always favors the information needstrategy always favors the information need Query: only a part of the user's information needQuery: only a part of the user's information need

1010

Responding to queriesResponding to queries Set-valued responseSet-valued response

Response set may be very largeResponse set may be very large (E.g., by recent estimates, over 12 million Web pages (E.g., by recent estimates, over 12 million Web pages

contain the word contain the word javajava.).) Demanding selective query from userDemanding selective query from user Guessing user's information need and Guessing user's information need and

ranking ranking responsesresponses Evaluating rankingsEvaluating rankings

1111

Evaluating procedureEvaluating procedure Given benchmarkGiven benchmark

Corpus of Corpus of n n documents documents D D A set of queries A set of queries QQ For each query, an exhaustive set of For each query, an exhaustive set of

relevant documents identified manuallyrelevant documents identified manually Query submitted systemQuery submitted system

Ranked list of documents retrievedRanked list of documents retrieved compute a 0/1 relevance listcompute a 0/1 relevance list

iffiff otherwise.otherwise.

Q q D Dq

)d ,,d ,(d n21 )r.., ,r ,(r n21

D d qi 1 ri 0 ri

1212

Recall and precisionRecall and precision Recall Recall at rankat rank

Fraction of all relevant documents included in Fraction of all relevant documents included in . .

.. Precision Precision at rankat rank

Fraction of the top Fraction of the top k k responses that are responses that are actually relevant.actually relevant.

..

1 k

)d ,,d ,(d n21

ki1

iq

r |D|

1 recall(k)

ki1

irk 1 k)precision(

1313

Other measuresOther measures Average precision Average precision

Sum of precision at each relevant hit position in the response Sum of precision at each relevant hit position in the response list, divided by the total number of relevant documentslist, divided by the total number of relevant documents

. .. . avg.precision =1 iff engine retrieves all relevant documents avg.precision =1 iff engine retrieves all relevant documents

and ranks them ahead of any irrelevant documentand ranks them ahead of any irrelevant document Interpolated precisionInterpolated precision

To combine precision values from multiple queriesTo combine precision values from multiple queries Gives precision-vs.-recall curve for the benchmark.Gives precision-vs.-recall curve for the benchmark.

For each query, For each query, take the maximum precision obtained for the take the maximum precision obtained for the query for any recall greater than or equal to query for any recall greater than or equal to

average them together for all queriesaverage them together for all queries Others like measures of authority, prestige etcOthers like measures of authority, prestige etc

||k1k

q

)(*r |D|

1 ionavg.precisD

kprecision

1414

Precision-Recall tradeoffPrecision-Recall tradeoff Interpolated precision cannot increase with recallInterpolated precision cannot increase with recall

Interpolated precision at recall level 0 may be less than 1Interpolated precision at recall level 0 may be less than 1 At level k At level k = 0= 0

Precision (by convention) = 1, Recall = 0Precision (by convention) = 1, Recall = 0 Inspecting more documentsInspecting more documents

Can increase recallCan increase recall Precision may decreasePrecision may decrease

we will start encountering more and more irrelevant documentswe will start encountering more and more irrelevant documents Search engine with a good ranking function will Search engine with a good ranking function will

generally show a negative relation between recall generally show a negative relation between recall and precision.and precision. Higher the curve, better the engineHigher the curve, better the engine

1515

Precision and interpolated precision plotted against recall for the given relevance vector.Missing are zeroes.kr

1616

The vector space modelThe vector space model Documents represented as vectors in a multi-Documents represented as vectors in a multi-

dimensional Euclidean spacedimensional Euclidean space Each axis = a term (token)Each axis = a term (token)

Coordinate of document Coordinate of document d d in direction of term in direction of term t t determined by:determined by: Term frequency Term frequency TF(TF(d,td,t))

number of times term number of times term t t occurs in document occurs in document dd, scaled in a , scaled in a variety of ways to normalize document lengthvariety of ways to normalize document length

Inverse document frequency Inverse document frequency IDF(IDF(tt)) to scale down the coordinates of terms that occur in many to scale down the coordinates of terms that occur in many

documentsdocuments

1717

Term frequency Term frequency . .

. . Cornell SMART system uses a smoothed Cornell SMART system uses a smoothed

versionversion

) n(d, t)n(d, t)TF(d,

)) (n(d,max t)n(d, t)TF(d,

)),(1log(1),(0),(

tdntdTFtdTF

otherwisetdn 0),(

1818

Inverse document frequencyInverse document frequency GivenGiven

D D is the document collection and is the set is the document collection and is the set of documents containing of documents containing tt

FormulaeFormulae mostly dampened functions of mostly dampened functions of SMARTSMART

..

|| tDD

)||||1log()(

tDDtIDF

tD

1919

Vector space modelVector space model Coordinate of document Coordinate of document d d in axis in axis t t

.. Transformed to Transformed to inin the TFIDF-space the TFIDF-space

Query Query q q Interpreted as a documentInterpreted as a document Transformed to Transformed to inin the same TFIDF-space as the same TFIDF-space as

dd

)(),( tIDFtdTFdt d

q

2020

Measures of proximityMeasures of proximity Distance measureDistance measure

Magnitude of the vector differenceMagnitude of the vector difference ..

Document vectors must be normalized to unit ( Document vectors must be normalized to unit ( or or ) length) length

Else shorter documents dominate (since queries are Else shorter documents dominate (since queries are short)short)

Cosine similarityCosine similarity cosine cosine of the angle between and of the angle between and

Shorter documents are penalizedShorter documents are penalized

|| qd

1L 2L

d

q

2121

Relevance feedback Relevance feedback Users Users learning learning how to modify querieshow to modify queries

Response list must have least some relevant documentsResponse list must have least some relevant documents Relevance feedback Relevance feedback

`correcting' the ranks to the user's taste`correcting' the ranks to the user's taste automates the query refinement processautomates the query refinement process

Rocchio's methodRocchio's method Folding-in user feedbackFolding-in user feedback To query vector To query vector

Add Add a weighted sum of vectors for relevant documents a weighted sum of vectors for relevant documents DD++ Subtract a weighted sum of the irrelevant documents Subtract a weighted sum of the irrelevant documents D-D-

..

q

D -D

d-dq'q

2222

Relevance feedback (contd.)Relevance feedback (contd.) PseudoPseudo-relevance feedback-relevance feedback

D+ and D- generated automaticallyD+ and D- generated automatically E.g.: Cornell SMART systemE.g.: Cornell SMART system top 10 documents reported by the first round of query top 10 documents reported by the first round of query

execution are included in execution are included in DD++ typically set to 0; D- not usedtypically set to 0; D- not used

Not a commonly available featureNot a commonly available feature Web users want instant gratificationWeb users want instant gratification System complexitySystem complexity

Executing the second round query slower and expensive for Executing the second round query slower and expensive for major search enginesmajor search engines

2323

Ranking by odds ratioRanking by odds ratio R R : Boolean random variable which represents : Boolean random variable which represents

the relevance of document the relevance of document d d w.r.t. query w.r.t. query qq.. Ranking documents by their Ranking documents by their odds ratio odds ratio for for

relevancerelevance ..

Approximating probability of d by product of Approximating probability of d by product of the probabilities of individual terms in the probabilities of individual terms in dd .. Approximately…Approximately…

),|Pr(/)|Pr(),|Pr(/)|Pr(

),Pr(/),,Pr(),Pr(/),,Pr(

),|Pr(),|Pr(

qRdqRqRdqR

dqdqRdqdqR

dqRdqR

t t

t

qRxqRx

qRdqRd

),|Pr(),|Pr(

),|Pr(),|Pr(

dqt qtqt

qtqt

abba

dqRdqR

)1()1(

),|Pr(),|Pr(

,,

,,

2424

Meta-search systemsMeta-search systems• Take the search engine to the documentTake the search engine to the document

Forward queries to many geographically distributed repositoriesForward queries to many geographically distributed repositories• Each has its own search serviceEach has its own search service

Consolidate their responses.Consolidate their responses.• AdvantagesAdvantages

Perform non-trivial query rewriting Perform non-trivial query rewriting • Suit a single user query to many search engines with different Suit a single user query to many search engines with different

query syntaxquery syntax Surprisingly small overlap between crawlsSurprisingly small overlap between crawls

• Consolidating responsesConsolidating responses Function goes beyond just eliminating duplicatesFunction goes beyond just eliminating duplicates Search services do not provide standard ranks which can be Search services do not provide standard ranks which can be

combined meaningfullycombined meaningfully

Mining the WebMining the Web Chakrabarti and RamakrishnanChakrabarti and Ramakrishnan 2525

Similarity searchSimilarity search• Cluster hypothesisCluster hypothesis

Documents similar to relevant documents are Documents similar to relevant documents are also likely to be relevantalso likely to be relevant

• Handling “find similar” queriesHandling “find similar” queries Replication Replication or or duplicationduplication of pages of pages Mirroring of sitesMirroring of sites

Mining the WebMining the Web Chakrabarti and RamakrishnanChakrabarti and Ramakrishnan 2626

Document similarityDocument similarity• Jaccard coefficientJaccard coefficient of similarity between of similarity between

document and document and • T(d) = set of tokens in document dT(d) = set of tokens in document d

.. Symmetric, reflexive, not a metricSymmetric, reflexive, not a metric Forgives any number of occurrences and any Forgives any number of occurrences and any

permutations of the terms.permutations of the terms.• is a metricis a metric

1d 2d

|)()(||)()(|),('

21

2121 dTdT

dTdTddr

),('1 21 ddr