dd2476 search engines and information retrieval systems lecture 4
TRANSCRIPT
DD2476 Search Engines and Information Retrieval Systems Lecture 4: Scoring, Weighting, Vector Space Model Hedvig KjellstrĂśm [email protected] www.csc.kth.se/DD2476
Assignment 1
â˘âŻThus far, Boolean queries:
BRUTUS AND CAESAR AND NOT CALPURNIA
â˘âŻGood for: -⯠Expert users with precise understanding of their needs
and the collection -⯠Computer applications
â˘âŻNot good for: -⯠The majority of users
â˘âŻ This is particularly true of web search
DD2476 Lecture 4, February 5, 2013
Ch. 6!
Problem with Boolean Search: Feast or Famine
â˘âŻBoolean queries often result in either too few (=0) or too many (1000s) results. -âŻQuery 1: âstandard user dlink 650â: 200,000 hits -âŻQuery 2: âstandard user dlink 650 no card foundâ: 0 hits
â˘âŻIt takes a lot of skill to come up with a query that produces a manageable number of hits. -⯠AND gives too few; OR gives too many
Ch. 6!
DD2476 Lecture 4, February 5, 2013
Ranked Retrieval
DD2476 Lecture 4, February 5, 2013
Feast or Famine: Not a Problem in Ranked Retrieval
â˘âŻLarge result sets no issue -⯠Show top K ( â 10) results -âŻOption to see more results
â˘âŻPremise: the ranking algorithm works
DD2476 Lecture 4, February 5, 2013
Ch. 6!
Today
â˘âŻTf-idf and the vector space model (Manning Chapter 6.2-4) -⯠Term frequency, collection statistics -⯠Vector space scoring and ranking
â˘âŻScoring and ranking (Manning Chapter 6.1, 7) -⯠Speeding up vector space scoring and ranking -⯠Using zones in documents -⯠Putting together a complete search system
DD2476 Lecture 4, February 5, 2013
Tf-idf and the Vector Space Model (Manning Chapter 6.2-4)
Scoring as the Basis of Ranked Retrieval
â˘âŻWish to return in order the documents most likely to be useful to the searcher
â˘âŻRank-order documents with respect to a query
â˘âŻAssign a score â say in [0, 1] â to each document -âŻMeasures how well document and query âmatchâ
DD2476 Lecture 4, February 5, 2013
Ch. 6!
Query-Document Matching Scores
â˘âŻOne-term query: BRUTUS
â˘âŻTerm not in document: score 0 â˘âŻMore appearances of term in document: higher score
DD2476 Lecture 4, February 5, 2013
Ch. 6!
Recall (Lecture 1): Binary Term-Document Incidence Matrix
â˘âŻDocument represented by binary vector ďż˝ {0,1}|V|
DD2476 Lecture 4, February 5, 2013
Sec. 6.2!
Antony
and Cleopatra
Julius Caesar
The Tempest Hamlet Othello Macbeth
ANTONY 1 1 0 0 0 0
BRUTUS 1 1 0 1 0 0
CAESAR 1 1 0 1 1 1
CALPURNIA 0 1 0 0 0 0
CLEOPATRA 1 0 0 0 0 0
MERCY 1 0 1 1 1 1
WORSER 1 0 1 1 1 0
â˘âŻDocument represented by count vector ďż˝ âv
Term-Document Count Matrix
DD2476 Lecture 4, February 5, 2013
Sec. 6.2!
Antony
and Cleopatra
Julius Caesar
The Tempest Hamlet Othello Macbeth
ANTONY 157 73 0 0 0 0
BRUTUS 4 157 0 1 0 0
CAESAR 232 227 0 2 1 1
CALPURNIA 0 10 0 0 0 0
CLEOPATRA 57 0 0 0 0 0
MERCY 2 0 3 5 5 1
WORSER 2 0 1 1 1 0
Bag of Words Model
â˘âŻOrdering of words in document not considered: -⯠âJohn is quicker than Maryâ ďż˝ âMary is quicker than Johnâ
â˘âŻThis is called the bag of words model
â˘âŻIn a sense, step back: The positional index was able to distinguish these two documents
â˘âŻTask 2.1 Ranked Retrieval: Back to Bag-of-Words
DD2476 Lecture 4, February 5, 2013
Term Frequency tf
â˘âŻTerm frequency tft,d of term t in document d ���� number of times that t occurs in d
â˘âŻHow use tf for query-document match scores?
â˘âŻRaw term frequency is not what we want: -⯠A document with 10 occurrences of the term is more
relevant than a document with 1 occurrence of the term -⯠But not 10 times more relevant
DD2476 Lecture 4, February 5, 2013
frequency = count in IR
Antony and
Cleopatra ANTONY 157
Log-Frequency Weighting
â˘âŻLog-frequency weight of term t in document d
DD2476 Lecture 4, February 5, 2013
Sec. 6.2!
!"# >+
=otherwise 0,
0 tfif, tflog 1 10 t,dt,d
t,dw
Exercise 2 Minutes
â˘âŻDocument d
â˘âŻFill in the wt,d column
DD2476 Lecture 4, February 5, 2013
term tft,d wt,d
airplane 0
shakespeare 1
calpurnia 10
under 100
the 1,000
!"# >+
=otherwise 0,
0 tfif, tflog 1 10 t,dt,d
t,dw
Simple Query-Document Score
â˘âŻQueries with >1 terms
â˘âŻScore for a document-query pair: sum over terms t in both q and d: score
â˘âŻThe score is 0 if none of the query terms is present in the document
â˘âŻWhat is the problem with this measure?
DD2476 Lecture 4, February 5, 2013
âŹ
= (1 + log10 tft ,d )tâqâŠdâ
Sec. 6.2!
Document Frequency
â˘âŻRare terms are more informative than frequent terms
â˘âŻExample: rare word ARACHNOCENTRIC -⯠Document containing this term is very likely to be
relevant to query ARACHNOCENTRIC ďż˝ High weight for rare terms like ARACHNOCENTRIC
â˘âŻExample: common word THE -⯠Document containing this term can be about anything ďż˝ Very low weight for common terms like THE
â˘âŻWe will use document frequency (df) to capture this. DD2476 Lecture 4, February 5, 2013
Sec. 6.2.1!
idf Weight
â˘âŻdft is the document frequency of term t: the number of documents that contain t -⯠dft is an inverse measure of the informativeness of t -⯠dft ⤠N
â˘âŻ Informativeness idf (inverse document frequency) of t:
-⯠log (N/dft) instead of N/dft to âdampenâ the effect of idf.
â˘âŻ Mathematical reasons later in the course
DD2476 Lecture 4, February 5, 2013
)/df( log idf 10 tt N=
Sec. 6.2.1!
Exercise 2 Minutes
â˘âŻSuppose N = 1,000,000
â˘âŻFill in the idft column
DD2476 Lecture 4, February 5, 2013
term dft idft
calpurnia 1
animal 100
sunday 1,000
fly 10,000
under 100,000
the 1,000,000
)/df( log idf 10 tt N=
Effect of idf on Ranking
â˘âŻDoes idf have an effect on ranking for one-term queries, like IPHONE?
â˘âŻOnly effect for >1 term -âŻQuery CAPRICIOUS PERSON: idf puts more weight on
CAPRICIOUS than PERSON.
DD2476 Lecture 4, February 5, 2013
Collection vs. Document Frequency
â˘âŻCollection frequency of t: number of occurrences of t in the collection, counting multiple occurrences.
â˘âŻExample:
â˘âŻWhich word is a better search term (and should get a
higher weight)?
DD2476 Lecture 4, February 5, 2013
Word Collection frequency Document frequency
insurance 10440 3997
try 10422 8760
Sec. 6.2.1!
tf-idf Weighting
â˘âŻtf-idf weight of a term: product of tf weight and idf weight
â˘âŻBest known weighting scheme in information retrieval -⯠Note: the â-â in tf-idf is a hyphen, not a minus sign! -⯠Alternative names: tf.idf, tf x idf
â˘âŻIncreases with the number of occurrences within a document
â˘âŻIncreases with the rarity of the term in the collection
DD2476 Lecture 4, February 5, 2013
âŹ
wt,d = (1+ log10 tft,d ) Ă log10(N /dft )
Sec. 6.2.2!
Binary ďż˝ Count ďż˝ Weight Matrix
DD2476 Lecture 4, February 5, 2013
Sec. 6.3!
Antony
and Cleopatra
Julius Caesar
The Tempest Hamlet Othello Macbeth
ANTONY 5.255 3.18 0 0 0 0.35
BRUTUS 1.21 6.1 0 1 0 0
CAESAR 8.59 2.54 0 1.51 0.25 1
CALPURNIA 0 1.54 0 0 0 0
CLEOPATRA 2.85 0 0 0 0 0
MERCY 1.51 0 1.9 0.12 5.25 0.88
WORSER 1.37 0 0.11 4.15 0.25 1.95
â˘âŻDocument represented by tf-idf weight vector ďż˝ Rv
Documents as Vectors
â˘âŻSo we have a |V|-dimensional vector space -⯠Terms are axes/dimensions -⯠Documents are points in this space
â˘âŻVery high-dimensional -âŻOrder of 107 dimensions when for a web search engine
â˘âŻVery sparse vectors - most entries zero
DD2476 Lecture 4, February 5, 2013
Sec. 6.3!
Queries as Vectors
â˘âŻKey idea 1: Represent queries as vectors in same space
â˘âŻKey idea 2: Rank documents according to proximity to query in this space -⯠proximity = similarity of vectors -⯠proximity â inverse of distance
â˘âŻRecall: -âŻGet away from Boolean model -⯠Rank more relevant documents higher than less relevant
documents
DD2476 Lecture 4, February 5, 2013
Sec. 6.3!
Formalizing Vector Space Proximity
â˘âŻFirst cut: Euclidean distance?
â˘âŻEuclidean distance
is a bad idea . . . â˘âŻ. . . because
Euclidean distance is large for vectors of different lengths -âŻWhat determines
length here?
DD2476 Lecture 4, February 5, 2013
Sec. 6.3!
âŹ
q â dn 2
Exercise 5 Minutes
â˘âŻEuclidean distance bad for -⯠vectors of different length
(documents with different #words)
-⯠high-dimensional vectors (large dictionaries)
â˘âŻDiscuss in pairs: -⯠Can you come up with a better difference measure?
DD2476 Lecture 4, February 5, 2013
Use Angle Instead of Distance
â˘âŻThought experiment: take a document d and append it to itself. Call this document dâ˛
â˘âŻâSemanticallyâ d and dⲠhave the same content â˘âŻThe Euclidean distance between the two documents
can be quite large â˘âŻThe angle between the two documents is 0,
corresponding to maximal similarity
â˘âŻKey idea: -⯠Length unimportant -⯠Rank documents according to angle from query
DD2476 Lecture 4, February 5, 2013
Sec. 6.3!
Problems with Angle
â˘âŻAngles expensive to compute â arctan
â˘âŻFind a computationally cheaper, equivalent measure -âŻGive same ranking order ďż˝ monotonically increasing/
decreasing with angle
â˘âŻAny ideas?
DD2476 Lecture 4, February 5, 2013
Sec. 6.3!
Cosine More Efficient Than Angle
DD2476 Lecture 4, February 5, 2013
Sec. 6.3!
Monotonically decreasing!
0 10 20 30 40 50 60 70 80 900
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
angle
cos(angle)
Length Normalization
â˘âŻComputing cosine similarity involves length-normalizing document and query vectors
â˘âŻL2 norm:
â˘âŻDividing a vector by its L2 norm makes it a unit (length) vector (on surface of unit hypersphere)
â˘âŻRecall: -⯠Length unimportant -⯠Rank documents according to angle from query
DD2476 Lecture 4, February 5, 2013
â=i ixx 2
2
Sec. 6.3!
Cosine Similarity
â˘âŻqi is the tf-idf weight of term i in the query â˘âŻdi is the tf-idf weight of term i in the document
â˘âŻ is the cosine similarity of q and d = the cosine of the angle between q and d.
DD2476 Lecture 4, February 5, 2013
âŹ
cos( q , d )
âŹ
cos( q , d ) =
q q 2
â˘
d d
2
= q ⢠d
q 2 d
2
=qidii=1
Vâ
qi2
i=1
Vâ di
2
i=1
Vâ
dot / scalar / inner product
Cosine Similarity
â˘âŻIn reality: -⯠Length-normalize when
document added to index:
-⯠Length-normalize query:
-⯠Fast to compute cosine similarity:
DD2476 Lecture 4, February 5, 2013
âŹ
cos( q , d ) = q ⢠d = qidii=1
Vâ
âŹ
d â
d d
2
âŹ
q â q q 2
Cosine Similarity Example
â˘âŻHow similar are the novels -⯠SaS: Sense and Sensibility -⯠PaP: Pride and Prejudice -âŻWH: Wuthering Heights?
â˘âŻTerm frequency tft
DD2476 Lecture 4, February 5, 2013
term SaS PaP WH
affection 115 58 20
jealous 10 7 11
gossip 2 0 6
wuthering 0 0 38
No idf weighting!
Cosine Similarity Example
â˘âŻLog frequency weights
DD2476 Lecture 4, February 5, 2013
!"# >+
=otherwise 0,
0 tfif, tflog 1 10 t,dt,d
t,dw
term SaS PaP WH affection 3.06 2.76 2.30
jealous 2.00 1.85 2.04
gossip 1.30 0 1.78
wuthering 0 0 2.58
Cosine Similarity Example
â˘âŻAfter length normalization ,
cos(SaS,PaP) â 0.789*0.832 + 0.515*0.555 + 0.335*0 + 0*0 â 0.94
cos(SaS,WH) â 0.79 cos(PaP,WH) â 0.69
â˘âŻWhy is cos(SaS,PaP) > cos(*,WH)?
DD2476 Lecture 4, February 5, 2013
term SaS PaP WH affection 0.789 0.832 0.524
jealous 0.515 0.555 0.465
gossip 0.335 0 0.405
wuthering 0 0 0.588
âŹ
d â
d d
2
âŹ
d = w1,d ,âŚ,w4,d[ ]
Computing Cosine Scores
DD2476 Lecture 4, February 5, 2013
Sec. 6.3!
Summary â vector space ranking
â˘âŻRepresent the query as a tf-idf vector
â˘âŻRepresent each document as a tf-idf vector
â˘âŻCompute the cosine similarity score for the query vector and each document vector
â˘âŻRank documents with respect to the query by score
â˘âŻReturn the top K (e.g., K = 10) to the user
DD2476 Lecture 4, February 5, 2013
Scoring and Ranking (Manning Chapter 6.1, 7)
Efficient Cosine Ranking
â˘âŻFind the K docs in the collection ânearestâ to the query â K largest query-document cosine scores
â˘âŻUp to now: Linear scan through collection -⯠Did not make use of sparsity in term space -⯠Computed all cosine scores
â˘âŻEfficient cosine ranking: -⯠Computing each cosine score efficiently -⯠Choosing the K largest scores efficiently
DD2476 Lecture 4, February 5, 2013
Computing Cosine Scores Efficiently
â˘âŻApproximation: -⯠Assume that terms only occur once in query
â˘âŻWorks for short querys (|q| << N) â˘âŻWorks since ranking only relative
DD2476 Lecture 4, February 5, 2013
âŹ
wt,q â 1, if w t,q > 00, otherwise# $ %
Computing Cosine Scores Efficiently
DD2476 Lecture 4, February 5, 2013
Sec. 7.1!
Speedup here
Computing Cosine Scores Efficiently
â˘âŻDownside of approximation: sometimes get it wrong -⯠A document not in the top K may creep into the list of K
output documents
â˘âŻHow bad is this?
â˘âŻCosine similarity is only a proxy -⯠User has a task and a query formulation -⯠Cosine matches documents to query -⯠Thus cosine is anyway a proxy for user happiness -⯠If we get a list of K documents âcloseâ to the top K by
cosine measure, should be ok
DD2476 Lecture 4, February 5, 2013
Sec. 7.1.1!
Choosing K Largest Scores Efficiently
â˘âŻRetrieve top K documents wrt query -⯠Not totally order all documents in collection
â˘âŻDo selection: -⯠avoid visiting all documents
â˘âŻAlready do selection: -⯠Sparse term-document incidence matrix, |d| << N -âŻMany cosine scores = 0 -âŻOnly visits documents with nonzero cosine scores (âĽ1
term in common with query)
DD2476 Lecture 4, February 5, 2013
Sec. 7.1!
Choosing K Largest Scores Efficiently Generic Approach
â˘âŻFind a set A of contenders, with K < |A| << N -⯠A does not necessarily contain the top K, but has many
docs from among the top K -⯠Return the top K documents in A
â˘âŻThink of A as pruning non-contenders
â˘âŻSame approach used for any scoring function!
â˘âŻWill look at several schemes following this approach
DD2476 Lecture 4, February 5, 2013
Sec. 7.1.1!
Choosing K Largest Scores Efficiently Index Elimination
â˘âŻBasic algorithm FastCosineScore only considers documents containing at least one query term -⯠All documents have âĽ1 term in common with query
â˘âŻTake this further: -âŻOnly consider high-idf query terms -âŻOnly consider documents containing many query terms
DD2476 Lecture 4, February 5, 2013
Sec. 7.1.2!
Choosing K Largest Scores Efficiently Index Elimination
â˘âŻExample: CATCHER IN THE RYE
â˘âŻOnly accumulate scores from CATCHER and RYE â˘âŻIntuition:
-⯠IN and THE contribute little to the scores â do not alter rank-ordering much
-⯠Compare to stop words â˘âŻBenefit:
-⯠Posting lists of low-idf terms have many documents â eliminated from set A of contenders
DD2476 Lecture 4, February 5, 2013
Sec. 7.1.2!
Choosing K Largest Scores Efficiently Index Elimination
â˘âŻExample: CAESAR ANTONY CALPURNIA BRUTUS
â˘âŻOnly compute scores for documents containing âĽ3 query terms
DD2476 Lecture 4, February 5, 2013
Sec. 7.1.2!
Brutus!
Caesar!
Calpurnia!
1! 2! 3! 5! 8! 13! 21! 34!
2! 4! 8! 16! 32! 64! 128!
13! 16!
Antony! 3! 4! 8! 16! 32! 64! 128!
32!
Choosing K Largest Scores Efficiently Champion Lists
â˘âŻPrecompute for each dictionary term t, the r documents of highest tf-idftd weight -⯠Call this the champion list (fancy list, top docs) for t
â˘âŻBenefit: -⯠At query time, only compute scores for documents in the
champion lists â fast â˘âŻIssue:
-⯠r chosen at index build time -⯠Too large: slow -⯠Too small: r < K
DD2476 Lecture 4, February 5, 2013
Sec. 7.1.3!
Exercise 5 Minutes
â˘âŻIndex Elimination: consider only high-idf query terms and only documents with many query terms
â˘âŻChampion Lists: for each term t, consider only the r documents with highest tf-idftd values
â˘âŻThink quietly and write down: -⯠How do Champion Lists relate to Index Elimination? Can
they be used together? -⯠How can Champion Lists be implemented in an inverted
index?
DD2476 Lecture 4, February 5, 2013
Sec. 7.1.3!
Choosing K Largest Scores Efficiently Static Quality Scores
â˘âŻDevelop idea of champion lists
â˘âŻWe want top-ranking documents to be both relevant and authoritative -⯠Relevance â cosine scores -⯠Authority â query-independent property
â˘âŻExamples of authority signals -âŻWikipedia among websites (qualitative) -⯠Articles in certain newspapers (qualitative) -⯠A paper with many citations (quantitative) -⯠PageRank (quantitative)
DD2476 Lecture 4, February 5, 2013
Sec. 7.1.4!
More in Lecture 5
Choosing K Largest Scores Efficiently Static Quality Scores
â˘âŻAssign query-independent quality score g(d) in [0,1] to each document d
â˘âŻnet-score(q,d) = g(d) + cos(q,d) -⯠Two âsignalsâ of user happiness -âŻOther combination than equal weighting
â˘âŻSeek top K documents by net score
DD2476 Lecture 4, February 5, 2013
Sec. 7.1.4!
Choosing K Largest Scores Efficiently Champion Lists + Static Quality Scores
â˘âŻCan combine champion lists with g(d)-ordering
â˘âŻMaintain for each term t a champion list of the r documents with highest g(d) + tf-idftd
â˘âŻSeek top K results from only the documents in these champion lists
DD2476 Lecture 4, February 5, 2013
Sec. 7.1.4!
Choosing K Largest Scores Efficiently Cluster Pruning
â˘âŻPick âN documents at random â leaders
â˘âŻFor every other document, pre-compute nearest leader -⯠Documents attached to a leader â followers -⯠Average leader has ~ âN followers
â˘âŻProcess a query as follows: -âŻGiven query q, find its nearest leader l. -⯠Seek K nearest documents from only lâs followers.
â˘âŻVariant: -âŻMore than one leader for each document
DD2476 Lecture 4, February 5, 2013
Sec. 7.1.6!
More in Lecture 11
Choosing K Largest Scores Efficiently Cluster Pruning
DD2476 Lecture 4, February 5, 2013
Query
Leader Follower
Sec. 7.1.6!
Approximate K Nearest Neighbors
â˘âŻHave investigated methods to do find K nearest neighbor documents approximately: -⯠Index Elimination -⯠Champion Lists -⯠Static Quality Scores -⯠Cluster Pruning
â˘âŻPrincipled methods, based on e.g. hashing
DD2476 Lecture 4, February 5, 2013
More in Lecture 10
Parametric and Zone Indexes
â˘âŻThus far, document = sequence of terms
â˘âŻDocuments have multiple parts, metadata: -⯠Author -⯠Title -⯠Date of publication -⯠Language -⯠Format -⯠etc.
â˘âŻSometimes search by metadata: FIND DOCS AUTHORED BY WILLIAM SHAKESPEARE IN THE YEAR 1601, CONTAINING ALAS POOR YORICK
DD2476 Lecture 4, February 5, 2013
Sec. 6.1!
Parametric and Zone Indexes
â˘âŻField -⯠year, author etc. with only one value -⯠Field or parametric index: postings for each field value -⯠Field query typically treated as conjunction: Only search documents with author = SHAKESPEARE
â˘âŻZone -⯠Title, body etc. that contain arbitrary amount of text -⯠Zone index: postings for terms in each zone
DD2476 Lecture 4, February 5, 2013
Sec. 6.1!
Parametric and Zone Indexes
DD2476 Lecture 4, February 5, 2013
Encode zones in dictionary vs. postings.!
Sec. 6.1!
Query Term Proximity
â˘âŻFree text queries: -⯠Just a set of terms typed into the query box -⯠Common on the web
â˘âŻRanking: -⯠Term proximity in documents â higher for closer terms -⯠Term order â higher for terms in right order
DD2476 Lecture 4, February 5, 2013
Sec. 7.2.2!
Query Parser
â˘âŻQuery phrase: RISING INTEREST RATES
â˘âŻSequence: -⯠Run as a phrase query -⯠If <K documents contain the phrase RISING INTEREST
RATES, run phrase queries RISING INTEREST and INTEREST RATES
-⯠If still <K docs, run vector space query RISING INTEREST RATES
-⯠Rank matching docs by vector space scoring
DD2476 Lecture 4, February 5, 2013
Sec. 7.2.3!
Next â˘âŻAssignment 1 left? Email Johan or Hedvig
â˘âŻLecture 5 (February 8, 10.15-12.00) -⯠D3 -⯠Readings: Manning Chapter 21
Avrachenkov Sections 1-2
â˘âŻLecture 6 (February 15, 10.15-12.00) -⯠D3 -⯠Readings: Manning Chapter 8, 9
â˘âŻComputer hall session (February 19, 15.00-18.00) -âŻGul, Brun (Osquars Backe 2, level 4) -⯠Examination of computer assignment 2
DD2476 Lecture 4, February 5, 2013