15-381 artificial intelligence
DESCRIPTION
15-381 Artificial Intelligence. Information Retrieval (How to Power a Search Engine) Jaime Carbonell 20 September 2001 Topics Covered: “Bag of Words” Hypothesis Vector Space Model & Cosine Similarity Query Expansion Methods. Information Retrieval: The Challenge (1). Text DB includes: - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: 15-381 Artificial Intelligence](https://reader034.vdocuments.us/reader034/viewer/2022042718/56814023550346895dab807b/html5/thumbnails/1.jpg)
15-381 Artificial IntelligenceInformation Retrieval
(How to Power a Search Engine)
Jaime Carbonell20 September 2001
Topics Covered:• “Bag of Words” Hypothesis• Vector Space Model & Cosine Similarity• Query Expansion Methods
![Page 2: 15-381 Artificial Intelligence](https://reader034.vdocuments.us/reader034/viewer/2022042718/56814023550346895dab807b/html5/thumbnails/2.jpg)
Information Retrieval: The Challenge (1)
Text DB includes:(1) Rainfall measurements in the Sahara continue to show a steadydecline starting from the first measurements in 1961. In 1996 only12mm of rain were recorded in upper Sudan, and 1mm in SouthernAlgiers...
(2) Dan Marino states that professional football risks loosing the numberone position in heart of fans across this land. Declines in TV audienceratings are cited...
(3) Alarming reductions in precipitation in desert regions are blamed fordesert encroachment of previously fertile farmland in Northern Africa.Scientists measured both yearly precipitation and groundwater levels...
![Page 3: 15-381 Artificial Intelligence](https://reader034.vdocuments.us/reader034/viewer/2022042718/56814023550346895dab807b/html5/thumbnails/3.jpg)
Information Retrieval: The Challenge (2)
User query states:"Decline in rainfall and impact on farms near Sahara"
Challenges•How to retrieve (1) and (3) and not (2)?•How to rank (3) as best?•How to cope with no shared words?
![Page 4: 15-381 Artificial Intelligence](https://reader034.vdocuments.us/reader034/viewer/2022042718/56814023550346895dab807b/html5/thumbnails/4.jpg)
Information Retrieval Assumptions (1)
Basic IR task•There exists a document collection {Dj }
•Users enters at hoc query Q
•Q correctly states user’s interest
•User wants {Di } < {Dj } most relevant to Q
![Page 5: 15-381 Artificial Intelligence](https://reader034.vdocuments.us/reader034/viewer/2022042718/56814023550346895dab807b/html5/thumbnails/5.jpg)
"Shared Bag of Words" assumptionEvery query = {wi }Every document = {wk }...where wi & wk in same Σ
All syntax is irrelevant (e.g. word order)All document structure is irrelevantAll meta-information is irrelevant(e.g. author, source, genre)=> Words suffice for relevance assessment
Information Retrieval Assumption (2)
![Page 6: 15-381 Artificial Intelligence](https://reader034.vdocuments.us/reader034/viewer/2022042718/56814023550346895dab807b/html5/thumbnails/6.jpg)
Information Retrieval Assumption (3)
Retrieval by shared words
If Q and Dj share some wi , then Relevant(Q, Dj )
If Q and Dj share all wi , then Relevant(Q, Dj )
If Q and Dj share over K% of wi , then Relevant(Q, Dj)
![Page 7: 15-381 Artificial Intelligence](https://reader034.vdocuments.us/reader034/viewer/2022042718/56814023550346895dab807b/html5/thumbnails/7.jpg)
Boolean Queries (1)Industrial use of SilverQ: silverR: "The Count’s silver anniversary..."
"Even the crash of ’87 had a silver lining...""The Lone Ranger lived on in syndication...""Sliver dropped to a new low in London..."...
Q: silver AND photographyR: "Posters of Tonto and the Lone Ranger..."
"The Queen’s Silver Anniversary photos..."...
![Page 8: 15-381 Artificial Intelligence](https://reader034.vdocuments.us/reader034/viewer/2022042718/56814023550346895dab807b/html5/thumbnails/8.jpg)
Boolean Queries (2)
Q: (silver AND (NOT anniversary)AND (NOT lining)AND emulsion)
OR (AgI AND crystalAND photography))
R: "Silver Iodide Crystals in Photography...""The emulsion was worth its weight in
silver..."...
![Page 9: 15-381 Artificial Intelligence](https://reader034.vdocuments.us/reader034/viewer/2022042718/56814023550346895dab807b/html5/thumbnails/9.jpg)
Boolean Queries (3)
Boolean queries are:
a) easy to implement
b) confusing to compose
c) seldom used (except by librarians)
d) prone to low recall
e) all of the above
![Page 10: 15-381 Artificial Intelligence](https://reader034.vdocuments.us/reader034/viewer/2022042718/56814023550346895dab807b/html5/thumbnails/10.jpg)
Beyond the Boolean Boondoggle (1)
Desiderata (1)
•Query must be natural for all users
•Sentence, phrase, or word(s)
•No AND’s, OR’s, NOT’s, ...
•No parentheses (no structure)
•System focus on important words
•Q: I want laser printers now
![Page 11: 15-381 Artificial Intelligence](https://reader034.vdocuments.us/reader034/viewer/2022042718/56814023550346895dab807b/html5/thumbnails/11.jpg)
Beyond the Boolean Boondoggle (2)Desiderata (2)
• Find what I mean, not just what I say Q: cheap car insurance(pAND (pOR
"cheap" [1.0]"inexpensive" [0.9]"discount" [0.5)]
(pOR "car" [1.0]"auto" [0.8]"automobile" [0.9]"vehicle" [0.5])
(pOR "insurance" [1.0]"policy" [0.3]))
![Page 12: 15-381 Artificial Intelligence](https://reader034.vdocuments.us/reader034/viewer/2022042718/56814023550346895dab807b/html5/thumbnails/12.jpg)
The Vector Space Model (1)
Let Σ = [w1, w2, ... wn ]
Let Dj = [c(w1, Dj), c(w2, Dj), ... c(wn, Dj)]
Let Q = [c(w1, Q), c(w2, Q), ... c(wn, Q)]
![Page 13: 15-381 Artificial Intelligence](https://reader034.vdocuments.us/reader034/viewer/2022042718/56814023550346895dab807b/html5/thumbnails/13.jpg)
The Vector Space Model (2)
Initial Definition of Similarity:
SI(Q, Dj) = Q . Dj
Normalized Definition of Similarity:
SN(Q, Dj) = (Q . Dj)/(|Q| x |Dj|)
= cos(Q, Dj)
![Page 14: 15-381 Artificial Intelligence](https://reader034.vdocuments.us/reader034/viewer/2022042718/56814023550346895dab807b/html5/thumbnails/14.jpg)
The Vector Space Model (3)
Relevance Ranking
If SN(Q, Di) > SN(Q, Dj)
Then Di is more relevant than Di to Q
Retrieve(k,Q,{Dj}) =
Argmaxk[cos(Q, Dj)]
Dj in {Dj}
![Page 15: 15-381 Artificial Intelligence](https://reader034.vdocuments.us/reader034/viewer/2022042718/56814023550346895dab807b/html5/thumbnails/15.jpg)
Refinements to VSM (2)
Stop-Word Elimination
• Discard articles, auxiliaries, prepositions, ... typically 100-300 most frequent small words
• Reduce document length by 30-40%
• Retrieval accuracy improves slightly (5-10%)
![Page 16: 15-381 Artificial Intelligence](https://reader034.vdocuments.us/reader034/viewer/2022042718/56814023550346895dab807b/html5/thumbnails/16.jpg)
Refinements to VSM (3)
Proximity Phrases• E.g.: "air force" => airforce• Found by high-mutual information
p(w1 w2) >> p(w1)p(w2)
p(w1 & w2 in k-window) >>
p(w1 in k-window) p(w2 in same k-window)
• Retrieval accuracy improves slightly (5-10%)• Too many phrases => inefficiency
![Page 17: 15-381 Artificial Intelligence](https://reader034.vdocuments.us/reader034/viewer/2022042718/56814023550346895dab807b/html5/thumbnails/17.jpg)
Refinements to VSM (4)
Words => Terms
• term = word | stemmed word | phrase
• Use exactly the same VSM method on terms (vs words)
![Page 18: 15-381 Artificial Intelligence](https://reader034.vdocuments.us/reader034/viewer/2022042718/56814023550346895dab807b/html5/thumbnails/18.jpg)
Evaluating Information Retrieval (1)
Contingency table:
docsretrievedall
relevantretrieveedprecision
docsrelevantall
relevantretrieveedrecall
&
&
relevant not-relevant
retrieved a b
not retrieved c d
![Page 19: 15-381 Artificial Intelligence](https://reader034.vdocuments.us/reader034/viewer/2022042718/56814023550346895dab807b/html5/thumbnails/19.jpg)
Evaluating Information Retrieval (2)
P = a/(a+b) R = a/(a+c)
Accuracy = (a+d)/(a+b+c+d)
F1 = 2PR/(P+R)
Miss = c/(a+c) = 1 - R
(false negative)
F/A = b/(a+b+c+d)
(false positive)
![Page 20: 15-381 Artificial Intelligence](https://reader034.vdocuments.us/reader034/viewer/2022042718/56814023550346895dab807b/html5/thumbnails/20.jpg)
Query Expansion (1)Observations:• Longer queries often yield better results• User’s vocabulary may differ from document
vocabularyQ: how to avoid heart diseaseD: "Factors in minimizing stroke and cardiac arrest: Recommended dietary and exercise regimens"
• Maybe longer queries have more chances to help recall.
![Page 21: 15-381 Artificial Intelligence](https://reader034.vdocuments.us/reader034/viewer/2022042718/56814023550346895dab807b/html5/thumbnails/21.jpg)
Query Expansion (2)
Bridging the Gap• Human query expansion (user or expert)• Thesaurus-based expansion
Seldom works in practice (unfocused)• Relevance feedback
– Widen a thin bridge over vocabulary gap– Adds words from document space to query
• Pseudo-Relevance feedback• Local Context analysis
![Page 22: 15-381 Artificial Intelligence](https://reader034.vdocuments.us/reader034/viewer/2022042718/56814023550346895dab807b/html5/thumbnails/22.jpg)
Relevance Feedback
Rocchio Formula
Q’ = F[Q, Dret ]
F = weighted vector sum, such as:
W(t,Q’) =
αW(t,Q) + βW(t,Drel ) - γW(t,Dirr )
![Page 23: 15-381 Artificial Intelligence](https://reader034.vdocuments.us/reader034/viewer/2022042718/56814023550346895dab807b/html5/thumbnails/23.jpg)
Term Weighting Methods (1)
Salton’s Tf*IDfTf = term frequency in a document
Df = document frequency of term= # documents in collection
with this term
IDf = Df-1
![Page 24: 15-381 Artificial Intelligence](https://reader034.vdocuments.us/reader034/viewer/2022042718/56814023550346895dab807b/html5/thumbnails/24.jpg)
Term Weighting Methods (2)
Salton’s Tf*IDfTfIDf = f1(Tf)*f2(IDf)
E.g. f1(Tf) =Tf*ave(|Dj|)/|D|
E.g. f2(IDf) = log2(IDF)
f1 and f2 can differ for Q and D
![Page 25: 15-381 Artificial Intelligence](https://reader034.vdocuments.us/reader034/viewer/2022042718/56814023550346895dab807b/html5/thumbnails/25.jpg)
Efficient Implementations of VSM (1)
• Build an Inverted Index (next slide)
• Filter all 0-product terms
• Precompute IDF, per-document TF
• …but remove stopwords first.
![Page 26: 15-381 Artificial Intelligence](https://reader034.vdocuments.us/reader034/viewer/2022042718/56814023550346895dab807b/html5/thumbnails/26.jpg)
Efficient Implementations of VSM (3)[term IDFtermi
,
<doci, freq(term, doci )
docj, freq(term, docj )...>]
or:
[term IDFtermi,
<doci, freq(term, doci), [pos1,i, pos2,i, ...]
docj, freq(term, docj), [pos1,j, pos2,j, ...]...>]
posl,1 indicates the first position of term in documentj and so on.
![Page 27: 15-381 Artificial Intelligence](https://reader034.vdocuments.us/reader034/viewer/2022042718/56814023550346895dab807b/html5/thumbnails/27.jpg)
Generalized Vector Space Model (1)
Principles• Define terms by their occurrence patterns in
documents• Define query terms in the same way• Compute similarity by document-pattern
overlap for terms in D and Q• Use standard Cos similarity and either
binary or TfIDf weights
![Page 28: 15-381 Artificial Intelligence](https://reader034.vdocuments.us/reader034/viewer/2022042718/56814023550346895dab807b/html5/thumbnails/28.jpg)
Generalized Vector Space Model (2)
Advantages• Automatically calculates partial similarity
If "heart disease" and "stroke" and "ventricular" co-occur in many documents, then if the query contains only one of these terms, documents containing the other will receive partial credit proportional to their document co-occurrence ratio.
• No need to do query expansion or relevance feedback
![Page 29: 15-381 Artificial Intelligence](https://reader034.vdocuments.us/reader034/viewer/2022042718/56814023550346895dab807b/html5/thumbnails/29.jpg)
GVSM, How it Works (1)
Represent the collection as vector of documents:
Let C = [D1, D2, ..., Dm ]Represent each term by its distributional frequency:
Let ti = [Tf(ti, D1), Tf(ti, D2 ), ..., Tf(ti, Dm )]Term-to-term similarity is computed as:
Sim(ti, tj) = cos(vec(ti), vec(tj))Hence, highly co-occurring terms like "Arafat" and
"PLO"will be treated as near-synonyms for retrieval
![Page 30: 15-381 Artificial Intelligence](https://reader034.vdocuments.us/reader034/viewer/2022042718/56814023550346895dab807b/html5/thumbnails/30.jpg)
GVSM, How it Works (2)And query-document similarity is computed as
before: Sim(Q,D) = cos(vec(Q)), vec(D)), except that instead of the dot product calculation, we use a function of the term-to-term similarity computation above, For instance:
Sim(Q,D) = Σi[Maxj(sim(qi, dj)]or normalizing for document & query length:
Simnorm(Q, D) = ||||
)],(([
DQ
dqsimMax ji
![Page 31: 15-381 Artificial Intelligence](https://reader034.vdocuments.us/reader034/viewer/2022042718/56814023550346895dab807b/html5/thumbnails/31.jpg)
A Critique of Pure Relevance (1)
IR Maximizes Relevance
• Precision and recall are relevance measures
• Quality of documents retrieved is ignored
![Page 32: 15-381 Artificial Intelligence](https://reader034.vdocuments.us/reader034/viewer/2022042718/56814023550346895dab807b/html5/thumbnails/32.jpg)
A Critique of Pure Relevance (2)
Other Important Factors• What about information novelty, timeliness,
appropriateness, validity, comprehensibility, density, medium,...??
• In IR, we really want to maximize:P(U(f i , ..., f n ) | Q & {C} & U & H)where Q = query, {C} = collection set,U = user profile, H = interaction history
• ...but we don’t yet know how. Darn.
![Page 33: 15-381 Artificial Intelligence](https://reader034.vdocuments.us/reader034/viewer/2022042718/56814023550346895dab807b/html5/thumbnails/33.jpg)
Maximal Marginal Relevance (1)
• A crude first approximation:
novelty => minimal-redundancy
• Weighted linear combination:
(redundancy = cost, relevance = benefit)
• Free parameters: k and λ
![Page 34: 15-381 Artificial Intelligence](https://reader034.vdocuments.us/reader034/viewer/2022042718/56814023550346895dab807b/html5/thumbnails/34.jpg)
Maximal Marginal Relevance (2)
MMR(Q, C, R) =
Argmaxkdi in C[λS(Q, di) - (1-λ)maxdj
in R (S(di, dj))]
![Page 35: 15-381 Artificial Intelligence](https://reader034.vdocuments.us/reader034/viewer/2022042718/56814023550346895dab807b/html5/thumbnails/35.jpg)
Maximal Marginal Relevance (MMR) (3)COMPUTATION OF MMR RERANKING1. Standard IR Retrieval of top-N docs
Let Dr = IR(D, Q, N)
2. Rank max sim(di ε Dr, Q) as top doc, i.e. Let
Ranked = {di}
3. Let Dr = Dr\{di}
4. While Dr is not empty, do:
a. Find di with max MMR(Dr, Q. Ranked)
b. Let Ranked = Ranked.di
c. Let Dr = Dr\{di}
![Page 36: 15-381 Artificial Intelligence](https://reader034.vdocuments.us/reader034/viewer/2022042718/56814023550346895dab807b/html5/thumbnails/36.jpg)
Maximal Marginal Relevance (MMR) (4)
Applications:
• Ranking retrieved documents from IR Engine
• Ranking passages for inclusion in Summaries
![Page 37: 15-381 Artificial Intelligence](https://reader034.vdocuments.us/reader034/viewer/2022042718/56814023550346895dab807b/html5/thumbnails/37.jpg)
Document Summarization in a Nutshell (1)Types of Summaries
Task Query-relevant
(focused)
Query-free
(generic)
INDICATIVE, for Filtering (Do I
read further?)
To filter search engine results
Short abstracts
CONTENTFUL, for reading in lieu
of full doc.
To solve problems for busy
professionals
Executive summaries
![Page 38: 15-381 Artificial Intelligence](https://reader034.vdocuments.us/reader034/viewer/2022042718/56814023550346895dab807b/html5/thumbnails/38.jpg)
Summarization as Passage Retrieval (1)
For Query-Driven Summaries1. Divide document into passages
e.g, sentences, paragraphs, FAQ-pairs, ....2. Use query to retrieve most relevant passages, or better, use MMR to avoid redundancy.3. Assemble retrieved passages into a summary.