under the hood [part ii] web-based information architectures msec 20-760 mini ii jaime carbonell
Post on 21-Dec-2015
213 views
TRANSCRIPT
Today’s Topics
• Term weighting in detail
• Generalized Vector Space Model (GVSM)
• Maximal Marginal Relevance
• Summarization as Passage Retrieval
Term Weighting Revisited (1)
Definitions
wi "ith Term:" a word, stemmed word, orindexed phrase
Dj "jth Document:" a unit of indexed text, e.g. a web-page, a news report, an article, a patent, a legal case, a book, a chapter of a book, etc.
Term Weighting Revisited (2)
DefinitionsC "The Collection:" the full set of indexed documents
(e.g. the New York Times archive, the Web, ...)
Tf(wi ,Dj) "Term Frequency:" the number of times wi
occurs in document Dj. Tf is sometimes normalized by dividing by frequency of the most-frequent non-stop term in the document [Tf norm = Tf/ max_TF ].
)),((max)(max_ jiDw
j DwTFDTFji
Term Weighting Revisited (3)Definitions
Df(wi ,C) "Document Frequency:" the number of
documents from C in which wi occurs. Dfmay be normalized by dividing it by thetotal number of documents in C.
IDf(wi, C) “Inverse Document Frequency”:
[Df(wi, C)/size(C)]-1 . Most often the log2(IDf) is used, rather than IDf directly.
Term Weighting Revisited (4)TfIDf Term WeightsIn general: TfIDf(wi, Dj, C) =
F1(Tf(wi, Dj) * F2(IDf(wi, C))
Usually F1 = 0.5 + log2(Tf), or Tf/Tfmax
or 0.5 + 0.5Tf/Tfmax
Usually F2 = log2(IDf)
In the SMART IR system: TfIDf(wi, Dj,C) =
[0.5 + 0.5Tf(wi, Dj/Tfmax(Dj)] * log2(IDf(wi, C))
Term Weighting beyond TfIDf (1)
Probabilistic Models
• Old style (see textbooks)
Improves precision-recall slightly
• Full statistical language modeling (CMU)
Improves precision-recall more significantly
• Difficult to compute efficiently.
Term Weighting beyond TfIDf (2)
Neural Networks
• Theoretically attractive
• Do not scale up at all, unfortunately
Fuzzy Sets
• Not deeply researched, scaling difficulties
Term Weighting beyond TfIDf (3)
Natural Language Analysis• Analyze and understand D’s & Q first• Ultimate IR method, in theory• Generally NL understanding is an unsolved
problem• Scale up challenges, even if we could do it• But, shown to improve IR for very limited
domains
Generalized Vector Space Model (1)
Principles• Define terms by their occurrence patterns in
documents• Define query terms in the same way• Compute similarity by document-pattern
overlap for terms in D and Q• Use standard Cos similarity and either
binary or TfIDf weights
Generalized Vector Space Model (2)
Advantages• Automatically calculates partial similarity
If "heart disease" and "stroke" and "ventricular" co-occur in many documents, then if the query contains only one of these terms, documents containing the other will receive partial credit proportional to their document co-occurrence ratio.
• No need to do query expansion or relevance feedback
Generalized Vector Space Model (3)
Disadvantages
• Computationally expensive
• Performance = vector space + Q expansion
GVSM, How it Works (1)
Represent the collection as vector of documents:
Let C = [D1, D2, ..., Dm ]Represent each term by its distributional frequency:
Let ti = [Tf(ti, D1), Tf(ti, D2 ), ..., Tf(ti, Dm )]Term-to-term similarity is computed as:
Sim(ti, tj) = cos(vec(ti), vec(tj))Hence, highly co-occurring terms like "Arafat" and
"PLO"will be treated as near-synonyms for retrieval
GVSM, How it Works (2)And query-document similarity is computed as
before: Sim(Q,D) = cos(vec(Q)), vec(D)), except that instead of the dot product calculation, we use a function of the term-to-term similarity computation above, For instance:
Sim(Q,D) = Σi[Maxj(sim(qi, dj)]or normalizing for document & query length:
Simnorm(Q, D) = ||||
)],(([
DQ
dqsimMax ji
GVSM, How it Works (3)
Primary problem:
More computation (sparse => dense)
Primary benefit:
Automatic term expansion by corpus
A Critique of Pure Relevance (1)
IR Maximizes Relevance
• Precision and recall are relevance measures
• Quality of documents retrieved is ignored
A Critique of Pure Relevance (2)
Other Important Factors• What about information novelty, timeliness,
appropriateness, validity, comprehensibility, density, medium,...??
• In IR, we really want to maximize:P(U(f i , ..., f n ) | Q & {C} & U & H)where Q = query, {C} = collection set,U = user profile, H = interaction history
• ...but we don’t yet know how. Darn.
Maximal Marginal Relevance (1)
• A crude first approximation:
novelty => minimal-redundancy
• Weighted linear combination:
(redundancy = cost, relevance = benefit)
• Free parameters: k and λ
Maximal Marginal Relevance (2)
MMR(Q, C, R) =
Argmaxkdi in C[λS(Q, di) - (1-λ)maxdj
in R (S(di, dj))]
Maximal Marginal Relevance (MMR) (3)COMPUTATION OF MMR RERANKING1. Standard IR Retrieval of top-N docs
Let Dr = IR(D, Q, N)
2. Rank max sim(di ε Dr, Q) as top doc, i.e. Let
Ranked = {di}
3. Let Dr = Dr\{di}
4. While Dr is not empty, do:
a. Find di with max MMR(Dr, Q. Ranked)
b. Let Ranked = Ranked.di
c. Let Dr = Dr\{di}
Maximal Marginal Relevance (MMR) (4)
Applications:
• Ranking retrieved documents from IR Engine
• Ranking passages for inclusion in Summaries
Document Summarization in a Nutshell (1)Types of Summaries
Task Query-relevant
(focused)
Query-free
(generic)
INDICATIVE, for Filtering (Do I
read further?)
To filter search engine results
Short abstracts
CONTENTFUL, for reading in lieu
of full doc.
To solve problems for busy
professionals
Executive summaries
Document Summarization in a Nutshell (2)
Other Dimensions
• Single vs multi document summarization
• Genre-adaptive vs one-size-fits all
• Single-language vs translingual
• Flat summary vs hyperlinked pyramid
• Text-only vs multi-media
• ...
Summarization as Passage Retrieval (1)
For Query-Driven Summaries1. Divide document into passages
e.g, sentences, paragraphs, FAQ-pairs, ....2. Use query to retrieve most relevant passages, or better, use MMR to avoid redundancy.3. Assemble retrieved passages into a summary.
Summarization as Passage Retrieval (2)
For Generic Summaries
1. Use title or top-k Tf-IDF terms as query.
2. Proceed as Query-Driven Summarization.
Summarization as Passage Retrieval (3)
For Multidocument Summaries
1. Cluster documents into topically-related groups.
2. For each group, divide document into passages and keep track of source of each passage.
3. Use MMR to retrieve most relevant non-redundant passages (MMR is necessary for multiple docs).
4. Assemble a summary for each cluster.