under the hood [part ii] web-based information architectures msec 20-760 mini ii jaime carbonell

Under The Hood [Part II]Web-Based Information Architectures

MSEC 20-760Mini II

Jaime Carbonell

Today’s Topics

• Term weighting in detail

• Generalized Vector Space Model (GVSM)

• Maximal Marginal Relevance

• Summarization as Passage Retrieval

Term Weighting Revisited (1)

Definitions

wi "ith Term:" a word, stemmed word, orindexed phrase

Dj "jth Document:" a unit of indexed text, e.g. a web-page, a news report, an article, a patent, a legal case, a book, a chapter of a book, etc.

Term Weighting Revisited (2)

DefinitionsC "The Collection:" the full set of indexed documents

(e.g. the New York Times archive, the Web, ...)

Tf(wi ,Dj) "Term Frequency:" the number of times wi

occurs in document Dj. Tf is sometimes normalized by dividing by frequency of the most-frequent non-stop term in the document [Tf norm = Tf/ max_TF ].

)),((max)(max_ jiDw

j DwTFDTFji

Term Weighting Revisited (3)Definitions

Df(wi ,C) "Document Frequency:" the number of

documents from C in which wi occurs. Dfmay be normalized by dividing it by thetotal number of documents in C.

IDf(wi, C) “Inverse Document Frequency”:

[Df(wi, C)/size(C)]-1 . Most often the log2(IDf) is used, rather than IDf directly.

Term Weighting Revisited (4)TfIDf Term WeightsIn general: TfIDf(wi, Dj, C) =

F1(Tf(wi, Dj) * F2(IDf(wi, C))

Usually F1 = 0.5 + log2(Tf), or Tf/Tfmax

or 0.5 + 0.5Tf/Tfmax

Usually F2 = log2(IDf)

In the SMART IR system: TfIDf(wi, Dj,C) =

[0.5 + 0.5Tf(wi, Dj/Tfmax(Dj)] * log2(IDf(wi, C))

Term Weighting beyond TfIDf (1)

Probabilistic Models

• Old style (see textbooks)

Improves precision-recall slightly

• Full statistical language modeling (CMU)

Improves precision-recall more significantly

• Difficult to compute efficiently.


Neural Networks

• Theoretically attractive

• Do not scale up at all, unfortunately

Fuzzy Sets

• Not deeply researched, scaling difficulties


Natural Language Analysis• Analyze and understand D’s & Q first• Ultimate IR method, in theory• Generally NL understanding is an unsolved

problem• Scale up challenges, even if we could do it• But, shown to improve IR for very limited

domains

Generalized Vector Space Model (1)

Principles• Define terms by their occurrence patterns in

documents• Define query terms in the same way• Compute similarity by document-pattern

overlap for terms in D and Q• Use standard Cos similarity and either

binary or TfIDf weights


Advantages• Automatically calculates partial similarity

If "heart disease" and "stroke" and "ventricular" co-occur in many documents, then if the query contains only one of these terms, documents containing the other will receive partial credit proportional to their document co-occurrence ratio.

• No need to do query expansion or relevance feedback


Disadvantages

• Computationally expensive

• Performance = vector space + Q expansion

GVSM, How it Works (1)

Represent the collection as vector of documents:

Let C = [D1, D2, ..., Dm ]Represent each term by its distributional frequency:

Let ti = [Tf(ti, D1), Tf(ti, D2 ), ..., Tf(ti, Dm )]Term-to-term similarity is computed as:

Sim(ti, tj) = cos(vec(ti), vec(tj))Hence, highly co-occurring terms like "Arafat" and

"PLO"will be treated as near-synonyms for retrieval

GVSM, How it Works (2)And query-document similarity is computed as

before: Sim(Q,D) = cos(vec(Q)), vec(D)), except that instead of the dot product calculation, we use a function of the term-to-term similarity computation above, For instance:

Sim(Q,D) = Σi[Maxj(sim(qi, dj)]or normalizing for document & query length:

Simnorm(Q, D) = ||||

)],(([

DQ

dqsimMax ji

GVSM, How it Works (3)

Primary problem:

More computation (sparse => dense)

Primary benefit:

Automatic term expansion by corpus

A Critique of Pure Relevance (1)

IR Maximizes Relevance

• Precision and recall are relevance measures

• Quality of documents retrieved is ignored

A Critique of Pure Relevance (2)

Other Important Factors• What about information novelty, timeliness,

appropriateness, validity, comprehensibility, density, medium,...??

• In IR, we really want to maximize:P(U(f i , ..., f n ) | Q & {C} & U & H)where Q = query, {C} = collection set,U = user profile, H = interaction history

• ...but we don’t yet know how. Darn.

Maximal Marginal Relevance (1)

• A crude first approximation:

novelty => minimal-redundancy

• Weighted linear combination:

(redundancy = cost, relevance = benefit)

• Free parameters: k and λ

Maximal Marginal Relevance (2)

MMR(Q, C, R) =

Argmaxkdi in C[λS(Q, di) - (1-λ)maxdj

in R (S(di, dj))]

Maximal Marginal Relevance (MMR) (3)COMPUTATION OF MMR RERANKING1. Standard IR Retrieval of top-N docs

Let Dr = IR(D, Q, N)

2. Rank max sim(di ε Dr, Q) as top doc, i.e. Let

Ranked = {di}

3. Let Dr = Dr\{di}

4. While Dr is not empty, do:

a. Find di with max MMR(Dr, Q. Ranked)

b. Let Ranked = Ranked.di

c. Let Dr = Dr\{di}

MMR Ranking vs Standard IR

query

documents

MMR

IR

λ controls spiral curl

Maximal Marginal Relevance (MMR) (4)

Applications:

• Ranking retrieved documents from IR Engine

• Ranking passages for inclusion in Summaries

Document Summarization in a Nutshell (1)Types of Summaries

Task Query-relevant

(focused)

Query-free

(generic)

INDICATIVE, for Filtering (Do I

read further?)

To filter search engine results

Short abstracts

CONTENTFUL, for reading in lieu

of full doc.

To solve problems for busy

professionals

Executive summaries

Document Summarization in a Nutshell (2)

Other Dimensions

• Single vs multi document summarization

• Genre-adaptive vs one-size-fits all

• Single-language vs translingual

• Flat summary vs hyperlinked pyramid

• Text-only vs multi-media

• ...

Summarization as Passage Retrieval (1)

For Query-Driven Summaries1. Divide document into passages

e.g, sentences, paragraphs, FAQ-pairs, ....2. Use query to retrieve most relevant passages, or better, use MMR to avoid redundancy.3. Assemble retrieved passages into a summary.


For Generic Summaries

1. Use title or top-k Tf-IDF terms as query.

2. Proceed as Query-Driven Summarization.


For Multidocument Summaries

1. Cluster documents into topically-related groups.

2. For each group, divide document into passages and keep track of source of each passage.

3. Use MMR to retrieve most relevant non-redundant passages (MMR is necessary for multiple docs).

4. Assemble a summary for each cluster.

under the hood [part ii] web-based information architectures msec 20-760 mini ii jaime carbonell

Documents