information retrieval: overview. information retrieval and text processing huge literature dating...

Information retrieval: overview

Information Retrieval and Text Processing

• Huge literature dating back to the 1950’s!

• SIGIR/TREC - home for much of this

• Readings:– Salton, Wong, Yang, A Vector Space Model for

Automatic Indexing, CACM Nov 75 V18N11– Tutle, Croft, Inference Networks for Document

Retrieval, ???, [OPTIONAL]

IR/TP applications

• Search• Filtering• Summarization • Classification• Clustering

• Information extraction• Knowledge

management• Author identification• …and more...

Types of search

• Recall -- finding documents one knows exists, e.g., an old e-mail message or RFC

• Discovery -- finding “interesting” documents given a high-level goal

• Classic IR search is focused on discovery

Classic discovery problem

• Corpus: fixed collection of documents, typically “nice” docs (e.g., NYT articles)

• Problem: retrieve documents relevant to user’s information need

Classical search

Task

InfoNeed

Query

Results

Conception

Formulation

Search RefinementCorpus

Definitions

• Task: example: write a Web crawler

• Information need: perception of documents needed to accomplish task, e.g., Web specs

• Query: sequence of characters given to a search engine one hopes will return desired documents

Conception

• Translating task into information need

• Mis-conception: identify too little (tips on high-bandwidth DNS lookups) and/or too much (TCP spec) as relevant to task

• Sometimes a little extra breadth in results can tip user off to need to refine info need, but not much research into dealing with this automatically

Translation

• Translating info need into query syntax of particular search engine

• Mis-translation: get this wrong– Operator error (is “a b” == a&b or a|b ?)– Polysemy -- same word, different meanings– Synonimy -- different words, same meaning

• Automation: “NLP”, “easy syntax”, “query expansion”, “Q&A”

Refinement

• Modification of query, typically in light of particular results, to better meet info need

• Lots of work of refining query automatically (often with some input from user, e.g., “relevance feedback”)

Precision and recall

• Recall/precision trade-off:– Return everything ==> great recall, bad precision– Return nothing ==> great precision, bad recall

• Precision curves– Search engine produces total ranking– Plot precision at 10%, 20%, .., 100% recall

Other metrics

• Novelty / anti-redundancy– Information content of result set is disjoint

• Comprehendible– Returned documents can be understood by user

• Accurate / authoritative– Citation ranking!!

• Freshness

Classic search techniques

• Boolean

• Ranked boolean

• Vector space

• Probabilistic / Bayesian

Term vector basics

Document Id … Automobile … Carburetor … Feline … Jaguar … Doc 1 2 3 0 2 Doc 2 0 0 2 2 Doc 3 2 0 0 2 …

•Basic abstraction for information retrieval•Useful for measuring “semantic” similarity of text

•A row in the above table is a “term vector”•Columns are word stems and phrases

•Trying to capture “meaning”

Everything’s a vector!!

• Documents are vectors

• Document collections are vectors

• Queries are vectors

• Topics are vectors

Cosine measurement of similarity

• E1 . E2 / (|E1|*|E2|) = cos(E1,E2)• Rank doc’s against Q’s, measure similarity

of doc’s, etc.• In example:

– cos(doc1, doc2) ~ 1/3– cos(doc1, doc3) ~ 2/3– cos(doc2, doc3) ~ 1/2– So: doc1&3 are closest

Weighting of terms in vectors

• Salton’s “TF*IDF”– TF = term frequency in document– DF = doc frequency of term (# docs with term)– IDF = inverse doc freq. = 1/DF– Weight of term = TF * IDF

• “Importance” of term determined by:– Count of term in doc (high ==> important)– Number of docs with term (low ==> important)

Relevance-feedback in VSM

• Rocchio formula:– Q’ = F[Q, Relevant, Irrelevant]– Where F is weighted sum, such as:

Q’[t] = a*Q[t]+b*sum_i R_ i[t]+c*sum_i I_ i[t]

Remarks on VSM

• Principled way of solving many IR/text processing problems, not just search

• Tons of variations on VSM– Different term weighting schemes– Different similarity formulas

• Normalization itself is a huge sub-industry

All of this goes out on Web

• Very small, unrefined queries

• Recall not an issue– Quality is the issue (want most relevant)

– Precision-at-ten matters (how many total losers)

• Scale precludes heavy VSM techniques

• Corpus assumptions (e.g., unchanging, uniform quality) do not hold

• “Adversarial IR” - new challenge on Web

• Still, VSM important tool for Web Archeology

information retrieval: overview. information retrieval and text processing huge literature dating...

Documents

recall slide

discovery slide

qa slide

optional slide

size slide

overview slide

freshness slide

documents relevant