research & technology development center information retrieval james mayfield the johns hopkins...

Research & TechnologyDevelopment Center

Information Retrieval

James Mayfield

The Johns Hopkins University Applied Physics Laboratory

11100 Johns Hopkins RoadLaurel, MD 20723-6099(443) 778-6944 -or- (240) [email protected]


Overview

• What is IR?

• Evaluation

• Characterizing documents and queries

• Comparing documents against queries

• Implementation details

• Other IR tasks


Information Retrieval

• Information retrieval is the automatic identification of documents from a large document collection that are relevant to an explicitly-stated information need

• Simplifying assumptions» The document collection is static» A document is relevant or it isn’t» There is no user» All documents are in the same form

– corollary: all documents are text documents


Steps in Basic Text Retrieval

• Characterize each document in collection

• Store characterizations on disk

• Characterize user’s (natural language) query

• Compare characterization of query against document characterizations

• Return rank-ordered list of documents


Other Information Retrieval Tasks

• Document routing/filtering

• Multimedia retrieval (e.g., speech)

• Cross-language Retrieval

• Summarization

• Translation

• Question-answering

• Topic detection


HAIRCUT

• The Hopkins Automatic Information Retriever for Combing Unstructured Text (HAIRCUT): a hybrid information retrieval engine» Developed entirely at APL under IR&D funding» Written in Java for portability and ease of implementation» Supports words, stemmed words, n-grams and phrases as

indexing terms

» Uses a variety of similarity metrics to compare documents against queries

» Provides sophisticated tools for combination of evidence

» Indexes many gigabytes of text


Evaluation

• How do you know that one approach to retrieval is better than another?

• At least two requirements:

» An answer key

» A way to score a result set based on the answer key


Evaluating Retrieval Systems: The Text REtrieval

Conference

• “TREC”

• An annual bake-off for text retrieval systems

• Sponsored by

• Roughly 2.5 gigabytes of text (soon to be 10 gigabytes of Web data)

• 50 “topics” (queries)

• Return top 1000 documents for each topic

• Results judged by retired CIA and NSA analysts

• No-gloat rule

• Numerous tracks, including text routing, very large corpus, cross-language retrieval


Evaluating Retrieval Systems: Sample TREC

Topic

<top><num> Number: 253 <title> Topic: Cryonic suspension services<desc> Description: Status report of the cryonic suspension industry-background and future prospects.<narr> Narrative: Cryonics suspension is the practice of quick-freezing a human body immediately upon death and its preservation in a nitrogen environment for possible resuscitation in the event that a cure is found for the cause of death. There was a rush by some to have this done when it first became feasible, but only by those wealthy enough to afford the freezing and long-term storage fees. This search seeks to determine where the industry is today; is it viable?</top>

SGML Markup

Short Phrase

Sentence (fragment)

Paragraph


Evaluating Retrieval Systems: Precision and

Recall

precision = A

A + B

recall = A

A + C

average precision = area under curve 0% 100%

100 %

0%

precision

recall

relevantnot

relevant

retrieved

not retrieved

A B

C D“Type two errors” “Errors of omission” “False negatives”

“Type one errors” “Errors of commission” “False positives”


TREC-8 Ad Hoc Retrieval Performance

TREC-8 Ad Hoc Performance

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Recall

(46.92%) Top Manual

(33.03%) Top Automatic

(26.08%) Median Automatic


Characterizing Documents and Queries

• Break text into pieces

• Fiddle with the pieces to produce terms

• Throw the terms into a bag» No inter-term ordering information» Terms are assumed to be independent


Characterizing Documents and Queries: Words

• Most traditional information retrieval systems index documents according to the words in those documents.

• Word-based retrieval is language-specific (e.g., a retrieval system for English will not necessarily work well for Japanese, Turkish, or even Spanish).

Four score and seven...

Wordsfourscoreandseven...

Stemmedfourscorsevenyear...

Stoppedfourscoresevenyears...

• Word-based retrieval performs poorly when the documents to be retrieved are garbled or contain spelling mistakes (e.g., from OCR).


Characterizing Documents and Queries: N-grams

• An n-gram is a sequence of n consecutive characters (not words) in a text

• N-grams can span words

• Advantages of n-grams:» language-independent

» robust against errors in text

» capture information about phrases

• Disadvantages of n-grams:» Larger postings files

– corollary: slower retrieval

» Don’t mesh well with NLP techniques

Four score and seven...

6-Grams fours oursc ursco rscor ...

5-Grams four ours ursc rsco ...


Words vs. N-grams

• The efficacy of words and n-grams varies across natural languages

• Some languages (e.g., Mandarin) are difficult to automatically segment into words, and therefore fare significantly better under n-gram approaches

German

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Recall

(28.29%) 6-grams (16.14%) Words

English

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Recall

(25.38%) 6-grams (24.81%) Words

French

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Recall

(33.55%) 6-grams (35.74%) Words

Italian

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Recall

(23.91%) 6-grams (24.09%) Words


Retrieval Models

• Four basic models» Boolean» Vector» Probabilistic» Statistical

• Hybrid Retrieval

• Relevance Feedback


Retrieval Model #1:Boolean Retrieval

• Basic Boolean model combines terms with AND, OR, and NOT

• Boolean models do not rank results

• Extended Boolean models add operators…» proximity operators» wild cards» stemming

• … and/or document ranking


Retrieval Model #2:Vector-Based Retrieval

• View documents and queries as points in a (high-dimensional) vector space

• Similarity between a document and a query is calculated geometrically

• Typically, each term receives its own dimension

• Euclidean distance is a poor measure

• Many systems use some variant of cosine

term 1

term 2

term 3

Reference: Salton, Wang & Yang, “A vector space model for automatic indexing.” CACM18(11):613-620. 1975.

sim(D, Q)

diqiidi

2

i qi

2

i


Term Weighting

• Term weights need not be binary (as they are in the Boolean model)

• The more frequently a term appears in a document, the better that term describes the document

• The more frequently a term appears in the collection as a whole, the worse that term is at discriminating relevant from non-relevant documents

• So, weight terms by their frequency within the document (TF), and inversely by their frequency in the collection (IDF)

wt tft logNdft


Retrieval Model #3:Probabilistic Retrieval

• Probability ranking principle: optimal retrieval performance is achieved when documents are ranked according to their probabilities of being judged relevant to a query

• Often, odds of relevance are used instead of probabilities:

• Many probabilistic methods boil down to different term weighting

• Okapi BM25 and variants seem to be the current favorites:

O(R) P(R)

1 P(R)

wt tft

logN dft 0.5

dft 0.5

2 0.25 0.75dl

avdl

tft


Retrieval Model #4:Statistical Retrieval

• Build language model for each document in collection

• Calculate probability that each language model would produce query

• Requires smoothing for rare or non-existent terms


General query English

Document

query start

query end

p(query Term | general query English)

p(query Term | document)

Comparing Documents With Queries: The Hidden Markov

Model

p(query | dock is relevant) p(q | dock ) (1 )p(q | general query English ) qquery

Reference: David Miller, Tim Leek and Richard Schwartz, “A hidden Markov model information retrieval system.” In Proceedings of SIGIR ’99, pp. 214–221, 1999.

p(dock is relevant | query) p(query | dock is relevant )p(dock is relevant )

p(query)

Identical for all documents(in a single language)

Assume constant

Apply Bayes’ rule

Model this using corpus statistics

Good values for :words: 0.3n-grams: 0.15

Model this using document statistics


Hybrid Retrieval Models

• Generate two or more sets of retrieval results» e.g., a word-based run and an n-gram-based run

• Optionally normalize the various runs

• Merge into a single result set

Coffee & DoughnutProcessing Center


Linear Combination of Results

APL TREC-8 Results

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Recall

(31.5%) Combo: 2:2:1

(29.1%) stems, HMM

(28.9%) 6-grams, HMM

(23.4%) 6-grams, cosine


Linear Combination of Results Query by Query

TREC-8 Run Combination

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Topic

Combo 2:2:1 Stems HMM 6-grams HMM 6-grams cosine


Blind Relevance Feedback

Blind Relevance Feedback

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Recall(23.7%) Without Feedback (30.2%) With Feedback

• queries are typically short

• they omit valuable search terms» e.g., ‘automobile’ in query about

cars

• Blind relevance feedback draws terms from top retrieved documents, performs second retrieval

• 50 queries from TREC-7 collection show:» 27.7% increase in average retrieval

performance

» 18.8% more relevant documents identified


Implementation

• Statistics about documents and terms must be stored on disk

• Most common data structure is an inverted index, which maps a term to a list of documents containing that term

• Other data structures (e.g., signature files, PAT trees) can also be used

• Compression of indexes is advisable


Cross-Language Retrieval

• Query in one language

• Documents in many languages

• System must compare query against all documents in collection

• Issues» Translate query or documents?» Merging results


The TREC Cross-Language Task

• Four languages:» English: 764Mb» French: 251Mb» German: 550Mb» Italian: 95Mb

• 28 Topics (we use English)

• Return top 1000 documents for each topic, irrespective of language

• Results judged by four independent teams from America, Switzerland, Germany, and Italy


TREC-8 Cross-Language Retrieval Capability

TREC-8 Cross-Language RetrievalTop Eight Systems

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Recall

(26.13%) IBM(25.71%) JHU/APL(25.23%) TwentyOne(23.57%) Claritech(20.59%) IRIT(19.37%) Eurospider(16.16%) UMCP(15.22%) NMSU


English Query

Combined Results

French Query

French Results

Italian Query

Italian Results

German Query

German Results

COUPE TAGLIOHAARSCHNITTHAIRCUT

English Results

Our Current Approach toCross-Language Retrieval

1. Translate English queries into French, German and Italian using SYSTRAN®

2. Use HAIRCUT to retrieve documents using both words and 6-grams

3. Use linear combination to merge results


Queuehttp://www.semmel.comhttp://www.jhuapl.edu/http://familysearch.com/

How Web Search Engines Work:

Indexing

• Place seed URLs into a priority queue

• Repeatedly» Select next URL from queue» Fetch page» Characterize page» Store characterization in

index» Extract links from page» Assign priority to each link» Add links to queue

The

Web

Indexavocado doc3

doc177baby doc3

doc42doc117

beanie doc42doc77doc193

...

0.943

0.424

Ralph’s Web Page

My favorite color is lavender!

I collect Beanie Babies!

See pictures of my moss garden!

doc42babybeaniecollectcolor…

http://www.ty.com/

http://www.semmel.com/garden/

...

http://www.semmel.com/


doc3doc42doc77doc117doc193...

How Web Search Engines Work:

Retrieval

• Retrieve query from user

• Characterize query

• Use index to find documents that contain query terms

• Measure similarity between query and each potentially relevant document

• Sort documents by similarity score

• Return documents with highest scores to user

lavender Beanie Babies

Indexavocado doc3

doc177baby doc3

doc42doc117

beanie doc42doc77doc193

...

babybeanielavender

.331 doc3

.924 doc42

.624 doc77

.841 doc117

.118 doc193...

.924 doc42

.841 doc117

.624 doc77

.331 doc3

.118 doc193...

Search Results

1. Ralph’s Web Page2. Ty Homepage3. Toys R Expensive4. Caps for Freshmen5. Bohnanza6. Ralph’s Lavender Page7. 404 Not Found8. Hot Men in Tight Shorts


Why Web Search Engines Give Lousy Results

• don’t index right documents» corollary: do index lots of spam and drek

• don’t distinguish word senses

• don’t recognize phrases/names/acronyms/you name it

• use poor similarity metrics

• don’t take advantage of human understanding

• care too much about speed

• assume everything is English

• don’t recognize duplicate pages» corollary: don’t recognize aliased pages/sites

• wait months to revisit a page

• don’t do any meaningful evaluation of their techniques