the challenge of finding information in long documents

LIDA 2003 Invited Paper

The Challenge of Finding Information in Long Documents

David J Harper

The Robert Gordon UniversitySmart Web Technologies Centre

School of ComputingAberdeen, Scotland.


Preamble

Information retrieval research has focussed largely on document retrieval, and rather less on within-document retrieval

Within-document retrieval is just part of a range of tools and techniques that address “retrieval-with-reading” activities

Explore language modelling as a principled basis for “retrieval-with-reading” techniques or tools


Outline of Talk

Categorisation of retrieval-with-reading activities

Review of retrieval-with-reading techniques and tools

Language Modelling 101 ProfileSkim: Relevance Profiling Tool Applying Language Modelling to retrieval-

with-reading activities Concluding Remarks


Categorization of Reading Activities

Reading to … … to select a document

Buying a book Opening a webpage retrieved by search engine Deciding to read document

… to extract/locate specific information Finding a quotation in a book Locating contact details on a webpage

… to reference information (more generally) Finding supporting information for a legal case Finding related work


Categorization of Reading Activities (cont)

Reading to … … to write a document

Usually involves a complex mix of other reading activities

… to explore the information space from a given “pivot” document Follow-up bibliographic references in a paper Follow hypertext links in web pages Find similar documents

… to understand a document in depth Reading a book/paper cover-to-cover Skimming a book/paper


Reading to Select a Document

Enabled by various forms of document summarisation or overview

Summarisation of documents, e.g. automatic abstracting or extracting

Snippet summarisation of web pages retrieved by search engines: Generic summarisation Query-biased summarisation

Overviews of document structure/content


Reading to Select a Document (Example 1)

Query-biased web page summarisation Generating summaries for use in ranked

retrieval display Summaries based on distribution of

words in document (title, headings, body) biased towards query words

Top-scoring sentences used in summary User experiments confirm that query-

biased summaries are better than general summaries

Tombros and Sanderson 1998


Reading to Select a Document (Example 2)

Tilebars: Compact visualisation of retrieved documents with respect to query (topic) showing: relative length of each

document, the frequency of the

topic words in the document, and

the distribution of the topic words with respect to the document and to each other

Hearst 1995


Reading to Extract Specific Information

Information extraction techniques that extract factoids (and usually populate a database) based on templates, e.g. extracting contact details from web pages, Ask Jeeves

Passage (or snippet) retrieval, where the passage contains the desired specific information

Browsing tools and techniques: Query term highlighting within retrieved

documents Find function in web browser/ word processing

package (woeful)


Reading to Reference Information in a Document

Reading tools that integrate document overviews (e.g. table of contents) and document view

Passage retrieval, providing that passages rather than documents are retrieved

Within-document retrieval tools ProfileSkim: passage retrieval in

context


Reading to Write a Document

Interleaving of writing and reading sub-tasks

Mix of different kinds of reading activities Example: Remembrance Agent

Augments user while writing (unobstrusive) Displays documents (emails, notes, online

documents) relevant to user’s current context Monitors writing/browsing activity and displays

one-line summaries in document editor (Emacs) Rhodes and Starner 1996


Reading to Explore from Pivot Document

Follow-up references, papers by same author, same group, etc. CiteSeer is obvious tool on the Web

Find nearest neighbour documents by essentially using pivot document as a query, e.g. “More Like This” function

Explore category in which document is located, e.g. documents in NLM MESH category, web pages in Yahoo! Category

Follow hard-wired hypertext links Within and between document cross references

Follow “soft” hypertext links Use chunk of document text as a query [Plagarism Story]


Reading to Understand or Study a Document

In general, will involve a mix of other kinds of reading activity

Annotation (including ability to add dynamic cross references) and “clipping” are arguably as important as reading


“Reading” of Multi-media Documents

Kinds of reading activity equally applicable to multimedia documents

Reading to select: video or soundtrack

Reading to extract: quotation in audio speech

Reading to reference: scene/shot retrieval in a video


Language Modelling 101

(Simple) statistical representation of a “chunk” of text, e.g. of a document, paragraph, etc

Simpliest model is “bag of words” model, which essentially: Counts frequencies of words (tokens) in text Interprets counts as a probability distribution

Use distributions to compare different text chunks!!


“Bag of Words” Example

Consider relevance of this document with respect to queries:

{ TREC, experiment }

{ precision, recall }

Document

Words Frequency prob

evaluation 0.05retrieval 0.15information 0.15system 0.15TREC 0.25experiment 0.15precision 0.05recall 0.05


Language Modelling 101 (cont)

Language models can built over any chunks of text: Collection or (arbitrary) set of documents Entire document Parts of document

Given Text1 and Text2, and corresponding language models ModelT1 and ModelT2, we can use them to: Compare similarity of texts by comparing models

ModelT1 <-> ModelT2 e.g. document <-> document

Deciding if a text could be “generated” from another text Probability of (ModelT1 -> Text2) e.g. document -> query, often expressed as Prob( Query ¦ ModelDocument )


Using Language Models for Retrieval Processes

Similarity of text chunks, e.g. document with document

Matching based on probability of generating one text chunk from another, e.g. query from documentDocument 1 Document 2

Model of 1 Model of 2

Document D

Model of D Query

ModelT1 <-> ModelT2 Pr (ModelT1 -> Text2)


ProfileSkim

Developed to support retrieval within long documents

Within document retrieval tool: supports reading to extract and reading to reference

Main concept: relevance profiling based on language modelling

Harper et al 2002, 2003


Overview of ProfileSkim Tool

File to skim

Skim query

Tile being visited

Highlightedquery term

variants


Relevance Profile Meter (1)

Retrieval Status Value

Word position

Document

Relevance Profile Meter

Click and visit ...

Tile


Relevance Profiling Process

P(query | window)

Tile

Tile

Tile

max -> tile RSV

Sliding window


Profile Generation using Language Modelling

sliding window of N words of fixed size compute “retrieval status value” RSVwindow

at each word position in the document RSVwindow = P( generate query | window )

queryt

wintpwindowquery

i

imix )|()|(

)|(1)|()|( doct) * p- w ( wint * p w wintp idocwiniwinwinimix

WiWiwin nnwintp )|( DiDidoc nndoctp )|(


Query-biased summarisation: Using LM

Select representative paragraph for a retrieved document based on query: Choose paragraph

(para) where: Mpara <-> Mdoc

is largest AND Pr (Mpara ->

Query) is largest

Paragraph

Document

Query

Lang. Models

Mdoc

Mpara1

Mpara2

etc


Soft hyperlinks: Using LM

Given selected text within document, generate soft-links to other (relevant) documents Assume text model of

web (say) Mweb Compare Mweb and

Mselect to choose set of terms that contribute to MOST to divergence

Use chosen terms to query the Web, and generate soft links

Note: Can mix Mselect and Mdoc to obtain better model of selected text!

SelectedText

(Mselect)

Document (Mdoc)

Soft-linkedDocuments


Reading to write: Using LM (exercise for reader)

As you are writing a document, a tool suggests parts of other documents that may be relevant. c.f. Remembrance Agent

writing this


Reading in Context

Reading documents is generally done in the context of a larger task, and the pattern of reading activities will depend on the task.

Task Writing a research proposal for EU Framework 6: Reading FP6 Programme Call (and many related

documents): reading to extract and reference Reading to reference documents supporting proposal Reading to extract ancillary information, e.g. contact

details from web pages (say) Can you think of any searching/reading

environment that supports such a complex set of interactions?


Concluding Remarks

Reading of (long) documents to find information is raising interesting challenges in the field of information retrieval

A variety of reading activities should be supported, and preferably within an information seeking (with reading) environment

Language Models enable us to model text chunks at various levels of granularity, and thus provide a principled foundation for “retrieval-with-reading” techniques and tools


Reading List

Hearst, M. A.: TileBars: visualization of term distribution information in full text information access. Proc. CHI'95, (1995), 56-66.

Whittaker, S., Hirschberg, J., Choi, J., Hindle, D., Pereira, F. and Singhal, A.: SCAN: Designing and evaluating user interfaces to support retrieval from speech archives. In Proceedings ACM SIGIR '99. ACM Press (1999) 26-33.

Kaszkiel, M. and Zobel, J.: Passage Retrieval Revisited. In: Proceedings of the Twentieth International ACM-SIGIR Conference on Research and Development in Information Retrieval, Philadelphia, July 1997. ACM Press (1997) 178-185.

Kaszkiel, M.: Indexing and Retrieval of Passages in Full-Text Databases, PhD thesis. RMIT Computer Science Technical Report (RT-17), May 2000 (2000).


cont…

Kaszkiel, M., Zobel, J. and Sacks-Davis, R.: Efficient Passage Ranking for Document Databases. ACM Transactions on Information Systems, Vol 17, No. 4 (1999) 406-439.

Landauer, T., Egan, D., Remde, J., Lesk, M., Lochbaum, C., and Ketchum, D.: Enhancing the usability of text through computer delivery and formative evaluation: The SuperBook project. In: McKnight, C., Dillon, A., and Richardson, J. (eds): Hypertext: A Psychological Perspective. Ellis Horwood (1993) 71-136.

Marchionini. G.: Information Seeking in Electronic Environments. Cambridge University Press, Cambridge (1995).

Byrd, D.: A Scrollbar-based Visualization for Document Navigation. In Proceedings of ACM Digital Libraries 99. ACM Press (1999).

de Kretser, O. and Moffat, A.: Effective Document Presentation with a Locality-Based Similarity Heuristic. In: Proceedings of the Twenty Second International ACM-SIGIR Conference on Research and Development in Information Retrieval, Berkeley, August 1999. ACM Press (1999) 113-120.


cont…

Tombros, A. and Sanderson, M.: Advantages of Query Biased Summaries in Information Retrieval. In: Proceedings of 1998 ACM SIGIR Conference on Research and Development in Information Retrieval (1998) 2-10.

Ponte, J. and Croft, W. B.: A language modeling approach to information retrieval. In: Proceedings of the 1998 ACM SIGIR Conference on Research and Development in Information Retrieval (1998) 275-281.

Song, F. and Croft, W.B.: A general language model for information retrieval in Proceedings of the 1999 ACM SIGIR Conference on Research and Development in Information Retrieval (1999) 279-280.

Schilit, B. N., Golovchinsky, G. and Price, M. N.: Beyond paper: Supporting Active Reading with free-form digital ink annotations. In: Proceedings of CHI98, ACM Press (1998) 149-156.


cont…

Harper, D. J., Coulthard, S. and Sun, Y.: A Language Modelling Approach to Relevance Profiling for Document Browsing. In: Procs JCDL 2002, Oregon, USA (2002) 76-83.

Harper, D. J., Koychev, I. and Sun, Y. : Query-Based Document Skimming: A User-Centred Evaluation. In: Procs 25th European Conference on IR Research, LNCS 2622, Springer (2003) 377-392.

Rhodes, B. J. and Starner, T.: Remembrance Agent: A continuously running automated retrieval system. In: Proceedings of The First International Conference on The Practical Application Of Intelligent Agents and Multi Agent Technology (PAAM '96), (1996) 487-495.

the challenge of finding information in long documents

Documents

document example

invited paperreading

document title

snippet retrieval

activitiesreview of

query topic

querybiased summaries

reference information