the challenge of finding information in long documents
DESCRIPTION
The Challenge of Finding Information in Long Documents. David J Harper The Robert Gordon University Smart Web Technologies Centre School of Computing Aberdeen, Scotland. Preamble. - PowerPoint PPT PresentationTRANSCRIPT
LIDA 2003 Invited Paper
The Challenge of Finding Information in Long Documents
David J Harper
The Robert Gordon UniversitySmart Web Technologies Centre
School of ComputingAberdeen, Scotland.
LIDA 2003 Invited Paper
Preamble
Information retrieval research has focussed largely on document retrieval, and rather less on within-document retrieval
Within-document retrieval is just part of a range of tools and techniques that address “retrieval-with-reading” activities
Explore language modelling as a principled basis for “retrieval-with-reading” techniques or tools
LIDA 2003 Invited Paper
Outline of Talk
Categorisation of retrieval-with-reading activities
Review of retrieval-with-reading techniques and tools
Language Modelling 101 ProfileSkim: Relevance Profiling Tool Applying Language Modelling to retrieval-
with-reading activities Concluding Remarks
LIDA 2003 Invited Paper
Categorization of Reading Activities
Reading to … … to select a document
Buying a book Opening a webpage retrieved by search engine Deciding to read document
… to extract/locate specific information Finding a quotation in a book Locating contact details on a webpage
… to reference information (more generally) Finding supporting information for a legal case Finding related work
LIDA 2003 Invited Paper
Categorization of Reading Activities (cont)
Reading to … … to write a document
Usually involves a complex mix of other reading activities
… to explore the information space from a given “pivot” document Follow-up bibliographic references in a paper Follow hypertext links in web pages Find similar documents
… to understand a document in depth Reading a book/paper cover-to-cover Skimming a book/paper
LIDA 2003 Invited Paper
Reading to Select a Document
Enabled by various forms of document summarisation or overview
Summarisation of documents, e.g. automatic abstracting or extracting
Snippet summarisation of web pages retrieved by search engines: Generic summarisation Query-biased summarisation
Overviews of document structure/content
LIDA 2003 Invited Paper
Reading to Select a Document (Example 1)
Query-biased web page summarisation Generating summaries for use in ranked
retrieval display Summaries based on distribution of
words in document (title, headings, body) biased towards query words
Top-scoring sentences used in summary User experiments confirm that query-
biased summaries are better than general summaries
Tombros and Sanderson 1998
LIDA 2003 Invited Paper
Reading to Select a Document (Example 2)
Tilebars: Compact visualisation of retrieved documents with respect to query (topic) showing: relative length of each
document, the frequency of the
topic words in the document, and
the distribution of the topic words with respect to the document and to each other
Hearst 1995
LIDA 2003 Invited Paper
Reading to Extract Specific Information
Information extraction techniques that extract factoids (and usually populate a database) based on templates, e.g. extracting contact details from web pages, Ask Jeeves
Passage (or snippet) retrieval, where the passage contains the desired specific information
Browsing tools and techniques: Query term highlighting within retrieved
documents Find function in web browser/ word processing
package (woeful)
LIDA 2003 Invited Paper
Reading to Reference Information in a Document
Reading tools that integrate document overviews (e.g. table of contents) and document view
Passage retrieval, providing that passages rather than documents are retrieved
Within-document retrieval tools ProfileSkim: passage retrieval in
context
LIDA 2003 Invited Paper
Reading to Write a Document
Interleaving of writing and reading sub-tasks
Mix of different kinds of reading activities Example: Remembrance Agent
Augments user while writing (unobstrusive) Displays documents (emails, notes, online
documents) relevant to user’s current context Monitors writing/browsing activity and displays
one-line summaries in document editor (Emacs) Rhodes and Starner 1996
LIDA 2003 Invited Paper
Reading to Explore from Pivot Document
Follow-up references, papers by same author, same group, etc. CiteSeer is obvious tool on the Web
Find nearest neighbour documents by essentially using pivot document as a query, e.g. “More Like This” function
Explore category in which document is located, e.g. documents in NLM MESH category, web pages in Yahoo! Category
Follow hard-wired hypertext links Within and between document cross references
Follow “soft” hypertext links Use chunk of document text as a query [Plagarism Story]
LIDA 2003 Invited Paper
Reading to Understand or Study a Document
In general, will involve a mix of other kinds of reading activity
Annotation (including ability to add dynamic cross references) and “clipping” are arguably as important as reading
LIDA 2003 Invited Paper
“Reading” of Multi-media Documents
Kinds of reading activity equally applicable to multimedia documents
Reading to select: video or soundtrack
Reading to extract: quotation in audio speech
Reading to reference: scene/shot retrieval in a video
LIDA 2003 Invited Paper
Language Modelling 101
(Simple) statistical representation of a “chunk” of text, e.g. of a document, paragraph, etc
Simpliest model is “bag of words” model, which essentially: Counts frequencies of words (tokens) in text Interprets counts as a probability distribution
Use distributions to compare different text chunks!!
LIDA 2003 Invited Paper
“Bag of Words” Example
Consider relevance of this document with respect to queries:
{ TREC, experiment }
{ precision, recall }
Document
Words Frequency prob
evaluation 0.05retrieval 0.15information 0.15system 0.15TREC 0.25experiment 0.15precision 0.05recall 0.05
LIDA 2003 Invited Paper
Language Modelling 101 (cont)
Language models can built over any chunks of text: Collection or (arbitrary) set of documents Entire document Parts of document
Given Text1 and Text2, and corresponding language models ModelT1 and ModelT2, we can use them to: Compare similarity of texts by comparing models
ModelT1 <-> ModelT2 e.g. document <-> document
Deciding if a text could be “generated” from another text Probability of (ModelT1 -> Text2) e.g. document -> query, often expressed as Prob( Query ¦ ModelDocument )
LIDA 2003 Invited Paper
Using Language Models for Retrieval Processes
Similarity of text chunks, e.g. document with document
Matching based on probability of generating one text chunk from another, e.g. query from documentDocument 1 Document 2
Model of 1 Model of 2
Document D
Model of D Query
ModelT1 <-> ModelT2 Pr (ModelT1 -> Text2)
LIDA 2003 Invited Paper
ProfileSkim
Developed to support retrieval within long documents
Within document retrieval tool: supports reading to extract and reading to reference
Main concept: relevance profiling based on language modelling
Harper et al 2002, 2003
LIDA 2003 Invited Paper
Overview of ProfileSkim Tool
File to skim
Skim query
Tile being visited
Highlightedquery term
variants
LIDA 2003 Invited Paper
Relevance Profile Meter (1)
Retrieval Status Value
Word position
Document
Relevance Profile Meter
Click and visit ...
Tile
LIDA 2003 Invited Paper
Relevance Profiling Process
P(query | window)
Tile
Tile
Tile
max -> tile RSV
Sliding window
LIDA 2003 Invited Paper
Profile Generation using Language Modelling
sliding window of N words of fixed size compute “retrieval status value” RSVwindow
at each word position in the document RSVwindow = P( generate query | window )
queryt
wintpwindowquery
i
imix )|()|(
)|(1)|()|( doct) * p- w ( wint * p w wintp idocwiniwinwinimix
WiWiwin nnwintp )|( DiDidoc nndoctp )|(
LIDA 2003 Invited Paper
Query-biased summarisation: Using LM
Select representative paragraph for a retrieved document based on query: Choose paragraph
(para) where: Mpara <-> Mdoc
is largest AND Pr (Mpara ->
Query) is largest
Paragraph
Document
Query
Lang. Models
Mdoc
Mpara1
Mpara2
etc
LIDA 2003 Invited Paper
Soft hyperlinks: Using LM
Given selected text within document, generate soft-links to other (relevant) documents Assume text model of
web (say) Mweb Compare Mweb and
Mselect to choose set of terms that contribute to MOST to divergence
Use chosen terms to query the Web, and generate soft links
Note: Can mix Mselect and Mdoc to obtain better model of selected text!
SelectedText
(Mselect)
Document (Mdoc)
Soft-linkedDocuments
LIDA 2003 Invited Paper
Reading to write: Using LM (exercise for reader)
As you are writing a document, a tool suggests parts of other documents that may be relevant. c.f. Remembrance Agent
writing this
LIDA 2003 Invited Paper
Reading in Context
Reading documents is generally done in the context of a larger task, and the pattern of reading activities will depend on the task.
Task Writing a research proposal for EU Framework 6: Reading FP6 Programme Call (and many related
documents): reading to extract and reference Reading to reference documents supporting proposal Reading to extract ancillary information, e.g. contact
details from web pages (say) Can you think of any searching/reading
environment that supports such a complex set of interactions?
LIDA 2003 Invited Paper
Concluding Remarks
Reading of (long) documents to find information is raising interesting challenges in the field of information retrieval
A variety of reading activities should be supported, and preferably within an information seeking (with reading) environment
Language Models enable us to model text chunks at various levels of granularity, and thus provide a principled foundation for “retrieval-with-reading” techniques and tools
LIDA 2003 Invited Paper
Reading List
Hearst, M. A.: TileBars: visualization of term distribution information in full text information access. Proc. CHI'95, (1995), 56-66.
Whittaker, S., Hirschberg, J., Choi, J., Hindle, D., Pereira, F. and Singhal, A.: SCAN: Designing and evaluating user interfaces to support retrieval from speech archives. In Proceedings ACM SIGIR '99. ACM Press (1999) 26-33.
Kaszkiel, M. and Zobel, J.: Passage Retrieval Revisited. In: Proceedings of the Twentieth International ACM-SIGIR Conference on Research and Development in Information Retrieval, Philadelphia, July 1997. ACM Press (1997) 178-185.
Kaszkiel, M.: Indexing and Retrieval of Passages in Full-Text Databases, PhD thesis. RMIT Computer Science Technical Report (RT-17), May 2000 (2000).
LIDA 2003 Invited Paper
cont…
Kaszkiel, M., Zobel, J. and Sacks-Davis, R.: Efficient Passage Ranking for Document Databases. ACM Transactions on Information Systems, Vol 17, No. 4 (1999) 406-439.
Landauer, T., Egan, D., Remde, J., Lesk, M., Lochbaum, C., and Ketchum, D.: Enhancing the usability of text through computer delivery and formative evaluation: The SuperBook project. In: McKnight, C., Dillon, A., and Richardson, J. (eds): Hypertext: A Psychological Perspective. Ellis Horwood (1993) 71-136.
Marchionini. G.: Information Seeking in Electronic Environments. Cambridge University Press, Cambridge (1995).
Byrd, D.: A Scrollbar-based Visualization for Document Navigation. In Proceedings of ACM Digital Libraries 99. ACM Press (1999).
de Kretser, O. and Moffat, A.: Effective Document Presentation with a Locality-Based Similarity Heuristic. In: Proceedings of the Twenty Second International ACM-SIGIR Conference on Research and Development in Information Retrieval, Berkeley, August 1999. ACM Press (1999) 113-120.
LIDA 2003 Invited Paper
cont…
Tombros, A. and Sanderson, M.: Advantages of Query Biased Summaries in Information Retrieval. In: Proceedings of 1998 ACM SIGIR Conference on Research and Development in Information Retrieval (1998) 2-10.
Ponte, J. and Croft, W. B.: A language modeling approach to information retrieval. In: Proceedings of the 1998 ACM SIGIR Conference on Research and Development in Information Retrieval (1998) 275-281.
Song, F. and Croft, W.B.: A general language model for information retrieval in Proceedings of the 1999 ACM SIGIR Conference on Research and Development in Information Retrieval (1999) 279-280.
Schilit, B. N., Golovchinsky, G. and Price, M. N.: Beyond paper: Supporting Active Reading with free-form digital ink annotations. In: Proceedings of CHI98, ACM Press (1998) 149-156.
LIDA 2003 Invited Paper
cont…
Harper, D. J., Coulthard, S. and Sun, Y.: A Language Modelling Approach to Relevance Profiling for Document Browsing. In: Procs JCDL 2002, Oregon, USA (2002) 76-83.
Harper, D. J., Koychev, I. and Sun, Y. : Query-Based Document Skimming: A User-Centred Evaluation. In: Procs 25th European Conference on IR Research, LNCS 2622, Springer (2003) 377-392.
Rhodes, B. J. and Starner, T.: Remembrance Agent: A continuously running automated retrieval system. In: Proceedings of The First International Conference on The Practical Application Of Intelligent Agents and Multi Agent Technology (PAAM '96), (1996) 487-495.