Download - Nlp
Artificial intelligence & natural language processing
Mark Sanderson
Porto, 2000
Aims
• To provide an outline of the attempts made at using NLP techniques in IR
Objectives
• At the end of this lecture you will be able to– Outline a range of attempts to get NLP to work
with IR systems– Idly speculate on why they failed– Describe the successful use of NLP in a limited
domain
Why?
• Seems an obvious area of investigation– Why not working?
Use of NLP
• Syntactic– Parsing to identify phrases– Full syntactic structure comparison
• Semantic– Building an understanding of a document’s
content
• Discourse– Exploiting document structure?
Syntactic
• Parsing to identify phrases– The issues.– Explain how it’s done (a bit).– Is it worth it?
• Other possibilities– Grammatical tagging– Full syntactic structure comparison
• Explain how it’s done (a little bit).
• Show results.
Simple phrase identification
• High frequency terms could be good candidates.– Why?
• Terms co-occurring more often than chance.– Within small number of words.– Surrounding simple terms.– Not surrounding punctuation.
Problems
• Close words that aren’t phrases.• “the use of computers in science & technology”
• Distant words that are phrases.• “preparation & evaluation of abstracts and extracts”
Parsing for phrases
• Using parsers to identify noun phrases.
• Make a phrase out of a head and the head of its modifiers.
“automatic analysis of scientific text”
ADJADJ NOUNNOUN PREP
NP
PP
Errors
• Not a perfect rule by any means.– Need restrictions to eliminate bogus phrases.
“automatic analysis of these four scientific texts”
ADJADJ NOUNNOUN PREP
NP
PP
DET QUANT
Do they work?
• Fagan compared statistical with syntactic, statistics won, just
– J. Fagan (1987) Experiments in phrase indexing for document retrieval: a comparison of syntactic & nonsyntactic methods, in TR 87-868 - Department of Computer Science, Cornell University
• More research has been conducted.– T. Strzalkowski (1995) Natural language information
retrieval, in Information Processing & Management, Vol. 31, No. 3, pp 397-417
Check out TREC
• Overview of the Seventh Text REtrieval Conference (TREC-7), E. Voorhees, D. Harman (National Institute of Standards and Technology)– http://trec.nist.gov/
– Ad hoc track• Fairly even between statistical phrases, syntactic
phrases and no phrases.
Grammatical tagging?
• Tag document text with grammatical codes?– R. Garside (1987). The CLAWS word tagging system, in
The computational analysis of english: a corpus based approach, R. Garside, G. Leech, G. Sampson Eds., Longman: 30-41.
• Doesn’t appear to work– R. Sacks-Davis, P. Wallis, R. Wilkinson (1990). Using
syntactic analysis in a document retrieval system that uses signature files, in Proceedings of 13th ACM SIGIR Conference: 179-191.
Syntactic structure comparison
• Has been tried…– A. F. Smeaton & P. Sheridan (1991) Using morpho-syntactic
language analysis in phrase matching, in Proceedings of RIAO ‘91, Pages 414-429
• Method– Parse sentences into tree structures– When you get a phrase match
• Look at linking syntactic operator.• Look at the residual tree structure that didn’t match
• Does not to work
Semantic
• Disambiguation– Given a word appearing in a certain context,
disambiguators will tell you what sense it is.
• IR system– Index document collections by senses rather than
words– Ask the users what senses the query words are– Retrieve on senses
Disambiguation
• Does it work?– No (well maybe)
• M. Sanderson, Word sense disambiguation and information retrieval, in Proceedings of the 17th ACM SIGIR Conference, Pages 142-151, 1994
• M. Sanderson & C.J. van Rijsbergen, The impact on retrieval effectiveness of skewed frequency distributions, in ACM Transactions on Information Systems (TOIS) Vol. 17 No. 4, 1999, Pages 440-465.
Partial conclusions
• NLP has yet to prove itself in IR– Agree
– D.D. Lewis & K. Sparck-Jones (1996) Natural language processing for information retrieval, in Communications of the ACM (CACM) 1996 Vol. 39, No. 1, 92-101
– Sort of don’t agree– A. Smeaton (1992) Progress in the application of natural
language processing to information retrieval tasks, in The Computer Journal, Vol. 35, No. 3.
Mark’s idle speculation
• What people think is going on always
Keywords
NLP
Mark’s idle speculation
• What’s usually actually going on
Keywords NLP
Areas where NLP does work
• Systems with the following ingredients.– Collection documents cover small domain.– Language use is limited in some manner.– User queries cover tight subject area.– Documents/queries very short
• Image captions– LSI, pseudo-relevance feedback
– People willing to spend money getting NLP to work
RIME & IOTA
• From Grenoble– Y. Chiaramella & J. Nie (1990) A retrieval model based
on an extended modal logic and its application to the RIME experimental approach, in Proceedings of the 13th SIGIR conference, Pages 25-43
• Medical record retrieval system
• Some database’y parts
• Free text descriptions of cases
Indexing
• “an opacity affecting probably the lung and the trachea”
{[p], SGN}
{[bears-on], SGN}
{[and], SGN}
{[bears-on], SGN}
{[lung], LOC}{[opacity], SGN} {[opacity], SGN} {[trachea], LOC}
LOC - localisation
SGN - observed sign
Retrieval
• How do we match a user’s query to these structures?– Using transformations - bit like logic.
{[bears-on], SGN}
{[lung], LOC}{[opacity], SGN}
t - uncertainty
{[lung], LOC}, t
{[opacity], SGN}, t
Tree transformation
{[bears-on], SGN}
{[has-for-value], SGN}
{[has-for-value], SGN}
{[lung], LOC}{[opacity], SGN} {[contour], SGN} {[blurred], LOC}
{[opacity], SGN}
{[has-for-value], SGN}, t
{[has-for-value], SGN}
{[contour], SGN} {[blurred], LOC}
Term transforms
• Basic medical terms stored in a hierarchy.– Transformations possible again with
uncertainty added.
Level 1 Level 2 Level 3tumour cancer sarcoma
hygromakyste polykystosispseudokystpolyp polyposis
Isn’t this a bit slow?
• Yes
• Optimisation– Scan for potential documents.– Process them intensively.
• Evaluation?– Not in that paper.
Not unique
• SCISOR– P.S. Jacobs & L.F. Rau (1990) SCISOR: Extracting
Information from On-line News, in Communications of the ACM (CACM), Vol. 33, No. 11, 88-97
Why do they work?
• Because of the restrictions– Small subject domain.– Limited vocabulary.– Restricted type of question.
• Compare with large scale IR system.– Keywords are good enough.– Long time to set up.– Hard to adapt to new domain.
Anything else for NLP?
• Text Generation– IR system explaining itself?
Conclusions
• By now, you will be able to– Outline a range of attempts to get NLP to work
with IR systems– Idly speculate on why they failed– Describe the successful use of NLP in a limited
domain