answerfinder question answering from your desktop mark a. greenwood natural language processing...

AnswerFinderAnswerFinderQuestion Answering from your DesktopQuestion Answering from your Desktop

Mark A. Greenwood

Natural Language Processing Group

Department of Computer Science

University of Sheffield, UK

7th of January 2004 CLUK, 7th Annual Research Colloquium

Outline of TalkOutline of Talk

• What is Question Answering? Different Question Types A Generic Question Answering Framework Evaluating Question Answering Systems

• System Description Question Typing Information Retrieval Locating Possible Answers A Detailed Example

• Results and Evaluation

• Desktop Question Answering A Brief Comparison to Other On-Line Question Answering Systems

• Conclusions and Future Work


What is Question Answering?What is Question Answering?

• The main aim of QA is to present the user with a short answer to a question rather than a list of possibly relevant documents.

• As it becomes more and more difficult to find answers on the WWW using standard search engines, question answering technology will become increasingly important.

• Answering questions using the web is already enough of a problem for it to appear in fiction (Marshall, 2002):

“I like the Internet. Really, I do. Any time I need a piece of shareware or I want to find out the weather in Bogotá… I’m the first guy to get the modem humming. But as a source of information, it sucks. You got a billion pieces of data, struggling to be heard and seen and downloaded, and anything I want to know seems to get trampled underfoot in the crowd.”


Different Question TypesDifferent Question Types

• Clearly there are many different types of questions: When was Mozart born?

• Question requires a single fact as an answer.• Answer may be found verbatim in text i.e. “Mozart was born in 1756”.

How did Socrates die?• Finding an answer may require reasoning.• In this example die has to be linked with drinking poisoned wine.

How do I assemble a bike?• The full answer may require fusing information from many different sources.• The complexity can range from simple lists to script-based answers.

Is the Earth flat?• Requires a simple yes/no answer.

• The systems outlined in this presentation attempt to answer the first two types of question.


A Generic QA FrameworkA Generic QA Framework

DocumentCollection

Search EngineTop n

documents

DocumentProcessing

Questions Questions

Answers

• A search engine is used to find the n most relevant documents in the document collection.

• These documents are then processed with respect to the question to produce a set of answers which are passed back to the user.

• Most of the differences between question answering systems are centred around the document processing stage.


Evaluating QA SystemsEvaluating QA Systems• The biggest independent evaluations of question answering systems

have been carried out at TREC (Text Retrieval Conference) over the past five years. Five hundred factoid questions are provided and the groups taking part have a

week in which to process the questions and return one answer per question. No changes to systems are allowed between the time the questions are received

and the time at which the answers are submitted.

• Not only do these annual evaluations give groups a chance to see how their systems perform against those from other institutions but more importantly it is slowly building an invaluable collection of resources, including questions and their associated answers, which can be used for further development and testing.

• Different metrics have been used over the years but the current metric is simply the percentage of questions correctly answered.


System DescriptionSystem Description

• Many of the systems which have proved successful in previous TREC evaluations have made use of a fine grained set of answer types. One system (Harabagiu et al., 2000) has an answer type DOG BREED The answer topology described in (Hovy et al., 2000) contains 94

different answer types.

• The original idea behind building the QA system underlying AnswerFinder was to determine how well a system which used only a fine grained set of answer types could perform. The completed system consists of three distinct phases:

• Question Typing• Information Retrieval• Locating Possible Answers


Question TypingQuestion Typing

• The first stage of processing is to determine the semantic type of the expected answer.

• The semantic type, S, is determined through rules which examine the question, Q: If Q contains ‘congressman’ and does not start with ‘where’

or ‘when’ then S is person:male If Q contains ‘measured in’ then S is measurement_unit If Q contains ‘univesity’ and does not start with ‘who’,

‘where’ or ‘when’ then S is organization If Q contains ‘volcano’ and does not start with ‘who’ or

‘when’ then S is location

• The current system includes rules which can detect 46 different answer types.


Information RetrievalInformation Retrieval

• This is by far the simplest part of the question answering system with the question being passed, as is, to an appropriate search engine: Okapi is used, to search the AQUAINT collection, when answering the

TREC questions. XXXXXXXX is used, to search the Internet, when using AnswerFinder

as a general purpose question answering system.

• The top n relevant documents, as determined by the search engine, are then retrieved ready for the final processing stage.


Locating Possible AnswersLocating Possible Answers

• The only answers we attempt to locate are entities which the system can recognise.

• Locating possible answers consists therefore of extracting all entities of the required type from the relevant documents.

Entities are currently extracted using modified versions of the gazetteer lists and named entity transducer supplied with the GATE 2 framework (Cunningham et al., 2002).

• All entities of the correct type are retained as possible answers unless they fail one or both of the following tests:

The document the current entity appears in must contain all the entities in the question.

A possible answer entity must not contain any of the question words (ignoring stopwords).


Locating Possible AnswersLocating Possible Answers

• All the remaining entities are then grouped together using the following equivalence test (Brill et al., 2001):Two answers are said to be equivalent if all of the non-stopwords in one

are present in the other or vice versa.

• The resulting answer groups are then ordered by: the frequency of occurrence of all answers within the group the highest ranked document in which an answer in the group appears.

• This sorted list (or the top n answers) is then presented, along with a supporting snippet, to the user of the system.


A Detailed ExampleA Detailed ExampleQ: How high is Everest? D1: Everest’s 29,035 feet is 5.4 miles above sea level…

D2: At 29,035 feet the summit of Everest is the highest…

If Q contains ‘how’ and ‘high’ then thesemantic class, S, is measurement:distance

29,035 feetmeasurement:distance(‘5.4 miles’)1

measurement:distance(‘29,035 feet’)2

location(‘Everest’)2

Known Entities#


Results and EvaluationResults and Evaluation

• The underlying system was tested over the 500 factoid questions used in TREC 2002 (Voorhees, 2002):

• Results for the question typing stage were as follows: 16.8% (84/500) of the questions were of an unknown type and hence

could never be answered correctly. 1.44% (6/416) of those questions which were typed were given the

wrong type and hence could never be answered correctly. Therefore the maximum attainable score of the entire system,

irrespective of any future processing, is 82% (410/500).

• Results for the information retrieval stage were as follows: At least one relevant document was found for 256 of the of the correctly

typed questions. Therefore the maximum attainable score of the entire system,

irrespective of further processing, is 51.2% (256/500).


Results and EvaluationResults and Evaluation• Results for the question answering stage were as follows:

25.6% (128/500) questions were correctly answered by the system using this approach. These results are not overly impressive especially when compared with the best performing systems which can answer approximately 85% of the same five hundred questions (Moldovan et al, 2002).

• Users of web search engines are, however, used to looking at a set of relevant documents and so would probably be happy looking at a handful of short answers. If we examine the top five answers returned for each question then the

system correctly answers 35.8% (179/500) of the questions which is 69.9% (179/256) of the maximum attainable score.

If we examine all the answers returned for each question then 38.6% (193/500) of the questions are correctly answered which is 75.4% (193/256) of the maximum attainable score, but this involves displaying over 20 answers per question.


Desktop Question AnsweringDesktop Question Answering• Question answering may be an interesting research topic but what is

needed is an application that is as simple to use as a modern web search engine. No training or special knowledge required to use. Must respond within a reasonable period of time. Answers should be exact but should also be supported by a small snippet of text

so that users don’t have to read the supporting document to verify the answer

• AnswerFinder attempts to meet all of these requirements…


Desktop Question AnsweringDesktop Question Answering

When was Gustav Holst born?


Brief Comparison - PowerAnswerBrief Comparison - PowerAnswer

• PowerAnswer is developed by the team responsible for the best performing TREC system.

• At TREC 2002 their entry answered approx. 85% of the questions.

• Unfortunately PowerAnswer acts more like a search engine than a question answering system: Each answer is a sentence or long phrase No attempt is made to cluster/remove sentences which contain the same

answers

• This is strange as TREC results show that this system is very good at finding a single exact answer to a question.

… and get the answer!


Brief Comparison - AnswerBusBrief Comparison - AnswerBus

• Very similar to PowerAnswer in that: The answers presented are full sentences. No attempt is made to cluster/remove sentences containing the same answer.

• The interesting thing to note about AnswerBus is that questions can be asked in more than one language; English, French, Spanish, German, Italian or Portuguese – although all answers are given in English.

• The developer claims the system answers 70.5% of the TREC 8 questions, although The TREC 8 question set is not a good reflection of real world questions, Finding exact answers, as the TREC evaluations have shown, is a harder task

than simply finding answer bearing sentences.


Brief Comparison - Brief Comparison - XXXXXXXX

• The NSIR system, from the University of Michigan, is much closer to AnswerFinder than PowerAnswer or AnswerBus: Uses standard web search engines to find relevant documents. Returns a list of ranked exact answers.

• Unfortunately no context or confidence level is given for each answer so users would still have to refer to the relevant documents to verify that a given answer is correct.

• NSIR was entered in TREC 2002 correctly answering 24.2% of the questions. Very similar to the 25.6% obtained by AnswerFinder over the same

question set.


Brief Comparison - IONAUTBrief Comparison - IONAUT

• IONAUT is the system most close to AnswerFinder when viewed from the user’s perspective. A ranked list of answers is presented. Supporting snippets of context are also displayed.

• Unfortunately the exact answers are not linked to specific snippets, so it is not immediately clear which snippet supports which answer. This problem is compounded by the fact that multiple snippets may

support a single answer as no attempt has been made to cluster/remove snippets which support the same answer.


ConclusionsConclusions• The original aim in developing the underlying question

answering system was to determine how well only a fine grained system of answer types would perform. The system answers approximately 26% of the TREC 11 questions. The average performance by participants in TREC 11 was 22%. The best performing system at TREC 11 scored approximately 85%.

• The aim of developing AnswerFinder was to provide access to question answering technology in a manner similar to current web search engines. An interface similar to a web browser is used to both enter the question

and to display the answers. The answers are displayed in a similar fashion to standard web search

results. Very little extra time is required to locate possible answers over and

above simply collecting the relevant documents.


Future WorkFuture Work

• The question typing stage could be improved through either the edition of more rules or by replacing the rules with an automatically acquired classifier (Li and Roth, 2002).

• It should be clear that increasing the types of entities we can recognise will increase the percentage of questions we can answer. Unfortunately this is a task that is both time-consuming and never-ending.

• A possible extension to this approach is to include answer extraction patterns (Greenwood and Gaizauskas, 2003). These patterns are enhanced regular expressions in which certain tags

will match multi-word terms. For example questions such as “What does CPR stand for?” generate

patterns such as “NounChunK ( X )” where CPR is substituted for X to select a noun chunk that will be suggested as a possible answer.

Any Questions?Any Questions?Copies of these slides can be found at:

http://www.dcs.shef.ac.uk/~mark/phd/work/

AnswerFinder can be downloaded from:

http://www.dcs.shef.ac.uk/~mark/phd/software/


BibliographyBibliographyEric Brill, Jimmy Lin, Michele Banko, Susan Dumais and Andrew Ng. Data-Intensive Question

Answering. In Proceedings of the 10th Text REtrieval Conference, 2001.

Hamish Cunningham, Diana Maynard, Kalina Bontcheva and Valentin Tablan. GATE: A framework and graphical development environment for robust NLP tools and applications. In Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics, 2002.

Mark A. Greenwood and Robert Gaizauskas. Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering. In Proceedings of the Workshop on Natural Language Processing for Question Answering (EACL03), pages 29–34, Budapest, Hungary, April 14, 2003.

Sanda Harabagiu, Dan Moldovan, Marius Paşca, Rada Mihalcea, Mihai Surdeanu, Răzvan Bunescu, Roxana Gîrju, Vasile Rus and Paul Morărescu. FALCON: Boosting Knowledge for Answer Engines. In Proceedings of the 9th Text REtrieval Conference, 2000.

Eduard Hovy, Laurie Gerber, Ulf Hermjakob, Michael Junk and Chin-Yew Lin. Question Answering in Webclopedia. In Proceedings of the 9th Text REtrieval Conference, 2000.

Xin Li and Dan Roth. Learning Question Classifiers. In Proceedings of the 19th International Conference on Computational Linguistics (COLING’02), 2002.

Michael Marshall. The Straw Men. HarperCollins Publishers, 2002.

Dan Moldovan, Sanda Harabagiu, Roxana Girju, Paul Morarescu, Finley Lacatusu, Adrian Novischi, Adriana Badulescu, and Orest Bolohan. LCC Tools for Question Answering. In Proceedings of the 11th Text REtrieval Conference, 2002.

Ellen M. Voorhees. Overview of the TREC 2002 Question Answering Track. In Proceedings of the 11th Text REtrieval Conference, 2002.

answerfinder question answering from your desktop mark a. greenwood natural language processing...

Documents

frameworkevaluating

online question

short answer

annual research colloquiumwhat

annual evaluations

different types of questions

simple yesno answer

factoid questions