javelin project briefing aquaint program 1 aquaint 6-month meeting 10/08/04 javelin ii: scenarios...

29
1 AQUAINT 6-month Meeting 10/08/04 JAVELIN Project Briefing AQUAINT Program JAVELIN II: Scenarios and Variable Precision Reasoning for Advanced QA from Multilingual, Distributed Sources Eric Nyberg, Teruko Mitamura, Jamie Callan, Jaime Carbonell, Bob Frederking Language Technologies Institute Carnegie Mellon University

Upload: percival-sullivan

Post on 04-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: JAVELIN Project Briefing AQUAINT Program 1 AQUAINT 6-month Meeting 10/08/04 JAVELIN II: Scenarios and Variable Precision Reasoning for Advanced QA from

1AQUAINT 6-month Meeting 10/08/04

JAVELIN Project Briefing

AQUAINTProgram

JAVELIN II:

Scenarios and Variable Precision Reasoning forAdvanced QA from Multilingual, Distributed Sources

Eric Nyberg, Teruko Mitamura, Jamie Callan, Jaime Carbonell, Bob Frederking

Language Technologies InstituteCarnegie Mellon University

Page 2: JAVELIN Project Briefing AQUAINT Program 1 AQUAINT 6-month Meeting 10/08/04 JAVELIN II: Scenarios and Variable Precision Reasoning for Advanced QA from

2AQUAINT 6-month Meeting 10/08/04

JAVELIN Project Briefing

AQUAINTProgram

JAVELIN II Research Areas

EmergingFact Base

1. Scenario Dialog

User: “I’m focusing on thenew Iraqi minister AlTikriti. What can you tellme about his family andassociates?”

6. Answer Visualization andScenario Refinement

System: <displaysinstantiated scenario>User: “Can you find moreabout his brother-in-law’sbusiness associates?”

2. Scenario Representation

e1

e2 e3

r1 r2e4

e5

r3 r4

4. Multi-Strategy InformationGathering

NL ParsingStatisticalExtraction

PatternMatching

Relevant Documents

3. Distributed, Multilingual Retrieval

5. Variable-Precision KnowledgeRepresentation & Reasoning

ScenarioReasoning

SearchGuidance

BeliefRevision

AnswerJustification

Page 3: JAVELIN Project Briefing AQUAINT Program 1 AQUAINT 6-month Meeting 10/08/04 JAVELIN II: Scenarios and Variable Precision Reasoning for Advanced QA from

3AQUAINT 6-month Meeting 10/08/04

JAVELIN Project Briefing

AQUAINTProgram

Recent Highlights

• Multi-Strategy Information Gathering– Participation in Relationship Pilot– Training Extractors with Minor Third

• Variable-Precision KR and Reasoning– Text Processor Module

(1st version complete)– Fact Base (1st prototype complete)

• Distributed, Multilingual QA– Keyword Translation for CLQA

(English to Chinese)

Page 4: JAVELIN Project Briefing AQUAINT Program 1 AQUAINT 6-month Meeting 10/08/04 JAVELIN II: Scenarios and Variable Precision Reasoning for Advanced QA from

4AQUAINT 6-month Meeting 10/08/04

JAVELIN Project Briefing

AQUAINTProgram

Relationship Pilot• 50 sample scenarios, e.g.

The analyst is interested in knowing if a particular country is a member of an international organization. Is Chechnya a member of the United Nations?

• Phase I JAVELIN system was used with manual tweaking

• Output of Question Analyzer module was manually corrected– Decompose into subquestions (17 of 50 scenarios)– Gather key terms from background text

Page 5: JAVELIN Project Briefing AQUAINT Program 1 AQUAINT 6-month Meeting 10/08/04 JAVELIN II: Scenarios and Variable Precision Reasoning for Advanced QA from

5AQUAINT 6-month Meeting 10/08/04

JAVELIN Project Briefing

AQUAINTProgram

NIST Evaluation Methodology

• Two categories of information “nuggets”vital : must be presentokay : relevant but not necessary

• Each item could match more than one nugget

• Recall determined by vital nuggets

• Precision based on answer length

• Computed F-scores with recall 3 times as important as precision

nuggets vitaltotal

returned nuggets vital#R

RP

PRF

9

10

characters whitespace-non #

characters whitespace-nonrelevant #P

Page 6: JAVELIN Project Briefing AQUAINT Program 1 AQUAINT 6-month Meeting 10/08/04 JAVELIN II: Scenarios and Variable Precision Reasoning for Advanced QA from

6AQUAINT 6-month Meeting 10/08/04

JAVELIN Project Briefing

AQUAINTProgram

JAVELIN Performance Statistics

• Average F-score computed by NIST 0.298

• Average F-score with recall based on both vital and okay nuggets 0.322

• Total scenarios with F=1 0

• Total scenarios with all vital information correct: 9 1/1 – 18, 19, 36, 38 2/2 – 4, 16, 34, 37 3/3 – 33

• Total scenarios with F=0 19

• Total scenarios without any (vital or okay) correct answers 10 no answer found - 3, 5 bad answers - 6, 8, 10, 11, 13, 27, 29, 30

Page 7: JAVELIN Project Briefing AQUAINT Program 1 AQUAINT 6-month Meeting 10/08/04 JAVELIN II: Scenarios and Variable Precision Reasoning for Advanced QA from

7AQUAINT 6-month Meeting 10/08/04

JAVELIN Project Briefing

AQUAINTProgram

JAVELIN Performance Statistics

• Average recall (vital)0.344

• Average precision 0.261

• Matches per answer item:

no nuggets 2061 nugget matched 572 nuggets matched 103 nuggets matched 64 nuggets matched 1

• Not done (but potentially useful?): determine which decomposed questions we provided relevant information for

Page 8: JAVELIN Project Briefing AQUAINT Program 1 AQUAINT 6-month Meeting 10/08/04 JAVELIN II: Scenarios and Variable Precision Reasoning for Advanced QA from

8AQUAINT 6-month Meeting 10/08/04

JAVELIN Project Briefing

AQUAINTProgram

General Observations

• Nugget quality and assessment varies considerably (e.g., question #3, #8)

Nuggets overlap, repeat given information, sometimes represent cues not answers; doesn’t count other relevant information if it was not in the assessors’ original set

• Difficult to assess retrieval performanceNo document IDs provided in the nugget file

• Difficult to reproduce the precision scores Relevant text spans appear to have been manually determined and are not noted in the annotated file

Page 9: JAVELIN Project Briefing AQUAINT Program 1 AQUAINT 6-month Meeting 10/08/04 JAVELIN II: Scenarios and Variable Precision Reasoning for Advanced QA from

9AQUAINT 6-month Meeting 10/08/04

JAVELIN Project Briefing

AQUAINTProgram

• A standardized testbed to build and evaluate machine learning algorithms that work on text

• Includes a pattern language (Mixup) for building taggers(compiles to FSTs)

• Can we utilize MinorThird as a factory to build new information extractors for the QA task?

http://minorthird.sourceforge.net/

Page 10: JAVELIN Project Briefing AQUAINT Program 1 AQUAINT 6-month Meeting 10/08/04 JAVELIN II: Scenarios and Variable Precision Reasoning for Advanced QA from

10AQUAINT 6-month Meeting 10/08/04

JAVELIN Project Briefing

AQUAINTProgram

Initial Training Experiments• Can Minor Third train new taggers for specific tags and

corpora, based on bootstrap information from existing tagger(s)?

• Set Up:– Use Identifinder to annotate 101 messages

(focus: ORGANIZATION)– Manually fix incorrect tags – Training set: 81; Test set: 20

• Experiments:– Vary training set size: 40, 61, 81 messages– Vary history size and window size parameters used by

the Minor Third Learner class

Page 11: JAVELIN Project Briefing AQUAINT Program 1 AQUAINT 6-month Meeting 10/08/04 JAVELIN II: Scenarios and Variable Precision Reasoning for Advanced QA from

11AQUAINT 6-month Meeting 10/08/04

JAVELIN Project Briefing

AQUAINTProgram

Varying Size of Training Set

Variation on Training Size

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

Precision Recall F measure

BBN

40 msg

61 msg

81 msg

Page 12: JAVELIN Project Briefing AQUAINT Program 1 AQUAINT 6-month Meeting 10/08/04 JAVELIN II: Scenarios and Variable Precision Reasoning for Advanced QA from

12AQUAINT 6-month Meeting 10/08/04

JAVELIN Project Briefing

AQUAINTProgram

The Text Processor (TP)• A server capable of processing text

annotation requests (batch or run-time)• Receives a text stream input and assigns

multiple levels of tags or features• Application can specify which processors to

run on a text, and in what order• Provides a single API for a variety of

processors:

– Brill Tagger– BBN Identifinder– MXTerminator– Link Parser

–RASP–WordNet–CLAWS–FrameNet

Page 13: JAVELIN Project Briefing AQUAINT Program 1 AQUAINT 6-month Meeting 10/08/04 JAVELIN II: Scenarios and Variable Precision Reasoning for Advanced QA from

13AQUAINT 6-month Meeting 10/08/04

JAVELIN Project Briefing

AQUAINTProgram

TP Object Model

TP Object Model

Page 14: JAVELIN Project Briefing AQUAINT Program 1 AQUAINT 6-month Meeting 10/08/04 JAVELIN II: Scenarios and Variable Precision Reasoning for Advanced QA from

14AQUAINT 6-month Meeting 10/08/04

JAVELIN Project Briefing

AQUAINTProgram

Fact Base

• Relational data model containing:– Documents and metadata– Standoff annotations for:

• Linguistic analysis(segmentation, POS, parsing, predicate extraction)

• Semantic interpretation(frame filling -> facts/events/etc.)

• Reasoning(reference resolution, inference)

Page 15: JAVELIN Project Briefing AQUAINT Program 1 AQUAINT 6-month Meeting 10/08/04 JAVELIN II: Scenarios and Variable Precision Reasoning for Advanced QA from

15AQUAINT 6-month Meeting 10/08/04

JAVELIN Project Briefing

AQUAINTProgram

Fact Base [2]

Text Processor API

Seg

men

ter

Tag

gers

Pars

ers

Fra

mers

Text

1. Relevant documents or passages are processed by the TP modules

Features 2. Results are stored as features on text spans

Facts

3. Extracted frames are stored as possible facts, events, etc.

* All derived information directly linked to input source(s) at each level

* Persistent storage in RDBMS supports: - training/learning on any combination of features - reuse of results across sessions, analysts, etc. when appropriate - use of relational querying for association chains (cf. G. Bhalotia, et al., Keyword searching and browsing in databases using BANKS. In ICDE, San Jose, CA, 2002.)

Page 16: JAVELIN Project Briefing AQUAINT Program 1 AQUAINT 6-month Meeting 10/08/04 JAVELIN II: Scenarios and Variable Precision Reasoning for Advanced QA from

16AQUAINT 6-month Meeting 10/08/04

JAVELIN Project Briefing

AQUAINTProgram

CLQA: The Keyword Translation Problem

• Given keywords extracted from the question, how do we correctly translate them into languages of the information sources?

Keyword Translator

Keywords inLanguage B

Keywords inLanguage A

Keywords inLanguage C

Page 17: JAVELIN Project Briefing AQUAINT Program 1 AQUAINT 6-month Meeting 10/08/04 JAVELIN II: Scenarios and Variable Precision Reasoning for Advanced QA from

17AQUAINT 6-month Meeting 10/08/04

JAVELIN Project Briefing

AQUAINTProgram

Tools For Query/Keyword Translation• Machine Readable Dictionaries (MRD)

– Pros:• Easily obtained for high-density languages• Domain-specific dictionaries provide good coverage in-domain

– Cons:• Publicly available general dictionaries usually have low coverage• Cannot translate sentences

• MT Systems– Pros:

• Usually provide more coverage than publicly available MRD• Translate whole sentences

– Cons:• Translation quality varies• Low language-pair coverage compared to MRD

• Parallel Corpora– Pros: Good for domain-specific translation– Cons: Poor for open-domain translation

Page 18: JAVELIN Project Briefing AQUAINT Program 1 AQUAINT 6-month Meeting 10/08/04 JAVELIN II: Scenarios and Variable Precision Reasoning for Advanced QA from

18AQUAINT 6-month Meeting 10/08/04

JAVELIN Project Briefing

AQUAINTProgram

Tools For Query/Keyword Translation• Machine Readable Dictionaries (MRD)

– Pros:• Easily obtained for high-density languages• Domain-specific dictionaries provide good coverage in-domain

– Cons:• Publicly available general dictionaries usually have low coverage• Cannot translate sentences

• MT Systems– Pros:

• Usually provide more coverage than publicly available MRD• Translate whole sentences

– Cons:• Translation quality varies• Low language-pair coverage compared to MRD

• Parallel Corpora– Pros: Good for domain-specific translation– Cons: Poor for open-domain translation

Page 19: JAVELIN Project Briefing AQUAINT Program 1 AQUAINT 6-month Meeting 10/08/04 JAVELIN II: Scenarios and Variable Precision Reasoning for Advanced QA from

19AQUAINT 6-month Meeting 10/08/04

JAVELIN Project Briefing

AQUAINTProgram

Research Questions

• Can we improve keyword translation correctness by building a keyword selection model that selects one translation from translations produced by multiple MT systems?

• Can we improve keyword translation correctness by using the question sentence?

Page 20: JAVELIN Project Briefing AQUAINT Program 1 AQUAINT 6-month Meeting 10/08/04 JAVELIN II: Scenarios and Variable Precision Reasoning for Advanced QA from

20AQUAINT 6-month Meeting 10/08/04

JAVELIN Project Briefing

AQUAINTProgram

The Translation Selection Problem• Given a set of translation candidates and the question

sentence, how do we select a translation that is most likely a correct translation of the keyword?

SelectionModel

SourceKeyword

SourceQuestion

MT System 1

MT System 2

MT System 3

Target Keyword 1

Target Question 1

Score for TargetKeyword 1

SelectionModel

Target Keyword 2

Target Question 2

Score for TargetKeyword 2

SelectionModel

Target Keyword 3

Target Question 3

Score for TargetKeyword 3

Page 21: JAVELIN Project Briefing AQUAINT Program 1 AQUAINT 6-month Meeting 10/08/04 JAVELIN II: Scenarios and Variable Precision Reasoning for Advanced QA from

21AQUAINT 6-month Meeting 10/08/04

JAVELIN Project Briefing

AQUAINTProgram

Keyword Selection Model

• A set of scoring metrics:– A translation candidate is assigned an initial

base score of 0– Each scoring metric adds to or subtracts

from running total of the score– After all candidates go through the model,

the translation candidate with the highest score is selected as the most likely correct translation

Page 22: JAVELIN Project Briefing AQUAINT Program 1 AQUAINT 6-month Meeting 10/08/04 JAVELIN II: Scenarios and Variable Precision Reasoning for Advanced QA from

22AQUAINT 6-month Meeting 10/08/04

JAVELIN Project Briefing

AQUAINTProgram

The Experiment• Language Pair: From English to Chinese

• Uses three free web-based MT systems– www.systranbox.com– www.freetranslation.com– www.amikai.com

• Training Data:– 50 Input questions (125 Keywords) from TREC-8,

TREC-9, and TREC-10

• Testing Data:– 50 Input questions (147 Keywords) from TREC-8,

TREC-9, and TREC-10

• Evaluation: Translation correctness

Page 23: JAVELIN Project Briefing AQUAINT Program 1 AQUAINT 6-month Meeting 10/08/04 JAVELIN II: Scenarios and Variable Precision Reasoning for Advanced QA from

23AQUAINT 6-month Meeting 10/08/04

JAVELIN Project Briefing

AQUAINTProgram

Scoring Metrics• In this experiment, we constructed different

selection models, each uses a combination of following 5 scoring metrics:

I. BaselineII. Segmented Word-Matching and Partial Word-

MatchingIII. Full Sentence Word-Matching without Fall Back to

Partial Word-MatchingIV. Full Sentence Word-Matching with Fall Back to

Partial Word-MatchingV. Penalty for Partially Translated or Un-Translated

Keywords

Page 24: JAVELIN Project Briefing AQUAINT Program 1 AQUAINT 6-month Meeting 10/08/04 JAVELIN II: Scenarios and Variable Precision Reasoning for Advanced QA from

24AQUAINT 6-month Meeting 10/08/04

JAVELIN Project Briefing

AQUAINTProgram

Scoring Metrics Summary• Description of Scoring Metrics:

• Scoring Legend:█ █ █ █

Full Match Partial Match Support by >1 MT Not Fully Translated

+1.0 +0.5 +0.3 -1.3

Abbr. Description Scoring

B Baseline █

S Segmented Word-Matching and Partial Word-Matching █ █

F¹ Full Sentence Word-Matching without Fall Back to Partial Word-Matching

F² Full Sentence Word-Matching with Fall Back to Partial Word-Matching

█ █

P Penalty for Partially Translated or Un-Translated Keywords █

Page 25: JAVELIN Project Briefing AQUAINT Program 1 AQUAINT 6-month Meeting 10/08/04 JAVELIN II: Scenarios and Variable Precision Reasoning for Advanced QA from

25AQUAINT 6-month Meeting 10/08/04

JAVELIN Project Briefing

AQUAINTProgram

Results

Model S¹S² S¹S²S³

B 78.23% 78.91%

B+S 61.90% 64.63%

B+F¹ 80.27% 80.95%

B+F² 75.51% 78.91%

B+P 78.23% 78.91%

B+F¹+P 82.99% 85.71%

B+F²+P 78.23% 83.67%

Model S¹S² S¹S²S³

B 0% 0%

B+S -20.87% -18.10%

B+F¹ 2.61% 2.59%

B+F² -3.48% 0.00%

B+P 0.00% 0.00%

B+F¹+P 6.08% 8.62%

B+F²+P 0.00% 6.95%

Keyword Translation Accuracy of Different

Models on the Test Set

Improvement of Different Models over the Base Model

[Lin, F. and T. Mitamura, “Keyword Translation from English to Chinese for Multilingual QA”, Proceedings of AMTA 2004, Georgetown.]

Page 26: JAVELIN Project Briefing AQUAINT Program 1 AQUAINT 6-month Meeting 10/08/04 JAVELIN II: Scenarios and Variable Precision Reasoning for Advanced QA from

26AQUAINT 6-month Meeting 10/08/04

JAVELIN Project Briefing

AQUAINTProgram

Results

Model S¹S² S¹S²S³

B 78.23% 78.91%

B+S 61.90% 64.63%

B+F¹ 80.27% 80.95%

B+F² 75.51% 78.91%

B+P 78.23% 78.91%

B+F¹+P 82.99% 85.71%

B+F²+P 78.23% 83.67%

Keyword Translation Accuracy of Different

Models on the Test Set

Best single MT system performance:

78.23%

Best multiple MT model performance:

85.71%

Best possible result if the correct keywords are selected every time they are produced:

92.52%

[Lin, F. and T. Mitamura, “Keyword Translation from English to Chinese for Multilingual QA”, Proceedings of AMTA 2004, Georgetown.]

Page 27: JAVELIN Project Briefing AQUAINT Program 1 AQUAINT 6-month Meeting 10/08/04 JAVELIN II: Scenarios and Variable Precision Reasoning for Advanced QA from

27AQUAINT 6-month Meeting 10/08/04

JAVELIN Project Briefing

AQUAINTProgram

Observations• Models which include scoring metrics that require

segmentation did poorly

• Using more MT systems improves translation correctness

• Using the translated question improves keyword translation accuracy

• There is still room for improvement(85.71% to 92.52%)

Page 28: JAVELIN Project Briefing AQUAINT Program 1 AQUAINT 6-month Meeting 10/08/04 JAVELIN II: Scenarios and Variable Precision Reasoning for Advanced QA from

28AQUAINT 6-month Meeting 10/08/04

JAVELIN Project Briefing

AQUAINTProgram

More to Do…

• Use statistical/machine learning techniques– Result of each scoring metric a feature in a classification

problem (SVM, MaxEnt)– Train weights for each scoring metric (EM)

• Use additional / improved scoring metrics– Validate translation using search engines– Use better segmentation tools

• Compare with other evaluation methods– retrieval performance– end-to-end system (QA) performance

Page 29: JAVELIN Project Briefing AQUAINT Program 1 AQUAINT 6-month Meeting 10/08/04 JAVELIN II: Scenarios and Variable Precision Reasoning for Advanced QA from

29AQUAINT 6-month Meeting 10/08/04

JAVELIN Project Briefing

AQUAINTProgram

Questions?