2004 arda challenge workshop an investigation of evaluation metrics for analytic question answering...
Post on 12-Jan-2016
213 Views
Preview:
TRANSCRIPT
2004 ARDA Challenge WorkshopAn Investigation of Evaluation Metrics
for Analytic Question Answering
Overview
Antonio SanfilippoPNNL/NWRRC
AQUAINT Phase 2 Fall WorkshopTampa, FL
Northwestern Regional Research Center
• Hosted by Pacific Northwest National Laboratory
• Located in Richland, WA
Problem
• The adoption of new QA technologies in the IC is hindered by the gap between the development and usage environments– There is no systematic way of ensuring that QA
systems conform to the working practices of analysts
– Systems may perform well in terms of accuracy, but do not address the needs of analysts
Solution
• Develop evaluation metrics that reflect the interaction of users and QA systems to determine how and to which extent these systems meet user requirements – Determine the utility of features and functionalities
– Establish and corroborate user requirements
– Perform a user-centric comparison of different systems
Experimental Focus
• The development of the evaluation metrics is based on empirical studies of analysts using – 3 Question Answering systems
• Cycorp
• LCC
• SUNY@Albany
– the Google search engine as the baseline system
Stakeholders• Government Champions
– John Prange (ARDA)– Kelcy Allwein (DIA)– Mike Blair (NAVY)
• Team Leaders– Emile Morse & Jean Scholtz (NIST)
• Team Participants– Tomek Strzalkowski, Sharon Small, Sean Ryan, Hilda Hardy (SUNY@Albany)– Sanda Harabagiu, Andy Hickl, John Williams (LCC)– Stefano Bertolo (Cycorp)– Paul Kantor (Rutgers University)– Diane Kelly (University of North Carolina)– Peter LaMonica, Chuck Messenger (AFRL) – Joe Konczal (NIST)– Katherine Johnson, Frank Greitzer (PNNL)Analysts: 7 from NAVY, 1 from ARMYGraduate Students: Robert Rittman, Aleksandra Sarcevic, Ying Sun (Rutgers University)
• PNNL Oversight– Rich Quadrel (NWRRC Director)– Troy Juntunen (System Installation and Connectivity)– Ben Barnett, Trina Pitcher, John Calhoun, Eileen Boiling (Admin)– Antonio Sanfilippo (Project Manager)
RoadmapFeb 23
– Project planning meeting (NIST)March-April
– Preparation (contracts, purchases, data collection, initial scenario development)April 15-16
– Kickoff meeting (NIST)April-May
– Finalize scenarios, metric hypotheses, and evaluation methods & materials– Work with NWRRC to set up facilities for data collection at PNNL
June 7-25– Install systems at PNNL– Carry out user studies with analysts– Collect data
July– First version of data analysis– Internal progress report and agenda for the remaining work
August– Final version of data analysis and final exam
September– Final report
Technical Approach
• Construct evaluation metric hypotheses about the utility of QA systems and test these in experimental user studies – Collect data relative to evaluation hypotheses for 8
analysts working on 8 task assignment scenarios with 4 QA systems
– Analyze collected data to verify utility of evaluation metric hypotheses
Evaluation Hypotheses
Question answering systems shouldQuestion-naires
NASA TLX
SmiFro & Status
Cross-evaluation
System Logs
Glass Box
Query Trails
H1Support information gathering with lower cognitive workload X X X
H2 Assist in exploring more paths/hypotheses X X
H3 Enable production of higher quality reports X X
H4 Provide useful suggestions to the analyst X X X
H5 Provide more good surprises than bad X X
H6Enable more focus on analysis than data collection X
H7Enable analysts to collect more data in less time X X
H8 Reduce the time spent reading X X
H9 Identify gaps in the knowledge base X X
H10Help the analyst recognize gaps in their thinking X
H11 Provide context for information X X
H12Provide context, continuity and coherence of dialogue X X X X
H13Let analysts relocate previously seen materials X
H14 Be easy to use X X
H15Increase an analyst’s confidence in exploration and report X X
ID Scenario Topics
A Indian Chemical Weapons Production and Delivery Systems
B Libyan Chemical Weapons Program
C Iranian Chemical Weapons Development and Impact
D North Korean Chemical and Biological Weapons Research
E Pakistani Chemical Agent Production
F Current Status of Russia’s Chemical Weapons Program
G South African Chemical Agents Program Status
H Assessment of Egypt’s Biological Weapons
Methodology
Accomplishments
• Results to-date from the analysis of the data collected during the user studies at PNNL indicate that – Most of the valuation hypotheses initially set by the team proved
to be useful for the user-centered assessment of QA systems– The methodology developed by the team during the course of the
user studies is effective for applying these evaluation metrics– On average, the Cycorp, Albany and LCC Question Answering
systems were deemed to be more useful by users than the baseline system (Google)
Results & Benefits• The workshop delivered a set of tested user-centric
evaluation criteria and a methodology for applying these evaluation criteria to gain knowledge about how QA systems meet the needs of analysts
• The availability of user-centric evaluation metrics enables a systematic methodology for tailoring the utility of QA systems to the specific needs of the Intelligence Community– Target feature and functionalities that are most impactful– Facilitate technology insertion
Assessment• The work has been carried out on schedule and with
extreme precision, attention to details and high technical standards
• Results indicate that the Workshop will be impactful in establishing a user-centered evaluation framework for interactive information systems.
• Results will be presented in the next talk by Emile Morse• A version of the methodology developed will be
demonstrated in today’s exercise
Parting Shots
Views from the June Challenge problem in Richland
Thank You!
top related