2004 arda challenge workshop an investigation of evaluation metrics for analytic question answering...

2004 ARDA Challenge WorkshopAn Investigation of Evaluation Metrics

for Analytic Question Answering

Overview

Antonio SanfilippoPNNL/NWRRC

AQUAINT Phase 2 Fall WorkshopTampa, FL

Northwestern Regional Research Center

• Hosted by Pacific Northwest National Laboratory

• Located in Richland, WA

Problem

• The adoption of new QA technologies in the IC is hindered by the gap between the development and usage environments– There is no systematic way of ensuring that QA

systems conform to the working practices of analysts

– Systems may perform well in terms of accuracy, but do not address the needs of analysts

Solution

• Develop evaluation metrics that reflect the interaction of users and QA systems to determine how and to which extent these systems meet user requirements – Determine the utility of features and functionalities

– Establish and corroborate user requirements

– Perform a user-centric comparison of different systems

Experimental Focus

• The development of the evaluation metrics is based on empirical studies of analysts using – 3 Question Answering systems

• Cycorp

• LCC

• SUNY@Albany

– the Google search engine as the baseline system

Stakeholders• Government Champions

– John Prange (ARDA)– Kelcy Allwein (DIA)– Mike Blair (NAVY)

• Team Leaders– Emile Morse & Jean Scholtz (NIST)

• Team Participants– Tomek Strzalkowski, Sharon Small, Sean Ryan, Hilda Hardy (SUNY@Albany)– Sanda Harabagiu, Andy Hickl, John Williams (LCC)– Stefano Bertolo (Cycorp)– Paul Kantor (Rutgers University)– Diane Kelly (University of North Carolina)– Peter LaMonica, Chuck Messenger (AFRL) – Joe Konczal (NIST)– Katherine Johnson, Frank Greitzer (PNNL)Analysts: 7 from NAVY, 1 from ARMYGraduate Students: Robert Rittman, Aleksandra Sarcevic, Ying Sun (Rutgers University)

• PNNL Oversight– Rich Quadrel (NWRRC Director)– Troy Juntunen (System Installation and Connectivity)– Ben Barnett, Trina Pitcher, John Calhoun, Eileen Boiling (Admin)– Antonio Sanfilippo (Project Manager)

RoadmapFeb 23

– Project planning meeting (NIST)March-April

– Preparation (contracts, purchases, data collection, initial scenario development)April 15-16

– Kickoff meeting (NIST)April-May

– Finalize scenarios, metric hypotheses, and evaluation methods & materials– Work with NWRRC to set up facilities for data collection at PNNL

June 7-25– Install systems at PNNL– Carry out user studies with analysts– Collect data

July– First version of data analysis– Internal progress report and agenda for the remaining work

August– Final version of data analysis and final exam

September– Final report

Technical Approach

• Construct evaluation metric hypotheses about the utility of QA systems and test these in experimental user studies – Collect data relative to evaluation hypotheses for 8

analysts working on 8 task assignment scenarios with 4 QA systems

– Analyze collected data to verify utility of evaluation metric hypotheses

Evaluation Hypotheses

Question answering systems shouldQuestion-naires

NASA TLX

SmiFro & Status

Cross-evaluation

System Logs

Glass Box

Query Trails

H1Support information gathering with lower cognitive workload X X X

H2 Assist in exploring more paths/hypotheses X X

H3 Enable production of higher quality reports X X

H4 Provide useful suggestions to the analyst X X X

H5 Provide more good surprises than bad X X

H6Enable more focus on analysis than data collection X

H7Enable analysts to collect more data in less time X X

H8 Reduce the time spent reading X X

H9 Identify gaps in the knowledge base X X

H10Help the analyst recognize gaps in their thinking X

H11 Provide context for information X X

H12Provide context, continuity and coherence of dialogue X X X X

H13Let analysts relocate previously seen materials X

H14 Be easy to use X X

H15Increase an analyst’s confidence in exploration and report X X

ID Scenario Topics

A Indian Chemical Weapons Production and Delivery Systems

B Libyan Chemical Weapons Program

C Iranian Chemical Weapons Development and Impact

D North Korean Chemical and Biological Weapons Research

E Pakistani Chemical Agent Production

F Current Status of Russia’s Chemical Weapons Program

G South African Chemical Agents Program Status

H Assessment of Egypt’s Biological Weapons

Methodology

Accomplishments

• Results to-date from the analysis of the data collected during the user studies at PNNL indicate that – Most of the valuation hypotheses initially set by the team proved

to be useful for the user-centered assessment of QA systems– The methodology developed by the team during the course of the

user studies is effective for applying these evaluation metrics– On average, the Cycorp, Albany and LCC Question Answering

systems were deemed to be more useful by users than the baseline system (Google)

Results & Benefits• The workshop delivered a set of tested user-centric

evaluation criteria and a methodology for applying these evaluation criteria to gain knowledge about how QA systems meet the needs of analysts

• The availability of user-centric evaluation metrics enables a systematic methodology for tailoring the utility of QA systems to the specific needs of the Intelligence Community– Target feature and functionalities that are most impactful– Facilitate technology insertion

Assessment• The work has been carried out on schedule and with

extreme precision, attention to details and high technical standards

• Results indicate that the Workshop will be impactful in establishing a user-centered evaluation framework for interactive information systems.

• Results will be presented in the next talk by Emile Morse• A version of the methodology developed will be

demonstrated in today’s exercise

Parting Shots

Views from the June Challenge problem in Richland

Thank You!

2004 arda challenge workshop an investigation of evaluation metrics for analytic question answering...

utility of qa systems

data relative

install systems

experimental user studies

user requirementsperform

qa systemsanalyze

empirical studies of

utility of features

Documents

arda experiment dashboard

arda scuola 2012

arda tutorial

irp-cdn.multiscreensite.com...oarda american resort...

arda rework

20121015 nectec-arda

arda 2016 november slideshare

genet identification 14 sanfilippo b iiib) · sanfilippo...

open-domain textual question answering...

arda farms bull sale

natural & organic awards scandinavia 2015 layout 1...natural...

1 the biotext project myers seminar sept 22, 2003 marti...

open-domain textual question answering...

the book of arda viraf

2011 03 23 emercom sanfilippo eng

for personal use only - asx · 2014-04-01 · s$1,002,230....

karadeniz powership arda bey - ata...

arda 2014 presentation

javelin project briefing aquaint program 1 aquaint workshop,...

javelin project briefing aquaint program 1 aquaint 6-month...