1 automating slot filling validation to assist human assessment suzanne tamang and heng ji computer...

Post on 02-Jan-2016

217 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

Automating Slot Filling Validation to

Assist Human Assessment Suzanne Tamang and Heng Ji

Computer Science Department and Linguistics Department, Queens College and the Graduate Center

City University of New York

November 5, 2012

Overview KBP SF validation task Two-step validation

Logistic regression based reranking Predicted confidence adjustment and filtering

Validation features Shallow, Contextual, Emergent (voting)

System combination Perfect setting Limiting conditions Evaluation results Opportunities

SF Validation Task Standard answer format

id, slot, run, docid, filler, start and end offset for filler, start and end offset for justification, confidence

Richmond Flowers, per:title, SFV_10_1, APW_ENG_20070810.1457.LDC2009T13, Attorney General, 336, 351,321,44,1.0

Validation goal Use post-processing methods to label 1 or -1 Step one:

Combine runs, and rerank using a probabilistic classifier Identify a threshold for filtering best candidates

Step two: Automatically assess system quality When available, use deeper contextual information Adjust confidence values to dampen noisy system contribution

4/23

FeaturesFeature Description Value Type

document type provided by document collection as news wire, broadcast news, web log category shallow

*number of tokens count of white spaces (+1) between contiguous character string integer shallow

*acronym identify and concatenate first letter of each token binary shallow *url structural rules to determine if a valid url binary shallow named entity type label with gazetteer category shallow city, *state, *country, *title, ethnicity, religion appears in specific slot-related gazetteer binary shallow *alphanumeric indicate if numbers and letters appear binary shallow

date structural rules to determine if an acceptable date format binary shallow

capitalized first character of token(s) caps binary shallow same if query and fill strings match binary shallow

keywords used primarily for spouse and residence slots binary context dependency parse length from query to answer integer context

** system votes proportion of systems with answer agreement 0-1 emergent

** answer votes proportion of answers with answer agreement 0-1 emergent

* statistically significant predictor in select models** statistically significant predictor in most all models

5/23

Two Phased Validation Approach

Step 1: Classification Training with 2011KBP SF Data Using features extracted from the 2011 KBP results:

Model selection using stepwise procedure and AIC Threshold tuning on predicted confidence estimates

Step 2: Adjustment and filtering Automatic assessment of system quality Adjustment of predicted confidence using quality/DP Contextual analysis with answer provenance offsets

Features – answer, system and group level Shallow, Contextual, Emergent

6

Attribute Distribution in Automatic Slot Filling

7

PER Attribute Distribution

8

ORG Attribute Distribution

SF Performance: Training and Testing

10

Performance, Mean Confidence & Set Size

27 distinct runs; variable F1, size, confidence, &

offset use.

11/23

Results: Slot Filling Validation

12/23

Pre-Post Validation Results:

R P F1LDC 0.72 0.77 0.75

w/o validation 0.71 0.03 0.06validation P1 0.12 0.07 0.09validation P2 0.35 0.08 0.13

Reranking multi-systems Ideal case

Diversity of systems Comparable performance Rich information

Reliable answer context System approach / intermediate system results

KBP SF Task Twenty-seven runs, limited intermediate results, unkown

strategies, and variable performance Inconsistencies paired with `rigid’ framework

Provenance: unavailable, unreliable (off a little and a lot) Confidence may or may not be available

What have we learn that translates to more efficient assessment?

Confidence, provenance, approximating system quality, and flexibility

14

Challenges and Solutions Labor intensive

Training, quality control, tedious and unfulfilling 22% of total answers were redundant 1% gain on recall over systems

Validation Inconsistencies in reporting (provenance / confidence) Lack of intermediate output

Confidence

Uniform weighting Automatic assessment quality: inconsistency, confidence

distributions

R P F TP

LDC 0.72 0.77 0.75 1119

Systems 0.71 0.03 0.06 1081

Answer Key ? 1 ? 1543

16/23

Naïve Estimation of System Quality

Confidence of High and Low Performers

Shallow/emergent features reduce noise at the expense of better systems

18/23

Confidence-based Reranking

Confidence is and important factors to a validator informative at the >90 threshold paired with quality estimates, cull more valid answers

Summary Evaluation of a two-phase SF Validation approach

for KBP 2012 Improves overall F1 before (0.06) /after (0.13) Helps low performers at the expense of better systems Key observations

Shallow features contribute to establishing a baseline Voting features did not generalize, and susceptible to system noise Contextual features are helpful (P1 to P2 gains)

Opportunities Incorporating confidence as a classifier feature or filtering More flexible frameworks for using provenance information Improved methods for naively estimating low and high

performers in the multi-system setting

Thank you

top related