recognizing textual entailment with lcc’s groundhog system
DESCRIPTION
Recognizing Textual Entailment with LCC’s Groundhog System. Andrew Hickl, Jeremy Bensley, John Williams, Kirk Roberts, Bryan Rink and Ying Shi [email protected]. Introduction. We were grateful for the opportunity to participate in this year’s PASCAL RTE-2 Challenge - PowerPoint PPT PresentationTRANSCRIPT
Andrew Hickl, Jeremy Bensley, John Williams, Kirk Roberts, Bryan Rink and Ying Shi
Recognizing Textual Entailment with LCC’s Groundhog System
Introduction
• We were grateful for the opportunity to participate in this year’s PASCAL RTE-2 Challenge– First exposure to RTE came as part of the Fall 2005 AQUAINT
“Knowledge Base” Evaluation• Included PASCAL veterans: University of Colorado at
Boulder, University of Illinois at Urbana-Champaign, Stanford University, University of Texas at Dallas, and LCC (Moldovan)
• While this year’s evaluation represented our first foray into RTE, our group has worked extensively towards performing the types of textual inference that’s crucial for:– Question-Answering– Information Extraction– Multi-Document Summarization– Named Entity Recognition– Temporal and Spatial Normalization– Semantic Parsing
Outline of Today’s Talk
• Introduction• Groundhog Overview
– Preprocessing– New Sources of Training Data – Performing Lexical Alignment– Paraphrase Acquisition– Feature Extraction– Entailment Classification
• Evaluation• Conclusions
di marmotta americana
2 Feb: Giorno della Marmotta, the RTE-2 deadline
Architecture of the Groundhog System
Preprocessing
Paraphrase Acquisition
Feature Extraction
Entailment Classification
Named Entity Recognition
Syntactic Parsing
Semantic Parsing
Temporal Normalization
Temporal Ordering
RTE Test
Lexical Alignment
WWW
RTE Dev
PositiveExamples
NegativeExamples
Training Corpora
Name Aliasing
Name Coreference
Semantic Annotation
YES NO
A Motivating Example
• Questions we need to answer:– What are the “important” portions that should be considered by a
system? Can lexical alignment be used to identify these strings?– How do we determine that the same meaning is being conveyed by
phrases that may not necessarily be lexically related? Can phrase-level alternations (“paraphrases”) help?
– How do we deal with the complexity that reduces the effectiveness of syntactic and semantic parsers? Annotations? Rules? Compression?
The Bills now appear ready to hand the reins over to one of their two-top picks from a year ago in quarterback J.P. Losman, who missed most of last season with a broken leg.
The Bills plan to give the starting job to J.P. Losman.
Text
Hypothesis
Example 139 Task=SUM, Judgment=YES, LCC=YES, Conf = +0.8875
The Bills now appear ready to hand the reins over to one of their two-top picks from a year ago in quarterback J.P. Losman, who missed most of last season with a broken leg.
The Bills plan to give the starting job to J.P. Losman.
[The Bills]Arg0 now appear ready to hand [the reins]Arg1 over to [one]Arg2 of their two-top picks from a year ago in quarterback J.P. Losman, who missed most of last season with a broken leg.
Preprocessing
• Groundhog starts the process of RTE by annotating t-h pairs with a wide range of lexicosemantic information:
• Named Entity Recognition– LCC’s CiceroLite NER software is used to categorize over
150+ different types of named entities:
[The Bills]SPORTS_ORG plan to give the starting job to [J.P. Losman]PERSON.
[The Bills]SPORTS_ORG [now]TIMEX appear ready to hand the reins over to one of their two-top picks from [a year ago]TIMEX in quarterback [J.P. Losman]PERSON, who missed most of [last season]TIMEX with [a broken leg]BODY_PART.
• Name Aliasing and Coreference– Lexica and grammars found in CiceroLite are used to identify
coreferential names and to identify potential antecedents for pronouns
[The Bills]ID=01 plan to give the starting job to [J.P. Losman]ID=02.
[The Bills]ID=01 now appear ready to hand the reins over to [one of [their]ID=01 two-top picks]ID=02 from a year ago in [quarterback]ID=02 [J.P. Losman]ID=02, [who]ID=02 missed most of last season with a broken leg.
Preprocessing
• Temporal Normalization and Ordering– Heuristics found in LCC’s TASER temporal normalization
system is then used to normalize time expressions to their ISO 9000 values and to compute the relative order of time expressions within a context
• POS Tagging and Syntactic Parsing– We use LCC’s own implementation of the Brill POS tagger and the Collins Parser in order
to syntactically parse sentences and to identify phrase chunks, phrase heads, relative clauses, appositives, and parentheticals.
The Bills plan to give the starting job to J.P. Losman.
The Bills [now]2006/01/01 appear ready to hand the reins over to one of their two-top picks from [a year ago]2005/01/01-2005/12/31 in quarterback J.P. Losman, who missed most of [last season]2005/01/01-2005/12/31 with a broken leg.
Preprocessing
• Semantic Parsing– Semantic parsing is performed using a Maximum Entropy-based
semantic role labeling system trained on PropBank annotations
appear ready
hand
The Bills now
the reins one of their two-top picks
Arg0
Arg0
ArgM
Arg1 Arg2
The Bills now appear ready to hand the reins over to one of their two-top picks from a year ago in quarterback J.P. Losman, who missed most of last season with a broken leg.
missedwho most of last seasona broken legArg0 Arg1 Arg3
Preprocessing
• Semantic Parsing– Semantic parsing is performed using a Maximum Entropy-based
semantic role labeling system trained on PropBank annotations
plan
give
The Bills
the starting job J.P. LosmanArg0
Arg0
Arg1 Arg2
The Bills plan to give the starting job to J.P. Losman.
Preprocessing
• Semantic Annotation• Heuristics were used to annotate the following semantic
information:– Polarity: Predicates and nominals were assigned a negative polarity
value when found in the scope of an overt negative marker (no, not, never) or when associated with a negation-denoting verb (refuse).
Both owners and players admit there is [unlikely]TRUE to be much negotiating.
Never before had ski racing [seen]FALSE the likes of Alberto Tomba.
– Factive Verbs: Predicates such as acknowledge, admit, and regret conventionally imply the truth of their complement; complements associated with a list of factive verbs were always assigned a positive polarity value.
Members of Iraq's Governing Council refused to [sign]FALSE an interim constitution.
Preprocessing
• Semantic Annotation (Continued)– Non-Factive Verbs: We refer to predicates that do not imply
the truth of their complements as non-factive verbs. – Predicates found as complements of the following contexts
were marked as unresolved:• non-factive speech act verbs (deny, claim)• psych verbs (think, believe)• verbs of uncertainty or likelihood (be uncertain, be
likely)• verbs marking intentions or plans (scheme, plot, want)• verbs in conditional contexts (whether, if)
Congress approved a different version of the COCOPA Law, which did not include the autonomy clauses, claiming they [were in contradiction]UNRESOLVED with constitutional rights.
Defence Minister Robert Hill says a decision would need to be made by February next year, if Australian troops [extend]UNRESOLVED their stay in southern Iraq.
Preprocessing
• Semantic Annotation (Continued)– Supplemental Expressions: Following (Huddleston and Pullum
2002), constructions that are known to trigger conventional implicatures – including nominal appositives, epithets/name aliases, as-clauses, and non-restrictive relative clauses – were also extracted from text and appended to the end of each text or hypothesis.
Shia pilgrims converge on Karbala to mark the death of Hussein, the prophet Muhammad’s grandson, 1300 years ago.
Shia pilgrims converge on Karbala to mark the death of Hussein 1300 years ago AND Hussein is the prophet Muhammad’s
grandson.
Ali al-Timimi had previously avoided prosecution, but now the radical Islamic cleric is behind bars in an American prison.
Ali al-Timimi had previously avoided prosecution but now the radical Islamic cleric is behind bars... AND Ali al-Timimi is a radical
Islamic cleric.
Nominal Appositives
Epithets / Name Aliases
Preprocessing
• Semantic Annotation (Continued)– Supplemental Expressions: Following (Huddleston and Pullum
2002), constructions that are known to trigger conventional implicatures – including nominal appositives, epithets/name aliases, as-clauses, and non-restrictive relative clauses – were also extracted from text and appended to the end of each text or hypothesis.
The LMI was set up by Mr. Torvalds with John Hall as a non-profit organization to license the use of the word Linux.
The LMI was set up by Mr. Torvalds with John Hall as non-profit organization to license the use of the word Linux AND the LMI is a
non-profit organization
The Bills now appear ready to ... quarterback J.P. Losman, who missed most of last season with a broken leg.
The Bills now appear ready to ... quarterback J.P. Losman, who missed most of last season with a broken leg AND J.P. Losman
missed most of last season...
As-Clauses
Non-Restrictive Relative Clauses
Lexical Alignment
• We believe that these lexicosemantic annotations – along with the individual forms of the words – can provide us with the input needed to identify corresponding tokens, chunks, or collocations from the text and the hypothesis.
hand
The Bills
J.P. Losman
the reins
give
The Bills
J.P. Losman
the starting job
Arg0, ID=01, OrganizationArg0, ID=01, Organization
ID=02, Person Arg2, ID=02, Person
Arg1 Arg1
Alignment Probability: 0.74
Alignment Probability: 0. 94
Alignment Probability: 0. 91
Alignment Probability: 0. 49
Unresolved, WN Similar Unresolved, WN Similar
Lexical Alignment
• In Groundhog, we used a Maximum Entropy classifier to compute the probability that an element selected from a text corresponds to – or can be aligned with – an element selected from a hypothesis.
• Three-step Process:– First, sentences were decomposed into a set of “alignable
chunks” that were derived from the output of a chunk parser and a collocation detection system.
– Next, chunks from the text (Ct) and hypothesis (Ch) were assembled into an alignment matrix (CtCh).
– Finally, each pair of chunks were then submitted to a classifier which output the probability that the pair represented a positive example of alignment.
Lexical Alignment
• Four sets of features were used:– Statistical Features:
• Cosine Similarity• (Glickman and Dagan 2005)’s Lexical Entailment Probability
– Lexicosemantic Features:• WordNet Similarity (Pedersen et al. 2004)• WordNet Synonymy/Antonymy• Named Entity Features• Alternations
– String-based Features• Levenshtein Edit Distance• Morphological Stem Equality
– Syntactic Features• Maximal Category• Headedness• Structure of entity NPs (modifiers, PP attachment, NP-NP
compounds)
Training the Alignment Classifier
• Two developers annotated a selection of held-out set of 10,000 alignment chunk pairs from the RTE-2 Development Set as either positive or negative examples of alignment.
• Performance for two different classifiers on a randomly selected set of 1000 examples from the RTE-2 Dev Set is presented below:Classifier
Training Set
Precision Recall F1
Hillclimber 10K pairs 0.837 0.774 0.804Maximum Entropy
10K pairs 0.881 0.851 0.866• While both classifiers performed relatively satisfactorily,
F-measure varied significantly (p < 0.05) on different test sets.
Creating New Sources of Training Data
• In order to perform more robust alignment, we experimented with gathering two techniques for gathering training data:
• Positive Examples:– Following (Burger and Ferro 2005), we created a corpus of
101,329 positive examples of entailment examples by pairing the headline and first sentence from newswire documents.
First Line: Sydney newspapers made a secret bid not to report on the fawning and spending made during the city’s successful bid for the 2000 Olympics, former Olympics Minister Bruce Baird said today.
Headline: Papers Said To Protect Sydney Bid
– Examples filtered extensively in order to select only those examples where the headline and the first line both synopsized the content of a document
– In a evaluation set of 2500 examples, annotators found 91.8% to be positive examples of “rough” entailment
Creating New Sources of Training Data
Text: One player losing a close friend is Japanese pitcher Hideki Irabu, who was befriended by Wells during spring training last year.
Hypothesis: Irabu said he would take Wells out to dinner when the Yankees visit Toronto.
• Negative Examples:– We gathered 119,113 negative examples of textual entailment by:
• Selecting sequential sentences from newswire texts that featured a repeat mention of a named entity (98,062 examples)
• Extracting pairs of sentences linked by discourse connectives such as even though, although, otherwise, and in contrast (21,051 examples)
Text: According to the professor, present methods of cleaning up oil slicks are extremely costly and are never completely efficient.
Hypothesis: [In contrast], Clean Mag has a 1000 percent pollution retrieval rate, is low cost, and can b recycled.
Training the Alignment Classifier
• For performance reasons, the hillclimber trained on the 10K human-annotated pairs was used to annotate a selection of 450K chunk pairs selected equally from these two corpora.
• These annotations were then used to train a final MaxEnt classifier that was used in our final submission.
• Comparison of the three alignment classifiers is presented below for the same evaluation set of 1000 examples:
ClassifierTraining
SetPrecision Recall F1
Hillclimber 10K pairs 0.837 0.774 0.804Maximum Entropy
10K pairs 0.881 0.851 0.866
Maximum Entropy
450K pairs 0.902 0.944 0.922
Paraphrase Acquisition
• Groundhog uses techniques derived from automatic paraphrase acquisition (Dolan et al. 2004, Barzilay and Lee 2003, Shinyama et al. 2002) in order to identify phrase-level alternations for each t-h pair.
• Output from an alignment classifier can be used to determine a “target region” of high correspondence between a text and a hypothesis:
The Bills now appear ready to hand the reins over to one of their two-top picks from a year ago in quarterback J.P. Losman, who missed most of last season with a broken leg.
The Bills plan to give the starting job to J.P. Losman.
The Bills now appear ready to hand the reins over to one of their two-top picks from a year ago in quarterback J.P. Losman, who missed most of last season with a broken leg.
• If paraphrases can be found for the “target regions” of both the text and the hypothesis, we may have strong evidence that the two sentences exist in an entailment relationship.
The Bills plan to give the starting job to J.P. Losman.
... plan to give the starting job to ...
Paraphrase Acquisition
• For example, if a passage (or set of passages) can be found that are paraphrases of both a text and a hypothesis, those paraphrases can be said to encode the meaning that is common between the t and the h.
The Bills J.P. Losman... appear ready to hand the reins over to ...
... may go with quarterback ...
...could decide to put their trust in...
... might turn the keys of the offense over to ...
• However, not all sentences containing both aligned entities will be true paraphrases...
... benched Bledsoe in favor of ...
... is molding their QB of the future ...
... are thinking about cutting ...
Paraphrase Acquisition
• Like Barzilay and Lee (2003), our approach focuses on creating clusters of potential paraphrases acquired automatically from the WWW.– Step 1. The two entities with the highest alignment
confidence from each t-h pair were selected from each example.
– Step 2. Text passages containing both aligned entities (and a context window of m words) were extracted from each original t and h.
– Step 3. The top 500 documents containing each pair of aligned entities are retrieved from Google; only the sentences that contain both entities are kept.
– Step 4. Text passages containing the aligned entities are extracted from the sentences collected from the WWW.
– Step 5. WWW passages and original t-h passages are then clustered using the complete-link clustering algorithm outlined in Barzilay and Lee (2003); Clusters with less than 10 passages are discarded, even if they include the original t-h passage.
Entailment Classification
• As with other approximation-based approaches to RTE (Haghighi et al. 2005, MacCartney et al. 2006), we use a supervised machine learning classifier in order to determine whether an entailment relationship exists for a particular t-h pair.– Experimented with a number of machine learning
techniques:• Support Vector Machines (SVMs)• Maximum Entropy• Decision Trees
– February 2006: Decision Trees outperformed MaxEnt, SVMs– April 2006: MaxEnt comparable to Decision Trees, SVMs still
lag behind
Entailment Classification
• Information from the previous three components are used to extract 4 types of features to inform this entailment classifier:
• Selected examples of features used:– Alignment Features:
• Longest Common Substring: Longest contiguous string common to both t and h
• Unaligned Chunk: Number of chunks in h not aligned with chunks in t
– Dependency Features:• Entity Role Match: Aligned entities assigned same role• Entity Near Role Match: Collapsed semantic roles commonly
confused by semantic parser (e.g. Arg1, Arg2 >> Arg1&2; ArgM, etc.)
• Predicate Role Match: Roles assigned by aligned predicates• Predicate Role Near Match: Compared collapsed set of roles
assigned by aligned predicates
Entailment Classification
• Classifier Features (Continued)– Paraphrase Features
• Single Paraphrase Match: Paraphrase from a surviving cluster matches either the text or the hypothesis
– Did we select the correct entities at alignment?– Are we dealing with something that can be expressed in
multiple ways?
• Both Unique Paraphrase Match: Paraphrase P1 matches t, while paraphrase P2 matches h; P1P2
• Category Match: Paraphrase P1 matches t, while paraphrase P2 matches h; P1 and P2 found in same surviving cluster of paraphrases.
– Semantic Features• Truth-Value Mismatch: Aligned predicates differ in any truth
value (true, false, unresolved)• Polarity Mismatch: Aligned predicates assigned truth values
of opposite polarity
Entailment Classification
• Alignment Features: What elements align in the t or h?
The Bills J.P. Losman
The Bills J.P. Losman
hand
give
the reins
the starting job
• Dependency Features: Are the same dependencies assigned to corresponding entities in the t and h?
Arg0 --
Arg0 Arg2
hand
give
Arg1
Arg1
• Paraphrase Features: Were any paraphrases found that could be paraphrases of portions of the t and the h?
... have gone with quarterback ...
... has turned the keys of the offense over to ...
• Semantic Features: Were predicates assigned the same truth values? hand
give
LikelyEntailment!
Good Alignment0.94
Good Alignment0.91
Passable Alignment0.79
Marginal Alignment0.49
unresolved
unresolved
Another Example
• Not all examples, however, include as many complementary features as Example 139:
In spite of that, the government’s “economic development first” priority did not initially recognize the need for preventative measures to halt pollution, which may have slowed economic growth.
The government took measures to reduce pollution.
Text
Hypothesis
Example 734Example 734 Task=IR, Judgment=NO, LCC=NO, Conf = -0.8344
Example 734
• Even though this pair has a number of points of alignment, annotations suggest that there significant discrepancies between the sentences.
the need forpreventative measures halt pollution
measures reduce pollution
the government’s priority
the government took
not recognize
Partial Alignment, non-headArg Role Match
NE Category MismatchPassable Alignment
0.39
Partial Alignment, non-headArg Role Match
Passable Alignment0.41
DegreePOS Match
Good0.84
Lemma matchArg Role Match
Good0.93
POS AlignmentPolarity
MismatchNon-
SynonymousPoor0.23• In addition, few “paraphrases” could be found that clustered
with passages extracted from either the t or the h.
... not recognize need for measures to halt ...
... took measures to reduce ...
pollutionthe govt’s priority
the government
... has allowed companies to get away with ...
... is looking for ways to deal with ...
... wants to forget about ...
UnlikelyEntailment!
Evaluation: 2006 RTE Performance
• Groundhog correctly recognized entailment in 75.38% of examples in this year’s RTE-2 Test Set:
Task Accuracy Average PrecisionQA-Test 69.5% 0.8237IE-Test 73.0% 0.8351IR-Test 74.5% 0.7774
SUM-Test 84.5% 0.8343Total 75.38% 0.8082
• Performance differed markedly across the 4 subtasks: while the system netted 84.5% of the examples in the summarization set, Groundhog only categorized 69.5% of the examples in the question-answering set.
• This has something to do with our training data:– Headline corpus features a large number of “sentence compression”-
like examples; when Groundhog is trained on a balanced training corpus, performance on SUM task falls to 79.3%.
Evaluation: Role of Training Data
• Training data did play an important role in boosting our overall accuracy on the 2006 Test Set: performance increased from 65.25% to 75.38% when then entire training corpus was used.
• Refactoring features has allowed us to obtain some performance gains with smaller training sets, however: our performance when only using the 800 examples from the 2006 Dev Set has increased by 5.25%.
Training Set
# of Examples
Feb 2006 Accuracy
Change in Accuracy
April 2006 Accuracy
Change inAccuracy
2006 Dev 800 65.25% n/a 70.50% n/a
“25% LCC” 50,600 67.00% +1.75% 73.75% +3.25%
“50% LCC” 101,300 72.25% +7.00% 74.625% +4.125%
“75% LCC” 151,000 74.38% +9.13% 76.00% +5.5%
“100% LCC” 202,600 75.38% +10.13% 76.25% +5.75%
Performance increase appears to be tapering off as amount of
training data increases...
Evaluation: Role of Features in Entailment Classifier
• While best results were obtained by combining all 4 sets of features used in our entailment classifier, largest gains were observed by adding Paraphrase features:
58.00%
65.88%
62.50%
65.25%
66.25%
69.13%
68.00%
71.25%
73.62%
75.38%+ Semantic
+Paraphrase
+Dependency
+Alignment
+ Alignment+ Dependency+ Paraphrase
Conclusions
• We have introduced a three-tiered approach for RTE:– Alignment Classifier
Identifies “aligned” constituents using a wide range of lexicosemantic features
– Paraphrase Acquisition Derives phrase-level alternations for passages contained high-confidence aligned entities
– Entailment ClassifierCombines lexical, semantic, and syntactic information with phrase-level alternation information in order to make an entailment decision
• In addition, we showed that it is possible – by relaxing of the notion of strict entailment – in order to create training corpora that can prove effective in training systems for RTE– 200K+ examples (100K positive, 100K negative)