recognizing textual entailment with lcc’s groundhog system

Andrew Hickl, Jeremy Bensley, John Williams, Kirk Roberts, Bryan Rink and Ying Shi

[email protected]

Recognizing Textual Entailment with LCC’s Groundhog System

mailto:[email protected]

Introduction

• We were grateful for the opportunity to participate in this year’s PASCAL RTE-2 Challenge– First exposure to RTE came as part of the Fall 2005 AQUAINT

“Knowledge Base” Evaluation• Included PASCAL veterans: University of Colorado at

Boulder, University of Illinois at Urbana-Champaign, Stanford University, University of Texas at Dallas, and LCC (Moldovan)

• While this year’s evaluation represented our first foray into RTE, our group has worked extensively towards performing the types of textual inference that’s crucial for:– Question-Answering– Information Extraction– Multi-Document Summarization– Named Entity Recognition– Temporal and Spatial Normalization– Semantic Parsing

Outline of Today’s Talk

• Introduction• Groundhog Overview

– Preprocessing– New Sources of Training Data – Performing Lexical Alignment– Paraphrase Acquisition– Feature Extraction– Entailment Classification

• Evaluation• Conclusions

di marmotta americana

2 Feb: Giorno della Marmotta, the RTE-2 deadline

Architecture of the Groundhog System

Preprocessing

Paraphrase Acquisition

Feature Extraction

Entailment Classification

Named Entity Recognition

Syntactic Parsing

Semantic Parsing

Temporal Normalization

Temporal Ordering

RTE Test

Lexical Alignment

WWW

RTE Dev

PositiveExamples

NegativeExamples

Training Corpora

Name Aliasing

Name Coreference

Semantic Annotation

YES NO

A Motivating Example

• Questions we need to answer:– What are the “important” portions that should be considered by a

system? Can lexical alignment be used to identify these strings?– How do we determine that the same meaning is being conveyed by

phrases that may not necessarily be lexically related? Can phrase-level alternations (“paraphrases”) help?

– How do we deal with the complexity that reduces the effectiveness of syntactic and semantic parsers? Annotations? Rules? Compression?

The Bills now appear ready to hand the reins over to one of their two-top picks from a year ago in quarterback J.P. Losman, who missed most of last season with a broken leg.

The Bills plan to give the starting job to J.P. Losman.

Text

Hypothesis

Example 139 Task=SUM, Judgment=YES, LCC=YES, Conf = +0.8875



[The Bills]Arg0 now appear ready to hand [the reins]Arg1 over to [one]Arg2 of their two-top picks from a year ago in quarterback J.P. Losman, who missed most of last season with a broken leg.

Preprocessing

• Groundhog starts the process of RTE by annotating t-h pairs with a wide range of lexicosemantic information:

• Named Entity Recognition– LCC’s CiceroLite NER software is used to categorize over

150+ different types of named entities:

[The Bills]SPORTS_ORG plan to give the starting job to [J.P. Losman]PERSON.

[The Bills]SPORTS_ORG [now]TIMEX appear ready to hand the reins over to one of their two-top picks from [a year ago]TIMEX in quarterback [J.P. Losman]PERSON, who missed most of [last season]TIMEX with [a broken leg]BODY_PART.

• Name Aliasing and Coreference– Lexica and grammars found in CiceroLite are used to identify

coreferential names and to identify potential antecedents for pronouns

[The Bills]ID=01 plan to give the starting job to [J.P. Losman]ID=02.

[The Bills]ID=01 now appear ready to hand the reins over to [one of [their]ID=01 two-top picks]ID=02 from a year ago in [quarterback]ID=02 [J.P. Losman]ID=02, [who]ID=02 missed most of last season with a broken leg.

Preprocessing

• Temporal Normalization and Ordering– Heuristics found in LCC’s TASER temporal normalization

system is then used to normalize time expressions to their ISO 9000 values and to compute the relative order of time expressions within a context

• POS Tagging and Syntactic Parsing– We use LCC’s own implementation of the Brill POS tagger and the Collins Parser in order

to syntactically parse sentences and to identify phrase chunks, phrase heads, relative clauses, appositives, and parentheticals.


The Bills [now]2006/01/01 appear ready to hand the reins over to one of their two-top picks from [a year ago]2005/01/01-2005/12/31 in quarterback J.P. Losman, who missed most of [last season]2005/01/01-2005/12/31 with a broken leg.

Preprocessing

• Semantic Parsing– Semantic parsing is performed using a Maximum Entropy-based

semantic role labeling system trained on PropBank annotations

appear ready

hand

The Bills now

the reins one of their two-top picks

Arg0

Arg0

ArgM

Arg1 Arg2


missedwho most of last seasona broken legArg0 Arg1 Arg3

Preprocessing

• Semantic Parsing– Semantic parsing is performed using a Maximum Entropy-based

semantic role labeling system trained on PropBank annotations

plan

give

The Bills

the starting job J.P. LosmanArg0

Arg0

Arg1 Arg2


Preprocessing

• Semantic Annotation• Heuristics were used to annotate the following semantic

information:– Polarity: Predicates and nominals were assigned a negative polarity

value when found in the scope of an overt negative marker (no, not, never) or when associated with a negation-denoting verb (refuse).

Both owners and players admit there is [unlikely]TRUE to be much negotiating.

Never before had ski racing [seen]FALSE the likes of Alberto Tomba.

– Factive Verbs: Predicates such as acknowledge, admit, and regret conventionally imply the truth of their complement; complements associated with a list of factive verbs were always assigned a positive polarity value.

Members of Iraq's Governing Council refused to [sign]FALSE an interim constitution.

Preprocessing

• Semantic Annotation (Continued)– Non-Factive Verbs: We refer to predicates that do not imply

the truth of their complements as non-factive verbs. – Predicates found as complements of the following contexts

were marked as unresolved:• non-factive speech act verbs (deny, claim)• psych verbs (think, believe)• verbs of uncertainty or likelihood (be uncertain, be

likely)• verbs marking intentions or plans (scheme, plot, want)• verbs in conditional contexts (whether, if)

Congress approved a different version of the COCOPA Law, which did not include the autonomy clauses, claiming they [were in contradiction]UNRESOLVED with constitutional rights.

Defence Minister Robert Hill says a decision would need to be made by February next year, if Australian troops [extend]UNRESOLVED their stay in southern Iraq.

Preprocessing

• Semantic Annotation (Continued)– Supplemental Expressions: Following (Huddleston and Pullum

2002), constructions that are known to trigger conventional implicatures – including nominal appositives, epithets/name aliases, as-clauses, and non-restrictive relative clauses – were also extracted from text and appended to the end of each text or hypothesis.

Shia pilgrims converge on Karbala to mark the death of Hussein, the prophet Muhammad’s grandson, 1300 years ago.

Shia pilgrims converge on Karbala to mark the death of Hussein 1300 years ago AND Hussein is the prophet Muhammad’s

grandson.

Ali al-Timimi had previously avoided prosecution, but now the radical Islamic cleric is behind bars in an American prison.

Ali al-Timimi had previously avoided prosecution but now the radical Islamic cleric is behind bars... AND Ali al-Timimi is a radical

Islamic cleric.

Nominal Appositives

Epithets / Name Aliases

Preprocessing

• Semantic Annotation (Continued)– Supplemental Expressions: Following (Huddleston and Pullum

2002), constructions that are known to trigger conventional implicatures – including nominal appositives, epithets/name aliases, as-clauses, and non-restrictive relative clauses – were also extracted from text and appended to the end of each text or hypothesis.

The LMI was set up by Mr. Torvalds with John Hall as a non-profit organization to license the use of the word Linux.

The LMI was set up by Mr. Torvalds with John Hall as non-profit organization to license the use of the word Linux AND the LMI is a

non-profit organization

The Bills now appear ready to ... quarterback J.P. Losman, who missed most of last season with a broken leg.

The Bills now appear ready to ... quarterback J.P. Losman, who missed most of last season with a broken leg AND J.P. Losman

missed most of last season...

As-Clauses

Non-Restrictive Relative Clauses

Lexical Alignment

• We believe that these lexicosemantic annotations – along with the individual forms of the words – can provide us with the input needed to identify corresponding tokens, chunks, or collocations from the text and the hypothesis.

hand

The Bills

J.P. Losman

the reins

give

The Bills

J.P. Losman

the starting job

Arg0, ID=01, OrganizationArg0, ID=01, Organization

ID=02, Person Arg2, ID=02, Person

Arg1 Arg1

Alignment Probability: 0.74

Alignment Probability: 0. 94



Unresolved, WN Similar Unresolved, WN Similar

Lexical Alignment

• In Groundhog, we used a Maximum Entropy classifier to compute the probability that an element selected from a text corresponds to – or can be aligned with – an element selected from a hypothesis.

• Three-step Process:– First, sentences were decomposed into a set of “alignable

chunks” that were derived from the output of a chunk parser and a collocation detection system.

– Next, chunks from the text (Ct) and hypothesis (Ch) were assembled into an alignment matrix (CtCh).

– Finally, each pair of chunks were then submitted to a classifier which output the probability that the pair represented a positive example of alignment.

Lexical Alignment

• Four sets of features were used:– Statistical Features:

• Cosine Similarity• (Glickman and Dagan 2005)’s Lexical Entailment Probability

– Lexicosemantic Features:• WordNet Similarity (Pedersen et al. 2004)• WordNet Synonymy/Antonymy• Named Entity Features• Alternations

– String-based Features• Levenshtein Edit Distance• Morphological Stem Equality

– Syntactic Features• Maximal Category• Headedness• Structure of entity NPs (modifiers, PP attachment, NP-NP

compounds)

Training the Alignment Classifier

• Two developers annotated a selection of held-out set of 10,000 alignment chunk pairs from the RTE-2 Development Set as either positive or negative examples of alignment.

• Performance for two different classifiers on a randomly selected set of 1000 examples from the RTE-2 Dev Set is presented below:Classifier

Training Set

Precision Recall F1

Hillclimber 10K pairs 0.837 0.774 0.804Maximum Entropy

10K pairs 0.881 0.851 0.866• While both classifiers performed relatively satisfactorily,

F-measure varied significantly (p < 0.05) on different test sets.

Creating New Sources of Training Data

• In order to perform more robust alignment, we experimented with gathering two techniques for gathering training data:

• Positive Examples:– Following (Burger and Ferro 2005), we created a corpus of

101,329 positive examples of entailment examples by pairing the headline and first sentence from newswire documents.

First Line: Sydney newspapers made a secret bid not to report on the fawning and spending made during the city’s successful bid for the 2000 Olympics, former Olympics Minister Bruce Baird said today.

Headline: Papers Said To Protect Sydney Bid

– Examples filtered extensively in order to select only those examples where the headline and the first line both synopsized the content of a document

– In a evaluation set of 2500 examples, annotators found 91.8% to be positive examples of “rough” entailment

Creating New Sources of Training Data

Text: One player losing a close friend is Japanese pitcher Hideki Irabu, who was befriended by Wells during spring training last year.

Hypothesis: Irabu said he would take Wells out to dinner when the Yankees visit Toronto.

• Negative Examples:– We gathered 119,113 negative examples of textual entailment by:

• Selecting sequential sentences from newswire texts that featured a repeat mention of a named entity (98,062 examples)

• Extracting pairs of sentences linked by discourse connectives such as even though, although, otherwise, and in contrast (21,051 examples)

Text: According to the professor, present methods of cleaning up oil slicks are extremely costly and are never completely efficient.

Hypothesis: [In contrast], Clean Mag has a 1000 percent pollution retrieval rate, is low cost, and can b recycled.

Training the Alignment Classifier

• For performance reasons, the hillclimber trained on the 10K human-annotated pairs was used to annotate a selection of 450K chunk pairs selected equally from these two corpora.

• These annotations were then used to train a final MaxEnt classifier that was used in our final submission.

• Comparison of the three alignment classifiers is presented below for the same evaluation set of 1000 examples:

ClassifierTraining

SetPrecision Recall F1

Hillclimber 10K pairs 0.837 0.774 0.804Maximum Entropy

10K pairs 0.881 0.851 0.866

Maximum Entropy

450K pairs 0.902 0.944 0.922


• Groundhog uses techniques derived from automatic paraphrase acquisition (Dolan et al. 2004, Barzilay and Lee 2003, Shinyama et al. 2002) in order to identify phrase-level alternations for each t-h pair.

• Output from an alignment classifier can be used to determine a “target region” of high correspondence between a text and a hypothesis:




• If paraphrases can be found for the “target regions” of both the text and the hypothesis, we may have strong evidence that the two sentences exist in an entailment relationship.


... plan to give the starting job to ...


• For example, if a passage (or set of passages) can be found that are paraphrases of both a text and a hypothesis, those paraphrases can be said to encode the meaning that is common between the t and the h.

The Bills J.P. Losman... appear ready to hand the reins over to ...

... may go with quarterback ...

...could decide to put their trust in...

... might turn the keys of the offense over to ...

• However, not all sentences containing both aligned entities will be true paraphrases...

... benched Bledsoe in favor of ...

... is molding their QB of the future ...

... are thinking about cutting ...


• Like Barzilay and Lee (2003), our approach focuses on creating clusters of potential paraphrases acquired automatically from the WWW.– Step 1. The two entities with the highest alignment

confidence from each t-h pair were selected from each example.

– Step 2. Text passages containing both aligned entities (and a context window of m words) were extracted from each original t and h.

– Step 3. The top 500 documents containing each pair of aligned entities are retrieved from Google; only the sentences that contain both entities are kept.

– Step 4. Text passages containing the aligned entities are extracted from the sentences collected from the WWW.

– Step 5. WWW passages and original t-h passages are then clustered using the complete-link clustering algorithm outlined in Barzilay and Lee (2003); Clusters with less than 10 passages are discarded, even if they include the original t-h passage.


• As with other approximation-based approaches to RTE (Haghighi et al. 2005, MacCartney et al. 2006), we use a supervised machine learning classifier in order to determine whether an entailment relationship exists for a particular t-h pair.– Experimented with a number of machine learning

techniques:• Support Vector Machines (SVMs)• Maximum Entropy• Decision Trees

– February 2006: Decision Trees outperformed MaxEnt, SVMs– April 2006: MaxEnt comparable to Decision Trees, SVMs still

lag behind


• Information from the previous three components are used to extract 4 types of features to inform this entailment classifier:

• Selected examples of features used:– Alignment Features:

• Longest Common Substring: Longest contiguous string common to both t and h

• Unaligned Chunk: Number of chunks in h not aligned with chunks in t

– Dependency Features:• Entity Role Match: Aligned entities assigned same role• Entity Near Role Match: Collapsed semantic roles commonly

confused by semantic parser (e.g. Arg1, Arg2 >> Arg1&2; ArgM, etc.)

• Predicate Role Match: Roles assigned by aligned predicates• Predicate Role Near Match: Compared collapsed set of roles

assigned by aligned predicates


• Classifier Features (Continued)– Paraphrase Features

• Single Paraphrase Match: Paraphrase from a surviving cluster matches either the text or the hypothesis

– Did we select the correct entities at alignment?– Are we dealing with something that can be expressed in

multiple ways?

• Both Unique Paraphrase Match: Paraphrase P1 matches t, while paraphrase P2 matches h; P1P2

• Category Match: Paraphrase P1 matches t, while paraphrase P2 matches h; P1 and P2 found in same surviving cluster of paraphrases.

– Semantic Features• Truth-Value Mismatch: Aligned predicates differ in any truth

value (true, false, unresolved)• Polarity Mismatch: Aligned predicates assigned truth values

of opposite polarity


• Alignment Features: What elements align in the t or h?

The Bills J.P. Losman

The Bills J.P. Losman

hand

give

the reins

the starting job

• Dependency Features: Are the same dependencies assigned to corresponding entities in the t and h?

Arg0 --

Arg0 Arg2

hand

give

Arg1

Arg1

• Paraphrase Features: Were any paraphrases found that could be paraphrases of portions of the t and the h?

... have gone with quarterback ...

... has turned the keys of the offense over to ...

• Semantic Features: Were predicates assigned the same truth values? hand

give

LikelyEntailment!

Good Alignment0.94

Good Alignment0.91

Passable Alignment0.79

Marginal Alignment0.49

unresolved

unresolved

Another Example

• Not all examples, however, include as many complementary features as Example 139:

In spite of that, the government’s “economic development first” priority did not initially recognize the need for preventative measures to halt pollution, which may have slowed economic growth.

The government took measures to reduce pollution.

Text

Hypothesis

Example 734Example 734 Task=IR, Judgment=NO, LCC=NO, Conf = -0.8344

Example 734

• Even though this pair has a number of points of alignment, annotations suggest that there significant discrepancies between the sentences.

the need forpreventative measures halt pollution

measures reduce pollution

the government’s priority

the government took

not recognize

Partial Alignment, non-headArg Role Match

NE Category MismatchPassable Alignment

0.39

Partial Alignment, non-headArg Role Match

Passable Alignment0.41

DegreePOS Match

Good0.84

Lemma matchArg Role Match

Good0.93

POS AlignmentPolarity

MismatchNon-

SynonymousPoor0.23• In addition, few “paraphrases” could be found that clustered

with passages extracted from either the t or the h.

... not recognize need for measures to halt ...

... took measures to reduce ...

pollutionthe govt’s priority

the government

... has allowed companies to get away with ...

... is looking for ways to deal with ...

... wants to forget about ...

UnlikelyEntailment!

Evaluation: 2006 RTE Performance

• Groundhog correctly recognized entailment in 75.38% of examples in this year’s RTE-2 Test Set:

Task Accuracy Average PrecisionQA-Test 69.5% 0.8237IE-Test 73.0% 0.8351IR-Test 74.5% 0.7774

SUM-Test 84.5% 0.8343Total 75.38% 0.8082

• Performance differed markedly across the 4 subtasks: while the system netted 84.5% of the examples in the summarization set, Groundhog only categorized 69.5% of the examples in the question-answering set.

• This has something to do with our training data:– Headline corpus features a large number of “sentence compression”-

like examples; when Groundhog is trained on a balanced training corpus, performance on SUM task falls to 79.3%.

Evaluation: Role of Training Data

• Training data did play an important role in boosting our overall accuracy on the 2006 Test Set: performance increased from 65.25% to 75.38% when then entire training corpus was used.

• Refactoring features has allowed us to obtain some performance gains with smaller training sets, however: our performance when only using the 800 examples from the 2006 Dev Set has increased by 5.25%.

Training Set

# of Examples

Feb 2006 Accuracy

Change in Accuracy

April 2006 Accuracy

Change inAccuracy

2006 Dev 800 65.25% n/a 70.50% n/a

“25% LCC” 50,600 67.00% +1.75% 73.75% +3.25%

“50% LCC” 101,300 72.25% +7.00% 74.625% +4.125%

“75% LCC” 151,000 74.38% +9.13% 76.00% +5.5%

“100% LCC” 202,600 75.38% +10.13% 76.25% +5.75%

Performance increase appears to be tapering off as amount of

training data increases...

Evaluation: Role of Features in Entailment Classifier

• While best results were obtained by combining all 4 sets of features used in our entailment classifier, largest gains were observed by adding Paraphrase features:

58.00%

65.88%

62.50%

65.25%

66.25%

69.13%

68.00%

71.25%

73.62%

75.38%+ Semantic

+Paraphrase

+Dependency

+Alignment

+ Alignment+ Dependency+ Paraphrase

Conclusions

• We have introduced a three-tiered approach for RTE:– Alignment Classifier

Identifies “aligned” constituents using a wide range of lexicosemantic features

– Paraphrase Acquisition Derives phrase-level alternations for passages contained high-confidence aligned entities

– Entailment ClassifierCombines lexical, semantic, and syntactic information with phrase-level alternation information in order to make an entailment decision

• In addition, we showed that it is possible – by relaxing of the notion of strict entailment – in order to create training corpora that can prove effective in training systems for RTE– 200K+ examples (100K positive, 100K negative)

recognizing textual entailment with lcc’s groundhog system

Documents

years pascal rte

process of rte

starting job

university of colorado

university of illinois

stanford university

university of texas

year agotimex