learning syntactic patterns for automatic hypernym discovery rion snow, daniel jurafsky and andrew...

Learning syntactic patterns for automatichypernym discovery

Rion Snow, Daniel Jurafsky and Andrew Y. Ng

Prepared by Ang Sun

2009-02-17

Introduction

hypernym/hyponym relation: it describes a relationship between 2 nouns X and Y. Y is a hypernym of X if X is a member of Y.

Previous work: uses hand-crafted patterns to automatically label hypernym relation between nouns. E.g., pattern “NPx and other NPy” implies that NPx is a hyponym of NPy.

Novelty of this paper: uses known hypernym pairs to automatically identify useful lexico-syntactic patterns, and then train a high accuracy hypernym classifier by applying these patterns to a supervised learning algorithm.

Author (Y)

Shakespeare (X)

Introduction (cont’)

Overview of the approach

1. Training:(a) Collect noun pairs from corpora, identifying pairs of nouns in a hypernym/hyponym relation using WordNet.(b) For each noun pair, collect sentences in which both nouns occur.(c) Parse the sentences, and automatically extract patterns from the parse tree.(d) Train a hypernym classifier(binary) based on these features.

2. Test:(a) Given a pair of nouns in the test set, extract features and use the classier to determine if the noun pair is in the hypernym/hyponym relation or not.

Representing lexico-syntactic patterns with dependency paths

What does dependency parsing do?

A dependency parser produces a dependency tree that represents the syntactic relations between words by a list of edge tuples of the form:

(word1, CATEGORY1:RELATION:CATEGORY2, word2)

Example: (Herrick, -N:conj:N, Shakespeare) Thus, define space of lexico-syntactic patterns to be shortest paths between a

ny two nouns in a dependency tree.

Example: dependency path between 'authors' and 'Herrick':

Herrick, N:pcomp-n:-Prep, as, as, Prep:mod:-N, authors

Representing lexico-syntactic patterns with dependency paths(cont’)

Generalization and Extension of dependency parsing Generalization: remove the original nouns

Example: Herrick, N:pcomp-n:-Prep, as, as, Prep:mod:-N, authors =>

N:pcomp-n:-Prep, as, as, Prep:mod:-N Extension: (a) Capture function words like ‘such’ (in “such NP as NP”) and add optional ‘s

atellite links’ to each shortest path since they are important parts of lexico-syntactic patterns.

Example: N:pcomp-n:-Prep, as, as, Prep:mod:-N =>

N:pcomp-n:-Prep, as, as, Prep:mod:-N, (such, PreDet:pre:-N) (b) Capitalize on the distributive nature of the syntactic conjunction relation (no

uns linked by ‘and’ or ‘or’, or in comma-separated lists) by distributing dependency links across such conjunctions.

Example: see red dotted link in the next slide

Representing lexico-syntactic patterns with dependency paths(cont’)

Let’s look at the generation of dependency representation of pattern NPy such as NPx:

Herrick, N:pcomp-n:-Prep, as, as, Prep:mod:-N, authors=> (generalization)N:pcomp-n:-Prep, as , as, Prep:mod:-N=> (extension a)N:pcomp-n:-Prep, as, as, Prep:mod:-N, (such, PreDet:pre:-N)=> (extension b)N:PCOMP-N:PREP, as, as, PREP:MOD:N, (such, PREDET:PRE:N)

Experimental paradigm

Corpus: 6 million newswire sentences

Preprocessing: Parse each sentence using MINIPAR and extract noun pairs.

Dev/test data:1) WordNet labeled data: Label hypernym/hyponym relation between 2 nouns according to WordNet’s hypernym taxonomy. Known Hypernym Set: 14,387 pairsKnown Non-Hypernym Set: 737,924 pairs

2) Hand-labeled data(Key): totally 5387 noun pairs, 5122 are “unrelated”, 134 are hypernym pairs and 131 are “coordinate”(explained later in the paper).

Evaluation: Compare binary classifier’s performance against WordNet’s judgments and hand-labeled data(involve annotation disagreement, will show later).

Features: pattern discovery

Focus on discovering which dependency paths might prove useful features for binary hypernym classifier(will show details later).

Rediscovered hand-designed patterns(Hearst’s patterns, marked in red): high-performance boundary of precision and recall for individual features

Discovered new patterns(marked in blue): also have high-scoring

A hypernym-only classifier

The Classifier:1) Create a feature lexicon of 69,592 dependency paths, consisting of every dependency path that occurred between at least five unique noun pairs in corpus

2) Record in noun pair lexicon each noun pair that occurs with at least five unique paths from feature lexicon

3) Create a feature count vector for each such noun pair

4) Each entry of the 69,592-dimension vector represents a particular dependency path, and contains the total number of times that that path was the shortest path connecting that noun pair in some dependency tree in corpus

Thus the task becomes binary classification of a noun pair as a hypernym pair based on its feature vector of dependency paths.

5) Train a number of classifiers: Perform 10-fold cross validation on WordNet-labeled data and evaluate each model based on its maximum F-Score averaged across all folds.

A hypernym-only classifier(cont’)

Comparison of performances:

The first 4 are hypernym-only classifiers. Hearst's patterns simply detects the presence of at least one of Hearst's patterns, arguably the previous best classier consisting only of lexico-syntactic patterns; “And/or other” pattern consists of only the “NP and/or other NP” subset of Hearst's patterns

Clearly, the hypernym-only classifiers’ performance are much better than hand-designed pattern classifiers. But the performance is still NOT very good. WHY?

Using coordinate terms to improve hypernym classification

Problem with patterns: Pattern can only handle noun pairs which happen to occur in the same sentence; There are many hypernym/hyponym pairs may not occur in the same sentence.

Solution: consider coordinate terms

coordinate terms: nouns or verbs that have the same hypernym.

Assumption: If two nouns (ni, nj) are coordinate terms, and that nj is a hyponym of nk, we may then infer with higher probability that ni is similarly a hyponym of nk despite never having encountered the pair (ni, nk) within a single sentence.

Expectation: Using coordinate information will increase the recall of our hypernym classier.

Using coordinate terms to improve hypernym classification(cont’)

3 Classifiers:

1) Distributional Similarity Vector Space Model

2) Thresholded Conjunction Pattern Classier

Pattern “X, Y and Z”

3) Best WordNet Classier Comparison of performance

Hybrid hypernym-coordinate classification

1) : probability that noun ni has nj as an ancestor in its hypernym hierarchy

2) : probability that nouns ni and nj are coordinate terms

3) : probability produced by hypernym-only classifier

4)

is used to compute the new probability that nk is a hypernym of ni by linear interpolation.

(for final eval, they set )

Result

Logistic regression hypernym-only model has a 16% relative F-score improvement over the best WordNet classier.

Combined hypernym/coordinate model has a 40% relative F-score improvement. The best performing classier is a hypernym-only model additionally trained on the Wikipedi

a corpus, with an expanded feature lexicon of 200,000 dependency paths; this classier shows a 54% improvement over WordNet.

Result

learning syntactic patterns for automatic hypernym discovery rion snow, daniel jurafsky and andrew...

Documents