l-isa learning domain specific isa relations from the web alessandra potrich and emanuele pianta...

L-ISA

Learning Domain Specific ISA relations

from the WEB

Alessandra Potrich and Emanuele Pianta

Fondazione Bruno Kessler - IRSTTrento, Italy

LREC 2008 Marrakech, 31may 2008

Overview

• Learning ISA relations in the patent processing domain (the PatExpert Project)

• The L-ISA algorithm• Evaluation• Future Work

Ontology Learning/Population

• Ontology Learning: acquisition of new concepts and relations between them – e.g., a device is an artifact

• Ontology Population: acquisition of factual knowledge about specific instances– e.g. Einstein is an instance of a scientist– e.g. Einstein was born in 1879

PATExpert

• Funded by the European Union

• Aim: improving patent retrieval, summarization, paraphrasing, classification and valuing through shallow and deep semantic analysis

• Main semantic analysis task: recognizing occurrences of KB concepts and relations

• Proof of the concept on two domains– Optical Recording– Machine Tools

• Focus of the presentation: Ontology Learning in the Optical Recording domain

Optical Recording Domain Ontology (ORDO)

• Based on the Owl formalism• Built in three stages

– 200 hundreds manually crafted concepts: starting from a list of the most frequent terms in a reference corpus

– Pro-ISA: ontology learning algorithm based on projection of WordNet fragments onto ORDO

– L-ISA: ontology learning algorithm based on acquisition of isa templates from the Web

Patent Concept Annotation

Given a target word:

– disambiguate it, by assigning a WN synset whose domain is compatible with the optical recording domain (exploiting WORDNET-DOMAINS)

– If the synset is linked to an ORDO concept annotate the target word with the ORDO concept

– Otherwise: apply Pro-ISA

– Otherwise: apply L-ISA

Choosing the right sense

Senses for the word “CD”:

1. cadmium, Cd, atomic_number_48 (CHEMISTRY)

2. candle, candela, cd, standard candle (PHISYCS)

3. certificate of deposit, CD (MONEY)4. compact disk, compact disc, CD (COMPUTER, MUSIC)

CD

{compact_disk, compact_disc, CD} ordo:cd

lemma

synset KB-concept

Direct Concept Annotation

Pro-ISA 1:Looking for a WN-to-ORDO

link{event}

{crosstalk, XT} -{cross_talk, cross-talk, crosstalk_amount}

{noise, interference, disturbance}

{trouble}

{happening, occurrence, occurrent, natural_event}

sumo:Process

cross-talkLemma:

Pro-ISA 2: Projecting ISA chains (WN ->

ORDO){event}

{crosstalk, XT} -{cross_talk, cross-talk, crosstalk_amount}

{noise, interference, disturbance}

{trouble}

{happening, occurrence, occurrent, natural_event}

sumo:Process

cross-talk

auto_ordo:crosstalk

auto_ordo:noise

auto_ordo:trouble

auto_ordo:happening

From Pro-ISA to L-ISA

• In 15% of cases, the target word is not in WordNet, so Pro-ISA cannot be applied

• Then try and exploit the WEB

• Why not the patent corpus itself?– Isa relations are not frequent in restricted corpora– Patents often contain concept definitions with local

scope– We don’t want idiosyncratic concept definitions, but

common, shared definitions.

Learning ISA relations from a corpus …

• … by exploiting linguistic patterns expressing the ISA relation (Hearst, 1992; Hearst, 1998; Mititelu, 2006)

• Many patterns have been presented in the literature, but– Few evaluations of the pattern reliability (except Snow

2006)– Even less task-oriented evaluation in domain specific,

concrete application scenarios.

• This paper: attempt to provide both kind of evaluations in a real-word, challenging scenario such as patent semantic analysis.

Lexico-Syntactic Patterns

• Patterns reported in the literature

NP1 isa-phrase NP2syntactic noun phrases

sequence of tokens

• In our case we are looking for the hypernym of a specific target term

Term-NP isa-phrase Hyper-NPHyper-NP isa-phrase Term-NP

L-ISA

• Google (or any other web engine) does not allow for searching lexico-syntactic patterns…

• So, we proceed in three steps– Snippet acquisition from Google– Lexico-syntactic filtering– Semantic filtering

L-ISA: Snippet Acquisition

• Suppose we cannot link the term “photodetector” to any ORDO concept.

• We want to exploit the following lexico-syntactic pattern: <TERM-NP> “is an” <HYPER-NP>

• Submit to Google the following string query: “photodetector is an”

• Keep the first 100 snippets (at most), e.g. “... upper frequencies, the PIN waveguide photodetector

is an attractive device, since it is possible to reduce transit time without ..”

• Transform HTML snippets in pure text.

L-ISA: Lexico-syntactic Filtering

• Annotate snippets with TextPro (PoS, lemma, chunk)• Recognize <Term-NP> isa-phrase <Hyper-NP> in

the annotated snippets

token PoS lemma chunk

TERM-NP the AT0 the B-NP

PIN NN1 pin I-NP

waveguide NN1 waveguide I-NP

photodetector NN1 photodetector I-NP

isa-phrase is VBZ be B-VP

an AT0 an B-NP

HYPER-NP attractive AJ0 attractive I-NP

device NN1 device I-NP

L-ISA: Lexico-syntactic Filtering

• Filter out TERM-NP : – if target term is modified (e.g. “PIN waveguide

photodetector” above) – if it looks like a proper names (e.g. uppercase letter in the

middle of a sentence).

• Keep HYPER-NP: – only if it fits a restricted number of PoS-pattern: (N | AN | NN | NNN | ANN | XNN | R Vpastpart AXN)

TERM-NP HYPER-NP

the photodetector analog signal

A photodetector apparatus

photodetector effective monitor

The photodetector electric device

a photodetector electronic device

photodetector object

Semantic Filtering

• Keep only those HYPER-NPs compatible with the Optical Recording domain, by checking

– whether the HYPER-NP is already a label in one of the known ontologies (SUMO, ORDO, AUTO-ORDO)

– whether it is present in a WordNet synset with a WORDNET-DOMAIN label compatible with the Optical Recording domain.

HYPER-NP IN KB IN WN

DOM. COMPAT

analog signal

apparatus yes yes midlow

effective monitor

electric device

electronic device yes

object yes yes midlow

Candidate hypernyms for photodetctor

Candidate Selection

• Candidates are weighed according to– Frequency and Reliability of patterns

where the hypernym occurs– Variety of patterns– Belonging to specific ontologies

(manual ORDO, AUTO-ORDO or SUMO, in decreasing preference order)

Evaluating the Reliability of ISA

Patterns• Assessement of the reliability of the patterns reported in

the literature as predictors of the isa relation– Around 80 templates– On three target terms: “groove”, “photodetector” and

“magnetic head”.– Google returned around 9.000 snippets– Only snippets passing lexico-syntactic filtering have

been actually manually evaluated (about 1,450)– Guideline: try to interpret the intentions of the author

(does he/she really intende to say that X isa subclass of Y, beyond inappropriate phrasing, and even if you know that it is not true?)

– Results of this evaluation exploited in weighting the hypernym candidates

Evaluating the L-ISA accuracy

• Measuring the accuracy of the L-ISA algorithm in finding the hypernym of a given domain concept

• Most frequent 100 terms that we were not able to link to the ORDO ontology using the Pro-ISA learning strategy

• Including “wrong” target terms (because of errors of the linguistic processors, e.g. a past participle instead of a noun)

• Accuracy: 78.6%

Future Work

• Extend evaluation set• Inter-coder agreement• Use Machine Learning to optimize

the weights associated to templates

l-isa learning domain specific isa relations from the web alessandra potrich and emanuele pianta...

Documents

overviewlearning isa

happeningfrom proisa

acquisition of isa templates

isa relation hearst

reference corpusproisa

ordo conceptotherwise

ordo link

word cd