l-isa learning domain specific isa relations from the web alessandra potrich and emanuele pianta...
TRANSCRIPT
L-ISA
Learning Domain Specific ISA relations
from the WEB
Alessandra Potrich and Emanuele Pianta
Fondazione Bruno Kessler - IRSTTrento, Italy
LREC 2008 Marrakech, 31may 2008
Overview
• Learning ISA relations in the patent processing domain (the PatExpert Project)
• The L-ISA algorithm• Evaluation• Future Work
Ontology Learning/Population
• Ontology Learning: acquisition of new concepts and relations between them – e.g., a device is an artifact
• Ontology Population: acquisition of factual knowledge about specific instances– e.g. Einstein is an instance of a scientist– e.g. Einstein was born in 1879
PATExpert
• Funded by the European Union
• Aim: improving patent retrieval, summarization, paraphrasing, classification and valuing through shallow and deep semantic analysis
• Main semantic analysis task: recognizing occurrences of KB concepts and relations
• Proof of the concept on two domains– Optical Recording– Machine Tools
• Focus of the presentation: Ontology Learning in the Optical Recording domain
Optical Recording Domain Ontology (ORDO)
• Based on the Owl formalism• Built in three stages
– 200 hundreds manually crafted concepts: starting from a list of the most frequent terms in a reference corpus
– Pro-ISA: ontology learning algorithm based on projection of WordNet fragments onto ORDO
– L-ISA: ontology learning algorithm based on acquisition of isa templates from the Web
Patent Concept Annotation
Given a target word:
– disambiguate it, by assigning a WN synset whose domain is compatible with the optical recording domain (exploiting WORDNET-DOMAINS)
– If the synset is linked to an ORDO concept annotate the target word with the ORDO concept
– Otherwise: apply Pro-ISA
– Otherwise: apply L-ISA
Choosing the right sense
Senses for the word “CD”:
1. cadmium, Cd, atomic_number_48 (CHEMISTRY)
2. candle, candela, cd, standard candle (PHISYCS)
3. certificate of deposit, CD (MONEY)4. compact disk, compact disc, CD (COMPUTER, MUSIC)
CD
{compact_disk, compact_disc, CD} ordo:cd
lemma
synset KB-concept
Direct Concept Annotation
Pro-ISA 1:Looking for a WN-to-ORDO
link{event}
{crosstalk, XT} -{cross_talk, cross-talk, crosstalk_amount}
{noise, interference, disturbance}
{trouble}
{happening, occurrence, occurrent, natural_event}
sumo:Process
cross-talkLemma:
Pro-ISA 2: Projecting ISA chains (WN ->
ORDO){event}
{crosstalk, XT} -{cross_talk, cross-talk, crosstalk_amount}
{noise, interference, disturbance}
{trouble}
{happening, occurrence, occurrent, natural_event}
sumo:Process
cross-talk
auto_ordo:crosstalk
auto_ordo:noise
auto_ordo:trouble
auto_ordo:happening
From Pro-ISA to L-ISA
• In 15% of cases, the target word is not in WordNet, so Pro-ISA cannot be applied
• Then try and exploit the WEB
• Why not the patent corpus itself?– Isa relations are not frequent in restricted corpora– Patents often contain concept definitions with local
scope– We don’t want idiosyncratic concept definitions, but
common, shared definitions.
Learning ISA relations from a corpus …
• … by exploiting linguistic patterns expressing the ISA relation (Hearst, 1992; Hearst, 1998; Mititelu, 2006)
• Many patterns have been presented in the literature, but– Few evaluations of the pattern reliability (except Snow
2006)– Even less task-oriented evaluation in domain specific,
concrete application scenarios.
• This paper: attempt to provide both kind of evaluations in a real-word, challenging scenario such as patent semantic analysis.
Lexico-Syntactic Patterns
• Patterns reported in the literature
NP1 isa-phrase NP2syntactic noun phrases
sequence of tokens
• In our case we are looking for the hypernym of a specific target term
Term-NP isa-phrase Hyper-NPHyper-NP isa-phrase Term-NP
L-ISA
• Google (or any other web engine) does not allow for searching lexico-syntactic patterns…
• So, we proceed in three steps– Snippet acquisition from Google– Lexico-syntactic filtering– Semantic filtering
L-ISA: Snippet Acquisition
• Suppose we cannot link the term “photodetector” to any ORDO concept.
• We want to exploit the following lexico-syntactic pattern: <TERM-NP> “is an” <HYPER-NP>
• Submit to Google the following string query: “photodetector is an”
• Keep the first 100 snippets (at most), e.g. “... upper frequencies, the PIN waveguide photodetector
is an attractive device, since it is possible to reduce transit time without ..”
• Transform HTML snippets in pure text.
L-ISA: Lexico-syntactic Filtering
• Annotate snippets with TextPro (PoS, lemma, chunk)• Recognize <Term-NP> isa-phrase <Hyper-NP> in
the annotated snippets
token PoS lemma chunk
TERM-NP the AT0 the B-NP
PIN NN1 pin I-NP
waveguide NN1 waveguide I-NP
photodetector NN1 photodetector I-NP
isa-phrase is VBZ be B-VP
an AT0 an B-NP
HYPER-NP attractive AJ0 attractive I-NP
device NN1 device I-NP
L-ISA: Lexico-syntactic Filtering
• Filter out TERM-NP : – if target term is modified (e.g. “PIN waveguide
photodetector” above) – if it looks like a proper names (e.g. uppercase letter in the
middle of a sentence).
• Keep HYPER-NP: – only if it fits a restricted number of PoS-pattern: (N | AN | NN | NNN | ANN | XNN | R Vpastpart AXN)
TERM-NP HYPER-NP
the photodetector analog signal
A photodetector apparatus
photodetector effective monitor
The photodetector electric device
a photodetector electronic device
photodetector object
Semantic Filtering
• Keep only those HYPER-NPs compatible with the Optical Recording domain, by checking
– whether the HYPER-NP is already a label in one of the known ontologies (SUMO, ORDO, AUTO-ORDO)
– whether it is present in a WordNet synset with a WORDNET-DOMAIN label compatible with the Optical Recording domain.
HYPER-NP IN KB IN WN
DOM. COMPAT
analog signal
apparatus yes yes midlow
effective monitor
electric device
electronic device yes
object yes yes midlow
Candidate hypernyms for photodetctor
Candidate Selection
• Candidates are weighed according to– Frequency and Reliability of patterns
where the hypernym occurs– Variety of patterns– Belonging to specific ontologies
(manual ORDO, AUTO-ORDO or SUMO, in decreasing preference order)
Evaluating the Reliability of ISA
Patterns• Assessement of the reliability of the patterns reported in
the literature as predictors of the isa relation– Around 80 templates– On three target terms: “groove”, “photodetector” and
“magnetic head”.– Google returned around 9.000 snippets– Only snippets passing lexico-syntactic filtering have
been actually manually evaluated (about 1,450)– Guideline: try to interpret the intentions of the author
(does he/she really intende to say that X isa subclass of Y, beyond inappropriate phrasing, and even if you know that it is not true?)
– Results of this evaluation exploited in weighting the hypernym candidates
Evaluating the L-ISA accuracy
• Measuring the accuracy of the L-ISA algorithm in finding the hypernym of a given domain concept
• Most frequent 100 terms that we were not able to link to the ORDO ontology using the Pro-ISA learning strategy
• Including “wrong” target terms (because of errors of the linguistic processors, e.g. a past participle instead of a noun)
• Accuracy: 78.6%
Future Work
• Extend evaluation set• Inter-coder agreement• Use Machine Learning to optimize
the weights associated to templates