an unsupervised wsd algorithm for a nlp system iulia nica, andrés montoyo, sonia vázquez and mª...
TRANSCRIPT
An Unsupervised WSD Algorithm for a NLP
System
Iulia Nica, Andrés Montoyo, Sonia Vázquez and Mª Antonia Martí
2
INDEX
Introduction Architecture for the NLP System WSD Method Evaluation Conclusions Future Work
3
Introduction
Natural Language Processing (NLP) techniques are necessary for current information systems.
One problem of natural language is the ambiguity (phonological, morphological, syntactic, semantic or pragmatic).
The resolution of lexical ambiguity is necessary for certain NLP applications: Machine Translation, Information Retrieval, Information Extraction, etc.
4
Introduction
Word Sense Disambiguation (WSD) is an intermediate task that attemps to resolve lexical ambiguity problem, assigning to each word its appropriate meaning.
WSD uses two information sources: Context. External Knowledge Sources.
WSD approaches: Knowledge-driven. Data-driven.
5
Introduction
WSD method characteristics: Knowledge-driven. Unsupervised. Information sources:
EuroWordNet. Untagged large corpus.
Sense assignment uses paradigmatic information.
Easily adaptable to other languages.
6
Architecture for the PLN System
POS-analyser (MACO)
POS-tagger (RELAX)
Shallow parser
(TACAT)
WSD
module
INPUT
OUTPUT
Corpus
Sense Discriminators EWN
Untagged text
Extracts all possible POS-tags
Selects only one
morphosyntactic category
Identifies sentence’s
constituents
Text annotated with POS-
tags, chunks and noun
senses
Set of nouns derived from
lexical-semantic
relations of EWN
7
WSD method
It operates on paradigmatic information. It extracts paradigmatic information for
an ambiguous occurrence and it maps this information to the paradigmatic information from the lexicon.
It lays on the base that semantically similar words can substitute each other in the same context and, inversely, words that can commute in a context have a good probability to be close semantically.
8
WSD method
It uses a POS-tagged corpus for searching syntactic patterns (the corpus of EFE News Agency, over 70M words).
For the identification of patterns, it follows a structural criterion, using a list of basic patterns and search schemes.
Each syntactic pattern is identified at the lemmas and POS levels.
9
WSD method Syntactic patterns: X-R-Y X and Y are lexical content units (nouns, adjectives, verbs and adverbs). R is a relational element (functional
words: prepositions, conjunctions, ). The pattern expresses a syntactic
relation between X and Y. Examples:
grano - noun de - preposition azúcar - noun pasaje - noun subterráneo - adjective
10
WSD method Definition of basic patterns:
N, N N C N N P N N A N V A N V N
Conjunctions = {y, e, o, u}
N NounR AdverbA AdjectiveV Participle VerbC* ConjunctionD Determinant
11
WSD method Each basic pattern has discontinuous
realisations in texts. We pre-establish morphosyntactic schemes for
the search of patterns; e.g.:
N (((R) R) A/V) , ((D) D) (((R) R) A/V) N N (((R) R) A/V) C* ((D) D) (((R) R) A/V) NN (((R) R) A/V) P ((D) D) (((R) R) A/V) NN ((R) R) A (C* ((R) R) A/V)N ((R) R) V (C* ((R) R) A/V)(A/V C* ((D) D) (((R) R)) A N(A/V C* ((D) D) (((R) R)) V N
The units between brackets are optional, those separated by a bare are alternatives for a position.
12
WSD method
For each search scheme, we define decomposition rules in order to extract the basic patterns.
Example:
Each unit of the sequence is considered also at the lemma level.
NAC*A
NA NA
Coronas danesas y suecas
Corona danesa Corona sueca
13
WSD method
Information is extracted from two sources: Corpus (paradigmatic information). Sentences (syntagmatic information).
Paradigmatic information is extracted by exploiting the syntactic patterns
Example:
obra
concierto
pieza
Paradigmaticrelations
para órgano
Syntagmatic relations
14
WSD method
Sense discriminators obtained from EWN: Selection of all nouns related to each
sense along the different lexical-semantic relations.
Elimination of the common elements between different senses.
Disjunctive sets of nouns for the senses of a word.
15
WSD method
Commutative test: Hypothesis: If two words can commute
in a given context, they have a good probability to be semantically close.
Application: If the ambiguous word can be substituted with a sense discriminator inside a syntactic pattern, then it has the sense corresponding to that discriminator.
The algorithm operates with words from a sense-untagged corpus
16
WSD method
Commutative Test Algorithm
X – R - Y __ – R - Y Xk – R - Y Xk
dij
di0j
dnj
SD1
SDi0
SDn
X_i0 – R - Y
X_? – R - Y
Corpus
YES
NO
17
WSD method
WSD module has two heuristics: H1: Commutative Test Algorithm applied
on the paradigmatic information (the nouns obtained from substituting the ambiguous occurrence in the pattern).
H2: Commutative Test Algorithm applied on the syntagmatic information (the nouns obtained from the sentence).
The two heuristics act as voters for the sense assignment.
18
WSD method
Example:Los enormes y continuados progresos científicos y técnicos de la Medicina actual han logrado hacer descender espectacularmente la mortalidad infantil, erradicar multitud de enfermedades hasta hace poco mortales, sustituir mediante trasplante o implantación delcuerpo inutilizadas y alargar las expectativas de vida.
1. Input text POS-tagging.
2. Syntactic patterns identification.
2.1. Use of search schemes. 2.2. Use of decomposition rules.
3. Extraction of information.
3.1. From corpus. 3.2. From sentence.
órganos dañados o partes
órganos dañados o partes
NACN
NA NCN
órgano dañado órgano o parte
Scheme
Decomposition
Rules
FinalResult
mediador, terreno, chófer, árbol, cabeza, planeta, parte,
incremento, totalidad, guerrilla, programa, mitad, país, temporada, artículo,
tercio
progreso, científico, mortalidad, multitud, enfermedad, mortal,
trasplante, implantación, órgano, parte, cuerpo,
expectativa, vida
From corpus From sentence 4. Extraction of Sense Discriminators.
Sense 1: órgano vegetal, espora, flor, pera, manzana, bellota, hinojo, semilla, poro, píleo, carpóforo, ...
Sense 2: agencia, unidad administrativa, banco central, servicio secreto, seguridad social, FBI, ...
Sense 3: parte del cuerpo, trozo, músculo, riñón, oreja, ojo, glándula, lóbulo, tórax, dedo, articulación, rasgo, facción, ...
Sense 4: instrumento de viento, instrumento musical, mecanismo, aparato, teclado, pedal, corneta, ...
Sense 5: periódico, publicación, medio de comunicación, método, serie, serial, número, ejemplar, ...
Sense Discriminators Sets 5. Commutative
Test. 6. Final sense asignmentórgano#3: A fully differentiated structural and functional unit in an animal that is specialized for some particular function.
S1 SD1 = S1 SD2 = S1 SD3 S1 SD4 = S1 SD5 =
S2 SD1 = S2 SD2 = S2 SD3 S2 SD4 = S2 SD5 =
Heuristic 1
Heuristic 2
19
Evaluation
The WSD method was tested with the Spanish Lexical Sample task of Senseval-2.
For the evaluation, we selected all 17 nouns of this task.
We used the two heuristics H1 & H2.
20
Evaluation
Results obtained:
Precision Recall Coverage
H1 0,54 0,11 0,21
H2 0,59 0,04 0,07
H1 + H2
0,56 0,15 0,27
21
Evaluation
In Senseval-2, the values for the individual words reached the following level:
Precision = 51,4% - 71,2% Recall = 50,3% - 71,2% Coverage = 98% – 100%
22
Conclusions
This WSD method can be used as a module in a NLP system to prepare an input text to a real application.
It is independent of any corpus tagging at syntactic or semantic level.
It requires only a minimal preprocessing phase (POS-tagging) of the input text and of the search corpus.
23
Future work
Study of different possibilities to improve the WSD process.
Aplication of new algorithms over information associated to the ambiguous occurrence.
Combination with other data-driven WSD methods.
An Unsupervised WSD Algorithm for a NLP
System
Iulia Nica, Andrés Montoyo, Sonia Vázquez and Mª Antonia Martí
Thank you!!