lirics wp2 – nlp lexica
DESCRIPTION
LIRICS WP2 – NLP Lexica. Monica Monachini [email protected] CNR-ILC - Pisa 23rd May 2006. Summary of the presentation. Overview of WP2 1° year objectives Main results in T2.1 and T2.2 Work done Synergies with other LIRICS WPs, ISO activities, meetings - PowerPoint PPT PresentationTRANSCRIPT
LIRICS Mid-term Review 1
LIRICS WP2 – NLP Lexica
Monica [email protected]
CNR-ILC - Pisa23rd May 2006
LIRICS Mid-term Review 2
Summary of the presentation
Overview of WP21° year objectives
Main results in T2.1 and T2.2Work doneSynergies with other LIRICS WPs, ISO
activities, meetingsPriorities for future activities
LIRICS Mid-term Review 3
WP2 overall objective
Define a “family” of standards for NLP lexiconsTwo-level standards:the high level specifications provide
structural elements, i.e. lexical classes and relations between them, the meta-model;
the low level specifications provide standardized constants, i.e. data categories used to “adorn” the lexical classes ISO 12620
LIRICS Mid-term Review 4
WP2 T2.1 overview and objectives
From past and on-going standardization activities,
gathering linguistic information considered relevant for lexical description and to be combined with the layers of the lexical model
Coherent input to ISO Data Category Registry revision
LIRICS Mid-term Review 5
WP2 T2.1 results
Proposal for a unified set of lexical information and unified descriptors as draft set of Data Categories Maximum set of candidate lexical data categories subdivided along the layers of linguistic description: morphosyntax, syntax and semantics. Data Categories shared between WP2 and WP3 relevant to Morphosyntactic description have been incorporated in the Syntax Tool: the Morphosyntactic Profile.
LIRICS Mid-term Review 6
WP2 T2.1 Deliverables
1st year 2nd year 3rd year
M1
M2
M3
M4
M5
M6
M7
M8
M9
M10
M11
M12
M13
M14
M15
M16
M17
M18
M19
M20
M21
m22
M23
M24
M25
M26
M27
M28
M29
M30
WP2
T2.1
T2.2 I
T2.3 I
D.2.1 Survey and evaluation of existing standard for Lexica
D.2.1 Survey and evaluation of existing standard for Lexica (revision)
(version foreseen in conjunction with Data Cats to be issued togetherwith the data model in T2.2)
D.2.1 Survey and evaluation of existing standard for Lexica
LIRICS Mid-term Review 7
WP2 T2.2 overview and objectives
Define a lexical framework, a general and abstract meta-model as a set of structural nodes relevant for lexical description, enabling specific implementations on the basis of common Data Categories Definition of the common set of related Data Categories
LIRICS Mid-term Review 8
WP2 T2.2 results
Formulation of a high-level lexical meta-model, the Lexical Markup Framework, a flexible environment for user-defined mark-up languages Proof-of-concepts: mapping exercises of well known NLP lexicon practices against the model
LIRICS Mid-term Review 9
WP2 T2.2 Deliverables
1st year 2nd year 3rd year
M1
M2
MM3
M4
M5
M6
M7
M8
M9
M10
M11
M12
M13
M14
M15
M16
M17
M18
M19
M20
M21
m22
M23
M24
M25
M26
M27
M28
M29
M30
WP2
T2.1
T2.2 I
T2.3 I
NLP Lexica standard for CD ballot (submitted beginning year 06)
NLP Lexica standard for ISO DIS ballot
Internal milestone for internal quality control
LIRICS Mid-term Review 10
WP2 Activities, Meetings, Synergies...
LIRICS WPs BI- TRI-LATERAL Working Meetings: CNR-ILC – MPI, 15.2.2005: PAROLE-SIMPLE lexical architecture and LEXUS tool WP2 internal meeting, 16.2.2005: basic structure of the meta-model for lexicons (core model +
extensions) CNR-ILC – DFKI, 5.5.2005: convergences between morpho-syntactic and syntactic data; issues for
the submission of the N W I on Syntax (SynAF) to ISO Pisa, 23-24.11.2005. WP2 internal meeting: basic structure of the meta-model for representation of
Multiword expressions
LIRICS Meetings Paris, 16-17.3.2005. Progress of work within WP2. Presentation of the standard core model for
lexicons and the extensions for NLP lexicons Barcelona, 21-22.6.2005. LIRICS Industrial Advisory Board Meeting Barcelona, 22.6.2005 Presentation of first bulk of information relevant for lexical description Nancy, 8-9.12.2005. WP4 TDG3 Workshop: connections between lexico-semantic representation
and semantic roles in lexiconISO Meetings Berlin 8-9.4.2005. ISO TC37/SC4 WG4 Meetings Warsaw 21-26.08.05. Plenary meeting of ISO TC37/SC4. Task force for the purpose of designating
generic data category sets for alignment with with the level of the metamodel; task force related to the representation of MWEs.
Rome 27.10.2005. UNI-DIAM Commission: candidature of Italy as P-member in ISO TC37/SC4 (CNR-ILC reference expert)
LIRICS Mid-term Review 11
• provide a common model for the creation and use of lexical resources• manage the exchange of data between and among these resources• enable the merging of electronic resources to form extensive global resources. Range of topics:• monolingual, • bilingual • multilingual lexical resources
Scalability • the same specifications are to be used for both small and large lexicons
Coverage• linguistic description range from morphology, syntax, semantic to multilingual representation• languages are not restricted to European languages • the range of targeted NLP applications is not restricted.
What is LMF for?
LIRICS Mid-term Review 12
Future activities/Priorities/Plans Data Categories
deliver rev 2 of D2.1: candidate data categories will receive the necessary adjustments after discussion
extend the ISO Registry to cover further layers of linguistic description: do we need an ISO Syntactic Profile (Bejin)?
LMF model refine the NLP multilingual and MWE extensions XML representation of LMF linguistic objects in order
to allow unified access to LMF conformant lexicons through APIs
Provide implementation of test suite lexical entries: PAROLE-SIMPLE lexicons ready to be described according to LMF (LEXUS), to be put in the LMF server and made accessible via the web.
LIRICS Mid-term Review 13
Structure of LMF
NLP Multilingual notations extension
NLP Inflectional paradigm extension
NLP Morphology extension
NLP MWE pattern extension
NLP Semantic extension
MRD extension
NLP Syntax extension
Core Package
Structural skeleton, with the basic hierarchy of information in a lexical entry
extend a subset of core-model classes; are conformant to the core model; cannot be used regardless to the core model
LMF specifications comply with modeling UML principles
LIRICS Mid-term Review 14
Core package
Representation Frame
Lexicon Information
Form Sense
Entry Relation
Sense Relation
Lexical Entry
Database
Lexicon
0..* 0..*
0..*1
0..* 0..*
0..*1
1
0..*
11
1
0..*
1
1..*
1
0..*
1
1..*
1..*
1
Container for managing the top level language components. The number of words or MWe of the lexicon is equal to the number of lexical entries in a given lexicon.
Form consists of a text string that represents a single word or a multi-word expression
Sense specifies or disambiguates the meaning and context of a form
One to many Representation Frames can be associated with Form, each of which contains a form and data categories that specify the orthographic types and name of the word
It is a cross-reference pivot that can link to many Lexical Entries within or across Lexicons.
LIRICS Mid-term Review 15
Package for extensional morphology
InflectionalParadigm
ListOfComponents LemmatisedForm
InflectedForm
LexicalEntry
Stem
{ordered}0..*{ordered}
1..*
0..1 1
0..*1
1
0..*
1
1..*
0..*
0..1
: InflectedForm
grammaticalNumber = singularwrittenForm = clergyman
: InflectedForm
grammaticalNumber = pluralwrittenForm = clergymen
: LemmatisedForm
writtenForm = clergyman
: LexiconInformation
language = eng
: LexicalEntry
: Database
: Lexicon
1st strategy:describe the morphologyrepresenting explicitly all inflections
LIRICS Mid-term Review 16
Package for inflectional paradigm
MorphologicalFeaturesCombo
InflectedFormCalculator
MorphologicalFeature
InflectionalParadigm
OperationArgument
ListOfComponents LemmatisedForm
Composer
Operation
Stem{ordered}
0..*
1
0..*0..*
0..* 10..*
{ordered}
1
0..*
11..*
0..1
0..*
1
: MorphologicalFeaturesCombo
: MorphologicalFeaturesCombo
: Operation
graphicalOperator = removeAfter
: InflectedFormCalculator
stem = 0
: Operation
graphicalOperator = addAfter
: InflectedFormCalculator
stem = 0
: LemmatisedForm
writtenForm = clergyman
: MorphologicalFeature
att = numberval = singular
: MorphologicalFeature
att = genderval = masculine
: MorphologicalFeature
att = numberval = plural
: InflectionalParadigm
id = asMan
: OperationArgument
val = 2
: OperationArgument
val = en
for "clergymen"
for "clergyman"
2nd strategy: declare an inflectional paradigm; use the inflectional paradigm extension for defining it
LIRICS Mid-term Review 17
Package for NLP syntax
SyntacticArgument
Construction
SemanticArgument
SyntacticBehavior
ConstructionSet
LexicalEntry
Self
Sense
Described in core package
Described in Semantic package
Described in core package
1 0..*
0..*0..*
0..1
0..*
0..* 0..*0..1 1
1
0..*
0..*0..*
0..1
0..1
1
0..*
0..*0..*
0..*
0..*
0..*
0..*
: SyntacticArgument
function = subjectsyntacticConstituent = NP
: SyntacticArgument
function = objectsyntacticConstituent = NP
: Construction
id = amare-SyntFrame
: Self
id = amare-selfauxiliary = avere
Syntactic behavior represents one of the behaviors of one (or more) senses
Construction describes one syntactic construction and can be shared by all words with the same syntactic behavior
Self refers to the head lexical entry and describes syntactic properties
Syntactic Argument describes a syntactic actant
ConstructionSet regroups together various Syntactic Constructions and factorizes syntactic descriptions to have a minimum of syntactic behavior elements in the lexicon.
LIRICS Mid-term Review 18
XML representation
LIRICS Mid-term Review 19
Package for NLP semantics
PredicativeRepresentation
Sense
SemanticPredicate
SemanticArgument
SyntacticArgumentSemanticDefinition
SyntacticBehavior
PredicateRelation
Construction
SynsetRelationSenseExample
SenseRelation
LexicalEntry
Proposition
Synset
Described in core package
Described in syntactic package
0..* 0..*
1 0..*
0..* 0..*
0..1
0..*1
0..*
1
0..*
1
0..*
0..1
0..*
1
0..*
0..*
10..*
0..*
1
0..*
0..*
0..*
10..*
0..*
10..1
1..*
Predicative Representation describes the link between Sense and Semantic Predicate
Semantic Predicate describes an abstract meaning
Semantic Argument describes a semantic actant and is linked with its syntactic counterpart
LIRICS Mid-term Review 20
Package for NLP semantics (cont.)
PredicativeRepresentation
Sense
SemanticPredicate
SemanticArgument
SyntacticArgumentSemanticDefinition
SyntacticBehavior
PredicateRelation
Construction
SynsetRelationSenseExample
SenseRelation
LexicalEntry
Proposition
Synset
Described in core package
Described in syntactic package
0..* 0..*
1 0..*
0..* 0..*
0..1
0..*1
0..*
1
0..*
1
0..*
0..1
0..*
1
0..*
0..*
10..*
0..*
1
0..*
0..*
0..*
10..*
0..*
10..1
1..*
LIRICS Mid-term Review 21
XML representation
LIRICS Mid-term Review 22
Package for NLP semantics (cont.)
PredicativeRepresentation
Sense
SemanticPredicate
SemanticArgument
SyntacticArgumentSemanticDefinition
SyntacticBehavior
PredicateRelation
Construction
SynsetRelationSenseExample
SenseRelation
LexicalEntry
Proposition
Synset
Described in core package
Described in syntactic package
0..* 0..*
1 0..*
0..* 0..*
0..1
0..*1
0..*
1
0..*
1
0..*
0..1
0..*
1
0..*
0..*
10..*
0..*
1
0..*
0..*
0..*
10..*
0..*
10..1
1..*
: Definition
text = a deciduous tree of the genus Quercus; has acorm ...language = engview
: Definition
text = the hard durable wood of any oaklanguage = engview
: Definition
text = a tall perennial wood plant ...language = engview
: Form
lemmatisedForm = oak tree
: SynSetRelation
type = hyponymy
: Form
lemmatisedForm = tree
: Form
lemmatisedForm = oak
: LexicalEntry
partOfSpeech = noun
: LexicalEntry
partOfSpeech = noun
: LexicalEntry
partOfSpeech = noun
: SynSet
id = 11520753
: SynSet
id = 11520081
: SynSet
id = 12352501
: Sense : Sense : Sense
LIRICS Mid-term Review 23
Package for Multilingual representation
Transfer Axis Relation
Sense Axis Relation
Syntactic Behavior
SenseExample
Transfer Axis
Example Axis
Source Test
Sense Axis
Target Test
SynSet
Sense0..*
0..*
0..*
0..*
1
0..*
0..* 0..*
0..*0..*
1
0..*
1
0..1
10..*
0..1
1
1
0..*
1
0..*
10..*
: Sense Axis Relation
comment = flows into the sealabel = more precise
: Sense
label = eng:riverlabel = fra:rivière
: Sense
: Sense
label = fra:fleuve
: Sense Axis
: Sense Axis
Sense Axis Relation describes the linking between two different Sense Axis
Source and TargetTest permit to express conditions about the translation on the source/target language side
LIRICS Mid-term Review 24
Package for Multiword expressions
Combiner Argument
List of Components Lemmatised Form
Combiner
MWE Pattern
0..11
1..*0..*
0..*
1
0..1
0..*
0..*
1
0..*
1
: MWE Pattern
id = VPSomebodyPPcomment = for a pattern, VP somebody IndirectObject
: Lemmatised Form
writtenForm = throw to the lions
: Combiner
constituent = NPsemanticRestriction = human
: Combiner
head = trueconstituent = VPrank = 0graphicalSeparator = space
: Combiner Argument
rank = 1graphicalSeparator = space
: Combiner Argument
rank = 2graphicalSeparator = space
: Combiner Argument
rank = 3graphicalSeparator = space
: Combiner Argument
function = directObject
: Combiner Argument
function = indirectObject
: List of Components
: Lemmatised Form
writtenForm = throw
: Lemmatised Form
writtenForm = to
: Lemmatised Form
writtenForm = the
: Lemmatised Form
writtenForm = lion
: Combiner
constituent = PPnumber = plural