1 sto a lexical database of danish for language technology applications anna braasch center for...
TRANSCRIPT
![Page 1: 1 STO A Lexical Database of Danish for Language Technology Applications Anna Braasch Center for Sprogteknologi Copenhagen SPINN Seminar, October 27, 2001](https://reader036.vdocuments.us/reader036/viewer/2022082713/5697c01a1a28abf838ccf20a/html5/thumbnails/1.jpg)
1
STO A Lexical Database of Danish for
Language Technology Applications
Anna BraaschCenter for Sprogteknologi
Copenhagen
SPINN Seminar, October 27, 2001
![Page 2: 1 STO A Lexical Database of Danish for Language Technology Applications Anna Braasch Center for Sprogteknologi Copenhagen SPINN Seminar, October 27, 2001](https://reader036.vdocuments.us/reader036/viewer/2022082713/5697c01a1a28abf838ccf20a/html5/thumbnails/2.jpg)
2
Background
EU-funded international projects • EAGLES: recommendations for morphological and
syntactic specifications for 9 languages• GENELEX: development of a generic lexicon model• PAROLE: development of harmonized WL
resources (lexicon, corpus) for 12 languages• SIMPLE: development of an ontology and model of
semantic description for 12 languages
Follow-up • Danish, nationally funded co-operative lexicon
project: STO
![Page 3: 1 STO A Lexical Database of Danish for Language Technology Applications Anna Braasch Center for Sprogteknologi Copenhagen SPINN Seminar, October 27, 2001](https://reader036.vdocuments.us/reader036/viewer/2022082713/5697c01a1a28abf838ccf20a/html5/thumbnails/3.jpg)
3
Aims of the project
Monolingual aimto eliminate the usual ’bottleneck problem’: lack
of a large-size Danish lexical database for
• language technology applications • computational language research purposes
Multilingual aimto provide an elaborated Danish lexical database
for• linked bi- or multilingual databases for LT/NLP
applications
• contrastive CL and lexicology research …
![Page 4: 1 STO A Lexical Database of Danish for Language Technology Applications Anna Braasch Center for Sprogteknologi Copenhagen SPINN Seminar, October 27, 2001](https://reader036.vdocuments.us/reader036/viewer/2022082713/5697c01a1a28abf838ccf20a/html5/thumbnails/4.jpg)
4
STO development objectives
Requirements of monolingual applications• tailor the linguistic specifications for Danish• add more language specific features • extend the linguistic and lexical coverage• refine the lexicon structure• develop customized, user-friendly interfaces...but also requirements of multilingual linking• keep the basic, harmonised lexicon structure• keep the principles and language of lexical description• be attentive to similar follow-up projects
’more Danish’ but still consistent with the other
lexicons
![Page 5: 1 STO A Lexical Database of Danish for Language Technology Applications Anna Braasch Center for Sprogteknologi Copenhagen SPINN Seminar, October 27, 2001](https://reader036.vdocuments.us/reader036/viewer/2022082713/5697c01a1a28abf838ccf20a/html5/thumbnails/5.jpg)
5
The three linguistic layers of description
Main info types - 3 independent but linked layersMorphology
Inflection (pattern-based) Spelling Compounding
Syntax (totally pattern-based) Syntactic frame (complementation structures &
functional properties, etc.) Control, raising (constructional properties)
Semantics (the layer of multilingual linking) Domain (=sublanguage, source area) Semantic relations (qualia) Specification of meaning (SIMPLE model + core
ontolgy)
![Page 6: 1 STO A Lexical Database of Danish for Language Technology Applications Anna Braasch Center for Sprogteknologi Copenhagen SPINN Seminar, October 27, 2001](https://reader036.vdocuments.us/reader036/viewer/2022082713/5697c01a1a28abf838ccf20a/html5/thumbnails/6.jpg)
6
Between syntax and semantics
No clear-cut borderline: difficult to represent mutual dependencies in a strictly modular description.
Syntactic or semantic units?• Collocations: combine features of complex
structure, (morpho)syntactic constraints and slightly restricted compositionality (meaning transparency); strong subcategorisation and selectional restrictions ...
• Phrasal verbs: combine features of complex syntactic structure and compositional/non-compositional semantics …
Different representation strategies: ’early’ vs. ’late’
![Page 7: 1 STO A Lexical Database of Danish for Language Technology Applications Anna Braasch Center for Sprogteknologi Copenhagen SPINN Seminar, October 27, 2001](https://reader036.vdocuments.us/reader036/viewer/2022082713/5697c01a1a28abf838ccf20a/html5/thumbnails/7.jpg)
7
Linking lexicons at the semantic level
Basic method: link between L1-meaning and L2-meaning
Basic requirement:harmonized semantics (ontology, model & method)
Advantages: proper treatment of all lexical units including• homonymes• polysemes• complex lexical units (collocations, idioms)independent treatment of L1 and L2 wrt.
morpholgy and syntax
![Page 8: 1 STO A Lexical Database of Danish for Language Technology Applications Anna Braasch Center for Sprogteknologi Copenhagen SPINN Seminar, October 27, 2001](https://reader036.vdocuments.us/reader036/viewer/2022082713/5697c01a1a28abf838ccf20a/html5/thumbnails/8.jpg)
8
About the STO lexical database (V.1)Point of departure: PAROLE material• linguistic specifications elaborated (inc. also
Danish)• modular lexicon architecture developed• information structure developed• 20,000 general language lexicon entries encoded
Main STO development steps:• tailor and refine the LingSpec’s for Danish• improve the information structure (DB)• add new entry types (complex lexical units, etc.)• extend the vocabulary to 50,000 entries
(~ 35,000 GL and ~15,000 LSP from 6-8 domains)
![Page 9: 1 STO A Lexical Database of Danish for Language Technology Applications Anna Braasch Center for Sprogteknologi Copenhagen SPINN Seminar, October 27, 2001](https://reader036.vdocuments.us/reader036/viewer/2022082713/5697c01a1a28abf838ccf20a/html5/thumbnails/9.jpg)
9
Progress report for 2001 (1)
New status: Nationally funded co-operative projectrequiring• more thorough project planning (incl. ’logistics’)• more detailed information (guidelines,
specifications, cross-checks, evaluation…)
Continuously ongoing supporting processes• Updating and refinement of LingSpec’s • Elaboration of an Encoding Manual• Elaboration of various additional documentation
(evaluation sheets, etc.)• Revision of the database/info structure
![Page 10: 1 STO A Lexical Database of Danish for Language Technology Applications Anna Braasch Center for Sprogteknologi Copenhagen SPINN Seminar, October 27, 2001](https://reader036.vdocuments.us/reader036/viewer/2022082713/5697c01a1a28abf838ccf20a/html5/thumbnails/10.jpg)
10
Progress report for 2001 (2)
New supporting tools for lexicographers developed• Encoding tools for morphological and syntactic info• Browsers for retrieval of encoded info...
Number of entries encoded with• morphological information ~50,000 • syntactic information ~23,000• semantic information ~ 8,500 (from SIMPLE)
Other tasks (ongoing/finished)• selected entries (on customer’s request) downloaded• work on principles of statistically based selection of
lemmas and syntactic constructions to be encoded • corpus-related work
![Page 11: 1 STO A Lexical Database of Danish for Language Technology Applications Anna Braasch Center for Sprogteknologi Copenhagen SPINN Seminar, October 27, 2001](https://reader036.vdocuments.us/reader036/viewer/2022082713/5697c01a1a28abf838ccf20a/html5/thumbnails/11.jpg)
11
Progress report for 2001 (3)
Treatment of new entry types• domain specific (LSP) entries• compounds (decomposition and linking
elements implemented)• geographical proper nouns (inflectional and
agreement properties investigated, the results are implemented)
• collocations (information structure designed) • revision of the treatment of phrasal verbs
![Page 12: 1 STO A Lexical Database of Danish for Language Technology Applications Anna Braasch Center for Sprogteknologi Copenhagen SPINN Seminar, October 27, 2001](https://reader036.vdocuments.us/reader036/viewer/2022082713/5697c01a1a28abf838ccf20a/html5/thumbnails/12.jpg)
12
Summing up the goals
STO will• conform to ’general’ linguistic knowledge • meet demands of a broad application and
research area (size, selection of domains and vocabulary, detail of linguistic description…)
• satisfy monolingual language specific requirements
• be potentially compatible with other lexical databases for future linking
• be reasonable easy to access, customize/use...
• perform the development contract and meet the production deadlines