1 sto a lexical database of danish for language technology applications anna braasch center for...

12
1 STO A Lexical Database of Danish for Language Technology Applications Anna Braasch Center for Sprogteknologi Copenhagen SPINN Seminar, October 27, 2001

Upload: curtis-stokes

Post on 21-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 STO A Lexical Database of Danish for Language Technology Applications Anna Braasch Center for Sprogteknologi Copenhagen SPINN Seminar, October 27, 2001

1

STO A Lexical Database of Danish for

Language Technology Applications

Anna BraaschCenter for Sprogteknologi

Copenhagen

SPINN Seminar, October 27, 2001

Page 2: 1 STO A Lexical Database of Danish for Language Technology Applications Anna Braasch Center for Sprogteknologi Copenhagen SPINN Seminar, October 27, 2001

2

Background

EU-funded international projects • EAGLES: recommendations for morphological and

syntactic specifications for 9 languages• GENELEX: development of a generic lexicon model• PAROLE: development of harmonized WL

resources (lexicon, corpus) for 12 languages• SIMPLE: development of an ontology and model of

semantic description for 12 languages

Follow-up • Danish, nationally funded co-operative lexicon

project: STO

Page 3: 1 STO A Lexical Database of Danish for Language Technology Applications Anna Braasch Center for Sprogteknologi Copenhagen SPINN Seminar, October 27, 2001

3

Aims of the project

Monolingual aimto eliminate the usual ’bottleneck problem’: lack

of a large-size Danish lexical database for

• language technology applications • computational language research purposes

Multilingual aimto provide an elaborated Danish lexical database

for• linked bi- or multilingual databases for LT/NLP

applications

• contrastive CL and lexicology research …

Page 4: 1 STO A Lexical Database of Danish for Language Technology Applications Anna Braasch Center for Sprogteknologi Copenhagen SPINN Seminar, October 27, 2001

4

STO development objectives

Requirements of monolingual applications• tailor the linguistic specifications for Danish• add more language specific features • extend the linguistic and lexical coverage• refine the lexicon structure• develop customized, user-friendly interfaces...but also requirements of multilingual linking• keep the basic, harmonised lexicon structure• keep the principles and language of lexical description• be attentive to similar follow-up projects

’more Danish’ but still consistent with the other

lexicons

Page 5: 1 STO A Lexical Database of Danish for Language Technology Applications Anna Braasch Center for Sprogteknologi Copenhagen SPINN Seminar, October 27, 2001

5

The three linguistic layers of description

Main info types - 3 independent but linked layersMorphology

Inflection (pattern-based) Spelling Compounding

Syntax (totally pattern-based) Syntactic frame (complementation structures &

functional properties, etc.) Control, raising (constructional properties)

Semantics (the layer of multilingual linking) Domain (=sublanguage, source area) Semantic relations (qualia) Specification of meaning (SIMPLE model + core

ontolgy)

Page 6: 1 STO A Lexical Database of Danish for Language Technology Applications Anna Braasch Center for Sprogteknologi Copenhagen SPINN Seminar, October 27, 2001

6

Between syntax and semantics

No clear-cut borderline: difficult to represent mutual dependencies in a strictly modular description.

Syntactic or semantic units?• Collocations: combine features of complex

structure, (morpho)syntactic constraints and slightly restricted compositionality (meaning transparency); strong subcategorisation and selectional restrictions ...

• Phrasal verbs: combine features of complex syntactic structure and compositional/non-compositional semantics …

Different representation strategies: ’early’ vs. ’late’

Page 7: 1 STO A Lexical Database of Danish for Language Technology Applications Anna Braasch Center for Sprogteknologi Copenhagen SPINN Seminar, October 27, 2001

7

Linking lexicons at the semantic level

Basic method: link between L1-meaning and L2-meaning

Basic requirement:harmonized semantics (ontology, model & method)

Advantages: proper treatment of all lexical units including• homonymes• polysemes• complex lexical units (collocations, idioms)independent treatment of L1 and L2 wrt.

morpholgy and syntax

Page 8: 1 STO A Lexical Database of Danish for Language Technology Applications Anna Braasch Center for Sprogteknologi Copenhagen SPINN Seminar, October 27, 2001

8

About the STO lexical database (V.1)Point of departure: PAROLE material• linguistic specifications elaborated (inc. also

Danish)• modular lexicon architecture developed• information structure developed• 20,000 general language lexicon entries encoded

Main STO development steps:• tailor and refine the LingSpec’s for Danish• improve the information structure (DB)• add new entry types (complex lexical units, etc.)• extend the vocabulary to 50,000 entries

(~ 35,000 GL and ~15,000 LSP from 6-8 domains)

Page 9: 1 STO A Lexical Database of Danish for Language Technology Applications Anna Braasch Center for Sprogteknologi Copenhagen SPINN Seminar, October 27, 2001

9

Progress report for 2001 (1)

New status: Nationally funded co-operative projectrequiring• more thorough project planning (incl. ’logistics’)• more detailed information (guidelines,

specifications, cross-checks, evaluation…)

Continuously ongoing supporting processes• Updating and refinement of LingSpec’s • Elaboration of an Encoding Manual• Elaboration of various additional documentation

(evaluation sheets, etc.)• Revision of the database/info structure

Page 10: 1 STO A Lexical Database of Danish for Language Technology Applications Anna Braasch Center for Sprogteknologi Copenhagen SPINN Seminar, October 27, 2001

10

Progress report for 2001 (2)

New supporting tools for lexicographers developed• Encoding tools for morphological and syntactic info• Browsers for retrieval of encoded info...

Number of entries encoded with• morphological information ~50,000 • syntactic information ~23,000• semantic information ~ 8,500 (from SIMPLE)

Other tasks (ongoing/finished)• selected entries (on customer’s request) downloaded• work on principles of statistically based selection of

lemmas and syntactic constructions to be encoded • corpus-related work

Page 11: 1 STO A Lexical Database of Danish for Language Technology Applications Anna Braasch Center for Sprogteknologi Copenhagen SPINN Seminar, October 27, 2001

11

Progress report for 2001 (3)

Treatment of new entry types• domain specific (LSP) entries• compounds (decomposition and linking

elements implemented)• geographical proper nouns (inflectional and

agreement properties investigated, the results are implemented)

• collocations (information structure designed) • revision of the treatment of phrasal verbs

Page 12: 1 STO A Lexical Database of Danish for Language Technology Applications Anna Braasch Center for Sprogteknologi Copenhagen SPINN Seminar, October 27, 2001

12

Summing up the goals

STO will• conform to ’general’ linguistic knowledge • meet demands of a broad application and

research area (size, selection of domains and vocabulary, detail of linguistic description…)

• satisfy monolingual language specific requirements

• be potentially compatible with other lexical databases for future linking

• be reasonable easy to access, customize/use...

• perform the development contract and meet the production deadlines