hlt at the ailab, imcs, ul artificial intelligence laboratory institute of mathematics and computer...

HLT at the AILab, IMCS, UL

Artificial Intelligence LaboratoryInstitute of Mathematics and Computer Science

University of Latvia

www.ailab.lv

27.10.2004

Agenda

Brief history of the Laboratory

Corpus linguistics

MT modelling

Speech synthesis for Latvian

Development of electronic dictionaries

Computer-assisted teaching aids

History

IMCS (UL) has been dealing with automated processing of Latvian for more than 15 years

First activities concern the development of Latvian character coding standard (1989)

The AILab was founded in 1992

One of the main tasks of the Laboratory is to ensure the usage and processing of Latvian in computer systems

There are 5–10 people working at the Laboratory – research fellows and students of the Faculty of Physics and Mathematics (UL) and Faculty of Philology (UL)

Building up Latvian Corpus

Collecting of Latvian resources has been initiated at the end of 80ies, beginning of 90ies:

at the very beginning texts were manually keyboarded

later – they were scanned and optically recognised precision of 98.5% has been achieved

Ca 20 mill. running words covering different types of texts:

Pieces of classical Latvian literature: > 10 000 pages

end of the 19th, beginning of the 20th century


Pieces of Latvian folklore:

the biggest collection of Latvian beliefs

the biggest collection of Latvian fairy tales and legends

Electronic library of Latvian folkloristics

In collaboration with the Archives of Latvian Folklore building of fund of Latvian proverbs is in progress (> 20 000 units)

Latvian Culture (in Latvian and English, rich in pictures)

Texts from the newspaper “Rīgas Balss” (1994–1997) in Latvian and Russian


An issues of fragmentation, mark-up, character set/font compatibility, copyrights

Mark-up:

plain text format;

HTML

ca 4 mill. running words with structured SGML mark-up

ca 1 mill. running words transformed form HTML to XML

work on a software development for semi automated morpho-syntactical annotation is ongoing


Tools:

Morphological analyser has been developed, which offers all base forms of the particular word form and vice versa

Exploring data from the “Reverse dictionary of Latvian” an experimental software for Latvian morphemic analysis has been created, which is supplemented with new rules during the development of morpho-syntactical annotation tools

In collaboration with Stockholm’s University on the basis of “A derivational dictionary of Latvian” the morphemic analysis system has been performed


Work on a pilot morpho-syntactically annotated corpus has been started (1996–2000)

It covers approximately 10 000 words of modern written Latvian manually annotated

An experimental mark-up transformation tool (HTML to structural XML) has been developed

IMCS has passed 1th phase in the ESF project competition to receive a funding for supplementation and balancing of the corpus:

+ 20 mill. running words

both old and contemporary texts

Work with Parallel Texts

IMCS joined the EU joint action TELRI (Trans-European Language Resources Infrastructure) during 1995–2001:

Latvian translation of Plato “Republic” has been added to other 14 European languages and a CD “East meet West” has been produced with these aligned parallel texts

Orwell’s novel “1984” aligned at sentence level is available from Tractor, TELRI Research Archive of Computational Tools and Resources (www.tractor.de)

Vanila alignator developed at Göteborg University (Danielsson and Ridings, 1999), which explores algorithm of Gale and Church (1993) and operates with number of symbols, is used to align these texts

Work with Parallel Texts

Thanks to the collaboration with Translation and Terminology Centre (www.ttc.lv), there is a possibility to work with English-Latvian parallel texts:

a small pilot English-Latvian parallel corpus (legal texts) with ca 100 000 words per language aligned at sentence level has been built in 2001

corpus-based analysis of English multi-word verb units and their Latvian translation equivalent has been carried out, as well as some translation studies

possibilities how the information gained from parallel texts can be applied for MT systems have been examined

Corpus of Early Written Latvian Texts

Project was initiated in the middle of 90ies, when the most significant sources have been keyboarded

The aim of the corpus is:

to promote and facilitate the diachronic study of Latvian

to offer a computerised material to those interested in the development and varieties of language

it will serve as basis both for the dictionary of the 17th century (in near future) and Latvian thesaurus (in far future)

A pilot project on the development of electronic dictionary of the early written texts was initiated in 2002 which contributed towards building up the first Latvian corpus available publicly


Statistics:

consists of texts from the 16th to the 18th century

ecclesiastical texts mainly

ca 900 000 running words

A number of conventions have been introduced for:

coding special characters (different accents etc.) using compound chars of the Baltic Windows code page

Structural annotation: foreign language fragments; cross notes to the other text parts; structural containers; mistakes a.o. elements


Acquisition of the corpus: text typing or scanning → adding some structural mark-up → automated verifying and (pre)processing

On-line end-user tools: content navigation through different dimensions:

periods of time, sources, authors and text types search in word form index providing a word pattern

(several criteria and bounding of scope can be applied);

a kwic-concordance automatic context positioning of retrieved running

words (by search in index and concordance) some statistical tools: frequency lists, word lists a.o.

Machine Translation

In 1993 work limited MT model was started

An interlingua MT model LATRA (SWETRA, Lund University) for translation of stock market texts (Latvian-English-Latvian) has been developed:

translates basic types of declarative sentences

problem – disambiguation

Supported by the Latvian Council of Sciences:

“Limited Model of Automated Machine Translation System for Latvian” (1993–1996)

“Development of Probabilistic Methods for Automated Disambiguation of Natural Language Texts and Applications for Machine Translation” (1997 – 1999)

Machine Translation

In 1997 the Laboratory joined the Universal Networking Language (UNL) project (www.undl.org):

artificial language to overcome language barrier consists of dictionary of universal words, relations,

knowledge base, attributes an experimental grammar for deconversion of UNL texts into

Latvian has been developed

“Automated synthesis of language independent text representation” funded by the Latvian Council of SciencePerspective:

semantic aspects of text analysis development of a general purpose translation system implementation of semantic types proposed in SIMPLE

(Semantic Information for Multifunctional Plurilingual Lexica) project

Development of Electronic Dictionaries

A number of Latvian dictionaries have been keyboarded or scanned

Some special computerised lexicons have been prepared to meet the needs of particular projects:

for UNL an English-Latvian machine-readable lexicon including ca 10 000 entries has been made, grammatical information of Latvian entries is presented in formalised way

An initiative to develop a new electronic dictionary to cover as much Latvian words and their meanings as possible

in order to achieve this, main lexicographical resources are being processed


On-line versions of several dictionaries: an explanatory dictionary of contemporary Latvian (ca

35 000 entries)

Mülenbach-Endzelin’s “Lettisch-deutsches Wörterbuch” (ca 130 000 entries, very complex structure and character set)

Latvian-Russian-Latvian bilingual dictionary for students (ca 70 500 entries)

an internet Term Bank with ca 115 000 Latvian terms and their translation equivalents (Russian, German, English, Latin); developed in 1998

this was carried out for Translation and Terminology Centre using Trados MultiTerm platform


Have to move to standardised dictionary encoding (and development)

Have to convert already existing lexicographical sources in widely compatible format

Development of universal, metamodel-based dictionary production and publishing

on-line environment for both parties: dictionary creators/providers and end-users (humans as well as software agents)

funding from the Latvian Council of Science is assingned

On the basis of Latvian Corpus and various machine-readable/understandable dictionaries – extraction of the Latvian WordNet

Work with Speech

In a project “ONOMASTICA-COPERNICUS” (1995–1997) ca 250 000 Latvian proper names were transcribed using IPA

In 2001 work on development of corpus of spoken language has been started (funded by the Latvian Council of Science):

ca 1 300 phrases, words and sentences spoken by 15 persons (5 men, 7 women and 3 children)

8-hour record of a seminar (2 women performing synchronic translation)

special text of ca 1000 words read by 50 persons (29 women and 21 men)

speech is transcribed and text segmentation is performed using Transcriber software (Edinburgh University)

Work with Speech

In order to explore these data in speech synthesis and speech recognition systems, grapheme-phoneme transcription software has been developed:

> 300 grapheme-to-phoneme rules

The machine-readable transcription presents: consonant assimilation in sonority consonant assimilation in point of articulation vocalization vowel wakening

The machine-readable transcription does not present: word stress, syllable intonation, sentence intonation

Work with Speech

The accuracy of automatically obtained phonemic transcription is approximately 92%

experimental Latvian TTS (Text-to-Speech) synthesizer has been developed

The speech segment database is prepared

The program uses not only diphones and half phonemes, but also triphones, phonemes etc.

The next stage will be creation of the prosodic model in order to avoid monotonous synthesized speech and to develop a complete TTS system

Quality improvement and supplementation of the speech corpora – planned activities in next year

Computer-assisted Teaching Aids

Since 1998 the AILab is taking part in a project “Latvian education informatization system”

Special emphasis is put on non-native speakers of Latvian

For deaf students a sign language dictionary has been developed

Software tool to help to master a pronunciation of single sound or sound combinations or even short words

Latvian word analyser and synthesizer is available on-line

Computer-assisted Teaching Aids

Development of e-books and e-courses of Latvian since 1998

E-course of the Latvian language for secondary schools: widely covers theory of language more than 600 interactive exercises with automatic testing

Latvian for primary schools: expounded using animation, and interactive exercises and

tests

Teaching aids for foreigners: theory and exercises methodical guidelines for Russian teachers interactive course “What have you said?” (17 themes, games

and exercises; animation and sound)

Conclusions & Future Perspective

Fields that are and will remain the most important of interest of the Artificial Intelligence Laboratory, IMCS, UL:

Building up Latvian corpora and tools for their analysis:

spoken and written language

monolingual and multilingual

Work on electronic dictionary and software development

Latvian WordNet

MT modelling

Latvian text to speech synthesizer and speech recogniser

Thank You!

hlt at the ailab, imcs, ul artificial intelligence laboratory institute of mathematics and computer...

Documents

latvian resources

latvian proverbs

latvian corpus work

latvian corpus tools

latvian morphemic analysis

reverse dictionary of

units latvian culture

latvian translation