cooperative gms ict framework case study: construction of lao lexical resources

7
 Part VI Cross-Cutting Issues Cooperative GMS ICT Framework Case study: Construction of Lao Lexical Resources Vincent Berment 1 , Houmphanh Thongvilu 2  1 GETA, CLIPS, Joseph Fourier University 385 avenue de la Bibliothèque Saint-Martin-d'Hères, France, 38400 [email protected] 2 LaoNet- Lao Human Resources Network Melbourne , Australia [email protected] Abstract The case study of Lao resources development offers a window into a process for GMS organization to play an active role in regional ICT development, to enhance the organization cooperation with some common platform and tools, for the benefit of its member countries, to reduce the digital and information divide. It is observed that most of the languages spoken in the GMS region (e.g. Khmer, Lanna, Lao, Lü, Tai Dam) are still weekly computerized or neglected. Among the different steps, one of the heaviest tasks is probably the development of lexical resources. In this paper, we present Montaigne- Lao, an approach for building miscellaneous Lao resources based on a cooperative effort on the web. We then discuss how it can contribute to a collaborative network for the GMS languages. Keywords: Natural Language Processing, Generalized Linguistic Contribution, Minority language, Internet, Word Processing, Virtual Keyboard, Dictionary, Machine Translation, Unsegmented Writing Generalized linguistic contribution During the last ten years, significant efforts have been driven to produce efficient linguistic tools for previously unsupported languages. However, this is still limited to several tens of languages (less than 1%; Berment 2002b). In particular, most of the languages in the Greater Mekong Subregion (GMS) are still poorly computerized even if the most affluent GMS countries such as China, Thailand and Vietnam have greatly benefited from their language computerization. The developing nations in the GMS have limited resources and more immediate priorities so they can not afford significant research and development in computational linguistics. The profit driven research and development also often sees these needs being unanswered. Our point of view is that computational linguistic resources can be efficiently obtained by a collaborative work on the web (Boitet 1999), replacing a local development team with a free and potentially much larger distributed team. This idea of a “Generalized Linguistic Contribution” (GLC) on the web, already present in an early Montaigne project (1996), has been applied by Oki to the Japanese language 1 (Shimohata 2001), by NII and NECTEC to the Saikam Japanese- Thai dictionary project 2 and by miscellaneous contributors to the Papillon multilingual dictionary project 3 . Before that, in 1995, the LaoNet cyber-team already achieved the development of the so-called Lao Romanization Transcription (LRT) 4 . In this paper, we first present Montaigne-Lao, a cooperative project for developing Lao language resources and which initial framework is formed by a discussion forum , a dictionary with translation support environment, an open source virtual keyboard, as well as two ongoing projects, text to speech and Lao New Coding. In a second part, we discuss the perspective of a Montaigne-GMS project that would be based on a principle of exchange. The efficiency of the presented GLC concept can be witnessed by the fact that the authors of this paper never met before the Regional Conference on Digital GMS in 2003, though they worked together for more than one year and achieved significant progress thanks to their exchanges. 1 : http://www.yaku shite.net/ 2 : http://saikam.nii.ac .jp/ 3 : http://www.pap illon-dictionary.o rg/ The Papillon project already includes six languages: English, French, Japanese, Malay, Thai and Vietnamese. 4 : http://www.lao.n et/html/LRT.html 342 Berment, V. and Thongvilu, H. 

Upload: paw-siriluk-sriprasit

Post on 08-Apr-2018

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Cooperative GMS ICT Framework Case study: Construction of Lao Lexical Resources

8/7/2019 Cooperative GMS ICT Framework Case study: Construction of Lao Lexical Resources

http://slidepdf.com/reader/full/cooperative-gms-ict-framework-case-study-construction-of-lao-lexical-resources 1/7

 

Part VI Cross-Cutting Issues

Cooperative GMS ICT Framework 

Case study: Construction of Lao Lexical Resources

Vincent Berment1, Houmphanh Thongvilu

1GETA, CLIPS, Joseph Fourier University

385 avenue de la Bibliothèque Saint-Martin-d'Hères, France, 38400

[email protected] 

2LaoNet- Lao Human Resources Network 

Melbourne, Australia

[email protected]

Abstract

The case study of Lao resources development offers awindow into a process for GMS organization to play anactive role in regional ICT development, to enhance theorganization cooperation with some common platformand tools, for the benefit of its member countries, toreduce the digital and information divide. It is observedthat most of the languages spoken in the GMS region(e.g. Khmer, Lanna, Lao, Lü, Tai Dam) are still weeklycomputerized or neglected. Among the different steps,one of the heaviest tasks is probably the development of lexical resources. In this paper, we present Montaigne-

Lao, an approach for building miscellaneous Laoresources based on a cooperative effort on the web. Wethen discuss how it can contribute to a collaborativenetwork for the GMS languages.

Keywords: Natural Language Processing, Generalized

Linguistic Contribution, Minority language, Internet,

Word Processing, Virtual Keyboard, Dictionary,

Machine Translation, Unsegmented Writing

Generalized linguistic contribution

During the last ten years, significant efforts have beendriven to produce efficient linguistic tools for 

previously unsupported languages. However, this is stilllimited to several tens of languages (less than 1%;Berment 2002b). In particular, most of the languages inthe Greater Mekong Subregion (GMS) are still poorlycomputerized even if the most affluent GMS countriessuch as China, Thailand and Vietnam have greatlybenefited from their language computerization. Thedeveloping nations in the GMS have limited resourcesand more immediate priorities so they can not affordsignificant research and development in computationallinguistics. The profit driven research and developmentalso often sees these needs being unanswered.

Our point of view is that computational linguisticresources can be efficiently obtained by a collaborativework on the web (Boitet 1999), replacing a local

development team with a free and potentially muchlarger distributed team. This idea of a “GeneralizedLinguistic Contribution” (GLC) on the web, alreadypresent in an early Montaigne project (1996), has beenapplied by Oki to the Japanese language

1(Shimohata

2001), by NII and NECTEC to the Saikam Japanese-Thai dictionary project2 and by miscellaneouscontributors to the Papillon multilingual dictionaryproject3. Before that, in 1995, the LaoNet cyber-teamalready achieved the development of the so-called LaoRomanization Transcription (LRT)

4.

In this paper, we first present Montaigne-Lao, acooperative project for developing Lao languageresources and which initial framework is formed by adiscussion forum, a dictionary with translation supportenvironment, an open source virtual keyboard, as wellas two ongoing projects, text to speech and Lao NewCoding. In a second part, we discuss the perspective of a Montaigne-GMS project that would be based on aprinciple of exchange.

The efficiency of the presented GLC concept can bewitnessed by the fact that the authors of this paper never met before the Regional Conference on Digital GMS in2003, though they worked together for more than oneyear and achieved significant progress thanks to their 

exchanges.

1 : http://www.yakushite.net/2 : http://saikam.nii.ac.jp/3 : http://www.papillon-dictionary.org/

The Papillon project already includes six languages: English,

French, Japanese, Malay, Thai and Vietnamese.4 : http://www.lao.net/html/LRT.html

342  Berment, V. and Thongvilu, H. 

Page 2: Cooperative GMS ICT Framework Case study: Construction of Lao Lexical Resources

8/7/2019 Cooperative GMS ICT Framework Case study: Construction of Lao Lexical Resources

http://slidepdf.com/reader/full/cooperative-gms-ict-framework-case-study-construction-of-lao-lexical-resources 2/7

 

Part VI Cross-Cutting Issues

A case study: the Montaigne-Lao project5 

1. General description

Basically, the Montaigne project’s idea is to offer a freecooperative work facility on the web for developinglinguistic resources and machine translation tools. TheMontaigne-Lao project is the first implemented step of this concept and has Lao as target language.

One important question is how to go aboutdeveloping a collaborative effort targeting specificneeds, or in other terms how to harvest the goodwill andto channel the energies towards producing workablesolutions. The cyber-forum described in the next sectionis the way we propose for fostering such participationand for guiding it. It is the basis of an ICT cyber-

cooperation.More generally, we describe here three already

existing components of the Montaigne-Lao project:

Forum,

Dictionary,

Virtual keyboard,and two projects:

Text to speech,

Lao New Coding.We would like to emphasize that, if these existing or 

prospective developments have initially been drivenseparately

6, our discussions on the web allowed the

achievement of synergies and of technical emulation.

2. Forum & mailing list: a method to emulate

the cooperation

It is often said that the Internet has modernized theworkgroup relationship across cyberspace. However aforum does not necessarily become a project team.What are the ingredients that would attract commonminds to consolidate the efforts instead of projectsbeing worked on in isolation, duplicating efforts andresources?

The keys are: central contact point, shared goals,commitment, and participants. Obviously participants

are very important as they are the how and the fuel tokeep the central contact forum going. Even if somepeople think it is a waste of effort, it can quickly expandand sometimes replace the traditional networking of peer groups of people.

To create such a forum is to enlist a group of peoplewith the same interest, Internet news-groups are a goodplace to start for recruits as well as private mailing lists.If such contact point does not exist, then individuals or organizations should allocate resources to set it up, to

foster dialogue among peers, to encourage people totake ownership of the issues and problems andthereafter to search and identify means and ways of realizing the goals.

5 : http://sabaidi.imag.fr/6 : Mostly in France: dictionary and LaoWord/virtual

keyboard and in Australia: forum, Lao new coding and text to

speech. 

The initial step requires coaching and mentorship byan experienced teamwork overseer to manage theevolution of the group and keep the group focus.

As in any community, cyber-communities suffer from weak "signal to noise ratio" and effectiveness. It isoften the case that groups of people would break awayinto teams that are more compatible working on somecommon goals.

The driving force is not always money or fame, butthe desire for the common good, so the 'why' would beanswer by the 'why not'. Often the effort of participants,mostly in time, could expand their horizon and bring

new challenges, as well as unforeseen solutions.The resourcefulness and contact of the mentors, who

could see to the group needs, could steer the group up tosome level of self-management. "If there is a will thereis a way" aided by good resource would makes it easy.Once the means and tools are available for wider participation, the virtual workgroup could be replicatedto other fields. The trust of the process would bring peer groups to work on common problems for mutualbilateral benefits.

LaoNet forum was modeled on such cyber cooperation; with its limited resource, despite manydifficulties and lack of financial resources we havemanaged to achieve some milestone results: Lao

Romanization for Transliteration (LRT) 1995; STEALao IT Technology seminar-VTE August 1995, leadingto the introduction of Internet into Laos; first LaoHistory Symposium at Berkeley USA in March 2003;STEA PAN-Laos ICT affairs seminar VTE March2003. The forum has brought many, many people toexchange ideas across the cyber-space without havingever needed to meet physically to work together.

3. Dictionary and Translation Support

Environment (DTSE)

3.1. Functional architecture

The three following tasks participate to an initial step of the Montaigne-Lao dictionary project:a) A group of skilled persons build lexical entries,

partly starting from existing paper dictionaries.b) A panel of specialists derives a reference on-line

dictionary from the gathered lexical entries.c) Various tools and applications are developed to

utilize the resource.An original point is that, in parallel with this work of 

specialists, a lexical contribution coming from aworldwide public is also compiled. This process relieson a translation support environment provided on theweb site. In particular, a Lao-French bilingual editor isproposed to the visitors together with a word for word

translation service.

Berment, V. and Thongvilu, H. 343 

Page 3: Cooperative GMS ICT Framework Case study: Construction of Lao Lexical Resources

8/7/2019 Cooperative GMS ICT Framework Case study: Construction of Lao Lexical Resources

http://slidepdf.com/reader/full/cooperative-gms-ict-framework-case-study-construction-of-lao-lexical-resources 3/7

 

Part VI Cross-Cutting Issues

Bilingual editor 

Word for word translation

To help them working their personal translations out,the users thus can ask for a word for word translation of their Lao text. But in case a word is missing or is notcorrectly translated, they also can add new or improvedentries into a personal dictionary that will be used inpriority for their later translations. These personaldictionaries, as well as the personal bilingual textsthemselves, can only be seen by their owner and arestored on the server. Then, in order to improve thereference on-line dictionary (and therefore the word for word translation service), the users can propose the

entries of their personal dictionaries to be  transferredinto the main dictionary. This transfer is done under the control of the linguists' panel.

Functional architecture

The Montaigne-Lao dictionary is planned to beexported towards the Papillon dictionary7.

3.2. Lexicographic issuesFrom a lexicographic point of view, the project followsthe explanatory and combinatorial lexicology (ECL)concepts8 so the lexicographic content of the entriesincludes the lexical item itself and its part of speech,optional examples and idioms as well as a semanticformula, the government pattern and lexical functions.

Lexical items input page

However, fields have been added in order to deriveother applications from the database (paper dictionariesor machine translation tools). The initial lexicographic

effort has been handled by a group of lecturers andstudents of the Institut National des Langues et desCivilisations Orientales9

(Paris, France) and arecurrently being reviewed by the panel of linguists.

3.3. Technical architectureThe dictionary is stored as a MySQL database table aswell as the contributors’ profiles. C code is used for segmenting Lao texts into words10 and for sorting thedictionary

11. It uses a syllable recognition technology

(Berment 1998) and a longest matching algorithm (e.g.Meknavin et al. 1997). Text input is possible with thetwo currently used Lao keyboard layouts thanks toJavaScript and to TextArea or  Input  HTML forms

controls. The LaoUniKey input tool presented in thenext section will be used when the text input will bedone in Unicode directly.

7 : http://www.papillon-dictionary.org/8 : On this matter, see André Clas, Igor Mel'cuk and Alain

Polguère's book, Introduction à la lexicographie explicative et

combinatoire, Duculot 19959 : http://www.inalco.fr 10 : Lao is written from left to right with an alphabet deriving

from Indian scripts. A major characteristic of Lao writing is

that words are not separated with spaces, like Khmer, Thai or 

Burmese writings.11 : Another important characteristic of Lao writing is that

some vowels are placed before the consonant. This contributes

to make the automatic sort of Lao dictionaries more complex. 

344  Berment, V. and Thongvilu, H. 

Page 4: Cooperative GMS ICT Framework Case study: Construction of Lao Lexical Resources

8/7/2019 Cooperative GMS ICT Framework Case study: Construction of Lao Lexical Resources

http://slidepdf.com/reader/full/cooperative-gms-ict-framework-case-study-construction-of-lao-lexical-resources 4/7

 

Part VI Cross-Cutting Issues

4. Virtual Keyboard (VK)

Deriving from the Lao word processor called LaoWord

which is an add-in for Microsoft Word, a virtualkeyboard was developed to provide a general Unicodeinput device for Lao: LaoUniKey12. LaoUniKey is partof the Montaigne project both because it is available on-line but also in the sense that its source code is availableunder GPL license.

Its code has already been used as a basis of anongoing Montaigne-Lao project addressing an improvedLao keyboard evaluator. Due to a lack of technologynative to operating systems that can handle foreignlanguage script, the enhanced-VK is designed to createsome smart pre-emptive keyboard across applicationand platforms. In the first place, it simply interprets andmaps the keystrokes to character codes (8 bits or 16 bits

Unicode). By implementing some keystrokes buffering,keystrokes history recall could be implemented. Byadding a lookup of keystrokes translation table, macroprogramming of keystrokes is achieved. The macrosingle keystroke could be expanded to a stream of keystrokes, paving a way to implement multi elementsvowel sound of a phonetic virtual keyboard.

By enhancing the keystrokes buffering with someintelligent rules or a dictionary lookup words correctioncould be preempted.

The virtual keyboard is written in C language and islimited to Windows platforms at the moment .Later on,by implementing it by using a platform-independent

language such as Java, it will be possible to create asmart keyboard across platforms.

5. Text To Speech (TTS)

A text to speech synthesizer (TTS) is currently beingworked on to broaden the appeal and usefulness of theresource, to attract an audience that would sustain andcontribute to the growth of the online dictionary. It isplanned that the service will be an extension of thedictionary server, where a client application could makea word to sound lookup request. Such approach wouldomit a need for voice synthesizer engine at the clientend. It will also foster simple voice capable front-end

applications e.g. language teaching web page.The voice synthesizer is based on the public domainFestival system (http://www.festvox.org/). Although afull native Lao words to sound library is yet to beconstructed, an intermediate solution is adopted todemonstrate the concept. With Thai TTS already beingmore advanced we hope that a future GMS forum couldsee Lao TTS benefiting from NECTEC TTS work.

The current implementation is done with the Lao textbeing transcribed using roman characters employing anearlier work of romanization for transliteration (LRT).

The transcription engine was carefully coded closelyto Lao grammar so to guarantee a reversal transcription.

12 : LaoUniKey's technical principles are described in

(Berment 2002a) at section 4: "An input method using hooks".

Having the text transcribed, the romanized Lao textis fed into a Festival voice synthesizer engine. Minor sound rules modification of Standard English soundlibrary is sufficient to make the engine cope withproducing a workable Lao text to voice system.

Future Lao TTS work could share the same basis of this approach in terms of word construction, wordbreak, sentence formation, and utterance rules.

6. Lao New Coding (LNC)

6.1 Lao writing basicLao writing is derived from Indian Sanskrit script andmany words are borrowed to build Lao vocabulary.Word could be made of a single or multiple syllables.The syllable is formed by: [consonant/cluster 

consonants][vowel/derived vowel] [consonant modifier (optional)][tone (optional)]; this structure is beingreferred to as a word-clusters here within.

Lao vowels compose of multiple sub-elements; eachelement occupies a predefined position within the word-cluster, some elements proceed and displace consonantwhen invoke. Such symbols compounding is notsupported for Lao within any operating system, the lack of standards had bred many incoherent implementationsthat had confused the computerization of Lao language.

It is interesting to note that the term "Lao grammar"is classically defined as the rules for Lao speech andwriting, words construction and little inter-linkage of words. Multi-words words are built by joining a number 

of words. Lao grammar does not have extensive wordidentifier to explicitly define words relationship in themulti-words construction.

Traditionally, a Lao text is not punctuated, other thanbeing organized in paragraphs of long continuousstream of characters.

So when 'Lao grammar' is referred to in this text, weare referring to the word-clusters construction rule.

6.2 Current technologyIn current technology, Lao words require some fivebytes on average. Keystrokes or bytes stream of Laowords require a correct order of sequence to correctlyposition the elements within the word-cluster. Possible

solution is to assign extra bytes that would preserveelements' attributes and position information. The recentfont standard, OpenType

13developed by Adobe and

Microsoft, combines TrueType and Postscripttechnologies to provide new typographic features suchas the desirable capability of elements compounding,promises a possible solution to Lao script problems.Reliance on keystrokes sequence alone couldcomplicate lexical search and matching of words.

6.3 New encodingLNC offers a new way of correctly storing symbolelements' position at bits level within the word-cluster,in two to three bytes compare with current technology

13 : http://www.adobe.com/type/opentype/main.html

Berment, V. and Thongvilu, H. 345 

Page 5: Cooperative GMS ICT Framework Case study: Construction of Lao Lexical Resources

8/7/2019 Cooperative GMS ICT Framework Case study: Construction of Lao Lexical Resources

http://slidepdf.com/reader/full/cooperative-gms-ict-framework-case-study-construction-of-lao-lexical-resources 5/7

 

Part VI Cross-Cutting Issues

of on average using some five bytes. By efficientlyusing less byte for data representation and storage, LNCcan also be regarded as a compression algorithm andtechnique that is tightly related to Lao grammar.

LNC is pure research work, and has not yet provenin any product. It is not designed to replace charactersencoding such as MS code-page, or Unicode, but rather complementing them in a way that data representation iscompacted and fast to transmit. A cross conversionengine or driver program could transparently translatethe data. However raw LNC data would not be readilyreadable. LNC will be an optimized use of memory inapplications e.g. Virtual Keyboard buffers, with fastword search and recall. LNC can guarantee a uniquerepresentation of the entered words disregarding thesequence of keystrokes.

7. Available on-line resources

Before developing dictionaries, text-to-speech or other more sophisticated resources, a written language firstneeds the basic tools that allow it to be written incomputers. For languages such as Lao which has itsown alphabet, those primary tools are fonts and virtualkeyboards.

Lao language computerization has been industrydriven, implemented with varying degrees of successand acceptance. It is mainly based on MicrosoftWindows platform, with two across applicationsproducts dominating the Lao MS Windows system:

LaoWord14

and LSWIN15

. On Unix platform the projectof adding foreign languages support or internationalize(I18N)16 or localization of Lao language at commandsand messages level is approaching fruition. Thereafter applications that are compiled with I18N awarenesswould have full support of the language.

The MS Windows LaoWord and LSWIN solutionsare based on mainly solving keyboard input of Laolanguage with their respective keyboard translationapplication: LaoKey and Keyman. Both inputapplications have been updated to allow Unicode entries(LaoUniKey is the version for LaoWord).

Lao TrueType fonts have been developed since1992, mainly thanks to Fontographer . Most of them are

free and can be downloaded from the Internet. Creditsgo to many font designers (such as John Durdin, author of LSWIN) who had made serious attempt in makingLao language work across applications and the artfulskill for rendering practical fonts despite Lao smallmarket. Useful Lao language computerization resourcecould be found at http://park.lao.net/references/.

A scheme for computerizing GMS

unsegmented languages

14 : http://www.LaoSoftware.com15 : http://www.tavultesoft.com/16 : http://www.kde.org/international/ 

1. Introduction

In the Greater Mekong Subregion, there are many

writings deriving from Indian scripts and sharing acommon feature: they are unsegmented in the sense thatwords boundaries are not marked.

Having the same origin would mean possible manycross adaptations of already developed solutions. Suchas with font development, various existing fonts couldbe cloned for another character set.

The cyber-synergy described above for the Laolanguage computerization (forum, lexical resources,open source, downloads) may already exist for other languages and could be generalized. This sectiondiscusses several possible topics where synergies couldbe found. The implementation phase could be supportedby specific project grant and sponsorship. We propose

to generalize an existing Lao word processing facility toother unsegmented languages of the GMS. We further highlight issues and problems relating to implementingan effective online lexical resource.

2. GMSWord platform proposal

In this section, we propose to reuse the LaoWord17 software and to build a GMSWord from its features.LaoWord is an add-in for Microsoft Word that performsspecifically Lao word processing functions

18:

1. Virtual keyboards:Miscellaneous virtual keyboards are available, bothadapted to national physical keyboards (e.g. French,

US, Danish) and with different layouts (traditionaltypewriters or intuitive layouts).

2. Non-standardized font encoding management:Several encodings have been used by font's designersmaking text input and font change quite complex.LaoWord contains the necessary software for providing the user with a virtual encoding standard:he doesn't have to matter which font he uses bothduring text input and font changes.

3. Text selections by entire syllables:The absence of spaces between Lao words makesMicrosoft Word selections rather chaotic with theLao unsegmented writing. The LaoWord syllabic

selection mode makes the selections of Lao textsmore natural (syllable by syllable), for examplebefore a copy of text into the clipboard.

4. A small Lao-French / Lao-English lexicon:The French or English translation of a Lao word canbe obtained by selecting the word and clicking on thetranslation button.

5. Several romanizations:The automatic romanization of a Lao sample can beobtained by selecting the text and clicking on the

17 : See http://www.LaoSoftware.com/ or 

http://sabaidi.imag.fr/.18 : See LaoWord user's manual for a more detailed

description. It is included in the LaoWord package

(http://www.LaoSoftware.com/).

346   Berment, V. and Thongvilu, H. 

Page 6: Cooperative GMS ICT Framework Case study: Construction of Lao Lexical Resources

8/7/2019 Cooperative GMS ICT Framework Case study: Construction of Lao Lexical Resources

http://slidepdf.com/reader/full/cooperative-gms-ict-framework-case-study-construction-of-lao-lexical-resources 6/7

 

Part VI Cross-Cutting Issues

romanization button. Three different romanizationsare proposed: free (i.e. only making use of the Latinalphabet), a phonetic and a reversible transcriptionsboth using the International Phonetic Alphabet.

6. Lao table sort:Tables with text in Lao cannot be directly sorted withMicrosoft Word means. LaoWord allows a sort of Lao tables. It can use any of the different rules foundin the dictionaries (selectable).

7. Miscellaneous Lao text finishing functions:Lao making-up into pages,

Height selection (high/low) for accents,

Height selection (high/low) for "sala u",

Ligature of accented vowels.While Microsoft Windows still represent the

majority of the available platforms, GMSWord wouldoffer a suitable effective tool for the GMS languageswith unsegmented writings. The same methodologybeing implemented for the different languages, it wouldbe more easy to define common communication tools.

Note that a BurmaWord has already been prototypedfor testing the Burmese Unicode capability.

3. Character encoding standard, data format

3.1 Reducing the brakes on developmentWithin the past ten years despite the neglects and smallcommercial viability, language such as Lao hadreceived enough interests and support to emerge with a

workable set of tools. However due to uncoordinatedefforts, many diverging technologies and conventionsexacerbate the problems. Several issues had beenidentified that had hampered the development of thisLao lexical on-line resource:

Lack of workable Lao standard character coding.

Lack of Lao script support native to currentoperating system affecting keyboard input.

Resource: human and financial.We present hereafter some solutions for Lao that we

currently discuss in the forum or already use and thatcould also be applied to other languages19.

3.2 Lao standard character codingCode-page scheme of MS Windows to display foreigncharacter sets combined with the lack of a Lao workablecharacter standard had created a situation where anexchange of Lao data does not always convey therelevant information. This is an issue that the Laolexical online resource has to develop some standard or means of conversion:

Adopt Unicode,

Provide utilities for conversion,

Develop a convention to embed font information, toachieve some data uniformity.

3.3 Lao script support, keyboard input

19 : See also (Berment 2002b) for a general approach.

As previously described in section 4 (VirtualKeyboard), the implementation already exists for Laoand is fully adaptable to all GMS languages.

3.4 Resource: human and financialCurrently most of the work on Lao lexical developmentis driven by a group of volunteers. To fast track itsdevelopment would require the support of resourcefulhosts who share our vision. The goodwill and thehuman resource do exist, ready for the initiative togather the efforts.

3.5 Data formatThe data record format has to be carefully designed tocompromise between compactness and flexible datafields of information. As Lao text is made of continuousstream of characters, it is hard for unfamiliar reader todissect the word(s) boundary. It is further complicatedby a lack of clear rule base multi-words construction.

To record all Lao words would mean wasting of resource in duplicating many word entries. Retrievingwords from a dictionary would then be pure stringssearch rather than rule base.

By having a rule based database could create a moreconcise and smarter dictionary. Therefore data recordsneed to be selective in how words and what relevantinformation are stored, how multiple-words are linked.It is seen to be necessary to introduce word identifiersand deriving words dependencies and relationship.Another aspect of Lao words is dialectic variants, withdifferent intonation and consonants used in words.

Incorporating all these issues, a possible data recordsformat is such:

1) Single word Unicode encoded,2) International phonetic notation,3) Romanized scheme wording,4) Variants spelling (intonation, and consonant),5) Word identifier/type (e.g. adj., noun etc.),6) Links to multiple wordings,7) Description,8) Comment.To ensure a quality words retrieval, a client-server 

architecture is a key feature that would allowmaintaining a uniform and centralized database.Centralized database of significant hits answering to

remote requests could suffer some poor performance.To improve on availability, it is possible to distributetraffic load to multiple slave servers that replicate fromand update to a single master database.

By centralizing the database it would be much easier to collect statistical data of the words access and usage.Such data is relevant in further enhance the design of more intelligent lexical rules.

The implementation of a client-server architecturewould achieve an across platforms solution, favoringHTML, SSI, PHP, JavaScript and compiled C code usedas CGI technology.

Conclusion

Berment, V. and Thongvilu, H. 347  

Page 7: Cooperative GMS ICT Framework Case study: Construction of Lao Lexical Resources

8/7/2019 Cooperative GMS ICT Framework Case study: Construction of Lao Lexical Resources

http://slidepdf.com/reader/full/cooperative-gms-ict-framework-case-study-construction-of-lao-lexical-resources 7/7

 

Part VI Cross-Cutting Issues

Computerization of a language is a pride for its speakersand a natural right for all linguistic groups to possessthe same minimum set of tools for effectivecommunication of the time. Dialogues break down thebarrier of prejudices and ignorance. It establishes aworking relationship build on trust. Trust andunderstanding ensure cooperation for bilateral benefits.

Though the Montaigne project started with anapplication to the Lao language, its principles can beapplied to any of the GMS languages.

References

Berment V. 1998. Prolégomènes graphotaxiques dulaotien. DEA dissertation, Inalco, Paris, France.

Berment V. 2002a. Several Technical Issues for Building New Lexical Bases. Papillon seminar 2002.

Berment V. 2002b. Several directions for minoritylanguages computerization. COLING 2002.

Boitet C. 1999. A research perspective on how todemocratize machine translation and translation aidsaiming at high quality final output . MT Summit 1999.

Meknavin S., Charoenpornsawat P. and Kijsirikul B.1997. Featured-based Thai word segmentation. NaturalLanguage Processing Pacific Rim Symposium 1997.

Shimohata S., Kitamura M., Sukehiro T. and Murata T.2001. Collaborative Translation Environment on theWeb. MT Summit 2001.

Thongvilu H. (LaoNet) Lao Romanisation for Transliteration STEA Lao IT technology seminar – Vientiane August 1995.

Thongvilu H. Lao New Coding (LNC) PAN-LaosNational ICT affairs seminar – Vientiane 10th March2003. (to be published)

348  Berment, V. and Thongvilu, H.