computers and the thai language - lexitronlexitron.nectec.or.th/km_hl5001/file_hl5001/paper/inter...

16
Computers and the Thai Language Hugh Thaweesak Koanantakool National Science and Technology Development Agency Theppitak Karoonboonyanan Thai Linux Working Group Chai Wutiwiwatchai National Electronics and Computer Technology Center This article explains the history of Thai language development for computers, examining such factors as the language, script, and writing system, among others. The article also analyzes characteristics of Thai characters and I/O methods, and addresses key issues involved in Thai text processing. Finally, the article reports on language processing research and provides detailed information on Thai language resources. Thai is the official language of Thailand. The Thai script system has been used for Thai, Pali, and Sanskrit languages in Bud- dhist texts all over the country. Standard Thai is used in all schools in Thailand, and most dialects of Thai use the same script. Thai is the language of 65 million people, and has a number of regional dialects, such as Northeastern Thai (or Isan; 15 million people), Northern Thai (or Kam Meuang or Lanna; 6 million people), Southern Thai (5 million people), Khorat Thai (400,000 people), and many more variations (http:// en.wikipedia.org/wiki/Thai_language). Thai language is considered a member of the fam- ily of Tai languages, the language used in many parts of the Indochina subre- gion including India, southern China, northern Myanmar, Laos, Thai, Cambodia, and North Vietnam. The Thai script of today has a history going back about 700 years, with gradual changes in the script’s shape and writing sys- tem evolving over the years. The script was originally derived from the Khmer script in the sixth century. It is generally thought that the Khmer script developed from the Pallava script of India. 1 Thai is written left-to-right, without spaces between words. Each character has only one form, that is, no notion of uppercase and lowercase characters. Some vowels are writ- ten before and after the main consonant. Certain vowels, all tone marks, and diacritics are written above and below the main character. Pronunciation of Thai words does not change with their usage, as each word has a fixed tone. Changing the tone of a syllable may lead to a totally different meaning. Thai verbs do not change their forms as with tense, gender, and singular or plural form, as is the case in European languages. In- stead, there are other additional words to help with the meaning for tense, gender, and sin- gular or plural. Basic Thai words are typically monosyllabic. Contemporary Thai makes ex- tensive use of adapted Pali, Sanskrit, English, and Chinese words embedded in day-to-day vocabulary. Some words have been in use long enough that people have forgotten that they originated from other languages. At pres- ent, most new words are created from English. A Thai word is typically formed by the combination of one or more consonants, one vowel, one tone mark, and one or more final consonants to make one syllable. Cer- tain words may be polysyllabic and therefore they may consist of many characters in com- bination. Because of a limited number of characters (see Figure 1), the writing system can be implemented on typewriters by map- ping each symbol directly to each key. A typ- ing sequence is exactly the same as a writing sequence. [3B2-6] man2009010046.3d 12/2/09 13:47 Page 46 46 IEEE Annals of the History of Computing 1058-6180/09/$25.00 c 2009 IEEE Published by the IEEE Computer Society Authorized licensed use limited to: National Science & Technology Development Agency. Downloaded on April 20, 2009 at 22:32 from IEEE Xplore. Restrictions apply.

Upload: vodan

Post on 04-Jun-2018

221 views

Category:

Documents


0 download

TRANSCRIPT

Computers and the Thai Language

Hugh Thaweesak KoanantakoolNational Science and Technology Development Agency

Theppitak KaroonboonyananThai Linux Working Group

Chai WutiwiwatchaiNational Electronics and Computer Technology Center

This article explains the history of Thai language development forcomputers, examining such factors as the language, script, andwriting system, among others. The article also analyzes characteristicsof Thai characters and I/O methods, and addresses key issues involvedin Thai text processing. Finally, the article reports on languageprocessing research and provides detailed information on Thailanguage resources.

Thai is the official language of Thailand.The Thai script system has been usedfor Thai, Pali, and Sanskrit languages in Bud-dhist texts all over the country. StandardThai is used in all schools in Thailand, andmost dialects of Thai use the same script.

Thai is the language of 65 million people,and has a number of regional dialects, suchas Northeastern Thai (or Isan; 15 millionpeople), Northern Thai (or Kam Meuang orLanna; 6 million people), Southern Thai(5 million people), Khorat Thai (400,000people), and many more variations (http://en.wikipedia.org/wiki/Thai_language). Thailanguage is considered a member of the fam-ily of Tai languages, the language usedin many parts of the Indochina subre-gion including India, southern China,northern Myanmar, Laos, Thai, Cambodia,and North Vietnam.

The Thai script of today has a historygoing back about 700 years, with gradualchanges in the script’s shape and writing sys-tem evolving over the years. The script wasoriginally derived from the Khmer script inthe sixth century. It is generally thoughtthat the Khmer script developed from thePallava script of India.1

Thai is written left-to-right, without spacesbetween words. Each character has only oneform, that is, no notion of uppercase andlowercase characters. Some vowels are writ-ten before and after the main consonant.

Certain vowels, all tone marks, and diacriticsare written above and below the maincharacter.

Pronunciation of Thai words does notchange with their usage, as each word has afixed tone. Changing the tone of a syllablemay lead to a totally different meaning.Thai verbs do not change their forms aswith tense, gender, and singular or pluralform, as is the case in European languages. In-stead, there are other additional words to helpwith the meaning for tense, gender, and sin-gular or plural. Basic Thai words are typicallymonosyllabic. Contemporary Thai makes ex-tensive use of adapted Pali, Sanskrit, English,and Chinese words embedded in day-to-dayvocabulary. Some words have been in uselong enough that people have forgotten thatthey originated from other languages. At pres-ent, most new words are created fromEnglish.

A Thai word is typically formed by thecombination of one or more consonants,one vowel, one tone mark, and one or morefinal consonants to make one syllable. Cer-tain words may be polysyllabic and thereforethey may consist of many characters in com-bination. Because of a limited number ofcharacters (see Figure 1), the writing systemcan be implemented on typewriters by map-ping each symbol directly to each key. A typ-ing sequence is exactly the same as a writingsequence.

[3B2-6] man2009010046.3d 12/2/09 13:47 Page 46

46 IEEE Annals of the History of Computing 1058-6180/09/$25.00 �c 2009 IEEEPublished by the IEEE Computer Society

Authorized licensed use limited to: National Science & Technology Development Agency. Downloaded on April 20, 2009 at 22:32 from IEEE Xplore. Restrictions apply.

The computerization of Thai script for theThai language preserves the input sequence,and assigns one character in storage to corres-pond to one input keystroke. This concept issimple to understand, and all language pro-cessing algorithms are designed to suit thisarrangement, despite the fact that someThai vowels consist of up to three characters(from the vowel group) in that word.

Early IT standards in Thailand defined thecode points for computer handling, keyboardlayout, and I/O method. Subsequent researchprojects created many useful functions suchas word break, text-to-speech, optical charac-ter recognition (OCR), voice recognition, ma-chine translation, and search algorithms,which we describe elsewhere in this article.

Thai language on typewritersand computers

Printing of the Thai language was first ac-complished at Serampore in 1819: accordingto professor Michael Winship,2 a Catechism ofReligion had been prepared by the AmericanBaptist missionary Ann Hasseltine Judsonfor distribution to a small community ofThai prisoners of war in Burma. By 1836,Dan Beach Bradley, a missionary physician,started printing in Thailand using a donatedpress by the American Board of Commis-sioners for Foreign Mission. At the time, pub-lications were intended primarily forspreading Christianity into Thailand.3

Thai-language typewriter localizationwas first shown to the Thai public in1891, when Edwin McFarland, in coopera-tion with Smith Premier, a US typewriter

company, introduced the first Thai languagetypewriter. In 1898, son George McFarlandopened a typewriter shop in Bangkok.Those first typewriters had a moving carriagewithout any shift mechanism.4 Today’s type-writer keyboard has five rows, with standardshift keys.

The use of computers with the Thai lan-guage started in the late 1960s when IBMintroduced card-punch machines and lineprinters with Thai character capability. Indoing so, the 8-bit EBCDIC code assignmentswere made for Thai characters.5 With theprinter’s mechanical limitations, for eachline of Thai text an equivalent of four-passprinting was required to complete the multi-level nature of the Thai writing system. In the1970s, Univac, Control Data, and Wangintroduced their computers using the Thailanguage. In 1979, some companies offeredCRT terminals, with a 12-lines-per-screen ca-pability, which could display Thai characters.Interactive computing came to Thailand in1983 when a Thai inventor modified the Her-cules Graphic Card (HGC) circuitry and madeit capable of displaying 25 lines per screen intext mode. The full display capability of theThai language has been adopted widelysince then.

The computer industry in Thailand in the1980s was very much vertically integrated:that is, hardware manufacturers always sup-ported their version of localized operatingsystems and applications. Every companyused different Thai character codes as a resultof competition and corporate strategy to re-tain their customers. Application developers,however, suffered from the codes’ incompat-ibility and from differences in the systeminterfaces. By 1984, there were at least 20 dif-ferent character coding conventions used inThailand.6 This situation prompted the ThaiIndustrial Standard Institute (TISI) to estab-lish a standards committee to draft the na-tional standard code for computers. Thework, completed in 1986, is known as TIS620-2529—The Standard for Thai CharacterCodes for Computers B.E. [Buddhist Era]2529.7 Basically, two designs of the codepoints were adopted in the standard: theIBM-EBCDIC and the extended ASCII (8-bitplane). In 1990, the TIS 620 standard wasenhanced with a clearer explanation butwithout any change to the code points, andthis standard became TIS 620-2533. The tech-nical committees of TISI subsequently issueda number of IT standards for Thailand(see Table 1).

[3B2-6] man2009010046.3d 12/2/09 13:47 Page 47

Consonants

Vowels

Tone marks

Diacritics

Numerals

Other symbols

Total

44

18

4

5

10

6

87

Figure 1. Thai characters.

January–March 2009 47

Authorized licensed use limited to: National Science & Technology Development Agency. Downloaded on April 20, 2009 at 22:32 from IEEE Xplore. Restrictions apply.

It took only one year after the standard’sannouncement for every computer companyto be fully compliant with the TIS 620 stan-dard code. Everyone’s data then becameinterchangeable.

In 1990, the Thai API Consortium, a groupof Thailand software developers led by Tha-weesak Koanantakool, drafted a commonspecification for computer handling of theThai I/O method. The work was funded bythe National Electronics and Computer Tech-nology Center (NECTEC) and was publishedin 19918 as the WTT 2.0 specification.9 Thespecification was widely used by industry,including IBM, Digital Equipment, Hewlett-Packard, and, later, Microsoft. WTT 2.0 wasproposed to TISI as a draft national standardand, seven years later, became TIS 1566-2541(1998). At the same time, many more inter-esting, natural language processing (NLP)projects with useful results and solutionsfound their way into industrial use. In thepast two decades, Thailand has achieved sev-eral milestones for advanced computer pro-cessing with the Thai language; we reporton most of them in this article.

Thai alphabetsEncoding of Thai characters into an octet

has been made simple, since we need only87 characters to represent the language. Inthe code space between 0�80 and 0�FF, TIS620 occupies only the code space between0�A1 and 0�FB. The ASCII-compatible ver-sion of TIS 620 was interoperable with anyASCII-based computer with true 8-bit storageper character. The actual placement on the

code table was placed in the ‘‘most accept-able’’ collating sequence. The TIS 620 se-quence was subsequently registered with theUniversal Character Set defined by the inter-national standard ISO/IEC 10646, UniversalMultiple-Octet Coded Character Set, in the re-gion Uþ0E00 . . .Uþ0E7F, and also as theISO/IEC 8859-11 standard for 8-bit (singlebyte) coded graphic character sets, Latin/Thai alphabet. Figure 2 shows the code assign-ment for Thai characters and their names inTIS 620 and Unicode.

Consonants

Classifications of the Thai consonants canbe made as plosives (stops), non-plosives, sibi-lants, and voiced ‘‘h.’’ In the plosive table inFigure 3, each column is also groupedby voiced/unvoiced properties. In the lastgroup, อ is classified as a zero consonant,and it can be used to write with vowels,which cannot be stand-alone. Figure 3 showsthe consonants, classifications, and the associ-ated International Phonetic Alphabet (createdby the International Phonetic Association,IPA) and the International Alphabet of San-skrit Transliteration (IAST) representations.

Vowels

The 18 symbols for vowels, together withthree consonants—ย, ว, and อ—are used incombination to create 32 vowels for Thai.

The 18 symbols are listed in Figure 4, withthe associated Unicode values.

These 18 vowel symbols and 3 consonants(21 in total) combine to form 32 vowels, asFigure 5 shows.

[3B2-6] man2009010046.3d 12/2/09 13:47 Page 48

Computers and the Thai Language

Standard number

TIS 620-2529 (1986)TIS 820-2531 (1988)TIS 620-2533 (1990)TIS 988-2533 (1990)

TIS 1074-2535 (1992)TIS 1075-2535 (1992)

TIS 1099-2535 (1992)TIS 1111-2535 (1992)TIS 820-2538 (1995)TIS 1566-2541 (1988)

Standard title

Standard for Thai Character Codes for ComputersLayout of Thai Character Keys on Computer KeyboardsStandard for Thai Character Codes for Computers (Enhanced)Recommendation for Thai Combined Character Codes and Symbols for Line Graphics for Dot-Matrix PrintersStandard for 6-Bit Teletype CodesStandard for Conversion Between Computer Codes and 6-Bit Teletype CodesStandard for Province Identification Codes for Data InterchangeStandard for Representation of Dates and TimesLayout of Thai Character Keys on Computer Keyboards (Enhanced)Thai Input/Output Methods for Computers

Table 1. Thai industrial standards on information technology. The four digits at the end of the standard number (e.g., 2529) is the year of issue, according to Buddhist Era.

Source: http://www.nectec.or.th/it-standards/.

48 IEEE Annals of the History of Computing

Authorized licensed use limited to: National Science & Technology Development Agency. Downloaded on April 20, 2009 at 22:32 from IEEE Xplore. Restrictions apply.

Thai input method and keyboardsThai is written with the combining marks

stacked above or below the base consonant,like diacritics in European languages. How-ever, although the concepts are quite sim-ilar, the implementations are significantlydifferent.

Thai input method requirementsFirst, there are too many possible combina-

tions of base consonants and combining marksin Thai to be enumerated like Latin accents inthe ISO/IEC 8859 series standard for 8-bit char-acter encoding. Therefore, base characters andcombining characters are encoded separately,rather than precombined.

Second, Thai combining marks are classi-fied into upper or lower vowels, tone marks,and other diacritics. Moreover, a Thai baseconsonant can be combined with up to twocombining marks, that is, zero or one upperor lower vowel and zero or one tone mark or

[3B2-6] man2009010046.3d 12/2/09 13:47 Page 49

TIS

0

1

2

3

4

5

6

7

8

9

A

B

C

D

E

Fto patak fo fan paiyannoi baht fongman

yamakkan

nikhahit

thanthakhat

mai chattawa

pinthu mai tri angkhankhu

thai ninemai thosara uu

thai eightmai eksara u

khomut

ho nokhukpho phando chada

o angfo fayo ying

cho choe pho phung lo chula

so so po pla ho hip

cho chang bo baimai so sua

cho ching no nu so rusi

cho chan tho thong so sala

thai sevenmaitaikhusara ueengo ngu tho thahan wo waen

thai sixmaiyamoksara uekho rakhang tho thung lu

thai fivelakkhangyaosara iikho khon to tao lo ling

thai foursara ai maimalaisara ikho khwai do dek ru

thai threesara ai maimuansara amkho khuat no nen ro rua

thai twosara osara aakho khai tho phuthao yo yak

thai onesara aemai han-akatko kai tho nangmontho mo ma

thai zerosara esara atho than pho samphao

Unicode U+0E0x U+0E1x U+0E2x U+0E3x U+0E4x U+0E5x U+0E6x U+0E7x

0xAx 0xBx 0xCx 0xDx 0xEx 0xFx

Figure 2. Standard codes for Thai characters—TIS 620 and Unicode, with their names as defined by the

WTT 2.0 specification.

unaspirated unaspiratedaspirated aspirated nasal

voicedUnvoicedPlosiveclass

velar

palatal

retroflex

dental

labial

non-plosive

sibilants

Voiced ''h''

zero consonant (never a vowel)

(semi-vowel)

tone middle high low low low

palatal retroflex dental labial

Figure 3. Classification of Thai consonants. Each cell consists of the

character, International Phonetic Alphabet, and the International Alpha-

bet of Sanskrit Transliteration (IAST) representations in square brackets.

January–March 2009 49

Authorized licensed use limited to: National Science & Technology Development Agency. Downloaded on April 20, 2009 at 22:32 from IEEE Xplore. Restrictions apply.

diacritic. The upper/lower vowel, if present, isalways attached to the consonant before thetone/diacritic. These conditions must be gov-erned by some rules to limit the number andorder of the combining characters to be putafter the base consonant.

As a third difference, Thai users are famil-iar with typing the combining charactersafter the base consonant, in contrast tomany European input methods in whichusers type the diacritic before the base letterto compose an accent. The typing sequenceand the internal storage of a Thai string,

therefore, are totally consistent. Moreover, atyping sequence is almost exactly the sameas a handwriting sequence, with the excep-tion of only one character, sara am, a com-posite vowel that consists of two symbolssharing one code point.

Typewriter keyboardsThe keyboard of the first Thai typewriter

made by Smith Premier in 1891 consisted of12 keys in seven rows (Figure 6a shows the lay-out). When the typewriter industry startedusing the shift key, the Thai layout was alsoadapted on the typical QWERTY arrangement.The characters that normally appear aboveand below the base characters are assigned toindividual keys, and they appear above orbelow the previous position without movingthe carriage. This concept is called dead keys.

The two main differences between Thaiand English typewriters are that there aretwo different characters on the same key,and that there are eight dead keys. All thedead keys are grouped in the center of the Ket-manee keyboard layout (see Figure 6b), theclassic layout named after its inventor.

In 1966, Sarit Pattajoti,11 an engineerat the Royal Irrigation Department, rede-signed the layout to improve the mechanicalcoordination of human fingers. Statistically,the new layout would eliminate the imbal-ance of the Ketmanee layout (30% distrib-uted to the left hand and 70% to the righthand), and distribute more of the workloadto the index and middle fingers. The Patta-joti layout (see Figure 6c) became the officialstandard. Manufacturers were encouraged tomake them in quantity for government offi-ces. However, this ‘‘official’’ standard wasabandoned in 1971 because its marginalspeed improvement was not significantenough for people to migrate from the defacto Ketmanee.

In 1988, TISI issued the standard layout forthe computer keyboard, TIS 820-2531 (1988)(see Figure 6d). The TISI technical committeeadopted the Ketmanee layout with only onechange. In 1993, the standard was enhancedto utilize the additional three keys availableon the computer keyboards. Figure 6e showsthe present standard layout as definedby TIS 820-2536 (1993). Four rarely used char-acters, however, are defined in the TIS 620standard that are not on the keyboard; theywould be entered on the specific programthat uses them, typically through a GUI or,in the pre-Microsoft Windows era, via directhexadecimal code in DOS.

[3B2-6] man2009010046.3d 12/2/09 13:47 Page 50

Computers and the Thai Language

Front BackUnrounded

ShortClose

Close-mid

Open-midOpen

(a) 18 Monophthongs

(b) 9 Diphthongs

(c) 5 Semi-vowels

Thai

IPA

Thai

IPA

Short ShortLong Long LongRoundedUnrounded

Figure 5. Thirty-two vowels in Thai: 18 monophthongs, 9 diphthongs,

and 5 semi-vowels.

Vowel Name

sara a

mai han-akat

sara aa

sara am

sara i

sara ii

sara ue

sara uee

sara ue

sara uu

sara e

sara ae

sara o

sara ai maimuan

sara ai maimalai

ru

lu

lakkhangyao

FV1

AV2

FV1

FV1

AV1

AV3

AV2

AV3

BV1

BV2

LV

LV

LV

LV

LV

FV3

FV3

FV2

follow

above

follow

follow

above

above

above

above

below

below

leading vowel

leading vowel

leading vowel

leading vowel

leading vowel

follow

follow

follow

Type Position relative tothe main consonant

Figure 4. Eighteen vowel symbols in Thai.

50 IEEE Annals of the History of Computing

Authorized licensed use limited to: National Science & Technology Development Agency. Downloaded on April 20, 2009 at 22:32 from IEEE Xplore. Restrictions apply.

WTT 2.0 canonical orderThe Thai I/O method specifications

described in WTT 2.0 are based on the TIS620 standard code.8 The same concept waslater extended to the Thai code page in Uni-code. TIS 620 defines a code point for each ofthe Thai symbols, irrespective of their posi-tions when rendered. In the same manneras Unicode normalization, the WTT 2.0 speci-fication defines the canonical order of Thaicharacter strings. It differs, however, inthe implementation details and in terms ofsyntactic strictness. WTT 2.0 compliancerequires that certain input-sequence rulesmust be met, and most (if not all) input syn-tactic errors are eliminated at the time of dataentry (see Figure 7).

First, WTT 2.0 requires strings to be storedand transmitted in canonical order, not to benormalized at processing time. Therefore, inWTT 2.0, canonical and noncanonical stringsmean different things and can yield differentresults in rendering, sorting, searching, andso on. This helps simplify the string handlingroutines to some degree, leaving the task ofensuring the order at a single point, theinput method.

Second, WTT 2.0 concerns syntactic cor-rectness of the input string. It inhibits multi-ple tone marks to combine in one cell, whileUnicode does not care. Moreover, WTT 2.0defines three levels of syntactic strictness ofinput method. Level 0 (Passthrough) doesnot filter at all. Level 1 (BasicCheck) justensures that the input sequence complieswith canonical order and can be displayedgracefully in the output method. Level 2(Strict) is more picky in filtering out mostproblematic sequences. Therefore, the WTT2.0 input conditioning helps make the storageorder of Thai text cleaner than would the rawUnicode concept. The WTT 2.0 input methodproduces strings of a proper subset of the Uni-code specification, and thus it does not nega-tively affect Unicode-compliant programs.WTT 2.0 eliminates problematic strings thatmight otherwise fail string matching orthat might confuse the output method.

Technical characteristics of Thai input methodTechnical requirements of the Thai TIS

1566-2541 (WTT 2.0) input method differfrom input methods of other languages.The input method

� does not need pre-edit-and-commit stages(as used in some CJK [Chinese, Japanese,and Korean] input methods),

[3B2-6] man2009010046.3d 12/2/09 13:47 Page 51

a

b

c

d

e

Figure 6. Evolution of Thai typewriter keyboard layout: from

top, (a) Smith Premier layout, ca. 1890, (b) Ketmanee layout (no

date), and computer keyboard layouts: (c) Pattajoti layout (1966),

(d) TIS 820-2531 (1988), and (e) TIS 820-2536 (1993). (Source:

Koanantakool10)

January–March 2009 51

Authorized licensed use limited to: National Science & Technology Development Agency. Downloaded on April 20, 2009 at 22:32 from IEEE Xplore. Restrictions apply.

� does not use the composing method (asused in Europe), and

� is not a straight key-to-character mapping(as used in ordinary English input method).

Rather, the input method requires thefollowing:

� one-to-one key-to-character mapping (ir-respective of the width and position ofthe character),

� ability to retrieve context character from ap-plication input buffer (to verify validity),and

� (optional) write access to applicationinput buffer.

The key-to-character mapping is straightfor-ward. Thailand has an industrial standardfor a keyboard map, totally compatible withthe traditional (i.e., mechanical) typewriter.

The ability to retrieve context charactersfrom the application input buffer is intendedto validate the key events, according to thestate machine. Note that other solutions arenot adequate, such as these:

� using pre-edit string and committing vali-dated characters as a chunk,

� using composing method to commitstrings cell-by-cell, and

� remembering previously typed charactersin the input context memory.

These solutions can serve only for the input-ting of new text but cannot handle the edit-ing of existing text, especially when the firstkey to be input is a combining character.

The last requirement of write access to theapplication input buffer is for input sequencecorrection enhancement.

ImplementationsKnown implementations of WTT-based

Thai input methods include the following:

� Thai Language Environment for Solaris� Microsoft DOS 6.0 and Windows (all

versions)� X Window: XIM (embedded in Xlib)� GTKþ:

—Thai-Lao IM module (stock GTKþ IMmodule)—gtk-im-libthai (third party, based onlibthai)

� SCIM:—scim-thai (third party, based on libthai)—scim-m17n (third party, by ETL, Japan)

Making TIS 620 a part of international standards

The early 1990s witnessed different opin-ions concerning the encoding of the Thaiblock in the making of draft Basic Multilin-gual Plane (BMP), the code space between0�0000 and 0�FFFF in Unicode. Trin Tant-setthi, a Thai computer architecture and soft-ware expert, developed proposals to relevantstandards bodies that were responsible forthe internationalization of Thai charactercodes.12 In order to have a Thai languagecharacter set for the ISO 8859 series, the TIS620 standard was registered under ISO 2375by the ECMA (European Computer Manufac-turers Association) as ISO-IR-166. The processencountered many obstacles. According toTantsetthi, the first proposal was to eliminatenonspacing characters, which were nonop-tional atomic parts of the script, by pre-composing characters into a rectangularbounding box of glyphs. This proposal waslater dropped because there were an insuffi-cient number of code spaces to accommodateall precomposed glyphs. Once the industryhad learned about this limitation, the long-awaited ISO/IEC 8859-11 Latin/Thai standardwas approved. TIS 620 was formally regis-tered by the Internet Assigned Numbers

[3B2-6] man2009010046.3d 12/2/09 13:47 Page 52

Computers and the Thai Language

Forward

NON

LV

FV1

FV2

FV3

AV1

AV2

AV3

BV1

BV1

AD3

AD2

AD1

BD

TONE

CONS

NON = non-Thai printableCONS = Thai consonants

LV = leading vowels

FV1 = ordinary following vowels

FV2 = dependent following vowel

FV3 = special following vowels (RU, LU)BV1 = short below vowel (SARA U)

BV2 = long below vowel (SARA UU)BD = Below diacritic (PHINTHU)

TONE = tone marksAD1 = above diacritic 1

AD2 = above diacritic 2 (MAITAIKHU)

AD3 = above diacritic 3 (YAMAKKAN)

AV1 = above vowel 1 (SARA I)

AV2 = above vowel 2 (MAI HAN-AKAT,

AV3 = above vowel 3 (SARA II, SARA UEE) SARA UE)

(THANTHAKHAT, NIKHAHIT)

(LAKKHANGYAO)

Dead

Figure 7. WTT 2.0 Level 1 I/O state machine.

52 IEEE Annals of the History of Computing

Authorized licensed use limited to: National Science & Technology Development Agency. Downloaded on April 20, 2009 at 22:32 from IEEE Xplore. Restrictions apply.

Authority (IANA) as a legitimate encoding forthe Internet in 1998.12

Thailand also encountered another prob-lem in defining Thai in the ISO 10646 stan-dard, as some nonnative linguists hadstrongly proposed the use of phonetic order forThai, following Hindi-based scripts. Althoughthis scheme allows words to be normalizedinto a ‘‘proper form,’’ making word parsing abit easier, the proposal could not address thefollowing problems:

� it could not handle language exceptionsvigorously;

� compounded vowels were not addressed,neither in code-point assignments nor ininput method; and

� no backward compatibility was provided.

We opted for the traditional input method(i.e., visual, instead of phonetic, order) be-cause, since Thai localization had begun in1968, by 1990 the Thai computer industryhad equipped itself with libraries and toolsto use visual order without a problem.That’s basically how the Thai block in BMP/Unicode has developed. In 1998, IANA(Internet Assigned Numbers Authority)adopted TIS 620 as a legitimate character seton the Internet. The visual order was for-mally announced as a national standard inTIS 1566-2541 (1998).

Computer display, printing, and Thaiword processors

Like most languages, modern Thai textdisplay on computers has evolved from tradi-tional printing technologies. Mechanicaltypewriter design has allocated four verticalzones for placing Thai characters: one forbase characters, one for combining charactersbelow the baseline, and the other two forcombining characters above the base charac-ter. Characters are strictly classified on thebasis of their assigned positions. This haspropagated to a text-mode computer displaygrid. The WTT 2.0 specification was designedto accommodate exactly this.

Thai typography has a more-refinedscheme than that of typewriters. The top-most combining characters can be shifteddown slightly when the combining characterbelow them is missing to eliminate theempty gap. In addition, the combining char-acters can be slightly biased to the left orright of the base characters, which havelong ascenders (stems) to deliver aestheticprint results.

The bias technique was necessary for hot-metal typesetting because all the combinedsymbols had to be molded individually asone piece, and there were about 28 combi-nations to be implemented. Another set of28 types precombined with ascenders wasalso required. In phototypesetting or com-puter typesetting, these precombinationtechniques were no longer a concern, butthe same concept is useful to improve thespeed of dot matrix printing. To print a lineof Thai text on an impact printer normallyrequires four passes. By precombining thetop characters (two levels) before printing,the speed improves by 25% because onlythree passes are required, rather than thefour that mechanical typesetting required—one for each of the four vertical zones. Tomake all dot matrix printers interoperable,TISI issued TIS 988-2533, the standard codefor combined characters in dot matrix print-ers, in 1990.

Thai WTT-2.0�based rendering enginesfor text-mode display usually handle Thaitext in two steps. The text string is first clus-tered into cells, which are then shaped byproper placements of the combining charac-ters. Printing with an impact printer, on theother hand, requires a conversion of oneline of text into three or four stringswhose lengths are equal to the line width,with each string corresponding to eachpass of printing. The fastest dot matrixprinter manages to print a Thai line withonly two passes; typical printing requiresthree passes.

A number of DOS-based Thai-languageI/O subsystems were developed at KasetsartUniversity, Chulalongkorn University, andThammasat University. Yuen Poovorawanand his team at Kasetsart University createdthe first Thai word processor called ThaiEasy Writer. He also developed the Thai Ker-nel System as a resident program for DOS tohandle the Thai language.13 The most popu-lar DOS-based Thai word processor wasreleased in 1989. According to professorWanchai Rivepiboon of Chulalongkorn, CUWriter version 1.1 was first released to thepublic in April 1989. It was a tremendous suc-cess because of the demands for quality print-ing of Thai documents. The Faculty ofEngineering of Chulalongkorn Universitycontinued to support and improve theprogram for about five years until it wassuperseded in 1994 by commercial wordprocessing software running on the Windowsplatform.14

[3B2-6] man2009010046.3d 12/2/09 13:47 Page 53

January–March 2009 53

Authorized licensed use limited to: National Science & Technology Development Agency. Downloaded on April 20, 2009 at 22:32 from IEEE Xplore. Restrictions apply.

Cell clusteringCell clustering shares the same state table

with the input method. Composable se-quences are tokenized into cells. Rejectedsequences are separated, with an optional dot-ted circle as the base character that can beeasily caught by human eyesight, as Figure 8illustrates.

Shaping

For typewriter-style displays, the toke-nized cells are rendered in a perfect rectangu-lar bounding box, and all combining marksare placed at fixed vertical positions. But formodern desktop publishing and the Thai lan-guage, beginning in 1989, combining markshave been typographically adjusted for thefollowing issues:

� The topmost combining mark without amark below it is shifted down to eliminatethe gap between it and the base character.

� Combining marks that are combined withthe base character with long ascenders(stems) are biased slightly to the left toavoid overlapping.

� Consonants with removable descenders(yo ying and tho than) have the descenderremoved when combined with a lowervowel or diacritic.

To accommodate these issues, typesettershave used Private Use Area glyphs. PUAglyphs are still widely used in most fontstoday. OpenType may eventually maketheir use obsolete with Thai, although notin the near term.

PUA glyphs solutionsOperating system vendors assigned sep-

arate, predefined PUA code points toshifted glyphs. Font creators prepared therequired glyphs, and the rendering engineknew how to use them. Unfortunately,Microsoft and Apple defined their ownPUA code points differently, which meantfonts could not be used across Windowsand Mac platforms.

To make matters worse, Adobe did notsupport any PUA glyphs for Thai. This

situation caused developers to create somecommon hacks with automatic filters toencode text with PUA glyphs so that suchapplications would render properly. This situ-ation created another mess: when users ofsuch filters transmitted messages in PUA,they were unreadable by anyone using an-other platform.

Thai users have long suffered from thislack of interoperability. The availability ofPUA glyphs, however, is important enoughin Thai typography to live on.

Pango, a free software rendering engine,supports both sets of PUA glyphs—Windowsand Apple—by detecting them before using.Open standard and open source softwarehas significantly helped the local Thai indus-try to find useful typographical solutions.

OpenType solutions

With OpenType technology, all shapingdetails can be moved into fonts. All that therendering engines need do is call the appro-priate features for the language.

Glyph processing features in OpenTypefonts are represented in two forms: glyphsubstitution (GSUB) and glyph positioning(GPOS). GSUB contains substitution rules,while GPOS describes anchor points forattaching combining marks to base glyph orcombining marks to other combining marks.

According to Microsoft’s specification forThai OpenType fonts,15 several features arerequired, as explained next.

Character Composition/Decomposition(‘‘ccmp’’). This must contain GSUB rules to

1. select the lower variation for topmostmarks in the absence of an upper vowel(see Figure 9, left),

2. remove the descender for the consonantsyo ying and tho than when combined withbelow-base characters (see Figure 9, mid-dle), and

3. decompose the composite vowel sara aminto the nikhahit sign and sara aa vowel,plus rearrange them with the combinedtone mark if one is present (see Figure 9,right).

[3B2-6] man2009010046.3d 12/2/09 13:47 Page 54

Computers and the Thai Language

Figure 8. WTT 2.0 clustering. The dotted circles

indicate a rejected sequence.

Upper/lowervariation selection

Descenderremoval

SARA AMdecomposition

Figure 9. Glyph substitutions for shaping.

54 IEEE Annals of the History of Computing

Authorized licensed use limited to: National Science & Technology Development Agency. Downloaded on April 20, 2009 at 22:32 from IEEE Xplore. Restrictions apply.

Mark to Base Positioning (‘‘mark’’). This iscomposed of GPOS base anchors in the basecharacters and mark anchors in the combin-ing marks. This feature places marks aboveor below the base characters (see Figure 10).

Mark to Mark Positioning (‘‘mkmk’’). Thisis composed of GPOS base anchors in thebase marks and mark anchors in the combin-ing marks. This feature places topmost markson upper vowels.

Thai cultural conventionsThe most basic kind of cultural convention

is the Posix locale for the standard C library.According to Posix, a number of C functionsare defined to be locale-dependent, such asdate and time format, string collation, and soon. Many other programming languages alsorely on these functions.

The Posix locale specification has beenfurther extended in ISO/IEC TR 14652, Speci-fication Method for Cultural Conventions, byadding more categories, which have nowbeen supported by more-modern software.

Theppitak Karoonboonyanan16 has gath-ered information on Thai cultural conven-tions and documented his Thai localecreation for the GNU C library. Most Posixcategories were defined and later extendedwhen ISO/IEC TR 14652 was supported.

Handling of Thai wordsAccording to the Royal Institute of Thai-

land’s Royal Institute Dictionary (http://en.wikipedia.org/wiki/The_Royal_Institute_of_Thailand#The_Royal_Institute_Dictionary),Thai strings can be collated by comparisonfrom left to right, with two exceptions:

� Leading vowels are less significant thanthe initial consonant of the syllable andmust be compared after the consonant.

� Tone marks, thanthakhat and maitaikhu,are ignored unless all other parts are equal.

No syllable structure or word boundary analysisis required. Thai words are sorted alphabeti-cally, not phonetically. Figure 11 shows anexample of a sorted list of Thai words.

String collation solutions

The first known solution for Thai stringcollation was proposed in 1969 by D. Londeand Udom Warotamasikkhadit,17 and laterreferenced by Vichit Lorchirachoonkul18 in1979. Londe and Warotamasikkhadit pro-posed an algorithm for converting Thaistrings into left-to-right comparable form.

In 1992, Samphan Khamthaidee (a penname used by Samphan Raruenrom)19 pub-lished an article in a computer hobbyist jour-nal describing an algorithm for comparingtwo Thai strings on the fly. The algorithmwas well crafted so that the strings arescanned from left to right in a single pass,and the difference is detected as early as pos-sible. Karoonboonyanan20 later extendedKhamthaidee’s algorithm to cover punctua-tion marks by generalizing it to accommo-date extra levels of character weights in1997. Subsequently, details about the orderof Thai punctuation marks have been sum-marized in the literature.21,22

International standardsString collation is based primarily on the

Posix standard library functions strcoll()

and strxfrm(). Both are based on the LC_COL-

LATE locale setting. The Posix locale was laterextended as ISO/IEC TR 14652, SpecificationMethod for Cultural Conventions. For LC_COL-

LATE in particular, a Common TemplateTable for collating strings encoded in ISO/IEC

[3B2-6] man2009010046.3d 12/2/09 13:47 Page 55

Figure 10. Mark positioning with GPOS.

Figure 11. Example of a sorted list of Thai words.

January–March 2009 55

Authorized licensed use limited to: National Science & Technology Development Agency. Downloaded on April 20, 2009 at 22:32 from IEEE Xplore. Restrictions apply.

10646, Universal Multiple-Octet Coded CharacterSet, in general has been defined as ISO/IEC14651, International String Ordering and Compar-ison. Every locale can customize the table to fitlocal needs. Unicode also defines UTS #10 (theUnicode Collation Algorithm) in parallel withthe ISO/IEC 14651 standard.

Both standards already supported multiple-level character weights in the first place,which was important for handling Thaitone marks and diacritics. So, all the specifica-tion needed at that time was the properweights definition. The reordering of theleading vowels, however, had been contro-versial for a while, as we will explain.

Defending the non-Indic style of Thai Uni-code encoding had given the Unicode commit-tee sufficient information to address Thai inUTS#10 since the beginning. The reorderingwas defined as a separate preprocessing stage.It took time, however, to convince the ISO/IEC 14651 committee to include this in theCommon Template Table (CTT) as a set of pre-defined contraction forms as there was no pre-processing concept in it. Rather, the reorderingrequirement was first described in Annex C inDIS 14651:200023 without realization in theCTT, before the actual amendment was imple-mented in 2003.24

Leading-vowel rearrangement issueOne common misinterpretation of the

leading-vowels rearrangement by nonnativelinguists, which has been frequently encoun-tered at conferences and meetings, was thatThai string collation required linguistic analysis,as a consequence of a Thai visual encodingscheme,asopposedtoaphoneticencodingschemeof other Indic scripts. This was simply wrong.

A major ambiguity problem would result ifone tried to do linguistic analysis. For example,the Thai string เพลา can mean two differentwords. One is a two-syllable word phe-la (mean-ing ‘‘time’’), and the other is a one-syllableword phlao (meaning ‘‘axel’’ or ‘‘abate’’). In try-ing to preprocess the string by rearranging theleading vowel เ with the syllable initial sound,one will find two possible rearrangements:พเลา for the former case, and พลเา for the lat-ter. This makes it impossible to determine asingle weight for the string. This is another

reason why Thailand completely rejected pho-netic encoding in its standards.

Thai string collation is in fact as simple asits encoding scheme. The leading vowel onlyneeds to be swapped with its immediate suc-ceeding character—nothing more than that.

Word and syllable boundaryTypical Thai writing has no space be-

tween words and syllables. We use spaceonly to separate sentences. Lacking a wordor syllable boundary marker has been amajor problem for early Thai language proc-essing applications, which has beenresearched since the 1980s. Early works fo-cused on syllable segmentation due to itspotential for being modeled in rule-basedfashion.25,26 Word segmentation, on theother hand, requires a dictionary. Longestor maximal matching incorporated with awell-designed dictionary was the first practi-cal approach.27 Over time, researchers haveproposed advanced algorithms. A majorcontribution includes the first open-sourcesoftware for word segmentation calledSmart Word Analyzer for THai (SWATH),which NECTEC developed.28 It is based on astatistical part-of-speech n-gram. In 2002,Wirote Aroonmanakun constructed a wordsegmentation engine using two-pass proc-essing, syllable segmentation, and syllablemerging based on collocations.29 These fun-damental tools have been applied in variouslanguage processing tasks such as line break-ing and word boundary detection in wordprocessing and publishing software.

SoundexSoundex is a phonetic algorithm for index-

ing words by sound. It has been widely usedin Internet search engines for flexible searchingby the same or a similar sound without regardto the written form. An original algorithm,proposed by Vichit Lorchirachoonkul,30 wasbased on a set of rules used to segment Thaicharacter strings into syllables, simplify thesyllables, and encode them in a five-charactercode. Another general procedure is to findpronunciations of a given keyword andmatch them to indexed words in the searchdatabase. Standard sound representationsare defined by the International Phonetic As-sociation (IPA) and the Speech AssessmentMethods Phonetic Alphabet (SAMPA).31,32

The Thai Royal Institute has defined a stan-dard sound-based romanization of Thaiwords. Figure 12 demonstrates these standardsound representations.

[3B2-6] man2009010046.3d 12/2/09 13:47 Page 56

Computers and the Thai Language

Sample text Romanization

sawatdikhopkhunlakon

sa_2 wat_2 di:_1khOp_2 khun_1la:_1 kOn_2

IPA SAMPA

Figure 12. Examples of Thai sound representations.

56 IEEE Annals of the History of Computing

Authorized licensed use limited to: National Science & Technology Development Agency. Downloaded on April 20, 2009 at 22:32 from IEEE Xplore. Restrictions apply.

Automatic letter-to-sound conversion isan important tool that activates the auto-matic soundex search. One of the first com-plete tools for letter-to-sound conversionwas based on a probabilistic, generalizedleft-to-right parser trained by a pronuncia-tion dictionary.33

Advanced human�computer I/OSeveral forms of human�computer I/O—

OCR, handwriting recognition, text-to-speechand speech synthesis, and automaticspeech recognition—present different chal-lenges with respect to the Thai language.This section reviews major contributions tothese advanced technologies.

OCR and handwriting recognition

Thai OCR is not trivial, as a bounding boxmay contain up to three symbols stacking to-gether, and the written (or printed) sentencesdo not have spaces within them. In addition,modern Thai texts do mix with Englishwords. The challenges to OCR developershave been tremendous.

The history of Thai OCR spans well over20 years. Pipat Hiranvanichakorn et al. andChom Kimpan et al. were two of the originalgroups of researchers in this area.34,35 Theirsubsequent works primarily concerned recog-nition of individual (isolated) printed charac-ters, that is, without connections among thecharacters. A number of commercial productshave been launched since the 1990s—for ex-ample, ArnThai developed by NECTEC was oneof the first complete OCR software applica-tions that helped users convert scanned pic-tures of documents to editable documents,with over 90% conversion accuracy underrestricted scanning conditions.36

The technology trend has been tailored tothe use of machine learning such as an artifi-cial neural network. Although OCR technol-ogy has reached a commercial level, itsmass utilization has not yet been achieved.Problems include connected characters,which often appear in dirty documents;poorly scanned documents; or documentswith rarely used fonts that cause recognitionerrors. Research on Thai OCR is ongoing: forexample, an introduction of postprocessingto recover recognition errors.

Thai handwriting recognition wasexplored about a decade after OCR. Academicresearch started also on isolated-character rec-ognition.37 Current applicable systems stillmostly focus on isolated characters with aneural network as a basic classifier, like those

of OCR. Some systems have been adoptedcommercially in mobile devices.38 On thissimple but accurate task, characters aregrouped into Thai alphabets, English alpha-bets, and digits. Modification on charactershapes is applied to enlarge the difference ofsome characters or to simplify the writing.

Text-to-speech and speech synthesis

Research on Thai speech synthesis, chieflyfor academic purposes, began in the 1980s.Early work was reported by Sudaporn Luksa-neeyanawin.39 Text-to-speech synthesis(TTS) has resulted in several practical prod-ucts produced around 1990—for example,CUTalk by Luksaneeyanawin and Vaja byNECTEC.40 Major efforts required for Thai TTSare text processing, where a given text is seg-mented, normalized, and converted to soundrepresentatives; prosody prediction includingestimating unit durations, filled pauses, andfundamental frequencies; and synthesis, includ-ing the speech database and synthesizingalgorithms.

The technology for speech synthesis hasbeen incorporated by adopting availableTTS engines in applications such as the IBMhomepage reader,41 screen reader, and tele-phone voice response services. Limitationson speech quality and intelligibility inhibitwidespread use. The desire to provide TTSapplications on portable devices is increas-ing, and therefore researchers will have to op-timize the size of the speech database andtext processing dictionary. Thai TTS hasbeen used widely, however, for persons withvisual disabilities in Thailand.42

Automatic speech recognitionDevelopment of Thai automatic speech rec-

ognition (ASR) began with isolated wordrecognition (in which speakers utter only ashort command, not a sentence) in the mid-1990s, primarily at Chulalongkorn Univer-sity.43 A few years ago, the research expandedto applications of continuous speech in lim-ited tasks: for example, email access and tele-banking. Several attempts were made onlarge vocabulary continuous speech recogni-tion (LVCSR) for Thai; Table 2 summarizesthe results. The lack of a very large speechdatabase has slowed the advance of Thaispeech recognition; however, NECTEC is oneof the organizations that has continuouslydeveloped Thai speech recognition resourcesavailable for research, such as the LOTUScorpus.44

[3B2-6] man2009010046.3d 12/2/09 13:47 Page 57

January–March 2009 57

Authorized licensed use limited to: National Science & Technology Development Agency. Downloaded on April 20, 2009 at 22:32 from IEEE Xplore. Restrictions apply.

Search and Semantic WebIncreasingly, as more companies and org-

anizations in Thailand, both public and pri-vate, start adopting digital solutions in placeof their original paper-based workflow, theneed for search technology becomes inevita-ble. Most of the available software tools fordeveloping information retrieval (IR) systemshave not been specifically designed to workwith Thai. Consequently, there are two possi-ble approaches for indexing Thai texts,including suffix array and inverted file index.The former approach treats text as a stringof characters and records character positions;the latter approach adopts the conventionalword-based paradigm applied for Latin-basedlanguages. Word segmentation is thereforenecessary in the latter approach. The stem-ming process is, however, inapplicable sinceThai words are considered noninflectional.

An early publication describing full-textsearch for Thai was published in 1999 by PraditMitrapiyanuruk et al.47 In 2000, SurapanMeknavin, a co-author of that publication, fo-cused his research on founding an Internet ser-vice business called SiamGuru (http://www.siamguru.com), which could be recognizedas the first Thai-specific search engine. Today,the research trend in Thai-language searchengines is moving toward the natural-language

and semantic searches for implementingquestion-answering (QA) systems.

The Semantic Web research initiatives inThailand can be classified into four majorcategories:

� information extraction (IE) and ontologylearning,

� metadata/ontology standards and tools,� knowledge representation (KR) and infer-

ences, and� applications and services.

A wide spectrum of techniques has beenresearched and developed since the early2000s. For example, a Thai translation ofthe Dublin Core metadata element sets48

was completed by Science and TechnologyKnowledge Services (STKS), which also pro-motes its uses in Thailand, and the XML De-clarative Description (XDD)49 language wasproposed by the Asian Institute of Technol-ogy to extend the capabilities of XML andRDF to support rule-based reasoning.

Language resources and industrialstandard lists

One of the most important languageresources is a dictionary. So Sethaputra is con-sidered the first Thai-English dictionarymade into a commercial electronic dictionary

[3B2-6] man2009010046.3d 12/2/09 13:47 Page 58

Computers and the Thai Language

Table 2. State-of-the-art Thai large vocabulary continuous speech recognition (LVCSR) performance. The organization that provided the research results is listed after each “Task.”

Task

Newspaper reading (Carnegie Mellon University)45

Newspaper reading (Tokyo Institute of Technology)46

Vocabularysize (words)

~7,400

5,000

Perplexity

140

140

Word error rate(%)

14.0

11.6

Table 3. Available Thai dictionaries.

Dictionary

Royal Institute DictionaryNECTEC

LEXiTRONKDictThaiSaikam DictionaryLongdo Dictionary

Type

Monolingual with pronunciationBilingual (En-Th) with pronunciationBilingual (En-Th)Bilingual (Jp-Th)Multilingual (En-Th, Th-En, De-Th, Th-De, Jp-Th Th-Jp, Fr-Th, Th-Fr)

Source

http://rirs3.royin.go.th/dictionary.asp

http://lexitron.nectec.or.th/

http://kdictthai.sourceforge.net/http://saikam.nii.ac.jp/http://dict.longdo.com/

Availability

Web application

Publicly available

Publicly availableWeb applicationWeb, gadget, widget

Size (words)

133,524>600,000

37,01835,000 Thai53,000 English,

33,582

58 IEEE Annals of the History of Computing

Authorized licensed use limited to: National Science & Technology Development Agency. Downloaded on April 20, 2009 at 22:32 from IEEE Xplore. Restrictions apply.

product. LEXiTRON is another electronic dic-tionary with a long history.50 The first versionof LEXiTRON was launched by NECTEC in bothstandalone and Web-based systems in 1995.This first version contained approximately13,000 Thai words and 9,000 English words.Through user collaboration, the number ofentries has been more than doubled in thecurrent version, LEXiTRON 2.2. Table 3 listsother available dictionaries. Another languageresource necessary for language processing istext and speech corpora. Orchid and LOTUS,shown in Table 4, are considered two of thefirst official language corpora available pub-licly for Thai language research.

Concluding remarksThe history of the Thai language on com-

puters dates back to about 1965, but local re-search involvement began around 1975,when low-cost microcomputers became avail-able. Since the first standard Thai code forcomputers was developed in 1986, the com-puter processing speed has improved from8-bit, 8-MHz to 32-bit, 3.6-GHz (1,800times), and the RAM size for an entry-levelcomputer has increased from 256 Kbytes to1 Gbyte (3,900 times). NLP in real time is be-coming a reality. Research opportunities areopen to any students and researchers whocan afford these inexpensive computers. Thai-land’s I18N (internationalization) and L10N(localization) processes owe much to openstandards and many open source projects. Itis anticipated that openness in the interna-tional IT community will enable the Thai lan-guage to be more usable on any computers inthe future.

WithmoreITusers, improvement inhuman-computer interaction for the Thai language willbecome more desirable. Because computers arealsobecoming indispensable for the elderly,per-sons with disabilities, and those who are illiter-ate, many new developments will be possible.Speech I/O, handwriting input, sign languages,mobile-phone text entry, a voice-command sys-tem, speech-to-speech translation, voice search,and so on are the most likely efforts to be devel-oped and commercialized.

Acknowledgments

The authors would like to thank Trin Tantset-thi and Yuen Poovorawan for their personalcommunications, and to Choochart Harue-chaiyasak and Marut Buranarach from NECTEC

for their initial survey on search engines andSemantic Web technology.

References and Notes1. P. Soonthornpoct, From Freedom to Hell: A His-

tory of Foreign Interventions in Cambodian Politics

and Wars, Vantage Press, 2006.

2. M. Winship, ‘‘The Printing Press as an Agent of

Change? Early missionary printing in Thailand’’;

http://www.historycooperative.org/journals/cp/

vol-08/no-02/tales/.

3. NECTEC National Font Committee, The Time Line

of Thai Printing, Thai National Font Book (in

Thai), National Electronics and Computer Tech-

nology Center (NECTEC), Bangkok, 2001.

4. Antique Phonograph and Gramophone Thai

Society, ‘‘The Origins of Thai Typewriter’’; http://

www.talkingmachine.org/smithpremierorigin.html.

5. K. Hensch, ‘‘IBM History of Far Eastern Lan-

guages in Computing, Part 1: Requirements

and Initial Phonetic Product Solutions in the

1960s,’’ IEEE Annals of the History of Computing,

vol. 27, no. 1, 2005, pp. 17-26.

6. T. Koanantakool, ‘‘Compilation of all Thai Char-

acter Codes for Computers,’’ Microcomputer

Magazine, vol. 1, no. 9, Aug. 1984.

7. National Electronics and Computer Technology

Center, Thai Information Technology Standards;

http://www.nectec.or.th/it-standards/.

8. T. Koanantakool and the Thai API Consortium,

Computer Gub Pasa Thai [Computers and the

Thai Language], NECTEC, Oct. 1991 (in Thai).

9. The WTT specification was published in the book

Computers and the Thai Language in 1991. WTT,

a project code name, is a Thai acronym of

‘‘Wing Took Thee,’’ meaning ‘‘I run everywhere.’’

10. T. Koanantakool, ‘‘The Keyboard Layouts and

Input Method of the Thai Language,’’ Proc.

Symp. Natural Language Processing, Chulalong-

korn Univ., 1993.

11. S. Pattajoti, The Evolution of the Typewriter,

National Research Council, The Office of the

Prime Minister, 4 Nov. 1966 (in Thai).

[3B2-6] man2009010046.3d 12/2/09 13:47 Page 59

Table 4. Thai text and speech resources.

Corpus (Organization)

Orchid (NECTEC)51

http://www.hlt.nectec.or.th/orchid/LOTUS (NECTEC)44

Purpose

Annotated 568,316 words of Thai junior encyclopedias and NECTEC technical papersWell-designed speech utterances for 5,000-word dictation systems

Type

Text

Speech

January–March 2009 59

Authorized licensed use limited to: National Science & Technology Development Agency. Downloaded on April 20, 2009 at 22:32 from IEEE Xplore. Restrictions apply.

12. T. Tantsetthi, ‘‘Thai Software Alert #1: Legiti-

mate Thai Charset on the Internet’’; http://

software.thai.net/alerts/tis-620/index.html.

13. P. Suriyawong, Y. Poovarawan, and C. Wongchai-

suwat, ‘‘A Development of Thai Kernel System,’’

Proc. Regional Workshop on Computer Processing of

Asian Languages (CPAL), 26-28 Sept. 1989.

14. T. Koanantakool, ‘‘A Brief History of ICT in

Thailand’’; http://www.bangkokpost.com/

20th_database/07Feb2007_data00.php.

15. Microsoft, ‘‘Thai OpenType Specification:

Creating and Supporting OpenType fonts for

the Thai Script’’; http://www.microsoft.com/

typography/otfntdev/thaiot/default.htm.

16. T. Karoonboonyanan, ‘‘Thai Locale’’; http://

linux.thai.net/~thep/th-locale/.

17. D. Londe and U. Warotamasikkhadit, ‘‘Compu-

terized Alphabetization of Thai, tech. memo TM-BA-

1000/000/01, SystemDevelopment Corp., 1969.

18. V. Lorchirachoonkul, Thai Sorting Algorithm, Thai

Algorithm: Design, Analysis and Implementation,

tech. report, NIDA and Electrical Generation

Authority of Thailand, Bangkok, 1979.

19. S. Khamthaidee, ‘‘Thai Style Algorithm: Sorting

Thai,’’ Computer Rev., vol. 91, Mar. 1992.

20. T. Karoonboonyanan, ‘‘Thai Sorting Algo-

rithms’’; http://linux.thai.net/~thep/tsort.html.

21. T. Karoonboonyanan, S. Raruenrom, and

P. Boonma, ‘‘Thai-English Bilingual Sorting’’;

http://linux.thai.net/~thep/blsort_utf.html.

22. T. Karoonboonyanan, S. Raruenrom, and

P. Boonma, ‘‘Sorting Thai Words with Punctua-

tion Marks,’’ NECTEC J., vol. 14, Jan.�Feb. 1997,

NECTEC, 1997.

23. A. LaBonte, ed., ISO/IEC DIS 14651: Interna-

tional String Ordering and Comparison—Method

for Comparing Character Strings and Description

of the Common Template Tailorable Ordering,

ISO/IEC JTC1/SC22/WG20 N731, 2000.

24. K. Karlsson, Ordering Rules for Thai and Lao,

ISO/IEC JTC1/SC22/WG20 N1077, 2003.

25. Y. Thairatananond, ‘‘Towards the Design of a

Thai Text Syllable Analyzer,’’ master’s thesis,

Asian Inst. of Technology, Pathumthani, 1981.

26. S. Charnyapornpong, ‘‘A Thai Syllable Separa-

tion Algorithm,’’ master’s thesis, Asian Inst. of

Technology, Pathumthani, 1983.

27. R. Varakulsiripunth et al., ‘‘Word Segmentation

in Thai Sentence by Longest Word Mapping,’’

Papers on Natural Language Processing: Multilin-

gual Machine Translation and Related Topics

(1987�1994), NECTEC, 1989.

28. S. Meknavin, P. Charoenpornsawat, and

B. Kijsirikul, ‘‘Feature-Based Thai Word Segmen-

tation,’’ Proc. Natural Language Processing Pacific

Rim Symp. (NLPRS 97), 1997, pp. 41-48.

29. W. Aroonmanakun, ‘‘Collocation and Thai Word

Segmentation,’’ Proc. Joint Int’l Conf. Symp.

Natural Language Processing (SNLP) and Oriental

Chapter of the Int’l Committee for the Coordina-

tion and Standardization of Speech Databases

and Assessment Techniques (O-COCOSDA),

2002, pp. 68-75.

30. V. Lorchirachoonkul, ‘‘A Thai Soundex System,’’

Information Processing and Management, vol. 18,

no. 5, 1982, pp. 243-255.

31. K. Tingsabadh and A.S. Abramson, Illustrations of

the IPA: Thai, Handbook of the International Pho-

netic Association, Cambridge Univ. Press, 1999.

32. J.C. Wells, ‘‘SAMPA for Thai’’; http://www.phon.

ucl.ac.uk/home/sampa/thai.htm.

33. P. Tarsaku, V. Sornlertlamvanich, and R. Thong-

prasirt, ‘‘Thai Grapheme-to-Phoneme Using

Probabilistic GLR Parser,’’ Proc. European Conf.

Speech Communication and Technology (Euro-

Speech), ISCA, 2001, pp. 1057-1060.

34. P. Hiranvanichakorn, T. Agui, and M. Nkajima,

‘‘A Recognition of Thai Characters,’’ Trans. IECE

Japan, vol. E 54, no. 12, 1982.

35. C. Kimpan et al., ‘‘Thai Character Pattern Rec-

ognition Preliminary,’’ Proc. Conf. L-05, College

of Science and Technology, Nihon Univ., 1980.

36. NECTEC, ‘‘Open TLE’’; http://www.opentle.org/.

37. V. Tanrudee, ‘‘Thai Character Handwriting Rec-

ognition System,’’ master’s thesis, Chulalong-

korn Univ., 1989.

38. G Softbiz Co., Ltd., ‘‘Thai-G’’; http://www.thai-g.

com/.

39. S. Luksaneeyanawin, ‘‘A Thai Text-to-Speech Sys-

tem,’’ Proc. Regional Workshop Computer Processing

of Asian Languages (CPAL), 1989, pp. 305-315.

40. P. Mittrapiyanurak et al., ‘‘Issues in Thai Text-to-

Speech Synthesis: The NECTEC Approach,’’ Proc.

NECTEC Ann. Conf., NECTEC, 2000, pp. 483-495.

41. IBM, ‘‘IBM Launches IBM Home Page Reader

V3.02 Thai, A Thai-speaking Web Browser—The

12th Language of the World,’’ press release,

12 Feb. 2003.

42. Thailand Association of the Blind, ‘‘Software for

the Blind’’; http://www.tab.or.th/.

43. T. Pathumthan, ‘‘Thai Speech Recognition Using

Syllable Units,’’ master’s thesis, Chulalongkorn

Univ., 1987 (in Thai).

44. S. Kasuriya et al., ‘‘Thai Speech Corpus for Speech

Recognition,’’ Proc. Int’l Conf. Int’l Committee for

the Coordination and Standardization of Speech

Databases and Assessments (O-COCOSDA),

2003, pp. 105-111.

45. S. Suebvisai et al., ‘‘Thai Automatic Speech Rec-

ognition,’’ Proc. IEEE Int’l Conf. Acoustics, Speech,

and Signal Processing (ICASSP), IEEE Press, 2005,

pp. 857-860.

46. I. Thienlikit, C. Wutiwiwatchai, and S. Furui,

‘‘Language Model Construction for Thai

LVCSR,’’ Reports of the Meeting of the Acoustic

Soc. of Japan, ASJ, 2004, pp. 131-132.

[3B2-6] man2009010046.3d 12/2/09 13:47 Page 60

Computers and the Thai Language

60 IEEE Annals of the History of Computing

Authorized licensed use limited to: National Science & Technology Development Agency. Downloaded on April 20, 2009 at 22:32 from IEEE Xplore. Restrictions apply.

47. P. Mitrapiyanuruk et al., ‘‘A Development of

Full-Text Search Engine for Large Scale Thai

Text Database,’’ Proc. Nat’l Science and Technol-

ogy Development Agency (NSTDA), annual meet-

ing, 1999, pp. 246-257 (in Thai).

48. Science and Technology Knowledge Service

(STKS), ‘‘Dublin Core Metadata Element Set

(Thai translation)’’; http://dublin.stks.or.th/.

49. V. Wuwongse et al., ‘‘XML Declarative Descrip-

tion: A Language for the Semantic Web,’’ IEEE In-

telligent Systems, vol. 16, no. 3, 2001, pp. 54-65.

50. NECTEC, ‘‘LEXiTRON Online Dictionary’’; http://

lexitron.nectec.or.th/.

51. V. Sornlertlamvanich, T. Charoenporn, and

H. Isahara, ORCHID: Thai part-of-speech tagged

corpus, tech. report TR-NECTEC-1997-001,

ISBN 9747576-98-8, NECTEC, 1997.

Hugh Thaweesak Koananta-kool is a vice president of theNational Science and Technol-ogy Development Agency.He received a PhD in digitalcommunications from the Im-perial College of Science andTechnology, London Univer-

sity. His major works are in IT standards ofThailand, the establishment of the Internetin Thailand, SchoolNet Thailand, and e-Government.Koanantakool served on Thailand’s National ITCommittee, the parliamentary commissions on

the Electronic Transactions Act of 2001 and theComputer-related Offences Act in 2007. Contacthim at [email protected].

Chai Wutiwiwatchai receiveda PhD from Tokyo Instituteof Technology in 2004. Cur-rently he is the director ofthe Human Language Tech-nology Laboratory in NECTEC,Thailand. His research inter-ests include speech process-

ing, natural language processing, and human�machine interaction. Contact him at [email protected].

Theppitak Karoonboonyananreceived a B.Eng. (honors) incomputer engineering fromChulalongkorn University in1993. His interests are soft-ware engineering, and freeand open source software phi-losophy. He has contributed

to the GNOME, Debian, and other projects forThai supporting infrastructure, and has maintainedupstream projects for Thai resources. Contact himat [email protected].

For further information on this or any other

computing topic, please visit our Digital Library

at http://computer.org/csdl.

[3B2-6] man2009010046.3d 12/2/09 13:47 Page 61

January–March 2009 61

Authorized licensed use limited to: National Science & Technology Development Agency. Downloaded on April 20, 2009 at 22:32 from IEEE Xplore. Restrictions apply.