localization and language technology standards kavi narayana murthy university of hyderabad elitex -...

Localization and Language Technology Standards

Kavi Narayana MurthyUniversity of Hyderabad

ELITEX - 2007New Delhi, 10-11 January 2007

Kavi Narayana Murthy UoH

Outline Character Encoding Standards Fonts, Glyphs, Mapping Standards OS/Browser Support, Drivers Transliteration, Romanization Translation, Linguistic Resources Speech and OCR Technologies Enforcement

Goals Functionality

Whatever we can do with English, we must be able to do with our own languages and scripts with equal ease

Inter-operability, Platform Independence All Applications must work seemlessly on all

hardware and software platforms Language and Script Independence

Multi-lingual, Multi-Script Support

Standards Even a poor standard is better than no

standard Standards save us a lot in the long run Commercial forces promoting non-

standard, proprietary, secret systems must not be allowed to succeed Let us not say “Let the Market Decide”!!!

Character Encoding Standards ISCII and Unicode ISCII is a BIS Standard, Unicode is

not Unicode is based on ISCII In some sense, Unicode is a step in

the backward direction Let us understand ISCII first

Language and Script Do not confuse one for the other Many-to-Many Script is neither language nor font Script and SuperScript Phonetic Basis

Common SuperScript for all ILs Script Grammar

Language and Script Sanskrit is written in Devanagari,

Telugu, Kannada, Bangla etc. scripts

Devanagari is used for writing Sanskrit, Hindi, Marathi, etc.

English words are often written (transliterated) in local language scripts

Phonetic Basis Words: Meanings, Sounds, Written

Symbols Meanings are supreme but difficult

to quantify and encode Sounds are the next best

A ‘ka’ sound is a ‘ka’ sound, whatever be the language – Hence ‘Universal’

No need for ‘Spellings’ What is write is what we speak - directly

Orthography Written symbols correspond with

phonemes – basic sound units Minor variations in sounds

(allophones, co-articulation effects etc.) are not depicted in orthography t: Mountain, tea, truck, spilt, little

Special Symbols not to confused with basic Characters

What is a Character? Indian Languages:

No ‘alphabet’, not letters, no spellings Phoneme-based Units are syllable-like: called ‘akshara’-

s akshara-s very large in number

Corpus studies not sufficient Made up of vowels, consonants etc. Not all sequences valid

Script Grammar A Grammar for Scripts Allows all valid sequences, only valid

sequences No need to code all possible akshara-s Script grammar must be part of

standards: ISCII includes. UNICODE? Script Grammar to be enforced by s/w

SuperScript ILs: 10 Scripts with a nearly common

sound system – all derived from the ancient ‘braahmi’ script

=> SuperScript Super Set of all Phonemes

Common encoding: ISCII Extendable to all languages of the

ISCII: (BIS – 1991: IS 13194) 128 codes more than sufficient Uses second half of ASCII, first half

untouched – allows mixing with English

SuperScript: Transliteration built-in Long Standing: ISCII 1988, 1991 Well thought and well designed

Why did ISCII fail to catch on? Silent on Character-to-Font mapping

A complex many-to-many mapping Fonts not standardized, fonts not available

Not registered, no OS/Browser Support (BIS – 1991: IS 13194) Rationale not explained Not publicized, not enforced

History Proprietary, non-standard, secret

font based encoding schemes Promoted by commercial companies Near Zero Inter-operability Ad-hoc ISCII-to-font mapping schemes Mapping schemes not made public To be made Illegal and Punishable

Put India back by at least a decade!

Improving ISCII Register - To get OS/Browser Support

Remove encoding of allophones, allographs Script Grammar: FSM enough, CFG - not needed

Include Rationale, explanatory notes Remove Attribute/Extension codes Standardize ISCII-to-Font Mapping Scheme Promote, Enforce

Character-to-Font Mapping Complex scripts – not linear Glyphs: shape units convenient for

rendering Poor correspondence with sound

units Many-to-Many mappings

Glyph selection, scaling, positioning No Glyph Encoding Standard

From Character to Font Must be provably complete and

100% consistent Current systems are all ad-hoc –

neither complete nor consistent Finite State Transducers:

Necessary and Sufficient Without restricting Creativity and

Flexibility Simple, Efficient, Re-Usable

Encoding Standards: Unicode For Language/Script/SuperScript?

CJK. Why not for ILs? Script Grammar? Character-to-Font:

relegated to font level font effects

ISCII-88 Based, Has Errors Once added, cannot be deleted!

ISCII or Unicode? Unicode:

To be with the World, to know and be known ‘Correcting’ Mistakes, Improving Standards Support (OS, Fonts, etc.), Education, Training Converting Legacy Data – A Huge Task

ISCII-to-Unicode is not trivial Ignore BIS Standard and embrace what is not

yet ‘standardized’? Why not co-exist? – Internal and External

Keyboard Layouts, Drivers Several de-facto standards and

many variations in use To select a few and standardize

So called Roman Phonetic Typing ILs through English! OK for oldies, not for future!

INSCRIPT: ISCII Standard, Good for new comers

To strictly enforce Script Grammar

Document Encoding Standards Plain Text: pure ISCII/UNICODE

Mono-lingual Plain Text? Annotated Text (Ex. Word

Processors) XML Style, Open, Readable formats to

be encouraged Proprietary, secret, non-standard

encodings must be discouraged

Transliteration Widely used, part of our Tradition

Sanskrit texts in local scripts English, Hindi, Urdu words in local

scripts Music Compositions

Automatic in ISCII. Unicode? Quality of transliteration

To and From English?

Romanization Need:

Where there is no support for local languages English dailies, posters, advertisements etc. Lack of support: OS/Browser/Fonts etc.

Where users prefer Roman A variety of ad-hoc schemes in use

iTRANS, RTS, W-X, etc. Standards badly wanted

Romanization Multi-dimensional optimization problem

Case Mix-up 26 Letters not sufficient 52 nearly sufficient Not always supported

Storage space, Ease of Typing, Aesthetics Scientific/Logical Design/Naturalness

English-like – for the oldies: a, ee, oo, a, oa ??? Futuristic: aa/ii/uu/ee/oo

Romanization Clashes: a+u/au, k+h/kh, s’

Two way conversion, cyclic check Ex. Long Vowels:

a: -clashes with colon diacritic –not supported ipa –not understood –not supported A +single char. +saves space –ugly –

difficult to type –case-mix-up aa +logical (like ee) +easy to type

Romanization: An Example a aa i ii u uu R RR e ee ai o oo au M H k kh g gh n~ c ch j jh n` T TH D DH N t th d dh n p ph b bh m y r l v s’ S s h L

Translation Create Material Afresh Translate by Hand Automatic/Machine Translation Machine Aided Translation English – Local Language

Translation Local – Local Language Translation

Translation Resource Intensive

Manpower, Time, Cost Quality/Uniformity

Standards, Bench-Mark Data, Testing and Evaluation Procedures

Dictionaries, Terminology Databases Pan-Indian Terms/Sanskritize/Localize

Linguistic Resources Dictionaries – General, Domain Specific Terminological Databases Thesauri, WordNets, Ontologies Morphological Analyzers, Generators Spell/Grammar/Style Checkers Annotated Text and Speech Corpora

India: Future is in Speech One Billion People, A Sixth of the World More than 150 Languages, 22 Recognized 95 % not comfortable with English Computers, Current, Connectivity Info Revolution benefits: Majority

Deprived 10 M Computers, 100 M Phones Future is in Speech

Speech Natural Easy, Fast Hands-Free No need to Learn

Technology Language

Available to all

Text and Speech Speech is Natural Reading/Writing is learnt, Artificial Some never learn – Illiterates Oral Tradition Speech is more permanent than Text! “I did not steal that ring of gold” Trust Yourself!

Speech Technologies Speech Recognition: Speech to Text Speech Synthesis: Text to Speech Speaker Recognition,Verification,ID Speech Coding/Decoding,

Compression Slow down, Speed up Speech as Evidence

Applications Telephone Dialing Form Filling Dictation Machine Command and Control Voice enabled Web OCR+WP+TTS MT: Cross-Lingual IR, S2S

OCR OCR in Local Scripts Needed

To digitize and save legacy data To compile/process/edit/refine data

For Printed Texts/Manuscripts Old Data

deterioration of paper old type fonts, problems of type-

setting

Multi-Modal Interfaces

To Reach out to 1 Billion People, we must get the best of many worlds: Speech Recognition and Synthesis Graphics and iconic Interfaces OCR Technologies Translation, CLIR Camera, Gestures, Touch Screen

Balance Between Backward Compatibility

and Future-Proof Designs Quick Fix Solutions and Long Haul One Standard or Several? Economics and Business Sense

versus Social Responsibilities Acceptance versus Enforcement

The 3 Most Important Things1. Develop/Refine/Update Standards

Detailed Documentation Including rationale, issues, evaluation,

2. Education and Training3. Enforcement

Make use of non-standard methods illegal and punishable under law

Technical Workshops for detailing

Thank You!

Visitwww.LanguageTechnologies.a

localization and language technology standards kavi narayana murthy university of hyderabad elitex -...

Documents

bhakta kavi narsinh mehta universityit)-sem-1... ·...

hindi sufi kavi aur kavya

programme guide - cvru.ac.in · 4010343803 discipline...

kavi - all chapters

dbms-lm kavi with minipjt

ict in serbia elitex new delhi, 25 – 26 april 2005 radmila...

kavi pro ppt

gauri row kavi

saundarya lahari - sri ram kavi

navigating kavi

adi kavi nannayya university ... - rajamahendravaram

joyce olenja - kavi, kenya

caitanya candrodaya natakam kavi karnapuratransliterado

bhakta kavi narsinh mehta university - … · bhakta kavi...

integriti user manual elite / elitex lcd terminal...

kavi, jaipur rugs...kavi, jaipur rugs 18th november 2014 an...

bhakta kavi narsinh mehta university, junagadh …

kavi pandya - transcript

"7 navika uspesnih ljudi" stiven r. kavi

jpahd khiy - kavi yogi