localization and language technology standards kavi narayana murthy university of hyderabad elitex -...
Post on 26-Dec-2015
227 Views
Preview:
TRANSCRIPT
Localization and Language Technology Standards
Kavi Narayana MurthyUniversity of Hyderabad
ELITEX - 2007New Delhi, 10-11 January 2007
Kavi Narayana Murthy UoH
2
Outline Character Encoding Standards Fonts, Glyphs, Mapping Standards OS/Browser Support, Drivers Transliteration, Romanization Translation, Linguistic Resources Speech and OCR Technologies Enforcement
Kavi Narayana Murthy UoH
3
Goals Functionality
Whatever we can do with English, we must be able to do with our own languages and scripts with equal ease
Inter-operability, Platform Independence All Applications must work seemlessly on all
hardware and software platforms Language and Script Independence
Multi-lingual, Multi-Script Support
Kavi Narayana Murthy UoH
4
Standards Even a poor standard is better than no
standard Standards save us a lot in the long run Commercial forces promoting non-
standard, proprietary, secret systems must not be allowed to succeed Let us not say “Let the Market Decide”!!!
Kavi Narayana Murthy UoH
5
Character Encoding Standards ISCII and Unicode ISCII is a BIS Standard, Unicode is
not Unicode is based on ISCII In some sense, Unicode is a step in
the backward direction Let us understand ISCII first
Kavi Narayana Murthy UoH
6
Language and Script Do not confuse one for the other Many-to-Many Script is neither language nor font Script and SuperScript Phonetic Basis
Common SuperScript for all ILs Script Grammar
Kavi Narayana Murthy UoH
7
Language and Script Sanskrit is written in Devanagari,
Telugu, Kannada, Bangla etc. scripts
Devanagari is used for writing Sanskrit, Hindi, Marathi, etc.
English words are often written (transliterated) in local language scripts
Kavi Narayana Murthy UoH
8
Phonetic Basis Words: Meanings, Sounds, Written
Symbols Meanings are supreme but difficult
to quantify and encode Sounds are the next best
A ‘ka’ sound is a ‘ka’ sound, whatever be the language – Hence ‘Universal’
No need for ‘Spellings’ What is write is what we speak - directly
Kavi Narayana Murthy UoH
9
Orthography Written symbols correspond with
phonemes – basic sound units Minor variations in sounds
(allophones, co-articulation effects etc.) are not depicted in orthography t: Mountain, tea, truck, spilt, little
Special Symbols not to confused with basic Characters
Kavi Narayana Murthy UoH
10
What is a Character? Indian Languages:
No ‘alphabet’, not letters, no spellings Phoneme-based Units are syllable-like: called ‘akshara’-
s akshara-s very large in number
Corpus studies not sufficient Made up of vowels, consonants etc. Not all sequences valid
Kavi Narayana Murthy UoH
11
Script Grammar A Grammar for Scripts Allows all valid sequences, only valid
sequences No need to code all possible akshara-s Script grammar must be part of
standards: ISCII includes. UNICODE? Script Grammar to be enforced by s/w
Kavi Narayana Murthy UoH
12
SuperScript ILs: 10 Scripts with a nearly common
sound system – all derived from the ancient ‘braahmi’ script
=> SuperScript Super Set of all Phonemes
Common encoding: ISCII Extendable to all languages of the
world
Kavi Narayana Murthy UoH
13
ISCII: (BIS – 1991: IS 13194) 128 codes more than sufficient Uses second half of ASCII, first half
untouched – allows mixing with English
SuperScript: Transliteration built-in Long Standing: ISCII 1988, 1991 Well thought and well designed
Kavi Narayana Murthy UoH
14
Why did ISCII fail to catch on? Silent on Character-to-Font mapping
A complex many-to-many mapping Fonts not standardized, fonts not available
Not registered, no OS/Browser Support (BIS – 1991: IS 13194) Rationale not explained Not publicized, not enforced
Kavi Narayana Murthy UoH
15
History Proprietary, non-standard, secret
font based encoding schemes Promoted by commercial companies Near Zero Inter-operability Ad-hoc ISCII-to-font mapping schemes Mapping schemes not made public To be made Illegal and Punishable
Put India back by at least a decade!
Kavi Narayana Murthy UoH
16
Improving ISCII Register - To get OS/Browser Support
Remove encoding of allophones, allographs Script Grammar: FSM enough, CFG - not needed
Include Rationale, explanatory notes Remove Attribute/Extension codes Standardize ISCII-to-Font Mapping Scheme Promote, Enforce
Kavi Narayana Murthy UoH
17
Character-to-Font Mapping Complex scripts – not linear Glyphs: shape units convenient for
rendering Poor correspondence with sound
units Many-to-Many mappings
Glyph selection, scaling, positioning No Glyph Encoding Standard
Kavi Narayana Murthy UoH
18
From Character to Font Must be provably complete and
100% consistent Current systems are all ad-hoc –
neither complete nor consistent Finite State Transducers:
Necessary and Sufficient Without restricting Creativity and
Flexibility Simple, Efficient, Re-Usable
Kavi Narayana Murthy UoH
19
Encoding Standards: Unicode For Language/Script/SuperScript?
CJK. Why not for ILs? Script Grammar? Character-to-Font:
relegated to font level font effects
ISCII-88 Based, Has Errors Once added, cannot be deleted!
Kavi Narayana Murthy UoH
20
ISCII or Unicode? Unicode:
To be with the World, to know and be known ‘Correcting’ Mistakes, Improving Standards Support (OS, Fonts, etc.), Education, Training Converting Legacy Data – A Huge Task
ISCII-to-Unicode is not trivial Ignore BIS Standard and embrace what is not
yet ‘standardized’? Why not co-exist? – Internal and External
Views
Kavi Narayana Murthy UoH
21
Keyboard Layouts, Drivers Several de-facto standards and
many variations in use To select a few and standardize
So called Roman Phonetic Typing ILs through English! OK for oldies, not for future!
INSCRIPT: ISCII Standard, Good for new comers
To strictly enforce Script Grammar
Kavi Narayana Murthy UoH
22
Document Encoding Standards Plain Text: pure ISCII/UNICODE
Mono-lingual Plain Text? Annotated Text (Ex. Word
Processors) XML Style, Open, Readable formats to
be encouraged Proprietary, secret, non-standard
encodings must be discouraged
Kavi Narayana Murthy UoH
23
Transliteration Widely used, part of our Tradition
Sanskrit texts in local scripts English, Hindi, Urdu words in local
scripts Music Compositions
Automatic in ISCII. Unicode? Quality of transliteration
To and From English?
Kavi Narayana Murthy UoH
24
Romanization Need:
Where there is no support for local languages English dailies, posters, advertisements etc. Lack of support: OS/Browser/Fonts etc.
Where users prefer Roman A variety of ad-hoc schemes in use
iTRANS, RTS, W-X, etc. Standards badly wanted
Kavi Narayana Murthy UoH
25
Romanization Multi-dimensional optimization problem
Case Mix-up 26 Letters not sufficient 52 nearly sufficient Not always supported
Storage space, Ease of Typing, Aesthetics Scientific/Logical Design/Naturalness
English-like – for the oldies: a, ee, oo, a, oa ??? Futuristic: aa/ii/uu/ee/oo
Kavi Narayana Murthy UoH
26
Romanization Clashes: a+u/au, k+h/kh, s’
Two way conversion, cyclic check Ex. Long Vowels:
a: -clashes with colon diacritic –not supported ipa –not understood –not supported A +single char. +saves space –ugly –
difficult to type –case-mix-up aa +logical (like ee) +easy to type
Kavi Narayana Murthy UoH
27
Romanization: An Example a aa i ii u uu R RR e ee ai o oo au M H k kh g gh n~ c ch j jh n` T TH D DH N t th d dh n p ph b bh m y r l v s’ S s h L
Kavi Narayana Murthy UoH
28
Translation Create Material Afresh Translate by Hand Automatic/Machine Translation Machine Aided Translation English – Local Language
Translation Local – Local Language Translation
Kavi Narayana Murthy UoH
29
Translation Resource Intensive
Manpower, Time, Cost Quality/Uniformity
Standards, Bench-Mark Data, Testing and Evaluation Procedures
Dictionaries, Terminology Databases Pan-Indian Terms/Sanskritize/Localize
Kavi Narayana Murthy UoH
30
Linguistic Resources Dictionaries – General, Domain Specific Terminological Databases Thesauri, WordNets, Ontologies Morphological Analyzers, Generators Spell/Grammar/Style Checkers Annotated Text and Speech Corpora
Kavi Narayana Murthy UoH
31
India: Future is in Speech One Billion People, A Sixth of the World More than 150 Languages, 22 Recognized 95 % not comfortable with English Computers, Current, Connectivity Info Revolution benefits: Majority
Deprived 10 M Computers, 100 M Phones Future is in Speech
Kavi Narayana Murthy UoH
32
Speech Natural Easy, Fast Hands-Free No need to Learn
Technology Language
Available to all
Kavi Narayana Murthy UoH
33
Text and Speech Speech is Natural Reading/Writing is learnt, Artificial Some never learn – Illiterates Oral Tradition Speech is more permanent than Text! “I did not steal that ring of gold” Trust Yourself!
Kavi Narayana Murthy UoH
34
Speech Technologies Speech Recognition: Speech to Text Speech Synthesis: Text to Speech Speaker Recognition,Verification,ID Speech Coding/Decoding,
Compression Slow down, Speed up Speech as Evidence
Kavi Narayana Murthy UoH
35
Applications Telephone Dialing Form Filling Dictation Machine Command and Control Voice enabled Web OCR+WP+TTS MT: Cross-Lingual IR, S2S
Kavi Narayana Murthy UoH
36
OCR OCR in Local Scripts Needed
To digitize and save legacy data To compile/process/edit/refine data
For Printed Texts/Manuscripts Old Data
deterioration of paper old type fonts, problems of type-
setting
Kavi Narayana Murthy UoH
37
Multi-Modal Interfaces
To Reach out to 1 Billion People, we must get the best of many worlds: Speech Recognition and Synthesis Graphics and iconic Interfaces OCR Technologies Translation, CLIR Camera, Gestures, Touch Screen
Kavi Narayana Murthy UoH
38
Balance Between Backward Compatibility
and Future-Proof Designs Quick Fix Solutions and Long Haul One Standard or Several? Economics and Business Sense
versus Social Responsibilities Acceptance versus Enforcement
Kavi Narayana Murthy UoH
39
The 3 Most Important Things1. Develop/Refine/Update Standards
Detailed Documentation Including rationale, issues, evaluation,
etc.
2. Education and Training3. Enforcement
Make use of non-standard methods illegal and punishable under law
Technical Workshops for detailing
Thank You!
Visitwww.LanguageTechnologies.a
c.in
top related