kewal krishan - pan localization india.ppt.pdf · kewal krishan technical director ... english to...
Post on 05-Feb-2018
229 Views
Preview:
TRANSCRIPT
1National Informatics Centre, Government of India
“Localization & Language Technology Standards”
Kewal Krishan
Technical Director & Member Secretary
e-Governance Standards Working Group on Localisation & Language Technology (eGS-WG-LLT)
2
National Informatics Centre, Government of India
E-Governance StandardsThe Government of India has launched the National e-Governance Action Plan (NeGP) with the intent to support the growth of e-governance within the country. The Plan envisages creation of right environments to implement G2G,G2B,G2E and G2C services.
Many developments initiated by various government agencies are seemingly done in isolation. Different development platforms are used and the applications under different platforms are seldom interoperable with the result that it is difficult to integrate them even though many have similar features and functionalities. Added to this, is the fact that there is no single agency responsible for framing enforceable e-governance standards and processes that must be adhered to, by all developers.
Keeping in view of the strategic and contemporary importance of standards for e-Governance, the Department of Information Technology has constituted an apex body to oversee the process of bringing out e-governance standards. The following five areas have been identified:
1. Network and Information Security 2. Metadata and Data Standards for Application Domains 3. Quality and Documentation 4. Localization and Language Technology Standards5. Technical Standards and E-Governance Architecture
Localisation & Language Technology Standards
3
Language diversity in India
•India is a multi-lingual and multi-script country with
- Over 500 languages;- 216 mother-tongues with > 10000 dialects in use;- 22 constitutionally recognized languages
• Only 6% of the Indian population speak the “English” language• Operating Systems as well as Software Tools & Applications packages are in “English” language
• Roadmap : is to promote and achieve “e-Governance” through the language by which the Common man can speak, transact, understand, generate contents and communicate with each other
National Informatics Centre, Government of India
Localisation & Language Technology Standards
4
• Language plays a vital role in successful implementation of e-Governance
• Need for user-friendly interfaces.
• Especially targeted towards the ruralenvironment
E-Governance
National Informatics Centre, Government of India
Localisation & Language Technology Standards
5
Need of Standards
• Interoperability and Information Sharing between systems.
• Interoperability between systems supplied by different vendors.
• “Platform-Independent Modelling” approach.
• Increased Adaptability & Flexibility.
National Informatics Centre, Government of India
Localisation & Language Technology Standards
6
Standardization• Storage standards• Font standards• Inputting standards• Transliteration /Roman equivalent• Sorting order /sequence for Indian languages.• OCR Standards• Standards for Website and E-mail• Local Search Engine Standards• Availability of all constitutionally recognised Indian Languages in
all Operating Systems• Strategy for conversion of data from ISCII to UNICODE
National Informatics Centre, Government of India
Localisation & Language Technology Standards
7
Make all the Government services accessible to the Citizens through Common Services Centres in his own language
National Informatics Centre, Government of India
Localisation & Language Technology Standards
8National Informatics Centre, Government of India
Localisation & Language Technology Standards
9National Informatics Centre, Government of India
Localisation & Language Technology Standards
SCA – Service Centre Agencies
VLE – Village Level Entrepreneur
NLSA – National Level Service Agency
10
Scheduled Languages in descending order of strength - 2001 Census
2.8%22. Maithili/Meetei-Mayek2.79%11. Punjabi/Gurumukhi, Shahmukhi
0.56%21. Santhali/Devnagari, OL (ciki)3.35%10. Oriya/Oriya
0.1%20. Dogri/Devanagari3.62%9. Malayalam/Malayalam, Malayalam -2
0.23%19. Bodo/Devanagari, Bangla(Modified)3.91%8. Kannada/Kannada
0.01%18. Sanskrit/Devanagari4.85%7. Gujarati/Gujarati
0.01%17. Kashmiri/Perso-Arabic5.18%6. Urdu/Perso-Arabic
0.15%16. Manipuri/Bangla, Manipuri-new6.32%5. Tamil/Tamil
0.21%15. Konkani/Devnagari, Kannada, Roman7.45%4. Marathi /Devnagari
0.25%14. Nepali/Devanagari7.87%3. Telugu/Telugu
0.25%13. Sindhi/Devanagari, Gujarati, Roman, Perso-Arabic8.30%2. Bengali/Bangla
1.56%12. Assamese/Bangla (Modified)40.22%1. Hindi /Devnagari
Language/Script %Population Language/Script %Population
National Informatics Centre, Government of India
Localisation & Language Technology Standards
11
1. OS Supporti) Locales and Sorting
ii) User Interfaceiii) Searchingiv) Rendering on PC in Application/Browser : Display, Layoutv) Character Encoding : Unicode, ISCIIvi) Inputting Methods
- Keyboard Layouts : Typewriter, Inscript, Phonetic- Online Handwriting Recognition - Text OCR- Speech to Text
Editors for Desktop & Web (Front Page, Quanta Plus, Dreamweaver, Rational Site developer , W3Cindia.in compliant certified markup etc.) ODF and related Standards.
- Browser Support3. Resources and Tools:
- Processing Resources : Spell Checker- Language Resources : Dictionaries, Ontology, Glossary, Lexicon, Thesaurus- Annotated Corpora: Text & Speech
- Machine Translation- Transliteration (BARAHA software- Free ware, Internationalization Component for Unicode ICU)- Database Support : Data Storage & Retrieval
4. Search Engine Support (Google, Yahoo, MSN, LUCENE Raftar, Khoj etc.)5. Localized Applications
- Interoperability between Platforms and Technologies
Areas & Issues identified
Localisation & Language Technology Standards
National Informatics Centre, Government of India
12
So far we have organized ten brainstorming sessions (in Tamil, Telugu, Malyalam, Kannada, Marathi, Gujarati, Oriya, Assamese, Bangla and Hindi Language) at the following locations:
1. 1. Chennai - 20th Feb., 20062. Mumbai - 13th April, 20063. Trivandrum - 15th May, 20064. Kolkatta - 2nd June, 20065. Bhubneshwar - 19th June, 20066. Guwhati - 26th June, 20067. Ahmedabad - 11th July, 20068. Bangalore - 08th August, 20069. Hydrabad - 20th September, 2006
10. Chandigarh - 29th September, 2006
First Working Group Meeting held in NIC, Delhi - 25th July, 2006
Technology Provider’s meeting held in NIC, New Delhi – 15th November, 2006
Second Working Group Meeting held in NIC, Delhi - 09th January, 2007
Localisation & Language Technology Standards
National Informatics Centre, Government of India
13
Myself and my colleague Shri M.D. Kulkarni, Director, C-DAC prepared the draft on Roadmap for Localization & Language Technology Standards and later modification were made on the basis of feedback received from the
a) members who attended the Brainstorming sessions, First and Second Working Group meeting.
b) Technology Providers’
Localisation & Language Technology Standards
National Informatics Centre, Government of India
14
•In Windows OS : Bodo, Dogri, Kashmiri, Maithili, Manipuri, Sindhi, Santhali Constitutionally recognized languages - support is not available•In Red Hat Linux : Assamese, Bodo, Dogri, Kannada, Konkani, Kashmiri, Maithili, Manipuri, Nepali, Sindhi, Santhali, Sanskrit, Urdu Constitutionally recognized languages - support is not available•In MAC OS X - 18/22 constitutionally recognized languages support is not available.
Desitination desired: All 22 constitutionally recognisedlanguages support must be available in all Operating systems.
Present Status : •In Windows 2000/XP –13/22 constitutionally recognized Indian Languages support is available.(Bangla, Gujarati, Hindi, Kannada, Konkani, Malayalam, Marathi, Nepali Punjabi, Sanskrit, Tamil, Telugu & Urdu).• 15/22 under Windows-Vista added support for Oriya and Assamese. • In RedHat Linux –9/22 constitutionally recognized Indian Languages support is available.(Bangla, Gujarati, Hindi, Marathi, Oriya, Punjabi, Malayalam, Tamil & Telugu). • In MAC OS X-4/22 Gujarati, Hindi, Punjabi, Tamil constitutionally recognized Indian Languages Support is available
OS Support under Windows, Linux, MAC OS
1.1
1. OS Support
Destination DesiredCurrent Issues/StatusAreaSr.No
Localisation & Language Technology Standards
National Informatics Centre, Government of India
15
• Should be owned and managed by the various state and national authorities.
Sort order for official Indian Languages are available as a part of Common Locale Data Repository. (CLDR)
Sorting 1.3
•The Unicode Common Locale Data Repository is a public and open source locales database. MCIT is a member of the Unicode Consortium. CDAC and State authorities have access to it. This should be used to enhance needs.http://www.unicode.org/cldr/
• Nepali language locale data is available for Nepal as per their requirements.
• Available for 12/22 languages under Windows, • 9/22 languages under Linux
• Presently Locales data is insufficient and not accommodate Indian culture specific requirements.
Locales Data1.2
Localisation & Language Technology Standards
National Informatics Centre, Government of India
16
Localisation & Language Technology Standards
1.Constant interaction with Unicode for proper representative of Indian languages.
2.There has to be standards enforcement at State level.
Unicode characters are almost complete according to the respective language
requirements.
Encoding 1.4
National Informatics Centre, Government of India
17
Localisation & Language Technology Standards
• Typewriter keyboard and State level Languages specific requirements (KGP keyboard layout for Kannada, TAM99 keyboard layout for Tamil) should be supported at operating system level.
Note: Output of any user specific keyboard layout must conform to Unicode current version.
Present Statusa) Keyboard Layouts – Any Inputting method can be used in Unicode enabled OS.
- INSCRIPT keyboard layout is available at OS level.
b) Speech to Text Shrutlekhan- Rajbhasha is available for Hindi language.
c) Handwriting Recognition & OCR - Technology under development
Inputting Mechanisma) Keyboard Layouts
b) Speech to Text
c) Handwriting Recognition, Text OCR etc.
1.5
National Informatics Centre, Government of India
18
Intelligent search engines are required to be developed.
• Character level search is available. • Contextual and intelligent search engines are not available
Searching 1.7
• Rasterisation engine is not being implemented for all the 22 scheduled Indian languages. Needs to be implemented by OS developers• Collation standards needs to be defined
For rendering Open Type fonts – rasterisation engine needs to be built in the OS.
Rendering of fonts in Application as well as in Browser
1.6
Localisation & Language Technology Standards
National Informatics Centre, Government of India
19
IE6, IE7, FireFox, Netscape etc. supports Indian Languages.
Browser Support2.2
• Adoption of W3C specifications onevery government website.
Lot of tools is now available for content creation in Indian languages.
Content Creation Editors for Desktop & WebW3C :Markup languagesStandard Generalized Markup Language (SGML) old std.Hypertext Markup Language (HTML)Extensible Markup Language (XML)Extensible Hypertext Markup Language (XHTML)XLIFF – XML Localization Interchange File Format TEI – Text Encoding Initiative
2.1
2. Content Creation
Localisation & Language Technology Standards
National Informatics Centre, Government of India
20
Needs to have a National initiative for development of Linguistic resources.
No Language Resources conforming to standards are available.
Language Resources : Dictionaries, Glossary, Lexicon, Thesaurus, WordNet, Corpora: Text & Speech
ISO : TermBase eXchange (TBX): ISO : Terminology Markup Framework
(TMF) ISO : Lexical Resource Markup
Framework (LRMF)EAGLES/ISLE: CES: Corpus Encoding
StandardsEAGLES/ISLE: XCES: XML based
Corpus Encoding StandardsEAGLES:MATE – Multilingual
Annotation Tools Engineering
3.2`
Plug-in spell checker is required which should work in all General purpose applications softwares.
Available in most of the Indian languages but need to be bettered. The available spell-checkers work in specific applications (i.e. will not plug-in to other/applications)
Processing Resources : SpellChecker
3.1
3. Resources & Tools
Localisation & Language Technology Standards
National Informatics Centre, Government of India
21
The interfaces for these technologies should be general purpose and not platform-specific.
Research and Development in MT has been underway at several organizations in India.
i) English to Indian language MT Systems
ii) Indian language to Indian language MT Systems
iii) English has been the language of choice in the foreign language category among MT R&D community in India. Efforts are being made for building MT systems for English-Hindi language pair.Research is being done for developing MT systems for Hindi and other Indian languages.iv) University of Hyderabad is working on an English-Kannada MT system, using the Universal Clause Structure Grammar (UCSG). This is essentially a transfer-based approach, and will be used in all Government circulars.
v) Some organizations are also working in this area
IIT Kanpur (using Anglabharati approach)IIT Mumbai (using Universal Networking Language-UNL approach)Super Infosoft Pvt (developed Anuvadak system)IBM, Gurgaon (using Statistical approach)IIIT- Hyderabad and University of Hyderabad (developed Anusaraka – A Language Accessor).
vi) CDAC, AAI Group has developed MT system MANTRA-RAJBHASHA (English to Hindi MT System) for Administrative, Finance, Agriculture and Small Scale Industry domains.
Machine Translation
3.3
Localisation & Language Technology Standards
National Informatics Centre, Government of India
22
If database is Unicode complaint then there is no issue in regards with storage & retrieval. Most of the databases like (Sql Server,
Oracle, MySQL, DB2 etc)
support Unicode.
Database Support : Data Storage & Retrieval
3.5
Transliteration is available in few Indian Languages only.
Transliteration3.4
Localisation & Language Technology Standards
National Informatics Centre, Government of IndiaNational Informatics Centre, Government of India
23
1.C-DAC, GIST, Pune has already taken the work of localization of BharateeyaOpen Office for all scheduled 22 Indian languages.2.Currently localized versions for Tamil, Hindi & Telugu are released.3.Localized versions for Kannada, Punjabi, Urdu, Oriya, Assamese, Bengali, Malayalam, Gujarati are ready and awaiting for release4.Rest of the language localization is in progress.
Open OfficeWorks on Operating systems which has language support such as Windows XP, Linux.
5. Localized Applications
Presently character level search is available in all major search engines.
W3C
4. Search Engine Supporting Indian Languages (Google, Yahoo etc)
Localisation & Language Technology Standards
National Informatics Centre, Government of India
24
Main Points highlighted :
1. Strategy for conversion of data from ISCII and other formats to UNICODE.
2. Long term goal for manpower development in language technology.
3. Release of free Tools and tools-kit for software developers to develop portals, databases etc. in Indian Languages.
4. There should be a strategy for transparent Interoperability.
5. Setu-dev as double byte ttf fonts may be a standard that will work with all OS’s, all browsers and all word processors.
6. Stop leakage of Govt. money in non standard and proprietary items.
7. Chalk out a strategy for conversion of existing corpora
8. Adoption of W3C standards for developing websites in Indian Languages.
9. Adoption of Open Document format for G2C interaction.
Localisation & Language Technology Standards
National Informatics Centre, Government of India
25
Thank YouThank You
Localisation & Language Technology Standards
National Informatics Centre, Government of India
top related