[ieee 2013 4th international conference on computer and communication technology (iccct) - india...

9
Abstract-- Many research organizations in India and abroad have started developing translation systems for the Indian languages recently using conventional approaches like ruled-based or exampled-based or hybrid. Very few have tried to identify universality of government and binding (GB) theory, which emphasizes common phrase structure for all the languages. In this paper, a machine translation system based on ruled-based theory is proposed. The system takes Hindi as source language and English as target language. Index Terms-- GB theory, Hindi to English translation system, Indian language, exampled-based approach, government-and-binding theory, machine translation system, phrase structure, ruled-based approach. I. INTRODUCTION HE aim of the Research work is to deal with the development of English language text generator module for developing Hindi to English Machine Aided Translation (MAT) System named as Developing a System Machine Translation from Hindi language to English language System.This is a rule based MAT system with source language being Hindi and the target language being English. Source language analyzer and target language generator are the two major building blocks of Machine Translation System. The challenge was to adopt the rule based technology and develop the English language text generator. Machine Translation (MT) is a sub-field of Artificial Intelligence (AI), which translates the text from one language known as source language into the text of another language known as target language.[1] Development of a Machine translation (MT) system requires very close collaboration among linguists, professional translators and computer engineers. In the development process, there are two major goals: (a) accuracy of translation and (b) speed. Accuracy-wise, smart tools for handling transfer grammar and translation standards including equivalent words, expressions, phrases and styles in the target language are to be developed.The grammar should be optimized with a view to obtaining a single correct parse and hence a single translated output. Speed-wise , innovative use of corpus analysis, efficient parsing algorithm, design of efficient Data Structure and run-time frequency-based rearrangement of the grammar which substantially reduces the parsing and generation. Grammar Induction is a machine learning process for learning grammar from corpora [5] II. SIGNIFICANCE OF RESEARCH WORK The main significance of machine translation is a computerized method that automate all or part of the process of translating from one human language to another.[1,2] Researches in the area of Machine Translation in India or abroad are going on since several decades. During the early 90 s, advanced research in the field of Artificial Intelligence and Computational Linguistics made a promising development of translation technology. This helped in the development of usable Machine Translation Systems in certain well -defined domains. Fully automatic high quality machine translation system (FGH_MT) is extremely difficult to build. In fact there is no system in the world which qualifies to be called FGH_MT. Many organizations like IIT Kanpur, CDAC (Mumbai ), CDAC (P une), IIIT (Hyderabad) etc. are engaged in development of MT systems under projects sponsored by Department of Electronics (DoE), state governments etc. since 1990[4]. Researches on MT systems between Indian and foreign languages and also between Indian languages are going on in these institutions. Translation between structurally similar languages like Hindi and P unjabi is easier than the language pairs that have wide structural difference like Hindi and English. Translation systems between closely related languages are easier to develop since they have many parts of their grammars and vocabularies in common [6]. The proposed work will demonstrate how the machine translation is more frequent in various computer applications. III. LITERATURE SURVEY Approach used in Machine Translation Systems in India There are different approaches are under use of various researches for machine translation, some of them are discussed in the subsection of this section [6]. Developing a System for M achine T ranslation from H indi language to English language T 2013 4th International Conference on Computer and Communication Technology (ICCCT) 978-1-4799-1572-9/13/$31.00 ©2013 IEEE 79 Mrs. Shachi Mall PhD Scholar Deptt. of computer Science and engineering Madan Mohan Malaviya Engineering College Gorakhpur [email protected] Dr. Umesh Chandra Jaiswal Associate Professor Deptt. of computer Science and engineering Madan Mohan Malaviya Engineering College Gorakhpur [email protected]

Upload: umesh-chandra

Post on 11-Mar-2017

216 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: [IEEE 2013 4th International Conference on Computer and Communication Technology (ICCCT) - India (2013.09.20-2013.09.22)] 2013 4th International Conference on Computer and Communication

Abstract-- Many research organizations in India andabroad have started developing translation systems for theIndian languages recently using conventional approaches likeruled-based or exampled-based or hybrid. Very few have triedto identify universality of government and binding (GB)theory, which emphasizes common phrase structure for all thelanguages. In this paper, a machine translation system basedon ruled-based theory is proposed. The system takes Hindi assource language and English as target language.

Index Terms-- GB theory, Hindi to English translationsystem, Indian language, exampled-based approach,government-and-binding theory, machine translation system,phrase structure, ruled-based approach.

I. INTRODUCTION

HE aim of the Research work is to deal with thedevelopment of English language text generator

module for developing Hindi to English Machine AidedTranslation (MAT) System named as Developing a SystemMachine Translation from Hindi language to Englishlanguage System. This is a rule based MAT system withsource language being Hindi and the target language beingEnglish. Source language analyzer and target languagegenerator are the two major building blocks of MachineTranslation System. The challenge was to adopt the rulebased technology and develop the English language textgenerator. Machine Translation (MT) is a sub-field ofArtificial Intelligence (AI),which translates the text from one language known assource language into the text of another language known astarget language.[1] Development of a Machine translation(MT) system requires very close collaboration amonglinguists, professional translators and computer engineers.In the development process, there are two major goals: (a)accuracy of translation and (b) speed. Accuracy-wise, smarttools for handling transfer grammar and translationstandards including equivalent words, expressions, phrasesand styles in the target language are to be developed. Thegrammar should be optimized with a view to obtaining asingle correct parse and hence a single translated output.Speed-wise , innovative use of corpus analysis, efficientparsing algorithm, design of efficient Data Structure and

run-time frequency-based rearrangement of the grammarwhich substantially reduces the parsing and generation.Grammar Induction is a machine learning process forlearning grammar from corpora [5]

II. SIGNIFICANCE OF RESEARCH WORK

The main significance of machine translation is acomputerized method that automate all or part of theprocess of translating from one human language toanother.[1,2] Researches in the area of MachineTranslation in India or abroad are going on since severaldecades. During the early 90s, advanced research in thefield of Artificial Intelligence and ComputationalLinguistics made a promising development of translationtechnology. This helped in the development of usableMachine Translation Systems in certain well-defineddomains. Fully automatic high quality machine translationsystem (FGH_MT) is extremely difficult to build. In factthere is no system in the world which qualifies to be calledFGH_MT. Many organizations like IIT Kanpur, CDAC(Mumbai), CDAC (Pune), IIIT (Hyderabad) etc. areengaged in development of MT systems under projectssponsored by Department of Electronics (DoE), stategovernments etc. since 1990[4]. Researches on MT systemsbetween Indian and foreign languages and also betweenIndian languages are going on in these institutions.Translation between structurally similar languages likeHindi and Punjabi is easier than the language pairs thathave wide structural difference like Hindi and English.Translation systems between closely related languages areeasier to develop since they have many parts of theirgrammars and vocabularies in common [6]. The proposedwork will demonstrate how the machine translation is morefrequent in various computer applications.

III. LITERATURE SURVEY

Approach used in Machine Translation Systems in

India

There are different approaches are under use of variousresearches for machine translation, some of them arediscussed in the subsection of this section [6].

Developing a System for Machine Translation from Hindilanguage to English language

T

2013 4th International Conference on Computer and Communication Technology (ICCCT)

978-1-4799-1572-9/13/$31.00 ©2013 IEEE 79

Mrs. Shachi Mall

PhD ScholarDeptt. of computer Science and engineering

Madan Mohan Malaviya Engineering College Gorakhpur

[email protected]

Dr. Umesh Chandra Jaiswal

Associate ProfessorDeptt. of computer Science and engineering

Madan Mohan Malaviya Engineering College [email protected]

Page 2: [IEEE 2013 4th International Conference on Computer and Communication Technology (ICCCT) - India (2013.09.20-2013.09.22)] 2013 4th International Conference on Computer and Communication

A. Robust Parser based Translation ProcessorThe Robust Parser based Translation Processor is amachine translation which consists of the following threeanalysis models: [8]� All information analysis models.� Syntactic constraint analysis model.� Semantic constraint analysis model.

B. Example Based Translation ProcessorAn Example based Machine translation system (EBMT)system maintains a corpus consisting of translationexamples between source and target languages. An EBMTsystem has two modules:

1.1 Retrieval module

The retrieval module retrieves a similar sentence and itstranslation from the corpus for the given source sentence.For Example: Consider the English to Hindi translation forthe following sentence. “Rama sings a song” The retrieval

module retrieves the following sentence and its translationfrom the corpus from a list of approximately matchingsentences. It uses some similarity measures based on wordsimilarity or syntactic and semantic similarity to identifythis set of approximately matching sentences. From thesethe system selects the sentence with closest match with theinput sentence. If the system selects “Rohit sings a song”

and its translation “Rohit geet gaata hai” as the closest one,

it replaces Rohit with Rama and gaata with gaathi andfinally forms the translation, “Rama geet gaati hai”.

1.2 Adaptation module.

The adaptation module then adapts the retrieved translationto get the final corrected translation. Here the adaptation isrequired to replace the word and suffix replacements. Thismethod may not work in case of translation divergencewhere structurally similar sentences of the source languageget translated into a different structure [8].

C. AnusaarakaA project named “ANUSAARAKAA” for machine

translation from one Indian Language to another Languagein 1995. It has been used for translation from Telugu,Kannada, Bengali, Punjabi and Marathi to Hindi languagetranslation and vice versa. The ANUSAARAKA system isa language access or Machine Translation system thatworks on principles of Paninian Grammar (PG). [1, 2] Thisprimarily is used for generalizing the constituents andreplacing them with abstracted form achieved byidentification of syntactic groups from the raw examples[22]

D. The Mantra(Machine Assisted Translation Tool)

A machine translating system named “MANTRA” which

translates the text from English to Hindi language with aprecise domain in Office order, administrative work textsetc. in 1999.E. Anubharti-II Technology

A system with an approach for machine aided translationhaving the combination of example-based and corpus basedapproaches and some elementary grammatical analysis. InANUBHARTI the traditional EBMT approach has beenmodified to reduce the requirement of a large examplebase. ANUBHARTI-II in 2004 uses Hindi as a sourcelanguage for translation to other Indian language [1, 2]

F. UNL Based English-Hindi Machine TranslationIn 2003, at IIT [6], Bombay a machine translationsystem capable of translating from English to Hindiusing UNL as Interlingua was developed. The UNL isan International project of the United NationsUniversity, with an aim to create an Interlingua for allmajor human languages.

G. Sampark

In 2009 a machine translation system among IndianLanguages was proposed by the Consortium ofInstitutions (IIT Hyderabad, University of Hyderabad,C-DAC Noida, Anna University, KBC Chennai, IITKharagpur, IISc Bangalore, IIIT Allahabad, TamilUniversity, and Jadavpur University). At present sixsystems are being released: Punjabi to Hindi,Hindi toPunjabi, Telugu to Tamil, Urdu to Hindi, Tamil toHindi, Hindi to Telugu [7]

IV. OBJECTIVES

The objective of rule-based machine translation (MT)systems is to produce translations for a given Hindi text ToEnglish text this technique uses corpus management andMulti-lingual lexical database which were used for Hindilanguages to translate in English language. The mostimportant decision making element in a sentence forgenerating proper translation and this can be done by usingcorpus. Corpus is a body of language which is an inevitableresource in Natural Language Computing. It is widely usedto study the behavior of Natural Languages, developstatistical language computing rules and building lexicalresources. In the field of Machine Translation, a goodquality corpus is very much essential for good qualitytranslation. Thus corpus collection and its analysis form anessential task. In the following section we will discussabout the tools used for lexicon updating and domainspecific lexicon entry for Machine Translation from Hindilanguages to English languages System are

2013 4th International Conference on Computer and Communication Technology (ICCCT)

80

Page 3: [IEEE 2013 4th International Conference on Computer and Communication Technology (ICCCT) - India (2013.09.20-2013.09.22)] 2013 4th International Conference on Computer and Communication

A. Corpus collection and analysisThe corpus is acquired from web resources, Journals,Magazines, text books, etc. The collected corpus had to beprocessed for analysis. The process includes cleaning,unique word Identification, sentence extracting word andParts of Speech Tagger. The corpus which is collected fromweb may contain certain junk characters, extra spaces afterthe words or sentences. Before analyzing the corpus, suchcharacters and spaces should be removed properly.

B. Unique word extractionFrom the raw corpus, word and frequency count will be

taken. Then the word is compared with the main lexicon.The compared words are subjected to the stemmingprocess. Stemming is mainly done to remove the pluralmarkers. After stemming process the words are againcompared with the main lexicon. If the word is not found inthe lexicon then it will be stored in to a separate file.Manual verification is required after the process toeliminate the nonsensical words.

TABLE I

Review of the previous workTitle Methodol

ogyConclu sion

Lim itation

1.AnglaBangla(Machinetranslation systembased onAnglamachinetranslationParadigm)[2]

1.ExampleBase

2.Rulebased

3.MorphologicalAnalyzer:

1.AcompleteEnglish toBanglaMAT(MachAidedTranslation)systemnamed asAnglaBanglahas alreadybeendeveloped,which can

translate alltypes of

simplesentences.2.Currentlythe systemtranslatesinputsentencesline byline.

1. Translation forsimplesentences.2. Many atimessystem alsogives morethan onetranslationfor a givenEnglishsentence.3.User getsconfused toselectwhich oneof them asa suitablefortranslationof theparticularEnglishsentence.4. Currentsystems areunable toproduceoutput ofthe samequality as a

humantranslator,particularlywhere thetext to betranslatedusescolloquiallanguage.

1.1.AnglaBangla(AStrategyorMorphologicalAnalysisandSynthesisofBangla)[2]

1.Preprocessor/Postprocessor2. Rule

based3.Morphologicalanalyzer.4.ExampleBased5.

inflectionalmorphology6.Derivationalmorphology

1.Morphology is animportantelement ofstudy inthedevelopment of amachinetranslationsystem.2. Thedegree ofperformance of a MTsystemdepends ona goodmorphologicalanalyzerandsynthesizer.

1. This isonly a

stepping-stone in thearea ofmachinetranslationand it canbe hopedthat itwouldfacilitatethedevelopment of MTsystem inotherlanguagesalso.

1.2.AnglaBangla(Development ofBanglaTextGeneratorforTranslation fromEnglish)[]

1. It uses apseudo-interlinguaapproach.2.LexicalDatabase3.MorphologicalSynthesisof Verb

1.AcompleteEnglish toBanglaMAT(MachineAidedTranslation)Systemnamed asAnglaBangla hasalreadybeendeveloped,which cantranslate alltypes ofsimplesentences.2.Currentlythe systemtranslatesinputsentences

1.Translationfor simplesentences.2. Many atimessystemalsogivesmore thanonetranslationfor a givenEnglishsentence.3. User getconfused toselectwhich oneof them asa suitablefortranslationof theparticularEnglishsentence.4. Current

2013 4th International Conference on Computer and Communication Technology (ICCCT)

81

Page 4: [IEEE 2013 4th International Conference on Computer and Communication Technology (ICCCT) - India (2013.09.20-2013.09.22)] 2013 4th International Conference on Computer and Communication

line byline.

systems areunable toproduceoutput ofthe samequality as ahumantranslator,particularlywhere thetext to betranslatedusescolloquiallanguage.5.Improvement in theoutputquality ofMT Systemcan beachieved

by humanintervention: forexample,somesystems areable totranslatemoreaccuratelyif the userhasunambiguouslyidentifiedwhichwords inthe text arenames.With theassistanceof thesetechniques,MT hasprovenuseful as atool toassisthumantranslators,and insome casescan evenproduceoutput thatcan be used"asis". In

general,there arenumber ofand craftedrules havebeenimplemented in currenttextgeneratormodule.

2.Anglamalayalam(EnglishtoMalayalam MTsystembased onAnglaMTParadigm)[3]

1.Corpus2. CorpusAcquisition System3. Rulebase forPatterniinvocationandTransformation4. MorphAnylysis.5. PatternDirectedParsing.6. Pseudotarget forDravidianLanguage

1. TheTransliteration systemdevelopedhas anaccuracyrate of70%.2. Thismodule is apart ofthe MorphAnalyzer.During adictionarysearch, if aword is notfound inthedictionarythen theflow of thesystem isdirected tothetransliteration modulefor thetransliteration oftheunknownword. Thetransliterate’ module

transliteratestheunknownacronyms,namedentities,etc. thatarenot foundin thelexicaldatabase/storedtablesinto theroman

1.There aresomelimitationsto thisalso .Thename ‘sasi’

isPronounced as ‘SaSi’

(���instead of�� ), so insuch casesthe rulesfail and theonly thingwe can dois to addsuch wordsto thelexicon.2. MachineTranslationsystem forMalayalam,a languageof theDravidianfamily,usingtheAnglaBharatitechnology,withaccuracycomparabletothat of anIndo-Aryanlanguage

family. Thesystemyields goodaccuracyfor simplesentences.Moreresearch is

2013 4th International Conference on Computer and Communication Technology (ICCCT)

82

Page 5: [IEEE 2013 4th International Conference on Computer and Communication Technology (ICCCT) - India (2013.09.20-2013.09.22)] 2013 4th International Conference on Computer and Communication

notationassigned toMalayalam.

needed toimprove thesystemperformance withComplexsentences.3. I find thebottlenecks in theform ofnamedentities,nounphrases,verbphrases,etc. Wewere ableto resolvetheproblem tosomeextend.4. After thefirst phaseof thedevelopment we foundthatextensiveresearch isneededin the areaof languagemodelingto reducethe numberof alternatetranslationsand thereorderingof thecorrecttranslationsin the orderof theiracceptability.5. Thespeed ofthe systemis alsoanotherconcern.

3.TelugutoEnglishTranslation usingDirect

1.Rule-based2.Statistical3.Example-based

1.Telugubeing afreeword-orderstructurelanguage,

1. The testsystem wasalso testedusing freeflowingsentences

MachineTranslationApproach[4]

4.HybridMT

MT fromEnglish toTelugu2.Theaccuracy ofTranslationas high as90 percentover thegiven teststatements.3.Theparsing oflexicon,

splittingorstrippingofsuffices,and theirtranslationto Englishwas verymuchsatisfactory

fromvariouswebsites ofnewspapercompaniesand failedto produceto translatethosewordswhich areverycomplexelisions/inflections,or those notavailable inthedictionaryor thosehavingmanySynonyms.2.If thedictionaryis built onlywith thestandardversion, itis sure thattheaccuracy oftranslationwilldrasticallyreduce.3. Additionof morelinguisticrulesrelated tohandling ofelisions/inflections andtheword

orderingsystemwouldenhance

theaccuracy oftheproposedtranslationsystem.

2013 4th International Conference on Computer and Communication Technology (ICCCT)

83

Page 6: [IEEE 2013 4th International Conference on Computer and Communication Technology (ICCCT) - India (2013.09.20-2013.09.22)] 2013 4th International Conference on Computer and Communication

V. IDENTIFICATION OF RESEARCH GAP ANDPROBLEM

Machine Translation has been occasionally addressed inthe literature in the Last few years; no widely acceptedsolution seems to have emerged till date. A framework [19]was proposed which analyzes the errors manually. Thisscheme covers five top-level classes: missing words,incorrect words, unknown words, and word order andpunctuation errors. Another approach [20] was classifiedwhich find the errors at orthographic, morphological,lexical, semantic, and syntactic level. Some automaticmethods for error analysis using base forms and Parts ofspeech tags were also proposed [21]. This method has beenused for estimation of inflectional and reordering errors.This method is also used for automatic error classification.This tool describes and classifies errors into five categoriesbased on the hierarchy proposed [22] .A tool for automaticevaluation of MT output based on n-gram precision andrecall. The universal networking language has beenproposed by the United Nations University for overcomingthe language barrier. English to Hindi MT system whichuses Universal Natural Language [31] the system has anEnglish analyser which converts the sentence into UNLform which is then given to a Hindi generator whichgenerates the target sentence in Hindi. 95% of the UNLexpressions were correctly converted to Hindi. This systemdoes part of speech disambiguation and some sensedisambiguation for postposition markers and wh_pronouns.The system handles language divergence in a better way[16]. Currently work is under progress going on for MTsystem for English to Marathi and Bengali.Quantifies translation quality [23] based on the frequenciesof different error categories .A linguistic [24] approach is aclassifier trained which is used with a set of linguisticfeatures to automatically detect incorrect segments in MToutput. English to Bangla Machine AidedTranslationSystem, namely AnglaBangla, has beendeveloped by adapting the AnglaBharati framework [2]AnglaBangla has already been developed, which cantranslate all types of simple sentences and some of thecomplex sentences. Rules also have been incorporated forcommand and request type of sentences and giving correctresults. Currently the system translates input sentences lineby line. Many a time’s system also gives more than one

translation for a given English sentence.User can select anyone of them as a suitable translation of the particularEnglish sentence. User can also edit the translated outputwith the help of a on-screen Bangla Keyboard. User canalso provide rank of the translated output the drawback ofthe machine was it considers for only simple sentence.Current systems are unable to produce output of the samequality as a human translator, particularly where the text tobe translated uses colloquial language. Again, system needsto implement a target language model at the output of the

text generator to generate perfect translation or to decreasethe number of parsed outputs for a given input sentence andthe translation is for bangle .Anglabharti [2] is an approachwhich distinguishes a type of evaluation whose purpose isto discover the reason(s) why a system did not produce theresults it was expected to .Working on these lines an MTsystem for translation from English to Indian languageswhich uses pseudo Interlingua approach. The systemanalyses English sentences and creates an intermediatestructure called PLIL (Pseudo Lingua for IndianLanguages) [2]. It performs most of the disambiguation.The effort required for analysis phase is70% and thegeneration phase takes 30%. So with an additional effort of30% a new translator for an Indian language could be built.A context free grammar like structure is used to create thePLIL structure. It also uses statistical analysis of a corpusto identify the movement rules for the PLIL structure. Itsbeta version is Angla Hindi for English to Hindi translationand is available [17]. AnglaBharti-II had been implementedby incorporating additional layer to the existing English toHindi translation (AnglaBharti-II) and Hindi to Englishtranslation (AnuBharti-II) systems [2].The system claimedto produce satisfactory acceptable results in more than 90%of the cases. Only in case of polysemous verbs, due to avery shallow grammatical analysis used in the process, thesystem is unable to resolve their meaning. The MT systemsso far developed have many shortcomings in terms of ruleset, dictionary, translation methodology and it is apparentfrom the survey that further work is needed in MT as awhole to produce intelligible translations.

VI. LIST OF PROBLEM IN LITERATURE SURVEY

In literature survey I find many drawbacks in the field ofmachine translation which are enlisted as below:

1. There is no tools develop which translate the Hinditext in English text with related parts of speech forcomplex Hindi sentences.

2. AnuBharti-II systems [2].The system claimed toproduce satisfactory acceptable results in morethan 90% of the cases. Only in case of polysemousverbs, due to a very shallow grammatical analysisused in the process, the system is unable to resolvetheir meaning.

3. AnuBharti-II systems produce result for simplesentences and translation is done line by line.

4. AnuBharti-II systems some time translation of oneword having option of two word respect to translatefor given particular word.

5. Telugu to English Translation using DirectMachine Translation Approach [4]. The test systemwas also tested using free flowing sentences fromvarious websites of newspaper companies and

2013 4th International Conference on Computer and Communication Technology (ICCCT)

84

Page 7: [IEEE 2013 4th International Conference on Computer and Communication Technology (ICCCT) - India (2013.09.20-2013.09.22)] 2013 4th International Conference on Computer and Communication

failed to produce to translate those words which arevery complex elisions/ inflections, or those notavailable in the dictionary or those having manysynonyms.

6. The accuracy of translation as high as 90 percentover the given test statements.

VII. EXTENDED WORK

In my research work my aim would be to design anddevelop a tool that can translate the Hindi sentence inEnglish and also tag the Hindi word in the sentence withtheir related parts of speech and count the total sentence inthe given paragraph according to the delimiter and alsocount the word in the given sentences like Noun, Pronoun,Verb, Adjective etc. Now the translation process began byusing the dictionary of Hindi to English transliteration nowthe sentence is grammatically rectified by using Microsoftoffice Word this process is inbuilt inside the code. Rulebased tool approach is used which gained much interest inrecent years. Rule based systems provide direct translation,without using any intermediate representation. Theadvantage of these systems is that they are fully automaticand require less human labour.

VIII. EXPECTED IMPACT ONACADEMICS/INDUSTRY

India is the largest democratic country in the world andthere are more than 30 languages and approximately 2000dialects used for the communication by the Indian peoplesand out of these languages Hindi and English are taken aslanguage for official work and there are 22 scheduledlanguages used by the different states for theiradministrative work and communication purposes. These22 languages include Assames, Bengali, Bodo, Dogri,Gujrati etc. Machine Translation system is an importantpart of Natural Language Processing. It refers to usingmachine which convert one Natural Language to another.Machine Translation software designed that essentiallytakes a text in one language and translates the text intoanother language. Machine Translation consists ofLanguage Model (LM) and Translation Model (TM).Transfer Grammar Component is a major module in Rule-based Machine Translation (MT) systems. Various machineTranslation (MT) systems have already been developed formost of the commonly used natural languages.

A. AcademicsThe machine translation tools helps infield of academicslikes Translation Studies, the application for the theory oftranslation. This tool can used in Center for AppliedLinguistics (CAL) are used to promote and improve theteaching and learning of languages, identify and solveproblems related to language.

B. IndustryIn a large multi-lingual society like India, there is a greatdemand for translation of documents from one language toanother. Most of the state governments work in therespective regional languages whereas the UnionGovernment’s official documents and reports are in

bilingual form (Hindi/English). In order to have a propercommunication there is a need to translate these documentsand reports in the respective regional languages. Thenewspapers in regional languages are required to translatenews in English received from International NewsAgencies. With the limitations of human translators most ofthis information (reports and documents) is missing and notpercolating down. A machine assisted translation system ora translator's workstation would increase the efficiency ofthe human translators. As is clear from above, the marketis largest for translation from English into Indianlanguages, primarily Hindi. Hence, it is no surprise thatmajorities of the Indian Machine Translation (MT) systemshave been developed for Hindi- English translation. Thebenefits of my research work are to develop a system whichaccept the Hindi text and translate the given Hindi text andgenerate the corresponding translation in English text.Machine Translation (MT) technology has a broadsignificance. Large number of non-English users accessinginformation on the web makes MT’s need much more

significant. The rise and popularity of the internet has givenusers access to vast variety of information includingwritten, audio and visual data from any part of the world.But, still the language barrier is one of the majorhindrances for this information to be shared to all.

IX. PROPOSED METHODOLOGY

Rule based is a machine-aided translation methodologyspecifically designed for translating Hindi to Englishlanguages. English is a SVO language while Indianlanguages are SOV and are relatively of free word-order.Rule based uses a corpus based approach. It analyses Hindisentences only once and creates an intermediate structurewith most of the disambiguation performed. Theintermediate language structure has the word and word-group order as per the structure of the group of targetlanguages. The intermediate structure is the converted toeach Indian language through a process of text-generation.Rule based system with context free grammar structure foranalysis of Hindi as source language to translate in Englishlanguage.

A. This is per formed by follow ing steps which are asbelow:1. Hindi paragraph is tokenized in sentence according to

delimiter.2. Now we apply Rule based approach on the givensentences

2013 4th International Conference on Computer and Communication Technology (ICCCT)

85

Page 8: [IEEE 2013 4th International Conference on Computer and Communication Technology (ICCCT) - India (2013.09.20-2013.09.22)] 2013 4th International Conference on Computer and Communication

according to the linguistic rule in the given sentences inwhich the first verb in the sentence is searched after

findingthe first verb, the sentence will break in two parts. The

firstpart will be the sentence up to the verb contain and

secondpart the remaining sentence after the verb.

3. Now each part of the Sentence is Tokenize into wordaccording to delimiter.

4. Now each word is grammatically tagged by using look-up

dictionary of parts of speech.5. Look-up dictionary provide the facility of parts of speech

and also the conversion of Hindi word into English.6. After tagging and conversion of each word the sentenceis

grammatically rectified by using Microsoft office Wordthis

process is inbuilt inside the code.In this research work in Machine Translation system I

willtry to build lexical resources to give more accurate resultfrom the above work.

B. Developing lexical resources:The following lexical resources will be created and built aspart of a planned effort:1. Electronic dictionary (Shabdanjali English - Hindi

dictionary)2. Transfer lexicon and grammar (TransLexGram)3. Parts-of-Speech tagger corpora.4. Lexicon sentence extractor.5. Lexicon word extractor.6. Count sentence.7. Count word.8. Parts-of-Speech Tagger Formations.9. Parts-of-Speech Tagger resolution.10. Handle gender mismatch.11. Linguistic rules.12. Disambiguation of multiple Hindi words.13. Disambiguation of postpositions14. Lexicon matching.15. Updating of undefined word with related parts of

speech and translation.16. Multi-lingual Lexical Database and Corpus

Management Tools.17. Microsoft office word 2007

X. EXPECTED OUTPUT

The Various MT groups have used different formalismsbest suited to their applications. Of them transfer basedsystems are more flexible and it can be extended tolanguage pairs in a multilingual environment. Direct

translation is appropriate for structurally similar languages.The expected outcome of my research work would betranslating the given Hindi to English language. Rule basedapproached have given rise to a new optimism and anexploration of other data-driven performance like rule set,dictionary, translation of statistical techniques can beimproved through large parallel corpus and usage oflinguistic knowledge in the model. Hybrid systems arefound to have better performance compared to the oneswith the component technology. An expected sampleoutput from developed a system for Hindi to English isgiven below:

Hindi corpus input

��� �� �� ��� ��

����� �� �� ���� ����� ����

Ram is going home.

XI. REFERENCES

1. Sinha. R.M.K. , “ANGLABHARTI: A Multi-lingualMachine Aided Translation Project on Translationfrom English to Hindi,” Proceedings of the IEEEInternational Conference, Sep 12-15, pp. 1609-1614,2009.

2. Sinha, R. M. K., Sivaraman, K., Agrawal, A., Jain,R., Srivastava, R. and Jain, A,” AnglaBharti: A

multilingual machine aided translation project ontranslation from English to Hindi,” in Proceedings ofthe IEEE , Columbia, Canada , pp.1609-1614, 1995

3. Sinha, R. M. K., Sivaraman, Hybridizing Rule-Basedand Example-Based Approaches in Machine AidedTranslation System,” Proceedings of theIC-AIInternational Conference, Vegas, USA June 26-29,2000

4. Prasad .T. Venkateswara , Muthukumaran G. Mayil,” Telugu to English Translation using Direct

Machine Translation Approach,” in Proceedings ofthe International Journal of Science and EngineeringInvestigation, vol 2 , ISSN: 2251-8843, Jan12,pp.25-32, 2013.

5. Hermawan and A.T, “Natural Language GrammarInduction of Indonesian Language Corpora UsingGenetic Algorithm,” in Proceedings of the IALPInternational Conference, Nov11-12, Page 15-18,2011.

6. H.Lu and L. Shixing ,” Research on rule definitionand engine for general text processing,” Proceedings

of the ICCSE International Conference. July 25-28,pp 10955-1100, 2009.

7. Lee Hagyu,” Korean-English machine translation

2013 4th International Conference on Computer and Communication Technology (ICCCT)

86

Page 9: [IEEE 2013 4th International Conference on Computer and Communication Technology (ICCCT) - India (2013.09.20-2013.09.22)] 2013 4th International Conference on Computer and Communication

based on idiom recognition,” Proceedings of the

IEEE 1993 International Conference, Oct 19-21,Page 1058-1061.

8. Website source [Online] Available:http://www.sampark.iiit.ac.in,

9. Ren & F,” Dialogue machine translation system

using multiple translation processors,” Proceedingsof the IEEE 2000 11th International Conference, Oct,pp 143-152.

10. Gupta D and Chatterjee. N,” Identification of

Divergence for English to Hindi EBMT,”

Proceedings of MT SUMMIT IX, New Orleans,Louisiana, USA , 2003

11. Bandyopadhyay. S,"Use of Machine Translation inIndia,” Proceedings of the AAMTJ 2004,International Conference, Oct-18-19, pp. 22-25, 2004

12. Chaudhury S, Rao .A and Sharma D.M,“Anusaaraka: An expert system based machine

translation system “, Proceedings of the NLP-KEInternational Conference, Aug 21-23, pp 1-6 , 2010

13. Latha R. Nair and David Peter S,”MachineTranslation Systems for Indian Languages,“Proceeding of the International Journal of ComputerApplications (0975 – 8887) Volume 39– No.1,February 2012

14. Dubey P,” Overcoming the Digital Divide through

Machine Translation, Translation Journal, Volume15, 2011

15. Goyal.V and Lehal.G.S,” Advances in Machine

Translation Systems”. Language in India, Vol. 9, No.11, pp. 138-150 , 2009

16 Ananthakrishnan, R., Bhattacharyya, P., Sasikumar,M. and Shah, R,”Some issues in automatic evaluationof English-Hindi MT: More blues for BLEU,” in

Proceeding of5th ICON Hyderabad, India. 200717. Ananthakrishnan, R., Kavitha, M., Hegde, J.,

Shekhar, C., Shah, R., Bade, S. and Sasikumar, M,”

MaTra: A practical approach to fully-automaticindicative English-Hindi machine translation.Symposium on Modeling and Shallow Parsing ofIndianLanguages ,” in Proceeding ofMSPIL ,IITBombay, India.2006

18. Banerjee, S. and Lavie, A,” METEOR: An automatic

metric for MT evaluation with improved correlationwith human judgments,”in Proceedings of the ACLWorkshop on Intrinsic and Extrinsic EvaluationMeasures for MT and/or Summarization. Ann Arbor,Michigan, pp. 65-72 , 2005

19. Baskaran, S., Bali, K., Choudhury, M., Bhattacharya,T., Bhattacharyya, P., Jha, G. N., Rajendran, S.,Saravanan, K., Sobha, L. and Subbarao, K. V,” A

Common Partsof- Speech Tagset Framework forIndian Languages, “in Proceedings of the LREC.Marrakech, Morocco, pp .1331-1337, 2008

20. Chatterjee, N., Johnson, A. and Krishna, M,” Some

improvements over the BLEU metric for measuringthe translation quality for Hindi, “in. Proceedings ofthe ICCTA Kolkata, India, pp. 485- 490, 2007

21. The EAGLES MT Evaluation Working Group,”in

Proceedings of the EAGLES Evaluation of NaturalLanguage Processing Systems,” ISBN 87-90708-00-8. Center for Sprogteknologi, Copenhagen , 1996

22. Chaudhury, S., Rao, A. and Sharma, D. M.Anusaaraka ,” An Expert system based MT System

,”in Proceedings of the IEEE-NLPKE ,Beijing,China , 2010

23. Popović, M. and A. Burchardt ,” From human to

automatic error classification for machine translationoutput ,”in Proceedings of the EAMT, Leuven,Belgium, pp. 265-272, 2011

24. Fishel, M., Sennrich, R., Popović, M. and Bojar,”

TerrorCat: a Translation Error Categorization-basedMT Quality Metric,” in Proceedings of the SeventhWorkshop on Statistical Machine Translation.Montréal, Canada, pp 64-70, 2012

25. Vilar, D., Xu, J., Fernando L. D'Haro, and Ney, H,”Error analysis of statistical machine translationoutput, “in Proceedings of the LREC , Genoa, Italy,pp 697-702, 2006

26. Xiong , D. M. Zhang, and Li, H,”Error detection for

statistical machine translation using linguisticfeatures ,” in Proceedings of the ACL, Uppsala,Swede,pp 604- 611, 2010

27. Farr us, M., Costa-juss‘a , M. R., Mari˜no , J. B. and

Fonollosa , “Linguisticbased evaluation criteria to

identify statistical machine translation errors,”in

Proceedings of the EAMT .,Saint Rapha¨el, France,pp. 52–57,2010

28. Sinha. R.M.K. “A Pseudo Lingua for IndianLanguages (PLIL) for Translation from English.Technical Report,” Language Technology Lab,

Department of Computer Science and Engineering,Indian Institute of Technology, Kanpur 2004.

29. R.E. Asher, “T.C. Kumari, Malayalam - Descriptivegrammars,” Routledge, pp 317-319, 2007.

30. A R Rajaraja Varma , Keralapaniniyam ,” DC

Books” ,pp 177-269, 200031. Bhattacharyya P, Dave S, Parikh J,” Interlingua based

English-Hindi Machine Translation and LanguageDivergence” in Proceedings of the Springer2001, Volume 16, Issue 4,pp 251-304.

2013 4th International Conference on Computer and Communication Technology (ICCCT)

87