[ieee 2006 international conference on computer engineering and systems - cairo...

5
A Triliteral Word Roots Extraction Using Neural Network For Arabic Hasan Al-Serhan and Aladdin Ayesh Centre for Computational Intelligence (CCI) School of Computing Faculty of Computing Sciences and Engineering De Montfort University, Leicester UNITED KINGDOM Email: {alserhan,aayesh}@dmu.ac.uk Abstract- Many of existing Arabic stemming algorithms use a large set of rules. In many cases, they refer to a lookup table of patterns and roots. This requires a large storage space, and time to access the information. A novel neural network based approach for stemming Arabic words is proposed in this paper. This approach attempts to exploit numerical relations between characters by using Backpropagation Neural Network (BPNN). No such system in literature can be found that uses neural network to extract the stemming of Arabic words. Keywords: Neural Networks, Backpropagation, Stemming, Natural Language Processing, Arabic Language. I. INTRODUCTION Stemming is a process in which words are returned to their morphological root [1]. For example, the words computing, computer, computation, computerize, and computational could be mapped to a common stem, compute. In Arabic language, stemming is the process of removing affixes from a given word to get the original root. Affixes in Arabic are prefixes, suffixes and infixes. Prefixes are attached at the beginning of a word, where suffixes are attached at the end of the word, and infixes are found in the middle of the word. For example, the Arabic word "cll" "aljam'at", which means universities", consists of the following parts: "J1" represent prefix-part, "ci, 1" represent suffix-part, two '1", which are in the middle, represent infix-part, and the remaining letters "s" represent the root-part. Stemming can be used in many applications such as in data compression, in spell checking, and in information retrieval (IR) systems where many studies showed that using roots as an index word in IR gives much better results than using full words. It also can be used in text generation to generate different part-of- speech of a given word by attaching a certain affixes to the root verb. In spite of the rapid research conducted in other languages, Arabic language still suffers from the shortages of the researchers and development. The purpose of this study is to show numerical relations between characters can be extracted for the purpose of fast accurate stemming. This is inspired by linguistic studies that identify the likely redundant characters in words. Our approach is to use neural networks for stemming Arabic words. The remainder of the paper is organized as follows. In Section II, we introduce an overview of Arabic language. Section III shows the related works, the numerical approach to root extraction described in section IV. Finally, the experiments and results are provided in section V and conclusion in section VI. II. ARABIC LANGUAGE OVERVIEW Arabic ranks sixth in the world's league table of languages, with an estimated 206 million native speakers. As the language of the Qur'an, the holy book of Islam, it is also widely used throughout the Muslim world. It belongs to the Semitic group of languages which also include Hebrew and Amharic, the main language of Ethiopia. Arabic alphabets are used in several languages such as Persian, Malay, and Urdu [2]; the characters are consisted of letters, numbers, punctuation marks, space and special symbols (e.g. mathematical notations). It is different from English language in its vowels and diacritic marks. These marks are special marks placed above or under the Arabic letters. However, most recent written Arabic texts are non-vowelised. Arabic language is considered to be a member of a highly 1-4244-0272-7/06/$20.00 C2006 IEEE 436

Upload: aladdin

Post on 04-Mar-2017

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: [IEEE 2006 International Conference on Computer Engineering and Systems - Cairo (2006.11.5-2006.11.7)] 2006 International Conference on Computer Engineering and Systems - A Triliteral

A Triliteral Word Roots Extraction Using

Neural Network For Arabic

Hasan Al-Serhan and Aladdin AyeshCentre for Computational Intelligence (CCI)

School of ComputingFaculty of Computing Sciences and Engineering

De Montfort University, Leicester

UNITED KINGDOMEmail: {alserhan,aayesh}@dmu.ac.uk

Abstract- Many of existing Arabic stemmingalgorithms use a large set of rules. In many cases,they refer to a lookup table of patterns and roots. Thisrequires a large storage space, and time to access theinformation. A novel neural network based approachfor stemming Arabic words is proposed in this paper.This approach attempts to exploit numerical relationsbetween characters by using Backpropagation NeuralNetwork (BPNN). No such system in literature can befound that uses neural network to extract the stemmingof Arabic words.

Keywords: Neural Networks, Backpropagation, Stemming,Natural Language Processing, Arabic Language.

I. INTRODUCTION

Stemming is a process in which words are returnedto their morphological root [1]. For example, the wordscomputing, computer, computation, computerize, andcomputational could be mapped to a common stem,compute. In Arabic language, stemming is the processof removing affixes from a given word to get theoriginal root.

Affixes in Arabic are prefixes, suffixes and infixes.Prefixes are attached at the beginning of a word, wheresuffixes are attached at the end of the word, and infixesare found in the middle of the word. For example,the Arabic word "cll" "aljam'at", which meansuniversities", consists of the following parts: "J1"

represent prefix-part, "ci, 1" represent suffix-part, two'1", which are in the middle, represent infix-part, andthe remaining letters "s" represent the root-part.

Stemming can be used in many applications suchas in data compression, in spell checking, and ininformation retrieval (IR) systems where many studiesshowed that using roots as an index word in IR givesmuch better results than using full words. It also can

be used in text generation to generate different part-of-speech of a given word by attaching a certain affixes tothe root verb. In spite of the rapid research conductedin other languages, Arabic language still suffers fromthe shortages of the researchers and development.

The purpose of this study is to show numericalrelations between characters can be extracted for thepurpose of fast accurate stemming. This is inspiredby linguistic studies that identify the likely redundantcharacters in words. Our approach is to use neuralnetworks for stemming Arabic words. The remainderof the paper is organized as follows. In Section II,we introduce an overview of Arabic language. SectionIII shows the related works, the numerical approachto root extraction described in section IV. Finally, theexperiments and results are provided in section V andconclusion in section VI.

II. ARABIC LANGUAGE OVERVIEW

Arabic ranks sixth in the world's league tableof languages, with an estimated 206 million nativespeakers. As the language of the Qur'an, the holy bookof Islam, it is also widely used throughout the Muslimworld. It belongs to the Semitic group of languageswhich also include Hebrew and Amharic, the mainlanguage of Ethiopia.

Arabic alphabets are used in several languagessuch as Persian, Malay, and Urdu [2]; the characters areconsisted of letters, numbers, punctuation marks, spaceand special symbols (e.g. mathematical notations). Itis different from English language in its vowels anddiacritic marks. These marks are special marks placedabove or under the Arabic letters. However, mostrecent written Arabic texts are non-vowelised. Arabiclanguage is considered to be a member of a highly

1-4244-0272-7/06/$20.00 C2006 IEEE 436

Page 2: [IEEE 2006 International Conference on Computer Engineering and Systems - Cairo (2006.11.5-2006.11.7)] 2006 International Conference on Computer Engineering and Systems - A Triliteral

sophisticated category of natural languages, which hasa very rich morphology. In this category one root cangenerate several different words that have differentmeaning.

The grammatical system of the Arabic language isbased on a root-and-pattern structure and considered asa root-based language with more than 10,000 roots [3].A root in Arabic is the bare verb form which canbe triliteral, which is the overwhelming majority ofArabic words, and to a lesser extent, quadrilateral,pentaliteral, or hexaliteral, each of which generatesincreased verb forms and noun forms by adding thederivational affixes [4].

III. RELATED WORKS

A. Stemming Techniques

Stemmers are commonly designed for each spe-cific language. Stemmers design requires some linguis-tic expertise in the language itself. Many Stemmershave been implemented for many languages includingMalay [5], Latin [6], Indonesian [7], Swedish [8],Dutch [9], German [10], Slovene [11], Bulgarian [12]and Turkish [13]. Due to its non-concatenative nature,Arabic is very difficult to stem [14], so regrettably thereare only few stemmers in Arabic.

In literature, stemmers can be classified into tablelookup, linguistic, and combinational approaches.

Table lookup approach [15] utilizes huge list thatstores all valid Arabic words along their morphologicaldecompositions. This method does not use stemmingprocess. For a given Arabic word, it access the listand retrieve the associated root/stem. Consequently, thestems obtained are guaranteed to be highly accurate.However, the availability of such a table that shouldinclude all the language words is practically impossi-ble.

Linguistic approach [16], attempts to simulate thebehavior of a linguist by considering Arabic mor-phological system and thoroughly analyzing Arabicwords according to their morphological components.In such approach, prefix and suffix of a given wordare removed by comparing leading and trailing letterswith a given list of affixes. In literature, most of thepublished works were mainly linguistic-based.

Finally, in Combinational approach [16], a givenword is used to generate all combinations of letters.These combinations are compared against predefinedlists of Arabic roots. If matched, stem and patterns areextracted.

B. Backpropagation Neural Network (BPNN)

Backpropagation Neural Network (BPNN) is oneof the most common neural network architectures,which has been used in a wide range of machine learn-ing applications, such as character recognition [17].

Typically, BPNN architecture consists of threelayers [17]: input, hidden, and output layers. Fig. 1shows BPNN architecture in which the input layer with(n) nodes receives input data from an external source,the output layer with (m) nodes transmit the result ofthe neural network processing, and the hidden layerwith (p) nodes serves to provide the internal relationsbetween input and output layers.

There are two main operations during Backpropa-gation training: feedforward and error backpropagationoperations.

Feedforward and Error backpropagation operationsare repeated alternately until the mean squared errordipped below critical value.

IV. NUMERICAL APPROACH TO ROOT EXTRACTION

As mentioned before, most of the stemming al-gorithms depend on root and pattern files, and thisrequires too much spaces to store these files withtime consuming search operations. A neural networkapproach for extracting triliteral Arabic roots withoutreferring to root or pattern files is proposed here andproven to work.

Backpropagation neural network architecture isused here with its standards learning function. Themodel was trained to accept an encoded Arabic wordas an input and to generate the correct root of that wordas an output.

Output Response

m

n

... ................... ....

.....

*..._. .._........

Input Pattern

Fig. 1. Backpropagation Neural Network

437

Page 3: [IEEE 2006 International Conference on Computer Engineering and Systems - Cairo (2006.11.5-2006.11.7)] 2006 International Conference on Computer Engineering and Systems - A Triliteral

A. Data Analysis

Most Arabic words are derived from verb formswhich extend or modify the meaning of the root formof the verb, giving many shades of meaning (seetable I). The simple or root form of the verb is called

J"J1 " (the "stripped" verb), while the derivedform is said to be" l l" (the "increased" verb).Table I shows some of derived forms of the verb root" ' "kataba", which means "he wrote".

Several hundreds of Arabic words can be de-rived from a single root by adding affixes. Arabiclinguists have identified the commonly used letters inthese affixes and presented them in the following set:

{irsl, ^s^C g ,U^J^sl}. The Arabic linguistics expertsclassified these letters according to their frequency inappearing as affix letters [18]. The highest frequentlyused are { then {l,&CL'}, and the lowest are

{ir j J} Depending on this, we classify all Arabicletters into four classes and assign a numeric value toeach class as shown in table II. These values are usedin the network as inputs.

Any letter in the input word is encoded as threebinary digits, as shown in table III. This means that theinput word will be converted to a series of binary digits.Limiting an input word to length of five letters, givesus an input matrix of 5 x 3, where each row representedcontains three digit binary code for the letter.

The output matrix will also 5 x 3. If the input letterin any given word is in the root word, we presented itby three ones "111", other wise it presented by threezeros "000".

TABLE I

SOME OF DERIVED FORMS OF THE ROOT "'

Arabic Pronunciation MeaningForm

kaatib writer2SJ1i alketab the book

maktab desk1.. yaktubu he writes

I-~- I'aktaba he dictated

TABLE III

ARABIC CODED LETTERS

B. Proposed BPNN Architecture

The architecture of the model is summarized inFig. 2, while the model algorithm is in Fig. 3. Theinput layer consists of 15 nodes (5 letters and eachletter represented by 3 bits). One hidden layer consistsof 15 nodes, which are fully connected to the inputlayer, output layer. The output layer consists also of15 nodes representing the root output.

This network was trained using standard backprop-agation algorithm with learning rate r = 0.5. Trainingwas performed until an acceptable convergence was

found for 5000 epochs with mean squared error lessthan 0.004

V. EXPERIMENTS AND RESULTS

The data for this study were taken from a generatedlexicon. A shell program was implemented by the firstauthor. The program generated a set of words from a

given set of roots. The lexicon contains about 3000Arabic words and its correct word roots. A set of 500randomly chosen words have been selected from thislexicon, restricted to words of length five letters. The

I

TABLE II

ARABIC LETTER CLASSIFICATION

Class Value3

2

1e9991St 0

Fig. 2. The Proposed Model Architecture

438

Letter Decimal Code Binary code3 011

O 0 0002 010

53 3 011

1 5 4 3 011

I

Page 4: [IEEE 2006 International Conference on Computer Engineering and Systems - Cairo (2006.11.5-2006.11.7)] 2006 International Conference on Computer Engineering and Systems - A Triliteral

Fig. 3. The Model Algorithm

correct output Arabic root for each word was availableto the model during training. For example, for the word

L", "madars" which means "schools", the fiveletters { ¢, , l, _, i served as a fixed input pattern andthe correct target output sequence was {O, i},which also five letters. When combined these letters,the root " "drs" is extracted. The zero character"0" was used to indicate that the corresponding letterbelongs to affix letters and not in the root word.

The trained model has been tested using a setof 200 randomly chosen words from the generatedlexicon. The word length was five letters. Comparingthe output from the trained network with the actualroots, it shows that the network produces 188 correctroots from the given test set. The proposed techniquehas proved a good behavior with accuracy rate 94%.Table IV shows a sample of test run.

VI. CONCLUSION

Most Arabic stemming algorithms relied on rule-based systems joined with a roots dictionary and pat-

TABLE IV

A TESTING SAMPLE

tern files. These algorithms are often accurate but com-putationally demanding. In this paper, we presentedan alternative approach using backpropagation neuralnetworks for stemming Arabic words.

The aim of this paper is to exploit numer-

ical relations between characters that are identi-fied by Arabic linguists to be affixes. The linguis-tic studies showed that these characters, which are

{irv 1,J,s, c,9,u, j, 1}. appear as affixes with differ-ent frequencies. Our approach exploits this fact byassigning numeric values that identifies the class ofeach character. These numeric values are then used inencoding a given Arabic word to be presented as an

input to the neural network.We demonstrated in this paper through our ex-

periments' positive results that our approach providesfor fast and accurate stemming algorithm. Currently,the process is limited to triliteral roots and five lettersArabic words length. Future work will concentrate on

extending our approach to cover all length of Arabicwords and roots. Further experiments with neural net-works architecture, e.g. number of neurons and layers,will be performed while generalizing the approach.

REFERENCES

[1] Allan, James and Kumaran, Giridhar, "Stemming in thelanguage modeling framework," pp. 455-456, 2003. [Online].Available: http://doi.acm.org/10.1145/860435.860548

[2] S. Al-Fedaghi and H. Al-Sadoun, "Morphological compressionof arabic text," in Information Processing & Management,1990, pp. 303-316.

[3] R. A. Shalabi, G. Kannan, and H. Al-Serhan, "New approachfor extracting arabic roots," in ACIT '2003: Proceedings ofThe 2003 Arab conference on Information Technology, vol. 1,Alexandria, Egypt, December 2003, pp. 42-59.

[4] R. Al-Shalabi, "A computational morphology system for ara-

bic," in COLING-ACL98, 1998.[5] S. Y. Tai, C. S. Ong, and N. A. Abullah, "On designing an

automated malaysian stemmer for the malay language (postersession)," in IRAL '00: Proceedings of the fifth internationalworkshop on on Information retrieval with Asian languages.New York, NY, USA: ACM Press, 2000, pp. 207-208.

[6] M. Greengrass, A. M. Robertson, R. Schinke, and P. Willett,"Processing morphological variants in searches of latin text."Information research news, vol. 2, 1996.

[7] J. Asian, H. E. Williams, and S. M. M. Tahaghoghi, "Stemmingindonesian," in CRPIT '38: Proceedings of the Twenty-eighthAustralasian conferenceon Computer Science. Darlinghurst,Australia, Australia: Australian Computer Society, Inc., 2005,pp. 307-314.

[8] V. Cavalli-Sforza, A. Soudi, and T. Mitamura, "Arabic mor-

phology generation using a concatenative strategy," in Proceed-ings of the first conference on North American chapter of theAssociation for Computational Linguistics. San Francisco,CA, USA: Morgan Kaufmann Publishers Inc., 2000, pp. 86-93.

[9] W. Kraaij and R. Pohlmann, "Viewing stemming as recallenhancement," in Proceedings of the 19th Annual Interna-tional ACM SIGIR Conference on Research and Developmentin Information Retrieval, SIGIR'96, H.-P. Frei, D. Harman,P. Schauble, and R. Wilkinson, Eds. ACM, 1996, pp. 40-48.

439

Word Input Matrix Root Output MatrixLJ2 ooooo10011000000 1 1 o10000001 1

OOOOOOOOOO11O1O 111111111000000010000011000000 000111000111111000011011000010 oo10000001 1

LC iOo1o00000000 1o000 000111111000111

o010001000011000 000111111000111000000001011010 111111111000000oo0o1o00000000000o1 000 111111111000

Page 5: [IEEE 2006 International Conference on Computer Engineering and Systems - Cairo (2006.11.5-2006.11.7)] 2006 International Conference on Computer Engineering and Systems - A Triliteral

[10] M. A. Fattah, F. Ren, and S. Kuroiwa, "Stemmingto improve translation lexicon creation from bitexts,"Information Processing and Management, vol. 42,no. 4, pp. 1003-1016, 2005. [Online]. Available:http://dx.doi.org/10.1016/j.ipm.2005.07.002

[11] M. Popovic and P. Willett, "The effectiveness of stemmingfor natural-language access to slovenetextual data." Journal ofthe American Society for Information Science & Technology(JASIST), vol. 43, pp. 384-390, 1992.

[12] P. Nakov, "Building an inflectional stemmer for bulgarian," inInternational Conference on Computer Systems and Technolo-gies - CompSysTech2003, 2003.

[13] F. oCuna Ekmekcioglu, M. F. Lynch, and P. Willett, "Stemmingand n-gram matching for term conflation in turkish texts,"Information Research News, vol. 2, pp. 2-6, 1996. [Online].Available: http://informationr.net/ir/2-2/paperl3.html

[14] Larkey, L. S., L. Ballesteros, Connell, and M. E., "Improvingstemming for arabic information retrieval: light stemmingand co-occurrence analysis," in Proceedings of the 25thAnnual International ACM SIGIR Conference on Researchand Development in Information Retrieval, ser. ArabicInformation Retrieval, 2002, pp. 275-282. [Online]. Available:http://doi.acm.org/10.1 145/564376.564425

[15] J. C. Kazem Taghva, Rania Elkhoury, "Arabic stemming with-out a root dictionary," International Conference on InformationTechnology: Coding and Computing (ITCC'05), vol. 1, pp.152-157, 2005.

[16] I. A. Al-Kharashi and I. A. Al-Sughaiyer, "Rule merging in arule-based arabic stemmer." in COLING, 2002.

[17] L. V. Fausett, Fundamentals of Neural Networks. PrenticeHall, 1994.

[18] F. A. Qabaweh, Tasreef Al Asma' wa Al Afa'l. Al MaerfLibrary, 1988.

440