automated detection & identification of textual areas in

15
Applied Mathematical Sciences, Vol. 7, 2013, no. 88, 4387 - 4401 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ams.2013.36334 Automated Detection & Identification of Textual Areas in Dematerialization Process Mohammed MOUJABBIR Department of computer System Faculty of sciences and technology Mohammedia, Morocco [email protected] Mohammed RAMDANI Department of computer System Faculty of sciences and technology Mohammedia, Morocco [email protected] Copyright © 2013 Mohammed MOUJABBIR and Mohammed RAMDANI. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Abstract Documents Management System (DMS) are systems that read information depending on document’s categories and suggest item. While the majority of existing researches compute recommendation by considering several methods such us: classifiers, learning methods, clustering methods etc… We conduct experiments to identify the best detection algorithm for our dataset. We evaluate our approach from 10 types of documents. We focus on recognition the entire element of textual sectors –items contained in area text- . Results shows that the algorithm based on the calcul of similarity provide an acceptable rate of detection area. We conclude that this research area detection needs deeper studies. Keywords: Documents dematerialization, electronic invoicing dematerialization, character automatic recognition, Ontology, Similarity concept

Upload: others

Post on 01-Feb-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Automated Detection & Identification of Textual Areas in

Applied Mathematical Sciences, Vol. 7, 2013, no. 88, 4387 - 4401

HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ams.2013.36334

Automated Detection & Identification of Textual

Areas in Dematerialization Process

Mohammed MOUJABBIR

Department of computer System

Faculty of sciences and technology Mohammedia, Morocco [email protected]

Mohammed RAMDANI

Department of computer System

Faculty of sciences and technology Mohammedia, Morocco

[email protected] Copyright © 2013 Mohammed MOUJABBIR and Mohammed RAMDANI. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Documents Management System (DMS) are systems that read information

depending on document’s categories and suggest item. While the majority of existing researches compute recommendation by considering several methods such us: classifiers, learning methods, clustering methods etc…

We conduct experiments to identify the best detection algorithm for our dataset. We evaluate our approach from 10 types of documents. We focus on recognition the entire element of textual sectors –items contained in area text- . Results shows that the algorithm based on the calcul of similarity provide an acceptable rate of detection area. We conclude that this research area detection needs deeper studies. Keywords: Documents dematerialization, electronic invoicing dematerialization, character automatic recognition, Ontology, Similarity concept

Page 2: Automated Detection & Identification of Textual Areas in

4388 Mohammed MOUJABBIR and Mohammed RAMDANI

1. Introduction

In order to provide the efficiency of DMS especially in relevant of the results,

future generation of this kind of system should integrate multitudes functionalities: intelligence, adaptability, expressivity, indexing etc...

Document dematerialization processes are dedicated to the improvement of the efficiency, inexpensiveness and accessibility of all users’ document: dematerialization activities introduced to manage bureaucratic digital documents in a proper way, are among the main tasks of the e-government, e-commerce, accounting works.

In this paper we present a novel method in the dematerialization process, that constitutes the detecting all the sector area content by using the similarity concept able to identify correctly the most existing area in the digital document. The simulation results show that the proposed algorithm can effectively reduce the ranking abnormality and the number of handoffs.

This paper is organized as follows. Section II presents a description of related work. Section III describes the pre-processing and document representation phase. Section IV presents our experimental design based on similarity approach. Section V includes the discussions and critical. Section VI concludes this paper.

2. Relatd work

2.1 Relevant terms recognition

The state of the art in this domain is related to techniques of NLP (Naturel Langage Processing) including several approaches:

• Statistical Linguistics[7][11].

• Computational Linguistics [6][17][22][30]. Whose objective is the study and the analysis of natural language and its

functioning through computational tools and models. In particular, for the analysis of the textual areas, however specific disciplines have been developed, like Corpora Linguistics and Textual and Lexical Statistics [1][12].

Term extraction presents the first stage in automatic document processing and derivation of knowledge from texts. The second stage is the analysis of the lexical items which carries a specific meaning.

Generally Terms serve to convey the fundamental concepts of a specific knowledge domain for example the football term refers to the sport concept: they use their relationships to build the semantic meaning. Methods for term extraction from texts can be divided in three main categories:

Linguistic methods: They use the linguistic knowledge, heuristic rules for identification tokenisation, and harmonisation the tokens. They use also a part-of-speech tagging, word-stemming, lemmatization and identification of phrase-structures [1] [12];

Page 3: Automated Detection & Identification of Textual Areas in

Automated detection & identification of textual areas 4389

Statistical methods: They analyse the word occurrences within texts in order to measure the “weight” of the selected term. The termhood[21] allows the assignment of weight, it represent the degree by which a term is related to the specific concept of the domain. Unithood measures the collocation strength of the units; Evaluating unithood is based on measuring significance of the words occurring [38][10]. Hybrid methods: They use two methods: linguistic and statistical, the first is based on part-of-speech tags, to extract a set of candidate terms, the second assigns a value to each candidate term[23].

2.2 Segmentation

The segmentation can be performed by means of special instruments; it can be formed by two fundamental components:

• Glossaries listing well-known expressions to be regarded as tokens;

• Grammars containing heuristic rules (in the form of regular rules), which are manually written by experts.

The combined use of glossaries and grammars ensure high levels of precision. However results depend on the kind of the document, textual content and language used: texts which are full of acronyms or abbreviations can increase the mistakes rate. Consequently, the glossary and the grammatical rules should be adapted to the characteristics of the domain at issue. In the literature, several works focus on relevant knowledge. In contrast various systems deal the extraction knowledge by different means and use various methods. The used methods are described in the next section.

2.3 Systems based on the distributional properties of words

They consist in analyzing the distributions of words in order to recover the semantic distance between the concepts represented by those words. The distance can be used as a:

• Hierarchical clustering to automatically derive hierarchies of concepts form texts [14][25].

• Formal Concept Analysis[9].

• Classification of words inside existing ontologies [4][28].

• Learn concept hierarchies [8] [36] [26] [3].

2.4 Systems based on pattern extraction and matching

They rely on lexico-syntactic patterns to discover semantic relations between words in unrestricted texts. Hearst [18] pioneered using patterns to extract hypernymy relations; Berland and Charniak [5] applied the previous technique to extract meronymy. Girju [16] have studied meronymic relations extraction while Turney [35] has proposed a uniform approach for the extraction of different kinds of relations from text.

Hearst [19] proposes to look for co-occurrences of word pairs appearing in a specific relation inside WordNet. Turney[34] presents an unsupervised learning algorithm that mines large text corpora for patterns expressing implicit semantic relations.

Page 4: Automated Detection & Identification of Textual Areas in

4390 Mohammed MOUJABBIR and Mohammed RAMDANI

2.5 Hybrid approaches

They combine statistical and pattern-based techniques, as in Alfonseca and Manandhar [4], Ryu and Choi [29] have proposed an algorithm for IS-A relation extraction from the English Wikipedia. Giovannetti [15] propose a methodology that integrates lexico-syntactic patterns, manually defined, (pattern-based approach).

2.6 Ontologys

Ontology is a formal explicit specification of a shared conceptualization of a domain of interest [33].

An Ontology aims at providing a formal and explicit description of the concepts in a domain of discourse. The main elements of the ontology are concepts, relationships and instances:

• Concepts represent the categories and the classes of objects that are relevant in the domain of interest;

• Relationships serve to semantically connect concepts and instances;

• Instances represent the named and identifiable concrete objects in the domain of interest (i.e. the particular individuals that are classified by concepts and related by relationships).

There exist several approaches for ontology types classification, proposed amongst others by Lassila & McGuinness [24] in 2001 and by Oberle[27] in 2006. In this paper we characterize the ontology using 3 dimension:

• Number of conceptual elements in the domain. • Degree of ambiguity in the conceptualization of the domain. • Expressiveness of the formalism used to specify the ontology.

As we can clearly see, none of the existing techniques meets all our requirements. Therefore there is a need for a novel approach. In fact existing approaches do not allow: Document processing invoice format, Processing of documents in case of error injection

We are inspired by these methods to develop a new method based on the calculation of similarity. This calculation takes several aspects into consideration: syntactic similarity, the Synsets similarity and the hierarchical similarity.

3. Preprocessing and document representation

After collection of a given data set, this one has to be processed before we

could perform identification and detection tasks on its documents. This process consists of two main steps. The first step focuses on pre-processing techniques. The second step concerns the identification and the detection the entire area existing in the document.

3.1 Preprocessing

This phase consists of applying a number of techniques so as to clean and normalize the raw textual documents. We describe below, briefly, the techniques of pre-processing stage.

Page 5: Automated Detection & Identification of Textual Areas in

Automated detection & identification of textual areas 4391

• Tokenization: Basically, the main objective is to transform a text document to a sequence of tokens separated by white spaces. The output of this process is a text free from punctuation marks and special characters. The most common model is the bag-of-words which consists on splitting a given text into single words. We can also find the n-gram word model[13]. As a definition, an n-gram word is a sequence of n words in a given document. If n is equal to 1, we talk about unigrams, if it is equal to 2, they are called bigrams, and so on. For example, in the sentence “Facture N° 35/2013”, the unigrams that we can extract are “Facture”, “N°”, “35” ,”/” and “2013”. While the bigrams that we can find are “Facture”, “N°”, and “35/2013”.

• Stemming: Word stemming is a crude specific language process which removes suffices, synonym, sign in order to reduce words to their word stem[2]. For example, the stemming of the two words “Num” and “N°” will give the same stem “Numéro”.

• Stop Words elimination: Stop Words are function words which provide structure in language rather than content. For example The list of stop words can be “. # ^ etc...”.

3.2 Document representation

After pre-processing stage, we proceed to adaptive format building. This textual format consists of all terms that occur after all pre-processed documents. These terms can be words and/or expressions. The following figure shows an example of document representation.

Figure 1: document representation

4. The proposed method

As a pre-processing step, we have removed from words all special characters, like: punctuations marks, separators, spaces etc...This feature will not contribute in identification process. This is why we can say that these features do not have discriminate power and it is recommended to remove them.

In this study, we are interested in a detection and identification of textual area in the given document, where each of them is characterised by several features as: position, morphology, information relevance, vulnerability and sensibility to recognition errors.

For our tasks of area’s identification, we use the data from ABBYYFineReader [32], it is one of the strongest and widely optical character recognition (OCR)

Page 6: Automated Detection & Identification of Textual Areas in

4392 Mohammed MOUJABBIR and Mohammed RAMDANI used in the dematerialization of documents. We test tree different classifiers based on:

• Jaccard’s Coefficient [20].

• Grammatical Distance [31].

• Hierarchical similarity [37].

4.1 Critical

The use of each variable independently can not in any case give a satisfactory

decision, because the problem addressed:- automatically identification of area- depends on several constraints:

• Syntactic constraints: it specifies the degree of syntactic similarity, ignoring the semantic goal.

• Morphological constraints: it specifies the degree of similarity in sub-tree of the used ontology, also called the degree of correlation.

• Constrains related to the Synsets: it specifies the degree of similarity between the terms entered in the ontological tree and the area to detect.

• For those reasons we propose in the next section a formula to evaluate the score of each area, it will be used in the proposed algorithm.

4.2 Hierarchical similarity

It represents the quotient of the number the nodes in the Z model on the number the nodes in the area Zi . The used nodes belong to the considered ontology.

Formally:

���������, �� � ������� �������/ ������� ������

Page 7: Automated Detection & Identification of Textual Areas in

Automated detection & identification of textual areas 4393

Figure 2: tree of designation area

Example: We consider the following model area: ZDesignation = the designation model area and

������� �������=6 Let’s evaluate the following areas text:

• Z1 = “casa to rabat 5 2000.00 10 000.00” �������������������, ��=6/6=1

Casa ⇔ token11 ; to ⇔ token12 ; rabat ⇔ token13 ; 5 ⇔ token2; 2000.00 ⇔ token3;

10 000.00 ⇔ token4;

• Z2 = “rabat to 5 2000.00 10 000.00” ������������������� , !�=5/6=0,8333

• Z3 = “casa Le 10/02/2013” �������������������, "� =7/6=1,16 so

�������������������, "�=0 i.e. we will not consider this value.

4.3 Synsets similarity

This similarity is centered on the calculation of the Jaccard’s coefficient among the subset of synsets of the model area and Z subset of Synsets the zone Zi. Formally:

����#��, �� � $�������, �� � �����⋂��/�����⋃��

Page 8: Automated Detection & Identification of Textual Areas in

4394 Mohammed MOUJABBIR and Mohammed RAMDANI

Example: The model area is defined by : Synset(Z)={cinq, trente, mille, dirham, deux, cents.....} where card(Z)=20 ; Let’s evaluate the following areas text:

• Z1 : ‘’cinq mille deux cents dirham’’ SimSyn(Z, Z1 )=5/20=.

• Z2 : ‘’cinq mille dex cents dirham’’ SimSyn(Z, Z1 )=4/20.

• Z3 :’’facture N° 30/2013‘’ SimSyn(Z, Z1 )=0/20.

4.4 Syntactic similarity

It is based on the syntactic stack, where a syntax error (i.e. an empty case in the analysis table or no conformity in two tokens), then we subtracted from the syntactic similarity the - 1/number of cases in analysis table- value. 4.4.1 Reminders

In order to define the language denoted by a grammar symbol we need to define the set of alphabet then the concept of derivation. Derivation is a relation that holds between two forms, each a sequence of grammar symbols.

Formally, a grammar is a set denoted as G=(VT,VN,S,P).

• VN is a non-empty set of terminal symbols.

• VT is a non-empty set of non-terminal symbols with VN I VT =(

• S is an initial symbol (axiom).

• P is a set of production rules

Example VT ={a,c} VT ={S,T} S�aTb|c T�cSS|S Let’s calculate the set of first Definitions:

• first (A): the set of all terminal symbols that can start a chain that is derived from A

• Next (A): the set of all terminals that can appear immediately to the right of A, $ symbol belongs to the Next’s of the axiom. The following table shows the set of firsts and set of Nexts

Firsts Nexts

S a,c $

T a,c b

Table 1: The firsts and the nexts set

• Analysis table : is a two-dimensional array M which indicates for each non-terminal symbol A and a terminal symbol to the rule of production applied

a b c $

S S�aTb S�c

T T�S T�cS

Table 2: The analysis table

Page 9: Automated Detection & Identification of Textual Areas in

Automated detection & identification of textual areas 4395

4.4.2 Example We consider the following grammar:

S→ *+*+,-. (1) Ref→ /0*..1-2-3/0*..1- (2)

45 � 62, ,-.748 � 6*+, /0*..1-, 2-37 Then we generate the following tables:

Firsts Nexts

S Id $

Ref Chiffre -

Table 3: The firsts and the nexts set

id chiffre Sep

S (1)

Ref (2)

Table 4: The analyse table There is four empty cases i.e. four syntactic errors possible

• Let’s analyse the following sentence: Z1=’’Facture N° 30/2013”

Stack Input Output

$S id id chiffre sep chiffre$ S→ *+*+,-.

$Ref id id id id chiffre sep chiffre $ Ref→ /0*..1-2-3/0*..1- $chiffre sep chiffre chiffre sep chiffre $ the sentence is accepted

Table 5: the analyse process So the ����#������� ��9������, ��=1

• Let’s analyse the following sentence: Z2=’’Facture 30 N° /2013”

Stack Input Output

$S id chiffre id sep chiffre$ S→ *+*+,-.

$Ref id id id chiffre id sep chiffre $ Syntax error: we need to find an id

instead of a number

$Ref id sep chiffre $ Syntax error: we need to find an

number instead of a id

$chiffre sep chiffre chiffre sep chiffre $ The sentence is not accepted

Table 6: the analyse process

So the ����#������� ��9������, !�=1-(1/4)-(1/4)=1/2=0,5

In the first case, there is no error, but in the second case there are two errors. 4.5 Area comparaison

In order to detect all area contented in the document, we have considered several grammatical rules. These rules contain various features: lexical, syntactic and semantic. To achieve this task we specify for each category of document a set of rules called grammar. The similarity or association stuck between the sentences is centred by means of:

Page 10: Automated Detection & Identification of Textual Areas in

4396 Mohammed MOUJABBIR and Mohammed RAMDANI

• Jaccard’s coefficient.

• Grammatical distance.

• hierarchical similarity Jaccard’s coefficient is utilized to measure the intersection of two sets:

grammatical rules and candidate sentence. We make labelling areas as identification problem. For the given area Z, the

problem is to find the model (also the class) of that area, that is to say the attribute to which it corresponds. The identifier process is based on the calculation of the degree of similarity. Similarity that takes into account multiple constraints hierarchical similarity, syntactic similarity and lexical similarity.

Formally we use: ����, �� � : ∗ ���������, �� < = ∗ ����#����, �� < > ∗ ����#�����, ��

Where Zi is the textual area to defined Therefore the score Zi will be defined like:

�? � @ABCDEF∈����H�

�����F, ���

ZJ means the area getting the greatest value of ����F, �� Where:

• α, β and γ: Designate a fixed constants belonging to the interval] 0.1 [ and α + β + γ=1

• ����H�: Designate the set of model area

• ���������F, �� ∶ Refers to the similarity of number of nodes in the

ontological tree.

• ����#����F, ��: Designate the syntactic similarity, i.e. the degree of

conformity the Z and Zi.

• ����#�����F, ��: Designate the similarity of common Synsets.

5. Experiments

In this section, we present the experiments that we have performed for this

study. Firstly, we describe the experiments environment, i.e. some details about the pre-processing techniques, the identifiers, the validation method and the performance measure that we have used. Secondly, we present the obtained results and we discuss the different findings.

5.1 Process of experimentation

We scanned 200 documents from 10 categories, and we used ABBYYFineReader to find the text format–fig 3-. The considered categories are:

Page 11: Automated Detection & Identification of Textual Areas in

Automated detection & identification of textual areas 4397

Transports invoices, wood invoices, irons invoices, invoices of building

materials, invoices of spare parts, fuel invoices, invoices tires, invoices of notaries, lawyers invoices and phone invoices

We note in this sample, the existence of several errors of recognition, syntactic, lexical and semantic errors.

Figure 3: the text format

5.2 Algorithm of detection

In order to detect the type of all areas constituting the text document, we use the score calculation with the following algorithm:

1. For a zone Z gives calculate the score (Z)

2. If the score is satisfactory

a. the area is properly recognized

b. move to the next area

3. Otherwise Z together with another area go to 2

5.3 Results

We consider the following symbols that mean the model existing area: Z1=date area. Z2= number of invoice number. Z3= customer address area. Z4= designation label area. Z5= designation item area. Z6= amount ht area. Z7= amount ttc area. Z8= amount in words area.

The following table shows the results found, including the rate of detection

zones.

Client : SCIF

Alléé 2=s Cactus Ain Sebaâ- Casablanca-Maroc 3P : 2604

FACTURE N° 08/2013

Désignation Nombre de voyages Prix unitaire H.T.

Prix 'obi H.T.

Frais detransport du client Casablanca Safi :

2 üüü.üü 2 000.00

TOTAL TAXE HORS 2 000.00

T.V.A 14 Va 280.00

TOTAL T.T.C. 2 280.00

Arrètéfe la présente facture à lâiommede ;

Deux mille deux cent quatre vingt Dirham et 00 centimes.

Rue 20 Grouse 7 sut Moumen raa N*15 Casablanca T.V.A HT- ¿332252-i

»atexe >r- 3332252« R.C N* 3SS£5i GS V 351 53« 52

Casablanca le : 25/02/2013

Error

Error

Error

Error

Page 12: Automated Detection & Identification of Textual Areas in

4398 Mohammed MOUJABBIR and Mohammed RAMDANI

Table 5: Obtained results

The identification result that we obtained by the use of similarity method is

78.05%. The use of the multiple coefficient of similarity gives a good rate of detection and identification the area text that contained all useful data. The detection areas allows the recognition of all the components of this area, and also the possibility to obtain the best interpretation.

6. Conclusion and future works

In this paper, we investigate the use of similarity techniques to detect the contextual text area from document (invoices document) in order to provide ameliorations and recommendations to the developers of the DMS. We propose detection and classification approaches centred on similarity concept.

We evaluated our approach on our own dataset from multiple provider invoices. We focused on the areas text that we aim to detect the entire token contained, and they will be using as labels information in the interpretation phase.

We consider that the obtained result on our corpus, though it is not high, is promising given the challenges encountered in this area and that we mentioned above. Our result proves also that this research area especially the detection area text is challenging and still needs deeper and more fine-grained studies.

References

[1] A Bolasco S., della Ratta-Rinaldi F. (2004). Experiments on semantic categorisation of texts : analysis of positive and negative dimension. In Purnelle G., Fairon C., Dister A., Le poids des mots, Actes des 7es journées Internationales d’Analyse Statistique des Données Textuelles, UCL, Presses Universitaires de Louvain : 202-210.

Z1 Z2 Z3 Z4 Z5 Z6 Z7 Z8

C1 20 15 21 34 14 20 23 29

C2 18 23 13 30 27 24 23 34

C3 14 29 26 19 29 22 24 11

C4 13 11 28 34 18 15 25 24

C5 27 15 14 19 18 34 24 16

C6 20 32 20 17 18 14 30 15

C7 29 17 27 18 28 15 13 25

C8 27 29 20 26 16 28 15 21

C9 33 34 16 23 19 16 23 24

C10 24 27 17 14 21 28 27 13 22.5 23.2 20.2 23.4 20.8 21.6 22.7 21.2

Page 13: Automated Detection & Identification of Textual Areas in

Automated detection & identification of textual areas 4399

[2] A. Smeaton, « Information retrieval: Still butting heads with natural language processing? », Information Extraction A Multidisciplinary Approach to an Emerging Information Technology, p. 115–138, 1997.

[3] Agustini A., De Lima V., Gamallo P., Gasperin C., Lopes G. (2001), Using Syntactic Contexts for Measuring Word Similarity. In: Proceedings of the Workshop on Semantic Knowledge Acquisition and Categorisation, Helsinki, Finland.

[4] Alfonseca E., Manandhar S. (2002), Extending a Lexical Ontology by a Combination of Distributional Semantics Signatures. In: Proceedings of EKAW02, pp. 1-7;

[5] Berland M., Charniak E. (1999), Finding parts in very large corpora. In: Proceedings of the 37th ACL, pp. 57-64. University of Maryland.

[6] Biber D., Conrad S., Reppen R. (1998), Corpus linguistics. Invastigating language structure and use, Cambridge, Cambridge University Press.

[7] Butler C.S. (1985), Statistics in Linguistics, Oxford, Blackwell.

[8] Caraballo S. (1999), Automatic acquisition of a hyperonym-labeled noun hierarchy from text. In: Proceedings of the 37th Annual Meeting of the ACL. pp. 120-126.

[9] Cimiano P., Staab S. (2004), Clustering concept hierarchies from text. In: Proceedings of LREC- 2004. Lisbon, Portugal;

[10] Daille B., ´ Eric Gaussier, Lang´e J.M. , Towards automatic extraction of monolingual and bilingual terminology. In Proceedings of the 15th conference on Computational linguistics (Morristown, NJ, USA), Association for Computational Linguistics, pp. 515–521. 1994

[11] De Mauro T. (1961), Statistica Linguistica., In: Enciclopedia Italiana, Appendice III, Vol. II, Ist. Enciclopedia Italiana.

[12] De Mauro T., Mancini F., Vedovelli M., Voghera M. (1993), Lessico di frequenza dell’italiano parlato, Etaslibri, Roma.

[13] E. Shannon, « A mathematical theory of evidence: Bellsyt », Techn.Journal, vol. 27, p. 379–423, 1948.

[14] Faure D., Nedellec C. (1998), A corpus-based conceptual clustering method for verb frames and ontology acquisition. In: LREC workshop on Adapting lexical and corpus resources to sublanguages and applications. Granada, Spain;

[15] Giovannetti E., Marchi S., Montemagni S. (2008), Combining Statistical Techniques and Lexico-syntactic Patterns for Semantic Relations Extraction from Text. In: Aldo Gangemi, Johannes Keizer, Valentina Presutti, Heiko Stoermer (eds.), Proceedings of the 5thWorkshop on SemanticWeb Applications and Perspectives (SWAP 2008), Rome, 15-17 December 2008.

[16] Girju R., Badulescu A., Moldovan D. (2006), Automatic Discovery of Part-Whole Relations. In: Computational Linguistics, Volume 32, No. 1, pp. 83–135.

[17] Habert B., Nazarenko A., Salem A. (1997), Les linguistiques de corpus, Parigi,Colin.

[18] Hearst M. A. (1992), Automatic acquisition of hyponyms from large text corpora. In: Proceedings of the 14th International Conference on Computational Linguistics, pp. 539

Page 14: Automated Detection & Identification of Textual Areas in

4400 Mohammed MOUJABBIR and Mohammed RAMDANI

545. Nantes, France.

[19] Hearst M. A. (1998), Automated Discovery of WordNet Relations, MIT Press, Cambridge MA.

[20] Huda Yasin, Mohsin Mohammad Yasin and Farah Mohammad Yasin. Automated Multiple Related Documents Summarization via Jaccard’s Coefficient International Journal of Computer Applications (0975 – 8887) Volume 13– No.3, January 2011.

[21] Kageura K., Umino B. Methods of Automatic Term Recognition: A review. Terminology 3, 2, pp. 259 to 289.1996.

[22] Kennedy G. (1998), An introduction to corpus linguistics, Londra, Longman.

[23] Lame G. (2005), Using NLP Techniques to Identify Legal Ontology Components: Concepts and Relations. In: V.R. Benjamins et al. (Eds.): Law and the Semantic Web, LNAI, 3369. Springer-Verlag Berlin Heidelberg, pp. 169184.

[24] Lassila, O., McGuinness, D. L. The Role of Frame–Based Representation on the Semantic Web. Link¨oping Electronic Articles in Computer and Information Science, Vol. 6, (2001), No. 005, http://www.ep.liu.se/ea/cis/2001/005/

[25] Lee L. (1997), Similarity-Based Approaches to Natural Language Processing, Ph.D. thesis;

[26] Maedche A., Staab S. (2000), Discovering conceptual relations from text. In: Proceedings of ECAI 2000. Technical Report 425, University of Karlsruhe;

[27] Oberle, D. Semantic Management of Middleware. New York: Springer. 2006.

[28] Pekar, V., Staab, S. (2003), Word classification based on combined measures of distributional and semantic similarity. In: Proceedings of EACL03. Budapest.

[29] Ryu P., Choi K. (2007), Automatic Acquisition of Ranked IS-A Relation from Unstructured Text. In: Proceedings of OntoLex07, pp. 67–77. Busan, Korea.

[30] Spina S. (2001), Fare i conti con le parole. Introduzione alla linguistica dei corpora, Perugia, Guerra.

[31] Stavros Papadopoulos Introduction to Data Mining Fall 2011

[32] Système de reconnaissance optique de caractères ABBYY® FineReader, Guide de l'utilisateur 2011.

[33] T. Gruber. A Translation Approach to Portable Ontologies. Knowledge Acquisition, vol. 5, no. 2, pp. 199-220,1993.

[34] Turney P.D. (2006), Expressing implicit semantic relations without supervision. In: Proceedings of Coling/ACL-06, pp. 313– 20. Sydney, Australia.

[35] Turney P.D. (2008), A uniform approach to analogies, synonyms, antonyms, and associations. In: Proceedings of Coling 2008, pp. 905–912. Manchester, UK.

[36] Widdows D. (2003), Unsupervised methods for developing taxonomies by combining syntactic and statistical information. In Proc. of the HLT/NAACL 2003 - Vol. 1, 197-204.

[37] Wu & M.Palmer Verb semantics and lexical selection. Proceedings of the 23th Annual Meeting of the association for Computational Linguistics, pp. 133-138. 1994.

Page 15: Automated Detection & Identification of Textual Areas in

Automated detection & identification of textual areas 4401

[38] Ziqi Zhang, Jose Iria C. B., Ciravegna F. (2008), A comparative evaluation of term recognition algorithms. In Proceedings of the Sixth International Language Resources and Evaluation (LREC08) (Marrakech, Morocco), ELRA, Ed.

Received: June 19, 2013