sentimatrix - named entity recognition for romanian language

KNOWLEDGE ENGINEERING: PRINCIPLES AND TECHNIQUESProceedings of the International Conference on Knowledge Engineering,Principles and Techniques, KEPT2011Cluj-Napoca (Romania), July 4–6, 2011, pp. 25–36

NAMED ENTITY RECOGNITION FOR ROMANIAN

ADRIAN IFTENE(1), DIANA TRANDABAT(1), MIHAI TOADER(2),

AND MARIUS CORICI(2)

Abstract. This paper presents a Named Entity Recognition system forRomanian, created using linguistic grammar-based techniques and a set ofresources. Our system’s architecture is based on two modules, the namedentity identification and the named entity classification module. After thenamed entity candidates are marked for each input text, each candidate isclassified into one of the considered categories, such as Person, Organiza-tion, Place, Country, etc. The system’s Upper Bound and its performancein real context are evaluated for each of the two modules (identificationand classification) and for each named entity type. The evaluation showpromising results, our system being comparable with the existing systemsfor Romanian, and even better for Person recognition.

Named Entity Recognition (NER) is a common natural language process-ing task dedicated to the discovery of textual expressions such as the namesof persons, organizations, locations, places, etc. Named entity recognition,although a seemingly simple task, faces a number of challenges. Entities mayfirstly be difficult to find, and once found, difficult to classify. In this paper,we present the development of a NER system for Romanian.

Even though the categories of named entities are predefined, there arevarying opinions on what categories should be regarded as named entities andhow broad those categories should be. The categories chosen for a particu-lar NER project may depend on the requirements of the project. If numericalclassification is important to a particular field, then the categories dealing withnumerical data may need to be more refined. Similarly, if geographical classi-fication is important, it may be necessary to classify each location entity as aparticular type of location. The NER system for Romanian presented in thispaper is intended to be part of a sentiment assessment system which monitorsuser feedback in rapport to an organization’s brand or product. Therefore, we

Received by the editors: 07.04.2011.2010 Mathematics Subject Classification. 68T50, 68P20, 91F20.Key words and phrases. Named Entity, Information Extraction.

c©2011 Babes-Bolyai University, Cluj-Napoca

25

26ADRIAN IFTENE(1), DIANA TRANDABAT(1), MIHAI TOADER(2), AND MARIUS CORICI(2)

tried to refine the named entities types with regard to companies and prod-ucts, so the categories we considered are: Person, Organization, Company,Region, Place, City, Country, Product, Brand, Model, and Publication.

Having such complex named entities types was justified also by the needof an NER module for our question answering and textual entailment systems.In question answering, there are questions having as expected answer type aperson (male or female), a city, region or organization. The systems presentedin [4], [5] and [3] show that the identification of these types in the retrievedparagraphs is a necessary step toward better results.

The next section presents some observations on previous published work,while the second section elaborates on our method. Section 3 presents theevaluation of our system, before finally discussing our conclusions.

1. Existing Work

NER systems use linguistic grammar-based techniques or statistical mod-els (an overview is presented in[10]). Hand-crafted grammar-based systemstypically obtain better precision, but at the cost of lower recall and months ofwork by experienced computational linguists. Besides, they are hard to adaptto new domains. Statistical NER systems typically require a large amount ofmanually annotated training data. Machine learning techniques, such as theones discussed in [8] or [9], allow systems-based adaptation two new domains,perform very well for coarse-grained classification, but require large trainingdata.

Named entity recognition for Romanian has been attacked in [1], [6] and[7]. There is also a NER gazetteer for Romanian included in GATE [2]. Thesystem presented in [7] is based on GATE system, with several componentsadded to the GATE platform: three gazetteers developed for Romanian and aJAPE transducer, also developed for the Romanian language. The gazetteersprovide the system with a list of entities (person names, locations, organiza-tions) and the JAPE (Java Annotations Pattern Engine) transducer providesa set of rules for matching specific patterns in the text aimed at identifyingthe named entities which appear in the text. The performance of the systemwas measured to be F-Measure of 32% for the identification of Person namedentities, 66% for the identification of Locations and 58% for the identificationof Organizations.

Another named entity recognition for Romanian is presented in [6], de-veloped by extensive use of Perl regular expressions that define sequences oftokens that constitute named entities. The considered named entities are:integers, real numbers, dates, times, names of persons (female and male), dif-ferent quantities, lengths, volumes and weights.

NAMED ENTITY RECOGNITION FOR ROMANIAN 27

In [1], an algorithm for the minimally supervised learning of named entityrecognizers given short name lists as seed data is presented. The algorithmuses hierarchically smoothed trie structures for modeling morphological andcontextual probabilities effectively in a language independent framework, over-coming the need for fixed token boundaries or history lengths. The systemachieves 70.5%-75.4% F-measure (measuring both named entity identificationand classification) when applied to Romanian text.

The system presented in this paper obtains comparable results for most ofour categories, and outperforms the existing approaches for Person recognition.

2. Our Solution

Nowadays, big companies and organizations spend time and money in or-der to find users’ opinions about their products, the impact of their marketingdecisions, or the overall feeling about their support and maintenance services.These analyses help in the process of establishing new trends, policies anddetermine in which areas investments must be made. One of the focuses ofour work is helping companies build such analyses in the context of users’sentiment identification. Therefore, the corpus we work on consists of articlesfrom newspapers, people’s blog, various entries from forums, and posts fromsocial networks.

But before extracting sentiments, named entities must be recognized. Inthe process of extracting named entities (NEs) we consider two steps: thefirst one is related to the identification of NEs and second one involves theclassification of the identified NEs.

2.1. Named Entities Identification. We took a rule-based approach forthe Named Entities Identification (NEI) task. The NEI module uses in apreprocessing step a text segmentator and a tokenizer. Given a text, we divideit into paragraphs, every paragraph is split into sentences, and every phraseis tokenized. Each token is annotated with two pieces of information: it’slemma (obtained from our resource with 76,760 word lemmas correspondingto 633,444 derived forms) and the normalized form (translated to the properdiacritics1).

Every token written with a capital letter is then considered to be a namedentity candidate. Here we consider also tokens which are near or betweenpunctuation signs like comma, point, question mark, quotes, single quotes,brackets, etc.

A special module was built for tokens with capital letters which are thefirst tokens in phrases. For this category, we consider two situations:

1In Romanian online texts, two diacritics are commonly used, but only one is acceptedby the official grammar.


(1) when this first token of a phrase is in our stop word list - we eliminateit from the named entities candidate list;

(2) when the first token of a phrase is in our common word list - in thiscase we have two possible situations:(a) when this common word is followed by lowercase words - we check

if the common word (the first word of the sentence) can be foundin the list of trigger words. These trigger words are cue wordswhich introduce NEs, for example: university, company for theOrganization type (this is the case for “[Universitatea] din Iasi”(En: [University] of Iasi)); doctor, professor for the Person type(“[Profesor] de Fizica” (En: [Professor] of Physics)); words suchas city, country for Location NEs (“[Tara] de Jos” (En: Low[Country])), etc. If the first word of the sentence is in this listwith trigger words, it is kept in the NEs candidate list. If theword is not in the list with trigger words, it is eliminated fromNEs candidates, as being just a common word written with capitalletter just due to its position.

(b) when this common word is followed by uppercase words - in thiscase the first word of the sentence is kept in the NEs candidate list,and in a further step it will be decided if it will be combined withthe following word in order to create a composed named entities(For example “[Doctor] Stomatolog” (En: [Doctor] Dentist)).

After we build the list with named entity candidates, we apply rules thatunify adjacent candidates in order to obtain composed named entities. Themost important rules are:

(1) Rules related to a person’s title - in these cases, we unify words likeDoctor, Profesor, Ministru, Presedinte, etc. (En: Doctor, Professor,Minister, President) next to adjacent candidates (Example of possibleresults are: “Doctorul Popescu” (En: Doctor Popescu));

(2) Rules related to the Organization type - we unify words like Universi-tate, Companie, Partid, etc. (En: University, Company, Party) nextto adjacent candidates (Example of possible results are “UniversitateaCuza” (En: Cuza University));

(3) Rules related to abbreviation words - we unify abbreviations such asS.R.L., S.C., S.A. next to adjacent candidates (for example “S.C.Travis”);

(4) Rules related to special punctuation signs - in these cases we unifycandidates separated by “&” or “-” (“Ana - Maria” for example);

(5) Rules related to candidates to named entities separated by stop words -in these cases we unify candidates separated by one or two specific stop


words. Examples of specific stop words are: din, de la, a, al, pentru,ın (En: of, from, the, of, for, in) (Few examples: “Universitatea dinIasi” (En: Iasi University), “BCR Banca pentru Locuinte” (En: BCRHousing Bank), “Directia pentru Sanatate si Securitate ın Munca”(En: Department of Health and Safety at Work));

(6) Rules for a specific model/product - candidates are combined withnumbers or with one or two letters, followed by digits (For example:“Portege R835”, “Qosimio X500-Q930”).

Some of these rules are used also in the classification process, namely therules related to Person, Organization and Model types. Beside uppercasewords which are automatically NE candidates, we also consider as possibleNE-trigger lowercase words expressing titles (e.g. profesor, avocat, doctor,etc. (En: professor, lawyer, doctor)).

2.2. Named Entities Classification. The named entities (NEs) resourcefor Romanian was build starting from the categories used in GATE [2]. Thus,we consider the following major categories: the “standard categories” of City,Organization, Company, Country, Person, and additional categories such asBrand, Product and Publication (for revues, newspapers, etc.).

NE type Number of entries

Brand 9Non-Romanian City 114,288Romanian City 14,277Non-Romanian Company 73,125Romanian Company 2,884Country 405Non-Romanian Organization 2,016Romanian Organization 1,987Non-Romanian Person Name 102,512Romanian Person Name 52,868Place 189,403Product 14,842Region 64Title 4,050

Total 572,730Table 1: Number of NEs of each type

For almost all major categories we consider subcategories (the most rep-resented ones are included in Table 1): for Cities we have as subcategoriesRomanian, European, American and Other Cities; for Companies we considerRomanian and non-Romanian Companies, but also Airlines, Banks, etc.; for


Organizations we consider Parties, Faculties, Universities, Ministries; for Per-sons we have the subcategories of Sportsmen, Politicians, Males, Females, etc.;for Cities we split our resources on Romanian and non-Romanian Cities andwe separate Romanian Cities on Counties. In the end, we have built a total of14 main categories with 98 subcategories. The distribution of categories thathave the highest number of NEs in our databases is presented in Table 1.

In the classification process we use some of rules used in the unification ofNEs candidates along with the resource of NEs and several rules specificallytailored for classification. Thus, after all NEs in the input text are identifiedand, if possible, compound NEs have been created, we apply the followingclassification rules:

(1) contextual rules - using contextual information, we are able to clas-sify candidate NEs in one of the categories Organization, Company,Person, City and Country by considering a mix between regular ex-pressions and trigger words. Thus, our regular expressions search forspecific triggers one or two words before the NE candidate. For exam-ple oras, capitala, targ, localitate (En: city, capital, town, locality) arethe triggers searched in order to classify a candidate NE into the Citytype; companie or corporatie (En: company, corporation) are the trig-gers for Company type; partid, banca, universitate (En: party, bank,university) are triggers for Organization type and all titles are triggersfor Person type;

(2) resource-based rules - if no triggers were found to indicate what typeof entity we have, we start searching our databases for the candidateentity. If the candidate NE is a compound one, we first try to findit as if (i.e. the complex NE) in our resources. If it cannot be foundas a complex entity, we split it back and try to find the first entityand assign its type to the whole complex. This rule was inspiredby the fact that in cases such as “Universitatea din Iasi” (En.“IasiUniversity”), the first word “Universitatea”) is the one that determinesthe Organization type of the complex NE, and not the last word, whichis a City.

3. Evaluation

This section presents the performance of our NER system. Sections 3.1and 3.2 discuss a first “development” evaluation step, where we wanted toevaluate the system’s performance when all needed resources were available(i.e. all NE are can be found in our resources). Therefore, if a NE was notfound in our resources, we added the NE to the database, or if additional ruleswere needed, we built them. This case represents the system’s upper bound.


The next sections, 3.3 and 3.4, present the evaluation of our system on anew corpus, for each module. Some NEs were not found in our resources, butthis time we no longer added them, nor have we built specific rules.

3.1. Upper Bound Named Entities Identification Evaluation. In theevaluation process, we manually annotated 48 files with a total of 24,244 wordsand with 1,638 NEs. Based on these development files, we incrementally builtour rules, both for NE identification and for NE classification. Also, we addedall missing NEs in our resources and built special rules for the untreatedcases. The evaluation on this corpus is presented in Table 2. Partial matchingrepresents the intersection between the gold NE and the NE identified by oursystem.

Metric Value Description

Entity Identification 95.12% The percentage of gold entities found byPrecision the system reported to all returned entitiesEntity Identification 96.40% The percentage of gold entities found byRecall the system reported to all gold entitiesEntity Identification 95.76% The equally weighted f - measureF-MeasurePartial Precision 1.81% The percentage of gold entities partially

returned by the system reported to allreturned entities

Partial Recall 1.83% The percentage of gold entities partiallyreturned by the system reported to allgold entities

Table 2: Upper bound evaluation for NE identification

The first main problem in NE identification is related to the agreementbetween annotators when different types of NEs are adjacent (For example:“PDL Dolj”). In these situations, some believe it would be a single entity,while others believe that two different entities should be considered (“PDL”and “Dolj”, in this case, or even an embedded “PDL Dolj” and “Dolj”), withdifferent types.

The second main problem in NE identification is related to the cases whenthe first word of a sentence is a common word and is not followed by wordswith capitalized letter. In these cases, the system is trained to leave the firstword of the sentence out of the candidate NE list. This is the case for “Maria”,a common Romanian female name which also means “to marry” or “Ana”,also a common Romanian female name, which can be found in the list ofcommon words with the meaning of “rope used on boats”. A total number of4,346 common words appear 5,622 times in one or more resources as NEs. In


other words, 1% of named entities is ambiguous with common words in ourdatabases.

Interesting is the case of “Aurora”, a female Romanian name which alsomeans “dawn”: it appears simultaneously in four different resources: cities,places, products and person names. For these cases we decided to build aspecial list with NEs that exist also in the list of common words, which arehowever statistically more frequent as NEs than as common words.

Another frequent problem in the identification of NEs is related to specialcharacters found at the beginning of a row. In these cases, a wrong identifi-cation is due to the fact that the sentences are not properly segmented, but,although not a problem of the system itself, it influences the system’s overallperformance.

3.2. Upper Bound Named Entities Classification Evaluation. For cor-rectly identified named entities, the percentage of the matched and partialmatched NEs that have been properly categorized is 95.71%.

The main problems in NEs classification are related to the fact that thereexist NEs that are in more than one list of NEs. A number of 5,243 NEsappears in more than two resources, summing up to 10,588 occurrences. Forexample, “Adevarul” exist both in the list of companies and in list of publi-cations, “Markel” exists in the list of person name and in list of companies,“Capital” is both in the list of places and in list of publications, “Baneasa” isboth a place and a city, “Acer” and “Nokia” are both a brand and a company,“Dacia” is a company, a brand, country, a city and a product, “David Jones” isa person and a brand, “Darian” is a company or a person, “G7” is a model andan organization, “Siret” is a region and a city, “Tudor Vladimirescu” is a cityand a person name, “Alexandria” is a place and a city, “Franklin Templeton”is a person name and a company, and so on.

In Table 3, we present the most represented NE types pairs in our databases,for which the same NE appears in both categories:

NE types Number of Common Entities

Company and Product 2,067City and Place 890Person and Place 609Company and Person 247Company and Organization 96

Table 3: Number of NE joint in the two categories

As we can see from Table 3, the most important problem is due to the factthat we have products that have the same name as the company that producethem. Another problem is due to the fact that we have the same names for


cities and places (this is due to the fact that in Romania, most districts havethe same name as the administrative city in the county).

Another problem in classification is related to cases when we have partialmatch on extracted NEs. This is the case when in the initial text we havesomething like “...a declarat ın exclusivitate pentru Capital Bjorn Hauge”(En: said exclusively for Capital Bjorn Hauge), with two gold entities, eachwith its type: “Capital” is a company and “Bjorn Hauge” is a person. Inthis case, due to our NE composition rules, our application extracts only onenamed entity “Capital Bjorn Hauge” which is not found in any class, and thusthe system assign to this NE group the class of the first NE. This is also thecase for “fostul sef al Vamii Halmeu Nicoleta Dobrescu” (En: former headof Halmeu Custom Nicoleta Dobrescu), for “ıntalnirea Basescu-Nazarbaev”(En: Basescu-Nazarbaev meeting), for “a precizat pentru SFin Sorin Blejnar”(En: said for SFin Sorin Blejnar) and for “autostrada Pitesti-Bucuresti” (En:Pitesti-Bucuresti highway).

3.3. Named Entities Identification Evaluation in Real Context. Fortesting the system in real context we created a new test corpus, unseen by oursystem, containing 38 files manually annotated with a total of 19,509 wordsand 1,215 NEs. The evaluation of the system with this test corpus is presentedin Table 4.Metric Value Description

Entity Identification 90.50% The percentage of gold entities found byPrecision the system reported to all returned entitiesEntity Identification 90.95% The percentage of gold entities found byRecall the system reported to all gold entitiesEntity Identification 90.72% The equally weighted f - measureF-MeasurePartial Precision 2.29% The percentage of gold entities partially

returned by the system reported to allthe extracted entities

Partial Recall 2.30% The percentage of gold entities partiallyreturned by the system reported to allgold entities

Table 4: Evaluation of the NE identification module

Besides the problems discussed in the upper bound evaluation, we foundadditional problems related to the extraction of entities of the type Title(which are usually written with lowercase letters) and are very dependent toour resource list. The problems related to Title account for 3.70% of the totalnumber of NEs error in this corpus (i.e. 45 from 144 titles weren’t extracted)and comes from the fact that we don’t have enough entities in our resources.


Is was the case for words such as “calugarita, sora, colonel, viceprimar, co-presedinti, ambasador, guvernator, premier, patron, tar, tovarasa, maiestate,europarlamentar” (En: nun, nurse, Colonel, vice mayor, co-president, Ambas-sador, governor, premier, employer, Tsar, comrades, majesty, EP member).

3.4. Named Entities Classification Evaluation in Real Context. Oursystem correctly classified (total or partial match) 66.73% of the NEs in ourtest corpus. The next table presents the error distribution on the named entitytypes were most errors occurred.

NE Type NEs wrongly Identified Total no. of NEs Percent

Company 28 89 31.46%Publication 15 25 60.00%Person 86 364 23.63%Product 34 65 52.31%Organization 72 220 32.73%Region 55 76 72.37%Undecided 23 36 63.88%

Table 5: Error distribution on NE types

Interesting is the case of Undecided entities, entities which are not classifiedin any of our types by human annotator in the test corpus (for example:“Stirile ProTV” (En: ProTV News), different laws or rules, TBC, ITP, PIN,etc.). In 13 of these cases, our system was not able to classify the extractedentity, similar to the gold annotation. However, in 23 cases, although theseNEs don’t exist in our resources, contextual rules were applied and the systemwrongly identified a type for them.

For Companies, Organization and Person types, the errors appear becausethe NEs were not found in our resources and no contextual rules could beapplied.

For Publication and Product types, the errors occurred because they fre-quently are marked interchangeable in the test corpus, since it is difficult todistinguish between them.

For Region type, the major cause of errors is due to the fact that respectiveNE exists also in resources for other type, such as City, Place, and Country.

An interesting example is the case of PNL, which does not exist in any ofour resources. In some cases, when it is proceeded by the word “partid” (En:party), it is correctly classified as Organization, but in all other cases, the sys-tem does not identify any type for it. Thus is a clear example where anaphoraresolution would greatly increase the system performance, since after findingthe type for one entity, this type could be transferred to all its references inthe anaphoric chain.


4. Conclusions

This paper presents a Named Entity Recognition system for Romanian,created using linguistic grammar-based techniques and a set of resources. Thearchitecture of our system involves two modules, named entity identificationand named entity classification module, successively applied. The goal of thedescribed system is to recognize named entities for Romanian, distinguishingbetween 14 NE types. Even if we consider so many categories, we still man-age to have comparable results (and even better for specific categories) withexisting systems for Romanian, which identify less NE types.

Future work will be related to the elimination of problems related to com-mon words that are at the beginning of sentences. To fix these problems, weintend to use statistical information about common words obtained from alarge corpus, such as the Romanian Wikipedia.

Another immediate future direction is related to anaphora, which couldbe of great benefit in order to transfer the type of one classified entity to allits referees. Thus, the undecided cases will reduce, and we could also considera voting system if the same entity has different types assigned in differentcontexts.

In order to help us increase the resources we have, we also consider de-signing an interface where people using our NER could add the entities wedid not identified, or classify them correctly if the system did not. However,we also need to find a sort of annotation evaluation technique to minimize thepossibility of errors to be introduced in the database.

Acknowledgment

The research presented in this paper was funded by the Sectoral Opera-tional Program for Human Resources Development through the project “De-velopment of the innovation capacity and increasing of the research impactthrough post-doctoral programs” POSDRU/89/1.5/S/49944. The authors ofthis paper thank the colleagues Alexandru Gınsca, Emanuela Boros, AugustoPerez, Dan Cristea from Faculty of Computer Science Iasi, for the help offeredin this project.

References

1. S. Cucerzan and D. Yarowsky, Language independent named entity recog-nition combining morphological and contextual evidence, In Proceedings ofthe Joint SIGDAT Conference on EMNLP and VLC, 1999, pp. 90–99.

2. H. Cunningham, D. Maynard, K. Bontcheva, and V. Tablan, Gate: Aframework and graphical development environment for robust nlp tools and


applications, Proceedings of the 40th Anniversary Meeting of the Associ-ation for Computational Linguistics, 2002.

3. A. Iftene, D. Trandabat, M. Moruz, I. Pistol, M. Husarciuc, and D. Cristea,Question answering on english and romanian languages, Multilingual In-formation Access Evaluation, Text Retrieval Experiments (Carol Peters,Giorgio Di Nunzio, Mikko Kurimo, Thomas Mandl, Djamel Mostefa,Anselmo Pe nas, and Giovanna Roda, eds.), Lecture notes in ComputerScience, vol. 1, Springer Berlin / Heidelberg, 2010, pp. 229–236.

4. A. Iftene, D. Trandabat, I. Pistol, M. Moruz, A. Balahur-Dobrescu,D. Cotelea, I. Dornescu, I. Draghici, and D. Cristea, Uaic romanian ques-tion answering system for qa@clef, In Advances in Multilingual and Mul-timodal Information Retrieval, 8th Workshop of the Cross-Language Eval-uation Forum, CLEF 2007, Budapest, Hungary, September 19-21, 2007,Revised Selected Papers, Lecture notes in Computer Science, vol. 5152,2008, pp. 336–343.

5. A. Iftene, D. Trandabat, I. Pistol, M. Moruz, M. Husarciuc, and D. Cristea,Uaic participation at qa@clef2008, Evaluating Systems for Multilingualand Multimodal Information Access, 9th Workshop of the Cross-LanguageEvaluation Forum, CLEF 2008, Aarhus, Denmark, September 17-19, 2008,Revised Selected Papers, Lecture notes in Computer Science, vol. 5706,2009, pp. 448–451.

6. R. Ion, Word sense disambiguation methods applied to english and roma-nian, PhD Thesis, 2007.

7. L. M. Machison, Named entity recognition for romanian (roner), Proceed-ings of the International Conference on Knowledge Engineering, Principlesand Techniques, KEPT2009, 2009, pp. 53–56.

8. Y. Mehdad, V. Scurtu, and E. Stepanov, Italian named entity recognizerparticipation in ner task @ evalita 09, Proceedings of the 11th Conferenceof the Italian Association for Artificial Intelligence, 2009.

9. D. Nadeau, Semi-supervised named entity recognition: Learning to recog-nize 100 entity types with little supervision, PhD Thesis, 2007.

10. D. Nadeau and S. Sekine, A survey of named entity recognition and clas-sification, Linguisticae Investigationes 30 (2007), no. 1, 3–26, Publisher:John Benjamins Publishing Company.

(1) ”Al. I. Cuza” University of Iasi, Faculty of Computer Science, RomaniaE-mail address: [email protected],[email protected]

(2) Intelligentics, Cluj Napoca, RomaniaE-mail address: [email protected],[email protected]

sentimatrix - named entity recognition for romanian language

Documents