translation memory system optimization - diva portal820674/fulltext01.pdf · fish (previously owned...

DEGREE PROJECT, IN , SECOND LEVELCOMPUTER SCIENCE

STOCKHOLM, SWEDEN 2015

Translation Memory SystemOptimization

HOW TO EFFECTIVELY IMPLEMENTTRANSLATION MEMORY SYSTEMOPTIMIZATION

TING-HEY CHAU

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF COMPUTER SCIENCE AND COMMUNICATION (CSC)

EXAMENSARBETE VID CSC, KTH

Optimering av översättningsminnessystem Hur man effektivt implementerar en optimering i översättningsminnessystem

Translation Memory System Optimization How to effectively implement translation memory system optimization Chau, Ting-Hey

E-postadress vid KTH: [email protected]

Exjobb i: Datalogi

Handledare: Arnborg, Stefan

Examinator: Arnborg, Stefan

Uppdragsgivare: Excosoft AB

Datum: 2015-06-11

Abstract

Translation of technical manuals is expensive, especially when a largercompany needs to publish manuals for their whole product range inover 20 different languages. When a text segment (i.e. a phrase, sen-tence or paragraph) is manually translated, we would like to reuse thesetranslated segments in future translation tasks. A translated segment isstored with its corresponding source language, often called a languagepair in a Translation Memory System. A language pair in a TranslationMemory represents a Translation Entry also known as a TranslationUnit.

During a translation, when a text segment in a source documentmatches a segment in the Translation Memory, available target lan-guages in a Translation Unit will not require a human translation. Thepreviously translated segment can be inserted into the target document.Such functionality is provided in the single source publishing software,Skribenta developed by Excosoft.

Skribenta requires text segments in source documents to find anexact or a full match in the Translation Memory, in order to apply atranslation to a target language. A full match can only be achieved if asource segment is stored in a standardized form, which requires manualtagging of entities, and often reoccurring words such as model namesand product numbers.

This thesis investigates different ways to improve and optimize aTranslation Memory System. One way was to aid users with the workof manual tagging of entities, by developing Heuristic algorithms toapproach the problem of Named Entity Recognition (NER).

The evaluation results from the developed Heuristic algorithms werecompared with the result from an off the shelf NER tool developed byStanford. The results shows that the developed Heuristic algorithmsare able to achieve a higher F-Measure compare to the Stanford NER,and may be a great initial step to aid Excosofts’ users to improve theirTranslation Memories.

Referat

Optimering av översättningsminnessystem

Hur man effektivt implementerar en optimering i

översättningsminnessystem

Översättning av tekniska manualer är väldigt kostsamt, speciellt närstörre organisationer behöver publicera produktmanualer för hela derasutbud till över 20 olika språk. När en text (t.ex. en fras, mening, stycke)har blivit översatt så vill vi kunna återanvända den översatta texteni framtida översättningsprojekt och dokument. De översatta texternalagras i ett översättningsminne (Translation Memory). Varje text lagrasi sitt källspråk tillsammans med dess översättning på ett annat språk,så kallat målspråk. Dessa utgör då ett språkpar i ett översättningsmin-nessystem (Translation Memory System). Ett språkpar som lagras i ettöversättningsminne utgör en Translation Entry även kallat TranslationUnit.

Om man hittar en matchning när man söker på källspråket efteren given textsträng i översättningsminnet, får man upp översättning-ar på alla möjliga målspråk för den givna textsträngen. Dessa kan isin tur sättas in i måldokumentet. En sådan funktionalitet erbjuds ipubliceringsprogramvaran Skribenta, som har utvecklats av Excosoft.

För att utföra en översättning till ett målspråk kräver Skribenta atttext i källspråket hittar en exakt matchning eller en s.k. full match iöversättningsminnet. En full match kan bara uppnås om en text finnslagrad i standardform. Detta kräver manuell taggning av entiteter ochofta förekommande ord som modellnamn och produktnummer.

I denna uppsats undersöker jag hur man effektivt implementerar enoptimering i ett översättningsminnessystem, bland annat genom att un-derlätta den manuella taggningen av entitier. Detta har gjorts genomolika Heuristiker som angriper problemet med Named Entity Recogni-tion (NER).

Resultat från de utvecklade Heuristikerna har jämförts med resul-tatet från det NER-verktyg som har utvecklats av Stanford. Resultatenvisar att de Heuristiker som jag utvecklat uppnår ett högre F-Measurejämfört med Stanford NER och kan därför vara ett bra inledande stegför att hjälpa Excosofts användare att förbättra deras översättningsmin-nen.

Contents

1 Introduction 1

2 Background 32.1 Excosoft . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.2 Research question . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.3 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.4 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.5 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3 Theory 73.1 Translation Memory System . . . . . . . . . . . . . . . . . . . . . . . 7

3.1.1 How Translation Memory Systems Work . . . . . . . . . . . . 83.1.2 Different Matches . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.2 Translation Memory Optimizations . . . . . . . . . . . . . . . . . . . 103.2.1 Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . 103.2.2 Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.2.3 Translation Memory Database . . . . . . . . . . . . . . . . . 12

3.3 Named Entity Recognition . . . . . . . . . . . . . . . . . . . . . . . . 133.4 Concept of Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4 Related Work 154.1 Controlled Language . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.2 Similar Segments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.3 Regular Expression In Translation Memory . . . . . . . . . . . . . . 164.4 Machine Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

5 Methods 195.1 First step: White-space removal . . . . . . . . . . . . . . . . . . . . . 19

5.1.1 Identifying unnecessary characters . . . . . . . . . . . . . . . 195.1.2 White-space extraction . . . . . . . . . . . . . . . . . . . . . 20

5.2 Identifying Named Entities . . . . . . . . . . . . . . . . . . . . . . . 205.2.1 Intra-Heuristic . . . . . . . . . . . . . . . . . . . . . . . . . . 215.2.2 Inter-Heuristic . . . . . . . . . . . . . . . . . . . . . . . . . . 225.2.3 Stanford NER . . . . . . . . . . . . . . . . . . . . . . . . . . 22

6 Design and Implementation 256.1 Implementation of Eclipse Plug-in . . . . . . . . . . . . . . . . . . . 256.2 Implementation of quality filter . . . . . . . . . . . . . . . . . . . . . 266.3 Software Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

7 Results & Analysis 297.1 First Step: White-spaces . . . . . . . . . . . . . . . . . . . . . . . . . 29

7.1.1 Multiple White-space and Invisible Separators . . . . . . . . 297.1.2 White-space Extraction . . . . . . . . . . . . . . . . . . . . . 31

7.2 Heuristic NER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337.2.1 Intra-Heuristic . . . . . . . . . . . . . . . . . . . . . . . . . . 347.2.2 Inter-Heuristic . . . . . . . . . . . . . . . . . . . . . . . . . . 357.2.3 Combined Inter-Intra-Heuristic . . . . . . . . . . . . . . . . . 36

7.3 Stanford NER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

8 Conclusions 39

Bibliography 43

Appendices 45

A Appendix 47A.1 Excosoft Embedded Tags . . . . . . . . . . . . . . . . . . . . . . . . 47A.2 Quality Assessment Prototype . . . . . . . . . . . . . . . . . . . . . . 48A.3 First Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49A.4 Heuristic NER & Stanford NER . . . . . . . . . . . . . . . . . . . . 49

Chapter 1

Introduction

Hiring professional translators for translation of technical documents is often veryexpensive and time consuming. That is why in the last couple of decades computerscientists and human translators have been working together trying to develop dif-ferent tools and methods to minimize the use of human translation. There are twobranches in this area, where one of them is Computer Aided Translation (CAT)tools, which includes Translation Memory System (TMS)1, while the other branchis Machine Translation (MT), aiming at general translation of text. Both branchesare part of the computer science field Natural Language Processing (NLP). It isimportant to distinguish them both, since the two of them have different goals andpurposes. The purpose of using MT tools is to completely eliminate the use ofhuman translators, while CAT tools were developed for human translators to maketheir work more effective by eliminating repetitive work. As of today the TMSshave become the standardized tool used by all major translation agencies [1].

Companies grow with the help of the Internet and will require more multilingualsupport in order to compete globally, which is why it is important to find a goodtool to manage translations. Technical documents tend to be repetitive, translatorswho use TMSs are able to reduce the cost by 15% to 30% and at the same timeimprove their productivity by 30% or even 50% according to Esselink [2].

So far many people have probably used a translation engine e.g. Yahoo! BabelFish (previously owned by AltaVista), Google Translate, etc., to translate a web-site in a foreign language, and as many might have noticed the results may vary,ranging from acceptable results to ones that are grammatically incorrect, which isusually caused by a word-for-word translation. But even with a grammatically in-correct translation, no one can deny that these translation tools have provided uswith an understanding of the website’s information, rather than not being able tounderstand the information on the website at all.

TMS and MT systems are no doubt great tools for minimizing the work of

1We will throughout the report distinguish the system from its core component, the databasestoring the translations, often called the “memory”. Which is why we will refer to the systems asTMSs, and to the database itself as Translation Memory (TM).

1

CHAPTER 1. INTRODUCTION

human translators [1, 3].

2

Chapter 2

Background

Back in the beginning of 1980, P.J. Arthern of the European Council Secretariat hadthe idea of having a computer application to utilize the computational power of thecomputer by letting it process natural language translations. He soon realized thatit was a very complex task to perform. Instead of a machine translation system, itwould be useful to have a word processing system which could remember if a newtext typed into the system already has been translated, and fetch the translationwhich has previously been done and then show it on the display or have it printedout automatically as he previously reported [4].

Arthern described a solution to what the day-to-day task translators had beendealing with for many decades. Many translators developed different strategiesto deal with this problem; card indexes, cut-and-paste, etc. The solution wouldeliminate the need for translators to translate repetitive texts and documents con-taining commonly used texts. The solution Arthern [4] mentioned became availablefor translators under the generic term Translation Memory (Translation Memory)which is a Computer aided translation tool; also called Computer Assisted Transla-tion (CAT) which includes the following three categories: Translation Memory tools,Terminology tools, Software localization tools. Usually the translation memory andterminology tools are combined in one tool-set for translation of documentation.

It is a known fact that Machine Translation (MT) tools are hard to develop,especially a generalized MT tool capable of translating many different types of doc-uments. Using MT systems for wrong types of documents will make a costly, ineffi-cient and time-consuming process according to Esselink [2]. Although MT systemsrequire a much larger initial investment compared to TMSs, it is recommended toadapt the MT system for the indented type of source documents in order to achievea return of investment. This can be done by identifying often used terminology andby using controlled language (CL) which minimizes the number of ambiguities insource documents, see Section 4.1 on page 15.

Many companies working with translation use some kind of TMS. Each TMShas its own additional features, but all TMSs have one thing in common, the qual-ity of a TMS is affected by their users. Different technical writers have different

3

CHAPTER 2. BACKGROUND

styles of writing. This affects the matching quality of a TMS; making it harder tore-use previously translated segments due to the tendency of different writers to usedifferent texts for the same meaning. A TMS is usually a system combined by threebasic functions; Translator module, Editor module and a Database.

A list of the different modules in a TMS

Translator module: This module processes the documents that are sent forpublication, where all formatting information is removed. Each and every segmentin source language is replaced with a target language segment in a new document.If no such translation exists, the TU is sent for human translation, when this hasbeen done the translated language segment is inserted into the TU and is then ableto replace the given segment. (The comparisons are done once all source segmentsare generalized, i.e., put on a standardized form permitting more exact matches tobe made)

Editor module: This module displays documents in the source language andsometimes also in a target language, allowing human translators to view both sourcelanguage and target language if such translation exists in the TM. This module caneither be an add-on tool for an existing word processor e.g. Microsoft Word or aseparate word processor, as the one shipped with Skribenta 4, named XML Editor.

Database: This module is the core of a TMS, it is simply the database thatcontains all the TUs and is often referred to as Translation Memory (TM). It man-ages all load and save operations.

See Example 2.1 for a provided example with two basic TUs with five differentlanguage segments.

2.1 Excosoft

The case study for this project was done for the company Excosoft AB. Excosoft isa Swedish company founded in 1986; they provide a single source publishing soft-ware. The software features a translation memory which offers writers potentialre-use of previously translated documents and version control. The software de-veloped by Excosoft goes by the name Skribenta and is used to manage technicaldocumentation, were there are often repetitions.

2.2 Research question

With the knowledge and basic understanding of how a TMS operates, experiencedtechnical writers are able to accomplish some TM improvements on their own, andmaximize the potential future re-use of existing TUs. Such knowledge minimizes

4

2.3. OBJECTIVE

TU Id ar fr-ca da en sv

198220 Select Su-per Rinseby pressingthe buttonunder thesymbol.

Sélectionnerle « Superrinçage »à l’aide dela touchesituée sousle symbole.

Vælg Su-perskyl vedat trykkepå knap-pen undersymbolet

Select SuperRinse bypressingthe buttonunder thesymbol.

Välj Super-sköljninggenom atttrycka påknappenunder sym-bolen.

198591 Move theplasticplugs onthe upperand loweredges ofthe door.Use a flatscrewdriverto removethe plugs.

Déplacer lesbouchonsen plastiquesitués surles bordssupérieuret inférieurde la porte.Utiliser untournevisà lameplate pourretirer lesbouchons.

Byt om påde plast-propper,der sidderpå lugensover- ogunderkant.Brug en fladskruetrækkertil at løsnepropperne.

Move theplastic plugson the upperand loweredges of thedoor. Use aflat-bladedscrewdriverto removethe plugs.

Byt platspå de plas-tpluggarsom finnspå luck-ans över-respektiveunderkant.Använd enflat skru-vmejsel föratt lossapluggarna.

Example 2.1. Basic TU examples with five different languages segments

the risk of creating similar and unnecessary TUs with the same semantic meaning.This has led us to the following research question:

How to effectively implement translation memory system optimization

2.3 Objective

The purpose of this master thesis is to research how to improve and optimize aTMS. Our objective is to identify different ways to improve pattern matching in thetranslation phase. Algorithms will be developed in order to identify TUs, which canbe generalized in different ways. Users should be allowed to access the algorithmsthrough a graphical user interface, and choose an appropriate correction for selectedTUs, that match a chosen algorithm. The graphical user interface will be referredto as the quality assessment prototype (see Chapter 6 on page 25).

5

CHAPTER 2. BACKGROUND

2.4 Delimitations

Many difficulties need to be solved in order to create a fully optimized TMS, whichis why some limitations are needed. The focus of this project was to find ways tooptimize a TM and its existing TUs.

To simplify the optimization task, some assumptions were made:

• Generalization rules follow the same structure as the algorithms in the Skribentasystem.

• Changes are only made on existing TUs and will therefore not affect existingdocuments.

• Developed algorithms should work for all source and target languages based onthe Latin alphabet1. The reason for why we need an algorithm to work withall target language text segments is because otherwise we would not be ableto go the other way around; translating a document from a target languageto the previous source language should work at all times.

The coverage of this study was limited to developing a quality assessment toolfor Excosofts software named Skribenta 4 by the time of implementation. Theexisting translation memory infrastructure was not within the scope of the project;it was provided by Excosoft AB.

2.5 Limitations

Limitations encountered during this project were that some algorithms only work forlanguages based on the Latin alphabet, one of them are date and time identification.In order to create an algorithm which will be able to identify and tag date and timein all possible languages, some knowledge is required in all languages. Unfortunatelythat is not possible, so in order to develop such generalized algorithms, one wouldbe required to hire linguists which therefore would be too time consuming for thisproject.

Limitation in the evaluation process which is caused by optimizations done toonly the TM. We will not be able to measure the pattern matching improvementsdone by the different algorithms, as that would require implementing improvementsin the other two parts of the TMS as well.

1http://en.wikipedia.org/wiki/Latin_alphabets

6

Chapter 3

Theory

The aim of this Chapter is to introduce the theory behind TMS, Named EntityRecognition and the underlying theory used in the developed algorithms.

3.1 Translation Memory System

The term Translation Memory System (TMS) is usually referred to a software toolthat contains a database of translation texts. A source language text segmentassociated with one or more target translation segments is called a language pair. Alanguage pair stored within a TMS is called a Translation Unit (TU) and sometimestranslation entry. TMS are solely developed to assist human translators in theirdaily translation work.

TMS are very popular and often used by companies with the need of publishingnew manuals and technical documents, with a short life cycle, to many differentlanguages. When a new product is developed or new features have been addedto an existing computer application, a new manual is required. The use of TMSsenables companies to re-use translations from previous versions, reducing the timeto market and the amount of human translation.

Esselink [2] states that the best results with TMSs may be achieved when thesource documents are created in a structured way, by avoiding wordiness, ambigui-ties and synonyms (see Section 4.1 on page 15). For those who consider using TMSsplease refer to the report written by Webb [5].

Esselink [2] suggest that the use of TMSs in translation projects may reducethe total translation cost by 15% to 30%, while O’Brien [6] suggest a productivityincrease between 10% to 70% depending on the stored content in the TM. Althoughit is a very broad range, it is hard to calculate a generalized result, which is notdepending on the type of the translated texts. But according to Somers [7], on theother hand, a 60% productivity increase may be possible, while a more reasonableaverage productivity gain around 30% may be expected when the software is used.

A survey done by Lagoudaki [1] in 2006, shows that 82.5% of all professionaltranslators who responded to the survey use TMSs, while 17.5% did not use any

7

CHAPTER 3. THEORY

TMSs at all, the survey showed that many translators were able to save time (86%),improve the terminology and consistency of translations (83%), improve the qualityof the translation output (70%), achieve cost savings (34%), while 31% thoughtTMSs were the best way to exchange resources, such as glossaries and TMs.

3.1.1 How Translation Memory Systems Work

TMSs work at sentence level, meaning the source documents are broken down intosmaller components such as sentences or segments. The term segment is often usedbecause in some cases, a chunk of text may not be a complete sentence, in the caseof headings or lists. A segment is the smallest unit of text that may be reused whenworking with TMSs.

It is important to remember that smaller units of text, such as an individualword are not used, since they may occur in different context and would thereforerequire a context-dependent translation. That is why word-for-word translationsnormally do not produce usable results, as such translations are too literal, whichis well described by Arnold [8].

When a new document in a source language is saved, the document is sent tothe translation module. The document is separated into segments and processed bya set of generalization rules, and then compared with the TM. If a segment alreadyexists in the TM it will not be added, otherwise a new TU is created in the TM.Available target language segments in a newly created TU, are empty until it is sentfor translation. When the translator module finds an exact or a full match in theTM, all previously translated target languages in the given TU will be available fortranslation. Please refer to subsection 3.1.2 on the next page for a short descriptionof different matches used in a TM.

In theory a TMS used for a long period of time will become better and betteras more TUs with translated target languages populates the TM, and after a whilereach a point were practically all future documents will find perfect matches in theTMS and not require any human translation.

But we know that this is not true, this has to do with the complexity of writtenlanguage, ambiguities, wordiness and language innovation is hard to deal with, seewhy in Section 4.1 on page 15.

If previous translation exists and no TMS has been used previously, the trans-lated documents need to be aligned, such an operation is called a translation align-ment1. Translation alignment will match source language segments to target lan-guage segments; this will create a new language pair between source and targetlanguage on aligned segments, which can later on be used in a TMS. According toEsselink [2] it is not uncommon that manual alignment is required, it all dependson how the previous documents were produced.

1Parallel text - http://en.wikipedia.org/wiki/Parallel_text

8

3.1. TRANSLATION MEMORY SYSTEM

3.1.2 Different Matches

Esselink [2] states that there are four different types of matches which can be foundwhen a newly written document is being compared with a TM; repetitions; fullmatches; fuzzy matches and no matches.

Repetition, also referred to as internal matches, means that multiple occurrenceof the same segment exists in a document.

Full match sometimes also referred to as an exact match or perfect match, mean-ing that no character in a segment can differ from an existing TU, not even a white-space or punctuation. However according to Browker [9], on the other hand, a fullmatch means that matching segments only differ in terms of variables or other pre-tagged entities. The full match definition used by Browker will be used in the restof the report.

Fuzzy match means that one or more characters can be different, the differenceis often computed by a string edit distance see Section 4.2 on page 15 for moredetailed description.

A list of different matches within a TM

Exact match: An exact match, also called perfect match, is found when a segmentin the new document is exactly 100% the same as a segment stored in the TM. Nocharacter can differ, this definition is well described by Browker [9].

Full match: A full match is a match where a segment differs from a stored seg-ment in the TM only by pre-tagged terms, which can be variable elements, namedentities, numbers, dates, times, currencies, measurements, and sometimes propernames, this is well documented by Browker [9].

Fuzzy match: With a match between 75-94%, already translated segments maylargely be re-used by translators [10, 11]. Translators may edit the suggested textand adapt it to the new content. Which is why many translation agencies usuallycharge less for fuzzy matches [12, 13, 14], the rates vary depending on the level ofthe fuzzy match.

Repetitions: The same segment occurs several times in a document. This seg-ment only needs to be translated once. When the segment is repeated, the TMS isable to automatically supply a translation.

No match:. No match is found or a match lower than 75% is found. If an MTsystem is used, the result must still be checked and usually edited by a human.

9

CHAPTER 3. THEORY

3.2 Translation Memory Optimizations

In order to fully take advantage of a TMS it is necessary to make sure that the TUsare kept as consistent as possible. It is therefore necessary to regularly reorganizeTUs with similar or identical content. According to Iverson [15] time invested inrewriting sections and removing unnecessary words not relevant to a product in adocument, will yield great results in TMSs. He also recommends rewritten sectionsto be compared with other similar documents, to ensure consistency between thedocuments. What he is referring to is often called Controlled Language (CL),see Section 4.1 on page 15.

Many companies who offer TMSs have realized that a paragraph-based segmen-tation is preferred before a sentence-based segmentation. Keeping shorter segmentsalong with the target language in a TMS will result in a higher number of both fulland exact matches. However the downside of using sentence-based segmentation, isthat many similar TUs are stored in the TM. Some sentences can be interpreted inmany different ways, which depends on the context where the sentence is used. Suchsentences are often context dependent, which is why matches provided by a sentence-based TMS needs to be verified by a translator. While Esselink [2] states that fewerexact matches are found by a paragraph-based TMS compared to a sentence-basedTMS. But when an exact match is found in a paragraph-based TMS, no additionalreview or proofreading is usually required; the translated paragraph in the targetlanguage may easily be re-used.

Iverson [15] also mentions the important tradeoff between the length of the TUsand storing many similar TUs. The chance of finding an exact match gets loweras the text segments in the TUs get longer. While short segments increase theprobability of creating ambiguities (e.g. multiple or conflicting matches), whichresults in a lower translation quality of the intended translation.

In short the benefits of using TMSs are to reduce translation cost, improveturnaround times (require less time to produce), and increase translation consis-tency.

3.2.1 Generalization

Newly written or modified segments in the editor module are passed on to the trans-lator module. The translator module will process the segments by going throughall existing generalization rules in a TMS, this process is called generalization. Ageneralization makes segments match if they differ only in named entities, dates,etc. Generalized segments are compared with existing TUs in a TM, with a goal offinding a full match. No human translation is necessary if an exact or full matchcan be found. In case no full match is found, a new TU is created. The TU canbe used as soon as the segment has been translated into target language(s). Fu-ture segments that match the previous stored segment will then no longer requirea human translation. Entities such as dates, names, serial number etc. should begeneralized with different XML tags, see Example 3.1 for a desired tagging.

10

3.2. TRANSLATION MEMORY OPTIMIZATIONS

English: MPU Flasher version 1.04

Should yield the following result:English: MPU Flasher version <value>1.04</value>

TU: MPU Flasher version <value/>

Example 3.1. A desired segment generalization

During comparison with previous segments, the content within tagged entitiesare ignored; in this case “1.04”.

In cases where a user misses out to tag segments with certain entities, the non-generalized segments will not be able to find a match, creating new very similarTUs, see Example 3.2 on the current page where similar TUs have been created.

TU ID en574062 Adjust the back check (illustration 3).574066 Adjust the back check (illustration 4).574070 Adjust the back check (illustration 6).574074 Adjust the back check (illustration 8).

Example 3.2. Basic example of a non-generalized TM

A new segment only differentiating by the numerical parameter value will find afull match. But this requires that the segment compared is generalized and containsa <value/> tag; in this case the tag is a version number. Cases where writers havemissed to select the numerical value will unfortunately not find a match. Thatis why it is interesting to explore a way to automatically identify and tag thesenamed entities, as this is an often reoccurring problem. This work is a well-knownfield within computer science, Named Entity Recognition (NER), see Section 3.3 onpage 13.

For a desired generalization, see Example 3.3.

Input: “MPU Flasher version 1.05”Required input: “MPU Flasher version <value/>”TU: “MPU Flasher version <value/>”

Example 3.3. Desired matching procedure.

Unless the numerical version number is identified and pre-tagged before thecomparison is done, the new segment “MPU Flasher version 1.05” will not finda match, and a new TU is created instead, which creates an unnecessary TU withsimilar content.

11

CHAPTER 3. THEORY

3.2.2 Translation

Documents sent for publication are passed on to the Translator module; this processis called a Translation. In case a TU is missing in a target language translation,that TU is sent for translation either by in-house translator or a translation agency.When the translated target language segments are inserted back into the TM, wecan publish the given document in the required target languages.

Every segment sent for publication needs to be compared with all TUs in theTM. The comparison assures that there does not exist a previous translation in theTM. It is hard to find exact matches in a TM, if new segments are not generalizedbefore they are stored in the TM. Given segment “MPU Flasher version 1.06”,no exact match can be found. A fuzzy match can be found instead, depending onhow calculations are done, the difference between the two similar segments mayvary. A character-by-character comparison using the Levensthein algorithm onlyrequires one substitution operation. The level of the fuzzy match can be calculatedin the following way (20+3)/(20+4)=0,958. A 96% fuzzy match means that thesesegments are very similar.

The best way to show fuzzy matches would probably be to notify the user howsimilar the TU is to an already existing TU in the TMS (ranging between 75-94%),if the user thinks the new segment can be replaced by an already existing one, andtherefore will not require a human translation for the given segment.

3.2.3 Translation Memory Database

This case study focuses on optimizations in the TM, by tagging and generalizingdifferent parts of a TUs text segment, we will be able to reduce the number ofsimilar TUs in the TM. In Example 3.4, a writer has missed to tag a text segment.

”040209 11:43:44””040209 11:43:18””040209 11:43:19”

Example 3.4. Similar and unnecessary TUs in the TM.

Instead of storing these very similar text segments as separate TUs, it would bepreferred to replace all of the above TU with one generalized TU were the date andtime are replaced with parameters. This would enable future generalized TUs tofind a match and be translated automatically. The date “040209” should be replacedwith <date/> and the time were the seconds digits differs, should be replaced with<time/>

Unfortunately tagging date and time in source language TUs in TM, will notyield a desired result. A problem likely to occur due to a cross-translation, isone or more parameters missing in either one of the segments. This problem can beavoided by generalizing all target language segments in a TU and not only its sourcelanguage. Applied changes in a segments source language should also be applied

12

3.3. NAMED ENTITY RECOGNITION

in target language segments. This will allow a proper cross-translation between alllanguages and not only from a source language to its target languages. E.g. if thefirst document’s source language was English, and the target language would beSwedish. A future document written in the source language Swedish translated tothe target language English, should be able to re-use previously translated TUs.

3.3 Named Entity Recognition

Named Entity Recognition (NER) is a task when documents, paragraphs or sen-tences are broken down into tokens, where each token is evaluated and classifiedinto predefined categories, such as locations, names of persons, organizations, quan-tities, monetary values, percentages, etc. To illustrate the entity classification, abasic example containing different entities is provided below:

“Google launched their cloud service Google Drive in April 24, 2012, which of-fers online storage and backup. Allowing users to store 15 GB free of charge, while1 TB costs US$9.99 per month.”

We would like to divide the example into five different entities: “organization”,“name”, “date”, “unit” and “currency”, the generated output from a NER systemcould look like this:

“<ORGANIZATION>Google</ORGANIZATION> launched their cloudservice <NAME>Google Drive</NAME> in <DATE>April 24, 2012</DATE>,which offers online storage and backup. Allowing users to store <UNIT>15GB</UNIT> free of charge, while <UNIT>1 TB</UNIT> costs<CURRENCY>US$9.99</CURRENCY> per month.”

The output from NER systems are generally not meant to be read by humans,they are often used in information extraction and text categorization.

There are three different methods for NER systems to learn how to identifynamed entities, which are supervised learning, semi-supervised learning and unsu-pervised learning. The different methods are well described in a recent report [16].

3.4 Concept of Evaluation

Evaluation of NER systems are usually done with three well-known metrics preci-sion, recall and F-Score, initially developed for Information Retrieval2. To illustratehow the different metrics are calculated, we may use the problem “information re-trieval”.

2http://en.wikipedia.org/wiki/Information_retrieval

13

CHAPTER 3. THEORY

Relevant Non-relevant

Retrieved true positives (TP) false positives (FP)Non retrieved false negatives (FN) true negatives (TN)

Example 3.5. Different evaluation outcome classes

Precision (P) is the fraction of the retrieved documents that are relevant.

Precision =|{Relevant documents} ∩ {Retrieved documents}|

|{Retrieved documents}|

Recall (R) is the fraction of relevant documents that are retrieved

Recall =|{Relevant documents} ∩ {Retrieved documents}|

|{Relevant documents}|

During evaluation each document is divided into four different classes: false posi-tive, false negative, true positive and true negative, see Example 3.5. A true positivematch occurs when a word is correctly identified as an entity. A true negative matchis a word which is correctly ignored, not an entity. A false positive match occurswhen a word is incorrectly identified. A false negative match occurs when a wordis incorrectly ignored.

F-Measure (F1) is the weighted harmonic mean of precision and recall, the tradi-tional F-measure or sometimes called balanced F-score.

F1-measure = 2 ×Precision × Recall

Precision + Recall

The balanced F-Measure above weights precision and recall equally. Which is whya different formula may be used to weight the importance of the two metrics.

Fβ-measure = (1 + β2) ×Precision × Recall

(β2 × Precision) + Recall

A value of β > 1 assigns a higher importance to recall. Assigning β = 2, F2 weightsrecall twice as much as precision, while F0.5 weights precision twice as much asrecall.

14

Chapter 4

Related Work

The aim of this Chapter is to examine related work in generalizing TM. It willdescribe different ways proven to be useful with TMSs.

4.1 Controlled Language

Written language in technical documents should be as clear and concise as possibleto avoid misinterpretations. That is why companies who regularly produce technicaldocuments have restricted their writers’ grammar, vocabulary, style and semantics,these restrictions are often called a Controlled Language (CL). The goal with CL isto get rid of unclear writing, such as ambiguous words, complex grammar, incom-plete sentences and vernacular. This allows writers to write sentences that are lesslikely to be misinterpreted [17], which makes it easier for a translator to make amore consistent translation that is easier to understand [18].

Previous research [19] shows that CL may be used to improve translation quality.Mitamura [18] observed that 95.6% of all sentences could be assigned with a singlemeaning representation, while [20] found that around 33% of duplicate sentencescould be removed.

With these kinds of improvements you might wonder why CL is not used by alltechnical writers and translators? That is because it is difficult and time consumingto learn a CL [21].

4.2 Similar Segments

Similar segment matching for TM is often referred to the generic term fuzzy match.Similar segments can be identified by using an algorithm that solves the well-knownproblem Minimum Edit Distance. The minimum edit distance between two stringscalculates the minimum number of edit operations (usually insertion, deletion andsubstitution of single characters) required to transform one string into another [22].

Given the words computer and commuter with the same character length in Ex-ample 4.1, one operation is required to transform the word computer to commuter,

15

CHAPTER 4. RELATED WORK

by substituting the letter “P” to the letter “M”. If we were to assign a particular costor weight to the different edit operations, we will have the Levenshtein distancebetween two sequences. Giving each of the three operators a cost of 1 (assuming asubstitution of a letter for itself, has zero cost), the Levenshtein distance betweencomputer and commuter is 1.

C O M P U T E RC O M M U T E R

S

Example 4.1. Minimum edit distance operation.

This method is often used in spell-checkers to identify possible corrections formiss-spelled words. Words with a low minimum edit distance are often presentedas possible corrections. While many different string distance algorithms exist, Lev-ensthein [23] was the first to report this method.

Companies providing a TMS without a fuzzy matcher should consider imple-menting a fuzzy matcher, to provide translators the ability to view similar text seg-ments from previous translations. Improving the ability to generate more consistenttranslations, especially if there is a possibility to change the similarity threshold ofthe fuzzy matcher [24], giving users the ability to find the best balance betweenprecision and recall. Previous research [25] shows that fuzzy matches over 70% maybe of use.

4.3 Regular Expression In Translation Memory

Previous research [26] supports that the full matches in TMSs may be improvedwith the use of regular expressions. Each rule which uses regular expression needs toconsist of three different search patterns; input segment; TU containing both sourcelanguage and desired target language; additional regular expression to replace theparts that do not match with the source language in the TU as well as the targetlanguage segment in the TU, and translate it to the target language. The term forthe last regular expression is called a transfer rule, since no translation is neededand the inputs are transferred to a different form.

Jassem and Gintrowicz [26] were able to developed transfer rules that allowedautomatic translation for these specific entities: various format of date and time;currency expressions; metric expressions; numbers and e-mail addresses.

4.4 Machine Translation

Today there are a lot of companies that have developed usable software that com-bines the benefit of TM with the advantages of MT.

In order to translate one language into another, one needs to understand thegrammar of both languages, including morphology (the grammar of word forms) and

16

4.4. MACHINE TRANSLATION

syntax (the grammar of sentence structure). But in order to understand syntax,one also had to understand the semantics and the lexicon (or “vocabulary”), andeven to understand something of the pragmatics of language use.

The requirement of understanding the grammar of both languages makes theprocess of developing a MT system more complex compared to a TMS. This is oneof the main reason why CLs were developed, see Section 4.1 Controlled Language onpage 15. According to [22] some impressive results have been achieved when com-biding CL with MT systems.

A wide variety of different MT systems exist today. The most commonlyused ones are; Rule-Based Machine Translation (RBMT), Example-Based MachineTranslation (EBMT) and a Statistical Machine Translation (SMT). Previous re-search by Biçici and Dymetman [27] show improved NIST1 and BLUE2 score froma system combined with a phrase-based SMT trained on the same domain as anexisting TM, compared to the stand-alone SMT and the TMS.

One of the better known translation systems, Google translate, uses an extremelylarge multilingual corpus.

1NIST is a method to evaluate the text quality translated by a MT system.2BLUE is a method used to evaluate the quality of text which has been translated by a MT.

This metric is reported to have a high correlation with human judgments of quality.

17

Chapter 5

Methods

The aim of this Chapter is to introduce the methods used in the algorithms devel-oped, with the goal of achieving a more generalized TM system.

5.1 First step: White-space removal

Since no fuzzy matcher exists in Excosofts TMS, the fist step was to identify andremove unnecessary characters in existing TUs. The company’s previous attemptto reduce unnecessary characters was to add a “Smart Space” catcher in their editormodule, alerting a user if more than two white-spaces were typed. The problemswith this feature are that it can be disabled and that previous TUs were not cor-rected in the TM. It was therefore a very crucial step for the company to identifyunnecessary characters, especially since they had embedded XML code in their TUs.Removing these unnecessary characters could achieve a more generalized TM, whichcould result in fewer similar TUs.

5.1.1 Identifying unnecessary characters

The most common unnecessary characters used were different types of white-spaces.The content of a segment stored in a TU is not affected, if one or several misplacedwhite-spaces are moved from one XML tag to a parent XML tag, see Example 5.1on the following page, where white-space within the tag should be moved toits parent tag, which in this case is the tag.

Performing such corrections can be done in two ways; string manipulation withregular expressions and DOM parsers1. DOM parsers are often used to read XMLdocuments and validate the node-tree. They also feature data extraction, easingthe task of XML manipulations. Qureshi [28] states that DOM parsers are slow andthe complexity of parsing an XML document depends on the following factors; theheight of the tree; total number of elements; total number of distinct elements andthe size of the XML document.

1A DOM parser is a standard way to process and read XML documents.

19

CHAPTER 5. METHODS

With that in mind, the preferred method chosen in this case was string manip-ulations with a DOM-like approach, identifying all opening, closing and self-closingtags.

Operation String

Input White-spaces_and_HTML

Generalized output White-spaces_and_HTML

Example 5.1. Moving white-space to achieve a generalized segment, will not affecta segments content.

5.1.2 White-space extraction

Changes applied to the input string will change the string length. When we findanother misplaced white-space, we need to know where previous changes have oc-curred, in order to insert a white-space at the accurate position.

That is why I implemented an integer array with the same size like the length ofthe input string; this array holds the information of the given index-position of theinput string, which from now will be called the offset. If a character at a specificindex has been removed, the status of that specific index will be marked as removedand the offset of the following characters on the right of the index will be subtractedwith one, since the input string length now contains one less character, the white-space character. An algorithm will then proceed to check if the parent XML tagto the right or left of the index, depending on if we are currently checking if therecontains a white-space in the beginning of a XML tag or in the back of a XML tag.If there exists a white-space already or any other text, no white-space is required tobe inserted again. Otherwise the currently checked parent might be “empty”. Byempty, we mean that the XML tag do not contain any characters within the XMLtag or a white-space.

The best approach is to check for multiple white-spaces before a white-spaceextraction is executed. Eliminating the requirement of multiple white-space filteringafter a white-space extraction is done. Processing unwanted white-spaces in thefollowing will eliminate cases were the reverse order would miss out, space extractionand multiple white-space removal. See Example 5.2, where some misplaced white-spaces are not removed and requires another iteration of white-space extraction.

Excosoft provides version control in their TMS, which means that each TU cancontain different text segment versions. The latest version of each TU was usedduring the evaluation of the algorithms.

5.2 Identifying Named Entities

The second step was to identify entities in existing TUs and tag entities with relevanttags. This procedure is currently done by the users. Users who have done the

20

5.2. IDENTIFYING NAMED ENTITIES

Operation String Status

Input ___Loop

Output __Loop One white-space is re-moved.

Input (Multiplewhite-space)

__Loop

Output(Multiplewhite-space)

_Loop Multiple white-spacesare replaced by one.

Action White-space is extracted outside of the current XML-tag, causing a un-allowed white-space case. This forcesus to re-run the algorithm.

Example 5.2. Improper white-space elimination order.

most work tagging entities in a TM, will achieve the highest re-use, allowing futuredocuments to finding more full matches.

During the development of the first step, I noticed that many entities oftenoccurred in almost every language text segment within a TU. Which led me tobelieve that an entity probably occurs in the same form in every language textsegment, e.g. “D7000“, a camera model manufactured by Nikon is probably spelledthe same way in all languages.

To investigate if the assumption was accurate, two approaches to identifyingentities within different TMs were examined. One approach involved an off theshelf NER tagger developed by Stanford2, while the other approach involved animplementation of heuristics based on different assumptions.

5.2.1 Intra-Heuristic

My initial approach was to divide the source language text segment into words(often called tokenization3) and compare each word with all words available in alllanguages within a TU, which is why this heuristic from here on will be referred toas the Intra-Heuristic. In case a word occurs in all available language text segments,that specific word has a high probability of being an entity.

Troubles I encountered with this approach were entities right next to characterswithout any contextual meaning in the TUs was limiting the recall of the entityrecognition. In order to increase recall, I had to find a way to improve the tok-enization. See Example 5.3 for a TU where undesired characters were causing lowerprecision for the Intra-Heuristic, due to the false positive words “(1” and “=”, thatoccurred in all language segments. To avoid these false positives, an exclusion filter

2Stanford CoreNLP http://nlp.stanford.edu/software/corenlp.shtml3Tokenizer - A tokenizer divided text into a sequence of tokens, which roughly correspond to

"words". http://nlp.stanford.edu/software/tokenizer.shtml

21

CHAPTER 5. METHODS

was added, excluding the following characters “.([,=:)]” at the beginning or atthe end of each word, before every word comparison.

See the improved result in Tables 7.5 and 7.6 on page 34 and on page 35.

TU Id en sv no ge

147612(1 = Lightson)

(1 = Ljusetär på)

(1 = Lys på)(1 =Beleuchtungein)

Example 5.3. TU with undesired characters, causing undesired false positivematches

5.2.2 Inter-Heuristic

Another idea came to mind during the evaluations of the Intra-Heuristic. Given aTM without a fully optimized generalization engine, it is very likely that over timesimilar non-generalized TUs are created. These similar TUs may differ in one singleword, which might be a potential entity that also needs to be tagged, in order togeneralize the TU.

A possible way to identify these entities may be to compare all source languagewords in a TU with all existing TUs source language text segments with the sameword length. The word length can be calculated by tokenizing a text segment into“words“.

See previously mentioned Example 3.2 on page 11, where the Inter-Heuristicalgorithm is able to list the digits [3,4,6 and 8] as potential entities, that could betagged. In the given example, the entities could also be matched with a simpleregular expression, matching digits only. While the Intra-Heuristic is also able toidentify other kinds of words, see Example 5.4 on this page.

TU ID en574212 ABB has a huge database.574219 Casco has a huge database.574226 Raysearch has a huge database.

Example 5.4. Basic example of a non-generalized TM

See the result in Table 7.7 on page 35.

5.2.3 Stanford NER

To compare the results achieved with the different heuristics, the same datasetswere evaluated with Stanford NER. Standford provides six different models whichare trained on a mixture of different domains such as: ACE23, MUC-6, MUC-7,CoNLL, Wikiner, Ontonotes and English Extra. The provided models have the

22

5.2. IDENTIFYING NAMED ENTITIES

ability to identify different entity classes4 listed below.

3 class: Location, Person, Organization4 class: Location, Person, Organization, Misc7 class: Time, Location, Organization, Person, Money, Percent, Date

Using an English caseless 3 class model, I was able to achieve a high precisionwith a less desired recall. To see if more entities could be identified, the followingStanford NER models were used to evaluate the datasets: english.all.3class.caseless,english.all.3class.distsim, english.nowiki.3class.caseless, english.conll.4class.caseless,english.muc.7class.caseless and english.muc.7class.distsim. See results in Table 7.9.

4Stanford NER - http://nlp.stanford.edu/software/CRF-NER.shtml

23

Chapter 6

Design and Implementation

This chapter describes implementation of the prototype developed for the company.The goal of the prototype was to offer a quality assessment tool for the existingapplication developed by the company. Two desired requirements for the prototypewere support for loading quality filters in runtime and ease of creating new filters.This was solved by developing the prototype as an Eclipse plug-in, extending apredefined extension point.

6.1 Implementation of Eclipse Plug-in

Eclipse applications are based on a core called the workbench. The workbench canbe extended with a set of views, perspectives, menu contributions, key bindings,etc. through an extension point. These extensions can either be part of the mainproject or developed as separate Eclipse plug-in projects. Eclipse Plug-in projectscan provide contributions (extensions) to pre-defined extension points. During im-plementation phase the company was migrating their application to Eclipse 4. Thiswas one of the main reasons I choose to develop the prototype as an Eclipse plug-in. The Eclipse workbench manages the attached extensions. An extension pointin the main application was created to enable extensions. Extended functionalityis displayed depending on whether any extensions are attached to the predefinedextension point. I chose to separate the prototype with the main application toease the company’s future migration to Eclipse 4. Existing functions of the graphi-cal user interface were refactored; table for displaying matching TUs; search filter;language filter; project filter and status filter.

The Glazed List library [ca.odell.glazedlists]1 was used to feature filter-ing of TUs from a source EventList. All TUs in an existing TM were loaded intothe EventList. The library provides a thread safe EventList which can be usedwithout calling Java lock() and unlock() synchronization methods.

A FilterList is generated depending on the filters selected. A MatcherEditor

is used to enable dynamic filtering of the elements in the table. When a filter is

1http://www.glazedlists.com/

25

CHAPTER 6. DESIGN AND IMPLEMENTATION

changed the MatcherEditor fires an event, which creates new Matcher instances.The Matcher instances are immutable, this guarantees the FilterList to call amatcher() method without synchronization, see Section 6.2 on this page for moreinformation about the matcher() method.

An extension point is defined in the plug-in prototype, allowing other plug-insto add functionality by providing an extension. An extension defines a quality filter,which contains two methods: matcher() and correct(). The method matcher()

matches against refactored filters and a predefined pattern provided in the extension.

ExtensionExtension point

Figure 6.1. Many extensions can be attached to the same extension-point.

Attached extension are listed in a drop-down list displaying available qualityfilters. The table is updated when a quality filter is selected. TUs displayed are theones who match the selected filters and the quality filter.

6.2 Implementation of quality filter

A quality filter consist of a matching method named matcher() and a correctionmethod named correct(). The matching filter matches TUs against a set of pre-defined patterns, if a match is found, the result is then displayed in a table.

A user has the ability to select one or several TU(s) which match the qualityfilter. Selected TU(s) are sent to the correction method were appropriate changesare applied.

The matching filter extends the boolean matcher() method from the Glazed List

library. The method takes one input parameter and returns true if the item matchesa filter. The input data type is a TU.

The number of times method matcher() and correct() needs to be executedis reduced by passing a whole TU instead of each language text segment separately.

Regular expression was chosen instead of FSA, as most pattern-matching isusually done using regular expressions. Pattern matching with FSA requires adictionary over all possible words. A TM usually contains many thousand TUs,where each TU can contain several sentences (a long segment). If we were to useFSA in this case, the dictionary would grow very quickly and require a lot of space.Hence, the FSA is not suitable for this purpose.

With that in mind, the built in Java library package [java.util.regex] wasused for the regular expression pattern matching.

26

6.3. SOFTWARE USED

6.3 Software Used

The prototype was developed in the programming language Java, using the IDEtool Eclipse RCP Development.

The TMs were stored in a MySQL database. The databases were accessedthrough Apache Tomcat (or simply Tomcat), which is a Java Servlet container.The container provides a web server environment for Java code.

27

Chapter 7

Results & Analysis

The aim of this chapter is to show the results achieved during the evaluation ofthe different methods introduced in Chapter 5. Metrics used in the evaluationare precision (P), recall (R) and F-Measure (F1), which are previously describedin Section 3.4 on page 13. At the end of each subsection, there is a discussion aboutthe results, and how they may be interpreted.

7.1 First Step: White-spaces

The datasets used in the first step consisted of three different client TM; AssaAbloy, Casco and Raysearch. The companies provide different kinds of products;Assa develops locks and provides security systems for doors, Casco produces awide range of different adhesives for DIY-users and professional constructors, whileRaysearch provides radiation machines used to treat cancer patients. See Table 7.1on the following page for a more detailed description of the datasets used in theevaluation. The characters from the embedded XML code were excluded whenthe results were gathered in this evaluation, the characters do not provide anycontextual content. Data shows that the average occurrence of space separators inthe different TMs is between 13-16%. The evaluation in the first step will focus onthe total number of space separators that we were able to identify and correct.

7.1.1 Multiple White-space and Invisible Separators

The algorithm multiple white-space separators used the regular expression \s{2,} toidentify multiple white-space characters. The test results can be found in Table 7.2on the next page. In order to save space, the regular expressions were shortened inthe tables. Repetitions can be matched with the following regular expression syntax{2,}.

The algorithm multiple separators used the regular expression \p{Z}{2,}, toidentify multiple Unicode separators1, test results identified by the algorithm may

1UNICODE CHARACTER DATABASE - http://unicode.org/reports/tr44/

29

CHAPTER 7. RESULTS & ANALYSIS

Test DatasetDatabase Assa Casco Raysearch

TUs 6 207 4 883 6 827Language texts 82 539 34 590 19 186Languages 14 14 8Characters 4 613 653 2 885 121 1 127 079Characters inembedded code

517 415 118 295 193 895

Space separators 607 063 406 247 177 213Average use of spaceseparators

13.16% 14.08% 15.72%

Table 7.1. The testing was done on the following three databases. The numbers arethe total sum of each database, counting the occurrence in all TUs and its languagetexts.

Multiple White-space and Invisible SeparatorsDatabase Assa Casco RaysearchRegular Expression \s \p{Z} \s \p{Z} \s \p{Z}

TUs with misplacedspaces

4.75% 5.64 8.64% 9.30% 2.30% 2.34%

Matching TUs 295 350 422 454 157 160Matching translationtexts

392 588 690 749 158 161

Misplaced spaceseparators (Withinmatching TUs)

13.77% 16.55% 36.79% 34.43% 7.62% 7.80%

Table 7.2. Results from the algorithms: multiple white-space and multiple separa-

tors.

also be found in Table 7.2. The last row in the table only shows the amount of cor-rectable characters in matching TUs. An interesting observation is that Raysearch(7.62-7.80%) had the lowest amount of correctable characters in matching TUs,while Assa (13.77-16.55%) almost had twice as many correctable characters com-pared to Raysearch. Casco (34.43-36.79%) had the highest amount of correctablecharacters in matching TUs.

The number of matching TUs increased in all three datasets, when the algorithmmultiple separators was used, compared to algorithm multiple white-space separa-tors. The two algorithms may be used to identify potential TUs, where writers aretrying to style documents with white-spaces instead of using appropriate indenta-tion. Even though the same visual representation is achieved, the future matchingin a TM will be affected. Correcting the TUs with multiple white-spaces and invis-ible separators may improve the matching in the future. However correcting thesematching TUs may not be enough, which is why I suggest two actions: Informing

30

7.1. FIRST STEP: WHITE-SPACES

Assa Casco Raysearch0

10

20

14.06%

23.31%

3.76%

10.56%

6.33%

3.73%Cor

rect

able

Spac

eSep

arat

ors

(%)

Misplaced space separators Space extraction possible

Figure 7.1. Results from the space extraction algorithm, matching misplaced spaceseparators next to opening and closing tags.

users about the use of multiple white-spaces in the editor module and assisting themby providing corrections; automatically correct segments before they are stored inthe TM.

7.1.2 White-space Extraction

The average correctable characters in TUs with misplaced space separator(s) in eachdatabase: Assa 14.06%, Casco 23.31% and Raysearch 3.76%, are found in Figure 7.1.The difference between the solid and lined bars in the figure, shows the amount ofcharacters that may be corrected by the previous algorithm multiple separators,which means the Assa dataset contained the highest amount of extractable spaceseparators.

The algorithm space extraction was later on divided into two parts and werepartially evaluated. One part identified space separators next to opening tags,while the other part identified space separators next to closing tags, the results arelocated in Figure 7.2 on the following page.

The results from Figures 7.3 and 7.4 on page 33 and on page 34, show thatthe majority of the misplaced separators in the evaluated datasets are more oftenoccurring in TUs with misplaced separators next to closing tags. Especially inthe Casco dataset, were separators next to opening tags occurred twice as often(49.96%), compared to separators next to closing tags (23.31%). The probabilityof the two events, matching spaces next to opening tags and closing tags, are non-mutually exclusive. This means that the two events overlap each other, which iswhy the results gathered in the two events cannot be added using simple addition.Some TUs may contain extractable white-spaces next to opening and closing tagsat the same time.

31



2

4

6

0.4%

2.52%

1.27%

2.58%

7.17%

3.06%

Cor

reta

ble

TU

s(%

)

Opening Tags Closing Tags

Figure 7.2. Total number of TUs found by the space extraction algorithm, onematching opening tags and one matching closing tags.

Space Extraction - Opening TagsDatabase Assa Casco Raysearch


0.40% 2.52% 1.27%

Matching TUs 25 123 87Matching translationtexts

57 175 91

Misplaced spaceseparators

8.52% 49.96% 4.01%

Space extractionpossible

8.42% 4.86% 4.01%

Table 7.3. Results from the space extraction algorithm matching opening tags.

It is possible that the number of space separators that can be extracted corre-sponds to the use of embedded code in the TUs. But it appears that assumptionmay not be entirely correct. The amount of embedded code is calculated by di-viding the characters in embedded code, with the total number of characters in allTUs in each dataset, which yields: Assa (11.21%), Casco (4.10%) and Raysearch(17.20%). The Casco dataset contained the lowest amount of embedded code, hadthe highest number of correctable TUs in Figure 7.2 and the second highest numberof correctable separators in Figure 7.1 on page 31. While the Raysearch datasetcontained TUs with the highest use of embedded code, had the lowest number ofcorrectable separators. The results indicate that Raysearch has the highest TMquality, in terms of white-spaces and separators of the three evaluated datasets.While Casco has lowest quality in terms of white-spaces and separators.

32

7.2. HEURISTIC NER


20

40

8.52%

49.96%

4.01%

8.42%

4.86% 4.01%

Spac

eE

xtra

ctio

nO

pen

ing

Tag

s

Misplaced space separators Space extraction possible

Figure 7.3. Space Extraction algorithm matching opening tags.

Space Extraction - Closing TagsDatabase Assa Casco Raysearch


2.58% 7.17% 3.06%


1,716 1,305 299

Misplaced spaceseparators

14.28% 22.08% 3.69%


10.63% 5.39% 3.66%

Table 7.4. Results from the space extraction algorithm matching closing tags.

7.2 Heuristic NER

Two datasets were used in the evaluation of the Heuristic NER algorithms. Onedataset contained 100 TUs chosen from different Excosoft client TM, this datasetwill from now on be called “Excosoft-extraction”. Each TU in the Excosoft-extractiondataset contained one to 14 different language segments. The other dataset con-tained 100 manually created TUs, which were extracted from a product manual forthe camera model Nikon D7000. The manually created dataset will from now on becalled “Nikon”. Each TUs in the Nikon dataset contained language segments in thefollowing three languages: English, German and Swedish. The dictionary stop-listused was a word-list combined by Mieliestronk’s2, which contained 58,000 English

2Mieliestronk’s list of more than 58 000 English words - www.mieliestronk.com/wordlist.html

33



10

20

14.28%

22.08%

3.69%

10.63%

5.39%

3.66%

Spac

eE

xtra

ctio

nC

losi

ng

Tag

s

Misplaced space separatorsSpace extraction possible

Figure 7.4. Space Extraction algorithm matching closing tags.

Dataset NikonDescription P (%) R (%) F1 (%)

No improvements 37,17 60,00 45,90Exclusion filter (EF) all 56,90 64,71 60,55Dictionary stop-list(DSL) case sensitive

52,29 55,88 54,03

Dictionary stop-list(DSL) case insensitive

41,41 58,57 48,52

EF + DSL 58,93 64,71 61,68EF + DSL caseinsensitive

66,67 62,75 64,65

Table 7.5. Results from the Intra-Heuristic algorithm and the Combined Inter-Intra-

Heuristic. The Intra-Heuristic algorithm identifies variables by comparing availablelanguage segments within a TU. The Combined Inter-Intra-Heuristic identifies vari-ables by comparing with existing TU.

words in lower case.

7.2.1 Intra-Heuristic

The highest precision of 98.33% and recall of 50,00%, were achieved with theIntra-Heuristic, when a case sensitive stop-list was used on the Excosoft-extractiondataset, see Table 7.6 on the next page. While the highest precision of 66.67% andrecall of 62.75%, were achieved with the Intra-Heuristic, when a case insensitivedictionary stop-list was used on the Nikon dataset, see Table 7.5.

All general results were improved with the use of a case insensitive dictionary

34

7.2. HEURISTIC NER

Dataset Excosoft-extractionDescription P (%) R (%) F1 (%)


70,31 50,56 58,82


68,85 47,19 56,00


98,21 46,61 63,22

Table 7.6. Results from the Intra-Heuristic algorithm. Identifies variables by com-paring available target language segments within a TU.



50,00 16,85 25,21


55,56 16,85 25,86


92,59 21,19 34,48

Table 7.7. Results from the Inter-Heuristic algorithm. Identifies variables by com-paring with existing TU.

stop-list. Except in the Excosoft-extraction dataset, were the case sensitive stop-list was able to achieve a higher precision and recall. This is because potentialentities are compared with a dictionary stop-list, were matching words are ignored.Examples of problematic words found in the dictionary are: “apple”, which is aname of a fruit, but it may also be the name of the technology company Apple Inc.,also known as Apple; “windows”, which may be multiple openings in a wall of abuilding, that allows admission of air or light, it may also be the name of a famousoperating system, “Windows” developed by Microsoft.

7.2.2 Inter-Heuristic

The highest precision of 92.59% and a recall of 21.19%, was achieved with the Inter-Heuristic, when a case insensitive dictionary stop-list was used on the Excosoft-extraction dataset, see Table 7.7. However the Inter-Heuristic was unable to identifyany entities in the Nikon dataset.

35




62,50 56,18 59,17


63,51 52,81 57,67


97,06 55,93 70,97

Table 7.8. Results from the Combined Inter-Intra-Heuristic algorithm. Identifiesvariables by comparing available target language segments within a TU and withexisting TUs.

A possible reason why no entities were detected in the Nikon dataset, maybe that the character length of the language segments that were extracted fromthe camera manual. The average character length in English segments from theExcosoft-extraction dataset was 42.94 characters, while Nikon had an average char-acter length of 145.45 characters, see Table A.3 on page 49. Comparing the averagecharacter length in other languages in the Excosoft-extraction dataset may not beaccurate, since some TUs do not contain target language segments. All TUs in theNikon dataset contained target language segments.

Shortening the language segments should in theory not affect the results. Areasonable assumption why no TUs match the Inter-Heuristic in the Nikon dataset,is that the extracted segments did not contain enough similar segments. Assum-ing more similar TUs are created over time, the number of TUs that match theInter-Heuristic will likely increase, unless TUs are regularly corrected and reorga-nized. The results shown in Tables 7.5 and 7.7 indicates that the Excosoft-extractiondataset contains more similar TUs compared to the Nikon dataset. The similar TUsidentified by the Inter-Heuristic should be generalized, which will likely lead to fewerTUs and a more generalized TM.

7.2.3 Combined Inter-Intra-Heuristic

Since the Inter-Heuristic was unable to identify any potential entities in the Nikondataset, the achieved results from the Combined Inter-Intra-Heuristic were thereforethe same as with the Intra-Heuristic in Table 7.5 on page 34. The Combined Inter-Intra-Heuristic algorithm was able to achieve the highest F-Measure of 72,54%,compared to all other algorithms developed in this project, see Table 7.8.

If we had to choose between a high precision or high recall, a high recall would bepreferred in this project, since all potential words are verified and manually taggedby a user. A higher precision was achieved when a case insensitive dictionary stop-

36

7.3. STANFORD NER

list was used, at the expense of a lower recall. Due to problematic words previouslymentioned in the analysis of the Intra-Heuristic in Subsection 7.2.1 on page 34.The case insensitive stop-list may be of better use if many regularly used wordswere identified as potential words, which will not likely occur in the Intra-Heuristicalgorithm, unless it is a borrowed word and spelled the same way in all existingtarget language segments.

A higher F-Measure was achieved with a case sensitive stop-list in Excosoft-extraction dataset, compared to a case insensitive stop-list. While the oppositeresults were achieved in the Nikon dataset. Which means the number of potentialwords ignored in the Nikon dataset were higher, than in the Excosoft-extraction.

7.3 Stanford NER

To evaluate the results achieved with the Stanford NER, all tagged entities wereextracted from the generated output of each TU. During the comparison with thecorrect annotated answers, which were created by myself, the category of the ex-tracted entities were ignored. The highest precision of 42.86% and a recall of 6.86%were achieved with model muc.7class.caseless on the Nikon dataset. Whilethe highest F1-Measure of 20.62% were achieved with the conll.4class.caseless

model on the Nikon dataset, see Table 7.9. The evaluation results achieved withthe Stanford NER on the Nikon dataset can also be found in Table 7.9. Even if thesame models were used during the evaluation, almost twice as high precisions wereachieved, compared to the results achieved on the Excosoft-extraction dataset.

Stanford NERDataset Excosoft-extraction NikonMetric P (%) R (%) F1 (%) P (%) R (%) F1 (%)

all.3class.caseless 33,33 2,86 5,26 100,00 11,24 20,20all.3class.caseless Imp. 33,33 1,96 3,70 100,00 9,32 17,05all.3class.distsim 22,22 2,86 5,06 100,00 14,61 25,49all.3class.distsim Imp. 22,22 1,96 3,60 100,00 11,86 21,21nowiki.3class.caseless 25,00 7,14 11,11 91,67 12,36 21,78nowiki.3class.caseless Imp. 25,00 4,90 8,20 92,31 10,17 18,32conll.4class.caseless 37,04 14,29 20,62 81,82 20,22 32,43conll.4class.caseless Imp. 31,25 9,80 14,93 82,61 16,10 26,95muc.7class.caseless 42,86 8,45 14,12 75,00 6,74 12,37muc.7class.caseless Imp. 46,67 6,86 11,97 80,00 6,78 12,50muc.7class.distsim 25,00 9,86 14,14 56,25 10,11 17,14muc.7class.distsim Imp. 27,59 7,92 12,31 63,16 10,17 17,52

Table 7.9. Evaluation results achieved with the Stanford NER from Excosoft-extraction and Nikon datasets. All models used were trained on different Englishtexts. To save space in the results table, the abbreviation “Imp” for the word im-provement is used, and the model names prefix “english” were removed.

37


As previously mentioned in the analysis of the Inter-Heuristic, the length ofthe TUs in the Nikon dataset, may have affected the precision and recall. Butit appears the average length of the TUs in Nikon dataset, were more favorablewith the Stanford NER. Short language segments have a higher information density,compared to long language segments. Which is probably why the precision achievedin the Nikon dataset, were almost two to three times higher than in the Excosoft-extraction dataset.

The information density in technical manuals is often higher than in newspaperarticles, especially manuals that are written with CL (see Section 4.1 on page 15).Previous research [29] show that classifiers developed for a specific type of text, yielda much lower score when applied on a different type of text. That is why trainingis probably needed to improve the precision and recall of the Stanford NER. Suchtask requires manually annotated corpora which is a complex and time-consumingtask. Since the entities used in Excosoft clients TM use different tags comparedto existing NER tags, new categories/classes need to be created, which will requireeven more training to achieve an automatic entity tagging in a TM. That in theend still might require verification by users, to ensure a correct TM generalization.

38

Chapter 8

Conclusions

The algorithms identifying misuse of white-space and invisible separators may au-tomatically correct TUs with such errors. In the future newly created TUs shouldautomatically be corrected with these algorithms, to ensure a consistent TM. IfTUs are automatically corrected, users should be informed of such action, whichwill probably lead to fewer users making such mistakes.

The study shows that the Stanford NER had troubles identifying entities in thetwo datasets. From the data used it was not possible to determine, if the lowerprecision and recall was caused by the information density in the English segments.Further studies are therefore necessary to determine the effects of the informationdensity, in text analyzed by the Stanford NER.

Stanford NER is unable to identify a number in numbered list as an entity,since a number is not an entity, and not part of the seven predefined entity classes.These kinds of numbers are cardinal numbers, which the Heuristic algorithms canidentify as potential words that may be of interest. This appears to be the one ofthe reasons, why the Heuristic algorithms were able to achieve a higher F-Measure,compared to the Stanford NER. Further studies may include different ways to im-prove the default classifier or creating a new classifier. Such task requires manuallyannotated corpora, which is a complex and time-consuming task. This was also thebiggest problem encountered in this project, acquiring a large annotated corporafrom technical manuals required manual annotation.

The limitation in the Inter-Heuristic algorithm is that it only works if a similarTU previously exists in TM. While the Intra-Heuristic only works if at least onetarget language segment previously exists in TM. Another limitation in the Heuristicalgorithms is the inability to identify numerous entity classes, which in this case isdone by the user. The evaluation results from the Heuristic algorithms indicatesthat the Heuristic algorithms appears to be a great initial step to aid Excosofts’users to improve their TMs, by generalizing as many TUs as possible. Furtherstudies may be to add an additional dictionary in the Heuristics, with names of theworld’s e.g. 100,000 largest organizations. If the words match the new dictionary, ithas a high possibility of being, an entity, and will therefore not require any further

39

CHAPTER 8. CONCLUSIONS

processing.The developed prototype allows Excosofts’ users to improve the quality of exist-

ing TMs with different algorithms developed in this project, which in the end willimprove the pattern matching in the translation phase. Further studies may includeadding the developed algorithms to Excosofts’ default generalization rules, and suit-able algorithms to the editor module allowing users to tag entities in newly writtendocuments. Further studies may be to develop other algorithms which may iden-tify other often occurring errors, such as grammatical errors. Different algorithmsgathering different kinds of statistics from an evaluated TM.

The biggest possible factor that may have affected the achieved results duringthe evaluation of the Stanford NER and the Heuristics, is the collection of the cor-rect annotated answers, often called a “gold standard” in machine learning. Thecollection was created by myself and should preferably have been created by a com-bination of different human annotators, creating a more averaged gold standard.

40

Glossary

CAT Computer Aiden Translation. 1

CL Controlled Language. 3, 10, 15

DOM Parser A DOM parser is a standard way to process and read XML docu-ments.. 19

language pair A language pair is a data element, containing a source languagewith one or multiple target languages. 7, 8

MT Machine Translation. 1

Named Entity Recognition Named Entity Recognition is a well known fieldwithin NLP. NER is a task that locates and classifies words into pre-definedentity categories such as organizations, locations, quantities, monetary values,etc. 41

Natural Language Processing Natural Language Processing is a field of com-putational linguistics, processing human (natural) languages in many differentways.. 1

NER See: Named Entity Recognition. 11, 13

NLP See: Natural Language Processing. 1

TM See: Translation Memory. 1, 4

TMS Translation Memory System, a software or tool consisting of three differentmodules; Generalization, Database, Translation. 1

Translation Memory Translation Memory, the database containing the previ-ously translated segments.. 3

TU Translation Unit, an element in the Translation Memory system database. 7

41

Bibliography

[1] Elina Lagoudaki. Translation Memories Survey 2006: Users’ perceptionsaround TM use. In Translating and the Computer 28, volume 28, page 11,London, 2006. Aslib.

[2] B. Esselink, A.S. de Vries, and S. O’Brien. A Practical Guide to Localiza-tion, pages 366–395. Language International World Directory. John BenjaminsPublishing Company, 2000.

[3] Jost Zetzche. Translation memory: state of the technology. MultiLingual 18,18(6), 2007.

[4] Arthern P.J. Aids unlimited: the scope for machine aids in a large organization.In Machine Aids for Translators: ASL1B Proceedings, volume 33 Iss: 7, pages309–319, Great Britain, 1981. MCB UP Ltd.

[5] Lynn E. Webb. Advantages and disadvantages of translation memory: Acost/benefit analysis. Technical report, Monterey, California, 1998.

[6] Sharon O’Brien. Practical Experience of Computer-Aided Translation Toolsin the Software Localization Industry, pages 115–122. St Jerome Publishing,Michigan, 1998.

[7] Harold Somers. Computers and Translation: A translator’s guide, volume 35,pages 31–47. John Benjamins Publishing Company, Manchester, England,2003.

[8] Doug Arnold. Why translation is difficult for computers, volume 35, chapter 8,pages 119–. John Benjamins Publishing Company, Colchester, England, 2003.

[9] Lynne Bowker. Computer-Aided Translation Technology. A Practical Introduc-tion. University of Ottawa Press, Ottawa, 2002.

[10] Ana Guerberof Arenas. What do professional translators think about post-editing? JoSTrans, 19:75–95, 2013.

[11] Wei Huangfu and Yushan Zhao. A Corpus-based Machine Translation Methodof Term Extraction in LSP Texts. Theory and Practice in Language Studies,4(1):46–51, 2014.

43

BIBLIOGRAPHY

[12] MPP Rates for Computer Assisted Translation .http://www.mpp-europe.com/tao100en.htm, October 2010.

[13] ITlocal | Pricing. http://www.i-t-local.com/en/process_pricing, 2014.

[14] Translation, InterSol, Inc. Translation .http://www.intersolinc.us/services/translation.html, 2014.

[15] Steve Iverson. Working With Translation Memory: When and How to use TMfor a successful translation project. MultiLingual Computing & Technology,14(7), 2003.

[16] Bowen Sun. Named entity recognition : Evaluation of Existing Systems, 2010.

[17] Gordon Farrington. AECMA Simplified English: An Overview of the Interna-tional Aircraft Maintenance Language. volume 1, pages 1–21, Leuven, 1996.Centre for Computational Linguistics.

[18] Teruko Mitamura. Controlled Language for Multilingual Machine Translation.In Proceedings of Machine Translation Summit VII, pages 13–17, Singapore,September 1999.

[19] Christine Kamprath, Eric Adolphson, Teruko Mitamura, and Eric Nyberg.Controlled language for multilingual document production: Experience withcaterpillar technical english. In CLAW 98: Proceedings of the Second Interna-tional Workshop on Controlled Language Applications, pages 51–61, May 1998.

[20] Johann Roturier. Assessing a set of Controlled Language rules: Can they im-prove the performance of commercial Machine Translation systems? volume 26,page 4, London, 2004. Aslib.

[21] Jeffrey Allen. Adapting the Concept of “Translation Memory” to “AuthoringMemory” for a Controlled Language Writing Environment. In Translating andthe Computer 21. Proceedings of the 21st Conference of Translating and theComputer, page Un numbered, London, November 1999. Aslib.

[22] Daniel Jurafsky and James H. Martin. Speech and language processing: An in-troduction to natural language processing, computational linguistics and speechrecognition. Pearson Education International, New Jersey, 2nd edition, 2009.

[23] V. I. Levenshtein. Binary codes capable of correcting deletions, insertions andreversals. Soviet Physics Doklady., 10(8):707–710, February 1966.

[24] Lieve Macken. In search of the recurrent units of translation. LinguisticaAntverpiensia New Series, (8):195–212, 2009.

[25] Philipp Koehn and Jean Senellart. Convergence of translation memory and sta-tistical machine translation. In Proceedings of the Second Joint EM+/CNGL

44

http://www.mpp-europe.com/tao100en.htm

Workshop “Bringing MT to the User: Research on Integrating MT in the Trans-lation Industry", pages 21–31, 2010.

[26] Jacek Gintrowicz and Krzysztof Jassem. Using Regular Expressions in Trans-lation Memories. In T. Pełech-Pilichowski M. Ganzha, M. Paprzycki, editor,Proceedings of the International Multiconference on Computer Science and In-formation Technology, volume 2, pages 87–92, Wisła, Poland, October 15-172007.

[27] Ergun Biçici and Marc Dymetman. Dynamic Translation Memory: Using Sta-tistical Machine Translation to Improve Translation Memory Fuzzy Matches.In Computational Linguistics and Intelligent Text Processing, pages 454–465.Springer, 2008.

[28] Mustafa Hilal Qureshi. Determining the Complexity of XML Documents. The-sis, Oklahoma State University, Karachi, Pakistan, 2003.

[29] Thierry Poibeau and Leila Kosseim. Proper Name Extraction from Non-Journalistic Texts. Language and computers, 37(1):144–157, 2001.

45

Appendix A

Appendix

A.1 Excosoft Embedded Tags

Unit Often described numerical values, which often in-cludes unit conversions, e.g. 800 MM mm. Thenumerical value 800 in the unit [mm] can automati-cally be converted by existing rules in the translatormodule, as the language of the target document ischanged the unit is changed as well, without theneed of a human translation.

Variable Often used to replace a hard coded e.g. productname [Product.Name] which are located in the doc-ument.

Translation-hintSuperscriptSubscriptNo translate Describes a text sequence without the need of a

translationMonotypeLiteralInline-if Conditional block, if the condition is satisfied the

whole block is included in the document.CourierConvert

Table A.1. Different class attribute values in the phrase tag.

47

APPENDIX A. APPENDIX

A.2 Quality Assessment Prototype

A screenshot from the developed prototype’s graphical user interface in Figure A.1.

Fig

ure

A.1

.T

he

vie

wof

the

gra

ph

icalu

ser

inte

rface

.B

lock

sco

unts

,are

the

nu

mb

erof

matc

hin

gT

Us

wit

ha

chose

nalg

ori

thm

.

48

A.3. FIRST STEP

A.3 First Step

Additional data used in the result chapter.

Space ExtractionDatabase Assa Casco Raysearch


2.88% 7.33% 3.40%


1,766 1,320 326

Misused spaceseparators

14.06% 23.31% 3.76%


10.56% 6.33% 3.73%

Table A.2. Results from the white-space extraction algorithm that matched openingand closing tags.

A.4 Heuristic NER & Stanford NER

Dataset used by Heuristic NER and Stanford NERDescription Excosoft-extraction Nikon

Languages 14 3Average character length 42.94 145.45

Table A.3. The testing was done on the following two datasets. The averagecharacter length shows the average character length in English segments.

49

www.kth.se