s andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core localization world paris,...

Post on 24-Dec-2015

214 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

sandrejs vasiļjevs

chairman of the boardandrejs@tilde.com

data is core

LOCALIZATION WORLD PARIS, JUNE 5, 2012

• Language technology developer

• Localization service provider

• Leadership in smaller languages

• Offices in Riga (Latvia), Tallinn (Estonia) and Vilnius (Lithuania)

• 135 employees

• Strong R&D team

• 9 PhDs and candidates

MTmachine translation

machine translation

INNOVATIONd i s r u p ti v e

d i s r u p ti v e

rule-based MT

statistical MT

• High quality translation in specialized domains• Require highly qualified

linguists, researchers and software developers• Time and resource consuming• Difficult to evolve

• Translation and linguistic knowledge is derived from data• Relatively easy and quick to develop• Requires huge amounts of parallel and monolingual data• Translation quality inconsistent and can differ dramatically from

domain to domain

MT paradigms

CHALLENGE

15largest

languages

50%

domains

IT Aerospace

Agriculture Automotive

Chemistry Coal and mining industries

Communications Culture

Defence Education

Electronics Energy

Finance Food technology

Government affairs Legal

Life sciences Logistics

Marketing Mechanical engineering

Medicine Pharmaceuticals

Religion Social affairs

Trade

one size fits all

?

DATA

The total body of European Union law applicable in the EU Member States

JRC-Acquis http://langtech.jrc.it/JRC-Acquis.html

The DGT Multilingual Translation Memory of the Acquis Communautaire

DGT-TM

http://langtech.jrc.it/DGT-TM.html

Parallel data collected from the Web by University of Uppsala

90 languages, 3800 language

2,7B parallel units

Opus

http://opus.lingfil.uu.se

open European language resource infrastructure

http://www.meta-net.eu

Data for SMT training

PLATFORM

Moses toolkit

[ttable-file]0 0 5 /.../unfactored/model/phrase-table.0-0.gz% ls steps/1/LM_toy_tokenize.1* | catsteps/1/LM_toy_tokenize.1steps/1/LM_toy_tokenize.1.DONEsteps/1/LM_toy_tokenize.1.INFOsteps/1/LM_toy_tokenize.1.STDERRsteps/1/LM_toy_tokenize.1.STDERR.digeststeps/1/LM_toy_tokenize.1.STDOUT% train-model.perl \--corpus factored-corpus/proj-syndicate \--root-dir unfactored \--f de --e en \--lm 0:3:factored-corpus/surface.lm:0% moses -f moses.ini -lmodel-file "0 0 3 ../lm/europarl.srilm.gz“use-berkeley = truealignment-symmetrization-method = berkeleyberkeley-train = $moses-script-dir/ems/support/berkeley-train.shberkeley-process = $moses-script-dir/ems/support/berkeley-process.shberkeley-jar = /your/path/to/berkeleyaligner-2.1/berkeleyaligner.jarberkeley-java-options = "-server -mx30000m -ea"berkeley-training-options = "-Main.iters 5 5 -EMWordAligner.numThreads 8"berkeley-process-options = "-EMWordAligner.numThreads 8"berkeley-posterior = 0.5tokenizein: raw-stemout: tokenized-stemdefault-name: corpus/tokpass-unless: input-tokenizer output-tokenizertemplate-if: input-tokenizer IN.$input-extension OUT.$input-extensiontemplate-if: output-tokenizer IN.$output-extension OUT.$output-extensionparallelizable: yesworking-dir = /home/pkoehn/experimentwmt10-data = $working-dir/data

buildyour ownMT engine

Tilde / CoordinatorLATVIA

University of EdinburghUK

Uppsala UniversitySWEDEN

Copehagen UniversityDENMARK

University of ZagrebCROATIA

MoraviaCZECH REPUBLIC

SemLabNETHERLANDS

• Cloud-based self-service MT factory

• Repository of parallel and monolingual corpora for MT generation

• Automated training of SMT systems from specified collections of data

• Users can specify particular training data collections and build customised MT engines from these collections

• Users can also use LetsMT! platform for tailoring MT system to their needs from their non-public data

• Stores SMT training data• Supports different formats –

TMX, XLIFF, PDF, DOC, plain text

• Converts to unified format• Performs format

conversions and alignmentResourceRepository

• Put users in control of their data

• Fully public or fully private should not be the only choice

• Data can be used for MT generation without exposing it

• Empower users to create custom MT engines from their data

user-driven machine translation

• Integration with CAT tools• Integration in web pages • Integration in web browsers• API-level integration

integration

Integration of MT in SDL Trados

Training UsingSharing of training data

Giza++Moses SMT toolkit

SMT Resource Repository

SMT Multi-Model Repository

(trained SMT models)

Proc

esin

g, E

valu

ation

...

Upl

oad

Anon

ymou

sac

cess

Auth

entic

ated

acce

ss

System management, user authentication, access rights control ...

Web page

Web service

Web pagetranslation widget

CAT tools

Web browserPlug-ins

SMT Resource Directory

SMT System Directory

Moses decoder

use caseFORTERA

EVALUATION

• Keyboard-monitoring of post-editing (O´Brien, 2005)

• Productivity of MS Office localization (Schmidtke, 2008)

5-10% productivity gain for SP, FR, DE

• Adobe(Flournoy and Duran, 2009)

22%-51% productivity increase for RU, SP, FR

• Autodesk Moses SMT system (Plitt and Masselot, 2010)

74% average productivity increase for FR, IT, DE, SP

Previous Work

Evaluation at Tilde

• Latvian:

About 1,6 M native speakers Highly inflectional - ~22M possible

word forms in total Official EU language

• Tilde English – Latvian MT system

• IT Software Localization Domain

• Evaluation of translators’ productivity

English-Latvian data

Bilingual corpus Parallel units

Localization TM 1 290 K

DGT-TM 1 060 K

OPUS EMEA 970 K

Fiction 660 K

Dictionary data 510 K

Web corpus 900 K

Total 5 370 K

Monolingual corpus Words

Latvian side of parallel corpus

60 M

News (web) 250 M

Fiction 9 M

Total, Latvian 319 M

MT Integration into Localization Workflow

Evaluate original / assign Translator and Editor

Analyze against TMs

Translateusing translation suggestions for TMs

and MT

Evaluate translation quality / Edit

Fix errors

Ready translation

MT translate new sentences

• Key interest of localization industry is to increase productivity of translation process while maintaining required quality level

• Productivity was measured as the translation output of an average translator in words per hour

• 5 translators participated in evaluation including both experienced and new translatorsEvaluation of

Productivity

• Performed by human editors as part of their regular QA process

• Result of translation process was evaluated, editors did not know was or was not MT applied to assist translator

• Comparison to reference is not part of this evaluation

• Tilde standard QA assessment form was used covering the following text quality areas:

Accuracy

Spelling and grammar

Style

Terminology

Evaluation of Quality

QA Grades

Error Score (sum of weighted errors)

Resulting Quality Evaluation

0…9 Superior

10…29 Good

30…49 Mediocre

50…69 Poor

>70 Very poor

Tilde Localization QA assessment applied in the evaluation

Evaluation data

►54 documents in IT domain

►950-1050 adjusted words in each document

►Each document was split in half:

►the first part was translated using suggestions from TM only

►the second half was translated using suggestions from both TM and MT

%

productivity32.9%*

* Skadiņš R., Puriņš M., Skadiņa I., Vasiļjevs A., Evaluation of SMT in localization to under-resourced inflected language, in Proceedings of the 15th International Conference of the European Association for Machine Translation EAMT 2011, p. 35-40, May 30-31, 2011, Leuven, Belgium

Latvian

Evaluation at Moravia

► IT Localization domain►Systems trained on the

LetsMT platform►English - Czech translation

25.1% productivity increase

Error score increase from 19 to 27, still at the GOOD grade (<30)

►English – Polish translation

28.5% productivity increase

Error score increase from 16.8 to 23.6, still at the GOOD grade (<30)

%

productivity

25%

*For Czech and Polish formal evaluation was done by MoraviaForor Slovak productivity increase was estimated by Fortera

28.5%

Slovak* Polish

25.1%

Czech

MORE DATA

corpora collection tools

comparability metrics

named entity recognition tools

terminology extraction tools

ACCURAT TOOLKIT

use caseAUTOMOTIVE

MANUFACTURER

very smalltranslation memories(just 3500 sentences)

noin-domain corpora in target languages

nomoney for expensive developments

?

Terminology extraction

Web crawling parallel

monolingual

Parallel data extraction from comparable corpora

data collection workflow

TMs

Terminology glossary

Parallel phrases

Parallel Named Entities

Monolingual target language corpus

Resulting data

General domain data as a basis

Domain specific language model

Impose domain specific terminology, named entity translations

Add linguistic knowledge atop of statistical components

SMT Training

right data &right tools

tilde.comtechnologies

for smaller

languages

The research within the LetsMT! project leading to these results has received funding from the ICT Policy Support Programme (ICT PSP), Theme 5 – Multilingual web, grant agreement no 250456

top related