s andrejs vasiļjevs chairman of the board [email protected] data is core localization world paris,...

50
s andrejs vasiļjevs chairman of the board [email protected] data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012

Upload: leslie-wilcox

Post on 24-Dec-2015

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012

sandrejs vasiļjevs

chairman of the [email protected]

data is core

LOCALIZATION WORLD PARIS, JUNE 5, 2012

Page 2: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012

• Language technology developer

• Localization service provider

• Leadership in smaller languages

• Offices in Riga (Latvia), Tallinn (Estonia) and Vilnius (Lithuania)

• 135 employees

• Strong R&D team

• 9 PhDs and candidates

Page 3: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012

MTmachine translation

machine translation

Page 4: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012

INNOVATIONd i s r u p ti v e

d i s r u p ti v e

Page 5: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012

rule-based MT

statistical MT

• High quality translation in specialized domains• Require highly qualified

linguists, researchers and software developers• Time and resource consuming• Difficult to evolve

• Translation and linguistic knowledge is derived from data• Relatively easy and quick to develop• Requires huge amounts of parallel and monolingual data• Translation quality inconsistent and can differ dramatically from

domain to domain

MT paradigms

Page 6: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012

CHALLENGE

Page 7: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012

15largest

languages

50%

Page 8: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012

domains

IT Aerospace

Agriculture Automotive

Chemistry Coal and mining industries

Communications Culture

Defence Education

Electronics Energy

Finance Food technology

Government affairs Legal

Life sciences Logistics

Marketing Mechanical engineering

Medicine Pharmaceuticals

Religion Social affairs

Trade

Page 9: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012

one size fits all

?

Page 10: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012

DATA

Page 11: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012
Page 12: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012

The total body of European Union law applicable in the EU Member States

JRC-Acquis http://langtech.jrc.it/JRC-Acquis.html

Page 13: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012

The DGT Multilingual Translation Memory of the Acquis Communautaire

DGT-TM

http://langtech.jrc.it/DGT-TM.html

Page 14: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012

Parallel data collected from the Web by University of Uppsala

90 languages, 3800 language

2,7B parallel units

Opus

http://opus.lingfil.uu.se

Page 15: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012

open European language resource infrastructure

http://www.meta-net.eu

Page 16: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012

Data for SMT training

Page 17: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012

PLATFORM

Page 18: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012

Moses toolkit

[ttable-file]0 0 5 /.../unfactored/model/phrase-table.0-0.gz% ls steps/1/LM_toy_tokenize.1* | catsteps/1/LM_toy_tokenize.1steps/1/LM_toy_tokenize.1.DONEsteps/1/LM_toy_tokenize.1.INFOsteps/1/LM_toy_tokenize.1.STDERRsteps/1/LM_toy_tokenize.1.STDERR.digeststeps/1/LM_toy_tokenize.1.STDOUT% train-model.perl \--corpus factored-corpus/proj-syndicate \--root-dir unfactored \--f de --e en \--lm 0:3:factored-corpus/surface.lm:0% moses -f moses.ini -lmodel-file "0 0 3 ../lm/europarl.srilm.gz“use-berkeley = truealignment-symmetrization-method = berkeleyberkeley-train = $moses-script-dir/ems/support/berkeley-train.shberkeley-process = $moses-script-dir/ems/support/berkeley-process.shberkeley-jar = /your/path/to/berkeleyaligner-2.1/berkeleyaligner.jarberkeley-java-options = "-server -mx30000m -ea"berkeley-training-options = "-Main.iters 5 5 -EMWordAligner.numThreads 8"berkeley-process-options = "-EMWordAligner.numThreads 8"berkeley-posterior = 0.5tokenizein: raw-stemout: tokenized-stemdefault-name: corpus/tokpass-unless: input-tokenizer output-tokenizertemplate-if: input-tokenizer IN.$input-extension OUT.$input-extensiontemplate-if: output-tokenizer IN.$output-extension OUT.$output-extensionparallelizable: yesworking-dir = /home/pkoehn/experimentwmt10-data = $working-dir/data

Page 19: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012

buildyour ownMT engine

Page 20: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012

Tilde / CoordinatorLATVIA

University of EdinburghUK

Uppsala UniversitySWEDEN

Copehagen UniversityDENMARK

University of ZagrebCROATIA

MoraviaCZECH REPUBLIC

SemLabNETHERLANDS

Page 21: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012

• Cloud-based self-service MT factory

• Repository of parallel and monolingual corpora for MT generation

• Automated training of SMT systems from specified collections of data

• Users can specify particular training data collections and build customised MT engines from these collections

• Users can also use LetsMT! platform for tailoring MT system to their needs from their non-public data

Page 22: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012

• Stores SMT training data• Supports different formats –

TMX, XLIFF, PDF, DOC, plain text

• Converts to unified format• Performs format

conversions and alignmentResourceRepository

Page 23: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012

• Put users in control of their data

• Fully public or fully private should not be the only choice

• Data can be used for MT generation without exposing it

• Empower users to create custom MT engines from their data

user-driven machine translation

Page 24: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012

• Integration with CAT tools• Integration in web pages • Integration in web browsers• API-level integration

integration

Page 25: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012

Integration of MT in SDL Trados

Page 26: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012

Training UsingSharing of training data

Giza++Moses SMT toolkit

SMT Resource Repository

SMT Multi-Model Repository

(trained SMT models)

Proc

esin

g, E

valu

ation

...

Upl

oad

Anon

ymou

sac

cess

Auth

entic

ated

acce

ss

System management, user authentication, access rights control ...

Web page

Web service

Web pagetranslation widget

CAT tools

Web browserPlug-ins

SMT Resource Directory

SMT System Directory

Moses decoder

Page 27: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012
Page 28: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012
Page 29: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012

use caseFORTERA

Page 30: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012

EVALUATION

Page 31: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012

• Keyboard-monitoring of post-editing (O´Brien, 2005)

• Productivity of MS Office localization (Schmidtke, 2008)

5-10% productivity gain for SP, FR, DE

• Adobe(Flournoy and Duran, 2009)

22%-51% productivity increase for RU, SP, FR

• Autodesk Moses SMT system (Plitt and Masselot, 2010)

74% average productivity increase for FR, IT, DE, SP

Previous Work

Page 32: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012

Evaluation at Tilde

• Latvian:

About 1,6 M native speakers Highly inflectional - ~22M possible

word forms in total Official EU language

• Tilde English – Latvian MT system

• IT Software Localization Domain

• Evaluation of translators’ productivity

Page 33: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012

English-Latvian data

Bilingual corpus Parallel units

Localization TM 1 290 K

DGT-TM 1 060 K

OPUS EMEA 970 K

Fiction 660 K

Dictionary data 510 K

Web corpus 900 K

Total 5 370 K

Monolingual corpus Words

Latvian side of parallel corpus

60 M

News (web) 250 M

Fiction 9 M

Total, Latvian 319 M

Page 34: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012

MT Integration into Localization Workflow

Evaluate original / assign Translator and Editor

Analyze against TMs

Translateusing translation suggestions for TMs

and MT

Evaluate translation quality / Edit

Fix errors

Ready translation

MT translate new sentences

Page 35: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012

• Key interest of localization industry is to increase productivity of translation process while maintaining required quality level

• Productivity was measured as the translation output of an average translator in words per hour

• 5 translators participated in evaluation including both experienced and new translatorsEvaluation of

Productivity

Page 36: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012

• Performed by human editors as part of their regular QA process

• Result of translation process was evaluated, editors did not know was or was not MT applied to assist translator

• Comparison to reference is not part of this evaluation

• Tilde standard QA assessment form was used covering the following text quality areas:

Accuracy

Spelling and grammar

Style

Terminology

Evaluation of Quality

Page 37: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012

QA Grades

Error Score (sum of weighted errors)

Resulting Quality Evaluation

0…9 Superior

10…29 Good

30…49 Mediocre

50…69 Poor

>70 Very poor

Tilde Localization QA assessment applied in the evaluation

Page 38: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012

Evaluation data

►54 documents in IT domain

►950-1050 adjusted words in each document

►Each document was split in half:

►the first part was translated using suggestions from TM only

►the second half was translated using suggestions from both TM and MT

Page 39: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012

%

productivity32.9%*

* Skadiņš R., Puriņš M., Skadiņa I., Vasiļjevs A., Evaluation of SMT in localization to under-resourced inflected language, in Proceedings of the 15th International Conference of the European Association for Machine Translation EAMT 2011, p. 35-40, May 30-31, 2011, Leuven, Belgium

Latvian

Page 40: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012

Evaluation at Moravia

► IT Localization domain►Systems trained on the

LetsMT platform►English - Czech translation

25.1% productivity increase

Error score increase from 19 to 27, still at the GOOD grade (<30)

►English – Polish translation

28.5% productivity increase

Error score increase from 16.8 to 23.6, still at the GOOD grade (<30)

Page 41: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012

%

productivity

25%

*For Czech and Polish formal evaluation was done by MoraviaForor Slovak productivity increase was estimated by Fortera

28.5%

Slovak* Polish

25.1%

Czech

Page 42: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012

MORE DATA

Page 43: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012

corpora collection tools

comparability metrics

named entity recognition tools

terminology extraction tools

ACCURAT TOOLKIT

Page 44: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012

use caseAUTOMOTIVE

MANUFACTURER

Page 45: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012

very smalltranslation memories(just 3500 sentences)

noin-domain corpora in target languages

nomoney for expensive developments

?

Page 46: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012

Terminology extraction

Web crawling parallel

monolingual

Parallel data extraction from comparable corpora

data collection workflow

Page 47: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012

TMs

Terminology glossary

Parallel phrases

Parallel Named Entities

Monolingual target language corpus

Resulting data

Page 48: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012

General domain data as a basis

Domain specific language model

Impose domain specific terminology, named entity translations

Add linguistic knowledge atop of statistical components

SMT Training

Page 49: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012

right data &right tools

Page 50: S andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core LOCALIZATION WORLD PARIS, JUNE 5, 2012

tilde.comtechnologies

for smaller

languages

The research within the LetsMT! project leading to these results has received funding from the ICT Policy Support Programme (ICT PSP), Theme 5 – Multilingual web, grant agreement no 250456