s andrejs vasiļjevs chairman of the board andrejs@tilde.com data is core localization world paris,...
Post on 24-Dec-2015
214 Views
Preview:
TRANSCRIPT
sandrejs vasiļjevs
chairman of the boardandrejs@tilde.com
data is core
LOCALIZATION WORLD PARIS, JUNE 5, 2012
• Language technology developer
• Localization service provider
• Leadership in smaller languages
• Offices in Riga (Latvia), Tallinn (Estonia) and Vilnius (Lithuania)
• 135 employees
• Strong R&D team
• 9 PhDs and candidates
MTmachine translation
machine translation
INNOVATIONd i s r u p ti v e
d i s r u p ti v e
rule-based MT
statistical MT
• High quality translation in specialized domains• Require highly qualified
linguists, researchers and software developers• Time and resource consuming• Difficult to evolve
• Translation and linguistic knowledge is derived from data• Relatively easy and quick to develop• Requires huge amounts of parallel and monolingual data• Translation quality inconsistent and can differ dramatically from
domain to domain
MT paradigms
CHALLENGE
15largest
languages
50%
domains
IT Aerospace
Agriculture Automotive
Chemistry Coal and mining industries
Communications Culture
Defence Education
Electronics Energy
Finance Food technology
Government affairs Legal
Life sciences Logistics
Marketing Mechanical engineering
Medicine Pharmaceuticals
Religion Social affairs
Trade
one size fits all
?
DATA
The total body of European Union law applicable in the EU Member States
JRC-Acquis http://langtech.jrc.it/JRC-Acquis.html
The DGT Multilingual Translation Memory of the Acquis Communautaire
DGT-TM
http://langtech.jrc.it/DGT-TM.html
Parallel data collected from the Web by University of Uppsala
90 languages, 3800 language
2,7B parallel units
Opus
http://opus.lingfil.uu.se
open European language resource infrastructure
http://www.meta-net.eu
Data for SMT training
PLATFORM
Moses toolkit
[ttable-file]0 0 5 /.../unfactored/model/phrase-table.0-0.gz% ls steps/1/LM_toy_tokenize.1* | catsteps/1/LM_toy_tokenize.1steps/1/LM_toy_tokenize.1.DONEsteps/1/LM_toy_tokenize.1.INFOsteps/1/LM_toy_tokenize.1.STDERRsteps/1/LM_toy_tokenize.1.STDERR.digeststeps/1/LM_toy_tokenize.1.STDOUT% train-model.perl \--corpus factored-corpus/proj-syndicate \--root-dir unfactored \--f de --e en \--lm 0:3:factored-corpus/surface.lm:0% moses -f moses.ini -lmodel-file "0 0 3 ../lm/europarl.srilm.gz“use-berkeley = truealignment-symmetrization-method = berkeleyberkeley-train = $moses-script-dir/ems/support/berkeley-train.shberkeley-process = $moses-script-dir/ems/support/berkeley-process.shberkeley-jar = /your/path/to/berkeleyaligner-2.1/berkeleyaligner.jarberkeley-java-options = "-server -mx30000m -ea"berkeley-training-options = "-Main.iters 5 5 -EMWordAligner.numThreads 8"berkeley-process-options = "-EMWordAligner.numThreads 8"berkeley-posterior = 0.5tokenizein: raw-stemout: tokenized-stemdefault-name: corpus/tokpass-unless: input-tokenizer output-tokenizertemplate-if: input-tokenizer IN.$input-extension OUT.$input-extensiontemplate-if: output-tokenizer IN.$output-extension OUT.$output-extensionparallelizable: yesworking-dir = /home/pkoehn/experimentwmt10-data = $working-dir/data
buildyour ownMT engine
Tilde / CoordinatorLATVIA
University of EdinburghUK
Uppsala UniversitySWEDEN
Copehagen UniversityDENMARK
University of ZagrebCROATIA
MoraviaCZECH REPUBLIC
SemLabNETHERLANDS
• Cloud-based self-service MT factory
• Repository of parallel and monolingual corpora for MT generation
• Automated training of SMT systems from specified collections of data
• Users can specify particular training data collections and build customised MT engines from these collections
• Users can also use LetsMT! platform for tailoring MT system to their needs from their non-public data
• Stores SMT training data• Supports different formats –
TMX, XLIFF, PDF, DOC, plain text
• Converts to unified format• Performs format
conversions and alignmentResourceRepository
• Put users in control of their data
• Fully public or fully private should not be the only choice
• Data can be used for MT generation without exposing it
• Empower users to create custom MT engines from their data
user-driven machine translation
• Integration with CAT tools• Integration in web pages • Integration in web browsers• API-level integration
integration
Integration of MT in SDL Trados
Training UsingSharing of training data
Giza++Moses SMT toolkit
SMT Resource Repository
SMT Multi-Model Repository
(trained SMT models)
Proc
esin
g, E
valu
ation
...
Upl
oad
Anon
ymou
sac
cess
Auth
entic
ated
acce
ss
System management, user authentication, access rights control ...
Web page
Web service
Web pagetranslation widget
CAT tools
Web browserPlug-ins
SMT Resource Directory
SMT System Directory
Moses decoder
use caseFORTERA
EVALUATION
• Keyboard-monitoring of post-editing (O´Brien, 2005)
• Productivity of MS Office localization (Schmidtke, 2008)
5-10% productivity gain for SP, FR, DE
• Adobe(Flournoy and Duran, 2009)
22%-51% productivity increase for RU, SP, FR
• Autodesk Moses SMT system (Plitt and Masselot, 2010)
74% average productivity increase for FR, IT, DE, SP
Previous Work
Evaluation at Tilde
• Latvian:
About 1,6 M native speakers Highly inflectional - ~22M possible
word forms in total Official EU language
• Tilde English – Latvian MT system
• IT Software Localization Domain
• Evaluation of translators’ productivity
English-Latvian data
Bilingual corpus Parallel units
Localization TM 1 290 K
DGT-TM 1 060 K
OPUS EMEA 970 K
Fiction 660 K
Dictionary data 510 K
Web corpus 900 K
Total 5 370 K
Monolingual corpus Words
Latvian side of parallel corpus
60 M
News (web) 250 M
Fiction 9 M
Total, Latvian 319 M
MT Integration into Localization Workflow
Evaluate original / assign Translator and Editor
Analyze against TMs
Translateusing translation suggestions for TMs
and MT
Evaluate translation quality / Edit
Fix errors
Ready translation
MT translate new sentences
• Key interest of localization industry is to increase productivity of translation process while maintaining required quality level
• Productivity was measured as the translation output of an average translator in words per hour
• 5 translators participated in evaluation including both experienced and new translatorsEvaluation of
Productivity
• Performed by human editors as part of their regular QA process
• Result of translation process was evaluated, editors did not know was or was not MT applied to assist translator
• Comparison to reference is not part of this evaluation
• Tilde standard QA assessment form was used covering the following text quality areas:
Accuracy
Spelling and grammar
Style
Terminology
Evaluation of Quality
QA Grades
Error Score (sum of weighted errors)
Resulting Quality Evaluation
0…9 Superior
10…29 Good
30…49 Mediocre
50…69 Poor
>70 Very poor
Tilde Localization QA assessment applied in the evaluation
Evaluation data
►54 documents in IT domain
►950-1050 adjusted words in each document
►Each document was split in half:
►the first part was translated using suggestions from TM only
►the second half was translated using suggestions from both TM and MT
%
productivity32.9%*
* Skadiņš R., Puriņš M., Skadiņa I., Vasiļjevs A., Evaluation of SMT in localization to under-resourced inflected language, in Proceedings of the 15th International Conference of the European Association for Machine Translation EAMT 2011, p. 35-40, May 30-31, 2011, Leuven, Belgium
Latvian
Evaluation at Moravia
► IT Localization domain►Systems trained on the
LetsMT platform►English - Czech translation
25.1% productivity increase
Error score increase from 19 to 27, still at the GOOD grade (<30)
►English – Polish translation
28.5% productivity increase
Error score increase from 16.8 to 23.6, still at the GOOD grade (<30)
%
productivity
25%
*For Czech and Polish formal evaluation was done by MoraviaForor Slovak productivity increase was estimated by Fortera
28.5%
Slovak* Polish
25.1%
Czech
MORE DATA
corpora collection tools
comparability metrics
named entity recognition tools
terminology extraction tools
ACCURAT TOOLKIT
use caseAUTOMOTIVE
MANUFACTURER
very smalltranslation memories(just 3500 sentences)
noin-domain corpora in target languages
nomoney for expensive developments
?
Terminology extraction
Web crawling parallel
monolingual
Parallel data extraction from comparable corpora
data collection workflow
TMs
Terminology glossary
Parallel phrases
Parallel Named Entities
Monolingual target language corpus
Resulting data
General domain data as a basis
Domain specific language model
Impose domain specific terminology, named entity translations
Add linguistic knowledge atop of statistical components
SMT Training
right data &right tools
tilde.comtechnologies
for smaller
languages
The research within the LetsMT! project leading to these results has received funding from the ICT Policy Support Programme (ICT PSP), Theme 5 – Multilingual web, grant agreement no 250456
top related