taus open source machine translation showcase, monaco, andrejs vasiljevs, tilde, 25 march 2012
Post on 26-Jun-2015
3.220 Views
Preview:
DESCRIPTION
TRANSCRIPT
TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE
Moses on the Cloud for Do-It-Yourself Machine Translationranslation
By Andrejs Vasiļjevs
sAndrejs Vasiļjevs
Chairman of the Board, Tildeandrejs@tilde.com
Moses on the Cloud for Do-It-Yourself Machine
Translation
• Language technology developer
• Localization service provider
• Leadership in smaller languages
• Offices in Riga (Latvia), Tallinn (Estonia) and Vilnius (Lithuania)
• 135 employees
• Strong R&D team
• 9 PhDs and candidates
MTmachine translation
machine translation
INNOVATIONd i s r u p ti v e
d i s r u p ti v e
CHALLENGE
15largest
languages
50%
DATA
one size fits all
?
just use Moses?
[ttable-file]0 0 5 /.../unfactored/model/phrase-table.0-0.gz% ls steps/1/LM_toy_tokenize.1* | catsteps/1/LM_toy_tokenize.1steps/1/LM_toy_tokenize.1.DONEsteps/1/LM_toy_tokenize.1.INFOsteps/1/LM_toy_tokenize.1.STDERRsteps/1/LM_toy_tokenize.1.STDERR.digeststeps/1/LM_toy_tokenize.1.STDOUT% train-model.perl \--corpus factored-corpus/proj-syndicate \--root-dir unfactored \--f de --e en \--lm 0:3:factored-corpus/surface.lm:0% moses -f moses.ini -lmodel-file "0 0 3 ../lm/europarl.srilm.gz“use-berkeley = truealignment-symmetrization-method = berkeleyberkeley-train = $moses-script-dir/ems/support/berkeley-train.shberkeley-process = $moses-script-dir/ems/support/berkeley-process.shberkeley-jar = /your/path/to/berkeleyaligner-2.1/berkeleyaligner.jarberkeley-java-options = "-server -mx30000m -ea"berkeley-training-options = "-Main.iters 5 5 -EMWordAligner.numThreads 8"berkeley-process-options = "-EMWordAligner.numThreads 8"berkeley-posterior = 0.5tokenizein: raw-stemout: tokenized-stemdefault-name: corpus/tokpass-unless: input-tokenizer output-tokenizertemplate-if: input-tokenizer IN.$input-extension OUT.$input-extensiontemplate-if: output-tokenizer IN.$output-extension OUT.$output-extensionparallelizable: yesworking-dir = /home/pkoehn/experimentwmt10-data = $working-dir/data
buildyour ownMT engine
!
s
customized MT
Tilde / CoordinatorLATVIA
University of EdinburghUK
Uppsala UniversitySWEDEN
Copehagen UniversityDENMARK
University of ZagrebCROATIA
MoraviaCZECH REPUBLIC
SemLabNETHERLANDS
• Online collaborative platform for MT building from user-provided data
• Repository of parallel and monolingual corpora for MT generation
• Automated training of SMT systems from specified collections of data
• Users can specify particular training data collections and build customised MT engines from these collections
• Users can also use LetsMT! platform for tailoring MT system to their needs from their non-public data
• User-driven cloud-based MT factory, based on open-source MT tools
• Services for data collection, MT generation, customization and running of variety of user-tailored MT systems
• Application in localization among the key usage scenarios
• Strong synergy with FP7 project ACCURAT to advance data-driven machine translation for under-resourced languages and domains
• Stores SMT training data• Supports different formats –
TMX, XLIFF, PDF, DOC, plain text
• Converts to unified format• Performs format
conversions and alignmentResourceRepository
c
MT
• Integration with CAT tools• Integration in web pages • Integration in web browsers• API-level integration
integration
Training UsingSharing of training data
Giza++Moses SMT toolkit
SMT Resource Repository
SMT Multi-Model Repository
(trained SMT models)
Proc
esin
g, E
valu
ation
...
Upl
oad
Anon
ymou
sac
cess
Auth
entic
ated
acce
ss
System management, user authentication, access rights control ...
Web page
Web service
Web pagetranslation widget
CAT tools
Web browserPlug-ins
SMT Resource Directory
SMT System Directory
Moses decoder
sUser interface webpage UI, web service API
Application Logic Resource Repositorystores MT training data and trained models
High-performance Computing Clusterexecutes all computationally heavy tasks: SMT training, MT service, Processing and aligning of training data etc.
Interface Layer
Web Page UI Public API
Application Logic LayerResource
Repository Adapter
SMT training
Data Storage Layer(Resource Repository)
High-performance Computing (HPC) Cluster
SGE
Widget ...CAT toolsCAT tools CAT toolsBrowser plug-ins
http
sR
ES
T
http
/http
sht
ml
http
sR
ES
T
h ttp
sR
ES
T, S
OA
P, .
. .
TC
P/IP
h tt p
RE
ST
/ SO
AP
CPUCPU
CPU CPU
CPU CPU
CPU
CPU
htt p
RE
ST
/ SO
AP
Translation
RE
ST
System DB
RR API
SVN
File Share
Web Browsers
HPC frontend CPUREST
SystemArchitecture
%
productivity32.9%*
* Skadiņš R., Puriņš M., Skadiņa I., Vasiļjevs A., Evaluation of SMT in localization to under-resourced inflected language, in Proceedings of the 15th International Conference of the European Association for Machine Translation EAMT 2011, p. 35-40, May 30-31, 2011, Leuven, Belgium
Latvian
%
productivity
25.1%
* LetsMT! Project Deliverable D6.4
28.5%
Czech Polish
• incremental training,
• distributed language models
• interpolated language models for domain adaptation
• randomized language models to train using huge corpora
• translation of formatted texts
• running Moses decoder in a server mode
New Moses features
tilde.comtechnologies
for smaller
languages
The research within the LetsMT! project leading to these results has received funding from the ICT Policy Support Programme (ICT PSP), Theme 5 – Multilingual web, grant agreement no 250456
top related