statistical machine translation with the moses toolkitthe 11th conference of the association for m...
TRANSCRIPT
The 11th Conference of the Association for Machine Translation in the Americas
October 22 – 26, 2014 -- Vancouver, BC Canada
Tutorial on
Statistical Machine Translation
With the Moses Toolkit
Hieu Hoang, Matthias Huck, and Philipp Koehn
Association for Machine Translation in the Americas
http://www.amtaweb.org
Mose
sMach
ineTranslationwithOpen
SourceSoftware
HieuHoangandMatthiasHuck
Octob
er20
14
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
1Outline
09:30-10:15
Introduction
10:15-11:00
Hands-onSession—
youwill
needalaptop
11:00-11:30
Break
11:30-12:30
Adva
ncedTopics
Slides
dow
nloadable
from
http://www.statmt.org/moses/amta.2014.pdf
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
2BasicIdea
Sta
tistic
al
Mac
hine
Tr
ansl
atio
n S
yste
m
Trai
ning
Dat
aLi
ngui
stic
Too
ls
Sta
tistic
al
Mac
hine
Tr
ansl
atio
n S
yste
m
Tran
slat
ion
Sou
rce
Text
Training
Using
para
llel c
orpo
ram
onol
ingu
al c
orpo
radi
ctio
narie
s
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
3StatisticalMach
ineTranslationHistory
around1990
Pioneeringworkat
IBM,inspired
bysuccessin
speech
recogn
ition
1990s
Dom
inance
ofIBM’sword-based
models,supporttechnolog
ies
early2000s
Phrase-based
models
late
2000s
Tree-based
models
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
4MosesHistory
2002
Pharaohdecoder,precursor
toMoses
(phrase-based
models)
2005
Moses
startedby
HieuHoangandPhilippKoehn(factoredmodels)
2006
JHU
workshop
extendsMoses
sign
ificantly
2006-2012
Fundingby
EU
projects
EuroMatrix,
EuroMatrixP
lus
2009
Tree-based
modelsim
plementedin
Moses
2012-2015
MosesCoreproject.
Full-timestaff
tomaintain
andenhance
Moses
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
5Mosesin
Aca
dem
ia
•Built
byacadem
ics,foracadem
ics
•Reference
implementation
ofstateof
theart
–researchersdevelop
new
methodson
topof
Moses
–develop
ersre-implementpublished
methods
–usedby
other
researchersas
black
box
•Baselineto
beat
–researcherscomparetheirmethodagainst
Moses
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
6Dev
eloper
Community
•Maindevelop
mentat
University
ofEdinburgh,butalso:
–Fon
dazioneBrunoKessler
(Italy)
–CharlesUniversity
(Czech
Republic)
–DFKI(G
ermany)
–RWTH
Aachen
(Germany)
–others...
•Codeshared
ongithub.com
•Mainforum:Moses
supportmailinglist
•Mainevent:
MachineTranslationMarathon
–annual
open
sourceconvention
–presentation
ofnew
open
sourcetools
–hands-on
workon
new
open
sourceprojects
–summer
school
forstatisticalmachinetranslation
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
7Open
SourceComponen
ts
•Moses
distribution
usesexternal
open
sourcetools
–wordalignment:
giza++,mgiza,BerkeleyAligner,Fast
Align
–langu
agemodel:sr
ilm,irst
lm,randlm,kenlm
–scoring:
bleu,ter,meteor
•Other
usefultools
–sentence
aligner
–syntactic
parsers
–part-of-speech
tagg
ers
–morpholog
ical
analyzers
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
8Other
Open
SourceMT
Systems
•Joshua—
JohnsHop
kinsUniversity
http://joshua.sourceforge.net/
•CDec
—University
ofMaryland
http://cdec-decoder.org/
•Jane—
RWTH
Aachen
http://www.hltpr.rwth-aachen.de/jane/
•Phrasal—
Stanford
University
http://nlp.stanford.edu/phrasal/
•Verysimilartechnolog
y
–JoshuaandPhrasalim
plementedin
Java,othersin
C++
–Joshuasupports
only
tree-based
models
–Phrasalsupports
only
phrase-based
models
•Open
sourcingtoolsincreasingtrendin
NLPresearch
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
9Mosesin
Industry
•DistributedwithLGPL—
free
touse
•Com
petitivewithcommercial
SMT
solution
s(G
oogle,Microsoft,SDLLangu
ageWeaver,...)
•But:
–not
easy
touse
–requires
sign
ificantexpertise
forop
timal
perform
ance
–integrationinto
existingworkfl
ownot
straight-forward
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
10Case
Studies
Europea
nCommission
—usesMoses
in-hou
seto
aidhuman
translators
Autodesk
—show
edproductivityincreasesin
translatingmanualswhen
post-editingou
tput
from
acustom
-build
Moses
system
Systran
—develop
edstatisticalpost-editingusingMoses
AsiaOnlin
e—
offerstranslationtechnolog
yandservices
based
onMoses
Manyothers...
World
Trade
Organisation,
Adob
e,Sym
antec,
WIPO,
Sybase,
Safaba,
Bloom
berg,
Pangeanic,KatanMT,Capita,
...
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
11Phrase-basedTranslation
•Foreign
inputissegm
entedin
phrases
•Eachphrase
istranslated
into
English
•Phrasesarereordered
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
12Phrase
TranslationOptions
heergeht
janicht
nach
hause
it , it
, he
is are
goes
go
yes
is, o
f cou
rse
not
do n
otdo
es n
otis
not
afte
rto
acco
rdin
g to
in
hous
eho
me
cham
ber
at h
ome
not
is n
otdo
es n
otdo
not
hom
eun
der
hous
ere
turn
hom
edo
not
it is
he w
ill b
eit
goes
he g
oes
is are
is a
fter
all
does
tofo
llow
ing
not a
fter
not t
o
, not
is n
otar
e no
tis
not
a
•Manytranslationop
tion
sto
choosefrom
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
13Phrase
TranslationOptions
heergeht
janicht
nach
hause
it , it
, he
is are
goes
go
yes
is, o
f cou
rse
not
do n
otdo
es n
otis
not
afte
rto
acco
rdin
g to
in
hous
eho
me
cham
ber
at h
ome
not
is n
otdo
es n
otdo
not
hom
eun
der
hous
ere
turn
hom
edo
not
it is
he w
ill b
eit
goes
he g
oes
is are
is a
fter
all
does
tofo
llow
ing
not a
fter
not t
ono
tis
not
are
not
is n
ot a
•Themachinetranslationdecoder
does
not
know
therigh
tansw
er–
pickingtherigh
ttranslationop
tion
s–
arrangingthem
intherigh
torder
→Searchprob
lem
solved
bybeam
search
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
14Decoding:Preco
mpute
TranslationOptions
ergeht
janicht
nach
hause
consultphrase
translationtable
forallinputphrases
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
15Decoding:Start
withInitialHyp
othesis
ergeht
janicht
nach
hause
initialhyp
othesis:noinputwordscovered,noou
tputproduced
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
16Decoding:Hyp
othesis
Exp
ansion
ergeht
janicht
nach
hause
are
pickanytranslationop
tion
,create
new
hyp
othesis
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
17Decoding:Hyp
othesis
Exp
ansion
ergeht
janicht
nach
hause
are ithe
create
hyp
otheses
forallother
translationop
tion
s
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
18Decoding:Hyp
othesis
Exp
ansion
ergeht
janicht
nach
hause
are ithe
goes
does
not
yes
go to
hom
e
hom
e
also
create
hyp
otheses
from
createdpartial
hyp
othesis
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
19Decoding:FindBestPath
ergeht
janicht
nach
hause
are ithe
goes
does
not
yes
go to
hom
e
hom
e
backtrack
from
highestscoringcomplete
hyp
othesis
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
20FactoredTranslation
•Factoredrepresention
ofwords
wor
dw
ord
part
-of-
spee
ch
Output
Input
mor
phol
ogy
part
-of-
spee
ch
mor
phol
ogy
wor
d cl
ass
lem
ma
wor
d cl
ass
lem
ma
......
•Goals
–generalization,e.g.
bytranslatinglemmas,not
surfaceform
s–
richer
model,e.g.
usingsyntaxforreordering,
langu
agemodeling
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
21FactoredModel
Example:
lem
ma
lem
ma
part
-of-
spee
ch
Output
Input
mor
phol
ogy
part
-of-
spee
ch
wor
dw
ord
Decom
posingthetranslationstep
Translatinglemmaandmorpholog
ical
inform
ationmorerobust
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
22Syn
tax-basedTranslation
String-to-String
String-to-T
ree
JohnmissesMary
JohnmissesMary
⇒Marie
manqueaJean
⇒S
NP
Marie
VP
V
manque
PP
P a
NP
Jean
Tree-to-String
Tree-to-T
ree
S
NP
John
VP
V
misses
NP
Mary
S
NP
John
VP
V
misses
NP
Mary⇒
S
NP
Marie
VP
V
manque
PP
P a
NP
Jean
⇒MariemanqueaJean
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
23Syn
tax-basedDecoding
Sie
PP
ER
will
VA
FIN
eine
AR
TTa
sse
NN
Kaf
fee
NN
trin
ken
VV
INF
NP
VP
S
VB
drin
k
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
24Syn
tax-basedDecoding
Sie
PP
ER
will
VA
FIN
eine
AR
TTa
sse
NN
Kaf
fee
NN
trin
ken
VV
INF
NP
VP
S
PR
Osh
eV
Bdr
ink
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
25Syn
tax-basedDecoding
Sie
PP
ER
will
VA
FIN
eine
AR
TTa
sse
NN
Kaf
fee
NN
trin
ken
VV
INF
NP
VP
S
PR
Osh
eV
Bdr
ink
NN
coffe
e
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
26Syn
tax-basedDecoding
Sie
PP
ER
will
VA
FIN
eine
AR
TTa
sse
NN
Kaf
fee
NN
trin
ken
VV
INF
NP
VP
S
PR
Osh
eV
Bdr
ink
NN
coffe
e
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
27Syn
tax-basedDecoding
Sie
PP
ER
will
VA
FIN
eine
AR
TTa
sse
NN
Kaf
fee
NN
trin
ken
VV
INF
NP
VP
S
PR
Osh
eV
Bdr
ink
NN |
cup
IN | of
NP
PP
NN
NP
DE
T | a
NN
coffe
e
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
28Syn
tax-basedDecoding
Sie
PP
ER
will
VA
FIN
eine
AR
TTa
sse
NN
Kaf
fee
NN
trin
ken
VV
INF
NP
VP
S
PR
Osh
eV
Bdr
ink
NN |
cup
IN | of
NP
PP
NN
NP
DE
T | a
VB
Z |w
ants
VB
VP
VP
NP
TO | to
NN
coffe
e
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
29Syn
tax-basedDecoding
Sie
PP
ER
will
VA
FIN
eine
AR
TTa
sse
NN
Kaf
fee
NN
trin
ken
VV
INF
NP
VP
S
PR
Osh
eV
Bdr
ink
NN |
cup
IN | of
NP
PP
NN
NP
DE
T | a
VB
Z |w
ants
VB
VP
VP
NP
TO | to
NN
coffe
e
S
PR
O V
P
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
30How
doIget
started?
•Collect
yourdata
–Paralleldata
∗Freelyavailable
data,
e.g.
Europarl,MultiUN,WIT3,
OPUS,...
∗TAUS,Lingu
isticDataCon
sortium
(LDC),...
–Mon
olingu
aldata
∗Com
mon
Crawl,WMT
New
sCrawlcorpus,...
•Set
upMoses
–Dow
nload
sourcecodeforMoses,GIZA++,MGIZA
–Com
pile,install
–Oruse
precom
piledMoses
packages(W
indow
s,Linux,
Mac
OSX)
–Moreinfo:http://www.statmt.org/moses/
•Train
system
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
31Data-drive
nMT
Training
ParallelTrainingData
TranslationModel
Mon
olingu
alTrainingData
Langu
ageModel
LM
Estim
ation
Tuning
Develop
mentSet
SMT
System
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
32MosesPipeline
Execute
alotof
scripts
tokenize<corpus.en>corpus.en.tok
lowercase<corpus.en.tok>corpus.en.lc
...
mert.perl....
moses...
mteval-v13.pl...
Change
apartof
theprocess,execute
everythingagain
tokenize<corpus.en>corpus.en.tok
lowercase<corpus.en.tok>corpus.en.lc
...
mert.perl....
moses...
mteval-v13.pl...
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
33Phrase-basedModel
TrainingwithMoses
•Com
mandline
train-model.perl...
•Example
phrase
from
model
Bndnisse|||alliances|||11112.718||||||11
GeneralMusharrafbetratam|||generalMusharrafappearedon|||11112.718||||||11
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
34Phrase-basedDecodingwithMoses
•Com
mandline
moses-fmoses.ini-iin.txt>out.txt
•Advantages
–fast
—under
halfasecondper
sentence
forfast
configu
ration
–low
mem
oryrequirem
ents
—∼2
00MBRAM
forlowestconfigu
ration
–state-of-the-arttranslationqualityon
mosttasks,
especially
forrelatedlangu
agepairs
–robust
—does
not
rely
onanysyntactic
annotation
•Disadvantages
–poor
modelingof
lingu
istickn
owledge
andof
long-distance
dependencies
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
35HierarchicalModel
TrainingwithMoses
•Hierarchical
model:
form
ally
syntax-based,withou
tlingu
isticannotation(string-to-string)
•Com
mandline
train-model.perl-hierarchical...
•Example
rule
from
model
Bundnisse[X][X]Kraften[X]|||alliances[X][X]forces[X]|||11112.718|||1-1|||0.05263160.0526316
•Visualizationof
rule
X
Bundnisse
X1
Kraften
⇒X
alliances
X1
forces
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
36HierarchicalDecodingwithMoses
•Com
mandline
moses_chart-fmoses.ini-iin.txt>out.txt
•Advantages
–able
tomodel
non
-con
tigu
itieslikene...pas
→not
–betterat
medium-range
reordering
–ou
tperform
sphrase-based
system
swhen
translatingbetweenwidelydifferent
langu
ages,e.g.
Chinese-English
•Disadvantages
–morediskusage
—translationmodel
×10larger
than
phrase-based
–slow
er—
0.5-2sec/sent.forfastestconfigu
ration
–higher
mem
oryrequirem
ents
—morethan
1GBRAM
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
37Syn
tax-basedModel
TrainingwithMoses
•Com
mandline
train-model.perl-ghkm...
•Example
rule
from
model
[X][NP-SB]alsowanted[X][ADJA][X][NN][X]|||[X][NP-SB]wolltenauchdie[X][ADJA][X][NN][S-TOP]|||...
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
38Syn
tax-basedDecodingwithMoses
•Com
mandline
moses_chart-fmoses.ini-iin.txt>out.txt
(likehierarchical)
•Advantage
–canuse
outsidelingu
isticinform
ation
–prom
ises
tosolveim
portantprob
lemsin
SMT,e.g.
long-range
reordering
•Disadvantages
–trainingslow
anddiffi
cultto
getrigh
t–
requires
syntactic
parse
annotation
∗syntactic
parsers
available
only
forsomelangu
ages
∗not
designed
formachinetranslation
∗unreliable
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
39Exp
erim
entManagem
entSystem
•EMSautomates
theentire
pipeline
•Oneconfigu
ration
file
forallsettings:record
ofallexperim
entaldetails
•Schedulerof
individual
stepsin
pipeline
–automatically
keepstrackof
dependencies
–parallelexecution
–crashdetection
–automatic
re-use
ofpriorresults
•Fastto
use
–setupanew
experim
entin
minutes
–setupavariationof
anexperim
entin
seconds
•Disadvantage:not
allMoses
featuresareintegrated
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
40How
does
itwork?
•Write
aconfigu
ration
file
(typically
byadaptingan
existingfile)
•Test:
experiment.perl-configconfig
•Execute:
experiment.perl-configconfig-exec
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
41
Wor
kflow
aut
omat
ical
lyge
nera
ted
byex
perim
ent.p
erl
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
42W
ebInterface
Listof
experim
ents
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
43ListofRuns
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
44Analysis:
BasicStatistics
•Basic
statistics
–n-gram
precision
–evaluationmetrics
–coverage
oftheinputin
corpusandtranslationmodel
–phrase
segm
entation
sused
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
45Analysis:
UnknownW
ords
grou
ped
bycountin
test
set
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
46Analysis:
OutputAnnotation
Color
highlightingto
indicaten-gram
overlapwithreference
translation
darkerbleu=
wordispartof
larger
n-gram
match
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
47Analysis:
InputAnnotation
•For
each
wordandphrase,colorcodingandstatson
–number
ofoccurrencesin
trainingcorpus
–number
ofdistinct
translationsin
translationmodel
–entropyof
conditionaltranslationprob
ability
distribution
φ(e|f)(normalized)
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
48Analysis:
BilingualConco
rdancer
translationof
inputphrase
intrainingdatacontext
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
49Analysis:
Alig
nmen
t
Phrase
alignmentof
thedecodingprocess
(red
border,interactive)
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
50Analysis:
TreeAlig
nmen
t
Usesnestedboxes
toindicatetree
structure
(red
border,yellow
shaded
spansin
focus,interactive)
forsyntaxmodel,non
-terminalsarealso
show
n
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
5151
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
52Analysis:
Comparisonof2Runs
Differentwordsarehighlighted
sortable
bymostim
provem
ent,deterioration
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
53
Hands-OnSession
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
54Adva
ncedFea
tures
•Faster
Training
•FasterDecoding
•Moses
Server
•Dataanddom
ainadaptation
•Instructionsto
decoder
•Inputform
ats
•Outputform
ats
•Increm
entalTraining
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
55Adva
ncedFea
tures
•Faster
Training
–Tokenization
–Tuning
–Alignment
–Phrase-Table
Extraction
–Train
langu
agemodel
...
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
56Faster
Training
•Runstepsin
parallel(that
donot
dependon
each
other)
•MulticoreParallelization
.../train-model.perl-parallel
•EMS:
[TRAINING]
parallel=yes
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
57Adva
ncedFea
tures
•FasterTraining
–Toke
nization
–Tuning
–Alignment
–Phrase-Table
Extraction
–Train
langu
agemodel
...
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
58Faster
Training
•Multi-threaded
tokenization
•Specifynumber
ofthreads
.../tokenizer.perl-threadsNUM
•EMS:
input-tokenizer="$moses-script-dir/tokenizer/tokenizer.perl
-threadsNUM"
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
59Adva
ncedFea
tures
•FasterTraining
–Tokenization
–Tuning
–Alignment
–Phrase-Table
Extraction
–Train
langu
agemodel
...
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
60Faster
Training
•Multi-threaded
tokenization
•Specifynumber
ofthreads
.../mert-threadsNUM
•EMS:
tuning-settings="-threadsNUM"
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
61Adva
ncedFea
tures
•FasterTraining
–Tokenization
–Tuning
–Alig
nmen
t
–Phrase-Table
Extraction
–Train
langu
agemodel
...
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
62Faster
Training
•WordAlignment
•Multi-threaded
–Use
MGIZA,not
GIZA++
.../train-model.perl-mgiza-mgiza-cpusNUM
EMS:
training-options="-mgiza-mgiza-cpusNUM"
•On:mem
ory-lim
ited
machines
–snt2coocprog
ram
requires
6GB+
mem
ory
–Reimplementation
uses10
MB,buttake
longerto
run
.../train-model.perl-snt2coocsnt2cooc.pl
EMS:
training-options="-snt2coocsnt2cooc.pl"
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
63Adva
ncedFea
tures
•FasterTraining
–Tokenization
–Tuning
–Alignment
–Phrase-T
able
Extraction
–Train
langu
agemodel
...
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
64Faster
Training
•Phrase-Table
Extraction
–Split
trainingdatainto
NUM
equal
parts
–Extract
concurrently .../train-model.perl-coresNUM
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
65Faster
Training
•Sorting
–Relyheavily
onUnix
’sort’command
–may
take
50%+
oftranslationmodel
build
time
–Needto
optimizefor
∗speed
∗diskusage
–Dependenton
∗sort
version
∗Unix
version
∗available
mem
ory
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
66Faster
Training
•Plain
sorted
sort<extract.txt>extract.sorted.txt
•Optimized
forlargeserver
sort--buffer-size10G--parallel5
--batch-size253--compress-program[gzip/pigz]...
–Use
10GB
ofRAM
—themorethebetter
–5CPUs—
themorethebetter
–mergesort
atmost25
3files
–compressinterm
ediate
files—
less
diski/o
•In
Moses:
.../train-model.perl-sort-buffer-size10G-sort-parallel5
-sort-batch-size253-sort-compresspigz
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
67Adva
ncedFea
tures
•FasterTraining
–Tokenization
–Tuning
–Alignment
–Phrase-Table
Extraction
–Train
languagemodel
...
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
68IR
STLM:Training
•Develop
edby
FBK-irst,Trento,Italy
•Specialized
trainingforlargecorpora
–parallelization
–reduce
mem
oryusage
•Quantization
ofprob
abilities
–reducesmem
orybutlose
accuracy
–prob
ability
stored
in1byte
insteadof
4bytes
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
69IR
STLM:Training
•Training:
build-lm.sh-i"gunzip-ccorpus.gz"-n3
-otrain.irstlm.gz-k10
–-n3
=n-gram
order
–-k10
=split
trainingprocedure
into
10steps
•EMS:
irst-dir=[IRSTpath]
lm-training="$moses-script-dir/generic/trainlm-irst.perl
-coresNUM-irst-dir$irst-dir"
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
70New
:KENLM
Training
•Can
trainvery
largelangu
agemodelswithlim
ited
RAM
(ondiskstream
ing)
lmplz-o[order]-S[memory]<text>text.lm
•-oorder
=n-gram
order
•-Smemory
=How
much
mem
oryto
use.
–NUM%
=percentage
ofphysical
mem
ory
–NUM[b/K/M/G/T]
=specified
amou
ntin
bytes,kilo
bytes,etc.
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
71Adva
ncedFea
tures
•FasterTraining
•Faster
Decoding
•Moses
Server
•Dataanddom
ainadaptation
•Instructionsto
decoder
•Inputform
ats
•Outputform
ats
•Increm
entalTraining
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
72Adva
ncedFea
tures
•FasterTraining
•Faster
Decoding
–Multi-threading
–Speedvs.Mem
ory
–Speedvs.Quality
...
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
73Adva
ncedFea
tures
•FasterTraining
•FasterDecoding
–Multi-threading
–Speedvs.Mem
ory
–Speedvs.Quality
...
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
74Faster
Decoding
•Multi-threaded
decoding
.../moses--threadsNUM
•Easyspeed-up
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
75Adva
ncedFea
tures
•FasterTraining
•FasterDecoding
–Multi-threading
–Spee
dvs.Mem
ory
–Speedvs.Quality
...
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
76Spee
dvs.Mem
ory
Use
Typical
Europarlfile
sizes:
•Langu
agemodel
–17
0MB
(trigram
)–
412MB
(5-gram)
•Phrase
table
–11GB
•Lexicalized
reordering
–9.4G
B
→total=
20.8
GB
Wor
king
mem
ory
Tran
slat
ion
mod
el
Lang
uage
mod
el
Pro
cess
siz
e(R
AM
)
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
77Spee
dvs.Mem
ory
Use
•Loadinto
mem
ory
–longload
time
–largemem
oryusage
–fast
decoding
Wor
king
mem
ory
Tran
slat
ion
mod
el
Lang
uage
mod
el
Pro
cess
siz
e(R
AM
)
•Load-on-dem
and
–storeindexed
model
ondisk
–binaryform
at–
minim
alstart-uptime,
mem
oryusage
–slow
erdecoding
Dis
kca
che
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
78Create
BinaryTables
Phrase
Table:
Phrase-based
exportLC_ALL=C
catpt.txt|sort|./processPhraseTable-ttable00-\
-nscores4-outout.file
exportLC_ALL=C
./CreateOnDiskPt1141002pt.txtout.folder
Hierarchical
/Syntax
exportLC_ALL=C./CreateOnDiskPt1141002pt.txtout.folder
Lexical
ReorderingTable:
exportLC_ALL=C
processLexicalTable-inr-t.txt-outout.file
Langu
ageModels(later)
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
79SpecifyBinaryTables
Change
inifile
Phrase
Table
[feature]
PhraseDictionaryBinaryname=TranslationModel0table-limit=20\
num-features=4path=/.../phrase-table
Hierarchical
/Syntax
[feature]
PhraseDictionaryOnDiskname=TranslationModel0table-limit=20\
num-features=4path=/.../phrase-table
Lexical
ReorderingTable
automatically
detected
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
80Compact
Phrase
Table
•Mem
ory-effi
cientdatastructure
–phrase
table
6–7times
smallerthan
on-diskbinarytable
–lexicalreorderingtable
12–1
5times
smallerthan
on-diskbinarytable
•Storedin
RAM
•May
bemem
orymapped
•Train
withprocessPhraseTableMin
•SpecifywithPhraseDictionaryCompact
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
81IR
STLM
•Develop
edby
FBK-irst,Trento,Italy
•Createabinaryform
atwhichcanberead
from
diskas
needed
–reducesmem
orybutslow
erdecoding
•Quantization
ofprob
abilities
–reducesmem
orybutlose
accuracy
–prob
ability
stored
in1byte
insteadof
4bytes
•Not
multithreaded
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
82IR
STLM
inMoses
•Com
pile
thedecoder
withIRSTLM
library
./configure--with-irstlm=[rootdiroftheIRSTLMtoolkit]
•Createbinaryform
at:
compile-lmlanguage-model.srilmlanguage-model.blm
•Load-on-dem
and:
renamefile.mm
•Change
inifile
touse
IRSTLM
implementation
[feature]
IRSTLMname=LM0factor=0path=/.../lmorder=5
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
83KENLM
•Develop
edby
KennethHeafield(C
MU
/Edinburgh/Stanford)
•Fastest
andsm
allest
langu
agemodel
implementation
•Com
pile
from
LM
trained
withSRILM
build_binarymodel.lmmodel.binlm
•Specifyin
decoder
[feature]
KENLMname=LM0factor=0path=/.../model.binlmorder=5
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
84New
:OSM
(OperationsSeq
uen
ceModel)
•Amodel
that
–considerssourceandtarget
contextual
inform
ationacross
phrases
–integrates
translationandreorderinginto
asinglemodel
•Con
vert
abilingu
alsentence
toasequence
ofop
erations
–Translate(G
enerateaminim
altranslationunit)
–Reordering(Insert
agapor
Jump)
•P(e,f,a)=
N-gram
model
over
resultingop
erationsequences
•Overcom
esphrasalindependence
assumption
–Con
siderssourceandtarget
contextual
inform
ationacross
phrases
•Betterreorderingmodel
–Translationandreorderingdecisionsinfluence
each
–Handleslocalandlongdistance
reorderings
inaunified
manner
•Nospuriou
sphrasalsegm
entation
prob
lem
•Average
gain
of+0.40
onnew
s-test20
13across
10pairs
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
85New
:OSM
(OperationsSeq
uen
ceModel)
Sie
wür
den
gege
nS
iest
imm
en
The
y w
ould
vot
e ag
ains
t you
Op
erat
ion
so 1
Gen
erat
e (S
ie, T
hey)
o 2G
ener
ate
(wür
den,
wou
ld)
o 3In
sert
Gap
o 4G
ener
ate
(stim
men
, vot
e)
o 5Ju
mp
Bac
k (1
)
o 6G
ener
ate
(geg
en, a
gain
st)
o 7G
ener
ate
(Sie
, you
)
Sie
wür
den
stim
men
The
y
wou
ld
vot
e
o 1o 2
o 3o 4
Co
nte
xt W
ind
ow
Mo
del
:
p osm
(F,E
,A)
= p
(o1,
...,o
N)
= �
ip(
o i|o
i-n+
1...o
i-1)
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
86Adva
ncedFea
tures
•FasterTraining
•FasterDecoding
–Multi-threading
–Speedvs.Mem
ory
–Spee
dvs.Quality
...
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
87Spee
dvs.Quality tim
e (s
econ
ds/w
ord)
quality (bleu)op
timum
for
win
ning
com
petit
ions
optim
umfo
r pr
actic
alus
e
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
88Spee
dvs.Quality
ergeht
janicht
nach
hause
are ithe
goes
does
not
yes
go to
hom
e
hom
e
•Decoder
search
createsvery
largenumber
ofpartial
translations(”hyp
otheses”)
•Decodingtime∼
number
ofhyp
otheses
created
•Translationquality∼
number
ofhyp
othesiscreated
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
89Hyp
othesis
Stack
s
are ithe
goes
does
not
yes
no w
ord
tran
slat
edon
e w
ord
tran
slat
edtw
o w
ords
tran
slat
edth
ree
wor
dstr
ansl
ated
•Phrase-based:Onestackper
number
ofinputwordscovered
•Number
ofhyp
othesiscreated=
sentence
length×
stacksize
×applicable
translationop
tion
s
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
90PruningParameters
•Regularbeam
search
–--stackNUM
max.number
ofhyp
otheses
contained
ineach
stack
–--ttable-limitNUM
max.num.of
translationop
tion
sper
inputphrase
–search
timerough
lylinearwithrespectto
each
number
•Cubepruning
(fixednumber
ofhyp
otheses
areadded
toeach
stack)
–--search-algorithm1
turnson
cubepruning
–--cube-pruning-pop-limitNUM
number
ofhyp
otheses
added
toeach
stack
–search
timerough
lylinearwithrespectto
pop
limit
–note:
stacksize
andtranslationtable
limithavelittleim
pactin
speed
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
91Syn
taxHyp
othesis
Stack
s
•Onestackper
inputwordspan
•Number
ofhyp
othesiscreated=
sentence
length2×
number
ofhyp
otheses
added
toeach
stack
--cube-pruning-pop-limitNUM
number
ofhyp
otheses
added
toeach
stack
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
92Adva
ncedFea
tures
•FasterTraining
•FasterDecoding
•MosesServe
r•
Dataanddom
ainadaptation
•Instructionsto
decoder
•Inputform
ats
•Outputform
ats
•Increm
entalTraining
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
93MosesServe
r
•Moses
commandline:
.../moses-f[ini]<[inputfile]>[outputfile]
–Not
practicalforcommercial
use
•Moses
Server:
.../mosesserver-f[ini]--server-port[PORT]--server-log[LOG]
–AcceptHTTPinput.
XMLSOAPform
at•
Client:
–Com
municateviahttp
–Example
clients
inJava
andPerl
–Write
yourow
nclient
–Integrateinto
yourow
napplication
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
94Adva
ncedFea
tures
•FasterTraining
•FasterDecoding
•Moses
Server
•Data
anddomain
adaptation
–Train
everythingtogether
–Secon
daryphrase
table
–Dom
ainindicator
features
–Interpolated
langu
agemodels
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
95Data
•Parallelcorpora→
translationmodel
–sentence-aligned
translated
texts
–translationmem
oriesareparallelcorpora
–diction
ariesareparallelcorpora
•Mon
olingu
alcorpora→
langu
agemodel
–text
inthetarget
langu
age
–billionsof
wordseasy
tohandle
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
96Domain
Adaptation
•Themoredata,
thebetter
•Themorein-dom
aindata,
thebetter
(evenin-dom
ainmon
olingu
aldatavery
valuable)
•Alwaystunetowardstarget
dom
ain
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
97Adva
ncedFea
tures
•FasterTraining
•FasterDecoding
•Moses
Server
•Dataanddom
ainadaptation
–Train
everythingtogether
–Secon
daryphrase
table
–Dom
ainindicator
features
–Interpolated
langu
agemodels
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
98Default:Train
Eve
rythingTogether
•Easyto
implement
–Con
catenatenew
datawithexistingdata
–Retrain
•Disadvantages:
–Slower
trainingforlargeam
ountof
data
–Cannot
weigh
toldandnew
dataseparately
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
99Default:Train
Eve
rythingTogether
Specification
inEMS:
•Phrase-table
[CORPUS]
[CORPUS:in-domain]
raw-stem=....
[CORPUS:background]
raw-stem=....
•LM
[LM]
[LM:in-domain]
raw-corpus=....
[LM:background]
raw-corpus=....
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
100
Adva
ncedFea
tures
•FasterTraining
•FasterDecoding
•Moses
Server
•Dataanddom
ainadaptation
–Train
everythingtogether
–Secondaryphrase
table
–Dom
ainindicator
features
–Interpolated
langu
agemodels
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
101
SecondaryPhrase
Table
•Train
initialphrase
table
andLM
onbaselinedata
•Train
secondaryphrase
table
andLM
new
/in-dom
aindata
•Use
bothin
Moses
–Secon
daryphrase
table
[feature]
PhraseDictionaryMemorypath=.../path.1
PhraseDictionaryMemorypath=.../path.2
[mapping]
0T0
1T1
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
102
SecondaryPhrase
Table
•–
Secon
daryLM
[feature]
KENLMpath=.../path.1
KENLMpath=.../path.2
•Can
give
differentweigh
tsforprim
aryandsecondarytables
•Not
integrated
into
theEMS
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
103
SecondaryPhrase
Table
•Terminolog
y/Glossarydatabase
–fixedtranslation
–per
client,project,etc
•Primaryphrase
table
–backoffto
’normal’phrase-table
ifnoglossary
term
[feature]
PhraseDictionaryMemorypath=.../glossary
PhraseDictionaryMemorypath=.../normal.phrase.table
[mapping]
0T0
1T1
[decoding-graph-backoff]
0 1
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
104
Adva
ncedFea
tures
•FasterTraining
•FasterDecoding
•Moses
Server
•Dataanddom
ainadaptation
–Train
everythingtogether
–Secon
daryphrase
table
–Domain
indicatorfeatures
–Interpolated
langu
agemodels
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
105
Domain
IndicatorFea
tures
•Onetranslationmodel
•Flageach
phrase
pair’sorigin
–indicator:binaryflag
ifitoccurs
inspecificdom
ain
–ratio:
how
oftenitoccurs
inspecificdom
ainrelative
toall
–subset:
similarto
indicator,butifin
multipledom
ains,markedwithmultiple-
dom
ainfeature
•In
EMS:
[TRAINING]
domain-features="indicator"
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
106
Adva
ncedFea
tures
•FasterTraining
•FasterDecoding
•Moses
Server
•Dataanddom
ainadaptation
–Train
everythingtogether
–Secon
daryphrase
table
–Dom
ainindicator
features
–Interpolatedlanguagemodels
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
107
InterpolatedLanguageModels
•Train
onelangu
agemodel
per
corpus
•Com
binethem
byweigh
tingeach
accordingto
itsim
portance
–weigh
tsob
tained
byop
timizingperplexity
ofresultinglangu
agemodel
ontuningset
(not
thesameas
machinetranslationquality)
–modelsarelinearlycombined
•EMSprovides
asection[INTERPOLATED-LM]that
needsto
becommentedou
t
•Alternative:
use
multiple
langu
agemodels
(disadvantage:larger
process,slow
er)
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
108
Adva
ncedFea
tures
•FasterTraining
•FasterDecoding
•Moses
Server
•Dataanddom
ainadaptation
•Instructionsto
decoder
•Inputform
ats
•Outputform
ats
•Increm
entalTraining
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
109
SpecifyingTranslationswithXML
•Translationtablesfornumbers?
fe
p(f|e)
2003
2003
0.74
322003
2000
0.04
212003
year
0.02
122003
the
0.01
752003
...
...
•Instruct
thedecoder
withXMLinstruction
therevenuefor<numtranslation="20
03">
2003</num>
ishigher
than...
•Dealwithdifferentnumber
form
ats
ererzielte<numtranslation="17.55">
17,55</num>
Punkte
.
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
110
SpecifyingTranslationswithXML
./moses-xml-input[exclusive|inclusive|constraint]
therevenuefor<numtranslation="20
03">
2003</num>
ishigher
than...
Threetypes
ofXMLinput:
•Exclusive
Only
possible
translationisgivenin
XML
•Inclusive
Translationisgivenin
XMLisin
additionto
phrase-table
•Con
straint
Only
use
translationsfrom
phrase-table
ifitmatch
XMLspecification
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
111
ConstraintXML
•Specifically
fortranslatingterm
inolog
y
–consistentlytranslateparticularphrase
inadocument
–may
havelearned
larger
phrase
pairs
that
contain
term
inolog
yterm
•Example:
Microsoft<optiontranslation="W
indow
s">
Window
s</option>
8...
•Allowsuse
ofphrase
pairon
lyifmapsW
indow
sto
Window
s
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
112
Placeholders
•Translate:
–You
oweme100dollars
!–
You
oweme200dollars
!–
You
oweme9.56dollars
!
•Problem:needtranslationsfor
–100
–200
–9.56
•Som
ethings
arebetteroff
beinghandledby
simple
rules:
–Numbers
–Dates
–Currency
–Nam
edentities
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
113
Placeholders
•Input
You
oweme100dollars
!
•Replace
numberswith@num@
You
oweme@num@
dollars
!
•Specification
You
oweme<netranslation="@num@"entity="10
0">@num@</ne>
dollars
!
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
114
Walls
andZones
•Specification
ofreorderingconstraints
•Zon
e
sequence
tobetranslated
withou
treorderingwithou
tsidematerial
•Wall
hardreorderingconstraint,nowordsmay
bereordered
across
•Localwall
wallwithin
azone,
not
valid
outsidezone
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
115
Walls
andZones:Exa
mples
•Requiringthetranslationof
quoted
materialas
ablock
Hesaid<zone>
”yes
”</zone>.
•Hardreorderingconstraint
Number
1:<wall/>
thebeginning.
•Localhardreorderingconstraintwithin
zone
Anew
plan<zone>
(<wall/>
maybenotnew<wall/>
)</zone>
emerged
.
•Nesting
The<zone>
”new<zone>
(old)</zone>
”</zone>
proposal.
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
116
PreservingMarkup
•How
doyoutranslatethis:
<h1>
MyHomePage<
/h1>
Ireallylike
to<b>eat<
/b>
chicken!
•Solution
1:XMLtranslations,wallsandzones
<xtranslation="<h1>"/><wall/>
MyHomePage<wall/>
<xtranslation="</h1>"/>
Ireallylike
to<zone><xtranslation="<b>"/><wall/>
eat<wall/>
<xtranslation="</b
>"/></zone>
chicken
!
(note:
specialXMLcharacters
like<
and>
needto
beescaped)
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
117
PreservingMarkup
•Solution
2:Handle
marku
pexternally
–trackwordpositionsandtheirmarku
pI
really
like
to<b>eat<
/b>
chicken
!1
23
45
67
--
--
<b>
--
–translatewithou
tmarku
p Ireallylike
toeatchicken
!
–keep
wordalignmentto
source
Ich
esse
wirklich
gerne
Huhnchen
!1
52
3-4
67
–re-insert
marku
p Ich<b>esse</b
>wirklich
gerneHuhnchen
!
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
118
Transliteration
•Langu
ages
arewritten
indifferentscripts
–Russian,BulgarianandSerbian-written
inCyrillic
script
–Urdu,FarsiandPashto
-written
inArabic
script
–Hindi,MarathiandNepalese-written
inDevanagri
•Transliterationcanbeusedto
translateOOVsandNam
edEntities
•Problem:Transliterationcorpusisnot
alwaysavailable
•Naive
Solution
:
–Crawltrainingdatafrom
Wikipedia
titles
–Build
character-based
transliterationmodel
–Replace
OOVwordswith1-besttransliteration
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
119
Transliteration
•2methodsto
integrateinto
MT
•Post-decodingmethod
–Use
langu
agemodel
topickbesttransliteration
–Transliterationfeatures
•In-decodingmethod
–Integratetransliterationinsidedecoder
–Wordscanbetranslated
ORtransliterated
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
120
Transliteration
•EMS:
[TRAINING]
transliteration-module="yes"
•Post-processingmethod
post-decoding-transliteration="yes"
•In-decodingmethod
in-decoding-transliteration="yes"
transliteration-file=/listofwordstobetransliterated/
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
121
Adva
ncedFea
tures
•FasterTraining
•FasterDecoding
•Moses
Server
•Dataanddom
ainadaptation
•Instructionsto
decoder
•Inputform
ats
•Outputform
ats
•Increm
entalTraining
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
122
Exa
mple:Missp
eltW
ords
•Misspeltsentence:
Theroom
was*exellentbutthehallway
was
*filty.
•Strategiesfordealingwithspellingerrors:
–Createcorrectsentence
withcorrection
×prob
lem:ifnot
correctedprop
erly,addsmoreerrors
–Createmanysentenceswithdifferentcorrection
s×
prob
lem:haveto
decodeeach
sentence,slow
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
123
ConfusionNetwork
Theroom
was*exellentbutthehallway
was
*filty.
Inputto
decoder:
Let
thedecoder
decide
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
124
Exa
mple:Diacritics
•Correct
sentence
•Som
ethinganon
-nativepersonmighttype
TrungQuoccanhbao
Myve
luat
tien
te
•Con
fusion
network
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
125
ConfusionNetwork
Specifica
tion
Argumenton
commandline
./moses-inputtype1
Inputto
moses
the1.0
room1.0
was1.0
excel0.33excellent0.33excellence0.33
but1.0
the1.0
hallway1.0
was1.0
guilty0.5filthy0.5
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
126
Lattice
Exa
mple:ChineseW
ord
Seg
men
tation
•Unsegm
entedsentence
•Incorrectsegm
ention
•Correct
segm
ention
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
127
Lattice
Inputto
decoder:
Let
thedecoder
decide
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
128
Exa
mple:CompoundSplitting
•Inputsentence
einen
wettbew
erbsbedingtenpreissturz
•Differentcompou
ndsplits
•Let
thedecoder
decide
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
129
LatticeSpecifica
tion
Com
mandlineargu
ment
Inputto
Moses
(PLFform
at-Python
Lattice
Format)
./moses-inputtype1
(
(
(’einen’,1.0,1),
),
(
(’wettbewerbsbedingten’,0.5,2),
(’wettbewerbs’,0.25,1),
(’wettbewerb’,0.25,1),
),
(
(’bedingten’,1.0,1),
),
(
(’preissturz’,0.5,2),
(’preis’,0.5,1),
),
(
(’sturz’,1.0,1),
),
)
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
130
Adva
ncedFea
tures
•FasterTraining
•FasterDecoding
•Moses
Server
•Dataanddom
ainadaptation
•Instructionsto
decoder
•Inputform
ats
•Outputform
ats
•Increm
entalTraining
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
131
N-B
estList
•Input
esgibtverschiedeneanderemeinungen
.
•BestTranslation
therearevariousdifferentopinions.
•Nextninebesttranslationstherearevariousother
opinions.
therearedifferentdifferentopinions.
thereareother
differentop
inions.
weare
variousdifferentopinions.
therearevariousother
opinionsof.
itis
variousdifferentop
inions.
therearedifferentother
opinions.
itis
variousother
opinions.
itis
adifferentop
inions.
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
132
UsesofN-B
estLists
•Let
thetranslator
choosefrom
possible
translations
•Reranker
–addmorekn
owledge
sources
–cantake
glob
alview
–coherency
ofwholesentence
–coherency
ofdocument
•Usedto
tunecompon
entweigh
ts
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
133
N-B
estLists
inMoses
Argumentto
commandline
./moses-n-bestlistn-best.file.txt[distinct]100
Output
0|||therearevariousdifferentopinions.|||d:0lm:-21.6664w:-6...|||-113.734
0|||therearevariousotheropinions.|||d:0lm:-25.3276w:-6...|||-114.004
0|||therearedifferentdifferentopinions.|||d:0lm:-27.8429w:-6...|||-117.738
0|||thereareotherdifferentopinions.|||d:-4lm:-25.1666w:-6...|||-118.007
0|||wearevariousdifferentopinions.|||d:0lm:-28.1533w:-6...|||-118.142
0|||therearevariousotheropinionsof.|||d:0lm:-33.7616w:-7...|||-118.153
0|||itisvariousdifferentopinions.|||d:0lm:-29.8191w:-6...|||-118.222
0|||therearedifferentotheropinions.|||d:0lm:-30.426w:-6...|||-118.236
0|||itisvariousotheropinions.|||d:0lm:-32.6824w:-6...|||-118.395
0|||itisadifferentopinions.|||d:0lm:-20.1611w:-6...|||-118.434
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
134
Sea
rchGraph
•Input
ergehtja
nichtnach
hause
•Return
internal
structure
from
thedecoder
•Encodemillionsof
other
possible
translations
(every
paththrough
thegraph=
1translation)
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
135
UsesofSea
rchGraphs
•Let
thetranslator
choose
–Individual
words
orphrases
–’Sugg
est’nextphrase
•Reranker
•Used
totune
compon
ent
weigh
ts
–Morediffi
cultthan
with
n-bestlist
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
136
Sea
rchGraphsin
Moses
Argumentto
commandline
./moses-output-search-graphsearch-graph.file.txt
Argumentto
commandline
0hyp=0stack=0forward=36fscore=-113.734
0hyp=75stack=1back=0score=-104.943...covered=5-5out=.
0hyp=72stack=1back=0score=-8.846...covered=4-4out=opinions
0hyp=73stack=1back=0score=-10.661...covered=4-4out=opinionsof
•hyp
-hyp
othesisid
•stack-how
manywordshavebeentranslated
•score-totalweigh
tedscore
•covered-whichwordsweretranslated
bythishyp
othesis
•ou
t-target
phrase
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
137
Adva
ncedFea
tures
•FasterTraining
•FasterDecoding
•Moses
Server
•Dataanddom
ainadaptation
•Instructionsto
decoder
•Inputform
ats
•Outputform
ats
•Increm
entalTraining
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
138
Adva
ncedFea
tures
•FasterTraining
•FasterDecoding
•Moses
Server
•Dataanddom
ainadaptation
•Instructionsto
decoder
•Inputform
ats
•Outputform
ats
•Increm
entalTraining
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
139
Increm
entalTraining
sour
ce te
xt
MT
tran
slat
ion
hum
an tr
ansl
atio
n
MT
engi
ne
post-edit
re-train
translate
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
140
Increm
entalTraining
•Increm
entalwordalignment
–requires
modified
versionof
GIZA++
(available
athttp://code.go
ogle.com
/p/inc-giza-pp/)
–on
lyworks
forHMM
alignment(not
thecommon
IBM
Model
4)
•Translationmodel
isdefined
byparallelcorpus
PhraseDictionaryBitextSampling\
path=/path/to/corpus\
L1=sourcelanguageextension\
L2=targetlanguageextension
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
141
Update
Word
Alig
nmen
t
•Usesoriginal
wordalignmentmodels
(withadditional
model
filesstored
aftertraining)
•Increm
entalGIZA++
loadsmodel
•New
sentence
pairs
isaligned
onthefly
•Typically,GIZA++
processes
arerunin
bothdirection
s,symmetrized
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
142
Update
TranslationModel
•Translationtable
isstored
asword-aligned
parallelcorpus
•Update=
addwordaligned
sentence
pair
•UpdatingarunningMoses
instance
viaXMLRPC
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
143
BeyondMoses:
CASMACAT
Workben
ch
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14
144
Ack
nowledgem
ents
Hoang,
Huck
andKoehn
MachineTranslationwithOpen
Sou
rceSoftware
Octob
er20
14