linear programming and the worst-case analysis of greedy

Microsoft PowerPoint - HUTemplate.pptxw w
w .u
n i
Institute of Natural Language Processing University of Stuttgart
w w
w .u
n i-
s tu
tt g
a rt
.d e
w w
w .u
n i
w w
w .u
n i-
s tu
tt g
a rt
.d e
Hindi-Urdu Background
w w
w .u
n i-
s tu
tt g
a rt
.d e
Share large proportion of vocabulary inherited from Sanskrit
– Most of the verbs and closed class words are the same
Have lived together for centuries and allowed mixing of word w w
w .u
n i
Have lived together for centuries and allowed mixing of word
inventories
Have similar sound system
An initial study on small BBC corpus of 5000 Hindi words revealed
that 62% of the types are transliterated
w w
w .u
n i-
s tu
tt g
a rt
.d e
Some Hindi characters have multiple orthographic
equivalents in Urdu
– \s\ sound is represented by a and in Urdu
w w
w .u
n i
Sometimes multiple orthographic equivalent of a Hindi
word are all valid Urdu words :
– (sur@t d) <-> (Chapter of Koran) or (Face/Condition)
w w
w .u
n i-
s tu
tt g
a rt
.d e
translate or transliterate
w .u
n i
and transliterated to (Shanti)
pit them against regular translations on the fly
and hope that language model is able to decidew w
w .u
n i
–Whether to translate or transliterate given the context
–Which transliteration to choose given the context
w w
w .u
n i-
s tu
tt g
a rt
.d e
model:
model and character-based model
w w
w .u
n i-
s tu
tt g
a rt
.d e
pc(hi,ui), a joint character model
w w
w .u
n i
– ith Hindi character
w w
w .u
n i-
s tu
tt g
a rt
.d e
Filtering is done with the help of edit distance algorithm
w w
w .u
n i
Cost of insertion, deletion and replace are tuned on held out
data
w w
w .u
n i
w w
w .u
n i-
s tu
tt g
a rt
.d e
Language Model
w w
w .u
n i
Smoothing
To control the tradeoff between LM-known and LM-
Unknown transliterations
w w
w .u
n i-
s tu
tt g
a rt
.d e
interpolate joint probabilities
w w
w .u
n i-
s tu
tt g
a rt
.d e
No Reordering
w .u
n i
25-best transliterations are computed at lower level
At higher level transliteration probabilities are interpolated with 20- best translation probabilities
Recombination and Histogram pruning with a stack size of 100
w w
w .u
n i-
s tu
tt g
a rt
.d e
Word alignment using Giza++
w .u
n i
And also for the extraction of transliteration corpus
Monolingual Urdu corpus consist of roughly 114K sentences
108 K sentences , data obtained from Leipzig University
Rest is Urdu part of extracted parallel sentences
w w
w .u
n i-
s tu
tt g
a rt
.d e
1400 test sentences
Split the test into two halves
Optimize on first and test on second
Then optimize on second and test on first
w w
w .u
n i-
s tu
tt g
a rt
.d e
Pb0 : Running Moses with default settings and no
reordering
Pb1 : All OOV words in output of Pb0 are replaced by
transliterationsw w
w .u
n i
training corpus and retrain Moses
M1 = Conditional Model , M2 = Joint Model
M Pb0 Pb1 Pb2 M1 M2
BLEU 14.3 16.25 16.13 18.6 17.05
w w
w .u
n i-
s tu
tt g
a rt
.d e
Problem: Lots of errors occur because the data is sparse
and noisy
also transliterationsw w
transliteration that has best probability given by pc(hi,ui)/pc(ui)
w w
w .u
n i
• Heuristic: In case of a unknown word we drop the
denominator pc(ui)
w w
w .u
n i-
s tu
tt g
a rt
.d e
Problem: For TM-unknown transliteration options the interpolating factor λ cancels out
w w
w .u
n i
Transliteration are sometimes incorrectly favored
Heuristic: For TM-unknown words assign a probability β to word-priori
pw(ui)
high number of common vocabulary
No Heuristic H1 H2 H12
M1 18.6 18.86 18.97 19.35w w
w .u
n i
well as conditional probability model
M1 18.6 18.86 18.97 19.35
M2 17.05 17.56 17.85 18.34
H3 H13 H23 H123
w w
w .u
n i-
s tu
tt g
a rt
.d e
related languages – Thai – Lao
help us solve disambiguation problem
Joint-probability model works as well as conditional
probability model
w w
w .u
n i-
s tu
tt g
a rt
.d e
Different Transliterations in Different Contexts
a a aaaa
w w
w .u
n i
aa!" aa#$%"
Lion is the king of jungle
w w
w .u
n i-
s tu
tt g
a rt
.d e
Translate or Transliterate
a&'a()a(* aa a& a+ ,a
Even then he can’t live peacefully
w w
w .u
n i
" a-'aa./0a1'2aaa30a-'
Om Shanti Om is Farha khan’s second film
w w
w .u
n i-
s tu
tt g
a rt
.d e
Total extracted alignment pairs 107323
93176 1-1/1-N alignments
5743 N-1 alignments
8404 M-N alignments
A manual inspection of 1000 N-1 an M-1 alignment pairs
showed
More than 70% are totally or partially wrong
most of the correct 30% alignments can be broken into 1-
1 and 1-N alignments
Derivational affixes vs. a (Beautiful)
w w
w .u
n i
w w
w .u
n i-
s tu
tt g
a rt
.d e
Lack of training data – only 7000 sentences
w w
w .u
n i

linear programming and the worst-case analysis of greedy

Documents