Motivation on Ontology Translation
Business Information query in English fixed assets@en
Motivation on Ontology Translation
Business Information query in English fixed assets@en ; Vermögensgegenstände@de ; activo fijo@es ...
Outline
• Statistical Machine Translation (SMT)
– Insights into SMT
• Translation Models (TM) / Language model (LM)
• Word alignment -> Phrase-Based SMT
• Ontology Translation and SMT
– Domain adaptation, Term Identification
– UNLP SMT Demos
• TeTra, OTTO, Iris
Insights into SMT
• Statistical Machine Translation
– Translation Model
• lexical correspondence between languages
– Language Model
• takes care of fluency (and lexical choice) in the target language
fixed asset | anlagevermögens | 0.003 0.003 0.029 0.102
-4.868038 anlagevermögens -0.1317768
Training Data (Parallel Corpora) for SMT
Source document Target document
• Sentence aligned parallel data
Training Data for SMT
sentence one with some words sentence two with more words ...
satz eins mit einigen wörtern satz zwei mit weiteren wörtern ...
• Parallel data
Training Data for SMT
sentence one with some words sentence two with more words ...
satz eins mit einigen wörtern satz zwei mit weiteren wörtern ...
Word Alignment
Word Alignment Scenarios
Word Alignment Scenarios
Word Alignment Scenarios
Word Alignment Scenarios
Word Alignment Scenarios
Word Alignment with IBM Models
Models Function of the model
IBM Model 1 lexical translation
IBM Model 2 adds absolute reordering model
IBM Model 3 adds fertility model
IBM Model 4 adds relative alignment model
IBM Model 5 fixes deficiency
IBM Model 1
Source Document Target Document
das Haus the house
das Buch the book
ein Buch a book
Training data:
IBM Model 1
e f Initial 1st iter. 2nd iter. 3rd iter. ... Final the das 0.25 1 book das 0.25 0 house das 0.25 0 the buch 0.25 0 book buch 0.25 1 a buch 0.25 0 book ein 0.25 0 a ein 0.25 1 the haus 0.25 0 house haus 0.25 1
das haus das buch ein buch
the house the book a book
IBM Model 1
e f Initial 1st iter. 2nd iter. 3rd iter. ... Final the das 0.25 0.50 ... 1 book das 0.25 0.25 ... 0 house das 0.25 0.25 ... 0 the buch 0.25 0.25 ... 0 book buch 0.25 0.50 ... 1 a buch 0.25 0.25 ... 0 book ein 0.25 0.50 ... 0 a ein 0.25 0.50 ... 1 the haus 0.25 0.50 ... 0 house haus 0.25 0.50 ... 1
das haus das buch ein buch
the house the book a book
IBM Model 1
e f Initial 1st iter. 2nd iter. 3rd iter. ... Final the das 0.25 0.50 0.6364 ... 1 book das 0.25 0.25 0.1818 ... 0 house das 0.25 0.25 0.1818 ... 0 the buch 0.25 0.25 0.1818 ... 0 book buch 0.25 0.50 0.6364 ... 1 a buch 0.25 0.25 0.1818 ... 0 book ein 0.25 0.50 0.4286 ... 0 a ein 0.25 0.50 0.5714 ... 1 the haus 0.25 0.50 0.4286 ... 0 house haus 0.25 0.50 0.5714 ... 1
das haus das buch ein buch
the house the book a book
IBM Model 1
e f Initial 1st iter. 2nd iter. 3rd iter. ... Final the das 0.25 0.50 0.6364 0.7479 ... 1 book das 0.25 0.25 0.1818 0.1208 ... 0 house das 0.25 0.25 0.1818 0.1313 ... 0 the buch 0.25 0.25 0.1818 0.1208 ... 0 book buch 0.25 0.50 0.6364 0.7479 ... 1 a buch 0.25 0.25 0.1818 0.1313 ... 0 book ein 0.25 0.50 0.4286 0.3466 ... 0 a ein 0.25 0.50 0.5714 0.6534 ... 1 the haus 0.25 0.50 0.4286 0.3466 ... 0 house haus 0.25 0.50 0.5714 0.6534 ... 1
das haus das buch ein buch
the house the book a book
IBM Model 1
e f Initial 1st iter. 2nd iter. 3rd iter. ... Final the das 0.25 0.50 0.6364 0.7479 ... 1 book das 0.25 0.25 0.1818 0.1208 ... 0 house das 0.25 0.25 0.1818 0.1313 ... 0 the buch 0.25 0.25 0.1818 0.1208 ... 0 book buch 0.25 0.50 0.6364 0.7479 ... 1 a buch 0.25 0.25 0.1818 0.1313 ... 0 book ein 0.25 0.50 0.4286 0.3466 ... 0 a ein 0.25 0.50 0.5714 0.6534 ... 1 the haus 0.25 0.50 0.4286 0.3466 ... 0 house haus 0.25 0.50 0.5714 0.6534 ... 1
das haus das buch ein buch
the house the book a book
IBM Model 1
das haus das buch ein buch
the house the book a book
buch book 0.9933 a 0.0046 the 0.0020 haus house 0.9172
the 0.0827
das the 0.9933 house 0.0046 book 0.0020 ein a 0.9172
book 0.0827
Lexical (word) probabilities (after 10 iterations):
IBM Model 1
das haus das buch ein buch
the house the book a book
ein buch a book 0.25 book book 0.01 das buch the book 0.25
das haus the house 0.25 the the 0.01
Decoding (translating) using the lexical probabilities:
IBM Model 1
das haus das buch ein buch
the house the book a book
ein buch a book 0.25 book book 0.01 das buch the book 0.25
das haus the house 0.25 the the 0.01 ein haus a house 0.25 book house 0.01
Decoding (translating) using the lexical probabilities:
Language Ambiguity in SMT
schlechte6 bank2
bad6 bench4 0.1239
bad6 bank2 0.1239
Source language Target language
freundliche1 bank2 friendly1 bank2
gemütliche3 bank4 cosy3 bench4
freundliche1 friendly1
gemütliche5 cosy5
schlechte6 bad6
Generic Models in SMT
schlechte6 bank2
bad6 bench4 0.1918
bad6 bank2 0.0581
Source language Target language
freundliche1 bank2 friendly1 bank2
gemütliche3 bank4 cosy3 bench4
freundliche1 friendly1
gemütliche5 cosy5
schlechte6 bad6
grüne7 bank4 green7 bench4
Domain-Specific Models in SMT
Source language Target language
freundliche1 bank2 friendly1 bank2
gemütliche3 bank4 cosy3 bench4
freundliche1 friendly1
gemütliche5 cosy5
schlechte6 bad6
multinationale7 bank2 multinational7 bank2
schlechte6 bank2
bad6 bank2 0.1918
bad6 bench4 0.0581
• Maria, Mary • no, did not • slap, daba una bofetada • a la, the • bruja, witch • verde, green
• Maria no, Mary did not • no daba una bofetada, did not slap • daba una bofetada a la, slap the • bruja verde, green witch
• Maria no daba una bofetada, Mary did not slap • no daba una bofetada a la, did not slap the • a la bruja verde, the green witch
From Word to Phrase Based SMT
Generic Models in SMT
freundliche1 bank2
friendly1 bench4 0.1576
friendly1 bank2 0.0581
Source language Target language
freundliche1 bank2 friendly1 bank2
gemütliche3 bank4 cosy3 bench4
freundliche1 friendly1
gemütliche5 cosy5
schlechte6 bad6
grüne7 bank4 green7 bench4
Phrase Based-SMT
schlechte _bank1
bad_bank1 1
Source language Target language
schlechte_bank1 bad_bank1
gemütliche_bank2 cosy_bench2
freundliche3 friendly3
gemütliche4 cosy4
schlechte5 bad5
grüne_bank6 green_bench6
Prime Minister Ayrault said: "It's incredible that an allied country like the United States at this point goes as far as spying on private communications that have no strategic justification, no justification on the basis of national defence.“
Premierminister | ayrault | sagte: |"es | ist unglaublich |, dass eine | verbündete | Land wie die | Vereinigten Staaten | an diesem Punkt | geht so | weit wie | Spionage | auf private | Mitteilungen | , dass | keine strategische |Gründe | , keine | Begründung | auf der Grundlage der nationalen | Verteidigung. |"
Why are phrases better?
Best translation = probability of Translation Model * Language Model1
(1) plus other things
Decoding - Finding the best path
Decoding - Finding the best path
he 's not home -3.09794 he 's not home . -3.26325 he is not home -3.27113 he can 't go home -3.32145 he 's not home , -3.48073 he is not to go home -3.48158 he is not home . -3.48415 he won 't go home -3.5298
it is not home -3.55166 he 's not going home -3.57796 - he 's not home -3.58466 he is not go home -3.59997 he 's not at home -3.62157 he is not going home -3.63351 he can 't go home . -3.6497 he 's not go home -3.65487
Outline
• Statistical Machine Translation (SMT)
– Insights into SMT
• Word alignment
• Word alignment -> Phrase-Based SMT
• Ontology Translation and SMT
– Domain adaptation, Term Identification
– UNLP SMT Demos:
• TeTra, OTTO, Iris
SMT
decoder @de
@nl
@it @en
@es
Terminological Injection into SMT
Changes in equity attributable to owners of parent
Änderungen im Eigenkapital , das den Eigentümern des Mutterunternehmens zuzurechnen ist
Änderungen im Eigenkapital, das den Eigentümern des Mutterunter Google Translate:
TeTra – Term Translation System
http://server1.nlp.insight-centre.org/tetra/
TeTra – Term Translation System
http://server1.nlp.insight-centre.org/tetra/
Translation Model
Source Term
Target Term
Generic Equity1 And2
Liabilities3
Gerechtigkeit1a und2 Verbindlichkeiten3
Financial Eigenkapital1b und2
Schulden3
Financial Domain (IFRS Ontology)
1a) Something that is just and fair. 1b) Ownership interest in a corporation, property, or other holding
TeTra – Term Translation System
http://server1.nlp.insight-centre.org/tetra/
Translation Model
Source Term
Target Term
Generic Birth1 injury2 to3
scalp4
Geburt1 schädigung2 zu3 skalpieren4
Medical Verletzung2 auf3 der3
Kopfhaut4
Medical Domain (ICD Ontology)
OTTO – OnTology TranslatiOn System
http://server1.nlp.insight-centre.org/otto/
OTTO – OnTology TranslatiOn System
http://server1.nlp.insight-centre.org/otto/
IRIS – English-Irish Translation System
http://server1.nlp.insight-centre.org/iris/
IRIS – English-Irish Translation System
http://server1.nlp.insight-centre.org/iris/
IRIS – English-Irish Translation System
http://server1.nlp.insight-centre.org/iris/
Statistical Machine Translation for domain-specific vocabulary translation
Mihael Arčan