SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced LanguagesSt. Petersburg, Russia
www.kit.eduKIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association
ZeigenSieandereAppsfüreinfachesMultitaskingnebendemBrowseranInternetExplorernutztHardwarebeschleunigungWebsiteswerdenschnellergeladendamitSienochreibungslosersurfenkönnen
NimmdeineLieblingsmusiküberallhinmitkommtderiPodshufflemitSpeichergenugfürhundertevonSongsallewichtigenSongsfürsTrainingWiedergabelistenGeniusMixesPodcastsundHörbücher
Automatic Detection of Anglicisms for the Pronunciation Dictionary Generation:
A Case Study on our German IT Corpus
Sebastian leidig, Tim Schlippe, Tanja Schultz
2 15-May-2014
Motivation
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
From Microsoft's German website www.microsoft.de:
“Zeigen Sie andere Apps für einfaches Multitasking neben dem Browser an.”
“Internet Explorer nutzt Hardwarebeschleunigung. Websites werden schneller geladen, damit Sie noch reibungsloser surfen können.”
3 15-May-2014
Motivation
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
With the globalization words from other languages come into a language without assimilation to the phonetic system of the new language
To economically build up lexical resources with automatic or semi-automatic methods
detect and treat them separately
4 15-May-2014
Overview
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
combinationfeaturesInput
graphemeperplexity
g2p confidence
hunspell lookup(native)
hunspell lookup(English)
Wiktionarylookup
Googlehit count
voting
decision tree
SVM
Output
word list
word1
word2
word3
word4
word5
word6
classification
5 15-May-2014
Outline
1. Motivation and Overview
2. Test Sets
3. Single Features
4. Combinations
5. Summary and Future Work
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
6 15-May-2014
Test Sets - Domains
German IT websitewww.microsoft.de
4.6k unique words
German general newswww.spiegel.de
6.6k unique words
AfrikaansNCHLT corpus (Heerden, Davel, Barnard, 2013), (Basson, Davel, 2013)
9.4k unique words
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
7 15-May-2014
Test Sets - Domains
Tag for “English”:
e.g. Software, Brain, …
Foreign hybridsCompound words
e.g. Schadsoftware, …
Grammatically adapted words
e.g. downloaden, …
Decisions based onAgreement of annotators
duden.de .Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
Different word categories:Abbreviations:
e.g. UV, CIA, …
Other foreign wordsCompound words
e.g. Français, Niveau, …
8 15-May-2014
Foreign words in different test sets
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
9 15-May-2014
Single Features – Design Criteria
Features trained on commonly available resourcesWord lists, Pronunciation dictionaries, Spellchecker dictionaries, Wiktionary, Google
Thresholds without supervised trainingComparison between English and native models
New approaches
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
10 15-May-2014
Grapheme Perplexity
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
11 15-May-2014
Grapheme Perplexity
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
12 15-May-2014
Grapheme-to-Phoneme Confidence
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
Phonetisaurus confidence
scores (costs)
13 15-May-2014
Grapheme-to-Phoneme Confidence
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
14 15-May-2014
Hunspell Lookup
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
classification
word list
word1
word2
word3
word4
spellchecker dictionaryEnglish: Hunspell-en
classification
Hunspell
dictionary lookup
derive word forms
classification
word list
word1
word2
word3
word4
spellchecker dictionaryGerman: Hunspell-de
classification
Hunspell
dictionary lookup
derive word forms
2 features performed
best
15 15-May-2014
Hunspell Lookup
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
classification
word list
word1
word2
word3
word4
spellchecker dictionaryEnglish: Hunspell-en
classification
Hunspell
dictionary lookup
derive word forms
classification
word list
word1
word2
word3
word4
spellchecker dictionaryGerman: Hunspell-de
classification
Hunspell
dictionary lookup
derive word forms
16 15-May-2014
Wiktionary Lookup
Check crowdsourced information from matrix language Wiktionary
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
17 15-May-2014
Google Hit Count
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
Based on Alex B. (2008) “Automatic Detection of English Inclusion in Mixed-lingual Data with an Application to Parsing”, University of Edinburgh
18 15-May-2014
Google Hit Count
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
Based on Alex B. (2008) “Automatic Detection of English Inclusion in Mixed-lingual Data with an Application to Parsing”, University of Edinburgh
19 15-May-2014
Result: Single Features
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
20 15-May-2014
Grapheme-to-Phoneme Confidence
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
21 15-May-2014
Result: Single Features
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
On Spiegel-de test set: Higher ratio of words classified as English are wrong
22 15-May-2014
Result: Combination
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
23 15-May-2014
Performance after filtering difficult words (oracle)
Challenges
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
24 15-May-2014
Conclusion and Future Work
Features based on available sources
New approaches:G2P confidence
Wiktionary
Further features:Part-of-speech (POS)
Context, trigger words
Capitalization
Translate and compare
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
25 15-May-2014
благодари? м за внима? ние!
Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
26 15-May-2014 Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
References
27 15-May-2014 Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios
References