sltu 2014 – 4th workshop on spoken language technologies for under-resourced languages st....

Post on 01-Apr-2015

212 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced LanguagesSt. Petersburg, Russia

www.kit.eduKIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association

ZeigenSieandereAppsfüreinfachesMultitaskingnebendemBrowseranInternetExplorernutztHardwarebeschleunigungWebsiteswerdenschnellergeladendamitSienochreibungslosersurfenkönnen

NimmdeineLieblingsmusiküberallhinmitkommtderiPodshufflemitSpeichergenugfürhundertevonSongsallewichtigenSongsfürsTrainingWiedergabelistenGeniusMixesPodcastsundHörbücher

Automatic Detection of Anglicisms for the Pronunciation Dictionary Generation:

A Case Study on our German IT Corpus

Sebastian leidig, Tim Schlippe, Tanja Schultz

2 15-May-2014

Motivation

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

From Microsoft's German website www.microsoft.de:

“Zeigen Sie andere Apps für einfaches Multitasking neben dem Browser an.”

“Internet Explorer nutzt Hardwarebeschleunigung. Websites werden schneller geladen, damit Sie noch reibungsloser surfen können.”

3 15-May-2014

Motivation

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

With the globalization words from other languages come into a language without assimilation to the phonetic system of the new language

To economically build up lexical resources with automatic or semi-automatic methods

detect and treat them separately

4 15-May-2014

Overview

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

combinationfeaturesInput

graphemeperplexity

g2p confidence

hunspell lookup(native)

hunspell lookup(English)

Wiktionarylookup

Googlehit count

voting

decision tree

SVM

Output

word list

word1

word2

word3

word4

word5

word6

classification

5 15-May-2014

Outline

1. Motivation and Overview

2. Test Sets

3. Single Features

4. Combinations

5. Summary and Future Work

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

6 15-May-2014

Test Sets - Domains

German IT websitewww.microsoft.de

4.6k unique words

German general newswww.spiegel.de

6.6k unique words

AfrikaansNCHLT corpus (Heerden, Davel, Barnard, 2013), (Basson, Davel, 2013)

9.4k unique words

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

7 15-May-2014

Test Sets - Domains

Tag for “English”:

e.g. Software, Brain, …

Foreign hybridsCompound words

e.g. Schadsoftware, …

Grammatically adapted words

e.g. downloaden, …

Decisions based onAgreement of annotators

duden.de .Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

Different word categories:Abbreviations:

e.g. UV, CIA, …

Other foreign wordsCompound words

e.g. Français, Niveau, …

8 15-May-2014

Foreign words in different test sets

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

9 15-May-2014

Single Features – Design Criteria

Features trained on commonly available resourcesWord lists, Pronunciation dictionaries, Spellchecker dictionaries, Wiktionary, Google

Thresholds without supervised trainingComparison between English and native models

New approaches

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

10 15-May-2014

Grapheme Perplexity

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

11 15-May-2014

Grapheme Perplexity

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

12 15-May-2014

Grapheme-to-Phoneme Confidence

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

Phonetisaurus confidence

scores (costs)

13 15-May-2014

Grapheme-to-Phoneme Confidence

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

14 15-May-2014

Hunspell Lookup

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

classification

word list

word1

word2

word3

word4

spellchecker dictionaryEnglish: Hunspell-en

classification

Hunspell

dictionary lookup

derive word forms

classification

word list

word1

word2

word3

word4

spellchecker dictionaryGerman: Hunspell-de

classification

Hunspell

dictionary lookup

derive word forms

2 features performed

best

15 15-May-2014

Hunspell Lookup

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

classification

word list

word1

word2

word3

word4

spellchecker dictionaryEnglish: Hunspell-en

classification

Hunspell

dictionary lookup

derive word forms

classification

word list

word1

word2

word3

word4

spellchecker dictionaryGerman: Hunspell-de

classification

Hunspell

dictionary lookup

derive word forms

16 15-May-2014

Wiktionary Lookup

Check crowdsourced information from matrix language Wiktionary

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

17 15-May-2014

Google Hit Count

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

Based on Alex B. (2008) “Automatic Detection of English Inclusion in Mixed-lingual Data with an Application to Parsing”, University of Edinburgh

18 15-May-2014

Google Hit Count

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

Based on Alex B. (2008) “Automatic Detection of English Inclusion in Mixed-lingual Data with an Application to Parsing”, University of Edinburgh

19 15-May-2014

Result: Single Features

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

20 15-May-2014

Grapheme-to-Phoneme Confidence

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

21 15-May-2014

Result: Single Features

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

On Spiegel-de test set: Higher ratio of words classified as English are wrong

22 15-May-2014

Result: Combination

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

23 15-May-2014

Performance after filtering difficult words (oracle)

Challenges

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

24 15-May-2014

Conclusion and Future Work

Features based on available sources

New approaches:G2P confidence

Wiktionary

Further features:Part-of-speech (POS)

Context, trigger words

Capitalization

Translate and compare

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

25 15-May-2014

благодари? м за внима? ние!

Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

26 15-May-2014 Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

References

27 15-May-2014 Combining Grapheme-to-Phoneme Converter Outputs for Enhanced Pronunciation Generation in Low-Resource Scenarios

References

top related