mm9 speech communication

25
P. 1 Dep. of Communication Technology Tom Brøndsted, Speech Communication 06 MM9 Speech Communication • MM8 summary – Brush-up – Conclusions (what you hopefully learned!) • MM9 – Standard Speech API – Hello World

Upload: tomai

Post on 09-Jan-2016

39 views

Category:

Documents


0 download

DESCRIPTION

MM9 Speech Communication. MM8 summary Brush-up Conclusions (what you hopefully learned!) MM9 Standard Speech API Hello World. From mm 7. Types of Speech Recognisers. “rule grammar recognition” = “command & control recognition” “dictation”, “large vocabulary recognition”, - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: MM9 Speech Communication

P.1

Dep. of Communication Technology

Tom Brøndsted, Speech Communication 06

MM9 Speech Communication

• MM8 summary– Brush-up– Conclusions (what you hopefully learned!)

• MM9– Standard Speech API– Hello World

Page 2: MM9 Speech Communication

P.2

Dep. of Communication Technology

Tom Brøndsted, Speech Communication 06

Types of Speech Recognisers

1. “rule grammar recognition” = “command & control recognition”

2. “dictation”, “large vocabulary recognition”,

3. other types (e.g.. “Speech Commands” on mobile phones, DTW)

From mm 7

Page 3: MM9 Speech Communication

P.3

Dep. of Communication Technology

Tom Brøndsted, Speech Communication 06

Exercise with Dictation

• Dictation is not “general recognition”– Dependent on the ”topic” of the text data used for

LM-training• E.g. ViaVoice performs better for dictation of business

letters than for dictation of fairy-tales!

• Dictation performs better after adaptation to the user– Is not 100% speaker-independent!

Page 4: MM9 Speech Communication

P.4

Dep. of Communication Technology

Tom Brøndsted, Speech Communication 06

Exercise with the calculator

• Speech recognition is not the same is speech understanding!

• Understanding requires– Parsing– Context analysis

Page 5: MM9 Speech Communication

P.5

Dep. of Communication Technology

Tom Brøndsted, Speech Communication 06

Dialogue System (text)James Allen: Natural Language Understanding, 1995

Recognition Synthesis

Grammar & lexicon,

Acoustic models

Page 6: MM9 Speech Communication

P.6

Dep. of Communication Technology

Tom Brøndsted, Speech Communication 06

Exercise JHVite 10

dansk advokat har afsløret afdelingsingeniør

dansk advokat har afsløret afhængig afdelingsingeniør

almindelig dansk advokat har afsløret afdelingsingeniør

afhængig afdelingsingeniør angrede

begejstret advokat dominerer

advokat afviser begejstret afdelingsingeniør

advokat angrede

dansk advokat angrede

advokat afviser en begejstret afdelingsingeniør

Page 7: MM9 Speech Communication

P.7

Dep. of Communication Technology

Tom Brøndsted, Speech Communication 06

Exercise JHVite 10

$adjektiv = dansk | afhængig | begejstret | almindelig;

$substantiv = advokat | afdelingsingeniør;

$transverb = afviser | har afsløret ;

$intransverb = angrede | dominerer;

$det = en | den;

$np = [$det] {$adjektiv} $substantiv;

$vp = ($transverb [$np]) | $intransverb;

$s= $np $vp;

($s)

Page 8: MM9 Speech Communication

P.8

Dep. of Communication Technology

Tom Brøndsted, Speech Communication 06

Exercise JHVite 10

• Use variables that correspond to normal grammatical categories (noun, verb, subject, predicate etc.)

• Test the grammar– Does it take all sentences of a testset into account?– Does it only generate sentences that are likely to be

input to the system?

Page 9: MM9 Speech Communication

P.9

Dep. of Communication Technology

Tom Brøndsted, Speech Communication 06

WHAT IS A SPEECH API?(Conservative) State-of-the-art speech technology

command and control speech

recognition

dictation speech

recognition

speech synthesis

SPEECH API

SPEECH APPLICATION

e.g. spoken language dialogue system

(grammar)(mark-up language)

Page 10: MM9 Speech Communication

P.10

Dep. of Communication Technology

Tom Brøndsted, Speech Communication 06

SAPI

• Microsoft+vendors (IBM etc.)– Cross-vendor API

– Platform: Windows 32 systems (NT, 2K XP) – Com interface, Ms Visual C++ 4.0, and other

MS products

• SAPI-compliant speech products:– MS Whisper (free!), + “any” modern speech

recogniser /synthesizer for Windows

Page 11: MM9 Speech Communication

P.11

Dep. of Communication Technology

Tom Brøndsted, Speech Communication 06

JSAPI

• Sun Microsystems+vendors (Apple Computer, Inc, AT&T, Dragon Systems, IBM, Novell. Inc. Philips, Texas Instruments Incorporated)

– cross-vendor API– cross platform API– JAVA

• JSAPI-compliant speech products:– ViaVoice for Linux (was free!) and Win32 systems,

various speech synthesis systems

Page 12: MM9 Speech Communication

P.12

Dep. of Communication Technology

Tom Brøndsted, Speech Communication 06

JSAPI packages

• three packages (collections of objects)

javax.speech.

javax.speech.synthesis

javax.speech.recognition• standard extension to the Java platform (“x”)

• Personal Java, Embedded Java

Page 13: MM9 Speech Communication

P.13

Dep. of Communication Technology

Tom Brøndsted, Speech Communication 06

javax.speech

• centralized mechanism for – a) registering new speech engines, and

– b) selecting available speech engines (from an application)

• a locale defines the supported language (e.g. de.ch = Swiss German)

• Additional features define – names of speakers that have trained the recognizer,

– available synthetic voices

• pausing/resuming, notification of events etc.

Page 14: MM9 Speech Communication

P.14

Dep. of Communication Technology

Tom Brøndsted, Speech Communication 06

javax.speech.synthesis

Interface javax.speech.synthesis.speakPlainText– argument simple orthographic text

Interface javax.speech. synthesis.speak – argument JSML-text, e.g.

<PARA>Message from <EMP>John Doe</EMP> regarding <BREAK/> <PROS RATE="-20%">magazine article</PROS>.</PARA>

Page 15: MM9 Speech Communication

P.15

Dep. of Communication Technology

Tom Brøndsted, Speech Communication 06

javax.speech.recognition

Interface javax.speech.recognition.FinalRuleResult

Interface javax.speech.recognition.Result

• 1-best list/n-best-list; for each item in list:– list of tokens (“words”)– list of tags– name of JSGF grammar accepting input– name of public rule accepting input

Page 16: MM9 Speech Communication

P.16

Dep. of Communication Technology

Tom Brøndsted, Speech Communication 06

Java Speech Grammar Format (JSGF)

• EBNF-equivalent “traditional style” (like SAPIs CFG-format)

plus!:– Java-adapted style (e.g. grammar URLs!)– “semantic tags” (synonymy, multilinguality)– weights (enabling n-gram-statistics)– unification gr.-like “action tags” (Sun Microsystem

proposal)

Page 17: MM9 Speech Communication

P.17

Dep. of Communication Technology

Tom Brøndsted, Speech Communication 06

JSGF: JAVA-adapted style

• JSGF header: grammar name/import, e.g.grammar dk.mydomain.emailapplication.mailBrowser

import <dk.mydomain.ReusableGrammars.date>

• documentation comments /** - */

• public rules vs. non-public (“private”) rulespublic <s> = <np> <vp>;

<np>=<det><n>;

<n>=man | woman | bird;

Page 18: MM9 Speech Communication

P.18

Dep. of Communication Technology

Tom Brøndsted, Speech Communication 06

JSGF Tags

• handling synonymy:<country> = Australia {Oz} | (United States) {USA} |

America {USA} | (U S of A) {USA};

• handling multilinguality:<greeting>= (howdy | good morning) {hi};

<greeting>= (ohayo | ohayogozaimasu) {hi};

<greeting>= (guten tag) {hi};

<greeting>= (bon jour) {hi};

Page 19: MM9 Speech Communication

P.19

Dep. of Communication Technology

Tom Brøndsted, Speech Communication 06

JSGF Weights

• probabilistic grammars (e.g. bigrams, trigrams) in JSGF<size> = /10/ small | /2/ medium | /1/ large;

equivalent to probabilities

<size> = /10/13/ small | /2/13/ medium | /1/13/ large;

Page 20: MM9 Speech Communication

P.20

Dep. of Communication Technology

Tom Brøndsted, Speech Communication 06

JSGF Action Tags (proposal)

• Unification gr.-like percolation mechanism, but no structure sharing/feature constraints

<_juliek> = (julie | "julie kay")

{ cat = properNoun; // The word is a proper noun.

email = juliek; // User's e-mail ID.

date = permanent; // Indicates permanent entry in address book.

};

<person> = ((<_rickc> | <_juliek> | <_sadams>){$user}) { this = $user; };

Page 21: MM9 Speech Communication

P.21

Dep. of Communication Technology

Tom Brøndsted, Speech Communication 06

Language Models

Page 22: MM9 Speech Communication

P.22

Dep. of Communication Technology

Tom Brøndsted, Speech Communication 06

N-grams

• Sentence: S = w1 w2 ... wQ• Ideal sentence probability:

P(S) = P(w1 w2 ... wQ)=

P(w1)P(w2|w1)P(w3|w1 w2)...P(wQ|w1 w2 ...wQ-1)

• Approximate conditional word probability:P(wQ|w1 w2 ... wQ-1) p(wQ|wQ-N+1 ... wQ-1)

- where N has a constant “windowing” size:• Unigram (N=1), Bigram (N=2), Trigram (N=3)

Page 23: MM9 Speech Communication

P.23

Dep. of Communication Technology

Tom Brøndsted, Speech Communication 06

Trigram smoothing (Jellinek)

• Used when there are insufficient data for real trigrams

P(w3|w1 w2)= p1 F(w1,w2,w3) + p2 F(w1,w2) + p3 F(w1)

F(w1, w2) F(w1) F(wi)

Where:

F is number of occurences of the string in its argument

F(wi) is the number of words in corpus

p1, p2, p3 are positive values and p1+p2+p3=1

Page 24: MM9 Speech Communication

P.24

Dep. of Communication Technology

Tom Brøndsted, Speech Communication 06

Clustering words in N-grams

• N-grams of word classes, categorical N-grams:– Words are “replaced” by (semantic, syntactic)

categories before training. (e.g. “w_day” for Monday, Tuesday ...)

• Data-driven clustering

• Stemming (porter)

• ….

Page 25: MM9 Speech Communication

P.25

Dep. of Communication Technology

Tom Brøndsted, Speech Communication 06

N-gram problems

• Long distance dependencies exceeding n:[kommoden/bordet/stolene] i værelset på tredje etage skal males [rød, rødt,

røde]

• Stochastic grammars “freezes” human verbal behaviour at a state reflected in the training data. The verbal behaviour may change. Adaptive approach?

• Finding corpora reflecting how humans will communicate with the final system – (Human-human dialogs vs. WOZ-experiments).