general speereo technology

Speereo Software, 2009www.speereo.com

Speereo Speech Recognition Technologies

Konstantin Lamin

CEO

[email protected]

Oleg Maleev

CTO, VP of R&D

[email protected]

Daniel Ischenko

VP of Business

Development

[email protected]

m

http://www.speereo.com/

mailto:[email protected]




User friendliness

Speech is most natural way of communication for humans.

Therefore speech interface is most natural way to interact with

mobile device.

Mobility

While using speech interface User‟s hand and eyes are free for

any other activity.

Device novelty

Speech interface gives User an easy-to-use device not

burdened by numerous keys or large screens.

What speech technologies are needed for?

ASR is a conversion of speech signal to text or control

commands. ASR allows to manufacture devices with

speech control abilities, i.e. speech interface.

Command ID

ASR

Automatic Speech Recognition System (ASR)

Voice

Text to speech (TTS) is a signal conversion in with

consideration of language pronunciation norms. It allows to

create „speaking‟ devices.

Speech Synthesizer (TTS)

Speech

TTS

Text

Allows to record speech signal with small memory size.

Speech signal compression

Packing Data

Packing

Speech

Packing Data

Unpacking

Speech

ASR

Pentium 4 2.0 GHz , 64 MB

Memory bandwidth 1.2 GB/s

TTS

Pentium 4 2.0 GHz , 100-500 MB

Standard solutions are not acceptable for embedded

and mobile devices. Threfore special approaches for reduction

of CPU and memory usage must be applied.

Speech technology on desktop PC

Compactness: used memory size less than 1-2 MB).

Possibility to perform with CPU under 100 MIPS 300

MHz XScale - 12x or more output in performance with 2.0

GHz Pentium 4.

Low memory bandwidth (XScale delivers only 64

MB/s).

Requirements for embedded devices

(low footprint)

Intuitively understandable and simple API accessible

for use by non-specialists in speech technologies field.

Scalable and portable software design.

Possibility to use with various OS, or on devices with no

OS.

Only Software! No demand for use of any additional

hardware.

Embedded Speech SDK

Speaker-Dependent or Speaker-Independent?

Is training necessary? Training necessity annoys

Users.

Recognition on the phonetic (any size dictionary) or

whole-word level (small dictionaries only)?

Large (>10000 words) or small size vocabulary?

Is dynamic change of recognizable commands set

possible?

Speech Recognition Technology Characteristics

Speaker-Independent.

Flexible large vocabulary, allowing to change the set of

recognizable words and phrases „on the fly‟.

Noise robustness. Ability to use device in different

conditions (car, outdoors, in a crowded surrounding).

Stability to pronunciation variations, including nonnative

speakers.

Optimal System of Speech Control

Speereo Speech Recognition Engine

Speech

Acoustic

environment

Acoustic

Front-EndDecoder

Phone Models

Very Large

Vocabulary

Real-Time

Trancriber

Recognition

result

Features system, 41 coefficient.

Setting on acoustic environment.

Special algorithm for automatic setting on the

microphone type (far-field or close talk), conditions of

recording and a channel distortion.

Special algorithm for operation of the system in a car.

Acoustic Front-End

Continuous Density Hidden Markov‟s Models (more

precise).

Discrete Hidden Markov‟s Models (faster).

For English language 63 HMM models that include

2446 mixture Gaussian components.

Parameters of HMM models have been determined

statistically with use of a priori phonetic restrictions.

Enhanced algorithm of decoder functionality to speed

up work mechanism.

Acoustic model Decoder

Converts written English words and phrases to suitable

for recognition form.

Unlimited dictionary. Out-of-Vocabulary problem solved.

Recognition of first and last names.

Recognition of geographic names.

Real-Time Transcriber

Test 1: Long phrases recognition

Test conditions: statistical sampling – 1680 utterances, 626

unique phrases. Language – English.

Recognition accuracy – 99.9%.

Test 2: Short words recognition

Test conditions: numerical vocabulary database (including

inarticulately pronounced words), 11 unique words.

Language – English: recognition accuracy – 99.2%.

Language – Russian: recognition accuracy – 98.5%.

Accuracy

Noise Robustness

Test 3: accuracy dependence on noise level

SNR (dB) 0 5 10 15 20 Clear

Accuracy

(%)98,2 98,4 98,3 98,6 98,7 99,2

Speereo Speech Engine demonstrates high noise robustness.

Test 4: long phrases recognition in noisy surrounding

Test conditions: statistical sampling – 1632 utterances, 626

unique phrases. Noise sample – moving vehicle with windows

rolled down.

Language – English.

Recognition accuracy – 97,6%.

Due to special algorithms, Speereo Recognition Engine

demonstrates good robustness in a car.

Speereo Speech Engine in a Car

Comparison of Recognition SystemsNumber of mistakes in tests 1 and 2 (less value is better)

0

10

20

30

40

50

60

70

80

Philips Microsoft IBM Speereo

Phrases

Digits

While testing the following product have been used:

Philips FreeSpeech 2000, Microsoft Speech Recognition

Engine 4.0, IBM ViaVoice 7.0, Speereo Speech Engine 2.0

High accuracy speech recognition

Speaker-Independence

Large vocabulary (>100000 words)

Short latency

Noise robustness

Excellent compatibility

Ease of use

Speereo Speech Recognition Technology

Features

Speereo Speech Engine currently supports a wide

variety of processors, such as SHx, TMPR39XX, NEC

VR4122, MIPS, ARM, Xscale, etc.

Speereo Speech Engine operates with CPU with

performance from 40 MIPS (80 recommended) and

memory from 700 KB.

CPU and Memory requirements

Simple API not requiring skills in the speech technology

development.

Supports Windows Mobile, Symbian, Java, other platforms

and embedded devices with no OS.

For OS Windows Mobile and Symbian the operation support

with ready made Audio Input-Output is provided. No need to

program Audio Input-Output.

Support of smartphones based on Series 60, UIQ, Windows

Mobile and mobile devices with J2M.

Speereo Speech Recognition SDK

Speereo Speech Engine Windows CE Version

Audio

Input-Output

Speereo

Speech Engine

Application 1

List of speech commands

Speech commands

pronounced by user

Application 2

Application N

Operation of SE can be divided into 2 major stages:

1.Application defines the operating mode of SE and if it‟s

necessary sends the list of speech commands to SE.

2.When User pronounces a phrase (command), SE determines

most probable phrase from the list of received speech

commands and sends its ID to the application.

Developer does not need to trace the moment of pronouncing

of a phrase. All one needs is to process the Speereo Speech

Engine message that contains ID of the command pronounced

by User.

Use of Speereo Speech Engine (SE)

There are 3 recognition modes of SE realized currently:

1. Recognition of phrases with words known to SE and

included into the vocabulary.

2. Recognition of phrases with unknown to SE words (mostly

personal names, etc.). In this case unknown words are

transcribed automatically.

3. Recognition of numbers from the 1 to the 31. There is a

special mode for improvement of ordinal numbers recognition

accuracy.

Recognition modes

In order to use the Speech Interface in any application

developer must register given application in Speereo Speech

Engine by accessing AddRegisterApplication function.

Function prototype is as follows:

UINT AddRegisterApplication (HWND hWnd), where hWnd is

the handle of the developer‟s application window which

receives the message from SE.

Speereo Speech Engine Initialization

Speech Commands List is created by AddPhrase function for

each speech command.

void AddPhrase (LPCTSTR pszText, DWORD dwId)

Where pszText is a speech command in orthographic form and

dwId is the identifier of the speech command that will be

returned by SE if the speech command is pronounced.

Speech Commands List Creation

Message WM_SRT_ACCEPTHYPO passes identifier of

recognized speech command as wPARAM parameter.

Message goes from SE to the application window hWnd of

which was used in the AddRegisterApplication function as its

parameter.

Example:

case WM_SRT_ACCEPTHYPO:

MakeHypo (wParam);

return TRUE;

MakeHypo is developer's command for implementation of

speech commands functionality here.

Response receipt from SE

AddPhrase (_T(“Open Window”), ID_OPEN_WINDOW)

AddPhrase (_T(“Close Window”), ID_CLOSE_WINDOW)

That means that two speech commands (“Open Window” and

“Close Window”) are passed to SE with identifiers

ID_OPEN_WINDOW and ID_CLOSE_WINDOW accordingly.

Defining Speech Commands Example

In order to build speech interface into the application using

Speereo Speech Engine one has to make following three

simple steps:

1.Initialize Speereo Speech Engine.

2.Define list of speech commands.

3.Define application‟s reaction to speech commands.

It’s That Simple!

1.Microphone and speaker controls.

2.Ability to interact with several applications simultaneously.

3.Ability to record sound and voice signal via microphone and

real-time compression.

4.Ability to play sound and voice signals for User/speaker.

5.Speech signal detector selection (continuous monitoring of

speech signal or recognition launch on a key press).

Speereo Speech Engine Additional Features

Home appliances

Consumer electronics (audio/video systems)

Computer hardware and software (all operations)

Portable devices (mobile phones, smartphones)

Voice mail system

Other embedded devices

Using Speereo Speech Interface can greatly contribute to

functionality, accessibility, and innovative appeal of any

product by making it fully interactive, easy to control, and

therefore more productive and enjoyable.

Speereo Speech Engine

Implementation Possibilities

Example 1: operating a phonebook

Instead of selecting from menu…

Menu

Names

Search Samantha

Call

Feature can be accessed by

one short phrase: “Call

Samantha”.

Send voice message via E-mail/MMS:

Say “Send E-mail” or “Send MMS”.

You will be prompted to give the name

of recipient. Since names are

articulated the system finds the name

in the database and offers to send a

voice message.

Example 2: Mobile Voice Interface

Voice Interface for mobile services is highly requires by mobile

community.

Maps

Tickets booking

Information

Humor

Weather

Dictionaries

Exchange rates

E-Commerce

Example 3: GPS Voice Control

Speech menu

Map navigation

Search P.O.I.

Route indication

Speereo Voice Translator

Speaking in a foreign language? Nothing's more simple!

Speereo Voice Translator is an

Innovative mobile phrase book, that

understands a spoken phrase in English

(pronounced even with a strong accent)

and immediately reads back the same

phrase in Arabic, Chinese (Traditional or

Simplified), Danish, English, Finnish,

French, German, Italian, Korean, Polish,

Russian, Spanish or Turkish.

http://www.speereovt.com/





Speereo Voice Organizer

Manage your personal information,

send e-mails and set your schedule

using only voice commands with our

Stylus Free Concept – Speereo Voice

Organizer!

Free your hands & don‟t stop to work

your mobile device – application will find

and dial numbers, write e-mails and

remind you of your appointments

following your voice commands!

http://www.personal-secretary.com/

http://www.personal-secretary.com/

Use our unique skills!

Speech interface is a new level of the user‟s convenience. We

got the necessary knowledge for the successful

implementation of speech technology.

"A voice-operated scheduler is a very good idea and Speereo has made it

an impressive and enjoyable reality. Perhaps the best thing about Voice

Organizer is that you can access all of its features with one hand. One

touch of your Pocket PC's record button and your voice does all the work:

switching between days, week and month views of your events; adding new

events; or adding vocal notes to your phone contacts.“

Voice Recognition Programs for the Pocket PC

By John Mierau, Pocket PC Magazine, November 2002, Vol. 5 No. 5

Speech Synthesizers Types (TTS)

Words DB

Text

Speech

Phones

Phrases compiler

Phones DB

Whole words TTS

Phrases compiler

Phonemic TTS

SpeechProsody

Transcriber

Text

TTS Requirements

Whole-words TTSPredefined vocabulary (up to 2-3 thousands words) at thesystem development stage.CPU from 40 MIPS, RAM from 0.5 Mb requires pronunciation by a narrator of all the vocabulary‟s words.

Phonemic TTSLarge dictionaries possible (over 100 thousands words).CPU from 80 MIPS, RAM from 2 Mb, does not require setting for a dictionary.

TTS Language Support

Whole-words ТТSAny language may be used. Narrator needed to create theword‟s database. Вevelopment time (1-2 weeks) depending ondictionary.

Phonemic TTSPresently there is support of English, Spanish, German andItalian. New language development period – 3 months.

Speech Compression Algorithms

Speech signal16bit/8kHz1 minute takes 960 КB in memory.

АDPCM (Adaptive Differential Pulse Code Modulation) is recording only the difference between samples and adjusting the coding scale dynamically)1 minute takes 240 КB in memory.Compression of any sound signal is possible.

Speech Compression Special Algorythms

Use of speech signal features allows to achieve higher compression power:

GSM compression1 minute takes about 100 КB in memory. Optimal compression of speech signal only.

Speereo advanced compression1 minute takes about 10.25 KB in memory.It is possible to record more than 1.5 hours of speech signalinto 1mb of space.

Speereo Advanced Compression

Speereo compression/decompression algorithm in a real-time mode requires a processor with performance of 60 MIPS and memory of 200 КB.

Only Speereo decompression algorithm in a real-time mode requires a processor with performance of 40 MIPS and memory of 200 КB.

Speereo Compression Algorithms Usage

Preinstalled voice commands for mobile and embedded devices play (decompression only).

Creation of voice User commands on PC with following transfer them to mobile and embedded devices (decompression on embedded devices, compression on desktop PC).

Recording and play of Users‟ commands on mobile and embedded devices (compression and decompression on embedded devices).

Conclusion

Speereo Speech Technology for embedded devices:

Automatic Speech Recognition (ASR) from 40 MIPS(80 MIPS is recommended) from memory of 700 KB.

Speech synthesizer (TTS) from 40(80) MIPS, from memory of 500KB (2Mb).

Speech signal compression from 40 MIPS, from memory of 200 KB.

Speereo Speech TechnologyTechnology that understands your language

QUESTIONS? COMMENTS?

Speereo Software UK

www.speereo.com

Konstantin Lamin

CEO

[email protected]

Oleg Maleev

CTO, VP of R&D

[email protected]

Daniel Ischenko

VP of Business

Development

[email protected]

m

http://www.speereo.com/