speech recognition seminar

1

1. INTRODUCTION

One of the most important inventions of the nineteenth century was the telephone. Then at the

midpoint of twentieth century, the invention of the digital computer amplified the power of our minds,

enabled us to think and work more efficiently and made us more imaginative then we could ever have

imagined .Now several new technologies have empowered us to teach computers to talk to us in our native

languages and to listen to us when we speak (recognition); haltingly computers have begun to understand

what we say. Having given our computers both oral and aural abilities, we have been able to produce

innumerable computer applications that further enhance our productivity. Such capabilities enable us to

route phone calls automatically and to obtain and update computer based information by telephone, using a

group of activities collectively referred to as Voice Processing.

SPEECH TECHNOLOGY

Three primary speech technologies are used in voice processing applications: stored speech, text-to-

speech and speech recognition. Stored speech involves the production of computer speech from an actual

human voice that is stored in a computer’s memory and used in any of several ways.

Speech can also be synthesized from plain text in a process known as text-to-speech which also

enables voice processing applications to read from textual database.

Speech recognition is the process of deriving either a textual transcription or some form of meaning

from a spoken input.

Speech analysis can be thought of as that part of voice processing that converts human speech to

digital forms suitable for transmission or storage by computers.

Speech synthesis functions are essentially the inverse of speech analysis – they reconvert speech

data from a digital form to one that’s similar to the original recording and suitable for playback.

Speech analysis processes can also be referred to as a digital speech encoding ( or simply coding)

and speech synthesis can be referred to as Speech decoding.

Dept. Of ECE PESIT

2

2. Evolution of ASR Methodologies

Speech recognition research has been on-going for more than 80 years. Over that period there have

been at least 4 generations of approaches, and we forecast a 5th generation that is being formulated based

on current research themes. The 5 generations, and the technology themes associated with each of them,

are as follows[5].

Generation 1 (1930s to 1950s):

Use of ad hoc methods to recognize sounds, or small vocabularies of isolated words.


Use of acoustic phonetic approaches to recognize phonemes, phones, or digit vocabularies.


Use of pattern recognition approaches to speech recognition of small to medium-sized vocabularies

of isolated and connected word sequences, including use of linear predictive coding (LPC) as the

basic method of spectral analysis; use of LPC distance measures for pattern similarity scores; use of

dynamic programming methods for time aligning patterns; use of pattern recognition methods for

clustering multiple patterns into consistent reference patterns; use of vector quantization (VQ)

codebook methods for data reduction and reduced computation.


Use of Hidden Markov model (HMM) statistical methods for modelling speech dynamics and

statistics in a continuous speech recognition system; use of forward-backward and segmental K -

means training methods; use of Viterbi alignment methods; use of maximum likelihood (ML) and

various other performance criteria and methods for optimizing statistical models; introduction of

neural network (NN) methods for estimating conditional probability densities; use of adaptation

methods that modify the parameters associated with either the speech signal or the statistical model

so as to enhance the compatibility between model and data for increased recognition accuracy.


Use of parallel processing methods to increase recognition decision reliability; combinations of

HMMs and acoustic-phonetic approaches to detect and correct linguistic irregularities; increased

robustness for recognition of speech in noise; machine learning of optimal combinations of models.

Dept. Of ECE PESIT

3

3. ISSUES IN SPEECH RECOGNITION

As we examine the progress made in implementing speech recognition and natural language

understanding systems over the years, we will see that there are a number of issues that need to be

addressed in order to define the operating range of each speech recognition system that is built. These

issues include the following [5]:

• Speech unit for recognition: ranging from words down to syllables and finally to phonemes or even

phones. Early systems investigated all these types of units with the goal of understanding their

robustness to context, speakers and speaking environments

• Vocabulary size: ranging from small (order of 2–100 words), medium (order of 100–1000) words,

and large (anything above 1000 words up to unlimited vocabularies). Early systems tackled

primarily small-vocabulary recognition problems; modern speech recognizers are all large-

vocabulary systems

• Task syntax: ranging from simple tasks with almost no syntax (every word in the vocabulary can

follow every other word) to highly complex tasks where the words follow a statistical n-gram

language model

• Task perplexity (the average word branching factor): ranging from low values (for simple tasks) to

values on the order of 100 for complex tasks whose perplexity approaches that of natural language

task

• Speaking mode: ranging from isolated words (or short phrases), to connected word systems (e.g.,

sequences of digits that form identification codes or telephone numbers), to continuous speech

(including both read passages and spontaneous conversational speech)

• Speaker mode: ranging from speaker-trained systems to speaker-adaptive systems to speaker

independent systems, which can be used by anyone without any additional training. Most modern

ASR systems are speaker independent and are utilized in a range of telecommunication

applications. However, for dictation purposes, most systems are still largely speaker dependent and

adapt over time to each individual speaker.

• Speaking situation: ranging from human-to-machine dialogues to human-to-human dialogues (e.g.,

as might be needed for language translation systems)

• Speaking environment: ranging from a quiet room, to noisy places (e.g., offices, airline terminals),

and even outdoors (e.g., via the use of cellphones)

• Transducer: ranging from high-quality microphones to telephones (wire line) to cellphones (mobile)

to array microphones (which track the speaker location electronically)Dept. Of ECE PESIT

4

4. SPEECH RECOGNITION BASICS

The following definitions are the basics needed for understanding speech recognition technology.

Utterance

An utterance is the vocalization (speaking) of a word or words that represent a single meaning to

the computer. Utterances can be a single word, a few words, a sentence, or even multiple sentences.

Speaker Dependency

Speaker dependent systems are designed around a specific speaker. They generally are more

accurate for the correct speaker, but much less accurate for other speakers. They assume the speaker

will speak in a consistent voice and tempo. Speaker independent systems are designed for a variety

of speakers. Adaptive systems usually start as speaker independent systems and utilize training

techniques to adapt to the speaker to increase their recognition accuracy.

Vocabularies

Vocabularies (or dictionaries) are lists of words or utterances that can be recognized by the SR

system. Generally, smaller vocabularies are easier for a computer to recognize, while larger

vocabularies are more difficult. Unlike normal dictionaries, each entry doesn't have to be a single

word. They can be as long as a sentence or two. Smaller vocabularies can have as few as 1 or 2

recognized utterances (e.g."Wake Up"), while very large vocabularies can have a hundred thousand

or more!

Accurate

The ability of a recognizer can be examined by measuring its accuracy − or how well it recognizes

utterances. This includes not only correctly identifying an utterance but also identifying if the

spoken utterance is not in its vocabulary. Good ASR systems have an accuracy of 98% or more!

The acceptable accuracy of a system really depends on the application.

Training

Some speech recognizers have the ability to adapt to a speaker. When the system has this ability, it

may allow training to take place. An ASR system is trained by having the speaker repeat standard

or common phrases and adjusting its comparison algorithms to match that particular speaker.

Training a recognizer usually improves its accuracy. Training can also be used by speakers that

have difficulty speaking, or pronouncing certain words. As long as the speaker can consistently

repeat an utterance, ASR systems with training should be able to adapt.

Dept. Of ECE PESIT

5

5. TYPES OF SPEECH RECOGNITION

Speech recognition systems can be separated in several different classes by describing what types of

utterances they have the ability to recognize. These classes are based on the fact that one of the difficulties

of ASR is the ability to determine when a speaker starts and finishes an utterance. Most packages can fit

into more than one class, depending on which mode they are using.

Isolated Words:

Isolated word recognizers usually require each utterance to have quiet (lack of an audio signal) on

BOTH sides of the sample window. It doesn't mean that it accepts single words, but does require a

single utterance at a time. Often, these systems have "Listen/Not−Listen" states, where they require

the speaker to wait between utterances (usually doing processing during the pauses). Isolated

Utterance might be a better name for this class.

Connected Words:

Connect word systems (or more correctly 'connected utterances') are similar to Isolated words, but

allow separate utterances to be 'run−together' with a minimal pause between them.

Continuous Speech

Continuous recognition is the next step. Recognizers with continuous speech capabilities are some

of the most difficult to create because they must utilize special methods to determine utterance

boundaries. Continuous speech recognizers allow users to speak almost naturally, while the

computer determines the content. Basically, it's computer dictation.

Spontaneous Speech

There appears to be a variety of definitions for what spontaneous speech actually is. At a basic

level, it can be thought of as speech that is natural sounding and not rehearsed. An ASR system

with spontaneous speech ability should be able to handle a variety of natural speech features such as

words being run together, "ums" and "ahs", and even slight stutters.

Voice Verification/Identification

Some ASR systems have the ability to identify specific users. This document doesn't cover

verification or security systems

Dept. Of ECE PESIT

6

6. SPEECH RECOGNITION

The days when you had to keep staring at the computer screen and frantically hit the key or click

the mouse for the computer to respond to your commands may soon be a things of past. Today we can

stretch out and relax and tell your computer to do your bidding. Speech recognition is the process of

deriving either a textual transcription or some form of meaning from a spoken input.

Speech recognition is the inverse process of synthesis, conversion of speech to text. The Speech

recognition task is complex. This involves the computer taking the user's speech and interpreting what has

been said. This allows the user to control the computer (or certain aspects of it) by voice, rather than having

to use the mouse and keyboard, or alternatively just dictating the contents of a document. This has been

made possible by the ASR (Automatic Speech Recognition) technology.

The ASR technology would be particularly welcome by automated telephone exchange operators,

doctors, besides others whose seek freedom from tiresome conventional computer operations using

keyboard and the mouse. It is suitable for applications in which computers are used to provide routine

information and services. The ASR’s direct speech to text dictation offers a significant advantage over

traditional transcriptions. With further refinement of the technology in text will become a thing of past.

ASR offers a solution to this fatigue-causing procedure by converting speech in to text.

The ASR technology is presently capable achieving recognition accuracies of 95% - 98 % but only

under ideal conditions. The technology is still far from perfect in the uncontrolled real world. The routes of

this technology can be traced to 1968 when the term Information Technology hadn’t even been coined.

American’s had only begun to realize the vast potential of computers. In the Hollywood blockbuster 2001:

a space odyssey. A talking listening computer HAL-9000, had been featured which to date is a called

figure in both science fiction and in the world of computing. Even today almost every speech recognition

technologist dreams of designing an HAL-like computer with a clear voice and the ability to understand

normal speech. Though the ASR technology is still not as versatile as the imaginer HAL, it can

nevertheless be used to make life easier. New application specific standard products, interactive error-

recovery techniques, and better voice activated user interfaces allow the handicapped, computer-illiterate,

and rotary dial phone owners to talk to the computers. ASR by offering a natural human interface to

computers, finds applications in telephone-call centres, such as for airline flight information system,

learning devices, toys, etc.

Dept. Of ECE PESIT

76.1. HOW DOES THE ASR TECHNOLOGY WORK?

When a person speaks, compressed air from the lungs is forced through the vocal tract as a sound

wave that varies as per the variations in the lung pressure and the vocal tract. This acoustic wave is

interpreted as speech when it falls upon a person’s ear. In any machine that records or transmits human

voice, the sound wave is converted into an electrical analog signal using a microphone.

THE ROAD TO HAL

TWO CALL

Fig.6.1 Flow of speech recognition

When we speak into a telephone receiver, for instance, its microphone converts the acoustic wave

into an electrical analog signal that is transmitted through the telephone network. The electrical signals

strength from the microphone varies in amplitude over time and is referred to as an analog signal or an

Dept. Of ECE PESIT

Person speaks “THE ROAD TO HAL”

ELECTRONICAL SIGNAL INTO THE COMPUTER

BACKGR REMOVAL OF NOISE AND SOUND AMPLIFIC ATION

BREAK UP WORDS INTO PHONEMES

LANGUAGE ANALYSIS

MATCHING AND CHOSING THE RIGHT CHARACTER COMBINATION

PER

THE

Road

Load

Moo

To

Gall

Mall

THE ROAD TO HAL

8analog waveform. If the signal results from speech, it is known as a speech waveform. Speech waveforms

have the characteristic of being continuous in both time and amplitude.

A listener’s ears and brain receive and process the analog speech waveforms to figure out the

speech. ASR enabled computers, too, work under the same principle by picking up acoustic cues for speech

analysis and synthesis. Because it helps to understand the ASR technology better, let us dwell a little more

on the acoustic process of the human articulator system. In the vocal tract the process begins from the

lungs. The variations in air pressure cause vibrations in the folds of skin that constitute the vocal chords.

The elongated orifice between the vocal chords is called the glottis. As a result of the vibrations, repeated

bursts of compressed air are released in to the air as sound waves.

Articulators in the vocal tract are manipulated by the speaker to produce various effects. The vocal

chords can be stiffened or relaxed to modify the rate of vibration, or they can be turned off and the

vibration eliminated while still allowing air to pass. The velum acts as a gate between the oral and the nasal

cavities. It can be closed to isolate or opened to couple the two cavities. The tongue, jaw, teeth, and lips can

be moved to change the shape of the oral cavity.

The nature of sound preserve wave radiating out world from the lips depends upon this time varying

articulations and upon the absorptive qualities of the vocal tracts materials. The sound pressure wave exists

as a continually moving disturbance of air. Particles come move closer together as the pressure increases or

move further apart as it decreases, each influencing its neighbor in turn as the wave propagates at the speed

of sound. The amplitude to the wave at any position, distant from the speaker, is measured by the density of

air molecules and grows weaker as the distance increases. When this wave falls upon the ear it is

interpreted as sound with discernible timbre, pitch, and loudness.

Air under pressure from the lung moves through the vocal tract and comes into contact with various

obstructions including palate, tongue, teeth, lips and timings. Some of its energy is absorbed by these

obstructions; most is reflected. Reflections occur in all directions so that parts of waves bounce around

inside the cavities for some time, blending with other waves, dissipating energy and finally finding the way

out through the nostrils or past the lips.

Some waves resonate inside the tract according to their frequency and the cavity’s shape at that

moment, combining with other reflections, reinforcing the wave energy before exiting. Energy in waves of

other, non-resonant frequencies is attenuated rather than amplified in its passage through the tract.

6.2. THE SPEECH RECOGNITION PROCESS

Dept. Of ECE PESIT

9When a person speaks, compressed air from the lungs is forced through the vocal tract as a sound

wave that varies as per the variations in the lung pressure and the vocal tract. This acoustic wave is

interpreted as speech when it falls up on a person’s ear. Speech waveforms have the characteristic of being

continuous in both time and amplitude.

Fig.6.2.Steps in speech recognition

Fig.6.3.Block diagram of steps in speech recognition

Any speech recognition system involves five major steps:

Converting sounds into electrical signals: when we speak into microphone it converts sound waves

into electrical signals. In any machine that records or transmits human voice, the sound wave is

converted into an electrical signal using a microphone. When we speak into telephone receiver, for

instance, its microphone converts the acoustic wave into an electrical analog signal that is

transmitted through the telephone network. The electrical signal’s strength from the microphone

varies in amplitude overtime and is referred to as an analog signal or an analog waveform. Analog

signal is converted into digital signal using sound card.

Background noise removal: the ASR programs removes all noise and retains the words that you

have spoken.

Dept. Of ECE PESIT

Voice Input Analog Electrical Signal Digital Signal

Noise RemovedBreakdown Into

PhonemesMatching

and choosing

Language Analysis

10 Breaking up words into phonemes: The words are broken down into individual sounds, known as

phonemes, which are the smallest sound units discernible. For each small amount of time, some

feature, value is found out in the wave. Likewise, the wave is divided into small parts, called

Phonemes.

Matching and choosing character combination: this is the most complex phase. The program has

big dictionary of popular words that exist in the language. Each Phoneme is matched against the

sounds and converted into appropriate character group. This is where problem begins. It checks and

compares words that are similar in sound with what they have heard. All these similar words are

collected.

Language analysis: here it checks if the language allows a particular syllable to appear after

another.

After that, there will be grammar check. It tries to find out whether or not the combination of

words any sense. That is there will be a grammar check package.

Finally the numerous words constitution the speech recognition programs come with their own

word processor, some can work with other word processing package like MS word and word perfect.

6.3. VARIATIONS IN SPEECH

The speech-recognition process is complicated because the production of phonemes and the

transition between them varies from person to person and even the same person. Different people speak

differently. Accents, regional dialects, sex, age, speech impediments, emotional state, and other factors

cause people to pronounce the same word in different ways. Phonemes are added, omitted, or substituted.

For example, the word, America, is pronounced in parts of New England as Amrica. The rate of speech

also varies from person the person depending upon a person’s habit and his regional background.

A word or a phrase spoken by the same individual differs from moment to moment illness;

tiredness, stress or other conditions cause subtle variations in the way a word is spoken at different times.

Also, the voice quality varies in accordance with the position of the person relative to the microphone, the

acoustic nature of the surroundings, or the quality of the recording devices. The resulting changes in the

waveform can drastically affect the performance of the recognizer.

6.4. VOCABULARIES FOR COMPUTERS

Dept. Of ECE PESIT

11Each ASR system has LAN active vocabulary- a set of words from which the recognition engine

tries to make senses of utterance- and a total vocabulary size-the total number of words in all possible sets

that can be culled from the memory.

The vocabulary size and system recognition latency- the allowable time to accurately recognize an

utterance determining the process horsepower of the recognition engine.

An active vocabulary set comprises approximately fourteen words plus none of the above, who the

recognizer chooses when none of the fourteen words is good mach .The recognition latency when using a

4-MIPS processor, is about.5 seconds for a second for independent set. Processing power requirements

increased dramatically for LVR sets with thousands of words. Real time latencies with a vocabulary base

of few thousands are possible only through the use of Pentium class processors. A small active vocabulary

limits a system search range providing advantages in latency and search time. A large total vocabulary

enables more versatile human interface but affects system memory requirements. A system with a small

active vocabulary with each prompt usually provides faster more accurate results, similar sounding words

in vocabulary set cause recognition errors. But a unique sound for each word enhances recognition engines

accuracy.

6.5. WHICH SYSTEM TO CHOOSE

In choosing a speech recognition system you should consider the degree of speaker independence it

offers. Speaker independent systems can provide high recognition accuracies for a wide range of users

without needing to adapt to each user’s voice. Speaker dependent systems require that to train the system to

your voice to attain high accuracy. Speaker adaptive systems an intermediate category are essentially

speaker-independent but can adapt their templates for each user to improve accuracy.

ADVANTAGES OF SPEAKER INDEPENDENT SYSTEM

The advantage of a speaker independent system is obvious-anyone can use the system without first

training it. However, its drawbacks are not so obvious. One limitation is the work that goes into creating

the vocabulary templates. To create reliable speaker-independents templates, someone must collect and

process numerous speech samples. This is a time-consuming task; creating these templates is not a one-

time effort. Speaker-independent templates are language-dependant, and the templates are sensitive not

only to two dissimilar languages but also to the differences between British and American English.

Therefore, as part of your design activity, you would need to create a set of templates for each language or

Dept. Of ECE PESIT

12a major dialect that your customers use. Speaker independent systems also have a relatively fixed

vocabulary because of the difficulty in creating a new template in the field at the user’s site.

ADVANTAGE OF A SPEAKER-DEPENDENT SYSTEM:

A speaker dependent system requires the user to train the ASR system by providing examples of his

own speech. Training can be tedious process, but the system has the advantage of using templates that refer

only to the specific user and not some vague average voice. The result is language independence. You can

say ja, si, or ya during training, as long as you are consistent. The drawback is that the speaker-dependent

system must do more than simply match incoming speech to the templates. It must also include resources

to create those templates.

WHICH IS BETTER:

For a given amount of processing power, a speaker dependent system tends to provide more

accurate recognition than a speaker-independent system. A speaker independent system is not necessarily

better, the difference in performance stems from the speaker independent template encompassing wide

speech variations.

6.6. TECHNIQUES IN VOGUE:

The most frequently used speech recognition technique involves template matching, in which

vocabulary words are characterized in memory a template time based sequences of spectral information

taken from waveforms obtained during training.

As an alternative to template matching, feature based designs have been used in which a time

sequence of the pertinent phonetic features is extracted from a speech waveform. Different modelling

approaches are used, but models involving state diagrams have been found to give encouraging

performance. In particular, HMM (Hidden Markov Models) are frequently applied. With HMMs any

speech unit can be modelled, and all knowledge sources can be modelled, and all knowledge sources and

be included in a single, integrated model. Various types of HMMs have been implemented with differing

results. Some model each word in the vocabulary, while others model sub word speech units.

Dept. Of ECE PESIT

13

7. HIDDEN MARKOV MODEL

A hidden Markov model can be used to model an unknown process that produces a sequence of

observable outputs at discrete intervals where the outputs are members of some finite alphabet. It might be

helpful to think of the unknown process as a black box about whose workings nothing is known except

that, at each interval, it issues one member chosen from the alphabet. These models are called "hidden"

Markov models precisely because the state sequence that produced the observable output is not known-it's

"hidden." HMMs have been found to be especially apt for modelling speech processes.

CHOICE OF SPEECH UNITS

The amount of storage required and the amount of processing time for recognition are functions of

the number of units in the inventory, so selection of the unit will have a significant impact. Another

important consideration in selecting a speech unit concerns the ability to model contextual differences.

Another consideration concerns the ease with which adequate training can be provided

MODELING SPEECH UNITS WITH HIDDEN MARKOV MODELS

Suppose we want to design a word-based, isolated word recognizer using discrete hidden Markov

models. Each word in the vocabulary is represented by an individual HMM, each with the same number of

states. A Word can be modelled as a sequence of syllables, phonemes, or other speech sounds that have a

temporal interpretation and can best be modelled with a left-to-right HMM whose states represent the

speech sounds. Assume the longest word in the vocabulary can be represented by a 10-state HMM. So,

using a 10-state HMM like that of Figure below for each word, let's assume states in the HMM represent

phonemes. The dotted lines in the figure are null transitions, so any state can be omitted and some words

modelled with fewer states. The duration of a phoneme is accommodated by having a state transition

returning to the same state. Thus, at a clock time, a state may return to itself and may do so at as many

clock times as required to correctly model the duration of that phoneme in the word, Except for beginning

and end states, which represent transitions into and out of the word, each state in the word model has a self-

transition. Assume, in our example, that the input speech waveform is coded into a string of spectral

vectors, one occurring every 10 milliseconds, and that vector quantization further transforms each spectral

vector to a single value that indexes a representative vector in the codebook. Each word in the vocabulary

Dept. Of ECE PESIT

14will be trained through a number of repetitions by one or more talkers. As each word is trained, the

transitional and output probabilities of its HMM are adjusted to merge the latest word repetition into the

model. During training, the codebook is iterated with the objective of deriving one that’s optimum for the

defined vocabulary. When an unknown spoken word is to be recognized, it's transformed to a string of code

book indices. That string is then considered an HMM observation sequence by the recognizer that

calculates, for each word model in the vocabulary, the probability of that HMM having generated the

observations. The word corresponding to the word model with the highest probability is selected as the one

recognized.

ACOUSTIC/PHONETIC EXAMPLE USING HIDDEN MARKOV MODEL

Every speech recognition system has its own architecture. Even those that are based on HMMs have

their individual designs, but all share some basic concepts and features, many of which are recognizable

even though the names are often different. A representative block diagram is given below. The input to a

recognizer represented by Figure arrives from the left in the form of a speech waveform, and an output

word or sequence of words emanates from the recognizer to the right.

It incorporates:

(A) SPECTRAL CODING: The purpose of spectral coding is to transform the signal into digital

form embodying speech features that facilitate subsequent recognition tasks. In addition to spectral

coding, this function is sometimes called spectrum analysis, acoustic parameterization, etc. Recognizers

can work with time-domain coding, but spectrally coded parameters in the frequency domain have

advantages and are widely used-hence the title "spectral coding."

:

Fig.7.1.A hidden Markov model recognizer

Dept. Of ECE PESIT

15

(B) UNIT MATCHING: The objective of unit matching is to transcribe the output data stream from

the spectral coding module into a sequence of speech units. The function of this module is also referred to

as feature analysis, phonetic decoding, phonetic segmentation, phonetic processing, feature extraction, etc.

(C) LEXICAL DECODING: The function of this module is to match strings of speech units in the

unit matching module's output stream with words from the recognizer's lexicon. It outputs candidate words-

usually in the form of a word lattice containing sets of alternative word choices.

(D) SYNTACTIC, SEMANTIC, AND OTHER ANALYSES Analyses that follow lexical

decoding all have the purpose of pruning worst candidates passed along from the lexical decoding module

until optimal word selections can be made. Various means and various sources of intelligence- can be

applied to this end. Acoustic information (stress, intonation, change of amplitude or pitch, relative location

of formants, etc.) obtained from the waveform can be employed, but sources of intelligence from outside

the waveform are also available. These include syntactic, semantic, and pragmatic information.

Dept. Of ECE PESIT

16

8. APPLICATIONS

The specific use of speech recognition technology will depend on the application. Some target

applications that are good candidates for integrating speech recognition include:

Games and Edutainment

Speech recognition offers game and edutainment developers the potential to bring their applications

to a new level of play. With games, for example, traditional computer-based characters could

evolve into characters that the user can actually talk to. While speech recognition enhances the

realism and fun in many computer games, it also provides a useful alternative to keyboard-based

control, and voice commands provide new freedom for the user in any sort of application, from

entertainment to office productivity.

Data Entry

Applications that require users to keyboard paper-based data into the computer (such as database

front-ends and spreadsheets) are good candidates for a speech recognition application. Reading data

directly to the computer is much easier for most users and can significantly speed up data entry.

While speech recognition technology cannot effectively be used to enter names, it can enter

numbers or items selected from a small (less than 100 items) list. Some recognizers can even handle

spelling fairly well. If an application has fields with mutually exclusive data types (for example,

one field allows "male" or "female", another is for age, and a third is for city), the speech

recognition engine can process the command and automatically determine which field to fill in.

Document Editing

This is a scenario in which one or both modes of speech recognition could be used to dramatically

improve productivity. Dictation would allow users to dictate entire documents without typing.

Command and control would allow users to modify formatting or change views without using the

mouse or keyboard. For example, a word processor might provide commands like "bold", "italic",

"change to Times New Roman font", "use bullet list text style," and "use 18 point type." A paint

package might have "select eraser" or "choose a wider brush."

Command and Control

ASR systems that are designed to perform functions and actions on the system are defined as

Command and Control systems. Utterances like "Open Netscape" and "Start a new xterm" will do

just that.

Telephony

Dept. Of ECE PESIT

17Some PBX/Voice Mail systems allow callers to speak commands instead of pressing buttons to

send specific tones.

Wearable devices

Because inputs are limited for wearable devices, speaking is a natural possibility.

Medical/Disabilities

Many people have difficulty typing due to physical limitations such as repetitive strain injuries

(RSI), muscular dystrophy, and many others. For example, people with difficulty hearing could use

a system connected to their telephone to convert the caller's speech to text.

Embedded Applications

Some newer cellular phones include C&C speech recognition that allow utterances such as "Call

Home". This could be a major factor in the future of ASR and Linux. Why can't I talk to my

television yet?

9. LIMITATIONS OF SPEECH RECOGNITION

Each of the speech technologies of recognition and synthesis has its limitations. These limitations

or constraints on speech recognition systems focus on the idea of variability. Overcoming the tendency for

ASR systems to assign completely different labels to speech signals which a human being would judge to

be variants of the same signal has been a major stumbling block in developing the technology. The task has

been viewed as one of de-sensitising recognisers to variability. It is not entirely clear that this idea models

adequately the parallel process in human speech perception.

Human being are extremely good at spotting similarities between input signals - whether they are speech

signals or some other kind of sensory input, like visual signals. The human being is essentially a pattern

seeking device, attempting all the while to spot identity rather than difference.

By contrast traditional computer programming techniques make it relatively easy to spot differences, but

surprisingly difficult to spot similarity even when the variability is only slight. Much effort is being

devoted at the moment to developing techniques which can re-orientate this situation and turn the computer

into an efficient pattern spotting device.

Dept. Of ECE PESIT

18

10. MERITS

The uses of speech technology are wide ranging. Most effort at the moment centers around trying to

provide voice input and output for information systems - say, over the telephone network.

A relatively new refinement here is the provision of speech systems for accessing distributed information

of the kind presented on the Internet. The idea is to make this information available to people who do not

have, or do not want to have, access to screens and keyboards. Essentially researchers are trying to harness

the more natural use of speech as a means of direct access to systems which which more normally

associated with the technological paraphernalia of computers.

Clearly a major use of the technology is to assist people who are disadvantaged in one way or another with

respect to producing or perceiving normal speech.

The eavesdropping potential referred to in the slide is not sinister. It simply means the provision of, say, a

speech recognition system for providing an input to a computer when the speaker has their hands engaged

on some other task and cannot manipulate a keyboard - for example, a surgeon giving a running

commentary on what he or she is doing. Another example might be a car mechanic on his or her back

underneath a vehicle interrogating a stores computer as to the availability of a particular spare part.

CONCLUSION

Speech recognition is a truly amazing human capacity, especially when you consider that normal

conversation requires the recognition of 10 to 15 phonemes per second. It should be of little surprise then

that attempts to make machine (computer) recognition systems have proven difficult. Despite these

problems, a variety of systems are becoming available that achieve some success, usually by addressing

one or two particular aspects of speech recognition. A variety of speech synthesis systems, on the other

hand, have been available for some time now. Though limited in capabilities and generally lacking the

“natural” quality of human speech, these systems are now a common component in our lives.

Dept. Of ECE PESIT

19

BIBLIOGRAHY

[1] L. R. Rabiner and B. Juang, “Fundamentals of Speech Recognition”, Pearson Education (Asia) Pte.

Ltd., 2004

[2] L. R. Rabiner and R. W. Schafer, “Digital Processing of Speech Signals”, Pearson Education (Asia) Pte.

Ltd., 2004.

[4]S.Young, HMMs and Related Speech Recognition Technologies, Part E 27

[5]Historical Perspective of the Field of ASR/NLU,L. Rabiner, B.-H. Juang,Part E 26

[6] http://en.wikipedia.org/wiki/Speech_recognition

Dept. Of ECE PESIT

http://en.wikipedia.org/wiki/Speech_recognition

speech recognition seminar

Documents

digital speech

stored speech

speech data

human speech

speech decoding

speech recognition of

production of computer

inverse of speech analysis