automatic speech recognition

August 21, 2013

SEARCH USING VOICE AND IMAGE RFECOGNITION

Speech Recognition

August 21, 2013


CONTENTS

01 What is Voice

02 Component of Sound

03 Why Voices are Different

04 Classification of Speech Sound

05 Process of Speech Production

06 What is Voice Recognition

07 ASR(Automatic Speech Recognition)

08 Types of ASR

09 Approaches to Speech Recognition

10 Process of Speech Recognition

11 How Speech Recognition Works

12 Approaches of Speech recognition

13 Application of Speech Processing

August 21, 2013


1. What is Voice? The voice consists of sound made by a human being using the vocal folds for talking,

singing, laughing, crying, screaming, etc. The human voice is specifically that part of human sound production in which the vocal folds (vocal cords) are the primary sound source. Generally

speaking, the mechanism for generating the human voice can be subdivided into three parts; the lungs, the vocal folds within the larynx, and the articulators. The lung (the pump) must produce adequate airflow and air pressure to vibrate vocal folds (this air pressure is the fuel of the voice).

The vocal folds (vocal cords) are a vibrating valve that chops up the airflow from the lungs into audible pulses that form the laryngeal sound source. The muscles of the larynx adjust the length

and tension of the vocal folds to ‘fine tune’ pitch and tone. The articulators (the parts of the vocal tract above the larynx consisting of tongue, palate, cheek, lips, etc.) articulate and filter the sound emanating from the larynx and to some degree can interact with the laryngeal airflow to

strengthen it or weaken it as a sound source.

The vocal folds, in combination with the articulators, are capable of producing highly intricate arrays of sound. The tone of voice may be modulated to suggest emotions such as anger, surprise, or happiness. Singers use the human voice as an instrument for creating music.

2. Components of Sound There are NINE(09) components of sound given below

1. Music components

Pitch Timbre Harmonics

2. Loudness

Rhythm

3. Sound envelope components

Attack Sustain

Decay 4. Record and playback component

Speed

August 21, 2013


Different Terms

1. Compressions, in which particles are crowded together, appear as upward curves in the

line.

2. Rarefactions, in which particles are spread apart, appear as downward curves in the line.

Three characteristics are used to describe a sound wave. These are wavelength,

frequency, and amplitude.

3. Wavelength; this is the distance from the crest of one wave to the crest of the next.

4. Frequency; this is the number of waves that pass a point in each second.

5. Amplitude; this is the measure of the amount of energy in a sound wave.

6. Pitch

This is how high or low a sound seems. A bird makes a high pitch. A lion makes a low pitch.

August 21, 2013


Sounds also are different in how loud and how soft they are. The more energy the sound wave has the louder the sound seems. The intensity of a

sound is the amount of energy it has. You hear intensity as loudness. Remember the amplitude, or height of a sound wave is a measure of the amount of

energy in the wave. So the greater the intensity of a sound, the greater the amplitude.

Pitch and loudness are two ways that sounds are different. Another way is in quality.

Some sounds are pleasant and some are a noise. Compare the two waves on the right. A pleasant

sound has a regular wave pattern. The pattern is repeated over and over. But the waves of noise are irregular. They do not have a repeated pattern.

7. Why Voices are Different? Voices are different caused by

INTENSITY(depend on amplitude)

PITCH(frequency) TONE(pleasant or unpleasant).

1. Amplitude is a measure of energy. The more energy a wave has, the higher its amplitude. As amplitude increases, intensity also increases.

August 21, 2013


2. Intensity is the amount of energy a sound has over an area. The same sound is more

intense if you hear it in a smaller area. In general, we call sounds with a higher intensity louder.

3. Pitch depends on the frequency of a sound wave. Frequency is the number of wavelengths that fit into one unit of time.

Sounds also are different in how loud and how soft they are. The more energy the sound

wave has the louder the sound seems. The intensity of a sound is the amount of energy it has.

You hear intensity as loudness. Remember the amplitude, or height of a sound wave is a measure of the amount of

energy in the wave. so the greater the intensity of a sound, the greater the amplitude.

4. Classification of Speech Sound One can make broad divisions such as voiced and unvoiced sound, or become more

speci_c, such as front vowels, back vowels, semivowels, and so on.

The difference between voiced and unvoiced sounds becomes clear in these samples. The first two blocks demonstrate a dominant low frequency sound wave, which is not present in the third block. This frequency is produced by the vibration of the larynx, or voice box.

Although the exact frequency differs for each speaker (females tend to have a higher frequency), the dominant presence of a low frequency sound wave is a surefire indicator of a voiced sound.

1. Voiced Sound Vocal Chord play active role in the production of SOUND e.g. a/e/I. It has high

frequency

2. Un Voiced Sound When Vocal Chord is Inactive Called UN VOICED SOUND e.g. s/f. It build up by

pressure

August 21, 2013


5. Process of Speech Production

6. What is voice recognition?

Voice recognition is the process of taking the spoken word as an input to a computer

program. It is the process of converting voice into electric signals. Signals transform into CODING PATTERN.

Voice recognition is "the technology by which sounds, words or phrases spoken by humans are converted into electrical signals and these signals are transformed into coding

patterns to which meaning has been assigned". While the concept could more generally is called "sound recognition".

speech recognition, voice recognition is an ability of a computer, computer software program, or hardware device to decode the human voice into digitized speech that can be

interpreted by the computer or hardware device. Voice recognition is commonly used to operate a device, perform commands, or write without having to operate a keyboard, mouse, or press any buttons

7. ASR (Automatic Speech Recognition) Process of converting acoustic signal captured by microphone or telephone to a set of

words. Recognized words can be final results, as for applications such as commands and control,

data entry and document preparation. They can also serve as input to further linguistic processing in order to achieve speech understanding.

First ASR device was used in 1952 and recognized single digits spoken by a user (it was not computer driven). Today, ASR programs are used in many industries, including Healthcare,

http://www.computerhope.com/history/194060.htm

August 21, 2013


Military (i.e. jets and helicopters), Telecommunications and Personal computing (i.e. hands free computing).

Evaluation of ASR

Acoustic Model An acoustic model is created by taking audio recordings of speech, and their text

transcriptions, and using software to create statistical representations of the sounds that make up

each word. It is used by a speech recognition engine to recognize speech.

August 21, 2013


Language Model Language modeling is used in many natural language processing applications such

as speech recognition tries to capture the properties of a language, and to predict the next word in a speech sequence.

8. Basic Types of Speech Recognition System

1. Speaker-dependent: user must provide samples of his/her speech before using them. The voice recognition

must be trained before it can be used. This often requires a user reads a series of words and

phrases so the computer can understand the users voice. Speaker–dependent software works by learning the unique characteristics of a

single person's voice, in a way similar to voice recognition. New users must first "train" the software by speaking to it, so the computer can analyze how the person talks. This often means users have to read a few pages of text to the computer before they can use the speech recognition

software.

2. Speaker independent no speaker enrollment necessary. The voice recognition software recognizes most users

voices with no training. Speaker–independent software is designed to recognize anyone's voice, so no training is

involved. This means it is the only real option for applications such as interactive voice response

systems — where businesses can't ask callers to read pages of text before using the system. The downside is that speaker–independent software is generally less accurate than speaker–dependent

software.

Other types

1. Discrete speech recognition - The user must pause between each word so

that the speech recognition can identify each separate word.

2. Continuous speech recognition - The voice recognition can understand a

normal rate of speaking.

3. Natural language - The speech recognition not only can understand the voice

but also return answers to questions or other queries that are being asked.

9. Approaches to ASR Template matching Knowledge-based (or rule-based) approach

Statistical approach:

August 21, 2013


Noisy channel model + machine learning

1. Template matching It is SPEAKE DEPENDENT. It match voice with already saved templates. Before it

we’ve to trained the system. System must be trained. User speak same word which are avail in

template. Recognition accuracy can be about 98 percent. Store examples of units (words, phonemes), then find the example that most closely fits the input Extract features from speech

signal, then it’s “just” a complex similarity matching problem, using solutions developed for all sorts of applications OK for discrete utterances, and a single user Hard to distinguish very similar templates And quickly degrades when input differs from templates Therefore needs

techniques to mitigate this degradation: More subtle matching techniques Multiple templates which are aggregated.

2. Rule-based approach It is SPEAKER INDEPENDENT. First process the giving voice as input.Using

LPC(Linear Productive Coding) Attempt to find similarities b/w expected Input and Digitized input. Recognition accuracy for speaker-independent systems is somewhat less than for

speaker-dependent systems, usually between 90 and 95 percent. Use knowledge of phonetics and linguistics to guide search process Templates are replaced by rules expressing everything

(anything) that might help to decode: Phonetics, phonology, phonotactics Syntax

Pragmatics

Typical approach is based on “blackboard” architecture: At each decision point, lay out the possibilities

Apply rules to determine which sequences are permitted Poor performance due to Difficulty to express rules

Difficulty to make rules interact Difficulty to know how to improve the system

3. Statistical Base Approach Can be seen as extension of template-based approach, using more powerful mathematical

and statistical tools. Sometimes seen as “anti-linguistic” approach Fred Jelinek (IBM, 1988): “Every time I fire a linguist my system improves” Collect a large corpus of transcribed speech

recordings Train the computer to learn the correspondences (“machine learning”) At run time, apply statistical processes to search through the space of all possible

solutions, and pick the statistically most likely one

August 21, 2013


9. Process of Speech Recognition

Vocal Tract Consist of laryngeal pharynx, oral hyrax, oral cavity, nasal cavity, nasal pharynx.

Spectrum Analysis MFCC used to produce voice feature. DTW to select the pattern that match the

database(matLab).

10. How Speech Recognition Works Divide the sound wave into evenly spaced blocks. Transform the PCM digital

audio into a better acoustic representation.

Process each block for important characteristics, such as strength across various frequency ranges, number of zero crossings, and total energy. Apply a "grammar"

so the speech recognizer knows what phonemes to expect. A grammar could be anything from a context-free grammar to full-blown Language.

Using this characteristic vector, attempt to associate each block with a phone,

which is the most basic unit of speech, producing a string of phones? Figure out which phonemes are spoken.

August 21, 2013


Find the word whose model is the most likely match to the string of phones which was produced. Convert the phonemes into words.

1. Speech Detection The first task is to identify the precense of a speech signal. This task is easy if the signal

is clear, however frequently the signal contains background noise, resulting from a noisy microphone, a fan running in the room, etc. The signals obtained were in fact found to contain

some noise. I used two criteria to identify the presence of a spoken word. First, the total energy is measured, and second the number of zero crossings are counted. Both of these were found to be necessary, as voiced sounds tend to have a high volume (and thus a high total energy), but a

low overall frequency (and thus a low number of zero crossings), while unvoiced sounds were found to have a high frequency, but a low volume. Only background noise was found to have

both low energy and low frequency. The method was found to successfully detect the beginning and end of the several words tested. Note that this is not sufficient for the general case, as fluent speech tends to have pauses, even in the middle of words (such as in the word 'acquire', between

the 'c' and 'q').

2. Blocking The second task is blocking. Older speech recognition systems first attempted to detect where

the phones would start and finish, and then block the signal by placing one phone in each block. However, phones can blend together in many circumstances, and this method generally could not

reliably detect the correct boundaries. Most modern systems simply separate the signal into blocks of a fixed length. These blocks tend to overlap, so that phones which cross block boundaries will not be missed. This project uses blocks which are 30 msec in length (containing

600 samples), and which shift by 10 msec increments.

The next important step in the processing of the signal is to obtain a frequency spectrum of each block. The information in the frequency spectrum is often enough to identify the phone. The purpose of the frequency spectrum is to identify the formants, which are the peaks in the

frequency spectrum. Vowels are often uniquely identified by their first two formants. This experiment has shown that the identification of formants is not a trivial task. One method to

obtain a frequency spectrum is to apply an FFT to each block. The resulting information can be examined manually to find the peaks, but it is quite noisy, which makes the take difficult for a computer to identify the peaks. Very useful data can still be obtained. This is often done by

measuring the strength across various frequency ranges.

the frequency spectrum of a different speaker, saying the 's' in 'yes'. The important feature to note is the presence of a peak in the 100-150 range (which scales to 3600-5400 Hz). This peak is a feature of the letter 's'. Each spectrum has a peak there, although it is at a different

strength in each one. (Any data in the 0-10 range is likely to be noise in these). In many cases, the overall strength in that range is quite low, compared with the strength of the lower

frequencies. This is a feature of the voiced sounds, although the exact frequencies vary with the

speaker. The important features visible in this spectrum are the existence of a formant in the 80-

August 21, 2013


100 range while the 'y' is spoken, and then later the existence of formants at both ~70 and ~50 simultaneously while the 'e' is spoken.

This is the frequency spectrum produced by another speaker, while saying the 'ye' of yes.

Notice here that the 'y' and 'e' overlap substantially. Often times, consonants will take on the

frequencies of the vowels which follow them, and must be identified by characteristics other than their frequencies alone. Here, the 'y' may be identified by the transition from the higher

frequency into the frequency of the vowel which follows. Another method, which is used to obtain a frequency spectrum is that of Linear

Predictive Coding(LPC). This is the most successful method in widespread use today. The idea

behind LPC is that the values of the signal can be expressed as a linear combination of the preceding values. That is, if s(i) is the amplitude at time i,

s(i) = a1*s(i-1) + a2*s(i-2) + ... + ap*s(i-p)

When the input data is filled in, this becomes a system of linear equations which can be

solved to determine the values of a1 through ap. These values then produce a very noise free signal, which clearly identifies the formants.

3. Other Features Plosives (b, p, d, t, g, k) can generally be identified by a pause followed by a sudden

increase in energy of short duration. Nasals (n, m, ng), are often characterized by a single formant of low frequency, and if followed by a vowel, their formants tend to have a wide

spectrum. The 'h' is characterized by a building unvoiced sound followed by a sudden sustained increase in energy at the formants of the vowel which follows. Unvoiced fricatives (th, s, sh, f), are characterized by a low energy, wide band, high frequency spectrum. Their voiced

counterparts (dh, z, zh, v) have an additional formant in the low frequency spectrum. Affricatives (j, ch) are often described as a plosive which turns into a fricative (d - zh and t - sh

respectively). Glides, or semivowels (w, l, r, y) may be the most difficult to characterize, because they are highly situation dependent. They are followed by vowels, unless they appear at the end of a word, and behave much like a transition from another vowel into the vowel which

follows it. In this project we noted how the 'y' transitions from its characteristic frequency into the frequencies of the 'e' which follows it. There may be no clear distinction where one ends and

the other begins.

4. Word identification

Although this project used a very simple identification method to differentiate between

two words, real word identification has many obstacles to overcome. Because we chose to divide our signal into blocks of a set duration, we do not know how many blocks a given phone may occupy. Some phones may only be recognized as the transition from one phone to another.

Some phones may be missing or improperly identified. All of these notions are captured by a model known as the Hidden Markov Model. A HMM is basically a finite automata in which

each transition has a probability associated with it.

A given vocabulary word has a HMM which is designed to model the many possible

strings of phones which may be produced by the utterance of the word. Each expected phone is

August 21, 2013


generally represented by a state in the HMM, while each possible phone at every stage has an arc. This it that a 'y' may be represented as a 'y' or an 'i' arc, while both lead to the 'y' state. Self

loops account for the possibility of a phone stretching over several blocks. Missed phones are also allowed, as an arc may jump over a state. Each arc is then assigned a probility to complete the HMM.

Then on an input signal, a dynamic programming algorithm, called the Viterbi algorithm, is applied to identify which HMM is the most likely match for the input signal.

11. Approach to Speech Recognition Acoustic Phonetic Approach Pattern Recognition Approach(HMM) Artificial Intelligence Approach(Neural Networks)

1. Pattern Recognition Approach

“A pattern is the opposite of a chaos; it is an entity vaguely defined, that could be given a

name.”

A pattern is an object, process or event A class (or category) is a set of patterns that share common attribute (features) usually

from the same information source

During recognition (or classification) classes are assigned to the objects. A classifier is a machine that performs such task

2. Neural Network Approach

classifier is represented as a network of cells modeling neurons of the human brain

(connectionist approach).

August 21, 2013


3. Language Model

12. Application of Speech Processing

Medical Transcription Military

Telephony and other domains Serving the disabled

Home automation Automobile audio systems Telematics

automatic speech recognition

Education

sound wave

component of sound

classification of speech

voice recognition

human sound production

nine09 components of

human voice

laryngeal sound source