voice operated switch

FINAL YEAR PROJECT REPORT ON

SPEECH / VOICE OPERATED SWITCH

Submitted to: Electrical & Electronics Dept.

Hindustan College of Science & Technology

Submitted By:

1. Acknowledgement2. Introduction3. Abstract4. Key component

Hardware Software

5. Description Hardware Circuit Flow Diagram Software Initial problem How to compare recording

Frequency Domain Finding a norm

Algorithm Instructions

6. Operation7. Advantages8. Limitations9. Applications10. Conclusion11. Bibliography

It gives us an immense pleasure to present the technical paper on topic CONTROLLING OF DEVICE THROUGH VOICE RECOGNITION USING MATLAB undertaken during Bachelor of Technology final year. We would like to express our deep and sincere gratitude to for the support to develop this documentour project guide for providing us excellent guidance, encouragement and inspiration throughout the projectwork. We would like to extend our special thanks to Head of Department & all the staff of Electrical and electronics for their immense support in all kind of work.

By :-

Speech is the most natural way to communicate forhumans. While this has been true since the dawn of civilization, the invention

and widespread use of the telephone, audio-phonic storage media, radio, and television has given even further importance to speech communication and speech processing. The advances in digital signal processing technology has led the use of speech processing in many different application areas like speech compression, enhancement, synthesis, and recognition.The concept of a machine than can recognize the human voice has long been an accepted feature in Science Fiction. From „Star Trek to Iron man’s Jarvis „2012 -“Actually he was not used to writing by hand. Apart from very short notes, it was usual to dictate everything into the speak writer.” - It has been commonly assumed that one day it will be possible to converse naturally with an advanced computer-based system.The voice recognition technology enables the severely deaf or hearing impaired people, who cannot recognize by the aids. This is expected to be a remarkable innovation for the life quality of the hearing-impaired. Lowering the gate length from thecurrent sizeof voice merely by amplifying the sound, to see the words recognized 40nm to 10nm in the semiconductor process technology within the next 20 years will bring about a reduction in the size of the hearing aids. The theory was

So simple that a voice was generated through the trachea and the speech was decoded in the brain. Even the voice spectrogram was not considered.

Voice Production/Perception Process

This Switch will work on a voice command (like Play, Run, ON, OFF, End Program, Cancel etc.) to operate a desired appliances.

Voice recognition (by a machine) is a very complex problem. Vocalizations vary in terms of accent, pronunciation, articulation, roughness, nasality, pitch, volume, and speed. Voice is distorted by a background noise and echoes, electrical characteristics.

Voice recognition is different from speech recognition.

SPEECH RECOGNITION (SR) is the translation of spoken words into text. It is also known as "automatic speech recognition", "computer speech recognition", "speech to text", or just "STT". Some SR systems use "training" where an individual speaker reads sections of text into the SR system. These systems analyse the person's specific voice and use it to fine tune the recognition of that person's speech, resulting in more accurate transcription. Systems that do not use training are called "Speaker Independent" systems. Systems that use training are called "Speaker Dependent" systems.

Speech recognition applications include voice user interfaces such as voice dialling (e.g., "Call home"), call routing (e.g., "I would like to make a collect call"), demotic appliance control, search (e.g., find a podcast where particular words were spoken), simple data entry (e.g., entering a credit card number), preparation of structured documents (e.g., a radiology report), speech-to-text processing (e.g., word processors or emails), and aircraft (usually termed Direct Voice Input).

Hardware

Computer system

Microphone

DB 9 connector

MAX232

AT89S8253 controller

Relay driver (ULN 2803)

Resistors, LED. , PCB, RELAYS etc.

Software

Operating System

MATLAB

HARDWAREIn hardware the components are serial port, MAX232 voltage level converter controller to take input and generate output. To drive the relays we have used ULN 2803 IC which has arrays of 8 Darlington pair oftransistors. Darlington pair of transistors are capable of provide larger amount of current to drive the relays. 8 LEDs are used as indicator each corresponding to 8 data lines of the data port.

CIRCUITS RS-232 Level converters circuit for UARTcommunication. ULN based relay driver circuit Programmer Circuit for Microcontroller

Flow Diagram

Block diagram of hardware

SOFTWAREThe software section is completely based on MATLAB. In our interface we have used MATLAB for voice reorganization. It can be used into three different modes viz. „text to speech‟, „speech to text‟ and as „voice command recognizer ‟. We have used it in third mode. In this mode of operation we can add predefined commands. It listens command and matches it from the given list. If matching occurs it generates an event corresponding to the matching. This event is used to control the device by giving the controller input to control the output and thus control the system.

Initial Problem

A human can easily recognize a familiar voice however, getting a computer to distinguish a particular voice among others is a more difficult task. Immediately, several problems arise when trying to write a voice recognition algorithm. The majority of these difficulties are due to the fact that it is almost impossible to say a word exactly the same way on two different occasions. Some factors that continuously change in human speech are how fast the word is

spoken, emphasizing different parts of the word, etc… Furthermore, suppose that a word could in fact be said the same way on different occasions, then we would still be left with another major dilemma. Namely, in order to analyse two sound files in time domain, the recordings would have to be aligned just right so that both recordings would begin at precisely the same moment.

How to Compare Recordings

Frequency Domain

Given the difficulties mentioned in the above paragraph, it became quite evident that any voice analysis in time domain would be extremely impractical. Instead, an analysis of the frequency spectra in a voice (which remains predominately unchanged as speech is slightly varied) turned out to be a more viable option. Converting all recordings into frequency domain (by applying the Discrete Fourier Transform) greatly simplified the process of comparing two recordings. That being said, working in frequency domain also provided a new set of issues that required attention.

Finding a Norm

Due to nature of human speech, all data pertaining to frequencies above 600Hz can safely be discarded. Therefore, once a recording is converted into frequency domain, it could then be simply regarded as a vector in 600-dimensional Euclidean space. At this point, a comparison between two vectors could easily be carried out by normalizing the vectors (giving them length 1) then computing the norm of the difference between the two (of course, the difference between two vectors in R600 is performed by subtracting componentwise). Unfortunately, exactly which norm to use is not immediately clear? After carefully comparing and contrasting the use of the Taxicab, Euclidean, and Maximum norms, it became clear that the Euclidean norm most accurately measured the closeness between different frequency spectra. Once the norm function was chosen, all that remained was to decide exactly how small the norm

of the difference of two vectors had to be in order to determine that both recordings originated from the same person.

Algorithm Instructions

The following is a short synopsis regarding the proper execution of the software.

Short Description

As mentioned before, all files pertaining to the project can be accessed using the link: Voice Recognition. As soon as the file is opened, the following folders will be access able:

David's Recordings

Matlab Files

The contents of these folders will now be discussed in more detail. The folder Matlab Files contains 10 audio recordings of David Roberts saying his name 'David'. Moreover, the folder contains the two m-files project.m and voicerec.m.

Project.m is the voice recognition algorithm that accomplishes the goals of the class project. The script file project.m can be executed by typing 'project' in the command window. Please make sure that the directory in Matlab is set to the directory that contains project.m and the 10 audio recordings g1.wav through g10.wav. Once project.m is ran in Matlab, it will then request that you "Enter the name that must be recognized". Since the recordings in that folder are of David Roberts, then type in 'David'. Next, the program will inform you that you have 2 seconds to say the name 'David'. After recording, Matlab will playback the sample and give you the option to try again or to proceed if satisfied. A plot is then generated depicting how the normalized frequency spectra in your voice (top window) compares to the average normal vector of David's Voice

(bottom window). See the figure below for an example. At this point, the algorithm makes a comparison and displays in the command window 'YOU ARE NOT DAVID!!!!' if you do not fall within 2 standard deviations of the normal average voice. If you do happen to fall within 2 standard deviations, then the command window displays 'HELLO DAVID!!!'.

Figure 1: Example of a Frequency Spectra Comparison.

The second m-file in that folder is voicerec.m. This script file is executed by typing 'voicerec' in the command window. Running voicerec.m will prompt the user to record their name 10 times. The recordings are then saved as g1.wav through g10.wav in the directory. Therefore, the ten new recording will in fact replace the recordings of David Roberts. Doing this results in the conversion of project.m into a voice recognition algorithm for the user's voice (as oppose to the voice of David Roberts). In this case, the user's name

should be entered as the voice to be recognized (instead of 'David') when running project.m. Lastly, since voicerec.m replaces g1.wav through g10.wav in the directory, back-up copies of David Roberts' voice are conviently stored in the folder David's Recordings.

To recognize the voice commands efficiently different parameters of speech like pitch, amplitude pattern or power/energy can be used. Here to recognize the voice commands power of the speech signal is used. First the voice commands are taken with the help of a microphone that is directly connected to PC. After it the analog voice signals are sampled using MATLAB. As speech signals generally lie in the range of 300Hz-4000 Hz, so according to Nyquist Sampling Theorem, minimum sampling rate required should be greater or equal to 8000

samples/second.

FS=2*Fm (1)

Where FS is sampling frequency and Fm is frequency of the modulated wave signal. After sampling, the discrete data obtained is passed through a band pass filter having pass band frequency in the range of 300 - 4000 Hz. The basic purpose of using band pass filter is to eliminate the noise that lies at low frequencies (below 300 Hz) and generally above 4000 Hz there is no speech signal. This algorithm for voice recognition comprises of speech templates. The templates basically consist of the power of discrete signals. To create the templates here the power of each sample is calculated and then the accumulated power of 250 subsequent samples isrepresented by one value.

For recognition of commands first a dictionary is created that consists of templates of all the commands that the device has to

follow (like ON and OFF).For creating the dictionary the same command is taken several times and template is created each time. For creating the final template the average of all these templates is taken and stored.

After creating the dictionary of templates, the command to be followed is taken with the help of the microphone and the template of the input command signal is created. Now the template of command received is compared withthe templates of dictionary using Euclidian distance. It is the accumulation of the square of Each difference between the value of dictionary template and that of commandtemplate at each sample points. The formula can be given as

Euclidian Distance=∑i=1(dic[i]-com[i])2 (2)

Where i denotes the number of sample points, which is 32 in the proposed algorithm. After calculating Euclidian distance for each dictionary template, these distances are sorted in the ascending order to find out the smallest distance among them. This distance corresponds to a particular dictionary template which is the template belonging to a particular dictionary command. Then the device detects that particular command given by the operator and performs the task accordingly. If thecommand given by the operator does not match with any of the dictionary command then the device should not follow that command.Speaker recognition can be classified into identification and verification. Speaker identification is the process of determining which registered speaker provides a given utterance. Speaker verification, on the other hand, is the process of accepting or rejecting the identity claim of a speaker. Figure 1 shows the basic structures of speaker identification and verification systems.

At the highest level, all speaker recognition systems contain two main modules (refer to Figure 1): feature extraction and feature matching. Feature extraction is the process that extracts a small amount of data from the voice signal that can later be used to represent each speaker. Feature matching involves the actual procedure to identify the unknown speaker by comparing extracted

features from his/her voice input with the ones from a set of known speakers.

SPEECH PROCESSINGSpeech processing is the study of speech signals and the processing methods of these signals. The signals are usually processed in a digital representation, so speech processing can be regarded as a special case of digital signal processing, applied to speech signal.[clarification needed] Aspects of speech processing includes the acquisition, manipulation, storage, transfer and output of digital speech signals.

It is also closely tied to natural language processing (NLP), as its input can come from / output can go to NLP applications. E.g. text-to-speech synthesis may use a syntactic parser on its input text and speech recognition's output may be used by e.g. information extraction techniques. The main applications of speech processing are the recognition, synthesis and compression of human speech.

Speech processing includes the following areas of study:

Speech recognition (also called voice recognition), which deals with analysis of the linguistic content of a speech signal and its conversion into a computer-readable format.

Speaker recognition, where the aim is to recognize the identity of the speaker.

Speech coding, a specialized form of data compression, is important in the telecommunication area.

Voice analysis for medical purposes, such as analysis of vocal loading and dysfunction of the vocal cords.

Speech synthesis: the artificial synthesis of speech, which usually means computer-generated speech. Advances in this area improve the computer's usability for the visually impaired.

Speech enhancement: enhancing the intelligibility and/or perceptual quality of a speech signal, like audio noise reduction for audio signals.

Speech compression is important in the telecommunications area for increasing the amount of information which can be transferred, stored, or heard, for a given set of time and space constraints.

Speaker diarization is the process of determining who spoke when in a signal.

MEDIA INFO OF ANY AUDIO FILE comprise of:

Audio #1

ID : 2

Format : AAC

Format/Info : Advanced Audio Codec

Format profile : HE-AAC / LC

Codec ID : A_AAC

Duration : 1h 59mn

Channel(s) : 6 channels

Channel positions : Front: L C R, Side: L R, LFE

Sampling rate : 48.0 KHz / 24.0 KHz

Compression mode : Lossy

Delay relative to video : 31ms

Title : Mafiaking

Language : Hindi

Default : Yes

Forced : No

Audio #2

ID : 3

Format : AAC

Format/Info : Advanced Audio Codec

Format profile : HE-AAC / LC

Codec ID : A_AAC

Duration : 1h 59mn

Channel(s) : 6 channels

Channel positions : Front: L C R, Side: L R, LFE

Sampling rate : 48.0 KHz / 24.0 KHz

Compression mode : Lossy

Delay relative to video : 31ms

Title : Mafiaking

Language : English

Default : No

Forced : No

(a)Speaker identification

(b) Speaker verification

All speaker recognition systems have to serve two distinguishes phases. The first one is referred to theenrollment sessions or training phase while the second one is referred to as the operation sessions or testing phase. In the training phase, each registered speaker has to provide samples of their speech so that the system can build or train a reference model for that speaker. In case of speaker verification systems, in addition, a speaker-specific threshold is also computed from the training samples. During the testing phase (Figure 1), the input speech is matched with stored reference model and recognition decision is made.

ADVANTAGES

Very easy to control the appliances Even an ordinary person can control it Very much cost effective Easy to design

LIMITATIONS It can understand pure English only. Voice interference can lead to undesired

operation. Even noise can activate it.

APLICATION

Aerospace (e.g. space exploration, spacecraft, etc.) NASA’s Mars Polar Lander used speech recognition from technology Sensory, Inc. in the Mars Microphone on the Lander

Automatic translation Automotive speech recognition (e.g., OnStar, Ford Sync) Court reporting (Real time Speech Writing) Hands-free computing: Speech recognition computer user interface Home automation Interactive voice response Mobile telephony, including mobile email Multimodal interaction Pronunciation evaluation in computer-aided language learning applications Robotics Speech-to-text reporter (transcription of speech into text, video captioning,

Court reporting ) Telematics (e.g., vehicle Navigation Systems) Transcription (digital speech-to-text) Video games, with Tom Clancy's End War and Lifeline as working examples

An economical system for reliable recognition of voice has been designed and developed. This system can be made highly efficient and effective if stringent environmental conditions are maintained. The setup for maintaining these environmental conditions will be a onetime investment for any real life application.

1. DARSHAN MANDALIA, PRAVIN GARETA –“Speaker Recognition Using MFCC and Vector Quantization Model”.

2. ADITYA AGARWAL, SUBHAYAN BANERJEE, NILESH GOEL “Design & Implementation of a Person Authenticating & Commands Following Robot.”

3. SINGH AND SAPRE–“PRINCIPLE OF COMMUNICATION”.

4. International Journal of Advanced Technology & Engineering Research (IJATER)

5. https://www.39609152voicerecognition.pdf/ www.troubleshoot 4 free.com/fyp/

6. https://www.wikipedia.com

https://www.39609152voicerecognition.pdf/www.troubleshoot%204%20free.com/fyp/

https://www.39609152voicerecognition.pdf/www.troubleshoot%204%20free.com/fyp/

voice operated switch

Documents

band pass filter

advanced audio codec

aac lc

speech processing includes

voice recognition algorithm

speaker recognition systems

frequency spectra

feature extraction