ivr with pattern recognition

1

I.V.R – (Interactive Voice Response)

with Pattern Recognition.

2

ABSTRACT

In the present era of information technology, information nowadays

is just a telephone call away. However, applications such as telephone

banking etc. need extra security for making it a reliable service for the

people. Application of PIN code/Password via telephone is not enough and

additional user specific information is required that can protect the user

identity in a more effective way. In this paper, we propose an approach that

uses Interactive Voice Response (IVR) with pattern recognition based on

Neural Networks. In this case, after entering the correct password the user

is asked to input his voice sample which is used to verify his identity. The

addition of voice pattern recognition in the authentication process can

potentially further enhance the security level. As both are simultaneously

applied so there is lesser probability of misuse. The developed system is fully

compliant with landline phone system.

3

INTRODUCTION

INTRODUCTION

In telephony, Interactive Voice Response, or IVR, is a computerized system that allows a

person, typically a telephone caller, to select an option from a voice menu. In some of the

applications like bank account balances, transfers and accessing databases of strategic

organizations etc. require high level of security. In such applications the information to

be provided is made secure by the use of Personal Identification Number (PIN).

However, this approach is not secure and is prone to tampering and misuse.

To overcome this problem a pattern recognition approach based on neural network is

proposed. User specific patterns such as fingerprint, retina, facial features, DNA

sequence identification and voice etc. can be used for authentication. However, among

these, voice authentication is readily available and most suitable for this application. The

speaker recognition area has a long and rich scientific basis with over 30 years of

research, development and evaluations.

Inherent in attempts at speaker identity verification, it is the general assumption that at

some level of scrutiny, no two individuals have exactly the same voice characteristics.

In the proposed approach, besides entering the PIN code, the user will also be asked to

get himself recognized through his voice signatures which further enhance the secure

access to various applications.

The results are promising based on false accept and false reject criteria offering quick

response time. It can potentially play an effective role in the existing authentication

techniques used for identity verification to access secured services through telephone or

similar media. In the proposed model, speaker specific features are extracted and

Multilayer Perceptron (MLP) is used for feature matching.

4

INTERACTIVE VOICE RESPONSE SYSTEM

2. WHAT IS IVRS

INTERACTIVE VOICE RESPONSE SYSTEM (IVRS) is an important

development in the field of interactive communication which makes use of the most

modern technology available today. IVRS is a unique blend of both the communication

field and the software field, incorporating the best features of both these streams of

technology. IVRS is an electronic device through which information is available related

to any topic about a particular organization with the help of telephone lines anywhere in

the world.

IVRS provides a friendly and faster self service alternative to speaking with

customer service agents. It finds a large scale use in enquiry systems of railways, banks,

universities, tourism, industry etc. It is the easiest and most flexible mode of interactive

communication because pressing a few numbers on the telephone set provides the user

with a wide range of information on the topic desired. IVRS reduces the cost of servicing

customers.

In telecommunications, IVRS allows customers to interact with a company’s

database via a telephone keypad or by speech recognition, after which they can service

their own inquiries by following the IVR dialogue. IVR systems can respond with

prerecorded or dynamically generated audio to further direct users on how to proceed.

IVR applications can be used to control almost any function where the interface can be

broken down into a series of simple interactions. IVR systems deployed in the network

are sized to handle large call volumes.

The use of IVR and voice automation enables a company to improve its customer

service and lower its costs, due to the fact that callers' queries can be resolved without

the need for queuing and incurring the cost of a live agent who, in turn, can be directed

5

to deal with more demanding areas of the service. If the caller does not find the

information they need, or requires further assistance, the call can then be transferred to

an agent. This makes for a more efficient system in which agents have more time to deal

with complex interactions.

When an IVR system answers multiple phone numbers the use of DNIS ensures that the

correct application and language is executed. A single large IVR system can handle calls

for thousands of applications, each with its own phone numbers and script.

IVR also enables customer prioritization. In a system wherein individual customers may

have a different status, the service will automatically prioritize the individual's call and

move customers to the front of a specific queue. Prioritization could also be based on

the DNIS and call reason.

IVR technology is also being introduced into automobile systems for hands-free

operation.

2.1 IVRS Block Diagram

Fig. 2.1

6

The IVRS on the whole consists of the user telephone, the telephone

connection between the user and the IVRS and the personal computer which

stores the data base. The interactive voice response system consists of the

following parts.

2.1.1 Hardware Section

1. Relay: For switching between the ring detector and the DTMF decoder.

2. Ring detector: To detect the presence of incoming calls.

3. DTMF decoder: To convert the DTMF tones to 4 bit BCD codes.

4. Micro controller: To accept the BCD calls, process them and transmit them serially to

the PC.

5. Level Translator: To provide the interface between PC and micro Controller.

6. Personal Computer: To store the data base and to carry out the text to speech

conversion.

7. Audio Amplifier: To provide audio amplification to standard output and to act as a

buffer between the telephone line and sound card.

2.1.2 Software Selection

1. Visual Basics 6.0

2. Oracle 8.0

3. Microsoft Agent

7

2.2 Operations of IVRS

The user dials the phone number connected to the IVRS. The call is

taken over by the IVRS after a delay of 12 seconds during which the call can be

attended by the operator. After 12 seconds if the ring detector output is low, it is

ensured that the phone has not been picked up by the operator. The

microcontroller then switches the relay to the DTMF and sends a signal via RS

232 to the pc to run the wave file welcoming the user to the IVRS. The user is

also informed of the various codes present in the system, which the user dial in

order to access the necessary information.

Thirty seconds are given to the user to press the codes, failure of

which results in switch back of the relay. The DTMF decoder converts the codes

pressed by the user to BCD. It is then pressed to the input pins of the

microcontroller and is stored in the microcontroller memory. After these codes

have been received, they are transmitted serially to the serial port of the PC via

max232 IC. Any hardware failure in transmission falls in the lightning of a LED

and the relay is switched back.

The serial port of the PC is continually polled by the software used

such as Visual Basics and Microsoft Agent program and the received code words

are put in the text box from the input buffer. The received personal identification

number (PIN) is compared with the stored data base to determine the result. The

corresponding wave file is played by the sound blaster card. It is coupled to the

telephone line through the Audio Amplifier, which is connected between the

sound blaster and the telephone line to amplify the blaster output, drive the

telephone line acts as the buffer for sound blaster.

8

2.3 Advantages of IVRS

1. The addition of speech recognition capabilities help IVRS owners derive

more benefit from their investment in existing IVRS resource.

2. Motivating organizations to embrace speech solutions is the potential for

dramatic reductions in operational cost.

3. Increased automation frees the customer service agents from any routine

administrative tasks and reduces cost related to customer service staffing.

That is fewer agents are able to serve more customers.

4. Resources that have been developed to support an internet presence can

support an IVRS as well. Thus organizations can use some of the same

data modules bid for speech enabled IVRS application for their intranets.

This could deliver a high degree of code reuse.

9

PATTERN RECOGNITION

INTRODUCTION

Automatic (machine) recognition, description, classification, and grouping of

patterns are important problems in a variety of engineering and scientific disciplines

such as biology, psychology, medicine, marketing, computer vision, artificial

intelligence, and remote sensing.

A pattern could be a fingerprint image, a handwritten cursive word, a human face,

or a speech signal. Given a pattern, its recognition/classification may consist of one of the

following two tasks: 1) supervised classification (e.g., discriminant analysis) in which the

input pattern is identified as a member of a predefined class, 2) unsupervised

classification (e.g., clustering) in which the pattern is assigned to a hitherto unknown

class. The recognition problem here is being posed as a classification or categorization

task, where the classes are either defined by the system designer (in supervised

classification) or are learned based on the similarity of patterns (in unsupervised

classification).

These applications include data mining (identifying a “pattern”, e.g., correlation,

or an outlier in millions of multidimensional patterns), document classification

(efficiently searching text documents), financial forecasting, organization and retrieval of

multimedia databases, and biometrics. The rapidly growing and available computing

power, while enabling faster processing of huge data sets, has also facilitated the use of

elaborate and diverse methods for data analysis and classification. At the same time,

demands on automatic pattern recognition systems are rising enormously due to the

availability of large databases and stringent performance requirements (speed, accuracy,

and cost). The design of a pattern recognition system essentially involves the following

three aspects:

1) data acquisition and preprocessing,

2) data representation, and

3) decision making.

10

The problem domain dictates the choice of sensor(s), preprocessing technique,

representation scheme, and the decision making model. It is generally agreed that a

well-defined and sufficiently constrained recognition problem (small intraclass

variations and large interclass variations) will lead to a compact pattern representation

and a simple decision making strategy.

Learning from a set of examples (training set) is an important and desired

attribute of most pattern recognition systems. The four best known approaches for pattern

recognition are:

1) template matching,

2) statistical classification,

3) syntactic or structural matching, and

4) neural networks.

3.1 Voice Recognition

VOICE recognition is different than Speech recognition.

Speech recognition (also known as automatic speech recognition or computer speech

recognition) converts spoken words to text. The term "voice recognition" (also called as

Speaker recognition) is used to refer to recognition systems that must be trained to a

particular speaker.

Speaker recognition, which can be classified into identification and verification, is the

process of automatically recognizing who is speaking on the basis of individual

information included in speech waves. This technique makes it possible to use the

speaker's voice to verify their identity and control access to services such as voice

dialing, banking by telephone, telephone shopping, database access services,

information services, voice mail, security control for confidential information areas, and

remote access to computers.

11

Fig. 3.1 shows the basic components of speaker identification and verification

systems. Speaker identification is the process of determining which registered speaker

provides a given utterance. Speaker verification, on the other hand, is the process of

accepting or rejecting the identity claim of a speaker. Most applications in which a voice

is used as the key to confirm the identity of a speaker are classified as speaker

verification .

Speaker recognition methods can also be divided into text-dependent and text-

independent methods. The former require the speaker to say key words or sentences

having the same text for both training and recognition trials, whereas the latter do not

rely on a specific text being spoken.

(a) Speaker identification

(b) Speaker Verification

12

Fig. 3.1. Basic structure of speaker recognition systems.

Both text-dependent and independent methods share a problem however. These

systems can be easily deceived because someone who plays back the recorded voice of a

registered speaker saying the key words or sentences can be accepted as the registered

speaker. To cope with this problem, there are methods in which a small set of words,

such as digits, are used as key words and each user is prompted to utter a given sequence

of key words that is randomly chosen every time the system is used. Yet even this

method is not completely reliable, since it can be deceived with advanced electronic

recording equipment that can reproduce key words in a requested order. Therefore, a text-

prompted speaker recognition method has recently been proposed.

13

NEURAL NETWORKS

4.1 What is a Neural Network?

An Artificial Neural Network (ANN) is an information processing paradigm that

is inspired by the way biological nervous systems, such as the brain, process information.

The key element of this paradigm is the novel structure of the information processing

system. It is composed of a large number of highly interconnected processing elements

(neurons) working in unison to solve specific problems. ANNs, like people, learn by

example. An ANN is configured for a specific application, such as pattern recognition or

data classification, through a learning process. Learning in biological systems involves

adjustments to the synaptic connections that exist between the neurons. This is true of

ANNs as well.

4.2 Why use neural networks?

Neural networks, with their remarkable ability to derive meaning from complicated or

imprecise data, can be used to extract patterns and detect trends that are too complex to

be noticed by either humans or other computer techniques. A trained neural network can

be thought of as an "expert" in the category of information it has been given to analyse.

This expert can then be used to provide projections given new situations of interest and

answer "what if" questions. Other advantages include:

1. Adaptive learning: An ability to learn how to do tasks based on the data given for

training or initial experience.

2. Self-Organization: An ANN can create its own organization or representation of

the information it receives during learning time.

3. Fault Tolerance via Redundant Information Coding: Partial destruction of a

network leads to the corresponding degradation of performance. However, some

network capabilities may be retained even with major network damage.

14

4.3 Pattern Recognition - an example

An important application of neural networks is pattern recognition. Pattern

recognition can be implemented by using a feed-forward (figure 1) neural network that

has been trained accordingly. During training, the network is trained to associate outputs

with input patterns. When the network is used, it identifies the input pattern and tries to

output the associated output pattern. The power of neural networks comes to life when a

pattern that has no output associated with it, is given as an input. In this case, the network

gives the output that corresponds to a taught input pattern that is least different from the

given pattern.

Fig. 4.1.

For example:

The network of figure 1 is trained to recognise the patterns T and H. The associated

patterns are all black and all white respectively as shown below.

15

If we represent black squares with 0 and white squares with 1 then the truth tables for the

3 neurones after generalisation are;

X11: 0 0 0 0 1 1 1 1

X12: 0 0 1 1 0 0 1 1

X13: 0 1 0 1 0 1 0 1

OUT: 0 0 1 1 0 0 1 1

Top neuron

X21: 0 0 0 0 1 1 1 1

X22: 0 0 1 1 0 0 1 1

X23: 0 1 0 1 0 1 0 1

OUT: 1 0/1 1 0/1 0/1 0 0/1 0

Middle neuron

X21: 0 0 0 0 1 1 1 1

X22: 0 0 1 1 0 0 1 1

X23: 0 1 0 1 0 1 0 1

16

OUT: 1 0 1 1 0 0 1 0

Bottom neuron

From the tables it can be seen the following associations can be extracted:

In this case, it is obvious that the output should be all blacks since the input pattern is

almost the same as the 'T' pattern.

Here also, it is obvious that the output should be all whites since the input pattern is

almost the same as the 'H' pattern.

17

Here, the top row is 2 errors away from the a T and 3 from an H. So the top output is

black. The middle row is 1 error away from both T and H so the output is random. The

bottom row is 1 error away from T and 2 away from H. Therefore the output is black. The

total output of the network is still in favor of the T shape.

4.4 Feed-forward networks

Feed-forward ANNs (figure 1) allow signals to travel one way only; from input to

output. There is no feedback (loops) i.e. the output of any layer does not affect that

same layer. Feed-forward ANNs tend to be straight forward networks that associate

inputs with outputs. They are extensively used in pattern recognition. This type of

organisation is also referred to as bottom-up or top-down.

4.5 The Back-Propagation Algorithm

In order to train a neural network to perform some task, we must adjust the

weights of each unit in such a way that the error between the desired output and the

actual output is reduced. This process requires that the neural network compute the

error derivative of the weights (EW). In other words, it must calculate how the error

changes as each weight is increased or decreased slightly. The back propagation

algorithm is the most widely used method for determining the EW.

The back-propagation algorithm is easiest to understand if all the units in the

network are linear. The algorithm computes each EW by first computing the EA, the rate

at which the error changes as the activity level of a unit is changed. For output units, the

EA is simply the difference between the actual and the desired output. To compute the

EA for a hidden unit in the layer just before the output layer, we first identify all the

weights between that hidden unit and the output units to which it is connected. We

then multiply those weights by the EAs of those output units and add the products. This

sum equals the EA for the chosen hidden unit. After calculating all the EAs in the hidden

layer just before the output layer, we can compute in like fashion the EAs for other

18

layers, moving from layer to layer in a direction opposite to the way activities propagate

through the network. This is what gives back propagation its name. Once the EA has

been computed for a unit, it is straight forward to compute the EW for each incoming

connection of the unit. The EW is the product of the EA and the activity through the

incoming connection.

Note that for non-linear units, the back-propagation algorithm includes an extra step.

Before back-propagating, the EA must be converted into the EI, the rate at which the

error changes as the total input received by a unit is changed.

SPEAKER RECOGNITION USING ANN

5.1 Modes

At highest level, all speaker recognition systems contain two modules: Feature

Extraction and Feature Matching. Similarly they operate in two modes: Training and

Recognition/Testing modes. Both training and recognition modes include Feature

Extraction and Feature Matching. In training mode speaker models are created for

database. This is also called enrollment mode in which speakers are enrolled in the

database. In this mode, useful features from speech signal are extracted and model is

trained.

The objective of the model is generalization of the speaker's voice beyond the

training material. so that any unknown speech signal can be classified as intended speaker

or imposter. In recognition mode, system makes decision about the unknown speaker's

identity claim. In this mode features are extracted from the speech signal of the unknown

speaker using the same technique as in the training mode. And then the speaker model

from the database is used to calculate the similarity score. Finally decision is made based

19

on the similarity score. For speaker verification, the decision is either accepted or rejected

for the identity claim.

Two types of errors occur in speaker verification system,

False Reject (FR) and False Accept (FA). When a true speaker is rejected by the speaker

recognition system, it is called FR. Similarly FA occurs when imposter is recognized as a

true speaker. The input pattern used for verification can be either text dependent or text-

independent. For the text-dependent speech pattern, the speaker is asked to utter a

prescribed text. However in text-independent case the user is free to speak any text. Text-

independent speech pattern is considered more flexible as the user doesn't require to

memorize the text.

5.2 Methodology Adopted

Fig. 5.1 Methodology Adopted

20

5.3 Feature Extractor

The speech samples from a single speaker are recorded. Five samples for each word are

used for training the neural networks. The LPC Cepstrum coefficients of each word are

extracted and the K-means vector quantization is applied to get the reduced trajectories.

The feature extraction consists of the following steps:

5.3.1 Speech Sampling:

The speech was recorded and sampled using an off-the-shelf relatively

inexpensive dynamic microphone and a standard PC sound card. The incoming signal

was sampled at 22,050 Hertz with 16 bits of precision.

5.3.2 Endpoint Detection:

A fast and robust technique for accurately locating the endpoints

of isolated words has been used. This technique utilizes frame energy to acquire the

reference points. The algorithm takes frames of size 100 samples and calculates the

energy for each frame and averages it over all the frames to get the reference value of the

energy. The energy per frame is calculated as:

P[i] = Sum k=1...j (s[k]²) (1)

where s [k] are the speech data in the frame. Similarly P is calculated for all the frames

and an average is taken for the final energy value [E].

E= [Sum k=l....m (p[k] ²)]/m (2)

The threshold is set at (constant* E), as the detecting criterion.

5.3.3 Pre-emphasis:

As is common in speech recognizers, a pre-emphasis filter was applied to the

digitized speech to spectrally flatten the signal and diminish the effects of finite numeric

precision in further calculations. This type of filter boosts the magnitude of the high

frequency components, leaving relatively untouched the lower ones.

5.3.4 Framing and Windowing:

21

After the signal was sampled, the utterances were isolated, and the spectrum was

flattened, each signal was divided into a sequence of data blocks, each block spanning

300 samples, and separated by 100 samples. Next, each block was multiplied by a

Hamming window, which had the same width as that of the block, to lessen the leakage

effects.

5.3.5 LPC Analysis:

Then, a vector of 12 Linear Predicting Coding (LPC) Cepstrum coefficients was

obtained from each data block using Durbin’s method.

5.3.6 Vector Quantization:

The dimensionality of the LPC Cepstrum vectors is reduced using

Vector Quantization Technique. A total of 36 coefficients are obtained after the vector

quantization. For the vector quantization the K-means algorithm is used.

The way in which a set of L training vectors can be clustered into a set of M codebook

vectors is the following:

1. Initialization: Arbitrarily choose M vectors (initially out of the training set of L

vectors) as the initial set of code words in the codebook.

2. Nearest-Neighbor Search: For each training vector, find the code word in the

current codebook that is closest (in terms of spectral distance) and assign that

vector to the corresponding cell.

3. Centroid Update: Update the code word in each cell using the centroids of the

training vectors assigned to that cell.

4. Iteration: Repeat the steps 2 and 3 until the average distance falls below a preset

threshold.

After the VQ stage only 3 vectors of size 12 are left. The output of this last stage is the

final feature used throughout.

22

5.4 Recognizer

The recognizer block is built using the neural network approach. The 2 types of Neural

networks used are the Multi layer Perceptrons and Recurrent Neural Networks. A neural

network is a collection of layers of "neurons," simulating the human brain structure.

Each

neuron takes input from each neuron in the previous layer (or from the outside world, if it

is in the first layer). Then, it adds this input up, and passes it to the next layer. Each

connection between layers, however, has a certain weight. Every time the neural network

processes some input, it adjusts these weights to make the output closer to a given desired

value for the output. After several repetitions of this (each repetition is an iteration), the

network can produce the correct output given a loose approximation of the input.

5.5 Results

The 2 different approaches were used for recognition. For each word 5 different

training samples were used and the networks are trained. Then the recognition accuracies

were calculated by recording more samples of the words.

5.5.1 MLP Approach

23

Fig. 5.2 Architecture of a multi-layer perceptron with two hidden layers

The MLP had 36 input nodes, 36 hidden neurons, and 1 output neuron. The output

of the neurons was inside the interval [-1,+1]. The transit was used as the threshold

function. Each neuron had an extra connection, whose input was kept constant and equal

to one (the literature usually refers to this connection as bias or threshold). The weights

were initialized with random values selected within the small interval. The MLP was

trained using the Error back propagation method.

TEST:

Around fifty seconds of speech data of the intended speaker was collected for

training the neural network. In testing phase, 10% tolerance is present for the intended

speaker i.e. if the output of the network is 10% less or greater than 10%, still the speaker

is recognized as the intended speaker otherwise rejected. The test data consists of fifty

(50) speech samples (other than those used for training the neural network) of the speaker

for whom network is trained and 125 samples of imposter speech. The imposter speech

data was collected from 13 persons (male). Out of 50 samples of the intended speaker

24

41 were recognized. So false reject is only 18%. Similarly for imposter data out of 125

trials only 17 were falsely accepted. Thus false accept is about 14%. Table 1 summarizes

the details of test results.

The performance measure is expressed as half total error rate (HTER). It is defined as:

HTER = (FA + FR)/2 (2)

where FA and FR are the false acceptance and false rejection, respectively. It was

introduced into practice in the international speaker recognition evaluation campaigns

organized by the NIST and NFI/TNO. In our case, the HTER comes to be 16% which is

very promising.

5.5.2 Recurrent Neural Network Approach

25

Fig. 5.3 Structure of Recurrent neural network

While training an Elman network the following occurs. At each epoch:

1. The entire input sequence is presented to the network, and its outputs are

calculated

and compared with the target sequence to generate an error sequence.

2. For each time step, the error is back propagated to find gradients of errors for

each weight and bias. This gradient is actually an approximation since the

contributions of weights and biases to errors via the delayed recurrent connection

are ignored.

3. This gradient is then used to update the weights with the back prop training

function chosen by the user.

5.6 Comparison

26

Some comments concerning the MLP approach can also be made:

1. Its recognition accuracies were better than the ones obtained with the RNN

approach. Even though its performance was better, it is still below the limits

required for practical applications.

2. The input layer consists of 36 neurons. The hidden layer was defined by 36

hidden neurons, having a total of 1296 weights and totaling 1296 floating point

values. The output layer consisted of only 1 neuron, with 36 weights totaling 36

floating point values. In total, each MLP required 1332 floating point values.

A few comments concerning the RNN can also be made in the light of the items used

for comparison above:

1. It achieved only 80% of recognition accuracy.

2. In terms of memory requirement, it is the best. The fully connected RNN with 10

hidden neurons and 1 output neuron requires only 360 floating point values.

CONCLUSION

27

An Interactive Voice Response (IVR) based on neural network approach has been

proposed that incorporates user specific features in terms of voice extracted while

MultiLayer Perceptron (MLP) is used for feature matching. The preliminary results

shows promise of the approach that can potentially bring added security in applications

involving access to bank services etc. via telephone. Further work focuses on improving

the error in the patterns recognition on criteria based on false accept and false reject.

BIBLIOGRAPHY

28

[1] Abhinav Sharma Surinder P Singh Vipin Kumar, "Text-independent

speaker identification using backpropagation MLP network classifier

for a closed set ofspeakers", Proceedings of IEEE International

Symposium on Signal Processing and Information Technology, Indian

Institute Of Information Technology Allahabad, INDIA, 2005.

[2] David Benenaty. BELL LABS TECHANICAL JOURNAL- Vol- 1-Jan

2002-Wiley Publishers

[3] "Interactive Voice Response with Pattern Recognition Based on

Artificial Neural Network Approach",

Syed Ayaz Ali Shah, Azzam ul Asar, Syed Waqar Shah.

[4] J .Tebelskis, “Speech Recognition Using Neural Networks,” PhD

Dissertation Carnegie Mellon University, 1995.

[5] K. J. Lang and A. H. Waibel, “A Time-Delay Neural network

Architecture for Isolated Word Recognition,” Neural networks,

Vol.3, 1990

[6] “Neural Networks” From Wikipedia, the free encyclopedia

[7] www.fuzzytech.com

ivr with pattern recognition

Documents