ivr with pattern recognition
TRANSCRIPT
1
I.V.R – (Interactive Voice Response)
with Pattern Recognition.
2
ABSTRACT
In the present era of information technology, information nowadays
is just a telephone call away. However, applications such as telephone
banking etc. need extra security for making it a reliable service for the
people. Application of PIN code/Password via telephone is not enough and
additional user specific information is required that can protect the user
identity in a more effective way. In this paper, we propose an approach that
uses Interactive Voice Response (IVR) with pattern recognition based on
Neural Networks. In this case, after entering the correct password the user
is asked to input his voice sample which is used to verify his identity. The
addition of voice pattern recognition in the authentication process can
potentially further enhance the security level. As both are simultaneously
applied so there is lesser probability of misuse. The developed system is fully
compliant with landline phone system.
3
INTRODUCTION
INTRODUCTION
In telephony, Interactive Voice Response, or IVR, is a computerized system that allows a
person, typically a telephone caller, to select an option from a voice menu. In some of the
applications like bank account balances, transfers and accessing databases of strategic
organizations etc. require high level of security. In such applications the information to
be provided is made secure by the use of Personal Identification Number (PIN).
However, this approach is not secure and is prone to tampering and misuse.
To overcome this problem a pattern recognition approach based on neural network is
proposed. User specific patterns such as fingerprint, retina, facial features, DNA
sequence identification and voice etc. can be used for authentication. However, among
these, voice authentication is readily available and most suitable for this application. The
speaker recognition area has a long and rich scientific basis with over 30 years of
research, development and evaluations.
Inherent in attempts at speaker identity verification, it is the general assumption that at
some level of scrutiny, no two individuals have exactly the same voice characteristics.
In the proposed approach, besides entering the PIN code, the user will also be asked to
get himself recognized through his voice signatures which further enhance the secure
access to various applications.
The results are promising based on false accept and false reject criteria offering quick
response time. It can potentially play an effective role in the existing authentication
techniques used for identity verification to access secured services through telephone or
similar media. In the proposed model, speaker specific features are extracted and
Multilayer Perceptron (MLP) is used for feature matching.
4
INTERACTIVE VOICE RESPONSE SYSTEM
2. WHAT IS IVRS
INTERACTIVE VOICE RESPONSE SYSTEM (IVRS) is an important
development in the field of interactive communication which makes use of the most
modern technology available today. IVRS is a unique blend of both the communication
field and the software field, incorporating the best features of both these streams of
technology. IVRS is an electronic device through which information is available related
to any topic about a particular organization with the help of telephone lines anywhere in
the world.
IVRS provides a friendly and faster self service alternative to speaking with
customer service agents. It finds a large scale use in enquiry systems of railways, banks,
universities, tourism, industry etc. It is the easiest and most flexible mode of interactive
communication because pressing a few numbers on the telephone set provides the user
with a wide range of information on the topic desired. IVRS reduces the cost of servicing
customers.
In telecommunications, IVRS allows customers to interact with a company’s
database via a telephone keypad or by speech recognition, after which they can service
their own inquiries by following the IVR dialogue. IVR systems can respond with
prerecorded or dynamically generated audio to further direct users on how to proceed.
IVR applications can be used to control almost any function where the interface can be
broken down into a series of simple interactions. IVR systems deployed in the network
are sized to handle large call volumes.
The use of IVR and voice automation enables a company to improve its customer
service and lower its costs, due to the fact that callers' queries can be resolved without
the need for queuing and incurring the cost of a live agent who, in turn, can be directed
5
to deal with more demanding areas of the service. If the caller does not find the
information they need, or requires further assistance, the call can then be transferred to
an agent. This makes for a more efficient system in which agents have more time to deal
with complex interactions.
When an IVR system answers multiple phone numbers the use of DNIS ensures that the
correct application and language is executed. A single large IVR system can handle calls
for thousands of applications, each with its own phone numbers and script.
IVR also enables customer prioritization. In a system wherein individual customers may
have a different status, the service will automatically prioritize the individual's call and
move customers to the front of a specific queue. Prioritization could also be based on
the DNIS and call reason.
IVR technology is also being introduced into automobile systems for hands-free
operation.
2.1 IVRS Block Diagram
Fig. 2.1
6
The IVRS on the whole consists of the user telephone, the telephone
connection between the user and the IVRS and the personal computer which
stores the data base. The interactive voice response system consists of the
following parts.
2.1.1 Hardware Section
1. Relay: For switching between the ring detector and the DTMF decoder.
2. Ring detector: To detect the presence of incoming calls.
3. DTMF decoder: To convert the DTMF tones to 4 bit BCD codes.
4. Micro controller: To accept the BCD calls, process them and transmit them serially to
the PC.
5. Level Translator: To provide the interface between PC and micro Controller.
6. Personal Computer: To store the data base and to carry out the text to speech
conversion.
7. Audio Amplifier: To provide audio amplification to standard output and to act as a
buffer between the telephone line and sound card.
2.1.2 Software Selection
1. Visual Basics 6.0
2. Oracle 8.0
3. Microsoft Agent
7
2.2 Operations of IVRS
The user dials the phone number connected to the IVRS. The call is
taken over by the IVRS after a delay of 12 seconds during which the call can be
attended by the operator. After 12 seconds if the ring detector output is low, it is
ensured that the phone has not been picked up by the operator. The
microcontroller then switches the relay to the DTMF and sends a signal via RS
232 to the pc to run the wave file welcoming the user to the IVRS. The user is
also informed of the various codes present in the system, which the user dial in
order to access the necessary information.
Thirty seconds are given to the user to press the codes, failure of
which results in switch back of the relay. The DTMF decoder converts the codes
pressed by the user to BCD. It is then pressed to the input pins of the
microcontroller and is stored in the microcontroller memory. After these codes
have been received, they are transmitted serially to the serial port of the PC via
max232 IC. Any hardware failure in transmission falls in the lightning of a LED
and the relay is switched back.
The serial port of the PC is continually polled by the software used
such as Visual Basics and Microsoft Agent program and the received code words
are put in the text box from the input buffer. The received personal identification
number (PIN) is compared with the stored data base to determine the result. The
corresponding wave file is played by the sound blaster card. It is coupled to the
telephone line through the Audio Amplifier, which is connected between the
sound blaster and the telephone line to amplify the blaster output, drive the
telephone line acts as the buffer for sound blaster.
8
2.3 Advantages of IVRS
1. The addition of speech recognition capabilities help IVRS owners derive
more benefit from their investment in existing IVRS resource.
2. Motivating organizations to embrace speech solutions is the potential for
dramatic reductions in operational cost.
3. Increased automation frees the customer service agents from any routine
administrative tasks and reduces cost related to customer service staffing.
That is fewer agents are able to serve more customers.
4. Resources that have been developed to support an internet presence can
support an IVRS as well. Thus organizations can use some of the same
data modules bid for speech enabled IVRS application for their intranets.
This could deliver a high degree of code reuse.
9
PATTERN RECOGNITION
INTRODUCTION
Automatic (machine) recognition, description, classification, and grouping of
patterns are important problems in a variety of engineering and scientific disciplines
such as biology, psychology, medicine, marketing, computer vision, artificial
intelligence, and remote sensing.
A pattern could be a fingerprint image, a handwritten cursive word, a human face,
or a speech signal. Given a pattern, its recognition/classification may consist of one of the
following two tasks: 1) supervised classification (e.g., discriminant analysis) in which the
input pattern is identified as a member of a predefined class, 2) unsupervised
classification (e.g., clustering) in which the pattern is assigned to a hitherto unknown
class. The recognition problem here is being posed as a classification or categorization
task, where the classes are either defined by the system designer (in supervised
classification) or are learned based on the similarity of patterns (in unsupervised
classification).
These applications include data mining (identifying a “pattern”, e.g., correlation,
or an outlier in millions of multidimensional patterns), document classification
(efficiently searching text documents), financial forecasting, organization and retrieval of
multimedia databases, and biometrics. The rapidly growing and available computing
power, while enabling faster processing of huge data sets, has also facilitated the use of
elaborate and diverse methods for data analysis and classification. At the same time,
demands on automatic pattern recognition systems are rising enormously due to the
availability of large databases and stringent performance requirements (speed, accuracy,
and cost). The design of a pattern recognition system essentially involves the following
three aspects:
1) data acquisition and preprocessing,
2) data representation, and
3) decision making.
10
The problem domain dictates the choice of sensor(s), preprocessing technique,
representation scheme, and the decision making model. It is generally agreed that a
well-defined and sufficiently constrained recognition problem (small intraclass
variations and large interclass variations) will lead to a compact pattern representation
and a simple decision making strategy.
Learning from a set of examples (training set) is an important and desired
attribute of most pattern recognition systems. The four best known approaches for pattern
recognition are:
1) template matching,
2) statistical classification,
3) syntactic or structural matching, and
4) neural networks.
3.1 Voice Recognition
VOICE recognition is different than Speech recognition.
Speech recognition (also known as automatic speech recognition or computer speech
recognition) converts spoken words to text. The term "voice recognition" (also called as
Speaker recognition) is used to refer to recognition systems that must be trained to a
particular speaker.
Speaker recognition, which can be classified into identification and verification, is the
process of automatically recognizing who is speaking on the basis of individual
information included in speech waves. This technique makes it possible to use the
speaker's voice to verify their identity and control access to services such as voice
dialing, banking by telephone, telephone shopping, database access services,
information services, voice mail, security control for confidential information areas, and
remote access to computers.
11
Fig. 3.1 shows the basic components of speaker identification and verification
systems. Speaker identification is the process of determining which registered speaker
provides a given utterance. Speaker verification, on the other hand, is the process of
accepting or rejecting the identity claim of a speaker. Most applications in which a voice
is used as the key to confirm the identity of a speaker are classified as speaker
verification .
Speaker recognition methods can also be divided into text-dependent and text-
independent methods. The former require the speaker to say key words or sentences
having the same text for both training and recognition trials, whereas the latter do not
rely on a specific text being spoken.
(a) Speaker identification
(b) Speaker Verification
12
Fig. 3.1. Basic structure of speaker recognition systems.
Both text-dependent and independent methods share a problem however. These
systems can be easily deceived because someone who plays back the recorded voice of a
registered speaker saying the key words or sentences can be accepted as the registered
speaker. To cope with this problem, there are methods in which a small set of words,
such as digits, are used as key words and each user is prompted to utter a given sequence
of key words that is randomly chosen every time the system is used. Yet even this
method is not completely reliable, since it can be deceived with advanced electronic
recording equipment that can reproduce key words in a requested order. Therefore, a text-
prompted speaker recognition method has recently been proposed.
13
NEURAL NETWORKS
4.1 What is a Neural Network?
An Artificial Neural Network (ANN) is an information processing paradigm that
is inspired by the way biological nervous systems, such as the brain, process information.
The key element of this paradigm is the novel structure of the information processing
system. It is composed of a large number of highly interconnected processing elements
(neurons) working in unison to solve specific problems. ANNs, like people, learn by
example. An ANN is configured for a specific application, such as pattern recognition or
data classification, through a learning process. Learning in biological systems involves
adjustments to the synaptic connections that exist between the neurons. This is true of
ANNs as well.
4.2 Why use neural networks?
Neural networks, with their remarkable ability to derive meaning from complicated or
imprecise data, can be used to extract patterns and detect trends that are too complex to
be noticed by either humans or other computer techniques. A trained neural network can
be thought of as an "expert" in the category of information it has been given to analyse.
This expert can then be used to provide projections given new situations of interest and
answer "what if" questions. Other advantages include:
1. Adaptive learning: An ability to learn how to do tasks based on the data given for
training or initial experience.
2. Self-Organization: An ANN can create its own organization or representation of
the information it receives during learning time.
3. Fault Tolerance via Redundant Information Coding: Partial destruction of a
network leads to the corresponding degradation of performance. However, some
network capabilities may be retained even with major network damage.
14
4.3 Pattern Recognition - an example
An important application of neural networks is pattern recognition. Pattern
recognition can be implemented by using a feed-forward (figure 1) neural network that
has been trained accordingly. During training, the network is trained to associate outputs
with input patterns. When the network is used, it identifies the input pattern and tries to
output the associated output pattern. The power of neural networks comes to life when a
pattern that has no output associated with it, is given as an input. In this case, the network
gives the output that corresponds to a taught input pattern that is least different from the
given pattern.
Fig. 4.1.
For example:
The network of figure 1 is trained to recognise the patterns T and H. The associated
patterns are all black and all white respectively as shown below.
15
If we represent black squares with 0 and white squares with 1 then the truth tables for the
3 neurones after generalisation are;
X11: 0 0 0 0 1 1 1 1
X12: 0 0 1 1 0 0 1 1
X13: 0 1 0 1 0 1 0 1
OUT: 0 0 1 1 0 0 1 1
Top neuron
X21: 0 0 0 0 1 1 1 1
X22: 0 0 1 1 0 0 1 1
X23: 0 1 0 1 0 1 0 1
OUT: 1 0/1 1 0/1 0/1 0 0/1 0
Middle neuron
X21: 0 0 0 0 1 1 1 1
X22: 0 0 1 1 0 0 1 1
X23: 0 1 0 1 0 1 0 1
16
OUT: 1 0 1 1 0 0 1 0
Bottom neuron
From the tables it can be seen the following associations can be extracted:
In this case, it is obvious that the output should be all blacks since the input pattern is
almost the same as the 'T' pattern.
Here also, it is obvious that the output should be all whites since the input pattern is
almost the same as the 'H' pattern.
17
Here, the top row is 2 errors away from the a T and 3 from an H. So the top output is
black. The middle row is 1 error away from both T and H so the output is random. The
bottom row is 1 error away from T and 2 away from H. Therefore the output is black. The
total output of the network is still in favor of the T shape.
4.4 Feed-forward networks
Feed-forward ANNs (figure 1) allow signals to travel one way only; from input to
output. There is no feedback (loops) i.e. the output of any layer does not affect that
same layer. Feed-forward ANNs tend to be straight forward networks that associate
inputs with outputs. They are extensively used in pattern recognition. This type of
organisation is also referred to as bottom-up or top-down.
4.5 The Back-Propagation Algorithm
In order to train a neural network to perform some task, we must adjust the
weights of each unit in such a way that the error between the desired output and the
actual output is reduced. This process requires that the neural network compute the
error derivative of the weights (EW). In other words, it must calculate how the error
changes as each weight is increased or decreased slightly. The back propagation
algorithm is the most widely used method for determining the EW.
The back-propagation algorithm is easiest to understand if all the units in the
network are linear. The algorithm computes each EW by first computing the EA, the rate
at which the error changes as the activity level of a unit is changed. For output units, the
EA is simply the difference between the actual and the desired output. To compute the
EA for a hidden unit in the layer just before the output layer, we first identify all the
weights between that hidden unit and the output units to which it is connected. We
then multiply those weights by the EAs of those output units and add the products. This
sum equals the EA for the chosen hidden unit. After calculating all the EAs in the hidden
layer just before the output layer, we can compute in like fashion the EAs for other
18
layers, moving from layer to layer in a direction opposite to the way activities propagate
through the network. This is what gives back propagation its name. Once the EA has
been computed for a unit, it is straight forward to compute the EW for each incoming
connection of the unit. The EW is the product of the EA and the activity through the
incoming connection.
Note that for non-linear units, the back-propagation algorithm includes an extra step.
Before back-propagating, the EA must be converted into the EI, the rate at which the
error changes as the total input received by a unit is changed.
SPEAKER RECOGNITION USING ANN
5.1 Modes
At highest level, all speaker recognition systems contain two modules: Feature
Extraction and Feature Matching. Similarly they operate in two modes: Training and
Recognition/Testing modes. Both training and recognition modes include Feature
Extraction and Feature Matching. In training mode speaker models are created for
database. This is also called enrollment mode in which speakers are enrolled in the
database. In this mode, useful features from speech signal are extracted and model is
trained.
The objective of the model is generalization of the speaker's voice beyond the
training material. so that any unknown speech signal can be classified as intended speaker
or imposter. In recognition mode, system makes decision about the unknown speaker's
identity claim. In this mode features are extracted from the speech signal of the unknown
speaker using the same technique as in the training mode. And then the speaker model
from the database is used to calculate the similarity score. Finally decision is made based
19
on the similarity score. For speaker verification, the decision is either accepted or rejected
for the identity claim.
Two types of errors occur in speaker verification system,
False Reject (FR) and False Accept (FA). When a true speaker is rejected by the speaker
recognition system, it is called FR. Similarly FA occurs when imposter is recognized as a
true speaker. The input pattern used for verification can be either text dependent or text-
independent. For the text-dependent speech pattern, the speaker is asked to utter a
prescribed text. However in text-independent case the user is free to speak any text. Text-
independent speech pattern is considered more flexible as the user doesn't require to
memorize the text.
5.2 Methodology Adopted
Fig. 5.1 Methodology Adopted
20
5.3 Feature Extractor
The speech samples from a single speaker are recorded. Five samples for each word are
used for training the neural networks. The LPC Cepstrum coefficients of each word are
extracted and the K-means vector quantization is applied to get the reduced trajectories.
The feature extraction consists of the following steps:
5.3.1 Speech Sampling:
The speech was recorded and sampled using an off-the-shelf relatively
inexpensive dynamic microphone and a standard PC sound card. The incoming signal
was sampled at 22,050 Hertz with 16 bits of precision.
5.3.2 Endpoint Detection:
A fast and robust technique for accurately locating the endpoints
of isolated words has been used. This technique utilizes frame energy to acquire the
reference points. The algorithm takes frames of size 100 samples and calculates the
energy for each frame and averages it over all the frames to get the reference value of the
energy. The energy per frame is calculated as:
P[i] = Sum k=1...j (s[k]²) (1)
where s [k] are the speech data in the frame. Similarly P is calculated for all the frames
and an average is taken for the final energy value [E].
E= [Sum k=l....m (p[k] ²)]/m (2)
The threshold is set at (constant* E), as the detecting criterion.
5.3.3 Pre-emphasis:
As is common in speech recognizers, a pre-emphasis filter was applied to the
digitized speech to spectrally flatten the signal and diminish the effects of finite numeric
precision in further calculations. This type of filter boosts the magnitude of the high
frequency components, leaving relatively untouched the lower ones.
5.3.4 Framing and Windowing:
21
After the signal was sampled, the utterances were isolated, and the spectrum was
flattened, each signal was divided into a sequence of data blocks, each block spanning
300 samples, and separated by 100 samples. Next, each block was multiplied by a
Hamming window, which had the same width as that of the block, to lessen the leakage
effects.
5.3.5 LPC Analysis:
Then, a vector of 12 Linear Predicting Coding (LPC) Cepstrum coefficients was
obtained from each data block using Durbin’s method.
5.3.6 Vector Quantization:
The dimensionality of the LPC Cepstrum vectors is reduced using
Vector Quantization Technique. A total of 36 coefficients are obtained after the vector
quantization. For the vector quantization the K-means algorithm is used.
The way in which a set of L training vectors can be clustered into a set of M codebook
vectors is the following:
1. Initialization: Arbitrarily choose M vectors (initially out of the training set of L
vectors) as the initial set of code words in the codebook.
2. Nearest-Neighbor Search: For each training vector, find the code word in the
current codebook that is closest (in terms of spectral distance) and assign that
vector to the corresponding cell.
3. Centroid Update: Update the code word in each cell using the centroids of the
training vectors assigned to that cell.
4. Iteration: Repeat the steps 2 and 3 until the average distance falls below a preset
threshold.
After the VQ stage only 3 vectors of size 12 are left. The output of this last stage is the
final feature used throughout.
22
5.4 Recognizer
The recognizer block is built using the neural network approach. The 2 types of Neural
networks used are the Multi layer Perceptrons and Recurrent Neural Networks. A neural
network is a collection of layers of "neurons," simulating the human brain structure.
Each
neuron takes input from each neuron in the previous layer (or from the outside world, if it
is in the first layer). Then, it adds this input up, and passes it to the next layer. Each
connection between layers, however, has a certain weight. Every time the neural network
processes some input, it adjusts these weights to make the output closer to a given desired
value for the output. After several repetitions of this (each repetition is an iteration), the
network can produce the correct output given a loose approximation of the input.
5.5 Results
The 2 different approaches were used for recognition. For each word 5 different
training samples were used and the networks are trained. Then the recognition accuracies
were calculated by recording more samples of the words.
5.5.1 MLP Approach
23
Fig. 5.2 Architecture of a multi-layer perceptron with two hidden layers
The MLP had 36 input nodes, 36 hidden neurons, and 1 output neuron. The output
of the neurons was inside the interval [-1,+1]. The transit was used as the threshold
function. Each neuron had an extra connection, whose input was kept constant and equal
to one (the literature usually refers to this connection as bias or threshold). The weights
were initialized with random values selected within the small interval. The MLP was
trained using the Error back propagation method.
TEST:
Around fifty seconds of speech data of the intended speaker was collected for
training the neural network. In testing phase, 10% tolerance is present for the intended
speaker i.e. if the output of the network is 10% less or greater than 10%, still the speaker
is recognized as the intended speaker otherwise rejected. The test data consists of fifty
(50) speech samples (other than those used for training the neural network) of the speaker
for whom network is trained and 125 samples of imposter speech. The imposter speech
data was collected from 13 persons (male). Out of 50 samples of the intended speaker
24
41 were recognized. So false reject is only 18%. Similarly for imposter data out of 125
trials only 17 were falsely accepted. Thus false accept is about 14%. Table 1 summarizes
the details of test results.
The performance measure is expressed as half total error rate (HTER). It is defined as:
HTER = (FA + FR)/2 (2)
where FA and FR are the false acceptance and false rejection, respectively. It was
introduced into practice in the international speaker recognition evaluation campaigns
organized by the NIST and NFI/TNO. In our case, the HTER comes to be 16% which is
very promising.
5.5.2 Recurrent Neural Network Approach
25
Fig. 5.3 Structure of Recurrent neural network
While training an Elman network the following occurs. At each epoch:
1. The entire input sequence is presented to the network, and its outputs are
calculated
and compared with the target sequence to generate an error sequence.
2. For each time step, the error is back propagated to find gradients of errors for
each weight and bias. This gradient is actually an approximation since the
contributions of weights and biases to errors via the delayed recurrent connection
are ignored.
3. This gradient is then used to update the weights with the back prop training
function chosen by the user.
5.6 Comparison
26
Some comments concerning the MLP approach can also be made:
1. Its recognition accuracies were better than the ones obtained with the RNN
approach. Even though its performance was better, it is still below the limits
required for practical applications.
2. The input layer consists of 36 neurons. The hidden layer was defined by 36
hidden neurons, having a total of 1296 weights and totaling 1296 floating point
values. The output layer consisted of only 1 neuron, with 36 weights totaling 36
floating point values. In total, each MLP required 1332 floating point values.
A few comments concerning the RNN can also be made in the light of the items used
for comparison above:
1. It achieved only 80% of recognition accuracy.
2. In terms of memory requirement, it is the best. The fully connected RNN with 10
hidden neurons and 1 output neuron requires only 360 floating point values.
CONCLUSION
27
An Interactive Voice Response (IVR) based on neural network approach has been
proposed that incorporates user specific features in terms of voice extracted while
MultiLayer Perceptron (MLP) is used for feature matching. The preliminary results
shows promise of the approach that can potentially bring added security in applications
involving access to bank services etc. via telephone. Further work focuses on improving
the error in the patterns recognition on criteria based on false accept and false reject.
BIBLIOGRAPHY
28
[1] Abhinav Sharma Surinder P Singh Vipin Kumar, "Text-independent
speaker identification using backpropagation MLP network classifier
for a closed set ofspeakers", Proceedings of IEEE International
Symposium on Signal Processing and Information Technology, Indian
Institute Of Information Technology Allahabad, INDIA, 2005.
[2] David Benenaty. BELL LABS TECHANICAL JOURNAL- Vol- 1-Jan
2002-Wiley Publishers
[3] "Interactive Voice Response with Pattern Recognition Based on
Artificial Neural Network Approach",
Syed Ayaz Ali Shah, Azzam ul Asar, Syed Waqar Shah.
[4] J .Tebelskis, “Speech Recognition Using Neural Networks,” PhD
Dissertation Carnegie Mellon University, 1995.
[5] K. J. Lang and A. H. Waibel, “A Time-Delay Neural network
Architecture for Isolated Word Recognition,” Neural networks,
Vol.3, 1990
[6] “Neural Networks” From Wikipedia, the free encyclopedia
[7] www.fuzzytech.com