st final report tommorow 4-4-2011 report
TRANSCRIPT
1
CHAPTER 1
INTRODUCTION
1.1 DESIGN INTRODUCTION
For the past several decades, designers have processed speech for a
wide variety of applications ranging from mobile communications to
automatic reading machines. Speech recognition reduces the overhead
caused by alternate communication methods. Speech has not been used
much in the field of electronics and computers due to the complexity and
variety of speech signals and sounds. However, with modern processes,
algorithms, and methods we can process speech signals easily and
recognize the text.
1.2 INTRODUCTION
Our project aimed at developing a Real Time Speech Recognition
Engine on an FPGA using Altera DE2 board. The system was designed so
as to recognize the word being spoken into the microphone. Both industry
and academia have spent a considerable effort in this field for developing
software and hardware to come up with a robust solution. However, it is
because of large number of accents spoken around the world that this
conundrum still remains an active area of research. Speech Recognition
finds numerous applications including health care, artificial intelligence,
human computer interaction, Interactive Voice Response Systems, military,
2
avionics etc. Another most important application resides in helping the
physically challenged people to interact with the world in a better way.
We implemented a Real Time Speech Recognition Engine that takes
as an input the time domain signal from a microphone and performs the
frequency domain feature extraction on the sample to identify the word
being spoken. Our design exploits the fact that most of the words spoken
across various accents around the world have some common frequency
domain features that can be used to identify the word. Speech Recognition
has always been a conundrum and a point of keen interest for researchers
all around the globe. While various methodologies have been developed to
solve this issue, it still remains an unsolved, nevertheless an intriguing
problem.
3
CHAPTER 2
LITERATURE SURVEY
2.1 EXISTING SYSTEM
By using the existing system we can only recognised the speech but we
won’t be able display the text. In the existing system we can only use
discrete or continuous Hidden Markov Model. Using SRAM we can only
implement and recognised 500 words. Recognition speed of existing
system is slow. The existing system can be applied for various practical
purposes but it does provide various losses so that’s way we are using
speech to text.
2.2 PROPOSED SYSTEM
The proposed system provides with much more advantage and provides
better use of HMM. our system uses both discrete and continuous form of
hidden markov model. The system does not only provide recognition but
also provides display of the text with the help of a liquid crystal
display.with proper training and in a closed environment we can achieve
much more accuracy in the text and the viterbi algorithm helps us to
designate and find out the most likely text in the speech. This system has
such a practical application for deaf persons and even further improvement
in our system will lead to educate a lot of deaf people and even speech to
visual conversion would be possible in the near future.
4
CHAPTER 3
BACKGROUND THEORY
3.1 SPEECH RECOGNITION PRINCIPLE
Figure 3.1 Speech waves
Speech recognition systems can be classified into several models by
describing the types of utterances to be recognized. These classes shall take
into consideration the ability to determine the instance when the speaker
starts and finishes the utterance. In our project we aimed to implement
Isolated Word Recognition System which usually used a rectangular
window over the word being spoken. These types of systems have
"Listen/Not-Listen" states, where they require the speaker to wait between
utterances.
5
A desktop microphone usage shall not be appropriate for realization
of the project since they tend to pick up more ambient noise that gives
might not be appropriate for accurate detection of speech. The usage of
headset styled microphone allows the ambient noise to be minimized. Since
the Speech Recognition is heavily dependent on processing speed because
of a large amount of signal processing, implementation of the same on an
FPGA was a good choice and motivation behind this project. Also, the
memory available on Altera DE2 Development board was enough to easily
and successfully implement the design for a word of length nearly 1
second.
The Speech Recognition Engines are broadly classified into 2 types,
namely Pattern Recognition and Acoustic Phonetic systems. While the
former use the known/trained patterns to determine a match, the latter uses
attributes of the human body to compare speech features (phonetics such as
vowel sounds). The pattern recognition systems combine with current
computing techniques and tend to have higher accuracy.
6
3.1.1 Flow chart
Figure 3.1.1 speech recognition principle
The system recognizes the spoken digit using a maximum likehood
estimate, i.e., a Viterbi decoder. The input speech sample is preprocessed to
extract the feature vector. Then, the nearest codebook vector index for each
frame is sent to the digit models. The system chooses the model that has
the maximum probability of a match.
7
3.2 DATA ACQUISITION
The speech signal is essentially analog in nature. Hence, the signals
must be converted to digital data in order to be read and processed. We
used an inbuilt ADC using the Wolfson Codec to sample our signal at 8
KHz frequency thus producing a 16 bit signed digital output. Once the
word was known to be detected, we acquired the FFT over the next 32
blocks of data and read their power coefficients in Nios II.
Figure 3.2 ADC Wave Form.
3.3 DETECTION
The system must know when a spoken word is input. Thus, a
detection algorithm has been devised. This is done by continually
computing the difference of the absolute average of two adjacent sound
windows (sets of consecutive sound data), and comparing it to a predefined
threshold.
8
The detector algorithm can be broken down as follows:
1. The absolute average w1 of a sound window of length W is computed
from the sound samples si starting at sa and ending at sb as shown in Eq. 1.
W1=1/w∑i=a
b
¿ si∨¿¿ ………………..
(1)
2. The average of the second window w2 is computed from the sound
samples si starting at sb and ending at sc as shown in Eq. 2.
W1=1/w∑i=b
e
¿ si∨¿¿ ………………
(2)
3. The difference between w2 and w1 is compared to the threshold value
Th. If it is larger, the spoken word is considered to start at sc. Else, the
algorithm goes on to step 4.
4. The average of the oldest window (w1) is discarded, and replaced by w2.
Then, the algorithm goes back to step 2.
Note: That the value has been experimentally determined in the MATLAB
implementation (see appendix A). Nevertheless, it may vary depending on
the sound acquisition setup (i.e. position of the microphone, noise level,
etc.). Finally, the length of the word is fixed to 1.024s for convenience.
3.4 FREQUENCY CONTENT
Once the word is detected, it is mapped to the frequency domain by
computing its Discrete Fourier Transform (DFT) using the Fast Fourier
Transform (FFT) algorithm. Since the length of a word is 1.024 s and the
9
sound is sampled at 5 kHz, five 1024 points FFTs are required to fully
characterize a single word. In the MATLAB implementation, these are
stored in each row of a 1024 x 5 matrix. This matrix constitutes the
“fingerprint”. Note that, for the sake of simplicity, only the real part of the
DFT is kept. In the training mode, the user defines how many times a word
is trained. The frequency content of each is averaged by adding their
fingerprints together and dividing the final sum by the number of times the
word has been trained. This generates the “reference fingerprint”.
3.5 DISTANCE
The comparison between a word's fingerprint and the reference
fingerprint is done by taking the euclidean distance between them. To do
this, they are considered as five 1024-dimensional vectors(one for each
matrix row), and the average of their respective euclidean distance is
computed. This is shown in Eq. 3, where D is the distance, and ani and bni
are the ith components of the fingerprints. The n index points to each of the
five vector pairs.
D=1/5∑5n=1√
∑i=1
1024
.(ani - bni)2 ...............................(3)
If the distance is less than a preset maximum (maxDis), then the
analyzed word is considered to match the reference word. Note that
maxDis is experimentally set to 140 in the MATLAB implementation (see
appendix B). Similarly to the Th parameter, this value depends on the
sound acquisition setup and may need to be varied in order to achieve
accurate speech recognition.
10
CHAPTER 4
HARDWARE IMPLEMENTATION
In order to implement the speech recognition algorithm in the Altera
DE2 board, it is broken down into modules. These are then mapped to
combinational logic and finite-state machines (FSM), using the Quartus II
software package.
4.1 WOLFSON INTERFACE
The board has a Wolfson WM8731 Coder-Decoder (CODEC),
which acts as the ADC. This audio chip has a microphone jack, and is
connected in a master-slave configuration with the FPGA (the latter being
the master). In order for the master to control the CODEC and acquire the
digital data, three modules have been created: the I2C bus controller, the
clock module, and a sound fetcher.
4.1.1 I2C Bus Controller
Three tasks need to be performed on the CODEC to modify its
internal settings: “de-mute” the microphone input, boost the microphone
volume, and change the default sound path (so that the microphone is given
11
priority over other inputs). To do this, the FPGA communicates with the
Wolfson via the I2C (Inter-Integrated Circuit) protocol using two pins:
'SDIN' (the data line), and 'SCLK' (the bus clock), as seen in Fig 4.1.1
Figure 4.1.1 Two line I2C bus protocol for the wolfson WM8731
The contents of the data line are sent in the same order as seen above
(after a start condition): 'RADDR', 'R/W', 'ACK', 'DATAB[15-9]', and
'DATAB[8-0]', which stand respectively for “base address”, “Read/Write”,
“acknowledge”, “control address”, and “control data”. The last block
modifies the settings. For instance, if 'DATAB[0]' is '1', the volume is
boosted. The base and control addresses are used to specify which internal
CODEC registers need to be accessed. “Read/Write” will always be set to
zero (i.e. write), since the Wolfson is write-only.
To signify a start condition, 'SDIN' goes from high to low while the
clock is maintained high. The same applies for a stop condition, except the
transition is low-to-high. Finally, the 'ACK' signal is sent from the
CODEC to the FPGA, as opposed to all the other data line contents. This
introduces the need for 'SDIN' to be implemented as a bi-directional pin,
which requires the use of a tri-state buffer. An FSM is created to
12
implement the bus interface between the FPGA and the Wolfson. Note
that, because 'SCLK' must be between 0 Hz and 400 kHz, 'ADCLRC'
(48.83 kHz) is used (see section 2.1.3). For start and stop conditions,
'ADCLRC' is overridden by the FSM, so that 'SCLK’ at ‘1’
4.1.2 Sound Fetcher
After the Wolfson digitalizes the input, it presents the data
('ADCDAT') serially as seen in Fig.4.1.2a. This is the Integrated Interchip
Sound (I2S) standard. Two clocks are needed: 'ADCLRC' (the left-right
clock for ADC data), and 'BCLK' (the bit-stream clock). The CODEC will
place the most significant bit (MSB) on the 'ADCDAT' line so that it can
be fetched on the second rising 'BCLK' edge following a high-to-low
transition of 'ADCLRC'. The left and right channel distinction is used for
stereo sound. Since this project deals with mono sound, the data is fetched
when 'ADCLRC' is low (left channel).
13
Figure 4.1.2a ACDAT output convention used by the wolfson WM8731(I2S)
14
Figure 4.1.2b Circuit schematic of the overall ADCDAT fetcher
The FSM in Fig 4.1.2b ('ADCDAT_fetcher_FSM') is used to keep
track of the events on the clocks (e.g. rising edges) in order to know the
exact moment one can start and stop to fetch. Because the data is presented
serially, the FSM communicates with a serial-to-parallel register
('LPM_SHIFTREG'), which outputs this data in parallel form.
15
Table 4.1.2 Two’s complement quantization from 3 bits to 2 bits
The next step is to quantize. The ‘ADCDAT’ word length is 24 bits
in two’s complement form. As said in section 1.2, the objective is to reduce
the length to 8 bits. In order to see how signed binary numbers can be
quantized, Table 1 illustrates a quantization from 3 bits to 2 bits.
A closer look at the second and fourth columns reveals that, in order
to quantize, it is only necessary to keep the two MSBs. Note that this is
possible because the two's complement scheme is used. Consequently,
when going from 24 bits to 8 bits, only the first eight most significant bits
need to be kept.
The last D-type flip-flop ('LPM_DFF/downsampler_ff') reduces the
Decimal
number
Binary
(2’s comp.)
Quantized
decimal
Quantized
binary
(2’s comp.)3 011 1 01
2 010
1 001 0 00
0 000
-1 111 -1 11
-2 110
-3 101 -2 10
-4 100
16
output data rate from 48 kHz to 5 kHz. In order to do that, it is controlled
by the two modules (a counter and an FSM) in the top right corner of Fig.
3, which generate two pulses. Both pulses occur at a 5 kHz frequency.
The first instructs the flip-flop to fetch the data. The second pulse is an
output 'READY' signal that happens half-a-period after the first. Its
purpose is to make sure that the rest of the circuit will fetch the data after it
has been properly latched.
4.1.3 Clock Module
The FPGA is clocked at 50 MHz [1]. Because it acts as the
Wolfson's master, it must feed the latter with various clocks: the main
audio chip clock ('XCK'), 'ADCLRC', and 'BCLK'. According to the
Wolfson data sheets, both 'ADCLRC' and 'XCK' are dependent on the
sampling frequency. Since the latter is 48 kHz, 'ADCLRC' must also be 48
kHz. 'XCK' is 12.288 MHz [4]. 'BCLK' must be at least 2.4 MHz, because
it needs to yield 25 rising clock edges (1 to wait for the MSB and 24 to
fetch each 'ADCDAT' bit) within half the period of 'ADCLRC' (i.e. within
10.42 μs).
Figure 4.1.3 Block diagram of clock module
To implement all three clocks, a single clock module was devised.
17
As seen in Fig. 4.1.3, it takes the 50 MHz clock as an input. Using a 2-bit
counter, it then proceeds to divide it by 22 yielding a 12.5 MHz 'XCK'
signal. Similarly, 'ADCLRC' and 'BCLK' are output using respectively 10-
bit and 3-bit counters (to divide by 210 and 23). This produces 48.83 kHz,
and 6.25 MHz signals (the latter being greater than 2.4 MHz). Even though
those values are approximations of the ideal ones specified in the data
sheets, they are close enough for practical purposes [3].
4.3 FFT
The discrete Fourier transform (DFT) plays an important role in the
analysis, design, and implementation of discrete-time signal processing
algorithms and systems because efficient algorithms exist for the
computation of the DFT. These efficient algorithms are called Fast Fourier
Transform (FFT) algorithms. In terms of multiplications and additions, the
FFT algorithms can be orders of magnitude more efficient than competing
algorithms.
It is well known that the DFT takes N2 complex multiplications and
N2 complex additions for complex N-point transform. Thus, direct
computation of the DFT is inefficient. The basic idea of the FFT algorithm
is to break up an N-point DFT transform into successive smaller and
smaller transforms known as butterflies (basic computational elements).
The small transforms used can be 2-point DFTs known as Radix-2, 4-point
DFTs known as Radix-4, or other points. A two-point butterfly requires 1
complex multiplication and 2 complex additions, and a 4-point butterfly
18
requires 3 complex multiplications and 8 complex additions. Therefore, the
Radix-2 FFT reduces the complexity of a N-point DFT down to
(N/2)log2N complex multiplications and Nlog2N complex additions since
there are log2N stages and each stage has N/2 2-point butterflies. For the
Radix-4 FFT, there are log4N stages and each stage has N/4 4-point
butterflies. Thus, the total number of complex multiplication is
(3N/4)log4N = (3N/8)log2N and the number of required complex additions
is 8(N/4)log4N = Nlog2N.
Above all, the radix-4 FFT requires only 75% as many complex
multiplies as the radix-2 FFT, although it uses the same number of complex
additions. These additional savings make it a widely-used FFT algorithm.
Thus, we would like to use Radix-4 FFT if the number of points is power
of 4. However, if the number of points is power of 2 but not power of 4, the
Radix-2 algorithm must be used to complete the whole FFT process. In this
application note, we will only discuss Radix-4 FFT algorithm.
Now, let’s consider an example to demonstrate how FFTs are used
in real applications. In the 3GPP-LTE (Long Term Evolution), M-point
DFT and Inverse DFT (IDFT) are used to convert the signal between
frequency domain and time domain. 3GPP-LTE aims to provide for an
uplink speed of up to 50Mbps and a downlink speed of up to 100Mbps. For
this purpose, 3GPP-LTE physical layer uses Orthogonal Frequency
Division Multiple Access (OFDMA) on the downlink and Single Carrier -
Frequency Division Multiple Access (SC-FDMA) on the uplink.
19
In order to map the sound data from the time domain to the
frequency domain, the Altera IP Megacore FFT module is used. The
module is configured so as to produce a 1024-point FFT. It is not only
capable of taking a streaming data input in natural order, but it can also
output the transformed data in natural order, with a maximum latency of
1024 clock cycles once all the data (1024 data samples) has been received.
4.2 DETECTOR
The absolute values of the first 1024 samples that constitute a
window are accumulated (summed together). Then sum is shifted right by
10 in order to divide by 1024 (since 210 = 1024), thus producing the
average value of the window.
The difference between that average and the one from the previous
window (stored in 'Register 1') is then computed. 'Register 2' is used to
control the comparator's input in order to ensure the comparison with user-
defined 9-bit threshold takes place when all the samples of the window
have been processed. Once done, the contents of 'Register 1' are replaced
by newer window avg.
20
Figure 4.3 Detector
An FSM is needed in order to control when to do this average
swapping, when to enable 'Register 2', when to determine if a count of
1024 samples has been reached, and when to clear the accumulator to
restart the summation. It also accepts a 'RESET' signal that
asynchronously clears the accumulator
4.4 MEMORY MANAGEMENT
In order to store the reference fingerprint, the 512 kB SRAM
module built in the board is used. There are three memory modules on the
Altera DE2: a 4 MB Flash memory chip, an 8 MB SDRAM chip and a 512
kB SRAM chip. While the Flash module provides a vast amount of non
volatile storage, it is very slow with respect to the main system clock. It
also requires a controller capable of dealing with its timing constraints.
The SDRAM chip is very fast and has a very large storage capacity, but it
requires a very sophisticated controller to be operated. This makes the
21
SRAM chip an obvious choice. Even though it is not the fastest nor the
largest, it has ten times the required storage capacity needed for this
project, and it is fast enough (since it can perform a read or write operation
in less than 20 ns, i.e. a system clock period) so as to avoid any timing
issues. Moreover, it is a fairly simple device and can be easily controlled.
Figure 4.4 512kb SRAM chip block diagram
The SRAM memory module is depicted in Fig. 6 with its inputs and
outputs. Note that the 'Data' pins are bidirectional and into require a tristate
buffer to be properly driven’ The chip storage is divided 218 16-bit blocks
which can be directly addressed trough the 18 'Address' lines. This is not
convenient for the implementation since the data stored is 8-bit wide.
4.4.1 Memory Controller
“Memory Controller” shown in Fig. 4.4.1a has four user inputs
('ADDR', 'DATA_IN', 'MODE', and 'ENABLE'), one user output
('DATA_OUT) and seven inputs/outputs (depicted in green) that connect
directly to the SRAM chip ('Low Byte Mask', 'High Byte Mask', 'Output
22
Enable', 'Write Enable', 'Chip Enable', 'Address', and 'Data'). The controller
simplifies the communication to the SRAM chip by splitting the
bidirectional pins and allowing each 8-bit memory block to be directly
accessed (see its detailed schematics in 4.4.1b).
Figure 4.4.1a Memory controller block diagram
The pins are split by using Altera's “bustri” (tri-state buffer) and each
8-bit block can be accessed using the 'High Byte Mask' and the 'Low Byte
Mask' according to the least significant bit of 'ADDR'. As a result, the user
sees an 8-bit data input ('DATA_IN'), a separate 8-bit data output
('DATA_OUT') and 19 address lines ('ADDR') which double the original
address space.
23
Figure 4.4.1b Schematic diagram of memory controller
The controller simplifies the communication to the SRAM chip by
splitting the bidirectional pins and allowing each 8-bit memory block to be
directly accessed (see its detailed schematics diagram). The pins are split
by using Altera's “bustri” (tri-state buffer) and each 8-bit block can be
accessed using the 'High Byte Mask' and the 'Low Byte Mask' according to
the least significant bit of 'ADDR'. As a result, the user sees an 8-bit data
input ('DATA_IN'), a separate 8-bit data output ('DATA_OUT') and 19
address lines ('ADDR') which double the original address space.
24
4.4.2 Memory Batch Operator
In order to sequentially access the memory, a 'Memory Batch
Operator' module was devised. As shown in Fig. 8, its takes 6 inputs
('START_ADDR', 'END_ADDR', 'DATA_IN', 'MODE', 'DATA_READY',
and 'ENABLE') and has 5 outputs ('DATA_OUT', 'ADDR',
'MEM_MODE', 'MEM_ENABLE', and 'DONE'). It operates on the rising
edge of a clock signal ('CLK').
Figure 4.4.2 Memory batch operator block diagram
The module works as follows:
Whenever the 'ENABLE' input goes high, it fetches the starting and
ending addresses as specified in the 'START_ADDR' and
'END_ADDR' inputs, and readies to start writing or reading
(according to the 'MODE' input) at the starting address. This takes
two clock cycles.
Whenever the 'DATA_READY' signal is asserted, the module goes
25
to the next address and reads (the data can be read from the
'DATA_OUT' lines of the memory controller) or writes (the data
from the 'DATA_IN' input lines).
If the module reaches the ending address, then it signals 'DONE'
until the 'ENABLE' input is low and goes back to step 1. Else, it goes
back to step 2. Note that on each step, the module takes care of
sending the appropriate signals to the memory controller in order to
perform the desired action.
4.5 DISTANCE(HMM)
The distance module illustrated in Fig.4.5a has four inputs ('A', 'B',
'ENABLE', and jkn'RST') and one output 'Distance'. It computes the
distance between two arbitrarily sized vectors by adding and accumulating
the squared difference of theinput is high. In order to clear the
accumulated distance the asynchrono 'A' and 'B' inputs on each rising edge
of a clock signal 'CLK' while the 'ENABLE' us 'RST' signal must be
asserted.
Figure 4.5a distance block diagram
26
Figure 4.5b Schematic diagram of distance
4.6 HMM TRAINING
An important part of speech-to-text conversion using pattern
recognition is training. Training involves creating a pattern representative
of the features of a class using one or more test patterns that correspond to
speech sounds of the same class. The resulting pattern (generally called a
reference pattern) is an example or template, derived from some type of
averaging technique. It can also be a model that characterizes the reference
pattern statistics. Our system uses speech samples from three individuals
during training.
A model commonly used for speech recognition is the HMM, which
is a statistical model used for modeling an unknown system using an
observed output sequence. The system trains the HMM for each digit in the
vocabulary using the Baum-Welch algorithm. The codebook index created
during preprocessing is the observation vector for the HMM model.
27
After preprocessing the input speech samples to extract feature
vectors, the system builds the codebook. The codebook is the reference
code space that we can use to compare input feature vectors. The weighted
cepstrum matrices for various users and digits are compared with the
codebook. The nearest corresponding codebook vector indices are sent to
the Baum-Welch algorithm for training an HMM model.
The HMM characterizes the system using three matrices:
A—The state transition probability distribution.
B—The observation symbol probability distribution.
n—The initial state distribution.
Any digit is completely characterized by its corresponding A, B, and
n matrices. The A, B, and n matrices are modeled using the Baum-Welch
algorithm, which is an iterative procedure (we limit the iterations to 20).
The Baum-Welch algorithm gives 3 matrices for each digit corresponding
to the 3 users with whom we created the vocabulary set. The A, B, and n
matrices are averaged over the users to generalize them for user-
independent recognition.
For the design to recognize the same digit uttered by a user for which
the design has not been trained, the zero probabilities in the B matrix are
replaced with a low value so that it gives a non-zero value on recognition.
To some extent, this arrangement overcomes the problem of less training
data.
28
Training is a one-time process. Due to the complexity and resource
requirements, it is performed using standalone PC application software that
we created by compiling our C program into an executable. For
recognition, we compile the same C program but target it to run on the
Nios II processor instead. We were able to accomplish this cross-
compilation because of the wide support for the C language in the Nios II
processor IDE.
4.6.1 HMM-Based Recognition
Recognition or pattern classification is the process of comparing the
unknown test pattern with each sound class reference pattern and
computing a measure of similarity (distance) between the test pattern and
each reference pattern. The digit is recognized using a maximum likelihood
estimate, such as the Viterbi decoding algorithm, which implies that the
digit whose model has the maximum probability is the spoken digit.
Preprocessing, feature vector extraction, and codebook generation are same
as in HMM training. The input speech sample is preprocessed and the
feature vector is extracted. Then, the index of the nearest codebook vector
for each frame is sent to all digit models. The model with the maximum
probability is chosen as the recognized digit.
After preprocessing in the Nios II processor, the required data is
passed to the hardware for Viterbi decoding. Viterbi decoding is
computationally intensive so we implemented it in the FPGA for better
execution speed, taking advantage of hardware/software co-design. We
wrote the Viterbi decoder in Verilog HDL and included it as a custom
29
instruction in the Nios II processor. Data passes through the dataa and
datab ports and the prefix port is used for control operations. The custom
instruction copies or adds two floating-point numbers from dataa and
datab, depending on the prefix input. The output (result) is sent back to the
Nios II processor for further maximum likelihood estimation.
4.6.2 Flowchart Of HMM
The system trains the HMM for each digit in the vocabulary. The
same weighted cepstrum matrices for various users and digits are compared
with the codebook and their corresponding nearest codebook vector indices
is sent to the Baum-Welch algorithm to train a model for the input index
sequence. The codebook index is the observation vector for the HMM
model.
Figure 4.6.2 flowchart of HMM
30
The Baum-Welch model is an iterative procedure and our system
limits the iterations to 20. After training, we have three models for each
digit that correspond to the three users in our vocabulary set. We find the
average of the A, B, and n matrices over the users to generalize the models.
4.7 LCD
Figure 4.7 LCD
A liquid crystal display (LCD) is a thin, flat electronic visual
display that uses the light modulating properties of liquid crystals (LCs).
LCs do not emit light directly. They are used in a wide range of
applications, including computer monitors, television, instrument
panels, aircraft cockpit displays, signage, etc. They are common in
consumer devices such as video players, gaming devices,clocks,
watches, calculators, and telephones. LCDs have displaced cathode ray
tube (CRT) displays in most applications. They are usually more compact,
lightweight, portable, less expensive, more reliable, and easier on the eyes.
They are available in a wider range of screen sizes than CRT and plasma
displays, and since they do not use phosphors, they cannot suffer image
31
burn-in. LCDs are more energy efficient and offer safer disposal than
CRTs. Its low electrical power consumption enables it to be used
in battery-powered electronic equipment. It is an electronically-modulated
optical device made up of any number of pixels filled with liquid
crystals and arrayed in front of a light source (backlight) or reflector to
produce images in colour or monochrome.
4.8 LED
Figure 4.8 LED
A light-emitting diode (LED) is a semiconductor light source. LEDs
are used as indicator lamps in many devices, and are increasingly used
for lighting. Introduced as a practical electronic component in 1962, early
LEDs emitted low-intensity red light, but modern versions are available
across the visible, ultraviolet and infrared wavelengths, with very high
brightness.
When a light-emitting diode is forward biased (switched
on), electrons are able to recombine with electron holes within the device,
32
releasing energy in the form of photons. This effect is
called electroluminescence and the color of the light (corresponding to the
energy of the photon) is determined by the energy gap of the
semiconductor. An LED is often small in area (less than 1 mm2), and
integrated optical components may be used to shape its radiation pattern.
LEDs present many advantages over incandescent light sources
including lower energy consumption, longer lifetime, improved robustness,
smaller size, faster switching, and greater durability and reliability. LEDs
powerful enough for room lighting are relatively expensive and require
more precise current and heat management than compact fluorescent
lamp sources of comparable output.
Light-emitting diodes are used in applications as diverse as
replacements for aviation lighting, automotive lighting(particularly brake
lamps, turn signals and indicators) as well as in traffic signals. The compact
size, the possibility of narrow bandwidth, switching speed, and extreme
reliability of LEDs has allowed new text and video displays and sensors to
be developed, while their high switching rates are also useful in advanced
communications technology .Infrared LEDs are also used in the remote
control units of many commercial products including televisions, DVD
players, and other domestic appliances.
CHAPTER 5
33
ARCHITECTURE
5.1 SYSTEM CONTROLLER
Figure 5.1 Overall Diagram
Fig 5.1 shows how the modules discussed in this chapter interact
with each other. Most of the signals pass through the “System Controller”
34
module. It controls the datapath by coordinating the modules so that the
data can flow. It deals primarily with the training phase of the algorithm,
since it is much more complex than the sound recognition phase. For
instance, once a sound has been detected, the system controller is notified.
Then, It waits for the FFT to output the data before notifying the 'Average'
module it should start operating. Finally, it instructs the memory controller
to store the averaged data.
5.2 NIOS II PROCESSOR
We used the “NIOS II Software Build Tools for Eclipse” software
for writing our C program. The C program executes our algorithm for the
speech recognition. We have included the complete code in the code listing
section. The overall operation of the code can be described as follows.
The code is executing an infinite loop as it’s always either expecting
the input or processing it. It initiates start by giving the fftstart signal which
starts the memory loading and the FFT operation. It keeps checking the
fftcomplete signal to detect the end of the FFT operation. Once the FFT is
complete it make the fftstart signal low so that the FFT values stored in the
FFT memory doesn’t change before it copies the values to SDRAM. It then
checks for the fftlevel signal to check whether a significant level of input is
present of the MIC input and so as to indicate the start of the voice
command. We found out experimentally that the value of fftlevel greater
than 60 corresponds to an actual voice command, while the value below
this represent either silence or the noise.
35
Once we detected the start of the command, we continuously stored
the FFT output of next 32 chunks of voice sample which each chunk being
32ms. We store these values in a large array named fftcoeff of size 8192.
This is performed through a for loop iterating for 32 cycles and performing
the above operation of initiating the FFT module and then storing the FFT
output into the fftcoeff array at appropriate location. Now, we have got the
power spectrum of the word which has been spoken. We will now do the
feature extraction and determine the word spoken.
First step is to convert the spectrum to the mel scale. We defined the
melcepstrum_conversion function which converts the input power
spectrum to the mel scale. We extracted 12 coefficients from this spectrum.
We pass the as input the fftcoeff array, and output the mel array. The C
module does the shifting as described in the theory section. We did the mel
shifting for the entire 1 sec speech instead of 32ms chunks of speech. It
would have been more efficient to do separate mel shifting for each of the
32ms but would have required a sophisticated synchronization and
nevertheless wasn’t required for the operation which we are trying to
achieve.
Next step is to compute the discrete cosine transform of these
spectral points and obtain the MFCCs. We defined the dct function which
takes the input as the mel_array (12 coefficients) and outputs the
mfcc_array(12 coefficients).
36
Next we identify the spoken word based on the DCT coefficients.
Since the first two coefficients contain the maximum information we took
the sum of first two coefficients of the dct output and store it in the variable
named sum_mel. Since in our implementation we are differentiating
between the words ‘Yes’ and ‘No’, we experimentally noticed that this
value was always above 59 for the word Yes and was in between 50 to 58
for the word No. The program compares sum_mel variables with these
values in order to determine whether the spoken word is a ‘Yes’ or ‘No’. It
then accordingly glows the appropriate LEDs and the hardware displays the
word ‘Yes’ or ‘No’ on the 7-segment display.
CHAPTER 6
37
FIELD PROGRAMABLE GATE ARRAY (FPGA)
6.1 FPGA
A field-programmable gate array (FPGA) is an integrated circuit
designed to be configured by the customer or designer after manufacturing
—hence "field-programmable". The FPGA configuration is generally
specified using a hardware description language (HDL), similar to that
used for an application-specific integrated circuit (ASIC) (circuit diagrams
were previously used to specify the configuration, as they were for ASICs,
but this is increasingly rare). FPGAs can be used to implement any logical
function that an ASIC could perform. The ability to update the
functionality after shipping, partial re-configuration of the portion of the
design and the low non-recurring engineering costs relative to an ASIC
design (not withstanding the generally higher unit cost), offer advantages
for many applications. FPGAs contain programmable logic components
called "logic blocks", and a hierarchy of reconfigurable interconnects that
allow the blocks to be "wired together"—somewhat like a one-chip
programmable breadboard. Logic blocks can be configured to perform
complex combinational functions, or merely simple logic gates like AND
and XOR. In most FPGAs, the logic blocks also include memory
elements, which may be simple flip-flops or more complete blocks of
memory.
6.1.1 Introduction
38
The area of field programmable gate array (FPGA) design is
evolving at a rapid pace. The increase in the complexity of the FPGAs
architecture means that it can now be used in far more applications than
before. The newer FPGAs are steering away from the plain vanilla type
"logic only" architecture to one with embedded dedicated blocks for
specialized applications. With so many choices available, the designer not
only has to familiarize himself with the various architectures and their
strengths, but he also needs a way to quickly estimate the performance of
his design when targeted to the different technologies. This paper briefly
outlines the latest offerings from the key FPGA vendors and in its latter
half discusses the importance of using the right synthesis tool in order to
target the same design to these various technologies.
Definitions of Relevant Terminology are : Field-programmable
Device (FPD) — a general term that refers to any type of integrated circuit
used for implementing digital hardware, where the chip can be configured
by the end user to realize different designs. Programming of such a device
often involves placing the chip into a special programming unit, but some
chips can also be configured “in-system”. Another name for FPDs is
programmable logic devices (PLDs); although PLDs encompass the same
types of chips as FPDs, we prefer the term FPD because historically the
word PLD has referred to relatively simple types of devices. PLA — a
Programmable Logic Array (PLA) is a relatively small FPD that contains
two levels of logic, an AND-plane and an OR-plane, where both levels are
programmable (note: although PLA structures are sometimes embedded
39
into full-custom chips, we refer here only to those PLAs that are provided
as separate integrated circuits and are user-programmable).
PAL— a Programmable Array Logic (PAL) is a relatively small FPD
that has a programmable AND-plane followed by a fixed OR-plane. SPLD
— refers to any type of Simple PLD, usually either a PLA or PAL. CPLD
— a more Complex PLD that consists of an arrangement of multiple
SPLD-like blocks on a single chip. Alternative names (that will not be used
in this paper) sometimes adopted for this style of chip are Enhanced PLD
(EPLD), Super PAL, Mega PAL, and others.
FPGA — a Field-Programmable Gate Array is an FPD featuring a
general structure that allows very high logic capacity. Whereas CPLDs
feature logic resources with a wide number of inputs (AND planes),
FPGAs offer more narrow logic resources. FPGAs also offer a higher ratio
of flip-flops to logic resources than do CPLDs.
6.1.2 The FPGA Landscape
In the semiconductor industry, the programmable logic segment is
the best indicator of the progress of technology. No other segment has such
varied offerings as field programmable gate arrays. It is no wonder that
FPGAs were among the first semiconductor products to move to the
0.13µm technology, and again recently to 90nm technology. This rapidly
changing technology means that more complex functionality is being
designed.
40
Figure 6.1.2 Structure Of An FPGA
The players in the current programmable logic market are Altera,
Atmel, Actel, Cypress, Lattice, Quick logic and Xilinx. Some of the larger
and more popular device families are: Stratix™ from Altera; Accelerator
from Actel; ispXPGA™ from Lattice and Virtex™ from Xilinx. Between
these FPGA devices, many major electronics applications such as
communications, video, image and digital signal processing, storage area
networks and aerospace are covered. While the architecture of each FPGA
is unique, the basic combination of the functional block remains the same:
LUTs + registers + carry-chain + wide MUX. It is important to be aware of
the required resources for a design and to cross-reference this with what is
available. Sometimes, however, it is also the supported configuration that is
important for a design's requirement. For example, the capability of a
dedicated RAM to function in a particular mode might not be supported by
all vendors.
41
6.1.3 FPGA synthesis The Vendor-Independent Approach
The present-day FPGAs offer the necessary features for successfully
completing most complex designs. Table 6.1.3 highlights the amount of
key resources available in the largest device offered by each FPGA vendor.
Clock management forms a very important part of any digital design and
this functionality is facilitated by on-chip phase locked loop (PLLs or
DLLs) circuitry. Dedicated memory blocks offer data storage and can be
configured as basic single-port RAMs, ROMs (read only memory), FIFOs
(first in first out), or CAMs (content addressable memory). Data processing
or the logic fabric of these FPGAs varies widely in size with the biggest
Xilinx Virtex-II Pro™ offering up to 100K LUT4s. The ability to interface
the FPGA with backplanes, high-speed buses, and memories is possible by
the availability of various single-ended and differential I/O standards
support.
Many of the major electronics applications such as communications,
video, image and digital signal processing; storage area networks and
aerospace are covered between the above-mentioned FPGA devices.
Although all of these FPGAs can perform the key functions required by
these applications, each of them is individually better suited for certain
target segments. For example, although Virtex-II and the Stratix both offer
dedicated multiplier blocks, the existence of the adders in the dedicated
DSP block may enable the Stratix device to target DSP applications more
effectively due to its ability to create efficient MAC (multiply-accumulate)
blocks4. In a similar manner, for programmable systems applications
42
requiring embedded processors, the Virtex-II Pro™ with its 32-bit RISC
processor (PowerPC 405) would be an ideal choice.
Features Xilinx Virtex II
Pro
Altera
stratix
Actel
Accelerator
Lattice is
pXPGA
Clock
management
DCM
Up to 12
PLL
Up to 12
PLL
Up to 8
Sys CLOCK
PLL up to 8
Embedded
memory blocks
Block RAM
Up to 10 Mbit
Tri Matrix
Memory
Up to10 Mbit
Embedded
RAM
Up to 338K
Sys MEM
Blocks
Up to 414K
Data processing CLB and
18-bitx 18-bit
Multipliers
LE’s and
embedded
multipliers
Logic modules
(C-cell &R-
cell)
PFU based
Programmable
I/O s
Select IO Advanced IO
Support
Advanced
IO Support
Sys IO
Special features Embedded
power PC405
Cores
DSP blocks Per pin
FIFO’s for bus
application
Sys Hs 1 for
high speed
serial interface
Table 6.1.3 Features Offered In FPGA
6.1.4 Applications of FPGAs
43
FPGAs have gained rapid acceptance and growth over the past
decade because they can be applied to a very wide range of applications. A
list of typical applications includes: random logic, integrating multiple
SPLDs, device controllers, communication encoding and filtering, small to
medium sized systems with SRAM blocks, and many more.
Other interesting applications of FPGAs are prototyping of designs
later to be implemented in gate arrays, and also emulation of entire large
hardware systems. The former of these applications might be possible
using only a single large FPGA (which corresponds to a small Gate Array
in terms of capacity), and the latter would entail many FPGAs connected
by some sort of interconnect; for emulation of hardware, QuickTurn
[Wolff90] (and others) has developed products that comprise many FPGAs
and the necessary software to partition and map circuits.
Another promising area for FPGA application, which is only
beginning to be developed, is the usage of FPGAs as custom computing
machines. This involves using the programmable parts to “execute”
software, rather than compiling the software for execution on a regular
CPU.
CHAPTER 7
44
ALTERA CYCLONE II DE2 KIT
7.1 LAYOUT AND COMPONENTS
A photograph of the DE2-70 board is shown in Figure 7.1. It depicts
the layout of the board and indicates the location of the connectors and key
components.
Figure 7.1 Altera DE2 kit
45
The DE2-70 board has many features that allow the user to
implement a wide range of designed circuits, from simple circuits to
various multimedia projects.
The following hardware is provided on the DE2-70 board
Altera Cyclone® II 2C70 FPGA device
Altera Serial Configuration device - EPCS16
USB Blaster (on board) for programming and user API control; both
JTAG and Active Serial(AS) programming modes are supported
2-Mbyte SSRAM
Two 32-Mbyte SDRAM
8-Mbyte Flash memory
SD Card socket
4 pushbutton switches
18 toggle switches
18 red user LEDs
9 green user LEDs
50-MHz oscillator and 28.63-MHz oscillator for clock sources
24-bit CD-quality audio CODEC with line-in, line-out, and
microphone-in jacks
VGA DAC (10-bit high-speed triple DACs) with VGA-out connector
2 TV Decoder (NTSC/PAL/SECAM) and TV-in connector
10/100 Ethernet Controller with a connector
USB Host/Slave Controller with USB type A and type B connectors
RS-232 transceiver and 9-pin connector
PS/2 mouse/keyboard connector
IrDA transceiver
46
1 SMA connector
Two 40-pin Expansion Headers with diode protection
In addition to these hardware features, the DE2-70 board has
software support for standard I/O interfaces and a control panel facility for
accessing various components. Also, software is provided for a number of
demonstrations that illustrate the advanced capabilities of the DE2-70
board.
In order to use the DE2-70 board, the user has to be familiar with the
Quartus II software. The necessary knowledge can be acquired by reading
the tutorials Getting Started with Altera’s DE2-70 Board and Quartus II
Introduction (which exists in three versions based on the design entry
method used, namely Verilog, VHDL or schematic entry). These tutorials
are provided in the directory DE2_70_tutorials on thr DE2 -70 systemCD-
ROMS that accompanies the DE2-70 board and can also be found on
Altera’s DE2-70 web pages
7.2 BLOCK DIAGRAM OF THE DE2-70 BOARD
Figure 7.2 gives the block diagram of the DE2-70 board. To provide
maximum flexibility for the user, all connections are made through the
Cyclone II FPGA device. Thus, the user can configure the FPGA to
implement any system design.
Following is more detailed information about the blocks in Figure 7.2.1
47
7.2.1 Cyclone II 2C70 FPGA
68,416 Les.
250 M4K RAM blocks.
1,152,000 total RAM bits.
150 embedded multipliers.
4 PLLs.
622 user I/O pins.
FineLine BGA 896-pin package.
Figure 7.2.1. Block Diagram Of The DE2-70 Board.
7.2.2 Serial Configuration Device And USB Blaster Circuit
48
Altera’s EPCS16 Serial Configuration device.
On-board USB Blaster for programming and user API control.
JTAG and AS programming modes are supported.
7.2.3 SSRAM
2-Mbyte standard synchronous SRAM.
Organized as 512K x 36 bit and Accessible as memory for the Nios II
processor and by the DE2-70 Control Panel.
7.2.4 SDRAM
Two 32-Mbyte Single Data Rate Synchronous Dynamic RAM
memory chips.
Organized as 4M x 16 bits x 4 banks.
Accessible as memory for the Nios II processor and by the DE2-70
Control Panel.
7.2.5 Flash Memory
8-Mbyte NOR Flash memory.
Support both byte and word mode access.
Accessible as memory for the Nios II processor and by the DE2-70
Control Panel.
7.2.6 SD Card Socket
Provides SPI and 1-bit SD mode for SD Card access.
Accessible as memory for the Nios II processor with the DE2-70 SD
Card Driver.
7.2.7 Pushbutton Switches
49
4 pushbutton switches.
Debounced by a Schmitt trigger circuit.
Normally high generates one active-low pulse when the switch is
pressed.
7.2.8 Toggle Switches
18 toggle switches for user inputs.
A switch causes logic 0 when in the down (closest to the edge of the
DE2-70 board) position and logic 1 when in the UP position.
7.2.9 Clock Inputs
50-MHz oscillator.
28.63-MHz oscillator.
SMA external clock input.
7.2.10 Audio CODEC
Wolfson WM8731 24-bit sigma-delta audio CODEC.
Line-level input, line-level output, and microphone input jacks
Sampling frequency: 8 to 96 KHz.
Applications for MP3 players and recorders, PDAs, smart phones,
voice recorders, etc.
7.2.11 VGA Output
Uses the ADV7123 240-MHz triple 10-bit high-speed video DAC.
With 15-pin high-density D-sub connector.
Supports up to 1600 x 1200 at 100-Hz refresh rate.
7.2.12 NTSC/PAL/ SECAM TV Decoder Circuit
Uses two ADV7180 Multi-format SDTV Video Decoders.
50
Supports worldwide NTSC/PAL/SECAM color demodulation.
One 10-bit ADC, 4X over-sampling for CVBS.
Supports Composite Video (CVBS) RCA jack input.
Supports digital output formats : 8-bit ITU-R BT.656 YCrCb 4:2:2
output + HS, VS, and FIELD.
Applications: DVD recorders, LCD TV, Set-top boxes, Digital TV,
Portable video devices, and TV PIP (picture in picture) display.
7.2.13 10/100 Ethernet Controller
Integrated MAC and PHY with a general processor interface.
Supports 100Base-T and 10Base-T applications.
Supports full-duplex operation at 10 Mb/s and 100 Mb/s, with auto-
MDIX.
Fully compliant with the IEEE 802.3u Specification.
Supports IP/TCP/UDP checksum generation and checking.
Supports back-pressure mode for half-duplex mode flow control.
7.2.14 USB Host/Slave Controller
Complies fully with Universal Serial Bus Specification Rev. 2.0.
Supports data transfer at full-speed and low-speed.
Supports both USB host and device.
Two USB ports (one type A for a host and one type B for a device).
Provides a high-speed parallel interface to most available processors;
supports Nios II with a Terasic driver.
7.2.15 Serial Ports
One RS-232 port.
One PS/2 port.
51
DB-9 serial connector for the RS-232 port.
PS/2 connector for connecting a PS2 mouse or keyboard to the DE2-
70 board.
7.2.16 IRDA Transceiver
Contains a 115.2-kb/s infrared transceiver.
32 mA LED drive current.
Integrated EMI shield.
IEC825-1 Class 1 eye safe.
Edge detection input.
7.2.17 Two 40-pin Expansion Headers
72 Cyclone II I/O pins, as well as 8 power and ground lines, are
brought out to two 40-pin expansion connectors.
40-pin header is designed to accept a standard 40-pin ribbon cable
used for IDE hard drives.
Diode and resistor protection is provided.
CHAPTER 8
EXPERIMENTAL RESULTS
52
The machine is trained three times by the “WORD”. The word “help”
is recognized 90.9% of the time, whereas “held” is correctly ignored (100%
correct) when speaks. However, these percentages are respectively 45.5%,
and 0% when speaks. If during the training phase, first person inputs two
words and second person one, their percentages become respectively (when
saying “help”) 72.7%, and 45.5%. When saying “held”, the machine
correctly assesses that they are not saying “help” in all cases. This data
was collected by saying “help” 11 times, and “held” two times.
Word Verdict Correct? Correctness
help Same Yes 90.9 %
help Different No
help Same Yes
help Same Yes
help Same Yes
help Same Yes
help Same Yes 92.3 %help Same Yes
help Same Yes
help Same Yes
help Same Yes
held Different Yes 100 %
held Different Yes
Table 8 Experimental Results
53
This indicates that the training works properly, because the
correctness in “first person results decreases, when his “participation” in
the training decreases (from three times to two). On the other hand, second
person correctness increases when he participates in the training.
Since the fingerprints are analyzed in the time domain, the system is
much more sensible to the speed, the intonation and the surrounding noise
when a word is input. Thus, the above results should be taken with
caution, because the words were spoken really close to the microphone,
and in a somewhat similar way each time. Nonetheless, the results seem
conclusive. Thus, despite a potential lack in accuracy, the machine is
functional.
54
CHAPTER 9
ADVANTAGES
The user can talk and write freely. The system understands,
analyzes and creates all the elements that are presented.
The user leads and controls the dialogue. He or she can interact
by canceling or substituting previous functions and sentences.
The technology understands, analyzes and creates all the
elements representation using a grammatical analysis strategy,
assuring the right interpretation and management of all the
semantic capacity of natural language.
The platform offers real-time interaction in massive
environments using acute memory management strategies.
The solution completely adapts to the user profile and previous
history dialogues with the aim to customize all interactive
processes.
55
CHAPTER 10
APPLICATONS
Interactive voice response system (IVRS) .
Voice-dialing in mobile phones and telephones .
Hands-free dialing in wireless bluetooth headsets .
PIN and numeric password entry modules.
Value added service (VAS) providers .
Automated teller machines (ATMs) If we increased the system’s
vocabulary using phoneme-based recognition for a particular
language, e.g., English, the system could be used to replace
standard input devices such as keyboards, touch pads, etc. in IT-
and electronics-based applications in various fields. The design
has a wide market opportunity ranging from mobile service
providers to ATM makers. Some of the targeted users for the
system include.
Mobile operators .
Home and office security device providers .
ATM manufacturers.
Mobile phone and bluetooth headset manufacturers.
Telephone service providers.
Manufacturers of instruments for disabled persons.
PC users.
CHAPTER 11
56
FUTURE IMPROVEMENTS
Given more time, we would have liked to implement a more robust
system with a larger dictionary of words. Also, using the Mel Scale
followed by DCT is a weaker approach to solve the Speech Recognition
problem. Instead, the usage of algorithms built upon the concept of using
the Hidden Markov Models which are predominantly statistical techniques
that treat a speech signal as a piecewise stationary signal over a window of
10ms is preferred. Hence, using these algorithms we could enhance the
accuracy of the system and make the system more robust.
57
CHAPTER 12
CONCLUSION
After applying background theory and scripting a VLSI prototype, a
speech recognition system can indeed be successfully implemented and
using FPGA technology. The experimental theoretical results show that the
algorithm is accurate and fast enough for consumer product applications.
Despite only partial hardware implementation due to technical difficulties,
it remains functional.
Besides producing a full implementation (by including an FFT
module and thus being able to analyze words in the frequency spectrum),
other improvements can be done to the system. For instance, allowing the
use of a variable length for the input sounds would drastically improve its
performance on very short or very long words. Also, adding support for
training several words would be rather simple and would increase the
system flexibility.
58
REFERENCE
[1] L. Rabiner, and B. Juang, Fundamentals of speech recognition:
Tsinghua University Press.
[2] E. Trentin, and M. Gori, “A survey of hybrid ANN/HMM
models for automatic speech recognition,” Neurocomputing, vol.
37, no. 1, pp. 91-126, 2001.
[3] V. Steinbiss, B. Tran, and H. Ney, "Improvements in beam
search," ICSLP, 1994, pp. 2143-2146.
[4] D. Llorens, and F. Casacuberta, “An Experimental Study of
Histogram Pruning in Speech Recognition,” Proceeding of the
VIII SNRFAI. pp, vol. 5, pp. 6, 1999.
[5] M. Mosleh, S. Setayeshi, and M. Kheyrandish, “Accelerating
Speech Recognition Algorithm with Synergic Hidden Markov
Model and Genetic Algorithm based on Cellular Automata.”
ICSPS2009,pp.1-7,2009.
[6] K. Kuo, “Dual-ALU Structure Processor for Speech
Recognition” in Proceedings of the 2006 IEEE/SMC
International Conference on System of Systems Engineering,
USA, 2006, pp. 193-196.