st final report tommorow 4-4-2011 report

86
1 CHAPTER 1 INTRODUCTION 1.1 DESIGN INTRODUCTION For the past several decades, designers have processed speech for a wide variety of applications ranging from mobile communications to automatic reading machines. Speech recognition reduces the overhead caused by alternate communication methods. Speech has not been used much in the field of electronics and computers due to the complexity and variety of speech signals and sounds. However, with modern processes, algorithms, and methods we can process speech signals easily and recognize the text. 1.2 INTRODUCTION Our project aimed at developing a Real Time Speech Recognition Engine on an FPGA using Altera DE2 board. The system was designed so as to

Upload: sureshkumarscool

Post on 28-Nov-2014

108 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: ST Final Report TOMMOROW 4-4-2011 Report

1

CHAPTER 1

INTRODUCTION

1.1 DESIGN INTRODUCTION

For the past several decades, designers have processed speech for a

wide variety of applications ranging from mobile communications to

automatic reading machines. Speech recognition reduces the overhead

caused by alternate communication methods. Speech has not been used

much in the field of electronics and computers due to the complexity and

variety of speech signals and sounds. However, with modern processes,

algorithms, and methods we can process speech signals easily and

recognize the text.

1.2 INTRODUCTION

Our project aimed at developing a Real Time Speech Recognition

Engine on an FPGA using Altera DE2 board. The system was designed so

as to recognize the word being spoken into the microphone. Both industry

and academia have spent a considerable effort in this field for developing

software and hardware to come up with a robust solution. However, it is

because of large number of accents spoken around the world that this

conundrum still remains an active area of research. Speech Recognition

finds numerous applications including health care, artificial intelligence,

human computer interaction, Interactive Voice Response Systems, military,

Page 2: ST Final Report TOMMOROW 4-4-2011 Report

2

avionics etc. Another most important application resides in helping the

physically challenged people to interact with the world in a better way.

We implemented a Real Time Speech Recognition Engine that takes

as an input the time domain signal from a microphone and performs the

frequency domain feature extraction on the sample to identify the word

being spoken. Our design exploits the fact that most of the words spoken

across various accents around the world have some common frequency

domain features that can be used to identify the word. Speech Recognition

has always been a conundrum and a point of keen interest for researchers

all around the globe. While various methodologies have been developed to

solve this issue, it still remains an unsolved, nevertheless an intriguing

problem.

Page 3: ST Final Report TOMMOROW 4-4-2011 Report

3

CHAPTER 2

LITERATURE SURVEY

2.1 EXISTING SYSTEM

By using the existing system we can only recognised the speech but we

won’t be able display the text. In the existing system we can only use

discrete or continuous Hidden Markov Model. Using SRAM we can only

implement and recognised 500 words. Recognition speed of existing

system is slow. The existing system can be applied for various practical

purposes but it does provide various losses so that’s way we are using

speech to text.

2.2 PROPOSED SYSTEM

The proposed system provides with much more advantage and provides

better use of HMM. our system uses both discrete and continuous form of

hidden markov model. The system does not only provide recognition but

also provides display of the text with the help of a liquid crystal

display.with proper training and in a closed environment we can achieve

much more accuracy in the text and the viterbi algorithm helps us to

designate and find out the most likely text in the speech. This system has

such a practical application for deaf persons and even further improvement

in our system will lead to educate a lot of deaf people and even speech to

visual conversion would be possible in the near future.

Page 4: ST Final Report TOMMOROW 4-4-2011 Report

4

CHAPTER 3

BACKGROUND THEORY

3.1 SPEECH RECOGNITION PRINCIPLE

Figure 3.1 Speech waves

Speech recognition systems can be classified into several models by

describing the types of utterances to be recognized. These classes shall take

into consideration the ability to determine the instance when the speaker

starts and finishes the utterance. In our project we aimed to implement

Isolated Word Recognition System which usually used a rectangular

window over the word being spoken. These types of systems have

"Listen/Not-Listen" states, where they require the speaker to wait between

utterances.

Page 5: ST Final Report TOMMOROW 4-4-2011 Report

5

A desktop microphone usage shall not be appropriate for realization

of the project since they tend to pick up more ambient noise that gives

might not be appropriate for accurate detection of speech. The usage of

headset styled microphone allows the ambient noise to be minimized. Since

the Speech Recognition is heavily dependent on processing speed because

of a large amount of signal processing, implementation of the same on an

FPGA was a good choice and motivation behind this project. Also, the

memory available on Altera DE2 Development board was enough to easily

and successfully implement the design for a word of length nearly 1

second.

The Speech Recognition Engines are broadly classified into 2 types,

namely Pattern Recognition and Acoustic Phonetic systems. While the

former use the known/trained patterns to determine a match, the latter uses

attributes of the human body to compare speech features (phonetics such as

vowel sounds). The pattern recognition systems combine with current

computing techniques and tend to have higher accuracy.

Page 6: ST Final Report TOMMOROW 4-4-2011 Report

6

3.1.1 Flow chart

Figure 3.1.1 speech recognition principle

The system recognizes the spoken digit using a maximum likehood

estimate, i.e., a Viterbi decoder. The input speech sample is preprocessed to

extract the feature vector. Then, the nearest codebook vector index for each

frame is sent to the digit models. The system chooses the model that has

the maximum probability of a match.

Page 7: ST Final Report TOMMOROW 4-4-2011 Report

7

3.2 DATA ACQUISITION

The speech signal is essentially analog in nature. Hence, the signals

must be converted to digital data in order to be read and processed. We

used an inbuilt ADC using the Wolfson Codec to sample our signal at 8

KHz frequency thus producing a 16 bit signed digital output. Once the

word was known to be detected, we acquired the FFT over the next 32

blocks of data and read their power coefficients in Nios II.

Figure 3.2 ADC Wave Form.

3.3 DETECTION

The system must know when a spoken word is input. Thus, a

detection algorithm has been devised. This is done by continually

computing the difference of the absolute average of two adjacent sound

windows (sets of consecutive sound data), and comparing it to a predefined

threshold.

Page 8: ST Final Report TOMMOROW 4-4-2011 Report

8

The detector algorithm can be broken down as follows:

1. The absolute average w1 of a sound window of length W is computed

from the sound samples si starting at sa and ending at sb as shown in Eq. 1.

W1=1/w∑i=a

b

¿ si∨¿¿ ………………..

(1)

2. The average of the second window w2 is computed from the sound

samples si starting at sb and ending at sc as shown in Eq. 2.

W1=1/w∑i=b

e

¿ si∨¿¿ ………………

(2)

3. The difference between w2 and w1 is compared to the threshold value

Th. If it is larger, the spoken word is considered to start at sc. Else, the

algorithm goes on to step 4.

4. The average of the oldest window (w1) is discarded, and replaced by w2.

Then, the algorithm goes back to step 2.

Note: That the value has been experimentally determined in the MATLAB

implementation (see appendix A). Nevertheless, it may vary depending on

the sound acquisition setup (i.e. position of the microphone, noise level,

etc.). Finally, the length of the word is fixed to 1.024s for convenience.

3.4 FREQUENCY CONTENT

Once the word is detected, it is mapped to the frequency domain by

computing its Discrete Fourier Transform (DFT) using the Fast Fourier

Transform (FFT) algorithm. Since the length of a word is 1.024 s and the

Page 9: ST Final Report TOMMOROW 4-4-2011 Report

9

sound is sampled at 5 kHz, five 1024 points FFTs are required to fully

characterize a single word. In the MATLAB implementation, these are

stored in each row of a 1024 x 5 matrix. This matrix constitutes the

“fingerprint”. Note that, for the sake of simplicity, only the real part of the

DFT is kept. In the training mode, the user defines how many times a word

is trained. The frequency content of each is averaged by adding their

fingerprints together and dividing the final sum by the number of times the

word has been trained. This generates the “reference fingerprint”.

3.5 DISTANCE

The comparison between a word's fingerprint and the reference

fingerprint is done by taking the euclidean distance between them. To do

this, they are considered as five 1024-dimensional vectors(one for each

matrix row), and the average of their respective euclidean distance is

computed. This is shown in Eq. 3, where D is the distance, and ani and bni

are the ith components of the fingerprints. The n index points to each of the

five vector pairs.

D=1/5∑5n=1√

∑i=1

1024

.(ani - bni)2 ...............................(3)

If the distance is less than a preset maximum (maxDis), then the

analyzed word is considered to match the reference word. Note that

maxDis is experimentally set to 140 in the MATLAB implementation (see

appendix B). Similarly to the Th parameter, this value depends on the

sound acquisition setup and may need to be varied in order to achieve

accurate speech recognition.

Page 10: ST Final Report TOMMOROW 4-4-2011 Report

10

CHAPTER 4

HARDWARE IMPLEMENTATION

In order to implement the speech recognition algorithm in the Altera

DE2 board, it is broken down into modules. These are then mapped to

combinational logic and finite-state machines (FSM), using the Quartus II

software package.

4.1 WOLFSON INTERFACE

The board has a Wolfson WM8731 Coder-Decoder (CODEC),

which acts as the ADC. This audio chip has a microphone jack, and is

connected in a master-slave configuration with the FPGA (the latter being

the master). In order for the master to control the CODEC and acquire the

digital data, three modules have been created: the I2C bus controller, the

clock module, and a sound fetcher.

4.1.1 I2C Bus Controller

Three tasks need to be performed on the CODEC to modify its

internal settings: “de-mute” the microphone input, boost the microphone

volume, and change the default sound path (so that the microphone is given

Page 11: ST Final Report TOMMOROW 4-4-2011 Report

11

priority over other inputs). To do this, the FPGA communicates with the

Wolfson via the I2C (Inter-Integrated Circuit) protocol using two pins:

'SDIN' (the data line), and 'SCLK' (the bus clock), as seen in Fig 4.1.1

Figure 4.1.1 Two line I2C bus protocol for the wolfson WM8731

The contents of the data line are sent in the same order as seen above

(after a start condition): 'RADDR', 'R/W', 'ACK', 'DATAB[15-9]', and

'DATAB[8-0]', which stand respectively for “base address”, “Read/Write”,

“acknowledge”, “control address”, and “control data”. The last block

modifies the settings. For instance, if 'DATAB[0]' is '1', the volume is

boosted. The base and control addresses are used to specify which internal

CODEC registers need to be accessed. “Read/Write” will always be set to

zero (i.e. write), since the Wolfson is write-only.

To signify a start condition, 'SDIN' goes from high to low while the

clock is maintained high. The same applies for a stop condition, except the

transition is low-to-high. Finally, the 'ACK' signal is sent from the

CODEC to the FPGA, as opposed to all the other data line contents. This

introduces the need for 'SDIN' to be implemented as a bi-directional pin,

which requires the use of a tri-state buffer. An FSM is created to

Page 12: ST Final Report TOMMOROW 4-4-2011 Report

12

implement the bus interface between the FPGA and the Wolfson. Note

that, because 'SCLK' must be between 0 Hz and 400 kHz, 'ADCLRC'

(48.83 kHz) is used (see section 2.1.3). For start and stop conditions,

'ADCLRC' is overridden by the FSM, so that 'SCLK’ at ‘1’

4.1.2 Sound Fetcher

After the Wolfson digitalizes the input, it presents the data

('ADCDAT') serially as seen in Fig.4.1.2a. This is the Integrated Interchip

Sound (I2S) standard. Two clocks are needed: 'ADCLRC' (the left-right

clock for ADC data), and 'BCLK' (the bit-stream clock). The CODEC will

place the most significant bit (MSB) on the 'ADCDAT' line so that it can

be fetched on the second rising 'BCLK' edge following a high-to-low

transition of 'ADCLRC'. The left and right channel distinction is used for

stereo sound. Since this project deals with mono sound, the data is fetched

when 'ADCLRC' is low (left channel).

Page 13: ST Final Report TOMMOROW 4-4-2011 Report

13

Figure 4.1.2a ACDAT output convention used by the wolfson WM8731(I2S)

Page 14: ST Final Report TOMMOROW 4-4-2011 Report

14

Figure 4.1.2b Circuit schematic of the overall ADCDAT fetcher

The FSM in Fig 4.1.2b ('ADCDAT_fetcher_FSM') is used to keep

track of the events on the clocks (e.g. rising edges) in order to know the

exact moment one can start and stop to fetch. Because the data is presented

serially, the FSM communicates with a serial-to-parallel register

('LPM_SHIFTREG'), which outputs this data in parallel form.

Page 15: ST Final Report TOMMOROW 4-4-2011 Report

15

Table 4.1.2 Two’s complement quantization from 3 bits to 2 bits

The next step is to quantize. The ‘ADCDAT’ word length is 24 bits

in two’s complement form. As said in section 1.2, the objective is to reduce

the length to 8 bits. In order to see how signed binary numbers can be

quantized, Table 1 illustrates a quantization from 3 bits to 2 bits.

A closer look at the second and fourth columns reveals that, in order

to quantize, it is only necessary to keep the two MSBs. Note that this is

possible because the two's complement scheme is used. Consequently,

when going from 24 bits to 8 bits, only the first eight most significant bits

need to be kept.

The last D-type flip-flop ('LPM_DFF/downsampler_ff') reduces the

Decimal

number

Binary

(2’s comp.)

Quantized

decimal

Quantized

binary

(2’s comp.)3 011 1 01

2 010

1 001 0 00

0 000

-1 111 -1 11

-2 110

-3 101 -2 10

-4 100

Page 16: ST Final Report TOMMOROW 4-4-2011 Report

16

output data rate from 48 kHz to 5 kHz. In order to do that, it is controlled

by the two modules (a counter and an FSM) in the top right corner of Fig.

3, which generate two pulses. Both pulses occur at a 5 kHz frequency.

The first instructs the flip-flop to fetch the data. The second pulse is an

output 'READY' signal that happens half-a-period after the first. Its

purpose is to make sure that the rest of the circuit will fetch the data after it

has been properly latched.

4.1.3 Clock Module

The FPGA is clocked at 50 MHz [1]. Because it acts as the

Wolfson's master, it must feed the latter with various clocks: the main

audio chip clock ('XCK'), 'ADCLRC', and 'BCLK'. According to the

Wolfson data sheets, both 'ADCLRC' and 'XCK' are dependent on the

sampling frequency. Since the latter is 48 kHz, 'ADCLRC' must also be 48

kHz. 'XCK' is 12.288 MHz [4]. 'BCLK' must be at least 2.4 MHz, because

it needs to yield 25 rising clock edges (1 to wait for the MSB and 24 to

fetch each 'ADCDAT' bit) within half the period of 'ADCLRC' (i.e. within

10.42 μs).

Figure 4.1.3 Block diagram of clock module

To implement all three clocks, a single clock module was devised.

Page 17: ST Final Report TOMMOROW 4-4-2011 Report

17

As seen in Fig. 4.1.3, it takes the 50 MHz clock as an input. Using a 2-bit

counter, it then proceeds to divide it by 22 yielding a 12.5 MHz 'XCK'

signal. Similarly, 'ADCLRC' and 'BCLK' are output using respectively 10-

bit and 3-bit counters (to divide by 210 and 23). This produces 48.83 kHz,

and 6.25 MHz signals (the latter being greater than 2.4 MHz). Even though

those values are approximations of the ideal ones specified in the data

sheets, they are close enough for practical purposes [3].

4.3 FFT

The discrete Fourier transform (DFT) plays an important role in the

analysis, design, and implementation of discrete-time signal processing

algorithms and systems because efficient algorithms exist for the

computation of the DFT. These efficient algorithms are called Fast Fourier

Transform (FFT) algorithms. In terms of multiplications and additions, the

FFT algorithms can be orders of magnitude more efficient than competing

algorithms.

It is well known that the DFT takes N2 complex multiplications and

N2 complex additions for complex N-point transform. Thus, direct

computation of the DFT is inefficient. The basic idea of the FFT algorithm

is to break up an N-point DFT transform into successive smaller and

smaller transforms known as butterflies (basic computational elements).

The small transforms used can be 2-point DFTs known as Radix-2, 4-point

DFTs known as Radix-4, or other points. A two-point butterfly requires 1

complex multiplication and 2 complex additions, and a 4-point butterfly

Page 18: ST Final Report TOMMOROW 4-4-2011 Report

18

requires 3 complex multiplications and 8 complex additions. Therefore, the

Radix-2 FFT reduces the complexity of a N-point DFT down to

(N/2)log2N complex multiplications and Nlog2N complex additions since

there are log2N stages and each stage has N/2 2-point butterflies. For the

Radix-4 FFT, there are log4N stages and each stage has N/4 4-point

butterflies. Thus, the total number of complex multiplication is

(3N/4)log4N = (3N/8)log2N and the number of required complex additions

is 8(N/4)log4N = Nlog2N.

Above all, the radix-4 FFT requires only 75% as many complex

multiplies as the radix-2 FFT, although it uses the same number of complex

additions. These additional savings make it a widely-used FFT algorithm.

Thus, we would like to use Radix-4 FFT if the number of points is power

of 4. However, if the number of points is power of 2 but not power of 4, the

Radix-2 algorithm must be used to complete the whole FFT process. In this

application note, we will only discuss Radix-4 FFT algorithm.

Now, let’s consider an example to demonstrate how FFTs are used

in real applications. In the 3GPP-LTE (Long Term Evolution), M-point

DFT and Inverse DFT (IDFT) are used to convert the signal between

frequency domain and time domain. 3GPP-LTE aims to provide for an

uplink speed of up to 50Mbps and a downlink speed of up to 100Mbps. For

this purpose, 3GPP-LTE physical layer uses Orthogonal Frequency

Division Multiple Access (OFDMA) on the downlink and Single Carrier -

Frequency Division Multiple Access (SC-FDMA) on the uplink.

Page 19: ST Final Report TOMMOROW 4-4-2011 Report

19

In order to map the sound data from the time domain to the

frequency domain, the Altera IP Megacore FFT module is used. The

module is configured so as to produce a 1024-point FFT. It is not only

capable of taking a streaming data input in natural order, but it can also

output the transformed data in natural order, with a maximum latency of

1024 clock cycles once all the data (1024 data samples) has been received.

4.2 DETECTOR

The absolute values of the first 1024 samples that constitute a

window are accumulated (summed together). Then sum is shifted right by

10 in order to divide by 1024 (since 210 = 1024), thus producing the

average value of the window.

The difference between that average and the one from the previous

window (stored in 'Register 1') is then computed. 'Register 2' is used to

control the comparator's input in order to ensure the comparison with user-

defined 9-bit threshold takes place when all the samples of the window

have been processed. Once done, the contents of 'Register 1' are replaced

by newer window avg.

Page 20: ST Final Report TOMMOROW 4-4-2011 Report

20

Figure 4.3 Detector

An FSM is needed in order to control when to do this average

swapping, when to enable 'Register 2', when to determine if a count of

1024 samples has been reached, and when to clear the accumulator to

restart the summation. It also accepts a 'RESET' signal that

asynchronously clears the accumulator

4.4 MEMORY MANAGEMENT

In order to store the reference fingerprint, the 512 kB SRAM

module built in the board is used. There are three memory modules on the

Altera DE2: a 4 MB Flash memory chip, an 8 MB SDRAM chip and a 512

kB SRAM chip. While the Flash module provides a vast amount of non

volatile storage, it is very slow with respect to the main system clock. It

also requires a controller capable of dealing with its timing constraints.

The SDRAM chip is very fast and has a very large storage capacity, but it

requires a very sophisticated controller to be operated. This makes the

Page 21: ST Final Report TOMMOROW 4-4-2011 Report

21

SRAM chip an obvious choice. Even though it is not the fastest nor the

largest, it has ten times the required storage capacity needed for this

project, and it is fast enough (since it can perform a read or write operation

in less than 20 ns, i.e. a system clock period) so as to avoid any timing

issues. Moreover, it is a fairly simple device and can be easily controlled.

Figure 4.4 512kb SRAM chip block diagram

The SRAM memory module is depicted in Fig. 6 with its inputs and

outputs. Note that the 'Data' pins are bidirectional and into require a tristate

buffer to be properly driven’ The chip storage is divided 218 16-bit blocks

which can be directly addressed trough the 18 'Address' lines. This is not

convenient for the implementation since the data stored is 8-bit wide.

4.4.1 Memory Controller

“Memory Controller” shown in Fig. 4.4.1a has four user inputs

('ADDR', 'DATA_IN', 'MODE', and 'ENABLE'), one user output

('DATA_OUT) and seven inputs/outputs (depicted in green) that connect

directly to the SRAM chip ('Low Byte Mask', 'High Byte Mask', 'Output

Page 22: ST Final Report TOMMOROW 4-4-2011 Report

22

Enable', 'Write Enable', 'Chip Enable', 'Address', and 'Data'). The controller

simplifies the communication to the SRAM chip by splitting the

bidirectional pins and allowing each 8-bit memory block to be directly

accessed (see its detailed schematics in 4.4.1b).

Figure 4.4.1a Memory controller block diagram

The pins are split by using Altera's “bustri” (tri-state buffer) and each

8-bit block can be accessed using the 'High Byte Mask' and the 'Low Byte

Mask' according to the least significant bit of 'ADDR'. As a result, the user

sees an 8-bit data input ('DATA_IN'), a separate 8-bit data output

('DATA_OUT') and 19 address lines ('ADDR') which double the original

address space.

Page 23: ST Final Report TOMMOROW 4-4-2011 Report

23

Figure 4.4.1b Schematic diagram of memory controller

The controller simplifies the communication to the SRAM chip by

splitting the bidirectional pins and allowing each 8-bit memory block to be

directly accessed (see its detailed schematics diagram). The pins are split

by using Altera's “bustri” (tri-state buffer) and each 8-bit block can be

accessed using the 'High Byte Mask' and the 'Low Byte Mask' according to

the least significant bit of 'ADDR'. As a result, the user sees an 8-bit data

input ('DATA_IN'), a separate 8-bit data output ('DATA_OUT') and 19

address lines ('ADDR') which double the original address space.

Page 24: ST Final Report TOMMOROW 4-4-2011 Report

24

4.4.2 Memory Batch Operator

In order to sequentially access the memory, a 'Memory Batch

Operator' module was devised. As shown in Fig. 8, its takes 6 inputs

('START_ADDR', 'END_ADDR', 'DATA_IN', 'MODE', 'DATA_READY',

and 'ENABLE') and has 5 outputs ('DATA_OUT', 'ADDR',

'MEM_MODE', 'MEM_ENABLE', and 'DONE'). It operates on the rising

edge of a clock signal ('CLK').

Figure 4.4.2 Memory batch operator block diagram

The module works as follows:

Whenever the 'ENABLE' input goes high, it fetches the starting and

ending addresses as specified in the 'START_ADDR' and

'END_ADDR' inputs, and readies to start writing or reading

(according to the 'MODE' input) at the starting address. This takes

two clock cycles.

Whenever the 'DATA_READY' signal is asserted, the module goes

Page 25: ST Final Report TOMMOROW 4-4-2011 Report

25

to the next address and reads (the data can be read from the

'DATA_OUT' lines of the memory controller) or writes (the data

from the 'DATA_IN' input lines).

If the module reaches the ending address, then it signals 'DONE'

until the 'ENABLE' input is low and goes back to step 1. Else, it goes

back to step 2. Note that on each step, the module takes care of

sending the appropriate signals to the memory controller in order to

perform the desired action.

4.5 DISTANCE(HMM)

The distance module illustrated in Fig.4.5a has four inputs ('A', 'B',

'ENABLE', and jkn'RST') and one output 'Distance'. It computes the

distance between two arbitrarily sized vectors by adding and accumulating

the squared difference of theinput is high. In order to clear the

accumulated distance the asynchrono 'A' and 'B' inputs on each rising edge

of a clock signal 'CLK' while the 'ENABLE' us 'RST' signal must be

asserted.

Figure 4.5a distance block diagram

Page 26: ST Final Report TOMMOROW 4-4-2011 Report

26

Figure 4.5b Schematic diagram of distance

4.6 HMM TRAINING

An important part of speech-to-text conversion using pattern

recognition is training. Training involves creating a pattern representative

of the features of a class using one or more test patterns that correspond to

speech sounds of the same class. The resulting pattern (generally called a

reference pattern) is an example or template, derived from some type of

averaging technique. It can also be a model that characterizes the reference

pattern statistics. Our system uses speech samples from three individuals

during training.

A model commonly used for speech recognition is the HMM, which

is a statistical model used for modeling an unknown system using an

observed output sequence. The system trains the HMM for each digit in the

vocabulary using the Baum-Welch algorithm. The codebook index created

during preprocessing is the observation vector for the HMM model.

Page 27: ST Final Report TOMMOROW 4-4-2011 Report

27

After preprocessing the input speech samples to extract feature

vectors, the system builds the codebook. The codebook is the reference

code space that we can use to compare input feature vectors. The weighted

cepstrum matrices for various users and digits are compared with the

codebook. The nearest corresponding codebook vector indices are sent to

the Baum-Welch algorithm for training an HMM model.

The HMM characterizes the system using three matrices:

A—The state transition probability distribution.

B—The observation symbol probability distribution.

n—The initial state distribution.

Any digit is completely characterized by its corresponding A, B, and

n matrices. The A, B, and n matrices are modeled using the Baum-Welch

algorithm, which is an iterative procedure (we limit the iterations to 20).

The Baum-Welch algorithm gives 3 matrices for each digit corresponding

to the 3 users with whom we created the vocabulary set. The A, B, and n

matrices are averaged over the users to generalize them for user-

independent recognition.

For the design to recognize the same digit uttered by a user for which

the design has not been trained, the zero probabilities in the B matrix are

replaced with a low value so that it gives a non-zero value on recognition.

To some extent, this arrangement overcomes the problem of less training

data.

Page 28: ST Final Report TOMMOROW 4-4-2011 Report

28

Training is a one-time process. Due to the complexity and resource

requirements, it is performed using standalone PC application software that

we created by compiling our C program into an executable. For

recognition, we compile the same C program but target it to run on the

Nios II processor instead. We were able to accomplish this cross-

compilation because of the wide support for the C language in the Nios II

processor IDE.

4.6.1 HMM-Based Recognition

Recognition or pattern classification is the process of comparing the

unknown test pattern with each sound class reference pattern and

computing a measure of similarity (distance) between the test pattern and

each reference pattern. The digit is recognized using a maximum likelihood

estimate, such as the Viterbi decoding algorithm, which implies that the

digit whose model has the maximum probability is the spoken digit.

Preprocessing, feature vector extraction, and codebook generation are same

as in HMM training. The input speech sample is preprocessed and the

feature vector is extracted. Then, the index of the nearest codebook vector

for each frame is sent to all digit models. The model with the maximum

probability is chosen as the recognized digit.

After preprocessing in the Nios II processor, the required data is

passed to the hardware for Viterbi decoding. Viterbi decoding is

computationally intensive so we implemented it in the FPGA for better

execution speed, taking advantage of hardware/software co-design. We

wrote the Viterbi decoder in Verilog HDL and included it as a custom

Page 29: ST Final Report TOMMOROW 4-4-2011 Report

29

instruction in the Nios II processor. Data passes through the dataa and

datab ports and the prefix port is used for control operations. The custom

instruction copies or adds two floating-point numbers from dataa and

datab, depending on the prefix input. The output (result) is sent back to the

Nios II processor for further maximum likelihood estimation.

4.6.2 Flowchart Of HMM

The system trains the HMM for each digit in the vocabulary. The

same weighted cepstrum matrices for various users and digits are compared

with the codebook and their corresponding nearest codebook vector indices

is sent to the Baum-Welch algorithm to train a model for the input index

sequence. The codebook index is the observation vector for the HMM

model.

Figure 4.6.2 flowchart of HMM

Page 30: ST Final Report TOMMOROW 4-4-2011 Report

30

The Baum-Welch model is an iterative procedure and our system

limits the iterations to 20. After training, we have three models for each

digit that correspond to the three users in our vocabulary set. We find the

average of the A, B, and n matrices over the users to generalize the models.

4.7 LCD

Figure 4.7 LCD

A liquid crystal display (LCD) is a thin, flat electronic visual

display that uses the light modulating properties of liquid crystals (LCs).

LCs do not emit light directly. They are used in a wide range of

applications, including computer monitors, television, instrument

panels, aircraft cockpit displays, signage, etc. They are common in

consumer devices such as video players, gaming devices,clocks,

watches, calculators, and telephones. LCDs have displaced cathode ray

tube (CRT) displays in most applications. They are usually more compact,

lightweight, portable, less expensive, more reliable, and easier on the eyes.

They are available in a wider range of screen sizes than CRT and plasma

displays, and since they do not use phosphors, they cannot suffer image

Page 31: ST Final Report TOMMOROW 4-4-2011 Report

31

burn-in. LCDs are more energy efficient and offer safer disposal than

CRTs. Its low electrical power consumption enables it to be used

in battery-powered electronic equipment. It is an electronically-modulated

optical device made up of any number of pixels filled with liquid

crystals and arrayed in front of a light source (backlight) or reflector to

produce images in colour or monochrome.

4.8 LED

Figure 4.8 LED

A light-emitting diode (LED) is a semiconductor light source. LEDs

are used as indicator lamps in many devices, and are increasingly used

for lighting. Introduced as a practical electronic component in 1962, early

LEDs emitted low-intensity red light, but modern versions are available

across the visible, ultraviolet and infrared wavelengths, with very high

brightness.

When a light-emitting diode is forward biased (switched

on), electrons are able to recombine with electron holes within the device,

Page 32: ST Final Report TOMMOROW 4-4-2011 Report

32

releasing energy in the form of photons. This effect is

called electroluminescence and the color of the light (corresponding to the

energy of the photon) is determined by the energy gap of the

semiconductor. An LED is often small in area (less than 1 mm2), and

integrated optical components may be used to shape its radiation pattern.

LEDs present many advantages over incandescent light sources

including lower energy consumption, longer lifetime, improved robustness,

smaller size, faster switching, and greater durability and reliability. LEDs

powerful enough for room lighting are relatively expensive and require

more precise current and heat management than compact fluorescent

lamp sources of comparable output.

Light-emitting diodes are used in applications as diverse as

replacements for aviation lighting, automotive lighting(particularly brake

lamps, turn signals and indicators) as well as in traffic signals. The compact

size, the possibility of narrow bandwidth, switching speed, and extreme

reliability of LEDs has allowed new text and video displays and sensors to

be developed, while their high switching rates are also useful in advanced

communications technology .Infrared LEDs are also used in the remote

control units of many commercial products including televisions, DVD

players, and other domestic appliances.

CHAPTER 5

Page 33: ST Final Report TOMMOROW 4-4-2011 Report

33

ARCHITECTURE

5.1 SYSTEM CONTROLLER

Figure 5.1 Overall Diagram

Fig 5.1 shows how the modules discussed in this chapter interact

with each other. Most of the signals pass through the “System Controller”

Page 34: ST Final Report TOMMOROW 4-4-2011 Report

34

module. It controls the datapath by coordinating the modules so that the

data can flow. It deals primarily with the training phase of the algorithm,

since it is much more complex than the sound recognition phase. For

instance, once a sound has been detected, the system controller is notified.

Then, It waits for the FFT to output the data before notifying the 'Average'

module it should start operating. Finally, it instructs the memory controller

to store the averaged data.

5.2 NIOS II PROCESSOR

We used the “NIOS II Software Build Tools for Eclipse” software

for writing our C program. The C program executes our algorithm for the

speech recognition. We have included the complete code in the code listing

section. The overall operation of the code can be described as follows.

The code is executing an infinite loop as it’s always either expecting

the input or processing it. It initiates start by giving the fftstart signal which

starts the memory loading and the FFT operation. It keeps checking the

fftcomplete signal to detect the end of the FFT operation. Once the FFT is

complete it make the fftstart signal low so that the FFT values stored in the

FFT memory doesn’t change before it copies the values to SDRAM. It then

checks for the fftlevel signal to check whether a significant level of input is

present of the MIC input and so as to indicate the start of the voice

command. We found out experimentally that the value of fftlevel greater

than 60 corresponds to an actual voice command, while the value below

this represent either silence or the noise.

Page 35: ST Final Report TOMMOROW 4-4-2011 Report

35

Once we detected the start of the command, we continuously stored

the FFT output of next 32 chunks of voice sample which each chunk being

32ms. We store these values in a large array named fftcoeff of size 8192.

This is performed through a for loop iterating for 32 cycles and performing

the above operation of initiating the FFT module and then storing the FFT

output into the fftcoeff array at appropriate location. Now, we have got the

power spectrum of the word which has been spoken. We will now do the

feature extraction and determine the word spoken.

First step is to convert the spectrum to the mel scale. We defined the

melcepstrum_conversion function which converts the input power

spectrum to the mel scale. We extracted 12 coefficients from this spectrum.

We pass the as input the fftcoeff array, and output the mel array. The C

module does the shifting as described in the theory section. We did the mel

shifting for the entire 1 sec speech instead of 32ms chunks of speech. It

would have been more efficient to do separate mel shifting for each of the

32ms but would have required a sophisticated synchronization and

nevertheless wasn’t required for the operation which we are trying to

achieve.

Next step is to compute the discrete cosine transform of these

spectral points and obtain the MFCCs. We defined the dct function which

takes the input as the mel_array (12 coefficients) and outputs the

mfcc_array(12 coefficients).

Page 36: ST Final Report TOMMOROW 4-4-2011 Report

36

Next we identify the spoken word based on the DCT coefficients.

Since the first two coefficients contain the maximum information we took

the sum of first two coefficients of the dct output and store it in the variable

named sum_mel. Since in our implementation we are differentiating

between the words ‘Yes’ and ‘No’, we experimentally noticed that this

value was always above 59 for the word Yes and was in between 50 to 58

for the word No. The program compares sum_mel variables with these

values in order to determine whether the spoken word is a ‘Yes’ or ‘No’. It

then accordingly glows the appropriate LEDs and the hardware displays the

word ‘Yes’ or ‘No’ on the 7-segment display.

CHAPTER 6

Page 37: ST Final Report TOMMOROW 4-4-2011 Report

37

FIELD PROGRAMABLE GATE ARRAY (FPGA)

6.1 FPGA

A field-programmable gate array (FPGA) is an integrated circuit

designed to be configured by the customer or designer after manufacturing

—hence "field-programmable". The FPGA configuration is generally

specified using a hardware description language (HDL), similar to that

used for an application-specific integrated circuit (ASIC) (circuit diagrams

were previously used to specify the configuration, as they were for ASICs,

but this is increasingly rare). FPGAs can be used to implement any logical

function that an ASIC could perform. The ability to update the

functionality after shipping, partial re-configuration of the portion of the

design and the low non-recurring engineering costs relative to an ASIC

design (not withstanding the generally higher unit cost), offer advantages

for many applications. FPGAs contain programmable logic components

called "logic blocks", and a hierarchy of reconfigurable interconnects that

allow the blocks to be "wired together"—somewhat like a one-chip

programmable breadboard. Logic blocks can be configured to perform

complex combinational functions, or merely simple logic gates like AND

and XOR. In most FPGAs, the logic blocks also include memory

elements, which may be simple flip-flops or more complete blocks of

memory.

6.1.1 Introduction

Page 38: ST Final Report TOMMOROW 4-4-2011 Report

38

The area of field programmable gate array (FPGA) design is

evolving at a rapid pace. The increase in the complexity of the FPGAs

architecture means that it can now be used in far more applications than

before. The newer FPGAs are steering away from the plain vanilla type

"logic only" architecture to one with embedded dedicated blocks for

specialized applications. With so many choices available, the designer not

only has to familiarize himself with the various architectures and their

strengths, but he also needs a way to quickly estimate the performance of

his design when targeted to the different technologies. This paper briefly

outlines the latest offerings from the key FPGA vendors and in its latter

half discusses the importance of using the right synthesis tool in order to

target the same design to these various technologies.

Definitions of Relevant Terminology are : Field-programmable

Device (FPD) — a general term that refers to any type of integrated circuit

used for implementing digital hardware, where the chip can be configured

by the end user to realize different designs. Programming of such a device

often involves placing the chip into a special programming unit, but some

chips can also be configured “in-system”. Another name for FPDs is

programmable logic devices (PLDs); although PLDs encompass the same

types of chips as FPDs, we prefer the term FPD because historically the

word PLD has referred to relatively simple types of devices. PLA — a

Programmable Logic Array (PLA) is a relatively small FPD that contains

two levels of logic, an AND-plane and an OR-plane, where both levels are

programmable (note: although PLA structures are sometimes embedded

Page 39: ST Final Report TOMMOROW 4-4-2011 Report

39

into full-custom chips, we refer here only to those PLAs that are provided

as separate integrated circuits and are user-programmable).

PAL— a Programmable Array Logic (PAL) is a relatively small FPD

that has a programmable AND-plane followed by a fixed OR-plane. SPLD

— refers to any type of Simple PLD, usually either a PLA or PAL. CPLD

— a more Complex PLD that consists of an arrangement of multiple

SPLD-like blocks on a single chip. Alternative names (that will not be used

in this paper) sometimes adopted for this style of chip are Enhanced PLD

(EPLD), Super PAL, Mega PAL, and others.

FPGA — a Field-Programmable Gate Array is an FPD featuring a

general structure that allows very high logic capacity. Whereas CPLDs

feature logic resources with a wide number of inputs (AND planes),

FPGAs offer more narrow logic resources. FPGAs also offer a higher ratio

of flip-flops to logic resources than do CPLDs.

6.1.2 The FPGA Landscape

In the semiconductor industry, the programmable logic segment is

the best indicator of the progress of technology. No other segment has such

varied offerings as field programmable gate arrays. It is no wonder that

FPGAs were among the first semiconductor products to move to the

0.13µm technology, and again recently to 90nm technology. This rapidly

changing technology means that more complex functionality is being

designed.

Page 40: ST Final Report TOMMOROW 4-4-2011 Report

40

Figure 6.1.2 Structure Of An FPGA

The players in the current programmable logic market are Altera,

Atmel, Actel, Cypress, Lattice, Quick logic and Xilinx. Some of the larger

and more popular device families are: Stratix™ from Altera; Accelerator

from Actel; ispXPGA™ from Lattice and Virtex™ from Xilinx. Between

these FPGA devices, many major electronics applications such as

communications, video, image and digital signal processing, storage area

networks and aerospace are covered. While the architecture of each FPGA

is unique, the basic combination of the functional block remains the same:

LUTs + registers + carry-chain + wide MUX. It is important to be aware of

the required resources for a design and to cross-reference this with what is

available. Sometimes, however, it is also the supported configuration that is

important for a design's requirement. For example, the capability of a

dedicated RAM to function in a particular mode might not be supported by

all vendors.

Page 41: ST Final Report TOMMOROW 4-4-2011 Report

41

6.1.3 FPGA synthesis The Vendor-Independent Approach

The present-day FPGAs offer the necessary features for successfully

completing most complex designs. Table 6.1.3 highlights the amount of

key resources available in the largest device offered by each FPGA vendor.

Clock management forms a very important part of any digital design and

this functionality is facilitated by on-chip phase locked loop (PLLs or

DLLs) circuitry. Dedicated memory blocks offer data storage and can be

configured as basic single-port RAMs, ROMs (read only memory), FIFOs

(first in first out), or CAMs (content addressable memory). Data processing

or the logic fabric of these FPGAs varies widely in size with the biggest

Xilinx Virtex-II Pro™ offering up to 100K LUT4s. The ability to interface

the FPGA with backplanes, high-speed buses, and memories is possible by

the availability of various single-ended and differential I/O standards

support.

Many of the major electronics applications such as communications,

video, image and digital signal processing; storage area networks and

aerospace are covered between the above-mentioned FPGA devices.

Although all of these FPGAs can perform the key functions required by

these applications, each of them is individually better suited for certain

target segments. For example, although Virtex-II and the Stratix both offer

dedicated multiplier blocks, the existence of the adders in the dedicated

DSP block may enable the Stratix device to target DSP applications more

effectively due to its ability to create efficient MAC (multiply-accumulate)

blocks4. In a similar manner, for programmable systems applications

Page 42: ST Final Report TOMMOROW 4-4-2011 Report

42

requiring embedded processors, the Virtex-II Pro™ with its 32-bit RISC

processor (PowerPC 405) would be an ideal choice.

Features Xilinx Virtex II

Pro

Altera

stratix

Actel

Accelerator

Lattice is

pXPGA

Clock

management

DCM

Up to 12

PLL

Up to 12

PLL

Up to 8

Sys CLOCK

PLL up to 8

Embedded

memory blocks

Block RAM

Up to 10 Mbit

Tri Matrix

Memory

Up to10 Mbit

Embedded

RAM

Up to 338K

Sys MEM

Blocks

Up to 414K

Data processing CLB and

18-bitx 18-bit

Multipliers

LE’s and

embedded

multipliers

Logic modules

(C-cell &R-

cell)

PFU based

Programmable

I/O s

Select IO Advanced IO

Support

Advanced

IO Support

Sys IO

Special features Embedded

power PC405

Cores

DSP blocks Per pin

FIFO’s for bus

application

Sys Hs 1 for

high speed

serial interface

Table 6.1.3 Features Offered In FPGA

6.1.4 Applications of FPGAs

Page 43: ST Final Report TOMMOROW 4-4-2011 Report

43

FPGAs have gained rapid acceptance and growth over the past

decade because they can be applied to a very wide range of applications. A

list of typical applications includes: random logic, integrating multiple

SPLDs, device controllers, communication encoding and filtering, small to

medium sized systems with SRAM blocks, and many more.

Other interesting applications of FPGAs are prototyping of designs

later to be implemented in gate arrays, and also emulation of entire large

hardware systems. The former of these applications might be possible

using only a single large FPGA (which corresponds to a small Gate Array

in terms of capacity), and the latter would entail many FPGAs connected

by some sort of interconnect; for emulation of hardware, QuickTurn

[Wolff90] (and others) has developed products that comprise many FPGAs

and the necessary software to partition and map circuits.

Another promising area for FPGA application, which is only

beginning to be developed, is the usage of FPGAs as custom computing

machines. This involves using the programmable parts to “execute”

software, rather than compiling the software for execution on a regular

CPU.

CHAPTER 7

Page 44: ST Final Report TOMMOROW 4-4-2011 Report

44

ALTERA CYCLONE II DE2 KIT

7.1 LAYOUT AND COMPONENTS

A photograph of the DE2-70 board is shown in Figure 7.1. It depicts

the layout of the board and indicates the location of the connectors and key

components.

Figure 7.1 Altera DE2 kit

Page 45: ST Final Report TOMMOROW 4-4-2011 Report

45

The DE2-70 board has many features that allow the user to

implement a wide range of designed circuits, from simple circuits to

various multimedia projects.

The following hardware is provided on the DE2-70 board

Altera Cyclone® II 2C70 FPGA device

Altera Serial Configuration device - EPCS16

USB Blaster (on board) for programming and user API control; both

JTAG and Active Serial(AS) programming modes are supported

2-Mbyte SSRAM

Two 32-Mbyte SDRAM

8-Mbyte Flash memory

SD Card socket

4 pushbutton switches

18 toggle switches

18 red user LEDs

9 green user LEDs

50-MHz oscillator and 28.63-MHz oscillator for clock sources

24-bit CD-quality audio CODEC with line-in, line-out, and

microphone-in jacks

VGA DAC (10-bit high-speed triple DACs) with VGA-out connector

2 TV Decoder (NTSC/PAL/SECAM) and TV-in connector

10/100 Ethernet Controller with a connector

USB Host/Slave Controller with USB type A and type B connectors

RS-232 transceiver and 9-pin connector

PS/2 mouse/keyboard connector

IrDA transceiver

Page 46: ST Final Report TOMMOROW 4-4-2011 Report

46

1 SMA connector

Two 40-pin Expansion Headers with diode protection

In addition to these hardware features, the DE2-70 board has

software support for standard I/O interfaces and a control panel facility for

accessing various components. Also, software is provided for a number of

demonstrations that illustrate the advanced capabilities of the DE2-70

board.

In order to use the DE2-70 board, the user has to be familiar with the

Quartus II software. The necessary knowledge can be acquired by reading

the tutorials Getting Started with Altera’s DE2-70 Board and Quartus II

Introduction (which exists in three versions based on the design entry

method used, namely Verilog, VHDL or schematic entry). These tutorials

are provided in the directory DE2_70_tutorials on thr DE2 -70 systemCD-

ROMS that accompanies the DE2-70 board and can also be found on

Altera’s DE2-70 web pages

7.2 BLOCK DIAGRAM OF THE DE2-70 BOARD

Figure 7.2 gives the block diagram of the DE2-70 board. To provide

maximum flexibility for the user, all connections are made through the

Cyclone II FPGA device. Thus, the user can configure the FPGA to

implement any system design.

Following is more detailed information about the blocks in Figure 7.2.1

Page 47: ST Final Report TOMMOROW 4-4-2011 Report

47

7.2.1 Cyclone II 2C70 FPGA

68,416 Les.

250 M4K RAM blocks.

1,152,000 total RAM bits.

150 embedded multipliers.

4 PLLs.

622 user I/O pins.

FineLine BGA 896-pin package.

Figure 7.2.1. Block Diagram Of The DE2-70 Board.

7.2.2 Serial Configuration Device And USB Blaster Circuit

Page 48: ST Final Report TOMMOROW 4-4-2011 Report

48

Altera’s EPCS16 Serial Configuration device.

On-board USB Blaster for programming and user API control.

JTAG and AS programming modes are supported.

7.2.3 SSRAM

2-Mbyte standard synchronous SRAM.

Organized as 512K x 36 bit and Accessible as memory for the Nios II

processor and by the DE2-70 Control Panel.

7.2.4 SDRAM

Two 32-Mbyte Single Data Rate Synchronous Dynamic RAM

memory chips.

Organized as 4M x 16 bits x 4 banks.

Accessible as memory for the Nios II processor and by the DE2-70

Control Panel.

7.2.5 Flash Memory

8-Mbyte NOR Flash memory.

Support both byte and word mode access.

Accessible as memory for the Nios II processor and by the DE2-70

Control Panel.

7.2.6 SD Card Socket

Provides SPI and 1-bit SD mode for SD Card access.

Accessible as memory for the Nios II processor with the DE2-70 SD

Card Driver.

7.2.7 Pushbutton Switches

Page 49: ST Final Report TOMMOROW 4-4-2011 Report

49

4 pushbutton switches.

Debounced by a Schmitt trigger circuit.

Normally high generates one active-low pulse when the switch is

pressed.

7.2.8 Toggle Switches

18 toggle switches for user inputs.

A switch causes logic 0 when in the down (closest to the edge of the

DE2-70 board) position and logic 1 when in the UP position.

7.2.9 Clock Inputs

50-MHz oscillator.

28.63-MHz oscillator.

SMA external clock input.

7.2.10 Audio CODEC

Wolfson WM8731 24-bit sigma-delta audio CODEC.

Line-level input, line-level output, and microphone input jacks

Sampling frequency: 8 to 96 KHz.

Applications for MP3 players and recorders, PDAs, smart phones,

voice recorders, etc.

7.2.11 VGA Output

Uses the ADV7123 240-MHz triple 10-bit high-speed video DAC.

With 15-pin high-density D-sub connector.

Supports up to 1600 x 1200 at 100-Hz refresh rate.

7.2.12 NTSC/PAL/ SECAM TV Decoder Circuit

Uses two ADV7180 Multi-format SDTV Video Decoders.

Page 50: ST Final Report TOMMOROW 4-4-2011 Report

50

Supports worldwide NTSC/PAL/SECAM color demodulation.

One 10-bit ADC, 4X over-sampling for CVBS.

Supports Composite Video (CVBS) RCA jack input.

Supports digital output formats : 8-bit ITU-R BT.656 YCrCb 4:2:2

output + HS, VS, and FIELD.

Applications: DVD recorders, LCD TV, Set-top boxes, Digital TV,

Portable video devices, and TV PIP (picture in picture) display.

7.2.13 10/100 Ethernet Controller

Integrated MAC and PHY with a general processor interface.

Supports 100Base-T and 10Base-T applications.

Supports full-duplex operation at 10 Mb/s and 100 Mb/s, with auto-

MDIX.

Fully compliant with the IEEE 802.3u Specification.

Supports IP/TCP/UDP checksum generation and checking.

Supports back-pressure mode for half-duplex mode flow control.

7.2.14 USB Host/Slave Controller

Complies fully with Universal Serial Bus Specification Rev. 2.0.

Supports data transfer at full-speed and low-speed.

Supports both USB host and device.

Two USB ports (one type A for a host and one type B for a device).

Provides a high-speed parallel interface to most available processors;

supports Nios II with a Terasic driver.

7.2.15 Serial Ports

One RS-232 port.

One PS/2 port.

Page 51: ST Final Report TOMMOROW 4-4-2011 Report

51

DB-9 serial connector for the RS-232 port.

PS/2 connector for connecting a PS2 mouse or keyboard to the DE2-

70 board.

7.2.16 IRDA Transceiver

Contains a 115.2-kb/s infrared transceiver.

32 mA LED drive current.

Integrated EMI shield.

IEC825-1 Class 1 eye safe.

Edge detection input.

7.2.17 Two 40-pin Expansion Headers

72 Cyclone II I/O pins, as well as 8 power and ground lines, are

brought out to two 40-pin expansion connectors.

40-pin header is designed to accept a standard 40-pin ribbon cable

used for IDE hard drives.

Diode and resistor protection is provided.

CHAPTER 8

EXPERIMENTAL RESULTS

Page 52: ST Final Report TOMMOROW 4-4-2011 Report

52

The machine is trained three times by the “WORD”. The word “help”

is recognized 90.9% of the time, whereas “held” is correctly ignored (100%

correct) when speaks. However, these percentages are respectively 45.5%,

and 0% when speaks. If during the training phase, first person inputs two

words and second person one, their percentages become respectively (when

saying “help”) 72.7%, and 45.5%. When saying “held”, the machine

correctly assesses that they are not saying “help” in all cases. This data

was collected by saying “help” 11 times, and “held” two times.

Word Verdict Correct? Correctness

help Same Yes 90.9 %

help Different No

help Same Yes

help Same Yes

help Same Yes

help Same Yes

help Same Yes 92.3 %help Same Yes

help Same Yes

help Same Yes

help Same Yes

held Different Yes 100 %

held Different Yes

Table 8 Experimental Results

Page 53: ST Final Report TOMMOROW 4-4-2011 Report

53

This indicates that the training works properly, because the

correctness in “first person results decreases, when his “participation” in

the training decreases (from three times to two). On the other hand, second

person correctness increases when he participates in the training.

Since the fingerprints are analyzed in the time domain, the system is

much more sensible to the speed, the intonation and the surrounding noise

when a word is input. Thus, the above results should be taken with

caution, because the words were spoken really close to the microphone,

and in a somewhat similar way each time. Nonetheless, the results seem

conclusive. Thus, despite a potential lack in accuracy, the machine is

functional.

Page 54: ST Final Report TOMMOROW 4-4-2011 Report

54

CHAPTER 9

ADVANTAGES

The user can talk and write freely. The system understands,

analyzes and creates all the elements that are presented.

The user leads and controls the dialogue. He or she can interact

by canceling or substituting previous functions and sentences.

The technology understands, analyzes and creates all the

elements representation using a grammatical analysis strategy,

assuring the right interpretation and management of all the

semantic capacity of natural language.

The platform offers real-time interaction in massive

environments using acute memory management strategies.

The solution completely adapts to the user profile and previous

history dialogues with the aim to customize all interactive

processes.

Page 55: ST Final Report TOMMOROW 4-4-2011 Report

55

CHAPTER 10

APPLICATONS

Interactive voice response system (IVRS) .

Voice-dialing in mobile phones and telephones .

Hands-free dialing in wireless bluetooth headsets .

PIN and numeric password entry modules.

Value added service (VAS) providers .

Automated teller machines (ATMs) If we increased the system’s

vocabulary using phoneme-based recognition for a particular

language, e.g., English, the system could be used to replace

standard input devices such as keyboards, touch pads, etc. in IT-

and electronics-based applications in various fields. The design

has a wide market opportunity ranging from mobile service

providers to ATM makers. Some of the targeted users for the

system include.

Mobile operators .

Home and office security device providers .

ATM manufacturers.

Mobile phone and bluetooth headset manufacturers.

Telephone service providers.

Manufacturers of instruments for disabled persons.

PC users.

CHAPTER 11

Page 56: ST Final Report TOMMOROW 4-4-2011 Report

56

FUTURE IMPROVEMENTS

Given more time, we would have liked to implement a more robust

system with a larger dictionary of words. Also, using the Mel Scale

followed by DCT is a weaker approach to solve the Speech Recognition

problem. Instead, the usage of algorithms built upon the concept of using

the Hidden Markov Models which are predominantly statistical techniques

that treat a speech signal as a piecewise stationary signal over a window of

10ms is preferred. Hence, using these algorithms we could enhance the

accuracy of the system and make the system more robust.

Page 57: ST Final Report TOMMOROW 4-4-2011 Report

57

CHAPTER 12

CONCLUSION

After applying background theory and scripting a VLSI prototype, a

speech recognition system can indeed be successfully implemented and

using FPGA technology. The experimental theoretical results show that the

algorithm is accurate and fast enough for consumer product applications.

Despite only partial hardware implementation due to technical difficulties,

it remains functional.

Besides producing a full implementation (by including an FFT

module and thus being able to analyze words in the frequency spectrum),

other improvements can be done to the system. For instance, allowing the

use of a variable length for the input sounds would drastically improve its

performance on very short or very long words. Also, adding support for

training several words would be rather simple and would increase the

system flexibility.

Page 58: ST Final Report TOMMOROW 4-4-2011 Report

58

REFERENCE

[1] L. Rabiner, and B. Juang, Fundamentals of speech recognition:

Tsinghua University Press.

[2] E. Trentin, and M. Gori, “A survey of hybrid ANN/HMM

models for automatic speech recognition,” Neurocomputing, vol.

37, no. 1, pp. 91-126, 2001.

[3] V. Steinbiss, B. Tran, and H. Ney, "Improvements in beam

search," ICSLP, 1994, pp. 2143-2146.

[4] D. Llorens, and F. Casacuberta, “An Experimental Study of

Histogram Pruning in Speech Recognition,” Proceeding of the

VIII SNRFAI. pp, vol. 5, pp. 6, 1999.

[5] M. Mosleh, S. Setayeshi, and M. Kheyrandish, “Accelerating

Speech Recognition Algorithm with Synergic Hidden Markov

Model and Genetic Algorithm based on Cellular Automata.”

ICSPS2009,pp.1-7,2009.

[6] K. Kuo, “Dual-ALU Structure Processor for Speech

Recognition” in Proceedings of the 2006 IEEE/SMC

International Conference on System of Systems Engineering,

USA, 2006, pp. 193-196.