speech recognition of highly inflective languageshomepage:main.pdfspeech recognition of highly...

Speech Recognition of HighlyInflective Languages

BARTOSZ ZIOŁKO

Ph.D. ThesisThis thesis is submitted in partial fulfilment of the requirements for the degree of Doctor ofPhilosophy.

Artificial Intelligence GroupPattern Recognition and Computer Vision GroupDepartment of Computer ScienceUnited Kingdom

2008

http://www.dsp.agh.edu.pl

2

Abstract

This PhD thesis combines various topics in speech recognition. There are two central hypo-

theses. First one is that it would be useful to incorporate phoneme segmentation information in

speech recognition and that this task can be achieved by applying discrete wavelet transform. The

second main point is that adding semantics into language models for speech recognition improves

recognition accuracy.

The research starts with analysing differences between English and Polish from speech recog-

nition point of view. English is a very typical positional language and Polish is highly inflective.

Part of research is focused on aspects which should be changed due to the linguistic differences

comparing to well known solutions for English to improve recognition of Polish. These are mainly

phoneme segmentation and semantic analysis. Phoneme statistics for Polish were gathered by the

author and a toolkit designed for English was applied on Polish.

The phoneme segmentation is more likely to be successful in Polish than English because

phonemes are easier to be distinguished. A method based on the discrete wavelet transform was

design and tested by the PhD candidate.

Another part of research is focused on finding new ways of modelling a natural language.

Semantic analysis is crucial for Polish because syntax models are not very effective and difficult

to be trained due to non-positionality of Polish. This part of the thesis describes an unsuccessful

approach of using part-of-speech taggers for language modelling in speech recognition and a much

better bag-of-words model. The latter is inspired by well known latent semantic analysis. It is,

however, easier to be trained and does not need calculations on big matrices. The difference is in

the completely new approach to smoothing information in a word-topic matrix. Because of the

morphological nature of the Polish language, this method gathers not only semantic content, but

also some grammatical structure.

Contents

1 Introduction 161.1 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

1.2 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

1.2.1 Introduction and Literature Review . . . . . . . . . . . . . . . . . . . . 18

1.2.2 Linguistic Aspects of Highly Inflective Languages Using Polish as an

Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

1.2.3 Phoneme Segmentation and Acoustic Models . . . . . . . . . . . . . . . 18

1.2.4 Language Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2 Literature Review 202.1 History of Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.2 Linguistic Rudiments of Speech Analysis . . . . . . . . . . . . . . . . . . . . . 22

2.3 Speech Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.3.1 Spectrum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.4 Speech Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.5 Phoneme Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.6 Speech Parametrisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.6.1 Parametrisation Methods Based on Linear Prediction Coefficients . . . . 30

2.6.2 Parametrisation Methods Based on Filter Banks . . . . . . . . . . . . . . 33

2.6.3 Test Corpora and Baselines . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.6.4 Comparison of the Methods . . . . . . . . . . . . . . . . . . . . . . . . 38

2.7 Speech Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

2.8 Natural Language Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

2.9 Semantic Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

2.10 Academic Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3 Linguistic Aspects of Polish 463.1 Analysis of Polish from the Speech Recognition Point of View . . . . . . . . . . 46

3.2 Triphone Statistics of Polish Language . . . . . . . . . . . . . . . . . . . . . . . 47

3.3 Description of a problem solution . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.4 Methods, software and hardware . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.4.1 Grapheme to Phoneme Transcription . . . . . . . . . . . . . . . . . . . 50

3

CONTENTS 4

3.4.2 Corpora Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.5 Analysis of Phonetic Similarities in Wrong Recognitions of the Polish Language 56

3.6 Experimental Results on Applying HTK to Polish . . . . . . . . . . . . . . . . . 57

3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4 Phoneme Segmentation 634.1 Analysis Using the Discrete Wavelet Transform . . . . . . . . . . . . . . . . . . 63

4.2 General Description of the Segmentation Method . . . . . . . . . . . . . . . . . 65

4.3 Phoneme Detection Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

4.4 Fuzzy Sets for Recall and Precision . . . . . . . . . . . . . . . . . . . . . . . . 74

4.5 Algorithm of Speech Segmentation Evaluation . . . . . . . . . . . . . . . . . . . 75

4.6 Comparison to Other Evaluation Methods . . . . . . . . . . . . . . . . . . . . . 78

4.7 Experimental Results of DWT Segmentation Method . . . . . . . . . . . . . . . 78

4.8 Evaluation for Different Types of Phoneme Transitions . . . . . . . . . . . . . . 80

4.9 LogitBoost WEKA Classifier Speech Segmentation . . . . . . . . . . . . . . . . 83

4.10 Experimental Results for LogitBoost . . . . . . . . . . . . . . . . . . . . . . . . 83

4.11 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

5 Language Models 875.1 POS Tagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

5.2 Applying POS Taggers for Language Modelling in Speech Recognition . . . . . 88

5.3 Experimental Results of Applying POS Tags in ASR . . . . . . . . . . . . . . . 89

5.4 Bag-of-words Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

5.5 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

5.6 Training Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

5.7 Process of Finding The Most Similar Topics . . . . . . . . . . . . . . . . . . . . 97

5.8 Example in English . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

5.9 Recognition Using Bag-of-words Model . . . . . . . . . . . . . . . . . . . . . . 99

5.10 Preliminary Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

5.11 K-means On-line Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

5.12 Experiment on Parliament Transcripts . . . . . . . . . . . . . . . . . . . . . . . 103

5.13 Preprocessing of Training Corpora . . . . . . . . . . . . . . . . . . . . . . . . . 107

5.14 Experiment with Literature Training Corpus . . . . . . . . . . . . . . . . . . . . 107

5.15 Word Prediction Model and Evaluation with Perplexity . . . . . . . . . . . . . . 110

5.16 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

6 Conclusions and Future Research 112

Appendices 114

List of References 115

List of Tables

2.1 Phoneme transcription in English - BEEP dictionary . . . . . . . . . . . . . . . . 23

2.2 Phoneme transcription in Polish - SAMPA . . . . . . . . . . . . . . . . . . . . . 23

2.3 Comparison of the efficiency of the described methods. Asterisks mark methods

appended to baselines (they could be used with most of the other methods). The

methods without asterisks are new sets of features, different to the baselines . . . 38

2.4 Speech recognition applications available on the Internet . . . . . . . . . . . . . 44

3.1 Phonemes in Polish (SAMPA Demenko et al. (2003)) . . . . . . . . . . . . . . . 49

3.2 Most common Polish diphones . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.3 Most common Polish triphones . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.4 Word recognition correctness for different speakers (the model was trained on

adult male speakers only) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

3.5 Errors in different types of utterances (for all speakers) . . . . . . . . . . . . . . 58

3.6 Errors in sentences (speakers AK1C1 and AK2C1 respectively) . . . . . . . . . . 58

3.7 Errors in digits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

3.8 Errors in the most often wrongly recognised names and commands . . . . . . . . 60

3.9 Errors in the most often wrongly recognised names and commands (2nd part) . . 61

3.10 Names which appeared the most commonly as wrong recognitions in above statistics 61

3.11 Errors in pronounced alphabet . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.1 Characteristics of the discrete wavelet transform levels and their envelopes . . . . 67

4.2 Types of events associated with a phoneme boundary. Mathematical conditions are

based on power envelope penm (n), rate-of-change information rm(n), a threshold p

of the distance between rm(n) and penm (n) and a threshold pmin of minimal penm (n)and β = 1. Values in the last four columns are for different DWT levels (the first

one for d1 level, the second one for d2 level, the third for levels from d3 to d5 and

the last one for d6 level) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

4.3 Comparison of fuzzy recall and precision with commonly used methods based on

insertions and deletions for an exemplar word . . . . . . . . . . . . . . . . . . . 79

4.4 Comparison of proposed method using different wavelets . . . . . . . . . . . . . 79

4.5 Comparison of some other segmentation strategies and proposed method . . . . . 79

4.6 Recall for different types of phoneme transitions. . . . . . . . . . . . . . . . . . 81

5

LIST OF TABLES 6

4.7 Precision for different types of phoneme transitions. . . . . . . . . . . . . . . . 82

4.8 F-score for different types of phoneme transitions. The scores above 0.5 were bolded. 82

4.9 Experimental results for LogitBoost classifier. The rows with the label boundary

is for classifying segments representing boundaries. The rows named phoneme

present grades for classifying segments inside phonemes which are not bounda-

ries. From practical point of view boundary labels are important. The grades for

phoneme labels are just for a reference . . . . . . . . . . . . . . . . . . . . . . . 84

5.1 Results of applying the POS tagger to language modelling. First, a sentence in Po-

lish is given, then a position of a correct recognition in 10 best list. The description

of tagger grade for the correct recognition follows . . . . . . . . . . . . . . . . . 90

5.2 Results of applying the POS tagger to language modelling. First, a sentence in Po-

lish is given, then a position of a correct recognition in 10 best list. The description

of tagger grade for the correct recognition follows (2nd part) . . . . . . . . . . . 91

5.3 Results of applying the POS tagger on its training corpus. First version of a sen-

tence is a correct one, second is a recognition using just HTK and third one using

HTK and POS tagging. Then the number of differences comparing to a correct

sentence were counted and summarised . . . . . . . . . . . . . . . . . . . . . . 92

5.4 Matrix S for the example with 4 topics and a row of S’ for the topic 3 . . . . . . 98

5.5 Matrix D for the presented example . . . . . . . . . . . . . . . . . . . . . . . . 98

5.6 Experimental results for pure HTK audio model, audio model with LSA and audio

model with our bag-of-words model . . . . . . . . . . . . . . . . . . . . . . . . 101

5.7 44 sentences in the exact transcription used for testing by HTK and bag-of-words

model with English translations . . . . . . . . . . . . . . . . . . . . . . . . . . . 104


model with English translations (2nd part) . . . . . . . . . . . . . . . . . . . . . 105


model with English translations (3rd part) . . . . . . . . . . . . . . . . . . . . . 106

5.10 SED script for text preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . 108


model with our bags-of-words model trained on literature . . . . . . . . . . . . . 109


model with our bags-of-words model trained on enlarged literature corpus . . . . 109

5.13 Text corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

List of Figures

2.1 Toy dog Rex - first working speech recognition system (USA 1920) . . . . . . . 20

2.2 Scheme of speech recognition system . . . . . . . . . . . . . . . . . . . . . . . 21

2.3 Typical current services offered by call centres with ASR (above) and its future

(below) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.4 Speech audibility and average human hearing band (Tadeusiewicz, 1988) . . . . 25

2.5 The example of Fourier spectrum amplitude . . . . . . . . . . . . . . . . . . . . 25

2.6 Frequency spectrum of speech in a linear and a non-linear scale . . . . . . . . . . 26

2.7 The cepstrum is the Fourier transform of the log of the power spectrum . . . . . 27

2.8 The types of speech segmentation . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.9 Comparison of the frames produced by constant segmentation and phoneme seg-

mentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.10 The list of speech features extracting method types, grouped in two avenues: based

on linear prediction coefficients (with PLP as the main one) and filter bank analysis

(with MFCC as the main one). . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.11 fMPE transformation matrix from original low-dimensional feature vector into

high-dimensional one . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.12 Mel frequency cepstrum coefficients . . . . . . . . . . . . . . . . . . . . . . . . 32

3.1 Phonemes in Polish in SAMPA alphabet . . . . . . . . . . . . . . . . . . . . . . 50

3.2 Frequency of diphones in Polish (each phoneme separately) . . . . . . . . . . . . 52

3.3 Space of triphones in Polish . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.4 Phoneme occurrences distribution . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.1 Wavelet transform outperforms STFT because it has higher resolution for higher

frequencies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.2 The discrete Meyer wavelet - dmey . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.3 Subband amplitude DWT spectra of the Polish word ’osiem’ (eng. eight). The

number of samples depends on a resolution level . . . . . . . . . . . . . . . . . 66

4.4 Segmentation of the Polish word ’osiem’ (eng. eight) based on DWT sub-bands.

Dotted lines are hand segmentation boundaries; dashed lines are automatic seg-

mentation boundaries, bold lines are envelopes and thin lines are smoothed rate-

of-change . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

7

LIST OF FIGURES 8

4.5 The event function versus time in ms of the word presented in Fig. 4.4. High event

scores mean that a phoneme boundary is more likely . . . . . . . . . . . . . . . 71

4.6 Simple examples of four events described in Table 4.2. They are characteristic for

phoneme boundaries. Images present power envelope penm (n) and rate-of-change

information (derivative) rm(n) . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

4.7 The general scheme of sets G with correct boundaries and A with detected ones.

Elements of set A have a grade f(x) standing for probability of being a correct

boundary. In set G there can be elements which were not detected (in the left part

of the set) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

4.8 The example of phoneme segmentation of a single word. In the lower part hand

segmentation is drawn. Boundaries are represented by two indexes close to each

other (sometimes overlapping). Upper columns present the example of segmenta-

tion for the word done by a segmentation algorithm. All of calculated boundaries

are quite accurate but never perfect . . . . . . . . . . . . . . . . . . . . . . . . . 75

4.9 Fuzzy membership . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4.10 F-score of phoneme boundaries detection for transitions between several types of

phonemes. Phoneme types 1-10 are explained in section 4.8 (1 - stops, 2 - nasal

consonants, etc.). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

5.1 Histogram of POS tagger probabilities for hypotheses which are correct recognitions 93

5.2 Histogram of POS tagger probabilities for hypotheses which are wrong recognitions 94

5.3 Ratio of correct recognitions to all for different probabilities from POS tagger . . 94

5.4 Undirected, complete graph illustrating similarities between sentences . . . . . . 96

5.5 Histogram of probabilities received from the bag-of-words model for hypotheses

which are correct recognitions . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

5.6 Histogram of probabilities received from the bag-of-words model for hypotheses

which are wrong recognitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

5.7 Ratio of correct recognitions to all of them for different probabilities received from

the bag-of-words model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

10

Acknowledgments

I would like to begin by thanking my parents. Not only for their unstinting support throughout my

educational career, but also for encouraging me to pursue my PhD.

I feel very lucky that I had two supervisors to guide me through research. I would like to

thank Dr Suresh Manandhar and Dr Richard C. Wilson for their continued support, advice and

constructive feedbacks. I am glad we published together many papers and that I participated

thanks to them in several conferences. They were not only teachers but also good friends helping

me in my life in a new country which was often less surprising and easier to understand because

of them.

Appreciation goes to my assessor Dr Adrian Bors for his regular feedback on progress of my

research.

I had the privilege to meet many interesting people in the department. This had provided an

excellent environment for inventing my methods and algorithms. I would like to thank for all

seminars and minor discussions on corridors of our department. In particular thanks go to Thimal

Jasooriya for sitting in front of me for long three years and answering patiently all questions like

‘Hey, how do you do this in LaTeX?’ or ‘Where is room 103?’. I appreciate as well Ioannis

Klapaftis help regarding grammar parsers and graphs of collocation. I would like to thank Pierre

Andrews for both improving my knowledge not only in NLP but also in photography. Many thanks

to Marcelo Romero Huertas. And finally I am very glad that I met Marek Grzes with who I had so

many exciting conversations about travels all over the world and who was a strong support for me

in days I had private problems. Many thanks too all other members of the department I have met

during my studies.

My PhD would not be completed without the help of many people out of the department. I

would like to thank Professor Zdzisław Brzezniak for our mathematical discussions with coffee.

Appreciations go to Professor Grazyna Demenko for providing PolPhone software and Dr Ste-

fan Grocholewski for CORPORA. I would like to thank Dr Adam Przepiorkowski and Dr Maciej

Piasecki for their help in part of research about POS taggers. I am also very glad for my close co-

operation with Jakub Gałka in our research. Finally many thanks to my father Professor Mariusz

Ziołko for many useful feedbacks about my research papers and this thesis.

11

List of candidate publications. Parts of some of them were used in the thesis.

Conferences:

• M.P. Sellars, G.E. Athanasiadou, B. Ziołko, S.D. Greaves, A. Hopper, Simulation of Broad-

band FWA Networks in High-rise Cities with Linear Antenna Polarisation, The 14th IEEE

2003 International Symposium on Personal, Indoor and Mobile Radio Communications Pro-

ceedings - PIMRC, pp. 371-5. Beijing, China 2003.

• M. Ziołko, P. Sypka, B. Ziołko, Compression of Transmultiplexed Acoustic Signals, Pro-

ceedings of The 2004 International TICSP Workshop on Spectral and Multirate Signal Pro-

cessing, pp.81-6. Vienna 2004.

• B. Ziołko, M. Ziołko, M. Nowak, P. Sypka, A suggestion of multiple-access method for

4G system, Proceedings of 47th International Symposium ELMAR-2005, pp.327-30 Zadar,

Croatia 2005.

• M. Ziołko, B. Ziołko, A. Dziech, Transcription as a Speech Compression Method in Trans-

multiplexer System, 5th WSEAS International Conference on Multimedia, Internet and Vi-

deo Technologies, Corfu, Greece 2005.

• B. Ziołko, M. Ziołko, M. Nowak, Design of Integer Filters for Transmultiplexer Perfect

Reconstruction, Proceedings of 13th European Signal Processing Conference EUSIPCO,

Antalya, Turkey 2005.

• M. Ziołko, M. Nowak, B. Ziołko, Transmultiplexer Integer-to-Integer Filter Banks, Pro-

ceedings of The First IFIP International Conference in Central Asia on Internet, The Next

Generation of Mobile, Wireless and Optical Communications Networks, Bishkek, Kyrgyzs-

tan 2005.

• P. Sypka, B. Ziołko, M. Ziołko, Integer-to-Integer Filters in Image Transmultiplexers, Pro-

ceedings of 2006 Second International Symposium on Communications, Control and Signal

Processing, ISCCSP, Marrakech, Morocco 2006.

• P. Sypka, M. Ziołko and B. Ziołko, Lossy Compression Approach to Transmultiplexed

Images, 48th International Symposium ELMAR-2006, Zadar, Croatia.

• B. Ziołko, S. Manandhar, R.C. Wilson, Phoneme segmentation of speech, Proceedings of

ICPR 2006 , Hong Kong, 2006.

• B. Ziołko, S. Manandhar, R.C. Wilson, M. Ziołko, Wavelet method of speech segmentation,

Proceedings of EUSIPCO 2006, Florence, Italy.

• P. Sypka, M. Ziołko, B. Ziołko, Robustness of Transmultiplexed Images, International

Conference Mixed Design of Integrated Circuits and Systems Mixdes , Gdynia, 2006.

• B. Ziołko, J. Gałka, S. Manandhar, R. C. Wilson, M. Ziołko, The use of statistics of Po-

lish phonemes in speech recognition, Speech Signal Annotation, Processing and Synthesis,

Poznan, 2006.

12

• P. Sypka, M. Ziołko and B. Ziołko, Lossless JPEG-Base Compression of Transmultiplexed

Images, Proceedings of the 12th Digital Signal Processing Workshop, pp. 531-534. Wyo-

ming 2006.

• M. Ziołko, P. Sypka, B. Ziołko, Application of 1-D Transmultiplexer to Images Transmis-

sion, Proceedings of the 32nd Annual Conference of the IEEE Industrial Electronics Society

IECON, pp. 3564-3567, Paris, France, 2006.

• M. Kotti, C. Kotropoulos, B. Ziołko, I. Pitas, V. Moschou, A Framework for Dialogue De-

tection in Movies, Proceedings of Multimedia Content Representation, Classification and

Security International Workshop, MRCS, Lecture Notes in Computer Science, vol. 4105, pp.

371-378, Istanbul, Turkey, 2006.

• P. Sypka, M. Ziołko, B. Ziołko, Approach of JPEG2000 Compression Standard to Transmul-

tiplexed Images, Proceedings of the Visualization, Imaging, and Image Processing, VIIP,

Palma De Mallorca, Spain, 2006.

• B. Ziołko, J. Gałka, S. Manandhar, R. C. Wilson, M. Ziołko, Triphone Statistics for Po-

lish Language, Proceedings of 3rd Language and Technology Conference, Poznan, Poland,

2007.

• B. Ziołko, S. Manandhar, R. C. Wilson, Fuzzy Recall and Precision for Speech Segmenta-

tion Evaluation, Proceedings of 3rd Language and Technology Conference, Poznan, Poland,

2007.

• B. Ziołko, S. Manandhar, R. C. Wilson, M. Ziołko, LogitBoost Weka Classifier Speech Seg-

mentation, Proceedings of 2008 IEEE International Conference on Multimedia and Expo,

Hannover, Germany, 2008.

• B. Ziołko, S. Manandhar, R. C. Wilson, M. Ziołko, Language Model Based on POS Tag-

ger, Proceedings of SIGMAP 2008 the International Conference on Signal Processing and

Multimedia Applications, Porto, Portugal, 2008.

• B. Ziołko, S. Manandhar, R. C. Wilson, M. Ziołko, J. Gałka Application of HTK to the

Polish Language, Proceedings of IEEE International Conference on Audio, Language and

Image Processing, Shanghai, 2008.

• B. Ziołko, S. Manandhar, R. C. Wilson, M. Ziołko, Semantic Modelling for Speech Recog-

nition, Proceedings of Speech Analysis, Synthesis and Recognition. Applications in Systems

for Homeland Security, Piechowice, Poland, 2008.

• B. Ziołko, S. Manandhar, R. C. Wilson, Bag-of-words Modelling for Speech Recognition,

Proceedings of International Conference on Future Computer and Communication, Kuala

Lumpur, Malaysia, 2009.

13

• B. Ziołko, M. Ziołko, Linguistic Calculations on Cyfronet High Performance Computers,

Proceedings of Conference of the High Performance Computers’ Users, Zakopane, Poland,

2009.

• B. Ziołko, J. Gałka, M. Ziołko, Phone, diphone and triphone statistics for Polish language,

Proceedings of SPECOM 2009, St. Petersburg, Russia, 2009.

• B. Ziołko, J. Gałka, M. Ziołko, Phoneme ngrams based on a Polish newspaper corpus, Pro-

ceedings of WORLDCOMP’09, Las Vegas, USA, 2009.

• B. Ziołko, J. Gałka, M. Ziołko, Phonetic statistics from an Internet articles corpus of Polish

language, Proceedings of Intelligent Information Systems, Krakow, Poland, 2009.

Journals:

• M.P. Sellars, G.E. Athanasiadou, B. Ziołko, S.D. Greaves, Opposite-sector uplink interfe-

rence in broadband FWA networks in high-rise cities, The IEE Electronics Letters , vol. 40,

no. 17, pp. 1070-1, 2004.

• M. Ziołko, A. Dziech, R. Baran, P. Sypka, B. Ziołko, Transmultiplexing System for Com-

pression of Selected Signals, WSEAS Transactions on Communications, issue 12, vol. 4, pp.

1427-1434, December 2005.

• M. Dyrek, J. Gałka and B. Ziołko, Measures On Wavelet Segmentation of Speech, Interna-

tional Journal Of Circuits, Systems And Signal Processing, NAUN 2008.

• J. Gałka and B. Ziołko, Study of Performance Evaluation Methods for Non-Uniform Speech

Segmentation, International Journal Of Circuits, Systems And Signal Processing, NAUN

2008.

14

List of Abbreviations

AMI - Augmented Multi-party Interaction

ANN - Artificial Neural Network

ASR - Automatic Speech Recognition

BEEP - British English Phonemic Transcription Dictionary

CML - Conditional Maximum Likelihood

CMU - Carnegie Mellon University

CUED - Cambridge University Engineering Department

DARPA - Defence Advanced Research Projects Agency

DBNs - Dynamic Bayesian Networks

DCT - Discrete Cosine Transform

DWT - Discrete Wavelet Transform

FBE - Filter Bank Energy

FFT - Fast Fourier Transform

fMPE - feature-space Minimum Phone Error

GSM - Global System for Mobile

HLDA - Heteroscedastic Linear Discriminant Analysis

HMM - Hidden Markov Model

HTK - Hidden Markov Model Toolkit

IIS - Improved Iterative Scaling

LFCCs - Linear Frequency Cepstrum Coefficients

LM - Language Models

LPCC - Linear Prediction Coefficients

LSA - Latent Semantic Analysis

MaxEnt - Maximum Entropy

MFCC - Mel Frequency Cepstrum Coefficients

MFMGDCCs - Mel Frequency Modified Group Delay Cepstral Coefficients

MFPSCCs - Mel Frequency Product Spectrum Cepstral Coefficients

MGDCCs - Modified Group Delay Cepstral Coefficients

MLLR - Maximum Likelihood Linear Regression

MMSE - Minimum Mean Square Error

MPE - Minimum Phone Error

PLP - Perceptual Linear Predictive

PMF - Probability Mass Function

POS - Part Of Speech

RASTA - Relative Spectral

RCs - Reflection Coefficients

SAT - Speaker Adaptative Training

SED - Stream Editor

SHLDA - Smoothed Heteroscedastic Linear Discriminant Analysis

SNR - Signal to Noise Ratio

15

SPLICE - Stereo-based Piecewise Linear Compensation for Environments

STFT - Short Time Fast Fourier Transform

SVD - Singular Value Decomposition

TIMIT - Texas Instrument/Massachusetts Institute of Technology

VTLN - Vocal Tract Length Normalisation

WER - Word Error Rate

Declaration

This thesis has not previously been accepted in substance for any degree and is not being concur-

rently submitted in candidature for any degree other than Doctor of Philosophy of the University

of York. This thesis is the result of my own investigations, except where otherwise stated. Other

sources are acknowledged by explicit references.

I hereby give consent for my thesis, if accepted, to be made available for photocopying and for

inter-library loan, and for the title and summary to be made available to outside organisations.

Chapter 1

Introduction

As information technology has an impact on more and more aspects of our lives with every year,

the problem of communication between human beings and information processing devices be-

comes increasingly important. Up to now, such communication has almost entirely been through

the use of keyboards and screens, but speech is the most widely used, natural and the fastest means

of communication for people. Moreover, mobile computing devices are becoming increasingly

small. The bottom limit lays not in integrated circuit design size but, simply, in the size a human

can operate with their fingers. There is also more and more hands-free, like in-car, computer sys-

tems. We must redefine traditional methods of human-computer and human-machine interactions.

Unfortunately, machine capabilities for interpreting speech are still poor in comparison to what

a human can achieve, even though we can predict that automatic speech recognition (ASR) will

become a very pervasive technology (Alewine et al., 2004).

1.1 Contribution

An aim of our research was to improve the accuracy of speech recognition and to find the elements

which might be especially efficient in the ASR of highly inflective and non-positional languages

like Polish. English is a much different language in some aspects and some of these differences

have impacts on speech recognition systems. The part-of-speech (POS) structure is much more

regular in English than in Polish, which means it is much more predictable. A word can change

its POS meaning depending on its position. For example we understand all nouns located on the

left from another noun as adjectives. In Polish such change is stressed by morphology rather then

position. English has many short forms including pronouncing many vowels weakly as /�/ and

skipping several letters in longer words. There are also some Polish phonemes which do not exist

in English and the other way around. As it is a wide field, research was conducted on chosen

elements.

As a part of our research we did practical, linguistic studies on differences between Polish

and English. Phonetic statistics for Polish were collected and analysed. These statistics helped

in further works. Among them a hidden Markov model toolkit (HTK) for Polish was trained and

tested. The model we created was trained from real data for all biphones in Polish and by HTK

16

CHAPTER 1. INTRODUCTION 17

scripts for all triphones in a synthesised way using statistics that we collected. The system can be

adapted to any vocabulary, however, it does not work efficiently for large vocabulary tasks.

One of the possible improvements in ASR is in detecting phoneme boundaries. This informa-

tion is typically skipped in existing solutions. Speech is usually analysed in frames of constant

length. Analysing separate phonemes would be much more accurate. One can quite easily set

phoneme boundaries by observing spectrograms or discrete wavelet transform (DWT) spectra of

speech, however, it is very difficult to give an exact algorithm to find them. Constant segmenta-

tion benefits from simplicity of implementation and the simple comparison of blocks of the same

length. However, it is perceptually unnatural. Human phonetic categorisation is very poor for such

short segments (Morgan et al., 2005). Constant segmentation is not natural as phonemes have dif-

ferent length. Moreover, boundary effects provide additional distortions, and framing creates more

boundaries than phoneme segmentation. We have to consider these boundary effects, which can

cause errors. Obviously, a smaller number of boundaries means smaller errors due to the mentio-

ned effects. Constant segmentation therefore risks losing information about the phonemes due to

merging different sounds into single blocks, losing phoneme length information and losing com-

plexity of individual phonemes. Phoneme duration can be also used as an additional parameter in

speech recognition, improving the accuracy of the whole process (Stober and Hess, 1998).

There is very little interest in using POS tags in ASR. We investigated its application in ASR.

POS tag trigrams, a matrix grading possible neighbourhoods or a probabilistic tagger can be crea-

ted and used to predict a word being recognised based on the left context analysed by a POS tagger.

Another innovation of speech recognition is based on semantic analysis as the very last step of the

process. It can be applied as an additional measure to use a non-first choice from a n-best list of

audio model recognition hypotheses, if the first one does not fit semantic content. It is not pos-

sible to recognise speech using acoustic information only. The human perception system is based

upon catching context, structure and understanding combined with recognition. It is much easier

to recognise and repeat without any errors a heard sentence, if it is in a language we understand,

compared to a sentence in a language we are not familiar with. Language modelling can improve

recognition highly.

We decided to focus on using information which was not used, or not commonly used until

now in speech recognition. POS tags were not applied as English can be modelled efficiently

using context-free grammars. In case of Polish, it is very difficult to provide tree structures, which

represent all possible sentences, as the order of words can vary significantly. We thought that

Polish can be modelled using POS tags because some tags are much more probable in the context

of some others. Unfortunately, experiments shown that POS information is too ambiguous to be

used in the way we proposed.

Semantic analysis is generally very difficult, due to information sparsity problems. We believe

that this is why it was not used very commonly in existing ASR systems, as language models

based on grammar structure were quite efficient for English, and there was no necessity of using

semantic analysis. In the case of Polish, semantic information has to be included in a language

model due to syntactic irregularities. A bag-of-words model was invented. It applies word-topic

statistics to re-rank a list of hypotheses from models of lower levels.


1.2 Thesis Overview

We investigated several new elements of ASR systems with special interest of highly inflective

and non-positional languages like Polish. It includes non-constant segmentation for acoustic mo-

delling. We have analysed some aspects of Polish to choose the language’s best approach for ASR

as a representative of highly inflective languages. Apart from this we investigate introducing POS

tagging and semantic information analysis in ASR systems.

1.2.1 Introduction and Literature Review

In the first chapter we will introduce the general aspects of the research areas that are involved

in ASR. Specifically, we pay attention to previous work concerning signal processing methods

like DWT, speech segmentation and parametrisation, pattern recognition, language modelling (for

example hidden Markov models (HMM) and n-grams) and natural language processing (NLP),

mainly lexical semantics, POS tagging and latent semantic analysis (LSA). Some literature in

linguistics, mathematical analysis, probabilistic and information theory is also considered.

1.2.2 Linguistic Aspects of Highly Inflective Languages Using Polish as an Example

This chapter will focus on a linguistic background (Ostaszewska and Tambor, 2000), which is

useful for ASR. Linguists have provided many basic assumptions in methodology of recognising

English. As we aim in creating ASR system for Polish, a similar analysis should be done, because

these two languages vary in some aspects. This chapter will summarise phonological knowledge

about sounds in Polish, pronouncing rules and grammatical phenomena related to rich morphology.

A Polish text corpus was analysed to find information about phoneme statistics. We were espe-

cially interested in triphones, as they are commonly used in many speech processing applications

like the HTK speech recogniser. An attempt to create the full list of triphones for Polish language

is presented. A vast amount of phonetically transcribed text was analysed to obtain the frequency

of triphone occurrences. A distribution of the frequency of triphone occurrence and other pheno-

mena are presented. The standard phonetic alphabet for Polish and methods of providing phonetic

transcriptions are described as well. The ASR system for Polish based on HTK is described with

detailed analysis of the errors it committed.

1.2.3 Phoneme Segmentation and Acoustic Models

Speech has to be split into some units to be analysed. The very common way is to use time

constant framing with overlapping. Phoneme segmentation is another approach, which may highly

improve acoustic models, if phoneme boundaries are detected correctly. We will present our own

segmentation method, evaluation method and the way to apply it in ASR.

The localisation of phoneme boundaries is useful in several speech analysis tasks and in parti-

cular for speech recognition. Here it enables the use of more accurate acoustic models, since the

lengths of phonemes are known and more accurate information is provided for parametrisation.

Our method compares the values of power envelopes and their first derivatives for six frequency


subbands. Specific scenarios which are typical of phoneme boundaries are searched for. Dis-

crete times with such events are noted and graded using a distribution-like event function. The

final decision on localisation of boundaries is taken by analysis of the event function. Boundaries

are therefore extracted using information from all the subbands. The method was developed on

small set of Polish hand segmented words and tested on another, large corpus containing 16425

utterances. A recall and precision measure specifically designed to measure the quality of speech

segmentation was adapted by using fuzzy sets; from this, results with f-score equal to 72.49%

were obtained. A statistical classification method was also used to check which features are useful

and also used as a baseline for the comparison of the new method.

1.2.4 Language Modelling

Language models are necessary for any large vocabulary speech recogniser. There are two main

types of information which can be used to support the modelling of a language: syntactic and

semantic. One of the ways to apply syntactic modelling is to use POS taggers. Morphological

information can be statistically analysed to provide the probability of a sequence of words using

their POS tags.

This chapter covers methods of POS tagging and available POS tagged data in Polish. We

presented our own method of applying taggers and POS tag statistics to ASR as a part of language

modelling. Unfortunately, experiments showed that this type of modelling is not effective.

Semantic analysis can be done in many different ways and has already been applied in ASR.

However, this kind of modelling is difficult due to the data sparsity problem. Literature always

mentions semantic analysis as a necessary step in ASR, but it is very difficult to find any research

papers, which provide results concerning the exact impact on recognition of applying semantic

methods. We investigate LSA and present our own method, which was shown to be more effec-

tive in experiments. The invented model differs from LSA in the way the word-topic matrix is

smoothed. Our method trains a model faster than the widely known LSA and is more efficient.

Chapter 2

Literature Review

This chapter presents the history of research on speech recognition and some of the details of more

up-to-date publications. ASR is a very wide area so only some choice of topics from this field is

presented, which were studied during PhD of the author.

2.1 History of Speech Recognition

In the beginning we should define what an ASR system is. Because of the variety of applied

methods and approaches, it is difficult to define it by describing how it works. It is better to say

that an ASR system is software which changes acoustic signal into sequence of symbols. Speech

is an input while the sequence of written words is an output. Obviously this definition covers a vast

area of applications. We can distinguish systems trained for a given user only or which are speaker

independent. A system can be dedicated for continuous speech or discrete word recognition. Some

applications assume that speech is clear (or rather clear enough) while some are dedicated for

working in a factory or at an airport where noise is a crucial issue. Finally the size of a vocabulary

is a feature of a system. There are quite different approaches for speech recognition with a small,

limited vocabulary and with a large vocabulary (especially with unlimited dictionary).

To give a proper background, we would like to set speech recognition research in time. The

invention of a phonograph in 1870 by Alexander Graham Bell can be considered as the very

Figure 2.1: Toy dog Rex - first working speech recognition system (USA 1920)

20

CHAPTER 2. LITERATURE REVIEW 21

Figure 2.2: Scheme of speech recognition system

first step of creating ASR system. More precisely, the phonograph is the first audio recording

tool, which transferred acoustic waves into electrical waves, allowing further processing. Another

important mile stone was set by the Swiss linguist Ferdinand de Saussure, who described general

rules of linguistics, which were collected and printed by his students and colleagues, after his

death in 1916 (de Saussure, 1916). His ideas became the rudiments of modern linguistics and

NLP. Then, quite surprisingly, we can speak about the first working ASR system in 1920. It was

a celluloid toy dog developed by Walker Balke and National Company Inc., presented in Fig. 2.1.

The dog was attached to the turntable of a phonograph and could jump out of its kennel, when

detecting its own name ’Rex’. The mechanism was controlled by resonant reed and in fact it was

detecting a phoneme e /�/ by a metal bar arranged to form a bridge and sensitive to acoustic energy

of 500 Hz, which vibrated it, interrupting the current and releasing the dog.

In 1952 Bell Labs created a digit recogniser (Davis et al., 1952). It was based on analysis of

the spectrum divided into 2 frequency bands (above and below 900 Hz). It recognised digits with

error less than 2%, if the user did not change the position of the head regarding to the microphone

between training and testing. In the sixties there were two important inventions: the fast Fourier

transform (FFT) (Tukey et al., 1963) and the HMM (Rabiner, 1989) which have a crucial impact

on current ASR systems. There was a growing interest in speech recognition which resulted in

running the ARPA Speech Understanding Project in 1971. This ambitious and well-funded pro-

ject ($15M) started connected word recognition with a vocabulary size of around 1000 words. It

resulted in CMU Harpy system (Lowerre, 1976) with 5% sentence error. Thanks to the project,

the seventies were a time of rapid improvements in ASR. Viterbi algorithm for model training was

developed between 1967 and 1973 (Viterbi, 1967; Forney, 1973). In 1975, linear predictive co-

ding, the first successive speech parameterisation method, was invented (Makhoul, 1975). Further

research in speech recognition has a larger impact on this dissertation so it will be described in

more detail in following sections.

The general scheme of ASR was created in the eighties. It survived till now with just small

differences. All the most important steps are presented in Fig. 2.2 which is based on (Rabiner


Village.Would you like to go village or town?

What kind of holidays would you like?

Hmm… I’d love to go to some lovely village at the seaside.

Figure 2.3: Typical current services offered by call centres with ASR (above) and its future (below)

and Juang, 1993). Our research is focused on segmentation and semantic analysis, so it will be

described in detail. Some other topics are connected very closely, so they have been also described.

Some of them, which are not crucial for our research, have been skipped because of the limit of

the thesis size. The whole large field of pre-processing is first of them, including noise reduction,

feature compensation, missing feature approaches. There are too many papers about these topic to

describe that step of speech recognition even succinctly. Many of them are very well summarised

in (Raj and Stern, 2005).

ASR can save around 60% of time spend on work with computer through automatic trans-

cription and dictation rather than typing as we are able to speak 3 times faster than we can type.

Sophisticated ASR systems are becoming more important, as customer services need to be more

friendly, while costs of running call centres need to be kept at a minimum level (Fig. 2.3). The

ASR system may introduce also an incredibly efficient lossy compression for communications if

recognition is seen as coding and speech synthesis as decoding.

2.2 Linguistic Rudiments of Speech Analysis

It is essential to understand the rudiments of speech generation process in order to do research

on digital speech analysis. Speech signals consist of sound sequences, which we interpret as

information representation. Phonetics is a science which classifies these sounds. Most languages,

including English and Polish, can be described in terms of a set of distinctive sounds - phonemes.

Both languages consist of around 40 phonemes, however, some of them exist in English and do not

exist in Polish and the other way round. They are grouped in vowels and consonants (nasals, stops

and fricatives). British English phoneme transcription presented in Table 2.1 is based on BEEP

dictionary (Beep dictionary, 2000), which is commonly used by speech recognisers, like HTK


Table 2.1: Phoneme transcription in English - BEEP dictionarytranscription example transcription example

aa odd ae atah hut ao oughtaw cow ax abaft (first vowel, schwa)ay hide ea weareh Ed er hurtey ate ia fortiethih it iy teenoh mob ow lobeoy toy ua intellectualuh nook uw twop pick b bet tip d deef fee v viseth thick dh thee (eth)s sick z zip

sh ship zh seizurech cheese jh jeepk key ng rang (engma)g green m men new l leer ream w winy you hh he

Table 2.2: Phoneme transcription in Polish - SAMPAi I e a o u e˜ o˜j l w r m n n’ Nv f x z s z’ s’ ZS dz ts dz’ ts’ dZ tS bp d t g k

(Young et al., 2005). It contains of 20 vowels and 24 consonants. Polish phoneme transcription is

typically presented in SAMPA notation (Ostaszewska and Tambor, 2000), like in Table 2.2, with

37 or 39 phonemes.

Irregularities of pronunciation and linguistic rules are a real challenge for speech recognition.

Many words sound similar, especially in English. They are called homophones (e.g. night and

knight). What is more there are even sentences which sound very similarly (e.g. ’I helped Apple

wreck a nice beach’ and ’I helped Apple recognise speech’. Another problem is caused by context

dependency of phonemes. As we said, there are around 40 different phonemes, but actually all

of them vary at the beginning and in the end, depending on neighbouring phonemes. Such triples


are so-called triphones. Around 40 % of possible phoneme combinations exist, which gives 25600

possible patterns to recognise. There are no trivial methods for such a number. Unfortunately, it is

not the only problem. Phoneme boundaries are overlapping each other. There is a co-articulation of

phonemes and words. Intonation and sentence stress plays an important role in the interpretation.

Utterances ‘go!’, ‘go?’ and ‘go.’ can clearly be recognised by a human but are difficult for a

computer. In naturally spoken language there are no pauses between words. It is difficult for

a computer to decide where boundaries lie. This is why a general speech recognition system

requires human knowledge and experience, as well as advanced pattern recognition and artificial

intelligence.

2.3 Speech Processing

Speech carries information. This is quite obvious, but very often we do not remember that our

brain has to decode speech on many different levels to produce real information. We have to do

the same using computers. We understand speech processing as waveform signal representing and

transforming. For practical reasons we do it usually in frequency domain, where coded information

is easier to find.

2.3.1 Spectrum

Originally, a spectrum was what is now called a spectre, for example, a phantom or an apparition.

In the 17th century the word spectrum was introduced into optics, referring to the range of colours

observed when white light was dispersed through a prism. A sound spectrum is a representation of

a sound in terms of the amount of vibration at each individual frequency. It is usually presented as

a graph of either power or pressure as a function of frequency. The power or pressure is measured

in decibels and the frequency is measured in vibrations per second - Hertz [Hz].

It is important for any research on speech, that speech is quite a specific audio signal, which can

be distinguished by its pressure and frequency as presented in Fig. 2.4, copied from (Tadeusiewicz,

1988). There is no point in analysing other frequencies. Similarly, a given acoustic pressure can

be expected. We can limit analysing to the subband of around 80-8000 [kHz]. This observation

was already very successfully used, for example in GSM mobile phones.

In 1807, Jean Baptiste Joseph Fourier described his method of analysing heat propagation. It

was very controversial and was negatively graded by a committee in Paris Institute which consisted

of many famous mathematicians. The first objection, made by Lagrange and Laplace in 1808,

was to Fourier’s expansions of functions as trigonometrical series, what we now call the Fourier

series. Others objections were connected to equations of heat transfer. Fourier spectrum (Fig.

2.5) is currently a basic and very common tool for analysing many types of stationary signals.

A stationary signal is a signal that repeats into infinity with the same periodicity. The spectral

representation of signal is calculated as

s(f) =∫ ∞−∞

s(t) exp(−2πjft)dt. (2.1)


Figure 2.4: Speech audibility and average human hearing band (Tadeusiewicz, 1988)

Figure 2.5: The example of Fourier spectrum amplitude


Figure 2.6: Frequency spectrum of speech in a linear and a non-linear scale

Function s(f) defines the notion of global frequency f in a signal. It is computed as inner

products of the the signal and trigonometric functions cos(2πft)− j sin(2πft) (from Euler equa-

tion), as basis functions of infinite duration (2.1). Any non-stationarity is spread out over the whole

frequency in s(f). Therefore, non-stationary signals require changes in the analysis method.

A non-stationary signal has to be windowed to be analysed by Fourier transform. The original

method was improved in 1965 by Cooley and Tukey (Cooley and Tukey, 1965), who found an

algorithm to calculate the spectrum in fewer steps. It is known as fast Fourier transform (FFT).

Then the transform is calculated locally for a given window over which the signal is approximately

stationary by repeating the part and creating a periodic function. This approach is called usually the

short time Fourier transform (STFT). Another way is to modify the basis functions used in Fourier

transform (trigonometric functions) to another, more concentrated in time and less in frequency.

This way of thinking leads to wavelet transforms.

Human perception systems work in non-linear scale, for example, it is much easier to perceive

a candle in a dark room than in a lit one. Perception depends on background and reference. This is

why we can say the natural scale for humans is the logarithm one. The most common conclusion

of this fact is using decibels [dB]. For the same reason, we use sometimes mel scale frequency in

speech analysis, rather than the standard linear one in Hz. Frequency in mels is defined as

fmel = 1000 log2

(1 +

fHz1000

). (2.2)

The comparison of two frequency scales is presented in Fig. 2.6.

The need of nonlinearity in ASR caused creation of an expression ’cepstrum’. It is etymology

of spectrum, formed by reversing the first four letters. This term was introduced by Tukey et al. in

1963 (Tukey et al., 1963). It has come to be accepted terminology for the inverse Fourier transform

of the logarithm of the power spectrum of a signal∫∞−∞ |s(t)| exp(2πjft)dt. It was simplified, by

changing inverse transform into a forward one, which does not change the basic idea (Rabiner and

Schafer, 1978).


Figure 2.7: The cepstrum is the Fourier transform of the log of the power spectrum

Speech segmentation

Silence and speechPhonetic features

To fit transcription

Words

Speakers Syllables

Phonemes

Figure 2.8: The types of speech segmentation

2.4 Speech Segmentation

In the vast majority of approaches to speech recognition, the speech signals need to be divided

into segments before recognition can take place. The properties of the signal contained in each

segment are then assumed to be constant, or in other words to be characteristic of a single part of

speech. Speech segmentation is easier than image segmentation (Nasios and Bors, 2005), as has

to be done in one dimension only.

There are different meanings of segmentation though (Fig. 2.4). Very often it is used for word

segmentation. It can be done by Viterbi and Forward-Backward Segmentation (Demuynck and

Laureys, 2002). The other applied method (Subramanya et al., 2005) is based on mean and va-

riance of spectral entropy. Another issue covered by the same name, segmentation, is separating

silence and speech from an audio recording (Zheng and Yan, 2004). The method uses so called

TRAPS-based segmentation and Gaussian (Nasios and Bors, 2006) mixture based segmentation.

Segmentation here means mainly removing non-speech events and additionally clustering accor-

ding to speaker identities, environmental and channel conditions. Another possible segmentation

is by phonetic features (not necessarily phonemes) (Tan et al., 1994), by applying wavelet analysis

which will be described in more detail in this dissertation. There also exists research on syllable

segmentation (Villing et al., 2004). Another meaning is segmenting due to partially correct trans-

criptions (Cardinal et al., 2005). In this case segmentation is combined with recognition. Finally,

we can understand segmentation as a process of breaking audio into phonemes (Grayden and Scor-

dilis, 1994). Segmentation was conducted by filter bank energy contours analysis. In our research

(Ziołko et al., 2006a,b), we find that phoneme segmentation is the most important and this is why,

we will use the word ’segmentation’ in the meaning of phoneme segmentation, if nothing else is

mentioned. Phoneme segmentation and its usefulness in speech recognition will be described in

more detail in the next chapter.


Naturally, if the frame contains the end of one phoneme and the beginning of another it will

cause recognition difficulties. Segmentation methods currently used in ASR are not particularly

sophisticated. For example they do not consider where phonemes begin and end; this causes

conflicting information to appear at the boundaries of phonemes. Non-uniform phoneme segmen-

tation can be useful in ASR for more accurate modelling (Glass, 2003).

2.5 Phoneme Segmentation

Constant-time segmentation or framing, for example into 23.2 ms blocks (Young, 1996), is com-

monly used to divide the speech signal for processing. This method benefits from simplicity of

implementation and easy comparison of blocks, which are of the same length. However, it is

perceptually unnatural, because of the variation in the duration of real phonemes. In fact, human

phonetic categorisation is also very poor for such short segments (Morgan et al., 2005). Moreover,

boundary effects provide additional distortions (which are partially reduced by applying Ham-

ming window), and framing with such short segments create many more boundaries than there

are phonemes in the speech. These boundary effects can cause errors in speech recognition be-

cause of the mixing of two phonemes in a single frame. A smaller number of boundaries means

a smaller number of errors due to the aforementioned effects. Constant segmentation therefore,

while straightforward and efficient, risks losing valuable information about the phonemes due to

the merging of different sounds into a single block and because the complexity of individual pho-

nemes cannot be represented in short frames. The length of a phoneme can be also used as an

additional parameter in speech recognition improving the accuracy of the whole process. A com-

parison of applying constant framing and phoneme segmentation is presented in Fig. 2.9. Models

based on processing information over long time ranges have already been introduced. The RASTA

(RelAtive SpecTrAl) methodology (Hermansky and Morgan, 1994) is based on relative spectral

analysis and the TRAPs (TempoRAl Patterns) approach (Morgan et al., 2005) is based on multi-

layer perceptrons with the temporal trajectory of logarithmic spectral energy as the input vector. It

allows the generation of class posterior probability estimates.

A number of approaches have been suggested (Stober and Hess, 1998; Grayden and Scordilis,

1994; Weinstein et al., 1975; Zue, 1985; Toledano et al., 2003) to find phoneme boundaries from

the time-varying speech signal properties. These approaches utilise features derived from acoustic

knowledge of the phonemes. For example, solution presented in (Grayden and Scordilis, 1994)

analyses a number of different subbands in the signal using its spectra. Phoneme boundaries are

extracted by comparing the percentage of signal power in different subbands. The Toledano et al.

(Toledano et al., 2003) approach is based on spectral variation functions. Such methods need to be

optimised for particular phoneme data and cannot be performed in isolation from phoneme recog-

nition itself. Neural networks (NN) (Suh and Lee, 1996) have also been tested, but they require

time consuming training. Segmentation can be applied by the segment models (SM) (Ostendorf

et al., 1996; Russell and Jackson, 2005) instead of the HMM. The SM solution differs from the

HMM by searching paths through sequences of frames of different lengths rather than frames. It

means that segmentation and recognition are conducted at the same time and there is a set of pos-


Figure 2.9: Comparison of the frames produced by constant segmentation and phoneme segmen-tation

sible observation lengths. In a general SM, the segmentation is associated with a likelihood and in

fact describes the likelihood of a particular segmentation of an utterance. The SM for a given label

is also characterised by a family of output densities which gives information about observation

sequences of different lengths. These features of SM solution allow the location of boundaries

only at several fixed positions which are dependent on framing (on an integer multiple value of the

frame length).

The typical approach to phoneme segmentation for creating speech corpora is to apply dynamic

programming (Rabiner and Juang, 1993; Holmes, 2001). Dynamic programming is a tool which

guarantees to find the cumulative distance along the optimum path without having to calculate

the distance along all possible paths. In speech segmentation it is used for time alignment of

boundaries. The common practice is to provide a transcription done by professional phoneticians

for one of the speakers in the given corpus. Then it is possible to automatically create phoneme

segmentation of the same utterances for other speakers. This method is very accurate but demands

transcription and hand segmentation to start with. For this reason it is not very useful for any

application other than creating a corpus.

There are several speech segmentation methods and several approaches to the most of them. It

is quite obvious that it is interesting to compare them. Surprisingly evaluation methods for speech

segmentation are quite simple and do not consider all scenarios. There are several suggestions

of evaluation methods but they are usually developed for given solutions, which are not very

universal and they lose some accuracy in their simplifications. Typically evaluation is based on

counting the number of insertions, deletions and substitutions of the automatic segmentations with

respect to a hand-checked reference transcription. The automatic word segmentation (Demuynck

and Laureys, 2002) was evaluated by counting the number of boundaries for which the deviation

between automatic and manual segmentation exceeded thresholds of 35, 70 and 100 ms. The

syllable segmentation (Villing et al., 2004) was evaluated by counting the number of insertion

and delation errors within a tolerance of 50 ms before and after a reference boundary. Some

authors do not publish any details about such a tolerance or do not give a tolerance at all but use

generally the same method (Grayden and Scordilis, 1994). This insertion and delation approach

has a few flaws. First of all, a value of tolerance is questionable and cannot be set with any exact

explanation. It is rather chosen using experience, quite often experience in results of a given speech

segmentation method and experiments. What is more, such methods treat different inaccuracies as


Figure 2.10: The list of speech features extracting method types, grouped in two avenues: basedon linear prediction coefficients (with PLP as the main one) and filter bank analysis (with MFCCas the main one).

simply correct or wrong detections (or giving a larger scale of grades) without considering ’how

wrong’ the detection really is. Unfortunately, it is not the last of the problems. A tolerance is

set, like 50 ms (Villing et al., 2004) for syllables, according to a statistically average length of a

segment. The disadvantage of this approach is that speech segments, whatever they are, words,

syllables or phonemes, vary much in their length. This is why a shift of 50 ms in boundary location

is not the same for a 100 ms long syllable as for a 300 ms long one. Different speech segmentation

methods were compared by us in (Gałka and Ziołko, 2008).

2.6 Speech Parametrisation

Speech parametrisation is a representation of a spectral envelope of an audio signal which can be

used in further processing. There are two most common parametrisation methods, mel-frequency

cepstral coefficients (MFCC) (Davis and Mermelstein, 1980) and perceptual linear predictive

(PLP) (Hermansky, 1990).

2.6.1 Parametrisation Methods Based on Linear Prediction Coefficients

PLP (Rabiner, 1989) has become one of the standard speech parametrisation methods (Fig. 2.10),

and is used as a baseline for a part of the new research. Because of its importance, there have been

further improvements to the method, some of which are described below.


Figure 2.11: fMPE transformation matrix from original low-dimensional feature vector into high-dimensional one

Misra et al. (Misra et al., 2004) suggest normalising the spectrum into a probability mass

function (PMF) or more strictly speaking PMF-like function. Such a representation allows the

calculation of entropy. Voice and non-voice segments are easily detected, even with a low signal-

to-noise ratio (SNR). A hidden Markov models / artificial neural networks (HMM/ANN) hybrid

system was used in the experiments. Because the PLP features are the only baseline provided and

a novel hybrid system is used, it is difficult to compare the results with many other papers. The

results suggest that the entropy features are less efficient than PLP, but it is possible to improve a

system based on the PLP by using entropy for creating extra parameters. Entropy is a good choice

to measure the gross peakiness of a data spectrum.

Deng et al. (Deng et al., 2005) present and compare two feature extraction and compensation

algorithms which improve the PLP, and possibly other methods. The first one is the feature-

space minimum phone error (fMPE) (Fig. 4) and the second is the stereo-based piecewise linear

compensation for environments (SPLICE).

The fPME is an improvement to the PLP. It is based on adding an additional high-dimensional

feature vector containing conditional probabilities of each feature given the whole original low-

dimensional feature vector. The high dimensional feature vector is projected by a transformation

matrix into the subspace of the same dimension as the original vector (Fig. 2.11). The transforma-

tion matrix is created by reestimation via minimising the discriminative objective function known

as the minimum phone error by gradient descent. The training is conducted by an iterative scheme

of retraining the HMM parameters using the fMPE feature sets via maximum likelihood. There

are different possible decomposition schemes of the fMPE. One of them may be interpreted as a

compensation for the original features by adding a large number of bias vectors, each of which

is computed as a full-rank rotation of a small set of posterior probabilities. Approximations can

be easily made to remove the numerical problems in maximum-likelihood estimation. Another

decomposition scheme is interpreted as compensating for the original PLP cepstral features by a

frame-dependent bias vector. The fMPE can be understood as the compensation vector, which

consists of the linear weighted sum of a set of frame-independent correction vectors. The weight

is then the conditional probability associated with the corresponding correction vector. The fPME

algorithm is empirical in its nature.


Figure 2.12: Mel frequency cepstrum coefficients

The SPLICE is also a method of compensation. It assumes that an ideally clean speech feature

vector is ‘piecewise linearly’ related to the corresponding analysed noisy one. Which ‘piece’ of the

local approximation is used for the piecewise linear approximation to the non-linear relationship

between the noisy and clean speech feature vectors is determined by index. With such an assump-

tion the SPLICE compensation is calculated using the minimum mean square error (MMSE). This

gives corresponding conditional probabilities to ones in the fPME algorithm. In contrast to the

fMPE, the compensation by addition is a natural consequence of the MMSE optimisation rule.

The PLP has found several applications. The transcription of conference room meetings is des-

cribed in (Hain et al., 2005). It is based on the augmented multi-party interaction (AMI) system

using the HTK as the HMM for modelling and N-gram based language models. A phonetic de-

cision tree state clustered triphone models with standard left-to-right three states topology is used

for acoustic modelling. States are represented by mixtures of 16 Gaussians. Coefficients obtained

by applying the PLP can be transformed in other types of parameters (cepstral coefficients) for

further analysis. However, there is some ambiguity in the paper regarding the features. First, it is

stated that 12 mel-frequency PLP coefficients, with first and second order derivatives were used by

front-ends as parameters to form a 39 dimensional feature vector. Then, it is said that the smoo-

thed heteroscedastic linear discriminant analysis (SHLDA) reduces a 52 dimensional (standard

vector plus third derivatives) vector to 39 dimensions. Cepstral means and variance normalisation

are performed on complete channels. The vocal tract length normalisation (VTLN) gives speaker

adaptation. The maximum likelihood criterion estimates warp factors. The UNISYN pronuncia-

tion lexicon was used. The method for feature extraction is not very novel but the complexity of

the system and results of experiments on large amount of data are impressive. The AMI is a global

approach to a large vocabulary ASR system.


2.6.2 Parametrisation Methods Based on Filter Banks

Davis and Mermelstein (Davis and Mermelstein, 1980) suggested a new approach to speech para-

metrisation in 1980. They described and compared two groups of parametric representations: one

based on Fourier spectrum (the MFCCs, the linear frequency cepstrum coefficients LFCCs) and

another based on the linear prediction spectrum (linear prediction coefficients LPCs, the reflec-

tion coefficients RCs and the cepstrum coefficients derived from the linear prediction coefficients

LPCCs). The MFCCs, proved to be the best of them, and is computed using triangular bandpass

filters organised in a bank to filter different frequencies. The filters’ characteristics overlap each

other in a way that a next filter begins for the middle, best-passing frequency of the previous one

(Fig. 2.12). The MFCCs are computed as the sums over filters

MFCCi =12∑k=1

Xk cos(i

(k − 1

2

)π

20

), i = 1, 2, ...,M. (2.3)

The method was improved by setting 12 basic coefficients, energy, first and second derivatives of

these, which gives a set of 39 features (Young, 1996). This seems to be now the most common

parametrisation and a baseline for new research in ASR. Some improvements of MFCCs and new

approaches based on filter banks are described below.

Most researchers believe that the phase spectrum information is not useful in speech recog-

nition. Zhu and Paliwal (Zhu and Paliwal, 2004) argue that it is a wrong assumption. The phase

spectrum information is less important than the magnitude spectrum, but it can still be useful. They

use the product of the power spectrum and the group delay function (GDF). They compared a stan-

dard set of 39 parameters based on the MFCCs (12 MFCCs + energy, first and second derivatives

of these) with three new approaches, modified-group-delay cepstral coefficients (MGDCCs), mel-

frequency modified-group-delay cepstral coefficients (MFMGDCCs) and mel-frequency product

spectrum cepstral coefficients (MFPSCCs). MFCCs are the best for an absolutely clean signal and

MFPSCCs are the best for noisy signals. MFPSCCs are calculated in four steps (Zhu and Paliwal,

2004):

1. Compute the FFT spectrum of the speech signal x(n) and speech signal values multiplied

by indexes nx(n).

2. Compute its product spectrum.

3. Apply a mel-frequency filter-bank to produce spectrum in order to get filter-bank energies

(FBEs).

4. Compute the discrete cosine transform (DCT) (Ahmed et al., 1974) of log FBEs to get the

MFPSCCs.

MGDCCs and MFMGDCCs are calculated by applying so-called the modified GDF (MGDF)

on smoothed spectrum calculated using the FFT. Computing the DCT provides the features. In case

of MFMGDCCs before computing the DCT, mel-frequency filter banks are additionally applied.

Both methods were evaluated as less efficient than MFCCs by the authors.


Zhu and Paliwal used an HMM as a model of a language. In the calculation of all the features,

the speech signal was framed using Hamming window every 10 ms with a 30 ms frame. The

pre-emphasis filter was applied. The mel filter bank was designed with 23 frequency bands in the

range from 64 Hz to 4 kHz.

Another interesting approach is given by Ishizuka and Miyazaki (Ishizuka and Miyazaki,

2004). Their method focuses on feature extraction that represents aperiodicity of speech. The

method is based on the gammatone filter banks, framing, autocorrelation and comb filters. First

the signal is filtered by the gammatone filter banks, which are designed by using equivalent rec-

tangular bandwidth scale to choose the centre frequencies and bandwidths of filters. Each bank

consists of 24 filters. Various comb filters are designed for outputs of the gammatone filters. They

support separation of the output into its periodic and aperiodic features in subbands. Aperiodicity

and periodicity power vectors are calculated. The DCT is used to extract parametrisation features

from vectors. The method has the accuracy of the MFCCs without noise and is better in noisy

conditions. The HTK (Young, 1996) is used as the HMM pattern classifier.

The Centre for Speech Technology Research at the University of Edinburgh has introduced

an innovative method of parametrisation. King and Taylor (King and Taylor, 2000) describe a

linguistically motivated structural approach to continuous speech recognition based on symbolic

representation of distinctive phonological features. As the part of further research, syllable clas-

sification using articulatory-acoustic features was conducted (M. Wester, 2003). The speech is

firstly analysed using MFCCs, but then it is parametrised using features which are based on so-

called multivalued features, namely: front-back (front, back, nil, silence), place of articulation

(labial, labiodental, dental, alveolar, velar, glottal, high, mid, low, silence), manner of articulation

(approximant, fricative, nasal, stop, vowel, silence), roundness (rounded, unrounded, nil, silence),

static (static, dynamic, silence) and voicing (voiced, voiceless, silence). This is parametrisation

based strictly on classical phonology. The speech is represented by a sequence of symbolic ma-

trices, each identifying a phone in terms of its distinctive phonological features. The NN was

used for language modelling. The phonological approach is described in many other papers of the

group. Methods of language modelling are also described, for example, comparing using the NN

and dynamic Bayesian networks (DBNs) for phonological feature recognition.

Yapanel and Dharanipragada (Yapanel and Dharanipragada, 2003) present a method based

on the minimum variance distortionless response (MVDR), spectrum estimation and a trajectory

smoothing technique. It was applied to reduce the variance in the feature vectors. The method is

based on using specially designed FIR filters and it aims to gain the statistical stability of spectrum

estimation rather than spectral resolution limit. Reduction of bias and variance is of interest espe-

cially. The method was first described in 2001 and it differs from the classical MFCCs solution by

applying the shortly described technique as an additional block following window filtering. In (Ya-

panel and Dharanipragada, 2003) additional perceptually modified autocorrelation estimates are

obtained based on the PLP technique (Hermansky, 1990). The MVDR coefficients are calculated

from these autocorrelation estimates. Thanks to incorporating perceptual information, autocorre-

lation estimates are more reliable, because of perceptual smoothing of the spectrum. Then MVDR

estimation is more robust. But this is not the only advantage of using such smoothing; additionally,


the dimensionality of the MVDR estimation is reduced. As a result, the MVDR method is faster

with such a modification. The method was named by authors as perceptual MVDR-based cepstral

coefficients (PMCCs).

Farooq and Datta (Farooq and Datta, 2004) describe the opportunity of using the DWT instead

of the STFT to parametrise speech. The paper compares 2, 6 and 20 order Daubechies wavelets

and two sets of subbands with 6 and 8 bands. The method analyses 32 ms frames using 28 or 36

features (depends on a number of subbands). The linear discriminant analysis (LDA) using the

Mahalanobis distance measure classifier was used for phoneme classification. Evaluation of the

method is done with 52 MFCC features as a baseline. The method was evaluated under noiseless

conditions and with noise. Vowel recognition was found more difficult than fricatives and stops

for recognition. In most cases the DWT method is superior compared to the MFCCs even though

it uses less features.

The Speech Research Group at University of Cambridge describes a 2003 CU-HTK large vo-

cabulary speech recognition system for conversational telephone speech (CTS) (Evermann et al.,

2004) which uses MFCCs as feature vectors. The system has a multi-pass, multi-branch structure.

The multi-branch architecture works as combining results from a few separate similar systems

with different parameters by separate lattice rescoring. Basing on Levenshtein distance metric,

different word sequences are generated in branches instead of one best hypothesis. The output of

all branches is combined using a system combination based on a confusion network. The CU-

HTK CTS system consists of two main stages: lattice generation with adapted models and lattice

multi-pass rescoring in multiple branches. Lattices restrict the search space in the subsequent re-

scoring stage. Additionally, the generation of lattices provides control for adaptation in each of

the branches of the rescoring stage. In the lattice generation, the gain from performing the VTLN

by warping the filter bank is very substantial. The multi-passing scheme is used for lattice genera-

tion. The first pass generates a transcription using the heteroscedastic linear discriminant analysis

(HLDA), the minimum phone error (MPE) trained triphones and the word 4-gram language model

(LM). Speakers gain the VTLN warp factor in this step. The second pass uses MPE VTLN HLDA

triphones to create small lattices. In the third and last pass they are used in the lattice maximum

likelihood linear regression (MLLR). Word lattices are generated with the word 4-gram LM inter-

polated with the class trigram. The speaker adaptative training (SAT) and the single pronunciation

dictionaries were used. A word-based 4-gram language model was trained on the acoustic trans-

criptions. That system seems to be the most, if not the only ready, complex academic solution for

large vocabulary speech recognition.

Hifny et al. (Hifny et al., 2005) extend the classical HMM and MFCCs solution using the

maximum entropy (MaxEnt) principle to estimate posterior probabilities more efficiently. Entropy

measure information of acoustic constraints is used in an unbiased distribution to replace Gaussian

mixture models. They use discriminative MaxEnt models for modelling acoustic variability trained

using the conditional maximum likelihood (CML) criterion, which maximises the likelihood of the

empirical model estimated from the training data with respect to the hypothesised MaxEnt model.

Exact parameters are numerically estimated using a modified version of the improved iterative

scaling (IIS) algorithm. The difference lies in supporting constraints that may take negative values.


The idea of the IIS is to use an auxiliary function bounding the change in divergence after each

iteration. Parametric constraints model the high variability of the observed acoustic signal and do

not have the assumption of the Gaussian distribution of data which are not strictly true in practical

applications. They exist if acoustic features are used directly. Currently, in many fields, researchers

are trying to overcome a model dependence on Gaussian assumptions. In the opinion of authors the

hybrid MaxEnt/HMM method may replace hybrid ANN/HMM solutions, which are currently very

popular, using the MaxEnt modelling to estimate the posterior probabilities over the states. The

experiments were conducted using MFCC features. The conclusion might be that in a standard

speech recognition solutions (MFCCs and ANN/HMM model) there is a lack of use of entropy

information. This conclusion corresponds very well to the paper (Misra et al., 2004), described

earlier, which also points the lack of use of entropy in existing solutions as a flaw. Both papers

prove that taking entropy additionally to existing solutions improves them, one (Misra et al., 2004)

for PLP and the other (Hifny et al., 2005) for MFCCs.

2.6.3 Test Corpora and Baselines

The lack of a standard baseline method and a test corpus for speech recognition is an important

issue. Information about evaluation experiments published in described research papers is presen-

ted. It is easy to observe that databases and baselines are often different and the provided informa-

tion about them often covers different issues. It is very difficult to compare different methods of

parametrisation if they are evaluated using different baselines and modelling.

The Aurora2 database was used to evaluate the performance in (Zhu and Paliwal, 2004). The

source speech is TIDigits, consisting of connected digits task spoken by American English spea-

kers sampled at 8 kHz. It contains clean and multi-condition training sets and three test sets. 39

parameters based on the MFCCs are used as a baseline and a not described in detail HMM as a

language model. Aurora2 was also used to test SPLICE (Deng et al., 2005). PMCCs (Yapanel

and Dharanipragada, 2003) was evaluated using Aurora2 as well, and in addition, an automotive

speech recognition application was used. It was compared to MFCCs, PLP and standard MVDR.

HMM was used as a model.

Tests in (Ishizuka and Miyazaki, 2004) were carried out on vowels from Japanese sentences

from a newspaper spoken by a male speaker, and Japanese noisy digit recognition database Aurora-

2J. The HTK was used for features classification and the standard 39 MFCCs as baseline.

Misra and al. method (Misra et al., 2004) was tested on the Numbers95 database of US En-

glish connected digits telephone speech. There are 30 words in the database represented by 27

phonemes. Training was conducted on clear data. Noise from the Noisex92 database has been

added into testing data. The PLP features are used as the baseline. There are 3330 utterances for

training and 1143 utterances for testing. The HMM/NN hybrid system was used in the experi-

ments.

A very impressive amount of training and test data was used by the Cambridge Speech Re-

search Group (Evermann et al., 2004). Training data consists of 296 hours of speech by LDC

(Switchboard I, Call Home English and Switchboard Cellular) plus 67 hours of Switchboard (Cel-

lular and Switchboard II phase 2). Transcriptions were provided by the MSState University for


LDC (carefully) and by BBN commercial transcription service (quickly) for additional 67 hours.

Additionally Broadcast News data (427M words of text) and 62M words of ‘conversational texts’

were collected from the Internet (www.ldc.upenn.edu/Fisher/).

Paper (Hain et al., 2005) presenting the development of the AMI meeting transcription sys-

tem describes and uses many speech corpora for evaluation: SWBD/CHE, Fisher, BBC -THISL,

HUB4-LM96, SDR99-Newswire, Enron email, ICSI meeting, NIST, ISL and AMI. The last four

are typical meeting corpora. Results for different corpora and their sizes are compared. It uses

elements of the HTK for training and decoding.

The Centre for Speech Technology Research at the University of Edinburgh (King and Taylor,

2000; M. Wester, 2003) experiments were carried out on the Texas Instruments/Massachusetts Ins-

titute of Technology (TIMIT) database (read continuous speech from North American speakers).

3696 training utterances from 462 different speakers and 1344 test utterances from 168 speakers

were used. 39 phone classes are used, instead of original 61. The same database was used to eva-

luate MaxEnt/HMM model (Hifny et al., 2005). The same reduction of phone classes took place.

420 speakers were used for the training set. Farooq and Datta (Farooq and Datta, 2004) also eva-

luated their methods using the TIMIT database, using vowels (/aa/, /ax/, /iy/), unvoiced fricatives

(/f/, /sh/ and /s/) and unvoiced stops (/p/, /t/ and /k/) from the dialect region of New England and

the northern part of USA. 114 speakers’ (including 37 females) data was used for training and 37

speakers’ (including 12 females) for testing.

The fMPE (Deng et al., 2005) is evaluated using DARPA-ears rich-transcription-2004 conver-

sational telephone speech-recognition task. The baseline in this case is just the set of coefficients

to which the fMPE is appended to, with HMM used as a model.

As it has been said there are two typical baselines for feature evaluation: the MFCC and the

PLP. The first one is more popular. It has to be mentioned that in several papers other baselines

are used, especially incomplete MFCC. It makes comparing currently researched parametrisation

methods a difficult task. Unfortunately, it is not the only problem. The HTK is the most typical

method for speech modelling. However, not the only one and it should be stressed that the HTK

is a running project with new versions available quite frequently. It can be easily imagined, that

different researchers use different versions, which are better or worse according to its date of

release but authors do not give any details about the version they are using. What is more, quite

many experiments are based on other HMMs and HMM/ANN hybrid solutions than the HTK (or

authors just do not give all details) or just an ANNs. Differences in the results of experiments can

be caused by worse or better parametrisation as well as changes in a model.

One of the reasons why there is no standard test corpus might be that all of them are commer-

cial and it seems there is no satisfactory, free evaluation data test for speech recognition. This is an

issue which prevents standardisation of tests. Another point is that the ASR research is conducted

for different languages, so variety is inevitable because of the language preferences of researchers.

Still, different sizes, complexity and variety of words in test corpora cause difficulties in compa-

ring different approaches. To avoid such problems, there should be two freely available corpora.

One would be of small vocabulary, like digits, mainly for fast tests during research and the other

of large vocabulary for final results.


Table 2.3: Comparison of the efficiency of the described methods. Asterisks mark methods ap-pended to baselines (they could be used with most of the other methods). The methods withoutasterisks are new sets of features, different to the baselines

Method Comparison to MFCC Comparison to PLPMFPSCCs (Zhu and Paliwal, 2004) 2% ?

Ishizuka* (Ishizuka and Miyazaki, 2004) 17% ?Phonological (King and Taylor, 2000; M. Wester, 2003) (no straightforward comparison)

Spectral Entropy* (Misra et al., 2004) ? 15%DWT (Farooq and Datta, 2004) 2% (52 MFCC) ?

fPME* (Deng et al., 2005) ? 13%SPLICE* (Deng et al., 2005) ? 29%

PMCCs* (Yapanel and Dharanipragada, 2003) 20% 11%

2.6.4 Comparison of the Methods

It is very difficult to compare different methods because of the reasons presented in the previous

section. However, we tried to do at least an approximation of it. We compare methods according

to baselines, which authors gave, by presenting the average improvement in comparison to the

baseline (Table 2.3). We do not see any way to compare methods with different baselines. The

methods can be grouped in two categories. One of them covers basic features which replace the

baseline (Zhu and Paliwal, 2004; Farooq and Datta, 2004; Hain et al., 2005). The other consists

of elements appended to classical ones (Ishizuka and Miyazaki, 2004; Misra et al., 2004; Deng

et al., 2005; Yapanel and Dharanipragada, 2003) and these are marked by asterisks in Table 2.3.

The first group gives less improvement. It has to be stressed that methods in the second group are

additional elements and as such they may be used in connection with methods of the first group

to give even better results. Phonological approach (M. Wester, 2003; King and Taylor, 2000) has

not been compared with any baseline. Works on the phonological features are conducted, results

improved, but no clear comparison with the MFCCs or the PLP was found. As one of the authors

explained in the email conversation, the system is not ready for word recognition and because a

main reason for using articulatory features to mediate between the acoustic signal and words is

to get around the problem of ‘beads on a string’ (describing words as a simple concatenation of

phones) using phone error rate would be pointless.

New sets of features are not much better than baselines. The largest improvement is based on

adding extra elements and improving existing parametrisation. The methods marked with asterisk

could give outstanding results if combined. However, some of them might be dependent on each

other and use the same information in fact. The highest improvement of reviewed methods is given

by the SPLICE (Deng et al., 2005).

Basing on Yapanel results (Yapanel and Dharanipragada, 2003) (the only one compared with

the both baselines) we can calculate that the PLP method gives around 8% of improvement com-

pared to the MFCCs. This evaluation depends on the database used in the experiment and an exact

value is questionable. Still it allows us to give an assumption that the PLP is a bit better method

than the MFCCs.


2.7 Speech Modelling

Speech and language modelling is based on stochastic processes. To define them let us assume the

existence of a probabilistic space and infinite number of random variables in the space. E is the

space of process states, and T stands for the domain of a stochastic process. The set of random

variables S(t) ∈ E such that S(t) = {A(t), t ∈ T} is a stochastic process.

A stochastic process S = {A(t), t ∈ T} is called a Markov process if it fulfils

P{S(tn+1) = sn+1|S(tn) = sn, ..., S(t1) = s1} = P{S(tn+1) = sn+1|S(tn) = sn}. (2.4)

It means that a Markov process keeps a memory of the last event. The whole future run of the

process depends only on the current event. A Markov chain is a Markov process with a discrete

space of states. A domain may be continuous or discrete. The concept of Markov chains can be

extended to include the case where the observation is a probabilistic function of a state. The HMM

is a doubly embedded stochastic process with an underlying stochastic process that is hidden and

can only be observed through another set of stochastic processes that produce the sequence of

observations.

The HMM (Rabiner, 1989; Li et al., 2005) is a statistical model where the system being model-

led is assumed to be a Markov process with unknown parameters, and the challenge is to determine

the hidden parameters, from the observable parameters, based on this assumption. Speech recog-

nition systems are generally based on HMM(Young et al., 2005) or hybrid solutions with ANN

(Young, 1996; Holmes, 2001). Statistical model gives the probability of an observed sequence of

acoustic data by the application of Bayes’ rule

P (word|acoustic) =P (acoustic|word)P (word)

P (acoustic), (2.5)

where P (acoustic|word) comes from an acoustic model, P (word) is given by a language model

(or combination of several language models) and P (acoustic) is used for normalisation purposes

only so it can be skipped as long as we deliver normalisation in another way or we accept the

fact that final result is not a probability function, as it may not take values from 0 to 1 and the

sum of all of them is not equal to 1. We can easily accept it, if we are interested only in an

argument of a maximum of the result and we do not need proper probability values. The Bayes

rule can be similarly applied to phonemes, words, syntactic and semantic information. Introducing

an additional hidden dynamic state gives a model of spatial correlations and leads to better results

(Frankel and King, ress).

The HMM is very popular but there are some other approaches to language modelling. One

of them is a support vector machine (SVM), a classifier that estimates decision surfaces directly

rather than models a probability distribution across the training data. As the SVM cannot model

temporal speech structure efficiently it is best in a hybrid solution with the HMM (Ganapathiraju

et al., 2004).

Another model which started to be popular in speech recognition is based on dynamic Baye-

sian networks (DBNs) (Wester et al., 2004; Frankel and King., 2005). Typical Bayes nets are


directed acyclic graphs where each node represents a random variable. Implying conditional in-

dependence uses missing edges to factor joint distribution of all random variables into a set of

simpler probability distributions. DBNs consist of instances of Bayesian networks repeated over

time, with dependencies across time. DBNs were proposed as a model for articulatory feature

recognition. In a classical HMM framework, parameters are obtained by the maximum likelihood

approach. The variational Bayesian estimation and clustering (Watanabe et al., 2004) is another

approach. It does not use maximum likelihood parameters but posterior distribution.

There are other models (Venkataraman, 2001; Ma and Deng, 2004; Wester, 2003) for model-

ling acoustic parameters or elements of language. In all models we have to make many assump-

tions, like statistical dependence and independence (King, 2003). One has to be very careful to not

commit a simplification which might result in a wrong model.

Another issue is a training process of a model. Most popular algorithms are based on a

forward-backward procedure (Rabiner, 1989; X. Huang, 2001) for evaluation of HMM, Viterbi

algorithm (Rabiner, 1989; Viterbi, 1967; Forney, 1973) for decoding HMM and Baum-Welch for

estimating HMM parameters (Rabiner, 1989; X. Huang, 2001). All of them need human supervi-

sion and might be quite costly in time. There are also methods based on active learning (Riccardi

and Hakkani-Tur, 2005) in which applying adaptive learning may cut down the need of supervi-

sion.

2.8 Natural Language Modelling

Analysing semantic and syntax content is one of the topics of NLP (Manning, 1999). Words can be

connected in a large number of ways, including: by relations to other words, in terms of decom-

position of semantic primitives, and in terms of non-linguistic cognitive constructs (perception,

action and emotion). There are hierarchical and non-hierarchical relations. Some hierarchical

relations are: is-a (a tree is a plant), has-a (a computer has a screen), and for scales of degree.

Non-hierarchical relations include synonyms and antonyms. There are some word affinities and

disaffinities in the semantic relations regarding the expressed concept. They are difficult to be

described in a mathematical way but may be exploited by speech recognition systems. A crucial

problem is the context-dependent meaning of words. For example, ’bank’ is a bank of a river and

a bank to keep money in it. Authors of dictionaries try to identify distinct senses of entries, but

it is very difficult to put an exact boundary between senses of a word and to disambiguate senses

in practical contexts. Another problem is that natural languages are not static. Some additional

meanings of words can change quite frequently (X. Huang, 2001).

The language regularities are very often modelled by n-grams (X. Huang, 2001). Let us assume

the word string W consisting of n words w1, w2, w3, ..., wn. P (W ) is a probability distribution

over word strings W that reflects how often W occurs. It can be decomposed as

P (W ) = P (w1)P (w2|w1)P (w3|w1, w2)...P (wn|w1, ..wn−1). (2.6)

For calculation time reasons, the dependence is limited to n words backwards. Probably the most


popular are trigram models where P (wi|wi−2, wi−1), as a dependence on the previous two words

is very strong, while model complication is not very high. Such models need statistics collected

over a vast amount of text. It means that many dependencies can be averaged. Adaptive language

models (Bellegarda, 1997; Jelinek et al., 1991; Mahajan et al., 1999) deal with this flaw by a

semantic approach to n-grams. Several different models can be created for different topics and

different types of texts organised in a domain or topic-clustered language model. Then a system

detects a topic of recognised text and use a cluster of n-gram model associated with this topic. It is

possible to combine several clusters at once and to change a topic during recognition of different

parts of the same text. Latent semantic indexing (Bellegarda, 2000) improves the traditional n-

gram model by searching for co-occurrences across much larger spans regarding semantic roles

rather than the simple word distance.

We are mainly interested in lexical semantics which is a study of systematic, meaning related

structures of individual words. This field proves how ambiguous the natural language might be.

We will start with defining typical semantic notion (Jurafsky and Martin, 2000). A lexeme is

an individual entry in the lexicon. It corresponds to a word but it has a more strict meaning - a

pairing of a particular orthographic and phonological form with some form of symbolic meaning

representation - a sense. In most of traditional dictionaries lexeme senses are surprisingly circular

- blood may be defined as red liquid flowing in veins, and red as a colour of blood. The usage of

such structures is possible only if a user has some basic knowledge about the world and meanings.

Computers and artificial intelligence do not have it. This is why avoiding this circularity was one

of the main issues in creating a lexical database WordNet (Fellbaum, 1999). It contains three

separate databases for nouns, verbs and the third for adjectives and adverbs. WordNet is based

on relations among lexemes. Homonymy is a relation that holds between words that have the

same form with unrelated meanings. The items with such relation are homonyms. Words with

the same pronunciation and different spelling are homophones. In contrary, homographs have

same orthographic form but different sounds. Polysemy is an occurrence of multiple meanings

within a single lexeme. So we can say that a bank of a river and a bank to keep money are rather

homonyms, while a blood bank and a bank to keep money are rather polysems. Obviously, it is

not fully distinct what is a homonym and what is polysemy. The Polish example of homonyms

are two meanings of a word ‘zamek’. The first one is a castle and the second one is a lock. We

can separate them typically by investigating lexeme history and etymology (origin). A bank to

keep something has an Italian origin, while a bank of a river has a Scandinavian one. Synonymy is

defined as coexistence of different lexemes with the same meaning, which also leaves many open

questions. The example of synonyms are Polish words ‘kolor’ and ‘barwa’. The first one means

colour and the second one might be translated as hue, but in Polish it can easily replace the first

one. Hyponymy is a pair of lexemes with similar but not identical senses.

There are several problems with applying semantic analysis. First of them is using metaphors.

They are especially common in literature, but also in spoken language and sometimes even in

documents. Words and phrases used to present completely different kinds of concepts than their

lexical senses are a serious challenge. Metonymy is a related issue. These are using lexemes

to denote concepts by naming some other related concept. We can use word ’kill’ to describe


stopping some process in a more dramatical way like ’killing’ processes in Linux or ’killing a

sale of a rival company’. Finally the problem is that existing semantic algorithms are dedicated to

written text which is expected to be correct. Spoken language is characterised by a higher level of

mistakes and abbreviations, while a user expects a transcription produced by a speech recogniser

to be of a written text quality.

There is very little research on semantic analysis for ASR but there are some other fields which

might be useful in our research like word disambiguation (Banerjee and Pedersen, 2003) and auto-

matic hypertext construction (Green, 1999). One of the interesting issues is topic signatures. The

experiments show that it is possible to approximate accurately the link distance between synsets (a

semantic distance based on the internal structure of WordNet) with topic signatures (Agirre et al.,

2001, 2004). Clean signatures can be constructed from the WWW using filtering techniques like

ExRetriever and Infomap (Cuadros et al., 2005).

There are several methods of measuring the relatedness of concepts in WordNet. Similarity

package provides six measures of similarity (Pedersen et al., 2004). The lch measure searches for

the shortest path between two concepts, and scales it. The wup finds the path length to the root

(shared ancestor) node from the least common subsumer of the measured concepts. The measure

path equals to the inverse of the shortest path length between two concepts. The res, lin and jcn are

based on information content - a corpus based measure of the specificity of a concept. The package

contains also three measures of relatedness (Pedersen et al., 2004). The hso classifies relations as

having direction, so it is path based. The lesk and vector measures use the text of gloss (definition)

of the concept as a representation for it (Banerjee and Pedersen, 2003). It can be realised by

counting shared words in gloss. Strings containing several words bring much more information

due to entropy theory, so a score is the number of neighbouring words in overlapping description

risen to second power. If several strings are shared, their scores are summed. Glosses of related

senses can be also used to improve accuracy. There are other semantic similarity measures as well,

like (Seco et al., 2004) which is based on hierarchical structure only. Semantic similarity can be

also measured using Roget’s Thesaurus instead of WordNet (Jarmasz and Szpakowicz, 2003). The

method is based on calculating all paths between two words using Roget’s taxonomy.

Semantic analysis can improve quality of results of ASR. This is the highest information level

in the linguistic model. Semantics deals with the study of meaning, including the ways meaning is

structured in language and changes in meaning and form over time. Majority of the latest papers

describing general speech recognition scheme include semantics analysis. But there is no working

system (known to the author) using lexical semantics and there is little research on applying any

semantic analysis into speech recognition. Semantic analysis is much more often used in written

text analysis to retrieve information. There are two main approaches (X. Huang, 2001). The first

is based on semantic roles:

• agent - cause or initiator of the action

• patient - undergoer of the action

• instrument - how the action is accomplished


• goal - to whom the action is directed

• result - result of the action

• location - location of the action

We can predict a localisation and order of different semantic roles in sentences. Some of them

have to be present, others are optional. We can also associate exact words with a few roles. It

allows us to detect wrong structure of recognised text. Such semantic analysis can be used in

speech recognition (Bellegarda, 2000) instead of n-gram models.

The other approach is by lexical semantics. Some words go very often together in texts. Some

of them appear close to each other very rarely (Agirre et al., 2001, 2004). There are already such

collected statistics, for example as the semantic dictionary WordNet (Fellbaum, 1999). Words

create a set of trees and a number of branches between two nodes may stand for their semantic

closeness. There are other possible measures as well. It is possible to detect words which do not

fit to a general semantic content of a recognised hypothesis.

2.9 Semantic Modelling

It is not efficient to recognise speech using acoustic information only. The human perception

system is based on catching context, structure and understanding combined with recognition pro-

cedure. It is much easier for a human being to recognise and repeat without any errors a heard

sentence, if it is in a language we understand, comparing to a sentence in a language we are not fa-

miliar with, which is just a sequence of sounds. Similarly, it is much easier to recognise sentences

in a familiar domain or topic, then sentences from an unfamiliar context. Language modelling can

improve recognition highly. Semantic analysis can be done in many different ways and has been

applied to ASR already. However, this kind of modelling is difficult due to data sparsity problem.

The ASR literature always mentions semantic analysis, as a necessary step, but it is very difficult to

find any research papers, which provides any exact results on recognition, when applying semantic

methods.

Latent semantic analysis (LSA) (Bellegarda, 1997, 1998; T.Hofmann, 1999) is a NLP tech-

nique patented in 1988. It assumes, that the meaning of a small part of text, like a paragraph or

a sentence, can be approximated by the sum of the meanings of its words. LSA uses a word-

paragraph matrix which describes the occurrences of words in topics. It is a sparse matrix whose

rows correspond to topics and columns correspond typically to words that appear in the topics.

The elements of the matrix are proportional to the number of times the words appear in each do-

cument, where rare words are upweighted to reflect their relative importance. LSA is performed

by using singular value decomposition (SVD). LSA has found already a few applications. One

of them is automatic essay and answers grading (Kakkonen et al., 2006; Kanejiya et al., 2003).

LSA can be also used in modelling global word relationships for junk e-mail filtering or pronun-

ciation modelling (Bellegarda, 80). Another possible application is for word completion (Miller

and Wolf, 2006). LSA can be combined with the n-gram model (Coccaro and Jurafsky, 1998;


Table 2.4: Speech recognition applications available on the InternetHTK (Young, 1996; Evermann et al., 2004) - htk.eng.cam.ac.ukEdinburgh Speech Tools - www.cstr.ed.ac.uk/projects/speech toolsSPRACH (Hermansky, 1990) - www.icsi.berkeley.edu/dpwe/projects/sprach/sprachcore.htmlAMI (Hain et al., 2005) - www.amiproject.org/business/index.htmCMU Sphinx (Lamere et al., 2004)- cmusphinx.sourceforge.net/html/cmusphinx.phpCMU Let’s go (Eskenazi et al., 2008) - http://www.speech.cs.cmu.edu/letsgo/Snorri - www.loria.fr/ laprieSnack Speech Toolkit - http://www.speech.kth.se/snack/Praat (Boersma, 1996) - www.fon.hum.uva.nl/praat/CSLU OGI Toolkit - http://cslu.cse.ogi.edu/toolkit/Sonic ASR - cslr.colorado.edu/beginweb/speech recognition/sonic.html

Gronqvist, 2005) or maximum entropy model (Deng and Khudanpur, 2003). LSA can be also ap-

plied for bigrams of words in topics rather than single words (Y.-C. Tam, 2008). It is more difficult

to train such a model but can improve results if combined with a regular LSA model. There are

other methods of analysing semantic information, like topic signatures (Agirre et al., 2001, 2004)

and maximum entropy language models (Khudanpur and Wu, 1999; Wu and Khudanpur, 2000).

The idea of topic signatures is to store concepts in context vectors. There are simple methods to

automatically acquire for any concept hierarchy. They were used to approximate link distances

in WordNet. Maximum entropy language models combine dependency information from sources

like syntactic relationships, topic cohesiveness and a collocation frequency. They evolved from

n-grams. The difference is that they store not only n-words but also other information like n pre-

ceding exposed head-words of the syntactic partial parse, n non-terminal labels of the partial parse

and a topic.

2.10 Academic Applications

There are a few academic applications of speech recognition. We listed some of them in Table

2.4. Edinburgh Speech Tools is not a complex ASR but rather a toolbox for speech analysis with

many elements useful in speech recognition for example n-gram language model. The SPRACH

is the full package based on the PLP including for example ANN training and recognition, feature

calculation, sound file manipulation, plus all the GUI components and tools. The AMI targets

computer enhanced multi-modal interaction in the context of meetings including ASR. The CMU

Sphinx Group (Lamere et al., 2004) offers packages for speech using applications, very useful for

speech modelling in ASR. CMU provides also a spoken language system Let’s go (Eskenazi et al.,

2008) which includes ASR. Snorri is dedicated to assist researchers in the fields of ASR, phonetics,

perception and signal processing. Similar opportunities are provided by the Snack Sound Toolkit

which uses script languages like Python. Praat (Boersma, 1996) covers speech analysing, labelling,

segmentation and learning algorithms. The CSLU OGI Toolkit is the help in building interactive


language systems for human-computer interaction. SONIC is the speech recogniser developed by

University of Colorado. It is available only for registered and accepted persons.

The HTK (Young et al., 2005) is a toolkit using HMM, for ASR research mainly. Research into

speech synthesis, character recognition and DNA sequencing are its other applications. We used

version 3.3 in our research. HTK consists of many modules and tools. All of them are available

in C source form. The HTK provides facilities for speech analysis, HMM training, testing and

results analysis. The system fits hypothesis of every recognition with one of the elements from

the dictionary, provided by a user, comparing with phonetic transcriptions of words. The toolkit

supports HMMs using both continuous density mixture Gaussians and discrete distributions. HTK

was originally developed at the Machine Intelligence Laboratory of the Cambridge University

Engineering Department (CUED). It was sold to Entopic Research Laboratory Inc. and later to

Microsoft. Currently it is licensed back to CUED and under permanent development.

Chapter 3

Linguistic Aspects of Polish

English is the most common language of ASR research with Chinese and Japanese as two other

common languages. This thesis is focused on ASR of Polish which is the most commonly spoken

Slavic language in EU and one of the most common inflective languages. There is quite little

research and no working continuous Polish ASR system. To create such a system successes in

other languages have to be used. As Polish and English are languages of the same Indo-European

group, we focused on existing solutions for English ASR. There are some differences between

these languages which have a larger or smaller impact on ASR. These differences should result in

some variations in algorithms.

3.1 Analysis of Polish from the Speech Recognition Point of View

We searched for differences between English and Polish, which seem to be important in ASR. It

is important to consider linguistic aspects while designing ASR system.

• English has a large number of homophones. What is more, many combinations of different

words have similar pronunciation. Polish has fewer homophones.

• Pronunciation of vowels in English is very similar. If a vowel is not stressed it is usually

pronounced as /�/ or /*/. What is more, both of these phonemes have quite similar sounds

and spectra. It means that unstressed vowels are almost indistinguishable in English. It

contrasts with Polish.

• Modern English has emerged as a mixture of around thirty languages. It resulted in quite

simple general rules (which was necessary for a language to be widely accepted by different

people) but many irregularities (as a kind of residues), especially in pronunciation. Mo-

dern Polish is strongly based on Latin. Contrary to English, it resulted in very complicated

grammar rules and morphology but quite few irregularities in pronunciation.

• English is a positional language, while Polish is an inflectional one. A meaning of a word

in English depends strongly on the position of a word in a sentence. In Polish a position

is of secondary importance, the exact meaning of a word depends mainly on morphology.

46

CHAPTER 3. LINGUISTIC ASPECTS OF POLISH 47

For example in English the sentences ’Mike hit Andrew’ and ’Andrew hit Mike’ means

something quite different. In Polish (using Polish similar names) ’Michał uderzył Andrzeja’,

’Michał Andrzeja uderzył’, ’Andrzeja Michał uderzył’ and ’Andrzeja uderzył Michał’ are

all acceptable and mean almost the same. However, all not the first stress some part of

information and sound quite strange without a special context. To identify the person who

hit and who is hit, we have to use another ending ’Andrzej uderzył Michała’. It means the

usage of syntax modelling is very difficult for Polish and possibly not as necessary as for

English. On the other hand, analysing morphology seems to be crucial in the case of ASR

for Polish.

• In English, conjugation and declension are relatively simple and adjectives do not need

any type of agreement. In Polish there are groups of different ways of conjugation and

declension. Each verb has typically different forms for each combination of gender (there

are three basic genders in Polish, however, linguists distinguish 8 categories), person and

singular or plural number. Each noun has 7 forms (cases) depending on the position and

relation with other words in the sentence. Adjectives and numbers are agreed with the

nouns they describe. There is no general rule of word agreement, like adding ’s’ or ’es’

in English. Different groups of words have their own types of endings. Verbs have 47

inflection forms (excluding participles), adjectives 44, numerals up to 49, adverbs 3, nouns

and pronouns 14. A single word in Polish may have even several hundreds of derived forms

topically correlated (for example some verbs have almost 200 forms including conjugation

of participle, perfect and imperfect forms). This fact causes making a full dictionary of

Polish language for the ASR system very difficult. Even as it is possible, its size may cause

very serious delays in the work of the ASR system.

• English is well known to have a vast vocabulary. It is due to a large number of dialects

and versions of English situated all around the world. Another reason is that English is a

mixture of several languages, so there are words which mean almost the same but came from

different sources. Polish dictionary seems to be smaller in this aspect.

• Polish has a few phonemes, which are rare in other languages and do not exist in English.

They sound very different than other phonemes. Being more particular they have much

higher frequency and sound to non-Polish speakers almost like rustles or hums. These

phonemes are very easily detectable, and as such, can be additionally used as a kind of

boundaries between blocks of other phonemes.

3.2 Triphone Statistics of Polish Language

Statistical linguistics at the word and sentence level were under considerations for several lan-

guages Agirre et al. (2001); Bellegarda (2000). However, similar research on phonemes is rare

Denes (1962); Yannakoudakis and Hutton (1992); Basztura (1992). The frequency of phonetic

units appearance is an important topic itself for every language. It can also be used in several

speech processing applications, for example modelling in LVCSR or coding and compression.


Models of triphones which are not present in a training corpus of a speech recogniser can be pre-

pared using phonetic decision trees Young et al. (2005). The list of possible triphones has to be

provided for a particular language along with phonemes’ categorisation. The triphone statistics

can also be used to generate hypotheses used in recognition of out-of-dictionary words including

names and addresses.

3.3 Description of a problem solution

The problem is to find triphone statistics for Polish language. Our first attempt to this task was

already published Ziołko et al. (2007). The task was conducted on a corpus containing Parliament

transcriptions mainly, which makes up amounts to around 50 megabytes of text. It was repeated

on Mars, a Cyfronet computer cluster, for data of around 2 gigabytes.

Context-dependent modelling can significantly improve speech recognition quality. Each pho-

neme varies slightly depending on its context, namely neighbouring phonemes due to a natural

phenomena of coarticulation. It means that there are no clear boundaries between phonemes and

they overlap each other. It results in interference of acoustical properties. Speech recognisers ba-

sed on triphone models rather than phoneme ones are much more complex but give better results

Young (1996). Let us present examples of different ways of transcribing word above. Phoneme

model is ax b ah v while the triphone one is *-ax+b ax-b+ah b-ah+v ah-v+*. In case a specific

triphone is not present, it can be replaced by a phonetically similar triphone (phonemes of the

same phonetic group interfere in similar way with their neighbours) using phonetic decision trees

Young et al. (2005) or diphones (applying only left or right context) Rabiner and Juang (1993).

3.4 Methods, software and hardware

Sophisticated rules and methods are necessary to obtain the phonetic information from an orthogra-

phic text-data. Simplifications could cause errors Ostaszewska and Tambor (2000). Transcription

of text into phonetic data was applied first by PolPhone Demenko et al. (2003). The extended

SAMPA phonetic alphabet was applied with 39 symbols (plus space) and pronunciation rules for

cities Poznan and Krakow. We used our own digit symbols corresponding to SAMPA symbols,

instead of typical ones, to distinguish phonemes easier while analysing received phonetic trans-

criptions. Stream editor (SED) was applied to change original phoneme transcriptions into digits

with the following script:

s/##/#/g s/w∼/2/g s/dˆz/6/g

s/tˆs’/8/g s/s’/5/g s/tˆS/0/g

s/dˆz’/X/g s/z’/4/g s/dˆZ/9/g

s/j∼/1/g s/tˆs/7/g s/n’/3/g .Statistics can now be simply collected by counting the number of occurrences of each phoneme,

phoneme pair, and phoneme triple in an analysed text, where each phoneme is just a symbol (single

letter or a digit). Matlab was used to analyse the phonetic transcription of the text corpora. The


Table 3.1: Phonemes in Polish (SAMPA Demenko et al. (2003))

SAMPA example transcr. occurr. % % Basztura (1992)# # 283 296 436 15.256 4.7a pat pat 151 160 947 8.141 9.7e test test 146 364 208 7.882 10.6o pot pot 141 975 325 7.646 8.0t test test 68 851 605 3.708 4.8r ryk rIk 68 797 073 3.705 3.2n nasz naS 68 056 439 3.665 4.0i PIT pit 67 212 728 3.620 3.4j jak jak 61 265 911 3.299 4.4I typ tIp 58 930 672 3.174 3.8v wilk vilk 58 247 951 3.137 2.9s syk sIk 54 359 454 2.927 2.8u puk puk 51 503 621 2.774 2.8p pik pik 51 228 649 2.759 3.0m mysz mIS 48 760 010 2.626 3.2k kit kit 44 892 420 2.418 2.5d dym dIm 44 406 412 2.391 2.1l luk luk 40 189 121 2.164 1.9

n’ kon kon’ 34 092 610 1.84 2.4z zbir zbir 30 924 282 1.665 1.5w łyk wIk 30 194 178 1.626 1.8f fan fan 25 308 167 1.363 1.3g gen gen 24 910 462 1.341 1.3

tˆs cyk tˆsIk 24 789 080 1.335 1.2b bit bit 24 212 663 1.304 1.5x hymn xImn 21 407 209 1.153 1.0S szyk SIk 20 756 164 1.118 1.9s’ swit s’vit 17 220 321 0.927 1.6Z zyto ZIto 16 409 930 0.884 1.3

tˆS czyn tˆSIn 15 429 711 0.831 1.2tˆs’ cma tˆs’ma 11 945 381 0.643 1.2w∼ ciaza ts’ow∼Za 10 814 216 0.582 0.6

c kiedy cjedy 10 581 296 0.570 0.7dˆz’ dzwig dˆz’vik 9 995 596 0.538 0.7N pek peNk 4 880 260 0.262 0.1

dˆz dzwon dˆzvon’ 4 212 857 0.227 0.2J giełda Jjewda 3 680 888 0.198 0.1z’ zle z’le 3 390 372 0.183 0.2j∼ wiez vjej∼s’ 1 527 778 0.082 0.1dˆZ dzem dˆZem 693 838 0.037 0.1


0 2 4 6 8 10 12 14 16

#aeotrnijyvsupmkdln’zwfg

t^sbxSs’Z

t^St^s’w~c

d^z’N

d^zJz’j~d^Z

Phonemes

Occurrences [%]

Phoneme classes

Figure 3.1: Phonemes in Polish in SAMPA alphabet

calculations were conducted on Mars in Cyfronet, Krakow. We analysed more than 2 gigabytes of

data. Text data for Polish are still being collected and will be included in the statistics in the future.

Mars is a cluster for calculations with following specification: IBM Blade Center HS21 - 112

Intel Dual-core processors, 8GB RAM/core, 5 TB disk storage and 1192 Gflops. It operates using

Red Hat Linux. Mars uses Portable Batch System (PBS) to queue tasks and split calculation power

to optimise times for all users. A user have to declare expected time of every task. In example,

a short time is up to 24 hours of calculations and a long one is up to 300 hours. Tasks can be

submitted by simple commands with scripts and the cluster starts particular tasks when calculation

resources are available. One process needs around 100 hours to analyse 45 megabytes text file.

3.4.1 Grapheme to Phoneme Transcription

Two main approaches are used for the automatic transcription of texts into phonemic forms. The

classical approach is based on phonetic grammatical rules specified by human Steffen-Batog and

Nowakowski (1993) or machine learning process Daelemans and van den Bosch (1997). The

second solution utilises graphemic-phonetic dictionaries. Both methods were used in PolPhone to

cover typical and exceptional transcriptions. Polish phonetic transcription rules are relatively easy


to formalise because of their regularity.

The necessity of investigating large text corpus pointed to the use of the Polish phonetic trans-

cription system PolPhone Jassem (1996); Demenko et al. (2003). In this system, strings of Polish

characters are converted into their phonetic SAMPA representations. Extended SAMPA (Table

3.1) is used, to deal with nuances of Polish phonetic system. The transcription process is perfor-

med by a table-based system, which implements the rules of transcription. Matrix T ∈ Sm×n is

a transcription table, where S is a set of strings and the cells meet the requirements listed preci-

sely in Demenko et al. (2003). The first element t1,1 of each table contains currently processed

character of the input string. For every character (or character substring) one table is defined.

The first column of each table {ti,1}mi=1 contains all possible character strings that could precede

currently transcribed character. The first row {t1,j}nj=1 contains all possible character strings that

can proceed a currently transcribed character. All possible phonetic transcription results are stored

in the remaining cells {ti,j}m,ni=2,j=2. A particular element ti,j is chosen as a transcription result,

if ti,1 matches the substring preceding t1,1 and t1,j matches the substring proceeding t1,1. This

basic scheme is extended to cover overlapping phonetic contexts. If more then one result is pos-

sible, then longer context is chosen for transcription, which increases its accuracy. Exceptions are

handled by additional tables in the similar manner.

Specific transcription rules were designed by a human expert in an iterative process of testing

and updating rules. Text corpora used in design process consisted of various sample texts (news-

paper articles) and a few thousand words and phrases including special cases and exceptions.

3.4.2 Corpora Used

Several newspaper articles in Polish were used as input data in our experiment. They are from

Rzeczpospolita newspaper from years 1993-2002. They cover mainly political and economic is-

sues, so they contain quite many names and places including foreign ones, what may influence the

results slightly. In example, q appeared once, even though it does not exist in Polish. In total, 879

megabytes (103 655 666 words) were included in the process.

Several hundreds of thousands of Internet articles in Polish made another corpus. They are all

from a high quality website, where all content is reviewed and controlled by moderators. They

are of encyclopedia type, so they also contain many names including foreign ones. In total, 754

megabytes (96 679 304 words) were included in the process.

The third corpus consists of several literature books in Polish. Some of them are translations

from other languages, so they also contain foreign words. The corpus includes 490 megabytes (68

144 446 words) of text.

3.4.3 Results

The total number of around 1 856 900 000 phonemes were analysed. They are grouped into 40

categories (including space). Actually, one more, namely q, was detected, which appeared in

a foreign name. Since q is not a part of the Polish alphabet, it was not included in the phoneme

distribution presented in Table 3.1. Space (noted as #) frequency was 15.26 %. An average number


The probability of transition [%]

Second phoneme classes

# a e o t r n i j y v s u p m k d l n’ z w f g t^s b x S s’ Z t^St^s’w~ cd^z’N d^z J z’ j~d^Z

#aeotrnijyvsupmkdl

n’zwfg

t^sbxS

s’Z

t^S^s’w~c

d^z’N

d^zJ

z’j~d^Z

Figure 3.2: Frequency of diphones in Polish (each phoneme separately)

of phonemes in words is 6.6 including one space. Exactly 1 271 different diphones (Fig. 3.2 and

Table 3.2) for 1 560 possible combinations were found, which constitutes 81%.

21 961 different triphones (see Table 3.3) were detected. Combinations like *#*, where *

is any phoneme and # is a space were removed. These triples should not be considered as tri-

phones because the first and the second * are in two different words. The list of the most common

triphones is presented in Table 3.3. Assuming 40 different phonemes (including space) and sub-

tracting mentioned *#* combinations, there are 62 479 possible triples. We found 21 961 different

triphones. It leads to a conclusion that around 35% of possible triples were detected as triphones,

the very most of them at least 10 times.

Young Young (1996), estimates that in English, 60-70% of possible triples exist as triphones.

However, in his estimation there is no space between words, what changes the distribution a lot.

Some triphones may not occur inside words but may occur at combinations of an end of one word

and the beginning of another. We started to calculate such statistics without an empty space as

the next step of our research. It is also expected that there are different numbers of triphones for

different languages. Some values are similar to statistics given by Jassem a few decades ago and

reprinted in Basztura (1992). We applied computer clusters so our statistics were calculated for

much more data and they are more represantative.

Fig. 3.2 shows some symmetry but the probability of diphone αβ is usually different than


Figure 3.3: Space of triphones in Polish

probability of βα. The mentioned quasi symmetry results from the fact that high values of α

probability and (or) β probability often gives high probability of products αβ and βα as well.

Similar effects can be observed for triphones. Data presented in this paper illustrate the well-

known fact that probabilities of triphones (see Table 3.3) cannot be calculated from the diphone

probabilities (see Table 3.2). The conditional probabilities between diphones have to be known.

Besides the frequency of triphone occurrence, we are also interested in distributions of their

frequencies. This is presented in logarithmic scale in Fig. 3.4. We received another distribution

than in the previous experiment Ziołko et al. (2007) because larger number of words were analysed.

We have found around 500 triphones which occurred once and around 300 which occurred two or

three times. Then every occurrence up to 10 happened for 100 to 150 triphones. It supports a

hypothesis that one can reach a situation, when new triphones do not appear and a distribution

of occurrences is changing as a result of more data being analysed. Some threshold can be set

and the rarest triphones can be removed as errors caused by unusual Polish word combinations,

acronyms, slang and other variations of dictionary words, onomatopoeic words, foreign words,

errors in phonisation and typographical errors in the text corpus.

Entropy

H = −40∑i=1

p(i) log2 p(i), (3.1)

where p(i) is a probability of a particular phoneme, is used as a measure of the disorder of a lin-


Table 3.2: Most common Polish diphones

diphone no. of occurr. % diphone no. of occurr. %e# 43 557 832 2.346 on 12 854 255 0.692a# 38 690 469 2.084 #k 12 529 124 0.675#p 31 014 275 1.671 ta 12 449 178 0.671je 28 499 593 1.535 #n 12 316 393 0.663i# 24 271 474 1.307 va 11 413 878 0.615o# 23 552 591 1.269 ko 11 168 294 0.602#v 20 678 007 1.114 #i 10 515 253 0.566y# 19 018 563 1.024 aw 10 514 514 0.566na 18 384 584 0.990 u# 10 379 234 0.559#s 17 321 614 0.933 #f 10 265 162 0.553po 16 870 118 0.909 #b 10 167 482 0.548#z 16 619 556 0.895 #r 10 137 129 0.546ov 16 206 857 0.873 ja 10 097 444 0.544st 15 895 694 0.856 ar 9 818 127 0.529

n’e 14 851 771 0.800 x# 9 811 211 0.528#o 14 104 742 0.760 do 9 779 666 0.527#t 13 910 147 0.749 er 9 724 692 0.524ra 13 713 928 0.739 te 9 618 998 0.518

#m 13 657 073 0.736 #j 9 398 210 0.506ro 13 597 891 0.732 v# 9 251 288 0.498#d 13 103 398 0.706 #a 9 143 021 0.492m# 12 968 346 0.698 to 9 043 529 0.487

0 0.5 1 1.5 2 2.5

x 104

0

1

2

3

4

5

6

7

8

Triphones

log(occurrences of a triphone)

Figure 3.4: Phoneme occurrences distribution


Table 3.3: Most common Polish triphonestriphone no. of occurr. % triphone no. of occurr. %

#po 12 531 515 0.675 wa# 3 262 204 0.176#na 9 587 483 0.516 do# 3 210 532 0.173n’e# 9 178 080 0.494 #ma 3 209 675 0.173na# 8 588 806 0.463 jon 3 082 879 0.166

ow∼# 6 778 259 0.365 e#z 3 054 967 0.165#do 6 751 495 0.364 a#v 3 028 787 0.163#za 6 429 379 0.346 #z# 2 928 164 0.158ej# 6 390 911 0.344 ka# 2 871 230 0.155je# 6 388 032 0.344 #sp 2 818 515 0.152#pS 6 173 458 0.333 ontˆs 2 754 934 0.148go# 5 990 895 0.323 e#s 2 737 210 0.147#i# 5 945 409 0.320 i#p 2 725 414 0.147ego 5 742 711 0.309 o#p 2 719 121 0.146ova 5 560 749 0.300 #Ze 2 701 194 0.145vje 5 433 154 0.293 #ja 2 670 034 0.144#v# 5 317 078 0.286 ta# 2 618 595 0.141#je 5 311 716 0.286 ent 2 612 166 0.141

#n’e 5 292 103 0.285 #to 2 567 269 0.138sta 4 983 295 0.268 to# 2 557 630 0.138

#s’e 4 861 117 0.262 pro 2 548 979 0.137yx# 4 858 960 0.262 pra 2 539 424 0.137#vy 4 763 697 0.257 #pa 2 503 153 0.135s’e# 4 746 280 0.256 #re 2 502 443 0.135pSe 4 728 565 0.255 ost 2 490 304 0.134e#p 4 727 840 0.255 #ty 2 452 830 0.132#f# 4 660 745 0.251 tˆse# 2 436 864 0.131em# 4 514 478 0.243 #mj 2 397 741 0.129#pr 4 428 341 0.239 ku# 2 383 231 0.128#ko 4 216 459 0.227 e#m 2 379 510 0.128a#p 4 155 732 0.224 ja# 2 353 638 0.127ci# 3 965 693 0.214 e#o 2 343 622 0.126ne# 3 958 262 0.213 a#s 2 336 272 0.126cje 3 916 595 0.211 #vj 2 329 962 0.125

n’a# 3 888 279 0.209 #mo 2 320 091 0.125#ro 3 785 754 0.204 nyx 2 299 719 0.124mje 3 760 340 0.203 os’tˆs’ 2 295 365 0.124#st 3 745 320 0.202 ovy 2 284 782 0.123aw# 3 596 680 0.194 sci 2 282 887 0.123ny# 3 580 425 0.193 ove 2 262 277 0.122#te 3 449 304 0.186 li# 2 255 403 0.121e#v 3 313 798 0.178 ovj 2 251 294 0.121Ze# 3 309 352 0.178 mi# 2 243 432 0.121ym# 3 300 273 0.178 uv# 2 236 507 0.120


guistic system. It describes how many bits in average are needed to describe phonemes. According

to Jassem in Basztura (1992) entropy for Polish is 4.7506 bits/phoneme. From our calculations

entropy for phonemes is 4.6335, for diphones 8.3782 and 11.5801 for triphones.

3.5 Analysis of Phonetic Similarities in Wrong Recognitions of thePolish Language

A speech recognition system based on HTK for Polish is presented. It was trained on 365 utte-

rances, all spoken by 26 males. Errors in recognition were analysed in detail in an attempt to find

reasons and scenarios of wrong recognitions.

We aim to provide a large vocabulary ASR system for Polish. There is very little research

in this topic and there is no system which would work on sentence level for a relatively rich

dictionary. Polish differs from the languages most commonly used in ASR like English, Japanese

and Chinese in the same way as all Slavic languages. It is highly inflective and non-positional.

These disadvantages are compensated by an important feature of Polish language. The relation

between phonemes and the transcription is more distinct.

We used the HTK (Rabiner, 1989; Young, 1996) as the basis of the recognition engine. While

this solution seems to work well, it is necessary to add extra tools on grammar and semantic levels

if a large dictionary is going to be used, while retaining very good recognition.

The mel-frequency cepstral coefficients (MFCCs) (Davis and Mermelstein, 1980; Young, 1996)

were calculated for parametrisation. 12 MFCCs plus an energy with first and second derivatives

were used, giving a standard set of 39 elements. We used 25 ms windows for audio framing and

preemphasis filtering 0.97. Segments were windowed using Hamming method. All 37 different

phonemes were distinguished using a phonetic transcription provided with the corpus. As it was

shown in the previous chapter HTK is a standard for ASR. All technical details of HTK are also

considered state-of-art of ASR. HTK is widely used as a model (Hain et al., 2005; Zhu and Pali-

wal, 2004; Ishizuka and Miyazaki, 2004; Evermann et al., 2004). We used HTK settings suggested

in a tutorial in (Young et al., 2005) apart from a sentences model. We did not use it at all because

of linguistic differences between English and Polish. Namely, the order of words in Polish is too

irregular to use this kind of models. In this experiment we simply treated sentences like they were

words, which means we put them in a dictionary. Obviously we used different dictionary and list of

phonemes that in the English example in the tutorial. All other settings were like those suggested

in (Young et al., 2005).

Errors in speech recognition can have many different reasons (Greenberg et al., 2000). Some

of them can appear because of phonetic similarities of different types, although there are errors

which cannot be explained by acoustic similarities. We want to find other possible reasons for

these errors. Results are presented with very deep analysis of what utterances where wrongly

recognised and what utterances they were recognised as. This knowledge may help in future ASR

system design and in preparing data for corpora and model training.

There are three general types of errors: random, systematic and gross. Random (or indeter-

minate) errors are caused by uncontrollable fluctuations of voice that affect parametrisation and


experimental results. Systematic (or determinate) errors are instrumental, methodological, or per-

sonal mistakes causing lopsided data, which is consistently deviated in one direction from the true

value. The detection of such errors is most important, because the model has to be altered then.

Gross errors are caused by experimenter carelessness or equipment failure which are quite unlikely

here as we used a professionally recorded data which were already used by other researchers.

Our system has been trained on part of a set called CORPORA (Grocholewski, 1995) created

under supervision of Stefan Grocholewski in Institute of Computer Science, Poznan University

of Technology in 1997 (Grocholewski, 1995). Speech files in CORPORA were recorded with the

sampling frequency f0 = 16 kHz, equivalent to sampling period t0 = 62.5 µs. Speech was

recorded in an office, with the working computer in the background, which makes the corpus not

perfectly clean. Signal to noise ratio (SNR) is not stated in the description of the corpus. It can

be assumed that SNR is very high for actual speech but minor noise is detectable for periods of

silence. The database contains 365 utterances (33 single letters, 10 digits, 200 names, 8 short

computer commands and 114 simple sentences), each spoken by 11 females, 28 males and 6

children (45 people), giving 16425 utterances in total. One set spoken by male and one by female

were hand segmented. The rest were segmented by a dynamic programming algorithm using a

model trained on hand segmented ones. The optimisation was used to fit borders using existing

hand segmentation of the same utterance spoken by two different people. All available utterances

for 26 male speakers were used for training, considering all of them as single words in HTK

model. We created the decision tree to find contexts making the largest difference to the acoustics

and which should distinguish clusters using rules of phonology and phonetics in Polish (Kepinski,

2005) to create tied-state triphones.

In all our experiments involving HTK, some preprocessing of data is necessary, because of

special letters in Polish. The first step of this process is to change all upper case letters into lower

case letters. Than all Polish special letters are replaced by standard corresponding capital letters.

In example, o is changed into O.

3.6 Experimental Results on Applying HTK to Polish

As we mentioned already, the system was trained on 9490 utterances, 365 for each of 26 male spea-

kers. The orthographic dictionary contains 365 elements, but due to differences in pronunciation

between different speakers, the final version of the dictionary, working on phonetic transcriptions,

contains 1030 positions.

We started recognition evaluation using data of the only male speaker who was not used in

the training (Table 3.4). Only 6 out of 365 utterances were substituted giving correctness 98.36

%. Audio files of females, boys and girls were also recognised to check correlation between

parameterisation of different age and gender. These speakers were also used instead of adding

noise to the male speaker. We received correctnesses 79.73%, 95.34% and 92.05% for adult female

speakers. Child male speakers were recognised with correctnesses 60.55%, 95.07% and 75.62%.

We noted correctnesses 88.22% and 84.11% for girls. All non-adult male speakers gave clearly

worse results, however, there is no obvious difference between degradation in results related to age


Table 3.4: Word recognition correctness for different speakers (the model was trained on adultmale speakers only)

speaker age gender substitutions correctness

AO1M1 adult male 6 98.36AF1K1 adult female 74 79.73BC1K1 adult female 17 95.34BW1K1 adult female 29 92.05AK1C1 child male 144 60.55AK2C1 child male 89 75.62CK1C1 child male 18 95.07LK1D1 child female 43 88.22ZK1D1 child female 58 84.11

Table 3.5: Errors in different types of utterances (for all speakers)type errors being recog. % of errors

sentences 2 1026 0digits 21 90 23

alphabet 130 297 44names and commands 312 1872 17

or gender. Even girl speakers, for which both age and gender differed from the training speakers,

were recognised with the similar number of errors as speakers with just different gender or age.

Types of errors were carefully analysed. First, we checked percentage of correctly and wrongly

recognised utterances, depending on the type of utterances (Table 3.5). It can be clearly seen that

smaller units are much more difficult to recognise: 44 % for one syllable units (spoken letters

of alphabet), 23% and 17% for single words and almost no errors for sentences, even though we

evaluated the system also on speakers of gender and age which were not used during the training. It

suggests that recognition based on MFCC parameterisation only is not enough. The context has to

be used for allowing HMM models work correctly (or much better parameterisation, if possible).

All sentences were treated as single words during the training and the testing. The recognition

of sentences is on an exceptional level, especially considering, that we used many speakers of

gender and age not used during the training. The only two wrong recognitions are quite bizarre.

In the first case the sentence which means ‘He cleans sparrows in zoo’ was recognised as a female

name Helena. In the second case the sentence ‘Ups, it was more grey than yours’ was recognised

as ‘A horse went on poor road’. In both cases the correct transcription and wrong recognition are

Table 3.6: Errors in sentences (speakers AK1C1 and AK2C1 respectively)correct transcription wrong recognition

On myje wroble w zoo Helena

Oj bardziej niz wasz był szary Kon droga marna szedł


Table 3.7: Errors in digitsdigit no. wrong recognitions

0 zero 4 Zofia,Iwona,ce,Bozena

3 trzy 4 ce(2),zero,Joanna1 jeden 3 Urban(2),Izabela,4 cztery 3 o, ge(2)2 dwa 2 Diana,Anna

8 osiem 2 Franciszek, Alicja5 piec 1 Rudolf6 szesc 1 zero

7 siedem 1 Zenon9 dziewiec 1 Diana

phonetically very different and very easily distinguishable for a human listener.

There are several interesting detailed observations in patterns of wrong recognitions. Only one

name was recognised as a sentence and quite few were recognised as spoken letters (Table 3.8 and

3.9). The majority of wrong hypotheses were simple words. It means that the efficiency of the

model depends on a length of utterances. It works better for longer ones.

The very interesting fact is that even if names are recognised wrongly, their gender is still

correct most of the time. 79 female names were recognised as other female names (out of those

presented in Table 3.8), with only 17 female names recognised as male names. Some clue might

be that the very majority of female names in Polish end with ’a’. However, such phonological si-

milarity is probably not strong enough for this effect. It is difficult to explain fully this phenomena.

The similar pattern was found in case of male names. 50 male names were wrongly recognised as

other male names and only 14 male names were recognised as female names.

There are some pairs of phonologically similar names like Lucjan and Łucjan, or Mariola and

Marian, which where quite commonly mistaken with each other. However, most of wrong recog-

nitions seem to have no explanation like this. What is more, some wrong detections with large

phonological differences appear quite frequently. For example, Barbara was recognised wrongly

three times, and all of them as Marzena. It has to be stressed that many pairs of very similar words

were recognised quite correctly, like name Maria was only twice recognised as Marian and Marian

as Maria just once. We can conclude that phonological similarities can cause wrong detections but

seem to be not a major source of them.

Table 3.10 shows names which were used as wrong hypotheses for errors listed in other tables.

There is an interesting tendency that these words were correctly recognised most of the time when

the audio with their content was analysed. It suggests that some utterances are generally more

probable than others for the recognition of the whole set, correct or not. We can say that they are

represented more strongly in the language models. In a similar way, names which were wrongly

recognised, rarely appear in Table 3.10, because they are weakly represented. It has to be stressed

that all utterances were used 26 times (Table 3.10) during the training. The best example of this

behaviour is a name Łucjan, which was recognised for virtually all test speakers as Lucjan. The


Table 3.8: Errors in the most often wrongly recognised names and commandsword no. wrong recognitions

Łucjan 9 Lucjan(9)Nina 7 Lidia,Emilia,Anna(2),Łucja,Urszula,Julian

Dorota 6 Beata(4),Renata,DanutaJan 6 Jerzy,Łucjan(2),Daniel,Diana,Leonnie 6 zle,Lech(2),u(2),o

cofnij 5 Teofil(3),Rafał(2)Dominik 5 Jan,Daniel(3),Jakub,

Ewa 5 Anna,Helena,Ole nka,Eliza,HelenaMaria 5 Mariola,Marian(2),Klaudia,MarzenkaRegina 5 Joanna,Romuald,el,Emilia,AnielaWacław 5 Lucyna(2),Jarosław

Ziuta 5 Julita(2),Joanna,Jolanta,OlgaEmilia 4 Aniela(2),el,kuEmil 4 ku,el(3)

Gerard 4 Eugenia,Bozena,Leonard,deJulia 4 Urszula,Julian(2),JoannaLech 4 zero,Joanna,u,teŁucja 4 Lucjan(2),Urszula(2)Sabina 4 Celina(2),Halina(2),Teodor 4 Adam(3),Joanna,Alina 3 Emilia,Alicja,Urszula

Barbara 3 Marzena(3)Benon 3 Damian(2),Marian

Bernard 3 Gerard,Beata,LeonardCecylia 3 Apolonia(2),WacławCelina 3 Karol,zle,Mariola

Damian 3 Daniel(2),BenonDaria 3 Marta,Daniel,BozenaEliza 3 Alina(2),Lucjan

Felicja 3 Łucja,Urszula,AlicjaHanna 3 Helena,Marian,HalinaHenryk 3 Alfred,Romuald,HubertIrena 3 Ireneusz,Urszula,KarolinaIwona 3 Izabela,Maria,ZuzannaIzydor 3 jeden,Romuald,BogdanJerzy 3 zle,u,Leszek

Janusz 3 Ireneusz,Lech,RudolfKarolina 3 Mariola,Pelagia,AlinaMonika 3 Olenka,Łukasz


Table 3.9: Errors in the most often wrongly recognised names and commands (2nd part)word no. wrong recognitions

Marek 3 Romuald(2),MartaMariola 3 Marian(2),MariaPelagia 3 Karolina(2),ten chor dusiłem licznie,Paulina 3 Mariola(2),Karolina

Sławomir 3 Hanna,Mariola,KarolSeweryn 3 Karolina,Cezary,ZenonWojciech 3 Walenty,Monika,Alicja

Wanda 3 Halina,Marzena,MariolaWeronika 3 Dorota,Renata,Danuta

zle 3 Julian,Joanna,ZofiaZenon 3 Marian(2),Benon

Table 3.10: Names which appeared the most commonly as wrong recognitions in above statisticsname no. name no. name no.

Lucjan 14 Alina 3 Aniela 2Marian 8 Bozena 3 Apolonia 2Urszula 8 Diana 3 Benon 2Daniel 7 Emilia 3 Celina 2Joanna 7 Helena 3 Damian 2Mariola 7 Ireneusz 3 Danuta 2Beata 5 Rudolf 3 Izabela 2

Karolina 5 Julita 2 Maria 2Marzena 5 Karol 2 Marta 2Romuald 5 Lech 2 Olenka 2

Alicja 4 Leonard 2 Renata 2Anna 4 Leszek 2 Urban 2Halina 4 Łucja 2 Zenon 2Julian 4 Łucjan 2 Zofia 2


Table 3.11: Errors in pronounced alphabetletter errors letter errors letter errors

en 9 ce 5 a 2em 8 e 5 es 2er 8 ka 5 zet 2pe 8 be 4 eł 1ce 7 de 4 ku 1a 6 ge 4 u 1en 6 i 4 wu 1te 6 o 4 el 1y 6 zet 3 e 1

esz 6 es 3 ef 1zet 6

name Lucjan was always correctly recognised. What is more Lucjan was provided as a hypothesis

for several other names, including Jan which was recognised as Lucjan in case of two different

speakers. In this example the name Lucjan was provided as a recognised word 23 times (including

correct ones) and Łucjan twice, in both cases incorrectly.

Table 3.11 presents wrongly recognised letters of alphabet. We already mentioned that this

group is most likely to contain errors because its elements are very short and the HMM model

cannot use all its advantages. We can also observe that sonorants (n, m, r) tend to be the most

difficult for recognition. Letters ha and jot were recognised correctly for all speakers.

3.7 Conclusion

Polish and English were compared considering approaches to ASR of these two languages. 250

000 000 words from different corpora: newspaper articles, Internet and literature were analysed.

Statistics of Polish phonemes, diphones and triphones were created. They are not fully complete,

but the corpora were large enough, that they can be successfully applied in NLP applications

and speech processing. The collected statistics are the biggest for Polish of this type of linguistic

computational knowledge. Polish is one of most common Slavic languages. It has several different

phonemes than English and the statistics of phonemes are also different. The most popular and

standard ASR - HTK - was trained for the Polish language and tested with a deep analysis of the

errors that occurred.

Chapter 4

Phoneme Segmentation

Speech signals typically need to be divided into small frames before recognition can begin. Analy-

sis of these frames can then determine the likelihood of a particular phoneme being present within

the frame. Speech is non-stationary in the sense that frequency components change continuously

over time, but it is generally assumed to be a stationary process within a single frame. Segmenta-

tion methods currently used in speech recognition usually do not consider where phonemes begin

and end, which causes complications to appear at the boundaries of phonemes. However, non-

uniform phoneme segmentation was already found useful in ASR for more accurate modelling

(Glass, 2003).

A phoneme segmentation method is presented in this chapter, which is a more sophisticated

method than one described in (Ziołko et al., 2006b). More scenarios are covered and results are

evaluated in a better way. Experiments were taken on much larger COPORA, which was described

in the previous chapter. The method is based on analysing envelopes and the rate-of-change of the

DWT subband power.

4.1 Analysis Using the Discrete Wavelet Transform

The human hearing system uses frequency processing in the first step of sound analysis. While the

details are still not fully understood, it is clear that a frequency based analysis of speech reveals

important information. This encourages us to use DWT as a method of speech analysis, since the

DWT may be more similar to the human hearing system than other methods (Wang and Narayanan,

2005; Daubechies, 1992). Details of the wavelet transformation are beyond the scope of this thesis,

but here we present a brief overview of the method. The wavelet transformation provides a time-

frequency spectrum. The original speech signal s(n) and its wavelet spectrum are of 16 bits

accuracy. In order to obtain DWT (Daubechies, 1992), the coefficients of series

sm+1(n) =∑i

cm+1,i φm+1,i(n) (4.1)

63

CHAPTER 4. PHONEME SEGMENTATION 64

are computed, where φm+1,i is the ith wavelet function at the (m + 1)th resolution level. Due to

the orthogonality of wavelet functions

cm+1,i =∑

nεDm+1,i

s(n)φm+1,i(n), (4.2)

where

Dm+1,i = {n : ϕm+1,i(n) 6= 0} (4.3)

are supports of φm+1,i. The coefficients of the lower level are calculated by applying the well-

known (Daubechies, 1992; Rioul and Vetterli, 1991) formulae:

cm,k =∑i

hi−2k cm+1,i, (4.4)

dm,k =∑i

gi−2k cm+1,i, (4.5)

where hi and gi are the constant coefficients which depend on the scale function φ and wavelet

ψ (e.g. functions presented in Fig. 4.2, which characterises dmey (discrete Meyer wavelet). The

speech spectrum is decomposed using digital filtering and downsampling procedures defined by

(4.4) and (4.5). It means that given the wavelet coefficients cm+1,i of the (m+1)th resolution level,

(4.4) and (4.5) are applied to compute the coefficients of the mth resolution level. The elements of

the DWT for a particular level may be collected into a vector, for example dm = (dm,1, dm,2, ...)T .

The coefficients of other resolution levels are calculated recursively by applying formulae (4.4) and

(4.5). The multiresolution analysis gives a hierarchical and fast scheme for the computation of the

wavelet coefficients for a given speech signal s. In this way the values

DWT(s) = {dM ,dM−1, ...,d1, c1} (4.6)

of the DWT for M + 1 levels are obtained. Each signal

sm+1(n) = sm(n) + sdm(n) for all n ∈ Z (4.7)

on the resolution level m+1 is split into approximation (coarse signal)

sm(n) =∑k

cm,kφm,k(n) (4.8)

on the lower, mth resolution level and the high frequency details

sdm(n) =∑k

dm,kψm,k(n). (4.9)

The wavelet transformation can be viewed as a tree. The root of the tree consists of the coef-

ficients of wavelet series (4.1) of the original speech signal. The first level of the tree is the result

of one step of the (4.5). Subsequent levels in the tree are constructed by recursively applying (4.4)


��

��

��

��

��

��

Figure 4.1: Wavelet transform outperforms STFT because it has higher resolution for higher fre-quencies.

and (4.5) to split the spectrum into the low (approximation cm,n) and high (detail dm,n) parts.

Experiments undertaken by us, show that the speech signal decomposition into six levels is suf-

ficient (see Fig. 4.3) to cover the frequency band of a human voice (see Table 4.1). The energy

of the speech signal above 8 kHz and below 125 Hz is very low and can be neglected. The same

experiment was conducted using 7 subbands and the worse results were received.

There is a wide variety of possible basis functions from which a DWT can be derived. To

determine the optimal choice of wavelet, we analysed six different wavelet functions: Meyer (Fig.

4.2), Haar, Daubechies wavelets of 3 different orders and symlets. Our results show that the

discrete Meyer wavelet gives the best results.

4.2 General Description of the Segmentation Method

Phonemes are characterised by differing frequency content, and so we would expect changes of

the power in different wavelet resolution levels between phonemes. Clearly, it would be easier to

analyse the absolute value of the rate-of-change of power and expect it to be large at the beginning

and at the end of phonemes. However, this does not uniquely define start and end points, for two

reasons. Firstly, the power can rise over a considerable length of time at the start of a phoneme,

leading to an ambiguous start time. Secondly, there may also be rapid changes in power in the

middle of a segment. A better method of detecting the boundary of phonemes relies on power

transitions between the DWT subbands. Our approach (Ziołko et al., 2006b) is based on a six level

DWT analysis (for example M = 6) of a speech signal (Fig. 4.3).


−8 −6 −4 −2 0 2 4 6 8−1

0

1

2Meyer wavelet

−8 −6 −4 −2 0 2 4 6 8−0.5

0

0.5

1

1.5Meyer scaling function

Figure 4.2: The discrete Meyer wavelet - dmey

0 2000 4000 6000 80000

0.2

0.4DWT level d6

0 1000 2000 3000 40000

0.2

0.4DWT level d5

0 500 1000 1500 20000

1

2DWT level d4

0 200 400 600 8000

1

2DWT level d3

0 100 200 300 4000

1

2DWT level d2

0 50 100 150 2000

1

2DWT level d1

Figure 4.3: Subband amplitude DWT spectra of the Polish word ’osiem’ (eng. eight). The numberof samples depends on a resolution level


Table 4.1: Characteristics of the discrete wavelet transform levels and their envelopesLevel Band (kHz) No. of samples Windowd6 8− 4 32 5d5 4− 2 16 5d4 2− 1 8 5d3 1− 0.5 4 3d2 0.5− 0.25 2 3d1 0.25− 0.125 1 3

The amount 2−M+m−1N of wavelet spectrum samples in themth level (wherem = 1, . . . ,M )

depends on the length N of the speech signal in time domain, assuming N is a power of 2. Table

4.1 presents their number at each level relative to the lowest resolution level. The power waveform

pm(n) =2m−1∑j=1

d2m,j−1+n2m−1 where n = 0, . . . , 2−MN − 1, (4.10)

is computed in a way to obtain the equal number of power samples for all subbands.

The DWT subband power shows rapid variations (see Fig. 4.3) and despite smoothing (4.10)

the power waveforms change rapidly. The first order differences in the power are inevitably noisy,

and so we calculate the envelopes penm (n) for power fluctuations in each subband by choosing the

highest values of pm(n) in a window of given size ω (see Table 4.1) to obtain a power envelope

(Fig.4.4). A smoothed differencing operator was used and the subband power pm is convolved

with the mask [1, 2,−2,−1] to obtain smoothed rate-of-change information rm(n).In order to improve accuracy, a minimum threshold pmin was introduced for a subband DWT

power. This threshold was chosen experimentally as 0.0002 for the test corpus. This prevents us

from analysing noise where the power of the speech signal is very small (for example in areas

of ‘silence’), even though noise is very low in the test corpus. The parameter pmin can be easily

chosen for other corpora by analysing part of it with audio containing noise only. The threshold

pmin can be set as 110% of power of noise. The start and end of a phoneme should be marked by

an initially small, but rapidly rising power level in one or more of the DWT levels. In other words,

the derivative can be expected to be approximately as large as the power. This is why phoneme

boundaries can be detected searching for n-points for which the inequality

p ≥ |β|rm(n)| − penm (n)| (4.11)

holds for the phoneme boundaries. Constant p is a value of threshold which accounts for the time

scale and sensitivity of the crossing points. We found that setting the threshold p as 0.1 gave the

best results. The rate-of-change function rm is multiplied by scaling factor β approximately equal

to 1 which allows us to subtract the power from product β|rm(n)|.


0 50 100 150 2000

0.5

1

1.5d6

0 50 100 150 2000

1

2

3d5

0 50 100 150 2000

5

10

15d4

0 50 100 150 2000

5

10

15

20

25d3

0 50 100 150 2000

1

2

3

4d2

0 50 100 150 2000

0.2

0.4

0.6

0.8

1d1

Figure 4.4: Segmentation of the Polish word ’osiem’ (eng. eight) based on DWT sub-bands. Dot-ted lines are hand segmentation boundaries; dashed lines are automatic segmentation boundaries,bold lines are envelopes and thin lines are smoothed rate-of-change

4.3 Phoneme Detection Algorithm

Without any additional refinement, the above method may not be able to detect the phoneme

boundaries precisely. There are several reasons for this. First, the exact locations of the boundaries

may vary slightly between subbands, and for some phonemes, only one frequency band may show

significant variations in power, while for others several subbands may show variations in power.

Sometimes analysis will detect slightly separate boundaries for different subbands. Secondly,

despite smoothing the derivative, there may be a number of transitions which represent the same

boundary. This problem was approached by noting the transitions and other situations which are

likely to happen for phoneme boundaries using e(n), which will be referred to as an event function.

Such an approach let us consider several scenarios and aspects of potential phoneme boundaries.

It also allows us to improve the method easily by adding additional events to the existing list.

The suggested events are presented in Table 4.2 and explained in details later. Surprisingly pre-


emphasis filtering was found as a step degradating quality so it was not used in the final version of

the algorithm:

1. Normalise a speech signal by dividing by its maximum value in an analysed fragment of

speech.

2. Decompose a signal into six levels of the DWT.

3. Calculate (4.10) in all frequency subbands to obtain the power representations pm(n) of the

mth subband.

4. Calculate the envelopes penm (Fig. 4.4) for power fluctuations in each subband by choosing

the highest values of pm in a window of a given size ω, according to Table 4.1.

5. Calculate the rate-of-change function (Fig. 4.4) rm(n) by filtering pm(n) with [1, 2, -2, -1]

mask.

6. Create an event function e(n) = 0 for all n. In the next step the function value will be

increased to record events for which rm(n) and penm (n) look like a phoneme boundary for a

given n.

7. Analyse rm(n) and penm (n) for each DWT subband to find the discrete time n for which

the event conditions described in Table 4.2 hold. Add the value of the event importance (as

par Table 4.2) to the event function e(n) (Fig. 4.5) for a given discrete time n according to

Table 4.2. If several events occur for a single discrete time, then sum the event importances

of all of them. Repeat the step for all discrete times n. In this way, we have a boundary

distribution-like function.

e(n) =

{0 no condition fulfilled for n∑iwi otherwise

(4.12)

where wj are importance weights (see Table 4.2) for events that occurred for n in all sub-

bands.

8. Search for a discrete time n starting from 1, for which the event function is higher than a

decision threshold. A threshold value of 4 was chosen experimentally.

9. Find all the discrete times ti for which

e(ti) > τ − 1

ti > n

ti − ti+1 < α

(4.13)

where n is the last index analysed in the previous step and α is associated with minimal

phoneme length (α = 4 gives approximately 20 ms). Organise all the discrete times ti in

separate groups of those fulfilling the above conditions.


Table 4.2: Types of events associated with a phoneme boundary. Mathematical conditions arebased on power envelope penm (n), rate-of-change information rm(n), a threshold p of the distancebetween rm(n) and penm (n) and a threshold pmin of minimal penm (n) and β = 1. Values in the lastfour columns are for different DWT levels (the first one for d1 level, the second one for d2 level,the third for levels from d3 to d5 and the last one for d6 level)

Description Mathematical condition Importance

Quasi-crossing point |β|rm(n)| − penm (n)| < p and 1 3 4 1(|β|rm(n+ 1)| − penm (n+ 1)| > p or|β|rm(n− 1)| − penm (n− 1)| > p) and

penm (n) > pmin

Crossing point β|rm(n)| > penm (n) + p and 1 3 4 1first case β|rm(n+ 1)| < penm (n+ 1)− p and

penm (n) > 5 pminCrossing point β|rm(n)| < penm (n)− p and 1 3 4 1

second case β|rm(n+ 1)| > penm (n+ 1) + p andpenm (n) > 5 pmin

Rate-of-change higher than β|rm(n)| > penm (n) and 1 2 2 1power envelope penm (n) > 2 pmin

10. Calculate the weighted mean discrete time b from the discrete times grouped in the previous

step. Index b is the detected phoneme boundary in the discrete timing of DWT level d1,

which was used in the algorithm for all other subbands by summing samples.

b =∑

i tiwi∑iwi

. (4.14)

11. Repeat previous three steps for next discrete time values n, until the largest n with non-zero

value of event function e(n) will be processed.

Table 4.2 describes the events which can be expected to occur in the power of DWT subbands.

Some of them are more crucial than others. In our previously published work (Ziołko et al., 2006b)

only the first of them was used. Additionally, different weights were given to events with respect

to a subband in which they occur. It is a perceptually motivated idea which was very successfully

used in the PLP (Hermansky, 1990). As per this study, information in relatively high and low

frequency subbands is not so important for the human ear as information in the bands from 345

Hz to 2756 Hz. Briefly, the Hermansky solution (Hermansky, 1990; Hermansky and Morgan,

1994) used a window to modify speech, decreasing frequencies not crucial for the human ear and

amplifying the most important ones. The same aim was followed in our solution by giving low

weights for events occurring in detectable, but not the most important frequencies, and higher ones

for the middle of human hearing bands. Six DWT subbands were used. The third, fourth and fifth

were grouped together as the middle and most crucial ones. As a result in Table 4.2 four columns

with importance values (weights) are presented (the first one for the d1 level, the second one for


0 20 40 60 80 100 120 140 160 1800

2

4

6

8

10

12

14event function e(i)

Figure 4.5: The event function versus time in ms of the word presented in Fig. 4.4. High eventscores mean that a phoneme boundary is more likely

the d2 level, the third for the levels from d3 to d5 and the last one for the d6 level).

There are four possible events presented in Fig. 4.6 and described in Table 4.2. Some of them

are quite similar. It has to be stressed that for some discrete times and subbands more than one

event can occur (typically two and very rarely more). In this case weights of both events are taken

into account to the event function e(n). In all cases, the values of rate-of-change information

|rm(n)| are multiplied by scaling factor β equals to 1. The first event is called quasi-crossing

point. It is the most general and common one. The mathematical condition for this event detects

discrete times for which power envelope penm (n) and absolute value of rate-of-change information

|rm(n)| cross or approach each other very closely (on a distance of threshold p). Additionally

power envelope penm (n) has to be higher than threshold pmin.

The second and third events are twin events and represent rarer cases, namely the crossing of

power envelope penm (n) and absolute value of rate-of-change |rm(n)| when penm (n) is five times

higher than minimum threshold pmin. It means that the second and third cases are used to detect

and note more specific situations than the first one, because typically fulfilling one of those condi-

tions means fulfilling the first one as well. As we sum all event importances for a given n, this

will cause a higher value of event function e(n) than just the first event. In these cases, one of the


Figure 4.6: Simple examples of four events described in Table 4.2. They are characteristic forphoneme boundaries. Images present power envelope penm (n) and rate-of-change information (de-rivative) rm(n)


functions of penm (n) and |rm(n)| starts with higher level than the other and goes below the level of

the second one, suggesting a phoneme boundary very clearly. Fulfilling one of those conditions

means fulfilling the first one as well. As we sum all event importances for a given n, this will cause

a higher value of event function e(n) than just the first event. In these cases, one of the functions

of penm (n) and |rm(n)| starts with higher level than the other and goes below the level of the second

one, suggesting a phoneme boundary very clearly.

The fourth event is also quite rare and covers situations were the DWT spectrum changes very

rapidly, which happens for changes in speech content like phoneme boundaries. In this situation

a level of penm (n) can be relatively low. The absolute value of rate-of-change information |rm(n)|being higher than power envelope penm (n) and penm (n) being higher than double of the minimum

threshold are searched for. The fourth event is different, because it does not describe anything

similar to crossing used in general description of the method in the previous section. However, if

|rm(n)| is so high, it also indicates that a phoneme boundary may occur. It is less strict and more

general, so a lower weight was given.

The values of thresholds in the first three events were chosen to make the second and third

events more difficult to fulfil than the first one. The threshold in the fourth type event was chosen

experimentally.

The method is designed so that it would be easy to improve it by introducing additional condi-

tions. It is easy to introduce a new condition which will add or subtract (negative events, which

imply boundaries did not occur, are not included in this solution but generally possible) additional

values to e(n) for discrete times where the new condition is fulfilled. Another aspect of the ‘intel-

ligence’ of the method is that, even though it consists of several conditions, the sensitivity can be

easily changed by setting another decision threshold. The decision threshold is lowered by 1 for

finding the following discrete times (comparing to the first one in the group) due to a hysteresis

rule. The application of hysteresis for the threshold produces better results.

The algorithm is implemented in Matlab environment and not optimised for time efficiency. In

its current version it needs 14 minutes to segment the whole corpus using Haar wavelet (the lowest

order of filters) and 20 minutes for discrete Meyer wavelet (the highest order of filters, namely 50).

The corpus has 16425 utterances (some of them are sentences), which give 0.05 s per utterance for

the version with Haar wavelet and 0.07 s for the Meyer one. The properly optimised code in C++

would be much more time efficient. The experiment was conducted on a computer with AMD

Athlon 64 processor 3500+ 990 MHz, 1.00 GB of RAM.

The method was developed on a set of 50 hand segmented Polish words with the sampling

frequency f0 = 11025 Hz, equivalent to a sampling period t0 = 90.7 µs. In order to assess the

quality of our results, the method was tested on CORPORA. None of the CORPORA utterances were

in the original set used during development. Hand segmentation was done by different people in

the small development set and for CORPORA.


Figure 4.7: The general scheme of sets G with correct boundaries and A with detected ones.Elements of set A have a grade f(x) standing for probability of being a correct boundary. In set Gthere can be elements which were not detected (in the left part of the set)

4.4 Fuzzy Sets for Recall and Precision

Fuzzy logic is a tool for embedding structured human knowledge into workable algorithms. In a

narrow sense, fuzzy logic is considered a logical system aimed at providing a model for modes

of human reasoning that are approximate rather than exact. In a wider sense, it is treated as a

fuzzy set theory of classes with unsharp boundaries (Kecman, 2001). Fuzzy logic found many

applications in artificial intelligence, due to the introduction of the opportunity of numerical and

symbolic processing of a human-like knowledge. This kind of processing is needed in proper

evaluating of many types of segmentation. In our case we are interested in speech boundary (for

example phonemes) location (Fig. 4.8). Detected boundaries may be shifted more or less with

respect to a manual segmentation. This ’more or less’ makes a crucial difference and cannot be

mathematically described in a Boolean logic. Fuzzy logic introduces an opportunity of grading

detected boundary locations in more sensitive and human-like way.

Our segmentation evaluation method (?) is based on the well-known recall and precision

evaluation method. However, in our approach, calculated boundary locations are elements of a

fuzzy set and a binary operation T-norm describes their memberships. T-norm is defined as a

function T : [0, 1]× [0, 1]→ [0, 1] which satisfies commutativity, monotonicity, associativity and

for which 1 acts as an identity element. As usual in recall and precision, one set contains relevant

elements. The other is the set of retrieved boundaries. We calculate an evaluation grade using the

number of elements in each of them and in their intersection. The comparison of the number of

relevant boundaries and a number of elements in intersection gives precision. In a boolean version

of the evaluation method it is information about how many correct boundaries were found. By

using fuzzy logic we evaluate not only how many boundaries were detected, but how accurately

they were detected. The comparison of the number of retrieved elements and intersection gives

recall, which is a grade of wrong detections. In this case fuzzy logic allows to evaluate not only a


0 50 100 150 200−1

−0.5

0

0.5

1

Figure 4.8: The example of phoneme segmentation of a single word. In the lower part hand seg-mentation is drawn. Boundaries are represented by two indexes close to each other (sometimesoverlapping). Upper columns present the example of segmentation for the word done by a seg-mentation algorithm. All of calculated boundaries are quite accurate but never perfect

number of wrong detections but also their incorrectness. Each retrieved boundary has a probability

factor which represents being correct information.

4.5 Algorithm of Speech Segmentation Evaluation

In this section we present an example of applying the approach described in the previous section

for phoneme speech segmentation (Fig. 4.8). Due to the described features, such segmentation

and its evaluation is particularly useful in ASR. In this case we have to make three assumptions:

• Hand segmentation (ground truth) is given as a set of narrow ranges. Neighbouring pho-

nemes overlap each other in these ranges.

• Detected boundaries are represented as a set of single indexes.

• We assume the perfect detection of silence. Silence segments may be of almost any length.

Due to this fact including them in evaluation would cause serious inaccuracy. This is why

we skip silence segments in evaluation.

The method proceeds as follows:

1. Assign first and last detected boundaries with the same value as hand segmented boundaries

(typically the first and the last index). It has to be done because of the third assumption.

2. Start with matching the closest detected and hand segmented boundaries. They need to be

matched in pairs. Each boundary may have only one matched boundary from the other set.

Do following steps for each ith detected boundary ia starting from the first.


3. Calculate grades of being relevant and retrieved. All matched pairs are elements of two

sets of which one is fuzzy. All non-matched detected and hand segmented boundaries are

elements of one set. Let G denote the set of relevant (correct) elements. Let A denote the

ordered set containing retrieved (predicted) boundaries. For each segmentation boundary x

in A be define a fuzzy membership function f(x) that describes the degree to which x has

been accurately segmented. There are three different scenarios for calculating membership

function f(x):

• A hand segmented boundary not matched with any detected boundary is an element of

set G.

• A detected boundary x not matched with any hand segmented boundary is an element

of set A and has f(x) = 0. The last detected boundary on the Fig. 4.8 is such a case.

• If a detected boundary x is inside the hand segmented boundary range, the boundary

is the element of both sets A and G. The other probabilistic factor is boolean and

represents membership of a set with hand segmentation boundaries. We use algebraic

product of these two probabilistic grades as a T-norm, to find a membership grade of


midpoint start/end point0

1

Figure 4.9: Fuzzy membership

the intersection. In the situation where x is inside the hand segmented boundary, range

f(x) = 1.

• Otherwise it is a fuzzy case and f(x) = a−b/awhere a stands for the half of the length

of the phoneme which the boundary was detected (take the phoneme in which the

detected boundary is situated) and b stands for the distance between hand segmented

boundary and the detected one (Fig. 4.9). All boundaries on the Fig. 4.8 apart from

the last one are examples of this case, which proves how useful fuzzy logic can be in

the segmentation evaluation.

4. Fuzzy precision can be calculated as

P =∑

x∈A f(x)|G|

. (4.15)

5. Fuzzy recall equals

R =∑

x∈A f(x)|A|

. (4.16)

Recall and precision can be used to give a single evaluation grade in many different ways de-

pending on which of them is more important. The widely used way is calculating f-score (van

Rijsbergen, 1979)

F =(β2 + 1) ∗ P ∗R

(β2 ∗ P ) +R, (4.17)

where β is a parameter to the f-score. Often β = 1, that is, precision and recall are given equal

weights. Higher β values would favour recall over precision.


4.6 Comparison to Other Evaluation Methods

Evaluation methods are always subjective and there is no way to grade them statistically. This is

why it is difficult to compare evaluation methods and judge which one is better. Because it cannot

be proved our method outperforms the others, we present an example which might explain why we

believe so. There is no standard method, but all evaluations are based on insertions and deletions

with some tolerances. Let us compare a use of such methods with the fuzzy recall and precision for

the example presented in Fig. 4.8. The indexes are due to the segmentation method (Ziołko et al.,

2006b). One index unit corresponds to 5.8 ms. The very first and last boundary is not included due

to assumption that they are supposed to be perfectly detected. Table 4.3 lists membership function

f(x) for all boundaries. In lower rows, insertions and deletions with all possible tolerances are

marked. The symbol X stands for a boundary with a delation or insertion for a given tolerance,

while �stands for a boundary accepted as a correct one with a given tolerance. The number of

insertions and deletions is given in brackets in the first column.

As we use only a single word, results are the same for many tolerance levels. For a larger

corpora it does not happen. It is clearly visible that counting insertions and deletions is less ac-

curate, unless one uses tolerance levels with resolution equals to the resolution of index order.

Especially using single tolerance level smooths information about boundary detections. Perfectly

accurate detections are graded in the same way as imperfect, but fulfilling a tolerance level. Using

several tolerance levels improves quality of evaluation but is still just a step towards a high reso-

lution evaluation method, as suggested fuzzy recall and precision. Another issue is the length of

phonemes. A method based on tolerances gives grade without comparing a tolerance and length

of a given phoneme. In other words, our methods is better, because the membership function f(x)is calculated on percentage of the phoneme length of a boundary which was missed and not on a

constant tolerance value. In the presented example, phoneme lengths vary from 11 (64 ms) to 47

(273 ms). For example, the tolerance of 3 (17 ms) is effectively much higher for the shortest unit

than for the longest one. There is no such flaw in our method. The algorithm was implemented in

C++. Final grades for a given word are: precision: 0.813901, recall: 0.697629, f-score: 0.751293.

4.7 Experimental Results of DWT Segmentation Method

Our first set of results looks at the usefulness of the six wavelet functions for analysing phoneme

boundaries. The obtained results for different wavelets (see Table 4.4) show the differences in their

efficiency. They suggest that discrete Meyer wavelet (Fig. 4.2)(Abry, 1997) performs the best in

this case, probably because of its symmetry in the time domain, which helps in synchronisation

of the subbands. Asynchronisation in time domain can be caused by ripples in frequency domain.

An experiment using two wavelets (Meyer and sym6), one after another, was also conducted. As it

might be expected, it improved results only a little, while it almost doubled the time of calculations.

Analysing seven subbands was also checked, where the seventh one was from 125 Hz to 62.5 Hz.

The accuracy of our phoneme detection technique was then compared with some standard

framing techniques (see Table 4.5) like constant segmentation methods where the speech is broken


Table 4.3: Comparison of fuzzy recall and precision with commonly used methods based on in-sertions and deletions for an exemplar word

beg 9 56 89 113 156 196 -end 10 58 90 114 158 198 -auto 15 59 97 112 159 195 206

fuzzy recall and precisionf(x) 0.78 0.93 0.36 0.91 0.95 0.95 0

insertions and deletions without toleranceIns(7) X X X X X X XDel(6) X X X X X X -

with tolerance from 1 (5.8 ms) to 4 (23.2 ms) - same resultsIns(3) X � X � � � XDel(2) X � X � � � -

with tolerance 5 (29 ms) or 6 (34.8 ms)Ins(2) � � X � � � XDel(1) � � X � � � -

with tolerance 7 (40.6 ms) or higherIns(1) � � � � � � XDel(0) � � � � � � -

Table 4.4: Comparison of proposed method using different waveletsMethod av. recall av. precision f-scoreMeyer 0.7096 0.7408 0.7249

db2 0.6770 0.7562 0.7144db6 0.7029 0.7414 0.7217db20 0.7034 0.7408 0.7216sym6 0.7015 0.7426 0.7215haar 0.6377 0.8042 0.7113

Meyer+sym6 0.6825 0.7936 0.7339Meyer 7 subbands 0.6449 0.6714 0.6579

Table 4.5: Comparison of some other segmentation strategies and proposed methodMethod av. recall av. precision f-score

Const 23.2 ms 0.9651 0.1431 0.2493Const 92.8 ms 0.7635 0.4659 0.5787

SVM 0.50 0.33 0.40Wavelet 0.7096 0.7408 0.7249


into fixed length segments, and with the speech signal being segmented randomly. Accuracy of

constant segmentation for many multiplications of 5.8 ms (the time length between neighbouring

discrete times) was evaluated but we only present results for 23 ms as it is corresponding to typical

length of frames in speech recognition and for 92.8 ms for which the result is the best of all constant

segmentations. We also trained the SVM using powers and derivatives from DWT subbands.

Features for SVM included analysed part of speech as well as left and right context. No other

phoneme segmentation method available for comparison was found. While constant segmentation

is able to find most of the boundaries with a 23 ms frame, this is only at the expense of very short

segments and many irrelevant boundaries. The overall score of our method is much superior to the

constant segmentation approach.

Several researchers claim that syllables are better basic units for ASR than phonemes (Frankel

et al., 2007). It is probably true in terms of their content, but it seems not to be the same for

detecting unit boundaries. Our method is not perfect but the observed DWT spectra of speech

clearly show that boundaries between phonemes can be extracted. Boundaries between syllables

seem not to differ from phoneme boundaries in observed DWT spectra, while obviously there are

fewer syllable boundaries than phoneme ones. It is therefore difficult to detect syllable boundaries

without also finding phoneme boundaries when analysing DWT spectra.

4.8 Evaluation for Different Types of Phoneme Transitions

Errors in phoneme segmentation depend on what type of transitions are being detected. The eva-

luations differ regarding to groups of phonemes because some phonemes have similar spectra,

while others differ a lot. These differences depend on acoustic properties of phonemes (Kepinski,

2005). The transitions which are more likely to cause errors should be analysed with more care,

in example by applying more segmentation methods and considering all results.

There are following types of phonemes in Polish (Kepinski, 2005):

1. Stops (/p/, /b/, /t/, /d/, /k/, /g/)

2. Nasal consonants (/m/, /n/, /ni/, /N/)

3. Mouth vowels (/i/, /y/, /e/, /a/, /o/, /u/)

4. Nasal vowels (/e /, /a /)

5. Palatal consonants (Polish ’Glajdy’)(/j/, /l /)

6. Unstables (Polish ’Płynne’)(/l/, /r/)

7. Fricatives (/w/, /f/, /h/, /z/, /s/, /zi/, /si/, /rz/, /sz/)

8. Closed fricatives (/dz/, /c/, /dzi/, /ci/, /drz/, /cz/)

9. Silence in the beginnings and ends of recordings

10. Silence inside words


1 2 3 4 5 6 7 8 9 10

1

2

3

4

5

6

7

8

9

10

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Figure 4.10: F-score of phoneme boundaries detection for transitions between several types ofphonemes. Phoneme types 1-10 are explained in section 4.8 (1 - stops, 2 - nasal consonants, etc.).

This division is made on acoustic properties of phonemes. We do not have enough statistical

data to calculate results for transitions between all 39 types of phonemes. It can be assumed that

transitions between phonemes of two particular groups face similar problems due to co-articulation

and other natural phonetic phenomena. Tables 4.6, 4.7, 4.8 and Fig. 4.10 present evaluation of

phoneme segmentation regarding to the transitions of types listed above. Value 0 means that there

was no transition of this type.

Table 4.6: Recall for different types of phoneme transitions.

Type 1 2 3 4 5 6 7 8 9 10

1 0.7204 0.6101 0.5114 0.5776 0.5818 0.5007 0.5877 0.6456 0.5210 0.41942 0.6015 0.5555 0.4686 0.5812 0.5474 0.5087 0.5817 0.6062 0.5658 0.21293 0.4886 0.4493 0.5069 0.0821 0.4605 0.3776 0.4218 0.5872 0.4741 0.37124 0.5089 0.4816 0.5384 0 0.4215 0.4388 0.4380 0.5015 0.3692 0.21555 0.6403 0.5790 0.4534 0.5362 0.5942 0.5520 0.5829 0.6072 0.5563 0.07026 0.5624 0.5445 0.4690 0.5553 0.5428 0.4768 0.5781 0.5558 0.5885 0.26307 0.6148 0.5320 0.4389 0.5299 0.4641 0.4708 0.5203 0.5911 0.5784 0.46618 0.6216 0.5593 0.4771 0.5424 0.4281 0.5288 0.5372 0.6387 0.5169 0.13889 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 0 0.0227

10 0.0399 0.1399 0.4180 0 0.0335 0.0643 0.0835 0.0289 0 0

All silences before speech are marked as perfectly detected due to the evaluation algorithm.

Apart from that silences were not detected very well. The reason is that the segmentation method

is tuned to phoneme boundaries and not speech-silence transitions. There are other very efficient

methods for this task already established (Zheng and Yan, 2004).


Table 4.7: Precision for different types of phoneme transitions.

Type 1 2 3 4 5 6 7 8 9 10

1 0.6927 0.5788 0.4783 0.5299 0.5465 0.4741 0.5599 0.6094 0.3115 0.41082 0.5523 0.4858 0.4021 0.4996 0.4952 0.4783 0.5375 0.5569 0.3928 0.21293 0.4171 0.3692 0.4433 0.0771 0.3963 0.3033 0.3470 0.5207 0.2899 0.34234 0.4199 0.4124 0.4735 0 0.3789 0.4073 0.3405 0.4222 0.1987 0.18265 0.5943 0.5465 0.3731 0.4688 0.5554 0.5252 0.5488 0.5443 0.3811 0.06456 0.4838 0.4976 0.3987 0.4811 0.4811 0.4271 0.5303 0.5174 0.4203 0.26307 0.5762 0.4875 0.3732 0.4835 0.4150 0.4208 0.4798 0.5324 0.4158 0.44528 0.5573 0.4938 0.4154 0.4926 0.3511 0.4869 0.4809 0.5692 0.3209 0.13339 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 0 0.0227

10 0.0365 0.1399 0.4035 0 0.0310 0.0620 0.0835 0.0289 0 0

Table 4.8: F-score for different types of phoneme transitions. The scores above 0.5 were bolded.

Type 1 2 3 4 5 6 7 8 9 10

1 0.7063 0.5940 0.4943 0.5528 0.5636 0.4870 0.5734 0.6270 0.3899 0.41502 0.5759 0.5183 0.4328 0.5373 0.5200 0.4931 0.5587 0.5805 0.4637 0.21293 0.4500 0.4053 0.4730 0.0795 0.4260 0.3364 0.3807 0.5519 0.3598 0.35624 0.4601 0.4443 0.5038 0 0.3991 0.4225 0.3831 0.4584 0.2583 0.19775 0.6164 0.5623 0.4093 0.5002 0.5742 0.5383 0.5654 0.5740 0.4523 0.06726 0.5202 0.5200 0.4310 0.5155 0.5101 0.4506 0.5532 0.5359 0.4904 0.26307 0.5949 0.5088 0.4034 0.5056 0.4382 0.4444 0.4992 0.5602 0.4838 0.45558 0.5877 0.5245 0.4441 0.5163 0.3858 0.5070 0.5075 0.6019 0.3960 0.13609 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 0 0.0227

10 0.0382 0.1399 0.4106 0 0.0322 0.0632 0.0835 0.0289 0 0

DWT was also tested for parametrisation of speech (Farooq and Datta, 2004). The unvoiced

stops (/p/, /t/, /k/) were found more difficult to be recognised than vowels (/aa/, /ax/, /iy/) and

unvoiced fricatives (/p/, /t/, /k/). In our case stops did not cause difficult problems for locating

them correctly. Actually, the highest f-score (0.7063) was obtained for the boundaries between

two stops and the second grade (0.6270) between stops and closed fricatives. Also transitions

from palatal consonants to stops were evaluated highly (0.6164). Transitions between two closed

fricatives were another group of easy ones to be detected (0.6019).

The most difficult for detection were transitions from mouth vowels to any type apart from

closed fricatives, especially to nasal vowels (0.0795), unstables (0.3364) and fricatives (0.3807).

Also transitions to mouth vowels were difficult to locate correctly. The only exception was from

nasal vowels to mouth vowels (0.5038), which is surprisingly large, comparing to 0.0795 for a

transition in the other way. Another group of boundaries with low F-scores were transitions from

nasal vowels apart from the mentioned transition to mouth vowels. Especially difficult were tran-

sitions to fricatives (0.3831) and palatal consonants (0.3991). There are no transitions from one

nasal vowel into another one. The transitions from closed fricatives to palatal consonants, from

unstables to unstables and fricatives to palatal consonants, unstables and another fricatives were

also difficult to be detected properly.

According to our results it is relatively easy to find a boundary between phonemes of the


same group if such transition is possible. F-score for such boundaries is usually above 0.5. This is

slightly surprising and counterintuitive because phonemes of the same group have typically similar

spectra and it could be expected to be difficult to differentiate them.

Tables 4.6, 4.7 and 4.8 are not symmetric. It is not very surprising because phoneme spectra

are not symmetric. Their ends and starts can vary significantly. This is why, it might be easier to

locate a beginning of a particular phoneme than its end.

The gained statistical knowledge can improve the quality of segmentation. In case of large

vocabulary continuous speech recognition, the recognition follows the segmentation. If a phoneme

which is know to cause errors for segmentation is detected, its boundaries can be re-evaluated by

another more sophisticated or simply other method. Then another segmentation decision can be

taken, leading to a better final recognition.

4.9 LogitBoost WEKA Classifier Speech Segmentation

WEKA is a graphical data mining and machine learning software providing many classifiers. The

procedure called ‘boosting’ is the important classification methodology. The WEKA LogitBoost

classifier is based on well known AdaBoost procedure (Friedman et al., 1999). The AdaBoost pro-

cedure trains the classifiers on weighted versions of the training samples. It gives higher weights

for those which are misclassified. That part of procedure is conducted for a sequence of weigh-

ted samples. Afterwards the final classifier is defined to be a linear combination of the classifiers

from each stage. Logistic boost (Friedman et al., 1999) uses the adaptative Newton algorithm to

fit an additive multiple logistic regression model. So it calls a classifier repeatedly in series. A

distribution of weights is updated each time. In this way it indicates the importance of examples

in the data set for the classification. The main point of being adaptative is that, on each round,

the weights of each incorrectly classified example are increased. The new classifier focuses more

on those examples. Logistic regression fits data to a logistic curve to specify prediction of the

probability of occurrence of an event.

There were many more non-boundary points in feature space than those which really represent

boundaries. This is why we cloned all sets of features representing phoneme boundaries for 30

times to keep a similar ratio of boundaries and non-boundaries. We used 70 % of all feature points

as training data and 30 % for a test in every experiment.

4.10 Experimental Results for LogitBoost

Seven different sets of features for the same classifier and same test data were tested to check which

features are useful. The differences between following sets are described. The classification was

evaluated using popular precision and recall measure (van Rijsbergen, 1979) which is presented in

tables and by percentage of properly classified instances which are given in text for all cases. Two

evaluations are provided for every set of features to help in grading the method because we did not

manage to find any other similar system to use to present as a baseline.

We started with one left and one right context subset of features to describe the surrounding


part of signal. We included first and second derivatives and both of them were smoothed. Different

subbands were smoothed using different windows (see Tab. 4.1). We found that this method is the

most efficient in our previous experiments (Ziołko et al., 2006a). That gives 54 features in total.

64 % of test instances were correctly classified. The more exact results using recall and precision

evaluation are presented in Tab. 4.9. The final measure is f-score presented separately for sets of

features describing frames with boundaries and without. The second group is named in Tab. 4.9

as phonemes, as they are segments from inside of phonemes, far from boundaries. From practical

point of view we are interested in detecting boundaries so the evaluation of classification of these

frames is crucial. So for the first set of features the most important grade is f-score 0.45 (Tab. 4.9).

Table 4.9: Experimental results for LogitBoost classifier. The rows with the label boundary isfor classifying segments representing boundaries. The rows named phoneme present grades forclassifying segments inside phonemes which are not boundaries. From practical point of viewboundary labels are important. The grades for phoneme labels are just for a reference

set of features label precision recall f-score

Basicboundary 0.583 0.366 0.45phoneme 0.659 0.824 0.732

Without smoothing the second derivativeboundary 0.588 0.386 0.466phoneme 0.665 0.818 0.733

Normalisation by whole energy valueboundary 0.551 0.077 0.135phoneme 0.607 0.958 0.743

By max in a subband for a given utteranceboundary 0.59 0.317 0.413phoneme 0.649 0.851 0.737

With wider contextboundary 0.618 0.447 0.519phoneme 0.682 0.811 0.741

Even wider context but without 2nd derivativeboundary 0.699 0.162 0.263phoneme 0.703 0.966 0.814

Asymmetric contextboundary 0.609 0.2 0.302phoneme 0.712 0.939 0.81

We managed to improve results slightly by leaving the second derivative unsmoothed. There

were no other changes in the set of features. 64% of test instances were correctly classified like

for the previous set of features but the more exact evaluation presented in Tab. 4.9, indicates some

improvement through higher f-score, namely 0.466.

In the next approach, we kept the same number and type of features but subband features were

normalised, by dividing by the energy. In that way, 60.384 % of test instances were correctly

classified with f-score only 0.135 (Tab. 4.9).

We tried also an normalising approach, by dividing features by a maximum in a given subband

for an analysed utterance. 63.6347% of test instances were correctly classified, but f-score is also

quite low, namely 0.413 (Tab. 4.9). Surprisingly, none of normalisation methods improved results.

Finally, we experimented with wider left and right context. We added more subsets of features


for signal around the analysed frame. We have got 66% of test instances correctly classified by

including two contexts to the left and two to the right. In that case we had a set of 90 features with

a relatively high f-score 0.519 (Tab. 4.9).

To use wider context, namely three to the left and three to the right, we had to skip the second

derivative, because the number of features was too large to be operated by WEKA. In that way we

had a set of 84 features. 70% of test instances were correctly classified, but recall for boundary

frames was very low, just 0.162 which caused f-score to be only 0.263 (Tab. 4.9). It means, that

generally, this set of features is not effective.

The three to left and one to right context was also checked. In that experiment we used the se-

cond derivatives, so we had 90 features. We received correctness of 70% but f-score for boundaries

was again quite low, only 0.302 (Tab. 4.9).

4.11 Conclusion

ASR systems could be improved if an efficient phoneme segmentation method was found. Innova-

tive segmentation software was designed and implemented in Matlab. F-score 0.72 was achieved

for phoneme segmentation task analysing envelopes of discrete Meyer wavelet subband powers

and their derivatives. It is a very good result comparing to 0.4 for SVM, 0.58 for constant segmen-

tation and 0.46 for LogitBoost WEKA classifier. DWT is a good tool to analyse speech and extract

segments for further analysis. It achieves better results than all baselines, including WEKA ma-

chine learning LogitBoost classifier for which several sets of features were tested and compared.

The segmentation evaluation was also analysed and some flaws of typical approaches were identi-

fied. It was suggested that the segmentation evaluation by the application of fuzzy logic could be

improved.

Segmentation is a subfield of speech analysis which was not investigated enough in ASR. Our

solution showed a new direction of possible improvements in ASR for any language. Segmentation

allows to be more precise during modelling. Systems based on framing and HMMs miss some

of the information on the speech, which could be used in recognition if the efficient phoneme

segmentation was done first. This information, while once lost, cannot be recovered in the further

steps what results in worse efficiency of the whole system.

There are types of phoneme transitions which are more difficult to detect than others. The

average F-score for our segmentation method based on DWT vary from 0.0795 to 0.7063 for

transitions between different acoustic types of phonemes. The experiment support a hypothesis

that in general, it is more difficult to locate boundaries of vowels than other phonemes. One of the

reasons can be that vowel spectra are often less distinctive than others. Another reason might be

that vowels are relatively short comparing to other types of phonemes.

DWT is one of the most perceptual analysis processing tools. It enables to extract subbands

important to a human ear. It outperforms SFT because the size of DWT window is changeable

depending on a frequency subband as presented in Fig. 4.1. In SFT low and high frequencies are

analysed with the same resolution. It is not efficient, because a relatively short frame is needed for

high frequencies for an analysis. It has to be proportionally longer for low frequencies. DWT mo-


difies these lengths automatically, while in case of SFT, it is necessary to calculate mel-frequency

based cepstrum rather than regular spectrum from FFT.

Chapter 5

Language Models

Language modelling is a weak point of ASR. Most of the time n-grams are still the most efficient

models. Even though they are so simple a solution, it is difficult to train any better model be-

cause of data sparsity. Several experiments were conducted on n-best list of hypotheses received

from HTK audio model to re-rank the list and improve recognition. The POS tagger model was

presented in (Ziołko et al., 2008a) and the first results using a semantic model in (Ziołko et al.,

2008b).

So far the most popular and often most effective language model is the n-gram model (2.6)

described in the literature review chapter. N -gram is very simple in its nature, because it counts

possible sequences of words and uses them to provide probabilities. It is quite unusual than there

is no more sophisticated method which would perform in a better way than n-gram by applying

more complicated methods and calculations.

We did not find any published papers on applying POS tagging in ASR. This is why we decided

to check if it can be successfully used in language modelling instead of n-grams. It was quite a

promising idea as the grammar structure of sentences can be described using POS tags while they

provide much smaller set of elements in a model because several words can be modelled by the

same POS tag. One of the problems which is very often experienced while using n-grams, is a lack

of data for training because of too many possible words. The situation is even worse in inflective

languages what was described on an example of Russian (Whittaker and Woodland, 2003) where

the authors claim that 430,00 words for Russian is needed to provide the same vocabulary coverage

as 65,000 for English. A similar situation can be expected for all inflective languages.

The language models can be based on order of words in sentences like n-grams where words

are processed as a sequence. Another approach is to process words as a set, where the order is lost.

This approach is often called bag-of-words because we can imagine taking an ordered sequence of

words, putting them in a bag and shaking. This is a visualisation of modelling methods like LSA.

In most cases it is used to capture semantic knowledge. In case of inflective languages the order is

not crucial, so loosing the information about the order is not very destructive to the method while

allow one to reduce amount of data necessary for the training.

This chapter describes the language modelling part of the research. The methods presented

here are designed for inflective languages and tested on Polish but some of them could be applied

87

CHAPTER 5. LANGUAGE MODELS 88

to any other language as well. The first model is based on a probabilistic POS tagger. This

approach was unsuccessful, but we present it to document the experiment and discuss why we

believe it reduced recognition. Then the most of the chapter focuses on a bag-of-words model

designed by the candidate. The model has some similarities to LSA in its general concept but

differs a lot in realisation allowing calculations on much more data than LSA.

5.1 POS Tagging

POS tagging (Brill, 1995) is the process of marking up the words as corresponding to a particular

part of speech, based on both its definition, as well as its context, using their relationship with

other words in a phrase, sentence, or paragraph (Brill, 1994; Cozens, 1998). POS tagging is

more than providing a list of words with their parts of speech, because many words represent

more than one part of speech at different times. The first major corpus of English for computer

analysis was the Brown Corpus (Kucera and Francis, 1967). It consists of about 1,000,000 words,

made up of 500 samples from randomly chosen publications. In the mid 1980s, researchers in

Europe began to use HMMs to disambiguate parts of speech, when working to tag the Lancaster-

Oslo-Bergen Corpus (Johansson et al., 1978). HMMs involve counting cases and making a table

of the probabilities of certain sequences. For example, once an article has been recognised, the

next word is a noun with probability of 40%, an adjective with 40%, and a number with 20%.

Markov Models are a common method for asaigning POS tags. The methods already discussed

involve operations on a pre-existing corpus to find tag probabilities. Unsupervised tagging is also

possible by bootstrapping. Those techniques use an untagged corpus for their training data and

produce the tagset by induction. That is, they observe patterns in word structures, and provide

POS types. These two categories can be further subdivided into rule-based, stochastic, and neural

approaches. Some current major algorithms for POS tagging include the Viterbi algorithm (Viterbi,

1967; Forney, 1973), the Brill tagger (Brill, 1995), and the Baum-Welch algorithm (L. E. Baum

and Weiss, 1970) (also known as the forward-backward algorithm). The HMM and visible Markov

model taggers can both be implemented using the Viterbi algorithm.

POS tagging of Polish was started by governmental research institute IPI PAN. They crea-

ted a relatively large corpus which is partly hand tagged and partly automatically tagged (Pr-

zepiorkowski, 2004; A.Przepiorkowski, 2006; Debowski, 2003; Przepiorkowski and Wolinski,

2003). The tagging was later improved by focusing on hand-written and automatically acqui-

red rules, rather than trigrams by Piasecki (Piasecki, 2006). The best and latest version of the

tagger has accuracy 93.44%, which is not much comparing to other languages. It might be one of

the reasons for the outcome of our experiment.

5.2 Applying POS Taggers for Language Modelling in Speech Recog-nition

There is very little interest in using POS tags in ASR. Their usefulness was investigated. POS

tags trigrams, a matrix grading possible neighbourhoods or probabilistic tagger can be created


and used to predict a word being recognised based on left context analysed by a tagger. It is

very difficult to provide tree structures, necessary for context-free grammars, which represent all

possible sentences in case of Polish, as the order of words can vary significantly. Some POS tags

are much more probable in context of some others, which can be used in language modelling.

Experiments on applying morphological information to ASR of Polish language were under-

taken using the best available POS tagger for Polish (Piasecki, 2006; Przepiorkowski, 2004). The

results were unsatisfactory, probably because of high ambiguity. An average word in Polish has

two POS tags. It gives too many possible combinations for a sentence. Briefly speaking applying

POS tagging for modelling of Polish is a process of guessing based on uncertain information.

HTK (Young, 1996; Young et al., 2005) was used to provide 10 best list of acoustic hypotheses

for sentences from CORPORA. The hypotheses were constructed as any combinations of any words

from the corpus. The hypotheses are provided as an ordered lists of words. This model was

trained in a way which allowed all possible combinations of all words in a dictionary to have more

variations and to give opportunity for a language model to improve recognition. Then probabilities

of those hypotheses using the POS tagger (Piasecki, 2006) were calculated. The acoustic model

can be easily combined with language models using Bayes’ rule by multiplying both probabilities

(2.5).

5.3 Experimental Results of Applying POS Tags in ASR

Trigrams of tags were calculated using transcriptions of spoken language and existing tagging

tools. Results were saved in XML. We received significant help from Dr Maciej Piasecki and his

group from the Technical University of Wrocław in this step of research.

The results were compared giving different weights for probabilities from the HTK acoustic

model and the POS tagger language model. In all situations, the outcome probability gave worse

results then pure HTK acoustic model. Histograms of probabilities for correct and wrong recogni-

tion were also calculated and they showed unuseful correlation. Some examples of sentences were

also analysed and described by human supervisor. They are presented in Table 5.1.

In total 331 occurrences were analysed. Only 282 of them had correct recognition in the

whole 10 best list. An average HTK probability of correct sentences was 0.1105. Exactly 244

of all occurrences had a correct hypothesis on the first position of the 10 best list. 73.72 % of

occurrences were correctly recognised while using only HTK acoustic model. Only 53 occurrences

were recognised applying probabilities from the POS tagger, even when HTK probabilities were

4 times more important than those from POS tagger. The weight was applied by raising HTK

probability to power of 4. It gives 16.01 % of correct recognitions for a model with POS tag

probabilities, which is a very disappointing result.

The POS tagger was trained on a different corpus than the one used in an experiment described

above. This is why we decided to conduct an additional experiment. We recorded 11 sentences

from the POS tagger training corpus. They were recognised by HTK, providing 10 best list and

used in a similar experiment, as the one described above. The amount of data is not enough

to provide statistical results, but observations on exact sentences (Table 5.3) provide the same


Table 5.1: Results of applying the POS tagger to language modelling. First, a sentence in Polishis given, then a position of a correct recognition in 10 best list. The description of tagger grade forthe correct recognition follows

Lubic czardaszowy plas

1, Tagger grade is very low.

Cudzy brzuch i buzia w drzewie

4, Tagger grade is higher than for wrong recognitions.

W zadzy zejde z gwozdzia

There is no correct sentence in the 10 best list.

Krociowych sum nie zal mi

1, Tagger grade is higher or similar then other recognitions in top 6 but lower then 7th

Moc czuc kazdy odczynnik

6, Tagger grade is lower than for most of the wrong recognitions including first two hypotheses.

However, the wrong recognition with highest probability is grammatically correct.

On łom kładzie lampy i kołpak

7, Tagger grade is low.

Rybactwo smutnieje on sie smieje


On liczne tasmy w cuglach da

2, Tagger grade is low, but still highest in the first 5 hypotheses.

Ten chor dusiłem licznie


Chciałbym wpasc nas sesje


Zołtko wlazło i co zrobic


Wor rur zelaznych wazył

3, Tagger grade is lower than for the sentence on the first position.

U nas ludzie zwa to fuchy


On myje wroble w zoo


Bos cały w wisniowym soku

3, Tagger grade is higher then for 7 top hypotheses.

Na czczo chleby i pyry z dzemem


Lech byc podlejszym chce

1, Tagger grade is the lowest in top 5 hypotheses but most of them are grammatically correct.


Table 5.2: Results of applying the POS tagger to language modelling. First, a sentence in Polishis given, then a position of a correct recognition in 10 best list. The description of tagger grade forthe correct recognition follows (2nd part)

Zre jez zioła jak dzem John

1, Tagger grade is higher than for top 4 hypotheses.

Masz dzisiaj rozyczke zielona

1, Tagger grade is lower than for the second hypothesis which has no sense but morphologically is correct.

Wez daj im soli drogi dyzmo

2, Tagger grade is very close to the most probable hypothesis, which is also grammatically correct.

Wez masz ramki opolskie

1, tagger grade is higher than for the second hypothesis but lower than for the third one.

Dzgnał nas cicho pod zamkiem

1, Tagger grade is highest of all.

Tam spi wojsko z bronia

6, Tagger grade is second of all, the highest one is acoustically 5th.

Nie odchodz bo zona idzie

3, tagger grade is highest but equal to three others, which has acoustical probability lower.

Tym mozna atakowac

5, Tagger grade is higher than for the acoustically most probable sentence but lower than

for all other between 1 and 5, however all of them are grammatically correct.

Zmyslny kot psotny ujdzie

1, Tagger grade is higher then second and third hypothesis.

Niech pan sunie na wschod

4, Tagger grade is higher than for 7 most probable acoustically.

conclusion as in the main experiment. The recognitions, which were found using HTK only, had

fewer errors for 6 sentences. then 5 times the number of errors was the same. One sentence was

correctly recognised for both models. One more was correctly recognised using just HTK acoustic

model.

5.4 Bag-of-words Modelling

A new method of language modelling for ASR is presented (Ziołko et al., 2008b). The method

has some similarities to LSA, but it does not need so much memory and gave better experimental

results, which are provided as percentage of correctly recognised sentences from a corpus. The

main difference is a choice of similar topics influencing a matrix describing probability of words

appearing in topics.

Recently, graph based methods (Harary, 1969; Veronis, 2004; Agirre et al., 2006) have become

more and more popular. In case of our algorithm, graphs are used instead of applying SVD in order

to smooth information between different topics. Graphs help us to locate and grade similar topics.

An important advantage of our method is that it does not need much memory at once to process


Table 5.3: Results of applying the POS tagger on its training corpus. First version of a sentence isa correct one, second is a recognition using just HTK and third one using HTK and POS tagging.Then the number of differences comparing to a correct sentence were counted and summarised

i do licha coscie mi wczoraj dali takiego ze teraz ledwo wiem jak sie nazywami do i w coscie mi wczoraj dali takiego ze teraz ledwo wiem nie sie nazywami do i w coscie w wczoraj dali takiego ze teraz ledwo wiem nie sie nazywamhtk is betternie mowiac o tym kim ja jestem skineła głowa zawstydzonanie w wiem nocy nocy nie jestem skineła głowa zawstydzonanie w wiem nocy nocy nie jestem skineła bo w w w zawstydzonasame number of errors htk is betterto okropne obudzic sie po nocy spedzonej z kims czyjego imienia sie nie pamietato okropne obudzic sie minut spedzonej z kims czyjego imienia cie nie pamietato okropne obudzic w nocy spedzonej w kims czyjego imienia cie nie pamietahtk is betterpare minut temu nie pamietałam nawet ze jestem w innym swieciepare minut temu nie pamieta nawet jestem innym swieciepare minut temu nie pamieta nawet w jestem innym swieciesame number of errorspolez teraz spokojnie zasłonie okno bo widze ze swiatło cie razipolez teraz spokojnie zasłonie o okno bo widze cie swiatło cie razipolez z teraz spokojnie zasłonie o okno bo widze cie swiatło cie razihtk is better same number of errorszobaczysz wszystko bedzie dobrze pamietasz ze opusciła sanktuariumzobaczysz wszystko bedzie dobrze pamieta ze opusciła sanktuariumzobaczysz wszystko bedzie dobrze pamieta ze opusciła sanktuarium wsame number of errors htk is bettero tak pamietała wszystko powrociło z pełna wyrazistosciao tak pamieta wszystko powrociło pełna wyrazistosciao tak w pamieta wszystko powrociło pełna wyrazistosciahtk is betterw koncu tyle razy o tym myslała i wciaz nie mogła pojac jak do tego doszłokoncu cie teraz nocy myslała w wciaz nie tego tym cie bedzie do doszłokoncu cie teraz nocy myslała wciaz nie tego swiecie bedzie do doszłosame number of errors


0−0.1 0.1−0.2 0.2−0.3 0.3−0.4 04−0.5 0.5−0.6 0.6−0.7 0.7−0.8 0.8−0.9 0.9−10

20

40

60

80

100

120

140

160

180

probability

coun

ts

Histogram of probabilities for correct hypotehsis

Figure 5.1: Histogram of POS tagger probabilities for hypotheses which are correct recognitions

any amount of data. It is in contrary to LSA, which is quite limited in real applications for this

reason. SVD is conducted on the entire matrix in LSA which means that a model with a few

thousands words and a few hundred topics might be a challenge for memory of a regular PC. Our

method does not need to do operations on the entire matrix. There are other approaches to face

this issue like by applying generalised Hebbian algorithm (Gorrell and Webb, 2005).

The main aspect of modelling in our method is based on semantic analysis, which is an im-

portant innovation of ASR, as the very last step of the process. It can be applied as an additional

measure to use the non-first choice word recognition hypothesis, if they do not fit semantic context.

However, the method extracts some syntax information as well. It was designed for Polish, which

is highly inflective and not a positional language. For this reason only particular endings can occur

in a context of endings of other part of speech elements of a sentence. In example, we can expect

female adjectives with female nouns. In the same way, in English we can expect I in a same sen-

tence as am, and you in a same sentence as are, etc. In Polish all verbs have this kind of inflection,

however, usually, differences between forms are only in endings, not like to be in English.


0−0.1 0.1−0.2 0.2−0.3 0.3−0.4 0.4−0.5 0.5−0.6 0.6−0.7 0.7−0.8 0.8−0.9 0.9−10

200

400

600

800

1000

1200

1400

1600

1800

2000

probability

coun

ts

Histogram of probabilities of hypothesis with wrong recognitions

Figure 5.2: Histogram of POS tagger probabilities for hypotheses which are wrong recognitions

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

probability

ratio

Figure 5.3: Ratio of correct recognitions to all for different probabilities from POS tagger


5.5 Experimental Setup

Semantic analysis might be much more crucial in non-positional languages than in English, due to

irregularities in position structures of words. Language models, based on context free grammars,

are quite unsuccessful for non-positional languages. Research about applying LSA in ASR has

been done (Bellegarda, 1997) for English only.

HTK (Young, 1996; Young et al., 2005) was used to provide 100-best lists of acoustic hy-

potheses for sentences from the test corpora. The MFCCs (Davis and Mermelstein, 1980; Young,

1996) were calculated for parametrisation with a standard set of 39 features. 37 different phonemes

were distinguished using a phonetic transcription provided with CORPORA. Several experiments

were conducted to evaluate the method. The first one was very simple to have a general view only.

The audio model was trained on male speakers of CORPORA (Grocholewski, 1995). The corpus

was organised as follows: all single letters are combined in one topic, all digits in another, names

and commands separately in two more. Every sentence is also treated as a topic. In this way 118

topics are provided. They all consist of 659 different words in total. In the preliminary experiment

we used 114 simple sentences spoken by a male not included in the training set as a testing set.

All other utterances are obviously too short to use them in language modelling.

In following experiments HTK was also used to provide 100-best list. The main difference

was a division between training and testing corpora. Training data was collected from internet

and ebooks from several sources described later in details. testing sentences were created by the

author and recorded on a desktop PC with a regular microphone and some, but very little noise in

background.

5.6 Training Algorithm

The entire algorithm is illustrated on a simple English example in one of the following sections.

Several versions of the algorithm were applied and tested. Some of the differences are presented

in the following sections with experimental results. Here, we describe the final version which

performs in the best way. The training algorithm starts with creating matrix

S = [sik], (5.1)

representing semantic relations, where rows i = 1, ..., I represent topics and columns k = 1, ...,Krepresent words. Each matrix value sik is the number of times word k occurs in topic i. Some

words are so common that they appear in almost all topics. Appearance of these words have little

semantic information due to entropy rule. The words which appear only in certain topics can say

more about semantic content. This is why all values of (5.1) are divided by a sum for the given

word over all topics to normalise. In this way importance of commonly appearing words is reduced

for each topic. A measure of similarity between two topics is

dij =K∑k=1

siksjk. (5.2)


Figure 5.4: Undirected, complete graph illustrating similarities between sentences

It has to be normalised according to formula

d′ij = dij/maxi,j{dij}. (5.3)

As a result values 0 ≤ d′ij ≤ 1 are obtained.

These topic similarities are analysed as follows:

1. Create an undirected, complete graph (Fig. 5.4) with topics as nodes and d′ij as weights of

edges. Let us define path weight

pij =∏

(a,b)∈P (i,j)

d′ab, (5.4)

where P (i, j) is the sequence of edges in the path from i to j. In the simplest case of a single

edge i to j path weight is d′ij . In case of a multiple edges path, it is a product of similarities

of all edges on a path (5.4). In case there are several paths we always take a path with the

largest similarity for the path weight (5.4).

2. For each node, we need to find n nodes with highest path weights between the nodes and the

given, analysed topic node. It will allow us to define a list N of semantically related topics

which consists of the n nodes with their measures. The exact implementation of this part is

presented in the next section.

3. The matrix S has to be recalculated to include the impact of similar topics. Smoothed word-

topic relations are expressed by matrix

S′ = [s′ik]. (5.5)

For all topics in matrix (5.1), we add all values of topics from the list of related topics,

multiplied by a measure for a given pair of topics. The elements of S′ are

s′ik = sik + α−1∑j∈N

pijsjk. (5.6)

Coefficient α is a smoothing factor which provides additional weight for influence of other


topics on matrix S′. N is the list of similar topics found in step 2. Matrix element (5.6) is a

measure of likelihood that kth word appears in ith topic.

Matrix (5.5) stores counts of words present in particular topics. They can be represented as

C(wordk, si) = c. (5.7)

We should not assume that there can be 0 probability of any word appearing in any topic. This is

why we replace all zeros in (5.5) with small values s′min = 0.01. If (5.7) was normalised to have

values between 0 and 1, it would be a probabilistic information of type

P (wordk|si) = p. (5.8)

The sum of values in (5.5) is not equal to 1 which is why (5.7) are not probabilities regarding to the

definition. However, (5.7) stands for all other conditions of probabilities and often can be treated

like it was (5.8). For this reason, a sum of all values in (5.5) is calculated and then every value in

(5.5) is divided by it. In this way s′ik become probabilities, as their sum is equal to 1. In the further

sections we will assume that (5.8), rather than (5.7), is stored in (5.5) and s′ik are probabilities.

5.7 Process of Finding The Most Similar Topics

A group of the longest paths, where a distance is calculated using a product between edges rather

than sum, has to be found in the 2nd point of the algorithm described in the previous section. It

can be achieved by implementing the following algorithm:

1. Find n single edge paths with the highest measures d′ij .

2. Check if the two edges path P (i,m) starting from the node i with the highest measure d′ij ,

which was found in the step above and going through j to any other edge m, has a better

measure pim than the lowest of the n solutions found in the step above. If it does than

replace the lowest one with m in the list of n similar topics.

3. Conduct the step above for all other single node paths from the list apart from the lowest,

nth element.

4. If there are any non single edge paths P (i, j) on the list on position different then nth, repeat

a process similar to step 2. Check if after adding any other edge, a measure of path pij is

higher than a measure of the nth position. Than replace the previous path with a new, longer

path with higher pij .

It can be proved that the process is exhaustive in one way (from the analysed topic). Let us

name the analysed topic as i and the set of the n most similar topics to i, found in the first step of

the process (using a measure d′ij) as N1. Let l be the element with the lowest measure of similarity

d′ij of N1. As a result of the algorithm presented above, we obtain

d′in1> d′ij ∀n1 ∈ N1,∀j /∈ N1. (5.9)


Table 5.4: Matrix S for the example with 4 topics and a row of S’ for the topic 3big John has house black aggr. cat small mouse is mammal

1 1 1 1 1 0 0 0 0 0 0 02 1 1 1 0 1 1 1 0 0 0 03 0 0 1 0 1 1 1 1 1 0 04 0 0 0 0 0 0 0 1 1 1 1

3’ 7/8 7/8 15/8 1/2 11/8 11/8 11/8 1 1 0 0

Table 5.5: Matrix D for the presented example1 2 3 4

1 4 3 1 02 3 6 4 03 1 4 6 24 0 0 2 4

Let us define a set N2 = T/({ia}⋃N1) of topics not included in the list of similar topics, where

T is a set of all topics and {ia} is a one element set with the analysed topic ia. From definition

(5.3)

0 ≤ d′ij ≤ 1 ∀i, j ∈ {1, . . . , I}, (5.10)

therefore

d′ijd′jk ≤ d′ij ∀j ∈ N2, (5.11)

where k is any topic. From (5.9) and (5.11)

d′in1> d′ijd

′jk ∀j ∈ N2. (5.12)

As the same reasoning can be applied for further iterations (three-edge paths and so on) (5.11)

and (5.12) prove that the process is exhaustive in one way. It can skip some solutions from other

topics to the analysed one. But it is even better from linguistic point of view, because we do not

want topics assigned as being similar to many other topics, just because they have a very strong

link to one other topic.

5.8 Example in English

Let us consider an example of a corpus consisting of 4 sentences, all of them are treated as separate

topics. Big John has a house. Big John has a black, aggressive cat. The black aggressive cat has

a small mouse. The small mouse is a mammal.

All articles a and the were skipped as they have no semantic content and they do not exist in

Polish which was our experimental language. We count all other words, which creates matrices S


(Tab. 5.4) and D (Tab. 5.4). Following topic similarities (d′12 = 3/4, d′13 = 1/4, d′14 = 0, d′23 =1, d′24 = 0, d′34 = 1/2) are received. It constructs the graph on Fig. 5.4. Then a list of topics

similar to topic three N1 = {2, 4} can be found by applying first step of the process on the graph.

Topic 4 is l in this example - the topic with the lowest measure in N1, namely 1/2. In the next

step, pij are calculated for two-edge paths starting at node 3 and going through 2. There are two

of them. First one is for the path 3-2-4, where p34 = 1 · 0 = 0. The second one is for the path

3-2-1, where p31 = 1 · 3/4 = 3/4 > d′34. This is why the topic 4 is replaced by topic 1 and the

final list of topics similar to 3 is {2, 1}. Then assuming α = 2 we can calculate the row for topic

3 from S′ (Tab. 5.4, last row).

5.9 Recognition Using Bag-of-words Model

The recognition task can be described as

si = argmax P (s|wordk1 , ..., wordkm), (5.13)

where s is any topic and wordk1 , ..., wordkm is a set of recognised words, which were in a sen-

tence. It classifies the bag-of-words as one of the realisations of one of the topics in matrix (5.5).

Recognition can be conducted by finding the most coherent topic for a set of words W in a

provided hypothesis. It is carried out by finding a maximum of a sum of elements of (5.5) from

columns representing the words from a hypothesis over rows

Psem = maxi

∏k∈W s′ik|W |

. (5.14)

where |W | is cardinality of the set of wordsW in the sentence. The row i, for which the maximum

is found, is assumed to represent the topic of sentence being recognised. The calculated sum Psem

can be used as additional weight in providing speech recognition due to Bayes’ theorem. The

values of phtk probability gained from HTK model tend to be very similar for all hypotheses in the

100-best list of a particular utterance. This is why an extra weighting w was introduced to favour

probabilities from audio model over psem received from semantic model. The final measure can

be obtained applying Bayes’ theorem

p = pwhtkpsem. (5.15)

5.10 Preliminary Experiment

The first experiment (Ziołko et al., 2008b) was conducted on CORPORA using the same data for

training and testing to evaluate the implementation and approximate chances of the algorithm to

be successful without spending several days training a proper model. Because the model was

small, it was easy to compare different values of parameters n, α and w. Results for recognition

based on the audio model only are also included. LSA was used as a baseline to evaluate results

of our method. Experiments with several different w for the semantic model based on LSA were


conducted. Values in a range between 23 and 26 gave the best results presented in Tab. 5.6. 45

utterances did not have hypotheses with correct sentences in entire 100 best lists. This is why the

maximal number of utterances which could be recognised was 69.

The experiment shows that our semantic model is useful, even though, the results might be so

outstanding due to a small number of words in the corpus and using the same corpus for training

and testing. The same corpus was used for both tasks because phoneme segmentation in the

corpus is needed to use HTK. CORPORA is the only Polish corpus which provides it. However,

the comparison of 53% correct recognitions for best configurations of our model with 36% for

LSA and 29% for audio model only is impressive. The analysed results for different configurations

shown that the choice of n, the length of list of topics related to an analysed topic is not as important

as ratio between n and α which is a smoothing factor for weighting impact of related topics. The

ratio n/α should be kept around 2/3, for this case, in order to provide the best results. The

audio model importance weight w is also very crucial as the information from HTK model is very

important and can be ignored if w has too small value.

It has to be stressed, that it was a preliminary experiment. Our aim was to check, if it is

worth to invest more time in research on this model. This is why we used little data and the same

set for training and testing. Some elements of the algorithm were not used for this experiment.

In example, values in (5.1) were not normalised to be probabilities. We do not claim that the

calculated model can be used for any practical task. One more reason for that is that it was trained

on CORPORA which has no semantic connotations. On the other hand it has to be stressed that

for Polish this model keeps some grammar information as well, even though it was designed as a

semantic one. In example, we can expect words with morphology related to one gender in a given

sentence, which will be noted in matrix S. The results were promising, so more sophisticated

experiments using transcriptions from the Polish Parliament, literature, a journal and wikipedia as

training corpora were conducted and are described in following sections.

Another way of proving usefulness of our bag-of-words model is through calculating his-

tograms psemc of probabilities received from semantic model for hypotheses, which are correct

recognitions (Fig. 5.5) and histogram psemw of probabilities received from semantic model for hy-

potheses, which are wrong recognitions (Fig. 5.6). The ratio psemc/(psemc + psemw) is presented

in Fig. 5.7. It clearly shows a correlation between high probability from the bag-of-words model

and correctness of a recognition.

5.11 K-means On-line Clustering

The number of topics is limited to around 1000. If a large choice of words in the model is expected

then the number of topics has to be kept low to save memory. This is why it is necessary to

overcome the limitation in the number of topics for any real applications. It was done by clustering

them into representatives of several topics. K-means clustering algorithm was used for this aim.

However, it was not possible to apply it directly on all topics at once because of huge amount

of data (millions of sentences). This is why we invented an algorithm which we call on-line

clustering.


Table 5.6: Experimental results for pure HTK audio model, audio model with LSA and audiomodel with our bag-of-words model

n α w recognised sentences %LSA 25 41 0.36

HTK 33 0.293 1 50 48 0.423 2 50 46 0.403 3 50 46 0.407 1 50 35 0.317 3 50 45 0.397 5 50 46 0.405 1 20 44 0.395 2 20 55 0.485 3 20 60 0.535 4 20 59 0.525 5 20 59 0.523 2 20 61 0.533 1 20 50 0.447 6 20 59 0.527 5 20 61 0.537 4 20 59 0.528 4 20 57 0.58 5 20 61 0.538 6 20 60 0.539 1 20 28 0.259 3 20 49 0.439 5 20 57 0.59 6 20 61 0.539 7 20 59 0.5211 5 20 54 0.4711 7 20 60 0.5311 8 20 60 0.5311 9 20 58 0.519 6 10 58 0.519 6 15 60 0.539 6 17 60 0.539 6 18 61 0.539 6 19 61 0.539 6 20 61 0.539 6 22 59 0.529 6 25 58 0.51


0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

10

20

30

40

probability

num

ber

of c

orre

ct r

ecog

nitio

ns

Figure 5.5: Histogram of probabilities received from the bag-of-words model for hypotheseswhich are correct recognitions

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

100

200

300

400

probability

num

ber

of w

rong

hyp

othe

ses

Figure 5.6: Histogram of probabilities received from the bag-of-words model for hypotheseswhich are wrong recognitions

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

probability

ratio

Figure 5.7: Ratio of correct recognitions to all of them for different probabilities received fromthe bag-of-words model


The general scheme is to collect n topics from training data. The algorithm is initialised

heuristically. Then, they are clustered into n/2 topics using k-means clustering algorithm, which

is iterating following two steps until convergence. The first one is to compute membership of each

data point x in clusters by choosing the nearest centroid. The second is to recompute a location

of each centroid, according to members. When the k-means converge and new topics are chosen,

new n/2 topics can be added from new training data and clustering is repeated to reduce it again.

This loop is applied as long as there is new training data to be included. In the very end additional

clusterisation is conducted to limit the number of topics to n/4.

Every time, the information on how many sentences are represented by a particular topic is

stored and used as weights when means are being calculated and topics are combined as a result

of clustering. Thanks to that, the order of how sentences are fed into the training system is not

important for the image of the final clusters. Unfortunately, it is not possible to cluster all sentences

at once because of data sparsity.

Formula (5.8) holds for topics which represent several sentences in the same way, as for these

which represent just one sentence. However, it is not possible to calculate probability of a word

given a combined topic by using probabilities related to topics represented by the combined topic.

The new version (for clustered topics) of matrix (5.5) has to be calculated and used instead. It

means that the process of collecting statistical data by creating (5.1) has to be finished first to run

the described algorithm. When (5.5) is already created, new statistical data cannot be added to the

matrix (5.5). In case it has to be done, the new data should be added to (5.1). Then (5.5) has to be

recalculated from the beginning.

5.12 Experiment on Parliament Transcripts

A set of 44 sentences was created using words and language similar to expected to be used in

a parliament. They were also designed in a way that the most common words from the training

corpus are used and that some of the words from testing set appear in a few sentences. They

were recorded and the HTK recognition experiment was conducted on them using triphone model

trained on CORPORA but with vocabulary limited to words in these 45 sentences. In this way,

HTK provided 100-best list of hypotheses to every of the sentences. They were used in a same

way as in a previously described experiment.

Matrix (5.1) was created by analysing transcriptions of the Polish Parliament meetings in years

2005-2007. They are the biggest corpus of transcribed Polish speech. There are differences in

sentence construction between spoken and written language. This is one of the reasons, why we

decided to use this corpus for training. Another one is that our model is likely to be a part of

ASR system used by Police and Courts, so we are interested in research on very formal language.

None of testing sentences was intentionally taken from these transcriptions, however, it was not

checked that they did not appear there. The testing set consists of 198 words and those words

were included in matrix (5.1). Because of data sparsity the k-means on-line clustering algorithm

described above was used to combine several topics. Every topic is a set of words between two

dots in the training corpus. In ideal case topics are sentences. In real case dots are used in Polish


Table 5.7: 44 sentences in the exact transcription used for testing by HTK and bag-of-words modelwith English translations

platforma obywatelska wymaga funkcjonowania klubu w czasie obrad sejmu

Civic Platform expects the club to operate during parliament proceedings.

dlaczego poseL wojciech polega na opinii zarzAduWhy does MP Wojciech trust the board opinion?Latwo skierowaC czynnoSci do sAduIt is easy to move actions to court.wniosek rolniczego zwiAzku znajduje siE w ministerstwieThe petition of the agricultural union is in the ministry.projekt samorzAdu ma wysokie oczekiwania finansoweThe municipality project has high financial expectations.fundusz spoLeczny podjAL dziaLania w ramach obecnego prawa cywilnego

The communal foundation took steps according to existing civil law.

koalicja chce komisji sejmowej do oceny dziaLalnoSci posLa jana

The coalition wants a parliament commission for evaluation of MP Jan activity.

dzisiaj piEC paN poprze ministra w waZnym gLosowaniu w sejmieFive women will support the Minister in an important vote today.poseL ludwik dorn byl na waZnym gLosowaniu po duZym posiLku

MP Ludwik Dorn participated in an important vote after a large meal.

bOg ocenia polskE za powaZne przestEpstwa sektora finansowego w kraju i za granicA

God judges Poland for crucial crimes of the financial sector in the country and abroad.

poseL tadeusz cymaNski faktycznie wyraziL sprzeciw wobec rozwoju paNstwa polskiego

MP Tadeusz Cymanski expressed a protest against development of the Polish country indeed.

tak mi dopomOZ bOg

God, help me. (traditional formula added after an oath)

poseL andrzej lepper zajmuje siE rzAdem jak nikt inny

MP Andrzej Lepper takes care of the government like no one else.

uchwaLa rzAdowa dotyczAca handlu i inwestycji przedsiEbiorstw paNstwowych

w rynek nieruchomoSci

The government act on trade and investments of public enterprises in the estate market.

panie marszaLku wysoka izboMr speaker, House. (common way to start a speech in the Polish Parliament)

poseL ludwik dorn chce podziEkowaC komisjiMP Ludwik Dorn wants to thank the commission.bezpieczeNstwo jest bardzo waZneThe safety is very important.minister Srodowiska powiedziaL waZne rzeczyThe Minister of Environment said important things.


Table 5.8: 44 sentences in the exact transcription used for testing by HTK and bag-of-words modelwith English translations (2nd part)

narOd rzeczpospolitej polskiej chce pieniEdzyThe nation of Republic of Poland wants money.rodziny powinny byC najwaZniejszeFamilies should be the most important.resort bezpieczeNstwa ma wysokie uprawnieniaThe department of security has high authority.odpowiednie uprawnienia sA bardzo waZneProper authorities are very important.kilkanaScie przedsiEbiorstw potrzebuje nowych dochodOwOver a dozen of enterprises need new incomes.poseL andrzej lepper zwrOciL dokumenty do sejmuMP Andrzej Lepper returned documents to the Parliament.krajowa komisja popiera nowA ustawEThe national commission supports the new act.narOd rzeczpospolitej polskiej ma waZne oczekiwania od sejmu

The nation of the Republic of Poland has important expectations from the Parliament.

praktyka wskazuje co innegoReal life shows something else.czterech posLow nie mogLo zostaCFour MPs were not able to stay.na sLuZbie siE pracujeYou work on a duty.sprzeciwiam siEI speak against.wnoszE o przerwE w obradachI ask for a break in the proceedings.proszE o ciszEI ask for silence.wznowienie obrad nastApi po godzinnej przerwieThe proceedings will be reopened after an hour break.to jest skandalIt is a scandal.


Table 5.9: 44 sentences in the exact transcription used for testing by HTK and bag-of-words modelwith English translations (3rd part)

nie pozwolimy na toWe will not allow it.obrady przy zamkniEtych drzwiachClosed proceedings.matki potrzebujA becikoweMothers need a support.przechodzimy do konkretOw na temat ustawy o ubezpieczeniach spoLecznych

We move to details on the act on public insurances.

duZA frekwencja w trakcie gLosowaniaHigh attendance during a vote.zgromadzenie narodowe zadecyduje o przyszLoSci tej ustawyThe National Assembly will decide about the future of this act.komisja zbierze siE po przerwieThe commission will gather after a break.proszE mOwiC wolniejSpeak slower please.zacznijmy od budowania podstawLet’s start from building the foundations.zgLoszono wiele poprawek do tej ustawyMany corrections to this act were declared.

to mark abbreviations and ordering numbers what influenced the content of topics. The training

corpus consisted of around 800,000 topics.

In the end of the training process all topics were clustered into 500 final topics. Then values

of matrix (5.1) were normalised for words by all topics to increase importance of words which

appeared in few topics and decrease importance of words which appeared in many topics. Then

matrix (5.5) was created. The HTK hypotheses were rearranged using information from (5.5) in

the same away as in the previous experiment.

The results of this experiment were negative. The model did not improve recognition. Quality

of the training data is blamed for these results. The transcriptions contained many comments and

other elements which are not sentences. Then transcriptions were copied from pdf files into a

text file, what degradated quality slightly. All syllables fi in the corpus were changed into dots

and some parts were rearranged in an inappropriate way. What is more, a dot is quite frequently

used in Polish to mark ends of abbreviations and put after numbers if they mean order like 1st or

2nd in English. All these dots were treated by our algorithm in the same way as dots marking

ends of sentences. This is why the topics were quite often not proper sentences as expected in

our algorithm. We decided to conduct another experiment using literature for training. Quality

of ebooks is better than transcriptions. They are available in txt and doc files. Abbreviations and

numbers are much rarer in literature than in the Parliament transcripts.


5.13 Preprocessing of Training Corpora

The experiment on the Parliament transcripts taught us that text data has to be preprocessed more

before it can be used for model training. There are three main issues which has to be faced. First,

Matlab, which is used for model training, do not recognise special Polish letters. This is why they

have to be replaced by some single signs. Secondly, several special signs should be erased to keep

a corpus cleaner. Thirdly, some dots have to be removed from a corpus as they do not represent an

end of sentences.

We started with replacing all capital letters in a corpus with lower cases as they are redundant

for this experiment. Then, we can use capital letters to represent special Polish signs. The second

issue was faced by removing (some of them can be replaced by an empty space) all signs from

the list: , ” “ : ( ) ; + - \/ ’ # & =. Then question and exclamation marks ?! were changed

into dots. Several dots were removed if they followed some abbreviations. A dot is put after an

abbreviation in Polish, if it finishes with a consonant. All short forms from the list were replaced

by a full form or an abbreviation without a dot if several morphological forms are represented by

one abbreviation. An empty space was put at the beginning of a string to be searched for, to avoid

detecting ends of some words.

It is more and more common in Polish to put dots following digits if they mean order, like th in

English. This is why all dots following digits were also removed from corpora. Two dots following

each other were replaced with just one. The same happened with three dots. Finally all doubled

and tripled spaces were replaced by just one as final cleaning of the corpora. In the beginning

we did these operations using Matlab and Word for Windows. Later, the process was automatised

by using SED. Another preprocessing which we had to do was removing html and xml tags from

some of the texts. This task was also accomplished in SED, which is a simple stream editor under

Linux. It takes and filters a row after a row from a default input which was a text file in our case.

Then it applies changes in text according to commands in a specific order and send it to an output.

The script presented in Table (5.10) was used for changes apart from removing html tags.

5.14 Experiment with Literature Training Corpus

Another experiment on larger scale was conducted using literature to train the model. This attempt

was more successful then the previous one, however, the results are still unsatisfactory. The im-

provement comparing to the transcript might be caused by the fact that the language in literature

is much more proper than in the transcripts where spoken language was written down. It would

be an interesting observation that the written language should be used for training, even though

the spoken one is being recognised. With some configurations, 3% of improvement was noted

(Tab. 5.11). The low efficiency was probably caused by using too little data for training. Very bad

results for applying LSA support this hypothesis. The perplexity of the corpus is sufficiently large

and equals 9 031.

As the next step to improve our model, we started to normalise all values in the matrix (5.5) to

have probabilities as its values and to have final grades as probabilities, what we have not done in


Table 5.10: SED script for text preprocessings/A/a/g s/ [%]/ procent/g s/ tzw[.]/ tzw/gs/B/b/g s/ [$]/ dolar/g s/ im[.]/ imienia/gs/C/c/g s/nbsp/ /g s/ lit[.]/ litera/gs/D/d/g s/[.] [.]/./g s/ ang[.]/ ang/gs/E/e/g s/ ust[.]/ ustawa/g s/ Lac[.]/ Lac/gs/F/f/g s/ ub[.]/ ub/g s/ gr[.]/ gr/gs/G/g/g s/[(]//g s/ poL[.]/ poLowa/gs/H/h/g s/[)]//g s/ zm[.]/ zmarLy/gs/I/i/g s/[;]//g s/ ur[.]/ urodzony/gs/J/j/g s/[¡]//g s/ wyd[.]/ wyd/gs/K/k/g s/[#]//g s/ r[.]/ r/gs/L/l/g s/[&]//g s/ r [.]/ roku/gs/M/m/g s/[|]//g s/ sp[.]/ spOLka/gs/N/n/g s/[*]//g s/ ul[.]/ ulica/gs/O/o/g s/[ ]//g s/ pkt[.]/ pkt/gs/P/p/g s/[’]//g s/[.]jpg/ jpg/gs/R/r/g s/[!]/./g s/[.]png/ png/gs/S/s/g s/[?]/./g s/[.]exe/ exe/gs/T/t/g s/[@]/ /g s/[.]bmp/ bmp/gs/U/u/g s/0[.]/0/g s/[.]pdf/ pdf/gs/W/w/g s/1[.]/1/g s/[.]html/ htm/gs/Y/y/g s/2[.]/2/g s/[.]pl/ pl/gs/V/v/g s/3[.]/3/g s/[.]com/ com/gs/X/x/g s/3[.]/3/g s/ w[.]/ w/gs/Z/z/g s/4[.]/4/g s/ a[.]/ a/gs/ł/L/g s/5[.]/5/g s/ b[.]/ b/gs/s/S/g s/6[.]/6/g s/ c[.]/ c/gs/n/N/g s/7[.]/7/g s/ d[.]/ d/gs/c/C/g s/8[.]/8/g s/ e[.]/ e/gs/o/O/g s/9[.]/9/g s/ f[.]/ f/gs/e/E/g s/ godz[.]/ godz/g s/ g[.]/ g/gs/z/Z/g s/ art[.]/ art/g s/ h[.]/ h/gs/z/X/g s/ tys[.]/ tys/g s/ i[.]/ i/gs/a/A/g s/ ok[.]/ ok/g s/ j[.]/ j/gs/Ł/L/g s/ m[.]in[.]/ miEdzy innymi/g s/ k[.]/ k/gs/S/S/g s/ m[.] in[.]/ miEdzy innymi/g s/ l[.]/ l/gs/N/N/g s/ n[.]p[.]m[.]/ nad poziomem morza/g s/ L[.]/ L/gs/C/C/g s/ p[.]p[.]m[.]/ pod poziomem morza/g s/ m[.]/ m/gs/O/O/g s/ p[.]n[.]e[.]/ przed naszA erA/g s/ n[.]/ n/gs/E/E/g s/ n[.]e[.]/ naszej ery/g s/ o[.]/ o/gs/Z/Z/g s/ przyp. tLum./ przypis tLumacza/g s/ p[.]/ p/gs/Z/X/g s/ z o[.] o[.]/ z ograniczonA odpowiedzialnoSciA/g s/ s[.]/ s/gs/A/A/g s/ z o[.]o[.]/ z ograniczonA odpowiedzialnoSciA/g s/ t[.]/ t/gs/,//g s/ orygin[.]/ oryginalnie/g s/ u[.]/ u/gs/[-]//g s/ proc[.]/ procent/g s/ z[.]/ z/gs/[+]//g s/ tj[.]/ to jest/g s/www[.]/www /gs/[/]//g s/ szt[.]/ sztuk/g s/ / /gs/[=]//g s/ np[.]/ na przykLad/g s/ / /gs/[\]//g s/ ww[.]/ wyZej wym/g s/[.][.][.]/./gs/[”]//g s/ ds[.]/ do spraw/g s/[.][.]/./gs/[:]//g s/ wLaSc[.]/ wLaSc/g


Table 5.11: Experimental results for pure HTK audio model, audio model with LSA and audiomodel with our bags-of-words model trained on literature

n α w recognised sentences %LSA 26 8 18

HTK 16 3530 20 20 17 38

Table 5.12: Experimental results for pure HTK audio model, audio model with LSA and audiomodel with our bags-of-words model trained on enlarged literature corpus

n α w ranking of the correct hypothesis % improvementLSA 30 12.36 -19

HTK 10.39 03 3 25 8.95 14

the previous experiment. We also added new text to the training data. Additionally, we decided that

counting the number of properly recognised sentences is not the best way to evaluate the method.

We started to look at the average position of the correct hypothesis in the n-best list before and

after applying our model. It gives us evaluation from all sentences and not just from those, for

which a correct hypothesis was moved to a first position from not a first one, like in the earlier

evaluation method. We compared our model with LSA as the baseline. It performed better again

(Tab. 5.12). It supports the conclusion that this model is at least better than LSA, because it needs

less data to be trained. Different parameters of our model with which it performs best are probably

caused by a fact that the matrix (5.1) is calculated using more data. Thanks to that there are fewer

zeros in (5.1) and there is no need to smooth it so much by including an impact of many similar

topics. Only the most similar were used in that case.

We collected also more data for training using Rzeczpospolita journal and Polish wikipedia.

The first corpus can be downloaded from Dawid Weiss website as a set of html files. The researcher

claims that the journal agreed for using these resources for any academic research. The second was

collected form Internet using C++ software and very high perplexity, namely 16 436. However,

adding this data did not improve the performance of the method. Table 5.13 shows size and

complexity of all corpora we used in this research.

Table 5.13: Text corporaContent MBytes Mwords Perplexity

Parliament transcripts 58 8 4013Literature 490 68 9 031

Rzeczpospolita journal 879 104 8 918Wikipedia 754 97 16 436


5.15 Word Prediction Model and Evaluation with Perplexity

There are two main ways to evaluate language models. First of them is to find recognition error.

The second is by perplexity, which, for a probability model, is defined with cross entropy (Brown

et al., 1992)

2−∑

N p(x)log2q(x), (5.16)

where p(x) is a probability of a correct recognition from a ground truth distribution. Here it is

assumed to be uniform, what leads to p(x) = 1/N . N is the number of test samples and q(x) is a

probability of a correct recognition from a probability distribution of a tested language model.

The first one, usually given by word error rate (WER) is an accuracy. Briefly, it describes how

correct is the highest-probability hypothesis. The perplexity is a measure of how probable is the

observed data, according to the model. Our model is designed to be implemented in a working

ASR system and this is why, the accuracy is more important evaluation for us than perplexity.

Even though, perplexity is a very popular measure and it is recommended by many NLP resear-

chers to report both evaluations. It has to be stressed that the previously described bag-of-words

model cannot provide perplexity as such. The reason for it, is that our model does not provide a

probability of an event, like a word following a given history of words. The model provides us

with a grade of how coherent a set of words is. Perplexity of our model cannot be given, as the

model uses a probability of a topic given all words in a sentence. There is no ground truth for this

probability distribution to be used in (5.16). The topics in our model are not listed and named.

They are not real topics but representations of sentences grouped in an unsupervised process.

5.16 Conclusion

The POS tagger from dr Piasecki (Piasecki, 2006) was applied as an extra language model to the

problem of improving ASR. Although this is the most effective tagger for Polish, with an accuracy

of 93.44%, the results were not good. It reduced the recognition rate by 57% when applied as a

LM to ASR system based on HTK. We believe this is because POS tag information for Polish is

too ambiguous.

Another language model, inspired by LSA, was designed, implemented in Matlab and applied

to improve ASR recognition for sentences. It was mainly a semantic model, but because of the

inflective nature of the Polish language, it covers some syntax modelling as well. The semantic

model uses statistics of how words appear together in sentences using a word-topic matrix, where

the topics can be seen as sentence examples and patterns. The order of words is not kept though.

This is why we call it the bag-of-words model. Almost 300,000,000 word corpus was available

to the task of training the model. However, some texts were decreasing efficiency of the model.

After several experiments, an improvement in recognition of 14 % was achieved, compared to a

system without a language model, and 33 % comparing to LSA. An average ranking position of

the correct recognition in the entire n-best list of hypotheses was used as the evaluation grade. We

believe that the bag-of-words is effective because of non-positional nature of the Polish language.

The overall conclusion from this part of our research is that POS taggers are not useful in ASR of


Polish, but the bag-of-words model based on word-topic matrix helps in ASR task for Polish.

Unusefulness of applying POS tagging in language modelling of Polish was experimentally

supported. The main contribution presented in this chapter is the successful model based on word-

topic matrix was invented, implemented and tested. It can be trained with fewer data than a base-

line and has better predictive power.

The method could be improved by stemming the training corpora first. The stemming for

Polish can be applied using Morfeusz (Wolinski, 2004) - morphological analyser implemented

by Marcin Wolinski applying Zygmunt Saloni rules. It would reduce data sparsity and improve

results. The method can be used to any other language for which LSA is useful, however, it is

tuned to Polish and other Slavic languages because they are non-positional. The bag-of-words

philosophy fits the logic of these languages very well. We plan to train the bag-of-words model

on larger training corpus. The more data one can use, the better performance of a language model

can be achieved. We believe that it is true especially in this case, because LSA is known to be

effective, while it reduced recognition in this case, when trained on the available data. LSA is a

challenging baseline and this is why we believe our method is very good when trained on large

enough data what we plan to do.

The work on the bag-of-words will be extended. Several possible combinations will be tested

on larger corpora then described here. We are in a process of gaining more literature books, news-

paper articles and high quality websites. We will optimise the bag-of-words algorithm, especially

how to save memory while working on matrix (5.1) and implement it in C. The method will be

tested not only on sentences as topics but also on paragraphs and articles or chapters. In all cases

we will compare versions trained on original corpora and on the stemmed ones. We will also

combine the bag-of-words model with n-grams to catch some extra information and gain as high

recognition as possible.

Chapter 6

Conclusions and Future Research

It is difficult to predict success in research. In the case of ASR it is even more difficult as the

revolutionary and effective solutions have been anticipated for approximately 25 years but have

not, as yet, materialised. However, there is still important progress in all aspects of ASR. Our study

on different parametrisation methods has highlighted a few aspects which might be especially

successful in the near future. One of the obviously good avenues of research is the perceptual

approach. The idea was conceived by Hermansky by improving the already popular LPC to PLP.

Many other methods also give better results because they are perceptually motivated. Human

hearing and speaking systems are tuned to each other by millennia of evolution. It means that we

have to simulate processes in human ear and brain to recognise and understand signals created

by a human speech system. In fact, all ASR methods are perceptually motivated to some extent,

but some specifically model perceptual features. Wavelets, for example, give good opportunities

due to their non-uniform bandwidths. Phonological approaches also try to simulate processes in

human ear in more detail.

Another issue, which will definitely become more important, is the differences in parametrisa-

tion of speech for different languages. The beginnings of research in ASR were based on English.

Currently it has become quite popular to try to recognise other languages like Japanese, Chinese,

Arabic, German, French, Turkish, Finnish, Slavic languages and many more. There are obvious

differences between them, but the methods very often repeat the scheme applied for English. This

might be important encumbrance, because English is in fact quite an unusual language. It has a

few issues important for ASR, which mark it out even from other western European languages,

not to mention others. The huge majority of unstressed vowels are pronounced in a very similar

way. It causes a large number of homophones. Conjugation is relatively simple and declension of

nouns and adjectives almost does not exist. Languages have different widths of possible frequency

bands. For example, there are phonemes in Polish with frequencies much higher than any other

in English. It is quite common that most people find some phonemes especially difficult to use

while learning a new foreign language. This observation should be taken into consideration by

researchers working on non-English languages.

Table 2.3 shows clearly that it is very difficult to find a new parametrisation method which

would outperform the baseline. It is usually much more successful to append new elements, or

112

CHAPTER 6. CONCLUSIONS AND FUTURE RESEARCH 113

to further process a commonly known parametrisation. This suggests that it might be impossible

to find any new crucial parametrisation method and success can be obtained rather by additional

processing of features or better modelling.

The statistics of phonemes, diphones and triphones were collected for Polish using a large

corpus of mainly spoken formal language. Summary of the data was presented and interesting

phenomena in the statistics were described. Triphone statistics play an important role in ASR.

They are used to improve the proper transcription of the analysed speech segments. 28% of pos-

sible triples were detected as triphones, but many of them appeared very rarely. A majority of rare

triphones came from foreign or twisted words.

Most of the ASR systems do not use information about boundaries of phonetic units like

phonemes. A method based on the DWT to find such boundaries was presented. The method

is language agnostic, as it does not rely on any phonetic models but purely on the analysis of the

power spectrum and hence has applicability to any language. For the same reason it can be easily

introduced to most of existing systems as it does not depend on any exact configuration or training

of the speech model. It can also be used to provide additional information or primal hypothesis for

segmentation methods based on models like in (Ostendorf et al., 1996). Our method is intelligent

as it can be easily improved or adapted for specific applications, noisy data, etc. by introducing

additional conditions or changing weights. The algorithm can find most of the boundaries with

high accuracy. The use of several wavelet functions were compared and our results show that

Meyer wavelets are better than the others. Fuzzy recall and precision measures were introduced for

segmentation in order to evaluate the method with more sensitivity, grading errors more smoothly

than in commonly used evaluation methods. Our results give approximately 0.72 f-score for Meyer

and most of the other wavelets.

The precise evaluation method was described. It adapts a standard and very useful recall

and precision scheme for applications where evaluation has to consider more details. Speech

segmentation is such a field, however, many other types of segmentation are as well. The reason is

that the correctness of audio or image segmentation is typically not binary. This is why we found

usefulness of fuzzy sets in the task of segmentation evaluation. General rules of applying fuzzy

logic into recall and precision were presented as well as exact algorithm of using it for phoneme

segmentation evaluation, as an example.

It seems that POS tags are too ambiguous to be used effectively in modelling Polish for ASR.

Actually, according to our experiments it reduces the number of correct recognitions. Even though,

POS information is important in Polish language, the ambiguity of forms causes that other lan-

guage models have to be used.

The new method inspired by LSA was presented. The advantage of the method is that smoo-

thing of information in a matrix representing word-topic relations is based on a limited number

of most closely related topics for every topic rather than on all of them like in LSA. Our model

was still better than LSA which actually reduced recognition with the available training data. The

bag-of-word model can be trained with less data than LSA. The performance was improved in

comparison to audio model. In the experiment with the best algorithm and most of the training

data, we graded the method by an average position of the correct hypothesis in the n-best list. The

CHAPTER 6. CONCLUSIONS AND FUTURE RESEARCH 114

improvement was by 14% comparing to using HTK audio model only. LSA for the same training

model was reducing recognition.

The author’s research on ASR will be continued. He now works as a research assistant in an

ASR project for AGH University of Science and Technology, and Polish Platform of Homeland

Security. He is responsible for designing language models in the project and will apply his PhD

experience there and will experiment with the bag-of-words method on larger scale. It will proba-

bly be combined with n-grams and applied to subword units that are provided by a POS tagger to

reduce the size of a dictionary. The author’s segmentation method was already improved by other

people in the project, and is now being implemented in C++ for the ASR system which is going

to be used in courts and during police interrogations. The paper on triphone statistics was found

very good by the 3rd LTC conference committee and they requested a revised version for a journal.

Statistics will be collected again using a larger corpus and will be published in the submission of

the new paper.

List of References

Abry, P. (1997). Ondelettes et turbulence (eng. Wavelets and turbulence). Diderot ed., Paris.

Agirre, E., Alfonseca, E., and de Lacalle, O. L. (2004). Approximating hierachy-based similarity

for wordnet nominal synsets using topic signatures. Proceedings of the 2nd Global WordNet

Conference. Brno, Czech Republic.

Agirre, E., Ansa, O., Martınez, D., and Hovy, E. (2001). Enriching wordnet concepts with topic

signatures. Procceedings of the SIGLEX Workshop on WordNet and Other Lexical Resources:

Applications, Extensions and Customizations.

Agirre, E., Martınez, D., de Lacalle, O. L., and Soroa, A. (2006). Two graph-based algorithms

for state-of-the-art wsd. Proceedings of the 2006 Conference on Empirical Methods in Natural

Language Processing, Sydney, pages 585–593.

Ahmed, N., Natarajan, T., and Rao, K. R. (1974). Discrete cosine transform. IEEE Transcations

Computers, Jan:90–93.

Alewine, N., Ruback, H., and Deligne, S. (October-December 2004). Pervasive speech recogni-

tion. Pervasive computing, pages 78–81.

A.Przepiorkowski (2006). The potential of the IPI PAN corpus. Poznan Studies in Contemporary

Linguistics, 41:31–48.

Banerjee, S. and Pedersen, T. (2003). Extended gloss overlaps as a measure of semantic related-

ness. Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence,

pages 805–810.

Basztura, C. (1992). Rozmawiac z komputerem (Eng. To speak with computers). Format.

Baum, L. E., Petrie, T., Soules, G., and Weiss, N. (1970). A maximization technique occur-

ring in the statistical analysis of probabilistic functions of markov chains. Ann. Math. Statist.,

41(1):164–171.

Beep dictionary (2000). www.speech.cs.cmu.edu/comp.speech/Section1/Lexical/beep.html.

Bellegarda, J. (1998). A multispan language modeling framework for large vocabulary speech

recognition. IEEE Transactions on Speech and Audio Processing, 6(5):456–467.

Bellegarda, J. R. (1997). A latent semantic analysis framework for large-span language modeling.

Proceedings of Eurospeech, 3:1451–1454.

Bellegarda, J. R. (2000). Large vocabulary speech recognition with multispan statistical language

models. IEEE Transactions on Speech and Audio Processing, 8(1):76–84.

Bellegarda, J. R. (70–80). Latent semantic mapping. IEEE Signal Processing Magazine,

September:70–80.

115

LIST OF REFERENCES 116

Boersma, P. (1996). Praat, a system for doing phonetics by computer. Glot International,

5(9/10):341–345.

Brill, E. (1994). Some advances in transformation-based part of speech tagging. Proceedings of

the Twelfth National Conference on artificial Intelligence AAAI.

Brill, E. (1995). Transformation-based error-driven learning and natural language processing: A

case study in part of speech tagging. Computational Linguistics, December:543–565.

Brown, P. F., Pietra, V. J. D., ans S. A. Della Pietra, R. L. M., and Lai, J. C. (1992). An estimate

of an upper bound for the entropy of English. Computational Linguistics, 18(1):31–40.

Cardinal, P., Boulianne, G., and Comeau, M. (2005). Segmentation of recordings based on partial

transcriptions. Proceedings of Interspeech, pages 3345–3348.

Coccaro, N. and Jurafsky, D. (1998). Towards better integration of semantic predictors in statistical

language modeling. Proceedings of ICSLP-98, Sydney.

Cooley, J. W. and Tukey, J. W. (1965). An algorithm for the machine calculation of complex

fourier series. Math. Comput., 19:297–301.

Cozens, S. (1998). Primitive part-of-speech tagging using word length and sentential structure.

Computaion and Language.

Cuadros, M., Padro, L., and Rigau, G. (2005). Comparing methods for automatic acquisition of

topic signatures. Proceedings of the International Conference on Recent Advances on Natural

Language Processing (RANLP).

Daelemans, W. and van den Bosch, A. (1997). Language-independent data-oriented grapheme-to-

phoneme conversion. Progress in Speech Synthesis, New York: Springer-Verlag.

Daubechies, I. (1992). Ten lectures on Wavelets. Society for Industrial and Applied Mathematics,

Philadelphia, Pennsylvania.

Davis, H., Biddulph, R., and Balashek, S. (1952). Automatic recognition of spoken digits. Journal

of the Acoustical Society of America, (24(6)):637–642.

Davis, S. B. and Mermelstein, P. (1980). Comparison of parametric representations for mono-

syllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics,

Speech and Signal Processing, ASSP-28(4):357–366.

Debowski, Ł. (2003). A reconfigurable stochastic tagger for languages with complex tag structure.

The Proceedings of the Workshop on Morphological Processing of Slavic Languages, EACL.

de Saussure, F. (1916). Course de lingustique generale. Lausanne and Paris: Payot.

Demenko, G., Wypych, M., and Baranowska, E. (2003). Implementation of grapheme-to-phoneme

rules and extended SAMPA alphabet in Polish text-to-speech synthesis. Speech and Language

Technology, PTFon, Poznan, 7(17).

Demuynck, K. and Laureys, T. (2002). A comparison of different approaches to automatic speech

segmentation. Proceedings of the 5th International Conference on Text, Speech and Dialogue,

pages 277–284.

Denes, P. B. (1962). Statistics of spoken English. The Journal of the Acoustical Society of America,

34:1978–1979.

Deng, L., Wu, J., Droppo, J., and Acero, A. (2005). Analysis and comparison of two speech

feature extraction/compensation algorithms. IEEE Signal Processing Letters, 12(6):477–480.


Deng, Y. and Khudanpur, S. (2003). Latent semantic information in maximum entropy language

models for conversational speech recognition. Proceedings of the HLT-NAACL 03, pages 56–63.

Eskenazi, M., Black, A., Raux, A., and Langner, B. (2008). Let’s go lab: a platform for evaluation

of spoken dialog systems with real world users. Proceedings of Interspeech, Brisbane.

Evermann, G., Chan, H. Y., Gales, M. J. F., Hain, T., Liu, X., Mrva, D., Wang, L., and Woodland,

P. C. (2004). Develpment of the 2003 CU-HTK conversational telephone speech transcription

system. Proceedings of ICASSP Interspeech, pages I–249–252.

Farooq, O. and Datta, S. (2004). Wavelet based robust subband features for phoneme recognition.

IEE Proceedings: Vision, Image and Signal Processing, 151(3):187–193.

Fellbaum, C. (1999). Wordnet. An Electronic Lexical Database. Massachusetts Institute of Tech-

nology, US.

Forney, G. D. (1973). The Viterbi algorithm. Proceedings IEEE, 61:268–273.

Frankel, J. and King., S. (2005). A hybrid ANN/DBN approach to articulatory feature recognition.

Proceedings of Eurospeech.

Frankel, J. and King, S. (2007 (in press)). Speech recognition using linear dynamic models. IEEE

Transactions on Speech and Audio Processing.

Frankel, J., Wester, M., and King, S. (2007). Articulatory feature recognition using dynamic

Bayesian networks. Computer Speech and Language, 21(4):620–640.

Friedman, J., Hastie, T., and Tibshirani, R. (1999). Additive logistic regression: A statistical view

of boosting. Technical report, Department of Statistics, Stanford University.

Gałka, J. and Ziołko, B. (2008). Study of performance evaluation methods for non-uniform speech

segmentation. International Journal Of Circuits, Systems And Signal Processing, NAUN.

Ganapathiraju, A., Hamaker, J. E., and Picone, J. (2004). Applications of support vector machines

to speech recognition. IEEE Transactions on Signal Processing, 52(8):2348–2355.

Glass, J. (2003). A probabilistic framework for segment-based speech recognition. Computer

Speech and Language, 17:137–152.

Gorrell, G. and Webb, B. (2005). Generalized Hebbian algorithm for incremental latent semantic

analysis. proceedings of Intespeech.

Grayden, D. B. and Scordilis, M. S. (1994). Phonemic segmentation of fluent speech. Proceedings

of ICASSP, Adelaide, pages 73–76.

Green, S. J. (1999). Lexical semantics and automatic hypertext construction. ACM Computing

Surveys (CSUR), 31.

Greenberg, S., Chang, S., and Hollenback, J. (2000). An introduction to the diagnostic evaluation

of switchboard- corpus automatic speech recognition systems. Proceedings of NIST Speech

Transcription Workshop.

Grocholewski, S. (1995). Załozenia akustycznej bazy danych dla jezyka polskiego na nosniku

cd rom (Eng. Assumptions of acoustic database for Polish language). Mat. I KK: Głosowa

komunikacja człowiek-komputer, Wrocław, pages 177–180.

Gronqvist, L. (2005). An evaluation of bi- and trigram enriched latent semantic vector models.

ACM Proceedings of ELECTRA Workshop - Methodologies and Evaluation of Lexical Cohesion

Techniques in Real-world Applications, Salvador, Brazil, pages 57–62.


Hain, T., Dines, J., Garau, G., Karafiat, M., Moore, D., Wan, V., Ordelman, R., and S.Renals

(2005). Transcription of conference room meetings: an investigation. Proceedings of ICSLP

Interspeech.

Harary, F. (1969). Graph Theory. Addison-Wesley.

Hermansky, H. (1990). Perceptual linear predictive (PLP) analysis of speech. Journal of the

Acoustical Society of America, 87(4):1738–1752.

Hermansky, H. and Morgan, N. (1994). RASTA processing of speech. IEEE Transactions on

Speech and Audio Processing, 2(4):578–589.

Hifny, Y., Renals, S., and Lawrence, N. D. (2005). A hybrid MaxEnt/HMM based ASR system.

Proceedings of ICSLP Interspeech.

Holmes, J. N. (2001). Speech Synthesis and Recognition. London: Taylor and Francis.

Huang, W. and Lippman, R. (1988). Neural net and traditional classifiers. Neural Information

Processing Systems, D. Anderson, ed., pages 387–396.

Ishizuka, K. and Miyazaki, N. (2004). Speech feature extraction method representing periodicity

and aperiodicity in sub bands for robust speech recognition. Proceedings of ICASSP, pages

I–141–144.

Jarmasz, M. and Szpakowicz, S. (2003). Roget’s thesaurus and semantic similarity. Proceedings

of Conference on Recent Advances in Natural Language Processing (RANLP), pages 212–219.

Jassem, K. (1996). A phonemic transcription and syllable division rule engine. Onomastica-

Copernicus Research Colloquium, Edinburgh.

Jelinek, F., Merialdo, B., Roukos, S., and Strauss, M. (1991). A dynamic language model for

speech recognition. Fourth DARPA Speech and Natural Language Workshop, pages 293–295.

Johansson, S., Leech, G., and Goodluck, H. (1978). Manual of Information to Accompany the

Lancaster-Olso/Bergen Corpus of British English, for Use with Digital Computers. Department

of English, University of Oslo.

Jurafsky, D. and Martin, J. H. (2000). Speech and Language Processing. Prentice-Hall, Inc., New

Jersey.

Kakkonen, T., Myller, N., and Sutinen, E. (2006). Applying part-of-speech enhanced LSA to

automatic essay grading. Proceedings of the 4th IEEE International Conference on Information

Technology:Research and Education (ITRE 2006). Tel Aviv, Israel, pages 500–504.

Kanejiya, D., Kumar, A., and Prasad, S. (2003). Automatic evaluation of students’ answers using

syntactically enchanced LSA. Proceedings of the HLT-NAACL 03 workshop on Building edu-

cational applications using natural language processing, 2:53–60.

Kecman, V. (2001). Learning and Soft Computing. Massachusetts Institute of Technology, US.

Kepinski, M. (2005). Kontekstowe zwiazki cech w sygnale mowy polskiej (Eng. Contextual feature

relations in Polish speech signal), PhD Thesis. AGH University of Science and Technology,

Krakow.

Khudanpur, S. and Wu, J. (1999). A maximum entropy language model integrating n-grams and

topic dependencies for conversational speech recognition. Proceedings of the IEEE Internatio-

nal Conference on Acoustics, Speech and Signal Processing (ICASSP), Phoenix, AZ.

King, S. (2003). Dependence and independence in automatic speech recognition and synthesis.


Journal of Phonetics, 31(3-4):407–411.

King, S. and Taylor, P. (2000). Detection of phonological features in continuous speech using

neural networks. Computer Speech and Language, 14(4):333–353.

Kucera, H. and Francis, W. (1967). Computational Analysis of Present Day American English.

Brown University Press Providence.

Lamere, P., Kwok, P., Gouvea, E., Raj, B., Singh, R., Walker, W., and Wolf, P. (2004). The CMU

Sphinx-4 speech recognition system. Sun Microsystems.

Li, H.-Z., Liu, Z.-Q., and Zhu, X.-H. (2005). Hidden markov models with factored gaussian

mixtures densities. Elsevier Pattern Recognition, 38:2022–2031.

Lowerre, B. T. (1976). The HARPY Speech Recognition System, PhD thesis. Carnegie-Mellon

Univesity, Pittsburgh.

Ma, J. Z. and Deng, L. (2004). Target - directed mixture dynamic models for spontaneous speech

recognition. IEEE Transactions on Speech and Audio Processing, 12(1).

Mahajan, M., Beeferman, D., , and Huang, D. (1999). Improved topic-dependent language mode-

ling using information retrieval techniques. Proceedings of ICASSP, pages 541–544.

Makhoul, J. (1975). Spectral linear prediction: properties and applications. IEEE Transcations,

ASSP-23:283–296.

Manning, C. D. (1999). Foundations of Statistical Natural Language Processing. MIT Press.

Cambridge, MA.

Miller, T. and Wolf, E. (2006). Word completion with latent semantic analysis. 18th International

Conference on Pattern Recognition, ICPR, Hong Kong, 1:1252–1255.

Misra, H., Ikbal, S., Bourlard, H., and Hermansky, H. (2004). Spectral entropy based feature for

robust ASR. Proceedings of ICASSP, pages I–193–196.

Morgan, N., Zhu, Q., Stolcke, A., Sonmez, K., Sivadas, S., Shinozaki, T., Ostendorf, M., Jain, P.,

Hermansky, H., Ellis, D., Doddington, G., Chen, B., Cretin, O., Bourlard, H., and Athineos, M.

(2005). Pushing the envelope - aside. IEEE Signal Processing Magazine, 22:81–88.

M. Wester (2003). Syllable classification using articulatory-acoustic features. Proceedings of

Eurospeech.

Nasios, N. and Bors, A. (2005). Finding the number of clusters for nonparametric segmentation.

Lecture Notes in Computer Science, 3691:213–221.

Nasios, N. and Bors, A. (2006). Variational learning for gaussian mixture models. IEEE Transac-

tions on Systems, Man and Cybernetics - Part B: Cybernetics, 36(4):849–862.

Ostaszewska, D. and Tambor, J. (2000). Fonetyka i fonologia wspołczesnego jezyka Polskiego

(eng. Phonetics and phonology of modern Polish language). PWN.

Ostendorf, M., Digalakis, V. V., and Kimball, O. A. (1996). From HMM’s to segment models: A

unified view of stochastic modeling for speech recognition. IEEE Transactions on Speech and

Audio Processing, 4:360–378.

Pedersen, T., Patwardhan, S., and Michelizzi, J. (2004). Wordnet::similarity - measuring the re-

latedness of concepts. Proceedings of the Nineteenth National Conference on Artificial Intelli-

gence (AAAI-2004), pages 1024–1025.

Piasecki, M. (2006). Hand-written and automatically extracted rules for Polish tagger. Lecture


Notes in Artificial Intelligence, Springer, W P. Sojka, I. Kopecek, K. Pala, eds. Proceedings of

Text, Speech, Dialogue:205–212.

Przepiorkowski, A. (2004). The IPI PAN Corpus: Preliminary version. IPI PAN.

Przepiorkowski, A. and Wolinski, M. (2003). The unbearable lightness of tagging: A case study

in morphosyntactic tagging of Polish. Proceedings of the 4th International Workshop on Lin-

guistically Interpreted Corpora (LINC-03), EACL 2003.

Rabiner, L. and Juang, B. H. (1993). Fundamentals of speech recognition. PTR Prentice-Hall,

Inc., New Jersey.

Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech

recognition. Proceedings of the IEEE, 77(2):257–286.

Rabiner, L. R. and Schafer, R. W. (1978). Signal Processing of Speech Signals. Prentice Hall,

Englewood-cliffs.

Raj, B. and Stern, R. M. (September 2005). Missing-feature approaches in speech recognition.

IEEE Signal Processing Magazine, pages 101–116.

Riccardi, G. and Hakkani-Tur, D. (2005). Active learning: Theory and applications to automatic

speech recognition. IEEE Transactions on Speech and Audio Processing, 13(4):504–511.

Rioul, O. and Vetterli, M. (1991). Wavelets and signal processing. IEEE Signal Processing Ma-

gazine, 8:11–38.

Russell, M. and Jackson, P. J. B. (2005). A multiple-level linear/linear segmental HMM with a

formant-based intermediate layer. Computer Speech and Language, 19:205–225.

Seco, N., Veale, T., and Hayes, J. (2004). An intrinsic information content metric for semantic

similarity in wordnet. Proceedings of ECAI’2004, the 16th European Conference on Artificial

Intelligence.

Steffen-Batog, M. and Nowakowski, P. (1993). An algorithm for phonetic transcription of orto-

graphic texts in Polish. Studia Phonetica Posnaniensia, 3.

Stober, K. and Hess, W. (1998). Additional use of phoneme duration hypotheses in automatic

speech segmentation. Proceedings of ICSLP, Sydney, pages 1595–1598.

Subramanya, A., Bilmes, J., and Chen, C. P. (2005). Focused word segmentation for ASR. Pro-

ceedings of Interspeech 2005, pages 393–396.

Suh, Y. and Lee, Y. (1996). Phoneme segmentation of continuous speech using multi-layer per-

ceptron. In Proceedings of ICSLP, Philadelphia, pages 1297–1300.

Tadeusiewicz, R. (1988). Sygnał mowy (eng. Speech Signal). Wydawnictwo Komunikacji i

Łacznosci.

Tan, B. T., Lang, R., Schroder, H., Spray, A., and Dermody, P. (1994). Applying wavelet analysis

to speech segmentation and classification. H. H. Szu, editor, Wavelet Applications, volume Proc.

SPIE 2242, pages 750–761.

T.Hofmann (1999). Probabilistic latent semantic analysis. Proceedings of Uncertainty in Artificial

Intelligence, UAI’99, Stockholm.

Toledano, D., Gomez, L., and Grande, L. (2003). Automatic phonetic segmentation. IEEE Tran-

sactions on Speech and Audio Processing, 11(6):617–625.

Tukey, J. W., Bogert, B. P., and Healy, M. J. R. (1963). The quefrency analysis of time series for


echoes: cepstrum, pseudo-autocovariance, cross-cepstrum, and saphe-cracking. Proceedings of

the Symposium on Time Series Analysis (M. Rosenblatt, Ed), pages 209–243.

van Rijsbergen, C. J. (1979). Information Retrieval. London: Butterworths.

Venkataraman, A. (2001). A statistical model for word discovery in transcribed speech. Compu-

tational Linguistics, 27.

Veronis, J. (2004). Hyperlex: lexical cartography for information retrieval. Computer Speech and

Language, 18(3):223–252.

Villing, R., Timoney, J., Ward, T., and Costello, J. (2004). Automatic blind syllable segmentation

for continuous speech. Proceedings of ISSC 2004, Belfast.

Viterbi, A. J. (1967). Error bounds for convolutional codes and an asymptotically optimum deco-

ding algorithm. IEEE Transactions on Information Theory, 13(2):260–269.

Wang, D. and Narayanan, S. (2005). Piecewise linear stylization of pitch via wavelet analysis.

Proceedings of Interspeech, Lisboa, pages 3277–3280.

Watanabe, S., Minami, Y., Nakamura, A., and Ueda, N. (2004). Variational Bayesian estimation

and clustering for speech recognition. IEEE Transcations on Speech and Audio Processing,

12(4).

Weinstein, C. J., McCandless, S. S., Mondshein, L. F., and Zue, V. W. (1975). A system for

acoustic-phonetic analysis of continuous speech. IEEE Transactions on Acoustics, Speech and

Signal Processing, 23:54–67.

Wester, M. (2003). Pronunciation modeling for ASR - knowledge-based and data-derived methods.

Computer Speech and Language, 17:69–85.

Wester, M., Frankel, J., and King, S. (2004). Asynchronous articulatory feature recognition using

dynamic Bayesian networks. Proceedings of IEICI Beyond HMM Workshop.

Whittaker, E. and Woodland, P. (2003). Language modelling for Russian and English using words

and classes. Computer Speech and Language, 17:87–104.

Wolinski, M. (2004). System znacznikow morfosyntaktycznych w korpusie ipi pan (Eng. The

system of morphological tags used in IPI PAN corpus). POLONICA, XII:39–54.

Wu, J. and Khudanpur, S. (2000). Efficient training methods for maximum entropy language

modelling. Proceedings of 6th International Conference on Spoken Language Technologies

(ICSLP-00).

Y.-C. Tam, T. S. (2008). Correlated bigram LSA for unsupervised language model adaptation.

Proc. of Neural Information Processing Systems (NIPS), Vancouver.

Yannakoudakis, E. J. and Hutton, P. J. (1992). An assessment of n-phoneme statistics in phoneme

guessing algorithms which aim to incorporate phonotactic constraints. Speech Communication,

11:581 – 602.

Yapanel, U. and Dharanipragada, S. (2003). Perceptual MVDR-based cepstral coefficients

(PMCCs) for robust speech recognition. Proceedings of ICASSP.

Young, S. (1996). Large vocabulary continuous speech recognition: a review. IEEE Signal Pro-

cessing Magazine, 13(5):45–57.

Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Moore, G., Odell, J., Ollason, D., Po-

vey, D., Valtchev, V., and Woodland, P. (2005). HTK Book. Cambridge University Engineering


Department, UK.

Zheng, C. and Yan, Y. (2004). Fusion based speech segmentation in DARPA SPINE2 task. Pro-

ceedings of ICASSP, Montreal, pages I–885–888.

Zhu, D. and Paliwal, K. K. (2004). Product of power spectrum and group delay function for speech

recognition. Proceedings of ICASSP.

Ziołko, B., Gałka, J., Manandhar, S., Wilson, R., and Ziołko, M. (2007). Triphone statistics for

Polish language. Proceedings of 3rd Language and Technology Conference, Poznan.

Ziołko, B., Manandhar, S., and Wilson, R. C. (2006a). Phoneme segmentation of speech. Procee-

dings of 18th International Conference on Pattern Recognition.

Ziołko, B., Manandhar, S., Wilson, R. C., and Ziołko, M. (2006b). Wavelet method of speech seg-

mentation. Proceedings of 14th European Signal Processing Conference EUSIPCO, Florence.

Ziołko, B., Manandhar, S., Wilson, R. C., and Ziołko, M. (2008a). Language model based on pos

tagger. Proceedings of SIGMAP 2008 the International Conference on Signal Processing and

Multimedia Applications, Porto.

Ziołko, B., Manandhar, S., Wilson, R. C., and Ziołko, M. (2008b). Semantic modelling for speech

recognition. Proceedings of Speech Analysis, Synthesis and Recognition. Applications in Sys-

tems for Homeland Security, Piechowice, Poland.

Zue, V. W. (1985). The use of speech knowledge in automatic speech recognition. Proceedings of

the IEEE, 73:1602–1615.

speech recognition of highly inflective languageshomepage:main.pdfspeech recognition of highly...

Documents