[ieee 2013 world congress on computer and information technology (wccit) - sousse, tunisia...

Arabic text to speech synthesis based on neural

networks for MFCC estimation

Ilyes Rebai

MIRACL : Multimedia InfoRmation system and Advanced

Computing Laboratory Sfax University, Tunisia

Route de Tunis km10, technopole, Sfax Tunisie

[email protected]

Yassine BenAyed

MIRACL : Multimedia InfoRmation system and Advanced

Computing Laboratory Sfax University, Tunisia

Route de Tunis km10, technopole, Sfax Tunisie

[email protected]

Abstract— With the increasing number of users of text to speech

applications, high quality speech synthesis is required. However,

only few researches concern Arabic text to speech applications.

Compared with other languages such as English and French the

quality of Arabic synthesis speech is still poor. For these reasons,

we propose in this paper an Arabic text to speech synthesis

system based on statistical parametric synthesis. Mel Frequency

Cepstral Coefficients (MFCC), energy and pitch are predicted

using back propagation artificial neural networks and then

transformed into speech using Mel Log Spectrum Approximation

filter. Often, in Arabic written text, the short vowels called

diacritic marks are omitted. So, a diacritization system is

proposed to resolve this problem. Different unit sizes are

considered in speech database which are phoneme, diphone and

triphone. MFCC neural network architecture and an objective

evaluation with the MFCC distortion measure are given in this

paper.

Keywords-speech synthesis; statistical parametric synthesis;

Mel Frequency cepstral Coefficients; text diacritization

I. INTRODUCTION

Text To Speech system (TTS) is a process of generation of speech as output from text as input. The two popular methods for speech synthesis are concatenation synthesis [1] and Statistical Parametric Synthesis (SPS) [2]. In general, little works have been done for Arabic language compared with the other language such as English and French. More of this works are based on the first approach but a bit of them use statistical parametric.

Recent researches show the capability of SPS method on synthesizing a high quality voice [2] [3]. Furthermore, a few resources are needed to build a complete TTS system. In addition, many embedded systems have been developed in the recent years, used widely by people and characterized by limited resource in memory and computation comparing it with the computer.

Hence, SPS is an interesting approach for building a text to speech system for Arabic language. It consists of two main parts: training phase where the features are extracted from speech database and trained with machine learning with the corresponding text features. Mel Frequency Cepstral Coefficients (MFCC) are used as spectral features and pitch, energy and duration as prosodic features. The second phase is to predict the spectral and prosodic features to synthesize

speech using the models obtained from the first phase. A text analyzer transforms the text to be synthesized into acoustic and phonetic features used by the machine learning as input to predict the spectral and prosodic features according to each text features. Finally, the predicted MFCC, pitch and energy will be transformed to speech using Mel Log Spectrum Approximation (MLSA) filter [4].

Speech database is very important for both systems, concatenative and statistical parameter system. It contains units of natural pre-recorded speech. In concatenative synthesis systems, the produced utterance is synthesized by concatenating the appropriate units selected from the inventory. However, as have been claimed above, in SPS systems, the database is used just in the training phase to extract features for each unit in it. Frequently, the inventory is built with fixed length units, frequent phones and diphones [5]. In concatenative approach, recent researches use a dictionary of units with various lengths and as a result the quality of speech is higher than fixed length units method [6] [7].

In this paper, we propose a text to speech synthesis system based on statistical parametric approach with artificial neural network as learning technique for the generation of MFCC, energy, pitch and duration which is known with the capability of resolving a non-linear mapping problem as in this case - mapping from text to speech. Also, we propose a speech database with non-uniforms units which are phoneme, diphone and triphone to have more smoothing speech synthesis. In addition, diacritization system is developed to predict the diacritic marks for the input text to have the correct pronunciation of the words.

The remaining sections are organized as follows. The section 2 presents the Arabic language and its phonetic characteristic. The complete text to speech synthesis system is shown in section 3. The two sub-systems, diacritization system and speech synthesis system, implementation are explained is section 4 and 5 respectively. The evaluation of speech quality of our system is given in section 5.4. Finally, the conclusion is presented in section 6.

II. THE ARABIC SPEECH SPECIFITIES

Arabic language is one of the languages that are written from right to left. It has 34 phonemes classified as follow: 28 consonants, 3 long vowels ( ي ــ و ــ ا ), 3 short vowels (ـــ - ـــ - ـــ)

called diacritization marks [5]. Two phonemes ( ي وــ ) can be semi-vowels depending on the context which they appear in the word and easily recognized as vowel or consonant with full text diacritization. Further the three diacritics, sukun (ـــ) also can be used in Arabic text written bellow the consonant and it is usually not necessary to write it because each consonant without short vowel is considered with sukun.

The International Phonetic Alphabet (IPA) classifies the Arabic consonant according to the place and manner of articulation. Table 1 show the IPA classification for Arabic language.

TABLE I. ARABIC CONSONANTS

Bil

abia

l

La

bio

-den

tal

Inte

r-den

tal

Alv

eo-d

enta

l

Pa

lata

l

Vel

ar

Uvu

lar

Ph

ary

ng

eal

Glo

ttal

Sto

ps Voiced ج د,ض ب

Unvoiced ء ق ك ت,ط

Fri

cati

ves

Voiced ع غ ز ذ,ظ

Unvoiced هـ ح خ ش ص,س ث ف

Nas

als

Voiced ن م

Tri

ll

Voiced ر

Lat

eral

Voiced ل

Sem

i

Vo

wel

s

Voiced ي و

A. Syllables

Arabic language includes the notion of syllable which has an important role in linguistic terms. It’s easily-detectable and small in number. Arabic words contain one or more syllables. All syllables always begin with a consonant followed by vowel. Short vowels and long vowels are represented respectively with (v) and (vv). Each syllable finishes with a consonant (c) is closed and each syllable finishes with a vowel (v) is open. Arabic syllables are a number of six as shown in table 2. (cv) is the frequent syllable whereas (cvvcc) is very rare [8].

B. Diacritization:

The process of syllabification needs a full text diacritization to have the right syllabification and specifically the correct pronunciation. For example, ة (write) and ةب

(books) have the same spelling but with the diacritics difference give different words meaning and also different syllables type ( cv, cv, cvc). So, the ةب . cv, cv, cv ة ك input text should be with diacritic marks. But the main problem in Arabic, the text is written without these marks [9] [10].

Many systems have developed to handle this problem and many methods are used. The most common method is based on the standard Arabic dictionaries. Another method uses a morphological analyzer without the need of dictionary [11]. A third method is based on learning system with fast reply and little data [12].

The main drawback of the dictionary approach is the incomplete Arabic vocabulary even at using the largest dictionary. Also, it needs a huge memory to be stored and need an important time to access and reply. Besides, the morphological analyzer with the advantage of avoiding the use of a dictionary suffers from the higher processing time for the disambiguation process [11]. The third method having the advantage of lower memory size requirement and fast reply time is used in this paper.

In this work, diacritization system is basically built on artificial neural network with Multi-Layer Perceptron (MLP). The system is presented in section 4.

TABLE II. ARABIC SYLLABLES

Syllable Example English meaning

CV و And

CVC أخ Brother

CVCC صفر Zero

CVV ف In

CVVC تاب Door

CVVCC ار Hot

III. OVERVIEW OF THE PROPOSED METHOD

In this paper, the text to speech system is composed on two sub-systems: diacritization system and speech synthesis system detailed in section 4 and 5 respectively. The complete system is shown in Fig. 1. The first system has as input the text entry to give a full diacritized text as output and then as input for the second system. In this stage, the text is transformed into acoustic and phonetic feature then into spectral and prosodic features and finally into continuous speech sound as output. Both of these systems are based on multilayer perceptron neural networks with fully connected back propagation algorithm which is simple, fast and have the capabilities to perform a non linear mapping.

Figure 1. Text to speech synthesis system

Text Diacritization system

Speech synthesis system

Diacritized

text Speech

IV. DIACRITIZATION SYSTEM

Before the text enters the Neural Network (NN), some steps have to be fulfilled. First, anything except for Arabic graphemes and punctuations ( . , ! , ? ) are removed. Next, abbreviations and numbers must be expanded into full words. Special characters and symbols are transformed into the corresponding Arabic words. For examples, the number 2 is expanded as “إثنان” (two), “إلخ” expanded as “ آخره إلى ” (etcetera), “$20” as “ دوالرا عشرون ”. Finally, the text is segmented into sentences, and each sentence is segmented into words and then each word is transformed into input vectors for the NN.

A. Input vector

The features extracted from the text for each grapheme include the current grapheme represented with 5-bits, 3-adjacent right and left graphemes so 6x5-bits. Furthermore, a current grapheme position in a word (C.GPW) and a current word position in a sentence (C.WPS) with 3-bits are used respectively. Also, chadda written or not above the grapheme is coded with 1-bit.

Note that the number of adjacent grapheme is determined by test. 3-adjacent gives the better result.

B. Output vector

NN should predict the diacritic mark for each input from seven diacritic marks defined in this work. 7 neurons in the output layer are defined. The outputs neural network and their corresponding marks are shown in table 3.

TABLE III. NEURAL NETWORK OUTPUT

NN output Diacritic marks

ـــ 0000001

ـــ 0000010

ـــ 0000100

ـــ 0001000

ـب 0010000 ــ

ـــ 0100000

ـــ 1000000

C. Neural network architecture and results

The NN architecture consists of 42 input nodes, two hidden layer with 50 nodes in the first and 200 nodes in the second layer and 7 output nodes with sigmoid activation function (S) for both hidden and output layer.

Figure 2. Neural network architecture of diacritization system

The architecture is shown in Fig. 2. The corpus contains 8074 vectors, 7268 are used for training and 806 are used for the test phase. The error rate obtained in the test is 17%.

In order to decrease the error rate, the text resulted from the diacritization system is used to perform another process based on a set of rules to correct some mistakes of NN process. After the operation of the second process, the error rate decrease between 5 to 10 %.

V. SPEECH SYNTHESIS SYSTEM

Figure 3. Speech synthesis system

A. Database

The speech database consists of a set of recordings speech from a single-female speaker. It is composed of 434 sentences, with 382 declarative sentences, 17 exclamatory sentences and 35 interrogative sentences, 1346 words in all of them, from 1 to 10 words per sentence, 1066 distinct words and 5348 units. 90% of the entire sentence is used in training phase and 10% for testing. SPTK toolkit [13] is used to extract parameters

3-left

graph. 3-right

graph.

Current

graph. C.GPW C.WPS shadda

Output layer

Hidden layer

Input layer

Text

Speech database Speech parameters extraction

Text analyzer

ANN 1 train

MFCC Energy Pitch

ANN 2 train Acoustic,

phonetic

features

Text

Acoustic, phonetic

features

MLSA MFFC,

energy,

F0

Duration model

ANN 1, 2, 3 models

Speech

Text

analyzer

Training part

Synthesis part

ANN 3 train

ANN train

Duration

from speech which are 24-coefficients MFCC, pitch and energy with 25ms frame length and 10ms frame shift.

B. Speech segmentation

In this paper, we propose a database with non-uniform units which are phoneme, diphone (two successive phonemes) and triphone (three successive phonemes; long vowels is regarded as two phoneme), for having a smoothed trajectory and improving the quality of speech synthesized. Each unit must have just one consonant and with or without vowels. The set of rules for the segmentation of speech are as follows:

1) if a consonant is followed by another consonant then the first consonant represents an acoustic unit (c).

2) if a consonant and a vowel are followed by another consonant then the first consonant and vowel represent an acoustic unit (cv).

3) if a consonant and a long vowel are followed by another consonant then the first consonant and long vowel represent an acoustic unit (cvv).

For example, the word “الثاب” is composed of < c, v, c, c, vv, c > and after the segmentation we obtain 4 units: cv, c, cvv, c.

C. The proposed method

Speech synthesis system is shown in Fig 3. The three neural networks are used to predict all of the 24-coefficients MFCC, the energy and finally pitch extracted from the speech. Also, a neural network has been created to estimate the duration of each unit. In the synthesis phase, the MFCC, pitch and energy are predicted using the obtained models from the training phase and are given to MLSA filter to produce the synthesis speech. Duration value is used only as a text feature.

1) Input and output features: The text is transformed into acoustic and phonetic features through a text analyzer. Table 4 shows the input features. The input parameters to the neural network contains current, previous and next unit features, articulation features and syllable type. Adding to, current unit position in the syllable, in the word, current syllable position in the word and current word position in the sentence, type of sentence and unit duration value and temporal features. Fifty-five time index neurons are used to represent the relative position, temporal variations, of the current frame within the unit and help to smooth transition between neighboring frames. The value of time index i during frame j is calculated through

Eq. 1 ( = 0.2) [14].

Oi = exp(- (i - j)²) (1)

On the output layer, 24-coefficients MFCC are predicted from each 25ms frame length and 10ms frame shift.

TABLE IV. FEATURES EXTRACTED FROM TEXT

1- Overall features:

Features Number of bits

Current unit features 9bits (bin)

Previous unit features 9bits (bin)

Next unit features 9bits (bin)

Current unit articulation features 25bits (bin)

Previous unit articulation features 25bits (bin)

Next unit articulation features 25bits (bin)

Current syllable type 3bits (bin)

Previous syllable type 3bits (bin)

Next syllable type 3bits (bin)

Current unit position in the syllable 3bits (bin)

Current unit position in the word 3bits (bin)

Current syllable position in the word 3bits (bin)

Current word position in the sentence 3bits (bin)

Type of sentence 3bits (bin)

Temporal features 55bits(float)

unit duration 1bit (float)

Total 182bits

2- Unit articulation features:

Features Features values Number of bits

unit type C/CV/CVV 3bits (bin)

Consonant voiced Voiced/Invoiced 2bits (bin)

Consonant fluency Fluency/Desisting 2bits (bin)

Consonant plosiveness Plosiveness/Fricativeness/

Middle

3bits (bin)

Consonant height Elevation/Lowering 2bits (bin)

Consonant adhesion Adhesion/Separation 2bits (bin)

Lip rounding Yes/No 2bits (bin)

Place of articulation Bilabial/Labiodental/

Interdental/Alveodental/ Palatal/Velar/Uvular/

pharyngeal/laryngeal

9bits (bin)

Total 25bits

3- Unit features:


Grapheme 5 ي,و,ه,…,ت,ب,اbits (bin)

Diacritic marks Fatha/dhamma/kasra/soukoun 4bits (bin)

Total 9bits

4- Other features:


unit position in syllable Begin/middle/end 3bits (bin)

syllable position in word Begin/middle/end 3bits (bin)

unit position in word Begin/middle/end 3bits (bin)

word position in sentence Begin/middle/end 3bits (bin)

Syllable type cv/cvc/cvv/cvcc/cvvc/cvvcc 3bits (bin)

Types of sentence Declarative/Interrogative/

Exclamatory

3bits (bin)

Also, three other neural networks are used to predict energy, pitch and duration at the output layer.

2) Architecture of MFCC neural network: The main goal of this model is to generate the 24-coefficients MFCC from the input features. Our architecture is composed of 4 layers: 182 nodes in input layer, 24 nodes in the first hidden layer, 75 nodes in the second hidden layer and 24 nodes in output layer with Gaussian activation function (G) in the hidden layers while Linear activation function (L) in the output layer. The neural network is trained using 1000 iterations.

D. Evaluation

To evaluate synthesis system, Mel Cepstral Distortion (MCD) [15] is used as an objective measure. The MCD function is defined as follow:

where Wavtarg

and Waves

represent the set of MFCC of the target waveform and the set of MFCC of the estimated waveform respectively excluding silent frames. Wavi(t) indicates the value of the MFCC-i in the frame index t. the MFCC coefficients used in this work are without the zero coefficient. The MCD value obtained is 6.66. Table 5 shows some neural networks architectures and their corresponding MCD values.

TABLE V. NEURAL NETWORKS ARCHITECTURES

Architecture MCD value

182 – 70 (S) – 24 (L) 6.81

182 – 75 (S) – 24 (L) 6.79

182 – 80 (S) – 24 (L) 6.86

182 – 85 (S) – 24 (L) 6.87

182 – 90 (S) – 24 (L) 6.97

VI. CONCLUSION

In this research, a complete text to speech synthesis system

was designed for Arabic language. Two sub-systems are

proposed: diacritization system to predict the vowelization of

the input text which is mostly unvowelized and speech

synthesis system to produce a high quality speech based on

statistical parameter synthesis. Artificial neural networks have

been used to build both systems. Different lengths of units

were used to build our speech corpus, which are phonemes,

diphones and triphones. Our text to speech system is an

unlimited vocabulary system with automatic vowelization.

REFERENCES

[1] Ghadeer Al-Said and Moussa Abdallah: “An Arabic Text-To-Speech

System Based on Artificial Neural Networks”, Journal of Computer Science 5 (3): pp. 207-213, 2009.

[2] E. Verra Raghavendra, P. Vijayaditya, kishore Prahallad: “Speech synthesis using artificial neural networks”, Communications (NCC), 2010 National Conference on Date of Conference: pp. 29-31. 2010.

[3] A. Black, H. Zen, and K Tokuda: “Statistical parametric synthesis”, in Proceedings of ICASSP: pp. 1229-.1232. 2007

[4] S. Imai, K. Sumita, and C. Furuichi: "Mel log spectrum approximation (MLSA) filter for speech synthesis", lECE Trans. A, vol.no.2: pp. 122-129, 1983.

[5] T. S. Fares, A.F. Hegzay, A. H. Khalil: "Investigating An Arabic Text To Speech System Based On Diphone Concatenation", IJICIS, Vol 7, no. 1: pp. 49-69, 2007.

[6] Elshafei, M., H. Al-Muhtaseb and M. Al-Ghamdi: “Techniques for high quality Arabic speech synthesis”, Inform. Sci.: pp. 255-267, 2002.

[7] Tahar SAIDANE, Ahmed HADDAD, Mounir ZRIGUI and Mohamed BEN AHMED: “Réalisation d’un système hybride de synthèse de la parole arabe utilisant un dictionnaire de polyphones”, Traitement Automatique de l’Arabe, 2004.

[8] A. Zaki, A. Rajouani and M. Najim: “Synthèse des variations de la fréquence fondamentale de la parole rae à partir du texte”, Proceedings of the 19th Colloque sur le traitement du signal et des images : pp. 463-466, 2003.

[9] Fatehy, N.: “An Integrated Morphological and Syntactic Arabic Language Processor Based on a Novel Lexicon Search Technique”, master thesis, Faculty of Engineering, Cairo University, 1995.

[10] M. Attia: “Theory and Implementation of a Large-Scale Arabic Phonetic Transcriptor, and Applications”, PhD thesis, Dept. of Electronics and Electrical Communications, Faculty of Engineering, Cairo University, 2005.

[11] M. Al Badrashiny: “Automatic diacritizer for Arabic texts”, PhD thesis, Dept. of Electronics and Electrical Communications, Faculty of Engineering, Cairo University, 2009.

[12] Mansour Alghamdi, Muhammad Khursheed, Mustafa Elshafei, al.: “Automatic Arabic Text Diacritizer”, King Abdulaziz City for Science and Technology (KACST), Riyadh. 2. King Fahad University of Petroleum and Minerals, Dammam, Saudi Arabia, 2006.

[13] SPTK: “Speech Signal Processing Toolkit”, (http://sp-tk.sourceforge.net/), Version 3.6, 2012.

[14] C. Fatima and G. Mhania: “Towards a high quality arabic speech synthesis system based on neural networks and residual excited vocal tract model”, Signal, Image and Video Processing, vol. 2, no. 1: pp. 73.87, 2008.

[15] J. Kominek, T. Schultz, and A. Black: “Synthesizer voice quality on new languages calibrated with mel cepstral distorion”, in SLTU, Hanoi, Viet Nam, 2008.

[ieee 2013 world congress on computer and information technology (wccit) - sousse, tunisia...

Documents