development of speech database for hindi text-to...

© 2014, IJARCSSE All Rights Reserved Page | 531

Volume 4, Issue 5, May 2014 ISSN: 2277 128X

International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com

Development of Speech Database for Hindi Text-To-Speech

System Considering Syllable as a Basic Unit Arun Kumar C* Shreekanth T Udayashankara V

Dept. of ECE, Dept. of ECE, Dept. of IT,

SJCE Mysore, Karnataka, India SJCE Mysore, Karnataka, India SJCE Mysore, Karnataka, India

Abstract: The objective of a Text- to- speech system is to convert an orthographic text into intelligible and natural

sounding speech. In order to achieve this, unit selection plays a vital role. Phoneme, diphone, allophone and syllable

are the basic units of speech system. Considering phoneme as a basic unit for concatenation based TTS system results

in larger concatenation points, this result in low quality speech output. Considering syllable as basic unit for database

building results in less concatenation points and results in high quality speech output. Hence this work reveals

building of standard text database required to build syllable level speech database considering position of syllable in a

word i.e. Start, Middle and End. This database consists of 1326 standard and non-standard words and 442 syllables in

Start, middle and end position respectively.

Keywords: Speech synthesis, Concatenative synthesis, Text processing, Speech generation, Hindi TTS system.

I. INTRODUCTION The ultimate goal of Text-To-Speech (TTS) synthesis is to convert an ordinary orthographic text into an acoustic

signal that is indistinguishable from human speech [2].This generally involves two steps:

1. Text processing. 2. Speech generation. The objective of the text processing component is to process the given input text and produce appropriate sequence

of phonemic and syllable units. These phonemic and syllable units are realized by the speech generation component

either by synthesis from parameters or by selection of a unit from a large speech corpus [3].For natural sounding speech

synthesis, it is essential that the text processing component produce an appropriate sequence of syllabic units

corresponding to an arbitrary input text [4].

Phoneme, diphone, allophone and syllable are the basic unit of speech. Phoneme is the smallest sub unit of speech

synthesis system no other letters can modify their sound.

Syllable is a cluster of consonants and vowels. Syllable should contain one vowel and any number of consonants.

1. Single vowel can act as a syllable. (I.e. V). 2. V, C*V, V*C, C*V*C, C*C*V, C*C*C*V*C*C*C……etc. 3. Consonant before vowel is called „Onset‟. i.e.(C*V) 4. Consonant after vowel is called „Coda‟. i.e.(V*C)

The databases that are developed for Text to Speech synthesis system generally consists phonemes or syllables as the

basic Concatenative unit. Such types of databases are built/collected from LDCIL and implemented by many researchers

for continuous speech synthesis and recognition system. The maximum work is been carried out for Chinese, Punjabi and

English language. Little work is done for other Indian languages. Table II shows various databases built by researchers

for TTS system.

A Speech database has been developed for developing a Text to Speech Synthesis system in Kannada Language at

Mysore. The basic entity selected for the speech synthesis in this project was phonemes. This speech database consists of

total 1,605 phonemes. The phonemes were recorded using the utility tool PRAAT on Windows Operating System

platform. The sampling frequency used for recording the speech was 16,000 Hz. The recording was done using the

standard microphone in lab. The recorded phonemes include vowels, semi vowels, stops, fricatives, nasals etc [1].

A Punjabi language Speech Database has been developed for Text to Speech synthesis system at Department of

Computer Science, Punjabi University, and Patiala. The syllables were considered for developing said speech database

for Text to Speech Synthesis system because the researchers have selected syllables as the basic unit of concatenation.

This Punjabi language speech database consists of 3,312 syllables which account for more than 99% of commutative

percentage frequency in the selected corpus. These syllables were selected after analyzing total possible syllables of

Punjabi corpus which was having nearly 2, 33,009 unique and more than four million words; out of which 9,317 were

valid syllables from which 3312 syllables were selected. The selected syllables were recorded from a speaker using

standard microphone in the studio environment [10].

http://www.ijarcsse.com/

Arun et al., International Journal of Advanced Research in Computer Science and Software Engineering 4(5),

May - 2014, pp. 531-549


A Text to Speech synthesis System for four Indian Languages Hindi, Odiya, Bengali and Telugu has been developed

at Department of Computer Science and Application, Utkal University, Bhubaneswar. For developing the speech corpora

for the Text to Speech System in the said four languages native speakers were searched for all the four languages. The

speakers were asked to read the text in the laboratory environment without any background noise. The text to speech

synthesis system developed use the concatenation of syllables approach for the development of the Speech Database [11].

This following section reveals the syllable rules involved in word segmentation and Concatenation based Text to

speech synthesis.

A. Syllable Rules 1. When nasals such as /n’/, half pronounced /m/ or /n/ sound succeed a vowel immediately, they would be treated

as a part of the vowel and also the same syllable. For example, /n’/ in san’sthaa will be a part of syllable

containing /sa/ [10].

2. When there are three or more consonants between two consecutive vowels, the first consonant would be a part of the coda of the previous syllable while the remaining consonants would be onset of the next syllable [10].

E.g. a b c d e

Consonant Vowel

/ab/=Coda (V*C)

/cde/=Onset (C*C*V)

3. When there are exactly two consonants between two vowels, the first consonant would be part of coda of previous syllable and the second would be onset of the next syllable [10].

E.g. a m m a

Vowel Consonant

/am/=Coda (V*C)

/ma/=Onset (C*V)

4. When the second consonant is a member of the set {/r/ /s/ /sh/ /shh/}, both the consonants would be a part of onset of the next syllable [10].

E.g. y a a t r a

/yaa/=syllable1

/tra/=syllable2

In Hindi there are 5 vowels and 5 long vowels and two diphthongs, four semivowels 33 consonants. Hindi language

is having one to one correspondence with spoken language and written form. The phonemes are divided into two type

vowels (swaras) and consonants (vyanjanas). They together constitute the (varnamala) alphabet set. Vowels are the

independently existing letters which are also called as swaras [10]. They are:

अ आ इ ई उ ऊ ऋ ए ऐ ओ औ

Consonants are those which depend on vowels to take their independent form. They are as shown below

क ख ग घ ङ च छ ज झ ञ ट ठ ड ढ ण त थ द ध न ऩ प फ ब भ म य र व श ष स ह

Based on this rule the combination of vowel and consonant together will form a syllable (C*V) also called as

kagunitha. Since kagunitha is combination of consonant and vowel this belongs to syllable group (C*V).

E.g. क + आ = का C + V = (CV)

Hindi language is syllabic in nature. Hence building speech database for TTS system considering syllable as

basic unit is better choice [4].


May - 2014, pp. 531-549


B. Concatenative Synthesis Concatenative synthesis simply plays back the waveform with the matching phone string. An uttered sound is

synthesized by concatenating together several speech fragments, unlike synthesis-by rule; it requires neither rules nor

manual tuning. Moreover, each segment is completely natural, so we should expect very natural output. Speech segments

are greatly affected by co articulation, so if we concatenate two speech segments that were not adjacent to each other,

there can be spectral or prosodic discontinuities. Spectral discontinuities occur when the formants at the concatenation

point do not match. Prosodic discontinuities occur when the pitch at the concatenation point does not match. A listener

rates as poor synthetic speech that contains large discontinuities, even if each segment is very natural. There are a number

of factors, which contribute to the lack of naturalness in the speech output from speech synthesis systems like:

Intonation and rhythm, variability along the prosodic parameters and incorrect segmental rendering the only task in

this method is building an error free speech database suitable for concatenation of speech units [1]. Prosody and

Intonation are also most important for natural sounding of speech.

Hindi, words could be composed of basic characters as well as complex clusters of C*V*C. For the latter cases,

there is a need to come up with rules to break the word into syllables. Hence the work depicted in this paper derives

certain simplistic rules for syllabification i.e. rules for grouping clusters of C*V*C based on heuristic analysis of several

words in Telugu and Hindi languages [10]. Concatenation based TTS system considering phoneme as a basic unit results

low quality speech output because of large concatenation points. This large concatenation points results in glitches.

Hence to avoid this error considering syllable as basic unit of concatenation is the only solution.

Hence this paper reveals how to build an error free text and Speech database for Hindi language required to

develop Concatenation based TTS system.

II. STRUCTURE OF TEXT AND SPEECH DATABASE During the process of speech synthesis, required syllable units are fetched from speech database, concatenated and

finally processed suitably to obtain quality speech output. Hence creating an error free database of syllable units is most

important. The sound and duration of syllable slightly change based on their position of occurrence in the speech. A

syllable can occur at three different positions [1].

1. At the starting of a word. (Start) 2. In between two phonemes. (Middle) 3. At the end of the word. (End) Hence for the above mentioned purpose a text database consisting of 1326 words, which covers all syllable (C*V)

set are considered. This is manually prepared using standard Hindi dictionary [12], text books and various researchers'

guidance. From all above sources text corpus consisting of 1326 standard and non standard unique words are ready for

building speech database.

This text corpus shown in Table I cover all the required syllable set in all the possible position of occurrences

i.e. Start, Middle and End. From this we can observe that many of the rarely occurring syllables like ञ, यर, र,् ङ, छ् etc. taken as it is to cover all the syllables for documentation purpose.

For speech database, Utility software for Windows Operating System, called as PRAAT [9], is used. The

prepared words were recorded using PRAAT tool with a sampling frequency of 16 KHz and represented with 16-bits [1].

The following example shows the process of building speech database. Consider the syllable required as बा, then three words बायत, आबाय, अमबा are recorded using PRAAT tool using standard microphone. Record the required words and save to list, from each recorded word extract बा in all the three possible positions. Later store the extracted syllables in their respective directories based on their position of occurrence, Figure 5, 6 and 7 shows the labeling process and Figure

1 to 8 shows the steps involved in using PRAAT tool during speech database building.

A. Procedure to build speech database The below steps shows how to use PRAAT utility software to build speech database required to implement

concatenation based TTS system.

Step1: Open the PRAAT utility software, select record monosound option from „New‟ option in menu bar.

Fig. 1 PRAAT Tool


May - 2014, pp. 531-549


Step2: Select 16000 Hz sampling frequency and press record to start recording the required sound.

Fig. 2 Selecting sampling frequency and recording

Step3: Utter the word which covers required unit and start recording. After recordind stop recording and save it to list.

Fig. 3 Recording and save to list

Step4: Create Text grid and start Labelling the speech waveform by selectiing view and edit option.

Fig. 5 „बा‟ Starting position

Fig. 6 „बा‟ Middle position


May - 2014, pp. 531-549


Fig. 7 „बा‟ End position

Step5: Extract Labeled sound files using „Extract all non-empty interval‟ option.

Fig. 8 Extract labeled speech unit

After extracting all the labeled files from uttered sound they are saved in their respective directories as shown below in

Figure 9.

Fig. 9 Directories named Start, Middle and End

„बा‟ Starting position is saved in Start directory. „बा‟ Middle position is saved in Middle directory. „बा‟ End position is saved in End Directory.

The rich speech database consist of total1326 syllable (C*V). Each position has 429 syllables and 13

independent vowels. Hence form all the three positions total of [(429*3) + (13*3)] =1326 units of speech data is built.

TABLE I: TEXT CORPUS

FRONT MID BACK

कभर कीकय खटाक कायण खकाय खटाका ककयण चककत साकक कीकय पकीय धभकी कुमया तकुवा पऩ ॊकू कूकना ककून गुडाकू

Sound Library

Start Middle End


May - 2014, pp. 531-549


क्रतक प्रक्रत क्र केयर याकेश तडके कैसा डकैती जाकै कोभर डकोटा भाको कोडी सकौय जाकौ कॊ कड सकॊ द कॊ क् क् क्

खकाय चखना देख खाकक भखान रेखा खखडकी भुखखमा साखख खीजना भखीय ऩयखी खुचय सखुर जाखु खून सखून जाखू ख्रऩा भाख्रत सख्र खेवा भखेर जाख ेखैयात सखैय राख ैखोवा जाखोय जाखो खौवा भुखौटा भाखौ खॊजय जखॊ सजखॊ ख् ख् ख् गगन गगन डग गात तगादा दगा गगयता फगगमा भागग गीदड दगीरा दागी गुजय झगुरी जागु गूथना फगूरा गागू ग्रह सग्रह जाग्र गेरी बॊगेडी जागे गैरयी दगैर जागै गोदात बगोडा जागो गौयव रगौय जागौ गॊदगी भगॊदा जागॊ ग् ग् ग् घटक फघय फघ घातक प्रघान साघा घघचपऩच सघघर याघघ घीना सॊघीम सघी घुटन सघुर भाघु घूभना सघून जाघू घ्रत सघ्रऩ जाघ्र घेयना भघेय दाघे घैरा भघैर सघै


May - 2014, pp. 531-549


घोखना सघोना भाघो घौद सघौय सघौ घॊट रघॊट घॊ घ् घ् घ् ङ ङ ङ ङा ङा ङा ङङ ङङ ङङ ङी ङी ङी ङु ङु ङु ङू ङू ङू ङ्र ङ्र ङ्र ङे ङे ङे ङै ङै ङै ङो ङो ङो ङौ ङौ ङौ ङॊ ङॊ ङॊ ङ् ङ् ङ् चक दचक ऩेच चाऩ ऩॊचाट ऩायचा गचकट ऩेगचया सगच चीरय ऩेचीदा प्रऩॊची चुका सचुक रचु चूक कचूय वाचू च्रभा सच्रभ वाच्र चटेा सचते याच ेचैतन्म सचैन चाचै चोकय कचोट चाचो चौवा कचौडी याचौ

चॊन्द्न्िका भचॊद भचॊ च् च् च्

छकाय ऩाछना ऩाछ छागर बफछाना ऩीछा घछकना बफघछमा छाघछ छीजन सछीन ऩॊछी छुवा बफछुवा वाछु छूटना सछूत राछू छ्र छ्र छ्र

छेडना सछेन ऩीछे छैत सछैत वाछै छोयी बफछोह वाछो छौका बफचौना वाछौ छॊगा सछॊद साछॊ


May - 2014, pp. 531-549


छ् छ् छ् जकड ऩूजना पौज जागीय खजाना ऩूजा न्द्जगय ऩून्द्जत फान्द्ज जीतना सजीत ऩाजी जुवायी बफजुभ जाजु जूट बफजूका काजू ज्र ज्र ज्र जेठ सजेन जाजे जैपवक बफजैरा जाजै जोखखभी घजोय राजो जौहय वजौय राजौ जॊगदाय सजॊग रजॊ ज् ज् ज्

झकोरा झझक जाझ झाडना फझावू साझा खझप्ना सखझना साखझ झीखना सझीन साझी झुटाना सझुना भाझु झूट जाझूना भाझू झ्र झ्र झ्र

झरेना साझरे ऩाझ ेझैर सझैरा साझै झोरी सझोरा ताझो झौय कझौय साझौ झॊकाय सझॊक झॊ झ् झ् झ् ञ ञ ञ ञा ञा ञा गञ गञ गञ ञी ञी ञी ञु ञु ञु ञू ञू ञू ञ्र ञ्र ञ्र ञ े ञ े ञ ेञै ञै ञै ञो ञो ञो ञौ ञौ ञौ ञॊ ञॊ ञॊ ञ् ञ् ञ्

टकयाव ऩाटर ऩाट टाऩना पऩटाया ऩाटा


May - 2014, pp. 531-549


टटकट बफटटमा फाटट टीकाकाय सटीक ऩाटी टुकडा भटुक जाटु टूटना भटूक राटू ट्र ट्र ट्र

टेकना सटेरा जाटे टैक्सी सटैय जाटै टोकन सटोरा भाटो टौर सटौर याटौ टॊकाय टॊ जाटॊ ट् ट् ट्

ठकाय ऩाठक ऩाठ ठाकुय सठाऩ ऩाठा टठग्ना गटठमा ऩाटठ ठीकडा गठीरा ऩाठी ठुनका घनठुय कठुय ठूरा सठूय ऩाठू ठ्र ठ्र ठ्र

ठेकेढाय सठेक भाठे ठैभ भठैर जाठै

ठोकना सठोक साठो ठौय कठौय जाठौ ठॊडा ठॊ ठॊ ठ् ठ् ठ् डफर सडक अखड डाककमा बफडार अगडा ङडमो अङडभ छोङड डीजर सडीर खखचडी डुफकी सडुर जाडु डूफना सडूक झाडू ड्र ड्र ड्र

डमेयी भडरे साड ेडनैा भडरै जाडै डोभनी अडोस साडो डौर सडौर ताडौ डॊका फडॊग याडॊ ड् ड् ड्

ढकना गाढन साढ ढाना गढाना गढा टढरावी गटढमा गटढ ढीरना सढीर साढी ढुरना सढुर साढु


May - 2014, pp. 531-549


ढूह सढूह साढू ढ्र ढ्र ढ्र ढेय सढेय साढे ढैम गढैमा साढै ढोका भढोवा वाढो ढौयी ऩढौसी जाढौ ढॊगा ढॊ ढॊ ढ् ढ् ढ् ण ण ण णा णा णा खण खण खण णी णी णी णु णु णु णू णू णू ण्र ण्र ण्र णे णे णे णै णै णै णो णो णो णौ णौ णौ णॊ णॊ णॊ ण् ण् ण्

तकना भतरी उगचत तागना बफताना अॊधता घतजाया इघतका अघत तीखा त्रतीम इभयती तुकाॊत भातुर भातु तूफ़ान भातूक सातू त्रतीम सॊत्रप्त त्र तमेीस भातभे सात ेतैनात नतैभ वात ैतोड सतोर भातो तौरना अतौर आतौ तॊगी भतॊगी भतॊ त् त् अत्

थकना थुथना अकथ थाऩना भथानी साथा गथेटाय भगथत अगथ थीभ भथीभ साथी थुथना राथुय साथु थूकना थाथु भाथू थ्र थ्र थ्र थेर भाथेन साथे


May - 2014, pp. 531-549


थैरा भथैन साथै थोडा हथोड जाथो थौडा हथौडा भाथौ थॊडा भाथॊगी प्रीथॊ थ् थ् थ् दकाय फॊदय नाद दाता बफदाय बफदा टदखना फॊटदश भटद दीभी भदीय फॊदी दकुडा भदरु भद ुदधू फॊदकू जाद ूिड आित साि

देखना बफदेश सादे दैघनक वदैन सादै दोगरा भादोन सादो दौड फदौना वादौ दॊगा वदॊती दॊ द् द् द्

धगडा फॊधन अध धाना फॊधान वाधा गधक अगधक आगध धीभय फाॉधीत राधी धुक फॊधुता साधु धूऩ सधूय वाधू ध्र ध्र ध्र धेना अधेड साधे

धैमरवान अधैमर साधै धोखा सधोना आधो धौखना भधौना साधौ धॊधा वधॊती रयधॊ ध् ध् ध्

नकटा ऩनही अॊकन नाका ऩनाह अधाना घनकट सघनह यानी नीका ऩनीयी अॊजनी नुकीरा अनुजा अनु नूतन कनूत जानू न्रशॊस न्र न्र नेती जानेक अॊजाने नैघतक फनैरा जानै नोचना भनोज भानो नौकय कनौज भानौ


May - 2014, pp. 531-549


नॊगरा भानॊद भानॊ न् न् न् ऩकड तऩना आकॊ ऩ ऩाठ क्रऩार ु ऩाऩा पऩटायी कपऩर सीपऩ ऩीच सऩीठ छऩी ऩुकाय सऩुदर काऩु ऩूजना अऩूणर ऩाऩु प्र प्र प्र

ऩेखना सऩेया ताऩे ऩैतान ऩाऩैना साऩै ऩोटा सऩोरा ऩाऩो ऩौनी फऩौती साऩौ ऩॊककर सऩॊत सोऩॊ ऩ् ऩ् ऩ्

पटना आपत वप पाटक सपाना इजापा कपकय भाकपमा काकप पीका अपीभ भापी पुरका सपुर सापु पूटना सपूना सापू फ्रतोश नफ्रत फ्र पेनी सपेद रापे पैरना छपैर कापै पोडना सपोड भापो पौज सफ़ौर कापौ पॊ की पॊ पॊ प् प् प् फनाभ फफय अजफ फाहय आफादी गुडॊफा बफकना अॊबफका अॊबफ फीजी सफीर खयाफी फुकचा फफुजा साफु फूकना फफूर साफू ब्रॊगेश ब्र ब्र फेकस सफेये साफे फैठक भफैय काफै फोतर वफोत साफो फौछाय धफौय वाफौ फॊडर प्रफॊद शुफॊ फ् फ् फ्

बकोस बबक साब


May - 2014, pp. 531-549


बायत आबाय अमबा भबखायी भभबय आभब बीतय भाबीर छाबी बुकडी भाबुन आबु बूगोर बबूत बाब ुभ्रघत सुभ्रत भ्र बेदक सबेद राबे बैमा सबैद वाबै बोग आबोग वाबो बौचक खबौद वाबौ बॊजन भबॊज भाबॊ ब् ब् ब्

भकान गभक मतीभ भाधुयी आभाद भाभा भभचरी आभभश साभभ भीठा आभीन भाभी भुखौटा अभुख ऩाभु भूसा अभूर साभ ूम्रग अम्रत कम्र भेमय सभेत जाभे भैरा धभैर याभै भोटा आभोद साभो भौजा अभौर सभौ भॊजन आभॊत्रण भॊ भ् भ् भ्

मतीभ ऩामर भम माचक आमात भामा घमभान घम भाघम मीश्वय मी बाशामी मुग आमुध आमु मूनानी सामूर यामू य्र य्र य्र मेन भामेर सामे मै मै मै

मोगी आमोग भामो मौनती समौर घाडमो मॊबत्रक भाटॊक मॊ म् म् म् यकफा आयसी माय याकेट आयाजी भाया रयमाज ऩरयणत ऩरय यीछ ऩयीस ऩयी


May - 2014, pp. 531-549


रुकना ऩरुभा ऩारु रूऩा ऩरूर भरू यर यर यर

येखीम आयेख ऩये यैमत सयैर ऩायै योकड आयोऩ कयो यौजा भयौदा जायौ यॊग सयॊग भयॊ य् य् य्

रकडी भरफा पर राट भरारा रैरा

भरखना भभरक भाभर रीडय भरीदा म्रणारी रुकना सरुका भार ुरूभ आरूचा ऩल्रु ल्र ल्र ल्र

रेखन आरेख ऩहरे रैरा सरैभ जारै रोटन अरोक भारो रौकी अरौककक जारौ रॊऩट ऩरॊग सरॊ र् र् र्

वकीर अवभ मुव वाटटका आवाज यवा पवकट आपवरा छपव वीयाना सवीद यवी वुजा सवुय वावु वूपय येवूय कावू व्र आव्रत व्र

वेदना आवेग कयवे वैतार चवैमा भावै वोटय अवोक सावो वौभा वौ वौ वॊटक वॊ वॊ व् व् व्

शकुनी भशक आक्रोश शाकीम भशान शीशा भशकवा आभशक खुभश शीशभ भशीन शीशी शुदा अशुब आशु शूरना बत्रशूर ऩाशू श्रगार श्र श्र


May - 2014, pp. 531-549


शेखय भभशेर राश ेशैरा अशैक अक्शै शोभशत अशोक आशो शौहय कशौय भशौ शॊककत बत्रशॊकु शॊ श् श् श् ष ष ष षा षा षा पष पष पष षी षी षी षु षु षु षू षू षू ष्र ष्र ष्र षे षे षे षै षै षै षो षो षो षौ षौ षौ षॊ षॊ षॊ ष् ष् ष्

सकर ककसकी तीस साभभर कसाना बासा भसकट काभसभ शाभस सीखना ऩसीना ऩायसी सुहास जासुभ ऩासु सूऩय जासूय रास ूस्र स्र स्र

सेठानी कसेत बासे सैकडा ऩसैना जासै सोता ऩासोभ ऩासो सौगात भसौना हासौ सॊकट फसॊती रासॊ स् स् स्

हभाया सहया भह हात सुहास साहा टहन्दी भटहरा कटह हीयक सोहीर भाही हुवा गहुना साहु हूयना जाहूभ माहू ह्रतॊत्री ह्र ह्र हेकड भहेश कहे हैयान सहैगा राहै होटर कना ऩाहो


May - 2014, pp. 531-549


हौरा डहौना जाहौ हॊत हॊ साहॊ ह् ह् ह्

III. TEXT PROCESSING Text processing is the primary step involved in building Hindi TTS system. Once the orthographic text is available,

before synthesizing pre-processing of text is required [4]. The main intension behind text processing is to resolve the

ambiguity if any present in between two characters. Each and every language has its corresponding Unicode developed

by language research centers and every character has its own identification. This identification codes are used in pre-

processing program to understand better and solve the problem of confusion between two characters [1]. Pre-processing

program can be done in MATLAB, JAVA and many other programming languages but here it is implemented using

.NET programming Language.

TABLE II COMPARISON OF DATABASE

Sl.

No Developed by Unit Language Corpus

1

SJ College of

Engineering.

Mysore [1]

Phoneme Kannada 1605

2 Utkal

University [11] Syllable

Hindi,

Odiya,

Bengali &

Telugu

9317

3 Punjabi

University [10] Syllable Punjabi 3312

4

Carnegie

Mellon

University [9]

Syllable Hindi 2344

5 RIT,[13]

Maharashtra Phoneme Konkani 3000

To resolve the ambiguities present in understanding Hindi alphabets consonants and vowels are grouped into

different classes and programmed [1]. Classification of vowels and consonants are as shown below.

TABLE III. CONSONANT

Alphabets Unicode Decimal

Equivalent

क 0915 2325 ख 0916 2326 ग 0917 2327 घ 0918 2328 ङ 0919 2329

TABLE IX. INDEPENDENT VOWEL


Equivalent

अ 0905 2309 आ 0906 2310 इ 0907 2311 ई 0908 2312 उ 0909 2313 ऊ 090A 2314 ऋ 090B 2315


May - 2014, pp. 531-549


ए 090F 2319 ऐ 0910 2320 ओ 0913 2323 औ 0914 2324

Similarly consider all other Consonants and grouped as IV, V, VI, VII and VIII. Later Group Dependent vowel

signs which support for forming syllable.

TABEL X: DEPENDENT VOWEL SIGN

TABLE XI: PADDING


Equivalent

093E 2366

ाा 093F 2367 न्द्ा 0940 2368 ाी 0941 2369 ाु 0942 2370 ाू 0943 2371 ा 0947 2375 ाे 0948 2376 ाै 094B 2379 ाो 094C 2380 ाौ 094D 2381

Alphabets Unicode Digits Padded

093E ---

ाा 093F 01 न्द्ा 0940 02 ाी 0941 03 ाु 0942 04 ाू 0943 05 ा 0947 06 ाे 0948 07 ाै 094B 08 ाो 094C 09 ाौ 094D 10

The Pre-processor program reads the entered text character by character and generates a modified Unicode file

as output. The modified Unicode file is stored in a text file and imported directly to MATLAB program for further

processing.

A. Rules applied during Pre-processing 1. If character belongs to Independent vowel group as shown in Table IX then its Unicode converted Decimal

value is directly padded with zeroes. E.g. Consider character read is अ its Unicode is 2309 is padded with two zeroes directly. The modified Unicode will be 230900.

2. If character read belongs to consonant group as shown in Table III then check the next set of characters if the next character belongs to dependent vowel sign group then Unicode is padded with corresponding two digit


May - 2014, pp. 531-549


value obtained from the Table.10. E.g. Consider character entered is रु then it is divided into य its Unicode is 2352 and ाु its Unicode is 2370 padding and value obtained from Table10 is 04. So modified Unicode value is 235204.

3. If character belongs to consonant group as shown in Table II and the next character also belongs to consonant

group the Unicode is unchanged. E.g. consider the character read is ण its Unicode is 2339 and the next character read is also a consonant then Unicode remains the same 2339.

4. If entered word is अरुण then its Modified Unicode output will be 230900 235204 2339, the presence of spaces between each Unicode helps us to differentiate individual character in entered word.

5. If entered sentence is अरुण कुभाय then its modified Unicode output will be 230900 235204 2339 101010 232504 235001 2352. Unicode 101010 acts as space between two words and is used to differentiate two words during

sentence formation.

IV. SPEECH SYNTHESIS Speech Synthesis and processing is implemented using MATLAB tool. Selecting an appropriate algorithm among

concatenation based TTS system is very important after building database. According to recent studies direct waveform

concatenation algorithm is best suitable for speech synthesis [8].

The MATLAB program uses Modified Unicode file generated by Pre-processing program for this purpose. The

program reads the modified Unicode file, number by number and fetches the appropriate phonemes and syllable from the

database. The presence of spaces in the Modified Unicode file is used to determine the directory from which the syllable

should be fetched i.e. Start, middle or End. The following steps are performed to synthesize the speech. Consider the

word हभाया. The syllable units are fetched separately from respective database and concatenated using suitable algorithm. Fig. 11 shows the concatenated speech output.

Fig. 11 Concatenated output

After concatenation further processing is done using moving average windowing for smoothing the

concatenated output. This will increases the quality of speech output.

V. CONCLUSION This paper discusses the design and development of Hindi text and speech database for concatenation based TTS

system considering syllable as a basic unit. This technique provides very high quality speech output which is reasonably

natural and equivalent to voice of the original speaker. The proposed approach minimizes the co-articulation effect and

prosody mismatch between adjacent units concatenated. This new approach of considering position of syllable during

database building helps us to reduce glitches during concatenation and obtain continuity in concatenated speech and

improved quality speech output compared to normal concatenation done without considering position of character and

duration.

REFERENCES

[1]. Ravi D J and Sudarshan Patilkulkarni (2011), “A Novel Approach to Develop Speech Database for Kannada Text-to Speech System”, Int. J. on Recent Trends in Engineering & Technology, Vol. 05, No. 01.

[2]. Marian Macchi (1993), “Issues in Text-to-Speech Synthesis”. [3]. Kishore S P and Black A (2003), “Unit Size in Unit Selection Speech Synthesis”, in Proceedings of Euro

speech, September, pp. 1317-1320.

[4]. Paul Taylor (2009), “Text-to-Speech Synthesis”, Cambridge University Press. [5]. Lemmety S (1999), “Review of Speech Synthesis Technology”, M.S. Thesis, Dept. Elec. and Comm. Engg.,

Helsinki University of Technology.

[6]. Thomas S (2007), “Natural Sounding Text-to-Speech Synthesis Based on Syllable Like Units”, M.S. Thesis, Indian Institute of Madras.

0 5000 10000 15000-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3


May - 2014, pp. 531-549


[7]. Arun Kumar C and Shreekanth T (2014), “A Comprehensive review on Concatenation Based Text to Speech Synthesis for Indian Language”,Int. J. Elec&Electr.Eng&Telecoms, Vol. 3, No. 2, April 2014, ISSN 2319 –

2518.

[8]. PRAAT: A tool for phonetic analysis and sound manipulations by Boersma and Weenink, 1992-2001. www.praat.org

[9]. S P Kishore and Alan W Black, “Unit size in Unit selection Speech Synthesis”.EUROSPEECH 2003 – GENEVA.

[10]. Parminder Singh, Gurpreet Singh Lehal. 2006. Text-To Speech Synthesis System for Punjabi Language. In Proceedings of International Conference on Multidisciplinary Information Sciences and Technologies, Merida,

Spain

[11]. Sanghamitra Mohanty, “Syllable Based Indian Language Text To Speech System”, International Journal of Advances in Engineering & Technology, 2011. Vol.1, Issue 2.

[12]. Badri Nath Kapoor, “Practical Hindi-English Dictionary” January 1, 2004. [13]. Pukhraj P. Shrishrimal, Ratnadeep R. Deshmukh and Vishal B. Waghmare, “Indian Language Speech Database:

A Review”. International Journal of Computer Applications (0975 – 888), Volume 47– No.5, June 2012.

development of speech database for hindi text-to...

Documents