development of speech database for hindi text-to...

19
© 2014, IJARCSSE All Rights Reserved Page | 531 Volume 4, Issue 5, May 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Development of Speech Database for Hindi Text-To-Speech System Considering Syllable as a Basic Unit Arun Kumar C* Shreekanth T Udayashankara V Dept. of ECE, Dept. of ECE, Dept. of IT, SJCE Mysore, Karnataka, India SJCE Mysore, Karnataka, India SJCE Mysore, Karnataka, India Abstract: The objective of a Text- to- speech system is to convert an orthographic text into intelligible and natural sounding speech. In order to achieve this, unit selection plays a vital role. Phoneme, diphone, allophone and syllable are the basic units of speech system. Considering phoneme as a basic unit for concatenation based TTS system results in larger concatenation points, this result in low quality speech output. Considering syllable as basic unit for database building results in less concatenation points and results in high quality speech output. Hence this work reveals building of standard text database required to build syllable level speech database considering position of syllable in a word i.e. Start, Middle and End. This database consists of 1326 standard and non-standard words and 442 syllables in Start, middle and end position respectively. Keywords: Speech synthesis, Concatenative synthesis, Text processing, Speech generation, Hindi TTS system. I. INTRODUCTION The ultimate goal of Text-To-Speech (TTS) synthesis is to convert an ordinary orthographic text into an acoustic signal that is indistinguishable from human speech [2].This generally involves two steps: 1. Text processing. 2. Speech generation. The objective of the text processing component is to process the given input text and produce appropriate sequence of phonemic and syllable units. These phonemic and syllable units are realized by the speech generation component either by synthesis from parameters or by selection of a unit from a large speech corpus [3].For natural sounding speech synthesis, it is essential that the text processing component produce an appropriate sequence of syllabic units corresponding to an arbitrary input text [4]. Phoneme, diphone, allophone and syllable are the basic unit of speech. Phoneme is the smallest sub unit of speech synthesis system no other letters can modify their sound. Syllable is a cluster of consonants and vowels. Syllable should contain one vowel and any number of consonants. 1. Single vowel can act as a syllable. (I.e. V). 2. V, C*V, V*C, C*V*C, C*C*V, C*C*C*V*C*C*C……etc. 3. Consonant before vowel is called „Onset‟. i.e.(C*V) 4. Consonant after vowel is called „Coda‟. i.e.(V*C) The databases that are developed for Text to Speech synthesis system generally consists phonemes or syllables as the basic Concatenative unit. Such types of databases are built/collected from LDCIL and implemented by many researchers for continuous speech synthesis and recognition system. The maximum work is been carried out for Chinese, Punjabi and English language. Little work is done for other Indian languages. Table II shows various databases built by researchers for TTS system. A Speech database has been developed for developing a Text to Speech Synthesis system in Kannada Language at Mysore. The basic entity selected for the speech synthesis in this project was phonemes. This speech database consists of total 1,605 phonemes. The phonemes were recorded using the utility tool PRAAT on Windows Operating System platform. The sampling frequency used for recording the speech was 16,000 Hz. The recording was done using the standard microphone in lab. The recorded phonemes include vowels, semi vowels, stops, fricatives, nasals etc [1]. A Punjabi language Speech Database has been developed for Text to Speech synthesis system at Department of Computer Science, Punjabi University, and Patiala. The syllables were considered for developing said speech database for Text to Speech Synthesis system because the researchers have selected syllables as the basic unit of concatenation. This Punjabi language speech database consists of 3,312 syllables which account for more than 99% of commutative percentage frequency in the selected corpus. These syllables were selected after analyzing total possible syllables of Punjabi corpus which was having nearly 2, 33,009 unique and more than four million words; out of which 9,317 were valid syllables from which 3312 syllables were selected. The selected syllables were recorded from a speaker using standard microphone in the studio environment [10].

Upload: others

Post on 13-May-2020

21 views

Category:

Documents


0 download

TRANSCRIPT

  • © 2014, IJARCSSE All Rights Reserved Page | 531

    Volume 4, Issue 5, May 2014 ISSN: 2277 128X

    International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com

    Development of Speech Database for Hindi Text-To-Speech

    System Considering Syllable as a Basic Unit Arun Kumar C* Shreekanth T Udayashankara V

    Dept. of ECE, Dept. of ECE, Dept. of IT,

    SJCE Mysore, Karnataka, India SJCE Mysore, Karnataka, India SJCE Mysore, Karnataka, India

    Abstract: The objective of a Text- to- speech system is to convert an orthographic text into intelligible and natural

    sounding speech. In order to achieve this, unit selection plays a vital role. Phoneme, diphone, allophone and syllable

    are the basic units of speech system. Considering phoneme as a basic unit for concatenation based TTS system results

    in larger concatenation points, this result in low quality speech output. Considering syllable as basic unit for database

    building results in less concatenation points and results in high quality speech output. Hence this work reveals

    building of standard text database required to build syllable level speech database considering position of syllable in a

    word i.e. Start, Middle and End. This database consists of 1326 standard and non-standard words and 442 syllables in

    Start, middle and end position respectively.

    Keywords: Speech synthesis, Concatenative synthesis, Text processing, Speech generation, Hindi TTS system.

    I. INTRODUCTION The ultimate goal of Text-To-Speech (TTS) synthesis is to convert an ordinary orthographic text into an acoustic

    signal that is indistinguishable from human speech [2].This generally involves two steps:

    1. Text processing. 2. Speech generation. The objective of the text processing component is to process the given input text and produce appropriate sequence

    of phonemic and syllable units. These phonemic and syllable units are realized by the speech generation component

    either by synthesis from parameters or by selection of a unit from a large speech corpus [3].For natural sounding speech

    synthesis, it is essential that the text processing component produce an appropriate sequence of syllabic units

    corresponding to an arbitrary input text [4].

    Phoneme, diphone, allophone and syllable are the basic unit of speech. Phoneme is the smallest sub unit of speech

    synthesis system no other letters can modify their sound.

    Syllable is a cluster of consonants and vowels. Syllable should contain one vowel and any number of consonants.

    1. Single vowel can act as a syllable. (I.e. V). 2. V, C*V, V*C, C*V*C, C*C*V, C*C*C*V*C*C*C……etc. 3. Consonant before vowel is called „Onset‟. i.e.(C*V) 4. Consonant after vowel is called „Coda‟. i.e.(V*C)

    The databases that are developed for Text to Speech synthesis system generally consists phonemes or syllables as the

    basic Concatenative unit. Such types of databases are built/collected from LDCIL and implemented by many researchers

    for continuous speech synthesis and recognition system. The maximum work is been carried out for Chinese, Punjabi and

    English language. Little work is done for other Indian languages. Table II shows various databases built by researchers

    for TTS system.

    A Speech database has been developed for developing a Text to Speech Synthesis system in Kannada Language at

    Mysore. The basic entity selected for the speech synthesis in this project was phonemes. This speech database consists of

    total 1,605 phonemes. The phonemes were recorded using the utility tool PRAAT on Windows Operating System

    platform. The sampling frequency used for recording the speech was 16,000 Hz. The recording was done using the

    standard microphone in lab. The recorded phonemes include vowels, semi vowels, stops, fricatives, nasals etc [1].

    A Punjabi language Speech Database has been developed for Text to Speech synthesis system at Department of

    Computer Science, Punjabi University, and Patiala. The syllables were considered for developing said speech database

    for Text to Speech Synthesis system because the researchers have selected syllables as the basic unit of concatenation.

    This Punjabi language speech database consists of 3,312 syllables which account for more than 99% of commutative

    percentage frequency in the selected corpus. These syllables were selected after analyzing total possible syllables of

    Punjabi corpus which was having nearly 2, 33,009 unique and more than four million words; out of which 9,317 were

    valid syllables from which 3312 syllables were selected. The selected syllables were recorded from a speaker using

    standard microphone in the studio environment [10].

    http://www.ijarcsse.com/

  • Arun et al., International Journal of Advanced Research in Computer Science and Software Engineering 4(5),

    May - 2014, pp. 531-549

    © 2014, IJARCSSE All Rights Reserved Page | 532

    A Text to Speech synthesis System for four Indian Languages Hindi, Odiya, Bengali and Telugu has been developed

    at Department of Computer Science and Application, Utkal University, Bhubaneswar. For developing the speech corpora

    for the Text to Speech System in the said four languages native speakers were searched for all the four languages. The

    speakers were asked to read the text in the laboratory environment without any background noise. The text to speech

    synthesis system developed use the concatenation of syllables approach for the development of the Speech Database [11].

    This following section reveals the syllable rules involved in word segmentation and Concatenation based Text to

    speech synthesis.

    A. Syllable Rules 1. When nasals such as /n’/, half pronounced /m/ or /n/ sound succeed a vowel immediately, they would be treated

    as a part of the vowel and also the same syllable. For example, /n’/ in san’sthaa will be a part of syllable

    containing /sa/ [10].

    2. When there are three or more consonants between two consecutive vowels, the first consonant would be a part of the coda of the previous syllable while the remaining consonants would be onset of the next syllable [10].

    E.g. a b c d e

    Consonant Vowel

    /ab/=Coda (V*C)

    /cde/=Onset (C*C*V)

    3. When there are exactly two consonants between two vowels, the first consonant would be part of coda of previous syllable and the second would be onset of the next syllable [10].

    E.g. a m m a

    Vowel Consonant

    /am/=Coda (V*C)

    /ma/=Onset (C*V)

    4. When the second consonant is a member of the set {/r/ /s/ /sh/ /shh/}, both the consonants would be a part of onset of the next syllable [10].

    E.g. y a a t r a

    /yaa/=syllable1

    /tra/=syllable2

    In Hindi there are 5 vowels and 5 long vowels and two diphthongs, four semivowels 33 consonants. Hindi language

    is having one to one correspondence with spoken language and written form. The phonemes are divided into two type

    vowels (swaras) and consonants (vyanjanas). They together constitute the (varnamala) alphabet set. Vowels are the

    independently existing letters which are also called as swaras [10]. They are:

    अ आ इ ई उ ऊ ऋ ए ऐ ओ औ

    Consonants are those which depend on vowels to take their independent form. They are as shown below

    क ख ग घ ङ च छ ज झ ञ ट ठ ड ढ ण त थ द ध न ऩ प फ ब भ म य र व श ष स ह

    Based on this rule the combination of vowel and consonant together will form a syllable (C*V) also called as

    kagunitha. Since kagunitha is combination of consonant and vowel this belongs to syllable group (C*V).

    E.g. क + आ = का C + V = (CV)

    Hindi language is syllabic in nature. Hence building speech database for TTS system considering syllable as

    basic unit is better choice [4].

  • Arun et al., International Journal of Advanced Research in Computer Science and Software Engineering 4(5),

    May - 2014, pp. 531-549

    © 2014, IJARCSSE All Rights Reserved Page | 533

    B. Concatenative Synthesis Concatenative synthesis simply plays back the waveform with the matching phone string. An uttered sound is

    synthesized by concatenating together several speech fragments, unlike synthesis-by rule; it requires neither rules nor

    manual tuning. Moreover, each segment is completely natural, so we should expect very natural output. Speech segments

    are greatly affected by co articulation, so if we concatenate two speech segments that were not adjacent to each other,

    there can be spectral or prosodic discontinuities. Spectral discontinuities occur when the formants at the concatenation

    point do not match. Prosodic discontinuities occur when the pitch at the concatenation point does not match. A listener

    rates as poor synthetic speech that contains large discontinuities, even if each segment is very natural. There are a number

    of factors, which contribute to the lack of naturalness in the speech output from speech synthesis systems like:

    Intonation and rhythm, variability along the prosodic parameters and incorrect segmental rendering the only task in

    this method is building an error free speech database suitable for concatenation of speech units [1]. Prosody and

    Intonation are also most important for natural sounding of speech.

    Hindi, words could be composed of basic characters as well as complex clusters of C*V*C. For the latter cases,

    there is a need to come up with rules to break the word into syllables. Hence the work depicted in this paper derives

    certain simplistic rules for syllabification i.e. rules for grouping clusters of C*V*C based on heuristic analysis of several

    words in Telugu and Hindi languages [10]. Concatenation based TTS system considering phoneme as a basic unit results

    low quality speech output because of large concatenation points. This large concatenation points results in glitches.

    Hence to avoid this error considering syllable as basic unit of concatenation is the only solution.

    Hence this paper reveals how to build an error free text and Speech database for Hindi language required to

    develop Concatenation based TTS system.

    II. STRUCTURE OF TEXT AND SPEECH DATABASE During the process of speech synthesis, required syllable units are fetched from speech database, concatenated and

    finally processed suitably to obtain quality speech output. Hence creating an error free database of syllable units is most

    important. The sound and duration of syllable slightly change based on their position of occurrence in the speech. A

    syllable can occur at three different positions [1].

    1. At the starting of a word. (Start) 2. In between two phonemes. (Middle) 3. At the end of the word. (End) Hence for the above mentioned purpose a text database consisting of 1326 words, which covers all syllable (C*V)

    set are considered. This is manually prepared using standard Hindi dictionary [12], text books and various researchers'

    guidance. From all above sources text corpus consisting of 1326 standard and non standard unique words are ready for

    building speech database.

    This text corpus shown in Table I cover all the required syllable set in all the possible position of occurrences

    i.e. Start, Middle and End. From this we can observe that many of the rarely occurring syllables like ञ, यर, र,् ङ, छ् etc. taken as it is to cover all the syllables for documentation purpose.

    For speech database, Utility software for Windows Operating System, called as PRAAT [9], is used. The

    prepared words were recorded using PRAAT tool with a sampling frequency of 16 KHz and represented with 16-bits [1].

    The following example shows the process of building speech database. Consider the syllable required as बा, then three words बायत, आबाय, अमबा are recorded using PRAAT tool using standard microphone. Record the required words and save to list, from each recorded word extract बा in all the three possible positions. Later store the extracted syllables in their respective directories based on their position of occurrence, Figure 5, 6 and 7 shows the labeling process and Figure

    1 to 8 shows the steps involved in using PRAAT tool during speech database building.

    A. Procedure to build speech database The below steps shows how to use PRAAT utility software to build speech database required to implement

    concatenation based TTS system.

    Step1: Open the PRAAT utility software, select record monosound option from „New‟ option in menu bar.

    Fig. 1 PRAAT Tool

  • Arun et al., International Journal of Advanced Research in Computer Science and Software Engineering 4(5),

    May - 2014, pp. 531-549

    © 2014, IJARCSSE All Rights Reserved Page | 534

    Step2: Select 16000 Hz sampling frequency and press record to start recording the required sound.

    Fig. 2 Selecting sampling frequency and recording

    Step3: Utter the word which covers required unit and start recording. After recordind stop recording and save it to list.

    Fig. 3 Recording and save to list

    Step4: Create Text grid and start Labelling the speech waveform by selectiing view and edit option.

    Fig. 5 „बा‟ Starting position

    Fig. 6 „बा‟ Middle position

  • Arun et al., International Journal of Advanced Research in Computer Science and Software Engineering 4(5),

    May - 2014, pp. 531-549

    © 2014, IJARCSSE All Rights Reserved Page | 535

    Fig. 7 „बा‟ End position

    Step5: Extract Labeled sound files using „Extract all non-empty interval‟ option.

    Fig. 8 Extract labeled speech unit

    After extracting all the labeled files from uttered sound they are saved in their respective directories as shown below in

    Figure 9.

    Fig. 9 Directories named Start, Middle and End

    „बा‟ Starting position is saved in Start directory. „बा‟ Middle position is saved in Middle directory. „बा‟ End position is saved in End Directory.

    The rich speech database consist of total1326 syllable (C*V). Each position has 429 syllables and 13

    independent vowels. Hence form all the three positions total of [(429*3) + (13*3)] =1326 units of speech data is built.

    TABLE I: TEXT CORPUS

    FRONT MID BACK

    कभर कीकय खटाक कायण खकाय खटाका ककयण चककत साकक कीकय पकीय धभकी कुमया तकुवा पऩ ॊकू कूकना ककून गुडाकू

    Sound Library

    Start Middle End

  • Arun et al., International Journal of Advanced Research in Computer Science and Software Engineering 4(5),

    May - 2014, pp. 531-549

    © 2014, IJARCSSE All Rights Reserved Page | 536

    क्रतक प्रक्रत क्र केयर याकेश तडके कैसा डकैती जाकै कोभर डकोटा भाको कोडी सकौय जाकौ कॊ कड सकॊ द कॊ क् क् क्

    खकाय चखना देख खाकक भखान रेखा खखडकी भुखखमा साखख खीजना भखीय ऩयखी खुचय सखुर जाखु खून सखून जाखू ख्रऩा भाख्रत सख्र खेवा भखेर जाख ेखैयात सखैय राख ैखोवा जाखोय जाखो खौवा भुखौटा भाखौ खॊजय जखॊ सजखॊ ख् ख् ख् गगन गगन डग गात तगादा दगा गगयता फगगमा भागग गीदड दगीरा दागी गुजय झगुरी जागु गूथना फगूरा गागू ग्रह सग्रह जाग्र गेरी बॊगेडी जागे गैरयी दगैर जागै गोदात बगोडा जागो गौयव रगौय जागौ गॊदगी भगॊदा जागॊ ग् ग् ग् घटक फघय फघ घातक प्रघान साघा घघचपऩच सघघर याघघ घीना सॊघीम सघी घुटन सघुर भाघु घूभना सघून जाघू घ्रत सघ्रऩ जाघ्र घेयना भघेय दाघे घैरा भघैर सघै

  • Arun et al., International Journal of Advanced Research in Computer Science and Software Engineering 4(5),

    May - 2014, pp. 531-549

    © 2014, IJARCSSE All Rights Reserved Page | 537

    घोखना सघोना भाघो घौद सघौय सघौ घॊट रघॊट घॊ घ् घ् घ् ङ ङ ङ ङा ङा ङा ङङ ङङ ङङ ङी ङी ङी ङु ङु ङु ङू ङू ङू ङ्र ङ्र ङ्र ङे ङे ङे ङै ङै ङै ङो ङो ङो ङौ ङौ ङौ ङॊ ङॊ ङॊ ङ् ङ् ङ् चक दचक ऩेच चाऩ ऩॊचाट ऩायचा गचकट ऩेगचया सगच चीरय ऩेचीदा प्रऩॊची चुका सचुक रचु चूक कचूय वाचू च्रभा सच्रभ वाच्र चटेा सचते याच ेचैतन्म सचैन चाचै चोकय कचोट चाचो चौवा कचौडी याचौ

    चॊन्द्न्िका भचॊद भचॊ च् च् च्

    छकाय ऩाछना ऩाछ छागर बफछाना ऩीछा घछकना बफघछमा छाघछ छीजन सछीन ऩॊछी छुवा बफछुवा वाछु छूटना सछूत राछू छ्र छ्र छ्र

    छेडना सछेन ऩीछे छैत सछैत वाछै छोयी बफछोह वाछो छौका बफचौना वाछौ छॊगा सछॊद साछॊ

  • Arun et al., International Journal of Advanced Research in Computer Science and Software Engineering 4(5),

    May - 2014, pp. 531-549

    © 2014, IJARCSSE All Rights Reserved Page | 538

    छ् छ् छ् जकड ऩूजना पौज जागीय खजाना ऩूजा न्द्जगय ऩून्द्जत फान्द्ज जीतना सजीत ऩाजी जुवायी बफजुभ जाजु जूट बफजूका काजू ज्र ज्र ज्र जेठ सजेन जाजे जैपवक बफजैरा जाजै जोखखभी घजोय राजो जौहय वजौय राजौ जॊगदाय सजॊग रजॊ ज् ज् ज्

    झकोरा झझक जाझ झाडना फझावू साझा खझप्ना सखझना साखझ झीखना सझीन साझी झुटाना सझुना भाझु झूट जाझूना भाझू झ्र झ्र झ्र

    झरेना साझरे ऩाझ ेझैर सझैरा साझै झोरी सझोरा ताझो झौय कझौय साझौ झॊकाय सझॊक झॊ झ् झ् झ् ञ ञ ञ ञा ञा ञा गञ गञ गञ ञी ञी ञी ञु ञु ञु ञू ञू ञू ञ्र ञ्र ञ्र ञ े ञ े ञ ेञै ञै ञै ञो ञो ञो ञौ ञौ ञौ ञॊ ञॊ ञॊ ञ् ञ् ञ्

    टकयाव ऩाटर ऩाट टाऩना पऩटाया ऩाटा

  • Arun et al., International Journal of Advanced Research in Computer Science and Software Engineering 4(5),

    May - 2014, pp. 531-549

    © 2014, IJARCSSE All Rights Reserved Page | 539

    टटकट बफटटमा फाटट टीकाकाय सटीक ऩाटी टुकडा भटुक जाटु टूटना भटूक राटू ट्र ट्र ट्र

    टेकना सटेरा जाटे टैक्सी सटैय जाटै टोकन सटोरा भाटो टौर सटौर याटौ टॊकाय टॊ जाटॊ ट् ट् ट्

    ठकाय ऩाठक ऩाठ ठाकुय सठाऩ ऩाठा टठग्ना गटठमा ऩाटठ ठीकडा गठीरा ऩाठी ठुनका घनठुय कठुय ठूरा सठूय ऩाठू ठ्र ठ्र ठ्र

    ठेकेढाय सठेक भाठे ठैभ भठैर जाठै

    ठोकना सठोक साठो ठौय कठौय जाठौ ठॊडा ठॊ ठॊ ठ् ठ् ठ् डफर सडक अखड डाककमा बफडार अगडा ङडमो अङडभ छोङड डीजर सडीर खखचडी डुफकी सडुर जाडु डूफना सडूक झाडू ड्र ड्र ड्र

    डमेयी भडरे साड ेडनैा भडरै जाडै डोभनी अडोस साडो डौर सडौर ताडौ डॊका फडॊग याडॊ ड् ड् ड्

    ढकना गाढन साढ ढाना गढाना गढा टढरावी गटढमा गटढ ढीरना सढीर साढी ढुरना सढुर साढु

  • Arun et al., International Journal of Advanced Research in Computer Science and Software Engineering 4(5),

    May - 2014, pp. 531-549

    © 2014, IJARCSSE All Rights Reserved Page | 540

    ढूह सढूह साढू ढ्र ढ्र ढ्र ढेय सढेय साढे ढैम गढैमा साढै ढोका भढोवा वाढो ढौयी ऩढौसी जाढौ ढॊगा ढॊ ढॊ ढ् ढ् ढ् ण ण ण णा णा णा खण खण खण णी णी णी णु णु णु णू णू णू ण्र ण्र ण्र णे णे णे णै णै णै णो णो णो णौ णौ णौ णॊ णॊ णॊ ण् ण् ण्

    तकना भतरी उगचत तागना बफताना अॊधता घतजाया इघतका अघत तीखा त्रतीम इभयती तुकाॊत भातुर भातु तूफ़ान भातूक सातू त्रतीम सॊत्रप्त त्र तमेीस भातभे सात ेतैनात नतैभ वात ैतोड सतोर भातो तौरना अतौर आतौ तॊगी भतॊगी भतॊ त् त् अत्

    थकना थुथना अकथ थाऩना भथानी साथा गथेटाय भगथत अगथ थीभ भथीभ साथी थुथना राथुय साथु थूकना थाथु भाथू थ्र थ्र थ्र थेर भाथेन साथे

  • Arun et al., International Journal of Advanced Research in Computer Science and Software Engineering 4(5),

    May - 2014, pp. 531-549

    © 2014, IJARCSSE All Rights Reserved Page | 541

    थैरा भथैन साथै थोडा हथोड जाथो थौडा हथौडा भाथौ थॊडा भाथॊगी प्रीथॊ थ् थ् थ् दकाय फॊदय नाद दाता बफदाय बफदा टदखना फॊटदश भटद दीभी भदीय फॊदी दकुडा भदरु भद ुदधू फॊदकू जाद ूिड आित साि

    देखना बफदेश सादे दैघनक वदैन सादै दोगरा भादोन सादो दौड फदौना वादौ दॊगा वदॊती दॊ द् द् द्

    धगडा फॊधन अध धाना फॊधान वाधा गधक अगधक आगध धीभय फाॉधीत राधी धुक फॊधुता साधु धूऩ सधूय वाधू ध्र ध्र ध्र धेना अधेड साधे

    धैमरवान अधैमर साधै धोखा सधोना आधो धौखना भधौना साधौ धॊधा वधॊती रयधॊ ध् ध् ध्

    नकटा ऩनही अॊकन नाका ऩनाह अधाना घनकट सघनह यानी नीका ऩनीयी अॊजनी नुकीरा अनुजा अनु नूतन कनूत जानू न्रशॊस न्र न्र नेती जानेक अॊजाने नैघतक फनैरा जानै नोचना भनोज भानो नौकय कनौज भानौ

  • Arun et al., International Journal of Advanced Research in Computer Science and Software Engineering 4(5),

    May - 2014, pp. 531-549

    © 2014, IJARCSSE All Rights Reserved Page | 542

    नॊगरा भानॊद भानॊ न् न् न् ऩकड तऩना आकॊ ऩ ऩाठ क्रऩार ु ऩाऩा पऩटायी कपऩर सीपऩ ऩीच सऩीठ छऩी ऩुकाय सऩुदर काऩु ऩूजना अऩूणर ऩाऩु प्र प्र प्र

    ऩेखना सऩेया ताऩे ऩैतान ऩाऩैना साऩै ऩोटा सऩोरा ऩाऩो ऩौनी फऩौती साऩौ ऩॊककर सऩॊत सोऩॊ ऩ् ऩ् ऩ्

    पटना आपत वप पाटक सपाना इजापा कपकय भाकपमा काकप पीका अपीभ भापी पुरका सपुर सापु पूटना सपूना सापू फ्रतोश नफ्रत फ्र पेनी सपेद रापे पैरना छपैर कापै पोडना सपोड भापो पौज सफ़ौर कापौ पॊ की पॊ पॊ प् प् प् फनाभ फफय अजफ फाहय आफादी गुडॊफा बफकना अॊबफका अॊबफ फीजी सफीर खयाफी फुकचा फफुजा साफु फूकना फफूर साफू ब्रॊगेश ब्र ब्र फेकस सफेये साफे फैठक भफैय काफै फोतर वफोत साफो फौछाय धफौय वाफौ फॊडर प्रफॊद शुफॊ फ् फ् फ्

    बकोस बबक साब

  • Arun et al., International Journal of Advanced Research in Computer Science and Software Engineering 4(5),

    May - 2014, pp. 531-549

    © 2014, IJARCSSE All Rights Reserved Page | 543

    बायत आबाय अमबा भबखायी भभबय आभब बीतय भाबीर छाबी बुकडी भाबुन आबु बूगोर बबूत बाब ुभ्रघत सुभ्रत भ्र बेदक सबेद राबे बैमा सबैद वाबै बोग आबोग वाबो बौचक खबौद वाबौ बॊजन भबॊज भाबॊ ब् ब् ब्

    भकान गभक मतीभ भाधुयी आभाद भाभा भभचरी आभभश साभभ भीठा आभीन भाभी भुखौटा अभुख ऩाभु भूसा अभूर साभ ूम्रग अम्रत कम्र भेमय सभेत जाभे भैरा धभैर याभै भोटा आभोद साभो भौजा अभौर सभौ भॊजन आभॊत्रण भॊ भ् भ् भ्

    मतीभ ऩामर भम माचक आमात भामा घमभान घम भाघम मीश्वय मी बाशामी मुग आमुध आमु मूनानी सामूर यामू य्र य्र य्र मेन भामेर सामे मै मै मै

    मोगी आमोग भामो मौनती समौर घाडमो मॊबत्रक भाटॊक मॊ म् म् म् यकफा आयसी माय याकेट आयाजी भाया रयमाज ऩरयणत ऩरय यीछ ऩयीस ऩयी

  • Arun et al., International Journal of Advanced Research in Computer Science and Software Engineering 4(5),

    May - 2014, pp. 531-549

    © 2014, IJARCSSE All Rights Reserved Page | 544

    रुकना ऩरुभा ऩारु रूऩा ऩरूर भरू यर यर यर

    येखीम आयेख ऩये यैमत सयैर ऩायै योकड आयोऩ कयो यौजा भयौदा जायौ यॊग सयॊग भयॊ य् य् य्

    रकडी भरफा पर राट भरारा रैरा

    भरखना भभरक भाभर रीडय भरीदा म्रणारी रुकना सरुका भार ुरूभ आरूचा ऩल्रु ल्र ल्र ल्र

    रेखन आरेख ऩहरे रैरा सरैभ जारै रोटन अरोक भारो रौकी अरौककक जारौ रॊऩट ऩरॊग सरॊ र् र् र्

    वकीर अवभ मुव वाटटका आवाज यवा पवकट आपवरा छपव वीयाना सवीद यवी वुजा सवुय वावु वूपय येवूय कावू व्र आव्रत व्र

    वेदना आवेग कयवे वैतार चवैमा भावै वोटय अवोक सावो वौभा वौ वौ वॊटक वॊ वॊ व् व् व्

    शकुनी भशक आक्रोश शाकीम भशान शीशा भशकवा आभशक खुभश शीशभ भशीन शीशी शुदा अशुब आशु शूरना बत्रशूर ऩाशू श्रगार श्र श्र

  • Arun et al., International Journal of Advanced Research in Computer Science and Software Engineering 4(5),

    May - 2014, pp. 531-549

    © 2014, IJARCSSE All Rights Reserved Page | 545

    शेखय भभशेर राश ेशैरा अशैक अक्शै शोभशत अशोक आशो शौहय कशौय भशौ शॊककत बत्रशॊकु शॊ श् श् श् ष ष ष षा षा षा पष पष पष षी षी षी षु षु षु षू षू षू ष्र ष्र ष्र षे षे षे षै षै षै षो षो षो षौ षौ षौ षॊ षॊ षॊ ष् ष् ष्

    सकर ककसकी तीस साभभर कसाना बासा भसकट काभसभ शाभस सीखना ऩसीना ऩायसी सुहास जासुभ ऩासु सूऩय जासूय रास ूस्र स्र स्र

    सेठानी कसेत बासे सैकडा ऩसैना जासै सोता ऩासोभ ऩासो सौगात भसौना हासौ सॊकट फसॊती रासॊ स् स् स्

    हभाया सहया भह हात सुहास साहा टहन्दी भटहरा कटह हीयक सोहीर भाही हुवा गहुना साहु हूयना जाहूभ माहू ह्रतॊत्री ह्र ह्र हेकड भहेश कहे हैयान सहैगा राहै होटर कना ऩाहो

  • Arun et al., International Journal of Advanced Research in Computer Science and Software Engineering 4(5),

    May - 2014, pp. 531-549

    © 2014, IJARCSSE All Rights Reserved Page | 546

    हौरा डहौना जाहौ हॊत हॊ साहॊ ह् ह् ह्

    III. TEXT PROCESSING Text processing is the primary step involved in building Hindi TTS system. Once the orthographic text is available,

    before synthesizing pre-processing of text is required [4]. The main intension behind text processing is to resolve the

    ambiguity if any present in between two characters. Each and every language has its corresponding Unicode developed

    by language research centers and every character has its own identification. This identification codes are used in pre-

    processing program to understand better and solve the problem of confusion between two characters [1]. Pre-processing

    program can be done in MATLAB, JAVA and many other programming languages but here it is implemented using

    .NET programming Language.

    TABLE II COMPARISON OF DATABASE

    Sl.

    No Developed by Unit Language Corpus

    1

    SJ College of

    Engineering.

    Mysore [1]

    Phoneme Kannada 1605

    2 Utkal

    University [11] Syllable

    Hindi,

    Odiya,

    Bengali &

    Telugu

    9317

    3 Punjabi

    University [10] Syllable Punjabi 3312

    4

    Carnegie

    Mellon

    University [9]

    Syllable Hindi 2344

    5 RIT,[13]

    Maharashtra Phoneme Konkani 3000

    To resolve the ambiguities present in understanding Hindi alphabets consonants and vowels are grouped into

    different classes and programmed [1]. Classification of vowels and consonants are as shown below.

    TABLE III. CONSONANT

    Alphabets Unicode Decimal

    Equivalent

    क 0915 2325 ख 0916 2326 ग 0917 2327 घ 0918 2328 ङ 0919 2329

    TABLE IX. INDEPENDENT VOWEL

    Alphabets Unicode Decimal

    Equivalent

    अ 0905 2309 आ 0906 2310 इ 0907 2311 ई 0908 2312 उ 0909 2313 ऊ 090A 2314 ऋ 090B 2315

  • Arun et al., International Journal of Advanced Research in Computer Science and Software Engineering 4(5),

    May - 2014, pp. 531-549

    © 2014, IJARCSSE All Rights Reserved Page | 547

    ए 090F 2319 ऐ 0910 2320 ओ 0913 2323 औ 0914 2324

    Similarly consider all other Consonants and grouped as IV, V, VI, VII and VIII. Later Group Dependent vowel

    signs which support for forming syllable.

    TABEL X: DEPENDENT VOWEL SIGN

    TABLE XI: PADDING

    Alphabets Unicode Decimal

    Equivalent

    093E 2366

    ाा 093F 2367 न्द्ा 0940 2368 ाी 0941 2369 ाु 0942 2370 ाू 0943 2371 ा 0947 2375 ाे 0948 2376 ाै 094B 2379 ाो 094C 2380 ाौ 094D 2381

    Alphabets Unicode Digits Padded

    093E ---

    ाा 093F 01 न्द्ा 0940 02 ाी 0941 03 ाु 0942 04 ाू 0943 05 ा 0947 06 ाे 0948 07 ाै 094B 08 ाो 094C 09 ाौ 094D 10

    The Pre-processor program reads the entered text character by character and generates a modified Unicode file

    as output. The modified Unicode file is stored in a text file and imported directly to MATLAB program for further

    processing.

    A. Rules applied during Pre-processing 1. If character belongs to Independent vowel group as shown in Table IX then its Unicode converted Decimal

    value is directly padded with zeroes. E.g. Consider character read is अ its Unicode is 2309 is padded with two zeroes directly. The modified Unicode will be 230900.

    2. If character read belongs to consonant group as shown in Table III then check the next set of characters if the next character belongs to dependent vowel sign group then Unicode is padded with corresponding two digit

  • Arun et al., International Journal of Advanced Research in Computer Science and Software Engineering 4(5),

    May - 2014, pp. 531-549

    © 2014, IJARCSSE All Rights Reserved Page | 548

    value obtained from the Table.10. E.g. Consider character entered is रु then it is divided into य its Unicode is 2352 and ाु its Unicode is 2370 padding and value obtained from Table10 is 04. So modified Unicode value is 235204.

    3. If character belongs to consonant group as shown in Table II and the next character also belongs to consonant

    group the Unicode is unchanged. E.g. consider the character read is ण its Unicode is 2339 and the next character read is also a consonant then Unicode remains the same 2339.

    4. If entered word is अरुण then its Modified Unicode output will be 230900 235204 2339, the presence of spaces between each Unicode helps us to differentiate individual character in entered word.

    5. If entered sentence is अरुण कुभाय then its modified Unicode output will be 230900 235204 2339 101010 232504 235001 2352. Unicode 101010 acts as space between two words and is used to differentiate two words during

    sentence formation.

    IV. SPEECH SYNTHESIS Speech Synthesis and processing is implemented using MATLAB tool. Selecting an appropriate algorithm among

    concatenation based TTS system is very important after building database. According to recent studies direct waveform

    concatenation algorithm is best suitable for speech synthesis [8].

    The MATLAB program uses Modified Unicode file generated by Pre-processing program for this purpose. The

    program reads the modified Unicode file, number by number and fetches the appropriate phonemes and syllable from the

    database. The presence of spaces in the Modified Unicode file is used to determine the directory from which the syllable

    should be fetched i.e. Start, middle or End. The following steps are performed to synthesize the speech. Consider the

    word हभाया. The syllable units are fetched separately from respective database and concatenated using suitable algorithm. Fig. 11 shows the concatenated speech output.

    Fig. 11 Concatenated output

    After concatenation further processing is done using moving average windowing for smoothing the

    concatenated output. This will increases the quality of speech output.

    V. CONCLUSION This paper discusses the design and development of Hindi text and speech database for concatenation based TTS

    system considering syllable as a basic unit. This technique provides very high quality speech output which is reasonably

    natural and equivalent to voice of the original speaker. The proposed approach minimizes the co-articulation effect and

    prosody mismatch between adjacent units concatenated. This new approach of considering position of syllable during

    database building helps us to reduce glitches during concatenation and obtain continuity in concatenated speech and

    improved quality speech output compared to normal concatenation done without considering position of character and

    duration.

    REFERENCES

    [1]. Ravi D J and Sudarshan Patilkulkarni (2011), “A Novel Approach to Develop Speech Database for Kannada Text-to Speech System”, Int. J. on Recent Trends in Engineering & Technology, Vol. 05, No. 01.

    [2]. Marian Macchi (1993), “Issues in Text-to-Speech Synthesis”. [3]. Kishore S P and Black A (2003), “Unit Size in Unit Selection Speech Synthesis”, in Proceedings of Euro

    speech, September, pp. 1317-1320.

    [4]. Paul Taylor (2009), “Text-to-Speech Synthesis”, Cambridge University Press. [5]. Lemmety S (1999), “Review of Speech Synthesis Technology”, M.S. Thesis, Dept. Elec. and Comm. Engg.,

    Helsinki University of Technology.

    [6]. Thomas S (2007), “Natural Sounding Text-to-Speech Synthesis Based on Syllable Like Units”, M.S. Thesis, Indian Institute of Madras.

    0 5000 10000 15000-0.4

    -0.3

    -0.2

    -0.1

    0

    0.1

    0.2

    0.3

  • Arun et al., International Journal of Advanced Research in Computer Science and Software Engineering 4(5),

    May - 2014, pp. 531-549

    © 2014, IJARCSSE All Rights Reserved Page | 549

    [7]. Arun Kumar C and Shreekanth T (2014), “A Comprehensive review on Concatenation Based Text to Speech Synthesis for Indian Language”,Int. J. Elec&Electr.Eng&Telecoms, Vol. 3, No. 2, April 2014, ISSN 2319 –

    2518.

    [8]. PRAAT: A tool for phonetic analysis and sound manipulations by Boersma and Weenink, 1992-2001. www.praat.org

    [9]. S P Kishore and Alan W Black, “Unit size in Unit selection Speech Synthesis”.EUROSPEECH 2003 – GENEVA.

    [10]. Parminder Singh, Gurpreet Singh Lehal. 2006. Text-To Speech Synthesis System for Punjabi Language. In Proceedings of International Conference on Multidisciplinary Information Sciences and Technologies, Merida,

    Spain

    [11]. Sanghamitra Mohanty, “Syllable Based Indian Language Text To Speech System”, International Journal of Advances in Engineering & Technology, 2011. Vol.1, Issue 2.

    [12]. Badri Nath Kapoor, “Practical Hindi-English Dictionary” January 1, 2004. [13]. Pukhraj P. Shrishrimal, Ratnadeep R. Deshmukh and Vishal B. Waghmare, “Indian Language Speech Database:

    A Review”. International Journal of Computer Applications (0975 – 888), Volume 47– No.5, June 2012.