this project implements a speech synthesizer utilizing a source

8/8/2019 This Project Implements a Speech Synthesizer Utilizing a Source

http://slidepdf.com/reader/full/this-project-implements-a-speech-synthesizer-utilizing-a-source 1/13

This project implements a speech synthesizer utilizing a source-filter model that resembles actuaproduction. Literature survey on the anatomy of human speech production, source-filter models, of linear prediction coefficients are performed. While we discovered many ways to generate synthchose an all-pole synthesizer derived from linear predictive coding analysis(LPC). This method islinear prediction enables us to characterize the vocal tract very efficiently. (Also, we just wanted tsomething we've never seen before.) With some coding in MATLAB, we demonstrated that we cocomputer produce the sounds /s/, /a/, and /i/. The concatenation of these sounds could produce u"saucy" and "seesaw." Although the words are far from natural sounding speech, we have demoneasy techniques are powerful enough to do some very simple speech synthesizing.



Anatomy of Speech Production We have to first understand how human speech production works in order to create a model for mOur understanding of the anatomy of speech production can help us create a model for machine

In general, a speech signal is an air pressure wave that travels from the speaker's mouth to the liFigure1 is a schematic of the anatomy of speech production. The lung produces the initial air preessential for the speech signal; the pharyngeal cavity, oral cavity, and nasal cavity shapes the finthat is perceived as speech. The pharyngeal cavity and oral cavity (collectively known as the vocal tract) contracts and relaxecreate all sorts of sounds through resonance. The nasal cavity opens another air hole to create wnasal sounds (ie. /m/, /n/). Together, these cavities characterize the sounds we produce.



In order to research the sounds that humans make when they talk, it is important to know what kias humans are capable of producing. In linguistics, the basic unit of sound is called a phoneme. Tgeneral categories of phonemes: vowels and consonants. Vowels are formed by pushing air from the lungs through the vocal tract while the vocal chords aposition of the tongue and lips change the shape of the vocal tract which creates different resonavowel chart seen in Figure 2 not only corresponds to the position of the tongue in the mouth, but related to the first two formant frequencies of a vowel. The back chamber in the mouth amplifies and the front chamber amplifies the second formant. For example, the vowel /i/ is a high front vowthat the back chamber is large and the front chamber is small. Indeed, we find that the first formathe second formant is quite high.

Figure 2 Consonants are different from vowels because they involve an occlusion of the vocal tract. This cat the beginning of the consonant and is then released in some fashion. By changing the place ofmanner of articulation, and voicing of the sound produced, humans can create different consonaarticulation refers to the place in our mouth or throat that our tongue touches. In English, either thtogether or else the tongue makes contact with the teeth, the alveolar ridge, the palate, or the ve Manner of articulation refers to way in which the release happens. A stop is produced when therethrough the vocal tract, like when the sounds at the beginning of the words 'pat' and 'bat' are prodocclusion is semi-released, a fricative is produced, such as 's' and 'z'. Finally, if our vocal chords

the production of the sound, the phoneme is said to be voiced (b and z), whereas if the vocal chovibrating, the phoneme is said to be unvoiced (p and s). The pitch of a person's voice, or their funfrequency, is determined by the frequency at which the vocal chords vibrate during the productiophonemes.



Speech production is basically a source-filter model. The source is the air provided by the lungs. spectral shaping performed by the vocal tract. The convolution of the two in the time domain produtterance. Because these are two separate processes, the source excitation and the filter implemanalyzed separately. Optimizing both the source sub-model and the filter sub-model can improve

utterance. Source Excitation The source excitation can be one of two types. For voiced speech, the vocal folds close and openmake the air from the lung into an "impulse train." This impulse creates the pitch for a sustained vunvoiced speech, the source is white noise. In this case, the air that flows out of the lungs and beand mouth produces a relatively random sound. Acoustic Tube and Transmission Line Model In its simplest form, vocal tract can be modeled as a lossless acoustic tube. The cross-sectional and the speed of the air determine the sound pressure and volume velocity, which in turn determ

speech. The vocal tract can also be modeled as a transmission line. The acoustical resistance, mcompliance are distributed along the tube in the same manner as the resistance, inductance, anda transmission line.

The single acoustic tube/transmission line model is not adequate to model the wide range of souSince speech production is characterized by changing vocal-tract shape, it is more appropriate toacoustic tubes or cascading transmission lines as models. The specific shape of a vocal-tract antime determines the actual word utterances we perceive as speech.



The following figure shows how the source/filter approach applies to the human vocal productionvocal folds and trachea all belong to the "source" side of the model. The various cavities, the veluhump are part of the filter end.



Vocal-Tract Characterization by Linear Prediction Coefficients

Although the acoustic tube and transmission line model is relatively accurate, its complexity giveswe want to do speech synthesis in real-time. We cannot perform calculations on multiple acousti

cascaded transmission lines very easily. Fortunately, another model exists which greatly simplifiethat in the transmission line model, a signal flows from source to load via a series of delays. Alsofinally reaches the load (lips) is a linear combination of many reflected and transmitted waves thathe transmission line junctions.

This strongly implies that we can model the output of the vocal tract as the summation of past oucurrent inputs. If we let Yapprox be the output of the vocal-tract and by simply neglecting the inpuequation:



This is the linear prediction(LP) approximation! The aj's are called the LP coefficients; their weighcharacterizes a difference equation. Now, if we take the z-transform, we can get the transfer function:

This H(z) is the system response of an all-pole filter! If we pass the excitation source through thissignal will be shaped into our desired utterance. This is the method we use in this project to synth

Discrete-Time Model of Synthesis Using Linear Prediction Figure 4 is a model of speech production using LP analysis. The excitation signal is either an impulse train or white noise. For voiced speech, the excitation istrain with period equal to the pitch period of speech. This impulse train is passed through a glotta

the air from the lung and vocal folds. After all, the impulses from our vocal folds more closely resrather than an impulse train. For unvoiced speech, a white noise signal is produced. There is a swdesired source to the filter.

Figure 5 The vocal tract filter is characterized the LP coefficients. The radiation filter models the propagationce it leaves the lips, but it is neglected in our simplified model. Besides this, the main differencis that we have simplified the pole-zero filter that fully characterizes the vocal tract model into an motivation for this is the ease in calculation. Fortunately, this simplification is justified for simple sas the poles are the ones that determine the essential formant peaks in voiced signals. Howeverzeros, we are essentially taking out the nasal cavity, the alternate air passageway, in our simplifi



In our project, we created a talking computer in MATLAB using the scheme above. Our procedu

A. Acquire LPC parameters 1. Record sample speech signal2. Select target phoneme (sound)3. Use MATLAB lpc function to determine LPC coefficients for phonemes B. Produce Synthetic Sounds 1. Create source consisting of a pulse train or white noise2. Derive all-pole filter from LPC coefficients

3. Pass source through all-pole filter to generate sound C. Concatenate sounds into words 1. Adjust length of each sound segment2. Append sound vector in MATLAB3. Play resulting whole-word vector See MATLAB code



FFT of /a/ as in pot shows power spectral density and formant peaks The one on left is the actual recorded signal, on right is the approximation via LPC

FFT of /a/ as in pot shows power spectral density and formant peaks The one on left is the actual recorded signal, on right is the approximation via LPC



Spectrogram of /s/ as in send shows frequency content The one on left is the actual recorded signal, on right is the approximation via LPC



As can be seen from comparing the left and right panes, the signal synthesized using LPprovide a somewhat close approximation to actual speech.

v

this project implements a speech synthesizer utilizing a source

Documents