3. speech analysis-synthesis based on the periodic-aperiodic decomposition.doc

Nghiªn cøu khoa häc c«ng nghÖ (Tªn chuyªn môc do Ban biªn tËp quyÕt ®Þnh)

SPEECH ANALYSIS-SYNTHESIS BASED ON THE PERIODIC-APERIODIC DECOMPOSITION

AND AUTOREGRESIVE PARAMETERIZATIONTran Duy Hoa*, NGUYEN HUU MONG**, THAI TRUNG KIEN***

Abstractt: Speech analysis-synthesis based on the periodic-aperiodic decomposition and autoregressive parameterization is represented in this paper. Time varying discrete Fourier transform is used to decompose voiced frame into two parts: periodic part and aperidoc (or noise) part. The periodic part is sum of sinusoidal components, which are represented by magnitudes and phases. Autoregressive method is used in order to model spectral envelope, and to obtain magnitude parameters. In synthesis process, the periodic and aperiodic part are synthesized separately, and added together. The result shows that, model is fitted with speech model in [2, 3] and it can be used in voice conversion, speech synthesis, and speech coding.

Keyword: Periodic-aperiodic decomposition, autoregressive parameterization.

1. Introduction

Recently decades some analysis-synthesis models of speech were proposed, for instance sinusoidal model, harmonic plus noise model. Sinusoidal models are based on a well known assumption that the speech signal can be represented as a sum of sine waves with amplitudes and frequencies in [4] by Macon 1996. Sinusoidal models are also used successfully in singing voice synthesis [5]. Beside that, the Harmonic plus Noise Model (HNM) has been introduced by Stylianou [6]. A fundamental principle underlying HNM is the introduction of the concept of maximum voiced frequency, which is estimated for each frame. In voiced and mixed voiced/unvoiced frames, harmonic components can only be obtained up to a certain frequency. The higher frequency components are regarded as noise. Synthesis in HNM is performed in an overlap/add. The overlapping windows are centered on the frame center of gravity to ensuring phase coherence [8]. The noise component measured during analysis is generated by passing Gaussian white noise through the LP filter, and time-modulated using the power measured in each of frames used during analysis. The HNM was used for text-to-speech (TTS) synthesizer also voice conversion system. In the TTS, the results show that HNM consistently outperforms TD-PSOLA in all the above features except for computational load [7]. However maximum voiced frequency is problem, when this value is fixed at 4 kHz with speech signals were sampled at 16 kHz.

The analysis-synthesis system based on the periodic-aperiodic decomposition has been used for speech coding in [2, 3]. It gives high-quality speech synthesis without smoothing problems at the frameal boundaries. However when this model was tried to apply for voice conversion then in training and transformation process it has problem as high dimensionally, can’t use for time alignment because harmonic number is different between frames. Overcome these problems, speech

T¹p chÝ Nghiªn cøu KH&CN Qu©n sù, Sè 23, 02 - 2013 7

Tªn chuyªn ngµnh do t¸c gi¶ quyÕt ®Þnh (VÝ dô, §iÒu khiÓn & Tù ®éng hãa)

synthesis based on periodic-aperiodic decomposition with Autoregressive (AR) parameterization is considered.

2. THE VOICE ANALYSIS-SYNTHESIS SYSTEM BASED ON THE PERIODIC-APERIODIC DECOMPOSITION

In this research, the speech signal is assumed to be composed of two parts: periodic (harmonic) part and aperiodic (noise) part as in [2, 3]

s(n) = h(n) + r(n) (1)

The periodic part accounts for the quasi-periodic components of the speech signal while the noise part is non-periodic components. For each speech frame, the signal is assumed to consist of a set of sinusoids and the resulting speech signal is given by:

,

= nkf0

(2)

where An(k) and k are the magnitude and phase at time t of the kth harmonic, f0 is fundamental frequency and K(n) is number of harmonics. If signal frame is periodic, it is represented by a set of discrete samples. This is known as the Discrete Fourier Transform (DFT), which is represented mathematically by equation (3).

(3)

The amplitudes and phases can be expresses as follows: An = |H(k)| and k =

In the equation (3) the fundamental frequency f0 is assumed to be stable, however the fundamental frequency of a speech signal is changed in time. To solve this problem, Time Varying Discrete Fourier Transform (TVDFT) is proposed in [2, 3], which give more accurately periodic-aperiodic decomposition of speech. is defined:

8


(4)

where f0 is deviation between two adjacent frames of speech, N is frame length. The equation (3) to be come

(5)

The main advantage of analysis based on TVDFT is it can give more clearly spectral of speech signal.

The noise part is given by:

r(n) = s(n) – h(n) (6)

The noise part is analyzed by linear prediction coding and it is generated by passing Gaussian white noise through the LP filter.

3. AUTOREGRESSIVE PARAMETERIZATION

The periodic-aperiodic decomposition model has been shown to be capable of high quality speech processing, particularly in the areas of pitch As the result is showed in figure 1, the re-synthesized harmonic waveform by using original magnitude is fitted with speech waveform, while re-synthesized harmonic waveform by AR is not fitted.

(7)

In addition, equation (2) is perceptually almost indistinguishable from the original speech signal. However, the model has problem because its dimensionality is high. For instance, if a frame of speech signal has fundamental frequency F0 = 100 and sample rate Fs = 16000 then it requires K = 160 magnitude parameters. Also magnitude parameter number has to depend on fundamental frequency, they are difference between frames. In voice conversion areas, in order to transform spectral envelope then parameter number of each frame must be the same. For this reason, need to parameterize the periodic-aperiodic decomposition model by equivalent speech model. Source filter model is chosen. The source-filter model is represented by the parameters describing the transfer function of the vocal tract model. Two types of the source-filter model are useful for speech processing: the



all-pole model known as the autoregressive (AR) model, and the pole-zero models known as the autoregressive moving average (ARMA) model. The AR model of a vocal tract is well known in speech processing as a linear predictive coding (LPC) model and it is an all-pole model of a vocal tract given by the infinite impulse response (IIR) filter [1]:

(8)

where NA is the order of the AR model, the gain G and the coefficients ak are the AR parameters or the LPC parameters. AR has the frequency response given by equation:

, (9)

Amplitude in equation (2) can be approximate A(k) = |P(k)|, this is mean that:

(10)

magnitudes are different. From this observation, a gain factor is added into equation (2) when it is used with AR model. It is represented:

where the amplitudes and phases can be expresses as follows: Ak = |P(k)| and k =

and g is gain factor, which is defined:

(11)

The result in figure 2 show that re-synthesized harmonic waveform by AR can be fitted with original speech waveform, and its periodic part.

10


Fig. 1 Speech waveform (solid line), re-synthesized harmonic waveform by AR, and re-synthesized harmonic waveform by using original amplitude

Fig. 2 Speech waveform (solid line), re-synthesized harmonic waveform by AR with gain,

and re-synthesized harmonic waveform by using original amplitude

4. EXPERIMENTAL RESUL

The synthesis phase is implemented by using equation (1) and parameters are obtained in the phase analysis. The scheme analysis speech using HNM is showed in figure 3. The periodic part and noise part are synthesized separately, after that adding the synthesized harmonic and synthesized noise part gives the synthesized speech as showing in the figure 4. For each voiced frame is described by fundamental frequency, the number of harmonics, the LPC coefficients, and phase. The unvoiced frame is only represented by autocorrelation coefficient and its gain.



Fig. 3 Speech analysis scheme

In the practice, we use speech signal of female and male, which were sampled at 16000 Hz. Speech signal is framed and each frame has length is 480 sample, and hop size is 240. The order of autoregressive model is chosen in range [10, 40]. Pitch detection and voice/unvoiced estimation used method in [9]. Root mean square (RMS) and log spectral distortion (SD) is calculated in order to determine the error or difference between periodic parts. Three kinds of periodic part is estimated, they are:

- Periodic part, which is synthesized by equation 2.- Periodic part by AR without gain.- Periodic part by AR with gain.

Beside that the log spectral distortions are computed between periodic part, periodic part by AR without gain, and periodic part, periodic part by AR with Gain.

Fig.4 Synthesis speech scheme

12


Fig.5 Root mean square and log spectral distortion of a female voice

Fig.6 Root mean square and log spectral distortion of a male voice

Table. 1 Coefficient number of spectral envelope when changing fundamental frequency

F0 (Hz) 100 110 120 130 140 150 160 170 180Old model 40 36 33 30 28 26 25 23 22Proposed

model10 10 10 10 10 10 10 10 10

As shown in figure 5 and 6, the model with gain gives better than model without gain. By listening test, the quality of synthesized speech is very high, and listeners can’t distinguish with periodic-aperiodic decomposition model. In our experiment, the 20 order of LPC are chosen that mean the spectral envelop is represented with 20 coefficients. The table 1 show difference between periodic-aperiodic decomposition model and proposed model when changing fundamental frequency of speech signal, which is sampled at 16 kHz.



CONCLUSION

The aim of this paper is to investigate analysis-synthesis system based on the periodic-aperiodic decomposition with source filter model. Result of the periodic-aperiodic decomposition model with autoregressive and gain was estimated, to shows that quality of synthesized speech is very high and the proposed model is fitted with periodic-aperiodic decomposition while its order of spectral envelop is stable and reduced. The research is used for analysis-synthesis systems in future as speech synthesis, speech coding.

ACKNOWLEDGMENTThis work is supported by project 118/2013/HĐ – NĐT (2013-2014), funded by Ministry of science and technology of Vietnam.

REFERENCES

[1] Rabiner, Lawrence R, and Schafer, Ronald W. “Digital Processing of Speech Signals”. Bell Laboratories, 1978.

[2]

Sercov V., Petrovsky A. “An improved speech model with allowance for time-varying pitch harmonic amplitudes and frequencies in low bit-rate MBE coders”, in Proc. of the European Сonf. on Speech Communication and Technology EUROSPEECH, pp. 1479 – 1482, Budapest, Hungary, 1999

[3]

Piotr Zubrycki, Alexander Pavlovec, Alexander Petrovsky. "Analysis - by -Synthesis Parameters Estimation In The Harmonic Coding Framework By Pitch Tracking DFT", XI Symposium AES “New trends in Video and Audio”, Białystok, Poland, September 20-22 2006

[4]M. W. Macon and M. A. Clements. Speech Concatenation and Synthesis Using an Overlap-Add Sinusoidal Model,Proc. ICASSP'96, Vol.I, pp. 361-364, Atlanta, GA, 1996.

[5]M. W. Macon, L. Jensen-Link, J. Oliverio, M. A. Clements, and E. B. George. A Singing Voice Synthesis System Based on Sinusoidal Modeling," Proc. ICASSP'97, Vol.I, pp. 435-439, Munich, Germany, 1997

[6]Yannis Styliano. Concatenative Speech Synthesis using Harmonic Plus Noise Model, IEEE Transactions on Speech and Audio Processing, VOL. 9, NO. 1, January 2001

[7]

Syrdal, A., Y. Stylianou, L. Garisson, A. Conkie and J. Schroeter. "TD-PSOLA versus Harmonic plus Noise Model in diphone based speech synthesis". Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP98), pp. 273-276, Seattle, USA, 1998

[8] Stylianou, Removing phase mismatches in concatenative speech synthesis. Third ESCA Speech Synthesis Workshop, Nov, 1998

14


[9]Thai Trung Kien, Pitch detection method based on harmonic to noise ratio and average magnitude difference function for voice conversion, PRIP2007,Minks, Belarus, 22-24 May, 2007

Address: * Tran Duy Hoa, Truong Cao Dang Su Pham Tay Ninh. Email: [email protected]

** Nguyen Huu Mong, Hoc vien KTQS. Email: [email protected]

*** Thai Trung Kien, Vien CNTT – Vien KH&CNQS. Email: [email protected]


mailto:[email protected]

3. speech analysis-synthesis based on the periodic-aperiodic decomposition.doc

Documents