speech processing workstation -...

Department of Electrical & Electronic Engineering

Part IV Project Report 2003

Speech Processing

Workstation Author Ho Kwong Lai Project Partner Octavian Cheng Supervisor Dr Waleed Abdulla

Abstract This report presents the workings and details involved in building up a speech processing workstation as a final year project. The objective of this workstation is to provide assistants for researchers working in speech processing areas. In additional, it also provides education indications for students studying speech processing. The workstation contains a variety of speech processing techniques to characterise a speech signal. Thus, these techniques are illustrated in a graphical format as well as in audio formats. By using several similar speech workstations as a benchmark to our workstation, it was found that our workstation is to be flexible, easy to understand and to contain a variety of signal processing tools. This report will first briefly describe the background of speech production model and some relevant characteristic of speech signals. Then it will describe the basic components of the workstation interface and how to access the speech processing tools. The rest of the report will contain detailed descriptions of concepts, methods and results of speech processing tools that are implemented on the workstation. The workstation was successfully developed and 28 well known speech processing techniques were implemented. However, it was found that the workstation can greatly be improved if more speech production organs modelling were to be implemented. A simulation of using this speech processing workstation is included in a CD-ROM in the appendix. The simulation demonstrates how the workstation interface is to be used. It also shows where each processing tools are located and how they are accessed.

Declaration of Originality This report is my own unaided work and was not copied from nor written in collaboration with any other person. Signed:_____________________________

Table of Contents 1.0 Introduction............................................................................................................1 2.0 Speech Production Model ......................................................................................2 3.0 Workstation Overview and Basic Tools ................................................................3

3.1 Basic Tools.........................................................................................................3 4.0 Energy and Zero Crossings....................................................................................5 5.0 Speech Framing, Windowing and Overlapping.....................................................7

5.1 Framing..............................................................................................................7 5.2 Windowing.........................................................................................................7 5.3 Overlapping........................................................................................................9

6.0 Spectrogram .........................................................................................................10 7.0 Fast Fourier Transform (FFT)..............................................................................11 8.0 Linear Predictive Coding (LPC) ..........................................................................13

8.1 LPC Spectrum..................................................................................................15 8.2 LPC 3-Dimension ............................................................................................15 8.2 LPC Z-Plane.....................................................................................................16

9.0 Cepstrum Analysis ...............................................................................................18 9.1 LP Cepstrum Coefficients................................................................................19 9.2 LP Cepstrum Spectrum....................................................................................20 9.3 LP Cepstrum Spectra 3D .................................................................................20

10.0 Formants ............................................................................................................22 11.0 Pitch ...................................................................................................................23 12.0 Vocal Tract Modeling ........................................................................................24 13.0 Speech Clippings ...............................................................................................26

13.1 Center Clipping...............................................................................................26 13.2 Hard Clipping..................................................................................................27

14.0 Envelope ............................................................................................................29 15.0 Speech Reconstruction using Envelope and Hard Clipping ..............................30 16.0 Multi Level Crossings........................................................................................32 17.0 Filter Design.......................................................................................................33 18.0 Bark Scale Filter Banks .....................................................................................36 19.0 Re-Sampling Speech Signal...............................................................................38

19.1 Down Sampling .............................................................................................38 19.2 Up Sampling ..................................................................................................39

20.0 Autocorrelation ..................................................................................................41 21.0 Determining Voiced and Unvoiced Speech.......................................................42 22.0 Mean ..................................................................................................................44 23.0 Histogram...........................................................................................................45 24.0 Vertical and Horizontal Flipping of a Speech Signal ........................................46

24.1 Vertical Flip ...................................................................................................46 24.2 Horizontal Flip ...............................................................................................46

25.0 Future Expectation .............................................................................................47 26.0 Conclusions........................................................................................................48 27.0 Bibliography ......................................................................................................49 28.0 Reference ...........................................................................................................50 29.0 Appendix............................................................................................................50

Department of Electrical and Electronic Engineering Ho Kwong Lai Final Report Group 60 2508625

1

1.0 Introduction Speech signal is a highly redundant and non-stationary signal. This attribute causes the speech signal to be very challenging to characterize. The aim of the project is to develop a speech processing workstation to assists students and researchers who are engaged in speech processing areas. The workstation can provide educational aids for students to be practically familiar with speech processing techniques. This is because students studying signal processing will have difficulties implementing tasks by relying only on theories and algorithms on papers without visual understanding of the concepts. On the other hand, researchers need to write pages of software program to implement a speech processing algorithms. Therefore, it is necessary to supply a software tool for applying speech processing algorithms efficiently. The software embedded into the workstation includes signal processing and statistical modeling techniques. The main notion behind these techniques is to extract relevant features and characteristics of a speech signal. Matlab was chosen as our programming environment because it offers several advantages. Firstly, Matlab contains a variety of signal processing and statistical tools which are most suitable for developing the workstation. Users are able to generate a variety of signal and parameter plots as well as experiments with the effect of manipulating algorithm parameters. Secondly, Matlab codes are compact and thereby simplifying algorithm understanding. Thirdly, Matlab are used widely in academic institution arenas to support linear systems and DSP courses. Lastly, Matlab are compatible with a variety of operating systems such as DOS, Mac, UNIX and Windows. There were many other toolboxes investigated during the development of this workstation. These included: ‘Cool Edit 2000’, ‘Wavelets’, ‘Audio Spectrum Version 7’, ‘MAD’, ‘Colea’ and ‘J-DSP’. However, these toolboxes do not contain all of the speech processing techniques. Most of them only target specific signal processing techniques. Therefore, our workstation will be developed to include all the techniques used in these workstations. Furthermore, the software structure is to be flexible for easy additional of processing tools into the workstation. A simulation of using this speech processing workstation is included in a CD-ROM in the appendix. The simulation demonstrates how the workstation interface is to be used. It also shows where each processing tools are located and how they are accessed.


2

2.0 Speech Production Model The underline fundamental for speech production and parameters are needed to be addressed before any of the speech processing analysis used in the workstations can be described. The speech production mechanism can be modeled by a linear separable equivalent circuit shown in Fig. 2.1. This model is equivalent to a sound source G(ω) inputting into the articulation filter (vocal tract) to produce the speech. (Ifeachor E.C & Jervis B.W, 1993)

The sound source G(ω) can be categorized as a train of impulses (voiced) and random noises (unvoiced). Voiced sounds include /a/, /e/, /i/, /o/, /u/. On the other hand, unvoiced sounds are noise generated sounds such as /t/ /s/. The articulation H(ω) is a transfer function which models the vocal tract of the human vocal organ. The output speech wave S(ω) is the combination of the sound source multiplied with the articulation given by the Eq. (2.1) S(ω) = G(ω)H(ω) (2.1) This equation is given in frequency domain form. The corresponding time domain of the output is the convolution of the input and the filter.

Fig. 2.1 The equivalent circuit model of the speech production mechanism.


3

3.0 Workstation Overview and Basic Tools The aim of the workstation is to provide educational indication and research assistance on speech processing. The workstation is designed to be user friendly and great flexibility. Each analysis is separated into different modules so that future additions or improvements on any module can be performed without modifying the whole workstation. The main interface of the workstation is shown in Fig. 3.1.

Fig. 3.1 Main interface of the workstation toolbox 3.1 Basic Tools The workstation is divided into three sections: inputs, processing and outputs. Firstly, it contains two input methods which are located at the top of the workstation (as shown on Fig. 3.2). These methods include reading from a wav file and recording through a microphone.

Fig. 3.2 Workstation input area The user can specify the filename by typing into the text area or visually selecting the wave file using the browse button. User can also record a wave speech through a


4

microphone in the recording frame area. To do this, the user must first specify the length of the speech (in milliseconds) and the sampling frequency (in Hz). The recording will start when the user presses the ‘record’ button. When the wave file is loaded into the toolbox, it’s sampling rate and speech length will be displayed under the input area. The processing area is located on the right hand side of the workstation toolbox. Firstly, the top and bottom frames are tools for selecting and adjusting the speech signal. The user can highlight a specific area of the speech by clicking the mouse over the start and end points of the area that are of interest for analysis. Thus, many of the analytical tools will be triggered to work on the highlighted speech only. For the purpose of making the highlighting more accurate, the user can use the speech length text box located under the input frame. The user can adjust the length of an already highlighted segment to the length inputted on the text box. The top frame of the processing area has 8 buttons. The first four buttons include ‘zoom in’, ‘select all’, ‘save segment’ and ‘undo’. The ‘zoom in’ function allows the selected part of the speech waveform to be zoomed in closely. The button ‘select all’ will highlight the whole speech waveform and triggers the analytical tool to work on the whole speech. The button ‘save segment’ allows the user to save the highlighted part of a speech wave. This is a very useful tool because the user can now collect a set of speech database where each database can store specific words from different people and different sentences. The button ‘undo’ allows any tools that modify the speech signal to be redrawn back to the previous step. These tools include: clippings, silences, remove, horizontal and vertical flipping which will be discussed later. The top frame also includes four other buttons: ‘silent’, ‘remove’, ‘horizontal flip’ and ‘vertical flip’. Each of these functions will be discussed further in this report on their usefulness. The lower frame of the analytical toolbox contains the speech clippings. This will also be mentioned further in the report. The center frame is the analytic dropdown menu. The user can choose the analytical tool within the dropdown menu and press the plot button. The results will either be displayed on the output graphs located on the left or there will be another toolbox for further analysis. In many cases, the user must specify some parameters for the calculations. These are prepared by the text areas around the analytic frame. The user is able to listen to the waveform anytime by clicking on the play button located at the top right of the workstation. The user can also choose to play the whole speech signal or only the highlighted part of the speech signal.


5

4.0 Energy and Zero Crossings The first tool in the workstation is the energy and zero-crossings plots of a speech waveform. The energy and zero-crossings are essential to many areas of signal processing. In a speech waveform, a relatively low energy level would correspond to a high number of zero crossings and vise versa. This specifies that an unvoiced speech (weaker fricatives) such as /f/, /s/ sounds etc, would have higher number of zero crossings. This technique is very useful in distinguishing strong fricatives from weak fricatives, and hence finding voiced and unvoiced speech. The formula which was adopted in the workstation for energy calculation is shown on Eq. (4.1). (Rabiner L.R & Schafer R.W, 1987)

x[n] is the sequence of data in the speech signal and N is the number of samples in the speech. The result from this equation has no importance in our analysis since it gives very little information about the signal itself. As a result, short-time energy is used. This is done by dividing the whole speech into multiple segments called frames. Each frame is typically 20 to 30 milliseconds. For each frame, this formula will be applied to calculate the energy, and hence short-time energy. In the workstation, the user is able to input the frame size (in time or in samples) using the frame length textbox. The calculation for zero crossings is based on a threshold voltage which is pre-defined to zero. The number of zero crossings is found by counting the number of times the wave signal passes through the threshold (positive to negative or vise versa) within a frame. This analytical tool is performed by choosing ‘energy and zero crossings’ from the analytical dropdown menu. Users must also enter necessary parameters such as frame length, overlapping percentages and type of windowing. Fig. 4.1 shows the energy and zero crossing plots for the speech waveform ‘transform’.

(4.1)


6

Fig. 4.1 Energy and zero crossings for the waveform ‘transform’. Frame size=20ms Overlapping percentage =70% Window type = Hamming


7

5.0 Speech Framing, Windowing and Overlapping 5.1 Framing In most processing tools, it is not appropriate to consider a speech signal as a whole for conducting calculations. A speech signal is often separated into a number of segments called frames. This process of separation is known as framing. Fig. 5.1 illustrates how a signal might be divided into frames. Each frame will have the same number of samples although the last frame might have a small variance. In many occasions, the length of each frame N satisfies the equation:

2n = N where n is an integer (5.1)

Fig. 5.1 A signal is divided into 3 frames 5.2 Windowing There are many types of windowing that can be used in digital signal processing. These include Hamming, Hanning, Kaiser, Chebyshev etc. The purpose of windowing is to make the frame intense in some area while irrelevant in other areas. A typical window might be the Hamming window. The intensity of the window is much greater around the center than at the edges. When this window is multiplied point to point with a frame, the edges of the frame will become insignificant. Therefore, calculations on the frame will not be affected by the end data. Hamming windows have the property of low amplitudes at the edges in time domain. Correspondingly, the side lobes become lower at higher frequencies as shown on Fig. 5.2.

Frame1

Frame2

Frame3


8

The purpose of these properties is to compensate the effect of spectral leakage when a signal is divided into frames. The side lobes of these windows will break down the unwanted noise contributed in the signal. In the frequency representation, it is desirable to design the properties of the window to have a low noise bandwidth. This can be achieved by reducing the side lobe amplitudes. In our workstation, multiple types of well known windowing were included. These include Rectangular, Hamming, Barlett, Blackman, Kaiser, Bohman, Chebyshev, Hanning and Gausswin windows. Most of the windows have defined properties. However, the properties for Chebyshev and Kaiser windows were manually defaulted to 60dB and 42dB of second side lobe attenuation. The frequency representation of these windows can be seen by choosing the analytical toolbox ‘window freq response’ from the dropdown menu. The user also needs to specify the type of window which is be analysed from the window drop box.

Fig. 5.2 Left: Hamming window in time domain Bottom: Frequency response of Hamming window


9

However, windowing has negative effects. As the frames are multiplied with the windows, most of the data at the edge of the frame will became insignificant causing a lost of information. Therefore, we must include a method to compensate this loss. 5.3 Overlapping When a data is framed and windowed, the data at the ends of the frame is much likely to be reduced to zero. This will represent a loss of information. An approach to tackle this problem is to allow overlapping in the sections between frames. Overlapping will allow adjacent frames to include portions of data in the current frame. This will mean the edges of the current frame will be included as the center data of adjacent frames. Typically, around 60% of overlapping is sufficient to embrace the lost information. Fig. 5.3 shows how a signal is framed, windowed and overlapped.

Fig. 5.3 Signal is framed. Each frame overlaps the previous frame. And windows are multiplied into each frame.


10

6.0 Spectrogram Signals such as speech are composed by many different ranges of frequencies. Thus frequency representation is necessary in the interpretation of a speech signal. Spectrogram is one of a well-known frequency representation of the original speech signal. Fig. 6.1 shows the spectrogram of a speech signal ‘transform’. The vertical axis corresponds to the frequency level and the horizontal axis corresponds to time. The intensity of the pattern at any instant of time corresponds to the energy level.

Spectrogram allows users to know the amount of energy a speech might have in terms of frequency scale. This is a useful tool to detect voiced and unvoiced areas, and identifying the relevant frequency that is composed in this speech. There are other varieties of researches that can be performed using spectrogram. The amount of information a spectrogram can give is enormous and many speech researcher can identify plain English from spectrograms. Spectrogram can be accessed in the workstation by choosing ‘spectrogram’ under the analytical drop down menu. The resultant spectrogram plot will be displayed on the bottom left on the analytical result graph. Since spectrogram is also performed using frames and windowing, these parameters must also be inputted to implement the spectrogram

Fig. 6.1 Spectrogram of speech ‘transform’. Frame size is 20ms. Type of window is Hamming. Overlapping percentage is 60%.


11

7.0 Fast Fourier Transform (FFT) As described in the previous section on spectrogram. The analysis of a speech signal represented in frequency domain is of a great use. The process of Fourier transform converts a discrete signal x[n] from time domain representation into a frequency domain representation X[ejw] by the equation:

However, in short domain Fourier transform, we can assume the sequence of signal is periodic with period N, and thus the equation of the Fourier transform can be represented as:

The computation of X[k] is very inefficient and time consuming. The computation time is found to be proportional to (N2). A set of computational algorithms known as the fast Fourier transform (FFT) is introduced. The computation time has decreased dramatically to a proportional of (Nlog2N). The algorithm of FFT is based on the concept that the processing time of multiplication is longer than addition. Therefore, FFT is introduced to modify the calculation from a series of multiplications to a series of additions for faster computation. Fig. 7.1 is a three-dimensional FFT of the word ‘transform’. The x and y axis correspond to the time and frequency of the segment of speech respectively. The z axis is the magnitude of the respective frequency in decibel (dB).

… (7.1)

… (7.2)


12

Fig. 7.1 FFT of the speech waveform ‘transform’. Frame size = 20ms. Overlapping = 70%. Window type = Hamming


13

8.0 Linear Predictive Coding (LPC) The speech production mechanism can be modeled by a linear separable equivalent circuit shown on Fig. 2.1. This means the production of speech is equivalent to a sound source passing through a transfer function. The sound source G(ω) generated can be categorized as a train of impulses (voiced) and random noises (unvoiced). The articulation H(ω) is a transfer function which models the vocal tract resonance and anti-resonance. The output speech wave S(ω) is the combination of the sound source multiplied with the articulation given by the formula: S(ω) = G(ω) H(ω) (8.1) The vocal tract, shown by the red color on Fig 8.1, is the area from the Vocal Cords to the Lips.

Fig 8.1 (Red) Vocal tract of human organ Picture found from http://www.phon.ox.ac.uk/~jcoleman/phonation.htm There are a variety of techniques to model the vocal tract articulation filter H(ω). One of the most common technique is the linear prediction coding (LPC). This technique is a powerful tool which can also be used to conduct pitch and formant detection on speeches. These concepts will be introduced in section 10.0 and 11.0. The term linear predictive is based on the fact that the present sample value S[n] can be linearly predicted using the previous sample values S[n-k] (Rabiner L.R & Schafer R.W, 1987). The equation of linear prediction is shown in Eq. (8.2).


14

This linear prediction will introduce errors into the sequence of speech samples. This error is known as the residual error e[n]. It is represented by the following equation:

This equation is then transformed into z-domain, and is then expressed by

Or E(z) = A(z)S(z) (8.5) Where

The steady-state system function of the articulation transfer filter is found to be (Ifeachor E.C & Jervis B.W, 1993):

Where the gain G of the filter is found to be (Ifeachor E.C & Jervis B.W, 1993):

From the two equations (8.6) and (8.7), it can be seen that the speech signal obeys the LPC model Eq. (8.6) exactly when αk = ak. Thus, A(z), also called the inverse filter, will be identical to U(z) in the transfer filter. Based on these equations of linear predictive coding, the LPC analysis will minimize the error e[n] by adjusting the LP co-efficient ak. There are two ways of estimating the LP coefficients, and they are the covariance method and the autocorrelation method. The autocorrelation method was adopted in the design of the LPC analysis. The reason for using autocorrelation is because of the stability and fast computation over the covariance method. This method of estimating the LP coefficients using

(8.8)

(8.7)

(8.6)

(8.4)

(8.3)

(8.2)


15

autocorrelation assumes that for a sequence of data, the error e[n] can be minimized over a finite length by expressing the sequence of data in the speech into finite length windows. This is to allow for the use of windowing such as Hamming and Kaiser to reduce the end data to zero. Furthermore, by limiting the segment, the computation algorithm of LP-coefficients ak can be simplified to the use of auto-correlation functions (Ifeachor E.C & Jervis B.W, 1993). 8.1 LPC Spectrum LPC spectrum is the magnitude frequency response of the vocal tract model. Fig. 8.2 is the LPC spectrum of a voiced speech waveform extracted from the word ‘transform’. The blue line signifies the FFT of the segmented speech whereas the red line signifies the LPC of the segmented speech. The curve of LPC is much smoother than FFT. This is because the speech is generated by inputting the signal source into the transfer function (LPC curve). Since the input source is either a train of impulse responses (voiced) or random noise (unvoiced), the characteristic of FFT shows the output of the input multiplied into the LPC curve. The smoothness of the LPC can be adjusted by changing the order of the LPC (k) in Eq. (8.7). A higher order of LPC (k) will cause the LPC spectra to be smoother.

8.2 LPC 3-Dimension The LPC spectra of the whole speech can be displayed using a 3-dimensional plot. The 3-D LPC plot will show the magnitude and frequency component of the whole

Fig. 8.2 The LPC (red) and FFT (blue) of a voiced speech signal.


16

signal. Fig. 8.3 is the LPC of the whole speech signal in 3-dimension. The x and y axis are the frequency and time axes respectively. The z axis is the magnitude of the LPC in decibels.

8.2 LPC Z-Plane The LPC z-plane shows a similar characteristic as the LPC spectra. Fig. 8.4 is the LPC z-plane of the voiced speech signal in Fig. 8.2. Each pole (marked by X) represents a ‘k’ term in the filter equation Eq. (8.7). The position of the pole is either on the real axis or has a conjugate pair which is located symmetrically to the real axes. This represents that the LPC is a stable system. The magnitude of each pole represents a local maximum of the LPC spectra curve. These local maximums are also known as the formants. The closer the pole to the unit circle, the larger the formants in magnitude. The angle of the poles to the real axis corresponds to the frequencies component of the local maximum.

Fig. 8.3. The LPC of the speech signal in 3-dimensional.


17

Fig. 8.4 Z-plane of a voiced speech.


18

9.0 Cepstrum Analysis Cepstrum analysis is another method which models the vocal tract system. The vocal tract articulation equivalent filter is shown by the equation given in Eq. (9.1). S(ω) = G(ω) H(ω) (9.1) The equivalent logarithm of S(ω) is Log|S(ω)| = log|G(ω)| + log|H(ω)| (9.2) The cepstrum C(τ), or cepstral coefficients, is the inverse Fourier transform of the log|S(ω)| (Ifeachor E.C & Jervis B.W, 1993).

C(τ) = F-1log|S(ω)| = F-1log|G(ω)| + F-1log|H(ω)| (9.3) Where F is the Fourier transform. The first function (F-1log|G(ω)|) on the right hand side of equation (9.3) indicates the formation of a peak in the high-qrefuency region. Qrefuency is the independent parameter of the cepstrum. Since the cepstrum is the inverse transform of the frequency domain function, the qrefuency becomes the time-domain parameter. The second function (F-1log|H(ω)|) in the right hand side of the equation represents a concentration in the low-qrefuency region. The fundamental period (Pitch) of the source g(t) can then be extracted from the peak at the high-qrefuency region. On the other hand, the Fourier transform of the low-qrefuency elements can produce the spectral envelope h(t) (Vocal Tract). This process of extracting the pitch and the vocal tract using the cepstral method can be shown by the following block diagram Fig. 9.1:


19

9.1 LP Cepstrum Coefficients The cepstral coefficient C(τ) is defined as the inverse Fourier transform of the logarithmic amplitude spectrum |S(ω)|. The method which has been adopted to find the coefficients is by converting the LPC co-efficient ak as indicated by the equation Eq. 9.4. (Ifeachor E.C & Jervis B.W, 1993)

This cepstrum is referred to as the LPC cepstrum since it is derived from the LPC model. Fig. 9.2 shows the LP cepstrum coefficients plot.

Cepstral window (liftering)

Window

| DFT |

Log

DFT Peak extract

IDFT

Spectral Envelope (vocal tract)

Fundamental period (pitch)

Sampled sequence

Low frequency elements

High frequency elements

Fig. 9.1. Block diagram of cepstrum analysis for extracting spectral envelope (vocal tract) and fundamental period (pitch)

(9.4)


20

9.2 LP Cepstrum Spectrum By using the LP cepstrum coefficients, the spectral envelope (vocal tract) can be modeled using the first few coefficients (typically 12 to 14). Fig. 9.3 is an LP cepstrum spectrum of the same segment of speech as in Fig. 8.2. The vocal tract obtained here (also call FFT cepstrum) model is similar to the vocal tract model found by LPC (LPC cepstrum).

9.3 LP Cepstrum Spectra 3D The LP cepstrum spectra can also be displayed in 3-dimension for the whole speech. Fig. 9.4 is the LP cepstrum spectra 3D plot of the speech waveform ‘transform’.

Fig. 9.2 Cepstrum Coefficient of the speech segment

Fig. 9.3 The cepstrum spectrum (green) and FFT (blue) of a voiced speech signal.


21

Fig. 9.4. The LP Cepstrum spectrum of the speech signal in 3-dimensional.


22

10.0 Formants The resonating frequencies of a speech, that is, bumps in the frequency response curve of the vocal tract (Fig 8.2 and Fig 9.3) are called formants. They are usually referred to as the F1, F2, F3… etc. There are a variety of methods to find the formants of a speech signal. The method adopted in this workstation follows (Markel J.D, 1973). The basic idea behind this algorithm is to extract the first three local maximums from the LPC spectrum. For a male speech, the formants will be in the range from 0-3 kHz. However, the range used in the workstation is from 0-3.5 kHz so that the female formants can also be included. On the other hand, the first three formants may not lie between these ranges. The method suggested in (Markel J.D, 1973) is that at any point in time where there is less than two formants available within a frame, this formant will be compared with the previous sets of formants for the minimum difference. Therefore, the new formants will be assigned to the nearest previous trajectory. The remaining unfound formants will be assigned with their previous values. For example, if at any time the formants are found to be Pi (i=1,2,3….), each of these formants will be compared with the first formant from the previous frame. The lowest difference will be assigned as the new first formant (new F1 = Pi). This process continues in order to find the other formants F2 and F3. However, if there are less than 3 formants available, the unassigned formants will be assigned the value of the previous formant (new F1 = old F1). Fig. 10.1 shows a plot of the formants for the speech signal ‘Fourier’. It can be seen that for the voiced speech, the formants are smooth and continuous. On the other hand, the formants in unvoiced speech are randomly distributed.

Fig. 10.1 The first three formants for the speech ‘Fourier’


23

11.0 Pitch The cepstrum analysis is the inverse Fourier transform of the logarithmic amplitude spectrum |S(ω)| as shown on Eq. (9.3). The lower quefrency can be used to generate the vocal tract whereas the higher quefrency can be used to generate the fundamental period (pitch). The pitch can be found by locating the largest cepstrum coefficient in the high quefrency. For male and female voices, the pitch is tested to be in the range from 70Hz to 250Hz. To find the pitch, the maximum cepstrum coefficients that are within the ranges are to be found. If this co-efficient has the magnitude over 0.07, the correspondent frequency is matched. This frequency value is known as the pitch. Fig. 11.1 shows the pitches of the speech signal ‘Fourier’. It can be seen that pitch does not occur at unvoiced and silent areas.

Fig. 11.1 The Pitch for the speech signal ‘Fourier’


24

12.0 Vocal Tract Modeling The vocal tract (as seen in Fig 8.1) can be approximated as being a series of acoustic tubes with different cross section areas (Fig 12.1). This model is widely used as to demonstrate the movement of vocal tract during the production of a speech.

The radio between two cross-section areas is called the reflection coefficient, they are represented by the following formula (Ifeachor E.C & Jervis B.W, 1993): Ak-1/Ak = (1 + rk) / (1 – rk) (12.1) If the first cross-section area is assumed to be 1, then the first reflection co-efficient is multiplied together to form the second cross-section area. This implies that the number of coefficients represents the number of cross section areas. Matlab VoiceBox has provided two functions to convert the LPC coefficients to reflection coefficients. Therefore, the order of LPC will represent the number of tubes. When each cross-section area is found, the shape of the vocal tract model is transformed into a 3-dimensional cylindrical shape. This model is a very useful tool for displaying the shape of the vocal tract in various speech waveforms. Fig. 12.2 is the toolbox for displaying the vocal tract movement of a speech waveform ‘AH’ continuously.

Glottis/Vocal Cord

Lips

A1 A2 A3 A4 A5 A6

Fig. 12.1. The vocal tract model


25

Fig. 12.2 The vocal tract model for the continuous speech waveform ‘AH’.


26

13.0 Speech Clippings Center clipping and hard clipping are two functions commonly used in speech processing. The two attributes, frequency and amplitude are often discussed for their importance in many speech signals. Therefore, the two functions center and hard clipping are introduced to identify the significance of these attributes. The two functions can be accessed via the buttons on the right bottom frame. Firstly, the user must specify the threshold voltages (boundaries) for clipping. This can be done by toggling the Enable Button ‘on’ and then to use the slider bar or the text area to specify the boundaries voltages. Once the boundaries are satisfied, the user may then choose the clipping type and press clip to perform the clipping. 13.1 Center Clipping Center clipping is a method which identifies the importance of frequencies in the speech signals. It determines the relationship between zero crossings and the speech characteristics. In our workstation tool box, we have included two kinds of center clipping, shifted and un-shifted. The method to find the center clipped speech is to use two threshold voltages. The voltages will be specified by the user which will determine the boundaries for the clipping. Samples that are between the boundaries will be set to zero, whereas samples that are outside will be either be shifted by the threshold value (shifted) or left untouched (un-shifted). The results of center clipping will generate some unwanted noise. However, the noises are insignificant compared with recognizing the words and the voice of the speaker. The envelope of the speech is also retained. Fig. 13.1 is the result of the speech signal ‘transform’ center clipped at 0.2V shifted and un-shifted.


27

13.2 Hard Clipping Hard clipping is used to analyze the importance of amplitude in a speech signal. There are also two kinds of hard clipping included. These are the stretched and un-stretched clippings. The method to find the hard clip is to also use two threshold voltages. The speech data between the levels will be unmodified whereas the signal data beyond the voltage levels can either be unchanged at the threshold (un-stretched) or set to +1/-1(stretched). Fig. 13.2 is the result of the speech signal ‘transform’ hard clipped at 0.2V stretched and un-stretched.

Fig. 13.1 Top - un-shifted center clipping. Bottom - shifted center clipping.


28

The results of hard clipping also introduce some noise. Stretched hard clipping would cause the waveform to be louder while un-stretched hard clipping causes the wave to be quieter. Un-stretched hard clipping is also a good method to analyze the importance of zero crossings. If the threshold voltage parameters are set to a low value, ie + 0.1V, the resultant clipped result will only contain the zero crossings information. When this waveform is played, the words of the speech can be recognized.

Fig 13.2. Top - Un-stretched Hard Clipping. Bottom - Stretched Hard Clipping.


29

14.0 Envelope Envelope of a signal is also called the modulating signal. Thus, it is the outer shape of the signal. Fig. 14.1 shows the envelope of the speech ‘transform’.

The speech signal is first divided into frames of 256 samples without windowing or overlapping. In each frame, a simple loop is performed to find the maximum and the minimum data within the frame. The maximum values found in each frame will be linked together to form the top envelope curve. Relatively, the minimum values in each frame will also be linked to produce the lower envelope curve. This analysis is performed by choosing ‘envelope’ under analytic drop box.

Fig. 14.1 Envelope of the speech signal ‘transform’


30

15.0 Speech Reconstruction using Envelope and Hard Clipping In many communications development, sizes of transmission of signals are critical. It is essential to reduce data as much as possible over the transmission link but retaining necessary information. Therefore, in our workstation, we have implemented a toolbox for speech reconstruction using envelope and hard clipped signal. Fig. 15.1 is the toolbox of the speech reconstruction. The user first load the speech into the toolbox and its time-domain waveform will be displayed. The correspondent envelope will also be displayed. The user can now adjust the clipping threshold level using the slider bar for hard clip. The reconstruction will occur when the user press ‘reconstruction’ button. The original waveform and the reconstructed waveform can be played and listened.

The processes of envelope and hard clipping during the reconstruction are the same as the processes used in envelope detection and stretched hard clipping. The result of each will be multiplied to form the reconstruction. However, the problem will occur

Fig. 15.1. Toolbox for speech reconstruction using envelop and hardclip


31

at low amplitudes. When the multiplication is applied to the low amplitudes of the speech, the envelope and the hard clipped results will produce even lower amplitude. This problem is critical at unvoiced speech sound where their voltage amplitudes are relatively low. Therefore, to maintain these areas during the reconstruction, the multiplication is modified. The multiplication will only occur if the correspondent data for hard clipped is at +1 in amplitude. That is, the data which was not clipped remains while data which was clipped was multiplied.


32

16.0 Multi Level Crossings Multi level crossings are similar with zero crossings. Except that the threshold can be adjusted into 5 different levels. In the workstation toolbox, a sub toolbox was constructed for finding the multi level crossings Fig. 16.1. The user can load the signal which will be displayed at the top of the toolbox. The user can then adjust the 5 lines of threshold levels using the 5 slider bars. When the 5 threshold voltages are chosen, the calculation will start when the user presses the ‘start’ button.

From the hard clipping tool, we have identified that the words of the speech can be extracted by using the zero crossings. And from the reconstruction using envelop and zero crossings, we can identify also the voice of the speaker. As a tool for research purpose, if multi levels of crossings are to be collected, the original signal could possibly be reconstructed from these crossings. Therefore, in the toolbox, we have included a button where the user can save the crossings data into a text file.

(Fig. 16.1. Toolbox for finding multi-level crossings)


33

17.0 Filter Design A filter is essentially a system or network that selectively changes the wave shape, in amplitude-frequency and/or phase-frequency characteristic of a signal. Most commonly, the objective of filtering is to improve the quality of a signal by reducing or removing noise. It also allows relevant information to be extracted or allowing a signal to be separated into two or more signals. These are common objectives in an efficient use of communication transmission. In the workstation, a digital filter toolbox was implemented to perform the 4 known kind of filtering. These include low pass filter, high pass filter, band pass filter and band stop filter. The characteristics of the digital filters are often discussed in frequency domain. The specifications for each filter are often in the form of tolerance schemes. Fig. 17.1 is an illustration of the low pass filter scheme. Low pass filter tolerance scheme

(Fig. 17.1, Tolerance scheme for a low pass filter) In the low pass tolerance scheme, the magnitude response in the pass band has a peak deviation of Apass/2 and, in the stop band, it has a maximum deviation of σs. The width of the transition band determines how sharp the filter is. The magnitude response decreases monotonically from the end of pass band (Fpass) to the start of stop band (Fstop). The schemes for high pass, band pass and band stop filter are shown on Fig. 17.2. Their characteristics are similar to the characteristic for the low pass scheme.

Pass band Transition Band Stop band

σs


34

Fig. 17.2, Tolerance scheme for top: high pass filter. Middle: band pass filter. Bottom: Band stop filter) In the workstation toolbox, user can access the filter design by choosing the ‘filter design’ option under the analytical dropdown menu. A sub toolbox will then be opened where the filter designing take place as shown on Fig. 17.3. The user begins by first loading the speech waveform into the toolbox by pressing ‘load’. This original waveform can be heard via the button ‘play’.

High pass filter tolerance scheme

Band pass filter tolerance scheme

Band stop filter tolerance scheme


35

In the filter design, the user may choose any of the 4 types of filter from the dropdown menu. These include low pass, high pass, band pass and band stop filters. The user must also specify the parameters such as cutoff frequency, which is Fpass, the transition band width, pass band ripple and stop band ripple. When the parameters are set, the user can perform the filtering by pressing ‘apply’. The resultant waveform is displayed and can be heard by pressing the button ‘play’.

Fig. 17.3. Filter design workstation. The speech ‘transform’ is band pass filtered by the cutoff frequency 1000Hz and 2500Hz with a transition width of 100Hz


36

18.0 Bark Scale Filter Banks The Bark scale is the first 24 critical bands of human hearing. This Bark scale ranges from 1 to 24 Barks. The Bark band edges are published and given in Hertz as [0, 100, 200, 300, 400, 510, 630, 770, 920, 1080, 1270, 1480, 1720, 2000, 2320, 2700, 3150, 3700, 4400, 5300, 6400, 7700, 9500, 12000, 15500]. The published band centers in Hertz are [50, 150, 250, 350, 450, 570, 700, 840, 1000, 1170, 1370, 1600, 1850, 2150, 2500, 2900, 3400, 4000, 4800, 5800, 7000, 8500, 10500, 13500] (Abel J.S & Smith J.O, 1999). The criteria for the Bark scale are that the highest sampling rate is 31 kHz. The purpose of Filter Banks is to signify the importance of the bark scale frequency range and the information that is obtained by human ear in each range. The filter banks will be a simple all-pass filter. Thus, the magnitude gain can be adjusted in each Bark. The toolbox implemented will be a great research and educational tool toward understanding human ears on the critical Bark bands. Fig. 18.1 is the toolbox for designing the Filter Banks. The user first input the speech signal into the toolbox. The user can then use the dropdown menu to select the Barks that wish to be changed.

Fig. 18.1, Bark scale filter bank toolbox.


37

The center two graphs are the frequency response of a band pass filter and the output of the speech wave if pass through this filter. The bottom two graphs are the filter banks. The composite frequency response shows the amount of gain is in each band of the filter banks. The user can adjust the gain by using the slider bar or entering by the text box the amount of gain in dB. The amount of gain adjusted will be accumulated onto the dialog on the top right. To apply this filter to the input speech, the user can press the ‘apply filter bank’ button. The original signal will then be multiplied into the all pass filter with the characteristics of the composite frequency response of the filter bank. The output result is displayed on the bottom right of the toolbox. This resultant waveform can also be played and listened.


38

19.0 Re-Sampling Speech Signal In communications and digital processing, errors would be inherent during digital-analogue-digital conversion process. These include quantization and aliasing errors which would degrade the signal. Therefore, it is best to do many processing while in digital form. Up and down sampling is an efficient technique for changing the sampling frequency of a signal digitally. Its main attraction is that it allows the strength of conventional DSP to be exploited (Ifeachor E.C & Jervis B.W, 1993). The process of decimation and interpolation are the fundamental operations in multi-rate signal processing, and they allow the sampling frequency to be decreased or increased without significant or undesirable effects of errors such as quantization and aliasing. 19.1 Down Sampling Reducing the sampling rate is also called down sampling. Down sampling is achieved by discarding M-1 samples for every M samples of a filtered signal. M is the factor which we want to decimate. Before the process of discarding, the original signal must first be passed through an anti-aliasing filter. The idea is to prevent aliasing by band limiting the original signal to less than Fs/2M as before (where Fs is the original sampling frequency). The process of down sampling can be achieved using ‘decimate’ function in Matlab. This function includes anti-alias filtering and down sampling. The input and output relationship for the decimation process can be shown in the block diagram Fig. 19.1.

Fig. 19.1. Block diagram of decimation by a factor of M Fig. 19.2 shows a down sampled speech signal by a factor of 4. The resultant signal has lost most of the unvoiced speech. The number of samples in this speech signal has decreased by a factor of 4. The sampling rate has become 1/4 of the original sampling rate.


39

19.2 Up Sampling Increasing the sampling rate is also called the up sampling. This process is performed by the technique interpolation. The signal is first expanded so that for each data there will be L-1 sample inserted (where L is the up sampling factor). The signal is then low pass filtered to remove any image frequencies created. Matlab has an internal function ‘interp’ which allows the signal to increase the sample and apply filtering. The process can be represented by the following block diagram Fig. 19.3.

Fig. 19.3. Block diagram of interpolation by a factor of L

Fig. 19.2. The speech signal ‘transform’ was down sampled by the factor 4


40

Fig. 19.4 shows an up sampled speech signal by a factor of 4. The number of samples in this speech signal has increased by a factor of 4. The resultant signal has no lost of information and the signal is the same. The sampling rate has become 4 times more than the original sampling rate.

Fig. 19.4. The speech signal ‘transform’ was up sampled by the factor 4


41

20.0 Autocorrelation Autocorrelation is a measure of similarity. Often a speech signal is compromise of voiced and unvoiced areas. Unvoiced sounds include /t/, /f/, /s/ etc. If an unvoiced speech is closely zoomed in, we find that they are periodic. In this case, we can use autocorrelation to analyze them. When the speech is periodic, the autocorrelation function will produce a graph that is smooth and widely spread. However, the autocorrelation function for the non periodic is narrower and rougher Fig. 20.1.

Autocorrelation is a power technique and a useful tool for other signal processing. These include calculations such as the pitch and finding voiced and unvoiced area. Thus, autocorrelation is identical to crossing correlation with itself. Therefore, in our workstation, autocorrelation was performed using the Matlab function ‘xcorr’ (cross correlation). The resultant autocorrelation function is normalized to one. This is to allow for comparison with other different autocorrelation functions.

Fig. 20.1: Left – Autocorrelation function for an unvoiced speech. Right – Autocorrelation function for a voiced speech.


42

21.0 Determining Voiced and Unvoiced Speech Speech sounds can be classified into 2 distinct classes according to their mode of excitation. Voiced sounds are produced by forcing air through the glottis when the tension of the vocal cords adjusted so that they vibrate in a relaxation oscillation, thereby producing quasi-periodic pulses of air which excite the vocal tract (Rabiner L.R & Schafer R.W, 1987). Unvoiced sounds, also called Fricatives are generated by forming a constriction some point in the vocal tract (usually toward the mouth end), and forcing air through the constriction at high enough velocity to produce turbulence. This creates a broad-spectrum noise source to excite the vocal tract (Rabiner L.R & Schafer R.W, 1987). The algorithm for finding voiced and unvoiced speech is from (Rabiner L.R & Schafer R.W, 1987). First the speech signal is segmented to a length of 30msec each. Thus, allowing 15msec overlapping between adjacent segments. To classify the silence noise, the peak signal level in each segment is compared to a threshold voltage which was determined by measuring the peak signal level of a 50msec background noise. This value has been determined to be 0.003 V by a few tests. If the peak signal level in the segment is below the threshold, the segment is classified as silence and no further action will be taken. If the peak signal level is above the threshold, it indicates that the segment is speech. Next, to classify the speech into voiced and unvoiced, a clipping level is needed. The level is a fixed percentage (75%) of the minimum of the maximum absolute values in the first and last 100 samples in a segment. Using this clipping level, the speech signal is processed by a center clipper and the correlation function is computed over a range spanning the expected range of pitch periods. The largest peak of the autocorrelation function is located and the peak value is compared to a fixed threshold (0.35 V). If the peak falls below threshold, the segment is classed as unvoiced and if it is above, the pitch period is defined as the location of the largest peak. The resultant classification is greatly dependent on the frame size and the thresholds. Thus, the outcome can be unsmooth. Therefore, we have also implemented an algorithm on the classification results. Any classification that result in a Voiced->Unvoiced->Voice where the unvoiced segment are less then two frame length, the algorithm will change the unvoiced segment to voiced segment. This also applies to other categories such as Unvoiced->Voiced->Unvoiced and Silence->Unvoiced->Silence etc. In the case where the result of the very beginning of the speech is categories as Silence->Silence->Voiced->Unvoiced, the voiced segment is changed to Unvoiced. This also applies to S->S->U->V, and the end speech signal U->V->S->S, V->U->S->S. Fig. 21.1 is a speech signal ‘transform’ that has be classified in silence, voiced and unvoiced.


43

Fig. 21.1 The speech signal ‘transform’ is classified into silence, voiced and unvoiced.


44

22.0 Mean The mean is a useful tool to filter out many of the high frequencies. In the workstation, the mean value of a time series data x(n) is given by Eq. (22.1).

The result of the equation has no meaning if x(n) are the full data in a speech signal. Therefore, we must use short time calculation. The full speech signal is divided into multiples of frames where each length of the frame could be defined by the user. The user can also select the overlapping percentage and the windowing type. Fig. 22.1 is a mean result of the speech ‘Transform’.

Most of the unvoiced speech has disappeared. And only the speech with lower frequencies has remained. From this result, we can state that silence and unvoiced speech are compromised by many randomly data which has a mean value closely to zero while voiced speech will produce a non-zero mean value.

(22.1)

Fig. 22.1. The speech signal ‘Transform’ and the mean of the speech.


45

23.0 Histogram Histogram is a statistical analysis which is a useful tool for determining the distribution of samples in a speech signal. The users can enter the number of bins that the histogram can be divided in. The number of bins will determine the number of intervals between -1 to +1. Fig. 23.1 is the histogram of the speech waveform ‘transform’ distributed into 50 bins.

Fig. 23.1. Histogram of the speech signal ‘transform’


46

24.0 Vertical and Horizontal Flipping of a Speech Signal 24.1 Vertical Flip Flipping a signal vertically is identical to a 180 degrees phase shifted signal. This tool is a demonstration to test the phase sensitivity of human ear when the resultant signal is played. To activate this tool, the user can select specific area of the speech that is intended to be flipped. The flipping take place when the user press the ‘vertical flip’ button located on the top right of the workstation. The sound of the resultant speech has no difference in respect to the sound of the un-flipped speech. This indicates that human ears are insensitive to phase variation. 24.2 Horizontal Flip This button will allow the selected part of the speech signal to be flipped horizontally. That is, the speech data will be rearranged so that the first data will be relocated at the end of the data. This demonstration can test what type of speech information is retained after the flipping. To activate this tool, the user can select specific area of the speech that is intended to be flipped. The flipping take place when the user press the ‘horizontal flip’ button. When the resultant sound was played, words from the speech cannot be recognized. However the voice of the speaker is still retained.


47

25.0 Future Expectation Although the speech processing workstation was fully developed, it was found that the workstation can be greatly improved if it were to contain more of the modeling toolboxes. The workstation has implemented only the vocal tract simulation. However, speech production not only consists of a vocal tract by itself. Vocal cord is also one of an important speech production organ. The vibrations of vocal cord can also be a great feature for investigation.


48

26.0 Conclusions The speech processing workstation can be a very useful tool for many researchers who are working on speech processing areas. This is because many researchers will need to carry out speech processing algorithms to find some characteristics of a speech signal. Thus, they will need to write pages of software codes to perform these algorithms. Moreover, the workstation is also found to be a great educational tool for students studying speech signals. It can be a great benefit when students can learn speech processing techniques in a visual and acoustic form. The speech workstation was developed successfully. It has implemented 28 well known speech processing techniques into one software program. Each of the speech processing tools allows the user to modify parameters in algorithm and to provide visual and acoustic outputs. The 28 speech processing techniques were discussed in the report on their relevance to characterize a speech. In additional, their fundamental concepts and basic methods of computation were also discussed. However, it was found that the speech processing workstation can greatly be improved by implementing more modeling tools on speech production organs. Finally, a simulation which shows how to use this speech processing workstation is included in a CD-ROM in the appendix. The simulation demonstrates how the workstation interface is to be used. It also shows where each processing tools are located and how they are accessed.


49

27.0 Bibliography Abel J.S & Smith J.O. (1999). The Bark and ERB Bilinear Transforms. IEEE transactions on Speech and Audio Processing. December. Ifeachor E.C & Jervis B.W. (1993). Digital Signal Processing, A practical approach. Addison-wesley publisher Ltd. USA Markel J.D. (1973). Application of a digital inverse filter for automatic formant and Fo Analysis. IEEE transactions on audio and electronacoustic. Volume 1. au21. no3. Rabiner L.R & Schafer R.W. (1987). Digital Processing of speech signals. Bell Laboratores. USA


50

28.0 Reference Amaratunga K & Strang G. (2003). Wavelets, Filter Banks and Applications. [Online]. Available: http://web.mit.edu/18.327/ [2003, March 25] Brigham E.O. (1988). The Fast Fourier Transform and its Applications. New Jersey: Prentice-Hall, Inc. Brown G.J & Cooke M & Wrigley S.N. (2000). Interactive learning in speech and hearing. UK Childers D.G. (2000). Speech processing and Synthesis Toolboxes. John Wiley & Sons Ice.USA. Horne R. (no date). Audio Spectrum Analysis. [Online]. Available: http://www.visualizationsoftware.com/gram.html [2003, May 6] Mitra S.K. (1999). Digital signal processing laboratory using MATLAB. Boston: McGraw-Hill. Painter E & Spanias A. (2000). A Matlab software tool for the introduction of speech coding fundamentals in a dsp course. Arizona.


51

29.0 Appendix This CD-ROM contains a simulation of using the speech processing workstation.

speech processing workstation -...

Documents