speech and audio processing and coding

1

Speech and Audio Processing and Coding

Dr Wenwu Wang

Centre for Vision Speech and Signal Processing

Department of Electronic Engineering

[email protected]

http://personal.ee.surrey.ac.uk/Personal/W.Wang/teaching.html

2

Course components� Components

� Speech Processing

� Wk 1-3, by Wenwu Wang

� Speech Coding

� Wk 4-6, by Ahmet Kondoz

� Audio

� Wk 8-9, by Ahmet Kondoz

� Wk 10-11, by Wenwu Wang

� Assessment

� 15% coursework + 85% exam

� 6 exam questions with 2 for each of the above component,

you only need to do 3 with one from each of the three

components

� 1 coursework in speech processing

3

Outline of speech analysis

� Introduction & applications

� Speech production and modelling

� Speech perception

� Signal processing techniques

� Autocorrelation, Fourier transform of speech, spectral

properties, convolution, periodicity estimation

� Linear prediction and inverse filtering of speech

� Transfer function, linear prediction, estimation of linear

prediction coefficients, LPC order, inverse filtering,

prediction gain

� Cepstral deconvolution

� Real cepstrum, complex cepstrum, quefrency, pitch

estimation via cepstrum, comparison of spectrum envelope

obtained by cesptrum with that obtained by LPC

4

Alternative textbooks to speech analysis and audio perception

� Digital Processing of Speech Signals, by Lawrence R. Rabiner& Ronald W. Shafer

� Signal and Systems, Alan V. Oppenheim, Alan S. Willsky

� An Introduction to the Psychology of Hearing, Brian C. J. Moore

� Acoustics and Psychoacoustics, David M. Howard, & James Angus

5

Introduction & applications

� What is speech & language?

� The act of speaking; the natural exercise of the vocal organs; the

utterance of words and sentences; oral expression of thought or feeling

(Oxford English Dictionary)

� The whole body of words and of methods of combination of words used

by a nation, people, or race; Words and the methods of combining them

for the expression of thought (Oxford English Dictionary)

� Didfference: speech is communicated orally (i.e. by mouth); while

language consists of rules for combining words together to convey

meaning which can be communicated through non-oral mechanisms,

such as, written, hand-signals, pictures, Morse code, etc.

� Connections: speech is the oral communication of meaningful

information through the rules of a specific language.

6

Introduction & applications (cont.)

� From the viewpoint of the speech processing/communication engineers

� Speech is regarded as a sequence of small speeh-like units, e.g. words or

phonemes.

� Often the main objective is to enhance, encode, communicate, or

recognise these units from real speech or to synthesise them from text.

� Our focus here will be on the physical attributes of speech rather than the

semantic level, as retrieval of semantic content from speech is a much

more difficult and largely unsolved problem.

• A speech communication system should generally allow the effective

transmission of speech in any language. Whilst rules of language and

vocabularies differ greatly, the physical attributes of speech have much more

in common between different language.

• Transmission of semantic information between speakers assumes huge

amounts of prior information. For example, “it’s raining cats and dogs” – talking

about weather rather than animals, but not obvious for computers.

� We will consider speech as the oral production of sound by natural

exercising of the vocal organs, resulting in a sequence of different

phonemes or words.

7


� Applications of speech processing

� Speech telecommunications and encoding

• Preservation of the message content and perceptual quality of the transmitted

speech

• Minimising the bandwidth required for speech transmission

� Speech enhancement

• Deals with the restoration of degraded speech caused by, e.g., additive noise,

reverberation, echoes, interfering speech or background sounds (cocktail party

effect).

• Adaptive filtering, spectral subtraction, Wiener filtering, harmonic selection, blind

source separation, and computational auditory scene analysis

A Real World Application

Scenario:-

Cocktail Party Problem

)(1 ts

)(2 ts

)(2

tx

)(1

txMicrophone1

Microphone2

Speaker1

Speaker2

Underdetermined Blind Source Separation

Sources:

Mixtures:

Estimated sources:

Speakers

Mixtures

RT60=30ms

Mixtures

RT60=150ms

Mixtures

RT60=400ms

ConvICA

ConvICA

ConvICA

Estimated IBM

Estimated IBM

Estimated IBM

Smoothed IBM

Smoothed IBM

Smoothed IBM

Blind Separation of Reverberant Speech

Mixtures with Different Reverberation Time

Blind Separation of Real Recordings –

Male Speech with TV on

Sensor signals

Conv. ICA Conv. ICA

+IBM

Conv. ICA

+IBM+Cepstral

Smoothing

Separated source signals

12


Applications in speech processing

� Speech and speaker recognition

� Speech recognition involves the automatic conversion of the message content of

the speech signal into a written format. Its performance depends highly on signal

quality, speaker identity, language, the size of the word dictionary, etc.

� Speaker recognition involves the identification of the speaker based on the

statistical properties of the speech signal, which can be thought of as biometric

data.

� Applications include voice interaction or dictation with computers, speaker

verification for security, automated telephony, spoken language translation, etc.

linguistic studies, etc.

� Speaker diarisation

� It studies the question of “who speaks and when”.

� It is usually done through speaker segmentation (finding the speaker change point

in an audio stream) and speaker clustering (grouping together the segments based

on speaker identity or characteristics).

13


� Applications in speech processing

� Speech synthesis

• It refers to artificial production of speech from text.

• Applications include speech-driven interfaces with computers, automated telephony,

general announcements in train stations, airports, etc.

• Quality measures include naturalness (how much the output sounds like the speech

of a real person) and intelligibility (how easily the output can be understood).

� Speech analysis

� Applications include user-feedback during language learning, speech therapy,

linguistic studies, etc.

14

Linguistics� A note on linguistics

� Linguistics is the scientific study of human language. Linguistic structures are

pairings of meaning and sound. Various sub-parts of linguistic structure exist which

can be arranged hierarchically from sound to meaning (with overlap):

� Phonetics: study the sound of human language.

� Phonology: study the distinctive units within a given language and

commonalities across languages.

� Morphology: study the internal structure of words.

� Syntax : study of how wor.

� Semantics : study meaning of words or sentences.

� Pragmatics: study how utterances are used in communicative acts.

� Discourse analysis: study how sentences organised into texts.

� Phoneme

� The smallest units of speech in a language that distinguish one word from another.

� Continuous speech can be thought of as a concatenation of different phonemes.

Phonemes are split into consonants (either voiced or unvoiced) and vowels (all are

voiced) in English.

15

Speech Production

� Many problems in speech processing can be aided by a basic understanding of

the speech production mechanism and anatomy of the vocal tract.

� Common speech coding schemes such as linear predictive coding (LPC)

provide a fairly crude model of the speech production process.

� More complex physical models of speech production process, e.g. taking into

account the precise shape and absorption characteristics of the vocal tract

have resulted in more natural sounding speech synthesizers.

16

Speech Production (cont.)

� Vocal organs

� Vocal tract: begins at the vocal cords or glottis, ends at the lips.

� Nasal tract: exists between the velum and the nostrils.

� Lungs, diaphragm, and trachea are situated below the vocal cords,

serving as an excitation mechanism by directing the air from the lungs

through the trachea and the glottis.

17


Sagittal section of the

vocal tract

18


A sketch of the vocal tract

19


� Voiced and unvoiced speech

� Voiced speech: the flow of air from the lungs causes a quasi-periodic

vibration of the vocal cords (or vocal folds), and the sound transmitted

along the vocal tract is unimpeded.

� Unvoiced speech: the flow of air through the vocal apparatus is either

cut-off or impeded, e.g. by forming a constriction using the tongue or lips.

20


� Voicing process

� Air forced through the glottis results in a

quasi-periodic oscillation of the vocal

cords, as shown in the left figure.

� In stage 2-4, the increasing pressure in the

trachea forces the opening of the glottis.

� As air travels through the glottis, the air

pressure decreases between the vocal

cords.

� The decreasing pressure forces the vocal

cords to snap together again, at the lower

edge first, as in stages 6-10.

� The resulting perception is called pitch,

whose frequency is around 85-155Hz for

adult male speakers and 165-255Hz for

adult female speakers.

21


Beth's First Laryngoscopy - Vocal Cords in Action

� An interesting video clip from Youtube, available at:

� http://www.youtube.com/watch?v=iYpDwhpILkQ

22


� Unvoiced speech

� Unvoiced sounds are caused by a partial or total constriction at some point

in the vocal tract, such as the lips.

� Air can be forced through this constriction to create turbulence or noise

(e.g. fricative sounds \s\ and \f\), or the constriction can be opened

suddenly to create a burst of turbulence (e.g. the plosives \p\ and \t\)

� There is no vibration of the vocal cords.

� The excitation signal can be regarded as a random noise source as

opposed to a periodic sequence of glottal pulses in the voiced case.

23


Typical voiced and unvoiced speech waveform and their spectrum

[Sources: from Rice/PROJECTS00/vocode/]

24


� Formants

� The vocal tract (and on occasion, coupled with nasal tract) can be

regarded as a resonant system that performs some spectral modification of

the excitation signal (i.e. a quasi-periodic series of glottal pulses for voiced

sounds, or turbulent air flow/noise for unvoiced sounds) before it is

released from the lips.

� Modes or peaks in the frequency response of the resonant system are

known as formants, and these occur at the formant frequencies.

� Anti-resonances (minimum in the frequency response) can also exist, e.g.

for nasal consonants.

� In speech, the most perceptually important formants are the lowest 3

formants. However, trained singers are sometimes able to place more

energy in higher formants (e.g. around 3000Hz for male operatic singers).

25


The first three formants of "ah" are shown in the above spectrum. The vertical

lines denote harmonics due to the vibration of the vocal cords (i.e. multiples of the

fundamental frequencies). The vocal tract acts as a resonance system through

which harmonics pass to generate the vowel's characteristic spectral shape.

26

Speech Production (cont.)� Vowels

� Vowels are all voiced. Vowel phonemes in BBC English is shown below:

27


� Vowels can be characterised by the articulatory parameters as shown in the vowel

quadrilateral in the following figure: height (close/mid/open: the vertical position of

the tongue relative to the roof of the mouth), backness (front/central/back: horizontal

tongue position relative to the back of the mouth), and roundedness (whether the

lips are rounded or not).

28


� Vowels can also be characterised in terms of their average formant frequencies, which

are related to the articulatory parameters but differ more from speaker to speaker.

� The first three formant frequencies are given in the following table for various vowels

(averaged over male speakers):

29

Speech Production (cont.)� Consonants

� The point or place of constriction in the vocal tract, typically between the tongue and

a stationary articulator (e.g. the teeth or the roof of the mouth), gives the consonant

its characteristic sound.

� Consonants can be categorised into:

� Fricatives: produced by forcing air through a narrow channel at some point in

the oral pathway. They can be either voiced, e.g. \v\, \z\, or unvoiced, e.g. \f\,

\s\. Sibilants is a particular subset of fricatives, where the air is directed over

the edge of the teeth, e.g. \s\ and \z\.

� Stops: produced by building up pressure behind complete constriction in the

vocal tract, and suddenly releasing the pressure, e.g. voiced \b\, \d\ and \g\, and

unvoiced \p\, \t\ and \k\. Plosives are reserved for oral (non-nasal) stops, such

as \p\ in pit, and \d\ in dog.

� Nasals: the mouth is completely constricted at some point, and the velum is

lowered so that nasal tract is opened, and sound is radiated from the nostrils,

e.g. \m\, \n\.

� Affricates can be modelled as a concatenation of a stop and a fricative, e.g.

\dzh\.

� Approximants: vocal tract is narrowed, but leave enough space for air to flow

without much audible turbulence, e.g. \w\ \l\.

30

Speech Production (cont.)� Consonants phonemes in BBC English:

31


� Vowel, consonant,

and formant frequency can be depicted by the right figure:

� Waveform and spectrogram of the

word “zoo”

32

Modelling of speech production

� Accurate model

� The production of speech is rather complex, and to model it accurately with

a physical model would have to involve the following:

• The nature of the glottal excitation, e.g. periodic/aperiodic

• Time variation of the vocal tract shape

• Losses due to heat conduction, viscous friction and absorption

characteristics of the vocal tract walls

• Nasal coupling

• Radiation of sound from the lips

� Reference:

� L. R. Rabiner and R.W. Schafer, “Digital Processing of Speech Signal,”

Printice-Hall, 1978.

33


� Lossless tube model

� One of the simplest models of the vocal tract is a tube of non-uniform, time-

varying cross-section with plane wave propagation along the axis of the

tube, assuming no losses due to viscosity or thermal conduction:

34

Modelling of speech production� Source-filter model

� The source or excitation is the signal arriving from the glottis, either a quasi-periodic

sequence of glottal pulses or a broad-band noise signal (typically treated as

Gaussian noise).

� The combined response of the vocal tract, nasal tract, and lips is modelled as a time-varying linear filter.

� The output of the model is a convolution of the excitation with the impulse response of

the linear filter. The filter response typically has a number of poles and zeros, or may

be an all-pole filter as in LPC.

� It is an approximation and is widely used in speech coding for its simplicity. Filter

parameters can be estimated easily from real speech, which can be subsequently used

for speech synthesis using the same source-filter model in the receiver end of the

transmission channel.

35


Model fit to the real data (where PDS denotes the power spectral density function) [Sources: from Rice/PROJECTS00/vocode/]

36

Speech perception

� The auditory system and subsequent higher-level cognitive processing, is responsible for speech perception. It has an entirely different structure to the organs of speech production,and is not at all like an inverse model of speech production.

� The study of auditory system is split into physiological aspects (i.e. relating to physical/mechanical processing of sound) and psychological aspects (i.e. related to processing in the brain).

� Psychoacoustics is a general term for the study of how humans perceive sound, covering both physiological and psychological aspects (to be covered in wk 10-11).

37

Speech perception (cont.)� Outer ear

� It contains pinna and auditory canal. The pinna helps to direct sound into the

auditory canal, and is used to localise the direction of a sound source. The sound

travelling through the canal causes the tympanic membrane (eardrum) to vibrate.

� Middle ear

� It contains three bones (or ossicles): the malleus (or hammer), incus (or anvil) and

stapes (or stirrup). The arrangement of these bones amplifies the sound being

transmitted to the fluid-filled cochlea in the inner ear.

� Since the surface area of the eardrum is many times of that of the stapes footplate,

sound energy striking the eardrum is concentrated on the smaller footplate. The

angles between the ossicles are such that a greater force is applied to the cochlea

than that transmitted to the hammer. The middle ear can also be considered as an

impedance matching device.

� Inner ear

� It contains two sensory systems: the vestibular apparatus, and the cochlea. The

former is responsible for balance and contains the vestibule and semi-circular canals.

Sound transmitted to the inner ear causes movement of fluid within the cochlea. The

hair cells within the cochlea are stimulated by this movement and convert the

vibration into electrical potentials, which are then transmitted as neural impulsesalong the auditory nerve towards the brain.

38

Speech perception (cont.)

Anatomy of the human ear

39


� Hair cells along the cochlear are frequency-selective, with hair cells at the end near the elliptical window being receptive to high frequencies, and those near the apex being receptive to low frequencies.

� It performs a kind of spectral analysis, whose resolution is non-linear. In other words, a difference of ∆f = 10Hz between two sinusoidal components around 100Hz is easily noticeable, whereas at 5kHz isimperceptible.

� The frequency sensitivity of the cochlea is roughly logarithmic above around 500Hz, i.e. the relative frequency resolution ∆f/f of the cochlear is relatively constant.

� As a consequence of the non-linearity of the frequency selectivity of the ear, the mel-frequency scale was designed as a perceptual scale of pitches judged by listeners to be equal in distance from one another. In other words, any three frequencies equi-distant apart in mels, will appear to be roughly equi-distant in perceived pitch.

40


� Cognitive processing of sound is still an active area of research with many questions to be answered.

� The study of how humans process and interpret their sound environment is termed as auditory scene analysis.

� Computer simulation of the auditory scene analysis process is known as computational auditory scene analysis.

41


� To convert frequency in Hz into mel, we use:

)700/1ln(01048.1127 Hzmel ff +⋅=

)1(700 01048.1127−⋅=

melf

efHz

� And vice versa:

42


Mel-frequency scale

43


� Sensation of loudness is also frequency-dependent.

� When listening to two equal amplitude sinusoids/pure tones at 50Hz and 1kHz, the 1kHz sinusoid will be heard as louder.

� The range of hearing is roughly between 20Hz and 20kHz (althoughthese limits tend to reduce with age, especially at the high-frequency end). Outside these limits, nothing is heard at all.

� The unit of measurement of loudness level is phon; by definition, two sine waves that have equal phons are equally loud.

44


45

Digital encoding of speech� Processing of speech has moved almost entirely into the digital

domain.

� Speech is initially a variation in air pressure which is converted into a continuous voltage by a microphone.

� Digital encoding of speech has several advantages, such as:

� Digital signals can be stored for periods of time and transmitted over noisy channels relatively uncorrupted.

� Digital signals can be encrypted by scrambling the bits, which are then unscrambled at the receiver.

� Digital speech can be encoded and compressed for efficient transmission and storage.

46

Analog-to-digital (A/D) conversion

� A/D conversion consists of two stages:

� Sampling

� A continuous signal x(t) can be sampled into a discrete signal x[n]

=x(nT), every T seconds, using a sample and hold circuit. The sampling

rate is defined as the number of samples obtained in one second, and

is measured in Hertz (Hz), i.e.

� Quantisation

� The value of each sample is represented using a finite number of bits.

Each possible combination of n bits denotes a quantisation level. The

difference between the sampled value and the quantised value is the

quantisation error.

Tf s

1=

47

A/D conversion (cont.)

� In practice, different bit-depth usually used for different audio signals

� Digital speech

� The dynamic range of clean speech is around 40dB (between the

threshold of hearing and the loudest normal speech sound).

Background noise becomes obtrusive when SNR is worse than

about 30dB. Therefore, a 70dB dynamic range provides

reasonable quality, which is equivalent to 12-bit resolution (roughly

6dB/bit).

� Commercial CD quality music

� 16 bits are usually used, i.e. 65536 levels, which correspond to

96dB dynamic range.

� Digital mixing consoles, music effects units, and audio processing

software

� It is common to use 24-bits or higher.

48

A/D conversion (cont.)

� Reconstruction conditions

� Aliasing effect and Nyquist criterion

� To allow the perfect reconstruction of the original signal, the sampling rate

should be at least twice the highest frequency in the signal.

� A lower sampling rate can cause aliasing effect.

49

Aliasing effect in spectral domain

50

Anti-aliasing� To avoid aliasing effect, the A/D converter usually incorporates an anti-

aliasing filter (a low-pass filter) before sampling, with a cut-off frequency near

the Nyquist frequency (half of the sampling rate).

� In practice, it is difficult to design a steep cut-off low-pass filter. A non-ideal

filter is used instead, and the sampling rate is usually chosen to be more than

twice the highest frequency in the signal. For some typical applications, the

sampling rates are usually chosen as follows:� In telecommunication networks, 8kHz (the signal is band-limited to [300 3400]Hz)

� Wideband speech coding, 16kHz (natural quality speech is band-limited to [50

7000]Hz)

� Commercial CD music, 44.1kHz (audible frequency range reaches up to 20kHz)

� Oversampling can be advantageous in some applications to relax the sharp

cut-off frequency requirements for anti-aliasing filters.

51

D/A converter� D/A conversion consists of two stages:

� Deglitching

� A process to convert the digital speech represented by bits into a continuous

voltage signal, similar to the sample and hold operation in A/D conversion.

� Interpolating filter

� A low-pass filter is then used to remove the sharp edges (causing high-frequency

noise) in the output voltage.

� According to sampling theorem, the ideal low-pass filter with which the analog

signal can be perfectly recovered uses a sinc impulse function:

� In practice, the sinc function is truncated to a limited interval, instead of infinite sum.

Tt

Tttg

/

)/sin()(

π

π= ∑

∞

−∞=

−=

n sf

ntgnxtx )(][)(ˆ

speech and audio processing and coding

Documents