speech production utterance: should we chase acoustic waveform

Speech Production

Utterance: "Should we chase"

Acoustic waveform

Production of speech:

Glottal sourceWednesday, July 27, 2011 6:18 AM

Class-SP-1.4-print1

Respiration <= Lungs•

Phonation <= Vocal cords•

Articulation <= Vocal tract•

Simple but important part of speech production. Respiration provides the air-flow and pressure source required for speech production. The lungs primarily serve breathing: inspiration, expiration.

•

Most languages sounds are formed during expiration (“egressive” sounds).•

Total lung capacity is 4-5 litre. The volume velocity of air leaving the lungs is about 0.2 lt/sec during sustained sounds.

•

Increased air-flow rate => increase in sound amplitude •

Respiration

Respiration: the air flow for speech production (lungs).•

Phonation: generation of basic sound by vibration of vocal cords (glottis). The otherwise smooth airflow is disturbed, causing sound.

•

Articulation: changing the spectrum of sound (vocal tract). It gives rise to different types of sound. The variation is generated by adjusting nature & shape of mouth cavity.

•

Class-SP-1.4-print1

Vocal folds: anatomy and physiology

Pair of elastic structures of tendon, muscles and mucous membrane situated in the larynx. The variable opening between the folds is the “glottis”.In normal breathing, cords are parted to allow free passage of air.

Observing vocal fold motion:

electro-glottography○

video photography (see track9)•

The vocal cords functions chiefly in two modes:

With phonation: opening-closing periodic motion => periodic waveform1.

Without phonation: vocal folds are kept slightly parted => aperiodic (noisy) waveform2.

Phonation (vocal cords vibration) is an involuntary muscle action. It occurs when

(a) the vocal cords are elastic and close together, and(b) there is sufficient difference between sub-glottal and supra-glottal pressure

Anatomical views of Larynx and vocal folds <www.mayoclinic.com>

Phonation

Glottis

Class-SP-1.4-print1

http://www.mayoclinic.com

(b) there is sufficient difference between sub-glottal and supra-glottal pressure

The aerodynamics…..

Electro-glottograph (EGG)Impedance is monitored via high-frequency current between electrodes across throat.

EGG is based on the principle that tissue is a moderate conductor whereas air is poor. A high frequency current is passed between electrodes positioned on either side of thyroid cartilage and electrical impedance is monitored => area of opening vs time.

Show EGG waveform (correlate of glottal opening).

But more typically, we show glottal vol. Velocity (cc/sec vs time). Not directly obtained from the glottal opening due to source-tract interaction (loading) effects. Rothenberg flow mask is used to measure flow at mouth opening and then formants are removed by inverse filtering.

Class-SP-1.4-print1

Glottal pulses are not truly periodic but exhibit jitter and shimmer due to neurologic, biomechanical and aerodynamic disturbances.

Jitter: period to period variations in duration; normally < 1%Shimmer: period to period variations in amplitude; normally < 6%

Not normally directly perceptible but add to naturalness of the voice.

High jitter-shimmer => roughness

"Glottal flow signal can be approximated by 2-poles near dc. K. N. Stevens, ‘‘On the quantal nature of speech,’’ J. Phonet., 17, 3–46 (1989).

Voice quality is altered by modifying glottal vibration pattern.Voice quality changes can be non-phonemic or phonemic.

Rate of Vibration of the vocal cords

The average rate is inversely proportional to the length of the vocal folds.This length is correlated with neck circumference

Voluntary control: By means of muscle contractions, the vocal folds can be varied in length (tension), thickness and position configuration.

Folds are relaxed (short) and thick -> low pitchFolds are tense (long) and thin -> high pitch

Male: 80 - 160 HzFemale: 160 - 320 Hz

Class-SP-1.4-print1

Types of Phonation : non-phonemic; speaker-dependent or controlled

Normal : or modal quality; can change with changing speed of glottal closure•

Breathy / Whisper :incomplete closure with posterior portion of the glottis always open; the airflow has periodic + noisy component; extent of breathiness depends on proportion of time vocal folds are open.

•

Creaky/Hoarse: folds are closed with a small part vibrating with irregular period.•

Falsetto: folds are thin and don't close completely; only central part vibrates with high rate.•

Pathological voices are rough, hoarse and quantified by measures of aperiodicity including breath noise

Class-SP-1.4-print1

"Phonemic" voice quality

We can divide all speech sounds based on whether produced with vocal folds vibration or without(held open with narrow constriction) into the categories

Voiced sounds-

Unvoiced sounds-

Vowels Fricatives Plosives

Voiced normal z, j, v b, d, g

Unvoiced whispered s, sh, f p, t, k

Other source of sound in glottis: Aspiration noise

Electronic Larynx

Class-SP-1.4-print1

Class-SP-1.4-print1

Articulation

The sound produced at the larynx passes through the vocal tract which alters the sound quality based on the selected positions of the articulators (tongue, jaw, lips, velum) changing the shape of the vocal tract "resonator".

From unsw acoustics site.

We can use the known expressions for resonances of a tube of given length and end (open/closed) conditions.

(These known expressions come from solving the Newton's 2nd law for sound propagation in the body to arrive at the constant o f proportionality in the Simple Harmonic Motion differential eqn).

From: Ladefoged, Acoustic Phonetics

Tube model for vocal tract:

Good approximation for the sound /uh/ as in "burn"

Vocal tract acoustics

To appreciate the role of the vocal tract, change your mouth shape while phonating at constant pitch and amplitude.

We can now see how we can independently control the larynx (source) and vocal tract articulators (filter) for different sounds.

Vocal tractMonday, August 20, 2012 1:25 PM

Class-SP-1.4-print1

For L=17.5 cm, C= 340 m/s => f = 500, 1500, 2500….. Hz

Tube approximation for /a/ as in "cart"

For L1 = L2 = 8.75 cm => f = 1000, 3000, 5000… Hz

Other vowels; Role of tongue, lips.Tongue position and height creates the vocal tract cavities. Rounding of lips changes length.

Nasal sounds: Branched resonator

In reality, there are perturbations in above values due to the coupling between the tubes. E.g. /a/ tubes' resonances at 1000 are really at 900, 1100 Hz.

Class-SP-1.4-print1

Damped resonator: spectrum, waveform

Nasal cavity Closure of oral cavity + radiation of sound through nasal cavity.

Oral cavity acts as a side-branch resonator, introducing zeros (anti-resonances) based on its length.

Nasalised vowels:Both oral and nasal cavities are open and coupled but oral is more open. Thus nasal cavity acts like a anti-resonator.

Laterals, fricatives

Screen clipping taken: 7/28/2013, 8:38 PM

Laterals (l,r) have a side-cavity that introduces anti-resonances.

<- pocket of air above tongue

<- main cavity curves around tongue

Unvoiced consonants: There is a turbulent flow of air through a constriction within the vocal tract. This constriction creates a frication noise source that excites primarily the portion of the vocal tract in front of it. Depending on the place of the constriction we have different sounds: sh, s, f.

Effect of losses in the vocal tract:

Resonances and anti-resonances have zero bandwidth. But in practice, there are losses in the speech production system such as:

yielding (not rigid) walls that vibrate at low frequencies,

viscous friction between the air and walls and heat conduction through walls,

large yielding surface area of nasal cavity,

sound radiation at the lips.

Nasal consonants:

Class-SP-1.4-print1

Also applies to musical instruments...

Lip radiation:The lips form a small opening so that diffraction (bending) of large wavelengths (low frequencies) takes place while high frequencies are directed in front => lip radiation is modeled by high-pass filter.

Screen clipping taken: 7/28/2013, 8:58 PM

B = -σ/ᴨω = 2ᴨF = 2ᴨ(1/T)

Source-filter model of speech production

For given formant frequency Fi Hz and bandwidth Bi Hz , we have for sampling period T:

θi = 2π.Fi.T

ri = e-πBiT

Digital resonator

For consonant phones:

Class-SP-1.4-print1

Acoustic phonetics: the differentiation of sounds on an acoustic basis. The acoustics are more evident spectrally rather than in the time domain.

<---- Voicing and manner

Class-SP-1.4-print1

articulatory, acoustic, phonetic and perceptual. Speech sounds can be analysed from different points of view:

Articulatory phonetics relates linguistic features of sounds to positions and movements of the speech organs.

This knowledge is limited by the lack of data on the motion of the vocal tract. Visual and x-ray means do not provide 3d images. MRI is good but limited to the study of sustained sounds.

A phoneme is the smallest linguistic unit of speech. A phone is the corresponding acoustic unit (the realisation of the phoneme).

Phone: It is the smallest meaningful, contrastive unit of speech. Duration of phone may vary from 30 ms to 100 ms.

Every language has a defined phone inventory. We come across these phones when we look up a dictionary for the pronunciation of a word (e.g. look up an English dictionary).

Vowels of Amer. English

Examples of pronunciation…

Articulatory PhoneticsTuesday, July 26, 2016 6:45 PM

Class-SP-1.5 Page 1

Phones are the basic speech sounds and are completely described based on a small set of attributes or features. Thus phones can be classified in multiple ways.

Vowels•Consonants•

A major classification of phones is based on their role in a syllable as:

Classification based on Articulation:

Phones in each class have common articulatory configuration. The articulation of a phone has a source component and a tract component.

The Source component comprises the Voicing and Manner of Articulation (MoA).

Voiced sound(i)Unvoiced sound(ii)

Whether glottis vibrates:(a)

Manner of articulation(b)(i) whether there is a constriction in the vocal tract, and type(ii) whether velum is open or closed

Tract component: described by the shape of the vocal tract in terms of where and what type of narrowing occurs. This depends on the positions of the articulators.

Vowels: always voiced and relatively steady.

Vowel quality is determined by the shape of the oral cavity, controlled by tongue and lip positions.

Distinguished by tongue height and backness.

Vowel quadrilateral: Articulatory and Acoustic interpretations

A syllable is a complex unit of phones made up of nuclear and marginal elements.

Class-SP-1.5 Page 2

Tongue position

F1 (200 - 800 Hz)

F2 (2500 - 1000 Hz)

Consonants: are classified based on

(i) voicing and MoA(ii) place of articulation (PoA)

Voicing and MoA:

Voicing -> Vocal fold vibration•Aspiration source •Frication source•

MOA: Vowels, Fricatives, Stops and Affricates, Nasal consonants, Glides and Semi-vowels

Active, passive articulatorsPoA: classification of consonants by PoA

Bb: p30

Bb: p32

Bb: p.28

Class-SP-1.5 Page 3

Class-SP-1.5 Page 4

A history of Phonetics …* *From: SPAU 3343Phonetics and PhonologyWilliam Katz, Ph.D.University of Texas at Dallas

History of phonetics07 August 2019 09:34

Class-SP-1.5 Page 5

Class-SP-1.5 Page 6

EE679: Speech Processing

Instructor: Preeti Rao

Articulatory classification of phonemes (Indo-Aryan, English)

Consonant Articulation: Consonants are articulated by introducing a constriction in the vocal tract. Consonants are described via (i) voicing, (ii) manner of articulation, and (iii) place of articulation.

(i) Voicing: Voiced (glottal vibration) or Unvoiced (ii) Manner of Articulation – It represents the manner of restriction of the air-flow from

glottis. a) Stops or Plosives – Airflow is completely stopped for some time and then released abruptly

leading to implosive sound followed by onset of the next phone.

e.g. /b/, /p/, /t/, /d/, /t̪/, /d̪/, /g/, /k/

b) Nasal – Sound is produced by lowering the velum and allowing air to pass into the nasal cavity. They are always voiced and velum is open.

e.g. /m/, /n/, /ng/

b) Fricatives – Continuous airflow through a narrow constriction in the vocal tract leading to a characteristic hissing sound resulting from the turbulent airflow.

e.g. /f/ in fat, /v/ in very, /δ/ in then, /θ/ in thin, /z/ in zoos

श , ष , स are some examples from Hindi.

Stops that are followed immediately by fricatives are called affricates. They can be considered as mixture of stops and fricatives.

e.g. /dʒ/ (ज) as in dodge, /tʃ/ (च) in chirp

c) Approximants – The two articulators are close together but not close enough to cause turbulent airflow.

e.g. /j/ as in you

d) Taps/Flaps – Quick motion of the tongue against the alveolar ridge.

e.g. /ɾ/ as in butter

e) Glides and Semi-vowels – There is movement of tongue from one vowel position to other.

e.g. य in you, र , ल , व in watch

f) ळ (between retroflex & glides/ liquids)

(iii) Place of Articulation – It is the point at which maximum restriction to airflow occurs in the vocal tract.

While uttering a consonant, two articulators are involved. They are called Active articulator and Passive articulator. Active articulator is the dynamic articulator which moves towards steady passive articulator to produce the desired constriction for a given consonant. Based on which two articulators are used, phones are classified as --

a) Labial – Constriction is made by pressing two lips together.

Active articulator: Lower lip Passive articulator: Upper lip

e.g. प , फ , ब , भ , म (unvoiced, unvoiced aspirated, voiced, voiced aspirated, nasal

respectively).

/p/, /ph/, /b/, /bh/, /m/

b) Labiodental – Pressing the lower lip against the upper row of teeth and letting the air flow through the space in the upper teeth.

Active articulator: Lower lip

Passive articulator: Upper front teeth

e.g. / f /, / v /

c) Dental – Placing the tongue against the back of upper teeth. Active articulator: Tongue tip


e.g. त , थ , द , ध , न (unvoiced, unvoiced aspirated, voiced, voiced aspirated, nasal

respectively). /t̪/, /θ/, /δ/, /δh/, /n/

d) Alveolar – Placing the tip of the tongue against the alveolar ridge, the portion of the roof of the mouth just behind the upper teeth.

Active articulator: Tongue tip

Passive articulator: rear teeth ridge

E.g. : / t / - table

e) Post-alveolar: / t / - train f) Retroflex: Active articulator: Lower lips


e.g. ट , ठ , ड , ढ , ण (unvoiced, unvoiced aspirated, voiced, voiced aspirated, nasal

respectively) /ʈ/,/ʈh/,/ɖ/,/ɖh/,/ɳ/

g) Palato-alveolar: These sounds are between retroflex & palatal h) Palatal – Sounds made with the blade of the tongue against the rising back of the alveolar

ridge, palate.

Active articulator: Front of tongue

Passive articulator: Hard palate

e.g. च , छ , ज , झ , ञ (unvoiced, unvoiced aspirated, voiced, voiced aspirated, nasal respectively).

/tʃ/,

i) Velar – Pressing the back of the tongue up against the velum, a movable muscular flap at the very back of the roof of the mouth.

Active articulator: Back of tongue

Passive articulator: Palate

e.g. क , ख , ग , घ , ड. (unvoiced, unvoiced aspirated, voiced, voiced aspirated, nasal

respectively).

/k/, /kh/, /g/, /gh/, /ng/

j) Uvular k) Glottal – Stop made by closing the glottis, space between vocal cords.

e.g. ह (/h/)

Figure 1 Place of Articulation based Classification

Broad Classification based on Manner of Articulation: 1) Sonorants (“can be sung”; no friction noise) Always voiced

Excitation is only at glottis

Can have aspiration as well

Continuous & intense

e.g. vowels, nasals, liquids, glides, semivowels, dipthongs 2) Obstruents

Vocal tract is constricted somewhere

Weak & non-periodic

Primary excitation at vocal tract constriction

Complete closure & subsequent release of constriction e.g. stops & affricates, fricatives

speech production utterance: should we chase acoustic waveform

Documents