Download - Lecture 2: Phonetics - Stanford University speech sounds are made by articulators ... Larynx and Vocal Folds The Larynx (voice box) A structure made of cartilage and muscle Located

CS 224S / LINGUIST 285Spoken Language Processing

AndrewMaasStanfordUniversity

Spring2017Lecture2:Phonetics

OriginalslidesbyDanJurafsky

Homework 1� Outafterlecturetoday.Duein1week� PDFhandoutlinkedonwebsitesyllabus� You’llneedtodownloadPRAAT;detailsareinthehomework.

Phonetics� ARPAbet

� An alphabet for transcribing American English phonetic sounds.

� Articulatory Phonetics� How speech sounds are made by articulators

(moving organs) in mouth.� Acoustic Phonetics

� Acoustic properties of speech sounds

ARPAbet� http://www.stanford.edu/class/cs224s/arpabet.ht

ml

� The CMU Pronouncing Dictionary� http://www.speech.cs.cmu.edu/cgi-bin/cmudict

� What about other languages?� International Phonetic Alphabet:� http://en.wikipedia.org/wiki/International_Phoneti

c_Alphabet

ARPAbet Vowelsb_d ARPA b_d ARPA

1 bead iy 9 bode ow2 bid ih 10 booed uw3 bayed ey 11 bud ah4 bed eh 12 bird er5 bad ae 13 bide ay6 bod(y) aa 14 bowed aw7 bawd ao 15 Boyd oy8 Budd(hist) uh

https://corpus.linguistics.berkeley.edu/acip/

Note: Many speakers pronounce Buddhist with the vowel uw as in booed,So for them [uh] is instead the vowel in “put” or “book”

The Speech Chain (Denes and Pinson)

SPEAKERHEARER

Speech Production Process� Respiration:

�We(normally)speakwhilebreathingout.Respirationprovidesairflow.“Pulmonicegressive airstream”

� Phonation� Airstreamsetsvocalfoldsinmotion.Vibrationofvocalfoldsproducessounds.Soundisthenmodulatedby:

� ArticulationandResonance� Shapeofvocaltract,characterizedby:�Oraltract

� Teeth,softpalate(velum),hardpalate� Tongue,lips,uvula

�Nasaltract Text adopted from Sharon Rose

Nasal Cavity

Pharynx

Vocal Folds (within the Larynx)

Trachea

Lungs

Text copyright J. J. Ohala, Sept 2001, from Sharon Rose slide

Sagittal section of the vocal tract(Techmer 1880)

From Mark Liberman’s website, from Ultimate Visual Dictionary

From Mark Liberman’s Web Site, from Language Files (7th ed)

Figure of Ken Stevens, labels from Peter Ladefoged’s web site

USC’s SAIL LabShri Narayanan

Larynx and Vocal Folds� TheLarynx(voicebox)

� Astructuremadeofcartilageandmuscle� Locatedabovethetrachea(windpipe)andbelowthepharynx(throat)

� Containsthevocalfolds� (adjectiveforlarynx:laryngeal)

� VocalFolds(olderterm:vocalcords)� Twobandsofmuscleandtissueinthelarynx� Canbesetinmotiontoproducesound(voicing)

Text from slides by Sharon Rose UCSD LING 111 handout

The larynx, external structure, from front

Figure thnx to John Coleman!!

Vertical slice through larynx, as seen from back

Figure thnx to John Coleman!!

Voicing:

•Aircomesupfromlungs•Forcesitswaythroughvocalcords,pushingopen(2,3,4)•Thiscausesairpressureinglottistofall,since:

• whengasrunsthroughconstrictedpassage,itsvelocityincreases(Venturitubeeffect)• thisincreaseinvelocityresultsinadropinpressure(Bernoulliprinciple)

•Becauseofdropinpressure,vocalcordssnaptogetheragain(6-10)•Singlecycle:~1/100ofasecond.

Figure & text from John Coleman’s web site

Voicelessness� Whenvocalcordsareopen,airpassesthroughunobstructed

� Voicelesssounds:p/t/k/s/f/sh/th/ch� Iftheairmovesveryquickly,theturbulencecausesadifferentkindofphonation:whisper

Vocal folds open during breathing

From Mark Liberman’s web site, from Ultimate Visual Dictionary

Vocal Fold Vibration

UCLA Phonetics Lab Demo

Consonants and Vowels� Consonants:phonetically,soundswithaudiblenoiseproducedbyaconstriction

� Vowels:phonetically,soundswithnoaudiblenoiseproducedbyaconstriction

� (it’smorecomplicatedthanthis,sincewehavetoconsidersyllabicfunction,butthiswilldofornow)

Text adapted from John Coleman

Place of Articulation� Consonantsareclassifiedaccordingtothelocationwheretheairflowismostconstricted.

� Thisiscalledplaceofarticulation� Threemajorkindsofplacearticulation:

� Labial (withlips)� Coronal (usingtiporbladeoftongue)�Dorsal (usingbackoftongue)

Places of articulation

labial

dentalalveolar post-alveolar/palatal

velaruvular

pharyngeal

laryngeal/glottal

Figure thanks to Jennifer Venditti

Coronal place

dentalalveolar post-alveolar/palatal


Dental:th/dh

Alveolar:t/d/s/z/l

Post:sh/zh/y

Dorsal Place

velaruvular

pharyngeal


Velar:k/g/ng

Manner of Articulation� Stop:completeclosureofarticulators,sonoairescapesthroughmouth

� Oralstop:palateisraised,noairescapesthroughnose.Airpressurebuildsupbehindclosure,explodeswhenreleased� p,t,k,b,d,g

� Nasalstop:oralclosure,butpalateislowered,airescapesthroughnose.�m,n,ng

Oral vs. Nasal Sounds

Thanks to Jong-bok Kim for this figure!

More on Manner of articulation of consonants� Fricatives

� Closeapproximationoftwoarticulators,resultinginturbulentairflowbetweenthem,producingahissingsound.� f,v,s,z,th,dh

� Approximant� Notquite-so-closeapproximationoftwoarticulators,sonoturbulence� y,r

� Lateralapproximant� Obstructionofairstreamalongcenteroforaltract,withopeningaroundsidesoftongue.� l

Text from Ladefoged “A Course in Phonetics”

More on manner of articulation of consonants�Taporflap�Tonguemakesasingletapagainstthealveolarridge�dxin“butter”

�Affricate�Stopimmediatelyfollowedbyafricative�ch,jh

Articulatory parameters for English consonants (in ARPAbet)

PLACE OF ARTICULATIONbilabial labio-

dentalinter-dental

alveolar palatal velar glottal

stop p b t d k g q

fric. f v th dh s z sh zh h

affric. ch jh

nasal m n ng

approx w l/r y

flap dx

MA

NN

ERO

F A

RTIC

ULA

TIO

N

VOICING: voiceless voicedTable from Jennifer Venditti

Tongue position for vowels

Vowels

1/5/07

IY AA UW

Fig. from Eric Keller

American English Vowel Space

FRONT BACK

HIGH

LOW

iy

ih

eh

ae aa

ao

uw

uh

ahax

ix ux

Figure from Jennifer VendittiRed: Vowels, Blue: Dipthongs

[iy] vs. [uw]

Figure from Jennifer Venditti, from a lecture given by Rochelle Newman

[ae] vs. [aa]

Figure from Jennifer Venditti, from a lecture given by Rochelle Newman

Where to go for more info� Ladefoged,Peter.1993.ACourseinPhonetics� MarkLiberman’ssite

� http://www.ling.upenn.edu/courses/Spring_2001/ling001/phonetics.html

� JohnColeman’ssite� http://www.phon.ox.ac.uk/%7Ejcoleman/mst_mphil_phonetics_course_index.html

� JenniferSmith’sresourcepage� http://www.unc.edu/~jlsmith/pht-url.html

Sound waves are longitudinal waves

Dan Rusell Figure

particle dispacment

pressure

Dan Rusell Figure

Remember High School PhysicsSimple Period Waves (sine waves)

Time (s)0 0.02

–0.99

0.99

0• Characterized by:• period: T• amplitude A• phase f

• Fundamental frequencyin cycles per second, or Hz• F0=1/T

1 cycle

To listen to sine waves:http://www.szynalski.com/tone-generator/

Simple periodic waves� Computingthefrequencyofawave:

� 5cyclesin.5seconds=10cycles/second=10Hz� Amplitude:

� 1� Equation:

� Y=Asin(2pft)

Thefrequencyofawave:5cyclesin.5seconds=10cycles/second=10Hz

Amplitude:1

Speech sound waves

� Alittlepiecefromthewaveformofthevowel[iy]� Xaxis:time.� Yaxis:

� Amplitude=airpressureatthattime� +:compression� 0:normalairpressure,� -:rarefaction

Back to waves:Fundamental frequency

� Waveformofthevowel[iy]

� Frequency:10repetitions/.03875seconds=258Hz� Thisisspeedthatvocalfoldsmove,hencevoicing� Eachpeakcorrespondstoanopeningofthevocalfolds� ThelowfrequencyofthecomplexwaveiscalledthefundamentalfrequencyofthewaveorF0

She just had a baby

� Notethatvowelsallhaveregularamplitudepeaks� Stopconsonant

� Closurefollowedbyrelease� Noticethesilencefollowedbyslightburstsofemphasis:veryclearfor[b]of“baby”

� Fricative:noisy.[sh]of“she” atbeginning

Fricative

Back to freshman physics:Waves have different frequencies

Time (s)0 0.02

–0.99

0.99

0

Time (s)0 0.02

–0.99

0.99

0

100 Hz

1000 Hz

Complex waves: Adding a 100 Hz and 1000 Hz wave together

Time (s)0 0.05

–0.9654

0.99

0

Spectrum

100 1000Frequency in Hz

Am

plitu

de

Frequency components (100 and 1000 Hz) on x-axis

Spectra continued� Fourieranalysis:anywavecanberepresentedasthe(infinite)sumofsinewavesofdifferentfrequencies(amplitude,phase)

Spectrum of one instant in an actual soundwave: many components across frequency range

Frequency (Hz)0 5000

0

20

40

Part of [ae] waveform from “had”

� Notecomplexwaverepeatingninetimesinfigure� Plussmallerwaveswhichrepeats4timesforeverylargepattern

� Largewavehasfrequencyof250Hz(9timesin.036seconds)

� Smallwaveroughly4timesthis,orroughly1000Hz� Twolittletinywavesontopofpeakof1000Hzwaves

Back to spectrum� Spectrumrepresentsthesefreq components� ComputedbyFouriertransform

� x-axisshowsfrequency,y-axisshowsmagnitude(indecibels)

� Peaksat930Hz,1860Hz,and3020Hz.

Seeing formants: the spectrogram

1/5/07

Formants� Vowelslargelydistinguishedby2characteristicpitches.

� Oneofthem(thehigherofthetwo)goesdownwardthroughouttheseriesiyihehaeaaaoouu

� Theothergoesupforthefirstfourvowelsandthendownforthenextfour.

� Thesearecalled"formants"ofthevowels,loweris1stformant,higheris2ndformant.

Spectrogram: spectrum + time dimension

How to read spectrograms

� bab:closureoflipslowersallformants:sorapidincreaseinallformantsatbeginningof"bab”

� dad:firstformantincreases,butF2andF3slightfall� gag:F2andF3cometogether:thisisacharacteristicofvelars.Formanttransitionstakelongerinvelarsthaninalveolars orlabials

From Ladefoged “A Course in Phonetics”

She came back and started again

� 1.lotsofhigh-freq energy� 3.closurefork� 4.burstofaspirationfork� 5.ey vowel;faint 1100Hzformantisnasalization� 6.bilabialnasal� 7.shortbclosure,voicingbarelyvisible.� 8.ae;noteupwardtransitionsafterbilabialstopatbeginning� 9.noteF2andF3comingtogetherfor"k”

From Ladefoged “A Course in Phonetics”

Praat example� http://www.fon.hum.uva.nl/praat/

Different vowels have different formants

� Everytimethevocalcordsopenandclose,pulseofairfromthelungsissharptaponairinvocaltract.

� Settingairinvocalcavityvibrating,producingdifferentharmonics

Vocal Fold Cycles

The vocal source at 150 Hz� a

The harmonics� a

Source filter model of vowels

�Anybodyofairwillvibrateinawaythatdependsonitssizeandshape.

�Vocaltractas"amplifier";amplifiescertainharmonics

�Formantsareresultofdifferentshapesofvocaltract.

The oral cavity amplifies some harmonics� a

Source-filter model of speech production

Input Filter Output

Glottal spectrum Vocal tract frequencyresponse function

Figures and text from Ratree Wayland slide from his website

Source and filter are independent, so:Different vowels can have same pitchThe same vowel can have different pitch

FromMarkLiberman’sWeb site

Resonances of the vocal tract� Thehumanvocaltractasanopentube

� Airinatubeofagivenlengthwilltendtovibrateatresonancefrequencyoftube.

Closed end Open end

Length 17.5 cm.

Figure from Ladefoged(1996) p 117

Resonances of the vocal tract

� Thehumanvocaltractasanopentube

� Airinatubeofagivenlengthwilltendtovibrateatresonancefrequencyoftube.

Closed end Open end

Length 17.5 cm.

Figure from W. Barry Speech Science slides

Resonances of the vocal tract� Ifvocaltractiscylindricaltubeopenatoneend� Standingwavesformintubes� Waveswillresonateiftheirwavelengthcorrespondstodimensionsoftube

� Constraint:Pressuredifferentialshouldbemaximalat(closed)glottalendandminimalat(open)lipend.

� Nextslideshowswhatkindoflengthofwavescanfitintoatubewiththiscontraint

1/5/07From Sundberg

Defining Intonation� Ladd(1996)“Intonational phonology”� “Theuseofsuprasegmental phonetic features

Suprasegmental =above&beyondthesegment/phone� F0� Intensity(energy)� Duration

� toconveysentence-level pragmaticmeanings”� I.e.meaningsthatapplytophrasesorutterancesasawhole,notlexicalstress,notlexicaltone.

Pitch track

�

Pitch is not Frequency� PitchisthementalsensationorperceptualcorrelatedofF0

� RelationshipbetweenpitchandF0isnotlinear;� humanpitchperceptionismostaccuratebetween100Hzand1000Hz.� Linearinthisrange� Logarithmicabove1000Hz

� MelscaleisonemodelofthisF0-pitchmapping� Amelisaunitofpitchdefinedsothatpairsofsoundswhichareperceptuallyequidistantinpitchareseparatedbyanequalnumberofmels

� Frequencyinmels=1127ln(1+f/700)

Plot of Intensity

Three aspects of prosody� Prominence: some syllables/words are more

prominent than others� Structure/boundaries: sentences have prosodic

structure� Some words group naturally together� Others have a noticeable break or disjuncture

between them� Tune: the intonational melody of an utterance.

From Ladd (1996)

Prosodic Boundaries

I met Mary and Elena’s mother at the mall yesterday.I met Mary and Elena’s mother at the mall yesterday.

French [bread and cheese][French bread] and [cheese]

Slide from Jennifer Venditti

Intonational tunes

Yes-No question tune

are LEGUMES a good source of vitamins

Rise from the main accent to the end of the sentence.

50100150200250300350400450500550



are legumes a GOOD source of vitamins


50100150200250300350400450500550



are legumes a good source of VITAMINS


50100150200250300350400450500550


WH-questions

50

100

150

200

250

300

350

400

WHAT are a good source of vitamins

WH-questions typically have falling contours, like statements.

[I know that many natural foods are healthy, but ...]


Broad focus

legumes are a good source of vitamins

“Tell me something about the world.”

50

100

150

200

250

300

350

400


In the absence of narrow focus, English tends to mark the firstand last ‘content’ words with perceptually prominent accents.

Rising statements

50100150200250300350400450500550


High-rising statements can signal that the speaker is seeking approval.

“Tell me something I didn’t already know.”

[... does this statement qualify?]


Yes-No question

50100150200250300350400450500550

are legumes a good source of VITAMINS



‘Surprise-redundancy’ tune


Low beginning followed by a gradual rise to a high at the end.

[How many times do I have to tell you ...]

50

100

150

200

250

300

350

400


‘Contradiction’ tune

50

100

150

200

250

300

350

400

linguini isn’t a good source of vitamins

Sharp fall at the beginning, flat and low, then rising at the end.

“I’ve heard that linguini is a good source of vitamins.”

[... how could you think that?]


Thinking about F0

Graphic representation of F0

legumes are a good source of VITAMINS50

100

150

200

250

300

350

400

time

F0 (i

n H

ertz

)


The ‘ripples’

legumes are a good source of VITAMINS[ t ][ s ] [ s ]

50

100

150

200

250

300

350

400

F0 is not defined for consonants without vocalfold vibration.


The ‘ripples’

legumes are a good source of VITAMINS[ v ][ g ] [ g ][ z ]

50

100

150

200

250

300

350

400

... and F0 can be perturbed by consonants withan extreme constriction in the vocal tract.


Abstraction of the F0 contour


100

150

200

250

300

350

400

Our perception of the intonation contour abstracts away from these perturbations.


The ‘waves’ and the ‘swells’


100

150

200

250

300

350

400 ‘wave’ = accent

‘swell’ = phrase


Prominence: Placement of Pitch Accents

Stress vs. accent� Stress is a structural property of a word

� it marks a potential (arbitrary) location for an accent to occur, if there is one.

� Accent is a property of a word in context� it is a way to mark intonational prominence in order to ‘highlight’

important words in the discourse.

(x) (x) (accented syll)x x stressed syllx x x full vowelsx x x x x x x syllablesvi ta mins Ca li for nia


Stress vs. accent (2)� The speaker decides to make the word

vitamin more prominent by accenting it.� Lexical stress tell us that this prominence will

appear on the first syllable, hence VItamin.

� So prosodic prominence is a function of� lexicon� context

� I’m a little surPRISED to hear it CHARacterized as upBEAT

Which word receives an accent?

� It depends on the context. � The ‘new’ information in the answer to a question is

often accented� while the ‘old’ information is usually not.

� Q1: What types of foods are a good source of vitamins?� A1: LEGUMES are a good source of vitamins.

� Q2: Are legumes a source of vitamins?� A2: Legumes are a GOOD source of vitamins.

� Q3: I’ve heard that legumes are healthy, but what are they a good source of ?

� A3: Legumes are a good source of VITAMINS.


Same ‘tune’, different alignment

50

100

150

200

250

300

350

400

LEGUMES are a good source of vitamins

The main rise-fall accent (= “I assert this”) shifts locations.



50

100

150

200

250

300

350

400

Legumes are a GOOD source of vitamins





100

150

200

250

300

350

400



Levels of prominence� Mostphraseshavemorethanoneaccent� Thelastaccentinaphraseisperceivedasmoreprominent

� Calledthe NuclearAccent� Emphaticaccentslikenuclearaccentoftenusedforsemanticpurposes,suchasindicatingthatawordiscontrastive,orthesemanticfocus.� Thekindofthingyouuses***sinIM,orcapitalizedletters� ‘IknowSOMETHING interestingissuretohappen,’ shesaidtoherself.

� Canalsohavewordsthatareless prominentthanusual� Reducedwords,especiallyfunctionwords.

� Oftenuse4classesofprominence:� Emphaticaccent,pitchaccent,unaccented,reduced

Intonational phrasing/boundaries

A single intonation phrase

50

100

150

200

250

300

350

400


Broad focus statement consisting of one intonation phrase(that is, one intonation tune spans the whole unit).


Multiple phrases

50

100

150

200

250

300

350

400


Utterances can be ‘chunked’ up into smaller phrases in order to signal the importance of information in each unit.


Phrasing sometimes helps disambiguate� Global ambiguity:

The old men and women stayed home.

Sally saw the man with the binoculars.

John doesn’t drink because he’s unhappy.


Phrasing can disambiguate� Global ambiguity:

The old men and women stayed home.The old men % and women % stayed home.

Sally saw % the man with the binoculars.Sally saw the man % with the binoculars.

John doesn’t drink because he’s unhappy.John doesn’t drink % because he’s unhappy.


Phrasing sometimes helps disambiguate

� Temporary ambiguity:When Madonna sings the song ...



� Temporary ambiguity:When Madonna sings the song is a hit.



� Temporary ambiguity:When Madonna sings % the song is a hit.

When Madonna sings the song % it’s a hit.

[from Speer & Kjelgaard (1992)]



50

100

150

200

250

300

350

400

I met Mary and Elena’s mother at the mall yesterday

Mary & Elena’s mothermall

One intonation phrase with relatively flat overall pitch range.



50

100

150

200

250

300

350

400

I met Mary and Elena’s mother at the mall yesterday

Marymall

Elena’s mother

Separate phrases, with expanded pitch movements.


Using Intonation in Spoken Language Processing1) Prominence/Accent: Tells us about

focus of utterance2) Tune: whether utterance is

question/statement, important for affect extraction

3) Boundaries: can help parsing

More phonetic structure� Syllables

� Composedofvowelsandconsonants.Notwelldefined.Somethinglikea“vowelnucleuswithsomeofitssurroundingconsonants”.

More phonetic structure� Stress

� Somesyllableshavemoreenergythanothers� Stressedsyllablesversusunstressedsyllables� (an)‘INsult vs.(to)in’SULT� (an)‘OBject vs.(to)ob’JECT

� Simplemodel:everymulti-syllabicwordhasonesyllablewith:� “primarystress”

� Wecanrepresentbyusingthenumber“1” onthevowel(andanimplicitunmarkingontheothervowels)

� “table”:tey1baxl� “machine:maxsh iy1n

� Alsopossible:“secondarystress”,markedwitha“2”� ih-2nfaxr mey-1sh axn

� Thirdcategory:reduced:schwa:� ax

Download - Lecture 2: Phonetics - Stanford University speech sounds are made by articulators ... Larynx and Vocal Folds The Larynx (voice box) A structure made of cartilage and muscle Located

Top Related