CS 224S / LINGUIST 285Spoken Language Processing
AndrewMaasStanfordUniversity
Spring2017Lecture2:Phonetics
OriginalslidesbyDanJurafsky
Homework 1� Outafterlecturetoday.Duein1week� PDFhandoutlinkedonwebsitesyllabus� You’llneedtodownloadPRAAT;detailsareinthehomework.
Phonetics� ARPAbet
� An alphabet for transcribing American English phonetic sounds.
� Articulatory Phonetics� How speech sounds are made by articulators
(moving organs) in mouth.� Acoustic Phonetics
� Acoustic properties of speech sounds
ARPAbet� http://www.stanford.edu/class/cs224s/arpabet.ht
ml
� The CMU Pronouncing Dictionary� http://www.speech.cs.cmu.edu/cgi-bin/cmudict
� What about other languages?� International Phonetic Alphabet:� http://en.wikipedia.org/wiki/International_Phoneti
c_Alphabet
ARPAbet Vowelsb_d ARPA b_d ARPA
1 bead iy 9 bode ow2 bid ih 10 booed uw3 bayed ey 11 bud ah4 bed eh 12 bird er5 bad ae 13 bide ay6 bod(y) aa 14 bowed aw7 bawd ao 15 Boyd oy8 Budd(hist) uh
https://corpus.linguistics.berkeley.edu/acip/
Note: Many speakers pronounce Buddhist with the vowel uw as in booed,So for them [uh] is instead the vowel in “put” or “book”
The Speech Chain (Denes and Pinson)
SPEAKERHEARER
Speech Production Process� Respiration:
�We(normally)speakwhilebreathingout.Respirationprovidesairflow.“Pulmonicegressive airstream”
� Phonation� Airstreamsetsvocalfoldsinmotion.Vibrationofvocalfoldsproducessounds.Soundisthenmodulatedby:
� ArticulationandResonance� Shapeofvocaltract,characterizedby:�Oraltract
� Teeth,softpalate(velum),hardpalate� Tongue,lips,uvula
�Nasaltract Text adopted from Sharon Rose
Nasal Cavity
Pharynx
Vocal Folds (within the Larynx)
Trachea
Lungs
Text copyright J. J. Ohala, Sept 2001, from Sharon Rose slide
Sagittal section of the vocal tract(Techmer 1880)
From Mark Liberman’s website, from Ultimate Visual Dictionary
From Mark Liberman’s Web Site, from Language Files (7th ed)
Figure of Ken Stevens, labels from Peter Ladefoged’s web site
USC’s SAIL LabShri Narayanan
Tamil
Larynx and Vocal Folds� TheLarynx(voicebox)
� Astructuremadeofcartilageandmuscle� Locatedabovethetrachea(windpipe)andbelowthepharynx(throat)
� Containsthevocalfolds� (adjectiveforlarynx:laryngeal)
� VocalFolds(olderterm:vocalcords)� Twobandsofmuscleandtissueinthelarynx� Canbesetinmotiontoproducesound(voicing)
Text from slides by Sharon Rose UCSD LING 111 handout
The larynx, external structure, from front
Figure thnx to John Coleman!!
Vertical slice through larynx, as seen from back
Figure thnx to John Coleman!!
Voicing:
•Aircomesupfromlungs•Forcesitswaythroughvocalcords,pushingopen(2,3,4)•Thiscausesairpressureinglottistofall,since:
• whengasrunsthroughconstrictedpassage,itsvelocityincreases(Venturitubeeffect)• thisincreaseinvelocityresultsinadropinpressure(Bernoulliprinciple)
•Becauseofdropinpressure,vocalcordssnaptogetheragain(6-10)•Singlecycle:~1/100ofasecond.
Figure & text from John Coleman’s web site
Voicelessness� Whenvocalcordsareopen,airpassesthroughunobstructed
� Voicelesssounds:p/t/k/s/f/sh/th/ch� Iftheairmovesveryquickly,theturbulencecausesadifferentkindofphonation:whisper
Vocal folds open during breathing
From Mark Liberman’s web site, from Ultimate Visual Dictionary
Vocal Fold Vibration
UCLA Phonetics Lab Demo
Consonants and Vowels� Consonants:phonetically,soundswithaudiblenoiseproducedbyaconstriction
� Vowels:phonetically,soundswithnoaudiblenoiseproducedbyaconstriction
� (it’smorecomplicatedthanthis,sincewehavetoconsidersyllabicfunction,butthiswilldofornow)
Text adapted from John Coleman
Place of Articulation� Consonantsareclassifiedaccordingtothelocationwheretheairflowismostconstricted.
� Thisiscalledplaceofarticulation� Threemajorkindsofplacearticulation:
� Labial (withlips)� Coronal (usingtiporbladeoftongue)�Dorsal (usingbackoftongue)
Places of articulation
labial
dentalalveolar post-alveolar/palatal
velaruvular
pharyngeal
laryngeal/glottal
Figure thanks to Jennifer Venditti
Coronal place
dentalalveolar post-alveolar/palatal
Figure thanks to Jennifer Venditti
Dental:th/dh
Alveolar:t/d/s/z/l
Post:sh/zh/y
Dorsal Place
velaruvular
pharyngeal
Figure thanks to Jennifer Venditti
Velar:k/g/ng
Manner of Articulation� Stop:completeclosureofarticulators,sonoairescapesthroughmouth
� Oralstop:palateisraised,noairescapesthroughnose.Airpressurebuildsupbehindclosure,explodeswhenreleased� p,t,k,b,d,g
� Nasalstop:oralclosure,butpalateislowered,airescapesthroughnose.�m,n,ng
Oral vs. Nasal Sounds
Thanks to Jong-bok Kim for this figure!
More on Manner of articulation of consonants� Fricatives
� Closeapproximationoftwoarticulators,resultinginturbulentairflowbetweenthem,producingahissingsound.� f,v,s,z,th,dh
� Approximant� Notquite-so-closeapproximationoftwoarticulators,sonoturbulence� y,r
� Lateralapproximant� Obstructionofairstreamalongcenteroforaltract,withopeningaroundsidesoftongue.� l
Text from Ladefoged “A Course in Phonetics”
More on manner of articulation of consonants�Taporflap�Tonguemakesasingletapagainstthealveolarridge�dxin“butter”
�Affricate�Stopimmediatelyfollowedbyafricative�ch,jh
Articulatory parameters for English consonants (in ARPAbet)
PLACE OF ARTICULATIONbilabial labio-
dentalinter-dental
alveolar palatal velar glottal
stop p b t d k g q
fric. f v th dh s z sh zh h
affric. ch jh
nasal m n ng
approx w l/r y
flap dx
MA
NN
ERO
F A
RTIC
ULA
TIO
N
VOICING: voiceless voicedTable from Jennifer Venditti
Tongue position for vowels
Vowels
1/5/07
IY AA UW
Fig. from Eric Keller
American English Vowel Space
FRONT BACK
HIGH
LOW
iy
ih
eh
ae aa
ao
uw
uh
ahax
ix ux
Figure from Jennifer VendittiRed: Vowels, Blue: Dipthongs
[iy] vs. [uw]
Figure from Jennifer Venditti, from a lecture given by Rochelle Newman
[ae] vs. [aa]
Figure from Jennifer Venditti, from a lecture given by Rochelle Newman
Where to go for more info� Ladefoged,Peter.1993.ACourseinPhonetics� MarkLiberman’ssite
� http://www.ling.upenn.edu/courses/Spring_2001/ling001/phonetics.html
� JohnColeman’ssite� http://www.phon.ox.ac.uk/%7Ejcoleman/mst_mphil_phonetics_course_index.html
� JenniferSmith’sresourcepage� http://www.unc.edu/~jlsmith/pht-url.html
Sound waves are longitudinal waves
Dan Rusell Figure
particle dispacment
pressure
Dan Rusell Figure
Remember High School PhysicsSimple Period Waves (sine waves)
Time (s)0 0.02
–0.99
0.99
0• Characterized by:• period: T• amplitude A• phase f
• Fundamental frequencyin cycles per second, or Hz• F0=1/T
1 cycle
To listen to sine waves:http://www.szynalski.com/tone-generator/
Simple periodic waves� Computingthefrequencyofawave:
� 5cyclesin.5seconds=10cycles/second=10Hz� Amplitude:
� 1� Equation:
� Y=Asin(2pft)
Thefrequencyofawave:5cyclesin.5seconds=10cycles/second=10Hz
Amplitude:1
Speech sound waves
� Alittlepiecefromthewaveformofthevowel[iy]� Xaxis:time.� Yaxis:
� Amplitude=airpressureatthattime� +:compression� 0:normalairpressure,� -:rarefaction
Back to waves:Fundamental frequency
� Waveformofthevowel[iy]
� Frequency:10repetitions/.03875seconds=258Hz� Thisisspeedthatvocalfoldsmove,hencevoicing� Eachpeakcorrespondstoanopeningofthevocalfolds� ThelowfrequencyofthecomplexwaveiscalledthefundamentalfrequencyofthewaveorF0
She just had a baby
� Notethatvowelsallhaveregularamplitudepeaks� Stopconsonant
� Closurefollowedbyrelease� Noticethesilencefollowedbyslightburstsofemphasis:veryclearfor[b]of“baby”
� Fricative:noisy.[sh]of“she” atbeginning
Fricative
Back to freshman physics:Waves have different frequencies
Time (s)0 0.02
–0.99
0.99
0
Time (s)0 0.02
–0.99
0.99
0
100 Hz
1000 Hz
Complex waves: Adding a 100 Hz and 1000 Hz wave together
Time (s)0 0.05
–0.9654
0.99
0
Spectrum
100 1000Frequency in Hz
Am
plitu
de
Frequency components (100 and 1000 Hz) on x-axis
Spectra continued� Fourieranalysis:anywavecanberepresentedasthe(infinite)sumofsinewavesofdifferentfrequencies(amplitude,phase)
Spectrum of one instant in an actual soundwave: many components across frequency range
Frequency (Hz)0 5000
0
20
40
Part of [ae] waveform from “had”
� Notecomplexwaverepeatingninetimesinfigure� Plussmallerwaveswhichrepeats4timesforeverylargepattern
� Largewavehasfrequencyof250Hz(9timesin.036seconds)
� Smallwaveroughly4timesthis,orroughly1000Hz� Twolittletinywavesontopofpeakof1000Hzwaves
Back to spectrum� Spectrumrepresentsthesefreq components� ComputedbyFouriertransform
� x-axisshowsfrequency,y-axisshowsmagnitude(indecibels)
� Peaksat930Hz,1860Hz,and3020Hz.
Seeing formants: the spectrogram
1/5/07
Formants� Vowelslargelydistinguishedby2characteristicpitches.
� Oneofthem(thehigherofthetwo)goesdownwardthroughouttheseriesiyihehaeaaaoouu
� Theothergoesupforthefirstfourvowelsandthendownforthenextfour.
� Thesearecalled"formants"ofthevowels,loweris1stformant,higheris2ndformant.
Spectrogram: spectrum + time dimension
How to read spectrograms
� bab:closureoflipslowersallformants:sorapidincreaseinallformantsatbeginningof"bab”
� dad:firstformantincreases,butF2andF3slightfall� gag:F2andF3cometogether:thisisacharacteristicofvelars.Formanttransitionstakelongerinvelarsthaninalveolars orlabials
From Ladefoged “A Course in Phonetics”
She came back and started again
� 1.lotsofhigh-freq energy� 3.closurefork� 4.burstofaspirationfork� 5.ey vowel;faint 1100Hzformantisnasalization� 6.bilabialnasal� 7.shortbclosure,voicingbarelyvisible.� 8.ae;noteupwardtransitionsafterbilabialstopatbeginning� 9.noteF2andF3comingtogetherfor"k”
From Ladefoged “A Course in Phonetics”
Praat example� http://www.fon.hum.uva.nl/praat/
Different vowels have different formants
� Everytimethevocalcordsopenandclose,pulseofairfromthelungsissharptaponairinvocaltract.
� Settingairinvocalcavityvibrating,producingdifferentharmonics
Vocal Fold Cycles
The vocal source at 150 Hz� a
The harmonics� a
Source filter model of vowels
�Anybodyofairwillvibrateinawaythatdependsonitssizeandshape.
�Vocaltractas"amplifier";amplifiescertainharmonics
�Formantsareresultofdifferentshapesofvocaltract.
The oral cavity amplifies some harmonics� a
Source-filter model of speech production
Input Filter Output
Glottal spectrum Vocal tract frequencyresponse function
Figures and text from Ratree Wayland slide from his website
Source and filter are independent, so:Different vowels can have same pitchThe same vowel can have different pitch
FromMarkLiberman’sWeb site
Resonances of the vocal tract� Thehumanvocaltractasanopentube
� Airinatubeofagivenlengthwilltendtovibrateatresonancefrequencyoftube.
Closed end Open end
Length 17.5 cm.
Figure from Ladefoged(1996) p 117
Resonances of the vocal tract
� Thehumanvocaltractasanopentube
� Airinatubeofagivenlengthwilltendtovibrateatresonancefrequencyoftube.
Closed end Open end
Length 17.5 cm.
Figure from W. Barry Speech Science slides
Resonances of the vocal tract� Ifvocaltractiscylindricaltubeopenatoneend� Standingwavesformintubes� Waveswillresonateiftheirwavelengthcorrespondstodimensionsoftube
� Constraint:Pressuredifferentialshouldbemaximalat(closed)glottalendandminimalat(open)lipend.
� Nextslideshowswhatkindoflengthofwavescanfitintoatubewiththiscontraint
1/5/07From Sundberg
Defining Intonation� Ladd(1996)“Intonational phonology”� “Theuseofsuprasegmental phonetic features
Suprasegmental =above&beyondthesegment/phone� F0� Intensity(energy)� Duration
� toconveysentence-level pragmaticmeanings”� I.e.meaningsthatapplytophrasesorutterancesasawhole,notlexicalstress,notlexicaltone.
Pitch track
�
Pitch is not Frequency� PitchisthementalsensationorperceptualcorrelatedofF0
� RelationshipbetweenpitchandF0isnotlinear;� humanpitchperceptionismostaccuratebetween100Hzand1000Hz.� Linearinthisrange� Logarithmicabove1000Hz
� MelscaleisonemodelofthisF0-pitchmapping� Amelisaunitofpitchdefinedsothatpairsofsoundswhichareperceptuallyequidistantinpitchareseparatedbyanequalnumberofmels
� Frequencyinmels=1127ln(1+f/700)
Plot of Intensity
Three aspects of prosody� Prominence: some syllables/words are more
prominent than others� Structure/boundaries: sentences have prosodic
structure� Some words group naturally together� Others have a noticeable break or disjuncture
between them� Tune: the intonational melody of an utterance.
From Ladd (1996)
Prosodic Boundaries
I met Mary and Elena’s mother at the mall yesterday.I met Mary and Elena’s mother at the mall yesterday.
French [bread and cheese][French bread] and [cheese]
Slide from Jennifer Venditti
Intonational tunes
Yes-No question tune
are LEGUMES a good source of vitamins
Rise from the main accent to the end of the sentence.
50100150200250300350400450500550
Slide from Jennifer Venditti
Yes-No question tune
are legumes a GOOD source of vitamins
Rise from the main accent to the end of the sentence.
50100150200250300350400450500550
Slide from Jennifer Venditti
Yes-No question tune
are legumes a good source of VITAMINS
Rise from the main accent to the end of the sentence.
50100150200250300350400450500550
Slide from Jennifer Venditti
WH-questions
50
100
150
200
250
300
350
400
WHAT are a good source of vitamins
WH-questions typically have falling contours, like statements.
[I know that many natural foods are healthy, but ...]
Slide from Jennifer Venditti
Broad focus
legumes are a good source of vitamins
“Tell me something about the world.”
50
100
150
200
250
300
350
400
Slide from Jennifer Venditti
In the absence of narrow focus, English tends to mark the firstand last ‘content’ words with perceptually prominent accents.
Rising statements
50100150200250300350400450500550
legumes are a good source of vitamins
High-rising statements can signal that the speaker is seeking approval.
“Tell me something I didn’t already know.”
[... does this statement qualify?]
Slide from Jennifer Venditti
Yes-No question
50100150200250300350400450500550
are legumes a good source of VITAMINS
Rise from the main accent to the end of the sentence.
Slide from Jennifer Venditti
‘Surprise-redundancy’ tune
legumes are a good source of vitamins
Low beginning followed by a gradual rise to a high at the end.
[How many times do I have to tell you ...]
50
100
150
200
250
300
350
400
Slide from Jennifer Venditti
‘Contradiction’ tune
50
100
150
200
250
300
350
400
linguini isn’t a good source of vitamins
Sharp fall at the beginning, flat and low, then rising at the end.
“I’ve heard that linguini is a good source of vitamins.”
[... how could you think that?]
Slide from Jennifer Venditti
Thinking about F0
Graphic representation of F0
legumes are a good source of VITAMINS50
100
150
200
250
300
350
400
time
F0 (i
n H
ertz
)
Slide from Jennifer Venditti
The ‘ripples’
legumes are a good source of VITAMINS[ t ][ s ] [ s ]
50
100
150
200
250
300
350
400
F0 is not defined for consonants without vocalfold vibration.
Slide from Jennifer Venditti
The ‘ripples’
legumes are a good source of VITAMINS[ v ][ g ] [ g ][ z ]
50
100
150
200
250
300
350
400
... and F0 can be perturbed by consonants withan extreme constriction in the vocal tract.
Slide from Jennifer Venditti
Abstraction of the F0 contour
legumes are a good source of VITAMINS50
100
150
200
250
300
350
400
Our perception of the intonation contour abstracts away from these perturbations.
Slide from Jennifer Venditti
The ‘waves’ and the ‘swells’
legumes are a good source of VITAMINS50
100
150
200
250
300
350
400 ‘wave’ = accent
‘swell’ = phrase
Slide from Jennifer Venditti
Prominence: Placement of Pitch Accents
Stress vs. accent� Stress is a structural property of a word
� it marks a potential (arbitrary) location for an accent to occur, if there is one.
� Accent is a property of a word in context� it is a way to mark intonational prominence in order to ‘highlight’
important words in the discourse.
(x) (x) (accented syll)x x stressed syllx x x full vowelsx x x x x x x syllablesvi ta mins Ca li for nia
Slide from Jennifer Venditti
Stress vs. accent (2)� The speaker decides to make the word
vitamin more prominent by accenting it.� Lexical stress tell us that this prominence will
appear on the first syllable, hence VItamin.
� So prosodic prominence is a function of� lexicon� context
� I’m a little surPRISED to hear it CHARacterized as upBEAT
Which word receives an accent?
� It depends on the context. � The ‘new’ information in the answer to a question is
often accented� while the ‘old’ information is usually not.
� Q1: What types of foods are a good source of vitamins?� A1: LEGUMES are a good source of vitamins.
� Q2: Are legumes a source of vitamins?� A2: Legumes are a GOOD source of vitamins.
� Q3: I’ve heard that legumes are healthy, but what are they a good source of ?
� A3: Legumes are a good source of VITAMINS.
Slide from Jennifer Venditti
Same ‘tune’, different alignment
50
100
150
200
250
300
350
400
LEGUMES are a good source of vitamins
The main rise-fall accent (= “I assert this”) shifts locations.
Slide from Jennifer Venditti
Same ‘tune’, different alignment
50
100
150
200
250
300
350
400
Legumes are a GOOD source of vitamins
The main rise-fall accent (= “I assert this”) shifts locations.
Slide from Jennifer Venditti
Same ‘tune’, different alignment
legumes are a good source of VITAMINS50
100
150
200
250
300
350
400
The main rise-fall accent (= “I assert this”) shifts locations.
Slide from Jennifer Venditti
Levels of prominence� Mostphraseshavemorethanoneaccent� Thelastaccentinaphraseisperceivedasmoreprominent
� Calledthe NuclearAccent� Emphaticaccentslikenuclearaccentoftenusedforsemanticpurposes,suchasindicatingthatawordiscontrastive,orthesemanticfocus.� Thekindofthingyouuses***sinIM,orcapitalizedletters� ‘IknowSOMETHING interestingissuretohappen,’ shesaidtoherself.
� Canalsohavewordsthatareless prominentthanusual� Reducedwords,especiallyfunctionwords.
� Oftenuse4classesofprominence:� Emphaticaccent,pitchaccent,unaccented,reduced
Intonational phrasing/boundaries
A single intonation phrase
50
100
150
200
250
300
350
400
legumes are a good source of vitamins
Broad focus statement consisting of one intonation phrase(that is, one intonation tune spans the whole unit).
Slide from Jennifer Venditti
Multiple phrases
50
100
150
200
250
300
350
400
legumes are a good source of vitamins
Utterances can be ‘chunked’ up into smaller phrases in order to signal the importance of information in each unit.
Slide from Jennifer Venditti
Phrasing sometimes helps disambiguate� Global ambiguity:
The old men and women stayed home.
Sally saw the man with the binoculars.
John doesn’t drink because he’s unhappy.
Slide from Jennifer Venditti
Phrasing can disambiguate� Global ambiguity:
The old men and women stayed home.The old men % and women % stayed home.
Sally saw % the man with the binoculars.Sally saw the man % with the binoculars.
John doesn’t drink because he’s unhappy.John doesn’t drink % because he’s unhappy.
Slide from Jennifer Venditti
Phrasing sometimes helps disambiguate
� Temporary ambiguity:When Madonna sings the song ...
Slide from Jennifer Venditti
Phrasing sometimes helps disambiguate
� Temporary ambiguity:When Madonna sings the song is a hit.
Slide from Jennifer Venditti
Phrasing sometimes helps disambiguate
� Temporary ambiguity:When Madonna sings % the song is a hit.
When Madonna sings the song % it’s a hit.
[from Speer & Kjelgaard (1992)]
Slide from Jennifer Venditti
Phrasing sometimes helps disambiguate
50
100
150
200
250
300
350
400
I met Mary and Elena’s mother at the mall yesterday
Mary & Elena’s mothermall
One intonation phrase with relatively flat overall pitch range.
Slide from Jennifer Venditti
Phrasing sometimes helps disambiguate
50
100
150
200
250
300
350
400
I met Mary and Elena’s mother at the mall yesterday
Marymall
Elena’s mother
Separate phrases, with expanded pitch movements.
Slide from Jennifer Venditti
Using Intonation in Spoken Language Processing1) Prominence/Accent: Tells us about
focus of utterance2) Tune: whether utterance is
question/statement, important for affect extraction
3) Boundaries: can help parsing
More phonetic structure� Syllables
� Composedofvowelsandconsonants.Notwelldefined.Somethinglikea“vowelnucleuswithsomeofitssurroundingconsonants”.
More phonetic structure� Stress
� Somesyllableshavemoreenergythanothers� Stressedsyllablesversusunstressedsyllables� (an)‘INsult vs.(to)in’SULT� (an)‘OBject vs.(to)ob’JECT
� Simplemodel:everymulti-syllabicwordhasonesyllablewith:� “primarystress”
� Wecanrepresentbyusingthenumber“1” onthevowel(andanimplicitunmarkingontheothervowels)
� “table”:tey1baxl� “machine:maxsh iy1n
� Alsopossible:“secondarystress”,markedwitha“2”� ih-2nfaxr mey-1sh axn
� Thirdcategory:reduced:schwa:� ax