12/8/20151 introduction to the course and to speech synthesis julia hirschberg
TRANSCRIPT
![Page 1: 12/8/20151 Introduction to the Course and to Speech Synthesis Julia Hirschberg](https://reader036.vdocuments.us/reader036/viewer/2022070415/5697bf851a28abf838c87418/html5/thumbnails/1.jpg)
04/21/23 1
Introduction to the Course and to Speech Synthesis
Julia Hirschberg
![Page 2: 12/8/20151 Introduction to the Course and to Speech Synthesis Julia Hirschberg](https://reader036.vdocuments.us/reader036/viewer/2022070415/5697bf851a28abf838c87418/html5/thumbnails/2.jpg)
Applications for Speech Technologies
• Speech synthesis (TTS): AT&T, IBM (Jeopardy 2/14-16), SitePal
• Speech recognition (ASR): Nuance• Speech to Speech Translation• Speech Search: Google Voice Search• Homeland Security: Deception Detection, Dialect and
Language ID, and Speaker ID, trust• Spoken Dialogue Systems:
– Over-the-phone services: Voice Actions for Android – Tutoring systems: KTH’s Ville– Amtrak Julie (or here)
![Page 3: 12/8/20151 Introduction to the Course and to Speech Synthesis Julia Hirschberg](https://reader036.vdocuments.us/reader036/viewer/2022070415/5697bf851a28abf838c87418/html5/thumbnails/3.jpg)
Text-to-Speech Synthesis
• Course syllabus and readings (Jurafsky & Martin, Chapter 8, link from the syllabus
• Course project:– Build your own SDS using the Festival and
HTK Toolkits, or– Evaluate 3 current TTS systems to see how
better knowledge of linguistics could improve them
– Honor policy on syllabus
![Page 4: 12/8/20151 Introduction to the Course and to Speech Synthesis Julia Hirschberg](https://reader036.vdocuments.us/reader036/viewer/2022070415/5697bf851a28abf838c87418/html5/thumbnails/4.jpg)
04/21/23 4
Speech Synthesis: Then and Now
• Then: Early speech synthesizers• Now: Overview of Modern TTS Systems• Think about:
– What needs to be modeled to create artificial speech?
– How do we evaluate a synthesizer?
![Page 5: 12/8/20151 Introduction to the Course and to Speech Synthesis Julia Hirschberg](https://reader036.vdocuments.us/reader036/viewer/2022070415/5697bf851a28abf838c87418/html5/thumbnails/5.jpg)
04/21/23 5
The First ‘Speaking Machine’
• Wolfgang von Kempelen, Mechanismus der menschlichen Sprache nebst Beschreibung einer sprechenden Maschine, 1791 (in Deutsches Museum still and playable)
• First to produce whole words, phrases – in many languages
![Page 6: 12/8/20151 Introduction to the Course and to Speech Synthesis Julia Hirschberg](https://reader036.vdocuments.us/reader036/viewer/2022070415/5697bf851a28abf838c87418/html5/thumbnails/6.jpg)
• First experimental phonetician: – Therapeutic applications: how do humans produce
speech?• First machine which could produce whole words• Took 3 weeks to learn to ‘play’ in Latin, French or Italian
– German harder due to consonant clusters, closed syllables
• Parts:– Bellows: lungs (operated with right forearm;
counterweight for inhale; auxiliary bellows to simulate stop release
– Wind box, mouth (cover for unvoiced sounds), nostrils (cover except for nasal)
• Thumb in mouth [l]• Hissing whistle to make sibilants
– Vocal cords: ivory reed• Can’t change length on the fly, so monotone only• Wire dropped on read simulated [r]
![Page 7: 12/8/20151 Introduction to the Course and to Speech Synthesis Julia Hirschberg](https://reader036.vdocuments.us/reader036/viewer/2022070415/5697bf851a28abf838c87418/html5/thumbnails/7.jpg)
04/21/23 7
Joseph Faber’s Euphonia, 1846
![Page 8: 12/8/20151 Introduction to the Course and to Speech Synthesis Julia Hirschberg](https://reader036.vdocuments.us/reader036/viewer/2022070415/5697bf851a28abf838c87418/html5/thumbnails/8.jpg)
04/21/23 8
• Constructed 1835 w/pedal and keyboard control– Whispered and ordinary speech– Model of tongue, pharyngeal cavity with
manipulable shape– Singing too: “God Save the Queen”
• Riesz’s 1937 synthesizer with almost natural vocal tract shape
• Forerunners of Modern Articulatory Synthesis: George Rosen’s DAVO synthesizer (1958) at MIT
![Page 9: 12/8/20151 Introduction to the Course and to Speech Synthesis Julia Hirschberg](https://reader036.vdocuments.us/reader036/viewer/2022070415/5697bf851a28abf838c87418/html5/thumbnails/9.jpg)
04/21/23 9
![Page 10: 12/8/20151 Introduction to the Course and to Speech Synthesis Julia Hirschberg](https://reader036.vdocuments.us/reader036/viewer/2022070415/5697bf851a28abf838c87418/html5/thumbnails/10.jpg)
04/21/23 10
• First notable electronic synthesizer• Presented at World’s Fair in NY, 1939• Requires much training to ‘play’• Purpose: coding/compression
– Reduce bandwidth needed to transmit speech, so many phone calls can be sent over single line
![Page 11: 12/8/20151 Introduction to the Course and to Speech Synthesis Julia Hirschberg](https://reader036.vdocuments.us/reader036/viewer/2022070415/5697bf851a28abf838c87418/html5/thumbnails/11.jpg)
04/21/23 11
![Page 12: 12/8/20151 Introduction to the Course and to Speech Synthesis Julia Hirschberg](https://reader036.vdocuments.us/reader036/viewer/2022070415/5697bf851a28abf838c87418/html5/thumbnails/12.jpg)
• First attempt to synthesise speech by breaking it down into component sounds and reproducing sound patterns electronically
• Produced two sounds: – Tone generated by a radio valve to produce
the voiced sounds– Hissing noise produced by gas discharge tube
to create sibilants– These passed through filters and amplifier
that mixed and modulated
![Page 13: 12/8/20151 Introduction to the Course and to Speech Synthesis Julia Hirschberg](https://reader036.vdocuments.us/reader036/viewer/2022070415/5697bf851a28abf838c87418/html5/thumbnails/13.jpg)
04/21/23 13
![Page 14: 12/8/20151 Introduction to the Course and to Speech Synthesis Julia Hirschberg](https://reader036.vdocuments.us/reader036/viewer/2022070415/5697bf851a28abf838c87418/html5/thumbnails/14.jpg)
04/21/23 14
• Answers:– These days a chicken leg is a rare dish.– It’s easy to tell the depth of a well.– Four hours of steady work faced us.
• Goal: Understand perceptual effect of spectral details
• Last used for an experimental study by Robert Remez in 1976!
• Inverted spectrogram: from spectral information to speech
![Page 15: 12/8/20151 Introduction to the Course and to Speech Synthesis Julia Hirschberg](https://reader036.vdocuments.us/reader036/viewer/2022070415/5697bf851a28abf838c87418/html5/thumbnails/15.jpg)
• Lamp produces light ray directed against rotating disk with 50 concentric tracks whose transparence varies systematically to produce 50 partials (pure tones) w/f0 of 120 hz– Transparencies rep sound pressures– Light projected against spectrogram – Variation in light converted into variation in
sound pressure– Spectrogram passed thru light on rollers to
reproduce the speech of the spectrogram• Can create artificial spectrograms to produce
new speech
![Page 16: 12/8/20151 Introduction to the Course and to Speech Synthesis Julia Hirschberg](https://reader036.vdocuments.us/reader036/viewer/2022070415/5697bf851a28abf838c87418/html5/thumbnails/16.jpg)
04/21/23 16
Formant/Resonance/Acoustic Synthesis
• Parametric or resonance synthesis– Specify minimal parameters, e.g. f0 and first 3
formants– Pass electronic source signal thru filter
• Harmonic tone for voiced sounds• Aperiodic noise for unvoiced• Filter simulates the different resonances of the vocal tract
• E.g.– Walter Lawrence’s Parametric Artificial Talker (1953)
for vowels and consonants– Gunnar Fant’s Orator Verbis Electris (1953) for
vowels– Formant synthesis download (M$demo)
![Page 17: 12/8/20151 Introduction to the Course and to Speech Synthesis Julia Hirschberg](https://reader036.vdocuments.us/reader036/viewer/2022070415/5697bf851a28abf838c87418/html5/thumbnails/17.jpg)
Examples
• Walter Lawrence’s Parametric Artificial Talker (1953) for vowels and consonants
• Gunnar Fant’s Orator Verbis Electris (1953) for vowels
• Formant synthesis download (M$demo)
![Page 18: 12/8/20151 Introduction to the Course and to Speech Synthesis Julia Hirschberg](https://reader036.vdocuments.us/reader036/viewer/2022070415/5697bf851a28abf838c87418/html5/thumbnails/18.jpg)
04/21/23 18
Synthesis by Computer
• Beginnings ~1960; dominant from 1970—
![Page 19: 12/8/20151 Introduction to the Course and to Speech Synthesis Julia Hirschberg](https://reader036.vdocuments.us/reader036/viewer/2022070415/5697bf851a28abf838c87418/html5/thumbnails/19.jpg)
04/21/23 19
Concatenative Synthesis
• Most common type today• First practical application in 1936: British Phone
company’s Talking Clock– Optical storage for words, part-words, phrases– Concatenated to tell time
• E.g. • And a ‘similar’ example from Radio Free
Vestibule (1994)• Bell Labs TTS (1977) (1985)
![Page 20: 12/8/20151 Introduction to the Course and to Speech Synthesis Julia Hirschberg](https://reader036.vdocuments.us/reader036/viewer/2022070415/5697bf851a28abf838c87418/html5/thumbnails/20.jpg)
04/21/23 20
Variants of Concatenative Synthesis
• Inventory units– Diphone synthesis (e.g. Festival)– Microsegment synthesis– “Unit Selection” – large, variable units
• Issues– How well do units fit together?– What is the perceived acoustic quality of the
concatenated units? – Is post-processing on the output possible, to
improve quality?
![Page 21: 12/8/20151 Introduction to the Course and to Speech Synthesis Julia Hirschberg](https://reader036.vdocuments.us/reader036/viewer/2022070415/5697bf851a28abf838c87418/html5/thumbnails/21.jpg)
04/21/23 21
Overview: Synthesizer I/O
• Front end: From input to control parameters– Acoustic/phonetic representations, naturally
occurring text, constrained mark-up language, semantic/conceptual representations
• Back end: From control parameters to waveform– Articulatory, formant/acoustic, concatenative,
(diphone, unit-selection/corpus, HMM) synthesis
![Page 22: 12/8/20151 Introduction to the Course and to Speech Synthesis Julia Hirschberg](https://reader036.vdocuments.us/reader036/viewer/2022070415/5697bf851a28abf838c87418/html5/thumbnails/22.jpg)
TTS Production Levels
Knowledge
• World Knowledge• Syntax, semantics,
lexicon• Phonetics/phonology• Acoustics/signal
processing
Task
• Text Normalization• Pronunciation, intonation
assignment• Duration, f0, durations• Waveform production
04/21/23 22
![Page 23: 12/8/20151 Introduction to the Course and to Speech Synthesis Julia Hirschberg](https://reader036.vdocuments.us/reader036/viewer/2022070415/5697bf851a28abf838c87418/html5/thumbnails/23.jpg)
04/21/23 23
Text Normalization Issues
• Numbers– In 2011 she sold 2010 shares and deposited
$42 in her 401(k) before calling 911.• Abbreviations
– The NAACP just elected a new president.– NAACL just elected a new president.
![Page 24: 12/8/20151 Introduction to the Course and to Speech Synthesis Julia Hirschberg](https://reader036.vdocuments.us/reader036/viewer/2022070415/5697bf851a28abf838c87418/html5/thumbnails/24.jpg)
Pronunciation Issues
• Lexicon: – comb, tomb
• Proper Names– Punxsutawney Phil– Djokovitz
• Word sense ambiguity: – desert– bass – Nice
04/21/23 24
![Page 25: 12/8/20151 Introduction to the Course and to Speech Synthesis Julia Hirschberg](https://reader036.vdocuments.us/reader036/viewer/2022070415/5697bf851a28abf838c87418/html5/thumbnails/25.jpg)
04/21/23 25
Intonation Assignment Issues
• Phrasing: Use punctuation?– 234-5682– He was born in Independence, MO.
• Accent: Accent content words, not function words?– I threw out the trash.
• Contour– Did he do it?– How did he do it?– And so then how did he do it?
![Page 26: 12/8/20151 Introduction to the Course and to Speech Synthesis Julia Hirschberg](https://reader036.vdocuments.us/reader036/viewer/2022070415/5697bf851a28abf838c87418/html5/thumbnails/26.jpg)
04/21/23 26
Phonological Specification and Realization
• Task: Produce a phonological representation from phonetic and intonational assignment
• Align phones and f0 contour• Specify durations and intensity
• Select/create acoustic realization from this specification:– Acoustic transformation– Concatenation: diphone, unit selection– HMM synthesis
![Page 27: 12/8/20151 Introduction to the Course and to Speech Synthesis Julia Hirschberg](https://reader036.vdocuments.us/reader036/viewer/2022070415/5697bf851a28abf838c87418/html5/thumbnails/27.jpg)
How Human does TTS Sound?
• Festival concatenative:• Acuvoice concatenative: • HMM synthesis (Rob Donovan):• Rhetorical unit selection
– (acquired by Nuance)• AT&T Labs Naturally Speaking
04/21/23 27
![Page 28: 12/8/20151 Introduction to the Course and to Speech Synthesis Julia Hirschberg](https://reader036.vdocuments.us/reader036/viewer/2022070415/5697bf851a28abf838c87418/html5/thumbnails/28.jpg)
04/21/23 28
Next Class
• Text Normalization techniques