ee2f1 speech & audio technology sept. 26, 2002 slide 1 the university of birmingham electronic,...
Post on 19-Dec-2015
213 views
TRANSCRIPT
EE2F1 Speech & Audio
Technology
Sept. 26, 2002
SLIDE 1
THE UNIVERSITY OF BIRMINGHAM
ELECTRONIC, ELECTRICAL &
COMPUTER ENGINEERING
Digital Systems&
Vision Processing
EE2F1Multimedia (1): Speech & Audio
Technology
Lecture 7: Speech Synthesis (1)
Martin RussellElectronic, Electrical & Computer Engineering
School of EngineeringThe University of Birmingham
EE2F1 Speech & Audio
Technology
Sept. 26, 2002
SLIDE 2
THE UNIVERSITY OF BIRMINGHAM
ELECTRONIC, ELECTRICAL &
COMPUTER ENGINEERING
Digital Systems&
Vision Processing
Stages in “text-to-speech” synthesis Text normalisation Text-to-phone conversion Linguistic analysis Semantic analysis Conversion of phone-sequence to sequence
of synthesiser control parameters Synthesis of acoustic speech signal
EE2F1 Speech & Audio
Technology
Sept. 26, 2002
SLIDE 3
THE UNIVERSITY OF BIRMINGHAM
ELECTRONIC, ELECTRICAL &
COMPUTER ENGINEERING
Digital Systems&
Vision Processing
Approaches to synthesis
Final stage is to convert ‘phone’ or word sequence into a sequence of synthesiser control parameters
Two main approaches:
– Waveform concatenation
– Model-based speech synthesis (inludes articulatory synthesis)
EE2F1 Speech & Audio
Technology
Sept. 26, 2002
SLIDE 4
THE UNIVERSITY OF BIRMINGHAM
ELECTRONIC, ELECTRICAL &
COMPUTER ENGINEERING
Digital Systems&
Vision Processing
Waveform Concatenation Join together, or concatenate, stored sections of
real speech Sections may correspond to whole word, or sub-
word units Early systems based on whole words
– E.G: Speaking clock - UK telephone system, 1936
Storage and access major issues Speech quality requires data-rates of 16,000 to
32,000 bits per second (bps)
EE2F1 Speech & Audio
Technology
Sept. 26, 2002
SLIDE 5
THE UNIVERSITY OF BIRMINGHAM
ELECTRONIC, ELECTRICAL &
COMPUTER ENGINEERING
Digital Systems&
Vision Processing
1936 “Speaking Clock”
From John Holmes, “Speech synthesis and recognition”, courtesy of British
Telecommunications plc
EE2F1 Speech & Audio
Technology
Sept. 26, 2002
SLIDE 6
THE UNIVERSITY OF BIRMINGHAM
ELECTRONIC, ELECTRICAL &
COMPUTER ENGINEERING
Digital Systems&
Vision Processing
Whole word concatenation (1)
Whole word concatenation can give good quality speech (as in speaking clock), but has many disadvantages:– pronunciation of a word influenced by
neighbouring words (co-articulation)– prosodic effects like intonation, rate-of-speaking
and amplitude also influenced by context.– interpretation of a sentence will be strongly
influenced by details of individual words used (“Mary didn’t buy Sam a pizza”)
EE2F1 Speech & Audio
Technology
Sept. 26, 2002
SLIDE 7
THE UNIVERSITY OF BIRMINGHAM
ELECTRONIC, ELECTRICAL &
COMPUTER ENGINEERING
Digital Systems&
Vision Processing
Whole word concatenation (2)
Disadvantages (continued):– words must be extracted from the right sort of
sentence– most suitable for applications where structure of
the sentence is constrained, e.g., announcements, lists…
– may need to record more than one example of each word, e.g., raised pitch at end of a list, pre-pause lengthening…
EE2F1 Speech & Audio
Technology
Sept. 26, 2002
SLIDE 8
THE UNIVERSITY OF BIRMINGHAM
ELECTRONIC, ELECTRICAL &
COMPUTER ENGINEERING
Digital Systems&
Vision Processing
Example – original recording
The next train to arrive at platform 2 will call at Bromsgrove, Droitwich Spa, Worcester Foregate Street and Malvern Link
EE2F1 Speech & Audio
Technology
Sept. 26, 2002
SLIDE 9
THE UNIVERSITY OF BIRMINGHAM
ELECTRONIC, ELECTRICAL &
COMPUTER ENGINEERING
Digital Systems&
Vision Processing
Example – trivial concatenative synthesis
The next train to arrive at platform 2 will call at Malvern Link, Worcester Foregate Street, Droitwich Spa and Bromsgrove
EE2F1 Speech & Audio
Technology
Sept. 26, 2002
SLIDE 10
THE UNIVERSITY OF BIRMINGHAM
ELECTRONIC, ELECTRICAL &
COMPUTER ENGINEERING
Digital Systems&
Vision Processing
Example repeated
Original recording ‘Concatenative synthesis’
EE2F1 Speech & Audio
Technology
Sept. 26, 2002
SLIDE 11
THE UNIVERSITY OF BIRMINGHAM
ELECTRONIC, ELECTRICAL &
COMPUTER ENGINEERING
Digital Systems&
Vision Processing
Whole word concatenation (3)
Disadvantages (continued):– to add new words the original speaker must be
found, or all words must be re-recorded– even with specialist facilities, selection and
extraction of suitable words is labour intensive and time consuming
EE2F1 Speech & Audio
Technology
Sept. 26, 2002
SLIDE 12
THE UNIVERSITY OF BIRMINGHAM
ELECTRONIC, ELECTRICAL &
COMPUTER ENGINEERING
Digital Systems&
Vision Processing
Sub-word concatenation (1)
Limitations of word-based methods suggest concatenative speech synthesis based on sub-word units
Need well-annotated, phonetically-balanced corpus of speech recordings
Extract fragments from waveforms in the corpus which represent ‘basic units’ of speech, and can be concatenated and used for speech synthesis
EE2F1 Speech & Audio
Technology
Sept. 26, 2002
SLIDE 13
THE UNIVERSITY OF BIRMINGHAM
ELECTRONIC, ELECTRICAL &
COMPUTER ENGINEERING
Digital Systems&
Vision Processing
Sub-word concatenation (2)
Difficulties include:– identification of a set of suitable units– careful annotation of large amounts of data– derivation of a good method for concatenation
EE2F1 Speech & Audio
Technology
Sept. 26, 2002
SLIDE 14
THE UNIVERSITY OF BIRMINGHAM
ELECTRONIC, ELECTRICAL &
COMPUTER ENGINEERING
Digital Systems&
Vision Processing
Sub-word concatenation (3)
Sub-word concatenation overcomes difficulties with adding new words to the application vocabulary,
But, other problems exacerbated. In particular, coarticulation and pitch
continuity problems occur within, as well as between, words.
Necessary to use several examples of each phone (corresponding roughly to different allophones).
EE2F1 Speech & Audio
Technology
Sept. 26, 2002
SLIDE 15
THE UNIVERSITY OF BIRMINGHAM
ELECTRONIC, ELECTRICAL &
COMPUTER ENGINEERING
Digital Systems&
Vision Processing
Sub-word concatenation (4)
Natural to select fragments that characterise the phone target values, but modelling transitions between these targets is a significant problem
EE2F1 Speech & Audio
Technology
Sept. 26, 2002
SLIDE 16
THE UNIVERSITY OF BIRMINGHAM
ELECTRONIC, ELECTRICAL &
COMPUTER ENGINEERING
Digital Systems&
Vision Processing
Example: sub-word concatenation
“stack” (original)
“task” sub-word concatenative synthesis
EE2F1 Speech & Audio
Technology
Sept. 26, 2002
SLIDE 17
THE UNIVERSITY OF BIRMINGHAM
ELECTRONIC, ELECTRICAL &
COMPUTER ENGINEERING
Digital Systems&
Vision Processing
Transitional units (1)
Central regions of many speech sounds are approximately stationary and less susceptible to coarticulation effects.
Hence select fragments which characterise transitions between phones, rather than phone targets.
e.g., diphone - transition between two phones.
EE2F1 Speech & Audio
Technology
Sept. 26, 2002
SLIDE 18
THE UNIVERSITY OF BIRMINGHAM
ELECTRONIC, ELECTRICAL &
COMPUTER ENGINEERING
Digital Systems&
Vision Processing
Transitional units (2)
There are contextually-induced differences between instantiations of the central region of phone, which cause discontinuities if they are not attended to.
Possible solutions are:– use several different examples of each
diphone– store short transition regions, and– interpolate between end values
EE2F1 Speech & Audio
Technology
Sept. 26, 2002
SLIDE 19
THE UNIVERSITY OF BIRMINGHAM
ELECTRONIC, ELECTRICAL &
COMPUTER ENGINEERING
Digital Systems&
Vision Processing
Transitional units (3)
Coping with coarticulation effects by modelling transitions and– (a) using multiple examples to cope with variation in the
instantiation of the phone centres, and– (b) by interpolation between short transition regions
EE2F1 Speech & Audio
Technology
Sept. 26, 2002
SLIDE 20
THE UNIVERSITY OF BIRMINGHAM
ELECTRONIC, ELECTRICAL &
COMPUTER ENGINEERING
Digital Systems&
Vision Processing
More on prosody
Discontinuity in the fundamental frequency exacerbated for sub-word methods.
Can use source-filter model to separate-excitation signal from vocal-tract shape.
Vocal-tract shape descriptions can then be concatenated and an appropriately smooth fundamental frequency pattern can be added separately.
EE2F1 Speech & Audio
Technology
Sept. 26, 2002
SLIDE 21
THE UNIVERSITY OF BIRMINGHAM
ELECTRONIC, ELECTRICAL &
COMPUTER ENGINEERING
Digital Systems&
Vision Processing
PSOLA: Pitch Synchronous Overlap and Add PSOLA (Charpentier, 1986) Most successful current approach to
concatenative synthesis In PSOLA, the end regions of windowed
waveform samples are overlapped pitch-synchronously and added
BT’s Laureate is an example
EE2F1 Speech & Audio
Technology
Sept. 26, 2002
SLIDE 22
THE UNIVERSITY OF BIRMINGHAM
ELECTRONIC, ELECTRICAL &
COMPUTER ENGINEERING
Digital Systems&
Vision Processing
PSOLA
From: John Holmes and Wendy Holmes, “Speech synthesis and recognition”, Taylor & Francis 2001
EE2F1 Speech & Audio
Technology
Sept. 26, 2002
SLIDE 23
THE UNIVERSITY OF BIRMINGHAM
ELECTRONIC, ELECTRICAL &
COMPUTER ENGINEERING
Digital Systems&
Vision Processing
Speech modification using PSOLA In addition to speech synthesis from
segments, there are two other common applications of PSOLA:– Pitch modification– Duration modification
EE2F1 Speech & Audio
Technology
Sept. 26, 2002
SLIDE 24
THE UNIVERSITY OF BIRMINGHAM
ELECTRONIC, ELECTRICAL &
COMPUTER ENGINEERING
Digital Systems&
Vision Processing
Increasing pitch using PSOLA
From: John Holmes and Wendy Holmes, “Speech synthesis and recognition”, Taylor & Francis 2001
EE2F1 Speech & Audio
Technology
Sept. 26, 2002
SLIDE 25
THE UNIVERSITY OF BIRMINGHAM
ELECTRONIC, ELECTRICAL &
COMPUTER ENGINEERING
Digital Systems&
Vision Processing
Decreasing pitch using PSOLA
From: John Holmes and Wendy Holmes, “Speech synthesis and recognition”, Taylor & Francis 2001
EE2F1 Speech & Audio
Technology
Sept. 26, 2002
SLIDE 26
THE UNIVERSITY OF BIRMINGHAM
ELECTRONIC, ELECTRICAL &
COMPUTER ENGINEERING
Digital Systems&
Vision Processing
The ‘Laureate’ System
The BT “Laureate” system is a modern, PSOLA-based synthesiser
See Edington et al. (1996a), also look at the web site
Demonstration
EE2F1 Speech & Audio
Technology
Sept. 26, 2002
SLIDE 27
THE UNIVERSITY OF BIRMINGHAM
ELECTRONIC, ELECTRICAL &
COMPUTER ENGINEERING
Digital Systems&
Vision Processing
PSOLA strengths and weaknesses Strengths
– Produces good quality speech
Weaknesses– Large, annotated corpus needed for each ‘voice’– Requires accurate pitch peak detection– Inflexible – new voices can only be produced by
recording and labelling significant speech corpora from new speakers
Automatic annotation of corpora using techniques from speech recognition
EE2F1 Speech & Audio
Technology
Sept. 26, 2002
SLIDE 28
THE UNIVERSITY OF BIRMINGHAM
ELECTRONIC, ELECTRICAL &
COMPUTER ENGINEERING
Digital Systems&
Vision Processing
Summary
Concatenative speech synthesis Whole word concatenation Importance of prosody Sub-word concatenation Choice of sub-word units PSOLA