1 st and 2 nd generation synthesis speech synthesis generation – first: ground up synthesis –...

1st and 2nd Generation Synthesis• Speech Synthesis Generation

– First: Ground Up Synthesis– Second: Data Driven Synthesis by Concatenation

• Input (Sequence of)– Phonetic symbols– Duration– F0 contours– Amplification factors

• Data– Rule-based parameters– Linear Prediction: Stored diphone parameters

Early Synthesis History• Klatt, 1987

– “Review of text-to-sppech conversion for English– http://americanhistory.si.edu/archives/speechsynthesis/dk_737b.htm

– Audio: http://www.cs.indiana.edu/rhythmsp/ASA/Contents.html

• Milestones– 1939 Worlds Fair, Voder, Dudley– First TTS, Umeda, 1968– Low rate resynthesis, Speak and Spell, Wiggins, 1980– Natural sounding resynthesis, multi-pulse Linear

Prediction, Atal, 1982 resynthesis– Natural Sounding Synthesis, Klatt, 1986

http://americanhistory.si.edu/archives/speechsynthesis/dk_737b.htm

http://www.cs.indiana.edu/rhythmsp/ASA/Contents.html

Formant Synthesizer Design• Concept

– Create individual components for each synthesizer unit– Feed the system with a set of parameters

• Advantage– If the parameters are set properly, perfect natural

sounding speech is created

• Disadvantages– The combination of parameters becomes obscure– Parameter settings do not enable an automated algorithm

Demo Program: http://www.asel.udel.edu/speech/tutorials/synthesis/

http://www.asel.udel.edu/speech/tutorials/synthesis/

Formant Synthesizer• IIR filter: hn = b0sn – a1yn-1 – a2yn-2

• Transfer Function: H(z) = b0z0/{(1-a1z-1 – a2z-2)}

• Transfer Function: H(z) = 1/{(1-p1z-1)(1-p2z-1)}

• Because they are conjugate pairsH(z) = 1/{(1-reiθz-1)(1-re-iθ z-1)}

= 1/(1-re-iθ z-1-reiθ z-1 + reiθz-1re-i θz-1) = 1/(1-r(e-iθ+eiθ)z-1+r2z-2) = 1/(1-2rcosθz-1+r2z-2)

• The filter: yn = xn – 2rcosθ yn-1+r2yn-2

• Parameters (θ controls formant frequency; r controls bandwidth)

– Θ = 2 πf/F , r = e-πβ/F

– β = desired bandwidth, F = sampling rate, f = frequency

Design for individual formant Components

Parallel or Cascade• Cascaded connections

– Lose control over components because skirts of poles interact

• Parallel connections – Add filtered signals together to maintain component control

System Input Parameters A1,2,3 = AmplitudesF1,2,3 = FrequenciesBW1,2,3 = BandwidthsGain = Output multiplier

Periodic Source

• Flanagan model– Explicit periodic function

u[n] = ½(1-cos(πn/L)) if 0≤n ≤L u[n] = cos(π(n-L)/(2M)) if L<n ≤Mu[n] = 0 otherwise

• Lijencrants-Fant model (figure)– 0 to amplitude Av at time Tp

– Te where the derivative reaches E

– Te is the glottal closing instant

– The open quotient Oq = Te / T0.

– The ratio between the opening and closing phase is αm.

– Abrupt closure after maximum excitation between OqT0 and T0.

Glottis approximation formulas

Radiation From the Lips

• Actual modeling of the lips is very complicated• Rule based synthesizers want to use specific

formulas for simulation• Experiments show

– Lip radiation contains at least one anti-resonance (a zero in the transfer function)

– The approximation formula often used: R(z) = 1 – αz-1 where 0.95 ≤α ≤0.98

– This turns out to be the same formula for preemphasis

Consonants and Nasals• Nasals

– One resonator models the oral cavity– Another resonator models the nasal cavity– Add a zero in series with resonators– Outputs added to generate output

• Fricatives– Source either noise or glottis or both– One set of resonators model point in front of place of

constriction– Another set behind point of constriction– Outputs added together

The Klatt Synthesizer

Klatt Parameters

Evaluation of Formant Synthesizers• Quality

– Speech produced is understandable– Output sounds metallic (not natural)

• Problems– System uses lumped parameters (like components of a

spring), it is not distributed (like the vocal tract)– Individually valid assumptions are invalid when joined

together in a system– Speech subtleties are too complex for the formant model– Transitions between sounds is not modeled– Formants are not present in obstruent sounds

Classical Linear Prediction (LP Synthesis)• Concept

– Use the all-pole tube model of Linear Prediction – Y(z) = X(z)/(1-a1z-1 – a2z-2 - … - zpz-P) leads to the linear

prediction formula yn = xn + a1yn-1 + a2yn-2 + … + apyn-p

• Improvements over formant synthesis– Obtain parameters directly from speech, not from

experimentation or human intervention– The glottal filter is subsumed in the LP equation, so

synthesizing the glottal source becomes unnecessary• Tradeoffs

– Lose modularity and physical interpretations of coefficients– Lack of zeros make modeling nasals and fricatives difficult– Modeling transitions between sounds problematic

LP diphone-concatenation Synthesis

• Definition: The unit that starts from the middle of one phone and ends at the middle of the next phone

• Concept– Capture and store the vocal tract dynamics of each frame– Alter the F0 by changing the impulse rate– Alter duration as needed– Concatenate stored frames together to accomplish

synthesis

• Input: array of {phone symbol, F0 value, duration}

LP difficulties• Boundary point transition artifacts

– Approach: Interpolate the LP parameters between adjacent frames

– The output has a metallic or buzz quality because the LP filter does not entirely capture the characteristic of the source. The residual contains spikes at each pitch period

• Experiment to resynthesize a speech waveform– Resynthesize with residual: speech sound perfect– Resynthesize without residual

• Same pitch and duration: sounds degraded but okay• Alter pitch: speech becomes buzzy• Alter duration: degraded but okay

Articulatory Synthesis

• Kempelen– Mechanical device with tubes, bellows, and pipes– Played as one plays a musical instrument

• Digital version– Controls are the tubes, not the formants– Can obtain LP tube parameters from the LP filter

• Difficulties– Difficult to obtain values that shape the tubes– The glottis and lip radiation still need to be modeled– Existing models produce poor speech

• Current Applicable Research– Articulatory physiology, gestures, audio-visual synthesis, talking heads

The oldest approach: mimic the vocal tract components

2nd Generation Synthesis by Concatenation

• Comparisons to 1st generation models– Input

• Still explicitly defines the F0 contour and duration and phonetic symbols

– Output• Source waveform generated from a database of

diphones (one diphone per phone)• Discards impulse pulses and noise generators

– Concatenation• Pitch and duration algorithms glue together diphones

Extension of 1st Generation LP-Concatenation

Diphone Inventory• Requirements

– If 40 phonemes• 40 left diphones and 40 right diphones can combine in 1600 ways• A phonotactic grammar can reduce the database size

– Pick long units rather than short ones(It is easier to shorten duration than lengthen it)

– Normalize the phases of the diphones– All diphones should have equal pitch

• Finding diphone sound waves to build the inventory– Search a corpus (if one exists)– Specifically record words containing the diphones– Record nonsense words (logotomes) with desired features

Pitch-synchronous overlap and add (PSOLA)

• PSOLA is a time domain algorithm• Pseudo code

– Find the exact pitch periods in a speech signal– Create overlapping frames centered on epochs extending back

and forward one pitch period– Apply hamming window– Add waves back

• Closer together for higher pitch, further apart for lower pitch• Remove frames to shorten or insert frames to lengthen

• Undetectable if epochs are accurately found. Why?– We are not altering the vocal filter, but changing the amplitude

and spacing of the input

Purpose: Modify pitch or timing of a signal

PSOLA IllustrationsPitch (window and add)

Duration (insert or remove)

PSOLA Epochs

• PSOLA requires an exact marking of pitch points in a time domain signal

• Pitch mark– Marking any part within a pitch period is okay as long as

the algorithm marks the same point for every frame– The most common marking point is the instant of glottal

closure, which identifies a quick time domain descent

• Create an array of sample sample numbers comprise an analysis epoch sequence P = {p1, p2, …, pn}

• Estimate pitch period distance = (pk – pk+1)/2

PSOLA pseudo codeIdentify the epochs using an array of sample indices, PFor each input object

Extract the desired F0, phoneme, and durationspeech = looked up phoneme sound wave from stored data

Identify the epochs in the phoneme with array, PBreak up the phoneme into framesIf F0 value differs from that of the phoneme

Window each frame into an array of framesspeech = overlap and add frames using desired F0

IF duration is larger than desiredDelete extra frames from speech at regular intervals

ELSE if duration is smaller than desiredDuplicate frames at regular intervals in speech

Note: Multiple F0 points in a phoneme requires multiple input objects

PSOLA Evaluation

• Advantages– As a time domain algorithm, it is unlikely that any

other approach is more efficient (O(N))– If pitch and timing differences are within 25%,

listeners cannot detect the alterations• Disadvantages

– Epoch marking must be exact– Only pitch and timing changes are possible– If used with unit selection, several hundred

megabytes of storage could be needed

LP - PSOLA

• Algorithm– If the synthesizer uses linear prediction to compress

phoneme sound waves, the residual portion of the signal is already available for additional waveform modifications

• Algorithm– Mark the epoch points of the LP residual and overlap

/combine with the PSOLA approach

• Analysis– Resulting speech is competitive with PSOLA, but not

superior

Sinusoidal Models

• Definition: Statistically estimate relationships between variables that are related in a linear fashion

• Advantage: The algorithm is less sensitive to finding exact pitch points

• General approach1. Filter the noise component from the signal2. Successively match signal against a high frequency

sinusoidal wave, subtracting the match from the wave3. The lowest remaining wave is F04. Use PSOLA type algorithm to alter pitch and duration

Find contributing sinusoids in a signal using linear regression techniques

MBROLA• Overview

– PSOLA synthesis has very poor quality (very hoarse quality) if the pitch points are not correctly marked.

– MBROLA addresses this issue by preprocessing the database of phonemes

• Ensure that all phonemes have the same phase• Force all phonemes to have the same pitch• Overlap and synthesis then works with complete

accuracy

Home Page: http://tcts.fpms.ac.be/synthesis/mbrola/

http://tcts.fpms.ac.be/synthesis/mbrola/

Issues and Discussion

• Concatenation Synthesis– Micro-concatenation Problems:

• Joining phonemes can cause clicks at the boundary– Solution: Tapering waveforms at the edges

• Joining segments with mismatched phases– Solution: force all segments to be phase aligned

• Optimal coupling points– Solution: algorithms for matching trajectories– Solution: interpolate LP parameters

– Macro-concatenation: ensure a natural spectral envelope• Requires an accurate F0 contour

1 st and 2 nd generation synthesis speech synthesis generation – first: ground up synthesis –...

Documents

klatt synthesizer slide

klatt parameters

preemphasis slide

rcos y n

output multiplier slide

combination of parameters

set of parameters advantage

transfer function