embedded concatenative text-to-speech - ibm...

IBM Labs in Haifa © 2004 IBM Corporation

Embedded Concatenative Text-to-Speech

Ron Hoory, Zvi Kons, Dan Chazan, Slava ShechtmanMedia Services and Technologies GroupOctober 14, 2004

IBM Labs in Haifa

© 2004 IBM Corporation2

Why Text-to-Speech ? Why Concatenative ?

� Text-to-speech eliminates the need to prerecord all possible messages� The alternative - recorded prompts - has much less flexibility:

� Cannot synthesize words/phrase outside inventory� Adding new prompts is expensive� No expression: prosody cannot be controlled –

especially when combining prompts:

� Text-to-speech (TTS) can synthesize arbitrary text

� In Concatenative text-to-speech (CTTS), small segments of speech are selected from a large speech database and concatenated together

IBM Labs in Haifa


The role of TTS in a conversational system

� Critical component of the conversational interface� Only way to present information in eyes-busy/unavailable situations

(car, phone)� Quality of conversation system often equated to TTS quality

Dialog Manager

Natural Language Understanding

Speech Recognition

Natural Language Generation

Speech Synthesis

voice text ‘meaning’

‘meaning’textvoice

IBM Labs in Haifa


How does a CTTS system work ?

Normalization

Text to Unit Conversion

Text to ProsodyTargets

Segment Selection

Post-SearchModification

text Prosodymodels

Database

We visited Rodeo Dr. We visited Rodayo Drive.

speech

Front-End

Back-End

IBM Labs in Haifa


Language dependency and other considerations

� The front-end is mostly language dependent, relying on languages rules, pronunciation dictionaries etc.

� The back-end is mostly language independent, except for the speech database (a.k.a., “voice”)

� The voice needs to be recorded:� In the desired language and accent, e.g., “Canadian French”� With the desired speaker (“voice talent”) :

� male/female� low/high pitch� slow/fast speaking rate

� In a professional recording studio and equipment� With sampling rate above the target sampling rate (usually 22KHz)

IBM Labs in Haifa


Text Normalization

� Language-independent text cleaning (html tags, etc.)� Language-dependent normalization for dates, time, numbers, currency,

phone numbers, addresses, abbreviations

� Examples:� St. Martin St. becomes Saint Martin Street� Dr. King Dr. becomes Doctor King Drive� 1 oz. becomes one ounce� 2 oz. becomes two ounces� $5 million becomes five million dollars

IBM Labs in Haifa


Possible Concatenation Units

� Words� Syllables� Demi-syllables� Diphones� Augmented diphones� Phone Units� Subphone Units

IBM Labs in Haifa


Concatenation Units in the IBM CTTS system

� HMM state-sized segments (3 states per phone)

� Segments are classified according to their phonetic context:� Phonetic context determined by a binary decision

tree with questions on neighboring phones� Segments are labeled according to the

leaves of the context dependent decision tree.� Typically 10-20 database occurrences per leaf label

� Text is first converted to phones using a pronunciation dictionary and then to leaves

S1 S2 S3

L3L2

L1

IBM Labs in Haifa


Prosody modeling

� Prosody is critical for obtaining the right pronunciation and intonation� Wrong prosody can cause speech to sound unnatural or even

unintelligible

� Prosody targets typically include:� Pitch� Phone durations� Energy

� Prosody parameters can be trained to match the target speaker prosody

IBM Labs in Haifa


How can prosody effect naturalness

� Expressiveness is a very important factor in speech naturalness.Controlling prosody can generate expression

� Neutral prosody:

� Expressive prosody:

��

��

IBM Labs in Haifa


Segment Selection and Post-Search Modification

� Each segment is selected from the all database candidates labeled with the target leaf label

� Dynamic programming used to optimize the series of segments selected by minimizing a cost function

� Cost function weights:� Proximity to prosody targets (pitch, duration)� Continuity between consecutive segments chosen

� Spectrum continuity � Pitch continuity

� Post-search modifications carried out to modify the pitch, duration and energy to match the target prosody

IBM Labs in Haifa


IBM high quality CTTS with super-voices

� Building of CTTS voices includes:

� Voice recording using a predefined script

� Limited manual work for “cleaning”

� Intensive automatic processing

� IBM Super-voices

� Very large recording script:

� Usually 10000 sentences are read by the speaker� 15 hours of audio, 11 hours of speech excluding silence

� Script reflects typical scenarios

� Professional recording studio and professional speakers

� Three stage audition process of final voices

IBM Labs in Haifa


Footprint and environments

� Size of the voice dataset is a crucial parameter for quality andnaturalness

� The CTTS system can operate in various environments;• Server : typical footprint of 500-1000MB• Desktop : typical footprint of 50-100MB• Embedded : typical footprint of 5-10MB

� The Embedded concatenative text-to-speech (eCTTS) challenge:Can we reduce the size of the voice by two orders of magnitude without severely degrading the quality ?

IBM Labs in Haifa


Why eCTTS ?

� Server based CTTS requires a connection (wired or wireless) to the server, which is not always available

� Device manufacturers and car manufacturers usually prefer embedded applications running locally

� Even with growing amount of resources available on embedded devices and in-car systems, small footprint eCTTS is required:� Memory and processing power are important factors for the price of

embedded devices� Typically, the system includes many other components� Sometimes several languages should be supported

IBM Labs in Haifa


eCTTS in the automotive market

� IBMS’s eCTTS part of the Honda speech interface in 2005 high-end cars� Embedded Viavoice includes embedded TTS and speech recognition,

providing a full conversational system� Main usage is for navigation applications

IBM Labs in Haifa


How does eCTTS work ?

TTS

Front

Endleaves Segment

Selection

Segmentadjustment &

Concatenation

Feature

Reconstruction

Speech Dataset

Feature vectors

Features

Pitch

Energy

Durationspeech

IBM Labs in Haifa


How is the x100 size reduction achieved ?

� Reducing the number of segments by segment preselection� Reusing of the same data for several purposes� Using a more efficient speech model� Data compression

Voice dataset

��MB�MB

IBM Labs in Haifa


Segment Preselection

� The process:1. A voice dataset is built with all the segments in place (typically 1M

segments)2. A large number (100K) of sentences are synthesized and the

selected segments statistics is collected.3. A fraction of the segments that were the most frequently selected is

chosen.

� Typically 7-10% are chosen, resulting in ~100K segments and a coverage of 70-80% of the selections made during the statistics collection.

IBM Labs in Haifa


Speech Model

� Frequency domain sinusoidal model, with amplitude and phase computed for every pitch harmonic (voiced frames)

� Accurate representation of the spectral envelope that is used in segment selection, pitch modification and reconstruction

Spectral envelope

frequencypitch

ijieA �

IBM Labs in Haifa


Demonstration

German

Italian

UK English

US English

Example *Language

* All voices are 22KHz/10MB except the German male which is 11KHz/8MB

embedded concatenative text-to-speech - ibm...

Documents