embedded concatenative text-to-speech - ibm...

20
IBM Labs in Haifa © 2004 IBM Corporation Embedded Concatenative Text-to-Speech Ron Hoory, Zvi Kons, Dan Chazan, Slava Shechtman Media Services and Technologies Group October 14, 2004

Upload: others

Post on 04-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Embedded Concatenative Text-to-Speech - IBM …research.ibm.com/haifa/Workshops/summerseminar2004/...Text-to-speech eliminates the need to prerecord all possible messages The alternative

IBM Labs in Haifa © 2004 IBM Corporation

Embedded Concatenative Text-to-Speech

Ron Hoory, Zvi Kons, Dan Chazan, Slava ShechtmanMedia Services and Technologies GroupOctober 14, 2004

Page 2: Embedded Concatenative Text-to-Speech - IBM …research.ibm.com/haifa/Workshops/summerseminar2004/...Text-to-speech eliminates the need to prerecord all possible messages The alternative

IBM Labs in Haifa

© 2004 IBM Corporation2

Why Text-to-Speech ? Why Concatenative ?

� Text-to-speech eliminates the need to prerecord all possible messages� The alternative - recorded prompts - has much less flexibility:

� Cannot synthesize words/phrase outside inventory� Adding new prompts is expensive� No expression: prosody cannot be controlled –

especially when combining prompts:

� Text-to-speech (TTS) can synthesize arbitrary text

� In Concatenative text-to-speech (CTTS), small segments of speech are selected from a large speech database and concatenated together

Page 3: Embedded Concatenative Text-to-Speech - IBM …research.ibm.com/haifa/Workshops/summerseminar2004/...Text-to-speech eliminates the need to prerecord all possible messages The alternative

IBM Labs in Haifa

© 2004 IBM Corporation3

The role of TTS in a conversational system

� Critical component of the conversational interface� Only way to present information in eyes-busy/unavailable situations

(car, phone)� Quality of conversation system often equated to TTS quality

Dialog Manager

Natural Language Understanding

Speech Recognition

Natural Language Generation

Speech Synthesis

voice text ‘meaning’

‘meaning’textvoice

Page 4: Embedded Concatenative Text-to-Speech - IBM …research.ibm.com/haifa/Workshops/summerseminar2004/...Text-to-speech eliminates the need to prerecord all possible messages The alternative

IBM Labs in Haifa

© 2004 IBM Corporation4

How does a CTTS system work ?

Normalization

Text to Unit Conversion

Text to ProsodyTargets

Segment Selection

Post-SearchModification

text Prosodymodels

Database

We visited Rodeo Dr. We visited Rodayo Drive.

speech

Front-End

Back-End

Page 5: Embedded Concatenative Text-to-Speech - IBM …research.ibm.com/haifa/Workshops/summerseminar2004/...Text-to-speech eliminates the need to prerecord all possible messages The alternative

IBM Labs in Haifa

© 2004 IBM Corporation5

Language dependency and other considerations

� The front-end is mostly language dependent, relying on languages rules, pronunciation dictionaries etc.

� The back-end is mostly language independent, except for the speech database (a.k.a., “voice”)

� The voice needs to be recorded:� In the desired language and accent, e.g., “Canadian French”� With the desired speaker (“voice talent”) :

� male/female� low/high pitch� slow/fast speaking rate

� In a professional recording studio and equipment� With sampling rate above the target sampling rate (usually 22KHz)

Page 6: Embedded Concatenative Text-to-Speech - IBM …research.ibm.com/haifa/Workshops/summerseminar2004/...Text-to-speech eliminates the need to prerecord all possible messages The alternative

IBM Labs in Haifa

© 2004 IBM Corporation6

Text Normalization

� Language-independent text cleaning (html tags, etc.)� Language-dependent normalization for dates, time, numbers, currency,

phone numbers, addresses, abbreviations

� Examples:� St. Martin St. becomes Saint Martin Street� Dr. King Dr. becomes Doctor King Drive� 1 oz. becomes one ounce� 2 oz. becomes two ounces� $5 million becomes five million dollars

Page 7: Embedded Concatenative Text-to-Speech - IBM …research.ibm.com/haifa/Workshops/summerseminar2004/...Text-to-speech eliminates the need to prerecord all possible messages The alternative

IBM Labs in Haifa

© 2004 IBM Corporation7

Possible Concatenation Units

� Words� Syllables� Demi-syllables� Diphones� Augmented diphones� Phone Units� Subphone Units

Page 8: Embedded Concatenative Text-to-Speech - IBM …research.ibm.com/haifa/Workshops/summerseminar2004/...Text-to-speech eliminates the need to prerecord all possible messages The alternative

IBM Labs in Haifa

© 2004 IBM Corporation8

Concatenation Units in the IBM CTTS system

� HMM state-sized segments (3 states per phone)

� Segments are classified according to their phonetic context:� Phonetic context determined by a binary decision

tree with questions on neighboring phones� Segments are labeled according to the

leaves of the context dependent decision tree.� Typically 10-20 database occurrences per leaf label

� Text is first converted to phones using a pronunciation dictionary and then to leaves

S1 S2 S3

L3L2

L1

Page 9: Embedded Concatenative Text-to-Speech - IBM …research.ibm.com/haifa/Workshops/summerseminar2004/...Text-to-speech eliminates the need to prerecord all possible messages The alternative

IBM Labs in Haifa

© 2004 IBM Corporation9

Prosody modeling

� Prosody is critical for obtaining the right pronunciation and intonation� Wrong prosody can cause speech to sound unnatural or even

unintelligible

� Prosody targets typically include:� Pitch� Phone durations� Energy

� Prosody parameters can be trained to match the target speaker prosody

Page 10: Embedded Concatenative Text-to-Speech - IBM …research.ibm.com/haifa/Workshops/summerseminar2004/...Text-to-speech eliminates the need to prerecord all possible messages The alternative

IBM Labs in Haifa

© 2004 IBM Corporation10

How can prosody effect naturalness

� Expressiveness is a very important factor in speech naturalness.Controlling prosody can generate expression

� Neutral prosody:

� Expressive prosody:

������������������������������

���������� ��������������������

Page 11: Embedded Concatenative Text-to-Speech - IBM …research.ibm.com/haifa/Workshops/summerseminar2004/...Text-to-speech eliminates the need to prerecord all possible messages The alternative

IBM Labs in Haifa

© 2004 IBM Corporation11

Segment Selection and Post-Search Modification

� Each segment is selected from the all database candidates labeled with the target leaf label

� Dynamic programming used to optimize the series of segments selected by minimizing a cost function

� Cost function weights:� Proximity to prosody targets (pitch, duration)� Continuity between consecutive segments chosen

� Spectrum continuity � Pitch continuity

� Post-search modifications carried out to modify the pitch, duration and energy to match the target prosody

Page 12: Embedded Concatenative Text-to-Speech - IBM …research.ibm.com/haifa/Workshops/summerseminar2004/...Text-to-speech eliminates the need to prerecord all possible messages The alternative

IBM Labs in Haifa

© 2004 IBM Corporation12

IBM high quality CTTS with super-voices

� Building of CTTS voices includes:

� Voice recording using a predefined script

� Limited manual work for “cleaning”

� Intensive automatic processing

� IBM Super-voices

� Very large recording script:

� Usually 10000 sentences are read by the speaker� 15 hours of audio, 11 hours of speech excluding silence

� Script reflects typical scenarios

� Professional recording studio and professional speakers

� Three stage audition process of final voices

Page 13: Embedded Concatenative Text-to-Speech - IBM …research.ibm.com/haifa/Workshops/summerseminar2004/...Text-to-speech eliminates the need to prerecord all possible messages The alternative

IBM Labs in Haifa

© 2004 IBM Corporation13

Footprint and environments

� Size of the voice dataset is a crucial parameter for quality andnaturalness

� The CTTS system can operate in various environments;• Server : typical footprint of 500-1000MB• Desktop : typical footprint of 50-100MB• Embedded : typical footprint of 5-10MB

� The Embedded concatenative text-to-speech (eCTTS) challenge:Can we reduce the size of the voice by two orders of magnitude without severely degrading the quality ?

Page 14: Embedded Concatenative Text-to-Speech - IBM …research.ibm.com/haifa/Workshops/summerseminar2004/...Text-to-speech eliminates the need to prerecord all possible messages The alternative

IBM Labs in Haifa

© 2004 IBM Corporation14

Why eCTTS ?

� Server based CTTS requires a connection (wired or wireless) to the server, which is not always available

� Device manufacturers and car manufacturers usually prefer embedded applications running locally

� Even with growing amount of resources available on embedded devices and in-car systems, small footprint eCTTS is required:� Memory and processing power are important factors for the price of

embedded devices� Typically, the system includes many other components� Sometimes several languages should be supported

Page 15: Embedded Concatenative Text-to-Speech - IBM …research.ibm.com/haifa/Workshops/summerseminar2004/...Text-to-speech eliminates the need to prerecord all possible messages The alternative

IBM Labs in Haifa

© 2004 IBM Corporation15

eCTTS in the automotive market

� IBMS’s eCTTS part of the Honda speech interface in 2005 high-end cars� Embedded Viavoice includes embedded TTS and speech recognition,

providing a full conversational system� Main usage is for navigation applications

Page 16: Embedded Concatenative Text-to-Speech - IBM …research.ibm.com/haifa/Workshops/summerseminar2004/...Text-to-speech eliminates the need to prerecord all possible messages The alternative

IBM Labs in Haifa

© 2004 IBM Corporation16

How does eCTTS work ?

TTS

Front

Endleaves Segment

Selection

Segmentadjustment &

Concatenation

Feature

Reconstruction

Speech Dataset

Feature vectors

Features

Pitch

Energy

Durationspeech

Page 17: Embedded Concatenative Text-to-Speech - IBM …research.ibm.com/haifa/Workshops/summerseminar2004/...Text-to-speech eliminates the need to prerecord all possible messages The alternative

IBM Labs in Haifa

© 2004 IBM Corporation17

How is the x100 size reduction achieved ?

� Reducing the number of segments by segment preselection� Reusing of the same data for several purposes� Using a more efficient speech model� Data compression

Voice dataset

���MB�MB

Page 18: Embedded Concatenative Text-to-Speech - IBM …research.ibm.com/haifa/Workshops/summerseminar2004/...Text-to-speech eliminates the need to prerecord all possible messages The alternative

IBM Labs in Haifa

© 2004 IBM Corporation18

Segment Preselection

� The process:1. A voice dataset is built with all the segments in place (typically 1M

segments)2. A large number (100K) of sentences are synthesized and the

selected segments statistics is collected.3. A fraction of the segments that were the most frequently selected is

chosen.

� Typically 7-10% are chosen, resulting in ~100K segments and a coverage of 70-80% of the selections made during the statistics collection.

Page 19: Embedded Concatenative Text-to-Speech - IBM …research.ibm.com/haifa/Workshops/summerseminar2004/...Text-to-speech eliminates the need to prerecord all possible messages The alternative

IBM Labs in Haifa

© 2004 IBM Corporation19

Speech Model

� Frequency domain sinusoidal model, with amplitude and phase computed for every pitch harmonic (voiced frames)

� Accurate representation of the spectral envelope that is used in segment selection, pitch modification and reconstruction

Spectral envelope

frequencypitch

ijieA �

Page 20: Embedded Concatenative Text-to-Speech - IBM …research.ibm.com/haifa/Workshops/summerseminar2004/...Text-to-speech eliminates the need to prerecord all possible messages The alternative

IBM Labs in Haifa

© 2004 IBM Corporation20

Demonstration

German

Italian

UK English

US English

Example *Language

* All voices are 22KHz/10MB except the German male which is 11KHz/8MB