dynamic calls with text to speech

Fundamentals of Text To Speech in UC

Patrick Dexter

Thank you - my name is Patrick Dexter with a company Cepstral and today I’ll be talking about Text To Speech voices. We’ll discuss how a TTS voice is made, the component parts of text to speech software, and how that fits into Unified Communications software like Elastix

Cepstral

Text To Speech innovator

Founded in 2001

Focus on North and South America

Elastix Partner since 2011

To give you some background information Cepstral is a commercial company spun out of Carnegie Mellon University in Pittsburgh Pennsylvania. We have customers all around the world from doing announcements at train stations in New Zealand and Australia to delivering 1000s of concurrent ports in large call centers in Canada. Our main customer base is in North and South America. And we’ve been a proud partner of Elastix since 2011.

!

!

@Cepstral_LLC

Our marketing department wouldn’t let me do this presentation without giving you our twitter address. But this is also useful if you have any questions about this presentation or TTS in general tweet them to me and I’ll respond.

What is Text To Speech?

So what is Text To Speech? Text to Speech is the ability to create audio that was never recorded before. There’s far too many words to record them all and new ones are being created every day. We see this all the time in Telephone systems. You need to tell a caller the amount of money they have in an account. Or that their package will be delivered to a specific address. Information is constantly changing so you need a way to get it to your callers.

Fun History of TTS

Before we dive into more details about Text To Speech I want to show you one of the earliest Speech Synthesis devices

The machine pictured on this slide is a replica of the first speech synthesizer originally developed by Wolfgang Von Kempelen in the late 1700s. Interestingly this machine from the 1840s was viewed and studied by Alexander Graham Bell who created his own version and used many of the ideas when he invented the telephone! So Speech Synthesis and the Telephone have been used together since the very beginning.

Text To Speech Technologies

So there are several different competing technologies that are used to create Text To Speech voices.

• Formant • Diphone • Statistical Parametric

If you’re familiar with Text To Speech you’ve heard of some of these. !Formant synthesis creates mathematical models of the tissue in the mouth and lungs. It has a very small footprint but requires a great deal of computing power to operate. To me it sounds like an opera singer. doing scales of aaaaahhhhhhs Formant synthesizers are good at doing vowel sounds in a range of pitches !Diphone voices are quite robotic - this is the Stephen Hawking voice it’s easily understood but doesn’t sound like a person. !Statistical parametric voices are sometimes called HMM for their use of Hidden Markov Models to create a model of speech based on a corpus and then use that model to generate the new audio. !

Unit Selection Synthesis

But the primary technology being used in commercial Text To Speech systems today is Unit Selection synthesis. You’re already familiar with this if your mobile phone talks back to you. It’s what SIRI and Google Now uses. !Unit Selection voices provide the most human like experience today And that’s because they are made with the recordings of actual people. these recordings are identified and labeled to create a database of sounds. !In this sense Unit Selection is similar to a ransom note

We’ve all seen these in movies and TV shows. You cut up letters and rearrange them to form new words. At the most base level this is what Unit Selection Synthesis is all about. In English there are 26 letters in Spanish 29 so all we have to do is record about 30 things and we have a Unit Selection Text to Speech voice, right? Well no - unfortunately it’s not this easy.

Unit Selection Synthesis

You can’t just record the alphabet because human speech is not made up of letters. We use letters to write down speech. but when spoken, speech is made up of sounds which we call phonemes. And how these phonemes are pronounced can vary quite a lot depending on what you’re saying and importantly where in the sentence that sound occurs. So much so that there are specialized alphabets specifically for phonemes. !!

agua

So let’s take a look at phonemes. I tried to modify this presentation with Spanish examples when possible. !Agua is a word that even with my limited Spanish skills I can pronounce.

a1 g xu0 a0

and here is the phonetic spelling of that word. If you’re curious this is based on the Carnegie Mellon University phonetic set. There are several other phoneme dictionaries IPA, SAMPA are popular but we use a variation of the CMU alphabet for our voices.

a1 g xu0 a0

OK so looking at this first phoneme we are depicting the ah sound with a 1 which denotes that the vowel is stressed.

a1 g xu0 a0

The guh sound is represented by the g

a1 g xu0 a0

and here’s one that should be new - xu with a 0. this is where phonetic alphabets start to make differ from a regular alphabet. If we just used the u to describe this sound it wouldn’t work.

aquí

Here’s a very similar word - has a u in it but it’s pronounced completely different aqui versus agua do you hear that whuu sound?

a1 g xu0 a0

That whuuu sound is identified by this X U phone. and the 0 marks it as being unstressed.

a1 g xu0 a0

the a 0 now completes the word to give us the whhhuuaaa sound.

Create a new unit selection TTS

voice

I think the really cool thing about Unit selection voices is that they we need to select a specific person to record a new voice and that their voice will live on theoretically forever.

So how do we grab all of those phonemes? We lock that lucky voice actor in a sound booth and force them to record hours and hours worth of carefully worded scripts. These scripts are designed to capture as many of the phoneme interactions as possible.

Nowadays I wish I used cheese to coax them out, because bacon can be awkward.

Here’s an actual sentence from our English script. This sentence is grammatically correct which helps the voice actor to read it in a natural tone. Once this has been recorded the audio file is segmented into individual phonemes, and the location of the phonemes in the syllables, words, common phrases, and then sentences is noted. This allows us to better match sounds when the software runs. We’ll see an example of this in a minute

Labeling

PitchDuration Position Diphones

The phonemes are then labelled with acoustic parameters or context factors like the fundamental frequency or pitch, time duration, position in the syllable, and neighboring phonemes. Because of all of these different possible interactions. When we’re done we’ll have hundreds of thousands of examples or units of each phoneme in our database. !

agua salida sola jamón hora

Now that we have a database filled with units we can select them to create new audio that was never said by the voice talent. Here’s a group of words - most likely recorded at different times of the day or even days or years apart. We have to use the same voice talent for a single voice. So based on our own testing and customer feedback we’ll record new material and build that into the voice in order to provide more natural synthesis. !Let’s go through this and create a new word by selecting units from these recordings.

a gxua salida sola jamón hora

g xu a

Starting off with our old friend agua, we’ll take the last bit of that. You can see that we’re grabbing several phonemes at the same time. The TTS engine is looking for units that will match up best. So if it can find phonemes that were near each other already the audio will probably sound more natural. What we want is for the phonemes from different recordings to join together smoothly. If they don’t there’s a jump that the two sounds will have to make and that’s when you hear the glitches in TTS audio that make it sound robotic. So getting smooth joins is of paramount importance. That’s one of the reasons why we need hundreds of thousands of these phonemes.

agua sali da sola jamón hora

g xu a d a

Continuing along with our example. Now we’ll add in a group of phonemes from the next word.

agua salida so la jamón hora

g xu a d a l a

We’ll continue to do this to build out the new word.

agua salida sola xa món hora

g xu a d a l a x a

In an actual TTS engine. This selection will only take milliseconds. This is how the software can be used in a telephone of unified communications system. It operates faster than realtime. The engine will also be performing a number of other calculations as well that we’ll look at in a minute. To me it’s still amazing that Text To Speech even works at all.

agua salida sola jamón ho rag xu a d a l a x a r a

And finishing up gives us

g xu a0 d a0 l a1 x a0 r a

Guadalajara

The Mexican city of Guadalajara !Now this is just a simple example of creating a new word. In real life a TTS engine is looking at features like phrase boundaries - does the phoneme occur in the beginning, middle, or end of a word. Going even further where in the original recording was the word? All of these attributes influence how that phoneme is said. !

Hay agua en Marte. !

!

Beber el agua.

The whuuuuuu phonemes from Agua in these two sentences are labeled differently. Typically at the end of a sentence the pitch descends. Beber el Agua. and I know my spanish is very very bad. But el Agua the l bleeds into the a. It would be difficult to take that A phoneme and use it in Marte for example. !The perfect unit required at synthesis time may not be available in the database, so a selection must be performed to choose, from amongst the many slightly mis-matched units, the best available sequence of units to concatenate. The more units we have the greater the chance that we’ll find that perfect unit.

User Lexicons !

!

We’ve been talking about the research end of speech synthesis but there are production applications that knowing all of this will help with. !Being familiar with phonemes and phonetic alphabets provides both you and the end users of Text To Speech software with the ability to customize the voice through a user lexicon.

!

User Lexicons word = phonemes word = phonemes word = phonemes word = phonemes word = phonemes

User lexicons are lookup tables replacing words in the text with user defined pronunciations. These can be specialized acronyms that are specific to a company or peoples names - often very useful when using an English voice to pronounce Spanish or other language names. Lexicons fine tune the audio to make sure that it’s as understandable as possible.

Text Normalization

I mentioned earlier that in milliseconds the engine performs other calculations as well. One of the more important calculations is called Text Normalization. This actually happens first in the process. So let’s take a look at what that means and why it is a challenge for all Text To Speech engines.

7/10

Let’s say you had a piece of text that said this. What could it mean?

7/107 de octubre

Here in Colombia it could be todays date October 7th.

7/10MM/DD/YYYY

In the United States the date format is different.

7/10July 7th

so this exact same text would be July 10th to me. It’s a bit absurd I think we’re one of the only countries to use this format it probably has something to do with our hatred of the metric system as well. I guess we just like to be difficult. Moving on

7/107 dividido por 10

Or we could look at it another way and it would be a math problem. These are the types of issues that Text Normalization has to solve. Unless the software knows what the sentence means it can’t properly pronounce the words to convey that information clearly. One of the keys to figuring these things out is identifying and analyzing them in the context of the sentence as a whole.

This is a courtesy call to remind Patrick Dexter of an appointment with Dr. Steel on 10/07/2015 at 10:30 am.

About a week before my dentist appointment I receive a phone call reminding me to floss so I don’t get yelled at by the guy with sharp stabby things in my mouth. !If you have a service that provides outbound phone calls this message is exactly the type of automation that Text To Speech is perfect for.


In a high call volume environment there’s no way you could record every name. And on the call you really do want to identify a specific person - in this case Patrick Dexter. You don’t want the person to show up to the appointment with their son or daughter when it’s actually their appointment.


But even with saying the person’s name there’s so much here that a computer can easily make a mistake on. The text to speech software needs to identify this text D R period. Not as durrrr and the end of the sentence. But as the abbreviation for Doctor. Doctor Steel is what a person reading this would think so that’s what the engine has to say.


So getting back to our first example. The text to speech engine should be able to correctly interpret this text as a date. It will look at the sentence and know that at least in English on is a preposition that typically comes before a date so that’s a very good clue and it’s English so the format is month date year. This gives you an idea of the type of rules that are built into TTS software. Now some of the fun research that’s being done in the field of computational linguistics is how to apply more artificial intelligence to this process rather than strict rule based decision trees.


And we’re not done yet. At the end of our sentence again there’s something ambiguous. we have the time of the appointment. if the TTS engine reads this as 10 colon 30 ammmm it will cause confusion. !!!

Heteronym

Another issue is this lovely thing. Does anyone know what a Heteronym is? It’s an evil part of the English language where a single word can mean two different things and is pronounced differently! I don’t believe Spanish has heteronyms but if it does I’d love to find out more.

!

!

@Cepstral_LLC

This is where that twitter handle becomes relevant.

Bass

The word bass can be a fish

Bass

Or it can be pronounced bass and in music mean the deep low end. As in the bass clef versus the treble clef.

Object

This is a fun word that can be either a verb or a noun

Me opongo a ese objeto.

In spanish this sentence may make sense. I’m really hoping. I used Google Translate. But you have two different words for the verb and noun

I object to that object.

But in English the sentence is I object to that object. Do you hear the two different pronunciations of the exact same word? object versus OB ject. If you say it I object to that object to an English speaker it doesn’t make sense.

I object verb to that object.

The engine can figure out the part of speech to help determine the pronunciation. Here object is a verb

I object to that object noun.

and a noun. This functionality is called a part of speech tagger and it’s very helpful in a text to speech engine. !

$10 per day

Currency interpretation is also important and is something we see all of the time in outbound phone call campaigns. Maybe this is a phone call to remind someone of a fine they will have to pay. Or a utility bill. We’d read this as 10 dollars or 10 pesos per day

$10.5 million per day

We’d read this as $10 point five million dollars per day. Not 10 period 5 dollars million per day. So the engine has to look at the entire sentence both backwards and forwards in advance in order to understand not only what the text is but how all of the words interoperate. !Text Normalization occurs first in order to determine what the text as a whole means. periods and commas are interpreted to add pauses, numbers and dates are converted into formats that are more friendly to the ear. Abbreviations and so much more all figured out so that the human computer interaction can occur.

1

To put it all those pieces together. we have our text to speech software here. !1. The text is sent to it

2

!2. The Text Normalization occurs trying to figure out all of those dates and currency, the parts of speech and heteronym information. And User lexicons are checked to see if there are custom pronunciations.

3

!3. The best possible units are selected from hundreds of thousands of examples based on all of those acoustic parameters like pitch, duration and position relative to other phonemes

4

!4. The units are all joined together to generate a wave form

5

!5. And the audio is outputted to the user. Magic! !In the world of Telephones and Unified Communication Text To Speech isn’t just magic though. it’s an incredibly powerful tool giving you the ability to deliver information to your callers. Let’s take a look at an example of this.

Since Elastix is built on top of Asterisk we can use the existing tools like ODBC database connections for grabbing variable information. !

And the open source module app_swift for linking Cepstral into Asterisk

app_swift

we'll be looking at app_swift which is specific to Cepstral text to speech.

MRCP

but there's also a protocol called MRCP if you want to use other TTS engines and MRCP is also used to add in speech recognition. It’s great for larger installs as well where you may have multiple Elastix servers or that all need to share TTS resources. Again twitter or see me after the talk with any questions on this.

exten=>n,Swift("Hello! Thank you for calling Cepstral.”|4000|3)

!

exten=>n,Set(CALL_TRANSFER=${FILTER(0-9,${SWIFT_DTMF})})

Here’s an app_swift example - swift is the name of Cepstral’s TTS engine - so that adds the swift command into your dialplan. It says a simple greeting and then uses Asterisk functionality to listen for a DTMF tone. !We’ve had customers with 100s of menu options in their IVR that have the entire thing read in realtime by TTS voices. Allowing them to make menu updates and changes on the fly. Want to add a new IVR option? Simply reload extensions.conf and the menu changes. No recording of prompts, no uploading wav files. Very easy to maintain.

exten => 123,8,Set(BALANCE=${ODBC_BALANCE(${ACCOUNT})}) !

exten => 123,9,Swift(${BALANCE})

Like I said before with ODBC you can also create very complex database driven systems right in the dialplan. Or you can use AGI to do this. !Here’s a very simple example of querying a database for an account balance.

exten => 123,8,Set(BALANCE=${ODBC_BALANCE(${ACCOUNT})}) !

exten => 123,9,Swift(${BALANCE})

and then using the swift command right in the dialplan to read that back to the caller. And because you have the $ symbol in there Cepstral will perform that text normalization we discussed to read this off as a person would say it. !You can really expand what’s possible to automate inbound or outbound calls. Do you have a technical support line? Have the caller identify themselves with a Ticket number and read back the last note that a customer service rep left for them. Or a status update on a known system outage. !

https://vimeo.com/84233208

!There's a fantastic video available from Elastix training that goes into detail on how to install and configure TTS and Elastix and how to set up an AGI script that makes use of the TTS software like this. The video shows you a demo app for employees to find out more information about outstanding loans. !I'll tweet a link to the video. And I do recommend that you watch it. It’s in Spanish and you can pause and rewatch it over and over. And now that you know how the text to speech software works it will make more sense if you’re adding TTS to your Elastix systems !

https://vimeo.com/84233208

• Text To Speech automates the delivery of information.

• Grow IVR usage without adding call center employees

To end - Text To Speech is a powerful tool that can read any information that you have in your systems. Not only is it useful for traditional IVR systems. But if you have a call center agent or customer service rep reading information to a caller then Text To Speech can automate that. Allowing you to grow call volumes without adding agents. It also frees agents up to handle more difficult tasks that can’t be automated.

¡Gracias!

Thank you very much for the opportunity to speak to all of you today.

dynamic calls with text to speech

Technology