speech technology and its applications: a technical overview

communications

Speech technology and its applications A technical overview

bY DENIS JOHNSTON S peech technology is about getting speech into and out of com- puters. Speech, however, is an

acoustic signal and even when turned into an electrical signal cannot be directly handled by a digital computer. The ‘digitizing’ of speech and the techniques of digital signal processing therefore underpin all modern speech technology applications. Once speech is in a digitized form, it can be

stored, transformed and operated on just like any other data files.

This paper begins by discussing the fundamental processes involved in

Abstract: Thejrst sage of capturing a speech getting speech in and out of a com-

signat is tn change ~t~r~rn an ac~~t~~ signal puter and then describes briefly some

of the ideas behind speech data rate

reduction, speech synthesis and

speech recognition. Finally an appli- cation in which some of these techniques are combined is described.

Representing speech in a computer

The first stage of capturing a speech signal is to change it from an acoustic signal into an electrical signal. This is done with a microphone and ampli-

fier. This signal is a continously varying electrical signal. Figure 1 shows what a word looks like displayed this way.

into an electrical signal, using rn~cro~h~~ and amplifier. This signal must be digitized before the computer can handle it, using sampling techniques.

Speech is produced by capturing human speech patterns and analysing them for later reproduction by mechanical means. Texr-to- speech synthesis is used in driving voice synthesizers.

Speech recognition is based on the matching offrames of speech. Dynamic programming is a si~i~~ant step forward in speech relocation .

An app~i~atlon is cu~ent~~ being developed using speech recognition on a

PABX.

Keywords: data processing, computer communications, voice processing.

_1

0 500 I @O@ I 500 2 000 2 500 3 000 3500 4 000 Denis Johnston is head of Speech Recognizer Applications, Assessment and Echo Control 1o-4 s Group, British Telecom Research Labora- tones.

Figure 1. A word displayed as a c~nti~~~s~ va vying electrical signal

~0128 no 9 november 1986 001~-~84X~86~0904~~08$03.00 @) 1986 Burterworth & Co (Publishers) Ltd 453

Before a computer can handle this signal it must be digitized. One way of doing this is to sample the analogue electrical signal and measure its in- staneous value. A fragment of an analogue signal and its sampled form are shown in Figures 2a and 2b.

The continuous signal has now been represented by a sequence of numbers and this is a step closer to getting the signal into a form which can be handled by the computer.

Sampling takes place at twice the rate of the highest frequency in the

J 2.0

I .o

N” 0.0 ‘0

--IO- y/J yv V’

-2.o

-.. 3 0 _

-4.o-

‘5.0-I 0 500 I oco I sot 2C’W 25cc 3 on? 3 .yr 4 iQ3

Id” 5

Figure 2a. An original fragment of an analogue signal

0 500 I @Cc\ I 5co 2 cm? 2 5oc 3 ooc 35oc! 4 000

Id5 S

Figure 2b. Sampled fm of the analogue signal

signal. This rate (known as the Nyquist sampling rate) guarantees that one will always be able to recon- stitute the original signal from the samples. This also implies that all the essential information in that signal has been captured.

Although speech frequencies can go higher, most information in a speech signal is contained at frequencies below 7 kHz. Even if it is low-pass filtered at 4 kHz, the speech remains highly intelligible and easy to under- stand.

For many speech applications, (including telephone systems), this is adequate. So, after filtering to exclude all frequencies above 4 kHz, the speech is sampled 8000 times a second, using an analogue-to-digital (A/D) converter.

There are many types of AID converter and Figure 3 shows a few examples of commercially available devices.

Sampling at the Nyquist rate en- sures that all the information is captured in the signal, provided that the level of the signal at each sampling instant is measured exactly. This is not practically possible and the signal level can only be estimated to within one of perhaps 256 ‘quantized’ levels. As 256 levels can be represented using 8 bits, a device with such resolution is known as an 8 bit AID converter.

Although speech coded with this resolution would be intelligible, the quantizing would make it sound very noisy. A resolution of one part in 4096 (12 bits) is more typically used for speech.

So, in summary, to get good quality speech, analogue waveform is sampled at least 8 000 times a second and each sample is represented by 12 bits. This is equivalent to capturing data at 12 kbyteis, 720 kbyte/min or 55.2 Mbyteihr. Clearly, digitally-stored speech uses up a great deal of memory and it is not surprising that many techniques for coding the speech more efficiently have been developed.

454 data processing

Figure 3. Samples of A/D convertors

Reducing the bit rate

One technique which gives 12 bit quality for only 8 bits is called c,ompanding. It is widely used in tele- phony transmission where the capa-

city of a channel is limited by the number of bits which can be transmitted per second. This technique relies on the fact that the human ear is much less sensitive to the quantizing noise when the signal is high than when it is low. The signal is therefore quantized non-linearly. At low signal levels the step size of the AID converter is

small. At high levels it is much greater. With such techniques, 8 bits

per sample (companded) can give the same subjective quality as 12 bits

linearly coded).

Another approach which reduces the number of bits per sample to four while maintaining almost the same level of quality is adaptive differential pulse code modulation (ADPCM). In this technique only the difference between adjacent speech samples is transmitted. Also, the quantizing step size is adjusted according to the current nature of the signal. For example, if the signal is high level or changing rapidly, then the quantizing

step interval is large. If changing slowly, or of low level, then a small quantizing step is used.

If such a system is combined with a speech detector, which tells the com-

puter only to store data when speech is present, no storage space is wasted in storing silence (which is treated separately). Using the above techniques it is possible to code good quality speech at as low a rate as 2 kbyte/s.

How is speech produced?

Speech is a highly redundant signal. To explain why this is so, and to introduce some of the ideas in speech synthesis, it is useful to examine how the human voice works. Figure 4a shows the physiology of the human vocal tract. Figure 4b shows schema- tically a mechanical analogue of this.

The human vocal tract can be

considered as a tube which can be varied in effective length and cross-

section by moving the tongue, jaw and lips. The source of the sound is located in the larynx where air from the lungs is forced through the vocal chords. These may vibrate in a man-

ner similar to the reed on a wind instrument or may be prevented from vibrating but used to generate noise (such occurs during whispering for example). Very early attempts at

mechanical speech synthesis used these techniques. Figure 5 shows a replica of Von Kemplen’s speaking machine (1791) where air from a bellows (the lung) was forced through a bagpipe reed (the larynx) into a

leather tube (the vocal tract) which could be squeezed into shape by the operator. Additional knobs and whistles (literally) were added to pro- vide other sounds.

The important thing to note about

Figure 4a. Physiology of the human vocal tract

~0128 no 9 november 1986 455

capturing these repeated patterns,

describing them in a simple form for storage and then reconstructing the original signal from them.

Figure 4b. Mechanical analogue of tht

One of the first devices to attempt this was the vocoder. Although the

early vocoders were analogue devices, Figure 7 shows a version in which the data is digitized using a number of A/D converters.

The vocoder detected if the speech

signal was voiced or unvoiced. If it was voiced, the pitch frequency was measured. If unvoiced, the energy of the noise was estimated. The speech was then filtered by a bank of band-

pass filters and the energy in each band measured. Because the overall

! human vocal tract shape of the vocal tract could not change very quickly, the output from

Figure 5. Replica of von Kemplen’s speaking machine (I 791)

all of these speech-generating devices, including the human vocal tract, is that they are mechanical in nature. An immediate consequence of this is that they cannot change position very quickly. This gives some important clues as to how to reduce the number of bits required to code the speech.

For example, the speech waveform tends to repeat itself over a short

period of time because of the inertia of the vocal tract and because the vocal chords are constrained to output a periodic pattern. Figure 6 show? a close-up of a vowel waveform. The repetitive nature of the signal: illus- trated by the regular spacing of both the pitch and harmonic structure can be clearly seen.

What is needed is some way of

these filters could not change very quickly either. It was, therefore, valid to measure the energy in each frequency band at a much lower rate than required to sample the original speech waveform. Typically, a rate of 100 times/s was used.

To reproduce the speech, the re- verse process was used with voicing being provided by a pulse generator, noise by a noise generator and the overall spectral shape controlled by a series of variable gain amplifiers. A rough calculation for this system shows that if there were 20 filters (channels) and each was sampled 100

times/s, then about 2 000 samples/s were sufficient to code the speech. With each sample coded at 8 bits then the overall bit rate was about 16 kbits/s. A few more bit/s were required to code the pitch information, but some channels did not need to be coded to the same precision as others. Additional bit rate reduction methods (such as differential coding described above) could reduce the bit rate further.

Speech synthesis

When researchers looked closely at the output of the filter banks in the vocoder and started to build up a

456 data processing

0 2OC 400 6OC’ 800 I0f.Y I2OC 1400 16CO I800 ZOCC

Figure 6. Close-up of a vowel waveform

Detect Analogue Dlgltol BP energy to to BP

f llters in bond d!gltol analogue MultIply filter

Figure 7. Vocoder which digitizes data using a number of AID converters

picture of the overall spectrum of the signal, they noticed that ‘snapshots’ of the filter-bank outputs (the ‘spectral frames’) tended to have shapes which reflected peaks (formants) in the vocal tract (see Figure 8).

If these peaks could be detected separately then it might be possible to reduce the bit rate even further by

using variable filters which generated

the particular spectral shape required. What is more, it would be possible

to manipulate the pitch frequency, noise levels and filter positions artificially. If that could be done, then genuinely synthetic speech could be produced. Figure 9 shows the layout of a simple formant synthesizer which does just that.

Figure 9 is oversimplified and such

a device would be primarily suited only to producing the simpler vowel- like sounds. In practical implementa- tions, extra filters and branches must

be added to include nasal sounds and those sounds (such as s, t, p etc.) which depend upon the air turbulence set up at the teeth and lips. To drive the synthesizer a number of control parameters can be varied, e.g. the pitch frequency, the pitch amplitude,

the formant frequencies and the formant bandwidths. This is where text- to-speech synthesis comes in.

The ideal would be to drive the synthesizer using normal written text. Many programs have been written to do this. The text is first converted into a phonetic form using a series of rules, in combination with look-up tables which handle those words which do

not conform to universal rules, e.g. through, trough, hiccough etc. The phonetic components are then trans- ferred into a set of parameters which are fed into the synthesizer proper. Additional rules are also applied to modify the pitch frequency and overall stress on each utterance. A single board speech text-to-speech synthesizer developed at British Telecom Research Laboratories (BTRL) in which all these are combined is shown in Figure 10.

Speech recognition

At first sight, speech recognition seems to be fairly straightforward. All that seems necessary is to store the digitized version of every word in the language in a ‘library’. An unknown word would be similarly digitized and then compared sample by sample against each of the stored words. The one that ‘matched’ would be chosen.

A little thought, however, should indicate that it is not as easy as that. No two words are ever spoken identi- cally - even by the same speaker. The level of speaking, the length of each word, the stress on each word and many other factors mean that comparing waveforms on a sample-

~0128 no 9 november 1986 457

-40c -50’ 1 -6O-

--_---- Formant Dositions I

560 r&c I500 2 do0 3 500 4000 4 5OC SCCC

IO0 Hz LIN

Figure 8. Founds in the vocal tract

Noise source

01

bk

I ‘:

Y Speech out

I

Pitch source I I Frequency gun

formant controls

Figure 9. Layout of simple formant synthesizer which artificially manipulates pitch frequency, noise levels and filter positions

by-sample basis does not work. In- via a sequence of ‘spectral frames’ has stead, the waveform of speech must already been discussed. Although first be decomposed into features, there are exceptions, most speech typically spectrally based frames, recognizers use spectrally-based fea- before the matching can take place. tures similar to these. Figure I1

shows a typical speech recognizer

How a speech recognizer works architecture. The ‘front end’ here is a filter bank very similar to that used in

The idea of speech being synthesized the vocoder.

Instead of storing the original waveform of each word in the library, the word is stored as a sequence of spectral frames. These are commonly known as word templates.

However, there is still the problem of matching an unknown word against all the templates in the library. This is really the problem of pattern matching. It is the overall shape of the template and the rela~onship between different parts of it which are important, rather than any absolute values.

One of the most significant steps forward in the science of speech recognition was the development of dynamic programming methods; often known as dynamic time warping (DTW) when applied to speech template matching.

Dynamic pro~amming for speech recognition

One of the most difficult problems in speech recognition is that the rate of speaking is highly variable. Some- times people speak a word quickly and at other times slowly. To explain how dynamic programming helps to solve this problem, consider the three ‘patterns’ shown in Figure 12.

These three trajectories are to be compared. The three patterns have the same average value but examining them by eye it seems that in terms of ‘shape’ A and B are similar whereas C is different. On the other hand A and C are the same length - but B is shorter.

Dynamic programming allows ex- pansion and compression in one dimension so that one pattern is optimally aligned with another. The first stage is to calculate all the ‘distances’ . Figure 13 shows how the distances between samples are tabu- lated for patterns A and B and also for A and C.

As any alignment requires that both patterns start and finish at their res- pective endpoints it is necessary to find that alignment which best fits in between.

458 data processing

communications

Figure 10. BTRL text-to-speech converter

By eye it is possible to trace that

path which seems to represent the minimum distance through the matrix - shown circled for each case.

How can this be performed algo-

rithmically? The solution is shown below where part of the matrix is shown with letters rather than numbers.

.

. . . .

. . . . ab..

cd.. . . . . .

A new matrix is formed in which

each element (starting from the top left) is replaced such that:

I new value of d = old value of d +

the minimum of a, b or c

Applying this rule to the two pre- vious matrices, new matrices are obtained, as in Figure 14.

The bottom right hand element

now contains a value which indicates the minimum accumulated distances between the two trajectories, given that they have been allowed to expand and extract in one dimension. Look- ing at the last entry in each matrix it can be seen that the left hand matrix has a lower accumulated distance (8) than the right (11) and is hence the

Figure Il. Typical speech recognizer architecture

7 5 3 5 7 9118 8 4 6 10 12 9 0

0

0

0 0

0 0

0 0

0

0 0

0

A B

Figure 12. Patterns in dynamic programming

4 7 9 10 12 11 9 6 0

0

0

0 0

0

0

0

C

As yet, there are very few commercial applications where speech recognition, synthesis and coding are combined. Many experimental systems, however, have been developed. There is, for example, an experimental pri- vate automatic branch exchange (PABX) in which speech coding and speech recognition were used. This equipment was taken to the stage where a field trial was performed and the results of that are now being used for further development of the system.

The overall system consisted of a PABX, a minicomputer, a speech recognizer and a low bit rate coder

~0128 no 9 november 1986 459

B C 8 4 6 10 12 9 4 7 9 10 12 11 9 6

7031352 700235421 5381574 502457641

A3503796 A304679863 5301574 502457641 7130352 73@235421 9153030 952@13203

11375 102 117 4 20Q@2 5 8042240 841124308

A/B differences A/C differences

Figure 13. Distances between samples for patterns A and B, and patterns A and C

1 4 5 8 13 15 4 2 3 8 15 17 9 3 5 10 17 21

12 4 4 9 16 20 13 7 5 7 12 14 14 12 8 6 9 9 17 19 13 7 7 9 17 21 15 9 11 8

Cumulative A/B

3 3 4 5 5 8 6 7 9 6

14 8 21 12 25 13

5 8 13 17 19 20 7 10 15 19 21 20

11 14 19 23 25 23 11 16 21 25 27 24 8 11 16 20 22 23 6 7 10 12 12 15 8 7 8 8 10 15 9 10 11 11 9 11 Cumulative A/C

Figure 14. New samples obtained by applying the matrix to the original samples

Exchange lines

I

Experimental

PABX

Speech - + -- - feedback

Figure 15. System configuration for experiment PABXlspeech coder and recognizer

working at 16 kbit/s. Figure 15 shows were shown the facilities of the sys- the system arrangement. tern. At the same time, the opportu-

To use the system, users took part nity was taken to get each user to in enrolment sessions in which they ‘train’ the speech recognizer with the

required vocabulary. This vocabulary

was quite short and contained words such as ‘Shortcode’, ‘Enter’ and the digits, as well as phrases such as ‘Divert all calls’ or ‘Divert on busy’. It was also trained with each user saying his or her own name.

In use, the system was operated as

follows. On picking up the telephone, each user received a spoken message: ‘Please speak your name’. On hearing the name, the recognizer compared it to all the names stored in its ‘name’ library and identified the current user. At this point, the recognizer was

reloaded with the templates of that particular user. The user could now say things like ‘Dial’, in which case the system would prompt for digits or, more usefully, could request facilities by saying ‘Facilities’. If this was followed by the phrase ‘Divert all calls’ the system would respond with ‘Divert to which number?’ and so on. At each stage the user could either

speak or key in a number and so carry on an - albeit limited - conversa- tion with the machine.

In addition to this each user was able to store up to 100 different shortcode entries. To make a short-

code it was only necessary to say ‘Shortcode’ followed by ‘Make’ and the system would request the name of the shortcode followed by the number. Subsequently, to dial that number, the user only need say the name and the system would look up the person’s stored number and dial it.

Of course, this is exactly how the telephone system operated 50 years ago, where the operator was the speech recognizer, the speech synthesizer and the storage unit.

Acknowledgement

Acknowledgement is made to the Director of Research, British Tele- communications, for permission to publish this paper. 0

British Telecom Research Laboratories, R18.3.2, Martlesham Heath, Ipswich IP5 7RE, UK.

460 data processing

speech technology and its applications: a technical overview

Documents