eg-348_371_09 1 multimedia communications (371) speech and image communications (348) john mason...

80
EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University

Upload: kristen-burkman

Post on 01-Apr-2015

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University

EG-348_371_09

1

Multimedia Communications (371) Speech and Image Communications (348)

John Mason

Engineering

Swansea University

Page 2: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University

EG-348_371_09

2

Features in speech

X1

.

.

.

.Xi

.

.

.

.

.

Acquisition

(frame: 20/30 ms & sampling F: 8khz)

Feature extraction

time

Page 3: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University

EG-348_371_09

3

Features in speech

X1

.

.

.

.Xi

.

.

.

.

.

Acquisition

(frame: 20/30 ms & sampling F: 8khz)

Feature extraction

Page 4: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University

EG-348_371_09

4

Speech production

Air fromthe lungs

Vocal fold Vocal tract Speech

0

Page 5: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University

EG-348_371_09

5

LPC Short and Long

Spectral envelop reflects morphological characteristics of the vocal tract

H1(z) H2(z)noise synthesisedSpeech

Air fromthe lungs

Vocal fold Vocal tract Speech

Page 6: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University

EG-348_371_09

6

Features: building of statistical model

T1

T2

T1

T2

T1

T2

T1

T2

T1

T2

T1

T2

T1

T2

T1

T2

T1

T2

T1

T2 T1

T2

Page 7: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University

EG-348_371_09

7

VT Shape & Some Vowels - Ladefoged ‘62

Page 8: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University

EG-348_371_09

8

Speech Processing - Applications

Why? Communications Synthesis Recognition

Speech & Speaker

How? Frame-based Systems approach

Page 9: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University

EG-348_371_09

9

Some Books

Flanagan -’Speech Analysis, Synthesis and Perception’, Springer-Verlag, - a classic!

Furui - several books on recognition Parsons - `Voice and Speech Processing’ - McGraw Hill,

one of the first text books on computer speech processing O’Shaughnessy - ‘Speech Comms - human and machine’

Addison-Wesley Rabiner & Juang - ‘Fundamentals of Speech Recognition’

Prentice Hall, 1993 Ramachandran & Mamone (eds) ‘Modern Methods of

Speech Processing’ Kluer Academic, 1995

Page 10: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University

EG-348_371_09

10

Speech Communications

Person-to-Person

Person-to-Machinespeech/speaker recognition

Machine-to-Personspeech synthesis

Page 11: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University

EG-348_371_09

11

(Electronic) Speech Communications

perhaps separated by long distance(or in time)

Page 12: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University

EG-348_371_09

12

Telephony & Broadcasting

Acoustic Air Path Acoustic Air Path

Electronic Link

l Transmission Path

Page 13: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University

EG-348_371_09

13

Speech Comms: Telephony

Electronic Link

Channel Transmission Path

MicrophoneADCAnalysisCodingTransmitter

ReceiverDecoding(re-)SynthesisDACLoudspeaker

Page 14: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University

EG-348_371_09

14

Speech Bit Rates

Message

Creation

Language

Coding

Human

Acoustic

generation

Transmission

Message

Realisation

Language

decoding

Human

Hearing

Extraction

Acoustic Space

tens hundreds thousands Tens ofthousands

Approx. bit rate in bps

Page 15: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University

EG-348_371_09

15

Criteria in Speech Comms.

Quality versus Bit-rate

Qua

lity

Excellent

Good

Fair

Poor

4 8 16 32 64 kbps

GSM

ADPCM

CELP

4 Quality Measures:intelligibility loudnessnaturalness ease-of-listening

Page 16: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University

EG-348_371_09

16

Low Bit Rate Speech CodingCompandent http://www.compandent.com/

Page 17: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University

EG-348_371_09

17

Speech Processing

The three main application areas are: Speech Comms. (the ‘electronic link’) Automatic Speech/Speaker recognition Speech Synthesis

Much of the underlying analysis is common, eg linear predictive coding

Page 18: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University

EG-348_371_09

18

What does speech look like?

Page 19: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University

EG-348_371_09

19

What does speech look like?

0 1000 2000 3000 4000 5000 6000 7000

Dynamic Range - for flexibilityand robustness

Time-varying - to convey information

Page 20: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University

EG-348_371_09

20

Frame-based Analysis

0 1000 2000 3000 4000 5000 6000 7000

To capture time variations:• 20-30 ms frames - ‘centi-second’ labeling

• spectral analysisFFTFilter-bankLinear Predictive Coding

Page 21: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University

EG-348_371_09

21

Speech Analysis/Coding

Two general cases: Waveform coders Source (voice) coders (vo-coders)

Source coders eg linear predictive coding (LPC): Model the source ie the vocal tract (VT) Linear, time varying model of VT, plus excitation

H(z)

Excitation:voiced

unvoiced

speechen sn

Page 22: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University

EG-348_371_09

22

Systems Approach

VocalTract

Excitation Speech

Voiced

Unvoiced

Model

Time VaryingParameters

Speechf0

Page 23: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University

EG-348_371_09

23

LPC Analysis/Synthesis

Synthesis: Input: Excitation output: Speech

Analysis: Input: Speech output: Excitation

H(z)hn

S(z)E(z)en sn

1/H(z) E(z)S(z)sn en

Page 24: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University

EG-348_371_09

24

‘Perfect’ Analysis/Synthesis

H(z)S(z)E(z)

en sn

1/H(z) E(z)S(z)sn en

Input sn and output sn are identical (within arithmetic limits)

Page 25: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University

EG-348_371_09

25

Analysis

Coding .Synthesis

De-coding

Source Coding

SnSn

LPC-based Systems (eg CELP)

1

H z( )sn en

Analysis Re-Synthesis

)(ˆ zHne sn

Practical Analysis/Synthesis

Page 26: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University

EG-348_371_09

26

Practical Analysis/Synthesis

1/H(z) E(z)S(z)sn en

H(z)S(z)E(z)

en sn

Transmission ReceivingSending

Parameters for Transmission :• Input / Excitation en

• Source model H(z)Thus Analysis must derive these parameters, and

Synthesis must use them to re-generate speech

Page 27: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University

EG-348_371_09

27

Principle of linear prediction: The next value (or sample) in a series, ie at time n, is predicted

or estimated by a weighted sum of previous values, ie those at time n-1, n-2, ...

Thus for a predictor of order p, we have:

s a s a s a sn n n n

1 1 2 2 3 3 ........ a sn p p

Linear Predictive Coding - LPC

Page 28: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University

EG-348_371_09

28

Linear Prediction

Transforming to the z-domain gives:( ) ( ) ( ) ...... ( )

( ) { ( ) ( ) ...... ( )}

( ) ( ) { ( ) ( ) ...... ( )}

( ) ( )

( ) ( .... )

S z a z S z a z S z a z S z

S z a z S z a z S z a z S z

E z S z a z S z a z S z a z S z

A z S z

where A z a z a z a z

pp

pp

pp

pp

11

22

11

22

11

22

11

22

0

1

......s a s a s a s

a s

n n n p n p

i n ii

p

1 1 2 2

1

Page 29: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University

EG-348_371_09

29sn

)('1)(

)(zA

zS

zE

LPC Error Terms

Error is simply difference between predicted and actual values:

A’(z)

+ensn

e s s s a s

E z S z S z

S z a z S z a z S z a z S z

A z S z

where A z a z a z a z

n n n n i n ii

p

pp

pp

( ) ( ) ( )

( ) { ( ) ( ) ...... ( )}

( ) ( )

( ) ( .... )

1

11

22

11

221

ˆ-

Page 30: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University

EG-348_371_09

30

Synthesis

H(z)sn

Parameters updated at frame rate

en

A’(z)

+ snen

+

NB ‘hat’ of approximation omitted for simplicity

Page 31: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University

EG-348_371_09

31

The Analysis and Synthesis must match what is needed for the Synthesis?

Answer: en - the excitation and H(z) - the system

Thus the Analysis must derive these terms (from sn ):

The speech signal, sn is analysed to give en and H(z) ie A’(z) parameters for transmission.

Analysis for Synthesis

H(z)sn

en

Synthesis

1/H(z) E(z)S(z)

sn en

Analysis

A’(z)

+

-

ensn

Analysis

Page 32: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University

EG-348_371_09

32

Derivation of LPC Coefficients - A(z)

e s s s a sn n n n i n ii

p

1

Recall:

where ai are the p prediction coefficients.The principlebehind LPC is to find a set of p coefficients, a1, a2, a3, ...ap, which in some sense minimizes the error signal en, over a frame of speech, N. This leads to a set p coefficients for each frame.

1

0

2

1

1

0

22

N

n

p

iinin

N

nnnn sasssE

Page 33: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University

EG-348_371_09

33

Derivation of A(z) – (2)

Minimisation of En is achieved by setting the p partial derivatives to zero:

02

i

n

a

E

for i = 1, 2, .… p

01

p

kjkkj rar where:

1nknjnjk ssr

From which:

In matrix form:

0 aRr rRa 1or

The matrix [R] is Toepliz symmetric, offering numerically efficient inversion techniques - Durbin’s recursion algorithm being one of the most popular.

Page 34: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University

EG-348_371_09

34

Derivation of A(z) – (3)

When N very large r is the autocorrelation coefficients of s S comes from e convolved with h (excitation & vocal tract) we are interested here in separating e and h the predictor order, p, is small to reflect the short-term periodicities

(formants) with higher predictor orders we will get the longer-term periodicities

(pitch) 2 practical problems with evaluating a:

matrix singularities in R-1

unstable resultant H(z)

in practice both are solved by windowing - shaping frame - Hamming

Page 35: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University

EG-348_371_09

35

Speech Signal Characteristics

Duration Dynamic Range Periodicities:

vocal tract pitch

Frame-based Analysis frame size: quasi-stationary

capture transitiontypically 20 - 30ms

frame rate: task dependent: more means moreband-width/computation - up to 100 frames/second

Page 36: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University

EG-348_371_09

36

Harmonic Structures and Periodicities

Harmonic Structures & Periodicities give potential for data reduction

LPC is one way of gaining this compression

Speech has two obvious separate structures

vocal tract resonances

pitch

Page 37: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University

EG-348_371_09

37

Harmonic Structures and Periodicities

0

nenE

sase

sse

sas

in

p

iinn

nnn

in

p

iin

)( 2

1

1

ˆ

ˆ

nssn

p

Vocal tract

voicedorunvoiced

H(z)speechen sn

Tp

Short term prediction

Short Term

Page 38: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University

EG-348_371_09

38

Harmonic Structures and Periodicities

0

nenE

sase

sse

sas

in

P

iinn

nnn

in

P

iin

)( 2

1

1

ˆ

ˆ

nssn

P

Vocal tract

voiced

unvoicedHst(z)

speechepn sn

Tp

Long term prediction

Hlt(z)

Pitchen

Page 39: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University

EG-348_371_09

39

Hst(z)snHlt(z)en ep

n

Two Structures: short-term (formants) & long-term - pitch (excitation)

Harmonic Structures and Periodicities

eg 20ms frame160 samples @ 8Khz

ai eg p=3 ai eg p=10

Gain

k

NB Representations of these parameters are transmitted

Page 40: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University

EG-348_371_09

40

Waveform & Source Coders (Vocoders)Source Coders (Vocoders) 2 periodicities/redundancies in source

short-term (formants) long-term - pitch

Excitation en

Practical Coding Systems

Hst(z)snHlt(z)en epn

Page 41: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University

EG-348_371_09

41

‘Perfect’ Analysis/Synthesis (1)

H(z)S(z)E(z)

en sn

1/H(z) E(z)S(z)sn en

Input sn and output sn are identical (within arithmetic limits)

Page 42: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University

EG-348_371_09

42

‘Perfect’ Analysis/Synthesis (2)

H(z)S(z)E(z)

en sn

1/H(z) E(z)S(z)sn en

1/(1–A’(z))S(z)E(z)

en sn

1 – A’(z) E(z)S(z)sn en

1 – A’(z)sn en 1/(1–A’(z))en sn

Page 43: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University

EG-348_371_09

43

‘Perfect’ Analysis/Synthesis (3)

1 – A’(z)sn en 1/(1–A’(z))en sn

sn en

Z-1

Z-1

Z-1

a1

ai

ap

sn

sn

sn-1

sn-i

sn-p

+-

Note – minus sign:in Matlab combined with ai What determines p?

Original Speech Residual

p

iininnnn sassse

1

Page 44: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University

EG-348_371_09

44

‘Perfect’ Analysis/Synthesis (4)

1 – A’(z)sn en 1/(1–A’(z))en sn

en

Z-1

Z-1

Z-1

a1

ai

ap

sn

snen

Z-1

Z-1

Z-1

a1

ai

ap

sn-1

sn-i

sn-p

sn

sn-1

sn-i

sn-p

sn

Original Speech Residual Re-Synth.

+NoteNo minus

+-

Page 45: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University

EG-348_371_09

45

Practical System

TransmittedData Frame

H(z)S(z)E(z)

en

1/H(z) E(z)S(z)sn en

Input sn and output sn are “similar”

sn

What does the Transmitted Data Frame Contain?

Page 46: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University

EG-348_371_09

46

Analysis-by-Synthesis: LPAS

Integrated encoder & decoder at the encoder

Basicdecoder

Adaptiveencoder

sn

-

+

LPAS Encoder

Weighted error

Page 47: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University

EG-348_371_09

47

Log Spectral Estimates

Comparisons between frames are very important in many situations log spectral estimates are the most common (though in Comms. An

approximation is used to reduce computation)

))(log(

))(log(

1

)()(1

12/

0

2

0

2

zH

orsDFTSwhere

SSN

dwwSwSB

D

jwez

nk

N

kkk

B

In Comms, compuation is expensive and parameter vector approximations to D are used

Page 48: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University

EG-348_371_09

48

Some Standards

GSM European Cellular RPE-LTP13kb/s

FS1016 Secure Voice CELP 4.8

IS54 NA Cellular VSELP 7.95

IS96 “ QCELP 1-8

JDC-FR Japanese Cellular VSELP 6.7

JDC-HR “ PSI-CELP 3.67

G.728 (terrestrial) LD-CELP 16

Page 49: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University

EG-348_371_09

49

Low Bit Rate Speech CodingCompandent http://www.compandent.com/

Page 50: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University

EG-348_371_09

50

Criteria in Speech Comms.

Quality versus Bit-rate

Qua

lity

Excellent

Good

Fair

Poor

4 8 16 32 64 kbps

GSM

ADPCM

CELP

4 Quality Measures:intelligibility loudnessnaturalness ease-of-listening

Page 51: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University

EG-348_371_09

51

CELP eg

enHst(z)

snHlt(z)

CBIndex Gain

Long-term coefficients(pitch)

Short-term coefficients(formants)

Excitation is represented by address ie CB Index en

Page 52: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University

EG-348_371_09

52

CELP – LPAS (Encoder)

enHst(z) snHlt(z)

CBIndex

Gain

Long-term coefficients(pitch)

Short-term coefficients(formants)

Excitation is represented by address ie CB Index en

sn

snen

Basicdecoder

Adaptiveencoder

sn-

+Weighted error

Page 53: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University

EG-348_371_09

53

Conversion of LPC Parameters

• A(z) = 1 + a1 z - 1 + a2 z

- 2 + …… ap z - p and a i are to be Tx’d

• Line Spectral Frequencies (LSF) present a clever way of representing the LPC coefficients, the ai’s of A(z)

• The ai’s are floating point numbers and their accuracy is important

• Factorising A(z) tends to give complex roots in the z-domain

• LSF’s map these complex roots on to the unit circle

LSF’s Lead to efficient coding Ensure a minimum phase filter Bit errors are spectrum localised minimising loss of speech quality

z-plane jy

x

x

ws

LSF = ws . /2

Page 54: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University

EG-348_371_09

54

Line Spectral Frequencies

• Consider

P(z) = A(z) + z—(n+1) A(z—1 )

and

Q(z) = A(z) - z—(n+1) A(z—1 )

then P(z) and Q(z) lead to what is known as LSF’s

• Clearly if P(z) and Q(z) are known then A(z) can be found:A(z) = {P(z) + Q(z)} / 2

• Roots of P(z) and Q(z) lie on the unit circle in z-domain The locations give:

the LSF’s P(z) and Q(z), and whence A(z)

Page 55: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University

EG-348_371_09

55

LSF Evaluation

Consider one pair of complex roots, A1(z) :

A1(z) = 1 + a1 z -1 + a2 z

-2

P1(z) = 1 + a1 z -1 + a2 z

-2 + z -3 (1 + a1 z

1 + a2 z2 )

= (z2 + (a1 + a2 - 1) z + 1 )( z + 1 ) z –3

Q1(z) = 1 + a1 z -1 + a2 z

-2 - z -3 (1 + a1 z

1 + a2 z2 )

= (z2 + (a1 - a2 + 1) z + 1 )( z - 1 ) z -3

 The roots at 0 and 1 are discarded

It follows that the LSF’s, 1 & 2 , are given by:

  

cos (1) = - (a1 + a2 - 1)/2

and cos (2) = - (a1 - a2 + 1)/2

Show:a1 = -(cos (1) + cos (2) ) and

a2 = (cos (2) - cos (1) +1 )

Page 56: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University

EG-348_371_09

56

LSF Test Example

A1(z) = 1 + a1 z -1 + a2 z

- 2

= (z2 + a1 z + a2 )z

- 2

= (z2 + 2 cos() wn z + wn

2 ) z - 2

where wn is radius and is angle from . So: radius = a2 & = -

Note: in P & Q all w n2 terms (of the multiple 2nd orders) are unity

EG 1: a2 = 1 then cos (1) = - (a1 + a2 - 1)/2 = - (a1)/2

roots already on circle and do not move (unstable system – not practical)

EG 2: a1 = 0 then cos (1) = - (a1 + a2 -1)/2 = - (a2 - 1)/2

cos (2) = - (a1 - a2 + 1)/2 = - (-a2 + 1)/2

so LSF’s are symmetric about /4

Page 57: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University

EG-348_371_09

57

LSF Review & Example (1)

LSF’s/LSP’s are defined as:

P(z) = A(z) + z-(n+1) A(z-1 )

and Q(z) = A(z) - z-(n+1) A(z-1 )

thus A(z) = {P(z) + Q(z)} / 2

Page 58: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University

EG-348_371_09

58

For a second order A(z)= 1 + a1 z-1 + a2 z-2

P (z) = 1 + a1 z-1 + a2 z-2 + (1 + a1 z1 + a2 z2)z-3

= (z2 + (a1 + a2 - 1)z + 1)(z + 1)z–3

Q (z) = 1 + a1 z-1 + a2 z-2 - (a1 z1 + a2 z2)z-3

= (z2 + (a1 - a2 + 1)z + 1)(z - 1 )z–3

cf: (s2 + ( 2cos()wn ) s + wn2)

LSF Review & Example (2)

Page 59: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University

EG-348_371_09

59

For a second order A(z)= 1 + a1 z-1 + a2 z-2 :

P (z) = (z2 + (a1 + a2 - 1)z + 1)(z + 1)z–3

Q (z) = (z2 + (a1 - a2 + 1)z + 1)(z - 1 )z–3

cf: (s2 + ( 2cos()wn )s + wn2)

Thus: (a1 + a2 - 1) = 2cos(1) = - 2cos(1)

&(a1 - a2 + 1) = - 2cos(2 )

So, given: i) LPC coeffs., a1 and a2 , then LSFs 1 & 2 can be found

ii) LSFs, 1 & 2 , then the LPC coeffs. a1 and a2 be found

00.20.40.60.8

1

-0.5 0 0.5 1

1

2 P(z)

Q(z)

P(z)Q(z)

2

1

LSF Review & Example (3)

Page 60: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University

EG-348_371_09

60

For a second order and with P(z) corresponding to the first root, Q(z) to the second root, and so P (z) = 1 + a1 z-1 + a2 z-2 + (1 + a1 z1 + a2 z2)z-3 = (z2 + (a1 + a2 - 1)z + 1)(z + 1)z–3 for the second pair of qi, 1.37 and 1.77

= (z2 - 2cos(1.37) z + 1 )(z + 1) z–3= (z3 +(1 - 2cos(1.37) z2 + (1 - 2cos(1.37))z + 1)z–3

 LikewiseQ (z) = 1 + a1 z-1 + a2 z-2 - (a1 z1 + a2 z2)z-3

= (z2 + (a1 - a2 + 1)z + 1)(z - 1 )z–3 = (z2 - 2cos(1.77) z + 1 )(z - 1) z–3= (z3 +(-1 - 2cos(1.77) z2 + (1 + 2cos(1.77))z - 1)z–3

 Then

A(z) = {P(z) + Q(z)} / 2) = (z3 + (cos(1.37) + cos(1.77))z2 + (1 - cos(1.37) + cos(1.77))z)z–3

LSF Review & Example (4)

Page 61: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University

EG-348_371_09

61

LSF Examples LPC coeffs. LSF’s

a1 a2 1 2

0 0.5 1.31812 1.82348

-1.8 0.9 0.31756 0.554811

+1.8 0.9 π-0.554811 π-0. 31756

2.2274 2.3743

-1 0 1

-1 0 1-1 0 1

Page 62: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University

EG-348_371_09

62

LSF Examples

LPC coeffs. LSF’s

a1 a2 1 2

0 0.5 1.31812 1.82348

-1.8 0.9 0.31756 0.554811

+1.8 0.9 π-0.554811

π-0. 31756

2.2274 2.3743

A(z)= 1 + a1 z-1 + a2 z-2

P (z) = 1 + a1 z-1 + a2 z-2 + (1 + a1 z1 + a2 z2)z-3

= (z2 + (a1 + a2 - 1)z + 1)(z + 1)z–3

= (z2 + (-1.8 + 0.9 - 1)z + 1)(z + 1)z–3

= (z2 - 1.9 z + 1) (z + 1)z–3

cf: (z2 + ( 2cos()wn ) z + wn2)

thus cos() = - 1.9/2 or = 2.824 and 1 = π -

= 0.318

Page 63: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University

EG-348_371_09

63

Bit allocation Voiced Unvoiced

V/U decision 1 1

Excitation 11 11

Sync 1 1

Φ1 = 0.3176 5 5

Φ2 = 0.5548 5 5

Φ3 = 1.4454 5 5

Φ4 = 1.6961 5 5

Φ 5 4 0

Φ 6 4 0

Φ 7 4 0

Φ 8 4 0

Φ 9 3 0

Φ 10 2 0

Error check 0 21

Total / frame 54 54

Example Bit Allocation

Page 64: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University

EG-348_371_09

64

Codebooks & VQ

p

N = 2L

i (0 … N-1)

Identical book

Data reduction: (p x B) to Ltime

p

time

Page 65: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University

EG-348_371_09

65

Principle representative data sets data vector is replaced / represented

by “nearest” vector, chosen from a “codebook” - a closed set of vectors

Examples LPC parameter sets Excitation as in CELP

Codebook Compression

M

N = 2 k

i

index, i

A(z)

enH(z)

sn

Page 66: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University

EG-348_371_09

66

P

Codebook Compression - CELP

H(z)sny ms eny ms

en are time domain samples (integers)

R samples per second (eg 8000 Hz)

Frame rate governs vector size

P = 2 j

Bit rate = j/y bits/ms

Codebook of time-domain samples

start point

y ms

NB en also includes gain

Page 67: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University

EG-348_371_09

67

A[z] at time t

time

Codebook Compression of H(z)

M

N = 2 k

i

index, i

Vector with M elements, every x ms

Codebook with N = 2 k vectors

Bit rate = k/x bits per ms (not a function of M)

In practice A[z] is converted to LSF’s.

x ms

Page 68: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University

EG-348_371_09

68

Codebook Generation

1) Initialise:form a single centroid of all training data, N=1

2) RepeatSplit centroids: N -> 2N Repeat

Cluster data to nearest centroiduntil convergence

until N large enough

Page 69: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University

EG-348_371_09

69

VQ Performance on Unseen Data

Ramachandran & Mamone (eds) ‘Modern Methods of Speech Processing’ Kluer Academic, 1995

Page 70: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University

EG-348_371_09

70

Ramachandran & Mamone (eds) ‘Modern Methods of Speech Processing’ Kluer Academic, 1995

VQ Performance on Unseen Data

Page 71: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University

EG-348_371_09

710 1 2 3 4 5-40

-20

0

20

40

Ma

gn

itu

de

(d

B)

Frequency (KHz) ( 0-to-Fs/2)

0 3.2 6.4 9.6 12.8 16 19.2 22.4 25.6-1

-0.5

0

0.5

1

Wav

efo

rm

Time (ms)

LPC & FFT SpectraLPC Roots -0.6651 ± 0.6695i -0.0560 ± 0.9709i 0.7228 ± 0.6225i 0.8714 ± 0.3694i 0.5758 -0.4200

2 of Q(z) 1 of P(z)

2.3743 2.2274

1.6540 1.5997

0.8261 0.6954

0.6106 0.3937

LSFs

Page 72: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University

EG-348_371_09

72

0 1 2 3 4 5-40

-20

0

20

40

Ma

gn

itu

de

(d

B)

Frequency (KHz) ( 0-to-Fs/2)

LPC Spectra & LSF’sLPC Roots -0.6651 ± 0.6695i -0.0560 ± 0.9709i 0.7228 ± 0.6225i 0.8714 ± 0.3694i 0.5758 -0.4200

2 of Q(z) 1 of P(z)

2.3743 2.2274

1.6540 1.5997

0.8261 0.6954

0.6106 0.3937

LSFs

-1

-0.5

0

0.5

1

-1 0 1

Page 73: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University

EG-348_371_09

730 1 2 3 4 5-40

-20

0

20

40

Frequency (KHz) ( 0-to-Fs/2)

0 3.2 6.4 9.6 12.8 16 19.2 22.4 25.6-1

-0.5

0

0.5

1

Time (ms)

A(z): 1.5537 -0.8276Roots: 0.7769 ± 0.4733i

H(0) = K (1- (1.5537 - 0.8276))

H(ws/2) = K

(1- (-1.5537 - 0.8276))

H(0) K/0.274 = = 21.8dBH(ws /2) K/ 3.38

LPC & FFT Spectra - 2nd Order

Page 74: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University

EG-348_371_09

74

GSM

Groupe Special Mobile - EU First digital cellular system in world See Hodge 1990 Based on TDMA & FDMA at 900MHz, and RPE-LPC

(ie it is an ‘LPAS’ system) Now at 1800 MHz Carriers at 200kHz Supporting 8 TDMA time slots each Time slots: 577s - 156.26 bit slots 8 time slots form 1 GSM frame of 4.62 ms Modulation: Gaussian minimum shift key 26 bit training in every time slot Round-trip delay ~ 80ms EU: GSM US: D-AMPS

Page 75: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University

EG-348_371_09

75

Other Related Topics

Spectral Lifting: H(z) = (1-az-1)

Codebook Training

Spectral Differences between 2 frames

Cepstra

Modeling Speech Space - HMM’s

Page 76: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University

EG-348_371_09

76

Pre-Emphasis Example

-8000

0

8000

-8000

0

8000

1

- 1

1

- 130ms

(a)

(b)

Figure Q1

Page 77: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University

EG-348_371_09

77

Pre-Emphasis Example

a

z-plane jy

1+a = 2

ws/2

G(ws/2) = 1 + aG(0) = 1 - a

For G(ws/2 ) > G(0) then a must be > 0

Page 78: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University

EG-348_371_09

78

1+a = 2

ws/2

0 1 2 3 4 5-30

-20

-10

0

10

20

30

40

50

Mag

nit

ud

e (d

B)

Frequency (KHz) ( 0-to-Fs/2)

-1 -0.5 0 0.5 1

-1

-0.5

0

0.5

1

Real Part

Imag

inar

y P

art

Z-plane to Magnitude Spectrum

Page 79: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University

EG-348_371_09

79

LPC Short and Long

Spectral envelop reflects morphological characteristics of the vocal tract

H1(z) H2(z)noise synthesisedSpeech

Air fromthe lungs

Vocal fold Vocal tract Speech

Page 80: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University

EG-348_371_09

80

ST & LT Prediction

1 – A’(z)sn en

Residual

1 – A’(z) e`n

Z-1

Z-1

Z-1

a1

ai

ai

sn

sn

sn-1

sn-i

sn-p

+-Z-1

Z-1

Z-1

a1

ai

ap

+-

Z-1

ap

LTP

STP

Speech