eg-348_371_09 1 multimedia communications (371) speech and image communications (348) john mason...
TRANSCRIPT
![Page 1: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University](https://reader036.vdocuments.us/reader036/viewer/2022070307/551ae1515503465e7d8b47a0/html5/thumbnails/1.jpg)
EG-348_371_09
1
Multimedia Communications (371) Speech and Image Communications (348)
John Mason
Engineering
Swansea University
![Page 2: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University](https://reader036.vdocuments.us/reader036/viewer/2022070307/551ae1515503465e7d8b47a0/html5/thumbnails/2.jpg)
EG-348_371_09
2
Features in speech
X1
.
.
.
.Xi
.
.
.
.
.
Acquisition
(frame: 20/30 ms & sampling F: 8khz)
Feature extraction
time
![Page 3: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University](https://reader036.vdocuments.us/reader036/viewer/2022070307/551ae1515503465e7d8b47a0/html5/thumbnails/3.jpg)
EG-348_371_09
3
Features in speech
X1
.
.
.
.Xi
.
.
.
.
.
Acquisition
(frame: 20/30 ms & sampling F: 8khz)
Feature extraction
![Page 4: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University](https://reader036.vdocuments.us/reader036/viewer/2022070307/551ae1515503465e7d8b47a0/html5/thumbnails/4.jpg)
EG-348_371_09
4
Speech production
Air fromthe lungs
Vocal fold Vocal tract Speech
0
![Page 5: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University](https://reader036.vdocuments.us/reader036/viewer/2022070307/551ae1515503465e7d8b47a0/html5/thumbnails/5.jpg)
EG-348_371_09
5
LPC Short and Long
Spectral envelop reflects morphological characteristics of the vocal tract
H1(z) H2(z)noise synthesisedSpeech
Air fromthe lungs
Vocal fold Vocal tract Speech
![Page 6: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University](https://reader036.vdocuments.us/reader036/viewer/2022070307/551ae1515503465e7d8b47a0/html5/thumbnails/6.jpg)
EG-348_371_09
6
Features: building of statistical model
T1
T2
T1
T2
T1
T2
T1
T2
T1
T2
T1
T2
T1
T2
T1
T2
T1
T2
T1
T2 T1
T2
![Page 7: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University](https://reader036.vdocuments.us/reader036/viewer/2022070307/551ae1515503465e7d8b47a0/html5/thumbnails/7.jpg)
EG-348_371_09
7
VT Shape & Some Vowels - Ladefoged ‘62
![Page 8: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University](https://reader036.vdocuments.us/reader036/viewer/2022070307/551ae1515503465e7d8b47a0/html5/thumbnails/8.jpg)
EG-348_371_09
8
Speech Processing - Applications
Why? Communications Synthesis Recognition
Speech & Speaker
How? Frame-based Systems approach
![Page 9: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University](https://reader036.vdocuments.us/reader036/viewer/2022070307/551ae1515503465e7d8b47a0/html5/thumbnails/9.jpg)
EG-348_371_09
9
Some Books
Flanagan -’Speech Analysis, Synthesis and Perception’, Springer-Verlag, - a classic!
Furui - several books on recognition Parsons - `Voice and Speech Processing’ - McGraw Hill,
one of the first text books on computer speech processing O’Shaughnessy - ‘Speech Comms - human and machine’
Addison-Wesley Rabiner & Juang - ‘Fundamentals of Speech Recognition’
Prentice Hall, 1993 Ramachandran & Mamone (eds) ‘Modern Methods of
Speech Processing’ Kluer Academic, 1995
![Page 10: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University](https://reader036.vdocuments.us/reader036/viewer/2022070307/551ae1515503465e7d8b47a0/html5/thumbnails/10.jpg)
EG-348_371_09
10
Speech Communications
Person-to-Person
Person-to-Machinespeech/speaker recognition
Machine-to-Personspeech synthesis
![Page 11: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University](https://reader036.vdocuments.us/reader036/viewer/2022070307/551ae1515503465e7d8b47a0/html5/thumbnails/11.jpg)
EG-348_371_09
11
(Electronic) Speech Communications
perhaps separated by long distance(or in time)
![Page 12: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University](https://reader036.vdocuments.us/reader036/viewer/2022070307/551ae1515503465e7d8b47a0/html5/thumbnails/12.jpg)
EG-348_371_09
12
Telephony & Broadcasting
Acoustic Air Path Acoustic Air Path
Electronic Link
l Transmission Path
![Page 13: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University](https://reader036.vdocuments.us/reader036/viewer/2022070307/551ae1515503465e7d8b47a0/html5/thumbnails/13.jpg)
EG-348_371_09
13
Speech Comms: Telephony
Electronic Link
Channel Transmission Path
MicrophoneADCAnalysisCodingTransmitter
ReceiverDecoding(re-)SynthesisDACLoudspeaker
![Page 14: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University](https://reader036.vdocuments.us/reader036/viewer/2022070307/551ae1515503465e7d8b47a0/html5/thumbnails/14.jpg)
EG-348_371_09
14
Speech Bit Rates
Message
Creation
Language
Coding
Human
Acoustic
generation
Transmission
Message
Realisation
Language
decoding
Human
Hearing
Extraction
Acoustic Space
tens hundreds thousands Tens ofthousands
Approx. bit rate in bps
![Page 15: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University](https://reader036.vdocuments.us/reader036/viewer/2022070307/551ae1515503465e7d8b47a0/html5/thumbnails/15.jpg)
EG-348_371_09
15
Criteria in Speech Comms.
Quality versus Bit-rate
Qua
lity
Excellent
Good
Fair
Poor
4 8 16 32 64 kbps
GSM
ADPCM
CELP
4 Quality Measures:intelligibility loudnessnaturalness ease-of-listening
![Page 16: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University](https://reader036.vdocuments.us/reader036/viewer/2022070307/551ae1515503465e7d8b47a0/html5/thumbnails/16.jpg)
EG-348_371_09
16
Low Bit Rate Speech CodingCompandent http://www.compandent.com/
![Page 17: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University](https://reader036.vdocuments.us/reader036/viewer/2022070307/551ae1515503465e7d8b47a0/html5/thumbnails/17.jpg)
EG-348_371_09
17
Speech Processing
The three main application areas are: Speech Comms. (the ‘electronic link’) Automatic Speech/Speaker recognition Speech Synthesis
Much of the underlying analysis is common, eg linear predictive coding
![Page 18: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University](https://reader036.vdocuments.us/reader036/viewer/2022070307/551ae1515503465e7d8b47a0/html5/thumbnails/18.jpg)
EG-348_371_09
18
What does speech look like?
![Page 19: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University](https://reader036.vdocuments.us/reader036/viewer/2022070307/551ae1515503465e7d8b47a0/html5/thumbnails/19.jpg)
EG-348_371_09
19
What does speech look like?
0 1000 2000 3000 4000 5000 6000 7000
Dynamic Range - for flexibilityand robustness
Time-varying - to convey information
![Page 20: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University](https://reader036.vdocuments.us/reader036/viewer/2022070307/551ae1515503465e7d8b47a0/html5/thumbnails/20.jpg)
EG-348_371_09
20
Frame-based Analysis
0 1000 2000 3000 4000 5000 6000 7000
To capture time variations:• 20-30 ms frames - ‘centi-second’ labeling
• spectral analysisFFTFilter-bankLinear Predictive Coding
![Page 21: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University](https://reader036.vdocuments.us/reader036/viewer/2022070307/551ae1515503465e7d8b47a0/html5/thumbnails/21.jpg)
EG-348_371_09
21
Speech Analysis/Coding
Two general cases: Waveform coders Source (voice) coders (vo-coders)
Source coders eg linear predictive coding (LPC): Model the source ie the vocal tract (VT) Linear, time varying model of VT, plus excitation
H(z)
Excitation:voiced
unvoiced
speechen sn
![Page 22: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University](https://reader036.vdocuments.us/reader036/viewer/2022070307/551ae1515503465e7d8b47a0/html5/thumbnails/22.jpg)
EG-348_371_09
22
Systems Approach
VocalTract
Excitation Speech
Voiced
Unvoiced
Model
Time VaryingParameters
Speechf0
![Page 23: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University](https://reader036.vdocuments.us/reader036/viewer/2022070307/551ae1515503465e7d8b47a0/html5/thumbnails/23.jpg)
EG-348_371_09
23
LPC Analysis/Synthesis
Synthesis: Input: Excitation output: Speech
Analysis: Input: Speech output: Excitation
H(z)hn
S(z)E(z)en sn
1/H(z) E(z)S(z)sn en
![Page 24: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University](https://reader036.vdocuments.us/reader036/viewer/2022070307/551ae1515503465e7d8b47a0/html5/thumbnails/24.jpg)
EG-348_371_09
24
‘Perfect’ Analysis/Synthesis
H(z)S(z)E(z)
en sn
1/H(z) E(z)S(z)sn en
Input sn and output sn are identical (within arithmetic limits)
![Page 25: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University](https://reader036.vdocuments.us/reader036/viewer/2022070307/551ae1515503465e7d8b47a0/html5/thumbnails/25.jpg)
EG-348_371_09
25
Analysis
Coding .Synthesis
De-coding
Source Coding
SnSn
LPC-based Systems (eg CELP)
1
H z( )sn en
Analysis Re-Synthesis
)(ˆ zHne sn
Practical Analysis/Synthesis
![Page 26: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University](https://reader036.vdocuments.us/reader036/viewer/2022070307/551ae1515503465e7d8b47a0/html5/thumbnails/26.jpg)
EG-348_371_09
26
Practical Analysis/Synthesis
1/H(z) E(z)S(z)sn en
H(z)S(z)E(z)
en sn
Transmission ReceivingSending
Parameters for Transmission :• Input / Excitation en
• Source model H(z)Thus Analysis must derive these parameters, and
Synthesis must use them to re-generate speech
![Page 27: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University](https://reader036.vdocuments.us/reader036/viewer/2022070307/551ae1515503465e7d8b47a0/html5/thumbnails/27.jpg)
EG-348_371_09
27
Principle of linear prediction: The next value (or sample) in a series, ie at time n, is predicted
or estimated by a weighted sum of previous values, ie those at time n-1, n-2, ...
Thus for a predictor of order p, we have:
s a s a s a sn n n n
1 1 2 2 3 3 ........ a sn p p
Linear Predictive Coding - LPC
![Page 28: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University](https://reader036.vdocuments.us/reader036/viewer/2022070307/551ae1515503465e7d8b47a0/html5/thumbnails/28.jpg)
EG-348_371_09
28
Linear Prediction
Transforming to the z-domain gives:( ) ( ) ( ) ...... ( )
( ) { ( ) ( ) ...... ( )}
( ) ( ) { ( ) ( ) ...... ( )}
( ) ( )
( ) ( .... )
S z a z S z a z S z a z S z
S z a z S z a z S z a z S z
E z S z a z S z a z S z a z S z
A z S z
where A z a z a z a z
pp
pp
pp
pp
11
22
11
22
11
22
11
22
0
1
......s a s a s a s
a s
n n n p n p
i n ii
p
1 1 2 2
1
![Page 29: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University](https://reader036.vdocuments.us/reader036/viewer/2022070307/551ae1515503465e7d8b47a0/html5/thumbnails/29.jpg)
EG-348_371_09
29sn
)('1)(
)(zA
zS
zE
LPC Error Terms
Error is simply difference between predicted and actual values:
A’(z)
+ensn
e s s s a s
E z S z S z
S z a z S z a z S z a z S z
A z S z
where A z a z a z a z
n n n n i n ii
p
pp
pp
( ) ( ) ( )
( ) { ( ) ( ) ...... ( )}
( ) ( )
( ) ( .... )
1
11
22
11
221
ˆ-
![Page 30: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University](https://reader036.vdocuments.us/reader036/viewer/2022070307/551ae1515503465e7d8b47a0/html5/thumbnails/30.jpg)
EG-348_371_09
30
Synthesis
H(z)sn
Parameters updated at frame rate
en
A’(z)
+ snen
+
NB ‘hat’ of approximation omitted for simplicity
![Page 31: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University](https://reader036.vdocuments.us/reader036/viewer/2022070307/551ae1515503465e7d8b47a0/html5/thumbnails/31.jpg)
EG-348_371_09
31
The Analysis and Synthesis must match what is needed for the Synthesis?
Answer: en - the excitation and H(z) - the system
Thus the Analysis must derive these terms (from sn ):
The speech signal, sn is analysed to give en and H(z) ie A’(z) parameters for transmission.
Analysis for Synthesis
H(z)sn
en
Synthesis
1/H(z) E(z)S(z)
sn en
Analysis
A’(z)
+
-
ensn
Analysis
![Page 32: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University](https://reader036.vdocuments.us/reader036/viewer/2022070307/551ae1515503465e7d8b47a0/html5/thumbnails/32.jpg)
EG-348_371_09
32
Derivation of LPC Coefficients - A(z)
e s s s a sn n n n i n ii
p
1
Recall:
where ai are the p prediction coefficients.The principlebehind LPC is to find a set of p coefficients, a1, a2, a3, ...ap, which in some sense minimizes the error signal en, over a frame of speech, N. This leads to a set p coefficients for each frame.
1
0
2
1
1
0
22
N
n
p
iinin
N
nnnn sasssE
![Page 33: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University](https://reader036.vdocuments.us/reader036/viewer/2022070307/551ae1515503465e7d8b47a0/html5/thumbnails/33.jpg)
EG-348_371_09
33
Derivation of A(z) – (2)
Minimisation of En is achieved by setting the p partial derivatives to zero:
02
i
n
a
E
for i = 1, 2, .… p
01
p
kjkkj rar where:
1nknjnjk ssr
From which:
In matrix form:
0 aRr rRa 1or
The matrix [R] is Toepliz symmetric, offering numerically efficient inversion techniques - Durbin’s recursion algorithm being one of the most popular.
![Page 34: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University](https://reader036.vdocuments.us/reader036/viewer/2022070307/551ae1515503465e7d8b47a0/html5/thumbnails/34.jpg)
EG-348_371_09
34
Derivation of A(z) – (3)
When N very large r is the autocorrelation coefficients of s S comes from e convolved with h (excitation & vocal tract) we are interested here in separating e and h the predictor order, p, is small to reflect the short-term periodicities
(formants) with higher predictor orders we will get the longer-term periodicities
(pitch) 2 practical problems with evaluating a:
matrix singularities in R-1
unstable resultant H(z)
in practice both are solved by windowing - shaping frame - Hamming
![Page 35: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University](https://reader036.vdocuments.us/reader036/viewer/2022070307/551ae1515503465e7d8b47a0/html5/thumbnails/35.jpg)
EG-348_371_09
35
Speech Signal Characteristics
Duration Dynamic Range Periodicities:
vocal tract pitch
Frame-based Analysis frame size: quasi-stationary
capture transitiontypically 20 - 30ms
frame rate: task dependent: more means moreband-width/computation - up to 100 frames/second
![Page 36: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University](https://reader036.vdocuments.us/reader036/viewer/2022070307/551ae1515503465e7d8b47a0/html5/thumbnails/36.jpg)
EG-348_371_09
36
Harmonic Structures and Periodicities
Harmonic Structures & Periodicities give potential for data reduction
LPC is one way of gaining this compression
Speech has two obvious separate structures
vocal tract resonances
pitch
![Page 37: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University](https://reader036.vdocuments.us/reader036/viewer/2022070307/551ae1515503465e7d8b47a0/html5/thumbnails/37.jpg)
EG-348_371_09
37
Harmonic Structures and Periodicities
0
nenE
sase
sse
sas
in
p
iinn
nnn
in
p
iin
)( 2
1
1
ˆ
ˆ
nssn
p
Vocal tract
voicedorunvoiced
H(z)speechen sn
Tp
Short term prediction
Short Term
![Page 38: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University](https://reader036.vdocuments.us/reader036/viewer/2022070307/551ae1515503465e7d8b47a0/html5/thumbnails/38.jpg)
EG-348_371_09
38
Harmonic Structures and Periodicities
0
nenE
sase
sse
sas
in
P
iinn
nnn
in
P
iin
)( 2
1
1
ˆ
ˆ
nssn
P
Vocal tract
voiced
unvoicedHst(z)
speechepn sn
Tp
Long term prediction
Hlt(z)
Pitchen
![Page 39: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University](https://reader036.vdocuments.us/reader036/viewer/2022070307/551ae1515503465e7d8b47a0/html5/thumbnails/39.jpg)
EG-348_371_09
39
Hst(z)snHlt(z)en ep
n
Two Structures: short-term (formants) & long-term - pitch (excitation)
Harmonic Structures and Periodicities
eg 20ms frame160 samples @ 8Khz
ai eg p=3 ai eg p=10
Gain
k
NB Representations of these parameters are transmitted
![Page 40: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University](https://reader036.vdocuments.us/reader036/viewer/2022070307/551ae1515503465e7d8b47a0/html5/thumbnails/40.jpg)
EG-348_371_09
40
Waveform & Source Coders (Vocoders)Source Coders (Vocoders) 2 periodicities/redundancies in source
short-term (formants) long-term - pitch
Excitation en
Practical Coding Systems
Hst(z)snHlt(z)en epn
![Page 41: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University](https://reader036.vdocuments.us/reader036/viewer/2022070307/551ae1515503465e7d8b47a0/html5/thumbnails/41.jpg)
EG-348_371_09
41
‘Perfect’ Analysis/Synthesis (1)
H(z)S(z)E(z)
en sn
1/H(z) E(z)S(z)sn en
Input sn and output sn are identical (within arithmetic limits)
![Page 42: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University](https://reader036.vdocuments.us/reader036/viewer/2022070307/551ae1515503465e7d8b47a0/html5/thumbnails/42.jpg)
EG-348_371_09
42
‘Perfect’ Analysis/Synthesis (2)
H(z)S(z)E(z)
en sn
1/H(z) E(z)S(z)sn en
1/(1–A’(z))S(z)E(z)
en sn
1 – A’(z) E(z)S(z)sn en
1 – A’(z)sn en 1/(1–A’(z))en sn
![Page 43: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University](https://reader036.vdocuments.us/reader036/viewer/2022070307/551ae1515503465e7d8b47a0/html5/thumbnails/43.jpg)
EG-348_371_09
43
‘Perfect’ Analysis/Synthesis (3)
1 – A’(z)sn en 1/(1–A’(z))en sn
sn en
Z-1
Z-1
Z-1
a1
ai
ap
sn
sn
sn-1
sn-i
sn-p
+-
Note – minus sign:in Matlab combined with ai What determines p?
Original Speech Residual
p
iininnnn sassse
1
![Page 44: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University](https://reader036.vdocuments.us/reader036/viewer/2022070307/551ae1515503465e7d8b47a0/html5/thumbnails/44.jpg)
EG-348_371_09
44
‘Perfect’ Analysis/Synthesis (4)
1 – A’(z)sn en 1/(1–A’(z))en sn
en
Z-1
Z-1
Z-1
a1
ai
ap
sn
snen
Z-1
Z-1
Z-1
a1
ai
ap
sn-1
sn-i
sn-p
sn
sn-1
sn-i
sn-p
sn
Original Speech Residual Re-Synth.
+NoteNo minus
+-
![Page 45: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University](https://reader036.vdocuments.us/reader036/viewer/2022070307/551ae1515503465e7d8b47a0/html5/thumbnails/45.jpg)
EG-348_371_09
45
Practical System
TransmittedData Frame
H(z)S(z)E(z)
en
1/H(z) E(z)S(z)sn en
Input sn and output sn are “similar”
sn
What does the Transmitted Data Frame Contain?
![Page 46: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University](https://reader036.vdocuments.us/reader036/viewer/2022070307/551ae1515503465e7d8b47a0/html5/thumbnails/46.jpg)
EG-348_371_09
46
Analysis-by-Synthesis: LPAS
Integrated encoder & decoder at the encoder
Basicdecoder
Adaptiveencoder
sn
-
+
LPAS Encoder
Weighted error
![Page 47: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University](https://reader036.vdocuments.us/reader036/viewer/2022070307/551ae1515503465e7d8b47a0/html5/thumbnails/47.jpg)
EG-348_371_09
47
Log Spectral Estimates
Comparisons between frames are very important in many situations log spectral estimates are the most common (though in Comms. An
approximation is used to reduce computation)
))(log(
))(log(
1
)()(1
12/
0
2
0
2
zH
orsDFTSwhere
SSN
dwwSwSB
D
jwez
nk
N
kkk
B
In Comms, compuation is expensive and parameter vector approximations to D are used
![Page 48: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University](https://reader036.vdocuments.us/reader036/viewer/2022070307/551ae1515503465e7d8b47a0/html5/thumbnails/48.jpg)
EG-348_371_09
48
Some Standards
GSM European Cellular RPE-LTP13kb/s
FS1016 Secure Voice CELP 4.8
IS54 NA Cellular VSELP 7.95
IS96 “ QCELP 1-8
JDC-FR Japanese Cellular VSELP 6.7
JDC-HR “ PSI-CELP 3.67
G.728 (terrestrial) LD-CELP 16
![Page 49: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University](https://reader036.vdocuments.us/reader036/viewer/2022070307/551ae1515503465e7d8b47a0/html5/thumbnails/49.jpg)
EG-348_371_09
49
Low Bit Rate Speech CodingCompandent http://www.compandent.com/
![Page 50: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University](https://reader036.vdocuments.us/reader036/viewer/2022070307/551ae1515503465e7d8b47a0/html5/thumbnails/50.jpg)
EG-348_371_09
50
Criteria in Speech Comms.
Quality versus Bit-rate
Qua
lity
Excellent
Good
Fair
Poor
4 8 16 32 64 kbps
GSM
ADPCM
CELP
4 Quality Measures:intelligibility loudnessnaturalness ease-of-listening
![Page 51: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University](https://reader036.vdocuments.us/reader036/viewer/2022070307/551ae1515503465e7d8b47a0/html5/thumbnails/51.jpg)
EG-348_371_09
51
CELP eg
enHst(z)
snHlt(z)
CBIndex Gain
Long-term coefficients(pitch)
Short-term coefficients(formants)
Excitation is represented by address ie CB Index en
![Page 52: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University](https://reader036.vdocuments.us/reader036/viewer/2022070307/551ae1515503465e7d8b47a0/html5/thumbnails/52.jpg)
EG-348_371_09
52
CELP – LPAS (Encoder)
enHst(z) snHlt(z)
CBIndex
Gain
Long-term coefficients(pitch)
Short-term coefficients(formants)
Excitation is represented by address ie CB Index en
sn
snen
Basicdecoder
Adaptiveencoder
sn-
+Weighted error
![Page 53: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University](https://reader036.vdocuments.us/reader036/viewer/2022070307/551ae1515503465e7d8b47a0/html5/thumbnails/53.jpg)
EG-348_371_09
53
Conversion of LPC Parameters
• A(z) = 1 + a1 z - 1 + a2 z
- 2 + …… ap z - p and a i are to be Tx’d
• Line Spectral Frequencies (LSF) present a clever way of representing the LPC coefficients, the ai’s of A(z)
• The ai’s are floating point numbers and their accuracy is important
• Factorising A(z) tends to give complex roots in the z-domain
• LSF’s map these complex roots on to the unit circle
LSF’s Lead to efficient coding Ensure a minimum phase filter Bit errors are spectrum localised minimising loss of speech quality
z-plane jy
x
x
ws
LSF = ws . /2
![Page 54: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University](https://reader036.vdocuments.us/reader036/viewer/2022070307/551ae1515503465e7d8b47a0/html5/thumbnails/54.jpg)
EG-348_371_09
54
Line Spectral Frequencies
• Consider
P(z) = A(z) + z—(n+1) A(z—1 )
and
Q(z) = A(z) - z—(n+1) A(z—1 )
then P(z) and Q(z) lead to what is known as LSF’s
• Clearly if P(z) and Q(z) are known then A(z) can be found:A(z) = {P(z) + Q(z)} / 2
• Roots of P(z) and Q(z) lie on the unit circle in z-domain The locations give:
the LSF’s P(z) and Q(z), and whence A(z)
![Page 55: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University](https://reader036.vdocuments.us/reader036/viewer/2022070307/551ae1515503465e7d8b47a0/html5/thumbnails/55.jpg)
EG-348_371_09
55
LSF Evaluation
Consider one pair of complex roots, A1(z) :
A1(z) = 1 + a1 z -1 + a2 z
-2
P1(z) = 1 + a1 z -1 + a2 z
-2 + z -3 (1 + a1 z
1 + a2 z2 )
= (z2 + (a1 + a2 - 1) z + 1 )( z + 1 ) z –3
Q1(z) = 1 + a1 z -1 + a2 z
-2 - z -3 (1 + a1 z
1 + a2 z2 )
= (z2 + (a1 - a2 + 1) z + 1 )( z - 1 ) z -3
The roots at 0 and 1 are discarded
It follows that the LSF’s, 1 & 2 , are given by:
cos (1) = - (a1 + a2 - 1)/2
and cos (2) = - (a1 - a2 + 1)/2
Show:a1 = -(cos (1) + cos (2) ) and
a2 = (cos (2) - cos (1) +1 )
![Page 56: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University](https://reader036.vdocuments.us/reader036/viewer/2022070307/551ae1515503465e7d8b47a0/html5/thumbnails/56.jpg)
EG-348_371_09
56
LSF Test Example
A1(z) = 1 + a1 z -1 + a2 z
- 2
= (z2 + a1 z + a2 )z
- 2
= (z2 + 2 cos() wn z + wn
2 ) z - 2
where wn is radius and is angle from . So: radius = a2 & = -
Note: in P & Q all w n2 terms (of the multiple 2nd orders) are unity
EG 1: a2 = 1 then cos (1) = - (a1 + a2 - 1)/2 = - (a1)/2
roots already on circle and do not move (unstable system – not practical)
EG 2: a1 = 0 then cos (1) = - (a1 + a2 -1)/2 = - (a2 - 1)/2
cos (2) = - (a1 - a2 + 1)/2 = - (-a2 + 1)/2
so LSF’s are symmetric about /4
![Page 57: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University](https://reader036.vdocuments.us/reader036/viewer/2022070307/551ae1515503465e7d8b47a0/html5/thumbnails/57.jpg)
EG-348_371_09
57
LSF Review & Example (1)
LSF’s/LSP’s are defined as:
P(z) = A(z) + z-(n+1) A(z-1 )
and Q(z) = A(z) - z-(n+1) A(z-1 )
thus A(z) = {P(z) + Q(z)} / 2
![Page 58: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University](https://reader036.vdocuments.us/reader036/viewer/2022070307/551ae1515503465e7d8b47a0/html5/thumbnails/58.jpg)
EG-348_371_09
58
For a second order A(z)= 1 + a1 z-1 + a2 z-2
P (z) = 1 + a1 z-1 + a2 z-2 + (1 + a1 z1 + a2 z2)z-3
= (z2 + (a1 + a2 - 1)z + 1)(z + 1)z–3
Q (z) = 1 + a1 z-1 + a2 z-2 - (a1 z1 + a2 z2)z-3
= (z2 + (a1 - a2 + 1)z + 1)(z - 1 )z–3
cf: (s2 + ( 2cos()wn ) s + wn2)
LSF Review & Example (2)
![Page 59: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University](https://reader036.vdocuments.us/reader036/viewer/2022070307/551ae1515503465e7d8b47a0/html5/thumbnails/59.jpg)
EG-348_371_09
59
For a second order A(z)= 1 + a1 z-1 + a2 z-2 :
P (z) = (z2 + (a1 + a2 - 1)z + 1)(z + 1)z–3
Q (z) = (z2 + (a1 - a2 + 1)z + 1)(z - 1 )z–3
cf: (s2 + ( 2cos()wn )s + wn2)
Thus: (a1 + a2 - 1) = 2cos(1) = - 2cos(1)
&(a1 - a2 + 1) = - 2cos(2 )
So, given: i) LPC coeffs., a1 and a2 , then LSFs 1 & 2 can be found
ii) LSFs, 1 & 2 , then the LPC coeffs. a1 and a2 be found
00.20.40.60.8
1
-0.5 0 0.5 1
1
2 P(z)
Q(z)
P(z)Q(z)
2
1
LSF Review & Example (3)
![Page 60: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University](https://reader036.vdocuments.us/reader036/viewer/2022070307/551ae1515503465e7d8b47a0/html5/thumbnails/60.jpg)
EG-348_371_09
60
For a second order and with P(z) corresponding to the first root, Q(z) to the second root, and so P (z) = 1 + a1 z-1 + a2 z-2 + (1 + a1 z1 + a2 z2)z-3 = (z2 + (a1 + a2 - 1)z + 1)(z + 1)z–3 for the second pair of qi, 1.37 and 1.77
= (z2 - 2cos(1.37) z + 1 )(z + 1) z–3= (z3 +(1 - 2cos(1.37) z2 + (1 - 2cos(1.37))z + 1)z–3
LikewiseQ (z) = 1 + a1 z-1 + a2 z-2 - (a1 z1 + a2 z2)z-3
= (z2 + (a1 - a2 + 1)z + 1)(z - 1 )z–3 = (z2 - 2cos(1.77) z + 1 )(z - 1) z–3= (z3 +(-1 - 2cos(1.77) z2 + (1 + 2cos(1.77))z - 1)z–3
Then
A(z) = {P(z) + Q(z)} / 2) = (z3 + (cos(1.37) + cos(1.77))z2 + (1 - cos(1.37) + cos(1.77))z)z–3
LSF Review & Example (4)
![Page 61: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University](https://reader036.vdocuments.us/reader036/viewer/2022070307/551ae1515503465e7d8b47a0/html5/thumbnails/61.jpg)
EG-348_371_09
61
LSF Examples LPC coeffs. LSF’s
a1 a2 1 2
0 0.5 1.31812 1.82348
-1.8 0.9 0.31756 0.554811
+1.8 0.9 π-0.554811 π-0. 31756
2.2274 2.3743
-1 0 1
-1 0 1-1 0 1
![Page 62: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University](https://reader036.vdocuments.us/reader036/viewer/2022070307/551ae1515503465e7d8b47a0/html5/thumbnails/62.jpg)
EG-348_371_09
62
LSF Examples
LPC coeffs. LSF’s
a1 a2 1 2
0 0.5 1.31812 1.82348
-1.8 0.9 0.31756 0.554811
+1.8 0.9 π-0.554811
π-0. 31756
2.2274 2.3743
A(z)= 1 + a1 z-1 + a2 z-2
P (z) = 1 + a1 z-1 + a2 z-2 + (1 + a1 z1 + a2 z2)z-3
= (z2 + (a1 + a2 - 1)z + 1)(z + 1)z–3
= (z2 + (-1.8 + 0.9 - 1)z + 1)(z + 1)z–3
= (z2 - 1.9 z + 1) (z + 1)z–3
cf: (z2 + ( 2cos()wn ) z + wn2)
thus cos() = - 1.9/2 or = 2.824 and 1 = π -
= 0.318
![Page 63: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University](https://reader036.vdocuments.us/reader036/viewer/2022070307/551ae1515503465e7d8b47a0/html5/thumbnails/63.jpg)
EG-348_371_09
63
Bit allocation Voiced Unvoiced
V/U decision 1 1
Excitation 11 11
Sync 1 1
Φ1 = 0.3176 5 5
Φ2 = 0.5548 5 5
Φ3 = 1.4454 5 5
Φ4 = 1.6961 5 5
Φ 5 4 0
Φ 6 4 0
Φ 7 4 0
Φ 8 4 0
Φ 9 3 0
Φ 10 2 0
Error check 0 21
Total / frame 54 54
Example Bit Allocation
![Page 64: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University](https://reader036.vdocuments.us/reader036/viewer/2022070307/551ae1515503465e7d8b47a0/html5/thumbnails/64.jpg)
EG-348_371_09
64
Codebooks & VQ
p
N = 2L
i (0 … N-1)
Identical book
Data reduction: (p x B) to Ltime
p
time
![Page 65: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University](https://reader036.vdocuments.us/reader036/viewer/2022070307/551ae1515503465e7d8b47a0/html5/thumbnails/65.jpg)
EG-348_371_09
65
Principle representative data sets data vector is replaced / represented
by “nearest” vector, chosen from a “codebook” - a closed set of vectors
Examples LPC parameter sets Excitation as in CELP
Codebook Compression
M
N = 2 k
i
index, i
A(z)
enH(z)
sn
![Page 66: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University](https://reader036.vdocuments.us/reader036/viewer/2022070307/551ae1515503465e7d8b47a0/html5/thumbnails/66.jpg)
EG-348_371_09
66
P
Codebook Compression - CELP
H(z)sny ms eny ms
en are time domain samples (integers)
R samples per second (eg 8000 Hz)
Frame rate governs vector size
P = 2 j
Bit rate = j/y bits/ms
Codebook of time-domain samples
start point
y ms
NB en also includes gain
![Page 67: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University](https://reader036.vdocuments.us/reader036/viewer/2022070307/551ae1515503465e7d8b47a0/html5/thumbnails/67.jpg)
EG-348_371_09
67
A[z] at time t
time
Codebook Compression of H(z)
M
N = 2 k
i
index, i
Vector with M elements, every x ms
Codebook with N = 2 k vectors
Bit rate = k/x bits per ms (not a function of M)
In practice A[z] is converted to LSF’s.
x ms
![Page 68: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University](https://reader036.vdocuments.us/reader036/viewer/2022070307/551ae1515503465e7d8b47a0/html5/thumbnails/68.jpg)
EG-348_371_09
68
Codebook Generation
1) Initialise:form a single centroid of all training data, N=1
2) RepeatSplit centroids: N -> 2N Repeat
Cluster data to nearest centroiduntil convergence
until N large enough
![Page 69: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University](https://reader036.vdocuments.us/reader036/viewer/2022070307/551ae1515503465e7d8b47a0/html5/thumbnails/69.jpg)
EG-348_371_09
69
VQ Performance on Unseen Data
Ramachandran & Mamone (eds) ‘Modern Methods of Speech Processing’ Kluer Academic, 1995
![Page 70: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University](https://reader036.vdocuments.us/reader036/viewer/2022070307/551ae1515503465e7d8b47a0/html5/thumbnails/70.jpg)
EG-348_371_09
70
Ramachandran & Mamone (eds) ‘Modern Methods of Speech Processing’ Kluer Academic, 1995
VQ Performance on Unseen Data
![Page 71: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University](https://reader036.vdocuments.us/reader036/viewer/2022070307/551ae1515503465e7d8b47a0/html5/thumbnails/71.jpg)
EG-348_371_09
710 1 2 3 4 5-40
-20
0
20
40
Ma
gn
itu
de
(d
B)
Frequency (KHz) ( 0-to-Fs/2)
0 3.2 6.4 9.6 12.8 16 19.2 22.4 25.6-1
-0.5
0
0.5
1
Wav
efo
rm
Time (ms)
LPC & FFT SpectraLPC Roots -0.6651 ± 0.6695i -0.0560 ± 0.9709i 0.7228 ± 0.6225i 0.8714 ± 0.3694i 0.5758 -0.4200
2 of Q(z) 1 of P(z)
2.3743 2.2274
1.6540 1.5997
0.8261 0.6954
0.6106 0.3937
LSFs
![Page 72: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University](https://reader036.vdocuments.us/reader036/viewer/2022070307/551ae1515503465e7d8b47a0/html5/thumbnails/72.jpg)
EG-348_371_09
72
0 1 2 3 4 5-40
-20
0
20
40
Ma
gn
itu
de
(d
B)
Frequency (KHz) ( 0-to-Fs/2)
LPC Spectra & LSF’sLPC Roots -0.6651 ± 0.6695i -0.0560 ± 0.9709i 0.7228 ± 0.6225i 0.8714 ± 0.3694i 0.5758 -0.4200
2 of Q(z) 1 of P(z)
2.3743 2.2274
1.6540 1.5997
0.8261 0.6954
0.6106 0.3937
LSFs
-1
-0.5
0
0.5
1
-1 0 1
![Page 73: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University](https://reader036.vdocuments.us/reader036/viewer/2022070307/551ae1515503465e7d8b47a0/html5/thumbnails/73.jpg)
EG-348_371_09
730 1 2 3 4 5-40
-20
0
20
40
Frequency (KHz) ( 0-to-Fs/2)
0 3.2 6.4 9.6 12.8 16 19.2 22.4 25.6-1
-0.5
0
0.5
1
Time (ms)
A(z): 1.5537 -0.8276Roots: 0.7769 ± 0.4733i
H(0) = K (1- (1.5537 - 0.8276))
H(ws/2) = K
(1- (-1.5537 - 0.8276))
H(0) K/0.274 = = 21.8dBH(ws /2) K/ 3.38
LPC & FFT Spectra - 2nd Order
![Page 74: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University](https://reader036.vdocuments.us/reader036/viewer/2022070307/551ae1515503465e7d8b47a0/html5/thumbnails/74.jpg)
EG-348_371_09
74
GSM
Groupe Special Mobile - EU First digital cellular system in world See Hodge 1990 Based on TDMA & FDMA at 900MHz, and RPE-LPC
(ie it is an ‘LPAS’ system) Now at 1800 MHz Carriers at 200kHz Supporting 8 TDMA time slots each Time slots: 577s - 156.26 bit slots 8 time slots form 1 GSM frame of 4.62 ms Modulation: Gaussian minimum shift key 26 bit training in every time slot Round-trip delay ~ 80ms EU: GSM US: D-AMPS
![Page 75: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University](https://reader036.vdocuments.us/reader036/viewer/2022070307/551ae1515503465e7d8b47a0/html5/thumbnails/75.jpg)
EG-348_371_09
75
Other Related Topics
Spectral Lifting: H(z) = (1-az-1)
Codebook Training
Spectral Differences between 2 frames
Cepstra
Modeling Speech Space - HMM’s
![Page 76: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University](https://reader036.vdocuments.us/reader036/viewer/2022070307/551ae1515503465e7d8b47a0/html5/thumbnails/76.jpg)
EG-348_371_09
76
Pre-Emphasis Example
-8000
0
8000
-8000
0
8000
1
- 1
1
- 130ms
(a)
(b)
Figure Q1
![Page 77: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University](https://reader036.vdocuments.us/reader036/viewer/2022070307/551ae1515503465e7d8b47a0/html5/thumbnails/77.jpg)
EG-348_371_09
77
Pre-Emphasis Example
a
z-plane jy
1+a = 2
ws/2
G(ws/2) = 1 + aG(0) = 1 - a
For G(ws/2 ) > G(0) then a must be > 0
![Page 78: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University](https://reader036.vdocuments.us/reader036/viewer/2022070307/551ae1515503465e7d8b47a0/html5/thumbnails/78.jpg)
EG-348_371_09
78
1+a = 2
ws/2
0 1 2 3 4 5-30
-20
-10
0
10
20
30
40
50
Mag
nit
ud
e (d
B)
Frequency (KHz) ( 0-to-Fs/2)
-1 -0.5 0 0.5 1
-1
-0.5
0
0.5
1
Real Part
Imag
inar
y P
art
Z-plane to Magnitude Spectrum
![Page 79: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University](https://reader036.vdocuments.us/reader036/viewer/2022070307/551ae1515503465e7d8b47a0/html5/thumbnails/79.jpg)
EG-348_371_09
79
LPC Short and Long
Spectral envelop reflects morphological characteristics of the vocal tract
H1(z) H2(z)noise synthesisedSpeech
Air fromthe lungs
Vocal fold Vocal tract Speech
![Page 80: EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University](https://reader036.vdocuments.us/reader036/viewer/2022070307/551ae1515503465e7d8b47a0/html5/thumbnails/80.jpg)
EG-348_371_09
80
ST & LT Prediction
1 – A’(z)sn en
Residual
1 – A’(z) e`n
Z-1
Z-1
Z-1
a1
ai
ai
sn
sn
sn-1
sn-i
sn-p
+-Z-1
Z-1
Z-1
a1
ai
ap
+-
Z-1
ap
LTP
STP
Speech