1/29COLUMBIA UNIVERSITY
IN THE CITY OF NEW YORK
EE E6820: Speech & Audio Processing & Recognition
/
ering
E6820 SAPR - Dan Ellis L01 - 2006-01-19
Lecture 1: Introduction & DSP
Sound and information
Course structure
DSP review: Timescale modification
Dan Ellis <[email protected]>http://www.ee.columbia.edu/~dpwe/e6820
Columbia University Dept. of Electrical EngineSpring 2006
1
2
3
2/29COLUMBIA UNIVERSITY
IN THE CITY OF NEW YORK
Sound and information
voltage
1
on
air
r
ge
E6820 SAPR - Dan Ellis L01 - 2006-01-19
• Sound is air pressure variation
• Transducers convert air pressure ↔↔↔↔
Mechanical vibrati
Pressure waves in
Motion of senso
Time-varying volta
+ + + +
t
v(t)
3/29COLUMBIA UNIVERSITY
IN THE CITY OF NEW YORK
What use is sound?
vantage
ioncornersral sounds’
.5 5
.5 5
E6820 SAPR - Dan Ellis L01 - 2006-01-19
• Footsteps examples:
• Hearing confers an evolutionary ad- useful information, complements vis- ...at a distance, in the dark, around - listeners are highly adapted to ‘natu
(including speech)
0 0.5 1 1.5 2 2.5 3 3.5 4 4-0.5
0
0.5
0 0.5 1 1.5 2 2.5 3 3.5 4 4-0.5
0
0.5
time / s
4/29COLUMBIA UNIVERSITY
IN THE CITY OF NEW YORK
The scope of audio processing
abstract
E6820 SAPR - Dan Ellis L01 - 2006-01-19
AUDIO
PROCESSING
natural
man-made
simple
5/29COLUMBIA UNIVERSITY
IN THE CITY OF NEW YORK
The acoustic communication chain
l)
r decoder
!
tion
E6820 SAPR - Dan Ellis L01 - 2006-01-19
• Sound is an information bearer
• Received sound reflects source(s) plus effect of environment (channe
message signal channel receive
synthesis audioprocessing recogni
6/29COLUMBIA UNIVERSITY
IN THE CITY OF NEW YORK
Levels of abstraction
between
erent tasks
xplicit ...
E6820 SAPR - Dan Ellis L01 - 2006-01-19
• Much processing concerns shiftinglevels of abstraction
• Different representations serve diff- separating aspects, making things e
sound p(t)
representation(e.g. t-f energy)
‘information’
abstract
concrete
An
alys
is
Syn
thesis
7/29COLUMBIA UNIVERSITY
IN THE CITY OF NEW YORK
Course structure
rocessingls
2
E6820 SAPR - Dan Ellis L01 - 2006-01-19
• Goals:- survey topics in sound analysis & p- develop an intuition for sound signa- learn some specific technologies
• Course structure:- weekly assignments (25%)- midterm event (25%)- final project (50%)
• Text:Speech and Audio Signal ProcessingBen Gold & Nelson Morgan, Wiley, 2000 ISBN: 0-471-35154-7
8/29COLUMBIA UNIVERSITY
IN THE CITY OF NEW YORK
Web-based
820/
mples, ...
etc.
E6820 SAPR - Dan Ellis L01 - 2006-01-19
• Course website:http://www.ee.columbia.edu/~dpwe/e6
for lecture notes, problem sets, exa
• + student web pages for homework
9/29COLUMBIA UNIVERSITY
IN THE CITY OF NEW YORK
Course outline
rytion
L10:Music
retrieval
L12:Multimediaindexing
E6820 SAPR - Dan Ellis L01 - 2006-01-19
Fundamentals
L1:DSP
L2:Acoustics
L3:Pattern
recognition
L4:Audito
percep
Audio processing
L5:Signalmodels
L6:Music
analysis/synthesis
L7:Audio
compression
L8:Spatial sound& rendering
Applications
L9:Speech
recognition
L11:Signal
separation
10/29COLUMBIA UNIVERSITY
IN THE CITY OF NEW YORK
Weekly Assignments
g Toolbox)ing
E6820 SAPR - Dan Ellis L01 - 2006-01-19
• Research papers- journal & conference publications- summarize & discuss in class- written summaries on web page
• Practical experiments- MATLAB-based (+ Signal Processin- direct experience of sound process- skills for project
• Book sections
11/29COLUMBIA UNIVERSITY
IN THE CITY OF NEW YORK
Final Project
of grade)
s
E6820 SAPR - Dan Ellis L01 - 2006-01-19
• Most significant part of course (50%
• Oral proposals mid-semester; Presentations in final class+ website
• Scope- practical (Matlab recommended)- identify a problem; try some solution- evaluation
• Topic- few restrictions within world of audio- investigate other resources- develop in discussion with me
• Copying
12/29COLUMBIA UNIVERSITY
IN THE CITY OF NEW YORK
Examples of past projects
E6820 SAPR - Dan Ellis L01 - 2006-01-19
• Automatic Prosody Classification
• Model-based note transcription
13/29COLUMBIA UNIVERSITY
IN THE CITY OF NEW YORK
DSP review: Digital Signals
3
time
ng
T
ε
E6820 SAPR - Dan Ellis L01 - 2006-01-19
- sampling interval T,
sampling frequency
- quantizer
xd[n] = Q( xc(nT ) )
Discrete-time samplilimits bandwidth
Discrete-levelquantization
limits dynamic range
ΩT2πT------=
Q y( ) ε y εÚ⋅=
14/29COLUMBIA UNIVERSITY
IN THE CITY OF NEW YORK
The speech signal: time domain
ound types:
1.92
2.46 2.48
2.6 time/s
dime
: transiente”
E6820 SAPR - Dan Ellis L01 - 2006-01-19
• Speech is a sequence of different s
-0.2
-0.1
0
0.1
0.2
1.38 1.4 1.42-0.1
0
0.1
1.52 1.54 1.56 1.58-0.1
0
0.1
1.86 1.88 1.9-0.05
0
0.05
2.42 2.44
-0.02
0
0.02
1.4 1.6 1.8 2 2.2 2.4
watch thin as aahas
Vowel: periodic“has”
Fricative: aperiodic“watch”
Glide: smooth transition“watch”
Stop burst“dim
15/29COLUMBIA UNIVERSITY
IN THE CITY OF NEW YORK
Timescale modification (TSM)
slower’?y
s?
pling rate
)tor
time/s
r = 2
M
E6820 SAPR - Dan Ellis L01 - 2006-01-19
• Can we modify a sound to make it ‘i.e. speech pronounced more slowl- e.g. to help comprehension, analysi- or more quickly for ‘speed listening’
• Why not just slow it down?
-
- equiv. to playback at a different sam
xs t( ) xotr-- =
(>1 → slower, r = slowdown fac
2.35 2.4 2.45 2.5 2.55 2.6-0.1
-0.05
0
0.05
0.1
2.35 2.4 2.45 2.5 2.55 2.6-0.1
-0.05
0
0.05
0.1
Original
2x slower
16/29COLUMBIA UNIVERSITY
IN THE CITY OF NEW YORK
Time-domain TSM
e structure
mr---- L n+
e / s
e / s
E6820 SAPR - Dan Ellis L01 - 2006-01-19
• Problem: want to preserve local timbut alter global time structure
• Repeat segments- but: artefacts from abrupt edges
• Cross-fade & overlap
ym
mL n+[ ] ym 1–
mL n+[ ] w n[ ] x⋅+=
2.35 2.4 2.45 2.5 2.55 2.6-0.1
0
0.1
4.7 4.75 4.8 4.85 4.9 4.95-0.1
0
0.1
1
1
1 1 2 2 3 3 4 4 5 5 6
2
2
3
3
4
4
5 6
6
5
tim
tim
17/29COLUMBIA UNIVERSITY
IN THE CITY OF NEW YORK
Synchronous Overlap-Add (SOLA)
window to
ion
:
L n Km+ +
mr---- L n K+ +
mr---- L n K+ +
2
------------------------------------------
M
E6820 SAPR - Dan Ellis L01 - 2006-01-19
• Idea: Allow some leeway in placingoptimize alignment of waveforms
• Hence,
where Km chosen by cross-correlat
1
2
Km maximizes alignment of 1 and 2
ym
mL n+[ ] ym 1–
mL n+[ ] w n[ ] x mr----⋅+=
Km
ym 1–
mL n+[ ] x⋅n 0=
Nov∑
ym 1–
mL n+[ ]( )2
∑ x
∑
-----------------------------------------------------------------------------0 K KU ≤ ≤
argmax =
18/29COLUMBIA UNIVERSITY
IN THE CITY OF NEW YORK
The Fourier domain
us x)
k5 6 74
0.5 0 0.5 1 1.5
t1/k
04 0.006 0.008time / sec
x(t)
0 6000 8000freq / Hz
|X(jΩ)|
E6820 SAPR - Dan Ellis L01 - 2006-01-19
Fourier Series (periodic continuous x)
Fourier Transform (aperiodic continuo
x t( ) ck ejkΩ0t
⋅k∑=
ck1
2πT---------- x t( ) e
jkΩ0– t⋅ td
T 2Ú–
T 2Ú∫=
Ω02πT------=
1 2 3
|ck|1.0
1.5 1 1
0.5
0
0.5
x(t)
x t( )1
2π------ X jΩ( ) e
jΩt⋅ Ωd∫=
X jΩ( ) x t( ) ejΩt–
⋅ td∫=
0 0.002 0.0
level / dB
-0.01
0
0.01
0.02
0 2000 400-80
-60
-40
-20
19/29COLUMBIA UNIVERSITY
IN THE CITY OF NEW YORK
Discrete-time Fourier
led x)
n2 3 4 5 6 7
)|
ω2π 3π 4π 5π
n]
k
|X(ejω)|]|n4 5 6 7
E6820 SAPR - Dan Ellis L01 - 2006-01-19
DT Fourier Transform (aperiodic samp
Discrete Fourier Transform (N-point x)
x n[ ]1
2π------ X e
jω( ) e
jωn⋅ ωd
π–
π∫=
X ejω
( ) x n[ ] ejωn–
⋅∑=
-1 1
0 π
|X(ejω
1
2
3
x [
x n[ ] X k[ ] ej2πkn
N-------------
⋅k∑=
X k[ ] x n[ ] ej2πkn
N-------------–
⋅n∑=
|X[k
k=1...
1 2 3
x [n]
20/29COLUMBIA UNIVERSITY
IN THE CITY OF NEW YORK
Sampling and aliasing
tinuous tants:
uctuations
pectrum:
10
) n I∈∀
Ω
seband” l
E6820 SAPR - Dan Ellis L01 - 2006-01-19
• Discrete-time signals equal the contime signal at discrete sampling ins
• Sampling cannot represent rapid fl
• Nyquist limit (ΩΩΩΩT/2) from periodic s
xd n[ ] xc nT( )=
0 1 2 3 4 5 6 7 8 9 1
0.5
0
0.5
1
ΩM2πT------+
Tn sin ΩMTn(sin=
ΩM−ΩT ΩT −ΩMΩT - ΩM−ΩT + ΩM
Gp(jΩ)Ga(jΩ) “alias” of “ba
signa
21/29COLUMBIA UNIVERSITY
IN THE CITY OF NEW YORK
Speech sounds in the Fourier domain
0(power)
nts
2000 3000 4000
2000 3000 4000
0.1
2000 3000
-40
2000 3000 4000
time domain frequency domain
freq / Hz
B
E6820 SAPR - Dan Ellis L01 - 2006-01-19
- dB = 20·log10(amplitude) = 10·log1
• Voiced spectrum has pitch + forma
1.52 1.54 1.56 1.58-0.1
0
0.1
2.42 2.44 2.46 2.48
-0.02
0
0.02
0 1000-100
-80
-60
-40
0 1000-100
-80
-60
1.37 1.38 1.39 1.4 1.41 1.42-0.1
0
0 1000-100
-80
-60
1.86 1.87 1.88 1.89 1.9 1.91-0.05
0
0.05
0 1000-100
-80
-60
Vowel: periodic“has”
Fricative: aperiodic“watch”
Glide: transition“watch”
Stop: transient“dime”
time / s
ener
gy
/ d
22/29COLUMBIA UNIVERSITY
IN THE CITY OF NEW YORK
Short-time Fourier Transform
e and freq
time / s
2πk n mL–( )N
--------------------------------( )
E6820 SAPR - Dan Ellis L01 - 2006-01-19
• Want to localize energy in both tim→break sound into short-time pieces
calculate DFT of each one
• Mathematically:
2.35 2.4 2.45
0
4000
3000
2000
1000
2.5 2.55 2.6-0.1
0
0.1
freq
/ H
z
k →
short-timewindow
DFT
m = 0 m = 1 m = 2 m = 3
L 2L 3L
X k m,[ ] x n[ ] w n mL–[ ] j–exp⋅ ⋅n 0=N 1–
∑=
23/29COLUMBIA UNIVERSITY
IN THE CITY OF NEW YORK
The Spectrogram
mage:
time / s
time / s
intensity / dB
-50
-40
-30
-20
-10
0
10
E6820 SAPR - Dan Ellis L01 - 2006-01-19
• Plot STFT as a grayscale iX k m,[ ]fr
eq /
Hz
2.35 2.4 2.45 2.5 2.55 2.60
1000
2000
3000
4000
freq
/ H
z
0
1000
2000
3000
4000
0
0.1
0 0.5 1 1.5 2 2.5
24/29COLUMBIA UNIVERSITY
IN THE CITY OF NEW YORK
Time-frequency tradeoff
cy
.6 time / s
level/ dB
-50
-40
-30
-20
-10
0
10
E6820 SAPR - Dan Ellis L01 - 2006-01-19
• Longer window w[n] gains frequenresolution at cost of time resolution
1.4 1.6 1.8 2 2.2 2.4 2
freq
/ H
z
0
1000
2000
3000
4000
freq
/ H
z
0
1000
2000
3000
4000
0
0.2W
indo
w =
256
pt
Nar
row
ban
dW
indo
w =
48
ptW
ideb
and
25/29COLUMBIA UNIVERSITY
IN THE CITY OF NEW YORK
Speech sounds on the Spectrogram
an rmants
6 time/s
dime
E6820 SAPR - Dan Ellis L01 - 2006-01-19
• Most popular speech visualization
• Wideband (short window) better thnarrowband (long window) to see fo
freq
/ H
z
0
1000
2000
3000
4000
1.4 1.6 1.8 2 2.2 2.4 2.
watch thin as aahas
Vo
wel
: pe
riodi
c“h
as”
Fri
c've
: ap
erio
dic
“wat
ch”
Glid
e: tr
ansi
tion
“wat
ch”
Sto
p:
tran
sien
t“d
ime”
26/29COLUMBIA UNIVERSITY
IN THE CITY OF NEW YORK
TSM with the Spectrogram
1.4
1.4
E6820 SAPR - Dan Ellis L01 - 2006-01-19
• Just stretch out the spectrogram?
- how to resynthesize?
spectrogram is only
Time
Fre
quen
cy
0 0.2 0.4 0.6 0.8 1 1.20
1000
2000
3000
4000
Time
Fre
quen
cy
0 0.2 0.4 0.6 0.8 1 1.20
1000
2000
3000
4000
Y k m,[ ]
27/29COLUMBIA UNIVERSITY
IN THE CITY OF NEW YORK
The Phase Vocoder
domain
gram:
een slices:
aligned
M
E6820 SAPR - Dan Ellis L01 - 2006-01-19
• Timescale modification in the STFT
• Magnitude from ‘stretched’ spectro
- e.g. by linear interpolation
• But preserve phase increment betw
- e.g. by discrete differentiator
• Does right thing for single sinusoid- keeps overlapped parts of sinusoid
Y k m,[ ] X kmr----,=
θY k m,[ ] θX kmr----,=
time
28/29COLUMBIA UNIVERSITY
IN THE CITY OF NEW YORK
General issues in TSM
m
re
parate!ement
ption...
E6820 SAPR - Dan Ellis L01 - 2006-01-19
• Time window- stretching a narrowband spectrogra
• Malleability of different sounds- vowels stretch well, stops lose natu
• Not a well-formed problem?- want to alter time without frequency
... but time and frequency are not se- ‘satisfying’ result is a subjective judg→solution depends on auditory perce
29/29COLUMBIA UNIVERSITY
IN THE CITY OF NEW YORK
Summary
n
E6820 SAPR - Dan Ellis L01 - 2006-01-19
• Information in sound- lots of it, multiple levels of abstractio
• Course overview- survey of audio processing topics- practicals, readings, project
• DSP review- digital signals, time domain- Fourier domain, STFT
• Timescale modification- properties of the speech signal- time-domain- phase vocoder