new audio datarepresentationsjuhan/gct634/slides/02 audio... · 2020. 9. 8. · score (symbolic)...

GCT634/AI613: Musical Applications of Machine Learning (Fall 2020)

Audio Data Representations

Juhan Nam

Types of Music Data

● Audio○ MP3, WAV

● Score (symbolic)○ MIDI, typesetting script languages (e.g., MusicXML)

● Image○ Score (scanned image), album/playlist cover, performance video

● Text○ Meta data, tags, lyrics, reviews

● User Data○ Listening history, rating

Types of Music Data

● Audio○ MP3, WAW

● Score (symbolic)○ MIDI, typesetting script languages (e.g., MusicXML)

● Image○ Score (scanned image), album/playlist cover, performance video

● Text○ Meta data, tags, lyrics, reviews

● User Data○ Listening history, favorites or scores

Types of Audio Data Representations

● Waveform (digital audio samples): sampling and quantization

● Spectrogram: short-time Fourier transform

● Mel-spectrogram: human pitch perception

● Constant-Q transform: transform into musical (chromatic) scale

Digital Audio Chain

…0 0 1 0 1 0 …

\ microphone

LowpassFilters

Sampling Quantization

Storage,Processing

Digital-to-AnalogConversion

LowpassFilters

Amplifier

loudspeaker

Analog-to-DigitalConversion

Sampling and Quantization

…0 0 1 0 1 0 …

\ microphone

LowpassFilters

Sampling Quantization

Storage,Processing

Digital-to-AnalogConversion

LowpassFilters

Amplifier

loudspeaker

Analog-to-DigitalConversion

Sampling

● Convert continuous-time signals to discrete-time signals by periodically picking up the instantaneous values○ Represented as a sequence of numbers○ Sampling period (Ts): the amount of time between samples○ Sampling rate ( fs =1/Ts )

x(t)→ x(nTs )

Signal notationTs

Sampling Theorem

● What is an appropriate sampling rate?○ Too high: increase the data size in the digital domain○ Too low: cannot reconstruct the original signal

● Sampling Theorem○ The sampling rate must be greater than twice the maximum frequency in the

signal in order to reconstruct the original signal

○ Half the sampling rate is called Nyquist frequency (𝑓!/2)

𝑓! > 2 $ 𝑓"𝑓!: sampling rate𝑓": maximum frequency of the signal

Sampling in the Frequency Domain

fm-fm Frequency0

fm-fm Frequency0

fm-fm Frequency0

(𝑓! > 2 $ 𝑓")

(𝑓! < 2 $ 𝑓")

-fs fsfs-fm fs+fm-fs-fm -fs+fm

-fs-fm -fs+fm fs-fm fs+fm

The high-frequency content above the Nyquist frequency is folded over

AliasAlias

Sampling Rate

● Determined by the bandwidth of signals or hearing limits○ Music (CD): 44.1 kHz (consumer) or 48/96/192 kHz (professional)○ Speech communication: 8 kHz

MusicSpeech

Sampling Rate Conversion (Resampling)

● We often increase or decrease the sampling rate○ 44.1kHz CD quality music is often down-sampled to 22.05 kHz or even lower

rates to reduce the data size

○ Computed by signal interpolation ■ In down-sampling, preceded by a low-pass filter

to avoid the aliasing noise■ Windowed sinc function■ https://ccrma.stanford.edu/~jos/resample/

Down-sampling

Up-sampling

https://ccrma.stanford.edu/~jos/resample/

Quantization

● Discretizing the amplitude of real-valued signals○ Round the amplitude to the nearest discrete steps○ The bit discrete steps are determined by the number of bit bits (bit depth)

■ N bits can range from -2N-1 to 2N-1-1: 8 bit (-128 to 127), 16 bit ( -32767 to 32766)

Quantization step

2N-1-1

-2N-1

Quantization

● Determined by the dynamic range of of signals○ Adding 1 bits to LSB increases 6dB in sound level: N bits à 6N dB○ Music (CD): 16 bits (consumer) à 96dB ○ Speech communication: 8 bits à 48dB

Music Speech

Loading Audio Files

● Check the sampling rate and bit depth○ You can check them using audio software such as Audacity

● Do resampling (usually down-sampling) if necessary○ Librosa provides resampling when loading audio files

Waveform

● Waveform is a natural representation of audio but limited in analyzing the content○ Mainly show the temporal energy

Spectrogram

● 2D-image representation of audio using short-time Fourier transform○ x-axis: time, y-axis: frequency, color: magnitude response ○ It is common to use dB scale (a log scale) for the magnitude○ Easy to match what you hear to what you see

● For each short segment (frame)○ Take a window (one frame)○ Compute DFT (FFT) ○ Convert them to polar coordinate

■ Magnitude and Phase○ Compress the magnitude

■ 20log!"𝑋#$%: decibel○ Shifting by a hop size

● Spectrogram parameters○ Window size (FFT size) ○ Hop size○ Window type

Computing Spectrogram

DFT

𝑥(𝑙)

𝑋"#$

𝑥(𝑙 − 1)

MagnitudeCompression

𝑋!%&'

Short-TimeFourierTransform(STFT)

hop size

window size

WindowingWindowing

● Find the frequency (sinusoidal) component of 𝑥 𝑛

● Represent 𝑥 𝑛 with 𝑥 𝑛 = ∑$%&'())'𝐴 𝑘 cos(*+$,

'+ ϕ(𝑘))

○ 𝐴 𝑘 : amplitude (or magnitude) of the sinusoid○ ϕ(𝑘): phase of the sinusoid ○ 𝑁: size of DFT or the input segment○ 𝑘: frequency bin index (0 to 𝑁 − 1)

( "#𝑓! is the frequency at each frequency bin)

● DFT provides the way of finding 𝐴 𝑘 and ϕ(𝑘)

Discrete Fourier Transform (DFT)

Pink Floyd”The Dark Side of the Moon”


● Use the orthogonality of sinusoids

○ ∑&'"()! cos(*+,&(

)cos(*+-&(

) = ,𝑁/20

if 𝑘 = 𝑙 or 𝑘 = 𝑁 − 𝑙otherwise

(equivalent to − 𝑙)

○ ∑&'"()! cos(*+,&( )sin(

*+-&( ) = 0

○ ∑&'"()! sin(*+,&( )sin(

*+-&( ) = ?

0𝑁/2−𝑁/2

otherwise𝑘 = 𝑙

𝑘 = 𝑁 − 𝑙 (equivalent to − 𝑙)

● The inner product (or correlation) between the two sinusoids:○ If the frequencies are the same (including different signs), it is a non-zero ○ Otherwise, it is zero (they are orthogonal to each other)


● Inner product with the input and sinusoids○ 𝑋./ 𝑘 = ∑&'"()!𝑥 𝑛 cos

*+-&( = ∑&'"

()!(∑,'"()!!(𝐴 𝑘 cos(

*+,&( + ϕ(𝑘)))cos

*+-&(

= 𝐴 𝑘 cos ϕ 𝑘

○ 𝑋0# 𝑘 = −∑&'"()!𝑥 𝑛 sin*+-&(

= −∑&'"()! ∑,'"()!!(𝐴 𝑘 cos(*+,&

(+ 𝜙(𝑘)) sin *+-&

(= 𝐴 𝑘 sin ϕ 𝑘

● The magnitude and phase

○ 𝑋#$%(𝑘) = 𝐴 𝑘 = 𝑋./* 𝑘 + 𝑋0#* 𝑘 , 𝑋12$3/(𝑘) = ϕ 𝑘 = tan)!(4()(,)4*+(,)

)

● The definition of DFT can be simplified using complex sinusoids○ 𝑋 𝑘 = ∑&'"()!𝑥 𝑛 𝑒)7

,-./0 = 𝑋./ 𝑘 + 𝑗𝑋0# 𝑘 = 𝐴(𝑘)78 , 𝑒!

"#$%& = cos

2𝜋𝑘𝑛𝑁 + 𝑗sin

2𝜋𝑘𝑛𝑁

Euler’s identity


● Can be viewed as matrix multiplication

● In practice, we use an FFT algorithm instead of direct multiplication ○ Divide the matrix into small matrices recursively○ Complexity reduction: O(N2)à O(Nlog2N)

𝑊'1! 𝑊!23

𝑥(𝑛)

to polar

𝑋4& 𝑋2"

𝑋"#$ 𝑋%5#!&

𝑊'1! 𝑊!23

𝑠6∗(𝑛) = 𝑒89:;63


● When DFT is applied to musical sounds○ A musical tone with pitch has periodic waveforms ○ DFT shows harmonic spectrum (harmonic overtones)○ Pitch information can be also extracted ○ The magnitude is generally more sparse than the waveform

𝑥(𝑛) 𝑋"#$(𝑘)

𝐹0

2𝐹03𝐹0

Effect of Window Type

● Types of window functions○ Trade-off between the width of main-lobe and the level of side-lobe○ Hann window is the most widely used in music analysis.

200 0 200

0

0.5

1

Amplitude

Rectangular

500 0 50060402002040

Magnitude(dB)

200 0 200

0

0.5

1

Triangular

500 0 50060402002040

200 0 200

0

0.5

1

Hann

500 0 50060402002040

200 0 200

0

0.5

1

Blackmann

500 0 50060402002040

Spectra of windowed single sinusoids

Effect of Window Size

● Trade-off between time and frequency resolutions○ Short window: low frequency-resolution and high time-resolution ○ Long window: high frequency-resolution and low time-resolution

Hop=128, N=256 Hop=128, N=4096

Human Ears

● Human ear is a spectrum analyzer? ○ Our ear has a complicated pathway from the ear drum to the auditory nerve○ The cochlea in the inner ear is a bandpass-filter bank○ The membrane resonates at a different position depending the frequency of

the input. The resonance frequency increases in a log scale along the membrane

(Unrolled) Cochlear

Membrane

Human Pitch Perception

● Pitch Resolution○ Just noticeable difference (JND) increases

as the frequency goes up

● Mel scale○ Approximate the human pitch resolution

based on pitch ratio of tones ○ Most widely used for speech and music

analysis○ A log frequency scale

m = 2595log10 (1+ f / 700)

Computing Mel-Spectrogram

● Mapping linear frequency to mel scale○ A mel-scaled filter bank is used: linear interpolation on the center frequency

with the corresponding bandwidth skirt ○ The high-frequency range is zoomed-out and the low-frequency range is

relatively zoomed-in; the number of frequency bins is usually smaller

Spectrogram (1024 freq. bins) Mel-spectrogram (128 mel bins)Mel-scaled filter bank

Center Frequency

BandWidth

Musical scale

● Musical tuning system○ Equal temperament: 1: 21/12 ratio for semi-note○ Music note (m) and frequency (f) in Hz

f = 440 ⋅2(m−69)

12m =12 log2(f440

)+ 69,

https://newt.phys.unsw.edu.au/jw/notes.html

Review

● Now we know that we need a log-scale for music

● The log-scale filter bank will look like this

● Question:○ Can we obtain the log-frequency scale spectrogram

directly from waveforms using a time-frequency representation?

Log-scaled filter bank

Center Frequency

BandWidth

Constant-Q Transform

● Time-frequency representation which uses a set of sinusoidal kernels with log-spaced frequencies○ As the frequency increases, the length of sinusoidal kernels becomes shorter

(bandwidth becomes wider) to have constant Q (= frequency/bandwidth)

Figure 1. The upper panel illustrates the real part of thetransform bases (temporal kernel) that can be used to cal-culate the CQT over two octaves, with 12 bins per octave.The lower panel shows the absolute values of the corre-sponding spectral kernel.

3.1 Algorithm of Brown and Puckette

Let us assume that we want to calculate the CQT trans-form coefficients XCQ(k, n) as defined by (1) at one pointn of an input signal x(n). A direct implementation of (1)obviously requires calculating inner products of the inputsignal with each of the transform bases. The upper panel ofFig. 1 illustrates the real part of the transform bases ak(n),assuming here for simplicity only B = 12 bins per octaveand a frequency range of two octaves.

A computationally more efficient implementation is ob-tained by utilizing the identity

N�1X

n=0

x(n)a⇤(n) =

N�1X

j=0

X(j)A⇤(j) (7)

where X(j) denotes the discrete Fourier transform (DFT)of x(n) and A(j) denotes the DFT of a(n). Equation (7)holds for any discrete signals x(n) and a(n) and stemsfrom Parseval’s theorem [3].

Using (7), the CQT transform in (1) can be written as

XCQ

(k,N/2) =

NX

j=0

X(j)A⇤k(j) (8)

where Ak(j) is the complex-valued N -point DFT of thetransform basis ak(n) so that the bases ak(n) are centeredat the point N/2 within the transform frame. Following the

terminology of [9], we will refer to Ak(j) as the spectralkernels and to ak(n) as the temporal kernels. The lowerpanel of Fig. 1 illustrates the absolute values of the spectralkernels Ak(j) corresponding to temporal kernels ak(n) inthe upper panel.

As observed by Brown and Puckette, the spectral ker-nels Ak(j) are sparse: most of the values being near zerobecause they are Fourier transforms of modulated sinu-soids. Therefore the summation in (8) can be limited tovalues near the peak in the spectral kernel to achieve suf-ficient numerical accuracy – omitting near-zero values inAk(j). This is the main idea of the efficient CQT transformproposed in [9]. It is also easy to see that the summing hasto be carried out for positive frequencies only, followed bymultiplication by two.

For convenience, we store the spectral kernels Ak(j) ascolumns in matrix A. The transform in (8) can then bewritten in matrix form as

XCQ = A⇤X (9)

where A⇤ denotes the conjugate transpose of A. MatricesX and XCQ have only one column each, containing theDFT values X(j) and the corresponding CQT coefficients,respectively.

3.2 Processing One Octave at a Time

There are two remaining problems with the method out-lined in the previous subsection. Firstly, when a wide rangeof frequencies is considered (for example, eight octavesfrom 60Hz to 16kHz), quite long DFT transform blocks arerequired and the spectral kernel is no longer very sparse,since the frequency responses of higher frequency bins arewider as can be seen from Fig. 1. Secondly, in order toanalyze all parts of the input signal adequately, the CQTtransform for the highest frequency bins has to be calcu-lated at least every NK/2 samples apart, where NK is thewindow length for the highest CQT bin. Both of these fac-tors reduce the computational efficiency of the method.

We propose two extensions to address the above prob-lems. The first is processing by octaves. 2 We use a spec-tral kernel matrix A which produces the CQT for the high-est octave only. After computing the highest-octave CQTbins over the entire signal, the input signal is lowpass fil-tered and downsampled by factor two, and then the sameprocess is repeated to calculate the CQT bins for the nextoctave, using exactly the same DFT block size and spec-tral kernel (see (8)). This is repeated iteratively until thedesired number of octaves has been covered. Figure 2 il-lustrates this process.

Since the spectral kernel A now represents frequencybins that are at maximum one octave apart, the length ofthe DFT block can be made quite short (according to Nkof the lowest CQT bin) and the matrix A is very sparseeven for the highest-frequency bins.

Another computational efficiency improvement is ob-tained by using several temporally translated versions ofthe transform bases ak(n) within the same spectral kernel

2 We want to credit J. Brown for mentioning this possibility already in[8], although octave-by-octave processing was not implemented in [8, 9].

Figure 1. The upper panel illustrates the real part of thetransform bases (temporal kernel) that can be used to cal-culate the CQT over two octaves, with 12 bins per octave.The lower panel shows the absolute values of the corre-sponding spectral kernel.

3.1 Algorithm of Brown and Puckette

Let us assume that we want to calculate the CQT trans-form coefficients XCQ(k, n) as defined by (1) at one pointn of an input signal x(n). A direct implementation of (1)obviously requires calculating inner products of the inputsignal with each of the transform bases. The upper panel ofFig. 1 illustrates the real part of the transform bases ak(n),assuming here for simplicity only B = 12 bins per octaveand a frequency range of two octaves.

A computationally more efficient implementation is ob-tained by utilizing the identity

N�1X

n=0

x(n)a⇤(n) =

N�1X

j=0

X(j)A⇤(j) (7)

where X(j) denotes the discrete Fourier transform (DFT)of x(n) and A(j) denotes the DFT of a(n). Equation (7)holds for any discrete signals x(n) and a(n) and stemsfrom Parseval’s theorem [3].

Using (7), the CQT transform in (1) can be written as

XCQ

(k,N/2) =

NX

j=0

X(j)A⇤k(j) (8)

where Ak(j) is the complex-valued N -point DFT of thetransform basis ak(n) so that the bases ak(n) are centeredat the point N/2 within the transform frame. Following the

terminology of [9], we will refer to Ak(j) as the spectralkernels and to ak(n) as the temporal kernels. The lowerpanel of Fig. 1 illustrates the absolute values of the spectralkernels Ak(j) corresponding to temporal kernels ak(n) inthe upper panel.

As observed by Brown and Puckette, the spectral ker-nels Ak(j) are sparse: most of the values being near zerobecause they are Fourier transforms of modulated sinu-soids. Therefore the summation in (8) can be limited tovalues near the peak in the spectral kernel to achieve suf-ficient numerical accuracy – omitting near-zero values inAk(j). This is the main idea of the efficient CQT transformproposed in [9]. It is also easy to see that the summing hasto be carried out for positive frequencies only, followed bymultiplication by two.

For convenience, we store the spectral kernels Ak(j) ascolumns in matrix A. The transform in (8) can then bewritten in matrix form as

XCQ = A⇤X (9)

where A⇤ denotes the conjugate transpose of A. MatricesX and XCQ have only one column each, containing theDFT values X(j) and the corresponding CQT coefficients,respectively.

3.2 Processing One Octave at a Time

There are two remaining problems with the method out-lined in the previous subsection. Firstly, when a wide rangeof frequencies is considered (for example, eight octavesfrom 60Hz to 16kHz), quite long DFT transform blocks arerequired and the spectral kernel is no longer very sparse,since the frequency responses of higher frequency bins arewider as can be seen from Fig. 1. Secondly, in order toanalyze all parts of the input signal adequately, the CQTtransform for the highest frequency bins has to be calcu-lated at least every NK/2 samples apart, where NK is thewindow length for the highest CQT bin. Both of these fac-tors reduce the computational efficiency of the method.

We propose two extensions to address the above prob-lems. The first is processing by octaves. 2 We use a spec-tral kernel matrix A which produces the CQT for the high-est octave only. After computing the highest-octave CQTbins over the entire signal, the input signal is lowpass fil-tered and downsampled by factor two, and then the sameprocess is repeated to calculate the CQT bins for the nextoctave, using exactly the same DFT block size and spec-tral kernel (see (8)). This is repeated iteratively until thedesired number of octaves has been covered. Figure 2 il-lustrates this process.

Since the spectral kernel A now represents frequencybins that are at maximum one octave apart, the length ofthe DFT block can be made quite short (according to Nkof the lowest CQT bin) and the matrix A is very sparseeven for the highest-frequency bins.

Another computational efficiency improvement is ob-tained by using several temporally translated versions ofthe transform bases ak(n) within the same spectral kernel

2 We want to credit J. Brown for mentioning this possibility already in[8], although octave-by-octave processing was not implemented in [8, 9].

(Schorkhuber and Klapuri, 2010)

Constant-Q IIR Filter Bank

● Musically designed constant-Q transform ○ 88 IIR bandpass filters○ The center frequency corresponds to the

pitch of each piano note○ The bandwidth is set to have constant-Q

with +/- 25 cent around the center(100 cents = 1 semi-tone)

(Müller, 2011)

Example: Constant-Q Transform

● Chromatic music scale○ The harmonics of notes increase linearly in the constant-Q transform

Spectrogram Constant-Q transform

new audio datarepresentationsjuhan/gct634/slides/02 audio... · 2020. 9. 8. · score (symbolic)...

Documents