new audio datarepresentationsjuhan/gct634/slides/02 audio... · 2020. 9. 8. · score (symbolic)...

32
GCT634/AI613: Musical Applications of Machine Learning (Fall 2020) Audio Data Representations Juhan Nam

Upload: others

Post on 23-Oct-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

  • GCT634/AI613: Musical Applications of Machine Learning (Fall 2020)

    Audio Data Representations

    Juhan Nam

  • Types of Music Data

    ● Audio○ MP3, WAV

    ● Score (symbolic)○ MIDI, typesetting script languages (e.g., MusicXML)

    ● Image○ Score (scanned image), album/playlist cover, performance video

    ● Text○ Meta data, tags, lyrics, reviews

    ● User Data○ Listening history, rating

  • Types of Music Data

    ● Audio○ MP3, WAW

    ● Score (symbolic)○ MIDI, typesetting script languages (e.g., MusicXML)

    ● Image○ Score (scanned image), album/playlist cover, performance video

    ● Text○ Meta data, tags, lyrics, reviews

    ● User Data○ Listening history, favorites or scores

  • Types of Audio Data Representations

    ● Waveform (digital audio samples): sampling and quantization

    ● Spectrogram: short-time Fourier transform

    ● Mel-spectrogram: human pitch perception

    ● Constant-Q transform: transform into musical (chromatic) scale

  • Digital Audio Chain

    …0 0 1 0 1 0 …

    \ microphone

    LowpassFilters

    Sampling Quantization

    Storage,Processing

    Digital-to-AnalogConversion

    LowpassFilters

    Amplifier

    loudspeaker

    Analog-to-DigitalConversion

  • Sampling and Quantization

    …0 0 1 0 1 0 …

    \ microphone

    LowpassFilters

    Sampling Quantization

    Storage,Processing

    Digital-to-AnalogConversion

    LowpassFilters

    Amplifier

    loudspeaker

    Analog-to-DigitalConversion

  • Sampling

    ● Convert continuous-time signals to discrete-time signals by periodically picking up the instantaneous values○ Represented as a sequence of numbers○ Sampling period (Ts): the amount of time between samples○ Sampling rate ( fs =1/Ts )

    x(t)→ x(nTs )

    Signal notationTs

  • Sampling Theorem

    ● What is an appropriate sampling rate?○ Too high: increase the data size in the digital domain○ Too low: cannot reconstruct the original signal

    ● Sampling Theorem○ The sampling rate must be greater than twice the maximum frequency in the

    signal in order to reconstruct the original signal

    ○ Half the sampling rate is called Nyquist frequency (𝑓!/2)

    𝑓! > 2 $ 𝑓"𝑓!: sampling rate𝑓": maximum frequency of the signal

  • Sampling in the Frequency Domain

    fm-fm Frequency0

    fm-fm Frequency0

    fm-fm Frequency0

    (𝑓! > 2 $ 𝑓")

    (𝑓! < 2 $ 𝑓")

    -fs fsfs-fm fs+fm-fs-fm -fs+fm

    -fs-fm -fs+fm fs-fm fs+fm

    The high-frequency content above the Nyquist frequency is folded over

    AliasAlias

  • Sampling Rate

    ● Determined by the bandwidth of signals or hearing limits○ Music (CD): 44.1 kHz (consumer) or 48/96/192 kHz (professional)○ Speech communication: 8 kHz

    MusicSpeech

  • Sampling Rate Conversion (Resampling)

    ● We often increase or decrease the sampling rate○ 44.1kHz CD quality music is often down-sampled to 22.05 kHz or even lower

    rates to reduce the data size

    ○ Computed by signal interpolation ■ In down-sampling, preceded by a low-pass filter

    to avoid the aliasing noise■ Windowed sinc function■ https://ccrma.stanford.edu/~jos/resample/

    Down-sampling

    Up-sampling

    https://ccrma.stanford.edu/~jos/resample/

  • Quantization

    ● Discretizing the amplitude of real-valued signals○ Round the amplitude to the nearest discrete steps○ The bit discrete steps are determined by the number of bit bits (bit depth)

    ■ N bits can range from -2N-1 to 2N-1-1: 8 bit (-128 to 127), 16 bit ( -32767 to 32766)

    Quantization step

    2N-1-1

    -2N-1

  • Quantization

    ● Determined by the dynamic range of of signals○ Adding 1 bits to LSB increases 6dB in sound level: N bits à 6N dB○ Music (CD): 16 bits (consumer) à 96dB ○ Speech communication: 8 bits à 48dB

    Music Speech

  • Loading Audio Files

    ● Check the sampling rate and bit depth○ You can check them using audio software such as Audacity

    ● Do resampling (usually down-sampling) if necessary○ Librosa provides resampling when loading audio files

  • Waveform

    ● Waveform is a natural representation of audio but limited in analyzing the content○ Mainly show the temporal energy

  • Spectrogram

    ● 2D-image representation of audio using short-time Fourier transform○ x-axis: time, y-axis: frequency, color: magnitude response ○ It is common to use dB scale (a log scale) for the magnitude○ Easy to match what you hear to what you see

  • ● For each short segment (frame)○ Take a window (one frame)○ Compute DFT (FFT) ○ Convert them to polar coordinate

    ■ Magnitude and Phase○ Compress the magnitude

    ■ 20log!"𝑋#$%: decibel○ Shifting by a hop size

    ● Spectrogram parameters○ Window size (FFT size) ○ Hop size○ Window type

    Computing Spectrogram

    DFT

    𝑥(𝑙)

    𝑋"#$

    𝑥(𝑙 − 1)

    MagnitudeCompression

    𝑋!%&'

    Short-TimeFourierTransform(STFT)

    hop size

    window size

    WindowingWindowing

  • ● Find the frequency (sinusoidal) component of 𝑥 𝑛

    ● Represent 𝑥 𝑛 with 𝑥 𝑛 = ∑$%&'())'𝐴 𝑘 cos(*+$,

    '+ ϕ(𝑘))

    ○ 𝐴 𝑘 : amplitude (or magnitude) of the sinusoid○ ϕ(𝑘): phase of the sinusoid ○ 𝑁: size of DFT or the input segment○ 𝑘: frequency bin index (0 to 𝑁 − 1)

    ( "#𝑓! is the frequency at each frequency bin)

    ● DFT provides the way of finding 𝐴 𝑘 and ϕ(𝑘)

    Discrete Fourier Transform (DFT)

    Pink Floyd”The Dark Side of the Moon”

  • Discrete Fourier Transform (DFT)

    ● Use the orthogonality of sinusoids

    ○ ∑&'"()! cos(*+,&(

    )cos(*+-&(

    ) = ,𝑁/20

    if 𝑘 = 𝑙 or 𝑘 = 𝑁 − 𝑙otherwise

    (equivalent to − 𝑙)

    ○ ∑&'"()! cos(*+,&( )sin(

    *+-&( ) = 0

    ○ ∑&'"()! sin(*+,&( )sin(

    *+-&( ) = ?

    0𝑁/2−𝑁/2

    otherwise𝑘 = 𝑙

    𝑘 = 𝑁 − 𝑙 (equivalent to − 𝑙)

    ● The inner product (or correlation) between the two sinusoids:○ If the frequencies are the same (including different signs), it is a non-zero ○ Otherwise, it is zero (they are orthogonal to each other)

  • Discrete Fourier Transform (DFT)

    ● Inner product with the input and sinusoids○ 𝑋./ 𝑘 = ∑&'"()!𝑥 𝑛 cos

    *+-&( = ∑&'"

    ()!(∑,'"()!!(𝐴 𝑘 cos(

    *+,&( + ϕ(𝑘)))cos

    *+-&(

    = 𝐴 𝑘 cos ϕ 𝑘

    ○ 𝑋0# 𝑘 = −∑&'"()!𝑥 𝑛 sin*+-&(

    = −∑&'"()! ∑,'"()!!(𝐴 𝑘 cos(*+,&

    (+ 𝜙(𝑘)) sin *+-&

    (= 𝐴 𝑘 sin ϕ 𝑘

    ● The magnitude and phase

    ○ 𝑋#$%(𝑘) = 𝐴 𝑘 = 𝑋./* 𝑘 + 𝑋0#* 𝑘 , 𝑋12$3/(𝑘) = ϕ 𝑘 = tan)!(4()(,)4*+(,)

    )

    ● The definition of DFT can be simplified using complex sinusoids○ 𝑋 𝑘 = ∑&'"()!𝑥 𝑛 𝑒)7

    ,-./0 = 𝑋./ 𝑘 + 𝑗𝑋0# 𝑘 = 𝐴(𝑘)78 , 𝑒!

    "#$%& = cos

    2𝜋𝑘𝑛𝑁 + 𝑗sin

    2𝜋𝑘𝑛𝑁

    Euler’s identity

  • Discrete Fourier Transform (DFT)

    ● Can be viewed as matrix multiplication

    ● In practice, we use an FFT algorithm instead of direct multiplication ○ Divide the matrix into small matrices recursively○ Complexity reduction: O(N2)à O(Nlog2N)

    𝑊'1! 𝑊!23

    𝑥(𝑛)

    to polar

    𝑋4& 𝑋2"

    𝑋"#$ 𝑋%5#!&

    𝑊'1! 𝑊!23

    𝑠6∗(𝑛) = 𝑒89:;63

  • Discrete Fourier Transform (DFT)

    ● When DFT is applied to musical sounds○ A musical tone with pitch has periodic waveforms ○ DFT shows harmonic spectrum (harmonic overtones)○ Pitch information can be also extracted ○ The magnitude is generally more sparse than the waveform

    𝑥(𝑛) 𝑋"#$(𝑘)

    𝐹0

    2𝐹03𝐹0

  • Effect of Window Type

    ● Types of window functions○ Trade-off between the width of main-lobe and the level of side-lobe○ Hann window is the most widely used in music analysis.

    200 0 200

    0

    0.5

    1

    Amplitude

    Rectangular

    500 0 50060402002040

    Magnitude(dB)

    200 0 200

    0

    0.5

    1

    Triangular

    500 0 50060402002040

    200 0 200

    0

    0.5

    1

    Hann

    500 0 50060402002040

    200 0 200

    0

    0.5

    1

    Blackmann

    500 0 50060402002040

    Spectra of windowed single sinusoids

  • Effect of Window Size

    ● Trade-off between time and frequency resolutions○ Short window: low frequency-resolution and high time-resolution ○ Long window: high frequency-resolution and low time-resolution

    Hop=128, N=256 Hop=128, N=4096

  • Human Ears

    ● Human ear is a spectrum analyzer? ○ Our ear has a complicated pathway from the ear drum to the auditory nerve○ The cochlea in the inner ear is a bandpass-filter bank○ The membrane resonates at a different position depending the frequency of

    the input. The resonance frequency increases in a log scale along the membrane

    (Unrolled) Cochlear

    Membrane

  • Human Pitch Perception

    ● Pitch Resolution○ Just noticeable difference (JND) increases

    as the frequency goes up

    ● Mel scale○ Approximate the human pitch resolution

    based on pitch ratio of tones ○ Most widely used for speech and music

    analysis○ A log frequency scale

    m = 2595log10 (1+ f / 700)

  • Computing Mel-Spectrogram

    ● Mapping linear frequency to mel scale○ A mel-scaled filter bank is used: linear interpolation on the center frequency

    with the corresponding bandwidth skirt ○ The high-frequency range is zoomed-out and the low-frequency range is

    relatively zoomed-in; the number of frequency bins is usually smaller

    Spectrogram (1024 freq. bins) Mel-spectrogram (128 mel bins)Mel-scaled filter bank

    Center Frequency

    BandWidth

  • Musical scale

    ● Musical tuning system○ Equal temperament: 1: 21/12 ratio for semi-note○ Music note (m) and frequency (f) in Hz

    f = 440 ⋅2(m−69)

    12m =12 log2(f440

    )+ 69,

    https://newt.phys.unsw.edu.au/jw/notes.html

  • Review

    ● Now we know that we need a log-scale for music

    ● The log-scale filter bank will look like this

    ● Question:○ Can we obtain the log-frequency scale spectrogram

    directly from waveforms using a time-frequency representation?

    Log-scaled filter bank

    Center Frequency

    BandWidth

  • Constant-Q Transform

    ● Time-frequency representation which uses a set of sinusoidal kernels with log-spaced frequencies○ As the frequency increases, the length of sinusoidal kernels becomes shorter

    (bandwidth becomes wider) to have constant Q (= frequency/bandwidth)

    Figure 1. The upper panel illustrates the real part of thetransform bases (temporal kernel) that can be used to cal-culate the CQT over two octaves, with 12 bins per octave.The lower panel shows the absolute values of the corre-sponding spectral kernel.

    3.1 Algorithm of Brown and Puckette

    Let us assume that we want to calculate the CQT trans-form coefficients XCQ(k, n) as defined by (1) at one pointn of an input signal x(n). A direct implementation of (1)obviously requires calculating inner products of the inputsignal with each of the transform bases. The upper panel ofFig. 1 illustrates the real part of the transform bases ak(n),assuming here for simplicity only B = 12 bins per octaveand a frequency range of two octaves.

    A computationally more efficient implementation is ob-tained by utilizing the identity

    N�1X

    n=0

    x(n)a⇤(n) =

    N�1X

    j=0

    X(j)A⇤(j) (7)

    where X(j) denotes the discrete Fourier transform (DFT)of x(n) and A(j) denotes the DFT of a(n). Equation (7)holds for any discrete signals x(n) and a(n) and stemsfrom Parseval’s theorem [3].

    Using (7), the CQT transform in (1) can be written as

    XCQ

    (k,N/2) =

    NX

    j=0

    X(j)A⇤k(j) (8)

    where Ak(j) is the complex-valued N -point DFT of thetransform basis ak(n) so that the bases ak(n) are centeredat the point N/2 within the transform frame. Following the

    terminology of [9], we will refer to Ak(j) as the spectralkernels and to ak(n) as the temporal kernels. The lowerpanel of Fig. 1 illustrates the absolute values of the spectralkernels Ak(j) corresponding to temporal kernels ak(n) inthe upper panel.

    As observed by Brown and Puckette, the spectral ker-nels Ak(j) are sparse: most of the values being near zerobecause they are Fourier transforms of modulated sinu-soids. Therefore the summation in (8) can be limited tovalues near the peak in the spectral kernel to achieve suf-ficient numerical accuracy – omitting near-zero values inAk(j). This is the main idea of the efficient CQT transformproposed in [9]. It is also easy to see that the summing hasto be carried out for positive frequencies only, followed bymultiplication by two.

    For convenience, we store the spectral kernels Ak(j) ascolumns in matrix A. The transform in (8) can then bewritten in matrix form as

    XCQ = A⇤X (9)

    where A⇤ denotes the conjugate transpose of A. MatricesX and XCQ have only one column each, containing theDFT values X(j) and the corresponding CQT coefficients,respectively.

    3.2 Processing One Octave at a Time

    There are two remaining problems with the method out-lined in the previous subsection. Firstly, when a wide rangeof frequencies is considered (for example, eight octavesfrom 60Hz to 16kHz), quite long DFT transform blocks arerequired and the spectral kernel is no longer very sparse,since the frequency responses of higher frequency bins arewider as can be seen from Fig. 1. Secondly, in order toanalyze all parts of the input signal adequately, the CQTtransform for the highest frequency bins has to be calcu-lated at least every NK/2 samples apart, where NK is thewindow length for the highest CQT bin. Both of these fac-tors reduce the computational efficiency of the method.

    We propose two extensions to address the above prob-lems. The first is processing by octaves. 2 We use a spec-tral kernel matrix A which produces the CQT for the high-est octave only. After computing the highest-octave CQTbins over the entire signal, the input signal is lowpass fil-tered and downsampled by factor two, and then the sameprocess is repeated to calculate the CQT bins for the nextoctave, using exactly the same DFT block size and spec-tral kernel (see (8)). This is repeated iteratively until thedesired number of octaves has been covered. Figure 2 il-lustrates this process.

    Since the spectral kernel A now represents frequencybins that are at maximum one octave apart, the length ofthe DFT block can be made quite short (according to Nkof the lowest CQT bin) and the matrix A is very sparseeven for the highest-frequency bins.

    Another computational efficiency improvement is ob-tained by using several temporally translated versions ofthe transform bases ak(n) within the same spectral kernel

    2 We want to credit J. Brown for mentioning this possibility already in[8], although octave-by-octave processing was not implemented in [8, 9].

    Figure 1. The upper panel illustrates the real part of thetransform bases (temporal kernel) that can be used to cal-culate the CQT over two octaves, with 12 bins per octave.The lower panel shows the absolute values of the corre-sponding spectral kernel.

    3.1 Algorithm of Brown and Puckette

    Let us assume that we want to calculate the CQT trans-form coefficients XCQ(k, n) as defined by (1) at one pointn of an input signal x(n). A direct implementation of (1)obviously requires calculating inner products of the inputsignal with each of the transform bases. The upper panel ofFig. 1 illustrates the real part of the transform bases ak(n),assuming here for simplicity only B = 12 bins per octaveand a frequency range of two octaves.

    A computationally more efficient implementation is ob-tained by utilizing the identity

    N�1X

    n=0

    x(n)a⇤(n) =

    N�1X

    j=0

    X(j)A⇤(j) (7)

    where X(j) denotes the discrete Fourier transform (DFT)of x(n) and A(j) denotes the DFT of a(n). Equation (7)holds for any discrete signals x(n) and a(n) and stemsfrom Parseval’s theorem [3].

    Using (7), the CQT transform in (1) can be written as

    XCQ

    (k,N/2) =

    NX

    j=0

    X(j)A⇤k(j) (8)

    where Ak(j) is the complex-valued N -point DFT of thetransform basis ak(n) so that the bases ak(n) are centeredat the point N/2 within the transform frame. Following the

    terminology of [9], we will refer to Ak(j) as the spectralkernels and to ak(n) as the temporal kernels. The lowerpanel of Fig. 1 illustrates the absolute values of the spectralkernels Ak(j) corresponding to temporal kernels ak(n) inthe upper panel.

    As observed by Brown and Puckette, the spectral ker-nels Ak(j) are sparse: most of the values being near zerobecause they are Fourier transforms of modulated sinu-soids. Therefore the summation in (8) can be limited tovalues near the peak in the spectral kernel to achieve suf-ficient numerical accuracy – omitting near-zero values inAk(j). This is the main idea of the efficient CQT transformproposed in [9]. It is also easy to see that the summing hasto be carried out for positive frequencies only, followed bymultiplication by two.

    For convenience, we store the spectral kernels Ak(j) ascolumns in matrix A. The transform in (8) can then bewritten in matrix form as

    XCQ = A⇤X (9)

    where A⇤ denotes the conjugate transpose of A. MatricesX and XCQ have only one column each, containing theDFT values X(j) and the corresponding CQT coefficients,respectively.

    3.2 Processing One Octave at a Time

    There are two remaining problems with the method out-lined in the previous subsection. Firstly, when a wide rangeof frequencies is considered (for example, eight octavesfrom 60Hz to 16kHz), quite long DFT transform blocks arerequired and the spectral kernel is no longer very sparse,since the frequency responses of higher frequency bins arewider as can be seen from Fig. 1. Secondly, in order toanalyze all parts of the input signal adequately, the CQTtransform for the highest frequency bins has to be calcu-lated at least every NK/2 samples apart, where NK is thewindow length for the highest CQT bin. Both of these fac-tors reduce the computational efficiency of the method.

    We propose two extensions to address the above prob-lems. The first is processing by octaves. 2 We use a spec-tral kernel matrix A which produces the CQT for the high-est octave only. After computing the highest-octave CQTbins over the entire signal, the input signal is lowpass fil-tered and downsampled by factor two, and then the sameprocess is repeated to calculate the CQT bins for the nextoctave, using exactly the same DFT block size and spec-tral kernel (see (8)). This is repeated iteratively until thedesired number of octaves has been covered. Figure 2 il-lustrates this process.

    Since the spectral kernel A now represents frequencybins that are at maximum one octave apart, the length ofthe DFT block can be made quite short (according to Nkof the lowest CQT bin) and the matrix A is very sparseeven for the highest-frequency bins.

    Another computational efficiency improvement is ob-tained by using several temporally translated versions ofthe transform bases ak(n) within the same spectral kernel

    2 We want to credit J. Brown for mentioning this possibility already in[8], although octave-by-octave processing was not implemented in [8, 9].

    (Schorkhuber and Klapuri, 2010)

  • Constant-Q IIR Filter Bank

    ● Musically designed constant-Q transform ○ 88 IIR bandpass filters○ The center frequency corresponds to the

    pitch of each piano note○ The bandwidth is set to have constant-Q

    with +/- 25 cent around the center(100 cents = 1 semi-tone)

    (Müller, 2011)

  • Example: Constant-Q Transform

    ● Chromatic music scale○ The harmonics of notes increase linearly in the constant-Q transform

    Spectrogram Constant-Q transform