[ieee 2010 ieee international conference on acoustics, speech and signal processing - dallas, tx,...

MDCT SPECTRUM SEPARATION: CATCHINGTHE FINE SPECTRAL STRUCTURES FOR STEREO CODING

Shuhua Zhang, Weibei Dou, Ping Chi, and Huazhong Yang

Tsinghua National Laboratory for Information Science and Technology,Department of Electronic Engineering, Tsinghua University, Beijing 100084, China

ABSTRACT

The spectrum of a sinusoid using the Modified Discrete CosineTransform (MDCT), when separated into an even subspectrumand an odd subspectrum by bin parity, gives rise to a distinctiveproperty—subspectral shapes are independent of the sinusoid phase,which contributes only to scaling. Based on this finding, we pro-pose an Even-Odd (EO) scheme for stereo coding: partitioning theeven and odd subspectra separately into subbands to capture the finespectral structures of sinusoidal and rich tone signals. The schemereduces the coding noises by 0–20 dB for music signals. When in-tegrated into a MDCT domain KLT-based stereo coder, the schemeboosts subjective listening test (MUSHRA) scores. This coder,called KLT-EO, competes the Parametric Stereo (PS) in quality bya slightly higher bitrate but without the algorithmic delay of 20 msresulted from the stereo processing.

Index Terms— Modified discrete cosine transform, stereo cod-ing, sinusoidal analysis, spectrum separation

1. INTRODUCTION

The Modified Discrete Cosine Transform (MDCT [1, 2]) is widelyused in audio coding, such as MP3, AAC, and Ogg Vorbis. With50% overlap between adjacent transform blocks, MDCT is still crit-ically sampled—elegantly smoothing out blocking artifacts withoutbitrate penalty.

Apart from this elegance in signal representation, MDCT posesa great difficulty in spectral processing due to frequency componentaliasing, e.g., gain control [3] that works well in the DFT domainproduces significant spectral distortion. For this reason, a complex-modulated Quadrature Mirror Filterbank (QMF) substitutes theMDCT in Side Band Replication (SBR [4]) and Parametric Stereo(PS [5]); the Modified Discrete Sine Transform (MDST), in additionto the MDCT, is used in the modified distortion metric for AAC [6]and the MDCT domain spatial audio coding [7].

Both the strategies are not free. They come with higher com-plexity, which may be mitigated by high performance hardwares,or additional algorithmic delay, which persists regardless of hard-ware advancements. Pure MDCT domain processing is attractivehere provided that coding quality is satisfying.

The root of this difficulty is that the MDCT basis is not shift-invariant [8]. We intend to tackle this problem for pure MDCT do-main stereo coding. A critical observation is that although the com-plete spectrum of a sinusoid has variable shapes as the phase varies(Fig. 1 left), the even and odd subspectra have fixed shapes and vary

This work was supported by the National Science Foundation of China(NSFC 60832002). The authors thank Dr. Chen for her arrangement of thesubjective test.

-6 -5 -4 -3 -2 -1 0 1 2 3 4 5

Phas

e

-6 -4 -2 0 2 4 -5 -3 -1 1 3 5

k − �κ� k − �κ� k − �κ�

Fig. 1. MDCT spectra of tones with the same frequency but differentphases. k is the frequency index and κ the tone frequency. Left,the complete spectra around κ; middle and right, the even and oddsubspectra corresponding to the left. The subspectra vary only inscaling, but after being merged, they lose this regularity.

only in scaling (Fig. 1 middle and right). In other words, the sub-spectral shapes are shift-invariant.

A stereo sinusoid may have different phases in the left and rightchannels, so the spectral shapes of the two channels may not be iden-tical. Only one interchannel intensity ratio is not sufficient to regis-ter the fine spectral structures. This is the pitfall of Intensity Stereo(IS [9]) coding, useless below 1.6 kHz since our ears are sensitiveto the fine spectral structures in this range. Instead, we may use oneintensity ratio for each subspectrum to capture the fine structures.Possibly this way, IS can be extended to the full frequency range andstill has satisfying quality.

2. MDCT SPECTRA OF SINUSOIDS

We study the MDCT spectral structures of sinusoidal signals withgeneral symmetric window functions. Two techniques are used:

• expanding a window function by the basis of the type-IV Dis-crete Sine Transform (DST-IV), then a windowed sinusoidequals to a sum of component sinusoids and thus eliminatingthe sine window restriction in [8];

• for the spectra of all these component sinusoids, separatingout a common phase-dependent factor and leaving the re-maining part independent of phase, i.e., shift-invariant.

In analogy to DFT, the shift-invariant factor corresponds to ampli-tude and the phase-dependent factor corresponds to phase.

369978-1-4244-4296-6/10/$25.00 ©2010 IEEE ICASSP 2010

In the following, Latin characters are reserved for integer vari-ables and Greek characters for real variables.

The MDCT maps x(n) ∈ R2M to X(k) ∈ R

M by

X(k) =

2M−1Xn=0

w(n)x(n) cos[π

M(n +

M

2+

1

2)(k +

1

2)], (1)

where w(n) ∈ R2M is the window function (prototype filter) satis-

fying the Princen-Bradley condition [1]:(w(n) = w(2M − 1− n) (Symmetry)

w(n)2 + w(n + M)2 = 1 (Perfect Reconstruction)(2)

Common choices of w(n) include the sine window and the Kaiser-Bessel Derived (KBD) window.

The basis vectors of DST-IV are sl(n) = sin[ πM

(n+ 12)(l+ 1

2)]

for l = 0, 1, · · · , M−1, which are complete and orthogonal in RM .

Note that s0(n) coincides with the sine window. By extending n to0, 1, · · · , 2M − 1, sl(n) becomes a symmetric vector of length 2Mand all these extended vectors form a complete orthogonal basis forthe symmetric subspace in R

2M . Therefore, w(n) can be uniquelyexpanded as

w(n) = α0s0(n) + α1s1(n) + · · ·+ αM−1sM−1(n), (3)

where αl = 1√M〈w(n), sl(n)〉 and

PM−1l=0 α2

l = 1 due to the per-

fect reconstruction condition in (2). For the sine window, α0 = 1and αl = 0 if l > 0; for the KBD and other windows, only the firstseveral αl are significant, e.g.,

Pl>4 α2

l = 8.78×10−7 for the longKBD window (M = 1024) used in AAC.

Let x(n) = sin[ πM

κn + ϕ] be a sinusoid and without loss ofgenerality assuming κ ∈ [0, M) and ϕ ∈ [0, 2π). By (3), x(n)windowed by w(n) can be expanded to

w(n)x(n) = α0s0(n)x(n) + · · ·+ αM−1sM−1(n)x(n)

=1

2

M−1Xl=0

αl

nsin[

π

M(κ− l − 1

2)n + ϕ− φl]

+ sin[π

M(κ + l +

1

2)n + ϕ + φl]

o, (4)

where φl ≡ π2M

(l+ 12)− π

2. For each of the component sinusoids in

(4), its MDCT coefficients can be derived by summing a 2M -termsine series. The sine summation formula says

2M−1Xn=0

sin[π

Mξn + ψ] = D2M (ξ) sin[(1− 1

2M)πξ + ψ], (5)

where D2M (ξ) ≡ sin(πξ)/ sin(πξ/(2M)), known as the Dirichletfunction or periodic sinc function. The right side of (5) is arrangedsuch that the first factor depends only on ξ and the second factordepends also on ψ. They are pertinent to the above mentioned shift-invariant factor and phase-dependent factor respectively and lead toa representation of the overall spectrum in a similar form. DenoteVl(ξ) ≡ D2M (ξ + l) −D2M (ξ − l − 1). By (1), (4), and (5), theoverall MDCT spectrum of the windowed x(n) is

X(k) = A(κ− k) sin[θ − 3π

2k]

−A(κ + k + 1) cos[θ +3π

2k], (6)

where (A(ξ) ≡ 1

4

PM−1l=0 (−1)lαlVl(ξ)

θ ≡ (1− 12M

)πκ + ϕ− 3π4

(7)

Using the first order Taylor expansion of the sine function, wesee that A(ξ) decays to 0 on the order of 1/ξ2. Since κ, k ∈ [0, M),except for the boundaries of the spectrum, |κ−k| < |κ+k +1| andA(κ− k) � A(κ + k + 1). Therefore (6) can be approximated to

X(k) ≈ A(κ− k) sin[θ − 3π

2k]. (8)

Here A(κ− k) is the shift-invariant factor, controlling the envelopeof X(k), and sin[θ− 3π

2k] is the phase-dependent factor, controlling

the fluctuation of X(k). Similar results hold for DCT types I–IV(proof is similar and omitted for brevity).

3. EVEN-ODD MDCT SPECTRUM SEPARATION

3.1. Linearity of Even and Odd Subspectra

Let x0(n) = sin[ πM

κn + ϕ0] and x1(n) = sin[ πM

κn + ϕ1] betwo sinusoids with the same frequency but different phases. By (8),Their MDCT spectra are(

X0(k) ≈ A(κ− k) sin[θ0 − 3π2

k]

X1(k) ≈ A(κ− k) sin[θ1 − 3π2

k](9)

where θ0 and θ1 are defined as θ in (7). Due to the term 3π2

k, thephase-dependent factors have a period of 4 along k:

· · · 4�κ4� − 2 4�κ

4� − 1 4�κ

4� 4�κ

4�+ 1 · · ·

· · · − sin θ0 cos θ0 sin θ0 − cos θ0 · · ·· · · − sin θ1 cos θ1 sin θ1 − cos θ1 · · ·

and X0(k)/X1(k) shows a simple but non-trivial pattern—almostconstant on alternating bins (k),

X0(k)

X1(k)≈

(sin θ0sin θ1

, for k = 2qcos θ0cos θ1

, for k = 2q + 1(10)

where q ∈ Z. Thus the even subspectra or odd subspectra of si-nusoids with the same frequency but different phases are linearlyrelated, in other words, fully correlated. This explains the propertyshown in Fig. 1.

3.2. Musical Signals and Subband Partition

For music signals, a frequent situation is that tones and overtonesof the same frequency but different phases present in different chan-nels. Locally in the frequency domain, these tonal components areapproximately sinusoids and their even and odd subspectra are ap-proximately linearly related. To exploit this linearity for stereo andmultichannel coding, we partition the even and odd subspectra sepa-rately into subbands (Fig. 2(b)), called even-odd scheme, or sim-ply EO scheme. The subbands approximate the Bark frequencyscale [10]—broader toward higher frequency—to accommodate thepsychoacoustics. Traditionally, subbands are composed of consecu-tive coefficients (Fig. 2(a)).

Under the EO scheme, one scaling parameter per subband is suf-ficient to register the difference between channels of a tonal stereosignal.

370

(a)

(b)

2k − 12k − 22k − 32k − 4

2k + 0 2k + 1 2k + 2 2k + 3

2k − 1

2k − 2

2k − 3

2k − 4 2k + 0

2k + 1

2k + 2

2k + 3

Subband 2b

Subband 2b + 1

Subband 2b

Subband 2b + 1

Fig. 2. Partition of a MDCT Spectrum into subbands. (a) the tradi-tion scheme, (b) the even-odd scheme.

3.3. Coding Gain of the EO Scheme

If incoherent non-tonal components arise in different channels, thecross-channel correlation will decrease. In this case, the Karhunen-Loeve Transform (KLT [9]) optimally exploits the remaining corre-lation: two spectral vectors X0 and X1 from the same subband butdifferent channels are orthogonally transformed to a main vector Y0

and a minor vector Y1 as„Y0

Y1

«=

„cos β sin β− sin β cos β

« „X0

X1

«(11)

where tan 2β = 〈X0, X1〉/(‖X0‖2 − ‖X1‖2) to minimize ‖Y1‖2.If only Y0 and β are available to a decoder, the power of coding error,equal to ‖Y1‖2, will be

ε(B) =1

2(‖X0‖2 + ‖X1‖2)

− 1

2

p(‖X0‖2 − ‖X1‖2)2 + 4〈X0, X1〉2. (12)

Generally, the larger the 〈X0, X1〉2, the smaller the ε(B). LetB0, B1, · · · , Bp−1 be the subbands partitioned by the EO scheme,and B0, B1, · · · , Bp−1 be the subbands partitioned by the tradi-tional scheme. We define the coding gain as the ratio of the overallpower of coding error between the two schemes:

G = 10 log10

Pp−1b=0 ε(Bb)Pp−1b=0 ε(Bb)

. (13)

Due to (10), the subspectra of sinusoids from different channels arealmost fully correlated but the complete spectra are not. We expecthigh coding gain for pure sinusoids and rich tone signals. Numericalsimulation supported the expectation.

We compute the distribution of the gain with a constant framelength of M = 1024 (AAC long window). Each MDCT spectrum isfirst partitioned into 24 intervals following approximately the Barkscale, then each interval forms an even subband and an odd subband(Fig. 2(b)) or a first-half subband and a second-half subband (Fig.2(a)). In both the cases, we have p = 48 subbands, with whichthe gain is computed for each frame. For stereo sinusoids with uni-form random phases in [0, 2π), the gain is mostly around 90 dB(Fig. 3(a)); for speeches (Fig. 3(b)) and music (Fig. 3(c)(d)), mostlywithin 0–20 dB. Since lower coding noises generally lead to higheraudio quality, the EO scheme will probably boost stereo coding per-formance for rich tone signals.

0

.1

.2

.3

70 75 80 85 90 95 100

Prob

. Den

sity

Coding Gain (dB)

(a)

0

.1

.2

.3

-5 0 5 10 15 20 25

Prob

. Den

sity

Coding Gain (dB)

(b)

0

.1

.2

.3

-5 0 5 10 15 20 25

Prob

. Den

sity

Coding Gain (dB)

(c)

0

.1

.2

.3

-5 0 5 10 15 20 25

Prob

. Den

sity

Coding Gain (dB)

(d)

Fig. 3. Probability distribution of the coding gain for stereo sig-nals. (a) sinusoids with uniformly distributed initial phase, (b) fe-male speech moving, (c) pitch pipe, (d) trumpet solo and orchestra.

T/M

T/M

K/D

K/D

C/C

P/C

S

S

M

L

R

e e

o o

o

oe

e eo

β

β

C/B

P/B

M/T

M/T

K/U

K/U

C/D

P/D

M

M

S

L

R

e e

o oo

oe

e

eo

β

β

C/B

P/B

Fig. 4. The KLT-EO structure: MDCT domain stereo coding basedon the KLT and the EO scheme. Top encoder, bottom decoder.

4. EXPERIMENTS

We construct a MDCT domain stereo coder by the KLT and the EOscheme, called KLT-EO, for two considerations: the KLT is the opti-mal orthogonal transform for compression of correlated signals [9];the EO scheme increases the subband spectral cross-channel corre-lation for sinusoidal and rich tone signals.

On the encoder side of KLT-EO (Fig. 4 top), signal blocks fromthe left (L) and the right (R) channels first go through time to MDCTdomain mapping (T/M); then the spectra are separated (S) into evensubbands (e) and odd subbands (o); on each subband, two spectralvectors from the both channels are downmixed to one vector by theKLT (K/D), i.e., keeping only the main vector Y0 in (11), and therotation angle (β) is quantized and Huffman coded (P/C) to formthe parameter bitstream (P/B); downmixed subspectra are merged(M) into a complete MDCT spectrum, which is sent to a core coder(C/C) such as AAC to generate the core bitstream (C/B). On thedecoder side (Fig. 4 bottom), this process is inverted: C/C changesto core decoder (C/D), P/C to parameter decoder (P/D), K/D to KLTupmixing (K/U) which is the inverse of (11) but setting Y1 = 0, andT/M to MDCT to time domain mapping (M/T).

371

0

20

40

60

80

100

MU

SHR

A S

core

Speech and Vocal Single Instrument Multi Instruments

91 9788

73 748483

47

8182 7687

56

39

28

1221

32

refps20

fs48eo48

mono3.5k

Fig. 5. MUSHRA scores, mean and 95% confidence interval. ‘ref’is for reference; ‘ps20’ for PS with 20 subbands; ‘fs48’ for KLT-FSwith 48 subbands; ‘eo48’ for KLT-EO with 48 subbands; ‘mono’ formono downmixed; and ‘3.5k’ for 3.5 kHz low-pass filtered.

To evaluate the performance of the coder, we arranged aMUSHRA [11] subjective listening test. Using headphones, twelveyoung subjectives graded nine groups of hidden references, anchors,and the processed sequences from 0 (worst) to 100 (transparent)based on their perceived distortion against the given references.The reference set consisted of three speech and vocal, three singleinstrument, and three multi instruments stereo test sequences from3GPP and MPEG, all sampled at 48 kHz. Two types of anchors, 3.5kHz low-pass filtered and mono-downmixed references, were used.

To single out stereo processing performance, core coding wasbypassed. For each 1024-point frame, KLT-EO produces 48 rotationangles, one from each subband, corresponding to a parameter bitrateof 4.1–5.2 kb/s after quantization and Huffman coding. For com-parison, the test sequences were also processed by a plain MDCTdomain KLT stereo coder without the EO scheme (called KLT-FS),but otherwise same as our coder; and by the 20-subband mode PScoder in the 3GPP EAAC+ [12] stripped of core coding and SBR,parameter bitrate 1.9–2.9 kb/s.

Fig. 5 presents the MUSHRA scores in three groups. For speechand vocal signals, KLT-EO and KLT-FS have similar performance;for multi instruments signals, KLT-EO has slightly higher perfor-mance than KLT-FS; for the single instrument signals (pitch pipe,glockenspiel, and plucked strings), KLT-EO outperforms KLT-FS bya large margin—29 MUSHRA points. The single instrument signalsare rich of tonal components, to which our ears are very sensitive.The spectrum separation in KLT-EO is critical to catching the finespectral structure of this kind of signals and leads to the performanceboost. Even with more subbands and parameters, we experienced se-vere harmonic distortions in KLT-FS, making the sounds tremble, asis mostly pronounced for the pitch pipe signal.

Generally, KLT-EO performs equally well as PS, but with about2 kb/s higher bitrate due to the larger number of subbands used tocompensate the lack of coherence processing. A distinct advantageof our coder is no additional algorithmic delay. PS has to budget 20ms to bridge QMF and MDCT (this delay is currently shared withSBR in EAAC+). In KLT-EO, the stereo processing and core coding(AAC) both work in the MDCT domain, so the delay is exempted.This amounts to significant delay reduction for real-time two-waycommunications and the bitrate increase is relatively small.

5. CONCLUSIONS

We have proved a distinctive property of the MDCT spectra of sinu-soids: the even and odd subspectra have phase-independent shapes.This translates to performance boost of stereo coding by exploitingthe cross-channel correlation of the MDCT subspectra, as is mostlypronounced in rich tone stereo signals.

Apart from the low delay MDCT domain stereo coding, we mayuse the property to enhance the traditional coding schemes such asIntensity Stereo for lower coding noises, and Mid/Side Stereo forlower side channel power. And this property is also shared by DCTsof types I–IV, so the spectrum separation will work in these domainsfor compression of correlated signals with rich tone components.

6. REFERENCES

[1] J. Princen and A. Bradley, “Analysis/synthesis filter bank de-sign based on time domain aliasing cancellation,” IEEE Trans.Acoust., Speech, Signal Process., vol. 34, no. 5, pp. 1153–1161, Oct. 1986.

[2] H. Malvar, “Lapped transforms for efficient transform/subbandcoding,” IEEE Trans. Acoust. Speech, Signal Process., vol. 38,no. 6, pp. 969–978, June 1990.

[3] F. Kuech and B. Edler, “Aliasing reduction for modified dis-crete cosine transform domain filtering and its application tospeech enhancement,” in Proc. IEEE Workshop Applicat. Sig-nal Process. Audio Acoust., New York, Oct. 21–24, 2007, pp.131–134.

[4] P. Ekstrand, “Bandwidth extension of audio signals by spec-tral band replication,” in Proc. 1st IEEE Benelux Workshopon Model Based Process. and Coding of Audio (MPCA-2002),Leuven, Nov. 2002, pp. 53–58.

[5] J. Breebaart, S. van de Par, A. Kohlrausch, and E. Schuijers,“Parametric Coding of Stereo Audio,” EURASIP J. Appl. Sig-nal Process., pp. 1305–1322, Sept. 2005.

[6] V. Melkote and K. Rose, “A modified distortion metric foraudio coding,” in Proc. IEEE Int. Conf. Audio Speech SignalProcess., Taipei, Apr. 2009, pp. 17–21.

[7] S. Chen, R. Hu, and S. Zhang, “Estimating spatial cues for au-dio coding in MDCT domain,” in Proc. IEEE Int. Conf. Multi-media Expo, July 2009, pp. 53–56.

[8] L. Daudet and M. Sandler, “MDCT analysis of sinusoids: exactresults and applications to coding artifacts reduction,” IEEETrans. Speech Audio Process., vol. 12, no. 3, pp. 302–312, May2004.

[9] R.G. van der Waal and R.N.J. Veldhuis, “Subband coding ofstereophonic digital audio signals,” in Proc. IEEE Int. Conf.Audio Speech Signal Process. Toronto, Apr. 1991, vol. 5, pp.3601–3604.

[10] E. Zwicker and H. Fastl, Psychoacoustics: Facts and Models.Berlin Heidelberg, Germany: Springer-Verlag, 1990.

[11] ITU-R BS.1534-1: Method for the subjective assessment ofintermediate quality levels coding systems, ITU, 2003.

[12] 3GPP TS 26.410: General audio codec audio processingfunctions; enhanced aacplus general audio codec; floating-point ANSI-C code, 2008 [online]. Available: http://www.3gpp.org/ftp/Specs/html-info/26410.htm

372

[ieee 2010 ieee international conference on acoustics, speech and signal processing - dallas, tx,...

Documents