abstract arxiv:2112.01525v1 [cs.cv] 2 dec 2021

16
Co-domain Symmetry for Complex-Valued Deep Learning Utkarsh Singhal Yifei Xing Stella X. Yu UC Berkeley / ICSI {s.utkarsh, xingyifei2016, stellayu}@berkeley.edu Abstract We study complex-valued scaling as a type of symme- try natural and unique to complex-valued measurements and representations. Deep Complex Networks (DCN) ex- tends real-valued algebra to the complex domain without addressing complex-valued scaling. SurReal takes a restric- tive manifold view of complex numbers, adopting a distance metric to achieve complex-scaling invariance while losing rich complex-valued information. We analyze complex-valued scaling as a co-domain trans- formation and design novel equivariant and invariant neural network layer functions for this special transformation. We also propose novel complex-valued representations of RGB images, where complex-valued scaling indicates hue shift or correlated changes across color channels. Benchmarked on MSTAR, CIFAR10, CIFAR100, and SVHN, our co-domain symmetric (CDS) classifiers deliver higher accuracy, better generalization, robustness to co- domain transformations, and lower model bias and variance than DCN and SurReal with far fewer parameters. 1. Introduction Symmetry is one of the most powerful tools in the deep learning repertoire. Naturally occurring symmetries lead to structured variation in natural data, and so modeling these symmetries greatly simplifies learning [1]. A key factor be- hind the success of Convolutional Neural Networks [2] is their ability to capture the translational symmetry of image data. Similarly, PointNet [3] captures permutation symme- try of point-clouds. These symmetries are formalized as invariance or equivariance to a group of transformations [4]. This view is taken to model a general class of spatial sym- metries, including 2D/3D rotations [57]. However, this line of research has primarily focused on real-valued data. We explore complex-valued data which arise naturally in 1) remote sensing such as synthetic aperture radar (SAR), medical imaging such as magnetic resonance imaging (MRI), and radio frequency communications; 2) spectral representa- tions of real-valued data such as Fourier Transform [8, 9]; and Figure 1. An image is a function from the domain R 2 to the co- domain C N . Image transformations like rotation and translation act on the domain, mapping points in R 2 to other points, while leaving the underlying function values intact. Previous works like [5, 6] aim to produce architectures invariant to domain transformations. Co-domain transformations like color distortion or complex-valued scaling, on the other hand, act on the function values only. 3) physics and engineering applications [10]. In deep learn- ing, complex-valued models have shown several benefits over their real-valued counterparts: larger representational capacity [11], more robust embedding [12] and associative memory [13], more efficient multi-task learning [14], and higher quality MRI image reconstruction [15]. We approach complex-valued deep learning from a symmetry perspective: Which symmetries are inherent in complex-valued data, and how do we exploit them in modeling? One type of symmetry inherent to complex-valued data is complex-valued scaling ambiguity [18]. For example, con- sider a complex-valued MRI or SAR signal z. Due to the nature of signal acquisition, z could be subject to global mag- nitude scaling and phase offset represented by a complex- valued scalar s, thus becoming s · z. A complex-valued classifier takes input z and ideally should focus on discriminating among instances from differ- 1 arXiv:2112.01525v1 [cs.CV] 2 Dec 2021

Upload: others

Post on 25-Apr-2022

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Abstract arXiv:2112.01525v1 [cs.CV] 2 Dec 2021

Co-domain Symmetry for Complex-Valued Deep Learning

Utkarsh Singhal Yifei Xing Stella X. YuUC Berkeley / ICSI

{s.utkarsh, xingyifei2016, stellayu}@berkeley.edu

Abstract

We study complex-valued scaling as a type of symme-try natural and unique to complex-valued measurementsand representations. Deep Complex Networks (DCN) ex-tends real-valued algebra to the complex domain withoutaddressing complex-valued scaling. SurReal takes a restric-tive manifold view of complex numbers, adopting a distancemetric to achieve complex-scaling invariance while losingrich complex-valued information.

We analyze complex-valued scaling as a co-domain trans-formation and design novel equivariant and invariant neuralnetwork layer functions for this special transformation. Wealso propose novel complex-valued representations of RGBimages, where complex-valued scaling indicates hue shift orcorrelated changes across color channels.

Benchmarked on MSTAR, CIFAR10, CIFAR100, andSVHN, our co-domain symmetric (CDS) classifiers deliverhigher accuracy, better generalization, robustness to co-domain transformations, and lower model bias and variancethan DCN and SurReal with far fewer parameters.

1. IntroductionSymmetry is one of the most powerful tools in the deep

learning repertoire. Naturally occurring symmetries lead tostructured variation in natural data, and so modeling thesesymmetries greatly simplifies learning [1]. A key factor be-hind the success of Convolutional Neural Networks [2] istheir ability to capture the translational symmetry of imagedata. Similarly, PointNet [3] captures permutation symme-try of point-clouds. These symmetries are formalized asinvariance or equivariance to a group of transformations [4].This view is taken to model a general class of spatial sym-metries, including 2D/3D rotations [5–7]. However, this lineof research has primarily focused on real-valued data.

We explore complex-valued data which arise naturally in1) remote sensing such as synthetic aperture radar (SAR),medical imaging such as magnetic resonance imaging (MRI),and radio frequency communications; 2) spectral representa-tions of real-valued data such as Fourier Transform [8,9]; and

Figure 1. An image is a function from the domain R2 to the co-domain CN . Image transformations like rotation and translation acton the domain, mapping points in R2 to other points, while leavingthe underlying function values intact. Previous works like [5, 6]aim to produce architectures invariant to domain transformations.Co-domain transformations like color distortion or complex-valuedscaling, on the other hand, act on the function values only.

3) physics and engineering applications [10]. In deep learn-ing, complex-valued models have shown several benefitsover their real-valued counterparts: larger representationalcapacity [11], more robust embedding [12] and associativememory [13], more efficient multi-task learning [14], andhigher quality MRI image reconstruction [15]. We approachcomplex-valued deep learning from a symmetry perspective:Which symmetries are inherent in complex-valued data, andhow do we exploit them in modeling?

One type of symmetry inherent to complex-valued data iscomplex-valued scaling ambiguity [18]. For example, con-sider a complex-valued MRI or SAR signal z. Due to thenature of signal acquisition, z could be subject to global mag-nitude scaling and phase offset represented by a complex-valued scalar s, thus becoming s · z.

A complex-valued classifier takes input z and ideallyshould focus on discriminating among instances from differ-

1

arX

iv:2

112.

0152

5v1

[cs

.CV

] 2

Dec

202

1

Page 2: Abstract arXiv:2112.01525v1 [cs.CV] 2 Dec 2021

−π −0.5π −0.2π −0.1π 0 0.1π 0.2π 0.5π π

(a) Color variations due to complex-scaling in our proposed LAB encoding

(b) t-SNE Embedding for DCN [16] (c) Visualization of our LAB encoding (d) Model confidence for a single example

(e) t-SNE Embedding for our modelReal-valued CNN DCN SurReal Ours

0

10

20

30

40

50

60

70

80

Tes

t A

ccura

cy (

%)

No Rotation

[−0.1π, 0.1π]

[−0.2π, 0.2π]

[−0.5π, 0.5π]

Color Jitter

(f) Accuracy under color jitter

0.0π 0.1π 0.2π 0.5π 1.0πRotation Range

20

30

40

50

60

70

Acc

ura

cy %

(g) Comparison against phase normalization

Figure 2. Our method learns invariant features with respect to complex-scaling of the input. All examples are from CIFAR 10 with ourLAB encoding, undergoing multiplication by a unit complex number. (b, e) tSNE embedding trajectories from DCN [16] and our model.Each color represents a different example. Embeddings form tight clusters for our model, and irregular overlapping curves for DCN.(c) Visualization of our complex-valued embedding of LAB information. The L∗ channel is visualized as a grayscale image, and thecomplex-valued a∗ + ib∗ visualized as a color image. (d) Model confidence of the correct class for a single example. Higher confidencemeans larger radius. DCN predictions are highly variable, while our model is robust to complex-scaling and thus constant. (f) Accuracyunder complex-scaling and color jitter. Red bars represent complex-rotations sampled from different rotation ranges. Blue bars representcolor jitter (as used in [17]). Our method maintains high accuracy across complex-rotations and color jitter, whereas DCN and Real-valuedCNN fail. SurReal [18] is robust, but has low overall accuracy. Our method combines high accuracy with robustness. (g) Average accuracyunder different rotation ranges, comparing DCN with phase normalization (dotted blue line) and without phase normalization (solid blueline) against our method. The color encoding has a complicated phase distribution, and phase normalization fails to estimate the amount ofrotation, resulting in poor accuracy. Our method is thus more suitable for complicated phase distributions.

ent classes, not on the instance-wise variation s · z causedby complex-valued scaling. Formally, function f is calledcomplex-scale invariant if f(s · z) = f(z) and calledcomplex-scale equivariant if f(s · z) = s · f(z).

We distinguish two types of image transformations, view-ing an image as a function defined over spatial locations.Complex-valued scaling of a complex-valued image is atransformation in the co-domain of the image function, asopposed to a spatial transformation in the domain of theimage (Figure 1). Formally, I : RD → CK denotes acomplex-valued image of K channels in the D-dimensionalspace, where R (C) denotes the set of real (complex) num-bers. Some common (D,K) are (2,1) for grayscale images,(2,3) for RGB, and (3, 6+) for diffusion tensor images.

1. Domain transformation T : RD→RD transforms the

spatial coordinates of an image, resulting in a spatiallywarped image I(T (x)), x ∈ RD. Translation, rotation,and scaling are examples of domain transformations.

2. Co-domain transformation T ′ : CK → CK maps thepixel value to another, resulting in a color adjusted imageT ′(I(x)), x ∈ RD. Complex-valued scaling and colordistortions are examples of co-domain transformations.

Complex-valued scaling thus presents not only a practicalsetting but also a case study for co-domain transformations.

Existing complex-valued deep learning methods such asDCN [16] are sensitive to complex-valued scaling (Fig 2b).A pre-processing trick to remove such scaling ambiguityis to simply normalize all the pixel values by setting theiraverage phase to 0 and magnitude to 1, but this process in-troduces artifacts when the phase distribution varies greatly

2

Page 3: Abstract arXiv:2112.01525v1 [cs.CV] 2 Dec 2021

with the content of the image (Figure 2g). SurReal [18] ap-plies manifold-valued deep learning to complex-valued data,but this framework only captures the manifold aspect and notthe complex algebra of complex-valued data. Thus, a moregeneral, principled method is needed. We propose novellayer functions for complex-valued deep learning by study-ing how they preserve co-domain symmetry. Specifically,we study whether each layer-wise transformation achievesequivariance or invariance to complex-valued scaling.

Our contributions: 1) We derive complex-scaling equiv-ariant and invariant versions of common layers used in com-puter vision pipelines. Our model circumvents the limi-tations of SurReal [18] and scales to larger models anddatasets. 2) Our experiments on MSTAR, CIFAR 10, CI-FAR 100, and SVHN datasets demonstrate a significant gainin generalization and robustness. 3) We introduce novelcomplex-valued encodings of color, demonstrating the util-ity of using complex-valued representations for real-valueddata. Complex-scaling invariance under our LAB encodingautomatically leads to color distortion robustness withoutthe need for color jitter augmentation.

2. Related WorkComplex-Valued Processing: Complex numbers are

ubiquitous in mathematics, physics, and engineering [10,19, 20]. Traditional complex-valued data analysis involveshigher-order statistics [21, 22]. [11] demonstrates higher rep-resentational capacity of complex-valued processing on theXOR problem. [23] proposes a sparse coding layer utiliz-ing complex basis functions. [24] proposes a biologicallymeaningful complex-valued model. [25, 26] encodes datafeatures in a complex vector and learns a metric for thisembedding. [15] applies complex-valued neural networks toMRI image reconstruction. [27] investigates the role of crit-ical points in complex neural networks. [28] demonstratesthat complex networks have smaller generalization error thanreal-valued, and offers an overview of their convergence andstability. We refer the reader to [16] for a more detailedaccount of complex-valued deep learning.

Transformation Equivariance/Invariance: An impor-tant line of work [5, 6, 29, 30] aims to develop convolu-tional layers equivariant to domain transformations like rota-tion/scaling. [6] introduces a principled method for produc-ing group-equivariant layers for finite groups. [5] extendsthis work to Lie groups on continuous data. [31] uses circularharmonics to produce deep neural networks equivariant torotation and translation. [32] attempts to produce a generaltheory of group-equivariant CNNs on the Euclidean spaceand the sphere. [33] further extends the framework to localgauge transformations on the manifold. This class of meth-ods is well-suited for domain transformations. However,complex-valued scaling is a co-domain transformation, andthese methods are not applicable. [7] generalizes neurons to

R3 vectors with 3D rotations as a co-domain transformation,introducing rotation-equivariant layers for point-clouds. Incontrast, our method handles both the complex algebra andthe geometry of complex-valued scaling.

Complex-Valued Scaling: Despite the increasing re-search interest in complex-valued neural networks, the prob-lem of effectively dealing with complex-scaling ambiguityremains open. [16, 34–36] propose an extension of real neu-ral architectures to the complex field by redefining basicbuilding blocks such as complex convolution, batch normal-ization, and non-linear activation functions. However, thesemethods are not robust against complex-valued scaling. Sur-Real [18] tackles the problem of complex-scale invarianceby adopting a manifold-based view of complex numbers. Itmodels each complex number as an element of a manifoldwhere complex-scaling corresponds to translation and usestools from manifold-valued learning to create complex-scaleinvariant models. This results in better generalization tounseen complex-valued data with leaner models. However,the SurReal framework is restrictive (Sec. 3.1), and thecomplex-valued processing is forced to be linear (Sec. 3.2),preventing SurReal from scaling to large datasets (Table 1).

3. Equivariance and Invariance for Complex-Valued Deep Learning

3.1. Equivariant Convolution

In contrast to domain transformations which group-specific convolution layers for equivariance [5, 6], any linearlayer is equivariant to complex-valued scaling: For a linearfunction L : Cn → Cn with an input vector x ∈ Cn andcomplex scalar s ∈ C, L(s · x) = s · L(x). While a biasterm is useful in real-valued neural networks, it turns the con-volution layer into an affine function, destroying complex-scale equivariance. Thus, we remove the bias term from thecomplex-valued convolution used in DCN [16], restoring itsequivariance. Additionally, we use Gauss’ multiplicationtrick to speed up the convolution by 25%.

Given a complex-valued input feature map z = a+ ib ∈CC×H×W with C channels and H ×W pixels, and givena convolutional filter of size K × K with weight W =X + iY ∈ CC×C×K×K , we define the Complex-ScaleEquivariant Convolution as:

Econv(z;W) = W ∗ z = (X+ iY) ∗ (a+ ib)

= t1 − t2 + i(t3 − t1 − t2)

where t1 = X ∗ a, t2 = Y ∗ b, t3 = (X+Y) ∗ (a+b), and X ∗ a represents the convolution operation on awith weight X. In contrast, SurReal uses weighted FrechetMean (wFM), a restricted convolution where the weights areconstrained to be real-valued, positive, and to sum to 1. Thisrestrictive formulation drastically reduces accuracy.

3

Page 4: Abstract arXiv:2112.01525v1 [cs.CV] 2 Dec 2021

Figure 3. Our equivariant non-linearity, E{N}, works in fourstages. We compute the channel mean m of the input feature fand normalize it to retain only phase information. This normalizedmean vector m̂ is equivariant to input phase. We multiply f bym̂∗ to cancel the input phase, resulting in a phase-invariant featuref � m̂∗. We feed this feature to the non-linearityN and multiplyby m̂ to restore the previously cancelled input phase. The result isequivariant in phase and also equivariant in magnitude ifN is.

3.2. Equivariant Non-Linearity

Non-linear activation functions such as ReLU are nec-essary to construct deep hierarchical representations. Forcomplex-valued data, several alternatives have been stud-ied [16, 18, 36, 37]. CReLU, the most prominent example,computes ReLU independently on the real and imaginaryparts of the input. Tangent ReLU (TReLU) [18] uses thepolar representation, thresholding both magnitude and phase.

However, these non-linearities are not complex-scaleequivariant. DCN [16] uses CReLU, failing to be ro-bust against complex-scaling. SurReal does not use non-linearities in its complex-valued stages (see Tables I & IIin [18]). SurReal’s complex processing is thus fully linear,preventing it from scaling to large datasets (See Table 1).

We introduce a general class of equivariant non-linearitieswhich act on the relative phase information between features(Fig. 3). Given a complex-valued input feature map f ∈CC×H×W with C channels and H ×W pixels, and givenany non-linear activation function N : C→ C, we computean equivariant version of N (denoted E{N}) as:

fout = E{N}(f) = m̂�N (f � m̂∗) (1)

where � denotes element-wise multiplication, m̂∗ denotesthe complex conjugate of m̂, and m̂ is the normalized per-pixel mean feature:

m(x, y) =1

C

C∑c=1

f(c, x, y); m̂(x, y) =m(x, y)

|m(x, y)|(2)

The normalized mean m̂ is equivariant to input phase andinvariant to input magnitude. As a result, the product f �m̂∗(x, y) is invariant to phase and equivariant to magnitude.If N is equivariant to magnitude (e.g., CReLU), the overalllayer E{N} is equivariant to both phase and magnitude.

3.3. Equivariant Pooling

In real-valued networks, max-pooling selects the largestactivations from a set of neighboring activations. However,

for complex numbers, this scheme selects for phase 0 ≤ φ ≤π2 over other phases even though they may encode similarsalience. Additionally, this process destroys complex-scaleequivariance. To remove this dependence on phase, weselect the pixels with the highest magnitude. The result isequivariant to both magnitude and phase.

3.4. Phase-Equivariant Batch Normalization

We follow Deng et. al [7], computing Batch Normaliza-tion [38] only on the magnitude of each complex-valuedfeature, thus preserving the phase information. Given aninput feature map f ∈ CC×H×W , we compute:

fBN = BN(|f |)� f

|f |(3)

where BN refers to real-valued BatchNorm and � is ele-mentwise multiplication. This layer is equivariant to phaseand invariant to input magnitude.

3.5. Complex-Valued Invariant Layers

In order to produce invariant complex-valued features,we introduce the Division Layer and the Conjugate Mul-tiplication Layer. Given two complex-valued featuresz1, z2 ∈ CH×W , we define:

Div(z1, z2) =|z1||z2|+ ε

exp{i(]z1 − ]z2)} (4)

Conj(z1, z2) = z1z∗2 (5)

In practice, the denominator for division can be small, so weoffset the magnitude of the denominator by ε = 10−7.

While the division layer induces invariance to all complex-valued scaling, the conjugate layer only induces invarianceto phase. This layer also captures some second-degree in-teractions similar to a bilinear layer [39]. In contrast to ourlayers which capture relative phase and magnitude offsetsof input features, SurReal’s Distance Layer achieves invari-ance by extracting real-valued distances between features,discarding rich complex-valued information in the process.

3.6. Generalized Tangent ReLU

In practice, TReLU slows down convergence comparedto CReLU. We remedy this through three modifications: a)a learned complex-valued scaling factor for each input chan-nel, enabling the layer to adapt to input magnitude and phase,b) hyperparameter r to control the magnitude threshold. No-tably, r = 0 produces a phase-only version of TangentReLU,which is equivariant to input magnitude, and c) learned scal-ing constant for the output phase of each channel, allowingthe non-linearity to adapt the output phase distribution. Ourproposed method generalizes TReLU both as a transforma-tion and as a thresholding function. It is defined as:

GTRelu(x; r, c, ω) = max(r, |c · x|)exp{iω](c · x)+}

4

Page 5: Abstract arXiv:2112.01525v1 [cs.CV] 2 Dec 2021

Figure 4. Our Division Layer is complex-scale invariant, and pre-serves more information than the Distance Transform [18]. Con-sider 4 complex numbers (colored dots) and the reference (blackdot). Complex-scaling changes the orientation and size of the trape-zoid. While the manifold distances [18] from the black dot to thecolored dots are identical, the Division Layer output is invariantand distinct for each dot (the same colored dots around 0◦).

0

π/2

π

3π/2

(a) Input features

0

π/2

π

3π/2

(b) Scaling factor

0

π/2

π

3π/2

(c) Thresholding

0

π/2

π

3π/2

(d) Phase-scaling

Figure 5. Our Generalized Tangent ReLU transforms the input inthree stages: (a) given input complex vectors, it multiplies eachchannel with a learned scaling factor, (b) thresholds the input mag-nitude and phase with hyperparameter r, and (c) scales the phaseto adapt the output distribution.

where x ∈ C is a scalar input, r ∈ C is the thresholdparameter, c ∈ C and ω ∈ R are learned scaling factors, andx+ = ReLU(x) = max(x, 0) (Figure 5).

3.7. Complex Features→ Real-Valued Outputs

In order to produce real-valued outputs from complex-valued features, we propose using feature distances, whichare commonly used for prototype-based classification. Givena complex-valued feature vector f ∈ CD

emb with embeddingdimensionDemb, we compute the distance of f to every classprototype vector pi ∈ CD

emb. The input is then classified asbelonging to the class of the nearest prototype pi. Formally,we define the logits returned by the network as the negativedistance to the prototype scaled by a learned distance scalingfactor α ∈ R+. Formally, the i-th logit Li ∈ R is:

Li = −α · d (f ,pi) (6)

Since the features are complex-valued, a suitable metric isthe manifold distance (equivalent to [18]):

d (z1, z2)=

√(ln |z1| − ln |z2|

)2+ arc(]z1,]z2)2 (7)

where z1, z2 ∈ C. It amplifies the effect of phase differenceswhich would otherwise be suppressed by large variations inmagnitude. Alternatively, a simple metric is the Euclideandistance. In practice, we use BatchNorm on the input fea-tures before computing distances to accelerate convergence.

Invariant Distance Layer: Similar to the Distance Layer[18], this layer can be made complex-scale invariant bymultiplying the prototypes with an equivariant feature map:

Li = −α · d (f ,pi �m) (8)

where m is the mean activation averaged over channels.

3.8. Composing Equivariant and Invariant Layers

We introduce two patterns of model composition basedon our proposed layers. Type I models use a complex-valued invariant layer in the early stages to achieve complex-scale invariance, and Type E models use equivariant layersthroughout the model, retaining more information.

Type I: These models consist of a complex-valued invari-ant layer (Division/Conjugate) in the early stages, producingcomplex-scale invariant features which can be used by laterstages without any architectural restrictions.

Type E: These models use equivariant layers throughoutthe network, relying on Equivariant convolutions, non-linearactivations, and pooling layers, followed by a invariant dis-tance layer to obtain invariant predictions. This model pre-serves the phase information through equivariant layers andthus can typically achieve higher accuracy. However, thisclass of models is less flexible than Type I.

4. Complex-Valued Color EncodingsSpectral representations like the Fourier Transform are

invaluable for analyzing real-valued signals. However, theirusefulness in convolutional neural networks is underminedby the fact that CNNs are designed to tackle spatially ho-mogeneous and translation-invariant domains, while Fourierdata is neither. We propose two complex-valued color en-codings which capture hue shift and channel correlationsrespectively, demonstrating the utility of complex-valuedrepresentations for real-valued data.

Our first so-called "Sliding" encoding takes an [R,G,B]image and encodes it with two complex-valued channels:

[R,G,B]→ [R+ iG,G+ iB] (9)

The complex phase in this encoding corresponds to the ratioof R, G, B values, so the phase in this encoding captures thecorrelation between the adjacent color channels.

Our second proposed encoding uses L∗a∗b∗, a percep-tually uniform color representation with luminance repre-sented by the L channel and chromaticity by the a and bchannels. [40] uses this color space for image colorization.

5

Page 6: Abstract arXiv:2112.01525v1 [cs.CV] 2 Dec 2021

ModelType I

ModelType E

Figure 6. Our CIFARnet models demonstrate two methods of constructing complex-scale invariant models. Green arrows representequivariant features, and blue arrows represent invariant features. top: Type I architecture uses a Division Layer in early stages, producingcomplex-scale invariant features which can be used with any following layers. bottom: Type E uses equivariant layers throughout thenetwork, retaining phase information until the final Invariant Prototype Distance Layer. This class of models is more restrictive but canachieve higher accuracy (See Table 1) as a consequence of retaining more information.

We use it to represent color as a two-channel, complex-valued representation, with the first channel containing theluminance (L∗ channel), and the second channel containingchromaticity (a∗ and b∗ channels) as a∗ + i b∗ (Figure 2c):

[R,G,B]→ [L∗, a∗ + ib∗] (10)

Color distortions as co-domain transformations: Colordistortion can be approximated with complex-valued scalingof our LAB color representation (Figure 2a). A complex-scale invariant network is thus automatically robust to colordistortions without the need for data augmentation.

5. ExperimentsWe conduct three kinds of experiments: Accuracy: 1)

Classification of naturally complex-valued images, 2) real-valued images with real and complex representations; Ro-bustness against complex scaling and color distortion; Gen-eralization: 1) Bias-variance analysis, 2) generalization onsmaller training sets, 3) Feature redundancy analysis.

5.1. Complex-Valued Dataset: MSTAR

MSTAR contains 15,716 complex-valued synthetic aper-ture radar (SAR) images divided into 11 classes [41]. Eachimage has one channel and size 128× 128. We discard thelast "clutter" class and follow [42], training on the depres-sion angle 17◦ and testing on 15◦. We train each model onvarying proportions of the dataset to evaluate the accuracyand generalization capabilities of each model.

SurReal: We replicate the architecture described in Ta-ble 1 of [18]. Since the paper does not mention the learn-ing rate, we use the same learning rate and batch size as

our model. DCN: We use author-provided code1, creatinga complex ResNet with CReLU and 10 blocks per stage.By default, this model accepts 32 × 32 images, so we ap-pend 2 × [ComplexConv,ComplexBatchNorm] with stride2 to downsample the input. The model is trained for 200epochs using SGD with batch size 64 and the learning rateschedule in [16]. We select the epoch with the best validationaccuracy. Real-valued baseline: We use a 3-stage ResNetwith 3 layers per residual block and convert the complexinput into two real-valued channels.

CDS: We use a Type I model based on SurReal [18].We extract equivariant features using an initial equivariantblock containing EConv, Eq. GTReLU, Eq. MaxPool layers,and then obtain complex-scale invariant features by using aDivision Layer. These features are then fed to a real-valuedResNet. For more details about every model, please refer tosupplementary materials.

Training: We optimize both SurReal and CDS modelsusing the AdamW optimizer [43,44] with learning rate 10−3,momentum (0.9, 0.99), weight decay 0.1, and batch size256 for 2.5× 105 iterations . We validate every 1000 steps,picking the model with the best validation accuracy.

5.2. Real-valued Datasets: CIFAR10/100, SVHN

Datasets: CIFAR10 [48] (and CIFAR100) consists of10 (100) classes containing 6000 (600) images each. BothCIFAR10 and CIFAR100 are partitioned into 50000 trainingimages and 10000 test images. SVHN [49] consists of housenumber images from Google Street View, divided into 10classes with 73, 257 training digits and 26, 032 testing digits.

1https://github.com/ChihebTrabelsi/deep_complex_networks

6

Page 7: Abstract arXiv:2112.01525v1 [cs.CV] 2 Dec 2021

10-1 100

Relative dataset size (r)

10-1Err

or r

ate

Real-valued CNN: −0.13r− 1.12

DCN: −0.57r− 2.07

SurReal: −0.76r− 3.03

Ours: −0.64r− 3.04

(a) Scaling Laws for MSTAR [45]

automobileairplane ship frog truck horse deer bird cat dog0.0

0.2

0.4

0.6

bia

s

Classwise BiasOurs

DCN

Real-valued

SurReal

automobileairplane ship frog truck horse deer bird cat dog0.0

0.1

0.2

0.3

0.4

0.5

vari

ance

Classwise Variance

(b) Class-wise Bias and Variance for CIFAR10 [46]

0.0 0.2 0.4 0.6 0.8Similarity

0.0%

2.0%

4.0%

6.0%

8.0%

10.0%

12.0%

14.0%

16.0%

Fre

quen

cy (

%)

Ours

DCN

Real-valued CNN

(c) Filter similarity histogram for conv2

Figure 7. Our model generalizes across various dataset sizes, has lower bias and variance, and avoids learning redundant filters. (a): Weproduce trend curves (similar to [45]) for the MSTAR accuracy table (Table 2). Our method has the lowest test error for measured datasetsizes, a trend that is predicted to scale to even smaller sizes. (b): We followed [46] for CIFAR10 models with LAB encoding. Classes areordered in ascending order of bias for our model. Our model consistently shows the lowest bias for each class, and the lowest variance for9 out of 10 classes, indicating overall superior generalization ability. (c): Filter similarity histogram from conv2 layer of each CIFARnetmodel, following [47]. Our distribution mean is closest to 0, indicating our method achieves the least redundant filters.

Method # Param CIFAR10 CIFAR100 SVHN

RGB LAB Sliding RGB LAB Sliding RGB LAB Sliding

DCN [16] 66,858 65.17 58.64 63.83 32.52 27.36 28.87 85.26 84.43 87.44SurReal [18] 35,274 50.68 53.02 54.61 23.57 25.97 26.66 80.51 53.48 80.79

Real-valued CNN 34,282 64.43 63 63.43 31.93 31.72 31.93 87.47 84.93 87.37

Ours (Type-I) 24,241 69.23 67.17 68.7 36.92 37.81 38.51 89.39 88.86 90.25Ours (Type-E) 25,745 68.48 67.58 69.19 41.83 39.55 42.08 77.19 74.21 88.39

Table 1. Our models outperform the baselines’ CIFARnet versions on real-valued datasets. Type-I model performs better on easier datasetslike SVHN, and Type-E performs better on difficult datasets like CIFAR100. In contrast, SurReal does not scale to large datasets.

Models: To ensure equal footing for each model, all net-works in this experiment are based off CIFARNet, i.e., 3Convolution Layers (stride 2) and 2 fully connected layers.We also replace average pooling with a depthwise-separableconvolution as a learnable pooling layer. All models are opti-mized with AdamW [43,44] using momentum (0.99, 0.999),for 5 × 104 steps with batch size 256, learning rate 10−3,weight decay 0.1, and validated every 1000 iterations. DCN:We use ComplexConv for convolutions and CReLU as thenon-linearity. We do not use Residual Blocks or ComplexBatchNorm from [16] to ensure fairness. SurReal: Weuse wFM for convolutions and use Distance Transform af-ter Layer 3 to extract invariant real-valued features. Real-Valued CNN: We use the CIFARNet architecture, convert-ing each complex input channel into two real-valued chan-nels. CDS: We evaluate two models: Type I: We use EConvfor convolutions and GTReLU (r = 0) for non-linearity.We use a Division layer after the first Econv to achieve in-variance. The final fully-connected layer is replaced withPrototype Distance layer to predict class logits (Figure 6).Type E: We use Econv for convolutions and EquivariantGTReLU for non-linearity. The final FC layer is replacedwith Invariant Prototype Distance layer to predict logits (Fig-

ure 6), and the prototype distance inputs are normalized withEquivariant BatchNorm to preserve equivariance.

CDS-Large: We train a 1.7M parameter Type I modelon CIFAR 10 with LAB encoding, and compare it againstequivalently sized DCN (WS with CReLU from [16]). CDS-Large is based on the simplified 4-stage ResNet provided byPage et al. [50] for DAWNBench [51]. We use the conjugatelayer after the first Econv to get complex-scale invariantfeatures and feed them to the Complex ResNet. Like DCN,we optimize the model using SGD with horizontal flippingand random cropping augmentations with a varying learningrate schedule (see supplementary material for more details).

5.3. Model Performance Analysis

Accuracy and scalability: Our approach achievescomplex-scale invariance of manifold-based methods whileretaining high accuracy and scalability. On MSTAR, ourmodel beats the baselines across a diverse range of splitswith less than half the parameters used by SurReal (Table 2).On the smallest training split (5% training data), our modelshows a gain of 19.7% against DCN and real-valued CNNand 8.4% against SurReal. On the largest split (100% train-ing data), our model beats real-valued CNN by 29.2%, DCN

7

Page 8: Abstract arXiv:2112.01525v1 [cs.CV] 2 Dec 2021

Model # Params 5% 10% 50% 90% 100%

Real 33,050 47.4 46.6 60.6 73 66.9

SurReal [18] 63,690 61.1 68 90.3 95.6 94.9

DCN [16] 863,587 49.8 47 81.9 89.1 89.1

Ours 29,536 69.5 78.3 91.3 95.2 96.1

Table 2. Our method achieves the best accuracy and generalizationwith the fewest parameters. We report accuracy on varying propor-tions of MSTAR training data. The performance gap is wider forsmaller train-sets, with Real-CNN and DCN failing to generalize.

by 7%, and SurReal by 1.2%, demonstrating our advantageacross an extensive range of dataset sizes.

On CIFAR10, CIFAR100, and SVHN under differentencodings, our models obtain the highest accuracy acrossevery setting (Table 1). Unlike SurReal, our model scales tothese large classification datasets while retaining complex-scale invariance. For the complex-valued color encodings,which require precise processing of phase information, ourmodel consistently beats baselines by 4%-8%. These resultshighlight the advantage of our approach for precise complex-valued processing across a variety of real-valued datasets.

Phase normalization and color jitter: A natural pre-processing trick to address complex-scaling invariance is tocompute the average input phase φ̂ and to scale the inputby e−ιφ̂ to cancel it. We test this approach by applying ran-dom complex-valued scaling with different rotation rangesand comparing DCN’s accuracy with and without phasenormalization against our method (Figure 2g). When theinput phase distribution is simple (e.g., phase set to 0), phasenormalization successfully protects DCN against complex-valued scaling. However, for complicated phase distributionssuch as LAB encoding, this method fails. Our method suc-ceeds in both situations, and this robustness transfers to thecolor jitter (as used by [17], see Figure 2f). Our model ismore robust against color jitter without data augmentation.

Bias and variance analysis: While model accuracyacross different datasets is useful, a better measure for thegeneralization of supervised models is the bias-variance de-composition. We follow [46]: given model f , dataset D,ground truth Y and instance x, [46] defines the bias-variancedecomposition of the prediction error (per instance) as:

Error(x; f) = E[(f(x;D)− Y )2

](11)

= Bias(x; f) + Var(x; f) + R(x) (12)

where bias measures the accuracy of the predictions withrespect to the ground-truth, variance measures the stabilityof the predictions, andR denotes the irreducible error. Usingthe 0-1 loss L0−1, [46] calculates the bias and variance terms

Method #Params %Acc

DCN [16] 1.7M 92.8

Ours 1.7M 93.7

(a) CIFAR10 LAB accuracy for large models

0.0π 0.1π 0.2π 0.5π 1.0πRotation Range

15

35

55

75

95

Acc

ura

cy %

Ours

DCN

(b) Robustness Curve for large models

Figure 8. Our model beats DCN while additionally achievingcomplex-scale invariance. a) We train DCN and CDS on CIFAR10with the LAB encoding, achieving higher accuracy. This result isconsistent with (Table 1), with smaller margin due to the largercapacity of big models. b) DCN is susceptible to complex-valuedscaling. Similar to Figure 1g, we plot average accuracy underdifferent rotation ranges. DCN accuracy degrades under complex-valued scaling, while our method is robust.

(per instance per model) for the classification task as such:

Bias(x;h) = L0−1 (ym; t) (13)

Var(x;h) =1

n

n∑k=1

L0−1

(y(k); ym

)(14)

where ym is the mode or the main prediction. We computethis metric for each instance, averaging bias and varianceover classes. Compared on CIFAR10 with LAB encoding,our model achieves the lowest bias among all classes and thelowest variance among 9 out of 10 classes (Figure 7).

Generalization from less training data: [45] derivesempirical trends for scaling of language models under dif-ferent conditions, including the overfitting regime where thetraining set is small compared to parameters. We producesimilar trend curves for the MSTAR test results by fittinglinear regression curves to log accuracy and dataset sizereported in Table 2. We plot the results in Figure 7. The ex-trapolated least-squares linear fit suggests our model mightcontinue to generalize better on yet smaller datasets.

Feature redundancy comparison: [47] shows that com-mon CNN architectures learn highly correlated filters. Thisincreases model size and reduces the ability to capture diver-sity. We follow [47], measuring correlations between guidedbackpropagation maps of different filters in layer 2 for eachmodel on CIFAR10 with the LAB encoding. We find thatour model displays the highest filter diversity. This observa-

8

Page 9: Abstract arXiv:2112.01525v1 [cs.CV] 2 Dec 2021

tion is consistent with higher test accuracy, lower bias andvariance, and leaner models from previous experiments.

Scaling to large models: Small models are essential forapplications such as edge computing, thus motivating leanermodels (Table 1 and 2). However, the best models on largereal-valued datasets, like ViT-H/14 [52] with 99.5% test ac-curacy on CIFAR10, have millions of parameters. We test thescalability of our approach by comparing CDS-Large with aDCN model of equivalent size on CIFAR10 with LAB en-coding (Table 8a). While we focus on leaner complex-scaleinvariant models, our method beats DCN while additionallyachieving complex-scale invariance even for large models.This observation is consistent with our results for small mod-els (Table 1), showing the effectiveness of our method fordiverse model sizes.

Summary: We analyze complex-scaling as a co-domaintransformation and derive equivariant/invariant versions ofcommonly used layers. We also present novel complexencodings. Our approach combines complex-valued algebrawith complex-scaling geometry, resulting in leaner and morerobust models with better accuracy and generalization.

Acknowledgements: The authors gratefully acknowl-edge DARPA, AFRL, and NGA for funding various appli-cations of this research. We also thank Matthew Tancik,Ren Ng, Connelly Barnes, and Claudia Tischler for theirthoughtful comments.

References[1] Maurice Weiler, Fred A. Hamprecht, and Martin Storath.

Learning steerable filters for rotation equivariant cnns. InCVPR, 2018.

[2] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedingsof the IEEE, 86(11):2278–2324, 1998.

[3] Charles R. Qi, Hao Su, Kaichun Mo, and Leonidas J. Guibas.Pointnet: Deep learning on point sets for 3d classification andsegmentation, 2017.

[4] Ethan Eade. Lie groups for computer vision. CambridgeUniv., Cam-bridge, UK, Tech. Rep, 2, 2014.

[5] Marc Finzi, Samuel Stanton, Pavel Izmailov, and Andrew Gor-don Wilson. Generalizing convolutional neural networks forequivariance to lie groups on arbitrary continuous data. InHal Daumé III and Aarti Singh, editors, Proceedings of the37th International Conference on Machine Learning, volume119 of Proceedings of Machine Learning Research, pages3165–3176. PMLR, 13–18 Jul 2020.

[6] Taco Cohen and Max Welling. Group equivariant convo-lutional networks. In International conference on machinelearning, pages 2990–2999. PMLR, 2016.

[7] Congyue Deng, Or Litany, Yueqi Duan, Adrien Poulenard,Andrea Tagliasacchi, and Leonidas J. Guibas. Vector neu-rons: A general framework for so(3)-equivariant networks. InProceedings of the IEEE/CVF International Conference onComputer Vision (ICCV), pages 12200–12209, October 2021.

[8] Stéphane Mallat. Understanding Deep Convolutional Net-works. Philosophical Transactions A, 374:20150203, 2016.

[9] Risi Kondor and Shubhendu Trivedi. On the generalization ofequivariance and convolution in neural networks to the actionof compact groups. arXiv preprint arXiv:1802.03690, 2018.

[10] Tristan Needham. Visual complex analysis. Oxford UniversityPress, 1998.

[11] Tohru Nitta. The computational power of complex-valuedneuron. In Joint International Conference ICANN/ICONIP,2003.

[12] Stella Yu. Angular embedding: A robust quadratic crite-rion. IEEE Transactions on Pattern Analysis and MachineIntelligence, 34(1):158–173, 2012.

[13] Ivo Danihelka, Greg Wayne, Benigno Uria, Nal Kalchbrenner,and Alex Graves. Associative long short-term memory. InMaria Florina Balcan and Kilian Q. Weinberger, editors, Pro-ceedings of The 33rd International Conference on MachineLearning, volume 48 of Proceedings of Machine LearningResearch, pages 1986–1994, New York, New York, USA,20–22 Jun 2016. PMLR.

[14] Brian Cheung, Alexander Terekhov, Yubei Chen, PulkitAgrawal, and Bruno Olshausen. Superposition of many mod-els into one. In H. Wallach, H. Larochelle, A. Beygelzimer,F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances inNeural Information Processing Systems, volume 32. CurranAssociates, Inc., 2019.

[15] Muneer Ahmad Dedmari, Sailesh Conjeti, Santiago Estrada,Phillip Ehses, Tony Stöcker, and Martin Reuter. Complexfully convolutional neural networks for mr image reconstruc-tion. In Florian Knoll, Andreas Maier, and Daniel Rueckert,editors, Machine Learning for Medical Image Reconstruction,pages 30–38, Cham, 2018. Springer International Publishing.

[16] Chiheb Trabelsi, Olexa Bilaniuk, Ying Zhang, DmitriySerdyuk, Sandeep Subramanian, João Felipe Santos, SoroushMehri, Negar Rostamzadeh, Yoshua Bengio, and Christo-pher J Pal. Deep complex networks. arXiv preprintarXiv:1705.09792, 2017.

[17] Antti Tarvainen and Harri Valpola. Mean teachers are betterrole models: Weight-averaged consistency targets improvesemi-supervised deep learning results. In I. Guyon, U. V.Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan,and R. Garnett, editors, Advances in Neural Information Pro-cessing Systems, volume 30, pages 1195–1204. Curran Asso-ciates, Inc., 2017.

[18] Rudrasis Chakraborty, Yifei Xing, and Stella Yu. Surreal:Complex-valued deep learning as principled transformationson a rotational lie group. arXiv preprint arXiv:1910.11334,2019.

[19] Alan V Oppenheim. Discrete-time signal processing. PearsonEducation India, 1999.

[20] Jon Mathews and Robert Lee Walker. Mathematical methodsof physics, volume 501. WA Benjamin New York, 1970.

[21] Witold Kinsner and Warren Grieder. Amplification of signalfeatures using variance fractal dimension trajectory. Int. J.Cogn. Inform. Nat. Intell., pages 1–17, 2010.

9

Page 10: Abstract arXiv:2112.01525v1 [cs.CV] 2 Dec 2021

[22] Juergen Reichert. Automatic classification of communica-tion signals using higher order statistics. In [Proceedings]ICASSP-92: 1992 IEEE International Conference on Acous-tics, Speech, and Signal Processing, volume 5, pages 221–224.IEEE, 1992.

[23] Charles F Cadieu and Bruno A Olshausen. Learningintermediate-level representations of form and motion fromnatural movies. Neural computation, 24(4):827–866, 2012.

[24] David P Reichert and Thomas Serre. Neuronal syn-chrony in complex-valued deep networks. arXiv preprintarXiv:1312.6115, 2013.

[25] Stella X. Yu. Angular embedding: from jarring intensitydifferences to perceived luminance. In IEEE Conferenceon Computer Vision and Pattern Recognition, pages 2302–9,2009.

[26] Stella X. Yu. Angular embedding: A robust quadratic crite-rion. IEEE Transactions on Pattern Analysis and MachineIntelligence, 34(1):158–73, 2012.

[27] T Nitta. On the critical points of the complex-valued neuralnetwork. In Proceedings of the 9th International Confer-ence on Neural Information Processing, 2002. ICONIP’02.,volume 3, pages 1099–1103. IEEE, 2002.

[28] Akira Hirose and Shotaro Yoshida. Generalization charac-teristics of complex-valued feedforward neural networks inrelation to signal coherence. IEEE Transactions on NeuralNetworks and learning systems, 23(4):541–551, 2012.

[29] Daniel E Worrall and Max Welling. Deep scale-spaces: Equiv-ariance over scale. arXiv preprint arXiv:1905.11697, 2019.

[30] Diego Marcos, Michele Volpi, Nikos Komodakis, and DevisTuia. Rotation equivariant vector field networks. In Pro-ceedings of the IEEE International Conference on ComputerVision, pages 5048–5057, 2017.

[31] Daniel E Worrall, Stephan J Garbin, Daniyar Turmukham-betov, and Gabriel J Brostow. Harmonic networks: Deeptranslation and rotation equivariance. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion, pages 5028–5037, 2017.

[32] Taco Cohen, Mario Geiger, and Maurice Weiler. A generaltheory of equivariant cnns on homogeneous spaces. arXivpreprint arXiv:1811.02017, 2018.

[33] Taco Cohen, Maurice Weiler, Berkay Kicanaoglu, and MaxWelling. Gauge equivariant convolutional networks and theicosahedral cnn. In International Conference on MachineLearning, pages 1321–1330. PMLR, 2019.

[34] Zhimian Zhang, Haipeng Wang, Feng Xu, and Ya-Qiu Jin.Complex-valued convolutional neural network and its appli-cation in polarimetric sar image classification. IEEE Transac-tions on Geoscience and Remote Sensing, 55(12):7177–7188,2017.

[35] Patrick Virtue, Stella X. Yu, and Michael Lustig. Better thanreal: Complex-valued neural networks for mri fingerprinting.In International Conference on Image Processing, 2017.

[36] Patrick Virtue. Complex-valued Deep Learning with Appli-cations to Magnetic Resonance Image Synthesis. PhD thesis,UC Berkeley, 2018.

[37] Simone Scardapane, Steven Van Vaerenbergh, Amir Hussain,and Aurelio Uncini. Complex-valued neural networks withnonparametric activation functions. IEEE Transactions onEmerging Topics in Computational Intelligence, 4(2):140–150, 2020.

[38] Sergey Ioffe and Christian Szegedy. Batch normalization: Ac-celerating deep network training by reducing internal covari-ate shift. In Francis Bach and David Blei, editors, Proceedingsof the 32nd International Conference on Machine Learning,volume 37 of Proceedings of Machine Learning Research,pages 448–456, Lille, France, 07–09 Jul 2015. PMLR.

[39] Yang Gao, Oscar Beijbom, Ning Zhang, and Trevor Darrell.Compact bilinear pooling. In Proceedings of the IEEE con-ference on computer vision and pattern recognition, pages317–326, 2016.

[40] Richard Zhang, Phillip Isola, and Alexei A Efros. Colorfulimage colorization. In ECCV, 2016.

[41] Eric R Keydel, Shung Wu Lee, and John T Moore. Mstar ex-tended operating conditions: A tutorial. In Algorithms for Syn-thetic Aperture Radar Imagery III, volume 2757, pages 228–242. International Society for Optics and Photonics, 1996.

[42] Jiayun Wang, Patrick Virtue, and Stella Yu. Successive em-bedding and classification loss for aerial image classification,2019.

[43] Diederik P. Kingma and Jimmy Ba. Adam: A methodfor stochastic optimization. In International Conference onLearning Representations, 2017.

[44] Ilya Loshchilov and Frank Hutter. Decoupled weight de-cay regularization. In International Conference on LearningRepresentations, 2019.

[45] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B.Brown, Benjamin Chess, Rewon Child, Scott Gray, AlecRadford, Jeffrey Wu, and Dario Amodei. Scaling laws forneural language models, 2020.

[46] Xudong Wang, Long Lian, Zhongqi Miao, Ziwei Liu, andStella Yu. Long-tailed recognition by routing diversedistribution-aware experts. In International Conference onLearning Representations, 2021.

[47] Xudong Wang and Stella X Yu. Tied block convolution:Leaner and better cnns with shared thinner filters. arXivpreprint arXiv:2009.12021, 2020.

[48] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiplelayers of features from tiny images. 2009.

[49] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco,Bo Wu, and Andrew Y Ng. Reading digits in natural imageswith unsupervised feature learning. 2011.

[50] David Page. How to train your resnet.https://github.com/davidcpage/cifar10-fast, 2020.

[51] Cody A. Coleman, Deepak Narayanan, Daniel Kang, TianZhao, Jian Zhang, Luigi Nardi, Peter Bailis, Kunle Olukotun,Chris Re, and Matei Zaharia. Dawnbench: An end-to-enddeep learning benchmark and competition. In NeurIPS MLSystems Workshop, 2017.

10

Page 11: Abstract arXiv:2112.01525v1 [cs.CV] 2 Dec 2021

[52] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov,Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner,Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl-vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image isworth 16x16 words: Transformers for image recognition atscale. In International Conference on Learning Representa-tions, 2021.

[53] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer,James Bradbury, Gregory Chanan, Trevor Killeen, ZemingLin, Natalia Gimelshein, Luca Antiga, Alban Desmaison,Andreas Kopf, Edward Yang, Zachary DeVito, Martin Rai-son, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner,Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: Animperative style, high-performance deep learning library. InH. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc,E. Fox, and R. Garnett, editors, Advances in Neural Infor-mation Processing Systems 32, pages 8024–8035. CurranAssociates, Inc., 2019.

6. Supplementary MaterialTable of contents:

1. Extensive bias-variance evaluation, including plots forboth LAB and RGB encodings.

2. Limitations of wFM as a convolutional layer in CNNs.3. Ablation tests4. Architecture details for models used in our experiments.

6.1. Bias-Variance Evaluation

In this section, we run extensive evaluations for bias andvariance on each encoding for the CIFAR10 dataset withCIFARNet models (Figure 9). In each case, our model (TypeE) achieves the lowest bias for each class. We also achievethe lowest variance for 8 (out of 10) classes for RGB, 9classes for LAB, and all classes for the Sliding encoding.Our overall bias and variance are the lowest of all models,indicating higher generalization ability.

6.2. Limitations of wFM

We demonstrate the theoretical and empirical limitationsof wFM. From a theoretical perspective, we show that wFMusing the Manifold Distance Metric of SurReal processesmagnitude and phase separately and is thus unable to processthe joint distribution. Our experiments show that in practice,wFM results in a significant loss in accuracy compared toreal-valued and complex-valued convolutional filters.

Decomposability: We discuss the weighted-FréchetMean for complex-valued neural networks and show that themagnitude and phase computations are decoupled. wFM isdefined as the minimum of weighted distances to a givenset of points. In specific cases (like the Euclidean distancemetric), closed-form solutions (like the euclidean weightedmean) exist, but there is no general closed-form solution.

Layer Type Params Acc (%)

Complex 66,858 58.6

Real 34,282 63.4

wFM 42,154 52.2

Table 3. wFM results in significant reductions in accuracy. We trainmodels on CIFAR10 with LAB encoding and tabulate the resultingtest accuracy. Compared to a real-valued CNN, wFM results insignificantly lower accuracy due to its restricted formulation.

Given {zi}Ki=1 ⊂ C and {wi}Ki=1 ⊂ (0, 1] with∑i wi =

1 and b ∈ R, we aspire to compute:

arg minm∈C

K∑i=1

wid2 (zi,m) (15)

Using the Manifold distance metric, the expression reducesto:

arg minm∈C

K∑i=1

wi((ln |zi| − ln |m|)2 + arc2(zi,m)

)(16)

Note that the objective is a sum of log |zi| − log |m|)2 andarc2(zi,m), where the first objective depends only on themagnitude of m, and the second objective depends onlyon the phase of m. Thus, the magnitude and phase can besolved independently of each other. The first can be furthersimplified:

r∗ = argminr∈R

K∑i=1

wi(log |zi| − r)2 (17)

=⇒ r∗ =

K∑i=1

wi. log |zi| (18)

=⇒ log |m∗| =K∑i=1

wi. log |zi| (19)

Thus, wFM simply computes a weighted sum of log magni-tudes and solves a different minimization problem to find thephase. This restricts the representational power of a wFMlayer compared to that of a convolution.

wFM experiments: We compare wFM against real-valued and complex-valued convolutions on CIFAR10 usinga CIFARnet architecture. We find that the lower represen-tational capacity of wFM leads to significant reductions inaccuracy (Table 3).

6.3. Ablation tests

In this section, we run ablation tests for our Type I modelon CIFAR10 under the LAB encoding to measure the im-pact of layer choices on final performance. Specifically, we

11

Page 12: Abstract arXiv:2112.01525v1 [cs.CV] 2 Dec 2021

automobile ship frog airplane truck horse deer bird dog cat0.0

0.2

0.4

0.6

bia

sClasswise Bias

Ours

DCN

Real-valued

SurReal

automobile ship frog airplane truck horse deer bird dog cat0.0

0.1

0.2

0.3

0.4

0.5

vari

ance

Classwise Variance

(a) RGB encoding

automobileairplane ship frog truck horse deer bird cat dog0.0

0.2

0.4

0.6

bia

s

Classwise BiasOurs

DCN

Real-valued

SurReal

automobileairplane ship frog truck horse deer bird cat dog0.0

0.1

0.2

0.3

0.4

0.5

vari

ance

Classwise Variance

(b) LAB encoding

automobile ship frog airplane truck horse deer bird dog cat0.0

0.2

0.4

0.6

bia

s

Classwise BiasOurs

DCN

Real-valued

SurReal

automobile ship frog airplane truck horse deer bird dog cat0.0

0.1

0.2

0.3

0.4

0.5

vari

ance

Classwise Variance

(c) Sliding encoding

Figure 9. Our method demonstrates lowest overall bias and variance on CIFAR10. We followed [46] for CIFAR10 models with RGBencoding. Classes are ordered in ascending order of bias for our model. Our model consistently shows the lowest bias for each class in everyencoding, and lowest variance in 8, 9, and 10 out of 10 classes with RGB, LAB, and Sliding encodings respectively.

Method Accuracy

Division Layer 67.17Conjugate Layer 66.73

Euclidean Distance 67.17Manifold Distance 68.54

GTReLU (r=0) 67.17GTReLU (r=0.1) 68.14GTReLU (r=1) 49.15

Table 4. Ablation test results for our Type-I model on CIFAR10 with LAB encoding. We find that Division Layer, ManifoldDistance, and a GTReLU threshold of r = 0.1 perform the best.

benchmark the Complex Invariant Layer, the Invariant Dis-tance Layer, and different thresholds for the GeneralizedTangent ReLU layer (Table 4). We find that the Divisionlayer beats the Conjugate layer in terms of accuracy andthat Manifold Distance achieves higher accuracy than ourbaseline of Euclidean Distance. We also note that amongGTReLU thresholds, r = 0.1 performs the best but is notequivariant to magnitude. We also note that higher thresholdslike r = 1 result in lower accuracy.

6.4. Architecture Details

In this section, we discuss details of the architectures usedin our experiments.

CIFARnet architectures: For CIFARnet architectures,please refer to tables 5-9. Please note that our replication ofwFM [18] uses the (logmag, sinθ, cosθ) encoding for themanifold values, and uses the weighted average formulation.

MSTAR architectures: For DCN, please refer to [16],and for the downsampling block, see Table 11. Our SurRealreplication is based on Table I in [18], and our model is basedon the SurReal architecture (see Table 13). We use the samereal-valued ResNet as the real-valued baseline (see Table 12).In order to pass the complex-valued features into the real-

valued ResNet, we convert complex features to real-valuedusing the (logmag, sinθ, cosθ) encoding, treating each asa separate real-valued channel (resulting in 15 real-valuedchannels from 5 complex-valued channels).

CDS-Large: For the model architecture, please see Ta-ble 10. We train this model with SGD, using momentum0.9, weight decay constant 5 × 10−4, using a piece-wiselinear learning rate schedule starting at 0.01, increasing to0.2 by epoch 10, then decreasing to 0.01 by epoch 100,0.001 by 120, 0.0001 by 150, and staying constant until200. To ensure fair comparison, use horizontal flips andrandom cropping augmentation as used in [16]. All modelsare implemented in PyTorch [53].

12

Page 13: Abstract arXiv:2112.01525v1 [cs.CV] 2 Dec 2021

Table 5. SurReal CIFAR Model Architecture

Layer Type Input Shape Kernel Stride Padding Output Shape

Complex CONV [3, 32, 32] 3× 3 2 1 [16, 16, 16]

G-transport [16, 16, 16] - - - [16, 16, 16]

Complex CONV [16, 16, 16] 3× 3 2 1 [32, 8, 8]

G-transport [32, 8, 8] - - - [32, 8, 8]

Complex CONV [32, 8, 8] 3× 3 2 1 [64, 4, 4]

G-transport [64, 4, 4] - - - [64, 4, 4]

Distance Layer [64, 4, 4] - - - [64, 4, 4]

Average Pooling [64, 4, 4] 4× 4 - - [64, 1, 1]

FC [64] - - - [128]

ReLU [128] - - - [128]

FC [128] - - - [10]

Table 6. DCN CIFAR Model Architecture

Layer Type Input Shape Kernel Stride Padding Output Shape

Complex CONV [3, 32, 32] 3× 3 2 1 [16, 16, 16]

CReLU [16, 16, 16] - - - [16, 16, 16]

Complex CONV [16, 16, 16] 3× 3 2 1 [32, 8, 8]

CReLU [32, 8, 8] - - - [32, 8, 8]

Complex CONV [32, 8, 8] 3× 3 2 1 [64, 4, 4]

CReLU [64, 4, 4] - - - [64, 4, 4]

Average Pooling [64, 4, 4] 4× 4 - - [64, 1, 1]

Complex-to-Real [64, 1, 1] 4× 4 - - [128]

FC [128] - - - [128]

ReLU [128] - - - [128]

FC [128] - - - [10]

13

Page 14: Abstract arXiv:2112.01525v1 [cs.CV] 2 Dec 2021

Table 7. Our (Type-E) CIFAR Model Architecture

Layer Type Input Shape Kernel Stride Padding Output Shape

Econv [3, 32, 32] 3× 3 2 1 [16, 16, 16]

Eq. GTReLU [16, 16, 16] - - - [16, 16, 16]

Econv [16, 16, 16] 3× 3 2 1 [32, 8, 8]

Eq. GTReLU [32, 8, 8] - - - [32, 8, 8]

Econv [32, 8, 8] 3× 3 2 1 [64, 4, 4]

Eq. GTReLU [64, 4, 4] - - - [64, 4, 4]

Average Pooling [64, 4, 4] 4× 4 - - [64, 1, 1]

Equivariant FC [64] - - - [128]

Invariant Prototype Distance [128] - - - [10]

Table 8. Our (Type-I) CIFAR Model Architecture

Layer Type Input Shape Kernel Stride Padding Output Shape

Econv [3, 32, 32] 3× 3 2 1 [16, 16, 16]

Division Layer [16, 16, 16] 3× 3 - 1 [16, 16, 16]

GTReLU [16, 16, 16] - - - [16, 16, 16]

Econv [16, 16, 16] 3× 3 2 1 [32, 8, 8]

GTReLU [32, 8, 8] - - - [32, 8, 8]

Econv [32, 8, 8] 3× 3 2 1 [64, 4, 4]

GTReLU [64, 4, 4] - - - [64, 4, 4]

Average Pooling [64, 4, 4] 4× 4 - - [64, 1, 1]

Equivariant FC [64] - - - [128]

Prototype Distance [128] - - - [10]

Table 9. 2-Channel Real-Valued CIFAR Model Architecture

Layer Type Input Shape Kernel Stride Padding Output Shape

CONV [3, 32, 32] 3× 3 2 1 [16, 16, 16]

ReLU [16, 16, 16] - - - [16, 16, 16]

CONV [16, 16, 16] 3× 3 2 1 [32, 8, 8]

ReLU [32, 8, 8] - - - [32, 8, 8]

CONV [32, 8, 8] 3× 3 2 1 [64, 4, 4]

ReLU [64, 4, 4] - - - [64, 4, 4]

Average Pooling [64, 4, 4] 4× 4 - - [64, 1, 1]

FC [64] - - - [128]

ReLU [128] - - - [128]

FC [128] - - - [10]

14

Page 15: Abstract arXiv:2112.01525v1 [cs.CV] 2 Dec 2021

Table 10. Our CDS-Large Model Architecture

Layer Type Input Shape Kernel Stride Padding Output Shape

Econv [3, 32, 32] 3× 3 1 1 [64, 32, 32]

Conjugate Layer [64, 32, 32] 1× 1 - - [64, 32, 32]

Econv (Groups=2) [64, 32, 32] 3× 3 1 1 [64, 32, 32]

ComplexBatchNorm [64, 32, 32] - - - [64, 32, 32]

CReLU [64, 32, 32] - - - [64, 32, 32]

Econv (Groups=2) [64, 32, 32] 3× 3 1 1 [128, 32, 32]

ComplexBatchNorm [128, 32, 32] - - - [128, 32, 32]

CReLU [128, 32, 32] - - - [128, 32, 32]

Eq. MaxPool [128, 32, 32] 2× 2 - - [128, 16, 16]

ResBlock(groups=2) [128, 16, 16] - - - [128, 16, 16]

Econv (Groups=4) [128, 16, 16] 3× 3 1 1 [256, 16, 16]

ComplexBatchNorm [256, 16, 16] - - - [256, 16, 16]

CReLU [256, 16, 16] - - - [256, 16, 16]

Eq. MaxPool [256, 16, 16] 2× 2 - - [256, 8, 8]

Econv (Groups=2) [256, 8, 8] 3× 3 1 1 [512, 8, 8]

ComplexBatchNorm [512, 8, 8] - - - [512, 8, 8]

CReLU [512, 8, 8] - - - [512, 8, 8]

Eq. MaxPool [512, 8, 8] 2× 2 - - [512, 4, 4]

ResBlock(groups=4) [512, 4, 4] - - - [512, 4, 4]

Eq. MaxPool [512, 4, 4] 2× 2 - - [512, 1, 1]

Fully Connected [1024] - - - [10]

Table 11. DCN Down-sampling Block for MSTAR

Layer Type Input Shape Kernel Stride Padding Output Shape

Complex CONV [1, 128, 128] 3× 3 2 1 [12, 64, 64]

ComplexBatchNorm [12, 64, 64] - - - [12, 64, 64]

Complex CONV [1, 64, 64] 3× 3 2 1 [12, 32, 32]

ComplexBatchNorm [12, 32, 32] - - - [12, 32, 32]

15

Page 16: Abstract arXiv:2112.01525v1 [cs.CV] 2 Dec 2021

Table 12. MSTAR Real-valued Model Architecture

Layer Type Input Shape Kernel Stride Output Shape

CONV [2, 100, 100] 5× 5 1 [30, 96, 96]

GroupNorm+ReLU [30, 96, 96] - - [30, 96, 96]

ResBlock [30, 96, 96] - - [40, 96, 96]

MaxPool [40, 96, 96] 2× 2 2 [40, 48, 48]

CONV [40, 48, 48] 5× 5 3 [50, 15, 15]

GroupNorm+ReLU [50, 15, 15] - - [50, 15, 15]

ResBlock [50, 15, 15] - - [60, 15, 15]

CONV [60, 15, 15] 2× 2 1 [70, 14, 14]

GroupNorm+ReLU [70, 14, 14] - - [70, 14, 14]

AveragePool [70, 14, 14] - - [70]

FC [70] - - [30]

ReLU [30] - - [30]

FC [30] - - [10]

Table 13. Our MSTAR Model Architecture

Layer Type Input Shape Kernel Stride Padding Output Shape

Econv (Groups=5) [1, 100, 100] 5× 5 1 0 [5, 96, 96]

Eq. GTReLU [5, 96, 96] - - - [5, 96, 96]

Eq. MaxPool [5, 96, 96] 2× 2 2 - [5, 48, 48]

Econv [5, 48, 48] 3× 3 2 0 [5, 23, 23]

Eq. GTReLU [5, 23, 23] - - - [5, 23, 23]

Division Layer [5, 23, 23] 3× 3 - - [5, 21, 21]

Complex-to-Real [5, 21, 21] - - - [15, 21, 21]

CONV (Groups=5) [15, 21, 21] 5× 5 1 - [30, 17, 17]

GroupNorm+ReLU [30, 17, 17] - - - [30, 17, 17]

ResBlock [30, 17, 17] - - - [40, 17, 17]

MaxPool [40, 17, 17] 2× 2 2 - [40, 8, 8]

CONV (Groups=5) [40, 8, 8] 5× 5 3 - [50, 2, 2]

GroupNorm+ReLU [50, 2, 2] - - - [50, 2, 2]

ResBlock [50, 2, 2] - - - [60, 2, 2]

CONV (Groups=5) [60, 2, 2] 2× 2 1 - [70, 1, 1]

GroupNorm+ReLU [70, 1, 1] - - - [70, 1, 1]

FC [70] - - - [30]

ReLU [30] - - - [30]

FC [30] - - - [10]

16