end-to-end auditory object recognition via inception...

Background and MotivationMethodResults

End-to-End Auditory Object Recognition via InceptionNucleus

Mohammad K. Ebrahimpour1,3, Timothy Shea3, Andreea Danielescu3,David C. Noelle1,2, Christopher T. Kello2

Electrical Engineering and Computer Science, UC Merced 1,Cognitive and Information Sciences, UC Merced 2,

Accenture Labs 3

May 2020Mohammad K. Ebrahimpour ([email protected]) End-to-End Auditory Object Recognition ... 1 / 20


Auditory Object Recognition

What is auditory object recognition?

It is the process by which we automatically assign labels to sounds basedon their characteristics.

• Some categories that we are trying to classify are as follows:Sirens Dog Barking Drilling Engine Idling Gun Shot

Mohammad K. Ebrahimpour ([email protected]) End-to-End Auditory Object Recognition ... 2 / 20


Traditional Auditory Object Recognition Pipeline

Raw Audio MFCC Classifier



Contributions:

Two main contributions of this paper:

Feature Learning

Removing hand-engineered features by learning more specific features.

Number of Parameters

Reducing number of parameters in our model while improving performance.



Motivation

Flaws in traditional approaches

• The traditional pipeline needs substantial hand engineering featureextraction.

• Engineered features need to be tuned for every problem.• Engineered features are not necessarily the optimal features.



Motivation (Continued)

Solution: Learning features and labels simultaneously!

• Deep neural networks have repeatedly proven that they are veryeffective in feature learning.

• Deep neural networks can map raw input to labels with nestedtransformation functions.

Advantages of learning features and labels simultaneously!

• This approach eliminates the need for feature engineering.• This method is able to obtain more accurate results due to features

learned for the problem at hand.• This approach is more suitable for real-time applications.



The Network Architecture

1DConvlayer

Inception 2D Conv layer

2D Conv layerGAP

Softmax



1D Conv. layers

Input preparation

We are reading audio recordings with 8 kHz sampling rate.

1D Convolutions

Since the input is a 1D signal, we are using 1D convolution layers on topof it to extract features.



Inception Nucleus

• The inception nucleus is built with a number of “1D Conv” layersstacked on top of each other.

• Each of the mentioned “1D Conv” layers has different filter sizeschosen to capture richer features.

• We concatenate the obtained feature maps by stacking them on topof each other to build a receptive field.



Transpose Layer

Transposing the receptive field

• The receptive field of the Inception Nucleus layer is in Rm×1×n. Herewe transpose the results of the inception nucleus layer to get a tensorin Rm×n×1.

• This step is crucial since it results in a grayscale image with width =m, height = n. With this technique we are able to apply 2Dconvolutions which enables us to extract richer features.



Typical 2D Conv. layers

Receptive field as a grayscale image

• By the transposing trick, we are able to apply 2D convolutionsinstead of 1D convolutions.

• Our experiments along with the audio classification literature suggestthat using 2D convolution exhibit stronger features.

• In our method, we used VGG standard 2D convolution layers with thekernel size of 3 and stride of 1.

• Each convolution layer is followed by a max pooling with kernel sizeof 2 × 2 and stride of 2.



Global Average Pooling (GAP) Layer

GAP• We borrowed the GAP concept from the computer vision literature.• The GAP layer aggregates spatial activations in the last convolution

layer. It will be computed as follows:

GAP(k) =1

m ∗ n∑

i ,j

F (i , j , k) (1)

with m and n indicating the width and height of the last tensor and krepresenting the number of filters.

• Using a GAP layer reduces the number of parameters dramatically. Italso can be considered as a regularization factor and thus, it helps thegeneralization.



The Network Architecture

1DConvlayer

Inception 2D Conv layer

2D Conv layerGAP

Softmax



The Details of Network ArchitectureTable 1. Our proposed deep neural networks architectures. Each column belongs to a network. The third row indicated numberof parameters. The convolutional layer parameters are denoted as “conv (1D or 2D),(number of channels),(kernel size),(stride).”Layers with batch normalization are denoted with BN.

Inception Nucleus Nets ConfigurationsInception Inception-FA Inception-FI Inception-BN

289 K 789 K 479 K 292 KInput (32000 ⇥ 1)

Conv1D,32,80,4 Inception Nucleus: Conv1D,32,80,4 with BNConv1D,32,60,4

Conv1D,[32,80,4]⇥2Conv1D,[32,100,4]⇥2

Inception Nucleus: Inception Nucleus: Inception Nucleus: Inception Nucleus:Conv1D,64,4,4 Conv1D,64,20,4 Conv1D,64,4,4 Conv1D,64,4,4 - BN

Conv1D,[64,8,4]⇥2 Conv1D,[64,40,4]⇥2 Conv1D,[64,8,4]⇥2 Conv1D,[64,8,4]⇥2-BNConv1D,[64,16,4]⇥2 Conv1D,[64,60,4]⇥2 Conv1D,[64,16,4]⇥2 Conv1D,[64,16,4]⇥2-BN

Max Pooling 1D, 64,10,1Reshape (put the channels first)

Conv2D,32,3 ⇥ 3,1 Conv2D,32,3 ⇥ 3,-BNMax Pooling 2D,32,2 ⇥ 2,2

Conv2D,64,3 ⇥ 3,1 Conv2D,64,3 ⇥ 3,1-BNConv2D,64,3 ⇥ 3,1 Conv2D,64,3 ⇥ 3,1-BN

Max Pooling 2D,64,2 ⇥ 2,2Conv2D,128,3 ⇥ 3,1 Conv2D,128,3 ⇥ 3,1-BN

Max Pooling 2D,128,2 ⇥ 2,2Conv2D,10,1 ⇥ 1,1 Conv2D,10,1 ⇥ 1,1-BN

Global Average PoolingSoftmax

Fig. 1. Inception nucleus. The input comes from the previouslayer and is passed to the 1D convolutional layers with kernelsizes of 4, 8, and 16 to capture a variety of features. The con-volutional layer parameters are denoted as “conv1D,(numberof channels),(kernel size),(stride).” All of the receptive fieldsare concatenated channel-wise in the concatenation layer.

rent state-of-the-art networks on the urbansound8k dataset by10.4 percentage points. Thus, our CNN is a strong candidatefor low-power always-on sound classification applications. Inaddition, we analyze the learned representations, using visu-alizations to reveal wavelet-like transforms in early layers,supporting deeper representations that are discriminative andmeaningful, even with reduced dimensionality.

2. PROPOSED METHODOur proposed end-to-end neural network takes time-domainwaveform data — not engineered representations — and pro-

cesses it through several 1D convolutions, the inception nu-cleus, and 2D convolutions to map the input to the desired out-puts. The details of the proposed architectures are describedin Table 1. The overall design can be summarized as follows:

Fully Convolution Network. We propose an inception nu-cleus convolution layer that contains a series of 1D convolu-tional layers followed by nonlinearities (i.e., ReLU layer) toreduce the sensitivity of the architecture to kernel size. Con-volutional networks are well-suited for audio signals for thefollowing reasons. First, similar to images, we desire our net-work to be translation invariant to reduce the number of pa-rameters efficiently. Second, convolutional networks allowus to stack layers, which gives us the opportunity to detecthigher-level concepts through a series of lower-level detec-tors. We used Global Average Pooling (GAP) in our archi-tectures to aggregate the spatial information in the last con-volutional layer and map this information onto class labels.GAP greatly reduces the number of parameters to make thenetwork relatively light to implement.

Variable Length Input/Output. Since sound can vary intemporal length, we want our network to handle variable-length inputs. To do this, we use a fully convolutional net-work. As convolutional layers are invariant to location, wecan convolve each layer based on the length of the input.The input layer to our network is a 1D array, representing theaudio waveform, which is denoted as X 2 R32000⇥1, sincethe audio files are about 4 seconds, and the sampling rate wasset to be 8 kHz. The network is designed to learn a set of pa-rameters, !, to map the input to the prediction, Ŷ , based on



Dataset

• We evaluated our model on ”Urbansound8K”[1] dataset.• This dataset contains 8000 samples. The length of the audio samples

are 4 seconds or less.

• Number of classes in this dataset are 10 which are the following:sirens dog barking drilling engine idling gun shot

air conditioning children playing jackhammer street music car horn

[1] J.Salamon,C.Jacoby, and J.Pablo Bello,”A dataset and taxonomy for urban sound research”, In ICM, 2014.



Comparison

• We compared our proposed inception nucleus method with thestate-of-the-art methods.

nested mapping functions, given by Eq 1.

Ŷ = F (X|!) = fk(...f2(f1(X|!1)|!2)|...!k) (1)

where k is the number of hidden layers and fi is a typicalconvolution layer followed by a pooling operation.

Inception Nucleus Layer. We propose the use of an incep-tion nucleus to produce a more robust architecture for soundclassification. This approach also makes the architecture lesssensitive to idiosyncratic variance in audio files. A schematicrepresentation of the inception nucleus appears in Fig 1. Theinputs to the inception nucleus are the feature maps of theprevious layer. Then, three 1D convolutions with differentkernels are applied to the inputs to capture a variety of fea-tures. We test the following kernel sizes in our experiments:4, 8, 16, 20, 40, 60, 80, 100. (See Section 3.) After obtainingthe feature maps from our convolutional layers, we concate-nate the receptive fields in a channel-wise manner.

Reshape. After applying 1D convolutions on the waveformsand obtaining low-level features, the feature map, L, will be 2R1⇥m⇥n. We can treat L as a grayscale image with width=m,height=n, and channel=1. For simplicity, we transpose thetensor L to L0 2 Rm⇥n⇥1. From here, we apply normal 2Dconvolutions with the VGG standard kernel size of 3 ⇥ 3 andstride = 1 [1]. Also, the pooling layers have kernel sizes =2 ⇥ 2 and stride = 2. We also implemented the inception nu-cleus with batch normalization to analyze the effect of batchnormalization on our approach, as explained in Section 3.

Global Average Pooling (GAP). In the last convolutionallayer we compute GAP to aggregate the most abstract fea-tures over the spatial dimensions and reduce the number ofoutputs to class labels. We use GAP instead of max pooling toreduce the number of parameters and avoid adding fully con-nected layers at the end of the network. It has been noted inthe computer vision literature that aggregating features acrossspatial locations and channels keeps important informationwhile reducing the number of parameters [14, 15]. We in-tentionally did not use fully connected layers with a softmaxactivation function to avoid overfitting, since fully connectedlayers greatly increase the number of parameters. GAP wasimplemented as follows:

GAPc =1

w ⇥ hX

i,j

A(i, j, c) (2)

where w, h, c are width, height, and channel of the last featuremap (A).

3. EXPERIMENTAL RESULTSWe tested our network on the UrbanSound8k dataset whichcontains 10 kinds of environmental sounds in urban areas,such as drilling, car horn, and children playing [16]. Thedataset consists of 8, 732 audio clips of 4 seconds or less,

Table 2. Accuracy of different approaches on the Urban-Sound8k dataset. The first column indicates the name of themethod, the second column is the accuracy of the model onthe test set, the third column reveals the number of parame-ters. It is clear that our proposed method has the fewest num-ber of parameters and achieves the highest test accuracy.

Model Test # Parameters

M3-fc [9] 46.82% 129MM5-fc [9] 62.76% 18MM11-fc [9] 68.29% 1.8MM18-fc [9] 64.93% 8.7MM3-Big [9] 57.55% 0.5MRCNN [20] 71.68% 3.7M

ACLNet [11] 65.32% 2MEnvNet-v2 [21] 78% 101MPiczakCNN [22] 73% 26M

VGG [23] 70% 77MInception Nucleus-BN (Ours) 83.2% 292KInception Nucleus-FA (Ours) 70.9% 789KInception Nucleus-FI (Ours) 75.3% 479K

Inception Nucleus (Ours) 88.4% 289K

totalling 9.7 hours. We padded zeros to the samples thatwere less than 4 seconds. To speed computation, the audiowaveforms were down-sampled to 8 kHz and standardized tozero mean and unit variance. We shuffled the training data toenhance variability in the training set.We trained the CNN models using the Adam [17] optimizer,a variant of stochastic gradient descent that adaptively tunesthe step size for each dimension. We used glorot weight ini-tialization [18] and trained each model with batch size 32 forup to 300 epochs until convergence.To avoid overfitting, all weight parameters were penalized bytheir `2 norm, using a � coefficient of 0.0001. Our modelswere implemented in Keras [19] and trained using a GeForceGTX 1080 Ti GPU.Table 2 provides classification performance on the testing setalong with numbers of parameters used for the Urbansound8kdataset. The table shows that our CNN outperformed othermethods in terms of test classification accuracy, with thefewest number of parameters. Preliminary simulations re-vealed that fully connected layers at the end of the networkcaused overfitting due to an explosion in the number of weightparameters. These preliminary results led us to use a fullyconvolutional network with a reduced number of parameters.We note that the deeper networks (M5, M11, and M18) can

improve performance if their architectures are well-designed.However, our inception nucleus model is 88.4% accurate,which outperforms the reported test accuracy of CNNs onspectrogram input using the same dataset by a large mar-gin [22]. Also, inception nucleus-FI achieves very goodresults in terms of both accuracy and number of parameters.



Lower Level Representation Analysis

Shallower layers

Our experiments suggest the very first layer is learning wavelet-likefeatures that capture different frequencies.



Higher Level Representation Analysis

Deeper Layers

Our experiments suggest high level representations are perceptuallymeaningful and they are relatively separable, even in low dimensionalspace.



Take Away ...

• Using Inception Nucleus will help to learn richer features.• Having Global Average Pooling reduces the number of parameters

and increases the generalization.

• Given relatively small datasets, very deep models don’t generalize well.• Batch normalization didn’t help the generalization here.• End-to-end auditory object recognition will boost the performance

due to learning more specific features.

• Representation analysis suggest that shallower layers are learningwavelet like features and deeper layers are learning perceptuallymeaningful representations.

• Smaller number of parameters means that the technique may beuseful for networks running on mobile or edge devices



Questions?!

Code is available!• The full implementation of the paper is accessible in the following

github repository:https://github.com/mkebrahimpour/e2e-Inception_Nucleus.

• You may scan the following barcode to retrieve the link.


https://github.com/mkebrahimpour/e2e-Inception_Nucleus


end-to-end auditory object recognition via inception...

Documents