watermarking in audio using deep learning

Master of Science Thesis in Electrical EngineeringDepartment of Electrical Engineering, Linköping University, 2019

Watermarking in Audiousing Deep Learning

Lukas Tegendal

Master of Science Thesis in Electrical Engineering

Watermarking in Audio using Deep Learning:

Lukas Tegendal

LiTH-ISY-EX--19/5246--SE

Supervisor: Abdelrahman Eldesokeyisy, Linköpings universitet

Andreas Rossholm

Examiner: Lasse Alfredssonisy, Linköpings universitet

Computer Vision LaboratoryDepartment of Electrical Engineering

Linköping UniversitySE-581 83 Linköping, Sweden

Copyright © 2019 Lukas Tegendal

Abstract

Watermarking is a technique used to used to mark the ownership in media suchas audio or images by embedding a watermark, e.g. copyrights information, intothe media. A good watermarking method should perform this embedding with-out affecting the quality of the media. Recent methods for watermarking in im-ages uses deep learning to embed and extract the watermark in the images. Inthis thesis, we investigate watermarking in the hearable frequencies of audio us-ing deep learning. More specifically, we try to create a watermarking method foraudio that is robust to noise in the carrier, and that allows for the extraction ofthe embedded watermark from the audio after being played over-the-air. The pro-posed method consists of two deep convolutional neural network trained end-to-end on music with simulated noise. Experiments show that the proposed methodsuccessfully creates watermarks robust to simulated noise with moderate qualityreductions, but it is not robust to the real world noise introduced after playingand recording the audio over-the-air.

iii

Contents

1 Introduction 11.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.4 Organisation of the thesis . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Background 52.1 Watermarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2.1 Short Time Fourier Transform . . . . . . . . . . . . . . . . . 62.2.2 Neural Network Architectures . . . . . . . . . . . . . . . . . 7

2.3 Watermarking Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 8

3 Method 113.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.1.1 Audio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.1.2 Messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.2.1 Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.2.2 Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.2.3 Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4 Results 174.1 Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174.2 Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.2.1 No Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.2.2 All Noise Types . . . . . . . . . . . . . . . . . . . . . . . . . 184.2.3 Gaussian Noise . . . . . . . . . . . . . . . . . . . . . . . . . 194.2.4 Dropout Noise . . . . . . . . . . . . . . . . . . . . . . . . . . 214.2.5 Crop & Roll Noise . . . . . . . . . . . . . . . . . . . . . . . . 214.2.6 Music Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.3 Evaluation Over-the-air . . . . . . . . . . . . . . . . . . . . . . . . . 23

v

vi Contents

4.4 Message representation . . . . . . . . . . . . . . . . . . . . . . . . . 234.4.1 Difference between models . . . . . . . . . . . . . . . . . . . 234.4.2 Difference between messages . . . . . . . . . . . . . . . . . 24

5 Discussion 295.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

6 Conclusions 336.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

A Encoded spectrograms 37

Bibliography 43

1Introduction

1.1 Problem Definition

Piracy of licensed and copyrighted audio and music has been a problem fordecades. By bypassing authorized distribution channels, the content is sharedwithout paying the creators and the rightsholders. To mitigate the usage of thisillegally obtained content, one must be able to prove that the content does notcome from legitimate sources. One tool for doing so is to tag the content withsome message or metadata before distributing it. When encountering contentthat is suspected to be illegally distributed, the message can be used to track itsorigin. This technique is called watermarking.

Watermarking is when the audio itself is modified to carry the message. Thisposes many challenges since the modifications to the audio should not be per-ceptible to the listener, while still being detectable when extracting the message.Traditional approaches to watermarking are sensitive to noise. If the encodedaudio is modified or affected by noise it will no longer be possible to extract thewatermark. When the content owner wants to detect and identify the watermarkfrom the carrier, they need access to the digital audio file. If the watermarkingmethod used is robust to the noise introduced when the audio is played over-the-air, the watermark could be extracted by recording the played audio.

Recent methods for watermarking in images use machine learning to embed themessage in the image. With these methods it is possible to make the watermark-ing more robust to different types of noise. If these methods could be adapted toaudio, it would be possible to create robust watermarks in the audio domain aswell.

1

2 1 Introduction

1.2 Objectives

The goal of this thesis is to develop a method for robust watermarking of au-dio using deep learning. More specifically, robust in the sense that a watermarkcan be extracted after playing and recording the audio over-the-air. Over-the-airmeans that the audio is played from a speaker, and then recorded with a micro-phone somewhere else in the room. The encoding should be made in the hearablefrequencies, due to limitations in consumer grade hardware. The task includesconstructing a pipeline of an encoder and a decoder, trained jointly using sim-ulated noise. The challenges to be solved includes designing an encoder and adecoder, and training these together with a differentiable simulation of noise inan over-the-air channel. An overview of such a system is shown in Figure 1.1.

The method should perform according to two different metrics. First, the methodshould be robust to noise. After playing and recording the audio over-the-air,the decoder should still be able to extract the encoded message. Second, thedistortion introduced during encoding of the carrier should be minimized.

These objectives can be formulated as the following research questions, which wetry to answer in this work.

1. How can a model for robust audio watermarking be constructed using deeplearning?

2. What is the message reconstruction accuracy and the audio distortion levelswhen different types of noise are added, for such a watermarking design?

3. What is the message reconstruction accuracy after playing the encoded au-dio over-the-air?

Figure 1.1: Overview of a watermarking pipeline

1.3 Delimitations

It is assumed that the encoded audio has not been modified before being played.This also includes the assumption that the audio is limited to one channel anduncompressed. The focus of the thesis is related to noise introduced by the over-the-air channel, and we have therefore limited the scope of the thesis to this.

A field closely related to watermarking is steganography. In steganography thegoal is to embed a message in the media, with the additional target to make theperturbations of the carrier undetectable by a third party, e.g. using a statistical

1.4 Organisation of the thesis 3

analysis of the encoded file. In this work we do not consider this factor and focussolely on making the changes imperceptible when listening.

Due to the overhead of a qualitative audio quality study, audio quality will onlybe measured quantitatively using appropriate metrics.

1.4 Organisation of the thesis

The thesis has the following structure. In Chapter 2, background, problem, andwork related to the focus in this thesis are described, including a section coveringprerequisites for audio processing and machine learning relevant for this work.Chapter 3 covers the network architecture used for the watermarking pipeline,a description of the different noise types used for simulation, and details on thetraining of the model. In Chapter 4, the results from the model evaluation are pre-sented. Chapter 5 contains a discussion of the results and the method. Chapter 6contains a conclusion of the thesis and suggestions for future work.

2Background

This chapter contains an introduction to watermarking and related work in thisarea of research. An introduction to the Short Time Fourier Transform is giventogether with a description of relevant neural network architectures. Finally, theevaluation criteria used in this thesis are described.

2.1 Watermarking

A watermark is a message or metadata embedded in media such as images oraudio. The watermark is embedded in a way that should not affect the perceptedquality of the media.

The message is a text string that is embedded into an image or audio, also calledthe carrier. The typical use case of watermarking consists of two steps. The firststep, encoding, the message is hidden in the carrier and it is performed before theaudio or image is distributed. The second step, decoding, is when the rightsholdercontrols an already distributed audio file for an encoded watermark.

Watermarking is not a new idea and has been approached as a pure signal pro-cessing problem for a long time. Classical approaches include more or less handcrafted solutions to encode hidden information in image and audio files. Theseapproaches include LSB encoding [17], phase coding [2] and spread spectrumwatermarking [5]. Lin and Abdulla [12] present a survey of different traditionalapproaches of watermarking.

In recent years another approach for watermarking has surfaced, using machinelearning to learn the representation of information in the media. Baluja [1] pro-poses a method for steganography in images, where the transferred message is

5

6 2 Background

another image. The method allows for a lossy transfer of very large messages,although the changes to the carrier are not imperceptible in some cases. Themethod shows the potential and impressive capabilities of machine learning ap-proaches to watermarking and steganography.

Another method for steganography and watermarking in images is proposed byZhu et al. in [18]. They let the carrier be distorted by noise after encoding, forcingthe learned representation of the encoded message to be robust to the introducednoise. In this way they manage to make the watermark robust to lossy JPEGcompression.

A subject closely related to watermarking is adversarial examples, an area thathas been frequently researched in recent years. An adversarial example is somedata (e.g. an image or and audio clip) that has been imperceptibly modified tofool a secondary system (e.g. an image classifier or an automatic speech recogni-tion system). The similarities to watermarking are striking, where the modifieddata is the carrier, and the secondary system is the decoder. Rather than fool-ing the secondary system, one uses the carrier modifications to carry the water-mark. One major difference is that in the watermarking system is that we trainthe secondary system together with the encoder. This area was initially studiedby Szegedy et al. in [16] followed by Goodfellow et al. in [6]. Kurakin et al.[11] show that these properties of high dimensional neural networks are robustto noise.

When it comes to audio adversarial samples, most research methods try to foolautomatic speech recognition systems. Carlini and Wagner [3] propose a methodthat can successfully fool an automatic speech recognition system, but the sam-ples do not stay adversarial when played over-the-air. Qin et al. [14] further in-vestigate this task with the target to make the sample stay adversarial over-the-air.To do this they simulate room acoustics during training and use a psychoacous-tic audio quality model. They show that their perturbations of the samples areimperceptible to humans, but they are not fully robust when playing over-the-air.

2.2 Prerequisites

2.2.1 Short Time Fourier Transform

How should the audio be represented when processed? Raw audio is sampled ata high rate, normally 44100 Hz for music. This leads to very high dimensionaldata and dependencies between samples stretching far apart.

An alternative for representing the audio involves the use of the Short Time FourierTransform, or STFT. Segments of the wavefile are extracted and the correspondingFourier Transform for each segment is calculated. The segments are combinedinto a 2D representation of the audio with time on one axis and frequencies onthe other axis. The complex representation is often expressed as amplitude andphase, where the amplitude is also called the spectrogram of the audio. The valuesin the spectrogram represent the signal energy in the corresponding frequency

2.2 Prerequisites 7

range, in each segment. The size of the segments are given by the window size,the distance between the start of each segment are given by the hop length. It iscommon to let the hop length be smaller than the window size so that the win-dows overlap each other. In Figure 2.1 segments are sampled with 50% overlap.

Hop length

Window size

Original audio

Figure 2.1: Segments are sampled from a waveform with 50% overlap.

From the STFT the waveform can be reconstructed without loss using the inverseSTFT. This makes it possible to process the audio in the STFT to use benefitssuch as spectral representation and reduced temporal dependencies, and thenreconstruct it to a playable waveform. Figure 2.2 illustrates such a pipeline.

AudioProcessingSTFT ISTFT

Orignal Audio Processed AudioSpectrogram ProcessedSpectrogram

Figure 2.2: Pipeline for audio processing using the STFT

2.2.2 Neural Network Architectures

In both methods by Bajula [1] and Zhu et al. [18], the encoders are inspired byautoencoders. An autoencoder is a network architecture where the network issupposed to minimize the difference between the input and the output of thenetwork, typically under some constraint such as a dimensionality reduction oradded noise. In the suggested watermarking methods, the constraint is that theencoded watermark should be recoverable in the decoder. More information onautoencoders can be found in [7].

The U-net architecture is an example of an autoencoder initially presented byRonneberger et al. [15] as a method for source separation in medical images. Themodel aims to preserve low level details by adding skip connections between thedown- and upsampling in the autoencoder. The reductions of spatial dimensionis achieved by multiple layers of strided convolution, and the upscaling by using

8 2 Background

corresponding layers of transposed convolution. An example of an U-net archi-tecture is shown in Figure 2.3.

Concat

Concat

Concat

Conv Conv Conv Conv Conv ConvInput Output

Figure 2.3: Overview of the U-net architecture

In this thesis, the encoder in the watermarking pipeline works in a spectrogram-to-spectrogram manner. It is a challenge to make a deep network while stillpreserving details, causing problems when processing audio. Small changes inthe STFT can lead to large differences in the reconstructed audio. Jansson et al.[9] use the U-net architecture for singing voice separation in audio. They proposethat the U-nets ability to preserve low level details in images makes it work wellfor audio reproduction.

2.3 Watermarking Evaluation

The model will be trained separately with different types of noise, as well as withall types of noise combined. The model trained without noise will be used asbaseline. The evaluation is based on two criteria, one to measure the distortionof the carrier and one to measure the error rate in the decoded message. Theerror rate in the decoded message is given by the binary accuracy of the decodedmessage, which is represented by a binary string. The binary accuracy A is givenby

A =1N

∑|Morig −Mpred| (2.1)

where Morig is the correct message, Mpred is the predicted message, and N is themessage length in bits.

The distortion of the carrier is given by the mean absolute error. A higher distor-tion means that larger changes have been made to the carrier. If the changes are

2.3 Watermarking Evaluation 9

too large they will be perceptible. The mean absolute error D is given by

D =1N

∑|Iorig − Ienc| (2.2)

where Iorig is the original carrier, Ienc is the encoded carrier, and N is the numberof data points.

3Method

In this chapter, the method used in the watermarking pipeline is described. First,the data used when training and evaluating the model is described. Second, themodel architecture is described split by encoder, decoder and noise. Finally, thetraining methodology is described including an explanation of the loss functionused.

3.1 Data

The data used to train the model consists of two parts, audio and messages.

3.1.1 Audio

The data used is based on 2500 of some of the most listened tracks of a commer-cial music service. From each track, an approximately 30 second long clip is used.The audio is sampled in 44100 Hz, converted from MP3 to WAV, and convertedfrom stereo to mono. The tracks are shuffled and split into 1500 training tracks,500 validation tracks and 500 test tracks. The test tracks are set aside and are notused during development or training.

The 30 second clips are concatenated and then split into segments of 88200 sam-ples each, corresponding to 2 seconds. The Short Time Fourier Transform are cal-culated with window size 2800 and hop length 1400. This results in a STFT with1401 frequency bins and 64 time steps, separated into amplitude and phase. Themodel will only utilize the amplitude in frequency bins 11 to 267, correspondingto frequencies between approximately 157 Hz and 4205 Hz. This is important tomake the method work when playing and recording the audio over-the-air withconsumer grade hardware, such as laptops or smartphones. The lower frequen-

11

12 3 Method

cies are lost over-the-air, and the higher frequencies will be lost if the hardwarehas a limited bandwidth. The rest of the amplitudes and the phase are left un-changed during the encoding. When reconstructing the audio after the encoding,the encoded audio is merged with the unchanged parts, illustrated in Figure 3.1.The decoder uses the same window of the amplitude as the encoder.

STFT

Phase

Encoder

Amplitude

ISTFT

Unchanged

Unchanged

Figure 3.1: Only a window of the amplitude of the STFT is processed by theencoder.

Amplitudes are normalized by dividing each sample xi by the standard deviationof the entire training set as

x̂i =xi

std(xtraining)(3.1)

where std(xtraining) is the standard deviation of the flattened amplitudes of thetraining set. This normalization scales the amplitudes but keeps them non-negative.

From the training, validation and test sets, 10 % of the data are set aside and usedas audio noise, as described in Section 3.2.3. The noise samples are repeated tomatch the size of the corresponding dataset.

After separating the spectrograms used for noise, the training set consists of20230 spectrograms, the validation set of 6741 spectrograms, and the test setof 6744 spectrograms.

3.1.2 Messages

Random binary messages M of length L according to M ∈ {0, 1}L are created. Toavoid that the model overfits, each spectrogram in the training set is repeated 5times with different messages, resulting in a training set of size 101150. For thevalidation and test sets, only one message is used.

3.2 Architecture 13

3.2 Architecture

The model architecture is based on three major parts; the encoder, which is usedto modify the audio, noise that distorts the encoded audio, and the decoder thatdecodes the message from the encoded and distorted audio.

3.2.1 Encoder

The encoder architecture is based on the U-net architecture used by Jansson et al.in [9]. An overview of the encoder architecture is given in Figure 3.2. The down-sampling part of the U-net consists of 5 blocks. Each block contains one layer ofstrided 2D convolution with stride 2 and kernel size 5 × 5, batch normalization[8], and Leaky ReLU activations [13] with a leakiness of 0.2. After the 5 blocks,the dimension is reduced to 8 × 2 × 256. In this bottleneck of the network, themessage is inserted into the network. Similar to Zhu et al. [18], the message is re-peated in the spatial dimension to match the size of the bottleneck. The messageis extending in the channel dimension, leading to a block of size 8 × 2 × 64, visu-alised in Figure 3.3. This block is concatenated with the bottleneck, followed byanother layer of 2D convolution with stride 1, kernel size 5 × 5 and ReLU activa-tion. To restore the data back to the original dimensionality, an upsampling partconsisting of 5 blocks is used. Each block consists of a transposed 2D convolutionwith stride 2 and kernel size 5×5, batch normalization and ReLU activations. Thefirst 3 blocks has a layer of dropout with a dropout probability of 50 % after thebatch normalization layer. At the end of each block, the layer of correspondingsize in the downsampling step is concatenated. A final layer of 2D convolutionwith stride 1, kernel size 5 × 5 and ReLU activation is used to reduce the numberof channels to 1.

Message

Concatenate

Input Output

256648

128 3216

64 1632

328

64

164

128

82

256

82

64

82

256 328

64

164

128 64 1632 128

3216H = 256

W = 64C = 1

Concatenate

256641

Figure 3.2: Encoder architecture with output dimensions specified for eachlayer.

14 3 Method

Repeat and stack

L = 64

Message

1 x 1 x L8 x 2 x L

Figure 3.3: The message is repeated and stacked before being concatenatedinto the encoder.

3.2.2 Decoder

The decoder works as a multi-label classifier. An overview of the architecture isgiven in Figure 3.4. It takes a spectrogram as input, and outputs 64 predictionsbetween 0 and 1, each corresponding to a bit. The network consists of 6 blocksof strided 2D convolution, batch normalization and Leaky ReLU activation withleakiness of 0.2. The strides are set so that the spatial dimension after the lastconvolution is 64x1. The last layer is a fully connected layer with 64 outputneurons and sigmoid activations. The outputs will be between 0 and 1, and thepredicted bit is generated by rounding the output value. The decoder loss isgiven by the binary cross entropy between the network output predictions andthe encoded message.

3.2.3 Noise

In between the encoder and the decoder, multiple different types of additive noiseis applied to the audio. All noise types are added directly to the spectrograms ofthe audio. To be able to include the noise when training the model, the noisemodels have to be differentiable. The different types of noise used are:

Dropout 50% of the pixels in the spectrogram, selected randomly, are set to 0.This forces redundancy in the representation of the encoded message.

Gaussian Noise Gaussian noise with mean 0 and standard deviation 0.2 is addedto the spectrogram. The purpose of this noise is to introduce robustness togeneral noise introduced in the over-the-air channel.

Background Music The spectrogram of a separate audio clip is added on top ofthe spectrogram. The noise spectrogram is scaled with a factor of 0.3. This

3.3 Training 15

FC64

642

128

H = 256W = 64C = 1

641632

648

64

644

641283232

64641

128

Figure 3.4: Decoder architecture with output dimensions specified for eachlayer.

simulates background noise in an over-the-air channel.

Roll & Crop The spectrogram is shifted along the temporal axis. At the bordersthe spectrogram wraps around, so that the part of the spectrogram thatis shifted to the right is added at the beginning. After the roll, the spec-trogram is cropped by 50% along the temporal axis. The shift is selectedrandomly between 0 and 64. In order to not change the input size to the de-coder, the spectrogram is zero-padded to the right to match the size beforethe crop. The purpose of this noise is to reduce the need to match the seg-ments between encoding and decoding. When playing audio over-the-air,the recorded 2 second segments will not be syncronised with the encodedones.

Combined All the noise types are combined, ordered as Gaussian Noise, Dropout,Background Audio and lastly Roll & Crop. The purpose of using all noisetypes together, is to create a model that is robust to the noise introducedwhen playing and recording the audio over-the-air.

3.3 Training

The model is trained using the Adam optimizer [10] with standards parameters,except the learning rate which is lowered to 0.0002. The model is trained for 200epochs with a batch size of 128.

The loss function is defined as the sum of the encoder loss and the decoder loss,which are weighted and scaled to help convergence. The mean absolute error inthe encoder loss and the binary cross entropy in the decoder loss does not haveany obvious connection, providing a need to scale the losses to give good balance

16 3 Method

between these. It is easy for the model to match the encoder input to its output,making this converge faster than the decoder.

To counter this, the decoder loss must be given a higher weight during the earlystages of training. The loss weights are updated dynamically with the objectiveto reduce the relative difference between the losses.

Two target weights, wencoder and wdecoder, sets the desired loss balance. The lossesL are multiplied with the weights w to give a scaling factor Lweighted = wL. Inthe beginning of the training, these are adjusted to give the decoder loss a higherweight, allowing the model to make compromises to the audio reconstruction tomake it possible to embed the message representation. Initially wencoder = 1.0and wdecoder = 2.0. At the end of each of the first 10 epochs wencoder is increasedwith 0.1 and wdecoder is decreased with 0.1. After epoch 10 wencoder = 2.0 andwdecoder = 1.0, giving more emphasis to the encoder reconstruction quality.

A second method is used to update the loss weights mid-epoch to force the lossesto be balanced according to the target weights. βencoder and βdecoder are the aver-age losses for the last 10 mini-batches scaled by the target weights, defined as

βencoder = wencoder ∗1

10

9∑i=0

Lencoder, b-i (3.2)

and

βdecoder = wdecoder ∗1

10

9∑i=0

Ldecoder, b-i (3.3)

where Lencoder, n and Ldecoder, n are the encoder and decoder loss after mini-batchn, respectively. b is the current mini-batch.

Using the average losses, the loss weight α is updated every 10 mini-batch as

α =βencoder

βencoder + βdecoder(3.4)

Given the loss weight α, the final combined loss function is calculated as

Lcombined = αLencoder + (1 − α)Ldecoder (3.5)

4Results

In this chapter, the results from the evaluation of the models are presented. Atotal of 6 models have been trained: with one type of the 4 additive noise types,without noise, and with all noise types combined. These models have been eval-uated for audio quality and robustness. The models trained with Crop & Rollnoise are cropped, but not rolled when evaluated without this type of noise. Themodel trained with all noise types combined has also been tested for robustnesswhen playing over-the-air. This model is also analysed with respect to the mes-sage representation.

4.1 Quality

To assess the quality of the encoded and reconstructed, the mean absolute errorbetween the original and encoded spectrogram is calculated as described in Sec-tion 2.3. In Figure 4.1 and Table 4.1, the results for the evaluated models arepresented. The values can be compared with the standard deviation of the spec-trograms, which is approximately equal to 1. The models trained without noiseor with one noise type has a very low mean absolute error, corresponding to smallmodifications of the encoded spectrogram. The model trained on all types has alarger error.

4.2 Robustness

To assess the robustness of the models, the binary accuracy of the decoded mes-sage is calculated with different types and amounts of noise added between theencoder and decoder.

17

18 4 Results

No Noise

Gaussia

n

Crop

& Roll

Drop

out

Music

Combine

d0.00

0.01

0.02

0.03Mea

n ab

solute erro

r

Figure 4.1: Mean absolute errors for the spectrograms encoded with eachmodel

Table 4.1: Mean absolute errors for the spectrograms encoded with eachmodel

Noise type Mean absolute errorNo noise 0.00097Gaussian 0.00476Crop & Roll 0.00381Dropout 0.00761Music 0.00220Combined 0.03577

4.2.1 No Noise

All models are evaluated without any noise introduced between the encoder andthe decoder. The results are presented in Figure 4.2 and Table 4.2. The resultsshows that all models except the one trained on all noise types reaches almost100% binary accuracy. The model trained on all noise types performs slightlyworse.

4.2.2 All Noise Types

To further evaluate the robustness, all models are evaluated using all noise typescombined. The noise levels used are the same as the ones used for training. Theresults are shown in Figure 4.3 and Table 4.3, and show that all models exceptthe one trained on all noise types do not reach an accuracy higher than randomchance. The model trained on all noise types reaches a binary accuracy higherthan 99%.

4.2 Robustness 19

No Noise

Gaussia

n

Crop

& Roll

Drop

out

Music

Combine

d0.0

0.5

1.0

Bina

ry accuracy

Figure 4.2: The binary accuracy for all models when trained with the noisetype indicated on the x-axis and evaluated without noise.

Table 4.2: The binary accuracy for all models when evaluated without noise.

Noise type Binary accuracyNo noise 99.991%Gaussian 99.990%Crop & Roll 99.965%Dropout 99.987%Music 99.998%Combined 85.782%

Table 4.3: The binary accuracy for all models when evaluated with all noisetypes combined.

Noise type Binary accuracyNo noise 50.093%Gaussian 50.168%Crop & Roll 50.280%Dropout 50.278%Music 50.031%Combined 99.234%

4.2.3 Gaussian Noise

One of the noise types used is added Gaussian noise. To evaluate the impacton the robustness of this noise, the baseline model trained without noise, themodel trained with only Gaussian noise and the model trained on all noise typescombined are evaluated with Gaussin noise. Figure 4.4 shows the binary accuracy

20 4 Results

No Noise

Gaussia

n

Crop

& Roll

Drop

out

Music

Combine

d0.0

0.5

1.0

Bina

ry accuracy

Figure 4.3: The binary accuracy for all models when trained with the noisetype indicated on the x-axis and evaluated with all noise types combined.

as a function of the standard deviation of the noise. The models trained on onlyGaussian noise and combined noise have similar performance, except for verylow noise levels, where the combined noise model has worse performance. Themodel trained without noise sees a large reduction in binary accuracy even withlow levels of noise.

0.0 0.5 1.0 1.5 2.0 2.5Standard deviation

0.5

0.6

0.7

0.8

0.9

1.0

Bina

ry accuracy

No noiseGaussianCombined

Figure 4.4: Binary accuracy for models evaluated with Gaussian noise withdifferent standard deviation.

4.2 Robustness 21

4.2.4 Dropout Noise

To evaluate the robustness to Dropout noise, the models trained on no noise, onlyDropout noise and on all noise types are evaluated with different dropout proba-bilities.

Figure 4.5 shows the binary accuracy as a function of dropout probability. Themodel trained on dropout has a high binary accuracy up to a probability ofaround 0.7. The model trained on combined noise has a worse performance, buthigher than random chance. The model trained without noise becomes no betterthan random chance even for low probabilities.

0.0 0.2 0.4 0.6 0.8 1.0Dropout probability

0.0

0.2

0.4

0.6

0.8

1.0

Bina

ry accuracy

No noiseDropoutCombined

Figure 4.5: Binary accuracy for models evaluated with different probabilitiesof dropout.

4.2.5 Crop & Roll Noise

To evaluate how Roll & Crop noise affects the robustness, the models trainedwithout noise, with only Roll & Crop noise, and the model trained on all noisetypes are evaluated with and without Roll & Crop noise. The results are given inFigure 4.6 and Table 4.4 and shows that adding Crop & Roll has no impact on per-formance for the models trained with all noise types and with Crop & Roll noise.The combined model has a lower binary accuracy than the specialised model. Themodel trained without noise has a binary accuracy close to 100% when evaluatedwithout the noise, but does not perform better than random chance when thenoise is applied.

4.2.6 Music Noise

The last noise type used and evaluated is Background Music noise. Figure 4.7shows the results for the models when another music track is added to the en-coded audio. The scaling is the factor that the noise spectrogram is multiplied

22 4 Results

No Crop & Roll Crop & Roll0.0

0.2

0.4

0.6

0.8

1.0

Bina

ry accuracy

No noiseCrop & RollCombined

Figure 4.6: Binary accuracy from evaluation with and without Roll & Cropnoise.

Table 4.4: Binary accuracy from evaluation with and without Roll & Crop

ModelBinary accuracyNo Roll & Crop

Binary accuracyRoll & Crop

No noise 99.991% 51.543%Crop & Roll 99.965% 99.933%Combined 85.782% 85.913%

by before adding it to the encoded audio. The results show that the added Musicnoise has a low impact on the result. All models show good performance evenwhen the scaling is higher than the 0.3 used for training. The model trainedwithout noise has a slightly lower performance when evaluated without noise.

0.00 0.25 0.50 0.75 1.00 1.25Scaling factor

0.0

0.2

0.4

0.6

0.8

1.0

Bina

ry accuracy

No noiseMusicCombined

Figure 4.7: The binary accuracy for models when evaluated with music noisewith different scaling.

4.3 Evaluation Over-the-air 23

4.3 Evaluation Over-the-air

To evaluate if it is possible to extract the encoded message after playing andrecording the audio over-the-air, we play and record a few tracks over-the-airin the physical world.

10 tracks are selected randomly from the test set. Each of the tracks are splitinto 14 blocks of 88200 samples each, encoded using the model trained on allnoise types combined, and then put back together. All 14 blocks are encodedwith the same message. The tracks are then played from a laptop speaker andrecorded with a handheld smartphone. The recorded audio is preprocessed thesame way as the test data and then fed into the decoder. Since the data is cropped,approximately 28 predictions can be generated per track. The average binaryaccuracy of these are given in Table 4.5. The binary accuracy is between 40% and60% for all models, and the average is close to 50%.

Figure 4.8 shows a histogram with the distribution of the number of correct bitsof all 279 predictions. The distribution is centered around 32 bits.

Table 4.5: Average binary accuracy for each of the 10 tracks used for evalu-ation.

Track Correct bits (average) PercentageTrack 1 30.92 48.32%Track 2 31.0 48.43%Track 3 29.46 46.03%Track 4 28.21 44.08%Track 5 36.35 56.80%Track 6 37.96 59.31%Track 7 35.18 54.97%Track 8 35.10 54.85%Track 9 32.14 50.22%Track 10 28.07 43.86%All 32.43 50.67%

4.4 Message representation

To visualise how the message is represented in the encoded spectrograms, theresiduals between the encoded and original spectrograms has been plotted fordifferent models, messages and spectrograms.

4.4.1 Difference between models

To visualise how the messages are represented in the encoded spectrograms forthe different models, the encoded spectrograms X̂ from one test sample are shownin Figure 4.9. The spectrograms are plotted as log(1 + X̂) for increased visibility.

24 4 Results

26 28 30 32 34 36 38 40Correct bits

0

5

10

15

20

25

30

Figure 4.8: The distribution of number of correct bits for all predictions.

The absolute value of the residual between the encoded spectrogram and the orig-inal spectrogram X is shown in Figure 4.10 and Figure 4.11, where the residualshas been scaled to log(1 + 10|X̂ −X |) and log(1 + 50|X̂ −X |) respectively. The resid-uals show distinct differences on how the messages are represented for differentmodels. Corresponding plots for 3 other samples are given in Appendix A.

4.4.2 Difference between messages

The output of the encoder depends on both the input spectrogram, and the mes-sage to be encoded. To visualise how these two factors affect the encoder output,two different spectrograms are encoded with the same message. To see how achange of message affects the output, the first spectrogram is encoded with thebinary complement of the first message. For this test, the model trained withall noise types combined is used. The residuals plotted as log(1 + 10|X̂ − X |) areshown in Figure 4.12. The two spectrograms encoded with the same messagehave similar looking residuals, while the third spectrogram encoded with the bi-nary complement of the original message looks different.

4.4 Message representation 25

No Noise Gaussian Crop & Roll Dropout Music Combined

Figure 4.9: Encoded spectrograms for the all models.


Figure 4.10: Residuals for the encoded spectrograms, scaled with the factor10.

26 4 Results


Figure 4.11: Residuals for the encoded spectrograms, scaled with the factor50.

4.4 Message representation 27

Track 1Message A

Track 2Message A

Track 1NOT(Message A)

Figure 4.12: The residuals of two spectrograms encoded with the same mes-sage, and residual of the left spectrogram encoded with the binary comple-ment of the message.

5Discussion

This chapter contains a discussion of the results and the method.

5.1 Results

The results show that the system is able to encode and extract the message with allnoise types. When training on only one noise type, or without noise, the decodingcan be performed with very high robustness and only small modifications of thecarrier.

On the other hand, when all noise types are combined, the encoding still succeeds,but the quality reduction of the audio becomes significant. When listening to theencoded audio, a static background noise can be heard, especially if the carrieraudio has silent parts.

When evaluating the model with the same amount and type of noise it was trainedon, it has a very high robustness. Interesting though is that the model trained onall noise types combined performs worse when it is evaluated on smaller amountsof noise, e.g. when only one noise type is used, or without noise. This indicatesthat the model might have created dependencies on the noise to extract the mes-sage, similar to overfitting. This might reduce the performance when playingover-the-air, if the simulated noise does not correspond to the real world noise.

In the evaluation of the models using music noise, it is seen that all evaluatedmodels performs fairly well even with a scale factor larger than the one used fortraining, even the model trained without noise. This indicates that this noisetype might not be suitable for the task. The randomly selected tracks used whentraining might be very weak and generally non-predictable.

29

30 5 Discussion

One of the goals of this work is to make the watermarking robust to noise. Dif-ferent types of noise have been used and evaluated, but as seen with the modeltrained on all noise types it is not obvious that the models perform well whenthe noise differs from that used when training. The results for the model trainedwith Dropout noise show that the dropout probability can be increased to a levelhigher than used for training with a relatively small reduction in binary accu-racy. The same can be seen for the model trained on Gaussian noise in Figure4.4. These noise types create models less sensitive to the amount of noise in thechannel.

Looking at the residuals of the encoded messages in Figure 4.12, it is clear thatthe encoding is mainly based on the message rather than the spectrogram. Dif-ferent spectrograms only give small changes, but changing the messages makes abig difference. The fact that the message representation does not make any largeradaptations when the spectrogram changes leads to hearable errors in the recon-structed audio. It is possible that a more adaptable and dynamic representationwould hide the message better.

The test when playing the audio over-the-air shows an average binary accuracyclose to 50%, i.e. not better than random choice. Looking at individual tracks,some of them reach a slightly higher binary accuracy close to 60%. This showssome potential, but more work has to be done to achieve robust results. It ispossible that some improvements could be achieved by better preprocessing ofthe recorded data, such as denoising and volume normalisation.

5.2 Method

It is clear that the encoder manages to embed the message in the audio while stillpreserving details in the audio. Looking at the residuals of the encoded audiospectrograms, the message representation does not appear to be very complex. Itadds or removes energy in some bands of frequencies. It raises a question if itmakes sense to use such a complex model with over 5 million weights to createa system with such simple representation. The model has not been optimised re-garding hyperparameters due to lack of time. It is possible that a better optimisedmodel could perform better, or achieve similar results with fewer weights.

Looking at the residuals in Figure 4.11, it is clear that the Crop & Roll noiseintroduces lines in the residual. When combining this noise with the other types,the result shows noticeable lines along the temporal axis, creating static noisein the background of the original audio. If the time syncing problem could besolved in other ways than adding the Crop & Roll noise, it might be possible toachieve better results with lower quality reductions.

Another factor that is likely to affect the simple message representation is thechoice of quality metrics for the encoder. The mean absolute error penalisessmall and large errors equally, making the large errors possible. It is at the sametime reasonable to believe that few large errors are more robust to noise than

5.2 Method 31

multiple small ones. The mean absolute error does not necessarily correspond topercepted audio quality and a possible area of improvement would be to use abetter loss function, for example the psychoacoustic model used in the work byQin et al. [14] that was released towards the end of this thesis.

An alternative approach to creating a deeper representation that is less obviouscould be to add an adversary to the model, as used by Zhu et al. [18]. Theclear lines in the residual produced by our model would likely be detected andpenalised by the adversary. This hypothesis is also supported by [18], where thedistortions are much more visible when they do not include the adversary in themodel.

A drawback of this approach would be is increased convergence problems. Theadversary adds a third loss that has to be balanced with the two losses in the cur-rent model. More sophisticated methods for loss balancing could be investigated,such as the gradient normalizations method proposed by Chen et al. [4].

Another area of improvement is the noise simulations. The results show that themodel has a tendency to overfit on the noise used in training. A simple remedycould be to vary the amount of noise used during training, but there is also possi-bilities for big improvements by using more realistic noise. The current methodsare good for introducing redundancy in the message representations but a morerealistic noise, such as the acoustic room simulations used by Qin et al. [14], ormicrophone and speaker responses has potential to work better when playingover-the-air.

The evaluation of performance when playing over the air is performed on a smallset of tracks on only one setup of equipment and environment. This limits theability to make any strong assumptions on the performance in the general case.To create comparable results a more thorough evaluation needs to be performed.

6Conclusions

In this thesis, a method for watermarking in audio using deep learning has beenproposed. The method consists of two deep neural networks, an encoder and adecoder that has been trained end-to-end using music and simulated noise. Abaseline model trained without any simulated noise, and is compared to mod-els trained using different types of noise such as dropout, gaussian noise, back-ground music, crop and roll, as well as these noise types combined.

The method performs well when only one type of noise is used, where the mes-sage can be extracted with high robustness and small quality reductions. Themodels trained with simulated noise outperforms the baseline model trainedwithout noise regarding robustness to simulated noise. When using multiplenoise types, the model is also able to extract the encoded message, but the reduc-tion of audio quality is larger and the model does not generalise very well withrespect to the amounts of noise. Even though the model is fairly robust to sim-ulated noise, it is not possible to extract the encoded message after playing andrecording the encoded audio over-the-air in the real world.

The results also suggests that the model does not embed the message in the audioin a very complex way. The encoding does not make big adaptations to differ-ent spectrograms, and therefore the resulting errors leads to static noise in theencoded audio.

6.1 Future work

Multiple areas of possible future work exists. One problem to look into would beto use a better quality metric for the encoder, e.g. using a psychoacoustic model.This has potential to result in less hearable errors and a smarter more dynamic

33

34 6 Conclusions

message representation that adapts to different content.

Another improvement would be to add an adversary to the model, as suggestedby Zhu et al. [18]. It is likely that this would require better methods for lossbalancing, but the results would hopefully be better hidden messages.

A third area of work to look into would be to use better noise simulations. Thenoise types used in this thesis do not necessarily correlate with the real world.However, we do show that the suggested method work for simulated noise, andif this noise correlated better with a real over-the-air channel, it is possible thatthe over-the-air robustness would be higher.

Finally, it could be possible to improve and optimise the network architecture toreduce the network size and complexity.

Appendix

AEncoded spectrograms

Complementary figures to Section 4.4.

37

38 A Encoded spectrograms


Figure A.1: Spectrogram A, Encoded spectrograms for the all models.


Figure A.2: Spectrogram A, Residuals for the encoded spectrograms, scaledas with the factor 10.

39


Figure A.3: Spectrogram A, Residuals for the encoded spectrograms, scaledas with the factor 50.


Figure A.4: Spectrogram B, Encoded spectrograms for the all models.



Figure A.5: Spectrogram B, Residuals for the encoded spectrograms, scaledas with the factor 10.


Figure A.6: Spectrogram B, Residuals for the encoded spectrograms, scaledas with the factor 50.

41


Figure A.7: Spectrogram C, Encoded spectrograms for the all models.


Figure A.8: Spectrogram C, Residuals for the encoded spectrograms, scaledas with the factor 10.



Figure A.9: Spectrogram C, Residuals for the encoded spectrograms, scaledas with the factor 50.

Bibliography

[1] Shumeet Baluja. Hiding images in plain sight: Deep steganography.In Neural Information Processing Systems, 2017. URL http://www.esprockets.com/papers/nips2017.pdf.

[2] Walter Bender, Daniel Gruhl, Norishige Morimoto, and Anthony Lu. Tech-niques for data hiding. IBM systems journal, 35(3.4):313–336, 1996.

[3] Nicholas Carlini and David Wagner. Audio adversarial examples: Targetedattacks on speech-to-text. arXiv preprint arXiv:1801.01944, 2018.

[4] Zhao Chen, Vijay Badrinarayanan, Chen-Yu Lee, and Andrew Rabinovich.Gradnorm: Gradient normalization for adaptive loss balancing in deep mul-titask networks. arXiv preprint arXiv:1711.02257, 2017.

[5] Ingemar J Cox, Joe Kilian, F Thomson Leighton, and Talal Shamoon. Securespread spectrum watermarking for multimedia. IEEE transactions on imageprocessing, 6(12):1673–1687, 1997.

[6] Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining andharnessing adversarial examples. In International Conference on LearningRepresentations, 2015. URL http://arxiv.org/abs/1412.6572.

[7] Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deeplearning, volume 1. MIT press Cambridge, 2016.

[8] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deepnetwork training by reducing internal covariate shift. pages 448–456, 2015.URL http://jmlr.org/proceedings/papers/v37/ioffe15.pdf.

[9] Andreas Jansson, Eric Humphrey, Nicola Montecchio, Rachel Bittner,Aparna Kumar, and Tillman Weyde. Singing voice separation with deepu-net convolutional networks. 2017.

[10] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimiza-tion. arXiv preprint arXiv:1412.6980, 2014.

43

http://www.esprockets.com/papers/nips2017.pdf

http://www.esprockets.com/papers/nips2017.pdf

http://arxiv.org/abs/1412.6572

http://jmlr.org/proceedings/papers/v37/ioffe15.pdf

44 Bibliography

[11] Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial examplesin the physical world. arXiv preprint arXiv:1607.02533, 2016.

[12] Yiqing Lin and Waleed H Abdulla. Principles of psychoacoustics. In AudioWatermark, pages 15–49. Springer, 2015.

[13] Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. Rectifier nonlinearitiesimprove neural network acoustic models. In Proc. icml, volume 30, page 3,2013.

[14] Yao Qin, Nicholas Carlini, Ian Goodfellow, Garrison Cottrell, and Colin Raf-fel. Imperceptible, robust, and targeted adversarial examples for automaticspeech recognition. arXiv preprint arXiv:1903.10346, 2019.

[15] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutionalnetworks for biomedical image segmentation. In International Conferenceon Medical image computing and computer-assisted intervention, pages234–241. Springer, 2015.

[16] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, DumitruErhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neuralnetworks. arXiv preprint arXiv:1312.6199, 2013.

[17] Ron G Van Schyndel, Andrew Z Tirkel, and Charles F Osborne. A digitalwatermark. In Proceedings of 1st International Conference on Image Pro-cessing, volume 2, pages 86–90. IEEE, 1994.

[18] Jiren Zhu, Russell Kaplan, Justin Johnson, and Li Fei-Fei. Hidden: Hidingdata with deep networks. In Proceedings of the European Conference onComputer Vision (ECCV), pages 657–672, 2018.

watermarking in audio using deep learning

Documents